[rs6000] inline expansion of memcmp using vsx
diff mbox series

Message ID ba141ddb-418b-6c06-03e3-8bd06fd02766@linux.ibm.com
State New
Headers show
Series
  • [rs6000] inline expansion of memcmp using vsx
Related show

Commit Message

Aaron Sawdey Nov. 14, 2018, 4:42 p.m. UTC
This patch generalizes some the functions added earlier to do vsx expansion of strncmp
so that the can also generate the code needed for memcmp. I reorganized
expand_block_compare() a little to be able to make use of this there. The vsx code is more
compact so I've changed the default block compare inline limit to 63 bytes. The vsx
code is only used if there is at least 16 bytes to compare as this means we don't have to
do complex code to compare less than one chunk. If vsx is not available the limit is cut
in half. The performance is good, vsx memcmp is considerably faster than the gpr inline code
if the strings are equal and is comparable if the strings have a 10% chance of being
equal (spread across the string).

Currently regtesting, ok for trunk if tests pass?

Thanks!
   Aaron

2018-11-14  Aaron Sawdey  <acsawdey@linux.ibm.com>

	* config/rs6000/rs6000-string.c (emit_vsx_zero_reg): New function.
	(expand_cmp_vec_sequence): Rename and modify
	expand_strncmp_vec_sequence.
	(emit_final_compare_vec): Rename and modify emit_final_str_compare_vec.
	(generate_6432_conversion): New function.
	(expand_block_compare): Add support for vsx.
	(expand_block_compare_gpr): New function.
	* config/rs6000/rs6000.opt (rs6000_block_compare_inline_limit): Increase
	default limit to 63 because of more compact vsx code.

Comments

Richard Biener Nov. 15, 2018, 10:02 a.m. UTC | #1
On Wed, Nov 14, 2018 at 5:43 PM Aaron Sawdey <acsawdey@linux.ibm.com> wrote:
>
> This patch generalizes some the functions added earlier to do vsx expansion of strncmp
> so that the can also generate the code needed for memcmp. I reorganized
> expand_block_compare() a little to be able to make use of this there. The vsx code is more
> compact so I've changed the default block compare inline limit to 63 bytes. The vsx
> code is only used if there is at least 16 bytes to compare as this means we don't have to
> do complex code to compare less than one chunk. If vsx is not available the limit is cut
> in half. The performance is good, vsx memcmp is considerably faster than the gpr inline code
> if the strings are equal and is comparable if the strings have a 10% chance of being
> equal (spread across the string).

How is performance affected if there are close earlier char-size
stores to one of the string/memory?
Can power still do store forwarding in this case?

> Currently regtesting, ok for trunk if tests pass?
>
> Thanks!
>    Aaron
>
> 2018-11-14  Aaron Sawdey  <acsawdey@linux.ibm.com>
>
>         * config/rs6000/rs6000-string.c (emit_vsx_zero_reg): New function.
>         (expand_cmp_vec_sequence): Rename and modify
>         expand_strncmp_vec_sequence.
>         (emit_final_compare_vec): Rename and modify emit_final_str_compare_vec.
>         (generate_6432_conversion): New function.
>         (expand_block_compare): Add support for vsx.
>         (expand_block_compare_gpr): New function.
>         * config/rs6000/rs6000.opt (rs6000_block_compare_inline_limit): Increase
>         default limit to 63 because of more compact vsx code.
>
>
>
>
> Index: gcc/config/rs6000/rs6000-string.c
> ===================================================================
> --- gcc/config/rs6000/rs6000-string.c   (revision 266034)
> +++ gcc/config/rs6000/rs6000-string.c   (working copy)
> @@ -615,6 +615,283 @@
>      }
>  }
>
> +static rtx
> +emit_vsx_zero_reg()
> +{
> +  unsigned int i;
> +  rtx zr[16];
> +  for (i = 0; i < 16; i++)
> +    zr[i] = GEN_INT (0);
> +  rtvec zv = gen_rtvec_v (16, zr);
> +  rtx zero_reg = gen_reg_rtx (V16QImode);
> +  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
> +  return zero_reg;
> +}
> +
> +/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
> +   instructions.
> +
> +   BYTES_TO_COMPARE is the number of bytes to be compared.
> +   ORIG_SRC1 is the unmodified rtx for the first string.
> +   ORIG_SRC2 is the unmodified rtx for the second string.
> +   S1ADDR is the register to use for the base address of the first string.
> +   S2ADDR is the register to use for the base address of the second string.
> +   OFF_REG is the register to use for the string offset for loads.
> +   S1DATA is the register for loading the first string.
> +   S2DATA is the register for loading the second string.
> +   VEC_RESULT is the rtx for the vector result indicating the byte difference.
> +   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
> +   to strcmp/strncmp if we have equality at the end of the inline comparison.
> +   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need code
> +   to clean up and generate the final comparison result.
> +   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
> +   set the final result.
> +   CHECKZERO indicates whether the sequence should check for zero bytes
> +   for use doing strncmp, or not (for use doing memcmp).  */
> +static void
> +expand_cmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
> +                        rtx orig_src1, rtx orig_src2,
> +                        rtx s1addr, rtx s2addr, rtx off_reg,
> +                        rtx s1data, rtx s2data, rtx vec_result,
> +                        bool equality_compare_rest, rtx *p_cleanup_label,
> +                        rtx final_move_label, bool checkzero)
> +{
> +  machine_mode load_mode;
> +  unsigned int load_mode_size;
> +  unsigned HOST_WIDE_INT cmp_bytes = 0;
> +  unsigned HOST_WIDE_INT offset = 0;
> +  rtx zero_reg = NULL;
> +
> +  gcc_assert (p_cleanup_label != NULL);
> +  rtx cleanup_label = *p_cleanup_label;
> +
> +  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
> +  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
> +
> +  if (checkzero && !TARGET_P9_VECTOR)
> +    zero_reg = emit_vsx_zero_reg();
> +
> +  while (bytes_to_compare > 0)
> +    {
> +      /* VEC/VSX compare sequence for P8:
> +        check each 16B with:
> +        lxvd2x 32,28,8
> +        lxvd2x 33,29,8
> +        vcmpequb 2,0,1  # compare strings
> +        vcmpequb 4,0,3  # compare w/ 0
> +        xxlorc 37,36,34       # first FF byte is either mismatch or end of string
> +        vcmpequb. 7,5,3  # reg 7 contains 0
> +        bnl 6,.Lmismatch
> +
> +        For the P8 LE case, we use lxvd2x and compare full 16 bytes
> +        but then use use vgbbd and a shift to get two bytes with the
> +        information we need in the correct order.
> +
> +        VEC/VSX compare sequence if TARGET_P9_VECTOR:
> +        lxvb16x/lxvb16x     # load 16B of each string
> +        vcmpnezb.           # produces difference location or zero byte location
> +        bne 6,.Lmismatch
> +
> +        Use the overlapping compare trick for the last block if it is
> +        less than 16 bytes.
> +      */
> +
> +      load_mode = V16QImode;
> +      load_mode_size = GET_MODE_SIZE (load_mode);
> +
> +      if (bytes_to_compare >= load_mode_size)
> +       cmp_bytes = load_mode_size;
> +      else
> +       {
> +         /* Move this load back so it doesn't go past the end.  P8/P9
> +            can do this efficiently.  This is never called with less
> +            than 16 bytes so we should always be able to do this.  */
> +         unsigned int extra_bytes = load_mode_size - bytes_to_compare;
> +         cmp_bytes = bytes_to_compare;
> +         gcc_assert (offset > extra_bytes);
> +         offset -= extra_bytes;
> +         cmp_bytes = load_mode_size;
> +         bytes_to_compare = cmp_bytes;
> +       }
> +
> +      /* The offset currently used is always kept in off_reg so that the
> +        cleanup code on P8 can use it to extract the differing byte.  */
> +      emit_move_insn (off_reg, GEN_INT (offset));
> +
> +      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> +      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
> +      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> +      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
> +
> +      /* Cases to handle.  A and B are chunks of the two strings.
> +        1: Not end of comparison:
> +        A != B: branch to cleanup code to compute result.
> +        A == B: next block
> +        2: End of the inline comparison:
> +        A != B: branch to cleanup code to compute result.
> +        A == B: call strcmp/strncmp
> +        3: compared requested N bytes:
> +        A == B: branch to result 0.
> +        A != B: cleanup code to compute result.  */
> +
> +      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
> +
> +      if (checkzero)
> +       {
> +         if (TARGET_P9_VECTOR)
> +           emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
> +         else
> +           {
> +             /* Emit instructions to do comparison and zero check.  */
> +             rtx cmp_res = gen_reg_rtx (load_mode);
> +             rtx cmp_zero = gen_reg_rtx (load_mode);
> +             rtx cmp_combined = gen_reg_rtx (load_mode);
> +             emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
> +             emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
> +             emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
> +             emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, zero_reg));
> +           }
> +       }
> +      else
> +       emit_insn (gen_altivec_vcmpequb_p (vec_result, s1data, s2data));
> +
> +      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
> +      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
> +      rtx dst_label;
> +      rtx cmp_rtx;
> +      if (branch_to_cleanup)
> +       {
> +         /* Branch to cleanup code, otherwise fall through to do more
> +            compares.  P8 and P9 use different CR bits because on P8
> +            we are looking at the result of a comparsion vs a
> +            register of zeroes so the all-true condition means no
> +            difference or zero was found.  On P9, vcmpnezb sets a byte
> +            to 0xff if there is a mismatch or zero, so the all-false
> +            condition indicates we found no difference or zero.  */
> +         if (!cleanup_label)
> +           cleanup_label = gen_label_rtx ();
> +         dst_label = cleanup_label;
> +         if (TARGET_P9_VECTOR && checkzero)
> +           cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
> +         else
> +           cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
> +       }
> +      else
> +       {
> +         /* Branch to final return or fall through to cleanup,
> +            result is already set to 0.  */
> +         dst_label = final_move_label;
> +         if (TARGET_P9_VECTOR && checkzero)
> +           cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
> +         else
> +           cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
> +       }
> +
> +      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
> +      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
> +                                        lab_ref, pc_rtx);
> +      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
> +      JUMP_LABEL (j2) = dst_label;
> +      LABEL_NUSES (dst_label) += 1;
> +
> +      offset += cmp_bytes;
> +      bytes_to_compare -= cmp_bytes;
> +    }
> +  *p_cleanup_label = cleanup_label;
> +  return;
> +}
> +
> +/* Generate the final sequence that identifies the differing
> +   byte and generates the final result, taking into account
> +   zero bytes:
> +
> +   P8:
> +        vgbbd 0,0
> +        vsldoi 0,0,0,9
> +        mfvsrd 9,32
> +        addi 10,9,-1    # count trailing zero bits
> +        andc 9,10,9
> +        popcntd 9,9
> +        lbzx 10,28,9    # use that offset to load differing byte
> +        lbzx 3,29,9
> +        subf 3,3,10     # subtract for final result
> +
> +   P9:
> +        vclzlsbb            # counts trailing bytes with lsb=0
> +        vextublx            # extract differing byte
> +
> +   STR1 is the reg rtx for data from string 1.
> +   STR2 is the reg rtx for data from string 2.
> +   RESULT is the reg rtx for the comparison result.
> +   S1ADDR is the register to use for the base address of the first string.
> +   S2ADDR is the register to use for the base address of the second string.
> +   ORIG_SRC1 is the unmodified rtx for the first string.
> +   ORIG_SRC2 is the unmodified rtx for the second string.
> +   OFF_REG is the register to use for the string offset for loads.
> +   VEC_RESULT is the rtx for the vector result indicating the byte difference.  */
> +
> +static void
> +emit_final_compare_vec (rtx str1, rtx str2, rtx result,
> +                       rtx s1addr, rtx s2addr,
> +                       rtx orig_src1, rtx orig_src2,
> +                       rtx off_reg, rtx vec_result)
> +{
> +
> +  if (TARGET_P9_VECTOR)
> +    {
> +      rtx diffix = gen_reg_rtx (SImode);
> +      rtx chr1 = gen_reg_rtx (SImode);
> +      rtx chr2 = gen_reg_rtx (SImode);
> +      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
> +      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
> +      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
> +      emit_insn (gen_vextublx (chr1, diffix, str1));
> +      emit_insn (gen_vextublx (chr2, diffix, str2));
> +      do_sub3 (result, chr1_di, chr2_di);
> +    }
> +  else
> +    {
> +      gcc_assert (TARGET_P8_VECTOR);
> +      rtx diffix = gen_reg_rtx (DImode);
> +      rtx result_gbbd = gen_reg_rtx (V16QImode);
> +      /* Since each byte of the input is either 00 or FF, the bytes in
> +        dw0 and dw1 after vgbbd are all identical to each other.  */
> +      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
> +      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
> +        For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
> +      rtx result_shifted = gen_reg_rtx (V16QImode);
> +      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
> +      emit_insn (gen_altivec_vsldoi_v16qi (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
> +
> +      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
> +      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
> +      rtx count = gen_reg_rtx (DImode);
> +
> +      if (BYTES_BIG_ENDIAN)
> +       emit_insn (gen_clzdi2 (count, diffix));
> +      else
> +       emit_insn (gen_ctzdi2 (count, diffix));
> +
> +      /* P8 doesn't have a good solution for extracting one byte from
> +        a vsx reg like vextublx on P9 so we just compute the offset
> +        of the differing byte and load it from each string.  */
> +      do_add3 (off_reg, off_reg, count);
> +
> +      rtx chr1 = gen_reg_rtx (QImode);
> +      rtx chr2 = gen_reg_rtx (QImode);
> +      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> +      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
> +      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> +      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
> +      machine_mode rmode = GET_MODE (result);
> +      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
> +      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
> +      do_sub3 (result, chr1_rm, chr2_rm);
> +    }
> +
> +  return;
> +}
> +
>  /* Expand a block compare operation using loop code, and return true
>     if successful.  Return false if we should let the compiler generate
>     normal code, probably a memcmp call.
> @@ -1343,106 +1620,80 @@
>    return true;
>  }
>
> -/* Expand a block compare operation, and return true if successful.
> -   Return false if we should let the compiler generate normal code,
> -   probably a memcmp call.
> +/* Generate code to convert a DImode-plus-carry subtract result into
> +   a SImode result that has the same <0 / ==0 / >0 properties to
> +   produce the final result from memcmp.
>
> -   OPERANDS[0] is the target (result).
> -   OPERANDS[1] is the first source.
> -   OPERANDS[2] is the second source.
> -   OPERANDS[3] is the length.
> -   OPERANDS[4] is the alignment.  */
> -bool
> -expand_block_compare (rtx operands[])
> +   TARGET is the rtx for the register to receive the memcmp result.
> +   SUB_RESULT is the rtx for the register contining the subtract result.  */
> +
> +void
> +generate_6432_conversion(rtx target, rtx sub_result)
>  {
> -  rtx target = operands[0];
> -  rtx orig_src1 = operands[1];
> -  rtx orig_src2 = operands[2];
> -  rtx bytes_rtx = operands[3];
> -  rtx align_rtx = operands[4];
> -  HOST_WIDE_INT cmp_bytes = 0;
> -  rtx src1 = orig_src1;
> -  rtx src2 = orig_src2;
> +  /* We need to produce DI result from sub, then convert to target SI
> +     while maintaining <0 / ==0 / >0 properties.  This sequence works:
> +     subfc L,A,B
> +     subfe H,H,H
> +     popcntd L,L
> +     rldimi L,H,6,0
>
> -  /* This case is complicated to handle because the subtract
> -     with carry instructions do not generate the 64-bit
> -     carry and so we must emit code to calculate it ourselves.
> -     We choose not to implement this yet.  */
> -  if (TARGET_32BIT && TARGET_POWERPC64)
> -    return false;
> +     This is an alternate one Segher cooked up if somebody
> +     wants to expand this for something that doesn't have popcntd:
> +     subfc L,a,b
> +     subfe H,x,x
> +     addic t,L,-1
> +     subfe v,t,L
> +     or z,v,H
>
> -  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
> +     And finally, p9 can just do this:
> +     cmpld A,B
> +     setb r */
>
> -  /* Allow this param to shut off all expansion.  */
> -  if (rs6000_block_compare_inline_limit == 0)
> -    return false;
> -
> -  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
> -     However slow_unaligned_access returns true on P7 even though the
> -     performance of this code is good there.  */
> -  if (!isP7
> -      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> -         || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2))))
> -    return false;
> -
> -  /* Unaligned l*brx traps on P7 so don't do this.  However this should
> -     not affect much because LE isn't really supported on P7 anyway.  */
> -  if (isP7 && !BYTES_BIG_ENDIAN)
> -    return false;
> -
> -  /* If this is not a fixed size compare, try generating loop code and
> -     if that fails just call memcmp.  */
> -  if (!CONST_INT_P (bytes_rtx))
> -    return expand_compare_loop (operands);
> -
> -  /* This must be a fixed size alignment.  */
> -  if (!CONST_INT_P (align_rtx))
> -    return false;
> -
> -  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
> -
> -  gcc_assert (GET_MODE (target) == SImode);
> -
> -  /* Anything to move?  */
> -  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
> -  if (bytes == 0)
> -    return true;
> -
> -  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
> -  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
> -  /* P7/P8 code uses cond for subfc. but P9 uses
> -     it for cmpld which needs CCUNSmode.  */
> -  rtx cond;
> -  if (TARGET_P9_MISC)
> -    cond = gen_reg_rtx (CCUNSmode);
> +  if (TARGET_64BIT)
> +    {
> +      rtx tmp_reg_ca = gen_reg_rtx (DImode);
> +      emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
> +      rtx popcnt = gen_reg_rtx (DImode);
> +      emit_insn (gen_popcntddi2 (popcnt, sub_result));
> +      rtx tmp2 = gen_reg_rtx (DImode);
> +      emit_insn (gen_iordi3 (tmp2, popcnt, tmp_reg_ca));
> +      emit_insn (gen_movsi (target, gen_lowpart (SImode, tmp2)));
> +    }
>    else
> -    cond = gen_reg_rtx (CCmode);
> +    {
> +      rtx tmp_reg_ca = gen_reg_rtx (SImode);
> +      emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
> +      rtx popcnt = gen_reg_rtx (SImode);
> +      emit_insn (gen_popcntdsi2 (popcnt, sub_result));
> +      emit_insn (gen_iorsi3 (target, popcnt, tmp_reg_ca));
> +    }
> +}
>
> -  /* Strategy phase.  How many ops will this take and should we expand it?  */
> +/* Generate memcmp expansion using in-line non-loop GPR instructions.
> +   The bool return indicates whether code for a 64->32 conversion
> +   should be generated.
>
> -  unsigned HOST_WIDE_INT offset = 0;
> -  machine_mode load_mode =
> -    select_block_compare_mode (offset, bytes, base_align);
> -  unsigned int load_mode_size = GET_MODE_SIZE (load_mode);
> +   BYTES is the number of bytes to be compared.
> +   BASE_ALIGN is the minimum alignment for both blocks to compare.
> +   ORIG_SRC1 is the original pointer to the first block to compare.
> +   ORIG_SRC2 is the original pointer to the second block to compare.
> +   SUB_RESULT is the reg rtx for the result from the final subtract.
> +   COND is rtx for a condition register that will be used for the final
> +   compare on power9 or better.
> +   FINAL_RESULT is the reg rtx for the final memcmp result.
> +   P_CONVERT_LABEL is a pointer to rtx that will be used to store the
> +   label generated for a branch to the 64->32 code, if such a branch
> +   is needed.
> +   P_FINAL_LABEL is a pointer to rtx that will be used to store the label
> +   for the end of the memcmp if a branch there is needed.
> +*/
>
> -  /* We don't want to generate too much code.  The loop code can take
> -     over for lengths greater than 31 bytes.  */
> -  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
> -  if (!IN_RANGE (bytes, 1, max_bytes))
> -    return expand_compare_loop (operands);
> -
> -  /* The code generated for p7 and older is not faster than glibc
> -     memcmp if alignment is small and length is not short, so bail
> -     out to avoid those conditions.  */
> -  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> -      && ((base_align == 1 && bytes > 16)
> -         || (base_align == 2 && bytes > 32)))
> -    return false;
> -
> -  bool generate_6432_conversion = false;
> -  rtx convert_label = NULL;
> -  rtx final_label = NULL;
> -
> +bool
> +expand_block_compare_gpr(unsigned HOST_WIDE_INT bytes, unsigned int base_align,
> +                        rtx orig_src1, rtx orig_src2,
> +                        rtx sub_result, rtx cond, rtx final_result,
> +                        rtx *p_convert_label, rtx *p_final_label)
> +{
>    /* Example of generated code for 18 bytes aligned 1 byte.
>       Compiled with -fno-reorder-blocks for clarity.
>               ldbrx 10,31,8
> @@ -1473,6 +1724,18 @@
>       if the difference is found there, then a final block of HImode that skips
>       the DI->SI conversion.  */
>
> +  unsigned HOST_WIDE_INT offset = 0;
> +  unsigned int load_mode_size;
> +  HOST_WIDE_INT cmp_bytes = 0;
> +  rtx src1 = orig_src1;
> +  rtx src2 = orig_src2;
> +  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
> +  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
> +  bool need_6432_conv = false;
> +  rtx convert_label = NULL;
> +  rtx final_label = NULL;
> +  machine_mode load_mode;
> +
>    while (bytes > 0)
>      {
>        unsigned int align = compute_current_alignment (base_align, offset);
> @@ -1536,15 +1799,15 @@
>         }
>
>        int remain = bytes - cmp_bytes;
> -      if (GET_MODE_SIZE (GET_MODE (target)) > GET_MODE_SIZE (load_mode))
> +      if (GET_MODE_SIZE (GET_MODE (final_result)) > GET_MODE_SIZE (load_mode))
>         {
> -         /* Target is larger than load size so we don't need to
> +         /* Final_result is larger than load size so we don't need to
>              reduce result size.  */
>
>           /* We previously did a block that need 64->32 conversion but
>              the current block does not, so a label is needed to jump
>              to the end.  */
> -         if (generate_6432_conversion && !final_label)
> +         if (need_6432_conv && !final_label)
>             final_label = gen_label_rtx ();
>
>           if (remain > 0)
> @@ -1557,7 +1820,7 @@
>               rtx tmp = gen_rtx_MINUS (word_mode, tmp_reg_src1, tmp_reg_src2);
>               rtx cr = gen_reg_rtx (CCmode);
>               rs6000_emit_dot_insn (tmp_reg_src2, tmp, 2, cr);
> -             emit_insn (gen_movsi (target,
> +             emit_insn (gen_movsi (final_result,
>                                     gen_lowpart (SImode, tmp_reg_src2)));
>               rtx ne_rtx = gen_rtx_NE (VOIDmode, cr, const0_rtx);
>               rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
> @@ -1572,11 +1835,11 @@
>                 {
>                   emit_insn (gen_subdi3 (tmp_reg_src2, tmp_reg_src1,
>                                          tmp_reg_src2));
> -                 emit_insn (gen_movsi (target,
> +                 emit_insn (gen_movsi (final_result,
>                                         gen_lowpart (SImode, tmp_reg_src2)));
>                 }
>               else
> -               emit_insn (gen_subsi3 (target, tmp_reg_src1, tmp_reg_src2));
> +               emit_insn (gen_subsi3 (final_result, tmp_reg_src1, tmp_reg_src2));
>
>               if (final_label)
>                 {
> @@ -1591,9 +1854,9 @@
>        else
>         {
>           /* Do we need a 64->32 conversion block? We need the 64->32
> -            conversion even if target size == load_mode size because
> +            conversion even if final_result size == load_mode size because
>              the subtract generates one extra bit.  */
> -         generate_6432_conversion = true;
> +         need_6432_conv = true;
>
>           if (remain > 0)
>             {
> @@ -1604,20 +1867,27 @@
>               rtx cvt_ref = gen_rtx_LABEL_REF (VOIDmode, convert_label);
>               if (TARGET_P9_MISC)
>                 {
> -               /* Generate a compare, and convert with a setb later.  */
> +               /* Generate a compare, and convert with a setb later.
> +                  Use cond that is passed in because the caller needs
> +                  to use it for the 64->32 conversion later.  */
>                   rtx cmp = gen_rtx_COMPARE (CCUNSmode, tmp_reg_src1,
>                                              tmp_reg_src2);
>                   emit_insn (gen_rtx_SET (cond, cmp));
>                 }
>               else
> -               /* Generate a subfc. and use the longer
> -                  sequence for conversion.  */
> -               if (TARGET_64BIT)
> -                 emit_insn (gen_subfdi3_carry_dot2 (tmp_reg_src2, tmp_reg_src2,
> -                                                    tmp_reg_src1, cond));
> -               else
> -                 emit_insn (gen_subfsi3_carry_dot2 (tmp_reg_src2, tmp_reg_src2,
> -                                                    tmp_reg_src1, cond));
> +               {
> +                 /* Generate a subfc. and use the longer sequence for
> +                    conversion.  Cond is not used outside this
> +                    function in this case.  */
> +                 cond = gen_reg_rtx (CCmode);
> +                 if (TARGET_64BIT)
> +                   emit_insn (gen_subfdi3_carry_dot2 (sub_result, tmp_reg_src2,
> +                                                      tmp_reg_src1, cond));
> +                 else
> +                   emit_insn (gen_subfsi3_carry_dot2 (sub_result, tmp_reg_src2,
> +                                                      tmp_reg_src1, cond));
> +               }
> +
>               rtx ne_rtx = gen_rtx_NE (VOIDmode, cond, const0_rtx);
>               rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
>                                                  cvt_ref, pc_rtx);
> @@ -1637,10 +1907,10 @@
>                 }
>               else
>                 if (TARGET_64BIT)
> -                 emit_insn (gen_subfdi3_carry (tmp_reg_src2, tmp_reg_src2,
> +                 emit_insn (gen_subfdi3_carry (sub_result, tmp_reg_src2,
>                                                 tmp_reg_src1));
>                 else
> -                 emit_insn (gen_subfsi3_carry (tmp_reg_src2, tmp_reg_src2,
> +                 emit_insn (gen_subfsi3_carry (sub_result, tmp_reg_src2,
>                                                 tmp_reg_src1));
>             }
>         }
> @@ -1649,51 +1919,162 @@
>        bytes -= cmp_bytes;
>      }
>
> -  if (generate_6432_conversion)
> +  if (convert_label)
> +    *p_convert_label = convert_label;
> +  if (final_label)
> +    *p_final_label = final_label;
> +  return need_6432_conv;
> +}
> +
> +/* Expand a block compare operation, and return true if successful.
> +   Return false if we should let the compiler generate normal code,
> +   probably a memcmp call.
> +
> +   OPERANDS[0] is the target (result).
> +   OPERANDS[1] is the first source.
> +   OPERANDS[2] is the second source.
> +   OPERANDS[3] is the length.
> +   OPERANDS[4] is the alignment.  */
> +bool
> +expand_block_compare (rtx operands[])
> +{
> +  rtx target = operands[0];
> +  rtx orig_src1 = operands[1];
> +  rtx orig_src2 = operands[2];
> +  rtx bytes_rtx = operands[3];
> +  rtx align_rtx = operands[4];
> +
> +  /* This case is complicated to handle because the subtract
> +     with carry instructions do not generate the 64-bit
> +     carry and so we must emit code to calculate it ourselves.
> +     We choose not to implement this yet.  */
> +  if (TARGET_32BIT && TARGET_POWERPC64)
> +    return false;
> +
> +  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
> +
> +  /* Allow this param to shut off all expansion.  */
> +  if (rs6000_block_compare_inline_limit == 0)
> +    return false;
> +
> +  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
> +     However slow_unaligned_access returns true on P7 even though the
> +     performance of this code is good there.  */
> +  if (!isP7
> +      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
> +         || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2))))
> +    return false;
> +
> +  /* Unaligned l*brx traps on P7 so don't do this.  However this should
> +     not affect much because LE isn't really supported on P7 anyway.  */
> +  if (isP7 && !BYTES_BIG_ENDIAN)
> +    return false;
> +
> +  /* If this is not a fixed size compare, try generating loop code and
> +     if that fails just call memcmp.  */
> +  if (!CONST_INT_P (bytes_rtx))
> +    return expand_compare_loop (operands);
> +
> +  /* This must be a fixed size alignment.  */
> +  if (!CONST_INT_P (align_rtx))
> +    return false;
> +
> +  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
> +
> +  gcc_assert (GET_MODE (target) == SImode);
> +
> +  /* Anything to move?  */
> +  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
> +  if (bytes == 0)
> +    return true;
> +
> +  /* P7/P8 code uses cond for subfc. but P9 uses
> +     it for cmpld which needs CCUNSmode.  */
> +  rtx cond = NULL;
> +  if (TARGET_P9_MISC)
> +    cond = gen_reg_rtx (CCUNSmode);
> +
> +  /* Is it OK to use vec/vsx for this.  TARGET_VSX means we have at
> +     least POWER7 but we use TARGET_EFFICIENT_UNALIGNED_VSX which is
> +     at least POWER8.  That way we can rely on overlapping compares to
> +     do the final comparison of less than 16 bytes.  Also I do not
> +     want to deal with making this work for 32 bits.  In addition, we
> +     have to make sure that we have at least P8_VECTOR (we don't allow
> +     P9_VECTOR without P8_VECTOR).  */
> +  int use_vec = (bytes >= 16 && !TARGET_32BIT
> +                && TARGET_EFFICIENT_UNALIGNED_VSX && TARGET_P8_VECTOR);
> +
> +  /* We don't want to generate too much code.  The loop code can take
> +     over for lengths greater than 31 bytes.  */
> +  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
> +
> +  /* Don't generate too much code if vsx was disabled.  */
> +  if (!use_vec && max_bytes > 1)
> +    max_bytes = ((max_bytes + 1) / 2) - 1;
> +
> +  if (!IN_RANGE (bytes, 1, max_bytes))
> +    return expand_compare_loop (operands);
> +
> +  /* The code generated for p7 and older is not faster than glibc
> +     memcmp if alignment is small and length is not short, so bail
> +     out to avoid those conditions.  */
> +  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
> +      && ((base_align == 1 && bytes > 16)
> +         || (base_align == 2 && bytes > 32)))
> +    return false;
> +
> +  rtx final_label = NULL;
> +
> +  if (use_vec)
>      {
> -      if (convert_label)
> -       emit_label (convert_label);
> +      rtx final_move_label = gen_label_rtx ();
> +      rtx s1addr = gen_reg_rtx (Pmode);
> +      rtx s2addr = gen_reg_rtx (Pmode);
> +      rtx off_reg = gen_reg_rtx (Pmode);
> +      rtx cleanup_label = NULL;
> +      rtx vec_result = gen_reg_rtx (V16QImode);
> +      rtx s1data = gen_reg_rtx (V16QImode);
> +      rtx s2data = gen_reg_rtx (V16QImode);
> +      rtx result_reg = gen_reg_rtx (word_mode);
> +      emit_move_insn (result_reg, GEN_INT (0));
>
> -      /* We need to produce DI result from sub, then convert to target SI
> -        while maintaining <0 / ==0 / >0 properties.  This sequence works:
> -        subfc L,A,B
> -        subfe H,H,H
> -        popcntd L,L
> -        rldimi L,H,6,0
> +      expand_cmp_vec_sequence (bytes, orig_src1, orig_src2,
> +                              s1addr, s2addr, off_reg, s1data, s2data,
> +                              vec_result, false,
> +                              &cleanup_label, final_move_label, false);
>
> -        This is an alternate one Segher cooked up if somebody
> -        wants to expand this for something that doesn't have popcntd:
> -        subfc L,a,b
> -        subfe H,x,x
> -        addic t,L,-1
> -        subfe v,t,L
> -        or z,v,H
> +      if (cleanup_label)
> +       emit_label (cleanup_label);
>
> -        And finally, p9 can just do this:
> -        cmpld A,B
> -        setb r */
> +      emit_insn (gen_one_cmplv16qi2 (vec_result, vec_result));
>
> -      if (TARGET_P9_MISC)
> +      emit_final_compare_vec (s1data, s2data, result_reg,
> +                             s1addr, s2addr, orig_src1, orig_src2,
> +                             off_reg, vec_result);
> +
> +      emit_label (final_move_label);
> +      emit_insn (gen_movsi (target,
> +                           gen_lowpart (SImode, result_reg)));
> +    }
> +  else
> +    { /* generate GPR code */
> +
> +      rtx convert_label = NULL;
> +      rtx sub_result = gen_reg_rtx (word_mode);
> +      bool need_6432_conversion =
> +       expand_block_compare_gpr(bytes, base_align,
> +                                orig_src1, orig_src2,
> +                                sub_result, cond, target,
> +                                &convert_label, &final_label);
> +
> +      if (need_6432_conversion)
>         {
> -         emit_insn (gen_setb_unsigned (target, cond));
> -       }
> -      else
> -       {
> -         if (TARGET_64BIT)
> -           {
> -             rtx tmp_reg_ca = gen_reg_rtx (DImode);
> -             emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
> -             emit_insn (gen_popcntddi2 (tmp_reg_src2, tmp_reg_src2));
> -             emit_insn (gen_iordi3 (tmp_reg_src2, tmp_reg_src2, tmp_reg_ca));
> -             emit_insn (gen_movsi (target, gen_lowpart (SImode, tmp_reg_src2)));
> -           }
> +         if (convert_label)
> +           emit_label (convert_label);
> +         if (TARGET_P9_MISC)
> +           emit_insn (gen_setb_unsigned (target, cond));
>           else
> -           {
> -             rtx tmp_reg_ca = gen_reg_rtx (SImode);
> -             emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
> -             emit_insn (gen_popcntdsi2 (tmp_reg_src2, tmp_reg_src2));
> -             emit_insn (gen_iorsi3 (target, tmp_reg_src2, tmp_reg_ca));
> -           }
> +           generate_6432_conversion(target, sub_result);
>         }
>      }
>
> @@ -1700,7 +2081,6 @@
>    if (final_label)
>      emit_label (final_label);
>
> -  gcc_assert (bytes == 0);
>    return true;
>  }
>
> @@ -1808,7 +2188,7 @@
>         }
>        rtx addr1 = gen_rtx_PLUS (Pmode, src1_addr, offset_rtx);
>        rtx addr2 = gen_rtx_PLUS (Pmode, src2_addr, offset_rtx);
> -
> +
>        do_load_for_compare_from_addr (load_mode, tmp_reg_src1, addr1, orig_src1);
>        do_load_for_compare_from_addr (load_mode, tmp_reg_src2, addr2, orig_src2);
>
> @@ -1966,176 +2346,6 @@
>    return;
>  }
>
> -/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
> -   instructions.
> -
> -   BYTES_TO_COMPARE is the number of bytes to be compared.
> -   ORIG_SRC1 is the unmodified rtx for the first string.
> -   ORIG_SRC2 is the unmodified rtx for the second string.
> -   S1ADDR is the register to use for the base address of the first string.
> -   S2ADDR is the register to use for the base address of the second string.
> -   OFF_REG is the register to use for the string offset for loads.
> -   S1DATA is the register for loading the first string.
> -   S2DATA is the register for loading the second string.
> -   VEC_RESULT is the rtx for the vector result indicating the byte difference.
> -   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
> -   to strcmp/strncmp if we have equality at the end of the inline comparison.
> -   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need code to clean up
> -   and generate the final comparison result.
> -   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
> -   set the final result.  */
> -static void
> -expand_strncmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
> -                            rtx orig_src1, rtx orig_src2,
> -                            rtx s1addr, rtx s2addr, rtx off_reg,
> -                            rtx s1data, rtx s2data,
> -                            rtx vec_result, bool equality_compare_rest,
> -                            rtx *p_cleanup_label, rtx final_move_label)
> -{
> -  machine_mode load_mode;
> -  unsigned int load_mode_size;
> -  unsigned HOST_WIDE_INT cmp_bytes = 0;
> -  unsigned HOST_WIDE_INT offset = 0;
> -
> -  gcc_assert (p_cleanup_label != NULL);
> -  rtx cleanup_label = *p_cleanup_label;
> -
> -  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
> -  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
> -
> -  unsigned int i;
> -  rtx zr[16];
> -  for (i = 0; i < 16; i++)
> -    zr[i] = GEN_INT (0);
> -  rtvec zv = gen_rtvec_v (16, zr);
> -  rtx zero_reg = gen_reg_rtx (V16QImode);
> -  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
> -
> -  while (bytes_to_compare > 0)
> -    {
> -      /* VEC/VSX compare sequence for P8:
> -        check each 16B with:
> -        lxvd2x 32,28,8
> -        lxvd2x 33,29,8
> -        vcmpequb 2,0,1  # compare strings
> -        vcmpequb 4,0,3  # compare w/ 0
> -        xxlorc 37,36,34       # first FF byte is either mismatch or end of string
> -        vcmpequb. 7,5,3  # reg 7 contains 0
> -        bnl 6,.Lmismatch
> -
> -        For the P8 LE case, we use lxvd2x and compare full 16 bytes
> -        but then use use vgbbd and a shift to get two bytes with the
> -        information we need in the correct order.
> -
> -        VEC/VSX compare sequence if TARGET_P9_VECTOR:
> -        lxvb16x/lxvb16x     # load 16B of each string
> -        vcmpnezb.           # produces difference location or zero byte location
> -        bne 6,.Lmismatch
> -
> -        Use the overlapping compare trick for the last block if it is
> -        less than 16 bytes.
> -      */
> -
> -      load_mode = V16QImode;
> -      load_mode_size = GET_MODE_SIZE (load_mode);
> -
> -      if (bytes_to_compare >= load_mode_size)
> -       cmp_bytes = load_mode_size;
> -      else
> -       {
> -         /* Move this load back so it doesn't go past the end.  P8/P9
> -            can do this efficiently.  This is never called with less
> -            than 16 bytes so we should always be able to do this.  */
> -         unsigned int extra_bytes = load_mode_size - bytes_to_compare;
> -         cmp_bytes = bytes_to_compare;
> -         gcc_assert (offset > extra_bytes);
> -         offset -= extra_bytes;
> -         cmp_bytes = load_mode_size;
> -         bytes_to_compare = cmp_bytes;
> -       }
> -
> -      /* The offset currently used is always kept in off_reg so that the
> -        cleanup code on P8 can use it to extract the differing byte.  */
> -      emit_move_insn (off_reg, GEN_INT (offset));
> -
> -      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> -      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
> -      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> -      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
> -
> -      /* Cases to handle.  A and B are chunks of the two strings.
> -        1: Not end of comparison:
> -        A != B: branch to cleanup code to compute result.
> -        A == B: next block
> -        2: End of the inline comparison:
> -        A != B: branch to cleanup code to compute result.
> -        A == B: call strcmp/strncmp
> -        3: compared requested N bytes:
> -        A == B: branch to result 0.
> -        A != B: cleanup code to compute result.  */
> -
> -      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
> -
> -      if (TARGET_P9_VECTOR)
> -       emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
> -      else
> -       {
> -         /* Emit instructions to do comparison and zero check.  */
> -         rtx cmp_res = gen_reg_rtx (load_mode);
> -         rtx cmp_zero = gen_reg_rtx (load_mode);
> -         rtx cmp_combined = gen_reg_rtx (load_mode);
> -         emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
> -         emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
> -         emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
> -         emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, zero_reg));
> -       }
> -
> -      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
> -      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
> -      rtx dst_label;
> -      rtx cmp_rtx;
> -      if (branch_to_cleanup)
> -       {
> -         /* Branch to cleanup code, otherwise fall through to do more
> -            compares.  P8 and P9 use different CR bits because on P8
> -            we are looking at the result of a comparsion vs a
> -            register of zeroes so the all-true condition means no
> -            difference or zero was found.  On P9, vcmpnezb sets a byte
> -            to 0xff if there is a mismatch or zero, so the all-false
> -            condition indicates we found no difference or zero.  */
> -         if (!cleanup_label)
> -           cleanup_label = gen_label_rtx ();
> -         dst_label = cleanup_label;
> -         if (TARGET_P9_VECTOR)
> -           cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
> -         else
> -           cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
> -       }
> -      else
> -       {
> -         /* Branch to final return or fall through to cleanup,
> -            result is already set to 0.  */
> -         dst_label = final_move_label;
> -         if (TARGET_P9_VECTOR)
> -           cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
> -         else
> -           cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
> -       }
> -
> -      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
> -      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
> -                                        lab_ref, pc_rtx);
> -      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
> -      JUMP_LABEL (j2) = dst_label;
> -      LABEL_NUSES (dst_label) += 1;
> -
> -      offset += cmp_bytes;
> -      bytes_to_compare -= cmp_bytes;
> -    }
> -  *p_cleanup_label = cleanup_label;
> -  return;
> -}
> -
>  /* Generate the final sequence that identifies the differing
>     byte and generates the final result, taking into account
>     zero bytes:
> @@ -2190,97 +2400,6 @@
>    return;
>  }
>
> -/* Generate the final sequence that identifies the differing
> -   byte and generates the final result, taking into account
> -   zero bytes:
> -
> -   P8:
> -        vgbbd 0,0
> -        vsldoi 0,0,0,9
> -        mfvsrd 9,32
> -        addi 10,9,-1    # count trailing zero bits
> -        andc 9,10,9
> -        popcntd 9,9
> -        lbzx 10,28,9    # use that offset to load differing byte
> -        lbzx 3,29,9
> -        subf 3,3,10     # subtract for final result
> -
> -   P9:
> -        vclzlsbb            # counts trailing bytes with lsb=0
> -        vextublx            # extract differing byte
> -
> -   STR1 is the reg rtx for data from string 1.
> -   STR2 is the reg rtx for data from string 2.
> -   RESULT is the reg rtx for the comparison result.
> -   S1ADDR is the register to use for the base address of the first string.
> -   S2ADDR is the register to use for the base address of the second string.
> -   ORIG_SRC1 is the unmodified rtx for the first string.
> -   ORIG_SRC2 is the unmodified rtx for the second string.
> -   OFF_REG is the register to use for the string offset for loads.
> -   VEC_RESULT is the rtx for the vector result indicating the byte difference.
> -  */
> -
> -static void
> -emit_final_str_compare_vec (rtx str1, rtx str2, rtx result,
> -                           rtx s1addr, rtx s2addr,
> -                           rtx orig_src1, rtx orig_src2,
> -                           rtx off_reg, rtx vec_result)
> -{
> -  if (TARGET_P9_VECTOR)
> -    {
> -      rtx diffix = gen_reg_rtx (SImode);
> -      rtx chr1 = gen_reg_rtx (SImode);
> -      rtx chr2 = gen_reg_rtx (SImode);
> -      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
> -      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
> -      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
> -      emit_insn (gen_vextublx (chr1, diffix, str1));
> -      emit_insn (gen_vextublx (chr2, diffix, str2));
> -      do_sub3 (result, chr1_di, chr2_di);
> -    }
> -  else
> -    {
> -      gcc_assert (TARGET_P8_VECTOR);
> -      rtx diffix = gen_reg_rtx (DImode);
> -      rtx result_gbbd = gen_reg_rtx (V16QImode);
> -      /* Since each byte of the input is either 00 or FF, the bytes in
> -        dw0 and dw1 after vgbbd are all identical to each other.  */
> -      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
> -      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
> -        For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
> -      rtx result_shifted = gen_reg_rtx (V16QImode);
> -      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
> -      emit_insn (gen_altivec_vsldoi_v16qi (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
> -
> -      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
> -      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
> -      rtx count = gen_reg_rtx (DImode);
> -
> -      if (BYTES_BIG_ENDIAN)
> -       emit_insn (gen_clzdi2 (count, diffix));
> -      else
> -       emit_insn (gen_ctzdi2 (count, diffix));
> -
> -      /* P8 doesn't have a good solution for extracting one byte from
> -        a vsx reg like vextublx on P9 so we just compute the offset
> -        of the differing byte and load it from each string.  */
> -      do_add3 (off_reg, off_reg, count);
> -
> -      rtx chr1 = gen_reg_rtx (QImode);
> -      rtx chr2 = gen_reg_rtx (QImode);
> -      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
> -      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
> -      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
> -      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
> -      machine_mode rmode = GET_MODE (result);
> -      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
> -      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
> -      do_sub3 (result, chr1_rm, chr2_rm);
> -    }
> -
> -  return;
> -}
> -
>  /* Expand a string compare operation with length, and return
>     true if successful.  Return false if we should let the
>     compiler generate normal code, probably a strncmp call.
> @@ -2490,13 +2609,13 @@
>        off_reg = gen_reg_rtx (Pmode);
>        vec_result = gen_reg_rtx (load_mode);
>        emit_move_insn (result_reg, GEN_INT (0));
> -      expand_strncmp_vec_sequence (compare_length,
> -                                  orig_src1, orig_src2,
> -                                  s1addr, s2addr, off_reg,
> -                                  tmp_reg_src1, tmp_reg_src2,
> -                                  vec_result,
> -                                  equality_compare_rest,
> -                                  &cleanup_label, final_move_label);
> +      expand_cmp_vec_sequence (compare_length,
> +                              orig_src1, orig_src2,
> +                              s1addr, s2addr, off_reg,
> +                              tmp_reg_src1, tmp_reg_src2,
> +                              vec_result,
> +                              equality_compare_rest,
> +                              &cleanup_label, final_move_label, true);
>      }
>    else
>      expand_strncmp_gpr_sequence (compare_length, base_align,
> @@ -2545,9 +2664,9 @@
>      emit_label (cleanup_label);
>
>    if (use_vec)
> -    emit_final_str_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
> -                               s1addr, s2addr, orig_src1, orig_src2,
> -                               off_reg, vec_result);
> +    emit_final_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
> +                           s1addr, s2addr, orig_src1, orig_src2,
> +                           off_reg, vec_result);
>    else
>      emit_final_str_compare_gpr (tmp_reg_src1, tmp_reg_src2, result_reg);
>
> Index: gcc/config/rs6000/rs6000.opt
> ===================================================================
> --- gcc/config/rs6000/rs6000.opt        (revision 266034)
> +++ gcc/config/rs6000/rs6000.opt        (working copy)
> @@ -326,7 +326,7 @@
>  Max number of bytes to move inline.
>
>  mblock-compare-inline-limit=
> -Target Report Var(rs6000_block_compare_inline_limit) Init(31) RejectNegative Joined UInteger Save
> +Target Report Var(rs6000_block_compare_inline_limit) Init(63) RejectNegative Joined UInteger Save
>  Max number of bytes to compare without loops.
>
>  mblock-compare-inline-loop-limit=
>
>
> --
> Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
> 050-2/C113  (507) 253-7520 home: 507/263-0782
> IBM Linux Technology Center - PPC Toolchain
>
Aaron Sawdey Nov. 15, 2018, 8:53 p.m. UTC | #2
On 11/15/18 4:02 AM, Richard Biener wrote:
> On Wed, Nov 14, 2018 at 5:43 PM Aaron Sawdey <acsawdey@linux.ibm.com> wrote:
>>
>> This patch generalizes some the functions added earlier to do vsx expansion of strncmp
>> so that the can also generate the code needed for memcmp. I reorganized
>> expand_block_compare() a little to be able to make use of this there. The vsx code is more
>> compact so I've changed the default block compare inline limit to 63 bytes. The vsx
>> code is only used if there is at least 16 bytes to compare as this means we don't have to
>> do complex code to compare less than one chunk. If vsx is not available the limit is cut
>> in half. The performance is good, vsx memcmp is considerably faster than the gpr inline code
>> if the strings are equal and is comparable if the strings have a 10% chance of being
>> equal (spread across the string).
> 
> How is performance affected if there are close earlier char-size
> stores to one of the string/memory?
> Can power still do store forwarding in this case?

Store forwarding between scalar and vector is not great, but it's
better than having to make a plt call to memcmp() which may well use
vsx anyway. I had set the crossover between scalar and vsx at 16 bytes
because the vsx code is more compact. The performance is similar for
16-32 byte sizes. But you could make an argument for switching at 33
bytes. This way builtin memcmp of 33-64 bytes would now use inline vsx
code instead of memcmp() call. At 33 bytes the vsx inline code is 3x
faster than a memcmp() call so would likely remain faster even if
there was an ugly vector-load-hit-scalar-store. Also small structures
32 bytes and less being compared would use scalar code and the same as
gcc 8 and would avoid this issue.

  Aaron

> 
>> Currently regtesting, ok for trunk if tests pass?
>>
>> Thanks!
>>    Aaron
>>
>> 2018-11-14  Aaron Sawdey  <acsawdey@linux.ibm.com>
>>
>>         * config/rs6000/rs6000-string.c (emit_vsx_zero_reg): New function.
>>         (expand_cmp_vec_sequence): Rename and modify
>>         expand_strncmp_vec_sequence.
>>         (emit_final_compare_vec): Rename and modify emit_final_str_compare_vec.
>>         (generate_6432_conversion): New function.
>>         (expand_block_compare): Add support for vsx.
>>         (expand_block_compare_gpr): New function.
>>         * config/rs6000/rs6000.opt (rs6000_block_compare_inline_limit): Increase
>>         default limit to 63 because of more compact vsx code.
>>
>>
>>
>>
>> Index: gcc/config/rs6000/rs6000-string.c
>> ===================================================================
>> --- gcc/config/rs6000/rs6000-string.c   (revision 266034)
>> +++ gcc/config/rs6000/rs6000-string.c   (working copy)
>> @@ -615,6 +615,283 @@
>>      }
>>  }
>>
>> +static rtx
>> +emit_vsx_zero_reg()
>> +{
>> +  unsigned int i;
>> +  rtx zr[16];
>> +  for (i = 0; i < 16; i++)
>> +    zr[i] = GEN_INT (0);
>> +  rtvec zv = gen_rtvec_v (16, zr);
>> +  rtx zero_reg = gen_reg_rtx (V16QImode);
>> +  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
>> +  return zero_reg;
>> +}
>> +
>> +/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
>> +   instructions.
>> +
>> +   BYTES_TO_COMPARE is the number of bytes to be compared.
>> +   ORIG_SRC1 is the unmodified rtx for the first string.
>> +   ORIG_SRC2 is the unmodified rtx for the second string.
>> +   S1ADDR is the register to use for the base address of the first string.
>> +   S2ADDR is the register to use for the base address of the second string.
>> +   OFF_REG is the register to use for the string offset for loads.
>> +   S1DATA is the register for loading the first string.
>> +   S2DATA is the register for loading the second string.
>> +   VEC_RESULT is the rtx for the vector result indicating the byte difference.
>> +   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
>> +   to strcmp/strncmp if we have equality at the end of the inline comparison.
>> +   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need code
>> +   to clean up and generate the final comparison result.
>> +   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
>> +   set the final result.
>> +   CHECKZERO indicates whether the sequence should check for zero bytes
>> +   for use doing strncmp, or not (for use doing memcmp).  */
>> +static void
>> +expand_cmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
>> +                        rtx orig_src1, rtx orig_src2,
>> +                        rtx s1addr, rtx s2addr, rtx off_reg,
>> +                        rtx s1data, rtx s2data, rtx vec_result,
>> +                        bool equality_compare_rest, rtx *p_cleanup_label,
>> +                        rtx final_move_label, bool checkzero)
>> +{
>> +  machine_mode load_mode;
>> +  unsigned int load_mode_size;
>> +  unsigned HOST_WIDE_INT cmp_bytes = 0;
>> +  unsigned HOST_WIDE_INT offset = 0;
>> +  rtx zero_reg = NULL;
>> +
>> +  gcc_assert (p_cleanup_label != NULL);
>> +  rtx cleanup_label = *p_cleanup_label;
>> +
>> +  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
>> +  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
>> +
>> +  if (checkzero && !TARGET_P9_VECTOR)
>> +    zero_reg = emit_vsx_zero_reg();
>> +
>> +  while (bytes_to_compare > 0)
>> +    {
>> +      /* VEC/VSX compare sequence for P8:
>> +        check each 16B with:
>> +        lxvd2x 32,28,8
>> +        lxvd2x 33,29,8
>> +        vcmpequb 2,0,1  # compare strings
>> +        vcmpequb 4,0,3  # compare w/ 0
>> +        xxlorc 37,36,34       # first FF byte is either mismatch or end of string
>> +        vcmpequb. 7,5,3  # reg 7 contains 0
>> +        bnl 6,.Lmismatch
>> +
>> +        For the P8 LE case, we use lxvd2x and compare full 16 bytes
>> +        but then use use vgbbd and a shift to get two bytes with the
>> +        information we need in the correct order.
>> +
>> +        VEC/VSX compare sequence if TARGET_P9_VECTOR:
>> +        lxvb16x/lxvb16x     # load 16B of each string
>> +        vcmpnezb.           # produces difference location or zero byte location
>> +        bne 6,.Lmismatch
>> +
>> +        Use the overlapping compare trick for the last block if it is
>> +        less than 16 bytes.
>> +      */
>> +
>> +      load_mode = V16QImode;
>> +      load_mode_size = GET_MODE_SIZE (load_mode);
>> +
>> +      if (bytes_to_compare >= load_mode_size)
>> +       cmp_bytes = load_mode_size;
>> +      else
>> +       {
>> +         /* Move this load back so it doesn't go past the end.  P8/P9
>> +            can do this efficiently.  This is never called with less
>> +            than 16 bytes so we should always be able to do this.  */
>> +         unsigned int extra_bytes = load_mode_size - bytes_to_compare;
>> +         cmp_bytes = bytes_to_compare;
>> +         gcc_assert (offset > extra_bytes);
>> +         offset -= extra_bytes;
>> +         cmp_bytes = load_mode_size;
>> +         bytes_to_compare = cmp_bytes;
>> +       }
>> +
>> +      /* The offset currently used is always kept in off_reg so that the
>> +        cleanup code on P8 can use it to extract the differing byte.  */
>> +      emit_move_insn (off_reg, GEN_INT (offset));
>> +
>> +      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
>> +      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
>> +      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
>> +      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
>> +
>> +      /* Cases to handle.  A and B are chunks of the two strings.
>> +        1: Not end of comparison:
>> +        A != B: branch to cleanup code to compute result.
>> +        A == B: next block
>> +        2: End of the inline comparison:
>> +        A != B: branch to cleanup code to compute result.
>> +        A == B: call strcmp/strncmp
>> +        3: compared requested N bytes:
>> +        A == B: branch to result 0.
>> +        A != B: cleanup code to compute result.  */
>> +
>> +      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
>> +
>> +      if (checkzero)
>> +       {
>> +         if (TARGET_P9_VECTOR)
>> +           emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
>> +         else
>> +           {
>> +             /* Emit instructions to do comparison and zero check.  */
>> +             rtx cmp_res = gen_reg_rtx (load_mode);
>> +             rtx cmp_zero = gen_reg_rtx (load_mode);
>> +             rtx cmp_combined = gen_reg_rtx (load_mode);
>> +             emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
>> +             emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
>> +             emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
>> +             emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, zero_reg));
>> +           }
>> +       }
>> +      else
>> +       emit_insn (gen_altivec_vcmpequb_p (vec_result, s1data, s2data));
>> +
>> +      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
>> +      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
>> +      rtx dst_label;
>> +      rtx cmp_rtx;
>> +      if (branch_to_cleanup)
>> +       {
>> +         /* Branch to cleanup code, otherwise fall through to do more
>> +            compares.  P8 and P9 use different CR bits because on P8
>> +            we are looking at the result of a comparsion vs a
>> +            register of zeroes so the all-true condition means no
>> +            difference or zero was found.  On P9, vcmpnezb sets a byte
>> +            to 0xff if there is a mismatch or zero, so the all-false
>> +            condition indicates we found no difference or zero.  */
>> +         if (!cleanup_label)
>> +           cleanup_label = gen_label_rtx ();
>> +         dst_label = cleanup_label;
>> +         if (TARGET_P9_VECTOR && checkzero)
>> +           cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
>> +         else
>> +           cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
>> +       }
>> +      else
>> +       {
>> +         /* Branch to final return or fall through to cleanup,
>> +            result is already set to 0.  */
>> +         dst_label = final_move_label;
>> +         if (TARGET_P9_VECTOR && checkzero)
>> +           cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
>> +         else
>> +           cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
>> +       }
>> +
>> +      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
>> +      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
>> +                                        lab_ref, pc_rtx);
>> +      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
>> +      JUMP_LABEL (j2) = dst_label;
>> +      LABEL_NUSES (dst_label) += 1;
>> +
>> +      offset += cmp_bytes;
>> +      bytes_to_compare -= cmp_bytes;
>> +    }
>> +  *p_cleanup_label = cleanup_label;
>> +  return;
>> +}
>> +
>> +/* Generate the final sequence that identifies the differing
>> +   byte and generates the final result, taking into account
>> +   zero bytes:
>> +
>> +   P8:
>> +        vgbbd 0,0
>> +        vsldoi 0,0,0,9
>> +        mfvsrd 9,32
>> +        addi 10,9,-1    # count trailing zero bits
>> +        andc 9,10,9
>> +        popcntd 9,9
>> +        lbzx 10,28,9    # use that offset to load differing byte
>> +        lbzx 3,29,9
>> +        subf 3,3,10     # subtract for final result
>> +
>> +   P9:
>> +        vclzlsbb            # counts trailing bytes with lsb=0
>> +        vextublx            # extract differing byte
>> +
>> +   STR1 is the reg rtx for data from string 1.
>> +   STR2 is the reg rtx for data from string 2.
>> +   RESULT is the reg rtx for the comparison result.
>> +   S1ADDR is the register to use for the base address of the first string.
>> +   S2ADDR is the register to use for the base address of the second string.
>> +   ORIG_SRC1 is the unmodified rtx for the first string.
>> +   ORIG_SRC2 is the unmodified rtx for the second string.
>> +   OFF_REG is the register to use for the string offset for loads.
>> +   VEC_RESULT is the rtx for the vector result indicating the byte difference.  */
>> +
>> +static void
>> +emit_final_compare_vec (rtx str1, rtx str2, rtx result,
>> +                       rtx s1addr, rtx s2addr,
>> +                       rtx orig_src1, rtx orig_src2,
>> +                       rtx off_reg, rtx vec_result)
>> +{
>> +
>> +  if (TARGET_P9_VECTOR)
>> +    {
>> +      rtx diffix = gen_reg_rtx (SImode);
>> +      rtx chr1 = gen_reg_rtx (SImode);
>> +      rtx chr2 = gen_reg_rtx (SImode);
>> +      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
>> +      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
>> +      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
>> +      emit_insn (gen_vextublx (chr1, diffix, str1));
>> +      emit_insn (gen_vextublx (chr2, diffix, str2));
>> +      do_sub3 (result, chr1_di, chr2_di);
>> +    }
>> +  else
>> +    {
>> +      gcc_assert (TARGET_P8_VECTOR);
>> +      rtx diffix = gen_reg_rtx (DImode);
>> +      rtx result_gbbd = gen_reg_rtx (V16QImode);
>> +      /* Since each byte of the input is either 00 or FF, the bytes in
>> +        dw0 and dw1 after vgbbd are all identical to each other.  */
>> +      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
>> +      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
>> +        For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
>> +      rtx result_shifted = gen_reg_rtx (V16QImode);
>> +      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
>> +      emit_insn (gen_altivec_vsldoi_v16qi (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
>> +
>> +      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
>> +      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
>> +      rtx count = gen_reg_rtx (DImode);
>> +
>> +      if (BYTES_BIG_ENDIAN)
>> +       emit_insn (gen_clzdi2 (count, diffix));
>> +      else
>> +       emit_insn (gen_ctzdi2 (count, diffix));
>> +
>> +      /* P8 doesn't have a good solution for extracting one byte from
>> +        a vsx reg like vextublx on P9 so we just compute the offset
>> +        of the differing byte and load it from each string.  */
>> +      do_add3 (off_reg, off_reg, count);
>> +
>> +      rtx chr1 = gen_reg_rtx (QImode);
>> +      rtx chr2 = gen_reg_rtx (QImode);
>> +      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
>> +      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
>> +      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
>> +      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
>> +      machine_mode rmode = GET_MODE (result);
>> +      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
>> +      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
>> +      do_sub3 (result, chr1_rm, chr2_rm);
>> +    }
>> +
>> +  return;
>> +}
>> +
>>  /* Expand a block compare operation using loop code, and return true
>>     if successful.  Return false if we should let the compiler generate
>>     normal code, probably a memcmp call.
>> @@ -1343,106 +1620,80 @@
>>    return true;
>>  }
>>
>> -/* Expand a block compare operation, and return true if successful.
>> -   Return false if we should let the compiler generate normal code,
>> -   probably a memcmp call.
>> +/* Generate code to convert a DImode-plus-carry subtract result into
>> +   a SImode result that has the same <0 / ==0 / >0 properties to
>> +   produce the final result from memcmp.
>>
>> -   OPERANDS[0] is the target (result).
>> -   OPERANDS[1] is the first source.
>> -   OPERANDS[2] is the second source.
>> -   OPERANDS[3] is the length.
>> -   OPERANDS[4] is the alignment.  */
>> -bool
>> -expand_block_compare (rtx operands[])
>> +   TARGET is the rtx for the register to receive the memcmp result.
>> +   SUB_RESULT is the rtx for the register contining the subtract result.  */
>> +
>> +void
>> +generate_6432_conversion(rtx target, rtx sub_result)
>>  {
>> -  rtx target = operands[0];
>> -  rtx orig_src1 = operands[1];
>> -  rtx orig_src2 = operands[2];
>> -  rtx bytes_rtx = operands[3];
>> -  rtx align_rtx = operands[4];
>> -  HOST_WIDE_INT cmp_bytes = 0;
>> -  rtx src1 = orig_src1;
>> -  rtx src2 = orig_src2;
>> +  /* We need to produce DI result from sub, then convert to target SI
>> +     while maintaining <0 / ==0 / >0 properties.  This sequence works:
>> +     subfc L,A,B
>> +     subfe H,H,H
>> +     popcntd L,L
>> +     rldimi L,H,6,0
>>
>> -  /* This case is complicated to handle because the subtract
>> -     with carry instructions do not generate the 64-bit
>> -     carry and so we must emit code to calculate it ourselves.
>> -     We choose not to implement this yet.  */
>> -  if (TARGET_32BIT && TARGET_POWERPC64)
>> -    return false;
>> +     This is an alternate one Segher cooked up if somebody
>> +     wants to expand this for something that doesn't have popcntd:
>> +     subfc L,a,b
>> +     subfe H,x,x
>> +     addic t,L,-1
>> +     subfe v,t,L
>> +     or z,v,H
>>
>> -  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
>> +     And finally, p9 can just do this:
>> +     cmpld A,B
>> +     setb r */
>>
>> -  /* Allow this param to shut off all expansion.  */
>> -  if (rs6000_block_compare_inline_limit == 0)
>> -    return false;
>> -
>> -  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
>> -     However slow_unaligned_access returns true on P7 even though the
>> -     performance of this code is good there.  */
>> -  if (!isP7
>> -      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
>> -         || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2))))
>> -    return false;
>> -
>> -  /* Unaligned l*brx traps on P7 so don't do this.  However this should
>> -     not affect much because LE isn't really supported on P7 anyway.  */
>> -  if (isP7 && !BYTES_BIG_ENDIAN)
>> -    return false;
>> -
>> -  /* If this is not a fixed size compare, try generating loop code and
>> -     if that fails just call memcmp.  */
>> -  if (!CONST_INT_P (bytes_rtx))
>> -    return expand_compare_loop (operands);
>> -
>> -  /* This must be a fixed size alignment.  */
>> -  if (!CONST_INT_P (align_rtx))
>> -    return false;
>> -
>> -  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
>> -
>> -  gcc_assert (GET_MODE (target) == SImode);
>> -
>> -  /* Anything to move?  */
>> -  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
>> -  if (bytes == 0)
>> -    return true;
>> -
>> -  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
>> -  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
>> -  /* P7/P8 code uses cond for subfc. but P9 uses
>> -     it for cmpld which needs CCUNSmode.  */
>> -  rtx cond;
>> -  if (TARGET_P9_MISC)
>> -    cond = gen_reg_rtx (CCUNSmode);
>> +  if (TARGET_64BIT)
>> +    {
>> +      rtx tmp_reg_ca = gen_reg_rtx (DImode);
>> +      emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
>> +      rtx popcnt = gen_reg_rtx (DImode);
>> +      emit_insn (gen_popcntddi2 (popcnt, sub_result));
>> +      rtx tmp2 = gen_reg_rtx (DImode);
>> +      emit_insn (gen_iordi3 (tmp2, popcnt, tmp_reg_ca));
>> +      emit_insn (gen_movsi (target, gen_lowpart (SImode, tmp2)));
>> +    }
>>    else
>> -    cond = gen_reg_rtx (CCmode);
>> +    {
>> +      rtx tmp_reg_ca = gen_reg_rtx (SImode);
>> +      emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
>> +      rtx popcnt = gen_reg_rtx (SImode);
>> +      emit_insn (gen_popcntdsi2 (popcnt, sub_result));
>> +      emit_insn (gen_iorsi3 (target, popcnt, tmp_reg_ca));
>> +    }
>> +}
>>
>> -  /* Strategy phase.  How many ops will this take and should we expand it?  */
>> +/* Generate memcmp expansion using in-line non-loop GPR instructions.
>> +   The bool return indicates whether code for a 64->32 conversion
>> +   should be generated.
>>
>> -  unsigned HOST_WIDE_INT offset = 0;
>> -  machine_mode load_mode =
>> -    select_block_compare_mode (offset, bytes, base_align);
>> -  unsigned int load_mode_size = GET_MODE_SIZE (load_mode);
>> +   BYTES is the number of bytes to be compared.
>> +   BASE_ALIGN is the minimum alignment for both blocks to compare.
>> +   ORIG_SRC1 is the original pointer to the first block to compare.
>> +   ORIG_SRC2 is the original pointer to the second block to compare.
>> +   SUB_RESULT is the reg rtx for the result from the final subtract.
>> +   COND is rtx for a condition register that will be used for the final
>> +   compare on power9 or better.
>> +   FINAL_RESULT is the reg rtx for the final memcmp result.
>> +   P_CONVERT_LABEL is a pointer to rtx that will be used to store the
>> +   label generated for a branch to the 64->32 code, if such a branch
>> +   is needed.
>> +   P_FINAL_LABEL is a pointer to rtx that will be used to store the label
>> +   for the end of the memcmp if a branch there is needed.
>> +*/
>>
>> -  /* We don't want to generate too much code.  The loop code can take
>> -     over for lengths greater than 31 bytes.  */
>> -  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
>> -  if (!IN_RANGE (bytes, 1, max_bytes))
>> -    return expand_compare_loop (operands);
>> -
>> -  /* The code generated for p7 and older is not faster than glibc
>> -     memcmp if alignment is small and length is not short, so bail
>> -     out to avoid those conditions.  */
>> -  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
>> -      && ((base_align == 1 && bytes > 16)
>> -         || (base_align == 2 && bytes > 32)))
>> -    return false;
>> -
>> -  bool generate_6432_conversion = false;
>> -  rtx convert_label = NULL;
>> -  rtx final_label = NULL;
>> -
>> +bool
>> +expand_block_compare_gpr(unsigned HOST_WIDE_INT bytes, unsigned int base_align,
>> +                        rtx orig_src1, rtx orig_src2,
>> +                        rtx sub_result, rtx cond, rtx final_result,
>> +                        rtx *p_convert_label, rtx *p_final_label)
>> +{
>>    /* Example of generated code for 18 bytes aligned 1 byte.
>>       Compiled with -fno-reorder-blocks for clarity.
>>               ldbrx 10,31,8
>> @@ -1473,6 +1724,18 @@
>>       if the difference is found there, then a final block of HImode that skips
>>       the DI->SI conversion.  */
>>
>> +  unsigned HOST_WIDE_INT offset = 0;
>> +  unsigned int load_mode_size;
>> +  HOST_WIDE_INT cmp_bytes = 0;
>> +  rtx src1 = orig_src1;
>> +  rtx src2 = orig_src2;
>> +  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
>> +  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
>> +  bool need_6432_conv = false;
>> +  rtx convert_label = NULL;
>> +  rtx final_label = NULL;
>> +  machine_mode load_mode;
>> +
>>    while (bytes > 0)
>>      {
>>        unsigned int align = compute_current_alignment (base_align, offset);
>> @@ -1536,15 +1799,15 @@
>>         }
>>
>>        int remain = bytes - cmp_bytes;
>> -      if (GET_MODE_SIZE (GET_MODE (target)) > GET_MODE_SIZE (load_mode))
>> +      if (GET_MODE_SIZE (GET_MODE (final_result)) > GET_MODE_SIZE (load_mode))
>>         {
>> -         /* Target is larger than load size so we don't need to
>> +         /* Final_result is larger than load size so we don't need to
>>              reduce result size.  */
>>
>>           /* We previously did a block that need 64->32 conversion but
>>              the current block does not, so a label is needed to jump
>>              to the end.  */
>> -         if (generate_6432_conversion && !final_label)
>> +         if (need_6432_conv && !final_label)
>>             final_label = gen_label_rtx ();
>>
>>           if (remain > 0)
>> @@ -1557,7 +1820,7 @@
>>               rtx tmp = gen_rtx_MINUS (word_mode, tmp_reg_src1, tmp_reg_src2);
>>               rtx cr = gen_reg_rtx (CCmode);
>>               rs6000_emit_dot_insn (tmp_reg_src2, tmp, 2, cr);
>> -             emit_insn (gen_movsi (target,
>> +             emit_insn (gen_movsi (final_result,
>>                                     gen_lowpart (SImode, tmp_reg_src2)));
>>               rtx ne_rtx = gen_rtx_NE (VOIDmode, cr, const0_rtx);
>>               rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
>> @@ -1572,11 +1835,11 @@
>>                 {
>>                   emit_insn (gen_subdi3 (tmp_reg_src2, tmp_reg_src1,
>>                                          tmp_reg_src2));
>> -                 emit_insn (gen_movsi (target,
>> +                 emit_insn (gen_movsi (final_result,
>>                                         gen_lowpart (SImode, tmp_reg_src2)));
>>                 }
>>               else
>> -               emit_insn (gen_subsi3 (target, tmp_reg_src1, tmp_reg_src2));
>> +               emit_insn (gen_subsi3 (final_result, tmp_reg_src1, tmp_reg_src2));
>>
>>               if (final_label)
>>                 {
>> @@ -1591,9 +1854,9 @@
>>        else
>>         {
>>           /* Do we need a 64->32 conversion block? We need the 64->32
>> -            conversion even if target size == load_mode size because
>> +            conversion even if final_result size == load_mode size because
>>              the subtract generates one extra bit.  */
>> -         generate_6432_conversion = true;
>> +         need_6432_conv = true;
>>
>>           if (remain > 0)
>>             {
>> @@ -1604,20 +1867,27 @@
>>               rtx cvt_ref = gen_rtx_LABEL_REF (VOIDmode, convert_label);
>>               if (TARGET_P9_MISC)
>>                 {
>> -               /* Generate a compare, and convert with a setb later.  */
>> +               /* Generate a compare, and convert with a setb later.
>> +                  Use cond that is passed in because the caller needs
>> +                  to use it for the 64->32 conversion later.  */
>>                   rtx cmp = gen_rtx_COMPARE (CCUNSmode, tmp_reg_src1,
>>                                              tmp_reg_src2);
>>                   emit_insn (gen_rtx_SET (cond, cmp));
>>                 }
>>               else
>> -               /* Generate a subfc. and use the longer
>> -                  sequence for conversion.  */
>> -               if (TARGET_64BIT)
>> -                 emit_insn (gen_subfdi3_carry_dot2 (tmp_reg_src2, tmp_reg_src2,
>> -                                                    tmp_reg_src1, cond));
>> -               else
>> -                 emit_insn (gen_subfsi3_carry_dot2 (tmp_reg_src2, tmp_reg_src2,
>> -                                                    tmp_reg_src1, cond));
>> +               {
>> +                 /* Generate a subfc. and use the longer sequence for
>> +                    conversion.  Cond is not used outside this
>> +                    function in this case.  */
>> +                 cond = gen_reg_rtx (CCmode);
>> +                 if (TARGET_64BIT)
>> +                   emit_insn (gen_subfdi3_carry_dot2 (sub_result, tmp_reg_src2,
>> +                                                      tmp_reg_src1, cond));
>> +                 else
>> +                   emit_insn (gen_subfsi3_carry_dot2 (sub_result, tmp_reg_src2,
>> +                                                      tmp_reg_src1, cond));
>> +               }
>> +
>>               rtx ne_rtx = gen_rtx_NE (VOIDmode, cond, const0_rtx);
>>               rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
>>                                                  cvt_ref, pc_rtx);
>> @@ -1637,10 +1907,10 @@
>>                 }
>>               else
>>                 if (TARGET_64BIT)
>> -                 emit_insn (gen_subfdi3_carry (tmp_reg_src2, tmp_reg_src2,
>> +                 emit_insn (gen_subfdi3_carry (sub_result, tmp_reg_src2,
>>                                                 tmp_reg_src1));
>>                 else
>> -                 emit_insn (gen_subfsi3_carry (tmp_reg_src2, tmp_reg_src2,
>> +                 emit_insn (gen_subfsi3_carry (sub_result, tmp_reg_src2,
>>                                                 tmp_reg_src1));
>>             }
>>         }
>> @@ -1649,51 +1919,162 @@
>>        bytes -= cmp_bytes;
>>      }
>>
>> -  if (generate_6432_conversion)
>> +  if (convert_label)
>> +    *p_convert_label = convert_label;
>> +  if (final_label)
>> +    *p_final_label = final_label;
>> +  return need_6432_conv;
>> +}
>> +
>> +/* Expand a block compare operation, and return true if successful.
>> +   Return false if we should let the compiler generate normal code,
>> +   probably a memcmp call.
>> +
>> +   OPERANDS[0] is the target (result).
>> +   OPERANDS[1] is the first source.
>> +   OPERANDS[2] is the second source.
>> +   OPERANDS[3] is the length.
>> +   OPERANDS[4] is the alignment.  */
>> +bool
>> +expand_block_compare (rtx operands[])
>> +{
>> +  rtx target = operands[0];
>> +  rtx orig_src1 = operands[1];
>> +  rtx orig_src2 = operands[2];
>> +  rtx bytes_rtx = operands[3];
>> +  rtx align_rtx = operands[4];
>> +
>> +  /* This case is complicated to handle because the subtract
>> +     with carry instructions do not generate the 64-bit
>> +     carry and so we must emit code to calculate it ourselves.
>> +     We choose not to implement this yet.  */
>> +  if (TARGET_32BIT && TARGET_POWERPC64)
>> +    return false;
>> +
>> +  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
>> +
>> +  /* Allow this param to shut off all expansion.  */
>> +  if (rs6000_block_compare_inline_limit == 0)
>> +    return false;
>> +
>> +  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
>> +     However slow_unaligned_access returns true on P7 even though the
>> +     performance of this code is good there.  */
>> +  if (!isP7
>> +      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
>> +         || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2))))
>> +    return false;
>> +
>> +  /* Unaligned l*brx traps on P7 so don't do this.  However this should
>> +     not affect much because LE isn't really supported on P7 anyway.  */
>> +  if (isP7 && !BYTES_BIG_ENDIAN)
>> +    return false;
>> +
>> +  /* If this is not a fixed size compare, try generating loop code and
>> +     if that fails just call memcmp.  */
>> +  if (!CONST_INT_P (bytes_rtx))
>> +    return expand_compare_loop (operands);
>> +
>> +  /* This must be a fixed size alignment.  */
>> +  if (!CONST_INT_P (align_rtx))
>> +    return false;
>> +
>> +  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
>> +
>> +  gcc_assert (GET_MODE (target) == SImode);
>> +
>> +  /* Anything to move?  */
>> +  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
>> +  if (bytes == 0)
>> +    return true;
>> +
>> +  /* P7/P8 code uses cond for subfc. but P9 uses
>> +     it for cmpld which needs CCUNSmode.  */
>> +  rtx cond = NULL;
>> +  if (TARGET_P9_MISC)
>> +    cond = gen_reg_rtx (CCUNSmode);
>> +
>> +  /* Is it OK to use vec/vsx for this.  TARGET_VSX means we have at
>> +     least POWER7 but we use TARGET_EFFICIENT_UNALIGNED_VSX which is
>> +     at least POWER8.  That way we can rely on overlapping compares to
>> +     do the final comparison of less than 16 bytes.  Also I do not
>> +     want to deal with making this work for 32 bits.  In addition, we
>> +     have to make sure that we have at least P8_VECTOR (we don't allow
>> +     P9_VECTOR without P8_VECTOR).  */
>> +  int use_vec = (bytes >= 16 && !TARGET_32BIT
>> +                && TARGET_EFFICIENT_UNALIGNED_VSX && TARGET_P8_VECTOR);
>> +
>> +  /* We don't want to generate too much code.  The loop code can take
>> +     over for lengths greater than 31 bytes.  */
>> +  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
>> +
>> +  /* Don't generate too much code if vsx was disabled.  */
>> +  if (!use_vec && max_bytes > 1)
>> +    max_bytes = ((max_bytes + 1) / 2) - 1;
>> +
>> +  if (!IN_RANGE (bytes, 1, max_bytes))
>> +    return expand_compare_loop (operands);
>> +
>> +  /* The code generated for p7 and older is not faster than glibc
>> +     memcmp if alignment is small and length is not short, so bail
>> +     out to avoid those conditions.  */
>> +  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
>> +      && ((base_align == 1 && bytes > 16)
>> +         || (base_align == 2 && bytes > 32)))
>> +    return false;
>> +
>> +  rtx final_label = NULL;
>> +
>> +  if (use_vec)
>>      {
>> -      if (convert_label)
>> -       emit_label (convert_label);
>> +      rtx final_move_label = gen_label_rtx ();
>> +      rtx s1addr = gen_reg_rtx (Pmode);
>> +      rtx s2addr = gen_reg_rtx (Pmode);
>> +      rtx off_reg = gen_reg_rtx (Pmode);
>> +      rtx cleanup_label = NULL;
>> +      rtx vec_result = gen_reg_rtx (V16QImode);
>> +      rtx s1data = gen_reg_rtx (V16QImode);
>> +      rtx s2data = gen_reg_rtx (V16QImode);
>> +      rtx result_reg = gen_reg_rtx (word_mode);
>> +      emit_move_insn (result_reg, GEN_INT (0));
>>
>> -      /* We need to produce DI result from sub, then convert to target SI
>> -        while maintaining <0 / ==0 / >0 properties.  This sequence works:
>> -        subfc L,A,B
>> -        subfe H,H,H
>> -        popcntd L,L
>> -        rldimi L,H,6,0
>> +      expand_cmp_vec_sequence (bytes, orig_src1, orig_src2,
>> +                              s1addr, s2addr, off_reg, s1data, s2data,
>> +                              vec_result, false,
>> +                              &cleanup_label, final_move_label, false);
>>
>> -        This is an alternate one Segher cooked up if somebody
>> -        wants to expand this for something that doesn't have popcntd:
>> -        subfc L,a,b
>> -        subfe H,x,x
>> -        addic t,L,-1
>> -        subfe v,t,L
>> -        or z,v,H
>> +      if (cleanup_label)
>> +       emit_label (cleanup_label);
>>
>> -        And finally, p9 can just do this:
>> -        cmpld A,B
>> -        setb r */
>> +      emit_insn (gen_one_cmplv16qi2 (vec_result, vec_result));
>>
>> -      if (TARGET_P9_MISC)
>> +      emit_final_compare_vec (s1data, s2data, result_reg,
>> +                             s1addr, s2addr, orig_src1, orig_src2,
>> +                             off_reg, vec_result);
>> +
>> +      emit_label (final_move_label);
>> +      emit_insn (gen_movsi (target,
>> +                           gen_lowpart (SImode, result_reg)));
>> +    }
>> +  else
>> +    { /* generate GPR code */
>> +
>> +      rtx convert_label = NULL;
>> +      rtx sub_result = gen_reg_rtx (word_mode);
>> +      bool need_6432_conversion =
>> +       expand_block_compare_gpr(bytes, base_align,
>> +                                orig_src1, orig_src2,
>> +                                sub_result, cond, target,
>> +                                &convert_label, &final_label);
>> +
>> +      if (need_6432_conversion)
>>         {
>> -         emit_insn (gen_setb_unsigned (target, cond));
>> -       }
>> -      else
>> -       {
>> -         if (TARGET_64BIT)
>> -           {
>> -             rtx tmp_reg_ca = gen_reg_rtx (DImode);
>> -             emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
>> -             emit_insn (gen_popcntddi2 (tmp_reg_src2, tmp_reg_src2));
>> -             emit_insn (gen_iordi3 (tmp_reg_src2, tmp_reg_src2, tmp_reg_ca));
>> -             emit_insn (gen_movsi (target, gen_lowpart (SImode, tmp_reg_src2)));
>> -           }
>> +         if (convert_label)
>> +           emit_label (convert_label);
>> +         if (TARGET_P9_MISC)
>> +           emit_insn (gen_setb_unsigned (target, cond));
>>           else
>> -           {
>> -             rtx tmp_reg_ca = gen_reg_rtx (SImode);
>> -             emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
>> -             emit_insn (gen_popcntdsi2 (tmp_reg_src2, tmp_reg_src2));
>> -             emit_insn (gen_iorsi3 (target, tmp_reg_src2, tmp_reg_ca));
>> -           }
>> +           generate_6432_conversion(target, sub_result);
>>         }
>>      }
>>
>> @@ -1700,7 +2081,6 @@
>>    if (final_label)
>>      emit_label (final_label);
>>
>> -  gcc_assert (bytes == 0);
>>    return true;
>>  }
>>
>> @@ -1808,7 +2188,7 @@
>>         }
>>        rtx addr1 = gen_rtx_PLUS (Pmode, src1_addr, offset_rtx);
>>        rtx addr2 = gen_rtx_PLUS (Pmode, src2_addr, offset_rtx);
>> -
>> +
>>        do_load_for_compare_from_addr (load_mode, tmp_reg_src1, addr1, orig_src1);
>>        do_load_for_compare_from_addr (load_mode, tmp_reg_src2, addr2, orig_src2);
>>
>> @@ -1966,176 +2346,6 @@
>>    return;
>>  }
>>
>> -/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
>> -   instructions.
>> -
>> -   BYTES_TO_COMPARE is the number of bytes to be compared.
>> -   ORIG_SRC1 is the unmodified rtx for the first string.
>> -   ORIG_SRC2 is the unmodified rtx for the second string.
>> -   S1ADDR is the register to use for the base address of the first string.
>> -   S2ADDR is the register to use for the base address of the second string.
>> -   OFF_REG is the register to use for the string offset for loads.
>> -   S1DATA is the register for loading the first string.
>> -   S2DATA is the register for loading the second string.
>> -   VEC_RESULT is the rtx for the vector result indicating the byte difference.
>> -   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
>> -   to strcmp/strncmp if we have equality at the end of the inline comparison.
>> -   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need code to clean up
>> -   and generate the final comparison result.
>> -   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
>> -   set the final result.  */
>> -static void
>> -expand_strncmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
>> -                            rtx orig_src1, rtx orig_src2,
>> -                            rtx s1addr, rtx s2addr, rtx off_reg,
>> -                            rtx s1data, rtx s2data,
>> -                            rtx vec_result, bool equality_compare_rest,
>> -                            rtx *p_cleanup_label, rtx final_move_label)
>> -{
>> -  machine_mode load_mode;
>> -  unsigned int load_mode_size;
>> -  unsigned HOST_WIDE_INT cmp_bytes = 0;
>> -  unsigned HOST_WIDE_INT offset = 0;
>> -
>> -  gcc_assert (p_cleanup_label != NULL);
>> -  rtx cleanup_label = *p_cleanup_label;
>> -
>> -  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
>> -  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
>> -
>> -  unsigned int i;
>> -  rtx zr[16];
>> -  for (i = 0; i < 16; i++)
>> -    zr[i] = GEN_INT (0);
>> -  rtvec zv = gen_rtvec_v (16, zr);
>> -  rtx zero_reg = gen_reg_rtx (V16QImode);
>> -  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
>> -
>> -  while (bytes_to_compare > 0)
>> -    {
>> -      /* VEC/VSX compare sequence for P8:
>> -        check each 16B with:
>> -        lxvd2x 32,28,8
>> -        lxvd2x 33,29,8
>> -        vcmpequb 2,0,1  # compare strings
>> -        vcmpequb 4,0,3  # compare w/ 0
>> -        xxlorc 37,36,34       # first FF byte is either mismatch or end of string
>> -        vcmpequb. 7,5,3  # reg 7 contains 0
>> -        bnl 6,.Lmismatch
>> -
>> -        For the P8 LE case, we use lxvd2x and compare full 16 bytes
>> -        but then use use vgbbd and a shift to get two bytes with the
>> -        information we need in the correct order.
>> -
>> -        VEC/VSX compare sequence if TARGET_P9_VECTOR:
>> -        lxvb16x/lxvb16x     # load 16B of each string
>> -        vcmpnezb.           # produces difference location or zero byte location
>> -        bne 6,.Lmismatch
>> -
>> -        Use the overlapping compare trick for the last block if it is
>> -        less than 16 bytes.
>> -      */
>> -
>> -      load_mode = V16QImode;
>> -      load_mode_size = GET_MODE_SIZE (load_mode);
>> -
>> -      if (bytes_to_compare >= load_mode_size)
>> -       cmp_bytes = load_mode_size;
>> -      else
>> -       {
>> -         /* Move this load back so it doesn't go past the end.  P8/P9
>> -            can do this efficiently.  This is never called with less
>> -            than 16 bytes so we should always be able to do this.  */
>> -         unsigned int extra_bytes = load_mode_size - bytes_to_compare;
>> -         cmp_bytes = bytes_to_compare;
>> -         gcc_assert (offset > extra_bytes);
>> -         offset -= extra_bytes;
>> -         cmp_bytes = load_mode_size;
>> -         bytes_to_compare = cmp_bytes;
>> -       }
>> -
>> -      /* The offset currently used is always kept in off_reg so that the
>> -        cleanup code on P8 can use it to extract the differing byte.  */
>> -      emit_move_insn (off_reg, GEN_INT (offset));
>> -
>> -      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
>> -      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
>> -      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
>> -      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
>> -
>> -      /* Cases to handle.  A and B are chunks of the two strings.
>> -        1: Not end of comparison:
>> -        A != B: branch to cleanup code to compute result.
>> -        A == B: next block
>> -        2: End of the inline comparison:
>> -        A != B: branch to cleanup code to compute result.
>> -        A == B: call strcmp/strncmp
>> -        3: compared requested N bytes:
>> -        A == B: branch to result 0.
>> -        A != B: cleanup code to compute result.  */
>> -
>> -      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
>> -
>> -      if (TARGET_P9_VECTOR)
>> -       emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
>> -      else
>> -       {
>> -         /* Emit instructions to do comparison and zero check.  */
>> -         rtx cmp_res = gen_reg_rtx (load_mode);
>> -         rtx cmp_zero = gen_reg_rtx (load_mode);
>> -         rtx cmp_combined = gen_reg_rtx (load_mode);
>> -         emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
>> -         emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
>> -         emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
>> -         emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, zero_reg));
>> -       }
>> -
>> -      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
>> -      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
>> -      rtx dst_label;
>> -      rtx cmp_rtx;
>> -      if (branch_to_cleanup)
>> -       {
>> -         /* Branch to cleanup code, otherwise fall through to do more
>> -            compares.  P8 and P9 use different CR bits because on P8
>> -            we are looking at the result of a comparsion vs a
>> -            register of zeroes so the all-true condition means no
>> -            difference or zero was found.  On P9, vcmpnezb sets a byte
>> -            to 0xff if there is a mismatch or zero, so the all-false
>> -            condition indicates we found no difference or zero.  */
>> -         if (!cleanup_label)
>> -           cleanup_label = gen_label_rtx ();
>> -         dst_label = cleanup_label;
>> -         if (TARGET_P9_VECTOR)
>> -           cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
>> -         else
>> -           cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
>> -       }
>> -      else
>> -       {
>> -         /* Branch to final return or fall through to cleanup,
>> -            result is already set to 0.  */
>> -         dst_label = final_move_label;
>> -         if (TARGET_P9_VECTOR)
>> -           cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
>> -         else
>> -           cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
>> -       }
>> -
>> -      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
>> -      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
>> -                                        lab_ref, pc_rtx);
>> -      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
>> -      JUMP_LABEL (j2) = dst_label;
>> -      LABEL_NUSES (dst_label) += 1;
>> -
>> -      offset += cmp_bytes;
>> -      bytes_to_compare -= cmp_bytes;
>> -    }
>> -  *p_cleanup_label = cleanup_label;
>> -  return;
>> -}
>> -
>>  /* Generate the final sequence that identifies the differing
>>     byte and generates the final result, taking into account
>>     zero bytes:
>> @@ -2190,97 +2400,6 @@
>>    return;
>>  }
>>
>> -/* Generate the final sequence that identifies the differing
>> -   byte and generates the final result, taking into account
>> -   zero bytes:
>> -
>> -   P8:
>> -        vgbbd 0,0
>> -        vsldoi 0,0,0,9
>> -        mfvsrd 9,32
>> -        addi 10,9,-1    # count trailing zero bits
>> -        andc 9,10,9
>> -        popcntd 9,9
>> -        lbzx 10,28,9    # use that offset to load differing byte
>> -        lbzx 3,29,9
>> -        subf 3,3,10     # subtract for final result
>> -
>> -   P9:
>> -        vclzlsbb            # counts trailing bytes with lsb=0
>> -        vextublx            # extract differing byte
>> -
>> -   STR1 is the reg rtx for data from string 1.
>> -   STR2 is the reg rtx for data from string 2.
>> -   RESULT is the reg rtx for the comparison result.
>> -   S1ADDR is the register to use for the base address of the first string.
>> -   S2ADDR is the register to use for the base address of the second string.
>> -   ORIG_SRC1 is the unmodified rtx for the first string.
>> -   ORIG_SRC2 is the unmodified rtx for the second string.
>> -   OFF_REG is the register to use for the string offset for loads.
>> -   VEC_RESULT is the rtx for the vector result indicating the byte difference.
>> -  */
>> -
>> -static void
>> -emit_final_str_compare_vec (rtx str1, rtx str2, rtx result,
>> -                           rtx s1addr, rtx s2addr,
>> -                           rtx orig_src1, rtx orig_src2,
>> -                           rtx off_reg, rtx vec_result)
>> -{
>> -  if (TARGET_P9_VECTOR)
>> -    {
>> -      rtx diffix = gen_reg_rtx (SImode);
>> -      rtx chr1 = gen_reg_rtx (SImode);
>> -      rtx chr2 = gen_reg_rtx (SImode);
>> -      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
>> -      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
>> -      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
>> -      emit_insn (gen_vextublx (chr1, diffix, str1));
>> -      emit_insn (gen_vextublx (chr2, diffix, str2));
>> -      do_sub3 (result, chr1_di, chr2_di);
>> -    }
>> -  else
>> -    {
>> -      gcc_assert (TARGET_P8_VECTOR);
>> -      rtx diffix = gen_reg_rtx (DImode);
>> -      rtx result_gbbd = gen_reg_rtx (V16QImode);
>> -      /* Since each byte of the input is either 00 or FF, the bytes in
>> -        dw0 and dw1 after vgbbd are all identical to each other.  */
>> -      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
>> -      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
>> -        For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
>> -      rtx result_shifted = gen_reg_rtx (V16QImode);
>> -      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
>> -      emit_insn (gen_altivec_vsldoi_v16qi (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
>> -
>> -      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
>> -      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
>> -      rtx count = gen_reg_rtx (DImode);
>> -
>> -      if (BYTES_BIG_ENDIAN)
>> -       emit_insn (gen_clzdi2 (count, diffix));
>> -      else
>> -       emit_insn (gen_ctzdi2 (count, diffix));
>> -
>> -      /* P8 doesn't have a good solution for extracting one byte from
>> -        a vsx reg like vextublx on P9 so we just compute the offset
>> -        of the differing byte and load it from each string.  */
>> -      do_add3 (off_reg, off_reg, count);
>> -
>> -      rtx chr1 = gen_reg_rtx (QImode);
>> -      rtx chr2 = gen_reg_rtx (QImode);
>> -      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
>> -      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
>> -      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
>> -      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
>> -      machine_mode rmode = GET_MODE (result);
>> -      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
>> -      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
>> -      do_sub3 (result, chr1_rm, chr2_rm);
>> -    }
>> -
>> -  return;
>> -}
>> -
>>  /* Expand a string compare operation with length, and return
>>     true if successful.  Return false if we should let the
>>     compiler generate normal code, probably a strncmp call.
>> @@ -2490,13 +2609,13 @@
>>        off_reg = gen_reg_rtx (Pmode);
>>        vec_result = gen_reg_rtx (load_mode);
>>        emit_move_insn (result_reg, GEN_INT (0));
>> -      expand_strncmp_vec_sequence (compare_length,
>> -                                  orig_src1, orig_src2,
>> -                                  s1addr, s2addr, off_reg,
>> -                                  tmp_reg_src1, tmp_reg_src2,
>> -                                  vec_result,
>> -                                  equality_compare_rest,
>> -                                  &cleanup_label, final_move_label);
>> +      expand_cmp_vec_sequence (compare_length,
>> +                              orig_src1, orig_src2,
>> +                              s1addr, s2addr, off_reg,
>> +                              tmp_reg_src1, tmp_reg_src2,
>> +                              vec_result,
>> +                              equality_compare_rest,
>> +                              &cleanup_label, final_move_label, true);
>>      }
>>    else
>>      expand_strncmp_gpr_sequence (compare_length, base_align,
>> @@ -2545,9 +2664,9 @@
>>      emit_label (cleanup_label);
>>
>>    if (use_vec)
>> -    emit_final_str_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
>> -                               s1addr, s2addr, orig_src1, orig_src2,
>> -                               off_reg, vec_result);
>> +    emit_final_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
>> +                           s1addr, s2addr, orig_src1, orig_src2,
>> +                           off_reg, vec_result);
>>    else
>>      emit_final_str_compare_gpr (tmp_reg_src1, tmp_reg_src2, result_reg);
>>
>> Index: gcc/config/rs6000/rs6000.opt
>> ===================================================================
>> --- gcc/config/rs6000/rs6000.opt        (revision 266034)
>> +++ gcc/config/rs6000/rs6000.opt        (working copy)
>> @@ -326,7 +326,7 @@
>>  Max number of bytes to move inline.
>>
>>  mblock-compare-inline-limit=
>> -Target Report Var(rs6000_block_compare_inline_limit) Init(31) RejectNegative Joined UInteger Save
>> +Target Report Var(rs6000_block_compare_inline_limit) Init(63) RejectNegative Joined UInteger Save
>>  Max number of bytes to compare without loops.
>>
>>  mblock-compare-inline-loop-limit=
>>
>>
>> --
>> Aaron Sawdey, Ph.D.  acsawdey@linux.vnet.ibm.com
>> 050-2/C113  (507) 253-7520 home: 507/263-0782
>> IBM Linux Technology Center - PPC Toolchain
>>
>
Segher Boessenkool Nov. 16, 2018, 9:52 p.m. UTC | #3
Hi Aaron,

On Wed, Nov 14, 2018 at 10:42:44AM -0600, Aaron Sawdey wrote:
> +static rtx
> +emit_vsx_zero_reg()
> +{
> +  unsigned int i;
> +  rtx zr[16];
> +  for (i = 0; i < 16; i++)
> +    zr[i] = GEN_INT (0);
> +  rtvec zv = gen_rtvec_v (16, zr);
> +  rtx zero_reg = gen_reg_rtx (V16QImode);
> +  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
> +  return zero_reg;
> +}

use CONST0_RTX (V16QImode) ?

> +      emit_insn (gen_altivec_vsldoi_v16qi (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));

This line is a bit^H^H^H^H^Hsomewhat^H^H^H^H^H^H^H^Hvery terribly quite much too long :-)

And there should be spaces after each comma.

This would have been a lot easier to review if you could have separated
the refactoring to a first patch, and the actual changes to a second.

But it's okay for trunk.  Please fix the long line, and maybe look if
CONST0_RTX helps you.  Thanks for the patch!

(Also fine with switching the cutoff to 33).


Segher

Patch
diff mbox series

Index: gcc/config/rs6000/rs6000-string.c
===================================================================
--- gcc/config/rs6000/rs6000-string.c	(revision 266034)
+++ gcc/config/rs6000/rs6000-string.c	(working copy)
@@ -615,6 +615,283 @@ 
     }
 }

+static rtx
+emit_vsx_zero_reg()
+{
+  unsigned int i;
+  rtx zr[16];
+  for (i = 0; i < 16; i++)
+    zr[i] = GEN_INT (0);
+  rtvec zv = gen_rtvec_v (16, zr);
+  rtx zero_reg = gen_reg_rtx (V16QImode);
+  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
+  return zero_reg;
+}
+
+/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
+   instructions.
+
+   BYTES_TO_COMPARE is the number of bytes to be compared.
+   ORIG_SRC1 is the unmodified rtx for the first string.
+   ORIG_SRC2 is the unmodified rtx for the second string.
+   S1ADDR is the register to use for the base address of the first string.
+   S2ADDR is the register to use for the base address of the second string.
+   OFF_REG is the register to use for the string offset for loads.
+   S1DATA is the register for loading the first string.
+   S2DATA is the register for loading the second string.
+   VEC_RESULT is the rtx for the vector result indicating the byte difference.
+   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
+   to strcmp/strncmp if we have equality at the end of the inline comparison.
+   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need code
+   to clean up and generate the final comparison result.
+   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
+   set the final result.
+   CHECKZERO indicates whether the sequence should check for zero bytes
+   for use doing strncmp, or not (for use doing memcmp).  */
+static void
+expand_cmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
+			 rtx orig_src1, rtx orig_src2,
+			 rtx s1addr, rtx s2addr, rtx off_reg,
+			 rtx s1data, rtx s2data, rtx vec_result,
+			 bool equality_compare_rest, rtx *p_cleanup_label,
+			 rtx final_move_label, bool checkzero)
+{
+  machine_mode load_mode;
+  unsigned int load_mode_size;
+  unsigned HOST_WIDE_INT cmp_bytes = 0;
+  unsigned HOST_WIDE_INT offset = 0;
+  rtx zero_reg = NULL;
+
+  gcc_assert (p_cleanup_label != NULL);
+  rtx cleanup_label = *p_cleanup_label;
+
+  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
+  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
+
+  if (checkzero && !TARGET_P9_VECTOR)
+    zero_reg = emit_vsx_zero_reg();
+
+  while (bytes_to_compare > 0)
+    {
+      /* VEC/VSX compare sequence for P8:
+	 check each 16B with:
+	 lxvd2x 32,28,8
+	 lxvd2x 33,29,8
+	 vcmpequb 2,0,1  # compare strings
+	 vcmpequb 4,0,3  # compare w/ 0
+	 xxlorc 37,36,34       # first FF byte is either mismatch or end of string
+	 vcmpequb. 7,5,3  # reg 7 contains 0
+	 bnl 6,.Lmismatch
+
+	 For the P8 LE case, we use lxvd2x and compare full 16 bytes
+	 but then use use vgbbd and a shift to get two bytes with the
+	 information we need in the correct order.
+
+	 VEC/VSX compare sequence if TARGET_P9_VECTOR:
+	 lxvb16x/lxvb16x     # load 16B of each string
+	 vcmpnezb.           # produces difference location or zero byte location
+	 bne 6,.Lmismatch
+
+	 Use the overlapping compare trick for the last block if it is
+	 less than 16 bytes.
+      */
+
+      load_mode = V16QImode;
+      load_mode_size = GET_MODE_SIZE (load_mode);
+
+      if (bytes_to_compare >= load_mode_size)
+	cmp_bytes = load_mode_size;
+      else
+	{
+	  /* Move this load back so it doesn't go past the end.  P8/P9
+	     can do this efficiently.  This is never called with less
+	     than 16 bytes so we should always be able to do this.  */
+	  unsigned int extra_bytes = load_mode_size - bytes_to_compare;
+	  cmp_bytes = bytes_to_compare;
+	  gcc_assert (offset > extra_bytes);
+	  offset -= extra_bytes;
+	  cmp_bytes = load_mode_size;
+	  bytes_to_compare = cmp_bytes;
+	}
+
+      /* The offset currently used is always kept in off_reg so that the
+	 cleanup code on P8 can use it to extract the differing byte.  */
+      emit_move_insn (off_reg, GEN_INT (offset));
+
+      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
+      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
+      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
+      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
+
+      /* Cases to handle.  A and B are chunks of the two strings.
+	 1: Not end of comparison:
+	 A != B: branch to cleanup code to compute result.
+	 A == B: next block
+	 2: End of the inline comparison:
+	 A != B: branch to cleanup code to compute result.
+	 A == B: call strcmp/strncmp
+	 3: compared requested N bytes:
+	 A == B: branch to result 0.
+	 A != B: cleanup code to compute result.  */
+
+      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
+
+      if (checkzero)
+	{
+	  if (TARGET_P9_VECTOR)
+	    emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
+	  else
+	    {
+	      /* Emit instructions to do comparison and zero check.  */
+	      rtx cmp_res = gen_reg_rtx (load_mode);
+	      rtx cmp_zero = gen_reg_rtx (load_mode);
+	      rtx cmp_combined = gen_reg_rtx (load_mode);
+	      emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
+	      emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
+	      emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
+	      emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, zero_reg));
+	    }
+	}
+      else
+	emit_insn (gen_altivec_vcmpequb_p (vec_result, s1data, s2data));
+
+      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
+      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
+      rtx dst_label;
+      rtx cmp_rtx;
+      if (branch_to_cleanup)
+	{
+	  /* Branch to cleanup code, otherwise fall through to do more
+	     compares.  P8 and P9 use different CR bits because on P8
+	     we are looking at the result of a comparsion vs a
+	     register of zeroes so the all-true condition means no
+	     difference or zero was found.  On P9, vcmpnezb sets a byte
+	     to 0xff if there is a mismatch or zero, so the all-false
+	     condition indicates we found no difference or zero.  */
+	  if (!cleanup_label)
+	    cleanup_label = gen_label_rtx ();
+	  dst_label = cleanup_label;
+	  if (TARGET_P9_VECTOR && checkzero)
+	    cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
+	  else
+	    cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
+	}
+      else
+	{
+	  /* Branch to final return or fall through to cleanup,
+	     result is already set to 0.  */
+	  dst_label = final_move_label;
+	  if (TARGET_P9_VECTOR && checkzero)
+	    cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
+	  else
+	    cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
+	}
+
+      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
+      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
+					 lab_ref, pc_rtx);
+      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
+      JUMP_LABEL (j2) = dst_label;
+      LABEL_NUSES (dst_label) += 1;
+
+      offset += cmp_bytes;
+      bytes_to_compare -= cmp_bytes;
+    }
+  *p_cleanup_label = cleanup_label;
+  return;
+}
+
+/* Generate the final sequence that identifies the differing
+   byte and generates the final result, taking into account
+   zero bytes:
+
+   P8:
+        vgbbd 0,0
+        vsldoi 0,0,0,9
+        mfvsrd 9,32
+        addi 10,9,-1    # count trailing zero bits
+        andc 9,10,9
+        popcntd 9,9
+        lbzx 10,28,9    # use that offset to load differing byte
+        lbzx 3,29,9
+        subf 3,3,10     # subtract for final result
+
+   P9:
+	 vclzlsbb            # counts trailing bytes with lsb=0
+	 vextublx            # extract differing byte
+
+   STR1 is the reg rtx for data from string 1.
+   STR2 is the reg rtx for data from string 2.
+   RESULT is the reg rtx for the comparison result.
+   S1ADDR is the register to use for the base address of the first string.
+   S2ADDR is the register to use for the base address of the second string.
+   ORIG_SRC1 is the unmodified rtx for the first string.
+   ORIG_SRC2 is the unmodified rtx for the second string.
+   OFF_REG is the register to use for the string offset for loads.
+   VEC_RESULT is the rtx for the vector result indicating the byte difference.  */
+
+static void
+emit_final_compare_vec (rtx str1, rtx str2, rtx result,
+			rtx s1addr, rtx s2addr,
+			rtx orig_src1, rtx orig_src2,
+			rtx off_reg, rtx vec_result)
+{
+
+  if (TARGET_P9_VECTOR)
+    {
+      rtx diffix = gen_reg_rtx (SImode);
+      rtx chr1 = gen_reg_rtx (SImode);
+      rtx chr2 = gen_reg_rtx (SImode);
+      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
+      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
+      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
+      emit_insn (gen_vextublx (chr1, diffix, str1));
+      emit_insn (gen_vextublx (chr2, diffix, str2));
+      do_sub3 (result, chr1_di, chr2_di);
+    }
+  else
+    {
+      gcc_assert (TARGET_P8_VECTOR);
+      rtx diffix = gen_reg_rtx (DImode);
+      rtx result_gbbd = gen_reg_rtx (V16QImode);
+      /* Since each byte of the input is either 00 or FF, the bytes in
+	 dw0 and dw1 after vgbbd are all identical to each other.  */
+      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
+      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
+	 For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
+      rtx result_shifted = gen_reg_rtx (V16QImode);
+      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
+      emit_insn (gen_altivec_vsldoi_v16qi (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
+
+      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
+      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
+      rtx count = gen_reg_rtx (DImode);
+
+      if (BYTES_BIG_ENDIAN)
+	emit_insn (gen_clzdi2 (count, diffix));
+      else
+	emit_insn (gen_ctzdi2 (count, diffix));
+
+      /* P8 doesn't have a good solution for extracting one byte from
+	 a vsx reg like vextublx on P9 so we just compute the offset
+	 of the differing byte and load it from each string.  */
+      do_add3 (off_reg, off_reg, count);
+
+      rtx chr1 = gen_reg_rtx (QImode);
+      rtx chr2 = gen_reg_rtx (QImode);
+      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
+      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
+      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
+      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
+      machine_mode rmode = GET_MODE (result);
+      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
+      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
+      do_sub3 (result, chr1_rm, chr2_rm);
+    }
+
+  return;
+}
+
 /* Expand a block compare operation using loop code, and return true
    if successful.  Return false if we should let the compiler generate
    normal code, probably a memcmp call.
@@ -1343,106 +1620,80 @@ 
   return true;
 }

-/* Expand a block compare operation, and return true if successful.
-   Return false if we should let the compiler generate normal code,
-   probably a memcmp call.
+/* Generate code to convert a DImode-plus-carry subtract result into
+   a SImode result that has the same <0 / ==0 / >0 properties to
+   produce the final result from memcmp.

-   OPERANDS[0] is the target (result).
-   OPERANDS[1] is the first source.
-   OPERANDS[2] is the second source.
-   OPERANDS[3] is the length.
-   OPERANDS[4] is the alignment.  */
-bool
-expand_block_compare (rtx operands[])
+   TARGET is the rtx for the register to receive the memcmp result.
+   SUB_RESULT is the rtx for the register contining the subtract result.  */
+
+void
+generate_6432_conversion(rtx target, rtx sub_result)
 {
-  rtx target = operands[0];
-  rtx orig_src1 = operands[1];
-  rtx orig_src2 = operands[2];
-  rtx bytes_rtx = operands[3];
-  rtx align_rtx = operands[4];
-  HOST_WIDE_INT cmp_bytes = 0;
-  rtx src1 = orig_src1;
-  rtx src2 = orig_src2;
+  /* We need to produce DI result from sub, then convert to target SI
+     while maintaining <0 / ==0 / >0 properties.  This sequence works:
+     subfc L,A,B
+     subfe H,H,H
+     popcntd L,L
+     rldimi L,H,6,0

-  /* This case is complicated to handle because the subtract
-     with carry instructions do not generate the 64-bit
-     carry and so we must emit code to calculate it ourselves.
-     We choose not to implement this yet.  */
-  if (TARGET_32BIT && TARGET_POWERPC64)
-    return false;
+     This is an alternate one Segher cooked up if somebody
+     wants to expand this for something that doesn't have popcntd:
+     subfc L,a,b
+     subfe H,x,x
+     addic t,L,-1
+     subfe v,t,L
+     or z,v,H

-  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
+     And finally, p9 can just do this:
+     cmpld A,B
+     setb r */

-  /* Allow this param to shut off all expansion.  */
-  if (rs6000_block_compare_inline_limit == 0)
-    return false;
-
-  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
-     However slow_unaligned_access returns true on P7 even though the
-     performance of this code is good there.  */
-  if (!isP7
-      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
-	  || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2))))
-    return false;
-
-  /* Unaligned l*brx traps on P7 so don't do this.  However this should
-     not affect much because LE isn't really supported on P7 anyway.  */
-  if (isP7 && !BYTES_BIG_ENDIAN)
-    return false;
-
-  /* If this is not a fixed size compare, try generating loop code and
-     if that fails just call memcmp.  */
-  if (!CONST_INT_P (bytes_rtx))
-    return expand_compare_loop (operands);
-
-  /* This must be a fixed size alignment.  */
-  if (!CONST_INT_P (align_rtx))
-    return false;
-
-  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
-
-  gcc_assert (GET_MODE (target) == SImode);
-
-  /* Anything to move?  */
-  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
-  if (bytes == 0)
-    return true;
-
-  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
-  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
-  /* P7/P8 code uses cond for subfc. but P9 uses
-     it for cmpld which needs CCUNSmode.  */
-  rtx cond;
-  if (TARGET_P9_MISC)
-    cond = gen_reg_rtx (CCUNSmode);
+  if (TARGET_64BIT)
+    {
+      rtx tmp_reg_ca = gen_reg_rtx (DImode);
+      emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
+      rtx popcnt = gen_reg_rtx (DImode);
+      emit_insn (gen_popcntddi2 (popcnt, sub_result));
+      rtx tmp2 = gen_reg_rtx (DImode);
+      emit_insn (gen_iordi3 (tmp2, popcnt, tmp_reg_ca));
+      emit_insn (gen_movsi (target, gen_lowpart (SImode, tmp2)));
+    }
   else
-    cond = gen_reg_rtx (CCmode);
+    {
+      rtx tmp_reg_ca = gen_reg_rtx (SImode);
+      emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
+      rtx popcnt = gen_reg_rtx (SImode);
+      emit_insn (gen_popcntdsi2 (popcnt, sub_result));
+      emit_insn (gen_iorsi3 (target, popcnt, tmp_reg_ca));
+    }
+}

-  /* Strategy phase.  How many ops will this take and should we expand it?  */
+/* Generate memcmp expansion using in-line non-loop GPR instructions.
+   The bool return indicates whether code for a 64->32 conversion
+   should be generated.

-  unsigned HOST_WIDE_INT offset = 0;
-  machine_mode load_mode =
-    select_block_compare_mode (offset, bytes, base_align);
-  unsigned int load_mode_size = GET_MODE_SIZE (load_mode);
+   BYTES is the number of bytes to be compared.
+   BASE_ALIGN is the minimum alignment for both blocks to compare.
+   ORIG_SRC1 is the original pointer to the first block to compare.
+   ORIG_SRC2 is the original pointer to the second block to compare.
+   SUB_RESULT is the reg rtx for the result from the final subtract.
+   COND is rtx for a condition register that will be used for the final
+   compare on power9 or better.
+   FINAL_RESULT is the reg rtx for the final memcmp result.
+   P_CONVERT_LABEL is a pointer to rtx that will be used to store the
+   label generated for a branch to the 64->32 code, if such a branch
+   is needed.
+   P_FINAL_LABEL is a pointer to rtx that will be used to store the label
+   for the end of the memcmp if a branch there is needed.
+*/

-  /* We don't want to generate too much code.  The loop code can take
-     over for lengths greater than 31 bytes.  */
-  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
-  if (!IN_RANGE (bytes, 1, max_bytes))
-    return expand_compare_loop (operands);
-
-  /* The code generated for p7 and older is not faster than glibc
-     memcmp if alignment is small and length is not short, so bail
-     out to avoid those conditions.  */
-  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
-      && ((base_align == 1 && bytes > 16)
-	  || (base_align == 2 && bytes > 32)))
-    return false;
-
-  bool generate_6432_conversion = false;
-  rtx convert_label = NULL;
-  rtx final_label = NULL;
-
+bool
+expand_block_compare_gpr(unsigned HOST_WIDE_INT bytes, unsigned int base_align,
+			 rtx orig_src1, rtx orig_src2,
+			 rtx sub_result, rtx cond, rtx final_result,
+			 rtx *p_convert_label, rtx *p_final_label)
+{
   /* Example of generated code for 18 bytes aligned 1 byte.
      Compiled with -fno-reorder-blocks for clarity.
              ldbrx 10,31,8
@@ -1473,6 +1724,18 @@ 
      if the difference is found there, then a final block of HImode that skips
      the DI->SI conversion.  */

+  unsigned HOST_WIDE_INT offset = 0;
+  unsigned int load_mode_size;
+  HOST_WIDE_INT cmp_bytes = 0;
+  rtx src1 = orig_src1;
+  rtx src2 = orig_src2;
+  rtx tmp_reg_src1 = gen_reg_rtx (word_mode);
+  rtx tmp_reg_src2 = gen_reg_rtx (word_mode);
+  bool need_6432_conv = false;
+  rtx convert_label = NULL;
+  rtx final_label = NULL;
+  machine_mode load_mode;
+
   while (bytes > 0)
     {
       unsigned int align = compute_current_alignment (base_align, offset);
@@ -1536,15 +1799,15 @@ 
 	}

       int remain = bytes - cmp_bytes;
-      if (GET_MODE_SIZE (GET_MODE (target)) > GET_MODE_SIZE (load_mode))
+      if (GET_MODE_SIZE (GET_MODE (final_result)) > GET_MODE_SIZE (load_mode))
 	{
-	  /* Target is larger than load size so we don't need to
+	  /* Final_result is larger than load size so we don't need to
 	     reduce result size.  */

 	  /* We previously did a block that need 64->32 conversion but
 	     the current block does not, so a label is needed to jump
 	     to the end.  */
-	  if (generate_6432_conversion && !final_label)
+	  if (need_6432_conv && !final_label)
 	    final_label = gen_label_rtx ();

 	  if (remain > 0)
@@ -1557,7 +1820,7 @@ 
 	      rtx tmp = gen_rtx_MINUS (word_mode, tmp_reg_src1, tmp_reg_src2);
 	      rtx cr = gen_reg_rtx (CCmode);
 	      rs6000_emit_dot_insn (tmp_reg_src2, tmp, 2, cr);
-	      emit_insn (gen_movsi (target,
+	      emit_insn (gen_movsi (final_result,
 				    gen_lowpart (SImode, tmp_reg_src2)));
 	      rtx ne_rtx = gen_rtx_NE (VOIDmode, cr, const0_rtx);
 	      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
@@ -1572,11 +1835,11 @@ 
 		{
 		  emit_insn (gen_subdi3 (tmp_reg_src2, tmp_reg_src1,
 					 tmp_reg_src2));
-		  emit_insn (gen_movsi (target,
+		  emit_insn (gen_movsi (final_result,
 					gen_lowpart (SImode, tmp_reg_src2)));
 		}
 	      else
-		emit_insn (gen_subsi3 (target, tmp_reg_src1, tmp_reg_src2));
+		emit_insn (gen_subsi3 (final_result, tmp_reg_src1, tmp_reg_src2));

 	      if (final_label)
 		{
@@ -1591,9 +1854,9 @@ 
       else
 	{
 	  /* Do we need a 64->32 conversion block? We need the 64->32
-	     conversion even if target size == load_mode size because
+	     conversion even if final_result size == load_mode size because
 	     the subtract generates one extra bit.  */
-	  generate_6432_conversion = true;
+	  need_6432_conv = true;

 	  if (remain > 0)
 	    {
@@ -1604,20 +1867,27 @@ 
 	      rtx cvt_ref = gen_rtx_LABEL_REF (VOIDmode, convert_label);
 	      if (TARGET_P9_MISC)
 		{
-		/* Generate a compare, and convert with a setb later.  */
+		/* Generate a compare, and convert with a setb later.
+		   Use cond that is passed in because the caller needs
+		   to use it for the 64->32 conversion later.  */
 		  rtx cmp = gen_rtx_COMPARE (CCUNSmode, tmp_reg_src1,
 					     tmp_reg_src2);
 		  emit_insn (gen_rtx_SET (cond, cmp));
 		}
 	      else
-		/* Generate a subfc. and use the longer
-		   sequence for conversion.  */
-		if (TARGET_64BIT)
-		  emit_insn (gen_subfdi3_carry_dot2 (tmp_reg_src2, tmp_reg_src2,
-						     tmp_reg_src1, cond));
-		else
-		  emit_insn (gen_subfsi3_carry_dot2 (tmp_reg_src2, tmp_reg_src2,
-						     tmp_reg_src1, cond));
+		{
+		  /* Generate a subfc. and use the longer sequence for
+		     conversion.  Cond is not used outside this
+		     function in this case.  */
+		  cond = gen_reg_rtx (CCmode);
+		  if (TARGET_64BIT)
+		    emit_insn (gen_subfdi3_carry_dot2 (sub_result, tmp_reg_src2,
+						       tmp_reg_src1, cond));
+		  else
+		    emit_insn (gen_subfsi3_carry_dot2 (sub_result, tmp_reg_src2,
+						       tmp_reg_src1, cond));
+		}
+
 	      rtx ne_rtx = gen_rtx_NE (VOIDmode, cond, const0_rtx);
 	      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, ne_rtx,
 						 cvt_ref, pc_rtx);
@@ -1637,10 +1907,10 @@ 
 		}
 	      else
 		if (TARGET_64BIT)
-		  emit_insn (gen_subfdi3_carry (tmp_reg_src2, tmp_reg_src2,
+		  emit_insn (gen_subfdi3_carry (sub_result, tmp_reg_src2,
 						tmp_reg_src1));
 		else
-		  emit_insn (gen_subfsi3_carry (tmp_reg_src2, tmp_reg_src2,
+		  emit_insn (gen_subfsi3_carry (sub_result, tmp_reg_src2,
 						tmp_reg_src1));
 	    }
 	}
@@ -1649,51 +1919,162 @@ 
       bytes -= cmp_bytes;
     }

-  if (generate_6432_conversion)
+  if (convert_label)
+    *p_convert_label = convert_label;
+  if (final_label)
+    *p_final_label = final_label;
+  return need_6432_conv;
+}
+
+/* Expand a block compare operation, and return true if successful.
+   Return false if we should let the compiler generate normal code,
+   probably a memcmp call.
+
+   OPERANDS[0] is the target (result).
+   OPERANDS[1] is the first source.
+   OPERANDS[2] is the second source.
+   OPERANDS[3] is the length.
+   OPERANDS[4] is the alignment.  */
+bool
+expand_block_compare (rtx operands[])
+{
+  rtx target = operands[0];
+  rtx orig_src1 = operands[1];
+  rtx orig_src2 = operands[2];
+  rtx bytes_rtx = operands[3];
+  rtx align_rtx = operands[4];
+
+  /* This case is complicated to handle because the subtract
+     with carry instructions do not generate the 64-bit
+     carry and so we must emit code to calculate it ourselves.
+     We choose not to implement this yet.  */
+  if (TARGET_32BIT && TARGET_POWERPC64)
+    return false;
+
+  bool isP7 = (rs6000_tune == PROCESSOR_POWER7);
+
+  /* Allow this param to shut off all expansion.  */
+  if (rs6000_block_compare_inline_limit == 0)
+    return false;
+
+  /* targetm.slow_unaligned_access -- don't do unaligned stuff.
+     However slow_unaligned_access returns true on P7 even though the
+     performance of this code is good there.  */
+  if (!isP7
+      && (targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src1))
+	  || targetm.slow_unaligned_access (word_mode, MEM_ALIGN (orig_src2))))
+    return false;
+
+  /* Unaligned l*brx traps on P7 so don't do this.  However this should
+     not affect much because LE isn't really supported on P7 anyway.  */
+  if (isP7 && !BYTES_BIG_ENDIAN)
+    return false;
+
+  /* If this is not a fixed size compare, try generating loop code and
+     if that fails just call memcmp.  */
+  if (!CONST_INT_P (bytes_rtx))
+    return expand_compare_loop (operands);
+
+  /* This must be a fixed size alignment.  */
+  if (!CONST_INT_P (align_rtx))
+    return false;
+
+  unsigned int base_align = UINTVAL (align_rtx) / BITS_PER_UNIT;
+
+  gcc_assert (GET_MODE (target) == SImode);
+
+  /* Anything to move?  */
+  unsigned HOST_WIDE_INT bytes = UINTVAL (bytes_rtx);
+  if (bytes == 0)
+    return true;
+
+  /* P7/P8 code uses cond for subfc. but P9 uses
+     it for cmpld which needs CCUNSmode.  */
+  rtx cond = NULL;
+  if (TARGET_P9_MISC)
+    cond = gen_reg_rtx (CCUNSmode);
+
+  /* Is it OK to use vec/vsx for this.  TARGET_VSX means we have at
+     least POWER7 but we use TARGET_EFFICIENT_UNALIGNED_VSX which is
+     at least POWER8.  That way we can rely on overlapping compares to
+     do the final comparison of less than 16 bytes.  Also I do not
+     want to deal with making this work for 32 bits.  In addition, we
+     have to make sure that we have at least P8_VECTOR (we don't allow
+     P9_VECTOR without P8_VECTOR).  */
+  int use_vec = (bytes >= 16 && !TARGET_32BIT
+		 && TARGET_EFFICIENT_UNALIGNED_VSX && TARGET_P8_VECTOR);
+
+  /* We don't want to generate too much code.  The loop code can take
+     over for lengths greater than 31 bytes.  */
+  unsigned HOST_WIDE_INT max_bytes = rs6000_block_compare_inline_limit;
+
+  /* Don't generate too much code if vsx was disabled.  */
+  if (!use_vec && max_bytes > 1)
+    max_bytes = ((max_bytes + 1) / 2) - 1;
+
+  if (!IN_RANGE (bytes, 1, max_bytes))
+    return expand_compare_loop (operands);
+
+  /* The code generated for p7 and older is not faster than glibc
+     memcmp if alignment is small and length is not short, so bail
+     out to avoid those conditions.  */
+  if (!TARGET_EFFICIENT_OVERLAPPING_UNALIGNED
+      && ((base_align == 1 && bytes > 16)
+	  || (base_align == 2 && bytes > 32)))
+    return false;
+
+  rtx final_label = NULL;
+
+  if (use_vec)
     {
-      if (convert_label)
-	emit_label (convert_label);
+      rtx final_move_label = gen_label_rtx ();
+      rtx s1addr = gen_reg_rtx (Pmode);
+      rtx s2addr = gen_reg_rtx (Pmode);
+      rtx off_reg = gen_reg_rtx (Pmode);
+      rtx cleanup_label = NULL;
+      rtx vec_result = gen_reg_rtx (V16QImode);
+      rtx s1data = gen_reg_rtx (V16QImode);
+      rtx s2data = gen_reg_rtx (V16QImode);
+      rtx result_reg = gen_reg_rtx (word_mode);
+      emit_move_insn (result_reg, GEN_INT (0));

-      /* We need to produce DI result from sub, then convert to target SI
-	 while maintaining <0 / ==0 / >0 properties.  This sequence works:
-	 subfc L,A,B
-	 subfe H,H,H
-	 popcntd L,L
-	 rldimi L,H,6,0
+      expand_cmp_vec_sequence (bytes, orig_src1, orig_src2,
+			       s1addr, s2addr, off_reg, s1data, s2data,
+			       vec_result, false,
+			       &cleanup_label, final_move_label, false);

-	 This is an alternate one Segher cooked up if somebody
-	 wants to expand this for something that doesn't have popcntd:
-	 subfc L,a,b
-	 subfe H,x,x
-	 addic t,L,-1
-	 subfe v,t,L
-	 or z,v,H
+      if (cleanup_label)
+	emit_label (cleanup_label);

-	 And finally, p9 can just do this:
-	 cmpld A,B
-	 setb r */
+      emit_insn (gen_one_cmplv16qi2 (vec_result, vec_result));

-      if (TARGET_P9_MISC)
+      emit_final_compare_vec (s1data, s2data, result_reg,
+			      s1addr, s2addr, orig_src1, orig_src2,
+			      off_reg, vec_result);
+
+      emit_label (final_move_label);
+      emit_insn (gen_movsi (target,
+			    gen_lowpart (SImode, result_reg)));
+    }
+  else
+    { /* generate GPR code */
+
+      rtx convert_label = NULL;
+      rtx sub_result = gen_reg_rtx (word_mode);
+      bool need_6432_conversion =
+	expand_block_compare_gpr(bytes, base_align,
+				 orig_src1, orig_src2,
+				 sub_result, cond, target,
+				 &convert_label, &final_label);
+
+      if (need_6432_conversion)
 	{
-	  emit_insn (gen_setb_unsigned (target, cond));
-	}
-      else
-	{
-	  if (TARGET_64BIT)
-	    {
-	      rtx tmp_reg_ca = gen_reg_rtx (DImode);
-	      emit_insn (gen_subfdi3_carry_in_xx (tmp_reg_ca));
-	      emit_insn (gen_popcntddi2 (tmp_reg_src2, tmp_reg_src2));
-	      emit_insn (gen_iordi3 (tmp_reg_src2, tmp_reg_src2, tmp_reg_ca));
-	      emit_insn (gen_movsi (target, gen_lowpart (SImode, tmp_reg_src2)));
-	    }
+	  if (convert_label)
+	    emit_label (convert_label);
+	  if (TARGET_P9_MISC)
+	    emit_insn (gen_setb_unsigned (target, cond));
 	  else
-	    {
-	      rtx tmp_reg_ca = gen_reg_rtx (SImode);
-	      emit_insn (gen_subfsi3_carry_in_xx (tmp_reg_ca));
-	      emit_insn (gen_popcntdsi2 (tmp_reg_src2, tmp_reg_src2));
-	      emit_insn (gen_iorsi3 (target, tmp_reg_src2, tmp_reg_ca));
-	    }
+	    generate_6432_conversion(target, sub_result);
 	}
     }

@@ -1700,7 +2081,6 @@ 
   if (final_label)
     emit_label (final_label);

-  gcc_assert (bytes == 0);
   return true;
 }

@@ -1808,7 +2188,7 @@ 
 	}
       rtx addr1 = gen_rtx_PLUS (Pmode, src1_addr, offset_rtx);
       rtx addr2 = gen_rtx_PLUS (Pmode, src2_addr, offset_rtx);
-	
+
       do_load_for_compare_from_addr (load_mode, tmp_reg_src1, addr1, orig_src1);
       do_load_for_compare_from_addr (load_mode, tmp_reg_src2, addr2, orig_src2);

@@ -1966,176 +2346,6 @@ 
   return;
 }

-/* Generate the sequence of compares for strcmp/strncmp using vec/vsx
-   instructions.
-
-   BYTES_TO_COMPARE is the number of bytes to be compared.
-   ORIG_SRC1 is the unmodified rtx for the first string.
-   ORIG_SRC2 is the unmodified rtx for the second string.
-   S1ADDR is the register to use for the base address of the first string.
-   S2ADDR is the register to use for the base address of the second string.
-   OFF_REG is the register to use for the string offset for loads.
-   S1DATA is the register for loading the first string.
-   S2DATA is the register for loading the second string.
-   VEC_RESULT is the rtx for the vector result indicating the byte difference.
-   EQUALITY_COMPARE_REST is a flag to indicate we need to make a cleanup call
-   to strcmp/strncmp if we have equality at the end of the inline comparison.
-   P_CLEANUP_LABEL is a pointer to rtx for a label we generate if we need code to clean up
-   and generate the final comparison result.
-   FINAL_MOVE_LABEL is rtx for a label we can branch to when we can just
-   set the final result.  */
-static void
-expand_strncmp_vec_sequence (unsigned HOST_WIDE_INT bytes_to_compare,
-			     rtx orig_src1, rtx orig_src2,
-			     rtx s1addr, rtx s2addr, rtx off_reg,
-			     rtx s1data, rtx s2data,
-			     rtx vec_result, bool equality_compare_rest,
-			     rtx *p_cleanup_label, rtx final_move_label)
-{
-  machine_mode load_mode;
-  unsigned int load_mode_size;
-  unsigned HOST_WIDE_INT cmp_bytes = 0;
-  unsigned HOST_WIDE_INT offset = 0;
-
-  gcc_assert (p_cleanup_label != NULL);
-  rtx cleanup_label = *p_cleanup_label;
-
-  emit_move_insn (s1addr, force_reg (Pmode, XEXP (orig_src1, 0)));
-  emit_move_insn (s2addr, force_reg (Pmode, XEXP (orig_src2, 0)));
-
-  unsigned int i;
-  rtx zr[16];
-  for (i = 0; i < 16; i++)
-    zr[i] = GEN_INT (0);
-  rtvec zv = gen_rtvec_v (16, zr);
-  rtx zero_reg = gen_reg_rtx (V16QImode);
-  rs6000_expand_vector_init (zero_reg, gen_rtx_PARALLEL (V16QImode, zv));
-
-  while (bytes_to_compare > 0)
-    {
-      /* VEC/VSX compare sequence for P8:
-	 check each 16B with:
-	 lxvd2x 32,28,8
-	 lxvd2x 33,29,8
-	 vcmpequb 2,0,1  # compare strings
-	 vcmpequb 4,0,3  # compare w/ 0
-	 xxlorc 37,36,34       # first FF byte is either mismatch or end of string
-	 vcmpequb. 7,5,3  # reg 7 contains 0
-	 bnl 6,.Lmismatch
-
-	 For the P8 LE case, we use lxvd2x and compare full 16 bytes
-	 but then use use vgbbd and a shift to get two bytes with the
-	 information we need in the correct order.
-
-	 VEC/VSX compare sequence if TARGET_P9_VECTOR:
-	 lxvb16x/lxvb16x     # load 16B of each string
-	 vcmpnezb.           # produces difference location or zero byte location
-	 bne 6,.Lmismatch
-
-	 Use the overlapping compare trick for the last block if it is
-	 less than 16 bytes.
-      */
-
-      load_mode = V16QImode;
-      load_mode_size = GET_MODE_SIZE (load_mode);
-
-      if (bytes_to_compare >= load_mode_size)
-	cmp_bytes = load_mode_size;
-      else
-	{
-	  /* Move this load back so it doesn't go past the end.  P8/P9
-	     can do this efficiently.  This is never called with less
-	     than 16 bytes so we should always be able to do this.  */
-	  unsigned int extra_bytes = load_mode_size - bytes_to_compare;
-	  cmp_bytes = bytes_to_compare;
-	  gcc_assert (offset > extra_bytes);
-	  offset -= extra_bytes;
-	  cmp_bytes = load_mode_size;
-	  bytes_to_compare = cmp_bytes;
-	}
-
-      /* The offset currently used is always kept in off_reg so that the
-	 cleanup code on P8 can use it to extract the differing byte.  */
-      emit_move_insn (off_reg, GEN_INT (offset));
-
-      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
-      do_load_for_compare_from_addr (load_mode, s1data, addr1, orig_src1);
-      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
-      do_load_for_compare_from_addr (load_mode, s2data, addr2, orig_src2);
-
-      /* Cases to handle.  A and B are chunks of the two strings.
-	 1: Not end of comparison:
-	 A != B: branch to cleanup code to compute result.
-	 A == B: next block
-	 2: End of the inline comparison:
-	 A != B: branch to cleanup code to compute result.
-	 A == B: call strcmp/strncmp
-	 3: compared requested N bytes:
-	 A == B: branch to result 0.
-	 A != B: cleanup code to compute result.  */
-
-      unsigned HOST_WIDE_INT remain = bytes_to_compare - cmp_bytes;
-
-      if (TARGET_P9_VECTOR)
-	emit_insn (gen_vcmpnezb_p (vec_result, s1data, s2data));
-      else
-	{
-	  /* Emit instructions to do comparison and zero check.  */
-	  rtx cmp_res = gen_reg_rtx (load_mode);
-	  rtx cmp_zero = gen_reg_rtx (load_mode);
-	  rtx cmp_combined = gen_reg_rtx (load_mode);
-	  emit_insn (gen_altivec_eqv16qi (cmp_res, s1data, s2data));
-	  emit_insn (gen_altivec_eqv16qi (cmp_zero, s1data, zero_reg));
-	  emit_insn (gen_orcv16qi3 (vec_result, cmp_zero, cmp_res));
-	  emit_insn (gen_altivec_vcmpequb_p (cmp_combined, vec_result, zero_reg));
-	}
-
-      bool branch_to_cleanup = (remain > 0 || equality_compare_rest);
-      rtx cr6 = gen_rtx_REG (CCmode, CR6_REGNO);
-      rtx dst_label;
-      rtx cmp_rtx;
-      if (branch_to_cleanup)
-	{
-	  /* Branch to cleanup code, otherwise fall through to do more
-	     compares.  P8 and P9 use different CR bits because on P8
-	     we are looking at the result of a comparsion vs a
-	     register of zeroes so the all-true condition means no
-	     difference or zero was found.  On P9, vcmpnezb sets a byte
-	     to 0xff if there is a mismatch or zero, so the all-false
-	     condition indicates we found no difference or zero.  */
-	  if (!cleanup_label)
-	    cleanup_label = gen_label_rtx ();
-	  dst_label = cleanup_label;
-	  if (TARGET_P9_VECTOR)
-	    cmp_rtx = gen_rtx_NE (VOIDmode, cr6, const0_rtx);
-	  else
-	    cmp_rtx = gen_rtx_GE (VOIDmode, cr6, const0_rtx);
-	}
-      else
-	{
-	  /* Branch to final return or fall through to cleanup,
-	     result is already set to 0.  */
-	  dst_label = final_move_label;
-	  if (TARGET_P9_VECTOR)
-	    cmp_rtx = gen_rtx_EQ (VOIDmode, cr6, const0_rtx);
-	  else
-	    cmp_rtx = gen_rtx_LT (VOIDmode, cr6, const0_rtx);
-	}
-
-      rtx lab_ref = gen_rtx_LABEL_REF (VOIDmode, dst_label);
-      rtx ifelse = gen_rtx_IF_THEN_ELSE (VOIDmode, cmp_rtx,
-					 lab_ref, pc_rtx);
-      rtx j2 = emit_jump_insn (gen_rtx_SET (pc_rtx, ifelse));
-      JUMP_LABEL (j2) = dst_label;
-      LABEL_NUSES (dst_label) += 1;
-
-      offset += cmp_bytes;
-      bytes_to_compare -= cmp_bytes;
-    }
-  *p_cleanup_label = cleanup_label;
-  return;
-}
-
 /* Generate the final sequence that identifies the differing
    byte and generates the final result, taking into account
    zero bytes:
@@ -2190,97 +2400,6 @@ 
   return;
 }

-/* Generate the final sequence that identifies the differing
-   byte and generates the final result, taking into account
-   zero bytes:
-
-   P8:
-        vgbbd 0,0
-        vsldoi 0,0,0,9
-        mfvsrd 9,32
-        addi 10,9,-1    # count trailing zero bits
-        andc 9,10,9
-        popcntd 9,9
-        lbzx 10,28,9    # use that offset to load differing byte
-        lbzx 3,29,9
-        subf 3,3,10     # subtract for final result
-
-   P9:
-	 vclzlsbb            # counts trailing bytes with lsb=0
-	 vextublx            # extract differing byte
-
-   STR1 is the reg rtx for data from string 1.
-   STR2 is the reg rtx for data from string 2.
-   RESULT is the reg rtx for the comparison result.
-   S1ADDR is the register to use for the base address of the first string.
-   S2ADDR is the register to use for the base address of the second string.
-   ORIG_SRC1 is the unmodified rtx for the first string.
-   ORIG_SRC2 is the unmodified rtx for the second string.
-   OFF_REG is the register to use for the string offset for loads.
-   VEC_RESULT is the rtx for the vector result indicating the byte difference.
-  */
-
-static void
-emit_final_str_compare_vec (rtx str1, rtx str2, rtx result,
-			    rtx s1addr, rtx s2addr,
-			    rtx orig_src1, rtx orig_src2,
-			    rtx off_reg, rtx vec_result)
-{
-  if (TARGET_P9_VECTOR)
-    {
-      rtx diffix = gen_reg_rtx (SImode);
-      rtx chr1 = gen_reg_rtx (SImode);
-      rtx chr2 = gen_reg_rtx (SImode);
-      rtx chr1_di = simplify_gen_subreg (DImode, chr1, SImode, 0);
-      rtx chr2_di = simplify_gen_subreg (DImode, chr2, SImode, 0);
-      emit_insn (gen_vclzlsbb_v16qi (diffix, vec_result));
-      emit_insn (gen_vextublx (chr1, diffix, str1));
-      emit_insn (gen_vextublx (chr2, diffix, str2));
-      do_sub3 (result, chr1_di, chr2_di);
-    }
-  else
-    {
-      gcc_assert (TARGET_P8_VECTOR);
-      rtx diffix = gen_reg_rtx (DImode);
-      rtx result_gbbd = gen_reg_rtx (V16QImode);
-      /* Since each byte of the input is either 00 or FF, the bytes in
-	 dw0 and dw1 after vgbbd are all identical to each other.  */
-      emit_insn (gen_p8v_vgbbd (result_gbbd, vec_result));
-      /* For LE, we shift by 9 and get BA in the low two bytes then CTZ.
-	 For BE, we shift by 7 and get AB in the high two bytes then CLZ.  */
-      rtx result_shifted = gen_reg_rtx (V16QImode);
-      int shift_amt = (BYTES_BIG_ENDIAN) ? 7 : 9;
-      emit_insn (gen_altivec_vsldoi_v16qi (result_shifted,result_gbbd,result_gbbd, GEN_INT (shift_amt)));
-
-      rtx diffix_df = simplify_gen_subreg (DFmode, diffix, DImode, 0);
-      emit_insn (gen_p8_mfvsrd_3_v16qi (diffix_df, result_shifted));
-      rtx count = gen_reg_rtx (DImode);
-
-      if (BYTES_BIG_ENDIAN)
-	emit_insn (gen_clzdi2 (count, diffix));
-      else
-	emit_insn (gen_ctzdi2 (count, diffix));
-
-      /* P8 doesn't have a good solution for extracting one byte from
-	 a vsx reg like vextublx on P9 so we just compute the offset
-	 of the differing byte and load it from each string.  */
-      do_add3 (off_reg, off_reg, count);
-
-      rtx chr1 = gen_reg_rtx (QImode);
-      rtx chr2 = gen_reg_rtx (QImode);
-      rtx addr1 = gen_rtx_PLUS (Pmode, s1addr, off_reg);
-      do_load_for_compare_from_addr (QImode, chr1, addr1, orig_src1);
-      rtx addr2 = gen_rtx_PLUS (Pmode, s2addr, off_reg);
-      do_load_for_compare_from_addr (QImode, chr2, addr2, orig_src2);
-      machine_mode rmode = GET_MODE (result);
-      rtx chr1_rm = simplify_gen_subreg (rmode, chr1, QImode, 0);
-      rtx chr2_rm = simplify_gen_subreg (rmode, chr2, QImode, 0);
-      do_sub3 (result, chr1_rm, chr2_rm);
-    }
-
-  return;
-}
-
 /* Expand a string compare operation with length, and return
    true if successful.  Return false if we should let the
    compiler generate normal code, probably a strncmp call.
@@ -2490,13 +2609,13 @@ 
       off_reg = gen_reg_rtx (Pmode);
       vec_result = gen_reg_rtx (load_mode);
       emit_move_insn (result_reg, GEN_INT (0));
-      expand_strncmp_vec_sequence (compare_length,
-				   orig_src1, orig_src2,
-				   s1addr, s2addr, off_reg,
-				   tmp_reg_src1, tmp_reg_src2,
-				   vec_result,
-				   equality_compare_rest,
-				   &cleanup_label, final_move_label);
+      expand_cmp_vec_sequence (compare_length,
+			       orig_src1, orig_src2,
+			       s1addr, s2addr, off_reg,
+			       tmp_reg_src1, tmp_reg_src2,
+			       vec_result,
+			       equality_compare_rest,
+			       &cleanup_label, final_move_label, true);
     }
   else
     expand_strncmp_gpr_sequence (compare_length, base_align,
@@ -2545,9 +2664,9 @@ 
     emit_label (cleanup_label);

   if (use_vec)
-    emit_final_str_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
-				s1addr, s2addr, orig_src1, orig_src2,
-				off_reg, vec_result);
+    emit_final_compare_vec (tmp_reg_src1, tmp_reg_src2, result_reg,
+			    s1addr, s2addr, orig_src1, orig_src2,
+			    off_reg, vec_result);
   else
     emit_final_str_compare_gpr (tmp_reg_src1, tmp_reg_src2, result_reg);

Index: gcc/config/rs6000/rs6000.opt
===================================================================
--- gcc/config/rs6000/rs6000.opt	(revision 266034)
+++ gcc/config/rs6000/rs6000.opt	(working copy)
@@ -326,7 +326,7 @@ 
 Max number of bytes to move inline.

 mblock-compare-inline-limit=
-Target Report Var(rs6000_block_compare_inline_limit) Init(31) RejectNegative Joined UInteger Save
+Target Report Var(rs6000_block_compare_inline_limit) Init(63) RejectNegative Joined UInteger Save
 Max number of bytes to compare without loops.

 mblock-compare-inline-loop-limit=