diff mbox

[ARM] Improve ARM memset inlining

Message ID 000301cf9781$fd11c7d0$f7355770$@arm.com
State New
Headers show

Commit Message

Bin Cheng July 4, 2014, 12:17 p.m. UTC
> -----Original Message-----
> From: Ramana Radhakrishnan [mailto:ramana.gcc@googlemail.com]
> Sent: Friday, June 27, 2014 9:22 AM
> To: Bin Cheng
> Cc: Richard Earnshaw; gcc-patches
> Subject: Re: [PATCH ARM] Improve ARM memset inlining
> 
> On Tue, May 6, 2014 at 5:59 AM, bin.cheng <bin.cheng@arm.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
> >> owner@gcc.gnu.org] On Behalf Of bin.cheng
> >> Sent: Monday, May 05, 2014 3:21 PM
> >> To: Richard Earnshaw
> >> Cc: gcc-patches@gcc.gnu.org
> >> Subject: RE: [PATCH ARM] Improve ARM memset inlining
> >>
> >> Hi Richard,  Thanks for reviewing.  I embedded answers to your
> >> comments, also updated the patch.
> >>
> >> > -----Original Message-----
> >> > From: Richard Earnshaw
> >> > Sent: Friday, May 02, 2014 10:00 PM
> >> > To: Bin Cheng
> >> > Cc: gcc-patches@gcc.gnu.org
> >> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
> >> >
> >> > On 30/04/14 03:52, bin.cheng wrote:
> >> > > Hi,
> >> > > This patch expands small memset calls into direct memory set
> >> > > instructions by introducing "setmemsi" pattern.  For processors
> >> > > without NEON support, it expands memset using general store
> >> > > instruction.  For example, strd for 4-bytes aligned addresses.
> >> > > For processors with NEON support, it expands memset using neon
> >> > > instructions like vstr and miscellaneous vst1.* instructions for
> >> > > both
> >> aligned
> >> > and unaligned cases.
> >> > >
> >> > > This patch depends on
> >> > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
> >> > > vst1.64 will be generated for 32-bit aligned memory unit.
> >> > >
> >> > > There is also one leftover work of this patch:  Since vst1.*
> >> > > instructions only support post-increment addressing mode, the
> >> > > inlined memset for unaligned neon cases should be like:
> >> > >   vmov.i32   q8, #...
> >> > >   vst1.8     {q8}, [r3]!
> >> > >   vst1.8     {q8}, [r3]!
> >> > >   vst1.8     {q8}, [r3]!
> >> > >   vst1.8     {q8}, [r3]
> >> >
> >> > Other than for zero, I'd expect the vmov to be vmov.i8 to move an
> >> arbitrary
> >> I just used vmov.i32 as an example.  The element size is actually
> > calculated by
> >> function neon_valid_immediate which works as expected I think.
> >>
> >> > byte value into all lanes in a vector.  After that, if the
> >> > alignment is
> >> known to
> >> > be more than 8-bit, I'd expect the vst1 instructions (with the
> >> > exception
> >> of the
> >> > last store if the length is not a multiple of the alignment) to use
> >> >
> >> >     vst1.<align> {reg}, [addr-reg :<align>]!
> >> >
> >> > Hence, for 16-bit aligned data, we want
> >> >
> >> >     vst1.16 {q8}, [r3:16]!
> >> Did I miss something important?  It seems to me the explicit
> >> alignment
> > notes
> >> supported are 64/128/256.  So what do you mean by 16 bits alignment
> here?
> >>
> >> >
> >> > > But for now, gcc can't do this and below code is generated:
> >> > >   vmov.i32   q8, #...
> >> > >   vst1.8     {q8}, [r3]
> >> > >   add        r2,   r3,  #16
> >> > >   add        r3,   r2,  #16
> >> > >   vst1.8     {q8}, [r2]
> >> > >   vst1.8     {q8}, [r3]
> >> > >   add        r2,   r3,  #16
> >> > >   vst1.8     {q8}, [r2]
> >> > >
> >> > > I investigated this issue.  The root cause lies in rtx cost
> >> > > returned by ARM backend.  Anyway, I think this is another issue
> >> > > and should be fixed in separated patch.
> 
> Ok looks like Charles B from Linaro has run into the same thing and has some
> fixes to suggest in costs.
> 
> >> > >
> >> > > Bootstrap and reg-test on cortex-a15, with or without neon support.
> >> > > Is it OK?
> >> > >
> >> >
> >> > Some more comments inline.
> >> >
> >> > > Thanks,
> >> > > bin
> >> > >
> >> > >
> >> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> >> > >
> >> > >   PR target/55701
> >> > >   * config/arm/arm.md (setmem): New pattern.
> >> > >   * config/arm/arm-protos.h (struct tune_params): New field.
> >> > >   (arm_gen_setmem): New prototype.
> >> > >   * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
> >> > >   (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
> >> > >   (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
> >> > >   (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
> >> > >   (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
> >> > >   (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
> >> > >   (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
> >> > >   (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
> >> > >   (arm_const_inline_cost): New function.
> >> > >   (arm_block_set_max_insns): New function.
> >> > >   (arm_block_set_straight_profit_p): New function.
> >> > >   (arm_block_set_vect_profit_p): New function.
> >> > >   (arm_block_set_unaligned_vect): New function.
> >> > >   (arm_block_set_aligned_vect): New function.
> >> > >   (arm_block_set_unaligned_straight): New function.
> >> > >   (arm_block_set_aligned_straight): New function.
> >> > >   (arm_block_set_vect, arm_gen_setmem): New functions.
> >> > >
> >> > > gcc/testsuite/ChangeLog
> >> > > 2014-04-29  Bin Cheng  <bin.cheng@arm.com>
> >> > >
> >> > >   PR target/55701
> >> > >   * gcc.target/arm/memset-inline-1.c: New test.
> >> > >   * gcc.target/arm/memset-inline-2.c: New test.
> >> > >   * gcc.target/arm/memset-inline-3.c: New test.
> >> > >   * gcc.target/arm/memset-inline-4.c: New test.
> >> > >   * gcc.target/arm/memset-inline-5.c: New test.
> >> > >   * gcc.target/arm/memset-inline-6.c: New test.
> >> > >   * gcc.target/arm/memset-inline-7.c: New test.
> >> > >   * gcc.target/arm/memset-inline-8.c: New test.
> >> > >   * gcc.target/arm/memset-inline-9.c: New test.
> >> > >
> >> > >
> >> > > j1328-20140429.txt
> >> > >
> >> > >
> >> > > Index: gcc/config/arm/arm.c
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/config/arm/arm.c  (revision 209852)
> >> > > +++ gcc/config/arm/arm.c  (working copy)
> >> > > @@ -1585,10 +1585,11 @@ const struct tune_params
> arm_slowmul_tune
> >> =
> >> > >    true,                                          /* Prefer constant
> >> > pool.  */
> >> > >    arm_default_branch_cost,
> >> > >    false,                                 /* Prefer LDRD/STRD.  */
> >> > > -  {true, true},                                  /* Prefer non short
> >> > circuit.  */
> >> > > -  &arm_default_vec_cost,                        /* Vectorizer costs.
> >> */
> >> > > -  false,                                        /* Prefer Neon for
> >> 64-bits bitops.  */
> >> > > -  false, false                                  /* Prefer 32-bit
> >> encodings.  */
> >> > > +  {true, true},                          /* Prefer non short circuit.
> >> */
> >> > > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
> >> > > +  false,                                /* Prefer Neon for 64-bits
> >> bitops.  */
> >> > > +  false, false,                         /* Prefer 32-bit encodings.
> > */
> >> > > +  false                                 /* Prefer Neon for stringops.
> >> */
> >> > >  };
> >> > >
> >> >
> >> > Please make sure that all the white space before the comments is
> >> > using
> >> TAB,
> >> > not spaces.  Similarly for the other tables.
> >> Fixed.
> >>
> >> >
> >> > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
> >> > >                         NULL_RTX, NULL_RTX, 0, 0));  }
> >> > >
> >> > > +/* Cost of loading a SImode constant.  */ static inline int
> >> > > +arm_const_inline_cost (rtx val) {
> >> > > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
> >> > > +                           NULL_RTX, NULL_RTX, 0, 0); }
> >> > > +
> >> >
> >> > This could be used more widely if you passed the SET in as a
> >> > parameter (there are cases in arm_new_rtx_cost that could use it, for
> example).
> >> > Also, you want to enable sub-targets (only once you can't create
> >> > new pseudos is that not safe), so the penultimate argument in the
> >> > call to arm_gen_constant should be 1.
> >> Fixed.
> 
> 
> 
> >>
> >> >
> >> > >  /* Return true if it is worthwhile to split a 64-bit constant
> >> > > into
> > two
> >> > >     32-bit operations.  This is the case if optimizing for size, or
> >> > >     if we have load delay slots, or if one 32-bit part can be
> >> > > done with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
> >> > > *comparison, rtx * op
> >> > >
> >> > >  }
> >> > >
> >> > > +/* Maximum number of instructions to set block of memory.  */
> >> > > +static int arm_block_set_max_insns (void) {
> >> > > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
> >> >
> >> > I think the non-size_p alternative should really be a parameter in
> >> > the
> >> per-cpu
> >> > costs table.
> >> Fixed.
> >>
> >> >
> >> > > +
> >> > > +/* Return TRUE if it's profitable to set block of memory for straight
> >> > > +   case.  */
> >> >
> >> > "Straight" is confusing here.  Do you mean non-vectorized?  If so,
> >> > then non_vect might be clearer.
> >> Fixed.
> >>
> >> >
> >> > The arguments should really be documented (see comment below
> about
> >> > align, for example).
> >> Fixed.
> >>
> >> >
> >> > > +static bool
> >> > > +arm_block_set_straight_profit_p (rtx val,
> >> > > +                          unsigned HOST_WIDE_INT length,
> >> > > +                          unsigned HOST_WIDE_INT align,
> >> > > +                          bool unaligned_p, bool use_strd_p) {
> >> > > +  int num = 0;
> >> > > +  /* For leftovers in bytes of 0-7, we can set the memory block using
> >> > > +     strb/strh/str with minimum instruction number.  */
> >> > > +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
> >> >
> >> > This should be marked const.
> >> Fixed.
> >>
> >> >
> >> > > +
> >> > > +  if (unaligned_p)
> >> > > +    {
> >> > > +      num = arm_const_inline_cost (val);
> >> > > +      num += length / align + length % align;
> >> >
> >> > Isn't align in bits here, when you really want it in bytes?
> >> All alignments are in bytes starting from pattern "setmem".
> >>
> >> >
> >> > What if align > 4 bytes?
> >> Then it's the "!unaligned_p" case and handled by other arms of this
> >> if statement.
> >>
> >> >
> >> > > +    }
> >> > > +  else if (use_strd_p)
> >> > > +    {
> >> > > +      num = arm_const_double_inline_cost (val);
> >> > > +      num += (length >> 3) + leftover[length & 7];
> >> > > +    }
> >> > > +  else
> >> > > +    {
> >> > > +      num = arm_const_inline_cost (val);
> >> > > +      num += (length >> 2) + leftover[length & 3];
> >> > > +    }
> >> > > +
> >> > > +  /* We may be able to combine last pair STRH/STRB into a single STR
> >> > > +     by shifting one byte back.  */  if (unaligned_access &&
> >> > > + length
> >> > > + > 3 && (length & 3) == 3)
> >> > > +    num--;
> >> > > +
> >> > > +  return (num <= arm_block_set_max_insns ()); }
> >> > > +
> >> > > +/* Return TRUE if it's profitable to set block of memory for
> >> > > +vector case.  */ static bool arm_block_set_vect_profit_p
> >> > > +(unsigned HOST_WIDE_INT length,
> >> > > +                      unsigned HOST_WIDE_INT align
> >> > ATTRIBUTE_UNUSED,
> >> > > +                      bool unaligned_p, enum machine_mode mode)
> >> >
> >> > I'm not sure what you mean by unaligned here.  Again, documenting
> >> > the arguments might help.
> >> Fixed.
> >>
> >> >
> >> > > +{
> >> > > +  int num;
> >> > > +  unsigned int nelt = GET_MODE_NUNITS (mode);
> >> > > +
> >> > > +  /* Num of instruction loading constant value.  */
> >> >
> >> > Use either "Number" or, in this case, simply drop that bit and write:
> >> >   /* Instruction loading constant value.  */
> >> Fixed.
> >>
> >> >
> >> > > +  num = 1;
> >> > > +  /* Num of store instructions.  */
> >> >
> >> > Likewise.
> >> >
> >> > > +  num += (length + nelt - 1) / nelt;
> >> > > +  /* Num of address adjusting instructions.  */
> >> >
> >> > Can't we work on the premise that the address adjusting
> >> > instructions will
> >> be
> >> > merged into the stores?  I know you said that they currently do
> >> > not, but that's not a problem that this bit of code should have to
> >> > worry
> > about.
> >> Fixed.
> >>
> >> >
> >> > > +  if (unaligned_p)
> >> > > +    /* For unaligned case, it's one less than the store instructions.
> >> */
> >> > > +    num += (length + nelt - 1) / nelt - 1;  else if ((length &
> >> > > + 3) !=
> >> > > + 0)
> >> > > +    /* For aligned case, it's one if bytes leftover can only be
> > stored
> >> > > +       by mis-aligned store instruction.  */
> >> > > +    num++;
> >> > > +
> >> > > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
> >> > > + */  if (!unaligned_p && mode == V16QImode)
> >> > > +    num--;
> >> > > +
> >> > > +  return (num <= arm_block_set_max_insns ()); }
> >> > > +
> >> > > +/* Set a block of memory using vectorization instructions for the
> >> > > +   unaligned case.  We fill the first LENGTH bytes of the memory
> >> > > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
> >> > > +   the alignment requirement of memory.  */
> >> >
> >> > What's the return value mean?
> >> Documented.
> >>
> >> >
> >> > > +static bool
> >> > > +arm_block_set_unaligned_vect (rtx dstbase,
> >> > > +                       unsigned HOST_WIDE_INT length,
> >> > > +                       unsigned HOST_WIDE_INT value,
> >> > > +                       unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
> >> >
> >> > Don't mix initialized declarations with unitialized ones on the
> >> > same
> > line.
> >> You
> >> > don't appear to use either I or J until their first use in the loop
> >> control below,
> >> > so why initialize them here?
> >> Fixed.
> >>
> >> >
> >> > > +  rtx dst, mem;
> >> > > +  rtx val_elt, val_vec, reg;
> >> > > +  rtx rval[MAX_VECT_LEN];
> >> > > +  rtx (*gen_func) (rtx, rtx);
> >> > > +  enum machine_mode mode;
> >> > > +  unsigned HOST_WIDE_INT v = value;
> >> > > +
> >> > > +  gcc_assert ((align & 0x3) != 0);
> >> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> >> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >=
> nelt_v16)
> >> > > +    {
> >> > > +      mode = V16QImode;
> >> > > +      gen_func = gen_movmisalignv16qi;
> >> > > +    }
> >> > > +  else
> >> > > +    {
> >> > > +      mode = V8QImode;
> >> > > +      gen_func = gen_movmisalignv8qi;
> >> > > +    }
> >> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> >> > > + nelt_mode);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_vect_profit_p (length, align, true, mode))
> >> > > +    return false;
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mem =
> >> > > + adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +
> >> > > +  v = sext_hwi (v, BITS_PER_WORD);  val_elt = GEN_INT (v);  for
> >> > > + (; j < nelt_mode; j++)
> >> > > +    rval[j] = val_elt;
> >> >
> >> > Is this the first use of J?  If so, initialize it here.
> >> >
> >> > > +
> >> > > +  reg = gen_reg_rtx (mode);
> >> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v
> (nelt_mode,
> >> > > + rval));
> >> > > +  /* Emit instruction loading the constant value.  */
> >> > > + emit_move_insn (reg, val_vec);
> >> > > +
> >> > > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i +
> >> > > + nelt_mode <= length); i += nelt_mode)
> >> >
> >> > Similarly for I.
> >> >
> >> > > +    {
> >> > > +      emit_insn ((*gen_func) (mem, reg));
> >> > > +      if (i + 2 * nelt_mode <= length) emit_insn (gen_add2_insn
> >> > > + (dst, GEN_INT (nelt_mode)));
> >> > > +    }
> >> > > +
> >> > > +  if (i + nelt_v8 <= length)
> >> > > +    gcc_assert (mode == V16QImode);
> >> >
> >> > Why not drop the if and write:
> >> >
> >> >      gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
> >> Fixed.
> >>
> >> >
> >> > > +
> >> > > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 <
> >> > > + length)
> >> >
> >> > Your assertion above checked <=, but here you use <.  Is that correct?
> >> Yes, it is.  For case "==", it means we have nelt_v8 bytes leftover,
> >> which
> > will
> >> be handled by the last branch of if statement.
> >>
> >> >
> >> > > +    {
> >> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
> >> > > +      /* We are shifting bytes back, set the alignment accordingly.
> > */
> >> > > +      if ((length & 1) != 0 && align >= 2) set_mem_align (mem,
> >> > > + BITS_PER_UNIT);
> >> > > +
> >> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> >> > > +    }
> >> > > +  /* Handle (0, 8] bytes leftover.  */  else if (i < length && i
> >> > > + + nelt_v8 >= length)
> >> > > +    {
> >> > > +      if (mode == V16QImode)
> >> > > + {
> >> > > +   reg = gen_lowpart (V8QImode, reg);
> >> > > +   mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
> >> > > + }
> >> > > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
> >> > > +                                       + (nelt_mode - nelt_v8))));
> >> > > +      /* We are shifting bytes back, set the alignment accordingly.
> > */
> >> > > +      if ((length & 1) != 0 && align >= 2) set_mem_align (mem,
> >> > > + BITS_PER_UNIT);
> >> > > +
> >> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> >> > > +    }
> >> > > +
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using vectorization instructions for the
> >> > > +   aligned case.  We fill the first LENGTH bytes of the memory area
> >> > > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
> >> > > +   alignment requirement of memory.  */
> >> >
> >> > See all the comments above for the unaligend case.
> >> Fixed accordingly.
> >>
> >> >
> >> > > +static bool
> >> > > +arm_block_set_aligned_vect (rtx dstbase,
> >> > > +                     unsigned HOST_WIDE_INT length,
> >> > > +                     unsigned HOST_WIDE_INT value,
> >> > > +                     unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
> >> > > +  rtx dst, addr, mem;
> >> > > +  rtx val_elt, val_vec, reg;
> >> > > +  rtx rval[MAX_VECT_LEN];
> >> > > +  enum machine_mode mode;
> >> > > +  unsigned HOST_WIDE_INT v = value;
> >> > > +
> >> > > +  gcc_assert ((align & 0x3) == 0);
> >> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
> >> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >=
> >> > > + nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
> >> > > +    mode = V16QImode;
> >> > > +  else
> >> > > +    mode = V8QImode;
> >> > > +
> >> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
> >> > > + nelt_mode);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_vect_profit_p (length, align, false, mode))
> >> > > +    return false;
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> >> > > +
> >> > > +  v = sext_hwi (v, BITS_PER_WORD);  val_elt = GEN_INT (v);  for
> >> > > + (; j < nelt_mode; j++)
> >> > > +    rval[j] = val_elt;
> >> > > +
> >> > > +  reg = gen_reg_rtx (mode);
> >> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v
> (nelt_mode,
> >> > > + rval));
> >> > > +  /* Emit instruction loading the constant value.  */
> >> > > + emit_move_insn (reg, val_vec);
> >> > > +
> >> > > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
> >> > > +*/
> >> > > +  if (mode == V16QImode)
> >> > > +    {
> >> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
> >> > > +      i += nelt_mode;
> >> > > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
> >> > > +      if (i + nelt_v8 < length && i + nelt_v16 > length)  {
> >> > > +   emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> >> > > +   mem = adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +   /* We are shifting bytes back, set the alignment accordingly.  */
> >> > > +   if ((length & 0x3) == 0)
> >> > > +     set_mem_align (mem, BITS_PER_UNIT * 4);
> >> > > +   else if ((length & 0x1) == 0)
> >> > > +     set_mem_align (mem, BITS_PER_UNIT * 2);
> >> > > +   else
> >> > > +     set_mem_align (mem, BITS_PER_UNIT);
> >> > > +
> >> > > +   emit_insn (gen_movmisalignv16qi (mem, reg));
> >> > > +   return true;
> >> > > + }
> >> > > +      /* Fall through for bytes leftover.  */
> >> > > +      mode = V8QImode;
> >> > > +      nelt_mode = GET_MODE_NUNITS (mode);
> >> > > +      reg = gen_lowpart (V8QImode, reg);
> >> > > +    }
> >> > > +
> >> > > +  /* Handle 8 bytes in a vector.  */  for (; (i + nelt_mode <=
> >> > > + length); i += nelt_mode)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +
> >> > > +  /* Handle single word leftover by shifting 4 bytes back.  We can
> >> > > +     use aligned access for this case.  */  if (i +
> >> > > + UNITS_PER_WORD == length)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
> >> > > +      mem = adjust_automodify_address (dstbase, mode,
> >> > > +                                addr, i - UNITS_PER_WORD);
> >> > > +      /* We are shifting 4 bytes back, set the alignment accordingly.
> >> */
> >> > > +      if (align > UNITS_PER_WORD) set_mem_align (mem,
> >> > > + BITS_PER_UNIT * UNITS_PER_WORD);
> >> > > +
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
> >> > > +     We have to use unaligned access for this case.  */  else if
> >> > > + (i < length)
> >> > > +    {
> >> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
> >> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
> >> > > +      /* We are shifting bytes back, set the alignment accordingly.
> > */
> >> > > +      if ((length & 1) == 0)
> >> > > + set_mem_align (mem, BITS_PER_UNIT * 2);
> >> > > +      else
> >> > > + set_mem_align (mem, BITS_PER_UNIT);
> >> > > +
> >> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
> >> > > +    }
> >> > > +
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using plain strh/strb instructions, only
> >> > > +   using instructions allowed by ALIGN on processor.  We fill the
> >> > > +   first LENGTH bytes of the memory area starting from DSTBASE
> >> > > +   with byte constant VALUE.  ALIGN is the alignment requirement
> >> > > +   of memory.  */
> >> > > +static bool
> >> > > +arm_block_set_unaligned_straight (rtx dstbase,
> >> > > +                           unsigned HOST_WIDE_INT length,
> >> > > +                           unsigned HOST_WIDE_INT value,
> >> > > +                           unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i;
> >> > > +  rtx dst, addr, mem;
> >> > > +  rtx val_exp, val_reg, reg;
> >> > > +  enum machine_mode mode;
> >> > > +  HOST_WIDE_INT v = value;
> >> > > +
> >> > > +  gcc_assert (align == 1 || align == 2);
> >> > > +
> >> > > +  if (align == 2)
> >> > > +    v |= (value << BITS_PER_UNIT);
> >> > > +
> >> > > +  v = sext_hwi (v, BITS_PER_WORD);  val_exp = GEN_INT (v);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_straight_profit_p (val_exp, length,
> >> > > +                                 align, true, false))
> >> > > +    return false;
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mode = (align == 2 ?
> >> > > + HImode : QImode);  val_reg = force_reg (SImode, val_exp);  reg
> >> > > + = gen_lowpart (mode, val_reg);
> >> > > +
> >> > > +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i +=
> >> > > + GET_MODE_SIZE
> >> > (mode))
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +
> >> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> >> > > +    {
> >> > > +      reg = gen_lowpart (QImode, val_reg);
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +      i++;
> >> > > +    }
> >> > > +
> >> > > +  gcc_assert (i == length);
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using plain strd/str/strh/strb instructions,
> >> > > +   to permit unaligned copies on processors which support unaligned
> >> > > +   semantics for those instructions.  We fill the first LENGTH bytes
> >> > > +   of the memory area starting from DSTBASE with byte constant
> VALUE.
> >> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> >> > > +arm_block_set_aligned_straight (rtx dstbase,
> >> > > +                         unsigned HOST_WIDE_INT length,
> >> > > +                         unsigned HOST_WIDE_INT value,
> >> > > +                         unsigned HOST_WIDE_INT align) {
> >> > > +  unsigned int i = 0;
> >> > > +  rtx dst, addr, mem;
> >> > > +  rtx val_exp, val_reg, reg;
> >> > > +  unsigned HOST_WIDE_INT v;
> >> > > +  bool use_strd_p;
> >> > > +
> >> > > +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
> >> > > +         && TARGET_LDRD && current_tune->prefer_ldrd_strd);
> >> > > +
> >> > > +  v = (value | (value << 8) | (value << 16) | (value << 24));
> >> > > + if (length < UNITS_PER_WORD)
> >> > > +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) *
> >> > > + BITS_PER_UNIT);
> >> > > +
> >> > > +  if (use_strd_p)
> >> > > +    v |= (v << BITS_PER_WORD);
> >> > > +  else
> >> > > +    v = sext_hwi (v, BITS_PER_WORD);
> >> > > +
> >> > > +  val_exp = GEN_INT (v);
> >> > > +  /* Skip if it isn't profitable.  */  if
> >> > > + (!arm_block_set_straight_profit_p (val_exp, length,
> >> > > +                                 align, false, use_strd_p))
> >> > > +    {
> >> > > +      /* Try without strd.  */
> >> > > +      v = (v >> BITS_PER_WORD);
> >> > > +      v = sext_hwi (v, BITS_PER_WORD);
> >> > > +      val_exp = GEN_INT (v);
> >> > > +      use_strd_p = false;
> >> > > +      if (!arm_block_set_straight_profit_p (val_exp, length,
> >> > > +                                     align, false, use_strd_p))
> >> > > + return false;
> >> > > +    }
> >> > > +
> >> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
> >> > > +  /* Handle double words using strd if possible.  */  if
> >> > > + (use_strd_p)
> >> > > +    {
> >> > > +      val_reg = force_reg (DImode, val_exp);
> >> > > +      reg = val_reg;
> >> > > +      for (; (i + 8 <= length); i += 8) {
> >> > > +   addr = plus_constant (Pmode, dst, i);
> >> > > +   mem = adjust_automodify_address (dstbase, DImode, addr, i);
> >> > > +   emit_move_insn (mem, reg);
> >> > > + }
> >> > > +    }
> >> > > +  else
> >> > > +    val_reg = force_reg (SImode, val_exp);
> >> > > +
> >> > > +  /* Handle words.  */
> >> > > +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
> >> > > + for (; (i + 4 <= length); i += 4)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
> >> > > +      if ((align & 3) == 0)
> >> > > + emit_move_insn (mem, reg);
> >> > > +      else
> >> > > + emit_insn (gen_unaligned_storesi (mem, reg));
> >> > > +    }
> >> > > +
> >> > > +  /* Merge last pair of STRH and STRB into a STR if possible.
> >> > > + */ if (unaligned_access && i > 0 && (i + 3) == length)
> >> > > +    {
> >> > > +      addr = plus_constant (Pmode, dst, i - 1);
> >> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
> >> > > +      /* We are shifting one byte back, set the alignment
> > accordingly.
> >> */
> >> > > +      if ((align & 1) == 0)
> >> > > + set_mem_align (mem, BITS_PER_UNIT);
> >> > > +
> >> > > +      /* Most likely this is an unaligned access, and we can't
> >> > > + tell
> > at
> >> > > +  compilation time.  */
> >> > > +      emit_insn (gen_unaligned_storesi (mem, reg));
> >> > > +      return true;
> >> > > +    }
> >> > > +
> >> > > +  /* Handle half word leftover.  */  if (i + 2 <= length)
> >> > > +    {
> >> > > +      reg = gen_lowpart (HImode, val_reg);
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
> >> > > +      if ((align & 1) == 0)
> >> > > + emit_move_insn (mem, reg);
> >> > > +      else
> >> > > + emit_insn (gen_unaligned_storehi (mem, reg));
> >> > > +
> >> > > +      i += 2;
> >> > > +    }
> >> > > +
> >> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
> >> > > +    {
> >> > > +      reg = gen_lowpart (QImode, val_reg);
> >> > > +      addr = plus_constant (Pmode, dst, i);
> >> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
> >> > > +      emit_move_insn (mem, reg);
> >> > > +    }
> >> > > +
> >> > > +  return true;
> >> > > +}
> >> > > +
> >> > > +/* Set a block of memory using vectorization instructions for both
> >> > > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
> >> > > +   the memory area starting from DSTBASE with byte constant VALUE.
> >> > > +   ALIGN is the alignment requirement of memory.  */ static bool
> >> > > +arm_block_set_vect (rtx dstbase,
> >> > > +             unsigned HOST_WIDE_INT length,
> >> > > +             unsigned HOST_WIDE_INT value,
> >> > > +             unsigned HOST_WIDE_INT align) {
> >> > > +  /* Check whether we need to use unaligned store instruction.
> >> > > +*/
> >> > > +  if (((align & 3) != 0 || (length & 3) != 0)
> >> > > +      /* Check whether unaligned store instruction is available.  */
> >> > > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
> >> > > +    return false;
> >> >
> >> > Huh!  vst1.8 can work for unaligned accesses even when hw alignment
> >> > checking is strict.
> >> Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
> >> unaligned_access", vst1.8 instructions  can't be recognized now in
> >> this
> > way.
> >> I agree that it's too strict, but that's another problem I think.
> 
> That was introduced to "fix up" the issue with another test IIRC.
> That's probably not related to this particular patch. It's the other thread that's
> been ongoing with Maciej, so let's continue it there.
> 
> >>
> >> >
> >> > > +
> >> > > +  if ((align & 3) == 0)
> >> > > +    return arm_block_set_aligned_vect (dstbase, length, value,
> >> > > +align);
> >> > > +  else
> >> > > +    return arm_block_set_unaligned_vect (dstbase, length, value,
> >> > > +align); }
> >> > > +
> >> > > +/* Expand string store operation.  Firstly we try to do that by using
> >> > > +   vectorization instructions, then try with ARM unaligned access and
> >> > > +   double-word store if profitable.  OPERANDS[0] is the destination,
> >> > > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
> >> > > +   initialize the memory, OPERANDS[3] is the known alignment of the
> >> > > +   destination.  */
> >> > > +bool
> >> > > +arm_gen_setmem (rtx *operands)
> >> > > +{
> >> > > +  rtx dstbase = operands[0];
> >> > > +  unsigned HOST_WIDE_INT length;
> >> > > +  unsigned HOST_WIDE_INT value;
> >> > > +  unsigned HOST_WIDE_INT align;
> >> > > +
> >> > > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
> >> > > +    return false;
> >> > > +
> >> > > +  length = UINTVAL (operands[1]);  if (length > 64)
> >> > > +    return false;
> >> > > +
> >> > > +  value = (UINTVAL (operands[2]) & 0xFF);  align = UINTVAL
> >> > > + (operands[3]);  if (TARGET_NEON && length >= 8
> >> > > +      && current_tune->string_ops_prefer_neon
> >> > > +      && arm_block_set_vect (dstbase, length, value, align))
> >> > > +    return true;
> >> > > +
> >> > > +  if (!unaligned_access && (align & 3) != 0)
> >> > > +    return arm_block_set_unaligned_straight (dstbase, length,
> >> > > + value, align);
> >> > > +
> >> > > +  return arm_block_set_aligned_straight (dstbase, length, value,
> >> > > +align); }
> >> > > +
> >> > >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
> >> > >
> >> > >  static unsigned HOST_WIDE_INT
> >> > > Index: gcc/config/arm/arm-protos.h
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/config/arm/arm-protos.h   (revision 209852)
> >> > > +++ gcc/config/arm/arm-protos.h   (working copy)
> >> > > @@ -277,6 +277,8 @@ struct tune_params
> >> > >    /* Prefer 32-bit encoding instead of 16-bit encoding where
> >> > > subset of
> >> flags
> >> > >       would be set.  */
> >> > >    bool disparage_partial_flag_setting_t16_encodings;
> >> > > +  /* Prefer to inline string operations like memset by using Neon.
> >> > > + */  bool string_ops_prefer_neon;
> >> > >  };
> >> > >
> >> > >  extern const struct tune_params *current_tune; @@ -289,6 +291,7
> >> > > @@ extern void arm_emit_coreregs_64bit_shift (enum rt  extern
> >> > > bool arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /*
> >> > > RTX_CODE */
> >> > >
> >> > > +extern bool arm_gen_setmem (rtx *);
> >> > >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1,
> >> > > rtx sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx
> >> > > op0, rtx op1, rtx sel);
> >> > >
> >> > > Index: gcc/config/arm/arm.md
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/config/arm/arm.md (revision 209852)
> >> > > +++ gcc/config/arm/arm.md (working copy)
> >> > > @@ -7555,6 +7555,20 @@
> >> > >  })
> >> > >
> >> > >
> >> > > +(define_expand "setmemsi"
> >> > > +  [(match_operand:BLK 0 "general_operand" "")
> >> > > +   (match_operand:SI 1 "const_int_operand" "")
> >> > > +   (match_operand:SI 2 "const_int_operand" "")
> >> > > +   (match_operand:SI 3 "const_int_operand" "")]
> >> > > +  "TARGET_32BIT"
> >> > > +{
> >> > > +  if (arm_gen_setmem (operands))
> >> > > +    DONE;
> >> > > +
> >> > > +  FAIL;
> >> > > +})
> >> > > +
> >> > > +
> >> > >  ;; Move a block of memory if it is word aligned and MORE than 2
> >> > > words
> >> > long.
> >> > >  ;; We could let this apply for blocks of less than this, but it
> >> > > clobbers so  ;; many registers that there is then probably a
> >> > > better
> > way.
> >> > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
> >> > >
> >> >
> >>
> ==========================================================
> >> > =========
> >> > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
> >> > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
> >> >
> >> > Have you tested these when the compiler was configured with
> >> > "--with- cpu=cortex-a9"?
> >> Here is the tricky part.
> >> For compiler configured with "--with-tune=cortex-a9", the neon
> >> related cases
> >> (4/5/6/8/9) would fail because we have no way to determine that we
> >> are compiling with cortex-a9 tune here.
> 
> There is no easy way of fixing this.
> 
> >> For compiler configured with "--with-cpu=cortex-a9", the test cases
> >> would pass but I think this is a mistake.  It reveals an issue that
> >> GCC won't
> > pass "-
> >> mcpu=cortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It
> >> just
> > makes
> >> no sense.
> >> With these issues, I didn't change the tests for now.
> 
> I think we'll just have to take the hit on noise in other configurations.
> 
> > Precisely, I configured gcc with options "--with-arch=armv7-a
> > --with-cpu|--with-tune=cortex-a9".
> > I read gcc documents and realized that "-mcpu" is ignored when
> > "-march" is specified.  I don't know why gcc acts in this manner, but
> > it leads to inconsistent configuration/command line behavior.
> > If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9",
> > then only "-march=armv7-a" is passed to cc1.
> 
> That kind of configuration is warned today but no one pays attention to that.
> See James's proposal to promote this to an error.
> 
> > If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works
> > fine and passes "-march=armv7-a -mcpu=cortex-a9" to cc1.
> 
> That should be fine.
> 
> >
> > Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
> > -march=armv7-m switch".
> 
> 
> This is OK unless there are objections in the next 24 hours.
> 
> Please watch out for any fallout - Certainly rebase and retest before applying
> and please post the rebased version for archival purposes.
> 

Hi Ramana,
This is the rebased patch, there is no conflict against latest trunk.  I am still doing some tests.  Is it OK if tests are fine?
Also, it depends on patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that patch two.

Thanks,
bin

Comments

Bin.Cheng July 8, 2014, 8:32 a.m. UTC | #1
On Fri, Jul 4, 2014 at 1:17 PM, Bin Cheng <bin.cheng@arm.com> wrote:
>
>

>
> Hi Ramana,
> This is the rebased patch, there is no conflict against latest trunk.  I am still doing some tests.  Is it OK if tests are fine?
> Also, it depends on patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that patch two.

Hi Ramana,

Bootstrap and tests for this patch are done.  Is it ok for me to submit?

Thanks,
bin
Ramana Radhakrishnan July 8, 2014, 8:56 a.m. UTC | #2
>
> Hi Ramana,
> This is the rebased patch, there is no conflict against latest trunk.  I am still doing some tests.  Is it OK if tests are fine?
> Also, it depends on patch at https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html, I will update that patch two.
>
> Thanks,
> bin

> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c	(revision 212295)
> +++ gcc/config/arm/arm.c	(working copy)
> @@ -1588,34 +1588,38 @@ const struct tune_params arm_slowmul_tune =
>  {
>    arm_slowmul_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  3,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  3,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */

Please make sure alignment is maintained with comments as today. I'm not 
sure why I see the following diffs in your patch since you don't really 
should be touching those lines, that applies to all the cost tables. I 
haven't called out all the places where you appear to have unrelated 
formatting changes in detail, but have done so in one cost table.

Please re-create a patch that doesn't have these hunks.

>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */

Likewise.

>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */

Likewise.

> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_fastmul_tune =
>  {
>    arm_fastmul_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* StrongARM has early execution of branches, so a sequence that is worth
> @@ -1625,17 +1629,19 @@ const struct tune_params arm_strongarm_tune =
>  {
>    arm_fastmul_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  3,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  3,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_xscale_tune =
> @@ -1643,50 +1649,56 @@ const struct tune_params arm_xscale_tune =
>    arm_xscale_rtx_costs,
>    NULL,
>    xscale_sched_adjust_cost,
> -  2,						/* Constant limit.  */
> -  3,						/* Max cond insns.  */
> +  2,					/* Constant limit.  */
> +  3,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_9e_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_v6t2_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
> @@ -1694,34 +1706,38 @@ const struct tune_params arm_cortex_tune =
>  {
>    arm_9e_rtx_costs,
>    &generic_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a8_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa8_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a7_tune =
> @@ -1729,67 +1745,75 @@ const struct tune_params arm_cortex_a7_tune =
>    arm_9e_rtx_costs,
>    &cortexa7_extra_costs,
>    NULL,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,			/* Vectorizer costs.  */
> -  false,					/* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a15_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa15_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  2,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  2,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  true,						/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  true, true                                    /* Prefer 32-bit encodings.  */
> +  true,					/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  true, true,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a53_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa53_extra_costs,
> -  NULL,						/* Scheduler cost adjustment.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Scheduler cost adjustment.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,			/* Vectorizer costs.  */
> -  false,					/* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a57_tune =
>  {
>    arm_9e_rtx_costs,
>    &cortexa57_extra_costs,
> -  NULL,                                         /* Scheduler cost adjustment.  */
> -  1,                                           /* Constant limit.  */
> -  2,                                           /* Max cond insns.  */
> +  NULL,					/* Scheduler cost adjustment.  */
> +  1,					/* Constant limit.  */
> +  2,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,                                       /* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  true,                                       /* Prefer LDRD/STRD.  */
> -  {true, true},                                /* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                       /* Vectorizer costs.  */
> -  false,                                       /* Prefer Neon for 64-bits bitops.  */
> -  true, true                                   /* Prefer 32-bit encodings.  */
> +  true,					/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  true, true,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* Branches can be dual-issued on Cortex-A5, so conditional execution is
> @@ -1799,17 +1823,19 @@ const struct tune_params arm_cortex_a5_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  1,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  1,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_cortex_a5_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {false, false},				/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {false, false},			/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a9_tune =
> @@ -1817,16 +1843,18 @@ const struct tune_params arm_cortex_a9_tune =
>    arm_9e_rtx_costs,
>    &cortexa9_extra_costs,
>    cortex_a9_sched_adjust_cost,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_BENEFICIAL(4,32,32),
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_cortex_a12_tune =
> @@ -1834,16 +1862,18 @@ const struct tune_params arm_cortex_a12_tune =
>    arm_9e_rtx_costs,
>    &cortexa12_extra_costs,
>    NULL,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_BENEFICIAL(4,32,32),
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  true,						/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  true,					/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  true,					/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
> @@ -1857,17 +1887,19 @@ const struct tune_params arm_v7m_tune =
>  {
>    arm_9e_rtx_costs,
>    &v7m_extra_costs,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  2,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  2,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_cortex_m_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {false, false},				/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {false, false},			/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
> @@ -1876,17 +1908,19 @@ const struct tune_params arm_v6m_tune =
>  {
>    arm_9e_rtx_costs,
>    NULL,
> -  NULL,						/* Sched adj cost.  */
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  NULL,					/* Sched adj cost.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  false,					/* Prefer constant pool.  */
> +  false,				/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {false, false},				/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {false, false},			/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>
>  const struct tune_params arm_fa726te_tune =
> @@ -1894,16 +1928,18 @@ const struct tune_params arm_fa726te_tune =
>    arm_9e_rtx_costs,
>    NULL,
>    fa726te_sched_adjust_cost,
> -  1,						/* Constant limit.  */
> -  5,						/* Max cond insns.  */
> +  1,					/* Constant limit.  */
> +  5,					/* Max cond insns.  */
>    ARM_PREFETCH_NOT_BENEFICIAL,
> -  true,						/* Prefer constant pool.  */
> +  true,					/* Prefer constant pool.  */
>    arm_default_branch_cost,
> -  false,					/* Prefer LDRD/STRD.  */
> -  {true, true},					/* Prefer non short circuit.  */
> -  &arm_default_vec_cost,                        /* Vectorizer costs.  */
> -  false,                                        /* Prefer Neon for 64-bits bitops.  */
> -  false, false                                  /* Prefer 32-bit encodings.  */
> +  false,				/* Prefer LDRD/STRD.  */
> +  {true, true},				/* Prefer non short circuit.  */
> +  &arm_default_vec_cost,		/* Vectorizer costs.  */
> +  false,				/* Prefer Neon for 64-bits bitops.  */
> +  false, false,				/* Prefer 32-bit encodings.  */
> +  false,				/* Prefer Neon for stringops.  */
> +  8					/* Maximum insns to inline memset.  */
>  };
>




> Thanks,
> bin
>
diff mbox

Patch

Index: gcc/config/arm/arm-protos.h
===================================================================
--- gcc/config/arm/arm-protos.h	(revision 212295)
+++ gcc/config/arm/arm-protos.h	(working copy)
@@ -277,6 +277,10 @@  struct tune_params
   /* Prefer 32-bit encoding instead of 16-bit encoding where subset of flags
      would be set.  */
   bool disparage_partial_flag_setting_t16_encodings;
+  /* Prefer to inline string operations like memset by using Neon.  */
+  bool string_ops_prefer_neon;
+  /* Maximum number of instructions to inline calls to memset.  */
+  int max_insns_inline_memset;
 };
 
 extern const struct tune_params *current_tune;
@@ -289,6 +293,7 @@  extern void arm_emit_coreregs_64bit_shift (enum rt
 extern bool arm_validize_comparison (rtx *, rtx *, rtx *);
 #endif /* RTX_CODE */
 
+extern bool arm_gen_setmem (rtx *);
 extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel);
 extern bool arm_expand_vec_perm_const (rtx target, rtx op0, rtx op1, rtx sel);
 
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 212295)
+++ gcc/config/arm/arm.c	(working copy)
@@ -1588,34 +1588,38 @@  const struct tune_params arm_slowmul_tune =
 {
   arm_slowmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  3,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  3,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fastmul_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* StrongARM has early execution of branches, so a sequence that is worth
@@ -1625,17 +1629,19 @@  const struct tune_params arm_strongarm_tune =
 {
   arm_fastmul_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  3,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  3,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_xscale_tune =
@@ -1643,50 +1649,56 @@  const struct tune_params arm_xscale_tune =
   arm_xscale_rtx_costs,
   NULL,
   xscale_sched_adjust_cost,
-  2,						/* Constant limit.  */
-  3,						/* Max cond insns.  */
+  2,					/* Constant limit.  */
+  3,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_9e_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_v6t2_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* Generic Cortex tuning.  Use more specific tunings if appropriate.  */
@@ -1694,34 +1706,38 @@  const struct tune_params arm_cortex_tune =
 {
   arm_9e_rtx_costs,
   &generic_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a8_tune =
 {
   arm_9e_rtx_costs,
   &cortexa8_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a7_tune =
@@ -1729,67 +1745,75 @@  const struct tune_params arm_cortex_a7_tune =
   arm_9e_rtx_costs,
   &cortexa7_extra_costs,
   NULL,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a15_tune =
 {
   arm_9e_rtx_costs,
   &cortexa15_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  2,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  true, true                                    /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  true, true,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a53_tune =
 {
   arm_9e_rtx_costs,
   &cortexa53_extra_costs,
-  NULL,						/* Scheduler cost adjustment.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Scheduler cost adjustment.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,			/* Vectorizer costs.  */
-  false,					/* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a57_tune =
 {
   arm_9e_rtx_costs,
   &cortexa57_extra_costs,
-  NULL,                                         /* Scheduler cost adjustment.  */
-  1,                                           /* Constant limit.  */
-  2,                                           /* Max cond insns.  */
+  NULL,					/* Scheduler cost adjustment.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,                                       /* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,                                       /* Prefer LDRD/STRD.  */
-  {true, true},                                /* Prefer non short circuit.  */
-  &arm_default_vec_cost,                       /* Vectorizer costs.  */
-  false,                                       /* Prefer Neon for 64-bits bitops.  */
-  true, true                                   /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  true, true,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* Branches can be dual-issued on Cortex-A5, so conditional execution is
@@ -1799,17 +1823,19 @@  const struct tune_params arm_cortex_a5_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  1,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  1,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_cortex_a5_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a9_tune =
@@ -1817,16 +1843,18 @@  const struct tune_params arm_cortex_a9_tune =
   arm_9e_rtx_costs,
   &cortexa9_extra_costs,
   cortex_a9_sched_adjust_cost,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_BENEFICIAL(4,32,32),
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_cortex_a12_tune =
@@ -1834,16 +1862,18 @@  const struct tune_params arm_cortex_a12_tune =
   arm_9e_rtx_costs,
   &cortexa12_extra_costs,
   NULL,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_BENEFICIAL(4,32,32),
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  true,						/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  true,					/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  true,					/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* armv7m tuning.  On Cortex-M4 cores for example, MOVW/MOVT take a single
@@ -1857,17 +1887,19 @@  const struct tune_params arm_v7m_tune =
 {
   arm_9e_rtx_costs,
   &v7m_extra_costs,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  2,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  2,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_cortex_m_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 /* The arm_v6m_tune is duplicated from arm_cortex_tune, rather than
@@ -1876,17 +1908,19 @@  const struct tune_params arm_v6m_tune =
 {
   arm_9e_rtx_costs,
   NULL,
-  NULL,						/* Sched adj cost.  */
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  NULL,					/* Sched adj cost.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  false,					/* Prefer constant pool.  */
+  false,				/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {false, false},				/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {false, false},			/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 const struct tune_params arm_fa726te_tune =
@@ -1894,16 +1928,18 @@  const struct tune_params arm_fa726te_tune =
   arm_9e_rtx_costs,
   NULL,
   fa726te_sched_adjust_cost,
-  1,						/* Constant limit.  */
-  5,						/* Max cond insns.  */
+  1,					/* Constant limit.  */
+  5,					/* Max cond insns.  */
   ARM_PREFETCH_NOT_BENEFICIAL,
-  true,						/* Prefer constant pool.  */
+  true,					/* Prefer constant pool.  */
   arm_default_branch_cost,
-  false,					/* Prefer LDRD/STRD.  */
-  {true, true},					/* Prefer non short circuit.  */
-  &arm_default_vec_cost,                        /* Vectorizer costs.  */
-  false,                                        /* Prefer Neon for 64-bits bitops.  */
-  false, false                                  /* Prefer 32-bit encodings.  */
+  false,				/* Prefer LDRD/STRD.  */
+  {true, true},				/* Prefer non short circuit.  */
+  &arm_default_vec_cost,		/* Vectorizer costs.  */
+  false,				/* Prefer Neon for 64-bits bitops.  */
+  false, false,				/* Prefer 32-bit encodings.  */
+  false,				/* Prefer Neon for stringops.  */
+  8					/* Maximum insns to inline memset.  */
 };
 
 
@@ -16802,6 +16838,14 @@  arm_const_double_inline_cost (rtx val)
 			      NULL_RTX, NULL_RTX, 0, 0));
 }
 
+/* Cost of loading a SImode constant.  */
+static inline int
+arm_const_inline_cost (enum rtx_code code, rtx val)
+{
+  return arm_gen_constant (code, SImode, NULL_RTX, INTVAL (val),
+                           NULL_RTX, NULL_RTX, 1, 0);
+}
+
 /* Return true if it is worthwhile to split a 64-bit constant into two
    32-bit operations.  This is the case if optimizing for size, or
    if we have load delay slots, or if one 32-bit part can be done with
@@ -31417,6 +31468,519 @@  arm_validize_comparison (rtx *comparison, rtx * op
 
 }
 
+/* Maximum number of instructions to set block of memory.  */
+static int
+arm_block_set_max_insns (void)
+{
+  if (optimize_function_for_size_p (cfun))
+    return 4;
+  else
+    return current_tune->max_insns_inline_memset;
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   non-vectorized case.  VAL is the value to set the memory
+   with.  LENGTH is the number of bytes to set.  ALIGN is the
+   alignment of the destination memory in bytes.  UNALIGNED_P
+   is TRUE if we can only set the memory with instructions
+   meeting alignment requirements.  USE_STRD_P is TRUE if we
+   can use strd to set the memory.  */
+static bool
+arm_block_set_non_vect_profit_p (rtx val,
+				 unsigned HOST_WIDE_INT length,
+				 unsigned HOST_WIDE_INT align,
+				 bool unaligned_p, bool use_strd_p)
+{
+  int num = 0;
+  /* For leftovers in bytes of 0-7, we can set the memory block using
+     strb/strh/str with minimum instruction number.  */
+  const int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
+
+  if (unaligned_p)
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += length / align + length % align;
+    }
+  else if (use_strd_p)
+    {
+      num = arm_const_double_inline_cost (val);
+      num += (length >> 3) + leftover[length & 7];
+    }
+  else
+    {
+      num = arm_const_inline_cost (SET, val);
+      num += (length >> 2) + leftover[length & 3];
+    }
+
+  /* We may be able to combine last pair STRH/STRB into a single STR
+     by shifting one byte back.  */
+  if (unaligned_access && length > 3 && (length & 3) == 3)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Return TRUE if it's profitable to set block of memory for
+   vectorized case.  LENGTH is the number of bytes to set.
+   ALIGN is the alignment of destination memory in bytes.
+   MODE is the vector mode used to set the memory.  */
+static bool
+arm_block_set_vect_profit_p (unsigned HOST_WIDE_INT length,
+			     unsigned HOST_WIDE_INT align,
+			     enum machine_mode mode)
+{
+  int num;
+  bool unaligned_p = ((align & 3) != 0);
+  unsigned int nelt = GET_MODE_NUNITS (mode);
+
+  /* Instruction loading constant value.  */
+  num = 1;
+  /* Instructions storing the memory.  */
+  num += (length + nelt - 1) / nelt;
+  /* Instructions adjusting the address expression.  Only need to
+     adjust address expression if it's 4 bytes aligned and bytes
+     leftover can only be stored by mis-aligned store instruction.  */
+  if (!unaligned_p && (length & 3) != 0)
+    num++;
+
+  /* Store the first 16 bytes using vst1:v16qi for the aligned case.  */
+  if (!unaligned_p && mode == V16QImode)
+    num--;
+
+  return (num <= arm_block_set_max_insns ());
+}
+
+/* Set a block of memory using vectorization instructions for the
+   unaligned case.  We fill the first LENGTH bytes of the memory
+   area starting from DSTBASE with byte constant VALUE.  ALIGN is
+   the alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_unaligned_vect (rtx dstbase,
+			      unsigned HOST_WIDE_INT length,
+			      unsigned HOST_WIDE_INT value,
+			      unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v16, nelt_v8, nelt_mode;
+  rtx dst, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  rtx (*gen_func) (rtx, rtx);
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) != 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16)
+    {
+      mode = V16QImode;
+      gen_func = gen_movmisalignv16qi;
+    }
+  else
+    {
+      mode = V8QImode;
+      gen_func = gen_movmisalignv8qi;
+    }
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  /* Handle nelt_mode bytes in a vector.  */
+  for (i = 0; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      emit_insn ((*gen_func) (mem, reg));
+      if (i + 2 * nelt_mode <= length)
+	emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
+    }
+
+  /* If there are not less than nelt_v8 bytes leftover, we must be in
+     V16QI mode.  */
+  gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
+
+  /* Handle (8, 16) bytes leftover.  */
+  if (i + nelt_v8 < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+    }
+  /* Handle (0, 8] bytes leftover.  */
+  else if (i < length && i + nelt_v8 >= length)
+    {
+      if (mode == V16QImode)
+	{
+	  reg = gen_lowpart (V8QImode, reg);
+	  mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
+	}
+      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
+					      + (nelt_mode - nelt_v8))));
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) != 0 && align >= 2)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for the
+   aligned case.  We fill the first LENGTH bytes of the memory area
+   starting from DSTBASE with byte constant VALUE.  ALIGN is the
+   alignment requirement of memory.  Return TRUE if succeeded.  */
+static bool
+arm_block_set_aligned_vect (rtx dstbase,
+			    unsigned HOST_WIDE_INT length,
+			    unsigned HOST_WIDE_INT value,
+			    unsigned HOST_WIDE_INT align)
+{
+  unsigned int i, j, nelt_v8, nelt_v16, nelt_mode;
+  rtx dst, addr, mem;
+  rtx val_elt, val_vec, reg;
+  rtx rval[MAX_VECT_LEN];
+  enum machine_mode mode;
+  unsigned HOST_WIDE_INT v = value;
+
+  gcc_assert ((align & 0x3) == 0);
+  nelt_v8 = GET_MODE_NUNITS (V8QImode);
+  nelt_v16 = GET_MODE_NUNITS (V16QImode);
+  if (length >= nelt_v16 && unaligned_access && !BYTES_BIG_ENDIAN)
+    mode = V16QImode;
+  else
+    mode = V8QImode;
+
+  nelt_mode = GET_MODE_NUNITS (mode);
+  gcc_assert (length >= nelt_mode);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_vect_profit_p (length, align, mode))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_elt = GEN_INT (v);
+  for (j = 0; j < nelt_mode; j++)
+    rval[j] = val_elt;
+
+  reg = gen_reg_rtx (mode);
+  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode, rval));
+  /* Emit instruction loading the constant value.  */
+  emit_move_insn (reg, val_vec);
+
+  i = 0;
+  /* Handle first 16 bytes specially using vst1:v16qi instruction.  */
+  if (mode == V16QImode)
+    {
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      emit_insn (gen_movmisalignv16qi (mem, reg));
+      i += nelt_mode;
+      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
+      if (i + nelt_v8 < length && i + nelt_v16 > length)
+	{
+	  emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+	  mem = adjust_automodify_address (dstbase, mode, dst, 0);
+	  /* We are shifting bytes back, set the alignment accordingly.  */
+	  if ((length & 0x3) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 4);
+	  else if ((length & 0x1) == 0)
+	    set_mem_align (mem, BITS_PER_UNIT * 2);
+	  else
+	    set_mem_align (mem, BITS_PER_UNIT);
+
+	  emit_insn (gen_movmisalignv16qi (mem, reg));
+	  return true;
+	}
+      /* Fall through for bytes leftover.  */
+      mode = V8QImode;
+      nelt_mode = GET_MODE_NUNITS (mode);
+      reg = gen_lowpart (V8QImode, reg);
+    }
+
+  /* Handle 8 bytes in a vector.  */
+  for (; (i + nelt_mode <= length); i += nelt_mode)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single word leftover by shifting 4 bytes back.  We can
+     use aligned access for this case.  */
+  if (i + UNITS_PER_WORD == length)
+    {
+      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
+      mem = adjust_automodify_address (dstbase, mode,
+				       addr, i - UNITS_PER_WORD);
+      /* We are shifting 4 bytes back, set the alignment accordingly.  */
+      if (align > UNITS_PER_WORD)
+	set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
+
+      emit_move_insn (mem, reg);
+    }
+  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
+     We have to use unaligned access for this case.  */
+  else if (i < length)
+    {
+      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
+      mem = adjust_automodify_address (dstbase, mode, dst, 0);
+      /* We are shifting bytes back, set the alignment accordingly.  */
+      if ((length & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT * 2);
+      else
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      emit_insn (gen_movmisalignv8qi (mem, reg));
+    }
+
+  return true;
+}
+
+/* Set a block of memory using plain strh/strb instructions, only
+   using instructions allowed by ALIGN on processor.  We fill the
+   first LENGTH bytes of the memory area starting from DSTBASE
+   with byte constant VALUE.  ALIGN is the alignment requirement
+   of memory.  */
+static bool
+arm_block_set_unaligned_non_vect (rtx dstbase,
+				  unsigned HOST_WIDE_INT length,
+				  unsigned HOST_WIDE_INT value,
+				  unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  enum machine_mode mode;
+  HOST_WIDE_INT v = value;
+
+  gcc_assert (align == 1 || align == 2);
+
+  if (align == 2)
+    v |= (value << BITS_PER_UNIT);
+
+  v = sext_hwi (v, BITS_PER_WORD);
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, true, false))
+    return false;
+
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  mode = (align == 2 ? HImode : QImode);
+  val_reg = force_reg (SImode, val_exp);
+  reg = gen_lowpart (mode, val_reg);
+
+  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i += GET_MODE_SIZE (mode))
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, mode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+      i++;
+    }
+
+  gcc_assert (i == length);
+  return true;
+}
+
+/* Set a block of memory using plain strd/str/strh/strb instructions,
+   to permit unaligned copies on processors which support unaligned
+   semantics for those instructions.  We fill the first LENGTH bytes
+   of the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_aligned_non_vect (rtx dstbase,
+				unsigned HOST_WIDE_INT length,
+				unsigned HOST_WIDE_INT value,
+				unsigned HOST_WIDE_INT align)
+{
+  unsigned int i;
+  rtx dst, addr, mem;
+  rtx val_exp, val_reg, reg;
+  unsigned HOST_WIDE_INT v;
+  bool use_strd_p;
+
+  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
+		&& TARGET_LDRD && current_tune->prefer_ldrd_strd);
+
+  v = (value | (value << 8) | (value << 16) | (value << 24));
+  if (length < UNITS_PER_WORD)
+    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
+
+  if (use_strd_p)
+    v |= (v << BITS_PER_WORD);
+  else
+    v = sext_hwi (v, BITS_PER_WORD);
+
+  val_exp = GEN_INT (v);
+  /* Skip if it isn't profitable.  */
+  if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					align, false, use_strd_p))
+    {
+      if (!use_strd_p)
+	return false;
+
+      /* Try without strd.  */
+      v = (v >> BITS_PER_WORD);
+      v = sext_hwi (v, BITS_PER_WORD);
+      val_exp = GEN_INT (v);
+      use_strd_p = false;
+      if (!arm_block_set_non_vect_profit_p (val_exp, length,
+					    align, false, use_strd_p))
+	return false;
+    }
+
+  i = 0;
+  dst = copy_addr_to_reg (XEXP (dstbase, 0));
+  /* Handle double words using strd if possible.  */
+  if (use_strd_p)
+    {
+      val_reg = force_reg (DImode, val_exp);
+      reg = val_reg;
+      for (; (i + 8 <= length); i += 8)
+	{
+	  addr = plus_constant (Pmode, dst, i);
+	  mem = adjust_automodify_address (dstbase, DImode, addr, i);
+	  emit_move_insn (mem, reg);
+	}
+    }
+  else
+    val_reg = force_reg (SImode, val_exp);
+
+  /* Handle words.  */
+  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
+  for (; (i + 4 <= length); i += 4)
+    {
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i);
+      if ((align & 3) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storesi (mem, reg));
+    }
+
+  /* Merge last pair of STRH and STRB into a STR if possible.  */
+  if (unaligned_access && i > 0 && (i + 3) == length)
+    {
+      addr = plus_constant (Pmode, dst, i - 1);
+      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
+      /* We are shifting one byte back, set the alignment accordingly.  */
+      if ((align & 1) == 0)
+	set_mem_align (mem, BITS_PER_UNIT);
+
+      /* Most likely this is an unaligned access, and we can't tell at
+	 compilation time.  */
+      emit_insn (gen_unaligned_storesi (mem, reg));
+      return true;
+    }
+
+  /* Handle half word leftover.  */
+  if (i + 2 <= length)
+    {
+      reg = gen_lowpart (HImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, HImode, addr, i);
+      if ((align & 1) == 0)
+	emit_move_insn (mem, reg);
+      else
+	emit_insn (gen_unaligned_storehi (mem, reg));
+
+      i += 2;
+    }
+
+  /* Handle single byte leftover.  */
+  if (i + 1 == length)
+    {
+      reg = gen_lowpart (QImode, val_reg);
+      addr = plus_constant (Pmode, dst, i);
+      mem = adjust_automodify_address (dstbase, QImode, addr, i);
+      emit_move_insn (mem, reg);
+    }
+
+  return true;
+}
+
+/* Set a block of memory using vectorization instructions for both
+   aligned and unaligned cases.  We fill the first LENGTH bytes of
+   the memory area starting from DSTBASE with byte constant VALUE.
+   ALIGN is the alignment requirement of memory.  */
+static bool
+arm_block_set_vect (rtx dstbase,
+		    unsigned HOST_WIDE_INT length,
+		    unsigned HOST_WIDE_INT value,
+		    unsigned HOST_WIDE_INT align)
+{
+  /* Check whether we need to use unaligned store instruction.  */
+  if (((align & 3) != 0 || (length & 3) != 0)
+      /* Check whether unaligned store instruction is available.  */
+      && (!unaligned_access || BYTES_BIG_ENDIAN))
+    return false;
+
+  if ((align & 3) == 0)
+    return arm_block_set_aligned_vect (dstbase, length, value, align);
+  else
+    return arm_block_set_unaligned_vect (dstbase, length, value, align);
+}
+
+/* Expand string store operation.  Firstly we try to do that by using
+   vectorization instructions, then try with ARM unaligned access and
+   double-word store if profitable.  OPERANDS[0] is the destination,
+   OPERANDS[1] is the number of bytes, operands[2] is the value to
+   initialize the memory, OPERANDS[3] is the known alignment of the
+   destination.  */
+bool
+arm_gen_setmem (rtx *operands)
+{
+  rtx dstbase = operands[0];
+  unsigned HOST_WIDE_INT length;
+  unsigned HOST_WIDE_INT value;
+  unsigned HOST_WIDE_INT align;
+
+  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
+    return false;
+
+  length = UINTVAL (operands[1]);
+  if (length > 64)
+    return false;
+
+  value = (UINTVAL (operands[2]) & 0xFF);
+  align = UINTVAL (operands[3]);
+  if (TARGET_NEON && length >= 8
+      && current_tune->string_ops_prefer_neon
+      && arm_block_set_vect (dstbase, length, value, align))
+    return true;
+
+  if (!unaligned_access && (align & 3) != 0)
+    return arm_block_set_unaligned_non_vect (dstbase, length, value, align);
+
+  return arm_block_set_aligned_non_vect (dstbase, length, value, align);
+}
+
 /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
 
 static unsigned HOST_WIDE_INT
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md	(revision 212295)
+++ gcc/config/arm/arm.md	(working copy)
@@ -6726,6 +6726,20 @@ 
 })
 
 
+(define_expand "setmemsi"
+  [(match_operand:BLK 0 "general_operand" "")
+   (match_operand:SI 1 "const_int_operand" "")
+   (match_operand:SI 2 "const_int_operand" "")
+   (match_operand:SI 3 "const_int_operand" "")]
+  "TARGET_32BIT"
+{
+  if (arm_gen_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
+
 ;; Move a block of memory if it is word aligned and MORE than 2 words long.
 ;; We could let this apply for blocks of less than this, but it clobbers so
 ;; many registers that there is then probably a better way.
Index: gcc/testsuite/gcc.target/arm/memset-inline-7.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-7.c	(revision 0)
@@ -0,0 +1,171 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+int b[LEN];
+
+void
+init (signed char *arr, int len)
+{
+  int i;
+  for (i = 0; i < len; i++)
+    arr[i] = 0;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+#define TEST(a,l,v)			\
+	init ((signed char*)(a), sizeof (a));		\
+	memset ((a), (v), (l));				\
+	check ((signed char *)(a), (l), sizeof (a), (v));
+int
+main(void)
+{
+  TEST (a, 1, -1);
+  TEST (a, 2, -1);
+  TEST (a, 3, -1);
+  TEST (a, 4, -1);
+  TEST (a, 5, -1);
+  TEST (a, 6, -1);
+  TEST (a, 7, -1);
+  TEST (a, 8, -1);
+  TEST (a, 9, 1);
+  TEST (a, 10, -1);
+  TEST (a, 11, 1);
+  TEST (a, 12, -1);
+  TEST (a, 13, 1);
+  TEST (a, 14, -1);
+  TEST (a, 15, 1);
+  TEST (a, 16, -1);
+  TEST (a, 17, 1);
+  TEST (a, 18, -1);
+  TEST (a, 19, 1);
+  TEST (a, 20, -1);
+  TEST (a, 21, 1);
+  TEST (a, 22, -1);
+  TEST (a, 23, 1);
+  TEST (a, 24, -1);
+  TEST (a, 25, 1);
+  TEST (a, 26, -1);
+  TEST (a, 27, 1);
+  TEST (a, 28, -1);
+  TEST (a, 29, 1);
+  TEST (a, 30, -1);
+  TEST (a, 31, 1);
+  TEST (a, 32, -1);
+  TEST (a, 33, 1);
+  TEST (a, 34, -1);
+  TEST (a, 35, 1);
+  TEST (a, 36, -1);
+  TEST (a, 37, 1);
+  TEST (a, 38, -1);
+  TEST (a, 39, 1);
+  TEST (a, 40, -1);
+  TEST (a, 41, 1);
+  TEST (a, 42, -1);
+  TEST (a, 43, 1);
+  TEST (a, 44, -1);
+  TEST (a, 45, 1);
+  TEST (a, 46, -1);
+  TEST (a, 47, 1);
+  TEST (a, 48, -1);
+  TEST (a, 49, 1);
+  TEST (a, 50, -1);
+  TEST (a, 51, 1);
+  TEST (a, 52, -1);
+  TEST (a, 53, 1);
+  TEST (a, 54, -1);
+  TEST (a, 55, 1);
+  TEST (a, 56, -1);
+  TEST (a, 57, 1);
+  TEST (a, 58, -1);
+  TEST (a, 59, 1);
+  TEST (a, 60, -1);
+  TEST (a, 61, 1);
+  TEST (a, 62, -1);
+  TEST (a, 63, 1);
+  TEST (a, 64, -1);
+
+  TEST (b, 1, -1);
+  TEST (b, 2, -1);
+  TEST (b, 3, -1);
+  TEST (b, 4, -1);
+  TEST (b, 5, -1);
+  TEST (b, 6, -1);
+  TEST (b, 7, -1);
+  TEST (b, 8, -1);
+  TEST (b, 9, 1);
+  TEST (b, 10, -1);
+  TEST (b, 11, 1);
+  TEST (b, 12, -1);
+  TEST (b, 13, 1);
+  TEST (b, 14, -1);
+  TEST (b, 15, 1);
+  TEST (b, 16, -1);
+  TEST (b, 17, 1);
+  TEST (b, 18, -1);
+  TEST (b, 19, 1);
+  TEST (b, 20, -1);
+  TEST (b, 21, 1);
+  TEST (b, 22, -1);
+  TEST (b, 23, 1);
+  TEST (b, 24, -1);
+  TEST (b, 25, 1);
+  TEST (b, 26, -1);
+  TEST (b, 27, 1);
+  TEST (b, 28, -1);
+  TEST (b, 29, 1);
+  TEST (b, 30, -1);
+  TEST (b, 31, 1);
+  TEST (b, 32, -1);
+  TEST (b, 33, 1);
+  TEST (b, 34, -1);
+  TEST (b, 35, 1);
+  TEST (b, 36, -1);
+  TEST (b, 37, 1);
+  TEST (b, 38, -1);
+  TEST (b, 39, 1);
+  TEST (b, 40, -1);
+  TEST (b, 41, 1);
+  TEST (b, 42, -1);
+  TEST (b, 43, 1);
+  TEST (b, 44, -1);
+  TEST (b, 45, 1);
+  TEST (b, 46, -1);
+  TEST (b, 47, 1);
+  TEST (b, 48, -1);
+  TEST (b, 49, 1);
+  TEST (b, 50, -1);
+  TEST (b, 51, 1);
+  TEST (b, 52, -1);
+  TEST (b, 53, 1);
+  TEST (b, 54, -1);
+  TEST (b, 55, 1);
+  TEST (b, 56, -1);
+  TEST (b, 57, 1);
+  TEST (b, 58, -1);
+  TEST (b, 59, 1);
+  TEST (b, 60, -1);
+  TEST (b, 61, 1);
+  TEST (b, 62, -1);
+  TEST (b, 63, 1);
+  TEST (b, 64, -1);
+
+  return 0;
+}
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c	(revision 0)
@@ -0,0 +1,44 @@ 
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-1.c	(revision 0)
@@ -0,0 +1,39 @@ 
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline"  } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 14);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-9.c	(revision 0)
@@ -0,0 +1,42 @@ 
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-2.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-2.c	(revision 0)
@@ -0,0 +1,38 @@ 
+/* { dg-do run } */
+/* { dg-options "-save-temps -Os -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+  memset (a, -1, 14);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 14, sizeof (a), -1);
+
+  return 0;
+}
+/* { dg-final { scan-assembler "bl?\[ \t\]*memset" { target { ! arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-3.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-3.c	(revision 0)
@@ -0,0 +1,40 @@ 
+/* { dg-do run } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+short a[LEN];
+void
+foo (void)
+{
+    memset (a, -1, 7);
+    return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo ();
+  check ((signed char *)a, 7, sizeof (a), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]*memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-not "strh" { target { ! arm_thumb1 } } } } */
+/* { dg-final { scan-assembler-not "strb" { target { ! arm_thumb1 } } } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c	(revision 0)
@@ -0,0 +1,68 @@ 
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 8);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 12);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, 1, 13);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  int i;
+
+  foo1 ();
+  check ((signed char *)a, 8, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 12, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 13, sizeof (c), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { ! arm_thumb1_ok } } } } */
+/* { dg-final { scan-assembler-times "vst1\.8" 1 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vstr" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c	(revision 0)
@@ -0,0 +1,78 @@ 
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+int d[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 16);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 25);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 19);
+  return;
+}
+
+void
+foo4 (void)
+{
+  memset (d, 1, 23);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 16, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 25, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 19, sizeof (c), -1);
+
+  foo4 ();
+  check ((signed char *)d, 23, sizeof (d), 1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler "vst1" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-not "vstr"  { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===================================================================
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c	(revision 0)
@@ -0,0 +1,68 @@ 
+/* { dg-do run } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mcpu=cortex-a9" } { "" } } */
+/* { dg-skip-if "Don't inline memset using neon instructions on cortex-a9" { *-*-* } { "-mtune=cortex-a9" } { "" } } */
+/* { dg-options "-save-temps -O2 -fno-inline" } */
+/* { dg-add-options "arm_neon" } */
+
+#include <string.h>
+#include <stdlib.h>
+
+#define LEN (100)
+int a[LEN];
+int b[LEN];
+int c[LEN];
+void
+foo1 (void)
+{
+    memset (a, -1, 20);
+    return;
+}
+
+void
+foo2 (void)
+{
+  memset (b, 1, 24);
+  return;
+}
+
+void
+foo3 (void)
+{
+  memset (c, -1, 32);
+  return;
+}
+
+void
+check (signed char *arr, int idx, int len, int v)
+{
+  int i;
+  for (i = 0; i < idx; i++)
+    if (arr[i] != v)
+      abort ();
+
+  for (i = idx; i < len; i++)
+    if (arr[i] != 0)
+      abort ();
+}
+
+int
+main(void)
+{
+  foo1 ();
+  check ((signed char *)a, 20, sizeof (a), -1);
+
+  foo2 ();
+  check ((signed char *)b, 24, sizeof (b), 1);
+
+  foo3 ();
+  check ((signed char *)c, 32, sizeof (c), -1);
+
+  return 0;
+}
+
+/* { dg-final { scan-assembler-not "bl?\[ \t\]+memset" { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vst1" 3 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { scan-assembler-times "vstr" 4 { target { arm_little_endian && arm_neon } } } } */
+/* { dg-final { cleanup-saved-temps } } */
+
+