diff mbox series

[RFC] ldist: Recognize rawmemchr loop patterns

Message ID 20210208124716.487709-1-stefansf@linux.ibm.com
State New
Headers show
Series [RFC] ldist: Recognize rawmemchr loop patterns | expand

Commit Message

Stefan Schulze Frielinghaus Feb. 8, 2021, 12:47 p.m. UTC
This patch adds support for recognizing loops which mimic the behaviour
of function rawmemchr, and replaces those with an internal function call
in case a target provides them.  In contrast to the original rawmemchr
function, this patch also supports different instances where the memory
pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
respectively.

This patch is not final and I'm looking for some feedback:

Previously, only loops which mimic the behaviours of functions memset,
memcpy, and memmove have been detected and replaced by corresponding
function calls.  One characteristic of those loops/partitions is that
they don't have a reduction.  In contrast, loops which mimic the
behaviour of rawmemchr compute a result and therefore have a reduction.
My current attempt is to ensure that the reduction statement is not used
in any other partition and only in that case ignore the reduction and
replace the loop by a function call.  We then only need to replace the
reduction variable of the loop which contained the loop result by the
variable of the lhs of the internal function call.  This should ensure
that the transformation is correct independently of how partitions are
fused/distributed in the end.  Any thoughts about this?

Furthermore, I simply added two new members (pattern, fn) to structure
builtin_info which I consider rather hacky.  For the long run I thought
about to split up structure builtin_info into a union where each member
is a structure for a particular builtin of a partition, i.e., something
like this:

union builtin_info
{
  struct binfo_memset *memset;
  struct binfo_memcpymove *memcpymove;
  struct binfo_rawmemchr *rawmemchr;
};

Such that a structure for one builtin does not get "polluted" by a
different one.  Any thoughts about this?

Cheers,
Stefan
---
 gcc/internal-fn.c            |  42 ++++++
 gcc/internal-fn.def          |   3 +
 gcc/target-insns.def         |   3 +
 gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++-----
 4 files changed, 272 insertions(+), 33 deletions(-)

Comments

Richard Biener Feb. 9, 2021, 8:57 a.m. UTC | #1
On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>
> This patch adds support for recognizing loops which mimic the behaviour
> of function rawmemchr, and replaces those with an internal function call
> in case a target provides them.  In contrast to the original rawmemchr
> function, this patch also supports different instances where the memory
> pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> respectively.
>
> This patch is not final and I'm looking for some feedback:
>
> Previously, only loops which mimic the behaviours of functions memset,
> memcpy, and memmove have been detected and replaced by corresponding
> function calls.  One characteristic of those loops/partitions is that
> they don't have a reduction.  In contrast, loops which mimic the
> behaviour of rawmemchr compute a result and therefore have a reduction.
> My current attempt is to ensure that the reduction statement is not used
> in any other partition and only in that case ignore the reduction and
> replace the loop by a function call.  We then only need to replace the
> reduction variable of the loop which contained the loop result by the
> variable of the lhs of the internal function call.  This should ensure
> that the transformation is correct independently of how partitions are
> fused/distributed in the end.  Any thoughts about this?

Currently we're forcing reduction partitions last (and force to have a single
one by fusing all partitions containing a reduction) because code-generation
does not properly update SSA form for the reduction results.  ISTR that
might be just because we do not copy the LC PHI nodes or do not adjust
them when copying.  That might not be an issue in case you replace the
partition with a call.  I guess you can try to have a testcase with
two rawmemchr patterns and a regular loop part that has to be scheduled
inbetween both for correctness.

> Furthermore, I simply added two new members (pattern, fn) to structure
> builtin_info which I consider rather hacky.  For the long run I thought
> about to split up structure builtin_info into a union where each member
> is a structure for a particular builtin of a partition, i.e., something
> like this:
>
> union builtin_info
> {
>   struct binfo_memset *memset;
>   struct binfo_memcpymove *memcpymove;
>   struct binfo_rawmemchr *rawmemchr;
> };
>
> Such that a structure for one builtin does not get "polluted" by a
> different one.  Any thoughts about this?

Probably makes sense if the list of recognized patterns grow further.

I see you use internal functions rather than builtin functions.  I guess
that's OK.  But you use new target hooks for expansion where I think
new optab entries similar to cmpmem would be more appropriate
where the distinction between 8, 16 or 32 bits can be encoded in
the modes.

Richard.

> Cheers,
> Stefan
> ---
>  gcc/internal-fn.c            |  42 ++++++
>  gcc/internal-fn.def          |   3 +
>  gcc/target-insns.def         |   3 +
>  gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++-----
>  4 files changed, 272 insertions(+), 33 deletions(-)
>
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index dd7173126fb..9cd62544a1a 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>    gcc_unreachable ();
>  }
>
> +static void
> +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> +{
> +  if (targetm.have_rawmemchr8 ())
> +    {
> +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> +      emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
> +    }
> +  else
> +    gcc_unreachable();
> +}
> +
> +static void
> +expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
> +{
> +  if (targetm.have_rawmemchr16 ())
> +    {
> +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> +      emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
> +    }
> +  else
> +    gcc_unreachable();
> +}
> +
> +static void
> +expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
> +{
> +  if (targetm.have_rawmemchr32 ())
> +    {
> +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> +      emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
> +    }
> +  else
> +    gcc_unreachable();
> +}
> +
>  /* Expand the IFN_UNIQUE function according to its first argument.  */
>
>  static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index daeace7a34e..34247859704 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
>  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
>
>  /* An unduplicable, uncombinable function.  Generally used to preserve
>     a CFG property in the face of jump threading, tail merging or
> diff --git a/gcc/target-insns.def b/gcc/target-insns.def
> index 672c35698d7..9248554cbf3 100644
> --- a/gcc/target-insns.def
> +++ b/gcc/target-insns.def
> @@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
>  DEF_TARGET_INSN (unique, (void))
>  DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
>  DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
> +DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
> +DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
> +DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2))
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 7ee19fc8677..f5b24bf53bc 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -218,7 +218,7 @@ enum partition_kind {
>         be unnecessary and removed once distributed memset can be understood
>         and analyzed in data reference analysis.  See PR82604 for more.  */
>      PKIND_PARTIAL_MEMSET,
> -    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> +    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
>  };
>
>  /* Type of distributed loop.  */
> @@ -244,6 +244,8 @@ struct builtin_info
>       is only used in memset builtin distribution for now.  */
>    tree dst_base_base;
>    unsigned HOST_WIDE_INT dst_base_offset;
> +  tree pattern;
> +  internal_fn fn;
>  };
>
>  /* Partition for loop distribution.  */
> @@ -588,7 +590,8 @@ class loop_distribution
>    bool
>    classify_partition (loop_p loop,
>                       struct graph *rdg, partition *partition,
> -                     bitmap stmt_in_all_partitions);
> +                     bitmap stmt_in_all_partitions,
> +                     vec<struct partition *> *partitions);
>
>
>    /* Returns true when PARTITION1 and PARTITION2 access the same memory
> @@ -1232,6 +1235,67 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
>      }
>  }
>
> +/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP.  */
> +
> +static void
> +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> +{
> +  gimple_stmt_iterator gsi;
> +  tree mem, pattern;
> +  struct builtin_info *builtin = partition->builtin;
> +  gimple *fn_call;
> +
> +  data_reference_p dr = builtin->src_dr;
> +  tree base = builtin->src_base;
> +
> +  tree result_old = TREE_OPERAND (DR_REF (dr), 0);
> +  tree result_new = copy_ssa_name (result_old);
> +
> +  /* The new statements will be placed before LOOP.  */
> +  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +
> +  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, GSI_CONTINUE_LINKING);
> +  pattern = builtin->pattern;
> +  if (TREE_CODE (pattern) == INTEGER_CST)
> +    pattern = fold_convert (integer_type_node, pattern);
> +  fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern);
> +  gimple_call_set_lhs (fn_call, result_new);
> +  gimple_set_location (fn_call, partition->loc);
> +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> +
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +       SET_USE (use_p, result_new);
> +
> +      update_stmt (stmt);
> +    }
> +
> +  fold_stmt (&gsi);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    switch (builtin->fn)
> +      {
> +      case IFN_RAWMEMCHR8:
> +       fprintf (dump_file, "generated rawmemchr8\n");
> +       break;
> +
> +      case IFN_RAWMEMCHR16:
> +       fprintf (dump_file, "generated rawmemchr16\n");
> +       break;
> +
> +      case IFN_RAWMEMCHR32:
> +       fprintf (dump_file, "generated rawmemchr32\n");
> +       break;
> +
> +      default:
> +       gcc_unreachable ();
> +      }
> +}
> +
>  /* Remove and destroy the loop LOOP.  */
>
>  static void
> @@ -1334,6 +1398,10 @@ generate_code_for_partition (class loop *loop,
>        generate_memcpy_builtin (loop, partition);
>        break;
>
> +    case PKIND_RAWMEMCHR:
> +      generate_rawmemchr_builtin (loop, partition);
> +      break;
> +
>      default:
>        gcc_unreachable ();
>      }
> @@ -1525,44 +1593,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
>         }
>      }
>
> -  if (!single_st)
> +  if (!single_ld && !single_st)
>      return false;
>
> -  /* Bail out if this is a bitfield memory reference.  */
> -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> -    return false;
> -
> -  /* Data reference must be executed exactly once per iteration of each
> -     loop in the loop nest.  We only need to check dominance information
> -     against the outermost one in a perfect loop nest because a bb can't
> -     dominate outermost loop's latch without dominating inner loop's.  */
> -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> -    return false;
> +  basic_block bb_ld = NULL;
> +  basic_block bb_st = NULL;
>
>    if (single_ld)
>      {
> -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> -      /* Direct aggregate copy or via an SSA name temporary.  */
> -      if (load != store
> -         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> -       return false;
> -
>        /* Bail out if this is a bitfield memory reference.  */
>        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
>           && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
>         return false;
>
> -      /* Load and store must be in the same loop nest.  */
> -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> -      if (bb_st->loop_father != bb_ld->loop_father)
> +      /* Data reference must be executed exactly once per iteration of each
> +        loop in the loop nest.  We only need to check dominance information
> +        against the outermost one in a perfect loop nest because a bb can't
> +        dominate outermost loop's latch without dominating inner loop's.  */
> +      bb_ld = gimple_bb (DR_STMT (single_ld));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +       return false;
> +    }
> +
> +  if (single_st)
> +    {
> +      /* Bail out if this is a bitfield memory reference.  */
> +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> +         && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
>         return false;
>
>        /* Data reference must be executed exactly once per iteration.
> -        Same as single_st, we only need to check against the outermost
> +        Same as single_ld, we only need to check against the outermost
>          loop.  */
> -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +      bb_st = gimple_bb (DR_STMT (single_st));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> +       return false;
> +    }
> +
> +  if (single_ld && single_st)
> +    {
> +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> +      /* Direct aggregate copy or via an SSA name temporary.  */
> +      if (load != store
> +         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> +       return false;
> +
> +      /* Load and store must be in the same loop nest.  */
> +      if (bb_st->loop_father != bb_ld->loop_father)
>         return false;
>
>        edge e = single_exit (bb_st->loop_father);
> @@ -1681,6 +1758,84 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
>    return builtin;
>  }
>
> +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> +   rawmemchr{8,16,32} call.  */
> +
> +static bool
> +classify_builtin_rawmemchr (loop_p loop, partition *partition, data_reference_p dr, tree loop_result)
> +{
> +  tree dr_ref = DR_REF (dr);
> +  tree dr_access_base = build_fold_addr_expr (dr_ref);
> +  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> +  gimple *dr_stmt = DR_STMT (dr);
> +  tree rhs1 = gimple_assign_rhs1 (dr_stmt);
> +  affine_iv iv;
> +  tree pattern;
> +
> +  if (TREE_OPERAND (rhs1, 0) != loop_result)
> +    return false;
> +
> +  /* A limitation of the current implementation is that we only support
> +     constant patterns.  */
> +  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> +  pattern = gimple_cond_rhs (cond_stmt);
> +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> +      || TREE_CODE (pattern) != INTEGER_CST)
> +    return false;
> +
> +  /* Bail out if no affine induction variable with constant step can be
> +     determined.  */
> +  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> +    return false;
> +
> +  /* Bail out if memory accesses are not consecutive.  */
> +  if (!operand_equal_p (iv.step, dr_access_size, 0))
> +    return false;
> +
> +  /* Bail out if direction of memory accesses is not growing.  */
> +  if (get_range_pos_neg (iv.step) != 1)
> +    return false;
> +
> +  internal_fn fn;
> +  switch (TREE_INT_CST_LOW (iv.step))
> +    {
> +    case 1:
> +      if (!targetm.have_rawmemchr8 ())
> +       return false;
> +      fn = IFN_RAWMEMCHR8;
> +      break;
> +
> +    case 2:
> +      if (!targetm.have_rawmemchr16 ())
> +       return false;
> +      fn = IFN_RAWMEMCHR16;
> +      break;
> +
> +    case 4:
> +      if (!targetm.have_rawmemchr32 ())
> +       return false;
> +      fn = IFN_RAWMEMCHR32;
> +      break;
> +
> +    default:
> +      return false;
> +    }
> +
> +  struct builtin_info *builtin;
> +  builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE);
> +  builtin->src_dr = dr;
> +  builtin->src_base = iv.base;
> +  builtin->pattern = pattern;
> +  builtin->fn = fn;
> +
> +  partition->loc = gimple_location (dr_stmt);
> +  partition->builtin = builtin;
> +  partition->kind = PKIND_RAWMEMCHR;
> +
> +  return true;
> +}
> +
>  /* Given data reference DR in loop nest LOOP, classify if it forms builtin
>     memset call.  */
>
> @@ -1792,12 +1947,16 @@ loop_distribution::classify_builtin_ldst (loop_p loop, struct graph *rdg,
>  bool
>  loop_distribution::classify_partition (loop_p loop,
>                                        struct graph *rdg, partition *partition,
> -                                      bitmap stmt_in_all_partitions)
> +                                      bitmap stmt_in_all_partitions,
> +                                      vec<struct partition *> *partitions)
>  {
>    bitmap_iterator bi;
>    unsigned i;
>    data_reference_p single_ld = NULL, single_st = NULL;
>    bool volatiles_p = false, has_reduction = false;
> +  unsigned nreductions = 0;
> +  gimple *reduction_stmt = NULL;
> +  bool has_interpar_reduction = false;
>
>    EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
>      {
> @@ -1821,6 +1980,19 @@ loop_distribution::classify_partition (loop_p loop,
>             partition->reduction_p = true;
>           else
>             has_reduction = true;
> +
> +         /* Determine whether the reduction statement occurs in other
> +            partitions than the current one.  */
> +         struct partition *piter;
> +         for (unsigned j = 0; partitions->iterate (j, &piter); ++j)
> +           {
> +             if (piter == partition)
> +               continue;
> +             if (bitmap_bit_p (piter->stmts, i))
> +               has_interpar_reduction = true;
> +           }
> +         reduction_stmt = stmt;
> +         ++nreductions;
>         }
>      }
>
> @@ -1840,6 +2012,30 @@ loop_distribution::classify_partition (loop_p loop,
>    if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
>      return has_reduction;
>
> +  /* If we determined a single load and a single reduction statement which does
> +     not occur in any other partition, then try to classify this partition as a
> +     rawmemchr builtin.  */
> +  if (single_ld != NULL
> +      && single_st == NULL
> +      && nreductions == 1
> +      && !has_interpar_reduction
> +      && is_gimple_assign (reduction_stmt))
> +    {
> +      /* If we classified the partition as a builtin, then ignoring the single
> +        reduction is safe, since the reduction variable is not used in other
> +        partitions.  */
> +      tree reduction_var = gimple_assign_lhs (reduction_stmt);
> +      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> +    }
> +
> +  if (single_st == NULL)
> +    return has_reduction;
> +
> +  /* Don't distribute loop if niters is unknown.  */
> +  tree niters = number_of_latch_executions (loop);
> +  if (niters == NULL_TREE || niters == chrec_dont_know)
> +    return has_reduction;
> +
>    partition->loc = gimple_location (DR_STMT (single_st));
>
>    /* Classify the builtin kind.  */
> @@ -2979,7 +3175,7 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
>    FOR_EACH_VEC_ELT (partitions, i, partition)
>      {
>        reduction_in_all
> -       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions);
> +       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions, &partitions);
>        any_builtin |= partition_builtin_p (partition);
>      }
>
> @@ -3290,11 +3486,6 @@ loop_distribution::execute (function *fun)
>               && !optimize_loop_for_speed_p (loop)))
>         continue;
>
> -      /* Don't distribute loop if niters is unknown.  */
> -      tree niters = number_of_latch_executions (loop);
> -      if (niters == NULL_TREE || niters == chrec_dont_know)
> -       continue;
> -
>        /* Get the perfect loop nest for distribution.  */
>        loop = prepare_perfect_loop_nest (loop);
>        for (; loop; loop = loop->inner)
> --
> 2.23.0
>
Stefan Schulze Frielinghaus Feb. 14, 2021, 10:27 a.m. UTC | #2
On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> >
> > This patch adds support for recognizing loops which mimic the behaviour
> > of function rawmemchr, and replaces those with an internal function call
> > in case a target provides them.  In contrast to the original rawmemchr
> > function, this patch also supports different instances where the memory
> > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > respectively.
> >
> > This patch is not final and I'm looking for some feedback:
> >
> > Previously, only loops which mimic the behaviours of functions memset,
> > memcpy, and memmove have been detected and replaced by corresponding
> > function calls.  One characteristic of those loops/partitions is that
> > they don't have a reduction.  In contrast, loops which mimic the
> > behaviour of rawmemchr compute a result and therefore have a reduction.
> > My current attempt is to ensure that the reduction statement is not used
> > in any other partition and only in that case ignore the reduction and
> > replace the loop by a function call.  We then only need to replace the
> > reduction variable of the loop which contained the loop result by the
> > variable of the lhs of the internal function call.  This should ensure
> > that the transformation is correct independently of how partitions are
> > fused/distributed in the end.  Any thoughts about this?
> 
> Currently we're forcing reduction partitions last (and force to have a single
> one by fusing all partitions containing a reduction) because code-generation
> does not properly update SSA form for the reduction results.  ISTR that
> might be just because we do not copy the LC PHI nodes or do not adjust
> them when copying.  That might not be an issue in case you replace the
> partition with a call.  I guess you can try to have a testcase with
> two rawmemchr patterns and a regular loop part that has to be scheduled
> inbetween both for correctness.

Ah ok, in that case I updated my patch by removing the constraint that
the reduction statement must be in precisely one partition.  Please find
attached the testcases I came up so far.  Since transforming a loop into
a rawmemchr function call is backend dependend, I planned to include
those only in my backend patch.  I wasn't able to come up with any
testcase where a loop is distributed into multiple partitions and where
one is classified as a rawmemchr builtin.  The latter boils down to a
for loop with an empty body only in which case I suspect that loop
distribution shouldn't be done anyway.

> > Furthermore, I simply added two new members (pattern, fn) to structure
> > builtin_info which I consider rather hacky.  For the long run I thought
> > about to split up structure builtin_info into a union where each member
> > is a structure for a particular builtin of a partition, i.e., something
> > like this:
> >
> > union builtin_info
> > {
> >   struct binfo_memset *memset;
> >   struct binfo_memcpymove *memcpymove;
> >   struct binfo_rawmemchr *rawmemchr;
> > };
> >
> > Such that a structure for one builtin does not get "polluted" by a
> > different one.  Any thoughts about this?
> 
> Probably makes sense if the list of recognized patterns grow further.
> 
> I see you use internal functions rather than builtin functions.  I guess
> that's OK.  But you use new target hooks for expansion where I think
> new optab entries similar to cmpmem would be more appropriate
> where the distinction between 8, 16 or 32 bits can be encoded in
> the modes.

The optab implementation is really nice which allows me to use iterators
in the backend which in the end saves me some boiler plate code compared
to the previous implementation :)

While using optabs now, I only require one additional member (pattern)
in the builtin_info struct.  Thus I didn't want to overcomplicate things
and kept the single struct approach as is.

For the long run, should I resubmit this patch once stage 1 opens or how
would you propose to proceed?

Thanks for your review so far!

Cheers,
Stefan

> 
> Richard.
> 
> > Cheers,
> > Stefan
> > ---
> >  gcc/internal-fn.c            |  42 ++++++
> >  gcc/internal-fn.def          |   3 +
> >  gcc/target-insns.def         |   3 +
> >  gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++-----
> >  4 files changed, 272 insertions(+), 33 deletions(-)
> >
> > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > index dd7173126fb..9cd62544a1a 100644
> > --- a/gcc/internal-fn.c
> > +++ b/gcc/internal-fn.c
> > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> >    gcc_unreachable ();
> >  }
> >
> > +static void
> > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> > +{
> > +  if (targetm.have_rawmemchr8 ())
> > +    {
> > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > +      emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
> > +    }
> > +  else
> > +    gcc_unreachable();
> > +}
> > +
> > +static void
> > +expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
> > +{
> > +  if (targetm.have_rawmemchr16 ())
> > +    {
> > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > +      emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
> > +    }
> > +  else
> > +    gcc_unreachable();
> > +}
> > +
> > +static void
> > +expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
> > +{
> > +  if (targetm.have_rawmemchr32 ())
> > +    {
> > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > +      emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
> > +    }
> > +  else
> > +    gcc_unreachable();
> > +}
> > +
> >  /* Expand the IFN_UNIQUE function according to its first argument.  */
> >
> >  static void
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index daeace7a34e..34247859704 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> >  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
> >  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
> >  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > +DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > +DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > +DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> >
> >  /* An unduplicable, uncombinable function.  Generally used to preserve
> >     a CFG property in the face of jump threading, tail merging or
> > diff --git a/gcc/target-insns.def b/gcc/target-insns.def
> > index 672c35698d7..9248554cbf3 100644
> > --- a/gcc/target-insns.def
> > +++ b/gcc/target-insns.def
> > @@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
> >  DEF_TARGET_INSN (unique, (void))
> >  DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
> >  DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
> > +DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
> > +DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
> > +DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2))
> > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> > index 7ee19fc8677..f5b24bf53bc 100644
> > --- a/gcc/tree-loop-distribution.c
> > +++ b/gcc/tree-loop-distribution.c
> > @@ -218,7 +218,7 @@ enum partition_kind {
> >         be unnecessary and removed once distributed memset can be understood
> >         and analyzed in data reference analysis.  See PR82604 for more.  */
> >      PKIND_PARTIAL_MEMSET,
> > -    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> > +    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
> >  };
> >
> >  /* Type of distributed loop.  */
> > @@ -244,6 +244,8 @@ struct builtin_info
> >       is only used in memset builtin distribution for now.  */
> >    tree dst_base_base;
> >    unsigned HOST_WIDE_INT dst_base_offset;
> > +  tree pattern;
> > +  internal_fn fn;
> >  };
> >
> >  /* Partition for loop distribution.  */
> > @@ -588,7 +590,8 @@ class loop_distribution
> >    bool
> >    classify_partition (loop_p loop,
> >                       struct graph *rdg, partition *partition,
> > -                     bitmap stmt_in_all_partitions);
> > +                     bitmap stmt_in_all_partitions,
> > +                     vec<struct partition *> *partitions);
> >
> >
> >    /* Returns true when PARTITION1 and PARTITION2 access the same memory
> > @@ -1232,6 +1235,67 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
> >      }
> >  }
> >
> > +/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP.  */
> > +
> > +static void
> > +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> > +{
> > +  gimple_stmt_iterator gsi;
> > +  tree mem, pattern;
> > +  struct builtin_info *builtin = partition->builtin;
> > +  gimple *fn_call;
> > +
> > +  data_reference_p dr = builtin->src_dr;
> > +  tree base = builtin->src_base;
> > +
> > +  tree result_old = TREE_OPERAND (DR_REF (dr), 0);
> > +  tree result_new = copy_ssa_name (result_old);
> > +
> > +  /* The new statements will be placed before LOOP.  */
> > +  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > +
> > +  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, GSI_CONTINUE_LINKING);
> > +  pattern = builtin->pattern;
> > +  if (TREE_CODE (pattern) == INTEGER_CST)
> > +    pattern = fold_convert (integer_type_node, pattern);
> > +  fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern);
> > +  gimple_call_set_lhs (fn_call, result_new);
> > +  gimple_set_location (fn_call, partition->loc);
> > +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > +
> > +  imm_use_iterator iter;
> > +  gimple *stmt;
> > +  use_operand_p use_p;
> > +  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> > +    {
> > +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > +       SET_USE (use_p, result_new);
> > +
> > +      update_stmt (stmt);
> > +    }
> > +
> > +  fold_stmt (&gsi);
> > +
> > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > +    switch (builtin->fn)
> > +      {
> > +      case IFN_RAWMEMCHR8:
> > +       fprintf (dump_file, "generated rawmemchr8\n");
> > +       break;
> > +
> > +      case IFN_RAWMEMCHR16:
> > +       fprintf (dump_file, "generated rawmemchr16\n");
> > +       break;
> > +
> > +      case IFN_RAWMEMCHR32:
> > +       fprintf (dump_file, "generated rawmemchr32\n");
> > +       break;
> > +
> > +      default:
> > +       gcc_unreachable ();
> > +      }
> > +}
> > +
> >  /* Remove and destroy the loop LOOP.  */
> >
> >  static void
> > @@ -1334,6 +1398,10 @@ generate_code_for_partition (class loop *loop,
> >        generate_memcpy_builtin (loop, partition);
> >        break;
> >
> > +    case PKIND_RAWMEMCHR:
> > +      generate_rawmemchr_builtin (loop, partition);
> > +      break;
> > +
> >      default:
> >        gcc_unreachable ();
> >      }
> > @@ -1525,44 +1593,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> >         }
> >      }
> >
> > -  if (!single_st)
> > +  if (!single_ld && !single_st)
> >      return false;
> >
> > -  /* Bail out if this is a bitfield memory reference.  */
> > -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > -    return false;
> > -
> > -  /* Data reference must be executed exactly once per iteration of each
> > -     loop in the loop nest.  We only need to check dominance information
> > -     against the outermost one in a perfect loop nest because a bb can't
> > -     dominate outermost loop's latch without dominating inner loop's.  */
> > -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> > -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > -    return false;
> > +  basic_block bb_ld = NULL;
> > +  basic_block bb_st = NULL;
> >
> >    if (single_ld)
> >      {
> > -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > -      /* Direct aggregate copy or via an SSA name temporary.  */
> > -      if (load != store
> > -         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > -       return false;
> > -
> >        /* Bail out if this is a bitfield memory reference.  */
> >        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
> >           && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
> >         return false;
> >
> > -      /* Load and store must be in the same loop nest.  */
> > -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> > -      if (bb_st->loop_father != bb_ld->loop_father)
> > +      /* Data reference must be executed exactly once per iteration of each
> > +        loop in the loop nest.  We only need to check dominance information
> > +        against the outermost one in a perfect loop nest because a bb can't
> > +        dominate outermost loop's latch without dominating inner loop's.  */
> > +      bb_ld = gimple_bb (DR_STMT (single_ld));
> > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > +       return false;
> > +    }
> > +
> > +  if (single_st)
> > +    {
> > +      /* Bail out if this is a bitfield memory reference.  */
> > +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > +         && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> >         return false;
> >
> >        /* Data reference must be executed exactly once per iteration.
> > -        Same as single_st, we only need to check against the outermost
> > +        Same as single_ld, we only need to check against the outermost
> >          loop.  */
> > -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > +      bb_st = gimple_bb (DR_STMT (single_st));
> > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > +       return false;
> > +    }
> > +
> > +  if (single_ld && single_st)
> > +    {
> > +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > +      /* Direct aggregate copy or via an SSA name temporary.  */
> > +      if (load != store
> > +         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > +       return false;
> > +
> > +      /* Load and store must be in the same loop nest.  */
> > +      if (bb_st->loop_father != bb_ld->loop_father)
> >         return false;
> >
> >        edge e = single_exit (bb_st->loop_father);
> > @@ -1681,6 +1758,84 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
> >    return builtin;
> >  }
> >
> > +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > +   rawmemchr{8,16,32} call.  */
> > +
> > +static bool
> > +classify_builtin_rawmemchr (loop_p loop, partition *partition, data_reference_p dr, tree loop_result)
> > +{
> > +  tree dr_ref = DR_REF (dr);
> > +  tree dr_access_base = build_fold_addr_expr (dr_ref);
> > +  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> > +  gimple *dr_stmt = DR_STMT (dr);
> > +  tree rhs1 = gimple_assign_rhs1 (dr_stmt);
> > +  affine_iv iv;
> > +  tree pattern;
> > +
> > +  if (TREE_OPERAND (rhs1, 0) != loop_result)
> > +    return false;
> > +
> > +  /* A limitation of the current implementation is that we only support
> > +     constant patterns.  */
> > +  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> > +  pattern = gimple_cond_rhs (cond_stmt);
> > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> > +      || TREE_CODE (pattern) != INTEGER_CST)
> > +    return false;
> > +
> > +  /* Bail out if no affine induction variable with constant step can be
> > +     determined.  */
> > +  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> > +    return false;
> > +
> > +  /* Bail out if memory accesses are not consecutive.  */
> > +  if (!operand_equal_p (iv.step, dr_access_size, 0))
> > +    return false;
> > +
> > +  /* Bail out if direction of memory accesses is not growing.  */
> > +  if (get_range_pos_neg (iv.step) != 1)
> > +    return false;
> > +
> > +  internal_fn fn;
> > +  switch (TREE_INT_CST_LOW (iv.step))
> > +    {
> > +    case 1:
> > +      if (!targetm.have_rawmemchr8 ())
> > +       return false;
> > +      fn = IFN_RAWMEMCHR8;
> > +      break;
> > +
> > +    case 2:
> > +      if (!targetm.have_rawmemchr16 ())
> > +       return false;
> > +      fn = IFN_RAWMEMCHR16;
> > +      break;
> > +
> > +    case 4:
> > +      if (!targetm.have_rawmemchr32 ())
> > +       return false;
> > +      fn = IFN_RAWMEMCHR32;
> > +      break;
> > +
> > +    default:
> > +      return false;
> > +    }
> > +
> > +  struct builtin_info *builtin;
> > +  builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE);
> > +  builtin->src_dr = dr;
> > +  builtin->src_base = iv.base;
> > +  builtin->pattern = pattern;
> > +  builtin->fn = fn;
> > +
> > +  partition->loc = gimple_location (dr_stmt);
> > +  partition->builtin = builtin;
> > +  partition->kind = PKIND_RAWMEMCHR;
> > +
> > +  return true;
> > +}
> > +
> >  /* Given data reference DR in loop nest LOOP, classify if it forms builtin
> >     memset call.  */
> >
> > @@ -1792,12 +1947,16 @@ loop_distribution::classify_builtin_ldst (loop_p loop, struct graph *rdg,
> >  bool
> >  loop_distribution::classify_partition (loop_p loop,
> >                                        struct graph *rdg, partition *partition,
> > -                                      bitmap stmt_in_all_partitions)
> > +                                      bitmap stmt_in_all_partitions,
> > +                                      vec<struct partition *> *partitions)
> >  {
> >    bitmap_iterator bi;
> >    unsigned i;
> >    data_reference_p single_ld = NULL, single_st = NULL;
> >    bool volatiles_p = false, has_reduction = false;
> > +  unsigned nreductions = 0;
> > +  gimple *reduction_stmt = NULL;
> > +  bool has_interpar_reduction = false;
> >
> >    EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> >      {
> > @@ -1821,6 +1980,19 @@ loop_distribution::classify_partition (loop_p loop,
> >             partition->reduction_p = true;
> >           else
> >             has_reduction = true;
> > +
> > +         /* Determine whether the reduction statement occurs in other
> > +            partitions than the current one.  */
> > +         struct partition *piter;
> > +         for (unsigned j = 0; partitions->iterate (j, &piter); ++j)
> > +           {
> > +             if (piter == partition)
> > +               continue;
> > +             if (bitmap_bit_p (piter->stmts, i))
> > +               has_interpar_reduction = true;
> > +           }
> > +         reduction_stmt = stmt;
> > +         ++nreductions;
> >         }
> >      }
> >
> > @@ -1840,6 +2012,30 @@ loop_distribution::classify_partition (loop_p loop,
> >    if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> >      return has_reduction;
> >
> > +  /* If we determined a single load and a single reduction statement which does
> > +     not occur in any other partition, then try to classify this partition as a
> > +     rawmemchr builtin.  */
> > +  if (single_ld != NULL
> > +      && single_st == NULL
> > +      && nreductions == 1
> > +      && !has_interpar_reduction
> > +      && is_gimple_assign (reduction_stmt))
> > +    {
> > +      /* If we classified the partition as a builtin, then ignoring the single
> > +        reduction is safe, since the reduction variable is not used in other
> > +        partitions.  */
> > +      tree reduction_var = gimple_assign_lhs (reduction_stmt);
> > +      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> > +    }
> > +
> > +  if (single_st == NULL)
> > +    return has_reduction;
> > +
> > +  /* Don't distribute loop if niters is unknown.  */
> > +  tree niters = number_of_latch_executions (loop);
> > +  if (niters == NULL_TREE || niters == chrec_dont_know)
> > +    return has_reduction;
> > +
> >    partition->loc = gimple_location (DR_STMT (single_st));
> >
> >    /* Classify the builtin kind.  */
> > @@ -2979,7 +3175,7 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
> >    FOR_EACH_VEC_ELT (partitions, i, partition)
> >      {
> >        reduction_in_all
> > -       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions);
> > +       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions, &partitions);
> >        any_builtin |= partition_builtin_p (partition);
> >      }
> >
> > @@ -3290,11 +3486,6 @@ loop_distribution::execute (function *fun)
> >               && !optimize_loop_for_speed_p (loop)))
> >         continue;
> >
> > -      /* Don't distribute loop if niters is unknown.  */
> > -      tree niters = number_of_latch_executions (loop);
> > -      if (niters == NULL_TREE || niters == chrec_dont_know)
> > -       continue;
> > -
> >        /* Get the perfect loop nest for distribution.  */
> >        loop = prepare_perfect_loop_nest (loop);
> >        for (; loop; loop = loop->inner)
> > --
> > 2.23.0
> >
commit bf792239150
Author: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
Date:   Mon Feb 8 10:35:30 2021 +0100

    ldist: Recognize rawmemchr loop patterns
    
    This patch adds support for recognizing loops which mimic the behaviour
    of function rawmemchr, and replaces those with an internal function call
    in case a target provides them.  In contrast to the original rawmemchr
    function, this patch also supports different instances where the memory
    pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
    respectively.

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index dd7173126fb..18e12b863c6 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2917,6 +2917,31 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+    {
+      tree rhs = gimple_call_arg (stmt, i);
+      tree rhs_type = TREE_TYPE (rhs);
+      rtx rhs_rtx = expand_normal (rhs);
+      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
+    }
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index daeace7a34e..95c76795648 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..f7c69f914ce 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
 OPTAB_D (movmem_optab, "movmem$a")
 OPTAB_D (setmem_optab, "setmem$a")
 OPTAB_D (strlen_optab, "strlen$a")
+OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
 
 OPTAB_DC(fma_optab, "fma$a4", FMA)
 OPTAB_D (fms_optab, "fms$a4")
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 7ee19fc8677..09f200da61f 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -115,6 +115,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "tree-eh.h"
 #include "gimple-fold.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "insn-codes.h"
+#include "optabs.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -218,7 +222,7 @@ enum partition_kind {
        be unnecessary and removed once distributed memset can be understood
        and analyzed in data reference analysis.  See PR82604 for more.  */
     PKIND_PARTIAL_MEMSET,
-    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
+    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
 };
 
 /* Type of distributed loop.  */
@@ -244,6 +248,8 @@ struct builtin_info
      is only used in memset builtin distribution for now.  */
   tree dst_base_base;
   unsigned HOST_WIDE_INT dst_base_offset;
+  /* Pattern is used only in rawmemchr builtin distribution for now.  */
+  tree pattern;
 };
 
 /* Partition for loop distribution.  */
@@ -1232,6 +1238,66 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
     }
 }
 
+/* Generate a call to rawmemchr for PARTITION in LOOP.  */
+
+static void
+generate_rawmemchr_builtin (class loop *loop, partition *partition)
+{
+  gimple_stmt_iterator gsi;
+  tree mem, pattern;
+  struct builtin_info *builtin = partition->builtin;
+  gimple *fn_call;
+
+  data_reference_p dr = builtin->src_dr;
+  tree base = builtin->src_base;
+
+  tree result_old = build_fold_addr_expr (DR_REF (dr));
+  tree result_new = copy_ssa_name (result_old);
+
+  /* The new statements will be placed before LOOP.  */
+  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
+				  GSI_CONTINUE_LINKING);
+  pattern = builtin->pattern;
+  fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
+  gimple_call_set_lhs (fn_call, result_new);
+  gimple_set_location (fn_call, partition->loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, result_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    switch (TYPE_MODE (TREE_TYPE (pattern)))
+      {
+      case QImode:
+	fprintf (dump_file, "generated rawmemchrqi\n");
+	break;
+
+      case HImode:
+	fprintf (dump_file, "generated rawmemchrhi\n");
+	break;
+
+      case SImode:
+	fprintf (dump_file, "generated rawmemchrsi\n");
+	break;
+
+      default:
+	gcc_unreachable ();
+      }
+}
+
 /* Remove and destroy the loop LOOP.  */
 
 static void
@@ -1334,6 +1400,10 @@ generate_code_for_partition (class loop *loop,
       generate_memcpy_builtin (loop, partition);
       break;
 
+    case PKIND_RAWMEMCHR:
+      generate_rawmemchr_builtin (loop, partition);
+      break;
+
     default:
       gcc_unreachable ();
     }
@@ -1525,44 +1595,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
 	}
     }
 
-  if (!single_st)
+  if (!single_ld && !single_st)
     return false;
 
-  /* Bail out if this is a bitfield memory reference.  */
-  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
-      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
-    return false;
-
-  /* Data reference must be executed exactly once per iteration of each
-     loop in the loop nest.  We only need to check dominance information
-     against the outermost one in a perfect loop nest because a bb can't
-     dominate outermost loop's latch without dominating inner loop's.  */
-  basic_block bb_st = gimple_bb (DR_STMT (single_st));
-  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
-    return false;
+  basic_block bb_ld = NULL;
+  basic_block bb_st = NULL;
 
   if (single_ld)
     {
-      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
-      /* Direct aggregate copy or via an SSA name temporary.  */
-      if (load != store
-	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
-	return false;
-
       /* Bail out if this is a bitfield memory reference.  */
       if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
 	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
 	return false;
 
-      /* Load and store must be in the same loop nest.  */
-      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
-      if (bb_st->loop_father != bb_ld->loop_father)
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      bb_ld = gimple_bb (DR_STMT (single_ld));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+	return false;
+    }
+
+  if (single_st)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
 	return false;
 
       /* Data reference must be executed exactly once per iteration.
-	 Same as single_st, we only need to check against the outermost
+	 Same as single_ld, we only need to check against the outermost
 	 loop.  */
-      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+      bb_st = gimple_bb (DR_STMT (single_st));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
+	return false;
+    }
+
+  if (single_ld && single_st)
+    {
+      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
+      /* Direct aggregate copy or via an SSA name temporary.  */
+      if (load != store
+	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
+	return false;
+
+      /* Load and store must be in the same loop nest.  */
+      if (bb_st->loop_father != bb_ld->loop_father)
 	return false;
 
       edge e = single_exit (bb_st->loop_father);
@@ -1681,6 +1760,68 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
   return builtin;
 }
 
+/* Given data reference DR in loop nest LOOP, classify if it forms builtin
+   rawmemchr call.  */
+
+static bool
+classify_builtin_rawmemchr (loop_p loop, partition *partition,
+			    data_reference_p dr, tree loop_result)
+{
+  tree dr_ref = DR_REF (dr);
+  tree dr_access_base = build_fold_addr_expr (dr_ref);
+  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
+  gimple *dr_stmt = DR_STMT (dr);
+  affine_iv iv;
+  tree pattern;
+
+  if (dr_access_base != loop_result)
+    return false;
+
+  /* A limitation of the current implementation is that we only support
+     constant patterns.  */
+  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
+  pattern = gimple_cond_rhs (cond_stmt);
+  if (gimple_cond_code (cond_stmt) != NE_EXPR
+      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
+      || TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive.  */
+  if (!operand_equal_p (iv.step, dr_access_size, 0))
+    return false;
+
+  /* Bail out if direction of memory accesses is not growing.  */
+  if (get_range_pos_neg (iv.step) != 1)
+    return false;
+
+  /* Bail out if target does not provide rawmemchr for a certain mode.  */
+  machine_mode mode;
+  switch (TREE_INT_CST_LOW (iv.step))
+    {
+    case 1: mode = QImode; break;
+    case 2: mode = HImode; break;
+    case 4: mode = SImode; break;
+    default: return false;
+    }
+  if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
+    return false;
+
+  struct builtin_info *builtin;
+  builtin = alloc_builtin (NULL, dr, NULL_TREE, iv.base, NULL_TREE);
+  builtin->pattern = pattern;
+
+  partition->loc = gimple_location (dr_stmt);
+  partition->builtin = builtin;
+  partition->kind = PKIND_RAWMEMCHR;
+
+  return true;
+}
+
 /* Given data reference DR in loop nest LOOP, classify if it forms builtin
    memset call.  */
 
@@ -1798,6 +1939,8 @@ loop_distribution::classify_partition (loop_p loop,
   unsigned i;
   data_reference_p single_ld = NULL, single_st = NULL;
   bool volatiles_p = false, has_reduction = false;
+  unsigned nreductions = 0;
+  gimple *reduction_stmt = NULL;
 
   EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
     {
@@ -1821,6 +1964,10 @@ loop_distribution::classify_partition (loop_p loop,
 	    partition->reduction_p = true;
 	  else
 	    has_reduction = true;
+
+	  /* Determine whether STMT is the only reduction statement or not.  */
+	  reduction_stmt = stmt;
+	  ++nreductions;
 	}
     }
 
@@ -1840,6 +1987,27 @@ loop_distribution::classify_partition (loop_p loop,
   if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
     return has_reduction;
 
+  /* If we determined a single load and a single reduction statement, then try
+     to classify this partition as a rawmemchr builtin.  */
+  if (single_ld != NULL
+      && single_st == NULL
+      && nreductions == 1
+      && is_gimple_assign (reduction_stmt))
+    {
+      /* If we classified the partition as a builtin, then ignoring the single
+	 reduction is safe, since the whole partition is replaced by a call.  */
+      tree reduction_var = gimple_assign_lhs (reduction_stmt);
+      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
+    }
+
+  if (single_st == NULL)
+    return has_reduction;
+
+  /* Don't distribute loop if niters is unknown.  */
+  tree niters = number_of_latch_executions (loop);
+  if (niters == NULL_TREE || niters == chrec_dont_know)
+    return has_reduction;
+
   partition->loc = gimple_location (DR_STMT (single_st));
 
   /* Classify the builtin kind.  */
@@ -3290,11 +3458,6 @@ loop_distribution::execute (function *fun)
 	      && !optimize_loop_for_speed_p (loop)))
 	continue;
 
-      /* Don't distribute loop if niters is unknown.  */
-      tree niters = number_of_latch_executions (loop);
-      if (niters == NULL_TREE || niters == chrec_dont_know)
-	continue;
-
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);
       for (; loop; loop = loop->inner)
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */

#include <cstdint>

template<typename T, T pattern>
T *rawmemchr(T *s)
{
  while (*s != pattern)
    ++s;
  return s;
}

template uint8_t *rawmemchr<uint8_t, 0xab>(uint8_t *);
template uint16_t *rawmemchr<uint16_t, 0xabcd>(uint16_t *);
template uint32_t *rawmemchr<uint32_t, 0xabcdef15>(uint32_t *);

template int8_t *rawmemchr<int8_t, (int8_t)0xab>(int8_t *);
template int16_t *rawmemchr<int16_t, (int16_t)0xabcd>(int16_t *);
template int32_t *rawmemchr<int32_t, (int32_t)0xabcdef15>(int32_t *);

/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */
/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */
/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */
/* { dg-do run } */
/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details -mzarch -march=z13" } */

#include <cstring>
#include <cassert>
#include <cstdint>
#include <iostream>

template <typename T, T pattern>
__attribute__((noinline,noclone))
T* rawmemchr (T *s)
{
  while (*s != pattern)
    ++s;
  return s;
}

template <typename T, T pattern>
void doit()
{
  T *buf = new T[4096 * 2];
  assert (buf != NULL);
  memset (buf, 0xa, 4096 * 2);
  // ensure q is 4096-byte aligned
  T *q = buf + (4096 - ((uintptr_t)buf & 4095));
  T *p;

  // unaligned + block boundary + 1st load
  p = (T *) ((uintptr_t)q - 8);
  p[2] = pattern;
  assert ((rawmemchr<T, pattern> (&p[0]) == &p[2]));
  p[2] = (T) 0xaaaaaaaa;

  // unaligned + block boundary + 2nd load
  p = (T *) ((uintptr_t)q - 8);
  p[6] = pattern;
  assert ((rawmemchr<T, pattern> (&p[0]) == &p[6]));
  p[6] = (T) 0xaaaaaaaa;



  // unaligned + 1st load
  q[5] = pattern;
  assert ((rawmemchr<T, pattern>(&q[2]) == &q[5]));
  q[5] = (T) 0xaaaaaaaa;

  // unaligned + 2nd load
  q[14] = pattern;
  assert ((rawmemchr<T, pattern>(&q[2]) == &q[14]));
  q[14] = (T) 0xaaaaaaaa;

  // unaligned + 3rd load
  q[19] = pattern;
  assert ((rawmemchr<T, pattern>(&q[2]) == &q[19]));
  q[19] = (T) 0xaaaaaaaa;

  // unaligned + 4th load
  q[25] = pattern;
  assert ((rawmemchr<T, pattern>(&q[2]) == &q[25]));
  q[25] = (T) 0xaaaaaaaa;



  // aligned + 1st load
  q[5] = pattern;
  assert ((rawmemchr<T, pattern>(&q[0]) == &q[5]));
  q[5] = (T) 0xaaaaaaaa;

  // aligned + 2nd load
  q[14] = pattern;
  assert ((rawmemchr<T, pattern>(&q[0]) == &q[14]));
  q[14] = (T) 0xaaaaaaaa;

  // aligned + 3rd load
  q[19] = pattern;
  assert ((rawmemchr<T, pattern>(&q[0]) == &q[19]));
  q[19] = (T) 0xaaaaaaaa;

  // aligned + 4th load
  q[25] = pattern;
  assert ((rawmemchr<T, pattern>(&q[0]) == &q[25]));
  q[25] = (T) 0xaaaaaaaa;

  delete buf;
}

int main(void)
{
  doit<uint8_t, 0xde> ();
  doit<int8_t, (int8_t)0xde> ();
  doit<uint16_t, 0xdead> ();
  doit<int16_t, (int16_t)0xdead> ();
  doit<uint32_t, 0xdeadbeef> ();
  doit<int32_t, (int32_t)0xdeadbeef> ();
  return 0;
}

/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */
/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */
/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */
Stefan Schulze Frielinghaus Feb. 25, 2021, 5:01 p.m. UTC | #3
Ping

On Sun, Feb 14, 2021 at 11:27:40AM +0100, Stefan Schulze Frielinghaus wrote:
> On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > This patch adds support for recognizing loops which mimic the behaviour
> > > of function rawmemchr, and replaces those with an internal function call
> > > in case a target provides them.  In contrast to the original rawmemchr
> > > function, this patch also supports different instances where the memory
> > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > respectively.
> > >
> > > This patch is not final and I'm looking for some feedback:
> > >
> > > Previously, only loops which mimic the behaviours of functions memset,
> > > memcpy, and memmove have been detected and replaced by corresponding
> > > function calls.  One characteristic of those loops/partitions is that
> > > they don't have a reduction.  In contrast, loops which mimic the
> > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > My current attempt is to ensure that the reduction statement is not used
> > > in any other partition and only in that case ignore the reduction and
> > > replace the loop by a function call.  We then only need to replace the
> > > reduction variable of the loop which contained the loop result by the
> > > variable of the lhs of the internal function call.  This should ensure
> > > that the transformation is correct independently of how partitions are
> > > fused/distributed in the end.  Any thoughts about this?
> > 
> > Currently we're forcing reduction partitions last (and force to have a single
> > one by fusing all partitions containing a reduction) because code-generation
> > does not properly update SSA form for the reduction results.  ISTR that
> > might be just because we do not copy the LC PHI nodes or do not adjust
> > them when copying.  That might not be an issue in case you replace the
> > partition with a call.  I guess you can try to have a testcase with
> > two rawmemchr patterns and a regular loop part that has to be scheduled
> > inbetween both for correctness.
> 
> Ah ok, in that case I updated my patch by removing the constraint that
> the reduction statement must be in precisely one partition.  Please find
> attached the testcases I came up so far.  Since transforming a loop into
> a rawmemchr function call is backend dependend, I planned to include
> those only in my backend patch.  I wasn't able to come up with any
> testcase where a loop is distributed into multiple partitions and where
> one is classified as a rawmemchr builtin.  The latter boils down to a
> for loop with an empty body only in which case I suspect that loop
> distribution shouldn't be done anyway.
> 
> > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > builtin_info which I consider rather hacky.  For the long run I thought
> > > about to split up structure builtin_info into a union where each member
> > > is a structure for a particular builtin of a partition, i.e., something
> > > like this:
> > >
> > > union builtin_info
> > > {
> > >   struct binfo_memset *memset;
> > >   struct binfo_memcpymove *memcpymove;
> > >   struct binfo_rawmemchr *rawmemchr;
> > > };
> > >
> > > Such that a structure for one builtin does not get "polluted" by a
> > > different one.  Any thoughts about this?
> > 
> > Probably makes sense if the list of recognized patterns grow further.
> > 
> > I see you use internal functions rather than builtin functions.  I guess
> > that's OK.  But you use new target hooks for expansion where I think
> > new optab entries similar to cmpmem would be more appropriate
> > where the distinction between 8, 16 or 32 bits can be encoded in
> > the modes.
> 
> The optab implementation is really nice which allows me to use iterators
> in the backend which in the end saves me some boiler plate code compared
> to the previous implementation :)
> 
> While using optabs now, I only require one additional member (pattern)
> in the builtin_info struct.  Thus I didn't want to overcomplicate things
> and kept the single struct approach as is.
> 
> For the long run, should I resubmit this patch once stage 1 opens or how
> would you propose to proceed?
> 
> Thanks for your review so far!
> 
> Cheers,
> Stefan
> 
> > 
> > Richard.
> > 
> > > Cheers,
> > > Stefan
> > > ---
> > >  gcc/internal-fn.c            |  42 ++++++
> > >  gcc/internal-fn.def          |   3 +
> > >  gcc/target-insns.def         |   3 +
> > >  gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++-----
> > >  4 files changed, 272 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > > index dd7173126fb..9cd62544a1a 100644
> > > --- a/gcc/internal-fn.c
> > > +++ b/gcc/internal-fn.c
> > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> > >    gcc_unreachable ();
> > >  }
> > >
> > > +static void
> > > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> > > +{
> > > +  if (targetm.have_rawmemchr8 ())
> > > +    {
> > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > +      emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
> > > +    }
> > > +  else
> > > +    gcc_unreachable();
> > > +}
> > > +
> > > +static void
> > > +expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
> > > +{
> > > +  if (targetm.have_rawmemchr16 ())
> > > +    {
> > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > +      emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
> > > +    }
> > > +  else
> > > +    gcc_unreachable();
> > > +}
> > > +
> > > +static void
> > > +expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
> > > +{
> > > +  if (targetm.have_rawmemchr32 ())
> > > +    {
> > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > +      emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
> > > +    }
> > > +  else
> > > +    gcc_unreachable();
> > > +}
> > > +
> > >  /* Expand the IFN_UNIQUE function according to its first argument.  */
> > >
> > >  static void
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index daeace7a34e..34247859704 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > >  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
> > >  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
> > >  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > >
> > >  /* An unduplicable, uncombinable function.  Generally used to preserve
> > >     a CFG property in the face of jump threading, tail merging or
> > > diff --git a/gcc/target-insns.def b/gcc/target-insns.def
> > > index 672c35698d7..9248554cbf3 100644
> > > --- a/gcc/target-insns.def
> > > +++ b/gcc/target-insns.def
> > > @@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
> > >  DEF_TARGET_INSN (unique, (void))
> > >  DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
> > >  DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
> > > +DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
> > > +DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
> > > +DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2))
> > > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> > > index 7ee19fc8677..f5b24bf53bc 100644
> > > --- a/gcc/tree-loop-distribution.c
> > > +++ b/gcc/tree-loop-distribution.c
> > > @@ -218,7 +218,7 @@ enum partition_kind {
> > >         be unnecessary and removed once distributed memset can be understood
> > >         and analyzed in data reference analysis.  See PR82604 for more.  */
> > >      PKIND_PARTIAL_MEMSET,
> > > -    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> > > +    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
> > >  };
> > >
> > >  /* Type of distributed loop.  */
> > > @@ -244,6 +244,8 @@ struct builtin_info
> > >       is only used in memset builtin distribution for now.  */
> > >    tree dst_base_base;
> > >    unsigned HOST_WIDE_INT dst_base_offset;
> > > +  tree pattern;
> > > +  internal_fn fn;
> > >  };
> > >
> > >  /* Partition for loop distribution.  */
> > > @@ -588,7 +590,8 @@ class loop_distribution
> > >    bool
> > >    classify_partition (loop_p loop,
> > >                       struct graph *rdg, partition *partition,
> > > -                     bitmap stmt_in_all_partitions);
> > > +                     bitmap stmt_in_all_partitions,
> > > +                     vec<struct partition *> *partitions);
> > >
> > >
> > >    /* Returns true when PARTITION1 and PARTITION2 access the same memory
> > > @@ -1232,6 +1235,67 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
> > >      }
> > >  }
> > >
> > > +/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP.  */
> > > +
> > > +static void
> > > +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> > > +{
> > > +  gimple_stmt_iterator gsi;
> > > +  tree mem, pattern;
> > > +  struct builtin_info *builtin = partition->builtin;
> > > +  gimple *fn_call;
> > > +
> > > +  data_reference_p dr = builtin->src_dr;
> > > +  tree base = builtin->src_base;
> > > +
> > > +  tree result_old = TREE_OPERAND (DR_REF (dr), 0);
> > > +  tree result_new = copy_ssa_name (result_old);
> > > +
> > > +  /* The new statements will be placed before LOOP.  */
> > > +  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > > +
> > > +  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, GSI_CONTINUE_LINKING);
> > > +  pattern = builtin->pattern;
> > > +  if (TREE_CODE (pattern) == INTEGER_CST)
> > > +    pattern = fold_convert (integer_type_node, pattern);
> > > +  fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern);
> > > +  gimple_call_set_lhs (fn_call, result_new);
> > > +  gimple_set_location (fn_call, partition->loc);
> > > +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > > +
> > > +  imm_use_iterator iter;
> > > +  gimple *stmt;
> > > +  use_operand_p use_p;
> > > +  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> > > +    {
> > > +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > > +       SET_USE (use_p, result_new);
> > > +
> > > +      update_stmt (stmt);
> > > +    }
> > > +
> > > +  fold_stmt (&gsi);
> > > +
> > > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > > +    switch (builtin->fn)
> > > +      {
> > > +      case IFN_RAWMEMCHR8:
> > > +       fprintf (dump_file, "generated rawmemchr8\n");
> > > +       break;
> > > +
> > > +      case IFN_RAWMEMCHR16:
> > > +       fprintf (dump_file, "generated rawmemchr16\n");
> > > +       break;
> > > +
> > > +      case IFN_RAWMEMCHR32:
> > > +       fprintf (dump_file, "generated rawmemchr32\n");
> > > +       break;
> > > +
> > > +      default:
> > > +       gcc_unreachable ();
> > > +      }
> > > +}
> > > +
> > >  /* Remove and destroy the loop LOOP.  */
> > >
> > >  static void
> > > @@ -1334,6 +1398,10 @@ generate_code_for_partition (class loop *loop,
> > >        generate_memcpy_builtin (loop, partition);
> > >        break;
> > >
> > > +    case PKIND_RAWMEMCHR:
> > > +      generate_rawmemchr_builtin (loop, partition);
> > > +      break;
> > > +
> > >      default:
> > >        gcc_unreachable ();
> > >      }
> > > @@ -1525,44 +1593,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> > >         }
> > >      }
> > >
> > > -  if (!single_st)
> > > +  if (!single_ld && !single_st)
> > >      return false;
> > >
> > > -  /* Bail out if this is a bitfield memory reference.  */
> > > -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > > -    return false;
> > > -
> > > -  /* Data reference must be executed exactly once per iteration of each
> > > -     loop in the loop nest.  We only need to check dominance information
> > > -     against the outermost one in a perfect loop nest because a bb can't
> > > -     dominate outermost loop's latch without dominating inner loop's.  */
> > > -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> > > -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > -    return false;
> > > +  basic_block bb_ld = NULL;
> > > +  basic_block bb_st = NULL;
> > >
> > >    if (single_ld)
> > >      {
> > > -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > -      /* Direct aggregate copy or via an SSA name temporary.  */
> > > -      if (load != store
> > > -         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > -       return false;
> > > -
> > >        /* Bail out if this is a bitfield memory reference.  */
> > >        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
> > >           && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
> > >         return false;
> > >
> > > -      /* Load and store must be in the same loop nest.  */
> > > -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> > > -      if (bb_st->loop_father != bb_ld->loop_father)
> > > +      /* Data reference must be executed exactly once per iteration of each
> > > +        loop in the loop nest.  We only need to check dominance information
> > > +        against the outermost one in a perfect loop nest because a bb can't
> > > +        dominate outermost loop's latch without dominating inner loop's.  */
> > > +      bb_ld = gimple_bb (DR_STMT (single_ld));
> > > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > +       return false;
> > > +    }
> > > +
> > > +  if (single_st)
> > > +    {
> > > +      /* Bail out if this is a bitfield memory reference.  */
> > > +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > +         && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > >         return false;
> > >
> > >        /* Data reference must be executed exactly once per iteration.
> > > -        Same as single_st, we only need to check against the outermost
> > > +        Same as single_ld, we only need to check against the outermost
> > >          loop.  */
> > > -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > +      bb_st = gimple_bb (DR_STMT (single_st));
> > > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > +       return false;
> > > +    }
> > > +
> > > +  if (single_ld && single_st)
> > > +    {
> > > +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > +      /* Direct aggregate copy or via an SSA name temporary.  */
> > > +      if (load != store
> > > +         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > +       return false;
> > > +
> > > +      /* Load and store must be in the same loop nest.  */
> > > +      if (bb_st->loop_father != bb_ld->loop_father)
> > >         return false;
> > >
> > >        edge e = single_exit (bb_st->loop_father);
> > > @@ -1681,6 +1758,84 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
> > >    return builtin;
> > >  }
> > >
> > > +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > > +   rawmemchr{8,16,32} call.  */
> > > +
> > > +static bool
> > > +classify_builtin_rawmemchr (loop_p loop, partition *partition, data_reference_p dr, tree loop_result)
> > > +{
> > > +  tree dr_ref = DR_REF (dr);
> > > +  tree dr_access_base = build_fold_addr_expr (dr_ref);
> > > +  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> > > +  gimple *dr_stmt = DR_STMT (dr);
> > > +  tree rhs1 = gimple_assign_rhs1 (dr_stmt);
> > > +  affine_iv iv;
> > > +  tree pattern;
> > > +
> > > +  if (TREE_OPERAND (rhs1, 0) != loop_result)
> > > +    return false;
> > > +
> > > +  /* A limitation of the current implementation is that we only support
> > > +     constant patterns.  */
> > > +  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> > > +  pattern = gimple_cond_rhs (cond_stmt);
> > > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> > > +      || TREE_CODE (pattern) != INTEGER_CST)
> > > +    return false;
> > > +
> > > +  /* Bail out if no affine induction variable with constant step can be
> > > +     determined.  */
> > > +  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> > > +    return false;
> > > +
> > > +  /* Bail out if memory accesses are not consecutive.  */
> > > +  if (!operand_equal_p (iv.step, dr_access_size, 0))
> > > +    return false;
> > > +
> > > +  /* Bail out if direction of memory accesses is not growing.  */
> > > +  if (get_range_pos_neg (iv.step) != 1)
> > > +    return false;
> > > +
> > > +  internal_fn fn;
> > > +  switch (TREE_INT_CST_LOW (iv.step))
> > > +    {
> > > +    case 1:
> > > +      if (!targetm.have_rawmemchr8 ())
> > > +       return false;
> > > +      fn = IFN_RAWMEMCHR8;
> > > +      break;
> > > +
> > > +    case 2:
> > > +      if (!targetm.have_rawmemchr16 ())
> > > +       return false;
> > > +      fn = IFN_RAWMEMCHR16;
> > > +      break;
> > > +
> > > +    case 4:
> > > +      if (!targetm.have_rawmemchr32 ())
> > > +       return false;
> > > +      fn = IFN_RAWMEMCHR32;
> > > +      break;
> > > +
> > > +    default:
> > > +      return false;
> > > +    }
> > > +
> > > +  struct builtin_info *builtin;
> > > +  builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE);
> > > +  builtin->src_dr = dr;
> > > +  builtin->src_base = iv.base;
> > > +  builtin->pattern = pattern;
> > > +  builtin->fn = fn;
> > > +
> > > +  partition->loc = gimple_location (dr_stmt);
> > > +  partition->builtin = builtin;
> > > +  partition->kind = PKIND_RAWMEMCHR;
> > > +
> > > +  return true;
> > > +}
> > > +
> > >  /* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > >     memset call.  */
> > >
> > > @@ -1792,12 +1947,16 @@ loop_distribution::classify_builtin_ldst (loop_p loop, struct graph *rdg,
> > >  bool
> > >  loop_distribution::classify_partition (loop_p loop,
> > >                                        struct graph *rdg, partition *partition,
> > > -                                      bitmap stmt_in_all_partitions)
> > > +                                      bitmap stmt_in_all_partitions,
> > > +                                      vec<struct partition *> *partitions)
> > >  {
> > >    bitmap_iterator bi;
> > >    unsigned i;
> > >    data_reference_p single_ld = NULL, single_st = NULL;
> > >    bool volatiles_p = false, has_reduction = false;
> > > +  unsigned nreductions = 0;
> > > +  gimple *reduction_stmt = NULL;
> > > +  bool has_interpar_reduction = false;
> > >
> > >    EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> > >      {
> > > @@ -1821,6 +1980,19 @@ loop_distribution::classify_partition (loop_p loop,
> > >             partition->reduction_p = true;
> > >           else
> > >             has_reduction = true;
> > > +
> > > +         /* Determine whether the reduction statement occurs in other
> > > +            partitions than the current one.  */
> > > +         struct partition *piter;
> > > +         for (unsigned j = 0; partitions->iterate (j, &piter); ++j)
> > > +           {
> > > +             if (piter == partition)
> > > +               continue;
> > > +             if (bitmap_bit_p (piter->stmts, i))
> > > +               has_interpar_reduction = true;
> > > +           }
> > > +         reduction_stmt = stmt;
> > > +         ++nreductions;
> > >         }
> > >      }
> > >
> > > @@ -1840,6 +2012,30 @@ loop_distribution::classify_partition (loop_p loop,
> > >    if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> > >      return has_reduction;
> > >
> > > +  /* If we determined a single load and a single reduction statement which does
> > > +     not occur in any other partition, then try to classify this partition as a
> > > +     rawmemchr builtin.  */
> > > +  if (single_ld != NULL
> > > +      && single_st == NULL
> > > +      && nreductions == 1
> > > +      && !has_interpar_reduction
> > > +      && is_gimple_assign (reduction_stmt))
> > > +    {
> > > +      /* If we classified the partition as a builtin, then ignoring the single
> > > +        reduction is safe, since the reduction variable is not used in other
> > > +        partitions.  */
> > > +      tree reduction_var = gimple_assign_lhs (reduction_stmt);
> > > +      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> > > +    }
> > > +
> > > +  if (single_st == NULL)
> > > +    return has_reduction;
> > > +
> > > +  /* Don't distribute loop if niters is unknown.  */
> > > +  tree niters = number_of_latch_executions (loop);
> > > +  if (niters == NULL_TREE || niters == chrec_dont_know)
> > > +    return has_reduction;
> > > +
> > >    partition->loc = gimple_location (DR_STMT (single_st));
> > >
> > >    /* Classify the builtin kind.  */
> > > @@ -2979,7 +3175,7 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
> > >    FOR_EACH_VEC_ELT (partitions, i, partition)
> > >      {
> > >        reduction_in_all
> > > -       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions);
> > > +       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions, &partitions);
> > >        any_builtin |= partition_builtin_p (partition);
> > >      }
> > >
> > > @@ -3290,11 +3486,6 @@ loop_distribution::execute (function *fun)
> > >               && !optimize_loop_for_speed_p (loop)))
> > >         continue;
> > >
> > > -      /* Don't distribute loop if niters is unknown.  */
> > > -      tree niters = number_of_latch_executions (loop);
> > > -      if (niters == NULL_TREE || niters == chrec_dont_know)
> > > -       continue;
> > > -
> > >        /* Get the perfect loop nest for distribution.  */
> > >        loop = prepare_perfect_loop_nest (loop);
> > >        for (; loop; loop = loop->inner)
> > > --
> > > 2.23.0
> > >

> commit bf792239150
> Author: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
> Date:   Mon Feb 8 10:35:30 2021 +0100
> 
>     ldist: Recognize rawmemchr loop patterns
>     
>     This patch adds support for recognizing loops which mimic the behaviour
>     of function rawmemchr, and replaces those with an internal function call
>     in case a target provides them.  In contrast to the original rawmemchr
>     function, this patch also supports different instances where the memory
>     pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
>     respectively.
> 
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index dd7173126fb..18e12b863c6 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2917,6 +2917,31 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>    gcc_unreachable ();
>  }
>  
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree lhs_type = TREE_TYPE (lhs);
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
> +
> +  for (unsigned int i = 0; i < 2; ++i)
> +    {
> +      tree rhs = gimple_call_arg (stmt, i);
> +      tree rhs_type = TREE_TYPE (rhs);
> +      rtx rhs_rtx = expand_normal (rhs);
> +      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
> +    }
> +
> +  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
> +
> +  expand_insn (icode, 3, ops);
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
>  /* Expand the IFN_UNIQUE function according to its first argument.  */
>  
>  static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index daeace7a34e..95c76795648 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
>  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
>  
>  /* An unduplicable, uncombinable function.  Generally used to preserve
>     a CFG property in the face of jump threading, tail merging or
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b..f7c69f914ce 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
>  OPTAB_D (movmem_optab, "movmem$a")
>  OPTAB_D (setmem_optab, "setmem$a")
>  OPTAB_D (strlen_optab, "strlen$a")
> +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
>  
>  OPTAB_DC(fma_optab, "fma$a4", FMA)
>  OPTAB_D (fms_optab, "fms$a4")
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 7ee19fc8677..09f200da61f 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -115,6 +115,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
> +#include "rtl.h"
> +#include "memmodel.h"
> +#include "insn-codes.h"
> +#include "optabs.h"
>  
>  
>  #define MAX_DATAREFS_NUM \
> @@ -218,7 +222,7 @@ enum partition_kind {
>         be unnecessary and removed once distributed memset can be understood
>         and analyzed in data reference analysis.  See PR82604 for more.  */
>      PKIND_PARTIAL_MEMSET,
> -    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> +    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
>  };
>  
>  /* Type of distributed loop.  */
> @@ -244,6 +248,8 @@ struct builtin_info
>       is only used in memset builtin distribution for now.  */
>    tree dst_base_base;
>    unsigned HOST_WIDE_INT dst_base_offset;
> +  /* Pattern is used only in rawmemchr builtin distribution for now.  */
> +  tree pattern;
>  };
>  
>  /* Partition for loop distribution.  */
> @@ -1232,6 +1238,66 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
>      }
>  }
>  
> +/* Generate a call to rawmemchr for PARTITION in LOOP.  */
> +
> +static void
> +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> +{
> +  gimple_stmt_iterator gsi;
> +  tree mem, pattern;
> +  struct builtin_info *builtin = partition->builtin;
> +  gimple *fn_call;
> +
> +  data_reference_p dr = builtin->src_dr;
> +  tree base = builtin->src_base;
> +
> +  tree result_old = build_fold_addr_expr (DR_REF (dr));
> +  tree result_new = copy_ssa_name (result_old);
> +
> +  /* The new statements will be placed before LOOP.  */
> +  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +
> +  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> +				  GSI_CONTINUE_LINKING);
> +  pattern = builtin->pattern;
> +  fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
> +  gimple_call_set_lhs (fn_call, result_new);
> +  gimple_set_location (fn_call, partition->loc);
> +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> +
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +	SET_USE (use_p, result_new);
> +
> +      update_stmt (stmt);
> +    }
> +
> +  fold_stmt (&gsi);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    switch (TYPE_MODE (TREE_TYPE (pattern)))
> +      {
> +      case QImode:
> +	fprintf (dump_file, "generated rawmemchrqi\n");
> +	break;
> +
> +      case HImode:
> +	fprintf (dump_file, "generated rawmemchrhi\n");
> +	break;
> +
> +      case SImode:
> +	fprintf (dump_file, "generated rawmemchrsi\n");
> +	break;
> +
> +      default:
> +	gcc_unreachable ();
> +      }
> +}
> +
>  /* Remove and destroy the loop LOOP.  */
>  
>  static void
> @@ -1334,6 +1400,10 @@ generate_code_for_partition (class loop *loop,
>        generate_memcpy_builtin (loop, partition);
>        break;
>  
> +    case PKIND_RAWMEMCHR:
> +      generate_rawmemchr_builtin (loop, partition);
> +      break;
> +
>      default:
>        gcc_unreachable ();
>      }
> @@ -1525,44 +1595,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
>  	}
>      }
>  
> -  if (!single_st)
> +  if (!single_ld && !single_st)
>      return false;
>  
> -  /* Bail out if this is a bitfield memory reference.  */
> -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> -    return false;
> -
> -  /* Data reference must be executed exactly once per iteration of each
> -     loop in the loop nest.  We only need to check dominance information
> -     against the outermost one in a perfect loop nest because a bb can't
> -     dominate outermost loop's latch without dominating inner loop's.  */
> -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> -    return false;
> +  basic_block bb_ld = NULL;
> +  basic_block bb_st = NULL;
>  
>    if (single_ld)
>      {
> -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> -      /* Direct aggregate copy or via an SSA name temporary.  */
> -      if (load != store
> -	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> -	return false;
> -
>        /* Bail out if this is a bitfield memory reference.  */
>        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
>  	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
>  	return false;
>  
> -      /* Load and store must be in the same loop nest.  */
> -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> -      if (bb_st->loop_father != bb_ld->loop_father)
> +      /* Data reference must be executed exactly once per iteration of each
> +	 loop in the loop nest.  We only need to check dominance information
> +	 against the outermost one in a perfect loop nest because a bb can't
> +	 dominate outermost loop's latch without dominating inner loop's.  */
> +      bb_ld = gimple_bb (DR_STMT (single_ld));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +	return false;
> +    }
> +
> +  if (single_st)
> +    {
> +      /* Bail out if this is a bitfield memory reference.  */
> +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> +	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
>  	return false;
>  
>        /* Data reference must be executed exactly once per iteration.
> -	 Same as single_st, we only need to check against the outermost
> +	 Same as single_ld, we only need to check against the outermost
>  	 loop.  */
> -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +      bb_st = gimple_bb (DR_STMT (single_st));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> +	return false;
> +    }
> +
> +  if (single_ld && single_st)
> +    {
> +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> +      /* Direct aggregate copy or via an SSA name temporary.  */
> +      if (load != store
> +	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> +	return false;
> +
> +      /* Load and store must be in the same loop nest.  */
> +      if (bb_st->loop_father != bb_ld->loop_father)
>  	return false;
>  
>        edge e = single_exit (bb_st->loop_father);
> @@ -1681,6 +1760,68 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
>    return builtin;
>  }
>  
> +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> +   rawmemchr call.  */
> +
> +static bool
> +classify_builtin_rawmemchr (loop_p loop, partition *partition,
> +			    data_reference_p dr, tree loop_result)
> +{
> +  tree dr_ref = DR_REF (dr);
> +  tree dr_access_base = build_fold_addr_expr (dr_ref);
> +  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> +  gimple *dr_stmt = DR_STMT (dr);
> +  affine_iv iv;
> +  tree pattern;
> +
> +  if (dr_access_base != loop_result)
> +    return false;
> +
> +  /* A limitation of the current implementation is that we only support
> +     constant patterns.  */
> +  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> +  pattern = gimple_cond_rhs (cond_stmt);
> +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> +      || TREE_CODE (pattern) != INTEGER_CST)
> +    return false;
> +
> +  /* Bail out if no affine induction variable with constant step can be
> +     determined.  */
> +  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> +    return false;
> +
> +  /* Bail out if memory accesses are not consecutive.  */
> +  if (!operand_equal_p (iv.step, dr_access_size, 0))
> +    return false;
> +
> +  /* Bail out if direction of memory accesses is not growing.  */
> +  if (get_range_pos_neg (iv.step) != 1)
> +    return false;
> +
> +  /* Bail out if target does not provide rawmemchr for a certain mode.  */
> +  machine_mode mode;
> +  switch (TREE_INT_CST_LOW (iv.step))
> +    {
> +    case 1: mode = QImode; break;
> +    case 2: mode = HImode; break;
> +    case 4: mode = SImode; break;
> +    default: return false;
> +    }
> +  if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
> +    return false;
> +
> +  struct builtin_info *builtin;
> +  builtin = alloc_builtin (NULL, dr, NULL_TREE, iv.base, NULL_TREE);
> +  builtin->pattern = pattern;
> +
> +  partition->loc = gimple_location (dr_stmt);
> +  partition->builtin = builtin;
> +  partition->kind = PKIND_RAWMEMCHR;
> +
> +  return true;
> +}
> +
>  /* Given data reference DR in loop nest LOOP, classify if it forms builtin
>     memset call.  */
>  
> @@ -1798,6 +1939,8 @@ loop_distribution::classify_partition (loop_p loop,
>    unsigned i;
>    data_reference_p single_ld = NULL, single_st = NULL;
>    bool volatiles_p = false, has_reduction = false;
> +  unsigned nreductions = 0;
> +  gimple *reduction_stmt = NULL;
>  
>    EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
>      {
> @@ -1821,6 +1964,10 @@ loop_distribution::classify_partition (loop_p loop,
>  	    partition->reduction_p = true;
>  	  else
>  	    has_reduction = true;
> +
> +	  /* Determine whether STMT is the only reduction statement or not.  */
> +	  reduction_stmt = stmt;
> +	  ++nreductions;
>  	}
>      }
>  
> @@ -1840,6 +1987,27 @@ loop_distribution::classify_partition (loop_p loop,
>    if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
>      return has_reduction;
>  
> +  /* If we determined a single load and a single reduction statement, then try
> +     to classify this partition as a rawmemchr builtin.  */
> +  if (single_ld != NULL
> +      && single_st == NULL
> +      && nreductions == 1
> +      && is_gimple_assign (reduction_stmt))
> +    {
> +      /* If we classified the partition as a builtin, then ignoring the single
> +	 reduction is safe, since the whole partition is replaced by a call.  */
> +      tree reduction_var = gimple_assign_lhs (reduction_stmt);
> +      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> +    }
> +
> +  if (single_st == NULL)
> +    return has_reduction;
> +
> +  /* Don't distribute loop if niters is unknown.  */
> +  tree niters = number_of_latch_executions (loop);
> +  if (niters == NULL_TREE || niters == chrec_dont_know)
> +    return has_reduction;
> +
>    partition->loc = gimple_location (DR_STMT (single_st));
>  
>    /* Classify the builtin kind.  */
> @@ -3290,11 +3458,6 @@ loop_distribution::execute (function *fun)
>  	      && !optimize_loop_for_speed_p (loop)))
>  	continue;
>  
> -      /* Don't distribute loop if niters is unknown.  */
> -      tree niters = number_of_latch_executions (loop);
> -      if (niters == NULL_TREE || niters == chrec_dont_know)
> -	continue;
> -
>        /* Get the perfect loop nest for distribution.  */
>        loop = prepare_perfect_loop_nest (loop);
>        for (; loop; loop = loop->inner)

> /* { dg-do compile } */
> /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> 
> #include <cstdint>
> 
> template<typename T, T pattern>
> T *rawmemchr(T *s)
> {
>   while (*s != pattern)
>     ++s;
>   return s;
> }
> 
> template uint8_t *rawmemchr<uint8_t, 0xab>(uint8_t *);
> template uint16_t *rawmemchr<uint16_t, 0xabcd>(uint16_t *);
> template uint32_t *rawmemchr<uint32_t, 0xabcdef15>(uint32_t *);
> 
> template int8_t *rawmemchr<int8_t, (int8_t)0xab>(int8_t *);
> template int16_t *rawmemchr<int16_t, (int16_t)0xabcd>(int16_t *);
> template int32_t *rawmemchr<int32_t, (int32_t)0xabcdef15>(int32_t *);
> 
> /* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */

> /* { dg-do run } */
> /* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details -mzarch -march=z13" } */
> 
> #include <cstring>
> #include <cassert>
> #include <cstdint>
> #include <iostream>
> 
> template <typename T, T pattern>
> __attribute__((noinline,noclone))
> T* rawmemchr (T *s)
> {
>   while (*s != pattern)
>     ++s;
>   return s;
> }
> 
> template <typename T, T pattern>
> void doit()
> {
>   T *buf = new T[4096 * 2];
>   assert (buf != NULL);
>   memset (buf, 0xa, 4096 * 2);
>   // ensure q is 4096-byte aligned
>   T *q = buf + (4096 - ((uintptr_t)buf & 4095));
>   T *p;
> 
>   // unaligned + block boundary + 1st load
>   p = (T *) ((uintptr_t)q - 8);
>   p[2] = pattern;
>   assert ((rawmemchr<T, pattern> (&p[0]) == &p[2]));
>   p[2] = (T) 0xaaaaaaaa;
> 
>   // unaligned + block boundary + 2nd load
>   p = (T *) ((uintptr_t)q - 8);
>   p[6] = pattern;
>   assert ((rawmemchr<T, pattern> (&p[0]) == &p[6]));
>   p[6] = (T) 0xaaaaaaaa;
> 
> 
> 
>   // unaligned + 1st load
>   q[5] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[2]) == &q[5]));
>   q[5] = (T) 0xaaaaaaaa;
> 
>   // unaligned + 2nd load
>   q[14] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[2]) == &q[14]));
>   q[14] = (T) 0xaaaaaaaa;
> 
>   // unaligned + 3rd load
>   q[19] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[2]) == &q[19]));
>   q[19] = (T) 0xaaaaaaaa;
> 
>   // unaligned + 4th load
>   q[25] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[2]) == &q[25]));
>   q[25] = (T) 0xaaaaaaaa;
> 
> 
> 
>   // aligned + 1st load
>   q[5] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[0]) == &q[5]));
>   q[5] = (T) 0xaaaaaaaa;
> 
>   // aligned + 2nd load
>   q[14] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[0]) == &q[14]));
>   q[14] = (T) 0xaaaaaaaa;
> 
>   // aligned + 3rd load
>   q[19] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[0]) == &q[19]));
>   q[19] = (T) 0xaaaaaaaa;
> 
>   // aligned + 4th load
>   q[25] = pattern;
>   assert ((rawmemchr<T, pattern>(&q[0]) == &q[25]));
>   q[25] = (T) 0xaaaaaaaa;
> 
>   delete buf;
> }
> 
> int main(void)
> {
>   doit<uint8_t, 0xde> ();
>   doit<int8_t, (int8_t)0xde> ();
>   doit<uint16_t, 0xdead> ();
>   doit<int16_t, (int16_t)0xdead> ();
>   doit<uint32_t, 0xdeadbeef> ();
>   doit<int32_t, (int32_t)0xdeadbeef> ();
>   return 0;
> }
> 
> /* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" } } */
> /* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" } } */
Jeff Law Feb. 25, 2021, 11:49 p.m. UTC | #4
On 2/25/21 10:01 AM, Stefan Schulze Frielinghaus via Gcc-patches wrote:
> Ping
Given this is not a regression it needs to wait for gcc-12.
jeff
Richard Biener March 2, 2021, 12:29 p.m. UTC | #5
On Sun, Feb 14, 2021 at 11:27 AM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > This patch adds support for recognizing loops which mimic the behaviour
> > > of function rawmemchr, and replaces those with an internal function call
> > > in case a target provides them.  In contrast to the original rawmemchr
> > > function, this patch also supports different instances where the memory
> > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > respectively.
> > >
> > > This patch is not final and I'm looking for some feedback:
> > >
> > > Previously, only loops which mimic the behaviours of functions memset,
> > > memcpy, and memmove have been detected and replaced by corresponding
> > > function calls.  One characteristic of those loops/partitions is that
> > > they don't have a reduction.  In contrast, loops which mimic the
> > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > My current attempt is to ensure that the reduction statement is not used
> > > in any other partition and only in that case ignore the reduction and
> > > replace the loop by a function call.  We then only need to replace the
> > > reduction variable of the loop which contained the loop result by the
> > > variable of the lhs of the internal function call.  This should ensure
> > > that the transformation is correct independently of how partitions are
> > > fused/distributed in the end.  Any thoughts about this?
> >
> > Currently we're forcing reduction partitions last (and force to have a single
> > one by fusing all partitions containing a reduction) because code-generation
> > does not properly update SSA form for the reduction results.  ISTR that
> > might be just because we do not copy the LC PHI nodes or do not adjust
> > them when copying.  That might not be an issue in case you replace the
> > partition with a call.  I guess you can try to have a testcase with
> > two rawmemchr patterns and a regular loop part that has to be scheduled
> > inbetween both for correctness.
>
> Ah ok, in that case I updated my patch by removing the constraint that
> the reduction statement must be in precisely one partition.  Please find
> attached the testcases I came up so far.  Since transforming a loop into
> a rawmemchr function call is backend dependend, I planned to include
> those only in my backend patch.  I wasn't able to come up with any
> testcase where a loop is distributed into multiple partitions and where
> one is classified as a rawmemchr builtin.  The latter boils down to a
> for loop with an empty body only in which case I suspect that loop
> distribution shouldn't be done anyway.
>
> > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > builtin_info which I consider rather hacky.  For the long run I thought
> > > about to split up structure builtin_info into a union where each member
> > > is a structure for a particular builtin of a partition, i.e., something
> > > like this:
> > >
> > > union builtin_info
> > > {
> > >   struct binfo_memset *memset;
> > >   struct binfo_memcpymove *memcpymove;
> > >   struct binfo_rawmemchr *rawmemchr;
> > > };
> > >
> > > Such that a structure for one builtin does not get "polluted" by a
> > > different one.  Any thoughts about this?
> >
> > Probably makes sense if the list of recognized patterns grow further.
> >
> > I see you use internal functions rather than builtin functions.  I guess
> > that's OK.  But you use new target hooks for expansion where I think
> > new optab entries similar to cmpmem would be more appropriate
> > where the distinction between 8, 16 or 32 bits can be encoded in
> > the modes.
>
> The optab implementation is really nice which allows me to use iterators
> in the backend which in the end saves me some boiler plate code compared
> to the previous implementation :)
>
> While using optabs now, I only require one additional member (pattern)
> in the builtin_info struct.  Thus I didn't want to overcomplicate things
> and kept the single struct approach as is.
>
> For the long run, should I resubmit this patch once stage 1 opens or how
> would you propose to proceed?

Yes, and sorry for the delay.  Few comments on the patch given I had a
quick look:

+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);

I think that give we have people testing with -fno-tree-dce you
should try to handle a NULL LHS gracefully.  I suppose by
simply doing nothing.

+  tree result_old = build_fold_addr_expr (DR_REF (dr));
+  tree result_new = copy_ssa_name (result_old);

I think you simply want

   tree result = make_ssa_name (ptr_type_node);

most definitely

+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+       SET_USE (use_p, result_new);
+
+      update_stmt (stmt);
+    }

isn't going to work.  It might work for the very specific
case that the IV used for the read is a pointer IV and
thus built_fold_addr_expr (...) results in the pointer.
But you can't really rely on any such thing.
(try a loop operating on a decl - the patch misses
testcases at least)

Instead you have to record the reduction result
during matching.  I see you're doing the same
there, I think you should try to do sth more general
to handle cases like

extern char s[];

int foo ()
{
  int i;
  for (i = 1; s[i-1]; ++i)
    ;
  return i;
}

which compute the number of chars including
the terminating '\0', because we might want to
eventually recognize this as strlen().  Correlating
the reduction IV evolution with the rawmemchr
result using SCEV shouldn't be difficult and
a pointer subtraction would yield the reduction
result.

If you think that's too complicated for the start
then please re-formulate the

+  tree dr_access_base = build_fold_addr_expr (dr_ref);

as

 if (TREE_CODE (dr_ref) != MEM_REF
     || !integer_zerop (TREE_OPERAND (dr_ref, 0)))
   return false;
 dr_access_base = TREE_OPERAND (dr_ref, 0);
+  /* A limitation of the current implementation is that we only support
+     constant patterns.  */
+  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));

the last stmt in the loop header might not be a gcond.  I think
you want sth like

  edge e = single_exit (loop);
  if (!e)
    return false;
  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
  if (!cond_stmt)
    return false;

here.

+  /* Bail out if direction of memory accesses is not growing.  */
+  if (get_range_pos_neg (iv.step) != 1)
+    return false;

that looks redundant with the access size check.  But I'd add
a check that the access is of integral, mode-precision type.

  if (!INTEGRAL_TYPE_P (TREE_TYPE (DR_REF (dr)))
      || !type_has_mode_precision_p (TREE_TYPE (DR_REF (dr))))
   return false;

+  /* Bail out if target does not provide rawmemchr for a certain mode.  */
+  machine_mode mode;
+  switch (TREE_INT_CST_LOW (iv.step))
+    {
+    case 1: mode = QImode; break;
+    case 2: mode = HImode; break;
+    case 4: mode = SImode; break;
+    default: return false;

then this simply becomes

  mode = TYPE_MODE (TREE_TYPE (DR_REF (dr)));

For testcases try sth like

extern char s[], r[];

char *foo ()
{
  int i = 0;
  char *p = r;
  while (*p)
    {
      ++p;
      s[i++] = '\0';
    }
  return p;
}

to end up with multiple partitions (and slightly off IVs because
we apply header copying).

> Thanks for your review so far!
>
> Cheers,
> Stefan
>
> >
> > Richard.
> >
> > > Cheers,
> > > Stefan
> > > ---
> > >  gcc/internal-fn.c            |  42 ++++++
> > >  gcc/internal-fn.def          |   3 +
> > >  gcc/target-insns.def         |   3 +
> > >  gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++-----
> > >  4 files changed, 272 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > > index dd7173126fb..9cd62544a1a 100644
> > > --- a/gcc/internal-fn.c
> > > +++ b/gcc/internal-fn.c
> > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> > >    gcc_unreachable ();
> > >  }
> > >
> > > +static void
> > > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> > > +{
> > > +  if (targetm.have_rawmemchr8 ())
> > > +    {
> > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > +      emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
> > > +    }
> > > +  else
> > > +    gcc_unreachable();
> > > +}
> > > +
> > > +static void
> > > +expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
> > > +{
> > > +  if (targetm.have_rawmemchr16 ())
> > > +    {
> > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > +      emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
> > > +    }
> > > +  else
> > > +    gcc_unreachable();
> > > +}
> > > +
> > > +static void
> > > +expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
> > > +{
> > > +  if (targetm.have_rawmemchr32 ())
> > > +    {
> > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > +      emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
> > > +    }
> > > +  else
> > > +    gcc_unreachable();
> > > +}
> > > +
> > >  /* Expand the IFN_UNIQUE function according to its first argument.  */
> > >
> > >  static void
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index daeace7a34e..34247859704 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > >  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
> > >  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
> > >  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > +DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > >
> > >  /* An unduplicable, uncombinable function.  Generally used to preserve
> > >     a CFG property in the face of jump threading, tail merging or
> > > diff --git a/gcc/target-insns.def b/gcc/target-insns.def
> > > index 672c35698d7..9248554cbf3 100644
> > > --- a/gcc/target-insns.def
> > > +++ b/gcc/target-insns.def
> > > @@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
> > >  DEF_TARGET_INSN (unique, (void))
> > >  DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
> > >  DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
> > > +DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
> > > +DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
> > > +DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2))
> > > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> > > index 7ee19fc8677..f5b24bf53bc 100644
> > > --- a/gcc/tree-loop-distribution.c
> > > +++ b/gcc/tree-loop-distribution.c
> > > @@ -218,7 +218,7 @@ enum partition_kind {
> > >         be unnecessary and removed once distributed memset can be understood
> > >         and analyzed in data reference analysis.  See PR82604 for more.  */
> > >      PKIND_PARTIAL_MEMSET,
> > > -    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> > > +    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
> > >  };
> > >
> > >  /* Type of distributed loop.  */
> > > @@ -244,6 +244,8 @@ struct builtin_info
> > >       is only used in memset builtin distribution for now.  */
> > >    tree dst_base_base;
> > >    unsigned HOST_WIDE_INT dst_base_offset;
> > > +  tree pattern;
> > > +  internal_fn fn;
> > >  };
> > >
> > >  /* Partition for loop distribution.  */
> > > @@ -588,7 +590,8 @@ class loop_distribution
> > >    bool
> > >    classify_partition (loop_p loop,
> > >                       struct graph *rdg, partition *partition,
> > > -                     bitmap stmt_in_all_partitions);
> > > +                     bitmap stmt_in_all_partitions,
> > > +                     vec<struct partition *> *partitions);
> > >
> > >
> > >    /* Returns true when PARTITION1 and PARTITION2 access the same memory
> > > @@ -1232,6 +1235,67 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
> > >      }
> > >  }
> > >
> > > +/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP.  */
> > > +
> > > +static void
> > > +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> > > +{
> > > +  gimple_stmt_iterator gsi;
> > > +  tree mem, pattern;
> > > +  struct builtin_info *builtin = partition->builtin;
> > > +  gimple *fn_call;
> > > +
> > > +  data_reference_p dr = builtin->src_dr;
> > > +  tree base = builtin->src_base;
> > > +
> > > +  tree result_old = TREE_OPERAND (DR_REF (dr), 0);
> > > +  tree result_new = copy_ssa_name (result_old);
> > > +
> > > +  /* The new statements will be placed before LOOP.  */
> > > +  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > > +
> > > +  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, GSI_CONTINUE_LINKING);
> > > +  pattern = builtin->pattern;
> > > +  if (TREE_CODE (pattern) == INTEGER_CST)
> > > +    pattern = fold_convert (integer_type_node, pattern);
> > > +  fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern);
> > > +  gimple_call_set_lhs (fn_call, result_new);
> > > +  gimple_set_location (fn_call, partition->loc);
> > > +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > > +
> > > +  imm_use_iterator iter;
> > > +  gimple *stmt;
> > > +  use_operand_p use_p;
> > > +  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> > > +    {
> > > +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > > +       SET_USE (use_p, result_new);
> > > +
> > > +      update_stmt (stmt);
> > > +    }
> > > +
> > > +  fold_stmt (&gsi);
> > > +
> > > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > > +    switch (builtin->fn)
> > > +      {
> > > +      case IFN_RAWMEMCHR8:
> > > +       fprintf (dump_file, "generated rawmemchr8\n");
> > > +       break;
> > > +
> > > +      case IFN_RAWMEMCHR16:
> > > +       fprintf (dump_file, "generated rawmemchr16\n");
> > > +       break;
> > > +
> > > +      case IFN_RAWMEMCHR32:
> > > +       fprintf (dump_file, "generated rawmemchr32\n");
> > > +       break;
> > > +
> > > +      default:
> > > +       gcc_unreachable ();
> > > +      }
> > > +}
> > > +
> > >  /* Remove and destroy the loop LOOP.  */
> > >
> > >  static void
> > > @@ -1334,6 +1398,10 @@ generate_code_for_partition (class loop *loop,
> > >        generate_memcpy_builtin (loop, partition);
> > >        break;
> > >
> > > +    case PKIND_RAWMEMCHR:
> > > +      generate_rawmemchr_builtin (loop, partition);
> > > +      break;
> > > +
> > >      default:
> > >        gcc_unreachable ();
> > >      }
> > > @@ -1525,44 +1593,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> > >         }
> > >      }
> > >
> > > -  if (!single_st)
> > > +  if (!single_ld && !single_st)
> > >      return false;
> > >
> > > -  /* Bail out if this is a bitfield memory reference.  */
> > > -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > > -    return false;
> > > -
> > > -  /* Data reference must be executed exactly once per iteration of each
> > > -     loop in the loop nest.  We only need to check dominance information
> > > -     against the outermost one in a perfect loop nest because a bb can't
> > > -     dominate outermost loop's latch without dominating inner loop's.  */
> > > -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> > > -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > -    return false;
> > > +  basic_block bb_ld = NULL;
> > > +  basic_block bb_st = NULL;
> > >
> > >    if (single_ld)
> > >      {
> > > -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > -      /* Direct aggregate copy or via an SSA name temporary.  */
> > > -      if (load != store
> > > -         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > -       return false;
> > > -
> > >        /* Bail out if this is a bitfield memory reference.  */
> > >        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
> > >           && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
> > >         return false;
> > >
> > > -      /* Load and store must be in the same loop nest.  */
> > > -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> > > -      if (bb_st->loop_father != bb_ld->loop_father)
> > > +      /* Data reference must be executed exactly once per iteration of each
> > > +        loop in the loop nest.  We only need to check dominance information
> > > +        against the outermost one in a perfect loop nest because a bb can't
> > > +        dominate outermost loop's latch without dominating inner loop's.  */
> > > +      bb_ld = gimple_bb (DR_STMT (single_ld));
> > > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > +       return false;
> > > +    }
> > > +
> > > +  if (single_st)
> > > +    {
> > > +      /* Bail out if this is a bitfield memory reference.  */
> > > +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > +         && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > >         return false;
> > >
> > >        /* Data reference must be executed exactly once per iteration.
> > > -        Same as single_st, we only need to check against the outermost
> > > +        Same as single_ld, we only need to check against the outermost
> > >          loop.  */
> > > -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > +      bb_st = gimple_bb (DR_STMT (single_st));
> > > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > +       return false;
> > > +    }
> > > +
> > > +  if (single_ld && single_st)
> > > +    {
> > > +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > +      /* Direct aggregate copy or via an SSA name temporary.  */
> > > +      if (load != store
> > > +         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > +       return false;
> > > +
> > > +      /* Load and store must be in the same loop nest.  */
> > > +      if (bb_st->loop_father != bb_ld->loop_father)
> > >         return false;
> > >
> > >        edge e = single_exit (bb_st->loop_father);
> > > @@ -1681,6 +1758,84 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
> > >    return builtin;
> > >  }
> > >
> > > +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > > +   rawmemchr{8,16,32} call.  */
> > > +
> > > +static bool
> > > +classify_builtin_rawmemchr (loop_p loop, partition *partition, data_reference_p dr, tree loop_result)
> > > +{
> > > +  tree dr_ref = DR_REF (dr);
> > > +  tree dr_access_base = build_fold_addr_expr (dr_ref);
> > > +  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> > > +  gimple *dr_stmt = DR_STMT (dr);
> > > +  tree rhs1 = gimple_assign_rhs1 (dr_stmt);
> > > +  affine_iv iv;
> > > +  tree pattern;
> > > +
> > > +  if (TREE_OPERAND (rhs1, 0) != loop_result)
> > > +    return false;
> > > +
> > > +  /* A limitation of the current implementation is that we only support
> > > +     constant patterns.  */
> > > +  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> > > +  pattern = gimple_cond_rhs (cond_stmt);
> > > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> > > +      || TREE_CODE (pattern) != INTEGER_CST)
> > > +    return false;
> > > +
> > > +  /* Bail out if no affine induction variable with constant step can be
> > > +     determined.  */
> > > +  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> > > +    return false;
> > > +
> > > +  /* Bail out if memory accesses are not consecutive.  */
> > > +  if (!operand_equal_p (iv.step, dr_access_size, 0))
> > > +    return false;
> > > +
> > > +  /* Bail out if direction of memory accesses is not growing.  */
> > > +  if (get_range_pos_neg (iv.step) != 1)
> > > +    return false;
> > > +
> > > +  internal_fn fn;
> > > +  switch (TREE_INT_CST_LOW (iv.step))
> > > +    {
> > > +    case 1:
> > > +      if (!targetm.have_rawmemchr8 ())
> > > +       return false;
> > > +      fn = IFN_RAWMEMCHR8;
> > > +      break;
> > > +
> > > +    case 2:
> > > +      if (!targetm.have_rawmemchr16 ())
> > > +       return false;
> > > +      fn = IFN_RAWMEMCHR16;
> > > +      break;
> > > +
> > > +    case 4:
> > > +      if (!targetm.have_rawmemchr32 ())
> > > +       return false;
> > > +      fn = IFN_RAWMEMCHR32;
> > > +      break;
> > > +
> > > +    default:
> > > +      return false;
> > > +    }
> > > +
> > > +  struct builtin_info *builtin;
> > > +  builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE);
> > > +  builtin->src_dr = dr;
> > > +  builtin->src_base = iv.base;
> > > +  builtin->pattern = pattern;
> > > +  builtin->fn = fn;
> > > +
> > > +  partition->loc = gimple_location (dr_stmt);
> > > +  partition->builtin = builtin;
> > > +  partition->kind = PKIND_RAWMEMCHR;
> > > +
> > > +  return true;
> > > +}
> > > +
> > >  /* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > >     memset call.  */
> > >
> > > @@ -1792,12 +1947,16 @@ loop_distribution::classify_builtin_ldst (loop_p loop, struct graph *rdg,
> > >  bool
> > >  loop_distribution::classify_partition (loop_p loop,
> > >                                        struct graph *rdg, partition *partition,
> > > -                                      bitmap stmt_in_all_partitions)
> > > +                                      bitmap stmt_in_all_partitions,
> > > +                                      vec<struct partition *> *partitions)
> > >  {
> > >    bitmap_iterator bi;
> > >    unsigned i;
> > >    data_reference_p single_ld = NULL, single_st = NULL;
> > >    bool volatiles_p = false, has_reduction = false;
> > > +  unsigned nreductions = 0;
> > > +  gimple *reduction_stmt = NULL;
> > > +  bool has_interpar_reduction = false;
> > >
> > >    EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> > >      {
> > > @@ -1821,6 +1980,19 @@ loop_distribution::classify_partition (loop_p loop,
> > >             partition->reduction_p = true;
> > >           else
> > >             has_reduction = true;
> > > +
> > > +         /* Determine whether the reduction statement occurs in other
> > > +            partitions than the current one.  */
> > > +         struct partition *piter;
> > > +         for (unsigned j = 0; partitions->iterate (j, &piter); ++j)
> > > +           {
> > > +             if (piter == partition)
> > > +               continue;
> > > +             if (bitmap_bit_p (piter->stmts, i))
> > > +               has_interpar_reduction = true;
> > > +           }
> > > +         reduction_stmt = stmt;
> > > +         ++nreductions;
> > >         }
> > >      }
> > >
> > > @@ -1840,6 +2012,30 @@ loop_distribution::classify_partition (loop_p loop,
> > >    if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> > >      return has_reduction;
> > >
> > > +  /* If we determined a single load and a single reduction statement which does
> > > +     not occur in any other partition, then try to classify this partition as a
> > > +     rawmemchr builtin.  */
> > > +  if (single_ld != NULL
> > > +      && single_st == NULL
> > > +      && nreductions == 1
> > > +      && !has_interpar_reduction
> > > +      && is_gimple_assign (reduction_stmt))
> > > +    {
> > > +      /* If we classified the partition as a builtin, then ignoring the single
> > > +        reduction is safe, since the reduction variable is not used in other
> > > +        partitions.  */
> > > +      tree reduction_var = gimple_assign_lhs (reduction_stmt);
> > > +      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> > > +    }
> > > +
> > > +  if (single_st == NULL)
> > > +    return has_reduction;
> > > +
> > > +  /* Don't distribute loop if niters is unknown.  */
> > > +  tree niters = number_of_latch_executions (loop);
> > > +  if (niters == NULL_TREE || niters == chrec_dont_know)
> > > +    return has_reduction;
> > > +
> > >    partition->loc = gimple_location (DR_STMT (single_st));
> > >
> > >    /* Classify the builtin kind.  */
> > > @@ -2979,7 +3175,7 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
> > >    FOR_EACH_VEC_ELT (partitions, i, partition)
> > >      {
> > >        reduction_in_all
> > > -       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions);
> > > +       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions, &partitions);
> > >        any_builtin |= partition_builtin_p (partition);
> > >      }
> > >
> > > @@ -3290,11 +3486,6 @@ loop_distribution::execute (function *fun)
> > >               && !optimize_loop_for_speed_p (loop)))
> > >         continue;
> > >
> > > -      /* Don't distribute loop if niters is unknown.  */
> > > -      tree niters = number_of_latch_executions (loop);
> > > -      if (niters == NULL_TREE || niters == chrec_dont_know)
> > > -       continue;
> > > -
> > >        /* Get the perfect loop nest for distribution.  */
> > >        loop = prepare_perfect_loop_nest (loop);
> > >        for (; loop; loop = loop->inner)
> > > --
> > > 2.23.0
> > >
Stefan Schulze Frielinghaus March 3, 2021, 5:17 p.m. UTC | #6
On Tue, Mar 02, 2021 at 01:29:59PM +0100, Richard Biener wrote:
> On Sun, Feb 14, 2021 at 11:27 AM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > This patch adds support for recognizing loops which mimic the behaviour
> > > > of function rawmemchr, and replaces those with an internal function call
> > > > in case a target provides them.  In contrast to the original rawmemchr
> > > > function, this patch also supports different instances where the memory
> > > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > > respectively.
> > > >
> > > > This patch is not final and I'm looking for some feedback:
> > > >
> > > > Previously, only loops which mimic the behaviours of functions memset,
> > > > memcpy, and memmove have been detected and replaced by corresponding
> > > > function calls.  One characteristic of those loops/partitions is that
> > > > they don't have a reduction.  In contrast, loops which mimic the
> > > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > > My current attempt is to ensure that the reduction statement is not used
> > > > in any other partition and only in that case ignore the reduction and
> > > > replace the loop by a function call.  We then only need to replace the
> > > > reduction variable of the loop which contained the loop result by the
> > > > variable of the lhs of the internal function call.  This should ensure
> > > > that the transformation is correct independently of how partitions are
> > > > fused/distributed in the end.  Any thoughts about this?
> > >
> > > Currently we're forcing reduction partitions last (and force to have a single
> > > one by fusing all partitions containing a reduction) because code-generation
> > > does not properly update SSA form for the reduction results.  ISTR that
> > > might be just because we do not copy the LC PHI nodes or do not adjust
> > > them when copying.  That might not be an issue in case you replace the
> > > partition with a call.  I guess you can try to have a testcase with
> > > two rawmemchr patterns and a regular loop part that has to be scheduled
> > > inbetween both for correctness.
> >
> > Ah ok, in that case I updated my patch by removing the constraint that
> > the reduction statement must be in precisely one partition.  Please find
> > attached the testcases I came up so far.  Since transforming a loop into
> > a rawmemchr function call is backend dependend, I planned to include
> > those only in my backend patch.  I wasn't able to come up with any
> > testcase where a loop is distributed into multiple partitions and where
> > one is classified as a rawmemchr builtin.  The latter boils down to a
> > for loop with an empty body only in which case I suspect that loop
> > distribution shouldn't be done anyway.
> >
> > > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > > builtin_info which I consider rather hacky.  For the long run I thought
> > > > about to split up structure builtin_info into a union where each member
> > > > is a structure for a particular builtin of a partition, i.e., something
> > > > like this:
> > > >
> > > > union builtin_info
> > > > {
> > > >   struct binfo_memset *memset;
> > > >   struct binfo_memcpymove *memcpymove;
> > > >   struct binfo_rawmemchr *rawmemchr;
> > > > };
> > > >
> > > > Such that a structure for one builtin does not get "polluted" by a
> > > > different one.  Any thoughts about this?
> > >
> > > Probably makes sense if the list of recognized patterns grow further.
> > >
> > > I see you use internal functions rather than builtin functions.  I guess
> > > that's OK.  But you use new target hooks for expansion where I think
> > > new optab entries similar to cmpmem would be more appropriate
> > > where the distinction between 8, 16 or 32 bits can be encoded in
> > > the modes.
> >
> > The optab implementation is really nice which allows me to use iterators
> > in the backend which in the end saves me some boiler plate code compared
> > to the previous implementation :)
> >
> > While using optabs now, I only require one additional member (pattern)
> > in the builtin_info struct.  Thus I didn't want to overcomplicate things
> > and kept the single struct approach as is.
> >
> > For the long run, should I resubmit this patch once stage 1 opens or how
> > would you propose to proceed?
> 
> Yes, and sorry for the delay.  Few comments on the patch given I had a
> quick look:
> 
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> 
> I think that give we have people testing with -fno-tree-dce you
> should try to handle a NULL LHS gracefully.  I suppose by
> simply doing nothing.
> 
> +  tree result_old = build_fold_addr_expr (DR_REF (dr));
> +  tree result_new = copy_ssa_name (result_old);
> 
> I think you simply want
> 
>    tree result = make_ssa_name (ptr_type_node);
> 
> most definitely
> 
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +       SET_USE (use_p, result_new);
> +
> +      update_stmt (stmt);
> +    }
> 
> isn't going to work.  It might work for the very specific
> case that the IV used for the read is a pointer IV and
> thus built_fold_addr_expr (...) results in the pointer.
> But you can't really rely on any such thing.
> (try a loop operating on a decl - the patch misses
> testcases at least)
> 
> Instead you have to record the reduction result
> during matching.  I see you're doing the same
> there, I think you should try to do sth more general
> to handle cases like
> 
> extern char s[];
> 
> int foo ()
> {
>   int i;
>   for (i = 1; s[i-1]; ++i)
>     ;
>   return i;
> }
> 
> which compute the number of chars including
> the terminating '\0', because we might want to
> eventually recognize this as strlen().  Correlating
> the reduction IV evolution with the rawmemchr
> result using SCEV shouldn't be difficult and
> a pointer subtraction would yield the reduction
> result.

So far I had only loops of the form

T *test(T *p)
{
  while (*p != constant)
    ++p;
  return p;
}

for any integer type T in mind.  I haven't thought about generalizing
this.  In case the reduction is of an integer type, then I could try to
detect a strlen builtin and in case of a pointer type a rawmemchr
builtin.  Not sure if there are also other patterns which could be
fruitful (except like obvious derivations memchr, strnlen, etc.).

Examples like

extern char *p;
char *test()
{
  while (*p)
    ++p;
  return p;
}

are not recognized so far because they are split into two partitions
where one only computes the return value and which is classified as a
rawmemchr builtin and another one which computes the side-effect to p
which is not classified as a rawmemchr builtin so far because of the
store which my current implementation rejects.  In the end both
partitions are fused because of the SCC partition check and the loop is
not altered.  This raises a concern: Should a loop which is split into
multiple partitions and where at least one partition is classified as a
rawmemchr/strlen builtin be distributed at all?  At least this would
mean that in every partition the very same memory region is examined
which is probably more expensive than the operations performed in the
loop bodies if any.  At least I cannot come up with an example where loop
distribution is beneficial w.r.t. rawmemchr/strlen builtins.  That
being said I think my initial attempt to implement this inside of
function distribute_loop is probably wrong and might be better done
after the call to it.

Regarding your other findings, I will incorporate them.

Thanks for your feedback!
Stefan

> 
> If you think that's too complicated for the start
> then please re-formulate the
> 
> +  tree dr_access_base = build_fold_addr_expr (dr_ref);
> 
> as
> 
>  if (TREE_CODE (dr_ref) != MEM_REF
>      || !integer_zerop (TREE_OPERAND (dr_ref, 0)))
>    return false;
>  dr_access_base = TREE_OPERAND (dr_ref, 0);
> +  /* A limitation of the current implementation is that we only support
> +     constant patterns.  */
> +  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> 
> the last stmt in the loop header might not be a gcond.  I think
> you want sth like
> 
>   edge e = single_exit (loop);
>   if (!e)
>     return false;
>   gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
>   if (!cond_stmt)
>     return false;
> 
> here.
> 
> +  /* Bail out if direction of memory accesses is not growing.  */
> +  if (get_range_pos_neg (iv.step) != 1)
> +    return false;
> 
> that looks redundant with the access size check.  But I'd add
> a check that the access is of integral, mode-precision type.
> 
>   if (!INTEGRAL_TYPE_P (TREE_TYPE (DR_REF (dr)))
>       || !type_has_mode_precision_p (TREE_TYPE (DR_REF (dr))))
>    return false;
> 
> +  /* Bail out if target does not provide rawmemchr for a certain mode.  */
> +  machine_mode mode;
> +  switch (TREE_INT_CST_LOW (iv.step))
> +    {
> +    case 1: mode = QImode; break;
> +    case 2: mode = HImode; break;
> +    case 4: mode = SImode; break;
> +    default: return false;
> 
> then this simply becomes
> 
>   mode = TYPE_MODE (TREE_TYPE (DR_REF (dr)));
> 
> For testcases try sth like
> 
> extern char s[], r[];
> 
> char *foo ()
> {
>   int i = 0;
>   char *p = r;
>   while (*p)
>     {
>       ++p;
>       s[i++] = '\0';
>     }
>   return p;
> }
> 
> to end up with multiple partitions (and slightly off IVs because
> we apply header copying).
> 
> > Thanks for your review so far!
> >
> > Cheers,
> > Stefan
> >
> > >
> > > Richard.
> > >
> > > > Cheers,
> > > > Stefan
> > > > ---
> > > >  gcc/internal-fn.c            |  42 ++++++
> > > >  gcc/internal-fn.def          |   3 +
> > > >  gcc/target-insns.def         |   3 +
> > > >  gcc/tree-loop-distribution.c | 257 ++++++++++++++++++++++++++++++-----
> > > >  4 files changed, 272 insertions(+), 33 deletions(-)
> > > >
> > > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > > > index dd7173126fb..9cd62544a1a 100644
> > > > --- a/gcc/internal-fn.c
> > > > +++ b/gcc/internal-fn.c
> > > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> > > >    gcc_unreachable ();
> > > >  }
> > > >
> > > > +static void
> > > > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> > > > +{
> > > > +  if (targetm.have_rawmemchr8 ())
> > > > +    {
> > > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > > +      emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
> > > > +    }
> > > > +  else
> > > > +    gcc_unreachable();
> > > > +}
> > > > +
> > > > +static void
> > > > +expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
> > > > +{
> > > > +  if (targetm.have_rawmemchr16 ())
> > > > +    {
> > > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > > +      emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
> > > > +    }
> > > > +  else
> > > > +    gcc_unreachable();
> > > > +}
> > > > +
> > > > +static void
> > > > +expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
> > > > +{
> > > > +  if (targetm.have_rawmemchr32 ())
> > > > +    {
> > > > +      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > > +      rtx start = expand_normal (gimple_call_arg (stmt, 0));
> > > > +      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> > > > +      emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
> > > > +    }
> > > > +  else
> > > > +    gcc_unreachable();
> > > > +}
> > > > +
> > > >  /* Expand the IFN_UNIQUE function according to its first argument.  */
> > > >
> > > >  static void
> > > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > > index daeace7a34e..34247859704 100644
> > > > --- a/gcc/internal-fn.def
> > > > +++ b/gcc/internal-fn.def
> > > > @@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > > >  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
> > > >  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
> > > >  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > > > +DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > > +DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > > +DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> > > >
> > > >  /* An unduplicable, uncombinable function.  Generally used to preserve
> > > >     a CFG property in the face of jump threading, tail merging or
> > > > diff --git a/gcc/target-insns.def b/gcc/target-insns.def
> > > > index 672c35698d7..9248554cbf3 100644
> > > > --- a/gcc/target-insns.def
> > > > +++ b/gcc/target-insns.def
> > > > @@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
> > > >  DEF_TARGET_INSN (unique, (void))
> > > >  DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
> > > >  DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
> > > > +DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
> > > > +DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
> > > > +DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2))
> > > > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> > > > index 7ee19fc8677..f5b24bf53bc 100644
> > > > --- a/gcc/tree-loop-distribution.c
> > > > +++ b/gcc/tree-loop-distribution.c
> > > > @@ -218,7 +218,7 @@ enum partition_kind {
> > > >         be unnecessary and removed once distributed memset can be understood
> > > >         and analyzed in data reference analysis.  See PR82604 for more.  */
> > > >      PKIND_PARTIAL_MEMSET,
> > > > -    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
> > > > +    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
> > > >  };
> > > >
> > > >  /* Type of distributed loop.  */
> > > > @@ -244,6 +244,8 @@ struct builtin_info
> > > >       is only used in memset builtin distribution for now.  */
> > > >    tree dst_base_base;
> > > >    unsigned HOST_WIDE_INT dst_base_offset;
> > > > +  tree pattern;
> > > > +  internal_fn fn;
> > > >  };
> > > >
> > > >  /* Partition for loop distribution.  */
> > > > @@ -588,7 +590,8 @@ class loop_distribution
> > > >    bool
> > > >    classify_partition (loop_p loop,
> > > >                       struct graph *rdg, partition *partition,
> > > > -                     bitmap stmt_in_all_partitions);
> > > > +                     bitmap stmt_in_all_partitions,
> > > > +                     vec<struct partition *> *partitions);
> > > >
> > > >
> > > >    /* Returns true when PARTITION1 and PARTITION2 access the same memory
> > > > @@ -1232,6 +1235,67 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
> > > >      }
> > > >  }
> > > >
> > > > +/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP.  */
> > > > +
> > > > +static void
> > > > +generate_rawmemchr_builtin (class loop *loop, partition *partition)
> > > > +{
> > > > +  gimple_stmt_iterator gsi;
> > > > +  tree mem, pattern;
> > > > +  struct builtin_info *builtin = partition->builtin;
> > > > +  gimple *fn_call;
> > > > +
> > > > +  data_reference_p dr = builtin->src_dr;
> > > > +  tree base = builtin->src_base;
> > > > +
> > > > +  tree result_old = TREE_OPERAND (DR_REF (dr), 0);
> > > > +  tree result_new = copy_ssa_name (result_old);
> > > > +
> > > > +  /* The new statements will be placed before LOOP.  */
> > > > +  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > > > +
> > > > +  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, GSI_CONTINUE_LINKING);
> > > > +  pattern = builtin->pattern;
> > > > +  if (TREE_CODE (pattern) == INTEGER_CST)
> > > > +    pattern = fold_convert (integer_type_node, pattern);
> > > > +  fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern);
> > > > +  gimple_call_set_lhs (fn_call, result_new);
> > > > +  gimple_set_location (fn_call, partition->loc);
> > > > +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > > > +
> > > > +  imm_use_iterator iter;
> > > > +  gimple *stmt;
> > > > +  use_operand_p use_p;
> > > > +  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
> > > > +    {
> > > > +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > > > +       SET_USE (use_p, result_new);
> > > > +
> > > > +      update_stmt (stmt);
> > > > +    }
> > > > +
> > > > +  fold_stmt (&gsi);
> > > > +
> > > > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > > > +    switch (builtin->fn)
> > > > +      {
> > > > +      case IFN_RAWMEMCHR8:
> > > > +       fprintf (dump_file, "generated rawmemchr8\n");
> > > > +       break;
> > > > +
> > > > +      case IFN_RAWMEMCHR16:
> > > > +       fprintf (dump_file, "generated rawmemchr16\n");
> > > > +       break;
> > > > +
> > > > +      case IFN_RAWMEMCHR32:
> > > > +       fprintf (dump_file, "generated rawmemchr32\n");
> > > > +       break;
> > > > +
> > > > +      default:
> > > > +       gcc_unreachable ();
> > > > +      }
> > > > +}
> > > > +
> > > >  /* Remove and destroy the loop LOOP.  */
> > > >
> > > >  static void
> > > > @@ -1334,6 +1398,10 @@ generate_code_for_partition (class loop *loop,
> > > >        generate_memcpy_builtin (loop, partition);
> > > >        break;
> > > >
> > > > +    case PKIND_RAWMEMCHR:
> > > > +      generate_rawmemchr_builtin (loop, partition);
> > > > +      break;
> > > > +
> > > >      default:
> > > >        gcc_unreachable ();
> > > >      }
> > > > @@ -1525,44 +1593,53 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> > > >         }
> > > >      }
> > > >
> > > > -  if (!single_st)
> > > > +  if (!single_ld && !single_st)
> > > >      return false;
> > > >
> > > > -  /* Bail out if this is a bitfield memory reference.  */
> > > > -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > > -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > > > -    return false;
> > > > -
> > > > -  /* Data reference must be executed exactly once per iteration of each
> > > > -     loop in the loop nest.  We only need to check dominance information
> > > > -     against the outermost one in a perfect loop nest because a bb can't
> > > > -     dominate outermost loop's latch without dominating inner loop's.  */
> > > > -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> > > > -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > > -    return false;
> > > > +  basic_block bb_ld = NULL;
> > > > +  basic_block bb_st = NULL;
> > > >
> > > >    if (single_ld)
> > > >      {
> > > > -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > > -      /* Direct aggregate copy or via an SSA name temporary.  */
> > > > -      if (load != store
> > > > -         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > > -       return false;
> > > > -
> > > >        /* Bail out if this is a bitfield memory reference.  */
> > > >        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
> > > >           && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
> > > >         return false;
> > > >
> > > > -      /* Load and store must be in the same loop nest.  */
> > > > -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> > > > -      if (bb_st->loop_father != bb_ld->loop_father)
> > > > +      /* Data reference must be executed exactly once per iteration of each
> > > > +        loop in the loop nest.  We only need to check dominance information
> > > > +        against the outermost one in a perfect loop nest because a bb can't
> > > > +        dominate outermost loop's latch without dominating inner loop's.  */
> > > > +      bb_ld = gimple_bb (DR_STMT (single_ld));
> > > > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > > +       return false;
> > > > +    }
> > > > +
> > > > +  if (single_st)
> > > > +    {
> > > > +      /* Bail out if this is a bitfield memory reference.  */
> > > > +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> > > > +         && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> > > >         return false;
> > > >
> > > >        /* Data reference must be executed exactly once per iteration.
> > > > -        Same as single_st, we only need to check against the outermost
> > > > +        Same as single_ld, we only need to check against the outermost
> > > >          loop.  */
> > > > -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> > > > +      bb_st = gimple_bb (DR_STMT (single_st));
> > > > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> > > > +       return false;
> > > > +    }
> > > > +
> > > > +  if (single_ld && single_st)
> > > > +    {
> > > > +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> > > > +      /* Direct aggregate copy or via an SSA name temporary.  */
> > > > +      if (load != store
> > > > +         && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> > > > +       return false;
> > > > +
> > > > +      /* Load and store must be in the same loop nest.  */
> > > > +      if (bb_st->loop_father != bb_ld->loop_father)
> > > >         return false;
> > > >
> > > >        edge e = single_exit (bb_st->loop_father);
> > > > @@ -1681,6 +1758,84 @@ alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
> > > >    return builtin;
> > > >  }
> > > >
> > > > +/* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > > > +   rawmemchr{8,16,32} call.  */
> > > > +
> > > > +static bool
> > > > +classify_builtin_rawmemchr (loop_p loop, partition *partition, data_reference_p dr, tree loop_result)
> > > > +{
> > > > +  tree dr_ref = DR_REF (dr);
> > > > +  tree dr_access_base = build_fold_addr_expr (dr_ref);
> > > > +  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
> > > > +  gimple *dr_stmt = DR_STMT (dr);
> > > > +  tree rhs1 = gimple_assign_rhs1 (dr_stmt);
> > > > +  affine_iv iv;
> > > > +  tree pattern;
> > > > +
> > > > +  if (TREE_OPERAND (rhs1, 0) != loop_result)
> > > > +    return false;
> > > > +
> > > > +  /* A limitation of the current implementation is that we only support
> > > > +     constant patterns.  */
> > > > +  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
> > > > +  pattern = gimple_cond_rhs (cond_stmt);
> > > > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > > > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
> > > > +      || TREE_CODE (pattern) != INTEGER_CST)
> > > > +    return false;
> > > > +
> > > > +  /* Bail out if no affine induction variable with constant step can be
> > > > +     determined.  */
> > > > +  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
> > > > +    return false;
> > > > +
> > > > +  /* Bail out if memory accesses are not consecutive.  */
> > > > +  if (!operand_equal_p (iv.step, dr_access_size, 0))
> > > > +    return false;
> > > > +
> > > > +  /* Bail out if direction of memory accesses is not growing.  */
> > > > +  if (get_range_pos_neg (iv.step) != 1)
> > > > +    return false;
> > > > +
> > > > +  internal_fn fn;
> > > > +  switch (TREE_INT_CST_LOW (iv.step))
> > > > +    {
> > > > +    case 1:
> > > > +      if (!targetm.have_rawmemchr8 ())
> > > > +       return false;
> > > > +      fn = IFN_RAWMEMCHR8;
> > > > +      break;
> > > > +
> > > > +    case 2:
> > > > +      if (!targetm.have_rawmemchr16 ())
> > > > +       return false;
> > > > +      fn = IFN_RAWMEMCHR16;
> > > > +      break;
> > > > +
> > > > +    case 4:
> > > > +      if (!targetm.have_rawmemchr32 ())
> > > > +       return false;
> > > > +      fn = IFN_RAWMEMCHR32;
> > > > +      break;
> > > > +
> > > > +    default:
> > > > +      return false;
> > > > +    }
> > > > +
> > > > +  struct builtin_info *builtin;
> > > > +  builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE);
> > > > +  builtin->src_dr = dr;
> > > > +  builtin->src_base = iv.base;
> > > > +  builtin->pattern = pattern;
> > > > +  builtin->fn = fn;
> > > > +
> > > > +  partition->loc = gimple_location (dr_stmt);
> > > > +  partition->builtin = builtin;
> > > > +  partition->kind = PKIND_RAWMEMCHR;
> > > > +
> > > > +  return true;
> > > > +}
> > > > +
> > > >  /* Given data reference DR in loop nest LOOP, classify if it forms builtin
> > > >     memset call.  */
> > > >
> > > > @@ -1792,12 +1947,16 @@ loop_distribution::classify_builtin_ldst (loop_p loop, struct graph *rdg,
> > > >  bool
> > > >  loop_distribution::classify_partition (loop_p loop,
> > > >                                        struct graph *rdg, partition *partition,
> > > > -                                      bitmap stmt_in_all_partitions)
> > > > +                                      bitmap stmt_in_all_partitions,
> > > > +                                      vec<struct partition *> *partitions)
> > > >  {
> > > >    bitmap_iterator bi;
> > > >    unsigned i;
> > > >    data_reference_p single_ld = NULL, single_st = NULL;
> > > >    bool volatiles_p = false, has_reduction = false;
> > > > +  unsigned nreductions = 0;
> > > > +  gimple *reduction_stmt = NULL;
> > > > +  bool has_interpar_reduction = false;
> > > >
> > > >    EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> > > >      {
> > > > @@ -1821,6 +1980,19 @@ loop_distribution::classify_partition (loop_p loop,
> > > >             partition->reduction_p = true;
> > > >           else
> > > >             has_reduction = true;
> > > > +
> > > > +         /* Determine whether the reduction statement occurs in other
> > > > +            partitions than the current one.  */
> > > > +         struct partition *piter;
> > > > +         for (unsigned j = 0; partitions->iterate (j, &piter); ++j)
> > > > +           {
> > > > +             if (piter == partition)
> > > > +               continue;
> > > > +             if (bitmap_bit_p (piter->stmts, i))
> > > > +               has_interpar_reduction = true;
> > > > +           }
> > > > +         reduction_stmt = stmt;
> > > > +         ++nreductions;
> > > >         }
> > > >      }
> > > >
> > > > @@ -1840,6 +2012,30 @@ loop_distribution::classify_partition (loop_p loop,
> > > >    if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> > > >      return has_reduction;
> > > >
> > > > +  /* If we determined a single load and a single reduction statement which does
> > > > +     not occur in any other partition, then try to classify this partition as a
> > > > +     rawmemchr builtin.  */
> > > > +  if (single_ld != NULL
> > > > +      && single_st == NULL
> > > > +      && nreductions == 1
> > > > +      && !has_interpar_reduction
> > > > +      && is_gimple_assign (reduction_stmt))
> > > > +    {
> > > > +      /* If we classified the partition as a builtin, then ignoring the single
> > > > +        reduction is safe, since the reduction variable is not used in other
> > > > +        partitions.  */
> > > > +      tree reduction_var = gimple_assign_lhs (reduction_stmt);
> > > > +      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
> > > > +    }
> > > > +
> > > > +  if (single_st == NULL)
> > > > +    return has_reduction;
> > > > +
> > > > +  /* Don't distribute loop if niters is unknown.  */
> > > > +  tree niters = number_of_latch_executions (loop);
> > > > +  if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > +    return has_reduction;
> > > > +
> > > >    partition->loc = gimple_location (DR_STMT (single_st));
> > > >
> > > >    /* Classify the builtin kind.  */
> > > > @@ -2979,7 +3175,7 @@ loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
> > > >    FOR_EACH_VEC_ELT (partitions, i, partition)
> > > >      {
> > > >        reduction_in_all
> > > > -       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions);
> > > > +       |= classify_partition (loop, rdg, partition, stmt_in_all_partitions, &partitions);
> > > >        any_builtin |= partition_builtin_p (partition);
> > > >      }
> > > >
> > > > @@ -3290,11 +3486,6 @@ loop_distribution::execute (function *fun)
> > > >               && !optimize_loop_for_speed_p (loop)))
> > > >         continue;
> > > >
> > > > -      /* Don't distribute loop if niters is unknown.  */
> > > > -      tree niters = number_of_latch_executions (loop);
> > > > -      if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > -       continue;
> > > > -
> > > >        /* Get the perfect loop nest for distribution.  */
> > > >        loop = prepare_perfect_loop_nest (loop);
> > > >        for (; loop; loop = loop->inner)
> > > > --
> > > > 2.23.0
> > > >
Stefan Schulze Frielinghaus March 16, 2021, 5:13 p.m. UTC | #7
[snip]

Please find attached a new version of the patch.  A major change compared to
the previous patch is that I created a separate pass which hopefully makes
reviewing also easier since it is almost self-contained.  After realizing that
detecting loops which mimic the behavior of rawmemchr/strlen functions does not
really fit into the topic of loop distribution, I created a separate pass.  Due
to this I was also able to play around a bit and schedule the pass at different
times.  Currently it is scheduled right before loop distribution where loop
header copying already took place which leads to the following effect.  Running
this setup over

char *t (char *p)
{
  for (; *p; ++p);
  return p;
}

the new pass transforms

char * t (char * p)
{
  char _1;
  char _7;

  <bb 2> [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 7>; [11.00%]

  <bb 5> [local count: 105119324]:

  <bb 3> [local count: 955630225]:
  # p_8 = PHI <p_6(6), p_3(D)(5)>
  p_6 = p_8 + 1;
  _1 = *p_6;
  if (_1 != 0)
    goto <bb 6>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 8> [local count: 105119324]:
  # p_2 = PHI <p_6(3)>
  goto <bb 4>; [100.00%]

  <bb 6> [local count: 850510901]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 12992276]:

  <bb 4> [local count: 118111600]:
  # p_9 = PHI <p_2(8), p_3(D)(7)>
  return p_9;

}

into

char * t (char * p)
{
  char * _5;
  char _7;

  <bb 2> [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 5> [local count: 105119324]:
  _5 = p_3(D) + 1;
  p_10 = .RAWMEMCHR (_5, 0);

  <bb 4> [local count: 118111600]:
  # p_9 = PHI <p_10(5), p_3(D)(2)>
  return p_9;

}

which is fine so far.  However, I haven't made up my mind so far whether it is
worthwhile to spend more time in order to also eliminate the "first unrolling"
of the loop.  I gave it a shot by scheduling the pass prior pass copy header
and ended up with:

char * t (char * p)
{
  <bb 2> [local count: 118111600]:
  p_5 = .RAWMEMCHR (p_3(D), 0);
  return p_5;

}

which seems optimal to me.  The downside of this is that I have to initialize
scalar evolution analysis which might be undesired that early.

All this brings me to the question where do you see this peace of code running?
If in a separate pass when would you schedule it?  If in an existing pass,
which one would you choose?

Another topic which came up is whether there exists a more elegant solution to
my current implementation in order to deal with stores (I'm speaking of the `if
(store_dr)` statement inside of function transform_loop_1).  For example,

extern char *p;
char *t ()
{
  for (; *p; ++p);
  return p;
}

ends up as

char * t ()
{
  char * _1;
  char * _2;
  char _3;
  char * p.1_8;
  char _9;
  char * p.1_10;
  char * p.1_11;

  <bb 2> [local count: 118111600]:
  p.1_8 = p;
  _9 = *p.1_8;
  if (_9 != 0)
    goto <bb 5>; [89.00%]
  else
    goto <bb 7>; [11.00%]

  <bb 5> [local count: 105119324]:

  <bb 3> [local count: 955630225]:
  # p.1_10 = PHI <_1(6), p.1_8(5)>
  _1 = p.1_10 + 1;
  p = _1;
  _3 = *_1;
  if (_3 != 0)
    goto <bb 6>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 8> [local count: 105119324]:
  # _2 = PHI <_1(3)>
  goto <bb 4>; [100.00%]

  <bb 6> [local count: 850510901]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 12992276]:

  <bb 4> [local count: 118111600]:
  # p.1_11 = PHI <_2(8), p.1_8(7)>
  return p.1_11;

}

where inside the loop a load and store occurs.  For a rawmemchr like loop I
have to show that we never load from a memory location to which we write.
Currently I solve this by hard coding those facts which are not generic at all.
I gave compute_data_dependences_for_loop a try which failed to determine the
fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
there are more generic solutions to express this in contrast to my current one?

Thanks again for your input so far.  Really appreciated.

Cheers,
Stefan
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 8a5fb3fd99c..7b2d7405277 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1608,6 +1608,7 @@ OBJS = \
 	tree-into-ssa.o \
 	tree-iterator.o \
 	tree-loop-distribution.o \
+	tree-loop-pattern.o \
 	tree-nested.o \
 	tree-nrv.o \
 	tree-object-size.o \
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index dd7173126fb..957e96a46a4 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+    {
+      tree rhs = gimple_call_arg (stmt, i);
+      tree rhs_type = TREE_TYPE (rhs);
+      rtx rhs_rtx = expand_normal (rhs);
+      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
+    }
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index daeace7a34e..95c76795648 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..f7c69f914ce 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
 OPTAB_D (movmem_optab, "movmem$a")
 OPTAB_D (setmem_optab, "setmem$a")
 OPTAB_D (strlen_optab, "strlen$a")
+OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
 
 OPTAB_DC(fma_optab, "fma$a4", FMA)
 OPTAB_D (fms_optab, "fms$a4")
diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc57..280e8fc0cde 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -274,6 +274,7 @@ along with GCC; see the file COPYING3.  If not see
 	     empty loops.  Remove them now.  */
 	  NEXT_PASS (pass_cd_dce, false /* update_address_taken_p */);
 	  NEXT_PASS (pass_iv_canon);
+	  NEXT_PASS (pass_lpat);
 	  NEXT_PASS (pass_loop_distribution);
 	  NEXT_PASS (pass_linterchange);
 	  NEXT_PASS (pass_copy_prop);
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c
new file mode 100644
index 00000000000..b4133510fca
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c
@@ -0,0 +1,72 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -fdump-tree-lpat-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "lpat" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "lpat" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "lpat" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt but no store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, pattern)   \
+__attribute__((noinline))  \
+T *test_##T (T *p)         \
+{                          \
+  while (*p != (T)pattern) \
+    ++p;                   \
+  return p;                \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  0xab)
+test (int16_t, 0xabcd)
+test (int32_t, 0xabcdef15)
+
+#define run(T, pattern, i)      \
+{                               \
+T *q = p;                       \
+q[i] = (T)pattern;              \
+assert (test_##T (p) == &q[i]); \
+q[i] = 0;                       \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0, 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, 0xab, 0);
+  run (int8_t, 0xab, 1);
+  run (int8_t, 0xab, 13);
+
+  run (int16_t, 0xabcd, 0);
+  run (int16_t, 0xabcd, 1);
+  run (int16_t, 0xabcd, 13);
+
+  run (int32_t, 0xabcdef15, 0);
+  run (int32_t, 0xabcdef15, 1);
+  run (int32_t, 0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c
new file mode 100644
index 00000000000..9bebec11db0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c
@@ -0,0 +1,83 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -fdump-tree-lpat-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "lpat" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "lpat" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "lpat" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+uint8_t *p_uint8_t;
+uint16_t *p_uint16_t;
+uint32_t *p_uint32_t;
+
+int8_t *p_int8_t;
+int16_t *p_int16_t;
+int32_t *p_int32_t;
+
+#define test(T, pattern)    \
+__attribute__((noinline))   \
+T *test_##T (void)          \
+{                           \
+  while (*p_##T != pattern) \
+    ++p_##T;                \
+  return p_##T;             \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  (int8_t)0xab)
+test (int16_t, (int16_t)0xabcd)
+test (int32_t, (int32_t)0xabcdef15)
+
+#define run(T, pattern, i) \
+{                          \
+T *q = p;                  \
+q[i] = pattern;            \
+p_##T = p;                 \
+T *r = test_##T ();        \
+assert (r == p_##T);       \
+assert (r == &q[i]);       \
+q[i] = 0;                  \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, '\0', 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, (int8_t)0xab, 0);
+  run (int8_t, (int8_t)0xab, 1);
+  run (int8_t, (int8_t)0xab, 13);
+
+  run (int16_t, (int16_t)0xabcd, 0);
+  run (int16_t, (int16_t)0xabcd, 1);
+  run (int16_t, (int16_t)0xabcd, 13);
+
+  run (int32_t, (int32_t)0xabcdef15, 0);
+  run (int32_t, (int32_t)0xabcdef15, 1);
+  run (int32_t, (int32_t)0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c
new file mode 100644
index 00000000000..b02509c2c8c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c
@@ -0,0 +1,100 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lpat-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlen\n" 4 "lpat" } } */
+/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrhi\n" 4 "lpat" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrsi\n" 4 "lpat" { target s390x-*-* } } } */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, U)        \
+__attribute__((noinline)) \
+U test_##T##U (T *s)      \
+{                         \
+  U i;                    \
+  for (i=0; s[i]; ++i);   \
+  return i;               \
+}
+
+test (uint8_t,  size_t)
+test (uint16_t, size_t)
+test (uint32_t, size_t)
+test (uint8_t,  int)
+test (uint16_t, int)
+test (uint32_t, int)
+
+test (int8_t,  size_t)
+test (int16_t, size_t)
+test (int32_t, size_t)
+test (int8_t,  int)
+test (int16_t, int)
+test (int32_t, int)
+
+#define run(T, U, i)             \
+{                                \
+T *q = p;                        \
+q[i] = 0;                        \
+assert (test_##T##U (p) == i);   \
+memset (&q[i], 0xf, sizeof (T)); \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+
+  run (uint8_t, size_t, 0);
+  run (uint8_t, size_t, 1);
+  run (uint8_t, size_t, 13);
+
+  run (int8_t, size_t, 0);
+  run (int8_t, size_t, 1);
+  run (int8_t, size_t, 13);
+
+  run (uint8_t, int, 0);
+  run (uint8_t, int, 1);
+  run (uint8_t, int, 13);
+
+  run (int8_t, int, 0);
+  run (int8_t, int, 1);
+  run (int8_t, int, 13);
+
+  run (uint16_t, size_t, 0);
+  run (uint16_t, size_t, 1);
+  run (uint16_t, size_t, 13);
+
+  run (int16_t, size_t, 0);
+  run (int16_t, size_t, 1);
+  run (int16_t, size_t, 13);
+
+  run (uint16_t, int, 0);
+  run (uint16_t, int, 1);
+  run (uint16_t, int, 13);
+
+  run (int16_t, int, 0);
+  run (int16_t, int, 1);
+  run (int16_t, int, 13);
+
+  run (uint32_t, size_t, 0);
+  run (uint32_t, size_t, 1);
+  run (uint32_t, size_t, 13);
+
+  run (int32_t, size_t, 0);
+  run (int32_t, size_t, 1);
+  run (int32_t, size_t, 13);
+
+  run (uint32_t, int, 0);
+  run (uint32_t, int, 1);
+  run (uint32_t, int, 13);
+
+  run (int32_t, int, 0);
+  run (int32_t, int, 1);
+  run (int32_t, int, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c
new file mode 100644
index 00000000000..e71dad8ed2e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lpat-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlen\n" 3 "lpat" } } */
+
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+__attribute__((noinline))
+int test_pos (char *s)
+{
+  int i;
+  for (i=42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_neg (char *s)
+{
+  int i;
+  for (i=-42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_including_null_char (char *s)
+{
+  int i;
+  for (i=1; s[i-1]; ++i);
+  return i;
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+  char *s = (char *)p + 100;
+
+  s[42+13] = 0;
+  assert (test_pos (s) == 42+13);
+  s[42+13] = 0xf;
+
+  s[13] = 0;
+  assert (test_neg (s) == 13);
+  s[13] = 0xf;
+
+  s[-13] = 0;
+  assert (test_neg (s) == -13);
+  s[-13] = 0xf;
+
+  s[13] = 0;
+  assert (test_including_null_char (s) == 13+1);
+
+  return 0;
+}
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 63c0b3306de..bdefc85fbb4 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -307,6 +307,7 @@ DEFTIMEVAR (TV_TREE_UBSAN            , "tree ubsan")
 DEFTIMEVAR (TV_INITIALIZE_RTL        , "initialize rtl")
 DEFTIMEVAR (TV_GIMPLE_LADDRESS       , "address lowering")
 DEFTIMEVAR (TV_TREE_LOOP_IFCVT       , "tree loop if-conversion")
+DEFTIMEVAR (TV_LPAT                  , "tree loop pattern")
 
 /* Everything else in rest_of_compilation not included above.  */
 DEFTIMEVAR (TV_EARLY_LOCAL	     , "early local passes")
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 7ee19fc8677..f7aafd0d0dc 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -890,7 +890,7 @@ loop_distribution::partition_merge_into (struct graph *rdg,
 /* Returns true when DEF is an SSA_NAME defined in LOOP and used after
    the LOOP.  */
 
-static bool
+bool
 ssa_name_has_uses_outside_loop_p (tree def, loop_p loop)
 {
   imm_use_iterator imm_iter;
@@ -912,7 +912,7 @@ ssa_name_has_uses_outside_loop_p (tree def, loop_p loop)
 /* Returns true when STMT defines a scalar variable used after the
    loop LOOP.  */
 
-static bool
+bool
 stmt_has_scalar_dependences_outside_loop (loop_p loop, gimple *stmt)
 {
   def_operand_p def_p;
@@ -1234,7 +1234,7 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
 
 /* Remove and destroy the loop LOOP.  */
 
-static void
+void
 destroy_loop (class loop *loop)
 {
   unsigned nbbs = loop->num_nodes;
diff --git a/gcc/tree-loop-pattern.c b/gcc/tree-loop-pattern.c
new file mode 100644
index 00000000000..a9c984d5e53
--- /dev/null
+++ b/gcc/tree-loop-pattern.c
@@ -0,0 +1,588 @@
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "intl.h"
+#include "tree.h"
+#include "gimple.h"
+#include "cfghooks.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "fold-const.h"
+#include "gimple-iterator.h"
+#include "gimplify-me.h"
+#include "tree-cfg.h"
+#include "tree-ssa.h"
+#include "tree-ssanames.h"
+#include "tree-ssa-loop.h"
+#include "tree-ssa-loop-manip.h"
+#include "tree-into-ssa.h"
+#include "cfgloop.h"
+#include "tree-scalar-evolution.h"
+#include "tree-vectorizer.h"
+#include "tree-eh.h"
+#include "gimple-fold.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "insn-codes.h"
+#include "optabs.h"
+
+/* This pass detects loops which mimic the effects of builtins and replaces them
+   accordingly.  For example, a loop of the form
+
+     for (; *p != 42; ++p);
+
+   is replaced by
+
+     p = rawmemchr (p, 42);
+
+   under the assumption that rawmemchr is available for a particular mode.
+   Another example is
+
+     int i;
+     for (i = 42; s[i]; ++i);
+
+   which is replaced by
+
+     i = (int)strlen (&s[42]) + 42;
+
+   for some character array S.  In case array S is not of a character array
+   type, we end up with
+
+     i = (int)(rawmemchr (&s[42], 0) - &s[42]) + 42;
+
+   assuming that rawmemchr is available for a particular mode.  Note, detecting
+   strlen like loops also depends on whether the type for the resulting length
+   is compatible with size type or overflow is undefined.  */
+
+/* TODO Quick and dirty imports from tree-loop-distribution pass.  */
+void destroy_loop (class loop *loop);
+bool stmt_has_scalar_dependences_outside_loop (loop_p loop, gimple *stmt);
+bool ssa_name_has_uses_outside_loop_p (tree def, loop_p loop);
+
+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+			    data_reference_p store_dr, tree base, tree pattern,
+			    location_t loc)
+{
+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
+		       && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
+
+  /* The new statements will be placed before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
+				       GSI_CONTINUE_LINKING);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
+  tree reduction_var_new = copy_ssa_name (reduction_var);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  if (store_dr)
+    {
+      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
+      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+    }
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    switch (TYPE_MODE (TREE_TYPE (pattern)))
+      {
+      case QImode:
+	fprintf (dump_file, "generated rawmemchrqi\n");
+	break;
+
+      case HImode:
+	fprintf (dump_file, "generated rawmemchrhi\n");
+	break;
+
+      case SImode:
+	fprintf (dump_file, "generated rawmemchrsi\n");
+	break;
+
+      default:
+	gcc_unreachable ();
+      }
+}
+
+static void
+generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
+			 tree start_len, location_t loc)
+{
+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
+
+  /* The new statements will be placed before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  tree reduction_var_new = make_ssa_name (size_type_node);
+
+  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
+				       GSI_CONTINUE_LINKING);
+  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN));
+  gimple *fn_call = gimple_build_call (fn, 1, mem);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  /* In case reduction type is not compatible with size type, then
+     conversion is sound even in case an overflow occurs since we previously
+     ensured that for reduction type an overflow is undefined.  */
+  tree convert = fold_convert (TREE_TYPE (reduction_var), reduction_var_new);
+  reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE,
+						false, GSI_CONTINUE_LINKING);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree fn_result = reduction_var_new;
+      reduction_var_new = make_ssa_name (TREE_TYPE (reduction_var));
+      gimple *add_stmt = gimple_build_assign (reduction_var_new, PLUS_EXPR,
+					      fn_result, start_len);
+      gsi_insert_after (&gsi, add_stmt, GSI_CONTINUE_LINKING);
+    }
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "generated strlen\n");
+}
+
+static void
+generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
+					 tree base, tree start_len,
+					 location_t loc)
+{
+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
+
+  /* The new statements will be placed before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
+				       GSI_CONTINUE_LINKING);
+  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (mem)));
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, zero);
+  tree end = make_ssa_name (TREE_TYPE (base));
+  gimple_call_set_lhs (fn_call, end);
+  gimple_set_location (fn_call, loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  tree diff = make_ssa_name (ptrdiff_type_node);
+  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, mem);
+  gsi_insert_after (&gsi, diff_stmt, GSI_CONTINUE_LINKING);
+
+  tree convert = fold_convert (ptrdiff_type_node,
+			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (mem))));
+  tree size = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE, false,
+					GSI_CONTINUE_LINKING);
+
+  tree count = make_ssa_name (ptrdiff_type_node);
+  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
+  gsi_insert_after (&gsi, count_stmt, GSI_CONTINUE_LINKING);
+
+  convert = fold_convert (TREE_TYPE (reduction_var), count);
+  tree reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true,
+						     NULL_TREE, false,
+						     GSI_CONTINUE_LINKING);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
+      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
+				       start_len);
+      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+      reduction_var_new = lhs;
+    }
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    switch (TYPE_MODE (TREE_TYPE (zero)))
+      {
+      case HImode:
+	fprintf (dump_file, "generated strlen using rawmemchrhi\n");
+	break;
+
+      case SImode:
+	fprintf (dump_file, "generated strlen using rawmemchrsi\n");
+	break;
+
+      default:
+	gcc_unreachable ();
+      }
+}
+
+static bool
+transform_loop_1 (loop_p loop,
+		  data_reference_p load_dr,
+		  data_reference_p store_dr,
+		  tree reduction_var)
+{
+  tree load_ref = DR_REF (load_dr);
+  tree load_type = TREE_TYPE (load_ref);
+  tree load_access_base = build_fold_addr_expr (load_ref);
+  tree load_access_size = TYPE_SIZE_UNIT (load_type);
+  affine_iv load_iv, reduction_iv;
+  tree pattern;
+
+  /* A limitation of the current implementation is that we only support
+     constant patterns.  */
+  edge e = single_exit (loop);
+  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
+  if (!cond_stmt)
+    return false;
+  pattern = gimple_cond_rhs (cond_stmt);
+  if (gimple_cond_code (cond_stmt) != NE_EXPR
+      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
+      || TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive or not growing.  */
+  if (!operand_equal_p (load_iv.step, load_access_size, 0))
+    return false;
+
+  if (!INTEGRAL_TYPE_P (load_type)
+      || !type_has_mode_precision_p (load_type))
+    return false;
+
+  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
+    return false;
+
+  /* Handle rawmemchr like loops.  */
+  if (operand_equal_p (load_iv.base, reduction_iv.base)
+      && operand_equal_p (load_iv.step, reduction_iv.step))
+    {
+      if (store_dr)
+	{
+	  /* Ensure that we store to X and load from X+I where I>0.  */
+	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
+	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
+	    return false;
+	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
+	  if (TREE_CODE (ptr_base) != SSA_NAME)
+	    return false;
+	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
+	  if (!gimple_assign_single_p (def)
+	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
+	    return false;
+	  /* Ensure that the reduction value is stored.  */
+	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
+	    return false;
+	}
+      /* Bail out if target does not provide rawmemchr for a certain mode.  */
+      machine_mode mode = TYPE_MODE (load_type);
+      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
+	return false;
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
+				  pattern, loc);
+      return true;
+    }
+
+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step)
+      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
+	  || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
+    {
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
+	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node))
+	generate_strlen_builtin (loop, reduction_var, load_iv.base,
+				 reduction_iv.base, loc);
+      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
+	       != CODE_FOR_nothing)
+	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
+						 load_iv.base,
+						 reduction_iv.base, loc);
+      else
+	return false;
+      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
+	{
+	  const char *msg = G_("assuming signed overflow does not occur "
+			       "when optimizing strlen like loop");
+	  fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
+	}
+      return true;
+    }
+
+  return false;
+}
+
+static bool
+transform_loop (loop_p loop)
+{
+  gimple *reduction_stmt = NULL;
+  data_reference_p load_dr = NULL, store_dr = NULL;
+
+  basic_block *bbs = get_loop_body (loop);
+
+  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
+    {
+      basic_block bb = bbs[i];
+
+      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+	   gsi_next (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return false;
+	  gphi *phi = bsi.phi ();
+	  if (gimple_has_volatile_ops (phi))
+	    return false;
+	  if (gimple_clobber_p (phi))
+	    continue;
+	  if (virtual_operand_p (gimple_phi_result (phi)))
+	    continue;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
+	    {
+	      if (reduction_stmt)
+		return false;
+	      reduction_stmt = phi;
+	    }
+	}
+
+      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+	   gsi_next (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return false;
+
+	  gimple *stmt = gsi_stmt (bsi);
+
+	  if (gimple_clobber_p (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_LABEL || is_gimple_debug (stmt))
+	    continue;
+
+	  if (gimple_has_volatile_ops (stmt))
+	    return false;
+
+	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
+	    {
+	      if (reduction_stmt)
+		return false;
+	      reduction_stmt = stmt;
+	    }
+
+	  /* Any scalar stmts are ok.  */
+	  if (!gimple_vuse (stmt))
+	    continue;
+
+	  /* Otherwise just regular loads/stores.  */
+	  if (!gimple_assign_single_p (stmt))
+	    return false;
+
+	  auto_vec<data_reference_p, 2> dr_vec;
+	  if (!find_data_references_in_stmt (loop, stmt, &dr_vec))
+	    return false;
+	  data_reference_p dr;
+	  unsigned j;
+	  FOR_EACH_VEC_ELT (dr_vec, j, dr)
+	    {
+	      tree type = TREE_TYPE (DR_REF (dr));
+	      if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (type)))
+		return false;
+	      if (DR_IS_READ (dr))
+		{
+		  if (load_dr != NULL)
+		    return false;
+		  load_dr = dr;
+		}
+	      else
+		{
+		  if (store_dr != NULL)
+		    return false;
+		  store_dr = dr;
+		}
+	    }
+	}
+    }
+
+  /* A limitation of the current implementation is that we require a reduction
+     statement which does not occur in cases like
+     extern int *p;
+     void foo (void) { for (; *p; ++p); } */
+  if (load_dr == NULL || reduction_stmt == NULL)
+    return false;
+
+  /* Note, reduction variables are guaranteed to be SSA names.  */
+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_PHI:
+      reduction_var = gimple_phi_result (reduction_stmt);
+      break;
+    case GIMPLE_ASSIGN:
+      reduction_var = gimple_assign_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;
+    }
+  if (reduction_var == NULL)
+    return false;
+
+  /* Bail out if this is a bitfield memory reference.  */
+  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
+      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
+    return false;
+
+  /* Data reference must be executed exactly once per iteration of each
+     loop in the loop nest.  We only need to check dominance information
+     against the outermost one in a perfect loop nest because a bb can't
+     dominate outermost loop's latch without dominating inner loop's.  */
+  basic_block load_bb = gimple_bb (DR_STMT (load_dr));
+  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, load_bb))
+    return false;
+
+  if (store_dr)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (store_dr)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (store_dr), 1)))
+	return false;
+
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      basic_block store_bb = gimple_bb (DR_STMT (store_dr));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, store_bb))
+	return false;
+
+      /* Load and store must be in the same loop nest.  */
+      if (store_bb->loop_father != load_bb->loop_father)
+	return false;
+
+      edge e = single_exit (store_bb->loop_father);
+      if (!e)
+	return false;
+      bool load_dom = dominated_by_p (CDI_DOMINATORS, e->src, load_bb);
+      bool store_dom = dominated_by_p (CDI_DOMINATORS, e->src, store_bb);
+      if (load_dom != store_dom)
+	return false;
+    }
+
+  return transform_loop_1 (loop, load_dr, store_dr, reduction_var);
+}
+
+namespace {
+
+const pass_data pass_data_lpat =
+{
+  GIMPLE_PASS, /* type */
+  "lpat", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_LPAT, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_lpat : public gimple_opt_pass
+{
+public:
+  pass_lpat (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_lpat, ctxt)
+  {}
+
+  bool
+  gate (function *) OVERRIDE
+  {
+    return optimize != 0;
+  }
+
+  unsigned int
+  execute (function *f) OVERRIDE
+  {
+    loop_p loop;
+    auto_vec<loop_p> loops_to_be_destroyed;
+
+    FOR_EACH_LOOP_FN (f, loop, LI_ONLY_INNERMOST)
+      {
+	if (!single_exit (loop)
+	    || (!flag_tree_loop_distribute_patterns // TODO
+		&& !optimize_loop_for_speed_p (loop)))
+	continue;
+
+	if (transform_loop (loop))
+	  loops_to_be_destroyed.safe_push (loop);
+      }
+
+    if (loops_to_be_destroyed.length () > 0)
+      {
+	unsigned i;
+	FOR_EACH_VEC_ELT (loops_to_be_destroyed, i, loop)
+	  destroy_loop (loop);
+
+	scev_reset_htab ();
+	mark_virtual_operands_for_renaming (f);
+	rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
+
+	return TODO_cleanup_cfg;
+      }
+    else
+      return 0;
+  }
+}; // class pass_lpat
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_lpat (gcc::context *ctxt)
+{
+  return new pass_lpat (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 15693fee150..2d71a12039e 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -380,6 +380,7 @@ extern gimple_opt_pass *make_pass_graphite (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_graphite_transforms (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_if_conversion (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_if_to_switch (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_lpat (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_distribution (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_simduid_cleanup (gcc::context *ctxt);
Stefan Schulze Frielinghaus April 8, 2021, 8:23 a.m. UTC | #8
ping

On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote:
> [snip]
> 
> Please find attached a new version of the patch.  A major change compared to
> the previous patch is that I created a separate pass which hopefully makes
> reviewing also easier since it is almost self-contained.  After realizing that
> detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> really fit into the topic of loop distribution, I created a separate pass.  Due
> to this I was also able to play around a bit and schedule the pass at different
> times.  Currently it is scheduled right before loop distribution where loop
> header copying already took place which leads to the following effect.  Running
> this setup over
> 
> char *t (char *p)
> {
>   for (; *p; ++p);
>   return p;
> }
> 
> the new pass transforms
> 
> char * t (char * p)
> {
>   char _1;
>   char _7;
> 
>   <bb 2> [local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
>     goto <bb 5>; [89.00%]
>   else
>     goto <bb 7>; [11.00%]
> 
>   <bb 5> [local count: 105119324]:
> 
>   <bb 3> [local count: 955630225]:
>   # p_8 = PHI <p_6(6), p_3(D)(5)>
>   p_6 = p_8 + 1;
>   _1 = *p_6;
>   if (_1 != 0)
>     goto <bb 6>; [89.00%]
>   else
>     goto <bb 8>; [11.00%]
> 
>   <bb 8> [local count: 105119324]:
>   # p_2 = PHI <p_6(3)>
>   goto <bb 4>; [100.00%]
> 
>   <bb 6> [local count: 850510901]:
>   goto <bb 3>; [100.00%]
> 
>   <bb 7> [local count: 12992276]:
> 
>   <bb 4> [local count: 118111600]:
>   # p_9 = PHI <p_2(8), p_3(D)(7)>
>   return p_9;
> 
> }
> 
> into
> 
> char * t (char * p)
> {
>   char * _5;
>   char _7;
> 
>   <bb 2> [local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
>     goto <bb 5>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
> 
>   <bb 5> [local count: 105119324]:
>   _5 = p_3(D) + 1;
>   p_10 = .RAWMEMCHR (_5, 0);
> 
>   <bb 4> [local count: 118111600]:
>   # p_9 = PHI <p_10(5), p_3(D)(2)>
>   return p_9;
> 
> }
> 
> which is fine so far.  However, I haven't made up my mind so far whether it is
> worthwhile to spend more time in order to also eliminate the "first unrolling"
> of the loop.  I gave it a shot by scheduling the pass prior pass copy header
> and ended up with:
> 
> char * t (char * p)
> {
>   <bb 2> [local count: 118111600]:
>   p_5 = .RAWMEMCHR (p_3(D), 0);
>   return p_5;
> 
> }
> 
> which seems optimal to me.  The downside of this is that I have to initialize
> scalar evolution analysis which might be undesired that early.
> 
> All this brings me to the question where do you see this peace of code running?
> If in a separate pass when would you schedule it?  If in an existing pass,
> which one would you choose?
> 
> Another topic which came up is whether there exists a more elegant solution to
> my current implementation in order to deal with stores (I'm speaking of the `if
> (store_dr)` statement inside of function transform_loop_1).  For example,
> 
> extern char *p;
> char *t ()
> {
>   for (; *p; ++p);
>   return p;
> }
> 
> ends up as
> 
> char * t ()
> {
>   char * _1;
>   char * _2;
>   char _3;
>   char * p.1_8;
>   char _9;
>   char * p.1_10;
>   char * p.1_11;
> 
>   <bb 2> [local count: 118111600]:
>   p.1_8 = p;
>   _9 = *p.1_8;
>   if (_9 != 0)
>     goto <bb 5>; [89.00%]
>   else
>     goto <bb 7>; [11.00%]
> 
>   <bb 5> [local count: 105119324]:
> 
>   <bb 3> [local count: 955630225]:
>   # p.1_10 = PHI <_1(6), p.1_8(5)>
>   _1 = p.1_10 + 1;
>   p = _1;
>   _3 = *_1;
>   if (_3 != 0)
>     goto <bb 6>; [89.00%]
>   else
>     goto <bb 8>; [11.00%]
> 
>   <bb 8> [local count: 105119324]:
>   # _2 = PHI <_1(3)>
>   goto <bb 4>; [100.00%]
> 
>   <bb 6> [local count: 850510901]:
>   goto <bb 3>; [100.00%]
> 
>   <bb 7> [local count: 12992276]:
> 
>   <bb 4> [local count: 118111600]:
>   # p.1_11 = PHI <_2(8), p.1_8(7)>
>   return p.1_11;
> 
> }
> 
> where inside the loop a load and store occurs.  For a rawmemchr like loop I
> have to show that we never load from a memory location to which we write.
> Currently I solve this by hard coding those facts which are not generic at all.
> I gave compute_data_dependences_for_loop a try which failed to determine the
> fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> there are more generic solutions to express this in contrast to my current one?
> 
> Thanks again for your input so far.  Really appreciated.
> 
> Cheers,
> Stefan

> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 8a5fb3fd99c..7b2d7405277 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1608,6 +1608,7 @@ OBJS = \
>  	tree-into-ssa.o \
>  	tree-iterator.o \
>  	tree-loop-distribution.o \
> +	tree-loop-pattern.o \
>  	tree-nested.o \
>  	tree-nrv.o \
>  	tree-object-size.o \
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index dd7173126fb..957e96a46a4 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>    gcc_unreachable ();
>  }
>  
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> +  if (!lhs)
> +    return;
> +  tree lhs_type = TREE_TYPE (lhs);
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
> +
> +  for (unsigned int i = 0; i < 2; ++i)
> +    {
> +      tree rhs = gimple_call_arg (stmt, i);
> +      tree rhs_type = TREE_TYPE (rhs);
> +      rtx rhs_rtx = expand_normal (rhs);
> +      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
> +    }
> +
> +  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
> +
> +  expand_insn (icode, 3, ops);
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
>  /* Expand the IFN_UNIQUE function according to its first argument.  */
>  
>  static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index daeace7a34e..95c76795648 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
>  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
>  
>  /* An unduplicable, uncombinable function.  Generally used to preserve
>     a CFG property in the face of jump threading, tail merging or
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b..f7c69f914ce 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
>  OPTAB_D (movmem_optab, "movmem$a")
>  OPTAB_D (setmem_optab, "setmem$a")
>  OPTAB_D (strlen_optab, "strlen$a")
> +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
>  
>  OPTAB_DC(fma_optab, "fma$a4", FMA)
>  OPTAB_D (fms_optab, "fms$a4")
> diff --git a/gcc/passes.def b/gcc/passes.def
> index e9ed3c7bc57..280e8fc0cde 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -274,6 +274,7 @@ along with GCC; see the file COPYING3.  If not see
>  	     empty loops.  Remove them now.  */
>  	  NEXT_PASS (pass_cd_dce, false /* update_address_taken_p */);
>  	  NEXT_PASS (pass_iv_canon);
> +	  NEXT_PASS (pass_lpat);
>  	  NEXT_PASS (pass_loop_distribution);
>  	  NEXT_PASS (pass_linterchange);
>  	  NEXT_PASS (pass_copy_prop);
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c
> new file mode 100644
> index 00000000000..b4133510fca
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c
> @@ -0,0 +1,72 @@
> +/* { dg-do run { target s390x-*-* } } */
> +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "lpat" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "lpat" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "lpat" { target s390x-*-* } } } */
> +
> +/* Rawmemchr pattern: reduction stmt but no store */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +#define test(T, pattern)   \
> +__attribute__((noinline))  \
> +T *test_##T (T *p)         \
> +{                          \
> +  while (*p != (T)pattern) \
> +    ++p;                   \
> +  return p;                \
> +}
> +
> +test (uint8_t,  0xab)
> +test (uint16_t, 0xabcd)
> +test (uint32_t, 0xabcdef15)
> +
> +test (int8_t,  0xab)
> +test (int16_t, 0xabcd)
> +test (int32_t, 0xabcdef15)
> +
> +#define run(T, pattern, i)      \
> +{                               \
> +T *q = p;                       \
> +q[i] = (T)pattern;              \
> +assert (test_##T (p) == &q[i]); \
> +q[i] = 0;                       \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0, 1024);
> +
> +  run (uint8_t, 0xab, 0);
> +  run (uint8_t, 0xab, 1);
> +  run (uint8_t, 0xab, 13);
> +
> +  run (uint16_t, 0xabcd, 0);
> +  run (uint16_t, 0xabcd, 1);
> +  run (uint16_t, 0xabcd, 13);
> +
> +  run (uint32_t, 0xabcdef15, 0);
> +  run (uint32_t, 0xabcdef15, 1);
> +  run (uint32_t, 0xabcdef15, 13);
> +
> +  run (int8_t, 0xab, 0);
> +  run (int8_t, 0xab, 1);
> +  run (int8_t, 0xab, 13);
> +
> +  run (int16_t, 0xabcd, 0);
> +  run (int16_t, 0xabcd, 1);
> +  run (int16_t, 0xabcd, 13);
> +
> +  run (int32_t, 0xabcdef15, 0);
> +  run (int32_t, 0xabcdef15, 1);
> +  run (int32_t, 0xabcdef15, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c
> new file mode 100644
> index 00000000000..9bebec11db0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c
> @@ -0,0 +1,83 @@
> +/* { dg-do run { target s390x-*-* } } */
> +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "lpat" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "lpat" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "lpat" { target s390x-*-* } } } */
> +
> +/* Rawmemchr pattern: reduction stmt and store */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +uint8_t *p_uint8_t;
> +uint16_t *p_uint16_t;
> +uint32_t *p_uint32_t;
> +
> +int8_t *p_int8_t;
> +int16_t *p_int16_t;
> +int32_t *p_int32_t;
> +
> +#define test(T, pattern)    \
> +__attribute__((noinline))   \
> +T *test_##T (void)          \
> +{                           \
> +  while (*p_##T != pattern) \
> +    ++p_##T;                \
> +  return p_##T;             \
> +}
> +
> +test (uint8_t,  0xab)
> +test (uint16_t, 0xabcd)
> +test (uint32_t, 0xabcdef15)
> +
> +test (int8_t,  (int8_t)0xab)
> +test (int16_t, (int16_t)0xabcd)
> +test (int32_t, (int32_t)0xabcdef15)
> +
> +#define run(T, pattern, i) \
> +{                          \
> +T *q = p;                  \
> +q[i] = pattern;            \
> +p_##T = p;                 \
> +T *r = test_##T ();        \
> +assert (r == p_##T);       \
> +assert (r == &q[i]);       \
> +q[i] = 0;                  \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, '\0', 1024);
> +
> +  run (uint8_t, 0xab, 0);
> +  run (uint8_t, 0xab, 1);
> +  run (uint8_t, 0xab, 13);
> +
> +  run (uint16_t, 0xabcd, 0);
> +  run (uint16_t, 0xabcd, 1);
> +  run (uint16_t, 0xabcd, 13);
> +
> +  run (uint32_t, 0xabcdef15, 0);
> +  run (uint32_t, 0xabcdef15, 1);
> +  run (uint32_t, 0xabcdef15, 13);
> +
> +  run (int8_t, (int8_t)0xab, 0);
> +  run (int8_t, (int8_t)0xab, 1);
> +  run (int8_t, (int8_t)0xab, 13);
> +
> +  run (int16_t, (int16_t)0xabcd, 0);
> +  run (int16_t, (int16_t)0xabcd, 1);
> +  run (int16_t, (int16_t)0xabcd, 13);
> +
> +  run (int32_t, (int32_t)0xabcdef15, 0);
> +  run (int32_t, (int32_t)0xabcdef15, 1);
> +  run (int32_t, (int32_t)0xabcdef15, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c
> new file mode 100644
> index 00000000000..b02509c2c8c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c
> @@ -0,0 +1,100 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> +/* { dg-final { scan-tree-dump-times "generated strlen\n" 4 "lpat" } } */
> +/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrhi\n" 4 "lpat" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrsi\n" 4 "lpat" { target s390x-*-* } } } */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +#define test(T, U)        \
> +__attribute__((noinline)) \
> +U test_##T##U (T *s)      \
> +{                         \
> +  U i;                    \
> +  for (i=0; s[i]; ++i);   \
> +  return i;               \
> +}
> +
> +test (uint8_t,  size_t)
> +test (uint16_t, size_t)
> +test (uint32_t, size_t)
> +test (uint8_t,  int)
> +test (uint16_t, int)
> +test (uint32_t, int)
> +
> +test (int8_t,  size_t)
> +test (int16_t, size_t)
> +test (int32_t, size_t)
> +test (int8_t,  int)
> +test (int16_t, int)
> +test (int32_t, int)
> +
> +#define run(T, U, i)             \
> +{                                \
> +T *q = p;                        \
> +q[i] = 0;                        \
> +assert (test_##T##U (p) == i);   \
> +memset (&q[i], 0xf, sizeof (T)); \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0xf, 1024);
> +
> +  run (uint8_t, size_t, 0);
> +  run (uint8_t, size_t, 1);
> +  run (uint8_t, size_t, 13);
> +
> +  run (int8_t, size_t, 0);
> +  run (int8_t, size_t, 1);
> +  run (int8_t, size_t, 13);
> +
> +  run (uint8_t, int, 0);
> +  run (uint8_t, int, 1);
> +  run (uint8_t, int, 13);
> +
> +  run (int8_t, int, 0);
> +  run (int8_t, int, 1);
> +  run (int8_t, int, 13);
> +
> +  run (uint16_t, size_t, 0);
> +  run (uint16_t, size_t, 1);
> +  run (uint16_t, size_t, 13);
> +
> +  run (int16_t, size_t, 0);
> +  run (int16_t, size_t, 1);
> +  run (int16_t, size_t, 13);
> +
> +  run (uint16_t, int, 0);
> +  run (uint16_t, int, 1);
> +  run (uint16_t, int, 13);
> +
> +  run (int16_t, int, 0);
> +  run (int16_t, int, 1);
> +  run (int16_t, int, 13);
> +
> +  run (uint32_t, size_t, 0);
> +  run (uint32_t, size_t, 1);
> +  run (uint32_t, size_t, 13);
> +
> +  run (int32_t, size_t, 0);
> +  run (int32_t, size_t, 1);
> +  run (int32_t, size_t, 13);
> +
> +  run (uint32_t, int, 0);
> +  run (uint32_t, int, 1);
> +  run (uint32_t, int, 13);
> +
> +  run (int32_t, int, 0);
> +  run (int32_t, int, 1);
> +  run (int32_t, int, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c
> new file mode 100644
> index 00000000000..e71dad8ed2e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c
> @@ -0,0 +1,58 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> +/* { dg-final { scan-tree-dump-times "generated strlen\n" 3 "lpat" } } */
> +
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +__attribute__((noinline))
> +int test_pos (char *s)
> +{
> +  int i;
> +  for (i=42; s[i]; ++i);
> +  return i;
> +}
> +
> +__attribute__((noinline))
> +int test_neg (char *s)
> +{
> +  int i;
> +  for (i=-42; s[i]; ++i);
> +  return i;
> +}
> +
> +__attribute__((noinline))
> +int test_including_null_char (char *s)
> +{
> +  int i;
> +  for (i=1; s[i-1]; ++i);
> +  return i;
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0xf, 1024);
> +  char *s = (char *)p + 100;
> +
> +  s[42+13] = 0;
> +  assert (test_pos (s) == 42+13);
> +  s[42+13] = 0xf;
> +
> +  s[13] = 0;
> +  assert (test_neg (s) == 13);
> +  s[13] = 0xf;
> +
> +  s[-13] = 0;
> +  assert (test_neg (s) == -13);
> +  s[-13] = 0xf;
> +
> +  s[13] = 0;
> +  assert (test_including_null_char (s) == 13+1);
> +
> +  return 0;
> +}
> diff --git a/gcc/timevar.def b/gcc/timevar.def
> index 63c0b3306de..bdefc85fbb4 100644
> --- a/gcc/timevar.def
> +++ b/gcc/timevar.def
> @@ -307,6 +307,7 @@ DEFTIMEVAR (TV_TREE_UBSAN            , "tree ubsan")
>  DEFTIMEVAR (TV_INITIALIZE_RTL        , "initialize rtl")
>  DEFTIMEVAR (TV_GIMPLE_LADDRESS       , "address lowering")
>  DEFTIMEVAR (TV_TREE_LOOP_IFCVT       , "tree loop if-conversion")
> +DEFTIMEVAR (TV_LPAT                  , "tree loop pattern")
>  
>  /* Everything else in rest_of_compilation not included above.  */
>  DEFTIMEVAR (TV_EARLY_LOCAL	     , "early local passes")
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 7ee19fc8677..f7aafd0d0dc 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -890,7 +890,7 @@ loop_distribution::partition_merge_into (struct graph *rdg,
>  /* Returns true when DEF is an SSA_NAME defined in LOOP and used after
>     the LOOP.  */
>  
> -static bool
> +bool
>  ssa_name_has_uses_outside_loop_p (tree def, loop_p loop)
>  {
>    imm_use_iterator imm_iter;
> @@ -912,7 +912,7 @@ ssa_name_has_uses_outside_loop_p (tree def, loop_p loop)
>  /* Returns true when STMT defines a scalar variable used after the
>     loop LOOP.  */
>  
> -static bool
> +bool
>  stmt_has_scalar_dependences_outside_loop (loop_p loop, gimple *stmt)
>  {
>    def_operand_p def_p;
> @@ -1234,7 +1234,7 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
>  
>  /* Remove and destroy the loop LOOP.  */
>  
> -static void
> +void
>  destroy_loop (class loop *loop)
>  {
>    unsigned nbbs = loop->num_nodes;
> diff --git a/gcc/tree-loop-pattern.c b/gcc/tree-loop-pattern.c
> new file mode 100644
> index 00000000000..a9c984d5e53
> --- /dev/null
> +++ b/gcc/tree-loop-pattern.c
> @@ -0,0 +1,588 @@
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "intl.h"
> +#include "tree.h"
> +#include "gimple.h"
> +#include "cfghooks.h"
> +#include "tree-pass.h"
> +#include "ssa.h"
> +#include "fold-const.h"
> +#include "gimple-iterator.h"
> +#include "gimplify-me.h"
> +#include "tree-cfg.h"
> +#include "tree-ssa.h"
> +#include "tree-ssanames.h"
> +#include "tree-ssa-loop.h"
> +#include "tree-ssa-loop-manip.h"
> +#include "tree-into-ssa.h"
> +#include "cfgloop.h"
> +#include "tree-scalar-evolution.h"
> +#include "tree-vectorizer.h"
> +#include "tree-eh.h"
> +#include "gimple-fold.h"
> +#include "rtl.h"
> +#include "memmodel.h"
> +#include "insn-codes.h"
> +#include "optabs.h"
> +
> +/* This pass detects loops which mimic the effects of builtins and replaces them
> +   accordingly.  For example, a loop of the form
> +
> +     for (; *p != 42; ++p);
> +
> +   is replaced by
> +
> +     p = rawmemchr (p, 42);
> +
> +   under the assumption that rawmemchr is available for a particular mode.
> +   Another example is
> +
> +     int i;
> +     for (i = 42; s[i]; ++i);
> +
> +   which is replaced by
> +
> +     i = (int)strlen (&s[42]) + 42;
> +
> +   for some character array S.  In case array S is not of a character array
> +   type, we end up with
> +
> +     i = (int)(rawmemchr (&s[42], 0) - &s[42]) + 42;
> +
> +   assuming that rawmemchr is available for a particular mode.  Note, detecting
> +   strlen like loops also depends on whether the type for the resulting length
> +   is compatible with size type or overflow is undefined.  */
> +
> +/* TODO Quick and dirty imports from tree-loop-distribution pass.  */
> +void destroy_loop (class loop *loop);
> +bool stmt_has_scalar_dependences_outside_loop (loop_p loop, gimple *stmt);
> +bool ssa_name_has_uses_outside_loop_p (tree def, loop_p loop);
> +
> +static void
> +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> +			    data_reference_p store_dr, tree base, tree pattern,
> +			    location_t loc)
> +{
> +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> +		       && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
> +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> +
> +  /* The new statements will be placed before LOOP.  */
> +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +
> +  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> +				       GSI_CONTINUE_LINKING);
> +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
> +  tree reduction_var_new = copy_ssa_name (reduction_var);
> +  gimple_call_set_lhs (fn_call, reduction_var_new);
> +  gimple_set_location (fn_call, loc);
> +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> +
> +  if (store_dr)
> +    {
> +      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
> +      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
> +    }
> +
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +	SET_USE (use_p, reduction_var_new);
> +
> +      update_stmt (stmt);
> +    }
> +
> +  fold_stmt (&gsi);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    switch (TYPE_MODE (TREE_TYPE (pattern)))
> +      {
> +      case QImode:
> +	fprintf (dump_file, "generated rawmemchrqi\n");
> +	break;
> +
> +      case HImode:
> +	fprintf (dump_file, "generated rawmemchrhi\n");
> +	break;
> +
> +      case SImode:
> +	fprintf (dump_file, "generated rawmemchrsi\n");
> +	break;
> +
> +      default:
> +	gcc_unreachable ();
> +      }
> +}
> +
> +static void
> +generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
> +			 tree start_len, location_t loc)
> +{
> +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
> +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
> +
> +  /* The new statements will be placed before LOOP.  */
> +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +
> +  tree reduction_var_new = make_ssa_name (size_type_node);
> +
> +  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> +				       GSI_CONTINUE_LINKING);
> +  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN));
> +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> +  gimple_call_set_lhs (fn_call, reduction_var_new);
> +  gimple_set_location (fn_call, loc);
> +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> +
> +  /* In case reduction type is not compatible with size type, then
> +     conversion is sound even in case an overflow occurs since we previously
> +     ensured that for reduction type an overflow is undefined.  */
> +  tree convert = fold_convert (TREE_TYPE (reduction_var), reduction_var_new);
> +  reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE,
> +						false, GSI_CONTINUE_LINKING);
> +
> +  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
> +     length.  */
> +  if (!integer_zerop (start_len))
> +    {
> +      tree fn_result = reduction_var_new;
> +      reduction_var_new = make_ssa_name (TREE_TYPE (reduction_var));
> +      gimple *add_stmt = gimple_build_assign (reduction_var_new, PLUS_EXPR,
> +					      fn_result, start_len);
> +      gsi_insert_after (&gsi, add_stmt, GSI_CONTINUE_LINKING);
> +    }
> +
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +	SET_USE (use_p, reduction_var_new);
> +
> +      update_stmt (stmt);
> +    }
> +
> +  fold_stmt (&gsi);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "generated strlen\n");
> +}
> +
> +static void
> +generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
> +					 tree base, tree start_len,
> +					 location_t loc)
> +{
> +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
> +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
> +
> +  /* The new statements will be placed before LOOP.  */
> +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +
> +  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> +				       GSI_CONTINUE_LINKING);
> +  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (mem)));
> +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, zero);
> +  tree end = make_ssa_name (TREE_TYPE (base));
> +  gimple_call_set_lhs (fn_call, end);
> +  gimple_set_location (fn_call, loc);
> +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> +
> +  tree diff = make_ssa_name (ptrdiff_type_node);
> +  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, mem);
> +  gsi_insert_after (&gsi, diff_stmt, GSI_CONTINUE_LINKING);
> +
> +  tree convert = fold_convert (ptrdiff_type_node,
> +			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (mem))));
> +  tree size = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE, false,
> +					GSI_CONTINUE_LINKING);
> +
> +  tree count = make_ssa_name (ptrdiff_type_node);
> +  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
> +  gsi_insert_after (&gsi, count_stmt, GSI_CONTINUE_LINKING);
> +
> +  convert = fold_convert (TREE_TYPE (reduction_var), count);
> +  tree reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true,
> +						     NULL_TREE, false,
> +						     GSI_CONTINUE_LINKING);
> +
> +  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
> +     length.  */
> +  if (!integer_zerop (start_len))
> +    {
> +      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
> +      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
> +				       start_len);
> +      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
> +      reduction_var_new = lhs;
> +    }
> +
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +	SET_USE (use_p, reduction_var_new);
> +
> +      update_stmt (stmt);
> +    }
> +
> +  fold_stmt (&gsi);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    switch (TYPE_MODE (TREE_TYPE (zero)))
> +      {
> +      case HImode:
> +	fprintf (dump_file, "generated strlen using rawmemchrhi\n");
> +	break;
> +
> +      case SImode:
> +	fprintf (dump_file, "generated strlen using rawmemchrsi\n");
> +	break;
> +
> +      default:
> +	gcc_unreachable ();
> +      }
> +}
> +
> +static bool
> +transform_loop_1 (loop_p loop,
> +		  data_reference_p load_dr,
> +		  data_reference_p store_dr,
> +		  tree reduction_var)
> +{
> +  tree load_ref = DR_REF (load_dr);
> +  tree load_type = TREE_TYPE (load_ref);
> +  tree load_access_base = build_fold_addr_expr (load_ref);
> +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> +  affine_iv load_iv, reduction_iv;
> +  tree pattern;
> +
> +  /* A limitation of the current implementation is that we only support
> +     constant patterns.  */
> +  edge e = single_exit (loop);
> +  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
> +  if (!cond_stmt)
> +    return false;
> +  pattern = gimple_cond_rhs (cond_stmt);
> +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
> +      || TREE_CODE (pattern) != INTEGER_CST)
> +    return false;
> +
> +  /* Bail out if no affine induction variable with constant step can be
> +     determined.  */
> +  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
> +    return false;
> +
> +  /* Bail out if memory accesses are not consecutive or not growing.  */
> +  if (!operand_equal_p (load_iv.step, load_access_size, 0))
> +    return false;
> +
> +  if (!INTEGRAL_TYPE_P (load_type)
> +      || !type_has_mode_precision_p (load_type))
> +    return false;
> +
> +  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
> +    return false;
> +
> +  /* Handle rawmemchr like loops.  */
> +  if (operand_equal_p (load_iv.base, reduction_iv.base)
> +      && operand_equal_p (load_iv.step, reduction_iv.step))
> +    {
> +      if (store_dr)
> +	{
> +	  /* Ensure that we store to X and load from X+I where I>0.  */
> +	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
> +	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
> +	    return false;
> +	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
> +	  if (TREE_CODE (ptr_base) != SSA_NAME)
> +	    return false;
> +	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
> +	  if (!gimple_assign_single_p (def)
> +	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
> +	    return false;
> +	  /* Ensure that the reduction value is stored.  */
> +	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
> +	    return false;
> +	}
> +      /* Bail out if target does not provide rawmemchr for a certain mode.  */
> +      machine_mode mode = TYPE_MODE (load_type);
> +      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
> +	return false;
> +      location_t loc = gimple_location (DR_STMT (load_dr));
> +      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
> +				  pattern, loc);
> +      return true;
> +    }
> +
> +  /* Handle strlen like loops.  */
> +  if (store_dr == NULL
> +      && integer_zerop (pattern)
> +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> +      && integer_onep (reduction_iv.step)
> +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> +	  || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> +    {
> +      location_t loc = gimple_location (DR_STMT (load_dr));
> +      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
> +	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node))
> +	generate_strlen_builtin (loop, reduction_var, load_iv.base,
> +				 reduction_iv.base, loc);
> +      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
> +	       != CODE_FOR_nothing)
> +	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
> +						 load_iv.base,
> +						 reduction_iv.base, loc);
> +      else
> +	return false;
> +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> +	{
> +	  const char *msg = G_("assuming signed overflow does not occur "
> +			       "when optimizing strlen like loop");
> +	  fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> +	}
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +static bool
> +transform_loop (loop_p loop)
> +{
> +  gimple *reduction_stmt = NULL;
> +  data_reference_p load_dr = NULL, store_dr = NULL;
> +
> +  basic_block *bbs = get_loop_body (loop);
> +
> +  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
> +    {
> +      basic_block bb = bbs[i];
> +
> +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> +	   gsi_next (&bsi), ++ninsns)
> +	{
> +	  /* Bail out early for loops which are unlikely to match.  */
> +	  if (ninsns > 16)
> +	    return false;
> +	  gphi *phi = bsi.phi ();
> +	  if (gimple_has_volatile_ops (phi))
> +	    return false;
> +	  if (gimple_clobber_p (phi))
> +	    continue;
> +	  if (virtual_operand_p (gimple_phi_result (phi)))
> +	    continue;
> +	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> +	    {
> +	      if (reduction_stmt)
> +		return false;
> +	      reduction_stmt = phi;
> +	    }
> +	}
> +
> +      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
> +	   gsi_next (&bsi), ++ninsns)
> +	{
> +	  /* Bail out early for loops which are unlikely to match.  */
> +	  if (ninsns > 16)
> +	    return false;
> +
> +	  gimple *stmt = gsi_stmt (bsi);
> +
> +	  if (gimple_clobber_p (stmt))
> +	    continue;
> +
> +	  if (gimple_code (stmt) == GIMPLE_LABEL || is_gimple_debug (stmt))
> +	    continue;
> +
> +	  if (gimple_has_volatile_ops (stmt))
> +	    return false;
> +
> +	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
> +	    {
> +	      if (reduction_stmt)
> +		return false;
> +	      reduction_stmt = stmt;
> +	    }
> +
> +	  /* Any scalar stmts are ok.  */
> +	  if (!gimple_vuse (stmt))
> +	    continue;
> +
> +	  /* Otherwise just regular loads/stores.  */
> +	  if (!gimple_assign_single_p (stmt))
> +	    return false;
> +
> +	  auto_vec<data_reference_p, 2> dr_vec;
> +	  if (!find_data_references_in_stmt (loop, stmt, &dr_vec))
> +	    return false;
> +	  data_reference_p dr;
> +	  unsigned j;
> +	  FOR_EACH_VEC_ELT (dr_vec, j, dr)
> +	    {
> +	      tree type = TREE_TYPE (DR_REF (dr));
> +	      if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (type)))
> +		return false;
> +	      if (DR_IS_READ (dr))
> +		{
> +		  if (load_dr != NULL)
> +		    return false;
> +		  load_dr = dr;
> +		}
> +	      else
> +		{
> +		  if (store_dr != NULL)
> +		    return false;
> +		  store_dr = dr;
> +		}
> +	    }
> +	}
> +    }
> +
> +  /* A limitation of the current implementation is that we require a reduction
> +     statement which does not occur in cases like
> +     extern int *p;
> +     void foo (void) { for (; *p; ++p); } */
> +  if (load_dr == NULL || reduction_stmt == NULL)
> +    return false;
> +
> +  /* Note, reduction variables are guaranteed to be SSA names.  */
> +  tree reduction_var;
> +  switch (gimple_code (reduction_stmt))
> +    {
> +    case GIMPLE_PHI:
> +      reduction_var = gimple_phi_result (reduction_stmt);
> +      break;
> +    case GIMPLE_ASSIGN:
> +      reduction_var = gimple_assign_lhs (reduction_stmt);
> +      break;
> +    default:
> +      /* Bail out e.g. for GIMPLE_CALL.  */
> +      return false;
> +    }
> +  if (reduction_var == NULL)
> +    return false;
> +
> +  /* Bail out if this is a bitfield memory reference.  */
> +  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
> +      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
> +    return false;
> +
> +  /* Data reference must be executed exactly once per iteration of each
> +     loop in the loop nest.  We only need to check dominance information
> +     against the outermost one in a perfect loop nest because a bb can't
> +     dominate outermost loop's latch without dominating inner loop's.  */
> +  basic_block load_bb = gimple_bb (DR_STMT (load_dr));
> +  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, load_bb))
> +    return false;
> +
> +  if (store_dr)
> +    {
> +      /* Bail out if this is a bitfield memory reference.  */
> +      if (TREE_CODE (DR_REF (store_dr)) == COMPONENT_REF
> +	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (store_dr), 1)))
> +	return false;
> +
> +      /* Data reference must be executed exactly once per iteration of each
> +	 loop in the loop nest.  We only need to check dominance information
> +	 against the outermost one in a perfect loop nest because a bb can't
> +	 dominate outermost loop's latch without dominating inner loop's.  */
> +      basic_block store_bb = gimple_bb (DR_STMT (store_dr));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, store_bb))
> +	return false;
> +
> +      /* Load and store must be in the same loop nest.  */
> +      if (store_bb->loop_father != load_bb->loop_father)
> +	return false;
> +
> +      edge e = single_exit (store_bb->loop_father);
> +      if (!e)
> +	return false;
> +      bool load_dom = dominated_by_p (CDI_DOMINATORS, e->src, load_bb);
> +      bool store_dom = dominated_by_p (CDI_DOMINATORS, e->src, store_bb);
> +      if (load_dom != store_dom)
> +	return false;
> +    }
> +
> +  return transform_loop_1 (loop, load_dr, store_dr, reduction_var);
> +}
> +
> +namespace {
> +
> +const pass_data pass_data_lpat =
> +{
> +  GIMPLE_PASS, /* type */
> +  "lpat", /* name */
> +  OPTGROUP_LOOP, /* optinfo_flags */
> +  TV_LPAT, /* tv_id */
> +  ( PROP_cfg | PROP_ssa ), /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_lpat : public gimple_opt_pass
> +{
> +public:
> +  pass_lpat (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_lpat, ctxt)
> +  {}
> +
> +  bool
> +  gate (function *) OVERRIDE
> +  {
> +    return optimize != 0;
> +  }
> +
> +  unsigned int
> +  execute (function *f) OVERRIDE
> +  {
> +    loop_p loop;
> +    auto_vec<loop_p> loops_to_be_destroyed;
> +
> +    FOR_EACH_LOOP_FN (f, loop, LI_ONLY_INNERMOST)
> +      {
> +	if (!single_exit (loop)
> +	    || (!flag_tree_loop_distribute_patterns // TODO
> +		&& !optimize_loop_for_speed_p (loop)))
> +	continue;
> +
> +	if (transform_loop (loop))
> +	  loops_to_be_destroyed.safe_push (loop);
> +      }
> +
> +    if (loops_to_be_destroyed.length () > 0)
> +      {
> +	unsigned i;
> +	FOR_EACH_VEC_ELT (loops_to_be_destroyed, i, loop)
> +	  destroy_loop (loop);
> +
> +	scev_reset_htab ();
> +	mark_virtual_operands_for_renaming (f);
> +	rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
> +
> +	return TODO_cleanup_cfg;
> +      }
> +    else
> +      return 0;
> +  }
> +}; // class pass_lpat
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_lpat (gcc::context *ctxt)
> +{
> +  return new pass_lpat (ctxt);
> +}
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 15693fee150..2d71a12039e 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -380,6 +380,7 @@ extern gimple_opt_pass *make_pass_graphite (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_graphite_transforms (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_if_conversion (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_if_to_switch (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_lpat (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_loop_distribution (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_vectorize (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_simduid_cleanup (gcc::context *ctxt);
Stefan Schulze Frielinghaus May 4, 2021, 5:25 p.m. UTC | #9
ping

On Thu, Apr 08, 2021 at 10:23:31AM +0200, Stefan Schulze Frielinghaus wrote:
> ping
> 
> On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote:
> > [snip]
> > 
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> > really fit into the topic of loop distribution, I created a separate pass.  Due
> > to this I was also able to play around a bit and schedule the pass at different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.  Running
> > this setup over
> > 
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> > 
> > the new pass transforms
> > 
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> > 
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> > 
> >   <bb 5> [local count: 105119324]:
> > 
> >   <bb 3> [local count: 955630225]:
> >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> > 
> >   <bb 8> [local count: 105119324]:
> >   # p_2 = PHI <p_6(3)>
> >   goto <bb 4>; [100.00%]
> > 
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> > 
> >   <bb 7> [local count: 12992276]:
> > 
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> >   return p_9;
> > 
> > }
> > 
> > into
> > 
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> > 
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 4>; [11.00%]
> > 
> >   <bb 5> [local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> > 
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> >   return p_9;
> > 
> > }
> > 
> > which is fine so far.  However, I haven't made up my mind so far whether it is
> > worthwhile to spend more time in order to also eliminate the "first unrolling"
> > of the loop.  I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> > 
> > char * t (char * p)
> > {
> >   <bb 2> [local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> > 
> > }
> > 
> > which seems optimal to me.  The downside of this is that I have to initialize
> > scalar evolution analysis which might be undesired that early.
> > 
> > All this brings me to the question where do you see this peace of code running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
> > 
> > Another topic which came up is whether there exists a more elegant solution to
> > my current implementation in order to deal with stores (I'm speaking of the `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> > 
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> > 
> > ends up as
> > 
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> > 
> >   <bb 2> [local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> > 
> >   <bb 5> [local count: 105119324]:
> > 
> >   <bb 3> [local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> > 
> >   <bb 8> [local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto <bb 4>; [100.00%]
> > 
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> > 
> >   <bb 7> [local count: 12992276]:
> > 
> >   <bb 4> [local count: 118111600]:
> >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> >   return p.1_11;
> > 
> > }
> > 
> > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > have to show that we never load from a memory location to which we write.
> > Currently I solve this by hard coding those facts which are not generic at all.
> > I gave compute_data_dependences_for_loop a try which failed to determine the
> > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > there are more generic solutions to express this in contrast to my current one?
> > 
> > Thanks again for your input so far.  Really appreciated.
> > 
> > Cheers,
> > Stefan
> 
> > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > index 8a5fb3fd99c..7b2d7405277 100644
> > --- a/gcc/Makefile.in
> > +++ b/gcc/Makefile.in
> > @@ -1608,6 +1608,7 @@ OBJS = \
> >  	tree-into-ssa.o \
> >  	tree-iterator.o \
> >  	tree-loop-distribution.o \
> > +	tree-loop-pattern.o \
> >  	tree-nested.o \
> >  	tree-nrv.o \
> >  	tree-object-size.o \
> > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > index dd7173126fb..957e96a46a4 100644
> > --- a/gcc/internal-fn.c
> > +++ b/gcc/internal-fn.c
> > @@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> >    gcc_unreachable ();
> >  }
> >  
> > +void
> > +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> > +{
> > +  expand_operand ops[3];
> > +
> > +  tree lhs = gimple_call_lhs (stmt);
> > +  if (!lhs)
> > +    return;
> > +  tree lhs_type = TREE_TYPE (lhs);
> > +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> > +  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
> > +
> > +  for (unsigned int i = 0; i < 2; ++i)
> > +    {
> > +      tree rhs = gimple_call_arg (stmt, i);
> > +      tree rhs_type = TREE_TYPE (rhs);
> > +      rtx rhs_rtx = expand_normal (rhs);
> > +      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
> > +    }
> > +
> > +  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
> > +
> > +  expand_insn (icode, 3, ops);
> > +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> > +    emit_move_insn (lhs_rtx, ops[0].value);
> > +}
> > +
> >  /* Expand the IFN_UNIQUE function according to its first argument.  */
> >  
> >  static void
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index daeace7a34e..95c76795648 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> >  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
> >  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
> >  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> > +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
> >  
> >  /* An unduplicable, uncombinable function.  Generally used to preserve
> >     a CFG property in the face of jump threading, tail merging or
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index b192a9d070b..f7c69f914ce 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
> >  OPTAB_D (movmem_optab, "movmem$a")
> >  OPTAB_D (setmem_optab, "setmem$a")
> >  OPTAB_D (strlen_optab, "strlen$a")
> > +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
> >  
> >  OPTAB_DC(fma_optab, "fma$a4", FMA)
> >  OPTAB_D (fms_optab, "fms$a4")
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index e9ed3c7bc57..280e8fc0cde 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -274,6 +274,7 @@ along with GCC; see the file COPYING3.  If not see
> >  	     empty loops.  Remove them now.  */
> >  	  NEXT_PASS (pass_cd_dce, false /* update_address_taken_p */);
> >  	  NEXT_PASS (pass_iv_canon);
> > +	  NEXT_PASS (pass_lpat);
> >  	  NEXT_PASS (pass_loop_distribution);
> >  	  NEXT_PASS (pass_linterchange);
> >  	  NEXT_PASS (pass_copy_prop);
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c
> > new file mode 100644
> > index 00000000000..b4133510fca
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-1.c
> > @@ -0,0 +1,72 @@
> > +/* { dg-do run { target s390x-*-* } } */
> > +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> > +/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "lpat" { target s390x-*-* } } } */
> > +/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "lpat" { target s390x-*-* } } } */
> > +/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "lpat" { target s390x-*-* } } } */
> > +
> > +/* Rawmemchr pattern: reduction stmt but no store */
> > +
> > +#include <stdint.h>
> > +#include <assert.h>
> > +
> > +typedef __SIZE_TYPE__ size_t;
> > +extern void* malloc (size_t);
> > +extern void* memset (void*, int, size_t);
> > +
> > +#define test(T, pattern)   \
> > +__attribute__((noinline))  \
> > +T *test_##T (T *p)         \
> > +{                          \
> > +  while (*p != (T)pattern) \
> > +    ++p;                   \
> > +  return p;                \
> > +}
> > +
> > +test (uint8_t,  0xab)
> > +test (uint16_t, 0xabcd)
> > +test (uint32_t, 0xabcdef15)
> > +
> > +test (int8_t,  0xab)
> > +test (int16_t, 0xabcd)
> > +test (int32_t, 0xabcdef15)
> > +
> > +#define run(T, pattern, i)      \
> > +{                               \
> > +T *q = p;                       \
> > +q[i] = (T)pattern;              \
> > +assert (test_##T (p) == &q[i]); \
> > +q[i] = 0;                       \
> > +}
> > +
> > +int main(void)
> > +{
> > +  void *p = malloc (1024);
> > +  assert (p);
> > +  memset (p, 0, 1024);
> > +
> > +  run (uint8_t, 0xab, 0);
> > +  run (uint8_t, 0xab, 1);
> > +  run (uint8_t, 0xab, 13);
> > +
> > +  run (uint16_t, 0xabcd, 0);
> > +  run (uint16_t, 0xabcd, 1);
> > +  run (uint16_t, 0xabcd, 13);
> > +
> > +  run (uint32_t, 0xabcdef15, 0);
> > +  run (uint32_t, 0xabcdef15, 1);
> > +  run (uint32_t, 0xabcdef15, 13);
> > +
> > +  run (int8_t, 0xab, 0);
> > +  run (int8_t, 0xab, 1);
> > +  run (int8_t, 0xab, 13);
> > +
> > +  run (int16_t, 0xabcd, 0);
> > +  run (int16_t, 0xabcd, 1);
> > +  run (int16_t, 0xabcd, 13);
> > +
> > +  run (int32_t, 0xabcdef15, 0);
> > +  run (int32_t, 0xabcdef15, 1);
> > +  run (int32_t, 0xabcdef15, 13);
> > +
> > +  return 0;
> > +}
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c
> > new file mode 100644
> > index 00000000000..9bebec11db0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-rawmemchr-2.c
> > @@ -0,0 +1,83 @@
> > +/* { dg-do run { target s390x-*-* } } */
> > +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> > +/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "lpat" { target s390x-*-* } } } */
> > +/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "lpat" { target s390x-*-* } } } */
> > +/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "lpat" { target s390x-*-* } } } */
> > +
> > +/* Rawmemchr pattern: reduction stmt and store */
> > +
> > +#include <stdint.h>
> > +#include <assert.h>
> > +
> > +typedef __SIZE_TYPE__ size_t;
> > +extern void* malloc (size_t);
> > +extern void* memset (void*, int, size_t);
> > +
> > +uint8_t *p_uint8_t;
> > +uint16_t *p_uint16_t;
> > +uint32_t *p_uint32_t;
> > +
> > +int8_t *p_int8_t;
> > +int16_t *p_int16_t;
> > +int32_t *p_int32_t;
> > +
> > +#define test(T, pattern)    \
> > +__attribute__((noinline))   \
> > +T *test_##T (void)          \
> > +{                           \
> > +  while (*p_##T != pattern) \
> > +    ++p_##T;                \
> > +  return p_##T;             \
> > +}
> > +
> > +test (uint8_t,  0xab)
> > +test (uint16_t, 0xabcd)
> > +test (uint32_t, 0xabcdef15)
> > +
> > +test (int8_t,  (int8_t)0xab)
> > +test (int16_t, (int16_t)0xabcd)
> > +test (int32_t, (int32_t)0xabcdef15)
> > +
> > +#define run(T, pattern, i) \
> > +{                          \
> > +T *q = p;                  \
> > +q[i] = pattern;            \
> > +p_##T = p;                 \
> > +T *r = test_##T ();        \
> > +assert (r == p_##T);       \
> > +assert (r == &q[i]);       \
> > +q[i] = 0;                  \
> > +}
> > +
> > +int main(void)
> > +{
> > +  void *p = malloc (1024);
> > +  assert (p);
> > +  memset (p, '\0', 1024);
> > +
> > +  run (uint8_t, 0xab, 0);
> > +  run (uint8_t, 0xab, 1);
> > +  run (uint8_t, 0xab, 13);
> > +
> > +  run (uint16_t, 0xabcd, 0);
> > +  run (uint16_t, 0xabcd, 1);
> > +  run (uint16_t, 0xabcd, 13);
> > +
> > +  run (uint32_t, 0xabcdef15, 0);
> > +  run (uint32_t, 0xabcdef15, 1);
> > +  run (uint32_t, 0xabcdef15, 13);
> > +
> > +  run (int8_t, (int8_t)0xab, 0);
> > +  run (int8_t, (int8_t)0xab, 1);
> > +  run (int8_t, (int8_t)0xab, 13);
> > +
> > +  run (int16_t, (int16_t)0xabcd, 0);
> > +  run (int16_t, (int16_t)0xabcd, 1);
> > +  run (int16_t, (int16_t)0xabcd, 13);
> > +
> > +  run (int32_t, (int32_t)0xabcdef15, 0);
> > +  run (int32_t, (int32_t)0xabcdef15, 1);
> > +  run (int32_t, (int32_t)0xabcdef15, 13);
> > +
> > +  return 0;
> > +}
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c
> > new file mode 100644
> > index 00000000000..b02509c2c8c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-1.c
> > @@ -0,0 +1,100 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> > +/* { dg-final { scan-tree-dump-times "generated strlen\n" 4 "lpat" } } */
> > +/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrhi\n" 4 "lpat" { target s390x-*-* } } } */
> > +/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrsi\n" 4 "lpat" { target s390x-*-* } } } */
> > +
> > +#include <stdint.h>
> > +#include <assert.h>
> > +
> > +typedef __SIZE_TYPE__ size_t;
> > +extern void* malloc (size_t);
> > +extern void* memset (void*, int, size_t);
> > +
> > +#define test(T, U)        \
> > +__attribute__((noinline)) \
> > +U test_##T##U (T *s)      \
> > +{                         \
> > +  U i;                    \
> > +  for (i=0; s[i]; ++i);   \
> > +  return i;               \
> > +}
> > +
> > +test (uint8_t,  size_t)
> > +test (uint16_t, size_t)
> > +test (uint32_t, size_t)
> > +test (uint8_t,  int)
> > +test (uint16_t, int)
> > +test (uint32_t, int)
> > +
> > +test (int8_t,  size_t)
> > +test (int16_t, size_t)
> > +test (int32_t, size_t)
> > +test (int8_t,  int)
> > +test (int16_t, int)
> > +test (int32_t, int)
> > +
> > +#define run(T, U, i)             \
> > +{                                \
> > +T *q = p;                        \
> > +q[i] = 0;                        \
> > +assert (test_##T##U (p) == i);   \
> > +memset (&q[i], 0xf, sizeof (T)); \
> > +}
> > +
> > +int main(void)
> > +{
> > +  void *p = malloc (1024);
> > +  assert (p);
> > +  memset (p, 0xf, 1024);
> > +
> > +  run (uint8_t, size_t, 0);
> > +  run (uint8_t, size_t, 1);
> > +  run (uint8_t, size_t, 13);
> > +
> > +  run (int8_t, size_t, 0);
> > +  run (int8_t, size_t, 1);
> > +  run (int8_t, size_t, 13);
> > +
> > +  run (uint8_t, int, 0);
> > +  run (uint8_t, int, 1);
> > +  run (uint8_t, int, 13);
> > +
> > +  run (int8_t, int, 0);
> > +  run (int8_t, int, 1);
> > +  run (int8_t, int, 13);
> > +
> > +  run (uint16_t, size_t, 0);
> > +  run (uint16_t, size_t, 1);
> > +  run (uint16_t, size_t, 13);
> > +
> > +  run (int16_t, size_t, 0);
> > +  run (int16_t, size_t, 1);
> > +  run (int16_t, size_t, 13);
> > +
> > +  run (uint16_t, int, 0);
> > +  run (uint16_t, int, 1);
> > +  run (uint16_t, int, 13);
> > +
> > +  run (int16_t, int, 0);
> > +  run (int16_t, int, 1);
> > +  run (int16_t, int, 13);
> > +
> > +  run (uint32_t, size_t, 0);
> > +  run (uint32_t, size_t, 1);
> > +  run (uint32_t, size_t, 13);
> > +
> > +  run (int32_t, size_t, 0);
> > +  run (int32_t, size_t, 1);
> > +  run (int32_t, size_t, 13);
> > +
> > +  run (uint32_t, int, 0);
> > +  run (uint32_t, int, 1);
> > +  run (uint32_t, int, 13);
> > +
> > +  run (int32_t, int, 0);
> > +  run (int32_t, int, 1);
> > +  run (int32_t, int, 13);
> > +
> > +  return 0;
> > +}
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c
> > new file mode 100644
> > index 00000000000..e71dad8ed2e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/lpat-strlen-2.c
> > @@ -0,0 +1,58 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O2 -fdump-tree-lpat-details" } */
> > +/* { dg-final { scan-tree-dump-times "generated strlen\n" 3 "lpat" } } */
> > +
> > +#include <assert.h>
> > +
> > +typedef __SIZE_TYPE__ size_t;
> > +extern void* malloc (size_t);
> > +extern void* memset (void*, int, size_t);
> > +
> > +__attribute__((noinline))
> > +int test_pos (char *s)
> > +{
> > +  int i;
> > +  for (i=42; s[i]; ++i);
> > +  return i;
> > +}
> > +
> > +__attribute__((noinline))
> > +int test_neg (char *s)
> > +{
> > +  int i;
> > +  for (i=-42; s[i]; ++i);
> > +  return i;
> > +}
> > +
> > +__attribute__((noinline))
> > +int test_including_null_char (char *s)
> > +{
> > +  int i;
> > +  for (i=1; s[i-1]; ++i);
> > +  return i;
> > +}
> > +
> > +int main(void)
> > +{
> > +  void *p = malloc (1024);
> > +  assert (p);
> > +  memset (p, 0xf, 1024);
> > +  char *s = (char *)p + 100;
> > +
> > +  s[42+13] = 0;
> > +  assert (test_pos (s) == 42+13);
> > +  s[42+13] = 0xf;
> > +
> > +  s[13] = 0;
> > +  assert (test_neg (s) == 13);
> > +  s[13] = 0xf;
> > +
> > +  s[-13] = 0;
> > +  assert (test_neg (s) == -13);
> > +  s[-13] = 0xf;
> > +
> > +  s[13] = 0;
> > +  assert (test_including_null_char (s) == 13+1);
> > +
> > +  return 0;
> > +}
> > diff --git a/gcc/timevar.def b/gcc/timevar.def
> > index 63c0b3306de..bdefc85fbb4 100644
> > --- a/gcc/timevar.def
> > +++ b/gcc/timevar.def
> > @@ -307,6 +307,7 @@ DEFTIMEVAR (TV_TREE_UBSAN            , "tree ubsan")
> >  DEFTIMEVAR (TV_INITIALIZE_RTL        , "initialize rtl")
> >  DEFTIMEVAR (TV_GIMPLE_LADDRESS       , "address lowering")
> >  DEFTIMEVAR (TV_TREE_LOOP_IFCVT       , "tree loop if-conversion")
> > +DEFTIMEVAR (TV_LPAT                  , "tree loop pattern")
> >  
> >  /* Everything else in rest_of_compilation not included above.  */
> >  DEFTIMEVAR (TV_EARLY_LOCAL	     , "early local passes")
> > diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> > index 7ee19fc8677..f7aafd0d0dc 100644
> > --- a/gcc/tree-loop-distribution.c
> > +++ b/gcc/tree-loop-distribution.c
> > @@ -890,7 +890,7 @@ loop_distribution::partition_merge_into (struct graph *rdg,
> >  /* Returns true when DEF is an SSA_NAME defined in LOOP and used after
> >     the LOOP.  */
> >  
> > -static bool
> > +bool
> >  ssa_name_has_uses_outside_loop_p (tree def, loop_p loop)
> >  {
> >    imm_use_iterator imm_iter;
> > @@ -912,7 +912,7 @@ ssa_name_has_uses_outside_loop_p (tree def, loop_p loop)
> >  /* Returns true when STMT defines a scalar variable used after the
> >     loop LOOP.  */
> >  
> > -static bool
> > +bool
> >  stmt_has_scalar_dependences_outside_loop (loop_p loop, gimple *stmt)
> >  {
> >    def_operand_p def_p;
> > @@ -1234,7 +1234,7 @@ generate_memcpy_builtin (class loop *loop, partition *partition)
> >  
> >  /* Remove and destroy the loop LOOP.  */
> >  
> > -static void
> > +void
> >  destroy_loop (class loop *loop)
> >  {
> >    unsigned nbbs = loop->num_nodes;
> > diff --git a/gcc/tree-loop-pattern.c b/gcc/tree-loop-pattern.c
> > new file mode 100644
> > index 00000000000..a9c984d5e53
> > --- /dev/null
> > +++ b/gcc/tree-loop-pattern.c
> > @@ -0,0 +1,588 @@
> > +#include "config.h"
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "intl.h"
> > +#include "tree.h"
> > +#include "gimple.h"
> > +#include "cfghooks.h"
> > +#include "tree-pass.h"
> > +#include "ssa.h"
> > +#include "fold-const.h"
> > +#include "gimple-iterator.h"
> > +#include "gimplify-me.h"
> > +#include "tree-cfg.h"
> > +#include "tree-ssa.h"
> > +#include "tree-ssanames.h"
> > +#include "tree-ssa-loop.h"
> > +#include "tree-ssa-loop-manip.h"
> > +#include "tree-into-ssa.h"
> > +#include "cfgloop.h"
> > +#include "tree-scalar-evolution.h"
> > +#include "tree-vectorizer.h"
> > +#include "tree-eh.h"
> > +#include "gimple-fold.h"
> > +#include "rtl.h"
> > +#include "memmodel.h"
> > +#include "insn-codes.h"
> > +#include "optabs.h"
> > +
> > +/* This pass detects loops which mimic the effects of builtins and replaces them
> > +   accordingly.  For example, a loop of the form
> > +
> > +     for (; *p != 42; ++p);
> > +
> > +   is replaced by
> > +
> > +     p = rawmemchr (p, 42);
> > +
> > +   under the assumption that rawmemchr is available for a particular mode.
> > +   Another example is
> > +
> > +     int i;
> > +     for (i = 42; s[i]; ++i);
> > +
> > +   which is replaced by
> > +
> > +     i = (int)strlen (&s[42]) + 42;
> > +
> > +   for some character array S.  In case array S is not of a character array
> > +   type, we end up with
> > +
> > +     i = (int)(rawmemchr (&s[42], 0) - &s[42]) + 42;
> > +
> > +   assuming that rawmemchr is available for a particular mode.  Note, detecting
> > +   strlen like loops also depends on whether the type for the resulting length
> > +   is compatible with size type or overflow is undefined.  */
> > +
> > +/* TODO Quick and dirty imports from tree-loop-distribution pass.  */
> > +void destroy_loop (class loop *loop);
> > +bool stmt_has_scalar_dependences_outside_loop (loop_p loop, gimple *stmt);
> > +bool ssa_name_has_uses_outside_loop_p (tree def, loop_p loop);
> > +
> > +static void
> > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> > +			    data_reference_p store_dr, tree base, tree pattern,
> > +			    location_t loc)
> > +{
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> > +		       && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> > +
> > +  /* The new statements will be placed before LOOP.  */
> > +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > +
> > +  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> > +				       GSI_CONTINUE_LINKING);
> > +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
> > +  tree reduction_var_new = copy_ssa_name (reduction_var);
> > +  gimple_call_set_lhs (fn_call, reduction_var_new);
> > +  gimple_set_location (fn_call, loc);
> > +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > +
> > +  if (store_dr)
> > +    {
> > +      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
> > +      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
> > +    }
> > +
> > +  imm_use_iterator iter;
> > +  gimple *stmt;
> > +  use_operand_p use_p;
> > +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
> > +    {
> > +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > +	SET_USE (use_p, reduction_var_new);
> > +
> > +      update_stmt (stmt);
> > +    }
> > +
> > +  fold_stmt (&gsi);
> > +
> > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > +    switch (TYPE_MODE (TREE_TYPE (pattern)))
> > +      {
> > +      case QImode:
> > +	fprintf (dump_file, "generated rawmemchrqi\n");
> > +	break;
> > +
> > +      case HImode:
> > +	fprintf (dump_file, "generated rawmemchrhi\n");
> > +	break;
> > +
> > +      case SImode:
> > +	fprintf (dump_file, "generated rawmemchrsi\n");
> > +	break;
> > +
> > +      default:
> > +	gcc_unreachable ();
> > +      }
> > +}
> > +
> > +static void
> > +generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
> > +			 tree start_len, location_t loc)
> > +{
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
> > +
> > +  /* The new statements will be placed before LOOP.  */
> > +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > +
> > +  tree reduction_var_new = make_ssa_name (size_type_node);
> > +
> > +  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> > +				       GSI_CONTINUE_LINKING);
> > +  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN));
> > +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> > +  gimple_call_set_lhs (fn_call, reduction_var_new);
> > +  gimple_set_location (fn_call, loc);
> > +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > +
> > +  /* In case reduction type is not compatible with size type, then
> > +     conversion is sound even in case an overflow occurs since we previously
> > +     ensured that for reduction type an overflow is undefined.  */
> > +  tree convert = fold_convert (TREE_TYPE (reduction_var), reduction_var_new);
> > +  reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE,
> > +						false, GSI_CONTINUE_LINKING);
> > +
> > +  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
> > +     length.  */
> > +  if (!integer_zerop (start_len))
> > +    {
> > +      tree fn_result = reduction_var_new;
> > +      reduction_var_new = make_ssa_name (TREE_TYPE (reduction_var));
> > +      gimple *add_stmt = gimple_build_assign (reduction_var_new, PLUS_EXPR,
> > +					      fn_result, start_len);
> > +      gsi_insert_after (&gsi, add_stmt, GSI_CONTINUE_LINKING);
> > +    }
> > +
> > +  imm_use_iterator iter;
> > +  gimple *stmt;
> > +  use_operand_p use_p;
> > +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
> > +    {
> > +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > +	SET_USE (use_p, reduction_var_new);
> > +
> > +      update_stmt (stmt);
> > +    }
> > +
> > +  fold_stmt (&gsi);
> > +
> > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > +    fprintf (dump_file, "generated strlen\n");
> > +}
> > +
> > +static void
> > +generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
> > +					 tree base, tree start_len,
> > +					 location_t loc)
> > +{
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
> > +
> > +  /* The new statements will be placed before LOOP.  */
> > +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> > +
> > +  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
> > +				       GSI_CONTINUE_LINKING);
> > +  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (mem)));
> > +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, zero);
> > +  tree end = make_ssa_name (TREE_TYPE (base));
> > +  gimple_call_set_lhs (fn_call, end);
> > +  gimple_set_location (fn_call, loc);
> > +  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
> > +
> > +  tree diff = make_ssa_name (ptrdiff_type_node);
> > +  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, mem);
> > +  gsi_insert_after (&gsi, diff_stmt, GSI_CONTINUE_LINKING);
> > +
> > +  tree convert = fold_convert (ptrdiff_type_node,
> > +			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (mem))));
> > +  tree size = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE, false,
> > +					GSI_CONTINUE_LINKING);
> > +
> > +  tree count = make_ssa_name (ptrdiff_type_node);
> > +  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
> > +  gsi_insert_after (&gsi, count_stmt, GSI_CONTINUE_LINKING);
> > +
> > +  convert = fold_convert (TREE_TYPE (reduction_var), count);
> > +  tree reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true,
> > +						     NULL_TREE, false,
> > +						     GSI_CONTINUE_LINKING);
> > +
> > +  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
> > +     length.  */
> > +  if (!integer_zerop (start_len))
> > +    {
> > +      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
> > +      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
> > +				       start_len);
> > +      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
> > +      reduction_var_new = lhs;
> > +    }
> > +
> > +  imm_use_iterator iter;
> > +  gimple *stmt;
> > +  use_operand_p use_p;
> > +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
> > +    {
> > +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> > +	SET_USE (use_p, reduction_var_new);
> > +
> > +      update_stmt (stmt);
> > +    }
> > +
> > +  fold_stmt (&gsi);
> > +
> > +  if (dump_file && (dump_flags & TDF_DETAILS))
> > +    switch (TYPE_MODE (TREE_TYPE (zero)))
> > +      {
> > +      case HImode:
> > +	fprintf (dump_file, "generated strlen using rawmemchrhi\n");
> > +	break;
> > +
> > +      case SImode:
> > +	fprintf (dump_file, "generated strlen using rawmemchrsi\n");
> > +	break;
> > +
> > +      default:
> > +	gcc_unreachable ();
> > +      }
> > +}
> > +
> > +static bool
> > +transform_loop_1 (loop_p loop,
> > +		  data_reference_p load_dr,
> > +		  data_reference_p store_dr,
> > +		  tree reduction_var)
> > +{
> > +  tree load_ref = DR_REF (load_dr);
> > +  tree load_type = TREE_TYPE (load_ref);
> > +  tree load_access_base = build_fold_addr_expr (load_ref);
> > +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> > +  affine_iv load_iv, reduction_iv;
> > +  tree pattern;
> > +
> > +  /* A limitation of the current implementation is that we only support
> > +     constant patterns.  */
> > +  edge e = single_exit (loop);
> > +  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
> > +  if (!cond_stmt)
> > +    return false;
> > +  pattern = gimple_cond_rhs (cond_stmt);
> > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
> > +      || TREE_CODE (pattern) != INTEGER_CST)
> > +    return false;
> > +
> > +  /* Bail out if no affine induction variable with constant step can be
> > +     determined.  */
> > +  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
> > +    return false;
> > +
> > +  /* Bail out if memory accesses are not consecutive or not growing.  */
> > +  if (!operand_equal_p (load_iv.step, load_access_size, 0))
> > +    return false;
> > +
> > +  if (!INTEGRAL_TYPE_P (load_type)
> > +      || !type_has_mode_precision_p (load_type))
> > +    return false;
> > +
> > +  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
> > +    return false;
> > +
> > +  /* Handle rawmemchr like loops.  */
> > +  if (operand_equal_p (load_iv.base, reduction_iv.base)
> > +      && operand_equal_p (load_iv.step, reduction_iv.step))
> > +    {
> > +      if (store_dr)
> > +	{
> > +	  /* Ensure that we store to X and load from X+I where I>0.  */
> > +	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
> > +	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
> > +	    return false;
> > +	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
> > +	  if (TREE_CODE (ptr_base) != SSA_NAME)
> > +	    return false;
> > +	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
> > +	  if (!gimple_assign_single_p (def)
> > +	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
> > +	    return false;
> > +	  /* Ensure that the reduction value is stored.  */
> > +	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
> > +	    return false;
> > +	}
> > +      /* Bail out if target does not provide rawmemchr for a certain mode.  */
> > +      machine_mode mode = TYPE_MODE (load_type);
> > +      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
> > +	return false;
> > +      location_t loc = gimple_location (DR_STMT (load_dr));
> > +      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
> > +				  pattern, loc);
> > +      return true;
> > +    }
> > +
> > +  /* Handle strlen like loops.  */
> > +  if (store_dr == NULL
> > +      && integer_zerop (pattern)
> > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > +      && integer_onep (reduction_iv.step)
> > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > +	  || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > +    {
> > +      location_t loc = gimple_location (DR_STMT (load_dr));
> > +      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
> > +	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node))
> > +	generate_strlen_builtin (loop, reduction_var, load_iv.base,
> > +				 reduction_iv.base, loc);
> > +      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
> > +	       != CODE_FOR_nothing)
> > +	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
> > +						 load_iv.base,
> > +						 reduction_iv.base, loc);
> > +      else
> > +	return false;
> > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > +	{
> > +	  const char *msg = G_("assuming signed overflow does not occur "
> > +			       "when optimizing strlen like loop");
> > +	  fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > +	}
> > +      return true;
> > +    }
> > +
> > +  return false;
> > +}
> > +
> > +static bool
> > +transform_loop (loop_p loop)
> > +{
> > +  gimple *reduction_stmt = NULL;
> > +  data_reference_p load_dr = NULL, store_dr = NULL;
> > +
> > +  basic_block *bbs = get_loop_body (loop);
> > +
> > +  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
> > +    {
> > +      basic_block bb = bbs[i];
> > +
> > +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> > +	   gsi_next (&bsi), ++ninsns)
> > +	{
> > +	  /* Bail out early for loops which are unlikely to match.  */
> > +	  if (ninsns > 16)
> > +	    return false;
> > +	  gphi *phi = bsi.phi ();
> > +	  if (gimple_has_volatile_ops (phi))
> > +	    return false;
> > +	  if (gimple_clobber_p (phi))
> > +	    continue;
> > +	  if (virtual_operand_p (gimple_phi_result (phi)))
> > +	    continue;
> > +	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> > +	    {
> > +	      if (reduction_stmt)
> > +		return false;
> > +	      reduction_stmt = phi;
> > +	    }
> > +	}
> > +
> > +      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
> > +	   gsi_next (&bsi), ++ninsns)
> > +	{
> > +	  /* Bail out early for loops which are unlikely to match.  */
> > +	  if (ninsns > 16)
> > +	    return false;
> > +
> > +	  gimple *stmt = gsi_stmt (bsi);
> > +
> > +	  if (gimple_clobber_p (stmt))
> > +	    continue;
> > +
> > +	  if (gimple_code (stmt) == GIMPLE_LABEL || is_gimple_debug (stmt))
> > +	    continue;
> > +
> > +	  if (gimple_has_volatile_ops (stmt))
> > +	    return false;
> > +
> > +	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
> > +	    {
> > +	      if (reduction_stmt)
> > +		return false;
> > +	      reduction_stmt = stmt;
> > +	    }
> > +
> > +	  /* Any scalar stmts are ok.  */
> > +	  if (!gimple_vuse (stmt))
> > +	    continue;
> > +
> > +	  /* Otherwise just regular loads/stores.  */
> > +	  if (!gimple_assign_single_p (stmt))
> > +	    return false;
> > +
> > +	  auto_vec<data_reference_p, 2> dr_vec;
> > +	  if (!find_data_references_in_stmt (loop, stmt, &dr_vec))
> > +	    return false;
> > +	  data_reference_p dr;
> > +	  unsigned j;
> > +	  FOR_EACH_VEC_ELT (dr_vec, j, dr)
> > +	    {
> > +	      tree type = TREE_TYPE (DR_REF (dr));
> > +	      if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (type)))
> > +		return false;
> > +	      if (DR_IS_READ (dr))
> > +		{
> > +		  if (load_dr != NULL)
> > +		    return false;
> > +		  load_dr = dr;
> > +		}
> > +	      else
> > +		{
> > +		  if (store_dr != NULL)
> > +		    return false;
> > +		  store_dr = dr;
> > +		}
> > +	    }
> > +	}
> > +    }
> > +
> > +  /* A limitation of the current implementation is that we require a reduction
> > +     statement which does not occur in cases like
> > +     extern int *p;
> > +     void foo (void) { for (; *p; ++p); } */
> > +  if (load_dr == NULL || reduction_stmt == NULL)
> > +    return false;
> > +
> > +  /* Note, reduction variables are guaranteed to be SSA names.  */
> > +  tree reduction_var;
> > +  switch (gimple_code (reduction_stmt))
> > +    {
> > +    case GIMPLE_PHI:
> > +      reduction_var = gimple_phi_result (reduction_stmt);
> > +      break;
> > +    case GIMPLE_ASSIGN:
> > +      reduction_var = gimple_assign_lhs (reduction_stmt);
> > +      break;
> > +    default:
> > +      /* Bail out e.g. for GIMPLE_CALL.  */
> > +      return false;
> > +    }
> > +  if (reduction_var == NULL)
> > +    return false;
> > +
> > +  /* Bail out if this is a bitfield memory reference.  */
> > +  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
> > +      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
> > +    return false;
> > +
> > +  /* Data reference must be executed exactly once per iteration of each
> > +     loop in the loop nest.  We only need to check dominance information
> > +     against the outermost one in a perfect loop nest because a bb can't
> > +     dominate outermost loop's latch without dominating inner loop's.  */
> > +  basic_block load_bb = gimple_bb (DR_STMT (load_dr));
> > +  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, load_bb))
> > +    return false;
> > +
> > +  if (store_dr)
> > +    {
> > +      /* Bail out if this is a bitfield memory reference.  */
> > +      if (TREE_CODE (DR_REF (store_dr)) == COMPONENT_REF
> > +	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (store_dr), 1)))
> > +	return false;
> > +
> > +      /* Data reference must be executed exactly once per iteration of each
> > +	 loop in the loop nest.  We only need to check dominance information
> > +	 against the outermost one in a perfect loop nest because a bb can't
> > +	 dominate outermost loop's latch without dominating inner loop's.  */
> > +      basic_block store_bb = gimple_bb (DR_STMT (store_dr));
> > +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, store_bb))
> > +	return false;
> > +
> > +      /* Load and store must be in the same loop nest.  */
> > +      if (store_bb->loop_father != load_bb->loop_father)
> > +	return false;
> > +
> > +      edge e = single_exit (store_bb->loop_father);
> > +      if (!e)
> > +	return false;
> > +      bool load_dom = dominated_by_p (CDI_DOMINATORS, e->src, load_bb);
> > +      bool store_dom = dominated_by_p (CDI_DOMINATORS, e->src, store_bb);
> > +      if (load_dom != store_dom)
> > +	return false;
> > +    }
> > +
> > +  return transform_loop_1 (loop, load_dr, store_dr, reduction_var);
> > +}
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_lpat =
> > +{
> > +  GIMPLE_PASS, /* type */
> > +  "lpat", /* name */
> > +  OPTGROUP_LOOP, /* optinfo_flags */
> > +  TV_LPAT, /* tv_id */
> > +  ( PROP_cfg | PROP_ssa ), /* properties_required */
> > +  0, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  0, /* todo_flags_finish */
> > +};
> > +
> > +class pass_lpat : public gimple_opt_pass
> > +{
> > +public:
> > +  pass_lpat (gcc::context *ctxt)
> > +    : gimple_opt_pass (pass_data_lpat, ctxt)
> > +  {}
> > +
> > +  bool
> > +  gate (function *) OVERRIDE
> > +  {
> > +    return optimize != 0;
> > +  }
> > +
> > +  unsigned int
> > +  execute (function *f) OVERRIDE
> > +  {
> > +    loop_p loop;
> > +    auto_vec<loop_p> loops_to_be_destroyed;
> > +
> > +    FOR_EACH_LOOP_FN (f, loop, LI_ONLY_INNERMOST)
> > +      {
> > +	if (!single_exit (loop)
> > +	    || (!flag_tree_loop_distribute_patterns // TODO
> > +		&& !optimize_loop_for_speed_p (loop)))
> > +	continue;
> > +
> > +	if (transform_loop (loop))
> > +	  loops_to_be_destroyed.safe_push (loop);
> > +      }
> > +
> > +    if (loops_to_be_destroyed.length () > 0)
> > +      {
> > +	unsigned i;
> > +	FOR_EACH_VEC_ELT (loops_to_be_destroyed, i, loop)
> > +	  destroy_loop (loop);
> > +
> > +	scev_reset_htab ();
> > +	mark_virtual_operands_for_renaming (f);
> > +	rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
> > +
> > +	return TODO_cleanup_cfg;
> > +      }
> > +    else
> > +      return 0;
> > +  }
> > +}; // class pass_lpat
> > +
> > +} // anon namespace
> > +
> > +gimple_opt_pass *
> > +make_pass_lpat (gcc::context *ctxt)
> > +{
> > +  return new pass_lpat (ctxt);
> > +}
> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> > index 15693fee150..2d71a12039e 100644
> > --- a/gcc/tree-pass.h
> > +++ b/gcc/tree-pass.h
> > @@ -380,6 +380,7 @@ extern gimple_opt_pass *make_pass_graphite (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_graphite_transforms (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_if_conversion (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_if_to_switch (gcc::context *ctxt);
> > +extern gimple_opt_pass *make_pass_lpat (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_loop_distribution (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_vectorize (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_simduid_cleanup (gcc::context *ctxt);
>
Richard Biener May 5, 2021, 9:36 a.m. UTC | #10
On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> [snip]
>
> Please find attached a new version of the patch.  A major change compared to
> the previous patch is that I created a separate pass which hopefully makes
> reviewing also easier since it is almost self-contained.  After realizing that
> detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> really fit into the topic of loop distribution, I created a separate pass.

It's true that these reduction-like patterns are more difficult than
the existing
memcpy/memset cases.

>  Due
> to this I was also able to play around a bit and schedule the pass at different
> times.  Currently it is scheduled right before loop distribution where loop
> header copying already took place which leads to the following effect.

In fact I'd schedule it after loop distribution so there's the chance that loop
distribution can expose a loop that fits the new pattern.

>  Running
> this setup over
>
> char *t (char *p)
> {
>   for (; *p; ++p);
>   return p;
> }
>
> the new pass transforms
>
> char * t (char * p)
> {
>   char _1;
>   char _7;
>
>   <bb 2> [local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
>     goto <bb 5>; [89.00%]
>   else
>     goto <bb 7>; [11.00%]
>
>   <bb 5> [local count: 105119324]:
>
>   <bb 3> [local count: 955630225]:
>   # p_8 = PHI <p_6(6), p_3(D)(5)>
>   p_6 = p_8 + 1;
>   _1 = *p_6;
>   if (_1 != 0)
>     goto <bb 6>; [89.00%]
>   else
>     goto <bb 8>; [11.00%]
>
>   <bb 8> [local count: 105119324]:
>   # p_2 = PHI <p_6(3)>
>   goto <bb 4>; [100.00%]
>
>   <bb 6> [local count: 850510901]:
>   goto <bb 3>; [100.00%]
>
>   <bb 7> [local count: 12992276]:
>
>   <bb 4> [local count: 118111600]:
>   # p_9 = PHI <p_2(8), p_3(D)(7)>
>   return p_9;
>
> }
>
> into
>
> char * t (char * p)
> {
>   char * _5;
>   char _7;
>
>   <bb 2> [local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
>     goto <bb 5>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
>
>   <bb 5> [local count: 105119324]:
>   _5 = p_3(D) + 1;
>   p_10 = .RAWMEMCHR (_5, 0);
>
>   <bb 4> [local count: 118111600]:
>   # p_9 = PHI <p_10(5), p_3(D)(2)>
>   return p_9;
>
> }
>
> which is fine so far.  However, I haven't made up my mind so far whether it is
> worthwhile to spend more time in order to also eliminate the "first unrolling"
> of the loop.

Might be a phiopt transform ;)  Might apply to quite some set of
builtins.  I wonder how the strlen case looks like though.

> I gave it a shot by scheduling the pass prior pass copy header
> and ended up with:
>
> char * t (char * p)
> {
>   <bb 2> [local count: 118111600]:
>   p_5 = .RAWMEMCHR (p_3(D), 0);
>   return p_5;
>
> }
>
> which seems optimal to me.  The downside of this is that I have to initialize
> scalar evolution analysis which might be undesired that early.
>
> All this brings me to the question where do you see this peace of code running?
> If in a separate pass when would you schedule it?  If in an existing pass,
> which one would you choose?

I think it still fits loop distribution.  If you manage to detect it
with your pass
standalone then you should be able to detect it in loop distribution.  Can you
explain what part is "easier" as standalone pass?

> Another topic which came up is whether there exists a more elegant solution to
> my current implementation in order to deal with stores (I'm speaking of the `if
> (store_dr)` statement inside of function transform_loop_1).  For example,
>
> extern char *p;
> char *t ()
> {
>   for (; *p; ++p);
>   return p;
> }
>
> ends up as
>
> char * t ()
> {
>   char * _1;
>   char * _2;
>   char _3;
>   char * p.1_8;
>   char _9;
>   char * p.1_10;
>   char * p.1_11;
>
>   <bb 2> [local count: 118111600]:
>   p.1_8 = p;
>   _9 = *p.1_8;
>   if (_9 != 0)
>     goto <bb 5>; [89.00%]
>   else
>     goto <bb 7>; [11.00%]
>
>   <bb 5> [local count: 105119324]:
>
>   <bb 3> [local count: 955630225]:
>   # p.1_10 = PHI <_1(6), p.1_8(5)>
>   _1 = p.1_10 + 1;
>   p = _1;
>   _3 = *_1;
>   if (_3 != 0)
>     goto <bb 6>; [89.00%]
>   else
>     goto <bb 8>; [11.00%]
>
>   <bb 8> [local count: 105119324]:
>   # _2 = PHI <_1(3)>
>   goto <bb 4>; [100.00%]
>
>   <bb 6> [local count: 850510901]:
>   goto <bb 3>; [100.00%]
>
>   <bb 7> [local count: 12992276]:
>
>   <bb 4> [local count: 118111600]:
>   # p.1_11 = PHI <_2(8), p.1_8(7)>
>   return p.1_11;
>
> }
>
> where inside the loop a load and store occurs.  For a rawmemchr like loop I
> have to show that we never load from a memory location to which we write.
> Currently I solve this by hard coding those facts which are not generic at all.
> I gave compute_data_dependences_for_loop a try which failed to determine the
> fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> there are more generic solutions to express this in contrast to my current one?

So the example loop is not valid to be trasformed to rawmemchr since it's
valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
things become fishy but GCC isn't able to turn that into a non-dependence.

Why is the case of stores inside the loop important?  In fact that you
think it is
makes a case for integrating this with loop distribution since loop distribution
would be able to prove (if possible) that the stores can be separated into
a different loop.

And sorry for the delay in answering ...

Thanks,
Richard.

>
> Thanks again for your input so far.  Really appreciated.
>
> Cheers,
> Stefan
Richard Biener May 5, 2021, 10:03 a.m. UTC | #11
On Wed, May 5, 2021 at 11:36 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > [snip]
> >
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> > really fit into the topic of loop distribution, I created a separate pass.
>
> It's true that these reduction-like patterns are more difficult than
> the existing
> memcpy/memset cases.
>
> >  Due
> > to this I was also able to play around a bit and schedule the pass at different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.
>
> In fact I'd schedule it after loop distribution so there's the chance that loop
> distribution can expose a loop that fits the new pattern.
>
> >  Running
> > this setup over
> >
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > the new pass transforms
> >
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # p_2 = PHI <p_6(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> >   return p_9;
> >
> > }
> >
> > into
> >
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 4>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> >   return p_9;
> >
> > }
> >
> > which is fine so far.  However, I haven't made up my mind so far whether it is
> > worthwhile to spend more time in order to also eliminate the "first unrolling"
> > of the loop.
>
> Might be a phiopt transform ;)  Might apply to quite some set of
> builtins.  I wonder how the strlen case looks like though.
>
> > I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> >
> > char * t (char * p)
> > {
> >   <bb 2> [local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> >
> > }
> >
> > which seems optimal to me.  The downside of this is that I have to initialize
> > scalar evolution analysis which might be undesired that early.
> >
> > All this brings me to the question where do you see this peace of code running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
>
> I think it still fits loop distribution.  If you manage to detect it
> with your pass
> standalone then you should be able to detect it in loop distribution.  Can you
> explain what part is "easier" as standalone pass?

Btw, another "fitting" pass would be final value replacement (pass_scev_cprop)
since what these patterns provide is a builtin call to compute the value of one
of the loop PHIs on exit.  Note this pass leaves removal of in-loop computations
to followup DCE which means that in some cases it does unprofitable transforms.
There's a bug somewhere where I worked on doing final value replacement
on-demand when DCE figures out the loop is otherwise dead but I never finished
this (loop distribution could also use such mechanism to get rid of
unwanted PHIs).

> > Another topic which came up is whether there exists a more elegant solution to
> > my current implementation in order to deal with stores (I'm speaking of the `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> >
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > ends up as
> >
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> >
> >   <bb 2> [local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> >   return p.1_11;
> >
> > }
> >
> > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > have to show that we never load from a memory location to which we write.
> > Currently I solve this by hard coding those facts which are not generic at all.
> > I gave compute_data_dependences_for_loop a try which failed to determine the
> > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > there are more generic solutions to express this in contrast to my current one?
>
> So the example loop is not valid to be trasformed to rawmemchr since it's
> valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
> things become fishy but GCC isn't able to turn that into a non-dependence.
>
> Why is the case of stores inside the loop important?  In fact that you
> think it is
> makes a case for integrating this with loop distribution since loop distribution
> would be able to prove (if possible) that the stores can be separated into
> a different loop.
>
> And sorry for the delay in answering ...
>
> Thanks,
> Richard.
>
> >
> > Thanks again for your input so far.  Really appreciated.
> >
> > Cheers,
> > Stefan
Stefan Schulze Frielinghaus May 7, 2021, 12:32 p.m. UTC | #12
On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > [snip]
> >
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> > really fit into the topic of loop distribution, I created a separate pass.
> 
> It's true that these reduction-like patterns are more difficult than
> the existing
> memcpy/memset cases.
> 
> >  Due
> > to this I was also able to play around a bit and schedule the pass at different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.
> 
> In fact I'd schedule it after loop distribution so there's the chance that loop
> distribution can expose a loop that fits the new pattern.
> 
> >  Running
> > this setup over
> >
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > the new pass transforms
> >
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # p_2 = PHI <p_6(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> >   return p_9;
> >
> > }
> >
> > into
> >
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 4>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> >   return p_9;
> >
> > }
> >
> > which is fine so far.  However, I haven't made up my mind so far whether it is
> > worthwhile to spend more time in order to also eliminate the "first unrolling"
> > of the loop.
> 
> Might be a phiopt transform ;)  Might apply to quite some set of
> builtins.  I wonder how the strlen case looks like though.
> 
> > I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> >
> > char * t (char * p)
> > {
> >   <bb 2> [local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> >
> > }
> >
> > which seems optimal to me.  The downside of this is that I have to initialize
> > scalar evolution analysis which might be undesired that early.
> >
> > All this brings me to the question where do you see this peace of code running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
> 
> I think it still fits loop distribution.  If you manage to detect it
> with your pass
> standalone then you should be able to detect it in loop distribution.

If a loop is distributed only because one of the partitions matches a
rawmemchr/strlen-like loop pattern, then we have at least two partitions
which walk over the same memory region.  Since a rawmemchr/strlen-like
loop has no body (neglecting expression-3 of a for-loop where just an
increment happens) it is governed by the memory accesses in the loop
condition.  Therefore, in such a case loop distribution would result in
performance degradation.  This is why I think that it does not fit
conceptually into ldist pass.  However, since I make use of a couple of
helper functions from ldist pass, it may still fit technically.

Since currently all ldist optimizations operate over loops where niters
is known and for rawmemchr/strlen-like loops this is not the case, it is
not possible that those optimizations expose a loop which is suitable
for rawmemchr/strlen optimization.  Therefore, what do you think about
scheduling rawmemchr/strlen optimization right between those
if-statements of function loop_distribution::execute?

   if (nb_generated_loops + nb_generated_calls > 0)
     {
       changed = true;
       if (dump_enabled_p ())
         dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
                          loc, "Loop%s %d distributed: split to %d loops "
                          "and %d library calls.\n", str, loop->num,
                          nb_generated_loops, nb_generated_calls);

       break;
     }

   // rawmemchr/strlen like loops

   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);

> Can you
> explain what part is "easier" as standalone pass?

Yea that term is rather misleading.  It was probably easier for me to
understand the underlying problem and to play around with my code.
There are no technical reasons for a standalone pass.

Cheers,
Stefan

> 
> > Another topic which came up is whether there exists a more elegant solution to
> > my current implementation in order to deal with stores (I'm speaking of the `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> >
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > ends up as
> >
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> >
> >   <bb 2> [local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> >   return p.1_11;
> >
> > }
> >
> > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > have to show that we never load from a memory location to which we write.
> > Currently I solve this by hard coding those facts which are not generic at all.
> > I gave compute_data_dependences_for_loop a try which failed to determine the
> > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > there are more generic solutions to express this in contrast to my current one?
> 
> So the example loop is not valid to be trasformed to rawmemchr since it's
> valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
> things become fishy but GCC isn't able to turn that into a non-dependence.
> 
> Why is the case of stores inside the loop important?  In fact that you
> think it is
> makes a case for integrating this with loop distribution since loop distribution
> would be able to prove (if possible) that the stores can be separated into
> a different loop.
> 
> And sorry for the delay in answering ...
> 
> Thanks,
> Richard.
> 
> >
> > Thanks again for your input so far.  Really appreciated.
> >
> > Cheers,
> > Stefan
Richard Biener May 20, 2021, 9:24 a.m. UTC | #13
On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> > <stefansf@linux.ibm.com> wrote:
> > >
> > > [snip]
> > >
> > > Please find attached a new version of the patch.  A major change compared to
> > > the previous patch is that I created a separate pass which hopefully makes
> > > reviewing also easier since it is almost self-contained.  After realizing that
> > > detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> > > really fit into the topic of loop distribution, I created a separate pass.
> >
> > It's true that these reduction-like patterns are more difficult than
> > the existing
> > memcpy/memset cases.
> >
> > >  Due
> > > to this I was also able to play around a bit and schedule the pass at different
> > > times.  Currently it is scheduled right before loop distribution where loop
> > > header copying already took place which leads to the following effect.
> >
> > In fact I'd schedule it after loop distribution so there's the chance that loop
> > distribution can expose a loop that fits the new pattern.
> >
> > >  Running
> > > this setup over
> > >
> > > char *t (char *p)
> > > {
> > >   for (; *p; ++p);
> > >   return p;
> > > }
> > >
> > > the new pass transforms
> > >
> > > char * t (char * p)
> > > {
> > >   char _1;
> > >   char _7;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 7>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >
> > >   <bb 3> [local count: 955630225]:
> > >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> > >   p_6 = p_8 + 1;
> > >   _1 = *p_6;
> > >   if (_1 != 0)
> > >     goto <bb 6>; [89.00%]
> > >   else
> > >     goto <bb 8>; [11.00%]
> > >
> > >   <bb 8> [local count: 105119324]:
> > >   # p_2 = PHI <p_6(3)>
> > >   goto <bb 4>; [100.00%]
> > >
> > >   <bb 6> [local count: 850510901]:
> > >   goto <bb 3>; [100.00%]
> > >
> > >   <bb 7> [local count: 12992276]:
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> > >   return p_9;
> > >
> > > }
> > >
> > > into
> > >
> > > char * t (char * p)
> > > {
> > >   char * _5;
> > >   char _7;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 4>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >   _5 = p_3(D) + 1;
> > >   p_10 = .RAWMEMCHR (_5, 0);
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> > >   return p_9;
> > >
> > > }
> > >
> > > which is fine so far.  However, I haven't made up my mind so far whether it is
> > > worthwhile to spend more time in order to also eliminate the "first unrolling"
> > > of the loop.
> >
> > Might be a phiopt transform ;)  Might apply to quite some set of
> > builtins.  I wonder how the strlen case looks like though.
> >
> > > I gave it a shot by scheduling the pass prior pass copy header
> > > and ended up with:
> > >
> > > char * t (char * p)
> > > {
> > >   <bb 2> [local count: 118111600]:
> > >   p_5 = .RAWMEMCHR (p_3(D), 0);
> > >   return p_5;
> > >
> > > }
> > >
> > > which seems optimal to me.  The downside of this is that I have to initialize
> > > scalar evolution analysis which might be undesired that early.
> > >
> > > All this brings me to the question where do you see this peace of code running?
> > > If in a separate pass when would you schedule it?  If in an existing pass,
> > > which one would you choose?
> >
> > I think it still fits loop distribution.  If you manage to detect it
> > with your pass
> > standalone then you should be able to detect it in loop distribution.
>
> If a loop is distributed only because one of the partitions matches a
> rawmemchr/strlen-like loop pattern, then we have at least two partitions
> which walk over the same memory region.  Since a rawmemchr/strlen-like
> loop has no body (neglecting expression-3 of a for-loop where just an
> increment happens) it is governed by the memory accesses in the loop
> condition.  Therefore, in such a case loop distribution would result in
> performance degradation.  This is why I think that it does not fit
> conceptually into ldist pass.  However, since I make use of a couple of
> helper functions from ldist pass, it may still fit technically.
>
> Since currently all ldist optimizations operate over loops where niters
> is known and for rawmemchr/strlen-like loops this is not the case, it is
> not possible that those optimizations expose a loop which is suitable
> for rawmemchr/strlen optimization.

True - though that seems to be an unnecessary restriction.

>  Therefore, what do you think about
> scheduling rawmemchr/strlen optimization right between those
> if-statements of function loop_distribution::execute?
>
>    if (nb_generated_loops + nb_generated_calls > 0)
>      {
>        changed = true;
>        if (dump_enabled_p ())
>          dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>                           loc, "Loop%s %d distributed: split to %d loops "
>                           "and %d library calls.\n", str, loop->num,
>                           nb_generated_loops, nb_generated_calls);
>
>        break;
>      }
>
>    // rawmemchr/strlen like loops
>
>    if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);

but we won't ever arrive here because of the niters condition.  But
yes, doing the pattern matching in the innermost loop processing code
looks good to me - for the specific case it would be

      /* Don't distribute loop if niters is unknown.  */
      tree niters = number_of_latch_executions (loop);
      if (niters == NULL_TREE || niters == chrec_dont_know)
---> here?
        continue;

> > Can you
> > explain what part is "easier" as standalone pass?
>
> Yea that term is rather misleading.  It was probably easier for me to
> understand the underlying problem and to play around with my code.
> There are no technical reasons for a standalone pass.

And sorry for the late response...

Richard.

> Cheers,
> Stefan
>
> >
> > > Another topic which came up is whether there exists a more elegant solution to
> > > my current implementation in order to deal with stores (I'm speaking of the `if
> > > (store_dr)` statement inside of function transform_loop_1).  For example,
> > >
> > > extern char *p;
> > > char *t ()
> > > {
> > >   for (; *p; ++p);
> > >   return p;
> > > }
> > >
> > > ends up as
> > >
> > > char * t ()
> > > {
> > >   char * _1;
> > >   char * _2;
> > >   char _3;
> > >   char * p.1_8;
> > >   char _9;
> > >   char * p.1_10;
> > >   char * p.1_11;
> > >
> > >   <bb 2> [local count: 118111600]:
> > >   p.1_8 = p;
> > >   _9 = *p.1_8;
> > >   if (_9 != 0)
> > >     goto <bb 5>; [89.00%]
> > >   else
> > >     goto <bb 7>; [11.00%]
> > >
> > >   <bb 5> [local count: 105119324]:
> > >
> > >   <bb 3> [local count: 955630225]:
> > >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> > >   _1 = p.1_10 + 1;
> > >   p = _1;
> > >   _3 = *_1;
> > >   if (_3 != 0)
> > >     goto <bb 6>; [89.00%]
> > >   else
> > >     goto <bb 8>; [11.00%]
> > >
> > >   <bb 8> [local count: 105119324]:
> > >   # _2 = PHI <_1(3)>
> > >   goto <bb 4>; [100.00%]
> > >
> > >   <bb 6> [local count: 850510901]:
> > >   goto <bb 3>; [100.00%]
> > >
> > >   <bb 7> [local count: 12992276]:
> > >
> > >   <bb 4> [local count: 118111600]:
> > >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> > >   return p.1_11;
> > >
> > > }
> > >
> > > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > > have to show that we never load from a memory location to which we write.
> > > Currently I solve this by hard coding those facts which are not generic at all.
> > > I gave compute_data_dependences_for_loop a try which failed to determine the
> > > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > > there are more generic solutions to express this in contrast to my current one?
> >
> > So the example loop is not valid to be trasformed to rawmemchr since it's
> > valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
> > things become fishy but GCC isn't able to turn that into a non-dependence.
> >
> > Why is the case of stores inside the loop important?  In fact that you
> > think it is
> > makes a case for integrating this with loop distribution since loop distribution
> > would be able to prove (if possible) that the stores can be separated into
> > a different loop.
> >
> > And sorry for the delay in answering ...
> >
> > Thanks,
> > Richard.
> >
> > >
> > > Thanks again for your input so far.  Really appreciated.
> > >
> > > Cheers,
> > > Stefan
Stefan Schulze Frielinghaus May 20, 2021, 6:37 p.m. UTC | #14
On Thu, May 20, 2021 at 11:24:57AM +0200, Richard Biener wrote:
> On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> > > <stefansf@linux.ibm.com> wrote:
> > > >
> > > > [snip]
> > > >
> > > > Please find attached a new version of the patch.  A major change compared to
> > > > the previous patch is that I created a separate pass which hopefully makes
> > > > reviewing also easier since it is almost self-contained.  After realizing that
> > > > detecting loops which mimic the behavior of rawmemchr/strlen functions does not
> > > > really fit into the topic of loop distribution, I created a separate pass.
> > >
> > > It's true that these reduction-like patterns are more difficult than
> > > the existing
> > > memcpy/memset cases.
> > >
> > > >  Due
> > > > to this I was also able to play around a bit and schedule the pass at different
> > > > times.  Currently it is scheduled right before loop distribution where loop
> > > > header copying already took place which leads to the following effect.
> > >
> > > In fact I'd schedule it after loop distribution so there's the chance that loop
> > > distribution can expose a loop that fits the new pattern.
> > >
> > > >  Running
> > > > this setup over
> > > >
> > > > char *t (char *p)
> > > > {
> > > >   for (; *p; ++p);
> > > >   return p;
> > > > }
> > > >
> > > > the new pass transforms
> > > >
> > > > char * t (char * p)
> > > > {
> > > >   char _1;
> > > >   char _7;
> > > >
> > > >   <bb 2> [local count: 118111600]:
> > > >   _7 = *p_3(D);
> > > >   if (_7 != 0)
> > > >     goto <bb 5>; [89.00%]
> > > >   else
> > > >     goto <bb 7>; [11.00%]
> > > >
> > > >   <bb 5> [local count: 105119324]:
> > > >
> > > >   <bb 3> [local count: 955630225]:
> > > >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> > > >   p_6 = p_8 + 1;
> > > >   _1 = *p_6;
> > > >   if (_1 != 0)
> > > >     goto <bb 6>; [89.00%]
> > > >   else
> > > >     goto <bb 8>; [11.00%]
> > > >
> > > >   <bb 8> [local count: 105119324]:
> > > >   # p_2 = PHI <p_6(3)>
> > > >   goto <bb 4>; [100.00%]
> > > >
> > > >   <bb 6> [local count: 850510901]:
> > > >   goto <bb 3>; [100.00%]
> > > >
> > > >   <bb 7> [local count: 12992276]:
> > > >
> > > >   <bb 4> [local count: 118111600]:
> > > >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> > > >   return p_9;
> > > >
> > > > }
> > > >
> > > > into
> > > >
> > > > char * t (char * p)
> > > > {
> > > >   char * _5;
> > > >   char _7;
> > > >
> > > >   <bb 2> [local count: 118111600]:
> > > >   _7 = *p_3(D);
> > > >   if (_7 != 0)
> > > >     goto <bb 5>; [89.00%]
> > > >   else
> > > >     goto <bb 4>; [11.00%]
> > > >
> > > >   <bb 5> [local count: 105119324]:
> > > >   _5 = p_3(D) + 1;
> > > >   p_10 = .RAWMEMCHR (_5, 0);
> > > >
> > > >   <bb 4> [local count: 118111600]:
> > > >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> > > >   return p_9;
> > > >
> > > > }
> > > >
> > > > which is fine so far.  However, I haven't made up my mind so far whether it is
> > > > worthwhile to spend more time in order to also eliminate the "first unrolling"
> > > > of the loop.
> > >
> > > Might be a phiopt transform ;)  Might apply to quite some set of
> > > builtins.  I wonder how the strlen case looks like though.
> > >
> > > > I gave it a shot by scheduling the pass prior pass copy header
> > > > and ended up with:
> > > >
> > > > char * t (char * p)
> > > > {
> > > >   <bb 2> [local count: 118111600]:
> > > >   p_5 = .RAWMEMCHR (p_3(D), 0);
> > > >   return p_5;
> > > >
> > > > }
> > > >
> > > > which seems optimal to me.  The downside of this is that I have to initialize
> > > > scalar evolution analysis which might be undesired that early.
> > > >
> > > > All this brings me to the question where do you see this peace of code running?
> > > > If in a separate pass when would you schedule it?  If in an existing pass,
> > > > which one would you choose?
> > >
> > > I think it still fits loop distribution.  If you manage to detect it
> > > with your pass
> > > standalone then you should be able to detect it in loop distribution.
> >
> > If a loop is distributed only because one of the partitions matches a
> > rawmemchr/strlen-like loop pattern, then we have at least two partitions
> > which walk over the same memory region.  Since a rawmemchr/strlen-like
> > loop has no body (neglecting expression-3 of a for-loop where just an
> > increment happens) it is governed by the memory accesses in the loop
> > condition.  Therefore, in such a case loop distribution would result in
> > performance degradation.  This is why I think that it does not fit
> > conceptually into ldist pass.  However, since I make use of a couple of
> > helper functions from ldist pass, it may still fit technically.
> >
> > Since currently all ldist optimizations operate over loops where niters
> > is known and for rawmemchr/strlen-like loops this is not the case, it is
> > not possible that those optimizations expose a loop which is suitable
> > for rawmemchr/strlen optimization.
> 
> True - though that seems to be an unnecessary restriction.
> 
> >  Therefore, what do you think about
> > scheduling rawmemchr/strlen optimization right between those
> > if-statements of function loop_distribution::execute?
> >
> >    if (nb_generated_loops + nb_generated_calls > 0)
> >      {
> >        changed = true;
> >        if (dump_enabled_p ())
> >          dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >                           loc, "Loop%s %d distributed: split to %d loops "
> >                           "and %d library calls.\n", str, loop->num,
> >                           nb_generated_loops, nb_generated_calls);
> >
> >        break;
> >      }
> >
> >    // rawmemchr/strlen like loops
> >
> >    if (dump_file && (dump_flags & TDF_DETAILS))
> >      fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);
> 
> but we won't ever arrive here because of the niters condition.  But
> yes, doing the pattern matching in the innermost loop processing code
> looks good to me - for the specific case it would be
> 
>       /* Don't distribute loop if niters is unknown.  */
>       tree niters = number_of_latch_executions (loop);
>       if (niters == NULL_TREE || niters == chrec_dont_know)
> ---> here?
>         continue;

Right, please find attached a new version of the patch where everything
is included in the loop distribution pass.  I will do a bootstrap and
regtest on IBM Z over night.  If you give me green light I will also do
the same on x86_64.

> 
> > > Can you
> > > explain what part is "easier" as standalone pass?
> >
> > Yea that term is rather misleading.  It was probably easier for me to
> > understand the underlying problem and to play around with my code.
> > There are no technical reasons for a standalone pass.
> 
> And sorry for the late response...
> 
> Richard.
> 
> > Cheers,
> > Stefan
> >
> > >
> > > > Another topic which came up is whether there exists a more elegant solution to
> > > > my current implementation in order to deal with stores (I'm speaking of the `if
> > > > (store_dr)` statement inside of function transform_loop_1).  For example,
> > > >
> > > > extern char *p;
> > > > char *t ()
> > > > {
> > > >   for (; *p; ++p);
> > > >   return p;
> > > > }
> > > >
> > > > ends up as
> > > >
> > > > char * t ()
> > > > {
> > > >   char * _1;
> > > >   char * _2;
> > > >   char _3;
> > > >   char * p.1_8;
> > > >   char _9;
> > > >   char * p.1_10;
> > > >   char * p.1_11;
> > > >
> > > >   <bb 2> [local count: 118111600]:
> > > >   p.1_8 = p;
> > > >   _9 = *p.1_8;
> > > >   if (_9 != 0)
> > > >     goto <bb 5>; [89.00%]
> > > >   else
> > > >     goto <bb 7>; [11.00%]
> > > >
> > > >   <bb 5> [local count: 105119324]:
> > > >
> > > >   <bb 3> [local count: 955630225]:
> > > >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> > > >   _1 = p.1_10 + 1;
> > > >   p = _1;
> > > >   _3 = *_1;
> > > >   if (_3 != 0)
> > > >     goto <bb 6>; [89.00%]
> > > >   else
> > > >     goto <bb 8>; [11.00%]
> > > >
> > > >   <bb 8> [local count: 105119324]:
> > > >   # _2 = PHI <_1(3)>
> > > >   goto <bb 4>; [100.00%]
> > > >
> > > >   <bb 6> [local count: 850510901]:
> > > >   goto <bb 3>; [100.00%]
> > > >
> > > >   <bb 7> [local count: 12992276]:
> > > >
> > > >   <bb 4> [local count: 118111600]:
> > > >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> > > >   return p.1_11;
> > > >
> > > > }
> > > >
> > > > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > > > have to show that we never load from a memory location to which we write.
> > > > Currently I solve this by hard coding those facts which are not generic at all.
> > > > I gave compute_data_dependences_for_loop a try which failed to determine the
> > > > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > > > there are more generic solutions to express this in contrast to my current one?
> > >
> > > So the example loop is not valid to be trasformed to rawmemchr since it's
> > > valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
> > > things become fishy but GCC isn't able to turn that into a non-dependence.
> > >
> > > Why is the case of stores inside the loop important?  In fact that you
> > > think it is
> > > makes a case for integrating this with loop distribution since loop distribution
> > > would be able to prove (if possible) that the stores can be separated into
> > > a different loop.
> > >
> > > And sorry for the delay in answering ...
> > >
> > > Thanks,
> > > Richard.
> > >
> > > >
> > > > Thanks again for your input so far.  Really appreciated.
> > > >
> > > > Cheers,
> > > > Stefan
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index d209a52f823..b39a0172f01 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2929,6 +2929,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+    {
+      tree rhs = gimple_call_arg (stmt, i);
+      tree rhs_type = TREE_TYPE (rhs);
+      rtx rhs_rtx = expand_normal (rhs);
+      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
+    }
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index daeace7a34e..95c76795648 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..f7c69f914ce 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
 OPTAB_D (movmem_optab, "movmem$a")
 OPTAB_D (setmem_optab, "setmem$a")
 OPTAB_D (strlen_optab, "strlen$a")
+OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
 
 OPTAB_DC(fma_optab, "fma$a4", FMA)
 OPTAB_D (fms_optab, "fms$a4")
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
new file mode 100644
index 00000000000..e998dd16b29
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
@@ -0,0 +1,72 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt but no store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, pattern)   \
+__attribute__((noinline))  \
+T *test_##T (T *p)         \
+{                          \
+  while (*p != (T)pattern) \
+    ++p;                   \
+  return p;                \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  0xab)
+test (int16_t, 0xabcd)
+test (int32_t, 0xabcdef15)
+
+#define run(T, pattern, i)      \
+{                               \
+T *q = p;                       \
+q[i] = (T)pattern;              \
+assert (test_##T (p) == &q[i]); \
+q[i] = 0;                       \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0, 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, 0xab, 0);
+  run (int8_t, 0xab, 1);
+  run (int8_t, 0xab, 13);
+
+  run (int16_t, 0xabcd, 0);
+  run (int16_t, 0xabcd, 1);
+  run (int16_t, 0xabcd, 13);
+
+  run (int32_t, 0xabcdef15, 0);
+  run (int32_t, 0xabcdef15, 1);
+  run (int32_t, 0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
new file mode 100644
index 00000000000..046450ea7e8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
@@ -0,0 +1,83 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrqi" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrhi" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrsi" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+uint8_t *p_uint8_t;
+uint16_t *p_uint16_t;
+uint32_t *p_uint32_t;
+
+int8_t *p_int8_t;
+int16_t *p_int16_t;
+int32_t *p_int32_t;
+
+#define test(T, pattern)    \
+__attribute__((noinline))   \
+T *test_##T (void)          \
+{                           \
+  while (*p_##T != pattern) \
+    ++p_##T;                \
+  return p_##T;             \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  (int8_t)0xab)
+test (int16_t, (int16_t)0xabcd)
+test (int32_t, (int32_t)0xabcdef15)
+
+#define run(T, pattern, i) \
+{                          \
+T *q = p;                  \
+q[i] = pattern;            \
+p_##T = p;                 \
+T *r = test_##T ();        \
+assert (r == p_##T);       \
+assert (r == &q[i]);       \
+q[i] = 0;                  \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, '\0', 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, (int8_t)0xab, 0);
+  run (int8_t, (int8_t)0xab, 1);
+  run (int8_t, (int8_t)0xab, 13);
+
+  run (int16_t, (int16_t)0xabcd, 0);
+  run (int16_t, (int16_t)0xabcd, 1);
+  run (int16_t, (int16_t)0xabcd, 13);
+
+  run (int32_t, (int32_t)0xabcdef15, 0);
+  run (int32_t, (int32_t)0xabcdef15, 1);
+  run (int32_t, (int32_t)0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
new file mode 100644
index 00000000000..c88d1db0a93
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
@@ -0,0 +1,100 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlen\n" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrhi\n" 4 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated strlen using rawmemchrsi\n" 4 "ldist" { target s390x-*-* } } } */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, U)        \
+__attribute__((noinline)) \
+U test_##T##U (T *s)      \
+{                         \
+  U i;                    \
+  for (i=0; s[i]; ++i);   \
+  return i;               \
+}
+
+test (uint8_t,  size_t)
+test (uint16_t, size_t)
+test (uint32_t, size_t)
+test (uint8_t,  int)
+test (uint16_t, int)
+test (uint32_t, int)
+
+test (int8_t,  size_t)
+test (int16_t, size_t)
+test (int32_t, size_t)
+test (int8_t,  int)
+test (int16_t, int)
+test (int32_t, int)
+
+#define run(T, U, i)             \
+{                                \
+T *q = p;                        \
+q[i] = 0;                        \
+assert (test_##T##U (p) == i);   \
+memset (&q[i], 0xf, sizeof (T)); \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+
+  run (uint8_t, size_t, 0);
+  run (uint8_t, size_t, 1);
+  run (uint8_t, size_t, 13);
+
+  run (int8_t, size_t, 0);
+  run (int8_t, size_t, 1);
+  run (int8_t, size_t, 13);
+
+  run (uint8_t, int, 0);
+  run (uint8_t, int, 1);
+  run (uint8_t, int, 13);
+
+  run (int8_t, int, 0);
+  run (int8_t, int, 1);
+  run (int8_t, int, 13);
+
+  run (uint16_t, size_t, 0);
+  run (uint16_t, size_t, 1);
+  run (uint16_t, size_t, 13);
+
+  run (int16_t, size_t, 0);
+  run (int16_t, size_t, 1);
+  run (int16_t, size_t, 13);
+
+  run (uint16_t, int, 0);
+  run (uint16_t, int, 1);
+  run (uint16_t, int, 13);
+
+  run (int16_t, int, 0);
+  run (int16_t, int, 1);
+  run (int16_t, int, 13);
+
+  run (uint32_t, size_t, 0);
+  run (uint32_t, size_t, 1);
+  run (uint32_t, size_t, 13);
+
+  run (int32_t, size_t, 0);
+  run (int32_t, size_t, 1);
+  run (int32_t, size_t, 13);
+
+  run (uint32_t, int, 0);
+  run (uint32_t, int, 1);
+  run (uint32_t, int, 13);
+
+  run (int32_t, int, 0);
+  run (int32_t, int, 1);
+  run (int32_t, int, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
new file mode 100644
index 00000000000..cd06e4a27cb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlen\n" 3 "ldist" } } */
+
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+__attribute__((noinline))
+int test_pos (char *s)
+{
+  int i;
+  for (i=42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_neg (char *s)
+{
+  int i;
+  for (i=-42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_including_null_char (char *s)
+{
+  int i;
+  for (i=1; s[i-1]; ++i);
+  return i;
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+  char *s = (char *)p + 100;
+
+  s[42+13] = 0;
+  assert (test_pos (s) == 42+13);
+  s[42+13] = 0xf;
+
+  s[13] = 0;
+  assert (test_neg (s) == 13);
+  s[13] = 0xf;
+
+  s[-13] = 0;
+  assert (test_neg (s) == -13);
+  s[-13] = 0xf;
+
+  s[13] = 0;
+  assert (test_including_null_char (s) == 13+1);
+
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 65aa1df4aba..8c0963b1836 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -116,6 +116,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "gimple-fold.h"
 #include "tree-affine.h"
+#include "intl.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "optabs.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop *loop, vec<gimple *> *work_list)
   return work_list->length () > 0;
 }
 
+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+			    data_reference_p store_dr, tree base, tree pattern,
+			    location_t loc)
+{
+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
+		       && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
+
+  /* The new statements will be placed before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
+				       GSI_CONTINUE_LINKING);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
+  tree reduction_var_new = copy_ssa_name (reduction_var);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  if (store_dr)
+    {
+      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
+      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+    }
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    switch (TYPE_MODE (TREE_TYPE (pattern)))
+      {
+      case QImode:
+	fprintf (dump_file, "generated rawmemchrqi\n");
+	break;
+
+      case HImode:
+	fprintf (dump_file, "generated rawmemchrhi\n");
+	break;
+
+      case SImode:
+	fprintf (dump_file, "generated rawmemchrsi\n");
+	break;
+
+      default:
+	gcc_unreachable ();
+      }
+}
+
+static void
+generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
+			 tree start_len, location_t loc)
+{
+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
+
+  /* The new statements will be placed before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  tree reduction_var_new = make_ssa_name (size_type_node);
+
+  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
+				       GSI_CONTINUE_LINKING);
+  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN));
+  gimple *fn_call = gimple_build_call (fn, 1, mem);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  /* In case reduction type is not compatible with size type, then
+     conversion is sound even in case an overflow occurs since we previously
+     ensured that for reduction type an overflow is undefined.  */
+  tree convert = fold_convert (TREE_TYPE (reduction_var), reduction_var_new);
+  reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE,
+						false, GSI_CONTINUE_LINKING);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree fn_result = reduction_var_new;
+      reduction_var_new = make_ssa_name (TREE_TYPE (reduction_var));
+      gimple *add_stmt = gimple_build_assign (reduction_var_new, PLUS_EXPR,
+					      fn_result, start_len);
+      gsi_insert_after (&gsi, add_stmt, GSI_CONTINUE_LINKING);
+    }
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "generated strlen\n");
+}
+
+static void
+generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
+					 tree base, tree start_len,
+					 location_t loc)
+{
+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base)));
+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (start_len));
+
+  /* The new statements will be placed before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  tree mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false,
+				       GSI_CONTINUE_LINKING);
+  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (mem)));
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, zero);
+  tree end = make_ssa_name (TREE_TYPE (base));
+  gimple_call_set_lhs (fn_call, end);
+  gimple_set_location (fn_call, loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  tree diff = make_ssa_name (ptrdiff_type_node);
+  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, mem);
+  gsi_insert_after (&gsi, diff_stmt, GSI_CONTINUE_LINKING);
+
+  tree convert = fold_convert (ptrdiff_type_node,
+			       TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (mem))));
+  tree size = force_gimple_operand_gsi (&gsi, convert, true, NULL_TREE, false,
+					GSI_CONTINUE_LINKING);
+
+  tree count = make_ssa_name (ptrdiff_type_node);
+  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
+  gsi_insert_after (&gsi, count_stmt, GSI_CONTINUE_LINKING);
+
+  convert = fold_convert (TREE_TYPE (reduction_var), count);
+  tree reduction_var_new = force_gimple_operand_gsi (&gsi, convert, true,
+						     NULL_TREE, false,
+						     GSI_CONTINUE_LINKING);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
+      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
+				       start_len);
+      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+      reduction_var_new = lhs;
+    }
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    switch (TYPE_MODE (TREE_TYPE (zero)))
+      {
+      case HImode:
+	fprintf (dump_file, "generated strlen using rawmemchrhi\n");
+	break;
+
+      case SImode:
+	fprintf (dump_file, "generated strlen using rawmemchrsi\n");
+	break;
+
+      default:
+	gcc_unreachable ();
+      }
+}
+
+static bool
+transform_reduction_loop_1 (loop_p loop,
+			    data_reference_p load_dr,
+			    data_reference_p store_dr,
+			    tree reduction_var)
+{
+  tree load_ref = DR_REF (load_dr);
+  tree load_type = TREE_TYPE (load_ref);
+  tree load_access_base = build_fold_addr_expr (load_ref);
+  tree load_access_size = TYPE_SIZE_UNIT (load_type);
+  affine_iv load_iv, reduction_iv;
+  tree pattern;
+
+  /* A limitation of the current implementation is that we only support
+     constant patterns.  */
+  edge e = single_exit (loop);
+  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
+  if (!cond_stmt)
+    return false;
+  pattern = gimple_cond_rhs (cond_stmt);
+  if (gimple_cond_code (cond_stmt) != NE_EXPR
+      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
+      || TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive or not growing.  */
+  if (!operand_equal_p (load_iv.step, load_access_size, 0))
+    return false;
+
+  if (!INTEGRAL_TYPE_P (load_type)
+      || !type_has_mode_precision_p (load_type))
+    return false;
+
+  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
+    return false;
+
+  /* Handle rawmemchr like loops.  */
+  if (operand_equal_p (load_iv.base, reduction_iv.base)
+      && operand_equal_p (load_iv.step, reduction_iv.step))
+    {
+      if (store_dr)
+	{
+	  /* Ensure that we store to X and load from X+I where I>0.  */
+	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
+	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
+	    return false;
+	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
+	  if (TREE_CODE (ptr_base) != SSA_NAME)
+	    return false;
+	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
+	  if (!gimple_assign_single_p (def)
+	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
+	    return false;
+	  /* Ensure that the reduction value is stored.  */
+	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
+	    return false;
+	}
+      /* Bail out if target does not provide rawmemchr for a certain mode.  */
+      machine_mode mode = TYPE_MODE (load_type);
+      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
+	return false;
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
+				  pattern, loc);
+      return true;
+    }
+
+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step)
+      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
+	  || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
+    {
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
+	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node))
+	generate_strlen_builtin (loop, reduction_var, load_iv.base,
+				 reduction_iv.base, loc);
+      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
+	       != CODE_FOR_nothing)
+	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
+						 load_iv.base,
+						 reduction_iv.base, loc);
+      else
+	return false;
+      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
+	{
+	  const char *msg = G_("assuming signed overflow does not occur "
+			       "when optimizing strlen like loop");
+	  fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
+	}
+      return true;
+    }
+
+  return false;
+}
+
+static bool
+transform_reduction_loop (loop_p loop)
+{
+  gimple *reduction_stmt = NULL;
+  data_reference_p load_dr = NULL, store_dr = NULL;
+
+  basic_block *bbs = get_loop_body (loop);
+
+  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
+    {
+      basic_block bb = bbs[i];
+
+      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+	   gsi_next (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return false;
+	  gphi *phi = bsi.phi ();
+	  if (gimple_has_volatile_ops (phi))
+	    return false;
+	  if (gimple_clobber_p (phi))
+	    continue;
+	  if (virtual_operand_p (gimple_phi_result (phi)))
+	    continue;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
+	    {
+	      if (reduction_stmt)
+		return false;
+	      reduction_stmt = phi;
+	    }
+	}
+
+      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+	   gsi_next (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return false;
+
+	  gimple *stmt = gsi_stmt (bsi);
+
+	  if (gimple_clobber_p (stmt))
+	    continue;
+
+	  if (gimple_code (stmt) == GIMPLE_LABEL || is_gimple_debug (stmt))
+	    continue;
+
+	  if (gimple_has_volatile_ops (stmt))
+	    return false;
+
+	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
+	    {
+	      if (reduction_stmt)
+		return false;
+	      reduction_stmt = stmt;
+	    }
+
+	  /* Any scalar stmts are ok.  */
+	  if (!gimple_vuse (stmt))
+	    continue;
+
+	  /* Otherwise just regular loads/stores.  */
+	  if (!gimple_assign_single_p (stmt))
+	    return false;
+
+	  auto_vec<data_reference_p, 2> dr_vec;
+	  if (!find_data_references_in_stmt (loop, stmt, &dr_vec))
+	    return false;
+	  data_reference_p dr;
+	  unsigned j;
+	  FOR_EACH_VEC_ELT (dr_vec, j, dr)
+	    {
+	      tree type = TREE_TYPE (DR_REF (dr));
+	      if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (type)))
+		return false;
+	      if (DR_IS_READ (dr))
+		{
+		  if (load_dr != NULL)
+		    return false;
+		  load_dr = dr;
+		}
+	      else
+		{
+		  if (store_dr != NULL)
+		    return false;
+		  store_dr = dr;
+		}
+	    }
+	}
+    }
+
+  /* A limitation of the current implementation is that we require a reduction
+     statement.  Therefore, loops without a reduction statement as in the
+     following are not recognized:
+     int *p;
+     void foo (void) { for (; *p; ++p); } */
+  if (load_dr == NULL || reduction_stmt == NULL)
+    return false;
+
+  /* Reduction variables are guaranteed to be SSA names.  */
+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_PHI:
+      reduction_var = gimple_phi_result (reduction_stmt);
+      break;
+    case GIMPLE_ASSIGN:
+      reduction_var = gimple_assign_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;
+    }
+  if (reduction_var == NULL)
+    return false;
+
+  /* Bail out if this is a bitfield memory reference.  */
+  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
+      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
+    return false;
+
+  /* Data reference must be executed exactly once per iteration of each
+     loop in the loop nest.  We only need to check dominance information
+     against the outermost one in a perfect loop nest because a bb can't
+     dominate outermost loop's latch without dominating inner loop's.  */
+  basic_block load_bb = gimple_bb (DR_STMT (load_dr));
+  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, load_bb))
+    return false;
+
+  if (store_dr)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (store_dr)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (store_dr), 1)))
+	return false;
+
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      basic_block store_bb = gimple_bb (DR_STMT (store_dr));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, store_bb))
+	return false;
+
+      /* Load and store must be in the same loop nest.  */
+      if (store_bb->loop_father != load_bb->loop_father)
+	return false;
+
+      edge e = single_exit (store_bb->loop_father);
+      if (!e)
+	return false;
+      bool load_dom = dominated_by_p (CDI_DOMINATORS, e->src, load_bb);
+      bool store_dom = dominated_by_p (CDI_DOMINATORS, e->src, store_bb);
+      if (load_dom != store_dom)
+	return false;
+    }
+
+  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
+}
+
 /* Given innermost LOOP, return the outermost enclosing loop that forms a
    perfect loop nest.  */
 
@@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
 	      && !optimize_loop_for_speed_p (loop)))
 	continue;
 
-      /* Don't distribute loop if niters is unknown.  */
+      /* If niters is unknown don't distribute loop but rather try to transform
+	 it to a call to a builtin.  */
       tree niters = number_of_latch_executions (loop);
       if (niters == NULL_TREE || niters == chrec_dont_know)
-	continue;
+	{
+	  if (transform_reduction_loop (loop))
+	    {
+	      changed = true;
+	      loops_to_be_destroyed.safe_push (loop);
+	      if (dump_file)
+		fprintf (dump_file, "Loop %d transformed into a builtin.\n", loop->num);
+	    }
+	  continue;
+	}
 
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);
Stefan Schulze Frielinghaus June 14, 2021, 5:26 p.m. UTC | #15
On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
[...]
> > but we won't ever arrive here because of the niters condition.  But
> > yes, doing the pattern matching in the innermost loop processing code
> > looks good to me - for the specific case it would be
> > 
> >       /* Don't distribute loop if niters is unknown.  */
> >       tree niters = number_of_latch_executions (loop);
> >       if (niters == NULL_TREE || niters == chrec_dont_know)
> > ---> here?
> >         continue;
> 
> Right, please find attached a new version of the patch where everything
> is included in the loop distribution pass.  I will do a bootstrap and
> regtest on IBM Z over night.  If you give me green light I will also do
> the same on x86_64.

Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
least the ldist-strlen testcase).  If you are Ok with the patch, then I
would rebase and run the testsuites again and post a patch series
including the rawmemchr implementation for IBM Z.

Cheers,
Stefan
Richard Biener June 16, 2021, 2:22 p.m. UTC | #16
On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> [...]
> > > but we won't ever arrive here because of the niters condition.  But
> > > yes, doing the pattern matching in the innermost loop processing code
> > > looks good to me - for the specific case it would be
> > >
> > >       /* Don't distribute loop if niters is unknown.  */
> > >       tree niters = number_of_latch_executions (loop);
> > >       if (niters == NULL_TREE || niters == chrec_dont_know)
> > > ---> here?
> > >         continue;
> >
> > Right, please find attached a new version of the patch where everything
> > is included in the loop distribution pass.  I will do a bootstrap and
> > regtest on IBM Z over night.  If you give me green light I will also do
> > the same on x86_64.
>
> Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> least the ldist-strlen testcase).  If you are Ok with the patch, then I
> would rebase and run the testsuites again and post a patch series
> including the rawmemchr implementation for IBM Z.

@@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
*loop, vec<gimple *> *work_list)
   return work_list->length () > 0;
 }

+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+                           data_reference_p store_dr, tree base, tree pattern,
+                           location_t loc)
+{

this new function needs a comment.  Applies to all of the new ones, btw.

+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
+                      && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));

this looks fragile and is probably unnecessary as well.

+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));

in general you want types_compatible_p () checks which for pointers means
all pointers are compatible ...

(skipping stuff)

@@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
              && !optimize_loop_for_speed_p (loop)))
        continue;

-      /* Don't distribute loop if niters is unknown.  */
+      /* If niters is unknown don't distribute loop but rather try to transform
+        it to a call to a builtin.  */
       tree niters = number_of_latch_executions (loop);
       if (niters == NULL_TREE || niters == chrec_dont_know)
-       continue;
+       {
+         if (transform_reduction_loop (loop))
+           {
+             changed = true;
+             loops_to_be_destroyed.safe_push (loop);
+             if (dump_file)
+               fprintf (dump_file, "Loop %d transformed into a
builtin.\n", loop->num);
+           }
+         continue;
+       }

please look at

          if (nb_generated_loops + nb_generated_calls > 0)
            {
              changed = true;
              if (dump_enabled_p ())
                dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
                                 loc, "Loop%s %d distributed: split to
%d loops "
                                 "and %d library calls.\n", str, loop->num,
                                 nb_generated_loops, nb_generated_calls);

and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
transforms are reported with -fopt-info-loop

+
+  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
+}

what's the point in tail-calling here and visually splitting the
function in half?

(sorry for picking random pieces now ;))

+      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+          gsi_next (&bsi), ++ninsns)
+       {

this counts debug insns, I guess you want gsi_next_nondebug at least.
not sure why you are counting PHIs at all btw - for the loops you match
you are expecting at most two, one IV and eventually one for the virtual
operand of the store?

+         if (gimple_has_volatile_ops (phi))
+           return false;

PHIs never have volatile ops.

+         if (gimple_clobber_p (phi))
+           continue;

or are clobbers.

Btw, can you factor out a helper from find_single_drs working on a
stmt to reduce code duplication?

+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_PHI:
+      reduction_var = gimple_phi_result (reduction_stmt);
+      break;
+    case GIMPLE_ASSIGN:
+      reduction_var = gimple_assign_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;

gimple_get_lhs (reduction_stmt); would work for both PHIs
and assigns.

+  if (reduction_var == NULL)
+    return false;

it can never be NULL here.

+  /* Bail out if this is a bitfield memory reference.  */
+  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
+      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
+    return false;
...

I see this is again quite some code copied from find_single_drs, please
see how to avoid this much duplication by splitting out helpers.

+static bool
+transform_reduction_loop_1 (loop_p loop,
+                           data_reference_p load_dr,
+                           data_reference_p store_dr,
+                           tree reduction_var)
+{
+  tree load_ref = DR_REF (load_dr);
+  tree load_type = TREE_TYPE (load_ref);
+  tree load_access_base = build_fold_addr_expr (load_ref);
+  tree load_access_size = TYPE_SIZE_UNIT (load_type);
+  affine_iv load_iv, reduction_iv;
+  tree pattern;
+
+  /* A limitation of the current implementation is that we only support
+     constant patterns.  */
+  edge e = single_exit (loop);
+  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
+  if (!cond_stmt)
+    return false;

that looks like checks to be done at the start of
transform_reduction_loop, not this late.

+  if (gimple_cond_code (cond_stmt) != NE_EXPR
+      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
+      || TREE_CODE (pattern) != INTEGER_CST)
+    return false;

half of this as well.  Btw, there's no canonicalization for
the tests so you have to verify the false edge actually exits
the loop and allow for EQ_EXPR in case the false edge does.

+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step)
+      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
+         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
+    {

I wonder what goes wrong with a larger or smaller wrapping IV type?
The iteration
only stops when you load a NUL and the increments just wrap along (you're
using the pointer IVs to compute the strlen result).  Can't you simply truncate?
For larger than size_type_node (actually larger than ptr_type_node would matter
I guess), the argument is that since pointer wrapping would be undefined anyway
the IV cannot wrap either.  Now, the correct check here would IMHO be

      TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
(ptr_type_node)
       || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))

?

+      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
+       {
+         const char *msg = G_("assuming signed overflow does not occur "
+                              "when optimizing strlen like loop");
+         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
+       }

no, please don't add any new strict-overflow warnings ;)

The generate_*_builtin routines need some factoring - if you code-generate
into a gimple_seq you could use gimple_build () which would do the fold_stmt
(not sure why you do that - you should see to fold the call, not necessarily
the rest).  The replacement of reduction_var and the dumping could be shared.
There's also GET_MODE_NAME for the printing.

I think overall the approach is sound now but the details still need work.

Thanks,
Richard.



> Cheers,
> Stefan
Stefan Schulze Frielinghaus June 25, 2021, 10:23 a.m. UTC | #17
On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> > [...]
> > > > but we won't ever arrive here because of the niters condition.  But
> > > > yes, doing the pattern matching in the innermost loop processing code
> > > > looks good to me - for the specific case it would be
> > > >
> > > >       /* Don't distribute loop if niters is unknown.  */
> > > >       tree niters = number_of_latch_executions (loop);
> > > >       if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > ---> here?
> > > >         continue;
> > >
> > > Right, please find attached a new version of the patch where everything
> > > is included in the loop distribution pass.  I will do a bootstrap and
> > > regtest on IBM Z over night.  If you give me green light I will also do
> > > the same on x86_64.
> >
> > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > would rebase and run the testsuites again and post a patch series
> > including the rawmemchr implementation for IBM Z.
> 
> @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> *loop, vec<gimple *> *work_list)
>    return work_list->length () > 0;
>  }
> 
> +static void
> +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> +                           data_reference_p store_dr, tree base, tree pattern,
> +                           location_t loc)
> +{
> 
> this new function needs a comment.  Applies to all of the new ones, btw.

Done.

> +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> +                      && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
> 
> this looks fragile and is probably unnecessary as well.
> 
> +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> 
> in general you want types_compatible_p () checks which for pointers means
> all pointers are compatible ...

True, I removed both asserts.

> (skipping stuff)
> 
> @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
>               && !optimize_loop_for_speed_p (loop)))
>         continue;
> 
> -      /* Don't distribute loop if niters is unknown.  */
> +      /* If niters is unknown don't distribute loop but rather try to transform
> +        it to a call to a builtin.  */
>        tree niters = number_of_latch_executions (loop);
>        if (niters == NULL_TREE || niters == chrec_dont_know)
> -       continue;
> +       {
> +         if (transform_reduction_loop (loop))
> +           {
> +             changed = true;
> +             loops_to_be_destroyed.safe_push (loop);
> +             if (dump_file)
> +               fprintf (dump_file, "Loop %d transformed into a
> builtin.\n", loop->num);
> +           }
> +         continue;
> +       }
> 
> please look at
> 
>           if (nb_generated_loops + nb_generated_calls > 0)
>             {
>               changed = true;
>               if (dump_enabled_p ())
>                 dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>                                  loc, "Loop%s %d distributed: split to
> %d loops "
>                                  "and %d library calls.\n", str, loop->num,
>                                  nb_generated_loops, nb_generated_calls);
> 
> and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> transforms are reported with -fopt-info-loop

Done.

> +
> +  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
> +}
> 
> what's the point in tail-calling here and visually splitting the
> function in half?

In the first place I thought that this is more pleasant since in
transform_reduction_loop_1 it is settled that we have a single load,
store, and reduction variable.  After refactoring this isn't true
anymore and I inlined the function and made this clear via a comment.

> 
> (sorry for picking random pieces now ;))
> 
> +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> +          gsi_next (&bsi), ++ninsns)
> +       {
> 
> this counts debug insns, I guess you want gsi_next_nondebug at least.
> not sure why you are counting PHIs at all btw - for the loops you match
> you are expecting at most two, one IV and eventually one for the virtual
> operand of the store?

Yes, I removed the counting for the phi loop and changed to
gsi_next_nondebug for both loops.

> 
> +         if (gimple_has_volatile_ops (phi))
> +           return false;
> 
> PHIs never have volatile ops.
> 
> +         if (gimple_clobber_p (phi))
> +           continue;
> 
> or are clobbers.

Removed both.

> Btw, can you factor out a helper from find_single_drs working on a
> stmt to reduce code duplication?

Ahh sorry for that.  I've already done this in one of my first patches
but didn't copy that over.  Although my changes do not require a RDG the
whole pass is based upon this data structure.  Therefore, in order to
share more code I decided to temporarily build the RDG so that I can
call into find_single_drs.  Since the graph is rather small I guess the
overhead is acceptable w.r.t. code sharing.

struct graph *rdg = build_rdg (loop, NULL);
if (rdg == NULL)
  {
    if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file,
     	 "Loop %d not transformed: failed to build the RDG.\n",
     	 loop->num);

    return false;
  }
auto_bitmap partition_stmts;
bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
free_rdg (rdg);

As a side-effect of this, now, I also have to (de)allocate the class
member datarefs_vec prior/after calling into transform_reduction_loop:

/* If niters is unknown don't distribute loop but rather try to transform
   it to a call to a builtin.  */
tree niters = number_of_latch_executions (loop);
if (niters == NULL_TREE || niters == chrec_dont_know)
  {
    datarefs_vec.create (20);
    if (transform_reduction_loop (loop))
      {
        changed = true;
        loops_to_be_destroyed.safe_push (loop);
        if (dump_enabled_p ())
          {
            dump_user_location_t loc = find_loop_location (loop);
            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
                             loc, "Loop %d transformed into a builtin.\n",
                             loop->num);
          }
      }
    free_data_refs (datarefs_vec);
    continue;
  }

> 
> +  tree reduction_var;
> +  switch (gimple_code (reduction_stmt))
> +    {
> +    case GIMPLE_PHI:
> +      reduction_var = gimple_phi_result (reduction_stmt);
> +      break;
> +    case GIMPLE_ASSIGN:
> +      reduction_var = gimple_assign_lhs (reduction_stmt);
> +      break;
> +    default:
> +      /* Bail out e.g. for GIMPLE_CALL.  */
> +      return false;
> 
> gimple_get_lhs (reduction_stmt); would work for both PHIs
> and assigns.

Done.

> 
> +  if (reduction_var == NULL)
> +    return false;
> 
> it can never be NULL here.

True, otherwise the reduction statement wouldn't have a dependence outside
the loop. => Removed.

> 
> +  /* Bail out if this is a bitfield memory reference.  */
> +  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
> +      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
> +    return false;
> ...
> 
> I see this is again quite some code copied from find_single_drs, please
> see how to avoid this much duplication by splitting out helpers.

Sorry again.  Hope the solution above is more appropriate.

> 
> +static bool
> +transform_reduction_loop_1 (loop_p loop,
> +                           data_reference_p load_dr,
> +                           data_reference_p store_dr,
> +                           tree reduction_var)
> +{
> +  tree load_ref = DR_REF (load_dr);
> +  tree load_type = TREE_TYPE (load_ref);
> +  tree load_access_base = build_fold_addr_expr (load_ref);
> +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> +  affine_iv load_iv, reduction_iv;
> +  tree pattern;
> +
> +  /* A limitation of the current implementation is that we only support
> +     constant patterns.  */
> +  edge e = single_exit (loop);
> +  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
> +  if (!cond_stmt)
> +    return false;
> 
> that looks like checks to be done at the start of
> transform_reduction_loop, not this late.

Pulled this to the very beginning of transform_reduction_loop.

> 
> +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
> +      || TREE_CODE (pattern) != INTEGER_CST)
> +    return false;
> 
> half of this as well.  Btw, there's no canonicalization for
> the tests so you have to verify the false edge actually exits
> the loop and allow for EQ_EXPR in case the false edge does.

Uh good point.  I added checks for that and pulled most of it to the
beginning of transform_reduction_loop.

> 
> +  /* Handle strlen like loops.  */
> +  if (store_dr == NULL
> +      && integer_zerop (pattern)
> +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> +      && integer_onep (reduction_iv.step)
> +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> +    {
> 
> I wonder what goes wrong with a larger or smaller wrapping IV type?
> The iteration
> only stops when you load a NUL and the increments just wrap along (you're
> using the pointer IVs to compute the strlen result).  Can't you simply truncate?

I think truncation is enough as long as no overflow occurs in strlen or
strlen_using_rawmemchr.

> For larger than size_type_node (actually larger than ptr_type_node would matter
> I guess), the argument is that since pointer wrapping would be undefined anyway
> the IV cannot wrap either.  Now, the correct check here would IMHO be
> 
>       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> (ptr_type_node)
>        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> 
> ?

Regarding the implementation which makes use of rawmemchr:

We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
the maximal length we can determine of a string where each character has
size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
ptrdiff type is undefined we have to make sure that if an overflow
occurs, then an overflow occurs for reduction variable, too, and that
this is undefined, too.  However, I'm not sure anymore whether we want
to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
this would mean that a single string consumes more than half of the
virtual addressable memory.  At least for architectures where
TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
to neglect the case where computing pointer difference may overflow.
Otherwise we are talking about strings with lenghts of multiple
pebibytes.  For other architectures we might have to be more precise
and make sure that reduction variable overflows first and that this is
undefined.

Thus a conservative condition would be (I assumed that the size of any
integral type is a power of two which I'm not sure if this really holds;
IIRC the C standard requires only that the alignment is a power of two
but not necessarily the size so I might need to change this):

/* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
   or in other words return true if reduction variable overflows first
   and false otherwise.  */

static bool
reduction_var_overflows_first (tree reduction_var, tree load_type)
{
  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
  unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
  unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
  return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
}

TYPE_PRECISION (ptrdiff_type_node) == 64
|| (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
    && reduction_var_overflows_first (reduction_var, load_type)

Regarding the implementation which makes use of strlen:

I'm not sure what it means if strlen is called for a string with a
length greater than SIZE_MAX.  Therefore, similar to the implementation
using rawmemchr where we neglect the case of an overflow for 64bit
architectures, a conservative condition would be:

TYPE_PRECISION (size_type_node) == 64
|| (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
    && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))

I still included the overflow undefined check for reduction variable in
order to rule out situations where the reduction variable is unsigned
and overflows as many times until strlen(,_using_rawmemchr) overflows,
too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
architectures.  Anyhow, while writing this down it becomes clear that
this deserves a comment which I will add once it becomes clear which way
to go.

> 
> +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> +       {
> +         const char *msg = G_("assuming signed overflow does not occur "
> +                              "when optimizing strlen like loop");
> +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> +       }
> 
> no, please don't add any new strict-overflow warnings ;)

I just stumbled over code which produces such a warning and thought this
is a hard requirement :D The new patch doesn't contain it anymore.

> 
> The generate_*_builtin routines need some factoring - if you code-generate
> into a gimple_seq you could use gimple_build () which would do the fold_stmt
> (not sure why you do that - you should see to fold the call, not necessarily
> the rest).  The replacement of reduction_var and the dumping could be shared.
> There's also GET_MODE_NAME for the printing.

I wasn't really sure which way to go.  Use a gsi, as it is done by
existing generate_* functions, or make use of gimple_seq.  Since the
latter uses internally also gsi I thought it is better to stick to gsi
in the first place.  Now, after changing to gimple_seq I see the beauty
of it :)

I created two helper functions generate_strlen_builtin_1 and
generate_reduction_builtin_1 in order to reduce code duplication.

In function generate_strlen_builtin I changed from using
builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
(BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
sure whether my intuition about the difference between implicit and
explicit builtins is correct.  In builtins.def there is a small example
given which I would paraphrase as "use builtin_decl_explicit if the
semantics of the builtin is defined by the C standard; otherwise use
builtin_decl_implicit" but probably my intuition is wrong?

Beside that I'm not sure whether I really have to call
build_fold_addr_expr which looks superfluous to me since
gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:

tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
gimple *fn_call = gimple_build_call (fn, 1, mem);

However, since it is also used that way in the context of
generate_memset_builtin I didn't remove it so far.

> I think overall the approach is sound now but the details still need work.

Once again thank you very much for your review.  Really appreciated!

Cheers,
Stefan
commit d24639c895ce3c0f9539570bab7b6510e98b1ffa
Author: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
Date:   Wed Mar 17 09:00:06 2021 +0100

    foo

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index d209a52f823..633f41d9e6d 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2929,6 +2929,35 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+/* Expand IFN_RAWMEMCHAR internal function.  */
+
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+    {
+      tree rhs = gimple_call_arg (stmt, i);
+      tree rhs_type = TREE_TYPE (rhs);
+      rtx rhs_rtx = expand_normal (rhs);
+      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
+    }
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index daeace7a34e..95c76795648 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/optabs.def b/gcc/optabs.def
index b192a9d070b..f7c69f914ce 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
 OPTAB_D (movmem_optab, "movmem$a")
 OPTAB_D (setmem_optab, "setmem$a")
 OPTAB_D (strlen_optab, "strlen$a")
+OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
 
 OPTAB_DC(fma_optab, "fma$a4", FMA)
 OPTAB_D (fms_optab, "fms$a4")
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
new file mode 100644
index 00000000000..6db62d7644d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
@@ -0,0 +1,72 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt but no store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, pattern)   \
+__attribute__((noinline))  \
+T *test_##T (T *p)         \
+{                          \
+  while (*p != (T)pattern) \
+    ++p;                   \
+  return p;                \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  0xab)
+test (int16_t, 0xabcd)
+test (int32_t, 0xabcdef15)
+
+#define run(T, pattern, i)      \
+{                               \
+T *q = p;                       \
+q[i] = (T)pattern;              \
+assert (test_##T (p) == &q[i]); \
+q[i] = 0;                       \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0, 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, 0xab, 0);
+  run (int8_t, 0xab, 1);
+  run (int8_t, 0xab, 13);
+
+  run (int16_t, 0xabcd, 0);
+  run (int16_t, 0xabcd, 1);
+  run (int16_t, 0xabcd, 13);
+
+  run (int32_t, 0xabcdef15, 0);
+  run (int32_t, 0xabcdef15, 1);
+  run (int32_t, 0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
new file mode 100644
index 00000000000..00d6ea0f8e9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
@@ -0,0 +1,83 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+uint8_t *p_uint8_t;
+uint16_t *p_uint16_t;
+uint32_t *p_uint32_t;
+
+int8_t *p_int8_t;
+int16_t *p_int16_t;
+int32_t *p_int32_t;
+
+#define test(T, pattern)    \
+__attribute__((noinline))   \
+T *test_##T (void)          \
+{                           \
+  while (*p_##T != pattern) \
+    ++p_##T;                \
+  return p_##T;             \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  (int8_t)0xab)
+test (int16_t, (int16_t)0xabcd)
+test (int32_t, (int32_t)0xabcdef15)
+
+#define run(T, pattern, i) \
+{                          \
+T *q = p;                  \
+q[i] = pattern;            \
+p_##T = p;                 \
+T *r = test_##T ();        \
+assert (r == p_##T);       \
+assert (r == &q[i]);       \
+q[i] = 0;                  \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, '\0', 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, (int8_t)0xab, 0);
+  run (int8_t, (int8_t)0xab, 1);
+  run (int8_t, (int8_t)0xab, 13);
+
+  run (int16_t, (int16_t)0xabcd, 0);
+  run (int16_t, (int16_t)0xabcd, 1);
+  run (int16_t, (int16_t)0xabcd, 13);
+
+  run (int32_t, (int32_t)0xabcdef15, 0);
+  run (int32_t, (int32_t)0xabcdef15, 1);
+  run (int32_t, (int32_t)0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
new file mode 100644
index 00000000000..918b60099e4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
@@ -0,0 +1,100 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target s390x-*-* } } } */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, U)        \
+__attribute__((noinline)) \
+U test_##T##U (T *s)      \
+{                         \
+  U i;                    \
+  for (i=0; s[i]; ++i);   \
+  return i;               \
+}
+
+test (uint8_t,  size_t)
+test (uint16_t, size_t)
+test (uint32_t, size_t)
+test (uint8_t,  int)
+test (uint16_t, int)
+test (uint32_t, int)
+
+test (int8_t,  size_t)
+test (int16_t, size_t)
+test (int32_t, size_t)
+test (int8_t,  int)
+test (int16_t, int)
+test (int32_t, int)
+
+#define run(T, U, i)             \
+{                                \
+T *q = p;                        \
+q[i] = 0;                        \
+assert (test_##T##U (p) == i);   \
+memset (&q[i], 0xf, sizeof (T)); \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+
+  run (uint8_t, size_t, 0);
+  run (uint8_t, size_t, 1);
+  run (uint8_t, size_t, 13);
+
+  run (int8_t, size_t, 0);
+  run (int8_t, size_t, 1);
+  run (int8_t, size_t, 13);
+
+  run (uint8_t, int, 0);
+  run (uint8_t, int, 1);
+  run (uint8_t, int, 13);
+
+  run (int8_t, int, 0);
+  run (int8_t, int, 1);
+  run (int8_t, int, 13);
+
+  run (uint16_t, size_t, 0);
+  run (uint16_t, size_t, 1);
+  run (uint16_t, size_t, 13);
+
+  run (int16_t, size_t, 0);
+  run (int16_t, size_t, 1);
+  run (int16_t, size_t, 13);
+
+  run (uint16_t, int, 0);
+  run (uint16_t, int, 1);
+  run (uint16_t, int, 13);
+
+  run (int16_t, int, 0);
+  run (int16_t, int, 1);
+  run (int16_t, int, 13);
+
+  run (uint32_t, size_t, 0);
+  run (uint32_t, size_t, 1);
+  run (uint32_t, size_t, 13);
+
+  run (int32_t, size_t, 0);
+  run (int32_t, size_t, 1);
+  run (int32_t, size_t, 13);
+
+  run (uint32_t, int, 0);
+  run (uint32_t, int, 1);
+  run (uint32_t, int, 13);
+
+  run (int32_t, int, 0);
+  run (int32_t, int, 1);
+  run (int32_t, int, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
new file mode 100644
index 00000000000..e25d6ea5b56
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 3 "ldist" } } */
+
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+__attribute__((noinline))
+int test_pos (char *s)
+{
+  int i;
+  for (i=42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_neg (char *s)
+{
+  int i;
+  for (i=-42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_including_null_char (char *s)
+{
+  int i;
+  for (i=1; s[i-1]; ++i);
+  return i;
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+  char *s = (char *)p + 100;
+
+  s[42+13] = 0;
+  assert (test_pos (s) == 42+13);
+  s[42+13] = 0xf;
+
+  s[13] = 0;
+  assert (test_neg (s) == 13);
+  s[13] = 0xf;
+
+  s[-13] = 0;
+  assert (test_neg (s) == -13);
+  s[-13] = 0xf;
+
+  s[13] = 0;
+  assert (test_including_null_char (s) == 13+1);
+
+  return 0;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 65aa1df4aba..6bd4bc8588a 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -116,6 +116,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "gimple-fold.h"
 #include "tree-affine.h"
+#include "intl.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "optabs.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -650,6 +654,10 @@ class loop_distribution
 		       control_dependences *cd, int *nb_calls, bool *destroy_p,
 		       bool only_patterns_p);
 
+  /* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+     replace them accordingly.  */
+  bool transform_reduction_loop (loop_p loop);
+
   /* Compute topological order for basic blocks.  Topological order is
      needed because data dependence is computed for data references in
      lexicographical order.  */
@@ -1490,14 +1498,14 @@ loop_distribution::build_rdg_partition_for_vertex (struct graph *rdg, int v)
    data references.  */
 
 static bool
-find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
+find_single_drs (class loop *loop, struct graph *rdg, const bitmap &partition_stmts,
 		 data_reference_p *dst_dr, data_reference_p *src_dr)
 {
   unsigned i;
   data_reference_p single_ld = NULL, single_st = NULL;
   bitmap_iterator bi;
 
-  EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
+  EXECUTE_IF_SET_IN_BITMAP (partition_stmts, 0, i, bi)
     {
       gimple *stmt = RDG_STMT (rdg, i);
       data_reference_p dr;
@@ -1538,44 +1546,47 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
 	}
     }
 
-  if (!single_st)
-    return false;
-
-  /* Bail out if this is a bitfield memory reference.  */
-  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
-      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
+  if (!single_ld && !single_st)
     return false;
 
-  /* Data reference must be executed exactly once per iteration of each
-     loop in the loop nest.  We only need to check dominance information
-     against the outermost one in a perfect loop nest because a bb can't
-     dominate outermost loop's latch without dominating inner loop's.  */
-  basic_block bb_st = gimple_bb (DR_STMT (single_st));
-  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
-    return false;
+  basic_block bb_ld = NULL;
+  basic_block bb_st = NULL;
 
   if (single_ld)
     {
-      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
-      /* Direct aggregate copy or via an SSA name temporary.  */
-      if (load != store
-	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
-	return false;
-
       /* Bail out if this is a bitfield memory reference.  */
       if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
 	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
 	return false;
 
-      /* Load and store must be in the same loop nest.  */
-      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
-      if (bb_st->loop_father != bb_ld->loop_father)
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      bb_ld = gimple_bb (DR_STMT (single_ld));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+	return false;
+    }
+
+  if (single_st)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
 	return false;
 
       /* Data reference must be executed exactly once per iteration.
-	 Same as single_st, we only need to check against the outermost
+	 Same as single_ld, we only need to check against the outermost
 	 loop.  */
-      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+      bb_st = gimple_bb (DR_STMT (single_st));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
+	return false;
+    }
+
+  if (single_ld && single_st)
+    {
+      /* Load and store must be in the same loop nest.  */
+      if (bb_st->loop_father != bb_ld->loop_father)
 	return false;
 
       edge e = single_exit (bb_st->loop_father);
@@ -1850,9 +1861,19 @@ loop_distribution::classify_partition (loop_p loop,
     return has_reduction;
 
   /* Find single load/store data references for builtin partition.  */
-  if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
+  if (!find_single_drs (loop, rdg, partition->stmts, &single_st, &single_ld)
+      || !single_st)
     return has_reduction;
 
+  if (single_ld && single_st)
+    {
+      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
+      /* Direct aggregate copy or via an SSA name temporary.  */
+      if (load != store
+	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
+	return has_reduction;
+    }
+
   partition->loc = gimple_location (DR_STMT (single_st));
 
   /* Classify the builtin kind.  */
@@ -3257,6 +3278,379 @@ find_seed_stmts_for_distribution (class loop *loop, vec<gimple *> *work_list)
   return work_list->length () > 0;
 }
 
+/* A helper function for generate_{rawmemchr,strlen}_builtin functions in order
+   to place new statements SEQ before LOOP and replace the old reduction
+   variable with the new one.  */
+
+static void
+generate_reduction_builtin_1 (loop_p loop, gimple_seq &seq,
+			      tree reduction_var_old, tree reduction_var_new,
+			      const char *info, machine_mode load_mode)
+{
+  /* Place new statements before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+  gsi_insert_seq_after (&gsi, seq, GSI_CONTINUE_LINKING);
+
+  /* Replace old reduction variable with new one.  */
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var_old)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, info, GET_MODE_NAME (load_mode));
+}
+
+/* Generate a call to rawmemchr and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+			    data_reference_p store_dr, tree base, tree pattern,
+			    location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
+  tree reduction_var_new = copy_ssa_name (reduction_var);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  if (store_dr)
+    {
+      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
+      gimple_seq_add_stmt (&seq, g);
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+				"generated rawmemchr%s\n",
+				TYPE_MODE (TREE_TYPE (TREE_TYPE (base))));
+}
+
+/* Helper function for generate_strlen_builtin(,_using_rawmemchr)  */
+
+static void
+generate_strlen_builtin_1 (loop_p loop, gimple_seq &seq,
+			   tree reduction_var_old, tree reduction_var_new,
+			   machine_mode mode, tree start_len)
+{
+  /* REDUCTION_VAR_NEW has either size type or ptrdiff type and must be
+     converted if types of old and new reduction variable are not compatible. */
+  reduction_var_new = gimple_convert (&seq, TREE_TYPE (reduction_var_old),
+				      reduction_var_new);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
+      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
+				       start_len);
+      gimple_seq_add_stmt (&seq, g);
+      reduction_var_new = lhs;
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var_old, reduction_var_new,
+				"generated strlen%s\n", mode);
+}
+
+/* Generate a call to strlen and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
+			 tree start_len, location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree reduction_var_new = make_ssa_name (size_type_node);
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
+  gimple *fn_call = gimple_build_call (fn, 1, mem);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+			     QImode, start_len);
+}
+
+/* Generate code in order to mimic the behaviour of strlen but this time over
+   an array of elements with mode different than QI.  REDUCTION_VAR is replaced
+   with a fresh SSA name representing the result, i.e., the length.  */
+
+static void
+generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
+					 tree base, tree start_len,
+					 location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree start = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (start)));
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, start, zero);
+  tree end = make_ssa_name (TREE_TYPE (base));
+  gimple_call_set_lhs (fn_call, end);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  /* Determine the number of elements between START and END by
+     evaluating (END - START) / sizeof (*START).  */
+  tree diff = make_ssa_name (ptrdiff_type_node);
+  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, start);
+  gimple_seq_add_stmt (&seq, diff_stmt);
+  /* Let SIZE be the size of the the pointed-to type of START.  */
+  tree size = gimple_convert (&seq, ptrdiff_type_node,
+			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (start))));
+  tree count = make_ssa_name (ptrdiff_type_node);
+  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
+  gimple_seq_add_stmt (&seq, count_stmt);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, count,
+			     TYPE_MODE (TREE_TYPE (TREE_TYPE (base))),
+			     start_len);
+}
+
+static bool
+reduction_var_overflows_first (tree reduction_var, tree load_type)
+{
+  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
+  unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
+  unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
+  return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
+}
+
+/* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+   replace them accordingly.  For example, a loop of the form
+
+     for (; *p != 42; ++p);
+
+   is replaced by
+
+     p = rawmemchr<MODE> (p, 42);
+
+   under the assumption that rawmemchr is available for a particular MODE.
+   Another example is
+
+     int i;
+     for (i = 42; s[i]; ++i);
+
+   which is replaced by
+
+     i = (int)strlen (&s[42]) + 42;
+
+   for some character array S.  In case array S is not of type character array
+   we end up with
+
+     i = (int)(rawmemchr<MODE> (&s[42], 0) - &s[42]) + 42;
+
+   assuming that rawmemchr is available for a particular MODE.  */
+
+bool
+loop_distribution::transform_reduction_loop (loop_p loop)
+{
+  gimple *reduction_stmt = NULL;
+  data_reference_p load_dr = NULL, store_dr = NULL;
+
+  edge e = single_exit (loop);
+  gcond *cond = safe_dyn_cast <gcond *> (last_stmt (e->src));
+  if (!cond)
+    return false;
+  /* Ensure loop condition is an (in)equality test and loop is exited either if
+     the inequality test fails or the equality test succeeds.  */
+  if (!(e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == NE_EXPR)
+      && !(e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == EQ_EXPR))
+    return false;
+  /* A limitation of the current implementation is that we only support
+     constant patterns in (in)equality tests.  */
+  tree pattern = gimple_cond_rhs (cond);
+  if (TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  basic_block *bbs = get_loop_body (loop);
+
+  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
+    {
+      basic_block bb = bbs[i];
+
+      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi))
+	{
+	  gphi *phi = bsi.phi ();
+	  if (virtual_operand_p (gimple_phi_result (phi)))
+	    continue;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
+	    {
+	      if (reduction_stmt)
+		return false;
+	      reduction_stmt = phi;
+	    }
+	}
+
+      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return false;
+	  gimple *stmt = gsi_stmt (bsi);
+	  if (gimple_clobber_p (stmt))
+	    continue;
+	  if (gimple_code (stmt) == GIMPLE_LABEL)
+	    continue;
+	  if (gimple_has_volatile_ops (stmt))
+	    return false;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
+	    {
+	      if (reduction_stmt)
+		return false;
+	      reduction_stmt = stmt;
+	    }
+	}
+    }
+
+  /* A limitation of the current implementation is that we require a reduction
+     statement.  Therefore, loops without a reduction statement as in the
+     following are not recognized:
+     int *p;
+     void foo (void) { for (; *p; ++p); } */
+  if (reduction_stmt == NULL)
+    return false;
+
+  /* Reduction variables are guaranteed to be SSA names.  */
+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_ASSIGN:
+    case GIMPLE_PHI:
+      reduction_var = gimple_get_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;
+    }
+
+  struct graph *rdg = build_rdg (loop, NULL);
+  if (rdg == NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Loop %d not transformed: failed to build the RDG.\n",
+		 loop->num);
+
+      return false;
+    }
+  auto_bitmap partition_stmts;
+  bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
+  find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
+  free_rdg (rdg);
+
+  /* Bail out if there is no single load.  */
+  if (load_dr == NULL)
+    return false;
+
+  /* Reaching this point we have a loop with a single reduction variable,
+     a single load, and an optional single store.  */
+
+  tree load_ref = DR_REF (load_dr);
+  tree load_type = TREE_TYPE (load_ref);
+  tree load_access_base = build_fold_addr_expr (load_ref);
+  tree load_access_size = TYPE_SIZE_UNIT (load_type);
+  affine_iv load_iv, reduction_iv;
+
+  if (!INTEGRAL_TYPE_P (load_type)
+      || !type_has_mode_precision_p (load_type))
+    return false;
+
+  /* We already ensured that the loop condition tests for (in)equality where the
+     rhs is a constant pattern. Now ensure that the lhs is the result of the
+     load.  */
+  if (gimple_cond_lhs (cond) != gimple_assign_lhs (DR_STMT (load_dr)))
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive or not growing.  */
+  if (!operand_equal_p (load_iv.step, load_access_size, 0))
+    return false;
+
+  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
+    return false;
+
+  /* Handle rawmemchr like loops.  */
+  if (operand_equal_p (load_iv.base, reduction_iv.base)
+      && operand_equal_p (load_iv.step, reduction_iv.step))
+    {
+      if (store_dr)
+	{
+	  /* Ensure that we store to X and load from X+I where I>0.  */
+	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
+	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
+	    return false;
+	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
+	  if (TREE_CODE (ptr_base) != SSA_NAME)
+	    return false;
+	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
+	  if (!gimple_assign_single_p (def)
+	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
+	    return false;
+	  /* Ensure that the reduction value is stored.  */
+	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
+	    return false;
+	}
+      /* Bail out if target does not provide rawmemchr for a certain mode.  */
+      machine_mode mode = TYPE_MODE (load_type);
+      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
+	return false;
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
+				  pattern, loc);
+      return true;
+    }
+
+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step))
+    {
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
+	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)
+	  && (TYPE_PRECISION (size_type_node) == 64
+	      || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		  && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))))
+	generate_strlen_builtin (loop, reduction_var, load_iv.base,
+				 reduction_iv.base, loc);
+      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
+	       != CODE_FOR_nothing
+	       && (TYPE_PRECISION (ptrdiff_type_node) == 64
+		   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		       && reduction_var_overflows_first (reduction_var, load_type))))
+	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
+						 load_iv.base,
+						 reduction_iv.base, loc);
+      else
+	return false;
+      return true;
+    }
+
+  return false;
+}
+
 /* Given innermost LOOP, return the outermost enclosing loop that forms a
    perfect loop nest.  */
 
@@ -3321,10 +3715,27 @@ loop_distribution::execute (function *fun)
 	      && !optimize_loop_for_speed_p (loop)))
 	continue;
 
-      /* Don't distribute loop if niters is unknown.  */
+      /* If niters is unknown don't distribute loop but rather try to transform
+	 it to a call to a builtin.  */
       tree niters = number_of_latch_executions (loop);
       if (niters == NULL_TREE || niters == chrec_dont_know)
-	continue;
+	{
+	  datarefs_vec.create (20);
+	  if (transform_reduction_loop (loop))
+	    {
+	      changed = true;
+	      loops_to_be_destroyed.safe_push (loop);
+	      if (dump_enabled_p ())
+		{
+		  dump_user_location_t loc = find_loop_location (loop);
+		  dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
+				   loc, "Loop %d transformed into a builtin.\n",
+				   loop->num);
+		}
+	    }
+	  free_data_refs (datarefs_vec);
+	  continue;
+	}
 
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);
Stefan Schulze Frielinghaus Aug. 6, 2021, 2:02 p.m. UTC | #18
ping

On Fri, Jun 25, 2021 at 12:23:32PM +0200, Stefan Schulze Frielinghaus wrote:
> On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
> > <stefansf@linux.ibm.com> wrote:
> > >
> > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> > > [...]
> > > > > but we won't ever arrive here because of the niters condition.  But
> > > > > yes, doing the pattern matching in the innermost loop processing code
> > > > > looks good to me - for the specific case it would be
> > > > >
> > > > >       /* Don't distribute loop if niters is unknown.  */
> > > > >       tree niters = number_of_latch_executions (loop);
> > > > >       if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > > ---> here?
> > > > >         continue;
> > > >
> > > > Right, please find attached a new version of the patch where everything
> > > > is included in the loop distribution pass.  I will do a bootstrap and
> > > > regtest on IBM Z over night.  If you give me green light I will also do
> > > > the same on x86_64.
> > >
> > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > > would rebase and run the testsuites again and post a patch series
> > > including the rawmemchr implementation for IBM Z.
> > 
> > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> > *loop, vec<gimple *> *work_list)
> >    return work_list->length () > 0;
> >  }
> > 
> > +static void
> > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> > +                           data_reference_p store_dr, tree base, tree pattern,
> > +                           location_t loc)
> > +{
> > 
> > this new function needs a comment.  Applies to all of the new ones, btw.
> 
> Done.
> 
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> > +                      && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
> > 
> > this looks fragile and is probably unnecessary as well.
> > 
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> > 
> > in general you want types_compatible_p () checks which for pointers means
> > all pointers are compatible ...
> 
> True, I removed both asserts.
> 
> > (skipping stuff)
> > 
> > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
> >               && !optimize_loop_for_speed_p (loop)))
> >         continue;
> > 
> > -      /* Don't distribute loop if niters is unknown.  */
> > +      /* If niters is unknown don't distribute loop but rather try to transform
> > +        it to a call to a builtin.  */
> >        tree niters = number_of_latch_executions (loop);
> >        if (niters == NULL_TREE || niters == chrec_dont_know)
> > -       continue;
> > +       {
> > +         if (transform_reduction_loop (loop))
> > +           {
> > +             changed = true;
> > +             loops_to_be_destroyed.safe_push (loop);
> > +             if (dump_file)
> > +               fprintf (dump_file, "Loop %d transformed into a
> > builtin.\n", loop->num);
> > +           }
> > +         continue;
> > +       }
> > 
> > please look at
> > 
> >           if (nb_generated_loops + nb_generated_calls > 0)
> >             {
> >               changed = true;
> >               if (dump_enabled_p ())
> >                 dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >                                  loc, "Loop%s %d distributed: split to
> > %d loops "
> >                                  "and %d library calls.\n", str, loop->num,
> >                                  nb_generated_loops, nb_generated_calls);
> > 
> > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> > transforms are reported with -fopt-info-loop
> 
> Done.
> 
> > +
> > +  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
> > +}
> > 
> > what's the point in tail-calling here and visually splitting the
> > function in half?
> 
> In the first place I thought that this is more pleasant since in
> transform_reduction_loop_1 it is settled that we have a single load,
> store, and reduction variable.  After refactoring this isn't true
> anymore and I inlined the function and made this clear via a comment.
> 
> > 
> > (sorry for picking random pieces now ;))
> > 
> > +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> > +          gsi_next (&bsi), ++ninsns)
> > +       {
> > 
> > this counts debug insns, I guess you want gsi_next_nondebug at least.
> > not sure why you are counting PHIs at all btw - for the loops you match
> > you are expecting at most two, one IV and eventually one for the virtual
> > operand of the store?
> 
> Yes, I removed the counting for the phi loop and changed to
> gsi_next_nondebug for both loops.
> 
> > 
> > +         if (gimple_has_volatile_ops (phi))
> > +           return false;
> > 
> > PHIs never have volatile ops.
> > 
> > +         if (gimple_clobber_p (phi))
> > +           continue;
> > 
> > or are clobbers.
> 
> Removed both.
> 
> > Btw, can you factor out a helper from find_single_drs working on a
> > stmt to reduce code duplication?
> 
> Ahh sorry for that.  I've already done this in one of my first patches
> but didn't copy that over.  Although my changes do not require a RDG the
> whole pass is based upon this data structure.  Therefore, in order to
> share more code I decided to temporarily build the RDG so that I can
> call into find_single_drs.  Since the graph is rather small I guess the
> overhead is acceptable w.r.t. code sharing.
> 
> struct graph *rdg = build_rdg (loop, NULL);
> if (rdg == NULL)
>   {
>     if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file,
>      	 "Loop %d not transformed: failed to build the RDG.\n",
>      	 loop->num);
> 
>     return false;
>   }
> auto_bitmap partition_stmts;
> bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
> find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
> free_rdg (rdg);
> 
> As a side-effect of this, now, I also have to (de)allocate the class
> member datarefs_vec prior/after calling into transform_reduction_loop:
> 
> /* If niters is unknown don't distribute loop but rather try to transform
>    it to a call to a builtin.  */
> tree niters = number_of_latch_executions (loop);
> if (niters == NULL_TREE || niters == chrec_dont_know)
>   {
>     datarefs_vec.create (20);
>     if (transform_reduction_loop (loop))
>       {
>         changed = true;
>         loops_to_be_destroyed.safe_push (loop);
>         if (dump_enabled_p ())
>           {
>             dump_user_location_t loc = find_loop_location (loop);
>             dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>                              loc, "Loop %d transformed into a builtin.\n",
>                              loop->num);
>           }
>       }
>     free_data_refs (datarefs_vec);
>     continue;
>   }
> 
> > 
> > +  tree reduction_var;
> > +  switch (gimple_code (reduction_stmt))
> > +    {
> > +    case GIMPLE_PHI:
> > +      reduction_var = gimple_phi_result (reduction_stmt);
> > +      break;
> > +    case GIMPLE_ASSIGN:
> > +      reduction_var = gimple_assign_lhs (reduction_stmt);
> > +      break;
> > +    default:
> > +      /* Bail out e.g. for GIMPLE_CALL.  */
> > +      return false;
> > 
> > gimple_get_lhs (reduction_stmt); would work for both PHIs
> > and assigns.
> 
> Done.
> 
> > 
> > +  if (reduction_var == NULL)
> > +    return false;
> > 
> > it can never be NULL here.
> 
> True, otherwise the reduction statement wouldn't have a dependence outside
> the loop. => Removed.
> 
> > 
> > +  /* Bail out if this is a bitfield memory reference.  */
> > +  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
> > +      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
> > +    return false;
> > ...
> > 
> > I see this is again quite some code copied from find_single_drs, please
> > see how to avoid this much duplication by splitting out helpers.
> 
> Sorry again.  Hope the solution above is more appropriate.
> 
> > 
> > +static bool
> > +transform_reduction_loop_1 (loop_p loop,
> > +                           data_reference_p load_dr,
> > +                           data_reference_p store_dr,
> > +                           tree reduction_var)
> > +{
> > +  tree load_ref = DR_REF (load_dr);
> > +  tree load_type = TREE_TYPE (load_ref);
> > +  tree load_access_base = build_fold_addr_expr (load_ref);
> > +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> > +  affine_iv load_iv, reduction_iv;
> > +  tree pattern;
> > +
> > +  /* A limitation of the current implementation is that we only support
> > +     constant patterns.  */
> > +  edge e = single_exit (loop);
> > +  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
> > +  if (!cond_stmt)
> > +    return false;
> > 
> > that looks like checks to be done at the start of
> > transform_reduction_loop, not this late.
> 
> Pulled this to the very beginning of transform_reduction_loop.
> 
> > 
> > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
> > +      || TREE_CODE (pattern) != INTEGER_CST)
> > +    return false;
> > 
> > half of this as well.  Btw, there's no canonicalization for
> > the tests so you have to verify the false edge actually exits
> > the loop and allow for EQ_EXPR in case the false edge does.
> 
> Uh good point.  I added checks for that and pulled most of it to the
> beginning of transform_reduction_loop.
> 
> > 
> > +  /* Handle strlen like loops.  */
> > +  if (store_dr == NULL
> > +      && integer_zerop (pattern)
> > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > +      && integer_onep (reduction_iv.step)
> > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > +    {
> > 
> > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > The iteration
> > only stops when you load a NUL and the increments just wrap along (you're
> > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> 
> I think truncation is enough as long as no overflow occurs in strlen or
> strlen_using_rawmemchr.
> 
> > For larger than size_type_node (actually larger than ptr_type_node would matter
> > I guess), the argument is that since pointer wrapping would be undefined anyway
> > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > 
> >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > (ptr_type_node)
> >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > 
> > ?
> 
> Regarding the implementation which makes use of rawmemchr:
> 
> We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> the maximal length we can determine of a string where each character has
> size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> ptrdiff type is undefined we have to make sure that if an overflow
> occurs, then an overflow occurs for reduction variable, too, and that
> this is undefined, too.  However, I'm not sure anymore whether we want
> to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> this would mean that a single string consumes more than half of the
> virtual addressable memory.  At least for architectures where
> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> to neglect the case where computing pointer difference may overflow.
> Otherwise we are talking about strings with lenghts of multiple
> pebibytes.  For other architectures we might have to be more precise
> and make sure that reduction variable overflows first and that this is
> undefined.
> 
> Thus a conservative condition would be (I assumed that the size of any
> integral type is a power of two which I'm not sure if this really holds;
> IIRC the C standard requires only that the alignment is a power of two
> but not necessarily the size so I might need to change this):
> 
> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
>    or in other words return true if reduction variable overflows first
>    and false otherwise.  */
> 
> static bool
> reduction_var_overflows_first (tree reduction_var, tree load_type)
> {
>   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
>   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
>   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
>   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> }
> 
> TYPE_PRECISION (ptrdiff_type_node) == 64
> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>     && reduction_var_overflows_first (reduction_var, load_type)
> 
> Regarding the implementation which makes use of strlen:
> 
> I'm not sure what it means if strlen is called for a string with a
> length greater than SIZE_MAX.  Therefore, similar to the implementation
> using rawmemchr where we neglect the case of an overflow for 64bit
> architectures, a conservative condition would be:
> 
> TYPE_PRECISION (size_type_node) == 64
> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> 
> I still included the overflow undefined check for reduction variable in
> order to rule out situations where the reduction variable is unsigned
> and overflows as many times until strlen(,_using_rawmemchr) overflows,
> too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> architectures.  Anyhow, while writing this down it becomes clear that
> this deserves a comment which I will add once it becomes clear which way
> to go.
> 
> > 
> > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > +       {
> > +         const char *msg = G_("assuming signed overflow does not occur "
> > +                              "when optimizing strlen like loop");
> > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > +       }
> > 
> > no, please don't add any new strict-overflow warnings ;)
> 
> I just stumbled over code which produces such a warning and thought this
> is a hard requirement :D The new patch doesn't contain it anymore.
> 
> > 
> > The generate_*_builtin routines need some factoring - if you code-generate
> > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > (not sure why you do that - you should see to fold the call, not necessarily
> > the rest).  The replacement of reduction_var and the dumping could be shared.
> > There's also GET_MODE_NAME for the printing.
> 
> I wasn't really sure which way to go.  Use a gsi, as it is done by
> existing generate_* functions, or make use of gimple_seq.  Since the
> latter uses internally also gsi I thought it is better to stick to gsi
> in the first place.  Now, after changing to gimple_seq I see the beauty
> of it :)
> 
> I created two helper functions generate_strlen_builtin_1 and
> generate_reduction_builtin_1 in order to reduce code duplication.
> 
> In function generate_strlen_builtin I changed from using
> builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> sure whether my intuition about the difference between implicit and
> explicit builtins is correct.  In builtins.def there is a small example
> given which I would paraphrase as "use builtin_decl_explicit if the
> semantics of the builtin is defined by the C standard; otherwise use
> builtin_decl_implicit" but probably my intuition is wrong?
> 
> Beside that I'm not sure whether I really have to call
> build_fold_addr_expr which looks superfluous to me since
> gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> 
> tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> gimple *fn_call = gimple_build_call (fn, 1, mem);
> 
> However, since it is also used that way in the context of
> generate_memset_builtin I didn't remove it so far.
> 
> > I think overall the approach is sound now but the details still need work.
> 
> Once again thank you very much for your review.  Really appreciated!
> 
> Cheers,
> Stefan

> commit d24639c895ce3c0f9539570bab7b6510e98b1ffa
> Author: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
> Date:   Wed Mar 17 09:00:06 2021 +0100
> 
>     foo
> 
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index d209a52f823..633f41d9e6d 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2929,6 +2929,35 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>    gcc_unreachable ();
>  }
>  
> +/* Expand IFN_RAWMEMCHAR internal function.  */
> +
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> +  if (!lhs)
> +    return;
> +  tree lhs_type = TREE_TYPE (lhs);
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
> +
> +  for (unsigned int i = 0; i < 2; ++i)
> +    {
> +      tree rhs = gimple_call_arg (stmt, i);
> +      tree rhs_type = TREE_TYPE (rhs);
> +      rtx rhs_rtx = expand_normal (rhs);
> +      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
> +    }
> +
> +  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
> +
> +  expand_insn (icode, 3, ops);
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
>  /* Expand the IFN_UNIQUE function according to its first argument.  */
>  
>  static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index daeace7a34e..95c76795648 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -348,6 +348,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
>  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
>  
>  /* An unduplicable, uncombinable function.  Generally used to preserve
>     a CFG property in the face of jump threading, tail merging or
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b..f7c69f914ce 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
>  OPTAB_D (movmem_optab, "movmem$a")
>  OPTAB_D (setmem_optab, "setmem$a")
>  OPTAB_D (strlen_optab, "strlen$a")
> +OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
>  
>  OPTAB_DC(fma_optab, "fma$a4", FMA)
>  OPTAB_D (fms_optab, "fms$a4")
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
> new file mode 100644
> index 00000000000..6db62d7644d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
> @@ -0,0 +1,72 @@
> +/* { dg-do run { target s390x-*-* } } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
> +
> +/* Rawmemchr pattern: reduction stmt but no store */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +#define test(T, pattern)   \
> +__attribute__((noinline))  \
> +T *test_##T (T *p)         \
> +{                          \
> +  while (*p != (T)pattern) \
> +    ++p;                   \
> +  return p;                \
> +}
> +
> +test (uint8_t,  0xab)
> +test (uint16_t, 0xabcd)
> +test (uint32_t, 0xabcdef15)
> +
> +test (int8_t,  0xab)
> +test (int16_t, 0xabcd)
> +test (int32_t, 0xabcdef15)
> +
> +#define run(T, pattern, i)      \
> +{                               \
> +T *q = p;                       \
> +q[i] = (T)pattern;              \
> +assert (test_##T (p) == &q[i]); \
> +q[i] = 0;                       \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0, 1024);
> +
> +  run (uint8_t, 0xab, 0);
> +  run (uint8_t, 0xab, 1);
> +  run (uint8_t, 0xab, 13);
> +
> +  run (uint16_t, 0xabcd, 0);
> +  run (uint16_t, 0xabcd, 1);
> +  run (uint16_t, 0xabcd, 13);
> +
> +  run (uint32_t, 0xabcdef15, 0);
> +  run (uint32_t, 0xabcdef15, 1);
> +  run (uint32_t, 0xabcdef15, 13);
> +
> +  run (int8_t, 0xab, 0);
> +  run (int8_t, 0xab, 1);
> +  run (int8_t, 0xab, 13);
> +
> +  run (int16_t, 0xabcd, 0);
> +  run (int16_t, 0xabcd, 1);
> +  run (int16_t, 0xabcd, 13);
> +
> +  run (int32_t, 0xabcdef15, 0);
> +  run (int32_t, 0xabcdef15, 1);
> +  run (int32_t, 0xabcdef15, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
> new file mode 100644
> index 00000000000..00d6ea0f8e9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
> @@ -0,0 +1,83 @@
> +/* { dg-do run { target s390x-*-* } } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
> +
> +/* Rawmemchr pattern: reduction stmt and store */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +uint8_t *p_uint8_t;
> +uint16_t *p_uint16_t;
> +uint32_t *p_uint32_t;
> +
> +int8_t *p_int8_t;
> +int16_t *p_int16_t;
> +int32_t *p_int32_t;
> +
> +#define test(T, pattern)    \
> +__attribute__((noinline))   \
> +T *test_##T (void)          \
> +{                           \
> +  while (*p_##T != pattern) \
> +    ++p_##T;                \
> +  return p_##T;             \
> +}
> +
> +test (uint8_t,  0xab)
> +test (uint16_t, 0xabcd)
> +test (uint32_t, 0xabcdef15)
> +
> +test (int8_t,  (int8_t)0xab)
> +test (int16_t, (int16_t)0xabcd)
> +test (int32_t, (int32_t)0xabcdef15)
> +
> +#define run(T, pattern, i) \
> +{                          \
> +T *q = p;                  \
> +q[i] = pattern;            \
> +p_##T = p;                 \
> +T *r = test_##T ();        \
> +assert (r == p_##T);       \
> +assert (r == &q[i]);       \
> +q[i] = 0;                  \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, '\0', 1024);
> +
> +  run (uint8_t, 0xab, 0);
> +  run (uint8_t, 0xab, 1);
> +  run (uint8_t, 0xab, 13);
> +
> +  run (uint16_t, 0xabcd, 0);
> +  run (uint16_t, 0xabcd, 1);
> +  run (uint16_t, 0xabcd, 13);
> +
> +  run (uint32_t, 0xabcdef15, 0);
> +  run (uint32_t, 0xabcdef15, 1);
> +  run (uint32_t, 0xabcdef15, 13);
> +
> +  run (int8_t, (int8_t)0xab, 0);
> +  run (int8_t, (int8_t)0xab, 1);
> +  run (int8_t, (int8_t)0xab, 13);
> +
> +  run (int16_t, (int16_t)0xabcd, 0);
> +  run (int16_t, (int16_t)0xabcd, 1);
> +  run (int16_t, (int16_t)0xabcd, 13);
> +
> +  run (int32_t, (int32_t)0xabcdef15, 0);
> +  run (int32_t, (int32_t)0xabcdef15, 1);
> +  run (int32_t, (int32_t)0xabcdef15, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
> new file mode 100644
> index 00000000000..918b60099e4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
> @@ -0,0 +1,100 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */
> +/* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target s390x-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target s390x-*-* } } } */
> +
> +#include <stdint.h>
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +#define test(T, U)        \
> +__attribute__((noinline)) \
> +U test_##T##U (T *s)      \
> +{                         \
> +  U i;                    \
> +  for (i=0; s[i]; ++i);   \
> +  return i;               \
> +}
> +
> +test (uint8_t,  size_t)
> +test (uint16_t, size_t)
> +test (uint32_t, size_t)
> +test (uint8_t,  int)
> +test (uint16_t, int)
> +test (uint32_t, int)
> +
> +test (int8_t,  size_t)
> +test (int16_t, size_t)
> +test (int32_t, size_t)
> +test (int8_t,  int)
> +test (int16_t, int)
> +test (int32_t, int)
> +
> +#define run(T, U, i)             \
> +{                                \
> +T *q = p;                        \
> +q[i] = 0;                        \
> +assert (test_##T##U (p) == i);   \
> +memset (&q[i], 0xf, sizeof (T)); \
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0xf, 1024);
> +
> +  run (uint8_t, size_t, 0);
> +  run (uint8_t, size_t, 1);
> +  run (uint8_t, size_t, 13);
> +
> +  run (int8_t, size_t, 0);
> +  run (int8_t, size_t, 1);
> +  run (int8_t, size_t, 13);
> +
> +  run (uint8_t, int, 0);
> +  run (uint8_t, int, 1);
> +  run (uint8_t, int, 13);
> +
> +  run (int8_t, int, 0);
> +  run (int8_t, int, 1);
> +  run (int8_t, int, 13);
> +
> +  run (uint16_t, size_t, 0);
> +  run (uint16_t, size_t, 1);
> +  run (uint16_t, size_t, 13);
> +
> +  run (int16_t, size_t, 0);
> +  run (int16_t, size_t, 1);
> +  run (int16_t, size_t, 13);
> +
> +  run (uint16_t, int, 0);
> +  run (uint16_t, int, 1);
> +  run (uint16_t, int, 13);
> +
> +  run (int16_t, int, 0);
> +  run (int16_t, int, 1);
> +  run (int16_t, int, 13);
> +
> +  run (uint32_t, size_t, 0);
> +  run (uint32_t, size_t, 1);
> +  run (uint32_t, size_t, 13);
> +
> +  run (int32_t, size_t, 0);
> +  run (int32_t, size_t, 1);
> +  run (int32_t, size_t, 13);
> +
> +  run (uint32_t, int, 0);
> +  run (uint32_t, int, 1);
> +  run (uint32_t, int, 13);
> +
> +  run (int32_t, int, 0);
> +  run (int32_t, int, 1);
> +  run (int32_t, int, 13);
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
> new file mode 100644
> index 00000000000..e25d6ea5b56
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
> @@ -0,0 +1,58 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
> +/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 3 "ldist" } } */
> +
> +#include <assert.h>
> +
> +typedef __SIZE_TYPE__ size_t;
> +extern void* malloc (size_t);
> +extern void* memset (void*, int, size_t);
> +
> +__attribute__((noinline))
> +int test_pos (char *s)
> +{
> +  int i;
> +  for (i=42; s[i]; ++i);
> +  return i;
> +}
> +
> +__attribute__((noinline))
> +int test_neg (char *s)
> +{
> +  int i;
> +  for (i=-42; s[i]; ++i);
> +  return i;
> +}
> +
> +__attribute__((noinline))
> +int test_including_null_char (char *s)
> +{
> +  int i;
> +  for (i=1; s[i-1]; ++i);
> +  return i;
> +}
> +
> +int main(void)
> +{
> +  void *p = malloc (1024);
> +  assert (p);
> +  memset (p, 0xf, 1024);
> +  char *s = (char *)p + 100;
> +
> +  s[42+13] = 0;
> +  assert (test_pos (s) == 42+13);
> +  s[42+13] = 0xf;
> +
> +  s[13] = 0;
> +  assert (test_neg (s) == 13);
> +  s[13] = 0xf;
> +
> +  s[-13] = 0;
> +  assert (test_neg (s) == -13);
> +  s[-13] = 0xf;
> +
> +  s[13] = 0;
> +  assert (test_including_null_char (s) == 13+1);
> +
> +  return 0;
> +}
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 65aa1df4aba..6bd4bc8588a 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -116,6 +116,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
>  #include "tree-affine.h"
> +#include "intl.h"
> +#include "rtl.h"
> +#include "memmodel.h"
> +#include "optabs.h"
>  
>  
>  #define MAX_DATAREFS_NUM \
> @@ -650,6 +654,10 @@ class loop_distribution
>  		       control_dependences *cd, int *nb_calls, bool *destroy_p,
>  		       bool only_patterns_p);
>  
> +  /* Transform loops which mimic the effects of builtins rawmemchr or strlen and
> +     replace them accordingly.  */
> +  bool transform_reduction_loop (loop_p loop);
> +
>    /* Compute topological order for basic blocks.  Topological order is
>       needed because data dependence is computed for data references in
>       lexicographical order.  */
> @@ -1490,14 +1498,14 @@ loop_distribution::build_rdg_partition_for_vertex (struct graph *rdg, int v)
>     data references.  */
>  
>  static bool
> -find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
> +find_single_drs (class loop *loop, struct graph *rdg, const bitmap &partition_stmts,
>  		 data_reference_p *dst_dr, data_reference_p *src_dr)
>  {
>    unsigned i;
>    data_reference_p single_ld = NULL, single_st = NULL;
>    bitmap_iterator bi;
>  
> -  EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
> +  EXECUTE_IF_SET_IN_BITMAP (partition_stmts, 0, i, bi)
>      {
>        gimple *stmt = RDG_STMT (rdg, i);
>        data_reference_p dr;
> @@ -1538,44 +1546,47 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
>  	}
>      }
>  
> -  if (!single_st)
> -    return false;
> -
> -  /* Bail out if this is a bitfield memory reference.  */
> -  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> -      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
> +  if (!single_ld && !single_st)
>      return false;
>  
> -  /* Data reference must be executed exactly once per iteration of each
> -     loop in the loop nest.  We only need to check dominance information
> -     against the outermost one in a perfect loop nest because a bb can't
> -     dominate outermost loop's latch without dominating inner loop's.  */
> -  basic_block bb_st = gimple_bb (DR_STMT (single_st));
> -  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> -    return false;
> +  basic_block bb_ld = NULL;
> +  basic_block bb_st = NULL;
>  
>    if (single_ld)
>      {
> -      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> -      /* Direct aggregate copy or via an SSA name temporary.  */
> -      if (load != store
> -	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> -	return false;
> -
>        /* Bail out if this is a bitfield memory reference.  */
>        if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
>  	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
>  	return false;
>  
> -      /* Load and store must be in the same loop nest.  */
> -      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
> -      if (bb_st->loop_father != bb_ld->loop_father)
> +      /* Data reference must be executed exactly once per iteration of each
> +	 loop in the loop nest.  We only need to check dominance information
> +	 against the outermost one in a perfect loop nest because a bb can't
> +	 dominate outermost loop's latch without dominating inner loop's.  */
> +      bb_ld = gimple_bb (DR_STMT (single_ld));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +	return false;
> +    }
> +
> +  if (single_st)
> +    {
> +      /* Bail out if this is a bitfield memory reference.  */
> +      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
> +	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
>  	return false;
>  
>        /* Data reference must be executed exactly once per iteration.
> -	 Same as single_st, we only need to check against the outermost
> +	 Same as single_ld, we only need to check against the outermost
>  	 loop.  */
> -      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
> +      bb_st = gimple_bb (DR_STMT (single_st));
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
> +	return false;
> +    }
> +
> +  if (single_ld && single_st)
> +    {
> +      /* Load and store must be in the same loop nest.  */
> +      if (bb_st->loop_father != bb_ld->loop_father)
>  	return false;
>  
>        edge e = single_exit (bb_st->loop_father);
> @@ -1850,9 +1861,19 @@ loop_distribution::classify_partition (loop_p loop,
>      return has_reduction;
>  
>    /* Find single load/store data references for builtin partition.  */
> -  if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
> +  if (!find_single_drs (loop, rdg, partition->stmts, &single_st, &single_ld)
> +      || !single_st)
>      return has_reduction;
>  
> +  if (single_ld && single_st)
> +    {
> +      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
> +      /* Direct aggregate copy or via an SSA name temporary.  */
> +      if (load != store
> +	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
> +	return has_reduction;
> +    }
> +
>    partition->loc = gimple_location (DR_STMT (single_st));
>  
>    /* Classify the builtin kind.  */
> @@ -3257,6 +3278,379 @@ find_seed_stmts_for_distribution (class loop *loop, vec<gimple *> *work_list)
>    return work_list->length () > 0;
>  }
>  
> +/* A helper function for generate_{rawmemchr,strlen}_builtin functions in order
> +   to place new statements SEQ before LOOP and replace the old reduction
> +   variable with the new one.  */
> +
> +static void
> +generate_reduction_builtin_1 (loop_p loop, gimple_seq &seq,
> +			      tree reduction_var_old, tree reduction_var_new,
> +			      const char *info, machine_mode load_mode)
> +{
> +  /* Place new statements before LOOP.  */
> +  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
> +  gsi_insert_seq_after (&gsi, seq, GSI_CONTINUE_LINKING);
> +
> +  /* Replace old reduction variable with new one.  */
> +  imm_use_iterator iter;
> +  gimple *stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var_old)
> +    {
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +	SET_USE (use_p, reduction_var_new);
> +
> +      update_stmt (stmt);
> +    }
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, info, GET_MODE_NAME (load_mode));
> +}
> +
> +/* Generate a call to rawmemchr and place it before LOOP.  REDUCTION_VAR is
> +   replaced with a fresh SSA name representing the result of the call.  */
> +
> +static void
> +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> +			    data_reference_p store_dr, tree base, tree pattern,
> +			    location_t loc)
> +{
> +  gimple_seq seq = NULL;
> +
> +  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
> +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
> +  tree reduction_var_new = copy_ssa_name (reduction_var);
> +  gimple_call_set_lhs (fn_call, reduction_var_new);
> +  gimple_set_location (fn_call, loc);
> +  gimple_seq_add_stmt (&seq, fn_call);
> +
> +  if (store_dr)
> +    {
> +      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
> +      gimple_seq_add_stmt (&seq, g);
> +    }
> +
> +  generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new,
> +				"generated rawmemchr%s\n",
> +				TYPE_MODE (TREE_TYPE (TREE_TYPE (base))));
> +}
> +
> +/* Helper function for generate_strlen_builtin(,_using_rawmemchr)  */
> +
> +static void
> +generate_strlen_builtin_1 (loop_p loop, gimple_seq &seq,
> +			   tree reduction_var_old, tree reduction_var_new,
> +			   machine_mode mode, tree start_len)
> +{
> +  /* REDUCTION_VAR_NEW has either size type or ptrdiff type and must be
> +     converted if types of old and new reduction variable are not compatible. */
> +  reduction_var_new = gimple_convert (&seq, TREE_TYPE (reduction_var_old),
> +				      reduction_var_new);
> +
> +  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
> +     length.  */
> +  if (!integer_zerop (start_len))
> +    {
> +      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
> +      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
> +				       start_len);
> +      gimple_seq_add_stmt (&seq, g);
> +      reduction_var_new = lhs;
> +    }
> +
> +  generate_reduction_builtin_1 (loop, seq, reduction_var_old, reduction_var_new,
> +				"generated strlen%s\n", mode);
> +}
> +
> +/* Generate a call to strlen and place it before LOOP.  REDUCTION_VAR is
> +   replaced with a fresh SSA name representing the result of the call.  */
> +
> +static void
> +generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
> +			 tree start_len, location_t loc)
> +{
> +  gimple_seq seq = NULL;
> +
> +  tree reduction_var_new = make_ssa_name (size_type_node);
> +
> +  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
> +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> +  gimple_call_set_lhs (fn_call, reduction_var_new);
> +  gimple_set_location (fn_call, loc);
> +  gimple_seq_add_stmt (&seq, fn_call);
> +
> +  generate_strlen_builtin_1 (loop, seq, reduction_var, reduction_var_new,
> +			     QImode, start_len);
> +}
> +
> +/* Generate code in order to mimic the behaviour of strlen but this time over
> +   an array of elements with mode different than QI.  REDUCTION_VAR is replaced
> +   with a fresh SSA name representing the result, i.e., the length.  */
> +
> +static void
> +generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
> +					 tree base, tree start_len,
> +					 location_t loc)
> +{
> +  gimple_seq seq = NULL;
> +
> +  tree start = force_gimple_operand (base, &seq, true, NULL_TREE);
> +  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (start)));
> +  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, start, zero);
> +  tree end = make_ssa_name (TREE_TYPE (base));
> +  gimple_call_set_lhs (fn_call, end);
> +  gimple_set_location (fn_call, loc);
> +  gimple_seq_add_stmt (&seq, fn_call);
> +
> +  /* Determine the number of elements between START and END by
> +     evaluating (END - START) / sizeof (*START).  */
> +  tree diff = make_ssa_name (ptrdiff_type_node);
> +  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, start);
> +  gimple_seq_add_stmt (&seq, diff_stmt);
> +  /* Let SIZE be the size of the the pointed-to type of START.  */
> +  tree size = gimple_convert (&seq, ptrdiff_type_node,
> +			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (start))));
> +  tree count = make_ssa_name (ptrdiff_type_node);
> +  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
> +  gimple_seq_add_stmt (&seq, count_stmt);
> +
> +  generate_strlen_builtin_1 (loop, seq, reduction_var, count,
> +			     TYPE_MODE (TREE_TYPE (TREE_TYPE (base))),
> +			     start_len);
> +}
> +
> +static bool
> +reduction_var_overflows_first (tree reduction_var, tree load_type)
> +{
> +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> +  unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> +  unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> +  return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> +}
> +
> +/* Transform loops which mimic the effects of builtins rawmemchr or strlen and
> +   replace them accordingly.  For example, a loop of the form
> +
> +     for (; *p != 42; ++p);
> +
> +   is replaced by
> +
> +     p = rawmemchr<MODE> (p, 42);
> +
> +   under the assumption that rawmemchr is available for a particular MODE.
> +   Another example is
> +
> +     int i;
> +     for (i = 42; s[i]; ++i);
> +
> +   which is replaced by
> +
> +     i = (int)strlen (&s[42]) + 42;
> +
> +   for some character array S.  In case array S is not of type character array
> +   we end up with
> +
> +     i = (int)(rawmemchr<MODE> (&s[42], 0) - &s[42]) + 42;
> +
> +   assuming that rawmemchr is available for a particular MODE.  */
> +
> +bool
> +loop_distribution::transform_reduction_loop (loop_p loop)
> +{
> +  gimple *reduction_stmt = NULL;
> +  data_reference_p load_dr = NULL, store_dr = NULL;
> +
> +  edge e = single_exit (loop);
> +  gcond *cond = safe_dyn_cast <gcond *> (last_stmt (e->src));
> +  if (!cond)
> +    return false;
> +  /* Ensure loop condition is an (in)equality test and loop is exited either if
> +     the inequality test fails or the equality test succeeds.  */
> +  if (!(e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == NE_EXPR)
> +      && !(e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == EQ_EXPR))
> +    return false;
> +  /* A limitation of the current implementation is that we only support
> +     constant patterns in (in)equality tests.  */
> +  tree pattern = gimple_cond_rhs (cond);
> +  if (TREE_CODE (pattern) != INTEGER_CST)
> +    return false;
> +
> +  basic_block *bbs = get_loop_body (loop);
> +
> +  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
> +    {
> +      basic_block bb = bbs[i];
> +
> +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> +	   gsi_next_nondebug (&bsi))
> +	{
> +	  gphi *phi = bsi.phi ();
> +	  if (virtual_operand_p (gimple_phi_result (phi)))
> +	    continue;
> +	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> +	    {
> +	      if (reduction_stmt)
> +		return false;
> +	      reduction_stmt = phi;
> +	    }
> +	}
> +
> +      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
> +	   gsi_next_nondebug (&bsi), ++ninsns)
> +	{
> +	  /* Bail out early for loops which are unlikely to match.  */
> +	  if (ninsns > 16)
> +	    return false;
> +	  gimple *stmt = gsi_stmt (bsi);
> +	  if (gimple_clobber_p (stmt))
> +	    continue;
> +	  if (gimple_code (stmt) == GIMPLE_LABEL)
> +	    continue;
> +	  if (gimple_has_volatile_ops (stmt))
> +	    return false;
> +	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
> +	    {
> +	      if (reduction_stmt)
> +		return false;
> +	      reduction_stmt = stmt;
> +	    }
> +	}
> +    }
> +
> +  /* A limitation of the current implementation is that we require a reduction
> +     statement.  Therefore, loops without a reduction statement as in the
> +     following are not recognized:
> +     int *p;
> +     void foo (void) { for (; *p; ++p); } */
> +  if (reduction_stmt == NULL)
> +    return false;
> +
> +  /* Reduction variables are guaranteed to be SSA names.  */
> +  tree reduction_var;
> +  switch (gimple_code (reduction_stmt))
> +    {
> +    case GIMPLE_ASSIGN:
> +    case GIMPLE_PHI:
> +      reduction_var = gimple_get_lhs (reduction_stmt);
> +      break;
> +    default:
> +      /* Bail out e.g. for GIMPLE_CALL.  */
> +      return false;
> +    }
> +
> +  struct graph *rdg = build_rdg (loop, NULL);
> +  if (rdg == NULL)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file,
> +		 "Loop %d not transformed: failed to build the RDG.\n",
> +		 loop->num);
> +
> +      return false;
> +    }
> +  auto_bitmap partition_stmts;
> +  bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
> +  find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
> +  free_rdg (rdg);
> +
> +  /* Bail out if there is no single load.  */
> +  if (load_dr == NULL)
> +    return false;
> +
> +  /* Reaching this point we have a loop with a single reduction variable,
> +     a single load, and an optional single store.  */
> +
> +  tree load_ref = DR_REF (load_dr);
> +  tree load_type = TREE_TYPE (load_ref);
> +  tree load_access_base = build_fold_addr_expr (load_ref);
> +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> +  affine_iv load_iv, reduction_iv;
> +
> +  if (!INTEGRAL_TYPE_P (load_type)
> +      || !type_has_mode_precision_p (load_type))
> +    return false;
> +
> +  /* We already ensured that the loop condition tests for (in)equality where the
> +     rhs is a constant pattern. Now ensure that the lhs is the result of the
> +     load.  */
> +  if (gimple_cond_lhs (cond) != gimple_assign_lhs (DR_STMT (load_dr)))
> +    return false;
> +
> +  /* Bail out if no affine induction variable with constant step can be
> +     determined.  */
> +  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
> +    return false;
> +
> +  /* Bail out if memory accesses are not consecutive or not growing.  */
> +  if (!operand_equal_p (load_iv.step, load_access_size, 0))
> +    return false;
> +
> +  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
> +    return false;
> +
> +  /* Handle rawmemchr like loops.  */
> +  if (operand_equal_p (load_iv.base, reduction_iv.base)
> +      && operand_equal_p (load_iv.step, reduction_iv.step))
> +    {
> +      if (store_dr)
> +	{
> +	  /* Ensure that we store to X and load from X+I where I>0.  */
> +	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
> +	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
> +	    return false;
> +	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
> +	  if (TREE_CODE (ptr_base) != SSA_NAME)
> +	    return false;
> +	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
> +	  if (!gimple_assign_single_p (def)
> +	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
> +	    return false;
> +	  /* Ensure that the reduction value is stored.  */
> +	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
> +	    return false;
> +	}
> +      /* Bail out if target does not provide rawmemchr for a certain mode.  */
> +      machine_mode mode = TYPE_MODE (load_type);
> +      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
> +	return false;
> +      location_t loc = gimple_location (DR_STMT (load_dr));
> +      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
> +				  pattern, loc);
> +      return true;
> +    }
> +
> +  /* Handle strlen like loops.  */
> +  if (store_dr == NULL
> +      && integer_zerop (pattern)
> +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> +      && integer_onep (reduction_iv.step))
> +    {
> +      location_t loc = gimple_location (DR_STMT (load_dr));
> +      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
> +	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)
> +	  && (TYPE_PRECISION (size_type_node) == 64
> +	      || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> +		  && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))))
> +	generate_strlen_builtin (loop, reduction_var, load_iv.base,
> +				 reduction_iv.base, loc);
> +      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
> +	       != CODE_FOR_nothing
> +	       && (TYPE_PRECISION (ptrdiff_type_node) == 64
> +		   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> +		       && reduction_var_overflows_first (reduction_var, load_type))))
> +	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
> +						 load_iv.base,
> +						 reduction_iv.base, loc);
> +      else
> +	return false;
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
>  /* Given innermost LOOP, return the outermost enclosing loop that forms a
>     perfect loop nest.  */
>  
> @@ -3321,10 +3715,27 @@ loop_distribution::execute (function *fun)
>  	      && !optimize_loop_for_speed_p (loop)))
>  	continue;
>  
> -      /* Don't distribute loop if niters is unknown.  */
> +      /* If niters is unknown don't distribute loop but rather try to transform
> +	 it to a call to a builtin.  */
>        tree niters = number_of_latch_executions (loop);
>        if (niters == NULL_TREE || niters == chrec_dont_know)
> -	continue;
> +	{
> +	  datarefs_vec.create (20);
> +	  if (transform_reduction_loop (loop))
> +	    {
> +	      changed = true;
> +	      loops_to_be_destroyed.safe_push (loop);
> +	      if (dump_enabled_p ())
> +		{
> +		  dump_user_location_t loc = find_loop_location (loop);
> +		  dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> +				   loc, "Loop %d transformed into a builtin.\n",
> +				   loop->num);
> +		}
> +	    }
> +	  free_data_refs (datarefs_vec);
> +	  continue;
> +	}
>  
>        /* Get the perfect loop nest for distribution.  */
>        loop = prepare_perfect_loop_nest (loop);
Richard Biener Aug. 20, 2021, 10:35 a.m. UTC | #19
On Fri, Jun 25, 2021 at 12:23 PM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
> > <stefansf@linux.ibm.com> wrote:
> > >
> > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> > > [...]
> > > > > but we won't ever arrive here because of the niters condition.  But
> > > > > yes, doing the pattern matching in the innermost loop processing code
> > > > > looks good to me - for the specific case it would be
> > > > >
> > > > >       /* Don't distribute loop if niters is unknown.  */
> > > > >       tree niters = number_of_latch_executions (loop);
> > > > >       if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > > ---> here?
> > > > >         continue;
> > > >
> > > > Right, please find attached a new version of the patch where everything
> > > > is included in the loop distribution pass.  I will do a bootstrap and
> > > > regtest on IBM Z over night.  If you give me green light I will also do
> > > > the same on x86_64.
> > >
> > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > > would rebase and run the testsuites again and post a patch series
> > > including the rawmemchr implementation for IBM Z.
> >
> > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> > *loop, vec<gimple *> *work_list)
> >    return work_list->length () > 0;
> >  }
> >
> > +static void
> > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> > +                           data_reference_p store_dr, tree base, tree pattern,
> > +                           location_t loc)
> > +{
> >
> > this new function needs a comment.  Applies to all of the new ones, btw.
>
> Done.
>
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> > +                      && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));
> >
> > this looks fragile and is probably unnecessary as well.
> >
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> >
> > in general you want types_compatible_p () checks which for pointers means
> > all pointers are compatible ...
>
> True, I removed both asserts.
>
> > (skipping stuff)
> >
> > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
> >               && !optimize_loop_for_speed_p (loop)))
> >         continue;
> >
> > -      /* Don't distribute loop if niters is unknown.  */
> > +      /* If niters is unknown don't distribute loop but rather try to transform
> > +        it to a call to a builtin.  */
> >        tree niters = number_of_latch_executions (loop);
> >        if (niters == NULL_TREE || niters == chrec_dont_know)
> > -       continue;
> > +       {
> > +         if (transform_reduction_loop (loop))
> > +           {
> > +             changed = true;
> > +             loops_to_be_destroyed.safe_push (loop);
> > +             if (dump_file)
> > +               fprintf (dump_file, "Loop %d transformed into a
> > builtin.\n", loop->num);
> > +           }
> > +         continue;
> > +       }
> >
> > please look at
> >
> >           if (nb_generated_loops + nb_generated_calls > 0)
> >             {
> >               changed = true;
> >               if (dump_enabled_p ())
> >                 dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >                                  loc, "Loop%s %d distributed: split to
> > %d loops "
> >                                  "and %d library calls.\n", str, loop->num,
> >                                  nb_generated_loops, nb_generated_calls);
> >
> > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> > transforms are reported with -fopt-info-loop
>
> Done.
>
> > +
> > +  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
> > +}
> >
> > what's the point in tail-calling here and visually splitting the
> > function in half?
>
> In the first place I thought that this is more pleasant since in
> transform_reduction_loop_1 it is settled that we have a single load,
> store, and reduction variable.  After refactoring this isn't true
> anymore and I inlined the function and made this clear via a comment.
>
> >
> > (sorry for picking random pieces now ;))
> >
> > +      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> > +          gsi_next (&bsi), ++ninsns)
> > +       {
> >
> > this counts debug insns, I guess you want gsi_next_nondebug at least.
> > not sure why you are counting PHIs at all btw - for the loops you match
> > you are expecting at most two, one IV and eventually one for the virtual
> > operand of the store?
>
> Yes, I removed the counting for the phi loop and changed to
> gsi_next_nondebug for both loops.
>
> >
> > +         if (gimple_has_volatile_ops (phi))
> > +           return false;
> >
> > PHIs never have volatile ops.
> >
> > +         if (gimple_clobber_p (phi))
> > +           continue;
> >
> > or are clobbers.
>
> Removed both.
>
> > Btw, can you factor out a helper from find_single_drs working on a
> > stmt to reduce code duplication?
>
> Ahh sorry for that.  I've already done this in one of my first patches
> but didn't copy that over.  Although my changes do not require a RDG the
> whole pass is based upon this data structure.  Therefore, in order to
> share more code I decided to temporarily build the RDG so that I can
> call into find_single_drs.  Since the graph is rather small I guess the
> overhead is acceptable w.r.t. code sharing.
>
> struct graph *rdg = build_rdg (loop, NULL);
> if (rdg == NULL)
>   {
>     if (dump_file && (dump_flags & TDF_DETAILS))
>      fprintf (dump_file,
>          "Loop %d not transformed: failed to build the RDG.\n",
>          loop->num);
>
>     return false;
>   }
> auto_bitmap partition_stmts;
> bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
> find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
> free_rdg (rdg);
>
> As a side-effect of this, now, I also have to (de)allocate the class
> member datarefs_vec prior/after calling into transform_reduction_loop:
>
> /* If niters is unknown don't distribute loop but rather try to transform
>    it to a call to a builtin.  */
> tree niters = number_of_latch_executions (loop);
> if (niters == NULL_TREE || niters == chrec_dont_know)
>   {
>     datarefs_vec.create (20);
>     if (transform_reduction_loop (loop))
>       {
>         changed = true;
>         loops_to_be_destroyed.safe_push (loop);
>         if (dump_enabled_p ())
>           {
>             dump_user_location_t loc = find_loop_location (loop);
>             dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>                              loc, "Loop %d transformed into a builtin.\n",
>                              loop->num);
>           }
>       }
>     free_data_refs (datarefs_vec);
>     continue;
>   }
>
> >
> > +  tree reduction_var;
> > +  switch (gimple_code (reduction_stmt))
> > +    {
> > +    case GIMPLE_PHI:
> > +      reduction_var = gimple_phi_result (reduction_stmt);
> > +      break;
> > +    case GIMPLE_ASSIGN:
> > +      reduction_var = gimple_assign_lhs (reduction_stmt);
> > +      break;
> > +    default:
> > +      /* Bail out e.g. for GIMPLE_CALL.  */
> > +      return false;
> >
> > gimple_get_lhs (reduction_stmt); would work for both PHIs
> > and assigns.
>
> Done.
>
> >
> > +  if (reduction_var == NULL)
> > +    return false;
> >
> > it can never be NULL here.
>
> True, otherwise the reduction statement wouldn't have a dependence outside
> the loop. => Removed.
>
> >
> > +  /* Bail out if this is a bitfield memory reference.  */
> > +  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
> > +      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
> > +    return false;
> > ...
> >
> > I see this is again quite some code copied from find_single_drs, please
> > see how to avoid this much duplication by splitting out helpers.
>
> Sorry again.  Hope the solution above is more appropriate.
>
> >
> > +static bool
> > +transform_reduction_loop_1 (loop_p loop,
> > +                           data_reference_p load_dr,
> > +                           data_reference_p store_dr,
> > +                           tree reduction_var)
> > +{
> > +  tree load_ref = DR_REF (load_dr);
> > +  tree load_type = TREE_TYPE (load_ref);
> > +  tree load_access_base = build_fold_addr_expr (load_ref);
> > +  tree load_access_size = TYPE_SIZE_UNIT (load_type);
> > +  affine_iv load_iv, reduction_iv;
> > +  tree pattern;
> > +
> > +  /* A limitation of the current implementation is that we only support
> > +     constant patterns.  */
> > +  edge e = single_exit (loop);
> > +  gcond *cond_stmt = safe_dyn_cast <gcond *> (last_stmt (e->src));
> > +  if (!cond_stmt)
> > +    return false;
> >
> > that looks like checks to be done at the start of
> > transform_reduction_loop, not this late.
>
> Pulled this to the very beginning of transform_reduction_loop.
>
> >
> > +  if (gimple_cond_code (cond_stmt) != NE_EXPR
> > +      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (DR_STMT (load_dr))
> > +      || TREE_CODE (pattern) != INTEGER_CST)
> > +    return false;
> >
> > half of this as well.  Btw, there's no canonicalization for
> > the tests so you have to verify the false edge actually exits
> > the loop and allow for EQ_EXPR in case the false edge does.
>
> Uh good point.  I added checks for that and pulled most of it to the
> beginning of transform_reduction_loop.
>
> >
> > +  /* Handle strlen like loops.  */
> > +  if (store_dr == NULL
> > +      && integer_zerop (pattern)
> > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > +      && integer_onep (reduction_iv.step)
> > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > +    {
> >
> > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > The iteration
> > only stops when you load a NUL and the increments just wrap along (you're
> > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
>
> I think truncation is enough as long as no overflow occurs in strlen or
> strlen_using_rawmemchr.
>
> > For larger than size_type_node (actually larger than ptr_type_node would matter
> > I guess), the argument is that since pointer wrapping would be undefined anyway
> > the IV cannot wrap either.  Now, the correct check here would IMHO be
> >
> >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > (ptr_type_node)
> >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> >
> > ?
>
> Regarding the implementation which makes use of rawmemchr:
>
> We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> the maximal length we can determine of a string where each character has
> size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> ptrdiff type is undefined we have to make sure that if an overflow
> occurs, then an overflow occurs for reduction variable, too, and that
> this is undefined, too.  However, I'm not sure anymore whether we want
> to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> this would mean that a single string consumes more than half of the
> virtual addressable memory.  At least for architectures where
> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> to neglect the case where computing pointer difference may overflow.
> Otherwise we are talking about strings with lenghts of multiple
> pebibytes.  For other architectures we might have to be more precise
> and make sure that reduction variable overflows first and that this is
> undefined.
>
> Thus a conservative condition would be (I assumed that the size of any
> integral type is a power of two which I'm not sure if this really holds;
> IIRC the C standard requires only that the alignment is a power of two
> but not necessarily the size so I might need to change this):
>
> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
>    or in other words return true if reduction variable overflows first
>    and false otherwise.  */
>
> static bool
> reduction_var_overflows_first (tree reduction_var, tree load_type)
> {
>   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
>   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
>   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
>   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> }
>
> TYPE_PRECISION (ptrdiff_type_node) == 64
> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>     && reduction_var_overflows_first (reduction_var, load_type)
>
> Regarding the implementation which makes use of strlen:
>
> I'm not sure what it means if strlen is called for a string with a
> length greater than SIZE_MAX.  Therefore, similar to the implementation
> using rawmemchr where we neglect the case of an overflow for 64bit
> architectures, a conservative condition would be:
>
> TYPE_PRECISION (size_type_node) == 64
> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
>
> I still included the overflow undefined check for reduction variable in
> order to rule out situations where the reduction variable is unsigned
> and overflows as many times until strlen(,_using_rawmemchr) overflows,
> too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> architectures.  Anyhow, while writing this down it becomes clear that
> this deserves a comment which I will add once it becomes clear which way
> to go.

I think all the arguments about objects bigger than half of the address-space
also are valid for 32bit targets and thus 32bit size_type_node (or
32bit pointer size).
I'm not actually sure what's the canonical type to check against, whether
it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
middle-end "offset" type used for all address computations).  For weird reasons
I'd lean towards 'sizetype' (for example some embedded targets have 24bit
pointers but 16bit 'sizetype').

> >
> > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > +       {
> > +         const char *msg = G_("assuming signed overflow does not occur "
> > +                              "when optimizing strlen like loop");
> > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > +       }
> >
> > no, please don't add any new strict-overflow warnings ;)
>
> I just stumbled over code which produces such a warning and thought this
> is a hard requirement :D The new patch doesn't contain it anymore.
>
> >
> > The generate_*_builtin routines need some factoring - if you code-generate
> > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > (not sure why you do that - you should see to fold the call, not necessarily
> > the rest).  The replacement of reduction_var and the dumping could be shared.
> > There's also GET_MODE_NAME for the printing.
>
> I wasn't really sure which way to go.  Use a gsi, as it is done by
> existing generate_* functions, or make use of gimple_seq.  Since the
> latter uses internally also gsi I thought it is better to stick to gsi
> in the first place.  Now, after changing to gimple_seq I see the beauty
> of it :)
>
> I created two helper functions generate_strlen_builtin_1 and
> generate_reduction_builtin_1 in order to reduce code duplication.
>
> In function generate_strlen_builtin I changed from using
> builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> sure whether my intuition about the difference between implicit and
> explicit builtins is correct.  In builtins.def there is a small example
> given which I would paraphrase as "use builtin_decl_explicit if the
> semantics of the builtin is defined by the C standard; otherwise use
> builtin_decl_implicit" but probably my intuition is wrong?
>
> Beside that I'm not sure whether I really have to call
> build_fold_addr_expr which looks superfluous to me since
> gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
>
> tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> gimple *fn_call = gimple_build_call (fn, 1, mem);
>
> However, since it is also used that way in the context of
> generate_memset_builtin I didn't remove it so far.
>
> > I think overall the approach is sound now but the details still need work.
>
> Once again thank you very much for your review.  Really appreciated!

The patch lacks a changelog entry / description.  It's nice if patches sent
out for review are basically the rev as git format-patch produces.

The rawmemchr optab needs documenting in md.texi

+}
+
+static bool
+reduction_var_overflows_first (tree reduction_var, tree load_type)
+{
+  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);

this function needs a comment.

+         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
+           {
+             if (reduction_stmt)
+               return false;

you leak bbs here and elsewhere where you early exit the function.
In fact you fail to free it at all.

Otherwise the patch looks good - thanks for all the improvements.

What I do wonder is

+  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
+  gimple *fn_call = gimple_build_call (fn, 1, mem);

using builtin_decl_explicit means that in a TU where strlen is neither
declared nor used we can end up emitting calls to it.  For memcpy/memmove
that's usually OK since we require those to be present even in a
freestanding environment.  But I'm not sure about strlen here so I'd
lean towards using builtin_decl_implicit and checking that for NULL which
IIRC should prevent emitting strlen when it's not declared and maybe even
if it's declared but not used.  All other uses that generate STRLEN
use that at least.

Thanks,
Richard.

> Cheers,
> Stefan
Stefan Schulze Frielinghaus Sept. 3, 2021, 8 a.m. UTC | #20
On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
[...]
> > >
> > > +  /* Handle strlen like loops.  */
> > > +  if (store_dr == NULL
> > > +      && integer_zerop (pattern)
> > > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > +      && integer_onep (reduction_iv.step)
> > > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > > +    {
> > >
> > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > The iteration
> > > only stops when you load a NUL and the increments just wrap along (you're
> > > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> >
> > I think truncation is enough as long as no overflow occurs in strlen or
> > strlen_using_rawmemchr.
> >
> > > For larger than size_type_node (actually larger than ptr_type_node would matter
> > > I guess), the argument is that since pointer wrapping would be undefined anyway
> > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > >
> > >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > (ptr_type_node)
> > >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > >
> > > ?
> >
> > Regarding the implementation which makes use of rawmemchr:
> >
> > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > the maximal length we can determine of a string where each character has
> > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > ptrdiff type is undefined we have to make sure that if an overflow
> > occurs, then an overflow occurs for reduction variable, too, and that
> > this is undefined, too.  However, I'm not sure anymore whether we want
> > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > this would mean that a single string consumes more than half of the
> > virtual addressable memory.  At least for architectures where
> > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > to neglect the case where computing pointer difference may overflow.
> > Otherwise we are talking about strings with lenghts of multiple
> > pebibytes.  For other architectures we might have to be more precise
> > and make sure that reduction variable overflows first and that this is
> > undefined.
> >
> > Thus a conservative condition would be (I assumed that the size of any
> > integral type is a power of two which I'm not sure if this really holds;
> > IIRC the C standard requires only that the alignment is a power of two
> > but not necessarily the size so I might need to change this):
> >
> > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> >    or in other words return true if reduction variable overflows first
> >    and false otherwise.  */
> >
> > static bool
> > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > {
> >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> > }
> >
> > TYPE_PRECISION (ptrdiff_type_node) == 64
> > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >     && reduction_var_overflows_first (reduction_var, load_type)
> >
> > Regarding the implementation which makes use of strlen:
> >
> > I'm not sure what it means if strlen is called for a string with a
> > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > using rawmemchr where we neglect the case of an overflow for 64bit
> > architectures, a conservative condition would be:
> >
> > TYPE_PRECISION (size_type_node) == 64
> > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> >
> > I still included the overflow undefined check for reduction variable in
> > order to rule out situations where the reduction variable is unsigned
> > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > architectures.  Anyhow, while writing this down it becomes clear that
> > this deserves a comment which I will add once it becomes clear which way
> > to go.
> 
> I think all the arguments about objects bigger than half of the address-space
> also are valid for 32bit targets and thus 32bit size_type_node (or
> 32bit pointer size).
> I'm not actually sure what's the canonical type to check against, whether
> it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> middle-end "offset" type used for all address computations).  For weird reasons
> I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> pointers but 16bit 'sizetype').

Ok, for the strlen implementation I changed from size_type_node to
sizetype and assume that no overflow occurs for string objects bigger
than half of the address space for 32-bit targets and up:

  (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
   && TYPE_PRECISION (ptr_type_node) >= 32)
  || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
      && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))

and similarly for the rawmemchr implementation:

  (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
   && TYPE_PRECISION (ptrdiff_type_node) >= 32)
  || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
      && reduction_var_overflows_first (reduction_var, load_type))

> 
> > >
> > > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > > +       {
> > > +         const char *msg = G_("assuming signed overflow does not occur "
> > > +                              "when optimizing strlen like loop");
> > > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > > +       }
> > >
> > > no, please don't add any new strict-overflow warnings ;)
> >
> > I just stumbled over code which produces such a warning and thought this
> > is a hard requirement :D The new patch doesn't contain it anymore.
> >
> > >
> > > The generate_*_builtin routines need some factoring - if you code-generate
> > > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > > (not sure why you do that - you should see to fold the call, not necessarily
> > > the rest).  The replacement of reduction_var and the dumping could be shared.
> > > There's also GET_MODE_NAME for the printing.
> >
> > I wasn't really sure which way to go.  Use a gsi, as it is done by
> > existing generate_* functions, or make use of gimple_seq.  Since the
> > latter uses internally also gsi I thought it is better to stick to gsi
> > in the first place.  Now, after changing to gimple_seq I see the beauty
> > of it :)
> >
> > I created two helper functions generate_strlen_builtin_1 and
> > generate_reduction_builtin_1 in order to reduce code duplication.
> >
> > In function generate_strlen_builtin I changed from using
> > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> > sure whether my intuition about the difference between implicit and
> > explicit builtins is correct.  In builtins.def there is a small example
> > given which I would paraphrase as "use builtin_decl_explicit if the
> > semantics of the builtin is defined by the C standard; otherwise use
> > builtin_decl_implicit" but probably my intuition is wrong?
> >
> > Beside that I'm not sure whether I really have to call
> > build_fold_addr_expr which looks superfluous to me since
> > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> >
> > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > gimple *fn_call = gimple_build_call (fn, 1, mem);
> >
> > However, since it is also used that way in the context of
> > generate_memset_builtin I didn't remove it so far.
> >
> > > I think overall the approach is sound now but the details still need work.
> >
> > Once again thank you very much for your review.  Really appreciated!
> 
> The patch lacks a changelog entry / description.  It's nice if patches sent
> out for review are basically the rev as git format-patch produces.
> 
> The rawmemchr optab needs documenting in md.texi

While writing the documentation in md.texi I realised that other
instructions expect an address to be a memory operand which is not the
case for rawmemchr currently. At the moment the address is either an
SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
consequence in the backend define_expand rawmemchr<mode> expects a
register operand and not a memory operand. Would it make sense to build
a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
MEM_REF is supposed to be the canonical form here.

> 
> +}
> +
> +static bool
> +reduction_var_overflows_first (tree reduction_var, tree load_type)
> +{
> +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> 
> this function needs a comment.

Done.

> 
> +         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> +           {
> +             if (reduction_stmt)
> +               return false;
> 
> you leak bbs here and elsewhere where you early exit the function.
> In fact you fail to free it at all.

Whoopsy. I factored the whole loop out into static function
determine_reduction_stmt in order to deal with all early exits.

> 
> Otherwise the patch looks good - thanks for all the improvements.
> 
> What I do wonder is
> 
> +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> 
> using builtin_decl_explicit means that in a TU where strlen is neither
> declared nor used we can end up emitting calls to it.  For memcpy/memmove
> that's usually OK since we require those to be present even in a
> freestanding environment.  But I'm not sure about strlen here so I'd
> lean towards using builtin_decl_implicit and checking that for NULL which
> IIRC should prevent emitting strlen when it's not declared and maybe even
> if it's declared but not used.  All other uses that generate STRLEN
> use that at least.

Thanks for clarification.  I changed it back to builtin_decl_implicit
and check for null pointers.

Thanks,
Stefan
From d36f1bddea11b48f3c72ad5ab33c0a405c964f40 Mon Sep 17 00:00:00 2001
From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
Date: Wed, 17 Mar 2021 09:00:06 +0100
Subject: [PATCH] ldist: Recognize strlen and rawmemchr like loops

This patch adds support for recognizing loops which mimic the behaviour
of functions strlen and rawmemchr, and replaces those with internal
function calls in case a target provides them.  In contrast to the
standard strlen and rawmemchr functions, this patch also supports
different instances where the memory pointed to is interpreted as 8, 16,
and 32-bit sized, respectively.

gcc/ChangeLog:

	* doc/md.texi (rawmemchr<mode>): Document.
	* internal-fn.c (expand_RAWMEMCHR): Define.
	* internal-fn.def (RAWMEMCHR): Add.
	* optabs.def (rawmemchr_optab): Add.
	* tree-loop-distribution.c (find_single_drs): Change return code
	behaviour by also returning true if no single store was found
	but a single load.
	(loop_distribution::classify_partition): Respect the new return
	code behaviour of function find_single_drs.
	(loop_distribution::execute): Call new function
	transform_reduction_loop in order to replace rawmemchr or strlen
	like loops by calls into builtins.
	(generate_reduction_builtin_1): New function.
	(generate_rawmemchr_builtin): New function.
	(generate_strlen_builtin_1): New function.
	(generate_strlen_builtin): New function.
	(generate_strlen_builtin_using_rawmemchr): New function.
	(reduction_var_overflows_first): New function.
	(determine_reduction_stmt_1): New function.
	(determine_reduction_stmt): New function.
	(loop_distribution::transform_reduction_loop): New function.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ldist-rawmemchr-1.c: New test.
	* gcc.dg/tree-ssa/ldist-rawmemchr-2.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-1.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-2.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-3.c: New test.
---
 gcc/doc/md.texi                               |   7 +
 gcc/internal-fn.c                             |  29 +
 gcc/internal-fn.def                           |   1 +
 gcc/optabs.def                                |   1 +
 .../gcc.dg/tree-ssa/ldist-rawmemchr-1.c       |  72 +++
 .../gcc.dg/tree-ssa/ldist-rawmemchr-2.c       |  83 +++
 .../gcc.dg/tree-ssa/ldist-strlen-1.c          | 100 ++++
 .../gcc.dg/tree-ssa/ldist-strlen-2.c          |  58 ++
 .../gcc.dg/tree-ssa/ldist-strlen-3.c          |  12 +
 gcc/tree-loop-distribution.c                  | 517 +++++++++++++++++-
 10 files changed, 851 insertions(+), 29 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2b41cb7fb7b..eff7d14b9b8 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -6695,6 +6695,13 @@ operand 2 is the character to search for (normally zero),
 and operand 3 is a constant describing the known alignment
 of the beginning of the string.
 
+@cindex @code{rawmemchr@var{m}} instruction pattern
+@item @samp{rawmemchr@var{m}}
+Scan memory referred to by operand 1 for the first occurrence of the object
+given by operand 2 which is a @code{const_int} of mode @var{m}.  Operand 0 is
+the result, i.e., a pointer to the first occurrence of operand 2 in the memory
+block given by operand 1.
+
 @cindex @code{float@var{m}@var{n}2} instruction pattern
 @item @samp{float@var{m}@var{n}2}
 Convert signed integer operand 1 (valid for fixed point mode @var{m}) to
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 1360a00f0b9..6797b7f1f42 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2931,6 +2931,35 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+/* Expand IFN_RAWMEMCHAR internal function.  */
+
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+    {
+      tree rhs = gimple_call_arg (stmt, i);
+      tree rhs_type = TREE_TYPE (rhs);
+      rtx rhs_rtx = expand_normal (rhs);
+      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
+    }
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 3ac9ae68b2a..96f51455677 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -352,6 +352,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 201b8aae1c0..9411097c9b5 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
 OPTAB_D (movmem_optab, "movmem$a")
 OPTAB_D (setmem_optab, "setmem$a")
 OPTAB_D (strlen_optab, "strlen$a")
+OPTAB_D (rawmemchr_optab, "rawmemchr$I$a")
 
 OPTAB_DC(fma_optab, "fma$a4", FMA)
 OPTAB_D (fms_optab, "fms$a4")
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
new file mode 100644
index 00000000000..6abfd278351
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
@@ -0,0 +1,72 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and no store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, pattern)   \
+__attribute__((noinline))  \
+T *test_##T (T *p)         \
+{                          \
+  while (*p != (T)pattern) \
+    ++p;                   \
+  return p;                \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  0xab)
+test (int16_t, 0xabcd)
+test (int32_t, 0xabcdef15)
+
+#define run(T, pattern, i)      \
+{                               \
+T *q = p;                       \
+q[i] = (T)pattern;              \
+assert (test_##T (p) == &q[i]); \
+q[i] = 0;                       \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0, 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, 0xab, 0);
+  run (int8_t, 0xab, 1);
+  run (int8_t, 0xab, 13);
+
+  run (int16_t, 0xabcd, 0);
+  run (int16_t, 0xabcd, 1);
+  run (int16_t, 0xabcd, 13);
+
+  run (int32_t, 0xabcdef15, 0);
+  run (int32_t, 0xabcdef15, 1);
+  run (int32_t, 0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
new file mode 100644
index 00000000000..00d6ea0f8e9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
@@ -0,0 +1,83 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+uint8_t *p_uint8_t;
+uint16_t *p_uint16_t;
+uint32_t *p_uint32_t;
+
+int8_t *p_int8_t;
+int16_t *p_int16_t;
+int32_t *p_int32_t;
+
+#define test(T, pattern)    \
+__attribute__((noinline))   \
+T *test_##T (void)          \
+{                           \
+  while (*p_##T != pattern) \
+    ++p_##T;                \
+  return p_##T;             \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  (int8_t)0xab)
+test (int16_t, (int16_t)0xabcd)
+test (int32_t, (int32_t)0xabcdef15)
+
+#define run(T, pattern, i) \
+{                          \
+T *q = p;                  \
+q[i] = pattern;            \
+p_##T = p;                 \
+T *r = test_##T ();        \
+assert (r == p_##T);       \
+assert (r == &q[i]);       \
+q[i] = 0;                  \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, '\0', 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, (int8_t)0xab, 0);
+  run (int8_t, (int8_t)0xab, 1);
+  run (int8_t, (int8_t)0xab, 13);
+
+  run (int16_t, (int16_t)0xabcd, 0);
+  run (int16_t, (int16_t)0xabcd, 1);
+  run (int16_t, (int16_t)0xabcd, 13);
+
+  run (int32_t, (int32_t)0xabcdef15, 0);
+  run (int32_t, (int32_t)0xabcdef15, 1);
+  run (int32_t, (int32_t)0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
new file mode 100644
index 00000000000..918b60099e4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
@@ -0,0 +1,100 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target s390x-*-* } } } */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, U)        \
+__attribute__((noinline)) \
+U test_##T##U (T *s)      \
+{                         \
+  U i;                    \
+  for (i=0; s[i]; ++i);   \
+  return i;               \
+}
+
+test (uint8_t,  size_t)
+test (uint16_t, size_t)
+test (uint32_t, size_t)
+test (uint8_t,  int)
+test (uint16_t, int)
+test (uint32_t, int)
+
+test (int8_t,  size_t)
+test (int16_t, size_t)
+test (int32_t, size_t)
+test (int8_t,  int)
+test (int16_t, int)
+test (int32_t, int)
+
+#define run(T, U, i)             \
+{                                \
+T *q = p;                        \
+q[i] = 0;                        \
+assert (test_##T##U (p) == i);   \
+memset (&q[i], 0xf, sizeof (T)); \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+
+  run (uint8_t, size_t, 0);
+  run (uint8_t, size_t, 1);
+  run (uint8_t, size_t, 13);
+
+  run (int8_t, size_t, 0);
+  run (int8_t, size_t, 1);
+  run (int8_t, size_t, 13);
+
+  run (uint8_t, int, 0);
+  run (uint8_t, int, 1);
+  run (uint8_t, int, 13);
+
+  run (int8_t, int, 0);
+  run (int8_t, int, 1);
+  run (int8_t, int, 13);
+
+  run (uint16_t, size_t, 0);
+  run (uint16_t, size_t, 1);
+  run (uint16_t, size_t, 13);
+
+  run (int16_t, size_t, 0);
+  run (int16_t, size_t, 1);
+  run (int16_t, size_t, 13);
+
+  run (uint16_t, int, 0);
+  run (uint16_t, int, 1);
+  run (uint16_t, int, 13);
+
+  run (int16_t, int, 0);
+  run (int16_t, int, 1);
+  run (int16_t, int, 13);
+
+  run (uint32_t, size_t, 0);
+  run (uint32_t, size_t, 1);
+  run (uint32_t, size_t, 13);
+
+  run (int32_t, size_t, 0);
+  run (int32_t, size_t, 1);
+  run (int32_t, size_t, 13);
+
+  run (uint32_t, int, 0);
+  run (uint32_t, int, 1);
+  run (uint32_t, int, 13);
+
+  run (int32_t, int, 0);
+  run (int32_t, int, 1);
+  run (int32_t, int, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
new file mode 100644
index 00000000000..e25d6ea5b56
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 3 "ldist" } } */
+
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+__attribute__((noinline))
+int test_pos (char *s)
+{
+  int i;
+  for (i=42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_neg (char *s)
+{
+  int i;
+  for (i=-42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_including_null_char (char *s)
+{
+  int i;
+  for (i=1; s[i-1]; ++i);
+  return i;
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+  char *s = (char *)p + 100;
+
+  s[42+13] = 0;
+  assert (test_pos (s) == 42+13);
+  s[42+13] = 0xf;
+
+  s[13] = 0;
+  assert (test_neg (s) == 13);
+  s[13] = 0xf;
+
+  s[-13] = 0;
+  assert (test_neg (s) == -13);
+  s[-13] = 0xf;
+
+  s[13] = 0;
+  assert (test_including_null_char (s) == 13+1);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
new file mode 100644
index 00000000000..370fd5eb088
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 1 "ldist" { target s390x-*-* } } } */
+
+extern int s[];
+
+int test ()
+{
+  int i = 0;
+  for (; s[i]; ++i);
+  return i;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 2df762c8aa8..9abf5a6352f 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -116,6 +116,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "gimple-fold.h"
 #include "tree-affine.h"
+#include "intl.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "optabs.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -651,6 +655,10 @@ class loop_distribution
 		       control_dependences *cd, int *nb_calls, bool *destroy_p,
 		       bool only_patterns_p);
 
+  /* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+     replace them accordingly.  */
+  bool transform_reduction_loop (loop_p loop);
+
   /* Compute topological order for basic blocks.  Topological order is
      needed because data dependence is computed for data references in
      lexicographical order.  */
@@ -1492,14 +1500,14 @@ loop_distribution::build_rdg_partition_for_vertex (struct graph *rdg, int v)
    data references.  */
 
 static bool
-find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
+find_single_drs (class loop *loop, struct graph *rdg, const bitmap &partition_stmts,
 		 data_reference_p *dst_dr, data_reference_p *src_dr)
 {
   unsigned i;
   data_reference_p single_ld = NULL, single_st = NULL;
   bitmap_iterator bi;
 
-  EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
+  EXECUTE_IF_SET_IN_BITMAP (partition_stmts, 0, i, bi)
     {
       gimple *stmt = RDG_STMT (rdg, i);
       data_reference_p dr;
@@ -1540,44 +1548,47 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
 	}
     }
 
-  if (!single_st)
-    return false;
-
-  /* Bail out if this is a bitfield memory reference.  */
-  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
-      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
+  if (!single_ld && !single_st)
     return false;
 
-  /* Data reference must be executed exactly once per iteration of each
-     loop in the loop nest.  We only need to check dominance information
-     against the outermost one in a perfect loop nest because a bb can't
-     dominate outermost loop's latch without dominating inner loop's.  */
-  basic_block bb_st = gimple_bb (DR_STMT (single_st));
-  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
-    return false;
+  basic_block bb_ld = NULL;
+  basic_block bb_st = NULL;
 
   if (single_ld)
     {
-      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
-      /* Direct aggregate copy or via an SSA name temporary.  */
-      if (load != store
-	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
-	return false;
-
       /* Bail out if this is a bitfield memory reference.  */
       if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
 	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
 	return false;
 
-      /* Load and store must be in the same loop nest.  */
-      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
-      if (bb_st->loop_father != bb_ld->loop_father)
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      bb_ld = gimple_bb (DR_STMT (single_ld));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+	return false;
+    }
+
+  if (single_st)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
 	return false;
 
       /* Data reference must be executed exactly once per iteration.
-	 Same as single_st, we only need to check against the outermost
+	 Same as single_ld, we only need to check against the outermost
 	 loop.  */
-      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+      bb_st = gimple_bb (DR_STMT (single_st));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
+	return false;
+    }
+
+  if (single_ld && single_st)
+    {
+      /* Load and store must be in the same loop nest.  */
+      if (bb_st->loop_father != bb_ld->loop_father)
 	return false;
 
       edge e = single_exit (bb_st->loop_father);
@@ -1852,9 +1863,19 @@ loop_distribution::classify_partition (loop_p loop,
     return has_reduction;
 
   /* Find single load/store data references for builtin partition.  */
-  if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
+  if (!find_single_drs (loop, rdg, partition->stmts, &single_st, &single_ld)
+      || !single_st)
     return has_reduction;
 
+  if (single_ld && single_st)
+    {
+      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
+      /* Direct aggregate copy or via an SSA name temporary.  */
+      if (load != store
+	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
+	return has_reduction;
+    }
+
   partition->loc = gimple_location (DR_STMT (single_st));
 
   /* Classify the builtin kind.  */
@@ -3260,6 +3281,427 @@ find_seed_stmts_for_distribution (class loop *loop, vec<gimple *> *work_list)
   return work_list->length () > 0;
 }
 
+/* A helper function for generate_{rawmemchr,strlen}_builtin functions in order
+   to place new statements SEQ before LOOP and replace the old reduction
+   variable with the new one.  */
+
+static void
+generate_reduction_builtin_1 (loop_p loop, gimple_seq &seq,
+			      tree reduction_var_old, tree reduction_var_new,
+			      const char *info, machine_mode load_mode)
+{
+  /* Place new statements before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+  gsi_insert_seq_after (&gsi, seq, GSI_CONTINUE_LINKING);
+
+  /* Replace old reduction variable with new one.  */
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var_old)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, info, GET_MODE_NAME (load_mode));
+}
+
+/* Generate a call to rawmemchr and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+			    data_reference_p store_dr, tree base, tree pattern,
+			    location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
+  tree reduction_var_new = copy_ssa_name (reduction_var);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  if (store_dr)
+    {
+      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
+      gimple_seq_add_stmt (&seq, g);
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+				"generated rawmemchr%s\n",
+				TYPE_MODE (TREE_TYPE (TREE_TYPE (base))));
+}
+
+/* Helper function for generate_strlen_builtin(,_using_rawmemchr)  */
+
+static void
+generate_strlen_builtin_1 (loop_p loop, gimple_seq &seq,
+			   tree reduction_var_old, tree reduction_var_new,
+			   machine_mode mode, tree start_len)
+{
+  /* REDUCTION_VAR_NEW has either size type or ptrdiff type and must be
+     converted if types of old and new reduction variable are not compatible. */
+  reduction_var_new = gimple_convert (&seq, TREE_TYPE (reduction_var_old),
+				      reduction_var_new);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
+      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
+				       start_len);
+      gimple_seq_add_stmt (&seq, g);
+      reduction_var_new = lhs;
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var_old, reduction_var_new,
+				"generated strlen%s\n", mode);
+}
+
+/* Generate a call to strlen and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
+			 tree start_len, location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree reduction_var_new = make_ssa_name (size_type_node);
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN));
+  gimple *fn_call = gimple_build_call (fn, 1, mem);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+			     QImode, start_len);
+}
+
+/* Generate code in order to mimic the behaviour of strlen but this time over
+   an array of elements with mode different than QI.  REDUCTION_VAR is replaced
+   with a fresh SSA name representing the result, i.e., the length.  */
+
+static void
+generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
+					 tree base, tree start_len,
+					 location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree start = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree zero = build_zero_cst (TREE_TYPE (TREE_TYPE (start)));
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, start, zero);
+  tree end = make_ssa_name (TREE_TYPE (base));
+  gimple_call_set_lhs (fn_call, end);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  /* Determine the number of elements between START and END by
+     evaluating (END - START) / sizeof (*START).  */
+  tree diff = make_ssa_name (ptrdiff_type_node);
+  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, start);
+  gimple_seq_add_stmt (&seq, diff_stmt);
+  /* Let SIZE be the size of the the pointed-to type of START.  */
+  tree size = gimple_convert (&seq, ptrdiff_type_node,
+			      TYPE_SIZE_UNIT (TREE_TYPE (TREE_TYPE (start))));
+  tree count = make_ssa_name (ptrdiff_type_node);
+  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
+  gimple_seq_add_stmt (&seq, count_stmt);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, count,
+			     TYPE_MODE (TREE_TYPE (TREE_TYPE (base))),
+			     start_len);
+}
+
+/* Return true if we can count at least as many characters by taking pointer
+   difference as we can count via reduction_var without an overflow.  Thus
+   compute 2^n < (2^(m-1) / s) where n = TYPE_PRECISION (reduction_var),
+   m = TYPE_PRECISION (ptrdiff_type_node), and s = size of each character.  */
+static bool
+reduction_var_overflows_first (tree reduction_var, tree load_type)
+{
+  widest_int n2 = wi::lshift (1, TYPE_PRECISION (reduction_var));;
+  widest_int m2 = wi::lshift (1, TYPE_PRECISION (ptrdiff_type_node) - 1);
+  widest_int s = wi::to_widest (TYPE_SIZE_UNIT (load_type));
+  return wi::ltu_p (n2, wi::udiv_trunc (m2, s));
+}
+
+static gimple *
+determine_reduction_stmt_1 (const loop_p loop, const basic_block *bbs)
+{
+  gimple *reduction_stmt = NULL;
+
+  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
+    {
+      basic_block bb = bbs[i];
+
+      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi))
+	{
+	  gphi *phi = bsi.phi ();
+	  if (virtual_operand_p (gimple_phi_result (phi)))
+	    continue;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
+	    {
+	      if (reduction_stmt)
+		return NULL;
+	      reduction_stmt = phi;
+	    }
+	}
+
+      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return NULL;
+	  gimple *stmt = gsi_stmt (bsi);
+	  if (gimple_clobber_p (stmt))
+	    continue;
+	  if (gimple_code (stmt) == GIMPLE_LABEL)
+	    continue;
+	  if (gimple_has_volatile_ops (stmt))
+	    return NULL;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
+	    {
+	      if (reduction_stmt)
+		return NULL;
+	      reduction_stmt = stmt;
+	    }
+	}
+    }
+
+  return reduction_stmt;
+}
+
+/* If LOOP has a single non-volatile reduction statement, then return a pointer
+   to this.  Otherwise return NULL.  */
+static gimple *
+determine_reduction_stmt (const loop_p loop)
+{
+  basic_block *bbs = get_loop_body (loop);
+  gimple *reduction_stmt = determine_reduction_stmt_1 (loop, bbs);
+  XDELETEVEC (bbs);
+  return reduction_stmt;
+}
+
+/* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+   replace them accordingly.  For example, a loop of the form
+
+     for (; *p != 42; ++p);
+
+   is replaced by
+
+     p = rawmemchr<MODE> (p, 42);
+
+   under the assumption that rawmemchr is available for a particular MODE.
+   Another example is
+
+     int i;
+     for (i = 42; s[i]; ++i);
+
+   which is replaced by
+
+     i = (int)strlen (&s[42]) + 42;
+
+   for some character array S.  In case array S is not of type character array
+   we end up with
+
+     i = (int)(rawmemchr<MODE> (&s[42], 0) - &s[42]) + 42;
+
+   assuming that rawmemchr is available for a particular MODE.  */
+
+bool
+loop_distribution::transform_reduction_loop (loop_p loop)
+{
+  gimple *reduction_stmt;
+  data_reference_p load_dr = NULL, store_dr = NULL;
+
+  edge e = single_exit (loop);
+  gcond *cond = safe_dyn_cast <gcond *> (last_stmt (e->src));
+  if (!cond)
+    return false;
+  /* Ensure loop condition is an (in)equality test and loop is exited either if
+     the inequality test fails or the equality test succeeds.  */
+  if (!(e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == NE_EXPR)
+      && !(e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == EQ_EXPR))
+    return false;
+  /* A limitation of the current implementation is that we only support
+     constant patterns in (in)equality tests.  */
+  tree pattern = gimple_cond_rhs (cond);
+  if (TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  reduction_stmt = determine_reduction_stmt (loop);
+
+  /* A limitation of the current implementation is that we require a reduction
+     statement.  Therefore, loops without a reduction statement as in the
+     following are not recognized:
+     int *p;
+     void foo (void) { for (; *p; ++p); } */
+  if (reduction_stmt == NULL)
+    return false;
+
+  /* Reduction variables are guaranteed to be SSA names.  */
+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_ASSIGN:
+    case GIMPLE_PHI:
+      reduction_var = gimple_get_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;
+    }
+
+  struct graph *rdg = build_rdg (loop, NULL);
+  if (rdg == NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Loop %d not transformed: failed to build the RDG.\n",
+		 loop->num);
+
+      return false;
+    }
+  auto_bitmap partition_stmts;
+  bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
+  find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
+  free_rdg (rdg);
+
+  /* Bail out if there is no single load.  */
+  if (load_dr == NULL)
+    return false;
+
+  /* Reaching this point we have a loop with a single reduction variable,
+     a single load, and an optional single store.  */
+
+  tree load_ref = DR_REF (load_dr);
+  tree load_type = TREE_TYPE (load_ref);
+  tree load_access_base = build_fold_addr_expr (load_ref);
+  tree load_access_size = TYPE_SIZE_UNIT (load_type);
+  affine_iv load_iv, reduction_iv;
+
+  if (!INTEGRAL_TYPE_P (load_type)
+      || !type_has_mode_precision_p (load_type))
+    return false;
+
+  /* We already ensured that the loop condition tests for (in)equality where the
+     rhs is a constant pattern. Now ensure that the lhs is the result of the
+     load.  */
+  if (gimple_cond_lhs (cond) != gimple_assign_lhs (DR_STMT (load_dr)))
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive or not growing.  */
+  if (!operand_equal_p (load_iv.step, load_access_size, 0))
+    return false;
+
+  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
+    return false;
+
+  /* Handle rawmemchr like loops.  */
+  if (operand_equal_p (load_iv.base, reduction_iv.base)
+      && operand_equal_p (load_iv.step, reduction_iv.step))
+    {
+      if (store_dr)
+	{
+	  /* Ensure that we store to X and load from X+I where I>0.  */
+	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
+	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
+	    return false;
+	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
+	  if (TREE_CODE (ptr_base) != SSA_NAME)
+	    return false;
+	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
+	  if (!gimple_assign_single_p (def)
+	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
+	    return false;
+	  /* Ensure that the reduction value is stored.  */
+	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
+	    return false;
+	}
+      /* Bail out if target does not provide rawmemchr for a certain mode.  */
+      machine_mode mode = TYPE_MODE (load_type);
+      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
+	return false;
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
+				  pattern, loc);
+      return true;
+    }
+
+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step))
+    {
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      /* While determining the length of a string an overflow might occur.
+	 If an overflow only occurs in the loop implementation and not in the
+	 strlen implementation, then either the overflow is undefined or the
+	 truncated result of strlen equals the one of the loop.  Otherwise if
+	 an overflow may also occur in the strlen implementation, then
+	 replacing a loop by a call to strlen is sound whenever we ensure that
+	 if an overflow occurs in the strlen implementation, then also an
+	 overflow occurs in the loop implementation which is undefined.  It
+	 seems reasonable to relax this and assume that the strlen
+	 implementation cannot overflow in case sizetype is big enough in the
+	 sense that an overflow can only happen for string objects which are
+	 bigger than half of the address space; at least for 32-bit targets and
+	 up.
+
+	 For strlen which makes use of rawmemchr the maximal length of a string
+	 which can be determined without an overflow is PTRDIFF_MAX / S where
+	 each character has size S.  Since an overflow for ptrdiff type is
+	 undefined we have to make sure that if an overflow occurs, then an
+	 overflow occurs in the loop implementation, too, and this is
+	 undefined, too.  Similar as before we relax this and assume that no
+	 string object is larger than half of the address space; at least for
+	 32-bit targets and up.  */
+      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
+	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)
+	  && ((TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
+	       && TYPE_PRECISION (ptr_type_node) >= 32)
+	      || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		  && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype)))
+	  && builtin_decl_implicit (BUILT_IN_STRLEN))
+	generate_strlen_builtin (loop, reduction_var, load_iv.base,
+				 reduction_iv.base, loc);
+      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
+	       != CODE_FOR_nothing
+	       && ((TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
+		    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
+		   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		       && reduction_var_overflows_first (reduction_var, load_type))))
+	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
+						 load_iv.base,
+						 reduction_iv.base, loc);
+      else
+	return false;
+      return true;
+    }
+
+  return false;
+}
+
 /* Given innermost LOOP, return the outermost enclosing loop that forms a
    perfect loop nest.  */
 
@@ -3324,10 +3766,27 @@ loop_distribution::execute (function *fun)
 	      && !optimize_loop_for_speed_p (loop)))
 	continue;
 
-      /* Don't distribute loop if niters is unknown.  */
+      /* If niters is unknown don't distribute loop but rather try to transform
+	 it to a call to a builtin.  */
       tree niters = number_of_latch_executions (loop);
       if (niters == NULL_TREE || niters == chrec_dont_know)
-	continue;
+	{
+	  datarefs_vec.create (20);
+	  if (transform_reduction_loop (loop))
+	    {
+	      changed = true;
+	      loops_to_be_destroyed.safe_push (loop);
+	      if (dump_enabled_p ())
+		{
+		  dump_user_location_t loc = find_loop_location (loop);
+		  dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
+				   loc, "Loop %d transformed into a builtin.\n",
+				   loop->num);
+		}
+	    }
+	  free_data_refs (datarefs_vec);
+	  continue;
+	}
 
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);
Richard Biener Sept. 6, 2021, 9:56 a.m. UTC | #21
On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> [...]
> > > >
> > > > +  /* Handle strlen like loops.  */
> > > > +  if (store_dr == NULL
> > > > +      && integer_zerop (pattern)
> > > > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > +      && integer_onep (reduction_iv.step)
> > > > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > > > +    {
> > > >
> > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > The iteration
> > > > only stops when you load a NUL and the increments just wrap along (you're
> > > > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> > >
> > > I think truncation is enough as long as no overflow occurs in strlen or
> > > strlen_using_rawmemchr.
> > >
> > > > For larger than size_type_node (actually larger than ptr_type_node would matter
> > > > I guess), the argument is that since pointer wrapping would be undefined anyway
> > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > >
> > > >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > (ptr_type_node)
> > > >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > >
> > > > ?
> > >
> > > Regarding the implementation which makes use of rawmemchr:
> > >
> > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > the maximal length we can determine of a string where each character has
> > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > ptrdiff type is undefined we have to make sure that if an overflow
> > > occurs, then an overflow occurs for reduction variable, too, and that
> > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > this would mean that a single string consumes more than half of the
> > > virtual addressable memory.  At least for architectures where
> > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > to neglect the case where computing pointer difference may overflow.
> > > Otherwise we are talking about strings with lenghts of multiple
> > > pebibytes.  For other architectures we might have to be more precise
> > > and make sure that reduction variable overflows first and that this is
> > > undefined.
> > >
> > > Thus a conservative condition would be (I assumed that the size of any
> > > integral type is a power of two which I'm not sure if this really holds;
> > > IIRC the C standard requires only that the alignment is a power of two
> > > but not necessarily the size so I might need to change this):
> > >
> > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> > >    or in other words return true if reduction variable overflows first
> > >    and false otherwise.  */
> > >
> > > static bool
> > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > {
> > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> > > }
> > >
> > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > >     && reduction_var_overflows_first (reduction_var, load_type)
> > >
> > > Regarding the implementation which makes use of strlen:
> > >
> > > I'm not sure what it means if strlen is called for a string with a
> > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > architectures, a conservative condition would be:
> > >
> > > TYPE_PRECISION (size_type_node) == 64
> > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > >     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> > >
> > > I still included the overflow undefined check for reduction variable in
> > > order to rule out situations where the reduction variable is unsigned
> > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > > architectures.  Anyhow, while writing this down it becomes clear that
> > > this deserves a comment which I will add once it becomes clear which way
> > > to go.
> >
> > I think all the arguments about objects bigger than half of the address-space
> > also are valid for 32bit targets and thus 32bit size_type_node (or
> > 32bit pointer size).
> > I'm not actually sure what's the canonical type to check against, whether
> > it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> > middle-end "offset" type used for all address computations).  For weird reasons
> > I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> > pointers but 16bit 'sizetype').
>
> Ok, for the strlen implementation I changed from size_type_node to
> sizetype and assume that no overflow occurs for string objects bigger
> than half of the address space for 32-bit targets and up:
>
>   (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
>    && TYPE_PRECISION (ptr_type_node) >= 32)
>   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>       && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
>
> and similarly for the rawmemchr implementation:
>
>   (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
>    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
>   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>       && reduction_var_overflows_first (reduction_var, load_type))
>
> >
> > > >
> > > > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > > > +       {
> > > > +         const char *msg = G_("assuming signed overflow does not occur "
> > > > +                              "when optimizing strlen like loop");
> > > > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > > > +       }
> > > >
> > > > no, please don't add any new strict-overflow warnings ;)
> > >
> > > I just stumbled over code which produces such a warning and thought this
> > > is a hard requirement :D The new patch doesn't contain it anymore.
> > >
> > > >
> > > > The generate_*_builtin routines need some factoring - if you code-generate
> > > > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > > > (not sure why you do that - you should see to fold the call, not necessarily
> > > > the rest).  The replacement of reduction_var and the dumping could be shared.
> > > > There's also GET_MODE_NAME for the printing.
> > >
> > > I wasn't really sure which way to go.  Use a gsi, as it is done by
> > > existing generate_* functions, or make use of gimple_seq.  Since the
> > > latter uses internally also gsi I thought it is better to stick to gsi
> > > in the first place.  Now, after changing to gimple_seq I see the beauty
> > > of it :)
> > >
> > > I created two helper functions generate_strlen_builtin_1 and
> > > generate_reduction_builtin_1 in order to reduce code duplication.
> > >
> > > In function generate_strlen_builtin I changed from using
> > > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> > > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> > > sure whether my intuition about the difference between implicit and
> > > explicit builtins is correct.  In builtins.def there is a small example
> > > given which I would paraphrase as "use builtin_decl_explicit if the
> > > semantics of the builtin is defined by the C standard; otherwise use
> > > builtin_decl_implicit" but probably my intuition is wrong?
> > >
> > > Beside that I'm not sure whether I really have to call
> > > build_fold_addr_expr which looks superfluous to me since
> > > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> > >
> > > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > gimple *fn_call = gimple_build_call (fn, 1, mem);
> > >
> > > However, since it is also used that way in the context of
> > > generate_memset_builtin I didn't remove it so far.
> > >
> > > > I think overall the approach is sound now but the details still need work.
> > >
> > > Once again thank you very much for your review.  Really appreciated!
> >
> > The patch lacks a changelog entry / description.  It's nice if patches sent
> > out for review are basically the rev as git format-patch produces.
> >
> > The rawmemchr optab needs documenting in md.texi
>
> While writing the documentation in md.texi I realised that other
> instructions expect an address to be a memory operand which is not the
> case for rawmemchr currently. At the moment the address is either an
> SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> consequence in the backend define_expand rawmemchr<mode> expects a
> register operand and not a memory operand. Would it make sense to build
> a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> MEM_REF is supposed to be the canonical form here.

I suppose the expander could use code similar to what
expand_builtin_memset_args does,
using get_memory_rtx.  I suppose that we're using MEM operands because those
can convey things like alias info or alignment info, something which
REG operands cannot
(easily).  I wouldn't build a MEM_REF and try to expand that.

> >
> > +}
> > +
> > +static bool
> > +reduction_var_overflows_first (tree reduction_var, tree load_type)
> > +{
> > +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> >
> > this function needs a comment.
>
> Done.
>
> >
> > +         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> > +           {
> > +             if (reduction_stmt)
> > +               return false;
> >
> > you leak bbs here and elsewhere where you early exit the function.
> > In fact you fail to free it at all.
>
> Whoopsy. I factored the whole loop out into static function
> determine_reduction_stmt in order to deal with all early exits.
>
> >
> > Otherwise the patch looks good - thanks for all the improvements.
> >
> > What I do wonder is
> >
> > +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> >
> > using builtin_decl_explicit means that in a TU where strlen is neither
> > declared nor used we can end up emitting calls to it.  For memcpy/memmove
> > that's usually OK since we require those to be present even in a
> > freestanding environment.  But I'm not sure about strlen here so I'd
> > lean towards using builtin_decl_implicit and checking that for NULL which
> > IIRC should prevent emitting strlen when it's not declared and maybe even
> > if it's declared but not used.  All other uses that generate STRLEN
> > use that at least.
>
> Thanks for clarification.  I changed it back to builtin_decl_implicit
> and check for null pointers.

Thanks,
Richard.

> Thanks,
> Stefan
Stefan Schulze Frielinghaus Sept. 13, 2021, 2:53 p.m. UTC | #22
On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > [...]
> > > > >
> > > > > +  /* Handle strlen like loops.  */
> > > > > +  if (store_dr == NULL
> > > > > +      && integer_zerop (pattern)
> > > > > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > +      && integer_onep (reduction_iv.step)
> > > > > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > > > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > > > > +    {
> > > > >
> > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > The iteration
> > > > > only stops when you load a NUL and the increments just wrap along (you're
> > > > > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> > > >
> > > > I think truncation is enough as long as no overflow occurs in strlen or
> > > > strlen_using_rawmemchr.
> > > >
> > > > > For larger than size_type_node (actually larger than ptr_type_node would matter
> > > > > I guess), the argument is that since pointer wrapping would be undefined anyway
> > > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > > >
> > > > >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > (ptr_type_node)
> > > > >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > >
> > > > > ?
> > > >
> > > > Regarding the implementation which makes use of rawmemchr:
> > > >
> > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > > the maximal length we can determine of a string where each character has
> > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > this would mean that a single string consumes more than half of the
> > > > virtual addressable memory.  At least for architectures where
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > > to neglect the case where computing pointer difference may overflow.
> > > > Otherwise we are talking about strings with lenghts of multiple
> > > > pebibytes.  For other architectures we might have to be more precise
> > > > and make sure that reduction variable overflows first and that this is
> > > > undefined.
> > > >
> > > > Thus a conservative condition would be (I assumed that the size of any
> > > > integral type is a power of two which I'm not sure if this really holds;
> > > > IIRC the C standard requires only that the alignment is a power of two
> > > > but not necessarily the size so I might need to change this):
> > > >
> > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> > > >    or in other words return true if reduction variable overflows first
> > > >    and false otherwise.  */
> > > >
> > > > static bool
> > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > {
> > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> > > > }
> > > >
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >     && reduction_var_overflows_first (reduction_var, load_type)
> > > >
> > > > Regarding the implementation which makes use of strlen:
> > > >
> > > > I'm not sure what it means if strlen is called for a string with a
> > > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > architectures, a conservative condition would be:
> > > >
> > > > TYPE_PRECISION (size_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> > > >
> > > > I still included the overflow undefined check for reduction variable in
> > > > order to rule out situations where the reduction variable is unsigned
> > > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > > > architectures.  Anyhow, while writing this down it becomes clear that
> > > > this deserves a comment which I will add once it becomes clear which way
> > > > to go.
> > >
> > > I think all the arguments about objects bigger than half of the address-space
> > > also are valid for 32bit targets and thus 32bit size_type_node (or
> > > 32bit pointer size).
> > > I'm not actually sure what's the canonical type to check against, whether
> > > it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> > > middle-end "offset" type used for all address computations).  For weird reasons
> > > I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> > > pointers but 16bit 'sizetype').
> >
> > Ok, for the strlen implementation I changed from size_type_node to
> > sizetype and assume that no overflow occurs for string objects bigger
> > than half of the address space for 32-bit targets and up:
> >
> >   (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
> >    && TYPE_PRECISION (ptr_type_node) >= 32)
> >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >       && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
> >
> > and similarly for the rawmemchr implementation:
> >
> >   (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
> >    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
> >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >       && reduction_var_overflows_first (reduction_var, load_type))
> >
> > >
> > > > >
> > > > > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > > > > +       {
> > > > > +         const char *msg = G_("assuming signed overflow does not occur "
> > > > > +                              "when optimizing strlen like loop");
> > > > > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > > > > +       }
> > > > >
> > > > > no, please don't add any new strict-overflow warnings ;)
> > > >
> > > > I just stumbled over code which produces such a warning and thought this
> > > > is a hard requirement :D The new patch doesn't contain it anymore.
> > > >
> > > > >
> > > > > The generate_*_builtin routines need some factoring - if you code-generate
> > > > > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > > > > (not sure why you do that - you should see to fold the call, not necessarily
> > > > > the rest).  The replacement of reduction_var and the dumping could be shared.
> > > > > There's also GET_MODE_NAME for the printing.
> > > >
> > > > I wasn't really sure which way to go.  Use a gsi, as it is done by
> > > > existing generate_* functions, or make use of gimple_seq.  Since the
> > > > latter uses internally also gsi I thought it is better to stick to gsi
> > > > in the first place.  Now, after changing to gimple_seq I see the beauty
> > > > of it :)
> > > >
> > > > I created two helper functions generate_strlen_builtin_1 and
> > > > generate_reduction_builtin_1 in order to reduce code duplication.
> > > >
> > > > In function generate_strlen_builtin I changed from using
> > > > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> > > > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> > > > sure whether my intuition about the difference between implicit and
> > > > explicit builtins is correct.  In builtins.def there is a small example
> > > > given which I would paraphrase as "use builtin_decl_explicit if the
> > > > semantics of the builtin is defined by the C standard; otherwise use
> > > > builtin_decl_implicit" but probably my intuition is wrong?
> > > >
> > > > Beside that I'm not sure whether I really have to call
> > > > build_fold_addr_expr which looks superfluous to me since
> > > > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> > > >
> > > > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > >
> > > > However, since it is also used that way in the context of
> > > > generate_memset_builtin I didn't remove it so far.
> > > >
> > > > > I think overall the approach is sound now but the details still need work.
> > > >
> > > > Once again thank you very much for your review.  Really appreciated!
> > >
> > > The patch lacks a changelog entry / description.  It's nice if patches sent
> > > out for review are basically the rev as git format-patch produces.
> > >
> > > The rawmemchr optab needs documenting in md.texi
> >
> > While writing the documentation in md.texi I realised that other
> > instructions expect an address to be a memory operand which is not the
> > case for rawmemchr currently. At the moment the address is either an
> > SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> > consequence in the backend define_expand rawmemchr<mode> expects a
> > register operand and not a memory operand. Would it make sense to build
> > a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> > MEM_REF is supposed to be the canonical form here.
> 
> I suppose the expander could use code similar to what
> expand_builtin_memset_args does,
> using get_memory_rtx.  I suppose that we're using MEM operands because those
> can convey things like alias info or alignment info, something which
> REG operands cannot
> (easily).  I wouldn't build a MEM_REF and try to expand that.

The new patch contains the following changes:

- In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
  change linkage of get_memory_rtx to extern.

- In function generate_strlen_builtin_using_rawmemchr I'm not
  reconstructing the load type anymore from the base pointer but rather
  pass it as a parameter from function transform_reduction_loop where we
  also ensured that it is of integral type.  Reconstructing the load
  type was error prone since e.g. I didn't distinct between
  pointer_plus_expr or addr_expr.  Thus passing the load type should be
  more solid.

Regtested on IBM Z and x86.  Ok for mainline?

Thanks,
Stefan

> 
> > >
> > > +}
> > > +
> > > +static bool
> > > +reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > +{
> > > +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > >
> > > this function needs a comment.
> >
> > Done.
> >
> > >
> > > +         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> > > +           {
> > > +             if (reduction_stmt)
> > > +               return false;
> > >
> > > you leak bbs here and elsewhere where you early exit the function.
> > > In fact you fail to free it at all.
> >
> > Whoopsy. I factored the whole loop out into static function
> > determine_reduction_stmt in order to deal with all early exits.
> >
> > >
> > > Otherwise the patch looks good - thanks for all the improvements.
> > >
> > > What I do wonder is
> > >
> > > +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> > >
> > > using builtin_decl_explicit means that in a TU where strlen is neither
> > > declared nor used we can end up emitting calls to it.  For memcpy/memmove
> > > that's usually OK since we require those to be present even in a
> > > freestanding environment.  But I'm not sure about strlen here so I'd
> > > lean towards using builtin_decl_implicit and checking that for NULL which
> > > IIRC should prevent emitting strlen when it's not declared and maybe even
> > > if it's declared but not used.  All other uses that generate STRLEN
> > > use that at least.
> >
> > Thanks for clarification.  I changed it back to builtin_decl_implicit
> > and check for null pointers.
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Stefan
From ad7cc80f57394178b11a4033f6a693a199732106 Mon Sep 17 00:00:00 2001
From: Stefan Schulze Frielinghaus <stefansf@linux.ibm.com>
Date: Wed, 17 Mar 2021 09:00:06 +0100
Subject: [PATCH 1/2] ldist: Recognize strlen and rawmemchr like loops

This patch adds support for recognizing loops which mimic the behaviour
of functions strlen and rawmemchr, and replaces those with internal
function calls in case a target provides them.  In contrast to the
standard strlen and rawmemchr functions, this patch also supports
different instances where the memory pointed to is interpreted as 8, 16,
and 32-bit sized, respectively.

gcc/ChangeLog:

	* builtins.c (get_memory_rtx): Change to external linkage.
	* builtins.h (get_memory_rtx): Add function prototype.
	* doc/md.texi (rawmemchr<mode>): Document.
	* internal-fn.c (expand_RAWMEMCHR): Define.
	* internal-fn.def (RAWMEMCHR): Add.
	* optabs.def (rawmemchr_optab): Add.
	* tree-loop-distribution.c (find_single_drs): Change return code
	behaviour by also returning true if no single store was found
	but a single load.
	(loop_distribution::classify_partition): Respect the new return
	code behaviour of function find_single_drs.
	(loop_distribution::execute): Call new function
	transform_reduction_loop in order to replace rawmemchr or strlen
	like loops by calls into builtins.
	(generate_reduction_builtin_1): New function.
	(generate_rawmemchr_builtin): New function.
	(generate_strlen_builtin_1): New function.
	(generate_strlen_builtin): New function.
	(generate_strlen_builtin_using_rawmemchr): New function.
	(reduction_var_overflows_first): New function.
	(determine_reduction_stmt_1): New function.
	(determine_reduction_stmt): New function.
	(loop_distribution::transform_reduction_loop): New function.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ldist-rawmemchr-1.c: New test.
	* gcc.dg/tree-ssa/ldist-rawmemchr-2.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-1.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-2.c: New test.
	* gcc.dg/tree-ssa/ldist-strlen-3.c: New test.
---
 gcc/builtins.c                                |   3 +-
 gcc/builtins.h                                |   1 +
 gcc/doc/md.texi                               |   7 +
 gcc/internal-fn.c                             |  30 +
 gcc/internal-fn.def                           |   1 +
 gcc/optabs.def                                |   1 +
 .../gcc.dg/tree-ssa/ldist-rawmemchr-1.c       |  72 +++
 .../gcc.dg/tree-ssa/ldist-rawmemchr-2.c       |  83 +++
 .../gcc.dg/tree-ssa/ldist-strlen-1.c          | 100 ++++
 .../gcc.dg/tree-ssa/ldist-strlen-2.c          |  58 ++
 .../gcc.dg/tree-ssa/ldist-strlen-3.c          |  12 +
 gcc/tree-loop-distribution.c                  | 518 +++++++++++++++++-
 12 files changed, 855 insertions(+), 31 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 3e57eb03af0..f989a869a9f 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -105,7 +105,6 @@ builtin_info_type builtin_info[(int)END_BUILTINS];
 bool force_folding_builtin_constant_p;
 
 static int target_char_cast (tree, char *);
-static rtx get_memory_rtx (tree, tree);
 static int apply_args_size (void);
 static int apply_result_size (void);
 static rtx result_vector (int, rtx);
@@ -1355,7 +1354,7 @@ expand_builtin_prefetch (tree exp)
    the maximum length of the block of memory that might be accessed or
    NULL if unknown.  */
 
-static rtx
+rtx
 get_memory_rtx (tree exp, tree len)
 {
   tree orig_exp = exp;
diff --git a/gcc/builtins.h b/gcc/builtins.h
index d330b78e591..5e4d86e9c37 100644
--- a/gcc/builtins.h
+++ b/gcc/builtins.h
@@ -146,6 +146,7 @@ extern char target_percent_s[3];
 extern char target_percent_c[3];
 extern char target_percent_s_newline[4];
 extern bool target_char_cst_p (tree t, char *p);
+extern rtx get_memory_rtx (tree exp, tree len);
 
 extern internal_fn associated_internal_fn (tree);
 extern internal_fn replacement_internal_fn (gcall *);
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 2b41cb7fb7b..10faacdca6c 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -6695,6 +6695,13 @@ operand 2 is the character to search for (normally zero),
 and operand 3 is a constant describing the known alignment
 of the beginning of the string.
 
+@cindex @code{rawmemchr@var{m}} instruction pattern
+@item @samp{rawmemchr@var{m}}
+Scan memory referred to by operand 1 for the first occurrence of operand 2.
+Operand 1 is a @code{mem} and operand 2 a @code{const_int} of mode @var{m}.
+Operand 0 is the result, i.e., a pointer to the first occurrence of operand 2
+in the memory block given by operand 1.
+
 @cindex @code{float@var{m}@var{n}2} instruction pattern
 @item @samp{float@var{m}@var{n}2}
 Convert signed integer operand 1 (valid for fixed point mode @var{m}) to
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index ada2a820ff1..5bde0864a8f 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2934,6 +2934,36 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+/* Expand IFN_RAWMEMCHAR internal function.  */
+
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+    return;
+  machine_mode lhs_mode = TYPE_MODE (TREE_TYPE (lhs));
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, lhs_mode);
+
+  tree mem = gimple_call_arg (stmt, 0);
+  rtx mem_rtx = get_memory_rtx (mem, NULL);
+  create_fixed_operand (&ops[1], mem_rtx);
+
+  tree pattern = gimple_call_arg (stmt, 1);
+  machine_mode mode = TYPE_MODE (TREE_TYPE (pattern));
+  rtx pattern_rtx = expand_normal (pattern);
+  create_input_operand (&ops[2], pattern_rtx, mode);
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, mode);
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 88169ef4656..01f60a6cf26 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -352,6 +352,7 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 201b8aae1c0..f02c7b729a5 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -267,6 +267,7 @@ OPTAB_D (cpymem_optab, "cpymem$a")
 OPTAB_D (movmem_optab, "movmem$a")
 OPTAB_D (setmem_optab, "setmem$a")
 OPTAB_D (strlen_optab, "strlen$a")
+OPTAB_D (rawmemchr_optab, "rawmemchr$a")
 
 OPTAB_DC(fma_optab, "fma$a4", FMA)
 OPTAB_D (fms_optab, "fms$a4")
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
new file mode 100644
index 00000000000..6abfd278351
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-1.c
@@ -0,0 +1,72 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and no store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, pattern)   \
+__attribute__((noinline))  \
+T *test_##T (T *p)         \
+{                          \
+  while (*p != (T)pattern) \
+    ++p;                   \
+  return p;                \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  0xab)
+test (int16_t, 0xabcd)
+test (int32_t, 0xabcdef15)
+
+#define run(T, pattern, i)      \
+{                               \
+T *q = p;                       \
+q[i] = (T)pattern;              \
+assert (test_##T (p) == &q[i]); \
+q[i] = 0;                       \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0, 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, 0xab, 0);
+  run (int8_t, 0xab, 1);
+  run (int8_t, 0xab, 13);
+
+  run (int16_t, 0xabcd, 0);
+  run (int16_t, 0xabcd, 1);
+  run (int16_t, 0xabcd, 13);
+
+  run (int32_t, 0xabcdef15, 0);
+  run (int32_t, 0xabcdef15, 1);
+  run (int32_t, 0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
new file mode 100644
index 00000000000..00d6ea0f8e9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-rawmemchr-2.c
@@ -0,0 +1,83 @@
+/* { dg-do run { target s390x-*-* } } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrQI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrHI" 2 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated rawmemchrSI" 2 "ldist" { target s390x-*-* } } } */
+
+/* Rawmemchr pattern: reduction stmt and store */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+uint8_t *p_uint8_t;
+uint16_t *p_uint16_t;
+uint32_t *p_uint32_t;
+
+int8_t *p_int8_t;
+int16_t *p_int16_t;
+int32_t *p_int32_t;
+
+#define test(T, pattern)    \
+__attribute__((noinline))   \
+T *test_##T (void)          \
+{                           \
+  while (*p_##T != pattern) \
+    ++p_##T;                \
+  return p_##T;             \
+}
+
+test (uint8_t,  0xab)
+test (uint16_t, 0xabcd)
+test (uint32_t, 0xabcdef15)
+
+test (int8_t,  (int8_t)0xab)
+test (int16_t, (int16_t)0xabcd)
+test (int32_t, (int32_t)0xabcdef15)
+
+#define run(T, pattern, i) \
+{                          \
+T *q = p;                  \
+q[i] = pattern;            \
+p_##T = p;                 \
+T *r = test_##T ();        \
+assert (r == p_##T);       \
+assert (r == &q[i]);       \
+q[i] = 0;                  \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, '\0', 1024);
+
+  run (uint8_t, 0xab, 0);
+  run (uint8_t, 0xab, 1);
+  run (uint8_t, 0xab, 13);
+
+  run (uint16_t, 0xabcd, 0);
+  run (uint16_t, 0xabcd, 1);
+  run (uint16_t, 0xabcd, 13);
+
+  run (uint32_t, 0xabcdef15, 0);
+  run (uint32_t, 0xabcdef15, 1);
+  run (uint32_t, 0xabcdef15, 13);
+
+  run (int8_t, (int8_t)0xab, 0);
+  run (int8_t, (int8_t)0xab, 1);
+  run (int8_t, (int8_t)0xab, 13);
+
+  run (int16_t, (int16_t)0xabcd, 0);
+  run (int16_t, (int16_t)0xabcd, 1);
+  run (int16_t, (int16_t)0xabcd, 13);
+
+  run (int32_t, (int32_t)0xabcdef15, 0);
+  run (int32_t, (int32_t)0xabcdef15, 1);
+  run (int32_t, (int32_t)0xabcdef15, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
new file mode 100644
index 00000000000..918b60099e4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-1.c
@@ -0,0 +1,100 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated strlenHI\n" 4 "ldist" { target s390x-*-* } } } */
+/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 4 "ldist" { target s390x-*-* } } } */
+
+#include <stdint.h>
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+#define test(T, U)        \
+__attribute__((noinline)) \
+U test_##T##U (T *s)      \
+{                         \
+  U i;                    \
+  for (i=0; s[i]; ++i);   \
+  return i;               \
+}
+
+test (uint8_t,  size_t)
+test (uint16_t, size_t)
+test (uint32_t, size_t)
+test (uint8_t,  int)
+test (uint16_t, int)
+test (uint32_t, int)
+
+test (int8_t,  size_t)
+test (int16_t, size_t)
+test (int32_t, size_t)
+test (int8_t,  int)
+test (int16_t, int)
+test (int32_t, int)
+
+#define run(T, U, i)             \
+{                                \
+T *q = p;                        \
+q[i] = 0;                        \
+assert (test_##T##U (p) == i);   \
+memset (&q[i], 0xf, sizeof (T)); \
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+
+  run (uint8_t, size_t, 0);
+  run (uint8_t, size_t, 1);
+  run (uint8_t, size_t, 13);
+
+  run (int8_t, size_t, 0);
+  run (int8_t, size_t, 1);
+  run (int8_t, size_t, 13);
+
+  run (uint8_t, int, 0);
+  run (uint8_t, int, 1);
+  run (uint8_t, int, 13);
+
+  run (int8_t, int, 0);
+  run (int8_t, int, 1);
+  run (int8_t, int, 13);
+
+  run (uint16_t, size_t, 0);
+  run (uint16_t, size_t, 1);
+  run (uint16_t, size_t, 13);
+
+  run (int16_t, size_t, 0);
+  run (int16_t, size_t, 1);
+  run (int16_t, size_t, 13);
+
+  run (uint16_t, int, 0);
+  run (uint16_t, int, 1);
+  run (uint16_t, int, 13);
+
+  run (int16_t, int, 0);
+  run (int16_t, int, 1);
+  run (int16_t, int, 13);
+
+  run (uint32_t, size_t, 0);
+  run (uint32_t, size_t, 1);
+  run (uint32_t, size_t, 13);
+
+  run (int32_t, size_t, 0);
+  run (int32_t, size_t, 1);
+  run (int32_t, size_t, 13);
+
+  run (uint32_t, int, 0);
+  run (uint32_t, int, 1);
+  run (uint32_t, int, 13);
+
+  run (int32_t, int, 0);
+  run (int32_t, int, 1);
+  run (int32_t, int, 13);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
new file mode 100644
index 00000000000..e25d6ea5b56
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenQI\n" 3 "ldist" } } */
+
+#include <assert.h>
+
+typedef __SIZE_TYPE__ size_t;
+extern void* malloc (size_t);
+extern void* memset (void*, int, size_t);
+
+__attribute__((noinline))
+int test_pos (char *s)
+{
+  int i;
+  for (i=42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_neg (char *s)
+{
+  int i;
+  for (i=-42; s[i]; ++i);
+  return i;
+}
+
+__attribute__((noinline))
+int test_including_null_char (char *s)
+{
+  int i;
+  for (i=1; s[i-1]; ++i);
+  return i;
+}
+
+int main(void)
+{
+  void *p = malloc (1024);
+  assert (p);
+  memset (p, 0xf, 1024);
+  char *s = (char *)p + 100;
+
+  s[42+13] = 0;
+  assert (test_pos (s) == 42+13);
+  s[42+13] = 0xf;
+
+  s[13] = 0;
+  assert (test_neg (s) == 13);
+  s[13] = 0xf;
+
+  s[-13] = 0;
+  assert (test_neg (s) == -13);
+  s[-13] = 0xf;
+
+  s[13] = 0;
+  assert (test_including_null_char (s) == 13+1);
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
new file mode 100644
index 00000000000..370fd5eb088
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-strlen-3.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -fdump-tree-ldist-details" } */
+/* { dg-final { scan-tree-dump-times "generated strlenSI\n" 1 "ldist" { target s390x-*-* } } } */
+
+extern int s[];
+
+int test ()
+{
+  int i = 0;
+  for (; s[i]; ++i);
+  return i;
+}
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 2df762c8aa8..fb9250031b5 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -116,6 +116,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-eh.h"
 #include "gimple-fold.h"
 #include "tree-affine.h"
+#include "intl.h"
+#include "rtl.h"
+#include "memmodel.h"
+#include "optabs.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -651,6 +655,10 @@ class loop_distribution
 		       control_dependences *cd, int *nb_calls, bool *destroy_p,
 		       bool only_patterns_p);
 
+  /* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+     replace them accordingly.  */
+  bool transform_reduction_loop (loop_p loop);
+
   /* Compute topological order for basic blocks.  Topological order is
      needed because data dependence is computed for data references in
      lexicographical order.  */
@@ -1492,14 +1500,14 @@ loop_distribution::build_rdg_partition_for_vertex (struct graph *rdg, int v)
    data references.  */
 
 static bool
-find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
+find_single_drs (class loop *loop, struct graph *rdg, const bitmap &partition_stmts,
 		 data_reference_p *dst_dr, data_reference_p *src_dr)
 {
   unsigned i;
   data_reference_p single_ld = NULL, single_st = NULL;
   bitmap_iterator bi;
 
-  EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
+  EXECUTE_IF_SET_IN_BITMAP (partition_stmts, 0, i, bi)
     {
       gimple *stmt = RDG_STMT (rdg, i);
       data_reference_p dr;
@@ -1540,44 +1548,47 @@ find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
 	}
     }
 
-  if (!single_st)
-    return false;
-
-  /* Bail out if this is a bitfield memory reference.  */
-  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
-      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
+  if (!single_ld && !single_st)
     return false;
 
-  /* Data reference must be executed exactly once per iteration of each
-     loop in the loop nest.  We only need to check dominance information
-     against the outermost one in a perfect loop nest because a bb can't
-     dominate outermost loop's latch without dominating inner loop's.  */
-  basic_block bb_st = gimple_bb (DR_STMT (single_st));
-  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
-    return false;
+  basic_block bb_ld = NULL;
+  basic_block bb_st = NULL;
 
   if (single_ld)
     {
-      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
-      /* Direct aggregate copy or via an SSA name temporary.  */
-      if (load != store
-	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
-	return false;
-
       /* Bail out if this is a bitfield memory reference.  */
       if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
 	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
 	return false;
 
-      /* Load and store must be in the same loop nest.  */
-      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
-      if (bb_st->loop_father != bb_ld->loop_father)
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      bb_ld = gimple_bb (DR_STMT (single_ld));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+	return false;
+    }
+
+  if (single_st)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
 	return false;
 
       /* Data reference must be executed exactly once per iteration.
-	 Same as single_st, we only need to check against the outermost
+	 Same as single_ld, we only need to check against the outermost
 	 loop.  */
-      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+      bb_st = gimple_bb (DR_STMT (single_st));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
+	return false;
+    }
+
+  if (single_ld && single_st)
+    {
+      /* Load and store must be in the same loop nest.  */
+      if (bb_st->loop_father != bb_ld->loop_father)
 	return false;
 
       edge e = single_exit (bb_st->loop_father);
@@ -1852,9 +1863,19 @@ loop_distribution::classify_partition (loop_p loop,
     return has_reduction;
 
   /* Find single load/store data references for builtin partition.  */
-  if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
+  if (!find_single_drs (loop, rdg, partition->stmts, &single_st, &single_ld)
+      || !single_st)
     return has_reduction;
 
+  if (single_ld && single_st)
+    {
+      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
+      /* Direct aggregate copy or via an SSA name temporary.  */
+      if (load != store
+	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
+	return has_reduction;
+    }
+
   partition->loc = gimple_location (DR_STMT (single_st));
 
   /* Classify the builtin kind.  */
@@ -3260,6 +3281,428 @@ find_seed_stmts_for_distribution (class loop *loop, vec<gimple *> *work_list)
   return work_list->length () > 0;
 }
 
+/* A helper function for generate_{rawmemchr,strlen}_builtin functions in order
+   to place new statements SEQ before LOOP and replace the old reduction
+   variable with the new one.  */
+
+static void
+generate_reduction_builtin_1 (loop_p loop, gimple_seq &seq,
+			      tree reduction_var_old, tree reduction_var_new,
+			      const char *info, machine_mode load_mode)
+{
+  /* Place new statements before LOOP.  */
+  gimple_stmt_iterator gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+  gsi_insert_seq_after (&gsi, seq, GSI_CONTINUE_LINKING);
+
+  /* Replace old reduction variable with new one.  */
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, reduction_var_old)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, reduction_var_new);
+
+      update_stmt (stmt);
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, info, GET_MODE_NAME (load_mode));
+}
+
+/* Generate a call to rawmemchr and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+			    data_reference_p store_dr, tree base, tree pattern,
+			    location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, mem, pattern);
+  tree reduction_var_new = copy_ssa_name (reduction_var);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  if (store_dr)
+    {
+      gassign *g = gimple_build_assign (DR_REF (store_dr), reduction_var_new);
+      gimple_seq_add_stmt (&seq, g);
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+				"generated rawmemchr%s\n",
+				TYPE_MODE (TREE_TYPE (TREE_TYPE (base))));
+}
+
+/* Helper function for generate_strlen_builtin(,_using_rawmemchr)  */
+
+static void
+generate_strlen_builtin_1 (loop_p loop, gimple_seq &seq,
+			   tree reduction_var_old, tree reduction_var_new,
+			   machine_mode mode, tree start_len)
+{
+  /* REDUCTION_VAR_NEW has either size type or ptrdiff type and must be
+     converted if types of old and new reduction variable are not compatible. */
+  reduction_var_new = gimple_convert (&seq, TREE_TYPE (reduction_var_old),
+				      reduction_var_new);
+
+  /* Loops of the form `for (i=42; s[i]; ++i);` have an additional start
+     length.  */
+  if (!integer_zerop (start_len))
+    {
+      tree lhs = make_ssa_name (TREE_TYPE (reduction_var_new));
+      gimple *g = gimple_build_assign (lhs, PLUS_EXPR, reduction_var_new,
+				       start_len);
+      gimple_seq_add_stmt (&seq, g);
+      reduction_var_new = lhs;
+    }
+
+  generate_reduction_builtin_1 (loop, seq, reduction_var_old, reduction_var_new,
+				"generated strlen%s\n", mode);
+}
+
+/* Generate a call to strlen and place it before LOOP.  REDUCTION_VAR is
+   replaced with a fresh SSA name representing the result of the call.  */
+
+static void
+generate_strlen_builtin (loop_p loop, tree reduction_var, tree base,
+			 tree start_len, location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree reduction_var_new = make_ssa_name (size_type_node);
+
+  tree mem = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree fn = build_fold_addr_expr (builtin_decl_implicit (BUILT_IN_STRLEN));
+  gimple *fn_call = gimple_build_call (fn, 1, mem);
+  gimple_call_set_lhs (fn_call, reduction_var_new);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, reduction_var_new,
+			     QImode, start_len);
+}
+
+/* Generate code in order to mimic the behaviour of strlen but this time over
+   an array of elements with mode different than QI.  REDUCTION_VAR is replaced
+   with a fresh SSA name representing the result, i.e., the length.  */
+
+static void
+generate_strlen_builtin_using_rawmemchr (loop_p loop, tree reduction_var,
+					 tree base, tree load_type,
+					 tree start_len, location_t loc)
+{
+  gimple_seq seq = NULL;
+
+  tree start = force_gimple_operand (base, &seq, true, NULL_TREE);
+  tree zero = build_zero_cst (load_type);
+  gimple *fn_call = gimple_build_call_internal (IFN_RAWMEMCHR, 2, start, zero);
+  tree end = make_ssa_name (TREE_TYPE (base));
+  gimple_call_set_lhs (fn_call, end);
+  gimple_set_location (fn_call, loc);
+  gimple_seq_add_stmt (&seq, fn_call);
+
+  /* Determine the number of elements between START and END by
+     evaluating (END - START) / sizeof (*START).  */
+  tree diff = make_ssa_name (ptrdiff_type_node);
+  gimple *diff_stmt = gimple_build_assign (diff, POINTER_DIFF_EXPR, end, start);
+  gimple_seq_add_stmt (&seq, diff_stmt);
+  /* Let SIZE be the size of each character.  */
+  tree size = gimple_convert (&seq, ptrdiff_type_node,
+			      TYPE_SIZE_UNIT (load_type));
+  tree count = make_ssa_name (ptrdiff_type_node);
+  gimple *count_stmt = gimple_build_assign (count, TRUNC_DIV_EXPR, diff, size);
+  gimple_seq_add_stmt (&seq, count_stmt);
+
+  generate_strlen_builtin_1 (loop, seq, reduction_var, count,
+			     TYPE_MODE (load_type),
+			     start_len);
+}
+
+/* Return true if we can count at least as many characters by taking pointer
+   difference as we can count via reduction_var without an overflow.  Thus
+   compute 2^n < (2^(m-1) / s) where n = TYPE_PRECISION (reduction_var),
+   m = TYPE_PRECISION (ptrdiff_type_node), and s = size of each character.  */
+static bool
+reduction_var_overflows_first (tree reduction_var, tree load_type)
+{
+  widest_int n2 = wi::lshift (1, TYPE_PRECISION (reduction_var));;
+  widest_int m2 = wi::lshift (1, TYPE_PRECISION (ptrdiff_type_node) - 1);
+  widest_int s = wi::to_widest (TYPE_SIZE_UNIT (load_type));
+  return wi::ltu_p (n2, wi::udiv_trunc (m2, s));
+}
+
+static gimple *
+determine_reduction_stmt_1 (const loop_p loop, const basic_block *bbs)
+{
+  gimple *reduction_stmt = NULL;
+
+  for (unsigned i = 0, ninsns = 0; i < loop->num_nodes; ++i)
+    {
+      basic_block bb = bbs[i];
+
+      for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi))
+	{
+	  gphi *phi = bsi.phi ();
+	  if (virtual_operand_p (gimple_phi_result (phi)))
+	    continue;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, phi))
+	    {
+	      if (reduction_stmt)
+		return NULL;
+	      reduction_stmt = phi;
+	    }
+	}
+
+      for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+	   gsi_next_nondebug (&bsi), ++ninsns)
+	{
+	  /* Bail out early for loops which are unlikely to match.  */
+	  if (ninsns > 16)
+	    return NULL;
+	  gimple *stmt = gsi_stmt (bsi);
+	  if (gimple_clobber_p (stmt))
+	    continue;
+	  if (gimple_code (stmt) == GIMPLE_LABEL)
+	    continue;
+	  if (gimple_has_volatile_ops (stmt))
+	    return NULL;
+	  if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
+	    {
+	      if (reduction_stmt)
+		return NULL;
+	      reduction_stmt = stmt;
+	    }
+	}
+    }
+
+  return reduction_stmt;
+}
+
+/* If LOOP has a single non-volatile reduction statement, then return a pointer
+   to it.  Otherwise return NULL.  */
+static gimple *
+determine_reduction_stmt (const loop_p loop)
+{
+  basic_block *bbs = get_loop_body (loop);
+  gimple *reduction_stmt = determine_reduction_stmt_1 (loop, bbs);
+  XDELETEVEC (bbs);
+  return reduction_stmt;
+}
+
+/* Transform loops which mimic the effects of builtins rawmemchr or strlen and
+   replace them accordingly.  For example, a loop of the form
+
+     for (; *p != 42; ++p);
+
+   is replaced by
+
+     p = rawmemchr<MODE> (p, 42);
+
+   under the assumption that rawmemchr is available for a particular MODE.
+   Another example is
+
+     int i;
+     for (i = 42; s[i]; ++i);
+
+   which is replaced by
+
+     i = (int)strlen (&s[42]) + 42;
+
+   for some character array S.  In case array S is not of type character array
+   we end up with
+
+     i = (int)(rawmemchr<MODE> (&s[42], 0) - &s[42]) + 42;
+
+   assuming that rawmemchr is available for a particular MODE.  */
+
+bool
+loop_distribution::transform_reduction_loop (loop_p loop)
+{
+  gimple *reduction_stmt;
+  data_reference_p load_dr = NULL, store_dr = NULL;
+
+  edge e = single_exit (loop);
+  gcond *cond = safe_dyn_cast <gcond *> (last_stmt (e->src));
+  if (!cond)
+    return false;
+  /* Ensure loop condition is an (in)equality test and loop is exited either if
+     the inequality test fails or the equality test succeeds.  */
+  if (!(e->flags & EDGE_FALSE_VALUE && gimple_cond_code (cond) == NE_EXPR)
+      && !(e->flags & EDGE_TRUE_VALUE && gimple_cond_code (cond) == EQ_EXPR))
+    return false;
+  /* A limitation of the current implementation is that we only support
+     constant patterns in (in)equality tests.  */
+  tree pattern = gimple_cond_rhs (cond);
+  if (TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  reduction_stmt = determine_reduction_stmt (loop);
+
+  /* A limitation of the current implementation is that we require a reduction
+     statement.  Therefore, loops without a reduction statement as in the
+     following are not recognized:
+     int *p;
+     void foo (void) { for (; *p; ++p); } */
+  if (reduction_stmt == NULL)
+    return false;
+
+  /* Reduction variables are guaranteed to be SSA names.  */
+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+    {
+    case GIMPLE_ASSIGN:
+    case GIMPLE_PHI:
+      reduction_var = gimple_get_lhs (reduction_stmt);
+      break;
+    default:
+      /* Bail out e.g. for GIMPLE_CALL.  */
+      return false;
+    }
+
+  struct graph *rdg = build_rdg (loop, NULL);
+  if (rdg == NULL)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "Loop %d not transformed: failed to build the RDG.\n",
+		 loop->num);
+
+      return false;
+    }
+  auto_bitmap partition_stmts;
+  bitmap_set_range (partition_stmts, 0, rdg->n_vertices);
+  find_single_drs (loop, rdg, partition_stmts, &store_dr, &load_dr);
+  free_rdg (rdg);
+
+  /* Bail out if there is no single load.  */
+  if (load_dr == NULL)
+    return false;
+
+  /* Reaching this point we have a loop with a single reduction variable,
+     a single load, and an optional single store.  */
+
+  tree load_ref = DR_REF (load_dr);
+  tree load_type = TREE_TYPE (load_ref);
+  tree load_access_base = build_fold_addr_expr (load_ref);
+  tree load_access_size = TYPE_SIZE_UNIT (load_type);
+  affine_iv load_iv, reduction_iv;
+
+  if (!INTEGRAL_TYPE_P (load_type)
+      || !type_has_mode_precision_p (load_type))
+    return false;
+
+  /* We already ensured that the loop condition tests for (in)equality where the
+     rhs is a constant pattern. Now ensure that the lhs is the result of the
+     load.  */
+  if (gimple_cond_lhs (cond) != gimple_assign_lhs (DR_STMT (load_dr)))
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, load_access_base, &load_iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive or not growing.  */
+  if (!operand_equal_p (load_iv.step, load_access_size, 0))
+    return false;
+
+  if (!simple_iv (loop, loop, reduction_var, &reduction_iv, false))
+    return false;
+
+  /* Handle rawmemchr like loops.  */
+  if (operand_equal_p (load_iv.base, reduction_iv.base)
+      && operand_equal_p (load_iv.step, reduction_iv.step))
+    {
+      if (store_dr)
+	{
+	  /* Ensure that we store to X and load from X+I where I>0.  */
+	  if (TREE_CODE (load_iv.base) != POINTER_PLUS_EXPR
+	      || !integer_onep (TREE_OPERAND (load_iv.base, 1)))
+	    return false;
+	  tree ptr_base = TREE_OPERAND (load_iv.base, 0);
+	  if (TREE_CODE (ptr_base) != SSA_NAME)
+	    return false;
+	  gimple *def = SSA_NAME_DEF_STMT (ptr_base);
+	  if (!gimple_assign_single_p (def)
+	      || gimple_assign_rhs1 (def) != DR_REF (store_dr))
+	    return false;
+	  /* Ensure that the reduction value is stored.  */
+	  if (gimple_assign_rhs1 (DR_STMT (store_dr)) != reduction_var)
+	    return false;
+	}
+      /* Bail out if target does not provide rawmemchr for a certain mode.  */
+      machine_mode mode = TYPE_MODE (load_type);
+      if (direct_optab_handler (rawmemchr_optab, mode) == CODE_FOR_nothing)
+	return false;
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      generate_rawmemchr_builtin (loop, reduction_var, store_dr, load_iv.base,
+				  pattern, loc);
+      return true;
+    }
+
+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step))
+    {
+      location_t loc = gimple_location (DR_STMT (load_dr));
+      /* While determining the length of a string an overflow might occur.
+	 If an overflow only occurs in the loop implementation and not in the
+	 strlen implementation, then either the overflow is undefined or the
+	 truncated result of strlen equals the one of the loop.  Otherwise if
+	 an overflow may also occur in the strlen implementation, then
+	 replacing a loop by a call to strlen is sound whenever we ensure that
+	 if an overflow occurs in the strlen implementation, then also an
+	 overflow occurs in the loop implementation which is undefined.  It
+	 seems reasonable to relax this and assume that the strlen
+	 implementation cannot overflow in case sizetype is big enough in the
+	 sense that an overflow can only happen for string objects which are
+	 bigger than half of the address space; at least for 32-bit targets and
+	 up.
+
+	 For strlen which makes use of rawmemchr the maximal length of a string
+	 which can be determined without an overflow is PTRDIFF_MAX / S where
+	 each character has size S.  Since an overflow for ptrdiff type is
+	 undefined we have to make sure that if an overflow occurs, then an
+	 overflow occurs in the loop implementation, too, and this is
+	 undefined, too.  Similar as before we relax this and assume that no
+	 string object is larger than half of the address space; at least for
+	 32-bit targets and up.  */
+      if (TYPE_MODE (load_type) == TYPE_MODE (char_type_node)
+	  && TYPE_PRECISION (load_type) == TYPE_PRECISION (char_type_node)
+	  && ((TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
+	       && TYPE_PRECISION (ptr_type_node) >= 32)
+	      || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		  && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype)))
+	  && builtin_decl_implicit (BUILT_IN_STRLEN))
+	generate_strlen_builtin (loop, reduction_var, load_iv.base,
+				 reduction_iv.base, loc);
+      else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE (load_type))
+	       != CODE_FOR_nothing
+	       && ((TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
+		    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
+		   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
+		       && reduction_var_overflows_first (reduction_var, load_type))))
+	generate_strlen_builtin_using_rawmemchr (loop, reduction_var,
+						 load_iv.base,
+						 load_type,
+						 reduction_iv.base, loc);
+      else
+	return false;
+      return true;
+    }
+
+  return false;
+}
+
 /* Given innermost LOOP, return the outermost enclosing loop that forms a
    perfect loop nest.  */
 
@@ -3324,10 +3767,27 @@ loop_distribution::execute (function *fun)
 	      && !optimize_loop_for_speed_p (loop)))
 	continue;
 
-      /* Don't distribute loop if niters is unknown.  */
+      /* If niters is unknown don't distribute loop but rather try to transform
+	 it to a call to a builtin.  */
       tree niters = number_of_latch_executions (loop);
       if (niters == NULL_TREE || niters == chrec_dont_know)
-	continue;
+	{
+	  datarefs_vec.create (20);
+	  if (transform_reduction_loop (loop))
+	    {
+	      changed = true;
+	      loops_to_be_destroyed.safe_push (loop);
+	      if (dump_enabled_p ())
+		{
+		  dump_user_location_t loc = find_loop_location (loop);
+		  dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
+				   loc, "Loop %d transformed into a builtin.\n",
+				   loop->num);
+		}
+	    }
+	  free_data_refs (datarefs_vec);
+	  continue;
+	}
 
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);
Richard Biener Sept. 17, 2021, 8:08 a.m. UTC | #23
On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
<stefansf@linux.ibm.com> wrote:
>
> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> > <stefansf@linux.ibm.com> wrote:
> > >
> > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > > [...]
> > > > > >
> > > > > > +  /* Handle strlen like loops.  */
> > > > > > +  if (store_dr == NULL
> > > > > > +      && integer_zerop (pattern)
> > > > > > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > > +      && integer_onep (reduction_iv.step)
> > > > > > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > > > > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > > > > > +    {
> > > > > >
> > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > > The iteration
> > > > > > only stops when you load a NUL and the increments just wrap along (you're
> > > > > > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> > > > >
> > > > > I think truncation is enough as long as no overflow occurs in strlen or
> > > > > strlen_using_rawmemchr.
> > > > >
> > > > > > For larger than size_type_node (actually larger than ptr_type_node would matter
> > > > > > I guess), the argument is that since pointer wrapping would be undefined anyway
> > > > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > > > >
> > > > > >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > > (ptr_type_node)
> > > > > >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > > >
> > > > > > ?
> > > > >
> > > > > Regarding the implementation which makes use of rawmemchr:
> > > > >
> > > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > > > the maximal length we can determine of a string where each character has
> > > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > > this would mean that a single string consumes more than half of the
> > > > > virtual addressable memory.  At least for architectures where
> > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > > > to neglect the case where computing pointer difference may overflow.
> > > > > Otherwise we are talking about strings with lenghts of multiple
> > > > > pebibytes.  For other architectures we might have to be more precise
> > > > > and make sure that reduction variable overflows first and that this is
> > > > > undefined.
> > > > >
> > > > > Thus a conservative condition would be (I assumed that the size of any
> > > > > integral type is a power of two which I'm not sure if this really holds;
> > > > > IIRC the C standard requires only that the alignment is a power of two
> > > > > but not necessarily the size so I might need to change this):
> > > > >
> > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> > > > >    or in other words return true if reduction variable overflows first
> > > > >    and false otherwise.  */
> > > > >
> > > > > static bool
> > > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > {
> > > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> > > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> > > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> > > > > }
> > > > >
> > > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > >     && reduction_var_overflows_first (reduction_var, load_type)
> > > > >
> > > > > Regarding the implementation which makes use of strlen:
> > > > >
> > > > > I'm not sure what it means if strlen is called for a string with a
> > > > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > > architectures, a conservative condition would be:
> > > > >
> > > > > TYPE_PRECISION (size_type_node) == 64
> > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > >     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> > > > >
> > > > > I still included the overflow undefined check for reduction variable in
> > > > > order to rule out situations where the reduction variable is unsigned
> > > > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > > > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > > > > architectures.  Anyhow, while writing this down it becomes clear that
> > > > > this deserves a comment which I will add once it becomes clear which way
> > > > > to go.
> > > >
> > > > I think all the arguments about objects bigger than half of the address-space
> > > > also are valid for 32bit targets and thus 32bit size_type_node (or
> > > > 32bit pointer size).
> > > > I'm not actually sure what's the canonical type to check against, whether
> > > > it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> > > > middle-end "offset" type used for all address computations).  For weird reasons
> > > > I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> > > > pointers but 16bit 'sizetype').
> > >
> > > Ok, for the strlen implementation I changed from size_type_node to
> > > sizetype and assume that no overflow occurs for string objects bigger
> > > than half of the address space for 32-bit targets and up:
> > >
> > >   (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
> > >    && TYPE_PRECISION (ptr_type_node) >= 32)
> > >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > >       && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
> > >
> > > and similarly for the rawmemchr implementation:
> > >
> > >   (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
> > >    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
> > >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > >       && reduction_var_overflows_first (reduction_var, load_type))
> > >
> > > >
> > > > > >
> > > > > > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > > > > > +       {
> > > > > > +         const char *msg = G_("assuming signed overflow does not occur "
> > > > > > +                              "when optimizing strlen like loop");
> > > > > > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > > > > > +       }
> > > > > >
> > > > > > no, please don't add any new strict-overflow warnings ;)
> > > > >
> > > > > I just stumbled over code which produces such a warning and thought this
> > > > > is a hard requirement :D The new patch doesn't contain it anymore.
> > > > >
> > > > > >
> > > > > > The generate_*_builtin routines need some factoring - if you code-generate
> > > > > > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > > > > > (not sure why you do that - you should see to fold the call, not necessarily
> > > > > > the rest).  The replacement of reduction_var and the dumping could be shared.
> > > > > > There's also GET_MODE_NAME for the printing.
> > > > >
> > > > > I wasn't really sure which way to go.  Use a gsi, as it is done by
> > > > > existing generate_* functions, or make use of gimple_seq.  Since the
> > > > > latter uses internally also gsi I thought it is better to stick to gsi
> > > > > in the first place.  Now, after changing to gimple_seq I see the beauty
> > > > > of it :)
> > > > >
> > > > > I created two helper functions generate_strlen_builtin_1 and
> > > > > generate_reduction_builtin_1 in order to reduce code duplication.
> > > > >
> > > > > In function generate_strlen_builtin I changed from using
> > > > > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> > > > > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> > > > > sure whether my intuition about the difference between implicit and
> > > > > explicit builtins is correct.  In builtins.def there is a small example
> > > > > given which I would paraphrase as "use builtin_decl_explicit if the
> > > > > semantics of the builtin is defined by the C standard; otherwise use
> > > > > builtin_decl_implicit" but probably my intuition is wrong?
> > > > >
> > > > > Beside that I'm not sure whether I really have to call
> > > > > build_fold_addr_expr which looks superfluous to me since
> > > > > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> > > > >
> > > > > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > > gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > > >
> > > > > However, since it is also used that way in the context of
> > > > > generate_memset_builtin I didn't remove it so far.
> > > > >
> > > > > > I think overall the approach is sound now but the details still need work.
> > > > >
> > > > > Once again thank you very much for your review.  Really appreciated!
> > > >
> > > > The patch lacks a changelog entry / description.  It's nice if patches sent
> > > > out for review are basically the rev as git format-patch produces.
> > > >
> > > > The rawmemchr optab needs documenting in md.texi
> > >
> > > While writing the documentation in md.texi I realised that other
> > > instructions expect an address to be a memory operand which is not the
> > > case for rawmemchr currently. At the moment the address is either an
> > > SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> > > consequence in the backend define_expand rawmemchr<mode> expects a
> > > register operand and not a memory operand. Would it make sense to build
> > > a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> > > MEM_REF is supposed to be the canonical form here.
> >
> > I suppose the expander could use code similar to what
> > expand_builtin_memset_args does,
> > using get_memory_rtx.  I suppose that we're using MEM operands because those
> > can convey things like alias info or alignment info, something which
> > REG operands cannot
> > (easily).  I wouldn't build a MEM_REF and try to expand that.
>
> The new patch contains the following changes:
>
> - In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
>   change linkage of get_memory_rtx to extern.
>
> - In function generate_strlen_builtin_using_rawmemchr I'm not
>   reconstructing the load type anymore from the base pointer but rather
>   pass it as a parameter from function transform_reduction_loop where we
>   also ensured that it is of integral type.  Reconstructing the load
>   type was error prone since e.g. I didn't distinct between
>   pointer_plus_expr or addr_expr.  Thus passing the load type should be
>   more solid.
>
> Regtested on IBM Z and x86.  Ok for mainline?

OK, and sorry for all the repeated delays.

Thanks,
Richard.

> Thanks,
> Stefan
>
> >
> > > >
> > > > +}
> > > > +
> > > > +static bool
> > > > +reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > +{
> > > > +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > >
> > > > this function needs a comment.
> > >
> > > Done.
> > >
> > > >
> > > > +         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> > > > +           {
> > > > +             if (reduction_stmt)
> > > > +               return false;
> > > >
> > > > you leak bbs here and elsewhere where you early exit the function.
> > > > In fact you fail to free it at all.
> > >
> > > Whoopsy. I factored the whole loop out into static function
> > > determine_reduction_stmt in order to deal with all early exits.
> > >
> > > >
> > > > Otherwise the patch looks good - thanks for all the improvements.
> > > >
> > > > What I do wonder is
> > > >
> > > > +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > >
> > > > using builtin_decl_explicit means that in a TU where strlen is neither
> > > > declared nor used we can end up emitting calls to it.  For memcpy/memmove
> > > > that's usually OK since we require those to be present even in a
> > > > freestanding environment.  But I'm not sure about strlen here so I'd
> > > > lean towards using builtin_decl_implicit and checking that for NULL which
> > > > IIRC should prevent emitting strlen when it's not declared and maybe even
> > > > if it's declared but not used.  All other uses that generate STRLEN
> > > > use that at least.
> > >
> > > Thanks for clarification.  I changed it back to builtin_decl_implicit
> > > and check for null pointers.
> >
> > Thanks,
> > Richard.
> >
> > > Thanks,
> > > Stefan
Stefan Schulze Frielinghaus Oct. 11, 2021, 4:02 p.m. UTC | #24
On Fri, Sep 17, 2021 at 10:08:27AM +0200, Richard Biener wrote:
> On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
> >
> > On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> > > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> > > <stefansf@linux.ibm.com> wrote:
> > > >
> > > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > > > [...]
> > > > > > >
> > > > > > > +  /* Handle strlen like loops.  */
> > > > > > > +  if (store_dr == NULL
> > > > > > > +      && integer_zerop (pattern)
> > > > > > > +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > > > +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > > > +      && integer_onep (reduction_iv.step)
> > > > > > > +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > > > > > +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> > > > > > > +    {
> > > > > > >
> > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > > > The iteration
> > > > > > > only stops when you load a NUL and the increments just wrap along (you're
> > > > > > > using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> > > > > >
> > > > > > I think truncation is enough as long as no overflow occurs in strlen or
> > > > > > strlen_using_rawmemchr.
> > > > > >
> > > > > > > For larger than size_type_node (actually larger than ptr_type_node would matter
> > > > > > > I guess), the argument is that since pointer wrapping would be undefined anyway
> > > > > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > > > > >
> > > > > > >       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > > > (ptr_type_node)
> > > > > > >        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > Regarding the implementation which makes use of rawmemchr:
> > > > > >
> > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > > > > the maximal length we can determine of a string where each character has
> > > > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > > > this would mean that a single string consumes more than half of the
> > > > > > virtual addressable memory.  At least for architectures where
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > > > > to neglect the case where computing pointer difference may overflow.
> > > > > > Otherwise we are talking about strings with lenghts of multiple
> > > > > > pebibytes.  For other architectures we might have to be more precise
> > > > > > and make sure that reduction variable overflows first and that this is
> > > > > > undefined.
> > > > > >
> > > > > > Thus a conservative condition would be (I assumed that the size of any
> > > > > > integral type is a power of two which I'm not sure if this really holds;
> > > > > > IIRC the C standard requires only that the alignment is a power of two
> > > > > > but not necessarily the size so I might need to change this):
> > > > > >
> > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> > > > > >    or in other words return true if reduction variable overflows first
> > > > > >    and false otherwise.  */
> > > > > >
> > > > > > static bool
> > > > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > > {
> > > > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> > > > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> > > > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> > > > > > }
> > > > > >
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > >     && reduction_var_overflows_first (reduction_var, load_type)
> > > > > >
> > > > > > Regarding the implementation which makes use of strlen:
> > > > > >
> > > > > > I'm not sure what it means if strlen is called for a string with a
> > > > > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > > > architectures, a conservative condition would be:
> > > > > >
> > > > > > TYPE_PRECISION (size_type_node) == 64
> > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > >     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> > > > > >
> > > > > > I still included the overflow undefined check for reduction variable in
> > > > > > order to rule out situations where the reduction variable is unsigned
> > > > > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > > > > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > > > > > architectures.  Anyhow, while writing this down it becomes clear that
> > > > > > this deserves a comment which I will add once it becomes clear which way
> > > > > > to go.
> > > > >
> > > > > I think all the arguments about objects bigger than half of the address-space
> > > > > also are valid for 32bit targets and thus 32bit size_type_node (or
> > > > > 32bit pointer size).
> > > > > I'm not actually sure what's the canonical type to check against, whether
> > > > > it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> > > > > middle-end "offset" type used for all address computations).  For weird reasons
> > > > > I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> > > > > pointers but 16bit 'sizetype').
> > > >
> > > > Ok, for the strlen implementation I changed from size_type_node to
> > > > sizetype and assume that no overflow occurs for string objects bigger
> > > > than half of the address space for 32-bit targets and up:
> > > >
> > > >   (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
> > > >    && TYPE_PRECISION (ptr_type_node) >= 32)
> > > >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >       && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
> > > >
> > > > and similarly for the rawmemchr implementation:
> > > >
> > > >   (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
> > > >    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
> > > >   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > >       && reduction_var_overflows_first (reduction_var, load_type))
> > > >
> > > > >
> > > > > > >
> > > > > > > +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> > > > > > > +       {
> > > > > > > +         const char *msg = G_("assuming signed overflow does not occur "
> > > > > > > +                              "when optimizing strlen like loop");
> > > > > > > +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> > > > > > > +       }
> > > > > > >
> > > > > > > no, please don't add any new strict-overflow warnings ;)
> > > > > >
> > > > > > I just stumbled over code which produces such a warning and thought this
> > > > > > is a hard requirement :D The new patch doesn't contain it anymore.
> > > > > >
> > > > > > >
> > > > > > > The generate_*_builtin routines need some factoring - if you code-generate
> > > > > > > into a gimple_seq you could use gimple_build () which would do the fold_stmt
> > > > > > > (not sure why you do that - you should see to fold the call, not necessarily
> > > > > > > the rest).  The replacement of reduction_var and the dumping could be shared.
> > > > > > > There's also GET_MODE_NAME for the printing.
> > > > > >
> > > > > > I wasn't really sure which way to go.  Use a gsi, as it is done by
> > > > > > existing generate_* functions, or make use of gimple_seq.  Since the
> > > > > > latter uses internally also gsi I thought it is better to stick to gsi
> > > > > > in the first place.  Now, after changing to gimple_seq I see the beauty
> > > > > > of it :)
> > > > > >
> > > > > > I created two helper functions generate_strlen_builtin_1 and
> > > > > > generate_reduction_builtin_1 in order to reduce code duplication.
> > > > > >
> > > > > > In function generate_strlen_builtin I changed from using
> > > > > > builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> > > > > > (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> > > > > > sure whether my intuition about the difference between implicit and
> > > > > > explicit builtins is correct.  In builtins.def there is a small example
> > > > > > given which I would paraphrase as "use builtin_decl_explicit if the
> > > > > > semantics of the builtin is defined by the C standard; otherwise use
> > > > > > builtin_decl_implicit" but probably my intuition is wrong?
> > > > > >
> > > > > > Beside that I'm not sure whether I really have to call
> > > > > > build_fold_addr_expr which looks superfluous to me since
> > > > > > gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> > > > > >
> > > > > > tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > > > gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > > > >
> > > > > > However, since it is also used that way in the context of
> > > > > > generate_memset_builtin I didn't remove it so far.
> > > > > >
> > > > > > > I think overall the approach is sound now but the details still need work.
> > > > > >
> > > > > > Once again thank you very much for your review.  Really appreciated!
> > > > >
> > > > > The patch lacks a changelog entry / description.  It's nice if patches sent
> > > > > out for review are basically the rev as git format-patch produces.
> > > > >
> > > > > The rawmemchr optab needs documenting in md.texi
> > > >
> > > > While writing the documentation in md.texi I realised that other
> > > > instructions expect an address to be a memory operand which is not the
> > > > case for rawmemchr currently. At the moment the address is either an
> > > > SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> > > > consequence in the backend define_expand rawmemchr<mode> expects a
> > > > register operand and not a memory operand. Would it make sense to build
> > > > a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> > > > MEM_REF is supposed to be the canonical form here.
> > >
> > > I suppose the expander could use code similar to what
> > > expand_builtin_memset_args does,
> > > using get_memory_rtx.  I suppose that we're using MEM operands because those
> > > can convey things like alias info or alignment info, something which
> > > REG operands cannot
> > > (easily).  I wouldn't build a MEM_REF and try to expand that.
> >
> > The new patch contains the following changes:
> >
> > - In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
> >   change linkage of get_memory_rtx to extern.
> >
> > - In function generate_strlen_builtin_using_rawmemchr I'm not
> >   reconstructing the load type anymore from the base pointer but rather
> >   pass it as a parameter from function transform_reduction_loop where we
> >   also ensured that it is of integral type.  Reconstructing the load
> >   type was error prone since e.g. I didn't distinct between
> >   pointer_plus_expr or addr_expr.  Thus passing the load type should be
> >   more solid.
> >
> > Regtested on IBM Z and x86.  Ok for mainline?
> 
> OK, and sorry for all the repeated delays.

No problem at all.  I'm glad to see how the patch evolved over each
iteration.  That being said:  Thanks for all your reviews and hints!

The patch implementing the rawmemchr expander for IBM Z was also ack'd and
I pushed both commits today.

For the xalancbmk benchmark we now recognize 1081 rawmemchr-like loops
where at least one is in the hot path.  Utilising a specialised rawmemchr
implementation for 16-bit characters gives good results on IBM Z ...
just saying maybe other archs are interested, too ;-)

Thanks,
Stefan


> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Stefan
> >
> > >
> > > > >
> > > > > +}
> > > > > +
> > > > > +static bool
> > > > > +reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > +{
> > > > > +  unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > >
> > > > > this function needs a comment.
> > > >
> > > > Done.
> > > >
> > > > >
> > > > > +         if (stmt_has_scalar_dependences_outside_loop (loop, phi))
> > > > > +           {
> > > > > +             if (reduction_stmt)
> > > > > +               return false;
> > > > >
> > > > > you leak bbs here and elsewhere where you early exit the function.
> > > > > In fact you fail to free it at all.
> > > >
> > > > Whoopsy. I factored the whole loop out into static function
> > > > determine_reduction_stmt in order to deal with all early exits.
> > > >
> > > > >
> > > > > Otherwise the patch looks good - thanks for all the improvements.
> > > > >
> > > > > What I do wonder is
> > > > >
> > > > > +  tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> > > > > +  gimple *fn_call = gimple_build_call (fn, 1, mem);
> > > > >
> > > > > using builtin_decl_explicit means that in a TU where strlen is neither
> > > > > declared nor used we can end up emitting calls to it.  For memcpy/memmove
> > > > > that's usually OK since we require those to be present even in a
> > > > > freestanding environment.  But I'm not sure about strlen here so I'd
> > > > > lean towards using builtin_decl_implicit and checking that for NULL which
> > > > > IIRC should prevent emitting strlen when it's not declared and maybe even
> > > > > if it's declared but not used.  All other uses that generate STRLEN
> > > > > use that at least.
> > > >
> > > > Thanks for clarification.  I changed it back to builtin_decl_implicit
> > > > and check for null pointers.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > Stefan
Tom de Vries Jan. 31, 2022, 1:16 p.m. UTC | #25
On 9/17/21 10:08, Richard Biener via Gcc-patches wrote:
> On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
> <stefansf@linux.ibm.com> wrote:
>>
>> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
>>> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
>>> <stefansf@linux.ibm.com> wrote:
>>>>
>>>> On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
>>>> [...]
>>>>>>>
>>>>>>> +  /* Handle strlen like loops.  */
>>>>>>> +  if (store_dr == NULL
>>>>>>> +      && integer_zerop (pattern)
>>>>>>> +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
>>>>>>> +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
>>>>>>> +      && integer_onep (reduction_iv.step)
>>>>>>> +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
>>>>>>> +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
>>>>>>> +    {
>>>>>>>
>>>>>>> I wonder what goes wrong with a larger or smaller wrapping IV type?
>>>>>>> The iteration
>>>>>>> only stops when you load a NUL and the increments just wrap along (you're
>>>>>>> using the pointer IVs to compute the strlen result).  Can't you simply truncate?
>>>>>>
>>>>>> I think truncation is enough as long as no overflow occurs in strlen or
>>>>>> strlen_using_rawmemchr.
>>>>>>
>>>>>>> For larger than size_type_node (actually larger than ptr_type_node would matter
>>>>>>> I guess), the argument is that since pointer wrapping would be undefined anyway
>>>>>>> the IV cannot wrap either.  Now, the correct check here would IMHO be
>>>>>>>
>>>>>>>        TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
>>>>>>> (ptr_type_node)
>>>>>>>         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
>>>>>>>
>>>>>>> ?
>>>>>>
>>>>>> Regarding the implementation which makes use of rawmemchr:
>>>>>>
>>>>>> We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
>>>>>> the maximal length we can determine of a string where each character has
>>>>>> size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
>>>>>> ptrdiff type is undefined we have to make sure that if an overflow
>>>>>> occurs, then an overflow occurs for reduction variable, too, and that
>>>>>> this is undefined, too.  However, I'm not sure anymore whether we want
>>>>>> to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
>>>>>> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
>>>>>> this would mean that a single string consumes more than half of the
>>>>>> virtual addressable memory.  At least for architectures where
>>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
>>>>>> to neglect the case where computing pointer difference may overflow.
>>>>>> Otherwise we are talking about strings with lenghts of multiple
>>>>>> pebibytes.  For other architectures we might have to be more precise
>>>>>> and make sure that reduction variable overflows first and that this is
>>>>>> undefined.
>>>>>>
>>>>>> Thus a conservative condition would be (I assumed that the size of any
>>>>>> integral type is a power of two which I'm not sure if this really holds;
>>>>>> IIRC the C standard requires only that the alignment is a power of two
>>>>>> but not necessarily the size so I might need to change this):
>>>>>>
>>>>>> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
>>>>>>     or in other words return true if reduction variable overflows first
>>>>>>     and false otherwise.  */
>>>>>>
>>>>>> static bool
>>>>>> reduction_var_overflows_first (tree reduction_var, tree load_type)
>>>>>> {
>>>>>>    unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
>>>>>>    unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
>>>>>>    unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
>>>>>>    return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
>>>>>> }
>>>>>>
>>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64
>>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>>>>>>      && reduction_var_overflows_first (reduction_var, load_type)
>>>>>>
>>>>>> Regarding the implementation which makes use of strlen:
>>>>>>
>>>>>> I'm not sure what it means if strlen is called for a string with a
>>>>>> length greater than SIZE_MAX.  Therefore, similar to the implementation
>>>>>> using rawmemchr where we neglect the case of an overflow for 64bit
>>>>>> architectures, a conservative condition would be:
>>>>>>
>>>>>> TYPE_PRECISION (size_type_node) == 64
>>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>>>>>>      && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
>>>>>>
>>>>>> I still included the overflow undefined check for reduction variable in
>>>>>> order to rule out situations where the reduction variable is unsigned
>>>>>> and overflows as many times until strlen(,_using_rawmemchr) overflows,
>>>>>> too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
>>>>>> architectures.  Anyhow, while writing this down it becomes clear that
>>>>>> this deserves a comment which I will add once it becomes clear which way
>>>>>> to go.
>>>>>
>>>>> I think all the arguments about objects bigger than half of the address-space
>>>>> also are valid for 32bit targets and thus 32bit size_type_node (or
>>>>> 32bit pointer size).
>>>>> I'm not actually sure what's the canonical type to check against, whether
>>>>> it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
>>>>> middle-end "offset" type used for all address computations).  For weird reasons
>>>>> I'd lean towards 'sizetype' (for example some embedded targets have 24bit
>>>>> pointers but 16bit 'sizetype').
>>>>
>>>> Ok, for the strlen implementation I changed from size_type_node to
>>>> sizetype and assume that no overflow occurs for string objects bigger
>>>> than half of the address space for 32-bit targets and up:
>>>>
>>>>    (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
>>>>     && TYPE_PRECISION (ptr_type_node) >= 32)
>>>>    || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>>>>        && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
>>>>
>>>> and similarly for the rawmemchr implementation:
>>>>
>>>>    (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
>>>>     && TYPE_PRECISION (ptrdiff_type_node) >= 32)
>>>>    || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
>>>>        && reduction_var_overflows_first (reduction_var, load_type))
>>>>
>>>>>
>>>>>>>
>>>>>>> +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
>>>>>>> +       {
>>>>>>> +         const char *msg = G_("assuming signed overflow does not occur "
>>>>>>> +                              "when optimizing strlen like loop");
>>>>>>> +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
>>>>>>> +       }
>>>>>>>
>>>>>>> no, please don't add any new strict-overflow warnings ;)
>>>>>>
>>>>>> I just stumbled over code which produces such a warning and thought this
>>>>>> is a hard requirement :D The new patch doesn't contain it anymore.
>>>>>>
>>>>>>>
>>>>>>> The generate_*_builtin routines need some factoring - if you code-generate
>>>>>>> into a gimple_seq you could use gimple_build () which would do the fold_stmt
>>>>>>> (not sure why you do that - you should see to fold the call, not necessarily
>>>>>>> the rest).  The replacement of reduction_var and the dumping could be shared.
>>>>>>> There's also GET_MODE_NAME for the printing.
>>>>>>
>>>>>> I wasn't really sure which way to go.  Use a gsi, as it is done by
>>>>>> existing generate_* functions, or make use of gimple_seq.  Since the
>>>>>> latter uses internally also gsi I thought it is better to stick to gsi
>>>>>> in the first place.  Now, after changing to gimple_seq I see the beauty
>>>>>> of it :)
>>>>>>
>>>>>> I created two helper functions generate_strlen_builtin_1 and
>>>>>> generate_reduction_builtin_1 in order to reduce code duplication.
>>>>>>
>>>>>> In function generate_strlen_builtin I changed from using
>>>>>> builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
>>>>>> (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
>>>>>> sure whether my intuition about the difference between implicit and
>>>>>> explicit builtins is correct.  In builtins.def there is a small example
>>>>>> given which I would paraphrase as "use builtin_decl_explicit if the
>>>>>> semantics of the builtin is defined by the C standard; otherwise use
>>>>>> builtin_decl_implicit" but probably my intuition is wrong?
>>>>>>
>>>>>> Beside that I'm not sure whether I really have to call
>>>>>> build_fold_addr_expr which looks superfluous to me since
>>>>>> gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
>>>>>>
>>>>>> tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
>>>>>> gimple *fn_call = gimple_build_call (fn, 1, mem);
>>>>>>
>>>>>> However, since it is also used that way in the context of
>>>>>> generate_memset_builtin I didn't remove it so far.
>>>>>>
>>>>>>> I think overall the approach is sound now but the details still need work.
>>>>>>
>>>>>> Once again thank you very much for your review.  Really appreciated!
>>>>>
>>>>> The patch lacks a changelog entry / description.  It's nice if patches sent
>>>>> out for review are basically the rev as git format-patch produces.
>>>>>
>>>>> The rawmemchr optab needs documenting in md.texi
>>>>
>>>> While writing the documentation in md.texi I realised that other
>>>> instructions expect an address to be a memory operand which is not the
>>>> case for rawmemchr currently. At the moment the address is either an
>>>> SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
>>>> consequence in the backend define_expand rawmemchr<mode> expects a
>>>> register operand and not a memory operand. Would it make sense to build
>>>> a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
>>>> MEM_REF is supposed to be the canonical form here.
>>>
>>> I suppose the expander could use code similar to what
>>> expand_builtin_memset_args does,
>>> using get_memory_rtx.  I suppose that we're using MEM operands because those
>>> can convey things like alias info or alignment info, something which
>>> REG operands cannot
>>> (easily).  I wouldn't build a MEM_REF and try to expand that.
>>
>> The new patch contains the following changes:
>>
>> - In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
>>    change linkage of get_memory_rtx to extern.
>>
>> - In function generate_strlen_builtin_using_rawmemchr I'm not
>>    reconstructing the load type anymore from the base pointer but rather
>>    pass it as a parameter from function transform_reduction_loop where we
>>    also ensured that it is of integral type.  Reconstructing the load
>>    type was error prone since e.g. I didn't distinct between
>>    pointer_plus_expr or addr_expr.  Thus passing the load type should be
>>    more solid.
>>
>> Regtested on IBM Z and x86.  Ok for mainline?
> 
> OK, and sorry for all the repeated delays.
> 

I'm running into PR56888 ( 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888 ) on nvptx due to 
this, f.i. in gcc/testsuite/gcc.c-torture/execute/builtins/strlen.c, 
where gcc/testsuite/gcc.c-torture/execute/builtins/lib/strlen.c contains 
a strlen function, with a strlen loop, which is transformed by 
pass_loop_distribution into a __builtin_strlen, which is then expanded 
into a strlen call, creating a self-recursive function. [ And on nvptx, 
that happens to result in a compilation failure, which is how I found 
this. ]

According to this ( 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888#c21 ) comment:
...
-fno-tree-loop-distribute-patterns is the reliable way to not
transform loops into library calls.
...

Then should we have something along the lines of:
...
$ git diff
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 6fe59cd56855..9a211d30cd7e 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -3683,7 +3683,11 @@ loop_distribution::transform_reduction_loop
                && TYPE_PRECISION (ptr_type_node) >= 32)
               || (TYPE_OVERFLOW_UNDEFINED (reduction_var_type)
                   && TYPE_PRECISION (reduction_var_type) <= 
TYPE_PRECISION (sizetype)))
-         && builtin_decl_implicit (BUILT_IN_STRLEN))
+         && builtin_decl_implicit (BUILT_IN_STRLEN)
+         && flag_tree_loop_distribute_patterns)
         generate_strlen_builtin (loop, reduction_var, load_iv.base,
                                  reduction_iv.base, loc);
        else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE 
(load_type))
...
?

Or is the comment no longer valid?

Thanks,
- Tom
Richard Biener Jan. 31, 2022, 3 p.m. UTC | #26
On Mon, Jan 31, 2022 at 2:16 PM Tom de Vries <tdevries@suse.de> wrote:
>
> On 9/17/21 10:08, Richard Biener via Gcc-patches wrote:
> > On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
> > <stefansf@linux.ibm.com> wrote:
> >>
> >> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> >>> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> >>> <stefansf@linux.ibm.com> wrote:
> >>>>
> >>>> On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> >>>> [...]
> >>>>>>>
> >>>>>>> +  /* Handle strlen like loops.  */
> >>>>>>> +  if (store_dr == NULL
> >>>>>>> +      && integer_zerop (pattern)
> >>>>>>> +      && TREE_CODE (reduction_iv.base) == INTEGER_CST
> >>>>>>> +      && TREE_CODE (reduction_iv.step) == INTEGER_CST
> >>>>>>> +      && integer_onep (reduction_iv.step)
> >>>>>>> +      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> >>>>>>> +         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
> >>>>>>> +    {
> >>>>>>>
> >>>>>>> I wonder what goes wrong with a larger or smaller wrapping IV type?
> >>>>>>> The iteration
> >>>>>>> only stops when you load a NUL and the increments just wrap along (you're
> >>>>>>> using the pointer IVs to compute the strlen result).  Can't you simply truncate?
> >>>>>>
> >>>>>> I think truncation is enough as long as no overflow occurs in strlen or
> >>>>>> strlen_using_rawmemchr.
> >>>>>>
> >>>>>>> For larger than size_type_node (actually larger than ptr_type_node would matter
> >>>>>>> I guess), the argument is that since pointer wrapping would be undefined anyway
> >>>>>>> the IV cannot wrap either.  Now, the correct check here would IMHO be
> >>>>>>>
> >>>>>>>        TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> >>>>>>> (ptr_type_node)
> >>>>>>>         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> >>>>>>>
> >>>>>>> ?
> >>>>>>
> >>>>>> Regarding the implementation which makes use of rawmemchr:
> >>>>>>
> >>>>>> We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> >>>>>> the maximal length we can determine of a string where each character has
> >>>>>> size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> >>>>>> ptrdiff type is undefined we have to make sure that if an overflow
> >>>>>> occurs, then an overflow occurs for reduction variable, too, and that
> >>>>>> this is undefined, too.  However, I'm not sure anymore whether we want
> >>>>>> to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> >>>>>> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> >>>>>> this would mean that a single string consumes more than half of the
> >>>>>> virtual addressable memory.  At least for architectures where
> >>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> >>>>>> to neglect the case where computing pointer difference may overflow.
> >>>>>> Otherwise we are talking about strings with lenghts of multiple
> >>>>>> pebibytes.  For other architectures we might have to be more precise
> >>>>>> and make sure that reduction variable overflows first and that this is
> >>>>>> undefined.
> >>>>>>
> >>>>>> Thus a conservative condition would be (I assumed that the size of any
> >>>>>> integral type is a power of two which I'm not sure if this really holds;
> >>>>>> IIRC the C standard requires only that the alignment is a power of two
> >>>>>> but not necessarily the size so I might need to change this):
> >>>>>>
> >>>>>> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 (sizeof (load_type))
> >>>>>>     or in other words return true if reduction variable overflows first
> >>>>>>     and false otherwise.  */
> >>>>>>
> >>>>>> static bool
> >>>>>> reduction_var_overflows_first (tree reduction_var, tree load_type)
> >>>>>> {
> >>>>>>    unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> >>>>>>    unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE (reduction_var));
> >>>>>>    unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT (load_type)));
> >>>>>>    return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - size_exponent);
> >>>>>> }
> >>>>>>
> >>>>>> TYPE_PRECISION (ptrdiff_type_node) == 64
> >>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>>>      && reduction_var_overflows_first (reduction_var, load_type)
> >>>>>>
> >>>>>> Regarding the implementation which makes use of strlen:
> >>>>>>
> >>>>>> I'm not sure what it means if strlen is called for a string with a
> >>>>>> length greater than SIZE_MAX.  Therefore, similar to the implementation
> >>>>>> using rawmemchr where we neglect the case of an overflow for 64bit
> >>>>>> architectures, a conservative condition would be:
> >>>>>>
> >>>>>> TYPE_PRECISION (size_type_node) == 64
> >>>>>> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>>>      && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> >>>>>>
> >>>>>> I still included the overflow undefined check for reduction variable in
> >>>>>> order to rule out situations where the reduction variable is unsigned
> >>>>>> and overflows as many times until strlen(,_using_rawmemchr) overflows,
> >>>>>> too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> >>>>>> architectures.  Anyhow, while writing this down it becomes clear that
> >>>>>> this deserves a comment which I will add once it becomes clear which way
> >>>>>> to go.
> >>>>>
> >>>>> I think all the arguments about objects bigger than half of the address-space
> >>>>> also are valid for 32bit targets and thus 32bit size_type_node (or
> >>>>> 32bit pointer size).
> >>>>> I'm not actually sure what's the canonical type to check against, whether
> >>>>> it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
> >>>>> middle-end "offset" type used for all address computations).  For weird reasons
> >>>>> I'd lean towards 'sizetype' (for example some embedded targets have 24bit
> >>>>> pointers but 16bit 'sizetype').
> >>>>
> >>>> Ok, for the strlen implementation I changed from size_type_node to
> >>>> sizetype and assume that no overflow occurs for string objects bigger
> >>>> than half of the address space for 32-bit targets and up:
> >>>>
> >>>>    (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
> >>>>     && TYPE_PRECISION (ptr_type_node) >= 32)
> >>>>    || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>        && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))
> >>>>
> >>>> and similarly for the rawmemchr implementation:
> >>>>
> >>>>    (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
> >>>>     && TYPE_PRECISION (ptrdiff_type_node) >= 32)
> >>>>    || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>>>        && reduction_var_overflows_first (reduction_var, load_type))
> >>>>
> >>>>>
> >>>>>>>
> >>>>>>> +      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
> >>>>>>> +       {
> >>>>>>> +         const char *msg = G_("assuming signed overflow does not occur "
> >>>>>>> +                              "when optimizing strlen like loop");
> >>>>>>> +         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
> >>>>>>> +       }
> >>>>>>>
> >>>>>>> no, please don't add any new strict-overflow warnings ;)
> >>>>>>
> >>>>>> I just stumbled over code which produces such a warning and thought this
> >>>>>> is a hard requirement :D The new patch doesn't contain it anymore.
> >>>>>>
> >>>>>>>
> >>>>>>> The generate_*_builtin routines need some factoring - if you code-generate
> >>>>>>> into a gimple_seq you could use gimple_build () which would do the fold_stmt
> >>>>>>> (not sure why you do that - you should see to fold the call, not necessarily
> >>>>>>> the rest).  The replacement of reduction_var and the dumping could be shared.
> >>>>>>> There's also GET_MODE_NAME for the printing.
> >>>>>>
> >>>>>> I wasn't really sure which way to go.  Use a gsi, as it is done by
> >>>>>> existing generate_* functions, or make use of gimple_seq.  Since the
> >>>>>> latter uses internally also gsi I thought it is better to stick to gsi
> >>>>>> in the first place.  Now, after changing to gimple_seq I see the beauty
> >>>>>> of it :)
> >>>>>>
> >>>>>> I created two helper functions generate_strlen_builtin_1 and
> >>>>>> generate_reduction_builtin_1 in order to reduce code duplication.
> >>>>>>
> >>>>>> In function generate_strlen_builtin I changed from using
> >>>>>> builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
> >>>>>> (BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
> >>>>>> sure whether my intuition about the difference between implicit and
> >>>>>> explicit builtins is correct.  In builtins.def there is a small example
> >>>>>> given which I would paraphrase as "use builtin_decl_explicit if the
> >>>>>> semantics of the builtin is defined by the C standard; otherwise use
> >>>>>> builtin_decl_implicit" but probably my intuition is wrong?
> >>>>>>
> >>>>>> Beside that I'm not sure whether I really have to call
> >>>>>> build_fold_addr_expr which looks superfluous to me since
> >>>>>> gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:
> >>>>>>
> >>>>>> tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
> >>>>>> gimple *fn_call = gimple_build_call (fn, 1, mem);
> >>>>>>
> >>>>>> However, since it is also used that way in the context of
> >>>>>> generate_memset_builtin I didn't remove it so far.
> >>>>>>
> >>>>>>> I think overall the approach is sound now but the details still need work.
> >>>>>>
> >>>>>> Once again thank you very much for your review.  Really appreciated!
> >>>>>
> >>>>> The patch lacks a changelog entry / description.  It's nice if patches sent
> >>>>> out for review are basically the rev as git format-patch produces.
> >>>>>
> >>>>> The rawmemchr optab needs documenting in md.texi
> >>>>
> >>>> While writing the documentation in md.texi I realised that other
> >>>> instructions expect an address to be a memory operand which is not the
> >>>> case for rawmemchr currently. At the moment the address is either an
> >>>> SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
> >>>> consequence in the backend define_expand rawmemchr<mode> expects a
> >>>> register operand and not a memory operand. Would it make sense to build
> >>>> a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
> >>>> MEM_REF is supposed to be the canonical form here.
> >>>
> >>> I suppose the expander could use code similar to what
> >>> expand_builtin_memset_args does,
> >>> using get_memory_rtx.  I suppose that we're using MEM operands because those
> >>> can convey things like alias info or alignment info, something which
> >>> REG operands cannot
> >>> (easily).  I wouldn't build a MEM_REF and try to expand that.
> >>
> >> The new patch contains the following changes:
> >>
> >> - In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
> >>    change linkage of get_memory_rtx to extern.
> >>
> >> - In function generate_strlen_builtin_using_rawmemchr I'm not
> >>    reconstructing the load type anymore from the base pointer but rather
> >>    pass it as a parameter from function transform_reduction_loop where we
> >>    also ensured that it is of integral type.  Reconstructing the load
> >>    type was error prone since e.g. I didn't distinct between
> >>    pointer_plus_expr or addr_expr.  Thus passing the load type should be
> >>    more solid.
> >>
> >> Regtested on IBM Z and x86.  Ok for mainline?
> >
> > OK, and sorry for all the repeated delays.
> >
>
> I'm running into PR56888 (
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888 ) on nvptx due to
> this, f.i. in gcc/testsuite/gcc.c-torture/execute/builtins/strlen.c,
> where gcc/testsuite/gcc.c-torture/execute/builtins/lib/strlen.c contains
> a strlen function, with a strlen loop, which is transformed by
> pass_loop_distribution into a __builtin_strlen, which is then expanded
> into a strlen call, creating a self-recursive function. [ And on nvptx,
> that happens to result in a compilation failure, which is how I found
> this. ]
>
> According to this (
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888#c21 ) comment:
> ...
> -fno-tree-loop-distribute-patterns is the reliable way to not
> transform loops into library calls.
> ...
>
> Then should we have something along the lines of:
> ...
> $ git diff
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index 6fe59cd56855..9a211d30cd7e 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -3683,7 +3683,11 @@ loop_distribution::transform_reduction_loop
>                 && TYPE_PRECISION (ptr_type_node) >= 32)
>                || (TYPE_OVERFLOW_UNDEFINED (reduction_var_type)
>                    && TYPE_PRECISION (reduction_var_type) <=
> TYPE_PRECISION (sizetype)))
> -         && builtin_decl_implicit (BUILT_IN_STRLEN))
> +         && builtin_decl_implicit (BUILT_IN_STRLEN)
> +         && flag_tree_loop_distribute_patterns)
>          generate_strlen_builtin (loop, reduction_var, load_iv.base,
>                                   reduction_iv.base, loc);
>         else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE
> (load_type))
> ...
> ?
>
> Or is the comment no longer valid?

It is still valid - and yes, I think we need to guard it with this flag
but please do it in the caller to transform_reduction_loop.

>
> Thanks,
> - Tom
diff mbox series

Patch

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index dd7173126fb..9cd62544a1a 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2917,6 +2917,48 @@  expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+static void
+expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr8 ())
+    {
+      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
+      rtx start = expand_normal (gimple_call_arg (stmt, 0));
+      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+      emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
+    }
+  else
+    gcc_unreachable();
+}
+
+static void
+expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr16 ())
+    {
+      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
+      rtx start = expand_normal (gimple_call_arg (stmt, 0));
+      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+      emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
+    }
+  else
+    gcc_unreachable();
+}
+
+static void
+expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr32 ())
+    {
+      rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, EXPAND_WRITE);
+      rtx start = expand_normal (gimple_call_arg (stmt, 0));
+      rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+      emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
+    }
+  else
+    gcc_unreachable();
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index daeace7a34e..34247859704 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -348,6 +348,9 @@  DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index 672c35698d7..9248554cbf3 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -106,3 +106,6 @@  DEF_TARGET_INSN (trap, (void))
 DEF_TARGET_INSN (unique, (void))
 DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
+DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1, rtx x2))
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 7ee19fc8677..f5b24bf53bc 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -218,7 +218,7 @@  enum partition_kind {
        be unnecessary and removed once distributed memset can be understood
        and analyzed in data reference analysis.  See PR82604 for more.  */
     PKIND_PARTIAL_MEMSET,
-    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE
+    PKIND_MEMSET, PKIND_MEMCPY, PKIND_MEMMOVE, PKIND_RAWMEMCHR
 };
 
 /* Type of distributed loop.  */
@@ -244,6 +244,8 @@  struct builtin_info
      is only used in memset builtin distribution for now.  */
   tree dst_base_base;
   unsigned HOST_WIDE_INT dst_base_offset;
+  tree pattern;
+  internal_fn fn;
 };
 
 /* Partition for loop distribution.  */
@@ -588,7 +590,8 @@  class loop_distribution
   bool
   classify_partition (loop_p loop,
 		      struct graph *rdg, partition *partition,
-		      bitmap stmt_in_all_partitions);
+		      bitmap stmt_in_all_partitions,
+		      vec<struct partition *> *partitions);
 
 
   /* Returns true when PARTITION1 and PARTITION2 access the same memory
@@ -1232,6 +1235,67 @@  generate_memcpy_builtin (class loop *loop, partition *partition)
     }
 }
 
+/* Generate a call to rawmemchr{8,16,32} for PARTITION in LOOP.  */
+
+static void
+generate_rawmemchr_builtin (class loop *loop, partition *partition)
+{
+  gimple_stmt_iterator gsi;
+  tree mem, pattern;
+  struct builtin_info *builtin = partition->builtin;
+  gimple *fn_call;
+
+  data_reference_p dr = builtin->src_dr;
+  tree base = builtin->src_base;
+
+  tree result_old = TREE_OPERAND (DR_REF (dr), 0);
+  tree result_new = copy_ssa_name (result_old);
+
+  /* The new statements will be placed before LOOP.  */
+  gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
+
+  mem = force_gimple_operand_gsi (&gsi, base, true, NULL_TREE, false, GSI_CONTINUE_LINKING);
+  pattern = builtin->pattern;
+  if (TREE_CODE (pattern) == INTEGER_CST)
+    pattern = fold_convert (integer_type_node, pattern);
+  fn_call = gimple_build_call_internal (builtin->fn, 2, mem, pattern);
+  gimple_call_set_lhs (fn_call, result_new);
+  gimple_set_location (fn_call, partition->loc);
+  gsi_insert_after (&gsi, fn_call, GSI_CONTINUE_LINKING);
+
+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
+    {
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	SET_USE (use_p, result_new);
+
+      update_stmt (stmt);
+    }
+
+  fold_stmt (&gsi);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    switch (builtin->fn)
+      {
+      case IFN_RAWMEMCHR8:
+	fprintf (dump_file, "generated rawmemchr8\n");
+	break;
+
+      case IFN_RAWMEMCHR16:
+	fprintf (dump_file, "generated rawmemchr16\n");
+	break;
+
+      case IFN_RAWMEMCHR32:
+	fprintf (dump_file, "generated rawmemchr32\n");
+	break;
+
+      default:
+	gcc_unreachable ();
+      }
+}
+
 /* Remove and destroy the loop LOOP.  */
 
 static void
@@ -1334,6 +1398,10 @@  generate_code_for_partition (class loop *loop,
       generate_memcpy_builtin (loop, partition);
       break;
 
+    case PKIND_RAWMEMCHR:
+      generate_rawmemchr_builtin (loop, partition);
+      break;
+
     default:
       gcc_unreachable ();
     }
@@ -1525,44 +1593,53 @@  find_single_drs (class loop *loop, struct graph *rdg, partition *partition,
 	}
     }
 
-  if (!single_st)
+  if (!single_ld && !single_st)
     return false;
 
-  /* Bail out if this is a bitfield memory reference.  */
-  if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
-      && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
-    return false;
-
-  /* Data reference must be executed exactly once per iteration of each
-     loop in the loop nest.  We only need to check dominance information
-     against the outermost one in a perfect loop nest because a bb can't
-     dominate outermost loop's latch without dominating inner loop's.  */
-  basic_block bb_st = gimple_bb (DR_STMT (single_st));
-  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
-    return false;
+  basic_block bb_ld = NULL;
+  basic_block bb_st = NULL;
 
   if (single_ld)
     {
-      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
-      /* Direct aggregate copy or via an SSA name temporary.  */
-      if (load != store
-	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
-	return false;
-
       /* Bail out if this is a bitfield memory reference.  */
       if (TREE_CODE (DR_REF (single_ld)) == COMPONENT_REF
 	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_ld), 1)))
 	return false;
 
-      /* Load and store must be in the same loop nest.  */
-      basic_block bb_ld = gimple_bb (DR_STMT (single_ld));
-      if (bb_st->loop_father != bb_ld->loop_father)
+      /* Data reference must be executed exactly once per iteration of each
+	 loop in the loop nest.  We only need to check dominance information
+	 against the outermost one in a perfect loop nest because a bb can't
+	 dominate outermost loop's latch without dominating inner loop's.  */
+      bb_ld = gimple_bb (DR_STMT (single_ld));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+	return false;
+    }
+
+  if (single_st)
+    {
+      /* Bail out if this is a bitfield memory reference.  */
+      if (TREE_CODE (DR_REF (single_st)) == COMPONENT_REF
+	  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (single_st), 1)))
 	return false;
 
       /* Data reference must be executed exactly once per iteration.
-	 Same as single_st, we only need to check against the outermost
+	 Same as single_ld, we only need to check against the outermost
 	 loop.  */
-      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_ld))
+      bb_st = gimple_bb (DR_STMT (single_st));
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb_st))
+	return false;
+    }
+
+  if (single_ld && single_st)
+    {
+      gimple *store = DR_STMT (single_st), *load = DR_STMT (single_ld);
+      /* Direct aggregate copy or via an SSA name temporary.  */
+      if (load != store
+	  && gimple_assign_lhs (load) != gimple_assign_rhs1 (store))
+	return false;
+
+      /* Load and store must be in the same loop nest.  */
+      if (bb_st->loop_father != bb_ld->loop_father)
 	return false;
 
       edge e = single_exit (bb_st->loop_father);
@@ -1681,6 +1758,84 @@  alloc_builtin (data_reference_p dst_dr, data_reference_p src_dr,
   return builtin;
 }
 
+/* Given data reference DR in loop nest LOOP, classify if it forms builtin
+   rawmemchr{8,16,32} call.  */
+
+static bool
+classify_builtin_rawmemchr (loop_p loop, partition *partition, data_reference_p dr, tree loop_result)
+{
+  tree dr_ref = DR_REF (dr);
+  tree dr_access_base = build_fold_addr_expr (dr_ref);
+  tree dr_access_size = TYPE_SIZE_UNIT (TREE_TYPE (dr_ref));
+  gimple *dr_stmt = DR_STMT (dr);
+  tree rhs1 = gimple_assign_rhs1 (dr_stmt);
+  affine_iv iv;
+  tree pattern;
+
+  if (TREE_OPERAND (rhs1, 0) != loop_result)
+    return false;
+
+  /* A limitation of the current implementation is that we only support
+     constant patterns.  */
+  gcond *cond_stmt = as_a <gcond *> (last_stmt (loop->header));
+  pattern = gimple_cond_rhs (cond_stmt);
+  if (gimple_cond_code (cond_stmt) != NE_EXPR
+      || gimple_cond_lhs (cond_stmt) != gimple_assign_lhs (dr_stmt)
+      || TREE_CODE (pattern) != INTEGER_CST)
+    return false;
+
+  /* Bail out if no affine induction variable with constant step can be
+     determined.  */
+  if (!simple_iv (loop, loop, dr_access_base, &iv, false))
+    return false;
+
+  /* Bail out if memory accesses are not consecutive.  */
+  if (!operand_equal_p (iv.step, dr_access_size, 0))
+    return false;
+
+  /* Bail out if direction of memory accesses is not growing.  */
+  if (get_range_pos_neg (iv.step) != 1)
+    return false;
+
+  internal_fn fn;
+  switch (TREE_INT_CST_LOW (iv.step))
+    {
+    case 1:
+      if (!targetm.have_rawmemchr8 ())
+	return false;
+      fn = IFN_RAWMEMCHR8;
+      break;
+
+    case 2:
+      if (!targetm.have_rawmemchr16 ())
+	return false;
+      fn = IFN_RAWMEMCHR16;
+      break;
+
+    case 4:
+      if (!targetm.have_rawmemchr32 ())
+	return false;
+      fn = IFN_RAWMEMCHR32;
+      break;
+
+    default:
+      return false;
+    }
+
+  struct builtin_info *builtin;
+  builtin = alloc_builtin (NULL, NULL, NULL_TREE, NULL_TREE, NULL_TREE);
+  builtin->src_dr = dr;
+  builtin->src_base = iv.base;
+  builtin->pattern = pattern;
+  builtin->fn = fn;
+
+  partition->loc = gimple_location (dr_stmt);
+  partition->builtin = builtin;
+  partition->kind = PKIND_RAWMEMCHR;
+
+  return true;
+}
+
 /* Given data reference DR in loop nest LOOP, classify if it forms builtin
    memset call.  */
 
@@ -1792,12 +1947,16 @@  loop_distribution::classify_builtin_ldst (loop_p loop, struct graph *rdg,
 bool
 loop_distribution::classify_partition (loop_p loop,
 				       struct graph *rdg, partition *partition,
-				       bitmap stmt_in_all_partitions)
+				       bitmap stmt_in_all_partitions,
+				       vec<struct partition *> *partitions)
 {
   bitmap_iterator bi;
   unsigned i;
   data_reference_p single_ld = NULL, single_st = NULL;
   bool volatiles_p = false, has_reduction = false;
+  unsigned nreductions = 0;
+  gimple *reduction_stmt = NULL;
+  bool has_interpar_reduction = false;
 
   EXECUTE_IF_SET_IN_BITMAP (partition->stmts, 0, i, bi)
     {
@@ -1821,6 +1980,19 @@  loop_distribution::classify_partition (loop_p loop,
 	    partition->reduction_p = true;
 	  else
 	    has_reduction = true;
+
+	  /* Determine whether the reduction statement occurs in other
+	     partitions than the current one.  */
+	  struct partition *piter;
+	  for (unsigned j = 0; partitions->iterate (j, &piter); ++j)
+	    {
+	      if (piter == partition)
+		continue;
+	      if (bitmap_bit_p (piter->stmts, i))
+		has_interpar_reduction = true;
+	    }
+	  reduction_stmt = stmt;
+	  ++nreductions;
 	}
     }
 
@@ -1840,6 +2012,30 @@  loop_distribution::classify_partition (loop_p loop,
   if (!find_single_drs (loop, rdg, partition, &single_st, &single_ld))
     return has_reduction;
 
+  /* If we determined a single load and a single reduction statement which does
+     not occur in any other partition, then try to classify this partition as a
+     rawmemchr builtin.  */
+  if (single_ld != NULL
+      && single_st == NULL
+      && nreductions == 1
+      && !has_interpar_reduction
+      && is_gimple_assign (reduction_stmt))
+    {
+      /* If we classified the partition as a builtin, then ignoring the single
+	 reduction is safe, since the reduction variable is not used in other
+	 partitions.  */
+      tree reduction_var = gimple_assign_lhs (reduction_stmt);
+      return !classify_builtin_rawmemchr (loop, partition, single_ld, reduction_var);
+    }
+
+  if (single_st == NULL)
+    return has_reduction;
+
+  /* Don't distribute loop if niters is unknown.  */
+  tree niters = number_of_latch_executions (loop);
+  if (niters == NULL_TREE || niters == chrec_dont_know)
+    return has_reduction;
+
   partition->loc = gimple_location (DR_STMT (single_st));
 
   /* Classify the builtin kind.  */
@@ -2979,7 +3175,7 @@  loop_distribution::distribute_loop (class loop *loop, vec<gimple *> stmts,
   FOR_EACH_VEC_ELT (partitions, i, partition)
     {
       reduction_in_all
-	|= classify_partition (loop, rdg, partition, stmt_in_all_partitions);
+	|= classify_partition (loop, rdg, partition, stmt_in_all_partitions, &partitions);
       any_builtin |= partition_builtin_p (partition);
     }
 
@@ -3290,11 +3486,6 @@  loop_distribution::execute (function *fun)
 	      && !optimize_loop_for_speed_p (loop)))
 	continue;
 
-      /* Don't distribute loop if niters is unknown.  */
-      tree niters = number_of_latch_executions (loop);
-      if (niters == NULL_TREE || niters == chrec_dont_know)
-	continue;
-
       /* Get the perfect loop nest for distribution.  */
       loop = prepare_perfect_loop_nest (loop);
       for (; loop; loop = loop->inner)