diff mbox series

PR90838: Support ctz idioms

Message ID VI1PR0801MB212780CDB6061AD5045E74B283770@VI1PR0801MB2127.eurprd08.prod.outlook.com
State New
Headers show
Series PR90838: Support ctz idioms | expand

Commit Message

Wilco Dijkstra Nov. 12, 2019, 2:35 p.m. UTC
Hi,

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 contains a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
    {
      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

	rbit	w0, w0
	clz	w0, w0
	and	w0, w0, 31
	ret

Bootstrapped on AArch64. OK for commit?

ChangeLog:

2019-11-12  Wilco Dijkstra  <wdijkstr@arm.com>

        PR tree-optimization/90838
        * generic-match-head.c (optimize_count_trailing_zeroes):
        Add stub function.
        * gimple-match-head.c (gimple_simplify): Add support for ARRAY_REF.
        (optimize_count_trailing_zeroes): Add new function.
        * match.pd: Add matching for ctz idioms.
        * testsuite/gcc.target/aarch64/pr90838.c: New test.

--

Comments

Segher Boessenkool Nov. 13, 2019, 11:39 a.m. UTC | #1
Hi!

On Tue, Nov 12, 2019 at 02:35:54PM +0000, Wilco Dijkstra wrote:
> Support common idioms for count trailing zeroes using an array lookup.
> The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> constant which when multiplied by a power of 2 contains a unique value
> in the top 5 or 6 bits.  This is then indexed into a table which maps it
> to the number of trailing zeroes.  When the table is valid, we emit a
> sequence using the target defined value for ctz (0):
> 
> int ctz1 (unsigned x)
> {
>   static const char table[32] =
>     {
>       0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>       31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
>     };
> 
>   return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> }

Out of interest, what uses this?  I have never seen it before.


Segher
Wilco Dijkstra Nov. 13, 2019, 12:19 p.m. UTC | #2
Hi Segher,

> Out of interest, what uses this?  I have never seen it before.

It's used in sjeng in SPEC and gives a 2% speedup on Cortex-A57.
Tricks like this used to be very common 20 years ago since a loop or binary search
is way too slow and few CPUs supported fast clz/ctz instructions. It's one of those
instructions you rarely need, but when you do, performance is absolutely critical...

As Jakub mentioned in the PR, https://doc.lagout.org/security/Hackers%20Delight.pdf
is a good resource for these bit tricks.

Cheers,
Wilco
Richard Biener Nov. 13, 2019, 1:58 p.m. UTC | #3
On Tue, Nov 12, 2019 at 3:36 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi,
>
> Support common idioms for count trailing zeroes using an array lookup.
> The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> constant which when multiplied by a power of 2 contains a unique value
> in the top 5 or 6 bits.  This is then indexed into a table which maps it
> to the number of trailing zeroes.  When the table is valid, we emit a
> sequence using the target defined value for ctz (0):
>
> int ctz1 (unsigned x)
> {
>   static const char table[32] =
>     {
>       0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>       31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
>     };
>
>   return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> }
>
> Is optimized to:
>
>         rbit    w0, w0
>         clz     w0, w0
>         and     w0, w0, 31
>         ret
>
> Bootstrapped on AArch64. OK for commit?

Uh.  Well.  I think that
the gimple-match-head.c hunk isn't something we want.  Instead,
since this optimizes a memory access, the handling should move
to tree-ssa-forwprop.c where you _may_ use a (match ...)
match.pd pattern to do the (rshift (mult (bit_and (negate @1) @1)
matching.  It might be the first to use that feature, you need to
declare the function to use it from tree-ssa-forwprop.c.  So

(match (clz_table_index @1 @2 @3)
  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2)
                       INTEGER_CST@3))

in match.pd and then

extern bool gimple_clz_table_index (tree, tree *, tree (*)(tree));

and use it like

 tree res_ops[3];
 if (gimple_clz_table_index (TREE_OPERAND (array_ref, 1), &res_ops, NULL))
   {
      @1 @2 and @3 are now in res_ops
   }

btw, the bit_and probably needs :c

Thanks,
Richard.

> ChangeLog:
>
> 2019-11-12  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         PR tree-optimization/90838
>         * generic-match-head.c (optimize_count_trailing_zeroes):
>         Add stub function.
>         * gimple-match-head.c (gimple_simplify): Add support for ARRAY_REF.
>         (optimize_count_trailing_zeroes): Add new function.
>         * match.pd: Add matching for ctz idioms.
>         * testsuite/gcc.target/aarch64/pr90838.c: New test.
>
> --
>
> diff --git a/gcc/generic-match-head.c b/gcc/generic-match-head.c
> index fdc603977fc5b03a843944f75ce262f5d2256308..5a38bd233585225d60f0159c9042a16d9fdc9d80 100644
> --- a/gcc/generic-match-head.c
> +++ b/gcc/generic-match-head.c
> @@ -88,3 +88,10 @@ optimize_successive_divisions_p (tree, tree)
>  {
>    return false;
>  }
> +
> +static bool
> +optimize_count_trailing_zeroes (tree type, tree array_ref, tree input,
> +                               tree mulc, tree shift, tree &zero_val)
> +{
> +  return false;
> +}
> diff --git a/gcc/gimple-match-head.c b/gcc/gimple-match-head.c
> index 53278168a59f5ac10ce6760f04fd42589a0792e7..2d3b305f8ea54e4ca31c64994af30b34bb7eff09 100644
> --- a/gcc/gimple-match-head.c
> +++ b/gcc/gimple-match-head.c
> @@ -909,6 +909,24 @@ gimple_simplify (gimple *stmt, gimple_match_op *res_op, gimple_seq *seq,
>                 res_op->set_op (TREE_CODE (op0), type, valueized);
>                 return true;
>               }
> +           else if (code == ARRAY_REF)
> +             {
> +               tree rhs1 = gimple_assign_rhs1 (stmt);
> +               tree op1 = TREE_OPERAND (rhs1, 1);
> +               tree op2 = TREE_OPERAND (rhs1, 2);
> +               tree op3 = TREE_OPERAND (rhs1, 3);
> +               tree op0 = TREE_OPERAND (rhs1, 0);
> +               bool valueized = false;
> +
> +               op0 = do_valueize (op0, top_valueize, valueized);
> +               op1 = do_valueize (op1, top_valueize, valueized);
> +
> +               if (op2 && op3)
> +                 res_op->set_op (code, type, op0, op1, op2, op3);
> +               else
> +                 res_op->set_op (code, type, op0, op1);
> +               return gimple_resimplify4 (seq, res_op, valueize) || valueized;
> +             }
>             break;
>           case GIMPLE_UNARY_RHS:
>             {
> @@ -1222,3 +1240,57 @@ optimize_successive_divisions_p (tree divisor, tree inner_div)
>      }
>    return true;
>  }
> +
> +/* Recognize count trailing zeroes idiom.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  Array[0] is returned so the caller can
> +   emit an appropriate sequence depending on whether ctz (0) is defined on
> +   the target.  */
> +static bool
> +optimize_count_trailing_zeroes (tree type, tree array, tree x, tree mulc,
> +                               tree tshift, tree &zero_val)
> +{
> +  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
> +  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
> +
> +  tree input_type = TREE_TYPE (x);
> +
> +  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
> +    return false;
> +
> +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> +  unsigned shiftval = tree_to_uhwi (tshift);
> +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
> +
> +  /* Check the array is not wider than integer type and the input is a 32-bit
> +     or 64-bit type.  The shift should extract the top 5..7 bits.  */
> +  if (TYPE_PRECISION (type) > 32)
> +    return false;
> +  if (input_bits != 32 && input_bits != 64)
> +    return false;
> +  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
> +    return false;
> +
> +  tree t = build4 (ARRAY_REF, type, array, size_int (0), NULL_TREE, NULL_TREE);
> +  t = fold_const_aggregate_ref (t);
> +  if (t == NULL)
> +    return false;
> +
> +  zero_val = build_int_cst (integer_type_node, tree_to_shwi (t));
> +
> +  for (unsigned i = 0; i < input_bits; i++, val <<= 1)
> +    {
> +      if (input_bits == 32)
> +       val &= 0xffffffff;
> +      t = build4 (ARRAY_REF, type, array, size_int ((int)(val >> shiftval)),
> +                 NULL_TREE, NULL_TREE);
> +      t = fold_const_aggregate_ref (t);
> +      if (t == NULL || tree_to_shwi (t) != i)
> +       return false;
> +    }
> +
> +  return true;
> +}
> +
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 6edf54b80012d87dbe7330f5ee638cdba2f9c099..bbe935e1e2af35e8e953a776eb3ecfb83414b047 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -6060,3 +6060,33 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (simplify
>   (vec_perm vec_same_elem_p@0 @0 @1)
>   @0)
> +
> +/* Recognize count trailing zeroes idiom.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  If valid, emit an optimal sequence
> +   depending on the result for zero.
> +*/
> +(simplify
> + (ARRAY_REF @0 (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2)
> +                       INTEGER_CST@3) @4 @5)
> + (with
> +  { tree zero_val;
> +    HOST_WIDE_INT val;
> +    HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (TREE_TYPE (@1)));
> +    bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (TREE_TYPE (@1)), val);
> +  }
> +  (if (optimize_count_trailing_zeroes (type, @0, @1, @2, @3, zero_val))
> +   (switch
> +    (if (zero_ok && tree_to_shwi (zero_val) == val)
> +     (convert (BUILT_IN_CTZ:integer_type_node @1)))
> +
> +    /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
> +    (if (zero_ok && val == type_size && integer_zerop (zero_val))
> +     (convert (bit_and (BUILT_IN_CTZ:integer_type_node @1)
> +       { build_int_cst (integer_type_node, type_size - 1); })))
> +
> +    /* Emit (x ? ctz (x) : zero_val).  */
> +    (if (true)
> +     (convert (cond @1 (BUILT_IN_CTZ:integer_type_node @1) { zero_val; } )))))))
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> @@ -0,0 +1,64 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +int ctz1 (unsigned x)
> +{
> +  static const char table[32] =
> +    {
> +      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
> +      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> +    };
> +
> +  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> +}
> +
> +int ctz2 (unsigned x)
> +{
> +  const int u = 0;
> +  static short table[64] =
> +    {
> +      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
> +      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
> +      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
> +      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
> +    };
> +
> +  x = (x & -x) * 0x0450FBAF;
> +  return table[x >> 26];
> +}
> +
> +int ctz3 (unsigned x)
> +{
> +  static int table[32] =
> +    {
> +      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
> +      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
> +    };
> +
> +  if (x == 0) return 32;
> +  x = (x & -x) * 0x04D7651F;
> +  return table[x >> 27];
> +}
> +
> +static const unsigned long long magic = 0x03f08c5392f756cdULL;
> +
> +static const char table[64] = {
> +     0,  1, 12,  2, 13, 22, 17,  3,
> +    14, 33, 23, 36, 18, 58, 28,  4,
> +    62, 15, 34, 26, 24, 48, 50, 37,
> +    19, 55, 59, 52, 29, 44, 39,  5,
> +    63, 11, 21, 16, 32, 35, 57, 27,
> +    61, 25, 47, 49, 54, 51, 43, 38,
> +    10, 20, 31, 56, 60, 46, 53, 42,
> +     9, 30, 45, 41,  8, 40,  7,  6,
> +};
> +
> +int ctz4 (unsigned long x)
> +{
> +  unsigned long lsb = x & -x;
> +  return table[(lsb * magic) >> 58];
> +}
> +
> +/* { dg-final { scan-assembler-times "clz\t" 4 } } */
> +/* { dg-final { scan-assembler-times "and\t" 2 } } */
> +/* { dg-final { scan-assembler-not "cmp\t.*0" } } */
Wilco Dijkstra Nov. 15, 2019, 3:23 p.m. UTC | #4
Hi Richard,

> Uh.  Well.  I think that the gimple-match-head.c hunk isn't something we want.  Instead,
> since this optimizes a memory access, the handling should move
> to tree-ssa-forwprop.c where you _may_ use a (match ...)
> match.pd pattern to do the (rshift (mult (bit_and (negate @1) @1)
> matching.  It might be the first to use that feature, you need to
> declare the function to use it from tree-ssa-forwprop.c.  So

OK, I've moved to to fwprop, and it works just fine there while still
using match.pd to do the idiom matching. Here is the updated version:

[PATCH v2] PR90838: Support ctz idioms

v2: Use fwprop pass rather than match.pd

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 contains a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
    {
      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

	rbit	w0, w0
	clz	w0, w0
	and	w0, w0, 31
	ret

Bootstrapped on AArch64. OK for commit?

ChangeLog:

2019-11-15  Wilco Dijkstra  <wdijkstr@arm.com>

        PR tree-optimization/90838
        * tree-ssa-forwprop.c (optimize_count_trailing_zeroes):
        Add new function.
        (simplify_count_trailing_zeroes): Add new function.
        (pass_forwprop::execute): Try ctz simplification.
        * match.pd: Add matching for ctz idioms.
        * testsuite/gcc.target/aarch64/pr90838.c: New test.
---

diff --git a/gcc/match.pd b/gcc/match.pd
index 6edf54b80012d87dbe7330f5ee638cdba2f9c099..479e9076f0d4deccda54425e93ee4567b85409aa 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -6060,3 +6060,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
+
+/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  */
+(match (ctz_table_index @1 @2 @3)
+  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c b/gcc/testsuite/gcc.target/aarch64/pr90838.c
new file mode 100644
index 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
@@ -0,0 +1,64 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int ctz1 (unsigned x)
+{
+  static const char table[32] =
+    {
+      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
+      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
+    };
+
+  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
+}
+
+int ctz2 (unsigned x)
+{
+  const int u = 0;
+  static short table[64] =
+    {
+      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
+      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
+      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
+      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
+    };
+
+  x = (x & -x) * 0x0450FBAF;
+  return table[x >> 26];
+}
+
+int ctz3 (unsigned x)
+{
+  static int table[32] =
+    {
+      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
+      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
+    };
+
+  if (x == 0) return 32;
+  x = (x & -x) * 0x04D7651F;
+  return table[x >> 27];
+}
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+     0,  1, 12,  2, 13, 22, 17,  3,
+    14, 33, 23, 36, 18, 58, 28,  4,
+    62, 15, 34, 26, 24, 48, 50, 37,
+    19, 55, 59, 52, 29, 44, 39,  5,
+    63, 11, 21, 16, 32, 35, 57, 27,
+    61, 25, 47, 49, 54, 51, 43, 38,
+    10, 20, 31, 56, 60, 46, 53, 42,
+     9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+int ctz4 (unsigned long x)
+{
+  unsigned long lsb = x & -x;
+  return table[(lsb * magic) >> 58];
+}
+
+/* { dg-final { scan-assembler-times "clz\t" 4 } } */
+/* { dg-final { scan-assembler-times "and\t" 2 } } */
+/* { dg-final { scan-assembler-not "cmp\t.*0" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index fe55ca958b49b986f79a9a710d92b5d906959105..a632d54712be55f8070c9816e3c3702d4a493182 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "optabs-tree.h"
 #include "tree-vector-builder.h"
 #include "vec-perm-indices.h"
+#include "internal-fn.h"
 
 /* This pass propagates the RHS of assignment statements into use
    sites of the LHS of the assignment.  It's basically a specialized
@@ -1778,6 +1779,126 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+
+/* Recognize count trailing zeroes idiom.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  Array[0] is returned so the caller can
+   emit an appropriate sequence depending on whether ctz (0) is defined on
+   the target.  */
+static bool
+optimize_count_trailing_zeroes (tree type, tree array, tree x, tree mulc,
+				tree tshift, tree &zero_val)
+{
+  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
+  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
+
+  tree input_type = TREE_TYPE (x);
+
+  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
+    return false;
+
+  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
+  unsigned shiftval = tree_to_uhwi (tshift);
+  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
+
+  /* Check the array is not wider than integer type and the input is a 32-bit
+     or 64-bit type.  The shift should extract the top 5..7 bits.  */
+  if (TYPE_PRECISION (type) > 32)
+    return false;
+  if (input_bits != 32 && input_bits != 64)
+    return false;
+  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
+    return false;
+
+  tree t = build4 (ARRAY_REF, type, array, size_int (0), NULL_TREE, NULL_TREE);
+  t = fold_const_aggregate_ref (t);
+  if (t == NULL)
+    return false;
+
+  zero_val = build_int_cst (integer_type_node, tree_to_shwi (t));
+
+  for (unsigned i = 0; i < input_bits; i++, val <<= 1)
+    {
+      if (input_bits == 32)
+	val &= 0xffffffff;
+      t = build4 (ARRAY_REF, type, array, size_int ((int)(val >> shiftval)),
+		  NULL_TREE, NULL_TREE);
+      t = fold_const_aggregate_ref (t);
+      if (t == NULL || tree_to_shwi (t) != i)
+	return false;
+    }
+
+  return true;
+}
+
+/* Match.pd function to match the ctz expression.  */
+extern bool gimple_ctz_table_index (tree, tree *, tree (*)(tree));
+
+static bool
+simplify_count_trailing_zeroes (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  tree array_ref = gimple_assign_rhs1 (stmt);
+  tree res_ops[3];
+  tree zero_val;
+
+  gcc_checking_assert (TREE_CODE (array_ref) == ARRAY_REF);
+
+  if (!gimple_ctz_table_index (TREE_OPERAND (array_ref, 1), &res_ops[0], NULL))
+    return false;
+
+  if (optimize_count_trailing_zeroes (TREE_TYPE (array_ref),
+				      TREE_OPERAND (array_ref, 0), res_ops[0],
+				      res_ops[1], res_ops[2], zero_val))
+    {
+      tree lhs = gimple_assign_lhs (stmt);
+      tree type = TREE_TYPE (res_ops[0]);
+      HOST_WIDE_INT val;
+      HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (type));
+      tree lhs = gimple_assign_lhs (stmt);
+      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
+      bool need_convert = !useless_type_conversion_p (TREE_TYPE (lhs),
+						      integer_type_node);
+
+      gcall *call = gimple_build_call_internal (IFN_CTZ, 1, res_ops[0]);
+      gimple_set_location (call, gimple_location (stmt));
+      gimple_set_lhs (call, make_ssa_name (integer_type_node));
+
+      gimple *g = call;
+      tree prev_lhs = gimple_call_lhs (call);
+
+      /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
+      if (zero_ok && val == type_size && integer_zerop (zero_val))
+	{
+	  gsi_insert_before (gsi, g, GSI_SAME_STMT);
+	  g = gimple_build_assign (make_ssa_name (integer_type_node),
+				   BIT_AND_EXPR, prev_lhs,
+				   build_int_cst (integer_type_node,
+						  type_size - 1));
+	  gimple_set_location (g, gimple_location (stmt));
+	  prev_lhs = gimple_assign_lhs (g);
+	}
+      else if (!zero_ok || tree_to_shwi (zero_val) != val)
+	return false;
+
+      if (need_convert)
+	{
+	  gsi_insert_before (gsi, g, GSI_SAME_STMT);
+	  g = gimple_build_assign (lhs, NOP_EXPR, prev_lhs);
+	}
+      else
+	gimple_set_lhs (g, lhs);
+
+      gsi_replace (gsi, g, false);
+      return true;
+    }
+
+  return false;
+}
+
+
 /* Combine an element access with a shuffle.  Returns true if there were
    any changes made, else it returns false.  */
  
@@ -2759,6 +2880,8 @@ pass_forwprop::execute (function *fun)
 		    else if (code == CONSTRUCTOR
 			     && TREE_CODE (TREE_TYPE (rhs1)) == VECTOR_TYPE)
 		      changed = simplify_vector_constructor (&gsi);
+		    else if (code == ARRAY_REF)
+		      changed = simplify_count_trailing_zeroes (&gsi);
 		    break;
 		  }
Wilco Dijkstra Nov. 28, 2019, 11:27 a.m. UTC | #5
ping

Hi Richard,

> Uh.  Well.  I think that the gimple-match-head.c hunk isn't something we want.  Instead,
> since this optimizes a memory access, the handling should move
> to tree-ssa-forwprop.c where you _may_ use a (match ...)
> match.pd pattern to do the (rshift (mult (bit_and (negate @1) @1)
> matching.  It might be the first to use that feature, you need to
> declare the function to use it from tree-ssa-forwprop.c.  So

OK, I've moved to to fwprop, and it works just fine there while still
using match.pd to do the idiom matching. Here is the updated version:

[PATCH v2] PR90838: Support ctz idioms

v2: Use fwprop pass rather than match.pd

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 contains a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
    {
      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

        rbit    w0, w0
        clz     w0, w0
        and     w0, w0, 31
        ret

Bootstrapped on AArch64. OK for commit?

ChangeLog:

2019-11-15  Wilco Dijkstra  <wdijkstr@arm.com>

        PR tree-optimization/90838
        * tree-ssa-forwprop.c (optimize_count_trailing_zeroes):
        Add new function.
        (simplify_count_trailing_zeroes): Add new function.
        (pass_forwprop::execute): Try ctz simplification.
        * match.pd: Add matching for ctz idioms.
        * testsuite/gcc.target/aarch64/pr90838.c: New test.
---

diff --git a/gcc/match.pd b/gcc/match.pd
index 6edf54b80012d87dbe7330f5ee638cdba2f9c099..479e9076f0d4deccda54425e93ee4567b85409aa 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -6060,3 +6060,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
+
+/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  */
+(match (ctz_table_index @1 @2 @3)
+  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c b/gcc/testsuite/gcc.target/aarch64/pr90838.c
new file mode 100644
index 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
@@ -0,0 +1,64 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int ctz1 (unsigned x)
+{
+  static const char table[32] =
+    {
+      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
+      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
+    };
+
+  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
+}
+
+int ctz2 (unsigned x)
+{
+  const int u = 0;
+  static short table[64] =
+    {
+      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
+      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
+      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
+      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
+    };
+
+  x = (x & -x) * 0x0450FBAF;
+  return table[x >> 26];
+}
+
+int ctz3 (unsigned x)
+{
+  static int table[32] =
+    {
+      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
+      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
+    };
+
+  if (x == 0) return 32;
+  x = (x & -x) * 0x04D7651F;
+  return table[x >> 27];
+}
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+     0,  1, 12,  2, 13, 22, 17,  3,
+    14, 33, 23, 36, 18, 58, 28,  4,
+    62, 15, 34, 26, 24, 48, 50, 37,
+    19, 55, 59, 52, 29, 44, 39,  5,
+    63, 11, 21, 16, 32, 35, 57, 27,
+    61, 25, 47, 49, 54, 51, 43, 38,
+    10, 20, 31, 56, 60, 46, 53, 42,
+     9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+int ctz4 (unsigned long x)
+{
+  unsigned long lsb = x & -x;
+  return table[(lsb * magic) >> 58];
+}
+
+/* { dg-final { scan-assembler-times "clz\t" 4 } } */
+/* { dg-final { scan-assembler-times "and\t" 2 } } */
+/* { dg-final { scan-assembler-not "cmp\t.*0" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index fe55ca958b49b986f79a9a710d92b5d906959105..a632d54712be55f8070c9816e3c3702d4a493182 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "optabs-tree.h"
 #include "tree-vector-builder.h"
 #include "vec-perm-indices.h"
+#include "internal-fn.h"
 
 /* This pass propagates the RHS of assignment statements into use
    sites of the LHS of the assignment.  It's basically a specialized
@@ -1778,6 +1779,126 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+
+/* Recognize count trailing zeroes idiom.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  Array[0] is returned so the caller can
+   emit an appropriate sequence depending on whether ctz (0) is defined on
+   the target.  */
+static bool
+optimize_count_trailing_zeroes (tree type, tree array, tree x, tree mulc,
+                               tree tshift, tree &zero_val)
+{
+  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
+  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
+
+  tree input_type = TREE_TYPE (x);
+
+  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
+    return false;
+
+  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
+  unsigned shiftval = tree_to_uhwi (tshift);
+  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
+
+  /* Check the array is not wider than integer type and the input is a 32-bit
+     or 64-bit type.  The shift should extract the top 5..7 bits.  */
+  if (TYPE_PRECISION (type) > 32)
+    return false;
+  if (input_bits != 32 && input_bits != 64)
+    return false;
+  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
+    return false;
+
+  tree t = build4 (ARRAY_REF, type, array, size_int (0), NULL_TREE, NULL_TREE);
+  t = fold_const_aggregate_ref (t);
+  if (t == NULL)
+    return false;
+
+  zero_val = build_int_cst (integer_type_node, tree_to_shwi (t));
+
+  for (unsigned i = 0; i < input_bits; i++, val <<= 1)
+    {
+      if (input_bits == 32)
+       val &= 0xffffffff;
+      t = build4 (ARRAY_REF, type, array, size_int ((int)(val >> shiftval)),
+                 NULL_TREE, NULL_TREE);
+      t = fold_const_aggregate_ref (t);
+      if (t == NULL || tree_to_shwi (t) != i)
+       return false;
+    }
+
+  return true;
+}
+
+/* Match.pd function to match the ctz expression.  */
+extern bool gimple_ctz_table_index (tree, tree *, tree (*)(tree));
+
+static bool
+simplify_count_trailing_zeroes (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  tree array_ref = gimple_assign_rhs1 (stmt);
+  tree res_ops[3];
+  tree zero_val;
+
+  gcc_checking_assert (TREE_CODE (array_ref) == ARRAY_REF);
+
+  if (!gimple_ctz_table_index (TREE_OPERAND (array_ref, 1), &res_ops[0], NULL))
+    return false;
+
+  if (optimize_count_trailing_zeroes (TREE_TYPE (array_ref),
+                                     TREE_OPERAND (array_ref, 0), res_ops[0],
+                                     res_ops[1], res_ops[2], zero_val))
+    {
+      tree lhs = gimple_assign_lhs (stmt);
+      tree type = TREE_TYPE (res_ops[0]);
+      HOST_WIDE_INT val;
+      HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (type));
+      tree lhs = gimple_assign_lhs (stmt);
+      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
+      bool need_convert = !useless_type_conversion_p (TREE_TYPE (lhs),
+                                                     integer_type_node);
+
+      gcall *call = gimple_build_call_internal (IFN_CTZ, 1, res_ops[0]);
+      gimple_set_location (call, gimple_location (stmt));
+      gimple_set_lhs (call, make_ssa_name (integer_type_node));
+
+      gimple *g = call;
+      tree prev_lhs = gimple_call_lhs (call);
+
+      /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
+      if (zero_ok && val == type_size && integer_zerop (zero_val))
+       {
+         gsi_insert_before (gsi, g, GSI_SAME_STMT);
+         g = gimple_build_assign (make_ssa_name (integer_type_node),
+                                  BIT_AND_EXPR, prev_lhs,
+                                  build_int_cst (integer_type_node,
+                                                 type_size - 1));
+         gimple_set_location (g, gimple_location (stmt));
+         prev_lhs = gimple_assign_lhs (g);
+       }
+      else if (!zero_ok || tree_to_shwi (zero_val) != val)
+       return false;
+
+      if (need_convert)
+       {
+         gsi_insert_before (gsi, g, GSI_SAME_STMT);
+         g = gimple_build_assign (lhs, NOP_EXPR, prev_lhs);
+       }
+      else
+       gimple_set_lhs (g, lhs);
+
+      gsi_replace (gsi, g, false);
+      return true;
+    }
+
+  return false;
+}
+
+
 /* Combine an element access with a shuffle.  Returns true if there were
    any changes made, else it returns false.  */
 
@@ -2759,6 +2880,8 @@ pass_forwprop::execute (function *fun)
                     else if (code == CONSTRUCTOR
                              && TREE_CODE (TREE_TYPE (rhs1)) == VECTOR_TYPE)
                       changed = simplify_vector_constructor (&gsi);
+                   else if (code == ARRAY_REF)
+                     changed = simplify_count_trailing_zeroes (&gsi);
                     break;
                   }
Richard Biener Nov. 29, 2019, 9:15 a.m. UTC | #6
On Fri, Nov 15, 2019 at 4:24 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Richard,
>
> > Uh.  Well.  I think that the gimple-match-head.c hunk isn't something we want.  Instead,
> > since this optimizes a memory access, the handling should move
> > to tree-ssa-forwprop.c where you _may_ use a (match ...)
> > match.pd pattern to do the (rshift (mult (bit_and (negate @1) @1)
> > matching.  It might be the first to use that feature, you need to
> > declare the function to use it from tree-ssa-forwprop.c.  So
>
> OK, I've moved to to fwprop, and it works just fine there while still
> using match.pd to do the idiom matching. Here is the updated version:
>
> [PATCH v2] PR90838: Support ctz idioms
>
> v2: Use fwprop pass rather than match.pd
>
> Support common idioms for count trailing zeroes using an array lookup.
> The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> constant which when multiplied by a power of 2 contains a unique value
> in the top 5 or 6 bits.  This is then indexed into a table which maps it
> to the number of trailing zeroes.  When the table is valid, we emit a
> sequence using the target defined value for ctz (0):
>
> int ctz1 (unsigned x)
> {
>   static const char table[32] =
>     {
>       0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>       31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
>     };
>
>   return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> }
>
> Is optimized to:
>
>         rbit    w0, w0
>         clz     w0, w0
>         and     w0, w0, 31
>         ret
>
> Bootstrapped on AArch64. OK for commit?
>
> ChangeLog:
>
> 2019-11-15  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         PR tree-optimization/90838
>         * tree-ssa-forwprop.c (optimize_count_trailing_zeroes):
>         Add new function.
>         (simplify_count_trailing_zeroes): Add new function.
>         (pass_forwprop::execute): Try ctz simplification.
>         * match.pd: Add matching for ctz idioms.
>         * testsuite/gcc.target/aarch64/pr90838.c: New test.
> ---
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 6edf54b80012d87dbe7330f5ee638cdba2f9c099..479e9076f0d4deccda54425e93ee4567b85409aa 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -6060,3 +6060,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (simplify
>   (vec_perm vec_same_elem_p@0 @0 @1)
>   @0)
> +
> +/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  */
> +(match (ctz_table_index @1 @2 @3)
> +  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))

You need a :c on the bit_and

> diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> @@ -0,0 +1,64 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +int ctz1 (unsigned x)
> +{
> +  static const char table[32] =
> +    {
> +      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
> +      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> +    };
> +
> +  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> +}
> +
> +int ctz2 (unsigned x)
> +{
> +  const int u = 0;
> +  static short table[64] =
> +    {
> +      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
> +      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
> +      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
> +      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
> +    };
> +
> +  x = (x & -x) * 0x0450FBAF;
> +  return table[x >> 26];
> +}
> +
> +int ctz3 (unsigned x)
> +{
> +  static int table[32] =
> +    {
> +      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
> +      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
> +    };
> +
> +  if (x == 0) return 32;
> +  x = (x & -x) * 0x04D7651F;
> +  return table[x >> 27];
> +}
> +
> +static const unsigned long long magic = 0x03f08c5392f756cdULL;
> +
> +static const char table[64] = {
> +     0,  1, 12,  2, 13, 22, 17,  3,
> +    14, 33, 23, 36, 18, 58, 28,  4,
> +    62, 15, 34, 26, 24, 48, 50, 37,
> +    19, 55, 59, 52, 29, 44, 39,  5,
> +    63, 11, 21, 16, 32, 35, 57, 27,
> +    61, 25, 47, 49, 54, 51, 43, 38,
> +    10, 20, 31, 56, 60, 46, 53, 42,
> +     9, 30, 45, 41,  8, 40,  7,  6,
> +};
> +
> +int ctz4 (unsigned long x)
> +{
> +  unsigned long lsb = x & -x;
> +  return table[(lsb * magic) >> 58];
> +}
> +
> +/* { dg-final { scan-assembler-times "clz\t" 4 } } */
> +/* { dg-final { scan-assembler-times "and\t" 2 } } */
> +/* { dg-final { scan-assembler-not "cmp\t.*0" } } */
> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
> index fe55ca958b49b986f79a9a710d92b5d906959105..a632d54712be55f8070c9816e3c3702d4a493182 100644
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "optabs-tree.h"
>  #include "tree-vector-builder.h"
>  #include "vec-perm-indices.h"
> +#include "internal-fn.h"
>
>  /* This pass propagates the RHS of assignment statements into use
>     sites of the LHS of the assignment.  It's basically a specialized
> @@ -1778,6 +1779,126 @@ simplify_rotate (gimple_stmt_iterator *gsi)
>    return true;
>  }
>
> +
> +/* Recognize count trailing zeroes idiom.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  Array[0] is returned so the caller can
> +   emit an appropriate sequence depending on whether ctz (0) is defined on
> +   the target.  */
> +static bool
> +optimize_count_trailing_zeroes (tree type, tree array, tree x, tree mulc,
> +                               tree tshift, tree &zero_val)
> +{
> +  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
> +  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
> +
> +  tree input_type = TREE_TYPE (x);
> +
> +  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
> +    return false;
> +
> +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> +  unsigned shiftval = tree_to_uhwi (tshift);
> +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));

In the even that a __int128_t IFN_CTZ is supported the above might ICE with
too large constants so please wither use wide-int ops or above verify
tree_fits_{u,s}hwi () before doing the conversions (the conversion from
TYPE_SIZE should always succeed though).

> +  /* Check the array is not wider than integer type and the input is a 32-bit
> +     or 64-bit type.  The shift should extract the top 5..7 bits.  */
> +  if (TYPE_PRECISION (type) > 32)
> +    return false;
> +  if (input_bits != 32 && input_bits != 64)
> +    return false;
> +  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
> +    return false;
> +
> +  tree t = build4 (ARRAY_REF, type, array, size_int (0), NULL_TREE, NULL_TREE);
> +  t = fold_const_aggregate_ref (t);
> +  if (t == NULL)
> +    return false;
> +
> +  zero_val = build_int_cst (integer_type_node, tree_to_shwi (t));
> +
> +  for (unsigned i = 0; i < input_bits; i++, val <<= 1)
> +    {
> +      if (input_bits == 32)
> +       val &= 0xffffffff;
> +      t = build4 (ARRAY_REF, type, array, size_int ((int)(val >> shiftval)),
> +                 NULL_TREE, NULL_TREE);
> +      t = fold_const_aggregate_ref (t);
> +      if (t == NULL || tree_to_shwi (t) != i)
> +       return false;
> +    }

Hmm.  So this verifies that for a subset of all possible inputs the table
computes the correct value.

 a) how do we know this verification is exhaustive?
 b) we do this for every array access matching the pattern

I suggest you do

  tree ctor = ctor_for_folding (array);
  if (!ctor || TREE_CODE (ctor) != CONSTRUCTOR)
    return false;

and then perform the verification on the constructor elements directly.
That's a lot cheaper.  Watch out for array_ref_low_bound which you
don't get passed in here - thus pass in the ARRAY_REF, not the array.
I believe it's also wrong in your code above (probably miscompiles
a fortran equivalent of your testcase or fails verification/transform).

When you do the verification on the ctor_for_folding then you
can in theory lookup the varpool node for 'array' and cache
the verification result there.

> +
> +  return true;
> +}
> +
> +/* Match.pd function to match the ctz expression.  */
> +extern bool gimple_ctz_table_index (tree, tree *, tree (*)(tree));
> +
> +static bool
> +simplify_count_trailing_zeroes (gimple_stmt_iterator *gsi)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  tree array_ref = gimple_assign_rhs1 (stmt);
> +  tree res_ops[3];
> +  tree zero_val;
> +
> +  gcc_checking_assert (TREE_CODE (array_ref) == ARRAY_REF);
> +
> +  if (!gimple_ctz_table_index (TREE_OPERAND (array_ref, 1), &res_ops[0], NULL))
> +    return false;
> +
> +  if (optimize_count_trailing_zeroes (TREE_TYPE (array_ref),
> +                                     TREE_OPERAND (array_ref, 0), res_ops[0],
> +                                     res_ops[1], res_ops[2], zero_val))
> +    {
> +      tree lhs = gimple_assign_lhs (stmt);
> +      tree type = TREE_TYPE (res_ops[0]);
> +      HOST_WIDE_INT val;
> +      HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (type));
> +      tree lhs = gimple_assign_lhs (stmt);
> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);

since we're using the optab entry shouldn't you check for == 2 here?

> +      bool need_convert = !useless_type_conversion_p (TREE_TYPE (lhs),
> +                                                     integer_type_node);
> +
> +      gcall *call = gimple_build_call_internal (IFN_CTZ, 1, res_ops[0]);
> +      gimple_set_location (call, gimple_location (stmt));
> +      gimple_set_lhs (call, make_ssa_name (integer_type_node));
> +
> +      gimple *g = call;
> +      tree prev_lhs = gimple_call_lhs (call);
> +
> +      /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
> +      if (zero_ok && val == type_size && integer_zerop (zero_val))
> +       {
> +         gsi_insert_before (gsi, g, GSI_SAME_STMT);
> +         g = gimple_build_assign (make_ssa_name (integer_type_node),
> +                                  BIT_AND_EXPR, prev_lhs,
> +                                  build_int_cst (integer_type_node,
> +                                                 type_size - 1));
> +         gimple_set_location (g, gimple_location (stmt));
> +         prev_lhs = gimple_assign_lhs (g);
> +       }
> +      else if (!zero_ok || tree_to_shwi (zero_val) != val)
> +       return false;

Please check this before building the call.

> +
> +      if (need_convert)
> +       {
> +         gsi_insert_before (gsi, g, GSI_SAME_STMT);
> +         g = gimple_build_assign (lhs, NOP_EXPR, prev_lhs);
> +       }
> +      else
> +       gimple_set_lhs (g, lhs);

For all of the above using gimple_build () style stmt building and
a final gsi_replace_with_seq would be more straight-forward.

> +      gsi_replace (gsi, g, false);
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +
>  /* Combine an element access with a shuffle.  Returns true if there were
>     any changes made, else it returns false.  */
>
> @@ -2759,6 +2880,8 @@ pass_forwprop::execute (function *fun)
>                     else if (code == CONSTRUCTOR
>                              && TREE_CODE (TREE_TYPE (rhs1)) == VECTOR_TYPE)
>                       changed = simplify_vector_constructor (&gsi);
> +                   else if (code == ARRAY_REF)
> +                     changed = simplify_count_trailing_zeroes (&gsi);
>                     break;
>                   }
>
Wilco Dijkstra Dec. 11, 2019, 4:55 p.m. UTC | #7
Hi Richard,

>> +(match (ctz_table_index @1 @2 @3)
>> +  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
>
> You need a :c on the bit_and

Fixed.

> +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> +  unsigned shiftval = tree_to_uhwi (tshift);
> +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));

> In the even that a __int128_t IFN_CTZ is supported the above might ICE with
> too large constants so please wither use wide-int ops or above verify
> tree_fits_{u,s}hwi () before doing the conversions (the conversion from
> TYPE_SIZE should always succeed though).

I've moved the initialization of val much later so we have done all the checks and
know for sure the mulc will fit in a HWint.

> Hmm.  So this verifies that for a subset of all possible inputs the table
> computes the correct value.
>
> a) how do we know this verification is exhaustive?
> b) we do this for every array access matching the pattern

It checks all the values that matter, which is the number of bits plus the special
handling of ctz(0). An array may contain entries which can never be referenced
(see ctz2() in the testcase), so we don't care what the value is in those cases.
Very few accesses can match the pattern given it is very specific and there are
many checks before it tries to check the contents of the array.

> I suggest you do
>  tree ctor = ctor_for_folding (array);
>  if (!ctor || TREE_CODE (ctor) != CONSTRUCTOR)
>    return false;
>
> and then perform the verification on the constructor elements directly.
> That's a lot cheaper.  Watch out for array_ref_low_bound which you
> don't get passed in here - thus pass in the ARRAY_REF, not the array.
>
> I believe it's also wrong in your code above (probably miscompiles
> a fortran equivalent of your testcase or fails verification/transform).
>
> When you do the verification on the ctor_for_folding then you
> can in theory lookup the varpool node for 'array' and cache
> the verification result there.

I've updated it to use the ctor, but it meant adding another code path to
handle string literals. It's not clear how the array_ref_low_bound affects the
initializer, but I now reject it if it is non-zero.

>> +      tree lhs = gimple_assign_lhs (stmt);
>> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
>
> since we're using the optab entry shouldn't you check for == 2 here?

Yes, that looks more correct (it's not clear what 1 means exactly).

> Please check this before building the call.

I've reordered the checks so it returns before it builds any gimple if it cannot do
the transformation.

> For all of the above using gimple_build () style stmt building and
> a final gsi_replace_with_seq would be more straight-forward.

I've changed that, but it meant always inserting the nop convert, otherwise
it does not make the code easier to follow.

Cheers,
Wilco


[PATCH v3] PR90838: Support ctz idioms

v3: Directly walk over the array initializer and other tweaks based on review.
v2: Use fwprop pass rather than match.pd

Support common idioms for count trailing zeroes using an array lookup.
The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
constant which when multiplied by a power of 2 contains a unique value
in the top 5 or 6 bits.  This is then indexed into a table which maps it
to the number of trailing zeroes.  When the table is valid, we emit a
sequence using the target defined value for ctz (0):

int ctz1 (unsigned x)
{
  static const char table[32] =
    {
      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };

  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

Is optimized to:

	rbit	w0, w0
	clz	w0, w0
	and	w0, w0, 31
	ret

Bootstrapped on AArch64. OK for commit?

ChangeLog:

2019-12-11  Wilco Dijkstra  <wdijkstr@arm.com>

        PR tree-optimization/90838
        * tree-ssa-forwprop.c (check_ctz_array): Add new function.
        (check_ctz_string): Likewise.
        (optimize_count_trailing_zeroes): Likewise.
        (simplify_count_trailing_zeroes): Likewise.
        (pass_forwprop::execute): Try ctz simplification.
        * match.pd: Add matching for ctz idioms.
        * testsuite/gcc.target/aarch64/pr90838.c: New test.

--
diff --git a/gcc/match.pd b/gcc/match.pd
index 3b7a5ce4e9a4de4f983ccdc696ad406a7932c08c..410cd6eaae0cdc9de7e01d5496de0595b7ea15ba 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -6116,3 +6116,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
+
+/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  */
+(match (ctz_table_index @1 @2 @3)
+  (rshift (mult (bit_and:c (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c b/gcc/testsuite/gcc.target/aarch64/pr90838.c
new file mode 100644
index 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
@@ -0,0 +1,64 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int ctz1 (unsigned x)
+{
+  static const char table[32] =
+    {
+      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
+      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
+    };
+
+  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
+}
+
+int ctz2 (unsigned x)
+{
+  const int u = 0;
+  static short table[64] =
+    {
+      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
+      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
+      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
+      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
+    };
+
+  x = (x & -x) * 0x0450FBAF;
+  return table[x >> 26];
+}
+
+int ctz3 (unsigned x)
+{
+  static int table[32] =
+    {
+      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
+      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
+    };
+
+  if (x == 0) return 32;
+  x = (x & -x) * 0x04D7651F;
+  return table[x >> 27];
+}
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+     0,  1, 12,  2, 13, 22, 17,  3,
+    14, 33, 23, 36, 18, 58, 28,  4,
+    62, 15, 34, 26, 24, 48, 50, 37,
+    19, 55, 59, 52, 29, 44, 39,  5,
+    63, 11, 21, 16, 32, 35, 57, 27,
+    61, 25, 47, 49, 54, 51, 43, 38,
+    10, 20, 31, 56, 60, 46, 53, 42,
+     9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+int ctz4 (unsigned long x)
+{
+  unsigned long lsb = x & -x;
+  return table[(lsb * magic) >> 58];
+}
+
+/* { dg-final { scan-assembler-times "clz\t" 4 } } */
+/* { dg-final { scan-assembler-times "and\t" 2 } } */
+/* { dg-final { scan-assembler-not "cmp\t.*0" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index 4a831242c0e1418d89523aaee0eef900d8fcc3bb..219fa158fa7bc373a63ede3853781c35c36948bc 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -48,6 +48,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "optabs-tree.h"
 #include "tree-vector-builder.h"
 #include "vec-perm-indices.h"
+#include "internal-fn.h"
+#include "cgraph.h"
 
 /* This pass propagates the RHS of assignment statements into use
    sites of the LHS of the assignment.  It's basically a specialized
@@ -1778,6 +1780,184 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+
+/* Check whether an array contains a valid ctz table.  */
+static bool
+check_ctz_array (tree ctor, unsigned HOST_WIDE_INT mulc,
+		 HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits)
+{
+  tree elt, idx;
+  unsigned HOST_WIDE_INT i;
+  unsigned mask = (1 << (bits - shift)) - 1;
+  unsigned matched = 0;
+
+  zero_val = 0;
+
+  FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (ctor), i, idx, elt)
+    {
+      if (TREE_CODE (idx) != INTEGER_CST || TREE_CODE (elt) != INTEGER_CST)
+	return false;
+      if (i > bits * 2)
+	return false;
+
+      unsigned HOST_WIDE_INT index = tree_to_shwi (idx);
+      HOST_WIDE_INT val = tree_to_shwi (elt);
+
+      if (index == 0)
+	{
+	  zero_val = val;
+	  matched++;
+	}
+
+      if (val >= 0 && val < bits && (((mulc << val) >> shift) & mask) == index)
+	matched++;
+
+      if (matched > bits)
+	return true;
+    }
+
+  return false;
+}
+
+/* Check whether a string contains a valid ctz table.  */
+static bool
+check_ctz_string (tree string, unsigned HOST_WIDE_INT mulc,
+		  HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits)
+{
+  unsigned HOST_WIDE_INT len = TREE_STRING_LENGTH (string);
+  unsigned mask = (1 << (bits - shift)) - 1;
+  unsigned matched = 0;
+  const unsigned char *p = (const unsigned char *) TREE_STRING_POINTER (string);
+
+  if (len < bits || len > bits * 2)
+    return false;
+
+  zero_val = p[0];
+
+  for (unsigned i = 0; i < len; i++)
+    if (p[i] < bits && (((mulc << p[i]) >> shift) & mask) == i)
+      matched++;
+
+  return matched == bits;
+}
+
+/* Recognize count trailing zeroes idiom.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  Array[0] is returned so the caller can
+   emit an appropriate sequence depending on whether ctz (0) is defined on
+   the target.  */
+static bool
+optimize_count_trailing_zeroes (tree array_ref, tree x, tree mulc,
+				tree tshift, HOST_WIDE_INT &zero_val)
+{
+  tree type = TREE_TYPE (array_ref);
+  tree array = TREE_OPERAND (array_ref, 0);
+
+  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
+  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
+
+  tree input_type = TREE_TYPE (x);
+  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
+
+  /* Check the array is not wider than integer type and the input is a 32-bit
+     or 64-bit type.  */
+  if (TYPE_PRECISION (type) > 32)
+    return false;
+  if (input_bits != 32 && input_bits != 64)
+    return false;
+
+  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
+    return false;
+
+  /* Check the lower bound of the array is zero.  */
+  tree low = array_ref_low_bound (array_ref);
+  if (!low || !integer_zerop (low))
+    return false;
+
+  unsigned shiftval = tree_to_uhwi (tshift);
+
+  /* Check the shift extracts the top 5..7 bits.  */
+  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
+    return false;
+
+  tree ctor = ctor_for_folding (array);
+  if (!ctor)
+    return false;
+
+  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
+
+  if (TREE_CODE (ctor) == CONSTRUCTOR)
+    return check_ctz_array (ctor, val, zero_val, shiftval, input_bits);
+  else if (TREE_CODE (ctor) == STRING_CST)
+    return check_ctz_string (ctor, val, zero_val, shiftval, input_bits);
+
+  return false;
+}
+
+/* Match.pd function to match the ctz expression.  */
+extern bool gimple_ctz_table_index (tree, tree *, tree (*)(tree));
+
+static bool
+simplify_count_trailing_zeroes (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  tree array_ref = gimple_assign_rhs1 (stmt);
+  tree res_ops[3];
+  HOST_WIDE_INT zero_val;
+
+  gcc_checking_assert (TREE_CODE (array_ref) == ARRAY_REF);
+
+  if (!gimple_ctz_table_index (TREE_OPERAND (array_ref, 1), &res_ops[0], NULL))
+    return false;
+
+  if (optimize_count_trailing_zeroes (array_ref, res_ops[0],
+				      res_ops[1], res_ops[2], zero_val))
+    {
+      tree type = TREE_TYPE (res_ops[0]);
+      HOST_WIDE_INT ctzval;
+      HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (type));
+      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) == 2;
+
+      /* Skip if there is no value defined at zero, or if we can't easily
+	 return the correct value for zero.  */
+      if (!zero_ok)
+	return false;
+      if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size))
+	return false;
+
+      gimple_seq seq = NULL;
+      gimple *g;
+      gcall *call = gimple_build_call_internal (IFN_CTZ, 1, res_ops[0]);
+      gimple_set_location (call, gimple_location (stmt));
+      gimple_set_lhs (call, make_ssa_name (integer_type_node));
+      gimple_seq_add_stmt (&seq, call);
+
+      tree prev_lhs = gimple_call_lhs (call);
+
+      /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
+      if (zero_val == 0 && ctzval == type_size)
+	{
+	  g = gimple_build_assign (make_ssa_name (integer_type_node),
+				   BIT_AND_EXPR, prev_lhs,
+				   build_int_cst (integer_type_node,
+						  type_size - 1));
+	  gimple_set_location (g, gimple_location (stmt));
+	  gimple_seq_add_stmt (&seq, g);
+	  prev_lhs = gimple_assign_lhs (g);
+	}
+
+      g = gimple_build_assign (gimple_assign_lhs (stmt), NOP_EXPR, prev_lhs);
+      gimple_seq_add_stmt (&seq, g);
+      gsi_replace_with_seq (gsi, seq, true);
+      return true;
+    }
+
+  return false;
+}
+
+
 /* Combine an element access with a shuffle.  Returns true if there were
    any changes made, else it returns false.  */
  
@@ -2867,6 +3047,8 @@ pass_forwprop::execute (function *fun)
 		    else if (code == CONSTRUCTOR
 			     && TREE_CODE (TREE_TYPE (rhs1)) == VECTOR_TYPE)
 		      changed = simplify_vector_constructor (&gsi);
+		    else if (code == ARRAY_REF)
+		      changed = simplify_count_trailing_zeroes (&gsi);
 		    break;
 		  }
Richard Biener Jan. 9, 2020, 1:26 p.m. UTC | #8
On Wed, Dec 11, 2019 at 5:55 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Richard,
>
> >> +(match (ctz_table_index @1 @2 @3)
> >> +  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
> >
> > You need a :c on the bit_and
>
> Fixed.
>
> > +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> > +  unsigned shiftval = tree_to_uhwi (tshift);
> > +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
>
> > In the even that a __int128_t IFN_CTZ is supported the above might ICE with
> > too large constants so please wither use wide-int ops or above verify
> > tree_fits_{u,s}hwi () before doing the conversions (the conversion from
> > TYPE_SIZE should always succeed though).
>
> I've moved the initialization of val much later so we have done all the checks and
> know for sure the mulc will fit in a HWint.
>
> > Hmm.  So this verifies that for a subset of all possible inputs the table
> > computes the correct value.
> >
> > a) how do we know this verification is exhaustive?
> > b) we do this for every array access matching the pattern
>
> It checks all the values that matter, which is the number of bits plus the special
> handling of ctz(0). An array may contain entries which can never be referenced
> (see ctz2() in the testcase), so we don't care what the value is in those cases.
> Very few accesses can match the pattern given it is very specific and there are
> many checks before it tries to check the contents of the array.
>
> > I suggest you do
> >  tree ctor = ctor_for_folding (array);
> >  if (!ctor || TREE_CODE (ctor) != CONSTRUCTOR)
> >    return false;
> >
> > and then perform the verification on the constructor elements directly.
> > That's a lot cheaper.  Watch out for array_ref_low_bound which you
> > don't get passed in here - thus pass in the ARRAY_REF, not the array.
> >
> > I believe it's also wrong in your code above (probably miscompiles
> > a fortran equivalent of your testcase or fails verification/transform).
> >
> > When you do the verification on the ctor_for_folding then you
> > can in theory lookup the varpool node for 'array' and cache
> > the verification result there.
>
> I've updated it to use the ctor, but it meant adding another code path to
> handle string literals. It's not clear how the array_ref_low_bound affects the
> initializer, but I now reject it if it is non-zero.
>
> >> +      tree lhs = gimple_assign_lhs (stmt);
> >> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
> >
> > since we're using the optab entry shouldn't you check for == 2 here?
>
> Yes, that looks more correct (it's not clear what 1 means exactly).
>
> > Please check this before building the call.
>
> I've reordered the checks so it returns before it builds any gimple if it cannot do
> the transformation.
>
> > For all of the above using gimple_build () style stmt building and
> > a final gsi_replace_with_seq would be more straight-forward.
>
> I've changed that, but it meant always inserting the nop convert, otherwise
> it does not make the code easier to follow.

OK.

Thanks,
Richard.

> Cheers,
> Wilco
>
>
> [PATCH v3] PR90838: Support ctz idioms
>
> v3: Directly walk over the array initializer and other tweaks based on review.
> v2: Use fwprop pass rather than match.pd
>
> Support common idioms for count trailing zeroes using an array lookup.
> The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> constant which when multiplied by a power of 2 contains a unique value
> in the top 5 or 6 bits.  This is then indexed into a table which maps it
> to the number of trailing zeroes.  When the table is valid, we emit a
> sequence using the target defined value for ctz (0):
>
> int ctz1 (unsigned x)
> {
>   static const char table[32] =
>     {
>       0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>       31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
>     };
>
>   return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> }
>
> Is optimized to:
>
>         rbit    w0, w0
>         clz     w0, w0
>         and     w0, w0, 31
>         ret
>
> Bootstrapped on AArch64. OK for commit?
>
> ChangeLog:
>
> 2019-12-11  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         PR tree-optimization/90838
>         * tree-ssa-forwprop.c (check_ctz_array): Add new function.
>         (check_ctz_string): Likewise.
>         (optimize_count_trailing_zeroes): Likewise.
>         (simplify_count_trailing_zeroes): Likewise.
>         (pass_forwprop::execute): Try ctz simplification.
>         * match.pd: Add matching for ctz idioms.
>         * testsuite/gcc.target/aarch64/pr90838.c: New test.
>
> --
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3b7a5ce4e9a4de4f983ccdc696ad406a7932c08c..410cd6eaae0cdc9de7e01d5496de0595b7ea15ba 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -6116,3 +6116,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (simplify
>   (vec_perm vec_same_elem_p@0 @0 @1)
>   @0)
> +
> +/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  */
> +(match (ctz_table_index @1 @2 @3)
> +  (rshift (mult (bit_and:c (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> @@ -0,0 +1,64 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +int ctz1 (unsigned x)
> +{
> +  static const char table[32] =
> +    {
> +      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
> +      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> +    };
> +
> +  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> +}
> +
> +int ctz2 (unsigned x)
> +{
> +  const int u = 0;
> +  static short table[64] =
> +    {
> +      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
> +      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
> +      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
> +      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
> +    };
> +
> +  x = (x & -x) * 0x0450FBAF;
> +  return table[x >> 26];
> +}
> +
> +int ctz3 (unsigned x)
> +{
> +  static int table[32] =
> +    {
> +      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
> +      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
> +    };
> +
> +  if (x == 0) return 32;
> +  x = (x & -x) * 0x04D7651F;
> +  return table[x >> 27];
> +}
> +
> +static const unsigned long long magic = 0x03f08c5392f756cdULL;
> +
> +static const char table[64] = {
> +     0,  1, 12,  2, 13, 22, 17,  3,
> +    14, 33, 23, 36, 18, 58, 28,  4,
> +    62, 15, 34, 26, 24, 48, 50, 37,
> +    19, 55, 59, 52, 29, 44, 39,  5,
> +    63, 11, 21, 16, 32, 35, 57, 27,
> +    61, 25, 47, 49, 54, 51, 43, 38,
> +    10, 20, 31, 56, 60, 46, 53, 42,
> +     9, 30, 45, 41,  8, 40,  7,  6,
> +};
> +
> +int ctz4 (unsigned long x)
> +{
> +  unsigned long lsb = x & -x;
> +  return table[(lsb * magic) >> 58];
> +}
> +
> +/* { dg-final { scan-assembler-times "clz\t" 4 } } */
> +/* { dg-final { scan-assembler-times "and\t" 2 } } */
> +/* { dg-final { scan-assembler-not "cmp\t.*0" } } */
> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
> index 4a831242c0e1418d89523aaee0eef900d8fcc3bb..219fa158fa7bc373a63ede3853781c35c36948bc 100644
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -48,6 +48,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "optabs-tree.h"
>  #include "tree-vector-builder.h"
>  #include "vec-perm-indices.h"
> +#include "internal-fn.h"
> +#include "cgraph.h"
>
>  /* This pass propagates the RHS of assignment statements into use
>     sites of the LHS of the assignment.  It's basically a specialized
> @@ -1778,6 +1780,184 @@ simplify_rotate (gimple_stmt_iterator *gsi)
>    return true;
>  }
>
> +
> +/* Check whether an array contains a valid ctz table.  */
> +static bool
> +check_ctz_array (tree ctor, unsigned HOST_WIDE_INT mulc,
> +                HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits)
> +{
> +  tree elt, idx;
> +  unsigned HOST_WIDE_INT i;
> +  unsigned mask = (1 << (bits - shift)) - 1;
> +  unsigned matched = 0;
> +
> +  zero_val = 0;
> +
> +  FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (ctor), i, idx, elt)
> +    {
> +      if (TREE_CODE (idx) != INTEGER_CST || TREE_CODE (elt) != INTEGER_CST)
> +       return false;
> +      if (i > bits * 2)
> +       return false;
> +
> +      unsigned HOST_WIDE_INT index = tree_to_shwi (idx);
> +      HOST_WIDE_INT val = tree_to_shwi (elt);
> +
> +      if (index == 0)
> +       {
> +         zero_val = val;
> +         matched++;
> +       }
> +
> +      if (val >= 0 && val < bits && (((mulc << val) >> shift) & mask) == index)
> +       matched++;
> +
> +      if (matched > bits)
> +       return true;
> +    }
> +
> +  return false;
> +}
> +
> +/* Check whether a string contains a valid ctz table.  */
> +static bool
> +check_ctz_string (tree string, unsigned HOST_WIDE_INT mulc,
> +                 HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits)
> +{
> +  unsigned HOST_WIDE_INT len = TREE_STRING_LENGTH (string);
> +  unsigned mask = (1 << (bits - shift)) - 1;
> +  unsigned matched = 0;
> +  const unsigned char *p = (const unsigned char *) TREE_STRING_POINTER (string);
> +
> +  if (len < bits || len > bits * 2)
> +    return false;
> +
> +  zero_val = p[0];
> +
> +  for (unsigned i = 0; i < len; i++)
> +    if (p[i] < bits && (((mulc << p[i]) >> shift) & mask) == i)
> +      matched++;
> +
> +  return matched == bits;
> +}
> +
> +/* Recognize count trailing zeroes idiom.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  Array[0] is returned so the caller can
> +   emit an appropriate sequence depending on whether ctz (0) is defined on
> +   the target.  */
> +static bool
> +optimize_count_trailing_zeroes (tree array_ref, tree x, tree mulc,
> +                               tree tshift, HOST_WIDE_INT &zero_val)
> +{
> +  tree type = TREE_TYPE (array_ref);
> +  tree array = TREE_OPERAND (array_ref, 0);
> +
> +  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
> +  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
> +
> +  tree input_type = TREE_TYPE (x);
> +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
> +
> +  /* Check the array is not wider than integer type and the input is a 32-bit
> +     or 64-bit type.  */
> +  if (TYPE_PRECISION (type) > 32)
> +    return false;
> +  if (input_bits != 32 && input_bits != 64)
> +    return false;
> +
> +  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
> +    return false;
> +
> +  /* Check the lower bound of the array is zero.  */
> +  tree low = array_ref_low_bound (array_ref);
> +  if (!low || !integer_zerop (low))
> +    return false;
> +
> +  unsigned shiftval = tree_to_uhwi (tshift);
> +
> +  /* Check the shift extracts the top 5..7 bits.  */
> +  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
> +    return false;
> +
> +  tree ctor = ctor_for_folding (array);
> +  if (!ctor)
> +    return false;
> +
> +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> +
> +  if (TREE_CODE (ctor) == CONSTRUCTOR)
> +    return check_ctz_array (ctor, val, zero_val, shiftval, input_bits);
> +  else if (TREE_CODE (ctor) == STRING_CST)
> +    return check_ctz_string (ctor, val, zero_val, shiftval, input_bits);
> +
> +  return false;
> +}
> +
> +/* Match.pd function to match the ctz expression.  */
> +extern bool gimple_ctz_table_index (tree, tree *, tree (*)(tree));
> +
> +static bool
> +simplify_count_trailing_zeroes (gimple_stmt_iterator *gsi)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  tree array_ref = gimple_assign_rhs1 (stmt);
> +  tree res_ops[3];
> +  HOST_WIDE_INT zero_val;
> +
> +  gcc_checking_assert (TREE_CODE (array_ref) == ARRAY_REF);
> +
> +  if (!gimple_ctz_table_index (TREE_OPERAND (array_ref, 1), &res_ops[0], NULL))
> +    return false;
> +
> +  if (optimize_count_trailing_zeroes (array_ref, res_ops[0],
> +                                     res_ops[1], res_ops[2], zero_val))
> +    {
> +      tree type = TREE_TYPE (res_ops[0]);
> +      HOST_WIDE_INT ctzval;
> +      HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (type));
> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) == 2;
> +
> +      /* Skip if there is no value defined at zero, or if we can't easily
> +        return the correct value for zero.  */
> +      if (!zero_ok)
> +       return false;
> +      if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size))
> +       return false;
> +
> +      gimple_seq seq = NULL;
> +      gimple *g;
> +      gcall *call = gimple_build_call_internal (IFN_CTZ, 1, res_ops[0]);
> +      gimple_set_location (call, gimple_location (stmt));
> +      gimple_set_lhs (call, make_ssa_name (integer_type_node));
> +      gimple_seq_add_stmt (&seq, call);
> +
> +      tree prev_lhs = gimple_call_lhs (call);
> +
> +      /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
> +      if (zero_val == 0 && ctzval == type_size)
> +       {
> +         g = gimple_build_assign (make_ssa_name (integer_type_node),
> +                                  BIT_AND_EXPR, prev_lhs,
> +                                  build_int_cst (integer_type_node,
> +                                                 type_size - 1));
> +         gimple_set_location (g, gimple_location (stmt));
> +         gimple_seq_add_stmt (&seq, g);
> +         prev_lhs = gimple_assign_lhs (g);
> +       }
> +
> +      g = gimple_build_assign (gimple_assign_lhs (stmt), NOP_EXPR, prev_lhs);
> +      gimple_seq_add_stmt (&seq, g);
> +      gsi_replace_with_seq (gsi, seq, true);
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +
>  /* Combine an element access with a shuffle.  Returns true if there were
>     any changes made, else it returns false.  */
>
> @@ -2867,6 +3047,8 @@ pass_forwprop::execute (function *fun)
>                     else if (code == CONSTRUCTOR
>                              && TREE_CODE (TREE_TYPE (rhs1)) == VECTOR_TYPE)
>                       changed = simplify_vector_constructor (&gsi);
> +                   else if (code == ARRAY_REF)
> +                     changed = simplify_count_trailing_zeroes (&gsi);
>                     break;
>                   }
>
Jakub Jelinek Jan. 10, 2020, 9:35 p.m. UTC | #9
On Thu, Jan 09, 2020 at 02:26:10PM +0100, Richard Biener wrote:
> > >> +      tree lhs = gimple_assign_lhs (stmt);
> > >> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
> > >
> > > since we're using the optab entry shouldn't you check for == 2 here?
> >
> > Yes, that looks more correct (it's not clear what 1 means exactly).

1 is what e.g. x86 uses with -mbmi, and I think it is what most targets
actually define if they have any defined value there, exception is
aarch64/arm, powerpc, gcn, nvptx and mips.
Given:
int foo (int x) { return __builtin_ctz (x); }
int bar (int x) { return x ? __builtin_ctz (x) : 32; }
on x86 -mbmi we with CTZ_DEFINED_VALUE_AT_ZERO being 1 and value type_size
emit the same code, we'd need to find out if that is something done for all
targets or just a few, and if for all that have CTZ_DEFINED_VALUE_AT_ZERO 1,
we could perhaps just emit branchy code and wait for RTL to fix that up.

That said, perhaps we should also check if the argument isn't known to be
non-zero from VRP, in that case we wouldn't have to bother with the zero
value stuff at all.

	Jakub
Segher Boessenkool Jan. 11, 2020, 7:20 p.m. UTC | #10
Hi!

On Fri, Jan 10, 2020 at 10:35:04PM +0100, Jakub Jelinek wrote:
> On Thu, Jan 09, 2020 at 02:26:10PM +0100, Richard Biener wrote:
> > > >> +      tree lhs = gimple_assign_lhs (stmt);
> > > >> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
> > > >
> > > > since we're using the optab entry shouldn't you check for == 2 here?
> > >
> > > Yes, that looks more correct (it's not clear what 1 means exactly).
> 
> 1 is what e.g. x86 uses with -mbmi, and I think it is what most targets
> actually define if they have any defined value there, exception is
> aarch64/arm, powerpc, gcn, nvptx and mips.

1 vs. 2 only makes a difference when expanding ffs, it seems.

> Given:
> int foo (int x) { return __builtin_ctz (x); }
> int bar (int x) { return x ? __builtin_ctz (x) : 32; }
> on x86 -mbmi we with CTZ_DEFINED_VALUE_AT_ZERO being 1 and value type_size
> emit the same code, we'd need to find out if that is something done for all
> targets or just a few,

It doesn't do that for rs6000 (not for clz either, which is easier for us,
the expansion of ctz depends on multiple factors).  (Note that "foo" has
UB whenever x == 0, btw. -- it would be nice if something like "bar" would
generate the optimal code as well.  Is there some other code that will just
work?)

> and if for all that have CTZ_DEFINED_VALUE_AT_ZERO 1,
> we could perhaps just emit branchy code and wait for RTL to fix that up.

Where would RTL fix that?  In what pass, I mean.


Segher
diff mbox series

Patch

diff --git a/gcc/generic-match-head.c b/gcc/generic-match-head.c
index fdc603977fc5b03a843944f75ce262f5d2256308..5a38bd233585225d60f0159c9042a16d9fdc9d80 100644
--- a/gcc/generic-match-head.c
+++ b/gcc/generic-match-head.c
@@ -88,3 +88,10 @@  optimize_successive_divisions_p (tree, tree)
 {
   return false;
 }
+
+static bool
+optimize_count_trailing_zeroes (tree type, tree array_ref, tree input,
+				tree mulc, tree shift, tree &zero_val)
+{
+  return false;
+}
diff --git a/gcc/gimple-match-head.c b/gcc/gimple-match-head.c
index 53278168a59f5ac10ce6760f04fd42589a0792e7..2d3b305f8ea54e4ca31c64994af30b34bb7eff09 100644
--- a/gcc/gimple-match-head.c
+++ b/gcc/gimple-match-head.c
@@ -909,6 +909,24 @@  gimple_simplify (gimple *stmt, gimple_match_op *res_op, gimple_seq *seq,
 		res_op->set_op (TREE_CODE (op0), type, valueized);
 		return true;
 	      }
+	    else if (code == ARRAY_REF)
+	      {
+		tree rhs1 = gimple_assign_rhs1 (stmt);
+		tree op1 = TREE_OPERAND (rhs1, 1);
+		tree op2 = TREE_OPERAND (rhs1, 2);
+		tree op3 = TREE_OPERAND (rhs1, 3);
+		tree op0 = TREE_OPERAND (rhs1, 0);
+		bool valueized = false;
+
+		op0 = do_valueize (op0, top_valueize, valueized);
+		op1 = do_valueize (op1, top_valueize, valueized);
+
+		if (op2 && op3)
+		  res_op->set_op (code, type, op0, op1, op2, op3);
+		else
+		  res_op->set_op (code, type, op0, op1);
+		return gimple_resimplify4 (seq, res_op, valueize) || valueized;
+	      }
 	    break;
 	  case GIMPLE_UNARY_RHS:
 	    {
@@ -1222,3 +1240,57 @@  optimize_successive_divisions_p (tree divisor, tree inner_div)
     }
   return true;
 }
+
+/* Recognize count trailing zeroes idiom.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  Array[0] is returned so the caller can
+   emit an appropriate sequence depending on whether ctz (0) is defined on
+   the target.  */
+static bool
+optimize_count_trailing_zeroes (tree type, tree array, tree x, tree mulc,
+				tree tshift, tree &zero_val)
+{
+  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
+  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
+
+  tree input_type = TREE_TYPE (x);
+
+  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, OPTIMIZE_FOR_BOTH))
+    return false;
+
+  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
+  unsigned shiftval = tree_to_uhwi (tshift);
+  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
+
+  /* Check the array is not wider than integer type and the input is a 32-bit
+     or 64-bit type.  The shift should extract the top 5..7 bits.  */
+  if (TYPE_PRECISION (type) > 32)
+    return false;
+  if (input_bits != 32 && input_bits != 64)
+    return false;
+  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
+    return false;
+
+  tree t = build4 (ARRAY_REF, type, array, size_int (0), NULL_TREE, NULL_TREE);
+  t = fold_const_aggregate_ref (t);
+  if (t == NULL)
+    return false;
+
+  zero_val = build_int_cst (integer_type_node, tree_to_shwi (t));
+
+  for (unsigned i = 0; i < input_bits; i++, val <<= 1)
+    {
+      if (input_bits == 32)
+	val &= 0xffffffff;
+      t = build4 (ARRAY_REF, type, array, size_int ((int)(val >> shiftval)),
+		  NULL_TREE, NULL_TREE);
+      t = fold_const_aggregate_ref (t);
+      if (t == NULL || tree_to_shwi (t) != i)
+	return false;
+    }
+
+  return true;
+}
+
diff --git a/gcc/match.pd b/gcc/match.pd
index 6edf54b80012d87dbe7330f5ee638cdba2f9c099..bbe935e1e2af35e8e953a776eb3ecfb83414b047 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -6060,3 +6060,33 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (simplify
  (vec_perm vec_same_elem_p@0 @0 @1)
  @0)
+
+/* Recognize count trailing zeroes idiom.
+   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
+   constant which when multiplied by a power of 2 contains a unique value
+   in the top 5 or 6 bits.  This is then indexed into a table which maps it
+   to the number of trailing zeroes.  If valid, emit an optimal sequence
+   depending on the result for zero.
+*/
+(simplify
+ (ARRAY_REF @0 (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2)
+			INTEGER_CST@3) @4 @5)
+ (with
+  { tree zero_val;
+    HOST_WIDE_INT val;
+    HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (TREE_TYPE (@1)));
+    bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (TREE_TYPE (@1)), val);
+  }
+  (if (optimize_count_trailing_zeroes (type, @0, @1, @2, @3, zero_val))
+   (switch
+    (if (zero_ok && tree_to_shwi (zero_val) == val)
+     (convert (BUILT_IN_CTZ:integer_type_node @1)))
+
+    /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
+    (if (zero_ok && val == type_size && integer_zerop (zero_val))
+     (convert (bit_and (BUILT_IN_CTZ:integer_type_node @1)
+	{ build_int_cst (integer_type_node, type_size - 1); })))
+
+    /* Emit (x ? ctz (x) : zero_val).  */
+    (if (true)
+     (convert (cond @1 (BUILT_IN_CTZ:integer_type_node @1) { zero_val; } )))))))
diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c b/gcc/testsuite/gcc.target/aarch64/pr90838.c
new file mode 100644
index 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
@@ -0,0 +1,64 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int ctz1 (unsigned x)
+{
+  static const char table[32] =
+    {
+      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
+      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
+    };
+
+  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
+}
+
+int ctz2 (unsigned x)
+{
+  const int u = 0;
+  static short table[64] =
+    {
+      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
+      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
+      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
+      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
+    };
+
+  x = (x & -x) * 0x0450FBAF;
+  return table[x >> 26];
+}
+
+int ctz3 (unsigned x)
+{
+  static int table[32] =
+    {
+      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
+      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
+    };
+
+  if (x == 0) return 32;
+  x = (x & -x) * 0x04D7651F;
+  return table[x >> 27];
+}
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+     0,  1, 12,  2, 13, 22, 17,  3,
+    14, 33, 23, 36, 18, 58, 28,  4,
+    62, 15, 34, 26, 24, 48, 50, 37,
+    19, 55, 59, 52, 29, 44, 39,  5,
+    63, 11, 21, 16, 32, 35, 57, 27,
+    61, 25, 47, 49, 54, 51, 43, 38,
+    10, 20, 31, 56, 60, 46, 53, 42,
+     9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+int ctz4 (unsigned long x)
+{
+  unsigned long lsb = x & -x;
+  return table[(lsb * magic) >> 58];
+}
+
+/* { dg-final { scan-assembler-times "clz\t" 4 } } */
+/* { dg-final { scan-assembler-times "and\t" 2 } } */
+/* { dg-final { scan-assembler-not "cmp\t.*0" } } */