[AArch64] Improve clz patterns

Message ID	HE1PR0801MB204484FA92B21DFD4AB7FF4F83030@HE1PR0801MB2044.eurprd08.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-518853-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:content-type :content-transfer-encoding:mime-version; q=dns; s=default; b=QO/ j2QzHDTFBTF7QTdsN1Rkbpb/uRpLSuDcbGyrgNGEuOZZ5RF571HDoR2nOR6FFyPB 4Qch28yzMLdecFFth7/EEWUbigtGZzb2L0+5ukI9kF5gExG8h9sqn69r9Ej8sBWH BtzMGiKplJkUWm+L4XrLosNDHKw1gNaMYFxQ+E2s= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> To: GCC Patches <gcc-patches@gcc.gnu.org> CC: Richard Earnshaw <Richard.Earnshaw@arm.com>, Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>, Richard Sandiford <Richard.Sandiford@arm.com> Subject: [PATCH][AArch64] Improve clz patterns Date: Tue, 4 Feb 2020 10:24:38 +0000 Message-ID: <HE1PR0801MB204484FA92B21DFD4AB7FF4F83030@HE1PR0801MB2044.eurprd08.prod.outlook.com> Authentication-Results-Original: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT061.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 6895ca50-f514-4965-aff5-08d7a95c708e
Series	[AArch64] Improve clz patterns \| expand [AArch64] Improve clz patterns

Wilco Dijkstra Feb. 4, 2020, 10:24 a.m. UTC

Although GCC should understand the limited range of clz/ctz/cls results,
Combine sometimes behaves oddly and duplicates ctz to remove a
sign extension.  Avoid this by adding an explicit AND with 127 in the
patterns. Deepsjeng performance improves by ~0.6%.

Bootstrap OK.

ChangeLog:
2020-02-03  Wilco Dijkstra  <wdijkstr@arm.com>

	* config/aarch64/aarch64.md (clz<mode>2): Mask the clz result.
	(clrsb<mode>2): Likewise.
	(ctz<mode>2): Likewise.
--

Richard Sandiford Feb. 4, 2020, 11:42 a.m. UTC | #1

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Although GCC should understand the limited range of clz/ctz/cls results,
> Combine sometimes behaves oddly and duplicates ctz to remove a
> sign extension.  Avoid this by adding an explicit AND with 127 in the
> patterns. Deepsjeng performance improves by ~0.6%.

Could you go into more detail about what the before and after code
looks like, and what combine is doing?  Like you say, this sounds
like a target-independent thing on face value.

Either way, something like this needs a testcase.

Thanks,
Richard

>
> Bootstrap OK.
>
> ChangeLog:
> 2020-02-03  Wilco Dijkstra  <wdijkstr@arm.com>
>
> * config/aarch64/aarch64.md (clz<mode>2): Mask the clz result.
> (clrsb<mode>2): Likewise.
> (ctz<mode>2): Likewise.
> --
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 5edc76ee14b55b2b4323530e10bd22b3ffca483e..7ff0536aac42957dbb7a15be766d35cc6725ac40 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -4794,7 +4794,8 @@ (define_insn "*and_one_cmpl_<SHIFT:optab><mode>3_compare0_no_reuse"
>
>  (define_insn "clz<mode>2"
>    [(set (match_operand:GPI 0 "register_operand" "=r")
> -(clz:GPI (match_operand:GPI 1 "register_operand" "r")))]
> +(and:GPI (clz:GPI (match_operand:GPI 1 "register_operand" "r"))
> + (const_int 127)))]
>    ""
>    "clz\\t%<w>0, %<w>1"
>    [(set_attr "type" "clz")]
> @@ -4848,7 +4849,8 @@ (define_expand "popcount<mode>2"
>
>  (define_insn "clrsb<mode>2"
>    [(set (match_operand:GPI 0 "register_operand" "=r")
> -        (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))]
> +(and:GPI (clrsb:GPI (match_operand:GPI 1 "register_operand" "r"))
> + (const_int 127)))]
>    ""
>    "cls\\t%<w>0, %<w>1"
>    [(set_attr "type" "clz")]
> @@ -4869,7 +4871,8 @@ (define_insn "rbit<mode>2"
>
>  (define_insn_and_split "ctz<mode>2"
>   [(set (match_operand:GPI           0 "register_operand" "=r")
> -       (ctz:GPI (match_operand:GPI  1 "register_operand" "r")))]
> +       (and:GPI (ctz:GPI (match_operand:GPI  1 "register_operand" "r"))
> +(const_int 127)))]
>    ""
>    "#"
>    "reload_completed"

Wilco Dijkstra Feb. 4, 2020, 2:28 p.m. UTC | #2

Hi Richard,

> Could you go into more detail about what the before and after code
> looks like, and what combine is doing?  Like you say, this sounds
> like a target-independent thing on face value.

It is indeed, but it seems specific to instructions where we have range
information which allows it to remove a redundant sign-extend.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 for the full details.

> Either way, something like this needs a testcase.

Sure I've added the testcase from pr93565, see below.

Cheers,
Wilco


Although GCC should understand the limited range of clz/ctz/cls results,
Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
sign extension.  Avoid this by adding an explicit AND with 127 in the
patterns. Deepsjeng performance improves by ~0.6%.

Bootstrap OK.

ChangeLog:
2020-02-04  Wilco Dijkstra  <wdijkstr@arm.com>

	PR rtl-optimization/93565
	* config/aarch64/aarch64.md (clz<mode>2): Mask the clz result.
	(clrsb<mode>2): Likewise.
	(ctz<mode>2): Likewise.

	* gcc.target/aarch64/pr93565.c: New test.
--
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 5edc76ee14b55b2b4323530e10bd22b3ffca483e..7ff0536aac42957dbb7a15be766d35cc6725ac40 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4794,7 +4794,8 @@ (define_insn "*and_one_cmpl_<SHIFT:optab><mode>3_compare0_no_reuse"
 
 (define_insn "clz<mode>2"
   [(set (match_operand:GPI 0 "register_operand" "=r")
-	(clz:GPI (match_operand:GPI 1 "register_operand" "r")))]
+	(and:GPI (clz:GPI (match_operand:GPI 1 "register_operand" "r"))
+		 (const_int 127)))]
   ""
   "clz\\t%<w>0, %<w>1"
   [(set_attr "type" "clz")]
@@ -4848,7 +4849,8 @@ (define_expand "popcount<mode>2"
 
 (define_insn "clrsb<mode>2"
   [(set (match_operand:GPI 0 "register_operand" "=r")
-        (clrsb:GPI (match_operand:GPI 1 "register_operand" "r")))]
+	(and:GPI (clrsb:GPI (match_operand:GPI 1 "register_operand" "r"))
+		 (const_int 127)))]
   ""
   "cls\\t%<w>0, %<w>1"
   [(set_attr "type" "clz")]
@@ -4869,7 +4871,8 @@ (define_insn "rbit<mode>2"
 
 (define_insn_and_split "ctz<mode>2"
  [(set (match_operand:GPI           0 "register_operand" "=r")
-       (ctz:GPI (match_operand:GPI  1 "register_operand" "r")))]
+       (and:GPI (ctz:GPI (match_operand:GPI  1 "register_operand" "r"))
+		(const_int 127)))]
   ""
   "#"
   "reload_completed"
diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c
new file mode 100644
index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+     0,  1, 12,  2, 13, 22, 17,  3,
+    14, 33, 23, 36, 18, 58, 28,  4,
+    62, 15, 34, 26, 24, 48, 50, 37,
+    19, 55, 59, 52, 29, 44, 39,  5,
+    63, 11, 21, 16, 32, 35, 57, 27,
+    61, 25, 47, 49, 54, 51, 43, 38,
+    10, 20, 31, 56, 60, 46, 53, 42,
+     9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+static inline int ctz1 (unsigned long  b)
+{
+  unsigned long lsb = b & -b;
+  return table[(lsb * magic) >> 58];
+}
+
+void f (unsigned long x, int *p)
+{
+  if (x != 0)
+    {
+      int a = ctz1 (x);
+      *p = a | p[a];
+    }
+}
+
+/* { dg-final { scan-assembler-times "rbit\t" 1 } } */
+/* { dg-final { scan-assembler-times "clz\t" 1 } } */
+

Richard Sandiford Feb. 7, 2020, 6:01 p.m. UTC | #3

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
>> Could you go into more detail about what the before and after code
>> looks like, and what combine is doing?  Like you say, this sounds
>> like a target-independent thing on face value.
>
> It is indeed, but it seems specific to instructions where we have range
> information which allows it to remove a redundant sign-extend.
>
> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 for the full details.
>
>> Either way, something like this needs a testcase.
>
> Sure I've added the testcase from pr93565, see below.
>
> Cheers,
> Wilco
>
>
> Although GCC should understand the limited range of clz/ctz/cls results,
> Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
> sign extension.  Avoid this by adding an explicit AND with 127 in the
> patterns. Deepsjeng performance improves by ~0.6%.
>
> Bootstrap OK.
>
> ChangeLog:
> 2020-02-04  Wilco Dijkstra  <wdijkstr@arm.com>
>
> PR rtl-optimization/93565
> * config/aarch64/aarch64.md (clz<mode>2): Mask the clz result.
> (clrsb<mode>2): Likewise.
> (ctz<mode>2): Likewise.
>
> * gcc.target/aarch64/pr93565.c: New test.

I think this only works because the (and ...) wrapper stops combine
from matching a plain ctz:

Failed to match this instruction:
(set (reg:DI 100 [ _9+-4 ])
    (ctz:DI (reg/v:DI 97 [ x ])))

I.e. the patch doesn't work by making the range explicit.  It just
works by wrapping the pattern in something that combine will never
generate naturally.  Something like (ashift:DI ... (const_int 0))
would give the same result.

If I add a second pattern that is never generated directly but can be
matched by combine:

(define_insn_and_split "*ctz<mode>2"
 [(set (match_operand:GPI           0 "register_operand" "=r")
       (ctz:GPI (match_operand:GPI  1 "register_operand" "r")))]
  ""
  "#"
  "reload_completed"
  [(const_int 0)]
  "
  emit_insn (gen_rbit<mode>2 (operands[0], operands[1]));
  emit_insn (gen_clz<mode>2 (operands[0], operands[0]));
  DONE;
")

then we get the double ctz again.

So I don't think this is OK even as a workaround.  We should fix it in
target-independent code instead.

Thanks,
Richard

Segher Boessenkool Feb. 8, 2020, 2:30 a.m. UTC | #4

On Fri, Feb 07, 2020 at 06:01:44PM +0000, Richard Sandiford wrote:
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> > Although GCC should understand the limited range of clz/ctz/cls results,
> > Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
> > sign extension.  Avoid this by adding an explicit AND with 127 in the
> > patterns. Deepsjeng performance improves by ~0.6%.

> > PR rtl-optimization/93565

> So I don't think this is OK even as a workaround.  We should fix it in
> target-independent code instead.

Yes, but combine shouldn't do CSE; combine does the *opposite* of CSE,
in some regards.  You shouldn't do CSE without a cost model for it, in
any case.


Segher

Wilco Dijkstra Feb. 12, 2020, 2:19 p.m. UTC | #5

Hi Richard,

Right, so this is an alternative approach using costs - Combine won't try to
duplicate instructions if it increases costs, so increasing the ctz cost to 2
instructions (which is the correct cost for ctz anyway) ensures we still get
efficient code for this example:

[AArch64] Set ctz rtx_cost (PR93565)

Although GCC should understand the limited range of clz/ctz/cls results,
Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
sign extension.  Avoid this by setting the cost for ctz to be higher than
that of a simple ALU instruction.  Deepsjeng performance improves by ~0.6%.

Bootstrap OK.

ChangeLog:
2020-02-12  Wilco Dijkstra  <wdijkstr@arm.com>

	PR rtl-optimization/93565
	* config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs.

	* gcc.target/aarch64/pr93565.c: New test.

--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e40750380cce202473da3cf572ebdbc28a4ecc06..7426629d6c973c06640f75d3de53a2815ff40f1b 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11459,6 +11459,13 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer ATTRIBUTE_UNUSED,
 
       return false;
 
+    case CTZ:
+      *cost = COSTS_N_INSNS (2);
+
+      if (speed)
+	*cost += extra_cost->alu.clz + extra_cost->alu.rev;
+      return false;
+
     case COMPARE:
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c
new file mode 100644
index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+static const unsigned long long magic = 0x03f08c5392f756cdULL;
+
+static const char table[64] = {
+     0,  1, 12,  2, 13, 22, 17,  3,
+    14, 33, 23, 36, 18, 58, 28,  4,
+    62, 15, 34, 26, 24, 48, 50, 37,
+    19, 55, 59, 52, 29, 44, 39,  5,
+    63, 11, 21, 16, 32, 35, 57, 27,
+    61, 25, 47, 49, 54, 51, 43, 38,
+    10, 20, 31, 56, 60, 46, 53, 42,
+     9, 30, 45, 41,  8, 40,  7,  6,
+};
+
+static inline int ctz1 (unsigned long  b)
+{
+  unsigned long lsb = b & -b;
+  return table[(lsb * magic) >> 58];
+}
+
+void f (unsigned long x, int *p)
+{
+  if (x != 0)
+    {
+      int a = ctz1 (x);
+      *p = a | p[a];
+    }
+}
+
+/* { dg-final { scan-assembler-times "rbit\t" 1 } } */
+/* { dg-final { scan-assembler-times "clz\t" 1 } } */
+

Richard Sandiford Feb. 12, 2020, 5:56 p.m. UTC | #6

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi Richard,
>
> Right, so this is an alternative approach using costs - Combine won't try to
> duplicate instructions if it increases costs, so increasing the ctz cost to 2
> instructions (which is the correct cost for ctz anyway)

...agreed...

> ensures we still get efficient code for this example:
>
> [AArch64] Set ctz rtx_cost (PR93565)
>
> Although GCC should understand the limited range of clz/ctz/cls results,
> Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
> sign extension.  Avoid this by setting the cost for ctz to be higher than
> that of a simple ALU instruction.  Deepsjeng performance improves by ~0.6%.
>
> Bootstrap OK.
>
> ChangeLog:
> 2020-02-12  Wilco Dijkstra  <wdijkstr@arm.com>
>
> PR rtl-optimization/93565
> * config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs.
>
> * gcc.target/aarch64/pr93565.c: New test.

OK, thanks.  Could you remove the bit about combine behaving oddly when
you commit though?  I think this was simply a target bug and combine
was being given duff information.

Richard

>
> --
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index e40750380cce202473da3cf572ebdbc28a4ecc06..7426629d6c973c06640f75d3de53a2815ff40f1b 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11459,6 +11459,13 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer ATTRIBUTE_UNUSED,
>
>        return false;
>
> +    case CTZ:
> +      *cost = COSTS_N_INSNS (2);
> +
> +      if (speed)
> +*cost += extra_cost->alu.clz + extra_cost->alu.rev;
> +      return false;
> +
>      case COMPARE:
>        op0 = XEXP (x, 0);
>        op1 = XEXP (x, 1);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +static const unsigned long long magic = 0x03f08c5392f756cdULL;
> +
> +static const char table[64] = {
> +     0,  1, 12,  2, 13, 22, 17,  3,
> +    14, 33, 23, 36, 18, 58, 28,  4,
> +    62, 15, 34, 26, 24, 48, 50, 37,
> +    19, 55, 59, 52, 29, 44, 39,  5,
> +    63, 11, 21, 16, 32, 35, 57, 27,
> +    61, 25, 47, 49, 54, 51, 43, 38,
> +    10, 20, 31, 56, 60, 46, 53, 42,
> +     9, 30, 45, 41,  8, 40,  7,  6,
> +};
> +
> +static inline int ctz1 (unsigned long  b)
> +{
> +  unsigned long lsb = b & -b;
> +  return table[(lsb * magic) >> 58];
> +}
> +
> +void f (unsigned long x, int *p)
> +{
> +  if (x != 0)
> +    {
> +      int a = ctz1 (x);
> +      *p = a | p[a];
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-times "rbit\t" 1 } } */
> +/* { dg-final { scan-assembler-times "clz\t" 1 } } */
> +

Andrew Pinski Feb. 12, 2020, 6:56 p.m. UTC | #7

On Wed, Feb 12, 2020 at 9:56 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> > Hi Richard,
> >
> > Right, so this is an alternative approach using costs - Combine won't try to
> > duplicate instructions if it increases costs, so increasing the ctz cost to 2
> > instructions (which is the correct cost for ctz anyway)
>
> ...agreed...

Yes I agree a better cost model for CTZ/CLZ is the right solution but
I disagree with 2 ALU instruction as the cost.  It should either be
the same cost as a multiply or have its own cost entry.
For an example on OcteonTX (and ThunderX1), the cost of CLS/CLZ is 4
cycles, the same as the cost as a multiple; on OcteonTX2 it is 5
cycles (again the same cost as a multiple).

Thanks,
Andrew Pinski

>
> > ensures we still get efficient code for this example:
> >
> > [AArch64] Set ctz rtx_cost (PR93565)
> >
> > Although GCC should understand the limited range of clz/ctz/cls results,
> > Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
> > sign extension.  Avoid this by setting the cost for ctz to be higher than
> > that of a simple ALU instruction.  Deepsjeng performance improves by ~0.6%.
> >
> > Bootstrap OK.
> >
> > ChangeLog:
> > 2020-02-12  Wilco Dijkstra  <wdijkstr@arm.com>
> >
> > PR rtl-optimization/93565
> > * config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs.
> >
> > * gcc.target/aarch64/pr93565.c: New test.
>
> OK, thanks.  Could you remove the bit about combine behaving oddly when
> you commit though?  I think this was simply a target bug and combine
> was being given duff information.
>
> Richard
>
> >
> > --
> >
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index e40750380cce202473da3cf572ebdbc28a4ecc06..7426629d6c973c06640f75d3de53a2815ff40f1b 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -11459,6 +11459,13 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer ATTRIBUTE_UNUSED,
> >
> >        return false;
> >
> > +    case CTZ:
> > +      *cost = COSTS_N_INSNS (2);
> > +
> > +      if (speed)
> > +*cost += extra_cost->alu.clz + extra_cost->alu.rev;
> > +      return false;
> > +
> >      case COMPARE:
> >        op0 = XEXP (x, 0);
> >        op1 = XEXP (x, 1);
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr93565.c b/gcc/testsuite/gcc.target/aarch64/pr93565.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..7200f80d1bb161f6a058cc6591f61b6b75cf1749
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr93565.c
> > @@ -0,0 +1,34 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2" } */
> > +
> > +static const unsigned long long magic = 0x03f08c5392f756cdULL;
> > +
> > +static const char table[64] = {
> > +     0,  1, 12,  2, 13, 22, 17,  3,
> > +    14, 33, 23, 36, 18, 58, 28,  4,
> > +    62, 15, 34, 26, 24, 48, 50, 37,
> > +    19, 55, 59, 52, 29, 44, 39,  5,
> > +    63, 11, 21, 16, 32, 35, 57, 27,
> > +    61, 25, 47, 49, 54, 51, 43, 38,
> > +    10, 20, 31, 56, 60, 46, 53, 42,
> > +     9, 30, 45, 41,  8, 40,  7,  6,
> > +};
> > +
> > +static inline int ctz1 (unsigned long  b)
> > +{
> > +  unsigned long lsb = b & -b;
> > +  return table[(lsb * magic) >> 58];
> > +}
> > +
> > +void f (unsigned long x, int *p)
> > +{
> > +  if (x != 0)
> > +    {
> > +      int a = ctz1 (x);
> > +      *p = a | p[a];
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-assembler-times "rbit\t" 1 } } */
> > +/* { dg-final { scan-assembler-times "clz\t" 1 } } */
> > +

Wilco Dijkstra Feb. 12, 2020, 7:08 p.m. UTC | #8

Hi Richard,

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565#c8 - the problem is
more generic like I suspected and it's easy to create similar examples. So while
this turned out to be an easy worksaround for ctz, there general case is harder
to avoid since you still want to allow beneficial multi-use cases (such as merging
a shift into 2 ALU instructions).

Cheers,
Wilco

Wilco Dijkstra Feb. 12, 2020, 7:13 p.m. UTC | #9

Hi Andrew,

> Yes I agree a better cost model for CTZ/CLZ is the right solution but
> I disagree with 2 ALU instruction as the cost.  It should either be
> the same cost as a multiply or have its own cost entry.
> For an example on OcteonTX (and ThunderX1), the cost of CLS/CLZ is 4
> cycles, the same as the cost as a multiple; on OcteonTX2 it is 5
> cycles (again the same cost as a multiple).

+      if (speed)
+          *cost += extra_cost->alu.clz + extra_cost->alu.rev;
+      return false;

So if the cost of clz and ctz are similar enough, this will use the defined
per-cpu costs.

Cheers,
Wilco

[AArch64] Improve clz patterns

Commit Message

Comments

Patch