diff mbox series

[x86_64] : Improvement to signed division of integer constant.

Message ID 000a01d773d2$cbf50d80$63df2880$@nextmovesoftware.com
State New
Headers show
Series [x86_64] : Improvement to signed division of integer constant. | expand

Commit Message

Roger Sayle July 8, 2021, 8:25 a.m. UTC
This patch tweaks the way GCC handles 32-bit integer division on
x86_64, when the numerator is constant.  Currently the function

int foo (int x) {
  return 100/x;
}

generates the code:
foo:	movl    $100, %eax
	cltd
	idivl   %edi
	ret

where the sign-extension instruction "cltd" creates a long
dependency chain, as it depends on the "mov" before it, and
is depended upon by "idivl" after it.

With this patch, GCC now matches both icc and LLVM and
uses an xor instead, generating:
foo:	xorl    %edx, %edx
	movl    $100, %eax
	idivl   %edi
	ret

Microbenchmarking confirms that this is faster on Intel
processors (Kaby lake), and no worse on AMD processors (Zen2),
which agrees with intuition, but oddly disagrees with the
llvm-mca cycle count prediction on godbolt.org.

The tricky bit is that this sign-extension instruction is only
produced by late (postreload) splitting, and unfortunately none
of the subsequent passes (e.g. cprop_hardreg) is able to
propagate and simplify its constant argument.  The solution
here is to introduce a define_insn_and_split that allows the
constant numerator operand to be captured (by combine) and
then split into an optimal form after reload.

The above microbenchmarking also shows that eliminating the
sign extension of negative values (using movl $-1,%edx) is also
a performance improvement, as performed by icc but not by LLVM.
Both the xor and movl sign-extensions are larger than cltd,
so this transformation is prevented for -Os.


This patch has been tested on x86_64-pc-linux-gnu with a "make
bootstrap" and "make -k check" with no new failures.

Ok for mainline?


2021-07-08  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/i386/i386.md (*divmodsi4_const): Optimize SImode
	divmod of a constant numerator with new define_insn_and_split.

gcc/testsuite/ChangeLog
	* gcc.target/i386/divmod-9.c: New test case.


Roger
--
Roger Sayle
NextMove Software
Cambridge, UK

/* { dg-do compile } */
/* { dg-options "-O2" } */

int foo (int x)
{
  return 100/x;
}

int bar(int x)
{
  return -100/x;
}
/* { dg-final { scan-assembler-not "(cltd|cdq)" } } */

Comments

Richard Biener July 8, 2021, 10:07 a.m. UTC | #1
On Thu, Jul 8, 2021 at 10:25 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way GCC handles 32-bit integer division on
> x86_64, when the numerator is constant.  Currently the function
>
> int foo (int x) {
>   return 100/x;
> }
>
> generates the code:
> foo:    movl    $100, %eax
>         cltd
>         idivl   %edi
>         ret
>
> where the sign-extension instruction "cltd" creates a long
> dependency chain, as it depends on the "mov" before it, and
> is depended upon by "idivl" after it.
>
> With this patch, GCC now matches both icc and LLVM and
> uses an xor instead, generating:
> foo:    xorl    %edx, %edx
>         movl    $100, %eax
>         idivl   %edi
>         ret

You made me lookup idiv and I figured we're not optimally
handling

int foo (long x, int y)
{
  return x / y;
}

by using a 32:32 / 32 bit divide.  combine manages to
see enough to eventually do this though.

> Microbenchmarking confirms that this is faster on Intel
> processors (Kaby lake), and no worse on AMD processors (Zen2),
> which agrees with intuition, but oddly disagrees with the
> llvm-mca cycle count prediction on godbolt.org.
>
> The tricky bit is that this sign-extension instruction is only
> produced by late (postreload) splitting, and unfortunately none
> of the subsequent passes (e.g. cprop_hardreg) is able to
> propagate and simplify its constant argument.  The solution
> here is to introduce a define_insn_and_split that allows the
> constant numerator operand to be captured (by combine) and
> then split into an optimal form after reload.
>
> The above microbenchmarking also shows that eliminating the
> sign extension of negative values (using movl $-1,%edx) is also
> a performance improvement, as performed by icc but not by LLVM.
> Both the xor and movl sign-extensions are larger than cltd,
> so this transformation is prevented for -Os.
>
>
> This patch has been tested on x86_64-pc-linux-gnu with a "make
> bootstrap" and "make -k check" with no new failures.
>
> Ok for mainline?
>
>
> 2021-07-08  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.md (*divmodsi4_const): Optimize SImode
>         divmod of a constant numerator with new define_insn_and_split.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/divmod-9.c: New test case.
>
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
Alexander Monakov July 8, 2021, 11:02 a.m. UTC | #2
On Thu, 8 Jul 2021, Richard Biener via Gcc-patches wrote:

> You made me lookup idiv and I figured we're not optimally
> handling
> 
> int foo (long x, int y)
> {
>   return x / y;
> }
> 
> by using a 32:32 / 32 bit divide.  combine manages to
> see enough to eventually do this though.

We cannot do that in general because idiv will cause an exception
if the signed result is not representable in 32 bits, but GCC
defines signed conversions to truncate without trapping.

Alexander
Uros Bizjak July 8, 2021, 12:57 p.m. UTC | #3
On Thu, Jul 8, 2021 at 10:25 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way GCC handles 32-bit integer division on
> x86_64, when the numerator is constant.  Currently the function
>
> int foo (int x) {
>   return 100/x;
> }
>
> generates the code:
> foo:    movl    $100, %eax
>         cltd
>         idivl   %edi
>         ret
>
> where the sign-extension instruction "cltd" creates a long
> dependency chain, as it depends on the "mov" before it, and
> is depended upon by "idivl" after it.
>
> With this patch, GCC now matches both icc and LLVM and
> uses an xor instead, generating:
> foo:    xorl    %edx, %edx
>         movl    $100, %eax
>         idivl   %edi
>         ret
>
> Microbenchmarking confirms that this is faster on Intel
> processors (Kaby lake), and no worse on AMD processors (Zen2),
> which agrees with intuition, but oddly disagrees with the
> llvm-mca cycle count prediction on godbolt.org.
>
> The tricky bit is that this sign-extension instruction is only
> produced by late (postreload) splitting, and unfortunately none
> of the subsequent passes (e.g. cprop_hardreg) is able to
> propagate and simplify its constant argument.  The solution
> here is to introduce a define_insn_and_split that allows the
> constant numerator operand to be captured (by combine) and
> then split into an optimal form after reload.
>
> The above microbenchmarking also shows that eliminating the
> sign extension of negative values (using movl $-1,%edx) is also
> a performance improvement, as performed by icc but not by LLVM.
> Both the xor and movl sign-extensions are larger than cltd,
> so this transformation is prevented for -Os.
>
>
> This patch has been tested on x86_64-pc-linux-gnu with a "make
> bootstrap" and "make -k check" with no new failures.
>
> Ok for mainline?
>
>
> 2021-07-08  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386.md (*divmodsi4_const): Optimize SImode
>         divmod of a constant numerator with new define_insn_and_split.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/divmod-9.c: New test case.

+  if (INTVAL (operands[2]) < 0)
+    emit_move_insn (operands[1], constm1_rtx);
+  else
+    ix86_expand_clear (operands[1]);

No need to call ix86_expand_clear,

    emit_move_insn (operands[1], const0_rtx);

will result in xor, too.

OK with the above change.

Thanks,
Uros.

>
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 700c158..908ae33 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -8657,6 +8657,33 @@ 
   [(set_attr "type" "idiv")
    (set_attr "mode" "SI")])
 
+;; Avoid sign-extension (using cdq) for constant numerators.
+(define_insn_and_split "*divmodsi4_const"
+  [(set (match_operand:SI 0 "register_operand" "=&a")
+        (div:SI (match_operand:SI 2 "const_int_operand" "n")
+		(match_operand:SI 3 "nonimmediate_operand" "rm")))
+   (set (match_operand:SI 1 "register_operand" "=&d")
+        (mod:SI (match_dup 2) (match_dup 3)))
+   (clobber (reg:CC FLAGS_REG))]
+  "!optimize_function_for_size_p (cfun)"
+  "#"
+  "reload_completed"
+  [(parallel [(set (match_dup 0)
+                   (div:SI (match_dup 0) (match_dup 3)))
+              (set (match_dup 1)
+                   (mod:SI (match_dup 0) (match_dup 3)))
+              (use (match_dup 1))
+              (clobber (reg:CC FLAGS_REG))])]
+{
+  emit_move_insn (operands[0], operands[2]);
+  if (INTVAL (operands[2]) < 0)
+    emit_move_insn (operands[1], constm1_rtx);
+  else
+    ix86_expand_clear (operands[1]);
+}
+  [(set_attr "type" "multi")
+   (set_attr "mode" "SI")])
+
 (define_expand "divmodqi4"
   [(parallel [(set (match_operand:QI 0 "register_operand")
 		   (div:QI