diff mbox series

x86_64: Integer min/max improvements.

Message ID 021801d66663$e93294d0$bb97be70$@nextmovesoftware.com
State New
Headers show
Series x86_64: Integer min/max improvements. | expand

Commit Message

Roger Sayle July 30, 2020, 11:23 a.m. UTC
This patch tweaks the way that min and max are expanded, so that the
semantics of these operations are visible to the early RTL optimization
passes, until split into explicit comparison and conditional move
instructions. The good news is that i386.md already contains all of
the required logic (many thanks to Richard Biener and Uros Bizjak),
but this is currently only to enable scalar-to-vector (STV) synthesis
of min/max instructions.  This change enables this functionality for
all TARGET_CMOVE architectures for SImode, HImode and DImode.

My first attempt to support "min(reg,reg)" as an instruction revealed
why this functionality isn't already enabled: In PR rtl-optimization 91154
we end up generating a cmp instead of a test in gcc.target/i386/minmax-3.c
which has poor performance on AMD Opteron.  My solution to this is to
actually support "min(reg,general_operand)" allowing us to inspect
any immediate constant at the point this operation is split.  This
allows us to use "test" instructions for min/max against 0, 1 and -1.
As an added benefit it allows us to CSE large immediate constants,
reducing code size.

Previously, "int foo(int x) { return max(x,12345); }" would generate:

foo:    cmpl    $12345, %edi
        movl    $12345, %eax
        cmovge  %edi, %eax
        ret

with this patch we instead generate:

foo:    movl    $12345, %eax
        cmpl    %eax, %edi
        cmovge  %edi, %eax
        ret


I've also included a peephole2 to avoid the "movl $0,%eax" instructions
being produced by the register allocator.  Materializing constants as
late as possible reduces register pressure, but for const0_rtx on x86,
it's preferable to use "xor" by moving this set from between a flags
setting operation and its use.  This should also help instruction macro
fusion on some microarchitectures, where test/cmp and the following
instruction can sometimes be combined.

Previously, "int foo(int x) { return max(x,0); }" would generate:

foo:    testl   %edi, %edi
        movl    $0, %eax
        cmovns  %edi, %eax
        ret

with this patch we instead generate:
foo:    xorl    %eax, %eax
        testl   %edi, %edi
        cmovns  %edi, %eax
        ret

The benefits of having min/max explicit at the RTL level can be seen
from compiling the following example with "-O2 -fno-tree-reassoc".


#define max(a,b) (((a) > (b))? (a) : (b))

int foo(int x)
{
  int y = max(x,5);
  return max(y,10);
}

where GCC currently produces:

foo:    cmpl    $5, %edi
        movl    $5, %eax
        movl    $10, %edx
        cmovge  %edi, %eax
        cmpl    $10, %eax
        cmovl   %edx, %eax
        ret

and with this patch it instead now produces:

foo:    movl    $10, %eax
        cmpl    %eax, %edi
        cmovge  %edi, %eax
        ret


The original motivation was from looking at a performance critical
function in a quantum mechanics code, that performed MIN_EXPR and
MAX_EXPR of the same arguments (effectively a two-element sort),
where GCC was performing the comparison twice.  I'd hoped that it
might be possible to fuse these together, perhaps in combine, but
this patch is an intermediate step towards that goal.

This patch has been tested on x86_64-pc-linux-gnu with a make
bootstrap followed by make -k check with no new regressions.
Ok for mainline?


2020-07-30  Roger Sayle  <roger@nextmovesoftware.com>

	* config/i386/i386.md (MAXMIN_IMODE): No longer needed.
	(<maxmin><mode>3):  Support SWI248 and general_operand for
	second operand, when TARGET_CMOVE.
	(<maxmin><mode>3_1 splitter): Optimize comparisons against
	0, 1 and -1 to use "test" instead of "cmp".
	(*<maxmin>di3_doubleword): Likewise, allow general_operand
	and enable on TARGET_CMOVE.
	(peephole2): Convert clearing a register after a flag setting
	instruction into an xor followed by the original flag setter.


Many thanks in advance.  Almost all of the credit goes to Uros and 
Richard for already implementing this feature, I've just copied the
transformations from optab expansion, that allow it to be enabled
without penalty (this late in the compilation).

Roger
--
Roger Sayle
NextMove Software
Cambridge, UK

Comments

Richard Biener July 30, 2020, 12:20 p.m. UTC | #1
On Thu, Jul 30, 2020 at 1:24 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way that min and max are expanded, so that the
> semantics of these operations are visible to the early RTL optimization
> passes, until split into explicit comparison and conditional move
> instructions. The good news is that i386.md already contains all of
> the required logic (many thanks to Richard Biener and Uros Bizjak),
> but this is currently only to enable scalar-to-vector (STV) synthesis
> of min/max instructions.  This change enables this functionality for
> all TARGET_CMOVE architectures for SImode, HImode and DImode.
>
> My first attempt to support "min(reg,reg)" as an instruction revealed
> why this functionality isn't already enabled: In PR rtl-optimization 91154
> we end up generating a cmp instead of a test in gcc.target/i386/minmax-3.c
> which has poor performance on AMD Opteron.  My solution to this is to
> actually support "min(reg,general_operand)" allowing us to inspect
> any immediate constant at the point this operation is split.  This
> allows us to use "test" instructions for min/max against 0, 1 and -1.
> As an added benefit it allows us to CSE large immediate constants,
> reducing code size.
>
> Previously, "int foo(int x) { return max(x,12345); }" would generate:
>
> foo:    cmpl    $12345, %edi
>         movl    $12345, %eax
>         cmovge  %edi, %eax
>         ret
>
> with this patch we instead generate:
>
> foo:    movl    $12345, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> I've also included a peephole2 to avoid the "movl $0,%eax" instructions
> being produced by the register allocator.  Materializing constants as
> late as possible reduces register pressure, but for const0_rtx on x86,
> it's preferable to use "xor" by moving this set from between a flags
> setting operation and its use.  This should also help instruction macro
> fusion on some microarchitectures, where test/cmp and the following
> instruction can sometimes be combined.
>
> Previously, "int foo(int x) { return max(x,0); }" would generate:
>
> foo:    testl   %edi, %edi
>         movl    $0, %eax
>         cmovns  %edi, %eax
>         ret
>
> with this patch we instead generate:
> foo:    xorl    %eax, %eax
>         testl   %edi, %edi
>         cmovns  %edi, %eax
>         ret
>
> The benefits of having min/max explicit at the RTL level can be seen
> from compiling the following example with "-O2 -fno-tree-reassoc".
>
>
> #define max(a,b) (((a) > (b))? (a) : (b))
>
> int foo(int x)
> {
>   int y = max(x,5);
>   return max(y,10);
> }
>
> where GCC currently produces:
>
> foo:    cmpl    $5, %edi
>         movl    $5, %eax
>         movl    $10, %edx
>         cmovge  %edi, %eax
>         cmpl    $10, %eax
>         cmovl   %edx, %eax
>         ret
>
> and with this patch it instead now produces:
>
> foo:    movl    $10, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> The original motivation was from looking at a performance critical
> function in a quantum mechanics code, that performed MIN_EXPR and
> MAX_EXPR of the same arguments (effectively a two-element sort),
> where GCC was performing the comparison twice.  I'd hoped that it
> might be possible to fuse these together, perhaps in combine, but
> this patch is an intermediate step towards that goal.
>
> This patch has been tested on x86_64-pc-linux-gnu with a make
> bootstrap followed by make -k check with no new regressions.
> Ok for mainline?

May I ask for a testcase or two?  You have good examples above.
Btw, I'm sure some variants of those are in bugzilla as well.

Richard.

>
> 2020-07-30  Roger Sayle  <roger@nextmovesoftware.com>
>
>         * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
>         (<maxmin><mode>3):  Support SWI248 and general_operand for
>         second operand, when TARGET_CMOVE.
>         (<maxmin><mode>3_1 splitter): Optimize comparisons against
>         0, 1 and -1 to use "test" instead of "cmp".
>         (*<maxmin>di3_doubleword): Likewise, allow general_operand
>         and enable on TARGET_CMOVE.
>         (peephole2): Convert clearing a register after a flag setting
>         instruction into an xor followed by the original flag setter.
>
>
> Many thanks in advance.  Almost all of the credit goes to Uros and
> Richard for already implementing this feature, I've just copied the
> transformations from optab expansion, that allow it to be enabled
> without penalty (this late in the compilation).
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
Roger Sayle July 30, 2020, 12:55 p.m. UTC | #2
Hi Richard,

>> This patch tweaks the way that min and max are expanded, so that the 
>> semantics of these operations are visible to the early RTL 
>> optimization passes, until split into explicit comparison and 
>> conditional move instructions.
>
> Btw, I'm sure some variants of those are in bugzilla as well.

Now that you mention it, I'm not sure whether PR rtl-optimization 94543
is a bug at all, but with you and Richard Henderson weighing in, I suspect
that I must be missing something subtle.

The initial text of the bug report complains about an AND of 0xff not being
optimized away, as a zero extension is still present.  I believe that the AND
with 0xff has been completely optimized away, and the remaining movzwl
(not movzbl) is the zero extension of the short result to the integer ABI return
type.  If the result is written as a short to memory, there is no extension 
(or AND).

But perhaps the problem isn't with min/and at all, but about whether function
arguments and return values are guaranteed to be extended on input/return,
so perhaps this is related to SUBREG_PROMOTED_P and friends?

Thoughts?

Cheers,
Roger

> 2020-07-30  Roger Sayle  <roger@nextmovesoftware.com>
>
>         * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
>         (<maxmin><mode>3):  Support SWI248 and general_operand for
>         second operand, when TARGET_CMOVE.
>         (<maxmin><mode>3_1 splitter): Optimize comparisons against
>         0, 1 and -1 to use "test" instead of "cmp".
>         (*<maxmin>di3_doubleword): Likewise, allow general_operand
>         and enable on TARGET_CMOVE.
>         (peephole2): Convert clearing a register after a flag setting
>         instruction into an xor followed by the original flag setter.
>
Jakub Jelinek July 30, 2020, 1:23 p.m. UTC | #3
On Thu, Jul 30, 2020 at 01:55:03PM +0100, Roger Sayle wrote:
> Now that you mention it, I'm not sure whether PR rtl-optimization 94543
> is a bug at all, but with you and Richard Henderson weighing in, I suspect
> that I must be missing something subtle.
> 
> The initial text of the bug report complains about an AND of 0xff not being
> optimized away, as a zero extension is still present.  I believe that the AND
> with 0xff has been completely optimized away, and the remaining movzwl
> (not movzbl) is the zero extension of the short result to the integer ABI return
> type.  If the result is written as a short to memory, there is no extension 
> (or AND).
> 
> But perhaps the problem isn't with min/and at all, but about whether function
> arguments and return values are guaranteed to be extended on input/return,
> so perhaps this is related to SUBREG_PROMOTED_P and friends?
> 
> Thoughts?

Yeah, given that the RTL performs the computation in HImode rather than
SImode, an extension is needed at least from the RTL POV, because there is
no guarantee what behavior will be for the upper bits of the GPR.

(insn 10 6 9 (set (reg:HI 88)
        (const_int 255 [0xff])) "x.c":1:56 -1
     (nil))
...
(insn 11 9 12 (set (reg:HI 87)
        (if_then_else:HI (leu (reg:CC 17 flags)
                (const_int 0 [0]))
            (reg/v:HI 84 [ x ])
            (reg:HI 88))) "x.c":1:56 -1
     (nil))

If these two (or just one of them) insns were emitted as movw $255, %ax
or cmova %ax, %di rather than movl $255, %eax or cmova %eax, %edi, then
the upper 16 bits would contain the previous content rather than being
cleared.  So, if we want to get rid of the useless zero extension, we'd
need some target specific pass or hook or whatever that would query
instruction attributes/whatever to figure out what can or can't be trusted
to be zero.  E.g. the conditions on when movhi_internal uses mode attribute
of SI vs. HI is quite complicated, but I guess hopefully it is reliable.
I'm afraid for many insns it is not.
And then there is also the on the side knowledge that pretty much all insns
for TARGET_64BIT with GPRs zero extend if they are 32-bit insns, but the
question is if one can trust mode attribute for those.

Another possibility would be to move the zero extensions earlier in this
case, so use set (reg:SI 88) (const_int 255) and 32-bit conditional move
rather than 16-bit, and in those cases perhaps with help of range
information during expansion we could create SUBREGs and use
SUBREG_PROMOTED_* on those.

	Jakub
Roger Sayle Aug. 3, 2020, 7:47 a.m. UTC | #4
Hi Richard,

As requested here are some additional test cases for this patch, derived
from the examples in my post.  For minmax-8 and minmax-9, the tests are
run at -Os, to avoid limiting code generation options (for speed) on future
architectures, i.e. CSE may or may not always be a win for performance, but
it is always smaller.  Tested on x86_64-pc-linux-gnu (with the above patch).

2020-08-03  Roger Sayle  <roger@nextmovesoftware.com>

gcc/testsuite/ChangeLog
	* gcc.target/i386/minmax-8.c: New test.
	* gcc.target/i386/minmax-9.c: New test.
	* gcc.target/i386/minmax-10.c: New test.
	* gcc.target/i386/minmax-11.c: New test.


Cheers,
Roger
--

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: 30 July 2020 13:21
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Uros Bizjak <ubizjak@gmail.com>
Subject: Re: [PATCH] x86_64: Integer min/max improvements.

On Thu, Jul 30, 2020 at 1:24 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way that min and max are expanded, so that the 
> semantics of these operations are visible to the early RTL 
> optimization passes, until split into explicit comparison and 
> conditional move instructions. The good news is that i386.md already 
> contains all of the required logic (many thanks to Richard Biener and 
> Uros Bizjak), but this is currently only to enable scalar-to-vector 
> (STV) synthesis of min/max instructions.  This change enables this 
> functionality for all TARGET_CMOVE architectures for SImode, HImode and DImode.
>
> My first attempt to support "min(reg,reg)" as an instruction revealed 
> why this functionality isn't already enabled: In PR rtl-optimization 
> 91154 we end up generating a cmp instead of a test in 
> gcc.target/i386/minmax-3.c which has poor performance on AMD Opteron.  
> My solution to this is to actually support "min(reg,general_operand)" 
> allowing us to inspect any immediate constant at the point this 
> operation is split.  This allows us to use "test" instructions for min/max against 0, 1 and -1.
> As an added benefit it allows us to CSE large immediate constants, 
> reducing code size.
>
> Previously, "int foo(int x) { return max(x,12345); }" would generate:
>
> foo:    cmpl    $12345, %edi
>         movl    $12345, %eax
>         cmovge  %edi, %eax
>         ret
>
> with this patch we instead generate:
>
> foo:    movl    $12345, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> I've also included a peephole2 to avoid the "movl $0,%eax" 
> instructions being produced by the register allocator.  Materializing 
> constants as late as possible reduces register pressure, but for 
> const0_rtx on x86, it's preferable to use "xor" by moving this set 
> from between a flags setting operation and its use.  This should also 
> help instruction macro fusion on some microarchitectures, where 
> test/cmp and the following instruction can sometimes be combined.
>
> Previously, "int foo(int x) { return max(x,0); }" would generate:
>
> foo:    testl   %edi, %edi
>         movl    $0, %eax
>         cmovns  %edi, %eax
>         ret
>
> with this patch we instead generate:
> foo:    xorl    %eax, %eax
>         testl   %edi, %edi
>         cmovns  %edi, %eax
>         ret
>
> The benefits of having min/max explicit at the RTL level can be seen 
> from compiling the following example with "-O2 -fno-tree-reassoc".
>
>
> #define max(a,b) (((a) > (b))? (a) : (b))
>
> int foo(int x)
> {
>   int y = max(x,5);
>   return max(y,10);
> }
>
> where GCC currently produces:
>
> foo:    cmpl    $5, %edi
>         movl    $5, %eax
>         movl    $10, %edx
>         cmovge  %edi, %eax
>         cmpl    $10, %eax
>         cmovl   %edx, %eax
>         ret
>
> and with this patch it instead now produces:
>
> foo:    movl    $10, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> The original motivation was from looking at a performance critical 
> function in a quantum mechanics code, that performed MIN_EXPR and 
> MAX_EXPR of the same arguments (effectively a two-element sort), where 
> GCC was performing the comparison twice.  I'd hoped that it might be 
> possible to fuse these together, perhaps in combine, but this patch is 
> an intermediate step towards that goal.
>
> This patch has been tested on x86_64-pc-linux-gnu with a make 
> bootstrap followed by make -k check with no new regressions.
> Ok for mainline?

May I ask for a testcase or two?  You have good examples above.
Btw, I'm sure some variants of those are in bugzilla as well.

Richard.

>
> 2020-07-30  Roger Sayle  <roger@nextmovesoftware.com>
>
>         * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
>         (<maxmin><mode>3):  Support SWI248 and general_operand for
>         second operand, when TARGET_CMOVE.
>         (<maxmin><mode>3_1 splitter): Optimize comparisons against
>         0, 1 and -1 to use "test" instead of "cmp".
>         (*<maxmin>di3_doubleword): Likewise, allow general_operand
>         and enable on TARGET_CMOVE.
>         (peephole2): Convert clearing a register after a flag setting
>         instruction into an xor followed by the original flag setter.
>
>
> Many thanks in advance.  Almost all of the credit goes to Uros and 
> Richard for already implementing this feature, I've just copied the 
> transformations from optab expansion, that allow it to be enabled 
> without penalty (this late in the compilation).
>
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
diff --git a/gcc/testsuite/gcc.target/i386/minmax-8.c b/gcc/testsuite/gcc.target/i386/minmax-8.c
new file mode 100644
index 0000000..1f7e466
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-8.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int foo(int x)
+{
+  return max(x,12345);
+}
+
+int bar(int x)
+{
+  return min(x,87654);
+}
+
+/* { dg-final { scan-assembler-times "12345" 1 } } */
+/* { dg-final { scan-assembler-times "87654" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-9.c b/gcc/testsuite/gcc.target/i386/minmax-9.c
new file mode 100644
index 0000000..3b94023
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-9.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int foo(int x)
+{
+  return max(x,0);
+}
+
+int bar(int x)
+{
+  return min(x,0);
+}
+
+unsigned int baz(unsigned int x)
+{
+  return min(x,1);
+}
+
+/* { dg-final { scan-assembler-times "xor" 3 } } */
+/* { dg-final { scan-assembler-times "test" 3 } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-10.c b/gcc/testsuite/gcc.target/i386/minmax-10.c
new file mode 100644
index 0000000..b044462
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-10.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int smax1(int x)
+{
+  return max(x,1);
+}
+
+int smin1(int x)
+{
+  return min(x,1);
+}
+
+int smaxm1(int x)
+{
+  return max(x,-1);
+}
+
+int sminm1(int x)
+{
+  return min(x,-1);
+}
+
+unsigned int umax1(unsigned int x)
+{
+  return max(x,1);
+}
+
+unsigned int umin1(unsigned int x)
+{
+  return min(x,1);
+}
+
+/* { dg-final { scan-assembler-times "test" 6 } } */
+/* { dg-final { scan-assembler-not "cmp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-11.c b/gcc/testsuite/gcc.target/i386/minmax-11.c
new file mode 100644
index 0000000..a8c2df5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-11.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-reassoc" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+
+int foo(int x)
+{
+  int y = max(x,12345);
+  return max(y,87654);
+}
+
+/* { dg-final { scan-assembler-not "12345" } } */
Uros Bizjak Aug. 3, 2020, 10:29 a.m. UTC | #5
On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way that min and max are expanded, so that the
> semantics of these operations are visible to the early RTL optimization
> passes, until split into explicit comparison and conditional move
> instructions. The good news is that i386.md already contains all of
> the required logic (many thanks to Richard Biener and Uros Bizjak),
> but this is currently only to enable scalar-to-vector (STV) synthesis
> of min/max instructions.  This change enables this functionality for
> all TARGET_CMOVE architectures for SImode, HImode and DImode.
>
> My first attempt to support "min(reg,reg)" as an instruction revealed
> why this functionality isn't already enabled: In PR rtl-optimization 91154
> we end up generating a cmp instead of a test in gcc.target/i386/minmax-3.c
> which has poor performance on AMD Opteron.  My solution to this is to
> actually support "min(reg,general_operand)" allowing us to inspect
> any immediate constant at the point this operation is split.  This
> allows us to use "test" instructions for min/max against 0, 1 and -1.
> As an added benefit it allows us to CSE large immediate constants,
> reducing code size.
>
> Previously, "int foo(int x) { return max(x,12345); }" would generate:
>
> foo:    cmpl    $12345, %edi
>         movl    $12345, %eax
>         cmovge  %edi, %eax
>         ret
>
> with this patch we instead generate:
>
> foo:    movl    $12345, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> I've also included a peephole2 to avoid the "movl $0,%eax" instructions
> being produced by the register allocator.  Materializing constants as
> late as possible reduces register pressure, but for const0_rtx on x86,
> it's preferable to use "xor" by moving this set from between a flags
> setting operation and its use.  This should also help instruction macro
> fusion on some microarchitectures, where test/cmp and the following
> instruction can sometimes be combined.
>
> Previously, "int foo(int x) { return max(x,0); }" would generate:
>
> foo:    testl   %edi, %edi
>         movl    $0, %eax
>         cmovns  %edi, %eax
>         ret
>
> with this patch we instead generate:
> foo:    xorl    %eax, %eax
>         testl   %edi, %edi
>         cmovns  %edi, %eax
>         ret
>
> The benefits of having min/max explicit at the RTL level can be seen
> from compiling the following example with "-O2 -fno-tree-reassoc".
>
>
> #define max(a,b) (((a) > (b))? (a) : (b))
>
> int foo(int x)
> {
>   int y = max(x,5);
>   return max(y,10);
> }
>
> where GCC currently produces:
>
> foo:    cmpl    $5, %edi
>         movl    $5, %eax
>         movl    $10, %edx
>         cmovge  %edi, %eax
>         cmpl    $10, %eax
>         cmovl   %edx, %eax
>         ret
>
> and with this patch it instead now produces:
>
> foo:    movl    $10, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> The original motivation was from looking at a performance critical
> function in a quantum mechanics code, that performed MIN_EXPR and
> MAX_EXPR of the same arguments (effectively a two-element sort),
> where GCC was performing the comparison twice.  I'd hoped that it
> might be possible to fuse these together, perhaps in combine, but
> this patch is an intermediate step towards that goal.
>
> This patch has been tested on x86_64-pc-linux-gnu with a make
> bootstrap followed by make -k check with no new regressions.
> Ok for mainline?
>
>
> 2020-07-30  Roger Sayle  <roger@nextmovesoftware.com>
>
>         * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
>         (<maxmin><mode>3):  Support SWI248 and general_operand for
>         second operand, when TARGET_CMOVE.
>         (<maxmin><mode>3_1 splitter): Optimize comparisons against
>         0, 1 and -1 to use "test" instead of "cmp".
>         (*<maxmin>di3_doubleword): Likewise, allow general_operand
>         and enable on TARGET_CMOVE.
>         (peephole2): Convert clearing a register after a flag setting
>         instruction into an xor followed by the original flag setter.
>
>
> Many thanks in advance.  Almost all of the credit goes to Uros and
> Richard for already implementing this feature, I've just copied the
> transformations from optab expansion, that allow it to be enabled
> without penalty (this late in the compilation).

LGTM, modulo the below part:

+;; Avoid clearing a register between a flags setting comparison and its use,
+;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+  [(set (reg FLAGS_REG) (match_operand 0))
+   (set (match_operand:SWI48 1 "register_operand") (const_int 0))]
+  "peep2_regno_dead_p (0, FLAGS_REG)
+   && !reg_overlap_mentioned_p (operands[1], operands[0])"
+  [(parallel [(set (match_dup 1) (const_int 0))
+      (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 2) (match_dup 0))]
+{
+  operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+})

Please use ix86_expand_clear (operands[1]); in the preparation
statement instead of emitting RTX pattern. This will take
TARGET_USE_MOV0 and size optimizations into account.

Uros.
Roger Sayle Aug. 6, 2020, 8:40 a.m. UTC | #6
Hi Uros,

Many thanks for the review and feedback.  Here's the final version as committed,
with both the test cases requested by Richard Biener and your suggestion/request
to use ix86_expand_clear.  Tested again on x86_64-pc-linux-gnu.

Thank you again for the fantastic ix86_expand_clear pointer, which cleared up one
of two technical questions I had, and allowed this peephole2 to now also apply to
QImode and HImode MOV0s, where my original version was limited to SImode and
DImode.

My two questions were (i) why a QImode set of 0 with a flags clobber isn't a recognized
instruction?  I'd assume that on some architectures "xorb dl,dl" might be an appropriate
sequence to use.  This is mostly answered by the use of ix86_expand_clear, which 
intelligently selects the correct form, but the lack of a *movqi_xor was previously odd.
(ii) My other question, was that despite my best efforts I couldn't seem to convince GCC
to generate/use a *movsi_or to load the constant -1.  It was just a curiosity, but this
would affect/benefit the smaxm1 and sminm1 examples in the new i386/minmax-10.c
test program.

Many thanks again for your help.
Roger
--

-----Original Message-----
From: Uros Bizjak <ubizjak@gmail.com> 
Sent: 03 August 2020 11:29
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] x86_64: Integer min/max improvements.

On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way that min and max are expanded, so that the 
> semantics of these operations are visible to the early RTL 
> optimization passes, until split into explicit comparison and 
> conditional move instructions. The good news is that i386.md already 
> contains all of the required logic (many thanks to Richard Biener and 
> Uros Bizjak), but this is currently only to enable scalar-to-vector 
> (STV) synthesis of min/max instructions.  This change enables this 
> functionality for all TARGET_CMOVE architectures for SImode, HImode and DImode.
>
> My first attempt to support "min(reg,reg)" as an instruction revealed 
> why this functionality isn't already enabled: In PR rtl-optimization 
> 91154 we end up generating a cmp instead of a test in 
> gcc.target/i386/minmax-3.c which has poor performance on AMD Opteron.  
> My solution to this is to actually support "min(reg,general_operand)" 
> allowing us to inspect any immediate constant at the point this 
> operation is split.  This allows us to use "test" instructions for min/max against 0, 1 and -1.
> As an added benefit it allows us to CSE large immediate constants, 
> reducing code size.
>
> Previously, "int foo(int x) { return max(x,12345); }" would generate:
>
> foo:    cmpl    $12345, %edi
>         movl    $12345, %eax
>         cmovge  %edi, %eax
>         ret
>
> with this patch we instead generate:
>
> foo:    movl    $12345, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> I've also included a peephole2 to avoid the "movl $0,%eax" 
> instructions being produced by the register allocator.  Materializing 
> constants as late as possible reduces register pressure, but for 
> const0_rtx on x86, it's preferable to use "xor" by moving this set 
> from between a flags setting operation and its use.  This should also 
> help instruction macro fusion on some microarchitectures, where 
> test/cmp and the following instruction can sometimes be combined.
>
> Previously, "int foo(int x) { return max(x,0); }" would generate:
>
> foo:    testl   %edi, %edi
>         movl    $0, %eax
>         cmovns  %edi, %eax
>         ret
>
> with this patch we instead generate:
> foo:    xorl    %eax, %eax
>         testl   %edi, %edi
>         cmovns  %edi, %eax
>         ret
>
> The benefits of having min/max explicit at the RTL level can be seen 
> from compiling the following example with "-O2 -fno-tree-reassoc".
>
>
> #define max(a,b) (((a) > (b))? (a) : (b))
>
> int foo(int x)
> {
>   int y = max(x,5);
>   return max(y,10);
> }
>
> where GCC currently produces:
>
> foo:    cmpl    $5, %edi
>         movl    $5, %eax
>         movl    $10, %edx
>         cmovge  %edi, %eax
>         cmpl    $10, %eax
>         cmovl   %edx, %eax
>         ret
>
> and with this patch it instead now produces:
>
> foo:    movl    $10, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> The original motivation was from looking at a performance critical 
> function in a quantum mechanics code, that performed MIN_EXPR and 
> MAX_EXPR of the same arguments (effectively a two-element sort), where 
> GCC was performing the comparison twice.  I'd hoped that it might be 
> possible to fuse these together, perhaps in combine, but this patch is 
> an intermediate step towards that goal.
>
> This patch has been tested on x86_64-pc-linux-gnu with a make 
> bootstrap followed by make -k check with no new regressions.
> Ok for mainline?
>
>
> 2020-07-30  Roger Sayle  <roger@nextmovesoftware.com>
>
>         * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
>         (<maxmin><mode>3):  Support SWI248 and general_operand for
>         second operand, when TARGET_CMOVE.
>         (<maxmin><mode>3_1 splitter): Optimize comparisons against
>         0, 1 and -1 to use "test" instead of "cmp".
>         (*<maxmin>di3_doubleword): Likewise, allow general_operand
>         and enable on TARGET_CMOVE.
>         (peephole2): Convert clearing a register after a flag setting
>         instruction into an xor followed by the original flag setter.
>
>
> Many thanks in advance.  Almost all of the credit goes to Uros and 
> Richard for already implementing this feature, I've just copied the 
> transformations from optab expansion, that allow it to be enabled 
> without penalty (this late in the compilation).

LGTM, modulo the below part:

+;; Avoid clearing a register between a flags setting comparison and its 
+use, ;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+  [(set (reg FLAGS_REG) (match_operand 0))
+   (set (match_operand:SWI48 1 "register_operand") (const_int 0))]
+  "peep2_regno_dead_p (0, FLAGS_REG)
+   && !reg_overlap_mentioned_p (operands[1], operands[0])"
+  [(parallel [(set (match_dup 1) (const_int 0))
+      (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 2) (match_dup 0))]
+{
+  operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+})

Please use ix86_expand_clear (operands[1]); in the preparation statement instead of emitting RTX pattern. This will take
TARGET_USE_MOV0 and size optimizations into account.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b24a455..4e916bf 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -18809,45 +18809,68 @@
 
 ;; min/max patterns
 
-(define_mode_iterator MAXMIN_IMODE
-  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
 (define_code_attr maxmin_rel
   [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
 
 (define_expand "<code><mode>3"
   [(parallel
-    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
-	  (maxmin:MAXMIN_IMODE
-	    (match_operand:MAXMIN_IMODE 1 "register_operand")
-	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+    [(set (match_operand:SWI248 0 "register_operand")
+	  (maxmin:SWI248
+	    (match_operand:SWI248 1 "register_operand")
+	    (match_operand:SWI248 2 "general_operand")))
      (clobber (reg:CC FLAGS_REG))])]
-  "TARGET_STV")
+  "TARGET_CMOVE")
 
 (define_insn_and_split "*<code><mode>3_1"
-  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
-	(maxmin:MAXMIN_IMODE
-	  (match_operand:MAXMIN_IMODE 1 "register_operand")
-	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+  [(set (match_operand:SWI248 0 "register_operand")
+	(maxmin:SWI248
+	  (match_operand:SWI248 1 "register_operand")
+	  (match_operand:SWI248 2 "general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+  "TARGET_CMOVE
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
   [(set (match_dup 0)
-	(if_then_else:MAXMIN_IMODE (match_dup 3)
+	(if_then_else:SWI248 (match_dup 3)
 	  (match_dup 1)
 	  (match_dup 2)))]
 {
   machine_mode mode = <MODE>mode;
+  rtx cmp_op = operands[2];
 
-  if (!register_operand (operands[2], mode))
-    operands[2] = force_reg (mode, operands[2]);
+  if (!register_operand (cmp_op, mode))
+    operands[2] = force_reg (mode, cmp_op);
 
   enum rtx_code code = <maxmin_rel>;
-  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+
+  if (cmp_op == const1_rtx)
+    {
+      /* Convert smax (x, 1) into (x > 0 ? x : 1).
+	 Convert umax (x, 1) into (x != 0 ? x : 1).
+	 Convert ?min (x, 1) into (x <= 0 ? x : 1).  */
+      cmp_op = const0_rtx;
+      if (code == GE)
+	code = GT;
+      else if (code == GEU)
+	code = NE;
+    }
+  /* Convert smin (x, -1) into (x < 0 ? x : -1).  */
+  else if (cmp_op == constm1_rtx && code == LE)
+    {
+      cmp_op = const0_rtx;
+      code = LT;
+    }
+  /* Convert smax (x, -1) into (x >= 0 ? x : -1).  */
+  else if (cmp_op == constm1_rtx && code == GE)
+    cmp_op = const0_rtx;
+  else if (cmp_op != const0_rtx)
+    cmp_op = operands[2];
+
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], cmp_op);
   rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
 
-  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], cmp_op);
   emit_insn (gen_rtx_SET (flags, tmp));
 
   operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
@@ -18856,9 +18879,9 @@
 (define_insn_and_split "*<code>di3_doubleword"
   [(set (match_operand:DI 0 "register_operand")
 	(maxmin:DI (match_operand:DI 1 "register_operand")
-		   (match_operand:DI 2 "nonimmediate_operand")))
+		   (match_operand:DI 2 "general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+  "!TARGET_64BIT && TARGET_CMOVE
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
@@ -18910,6 +18933,19 @@
       gcc_unreachable ();
     }
 })
+
+;; Avoid clearing a register between a flags setting comparison and its use,
+;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+  [(set (reg FLAGS_REG) (match_operand 0))
+   (set (match_operand:SWI 1 "register_operand") (const_int 0))]
+  "peep2_regno_dead_p (0, FLAGS_REG)
+   && !reg_overlap_mentioned_p (operands[1], operands[0])"
+   [(set (match_dup 2) (match_dup 0))]
+{
+  operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+  ix86_expand_clear (operands[1]);
+})
 
 ;; Misc patterns (?)
 
diff --git a/gcc/testsuite/gcc.target/i386/minmax-8.c b/gcc/testsuite/gcc.target/i386/minmax-8.c
new file mode 100644
index 0000000..1f7e466
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-8.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int foo(int x)
+{
+  return max(x,12345);
+}
+
+int bar(int x)
+{
+  return min(x,87654);
+}
+
+/* { dg-final { scan-assembler-times "12345" 1 } } */
+/* { dg-final { scan-assembler-times "87654" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-9.c b/gcc/testsuite/gcc.target/i386/minmax-9.c
new file mode 100644
index 0000000..3b94023
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-9.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int foo(int x)
+{
+  return max(x,0);
+}
+
+int bar(int x)
+{
+  return min(x,0);
+}
+
+unsigned int baz(unsigned int x)
+{
+  return min(x,1);
+}
+
+/* { dg-final { scan-assembler-times "xor" 3 } } */
+/* { dg-final { scan-assembler-times "test" 3 } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-10.c b/gcc/testsuite/gcc.target/i386/minmax-10.c
new file mode 100644
index 0000000..b044462
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-10.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int smax1(int x)
+{
+  return max(x,1);
+}
+
+int smin1(int x)
+{
+  return min(x,1);
+}
+
+int smaxm1(int x)
+{
+  return max(x,-1);
+}
+
+int sminm1(int x)
+{
+  return min(x,-1);
+}
+
+unsigned int umax1(unsigned int x)
+{
+  return max(x,1);
+}
+
+unsigned int umin1(unsigned int x)
+{
+  return min(x,1);
+}
+
+/* { dg-final { scan-assembler-times "test" 6 } } */
+/* { dg-final { scan-assembler-not "cmp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-11.c b/gcc/testsuite/gcc.target/i386/minmax-11.c
new file mode 100644
index 0000000..a8c2df5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-11.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-reassoc" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+
+int foo(int x)
+{
+  int y = max(x,12345);
+  return max(y,87654);
+}
+
+/* { dg-final { scan-assembler-not "12345" } } */
Jakub Jelinek Aug. 6, 2020, 12:27 p.m. UTC | #7
On Thu, Aug 06, 2020 at 09:40:49AM +0100, Roger Sayle wrote:

This test fails on i686-linux (or x86_64-linux when testing with -m32).
make check-gcc RUNTESTFLAGS='--target_board=unix\{-m32,-m64\} i386.exp=minmax-9.c'
Running /usr/src/gcc/gcc/testsuite/gcc.target/i386/i386.exp ...
FAIL: gcc.target/i386/minmax-9.c scan-assembler-times test 3

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/minmax-9.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +
> +#define max(a,b) (((a) > (b))? (a) : (b))
> +#define min(a,b) (((a) < (b))? (a) : (b))
> +
> +int foo(int x)
> +{
> +  return max(x,0);
> +}
> +
> +int bar(int x)
> +{
> +  return min(x,0);
> +}
> +
> +unsigned int baz(unsigned int x)
> +{
> +  return min(x,1);
> +}
> +
> +/* { dg-final { scan-assembler-times "xor" 3 } } */
> +/* { dg-final { scan-assembler-times "test" 3 } } */

	Jakub
Roger Sayle Aug. 6, 2020, 2:33 p.m. UTC | #8
Sorry for the inconvenience.  I've just added the obligatory
/* { dg-do compile { target { ! ia32 } } } */
to this new gcc.target/i386 test to resolve this failure.
Please let me know if there's a better fix.

2020-08-06  Roger Sayle  <roger@nextmovesoftware.com>

gcc/testsuite/ChangeLog
	* gcc.target/i386/minmax-9.c: Restrict test to !ia32.

My apologies again.
Roger
--

-----Original Message-----
From: Jakub Jelinek <jakub@redhat.com> 
Sent: 06 August 2020 13:28
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: 'Uros Bizjak' <ubizjak@gmail.com>; 'GCC Patches'
<gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] x86_64: Integer min/max improvements.

On Thu, Aug 06, 2020 at 09:40:49AM +0100, Roger Sayle wrote:

This test fails on i686-linux (or x86_64-linux when testing with -m32).
make check-gcc RUNTESTFLAGS='--target_board=unix\{-m32,-m64\}
i386.exp=minmax-9.c'
Running /usr/src/gcc/gcc/testsuite/gcc.target/i386/i386.exp ...
FAIL: gcc.target/i386/minmax-9.c scan-assembler-times test 3

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/minmax-9.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +
> +#define max(a,b) (((a) > (b))? (a) : (b)) #define min(a,b) (((a) < 
> +(b))? (a) : (b))
> +
> +int foo(int x)
> +{
> +  return max(x,0);
> +}
> +
> +int bar(int x)
> +{
> +  return min(x,0);
> +}
> +
> +unsigned int baz(unsigned int x)
> +{
> +  return min(x,1);
> +}
> +
> +/* { dg-final { scan-assembler-times "xor" 3 } } */
> +/* { dg-final { scan-assembler-times "test" 3 } } */

	Jakub
Uros Bizjak Aug. 7, 2020, 1:21 p.m. UTC | #9
On Thu, Aug 6, 2020 at 10:40 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
>
> Many thanks for the review and feedback.  Here's the final version as committed,
> with both the test cases requested by Richard Biener and your suggestion/request
> to use ix86_expand_clear.  Tested again on x86_64-pc-linux-gnu.
>
> Thank you again for the fantastic ix86_expand_clear pointer, which cleared up one
> of two technical questions I had, and allowed this peephole2 to now also apply to
> QImode and HImode MOV0s, where my original version was limited to SImode and
> DImode.
>
> My two questions were (i) why a QImode set of 0 with a flags clobber isn't a recognized
> instruction?  I'd assume that on some architectures "xorb dl,dl" might be an appropriate
> sequence to use.  This is mostly answered by the use of ix86_expand_clear, which
> intelligently selects the correct form, but the lack of a *movqi_xor was previously odd.

XOR transformation is used mostly due to code size, where we have:

   0:   b0 00                   mov    $0x0,%al
   2:   30 c0                   xor    %al,%al
   4:   bb 00 00 00 00          mov    $0x0,%ebx
   9:   31 db                   xor    %ebx,%ebx

So, as can be seen from the above example, there is no benefit for
QImode, where 3 bytes can be saved for SImode.

> (ii) My other question, was that despite my best efforts I couldn't seem to convince GCC
> to generate/use a *movsi_or to load the constant -1.  It was just a curiosity, but this
> would affect/benefit the smaxm1 and sminm1 examples in the new i386/minmax-10.c

This transformation is enabled only for -Os or

DEF_TUNE (X86_TUNE_MOVE_M1_VIA_OR, "move_m1_via_or", m_PENT | m_LAKEMONT)

However, clearing the register with xor reg,reg also prevents partial
reg stall, where or -1, reg does not.

Uros.
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b24a455..717d55f 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -18809,45 +18809,68 @@ 
 
 ;; min/max patterns
 
-(define_mode_iterator MAXMIN_IMODE
-  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
 (define_code_attr maxmin_rel
   [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
 
 (define_expand "<code><mode>3"
   [(parallel
-    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
-	  (maxmin:MAXMIN_IMODE
-	    (match_operand:MAXMIN_IMODE 1 "register_operand")
-	    (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+    [(set (match_operand:SWI248 0 "register_operand")
+	  (maxmin:SWI248
+	    (match_operand:SWI248 1 "register_operand")
+	    (match_operand:SWI248 2 "general_operand")))
      (clobber (reg:CC FLAGS_REG))])]
-  "TARGET_STV")
+  "TARGET_CMOVE")
 
 (define_insn_and_split "*<code><mode>3_1"
-  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
-	(maxmin:MAXMIN_IMODE
-	  (match_operand:MAXMIN_IMODE 1 "register_operand")
-	  (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+  [(set (match_operand:SWI248 0 "register_operand")
+	(maxmin:SWI248
+	  (match_operand:SWI248 1 "register_operand")
+	  (match_operand:SWI248 2 "general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+  "TARGET_CMOVE
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
   [(set (match_dup 0)
-	(if_then_else:MAXMIN_IMODE (match_dup 3)
+	(if_then_else:SWI248 (match_dup 3)
 	  (match_dup 1)
 	  (match_dup 2)))]
 {
   machine_mode mode = <MODE>mode;
+  rtx cmp_op = operands[2];
 
-  if (!register_operand (operands[2], mode))
-    operands[2] = force_reg (mode, operands[2]);
+  if (!register_operand (cmp_op, mode))
+    operands[2] = force_reg (mode, cmp_op);
 
   enum rtx_code code = <maxmin_rel>;
-  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+
+  if (cmp_op == const1_rtx)
+    {
+      /* Convert smax (x, 1) into (x > 0 ? x : 1).
+	 Convert umax (x, 1) into (x != 0 ? x : 1).
+	 Convert ?min (x, 1) into (x <= 0 ? x : 1).  */
+      cmp_op = const0_rtx;
+      if (code == GE)
+	code = GT;
+      else if (code == GEU)
+	code = NE;
+    }
+  /* Convert smin (x, -1) into (x < 0 ? x : -1).  */
+  else if (cmp_op == constm1_rtx && code == LE)
+    {
+      cmp_op = const0_rtx;
+      code = LT;
+    }
+  /* Convert smax (x, -1) into (x >= 0 ? x : -1).  */
+  else if (cmp_op == constm1_rtx && code == GE)
+    cmp_op = const0_rtx;
+  else if (cmp_op != const0_rtx)
+    cmp_op = operands[2];
+
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], cmp_op);
   rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
 
-  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], cmp_op);
   emit_insn (gen_rtx_SET (flags, tmp));
 
   operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
@@ -18856,9 +18879,9 @@ 
 (define_insn_and_split "*<code>di3_doubleword"
   [(set (match_operand:DI 0 "register_operand")
 	(maxmin:DI (match_operand:DI 1 "register_operand")
-		   (match_operand:DI 2 "nonimmediate_operand")))
+		   (match_operand:DI 2 "general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+  "!TARGET_64BIT && TARGET_CMOVE
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
@@ -18910,6 +18933,20 @@ 
       gcc_unreachable ();
     }
 })
+
+;; Avoid clearing a register between a flags setting comparison and its use,
+;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+  [(set (reg FLAGS_REG) (match_operand 0))
+   (set (match_operand:SWI48 1 "register_operand") (const_int 0))]
+  "peep2_regno_dead_p (0, FLAGS_REG)
+   && !reg_overlap_mentioned_p (operands[1], operands[0])"
+  [(parallel [(set (match_dup 1) (const_int 0))
+	      (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 2) (match_dup 0))]
+{
+  operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+})
 
 ;; Misc patterns (?)