diff mbox series

[x86] PR target/91681: zero_extendditi2 pattern for more optimizations.

Message ID 0af601d8772f$43ba22f0$cb2e68d0$@nextmovesoftware.com
State New
Headers show
Series [x86] PR target/91681: zero_extendditi2 pattern for more optimizations. | expand

Commit Message

Roger Sayle June 3, 2022, 9:49 a.m. UTC
Technically, PR target/91681 has already been resolved; we now recognize the
highpart multiplication at the tree-level, we no longer use the stack, and
we currently generate the same number of instructions as LLVM.  However, it
is still possible to do better, the current x86_64 code to generate a double
word addition of a zero extended operand, looks like:

        xorl    %r11d, %r11d
        addq    %r10, %rax
        adcq    %r11, %rdx

when it's possible (as LLVM does) to use an immediate constant:

        addq    %r10, %rax
        adcq    $0, %rdx

To do this, the backend required one or two simple changes, that
then themselves required one or two more obscure tweaks.

The simple starting point is to define a zero_extendditi2 pattern,
for zero extension from DImode to TImode on TARGET_64BIT that is
split after reload.  Double word (TImode) addition/subtraction is
split after reload, so that constrains when things should happen.

With zero extension now visible to combine, we add two new
define_insn_and_split that add/subtract a zero extended operand
in double word mode.  These apply to both 32-bit and 64-bit code
generation, to produce adc $0 and sbb $0.

The first strange tweak is that these new patterns interfere with
the optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair
of register moves, or more accurately the combine splitter no longer
triggers as we're now converting two instructions into two instructions
(not three instructions into two instructions).  This is easily
repaired (and extended to handle TImode) by changing from a pair
of define_split (that handle operand commutativity) to a set of
four define_insn_and_split (again to handle operand commutativity).

The other/final strange tweak that the above splitters now interfere
with AVX512's kunpckdq instruction which is defined as identical RTL,
DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish this, and
also avoid AVX512 mask registers being used by reload to perform
SImode scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP)
to the unpack mask operations, which matches what sse.md does for
the other mask specific (logic) operations.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-06-03  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        PR target/91681
        * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
        (*add<dwi>3_doubleword_zext): New define_insn_and_split.
        (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
        (*concat<mode><dwi>3_1): New define_insn_and_split replacing
        previous define_split for implementing DST = (HI<<32)|LO as
        pair of move instructions, setting lopart and hipart.
        (*concat<mode><dwi>3_2): Likewise.
        (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
        (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
        * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
        (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
        (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
        (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
unspec.
        (vec_pack_trunc_<mode>): Likewise.

gcc/testsuite/ChangeLog
        PR target/91681
        * g++.target/i386/pr91681.C: New test case (from the PR).
        * gcc.target/i386/pr91681-1.c: New int128 test case.
        * gcc.target/i386/pr91681-2.c: Likewise.
        * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.

Thanks in advance,
Roger
--

Comments

Uros Bizjak June 3, 2022, 10:08 a.m. UTC | #1
On Fri, Jun 3, 2022 at 11:49 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Technically, PR target/91681 has already been resolved; we now recognize the
> highpart multiplication at the tree-level, we no longer use the stack, and
> we currently generate the same number of instructions as LLVM.  However, it
> is still possible to do better, the current x86_64 code to generate a double
> word addition of a zero extended operand, looks like:
>
>         xorl    %r11d, %r11d
>         addq    %r10, %rax
>         adcq    %r11, %rdx
>
> when it's possible (as LLVM does) to use an immediate constant:
>
>         addq    %r10, %rax
>         adcq    $0, %rdx
>
> To do this, the backend required one or two simple changes, that
> then themselves required one or two more obscure tweaks.
>
> The simple starting point is to define a zero_extendditi2 pattern,
> for zero extension from DImode to TImode on TARGET_64BIT that is
> split after reload.  Double word (TImode) addition/subtraction is
> split after reload, so that constrains when things should happen.
>
> With zero extension now visible to combine, we add two new
> define_insn_and_split that add/subtract a zero extended operand
> in double word mode.  These apply to both 32-bit and 64-bit code
> generation, to produce adc $0 and sbb $0.
>
> The first strange tweak is that these new patterns interfere with
> the optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair
> of register moves, or more accurately the combine splitter no longer
> triggers as we're now converting two instructions into two instructions
> (not three instructions into two instructions).  This is easily
> repaired (and extended to handle TImode) by changing from a pair
> of define_split (that handle operand commutativity) to a set of
> four define_insn_and_split (again to handle operand commutativity).
>
> The other/final strange tweak that the above splitters now interfere
> with AVX512's kunpckdq instruction which is defined as identical RTL,
> DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish this, and
> also avoid AVX512 mask registers being used by reload to perform
> SImode scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP)
> to the unpack mask operations, which matches what sse.md does for
> the other mask specific (logic) operations.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?
>
>
> 2022-06-03  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/91681
>         * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
>         (*add<dwi>3_doubleword_zext): New define_insn_and_split.
>         (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
>         (*concat<mode><dwi>3_1): New define_insn_and_split replacing
>         previous define_split for implementing DST = (HI<<32)|LO as
>         pair of move instructions, setting lopart and hipart.
>         (*concat<mode><dwi>3_2): Likewise.
>         (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
>         (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
>         * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
>         (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
>         (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
>         (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
> unspec.
>         (vec_pack_trunc_<mode>): Likewise.
>
> gcc/testsuite/ChangeLog
>         PR target/91681
>         * g++.target/i386/pr91681.C: New test case (from the PR).
>         * gcc.target/i386/pr91681-1.c: New int128 test case.
>         * gcc.target/i386/pr91681-2.c: Likewise.
>         * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.

+  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1])
+ && !MEM_P (operands[2])
+       : !MEM_P (operands[1])"

Can we use ix86_binary_operator_ok (UNKNOWN, ...mode..., operands) here instead?

(UNKNOWN RTX code is used to prevent unwanted optimization with
commutative operands).

Uros.
Uros Bizjak June 3, 2022, 10:15 a.m. UTC | #2
On Fri, Jun 3, 2022 at 12:08 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Fri, Jun 3, 2022 at 11:49 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > Technically, PR target/91681 has already been resolved; we now recognize the
> > highpart multiplication at the tree-level, we no longer use the stack, and
> > we currently generate the same number of instructions as LLVM.  However, it
> > is still possible to do better, the current x86_64 code to generate a double
> > word addition of a zero extended operand, looks like:
> >
> >         xorl    %r11d, %r11d
> >         addq    %r10, %rax
> >         adcq    %r11, %rdx
> >
> > when it's possible (as LLVM does) to use an immediate constant:
> >
> >         addq    %r10, %rax
> >         adcq    $0, %rdx
> >
> > To do this, the backend required one or two simple changes, that
> > then themselves required one or two more obscure tweaks.
> >
> > The simple starting point is to define a zero_extendditi2 pattern,
> > for zero extension from DImode to TImode on TARGET_64BIT that is
> > split after reload.  Double word (TImode) addition/subtraction is
> > split after reload, so that constrains when things should happen.
> >
> > With zero extension now visible to combine, we add two new
> > define_insn_and_split that add/subtract a zero extended operand
> > in double word mode.  These apply to both 32-bit and 64-bit code
> > generation, to produce adc $0 and sbb $0.
> >
> > The first strange tweak is that these new patterns interfere with
> > the optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair
> > of register moves, or more accurately the combine splitter no longer
> > triggers as we're now converting two instructions into two instructions
> > (not three instructions into two instructions).  This is easily
> > repaired (and extended to handle TImode) by changing from a pair
> > of define_split (that handle operand commutativity) to a set of
> > four define_insn_and_split (again to handle operand commutativity).
> >
> > The other/final strange tweak that the above splitters now interfere
> > with AVX512's kunpckdq instruction which is defined as identical RTL,
> > DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish this, and
> > also avoid AVX512 mask registers being used by reload to perform
> > SImode scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP)
> > to the unpack mask operations, which matches what sse.md does for
> > the other mask specific (logic) operations.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-06-03  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/91681
> >         * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
> >         (*add<dwi>3_doubleword_zext): New define_insn_and_split.
> >         (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
> >         (*concat<mode><dwi>3_1): New define_insn_and_split replacing
> >         previous define_split for implementing DST = (HI<<32)|LO as
> >         pair of move instructions, setting lopart and hipart.
> >         (*concat<mode><dwi>3_2): Likewise.
> >         (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
> >         (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
> >         * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
> >         (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
> >         (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
> >         (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
> > unspec.
> >         (vec_pack_trunc_<mode>): Likewise.
> >
> > gcc/testsuite/ChangeLog
> >         PR target/91681
> >         * g++.target/i386/pr91681.C: New test case (from the PR).
> >         * gcc.target/i386/pr91681-1.c: New int128 test case.
> >         * gcc.target/i386/pr91681-2.c: Likewise.
> >         * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.

+(define_insn_and_split "*concat<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+ (any_or_plus:<DWI>
+  (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r")
+ (match_operand:<DWI> 2 "const_int_operand" "n"))

You can remove "n" when we deal with non-commutative (without %) operands.

+  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+  && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 6))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[1]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[1]);

Pre-reload splitters need to force_reg their register_operands. You
can have SUBREG here and gen_lowpar/gen_higpart will fail on SUBREG.

Uros.
Roger Sayle June 5, 2022, 11:48 a.m. UTC | #3
Hi Uros,
Many thanks for your speedy review.  This revised patch implements
all three of your recommended improvements; the use of
ix86_binary_operator_ok with code UNKNOWN, the removal of
"n" constraints from const_int_operand predicates, and the use
of force_reg (for input operands, and REG_P for output operands)
to ensure that it's always safe to call gen_lowpart/gen_highpart.

[If we proceed with the recent proposal to split double word 
addition, subtraction and other operations before reload, then
these new add/sub variants should be updated at the same time,
but for now this patch keeps double word patterns consistent].
 
This revised patch has been retested on x86_64-pc-linux-gnu with
make bootstrap and make -k check, both with and without 
--target_board=unix{-m32} with no new failures.  Ok for mainline?


2022-06-05  Roger Sayle  <roger@nextmovesoftware.com>
            Uroš Bizjak  <ubizjak@gmail.com>

gcc/ChangeLog
        PR target/91681
        * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
        (*add<dwi>3_doubleword_zext): New define_insn_and_split.
        (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
        (*concat<mode><dwi>3_1): New define_insn_and_split replacing
        previous define_split for implementing DST = (HI<<32)|LO as
        pair of move instructions, setting lopart and hipart.
        (*concat<mode><dwi>3_2): Likewise.
        (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
        (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
        * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
        (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
        (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
        (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP unspec.
        (vec_pack_trunc_<mode>): Likewise.

gcc/testsuite/ChangeLog
        PR target/91681
        * g++.target/i386/pr91681.C: New test case (from the PR).
        * gcc.target/i386/pr91681-1.c: New int128 test case.
        * gcc.target/i386/pr91681-2.c: Likewise.
        * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.


Thanks again,
Roger
--

> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 03 June 2022 11:08
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [x86 PATCH] PR target/91681: zero_extendditi2 pattern for more
> optimizations.
> 
> On Fri, Jun 3, 2022 at 11:49 AM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > Technically, PR target/91681 has already been resolved; we now
> > recognize the highpart multiplication at the tree-level, we no longer
> > use the stack, and we currently generate the same number of
> > instructions as LLVM.  However, it is still possible to do better, the
> > current x86_64 code to generate a double word addition of a zero extended
> operand, looks like:
> >
> >         xorl    %r11d, %r11d
> >         addq    %r10, %rax
> >         adcq    %r11, %rdx
> >
> > when it's possible (as LLVM does) to use an immediate constant:
> >
> >         addq    %r10, %rax
> >         adcq    $0, %rdx
> >
> > To do this, the backend required one or two simple changes, that then
> > themselves required one or two more obscure tweaks.
> >
> > The simple starting point is to define a zero_extendditi2 pattern, for
> > zero extension from DImode to TImode on TARGET_64BIT that is split
> > after reload.  Double word (TImode) addition/subtraction is split
> > after reload, so that constrains when things should happen.
> >
> > With zero extension now visible to combine, we add two new
> > define_insn_and_split that add/subtract a zero extended operand in
> > double word mode.  These apply to both 32-bit and 64-bit code
> > generation, to produce adc $0 and sbb $0.
> >
> > The first strange tweak is that these new patterns interfere with the
> > optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair of
> > register moves, or more accurately the combine splitter no longer
> > triggers as we're now converting two instructions into two
> > instructions (not three instructions into two instructions).  This is
> > easily repaired (and extended to handle TImode) by changing from a
> > pair of define_split (that handle operand commutativity) to a set of
> > four define_insn_and_split (again to handle operand commutativity).
> >
> > The other/final strange tweak that the above splitters now interfere
> > with AVX512's kunpckdq instruction which is defined as identical RTL,
> > DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish this, and also
> > avoid AVX512 mask registers being used by reload to perform SImode
> > scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP) to the
> > unpack mask operations, which matches what sse.md does for the other
> > mask specific (logic) operations.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-06-03  Roger Sayle  <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/91681
> >         * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
> >         (*add<dwi>3_doubleword_zext): New define_insn_and_split.
> >         (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
> >         (*concat<mode><dwi>3_1): New define_insn_and_split replacing
> >         previous define_split for implementing DST = (HI<<32)|LO as
> >         pair of move instructions, setting lopart and hipart.
> >         (*concat<mode><dwi>3_2): Likewise.
> >         (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
> >         (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
> >         * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
> >         (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
> >         (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
> >         (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
> > unspec.
> >         (vec_pack_trunc_<mode>): Likewise.
> >
> > gcc/testsuite/ChangeLog
> >         PR target/91681
> >         * g++.target/i386/pr91681.C: New test case (from the PR).
> >         * gcc.target/i386/pr91681-1.c: New int128 test case.
> >         * gcc.target/i386/pr91681-2.c: Likewise.
> >         * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.
> 
> +  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1]) &&
> + !MEM_P (operands[2])
> +       : !MEM_P (operands[1])"
> 
> Can we use ix86_binary_operator_ok (UNKNOWN, ...mode..., operands) here
> instead?
> 
> (UNKNOWN RTX code is used to prevent unwanted optimization with
> commutative operands).
> 
> Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 602dfa7..b3d2c90 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4325,6 +4325,16 @@
    (set_attr "type" "imovx,mskmov,mskmov")
    (set_attr "mode" "SI,QI,QI")])
 
+(define_insn_and_split "zero_extendditi2"
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=r,o")
+	(zero_extend:TI (match_operand:DI 1 "nonimmediate_operand" "rm,r")))]
+  "TARGET_64BIT"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 3) (match_dup 1))
+   (set (match_dup 4) (const_int 0))]
+  "split_double_mode (TImode, &operands[0], 1, &operands[3], &operands[4]);")
+
 ;; Transform xorl; mov[bw] (set strict_low_part) into movz[bw]l.
 (define_peephole2
   [(parallel [(set (match_operand:SWI48 0 "general_reg_operand")
@@ -6453,6 +6463,31 @@
   [(set_attr "type" "alu")
    (set_attr "mode" "QI")])
 
+(define_insn_and_split "*add<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+	(plus:<DWI>
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm,r")) 
+	  (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (UNKNOWN, <DWI>mode, operands)"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CCC FLAGS_REG)
+		   (compare:CCC
+		     (plus:DWIH (match_dup 1) (match_dup 2))
+		     (match_dup 1)))
+	      (set (match_dup 0)
+		   (plus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+		   (plus:DWIH
+		     (plus:DWIH
+		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))
+		       (match_dup 4))
+		     (const_int 0)))
+	      (clobber (reg:CC FLAGS_REG))])]
+ "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);")
+
 ;; Like DWI, but use POImode instead of OImode.
 (define_mode_attr DPWI [(QI "HI") (HI "SI") (SI "DI") (DI "TI") (TI "POI")])
 
@@ -6903,6 +6938,29 @@
     }
 })
 
+(define_insn_and_split "*sub<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+	(minus:<DWI>
+	  (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm,r"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (UNKNOWN, <DWI>mode, operands)"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CC FLAGS_REG)
+		   (compare:CC (match_dup 1) (match_dup 2)))
+	      (set (match_dup 0)
+		   (minus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+		   (minus:DWIH
+		     (minus:DWIH
+		       (match_dup 4)
+		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)))
+		     (const_int 0)))
+	      (clobber (reg:CC FLAGS_REG))])]
+  "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);")
+
 (define_insn "*sub<mode>_1"
   [(set (match_operand:SWI 0 "nonimmediate_operand" "=<r>m,<r>")
 	(minus:SWI
@@ -10956,34 +11014,82 @@
 
 ;; Split DST = (HI<<32)|LO early to minimize register usage.
 (define_code_iterator any_or_plus [plus ior xor])
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(any_or_plus:DI
-	  (ashift:DI (match_operand:DI 1 "register_operand")
-		     (const_int 32))
-	  (zero_extend:DI (match_operand:SI 2 "register_operand"))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
-   (set (match_dup 5) (match_dup 2))]
+(define_insn_and_split "*concat<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r")
+			(match_operand:<DWI> 2 "const_int_operand"))
+	  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 6))]
+{
+  operands[1] = force_reg (<DWI>mode, operands[1]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[1]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_2"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	  (ashift:<DWI> (match_operand:<DWI> 2 "register_operand" "r")
+			(match_operand:<DWI> 3 "const_int_operand"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 6))]
+{
+  operands[2] = force_reg (<DWI>mode, operands[2]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[2]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_3"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (ashift:<DWI>
+	    (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	    (match_operand:<DWI> 2 "const_int_operand"))
+	  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 1))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[1]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
 })
 
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(any_or_plus:DI
-	  (zero_extend:DI (match_operand:SI 1 "register_operand"))
-	  (ashift:DI (match_operand:DI 2 "register_operand")
-		     (const_int 32))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
-   (set (match_dup 5) (match_dup 1))]
+(define_insn_and_split "*concat<mode><dwi>3_4"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	  (ashift:<DWI>
+	    (zero_extend:<DWI> (match_operand:DWIH 2 "register_operand" "r"))
+	    (match_operand:<DWI> 3 "const_int_operand"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 2))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[2]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
 })
 
 ;; Negation instructions
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8b2602b..0198156 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2070,7 +2070,8 @@
 	  (ashift:HI
 	    (zero_extend:HI (match_operand:QI 1 "register_operand" "k"))
 	    (const_int 8))
-	  (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))]
+	  (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512F"
   "kunpckbw\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "HI")
@@ -2083,7 +2084,8 @@
 	  (ashift:SI
 	    (zero_extend:SI (match_operand:HI 1 "register_operand" "k"))
 	    (const_int 16))
-	  (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))]
+	  (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckwd\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "SI")])
@@ -2094,7 +2096,8 @@
 	  (ashift:DI
 	    (zero_extend:DI (match_operand:SI 1 "register_operand" "k"))
 	    (const_int 32))
-	  (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))]
+	  (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckdq\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "DI")])
@@ -17398,21 +17401,26 @@
 })
 
 (define_expand "vec_pack_trunc_qi"
-  [(set (match_operand:HI 0 "register_operand")
-	(ior:HI (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
-                           (const_int 8))
-		(zero_extend:HI (match_operand:QI 1 "register_operand"))))]
+  [(parallel
+    [(set (match_operand:HI 0 "register_operand")
+	(ior:HI
+	   (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
+		      (const_int 8))
+	   (zero_extend:HI (match_operand:QI 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512F")
 
 (define_expand "vec_pack_trunc_<mode>"
-  [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
-	(ior:<DOUBLEMASKMODE>
-	  (ashift:<DOUBLEMASKMODE>
+  [(parallel
+    [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
+	  (ior:<DOUBLEMASKMODE>
+	    (ashift:<DOUBLEMASKMODE>
+	      (zero_extend:<DOUBLEMASKMODE>
+	        (match_operand:SWI24 2 "register_operand"))
+	      (match_dup 3))
 	    (zero_extend:<DOUBLEMASKMODE>
-	      (match_operand:SWI24 2 "register_operand"))
-	    (match_dup 3))
-	  (zero_extend:<DOUBLEMASKMODE>
-	    (match_operand:SWI24 1 "register_operand"))))]
+	      (match_operand:SWI24 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512BW"
 {
   operands[3] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));
diff --git a/gcc/testsuite/g++.target/i386/pr91681.C b/gcc/testsuite/g++.target/i386/pr91681.C
new file mode 100644
index 0000000..0271e43
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr91681.C
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+void multiply128x64x2_3 ( 
+    const unsigned long a, 
+    const unsigned long b, 
+    const unsigned long c, 
+    const unsigned long d, 
+    __uint128_t o[2])
+{
+    __uint128_t B0 = (__uint128_t) b * c;
+    __uint128_t B2 = (__uint128_t) a * c;
+    __uint128_t B1 = (__uint128_t) b * d;
+    __uint128_t B3 = (__uint128_t) a * d;
+
+    o[0] = B2 + (B0 >> 64);
+    o[1] = B3 + (B1 >> 64);
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-1.c b/gcc/testsuite/gcc.target/i386/pr91681-1.c
new file mode 100644
index 0000000..ab83cc4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x + y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x + y;
+}
+
+void baz(unsigned long long y)
+{
+    m += y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-2.c b/gcc/testsuite/gcc.target/i386/pr91681-2.c
new file mode 100644
index 0000000..ea52c72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-2.c
@@ -0,0 +1,20 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x - y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x - y;
+}
+
+void baz(unsigned long long y)
+{
+    m -= y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-3.c b/gcc/testsuite/gcc.target/i386/pr91681-3.c
new file mode 100644
index 0000000..22a03c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-3.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+unsigned long long m;
+
+unsigned long long foo(unsigned long long x, unsigned int y)
+{
+    return x - y;
+}
+
+void bar(unsigned long long x, unsigned int y)
+{
+    m = x - y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
Uros Bizjak June 5, 2022, 6:57 p.m. UTC | #4
On Sun, Jun 5, 2022 at 1:48 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Uros,
> Many thanks for your speedy review.  This revised patch implements
> all three of your recommended improvements; the use of
> ix86_binary_operator_ok with code UNKNOWN, the removal of
> "n" constraints from const_int_operand predicates, and the use
> of force_reg (for input operands, and REG_P for output operands)
> to ensure that it's always safe to call gen_lowpart/gen_highpart.
>
> [If we proceed with the recent proposal to split double word
> addition, subtraction and other operations before reload, then
> these new add/sub variants should be updated at the same time,
> but for now this patch keeps double word patterns consistent].
>
> This revised patch has been retested on x86_64-pc-linux-gnu with
> make bootstrap and make -k check, both with and without
> --target_board=unix{-m32} with no new failures.  Ok for mainline?

+(define_insn_and_split "*concat<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+ (any_or_plus:<DWI>
+  (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r")
+ (match_operand:<DWI> 2 "const_int_operand"))
+  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+   && REG_P (operands[0])
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 6))]
+{
+  operands[1] = force_reg (<DWI>mode, operands[1]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[1]);
+})

Hm, but in this particular case (and other) you can use
split_double_mode on operands[0], instead of manually splitting REG_P
constrained operands, and it will handle everything correctly. Please
note that split_double_mode has:

split_double_mode (machine_mode mode, rtx operands[],
           int num, rtx lo_half[], rtx hi_half[])

so with some care you can use:

"split_double_mode (<DWI>mode, &operands[0],1, &operands[4], &operands[5]);"

followed by:

operands[6] = simplify_gen_subreg (<MODE>mode, op, <DWI>mode, 0);

The above line is partially what split_double_mode does.

This is the approach other pre_reload doubleword splitters take, it
looks the safest (otherwise it would break left and right with
existing patterns ...), and the most effective to me.

Please also get approval for sse.md change from Hongtao, AVX512F stuff
is in a separate playground.

Uros.


>
> 2022-06-05  Roger Sayle  <roger@nextmovesoftware.com>
>             Uroš Bizjak  <ubizjak@gmail.com>
>
> gcc/ChangeLog
>         PR target/91681
>         * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
>         (*add<dwi>3_doubleword_zext): New define_insn_and_split.
>         (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
>         (*concat<mode><dwi>3_1): New define_insn_and_split replacing
>         previous define_split for implementing DST = (HI<<32)|LO as
>         pair of move instructions, setting lopart and hipart.
>         (*concat<mode><dwi>3_2): Likewise.
>         (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
>         (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
>         * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
>         (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
>         (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
>         (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP unspec.
>         (vec_pack_trunc_<mode>): Likewise.
>
> gcc/testsuite/ChangeLog
>         PR target/91681
>         * g++.target/i386/pr91681.C: New test case (from the PR).
>         * gcc.target/i386/pr91681-1.c: New int128 test case.
>         * gcc.target/i386/pr91681-2.c: Likewise.
>         * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.
>
>
> Thanks again,
> Roger
> --
>
> > -----Original Message-----
> > From: Uros Bizjak <ubizjak@gmail.com>
> > Sent: 03 June 2022 11:08
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [x86 PATCH] PR target/91681: zero_extendditi2 pattern for more
> > optimizations.
> >
> > On Fri, Jun 3, 2022 at 11:49 AM Roger Sayle <roger@nextmovesoftware.com>
> > wrote:
> > >
> > >
> > > Technically, PR target/91681 has already been resolved; we now
> > > recognize the highpart multiplication at the tree-level, we no longer
> > > use the stack, and we currently generate the same number of
> > > instructions as LLVM.  However, it is still possible to do better, the
> > > current x86_64 code to generate a double word addition of a zero extended
> > operand, looks like:
> > >
> > >         xorl    %r11d, %r11d
> > >         addq    %r10, %rax
> > >         adcq    %r11, %rdx
> > >
> > > when it's possible (as LLVM does) to use an immediate constant:
> > >
> > >         addq    %r10, %rax
> > >         adcq    $0, %rdx
> > >
> > > To do this, the backend required one or two simple changes, that then
> > > themselves required one or two more obscure tweaks.
> > >
> > > The simple starting point is to define a zero_extendditi2 pattern, for
> > > zero extension from DImode to TImode on TARGET_64BIT that is split
> > > after reload.  Double word (TImode) addition/subtraction is split
> > > after reload, so that constrains when things should happen.
> > >
> > > With zero extension now visible to combine, we add two new
> > > define_insn_and_split that add/subtract a zero extended operand in
> > > double word mode.  These apply to both 32-bit and 64-bit code
> > > generation, to produce adc $0 and sbb $0.
> > >
> > > The first strange tweak is that these new patterns interfere with the
> > > optimization that recognizes DW:DI = (HI:SI<<32)+LO:SI as a pair of
> > > register moves, or more accurately the combine splitter no longer
> > > triggers as we're now converting two instructions into two
> > > instructions (not three instructions into two instructions).  This is
> > > easily repaired (and extended to handle TImode) by changing from a
> > > pair of define_split (that handle operand commutativity) to a set of
> > > four define_insn_and_split (again to handle operand commutativity).
> > >
> > > The other/final strange tweak that the above splitters now interfere
> > > with AVX512's kunpckdq instruction which is defined as identical RTL,
> > > DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish this, and also
> > > avoid AVX512 mask registers being used by reload to perform SImode
> > > scalar shifts, I've added the explicit (unspec UNSPEC_MASKOP) to the
> > > unpack mask operations, which matches what sse.md does for the other
> > > mask specific (logic) operations.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32},
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2022-06-03  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >         PR target/91681
> > >         * config/i386/i386.md (zero_extendditi2): New define_insn_and_split.
> > >         (*add<dwi>3_doubleword_zext): New define_insn_and_split.
> > >         (*sub<dwi>3_doubleword_zext): New define_insn_and_split.
> > >         (*concat<mode><dwi>3_1): New define_insn_and_split replacing
> > >         previous define_split for implementing DST = (HI<<32)|LO as
> > >         pair of move instructions, setting lopart and hipart.
> > >         (*concat<mode><dwi>3_2): Likewise.
> > >         (*concat<mode><dwi>3_3): Likewise, where HI is zero_extended.
> > >         (*concat<mode><dwi>3_4): Likewise, where HI is zero_extended.
> > >         * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
> > >         (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
> > >         (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
> > >         (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
> > > unspec.
> > >         (vec_pack_trunc_<mode>): Likewise.
> > >
> > > gcc/testsuite/ChangeLog
> > >         PR target/91681
> > >         * g++.target/i386/pr91681.C: New test case (from the PR).
> > >         * gcc.target/i386/pr91681-1.c: New int128 test case.
> > >         * gcc.target/i386/pr91681-2.c: Likewise.
> > >         * gcc.target/i386/pr91681-3.c: Likewise, but for ia32.
> >
> > +  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1]) &&
> > + !MEM_P (operands[2])
> > +       : !MEM_P (operands[1])"
> >
> > Can we use ix86_binary_operator_ok (UNKNOWN, ...mode..., operands) here
> > instead?
> >
> > (UNKNOWN RTX code is used to prevent unwanted optimization with
> > commutative operands).
> >
> > Uros.
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 602dfa7..6cce256 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4325,6 +4325,16 @@ 
    (set_attr "type" "imovx,mskmov,mskmov")
    (set_attr "mode" "SI,QI,QI")])
 
+(define_insn_and_split "zero_extendditi2"
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=r,o")
+	(zero_extend:TI (match_operand:DI 1 "nonimmediate_operand" "rm,r")))]
+  "TARGET_64BIT"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 3) (match_dup 1))
+   (set (match_dup 4) (const_int 0))]
+  "split_double_mode (TImode, &operands[0], 1, &operands[3], &operands[4]);")
+
 ;; Transform xorl; mov[bw] (set strict_low_part) into movz[bw]l.
 (define_peephole2
   [(parallel [(set (match_operand:SWI48 0 "general_reg_operand")
@@ -6453,6 +6463,33 @@ 
   [(set_attr "type" "alu")
    (set_attr "mode" "QI")])
 
+(define_insn_and_split "*add<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+	(plus:<DWI>
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm,r")) 
+	  (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")))
+   (clobber (reg:CC FLAGS_REG))]
+  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1])
+			 && !MEM_P (operands[2])
+		       : !MEM_P (operands[1])"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CCC FLAGS_REG)
+		   (compare:CCC
+		     (plus:DWIH (match_dup 1) (match_dup 2))
+		     (match_dup 1)))
+	      (set (match_dup 0)
+		   (plus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+		   (plus:DWIH
+		     (plus:DWIH
+		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))
+		       (match_dup 4))
+		     (const_int 0)))
+	      (clobber (reg:CC FLAGS_REG))])]
+ "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);")
+
 ;; Like DWI, but use POImode instead of OImode.
 (define_mode_attr DPWI [(QI "HI") (HI "SI") (SI "DI") (DI "TI") (TI "POI")])
 
@@ -6903,6 +6940,31 @@ 
     }
 })
 
+(define_insn_and_split "*sub<dwi>3_doubleword_zext"
+  [(set (match_operand:<DWI> 0 "nonimmediate_operand" "=r,o")
+	(minus:<DWI>
+	  (match_operand:<DWI> 1 "nonimmediate_operand" "0,0")
+	  (zero_extend:<DWI>
+	    (match_operand:DWIH 2 "nonimmediate_operand" "rm,r"))))
+   (clobber (reg:CC FLAGS_REG))]
+  "MEM_P (operands[0]) ? rtx_equal_p (operands[0], operands[1])
+			 && !MEM_P (operands[2])
+		       : !MEM_P (operands[1])"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CC FLAGS_REG)
+		   (compare:CC (match_dup 1) (match_dup 2)))
+	      (set (match_dup 0)
+		   (minus:DWIH (match_dup 1) (match_dup 2)))])
+   (parallel [(set (match_dup 3)
+		   (minus:DWIH
+		     (minus:DWIH
+		       (match_dup 4)
+		       (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)))
+		     (const_int 0)))
+	      (clobber (reg:CC FLAGS_REG))])]
+  "split_double_mode (<DWI>mode, &operands[0], 2, &operands[0], &operands[3]);")
+
 (define_insn "*sub<mode>_1"
   [(set (match_operand:SWI 0 "nonimmediate_operand" "=<r>m,<r>")
 	(minus:SWI
@@ -10956,34 +11018,76 @@ 
 
 ;; Split DST = (HI<<32)|LO early to minimize register usage.
 (define_code_iterator any_or_plus [plus ior xor])
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(any_or_plus:DI
-	  (ashift:DI (match_operand:DI 1 "register_operand")
-		     (const_int 32))
-	  (zero_extend:DI (match_operand:SI 2 "register_operand"))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
-   (set (match_dup 5) (match_dup 2))]
+(define_insn_and_split "*concat<mode><dwi>3_1"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (ashift:<DWI> (match_operand:<DWI> 1 "register_operand" "r")
+			(match_operand:<DWI> 2 "const_int_operand" "n"))
+	  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+  && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
+   (set (match_dup 5) (match_dup 6))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[1]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[1]);
 })
 
-(define_split
-  [(set (match_operand:DI 0 "register_operand")
-	(any_or_plus:DI
-	  (zero_extend:DI (match_operand:SI 1 "register_operand"))
-	  (ashift:DI (match_operand:DI 2 "register_operand")
-		     (const_int 32))))]
-  "!TARGET_64BIT"
-  [(set (match_dup 3) (match_dup 4))
+(define_insn_and_split "*concat<mode><dwi>3_2"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	  (ashift:<DWI> (match_operand:<DWI> 2 "register_operand" "r")
+			(match_operand:<DWI> 3 "const_int_operand" "n"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 6))]
+{
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+  operands[6] = gen_lowpart (<MODE>mode, operands[2]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_3"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (ashift:<DWI>
+	    (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	    (match_operand:<DWI> 2 "const_int_operand" "n"))
+	  (zero_extend:<DWI> (match_operand:DWIH 3 "register_operand" "r"))))]
+  "INTVAL (operands[2]) == <MODE_SIZE> * BITS_PER_UNIT
+  && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 3))
    (set (match_dup 5) (match_dup 1))]
 {
-  operands[3] = gen_highpart (SImode, operands[0]);
-  operands[4] = gen_lowpart (SImode, operands[2]);
-  operands[5] = gen_lowpart (SImode, operands[0]);
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
+})
+
+(define_insn_and_split "*concat<mode><dwi>3_4"
+  [(set (match_operand:<DWI> 0 "register_operand" "=r")
+	(any_or_plus:<DWI>
+	  (zero_extend:<DWI> (match_operand:DWIH 1 "register_operand" "r"))
+	  (ashift:<DWI>
+	    (zero_extend:<DWI> (match_operand:DWIH 2 "register_operand" "r"))
+	    (match_operand:<DWI> 3 "const_int_operand" "n"))))]
+  "INTVAL (operands[3]) == <MODE_SIZE> * BITS_PER_UNIT
+   && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 4) (match_dup 1))
+   (set (match_dup 5) (match_dup 2))]
+{
+  operands[4] = gen_lowpart (<MODE>mode, operands[0]);
+  operands[5] = gen_highpart (<MODE>mode, operands[0]);
 })
 
 ;; Negation instructions
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8b2602b..0198156 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2070,7 +2070,8 @@ 
 	  (ashift:HI
 	    (zero_extend:HI (match_operand:QI 1 "register_operand" "k"))
 	    (const_int 8))
-	  (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))]
+	  (zero_extend:HI (match_operand:QI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512F"
   "kunpckbw\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "HI")
@@ -2083,7 +2084,8 @@ 
 	  (ashift:SI
 	    (zero_extend:SI (match_operand:HI 1 "register_operand" "k"))
 	    (const_int 16))
-	  (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))]
+	  (zero_extend:SI (match_operand:HI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckwd\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "SI")])
@@ -2094,7 +2096,8 @@ 
 	  (ashift:DI
 	    (zero_extend:DI (match_operand:SI 1 "register_operand" "k"))
 	    (const_int 32))
-	  (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))]
+	  (zero_extend:DI (match_operand:SI 2 "register_operand" "k"))))
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckdq\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "DI")])
@@ -17398,21 +17401,26 @@ 
 })
 
 (define_expand "vec_pack_trunc_qi"
-  [(set (match_operand:HI 0 "register_operand")
-	(ior:HI (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
-                           (const_int 8))
-		(zero_extend:HI (match_operand:QI 1 "register_operand"))))]
+  [(parallel
+    [(set (match_operand:HI 0 "register_operand")
+	(ior:HI
+	   (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
+		      (const_int 8))
+	   (zero_extend:HI (match_operand:QI 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512F")
 
 (define_expand "vec_pack_trunc_<mode>"
-  [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
-	(ior:<DOUBLEMASKMODE>
-	  (ashift:<DOUBLEMASKMODE>
+  [(parallel
+    [(set (match_operand:<DOUBLEMASKMODE> 0 "register_operand")
+	  (ior:<DOUBLEMASKMODE>
+	    (ashift:<DOUBLEMASKMODE>
+	      (zero_extend:<DOUBLEMASKMODE>
+	        (match_operand:SWI24 2 "register_operand"))
+	      (match_dup 3))
 	    (zero_extend:<DOUBLEMASKMODE>
-	      (match_operand:SWI24 2 "register_operand"))
-	    (match_dup 3))
-	  (zero_extend:<DOUBLEMASKMODE>
-	    (match_operand:SWI24 1 "register_operand"))))]
+	      (match_operand:SWI24 1 "register_operand"))))
+     (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512BW"
 {
   operands[3] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));
diff --git a/gcc/testsuite/g++.target/i386/pr91681.C b/gcc/testsuite/g++.target/i386/pr91681.C
new file mode 100644
index 0000000..0271e43
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr91681.C
@@ -0,0 +1,20 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+void multiply128x64x2_3 ( 
+    const unsigned long a, 
+    const unsigned long b, 
+    const unsigned long c, 
+    const unsigned long d, 
+    __uint128_t o[2])
+{
+    __uint128_t B0 = (__uint128_t) b * c;
+    __uint128_t B2 = (__uint128_t) a * c;
+    __uint128_t B1 = (__uint128_t) b * d;
+    __uint128_t B3 = (__uint128_t) a * d;
+
+    o[0] = B2 + (B0 >> 64);
+    o[1] = B3 + (B1 >> 64);
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-1.c b/gcc/testsuite/gcc.target/i386/pr91681-1.c
new file mode 100644
index 0000000..ab83cc4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-1.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x + y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x + y;
+}
+
+void baz(unsigned long long y)
+{
+    m += y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-2.c b/gcc/testsuite/gcc.target/i386/pr91681-2.c
new file mode 100644
index 0000000..ea52c72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-2.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+unsigned __int128 m;
+
+unsigned __int128 foo(unsigned __int128 x, unsigned long long y)
+{
+    return x - y;
+}
+
+void bar(unsigned __int128 x, unsigned long long y)
+{
+    m = x - y;
+}
+
+void baz(unsigned long long y)
+{
+    m -= y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr91681-3.c b/gcc/testsuite/gcc.target/i386/pr91681-3.c
new file mode 100644
index 0000000..22a03c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91681-3.c
@@ -0,0 +1,16 @@ 
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+unsigned long long m;
+
+unsigned long long foo(unsigned long long x, unsigned int y)
+{
+    return x - y;
+}
+
+void bar(unsigned long long x, unsigned int y)
+{
+    m = x - y;
+}
+
+/* { dg-final { scan-assembler-not "xor" } } */