diff mbox series

middle-end: Recognize idioms for bswap32 and bswap64 in match.pd.

Message ID 000f01d67083$bc737c00$355a7400$@nextmovesoftware.com
State New
Headers show
Series middle-end: Recognize idioms for bswap32 and bswap64 in match.pd. | expand

Commit Message

Roger Sayle Aug. 12, 2020, 8:36 a.m. UTC
This patch is inspired by a small code fragment in comment #3 of
bugzilla PR rtl-optimization/94804.  That snippet appears almost
unrelated to the topic of the PR, but recognizing __builtin_bswap64
from two __builtin_bswap32 calls, seems like a clever/useful trick.
GCC's optabs.c contains the inverse logic to expand bswap64 by
IORing two bswap32 calls, so this transformation/canonicalization
is safe, even on targets without suitable optab support.  But
on x86_64, the swap64 of the test case becomes a single instruction.


This patch has been tested on x86_64-pc-linux-gnu with a "make
bootstrap" and a "make -k check" with no new failures.
Ok for mainline?


2020-08-12  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* match.pd (((T)bswapX(x)<<C)|bswapX(x>>C) -> bswapY(x)):
	New simplifications to recognize __builtin_bswap{32,64}.

gcc/testsuite/ChangeLog
	* gcc.dg/fold-bswap-1.c: New test.


Thanks in advance,
Roger
--
Roger Sayle
NextMove Software
Cambridge, UK

diff --git a/gcc/testsuite/gcc.dg/fold-bswap-1.c b/gcc/testsuite/gcc.dg/fold-bswap-1.c
new file mode 100644
index 0000000..f14f731
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-bswap-1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+unsigned int swap32(unsigned int x)
+{
+    unsigned int a = __builtin_bswap16(x);
+    x >>= 16;
+    a <<= 16;
+    return __builtin_bswap16(x) | a;
+}
+
+unsigned long swap64(unsigned long x)
+{
+    unsigned long a = __builtin_bswap32(x);
+    x >>= 32;
+    a <<= 32;
+    return __builtin_bswap32(x) | a;
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */
+

Comments

Marc Glisse Aug. 12, 2020, 9:43 a.m. UTC | #1
On Wed, 12 Aug 2020, Roger Sayle wrote:

> This patch is inspired by a small code fragment in comment #3 of
> bugzilla PR rtl-optimization/94804.  That snippet appears almost
> unrelated to the topic of the PR, but recognizing __builtin_bswap64
> from two __builtin_bswap32 calls, seems like a clever/useful trick.
> GCC's optabs.c contains the inverse logic to expand bswap64 by
> IORing two bswap32 calls, so this transformation/canonicalization
> is safe, even on targets without suitable optab support.  But
> on x86_64, the swap64 of the test case becomes a single instruction.
>
>
> This patch has been tested on x86_64-pc-linux-gnu with a "make
> bootstrap" and a "make -k check" with no new failures.
> Ok for mainline?

Your tests seem to assume that int has 32 bits and long 64.

+  (if (operand_equal_p (@0, @2, 0)

Why not reuse @0 instead of introducing @2 in the pattern? Similarly, it 
may be a bit shorter to reuse @1 instead of a new @3 (I don't think the 
tricks with @@ will be needed here).

+       && types_match (TREE_TYPE (@0), uint64_type_node)

that seems very specific. What goes wrong with a signed type for instance?

+(simplify
+  (bit_ior:c
+    (lshift
+      (convert (BUILT_IN_BSWAP16 (convert (bit_and @0
+						   INTEGER_CST@1))))
+      (INTEGER_CST@2))
+    (convert (BUILT_IN_BSWAP16 (convert (rshift @3
+						INTEGER_CST@4)))))

I didn't realize we kept this useless bit_and when casting to a smaller 
type. We probably get a different pattern on 16-bit targets, but a pattern 
they do not match won't hurt them.
Roger Sayle Aug. 15, 2020, 10:09 a.m. UTC | #2
Hi Marc,
Here's version #2 of the patch to recognize bswap32 and bswap64
incorporating your
suggestions and feedback.  The test cases now confirm the transformation is
applied
when int is 32 bits and long is 64 bits, and should pass otherwise; the
patterns now
reuse (more) capturing groups, and the patterns have been made more generic
to allow
the ultimate type to be signed or unsigned (hence there are now two new
gcc.dg tests).

Alas my efforts to allow the input argument to be signed, and use
fold_convert to coerce
it to the correct type before calling __builtin_bswap failed, with the error
messages:
>fold-bswap-2.c: In function 'swap64':
>fold-bswap-2.c:22:1: error: invalid argument to gimple call
>(long unsigned int) x_6(D)
>_12 = __builtin_bswap64 ((long unsigned int) x_6(D));
>during GIMPLE pass: forwprop
>fold-bswap-2.c:22:1: internal compiler error: verify_gimple failed
So I require arguments to be the expected type for now.  If anyone's
sufficiently motivated
to support these cases, this can be done as a follow-up patch.

This revised patch has been tested on x86_64-pc-linux-gnu with a "make
bootstrap"
and "make -k check" with no new failures.
Ok for mainline?

Thanks in advance,
Roger
--

-----Original Message-----
From: Marc Glisse <marc.glisse@inria.fr> 
Sent: 12 August 2020 10:43
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in
match.pd.

On Wed, 12 Aug 2020, Roger Sayle wrote:

> This patch is inspired by a small code fragment in comment #3 of 
> bugzilla PR rtl-optimization/94804.  That snippet appears almost 
> unrelated to the topic of the PR, but recognizing __builtin_bswap64 
> from two __builtin_bswap32 calls, seems like a clever/useful trick.
> GCC's optabs.c contains the inverse logic to expand bswap64 by IORing 
> two bswap32 calls, so this transformation/canonicalization is safe, 
> even on targets without suitable optab support.  But on x86_64, the 
> swap64 of the test case becomes a single instruction.
>
>
> This patch has been tested on x86_64-pc-linux-gnu with a "make 
> bootstrap" and a "make -k check" with no new failures.
> Ok for mainline?

Your tests seem to assume that int has 32 bits and long 64.

+  (if (operand_equal_p (@0, @2, 0)

Why not reuse @0 instead of introducing @2 in the pattern? Similarly, it may
be a bit shorter to reuse @1 instead of a new @3 (I don't think the tricks
with @@ will be needed here).

+       && types_match (TREE_TYPE (@0), uint64_type_node)

that seems very specific. What goes wrong with a signed type for instance?

+(simplify
+  (bit_ior:c
+    (lshift
+      (convert (BUILT_IN_BSWAP16 (convert (bit_and @0
+						   INTEGER_CST@1))))
+      (INTEGER_CST@2))
+    (convert (BUILT_IN_BSWAP16 (convert (rshift @3
+						INTEGER_CST@4)))))

I didn't realize we kept this useless bit_and when casting to a smaller
type. We probably get a different pattern on 16-bit targets, but a pattern
they do not match won't hurt them.

--
Marc Glisse
diff --git a/gcc/match.pd b/gcc/match.pd
index c3b8816..c682d3d 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3410,6 +3410,33 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    (bswap (bitop:c (bswap @0) @1))
    (bitop @0 (bswap @1)))))
 
+/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x).  */
+(simplify
+  (bit_ior:c
+    (lshift (convert (BUILT_IN_BSWAP32 (convert@0 @1)))
+	    INTEGER_CST@2)
+    (convert (BUILT_IN_BSWAP32 (convert@3 (rshift @1 @2)))))
+  (if (INTEGRAL_TYPE_P (type)
+       && TYPE_PRECISION (type) == 64
+       && types_match (TREE_TYPE (@1), uint64_type_node)
+       && types_match (TREE_TYPE (@0), uint32_type_node)
+       && types_match (TREE_TYPE (@3), uint32_type_node)
+       && wi::to_widest (@2) == 32)
+    (convert (BUILT_IN_BSWAP64 @1))))
+
+/* Recognize ((T)bswap16(x)<<16)|bswap16(x>>16) as bswap32(x).  */
+(simplify
+  (bit_ior:c
+    (lshift
+      (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 INTEGER_CST@1))))
+      (INTEGER_CST@2))
+    (convert (BUILT_IN_BSWAP16 (convert (rshift @0 @2)))))
+  (if (INTEGRAL_TYPE_P (type)
+       && TYPE_PRECISION (type) == 32
+       && types_match (TREE_TYPE (@0), uint32_type_node)
+       && wi::to_widest (@1) == 65535
+       && wi::to_widest (@2) == 16)
+    (convert (BUILT_IN_BSWAP32 @0))))
 
 /* Combine COND_EXPRs and VEC_COND_EXPRs.  */
diff --git a/gcc/testsuite/gcc.dg/fold-bswap-1.c b/gcc/testsuite/gcc.dg/fold-bswap-1.c
new file mode 100644
index 0000000..3abb862
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-bswap-1.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+unsigned int swap32(unsigned int x)
+{
+  if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) {
+    unsigned int a = __builtin_bswap16(x);
+    x >>= 16;
+    a <<= 16;
+    return __builtin_bswap16(x) | a;
+  } else return __builtin_bswap32(x);
+}
+
+unsigned long swap64(unsigned long x)
+{
+  if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) {
+    unsigned long a = __builtin_bswap32(x);
+    x >>= 32;
+    a <<= 32;
+    return __builtin_bswap32(x) | a;
+  } else return __builtin_bswap64(x);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */
+
diff --git a/gcc/testsuite/gcc.dg/fold-bswap-2.c b/gcc/testsuite/gcc.dg/fold-bswap-2.c
new file mode 100644
index 0000000..a581fd6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-bswap-2.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int swap32(unsigned int x)
+{
+  if (sizeof(int)==4 && sizeof(short)==2) {
+    int a = __builtin_bswap16(x);
+    x >>= 16;
+    a <<= 16;
+    return __builtin_bswap16(x) | a;
+  } else return __builtin_bswap32(x);
+}
+
+long swap64(unsigned long x)
+{
+  if (sizeof(long)==8 && sizeof(int)==4) {
+    long a = __builtin_bswap32(x);
+    x >>= 32;
+    a <<= 32;
+    return __builtin_bswap32(x) | a;
+  } else return __builtin_bswap64(x);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */
+
Jakub Jelinek Aug. 15, 2020, 1:26 p.m. UTC | #3
On Sat, Aug 15, 2020 at 11:09:17AM +0100, Roger Sayle wrote:
> +/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x).  */
> +(simplify
> +  (bit_ior:c

Any reason for supporting bit_ior only?  Don't plus:c or bit_xor:c
work the same (i.e. use (for op (bit_ior bit_xor plus) ...)?

	Jakub
Roger Sayle Aug. 17, 2020, 8:06 a.m. UTC | #4
Hi Jakub and Marc,
Here's version #3 of the patch to recognize bswap32 and bswap64 that
now also implements Jakub's suggestion to support addition and xor in
addition to bitwise ior when recognizing the union of highpart and
lowpart (and two additional tests to check for these variants).

This revised patch has been tested on x86_64-pc-linux-gnu with a
"make bootstrap" and "make -k check" with no new failures, and
confirming all four new tests pass.
Ok for mainline?

2020-08-17  Roger Sayle  <roger@nextmovesoftware.com>
	    Marc Glisse  <marc.glisse@inria.fr>
	    Jakub Jelinek  <jakub@redhat.com>

gcc/ChangeLog
	* match.pd (((T)bswapX(x)<<C)|bswapX(x>>C) -> bswapY(x)):
	New simplifications to recognize __builtin_bswap{32,64}.

gcc/testsuite/ChangeLog
	* gcc.dg/fold-bswap-1.c: New test.
	* gcc.dg/fold-bswap-2.c: New test.
	* gcc.dg/fold-bswap-3.c: New test.
	* gcc.dg/fold-bswap-4.c: New test.


Thanks in advance,
Roger
--

-----Original Message-----
From: Jakub Jelinek <jakub@redhat.com> 
Sent: 15 August 2020 14:26
To: Roger Sayle <roger@nextmovesoftware.com>
Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org>; 'Marc Glisse'
<marc.glisse@inria.fr>
Subject: Re: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in
match.pd.

On Sat, Aug 15, 2020 at 11:09:17AM +0100, Roger Sayle wrote:
> +/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x).  */ 
> +(simplify
> +  (bit_ior:c

Any reason for supporting bit_ior only?  Don't plus:c or bit_xor:c work the
same (i.e. use (for op (bit_ior bit_xor plus) ...)?

	Jakub
diff --git a/gcc/match.pd b/gcc/match.pd
index c3b8816..3d7a0db 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3410,6 +3410,35 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    (bswap (bitop:c (bswap @0) @1))
    (bitop @0 (bswap @1)))))
 
+/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x).  */
+(for op (bit_ior bit_xor plus)
+  (simplify
+    (op:c
+      (lshift (convert (BUILT_IN_BSWAP32 (convert@0 @1)))
+	      INTEGER_CST@2)
+      (convert (BUILT_IN_BSWAP32 (convert@3 (rshift @1 @2)))))
+    (if (INTEGRAL_TYPE_P (type)
+	 && TYPE_PRECISION (type) == 64
+	 && types_match (TREE_TYPE (@1), uint64_type_node)
+	 && types_match (TREE_TYPE (@0), uint32_type_node)
+	 && types_match (TREE_TYPE (@3), uint32_type_node)
+	 && wi::to_widest (@2) == 32)
+      (convert (BUILT_IN_BSWAP64 @1)))))
+
+/* Recognize ((T)bswap16(x)<<16)|bswap16(x>>16) as bswap32(x).  */
+(for op (bit_ior bit_xor plus)
+  (simplify
+    (op:c
+      (lshift
+	(convert (BUILT_IN_BSWAP16 (convert (bit_and @0 INTEGER_CST@1))))
+	(INTEGER_CST@2))
+      (convert (BUILT_IN_BSWAP16 (convert (rshift @0 @2)))))
+    (if (INTEGRAL_TYPE_P (type)
+	 && TYPE_PRECISION (type) == 32
+	 && types_match (TREE_TYPE (@0), uint32_type_node)
+	 && wi::to_widest (@1) == 65535
+	 && wi::to_widest (@2) == 16)
+      (convert (BUILT_IN_BSWAP32 @0)))))
 
 /* Combine COND_EXPRs and VEC_COND_EXPRs.  */
diff --git a/gcc/testsuite/gcc.dg/fold-bswap-1.c b/gcc/testsuite/gcc.dg/fold-bswap-1.c
new file mode 100644
index 0000000..3abb862
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-bswap-1.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+unsigned int swap32(unsigned int x)
+{
+  if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) {
+    unsigned int a = __builtin_bswap16(x);
+    x >>= 16;
+    a <<= 16;
+    return __builtin_bswap16(x) | a;
+  } else return __builtin_bswap32(x);
+}
+
+unsigned long swap64(unsigned long x)
+{
+  if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) {
+    unsigned long a = __builtin_bswap32(x);
+    x >>= 32;
+    a <<= 32;
+    return __builtin_bswap32(x) | a;
+  } else return __builtin_bswap64(x);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */
+
diff --git a/gcc/testsuite/gcc.dg/fold-bswap-2.c b/gcc/testsuite/gcc.dg/fold-bswap-2.c
new file mode 100644
index 0000000..a581fd6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-bswap-2.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int swap32(unsigned int x)
+{
+  if (sizeof(int)==4 && sizeof(short)==2) {
+    int a = __builtin_bswap16(x);
+    x >>= 16;
+    a <<= 16;
+    return __builtin_bswap16(x) | a;
+  } else return __builtin_bswap32(x);
+}
+
+long swap64(unsigned long x)
+{
+  if (sizeof(long)==8 && sizeof(int)==4) {
+    long a = __builtin_bswap32(x);
+    x >>= 32;
+    a <<= 32;
+    return __builtin_bswap32(x) | a;
+  } else return __builtin_bswap64(x);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */
+
diff --git a/gcc/testsuite/gcc.dg/fold-bswap-3.c b/gcc/testsuite/gcc.dg/fold-bswap-3.c
new file mode 100644
index 0000000..13bb6eb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-bswap-3.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+unsigned int swap32(unsigned int x)
+{
+  if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) {
+    unsigned int a = __builtin_bswap16(x);
+    x >>= 16;
+    a <<= 16;
+    return __builtin_bswap16(x) + a;
+  } else return __builtin_bswap32(x);
+}
+
+unsigned long swap64(unsigned long x)
+{
+  if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) {
+    unsigned long a = __builtin_bswap32(x);
+    x >>= 32;
+    a <<= 32;
+    return __builtin_bswap32(x) + a;
+  } else return __builtin_bswap64(x);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */
+
diff --git a/gcc/testsuite/gcc.dg/fold-bswap-4.c b/gcc/testsuite/gcc.dg/fold-bswap-4.c
new file mode 100644
index 0000000..1ae2084
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-bswap-4.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+unsigned int swap32(unsigned int x)
+{
+  if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) {
+    unsigned int a = __builtin_bswap16(x);
+    x >>= 16;
+    a <<= 16;
+    return __builtin_bswap16(x) ^ a;
+  } else return __builtin_bswap32(x);
+}
+
+unsigned long swap64(unsigned long x)
+{
+  if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) {
+    unsigned long a = __builtin_bswap32(x);
+    x >>= 32;
+    a <<= 32;
+    return __builtin_bswap32(x) ^ a;
+  } else return __builtin_bswap64(x);
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */
+
Marc Glisse Aug. 22, 2020, 2:42 p.m. UTC | #5
On Sat, 15 Aug 2020, Roger Sayle wrote:

> Here's version #2 of the patch to recognize bswap32 and bswap64 
> incorporating your suggestions and feedback.  The test cases now confirm 
> the transformation is applied when int is 32 bits and long is 64 bits, 
> and should pass otherwise; the patterns now reuse (more) capturing 
> groups, and the patterns have been made more generic to allow the 
> ultimate type to be signed or unsigned (hence there are now two new 
> gcc.dg tests).
>
> Alas my efforts to allow the input argument to be signed, and use 
> fold_convert to coerce it to the correct type before calling 
> __builtin_bswap failed, with the error messages:

You can't use fold_convert for that (well, maybe if you restricted the 
transformation to GENERIC), but if I understand correctly, you are trying 
to do

(convert (BUILT_IN_BSWAP64 (convert:uint64_type_node @1))))))

? (untested)

> From: Marc Glisse <marc.glisse@inria.fr>
>
> +(simplify
> +  (bit_ior:c
> +    (lshift
> +      (convert (BUILT_IN_BSWAP16 (convert (bit_and @0
> +						   INTEGER_CST@1))))
> +      (INTEGER_CST@2))
> +    (convert (BUILT_IN_BSWAP16 (convert (rshift @3
> +						INTEGER_CST@4)))))
>
> I didn't realize we kept this useless bit_and when casting to a smaller
> type.

I was confused when I wrote that and thought we were converting from int 
to uint16_t, but bswap16 actually takes an int on x86_64, probably because 
of the calling convention, so we are converting from unsigned int to int.

Having implementation details like the calling convention appear here in 
the intermediate language complicates things a bit. Can we assume that it 
is fine to build a call to bswap32/bswap64 taking uint32_t/uint64_t and 
that only bswap16 can be affected? Do most targets have a similar-enough 
calling convention that this transformation also works on them? It looks 
like aarch64 / powerpc64le / mips64el would like for bswap16->bswap32 a 
transformation of the same form as the one you wrote for bswap32->bswap64.


I was wondering what would happen if I start from an int instead of an 
unsigned int.

f (int x)
{
   short unsigned int _1;
   short unsigned int _2;
   short unsigned int _3;
   int _5;
   int _7;
   unsigned int _8;
   unsigned int _9;
   int _10;

   <bb 2> [local count: 1073741824]:
   _7 = x_4(D) & 65535;
   _1 = __builtin_bswap16 (_7);
   _8 = (unsigned int) x_4(D);
   _9 = _8 >> 16;
   _10 = (int) _9;
   _2 = __builtin_bswap16 (_10);
   _3 = _1 | _2;
   _5 = (int) _3;
   return _5;
}

Handling this in the same transformation with a pair of convert12? and 
some tests should be doable, but it gets complicated enough that it is 
fine to postpone that.
diff mbox series

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index 7e5c5a6..d4efbf3 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3410,6 +3410,39 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    (bswap (bitop:c (bswap @0) @1))
    (bitop @0 (bswap @1)))))
 
+/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x).  */
+(simplify
+  (bit_ior:c
+    (lshift
+      (convert (BUILT_IN_BSWAP32 (convert@4 @0)))
+      INTEGER_CST@1)
+    (convert (BUILT_IN_BSWAP32 (convert@5 (rshift @2
+						  INTEGER_CST@3)))))
+  (if (operand_equal_p (@0, @2, 0)
+       && types_match (type, uint64_type_node)
+       && types_match (TREE_TYPE (@0), uint64_type_node)
+       && types_match (TREE_TYPE (@4), uint32_type_node)
+       && types_match (TREE_TYPE (@5), uint32_type_node)
+       && wi::to_widest (@1) == 32
+       && wi::to_widest (@3) == 32)
+    (BUILT_IN_BSWAP64 @0)))
+
+/* Recognize ((T)bswap16(x)<<16)|bswap16(x>>16) as bswap32(x).  */
+(simplify
+  (bit_ior:c
+    (lshift
+      (convert (BUILT_IN_BSWAP16 (convert (bit_and @0
+						   INTEGER_CST@1))))
+      (INTEGER_CST@2))
+    (convert (BUILT_IN_BSWAP16 (convert (rshift @3
+						INTEGER_CST@4)))))
+  (if (operand_equal_p (@0, @3, 0)
+       && types_match (type, uint32_type_node)
+       && types_match (TREE_TYPE (@0), uint32_type_node)
+       && wi::to_widest (@1) == 65535
+       && wi::to_widest (@2) == 16
+       && wi::to_widest (@4) == 16)
+    (BUILT_IN_BSWAP32 @0)))
 
 /* Combine COND_EXPRs and VEC_COND_EXPRs.  */