Message ID | 000f01d67083$bc737c00$355a7400$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | middle-end: Recognize idioms for bswap32 and bswap64 in match.pd. | expand |
On Wed, 12 Aug 2020, Roger Sayle wrote: > This patch is inspired by a small code fragment in comment #3 of > bugzilla PR rtl-optimization/94804. That snippet appears almost > unrelated to the topic of the PR, but recognizing __builtin_bswap64 > from two __builtin_bswap32 calls, seems like a clever/useful trick. > GCC's optabs.c contains the inverse logic to expand bswap64 by > IORing two bswap32 calls, so this transformation/canonicalization > is safe, even on targets without suitable optab support. But > on x86_64, the swap64 of the test case becomes a single instruction. > > > This patch has been tested on x86_64-pc-linux-gnu with a "make > bootstrap" and a "make -k check" with no new failures. > Ok for mainline? Your tests seem to assume that int has 32 bits and long 64. + (if (operand_equal_p (@0, @2, 0) Why not reuse @0 instead of introducing @2 in the pattern? Similarly, it may be a bit shorter to reuse @1 instead of a new @3 (I don't think the tricks with @@ will be needed here). + && types_match (TREE_TYPE (@0), uint64_type_node) that seems very specific. What goes wrong with a signed type for instance? +(simplify + (bit_ior:c + (lshift + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 + INTEGER_CST@1)))) + (INTEGER_CST@2)) + (convert (BUILT_IN_BSWAP16 (convert (rshift @3 + INTEGER_CST@4))))) I didn't realize we kept this useless bit_and when casting to a smaller type. We probably get a different pattern on 16-bit targets, but a pattern they do not match won't hurt them.
Hi Marc, Here's version #2 of the patch to recognize bswap32 and bswap64 incorporating your suggestions and feedback. The test cases now confirm the transformation is applied when int is 32 bits and long is 64 bits, and should pass otherwise; the patterns now reuse (more) capturing groups, and the patterns have been made more generic to allow the ultimate type to be signed or unsigned (hence there are now two new gcc.dg tests). Alas my efforts to allow the input argument to be signed, and use fold_convert to coerce it to the correct type before calling __builtin_bswap failed, with the error messages: >fold-bswap-2.c: In function 'swap64': >fold-bswap-2.c:22:1: error: invalid argument to gimple call >(long unsigned int) x_6(D) >_12 = __builtin_bswap64 ((long unsigned int) x_6(D)); >during GIMPLE pass: forwprop >fold-bswap-2.c:22:1: internal compiler error: verify_gimple failed So I require arguments to be the expected type for now. If anyone's sufficiently motivated to support these cases, this can be done as a follow-up patch. This revised patch has been tested on x86_64-pc-linux-gnu with a "make bootstrap" and "make -k check" with no new failures. Ok for mainline? Thanks in advance, Roger -- -----Original Message----- From: Marc Glisse <marc.glisse@inria.fr> Sent: 12 August 2020 10:43 To: Roger Sayle <roger@nextmovesoftware.com> Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org> Subject: Re: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in match.pd. On Wed, 12 Aug 2020, Roger Sayle wrote: > This patch is inspired by a small code fragment in comment #3 of > bugzilla PR rtl-optimization/94804. That snippet appears almost > unrelated to the topic of the PR, but recognizing __builtin_bswap64 > from two __builtin_bswap32 calls, seems like a clever/useful trick. > GCC's optabs.c contains the inverse logic to expand bswap64 by IORing > two bswap32 calls, so this transformation/canonicalization is safe, > even on targets without suitable optab support. But on x86_64, the > swap64 of the test case becomes a single instruction. > > > This patch has been tested on x86_64-pc-linux-gnu with a "make > bootstrap" and a "make -k check" with no new failures. > Ok for mainline? Your tests seem to assume that int has 32 bits and long 64. + (if (operand_equal_p (@0, @2, 0) Why not reuse @0 instead of introducing @2 in the pattern? Similarly, it may be a bit shorter to reuse @1 instead of a new @3 (I don't think the tricks with @@ will be needed here). + && types_match (TREE_TYPE (@0), uint64_type_node) that seems very specific. What goes wrong with a signed type for instance? +(simplify + (bit_ior:c + (lshift + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 + INTEGER_CST@1)))) + (INTEGER_CST@2)) + (convert (BUILT_IN_BSWAP16 (convert (rshift @3 + INTEGER_CST@4))))) I didn't realize we kept this useless bit_and when casting to a smaller type. We probably get a different pattern on 16-bit targets, but a pattern they do not match won't hurt them. -- Marc Glisse diff --git a/gcc/match.pd b/gcc/match.pd index c3b8816..c682d3d 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3410,6 +3410,33 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (bswap (bitop:c (bswap @0) @1)) (bitop @0 (bswap @1))))) +/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x). */ +(simplify + (bit_ior:c + (lshift (convert (BUILT_IN_BSWAP32 (convert@0 @1))) + INTEGER_CST@2) + (convert (BUILT_IN_BSWAP32 (convert@3 (rshift @1 @2))))) + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (type) == 64 + && types_match (TREE_TYPE (@1), uint64_type_node) + && types_match (TREE_TYPE (@0), uint32_type_node) + && types_match (TREE_TYPE (@3), uint32_type_node) + && wi::to_widest (@2) == 32) + (convert (BUILT_IN_BSWAP64 @1)))) + +/* Recognize ((T)bswap16(x)<<16)|bswap16(x>>16) as bswap32(x). */ +(simplify + (bit_ior:c + (lshift + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 INTEGER_CST@1)))) + (INTEGER_CST@2)) + (convert (BUILT_IN_BSWAP16 (convert (rshift @0 @2))))) + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (type) == 32 + && types_match (TREE_TYPE (@0), uint32_type_node) + && wi::to_widest (@1) == 65535 + && wi::to_widest (@2) == 16) + (convert (BUILT_IN_BSWAP32 @0)))) /* Combine COND_EXPRs and VEC_COND_EXPRs. */ diff --git a/gcc/testsuite/gcc.dg/fold-bswap-1.c b/gcc/testsuite/gcc.dg/fold-bswap-1.c new file mode 100644 index 0000000..3abb862 --- /dev/null +++ b/gcc/testsuite/gcc.dg/fold-bswap-1.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +unsigned int swap32(unsigned int x) +{ + if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) { + unsigned int a = __builtin_bswap16(x); + x >>= 16; + a <<= 16; + return __builtin_bswap16(x) | a; + } else return __builtin_bswap32(x); +} + +unsigned long swap64(unsigned long x) +{ + if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) { + unsigned long a = __builtin_bswap32(x); + x >>= 32; + a <<= 32; + return __builtin_bswap32(x) | a; + } else return __builtin_bswap64(x); +} + +/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */ + diff --git a/gcc/testsuite/gcc.dg/fold-bswap-2.c b/gcc/testsuite/gcc.dg/fold-bswap-2.c new file mode 100644 index 0000000..a581fd6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/fold-bswap-2.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +int swap32(unsigned int x) +{ + if (sizeof(int)==4 && sizeof(short)==2) { + int a = __builtin_bswap16(x); + x >>= 16; + a <<= 16; + return __builtin_bswap16(x) | a; + } else return __builtin_bswap32(x); +} + +long swap64(unsigned long x) +{ + if (sizeof(long)==8 && sizeof(int)==4) { + long a = __builtin_bswap32(x); + x >>= 32; + a <<= 32; + return __builtin_bswap32(x) | a; + } else return __builtin_bswap64(x); +} + +/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */ +
On Sat, Aug 15, 2020 at 11:09:17AM +0100, Roger Sayle wrote: > +/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x). */ > +(simplify > + (bit_ior:c Any reason for supporting bit_ior only? Don't plus:c or bit_xor:c work the same (i.e. use (for op (bit_ior bit_xor plus) ...)? Jakub
Hi Jakub and Marc, Here's version #3 of the patch to recognize bswap32 and bswap64 that now also implements Jakub's suggestion to support addition and xor in addition to bitwise ior when recognizing the union of highpart and lowpart (and two additional tests to check for these variants). This revised patch has been tested on x86_64-pc-linux-gnu with a "make bootstrap" and "make -k check" with no new failures, and confirming all four new tests pass. Ok for mainline? 2020-08-17 Roger Sayle <roger@nextmovesoftware.com> Marc Glisse <marc.glisse@inria.fr> Jakub Jelinek <jakub@redhat.com> gcc/ChangeLog * match.pd (((T)bswapX(x)<<C)|bswapX(x>>C) -> bswapY(x)): New simplifications to recognize __builtin_bswap{32,64}. gcc/testsuite/ChangeLog * gcc.dg/fold-bswap-1.c: New test. * gcc.dg/fold-bswap-2.c: New test. * gcc.dg/fold-bswap-3.c: New test. * gcc.dg/fold-bswap-4.c: New test. Thanks in advance, Roger -- -----Original Message----- From: Jakub Jelinek <jakub@redhat.com> Sent: 15 August 2020 14:26 To: Roger Sayle <roger@nextmovesoftware.com> Cc: 'GCC Patches' <gcc-patches@gcc.gnu.org>; 'Marc Glisse' <marc.glisse@inria.fr> Subject: Re: [PATCH] middle-end: Recognize idioms for bswap32 and bswap64 in match.pd. On Sat, Aug 15, 2020 at 11:09:17AM +0100, Roger Sayle wrote: > +/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x). */ > +(simplify > + (bit_ior:c Any reason for supporting bit_ior only? Don't plus:c or bit_xor:c work the same (i.e. use (for op (bit_ior bit_xor plus) ...)? Jakub diff --git a/gcc/match.pd b/gcc/match.pd index c3b8816..3d7a0db 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3410,6 +3410,35 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (bswap (bitop:c (bswap @0) @1)) (bitop @0 (bswap @1))))) +/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x). */ +(for op (bit_ior bit_xor plus) + (simplify + (op:c + (lshift (convert (BUILT_IN_BSWAP32 (convert@0 @1))) + INTEGER_CST@2) + (convert (BUILT_IN_BSWAP32 (convert@3 (rshift @1 @2))))) + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (type) == 64 + && types_match (TREE_TYPE (@1), uint64_type_node) + && types_match (TREE_TYPE (@0), uint32_type_node) + && types_match (TREE_TYPE (@3), uint32_type_node) + && wi::to_widest (@2) == 32) + (convert (BUILT_IN_BSWAP64 @1))))) + +/* Recognize ((T)bswap16(x)<<16)|bswap16(x>>16) as bswap32(x). */ +(for op (bit_ior bit_xor plus) + (simplify + (op:c + (lshift + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 INTEGER_CST@1)))) + (INTEGER_CST@2)) + (convert (BUILT_IN_BSWAP16 (convert (rshift @0 @2))))) + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (type) == 32 + && types_match (TREE_TYPE (@0), uint32_type_node) + && wi::to_widest (@1) == 65535 + && wi::to_widest (@2) == 16) + (convert (BUILT_IN_BSWAP32 @0))))) /* Combine COND_EXPRs and VEC_COND_EXPRs. */ diff --git a/gcc/testsuite/gcc.dg/fold-bswap-1.c b/gcc/testsuite/gcc.dg/fold-bswap-1.c new file mode 100644 index 0000000..3abb862 --- /dev/null +++ b/gcc/testsuite/gcc.dg/fold-bswap-1.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +unsigned int swap32(unsigned int x) +{ + if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) { + unsigned int a = __builtin_bswap16(x); + x >>= 16; + a <<= 16; + return __builtin_bswap16(x) | a; + } else return __builtin_bswap32(x); +} + +unsigned long swap64(unsigned long x) +{ + if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) { + unsigned long a = __builtin_bswap32(x); + x >>= 32; + a <<= 32; + return __builtin_bswap32(x) | a; + } else return __builtin_bswap64(x); +} + +/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */ + diff --git a/gcc/testsuite/gcc.dg/fold-bswap-2.c b/gcc/testsuite/gcc.dg/fold-bswap-2.c new file mode 100644 index 0000000..a581fd6 --- /dev/null +++ b/gcc/testsuite/gcc.dg/fold-bswap-2.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +int swap32(unsigned int x) +{ + if (sizeof(int)==4 && sizeof(short)==2) { + int a = __builtin_bswap16(x); + x >>= 16; + a <<= 16; + return __builtin_bswap16(x) | a; + } else return __builtin_bswap32(x); +} + +long swap64(unsigned long x) +{ + if (sizeof(long)==8 && sizeof(int)==4) { + long a = __builtin_bswap32(x); + x >>= 32; + a <<= 32; + return __builtin_bswap32(x) | a; + } else return __builtin_bswap64(x); +} + +/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */ + diff --git a/gcc/testsuite/gcc.dg/fold-bswap-3.c b/gcc/testsuite/gcc.dg/fold-bswap-3.c new file mode 100644 index 0000000..13bb6eb --- /dev/null +++ b/gcc/testsuite/gcc.dg/fold-bswap-3.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +unsigned int swap32(unsigned int x) +{ + if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) { + unsigned int a = __builtin_bswap16(x); + x >>= 16; + a <<= 16; + return __builtin_bswap16(x) + a; + } else return __builtin_bswap32(x); +} + +unsigned long swap64(unsigned long x) +{ + if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) { + unsigned long a = __builtin_bswap32(x); + x >>= 32; + a <<= 32; + return __builtin_bswap32(x) + a; + } else return __builtin_bswap64(x); +} + +/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */ + diff --git a/gcc/testsuite/gcc.dg/fold-bswap-4.c b/gcc/testsuite/gcc.dg/fold-bswap-4.c new file mode 100644 index 0000000..1ae2084 --- /dev/null +++ b/gcc/testsuite/gcc.dg/fold-bswap-4.c @@ -0,0 +1,26 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-optimized" } */ + +unsigned int swap32(unsigned int x) +{ + if (sizeof(unsigned int)==4 && sizeof(unsigned short)==2) { + unsigned int a = __builtin_bswap16(x); + x >>= 16; + a <<= 16; + return __builtin_bswap16(x) ^ a; + } else return __builtin_bswap32(x); +} + +unsigned long swap64(unsigned long x) +{ + if (sizeof(unsigned long)==8 && sizeof(unsigned int)==4) { + unsigned long a = __builtin_bswap32(x); + x >>= 32; + a <<= 32; + return __builtin_bswap32(x) ^ a; + } else return __builtin_bswap64(x); +} + +/* { dg-final { scan-tree-dump-times "__builtin_bswap32" 1 "optimized" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_bswap64" 1 "optimized" } } */ +
On Sat, 15 Aug 2020, Roger Sayle wrote: > Here's version #2 of the patch to recognize bswap32 and bswap64 > incorporating your suggestions and feedback. The test cases now confirm > the transformation is applied when int is 32 bits and long is 64 bits, > and should pass otherwise; the patterns now reuse (more) capturing > groups, and the patterns have been made more generic to allow the > ultimate type to be signed or unsigned (hence there are now two new > gcc.dg tests). > > Alas my efforts to allow the input argument to be signed, and use > fold_convert to coerce it to the correct type before calling > __builtin_bswap failed, with the error messages: You can't use fold_convert for that (well, maybe if you restricted the transformation to GENERIC), but if I understand correctly, you are trying to do (convert (BUILT_IN_BSWAP64 (convert:uint64_type_node @1)))))) ? (untested) > From: Marc Glisse <marc.glisse@inria.fr> > > +(simplify > + (bit_ior:c > + (lshift > + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 > + INTEGER_CST@1)))) > + (INTEGER_CST@2)) > + (convert (BUILT_IN_BSWAP16 (convert (rshift @3 > + INTEGER_CST@4))))) > > I didn't realize we kept this useless bit_and when casting to a smaller > type. I was confused when I wrote that and thought we were converting from int to uint16_t, but bswap16 actually takes an int on x86_64, probably because of the calling convention, so we are converting from unsigned int to int. Having implementation details like the calling convention appear here in the intermediate language complicates things a bit. Can we assume that it is fine to build a call to bswap32/bswap64 taking uint32_t/uint64_t and that only bswap16 can be affected? Do most targets have a similar-enough calling convention that this transformation also works on them? It looks like aarch64 / powerpc64le / mips64el would like for bswap16->bswap32 a transformation of the same form as the one you wrote for bswap32->bswap64. I was wondering what would happen if I start from an int instead of an unsigned int. f (int x) { short unsigned int _1; short unsigned int _2; short unsigned int _3; int _5; int _7; unsigned int _8; unsigned int _9; int _10; <bb 2> [local count: 1073741824]: _7 = x_4(D) & 65535; _1 = __builtin_bswap16 (_7); _8 = (unsigned int) x_4(D); _9 = _8 >> 16; _10 = (int) _9; _2 = __builtin_bswap16 (_10); _3 = _1 | _2; _5 = (int) _3; return _5; } Handling this in the same transformation with a pair of convert12? and some tests should be doable, but it gets complicated enough that it is fine to postpone that.
diff --git a/gcc/match.pd b/gcc/match.pd index 7e5c5a6..d4efbf3 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3410,6 +3410,39 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (bswap (bitop:c (bswap @0) @1)) (bitop @0 (bswap @1))))) +/* Recognize ((T)bswap32(x)<<32)|bswap32(x>>32) as bswap64(x). */ +(simplify + (bit_ior:c + (lshift + (convert (BUILT_IN_BSWAP32 (convert@4 @0))) + INTEGER_CST@1) + (convert (BUILT_IN_BSWAP32 (convert@5 (rshift @2 + INTEGER_CST@3))))) + (if (operand_equal_p (@0, @2, 0) + && types_match (type, uint64_type_node) + && types_match (TREE_TYPE (@0), uint64_type_node) + && types_match (TREE_TYPE (@4), uint32_type_node) + && types_match (TREE_TYPE (@5), uint32_type_node) + && wi::to_widest (@1) == 32 + && wi::to_widest (@3) == 32) + (BUILT_IN_BSWAP64 @0))) + +/* Recognize ((T)bswap16(x)<<16)|bswap16(x>>16) as bswap32(x). */ +(simplify + (bit_ior:c + (lshift + (convert (BUILT_IN_BSWAP16 (convert (bit_and @0 + INTEGER_CST@1)))) + (INTEGER_CST@2)) + (convert (BUILT_IN_BSWAP16 (convert (rshift @3 + INTEGER_CST@4))))) + (if (operand_equal_p (@0, @3, 0) + && types_match (type, uint32_type_node) + && types_match (TREE_TYPE (@0), uint32_type_node) + && wi::to_widest (@1) == 65535 + && wi::to_widest (@2) == 16 + && wi::to_widest (@4) == 16) + (BUILT_IN_BSWAP32 @0))) /* Combine COND_EXPRs and VEC_COND_EXPRs. */