Message ID | CAKdteOZ=xdoANMnTJXoCUSxdkDBCfBDRfBd_KiitxB1ZHXBEiA@mail.gmail.com |
---|---|
State | New |
Headers | show |
> Here is a new patch, modifying tree-ssa-math-opts.c as you suggested. > It's indeed simpler :-) > > Validated with qemu-arm on target arm-none-linux-gnueabi. I cannot formally approve, but this looks good to me. However, could you check that this is also an improvement for PowerPC, which is the only other mainstream architecture with a bswaphi pattern AFAIK?
On Fri, Sep 21, 2012 at 1:04 AM, Christophe Lyon <christophe.lyon@linaro.org> wrote: > On 20 September 2012 09:55, Christophe Lyon <christophe.lyon@linaro.org> wrote: >> On 20 September 2012 09:12, Eric Botcazou <ebotcazou@adacore.com> wrote: >>>> The attached patch catches C constructs: >>>> (A << 8) | (A >> 8) >>>> where A is unsigned 16 bits >>>> and maps them to builtin_bswap16(A) which can provide more efficient >>>> implementations on some targets. >>> >>> This belongs in tree-ssa-math-opts.c:execute_optimize_bswap instead. >>> >> OK I'll have a look at that. Actually I modified fold-const.c because >> it's here that the similar 32 bits pattern is turned into a rotate. >> >>> When I implemented __builtin_bswap16, I didn't add this because I thought this >>> would be overkill since the RTL combiner should be able to catch the pattern. >>> Have you investigated on this front? But I don't have a strong opinion. >>> >> No I didn't. As I said above, I looked at where the 32 bits pattern >> was handled and added the 16 bits one. >> >> Christophe. > > Here is a new patch, modifying tree-ssa-math-opts.c as you suggested. > It's indeed simpler :-) > > Validated with qemu-arm on target arm-none-linux-gnueabi. This fixes PR 43550 also. Thanks, Andrew
On 21 September 2012 12:56, Eric Botcazou <ebotcazou@adacore.com> wrote: >> Here is a new patch, modifying tree-ssa-math-opts.c as you suggested. >> It's indeed simpler :-) >> >> Validated with qemu-arm on target arm-none-linux-gnueabi. > > I cannot formally approve, but this looks good to me. > > However, could you check that this is also an improvement for PowerPC, which > is the only other mainstream architecture with a bswaphi pattern AFAIK? > > -- > Eric Botcazou I have not yet been able to build an environment to run the testsuite with qemu-ppc (not sure about the best target & dejagnu board selection). However, when compiling a sample test short myswaps16(short x) { return (x << 8) | (x >> 8); } unsigned short myswapu16(unsigned short x) { return (x << 8) | (x >> 8); } The generated code is now: myswaps16: rlwinm 10,3,8,16,23 rlwinm 9,3,24,24,31 or 9,9,10 extsh 3,9 blr myswapu16: rlwinm 10,3,8,16,23 rlwinm 9,3,24,24,31 or 9,9,10 rlwinm 3,9,0,0xffff blr While it was (without my patch): myswaps16: slwi 9,3,8 srawi 3,3,8 or 3,9,3 extsh 3,3 blr myswapu16: srwi 9,3,8 rlwinm 3,3,8,16,23 or 3,3,9 blr I don't know PowerPC, but I am not sure it's an improvement. Is it? Christophe.
> I have not yet been able to build an environment to run the testsuite > with qemu-ppc (not sure about the best target & dejagnu board > selection). No problem, that wasn't really a requirement. > However, when compiling a sample test > short myswaps16(short x) { > return (x << 8) | (x >> 8); > } > unsigned short myswapu16(unsigned short x) { > return (x << 8) | (x >> 8); > } > > The generated code is now: > myswaps16: > rlwinm 10,3,8,16,23 > rlwinm 9,3,24,24,31 > or 9,9,10 > extsh 3,9 > blr > > myswapu16: > rlwinm 10,3,8,16,23 > rlwinm 9,3,24,24,31 > or 9,9,10 > rlwinm 3,9,0,0xffff > blr > > While it was (without my patch): > > myswaps16: > slwi 9,3,8 > srawi 3,3,8 > or 3,9,3 > extsh 3,3 > blr > > myswapu16: > srwi 9,3,8 > rlwinm 3,3,8,16,23 > or 3,3,9 > blr > > I don't know PowerPC, but I am not sure it's an improvement. Is it? That indeed doesn't look obvious, especially in the unsigned case. The PowerPC back-end has a splitter for bswap:HI which generates no less than 3 instructions, so I presume we're seeing its effects here. I've CCed the other interested parties. David, Michael, Segher, any comments about or insights into the results reported by Christophe for PowerPC?
>> The generated code is now: >> myswaps16: >> rlwinm 10,3,8,16,23 >> rlwinm 9,3,24,24,31 >> or 9,9,10 >> extsh 3,9 >> blr >> >> myswapu16: >> rlwinm 10,3,8,16,23 >> rlwinm 9,3,24,24,31 >> or 9,9,10 >> rlwinm 3,9,0,0xffff >> blr >> >> While it was (without my patch): >> >> myswaps16: >> slwi 9,3,8 >> srawi 3,3,8 >> or 3,9,3 >> extsh 3,3 >> blr >> >> myswapu16: >> srwi 9,3,8 >> rlwinm 3,3,8,16,23 >> or 3,3,9 >> blr >> >> I don't know PowerPC, but I am not sure it's an improvement. Is it? slwi and srwi are just extended mnemonics for the same rlwinm instruction, so that's the same. The last instruction in the new unsigned variant is superfluous, since it is just setting the top bits to zero, and they already are. rlwinm is ever so slightly better than srawi (in the signed version), because srawi sets the carry bit in addition to the GPR, so that is an improvement. But we can do this sequence in just two instructions: rlwimi 3,3,16,0,15 rlwinm 3,3,8,16,31 blr so some more work is needed to make this optimal ;-) Christophe, it looks like the zero-extend in the unsigned case is not needed on any target? Assuming the shifts are at least SImode, of course (I'm too lazy to check, sorry). Segher
On 25 September 2012 07:00, Segher Boessenkool <segher@kernel.crashing.org> wrote: > Christophe, it looks like the zero-extend in the unsigned case is not > needed on any target? Assuming the shifts are at least SImode, of > course (I'm too lazy to check, sorry). > It's also present when compiling: unsigned short swapu16(unsigned short x) { return __builtin_bswap16(x); } so it's not directly caused by my patch I think. We have to look at the __builtin_bswap16 expansion with an unsigned argument. Christophe.
>> Christophe, it looks like the zero-extend in the unsigned case is not >> needed on any target? Assuming the shifts are at least SImode, of >> course (I'm too lazy to check, sorry). > > It's also present when compiling: > unsigned short swapu16(unsigned short x) { > return __builtin_bswap16(x); > } > > so it's not directly caused by my patch I think. The RTL is (set (reg:HI) (bswap:HI (reg:HI))) which then gets extended for the SI (or DI) function return. Nothing to see here, it's a target problem. Your results look good. Segher
On 25 September 2012 13:32, Segher Boessenkool <segher@kernel.crashing.org> wrote: >>> Christophe, it looks like the zero-extend in the unsigned case is not >>> needed on any target? Assuming the shifts are at least SImode, of >>> course (I'm too lazy to check, sorry). >> >> >> It's also present when compiling: >> unsigned short swapu16(unsigned short x) { >> return __builtin_bswap16(x); >> } >> >> so it's not directly caused by my patch I think. > > > The RTL is (set (reg:HI) (bswap:HI (reg:HI))) which then gets > extended for the SI (or DI) function return. Nothing to see here, > it's a target problem. Your results look good. > OK thank for the confirmation. I guess I just have to wait for approval by the right maintainer now?
> I guess I just have to wait for approval by the right maintainer now?
Right, GCC's bureaucracy is no legend. :-)
I've CCed Richard, who approved the __builtin_bswap16 stuff back in April.
On Fri, Sep 21, 2012 at 10:04 AM, Christophe Lyon <christophe.lyon@linaro.org> wrote: > On 20 September 2012 09:55, Christophe Lyon <christophe.lyon@linaro.org> wrote: >> On 20 September 2012 09:12, Eric Botcazou <ebotcazou@adacore.com> wrote: >>>> The attached patch catches C constructs: >>>> (A << 8) | (A >> 8) >>>> where A is unsigned 16 bits >>>> and maps them to builtin_bswap16(A) which can provide more efficient >>>> implementations on some targets. >>> >>> This belongs in tree-ssa-math-opts.c:execute_optimize_bswap instead. >>> >> OK I'll have a look at that. Actually I modified fold-const.c because >> it's here that the similar 32 bits pattern is turned into a rotate. >> >>> When I implemented __builtin_bswap16, I didn't add this because I thought this >>> would be overkill since the RTL combiner should be able to catch the pattern. >>> Have you investigated on this front? But I don't have a strong opinion. >>> >> No I didn't. As I said above, I looked at where the 32 bits pattern >> was handled and added the 16 bits one. >> >> Christophe. > > Here is a new patch, modifying tree-ssa-math-opts.c as you suggested. > It's indeed simpler :-) > > Validated with qemu-arm on target arm-none-linux-gnueabi. > > OK? Assuming this one was the lastest patch ... Ok. Thanks, Richard. > Christophe.
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c index c3392fb..340f0df 100644 --- a/gcc/tree-ssa-math-opts.c +++ b/gcc/tree-ssa-math-opts.c @@ -154,6 +154,9 @@ static struct static struct { + /* Number of hand-written 16-bit bswaps found. */ + int found_16bit; + /* Number of hand-written 32-bit bswaps found. */ int found_32bit; @@ -1792,9 +1795,9 @@ static unsigned int execute_optimize_bswap (void) { basic_block bb; - bool bswap32_p, bswap64_p; + bool bswap16_p, bswap32_p, bswap64_p; bool changed = false; - tree bswap32_type = NULL_TREE, bswap64_type = NULL_TREE; + tree bswap16_type = NULL_TREE, bswap32_type = NULL_TREE, bswap64_type = NULL_TREE; if (BITS_PER_UNIT != 8) return 0; @@ -1802,17 +1805,25 @@ execute_optimize_bswap (void) if (sizeof (HOST_WIDEST_INT) < 8) return 0; + bswap16_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP16) + && optab_handler (bswap_optab, HImode) != CODE_FOR_nothing); bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32) && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing); bswap64_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP64) && (optab_handler (bswap_optab, DImode) != CODE_FOR_nothing || (bswap32_p && word_mode == SImode))); - if (!bswap32_p && !bswap64_p) + if (!bswap16_p && !bswap32_p && !bswap64_p) return 0; /* Determine the argument type of the builtins. The code later on assumes that the return and argument type are the same. */ + if (bswap16_p) + { + tree fndecl = builtin_decl_explicit (BUILT_IN_BSWAP16); + bswap16_type = TREE_VALUE (TYPE_ARG_TYPES (TREE_TYPE (fndecl))); + } + if (bswap32_p) { tree fndecl = builtin_decl_explicit (BUILT_IN_BSWAP32); @@ -1852,6 +1863,13 @@ execute_optimize_bswap (void) switch (type_size) { + case 16: + if (bswap16_p) + { + fndecl = builtin_decl_explicit (BUILT_IN_BSWAP16); + bswap_type = bswap16_type; + } + break; case 32: if (bswap32_p) { @@ -1879,7 +1897,9 @@ execute_optimize_bswap (void) continue; changed = true; - if (type_size == 32) + if (type_size == 16) + bswap_stats.found_16bit++; + else if (type_size == 32) bswap_stats.found_32bit++; else bswap_stats.found_64bit++; @@ -1924,6 +1944,8 @@ execute_optimize_bswap (void) } } + statistics_counter_event (cfun, "16-bit bswap implementations found", + bswap_stats.found_16bit); statistics_counter_event (cfun, "32-bit bswap implementations found", bswap_stats.found_32bit); statistics_counter_event (cfun, "64-bit bswap implementations found", diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap16-1.c b/gcc/testsuite/gcc.target/arm/builtin-bswap16-1.c new file mode 100644 index 0000000..6920f00 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/builtin-bswap16-1.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ +/* { dg-require-effective-target arm_arch_v6_ok } */ +/* { dg-add-options arm_arch_v6 } */ +/* { dg-final { scan-assembler-not "orr\[ \t\]" } } */ + +unsigned short swapu16_1 (unsigned short x) +{ + return (x << 8) | (x >> 8); +} + +unsigned short swapu16_2 (unsigned short x) +{ + return (x >> 8) | (x << 8); +}