Patchwork catch builtin_bswap16 construct

login
register
mail settings
Submitter Christophe Lyon
Date Sept. 21, 2012, 8:04 a.m.
Message ID <CAKdteOZ=xdoANMnTJXoCUSxdkDBCfBDRfBd_KiitxB1ZHXBEiA@mail.gmail.com>
Download mbox | patch
Permalink /patch/185602/
State New
Headers show

Comments

Christophe Lyon - Sept. 21, 2012, 8:04 a.m.
On 20 September 2012 09:55, Christophe Lyon <christophe.lyon@linaro.org> wrote:
> On 20 September 2012 09:12, Eric Botcazou <ebotcazou@adacore.com> wrote:
>>> The attached patch catches C constructs:
>>> (A << 8) | (A >> 8)
>>> where A is unsigned 16 bits
>>> and maps them to builtin_bswap16(A) which can provide more efficient
>>> implementations on some targets.
>>
>> This belongs in tree-ssa-math-opts.c:execute_optimize_bswap instead.
>>
> OK I'll have a look at that. Actually I modified fold-const.c because
> it's here that the similar 32 bits pattern is turned into a rotate.
>
>> When I implemented __builtin_bswap16, I didn't add this because I thought this
>> would be overkill since the RTL combiner should be able to catch the pattern.
>> Have you investigated on this front?  But I don't have a strong opinion.
>>
> No I didn't. As I said above, I looked at where the 32 bits pattern
> was handled and added the 16 bits one.
>
> Christophe.

Here is a new patch, modifying tree-ssa-math-opts.c as you suggested.
It's indeed simpler :-)

Validated with qemu-arm on target arm-none-linux-gnueabi.

OK?

Christophe.
2012-09-21  Christophe Lyon <christophe.lyon@linaro.org>

	gcc/
	* tree-ssa-math-opts.c (bswap_stats): Add found_16bit field.
	(execute_optimize_bswap): Add support for builtin_bswap16.

	gcc/testsuite/
	* gcc.target/arm/builtin-bswap16-1.c: New testcase.
Eric Botcazou - Sept. 21, 2012, 10:56 a.m.
> Here is a new patch, modifying tree-ssa-math-opts.c as you suggested.
> It's indeed simpler :-)
> 
> Validated with qemu-arm on target arm-none-linux-gnueabi.

I cannot formally approve, but this looks good to me.

However, could you check that this is also an improvement for PowerPC, which 
is the only other mainstream architecture with a bswaphi pattern AFAIK?
Andrew Pinski - Sept. 23, 2012, 10:57 p.m.
On Fri, Sep 21, 2012 at 1:04 AM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> On 20 September 2012 09:55, Christophe Lyon <christophe.lyon@linaro.org> wrote:
>> On 20 September 2012 09:12, Eric Botcazou <ebotcazou@adacore.com> wrote:
>>>> The attached patch catches C constructs:
>>>> (A << 8) | (A >> 8)
>>>> where A is unsigned 16 bits
>>>> and maps them to builtin_bswap16(A) which can provide more efficient
>>>> implementations on some targets.
>>>
>>> This belongs in tree-ssa-math-opts.c:execute_optimize_bswap instead.
>>>
>> OK I'll have a look at that. Actually I modified fold-const.c because
>> it's here that the similar 32 bits pattern is turned into a rotate.
>>
>>> When I implemented __builtin_bswap16, I didn't add this because I thought this
>>> would be overkill since the RTL combiner should be able to catch the pattern.
>>> Have you investigated on this front?  But I don't have a strong opinion.
>>>
>> No I didn't. As I said above, I looked at where the 32 bits pattern
>> was handled and added the 16 bits one.
>>
>> Christophe.
>
> Here is a new patch, modifying tree-ssa-math-opts.c as you suggested.
> It's indeed simpler :-)
>
> Validated with qemu-arm on target arm-none-linux-gnueabi.

This fixes PR 43550 also.

Thanks,
Andrew
Christophe Lyon - Sept. 24, 2012, 8:53 p.m.
On 21 September 2012 12:56, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> Here is a new patch, modifying tree-ssa-math-opts.c as you suggested.
>> It's indeed simpler :-)
>>
>> Validated with qemu-arm on target arm-none-linux-gnueabi.
>
> I cannot formally approve, but this looks good to me.
>
> However, could you check that this is also an improvement for PowerPC, which
> is the only other mainstream architecture with a bswaphi pattern AFAIK?
>
> --
> Eric Botcazou

I have not yet been able to build an environment to run the testsuite
with qemu-ppc (not sure about the best target & dejagnu board
selection).

However, when compiling a sample test
short myswaps16(short x) {
   return (x << 8) | (x >> 8);
}
unsigned short myswapu16(unsigned short x) {
  return (x << 8) | (x >> 8);
}

The generated code is now:
myswaps16:
        rlwinm 10,3,8,16,23
        rlwinm 9,3,24,24,31
        or 9,9,10
        extsh 3,9
        blr

myswapu16:
        rlwinm 10,3,8,16,23
        rlwinm 9,3,24,24,31
        or 9,9,10
        rlwinm 3,9,0,0xffff
        blr

While it was (without my patch):

myswaps16:
        slwi 9,3,8
        srawi 3,3,8
        or 3,9,3
        extsh 3,3
        blr

myswapu16:
        srwi 9,3,8
        rlwinm 3,3,8,16,23
        or 3,3,9
        blr

I don't know PowerPC, but I am not sure it's an improvement. Is it?

Christophe.
Eric Botcazou - Sept. 24, 2012, 9:17 p.m.
> I have not yet been able to build an environment to run the testsuite
> with qemu-ppc (not sure about the best target & dejagnu board
> selection).

No problem, that wasn't really a requirement.

> However, when compiling a sample test
> short myswaps16(short x) {
>    return (x << 8) | (x >> 8);
> }
> unsigned short myswapu16(unsigned short x) {
>   return (x << 8) | (x >> 8);
> }
> 
> The generated code is now:
> myswaps16:
>         rlwinm 10,3,8,16,23
>         rlwinm 9,3,24,24,31
>         or 9,9,10
>         extsh 3,9
>         blr
> 
> myswapu16:
>         rlwinm 10,3,8,16,23
>         rlwinm 9,3,24,24,31
>         or 9,9,10
>         rlwinm 3,9,0,0xffff
>         blr
> 
> While it was (without my patch):
> 
> myswaps16:
>         slwi 9,3,8
>         srawi 3,3,8
>         or 3,9,3
>         extsh 3,3
>         blr
> 
> myswapu16:
>         srwi 9,3,8
>         rlwinm 3,3,8,16,23
>         or 3,3,9
>         blr
> 
> I don't know PowerPC, but I am not sure it's an improvement. Is it?

That indeed doesn't look obvious, especially in the unsigned case.  The 
PowerPC back-end has a splitter for bswap:HI which generates no less than 3 
instructions, so I presume we're seeing its effects here.

I've CCed the other interested parties.  David, Michael, Segher, any comments 
about or insights into the results reported by Christophe for PowerPC?
Segher Boessenkool - Sept. 25, 2012, 5 a.m.
>> The generated code is now:
>> myswaps16:
>>         rlwinm 10,3,8,16,23
>>         rlwinm 9,3,24,24,31
>>         or 9,9,10
>>         extsh 3,9
>>         blr
>>
>> myswapu16:
>>         rlwinm 10,3,8,16,23
>>         rlwinm 9,3,24,24,31
>>         or 9,9,10
>>         rlwinm 3,9,0,0xffff
>>         blr
>>
>> While it was (without my patch):
>>
>> myswaps16:
>>         slwi 9,3,8
>>         srawi 3,3,8
>>         or 3,9,3
>>         extsh 3,3
>>         blr
>>
>> myswapu16:
>>         srwi 9,3,8
>>         rlwinm 3,3,8,16,23
>>         or 3,3,9
>>         blr
>>
>> I don't know PowerPC, but I am not sure it's an improvement. Is it?

slwi and srwi are just extended mnemonics for the same rlwinm  
instruction,
so that's the same.  The last instruction in the new unsigned variant is
superfluous, since it is just setting the top bits to zero, and they
already are.  rlwinm is ever so slightly better than srawi (in the  
signed
version), because srawi sets the carry bit in addition to the GPR, so  
that
is an improvement.

But we can do this sequence in just two instructions:

	rlwimi 3,3,16,0,15
	rlwinm 3,3,8,16,31
	blr

so some more work is needed to make this optimal ;-)

Christophe, it looks like the zero-extend in the unsigned case is not
needed on any target?  Assuming the shifts are at least SImode, of
course (I'm too lazy to check, sorry).


Segher
Christophe Lyon - Sept. 25, 2012, 10:28 a.m.
On 25 September 2012 07:00, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
> Christophe, it looks like the zero-extend in the unsigned case is not
> needed on any target?  Assuming the shifts are at least SImode, of
> course (I'm too lazy to check, sorry).
>

It's also present when compiling:
unsigned short swapu16(unsigned short x) {
     return __builtin_bswap16(x);
}

so it's not directly caused by my patch I think.

We have to look at the __builtin_bswap16 expansion with an unsigned argument.

Christophe.
Segher Boessenkool - Sept. 25, 2012, 11:32 a.m.
>> Christophe, it looks like the zero-extend in the unsigned case is not
>> needed on any target?  Assuming the shifts are at least SImode, of
>> course (I'm too lazy to check, sorry).
>
> It's also present when compiling:
> unsigned short swapu16(unsigned short x) {
>      return __builtin_bswap16(x);
> }
>
> so it's not directly caused by my patch I think.

The RTL is  (set (reg:HI) (bswap:HI (reg:HI)))  which then gets
extended for the SI (or DI) function return.  Nothing to see here,
it's a target problem.  Your results look good.


Segher
Christophe Lyon - Sept. 25, 2012, 3:05 p.m.
On 25 September 2012 13:32, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>>> Christophe, it looks like the zero-extend in the unsigned case is not
>>> needed on any target?  Assuming the shifts are at least SImode, of
>>> course (I'm too lazy to check, sorry).
>>
>>
>> It's also present when compiling:
>> unsigned short swapu16(unsigned short x) {
>>      return __builtin_bswap16(x);
>> }
>>
>> so it's not directly caused by my patch I think.
>
>
> The RTL is  (set (reg:HI) (bswap:HI (reg:HI)))  which then gets
> extended for the SI (or DI) function return.  Nothing to see here,
> it's a target problem.  Your results look good.
>

OK thank for the confirmation.

I guess I just have to wait for approval by the right maintainer now?
Eric Botcazou - Sept. 25, 2012, 3:30 p.m.
> I guess I just have to wait for approval by the right maintainer now?

Right, GCC's bureaucracy is no legend. :-)

I've CCed Richard, who approved the __builtin_bswap16 stuff back in April.
Richard Guenther - Sept. 26, 2012, 9:14 a.m.
On Fri, Sep 21, 2012 at 10:04 AM, Christophe Lyon
<christophe.lyon@linaro.org> wrote:
> On 20 September 2012 09:55, Christophe Lyon <christophe.lyon@linaro.org> wrote:
>> On 20 September 2012 09:12, Eric Botcazou <ebotcazou@adacore.com> wrote:
>>>> The attached patch catches C constructs:
>>>> (A << 8) | (A >> 8)
>>>> where A is unsigned 16 bits
>>>> and maps them to builtin_bswap16(A) which can provide more efficient
>>>> implementations on some targets.
>>>
>>> This belongs in tree-ssa-math-opts.c:execute_optimize_bswap instead.
>>>
>> OK I'll have a look at that. Actually I modified fold-const.c because
>> it's here that the similar 32 bits pattern is turned into a rotate.
>>
>>> When I implemented __builtin_bswap16, I didn't add this because I thought this
>>> would be overkill since the RTL combiner should be able to catch the pattern.
>>> Have you investigated on this front?  But I don't have a strong opinion.
>>>
>> No I didn't. As I said above, I looked at where the 32 bits pattern
>> was handled and added the 16 bits one.
>>
>> Christophe.
>
> Here is a new patch, modifying tree-ssa-math-opts.c as you suggested.
> It's indeed simpler :-)
>
> Validated with qemu-arm on target arm-none-linux-gnueabi.
>
> OK?

Assuming this one was the lastest patch ...

Ok.

Thanks,
Richard.

> Christophe.

Patch

diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index c3392fb..340f0df 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -154,6 +154,9 @@  static struct
 
 static struct
 {
+  /* Number of hand-written 16-bit bswaps found.  */
+  int found_16bit;
+
   /* Number of hand-written 32-bit bswaps found.  */
   int found_32bit;
 
@@ -1792,9 +1795,9 @@  static unsigned int
 execute_optimize_bswap (void)
 {
   basic_block bb;
-  bool bswap32_p, bswap64_p;
+  bool bswap16_p, bswap32_p, bswap64_p;
   bool changed = false;
-  tree bswap32_type = NULL_TREE, bswap64_type = NULL_TREE;
+  tree bswap16_type = NULL_TREE, bswap32_type = NULL_TREE, bswap64_type = NULL_TREE;
 
   if (BITS_PER_UNIT != 8)
     return 0;
@@ -1802,17 +1805,25 @@  execute_optimize_bswap (void)
   if (sizeof (HOST_WIDEST_INT) < 8)
     return 0;
 
+  bswap16_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP16)
+	       && optab_handler (bswap_optab, HImode) != CODE_FOR_nothing);
   bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
 	       && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing);
   bswap64_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP64)
 	       && (optab_handler (bswap_optab, DImode) != CODE_FOR_nothing
 		   || (bswap32_p && word_mode == SImode)));
 
-  if (!bswap32_p && !bswap64_p)
+  if (!bswap16_p && !bswap32_p && !bswap64_p)
     return 0;
 
   /* Determine the argument type of the builtins.  The code later on
      assumes that the return and argument type are the same.  */
+  if (bswap16_p)
+    {
+      tree fndecl = builtin_decl_explicit (BUILT_IN_BSWAP16);
+      bswap16_type = TREE_VALUE (TYPE_ARG_TYPES (TREE_TYPE (fndecl)));
+    }
+
   if (bswap32_p)
     {
       tree fndecl = builtin_decl_explicit (BUILT_IN_BSWAP32);
@@ -1852,6 +1863,13 @@  execute_optimize_bswap (void)
 
 	  switch (type_size)
 	    {
+	    case 16:
+	      if (bswap16_p)
+		{
+		  fndecl = builtin_decl_explicit (BUILT_IN_BSWAP16);
+		  bswap_type = bswap16_type;
+		}
+	      break;
 	    case 32:
 	      if (bswap32_p)
 		{
@@ -1879,7 +1897,9 @@  execute_optimize_bswap (void)
 	    continue;
 
 	  changed = true;
-	  if (type_size == 32)
+	  if (type_size == 16)
+	    bswap_stats.found_16bit++;
+	  else if (type_size == 32)
 	    bswap_stats.found_32bit++;
 	  else
 	    bswap_stats.found_64bit++;
@@ -1924,6 +1944,8 @@  execute_optimize_bswap (void)
 	}
     }
 
+  statistics_counter_event (cfun, "16-bit bswap implementations found",
+			    bswap_stats.found_16bit);
   statistics_counter_event (cfun, "32-bit bswap implementations found",
 			    bswap_stats.found_32bit);
   statistics_counter_event (cfun, "64-bit bswap implementations found",
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap16-1.c b/gcc/testsuite/gcc.target/arm/builtin-bswap16-1.c
new file mode 100644
index 0000000..6920f00
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/builtin-bswap16-1.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target arm_arch_v6_ok } */
+/* { dg-add-options arm_arch_v6 } */
+/* { dg-final { scan-assembler-not "orr\[ \t\]" } } */
+
+unsigned short swapu16_1 (unsigned short x)
+{
+  return (x << 8) | (x >> 8);
+}
+
+unsigned short swapu16_2 (unsigned short x)
+{
+  return (x >> 8) | (x << 8);
+}