diff mbox

[RFC] Remove a bad use of SLOW_UNALIGNED_ACCESS

Message ID AM5PR0802MB2610405E0020EE80CF9099F383A10@AM5PR0802MB2610.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra Nov. 1, 2016, 5:36 p.m. UTC
Looking at PR77308, one of the issues is that the bswap optimization 
phase doesn't work on ARM.  This is due to an odd check that uses
SLOW_UNALIGNED_ACCESS (which is always true on ARM).  Since the testcase
in PR77308 generates much better code with this patch (~13% fewer
instructions), it seems best to remove this odd check.

This exposes a problem with SLOW_UNALIGNED_ACCESS - what is it supposed
to mean or do? According to its current definition, it means we should
never emit an unaligned access for a given mode as it would lead to a
trap.  However that is not what happens, for example all integer modes on
ARM support really fast unaligned access and we generate unaligned instructions
without any issues.  Some Thumb-1 targets automatically expand unaligned
accesses if necessary.  So this macro clearly doesn't stop unaligned accesses
from being generated.

So I want to set it to false for most modes on ARM as they are not slow. 
However doing so causes incorrect code generation and unaligned traps.
How can we differentiate between modes that support fast unaligned access
in hardware, modes that get expanded and modes that should never be used in
an unaligned access?

Bootstrap & regress OK.

ChangeLog:
2015-11-01  Wilco Dijkstra  <wdijkstr@arm.com>

    gcc/
	* tree-ssa-math-opts.c (bswap_replace): Remove test
	of SLOW_UNALIGNED_ACCESS.

    testsuite/
	* gcc.dg/optimize-bswapdi-3.c: Remove xfail.
	* gcc.dg/optimize-bswaphi-1.c: Likewise. 	
	* gcc.dg/optimize-bswapsi-2.c: Likewise.

--

Comments

Jeff Law Nov. 1, 2016, 8:58 p.m. UTC | #1
On 11/01/2016 11:36 AM, Wilco Dijkstra wrote:
> Looking at PR77308, one of the issues is that the bswap optimization
> phase doesn't work on ARM.  This is due to an odd check that uses
> SLOW_UNALIGNED_ACCESS (which is always true on ARM).  Since the testcase
> in PR77308 generates much better code with this patch (~13% fewer
> instructions), it seems best to remove this odd check.
>
> This exposes a problem with SLOW_UNALIGNED_ACCESS - what is it supposed
> to mean or do? According to its current definition, it means we should
> never emit an unaligned access for a given mode as it would lead to a
> trap.  However that is not what happens, for example all integer modes on
> ARM support really fast unaligned access and we generate unaligned instructions
> without any issues.  Some Thumb-1 targets automatically expand unaligned
> accesses if necessary.  So this macro clearly doesn't stop unaligned accesses
> from being generated.
>
> So I want to set it to false for most modes on ARM as they are not slow.
> However doing so causes incorrect code generation and unaligned traps.
> How can we differentiate between modes that support fast unaligned access
> in hardware, modes that get expanded and modes that should never be used in
> an unaligned access?
>
> Bootstrap & regress OK.
>
> ChangeLog:
> 2015-11-01  Wilco Dijkstra  <wdijkstr@arm.com>
>
>     gcc/
> 	* tree-ssa-math-opts.c (bswap_replace): Remove test
> 	of SLOW_UNALIGNED_ACCESS.
>
>     testsuite/
> 	* gcc.dg/optimize-bswapdi-3.c: Remove xfail.
> 	* gcc.dg/optimize-bswaphi-1.c: Likewise. 	
> 	* gcc.dg/optimize-bswapsi-2.c: Likewise.
I think you'll need to look at bz61320 before this could go in.

jeff
>
Wilco Dijkstra Nov. 1, 2016, 9:39 p.m. UTC | #2
Jeff Law <law@redhat.com> wrote:

> I think you'll need to look at bz61320 before this could go in.

I had a look, but there is nothing there that is related - eventually
a latent alignment bug was fixed in IVOpt. Note that the bswap phase
currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
or SLOW_UNALIGNED_ACCESS:

-      if (bswap
 -         && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
 -         && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
 -       return false;

If bswap is false no byte swap is needed, so we found a native endian load
and it will always perform the optimization by inserting an unaligned load.
This apparently works on all targets, and doesn't cause alignment traps or
huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
So I'm at a loss what these macros are supposed to mean and how I can query
whether a backend supports fast unaligned access for a particular mode.

What I actually want to write is something like:

 if (!FAST_UNALIGNED_LOAD (mode, align)) return false;

And know that it only accepts unaligned accesses that are efficient on the target.
Maybe we need a new hook like this and get rid of the old one?

Wilco
Richard Biener Nov. 2, 2016, 12:42 p.m. UTC | #3
On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>  Jeff Law <law@redhat.com> wrote:
>
>> I think you'll need to look at bz61320 before this could go in.
>
> I had a look, but there is nothing there that is related - eventually
> a latent alignment bug was fixed in IVOpt. Note that the bswap phase
> currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
> or SLOW_UNALIGNED_ACCESS:
>
> -      if (bswap
>  -         && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
>  -         && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
>  -       return false;
>
> If bswap is false no byte swap is needed, so we found a native endian load
> and it will always perform the optimization by inserting an unaligned load.

Yes, the general agreement is that the expander can do best and thus we
should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
The expander will generate the canonical best code (hopefully...).

> This apparently works on all targets, and doesn't cause alignment traps or
> huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> So I'm at a loss what these macros are supposed to mean and how I can query
> whether a backend supports fast unaligned access for a particular mode.
>
> What I actually want to write is something like:
>
>  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>
> And know that it only accepts unaligned accesses that are efficient on the target.
> Maybe we need a new hook like this and get rid of the old one?

No, we don't need to other hook.

Note there is another similar user in gimple-fold.c when folding small
memcpy/memmove
to single load/store pairs (patch posted but not applied by me -- I've
asked for strict-align
target maintainer feedback but got none).

Now - for bswap I'm only 99% sure that unaligned load + bswap is
better than piecewise
loads plus manual swap.

But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
STRICT_ALIGNMENT
checks from the GIMPLE side of the compiler.

Richard.

> Wilco
>
Wilco Dijkstra Nov. 2, 2016, 1:43 p.m. UTC | #4
Richard Biener wrote:
On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> > If bswap is false no byte swap is needed, so we found a native endian load
> > and it will always perform the optimization by inserting an unaligned load.
>
> Yes, the general agreement is that the expander can do best and thus we
> should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
> The expander will generate the canonical best code (hopefully...).

Right, but there are cases where you have to choose between unaligned or aligned
accesses and you need to know whether the unaligned access is fast.

A good example is memcpy expansion, if you have fast unaligned accesses then you
should use them to deal with the last few bytes, but if they get expanded, using several
aligned accesses is much faster than a single unaligned access.

> > This apparently works on all targets, and doesn't cause alignment traps or
> > huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> > So I'm at a loss what these macros are supposed to mean and how I can query
> > whether a backend supports fast unaligned access for a particular mode.
> >
> > What I actually want to write is something like:
> >
> >  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
> >
> > And know that it only accepts unaligned accesses that are efficient on the target.
> > Maybe we need a new hook like this and get rid of the old one?
>
> No, we don't need to other hook.
> 
> Note there is another similar user in gimple-fold.c when folding small
> memcpy/memmove
> to single load/store pairs (patch posted but not applied by me -- I've
> asked for strict-align
> target maintainer feedback but got none).

I didn't find it, do you have a link?

> Now - for bswap I'm only 99% sure that unaligned load + bswap is
> better than piecewise loads plus manual swap.

It depends on whether unaligned loads and bswap are expanded or not. Even if we 
assume the expansion is at least as efficient as doing it explicitly (definitely true
for modes larger than the native integer size - as we found out in PR77308!),
if both the unaligned load and bswap are expanded it seems better not to make the
transformation for modes up to the word size. But there is no way to find out as
SLOW_UNALIGNED_ACCESS must be true whenever STRICT_ALIGN is true.

> But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
> STRICT_ALIGNMENT checks from the GIMPLE side of the compiler.

I sort of agree because the purpose of these macros is unclear - the documentation
is insufficient and out of date. I do believe however we need an accurate way to find out
whether a target supports fast unaligned accesses as that is required to generate good
target code.

Wilco
Richard Biener Nov. 2, 2016, 1:58 p.m. UTC | #5
On Wed, Nov 2, 2016 at 2:43 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Richard Biener wrote:
> On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
>> > If bswap is false no byte swap is needed, so we found a native endian load
>> > and it will always perform the optimization by inserting an unaligned load.
>>
>> Yes, the general agreement is that the expander can do best and thus we
>> should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
>> The expander will generate the canonical best code (hopefully...).
>
> Right, but there are cases where you have to choose between unaligned or aligned
> accesses and you need to know whether the unaligned access is fast.
>
> A good example is memcpy expansion, if you have fast unaligned accesses then you
> should use them to deal with the last few bytes, but if they get expanded, using several
> aligned accesses is much faster than a single unaligned access.

Yes.  That's RTL expansion at which point you of course have to look
at SLOW_UNALIGNED_ACCESS.

>> > This apparently works on all targets, and doesn't cause alignment traps or
>> > huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
>> > So I'm at a loss what these macros are supposed to mean and how I can query
>> > whether a backend supports fast unaligned access for a particular mode.
>> >
>> > What I actually want to write is something like:
>> >
>> >  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>> >
>> > And know that it only accepts unaligned accesses that are efficient on the target.
>> > Maybe we need a new hook like this and get rid of the old one?
>>
>> No, we don't need to other hook.
>>
>> Note there is another similar user in gimple-fold.c when folding small
>> memcpy/memmove
>> to single load/store pairs (patch posted but not applied by me -- I've
>> asked for strict-align
>> target maintainer feedback but got none).
>
> I didn't find it, do you have a link?

https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00598.html

>> Now - for bswap I'm only 99% sure that unaligned load + bswap is
>> better than piecewise loads plus manual swap.
>
> It depends on whether unaligned loads and bswap are expanded or not. Even if we
> assume the expansion is at least as efficient as doing it explicitly (definitely true
> for modes larger than the native integer size - as we found out in PR77308!),
> if both the unaligned load and bswap are expanded it seems better not to make the
> transformation for modes up to the word size. But there is no way to find out as
> SLOW_UNALIGNED_ACCESS must be true whenever STRICT_ALIGN is true.

The case I was thinking about is availability of a bswap load operating only on
aligned memory and "regular" register bswap being "fake" provided by first
spilling to an aligned stack slot and then loading from that.

Maybe a bit far-fetched.

>> But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
>> STRICT_ALIGNMENT checks from the GIMPLE side of the compiler.
>
> I sort of agree because the purpose of these macros is unclear - the documentation
> is insufficient and out of date. I do believe however we need an accurate way to find out
> whether a target supports fast unaligned accesses as that is required to generate good
> target code.

I believe the target macros are solely for RTL expansion and say that
it has to avoid
unaligned ops as those would trap.

Richard.

> Wilco
Jeff Law Nov. 15, 2016, 4:35 p.m. UTC | #6
On 11/01/2016 03:39 PM, Wilco Dijkstra wrote:
>  Jeff Law <law@redhat.com> wrote:
>
>> I think you'll need to look at bz61320 before this could go in.
>
> I had a look, but there is nothing there that is related - eventually
> a latent alignment bug was fixed in IVOpt.
Excellent.  Thanks for digging into what really happened.

> Note that the bswap phase
> currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
> or SLOW_UNALIGNED_ACCESS:
>
> -      if (bswap
>  -         && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
>  -         && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
>  -       return false;
>
> If bswap is false no byte swap is needed, so we found a native endian load
> and it will always perform the optimization by inserting an unaligned load.
> This apparently works on all targets, and doesn't cause alignment traps or
> huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> So I'm at a loss what these macros are supposed to mean and how I can query
> whether a backend supports fast unaligned access for a particular mode.
>
> What I actually want to write is something like:
>
>  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>
> And know that it only accepts unaligned accesses that are efficient on the target.
> Maybe we need a new hook like this and get rid of the old one?
As Richi indicated later, these decisions are probably made best at 
expansion time -- as long as we have the required information.  So I'd 
only go with a hook if (for example) the alignment information is lost 
by the time we get to expansion and thus we can't DTRT at expansion time.

Patch is OK.

jeff
diff mbox

Patch

diff --git a/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c b/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c
index 273b4bc622cb32564533e1352b5fc8ad52054f8b..6f682014622ab79e541cdf26d13f16a7d87f158d 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswapdi-3.c
@@ -61,4 +61,4 @@  uint64_t read_be64_3 (unsigned char *data)
 }
 
 /* { dg-final { scan-tree-dump-times "64 bit load in target endianness found at" 3 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "64 bit bswap implementation found at" 3 "bswap" { xfail alpha*-*-* arm*-*-* } } } */
+/* { dg-final { scan-tree-dump-times "64 bit bswap implementation found at" 3 "bswap" } } */
diff --git a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
index c18ca6174d12a786a71252dfe47cfe78ca58750a..852ccfe5c1acd519f2cf340cc55f3ea74b1ec21f 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
@@ -55,5 +55,4 @@  swap16 (HItype in)
 }
 
 /* { dg-final { scan-tree-dump-times "16 bit load in target endianness found at" 3 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 1 "bswap" { target alpha*-*-* arm*-*-* } } } */
-/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 4 "bswap" { xfail alpha*-*-* arm*-*-* } } } */
+/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 4 "bswap" } } */
diff --git a/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c b/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c
index a1558af2cc74adde439d42223b00977d9eeb9639..01ae3776ed3f44fbc300d001f8c67ec11625d03b 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswapsi-2.c
@@ -45,4 +45,4 @@  uint32_t read_be32_3 (unsigned char *data)
 }
 
 /* { dg-final { scan-tree-dump-times "32 bit load in target endianness found at" 3 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "32 bit bswap implementation found at" 3 "bswap" { xfail alpha*-*-* arm*-*-* } } } */
+/* { dg-final { scan-tree-dump-times "32 bit bswap implementation found at" 3 "bswap" } } */
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index 0cea1a8472d5d9c4f0e4a7bd82930e201948c4ec..cbb2f9367a287ad8cfcfc5740c0e49b2c83bafd0 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -2651,11 +2651,6 @@  bswap_replace (gimple *cur_stmt, gimple *src_stmt, tree fndecl,
 	    }
 	}
 
-      if (bswap
-	  && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
-	  && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
-	return false;
-
       /* Move cur_stmt just before  one of the load of the original
 	 to ensure it has the same VUSE.  See PR61517 for what could
 	 go wrong.  */