[RFC] Remove a bad use of SLOW_UNALIGNED_ACCESS

Message ID	AM5PR0802MB2610405E0020EE80CF9099F383A10@AM5PR0802MB2610.eurprd08.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-440083-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:content-type :content-transfer-encoding:mime-version; q=dns; s=default; b=kbF rzoKadLDidtk8KG0o9/XZR2R5HEC/IOIhwIZl1PkXvp6zvX3rrpNnVadJkJloXUy ANsToydh/+cAExuxGCpdvw1ML9JVvwQhCeVF7e1tXyhdMyCygrxuROgcLDCqT+j/ n69tfcqaUfRrRjgYZyZ2Ldbyt1k0Dp3jeu/yKYXU= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> To: GCC Patches <gcc-patches@gcc.gnu.org> CC: nd <nd@arm.com> Subject: [RFC][PATCH] Remove a bad use of SLOW_UNALIGNED_ACCESS Date: Tue, 1 Nov 2016 17:36:31 +0000 Message-ID: <AM5PR0802MB2610405E0020EE80CF9099F383A10@AM5PR0802MB2610.eurprd08.prod.outlook.com> nodisclaimer: True received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0

Wilco Dijkstra Nov. 1, 2016, 5:36 p.m. UTC

Looking at PR77308, one of the issues is that the bswap optimization 
phase doesn't work on ARM.  This is due to an odd check that uses
SLOW_UNALIGNED_ACCESS (which is always true on ARM).  Since the testcase
in PR77308 generates much better code with this patch (~13% fewer
instructions), it seems best to remove this odd check.

This exposes a problem with SLOW_UNALIGNED_ACCESS - what is it supposed
to mean or do? According to its current definition, it means we should
never emit an unaligned access for a given mode as it would lead to a
trap.  However that is not what happens, for example all integer modes on
ARM support really fast unaligned access and we generate unaligned instructions
without any issues.  Some Thumb-1 targets automatically expand unaligned
accesses if necessary.  So this macro clearly doesn't stop unaligned accesses
from being generated.

So I want to set it to false for most modes on ARM as they are not slow. 
However doing so causes incorrect code generation and unaligned traps.
How can we differentiate between modes that support fast unaligned access
in hardware, modes that get expanded and modes that should never be used in
an unaligned access?

Bootstrap & regress OK.

ChangeLog:
2015-11-01  Wilco Dijkstra  <wdijkstr@arm.com>

    gcc/
	* tree-ssa-math-opts.c (bswap_replace): Remove test
	of SLOW_UNALIGNED_ACCESS.

    testsuite/
	* gcc.dg/optimize-bswapdi-3.c: Remove xfail.
	* gcc.dg/optimize-bswaphi-1.c: Likewise. 	
	* gcc.dg/optimize-bswapsi-2.c: Likewise.

--

Jeff Law Nov. 1, 2016, 8:58 p.m. UTC | #1

On 11/01/2016 11:36 AM, Wilco Dijkstra wrote:
> Looking at PR77308, one of the issues is that the bswap optimization
> phase doesn't work on ARM.  This is due to an odd check that uses
> SLOW_UNALIGNED_ACCESS (which is always true on ARM).  Since the testcase
> in PR77308 generates much better code with this patch (~13% fewer
> instructions), it seems best to remove this odd check.
>
> This exposes a problem with SLOW_UNALIGNED_ACCESS - what is it supposed
> to mean or do? According to its current definition, it means we should
> never emit an unaligned access for a given mode as it would lead to a
> trap.  However that is not what happens, for example all integer modes on
> ARM support really fast unaligned access and we generate unaligned instructions
> without any issues.  Some Thumb-1 targets automatically expand unaligned
> accesses if necessary.  So this macro clearly doesn't stop unaligned accesses
> from being generated.
>
> So I want to set it to false for most modes on ARM as they are not slow.
> However doing so causes incorrect code generation and unaligned traps.
> How can we differentiate between modes that support fast unaligned access
> in hardware, modes that get expanded and modes that should never be used in
> an unaligned access?
>
> Bootstrap & regress OK.
>
> ChangeLog:
> 2015-11-01  Wilco Dijkstra  <wdijkstr@arm.com>
>
>     gcc/
> 	* tree-ssa-math-opts.c (bswap_replace): Remove test
> 	of SLOW_UNALIGNED_ACCESS.
>
>     testsuite/
> 	* gcc.dg/optimize-bswapdi-3.c: Remove xfail.
> 	* gcc.dg/optimize-bswaphi-1.c: Likewise. 	
> 	* gcc.dg/optimize-bswapsi-2.c: Likewise.
I think you'll need to look at bz61320 before this could go in.

jeff
>

Wilco Dijkstra Nov. 1, 2016, 9:39 p.m. UTC | #2

Jeff Law <law@redhat.com> wrote:

> I think you'll need to look at bz61320 before this could go in.

I had a look, but there is nothing there that is related - eventually
a latent alignment bug was fixed in IVOpt. Note that the bswap phase
currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
or SLOW_UNALIGNED_ACCESS:

-      if (bswap
 -         && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
 -         && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
 -       return false;

If bswap is false no byte swap is needed, so we found a native endian load
and it will always perform the optimization by inserting an unaligned load.
This apparently works on all targets, and doesn't cause alignment traps or
huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
So I'm at a loss what these macros are supposed to mean and how I can query
whether a backend supports fast unaligned access for a particular mode.

What I actually want to write is something like:

 if (!FAST_UNALIGNED_LOAD (mode, align)) return false;

And know that it only accepts unaligned accesses that are efficient on the target.
Maybe we need a new hook like this and get rid of the old one?

Wilco

Richard Biener Nov. 2, 2016, 12:42 p.m. UTC | #3

On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>  Jeff Law <law@redhat.com> wrote:
>
>> I think you'll need to look at bz61320 before this could go in.
>
> I had a look, but there is nothing there that is related - eventually
> a latent alignment bug was fixed in IVOpt. Note that the bswap phase
> currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
> or SLOW_UNALIGNED_ACCESS:
>
> -      if (bswap
>  -         && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
>  -         && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
>  -       return false;
>
> If bswap is false no byte swap is needed, so we found a native endian load
> and it will always perform the optimization by inserting an unaligned load.

Yes, the general agreement is that the expander can do best and thus we
should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
The expander will generate the canonical best code (hopefully...).

> This apparently works on all targets, and doesn't cause alignment traps or
> huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> So I'm at a loss what these macros are supposed to mean and how I can query
> whether a backend supports fast unaligned access for a particular mode.
>
> What I actually want to write is something like:
>
>  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>
> And know that it only accepts unaligned accesses that are efficient on the target.
> Maybe we need a new hook like this and get rid of the old one?

No, we don't need to other hook.

Note there is another similar user in gimple-fold.c when folding small
memcpy/memmove
to single load/store pairs (patch posted but not applied by me -- I've
asked for strict-align
target maintainer feedback but got none).

Now - for bswap I'm only 99% sure that unaligned load + bswap is
better than piecewise
loads plus manual swap.

But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
STRICT_ALIGNMENT
checks from the GIMPLE side of the compiler.

Richard.

> Wilco
>

Wilco Dijkstra Nov. 2, 2016, 1:43 p.m. UTC | #4

Richard Biener wrote:
On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:

> > If bswap is false no byte swap is needed, so we found a native endian load
> > and it will always perform the optimization by inserting an unaligned load.
>
> Yes, the general agreement is that the expander can do best and thus we
> should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
> The expander will generate the canonical best code (hopefully...).

Right, but there are cases where you have to choose between unaligned or aligned
accesses and you need to know whether the unaligned access is fast.

A good example is memcpy expansion, if you have fast unaligned accesses then you
should use them to deal with the last few bytes, but if they get expanded, using several
aligned accesses is much faster than a single unaligned access.

> > This apparently works on all targets, and doesn't cause alignment traps or
> > huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> > So I'm at a loss what these macros are supposed to mean and how I can query
> > whether a backend supports fast unaligned access for a particular mode.
> >
> > What I actually want to write is something like:
> >
> >  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
> >
> > And know that it only accepts unaligned accesses that are efficient on the target.
> > Maybe we need a new hook like this and get rid of the old one?
>
> No, we don't need to other hook.
> 
> Note there is another similar user in gimple-fold.c when folding small
> memcpy/memmove
> to single load/store pairs (patch posted but not applied by me -- I've
> asked for strict-align
> target maintainer feedback but got none).

I didn't find it, do you have a link?

> Now - for bswap I'm only 99% sure that unaligned load + bswap is
> better than piecewise loads plus manual swap.

It depends on whether unaligned loads and bswap are expanded or not. Even if we 
assume the expansion is at least as efficient as doing it explicitly (definitely true
for modes larger than the native integer size - as we found out in PR77308!),
if both the unaligned load and bswap are expanded it seems better not to make the
transformation for modes up to the word size. But there is no way to find out as
SLOW_UNALIGNED_ACCESS must be true whenever STRICT_ALIGN is true.

> But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
> STRICT_ALIGNMENT checks from the GIMPLE side of the compiler.

I sort of agree because the purpose of these macros is unclear - the documentation
is insufficient and out of date. I do believe however we need an accurate way to find out
whether a target supports fast unaligned accesses as that is required to generate good
target code.

Wilco

Richard Biener Nov. 2, 2016, 1:58 p.m. UTC | #5

On Wed, Nov 2, 2016 at 2:43 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> Richard Biener wrote:
> On Tue, Nov 1, 2016 at 10:39 PM, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
>> > If bswap is false no byte swap is needed, so we found a native endian load
>> > and it will always perform the optimization by inserting an unaligned load.
>>
>> Yes, the general agreement is that the expander can do best and thus we
>> should canonicalize accesses to larger ones even for SLOW_UNALIGNED_ACCESS.
>> The expander will generate the canonical best code (hopefully...).
>
> Right, but there are cases where you have to choose between unaligned or aligned
> accesses and you need to know whether the unaligned access is fast.
>
> A good example is memcpy expansion, if you have fast unaligned accesses then you
> should use them to deal with the last few bytes, but if they get expanded, using several
> aligned accesses is much faster than a single unaligned access.

Yes.  That's RTL expansion at which point you of course have to look
at SLOW_UNALIGNED_ACCESS.

>> > This apparently works on all targets, and doesn't cause alignment traps or
>> > huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
>> > So I'm at a loss what these macros are supposed to mean and how I can query
>> > whether a backend supports fast unaligned access for a particular mode.
>> >
>> > What I actually want to write is something like:
>> >
>> >  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>> >
>> > And know that it only accepts unaligned accesses that are efficient on the target.
>> > Maybe we need a new hook like this and get rid of the old one?
>>
>> No, we don't need to other hook.
>>
>> Note there is another similar user in gimple-fold.c when folding small
>> memcpy/memmove
>> to single load/store pairs (patch posted but not applied by me -- I've
>> asked for strict-align
>> target maintainer feedback but got none).
>
> I didn't find it, do you have a link?

https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00598.html

>> Now - for bswap I'm only 99% sure that unaligned load + bswap is
>> better than piecewise loads plus manual swap.
>
> It depends on whether unaligned loads and bswap are expanded or not. Even if we
> assume the expansion is at least as efficient as doing it explicitly (definitely true
> for modes larger than the native integer size - as we found out in PR77308!),
> if both the unaligned load and bswap are expanded it seems better not to make the
> transformation for modes up to the word size. But there is no way to find out as
> SLOW_UNALIGNED_ACCESS must be true whenever STRICT_ALIGN is true.

The case I was thinking about is availability of a bswap load operating only on
aligned memory and "regular" register bswap being "fake" provided by first
spilling to an aligned stack slot and then loading from that.

Maybe a bit far-fetched.

>> But generally I'm always in favor of removing SLOW_UNALIGNED_ACCESS /
>> STRICT_ALIGNMENT checks from the GIMPLE side of the compiler.
>
> I sort of agree because the purpose of these macros is unclear - the documentation
> is insufficient and out of date. I do believe however we need an accurate way to find out
> whether a target supports fast unaligned accesses as that is required to generate good
> target code.

I believe the target macros are solely for RTL expansion and say that
it has to avoid
unaligned ops as those would trap.

Richard.

> Wilco

Jeff Law Nov. 15, 2016, 4:35 p.m. UTC | #6

On 11/01/2016 03:39 PM, Wilco Dijkstra wrote:
>  Jeff Law <law@redhat.com> wrote:
>
>> I think you'll need to look at bz61320 before this could go in.
>
> I had a look, but there is nothing there that is related - eventually
> a latent alignment bug was fixed in IVOpt.
Excellent.  Thanks for digging into what really happened.

> Note that the bswap phase
> currently inserts unaligned accesses irrespectively of STRICT_ALIGNMENT
> or SLOW_UNALIGNED_ACCESS:
>
> -      if (bswap
>  -         && align < GET_MODE_ALIGNMENT (TYPE_MODE (load_type))
>  -         && SLOW_UNALIGNED_ACCESS (TYPE_MODE (load_type), align))
>  -       return false;
>
> If bswap is false no byte swap is needed, so we found a native endian load
> and it will always perform the optimization by inserting an unaligned load.
> This apparently works on all targets, and doesn't cause alignment traps or
> huge slowdowns via trap emulation claimed by SLOW_UNALIGNED_ACCESS.
> So I'm at a loss what these macros are supposed to mean and how I can query
> whether a backend supports fast unaligned access for a particular mode.
>
> What I actually want to write is something like:
>
>  if (!FAST_UNALIGNED_LOAD (mode, align)) return false;
>
> And know that it only accepts unaligned accesses that are efficient on the target.
> Maybe we need a new hook like this and get rid of the old one?
As Richi indicated later, these decisions are probably made best at 
expansion time -- as long as we have the required information.  So I'd 
only go with a hook if (for example) the alignment information is lost 
by the time we get to expansion and thus we can't DTRT at expansion time.

Patch is OK.

jeff

[RFC] Remove a bad use of SLOW_UNALIGNED_ACCESS

Commit Message

Comments

Patch