diff mbox

[RFC] Pair load store instructions using a generic scheduling fusion pass

Message ID 000001cfdc90$1d95c670$58c15350$@arm.com
State New
Headers show

Commit Message

Bin Cheng Sept. 30, 2014, 9:22 a.m. UTC
Hi,
Last time I posted the patch pairing consecutive load/store instructions on
ARM, the patch got some review comments.  The most important one, as
suggested by Jeff and Mike, is about to do the load/store pairing using
existing scheduling facility.  In the initial investigation, I cleared up
Mike's patch and fixed some implementation bugs in it, it can now find as
many load/store pairs as my old patch.  Then I decided to take one step
forward to introduce a generic instruction fusion infrastructure in GCC,
because in essence, load/store pair is nothing different with other
instruction fusion, all these optimizations want is to push instructions
together in instruction flow.
So here comes this patch.  It adds a new sched_fusion pass just before
peephole2.  The methodology is like:
1) The priority in scheduler is extended into [fusion_priority, priority]
pair, with fusion_priority as the major key and priority as the minor key.
2) The back-end assigns priorities pair to each instruction, instructions
want to be fused together get same fusion_priority assigned.
3) The haifa scheduler schedules instructions based on fusion priorities,
all other parts are just like the original sched2 pass.  Of course, some
functions can be simplified/skipped in this process.
4) With instructions fused together in flow, the following peephole2 pass
can easily transform interesting instructions into other forms, just like
ldrd/strd for ARM.

The new infrastructure can handle different kinds of fusion in one pass.
It's also easy to extend for new fusion cases, all it takes is to identify
instructions which want to be fused and assign new fusion priorities to
them.  Also as Mike suggested last time, the patch can be very simple by
reusing existing scheduler facility.

I collected performance data for both cortex-a15 and cortex-a57 (with a
local peephole ldp/stp patch), the benchmarks can be obviously improved on
arm/aarch64.  I also collected instrument data about how many load/store
pairs are found.  For the four versions of load/store pair patches:
0) The original Mike's patch.
1) My original prototype patch.
2) Cleaned up pass of Mike (with implementation bugs resolved).
3) This new prototype fusion pass.

The numbers of paired opportunities satisfy below relations:
3 * N0 ~ N1 ~ N2 < N3
For example, for one benchmark suite, we have:
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500

As a matter of fact, if we move sched_fusion and peephole2 pass after
register renaming (~11000 for above benchmark suite), then enable register
renaming pass, this patch can find even more load store pairs.  But rename
pass has its own impact on performance and we need more benchmark data
before doing that change.

Of course, this patch is no the perfect solution, it does miss load/store
pair in some corner cases which have complicated instruction dependencies.
Actually it breaks one load/store pair test on armv6 because of the corner
case, that's why the pass is disabled by default on non-armv7 processors.  I
may investigate the failure and try to enable the pass for all arm targets
in the future.

So any comments on this?

2014-09-30  Bin Cheng  <bin.cheng@arm.com>
	    Mike Stump  <mikestump@comcast.net>

	* timevar.def (TV_SCHED_FUSION): New time var.
	* passes.def (pass_sched_fusion): New pass.
	* config/arm/arm.c (TARGET_SCHED_FUSION_PRIORITY): New.
	(extract_base_offset_in_addr, fusion_load_store): New.
	(arm_sched_fusion_priority): New.
	(arm_option_override): Disable scheduling fusion on non-armv7
	processors by default.
	* sched-int.h (struct _haifa_insn_data): New field.
	(INSN_FUSION_PRIORITY, FUSION_MAX_PRIORITY, sched_fusion): New.
	* sched-rgn.c (rest_of_handle_sched_fusion): New.
	(pass_data_sched_fusion, pass_sched_fusion): New.
	(make_pass_sched_fusion): New.
	* haifa-sched.c (sched_fusion): New.
	(insn_cost): Handle sched_fusion.
	(priority): Handle sched_fusion by calling target hook.
	(enum rfs_decision): New enum value.
	(rfs_str): New element for RFS_FUSION.
	(rank_for_schedule): Support sched_fusion.
	(schedule_insn, max_issue, prune_ready_list): Handle sched_fusion.
	(schedule_block, fix_tick_ready): Handle sched_fusion.
	* common.opt (flag_schedule_fusion): New.
	* tree-pass.h (make_pass_sched_fusion): New.
	* target.def (fusion_priority): New.
	* doc/tm.texi.in (TARGET_SCHED_FUSION_PRIORITY): New.
	* doc/tm.texi: Regenerated.
	* doc/invoke.texi (-fschedule-fusion): New.

gcc/testsuite/ChangeLog
2014-09-30  Bin Cheng  <bin.cheng@arm.com>

	* gcc.target/arm/ldrd-strd-pair-1.c: New test.
	* gcc.target/arm/vfp-1.c: Improve scanning string.

Comments

Mike Stump Sept. 30, 2014, 9:06 p.m. UTC | #1
On Sep 30, 2014, at 2:22 AM, Bin Cheng <bin.cheng@arm.com> wrote:
> Then I decided to take one step forward to introduce a generic
> instruction fusion infrastructure in GCC, because in essence, load/store
> pair is nothing different with other instruction fusion, all these optimizations
> want is to push instructions together in instruction flow.

I like the step you took.  I had exactly this in mind when I wrote the original.

> N0 ~= 1300
> N1/N2 ~= 5000
> N3 ~= 7500

Nice.  Would be nice to see metrics for time to ensure that the code isn’t actually worse (CSiBE and/or spec and/or some other).  I didn’t have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.

> I cleared up Mike's patch and fixed some implementation bugs in it

So, I’m wondering what the bugs or missed opportunities were?  And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.
Bin.Cheng Oct. 6, 2014, 9:57 a.m. UTC | #2
On Wed, Oct 1, 2014 at 5:06 AM, Mike Stump <mikestump@comcast.net> wrote:
> On Sep 30, 2014, at 2:22 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>> Then I decided to take one step forward to introduce a generic
>> instruction fusion infrastructure in GCC, because in essence, load/store
>> pair is nothing different with other instruction fusion, all these optimizations
>> want is to push instructions together in instruction flow.
>
> I like the step you took.  I had exactly this in mind when I wrote the original.
>
>> N0 ~= 1300
>> N1/N2 ~= 5000
>> N3 ~= 7500
>
> Nice.  Would be nice to see metrics for time to ensure that the code isn't actually worse (CSiBE and/or spec and/or some other).  I didn't have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.

Hi Mike,
I did collect spec2k performance after pairing load/store using this
patch on both aarch64 and cortex-a15.  The performance is improved
obviously, especially on cortex-a57.  There are some (though not many)
benchmarks are regressed a little.  There is no register pressure
problem here because this pass is put between register allocation and
sched2, I guess sched2 should resolve most pipeline hazards introduced
by this pass.

>
>> I cleared up Mike's patch and fixed some implementation bugs in it
>
> So, I'm wondering what the bugs or missed opportunities were?  And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.
Just missed opportunity issues.

Thanks,
bin
Richard Biener Oct. 6, 2014, 11:32 a.m. UTC | #3
On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Oct 1, 2014 at 5:06 AM, Mike Stump <mikestump@comcast.net> wrote:
>> On Sep 30, 2014, at 2:22 AM, Bin Cheng <bin.cheng@arm.com> wrote:
>>> Then I decided to take one step forward to introduce a generic
>>> instruction fusion infrastructure in GCC, because in essence, load/store
>>> pair is nothing different with other instruction fusion, all these optimizations
>>> want is to push instructions together in instruction flow.
>>
>> I like the step you took.  I had exactly this in mind when I wrote the original.
>>
>>> N0 ~= 1300
>>> N1/N2 ~= 5000
>>> N3 ~= 7500
>>
>> Nice.  Would be nice to see metrics for time to ensure that the code isn't actually worse (CSiBE and/or spec and/or some other).  I didn't have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure.
>
> Hi Mike,
> I did collect spec2k performance after pairing load/store using this
> patch on both aarch64 and cortex-a15.  The performance is improved
> obviously, especially on cortex-a57.  There are some (though not many)
> benchmarks are regressed a little.  There is no register pressure
> problem here because this pass is put between register allocation and
> sched2, I guess sched2 should resolve most pipeline hazards introduced
> by this pass.

How many merging opportunities does sched2 undo again?  ISTR it
has the tendency of pushing stores down and loads up.

Richard.

>>
>>> I cleared up Mike's patch and fixed some implementation bugs in it
>>
>> So, I'm wondering what the bugs or missed opportunities were?  And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.
> Just missed opportunity issues.
>
> Thanks,
> bin
Mike Stump Oct. 6, 2014, 5:20 p.m. UTC | #4
On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com> wrote:
> On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> 
> How many merging opportunities does sched2 undo again?  ISTR it
> has the tendency of pushing stores down and loads up.

So, the pass works by merging 2 or more loads into 1 load (at least on my port).  sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work.  The non-real work, doesn’t matter any.  Can sched2 rip apart a single load?
Bin.Cheng Oct. 7, 2014, 1:31 a.m. UTC | #5
On Tue, Oct 7, 2014 at 1:20 AM, Mike Stump <mikestump@comcast.net> wrote:
> On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>> On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>
>> How many merging opportunities does sched2 undo again?  ISTR it
>> has the tendency of pushing stores down and loads up.
>
> So, the pass works by merging 2 or more loads into 1 load (at least on my port).  sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work.  The non-real work, doesn't matter any.  Can sched2 rip apart a single load?
On ARM and AARCH64, the two merged load/store are transformed into
single parallel insn by the following peephole2 pass, so that sched2
would not undo the fusion work.  I though sched2 works on the basis of
instructions, and it isn't good practice to have sched2 do split work.

Thanks,
bin
Jeff Law Oct. 8, 2014, 5:28 a.m. UTC | #6
On 10/06/14 19:31, Bin.Cheng wrote:
> On Tue, Oct 7, 2014 at 1:20 AM, Mike Stump <mikestump@comcast.net> wrote:
>> On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com> wrote:
>>> On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>>>
>>> How many merging opportunities does sched2 undo again?  ISTR it
>>> has the tendency of pushing stores down and loads up.
>>
>> So, the pass works by merging 2 or more loads into 1 load (at least on my port).  sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work.  The non-real work, doesn't matter any.  Can sched2 rip apart a single load?
> On ARM and AARCH64, the two merged load/store are transformed into
> single parallel insn by the following peephole2 pass, so that sched2
> would not undo the fusion work.  I though sched2 works on the basis of
> instructions, and it isn't good practice to have sched2 do split work.
It's certainly advantageous for sched2 to split insns that generate 
multiple instructions.  Running after register allocation, sched2 is 
ideal for splitting because the we know the alternative for each insn 
and thus we can (possibly for the first time) accurately know if a 
particular insn will generate multiple assembly instructions.

If the port has a splitter to rip apart a douple-word load into 
single-word loads, then we'd obviously only want to do that in cases 
where the double-word load actually generates > 1 assembly instruction.

Addressing issues in that space seems out of scope for Bin's work to me, 
except perhaps for such issues on aarch64/arm which are Bin's primary 
concerns.

jeff
Bin.Cheng Oct. 8, 2014, 5:52 a.m. UTC | #7
On Wed, Oct 8, 2014 at 1:28 PM, Jeff Law <law@redhat.com> wrote:
> On 10/06/14 19:31, Bin.Cheng wrote:
>>
>> On Tue, Oct 7, 2014 at 1:20 AM, Mike Stump <mikestump@comcast.net> wrote:
>>>
>>> On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com>
>>> wrote:
>>>>
>>>> On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com>
>>>> wrote:
>>>>
>>>> How many merging opportunities does sched2 undo again?  ISTR it
>>>> has the tendency of pushing stores down and loads up.
>>>
>>>
>>> So, the pass works by merging 2 or more loads into 1 load (at least on my
>>> port).  sched2 would need to rip apart 1 load into 2 loads to be able to
>>> undo the real work.  The non-real work, doesn't matter any.  Can sched2 rip
>>> apart a single load?
>>
>> On ARM and AARCH64, the two merged load/store are transformed into
>> single parallel insn by the following peephole2 pass, so that sched2
>> would not undo the fusion work.  I though sched2 works on the basis of
>> instructions, and it isn't good practice to have sched2 do split work.
>
> It's certainly advantageous for sched2 to split insns that generate multiple
> instructions.  Running after register allocation, sched2 is ideal for
> splitting because the we know the alternative for each insn and thus we can
> (possibly for the first time) accurately know if a particular insn will
> generate multiple assembly instructions.
>
> If the port has a splitter to rip apart a douple-word load into single-word
> loads, then we'd obviously only want to do that in cases where the
> double-word load actually generates > 1 assembly instruction.
>
> Addressing issues in that space seems out of scope for Bin's work to me,
> except perhaps for such issues on aarch64/arm which are Bin's primary
> concerns.


Hi Jeff,

Thanks very much for the explanation.  Very likely I am wrong here,
but seems what you mentioned fits to pass_split_before_sched2 very
well.  Then I guess it would be nice if we can differentiate cases in
the first place by generating different patterns, rather than split
some of instructions later.  Though I have no idea if we can do that
or not.

For arm/aarch64, I guess it's not an issue, otherwise the peephole2
won't work at all.  ARM maintainers should have answer to this.


>
> jeff
Ramana Radhakrishnan Oct. 8, 2014, 10:27 a.m. UTC | #8
>> If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction.

Or indeed if it is really a performance win. And I think that should
purely be a per port / micro-architectural decision .

>
> For arm/aarch64, I guess it's not an issue, otherwise the peephole2
> won't work at all.  ARM maintainers should have answer to this.

Generating more ldrd's and strd's will be beneficial in the ARM and
the AArch64 port - we save code size and start using more memory
bandwidth available per instruction on most higher end cores that I'm
aware of. Even on the smaller microcontrollers I expect it to be a win
because you've saved code size. There may well be pathological cases
given we've shortened some dependencies or increased lifetimes of
others but overall I'd expect it to be more positive than negative.

I also expect this to be more effective in the T32 (Thumb2) ISA and
AArch64 because ldrd/ strd and ldp / stp respectively can work with
any registers unlike the A32 ISA where the registers loaded or stored
must be consecutive registers. I'm hoping for some more review on the
generic bits before looking into the backend implementation in the
expectation that this is the direction folks want to proceed.


regards
Ramana




>
>
>>
>> jeff
Jeff Law Oct. 8, 2014, 10:21 p.m. UTC | #9
On 10/08/14 04:27, Ramana Radhakrishnan wrote:
>>> If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction.
>
> Or indeed if it is really a performance win. And I think that should
> purely be a per port / micro-architectural decision .
Agreed.

> Generating more ldrd's and strd's will be beneficial in the ARM and
> the AArch64 port - we save code size and start using more memory
> bandwidth available per instruction on most higher end cores that I'm
> aware of. Even on the smaller microcontrollers I expect it to be a win
> because you've saved code size. There may well be pathological cases
> given we've shortened some dependencies or increased lifetimes of
> others but overall I'd expect it to be more positive than negative.
Agreed.  I suspect there's multiple architectures where the results 
would be similar -- code size improvements, more effective use of memory 
bandwidth with possibly some pathological case(s) that we really 
shouldn't worry too much about.

> I also expect this to be more effective in the T32 (Thumb2) ISA and
> AArch64 because ldrd/ strd and ldp / stp respectively can work with
> any registers unlike the A32 ISA where the registers loaded or stored
> must be consecutive registers. I'm hoping for some more review on the
> generic bits before looking into the backend implementation in the
> expectation that this is the direction folks want to proceed.
I've got some questions that I'm formulating to make sure I understand 
how the facility is to be used.  I may have to simply sit down with the 
code installed on a test build and play with it.

However, to be clear, I really like the direction this work has gone.

Jeff
Mike Stump Oct. 9, 2014, 11:14 a.m. UTC | #10
On Oct 7, 2014, at 10:28 PM, Jeff Law <law@redhat.com> wrote:
> It's certainly advantageous for sched2 to split insns that generate multiple instructions.

So, on my port, I have a load multiple that is just one instruction, and it is a single clock cycle (to enque it).
Jeff Law Oct. 10, 2014, 9:13 p.m. UTC | #11
On 09/30/14 03:22, Bin Cheng wrote:
> Hi,
> many load/store pairs as my old patch.  Then I decided to take one step
> forward to introduce a generic instruction fusion infrastructure in GCC,
> because in essence, load/store pair is nothing different with other
> instruction fusion, all these optimizations want is to push instructions
> together in instruction flow.
Great generalization.  And yes, you're absolutely right, what you're 
doing is building a fairly generic mechanism to mark insns that might 
fuse together.

So, some questions.  Let's assume I've got 3 kinds of insns.  A B & C.

I can fuse AB or AC, but not BC.  In fact, moving B & C together may 
significantly harm performance.

So my question is can a given insn have different fusion priorities 
depending on its scheduling context?

So perhaps an example.  Let's say I have an insn stream with the 
following kinds of instructions, all ready at the same time.

AAAAAAAABBBBCCCC

Can I create 8 distinct fusion priorities such that I ultimately schedule
AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)

I guess another way to ask the question, are fusion priorities static 
based on the insn/alternative, or can they vary?  And if they can vary, 
can they vary each tick of the scheduler?



Now the next issue is I really don't want all those to fire 
back-to-back-to-back.  I'd like some other insns to be inserted between 
each fusion pair if they're in the ready list.  I guess the easiest way 
to get that is to assign the same fusion priority to other insns in the 
ready queue, even though they don't participate in fusion.  So

ABX(1) ABY(2).....

Where X & Y are some other arbitrary insns that don't participate in the 
AB fusion, but will issue in the same cycle as the AB fused insn.

Though I guess if we run fusion + peep2 between sched1 and sched2, that 
problem would just resolve itself as we'd have fused AB together into a 
new insn and we'd schedule normally with the fused insns and X, Y.




> So here comes this patch.  It adds a new sched_fusion pass just before
> peephole2.  The methodology is like:
> 1) The priority in scheduler is extended into [fusion_priority, priority]
> pair, with fusion_priority as the major key and priority as the minor key.
> 2) The back-end assigns priorities pair to each instruction, instructions
> want to be fused together get same fusion_priority assigned.
I think the bulk of my questions above are targetted at this part of the 
change.  When are these assignments made and how much freedom does the 
backend have to make/change those assignments.

So another question, given a fused pair, is there any way to guarantee 
ordering within the fused pair.  This is useful to cut down on the 
number of peep2 patterns.  I guess we could twiddle the priority in 
those cases to force a particular ordering of the fused pair, right?

I wonder if we could use this to zap all the hair I added to caller-save 
back in the early 90s to try and widen the save/restore modes.  So 
instead of st; st; call; ld; ld, we'd generate std; call; ldd.  It was a 
huge win for floating point on the sparc processors of that time.  I 
don't expect you to do that investigation.  Just thinking out loud.




>
> I collected performance data for both cortex-a15 and cortex-a57 (with a
> local peephole ldp/stp patch), the benchmarks can be obviously improved on
> arm/aarch64.  I also collected instrument data about how many load/store
> pairs are found.  For the four versions of load/store pair patches:
> 0) The original Mike's patch.
> 1) My original prototype patch.
> 2) Cleaned up pass of Mike (with implementation bugs resolved).
> 3) This new prototype fusion pass.
>
> The numbers of paired opportunities satisfy below relations:
> 3 * N0 ~ N1 ~ N2 < N3
> For example, for one benchmark suite, we have:
> N0 ~= 1300
> N1/N2 ~= 5000
> N3 ~= 7500
Nice.  Very nice.

Overall it's a fairly simple change.  I'll look deeper into it next week.

jeff
Mike Stump Oct. 10, 2014, 11:10 p.m. UTC | #12
[ I’ll give the state of the code that I finished with, Bin’s answers should be similar to mine, but, if he improved things, they could be better ]

On Oct 10, 2014, at 2:13 PM, Jeff Law <law@redhat.com> wrote:
> So, some questions.  Let's assume I've got 3 kinds of insns.  A B & C.
> 
> I can fuse AB or AC, but not BC.  In fact, moving B & C together may significantly harm performance.

We only can choose 1 ordering for fusion consideration, so you have to pick between AB or AC as the ordering.  Once you decide, then the other is hidden from consideration.

The priorities merely are used to select an instruction ordering to consider pair wise (adjacent) opportunities.

They can chain and stack, so, if you want A B C to fuse in the new ABC instruction, it will.

> So my question is can a given insn have different fusion priorities depending on its scheduling context?

Area for future improvement.  In theory you can feed any other state into the creation of the priority but only the single insn feeds into it today.

> So perhaps an example.  Let's say I have an insn stream with the following kinds of instructions, all ready at the same time.
> 
> AAAAAAAABBBBCCCC
> 
> Can I create 8 distinct fusion priorities such that I ultimately schedule
> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)
> 
> I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary?  And if they can vary, can they vary each tick of the scheduler?

So, it wasn’t created to solve that problem.  If the scheduler would tend to put them next to each other because structurally, they fit that way, then peephole already should have a chance to see them that way.

> Now the next issue is I really don't want all those to fire back-to-back-to-back.  I'd like some other insns to be inserted between each fusion pair if they're in the ready list.

Scheduling figures out the way to layout instructions and will be done.  Fusion run before scheduling and is separate, it doesn’t replace or change scheduling.

> I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion.  So
> 
> ABX(1) ABY(2).....
> 
> Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn.

You’re conflating fusion with scheduling, they are separate.  Fusion doesn’t deal with time or cycles.  It exists only to fuse to separate instructions together.  Those products will be scheduled later and post scheduling, all the holes will be filled to the extent the scheduler is able to find work to do.

For example, given:

ld 0
inc
ld 1
inc
ld 2
inc
ld 3
inc

and we want to fuse ld <even> with ld <even>+1, we then get:

ld 01
ld 23
inc
inc

then, post scheduling, we get:

ld 01
inc
ld 23
inc

by the scheduler to fill the holes.

> Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y.

Yes, in my version, I ran it really early, before sched.  I needed to run before ira and run before other people changed the code so much, that they obscured the instructions I wanted to fuse.  One thing that happened that I saw was once we got to far along, the offsets used were loaded into register and instead of being r+n, it was r1+r2 and this blew the sorting.

>> So here comes this patch.  It adds a new sched_fusion pass just before
>> peephole2.

Hum…  In my version, that was just way too late.  I needed to generate pseudos and count on ira and others to clean up register the register moves for me.  I’ve not tried his patch to see if it works as well as my version on my problem (combining adjacent loads and stores).

>>  The methodology is like:
>> 1) The priority in scheduler is extended into [fusion_priority, priority]
>> pair, with fusion_priority as the major key and priority as the minor key.
>> 2) The back-end assigns priorities pair to each instruction, instructions
>> want to be fused together get same fusion_priority assigned.
> I think the bulk of my questions above are targetted at this part of the change.  When are these assignments made and how much freedom does the backend have to make/change those assignments.

It is free to do anything with the numbers it wants.  But, the port writer has carefully chosen the priorities to generate the best code, it is unlikely it would ever be wise to change the numbers.

> So another question, given a fused pair, is there any way to guarantee ordering within the fused pair.

Sure, ensure one instruction has a lower number than the other.  That will order them reliably.

> This is useful to cut down on the number of peep2 patterns.  I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right?

Right, assuming that there is an ordering for the instruction that makes sense.  Given:

mul r1,#13
inc r3
mul r5,#13
inc r4

and a machine that could do a V2 mul, given a pair:

mov r8,r1
mov r9,r2
mulv2 r8,#13
inc r3
inc r4
mov r1,r8
mov r5,r9

Here, we can see all the extra moves to ensure a register pair (n, n+1 of the mulv2 instruction) and the temporary created to hold the tuple (r8,r9).  In my peephole patterns I do this to ensure registers have the right register number.  I do this before register allocation, as it can choose the registers the best and ensure all the other instructions around it use those registers, and leave the moves in, if it can’t do better.  The moves tend to be able to fill scheduling holes, so, even if they remain, usually the cost for them shouldn’t be too bad.

On the other hand, if you have

left
left
right
right

There is no way to sort them to get:

left
right
left
right

and then fuse:

left_right
left_right

This would be impossible.

> I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes.  So instead of st; st; call; ld; ld, we'd generate std; call; ldd.  It was a huge win for floating point on the sparc processors of that time.  I don't expect you to do that investigation.  Just thinking out loud.

Well, curiously enough, that sort of thing is exactly why I wrote the code.  I was thinking of cases where the user load lots of registers from memory and then plays a bit with them and then put stye data back into memory.  I wanted to put all the loads together and then try and find the loads that are next to each other and combine them into a single, larger load.  On a machine that can load n registers as once, you can then save n-1 instructions.  Likewise for stores.  Conceptually, there is nothing port specific about the optimization I wanted to do.
Bin.Cheng Oct. 11, 2014, 3:32 a.m. UTC | #13
Mike already gave great answers, here are just some of my thoughts on
the specific questions.  See embedded below.

On Sat, Oct 11, 2014 at 7:10 AM, Mike Stump <mikestump@comcast.net> wrote:
> [ I'll give the state of the code that I finished with, Bin's answers should be similar to mine, but, if he improved things, they could be better ]
>
> On Oct 10, 2014, at 2:13 PM, Jeff Law <law@redhat.com> wrote:
>> So, some questions.  Let's assume I've got 3 kinds of insns.  A B & C.
>>
>> I can fuse AB or AC, but not BC.  In fact, moving B & C together may significantly harm performance.
>
> We only can choose 1 ordering for fusion consideration, so you have to pick between AB or AC as the ordering.  Once you decide, then the other is hidden from consideration.
>
> The priorities merely are used to select an instruction ordering to consider pair wise (adjacent) opportunities.
>
> They can chain and stack, so, if you want A B C to fuse in the new ABC instruction, it will.
>
>> So my question is can a given insn have different fusion priorities depending on its scheduling context?
>
> Area for future improvement.  In theory you can feed any other state into the creation of the priority but only the single insn feeds into it today.
>
>> So perhaps an example.  Let's say I have an insn stream with the following kinds of instructions, all ready at the same time.
>>
>> AAAAAAAABBBBCCCC
>>
>> Can I create 8 distinct fusion priorities such that I ultimately schedule
>> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)
>>
>> I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary?  And if they can vary, can they vary each tick of the scheduler?

Though this pass works on predefined fusion types and priorities now,
there might be two possible fixes for this specific problem.
1) Introduce another exclusive_pri, now it's like "fusion_pri,
priority, exclusive_pri".  The first one is assigned to mark
instructions belonging to same fusion type.  The second is assigned to
fusion each pair/consecutive instructions together.  The last one is
assigned to prevent specific pair of instructions from being fused,
just like "BC" mentioned.
2) Extend the idea by using hook function
TARGET_SCHED_REORDER/TARGET_SCHED_REORDER2.  Now we can assign
fusion_pri at the first place, making sure instructions in same fusion
type will be adjacent to each other, then we can change priority (thus
reorder the ready list) at back-end's wish even per each tick of the
scheduler.


>
> So, it wasn't created to solve that problem.  If the scheduler would tend to put them next to each other because structurally, they fit that way, then peephole already should have a chance to see them that way.
>
>> Now the next issue is I really don't want all those to fire back-to-back-to-back.  I'd like some other insns to be inserted between each fusion pair if they're in the ready list.
>
> Scheduling figures out the way to layout instructions and will be done.  Fusion run before scheduling and is separate, it doesn't replace or change scheduling.
>
>> I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion.  So
>>
>> ABX(1) ABY(2).....
>>
>> Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn.
>
> You're conflating fusion with scheduling, they are separate.  Fusion doesn't deal with time or cycles.  It exists only to fuse to separate instructions together.  Those products will be scheduled later and post scheduling, all the holes will be filled to the extent the scheduler is able to find work to do.
>
> For example, given:
>
> ld 0
> inc
> ld 1
> inc
> ld 2
> inc
> ld 3
> inc
>
> and we want to fuse ld <even> with ld <even>+1, we then get:
>
> ld 01
> ld 23
> inc
> inc
>
> then, post scheduling, we get:
>
> ld 01
> inc
> ld 23
> inc
>
> by the scheduler to fill the holes.
>
>> Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y.
>
> Yes, in my version, I ran it really early, before sched.  I needed to run before ira and run before other people

Two reasons why I run it late in compilation process.
1) IRA is the pass I tend not to disturb since code is changed
dramatically.  With it after IRA, I can get certain improvement from
fusion, there is less noise here.
2) The spilling generates many load/store pair opportunities on ARM,
which I don't want to miss.

>changed the code so much, that they obscured the instructions I wanted to fuse.  One thing that happened that I
>saw was once we got to far along, the offsets used were loaded into register and instead of being r+n, it was r1+r2
>and this blew the sorting.

I saw this problem too, it's the fwprop pass that should be blamed.
Now it just forwards computation into address expression even there is
no benefit.  For example, if we have:

add rx, ry, rz
ldr   r1, [rx]
ldr   r2, [rx+4]
ldr   r3, [rx+8]

It will be transformed into:

add rx, ry, rz
ldr   r1, [ry+rz]
ldr   r2, [rx+4]
ldr   r3, [rx+8]

It corrupts consecutive load pairs, gives no computation reduction,
and increases register pressure.  Root cause is the pass works reg_use
by reg_use, doesn't take global information into consideration, like
register live range, whether there is real benefit, or if it hurts
other optimizations.

>
>>> So here comes this patch.  It adds a new sched_fusion pass just before
>>> peephole2.
>
> Hum...  In my version, that was just way too late.  I needed to generate pseudos and count on ira and others to clean up register the register moves for me.  I've not tried his patch to see if it works as well as my version on my problem (combining adjacent loads and stores).

Putting it after IRA has its own disadvantage.  For example, fake
dependency is introduced like below example:

ldr  r1, [rx + 4]
ldr  r2, [rx + 12]
ldr  rx, [rx + 8]
...
use r1/r2/rx

Here we can't fuse the first/third ldr instructions because rx is
modified.  This is why I tried to even push this pass (along with
peephole2) after register renaming and enable register renaming pass
in my experiments.  It does help a lot at lease on ARM.


>
>>>  The methodology is like:
>>> 1) The priority in scheduler is extended into [fusion_priority, priority]
>>> pair, with fusion_priority as the major key and priority as the minor key.
>>> 2) The back-end assigns priorities pair to each instruction, instructions
>>> want to be fused together get same fusion_priority assigned.
>> I think the bulk of my questions above are targetted at this part of the change.  When are these assignments made and how much freedom does the backend have to make/change those assignments.
>
> It is free to do anything with the numbers it wants.  But, the port writer has carefully chosen the priorities to generate the best code, it is unlikely it would ever be wise to change the numbers.
>
>> So another question, given a fused pair, is there any way to guarantee ordering within the fused pair.
>
> Sure, ensure one instruction has a lower number than the other.  That will order them reliably.
>
>> This is useful to cut down on the number of peep2 patterns.  I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right?
>
> Right, assuming that there is an ordering for the instruction that makes sense.  Given:
>
> mul r1,#13
> inc r3
> mul r5,#13
> inc r4
>
> and a machine that could do a V2 mul, given a pair:
>
> mov r8,r1
> mov r9,r2
> mulv2 r8,#13
> inc r3
> inc r4
> mov r1,r8
> mov r5,r9
>
> Here, we can see all the extra moves to ensure a register pair (n, n+1 of the mulv2 instruction) and the temporary created to hold the tuple (r8,r9).  In my peephole patterns I do this to ensure registers have the right register number.  I do this before register allocation, as it can choose the registers the best and ensure all the other instructions around it use those registers, and leave the moves in, if it can't do better.  The moves tend to be able to fill scheduling holes, so, even if they remain, usually the cost for them shouldn't be too bad.
>
> On the other hand, if you have
>
> left
> left
> right
> right
>
> There is no way to sort them to get:
>
> left
> right
> left
> right
>
> and then fuse:
>
> left_right
> left_right
>
> This would be impossible.

I can't understand this very well, this exactly is one case we want to
fuse on ARM for movw/movt.  Given
moww  r1, const_1
movw   r2, const_2
movw   r3, const_3
movt     r1, const_1
movt     r2, const_2
movt     r3, const_3

I would expect:
moww  r1, const_1
movt     r1, const_1
movw   r2, const_2
movt     r2, const_2
movw   r3, const_3
movt     r3, const_3

Thanks,
bin

>
>> I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes.  So instead of st; st; call; ld; ld, we'd generate std; call; ldd.  It was a huge win for floating point on the sparc processors of that time.  I don't expect you to do that investigation.  Just thinking out loud.
>
> Well, curiously enough, that sort of thing is exactly why I wrote the code.  I was thinking of cases where the user load lots of registers from memory and then plays a bit with them and then put stye data back into memory.  I wanted to put all the loads together and then try and find the loads that are next to each other and combine them into a single, larger load.  On a machine that can load n registers as once, you can then save n-1 instructions.  Likewise for stores.  Conceptually, there is nothing port specific about the optimization I wanted to do.
Mike Stump Oct. 11, 2014, 7:30 p.m. UTC | #14
On Oct 10, 2014, at 8:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> 
>>> Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y.
>> 
>> Yes, in my version, I ran it really early, before sched.  I needed to run before ira and run before other people
> 
> Two reasons why I run it late in compilation process.
> 1) IRA is the pass I tend not to disturb since code is changed
> dramatically.  With it after IRA, I can get certain improvement from
> fusion, there is less noise here.

Since I have a front-end background, I think nothing of creating pseudos when I want to, I just know if I do, I have to do this before allocation.  :-)  For my peepholes, since they create registers, they must run before allocation.

> 2) The spilling generates many load/store pair opportunities on ARM,
> which I don't want to miss.

I happen to have enough registers that spilling wasn’t my primary concern.

> add rx, ry, rz
> ldr   r1, [rx]
> ldr   r2, [rx+4]
> ldr   r3, [rx+8]
> 
> It will be transformed into:
> 
> add rx, ry, rz
> ldr   r1, [ry+rz]
> ldr   r2, [rx+4]
> ldr   r3, [rx+8]

Yeah, that seems to tickle a neuron.

>> On the other hand, if you have
>> 
>> left
>> left
>> right
>> right
>> 
>> There is no way to sort them to get:
>> 
>> left
>> right
>> left
>> right
>> 
>> and then fuse:
>> 
>> left_right
>> left_right
>> 
>> This would be impossible.
> 
> I can't understand this very well, this exactly is one case we want to
> fuse on ARM for movw/movt.  Given
> moww  r1, const_1

This differs from the above by having r1 and const_1, in my example, there is no r1 and no const_1, this matters.  I wanted to list a case where it is impossible to sort.  This happens when there isn’t enough data to sort on, for example, no offset, no register number.
Bin.Cheng Oct. 21, 2014, 5:43 a.m. UTC | #15
On Sat, Oct 11, 2014 at 5:13 AM, Jeff Law <law@redhat.com> wrote:
> On 09/30/14 03:22, Bin Cheng wrote:
>>
>> Hi,
>> many load/store pairs as my old patch.  Then I decided to take one step
>> forward to introduce a generic instruction fusion infrastructure in GCC,
>> because in essence, load/store pair is nothing different with other
>> instruction fusion, all these optimizations want is to push instructions
>> together in instruction flow.
>
> Great generalization.  And yes, you're absolutely right, what you're doing
> is building a fairly generic mechanism to mark insns that might fuse
> together.
>
> So, some questions.  Let's assume I've got 3 kinds of insns.  A B & C.
>
> I can fuse AB or AC, but not BC.  In fact, moving B & C together may
> significantly harm performance.
>
> So my question is can a given insn have different fusion priorities
> depending on its scheduling context?
>
> So perhaps an example.  Let's say I have an insn stream with the following
> kinds of instructions, all ready at the same time.
>
> AAAAAAAABBBBCCCC
>
> Can I create 8 distinct fusion priorities such that I ultimately schedule
> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8)
>
> I guess another way to ask the question, are fusion priorities static based
> on the insn/alternative, or can they vary?  And if they can vary, can they
> vary each tick of the scheduler?
>
>
>
> Now the next issue is I really don't want all those to fire
> back-to-back-to-back.  I'd like some other insns to be inserted between each
> fusion pair if they're in the ready list.  I guess the easiest way to get
> that is to assign the same fusion priority to other insns in the ready
> queue, even though they don't participate in fusion.  So
>
> ABX(1) ABY(2).....
>
> Where X & Y are some other arbitrary insns that don't participate in the AB
> fusion, but will issue in the same cycle as the AB fused insn.
>
> Though I guess if we run fusion + peep2 between sched1 and sched2, that
> problem would just resolve itself as we'd have fused AB together into a new
> insn and we'd schedule normally with the fused insns and X, Y.
>
>
>
>
>> So here comes this patch.  It adds a new sched_fusion pass just before
>> peephole2.  The methodology is like:
>> 1) The priority in scheduler is extended into [fusion_priority, priority]
>> pair, with fusion_priority as the major key and priority as the minor key.
>> 2) The back-end assigns priorities pair to each instruction, instructions
>> want to be fused together get same fusion_priority assigned.
>
> I think the bulk of my questions above are targetted at this part of the
> change.  When are these assignments made and how much freedom does the
> backend have to make/change those assignments.
>
> So another question, given a fused pair, is there any way to guarantee
> ordering within the fused pair.  This is useful to cut down on the number of
> peep2 patterns.  I guess we could twiddle the priority in those cases to
> force a particular ordering of the fused pair, right?
>
> I wonder if we could use this to zap all the hair I added to caller-save
> back in the early 90s to try and widen the save/restore modes.  So instead
> of st; st; call; ld; ld, we'd generate std; call; ldd.  It was a huge win
> for floating point on the sparc processors of that time.  I don't expect you
> to do that investigation.  Just thinking out loud.
>
>
>
>
>>
>> I collected performance data for both cortex-a15 and cortex-a57 (with a
>> local peephole ldp/stp patch), the benchmarks can be obviously improved on
>> arm/aarch64.  I also collected instrument data about how many load/store
>> pairs are found.  For the four versions of load/store pair patches:
>> 0) The original Mike's patch.
>> 1) My original prototype patch.
>> 2) Cleaned up pass of Mike (with implementation bugs resolved).
>> 3) This new prototype fusion pass.
>>
>> The numbers of paired opportunities satisfy below relations:
>> 3 * N0 ~ N1 ~ N2 < N3
>> For example, for one benchmark suite, we have:
>> N0 ~= 1300
>> N1/N2 ~= 5000
>> N3 ~= 7500
>
> Nice.  Very nice.
>
> Overall it's a fairly simple change.  I'll look deeper into it next week.
>
> jeff

Hi Jeff,
Any new comments?

Thanks,
bin
Jeff Law Oct. 30, 2014, 7:43 p.m. UTC | #16
On 10/10/14 21:32, Bin.Cheng wrote:
> Mike already gave great answers, here are just some of my thoughts on
> the specific questions.  See embedded below.
Thanks to both of you for your answers.

Fundamentally, what I see is this scheme requires us to be able to come 
up with a key based solely on information in a particular insn.  To get 
fusion another insn has to have the same or a closely related key.

This implies that the the two candidates for fusion are related, even if 
there isn't a data dependency between them.  The canonical example would 
be two loads with reg+d addressing modes.  If they use the same base 
register and the displacements differ by a word, then we don't have a 
data dependency between the insns, but the insns are closely related by 
their address computations and we can compute a key to ensure those two 
related insns end up consecutive.  At any given call to the hook, the 
only context we can directly see is the current insn.

I'm pretty sure if I were to tweak the ARM bits ever-so-slightly it 
could easily model the load-load or store-store special case on the 
PA7xxx[LC] processors.  Normally a pair of loads or stores can't dual 
issue.  But if the two loads (or two stores) hit the upper and lower 
half of a double-word objects, then the instructions can dual issue.

I'd forgotten about that special case scheduling opportunity until I 
started looking at some unrelated enhancement for prefetching.

Your code would also appear to allow significant cleanup of the old 
caller-save code that had a fair amount of bookkeeping added to issue 
double-word memory loads/stores rather than single word operations. 
This *greatly* improved performance on the old sparc processors which 
had no call-saved FP registers.

However, your new code doesn't handle fusing instructions which are 
totally independent and of different static types.  There just isn't a 
good way to compute a key that I can see.  And this is OK -- that case, 
if we cared to improve it, would be best handled by the SCHED_REORDER hooks.

>>>
>>> I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary?  And if they can vary, can they vary each tick of the scheduler?
>
> Though this pass works on predefined fusion types and priorities now,
> there might be two possible fixes for this specific problem.
> 1) Introduce another exclusive_pri, now it's like "fusion_pri,
> priority, exclusive_pri".  The first one is assigned to mark
> instructions belonging to same fusion type.  The second is assigned to
> fusion each pair/consecutive instructions together.  The last one is
> assigned to prevent specific pair of instructions from being fused,
> just like "BC" mentioned.
> 2) Extend the idea by using hook function
> TARGET_SCHED_REORDER/TARGET_SCHED_REORDER2.  Now we can assign
> fusion_pri at the first place, making sure instructions in same fusion
> type will be adjacent to each other, then we can change priority (thus
> reorder the ready list) at back-end's wish even per each tick of the
> scheduler.
#2 would be the best solution for the case I was pondering, but I don't 
think solving that case is terribly important given the processors for 
which it was profitable haven't been made for a very long time.

Jeff
Mike Stump Oct. 30, 2014, 11:37 p.m. UTC | #17
On Oct 30, 2014, at 12:43 PM, Jeff Law <law@redhat.com> wrote:
> Fundamentally, what I see is this scheme requires us to be able to come up with a key based solely on information in a particular insn.  To get fusion another insn has to have the same or a closely related key.

Right.
Bin.Cheng Oct. 31, 2014, 5:36 a.m. UTC | #18
Thanks for giving it a try.

On Fri, Oct 31, 2014 at 3:43 AM, Jeff Law <law@redhat.com> wrote:
> On 10/10/14 21:32, Bin.Cheng wrote:
>>
>> Mike already gave great answers, here are just some of my thoughts on
>> the specific questions.  See embedded below.
>
> Thanks to both of you for your answers.
>
> Fundamentally, what I see is this scheme requires us to be able to come up
> with a key based solely on information in a particular insn.  To get fusion
> another insn has to have the same or a closely related key.
>
> This implies that the the two candidates for fusion are related, even if
> there isn't a data dependency between them.  The canonical example would be
> two loads with reg+d addressing modes.  If they use the same base register
> and the displacements differ by a word, then we don't have a data dependency
> between the insns, but the insns are closely related by their address
> computations and we can compute a key to ensure those two related insns end
> up consecutive.  At any given call to the hook, the only context we can
> directly see is the current insn.
>
> I'm pretty sure if I were to tweak the ARM bits ever-so-slightly it could
> easily model the load-load or store-store special case on the PA7xxx[LC]
> processors.  Normally a pair of loads or stores can't dual issue.  But if
> the two loads (or two stores) hit the upper and lower half of a double-word
> objects, then the instructions can dual issue.
>
> I'd forgotten about that special case scheduling opportunity until I started
> looking at some unrelated enhancement for prefetching.
>
> Your code would also appear to allow significant cleanup of the old
> caller-save code that had a fair amount of bookkeeping added to issue
> double-word memory loads/stores rather than single word operations. This
> *greatly* improved performance on the old sparc processors which had no
> call-saved FP registers.
>
> However, your new code doesn't handle fusing instructions which are totally
> independent and of different static types.  There just isn't a good way to
> compute a key that I can see.  And this is OK -- that case, if we cared to
> improve it, would be best handled by the SCHED_REORDER hooks.
>
>>>>
>>>> I guess another way to ask the question, are fusion priorities static
>>>> based on the insn/alternative, or can they vary?  And if they can vary, can
>>>> they vary each tick of the scheduler?
>>
>>
>> Though this pass works on predefined fusion types and priorities now,
>> there might be two possible fixes for this specific problem.
>> 1) Introduce another exclusive_pri, now it's like "fusion_pri,
>> priority, exclusive_pri".  The first one is assigned to mark
>> instructions belonging to same fusion type.  The second is assigned to
>> fusion each pair/consecutive instructions together.  The last one is
>> assigned to prevent specific pair of instructions from being fused,
>> just like "BC" mentioned.
>> 2) Extend the idea by using hook function
>> TARGET_SCHED_REORDER/TARGET_SCHED_REORDER2.  Now we can assign
>> fusion_pri at the first place, making sure instructions in same fusion
>> type will be adjacent to each other, then we can change priority (thus
>> reorder the ready list) at back-end's wish even per each tick of the
>> scheduler.
>
> #2 would be the best solution for the case I was pondering, but I don't
> think solving that case is terribly important given the processors for which
> it was profitable haven't been made for a very long time.
I am thinking if it's possible to introduce a pattern-directed fusion.
Something like define_fusion, and adapting haifa-scheduler for it.  I
agree there are two kinds (relevant and irrelevant) fusion types, and
it's not trivial to support both in one scheme.  Do you have a
specific example that I can have a try?

This is just a preliminary idea and definitely can't catch up in GCC
5.0.  Moreover, so far I didn't see such requirement on ARM/AARCH64 or
any other targets that I know, so I would like to continue with this
version if it's fine.

Later I will send patch pairing different kinds of ldp/stp based on
this for aarch64.

Thanks,
bin

>
> Jeff
Jeff Law Oct. 31, 2014, 7:07 p.m. UTC | #19
On 10/30/14 23:36, Bin.Cheng wrote:
> Thanks for giving it a try.

I'm a very hands-on learner, so sometimes the best way for me to 
understand a blob of code is poke at it myself a bit.

Anyway, now I just need to get back to the patch for a final review :-)

jeff
Jeff Law Oct. 31, 2014, 7:20 p.m. UTC | #20
On 10/30/14 23:36, Bin.Cheng wrote:
>> #2 would be the best solution for the case I was pondering, but I don't
>> think solving that case is terribly important given the processors for which
>> it was profitable haven't been made for a very long time.
> I am thinking if it's possible to introduce a pattern-directed fusion.
> Something like define_fusion, and adapting haifa-scheduler for it.  I
> agree there are two kinds (relevant and irrelevant) fusion types, and
> it's not trivial to support both in one scheme.  Do you have a
> specific example that I can have a try?
I kicked around using reorg to do stuff like this in the past 
(combination of unrelated insns).  But ultimately I think the way to go 
is have it happen when insns are on the ready list in the scheduler.

For fusion of related insns like the load/store pairing, I think your 
approach should work pretty well.


As to specific examples of independent insn fusion, the ones I'm most 
familiar with are from the older PA chips.  I wouldn't recommend 
building something for those processors simply becuase they're so dated 
that I don't believe anyone uses them anymore.

However, if you have cases (arm shift insns?), building for those is 
fine.  If you just want examples, the ones we tried to exploit on the PA 
were fmpyadd/fmpysub, movb,tr and addb,tr

fmpyadd/fmpysub combined independent floating point multiply with an FP 
add or sub insn.  There's many conditions, but if you want a simple 
example to play with, the attached file with -O2 -mschedule=7100LC ought 
to generate one of these insns via pa_reorg.

addb,tr can combine an unconditional branch with a reg+reg or reg+imm5 
addition operation.  movb,tr combines an unconditional branch with a 
reg-reg copy or load of a 5 bit immediate value into a general register. 
  I don't happen to have examples handy, but compiling integer code with 
-O2 -mschedule=7100LC ought to trigger some.

The code in pa_reorg is O(n^2) or worse.  It predates the hooks to allow 
the target to reorder the ready queue.  It would probably be relatively 
easy to have that code run via those hooks and just look at the ready 
queue.  So it'd still be O(n^2), but the N would be *much* smaller.  But 
again, I don't think anyone uses PA7xxxx processors and hasn't for over 
a decade, so it hasn't seemed worth the effort to change.

Cheers,
Jeff
Jeff Law Oct. 31, 2014, 8:29 p.m. UTC | #21
On 09/30/14 03:22, Bin Cheng wrote:

>
> 2014-09-30  Bin Cheng<bin.cheng@arm.com>
> 	    Mike Stump<mikestump@comcast.net>
>
> 	* timevar.def (TV_SCHED_FUSION): New time var.
> 	* passes.def (pass_sched_fusion): New pass.
> 	* config/arm/arm.c (TARGET_SCHED_FUSION_PRIORITY): New.
> 	(extract_base_offset_in_addr, fusion_load_store): New.
> 	(arm_sched_fusion_priority): New.
> 	(arm_option_override): Disable scheduling fusion on non-armv7
> 	processors by default.
> 	* sched-int.h (struct _haifa_insn_data): New field.
> 	(INSN_FUSION_PRIORITY, FUSION_MAX_PRIORITY, sched_fusion): New.
> 	* sched-rgn.c (rest_of_handle_sched_fusion): New.
> 	(pass_data_sched_fusion, pass_sched_fusion): New.
> 	(make_pass_sched_fusion): New.
> 	* haifa-sched.c (sched_fusion): New.
> 	(insn_cost): Handle sched_fusion.
> 	(priority): Handle sched_fusion by calling target hook.
> 	(enum rfs_decision): New enum value.
> 	(rfs_str): New element for RFS_FUSION.
> 	(rank_for_schedule): Support sched_fusion.
> 	(schedule_insn, max_issue, prune_ready_list): Handle sched_fusion.
> 	(schedule_block, fix_tick_ready): Handle sched_fusion.
> 	* common.opt (flag_schedule_fusion): New.
> 	* tree-pass.h (make_pass_sched_fusion): New.
> 	* target.def (fusion_priority): New.
> 	* doc/tm.texi.in (TARGET_SCHED_FUSION_PRIORITY): New.
> 	* doc/tm.texi: Regenerated.
> 	* doc/invoke.texi (-fschedule-fusion): New.
>
> gcc/testsuite/ChangeLog
> 2014-09-30  Bin Cheng<bin.cheng@arm.com>
>
> 	* gcc.target/arm/ldrd-strd-pair-1.c: New test.
> 	* gcc.target/arm/vfp-1.c: Improve scanning string.
>
>
> sched-fusion-20140929.txt
>
>
> Index: gcc/doc/tm.texi
> ===================================================================
> --- gcc/doc/tm.texi	(revision 215662)
> +++ gcc/doc/tm.texi	(working copy)
> @@ -6677,6 +6677,29 @@ This hook is called by tree reassociator to determ
>   parallelism required in output calculations chain.
>   @end deftypefn
>
> +@deftypefn {Target Hook} void TARGET_SCHED_FUSION_PRIORITY (rtx_insn *@var{insn}, int @var{max_pri}, int *@var{fusion_pri}, int *@var{pri})
> +This hook is called by scheduling fusion pass.  It calculates fusion
> +priorities for each instruction passed in by parameter.  The priorities
> +are returned via pointer parameters.
> +
> +@var{insn} is the instruction whose priorities need to be calculated.
> +@var{max_pri} is the maximum priority can be returned in any cases.
> +@var{fusion_pri} is the pointer parameter through which @var{insn}'s
> +fusion priority should be calculated and returned.
> +@var{pri} is the pointer parameter through which @var{insn}'s priority
> +should be calculated and returned.
> +
> +Same @var{fusion_pri} should be returned for instructions which should
> +be scheduled together.  Different @var{pri} should be returned for
> +instructions with same @var{fusion_pri}.  All instructions will be
> +scheduled according to the two priorities.  @var{fusion_pri} is the major
> +sort key, @var{pri} is the minor sort key.  All priorities calculated
> +should be between 0 (exclusive) and @var{max_pri} (inclusive).  To avoid
> +false dependencies, @var{fusion_pri} of instructions which need to be
> +scheduled together should be smaller than @var{fusion_pri} of irrelevant
> +instructions.
> +@end deftypefn
> +
>   @node Sections
>   @section Dividing the Output into Sections (Texts, Data, @dots{})
>   @c the above section title is WAY too long.  maybe cut the part between
So I think we need to clarify that this hook is useful when fusing to 
related insns, but which don't have a data dependency.  Somehow we need 
to describe that the insns to be fused should have the same (or +-1) 
priority.  It may be useful to use code from the ARM implementation to 
show how to use this to pair up loads as an example.

It may also be useful to refer to the code which reorders insns in the 
ready queue for cases where we want to fuse two truly independent insns.


>
> +  if (sched_fusion)
> +    {
> +      /* The instruction that has the same fusion priority as the last
> +	 instruction is the instruction we picked next.  If that is not
> +	 the case, we sort ready list firstly by fusion priority, then
> +	 by priority, and at last by INSN_LUID.  */
> +      int a = INSN_FUSION_PRIORITY (tmp);
> +      int b = INSN_FUSION_PRIORITY (tmp2);
> +      int last = -1;
> +
> +      if (last_nondebug_scheduled_insn
> +	  && !NOTE_P (last_nondebug_scheduled_insn)
> +	  && BLOCK_FOR_INSN (tmp)
> +	       == BLOCK_FOR_INSN (last_nondebug_scheduled_insn))
> +	last = INSN_FUSION_PRIORITY (last_nondebug_scheduled_insn);
> +
> +      if (a != last && b != last)
> +	{
> +	  if (a == b)
> +	    {
> +	      a = INSN_PRIORITY (tmp);
> +	      b = INSN_PRIORITY (tmp2);
> +	    }
> +	  if (a != b)
> +	    return rfs_result (RFS_FUSION, b - a);
> +	  else
> +	    return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2));
rfs_result's signature has changed, I think you need to pass in tmp & 
tmp2.  You'll need to make that trivial update for all the callers.


Can you make those changes and repost so that I can look at the docs (I 
think the implementation is fine and won't need further review).

jeff
diff mbox

Patch

Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	(revision 215662)
+++ gcc/timevar.def	(working copy)
@@ -244,6 +244,7 @@  DEFTIMEVAR (TV_IFCVT2		     , "if-conversion 2")
 DEFTIMEVAR (TV_COMBINE_STACK_ADJUST  , "combine stack adjustments")
 DEFTIMEVAR (TV_PEEPHOLE2             , "peephole 2")
 DEFTIMEVAR (TV_RENAME_REGISTERS      , "rename registers")
+DEFTIMEVAR (TV_SCHED_FUSION          , "scheduling fusion")
 DEFTIMEVAR (TV_CPROP_REGISTERS       , "hard reg cprop")
 DEFTIMEVAR (TV_SCHED2                , "scheduling 2")
 DEFTIMEVAR (TV_MACH_DEP              , "machine dep reorg")
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 215662)
+++ gcc/doc/tm.texi	(working copy)
@@ -6677,6 +6677,29 @@  This hook is called by tree reassociator to determ
 parallelism required in output calculations chain.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_SCHED_FUSION_PRIORITY (rtx_insn *@var{insn}, int @var{max_pri}, int *@var{fusion_pri}, int *@var{pri})
+This hook is called by scheduling fusion pass.  It calculates fusion
+priorities for each instruction passed in by parameter.  The priorities
+are returned via pointer parameters.
+
+@var{insn} is the instruction whose priorities need to be calculated.
+@var{max_pri} is the maximum priority can be returned in any cases.
+@var{fusion_pri} is the pointer parameter through which @var{insn}'s
+fusion priority should be calculated and returned.
+@var{pri} is the pointer parameter through which @var{insn}'s priority
+should be calculated and returned.
+
+Same @var{fusion_pri} should be returned for instructions which should
+be scheduled together.  Different @var{pri} should be returned for
+instructions with same @var{fusion_pri}.  All instructions will be
+scheduled according to the two priorities.  @var{fusion_pri} is the major
+sort key, @var{pri} is the minor sort key.  All priorities calculated
+should be between 0 (exclusive) and @var{max_pri} (inclusive).  To avoid
+false dependencies, @var{fusion_pri} of instructions which need to be
+scheduled together should be smaller than @var{fusion_pri} of irrelevant
+instructions.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 215662)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -4817,6 +4817,8 @@  them: try the first ones in this list first.
 
 @hook TARGET_SCHED_REASSOCIATION_WIDTH
 
+@hook TARGET_SCHED_FUSION_PRIORITY
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 215662)
+++ gcc/doc/invoke.texi	(working copy)
@@ -404,7 +404,7 @@  Objective-C and Objective-C++ Dialects}.
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
 -fprofile-use -fprofile-use=@var{path} -fprofile-values -fprofile-reorder-functions @gol
--freciprocal-math -free -frename-registers -freorder-blocks @gol
+-freciprocal-math -free -frename-registers -fschedule-fusion -freorder-blocks @gol
 -freorder-blocks-and-partition -freorder-functions @gol
 -frerun-cse-after-loop -freschedule-modulo-scheduled-loops @gol
 -frounding-math -fsched2-use-superblocks -fsched-pressure @gol
@@ -9459,6 +9459,14 @@  a ``home register''.
 
 Enabled by default with @option{-funroll-loops} and @option{-fpeel-loops}.
 
+@item -fschedule-fusion
+@opindex fschedule-fusion
+Performs a target dependent pass over the instruction stream to schedule
+instructions of same type together because target machine can execute them
+more efficiently if they are adjacent to each other in the instruction flow.
+
+Enabled at levels @option{-O2}, @option{-O3}, @option{-Os}.
+
 @item -ftracer
 @opindex ftracer
 Perform tail duplication to enlarge superblock size.  This transformation
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 215662)
+++ gcc/common.opt	(working copy)
@@ -1806,6 +1806,10 @@  frename-registers
 Common Report Var(flag_rename_registers) Init(2) Optimization
 Perform a register renaming optimization pass
 
+fschedule-fusion
+Common Report Var(flag_schedule_fusion) Init(2) Optimization
+Perform a target dependent instruction fusion optimization pass
+
 freorder-blocks
 Common Report Var(flag_reorder_blocks) Optimization
 Reorder basic blocks to improve code placement
Index: gcc/haifa-sched.c
===================================================================
--- gcc/haifa-sched.c	(revision 215662)
+++ gcc/haifa-sched.c	(working copy)
@@ -1370,6 +1370,9 @@  insn_cost (rtx_insn *insn)
 {
   int cost;
 
+  if (sched_fusion)
+    return 0;
+
   if (sel_sched_p ())
     {
       if (recog_memoized (insn) < 0)
@@ -1581,6 +1584,8 @@  dep_list_size (rtx insn, sd_list_types_def list)
 
   return nodbgcount;
 }
+
+bool sched_fusion;
 
 /* Compute the priority number for INSN.  */
 static int
@@ -1596,7 +1601,15 @@  priority (rtx_insn *insn)
     {
       int this_priority = -1;
 
-      if (dep_list_size (insn, SD_LIST_FORW) == 0)
+      if (sched_fusion)
+	{
+	  int this_fusion_priority;
+
+	  targetm.sched.fusion_priority (insn, FUSION_MAX_PRIORITY,
+					 &this_fusion_priority, &this_priority);
+	  INSN_FUSION_PRIORITY (insn) = this_fusion_priority;
+	}
+      else if (dep_list_size (insn, SD_LIST_FORW) == 0)
 	/* ??? We should set INSN_PRIORITY to insn_cost when and insn has
 	   some forward deps but all of them are ignored by
 	   contributes_to_priority hook.  At the moment we set priority of
@@ -2527,7 +2540,7 @@  enum rfs_decision {
   RFS_SCHED_GROUP, RFS_PRESSURE_DELAY, RFS_PRESSURE_TICK,
   RFS_FEEDS_BACKTRACK_INSN, RFS_PRIORITY, RFS_SPECULATION,
   RFS_SCHED_RANK, RFS_LAST_INSN, RFS_PRESSURE_INDEX,
-  RFS_DEP_COUNT, RFS_TIE, RFS_N };
+  RFS_DEP_COUNT, RFS_TIE, RFS_FUSION, RFS_N };
 
 /* Corresponding strings for print outs.  */
 static const char *rfs_str[RFS_N] = {
@@ -2535,7 +2548,7 @@  static const char *rfs_str[RFS_N] = {
   "RFS_SCHED_GROUP", "RFS_PRESSURE_DELAY", "RFS_PRESSURE_TICK",
   "RFS_FEEDS_BACKTRACK_INSN", "RFS_PRIORITY", "RFS_SPECULATION",
   "RFS_SCHED_RANK", "RFS_LAST_INSN", "RFS_PRESSURE_INDEX",
-  "RFS_DEP_COUNT", "RFS_TIE" };
+  "RFS_DEP_COUNT", "RFS_TIE", "RFS_FUSION" };
 
 /* Statistical breakdown of rank_for_schedule decisions.  */
 typedef struct { unsigned stats[RFS_N]; } rank_for_schedule_stats_t;
@@ -2596,6 +2609,53 @@  rank_for_schedule (const void *x, const void *y)
   /* Make sure that priority of TMP and TMP2 are initialized.  */
   gcc_assert (INSN_PRIORITY_KNOWN (tmp) && INSN_PRIORITY_KNOWN (tmp2));
 
+  if (sched_fusion)
+    {
+      /* The instruction that has the same fusion priority as the last
+	 instruction is the instruction we picked next.  If that is not
+	 the case, we sort ready list firstly by fusion priority, then
+	 by priority, and at last by INSN_LUID.  */
+      int a = INSN_FUSION_PRIORITY (tmp);
+      int b = INSN_FUSION_PRIORITY (tmp2);
+      int last = -1;
+
+      if (last_nondebug_scheduled_insn
+	  && !NOTE_P (last_nondebug_scheduled_insn)
+	  && BLOCK_FOR_INSN (tmp)
+	       == BLOCK_FOR_INSN (last_nondebug_scheduled_insn))
+	last = INSN_FUSION_PRIORITY (last_nondebug_scheduled_insn);
+
+      if (a != last && b != last)
+	{
+	  if (a == b)
+	    {
+	      a = INSN_PRIORITY (tmp);
+	      b = INSN_PRIORITY (tmp2);
+	    }
+	  if (a != b)
+	    return rfs_result (RFS_FUSION, b - a);
+	  else
+	    return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2));
+	}
+      else if (a == b)
+	{
+	  gcc_assert (last_nondebug_scheduled_insn
+		      && !NOTE_P (last_nondebug_scheduled_insn));
+	  last = INSN_PRIORITY (last_nondebug_scheduled_insn);
+
+	  a = abs (INSN_PRIORITY (tmp) - last);
+	  b = abs (INSN_PRIORITY (tmp2) - last);
+	  if (a != b)
+	    return rfs_result (RFS_FUSION, a - b);
+	  else
+	    return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2));
+	}
+      else if (a == last)
+	return rfs_result (RFS_FUSION, -1);
+      else
+	return rfs_result (RFS_FUSION, 1);
+    }
+
   if (sched_pressure != SCHED_PRESSURE_NONE)
     {
       /* Prefer insn whose scheduling results in the smallest register
@@ -3914,8 +3974,8 @@  schedule_insn (rtx_insn *insn)
   gcc_assert (INSN_TICK (insn) >= MIN_TICK);
   if (INSN_TICK (insn) > clock_var)
     /* INSN has been prematurely moved from the queue to the ready list.
-       This is possible only if following flag is set.  */
-    gcc_assert (flag_sched_stalled_insns);
+       This is possible only if following flags are set.  */
+    gcc_assert (flag_sched_stalled_insns || sched_fusion);
 
   /* ??? Probably, if INSN is scheduled prematurely, we should leave
      INSN_TICK untouched.  This is a machine-dependent issue, actually.  */
@@ -5413,6 +5473,9 @@  max_issue (struct ready_list *ready, int privilege
   struct choice_entry *top;
   rtx_insn *insn;
 
+  if (sched_fusion)
+    return 0;
+
   n_ready = ready->n_ready;
   gcc_assert (dfa_lookahead >= 1 && privileged_n >= 0
 	      && privileged_n <= n_ready);
@@ -5768,6 +5831,9 @@  prune_ready_list (state_t temp_state, bool first_c
   bool sched_group_found = false;
   int min_cost_group = 1;
 
+  if (sched_fusion)
+    return;
+
   for (i = 0; i < ready.n_ready; i++)
     {
       rtx_insn *insn = ready_element (&ready, i);
@@ -6375,7 +6441,7 @@  schedule_block (basic_block *target_bb, state_t in
 	    {
 	      memcpy (temp_state, curr_state, dfa_state_size);
 	      cost = state_transition (curr_state, insn);
-	      if (sched_pressure != SCHED_PRESSURE_WEIGHTED)
+	      if (sched_pressure != SCHED_PRESSURE_WEIGHTED && !sched_fusion)
 		gcc_assert (cost < 0);
 	      if (memcmp (temp_state, curr_state, dfa_state_size) != 0)
 		cycle_issued_insns++;
@@ -7195,7 +7261,7 @@  fix_tick_ready (rtx_insn *next)
   INSN_TICK (next) = tick;
 
   delay = tick - clock_var;
-  if (delay <= 0 || sched_pressure != SCHED_PRESSURE_NONE)
+  if (delay <= 0 || sched_pressure != SCHED_PRESSURE_NONE || sched_fusion)
     delay = QUEUE_READY;
 
   change_queue_index (next, delay);
Index: gcc/sched-int.h
===================================================================
--- gcc/sched-int.h	(revision 215662)
+++ gcc/sched-int.h	(working copy)
@@ -806,6 +806,9 @@  struct _haifa_insn_data
   /* A priority for each insn.  */
   int priority;
 
+  /* The fusion priority for each insn.  */
+  int fusion_priority;
+
   /* The minimum clock tick at which the insn becomes ready.  This is
      used to note timing constraints for the insns in the pending list.  */
   int tick;
@@ -901,6 +904,7 @@  extern vec<haifa_insn_data_def> h_i_d;
 /* Accessor macros for h_i_d.  There are more in haifa-sched.c and
    sched-rgn.c.  */
 #define INSN_PRIORITY(INSN) (HID (INSN)->priority)
+#define INSN_FUSION_PRIORITY(INSN) (HID (INSN)->fusion_priority)
 #define INSN_REG_PRESSURE(INSN) (HID (INSN)->reg_pressure)
 #define INSN_MAX_REG_PRESSURE(INSN) (HID (INSN)->max_reg_pressure)
 #define INSN_REG_USE_LIST(INSN) (HID (INSN)->reg_use_list)
@@ -1618,6 +1622,10 @@  extern void sd_copy_back_deps (rtx_insn *, rtx_ins
 extern void sd_delete_dep (sd_iterator_def);
 extern void sd_debug_lists (rtx, sd_list_types_def);
 
+/* Macros and declarations for scheduling fusion.  */
+#define FUSION_MAX_PRIORITY (INT_MAX)
+extern bool sched_fusion;
+
 #endif /* INSN_SCHEDULING */
 
 #endif /* GCC_SCHED_INT_H */
Index: gcc/sched-rgn.c
===================================================================
--- gcc/sched-rgn.c	(revision 215662)
+++ gcc/sched-rgn.c	(working copy)
@@ -3647,6 +3647,17 @@  rest_of_handle_sched2 (void)
   return 0;
 }
 
+static unsigned int
+rest_of_handle_sched_fusion (void)
+{
+#ifdef INSN_SCHEDULING
+  sched_fusion = true;
+  schedule_insns ();
+  sched_fusion = false;
+#endif
+  return 0;
+}
+
 namespace {
 
 const pass_data pass_data_live_range_shrinkage =
@@ -3789,3 +3800,53 @@  make_pass_sched2 (gcc::context *ctxt)
 {
   return new pass_sched2 (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_sched_fusion =
+{
+  RTL_PASS, /* type */
+  "sched_fusion", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_SCHED_FUSION, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_sched_fusion : public rtl_opt_pass
+{
+public:
+  pass_sched_fusion (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_sched_fusion, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *);
+  virtual unsigned int execute (function *)
+    {
+      return rest_of_handle_sched_fusion ();
+    }
+
+}; // class pass_sched2
+
+bool
+pass_sched_fusion::gate (function *)
+{
+#ifdef INSN_SCHEDULING
+  return (optimize > 0 && flag_schedule_fusion
+    && targetm.sched.fusion_priority != NULL);
+#else
+  return 0;
+#endif
+}
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_sched_fusion (gcc::context *ctxt)
+{
+  return new pass_sched_fusion (ctxt);
+}
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 215662)
+++ gcc/target.def	(working copy)
@@ -1507,6 +1507,32 @@  parallelism required in output calculations chain.
 int, (unsigned int opc, enum machine_mode mode),
 hook_int_uint_mode_1)
 
+/* The following member value is a function that returns priority for
+   fusion of each instruction via pointer parameters.  */
+DEFHOOK
+(fusion_priority,
+"This hook is called by scheduling fusion pass.  It calculates fusion\n\
+priorities for each instruction passed in by parameter.  The priorities\n\
+are returned via pointer parameters.\n\
+\n\
+@var{insn} is the instruction whose priorities need to be calculated.\n\
+@var{max_pri} is the maximum priority can be returned in any cases.\n\
+@var{fusion_pri} is the pointer parameter through which @var{insn}'s\n\
+fusion priority should be calculated and returned.\n\
+@var{pri} is the pointer parameter through which @var{insn}'s priority\n\
+should be calculated and returned.\n\
+\n\
+Same @var{fusion_pri} should be returned for instructions which should\n\
+be scheduled together.  Different @var{pri} should be returned for\n\
+instructions with same @var{fusion_pri}.  All instructions will be\n\
+scheduled according to the two priorities.  @var{fusion_pri} is the major\n\
+sort key, @var{pri} is the minor sort key.  All priorities calculated\n\
+should be between 0 (exclusive) and @var{max_pri} (inclusive).  To avoid\n\
+false dependencies, @var{fusion_pri} of instructions which need to be\n\
+scheduled together should be smaller than @var{fusion_pri} of irrelevant\n\
+instructions.",
+void, (rtx_insn *insn, int max_pri, int *fusion_pri, int *pri), NULL)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to OpenMP and Cilk Plus SIMD clones.  */
Index: gcc/testsuite/gcc.target/arm/vfp-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/vfp-1.c	(revision 215662)
+++ gcc/testsuite/gcc.target/arm/vfp-1.c	(working copy)
@@ -126,7 +126,7 @@  void test_convert () {
 }
 
 void test_ldst (float f[], double d[]) {
-  /* { dg-final { scan-assembler "vldr.32.+ \\\[r0, #1020\\\]" } } */
+  /* { dg-final { scan-assembler "vldr.32.+ \\\[r0, #-?\[0-9\]+\\\]" } } */
   /* { dg-final { scan-assembler "vldr.32.+ \\\[r\[0-9\], #-1020\\\]" { target { arm32 && { ! arm_thumb2_ok } } } } } */
   /* { dg-final { scan-assembler "add.+ r0, #1024" } } */
   /* { dg-final { scan-assembler "vstr.32.+ \\\[r\[0-9\]\\\]\n" } } */
Index: gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c	(revision 0)
@@ -0,0 +1,23 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_prefer_ldrd_strd } */
+/* { dg-options "-O2" } */
+
+struct
+{
+  int x;
+  int y;
+  char c;
+  int d;
+}a;
+
+int foo(int x, int y)
+{
+  int c;
+  a.x = x;
+  c = a.x;
+  a.d = c;
+  a.y = y;
+
+  return 0;
+}
+/* { dg-final { scan-assembler "strd\t" } } */
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 215662)
+++ gcc/config/arm/arm.c	(working copy)
@@ -291,6 +291,8 @@  static unsigned arm_add_stmt_cost (void *data, int
 static void arm_canonicalize_comparison (int *code, rtx *op0, rtx *op1,
 					 bool op0_preserve_value);
 static unsigned HOST_WIDE_INT arm_asan_shadow_offset (void);
+
+static void arm_sched_fusion_priority (rtx_insn *, int, int *, int*);
 
 /* Table of machine attributes.  */
 static const struct attribute_spec arm_attribute_table[] =
@@ -688,6 +690,9 @@  static const struct attribute_spec arm_attribute_t
 #undef TARGET_CALL_FUSAGE_CONTAINS_NON_CALLEE_CLOBBERS
 #define TARGET_CALL_FUSAGE_CONTAINS_NON_CALLEE_CLOBBERS true
 
+#undef TARGET_SCHED_FUSION_PRIORITY
+#define TARGET_SCHED_FUSION_PRIORITY arm_sched_fusion_priority
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 /* Obstack for minipool constant handling.  */
@@ -3116,6 +3121,10 @@  arm_option_override (void)
   if (target_slow_flash_data)
     arm_disable_literal_pool = true;
 
+  /* Only enable scheduling fusion for armv7 processors by default.  */
+  if (flag_schedule_fusion == 2 && !arm_arch7)
+    flag_schedule_fusion = 0;
+
   /* Register global variables with the garbage collector.  */
   arm_add_gc_roots ();
 }
@@ -32289,4 +32298,124 @@  arm_is_constant_pool_ref (rtx x)
 	  && CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)));
 }
 
+/* If MEM is in the form of [base+offset], extract the two parts
+   of address and set to BASE and OFFSET, otherwise return false
+   after clearing BASE and OFFSET.  */
+
+static bool
+extract_base_offset_in_addr (rtx mem, rtx *base, rtx *offset)
+{
+  rtx addr;
+
+  gcc_assert (MEM_P (mem));
+
+  addr = XEXP (mem, 0);
+
+  /* Strip off const from addresses like (const (addr)).  */
+  if (GET_CODE (addr) == CONST)
+    addr = XEXP (addr, 0);
+
+  if (GET_CODE (addr) == REG)
+    {
+      *base = addr;
+      *offset = const0_rtx;
+      return true;
+    }
+
+  if (GET_CODE (addr) == PLUS
+      && GET_CODE (XEXP (addr, 0)) == REG
+      && CONST_INT_P (XEXP (addr, 1)))
+    {
+      *base = XEXP (addr, 0);
+      *offset = XEXP (addr, 1);
+      return true;
+    }
+
+  *base = NULL_RTX;
+  *offset = NULL_RTX;
+
+  return false;
+}
+
+/* If INSN is a load or store of address in the form of [base+offset],
+   extract the two parts and set to BASE and OFFSET. IS_LOAD is set
+   to TRUE if it's a load.  Return TRUE if INSN is such an instruction,
+   otherwise return FALSE.  */
+
+static bool
+fusion_load_store (rtx_insn *insn, rtx *base, rtx *offset, bool *is_load)
+{
+  rtx x, dest, src;
+
+  gcc_assert (INSN_P (insn));
+  x = PATTERN (insn);
+  if (GET_CODE (x) != SET)
+    return false;
+
+  src = SET_SRC (x);
+  dest = SET_DEST (x);
+  if (GET_CODE (src) == REG && GET_CODE (dest) == MEM)
+    {
+      *is_load = false;
+      extract_base_offset_in_addr (dest, base, offset);
+    }
+  else if (GET_CODE (src) == MEM && GET_CODE (dest) == REG)
+    {
+      *is_load = true;
+      extract_base_offset_in_addr (src, base, offset);
+    }
+  else
+    return false;
+
+  return (*base != NULL_RTX && *offset != NULL_RTX);
+}
+
+/* Implement the TARGET_SCHED_FUSION_PRIORITY hook.
+
+   Currently we only support to fuse ldr or str instructions, so FUSION_PRI
+   and PRI are only calculated for these instructions.  For other instruction,
+   FUSION_PRI and PRI are simply set to MAX_PRI.  In the future, other kind
+   instruction fusion can be supported by returning different priorities.
+
+   It's important that irrelevant instructions get the largest FUSION_PRI.  */
+
+static void
+arm_sched_fusion_priority (rtx_insn *insn, int max_pri,
+			   int *fusion_pri, int *pri)
+{
+  int tmp, off_val;
+  bool is_load;
+  rtx base, offset;
+
+  gcc_assert (INSN_P (insn));
+
+  tmp = max_pri - 1;
+  if (!fusion_load_store (insn, &base, &offset, &is_load))
+    {
+      *pri = tmp;
+      *fusion_pri = tmp;
+      return;
+    }
+
+  /* Load goes first.  */
+  if (is_load)
+    *fusion_pri = tmp - 1;
+  else
+    *fusion_pri = tmp - 2;
+
+  tmp /= 2;
+
+  /* INSN with smaller base register goes first.  */
+  tmp -= ((REGNO (base) & 0xff) << 20);
+
+  /* INSN with smaller offset goes first.  */
+  off_val = (int)(INTVAL (offset));
+  if (off_val >= 0)
+    tmp -= (off_val & 0xfffff);
+  else
+    tmp += (off_val & 0xfffff);
+
+  *pri = tmp;
+  return;
+}
 #include "gt-arm.h"
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	(revision 215662)
+++ gcc/passes.def	(working copy)
@@ -400,6 +400,7 @@  along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_stack_adjustments);
 	  NEXT_PASS (pass_jump2);
 	  NEXT_PASS (pass_duplicate_computed_gotos);
+	  NEXT_PASS (pass_sched_fusion);
 	  NEXT_PASS (pass_peephole2);
 	  NEXT_PASS (pass_if_after_reload);
 	  NEXT_PASS (pass_regrename);
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	(revision 215662)
+++ gcc/tree-pass.h	(working copy)
@@ -542,6 +542,7 @@  extern rtl_opt_pass *make_pass_branch_target_load_
 extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context
 							     *ctxt);
 extern rtl_opt_pass *make_pass_stack_adjustments (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_sched_fusion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_peephole2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_regrename (gcc::context *ctxt);