Message ID | 000001cfdc90$1d95c670$58c15350$@arm.com |
---|---|
State | New |
Headers | show |
On Sep 30, 2014, at 2:22 AM, Bin Cheng <bin.cheng@arm.com> wrote: > Then I decided to take one step forward to introduce a generic > instruction fusion infrastructure in GCC, because in essence, load/store > pair is nothing different with other instruction fusion, all these optimizations > want is to push instructions together in instruction flow. I like the step you took. I had exactly this in mind when I wrote the original. > N0 ~= 1300 > N1/N2 ~= 5000 > N3 ~= 7500 Nice. Would be nice to see metrics for time to ensure that the code isn’t actually worse (CSiBE and/or spec and/or some other). I didn’t have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure. > I cleared up Mike's patch and fixed some implementation bugs in it So, I’m wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity.
On Wed, Oct 1, 2014 at 5:06 AM, Mike Stump <mikestump@comcast.net> wrote: > On Sep 30, 2014, at 2:22 AM, Bin Cheng <bin.cheng@arm.com> wrote: >> Then I decided to take one step forward to introduce a generic >> instruction fusion infrastructure in GCC, because in essence, load/store >> pair is nothing different with other instruction fusion, all these optimizations >> want is to push instructions together in instruction flow. > > I like the step you took. I had exactly this in mind when I wrote the original. > >> N0 ~= 1300 >> N1/N2 ~= 5000 >> N3 ~= 7500 > > Nice. Would be nice to see metrics for time to ensure that the code isn't actually worse (CSiBE and/or spec and/or some other). I didn't have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure. Hi Mike, I did collect spec2k performance after pairing load/store using this patch on both aarch64 and cortex-a15. The performance is improved obviously, especially on cortex-a57. There are some (though not many) benchmarks are regressed a little. There is no register pressure problem here because this pass is put between register allocation and sched2, I guess sched2 should resolve most pipeline hazards introduced by this pass. > >> I cleared up Mike's patch and fixed some implementation bugs in it > > So, I'm wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity. Just missed opportunity issues. Thanks, bin
On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: > On Wed, Oct 1, 2014 at 5:06 AM, Mike Stump <mikestump@comcast.net> wrote: >> On Sep 30, 2014, at 2:22 AM, Bin Cheng <bin.cheng@arm.com> wrote: >>> Then I decided to take one step forward to introduce a generic >>> instruction fusion infrastructure in GCC, because in essence, load/store >>> pair is nothing different with other instruction fusion, all these optimizations >>> want is to push instructions together in instruction flow. >> >> I like the step you took. I had exactly this in mind when I wrote the original. >> >>> N0 ~= 1300 >>> N1/N2 ~= 5000 >>> N3 ~= 7500 >> >> Nice. Would be nice to see metrics for time to ensure that the code isn't actually worse (CSiBE and/or spec and/or some other). I didn't have any large scale benchmark runs with my code and I did worry about extending lifetimes and register pressure. > > Hi Mike, > I did collect spec2k performance after pairing load/store using this > patch on both aarch64 and cortex-a15. The performance is improved > obviously, especially on cortex-a57. There are some (though not many) > benchmarks are regressed a little. There is no register pressure > problem here because this pass is put between register allocation and > sched2, I guess sched2 should resolve most pipeline hazards introduced > by this pass. How many merging opportunities does sched2 undo again? ISTR it has the tendency of pushing stores down and loads up. Richard. >> >>> I cleared up Mike's patch and fixed some implementation bugs in it >> >> So, I'm wondering what the bugs or missed opportunities were? And, if they were of the type of problem that generated incorrect code or if they were of the type that was merely a missed opportunity. > Just missed opportunity issues. > > Thanks, > bin
On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com> wrote: > On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: > > How many merging opportunities does sched2 undo again? ISTR it > has the tendency of pushing stores down and loads up. So, the pass works by merging 2 or more loads into 1 load (at least on my port). sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work. The non-real work, doesn’t matter any. Can sched2 rip apart a single load?
On Tue, Oct 7, 2014 at 1:20 AM, Mike Stump <mikestump@comcast.net> wrote: > On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com> wrote: >> On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: >> >> How many merging opportunities does sched2 undo again? ISTR it >> has the tendency of pushing stores down and loads up. > > So, the pass works by merging 2 or more loads into 1 load (at least on my port). sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work. The non-real work, doesn't matter any. Can sched2 rip apart a single load? On ARM and AARCH64, the two merged load/store are transformed into single parallel insn by the following peephole2 pass, so that sched2 would not undo the fusion work. I though sched2 works on the basis of instructions, and it isn't good practice to have sched2 do split work. Thanks, bin
On 10/06/14 19:31, Bin.Cheng wrote: > On Tue, Oct 7, 2014 at 1:20 AM, Mike Stump <mikestump@comcast.net> wrote: >> On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com> wrote: >>> On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> wrote: >>> >>> How many merging opportunities does sched2 undo again? ISTR it >>> has the tendency of pushing stores down and loads up. >> >> So, the pass works by merging 2 or more loads into 1 load (at least on my port). sched2 would need to rip apart 1 load into 2 loads to be able to undo the real work. The non-real work, doesn't matter any. Can sched2 rip apart a single load? > On ARM and AARCH64, the two merged load/store are transformed into > single parallel insn by the following peephole2 pass, so that sched2 > would not undo the fusion work. I though sched2 works on the basis of > instructions, and it isn't good practice to have sched2 do split work. It's certainly advantageous for sched2 to split insns that generate multiple instructions. Running after register allocation, sched2 is ideal for splitting because the we know the alternative for each insn and thus we can (possibly for the first time) accurately know if a particular insn will generate multiple assembly instructions. If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction. Addressing issues in that space seems out of scope for Bin's work to me, except perhaps for such issues on aarch64/arm which are Bin's primary concerns. jeff
On Wed, Oct 8, 2014 at 1:28 PM, Jeff Law <law@redhat.com> wrote: > On 10/06/14 19:31, Bin.Cheng wrote: >> >> On Tue, Oct 7, 2014 at 1:20 AM, Mike Stump <mikestump@comcast.net> wrote: >>> >>> On Oct 6, 2014, at 4:32 AM, Richard Biener <richard.guenther@gmail.com> >>> wrote: >>>> >>>> On Mon, Oct 6, 2014 at 11:57 AM, Bin.Cheng <amker.cheng@gmail.com> >>>> wrote: >>>> >>>> How many merging opportunities does sched2 undo again? ISTR it >>>> has the tendency of pushing stores down and loads up. >>> >>> >>> So, the pass works by merging 2 or more loads into 1 load (at least on my >>> port). sched2 would need to rip apart 1 load into 2 loads to be able to >>> undo the real work. The non-real work, doesn't matter any. Can sched2 rip >>> apart a single load? >> >> On ARM and AARCH64, the two merged load/store are transformed into >> single parallel insn by the following peephole2 pass, so that sched2 >> would not undo the fusion work. I though sched2 works on the basis of >> instructions, and it isn't good practice to have sched2 do split work. > > It's certainly advantageous for sched2 to split insns that generate multiple > instructions. Running after register allocation, sched2 is ideal for > splitting because the we know the alternative for each insn and thus we can > (possibly for the first time) accurately know if a particular insn will > generate multiple assembly instructions. > > If the port has a splitter to rip apart a douple-word load into single-word > loads, then we'd obviously only want to do that in cases where the > double-word load actually generates > 1 assembly instruction. > > Addressing issues in that space seems out of scope for Bin's work to me, > except perhaps for such issues on aarch64/arm which are Bin's primary > concerns. Hi Jeff, Thanks very much for the explanation. Very likely I am wrong here, but seems what you mentioned fits to pass_split_before_sched2 very well. Then I guess it would be nice if we can differentiate cases in the first place by generating different patterns, rather than split some of instructions later. Though I have no idea if we can do that or not. For arm/aarch64, I guess it's not an issue, otherwise the peephole2 won't work at all. ARM maintainers should have answer to this. > > jeff
>> If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction. Or indeed if it is really a performance win. And I think that should purely be a per port / micro-architectural decision . > > For arm/aarch64, I guess it's not an issue, otherwise the peephole2 > won't work at all. ARM maintainers should have answer to this. Generating more ldrd's and strd's will be beneficial in the ARM and the AArch64 port - we save code size and start using more memory bandwidth available per instruction on most higher end cores that I'm aware of. Even on the smaller microcontrollers I expect it to be a win because you've saved code size. There may well be pathological cases given we've shortened some dependencies or increased lifetimes of others but overall I'd expect it to be more positive than negative. I also expect this to be more effective in the T32 (Thumb2) ISA and AArch64 because ldrd/ strd and ldp / stp respectively can work with any registers unlike the A32 ISA where the registers loaded or stored must be consecutive registers. I'm hoping for some more review on the generic bits before looking into the backend implementation in the expectation that this is the direction folks want to proceed. regards Ramana > > >> >> jeff
On 10/08/14 04:27, Ramana Radhakrishnan wrote: >>> If the port has a splitter to rip apart a douple-word load into single-word loads, then we'd obviously only want to do that in cases where the double-word load actually generates > 1 assembly instruction. > > Or indeed if it is really a performance win. And I think that should > purely be a per port / micro-architectural decision . Agreed. > Generating more ldrd's and strd's will be beneficial in the ARM and > the AArch64 port - we save code size and start using more memory > bandwidth available per instruction on most higher end cores that I'm > aware of. Even on the smaller microcontrollers I expect it to be a win > because you've saved code size. There may well be pathological cases > given we've shortened some dependencies or increased lifetimes of > others but overall I'd expect it to be more positive than negative. Agreed. I suspect there's multiple architectures where the results would be similar -- code size improvements, more effective use of memory bandwidth with possibly some pathological case(s) that we really shouldn't worry too much about. > I also expect this to be more effective in the T32 (Thumb2) ISA and > AArch64 because ldrd/ strd and ldp / stp respectively can work with > any registers unlike the A32 ISA where the registers loaded or stored > must be consecutive registers. I'm hoping for some more review on the > generic bits before looking into the backend implementation in the > expectation that this is the direction folks want to proceed. I've got some questions that I'm formulating to make sure I understand how the facility is to be used. I may have to simply sit down with the code installed on a test build and play with it. However, to be clear, I really like the direction this work has gone. Jeff
On Oct 7, 2014, at 10:28 PM, Jeff Law <law@redhat.com> wrote:
> It's certainly advantageous for sched2 to split insns that generate multiple instructions.
So, on my port, I have a load multiple that is just one instruction, and it is a single clock cycle (to enque it).
On 09/30/14 03:22, Bin Cheng wrote: > Hi, > many load/store pairs as my old patch. Then I decided to take one step > forward to introduce a generic instruction fusion infrastructure in GCC, > because in essence, load/store pair is nothing different with other > instruction fusion, all these optimizations want is to push instructions > together in instruction flow. Great generalization. And yes, you're absolutely right, what you're doing is building a fairly generic mechanism to mark insns that might fuse together. So, some questions. Let's assume I've got 3 kinds of insns. A B & C. I can fuse AB or AC, but not BC. In fact, moving B & C together may significantly harm performance. So my question is can a given insn have different fusion priorities depending on its scheduling context? So perhaps an example. Let's say I have an insn stream with the following kinds of instructions, all ready at the same time. AAAAAAAABBBBCCCC Can I create 8 distinct fusion priorities such that I ultimately schedule AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8) I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary? And if they can vary, can they vary each tick of the scheduler? Now the next issue is I really don't want all those to fire back-to-back-to-back. I'd like some other insns to be inserted between each fusion pair if they're in the ready list. I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion. So ABX(1) ABY(2)..... Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn. Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y. > So here comes this patch. It adds a new sched_fusion pass just before > peephole2. The methodology is like: > 1) The priority in scheduler is extended into [fusion_priority, priority] > pair, with fusion_priority as the major key and priority as the minor key. > 2) The back-end assigns priorities pair to each instruction, instructions > want to be fused together get same fusion_priority assigned. I think the bulk of my questions above are targetted at this part of the change. When are these assignments made and how much freedom does the backend have to make/change those assignments. So another question, given a fused pair, is there any way to guarantee ordering within the fused pair. This is useful to cut down on the number of peep2 patterns. I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right? I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes. So instead of st; st; call; ld; ld, we'd generate std; call; ldd. It was a huge win for floating point on the sparc processors of that time. I don't expect you to do that investigation. Just thinking out loud. > > I collected performance data for both cortex-a15 and cortex-a57 (with a > local peephole ldp/stp patch), the benchmarks can be obviously improved on > arm/aarch64. I also collected instrument data about how many load/store > pairs are found. For the four versions of load/store pair patches: > 0) The original Mike's patch. > 1) My original prototype patch. > 2) Cleaned up pass of Mike (with implementation bugs resolved). > 3) This new prototype fusion pass. > > The numbers of paired opportunities satisfy below relations: > 3 * N0 ~ N1 ~ N2 < N3 > For example, for one benchmark suite, we have: > N0 ~= 1300 > N1/N2 ~= 5000 > N3 ~= 7500 Nice. Very nice. Overall it's a fairly simple change. I'll look deeper into it next week. jeff
[ I’ll give the state of the code that I finished with, Bin’s answers should be similar to mine, but, if he improved things, they could be better ] On Oct 10, 2014, at 2:13 PM, Jeff Law <law@redhat.com> wrote: > So, some questions. Let's assume I've got 3 kinds of insns. A B & C. > > I can fuse AB or AC, but not BC. In fact, moving B & C together may significantly harm performance. We only can choose 1 ordering for fusion consideration, so you have to pick between AB or AC as the ordering. Once you decide, then the other is hidden from consideration. The priorities merely are used to select an instruction ordering to consider pair wise (adjacent) opportunities. They can chain and stack, so, if you want A B C to fuse in the new ABC instruction, it will. > So my question is can a given insn have different fusion priorities depending on its scheduling context? Area for future improvement. In theory you can feed any other state into the creation of the priority but only the single insn feeds into it today. > So perhaps an example. Let's say I have an insn stream with the following kinds of instructions, all ready at the same time. > > AAAAAAAABBBBCCCC > > Can I create 8 distinct fusion priorities such that I ultimately schedule > AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8) > > I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary? And if they can vary, can they vary each tick of the scheduler? So, it wasn’t created to solve that problem. If the scheduler would tend to put them next to each other because structurally, they fit that way, then peephole already should have a chance to see them that way. > Now the next issue is I really don't want all those to fire back-to-back-to-back. I'd like some other insns to be inserted between each fusion pair if they're in the ready list. Scheduling figures out the way to layout instructions and will be done. Fusion run before scheduling and is separate, it doesn’t replace or change scheduling. > I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion. So > > ABX(1) ABY(2)..... > > Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn. You’re conflating fusion with scheduling, they are separate. Fusion doesn’t deal with time or cycles. It exists only to fuse to separate instructions together. Those products will be scheduled later and post scheduling, all the holes will be filled to the extent the scheduler is able to find work to do. For example, given: ld 0 inc ld 1 inc ld 2 inc ld 3 inc and we want to fuse ld <even> with ld <even>+1, we then get: ld 01 ld 23 inc inc then, post scheduling, we get: ld 01 inc ld 23 inc by the scheduler to fill the holes. > Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y. Yes, in my version, I ran it really early, before sched. I needed to run before ira and run before other people changed the code so much, that they obscured the instructions I wanted to fuse. One thing that happened that I saw was once we got to far along, the offsets used were loaded into register and instead of being r+n, it was r1+r2 and this blew the sorting. >> So here comes this patch. It adds a new sched_fusion pass just before >> peephole2. Hum… In my version, that was just way too late. I needed to generate pseudos and count on ira and others to clean up register the register moves for me. I’ve not tried his patch to see if it works as well as my version on my problem (combining adjacent loads and stores). >> The methodology is like: >> 1) The priority in scheduler is extended into [fusion_priority, priority] >> pair, with fusion_priority as the major key and priority as the minor key. >> 2) The back-end assigns priorities pair to each instruction, instructions >> want to be fused together get same fusion_priority assigned. > I think the bulk of my questions above are targetted at this part of the change. When are these assignments made and how much freedom does the backend have to make/change those assignments. It is free to do anything with the numbers it wants. But, the port writer has carefully chosen the priorities to generate the best code, it is unlikely it would ever be wise to change the numbers. > So another question, given a fused pair, is there any way to guarantee ordering within the fused pair. Sure, ensure one instruction has a lower number than the other. That will order them reliably. > This is useful to cut down on the number of peep2 patterns. I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right? Right, assuming that there is an ordering for the instruction that makes sense. Given: mul r1,#13 inc r3 mul r5,#13 inc r4 and a machine that could do a V2 mul, given a pair: mov r8,r1 mov r9,r2 mulv2 r8,#13 inc r3 inc r4 mov r1,r8 mov r5,r9 Here, we can see all the extra moves to ensure a register pair (n, n+1 of the mulv2 instruction) and the temporary created to hold the tuple (r8,r9). In my peephole patterns I do this to ensure registers have the right register number. I do this before register allocation, as it can choose the registers the best and ensure all the other instructions around it use those registers, and leave the moves in, if it can’t do better. The moves tend to be able to fill scheduling holes, so, even if they remain, usually the cost for them shouldn’t be too bad. On the other hand, if you have left left right right There is no way to sort them to get: left right left right and then fuse: left_right left_right This would be impossible. > I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes. So instead of st; st; call; ld; ld, we'd generate std; call; ldd. It was a huge win for floating point on the sparc processors of that time. I don't expect you to do that investigation. Just thinking out loud. Well, curiously enough, that sort of thing is exactly why I wrote the code. I was thinking of cases where the user load lots of registers from memory and then plays a bit with them and then put stye data back into memory. I wanted to put all the loads together and then try and find the loads that are next to each other and combine them into a single, larger load. On a machine that can load n registers as once, you can then save n-1 instructions. Likewise for stores. Conceptually, there is nothing port specific about the optimization I wanted to do.
Mike already gave great answers, here are just some of my thoughts on the specific questions. See embedded below. On Sat, Oct 11, 2014 at 7:10 AM, Mike Stump <mikestump@comcast.net> wrote: > [ I'll give the state of the code that I finished with, Bin's answers should be similar to mine, but, if he improved things, they could be better ] > > On Oct 10, 2014, at 2:13 PM, Jeff Law <law@redhat.com> wrote: >> So, some questions. Let's assume I've got 3 kinds of insns. A B & C. >> >> I can fuse AB or AC, but not BC. In fact, moving B & C together may significantly harm performance. > > We only can choose 1 ordering for fusion consideration, so you have to pick between AB or AC as the ordering. Once you decide, then the other is hidden from consideration. > > The priorities merely are used to select an instruction ordering to consider pair wise (adjacent) opportunities. > > They can chain and stack, so, if you want A B C to fuse in the new ABC instruction, it will. > >> So my question is can a given insn have different fusion priorities depending on its scheduling context? > > Area for future improvement. In theory you can feed any other state into the creation of the priority but only the single insn feeds into it today. > >> So perhaps an example. Let's say I have an insn stream with the following kinds of instructions, all ready at the same time. >> >> AAAAAAAABBBBCCCC >> >> Can I create 8 distinct fusion priorities such that I ultimately schedule >> AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8) >> >> I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary? And if they can vary, can they vary each tick of the scheduler? Though this pass works on predefined fusion types and priorities now, there might be two possible fixes for this specific problem. 1) Introduce another exclusive_pri, now it's like "fusion_pri, priority, exclusive_pri". The first one is assigned to mark instructions belonging to same fusion type. The second is assigned to fusion each pair/consecutive instructions together. The last one is assigned to prevent specific pair of instructions from being fused, just like "BC" mentioned. 2) Extend the idea by using hook function TARGET_SCHED_REORDER/TARGET_SCHED_REORDER2. Now we can assign fusion_pri at the first place, making sure instructions in same fusion type will be adjacent to each other, then we can change priority (thus reorder the ready list) at back-end's wish even per each tick of the scheduler. > > So, it wasn't created to solve that problem. If the scheduler would tend to put them next to each other because structurally, they fit that way, then peephole already should have a chance to see them that way. > >> Now the next issue is I really don't want all those to fire back-to-back-to-back. I'd like some other insns to be inserted between each fusion pair if they're in the ready list. > > Scheduling figures out the way to layout instructions and will be done. Fusion run before scheduling and is separate, it doesn't replace or change scheduling. > >> I guess the easiest way to get that is to assign the same fusion priority to other insns in the ready queue, even though they don't participate in fusion. So >> >> ABX(1) ABY(2)..... >> >> Where X & Y are some other arbitrary insns that don't participate in the AB fusion, but will issue in the same cycle as the AB fused insn. > > You're conflating fusion with scheduling, they are separate. Fusion doesn't deal with time or cycles. It exists only to fuse to separate instructions together. Those products will be scheduled later and post scheduling, all the holes will be filled to the extent the scheduler is able to find work to do. > > For example, given: > > ld 0 > inc > ld 1 > inc > ld 2 > inc > ld 3 > inc > > and we want to fuse ld <even> with ld <even>+1, we then get: > > ld 01 > ld 23 > inc > inc > > then, post scheduling, we get: > > ld 01 > inc > ld 23 > inc > > by the scheduler to fill the holes. > >> Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y. > > Yes, in my version, I ran it really early, before sched. I needed to run before ira and run before other people Two reasons why I run it late in compilation process. 1) IRA is the pass I tend not to disturb since code is changed dramatically. With it after IRA, I can get certain improvement from fusion, there is less noise here. 2) The spilling generates many load/store pair opportunities on ARM, which I don't want to miss. >changed the code so much, that they obscured the instructions I wanted to fuse. One thing that happened that I >saw was once we got to far along, the offsets used were loaded into register and instead of being r+n, it was r1+r2 >and this blew the sorting. I saw this problem too, it's the fwprop pass that should be blamed. Now it just forwards computation into address expression even there is no benefit. For example, if we have: add rx, ry, rz ldr r1, [rx] ldr r2, [rx+4] ldr r3, [rx+8] It will be transformed into: add rx, ry, rz ldr r1, [ry+rz] ldr r2, [rx+4] ldr r3, [rx+8] It corrupts consecutive load pairs, gives no computation reduction, and increases register pressure. Root cause is the pass works reg_use by reg_use, doesn't take global information into consideration, like register live range, whether there is real benefit, or if it hurts other optimizations. > >>> So here comes this patch. It adds a new sched_fusion pass just before >>> peephole2. > > Hum... In my version, that was just way too late. I needed to generate pseudos and count on ira and others to clean up register the register moves for me. I've not tried his patch to see if it works as well as my version on my problem (combining adjacent loads and stores). Putting it after IRA has its own disadvantage. For example, fake dependency is introduced like below example: ldr r1, [rx + 4] ldr r2, [rx + 12] ldr rx, [rx + 8] ... use r1/r2/rx Here we can't fuse the first/third ldr instructions because rx is modified. This is why I tried to even push this pass (along with peephole2) after register renaming and enable register renaming pass in my experiments. It does help a lot at lease on ARM. > >>> The methodology is like: >>> 1) The priority in scheduler is extended into [fusion_priority, priority] >>> pair, with fusion_priority as the major key and priority as the minor key. >>> 2) The back-end assigns priorities pair to each instruction, instructions >>> want to be fused together get same fusion_priority assigned. >> I think the bulk of my questions above are targetted at this part of the change. When are these assignments made and how much freedom does the backend have to make/change those assignments. > > It is free to do anything with the numbers it wants. But, the port writer has carefully chosen the priorities to generate the best code, it is unlikely it would ever be wise to change the numbers. > >> So another question, given a fused pair, is there any way to guarantee ordering within the fused pair. > > Sure, ensure one instruction has a lower number than the other. That will order them reliably. > >> This is useful to cut down on the number of peep2 patterns. I guess we could twiddle the priority in those cases to force a particular ordering of the fused pair, right? > > Right, assuming that there is an ordering for the instruction that makes sense. Given: > > mul r1,#13 > inc r3 > mul r5,#13 > inc r4 > > and a machine that could do a V2 mul, given a pair: > > mov r8,r1 > mov r9,r2 > mulv2 r8,#13 > inc r3 > inc r4 > mov r1,r8 > mov r5,r9 > > Here, we can see all the extra moves to ensure a register pair (n, n+1 of the mulv2 instruction) and the temporary created to hold the tuple (r8,r9). In my peephole patterns I do this to ensure registers have the right register number. I do this before register allocation, as it can choose the registers the best and ensure all the other instructions around it use those registers, and leave the moves in, if it can't do better. The moves tend to be able to fill scheduling holes, so, even if they remain, usually the cost for them shouldn't be too bad. > > On the other hand, if you have > > left > left > right > right > > There is no way to sort them to get: > > left > right > left > right > > and then fuse: > > left_right > left_right > > This would be impossible. I can't understand this very well, this exactly is one case we want to fuse on ARM for movw/movt. Given moww r1, const_1 movw r2, const_2 movw r3, const_3 movt r1, const_1 movt r2, const_2 movt r3, const_3 I would expect: moww r1, const_1 movt r1, const_1 movw r2, const_2 movt r2, const_2 movw r3, const_3 movt r3, const_3 Thanks, bin > >> I wonder if we could use this to zap all the hair I added to caller-save back in the early 90s to try and widen the save/restore modes. So instead of st; st; call; ld; ld, we'd generate std; call; ldd. It was a huge win for floating point on the sparc processors of that time. I don't expect you to do that investigation. Just thinking out loud. > > Well, curiously enough, that sort of thing is exactly why I wrote the code. I was thinking of cases where the user load lots of registers from memory and then plays a bit with them and then put stye data back into memory. I wanted to put all the loads together and then try and find the loads that are next to each other and combine them into a single, larger load. On a machine that can load n registers as once, you can then save n-1 instructions. Likewise for stores. Conceptually, there is nothing port specific about the optimization I wanted to do.
On Oct 10, 2014, at 8:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote: > >>> Though I guess if we run fusion + peep2 between sched1 and sched2, that problem would just resolve itself as we'd have fused AB together into a new insn and we'd schedule normally with the fused insns and X, Y. >> >> Yes, in my version, I ran it really early, before sched. I needed to run before ira and run before other people > > Two reasons why I run it late in compilation process. > 1) IRA is the pass I tend not to disturb since code is changed > dramatically. With it after IRA, I can get certain improvement from > fusion, there is less noise here. Since I have a front-end background, I think nothing of creating pseudos when I want to, I just know if I do, I have to do this before allocation. :-) For my peepholes, since they create registers, they must run before allocation. > 2) The spilling generates many load/store pair opportunities on ARM, > which I don't want to miss. I happen to have enough registers that spilling wasn’t my primary concern. > add rx, ry, rz > ldr r1, [rx] > ldr r2, [rx+4] > ldr r3, [rx+8] > > It will be transformed into: > > add rx, ry, rz > ldr r1, [ry+rz] > ldr r2, [rx+4] > ldr r3, [rx+8] Yeah, that seems to tickle a neuron. >> On the other hand, if you have >> >> left >> left >> right >> right >> >> There is no way to sort them to get: >> >> left >> right >> left >> right >> >> and then fuse: >> >> left_right >> left_right >> >> This would be impossible. > > I can't understand this very well, this exactly is one case we want to > fuse on ARM for movw/movt. Given > moww r1, const_1 This differs from the above by having r1 and const_1, in my example, there is no r1 and no const_1, this matters. I wanted to list a case where it is impossible to sort. This happens when there isn’t enough data to sort on, for example, no offset, no register number.
On Sat, Oct 11, 2014 at 5:13 AM, Jeff Law <law@redhat.com> wrote: > On 09/30/14 03:22, Bin Cheng wrote: >> >> Hi, >> many load/store pairs as my old patch. Then I decided to take one step >> forward to introduce a generic instruction fusion infrastructure in GCC, >> because in essence, load/store pair is nothing different with other >> instruction fusion, all these optimizations want is to push instructions >> together in instruction flow. > > Great generalization. And yes, you're absolutely right, what you're doing > is building a fairly generic mechanism to mark insns that might fuse > together. > > So, some questions. Let's assume I've got 3 kinds of insns. A B & C. > > I can fuse AB or AC, but not BC. In fact, moving B & C together may > significantly harm performance. > > So my question is can a given insn have different fusion priorities > depending on its scheduling context? > > So perhaps an example. Let's say I have an insn stream with the following > kinds of instructions, all ready at the same time. > > AAAAAAAABBBBCCCC > > Can I create 8 distinct fusion priorities such that I ultimately schedule > AB(1) AB(2) AB(3) AB(4) AC(5) AC(6) AC(7) AC(8) > > I guess another way to ask the question, are fusion priorities static based > on the insn/alternative, or can they vary? And if they can vary, can they > vary each tick of the scheduler? > > > > Now the next issue is I really don't want all those to fire > back-to-back-to-back. I'd like some other insns to be inserted between each > fusion pair if they're in the ready list. I guess the easiest way to get > that is to assign the same fusion priority to other insns in the ready > queue, even though they don't participate in fusion. So > > ABX(1) ABY(2)..... > > Where X & Y are some other arbitrary insns that don't participate in the AB > fusion, but will issue in the same cycle as the AB fused insn. > > Though I guess if we run fusion + peep2 between sched1 and sched2, that > problem would just resolve itself as we'd have fused AB together into a new > insn and we'd schedule normally with the fused insns and X, Y. > > > > >> So here comes this patch. It adds a new sched_fusion pass just before >> peephole2. The methodology is like: >> 1) The priority in scheduler is extended into [fusion_priority, priority] >> pair, with fusion_priority as the major key and priority as the minor key. >> 2) The back-end assigns priorities pair to each instruction, instructions >> want to be fused together get same fusion_priority assigned. > > I think the bulk of my questions above are targetted at this part of the > change. When are these assignments made and how much freedom does the > backend have to make/change those assignments. > > So another question, given a fused pair, is there any way to guarantee > ordering within the fused pair. This is useful to cut down on the number of > peep2 patterns. I guess we could twiddle the priority in those cases to > force a particular ordering of the fused pair, right? > > I wonder if we could use this to zap all the hair I added to caller-save > back in the early 90s to try and widen the save/restore modes. So instead > of st; st; call; ld; ld, we'd generate std; call; ldd. It was a huge win > for floating point on the sparc processors of that time. I don't expect you > to do that investigation. Just thinking out loud. > > > > >> >> I collected performance data for both cortex-a15 and cortex-a57 (with a >> local peephole ldp/stp patch), the benchmarks can be obviously improved on >> arm/aarch64. I also collected instrument data about how many load/store >> pairs are found. For the four versions of load/store pair patches: >> 0) The original Mike's patch. >> 1) My original prototype patch. >> 2) Cleaned up pass of Mike (with implementation bugs resolved). >> 3) This new prototype fusion pass. >> >> The numbers of paired opportunities satisfy below relations: >> 3 * N0 ~ N1 ~ N2 < N3 >> For example, for one benchmark suite, we have: >> N0 ~= 1300 >> N1/N2 ~= 5000 >> N3 ~= 7500 > > Nice. Very nice. > > Overall it's a fairly simple change. I'll look deeper into it next week. > > jeff Hi Jeff, Any new comments? Thanks, bin
On 10/10/14 21:32, Bin.Cheng wrote: > Mike already gave great answers, here are just some of my thoughts on > the specific questions. See embedded below. Thanks to both of you for your answers. Fundamentally, what I see is this scheme requires us to be able to come up with a key based solely on information in a particular insn. To get fusion another insn has to have the same or a closely related key. This implies that the the two candidates for fusion are related, even if there isn't a data dependency between them. The canonical example would be two loads with reg+d addressing modes. If they use the same base register and the displacements differ by a word, then we don't have a data dependency between the insns, but the insns are closely related by their address computations and we can compute a key to ensure those two related insns end up consecutive. At any given call to the hook, the only context we can directly see is the current insn. I'm pretty sure if I were to tweak the ARM bits ever-so-slightly it could easily model the load-load or store-store special case on the PA7xxx[LC] processors. Normally a pair of loads or stores can't dual issue. But if the two loads (or two stores) hit the upper and lower half of a double-word objects, then the instructions can dual issue. I'd forgotten about that special case scheduling opportunity until I started looking at some unrelated enhancement for prefetching. Your code would also appear to allow significant cleanup of the old caller-save code that had a fair amount of bookkeeping added to issue double-word memory loads/stores rather than single word operations. This *greatly* improved performance on the old sparc processors which had no call-saved FP registers. However, your new code doesn't handle fusing instructions which are totally independent and of different static types. There just isn't a good way to compute a key that I can see. And this is OK -- that case, if we cared to improve it, would be best handled by the SCHED_REORDER hooks. >>> >>> I guess another way to ask the question, are fusion priorities static based on the insn/alternative, or can they vary? And if they can vary, can they vary each tick of the scheduler? > > Though this pass works on predefined fusion types and priorities now, > there might be two possible fixes for this specific problem. > 1) Introduce another exclusive_pri, now it's like "fusion_pri, > priority, exclusive_pri". The first one is assigned to mark > instructions belonging to same fusion type. The second is assigned to > fusion each pair/consecutive instructions together. The last one is > assigned to prevent specific pair of instructions from being fused, > just like "BC" mentioned. > 2) Extend the idea by using hook function > TARGET_SCHED_REORDER/TARGET_SCHED_REORDER2. Now we can assign > fusion_pri at the first place, making sure instructions in same fusion > type will be adjacent to each other, then we can change priority (thus > reorder the ready list) at back-end's wish even per each tick of the > scheduler. #2 would be the best solution for the case I was pondering, but I don't think solving that case is terribly important given the processors for which it was profitable haven't been made for a very long time. Jeff
On Oct 30, 2014, at 12:43 PM, Jeff Law <law@redhat.com> wrote:
> Fundamentally, what I see is this scheme requires us to be able to come up with a key based solely on information in a particular insn. To get fusion another insn has to have the same or a closely related key.
Right.
Thanks for giving it a try. On Fri, Oct 31, 2014 at 3:43 AM, Jeff Law <law@redhat.com> wrote: > On 10/10/14 21:32, Bin.Cheng wrote: >> >> Mike already gave great answers, here are just some of my thoughts on >> the specific questions. See embedded below. > > Thanks to both of you for your answers. > > Fundamentally, what I see is this scheme requires us to be able to come up > with a key based solely on information in a particular insn. To get fusion > another insn has to have the same or a closely related key. > > This implies that the the two candidates for fusion are related, even if > there isn't a data dependency between them. The canonical example would be > two loads with reg+d addressing modes. If they use the same base register > and the displacements differ by a word, then we don't have a data dependency > between the insns, but the insns are closely related by their address > computations and we can compute a key to ensure those two related insns end > up consecutive. At any given call to the hook, the only context we can > directly see is the current insn. > > I'm pretty sure if I were to tweak the ARM bits ever-so-slightly it could > easily model the load-load or store-store special case on the PA7xxx[LC] > processors. Normally a pair of loads or stores can't dual issue. But if > the two loads (or two stores) hit the upper and lower half of a double-word > objects, then the instructions can dual issue. > > I'd forgotten about that special case scheduling opportunity until I started > looking at some unrelated enhancement for prefetching. > > Your code would also appear to allow significant cleanup of the old > caller-save code that had a fair amount of bookkeeping added to issue > double-word memory loads/stores rather than single word operations. This > *greatly* improved performance on the old sparc processors which had no > call-saved FP registers. > > However, your new code doesn't handle fusing instructions which are totally > independent and of different static types. There just isn't a good way to > compute a key that I can see. And this is OK -- that case, if we cared to > improve it, would be best handled by the SCHED_REORDER hooks. > >>>> >>>> I guess another way to ask the question, are fusion priorities static >>>> based on the insn/alternative, or can they vary? And if they can vary, can >>>> they vary each tick of the scheduler? >> >> >> Though this pass works on predefined fusion types and priorities now, >> there might be two possible fixes for this specific problem. >> 1) Introduce another exclusive_pri, now it's like "fusion_pri, >> priority, exclusive_pri". The first one is assigned to mark >> instructions belonging to same fusion type. The second is assigned to >> fusion each pair/consecutive instructions together. The last one is >> assigned to prevent specific pair of instructions from being fused, >> just like "BC" mentioned. >> 2) Extend the idea by using hook function >> TARGET_SCHED_REORDER/TARGET_SCHED_REORDER2. Now we can assign >> fusion_pri at the first place, making sure instructions in same fusion >> type will be adjacent to each other, then we can change priority (thus >> reorder the ready list) at back-end's wish even per each tick of the >> scheduler. > > #2 would be the best solution for the case I was pondering, but I don't > think solving that case is terribly important given the processors for which > it was profitable haven't been made for a very long time. I am thinking if it's possible to introduce a pattern-directed fusion. Something like define_fusion, and adapting haifa-scheduler for it. I agree there are two kinds (relevant and irrelevant) fusion types, and it's not trivial to support both in one scheme. Do you have a specific example that I can have a try? This is just a preliminary idea and definitely can't catch up in GCC 5.0. Moreover, so far I didn't see such requirement on ARM/AARCH64 or any other targets that I know, so I would like to continue with this version if it's fine. Later I will send patch pairing different kinds of ldp/stp based on this for aarch64. Thanks, bin > > Jeff
On 10/30/14 23:36, Bin.Cheng wrote:
> Thanks for giving it a try.
I'm a very hands-on learner, so sometimes the best way for me to
understand a blob of code is poke at it myself a bit.
Anyway, now I just need to get back to the patch for a final review :-)
jeff
On 10/30/14 23:36, Bin.Cheng wrote: >> #2 would be the best solution for the case I was pondering, but I don't >> think solving that case is terribly important given the processors for which >> it was profitable haven't been made for a very long time. > I am thinking if it's possible to introduce a pattern-directed fusion. > Something like define_fusion, and adapting haifa-scheduler for it. I > agree there are two kinds (relevant and irrelevant) fusion types, and > it's not trivial to support both in one scheme. Do you have a > specific example that I can have a try? I kicked around using reorg to do stuff like this in the past (combination of unrelated insns). But ultimately I think the way to go is have it happen when insns are on the ready list in the scheduler. For fusion of related insns like the load/store pairing, I think your approach should work pretty well. As to specific examples of independent insn fusion, the ones I'm most familiar with are from the older PA chips. I wouldn't recommend building something for those processors simply becuase they're so dated that I don't believe anyone uses them anymore. However, if you have cases (arm shift insns?), building for those is fine. If you just want examples, the ones we tried to exploit on the PA were fmpyadd/fmpysub, movb,tr and addb,tr fmpyadd/fmpysub combined independent floating point multiply with an FP add or sub insn. There's many conditions, but if you want a simple example to play with, the attached file with -O2 -mschedule=7100LC ought to generate one of these insns via pa_reorg. addb,tr can combine an unconditional branch with a reg+reg or reg+imm5 addition operation. movb,tr combines an unconditional branch with a reg-reg copy or load of a 5 bit immediate value into a general register. I don't happen to have examples handy, but compiling integer code with -O2 -mschedule=7100LC ought to trigger some. The code in pa_reorg is O(n^2) or worse. It predates the hooks to allow the target to reorder the ready queue. It would probably be relatively easy to have that code run via those hooks and just look at the ready queue. So it'd still be O(n^2), but the N would be *much* smaller. But again, I don't think anyone uses PA7xxxx processors and hasn't for over a decade, so it hasn't seemed worth the effort to change. Cheers, Jeff
On 09/30/14 03:22, Bin Cheng wrote: > > 2014-09-30 Bin Cheng<bin.cheng@arm.com> > Mike Stump<mikestump@comcast.net> > > * timevar.def (TV_SCHED_FUSION): New time var. > * passes.def (pass_sched_fusion): New pass. > * config/arm/arm.c (TARGET_SCHED_FUSION_PRIORITY): New. > (extract_base_offset_in_addr, fusion_load_store): New. > (arm_sched_fusion_priority): New. > (arm_option_override): Disable scheduling fusion on non-armv7 > processors by default. > * sched-int.h (struct _haifa_insn_data): New field. > (INSN_FUSION_PRIORITY, FUSION_MAX_PRIORITY, sched_fusion): New. > * sched-rgn.c (rest_of_handle_sched_fusion): New. > (pass_data_sched_fusion, pass_sched_fusion): New. > (make_pass_sched_fusion): New. > * haifa-sched.c (sched_fusion): New. > (insn_cost): Handle sched_fusion. > (priority): Handle sched_fusion by calling target hook. > (enum rfs_decision): New enum value. > (rfs_str): New element for RFS_FUSION. > (rank_for_schedule): Support sched_fusion. > (schedule_insn, max_issue, prune_ready_list): Handle sched_fusion. > (schedule_block, fix_tick_ready): Handle sched_fusion. > * common.opt (flag_schedule_fusion): New. > * tree-pass.h (make_pass_sched_fusion): New. > * target.def (fusion_priority): New. > * doc/tm.texi.in (TARGET_SCHED_FUSION_PRIORITY): New. > * doc/tm.texi: Regenerated. > * doc/invoke.texi (-fschedule-fusion): New. > > gcc/testsuite/ChangeLog > 2014-09-30 Bin Cheng<bin.cheng@arm.com> > > * gcc.target/arm/ldrd-strd-pair-1.c: New test. > * gcc.target/arm/vfp-1.c: Improve scanning string. > > > sched-fusion-20140929.txt > > > Index: gcc/doc/tm.texi > =================================================================== > --- gcc/doc/tm.texi (revision 215662) > +++ gcc/doc/tm.texi (working copy) > @@ -6677,6 +6677,29 @@ This hook is called by tree reassociator to determ > parallelism required in output calculations chain. > @end deftypefn > > +@deftypefn {Target Hook} void TARGET_SCHED_FUSION_PRIORITY (rtx_insn *@var{insn}, int @var{max_pri}, int *@var{fusion_pri}, int *@var{pri}) > +This hook is called by scheduling fusion pass. It calculates fusion > +priorities for each instruction passed in by parameter. The priorities > +are returned via pointer parameters. > + > +@var{insn} is the instruction whose priorities need to be calculated. > +@var{max_pri} is the maximum priority can be returned in any cases. > +@var{fusion_pri} is the pointer parameter through which @var{insn}'s > +fusion priority should be calculated and returned. > +@var{pri} is the pointer parameter through which @var{insn}'s priority > +should be calculated and returned. > + > +Same @var{fusion_pri} should be returned for instructions which should > +be scheduled together. Different @var{pri} should be returned for > +instructions with same @var{fusion_pri}. All instructions will be > +scheduled according to the two priorities. @var{fusion_pri} is the major > +sort key, @var{pri} is the minor sort key. All priorities calculated > +should be between 0 (exclusive) and @var{max_pri} (inclusive). To avoid > +false dependencies, @var{fusion_pri} of instructions which need to be > +scheduled together should be smaller than @var{fusion_pri} of irrelevant > +instructions. > +@end deftypefn > + > @node Sections > @section Dividing the Output into Sections (Texts, Data, @dots{}) > @c the above section title is WAY too long. maybe cut the part between So I think we need to clarify that this hook is useful when fusing to related insns, but which don't have a data dependency. Somehow we need to describe that the insns to be fused should have the same (or +-1) priority. It may be useful to use code from the ARM implementation to show how to use this to pair up loads as an example. It may also be useful to refer to the code which reorders insns in the ready queue for cases where we want to fuse two truly independent insns. > > + if (sched_fusion) > + { > + /* The instruction that has the same fusion priority as the last > + instruction is the instruction we picked next. If that is not > + the case, we sort ready list firstly by fusion priority, then > + by priority, and at last by INSN_LUID. */ > + int a = INSN_FUSION_PRIORITY (tmp); > + int b = INSN_FUSION_PRIORITY (tmp2); > + int last = -1; > + > + if (last_nondebug_scheduled_insn > + && !NOTE_P (last_nondebug_scheduled_insn) > + && BLOCK_FOR_INSN (tmp) > + == BLOCK_FOR_INSN (last_nondebug_scheduled_insn)) > + last = INSN_FUSION_PRIORITY (last_nondebug_scheduled_insn); > + > + if (a != last && b != last) > + { > + if (a == b) > + { > + a = INSN_PRIORITY (tmp); > + b = INSN_PRIORITY (tmp2); > + } > + if (a != b) > + return rfs_result (RFS_FUSION, b - a); > + else > + return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2)); rfs_result's signature has changed, I think you need to pass in tmp & tmp2. You'll need to make that trivial update for all the callers. Can you make those changes and repost so that I can look at the docs (I think the implementation is fine and won't need further review). jeff
Index: gcc/timevar.def =================================================================== --- gcc/timevar.def (revision 215662) +++ gcc/timevar.def (working copy) @@ -244,6 +244,7 @@ DEFTIMEVAR (TV_IFCVT2 , "if-conversion 2") DEFTIMEVAR (TV_COMBINE_STACK_ADJUST , "combine stack adjustments") DEFTIMEVAR (TV_PEEPHOLE2 , "peephole 2") DEFTIMEVAR (TV_RENAME_REGISTERS , "rename registers") +DEFTIMEVAR (TV_SCHED_FUSION , "scheduling fusion") DEFTIMEVAR (TV_CPROP_REGISTERS , "hard reg cprop") DEFTIMEVAR (TV_SCHED2 , "scheduling 2") DEFTIMEVAR (TV_MACH_DEP , "machine dep reorg") Index: gcc/doc/tm.texi =================================================================== --- gcc/doc/tm.texi (revision 215662) +++ gcc/doc/tm.texi (working copy) @@ -6677,6 +6677,29 @@ This hook is called by tree reassociator to determ parallelism required in output calculations chain. @end deftypefn +@deftypefn {Target Hook} void TARGET_SCHED_FUSION_PRIORITY (rtx_insn *@var{insn}, int @var{max_pri}, int *@var{fusion_pri}, int *@var{pri}) +This hook is called by scheduling fusion pass. It calculates fusion +priorities for each instruction passed in by parameter. The priorities +are returned via pointer parameters. + +@var{insn} is the instruction whose priorities need to be calculated. +@var{max_pri} is the maximum priority can be returned in any cases. +@var{fusion_pri} is the pointer parameter through which @var{insn}'s +fusion priority should be calculated and returned. +@var{pri} is the pointer parameter through which @var{insn}'s priority +should be calculated and returned. + +Same @var{fusion_pri} should be returned for instructions which should +be scheduled together. Different @var{pri} should be returned for +instructions with same @var{fusion_pri}. All instructions will be +scheduled according to the two priorities. @var{fusion_pri} is the major +sort key, @var{pri} is the minor sort key. All priorities calculated +should be between 0 (exclusive) and @var{max_pri} (inclusive). To avoid +false dependencies, @var{fusion_pri} of instructions which need to be +scheduled together should be smaller than @var{fusion_pri} of irrelevant +instructions. +@end deftypefn + @node Sections @section Dividing the Output into Sections (Texts, Data, @dots{}) @c the above section title is WAY too long. maybe cut the part between Index: gcc/doc/tm.texi.in =================================================================== --- gcc/doc/tm.texi.in (revision 215662) +++ gcc/doc/tm.texi.in (working copy) @@ -4817,6 +4817,8 @@ them: try the first ones in this list first. @hook TARGET_SCHED_REASSOCIATION_WIDTH +@hook TARGET_SCHED_FUSION_PRIORITY + @node Sections @section Dividing the Output into Sections (Texts, Data, @dots{}) @c the above section title is WAY too long. maybe cut the part between Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi (revision 215662) +++ gcc/doc/invoke.texi (working copy) @@ -404,7 +404,7 @@ Objective-C and Objective-C++ Dialects}. -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol -fprofile-generate=@var{path} @gol -fprofile-use -fprofile-use=@var{path} -fprofile-values -fprofile-reorder-functions @gol --freciprocal-math -free -frename-registers -freorder-blocks @gol +-freciprocal-math -free -frename-registers -fschedule-fusion -freorder-blocks @gol -freorder-blocks-and-partition -freorder-functions @gol -frerun-cse-after-loop -freschedule-modulo-scheduled-loops @gol -frounding-math -fsched2-use-superblocks -fsched-pressure @gol @@ -9459,6 +9459,14 @@ a ``home register''. Enabled by default with @option{-funroll-loops} and @option{-fpeel-loops}. +@item -fschedule-fusion +@opindex fschedule-fusion +Performs a target dependent pass over the instruction stream to schedule +instructions of same type together because target machine can execute them +more efficiently if they are adjacent to each other in the instruction flow. + +Enabled at levels @option{-O2}, @option{-O3}, @option{-Os}. + @item -ftracer @opindex ftracer Perform tail duplication to enlarge superblock size. This transformation Index: gcc/common.opt =================================================================== --- gcc/common.opt (revision 215662) +++ gcc/common.opt (working copy) @@ -1806,6 +1806,10 @@ frename-registers Common Report Var(flag_rename_registers) Init(2) Optimization Perform a register renaming optimization pass +fschedule-fusion +Common Report Var(flag_schedule_fusion) Init(2) Optimization +Perform a target dependent instruction fusion optimization pass + freorder-blocks Common Report Var(flag_reorder_blocks) Optimization Reorder basic blocks to improve code placement Index: gcc/haifa-sched.c =================================================================== --- gcc/haifa-sched.c (revision 215662) +++ gcc/haifa-sched.c (working copy) @@ -1370,6 +1370,9 @@ insn_cost (rtx_insn *insn) { int cost; + if (sched_fusion) + return 0; + if (sel_sched_p ()) { if (recog_memoized (insn) < 0) @@ -1581,6 +1584,8 @@ dep_list_size (rtx insn, sd_list_types_def list) return nodbgcount; } + +bool sched_fusion; /* Compute the priority number for INSN. */ static int @@ -1596,7 +1601,15 @@ priority (rtx_insn *insn) { int this_priority = -1; - if (dep_list_size (insn, SD_LIST_FORW) == 0) + if (sched_fusion) + { + int this_fusion_priority; + + targetm.sched.fusion_priority (insn, FUSION_MAX_PRIORITY, + &this_fusion_priority, &this_priority); + INSN_FUSION_PRIORITY (insn) = this_fusion_priority; + } + else if (dep_list_size (insn, SD_LIST_FORW) == 0) /* ??? We should set INSN_PRIORITY to insn_cost when and insn has some forward deps but all of them are ignored by contributes_to_priority hook. At the moment we set priority of @@ -2527,7 +2540,7 @@ enum rfs_decision { RFS_SCHED_GROUP, RFS_PRESSURE_DELAY, RFS_PRESSURE_TICK, RFS_FEEDS_BACKTRACK_INSN, RFS_PRIORITY, RFS_SPECULATION, RFS_SCHED_RANK, RFS_LAST_INSN, RFS_PRESSURE_INDEX, - RFS_DEP_COUNT, RFS_TIE, RFS_N }; + RFS_DEP_COUNT, RFS_TIE, RFS_FUSION, RFS_N }; /* Corresponding strings for print outs. */ static const char *rfs_str[RFS_N] = { @@ -2535,7 +2548,7 @@ static const char *rfs_str[RFS_N] = { "RFS_SCHED_GROUP", "RFS_PRESSURE_DELAY", "RFS_PRESSURE_TICK", "RFS_FEEDS_BACKTRACK_INSN", "RFS_PRIORITY", "RFS_SPECULATION", "RFS_SCHED_RANK", "RFS_LAST_INSN", "RFS_PRESSURE_INDEX", - "RFS_DEP_COUNT", "RFS_TIE" }; + "RFS_DEP_COUNT", "RFS_TIE", "RFS_FUSION" }; /* Statistical breakdown of rank_for_schedule decisions. */ typedef struct { unsigned stats[RFS_N]; } rank_for_schedule_stats_t; @@ -2596,6 +2609,53 @@ rank_for_schedule (const void *x, const void *y) /* Make sure that priority of TMP and TMP2 are initialized. */ gcc_assert (INSN_PRIORITY_KNOWN (tmp) && INSN_PRIORITY_KNOWN (tmp2)); + if (sched_fusion) + { + /* The instruction that has the same fusion priority as the last + instruction is the instruction we picked next. If that is not + the case, we sort ready list firstly by fusion priority, then + by priority, and at last by INSN_LUID. */ + int a = INSN_FUSION_PRIORITY (tmp); + int b = INSN_FUSION_PRIORITY (tmp2); + int last = -1; + + if (last_nondebug_scheduled_insn + && !NOTE_P (last_nondebug_scheduled_insn) + && BLOCK_FOR_INSN (tmp) + == BLOCK_FOR_INSN (last_nondebug_scheduled_insn)) + last = INSN_FUSION_PRIORITY (last_nondebug_scheduled_insn); + + if (a != last && b != last) + { + if (a == b) + { + a = INSN_PRIORITY (tmp); + b = INSN_PRIORITY (tmp2); + } + if (a != b) + return rfs_result (RFS_FUSION, b - a); + else + return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2)); + } + else if (a == b) + { + gcc_assert (last_nondebug_scheduled_insn + && !NOTE_P (last_nondebug_scheduled_insn)); + last = INSN_PRIORITY (last_nondebug_scheduled_insn); + + a = abs (INSN_PRIORITY (tmp) - last); + b = abs (INSN_PRIORITY (tmp2) - last); + if (a != b) + return rfs_result (RFS_FUSION, a - b); + else + return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2)); + } + else if (a == last) + return rfs_result (RFS_FUSION, -1); + else + return rfs_result (RFS_FUSION, 1); + } + if (sched_pressure != SCHED_PRESSURE_NONE) { /* Prefer insn whose scheduling results in the smallest register @@ -3914,8 +3974,8 @@ schedule_insn (rtx_insn *insn) gcc_assert (INSN_TICK (insn) >= MIN_TICK); if (INSN_TICK (insn) > clock_var) /* INSN has been prematurely moved from the queue to the ready list. - This is possible only if following flag is set. */ - gcc_assert (flag_sched_stalled_insns); + This is possible only if following flags are set. */ + gcc_assert (flag_sched_stalled_insns || sched_fusion); /* ??? Probably, if INSN is scheduled prematurely, we should leave INSN_TICK untouched. This is a machine-dependent issue, actually. */ @@ -5413,6 +5473,9 @@ max_issue (struct ready_list *ready, int privilege struct choice_entry *top; rtx_insn *insn; + if (sched_fusion) + return 0; + n_ready = ready->n_ready; gcc_assert (dfa_lookahead >= 1 && privileged_n >= 0 && privileged_n <= n_ready); @@ -5768,6 +5831,9 @@ prune_ready_list (state_t temp_state, bool first_c bool sched_group_found = false; int min_cost_group = 1; + if (sched_fusion) + return; + for (i = 0; i < ready.n_ready; i++) { rtx_insn *insn = ready_element (&ready, i); @@ -6375,7 +6441,7 @@ schedule_block (basic_block *target_bb, state_t in { memcpy (temp_state, curr_state, dfa_state_size); cost = state_transition (curr_state, insn); - if (sched_pressure != SCHED_PRESSURE_WEIGHTED) + if (sched_pressure != SCHED_PRESSURE_WEIGHTED && !sched_fusion) gcc_assert (cost < 0); if (memcmp (temp_state, curr_state, dfa_state_size) != 0) cycle_issued_insns++; @@ -7195,7 +7261,7 @@ fix_tick_ready (rtx_insn *next) INSN_TICK (next) = tick; delay = tick - clock_var; - if (delay <= 0 || sched_pressure != SCHED_PRESSURE_NONE) + if (delay <= 0 || sched_pressure != SCHED_PRESSURE_NONE || sched_fusion) delay = QUEUE_READY; change_queue_index (next, delay); Index: gcc/sched-int.h =================================================================== --- gcc/sched-int.h (revision 215662) +++ gcc/sched-int.h (working copy) @@ -806,6 +806,9 @@ struct _haifa_insn_data /* A priority for each insn. */ int priority; + /* The fusion priority for each insn. */ + int fusion_priority; + /* The minimum clock tick at which the insn becomes ready. This is used to note timing constraints for the insns in the pending list. */ int tick; @@ -901,6 +904,7 @@ extern vec<haifa_insn_data_def> h_i_d; /* Accessor macros for h_i_d. There are more in haifa-sched.c and sched-rgn.c. */ #define INSN_PRIORITY(INSN) (HID (INSN)->priority) +#define INSN_FUSION_PRIORITY(INSN) (HID (INSN)->fusion_priority) #define INSN_REG_PRESSURE(INSN) (HID (INSN)->reg_pressure) #define INSN_MAX_REG_PRESSURE(INSN) (HID (INSN)->max_reg_pressure) #define INSN_REG_USE_LIST(INSN) (HID (INSN)->reg_use_list) @@ -1618,6 +1622,10 @@ extern void sd_copy_back_deps (rtx_insn *, rtx_ins extern void sd_delete_dep (sd_iterator_def); extern void sd_debug_lists (rtx, sd_list_types_def); +/* Macros and declarations for scheduling fusion. */ +#define FUSION_MAX_PRIORITY (INT_MAX) +extern bool sched_fusion; + #endif /* INSN_SCHEDULING */ #endif /* GCC_SCHED_INT_H */ Index: gcc/sched-rgn.c =================================================================== --- gcc/sched-rgn.c (revision 215662) +++ gcc/sched-rgn.c (working copy) @@ -3647,6 +3647,17 @@ rest_of_handle_sched2 (void) return 0; } +static unsigned int +rest_of_handle_sched_fusion (void) +{ +#ifdef INSN_SCHEDULING + sched_fusion = true; + schedule_insns (); + sched_fusion = false; +#endif + return 0; +} + namespace { const pass_data pass_data_live_range_shrinkage = @@ -3789,3 +3800,53 @@ make_pass_sched2 (gcc::context *ctxt) { return new pass_sched2 (ctxt); } + +namespace { + +const pass_data pass_data_sched_fusion = +{ + RTL_PASS, /* type */ + "sched_fusion", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_SCHED_FUSION, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_sched_fusion : public rtl_opt_pass +{ +public: + pass_sched_fusion (gcc::context *ctxt) + : rtl_opt_pass (pass_data_sched_fusion, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *); + virtual unsigned int execute (function *) + { + return rest_of_handle_sched_fusion (); + } + +}; // class pass_sched2 + +bool +pass_sched_fusion::gate (function *) +{ +#ifdef INSN_SCHEDULING + return (optimize > 0 && flag_schedule_fusion + && targetm.sched.fusion_priority != NULL); +#else + return 0; +#endif +} + +} // anon namespace + +rtl_opt_pass * +make_pass_sched_fusion (gcc::context *ctxt) +{ + return new pass_sched_fusion (ctxt); +} Index: gcc/target.def =================================================================== --- gcc/target.def (revision 215662) +++ gcc/target.def (working copy) @@ -1507,6 +1507,32 @@ parallelism required in output calculations chain. int, (unsigned int opc, enum machine_mode mode), hook_int_uint_mode_1) +/* The following member value is a function that returns priority for + fusion of each instruction via pointer parameters. */ +DEFHOOK +(fusion_priority, +"This hook is called by scheduling fusion pass. It calculates fusion\n\ +priorities for each instruction passed in by parameter. The priorities\n\ +are returned via pointer parameters.\n\ +\n\ +@var{insn} is the instruction whose priorities need to be calculated.\n\ +@var{max_pri} is the maximum priority can be returned in any cases.\n\ +@var{fusion_pri} is the pointer parameter through which @var{insn}'s\n\ +fusion priority should be calculated and returned.\n\ +@var{pri} is the pointer parameter through which @var{insn}'s priority\n\ +should be calculated and returned.\n\ +\n\ +Same @var{fusion_pri} should be returned for instructions which should\n\ +be scheduled together. Different @var{pri} should be returned for\n\ +instructions with same @var{fusion_pri}. All instructions will be\n\ +scheduled according to the two priorities. @var{fusion_pri} is the major\n\ +sort key, @var{pri} is the minor sort key. All priorities calculated\n\ +should be between 0 (exclusive) and @var{max_pri} (inclusive). To avoid\n\ +false dependencies, @var{fusion_pri} of instructions which need to be\n\ +scheduled together should be smaller than @var{fusion_pri} of irrelevant\n\ +instructions.", +void, (rtx_insn *insn, int max_pri, int *fusion_pri, int *pri), NULL) + HOOK_VECTOR_END (sched) /* Functions relating to OpenMP and Cilk Plus SIMD clones. */ Index: gcc/testsuite/gcc.target/arm/vfp-1.c =================================================================== --- gcc/testsuite/gcc.target/arm/vfp-1.c (revision 215662) +++ gcc/testsuite/gcc.target/arm/vfp-1.c (working copy) @@ -126,7 +126,7 @@ void test_convert () { } void test_ldst (float f[], double d[]) { - /* { dg-final { scan-assembler "vldr.32.+ \\\[r0, #1020\\\]" } } */ + /* { dg-final { scan-assembler "vldr.32.+ \\\[r0, #-?\[0-9\]+\\\]" } } */ /* { dg-final { scan-assembler "vldr.32.+ \\\[r\[0-9\], #-1020\\\]" { target { arm32 && { ! arm_thumb2_ok } } } } } */ /* { dg-final { scan-assembler "add.+ r0, #1024" } } */ /* { dg-final { scan-assembler "vstr.32.+ \\\[r\[0-9\]\\\]\n" } } */ Index: gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c =================================================================== --- gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c (revision 0) +++ gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c (revision 0) @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_prefer_ldrd_strd } */ +/* { dg-options "-O2" } */ + +struct +{ + int x; + int y; + char c; + int d; +}a; + +int foo(int x, int y) +{ + int c; + a.x = x; + c = a.x; + a.d = c; + a.y = y; + + return 0; +} +/* { dg-final { scan-assembler "strd\t" } } */ Index: gcc/config/arm/arm.c =================================================================== --- gcc/config/arm/arm.c (revision 215662) +++ gcc/config/arm/arm.c (working copy) @@ -291,6 +291,8 @@ static unsigned arm_add_stmt_cost (void *data, int static void arm_canonicalize_comparison (int *code, rtx *op0, rtx *op1, bool op0_preserve_value); static unsigned HOST_WIDE_INT arm_asan_shadow_offset (void); + +static void arm_sched_fusion_priority (rtx_insn *, int, int *, int*); /* Table of machine attributes. */ static const struct attribute_spec arm_attribute_table[] = @@ -688,6 +690,9 @@ static const struct attribute_spec arm_attribute_t #undef TARGET_CALL_FUSAGE_CONTAINS_NON_CALLEE_CLOBBERS #define TARGET_CALL_FUSAGE_CONTAINS_NON_CALLEE_CLOBBERS true +#undef TARGET_SCHED_FUSION_PRIORITY +#define TARGET_SCHED_FUSION_PRIORITY arm_sched_fusion_priority + struct gcc_target targetm = TARGET_INITIALIZER; /* Obstack for minipool constant handling. */ @@ -3116,6 +3121,10 @@ arm_option_override (void) if (target_slow_flash_data) arm_disable_literal_pool = true; + /* Only enable scheduling fusion for armv7 processors by default. */ + if (flag_schedule_fusion == 2 && !arm_arch7) + flag_schedule_fusion = 0; + /* Register global variables with the garbage collector. */ arm_add_gc_roots (); } @@ -32289,4 +32298,124 @@ arm_is_constant_pool_ref (rtx x) && CONSTANT_POOL_ADDRESS_P (XEXP (x, 0))); } +/* If MEM is in the form of [base+offset], extract the two parts + of address and set to BASE and OFFSET, otherwise return false + after clearing BASE and OFFSET. */ + +static bool +extract_base_offset_in_addr (rtx mem, rtx *base, rtx *offset) +{ + rtx addr; + + gcc_assert (MEM_P (mem)); + + addr = XEXP (mem, 0); + + /* Strip off const from addresses like (const (addr)). */ + if (GET_CODE (addr) == CONST) + addr = XEXP (addr, 0); + + if (GET_CODE (addr) == REG) + { + *base = addr; + *offset = const0_rtx; + return true; + } + + if (GET_CODE (addr) == PLUS + && GET_CODE (XEXP (addr, 0)) == REG + && CONST_INT_P (XEXP (addr, 1))) + { + *base = XEXP (addr, 0); + *offset = XEXP (addr, 1); + return true; + } + + *base = NULL_RTX; + *offset = NULL_RTX; + + return false; +} + +/* If INSN is a load or store of address in the form of [base+offset], + extract the two parts and set to BASE and OFFSET. IS_LOAD is set + to TRUE if it's a load. Return TRUE if INSN is such an instruction, + otherwise return FALSE. */ + +static bool +fusion_load_store (rtx_insn *insn, rtx *base, rtx *offset, bool *is_load) +{ + rtx x, dest, src; + + gcc_assert (INSN_P (insn)); + x = PATTERN (insn); + if (GET_CODE (x) != SET) + return false; + + src = SET_SRC (x); + dest = SET_DEST (x); + if (GET_CODE (src) == REG && GET_CODE (dest) == MEM) + { + *is_load = false; + extract_base_offset_in_addr (dest, base, offset); + } + else if (GET_CODE (src) == MEM && GET_CODE (dest) == REG) + { + *is_load = true; + extract_base_offset_in_addr (src, base, offset); + } + else + return false; + + return (*base != NULL_RTX && *offset != NULL_RTX); +} + +/* Implement the TARGET_SCHED_FUSION_PRIORITY hook. + + Currently we only support to fuse ldr or str instructions, so FUSION_PRI + and PRI are only calculated for these instructions. For other instruction, + FUSION_PRI and PRI are simply set to MAX_PRI. In the future, other kind + instruction fusion can be supported by returning different priorities. + + It's important that irrelevant instructions get the largest FUSION_PRI. */ + +static void +arm_sched_fusion_priority (rtx_insn *insn, int max_pri, + int *fusion_pri, int *pri) +{ + int tmp, off_val; + bool is_load; + rtx base, offset; + + gcc_assert (INSN_P (insn)); + + tmp = max_pri - 1; + if (!fusion_load_store (insn, &base, &offset, &is_load)) + { + *pri = tmp; + *fusion_pri = tmp; + return; + } + + /* Load goes first. */ + if (is_load) + *fusion_pri = tmp - 1; + else + *fusion_pri = tmp - 2; + + tmp /= 2; + + /* INSN with smaller base register goes first. */ + tmp -= ((REGNO (base) & 0xff) << 20); + + /* INSN with smaller offset goes first. */ + off_val = (int)(INTVAL (offset)); + if (off_val >= 0) + tmp -= (off_val & 0xfffff); + else + tmp += (off_val & 0xfffff); + + *pri = tmp; + return; +} #include "gt-arm.h" Index: gcc/passes.def =================================================================== --- gcc/passes.def (revision 215662) +++ gcc/passes.def (working copy) @@ -400,6 +400,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_stack_adjustments); NEXT_PASS (pass_jump2); NEXT_PASS (pass_duplicate_computed_gotos); + NEXT_PASS (pass_sched_fusion); NEXT_PASS (pass_peephole2); NEXT_PASS (pass_if_after_reload); NEXT_PASS (pass_regrename); Index: gcc/tree-pass.h =================================================================== --- gcc/tree-pass.h (revision 215662) +++ gcc/tree-pass.h (working copy) @@ -542,6 +542,7 @@ extern rtl_opt_pass *make_pass_branch_target_load_ extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context *ctxt); extern rtl_opt_pass *make_pass_stack_adjustments (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_sched_fusion (gcc::context *ctxt); extern rtl_opt_pass *make_pass_peephole2 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_if_after_reload (gcc::context *ctxt); extern rtl_opt_pass *make_pass_regrename (gcc::context *ctxt);