From patchwork Fri Dec 23 11:46:39 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Sandiford X-Patchwork-Id: 132995 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 30CEFB7001 for ; Fri, 23 Dec 2011 22:47:38 +1100 (EST) Received: (qmail 22094 invoked by alias); 23 Dec 2011 11:47:33 -0000 Received: (qmail 22059 invoked by uid 22791); 23 Dec 2011 11:47:22 -0000 X-SWARE-Spam-Status: No, hits=-1.1 required=5.0 tests=AWL, BAYES_50, RCVD_IN_DNSWL_LOW, TW_CP, TW_DB, T_TVD_MIME_NO_HEADERS X-Spam-Check-By: sourceware.org Received: from mail-ee0-f47.google.com (HELO mail-ee0-f47.google.com) (74.125.83.47) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 23 Dec 2011 11:46:46 +0000 Received: by eeit10 with SMTP id t10so5344548eei.20 for ; Fri, 23 Dec 2011 03:46:44 -0800 (PST) Received: by 10.14.124.77 with SMTP id w53mr6015003eeh.68.1324640803841; Fri, 23 Dec 2011 03:46:43 -0800 (PST) Received: from richards-thinkpad.stglab.manchester.uk.ibm.com (gbibp9ph1--blueice3n2.emea.ibm.com. [195.212.29.84]) by mx.google.com with ESMTPS id y12sm46175125eeb.11.2011.12.23.03.46.40 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 23 Dec 2011 03:46:42 -0800 (PST) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, vmakarov@redhat.com, richard.sandiford@linaro.org Cc: vmakarov@redhat.com Subject: RFC: An alternative -fsched-pressure implementation Date: Fri, 23 Dec 2011 11:46:39 +0000 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) MIME-Version: 1.0 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org So it looks like two pieces of work related to scheduling and register pressure are being posted close together. This one is an RFC for a less aggressive form of -fsched-pressure. I think it should complement rather than compete with Bernd's IRA patch. It seems like a good idea to take register pressure into account during the first scheduling pass, where we can still easily look at things like instruction latencies and pipeline utilisation. Better rematerialisation in the register allocator would obviously be a good thing too though. This patch started when we (Linaro) saw a big drop in performance from vectorising an RGB to YIQ filter on ARM. The first scheduling pass was overly aggressive in creating a "wide" schedule, and caused the newly-vectorised loop to contain lots of spills. The loop grew so big that it even had a constant pool in the middle of it. -fsched-pressure did a very good job on this loop, creating far fewer spills and consequently avoiding the constant pool. However, it seemed to make several other cases significantly worse. The idea was therefore to try to come up with a form of -fsched-pressure that could be turned on for ARM by default. Current -fsched-pressure works by assigning an "excess (pressure) cost change" to each instruction; here I'll write that as ECC(X). -fsched-pressure also changes the way that the main list scheduler handles stalls due to data dependencies. If an instruction would stall for N cycles, the scheduler would normally add it to the "now+N" queue, then add it to the ready queue after N cycles. With -fsched-pressure, it instead adds the instruction to the ready queue immediately, while still recording that the instruction would require N stalls. I'll write the number of stalls on X as delay(X). This arrangement allows the scheduler to choose between increasing register pressure and introducing a deliberate stall. Instructions are ranked by: (a) lowest ECC(X) + delay(X) (b) lowest delay(X) (c) normal list-scheduler ranking (based on things like INSN_PRIORITY) Note that since delay(X) is measured in cycles, ECC(X) is effectively measured in cycles too. Several things seemed to be causing the degradations we were seeing with -fsched-pressure: (1) The -fsched-pressure schedule is based purely on instruction latencies and issue rate; it doesn't take the DFA into account. This means that we attempt to "dual issue" things like vector operations, loads and stores on Cortex A8 and A9. In the examples I looked at, these sorts of inaccuracy seemed to accumulate, so that the value of delay(X) became based on slightly unrealistic cycle times. Note that this also affects code that doesn't have any pressure problems; it isn't limited to code that does. This may simply be historical. It became much easier to use the DFA here after Bernd's introduction of prune_ready_list, but the original -fsched-pressure predates that. (2) We calculate ECC(X) by walking the unscheduled part of the block in its original order, then recording the pressure at each instruction. This seemed to make ECC(X) quite sensitive to that original order. I saw blocks that started out naturally "narrow" (not much ILP, e.g. from unrolled loops) and others that started naturally "wide" (a lot of ILP, such as in the libav h264 code), and walking the block in order meant that the two styles would be handled differently. (3) When calculating the pressure of the original block (as described in (2)), we ignore the deaths of registers that are used by more than one unscheduled instruction. This tended to hurt long(ish) loops in things like filters, where the same value is often used as an input to two calculations. The effect was that instructions towards the end of the block would appear to have very high pressure. This in turn made the algorithm very conservative; it wouldn't promote instructions from later in the block because those instructions seemed to have a prohibitively large cost. I asked Vlad about this, and he confirmed that it was a deliberate decision. He'd tried honouring REG_DEAD notes instead, but it produced worse results on x86. I'll return to this at the end. (4) ECC(X) is based on the pressure over and above ira_available_class_regs (the number of allocatable registers in a given class). ARM has 14 allocatable GENERAL_REGS: 16 minus the stack pointer and program counter. So if 14 integer variables are live across a loop but not referenced within it, we end up scheduling that loop in a context of permanent pressure. Pressure becomes the overriding concern, and we don't get much ILP. I suppose there are at least two ways of viewing this: (4a) We're giving an equal weight to: (Ra) registers that are live across a loop but not used within it (and which should therefore require no spill code in the loop itself) (Rb) registers that hold loop invariants (and so should only require loads in the loop itself) (Rc) registers that are used and set within the loop (and so would require both loads and stores in the loop itself) We effectively treat everything as (Rc). (4b) We always work to ira_available_class_regs, rather than to an estimate of what maximum pressure is realistically achievable. If a block has something like: A: R2 = *R1 ... B: R3 = *(R1 + 4) ... C: R1 = R1 + 8 ... D: R4 = R2 + R3 ... then it can't help but have three registers live at once. The first thing I tried was to change just (4a) in isolation. Taking (Ra) out of the pressure estimation produced some benefits, but not enough. I then tried a "staging" of costs, so that: (Ra) had the cost of a load and store in the outer loop (if any) (Rb) had the cost of a load in the inner loop (or the cost of (Ra) if larger) (Rc) had the cost of a load and store in the inner loop ( " " ) These costs were still based on MEMORY_MOVE_COST, just like the current ones are. However, MEMORY_MOVE_COST is defined to be relative to REGISTER_MOVE_COST, and REGISTER_MOVE_COST is defined to be 2 for a "simple" move. Since ECC(X) is effectively a cycle count, it seemed better to divide the cost by 2. This again produced some benefit, but not enough. I think part of the problem is that ARM's MEMORY_MOVE_COST is very high: it says that both loads and stores cost the equivalent of 5 register moves, making the cost of a full spill equivalent to 10 register moves. ARM also makes these costs independent of the mode, so that a single SImode load and store has the same cost as a 4-quad-vector load and store. This might be something that's worth looking at, especially since the same costs influence spilling decisions in the register allocator. However, even reducing the cost of a load to 4 moves and a store to 2 moves didn't bring enough benefit (although it was better than 5 and 5). The division of costs above is also a little artificial. Because we don't have an integrated register allocator and scheduler, there's not really any telling how many times the block will need to load or store a value. An invariant might be used many times, and in such a way that several loads are needed, while another variable might be involved in a single read-modify-write sequence. The true cost also depends on how easily the spill instructions fit into the surrounding code. On a different tack, I tried to tackle (1) by taking the DFA into account. If delay(X) is zero, but X cannot issue due to resource hazards, I set delay(X) to 1. Again this wasn't enough in itself, although it does still form an important part of the "final" proposed version. I tried the algorithms from a few papers, but they too tended to be overly conservative or to suffer from degenerate behaviour in filters. In the end I tried an ad-hoc approach in an attempt to do something about (2), (3) and (4b). The idea was to construct a preliminary "model" schedule in which the only objective is to keep register pressure to a minimum. This schedule ignores pipeline characteristics, latencies, and the number of available registers. The maximum pressure seen in this initial model schedule (MP) is then the benchmark for ECC(X). During the main scheduling, an instruction X that occurs at or before the next point of maximum pressure in the model schedule is measured based on the current register pressure. If X doesn't increase the current pressure beyond the current maximum, its ECC(X) is zero, otherwise ECC(X) is the cost of going from MP to the new maximum. The idea is that the final net pressure of scheduling a given set of instructions is going to be the same regardless of the order; we simply want to keep the intermediate pressure under control. An ECC(X) of zero usually[*] means that scheduling X next won't send the rest of the sequence beyond the current maximum pressure. [*] but not always. There's more about this in the patch comments. If an instruction X occurs _after_ the next point of maximum pressure, X is measured based on that maximum pressure. If the current maximum pressure is MP', and X increases pressure by dP, ECC(X) is the cost of going from MP to MP' + dP. Of course, this all depends on how good a value MP is, and therefore on how good the model schedule is. I tried a few variations before settling on the one in the patch (which I hope makes conceptual sense). I initially stayed with the idea above about assigning different costs to (Ra), (Rb) and (Rc). This produces some good results, but was still a little too conservative in general, in that other tests were still worse with -fsched-pressure than without. I described some of the problems with these costs above. Another is that if an instruction X has a spill cost of 6, say, then: ECC(X) + delay(X) will only allow X to be scheduled if the next instruction without a spill cost has a delay of 6 cycles or more. This is overly harsh, especially seeing as few ARM instructions have such a high latency. The benefit of spilling is often to avoid a series of short (e.g. single-cycle) stalls, rather than to avoid a single long one. I then adjusted positive ECC(X) values based on the priority of X relative to the highest-priority zero-cost instruction. This was better, but a DES filter in particular still suffered from the "lots of short stalls" problem. Then, as an experiment, I tried ignoring MEMORY_MOVE_COST altogether and simply treating each extra spill as having a cost of 1. Somewhat to my surprise, there was only one test in which some improvement was lost; that test was now only slightly better than without -fsched-pressure. But the fixed cost of 1 per spill achieved the goal of being as good as or better than -fno-sched-pressure in almost all cases, while still being significantly better in some of the cases we care about. Assigning a cost of just 1 to each excess spill is clearly going to interfere with the normal list scheduler less often than assigning each spill a higher cost. Given that this appraoch also gives the most important benefits, it felt like a good default. All in all, the patch is a complicated way of being quite timid, but I hope it could also be used as a basis for more aggressive versions in future. Things I haven't looked at include whether it would make sense to disable the priority-based adjustment of ECC for -Os. (That's a question of whether this patch can improve size still further over -fno-sched-pressure.) Being timid ought to make it fit quite well with Bernd's patch, although I haven't had chance to try the two together. I talked with Vlad about this a couple of months ago (thanks again for the help, Vlad). He explained that some of the things I was seeing -- especially (3) -- were deliberate decisions that had improved the results for x86. And I think it's probably true that the patch below is only going to work well on fairly regular RISCy machines. For that reason, the patch adds an -fsched-pressure-algorithm= option to choose between the two versions. For historical reasons I called the current algorithm "weighted" and the proposed one "model", but I'm sure there are better names. I've kept the option undocumented because it's really an internal thing. "weighted" remains the default. Of course, adding an alternative like this would only be acceptable if -fsched-pressure and -fsched-pressure-algorithm=model became the default for ARM (or another target). That's a decision for the ARM maintainers; I've sent Ramana the results I got, which unfortunately I can't share more widely. If anyone does have time to test this on their favourite port, I'd really appreciate it. I already know that it doesn't perform well on SPECFP for rs6000 because of the choice of pressure classes; I'll send a separate note about that so that it doesn't get lost in this long essay. Also, print_pseudo_costs segfaults when -fsched-pressure and dumping are enabled. I'm planning to send a patch for that at some point; it's a regression from 4.6, so seems like stage 3 material. In the meantime, a simple workaround is to stick "return;" at the beginning of the function. Richard gcc/ * sched-deps.c (fixup_sched_groups): Rename to... (chain_to_prev_insn): ...this. (chain_to_prev_insn_p): New function. (deps_analyze_insn): Use it instead of SCHED_GROUP_P. Index: gcc/sched-deps.c =================================================================== --- gcc/sched-deps.c 2011-12-22 14:15:38.000000000 +0000 +++ gcc/sched-deps.c 2011-12-22 14:22:40.749212237 +0000 @@ -477,7 +477,7 @@ static void add_dependence_list (rtx, rt static void add_dependence_list_and_free (struct deps_desc *, rtx, rtx *, int, enum reg_note); static void delete_all_dependences (rtx); -static void fixup_sched_groups (rtx); +static void chain_to_prev_insn (rtx); static void flush_pending_lists (struct deps_desc *, rtx, int, int); static void sched_analyze_1 (struct deps_desc *, rtx, rtx); @@ -1655,7 +1655,7 @@ delete_all_dependences (rtx insn) the previous nonnote insn. */ static void -fixup_sched_groups (rtx insn) +chain_to_prev_insn (rtx insn) { sd_iterator_def sd_it; dep_t dep; @@ -3294,7 +3294,7 @@ sched_analyze_insn (struct deps_desc *de instructions that follow seem like they should be part of the call group. - Also, if we did, fixup_sched_groups() would move the + Also, if we did, chain_to_prev_insn would move the deps of the debug insn to the call insn, modifying non-debug post-dependency counts of the debug insn dependencies and otherwise messing with the scheduling @@ -3440,6 +3440,37 @@ call_may_noreturn_p (rtx insn) return true; } +/* Return true if INSN should be made dependent on the previous instruction + group, and if all INSN's dependencies should be moved to the first + instruction of that group. */ + +static bool +chain_to_prev_insn_p (rtx insn) +{ + rtx prev, x; + + /* INSN forms a group with the previous instruction. */ + if (SCHED_GROUP_P (insn)) + return true; + + /* If the previous instruction clobbers a register R and this one sets + part of R, the clobber was added specifically to help us track the + liveness of R. There's no point scheduling the clobber and leaving + INSN behind, especially if we move the clobber to another block. */ + prev = prev_nonnote_nondebug_insn (insn); + if (prev + && INSN_P (prev) + && BLOCK_FOR_INSN (prev) == BLOCK_FOR_INSN (insn) + && GET_CODE (PATTERN (prev)) == CLOBBER) + { + x = XEXP (PATTERN (prev), 0); + if (set_of (x, insn)) + return true; + } + + return false; +} + /* Analyze INSN with DEPS as a context. */ void deps_analyze_insn (struct deps_desc *deps, rtx insn) @@ -3607,8 +3638,9 @@ deps_analyze_insn (struct deps_desc *dep /* Fixup the dependencies in the sched group. */ if ((NONJUMP_INSN_P (insn) || JUMP_P (insn)) - && SCHED_GROUP_P (insn) && !sel_sched_p ()) - fixup_sched_groups (insn); + && chain_to_prev_insn_p (insn) + && !sel_sched_p ()) + chain_to_prev_insn (insn); } /* Initialize DEPS for the new block beginning with HEAD. */