From patchwork Fri Nov 19 17:25:16 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Improving reload inheritance code generation and predictability From: Vladimir Makarov X-Patchwork-Id: 72278 Message-Id: <4CE6B2FC.7080800@redhat.com> To: Jeff Law Cc: gcc-patches Date: Fri, 19 Nov 2010 12:25:16 -0500 On 11/19/2010 10:53 AM, Jeff Law wrote: > On 11/18/10 14:27, Vladimir Makarov wrote: >>> >> >> I like this idea and also thought long ago to try it too. Because of >> better inheritance I think it should show some code size improvement >> and probably some performance improvement too besides better debugging. > There's a definite code size improvement. > >> >> I am afraid only that it will take some compilation time too (which >> will be probably compensated partially by less final insns >> processing) and IMHO that is not because of insn traversing but >> mostly because of usage of DF-infrastructure. > I'm also more concerned about the DF scanning than the BB scan when we > need a reload register. Obviously for something with huge blocks (say > our friend fpppp) scanning the insns in the BB could get expensive and > we could clamp the number of insns scanned on a PARAM value. > > Anyway, I quickly inserted some counters to measure some data and ran > a bootstrap (without java). > > The first thing I note only 56% of the source files we compile even > end up calling allocate_reload_reg. I did not track total number of > function's compiled. 56% is low enough that lazily initializing the > DF data is probably worth it since DF scans the entire insn stream. > If we could lazily initialize DF within a block only, then that'd > probably save even more. > > Within the files that called allocate_reload_reg, we had 207003 calls > to allocate_reload_reg and we scanned 2071962 insns in the loop, or 10 > insns per call. That seemed rather high to me as I was expecting a > scan rate of 5-7 insns per call. > > Two related obvious improvements came to mind. If there is only one > spill reg, then scanning is totally unnecessary and if there is only > one spill reg left to find during a scan, we can stop the scan, in > both cases the remaining reg is the most desirable reg and scanning > insns is totally unnecessary. These two improvements get us down to > 7.5 insns scanned per call to allocate_reload_reg. Still more than I > would have expected. > > libgcc's bid_round results in 918 calls and 60627 insns scanned (* 3 > since libgcc is built 3 times during a bootstrap), which represents > more than 10% of the total insns scanned. If we factored out > bid_round's effects we'd be looking at 6.5 insns scanned per call > which seems about right. > Interesting. >> >> Some time ago I analyzed how many memory is used by DF during an IRA >> snapshot. It was about 25% vs 7% allocated by IRA for its IR (% of >> all heap memory). Touching this huge footprint will worse code >> locality and result in slow code. >> >> Reload does not use DF and even automatic insn rescanning is switched >> off. I believe that if reload were rewritten to use DF, it could >> result in much slower code. This is just some my speculations which >> really hard to confirm or reject. > Note that we still have DF structures lying around because ira doesn't > call df_finish prior to calling reload. So the memory increase should > be minimal (basically just the increase due to insns inserted by > caller-saves and the like). > That is not about memory increase. It is about DF data expelling rtl data from caches. I've just did some measurements of compilation time on your patch on all_cp2k_gfortran.f90 (> 400K lines of fortran) without patch 219.20user with only df calls in reload (see patch below) 221.39user with all your patch 221.17user So 1% of degradation is only because the patch touches DF-data (not scanning insns in finding reload reg as someone can think). Better inheritance might improve compilation time because less insns are generated (although it is hard to say the difference on two last lines is too small). That data was for -O2. I guess that the compilation time degradation will be even more for -O0. > The alternative would be to deep scan each insn in the loop. Index: reload1.c =================================================================== --- reload1.c (revision 166913) +++ reload1.c (working copy) @@ -1045,6 +1045,11 @@ } } + df_finish_pass (true); + df_scan_alloc (NULL); + df_scan_blocks (); + df_analyze (); + /* Use the reload registers where necessary by generating move instructions to move the must-be-register values into or out of the reload registers. */ @@ -1061,6 +1066,8 @@ gcc_assert (verify_initial_elim_offsets ()); } + df_finish_pass (true); + /* If we were able to eliminate the frame pointer, show that it is no longer live at the start of any basic block. If it ls live by virtue of being in a pseudo, that pseudo will be marked live