Patchwork Improving reload inheritance code generation and predictability

login
register
mail settings
Submitter Vladimir Makarov
Date Nov. 19, 2010, 5:25 p.m.
Message ID <4CE6B2FC.7080800@redhat.com>
Download mbox | patch
Permalink /patch/72278/
State New
Headers show

Comments

Vladimir Makarov - Nov. 19, 2010, 5:25 p.m.
On 11/19/2010 10:53 AM, Jeff Law wrote:
> On 11/18/10 14:27, Vladimir Makarov wrote:
>>>
>>
>> I like this idea and also thought long ago to try it too.  Because of 
>> better inheritance I think it should show some code size improvement 
>> and probably some performance improvement too besides better debugging.
> There's a definite code size improvement.
>
>>
>> I am afraid only that it will take some compilation time too (which 
>> will be probably compensated partially by less final insns 
>> processing) and IMHO that is not because of insn traversing but 
>> mostly because of usage of DF-infrastructure.
> I'm also more concerned about the DF scanning than the BB scan when we 
> need a reload register.  Obviously for something with huge blocks (say 
> our friend fpppp) scanning the insns in the BB could get expensive and 
> we could clamp the number of insns scanned on a PARAM value.
>
> Anyway, I quickly inserted some counters to measure some data and ran 
> a bootstrap (without java).
>
> The first thing I note only 56% of the source files we compile even 
> end up calling allocate_reload_reg.  I did not track total number of 
> function's compiled.  56% is low enough that lazily initializing the 
> DF data is probably worth it since DF scans the entire insn stream.   
> If we could lazily initialize DF within a block only, then that'd 
> probably save even more.
>
> Within the files that called allocate_reload_reg, we had 207003 calls 
> to allocate_reload_reg and we scanned 2071962 insns in the loop, or 10 
> insns per call.  That seemed rather high to me as I was expecting a 
> scan rate of 5-7 insns per call.
>
> Two related obvious improvements came to mind.  If there is only one 
> spill reg, then scanning is totally unnecessary and if there is only 
> one spill reg left to find during a scan, we can stop the scan, in 
> both cases the remaining reg is the most desirable reg and scanning 
> insns is totally unnecessary.  These two improvements get us down to 
> 7.5 insns scanned per call to allocate_reload_reg.  Still more than I 
> would have expected.
>
> libgcc's bid_round results in 918 calls and 60627 insns scanned (* 3 
> since libgcc is built 3 times during a bootstrap), which represents 
> more than 10% of the total insns scanned.  If we factored out 
> bid_round's effects we'd be looking at 6.5 insns scanned per call 
> which seems about right.
>

Interesting.

>>
>> Some time ago I analyzed how many memory is used by DF during an IRA 
>> snapshot.  It was about 25% vs 7% allocated by IRA for its IR (% of 
>> all heap memory).  Touching this huge footprint will worse code 
>> locality and result in slow code.
>>
>> Reload does not use DF and even automatic insn rescanning is switched 
>> off.  I believe that if reload were rewritten to use DF, it could 
>> result in much slower code.  This is just some my speculations which 
>> really hard to confirm or reject.
> Note that we still have DF structures lying around because ira doesn't 
> call df_finish prior to calling reload.  So the memory increase should 
> be minimal (basically just the increase due to insns inserted by 
> caller-saves and the like).
>
That is not about memory increase.  It is about DF data expelling rtl 
data from caches.

I've just did some measurements of compilation time on your patch on 
all_cp2k_gfortran.f90 (> 400K lines of fortran)

without patch                                     219.20user
with only df calls in reload (see patch below)    221.39user
with all your patch                               221.17user

So 1% of degradation is only because the patch touches DF-data (not 
scanning insns in finding reload reg as someone can think).  Better 
inheritance might improve compilation time because less insns are 
generated (although it is hard to say the difference on two last lines 
is too small).

That data was for -O2.  I guess that the compilation time degradation 
will be even more for -O0.


> The alternative would be to deep scan each insn in the loop.
Vladimir Makarov - Nov. 19, 2010, 5:34 p.m.
On 11/19/2010 12:25 PM, Vladimir Makarov wrote:
> On 11/19/2010 10:53 AM, Jeff Law wrote: Note that we still have DF 
> structures lying around because ira doesn't call df_finish prior to 
> calling reload.  So the memory increase should be minimal (basically 
> just the increase due to insns inserted by caller-saves and the like).
>>
> That is not about memory increase.  It is about DF data expelling rtl 
> data from caches.
>
> I've just did some measurements of compilation time on your patch on 
> all_cp2k_gfortran.f90 (> 400K lines of fortran)
>
> without patch                                     219.20user
> with only df calls in reload (see patch below)    221.39user
> with all your patch                               221.17user
>
> So 1% of degradation is only because the patch touches DF-data (not 
> scanning insns in finding reload reg as someone can think).  Better 
> inheritance might improve compilation time because less insns are 
> generated (although it is hard to say the difference on two last lines 
> is too small).
>
> That data was for -O2.  I guess that the compilation time degradation 
> will be even more for -O0.
>
Sorry, Jeff.  I forgot to add that the compiler was built in the release 
mode on Core i7.
Jeff Law - Nov. 19, 2010, 6:10 p.m.
On 11/19/10 10:25, Vladimir Makarov wrote:
> On 11/19/2010 10:53 AM, Jeff Law wrote:
>> On 11/18/10 14:27, Vladimir Makarov wrote:
>>>>
>>>
>>> I like this idea and also thought long ago to try it too.  Because 
>>> of better inheritance I think it should show some code size 
>>> improvement and probably some performance improvement too besides 
>>> better debugging.
>> There's a definite code size improvement.
>>
>>>
>>> I am afraid only that it will take some compilation time too (which 
>>> will be probably compensated partially by less final insns 
>>> processing) and IMHO that is not because of insn traversing but 
>>> mostly because of usage of DF-infrastructure.
>> I'm also more concerned about the DF scanning than the BB scan when 
>> we need a reload register.  Obviously for something with huge blocks 
>> (say our friend fpppp) scanning the insns in the BB could get 
>> expensive and we could clamp the number of insns scanned on a PARAM 
>> value.
>>
>> Anyway, I quickly inserted some counters to measure some data and ran 
>> a bootstrap (without java).
>>
>> The first thing I note only 56% of the source files we compile even 
>> end up calling allocate_reload_reg.  I did not track total number of 
>> function's compiled.  56% is low enough that lazily initializing the 
>> DF data is probably worth it since DF scans the entire insn stream.   
>> If we could lazily initialize DF within a block only, then that'd 
>> probably save even more.
>>
>> Within the files that called allocate_reload_reg, we had 207003 calls 
>> to allocate_reload_reg and we scanned 2071962 insns in the loop, or 
>> 10 insns per call.  That seemed rather high to me as I was expecting 
>> a scan rate of 5-7 insns per call.
>>
>> Two related obvious improvements came to mind.  If there is only one 
>> spill reg, then scanning is totally unnecessary and if there is only 
>> one spill reg left to find during a scan, we can stop the scan, in 
>> both cases the remaining reg is the most desirable reg and scanning 
>> insns is totally unnecessary.  These two improvements get us down to 
>> 7.5 insns scanned per call to allocate_reload_reg.  Still more than I 
>> would have expected.
>>
>> libgcc's bid_round results in 918 calls and 60627 insns scanned (* 3 
>> since libgcc is built 3 times during a bootstrap), which represents 
>> more than 10% of the total insns scanned.  If we factored out 
>> bid_round's effects we'd be looking at 6.5 insns scanned per call 
>> which seems about right.
>>
>
> Interesting.
>
>>>
>>> Some time ago I analyzed how many memory is used by DF during an IRA 
>>> snapshot.  It was about 25% vs 7% allocated by IRA for its IR (% of 
>>> all heap memory).  Touching this huge footprint will worse code 
>>> locality and result in slow code.
>>>
>>> Reload does not use DF and even automatic insn rescanning is 
>>> switched off.  I believe that if reload were rewritten to use DF, it 
>>> could result in much slower code.  This is just some my speculations 
>>> which really hard to confirm or reject.
>> Note that we still have DF structures lying around because ira 
>> doesn't call df_finish prior to calling reload.  So the memory 
>> increase should be minimal (basically just the increase due to insns 
>> inserted by caller-saves and the like).
>>
> That is not about memory increase.  It is about DF data expelling rtl 
> data from caches.
I wonder if I could convince the DF machinery to do an incremental 
update to add the caller-save insns and deal with the change from 
pseudo->hard regs external from DF.  That might mitigate this issue.


>
> I've just did some measurements of compilation time on your patch on 
> all_cp2k_gfortran.f90 (> 400K lines of fortran)
>
> without patch                                     219.20user
> with only df calls in reload (see patch below)    221.39user
> with all your patch                               221.17user
>
> So 1% of degradation is only because the patch touches DF-data (not 
> scanning insns in finding reload reg as someone can think).  Better 
> inheritance might improve compilation time because less insns are 
> generated (although it is hard to say the difference on two last lines 
> is too small).
I'd consider a 1% degradation way too much for this patch to be acceptable.

I think for -O0, we could just fall back to the round-robin and avoid 
the scan.

jeff
Jeff Law - Nov. 23, 2010, 4:37 p.m.
On 11/19/10 10:25, Vladimir Makarov wrote:
>
> That is not about memory increase.  It is about DF data expelling rtl 
> data from caches.
>
> I've just did some measurements of compilation time on your patch on 
> all_cp2k_gfortran.f90 (> 400K lines of fortran)
>
> without patch                                     219.20user
> with only df calls in reload (see patch below)    221.39user
> with all your patch                               221.17user
>
> So 1% of degradation is only because the patch touches DF-data (not 
> scanning insns in finding reload reg as someone can think).  Better 
> inheritance might improve compilation time because less insns are 
> generated (although it is hard to say the difference on two last lines 
> is too small).
Switching to df_insn_rescan_all cuts the cost by about 50%.   I may be 
able to only scan the insns created by caller-save and deal with the 
psuedo->hard reg conversion within mark_spill_regs.  I'll be testing 
that shortly.   Then I'll be looking at Bernd's suggestions (which are a 
rather different approach to the problem).

jeff
Vladimir Makarov - Nov. 23, 2010, 4:49 p.m.
On 11/23/2010 11:37 AM, Jeff Law wrote:
> On 11/19/10 10:25, Vladimir Makarov wrote:
>>
>> That is not about memory increase.  It is about DF data expelling rtl 
>> data from caches.
>>
>> I've just did some measurements of compilation time on your patch on 
>> all_cp2k_gfortran.f90 (> 400K lines of fortran)
>>
>> without patch                                     219.20user
>> with only df calls in reload (see patch below)    221.39user
>> with all your patch                               221.17user
>>
>> So 1% of degradation is only because the patch touches DF-data (not 
>> scanning insns in finding reload reg as someone can think).  Better 
>> inheritance might improve compilation time because less insns are 
>> generated (although it is hard to say the difference on two last 
>> lines is too small).
> Switching to df_insn_rescan_all cuts the cost by about 50%.
Great.  Personally for me 0.5% compilation time increase is already ok, 
especially when you improve the code size.
>    I may be able to only scan the insns created by caller-save and 
> deal with the psuedo->hard reg conversion within mark_spill_regs.  
> I'll be testing that shortly.   Then I'll be looking at Bernd's 
> suggestions (which are a rather different approach to the problem).
Ok.  May be it will be interesting for you, but call insns in 
df-infrastructure are particularly memory hungry (about 40 definitions 
for the insn on x86_64).
Jeff Law - Nov. 24, 2010, 4:17 a.m.
On 11/23/10 09:49, Vladimir Makarov wrote:
>  On 11/23/2010 11:37 AM, Jeff Law wrote:
>> On 11/19/10 10:25, Vladimir Makarov wrote:
>>>
>>> That is not about memory increase.  It is about DF data expelling 
>>> rtl data from caches.
>>>
>>> I've just did some measurements of compilation time on your patch on 
>>> all_cp2k_gfortran.f90 (> 400K lines of fortran)
>>>
>>> without patch                                     219.20user
>>> with only df calls in reload (see patch below)    221.39user
>>> with all your patch                               221.17user
>>>
>>> So 1% of degradation is only because the patch touches DF-data (not 
>>> scanning insns in finding reload reg as someone can think).  Better 
>>> inheritance might improve compilation time because less insns are 
>>> generated (although it is hard to say the difference on two last 
>>> lines is too small).
>> Switching to df_insn_rescan_all cuts the cost by about 50%.
> Great.  Personally for me 0.5% compilation time increase is already 
> ok, especially when you improve the code size.
Yea.  It was a nice improvement.

>>    I may be able to only scan the insns created by caller-save and 
>> deal with the psuedo->hard reg conversion within mark_spill_regs.  
>> I'll be testing that shortly.   Then I'll be looking at Bernd's 
>> suggestions (which are a rather different approach to the problem).
> Ok.  May be it will be interesting for you, but call insns in 
> df-infrastructure are particularly memory hungry (about 40 definitions 
> for the insn on x86_64).
I was rather surprised to find that just scanning the caller-save 
created insns and relying upon the old DF data for everything else was 
actually the slowest approach.  I'm at a bit of a loss to explain that 
behavior.  I can see the extra level of indirection for each REG 
appearing in a non-caller-save insn that we peek at in 
allocate_reload_reg, but I'm having trouble seeing how that is costing 
so much more than a df_insn_rescan_all.

Jeff
Jeff Law - Nov. 29, 2010, 11:56 p.m.
On 11/19/10 10:25, Vladimir Makarov wrote:
>
> I've just did some measurements of compilation time on your patch on 
> all_cp2k_gfortran.f90 (> 400K lines of fortran)
>
> without patch                                     219.20user
> with only df calls in reload (see patch below)    221.39user
> with all your patch                               221.17user
>
> So 1% of degradation is only because the patch touches DF-data (not 
> scanning insns in finding reload reg as someone can think).  Better 
> inheritance might improve compilation time because less insns are 
> generated (although it is hard to say the difference on two last lines 
> is too small).
So I've done a little more work on this.   The big surprise is a 
df_finish call can be quite expensive and given that IRA already takes 
care of it, the new call in reload1.c was totally unnecessary.  With 
that fix and removal of an unnecessary mapping from hard reg back to an 
index into SPILL_REGS, this code very close to compile-time neutral.

I had a version which only df-scanned blocks which were going to be 
insn-scanned which helped a little more, I didn't feel it was worth 
leaving DF info in a half-updated half-not-updated state.


Average times over 5 runs of compiling a big blob of .i files I have 
lying around:

Baseline (unpatched):                             705.57
Lazily DF scan the entire function:          706.66

So we're at about .15% compile-time increase -- which is IMHO reasonable.

I've also added a PARAM to clamp the number of insns we scan.  Clamping 
at 20 insns gets the vast majority of cases we care about.

Jeff

Patch

Index: reload1.c
===================================================================
--- reload1.c   (revision 166913)
+++ reload1.c   (working copy)
@@ -1045,6 +1045,11 @@ 
         }
      }

+   df_finish_pass (true);
+   df_scan_alloc (NULL);
+   df_scan_blocks ();
+   df_analyze ();
+
    /* Use the reload registers where necessary
       by generating move instructions to move the must-be-register
       values into or out of the reload registers.  */
@@ -1061,6 +1066,8 @@ 
        gcc_assert (verify_initial_elim_offsets ());
      }

+   df_finish_pass (true);
+
    /* If we were able to eliminate the frame pointer, show that it is no
       longer live at the start of any basic block.  If it ls live by
       virtue of being in a pseudo, that pseudo will be marked live