[lra] Improve initial program point density in lra-lives.c

Submitted by Steven Bosscher on Oct. 4, 2012, 4:56 p.m.

Details

Message ID CABu31nOYosRmrSE-XObe7df-7wh1s6Xq96nxFhR=Ttp_vhHFKQ@mail.gmail.com
State New
Headers show

Commit Message

Steven Bosscher Oct. 4, 2012, 4:56 p.m.
On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>> 0.6% sounds really very different from my timings. How much time does
>>> create_start_finish_chains take for you?
>>>
>> 0.65% (2.78s).
>>
>> Actually, I have a profile but I am not sure now that it is for PR54146.
>> It might be for PR26854.
>>
>> I'll check it again to be sure.
>
> Not it looks about the same.

Well, that's very strange. Maybe we measure these things differently?
I just hi-hack a timevar, so I measure e.g. the time spent in
create_start_finish_chains like so:



so that I get the timings in the -ftime-report like so:

 CPROP                   :  43.14 ( 4%) usr
 integrated RA           : 200.81 (17%) usr
 LRA non-specific        :  62.18 ( 5%) usr
 LRA virtuals elimination:  61.71 ( 5%) usr
 LRA reload inheritance  :   6.41 ( 1%) usr
 LRA create live ranges  :  139.75 (13%) usr
 LRA hard reg assignment : 130.90 (11%) usr
 LRA coalesce pseudo regs:   2.45 ( 0%) usr
 reload                  :   9.09 ( 1%) usr

"Crude, but efficient" (tm) :-)

How do you measure the time spent in that function, and in
remove_some_program_points_and_update_live_ranges?

Ciao!
Steven

Comments

Steven Bosscher Oct. 4, 2012, 5:14 p.m.
On Thu, Oct 4, 2012 at 6:56 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> "Crude, but efficient" (tm) :-)

BTW with a similar approach I also time other bits of process_bb_lives:

timevar_push (TV_HOIST);
  /* See if we'll need an increment at the end of this basic block.
     An increment is needed if the PSEUDOS_LIVE set is not empty,
     to make sure the finish points are set up correctly.  */
  need_curr_point_incr = (sparseset_cardinality (pseudos_live) > 0);
  EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, i)
    mark_pseudo_dead (i, curr_point);
timevar_pop (TV_HOIST);
timevar_push (TV_PRE);
  EXECUTE_IF_SET_IN_SPARSESET (pseudos_live_through_calls, i)
    if (bitmap_bit_p (DF_LR_IN (bb), i))
      check_pseudos_live_through_calls (i);
timevar_pop (TV_PRE);

 PRE                     :  12.20 ( 1%) usr
 code hoisting           :  34.03 ( 3%) usr

IOW that's ~46s out of ~180s *not* spent in walking the insns in
process_bb_lives!

I think this is due to the opening/closing of chains for all pseudos
live through the basic block at the start and end of process_bb_lives.
For this test case the cardinality of pseudos_live is as high as
max_reg_num/2 at its peak.

I tried to keep pseudos_live open for bb->prev_bb if there is a
find_edge(bb->prev_bb,bb), but that's when I ran into a problem with
inconsistent liveness data (that's what this message was about:
http://gcc.gnu.org/ml/gcc/2012-10/msg00035.html).

Ciao!
Steven
Vladimir Makarov Oct. 4, 2012, 5:44 p.m.
On 10/04/2012 12:56 PM, Steven Bosscher wrote:
> On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>>> 0.6% sounds really very different from my timings. How much time does
>>>> create_start_finish_chains take for you?
>>>>
>>> 0.65% (2.78s).
>>>
>>> Actually, I have a profile but I am not sure now that it is for PR54146.
>>> It might be for PR26854.
>>>
>>> I'll check it again to be sure.
>> Not it looks about the same.
> Well, that's very strange. Maybe we measure these things differently?
> I just hi-hack a timevar, so I measure e.g. the time spent in
> create_start_finish_chains like so:
>
> Index: lra-lives.c
> ===================================================================
> --- lra-lives.c (revision 192052)
> +++ lra-lives.c (working copy)
> @@ -770,6 +812,7 @@ create_start_finish_chains (void)
>     int i, max_regno;
>     lra_live_range_t r;
>
> +timevar_push (TV_CPROP);
>     lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
>     lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
>     max_regno = max_reg_num ();
> @@ -783,6 +826,7 @@ create_start_finish_chains (void)
>            lra_finish_point_ranges[r->finish] = r;
>          }
>       }
> +timevar_pop (TV_CPROP);
>   }
>
>   /* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after
>
>
> so that I get the timings in the -ftime-report like so:
>
>   CPROP                   :  43.14 ( 4%) usr
>   integrated RA           : 200.81 (17%) usr
>   LRA non-specific        :  62.18 ( 5%) usr
>   LRA virtuals elimination:  61.71 ( 5%) usr
>   LRA reload inheritance  :   6.41 ( 1%) usr
>   LRA create live ranges  :  139.75 (13%) usr
>   LRA hard reg assignment : 130.90 (11%) usr
>   LRA coalesce pseudo regs:   2.45 ( 0%) usr
>   reload                  :   9.09 ( 1%) usr
>
> "Crude, but efficient" (tm) :-)
>
> How do you measure the time spent in that function, and in
> remove_some_program_points_and_update_live_ranges?
>
You use AMD and I use Intel.  So it may be different with cache point of 
view.

Another thing is that I used gprof (-pg was used for bitmap.o lra*.o and 
ira*.o).  Your measurements are more accurate, I think, because it is 
without instrumentation and bitmap.o takes too much time. Bitmap does 
not work well in this case because they are too big and sparse.
Vladimir Makarov Oct. 4, 2012, 7:19 p.m.
On 10/04/2012 01:44 PM, Vladimir Makarov wrote:
> On 10/04/2012 12:56 PM, Steven Bosscher wrote:
>> On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov 
>> <vmakarov@redhat.com> wrote:
>>
>> so that I get the timings in the -ftime-report like so:
>>
>>   CPROP                   :  43.14 ( 4%) usr
>>   integrated RA           : 200.81 (17%) usr
>>   LRA non-specific        :  62.18 ( 5%) usr
>>   LRA virtuals elimination:  61.71 ( 5%) usr
>>   LRA reload inheritance  :   6.41 ( 1%) usr
>>   LRA create live ranges  :  139.75 (13%) usr
>>   LRA hard reg assignment : 130.90 (11%) usr
>>   LRA coalesce pseudo regs:   2.45 ( 0%) usr
>>   reload                  :   9.09 ( 1%) usr
>>
>> "Crude, but efficient" (tm) :-)
>>
>> How do you measure the time spent in that function, and in
>> remove_some_program_points_and_update_live_ranges?
>>
> You use AMD and I use Intel.  So it may be different with cache point 
> of view.
>
> Another thing is that I used gprof (-pg was used for bitmap.o lra*.o 
> and ira*.o).  Your measurements are more accurate, I think, because it 
> is without instrumentation and bitmap.o takes too much time. Bitmap 
> does not work well in this case because they are too big and sparse.
>
Yes, gcc17 (AMD) behaviour is very different from Intel machines.  I 
think that is why we have so different numbers.  Only 
create_start_and_finish_chains takes 2.4% (28sec) according to gprof on 
slow.cc (before my last patch).  Also on AMD machine find_hard_regno_for 
is on the first place (on Intel machines, several bitmap functions are 
on the 1st place).  That is the function I wanted to look at more later 
(to implement some simpler algorithm for huge functions).

I think, the importance of your patch will be even more important as my 
last patch increases % spent in lra-lives.c.  So thank you very much, 
Steven.

I'd like to play more with your patch and I'll give you an approval to 
commit probably tomorrow.

Patch hide | download patch | download mbox

Index: lra-lives.c
===================================================================
--- lra-lives.c (revision 192052)
+++ lra-lives.c (working copy)
@@ -770,6 +812,7 @@  create_start_finish_chains (void)
   int i, max_regno;
   lra_live_range_t r;

+timevar_push (TV_CPROP);
   lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
   lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
   max_regno = max_reg_num ();
@@ -783,6 +826,7 @@  create_start_finish_chains (void)
          lra_finish_point_ranges[r->finish] = r;
        }
     }
+timevar_pop (TV_CPROP);
 }

 /* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after