Patchwork [lra] Improve initial program point density in lra-lives.c

login
register
mail settings
Submitter Steven Bosscher
Date Oct. 4, 2012, 4:56 p.m.
Message ID <CABu31nOYosRmrSE-XObe7df-7wh1s6Xq96nxFhR=Ttp_vhHFKQ@mail.gmail.com>
Download mbox | patch
Permalink /patch/189207/
State New
Headers show

Comments

Steven Bosscher - Oct. 4, 2012, 4:56 p.m.
On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>> 0.6% sounds really very different from my timings. How much time does
>>> create_start_finish_chains take for you?
>>>
>> 0.65% (2.78s).
>>
>> Actually, I have a profile but I am not sure now that it is for PR54146.
>> It might be for PR26854.
>>
>> I'll check it again to be sure.
>
> Not it looks about the same.

Well, that's very strange. Maybe we measure these things differently?
I just hi-hack a timevar, so I measure e.g. the time spent in
create_start_finish_chains like so:



so that I get the timings in the -ftime-report like so:

 CPROP                   :  43.14 ( 4%) usr
 integrated RA           : 200.81 (17%) usr
 LRA non-specific        :  62.18 ( 5%) usr
 LRA virtuals elimination:  61.71 ( 5%) usr
 LRA reload inheritance  :   6.41 ( 1%) usr
 LRA create live ranges  :  139.75 (13%) usr
 LRA hard reg assignment : 130.90 (11%) usr
 LRA coalesce pseudo regs:   2.45 ( 0%) usr
 reload                  :   9.09 ( 1%) usr

"Crude, but efficient" (tm) :-)

How do you measure the time spent in that function, and in
remove_some_program_points_and_update_live_ranges?

Ciao!
Steven
Steven Bosscher - Oct. 4, 2012, 5:14 p.m.
On Thu, Oct 4, 2012 at 6:56 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> "Crude, but efficient" (tm) :-)

BTW with a similar approach I also time other bits of process_bb_lives:

timevar_push (TV_HOIST);
  /* See if we'll need an increment at the end of this basic block.
     An increment is needed if the PSEUDOS_LIVE set is not empty,
     to make sure the finish points are set up correctly.  */
  need_curr_point_incr = (sparseset_cardinality (pseudos_live) > 0);
  EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, i)
    mark_pseudo_dead (i, curr_point);
timevar_pop (TV_HOIST);
timevar_push (TV_PRE);
  EXECUTE_IF_SET_IN_SPARSESET (pseudos_live_through_calls, i)
    if (bitmap_bit_p (DF_LR_IN (bb), i))
      check_pseudos_live_through_calls (i);
timevar_pop (TV_PRE);

 PRE                     :  12.20 ( 1%) usr
 code hoisting           :  34.03 ( 3%) usr

IOW that's ~46s out of ~180s *not* spent in walking the insns in
process_bb_lives!

I think this is due to the opening/closing of chains for all pseudos
live through the basic block at the start and end of process_bb_lives.
For this test case the cardinality of pseudos_live is as high as
max_reg_num/2 at its peak.

I tried to keep pseudos_live open for bb->prev_bb if there is a
find_edge(bb->prev_bb,bb), but that's when I ran into a problem with
inconsistent liveness data (that's what this message was about:
http://gcc.gnu.org/ml/gcc/2012-10/msg00035.html).

Ciao!
Steven
Vladimir Makarov - Oct. 4, 2012, 5:44 p.m.
On 10/04/2012 12:56 PM, Steven Bosscher wrote:
> On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>>> 0.6% sounds really very different from my timings. How much time does
>>>> create_start_finish_chains take for you?
>>>>
>>> 0.65% (2.78s).
>>>
>>> Actually, I have a profile but I am not sure now that it is for PR54146.
>>> It might be for PR26854.
>>>
>>> I'll check it again to be sure.
>> Not it looks about the same.
> Well, that's very strange. Maybe we measure these things differently?
> I just hi-hack a timevar, so I measure e.g. the time spent in
> create_start_finish_chains like so:
>
> Index: lra-lives.c
> ===================================================================
> --- lra-lives.c (revision 192052)
> +++ lra-lives.c (working copy)
> @@ -770,6 +812,7 @@ create_start_finish_chains (void)
>     int i, max_regno;
>     lra_live_range_t r;
>
> +timevar_push (TV_CPROP);
>     lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
>     lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
>     max_regno = max_reg_num ();
> @@ -783,6 +826,7 @@ create_start_finish_chains (void)
>            lra_finish_point_ranges[r->finish] = r;
>          }
>       }
> +timevar_pop (TV_CPROP);
>   }
>
>   /* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after
>
>
> so that I get the timings in the -ftime-report like so:
>
>   CPROP                   :  43.14 ( 4%) usr
>   integrated RA           : 200.81 (17%) usr
>   LRA non-specific        :  62.18 ( 5%) usr
>   LRA virtuals elimination:  61.71 ( 5%) usr
>   LRA reload inheritance  :   6.41 ( 1%) usr
>   LRA create live ranges  :  139.75 (13%) usr
>   LRA hard reg assignment : 130.90 (11%) usr
>   LRA coalesce pseudo regs:   2.45 ( 0%) usr
>   reload                  :   9.09 ( 1%) usr
>
> "Crude, but efficient" (tm) :-)
>
> How do you measure the time spent in that function, and in
> remove_some_program_points_and_update_live_ranges?
>
You use AMD and I use Intel.  So it may be different with cache point of 
view.

Another thing is that I used gprof (-pg was used for bitmap.o lra*.o and 
ira*.o).  Your measurements are more accurate, I think, because it is 
without instrumentation and bitmap.o takes too much time. Bitmap does 
not work well in this case because they are too big and sparse.
Vladimir Makarov - Oct. 4, 2012, 7:19 p.m.
On 10/04/2012 01:44 PM, Vladimir Makarov wrote:
> On 10/04/2012 12:56 PM, Steven Bosscher wrote:
>> On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov 
>> <vmakarov@redhat.com> wrote:
>>
>> so that I get the timings in the -ftime-report like so:
>>
>>   CPROP                   :  43.14 ( 4%) usr
>>   integrated RA           : 200.81 (17%) usr
>>   LRA non-specific        :  62.18 ( 5%) usr
>>   LRA virtuals elimination:  61.71 ( 5%) usr
>>   LRA reload inheritance  :   6.41 ( 1%) usr
>>   LRA create live ranges  :  139.75 (13%) usr
>>   LRA hard reg assignment : 130.90 (11%) usr
>>   LRA coalesce pseudo regs:   2.45 ( 0%) usr
>>   reload                  :   9.09 ( 1%) usr
>>
>> "Crude, but efficient" (tm) :-)
>>
>> How do you measure the time spent in that function, and in
>> remove_some_program_points_and_update_live_ranges?
>>
> You use AMD and I use Intel.  So it may be different with cache point 
> of view.
>
> Another thing is that I used gprof (-pg was used for bitmap.o lra*.o 
> and ira*.o).  Your measurements are more accurate, I think, because it 
> is without instrumentation and bitmap.o takes too much time. Bitmap 
> does not work well in this case because they are too big and sparse.
>
Yes, gcc17 (AMD) behaviour is very different from Intel machines.  I 
think that is why we have so different numbers.  Only 
create_start_and_finish_chains takes 2.4% (28sec) according to gprof on 
slow.cc (before my last patch).  Also on AMD machine find_hard_regno_for 
is on the first place (on Intel machines, several bitmap functions are 
on the 1st place).  That is the function I wanted to look at more later 
(to implement some simpler algorithm for huge functions).

I think, the importance of your patch will be even more important as my 
last patch increases % spent in lra-lives.c.  So thank you very much, 
Steven.

I'd like to play more with your patch and I'll give you an approval to 
commit probably tomorrow.

Patch

Index: lra-lives.c
===================================================================
--- lra-lives.c (revision 192052)
+++ lra-lives.c (working copy)
@@ -770,6 +812,7 @@  create_start_finish_chains (void)
   int i, max_regno;
   lra_live_range_t r;

+timevar_push (TV_CPROP);
   lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
   lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point);
   max_regno = max_reg_num ();
@@ -783,6 +826,7 @@  create_start_finish_chains (void)
          lra_finish_point_ranges[r->finish] = r;
        }
     }
+timevar_pop (TV_CPROP);
 }

 /* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after