Message ID | CABu31nOYosRmrSE-XObe7df-7wh1s6Xq96nxFhR=Ttp_vhHFKQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Thu, Oct 4, 2012 at 6:56 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> "Crude, but efficient" (tm) :-)
BTW with a similar approach I also time other bits of process_bb_lives:
timevar_push (TV_HOIST);
/* See if we'll need an increment at the end of this basic block.
An increment is needed if the PSEUDOS_LIVE set is not empty,
to make sure the finish points are set up correctly. */
need_curr_point_incr = (sparseset_cardinality (pseudos_live) > 0);
EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, i)
mark_pseudo_dead (i, curr_point);
timevar_pop (TV_HOIST);
timevar_push (TV_PRE);
EXECUTE_IF_SET_IN_SPARSESET (pseudos_live_through_calls, i)
if (bitmap_bit_p (DF_LR_IN (bb), i))
check_pseudos_live_through_calls (i);
timevar_pop (TV_PRE);
PRE : 12.20 ( 1%) usr
code hoisting : 34.03 ( 3%) usr
IOW that's ~46s out of ~180s *not* spent in walking the insns in
process_bb_lives!
I think this is due to the opening/closing of chains for all pseudos
live through the basic block at the start and end of process_bb_lives.
For this test case the cardinality of pseudos_live is as high as
max_reg_num/2 at its peak.
I tried to keep pseudos_live open for bb->prev_bb if there is a
find_edge(bb->prev_bb,bb), but that's when I ran into a problem with
inconsistent liveness data (that's what this message was about:
http://gcc.gnu.org/ml/gcc/2012-10/msg00035.html).
Ciao!
Steven
On 10/04/2012 12:56 PM, Steven Bosscher wrote: > On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: >>>> 0.6% sounds really very different from my timings. How much time does >>>> create_start_finish_chains take for you? >>>> >>> 0.65% (2.78s). >>> >>> Actually, I have a profile but I am not sure now that it is for PR54146. >>> It might be for PR26854. >>> >>> I'll check it again to be sure. >> Not it looks about the same. > Well, that's very strange. Maybe we measure these things differently? > I just hi-hack a timevar, so I measure e.g. the time spent in > create_start_finish_chains like so: > > Index: lra-lives.c > =================================================================== > --- lra-lives.c (revision 192052) > +++ lra-lives.c (working copy) > @@ -770,6 +812,7 @@ create_start_finish_chains (void) > int i, max_regno; > lra_live_range_t r; > > +timevar_push (TV_CPROP); > lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point); > lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point); > max_regno = max_reg_num (); > @@ -783,6 +826,7 @@ create_start_finish_chains (void) > lra_finish_point_ranges[r->finish] = r; > } > } > +timevar_pop (TV_CPROP); > } > > /* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after > > > so that I get the timings in the -ftime-report like so: > > CPROP : 43.14 ( 4%) usr > integrated RA : 200.81 (17%) usr > LRA non-specific : 62.18 ( 5%) usr > LRA virtuals elimination: 61.71 ( 5%) usr > LRA reload inheritance : 6.41 ( 1%) usr > LRA create live ranges : 139.75 (13%) usr > LRA hard reg assignment : 130.90 (11%) usr > LRA coalesce pseudo regs: 2.45 ( 0%) usr > reload : 9.09 ( 1%) usr > > "Crude, but efficient" (tm) :-) > > How do you measure the time spent in that function, and in > remove_some_program_points_and_update_live_ranges? > You use AMD and I use Intel. So it may be different with cache point of view. Another thing is that I used gprof (-pg was used for bitmap.o lra*.o and ira*.o). Your measurements are more accurate, I think, because it is without instrumentation and bitmap.o takes too much time. Bitmap does not work well in this case because they are too big and sparse.
On 10/04/2012 01:44 PM, Vladimir Makarov wrote: > On 10/04/2012 12:56 PM, Steven Bosscher wrote: >> On Thu, Oct 4, 2012 at 6:12 PM, Vladimir Makarov >> <vmakarov@redhat.com> wrote: >> >> so that I get the timings in the -ftime-report like so: >> >> CPROP : 43.14 ( 4%) usr >> integrated RA : 200.81 (17%) usr >> LRA non-specific : 62.18 ( 5%) usr >> LRA virtuals elimination: 61.71 ( 5%) usr >> LRA reload inheritance : 6.41 ( 1%) usr >> LRA create live ranges : 139.75 (13%) usr >> LRA hard reg assignment : 130.90 (11%) usr >> LRA coalesce pseudo regs: 2.45 ( 0%) usr >> reload : 9.09 ( 1%) usr >> >> "Crude, but efficient" (tm) :-) >> >> How do you measure the time spent in that function, and in >> remove_some_program_points_and_update_live_ranges? >> > You use AMD and I use Intel. So it may be different with cache point > of view. > > Another thing is that I used gprof (-pg was used for bitmap.o lra*.o > and ira*.o). Your measurements are more accurate, I think, because it > is without instrumentation and bitmap.o takes too much time. Bitmap > does not work well in this case because they are too big and sparse. > Yes, gcc17 (AMD) behaviour is very different from Intel machines. I think that is why we have so different numbers. Only create_start_and_finish_chains takes 2.4% (28sec) according to gprof on slow.cc (before my last patch). Also on AMD machine find_hard_regno_for is on the first place (on Intel machines, several bitmap functions are on the 1st place). That is the function I wanted to look at more later (to implement some simpler algorithm for huge functions). I think, the importance of your patch will be even more important as my last patch increases % spent in lra-lives.c. So thank you very much, Steven. I'd like to play more with your patch and I'll give you an approval to commit probably tomorrow.
Index: lra-lives.c =================================================================== --- lra-lives.c (revision 192052) +++ lra-lives.c (working copy) @@ -770,6 +812,7 @@ create_start_finish_chains (void) int i, max_regno; lra_live_range_t r; +timevar_push (TV_CPROP); lra_start_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point); lra_finish_point_ranges = XCNEWVEC (lra_live_range_t, lra_live_max_point); max_regno = max_reg_num (); @@ -783,6 +826,7 @@ create_start_finish_chains (void) lra_finish_point_ranges[r->finish] = r; } } +timevar_pop (TV_CPROP); } /* Rebuild LRA_START_POINT_RANGES and LRA_FINISH_POINT_RANGES after