From patchwork Thu Oct 4 15:37:18 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vladimir Makarov X-Patchwork-Id: 189200 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 013BC2C0321 for ; Fri, 5 Oct 2012 01:43:33 +1000 (EST) Comment: DKIM? See http://www.dkim.org DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=gcc.gnu.org; s=default; x=1349970214; h=Comment: DomainKey-Signature:Received:Received:Received:Received:Received: Message-ID:Date:From:User-Agent:MIME-Version:To:Subject: Content-Type:Mailing-List:Precedence:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:Sender:Delivered-To; bh=IQfT8fo a7vl6O1jBBquTDZlfo84=; b=lIRSYqiDv9nEDA7kIxlWMZ1x1nY1A74vS2TrEr/ 11UDG+SVUqXpycD7UAyf7B4lKOmpqE+71Sbsc9T0Dj4iAoIWhazlyStM6c/2mOmu bElhzleWenXTqV71tRuhDp/im0+jgONz8GTf3aDwUHF0Agw+8WBgxB4Cyb7QO2dU kSes= Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=gcc.gnu.org; h=Received:Received:X-SWARE-Spam-Status:X-Spam-Check-By:Received:Received:Received:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:Content-Type:X-IsSubscribed:Mailing-List:Precedence:List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:Sender:Delivered-To; b=T8cbLPCDsTZWEWyTj0uGFp3LSKKM2d+VDsA+pMkr7sUbXn9NTwhZaRc4Ocg5xL NjA2dDPcdkpFcnTvEOPmX8mM5d83Uc5JWEzfk2MFB0HtAUniOSh6KK81i8cozMF2 VyRFK3sPtbJYJNiyUIOIwZBCaHk7FCBa3X5VL1Dx2pxyY=; Received: (qmail 5895 invoked by alias); 4 Oct 2012 15:43:15 -0000 Received: (qmail 5523 invoked by uid 22791); 4 Oct 2012 15:43:09 -0000 X-SWARE-Spam-Status: No, hits=-7.1 required=5.0 tests=AWL, BAYES_00, KHOP_RCVD_UNTRUST, KHOP_SPAMHAUS_DROP, MAY_BE_FORGED, RCVD_IN_DNSWL_HI, RCVD_IN_HOSTKARMA_W, RP_MATCHES_RCVD, SPF_HELO_PASS X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 04 Oct 2012 15:43:02 +0000 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q94Fh1LG011645 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Thu, 4 Oct 2012 11:43:01 -0400 Received: from toll.usersys.redhat.com (unused [10.15.16.165] (may be forged)) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id q94Fh1xo018170 for ; Thu, 4 Oct 2012 11:43:01 -0400 Message-ID: <506DAD2E.1010507@redhat.com> Date: Thu, 04 Oct 2012 11:37:18 -0400 From: Vladimir Makarov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120828 Thunderbird/15.0 MIME-Version: 1.0 To: gcc-patches Subject: [lra] patch to solve most scalability problems for LRA X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org The following patch solves most of LRA scalability problems. Itswitches on simpler algorithms in LRA. The first it switches off trying to reassign hard registers to spilled pseudos (they usually for such huge functions have long live ranges -- so the possibility to assign them something very small but trying to reassign them a hard registers is to expensive), inheritance, live range splitting, and memory coalescing optimizations. It seems that rematerialization is too important for performance -- so I don't switch it off. As splitting is also necessary for generation of caller saves code, I switch off caller-saves in IRA and force IRA to do non-regional RA. Here are the results on the huge tests in question. The testing was done on Corei7-2600 with 16GB memory to exclude swapping and make cpu time close to real time (on 8GB LRA in real time worked already faster reload when swapping occurs because better code/data locality). In the following table, time means cpu time of all GCC, size is text segment size of generated code (data and bss segments are all the same for given test), RA% means % of all cpu compiler time spent in RA (IRA+reload or IRA+LRA) taken from -ftime-report (it is approximate because it is sum of approximate numbers in which some of them are 0% although they are not exactly 0%). The first row of the table is for the current IRA and reload, the 2nd row is for current IRA+LRA on the branch, the 3rd row is the current IRA+LRA with the patch. Because of small space I put data only for -m32 for the first 2 tests (-m64 has similar results), the 3rd test is for 64 bits because it can not be correctly compiled for 32 bits. time/size/RA% PR26854 PR37448 PR54146 reload 565.15s 2102964 15% 293.47s 3122140 3% 349.93s 6556630 19% lra 624.18s 1707580 22% 311.76s 3221620 9% 469.30s 6277934 40% patched lra 524.51s 1720620 8% 287.83s 3002372 1% 399.32s 6395351 30% IRA+patched LRA behaves better than IRA+reload with compilation time and code size point of view. Interesting enough, that for PR37448 regular algorithms results in bigger code size that simpler ones. I guess, it is because of live-range splitting. It can be profitable but has a tendency to generate a bigger code. The only issue now is PR54146 compilation time for IRA+LRA although it was improved significantly. I will continue work on PR54146. But now I am going to focus on proposals from reviews. The patch was successfully bootstrapped on x86/x86-64. I did it twice, when simpler algorithms are always switched on (by setting threshold #pseudos * #basic blocks very small) and when simpler algorithms are used for huge functions (I believe there are no such functions in GCC). Committed as rev. 192086. 2012-10-04 Vladimir Makarov * lra.h (lra_simple_p): New external. * lra.c (lra_simple_p): New global var. (lra): Switch off inheritance and coalescing if lra_simple_p. * lra-assigns.c (assign_by_spills): Don't try to reassign spilled pseduos if lra_simple_p. * ira.c (ira): Set up lra_simple_p and ira_conflicts_p. Set up and restore flag_caller_saves and flag_ira_region. Index: ira.c =================================================================== --- ira.c (revision 192048) +++ ira.c (working copy) @@ -4327,8 +4327,26 @@ ira (FILE *f) bool loops_p; int max_regno_before_ira, ira_max_point_before_emit; int rebuild_p; + bool saved_flag_caller_saves = flag_caller_saves; + enum ira_region saved_flag_ira_region = flag_ira_region; + + ira_conflicts_p = optimize > 0; ira_use_lra_p = targetm.lra_p (); + /* If there are too many pseudos and/or basic blocks (e.g. 10K + pseudos and 10K blocks or 100K pseudos and 1K blocks), we will + use simplified and faster algorithms in LRA. */ + lra_simple_p + = (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block); + if (lra_simple_p) + { + /* It permits to skip live range splitting in LRA. */ + flag_caller_saves = false; + /* There is no sense to do regional allocation when we use + simplified LRA. */ + flag_ira_region = IRA_REGION_ONE; + ira_conflicts_p = false; + } #ifndef IRA_NO_OBSTACK gcc_obstack_init (&ira_obstack); @@ -4349,7 +4367,6 @@ ira (FILE *f) ira_dump_file = stderr; } - ira_conflicts_p = optimize > 0; setup_prohibited_mode_move_regs (); df_note_add_problem (); @@ -4530,6 +4547,13 @@ ira (FILE *f) /* See comment for find_moveable_pseudos call. */ if (ira_conflicts_p) move_unallocated_pseudos (); + + /* Restore original values. */ + if (lra_simple_p) + { + flag_caller_saves = saved_flag_caller_saves; + flag_ira_region = saved_flag_ira_region; + } } static void Index: lra-assigns.c =================================================================== --- lra-assigns.c (revision 192050) +++ lra-assigns.c (working copy) @@ -1186,46 +1186,50 @@ assign_by_spills (void) improve_inheritance (&changed_pseudo_bitmap); bitmap_clear (&non_reload_pseudos); bitmap_clear (&changed_insns); - /* We should not assign to original pseudos of inheritance pseudos - or split pseudos if any its inheritance pseudo did not get hard - register or any its split pseudo was not split because undo - inheritance/split pass will extend live range of such inheritance - or split pseudos. */ - bitmap_initialize (&do_not_assign_nonreload_pseudos, ®_obstack); - EXECUTE_IF_SET_IN_BITMAP (&lra_inheritance_pseudos, 0, u, bi) - if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 - && reg_renumber[u] < 0 && bitmap_bit_p (&lra_inheritance_pseudos, u)) - bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); - EXECUTE_IF_SET_IN_BITMAP (&lra_split_pseudos, 0, u, bi) - if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 - && reg_renumber[u] >= 0 && bitmap_bit_p (&lra_split_pseudos, u)) - bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); - for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++) - if (((i < lra_constraint_new_regno_start - && ! bitmap_bit_p (&do_not_assign_nonreload_pseudos, i)) - || (bitmap_bit_p (&lra_inheritance_pseudos, i) - && lra_reg_info[i].restore_regno >= 0) - || (bitmap_bit_p (&lra_split_pseudos, i) - && lra_reg_info[i].restore_regno >= 0) - || bitmap_bit_p (&lra_optional_reload_pseudos, i)) - && reg_renumber[i] < 0 && lra_reg_info[i].nrefs != 0 - && regno_allocno_class_array[i] != NO_REGS) - sorted_pseudos[n++] = i; - bitmap_clear (&do_not_assign_nonreload_pseudos); - if (n != 0 && lra_dump_file != NULL) - fprintf (lra_dump_file, " Reassing non-reload pseudos\n"); - qsort (sorted_pseudos, n, sizeof (int), pseudo_compare_func); - for (i = 0; i < n; i++) + if (! lra_simple_p) { - regno = sorted_pseudos[i]; - hard_regno = find_hard_regno_for (regno, &cost, -1); - if (hard_regno >= 0) + /* We should not assign to original pseudos of inheritance + pseudos or split pseudos if any its inheritance pseudo did + not get hard register or any its split pseudo was not split + because undo inheritance/split pass will extend live range of + such inheritance or split pseudos. */ + bitmap_initialize (&do_not_assign_nonreload_pseudos, ®_obstack); + EXECUTE_IF_SET_IN_BITMAP (&lra_inheritance_pseudos, 0, u, bi) + if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 + && reg_renumber[u] < 0 + && bitmap_bit_p (&lra_inheritance_pseudos, u)) + bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); + EXECUTE_IF_SET_IN_BITMAP (&lra_split_pseudos, 0, u, bi) + if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 + && reg_renumber[u] >= 0 && bitmap_bit_p (&lra_split_pseudos, u)) + bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); + for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++) + if (((i < lra_constraint_new_regno_start + && ! bitmap_bit_p (&do_not_assign_nonreload_pseudos, i)) + || (bitmap_bit_p (&lra_inheritance_pseudos, i) + && lra_reg_info[i].restore_regno >= 0) + || (bitmap_bit_p (&lra_split_pseudos, i) + && lra_reg_info[i].restore_regno >= 0) + || bitmap_bit_p (&lra_optional_reload_pseudos, i)) + && reg_renumber[i] < 0 && lra_reg_info[i].nrefs != 0 + && regno_allocno_class_array[i] != NO_REGS) + sorted_pseudos[n++] = i; + bitmap_clear (&do_not_assign_nonreload_pseudos); + if (n != 0 && lra_dump_file != NULL) + fprintf (lra_dump_file, " Reassing non-reload pseudos\n"); + qsort (sorted_pseudos, n, sizeof (int), pseudo_compare_func); + for (i = 0; i < n; i++) { - assign_hard_regno (hard_regno, regno); + regno = sorted_pseudos[i]; + hard_regno = find_hard_regno_for (regno, &cost, -1); + if (hard_regno >= 0) + { + assign_hard_regno (hard_regno, regno); /* We change allocation for non-reload pseudo on this iteration -- mark the pseudo for invalidation of used alternatives of insns containing the pseudo. */ - bitmap_set_bit (&changed_pseudo_bitmap, regno); + bitmap_set_bit (&changed_pseudo_bitmap, regno); + } } } free (update_hard_regno_preference_check); Index: lra.c =================================================================== --- lra.c (revision 192050) +++ lra.c (working copy) @@ -2178,8 +2178,14 @@ setup_reg_spill_flag (void) lra_reg_spill_p = false; } +/* True if the current function is too big to use regular algorithms + in LRA. In other words, we should use simpler and faster algorithms + in LRA. It also means we should not worry about generation code + for caller saves. The value is set up in IRA. */ +bool lra_simple_p; + /* Major LRA entry function. F is a file should be used to dump LRA - debug info. */ + debug info. */ void lra (FILE *f) { @@ -2266,7 +2272,9 @@ lra (FILE *f) RS6000_PIC_OFFSET_TABLE_REGNUM uneliminable if we started to use a constant pool. */ lra_eliminate (false); - lra_inheritance (); + /* Do inheritance only for regular algorithms. */ + if (! lra_simple_p) + lra_inheritance (); /* We need live ranges for lra_assign -- so build them. */ lra_create_live_ranges (true); live_p = true; @@ -2275,10 +2283,16 @@ lra (FILE *f) If inheritance pseudos were spilled, the memory-memory moves involving them will be removed by pass undoing inheritance. */ - if (! lra_assign () && lra_coalesce ()) - live_p = false; - if (lra_undo_inheritance ()) - live_p = false; + if (lra_simple_p) + lra_assign (); + else + { + /* Do coalescing only for regular algorithms. */ + if (! lra_assign () && lra_coalesce ()) + live_p = false; + if (lra_undo_inheritance ()) + live_p = false; + } } bitmap_clear (&lra_inheritance_pseudos); bitmap_clear (&lra_split_pseudos); Index: lra.h =================================================================== --- lra.h (revision 192048) +++ lra.h (working copy) @@ -20,6 +20,8 @@ You should have received a copy of the G along with GCC; see the file COPYING3. If not see . */ +extern bool lra_simple_p; + /* Return the allocno reg class of REGNO. If it is a reload pseudo, the pseudo should finally get hard register of the allocno class. */