From patchwork Wed Dec 4 21:52:28 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 1204380 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-515168-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ucw.cz Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="ZlQEP8H8"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 47Ssyz0fJGz9sP3 for ; Thu, 5 Dec 2019 08:52:42 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=SuXu0C8dc8cxyq9wlbAY/lQ+WPLqNCNHVP2IegoWgCT2auFvC/zBK FTc6EnacZUr93Q1WDUW1/QPGSF6a9esG9Q+4nvr1Pc6zyPOP8u54Z6h+CqH01tV+ A7OxL6BmzEwos17Tf7VrrEMNe5vy4nOpFhkpXu18ICMW+Nt1YV9AFg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; s= default; bh=aJ2H2KKxeqKK/hr1+Uxe2BwD4dI=; b=ZlQEP8H8pTTwW6ckEGqQ hN+JOV4Ld56V4et7ksmptf4aYEG3jvIV1IohZ1vMtE4lDp3eUJEPpbZFZgOC6OG4 Ib0z3RcKG/b8zD6Kfa8HtDLI9Jx1lgr680EPiMZlN9mMWMzdPQ6fopGlW9QDkhza Hz/M5PqwnYE9yfb40XlI3Pc= Received: (qmail 91589 invoked by alias); 4 Dec 2019 21:52:35 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 91580 invoked by uid 89); 4 Dec 2019 21:52:34 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-12.6 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS autolearn=ham version=3.3.1 spammy=training, directed, train, succs X-HELO: nikam.ms.mff.cuni.cz Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 04 Dec 2019 21:52:31 +0000 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 48D75281A44; Wed, 4 Dec 2019 22:52:28 +0100 (CET) Date: Wed, 4 Dec 2019 22:52:28 +0100 From: Jan Hubicka To: gcc-patches@gcc.gnu.org Subject: Add -fpartial-profile-training Message-ID: <20191204215228.fuuywt3ef3uiqswh@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Disposition: inline User-Agent: NeoMutt/20170113 (1.7.2) Hi, with recent fixes to proile updating I noticed that we get more regressions compared to gcc 9 at Firefox testing. This is because Firefox train run is not covering all the benchmarks and gcc 9, thanks to updating bugs sometimes optimize code for speed even if it was not trained. While in general one should have reasonable train run, in some cases it is not practical to do so. For example, skia library has optimized vector code for different ISAs and thus firefox renders quickly only if it is trained on same CPU as run. This patch adds flag -fprofile-partial-training which makes GCC to optimize untrained functions as w/o -fprofile-use. This nullifies code size improvements of FDO but can be used in cases where full training is not quite possible (and one can use it only on portions of programs). Previously only good answer was to disable profiling for a given function, but that needs to be done quite precisely and in general is hard to arrange. Patch works by 1) not setting PROFILE_READ for functions with entry count 0 2) make inliner and ipa-cp to drop profile to local one when all trained executions are redirected to clones 3) reduce quality of branch probabilities of branches leading to never executed regions to GUESSED. This is necessary to prevent gcc from proagating thins back. Bootstrapped/regtested x86_64-linux. I plan to commit it tomorrow if there are no complains. Feedback is welcome! Honza * cgraphclones.c (localize_profile): New function. (cgraph_node::create_clone): Use it for partial profiles. * common.opt (fprofile-partial-training): New flag. * doc/invoke.texi (-fprofile-partial-training): Document. * ipa-cp.c (update_profiling_info): For partial profiles do not set function profile to zero. * profile.c (compute_branch_probabilities): With partial profile watch if edge count is zero and turn all probabilities to guessed. (compute_branch_probabilities): For partial profiles do not apply profile when entry count is zero. * tree-profile.c (tree_profiling): Only do value_profile_transformations when profile is read. Index: cgraphclones.c =================================================================== --- cgraphclones.c (revision 278944) +++ cgraphclones.c (working copy) @@ -307,6 +307,22 @@ dump_callgraph_transformation (const cgr } } +/* Turn profile of N to local profile. */ + +static void +localize_profile (cgraph_node *n) +{ + n->count = n->count.guessed_local (); + for (cgraph_edge *e = n->callees; e; e=e->next_callee) + { + e->count = e->count.guessed_local (); + if (!e->inline_failed) + localize_profile (e->callee); + } + for (cgraph_edge *e = n->indirect_calls; e; e=e->next_callee) + e->count = e->count.guessed_local (); +} + /* Create node representing clone of N executed COUNT times. Decrease the execution counts from original node too. The new clone will have decl set to DECL that may or may not be the same @@ -340,6 +356,7 @@ cgraph_node::create_clone (tree new_decl cgraph_edge *e; unsigned i; profile_count old_count = count; + bool nonzero = count.ipa ().nonzero_p (); if (new_inlined_to) dump_callgraph_transformation (this, new_inlined_to, "inlining to"); @@ -426,6 +446,15 @@ cgraph_node::create_clone (tree new_decl if (call_duplication_hook) symtab->call_cgraph_duplication_hooks (this, new_node); + /* With partial train run we do not want to assume that original's + count is zero whenever we redurect all executed edges to clone. + Simply drop profile to local one in this case. */ + if (update_original + && opt_for_fn (decl, flag_partial_profile_training) + && nonzero + && count.ipa_p () + && !count.ipa ().nonzero_p ()) + localize_profile (this); if (!new_inlined_to) dump_callgraph_transformation (this, new_node, suffix); Index: common.opt =================================================================== --- common.opt (revision 278944) +++ common.opt (working copy) @@ -2160,6 +2160,10 @@ fprofile-generate= Common Joined RejectNegative Enable common options for generating profile info for profile feedback directed optimizations, and set -fprofile-dir=. +fprofile-partial-training +Common Report Var(flag_partial_profile_training) Optimization +Do not assume that functions never executed during the train run are cold + fprofile-use Common Var(flag_profile_use) Enable common options for performing profile feedback directed optimizations. Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 278944) +++ doc/invoke.texi (working copy) @@ -453,8 +453,8 @@ Objective-C and Objective-C++ Dialects}. -fpartial-inlining -fpeel-loops -fpredictive-commoning @gol -fprefetch-loop-arrays @gol -fprofile-correction @gol --fprofile-use -fprofile-use=@var{path} -fprofile-values @gol --fprofile-reorder-functions @gol +-fprofile-use -fprofile-use=@var{path} -fprofile-partial-training @gol +-fprofile-values -fprofile-reorder-functions @gol -freciprocal-math -free -frename-registers -freorder-blocks @gol -freorder-blocks-algorithm=@var{algorithm} @gol -freorder-blocks-and-partition -freorder-functions @gol @@ -10634,6 +10634,17 @@ default, GCC emits an error message when This option is enabled by @option{-fauto-profile}. +@item -fprofile-partial-training +@opindex fprofile-use +With @code{-fprofile-use} all portions of programs not executed during train +run are optimized agressively for size rather than speed. In some cases it is not +practical to train all possible paths hot paths in the program. (For example +program may contain functions specific for a given hardware and trianing may +not cover all hardware configurations program is run on.) With +@code{-fprofile-partial-training} profile feedback will be ignored for all +functions not executed during the train run leading them to be optimized as +if they were compiled without profile feedback. + @item -fprofile-use @itemx -fprofile-use=@var{path} @opindex fprofile-use Index: ipa-cp.c =================================================================== --- ipa-cp.c (revision 278944) +++ ipa-cp.c (working copy) @@ -4295,6 +4295,15 @@ update_profiling_info (struct cgraph_nod remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa () - new_sum.ipa ()); + + /* With partial train run we do not want to assume that original's + count is zero whenever we redurect all executed edges to clone. + Simply drop profile to local one in this case. */ + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () + && flag_partial_profile_training) + remainder = remainder.guessed_local (); + new_sum = orig_node_count.combine_with_ipa_count (new_sum); new_node->count = new_sum; orig_node->count = remainder; Index: profile.c =================================================================== --- profile.c (revision 278944) +++ profile.c (working copy) @@ -635,9 +635,20 @@ compute_branch_probabilities (unsigned c } if (bb_gcov_count (bb)) { + bool set_to_guessed = false; FOR_EACH_EDGE (e, ei, bb->succs) - e->probability = profile_probability::probability_in_gcov_type - (edge_gcov_count (e), bb_gcov_count (bb)); + { + bool prev_never = e->probability == profile_probability::never (); + e->probability = profile_probability::probability_in_gcov_type + (edge_gcov_count (e), bb_gcov_count (bb)); + if (e->probability == profile_probability::never () + && !prev_never + && flag_partial_profile_training) + set_to_guessed = true; + } + if (set_to_guessed) + FOR_EACH_EDGE (e, ei, bb->succs) + e->probability = e->probability.guessed (); if (bb->index >= NUM_FIXED_BLOCKS && block_ends_with_condjump_p (bb) && EDGE_COUNT (bb->succs) >= 2) @@ -697,17 +708,23 @@ compute_branch_probabilities (unsigned c } } - if (exec_counts) + if (exec_counts + && (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) + || !flag_partial_profile_training)) profile_status_for_fn (cfun) = PROFILE_READ; /* If we have real data, use them! */ if (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) || !flag_guess_branch_prob) FOR_ALL_BB_FN (bb, cfun) - bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); + if (bb_gcov_count (bb) || !flag_partial_profile_training) + bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); + else + bb->count = profile_count::guessed_zero (); /* If function was not trained, preserve local estimates including statically determined zero counts. */ - else if (profile_status_for_fn (cfun) == PROFILE_READ) + else if (profile_status_for_fn (cfun) == PROFILE_READ + && !flag_partial_profile_training) FOR_ALL_BB_FN (bb, cfun) if (!(bb->count == profile_count::zero ())) bb->count = bb->count.global0 (); @@ -1417,7 +1434,7 @@ branch_prob (bool thunk) /* At this moment we have precise loop iteration count estimates. Record them to loop structure before the profile gets out of date. */ FOR_EACH_LOOP (loop, 0) - if (loop->header->count > 0) + if (loop->header->count > 0 && loop->header->count.reliable_p ()) { gcov_type nit = expected_loop_iterations_unbounded (loop); widest_int bound = gcov_type_to_wide_int (nit); Index: tree-profile.c =================================================================== --- tree-profile.c (revision 278944) +++ tree-profile.c (working copy) @@ -785,7 +785,8 @@ tree_profiling (void) if (flag_branch_probabilities && !thunk && flag_profile_values - && flag_value_profile_transformations) + && flag_value_profile_transformations + && profile_status_for_fn (cfun) == PROFILE_READ) gimple_value_profile_transformations (); /* The above could hose dominator info. Currently there is