Message ID | AANLkTi=0tNb1FX7XweRCUQBh6OUjNW7f6+vyO0YuYk=z@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Wed, Dec 29, 2010 at 4:32 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> I think that due to these reasons, the patch should be committed to >> SVN even in this development stage. Even if the algorithm is not >> optimal, the patch demonstrably produces substantially better code. >> This feature has no impact on generic code without -mvzeroupper / >> -mavx switch, and since there are currently very few AVX users, >> negligible overall impact. >> >>> gcc/ >>> >>> 2010-12-18 H.J. Lu <hongjiu.lu@intel.com> >>> >>> PR target/46519 >>> * config/i386/i386.c (block_info_def): Remove referenced, count >>> and rescanned. >>> (move_or_delete_vzeroupper_2): Updated. >>> (move_or_delete_vzeroupper_1): Rewritten to avoid recursive call. >>> (rescan_move_or_delete_vzeroupper): Removed. >>> (move_or_delete_vzeroupper): Repeat processing all basic blocks >>> until no basic block state is changed to used at exit. >>> >>> gcc/testsuite/ >>> >>> 2010-12-18 H.J. Lu <hongjiu.lu@intel.com> >>> >>> PR target/46519 >>> * gfortran.dg/pr46519-2.f90: New. >>> >> >> The patch is OK, but please allow a day or two for RMs (CC'd) to >> eventually comment. > > We will investigate LCM for 4.7. In the meantime, here is a small patch > on top of the current one. If the upper 128bits are never changed in a basic > block, we can skip it in the later passes. OK for trunk together with the > current patch? > > 2010-12-29 H.J. Lu <hongjiu.lu@intel.com> > > * config/i386/i386.c (upper_128bits_state): Update comments. > (block_info_def): Add unchanged. > (move_or_delete_vzeroupper_2): Short circuit if upper 128bits > are unchanged in the block. > OK, but please remove now redundant coments in @@ -60,14 +60,17 @@ along with GCC; see the file COPYING3. If not see enum upper_128bits_state { unknown = 0, /* Unknown. */ - unused, /* Not used or not referenced. */ - used /* Used or referenced. */ + unused, /* Not used. */ + used /* Used. */ }; Thanks, Uros.
On 12/30/2010 3:19 AM, Uros Bizjak wrote: >> We will investigate LCM for 4.7. In the meantime, here is a small patch >> on top of the current one. If the upper 128bits are never changed in a basic >> block, we can skip it in the later passes. OK for trunk together with the >> current patch? For avoidance of doubt, since Uros explicitly asked for RM comments: I have no objections to the patch, if the x86 maintainers are happy with it. However, this comment: >> I'm not sure that the algorithm is correct (and I don't have enough >> experience in this area), so I'd rather leave the review to someone >> else. AFAICS, there can be 20 passes, and from comments, it is >> questionable if this is enough. concerns me. Do someone have confidence that this algorithm is correct, in the sense that we will not generate wrong code? And are we talking about 20 passes over the complete set of basic blocks? That sounds pretty expensive. Thank you,
On Fri, Dec 31, 2010 at 5:05 PM, Mark Mitchell <mark@codesourcery.com> wrote: >>> I'm not sure that the algorithm is correct (and I don't have enough >>> experience in this area), so I'd rather leave the review to someone >>> else. AFAICS, there can be 20 passes, and from comments, it is >>> questionable if this is enough. > > concerns me. > > Do someone have confidence that this algorithm is correct, in the sense > that we will not generate wrong code? And are we talking about 20 > passes over the complete set of basic blocks? That sounds pretty expensive. > I believe algorithm is correct, but probably not optimal. What we want to know is the precise state of the upper 128bits at exit of basic block. It rescans a basic block only if the exit state is unknown and the upper 128bits may be modified in the basic block.
On 12/31/2010 5:38 PM, H.J. Lu wrote: > I believe algorithm is correct, but probably not optimal. What we want to know > is the precise state of the upper 128bits at exit of basic block. It rescans a > basic block only if the exit state is unknown and the upper 128bits may be > modified in the basic block. What is the limit, then, on the number of iterations required for a function with N basic blocks?
On Fri, Dec 31, 2010 at 5:39 PM, Mark Mitchell <mark@codesourcery.com> wrote: > On 12/31/2010 5:38 PM, H.J. Lu wrote: > >> I believe algorithm is correct, but probably not optimal. What we want to know >> is the precise state of the upper 128bits at exit of basic block. It rescans a >> basic block only if the exit state is unknown and the upper 128bits may be >> modified in the basic block. > > What is the limit, then, on the number of iterations required for a > function with N basic blocks? > The limit depends on how complex CFG is. I don't think it will go into an infinite loop. However, I limit it to 20 just in case. I have built SPEC CPU 2K/2006 using -mavx -O2/-O3 without problems.
On 12/31/2010 6:08 PM, H.J. Lu wrote: > The limit depends on how complex CFG is. I don't think it will go into an > infinite loop. HJ, I think you should think about the algorithm in enough detail that you can be sure whether or not it's a terminating algorithm. Even if it is terminating, a limit is probably a good idea to contain compile-time cost for pathological cases, but you're not providing me with confidence that you've got an algorithm that you really understand if you aren't sure whether or not in terminates. > However, I limit it to 20 just in case. 20 what? 20 passes over the entire BB tree? If so, that seems like an awful lot. Please provide a statement like "the run time is at most 20 * (N + E) where N is the number of basic blocks and E is maximum number of outgoing edges from any basic block", i.e., in the standard form given in computer science texts? Thank you,
On Fri, Dec 31, 2010 at 6:17 PM, Mark Mitchell <mark@codesourcery.com> wrote: > On 12/31/2010 6:08 PM, H.J. Lu wrote: > >> The limit depends on how complex CFG is. I don't think it will go into an >> infinite loop. > > HJ, I think you should think about the algorithm in enough detail that > you can be sure whether or not it's a terminating algorithm. Even if it > is terminating, a limit is probably a good idea to contain compile-time > cost for pathological cases, but you're not providing me with confidence > that you've got an algorithm that you really understand if you aren't > sure whether or not in terminates. We can remove a vzeroupper insn if we can determine the upper 128bits of AVX registers are unused at the location. We track the upper 128bits of AVX registers by examining each insn. To determine the state of the upper 128bits of AVX registers at basic block entry, we need to know the exit states of its incoming edges. We start at the function entry with the known state of the upper 128bits of AVX registers. We repeat all basic blocks, in which the upper 128bits of AVX registers are changed, with the unknown entry state, if the upper 128bits of AVX registers of any basic blocks are changed to used at exit. Since the number of basic blocks is fixed and we don't repeat basic blocks with the known entry state, the algorithm should terminate. >> However, I limit it to 20 just in case. > > 20 what? 20 passes over the entire BB tree? If so, that seems like an > awful lot. Please provide a statement like "the run time is at most 20 > * (N + E) where N is the number of basic blocks and E is maximum number > of outgoing edges from any basic block", i.e., in the standard form > given in computer science texts? > The run-tme is at most 20 * N where N is the number of basic blocks since we only look at the exit state of the incoming edges. We repeat the scan until all incoming edges are stabilized.
On 1/1/2011 8:01 AM, H.J. Lu wrote: > We start at the function entry with > the known state of the upper 128bits of AVX registers. We repeat all basic > blocks, in which the upper 128bits of AVX registers are changed, with the > unknown entry state, if the upper 128bits of AVX registers of any basic blocks > are changed to used at exit. Since the number of basic blocks is fixed and > we don't repeat basic blocks with the known entry state, the algorithm should > terminate. OK, that's a good argument. Isn't this just a standard forward data-flow problem? Are we using the machinery we have for such problems? And, in fact, isn't this essentially similar to getting rid of unnecessary sign extensions, which Tom de Vries has been working on? >>> However, I limit it to 20 just in case. > The run-tme is at most 20 * N where N is the number of basic blocks > since we only look at the exit state of the incoming edges. We repeat > the scan until all incoming edges are stabilized. 20 sounds high to me, but I'll leave that to the x86 maintainers since this is an x86-specific pass. If you solve this as a forward data-flow problem (visiting the blocks in the appropriate order), it seems unlikely to me that you'd need more than two or three iterations for the vast majority of non-pathological cases.
On Mon, Jan 3, 2011 at 4:47 PM, Mark Mitchell <mark@codesourcery.com> wrote: > On 1/1/2011 8:01 AM, H.J. Lu wrote: > >> We start at the function entry with >> the known state of the upper 128bits of AVX registers. We repeat all basic >> blocks, in which the upper 128bits of AVX registers are changed, with the >> unknown entry state, if the upper 128bits of AVX registers of any basic blocks >> are changed to used at exit. Since the number of basic blocks is fixed and >> we don't repeat basic blocks with the known entry state, the algorithm should >> terminate. > > OK, that's a good argument. Isn't this just a standard forward > data-flow problem? Are we using the machinery we have for such > problems? And, in fact, isn't this essentially similar to getting rid > of unnecessary sign extensions, which Tom de Vries has been working on? My problem is if the portion of any vector register is live. My understandings are data-flow is expensive and overkill. And I couldn't find a way to get the answer from data-flow for my question. There are 2 different questions: 1. If sign/zero extension on a given GPR is redundant. 2. If upper 128bit of any vector registers is live, zero or non-zero. There are few overlaps. I would appreciate any suggestions. >>>> However, I limit it to 20 just in case. > >> The run-tme is at most 20 * N where N is the number of basic blocks >> since we only look at the exit state of the incoming edges. We repeat >> the scan until all incoming edges are stabilized. > > 20 sounds high to me, but I'll leave that to the x86 maintainers since > this is an x86-specific pass. If you solve this as a forward data-flow > problem (visiting the blocks in the appropriate order), it seems > unlikely to me that you'd need more than two or three iterations for the > vast majority of non-pathological cases. > I counted 8 iterations in some SPEC CPU 2K/2006 benchmarks. Thanks.
On 1/3/2011 7:33 PM, H.J. Lu wrote: >> OK, that's a good argument. Isn't this just a standard forward >> data-flow problem? Are we using the machinery we have for such >> problems? And, in fact, isn't this essentially similar to getting rid >> of unnecessary sign extensions, which Tom de Vries has been working on? > > My problem is if the portion of any vector register is live. My understandings > are data-flow is expensive and overkill. I don't understand this statement. Data flow problems are ones where the output of a basic block is dependent on its inputs along incoming edges and on the behavior of the block itself. That sounds like exactly what you have here. There are standard work-list algorithms for walking through the basic blocks in the right order and iterating only where necessary. Why is that more expensive than walking over all of the blocks?
On Mon, Jan 3, 2011 at 7:41 PM, Mark Mitchell <mark@codesourcery.com> wrote: > On 1/3/2011 7:33 PM, H.J. Lu wrote: > >>> OK, that's a good argument. Isn't this just a standard forward >>> data-flow problem? Are we using the machinery we have for such >>> problems? And, in fact, isn't this essentially similar to getting rid >>> of unnecessary sign extensions, which Tom de Vries has been working on? >> >> My problem is if the portion of any vector register is live. My understandings >> are data-flow is expensive and overkill. > > I don't understand this statement. Data flow problems are ones where > the output of a basic block is dependent on its inputs along incoming > edges and on the behavior of the block itself. That sounds like exactly > what you have here. There are standard work-list algorithms for walking > through the basic blocks in the right order and iterating only where > necessary. Why is that more expensive than walking over all of the blocks? > I tried to use DF. DF can tell me if a register is live or dead at the basic block entry. But what I want to know is if the upper 128bit of a vector register is zero, dead or live, at the basic block entry. Can DF tell me that? Thanks.
On 1/4/2011 2:09 PM, H.J. Lu wrote: > I tried to use DF. DF can tell me if a register is live or dead at the basic > block entry. But what I want to know is if the upper 128bit of a vector > register is zero, dead or live, at the basic block entry. Can DF tell me > that? DF, in the sense of math, can certainly tell you that. I'm not sure about DF, in the sense of code presently in GCC. But, if not, it could be made to do so; what you have meets the requirements for dataflow abstraction. If our infrastructure isn't powerful enough, we should probably make it more powerful.
On Tue, Jan 4, 2011 at 3:35 PM, Mark Mitchell <mark@codesourcery.com> wrote: > On 1/4/2011 2:09 PM, H.J. Lu wrote: > >> I tried to use DF. DF can tell me if a register is live or dead at the basic >> block entry. But what I want to know is if the upper 128bit of a vector >> register is zero, dead or live, at the basic block entry. Can DF tell me >> that? > > DF, in the sense of math, can certainly tell you that. I'm not sure > about DF, in the sense of code presently in GCC. But, if not, it could > be made to do so; what you have meets the requirements for dataflow > abstraction. If our infrastructure isn't powerful enough, we should > probably make it more powerful. > Missing info in the current DF infrastructure: 1. Used/unused registers at basic block boundaries. 2. The accessing modes of dead/live/used/unused registers at basic block boundaries. 3. Zero/sign extension info of live registers at the basic block entry.
On 1/4/2011 3:50 PM, H.J. Lu wrote: > 1. Used/unused registers at basic block boundaries. > 2. The accessing modes of dead/live/used/unused registers at basic > block boundaries. > 3. Zero/sign extension info of live registers at the basic block entry. HJ, I'm not sure what point you're trying to make. My point is that using standard data-flow techniques to solve this problem seems correct. You seem to be saying that our current infrastructure doesn't have everything you need. Presuming that you agree that using standard data-flow techniques is appropriate, that leaves you with two viable options: (a) enhance the infrastructure, (b) use the algorithms, but not the infrastructure.
On Tue, Jan 4, 2011 at 3:53 PM, Mark Mitchell <mark@codesourcery.com> wrote: > On 1/4/2011 3:50 PM, H.J. Lu wrote: > >> 1. Used/unused registers at basic block boundaries. >> 2. The accessing modes of dead/live/used/unused registers at basic >> block boundaries. >> 3. Zero/sign extension info of live registers at the basic block entry. > > HJ, I'm not sure what point you're trying to make. My point is that > using standard data-flow techniques to solve this problem seems correct. > You seem to be saying that our current infrastructure doesn't have > everything you need. Presuming that you agree that using standard That is correct. > data-flow techniques is appropriate, that leaves you with two viable > options: (a) enhance the infrastructure, (b) use the algorithms, but not > the infrastructure. > Enhance the DF infrastructure is beyond my resources. I will take a look at the DF algorithm. Thanks.
On 1/4/2011 4:06 PM, H.J. Lu wrote: > Enhance the DF infrastructure is beyond my resources. I > will take a look at the DF algorithm. Wikipedia (or any good compiler book) should have a good description of the appropriate work-list based algorithms. The basic idea is that you walk the BB tree in the right order (starting at the entry blocks), adding successor blocks to the worklist whenever you change a block. Thank you,
On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote: > On 1/4/2011 4:06 PM, H.J. Lu wrote: > >> Enhance the DF infrastructure is beyond my resources. I >> will take a look at the DF algorithm. > > Wikipedia (or any good compiler book) should have a good description of > the appropriate work-list based algorithms. The basic idea is that you > walk the BB tree in the right order (starting at the entry blocks), > adding successor blocks to the worklist whenever you change a block. > Are there any existing GCC passes which implement their own data-flow analysis? Thanks.
On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote: > On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote: > > On 1/4/2011 4:06 PM, H.J. Lu wrote: > > > >> Enhance the DF infrastructure is beyond my resources. I > >> will take a look at the DF algorithm. > > > > Wikipedia (or any good compiler book) should have a good description of > > the appropriate work-list based algorithms. The basic idea is that you > > walk the BB tree in the right order (starting at the entry blocks), > > adding successor blocks to the worklist whenever you change a block. > > > > Are there any existing GCC passes which implement their own data-flow > analysis? E.g. var-tracking.c (vt_find_locations). Jakub
On Wed, Jan 5, 2011 at 8:46 AM, Jakub Jelinek <jakub@redhat.com> wrote: > On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote: >> On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote: >> > On 1/4/2011 4:06 PM, H.J. Lu wrote: >> > >> >> Enhance the DF infrastructure is beyond my resources. I >> >> will take a look at the DF algorithm. >> > >> > Wikipedia (or any good compiler book) should have a good description of >> > the appropriate work-list based algorithms. The basic idea is that you >> > walk the BB tree in the right order (starting at the entry blocks), >> > adding successor blocks to the worklist whenever you change a block. >> > >> >> Are there any existing GCC passes which implement their own data-flow >> analysis? > > E.g. var-tracking.c (vt_find_locations). > Thanks.
On 01/05/2011 08:39 AM, H.J. Lu wrote: > Are there any existing GCC passes which implement their own data-flow > analysis? See also walk_dominator_tree and in general domwalk.c. I don't believe you can iterate with walk_dominator_tree, but examining its implementation will show you (1) how to walk the blocks in an optimal order and (2) how to use a worklist to queue blocks for (re-)processing. Alternately, one pass via walk_dominator_tree might give results that are good enough such that you may drop iteration entirely. r~
On Thu, Jan 13, 2011 at 9:49 AM, Richard Henderson <rth@redhat.com> wrote: > On 01/05/2011 08:39 AM, H.J. Lu wrote: >> Are there any existing GCC passes which implement their own data-flow >> analysis? > > See also walk_dominator_tree and in general domwalk.c. > > I don't believe you can iterate with walk_dominator_tree, but > examining its implementation will show you (1) how to walk the > blocks in an optimal order and (2) how to use a worklist to > queue blocks for (re-)processing. > > Alternately, one pass via walk_dominator_tree might give results > that are good enough such that you may drop iteration entirely. > Hi Richard, I implemented the similar work-list based algorithm based on vt_find_locations in var-tracking.c: http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00846.html Does it it look OK? Thanks.
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 28b26ef..2d06c04 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -60,14 +60,17 @@ along with GCC; see the file COPYING3. If not see enum upper_128bits_state { unknown = 0, /* Unknown. */ - unused, /* Not used or not referenced. */ - used /* Used or referenced. */ + unused, /* Not used. */ + used /* Used. */ }; typedef struct block_info_def { - /* State of the upper 128bits of any AVX registers at exit. */ + /* State of the upper 128bits of AVX registers at exit. */ enum upper_128bits_state state; + /* TRUE if state of the upper 128bits of AVX registers is unchanged + in this block. */ + bool unchanged; /* TRUE if block has been processed. */ bool processed; } *block_info; @@ -110,8 +113,7 @@ check_avx256_stores (rtx dest, const_rtx set, void *data) in basic block BB. Delete it if upper 128bit AVX registers are unused. If it isn't deleted, move it to just before a jump insn. - UPPER_128BITS_LIVE is TRUE if the upper 128bits of any AVX registers - are live at entry. */ + STATE is state of the upper 128bits of AVX registers at entry. */ static void move_or_delete_vzeroupper_2 (basic_block bb, @@ -121,11 +123,24 @@ move_or_delete_vzeroupper_2 (basic_block bb, rtx vzeroupper_insn = NULL_RTX; rtx pat; int avx256; + bool unchanged; + + if (BLOCK_INFO (bb)->unchanged) + { + if (dump_file) + fprintf (dump_file, " [bb %i] unchanged: upper 128bits: %d\n", + bb->index, state); + + BLOCK_INFO (bb)->state = state; + return; + } if (dump_file) fprintf (dump_file, " [bb %i] entry: upper 128bits: %d\n", bb->index, state); + unchanged = true; + /* BB_END changes when it is deleted. */ bb_end = BB_END (bb); insn = BB_HEAD (bb); @@ -179,6 +194,7 @@ move_or_delete_vzeroupper_2 (basic_block bb, && XINT (XVECEXP (pat, 0, 0), 1) == UNSPECV_VZEROALL) { state = unused; + unchanged = false; /* Delete pending vzeroupper insertion. */ if (vzeroupper_insn) @@ -189,9 +205,9 @@ move_or_delete_vzeroupper_2 (basic_block bb, } else if (state != used) { - /* No need to call note_stores if the upper 128bits of - AVX registers are never referenced. */ note_stores (pat, check_avx256_stores, &state); + if (state == used) + unchanged = false; } continue; } @@ -205,7 +221,10 @@ move_or_delete_vzeroupper_2 (basic_block bb, 256bit AVX register. We only need to check if callee returns 256bit AVX register. */ if (avx256 == callee_return_avx256) - state = used; + { + state = used; + unchanged = false; + } /* Remove unnecessary vzeroupper since upper 128bits are cleared. */ @@ -236,15 +255,20 @@ move_or_delete_vzeroupper_2 (basic_block bb, delete_insn (insn); } else - vzeroupper_insn = insn; + { + vzeroupper_insn = insn; + unchanged = false; + } } } BLOCK_INFO (bb)->state = state; + BLOCK_INFO (bb)->unchanged = unchanged; if (dump_file) - fprintf (dump_file, " [bb %i] exit: upper 128bits: %d\n", - bb->index, state); + fprintf (dump_file, " [bb %i] exit: %s: upper 128bits: %d\n", + bb->index, unchanged ? "unchanged" : "changed", + state); } /* Helper function for move_or_delete_vzeroupper. Process vzeroupper