diff mbox

PATCH: PR target/46519: Missing vzeroupper

Message ID AANLkTi=0tNb1FX7XweRCUQBh6OUjNW7f6+vyO0YuYk=z@mail.gmail.com
State New
Headers show

Commit Message

H.J. Lu Dec. 29, 2010, 3:32 p.m. UTC
On Wed, Dec 29, 2010 at 1:10 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sat, Dec 18, 2010 at 7:10 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Sat, Dec 18, 2010 at 9:48 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Fri, Dec 17, 2010 at 8:03 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>
>>>> This patch fixes another missing vzeroupper.  OK for trunk?
>
>> I'd like to apply this patch instead. It removes escan_move_or_delete_vzeroupper
>> and rewrites move_or_delete_vzeroupper_1 to avoid recursive call. It first scans
>> all basic blocks repeatedly until no basic block changes the upper
>> 128bits of AVX
>> to used at exit.  Then it rescans all basic blocks with unknown upper
>> 128bit state.
>> OK for trunk?
>
> H.J. explained me in a private mail about the importance of this
> patch. I think that the quote below explains it:
>
> <quote>
>> I'm not sure that the algorithm is correct (and I don't have enough
>> experience in this area), so I'd rather leave the review to someone
>> else. AFAICS, there can be 20 passes, and from comments, it is
>> questionable if this is enough.
>
> I tried several benchmarks which failed before my patch.  The most pass
> I saw is 2. I can change it to 2 and re-run SPEC CPU 2K/2006 to find
> out what the smallest pass should be.
>
>> I propose that you commit your previous (simple) patch, since IMO this
>
> My simple patch doesn't work on SPEC CPU 2K/2006. It isn't very
> useful for 4.6.
>
>> one is too invasive for this development stage. However, I still think
>
> The old algorithm is obviously incorrect. The new algorithm removes the
> recursive calls and is simpler/faster than the old one.  vzeroupper optimization
> is a very important new feature for AVX. The current implementation is
> incorrect.  I'd like to fix it before 4.6 is released.
>
>> that LCM infrastructure (see lcm.c) should be used to place
>> vzerouppers at optimum points.
>
> We will investigate LCM for 4.7.
> </qoute>
>
> I think that due to these reasons, the patch should be committed to
> SVN even in this development stage. Even if the algorithm is not
> optimal, the patch demonstrably produces substantially better code.
> This feature has no impact on generic code without -mvzeroupper /
> -mavx switch, and since there are currently very few AVX users,
> negligible overall impact.
>
>> gcc/
>>
>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * config/i386/i386.c (block_info_def): Remove referenced, count
>>        and rescanned.
>>        (move_or_delete_vzeroupper_2): Updated.
>>        (move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
>>        (rescan_move_or_delete_vzeroupper): Removed.
>>        (move_or_delete_vzeroupper): Repeat processing all basic blocks
>>        until no basic block state is changed to used at exit.
>>
>> gcc/testsuite/
>>
>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * gfortran.dg/pr46519-2.f90: New.
>>
>
> The patch is OK, but please allow a day or two for RMs (CC'd) to
> eventually comment.

We will investigate LCM for 4.7.  In the meantime, here is  a small patch
on top of the current one. If the upper 128bits are never changed in a basic
block, we can skip it in the later passes.  OK for trunk together with the
current patch?

Thanks.

Comments

Uros Bizjak Dec. 30, 2010, 11:19 a.m. UTC | #1
On Wed, Dec 29, 2010 at 4:32 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>> I think that due to these reasons, the patch should be committed to
>> SVN even in this development stage. Even if the algorithm is not
>> optimal, the patch demonstrably produces substantially better code.
>> This feature has no impact on generic code without -mvzeroupper /
>> -mavx switch, and since there are currently very few AVX users,
>> negligible overall impact.
>>
>>> gcc/
>>>
>>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>>
>>>        PR target/46519
>>>        * config/i386/i386.c (block_info_def): Remove referenced, count
>>>        and rescanned.
>>>        (move_or_delete_vzeroupper_2): Updated.
>>>        (move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
>>>        (rescan_move_or_delete_vzeroupper): Removed.
>>>        (move_or_delete_vzeroupper): Repeat processing all basic blocks
>>>        until no basic block state is changed to used at exit.
>>>
>>> gcc/testsuite/
>>>
>>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>>
>>>        PR target/46519
>>>        * gfortran.dg/pr46519-2.f90: New.
>>>
>>
>> The patch is OK, but please allow a day or two for RMs (CC'd) to
>> eventually comment.
>
> We will investigate LCM for 4.7.  In the meantime, here is  a small patch
> on top of the current one. If the upper 128bits are never changed in a basic
> block, we can skip it in the later passes.  OK for trunk together with the
> current patch?
>
> 2010-12-29  H.J. Lu  <hongjiu.lu@intel.com>
>
>        * config/i386/i386.c (upper_128bits_state): Update comments.
>        (block_info_def): Add unchanged.
>        (move_or_delete_vzeroupper_2): Short circuit if upper 128bits
>        are unchanged in the block.
>

OK, but please remove now redundant coments in

@@ -60,14 +60,17 @@ along with GCC; see the file COPYING3.  If not see
 enum upper_128bits_state
 {
   unknown = 0,		/* Unknown.  */
-  unused,		/* Not used or not referenced.  */
-  used			/* Used or referenced.  */
+  unused,		/* Not used.  */
+  used			/* Used.  */
 };

Thanks,
Uros.
Mark Mitchell Jan. 1, 2011, 1:05 a.m. UTC | #2
On 12/30/2010 3:19 AM, Uros Bizjak wrote:

>> We will investigate LCM for 4.7.  In the meantime, here is  a small patch
>> on top of the current one. If the upper 128bits are never changed in a basic
>> block, we can skip it in the later passes.  OK for trunk together with the
>> current patch?

For avoidance of doubt, since Uros explicitly asked for RM comments: I
have no objections to the patch, if the x86 maintainers are happy with it.

However, this comment:

>> I'm not sure that the algorithm is correct (and I don't have enough
>> experience in this area), so I'd rather leave the review to someone
>> else. AFAICS, there can be 20 passes, and from comments, it is
>> questionable if this is enough.

concerns me.

Do someone have confidence that this algorithm is correct, in the sense
that we will not generate wrong code?  And are we talking about 20
passes over the complete set of basic blocks?  That sounds pretty expensive.

Thank you,
H.J. Lu Jan. 1, 2011, 1:38 a.m. UTC | #3
On Fri, Dec 31, 2010 at 5:05 PM, Mark Mitchell <mark@codesourcery.com> wrote:

>>> I'm not sure that the algorithm is correct (and I don't have enough
>>> experience in this area), so I'd rather leave the review to someone
>>> else. AFAICS, there can be 20 passes, and from comments, it is
>>> questionable if this is enough.
>
> concerns me.
>
> Do someone have confidence that this algorithm is correct, in the sense
> that we will not generate wrong code?  And are we talking about 20
> passes over the complete set of basic blocks?  That sounds pretty expensive.
>

I believe algorithm is correct, but probably not optimal. What we want to know
is the precise state of the upper 128bits at exit of basic block.  It rescans a
basic block only if the exit state is unknown and the upper 128bits may be
modified in the basic block.
Mark Mitchell Jan. 1, 2011, 1:39 a.m. UTC | #4
On 12/31/2010 5:38 PM, H.J. Lu wrote:

> I believe algorithm is correct, but probably not optimal. What we want to know
> is the precise state of the upper 128bits at exit of basic block.  It rescans a
> basic block only if the exit state is unknown and the upper 128bits may be
> modified in the basic block.

What is the limit, then, on the number of iterations required for a
function with N basic blocks?
H.J. Lu Jan. 1, 2011, 2:08 a.m. UTC | #5
On Fri, Dec 31, 2010 at 5:39 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 12/31/2010 5:38 PM, H.J. Lu wrote:
>
>> I believe algorithm is correct, but probably not optimal. What we want to know
>> is the precise state of the upper 128bits at exit of basic block.  It rescans a
>> basic block only if the exit state is unknown and the upper 128bits may be
>> modified in the basic block.
>
> What is the limit, then, on the number of iterations required for a
> function with N basic blocks?
>

The limit depends on how complex CFG is. I don't think it will go into an
infinite loop.  However, I limit it to 20 just in case. I have built SPEC CPU
2K/2006 using -mavx -O2/-O3 without problems.
Mark Mitchell Jan. 1, 2011, 2:17 a.m. UTC | #6
On 12/31/2010 6:08 PM, H.J. Lu wrote:

> The limit depends on how complex CFG is. I don't think it will go into an
> infinite loop.

HJ, I think you should think about the algorithm in enough detail that
you can be sure whether or not it's a terminating algorithm.  Even if it
is terminating, a limit is probably a good idea to contain compile-time
cost for pathological cases, but you're not providing me with confidence
that you've got an algorithm that you really understand if you aren't
sure whether or not in terminates.

> However, I limit it to 20 just in case.

20 what?  20 passes over the entire BB tree?  If so, that seems like an
awful lot.  Please provide a statement like "the run time is at most 20
* (N + E) where N is the number of basic blocks and E is maximum number
of outgoing edges from any basic block", i.e., in the standard form
given in computer science texts?

Thank you,
H.J. Lu Jan. 1, 2011, 4:01 p.m. UTC | #7
On Fri, Dec 31, 2010 at 6:17 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 12/31/2010 6:08 PM, H.J. Lu wrote:
>
>> The limit depends on how complex CFG is. I don't think it will go into an
>> infinite loop.
>
> HJ, I think you should think about the algorithm in enough detail that
> you can be sure whether or not it's a terminating algorithm.  Even if it
> is terminating, a limit is probably a good idea to contain compile-time
> cost for pathological cases, but you're not providing me with confidence
> that you've got an algorithm that you really understand if you aren't
> sure whether or not in terminates.

We can remove a vzeroupper insn if we can determine the upper 128bits
of AVX registers are unused at the location. We track the upper 128bits
of AVX registers by examining each insn. To determine the state of  the
upper 128bits of AVX registers at basic block entry, we need to know the
exit states of its incoming edges. We start at the function entry with
the known state of the upper 128bits of AVX registers.  We repeat all basic
blocks, in which the upper 128bits of AVX registers are changed, with the
unknown entry state, if the upper 128bits of AVX registers of any basic blocks
are changed to used at exit.  Since the number of basic blocks is fixed and
we don't repeat basic blocks with the known entry state, the algorithm should
terminate.

>> However, I limit it to 20 just in case.
>
> 20 what?  20 passes over the entire BB tree?  If so, that seems like an
> awful lot.  Please provide a statement like "the run time is at most 20
> * (N + E) where N is the number of basic blocks and E is maximum number
> of outgoing edges from any basic block", i.e., in the standard form
> given in computer science texts?
>

The run-tme is at most 20 * N where N is the number of basic blocks
since we only look at the exit state of the incoming edges. We repeat
the scan until all incoming edges are stabilized.
Mark Mitchell Jan. 4, 2011, 12:47 a.m. UTC | #8
On 1/1/2011 8:01 AM, H.J. Lu wrote:

> We start at the function entry with
> the known state of the upper 128bits of AVX registers.  We repeat all basic
> blocks, in which the upper 128bits of AVX registers are changed, with the
> unknown entry state, if the upper 128bits of AVX registers of any basic blocks
> are changed to used at exit.  Since the number of basic blocks is fixed and
> we don't repeat basic blocks with the known entry state, the algorithm should
> terminate.

OK, that's a good argument.  Isn't this just a standard forward
data-flow problem?  Are we using the machinery we have for such
problems?  And, in fact, isn't this essentially similar to getting rid
of unnecessary sign extensions, which Tom de Vries has been working on?

>>> However, I limit it to 20 just in case.

> The run-tme is at most 20 * N where N is the number of basic blocks
> since we only look at the exit state of the incoming edges. We repeat
> the scan until all incoming edges are stabilized.

20 sounds high to me, but I'll leave that to the x86 maintainers since
this is an x86-specific pass.  If you solve this as a forward data-flow
problem (visiting the blocks in the appropriate order), it seems
unlikely to me that you'd need more than two or three iterations for the
vast majority of non-pathological cases.
H.J. Lu Jan. 4, 2011, 3:33 a.m. UTC | #9
On Mon, Jan 3, 2011 at 4:47 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/1/2011 8:01 AM, H.J. Lu wrote:
>
>> We start at the function entry with
>> the known state of the upper 128bits of AVX registers.  We repeat all basic
>> blocks, in which the upper 128bits of AVX registers are changed, with the
>> unknown entry state, if the upper 128bits of AVX registers of any basic blocks
>> are changed to used at exit.  Since the number of basic blocks is fixed and
>> we don't repeat basic blocks with the known entry state, the algorithm should
>> terminate.
>
> OK, that's a good argument.  Isn't this just a standard forward
> data-flow problem?  Are we using the machinery we have for such
> problems?  And, in fact, isn't this essentially similar to getting rid
> of unnecessary sign extensions, which Tom de Vries has been working on?

My problem is if the portion of any vector register is live. My understandings
are data-flow is expensive and overkill. And I couldn't find a way to get the
answer from data-flow for my question.

There are 2 different questions:

1. If sign/zero extension on a given GPR is redundant.
2. If upper 128bit of any vector registers is live, zero or non-zero.

There are few overlaps.  I would appreciate any suggestions.

>>>> However, I limit it to 20 just in case.
>
>> The run-tme is at most 20 * N where N is the number of basic blocks
>> since we only look at the exit state of the incoming edges. We repeat
>> the scan until all incoming edges are stabilized.
>
> 20 sounds high to me, but I'll leave that to the x86 maintainers since
> this is an x86-specific pass.  If you solve this as a forward data-flow
> problem (visiting the blocks in the appropriate order), it seems
> unlikely to me that you'd need more than two or three iterations for the
> vast majority of non-pathological cases.
>

I counted 8 iterations in some SPEC CPU 2K/2006 benchmarks.

Thanks.
Mark Mitchell Jan. 4, 2011, 3:41 a.m. UTC | #10
On 1/3/2011 7:33 PM, H.J. Lu wrote:

>> OK, that's a good argument.  Isn't this just a standard forward
>> data-flow problem?  Are we using the machinery we have for such
>> problems?  And, in fact, isn't this essentially similar to getting rid
>> of unnecessary sign extensions, which Tom de Vries has been working on?
> 
> My problem is if the portion of any vector register is live. My understandings
> are data-flow is expensive and overkill.

I don't understand this statement.  Data flow problems are ones where
the output of a basic block is dependent on its inputs along incoming
edges and on the behavior of the block itself.  That sounds like exactly
what you have here.  There are standard work-list algorithms for walking
through the basic blocks in the right order and iterating only where
necessary.  Why is that more expensive than walking over all of the blocks?
H.J. Lu Jan. 4, 2011, 10:09 p.m. UTC | #11
On Mon, Jan 3, 2011 at 7:41 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/3/2011 7:33 PM, H.J. Lu wrote:
>
>>> OK, that's a good argument.  Isn't this just a standard forward
>>> data-flow problem?  Are we using the machinery we have for such
>>> problems?  And, in fact, isn't this essentially similar to getting rid
>>> of unnecessary sign extensions, which Tom de Vries has been working on?
>>
>> My problem is if the portion of any vector register is live. My understandings
>> are data-flow is expensive and overkill.
>
> I don't understand this statement.  Data flow problems are ones where
> the output of a basic block is dependent on its inputs along incoming
> edges and on the behavior of the block itself.  That sounds like exactly
> what you have here.  There are standard work-list algorithms for walking
> through the basic blocks in the right order and iterating only where
> necessary.  Why is that more expensive than walking over all of the blocks?
>

I tried to use DF.  DF can tell me if a register is live or dead at the basic
block entry.  But what I want to know is if the upper 128bit of a vector
register is zero, dead or live, at the basic block entry.  Can DF tell me
that?

Thanks.
Mark Mitchell Jan. 4, 2011, 11:35 p.m. UTC | #12
On 1/4/2011 2:09 PM, H.J. Lu wrote:

> I tried to use DF.  DF can tell me if a register is live or dead at the basic
> block entry.  But what I want to know is if the upper 128bit of a vector
> register is zero, dead or live, at the basic block entry.  Can DF tell me
> that?

DF, in the sense of math, can certainly tell you that.  I'm not sure
about DF, in the sense of code presently in GCC.  But, if not, it could
be made to do so; what you have meets the requirements for dataflow
abstraction.  If our infrastructure isn't powerful enough, we should
probably make it more powerful.
H.J. Lu Jan. 4, 2011, 11:50 p.m. UTC | #13
On Tue, Jan 4, 2011 at 3:35 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/4/2011 2:09 PM, H.J. Lu wrote:
>
>> I tried to use DF.  DF can tell me if a register is live or dead at the basic
>> block entry.  But what I want to know is if the upper 128bit of a vector
>> register is zero, dead or live, at the basic block entry.  Can DF tell me
>> that?
>
> DF, in the sense of math, can certainly tell you that.  I'm not sure
> about DF, in the sense of code presently in GCC.  But, if not, it could
> be made to do so; what you have meets the requirements for dataflow
> abstraction.  If our infrastructure isn't powerful enough, we should
> probably make it more powerful.
>

Missing info in the current DF infrastructure:

1. Used/unused registers at basic block boundaries.
2. The accessing modes of dead/live/used/unused registers at basic
block boundaries.
3. Zero/sign extension info of live registers at the basic block entry.
Mark Mitchell Jan. 4, 2011, 11:53 p.m. UTC | #14
On 1/4/2011 3:50 PM, H.J. Lu wrote:

> 1. Used/unused registers at basic block boundaries.
> 2. The accessing modes of dead/live/used/unused registers at basic
> block boundaries.
> 3. Zero/sign extension info of live registers at the basic block entry.

HJ, I'm not sure what point you're trying to make.  My point is that
using standard data-flow techniques to solve this problem seems correct.
 You seem to be saying that our current infrastructure doesn't have
everything you need.  Presuming that you agree that using standard
data-flow techniques is appropriate, that leaves you with two viable
options: (a) enhance the infrastructure, (b) use the algorithms, but not
the infrastructure.
H.J. Lu Jan. 5, 2011, 12:06 a.m. UTC | #15
On Tue, Jan 4, 2011 at 3:53 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/4/2011 3:50 PM, H.J. Lu wrote:
>
>> 1. Used/unused registers at basic block boundaries.
>> 2. The accessing modes of dead/live/used/unused registers at basic
>> block boundaries.
>> 3. Zero/sign extension info of live registers at the basic block entry.
>
> HJ, I'm not sure what point you're trying to make.  My point is that
> using standard data-flow techniques to solve this problem seems correct.
>  You seem to be saying that our current infrastructure doesn't have
> everything you need.  Presuming that you agree that using standard

That is correct.

> data-flow techniques is appropriate, that leaves you with two viable
> options: (a) enhance the infrastructure, (b) use the algorithms, but not
> the infrastructure.
>

Enhance the DF infrastructure is beyond my resources.  I
will take a look at the DF algorithm.

Thanks.
Mark Mitchell Jan. 5, 2011, 12:09 a.m. UTC | #16
On 1/4/2011 4:06 PM, H.J. Lu wrote:

> Enhance the DF infrastructure is beyond my resources.  I
> will take a look at the DF algorithm.

Wikipedia (or any good compiler book) should have a good description of
the appropriate work-list based algorithms.  The basic idea is that you
walk the BB tree in the right order (starting at the entry blocks),
adding successor blocks to the worklist whenever you change a block.

Thank you,
H.J. Lu Jan. 5, 2011, 4:39 p.m. UTC | #17
On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> On 1/4/2011 4:06 PM, H.J. Lu wrote:
>
>> Enhance the DF infrastructure is beyond my resources.  I
>> will take a look at the DF algorithm.
>
> Wikipedia (or any good compiler book) should have a good description of
> the appropriate work-list based algorithms.  The basic idea is that you
> walk the BB tree in the right order (starting at the entry blocks),
> adding successor blocks to the worklist whenever you change a block.
>

Are there any existing GCC passes which implement their own data-flow
analysis?

Thanks.
Jakub Jelinek Jan. 5, 2011, 4:46 p.m. UTC | #18
On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote:
> On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> > On 1/4/2011 4:06 PM, H.J. Lu wrote:
> >
> >> Enhance the DF infrastructure is beyond my resources.  I
> >> will take a look at the DF algorithm.
> >
> > Wikipedia (or any good compiler book) should have a good description of
> > the appropriate work-list based algorithms.  The basic idea is that you
> > walk the BB tree in the right order (starting at the entry blocks),
> > adding successor blocks to the worklist whenever you change a block.
> >
> 
> Are there any existing GCC passes which implement their own data-flow
> analysis?

E.g. var-tracking.c (vt_find_locations).

	Jakub
H.J. Lu Jan. 5, 2011, 10:54 p.m. UTC | #19
On Wed, Jan 5, 2011 at 8:46 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote:
>> On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
>> > On 1/4/2011 4:06 PM, H.J. Lu wrote:
>> >
>> >> Enhance the DF infrastructure is beyond my resources.  I
>> >> will take a look at the DF algorithm.
>> >
>> > Wikipedia (or any good compiler book) should have a good description of
>> > the appropriate work-list based algorithms.  The basic idea is that you
>> > walk the BB tree in the right order (starting at the entry blocks),
>> > adding successor blocks to the worklist whenever you change a block.
>> >
>>
>> Are there any existing GCC passes which implement their own data-flow
>> analysis?
>
> E.g. var-tracking.c (vt_find_locations).
>

Thanks.
Richard Henderson Jan. 13, 2011, 5:49 p.m. UTC | #20
On 01/05/2011 08:39 AM, H.J. Lu wrote:
> Are there any existing GCC passes which implement their own data-flow
> analysis?

See also walk_dominator_tree and in general domwalk.c.

I don't believe you can iterate with walk_dominator_tree, but
examining its implementation will show you (1) how to walk the
blocks in an optimal order and (2) how to use a worklist to
queue blocks for (re-)processing.

Alternately, one pass via walk_dominator_tree might give results
that are good enough such that you may drop iteration entirely.


r~
H.J. Lu Jan. 13, 2011, 6 p.m. UTC | #21
On Thu, Jan 13, 2011 at 9:49 AM, Richard Henderson <rth@redhat.com> wrote:
> On 01/05/2011 08:39 AM, H.J. Lu wrote:
>> Are there any existing GCC passes which implement their own data-flow
>> analysis?
>
> See also walk_dominator_tree and in general domwalk.c.
>
> I don't believe you can iterate with walk_dominator_tree, but
> examining its implementation will show you (1) how to walk the
> blocks in an optimal order and (2) how to use a worklist to
> queue blocks for (re-)processing.
>
> Alternately, one pass via walk_dominator_tree might give results
> that are good enough such that you may drop iteration entirely.
>

Hi Richard,

I implemented the similar work-list based algorithm based on
vt_find_locations in var-tracking.c:

http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00846.html

Does it it look OK?

Thanks.
diff mbox

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 28b26ef..2d06c04 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -60,14 +60,17 @@  along with GCC; see the file COPYING3.  If not see
 enum upper_128bits_state
 {
   unknown = 0,		/* Unknown.  */
-  unused,		/* Not used or not referenced.  */
-  used			/* Used or referenced.  */
+  unused,		/* Not used.  */
+  used			/* Used.  */
 };
 
 typedef struct block_info_def
 {
-  /* State of the upper 128bits of any AVX registers at exit.  */
+  /* State of the upper 128bits of AVX registers at exit.  */
   enum upper_128bits_state state;
+  /* TRUE if state of the upper 128bits of AVX registers is unchanged
+     in this block.  */
+  bool unchanged;
   /* TRUE if block has been processed.  */
   bool processed;
 } *block_info;
@@ -110,8 +113,7 @@  check_avx256_stores (rtx dest, const_rtx set, void *data)
    in basic block BB.  Delete it if upper 128bit AVX registers are
    unused.  If it isn't deleted, move it to just before a jump insn.
    
-   UPPER_128BITS_LIVE is TRUE if the upper 128bits of any AVX registers
-   are live at entry.  */
+   STATE is state of the upper 128bits of AVX registers at entry.  */
 
 static void
 move_or_delete_vzeroupper_2 (basic_block bb,
@@ -121,11 +123,24 @@  move_or_delete_vzeroupper_2 (basic_block bb,
   rtx vzeroupper_insn = NULL_RTX;
   rtx pat;
   int avx256;
+  bool unchanged;
+
+  if (BLOCK_INFO (bb)->unchanged)
+    {
+      if (dump_file)
+	fprintf (dump_file, " [bb %i] unchanged: upper 128bits: %d\n",
+		 bb->index, state);
+
+      BLOCK_INFO (bb)->state = state;
+      return;
+    }
 
   if (dump_file)
     fprintf (dump_file, " [bb %i] entry: upper 128bits: %d\n",
 	     bb->index, state);
 
+  unchanged = true;
+
   /* BB_END changes when it is deleted.  */
   bb_end = BB_END (bb);
   insn = BB_HEAD (bb);
@@ -179,6 +194,7 @@  move_or_delete_vzeroupper_2 (basic_block bb,
 	      && XINT (XVECEXP (pat, 0, 0), 1) == UNSPECV_VZEROALL)
 	    {
 	      state = unused;
+	      unchanged = false;
 
 	      /* Delete pending vzeroupper insertion.  */
 	      if (vzeroupper_insn)
@@ -189,9 +205,9 @@  move_or_delete_vzeroupper_2 (basic_block bb,
 	    }
 	  else if (state != used)
 	    {
-	      /* No need to call note_stores if the upper 128bits of
-		 AVX registers are never referenced.  */
 	      note_stores (pat, check_avx256_stores, &state);
+	      if (state == used)
+		unchanged = false;
 	    }
 	  continue;
 	}
@@ -205,7 +221,10 @@  move_or_delete_vzeroupper_2 (basic_block bb,
 	     256bit AVX register.  We only need to check if callee
 	     returns 256bit AVX register.  */
 	  if (avx256 == callee_return_avx256)
-	    state = used;
+	    {
+	      state = used;
+	      unchanged = false;
+	    }
 
 	  /* Remove unnecessary vzeroupper since upper 128bits are
 	     cleared.  */
@@ -236,15 +255,20 @@  move_or_delete_vzeroupper_2 (basic_block bb,
 	      delete_insn (insn);
 	    }
 	  else
-	    vzeroupper_insn = insn;
+	    {
+	      vzeroupper_insn = insn;
+	      unchanged = false;
+	    }
 	}
     }
 
   BLOCK_INFO (bb)->state = state;
+  BLOCK_INFO (bb)->unchanged = unchanged;
 
   if (dump_file)
-    fprintf (dump_file, " [bb %i] exit: upper 128bits: %d\n",
-	     bb->index, state);
+    fprintf (dump_file, " [bb %i] exit: %s: upper 128bits: %d\n",
+	     bb->index, unchanged ? "unchanged" : "changed",
+	     state);
 }
 
 /* Helper function for move_or_delete_vzeroupper.  Process vzeroupper