PATCH: PR target/46519: Missing vzeroupper

On Wed, Dec 29, 2010 at 1:10 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Sat, Dec 18, 2010 at 7:10 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Sat, Dec 18, 2010 at 9:48 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Fri, Dec 17, 2010 at 8:03 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>
>>>> This patch fixes another missing vzeroupper.  OK for trunk?
>
>> I'd like to apply this patch instead. It removes escan_move_or_delete_vzeroupper
>> and rewrites move_or_delete_vzeroupper_1 to avoid recursive call. It first scans
>> all basic blocks repeatedly until no basic block changes the upper
>> 128bits of AVX
>> to used at exit.  Then it rescans all basic blocks with unknown upper
>> 128bit state.
>> OK for trunk?
>
> H.J. explained me in a private mail about the importance of this
> patch. I think that the quote below explains it:
>
> <quote>
>> I'm not sure that the algorithm is correct (and I don't have enough
>> experience in this area), so I'd rather leave the review to someone
>> else. AFAICS, there can be 20 passes, and from comments, it is
>> questionable if this is enough.
>
> I tried several benchmarks which failed before my patch.  The most pass
> I saw is 2. I can change it to 2 and re-run SPEC CPU 2K/2006 to find
> out what the smallest pass should be.
>
>> I propose that you commit your previous (simple) patch, since IMO this
>
> My simple patch doesn't work on SPEC CPU 2K/2006. It isn't very
> useful for 4.6.
>
>> one is too invasive for this development stage. However, I still think
>
> The old algorithm is obviously incorrect. The new algorithm removes the
> recursive calls and is simpler/faster than the old one.  vzeroupper optimization
> is a very important new feature for AVX. The current implementation is
> incorrect.  I'd like to fix it before 4.6 is released.
>
>> that LCM infrastructure (see lcm.c) should be used to place
>> vzerouppers at optimum points.
>
> We will investigate LCM for 4.7.
> </qoute>
>
> I think that due to these reasons, the patch should be committed to
> SVN even in this development stage. Even if the algorithm is not
> optimal, the patch demonstrably produces substantially better code.
> This feature has no impact on generic code without -mvzeroupper /
> -mavx switch, and since there are currently very few AVX users,
> negligible overall impact.
>
>> gcc/
>>
>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * config/i386/i386.c (block_info_def): Remove referenced, count
>>        and rescanned.
>>        (move_or_delete_vzeroupper_2): Updated.
>>        (move_or_delete_vzeroupper_1): Rewritten to avoid recursive call.
>>        (rescan_move_or_delete_vzeroupper): Removed.
>>        (move_or_delete_vzeroupper): Repeat processing all basic blocks
>>        until no basic block state is changed to used at exit.
>>
>> gcc/testsuite/
>>
>> 2010-12-18  H.J. Lu  <hongjiu.lu@intel.com>
>>
>>        PR target/46519
>>        * gfortran.dg/pr46519-2.f90: New.
>>
>
> The patch is OK, but please allow a day or two for RMs (CC'd) to
> eventually comment.

We will investigate LCM for 4.7.  In the meantime, here is  a small patch
on top of the current one. If the upper 128bits are never changed in a basic
block, we can skip it in the later passes.  OK for trunk together with the
current patch?

Thanks.

PATCH: PR target/46519: Missing vzeroupper

Commit Message

Comments

Patch