diff mbox

PATCH: PR target/46519: Missing vzeroupper

Message ID AANLkTikJE7EB+O_f_v3akJNnwv+_pwh5NjwAPm55tMXG@mail.gmail.com
State New
Headers show

Commit Message

H.J. Lu Nov. 18, 2010, 6:18 a.m. UTC
On Wed, Nov 17, 2010 at 8:15 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Wed, Nov 17, 2010 at 8:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Wed, Nov 17, 2010 at 3:36 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Wed, Nov 17, 2010 at 11:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>> On Wed, Nov 17, 2010 at 2:33 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>>> On Tue, Nov 16, 2010 at 11:51 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>>>>> On Wed, Nov 17, 2010 at 7:44 AM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>>>>
>>>>>>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>>>>>>
>>>>>>> We should check NEXT_INSN (insn) != NEXT_INSN (BB_END (bb)) in
>>>>>>> move_or_delete_vzeroupper_2.  This patch does it.
>>>>>>
>>>>>> Huh? The loop does simple linear scan of all insns in the bb, so it
>>>>>> can't miss BB_END. IIUC, in your case the bb does not have BB_END
>>>>>> (bb), but it has NEXT_INSN (BB_END (bb))?
>>>>>
>>>>> It has BB_END, but it won't be visited by NEXT_INSN starting from
>>>>> BB_HEAD. insn != NEXT_INSN (BB_END (bb)) is used to check the
>>>>> end of the BB everywhere in gcc.
>>>>>
>>>>>> Can you please provide a test case that illustrates this?
>>>>>>
>>>>>
>>>>> I am enclosing a work in progress.  We noticed that we are
>>>>> missing a few vzerouppers at -O3 on SPEC CPU 2K/2006.
>>>>> One isssue is we may have
>>>>>
>>>>> foo:
>>>>>
>>>>>       call bar <<<<< Missing vzeroupper
>>>>>
>>>>>       256bit vectorized insn
>>>>>       goto foo
>>>>>
>>>>> We miss vzeroupper before call bar.  We don't have a small testcase.
>>>>> But this patch fixes this case by inspection. We are checking other
>>>>> cases.
>>>>
>>>> @@ -118,9 +118,12 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool
>>>> upper_128bits_set)
>>>>             bb->index, upper_128bits_set);
>>>>
>>>>   insn = BB_HEAD (bb);
>>>> +  last = NEXT_INSN (BB_END (bb));
>>>>   while (insn != BB_END (bb))
>>>>     {
>>>>       insn = NEXT_INSN (insn);
>>>> +      if (insn == last)
>>>> +       break;
>>>>
>>>>       if (!NONDEBUG_INSN_P (insn))
>>>>        continue;
>>>>
>>>> The change above is not needed. The new check is never triggered - the
>>>> loop terminates when "insn == BB_END (bb)" at "while", so I fail to
>>>> see why additional termination for "NEXT_INSN (insn) == NEXT_INSN
>>>> (BB_END (bb))" is needed.
>>>
>>> Here is the patch for
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>
>>> We have 2 blocks pointing to each others. This patch first scans
>>> all blocks without moving vzeroupper so that we can have accurate
>>> information about upper 128bits at block entry.
>>>
>>>> (The BB_HEAD (bb) is either a NOTE or CODE_LABEL so it can be skipped
>>>> with NEXT_INSN.)
>>>
>>> Please try gcc.target/i386/avx-vzeroupper-20.c.  It will
>>> trigger this condition.
>>>
>>>> @@ -10970,7 +10973,7 @@ ix86_expand_epilogue (int style)
>>>>
>>>>   /* Emit vzeroupper if needed.  */
>>>>   if (TARGET_VZEROUPPER
>>>> -      && cfun->machine->use_avx256_p
>>>> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize)
>>>>       && !cfun->machine->caller_return_avx256_p)
>>>>     {
>>>>       cfun->machine->use_vzeroupper_p = 1;
>>>> @@ -21661,7 +21664,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
>>>>     }
>>>>
>>>>   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
>>>> -  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
>>>> +  if (TARGET_VZEROUPPER
>>>> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize))
>>>>
>>>> Decorate *ALL* calls with CALL_NEEDS_VZEROUPPER with
>>>> -ftree-vectorize?! It looks that parts (or state machine) that set
>>>> ...->use_avx256_p flag should be fixed.
>>>
>>> There are:
>>>
>>> foo:
>>>
>>>      call bar <<<<< Missing vzeroupper
>>>
>>>      256bit vectorized insn
>>>      goto foo
>>>
>>> I couldn't find a hook to set use_avx256_p before RTL expansion
>>> starts.
>>>
>>>>     {
>>>>       rtx unspec;
>>>>       int avx256;
>>>> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>>> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>>> new file mode 100644
>>>> index 0000000..3301083
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
>>>> @@ -0,0 +1,13 @@
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
>>>> +
>>>> +extern void free (void *);
>>>> +void
>>>> +bar (void *ncstrp)
>>>> +{
>>>> +  if(ncstrp==((void *)0))
>>>> +    return;
>>>> +  free(ncstrp);
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
>>>>
>>>> Hm, this testcase doesn't go together with the above change. There is
>>>> no vectorization involved, and the scan checks that vzeroupper is NOT
>>>> emitted.
>>>>
>>>
>>> This testcase is for
>>>
>>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>>
>
> I sent the patch without comments too soon.
>
> As discussed in PR, setting and checking use_avx256_p isn't reliable.
> This patch removes use_avx256_p.  Any comments?
>
> Thanks.
>
>
> --
> H.J.
> ---
> gcc/
>
> 2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * config/i386/i386.c (block_info_def): Add scanned and no_avx256.
>        (move_or_delete_vzeroupper_2): Properly check the end of basic
>        block.  Call note_stores only if no_avx256 is false.
>        (scan_live_upper_128bits_2): New.
>        (scan_live_upper_128bits_1): Likewise.
>        (move_or_delete_vzeroupper): Call scan_live_upper_128bits_1 to
>        scan predecessor blocks of all exit points.
>        (use_avx256_p): Removed.
>        (init_cumulative_args): Don't set use_avx256_p.
>        (ix86_function_arg): Likewise.
>        (ix86_expand_move): Likewise.
>        (ix86_expand_vector_move_misalign): Likewise.
>        (ix86_local_alignment): Likewise.
>        (ix86_minimum_alignment): Likewise.
>        (ix86_expand_epilogue): Don't check use_avx256_p when generating
>        vzeroupper.
>        (ix86_expand_call): Likewise.
>
>        * config/i386/i386.h (machine_function): Remove use_avx256_p.
>
> gcc/testsuite/
>
> 2010-11-17  H.J. Lu  <hongjiu.lu@intel.com>
>
>        PR target/46519
>        * gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
>        * gcc.target/i386/avx-vzeroupper-11.c: Likewise.
>
>        * gcc.target/i386/avx-vzeroupper-20.c: New.
>

Small optimization.  Don't emit vzeroupper if callee doesn't return.
diff mbox

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7553db0..cb43620 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -21798,7 +21798,9 @@  void
 ix86_split_call_vzeroupper (rtx insn, rtx vzeroupper)
 {
   rtx call = XVECEXP (PATTERN (insn), 0, 0);
-  emit_insn (gen_avx_vzeroupper (vzeroupper));
+  /* Don't emit vzeroupper if callee doesn't return.  */
+  if (!find_reg_note (insn, REG_NORETURN, NULL))
+    emit_insn (gen_avx_vzeroupper (vzeroupper));
   emit_call_insn (call);
 }