PATCH: Properly check the end of basic block

Submitted by Uros Bizjak on Nov. 17, 2010, 7:23 p.m.

Details

Message ID AANLkTikKrZ9kODo051O+-wQZ6UOnrK0Civp6Cq_28Ly4@mail.gmail.com
State New
Headers show

Commit Message

Uros Bizjak Nov. 17, 2010, 7:23 p.m.
On Wed, Nov 17, 2010 at 2:33 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Tue, Nov 16, 2010 at 11:51 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Wed, Nov 17, 2010 at 7:44 AM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>
>>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>>
>>> We should check NEXT_INSN (insn) != NEXT_INSN (BB_END (bb)) in
>>> move_or_delete_vzeroupper_2.  This patch does it.
>>
>> Huh? The loop does simple linear scan of all insns in the bb, so it
>> can't miss BB_END. IIUC, in your case the bb does not have BB_END
>> (bb), but it has NEXT_INSN (BB_END (bb))?
>
> It has BB_END, but it won't be visited by NEXT_INSN starting from
> BB_HEAD. insn != NEXT_INSN (BB_END (bb)) is used to check the
> end of the BB everywhere in gcc.
>
>> Can you please provide a test case that illustrates this?
>>
>
> I am enclosing a work in progress.  We noticed that we are
> missing a few vzerouppers at -O3 on SPEC CPU 2K/2006.
> One isssue is we may have
>
> foo:
>
>       call bar <<<<< Missing vzeroupper
>
>       256bit vectorized insn
>       goto foo
>
> We miss vzeroupper before call bar.  We don't have a small testcase.
> But this patch fixes this case by inspection. We are checking other
> cases.

@@ -118,9 +118,12 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool
upper_128bits_set)
 	     bb->index, upper_128bits_set);

   insn = BB_HEAD (bb);
+  last = NEXT_INSN (BB_END (bb));
   while (insn != BB_END (bb))
     {
       insn = NEXT_INSN (insn);
+      if (insn == last)
+	break;

       if (!NONDEBUG_INSN_P (insn))
 	continue;

The change above is not needed. The new check is never triggered - the
loop terminates when "insn == BB_END (bb)" at "while", so I fail to
see why additional termination for "NEXT_INSN (insn) == NEXT_INSN
(BB_END (bb))" is needed.

(The BB_HEAD (bb) is either a NOTE or CODE_LABEL so it can be skipped
with NEXT_INSN.)

@@ -10970,7 +10973,7 @@ ix86_expand_epilogue (int style)

   /* Emit vzeroupper if needed.  */
   if (TARGET_VZEROUPPER
-      && cfun->machine->use_avx256_p
+      && (cfun->machine->use_avx256_p || flag_tree_vectorize)
       && !cfun->machine->caller_return_avx256_p)
     {
       cfun->machine->use_vzeroupper_p = 1;
@@ -21661,7 +21664,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
     }

   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
-  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
+  if (TARGET_VZEROUPPER
+      && (cfun->machine->use_avx256_p || flag_tree_vectorize))

Decorate *ALL* calls with CALL_NEEDS_VZEROUPPER with
-ftree-vectorize?! It looks that parts (or state machine) that set
...->use_avx256_p flag should be fixed.

     {
       rtx unspec;
       int avx256;

Hm, this testcase doesn't go together with the above change. There is
no vectorization involved, and the scan checks that vzeroupper is NOT
emitted.

Uros.

Comments

H.J. Lu Nov. 17, 2010, 11:36 p.m.
On Wed, Nov 17, 2010 at 11:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Wed, Nov 17, 2010 at 2:33 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Tue, Nov 16, 2010 at 11:51 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Wed, Nov 17, 2010 at 7:44 AM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>>>
>>>> insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
>>>>
>>>> We should check NEXT_INSN (insn) != NEXT_INSN (BB_END (bb)) in
>>>> move_or_delete_vzeroupper_2.  This patch does it.
>>>
>>> Huh? The loop does simple linear scan of all insns in the bb, so it
>>> can't miss BB_END. IIUC, in your case the bb does not have BB_END
>>> (bb), but it has NEXT_INSN (BB_END (bb))?
>>
>> It has BB_END, but it won't be visited by NEXT_INSN starting from
>> BB_HEAD. insn != NEXT_INSN (BB_END (bb)) is used to check the
>> end of the BB everywhere in gcc.
>>
>>> Can you please provide a test case that illustrates this?
>>>
>>
>> I am enclosing a work in progress.  We noticed that we are
>> missing a few vzerouppers at -O3 on SPEC CPU 2K/2006.
>> One isssue is we may have
>>
>> foo:
>>
>>       call bar <<<<< Missing vzeroupper
>>
>>       256bit vectorized insn
>>       goto foo
>>
>> We miss vzeroupper before call bar.  We don't have a small testcase.
>> But this patch fixes this case by inspection. We are checking other
>> cases.
>
> @@ -118,9 +118,12 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool
> upper_128bits_set)
>             bb->index, upper_128bits_set);
>
>   insn = BB_HEAD (bb);
> +  last = NEXT_INSN (BB_END (bb));
>   while (insn != BB_END (bb))
>     {
>       insn = NEXT_INSN (insn);
> +      if (insn == last)
> +       break;
>
>       if (!NONDEBUG_INSN_P (insn))
>        continue;
>
> The change above is not needed. The new check is never triggered - the
> loop terminates when "insn == BB_END (bb)" at "while", so I fail to
> see why additional termination for "NEXT_INSN (insn) == NEXT_INSN
> (BB_END (bb))" is needed.

Here is the patch for

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519

We have 2 blocks pointing to each others. This patch first scans
all blocks without moving vzeroupper so that we can have accurate
information about upper 128bits at block entry.

> (The BB_HEAD (bb) is either a NOTE or CODE_LABEL so it can be skipped
> with NEXT_INSN.)

Please try gcc.target/i386/avx-vzeroupper-20.c.  It will
trigger this condition.

> @@ -10970,7 +10973,7 @@ ix86_expand_epilogue (int style)
>
>   /* Emit vzeroupper if needed.  */
>   if (TARGET_VZEROUPPER
> -      && cfun->machine->use_avx256_p
> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize)
>       && !cfun->machine->caller_return_avx256_p)
>     {
>       cfun->machine->use_vzeroupper_p = 1;
> @@ -21661,7 +21664,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
>     }
>
>   /* Add UNSPEC_CALL_NEEDS_VZEROUPPER decoration.  */
> -  if (TARGET_VZEROUPPER && cfun->machine->use_avx256_p)
> +  if (TARGET_VZEROUPPER
> +      && (cfun->machine->use_avx256_p || flag_tree_vectorize))
>
> Decorate *ALL* calls with CALL_NEEDS_VZEROUPPER with
> -ftree-vectorize?! It looks that parts (or state machine) that set
> ...->use_avx256_p flag should be fixed.

There are:

foo:

      call bar <<<<< Missing vzeroupper

      256bit vectorized insn
      goto foo

I couldn't find a hook to set use_avx256_p before RTL expansion
starts.

>     {
>       rtx unspec;
>       int avx256;
> diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
> b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
> new file mode 100644
> index 0000000..3301083
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
> +
> +extern void free (void *);
> +void
> +bar (void *ncstrp)
> +{
> +  if(ncstrp==((void *)0))
> +    return;
> +  free(ncstrp);
> +}
> +
> +/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */
>
> Hm, this testcase doesn't go together with the above change. There is
> no vectorization involved, and the scan checks that vzeroupper is NOT
> emitted.
>

This testcase is for

insn != BB_END (bb) && NEXT_INSN (insn) == NEXT_INSN (BB_END (bb))
Uros Bizjak Nov. 18, 2010, 8:12 a.m.
On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:

>> @@ -118,9 +118,12 @@ move_or_delete_vzeroupper_2 (basic_block bb, bool
>> upper_128bits_set)
>>             bb->index, upper_128bits_set);
>>
>>   insn = BB_HEAD (bb);
>> +  last = NEXT_INSN (BB_END (bb));
>>   while (insn != BB_END (bb))
>>     {
>>       insn = NEXT_INSN (insn);
>> +      if (insn == last)
>> +       break;
>>
>>       if (!NONDEBUG_INSN_P (insn))
>>        continue;
>>
>> The change above is not needed. The new check is never triggered - the
>> loop terminates when "insn == BB_END (bb)" at "while", so I fail to
>> see why additional termination for "NEXT_INSN (insn) == NEXT_INSN
>> (BB_END (bb))" is needed.
>
> Here is the patch for
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>
> We have 2 blocks pointing to each others. This patch first scans
> all blocks without moving vzeroupper so that we can have accurate
> information about upper 128bits at block entry.
>
>> (The BB_HEAD (bb) is either a NOTE or CODE_LABEL so it can be skipped
>> with NEXT_INSN.)
>
> Please try gcc.target/i386/avx-vzeroupper-20.c.  It will
> trigger this condition.

It doesn't even trigger move_or_delete_vzeroupper.

OK, I have hacked gcc a bit to blindly trigger the function in order
to check the loop iself. For your testcase we process following insn
stream:

 2 bb 2  [10000]
 3 bb 3  [7836]
 4 bb 4  [2164]
    1 NOTE_INSN_DELETED
    4 NOTE_INSN_BASIC_BLOCK
   17 NOTE_INSN_PROLOGUE_END
    3 NOTE_INSN_FUNCTION_BEG
    6 flags:CCZ=cmp(di:DI,0)
    7 pc={(flags:CCZ==0)?L14:pc}
      REG_DEAD: flags:CCZ
      REG_BR_PROB: 0x874
    8 NOTE_INSN_BASIC_BLOCK
   20 NOTE_INSN_EPILOGUE_BEG
   10 call <...>
      REG_DEAD: di:DI
      REG_EH_REGION: 0
i  11: barrier
L14:
   15 NOTE_INSN_BASIC_BLOCK
   19 return
i  18: barrier
   16 NOTE_INSN_DELETED

And the loop processed:

Assembling functions:
 bar
======
    4 NOTE_INSN_BASIC_BLOCK
++++++++
    6 flags:CCZ=cmp(di:DI,0)
++++++++
    7 pc={(flags:CCZ==0)?L14:pc}
      REG_DEAD: flags:CCZ
      REG_BR_PROB: 0x874
======
    8 NOTE_INSN_BASIC_BLOCK
++++++++
   10 call <...>
      REG_DEAD: di:DI
      REG_EH_REGION: 0
======
L14:
++++++++
   23 {return;unspec{0;};}

Where "===" precedes instruction, detected as BB_START and "+++"
precedes instructions that passed !NONDEBUG_INSN_P filter.

Please, can you show me the exact insn that was missed in this
particular testcase. I have used "-O3 -mavx -mtune=generic" and I have
to short-circuit the check for uses_vzeroupper_p at the end of
ix86_reorg.

Uros.
Uros Bizjak Nov. 18, 2010, 9:11 a.m.
On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:

> Here is the patch for
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>
> We have 2 blocks pointing to each others. This patch first scans
> all blocks without moving vzeroupper so that we can have accurate
> information about upper 128bits at block entry.

This introduces another insn scanning pass, almost the same as
existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).

So, if I understand correctly:
- The patch removes the detection if the function ever touches AVX registers.
- Due to this, all call_insn RTXes have to be decorated with
CALL_NEEDS_VZEROUPPER.
- A new pre-pass is required that scans all functions in order to
detect functions with live AVX registers at exit, and at the same time
marks the functions that *do not* use AVX registers.
- Existing pass then re-scans everything to again detect functions
with live AVX registers at exit and handles vzeroupper emission.

I don't think this approach is acceptable. Maybe a LCM infrastructure
can be used to handle this case?

Uros.

Patch hide | download patch | download mbox

diff --git a/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
new file mode 100644
index 0000000..3301083
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx-vzeroupper-20.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx -mtune=generic -dp" } */
+
+extern void free (void *);
+void
+bar (void *ncstrp)
+{
+  if(ncstrp==((void *)0))
+    return;
+  free(ncstrp);
+}
+
+/* { dg-final { scan-assembler-not "avx_vzeroupper" } } */