Patchwork PATCH: PR target/46519: Missing vzeroupper

login
register
mail settings
Submitter H.J. Lu
Date Jan. 13, 2011, 5:13 p.m.
Message ID <AANLkTinm1zdhXvJDmcpYRznDauybRSQ9rQ=tCUtSDP5d@mail.gmail.com>
Download mbox | patch
Permalink /patch/78776/
State New
Headers show

Comments

H.J. Lu - Jan. 13, 2011, 5:13 p.m.
On Wed, Jan 5, 2011 at 8:46 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Jan 05, 2011 at 08:39:51AM -0800, H.J. Lu wrote:
>> On Tue, Jan 4, 2011 at 4:09 PM, Mark Mitchell <mark@codesourcery.com> wrote:
>> > On 1/4/2011 4:06 PM, H.J. Lu wrote:
>> >
>> >> Enhance the DF infrastructure is beyond my resources.  I
>> >> will take a look at the DF algorithm.
>> >
>> > Wikipedia (or any good compiler book) should have a good description of
>> > the appropriate work-list based algorithms.  The basic idea is that you
>> > walk the BB tree in the right order (starting at the entry blocks),
>> > adding successor blocks to the worklist whenever you change a block.
>> >
>>
>> Are there any existing GCC passes which implement their own data-flow
>> analysis?
>
> E.g. var-tracking.c (vt_find_locations).
>

Thanks for Mark's suggestion and Jakub's pointer.  This patch implements
the work-list based algorithm.  I built SPEC CPU 2K/2006 with it.  Only 2
benchmarks are different with and without the patch.  It removes 4 extra
vzeroupper insns from 416.gamess in SPEC CPU 2006 and 1extra vzeroupper
from 177.mesa in SPEC CPU 2K.  The compile time difference is about 0.3%.

There are no AVX-SSE transition penalties in SPEC CPU 2K 32bit/64bit
and SPEC CPU 2006 64bit.  I am running  SPEC CPU 2006 32bit.  I am not
expecting any problems.  OK for trunk?

Thanks.
Mark Mitchell - Jan. 13, 2011, 5:19 p.m.
On 1/13/2011 9:13 AM, H.J. Lu wrote:

> Thanks for Mark's suggestion and Jakub's pointer.  This patch implements
> the work-list based algorithm.  I built SPEC CPU 2K/2006 with it.  Only 2
> benchmarks are different with and without the patch.  It removes 4 extra
> vzeroupper insns from 416.gamess in SPEC CPU 2006 and 1extra vzeroupper
> from 177.mesa in SPEC CPU 2K.  The compile time difference is about 0.3%.

HJ, thank you for implementing the worklist algorithm.  I have no
further comments on this patch; I would defer to the x86 maintainers.

Thank you,
Richard Henderson - Jan. 13, 2011, 6:03 p.m.
On 01/13/2011 09:13 AM, H.J. Lu wrote:
> +      sbitmap_zero (visited);
> +
>        cfun->machine->rescan_vzeroupper_p = 0;
> +
> +      while (!fibheap_empty (worklist))
> +	{
> +	  bb = (basic_block) fibheap_extract_min (worklist);
> +	  RESET_BIT (in_worklist, bb->index);
> +	  gcc_assert (!TEST_BIT (visited, bb->index));
> +	  if (!TEST_BIT (visited, bb->index))
> +	    {
> +	      edge_iterator ei;
> +
> +	      SET_BIT (visited, bb->index);
> +
> +	      move_or_delete_vzeroupper_1 (bb, false);
> +
> +	      FOR_EACH_EDGE (e, ei, bb->succs)

Hum.  Your use of the worklist appears to be totally superficial.

This is a very complicated way to write

  for (i = 0; i < n; ++i)
    move_or_delete_vzeroupper_1 (BASIC_BLOCK (bb_order[i]), false);

Which is still an improvement, mind, since you're now scanning
the blocks in a more intelligent order.

Proper use of a worklist would do something like

  changed = move_or_delete_vzeroupper_1 (bb, false);
  if (changed)
    FOR_EACH_EDGE (...)
      ...

I.e. not queue successors if nothing has changed in the current block.


r~
H.J. Lu - Jan. 13, 2011, 6:20 p.m.
On Thu, Jan 13, 2011 at 10:03 AM, Richard Henderson <rth@redhat.com> wrote:
> On 01/13/2011 09:13 AM, H.J. Lu wrote:
>> +      sbitmap_zero (visited);
>> +
>>        cfun->machine->rescan_vzeroupper_p = 0;
>> +
>> +      while (!fibheap_empty (worklist))
>> +     {
>> +       bb = (basic_block) fibheap_extract_min (worklist);
>> +       RESET_BIT (in_worklist, bb->index);
>> +       gcc_assert (!TEST_BIT (visited, bb->index));
>> +       if (!TEST_BIT (visited, bb->index))
>> +         {
>> +           edge_iterator ei;
>> +
>> +           SET_BIT (visited, bb->index);
>> +
>> +           move_or_delete_vzeroupper_1 (bb, false);
>> +
>> +           FOR_EACH_EDGE (e, ei, bb->succs)
>
> Hum.  Your use of the worklist appears to be totally superficial.
>
> This is a very complicated way to write
>
>  for (i = 0; i < n; ++i)
>    move_or_delete_vzeroupper_1 (BASIC_BLOCK (bb_order[i]), false);
>
> Which is still an improvement, mind, since you're now scanning
> the blocks in a more intelligent order.
>
> Proper use of a worklist would do something like
>
>  changed = move_or_delete_vzeroupper_1 (bb, false);
>  if (changed)
>    FOR_EACH_EDGE (...)
>      ...
>
> I.e. not queue successors if nothing has changed in the current block.
>

We have to scan its successors even if the exit state of the current block
is unchanged since one predecessor exit state is just one factor which
impacts the exit state and vzeroupper optimization. We must repeatedly
propagate the exit states and scan basic blocks until the exit state of
all basic blocks stabilized.

We can skip a basic block only if it has been processed or no insns
inside the block will touch the upper 128bits.
Richard Henderson - Jan. 14, 2011, 3:36 p.m.
On 01/13/2011 10:20 AM, H.J. Lu wrote:
> We have to scan its successors even if the exit state of the current block
> is unchanged since one predecessor exit state is just one factor which
> impacts the exit state and vzeroupper optimization.

Then you're not interpreting the "state" of a block broadly enough.

You certainly can consider the exit state of a block to be a function
of its input state and whatever it does locally.  Indeed, this is
exactly the formulation of the problem that you will need in order to
re-use the LCM infrastructure for 4.7.  But it doesn't hurt to think
about the problem in those terms now.


r~
H.J. Lu - Jan. 14, 2011, 4:02 p.m.
On Fri, Jan 14, 2011 at 7:36 AM, Richard Henderson <rth@redhat.com> wrote:
> On 01/13/2011 10:20 AM, H.J. Lu wrote:
>> We have to scan its successors even if the exit state of the current block
>> is unchanged since one predecessor exit state is just one factor which
>> impacts the exit state and vzeroupper optimization.
>
> Then you're not interpreting the "state" of a block broadly enough.
>
> You certainly can consider the exit state of a block to be a function
> of its input state and whatever it does locally.  Indeed, this is
> exactly the formulation of the problem that you will need in order to
> re-use the LCM infrastructure for 4.7.  But it doesn't hurt to think
> about the problem in those terms now.
>

vzeroupper works this way:

1. We add vzeroupper where there may be AVX->SSE transition
at function call and return RTL expansion.  It won't affect
correctness of the program.
2. We also add a special vzeroupper to function call which returns
or passes AVX registers. This will affect correctness of the program
if it isn't removed.

In vzeroupper optimization pass, we use the special vzeroupper to
propagate the AVX register state and remove it afterward.  We
also remove any redundant vzeroupper, which depends on the
the AVX register state at basic block entry.

Because #2, we have to scan a basic block at least once.
I can add more checks to avoid unnecessary rescans.

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index a26314b..5afc1ae 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -56,6 +56,8 @@  along with GCC; see the file COPYING3.  If not see
 #include "debug.h"
 #include "dwarf2out.h"
 #include "sched-int.h"
+#include "sbitmap.h"
+#include "fibheap.h"
 
 enum upper_128bits_state
 {
@@ -338,14 +340,18 @@  move_or_delete_vzeroupper (void)
   edge e;
   edge_iterator ei;
   basic_block bb;
-  unsigned int count;
+  fibheap_t worklist, pending, fibheap_swap;
+  sbitmap visited, in_worklist, in_pending, sbitmap_swap;
+  int *bb_order;
+  int *rc_order;
+  int i;
 
   /* Set up block info for each basic block.  */
   alloc_aux_for_blocks (sizeof (struct block_info_def));
 
-  /* Process successor blocks of all entry points.  */
+  /* Process outgoing edges of entry point.  */
   if (dump_file)
-    fprintf (dump_file, "Process all entry points\n");
+    fprintf (dump_file, "Process outgoing edges of entry point\n");
 
   FOR_EACH_EDGE (e, ei, ENTRY_BLOCK_PTR->succs)
     {
@@ -355,25 +361,100 @@  move_or_delete_vzeroupper (void)
       BLOCK_INFO (e->dest)->processed = true;
     }
 
-  /* Process all basic blocks.  */
-  count = 0;
-  do
+  /* Compute reverse completion order of depth first search of the CFG
+     so that the data-flow runs faster.  */
+  rc_order = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS);
+  bb_order = XNEWVEC (int, last_basic_block);
+  pre_and_rev_post_order_compute (NULL, rc_order, false);
+  for (i = 0; i < n_basic_blocks - NUM_FIXED_BLOCKS; i++)
+    bb_order[rc_order[i]] = i;
+  free (rc_order);
+
+  worklist = fibheap_new ();
+  pending = fibheap_new ();
+  visited = sbitmap_alloc (last_basic_block);
+  in_worklist = sbitmap_alloc (last_basic_block);
+  in_pending = sbitmap_alloc (last_basic_block);
+  sbitmap_zero (in_worklist);
+
+  /* Don't check outgoing edges of entry point.  */
+  sbitmap_ones (in_pending);
+  FOR_EACH_BB (bb)
+    if (BLOCK_INFO (bb)->processed)
+      RESET_BIT (in_pending, bb->index);
+    else
+      fibheap_insert (pending, bb_order[bb->index], bb);
+
+  if (dump_file)
+    fprintf (dump_file, "Check remaining basic blocks\n");
+
+  while (!fibheap_empty (pending))
     {
-      if (dump_file)
-	fprintf (dump_file, "Process all basic blocks: trip %d\n",
-		 count);
+      fibheap_swap = pending;
+      pending = worklist;
+      worklist = fibheap_swap;
+      sbitmap_swap = in_pending;
+      in_pending = in_worklist;
+      in_worklist = sbitmap_swap;
+
+      sbitmap_zero (visited);
+
       cfun->machine->rescan_vzeroupper_p = 0;
-      FOR_EACH_BB (bb)
-	move_or_delete_vzeroupper_1 (bb, false);
+
+      while (!fibheap_empty (worklist))
+	{
+	  bb = (basic_block) fibheap_extract_min (worklist);
+	  RESET_BIT (in_worklist, bb->index);
+	  gcc_assert (!TEST_BIT (visited, bb->index));
+	  if (!TEST_BIT (visited, bb->index))
+	    {
+	      edge_iterator ei;
+
+	      SET_BIT (visited, bb->index);
+
+	      move_or_delete_vzeroupper_1 (bb, false);
+
+	      FOR_EACH_EDGE (e, ei, bb->succs)
+		{
+		  if (e->dest == EXIT_BLOCK_PTR
+		      || BLOCK_INFO (e->dest)->processed)
+		    continue;
+
+		  if (TEST_BIT (visited, e->dest->index))
+		    {
+		      if (!TEST_BIT (in_pending, e->dest->index))
+			{
+			  /* Send E->DEST to next round.  */
+			  SET_BIT (in_pending, e->dest->index);
+			  fibheap_insert (pending,
+					  bb_order[e->dest->index],
+					  e->dest);
+			}
+		    }
+		  else if (!TEST_BIT (in_worklist, e->dest->index))
+		    {
+		      /* Add E->DEST to current round.  */
+		      SET_BIT (in_worklist, e->dest->index);
+		      fibheap_insert (worklist, bb_order[e->dest->index],
+				      e->dest);
+		    }
+		}
+	    }
+	}
+
+      if (!cfun->machine->rescan_vzeroupper_p)
+	break;
     }
-  while (cfun->machine->rescan_vzeroupper_p && count++ < 20);
 
-  /* FIXME: Is 20 big enough?  */
-  if (count >= 20)
-    gcc_unreachable ();
+  free (bb_order);
+  fibheap_delete (worklist);
+  fibheap_delete (pending);
+  sbitmap_free (visited);
+  sbitmap_free (in_worklist);
+  sbitmap_free (in_pending);
 
   if (dump_file)
-    fprintf (dump_file, "Process all basic blocks\n");
+    fprintf (dump_file, "Process remaining basic blocks\n");
 
   FOR_EACH_BB (bb)
     move_or_delete_vzeroupper_1 (bb, true);
diff --git a/gcc/config/i386/t-i386 b/gcc/config/i386/t-i386
index 6c801a5..1c658a1 100644
--- a/gcc/config/i386/t-i386
+++ b/gcc/config/i386/t-i386
@@ -23,7 +23,7 @@  i386.o: $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
   $(RECOG_H) $(EXPR_H) $(OPTABS_H) toplev.h $(BASIC_BLOCK_H) \
   $(GGC_H) $(TARGET_H) $(TARGET_DEF_H) langhooks.h $(CGRAPH_H) \
   $(TREE_GIMPLE_H) $(DWARF2_H) $(DF_H) tm-constrs.h $(PARAMS_H) \
-  i386-builtin-types.inc debug.h dwarf2out.h
+  i386-builtin-types.inc debug.h dwarf2out.h sbitmap.h $(FIBHEAP_H)
 
 i386-c.o: $(srcdir)/config/i386/i386-c.c \
   $(srcdir)/config/i386/i386-protos.h $(CONFIG_H) $(SYSTEM_H) coretypes.h \