diff mbox

Avoid store forwarding issue in vectorizing strided SLP loads

Message ID alpine.LSU.2.11.1609281336060.26629@t29.fhfr.qr
State New
Headers show

Commit Message

Richard Biener Sept. 28, 2016, 11:41 a.m. UTC
Currently strided SLP vectorization creates vector constructors composed
of vector elements.  This is a constructor form that is not handled
specially by the expander but it gets expanded via piecewise stores
to scratch memory and a load of that scratch memory.  This is obviously
bad on any modern CPU which can do store forwarding (which in this case
does not work on any CPU I know of).  The following patch simply avoids
the issue by making the vectorizer create integer loads, composing
a vector of that integers and then punning that to the desired vector
type.  Thus (V4SF){V2SF, V2SF} becomes (V4SF)(V2DI){DI, DI} and
every body is happy.  Especially x264 gets a 5-10% improvement
(dependent on vector size and x86 sub-architecture).

Handling the vector-vector constructors on the expander side would
require either similar punning or making vec_init parametric on
the element mode plus supporting vector elements in all targets
(which in the end probably will simply pun them similarly).

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

Any comments?

Thanks,
Richard.

2016-09-28  Richard Biener  <rguenther@suse.de>

	* tree-vect-stmts.c (vectorizable_load): Avoid emitting vector
	constructors with vector elements.

Comments

Jeff Law Sept. 28, 2016, 3:48 p.m. UTC | #1
On 09/28/2016 05:41 AM, Richard Biener wrote:
>
> Currently strided SLP vectorization creates vector constructors composed
> of vector elements.  This is a constructor form that is not handled
> specially by the expander but it gets expanded via piecewise stores
> to scratch memory and a load of that scratch memory.
Ugh.  Yup, obviously bad, even without store forwarding.


> does not work on any CPU I know of).  The following patch simply avoids
> the issue by making the vectorizer create integer loads, composing
> a vector of that integers and then punning that to the desired vector
> type.  Thus (V4SF){V2SF, V2SF} becomes (V4SF)(V2DI){DI, DI} and
> every body is happy.  Especially x264 gets a 5-10% improvement
> (dependent on vector size and x86 sub-architecture).
Seems reasonable to me -- there's not a lot of difference (conceptually) 
to how we've used SImode constants to construct DFmode constants in the 
past.

>
> Handling the vector-vector constructors on the expander side would
> require either similar punning or making vec_init parametric on
> the element mode plus supporting vector elements in all targets
> (which in the end probably will simply pun them similarly).
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> Any comments?
>
> Thanks,
> Richard.
>
> 2016-09-28  Richard Biener  <rguenther@suse.de>
>
> 	* tree-vect-stmts.c (vectorizable_load): Avoid emitting vector
> 	constructors with vector elements.
Seems quite reasonable.

jeff
diff mbox

Patch

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 240565)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -6862,17 +6925,40 @@  vectorizable_load (gimple *stmt, gimple_
       int nloads = nunits;
       int lnel = 1;
       tree ltype = TREE_TYPE (vectype);
+      tree lvectype = vectype;
       auto_vec<tree> dr_chain;
       if (memory_access_type == VMAT_STRIDED_SLP)
 	{
-	  nloads = nunits / group_size;
 	  if (group_size < nunits)
 	    {
-	      lnel = group_size;
-	      ltype = build_vector_type (TREE_TYPE (vectype), group_size);
+	      /* Avoid emitting a constructor of vector elements by performing
+		 the loads using an integer type of the same size,
+		 constructing a vector of those and then re-interpreting it
+		 as the original vector type.  This works around the fact
+		 that the vec_init optab was only designed for scalar
+		 element modes and thus expansion goes through memory.
+		 This avoids a huge runtime penalty due to the general
+		 inability to perform store forwarding from smaller stores
+		 to a larger load.  */
+	      unsigned lsize
+		= group_size * TYPE_PRECISION (TREE_TYPE (vectype));
+	      enum machine_mode elmode = mode_for_size (lsize, MODE_INT, 0);
+	      enum machine_mode vmode = mode_for_vector (elmode,
+							 nunits / group_size);
+	      /* If we can't construct such a vector fall back to
+		 element loads of the original vector type.  */
+	      if (VECTOR_MODE_P (vmode)
+		  && optab_handler (vec_init_optab, vmode) != CODE_FOR_nothing)
+		{
+		  nloads = nunits / group_size;
+		  lnel = group_size;
+		  ltype = build_nonstandard_integer_type (lsize, 1);
+		  lvectype = build_vector_type (ltype, nloads);
+		}
 	    }
 	  else
 	    {
+	      nloads = 1;
 	      lnel = nunits;
 	      ltype = vectype;
 	    }
@@ -6925,9 +7011,17 @@  vectorizable_load (gimple *stmt, gimple_
 	    }
 	  if (nloads > 1)
 	    {
-	      tree vec_inv = build_constructor (vectype, v);
-	      new_temp = vect_init_vector (stmt, vec_inv, vectype, gsi);
+	      tree vec_inv = build_constructor (lvectype, v);
+	      new_temp = vect_init_vector (stmt, vec_inv, lvectype, gsi);
 	      new_stmt = SSA_NAME_DEF_STMT (new_temp);
+	      if (lvectype != vectype)
+		{
+		  new_stmt = gimple_build_assign (make_ssa_name (vectype),
+						  VIEW_CONVERT_EXPR,
+						  build1 (VIEW_CONVERT_EXPR,
+							  vectype, new_temp));
+		  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+		}
 	    }
 
 	  if (slp)