diff mbox

How to generate AVX512 instructions now (just to look at them).

Message ID 20140103211102.GP892@tucnak.redhat.com
State New
Headers show

Commit Message

Jakub Jelinek Jan. 3, 2014, 9:11 p.m. UTC
Hi!

On Fri, Jan 03, 2014 at 08:58:30PM +0100, Toon Moene wrote:
> I don't doubt that would work, what I'm interested in, is (cat verintlin.f):

Well, you need gather loads for that and there you hit PR target/59617.

Completely untested patch that let's your testcase be vectorized
using 64-byte vectors, for vectorizable_mask_load_store it still punts,
but I guess the steps there are first to teach it about non-gather MASK_LOAD
and MASK_STORE, which aren't handled for the AVX512F modes either
(I think V8DI/V8DF/V16SI/V16SF modes should be possible to handle right now)
and then move on to handle the gathers similarly.

2014-01-03  Jakub Jelinek  <jakub@redhat.com>

	PR target/59617
	* config/i386/i386.c (ix86_vectorize_builtin_gather): Uncomment
	AVX512F gather builtins.
	* tree-vect-stmts.c (vectorizable_mask_load_store): For now punt
	on gather decls with INTEGER_TYPE masktype.
	(vectorizable_load): For INTEGER_TYPE masktype, put the INTEGER_CST
	directly into the builtin rather than hoisting it before loop.



	Jakub

Comments

Toon Moene Jan. 5, 2014, 2:52 p.m. UTC | #1
On 01/03/2014 10:11 PM, Jakub Jelinek wrote:

> Hi!
>
> On Fri, Jan 03, 2014 at 08:58:30PM +0100, Toon Moene wrote:
>> I don't doubt that would work, what I'm interested in, is (cat verintlin.f):
>
> Well, you need gather loads for that and there you hit PR target/59617.

I tried your patch, and the effect on the most heavily used loop in the 
full routine (not the part that I quoted before):

     160       DO JY = KLAT1,KLAT2
     161       DO JX = KLON1,KLON2
     162          IDX  = KP(JX,JY)
     163          IDY  = KQ(JX,JY)
     164          ILEV = KR(JX,JY)
...
     237      + + PBETA(JX,JY,4)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY+1,ILEV+1)
     238      +                  + PALFA(JX,JY,2)*PARG(IDX-1,IDY+1,ILEV+1)
     239      +                  + PALFA(JX,JY,3)*PARG(IDX  ,IDY+1,ILEV+1)
     240      +                  + 
PALFA(JX,JY,4)*PARG(IDX+1,IDY+1,ILEV+1) ) )
     241       ENDDO
     242       ENDDO

is (just counting assembler lines, i.e., instructions):

-Ofast -mavx2 -mfma:           627 lines in the .s file.

-Ofast -mavx2 -mfma -mavx512f: 588 lines in the .s file.

However, this routine is clearly memory bound (as the vectorization with 
the gather instruction, needed for the indirect adressing via IDX  = 
KP(JX,JY), etc. didn't bring any speed improvement).

The number of instructions accessing memory:

-Ofast -mavx2 -mfma:           364 lines in the .s file.

-Ofast -mavx2 -mfma -mavx512f: 221 lines in the .s file.

So there might be a clear improvement here ...

Thanks !
diff mbox

Patch

--- gcc/config/i386/i386.c.jj	2014-01-03 13:19:14.000000000 +0100
+++ gcc/config/i386/i386.c	2014-01-03 21:12:23.630145609 +0100
@@ -36527,9 +36527,6 @@  ix86_vectorize_builtin_gather (const_tre
     case V8SImode:
       code = si ? IX86_BUILTIN_GATHERSIV8SI : IX86_BUILTIN_GATHERALTDIV8SI;
       break;
-#if 0
-    /*  FIXME: Commented until vectorizer can work with (mask_type != src_type)
-	PR59617.   */
     case V8DFmode:
       if (TARGET_AVX512F)
 	code = si ? IX86_BUILTIN_GATHER3ALTSIV8DF : IX86_BUILTIN_GATHER3DIV8DF;
@@ -36554,7 +36551,6 @@  ix86_vectorize_builtin_gather (const_tre
       else
 	return NULL_TREE;
       break;
-#endif
     default:
       return NULL_TREE;
     }
--- gcc/tree-vect-stmts.c.jj	2014-01-03 11:41:01.000000000 +0100
+++ gcc/tree-vect-stmts.c	2014-01-03 21:29:47.595911084 +0100
@@ -1813,6 +1813,17 @@  vectorizable_mask_load_store (gimple stm
 			     "gather index use not simple.");
 	  return false;
 	}
+
+      tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
+      tree masktype
+	= TREE_VALUE (TREE_CHAIN (TREE_CHAIN (TREE_CHAIN (arglist))));
+      if (TREE_CODE (masktype) == INTEGER_TYPE)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "masked gather with integer mask not supported.");
+	  return false;
+	}
     }
   else if (tree_int_cst_compare (nested_in_vect_loop
 				 ? STMT_VINFO_DR_STEP (stmt_info)
@@ -5761,6 +5772,7 @@  vectorizable_load (gimple stmt, gimple_s
 	{
 	  mask = build_int_cst (TREE_TYPE (masktype), -1);
 	  mask = build_vector_from_val (masktype, mask);
+	  mask = vect_init_vector (stmt, mask, masktype, NULL);
 	}
       else if (SCALAR_FLOAT_TYPE_P (TREE_TYPE (masktype)))
 	{
@@ -5771,10 +5783,10 @@  vectorizable_load (gimple stmt, gimple_s
 	  real_from_target (&r, tmp, TYPE_MODE (TREE_TYPE (masktype)));
 	  mask = build_real (TREE_TYPE (masktype), r);
 	  mask = build_vector_from_val (masktype, mask);
+	  mask = vect_init_vector (stmt, mask, masktype, NULL);
 	}
       else
 	gcc_unreachable ();
-      mask = vect_init_vector (stmt, mask, masktype, NULL);
 
       scale = build_int_cst (scaletype, gather_scale);