diff mbox

[unified-autovect:,1b/N] Instruction tile and grammar creation.

Message ID 38C8F1E431EDD94A82971C543A11B4FEA53717C8@PUMAIL01.pu.imgtec.org
State New
Headers show

Commit Message

sameera April 6, 2017, 10:41 a.m. UTC
Hi Richard,

This is regarding the implementation of auto-vectorization based on my talk titled "Improving the effectiveness and generality of GCC auto-vectorization" in GNU Tools Cauldron 2015.

I have applied previous patches in branch 'unified-autovect' which generates primitive reorder tree (PRT) for load/store operations, and performs few optimizations to ensure optimal PRT is available for pattern matching. The options to enable unified auto-vectorization are '-ftree-vectorize -fopt-info-vec-all=file.txt -O2 -ftree-loop-vectorize-unified'. 

Now I am working on pattern matcher for PRT using bottom up rewrite parsing system (BURS) as described in "BURS Automata Generation" by Todd Proebsting. The paper describes an algorithm to generate BURS transition tables for cost effective pattern matching of complex target instructions. It takes target instructions represented in context free grammar in normal form as input, and generates transition table at build time of the compiler.

We have divided this problem into 3 parts:
1. Grammar creation for permute instructions of desired target at build time.
2. Transition table generation, and compile time executable code generation at build time.
3. Pattern matching code at compile time.

The patch I have attached addresses 1st part.

As number of states in automata increase exponentially for operators to be supported, I wanted to restrict the operators under consideration for BURS automata to be generated. As scalar to vector translator for arithmetic and logical instructions is already in place in GCC, and those operands do not play any role in permute order creation, the BURS is generated only for permute operators. For all other operators, GCC's current framework is to be used for code generation.

Hence, the only operators that are supported by the automata are
 ILV_2 : Interleave contents of 2 registers of size x and generate register of size 2x.
 EXTR_2,0   : Extract even elements from register of size x and generate register of size x/2
 EXTR_2,1  : Extract odd elements from register of size x and generate register of size x/2

Also, the number of states also depend upon the leaves in grammar. Hence, we are restricting the grammar to have 3 leaves namely REG, MEM and CONST.

We have defined new macro TARGET_VEC_PERM_CONST_ORDER which takes target supported permute orders as input in struct vec_perm_order_spec.

Current patch constructs instruction tiles for the target from this macro and optimizes those tiles using k-arity promotion reduction and redundancy elimination to have optimal PRT per instruction. It then generates context free grammar in normal form which can then be accepted by BURS algorithm to construct transition table.

There are multiple issues that need to be handled to support permute instructions across all architectures:
1. Constraints on input/output operands : We allow permutations on memory as well as registers to allow strided load/store and register permutes.
2. Different vector sizes supported : Each vector size needs to be enlisted in the macro in order to add support for that vector type.
3. Conditions under which the instruction is available : Like the condition field in RTL in md patterns.
4. Arity of permute operation : Multiple input single output model is supported.
5. Default rules to be generated for grammar to be complete : We need to identify default rules that are to be generated for pattern matcher so that the grammar will be complete. We have added
     goal --> reg              
     goal --> mem           
     reg --> mem            
     reg --> const           
     mem --> reg   
     mem --> const
       :
       :
     reg_32 --> REG  
     reg_16 --> REG  
     reg_8 --> REG    
     reg_4 --> REG    
     reg_2 --> REG     
     reg -> reg_2    
     reg -> reg_4            
     reg -> reg_8       
     reg -> reg_16         
     reg -> reg_32          
       :
       :
     mem --> MEM             
     const --> CONST           
     reg --> EXTR_2,0 (reg)    
     reg --> EXTR_2,1 (reg)    
     reg --> ILV (reg, reg)    

As we create 1 instruction tile per permute order in TARGET_VEC_PERM_CONST_ORDER, more detailed the macro,  more accurate the pattern matcher becomes.

Interestingly, even though different entries are created for different vector sizes, as and when possible, single tile is created for all vector sizes.
eg: In generated file insn-vect-inst-tiles.h which I have attached as sample, tile for permute order
permute order -  0  16  2  18  4  20  6  22  8  24  10  26  12  28  14  30 
ILV_2 (
  EXTR_2,0 (PHR:0) , 
  EXTR_2,0 (PHR:1))

is same as that for
permute order -  0  8  2  10  4  12  6  14 
ILV_2 (
  EXTR_2,0 (PHR:0) , 
  EXTR_2,0 (PHR:1))

This can be useful for SVE like extensions where vector size can be dynamic.

Richard, can you please see if this looks correct, or do I need additional information to successfully generate pattern matcher automatically?

Also, can you please comment on usability or scalability of this approach across all the architectures or point me to appropriate people in the group with whom I can discuss target specific vectorization issues?

- Thanks and regards,
  Sameera D.
diff mbox

Patch

Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 246613)
+++ gcc/Makefile.in	(working copy)
@@ -1067,7 +1067,12 @@ 
 	    build/print-rtl.o build/hash-table.o
 BUILD_MD = build/read-md.o
 BUILD_ERRORS = build/errors.o
+BUILD_UNITED = build/vec.o build/hash-table.o build/errors.o \
+	       build/ggc-none.o \
+	       build/tree-vect-unified-common.o build/tree-vect-unified-opts.o
 
+build/tree-vect-unified-common.o : tree-vect-unified-common.c gtype-desc.h insn-codes.h
+build/tree-vect-unified-opts.o : tree-vect-unified-opts.c gtype-desc.h insn-codes.h
 # Specify the directories to be searched for header files.
 # Both . and srcdir are used, in that order,
 # so that *config.h will be found in the compilation
@@ -2207,7 +2212,7 @@ 
   insn-emit.c insn-recog.c insn-extract.c insn-output.c insn-peep.c \
   insn-attr.h insn-attr-common.h insn-attrtab.c insn-dfatab.c \
   insn-latencytab.c insn-preds.c gimple-match.c generic-match.c \
-  insn-target-def.h
+  insn-target-def.h insn-vect-inst-tiles.h
 
 # Dependencies for the md file.  The first time through, we just assume
 # the md file itself and the generated dependency file (in order to get
@@ -2234,7 +2239,8 @@ 
 			  insn-extract.c insn-output.c \
 			  insn-peep.c insn-recog.c
 
-simple_generated_h	= $(simple_rtl_generated_h) insn-constants.h
+simple_generated_h	= $(simple_rtl_generated_h) insn-constants.h \
+			  insn-vect-inst-tiles.h
 
 simple_generated_c	= $(simple_rtl_generated_c) insn-enums.c
 
@@ -2602,6 +2608,8 @@ 
   $(GENSUPPORT_H)
 build/rtl.o: rtl.c $(BCONFIG_H) coretypes.h $(GTM_H) $(SYSTEM_H)	\
   $(RTL_H) $(GGC_H) errors.h
+build/tree.o: tree.c $(BCONFIG_H) coretypes.h $(GTM_H) $(SYSTEM_H)	\
+  $(RTL_H) $(GGC_H) errors.h
 build/vec.o : vec.c $(BCONFIG_H) $(SYSTEM_H) coretypes.h $(VEC_H)	\
    $(GGC_H) toplev.h $(DIAGNOSTIC_CORE_H)
 build/hash-table.o : hash-table.c $(BCONFIG_H) $(SYSTEM_H) coretypes.h  \
@@ -2655,6 +2663,9 @@ 
   coretypes.h $(GTM_H) $(RTL_BASE_H) errors.h $(READ_MD_H) $(GENSUPPORT_H)	\
   $(HASH_TABLE_H) target-insns.def
 build/gengenrtl.o : gengenrtl.c $(BCONFIG_H) $(SYSTEM_H) rtl.def
+build/genvect-inst-tiles.o : genvect-inst-tiles.c $(RTL_BASE_H) $(BCONFIG_H)    \
+  $(SYSTEM_H) coretypes.h $(GTM_H) errors.h tree-vect-unified.h \
+  tree-vect-unified-opts.o tree-vect-unified-common.o
 
 # The gengtype generator program is special: Two versions are built.
 # One is for the build machine, and one is for the host to allow
@@ -2732,8 +2743,11 @@ 
 genprogerr = $(genprogmd) genrtl modes gtype hooks cfn-macros
 $(genprogerr:%=build/gen%$(build_exeext)): $(BUILD_ERRORS)
 
+genprogunited = vect-inst-tiles
+$(genprogunited:%=build/gen%$(build_exeext)): $(BUILD_UNITED) 
+
 # Remaining build programs.
-genprog = $(genprogerr) check checksum condmd match
+genprog = $(genprogerr) $(genprogunited) check checksum condmd match
 
 # These programs need libs over and above what they get from the above list.
 build/genautomata$(build_exeext) : BUILD_LIBS += -lm
Index: gcc/config/mips/mips.h
===================================================================
--- gcc/config/mips/mips.h	(revision 246613)
+++ gcc/config/mips/mips.h	(working copy)
@@ -3468,4 +3468,37 @@ 
   (TARGET_LOAD_STORE_PAIRS && (TUNE_P5600 || TUNE_I6400) \
    && !TARGET_MICROMIPS && !TARGET_FIX_24K)
 
+#define TARGET_VEC_PERM_CONST_ORDER \
+{ \
+  {2, 2, 2, (int[2]){0,2}, 1, "PCKEV.D", "RRR", NULL, NULL}, \
+  {2, 2, 2, (int[2]){1,3}, 1, "PCKOD.D", "RRR", NULL, NULL}, \
+\
+  {2, 4, 4, (int[4]){0,4,2,6}, 1, "ILVEV.W", "RRR", NULL, NULL}, \
+  {2, 4, 4, (int[4]){1,5,3,7}, 1, "ILVOD.W", "RRR", NULL, NULL}, \
+  {2, 4, 4, (int[4]){0,2,4,6}, 1, "PCKEV.W", "RRR", NULL, NULL}, \
+  {2, 4, 4, (int[4]){1,3,5,7}, 1, "PCKOD.W", "RRR", NULL, NULL}, \
+  {2, 4, 4, (int[4]){2,6,3,7}, 1, "ILVL.W", "RRR", NULL, NULL}, \
+  {2, 4, 4, (int[4]){0,4,1,5}, 1, "ILVR.W", "RRR", NULL, NULL}, \
+\
+  {2, 8, 8, (int[8]){0,8,2,10,4,12,6,14}, 1, "ILVEV.H", "RRR", NULL, NULL}, \
+  {2, 8, 8, (int[8]){1,9,3,11,5,13,7,15}, 1, "ILVOD.H", "RRR", NULL, NULL}, \
+  {2, 8, 8, (int[8]){0,2,4,6,8,10,12,14}, 1, "PCKEV.H", "RRR", NULL, NULL}, \
+  {2, 8, 8, (int[8]){1,3,5,7,9,11,13,15}, 1, "PCKOD.H", "RRR", NULL, NULL}, \
+  {2, 8, 8, (int[8]){0,8,1,9,2,10,3,11}, 1, "ILVR.H", "RRR", NULL, NULL}, \
+  {2, 8, 8, (int[8]){4,12,5,13,6,14,7,15}, 1, "ILVL.H", "RRR", NULL, NULL}, \
+\
+  {2, 16, 16, (int[16]){0,16,2,18,4,20,6,22,8,24,10,26,12,28,14,30}, 1, \
+	 "ILVEV.Q", "RRR", NULL, NULL}, \
+  {2, 16, 16, (int[16]){1,17,3,19,5,21,7,23,9,25,11,27,13,29,15,31}, 1, \
+	 "ILVOD.Q", "RRR", NULL, NULL}, \
+  {2, 16, 16, (int[16]){0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}, 1, \
+	 "PCKEV.Q", "RRR", NULL, NULL}, \
+  {2, 16, 16, (int[16]){1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}, 1, \
+	 "PCKOD.Q", "RRR", NULL, NULL}, \
+  {2, 16, 16, (int[16]){8,24,9,25,10,26,11,27,12,28,13,29,14,30,15,31}, 1, \
+	 "ILVL.Q", "RRR", NULL, NULL}, \
+  {2, 16, 16, (int[16]){0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23}, 1, \
+	 "ILVR.Q", "RRR", NULL, NULL}, \
+}
+
 #define MAX_VECTOR_SIZE 16
Index: gcc/coretypes.h
===================================================================
--- gcc/coretypes.h	(revision 246613)
+++ gcc/coretypes.h	(working copy)
@@ -358,6 +358,8 @@ 
 typedef unsigned char uchar;
 #endif
 
+struct vec_perm_order_spec;
+
 /* Most host source files will require the following headers.  */
 #if !defined (GENERATOR_FILE) && !defined (USED_FOR_TARGET)
 #include "machmode.h"
Index: gcc/genvect-inst-tiles.c
===================================================================
--- gcc/genvect-inst-tiles.c	(revision 0)
+++ gcc/genvect-inst-tiles.c	(working copy)
@@ -0,0 +1,713 @@ 
+/* Loop Vectorization using unified representation for permute instructions.
+   Copyright (C) 2003-2015 Free Software Foundation, Inc.
+   Contributed by Sameera Deshpande <sameera.deshpande@imgtec.com>
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#define GENERATOR_FILE 1
+#include "bconfig.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "errors.h"
+#ifdef GENERATOR_FILE
+#include "machmode.h"
+#include "signop.h"
+#include "wide-int.h"
+#include "double-int.h"
+#include "real.h"
+#include "fixed-value.h"
+#include "statistics.h"
+#include "vec.h"
+#include "hash-table.h"
+#include "hash-set.h"
+#include "input.h"
+#include "is-a.h"
+#include "target.h"
+#endif
+
+#include "tree-core.h"
+#include "tree-vect-unified.h"
+//#include "tree-vect-unified-common.c"
+//#include "tree-vect-unified-opts.c"
+
+#define DEBUG 0
+int target_flags;
+
+struct vec_perm_order_spec target_spec[] = TARGET_VEC_PERM_CONST_ORDER;
+vec <struct grammar_rule *> rules;
+
+/* List of non-terminals used in grammar.  The index is used in the grammar rule
+   to point to appropriate non-terminal in the list.  For now, the non-terminal
+   is just list of strings with NT names.  However if needed, it can be updated
+   to hold additional information in the structure.  */
+vec<char *> non_terminals;
+
+/* List of terminals in Grammar.  Currently, we support only 3 categories in
+   terminals -
+   MEM, REG and CONST.  */
+vec<char *>terminals;
+
+/* Function create_placeholder.
+
+*/
+
+struct primop_tree *
+create_placeholder (int idx, char ch, struct primop_tree *parent)
+{
+  struct primop_tree *ptree;
+
+  ptree = populate_prim_node (POP_PH, NULL,
+			      parent, NULL);
+  PT_PH_IDX (ptree) = idx;
+  PT_PH_TYPE (ptree) = ch;
+  return ptree;
+}
+
+/* Function create_perm_order_tree.
+
+   For each element in TARGET_VEC_PERM_CONST_ORDER
+   Do
+     1. Create ILV node with arity out_vec_size.
+     2. For ith element in perm_order
+	Do
+	  1. Create EXTR node with parts = in_vec_size and selector = i % parts
+	  2. Create child of EXTR as PLACEHOLDER_<M|C|R>_<i / parts>, i / parts
+   	     should not exceed num_opd.  For k_arity_promotion_reduction and
+   	     unity_redundancy_elimination, PLACEHOLDER_<M|C|R>_<num> is used for
+   	     matching.  Whereas for grammar definition, only PLACEHOLDER_<M|C|R>
+   	     is used for generating rules.
+	   Done
+   Done
+*/
+
+struct primop_tree *
+create_perm_order_tree (struct vec_perm_order_spec spec)
+{
+  int i, num;
+  struct primop_tree *ilv_node, *expr_node, *placeholder;
+
+  ilv_node = create_primTree_combine (POP_ILV, NULL,
+		spec.out_vec_size, NULL, NULL);
+
+  for (i = 0; i < spec.out_vec_size; i++)
+    {
+      expr_node = create_primTree_partition (POP_EXTR, NULL,
+			spec.in_vec_size, spec.perm_order[i] % spec.in_vec_size,
+			NULL, ilv_node);
+      num = spec.perm_order[i] / spec.in_vec_size;
+      placeholder = create_placeholder (num,
+			 spec.opd_constraint[num], expr_node);
+      add_child_at_index (expr_node, placeholder, 0);
+      add_child_at_index (ilv_node, expr_node, i);
+    }
+
+  return ilv_node;
+}
+
+/* Function print_perm_order.
+
+*/
+
+void print_perm_order (int *perm_order, int num)
+{
+  int i;
+
+  for (i = 0; i < num; i++)
+    printf (" %d ", perm_order[i]);
+}
+
+/* Function print_instruction_tile.
+
+*/
+
+void
+print_instruction_tile (struct primop_tree *ptree, int tab = 0)
+{
+  int i;
+
+  if (PT_NODE_OP (ptree) != POP_PH)
+    {
+      printf ("\n");
+      for (i = 0; i < tab; i++)
+	printf (" ");
+    }
+  switch (PT_NODE_OP (ptree))
+    {
+      case POP_EXTR:
+	printf ("EXTR_%d,%d (", PT_DIVISION (ptree),
+		 PT_OPERAND_SELECTOR (ptree));
+	print_instruction_tile (PT_CHILD (ptree, 0), tab + 2);
+	printf (")");
+	break;
+      case POP_ILV:
+	printf ("ILV_%d (", PT_DIVISION (ptree));
+	for (i = 0; i < PT_DIVISION (ptree) - 1; i++)
+	  {
+	    print_instruction_tile (PT_CHILD (ptree, i), tab + 2);
+	    printf (" , ");
+	  }
+	print_instruction_tile (PT_CHILD (ptree, i), tab + 2);
+	printf (")");
+	break;
+      case POP_PH:
+	printf ("PH%c:%d", PT_PH_TYPE (ptree), PT_PH_IDX (ptree));
+	break;
+      default:
+	gcc_assert (!"\nUndesired case in printing tree.\n");
+	return;
+    }
+}
+
+/* Function print_instruction_tiles.
+
+*/
+
+void
+print_instruction_tiles ()
+{
+  int i;
+  printf ("/*");
+  for (i = 0; i < sizeof (target_spec)/sizeof (struct vec_perm_order_spec); i++)
+    {
+      printf ("\n\npermute order - ");
+      print_perm_order (target_spec[i].perm_order, target_spec[i].out_vec_size);
+      print_instruction_tile (target_spec[i].ptree);
+    }
+  printf ("*/");
+}
+
+/* Function create_instruction_tiles.
+
+   For each permute_order in TARGET_VEC_PERM_CONST_ORDER
+   Do
+     1. Create permute order tree from permute order - the permute order tree
+	so created is of arity out_vec_size.
+     2. Perform k_arity_promotion_reduction on permute order tree to reduce the
+	arity to 2.  As out_vec_size is power of 2, the promotion/reduction is
+	never going to fail.
+     3. Perform unity_redundancy_elimination of kind
+	ILV_m (EXTR_0(S), EXTR_1(S),...EXTR_m-1(S)) => S
+	EXTR_m,x (ILV_M(S1, S2, ... Sm)) => Sx
+	to get optimal permute order tree.
+   Done
+*/
+
+void
+create_instruction_tiles ()
+{
+  int i;
+  struct primop_tree *ptree;
+
+  for (i = 0; i < sizeof (target_spec)/sizeof (struct vec_perm_order_spec); i++)
+    {
+      ptree = create_perm_order_tree (target_spec[i]);
+      ptree = k_arity_promotion_reduction (ptree, 2);
+      ptree = unity_redundancy_elimination (ptree);
+      target_spec[i].ptree = ptree;
+    }
+
+}
+
+/* Function get_index.
+
+   Return index of non-terminal.
+*/
+
+int
+get_index (vec<char *> *worklist, char *str)
+{
+  int i;
+  for (i = 0; i < worklist->length (); i++)
+    {
+      if (!strcmp ((*worklist)[i], str))
+	return i;
+    }
+  return -1;
+}
+
+/* Function create_non_terminal.
+
+*/
+
+int
+create_non_terminal (char *str)
+{
+  int idx;
+  char *buf;
+
+  idx = get_index (&non_terminals, str);
+  if (idx != -1)
+    return idx;
+
+  idx = non_terminals.length ();
+  buf = (char *) xcalloc (strlen (str), sizeof (char));
+  strcpy (buf, str);
+  non_terminals.safe_insert (idx, buf);
+  return idx;
+}
+
+/* Function create_rule_NT_to_NT.
+
+   Creates grammar rule of kind NT --> NT for normalized grammar.
+*/
+
+struct grammar_rule *
+create_rule_NT_to_NT (int lhs_nt, int rhs_nt, int off)
+{
+  struct grammar_rule * rule;
+
+  rule = (struct grammar_rule *) xcalloc (1, sizeof (struct grammar_rule));
+  rule->lhs_nt = lhs_nt;
+  rule->type = NT2NT;
+  rule->u.non_terminal = rhs_nt;
+  rule->spec_idx = off;
+
+  return rule;
+}
+
+/* Function create_rule_NT_to_T.
+
+   Creates grammar rule of kind NT --> T for normalized grammar.
+*/
+
+struct grammar_rule *
+create_rule_NT_to_T (int lhs_nt, int rhs_t, int off)
+{
+  struct grammar_rule * rule;
+
+  rule = (struct grammar_rule *) xcalloc (1, sizeof (struct grammar_rule));
+  rule->lhs_nt = lhs_nt;
+  rule->type = NT2T;
+  rule->u.terminal = rhs_t;
+  rule->spec_idx = off;
+
+  return rule;
+}
+
+/* Function create_rule_NT_to_op_tree.
+
+   Creates grammar rule of kind NT --> OP (NT1, NT2 ...) for normalized grammar.
+*/
+
+struct grammar_rule *
+create_rule_NT_to_op_tree (int lhs_nt, enum primop_code op, int selector,
+ 			   int division, int *rhs_opd, int length, int off)
+{
+  struct grammar_rule * rule;
+  int i;
+
+  rule = (struct grammar_rule *) xcalloc (1, sizeof (struct grammar_rule));
+  rule->lhs_nt = lhs_nt;
+  rule->type = NT2OP;
+  rule->u.rhs_exp.primop.op = op;
+  rule->u.rhs_exp.primop.opd_selector = selector;
+  rule->u.rhs_exp.primop.division = division;
+  rule->u.rhs_exp.primop.var_stride = NULL;
+  rule->spec_idx = off;
+
+  for (i = 0; i < length; i++)
+    {
+      rule->u.rhs_exp.rhs_nt.safe_insert (i, rhs_opd[i]);
+    }
+
+  return rule;
+}
+
+/* Function create_terminals.
+
+*/
+
+void
+create_terminals ()
+{
+  char * buf;
+
+  buf = (char *) xcalloc (4, sizeof (char));
+  strcpy (buf, "MEM");
+  terminals.safe_insert (0, buf);
+
+  buf = (char *) xcalloc (4, sizeof (char));
+  strcpy (buf, "REG");
+  terminals.safe_insert (1, buf);
+
+  buf = (char *) xcalloc (6, sizeof (char));
+  strcpy (buf, "CONST");
+  terminals.safe_insert (2, buf);
+
+  return;
+}
+
+/* Function create_default_rules.
+
+     Default rules		Costs
+     ================================
+     goal --> reg     		(0)
+     goal --> mem     		(0)
+     reg --> mem  		(10)
+     reg --> const 		(5)
+     mem --> reg		(10)
+     mem --> const		(8)
+       :
+       :
+     reg_32 --> REG		(0)
+     reg_16 --> REG		(0)
+     reg_8 --> REG		(0)
+     reg_4 --> REG		(0)
+     reg_2 --> REG		(0)
+     reg -> reg_2		(0)
+     reg -> reg_4		(0)
+     reg -> reg_8		(0)
+     reg -> reg_16		(0)
+     reg -> reg_32		(0)
+       :
+       :
+     mem --> MEM		(0)
+     const --> CONST		(0)
+     reg --> EXTR_2,0 (reg)	(1)
+     reg --> EXTR_2,1 (reg)	(1)
+     reg --> ILV (reg, reg)	(2)
+*/
+
+void
+create_default_rules ()
+{
+  int goal, reg, mem, consti;
+  int i, vec_size, max_vec_size, vector_sizes;
+  int v[2];
+
+  create_terminals ();
+  goal = create_non_terminal ("goal");
+  reg = create_non_terminal ("reg");
+  mem = create_non_terminal ("mem");
+  consti = create_non_terminal ("const");
+  rules.safe_insert (rules.length (), create_rule_NT_to_NT (goal, reg, -1));
+  rules.safe_insert (rules.length (), create_rule_NT_to_NT (goal, mem, -1));
+  rules.safe_insert (rules.length (), create_rule_NT_to_NT (reg, mem, -1));
+  rules.safe_insert (rules.length (), create_rule_NT_to_NT (reg, consti, -1));
+  rules.safe_insert (rules.length (), create_rule_NT_to_NT (mem, reg, -1));
+  rules.safe_insert (rules.length (), create_rule_NT_to_NT (mem, consti, -1));
+  rules.safe_insert (rules.length (), create_rule_NT_to_T (mem, 0, -1));
+  rules.safe_insert (rules.length (), create_rule_NT_to_T (consti, 2, -1));
+
+  v[0] = v[1] = reg;
+
+  /* For each vector type supported, add NT2T rule for
+     reg.  */
+  vector_sizes = MAX_VECTOR_SIZE;
+  max_vec_size = 1 << floor_log2 (vector_sizes);
+  vec_size = max_vec_size;
+
+  for (i = 0; i < floor_log2 (max_vec_size); i++)
+    {
+      char buf[20];
+      int idx;
+
+      sprintf (buf, "reg_%d", vec_size);
+      vec_size = vec_size >> 1;
+      idx = create_non_terminal (buf);
+      rules.safe_insert (rules.length (), create_rule_NT_to_NT (reg, idx, -1));
+      rules.safe_insert (rules.length (), create_rule_NT_to_T (idx, 1, -1));
+    }
+  rules.safe_insert (rules.length (),
+	 create_rule_NT_to_op_tree (reg, POP_EXTR, 0, 2, v, 1, -1));
+  rules.safe_insert (rules.length (),
+	 create_rule_NT_to_op_tree (reg, POP_EXTR, 1, 2, v, 1, -1));
+  rules.safe_insert (rules.length (),
+	 create_rule_NT_to_op_tree (reg, POP_ILV, -1, 2, v, 2, -1));
+  return;
+}
+
+/* Function lookup_NT2T_in_grammar.
+
+   Look-up similar rule in rule-list.
+*/
+
+int
+lookup_NT2T_in_grammar (vec <struct grammar_rule *> *rule,
+			char *nt_substr, int t_idx)
+{
+  int i;
+  for (i = 0; i < rule->length (); i++)
+    {
+      struct grammar_rule *r;
+      rule->iterate (i, &r);
+
+      if (r->u.terminal == t_idx
+	  && (strstr (non_terminals[r->lhs_nt], nt_substr)))
+	return i;
+    }
+  return -1;
+}
+
+/* Function lookup_NT2OP_in_grammar.
+
+   Look-up rule matching operation OP and children in vector nt_list.
+*/
+
+int
+lookup_NT2OP_in_grammar (vec <struct grammar_rule *> *rule,
+			 enum primop_code code, int sel, int div,
+			 int *nt_list, int length)
+{
+  int i, j;
+
+  for (i = 0; i < rule->length (); i++)
+    {
+      struct grammar_rule *r;
+      rule->iterate (i, &r);
+
+      if (r->u.rhs_exp.primop.op == code
+	  && sel == r->u.rhs_exp.primop.opd_selector
+	  && div == r->u.rhs_exp.primop.division)
+	{
+	  for (j = 0; j < length; j++)
+	    {
+	      if (nt_list[j] != r->u.rhs_exp.rhs_nt[j])
+		break;
+	    }
+	  if (j == length)
+	    return i;
+	}
+    }
+
+  return -1;
+}
+
+/* Function create_rule_for_ptree.
+
+   Recursive function to create grammar rules in normal form.
+
+   If the leaf node with placeholder :
+     - check if the rule is already present.
+       If yes, return the previously created non-terminal
+       Otherwise, create new non-terminal for the place-holder, with appropriate
+       vector size, and create rules of the form
+	 <new_nt> --> <REG|MEM|CONST>
+	 <reg_<vecsize>|mem|const> --> <new_nt>
+
+   If non-leaf node,
+     - For each child of the node invoke the function recursively.
+     - Once all children are processed, check if the rule with current primop
+       and children is already present.
+       If yes, return previously created non-terminal for corresponding rule.
+       Otherwise, create non-terminal for out_vecsize and create rules of the
+       form
+	 <new_nt> --> PRIMOP (<list of NTs corresponding to respective child>)
+	 <reg_<vecsize>|mem|const> --> <new_nt>
+*/
+
+int
+create_rule_for_ptree (struct primop_tree *ptree, int spec_idx, int out_vecsize,
+		       int in_vecsize, vec <struct grammar_rule *> *rule)
+{
+  static int name_idx = 0;
+  int chld_nt[30];
+  char buf[20], name[20];
+  int found;
+  int nt;
+  int new_in_vec_size;
+  int i;
+
+  if (PT_NODE_OP (ptree) == POP_PH)
+    {
+      found = lookup_NT2T_in_grammar (rule, "ph",
+			PT_PH_TYPE (ptree) == 'R' ? 1 :
+				PT_PH_TYPE (ptree) == 'M' ? 0 : 2);
+      if (found != -1)
+	{
+	  return (*rule)[found]->lhs_nt;
+	}
+      else
+	{
+	  sprintf (name, "ph%d", name_idx++);
+	  nt = create_non_terminal (name);
+	  rule->safe_insert (rule->length (),
+		create_rule_NT_to_T (nt,
+			PT_PH_TYPE (ptree) == 'R' ? 1 :
+				PT_PH_TYPE (ptree) == 'M' ? 0 : 2, spec_idx));
+
+	  sprintf (buf, "reg_%d", in_vecsize);
+	  rule->safe_insert (rule->length (),
+		create_rule_NT_to_NT (
+			PT_PH_TYPE (ptree) == 'R' ? create_non_terminal (buf) :
+			PT_PH_TYPE (ptree) == 'M'
+			 ? create_non_terminal ("mem") :
+			   create_non_terminal ("const"), nt, spec_idx));
+	  return nt;
+	}
+
+    }
+
+  if (PT_NODE_OP (ptree) == POP_ILV)
+    new_in_vec_size = in_vecsize / PT_DIVISION (ptree);
+  if (PT_NODE_OP (ptree) == POP_EXTR)
+    new_in_vec_size = in_vecsize * PT_DIVISION (ptree);
+
+  for (i = 0; i < ptree->children.length (); i++)
+    {
+      chld_nt[i] = create_rule_for_ptree (PT_CHILD (ptree, i),
+			spec_idx, in_vecsize, new_in_vec_size, rule);
+    }
+
+
+  found = lookup_NT2OP_in_grammar (rule, (enum primop_code) PT_NODE_OP (ptree),
+				   PT_OPERAND_SELECTOR (ptree),
+				   PT_DIVISION (ptree), chld_nt, i);
+
+  if (found != -1)
+    {
+      return (*rule)[found]->lhs_nt;
+    }
+  else
+    {
+      /* Create new NT, and create rule NT_to_OP.  */
+      sprintf (name, "inter%d", name_idx++);
+      nt = create_non_terminal (name);
+      rule->safe_insert (rule->length (),
+    	create_rule_NT_to_op_tree (nt, (enum primop_code) PT_NODE_OP (ptree),
+    		PT_OPERAND_SELECTOR (ptree), PT_DIVISION (ptree),
+    		chld_nt, i, spec_idx));
+
+      sprintf (buf, "reg_%d", out_vecsize);
+      rule->safe_insert (rule->length (),
+    	create_rule_NT_to_NT (
+    		target_spec[spec_idx].opd_constraint[0] == 'R' ?
+    			create_non_terminal (buf) :
+    		target_spec[spec_idx].opd_constraint[0] == 'M' ?
+    			create_non_terminal ("mem") :
+    		    	create_non_terminal ("const"), nt, spec_idx));
+
+      return nt;
+    }
+}
+
+/* Function create_grammar_rules.
+
+   Creates grammar rules for each primop_tree.
+*/
+
+void
+create_grammar_rules ()
+{
+  int i, j, k;
+
+  create_default_rules ();
+  for (i = 0, j = rules.length ();
+       i < sizeof (target_spec)/sizeof (struct vec_perm_order_spec);
+       i++)
+    {
+      int idx = 0;
+      vec <struct grammar_rule *> rule = vNULL;
+
+      create_rule_for_ptree (target_spec[i].ptree, i,
+		 target_spec[i].out_vec_size,
+		 target_spec[i].in_vec_size, &rule);
+
+      for (k = 0; k < rule.length (); k++)
+	rules.safe_insert (j++, rule[k]);
+    }
+}
+
+/* Function print_rule_operands.
+
+*/
+
+void
+print_rule_operands (vec<int> *arr)
+{
+  int i;
+  printf ("%s", non_terminals[(*arr)[0]]);
+  for (i = 1; i < arr->length (); i++)
+    {
+      printf (", %s", non_terminals[(*arr)[i]]);
+    }
+}
+
+/* Function print_grammar_rule.
+
+*/
+
+void
+print_grammar_rule (struct grammar_rule *rule)
+{
+  switch (rule->type)
+    {
+      case NT2T:
+	printf ("%s --> %s\n", non_terminals[rule->lhs_nt],
+		 terminals[rule->u.terminal]);
+	break;
+
+      case NT2NT:
+	printf ("%s --> %s\n", non_terminals[rule->lhs_nt],
+		 non_terminals[rule->u.non_terminal]);
+
+	break;
+
+      case NT2OP:
+	printf ("%s --> ", non_terminals[rule->lhs_nt]);
+	switch (rule->u.rhs_exp.primop.op)
+	  {
+	    case POP_ILV:
+	      printf ("ILV_%d (", rule->u.rhs_exp.primop.division);
+	      break;
+	    case POP_EXTR:
+	      printf ("EXTR_%d,%d (", rule->u.rhs_exp.primop.division,
+		      rule->u.rhs_exp.primop.opd_selector);
+	      break;
+	    default:
+	      gcc_assert (0);
+	  }
+	print_rule_operands (&rule->u.rhs_exp.rhs_nt);
+	printf (")\n");
+
+	break;
+
+      default:
+	gcc_assert (0);
+    }
+}
+
+/* Function print_grammar_rules.
+
+   Emit grammar rules for which automata is constructed.
+
+*/
+
+void
+print_grammar_rules ()
+{
+  int i;
+
+  printf ("/*\n");
+  for (i = 0; i < rules.length (); i++)
+    {
+      print_grammar_rule (rules[i]);
+    }
+  printf ("*/\n");
+}
+
+
+int main (int argc, const char **argv)
+{
+  printf ("/* Generated automatically by the program `genvect-inst-tiles'\n\
+from the macro TARGET_VEC_PERM_CONST_ORDER in target header file.  */\n\n");
+  create_instruction_tiles ();
+  print_instruction_tiles ();
+  create_grammar_rules ();
+  print_grammar_rules ();
+}
Index: gcc/tree-vect-unified-common.c
===================================================================
--- gcc/tree-vect-unified-common.c	(revision 246613)
+++ gcc/tree-vect-unified-common.c	(working copy)
@@ -61,7 +61,6 @@ 
 #include "builtins.h"
 #include "params.h"
 #include "pretty-print.h"
-#include "pretty-print.h"
 #else
 #include "errors.h"
 #include "machmode.h"
@@ -78,9 +77,6 @@ 
 #include "is-a.h"
 #include "target.h"
 #include "tree-core.h"
-#include "tree-vect-unified.h"
-
-
 #endif
 
 #include "tree-vect-unified.h"
Index: gcc/tree-vect-unified-opts.c
===================================================================
--- gcc/tree-vect-unified-opts.c	(revision 246613)
+++ gcc/tree-vect-unified-opts.c	(working copy)
@@ -61,7 +61,6 @@ 
 #include "builtins.h"
 #include "params.h"
 #include "pretty-print.h"
-#include "pretty-print.h"
 #else
 # include "errors.h"
 #include "machmode.h"
@@ -78,8 +77,6 @@ 
 #include "is-a.h"
 #include "target.h"
 #include "tree-core.h"
-#include "tree-vect-unified.h"
-
 #endif
 
 #include "tree-vect-unified.h"
@@ -520,8 +517,8 @@ 
 annotate_tree_nodes (struct primop_tree *ptree, int *end,
 				struct primop_tree **leaves)
 {
-  int key;
-  long long int idx;
+  long long int key;
+  int idx;
   int length = 0;
   int i, j;
   struct primop_tree *temp[150];
@@ -541,8 +538,8 @@ 
 
   for (i = 0; i < ptree->children.length (); i++)
     {
-      key = (key | annotate_tree_nodes (PT_CHILD (ptree, i),
-			&length, temp)) << 12;
+      key = (key << 12)
+	     | annotate_tree_nodes (PT_CHILD (ptree, i), &length, temp);
       for (j = 0; j < length; j++)
 	{
 	  leaves[(*end)++] = temp[j];
@@ -575,7 +572,11 @@ 
   bool changed = false;
   int to_be_matched;
   struct primop_tree *temp_ptree;
-  int i;
+  int i, j;
+  long long int key;
+  int idx, end;
+  struct primop_tree *leaves[150];
+  struct primtree_hash_table *new_hash;
 
   *new_ptree = ptree;
 
@@ -612,6 +613,8 @@ 
 	}
     }
 
+  key = PT_NODE_OP (ptree) << 10;
+  end = 0;
   for (i = 0; i < (*new_ptree)->children.length (); i++)
     {
       changed |= unity_redundancy_elimination_2 (PT_CHILD (*new_ptree, i),
@@ -618,8 +621,45 @@ 
 			 &temp_ptree);
       PT_CHILD (*new_ptree, i) = temp_ptree;
       PT_PARENT (temp_ptree) = *new_ptree;
+      if (PT_NODE_OP (temp_ptree) == POP_MEMREF
+	  || PT_NODE_OP (temp_ptree) == POP_PH
+	  || PT_NODE_OP (temp_ptree) == POP_CONST)
+	{
+	  key = (key | 0xfff) << 12;
+	  leaves[end++] = temp_ptree; 
+	}
+      else
+	{
+      	  key = (key << 12) | PT_AUX(temp_ptree);
+      
+      	  for (j = 0;
+	       j < primop_hash[PT_AUX(temp_ptree)]->leaves.length ();
+	       j++)
+	    {
+      	      leaves[end + j] = primop_hash[PT_AUX(temp_ptree)]->leaves[j];
+	    }
+
+          end = end + j;
+	}
+
     }
 
+  idx = lookup_key_in_table (key, leaves, end);
+
+  if (idx == -1)
+    {
+      // Create new entry.
+      new_hash = (struct primtree_hash_table *) xcalloc (1,
+      sizeof (struct primtree_hash_table));
+      new_hash->key = key;
+      new_hash->leaves = vNULL;
+      for (i = 0; i < end; i++)
+	new_hash->leaves.safe_insert (new_hash->leaves.length (), leaves[i]);
+      idx = primop_hash.length ();
+      primop_hash.safe_insert (idx, new_hash);
+    }
+  PT_AUX (*new_ptree) = idx;
+
     return changed;
 }
 
@@ -640,13 +680,12 @@ 
   bool changed;
 
   annotate_tree_nodes (ptree, &end, dummy);
-
   changed = false;
 
   do {
     changed = unity_redundancy_elimination_2 (ptree, &new_ptree);
-    if (ptree == new_ptree)
-      break;
+    //if (ptree == new_ptree)
+    //  break;
     ptree = new_ptree;
   } while (changed == true);
 
Index: gcc/tree-vect-unified.c
===================================================================
--- gcc/tree-vect-unified.c	(revision 246613)
+++ gcc/tree-vect-unified.c	(working copy)
@@ -2527,6 +2527,26 @@ 
 	      dump_iter_node (tmp_iter_node, alt_dump_file);
 	  }
 
+	worklist = vNULL;
+	worklist = (ITER_NODE_LOOP_BODY (tmp_iter_node)).copy ();
+	for (i = 0;  i < worklist.length (); i++)
+	  {
+	    gcc_assert (worklist.iterate (i, &tmp_tree));
+	    tmp_tree = unity_redundancy_elimination (tmp_tree);
+
+	    ITER_NODE_LOOP_BODY (tmp_iter_node)[i] = tmp_tree;
+	  }
+
+	if (dump_enabled_p ())
+	  {
+	    dump_printf (MSG_NOTE, "\nUnity redundancy elimination applied.\n");
+	    if (dump_file)
+	      dump_iter_node (tmp_iter_node, dump_file);
+	    if (alt_dump_file)
+	      dump_iter_node (tmp_iter_node, alt_dump_file);
+	  }
+
+
 	gimple *loop_vectorized_call = vect_loop_vectorized_call (loop);
 	/* If the loop is vectorized, set uid of stmts within scalar loop to
 	   0.  This change is needed if transform phase uses this loop info.  */
Index: gcc/tree-vect-unified.h
===================================================================
--- gcc/tree-vect-unified.h	(revision 246613)
+++ gcc/tree-vect-unified.h	(working copy)
@@ -285,6 +285,87 @@ 
 #undef DEFTREECODE
 #undef END_OF_BASE_TREE_CODES
 
+/* TARGET_VEC_PERM_CONST_ORDER - If defined, this target hook points to an array
+   of "struct vec_perm_order_spec" specifying various permute orders supported
+   by the target architecture.  */
+
+struct vec_perm_order_spec
+{
+  /* Number of operands in permute order specified.  */
+  int num_opd;
+
+  /* Vector size of input permute order.  */
+  int in_vec_size;
+
+  /* Vector size of resultant permute order.  It can be identified from the size
+     of perm_order, however, if by mistake, this field is not defined properly,
+     can lead to errors.  Hence, taking that as input.  */
+  int out_vec_size;
+
+  /* Permute order of operands.  */
+  int *perm_order;
+
+  /* Cost of permute operation.  */
+  int cost;
+
+  /* Name of permute operation for debugging purpose.  */
+  char *op_name;
+
+  /* The constraints on input and output operands of this instruction.
+     Restricting these to R,M or I for register, memory and integer constant
+     respectively.  This is needed for reduction rules to be generated for BURS
+     tree.  It should have comma separated list - with num_opd + 1 listings.  */
+  char * opd_constraint;
+
+  /* Condition under which the instruction can be emitted.  Thinking of
+     something like condition part in define_insn.  */
+  char * cond;
+
+  /* PRIMOP_TREE constructed after tile construction.  */
+  struct primop_tree *ptree;
+};
+
+enum rule_type {NT2T, NT2NT, NT2OP};
+
+/* Normalized context free grammar of the form
+   NT --> T
+   NT--> NT
+   NT --> OP (<list of NTs>)  */
+struct grammar_rule
+{
+  /* Pointer to vec_perm_order_spec corresponding to grammar rule.  For default
+     rules, this value is NULL.  */
+  struct vec_perm_order_spec *porder;
+
+  /* Non-terminal on LHS.  */
+  int lhs_nt;
+
+  enum rule_type type;
+
+  int spec_idx;
+
+  union {
+    /* Terminal on RHS.  */
+    int terminal;
+
+    /* Non-terminal on RHS.  */
+    int non_terminal;
+
+    /* RHS of the form OP_div,sel (NT1, NT2...NTk) for k_arity operation op.  */
+    struct rhs_expression {
+      struct operation {
+	enum primop_code op;
+	int opd_selector;
+	int division;
+	tree *var_stride;
+      } primop;
+
+      vec<int> rhs_nt;
+    }  rhs_exp;
+  } u;
+};
+
+
 extern unsigned int vectorize_loops_using_uniop (void);
 extern struct primop_tree * analyze_and_create_ptree (struct primop_tree *,
 		 		gimple *, struct ITER_node *);