diff mbox series

[v2] Rearrange SLP nodes with duplicate statements. [PR98138]

Message ID 20240626120610.1862412-1-manolis.tsamis@vrull.eu
State New
Headers show
Series [v2] Rearrange SLP nodes with duplicate statements. [PR98138] | expand

Commit Message

Manolis Tsamis June 26, 2024, 12:06 p.m. UTC
This change checks when a two_operators SLP node has multiple occurrences of
the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the operands
so that there are no duplicates. Two vec_perm expressions are then introduced
to recreate the original ordering. These duplicates can appear due to how
two_operators nodes are handled, and they prevent vectorization in some cases.

This targets the vectorization of the SPEC2017 x264 pixel_satd functions.
In some processors a larger than 10% improvement on x264 has been observed.

See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

gcc/ChangeLog:

	* tree-vect-slp.cc: Avoid duplicates in two_operators nodes.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/vect-slp-two-operator.c: New test.

Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
---

Changes in v2:
        - Do not use predefined patterns; support rearrangement of arbitrary
        node orderings.
        - Only apply for two_operators nodes.
        - Recurse with single SLP operand instead of two duplicated ones.
        - Refactoring of code.

 .../aarch64/vect-slp-two-operator.c           |  36 ++++++
 gcc/tree-vect-slp.cc                          | 114 ++++++++++++++++++
 2 files changed, 150 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c

Comments

Manolis Tsamis June 26, 2024, 12:33 p.m. UTC | #1
This is a reworked implementation that only deduplicates two_operators
nodes and supports arbitrary orderings.

Based on the discussion on the original submission, I have made some
SPEC runs to see if this transformation applies to any other
benchmarks.
Aside from x264 this now triggers on SPEC2017 521.wrf and SPECv8
712.av1aom. In these additional two cases the transformation applies
in a way that could help vectorization.
For example in the relevant av1 code, we see things like:

note: Operand 0:
note: stmt 0 _718 = MEM[(int32_t *)output_1462(D) + 16B];
note: stmt 1 _718 = MEM[(int32_t *)output_1462(D) + 16B];
note: stmt 2 _722 = MEM[(int32_t *)output_1462(D) + 28B];
note: stmt 3 _722 = MEM[(int32_t *)output_1462(D) + 28B];
note: Operand 1:
note: stmt 0 _719 = MEM[(int32_t *)output_1462(D) + 20B];
note: stmt 1 _719 = MEM[(int32_t *)output_1462(D) + 20B];
note: stmt 2 _723 = MEM[(int32_t *)output_1462(D) + 24B];
note: stmt 3 _723 = MEM[(int32_t *)output_1462(D) + 24B];
note: With a single operand:
note: stmt 0 _718 = MEM[(int32_t *)output_1462(D) + 16B];
note: stmt 1 _719 = MEM[(int32_t *)output_1462(D) + 20B];
note: stmt 2 _722 = MEM[(int32_t *)output_1462(D) + 28B];
note: stmt 3 _723 = MEM[(int32_t *)output_1462(D) + 24B];

Whereas gcc master will create load nodes with permutations that have
repeated elements. The issue here that prevents even better
vectorization, and that affects all other cases including x264, is
that there is no way to properly order the deduplicated elements each
time. For example consider x264, in which we have two different
patterns that we're applying the transformation to:

    int t0 = s0 + s1;
    int t1 = s0 - s1;
    int t2 = s2 + s3;
    int t3 = s2 - s3;

and

    d0 = t0 + t2;
    d1 = t1 + t3;
    d2 = t0 - t2;
    d3 = t1 - t3;

The preferred order for the deduplicated operands in the first case is
[s0, s1, s2, s3] and in the second case [t0, t1, t2, t3]. But because
the operands appear in different order one of the two will end up in a
different ordering. With this implementation we get a good first node

x264.c:29:23: note:   Replace two_operators operands:
x264.c:29:23: note:   Operand 0:
x264.c:29:23: note:       stmt 0 a0_110 = (uint32_t) _12;
x264.c:29:23: note:       stmt 1 a2_112 = (uint32_t) _36;
x264.c:29:23: note:       stmt 2 a0_110 = (uint32_t) _12;
x264.c:29:23: note:       stmt 3 a2_112 = (uint32_t) _36;
x264.c:29:23: note:   Operand 1:
x264.c:29:23: note:       stmt 0 a1_111 = (uint32_t) _24;
x264.c:29:23: note:       stmt 1 a3_113 = (uint32_t) _48;
x264.c:29:23: note:       stmt 2 a1_111 = (uint32_t) _24;
x264.c:29:23: note:       stmt 3 a3_113 = (uint32_t) _48;
x264.c:29:23: note:   With a single operand:
x264.c:29:23: note:       stmt 0 a0_110 = (uint32_t) _12;
x264.c:29:23: note:       stmt 1 a1_111 = (uint32_t) _24;
x264.c:29:23: note:       stmt 2 a2_112 = (uint32_t) _36;
x264.c:29:23: note:       stmt 3 a3_113 = (uint32_t) _48;

but mess the order in the second one:

x264.c:29:23: note:   Replace two_operators operands:
x264.c:29:23: note:   Operand 0:
x264.c:29:23: note:       stmt 0 t0_114 = (int) _49;
x264.c:29:23: note:       stmt 1 t1_115 = (int) _50;
x264.c:29:23: note:       stmt 2 t0_114 = (int) _49;
x264.c:29:23: note:       stmt 3 t1_115 = (int) _50;
x264.c:29:23: note:   Operand 1:
x264.c:29:23: note:       stmt 0 t2_116 = (int) _51;
x264.c:29:23: note:       stmt 1 t3_117 = (int) _52;
x264.c:29:23: note:       stmt 2 t2_116 = (int) _51;
x264.c:29:23: note:       stmt 3 t3_117 = (int) _52;
x264.c:29:23: note:   With a single operand:
x264.c:29:23: note:       stmt 0 t0_114 = (int) _49;
x264.c:29:23: note:       stmt 1 t2_116 = (int) _51;
x264.c:29:23: note:       stmt 2 t1_115 = (int) _50;
x264.c:29:23: note:       stmt 3 t3_117 = (int) _52;

and get [_49, _51, _50, _52] instead of the preferred [_49, _50, _51,
_52]. As a result we get an extra layer of 4 permute instructions when
we generate code. In other cases this gets even worse.
Is there any reasonable way to improve the ordering of these nodes? I
though of sorting based on SSA name version but that's a workaround at
best and doesn't work in all cases.

Manolis


On Wed, Jun 26, 2024 at 3:06 PM Manolis Tsamis <manolis.tsamis@vrull.eu> wrote:
>
> This change checks when a two_operators SLP node has multiple occurrences of
> the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the operands
> so that there are no duplicates. Two vec_perm expressions are then introduced
> to recreate the original ordering. These duplicates can appear due to how
> two_operators nodes are handled, and they prevent vectorization in some cases.
>
> This targets the vectorization of the SPEC2017 x264 pixel_satd functions.
> In some processors a larger than 10% improvement on x264 has been observed.
>
> See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.cc: Avoid duplicates in two_operators nodes.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/vect-slp-two-operator.c: New test.
>
> Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
> ---
>
> Changes in v2:
>         - Do not use predefined patterns; support rearrangement of arbitrary
>         node orderings.
>         - Only apply for two_operators nodes.
>         - Recurse with single SLP operand instead of two duplicated ones.
>         - Refactoring of code.
>
>  .../aarch64/vect-slp-two-operator.c           |  36 ++++++
>  gcc/tree-vect-slp.cc                          | 114 ++++++++++++++++++
>  2 files changed, 150 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> new file mode 100644
> index 00000000000..b6b093ffc34
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> @@ -0,0 +1,36 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */
> +
> +typedef unsigned char uint8_t;
> +typedef unsigned int uint32_t;
> +
> +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> +    int t0 = s0 + s1;\
> +    int t1 = s0 - s1;\
> +    int t2 = s2 + s3;\
> +    int t3 = s2 - s3;\
> +    d0 = t0 + t2;\
> +    d1 = t1 + t3;\
> +    d2 = t0 - t2;\
> +    d3 = t1 - t3;\
> +}
> +
> +void sink(uint32_t tmp[4][4]);
> +
> +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
> +{
> +    uint32_t tmp[4][4];
> +    int sum = 0;
> +    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
> +    {
> +        uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> +        uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> +        uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> +        uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> +        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
> +    }
> +    sink(tmp);
> +}
> +
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index b47b7e8c979..60d0d388dff 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -2420,6 +2420,95 @@ out:
>        }
>    swap = NULL;
>
> +  bool has_two_operators_perm = false;
> +  auto_vec<unsigned> two_op_perm_indices[2];
> +  vec<stmt_vec_info> two_op_scalar_stmts[2] = {vNULL, vNULL};
> +
> +  if (two_operators && oprnds_info.length () == 2 && group_size > 2)
> +    {
> +      unsigned idx = 0;
> +      hash_map<gimple *, unsigned> seen;
> +      vec<slp_oprnd_info> new_oprnds_info
> +       = vect_create_oprnd_info (1, group_size);
> +      bool success = true;
> +
> +      enum tree_code code = ERROR_MARK;
> +      if (oprnds_info[0]->def_stmts[0]
> +         && is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt))
> +       code = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt);
> +
> +      for (unsigned j = 0; j < group_size; ++j)
> +       {
> +         FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
> +           {
> +             stmt_vec_info stmt_info = oprnd_info->def_stmts[j];
> +             if (!stmt_info || !stmt_info->stmt
> +                 || !is_a<gassign *> (stmt_info->stmt)
> +                 || gimple_assign_rhs_code (stmt_info->stmt) != code
> +                 || skip_args[i])
> +               {
> +                 success = false;
> +                 break;
> +               }
> +
> +             bool exists;
> +             unsigned &stmt_idx
> +               = seen.get_or_insert (stmt_info->stmt, &exists);
> +
> +             if (!exists)
> +               {
> +                 new_oprnds_info[0]->def_stmts.safe_push (stmt_info);
> +                 new_oprnds_info[0]->ops.safe_push (oprnd_info->ops[j]);
> +                 stmt_idx = idx;
> +                 idx++;
> +               }
> +
> +             two_op_perm_indices[i].safe_push (stmt_idx);
> +           }
> +
> +         if (!success)
> +           break;
> +       }
> +
> +      if (success && idx == group_size)
> +       {
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Replace two_operators operands:\n");
> +
> +             FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
> +               {
> +                 dump_printf_loc (MSG_NOTE, vect_location,
> +                                  "Operand %u:\n", i);
> +                 for (unsigned j = 0; j < group_size; j++)
> +                   dump_printf_loc (MSG_NOTE, vect_location, "\tstmt %u %G",
> +                                    j, oprnd_info->def_stmts[j]->stmt);
> +               }
> +
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "With a single operand:\n");
> +             for (unsigned j = 0; j < group_size; j++)
> +               dump_printf_loc (MSG_NOTE, vect_location, "\tstmt %u %G",
> +                                j, new_oprnds_info[0]->def_stmts[j]->stmt);
> +           }
> +
> +         two_op_scalar_stmts[0].safe_splice (oprnds_info[0]->def_stmts);
> +         two_op_scalar_stmts[1].safe_splice (oprnds_info[1]->def_stmts);
> +
> +         new_oprnds_info[0]->first_op_type = oprnds_info[0]->first_op_type;
> +         new_oprnds_info[0]->first_dt = oprnds_info[0]->first_dt;
> +         new_oprnds_info[0]->any_pattern = oprnds_info[0]->any_pattern;
> +         new_oprnds_info[0]->first_gs_p = oprnds_info[0]->first_gs_p;
> +         new_oprnds_info[0]->first_gs_info = oprnds_info[0]->first_gs_info;
> +
> +         vect_free_oprnd_info (oprnds_info);
> +         oprnds_info = new_oprnds_info;
> +         nops = 1;
> +         has_two_operators_perm = true;
> +       }
> +    }
> +
>    auto_vec<slp_tree, 4> children;
>
>    stmt_info = stmts[0];
> @@ -2691,6 +2780,29 @@ fail:
>          the true { a+b, a+b, a+b, a+b } ... but there we don't have
>          explicit stmts to put in so the keying on 'stmts' doesn't
>          work (but we have the same issue with nodes that use 'ops').  */
> +
> +      if (has_two_operators_perm)
> +       {
> +         slp_tree child = children[0];
> +         children.truncate (0);
> +         for (i = 0; i < 2; i++)
> +           {
> +             slp_tree pnode
> +               = vect_create_new_slp_node (two_op_scalar_stmts[i], 2);
> +             SLP_TREE_CODE (pnode) = VEC_PERM_EXPR;
> +             SLP_TREE_VECTYPE (pnode) = vectype;
> +             SLP_TREE_CHILDREN (pnode).quick_push (child);
> +             SLP_TREE_CHILDREN (pnode).quick_push (child);
> +             lane_permutation_t& perm = SLP_TREE_LANE_PERMUTATION (pnode);
> +             children.safe_push (pnode);
> +
> +             for (unsigned j = 0; j < stmts.length (); j++)
> +               perm.safe_push (std::make_pair (0, two_op_perm_indices[i][j]));
> +           }
> +
> +         SLP_TREE_REF_COUNT (child) += 4;
> +       }
> +
>        slp_tree one = new _slp_tree;
>        slp_tree two = new _slp_tree;
>        SLP_TREE_DEF_TYPE (one) = vect_internal_def;
> @@ -2727,12 +2839,14 @@ fail:
>           else
>             SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (0, i));
>         }
> +
>        SLP_TREE_CODE (one) = code0;
>        SLP_TREE_CODE (two) = ocode;
>        SLP_TREE_LANES (one) = stmts.length ();
>        SLP_TREE_LANES (two) = stmts.length ();
>        SLP_TREE_REPRESENTATIVE (one) = stmts[0];
>        SLP_TREE_REPRESENTATIVE (two) = stmts[j];
> +
>        return node;
>      }
>
> --
> 2.44.0
>
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
new file mode 100644
index 00000000000..b6b093ffc34
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
@@ -0,0 +1,36 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */
+
+typedef unsigned char uint8_t;
+typedef unsigned int uint32_t;
+
+#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
+    int t0 = s0 + s1;\
+    int t1 = s0 - s1;\
+    int t2 = s2 + s3;\
+    int t3 = s2 - s3;\
+    d0 = t0 + t2;\
+    d1 = t1 + t3;\
+    d2 = t0 - t2;\
+    d3 = t1 - t3;\
+}
+
+void sink(uint32_t tmp[4][4]);
+
+int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
+{
+    uint32_t tmp[4][4];
+    int sum = 0;
+    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
+    {
+        uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
+        uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
+        uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
+        uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
+        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
+    }
+    sink(tmp);
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index b47b7e8c979..60d0d388dff 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2420,6 +2420,95 @@  out:
       }
   swap = NULL;
 
+  bool has_two_operators_perm = false;
+  auto_vec<unsigned> two_op_perm_indices[2];
+  vec<stmt_vec_info> two_op_scalar_stmts[2] = {vNULL, vNULL};
+
+  if (two_operators && oprnds_info.length () == 2 && group_size > 2)
+    {
+      unsigned idx = 0;
+      hash_map<gimple *, unsigned> seen;
+      vec<slp_oprnd_info> new_oprnds_info
+	= vect_create_oprnd_info (1, group_size);
+      bool success = true;
+
+      enum tree_code code = ERROR_MARK;
+      if (oprnds_info[0]->def_stmts[0]
+	  && is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt))
+	code = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt);
+
+      for (unsigned j = 0; j < group_size; ++j)
+	{
+	  FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
+	    {
+	      stmt_vec_info stmt_info = oprnd_info->def_stmts[j];
+	      if (!stmt_info || !stmt_info->stmt
+		  || !is_a<gassign *> (stmt_info->stmt)
+		  || gimple_assign_rhs_code (stmt_info->stmt) != code
+		  || skip_args[i])
+		{
+		  success = false;
+		  break;
+		}
+
+	      bool exists;
+	      unsigned &stmt_idx
+		= seen.get_or_insert (stmt_info->stmt, &exists);
+
+	      if (!exists)
+		{
+		  new_oprnds_info[0]->def_stmts.safe_push (stmt_info);
+		  new_oprnds_info[0]->ops.safe_push (oprnd_info->ops[j]);
+		  stmt_idx = idx;
+		  idx++;
+		}
+
+	      two_op_perm_indices[i].safe_push (stmt_idx);
+	    }
+
+	  if (!success)
+	    break;
+	}
+
+      if (success && idx == group_size)
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "Replace two_operators operands:\n");
+
+	      FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
+		{
+		  dump_printf_loc (MSG_NOTE, vect_location,
+				   "Operand %u:\n", i);
+		  for (unsigned j = 0; j < group_size; j++)
+		    dump_printf_loc (MSG_NOTE, vect_location, "\tstmt %u %G",
+				     j, oprnd_info->def_stmts[j]->stmt);
+		}
+
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "With a single operand:\n");
+	      for (unsigned j = 0; j < group_size; j++)
+		dump_printf_loc (MSG_NOTE, vect_location, "\tstmt %u %G",
+				 j, new_oprnds_info[0]->def_stmts[j]->stmt);
+	    }
+
+	  two_op_scalar_stmts[0].safe_splice (oprnds_info[0]->def_stmts);
+	  two_op_scalar_stmts[1].safe_splice (oprnds_info[1]->def_stmts);
+
+	  new_oprnds_info[0]->first_op_type = oprnds_info[0]->first_op_type;
+	  new_oprnds_info[0]->first_dt = oprnds_info[0]->first_dt;
+	  new_oprnds_info[0]->any_pattern = oprnds_info[0]->any_pattern;
+	  new_oprnds_info[0]->first_gs_p = oprnds_info[0]->first_gs_p;
+	  new_oprnds_info[0]->first_gs_info = oprnds_info[0]->first_gs_info;
+
+	  vect_free_oprnd_info (oprnds_info);
+	  oprnds_info = new_oprnds_info;
+	  nops = 1;
+	  has_two_operators_perm = true;
+	}
+    }
+
   auto_vec<slp_tree, 4> children;
 
   stmt_info = stmts[0];
@@ -2691,6 +2780,29 @@  fail:
 	 the true { a+b, a+b, a+b, a+b } ... but there we don't have
 	 explicit stmts to put in so the keying on 'stmts' doesn't
 	 work (but we have the same issue with nodes that use 'ops').  */
+
+      if (has_two_operators_perm)
+	{
+	  slp_tree child = children[0];
+	  children.truncate (0);
+	  for (i = 0; i < 2; i++)
+	    {
+	      slp_tree pnode
+		= vect_create_new_slp_node (two_op_scalar_stmts[i], 2);
+	      SLP_TREE_CODE (pnode) = VEC_PERM_EXPR;
+	      SLP_TREE_VECTYPE (pnode) = vectype;
+	      SLP_TREE_CHILDREN (pnode).quick_push (child);
+	      SLP_TREE_CHILDREN (pnode).quick_push (child);
+	      lane_permutation_t& perm = SLP_TREE_LANE_PERMUTATION (pnode);
+	      children.safe_push (pnode);
+
+	      for (unsigned j = 0; j < stmts.length (); j++)
+		perm.safe_push (std::make_pair (0, two_op_perm_indices[i][j]));
+	    }
+
+	  SLP_TREE_REF_COUNT (child) += 4;
+	}
+
       slp_tree one = new _slp_tree;
       slp_tree two = new _slp_tree;
       SLP_TREE_DEF_TYPE (one) = vect_internal_def;
@@ -2727,12 +2839,14 @@  fail:
 	  else
 	    SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (0, i));
 	}
+
       SLP_TREE_CODE (one) = code0;
       SLP_TREE_CODE (two) = ocode;
       SLP_TREE_LANES (one) = stmts.length ();
       SLP_TREE_LANES (two) = stmts.length ();
       SLP_TREE_REPRESENTATIVE (one) = stmts[0];
       SLP_TREE_REPRESENTATIVE (two) = stmts[j];
+
       return node;
     }