Patchwork [5/9] Main target-independent support for direct interleaving

login
register
mail settings
Submitter Richard Sandiford
Date April 12, 2011, 1:59 p.m.
Message ID <g44o63fu4r.fsf@linaro.org>
Download mbox | patch
Permalink /patch/90804/
State New
Headers show

Comments

Richard Sandiford - April 12, 2011, 1:59 p.m.
This patch adds vec_load_lanes and vec_store_lanes optabs for instructions
like NEON's vldN and vstN.  The optabs are defined this way because the
vectors must be allocated to a block of consecutive registers.

Tested on x86_64-linux-gnu and arm-linux-gnueabi.  OK to install?

Richard


gcc/
	* doc/md.texi (vec_load_lanes, vec_store_lanes): Document.
	* optabs.h (COI_vec_load_lanes, COI_vec_store_lanes): New
	convert_optab_index values.
	(vec_load_lanes_optab, vec_store_lanes_optab): New convert optabs.
	* genopinit.c (optabs): Initialize the new optabs.
	* internal-fn.def (LOAD_LANES, STORE_LANES): New internal functions.
	* internal-fn.c (get_multi_vector_move, expand_LOAD_LANES)
	(expand_STORE_LANES): New functions.
	* tree.h (build_simple_array_type): Declare.
	* tree.c (build_simple_array_type): New function.
	* tree-vectorizer.h (vect_model_store_cost): Add a bool argument.
	(vect_model_load_cost): Likewise.
	(vect_store_lanes_supported, vect_load_lanes_supported)
	(vect_record_strided_load_vectors): Declare.
	* tree-vect-data-refs.c (vect_lanes_optab_supported_p)
	(vect_store_lanes_supported, vect_load_lanes_supported): New functions.
	(vect_transform_strided_load): Split out statement recording into...
	(vect_record_strided_load_vectors): ...this new function.
	* tree-vect-stmts.c (create_vector_array, read_vector_array)
	(write_vector_array, create_array_ref): New functions.
	(vect_model_store_cost): Add store_lanes_p argument.
	(vect_model_load_cost): Add load_lanes_p argument.
	(vectorizable_store): Try to use store-lanes functions for
	interleaved stores.
	(vectorizable_load): Likewise load-lanes and loads.
	* tree-vect-slp.c (vect_get_and_check_slp_defs)
	(vect_build_slp_tree):
Ira Rosen - April 17, 2011, 1:35 p.m.
gcc-patches-owner@gcc.gnu.org wrote on 12/04/2011 04:59:16 PM:

>
> This patch adds vec_load_lanes and vec_store_lanes optabs for
instructions
> like NEON's vldN and vstN.  The optabs are defined this way because the
> vectors must be allocated to a block of consecutive registers.
>
> Tested on x86_64-linux-gnu and arm-linux-gnueabi.  OK to install?

The vectorizer part is fine with me except for:


> @@ -685,9 +761,11 @@ vect_model_store_cost (stmt_vec_info stm
>        first_dr = STMT_VINFO_DATA_REF (stmt_info);
>      }
>
> -  /* Is this an access in a group of stores, which provide strided
access?
> -     If so, add in the cost of the permutes.  */
> -  if (group_size > 1)
> +  /* We assume that the cost of a single store-lanes instruction is
> +     equivalent to the cost of GROUP_SIZE separate stores.  If a strided
> +     access is instead being provided by a load-and-permute operation,

I think it should be 'permute-and-store' and not 'load-and-permute'.

> +     include the cost of the permutes.  */
> +  if (!store_lanes_p && group_size > 1)
>      {
>        /* Uses a high and low interleave operation for each needed
> permute.  */
>        inside_cost = ncopies * exact_log2(group_size) * group_size


Thanks,
Ira
Richard Guenther - April 18, 2011, 11:09 a.m.
On Tue, Apr 12, 2011 at 3:59 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> This patch adds vec_load_lanes and vec_store_lanes optabs for instructions
> like NEON's vldN and vstN.  The optabs are defined this way because the
> vectors must be allocated to a block of consecutive registers.
>
> Tested on x86_64-linux-gnu and arm-linux-gnueabi.  OK to install?
>
> Richard
>
>
> gcc/
>        * doc/md.texi (vec_load_lanes, vec_store_lanes): Document.
>        * optabs.h (COI_vec_load_lanes, COI_vec_store_lanes): New
>        convert_optab_index values.
>        (vec_load_lanes_optab, vec_store_lanes_optab): New convert optabs.
>        * genopinit.c (optabs): Initialize the new optabs.
>        * internal-fn.def (LOAD_LANES, STORE_LANES): New internal functions.
>        * internal-fn.c (get_multi_vector_move, expand_LOAD_LANES)
>        (expand_STORE_LANES): New functions.
>        * tree.h (build_simple_array_type): Declare.
>        * tree.c (build_simple_array_type): New function.
>        * tree-vectorizer.h (vect_model_store_cost): Add a bool argument.
>        (vect_model_load_cost): Likewise.
>        (vect_store_lanes_supported, vect_load_lanes_supported)
>        (vect_record_strided_load_vectors): Declare.
>        * tree-vect-data-refs.c (vect_lanes_optab_supported_p)
>        (vect_store_lanes_supported, vect_load_lanes_supported): New functions.
>        (vect_transform_strided_load): Split out statement recording into...
>        (vect_record_strided_load_vectors): ...this new function.
>        * tree-vect-stmts.c (create_vector_array, read_vector_array)
>        (write_vector_array, create_array_ref): New functions.
>        (vect_model_store_cost): Add store_lanes_p argument.
>        (vect_model_load_cost): Add load_lanes_p argument.
>        (vectorizable_store): Try to use store-lanes functions for
>        interleaved stores.
>        (vectorizable_load): Likewise load-lanes and loads.
>        * tree-vect-slp.c (vect_get_and_check_slp_defs)
>        (vect_build_slp_tree):
>
> Index: gcc/doc/md.texi
> ===================================================================
> --- gcc/doc/md.texi     2011-04-12 12:16:46.000000000 +0100
> +++ gcc/doc/md.texi     2011-04-12 14:48:28.000000000 +0100
> @@ -3846,6 +3846,48 @@ into consecutive memory locations.  Oper
>  consecutive memory locations, operand 1 is the first register, and
>  operand 2 is a constant: the number of consecutive registers.
>
> +@cindex @code{vec_load_lanes@var{m}@var{n}} instruction pattern
> +@item @samp{vec_load_lanes@var{m}@var{n}}
> +Perform an interleaved load of several vectors from memory operand 1
> +into register operand 0.  Both operands have mode @var{m}.  The register
> +operand is viewed as holding consecutive vectors of mode @var{n},
> +while the memory operand is a flat array that contains the same number
> +of elements.  The operation is equivalent to:
> +
> +@smallexample
> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
> +  for (i = 0; i < c; i++)
> +    operand0[i][j] = operand1[j * c + i];
> +@end smallexample
> +
> +For example, @samp{vec_load_lanestiv4hi} loads 8 16-bit values
> +from memory into a register of mode @samp{TI}@.  The register
> +contains two consecutive vectors of mode @samp{V4HI}@.

So vec_load_lanestiv2qi would load ... ?  c == 8 here.  Intuitively
such operation would have adjacent blocks of siv2qi memory.  But
maybe you want to constrain the mode size to GET_MODE_SIZE (@var{n})
* GET_MODE_NUNITS (@var{n})?  In which case the mode m is
redundant?  You could specify that we load NUNITS adjacent vectors into
an integer mode of appropriate size.

> +This pattern can only be used if:
> +@smallexample
> +TARGET_ARRAY_MODE_SUPPORTED_P (@var{n}, @var{c})
> +@end smallexample
> +is true.  GCC assumes that, if a target supports this kind of
> +instruction for some mode @var{n}, it also supports unaligned
> +loads for vectors of mode @var{n}.
> +
> +@cindex @code{vec_store_lanes@var{m}@var{n}} instruction pattern
> +@item @samp{vec_store_lanes@var{m}@var{n}}
> +Equivalent to @samp{vec_load_lanes@var{m}@var{n}}, with the memory
> +and register operands reversed.  That is, the instruction is
> +equivalent to:
> +
> +@smallexample
> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
> +  for (i = 0; i < c; i++)
> +    operand0[j * c + i] = operand1[i][j];
> +@end smallexample
> +
> +for a memory operand 0 and register operand 1.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> Index: gcc/optabs.h
> ===================================================================
> --- gcc/optabs.h        2011-04-12 12:16:46.000000000 +0100
> +++ gcc/optabs.h        2011-04-12 14:48:28.000000000 +0100
> @@ -578,6 +578,9 @@ enum convert_optab_index
>   COI_satfract,
>   COI_satfractuns,
>
> +  COI_vec_load_lanes,
> +  COI_vec_store_lanes,
> +

Um, they are not really conversion optabs.  Any reason they can't
use the direct_optab table and path?  What are the two modes
usually?  I don't see how you specify the kind of permutation that
is performed on the load - so, why not go the targetm.expand_builtin
path instead (well, targetm.expand_internal_fn, of course - or rather
targetm.expand_gimple_call which we need anyway for expanding
directly from gimple calls at some point).

>   COI_MAX
>  };
>
> @@ -598,6 +601,8 @@ #define fract_optab (&convert_optab_tabl
>  #define fractuns_optab (&convert_optab_table[COI_fractuns])
>  #define satfract_optab (&convert_optab_table[COI_satfract])
>  #define satfractuns_optab (&convert_optab_table[COI_satfractuns])
> +#define vec_load_lanes_optab (&convert_optab_table[COI_vec_load_lanes])
> +#define vec_store_lanes_optab (&convert_optab_table[COI_vec_store_lanes])
>
>  /* Contains the optab used for each rtx code.  */
>  extern optab code_to_optab[NUM_RTX_CODE + 1];
> Index: gcc/genopinit.c
> ===================================================================
> --- gcc/genopinit.c     2011-04-12 12:16:46.000000000 +0100
> +++ gcc/genopinit.c     2011-04-12 14:48:28.000000000 +0100
> @@ -74,6 +74,8 @@ static const char * const optabs[] =
>   "set_convert_optab_handler (fractuns_optab, $B, $A, CODE_FOR_$(fractuns$Q$a$I$b2$))",
>   "set_convert_optab_handler (satfract_optab, $B, $A, CODE_FOR_$(satfract$a$Q$b2$))",
>   "set_convert_optab_handler (satfractuns_optab, $B, $A, CODE_FOR_$(satfractuns$I$a$Q$b2$))",
> +  "set_convert_optab_handler (vec_load_lanes_optab, $A, $B, CODE_FOR_$(vec_load_lanes$a$b$))",
> +  "set_convert_optab_handler (vec_store_lanes_optab, $A, $B, CODE_FOR_$(vec_store_lanes$a$b$))",
>   "set_optab_handler (add_optab, $A, CODE_FOR_$(add$P$a3$))",
>   "set_optab_handler (addv_optab, $A, CODE_FOR_$(add$F$a3$)),\n\
>     set_optab_handler (add_optab, $A, CODE_FOR_$(add$F$a3$))",
> Index: gcc/internal-fn.def
> ===================================================================
> --- gcc/internal-fn.def 2011-04-12 14:10:42.000000000 +0100
> +++ gcc/internal-fn.def 2011-04-12 14:48:28.000000000 +0100
> @@ -32,3 +32,6 @@ along with GCC; see the file COPYING3.
>
>    where NAME is the name of the function and FLAGS is a set of
>    ECF_* flags.  */
> +
> +DEF_INTERNAL_FN (LOAD_LANES, ECF_CONST | ECF_LEAF)
> +DEF_INTERNAL_FN (STORE_LANES, ECF_CONST | ECF_LEAF)
> Index: gcc/internal-fn.c
> ===================================================================
> --- gcc/internal-fn.c   2011-04-12 14:10:42.000000000 +0100
> +++ gcc/internal-fn.c   2011-04-12 14:48:28.000000000 +0100
> @@ -41,6 +41,69 @@ #define DEF_INTERNAL_FN(CODE, FLAGS) FLA
>   0
>  };
>
> +/* ARRAY_TYPE is an array of vector modes.  Return the associated insn
> +   for load-lanes-style optab OPTAB.  The insn must exist.  */
> +
> +static enum insn_code
> +get_multi_vector_move (tree array_type, convert_optab optab)
> +{
> +  enum insn_code icode;
> +  enum machine_mode imode;
> +  enum machine_mode vmode;
> +
> +  gcc_assert (TREE_CODE (array_type) == ARRAY_TYPE);
> +  imode = TYPE_MODE (array_type);
> +  vmode = TYPE_MODE (TREE_TYPE (array_type));
> +
> +  icode = convert_optab_handler (optab, imode, vmode);
> +  gcc_assert (icode != CODE_FOR_nothing);
> +  return icode;
> +}
> +
> +/* Expand: LHS = LOAD_LANES (ARGS[0]).  */
> +
> +static void
> +expand_LOAD_LANES (tree lhs, tree *args)
> +{
> +  struct expand_operand ops[2];
> +  tree type;
> +  rtx target, mem;
> +
> +  type = TREE_TYPE (lhs);
> +
> +  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  mem = expand_normal (args[0]);
> +
> +  gcc_assert (MEM_P (mem));
> +  PUT_MODE (mem, TYPE_MODE (type));
> +
> +  create_output_operand (&ops[0], target, TYPE_MODE (type));
> +  create_fixed_operand (&ops[1], mem);
> +  expand_insn (get_multi_vector_move (type, vec_load_lanes_optab), 2, ops);
> +}
> +
> +/* Expand: LHS = STORE_LANES (ARGS[0]).  */
> +
> +static void
> +expand_STORE_LANES (tree lhs, tree *args)
> +{
> +  struct expand_operand ops[2];
> +  tree type;
> +  rtx target, rhs;
> +
> +  type = TREE_TYPE (args[0]);
> +
> +  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  rhs = expand_normal (args[0]);
> +
> +  gcc_assert (MEM_P (target));
> +  PUT_MODE (target, TYPE_MODE (type));
> +
> +  create_fixed_operand (&ops[0], target);
> +  create_input_operand (&ops[1], rhs, TYPE_MODE (type));
> +  expand_insn (get_multi_vector_move (type, vec_store_lanes_optab), 2, ops);
> +}
> +
>  /* Routines to expand each internal function, indexed by function number.
>    Each routine has the prototype:
>
> Index: gcc/tree.h
> ===================================================================
> --- gcc/tree.h  2011-04-12 12:16:46.000000000 +0100
> +++ gcc/tree.h  2011-04-12 14:48:28.000000000 +0100
> @@ -4198,6 +4198,7 @@ extern tree build_type_no_quals (tree);
>  extern tree build_index_type (tree);
>  extern tree build_array_type (tree, tree);
>  extern tree build_nonshared_array_type (tree, tree);
> +extern tree build_simple_array_type (tree, unsigned HOST_WIDE_INT);
>  extern tree build_function_type (tree, tree);
>  extern tree build_function_type_list (tree, ...);
>  extern tree build_function_type_skip_args (tree, bitmap);
> Index: gcc/tree.c
> ===================================================================
> --- gcc/tree.c  2011-04-12 12:16:46.000000000 +0100
> +++ gcc/tree.c  2011-04-12 14:48:28.000000000 +0100
> @@ -7385,6 +7385,15 @@ build_nonshared_array_type (tree elt_typ
>   return build_array_type_1 (elt_type, index_type, false);
>  }
>
> +/* Return a representation of ELT_TYPE[NELTS], using indices of type
> +   sizetype.  */
> +
> +tree
> +build_simple_array_type (tree elt_type, unsigned HOST_WIDE_INT nelts)

build_array_type_nelts

The rest looks ok to me.

Richard.
Richard Sandiford - April 18, 2011, 11:24 a.m.
Richard Guenther <richard.guenther@gmail.com> writes:
> On Tue, Apr 12, 2011 at 3:59 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> This patch adds vec_load_lanes and vec_store_lanes optabs for instructions
>> like NEON's vldN and vstN.  The optabs are defined this way because the
>> vectors must be allocated to a block of consecutive registers.
>>
>> Tested on x86_64-linux-gnu and arm-linux-gnueabi.  OK to install?
>>
>> Richard
>>
>>
>> gcc/
>>        * doc/md.texi (vec_load_lanes, vec_store_lanes): Document.
>>        * optabs.h (COI_vec_load_lanes, COI_vec_store_lanes): New
>>        convert_optab_index values.
>>        (vec_load_lanes_optab, vec_store_lanes_optab): New convert optabs.
>>        * genopinit.c (optabs): Initialize the new optabs.
>>        * internal-fn.def (LOAD_LANES, STORE_LANES): New internal functions.
>>        * internal-fn.c (get_multi_vector_move, expand_LOAD_LANES)
>>        (expand_STORE_LANES): New functions.
>>        * tree.h (build_simple_array_type): Declare.
>>        * tree.c (build_simple_array_type): New function.
>>        * tree-vectorizer.h (vect_model_store_cost): Add a bool argument.
>>        (vect_model_load_cost): Likewise.
>>        (vect_store_lanes_supported, vect_load_lanes_supported)
>>        (vect_record_strided_load_vectors): Declare.
>>        * tree-vect-data-refs.c (vect_lanes_optab_supported_p)
>>        (vect_store_lanes_supported, vect_load_lanes_supported): New functions.
>>        (vect_transform_strided_load): Split out statement recording into...
>>        (vect_record_strided_load_vectors): ...this new function.
>>        * tree-vect-stmts.c (create_vector_array, read_vector_array)
>>        (write_vector_array, create_array_ref): New functions.
>>        (vect_model_store_cost): Add store_lanes_p argument.
>>        (vect_model_load_cost): Add load_lanes_p argument.
>>        (vectorizable_store): Try to use store-lanes functions for
>>        interleaved stores.
>>        (vectorizable_load): Likewise load-lanes and loads.
>>        * tree-vect-slp.c (vect_get_and_check_slp_defs)
>>        (vect_build_slp_tree):
>>
>> Index: gcc/doc/md.texi
>> ===================================================================
>> --- gcc/doc/md.texi     2011-04-12 12:16:46.000000000 +0100
>> +++ gcc/doc/md.texi     2011-04-12 14:48:28.000000000 +0100
>> @@ -3846,6 +3846,48 @@ into consecutive memory locations.  Oper
>>  consecutive memory locations, operand 1 is the first register, and
>>  operand 2 is a constant: the number of consecutive registers.
>>
>> +@cindex @code{vec_load_lanes@var{m}@var{n}} instruction pattern
>> +@item @samp{vec_load_lanes@var{m}@var{n}}
>> +Perform an interleaved load of several vectors from memory operand 1
>> +into register operand 0.  Both operands have mode @var{m}.  The register
>> +operand is viewed as holding consecutive vectors of mode @var{n},
>> +while the memory operand is a flat array that contains the same number
>> +of elements.  The operation is equivalent to:
>> +
>> +@smallexample
>> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
>> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
>> +  for (i = 0; i < c; i++)
>> +    operand0[i][j] = operand1[j * c + i];
>> +@end smallexample
>> +
>> +For example, @samp{vec_load_lanestiv4hi} loads 8 16-bit values
>> +from memory into a register of mode @samp{TI}@.  The register
>> +contains two consecutive vectors of mode @samp{V4HI}@.
>
> So vec_load_lanestiv2qi would load ... ?  c == 8 here.  Intuitively
> such operation would have adjacent blocks of siv2qi memory.  But
> maybe you want to constrain the mode size to GET_MODE_SIZE (@var{n})
> * GET_MODE_NUNITS (@var{n})?  In which case the mode m is
> redundant?  You could specify that we load NUNITS adjacent vectors into
> an integer mode of appropriate size.

Like you say, vec_load_lanestiv2qi would load 16 QImode elements into
8 consecutive V2QI registers.  The first element from register vector I
would come from operand1[I] and the second element would come from
operand1[I + 8].  That's meant to be a valid combination.

We specifically want to allow:

  GET_MODE_SIZE (@var{m})
    != GET_MODE_SIZE (@var{n}) * GET_MODE_NUNITS (@var{n})

The vec_load_lanestiv4hi example in the docs is one case of this:

  GET_MODE_SIZE (@var{m}) = 16
  GET_MODE_SIZE (@var{n}) = 8
  GET_MODE_NUNITS (@var{n}) = 4

That example maps directly to ARM's vld2.32.  We also want cases
where @var{m} is three times the size of @var{n} (vld3.WW) and
cases where @var{m} is four times the size of @var{n} (vld4.WW)

>> +/* Return a representation of ELT_TYPE[NELTS], using indices of type
>> +   sizetype.  */
>> +
>> +tree
>> +build_simple_array_type (tree elt_type, unsigned HOST_WIDE_INT nelts)
>
> build_array_type_nelts

OK.

Richard
Richard Guenther - April 18, 2011, 12:08 p.m.
On Mon, Apr 18, 2011 at 1:24 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>> On Tue, Apr 12, 2011 at 3:59 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> This patch adds vec_load_lanes and vec_store_lanes optabs for instructions
>>> like NEON's vldN and vstN.  The optabs are defined this way because the
>>> vectors must be allocated to a block of consecutive registers.
>>>
>>> Tested on x86_64-linux-gnu and arm-linux-gnueabi.  OK to install?
>>>
>>> Richard
>>>
>>>
>>> gcc/
>>>        * doc/md.texi (vec_load_lanes, vec_store_lanes): Document.
>>>        * optabs.h (COI_vec_load_lanes, COI_vec_store_lanes): New
>>>        convert_optab_index values.
>>>        (vec_load_lanes_optab, vec_store_lanes_optab): New convert optabs.
>>>        * genopinit.c (optabs): Initialize the new optabs.
>>>        * internal-fn.def (LOAD_LANES, STORE_LANES): New internal functions.
>>>        * internal-fn.c (get_multi_vector_move, expand_LOAD_LANES)
>>>        (expand_STORE_LANES): New functions.
>>>        * tree.h (build_simple_array_type): Declare.
>>>        * tree.c (build_simple_array_type): New function.
>>>        * tree-vectorizer.h (vect_model_store_cost): Add a bool argument.
>>>        (vect_model_load_cost): Likewise.
>>>        (vect_store_lanes_supported, vect_load_lanes_supported)
>>>        (vect_record_strided_load_vectors): Declare.
>>>        * tree-vect-data-refs.c (vect_lanes_optab_supported_p)
>>>        (vect_store_lanes_supported, vect_load_lanes_supported): New functions.
>>>        (vect_transform_strided_load): Split out statement recording into...
>>>        (vect_record_strided_load_vectors): ...this new function.
>>>        * tree-vect-stmts.c (create_vector_array, read_vector_array)
>>>        (write_vector_array, create_array_ref): New functions.
>>>        (vect_model_store_cost): Add store_lanes_p argument.
>>>        (vect_model_load_cost): Add load_lanes_p argument.
>>>        (vectorizable_store): Try to use store-lanes functions for
>>>        interleaved stores.
>>>        (vectorizable_load): Likewise load-lanes and loads.
>>>        * tree-vect-slp.c (vect_get_and_check_slp_defs)
>>>        (vect_build_slp_tree):
>>>
>>> Index: gcc/doc/md.texi
>>> ===================================================================
>>> --- gcc/doc/md.texi     2011-04-12 12:16:46.000000000 +0100
>>> +++ gcc/doc/md.texi     2011-04-12 14:48:28.000000000 +0100
>>> @@ -3846,6 +3846,48 @@ into consecutive memory locations.  Oper
>>>  consecutive memory locations, operand 1 is the first register, and
>>>  operand 2 is a constant: the number of consecutive registers.
>>>
>>> +@cindex @code{vec_load_lanes@var{m}@var{n}} instruction pattern
>>> +@item @samp{vec_load_lanes@var{m}@var{n}}
>>> +Perform an interleaved load of several vectors from memory operand 1
>>> +into register operand 0.  Both operands have mode @var{m}.  The register
>>> +operand is viewed as holding consecutive vectors of mode @var{n},
>>> +while the memory operand is a flat array that contains the same number
>>> +of elements.  The operation is equivalent to:
>>> +
>>> +@smallexample
>>> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
>>> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
>>> +  for (i = 0; i < c; i++)
>>> +    operand0[i][j] = operand1[j * c + i];
>>> +@end smallexample
>>> +
>>> +For example, @samp{vec_load_lanestiv4hi} loads 8 16-bit values
>>> +from memory into a register of mode @samp{TI}@.  The register
>>> +contains two consecutive vectors of mode @samp{V4HI}@.
>>
>> So vec_load_lanestiv2qi would load ... ?  c == 8 here.  Intuitively
>> such operation would have adjacent blocks of siv2qi memory.  But
>> maybe you want to constrain the mode size to GET_MODE_SIZE (@var{n})
>> * GET_MODE_NUNITS (@var{n})?  In which case the mode m is
>> redundant?  You could specify that we load NUNITS adjacent vectors into
>> an integer mode of appropriate size.
>
> Like you say, vec_load_lanestiv2qi would load 16 QImode elements into
> 8 consecutive V2QI registers.  The first element from register vector I
> would come from operand1[I] and the second element would come from
> operand1[I + 8].  That's meant to be a valid combination.

Ok, but the C loop from the example doesn't seem to match.  Or I couldn't
wrap my head around it despite looking for 5 minutes and already having
coffee ;)  I would have expected the vectors being in memory as

  v0[0], v1[0], v0[1], v1[1], v2[0], v3[1]. v2[1], v3[1], ...

not

  v0[0], v1[0], v2[0], ...

as I would have thought the former is more useful (simple unrolling for
stride 2).  We'd need a separate set of optabs for such an interleaving
scheme?  In which case we might want to come up with a more
specific name than load_lane?

> We specifically want to allow:
>
>  GET_MODE_SIZE (@var{m})
>    != GET_MODE_SIZE (@var{n}) * GET_MODE_NUNITS (@var{n})
>
> The vec_load_lanestiv4hi example in the docs is one case of this:
>
>  GET_MODE_SIZE (@var{m}) = 16
>  GET_MODE_SIZE (@var{n}) = 8
>  GET_MODE_NUNITS (@var{n}) = 4
>
> That example maps directly to ARM's vld2.32.  We also want cases
> where @var{m} is three times the size of @var{n} (vld3.WW) and
> cases where @var{m} is four times the size of @var{n} (vld4.WW)
>
>>> +/* Return a representation of ELT_TYPE[NELTS], using indices of type
>>> +   sizetype.  */
>>> +
>>> +tree
>>> +build_simple_array_type (tree elt_type, unsigned HOST_WIDE_INT nelts)
>>
>> build_array_type_nelts
>
> OK.
>
> Richard
>
Richard Sandiford - April 18, 2011, 12:19 p.m.
Richard Guenther <richard.guenther@gmail.com> writes:
> On Mon, Apr 18, 2011 at 1:24 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> Richard Guenther <richard.guenther@gmail.com> writes:
>>> On Tue, Apr 12, 2011 at 3:59 PM, Richard Sandiford
>>> <richard.sandiford@linaro.org> wrote:
>>>> Index: gcc/doc/md.texi
>>>> ===================================================================
>>>> --- gcc/doc/md.texi     2011-04-12 12:16:46.000000000 +0100
>>>> +++ gcc/doc/md.texi     2011-04-12 14:48:28.000000000 +0100
>>>> @@ -3846,6 +3846,48 @@ into consecutive memory locations.  Oper
>>>>  consecutive memory locations, operand 1 is the first register, and
>>>>  operand 2 is a constant: the number of consecutive registers.
>>>>
>>>> +@cindex @code{vec_load_lanes@var{m}@var{n}} instruction pattern
>>>> +@item @samp{vec_load_lanes@var{m}@var{n}}
>>>> +Perform an interleaved load of several vectors from memory operand 1
>>>> +into register operand 0.  Both operands have mode @var{m}.  The register
>>>> +operand is viewed as holding consecutive vectors of mode @var{n},
>>>> +while the memory operand is a flat array that contains the same number
>>>> +of elements.  The operation is equivalent to:
>>>> +
>>>> +@smallexample
>>>> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
>>>> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
>>>> +  for (i = 0; i < c; i++)
>>>> +    operand0[i][j] = operand1[j * c + i];
>>>> +@end smallexample
>>>> +
>>>> +For example, @samp{vec_load_lanestiv4hi} loads 8 16-bit values
>>>> +from memory into a register of mode @samp{TI}@.  The register
>>>> +contains two consecutive vectors of mode @samp{V4HI}@.
>>>
>>> So vec_load_lanestiv2qi would load ... ?  c == 8 here.  Intuitively
>>> such operation would have adjacent blocks of siv2qi memory.  But
>>> maybe you want to constrain the mode size to GET_MODE_SIZE (@var{n})
>>> * GET_MODE_NUNITS (@var{n})?  In which case the mode m is
>>> redundant?  You could specify that we load NUNITS adjacent vectors into
>>> an integer mode of appropriate size.
>>
>> Like you say, vec_load_lanestiv2qi would load 16 QImode elements into
>> 8 consecutive V2QI registers.  The first element from register vector I
>> would come from operand1[I] and the second element would come from
>> operand1[I + 8].  That's meant to be a valid combination.
>
> Ok, but the C loop from the example doesn't seem to match.  Or I couldn't
> wrap my head around it despite looking for 5 minutes and already having
> coffee ;)  I would have expected the vectors being in memory as
>
>   v0[0], v1[0], v0[1], v1[1], v2[0], v3[1]. v2[1], v3[1], ...
>
> not
>
>   v0[0], v1[0], v2[0], ...
>
> as I would have thought the former is more useful (simple unrolling for
> stride 2).

The second one's right.  All lane 0 elements, followed by all lane 1
elements, etc.  I think that's what the C loop says.

> We'd need a separate set of optabs for such an interleaving
> scheme?  In which case we might want to come up with a more
> specific name than load_lane?

Yeah, if someone has a single instruction that does your first example,
then it would need a new optab.  The individual vector pairs could be
represented using the current optab though, if each pair needs a
separate instruction.  E.g. with your v2qi example, vec_load_lanessiv2qi
would load:

   v0[0], v1[0], v0[1], v1[1]

and you could repeat for the others.  So load_lanes (as defined here)
could be treated as a primitive, and your first example could be something
like "repeat_load_lanes".

If you don't like the name "load_lanes" though, I'm happy to use
something else.

Richard
Richard Guenther - April 18, 2011, 12:58 p.m.
On Mon, Apr 18, 2011 at 2:19 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>> On Mon, Apr 18, 2011 at 1:24 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> Richard Guenther <richard.guenther@gmail.com> writes:
>>>> On Tue, Apr 12, 2011 at 3:59 PM, Richard Sandiford
>>>> <richard.sandiford@linaro.org> wrote:
>>>>> Index: gcc/doc/md.texi
>>>>> ===================================================================
>>>>> --- gcc/doc/md.texi     2011-04-12 12:16:46.000000000 +0100
>>>>> +++ gcc/doc/md.texi     2011-04-12 14:48:28.000000000 +0100
>>>>> @@ -3846,6 +3846,48 @@ into consecutive memory locations.  Oper
>>>>>  consecutive memory locations, operand 1 is the first register, and
>>>>>  operand 2 is a constant: the number of consecutive registers.
>>>>>
>>>>> +@cindex @code{vec_load_lanes@var{m}@var{n}} instruction pattern
>>>>> +@item @samp{vec_load_lanes@var{m}@var{n}}
>>>>> +Perform an interleaved load of several vectors from memory operand 1
>>>>> +into register operand 0.  Both operands have mode @var{m}.  The register
>>>>> +operand is viewed as holding consecutive vectors of mode @var{n},
>>>>> +while the memory operand is a flat array that contains the same number
>>>>> +of elements.  The operation is equivalent to:
>>>>> +
>>>>> +@smallexample
>>>>> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
>>>>> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
>>>>> +  for (i = 0; i < c; i++)
>>>>> +    operand0[i][j] = operand1[j * c + i];
>>>>> +@end smallexample
>>>>> +
>>>>> +For example, @samp{vec_load_lanestiv4hi} loads 8 16-bit values
>>>>> +from memory into a register of mode @samp{TI}@.  The register
>>>>> +contains two consecutive vectors of mode @samp{V4HI}@.
>>>>
>>>> So vec_load_lanestiv2qi would load ... ?  c == 8 here.  Intuitively
>>>> such operation would have adjacent blocks of siv2qi memory.  But
>>>> maybe you want to constrain the mode size to GET_MODE_SIZE (@var{n})
>>>> * GET_MODE_NUNITS (@var{n})?  In which case the mode m is
>>>> redundant?  You could specify that we load NUNITS adjacent vectors into
>>>> an integer mode of appropriate size.
>>>
>>> Like you say, vec_load_lanestiv2qi would load 16 QImode elements into
>>> 8 consecutive V2QI registers.  The first element from register vector I
>>> would come from operand1[I] and the second element would come from
>>> operand1[I + 8].  That's meant to be a valid combination.
>>
>> Ok, but the C loop from the example doesn't seem to match.  Or I couldn't
>> wrap my head around it despite looking for 5 minutes and already having
>> coffee ;)  I would have expected the vectors being in memory as
>>
>>   v0[0], v1[0], v0[1], v1[1], v2[0], v3[1]. v2[1], v3[1], ...
>>
>> not
>>
>>   v0[0], v1[0], v2[0], ...
>>
>> as I would have thought the former is more useful (simple unrolling for
>> stride 2).
>
> The second one's right.  All lane 0 elements, followed by all lane 1
> elements, etc.  I think that's what the C loop says.
>
>> We'd need a separate set of optabs for such an interleaving
>> scheme?  In which case we might want to come up with a more
>> specific name than load_lane?
>
> Yeah, if someone has a single instruction that does your first example,
> then it would need a new optab.  The individual vector pairs could be
> represented using the current optab though, if each pair needs a
> separate instruction.  E.g. with your v2qi example, vec_load_lanessiv2qi
> would load:
>
>   v0[0], v1[0], v0[1], v1[1]
>
> and you could repeat for the others.  So load_lanes (as defined here)
> could be treated as a primitive, and your first example could be something
> like "repeat_load_lanes".
>
> If you don't like the name "load_lanes" though, I'm happy to use
> something else.

Ah, no - repeat_load_lanes sounds a good name for the new optab if
we need it at any point.

Richard.

> Richard
>

Patch

Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2011-04-12 12:16:46.000000000 +0100
+++ gcc/doc/md.texi	2011-04-12 14:48:28.000000000 +0100
@@ -3846,6 +3846,48 @@  into consecutive memory locations.  Oper
 consecutive memory locations, operand 1 is the first register, and
 operand 2 is a constant: the number of consecutive registers.
 
+@cindex @code{vec_load_lanes@var{m}@var{n}} instruction pattern
+@item @samp{vec_load_lanes@var{m}@var{n}}
+Perform an interleaved load of several vectors from memory operand 1
+into register operand 0.  Both operands have mode @var{m}.  The register
+operand is viewed as holding consecutive vectors of mode @var{n},
+while the memory operand is a flat array that contains the same number
+of elements.  The operation is equivalent to:
+
+@smallexample
+int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
+for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
+  for (i = 0; i < c; i++)
+    operand0[i][j] = operand1[j * c + i];
+@end smallexample
+
+For example, @samp{vec_load_lanestiv4hi} loads 8 16-bit values
+from memory into a register of mode @samp{TI}@.  The register
+contains two consecutive vectors of mode @samp{V4HI}@.
+
+This pattern can only be used if:
+@smallexample
+TARGET_ARRAY_MODE_SUPPORTED_P (@var{n}, @var{c})
+@end smallexample
+is true.  GCC assumes that, if a target supports this kind of
+instruction for some mode @var{n}, it also supports unaligned
+loads for vectors of mode @var{n}.
+
+@cindex @code{vec_store_lanes@var{m}@var{n}} instruction pattern
+@item @samp{vec_store_lanes@var{m}@var{n}}
+Equivalent to @samp{vec_load_lanes@var{m}@var{n}}, with the memory
+and register operands reversed.  That is, the instruction is
+equivalent to:
+
+@smallexample
+int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
+for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
+  for (i = 0; i < c; i++)
+    operand0[j * c + i] = operand1[i][j];
+@end smallexample
+
+for a memory operand 0 and register operand 1.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,
Index: gcc/optabs.h
===================================================================
--- gcc/optabs.h	2011-04-12 12:16:46.000000000 +0100
+++ gcc/optabs.h	2011-04-12 14:48:28.000000000 +0100
@@ -578,6 +578,9 @@  enum convert_optab_index
   COI_satfract,
   COI_satfractuns,
 
+  COI_vec_load_lanes,
+  COI_vec_store_lanes,
+
   COI_MAX
 };
 
@@ -598,6 +601,8 @@  #define fract_optab (&convert_optab_tabl
 #define fractuns_optab (&convert_optab_table[COI_fractuns])
 #define satfract_optab (&convert_optab_table[COI_satfract])
 #define satfractuns_optab (&convert_optab_table[COI_satfractuns])
+#define vec_load_lanes_optab (&convert_optab_table[COI_vec_load_lanes])
+#define vec_store_lanes_optab (&convert_optab_table[COI_vec_store_lanes])
 
 /* Contains the optab used for each rtx code.  */
 extern optab code_to_optab[NUM_RTX_CODE + 1];
Index: gcc/genopinit.c
===================================================================
--- gcc/genopinit.c	2011-04-12 12:16:46.000000000 +0100
+++ gcc/genopinit.c	2011-04-12 14:48:28.000000000 +0100
@@ -74,6 +74,8 @@  static const char * const optabs[] =
   "set_convert_optab_handler (fractuns_optab, $B, $A, CODE_FOR_$(fractuns$Q$a$I$b2$))",
   "set_convert_optab_handler (satfract_optab, $B, $A, CODE_FOR_$(satfract$a$Q$b2$))",
   "set_convert_optab_handler (satfractuns_optab, $B, $A, CODE_FOR_$(satfractuns$I$a$Q$b2$))",
+  "set_convert_optab_handler (vec_load_lanes_optab, $A, $B, CODE_FOR_$(vec_load_lanes$a$b$))",
+  "set_convert_optab_handler (vec_store_lanes_optab, $A, $B, CODE_FOR_$(vec_store_lanes$a$b$))",
   "set_optab_handler (add_optab, $A, CODE_FOR_$(add$P$a3$))",
   "set_optab_handler (addv_optab, $A, CODE_FOR_$(add$F$a3$)),\n\
     set_optab_handler (add_optab, $A, CODE_FOR_$(add$F$a3$))",
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2011-04-12 14:10:42.000000000 +0100
+++ gcc/internal-fn.def	2011-04-12 14:48:28.000000000 +0100
@@ -32,3 +32,6 @@  along with GCC; see the file COPYING3.  
 
    where NAME is the name of the function and FLAGS is a set of
    ECF_* flags.  */
+
+DEF_INTERNAL_FN (LOAD_LANES, ECF_CONST | ECF_LEAF)
+DEF_INTERNAL_FN (STORE_LANES, ECF_CONST | ECF_LEAF)
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2011-04-12 14:10:42.000000000 +0100
+++ gcc/internal-fn.c	2011-04-12 14:48:28.000000000 +0100
@@ -41,6 +41,69 @@  #define DEF_INTERNAL_FN(CODE, FLAGS) FLA
   0
 };
 
+/* ARRAY_TYPE is an array of vector modes.  Return the associated insn
+   for load-lanes-style optab OPTAB.  The insn must exist.  */
+
+static enum insn_code
+get_multi_vector_move (tree array_type, convert_optab optab)
+{
+  enum insn_code icode;
+  enum machine_mode imode;
+  enum machine_mode vmode;
+
+  gcc_assert (TREE_CODE (array_type) == ARRAY_TYPE);
+  imode = TYPE_MODE (array_type);
+  vmode = TYPE_MODE (TREE_TYPE (array_type));
+
+  icode = convert_optab_handler (optab, imode, vmode);
+  gcc_assert (icode != CODE_FOR_nothing);
+  return icode;
+}
+
+/* Expand: LHS = LOAD_LANES (ARGS[0]).  */
+
+static void
+expand_LOAD_LANES (tree lhs, tree *args)
+{
+  struct expand_operand ops[2];
+  tree type;
+  rtx target, mem;
+
+  type = TREE_TYPE (lhs);
+
+  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  mem = expand_normal (args[0]);
+
+  gcc_assert (MEM_P (mem));
+  PUT_MODE (mem, TYPE_MODE (type));
+
+  create_output_operand (&ops[0], target, TYPE_MODE (type));
+  create_fixed_operand (&ops[1], mem);
+  expand_insn (get_multi_vector_move (type, vec_load_lanes_optab), 2, ops);
+}
+
+/* Expand: LHS = STORE_LANES (ARGS[0]).  */
+
+static void
+expand_STORE_LANES (tree lhs, tree *args)
+{
+  struct expand_operand ops[2];
+  tree type;
+  rtx target, rhs;
+
+  type = TREE_TYPE (args[0]);
+
+  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rhs = expand_normal (args[0]);
+
+  gcc_assert (MEM_P (target));
+  PUT_MODE (target, TYPE_MODE (type));
+
+  create_fixed_operand (&ops[0], target);
+  create_input_operand (&ops[1], rhs, TYPE_MODE (type));
+  expand_insn (get_multi_vector_move (type, vec_store_lanes_optab), 2, ops);
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/tree.h
===================================================================
--- gcc/tree.h	2011-04-12 12:16:46.000000000 +0100
+++ gcc/tree.h	2011-04-12 14:48:28.000000000 +0100
@@ -4198,6 +4198,7 @@  extern tree build_type_no_quals (tree);
 extern tree build_index_type (tree);
 extern tree build_array_type (tree, tree);
 extern tree build_nonshared_array_type (tree, tree);
+extern tree build_simple_array_type (tree, unsigned HOST_WIDE_INT);
 extern tree build_function_type (tree, tree);
 extern tree build_function_type_list (tree, ...);
 extern tree build_function_type_skip_args (tree, bitmap);
Index: gcc/tree.c
===================================================================
--- gcc/tree.c	2011-04-12 12:16:46.000000000 +0100
+++ gcc/tree.c	2011-04-12 14:48:28.000000000 +0100
@@ -7385,6 +7385,15 @@  build_nonshared_array_type (tree elt_typ
   return build_array_type_1 (elt_type, index_type, false);
 }
 
+/* Return a representation of ELT_TYPE[NELTS], using indices of type
+   sizetype.  */
+
+tree
+build_simple_array_type (tree elt_type, unsigned HOST_WIDE_INT nelts)
+{
+  return build_array_type (elt_type, build_index_type (size_int (nelts - 1)));
+}
+
 /* Recursively examines the array elements of TYPE, until a non-array
    element type is found.  */
 
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2011-04-12 14:48:27.000000000 +0100
+++ gcc/tree-vectorizer.h	2011-04-12 14:48:28.000000000 +0100
@@ -788,9 +788,9 @@  extern void free_stmt_vec_info (gimple s
 extern tree vectorizable_function (gimple, tree, tree);
 extern void vect_model_simple_cost (stmt_vec_info, int, enum vect_def_type *,
                                     slp_tree);
-extern void vect_model_store_cost (stmt_vec_info, int, enum vect_def_type,
-                                   slp_tree);
-extern void vect_model_load_cost (stmt_vec_info, int, slp_tree);
+extern void vect_model_store_cost (stmt_vec_info, int, bool,
+				   enum vect_def_type, slp_tree);
+extern void vect_model_load_cost (stmt_vec_info, int, bool, slp_tree);
 extern void vect_finish_stmt_generation (gimple, gimple,
                                          gimple_stmt_iterator *);
 extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
@@ -829,7 +829,9 @@  extern tree vect_create_data_ref_ptr (gi
 extern tree bump_vector_ptr (tree, gimple, gimple_stmt_iterator *, gimple, tree);
 extern tree vect_create_destination_var (tree, tree);
 extern bool vect_strided_store_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT);
 extern bool vect_strided_load_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT);
 extern void vect_permute_store_chain (VEC(tree,heap) *,unsigned int, gimple,
                                     gimple_stmt_iterator *, VEC(tree,heap) **);
 extern tree vect_setup_realignment (gimple, gimple_stmt_iterator *, tree *,
@@ -837,6 +839,7 @@  extern tree vect_setup_realignment (gimp
                                     struct loop **);
 extern void vect_transform_strided_load (gimple, VEC(tree,heap) *, int,
                                          gimple_stmt_iterator *);
+extern void vect_record_strided_load_vectors (gimple, VEC(tree,heap) *);
 extern int vect_get_place_in_interleaving_chain (gimple, gimple);
 extern tree vect_get_new_vect_var (tree, enum vect_var_kind, const char *);
 extern tree vect_create_addr_base_for_vector_ref (gimple, gimple_seq *,
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	2011-04-12 14:48:27.000000000 +0100
+++ gcc/tree-vect-data-refs.c	2011-04-12 14:49:18.000000000 +0100
@@ -43,6 +43,45 @@  Software Foundation; either version 3, o
 #include "expr.h"
 #include "optabs.h"
 
+/* Return true if load- or store-lanes optab OPTAB is implemented for
+   COUNT vectors of type VECTYPE.  NAME is the name of OPTAB.  */
+
+static bool
+vect_lanes_optab_supported_p (const char *name, convert_optab optab,
+			      tree vectype, unsigned HOST_WIDE_INT count)
+{
+  enum machine_mode mode, array_mode;
+  bool limit_p;
+
+  mode = TYPE_MODE (vectype);
+  limit_p = !targetm.array_mode_supported_p (mode, count);
+  array_mode = mode_for_size (count * GET_MODE_BITSIZE (mode),
+			      MODE_INT, limit_p);
+
+  if (array_mode == BLKmode)
+    {
+      if (vect_print_dump_info (REPORT_DETAILS))
+	fprintf (vect_dump, "no array mode for %s[" HOST_WIDE_INT_PRINT_DEC "]",
+		 GET_MODE_NAME (mode), count);
+      return false;
+    }
+
+  if (convert_optab_handler (optab, array_mode, mode) == CODE_FOR_nothing)
+    {
+      if (vect_print_dump_info (REPORT_DETAILS))
+	fprintf (vect_dump, "cannot use %s<%s><%s>",
+		 name, GET_MODE_NAME (array_mode), GET_MODE_NAME (mode));
+      return false;
+    }
+
+  if (vect_print_dump_info (REPORT_DETAILS))
+    fprintf (vect_dump, "can use %s<%s><%s>",
+	     name, GET_MODE_NAME (array_mode), GET_MODE_NAME (mode));
+
+  return true;
+}
+
+
 /* Return the smallest scalar part of STMT.
    This is used to determine the vectype of the stmt.  We generally set the
    vectype according to the type of the result (lhs).  For stmts whose
@@ -3376,6 +3415,18 @@  vect_strided_store_supported (tree vecty
 }
 
 
+/* Return TRUE if vec_store_lanes is avaiable for COUNT vectors of
+   type VECTYPE.  */
+
+bool
+vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+{
+  return vect_lanes_optab_supported_p ("vec_store_lanes",
+				       vec_store_lanes_optab,
+				       vectype, count);
+}
+
+
 /* Function vect_permute_store_chain.
 
    Given a chain of interleaved stores in DR_CHAIN of LENGTH that must be
@@ -3830,6 +3881,16 @@  vect_strided_load_supported (tree vectyp
   return true;
 }
 
+/* Return TRUE if vec_load_lanes is avaiable for COUNT vectors of
+   type VECTYPE.  */
+
+bool
+vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+{
+  return vect_lanes_optab_supported_p ("vec_load_lanes",
+				       vec_load_lanes_optab,
+				       vectype, count);
+}
 
 /* Function vect_permute_load_chain.
 
@@ -3977,19 +4038,28 @@  vect_permute_load_chain (VEC(tree,heap) 
 vect_transform_strided_load (gimple stmt, VEC(tree,heap) *dr_chain, int size,
 			     gimple_stmt_iterator *gsi)
 {
-  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
-  gimple first_stmt = DR_GROUP_FIRST_DR (stmt_info);
-  gimple next_stmt, new_stmt;
   VEC(tree,heap) *result_chain = NULL;
-  unsigned int i, gap_count;
-  tree tmp_data_ref;
 
   /* DR_CHAIN contains input data-refs that are a part of the interleaving.
      RESULT_CHAIN is the output of vect_permute_load_chain, it contains permuted
      vectors, that are ready for vector computation.  */
   result_chain = VEC_alloc (tree, heap, size);
-  /* Permute.  */
   vect_permute_load_chain (dr_chain, size, stmt, gsi, &result_chain);
+  vect_record_strided_load_vectors (stmt, result_chain);
+  VEC_free (tree, heap, result_chain);
+}
+
+/* RESULT_CHAIN contains the output of a group of strided loads that were
+   generated as part of the vectorization of STMT.  Assign the statement
+   for each vector to the associated scalar statement.  */
+
+void
+vect_record_strided_load_vectors (gimple stmt, VEC(tree,heap) *result_chain)
+{
+  gimple first_stmt = DR_GROUP_FIRST_DR (vinfo_for_stmt (stmt));
+  gimple next_stmt, new_stmt;
+  unsigned int i, gap_count;
+  tree tmp_data_ref;
 
   /* Put a permuted data-ref in the VECTORIZED_STMT field.
      Since we scan the chain starting from it's first node, their order
@@ -4051,8 +4121,6 @@  vect_transform_strided_load (gimple stmt
 	    break;
         }
     }
-
-  VEC_free (tree, heap, result_chain);
 }
 
 /* Function vect_force_dr_alignment_p.
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2011-04-12 14:48:27.000000000 +0100
+++ gcc/tree-vect-stmts.c	2011-04-12 14:52:10.000000000 +0100
@@ -42,6 +42,81 @@  Software Foundation; either version 3, o
 #include "langhooks.h"
 
 
+/* Return a variable of type ELEM_TYPE[NELEMS].  */
+
+static tree
+create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems)
+{
+  return create_tmp_var (build_simple_array_type (elem_type, nelems),
+			 "vect_array");
+}
+
+/* ARRAY is an array of vectors created by create_vector_array.
+   Return an SSA_NAME for the vector in index N.  The reference
+   is part of the vectorization of STMT and the vector is associated
+   with scalar destination SCALAR_DEST.  */
+
+static tree
+read_vector_array (gimple stmt, gimple_stmt_iterator *gsi, tree scalar_dest,
+		   tree array, unsigned HOST_WIDE_INT n)
+{
+  tree vect_type, vect, vect_name, array_ref;
+  gimple new_stmt;
+
+  gcc_assert (TREE_CODE (TREE_TYPE (array)) == ARRAY_TYPE);
+  vect_type = TREE_TYPE (TREE_TYPE (array));
+  vect = vect_create_destination_var (scalar_dest, vect_type);
+  array_ref = build4 (ARRAY_REF, vect_type, array,
+		      build_int_cst (size_type_node, n),
+		      NULL_TREE, NULL_TREE);
+
+  new_stmt = gimple_build_assign (vect, array_ref);
+  vect_name = make_ssa_name (vect, new_stmt);
+  gimple_assign_set_lhs (new_stmt, vect_name);
+  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+  mark_symbols_for_renaming (new_stmt);
+
+  return vect_name;
+}
+
+/* ARRAY is an array of vectors created by create_vector_array.
+   Emit code to store SSA_NAME VECT in index N of the array.
+   The store is part of the vectorization of STMT.  */
+
+static void
+write_vector_array (gimple stmt, gimple_stmt_iterator *gsi, tree vect,
+		    tree array, unsigned HOST_WIDE_INT n)
+{
+  tree array_ref;
+  gimple new_stmt;
+
+  array_ref = build4 (ARRAY_REF, TREE_TYPE (vect), array,
+		      build_int_cst (size_type_node, n),
+		      NULL_TREE, NULL_TREE);
+
+  new_stmt = gimple_build_assign (array_ref, vect);
+  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+  mark_symbols_for_renaming (new_stmt);
+}
+
+/* PTR is a pointer to array type TYPE.  Return a representation of *PTR.
+   The memory reference replaces those in FIRST_DR (and its group).  */
+
+static tree
+create_array_ref (tree type, tree ptr, struct data_reference *first_dr)
+{
+  struct ptr_info_def *pi;
+  tree mem_ref, alias_ptr_type;
+
+  alias_ptr_type = reference_alias_ptr_type (DR_REF (first_dr));
+  mem_ref = build2 (MEM_REF, type, ptr, build_int_cst (alias_ptr_type, 0));
+  /* Arrays have the same alignment as their type.  */
+  pi = get_ptr_info (ptr);
+  pi->align = TYPE_ALIGN_UNIT (type);
+  pi->misalign = 0;
+  return mem_ref;
+}
+
 /* Utility functions used by vect_mark_stmts_to_be_vectorized.  */
 
 /* Function vect_mark_relevant.
@@ -648,7 +723,8 @@  vect_cost_strided_group_size (stmt_vec_i
 
 void
 vect_model_store_cost (stmt_vec_info stmt_info, int ncopies,
-		       enum vect_def_type dt, slp_tree slp_node)
+		       bool store_lanes_p, enum vect_def_type dt,
+		       slp_tree slp_node)
 {
   int group_size;
   unsigned int inside_cost = 0, outside_cost = 0;
@@ -685,9 +761,11 @@  vect_model_store_cost (stmt_vec_info stm
       first_dr = STMT_VINFO_DATA_REF (stmt_info);
     }
 
-  /* Is this an access in a group of stores, which provide strided access?
-     If so, add in the cost of the permutes.  */
-  if (group_size > 1)
+  /* We assume that the cost of a single store-lanes instruction is
+     equivalent to the cost of GROUP_SIZE separate stores.  If a strided
+     access is instead being provided by a load-and-permute operation,
+     include the cost of the permutes.  */
+  if (!store_lanes_p && group_size > 1)
     {
       /* Uses a high and low interleave operation for each needed permute.  */
       inside_cost = ncopies * exact_log2(group_size) * group_size
@@ -763,8 +841,8 @@  vect_get_store_cost (struct data_referen
    access scheme chosen.  */
 
 void
-vect_model_load_cost (stmt_vec_info stmt_info, int ncopies, slp_tree slp_node)
-
+vect_model_load_cost (stmt_vec_info stmt_info, int ncopies, bool load_lanes_p,
+		      slp_tree slp_node)
 {
   int group_size;
   gimple first_stmt;
@@ -789,9 +867,11 @@  vect_model_load_cost (stmt_vec_info stmt
       first_dr = dr;
     }
 
-  /* Is this an access in a group of loads providing strided access?
-     If so, add in the cost of the permutes.  */
-  if (group_size > 1)
+  /* We assume that the cost of a single load-lanes instruction is
+     equivalent to the cost of GROUP_SIZE separate loads.  If a strided
+     access is instead being provided by a load-and-permute operation,
+     include the cost of the permutes.  */
+  if (!load_lanes_p && group_size > 1)
     {
       /* Uses an even and odd extract operations for each needed permute.  */
       inside_cost = ncopies * exact_log2(group_size) * group_size
@@ -3324,6 +3404,7 @@  vectorizable_store (gimple stmt, gimple_
   int j;
   gimple next_stmt, first_stmt = NULL;
   bool strided_store = false;
+  bool store_lanes_p = false;
   unsigned int group_size, i;
   VEC(tree,heap) *dr_chain = NULL, *oprnds = NULL, *result_chain = NULL;
   bool inv_p;
@@ -3331,6 +3412,7 @@  vectorizable_store (gimple stmt, gimple_
   bool slp = (slp_node != NULL);
   unsigned int vec_num;
   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
+  tree aggr_type;
 
   if (loop_vinfo)
     loop = LOOP_VINFO_LOOP (loop_vinfo);
@@ -3415,7 +3497,9 @@  vectorizable_store (gimple stmt, gimple_
       if (!slp && !PURE_SLP_STMT (stmt_info))
 	{
 	  group_size = DR_GROUP_SIZE (vinfo_for_stmt (first_stmt));
-	  if (!vect_strided_store_supported (vectype, group_size))
+	  if (vect_store_lanes_supported (vectype, group_size))
+	    store_lanes_p = true;
+	  else if (!vect_strided_store_supported (vectype, group_size))
 	    return false;
 	}
 
@@ -3443,7 +3527,7 @@  vectorizable_store (gimple stmt, gimple_
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
-      vect_model_store_cost (stmt_info, ncopies, dt, NULL);
+      vect_model_store_cost (stmt_info, ncopies, store_lanes_p, dt, NULL);
       return true;
     }
 
@@ -3498,6 +3582,16 @@  vectorizable_store (gimple stmt, gimple_
 
   alignment_support_scheme = vect_supportable_dr_alignment (first_dr, false);
   gcc_assert (alignment_support_scheme);
+  /* Targets with store-lane instructions must not require explicit
+     realignment.  */
+  gcc_assert (!store_lanes_p
+	      || alignment_support_scheme == dr_aligned
+	      || alignment_support_scheme == dr_unaligned_supported);
+
+  if (store_lanes_p)
+    aggr_type = build_simple_array_type (elem_type, vec_num * nunits);
+  else
+    aggr_type = vectype;
 
   /* In case the vectorization factor (VF) is bigger than the number
      of elements that we can fit in a vectype (nunits), we have to generate
@@ -3586,7 +3680,7 @@  vectorizable_store (gimple stmt, gimple_
 	  /* We should have catched mismatched types earlier.  */
 	  gcc_assert (useless_type_conversion_p (vectype,
 						 TREE_TYPE (vec_oprnd)));
-	  dataref_ptr = vect_create_data_ref_ptr (first_stmt, vectype, NULL,
+	  dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type, NULL,
 						  NULL_TREE, &dummy, gsi,
 						  &ptr_incr, false, &inv_p);
 	  gcc_assert (bb_vinfo || !inv_p);
@@ -3609,11 +3703,31 @@  vectorizable_store (gimple stmt, gimple_
 	      VEC_replace(tree, dr_chain, i, vec_oprnd);
 	      VEC_replace(tree, oprnds, i, vec_oprnd);
 	    }
-	  dataref_ptr =
-		bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt, NULL_TREE);
+	  dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
+					 TYPE_SIZE_UNIT (aggr_type));
 	}
 
-      if (1)
+      if (store_lanes_p)
+	{
+	  tree vec_array;
+
+	  /* Combine all the vectors into an array.  */
+	  vec_array = create_vector_array (vectype, vec_num);
+	  for (i = 0; i < vec_num; i++)
+	    {
+	      vec_oprnd = VEC_index (tree, dr_chain, i);
+	      write_vector_array (stmt, gsi, vec_oprnd, vec_array, i);
+	    }
+
+	  /* Emit:
+	       MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
+	  data_ref = create_array_ref (aggr_type, dataref_ptr, first_dr);
+	  new_stmt = gimple_build_call_internal (IFN_STORE_LANES, 1, vec_array);
+	  gimple_call_set_lhs (new_stmt, data_ref);
+	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	  mark_symbols_for_renaming (new_stmt);
+	}
+      else
 	{
 	  new_stmt = NULL;
 	  if (strided_store)
@@ -3811,6 +3925,7 @@  vectorizable_load (gimple stmt, gimple_s
   gimple phi = NULL;
   VEC(tree,heap) *dr_chain = NULL;
   bool strided_load = false;
+  bool load_lanes_p = false;
   gimple first_stmt;
   tree scalar_type;
   bool inv_p;
@@ -3823,6 +3938,7 @@  vectorizable_load (gimple stmt, gimple_s
   enum tree_code code;
   bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info);
   int vf;
+  tree aggr_type;
 
   if (loop_vinfo)
     {
@@ -3918,7 +4034,9 @@  vectorizable_load (gimple stmt, gimple_s
       if (!slp && !PURE_SLP_STMT (stmt_info))
 	{
 	  group_size = DR_GROUP_SIZE (vinfo_for_stmt (first_stmt));
-	  if (!vect_strided_load_supported (vectype, group_size))
+	  if (vect_load_lanes_supported (vectype, group_size))
+	    load_lanes_p = true;
+	  else if (!vect_strided_load_supported (vectype, group_size))
 	    return false;
 	}
     }
@@ -3945,7 +4063,7 @@  vectorizable_load (gimple stmt, gimple_s
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
-      vect_model_load_cost (stmt_info, ncopies, NULL);
+      vect_model_load_cost (stmt_info, ncopies, load_lanes_p, NULL);
       return true;
     }
 
@@ -3986,6 +4104,11 @@  vectorizable_load (gimple stmt, gimple_s
 
   alignment_support_scheme = vect_supportable_dr_alignment (first_dr, false);
   gcc_assert (alignment_support_scheme);
+  /* Targets with load-lane instructions must not require explicit
+     realignment.  */
+  gcc_assert (!load_lanes_p
+	      || alignment_support_scheme == dr_aligned
+	      || alignment_support_scheme == dr_unaligned_supported);
 
   /* In case the vectorization factor (VF) is bigger than the number
      of elements that we can fit in a vectype (nunits), we have to generate
@@ -4117,22 +4240,52 @@  vectorizable_load (gimple stmt, gimple_s
   if (negative)
     offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1);
 
+  if (load_lanes_p)
+    aggr_type = build_simple_array_type (elem_type, vec_num * nunits);
+  else
+    aggr_type = vectype;
+
   prev_stmt_info = NULL;
   for (j = 0; j < ncopies; j++)
     {
-      /* 1. Create the vector pointer update chain.  */
+      /* 1. Create the vector or array pointer update chain.  */
       if (j == 0)
-        dataref_ptr = vect_create_data_ref_ptr (first_stmt, vectype, at_loop,
+        dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type, at_loop,
 						offset, &dummy, gsi,
 						&ptr_incr, false, &inv_p);
       else
-        dataref_ptr =
-		bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt, NULL_TREE);
+        dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
+				       TYPE_SIZE_UNIT (aggr_type));
 
       if (strided_load || slp_perm)
 	dr_chain = VEC_alloc (tree, heap, vec_num);
 
-      if (1)
+      if (load_lanes_p)
+	{
+	  tree vec_array;
+
+	  vec_array = create_vector_array (vectype, vec_num);
+
+	  /* Emit:
+	       VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
+	  data_ref = create_array_ref (aggr_type, dataref_ptr, first_dr);
+	  new_stmt = gimple_build_call_internal (IFN_LOAD_LANES, 1, data_ref);
+	  gimple_call_set_lhs (new_stmt, vec_array);
+	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	  mark_symbols_for_renaming (new_stmt);
+
+	  /* Extract each vector into an SSA_NAME.  */
+	  for (i = 0; i < vec_num; i++)
+	    {
+	      new_temp = read_vector_array (stmt, gsi, scalar_dest,
+					    vec_array, i);
+	      VEC_quick_push (tree, dr_chain, new_temp);
+	    }
+
+	  /* Record the mapping between SSA_NAMEs and statements.  */
+	  vect_record_strided_load_vectors (stmt, dr_chain);
+	}
+      else
 	{
 	  for (i = 0; i < vec_num; i++)
 	    {
@@ -4349,7 +4502,8 @@  vectorizable_load (gimple stmt, gimple_s
         {
           if (strided_load)
   	    {
-	      vect_transform_strided_load (stmt, dr_chain, group_size, gsi);
+	      if (!load_lanes_p)
+		vect_transform_strided_load (stmt, dr_chain, group_size, gsi);
 	      *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
 	    }
           else
Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	2011-04-12 12:16:46.000000000 +0100
+++ gcc/tree-vect-slp.c	2011-04-12 14:48:28.000000000 +0100
@@ -215,7 +215,8 @@  vect_get_and_check_slp_defs (loop_vec_in
 	    vect_model_simple_cost (stmt_info, ncopies_for_cost, dt, slp_node);
 	  else
 	    /* Store.  */
-	    vect_model_store_cost (stmt_info, ncopies_for_cost, dt[0], slp_node);
+	    vect_model_store_cost (stmt_info, ncopies_for_cost, false,
+				   dt[0], slp_node);
 	}
 
       else
@@ -579,7 +580,7 @@  vect_build_slp_tree (loop_vec_info loop_
 
                   /* Analyze costs (for the first stmt in the group).  */
                   vect_model_load_cost (vinfo_for_stmt (stmt),
-                                        ncopies_for_cost, *node);
+                                        ncopies_for_cost, false, *node);
                 }
 
               /* Store the place of this load in the interleaving chain.  In