diff mbox

Masked load/store vectorization (take 6)

Message ID 20131128230906.GX892@tucnak.redhat.com
State New
Headers show

Commit Message

Jakub Jelinek Nov. 28, 2013, 11:09 p.m. UTC
On Wed, Nov 27, 2013 at 04:10:16PM +0100, Richard Biener wrote:
> As you pinged this ... can you re-post a patch with changelog that
> includes the followups as we decided?

Ok, here is the updated patch against latest trunk with the follow-ups
incorporated.  Bootstrapped/regtested on x86_64-linux and i686-linux,
ok for trunk?

2013-11-28  Jakub Jelinek  <jakub@redhat.com>

	* tree-vectorizer.h (struct _loop_vec_info): Add scalar_loop field.
	(LOOP_VINFO_SCALAR_LOOP): Define.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Add scalar_loop argument.
	* config/i386/sse.md (maskload<mode>, maskstore<mode>): New expanders.
	* tree-data-ref.c (struct data_ref_loc_d): Replace pos field with ref.
	(get_references_in_stmt): Don't record operand addresses, but
	operands themselves.  Handle MASK_LOAD and MASK_STORE.
	(find_data_references_in_stmt, graphite_find_data_references_in_stmt):
	Adjust for the pos -> ref change.
	* internal-fn.def (LOOP_VECTORIZED, MASK_LOAD, MASK_STORE): New
	internal fns.
	* tree-if-conv.c: Include target.h, expr.h, optabs.h and
	tree-ssa-address.h.
	(release_bb_predicate): New function.
	(free_bb_predicate): Use it.
	(reset_bb_predicate): Likewise.  Don't unallocate bb->aux
	just to immediately allocate it again.
	(if_convertible_phi_p): Add any_mask_load_store argument, if true,
	handle it like flag_tree_loop_if_convert_stores.
	(insert_gimplified_predicates): Likewise.  If bb dominates
	loop->latch, call reset_bb_predicate.
	(ifcvt_can_use_mask_load_store): New function.
	(if_convertible_gimple_assign_stmt_p): Add any_mask_load_store
	argument, check if some conditional loads or stores can't be
	converted into MASK_LOAD or MASK_STORE.
	(if_convertible_stmt_p): Add any_mask_load_store argument,
	pass it down to if_convertible_gimple_assign_stmt_p.
	(predicate_bbs): Don't return bool, only check if the last stmt
	of a basic block is GIMPLE_COND and handle that.  For basic blocks
	that dominate loop->latch assume they don't need to be predicated.
	(if_convertible_loop_p_1): Only call predicate_bbs if
	flag_tree_loop_if_convert_stores and free_bb_predicate in that case
	afterwards, check gimple_code of stmts here.  Replace is_predicated
	check with dominance check.  Add any_mask_load_store argument,
	pass it down to if_convertible_stmt_p and if_convertible_phi_p,
	call if_convertible_phi_p only after all if_convertible_stmt_p
	calls.
	(if_convertible_loop_p): Add any_mask_load_store argument,
	pass it down to if_convertible_loop_p_1.
	(predicate_mem_writes): Emit MASK_LOAD and/or MASK_STORE calls.
	(combine_blocks): Add any_mask_load_store argument, pass
	it down to insert_gimplified_predicates and call predicate_mem_writes
	if it is set.  Call predicate_bbs.
	(version_loop_for_if_conversion): New function.
	(tree_if_conversion): Adjust if_convertible_loop_p and combine_blocks
	calls.  Return todo flags instead of bool, call
	version_loop_for_if_conversion if if-conversion should be just
	for the vectorized loops and nothing else.
	(main_tree_if_conversion): Adjust caller.  Don't call
	tree_if_conversion for dont_vectorize loops if if-conversion
	isn't explicitly enabled.
	* tree-vect-data-refs.c (vect_check_gather): Handle
	MASK_LOAD/MASK_STORE.
	(vect_analyze_data_refs, vect_supportable_dr_alignment): Likewise.
	* gimple.h (gimple_expr_type): Handle MASK_STORE.
	* internal-fn.c (expand_LOOP_VECTORIZED, expand_MASK_LOAD,
	expand_MASK_STORE): New functions.
	* tree-vectorizer.c: Include tree-cfg.h and gimple-fold.h.
	(vect_loop_vectorized_call, vect_loop_select): New functions.
	(vectorize_loops): Don't try to vectorize loops with
	loop->dont_vectorize set.  Set LOOP_VINFO_SCALAR_LOOP for if-converted
	loops, fold LOOP_VECTORIZED internal call depending on if loop
	has been vectorized or not.  Use vect_loop_select to attempt to
	vectorize an if-converted loop before it's non-if-converted
	counterpart.  If outer loop vectorization is successful in that
	case, ensure the loop in the soon to be dead non-if-converted loop
	is not vectorized.
	* tree-vect-loop-manip.c (slpeel_duplicate_current_defs_from_edges):
	New function.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Add scalar_loop argument.
	If non-NULL, copy basic blocks from scalar_loop instead of loop, but
	still to loop's entry or exit edge.
	(slpeel_tree_peel_loop_to_edge): Add scalar_loop argument, pass it
	down to slpeel_tree_duplicate_loop_to_edge_cfg.
	(vect_do_peeling_for_loop_bound, vect_do_peeling_for_loop_alignment):
	Adjust callers.
	(vect_loop_versioning): If LOOP_VINFO_SCALAR_LOOP, perform loop
	versioning from that loop instead of LOOP_VINFO_LOOP, move it to the
	right place in the CFG afterwards.
	* tree-vect-loop.c (vect_determine_vectorization_factor): Handle
	MASK_STORE.
	* cfgloop.h (struct loop): Add dont_vectorize field.
	* tree-loop-distribution.c (copy_loop_before): Adjust
	slpeel_tree_duplicate_loop_to_edge_cfg caller.
	* optabs.def (maskload_optab, maskstore_optab): New optabs.
	* passes.def: Add a note that pass_vectorize must immediately follow
	pass_if_conversion.
	* tree-predcom.c (split_data_refs_to_components): Give up if
	DR_STMT is a call.
	* tree-vect-stmts.c (vect_mark_relevant): Don't crash if lhs
	is NULL.
	(exist_non_indexing_operands_for_use_p): Handle MASK_LOAD
	and MASK_STORE.
	(vectorizable_mask_load_store): New function.
	(vectorizable_call): Call it for MASK_LOAD or MASK_STORE.
	(vect_transform_stmt): Handle MASK_STORE.
	* tree-ssa-phiopt.c (cond_if_else_store_replacement): Ignore
	DR_STMT where lhs is NULL.

	* gcc.dg/vect/vect-cond-11.c: New test.
	* gcc.target/i386/vect-cond-1.c: New test.
	* gcc.target/i386/avx2-gather-5.c: New test.
	* gcc.target/i386/avx2-gather-6.c: New test.
	* gcc.dg/vect/vect-mask-loadstore-1.c: New test.
	* gcc.dg/vect/vect-mask-load-1.c: New test.



	Jakub

Comments

Jeff Law Dec. 3, 2013, 8:13 p.m. UTC | #1
On 11/28/13 16:09, Jakub Jelinek wrote:
> On Wed, Nov 27, 2013 at 04:10:16PM +0100, Richard Biener wrote:
>> As you pinged this ... can you re-post a patch with changelog that
>> includes the followups as we decided?
>
> Ok, here is the updated patch against latest trunk with the follow-ups
> incorporated.  Bootstrapped/regtested on x86_64-linux and i686-linux,
> ok for trunk?
>
> 2013-11-28  Jakub Jelinek  <jakub@redhat.com>
>
> 	* tree-vectorizer.h (struct _loop_vec_info): Add scalar_loop field.
> 	(LOOP_VINFO_SCALAR_LOOP): Define.
> 	(slpeel_tree_duplicate_loop_to_edge_cfg): Add scalar_loop argument.
> 	* config/i386/sse.md (maskload<mode>, maskstore<mode>): New expanders.
> 	* tree-data-ref.c (struct data_ref_loc_d): Replace pos field with ref.
> 	(get_references_in_stmt): Don't record operand addresses, but
> 	operands themselves.  Handle MASK_LOAD and MASK_STORE.
> 	(find_data_references_in_stmt, graphite_find_data_references_in_stmt):
> 	Adjust for the pos -> ref change.
> 	* internal-fn.def (LOOP_VECTORIZED, MASK_LOAD, MASK_STORE): New
> 	internal fns.
> 	* tree-if-conv.c: Include target.h, expr.h, optabs.h and
> 	tree-ssa-address.h.
> 	(release_bb_predicate): New function.
> 	(free_bb_predicate): Use it.
> 	(reset_bb_predicate): Likewise.  Don't unallocate bb->aux
> 	just to immediately allocate it again.
> 	(if_convertible_phi_p): Add any_mask_load_store argument, if true,
> 	handle it like flag_tree_loop_if_convert_stores.
> 	(insert_gimplified_predicates): Likewise.  If bb dominates
> 	loop->latch, call reset_bb_predicate.
> 	(ifcvt_can_use_mask_load_store): New function.
> 	(if_convertible_gimple_assign_stmt_p): Add any_mask_load_store
> 	argument, check if some conditional loads or stores can't be
> 	converted into MASK_LOAD or MASK_STORE.
> 	(if_convertible_stmt_p): Add any_mask_load_store argument,
> 	pass it down to if_convertible_gimple_assign_stmt_p.
> 	(predicate_bbs): Don't return bool, only check if the last stmt
> 	of a basic block is GIMPLE_COND and handle that.  For basic blocks
> 	that dominate loop->latch assume they don't need to be predicated.
> 	(if_convertible_loop_p_1): Only call predicate_bbs if
> 	flag_tree_loop_if_convert_stores and free_bb_predicate in that case
> 	afterwards, check gimple_code of stmts here.  Replace is_predicated
> 	check with dominance check.  Add any_mask_load_store argument,
> 	pass it down to if_convertible_stmt_p and if_convertible_phi_p,
> 	call if_convertible_phi_p only after all if_convertible_stmt_p
> 	calls.
> 	(if_convertible_loop_p): Add any_mask_load_store argument,
> 	pass it down to if_convertible_loop_p_1.
> 	(predicate_mem_writes): Emit MASK_LOAD and/or MASK_STORE calls.
> 	(combine_blocks): Add any_mask_load_store argument, pass
> 	it down to insert_gimplified_predicates and call predicate_mem_writes
> 	if it is set.  Call predicate_bbs.
> 	(version_loop_for_if_conversion): New function.
> 	(tree_if_conversion): Adjust if_convertible_loop_p and combine_blocks
> 	calls.  Return todo flags instead of bool, call
> 	version_loop_for_if_conversion if if-conversion should be just
> 	for the vectorized loops and nothing else.
> 	(main_tree_if_conversion): Adjust caller.  Don't call
> 	tree_if_conversion for dont_vectorize loops if if-conversion
> 	isn't explicitly enabled.
> 	* tree-vect-data-refs.c (vect_check_gather): Handle
> 	MASK_LOAD/MASK_STORE.
> 	(vect_analyze_data_refs, vect_supportable_dr_alignment): Likewise.
> 	* gimple.h (gimple_expr_type): Handle MASK_STORE.
> 	* internal-fn.c (expand_LOOP_VECTORIZED, expand_MASK_LOAD,
> 	expand_MASK_STORE): New functions.
> 	* tree-vectorizer.c: Include tree-cfg.h and gimple-fold.h.
> 	(vect_loop_vectorized_call, vect_loop_select): New functions.
> 	(vectorize_loops): Don't try to vectorize loops with
> 	loop->dont_vectorize set.  Set LOOP_VINFO_SCALAR_LOOP for if-converted
> 	loops, fold LOOP_VECTORIZED internal call depending on if loop
> 	has been vectorized or not.  Use vect_loop_select to attempt to
> 	vectorize an if-converted loop before it's non-if-converted
> 	counterpart.  If outer loop vectorization is successful in that
> 	case, ensure the loop in the soon to be dead non-if-converted loop
> 	is not vectorized.
> 	* tree-vect-loop-manip.c (slpeel_duplicate_current_defs_from_edges):
> 	New function.
> 	(slpeel_tree_duplicate_loop_to_edge_cfg): Add scalar_loop argument.
> 	If non-NULL, copy basic blocks from scalar_loop instead of loop, but
> 	still to loop's entry or exit edge.
> 	(slpeel_tree_peel_loop_to_edge): Add scalar_loop argument, pass it
> 	down to slpeel_tree_duplicate_loop_to_edge_cfg.
> 	(vect_do_peeling_for_loop_bound, vect_do_peeling_for_loop_alignment):
> 	Adjust callers.
> 	(vect_loop_versioning): If LOOP_VINFO_SCALAR_LOOP, perform loop
> 	versioning from that loop instead of LOOP_VINFO_LOOP, move it to the
> 	right place in the CFG afterwards.
> 	* tree-vect-loop.c (vect_determine_vectorization_factor): Handle
> 	MASK_STORE.
> 	* cfgloop.h (struct loop): Add dont_vectorize field.
> 	* tree-loop-distribution.c (copy_loop_before): Adjust
> 	slpeel_tree_duplicate_loop_to_edge_cfg caller.
> 	* optabs.def (maskload_optab, maskstore_optab): New optabs.
> 	* passes.def: Add a note that pass_vectorize must immediately follow
> 	pass_if_conversion.
> 	* tree-predcom.c (split_data_refs_to_components): Give up if
> 	DR_STMT is a call.
> 	* tree-vect-stmts.c (vect_mark_relevant): Don't crash if lhs
> 	is NULL.
> 	(exist_non_indexing_operands_for_use_p): Handle MASK_LOAD
> 	and MASK_STORE.
> 	(vectorizable_mask_load_store): New function.
> 	(vectorizable_call): Call it for MASK_LOAD or MASK_STORE.
> 	(vect_transform_stmt): Handle MASK_STORE.
> 	* tree-ssa-phiopt.c (cond_if_else_store_replacement): Ignore
> 	DR_STMT where lhs is NULL.
>
> 	* gcc.dg/vect/vect-cond-11.c: New test.
> 	* gcc.target/i386/vect-cond-1.c: New test.
> 	* gcc.target/i386/avx2-gather-5.c: New test.
> 	* gcc.target/i386/avx2-gather-6.c: New test.
> 	* gcc.dg/vect/vect-mask-loadstore-1.c: New test.
> 	* gcc.dg/vect/vect-mask-load-1.c: New test.
I believe Richi has significant state on this.  So I'm explicitly 
leaving it for him.

jeff
Richard Biener Dec. 6, 2013, 12:49 p.m. UTC | #2
On Fri, 29 Nov 2013, Jakub Jelinek wrote:

> On Wed, Nov 27, 2013 at 04:10:16PM +0100, Richard Biener wrote:
> > As you pinged this ... can you re-post a patch with changelog that
> > includes the followups as we decided?
> 
> Ok, here is the updated patch against latest trunk with the follow-ups
> incorporated.  Bootstrapped/regtested on x86_64-linux and i686-linux,
> ok for trunk?

Comments inline (scary large this patch for this stage ...)

> 2013-11-28  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* tree-vectorizer.h (struct _loop_vec_info): Add scalar_loop field.
> 	(LOOP_VINFO_SCALAR_LOOP): Define.
> 	(slpeel_tree_duplicate_loop_to_edge_cfg): Add scalar_loop argument.
> 	* config/i386/sse.md (maskload<mode>, maskstore<mode>): New expanders.
> 	* tree-data-ref.c (struct data_ref_loc_d): Replace pos field with ref.
> 	(get_references_in_stmt): Don't record operand addresses, but
> 	operands themselves.  Handle MASK_LOAD and MASK_STORE.
> 	(find_data_references_in_stmt, graphite_find_data_references_in_stmt):
> 	Adjust for the pos -> ref change.
> 	* internal-fn.def (LOOP_VECTORIZED, MASK_LOAD, MASK_STORE): New
> 	internal fns.
> 	* tree-if-conv.c: Include target.h, expr.h, optabs.h and
> 	tree-ssa-address.h.
> 	(release_bb_predicate): New function.
> 	(free_bb_predicate): Use it.
> 	(reset_bb_predicate): Likewise.  Don't unallocate bb->aux
> 	just to immediately allocate it again.
> 	(if_convertible_phi_p): Add any_mask_load_store argument, if true,
> 	handle it like flag_tree_loop_if_convert_stores.
> 	(insert_gimplified_predicates): Likewise.  If bb dominates
> 	loop->latch, call reset_bb_predicate.
> 	(ifcvt_can_use_mask_load_store): New function.
> 	(if_convertible_gimple_assign_stmt_p): Add any_mask_load_store
> 	argument, check if some conditional loads or stores can't be
> 	converted into MASK_LOAD or MASK_STORE.
> 	(if_convertible_stmt_p): Add any_mask_load_store argument,
> 	pass it down to if_convertible_gimple_assign_stmt_p.
> 	(predicate_bbs): Don't return bool, only check if the last stmt
> 	of a basic block is GIMPLE_COND and handle that.  For basic blocks
> 	that dominate loop->latch assume they don't need to be predicated.
> 	(if_convertible_loop_p_1): Only call predicate_bbs if
> 	flag_tree_loop_if_convert_stores and free_bb_predicate in that case
> 	afterwards, check gimple_code of stmts here.  Replace is_predicated
> 	check with dominance check.  Add any_mask_load_store argument,
> 	pass it down to if_convertible_stmt_p and if_convertible_phi_p,
> 	call if_convertible_phi_p only after all if_convertible_stmt_p
> 	calls.
> 	(if_convertible_loop_p): Add any_mask_load_store argument,
> 	pass it down to if_convertible_loop_p_1.
> 	(predicate_mem_writes): Emit MASK_LOAD and/or MASK_STORE calls.
> 	(combine_blocks): Add any_mask_load_store argument, pass
> 	it down to insert_gimplified_predicates and call predicate_mem_writes
> 	if it is set.  Call predicate_bbs.
> 	(version_loop_for_if_conversion): New function.
> 	(tree_if_conversion): Adjust if_convertible_loop_p and combine_blocks
> 	calls.  Return todo flags instead of bool, call
> 	version_loop_for_if_conversion if if-conversion should be just
> 	for the vectorized loops and nothing else.
> 	(main_tree_if_conversion): Adjust caller.  Don't call
> 	tree_if_conversion for dont_vectorize loops if if-conversion
> 	isn't explicitly enabled.
> 	* tree-vect-data-refs.c (vect_check_gather): Handle
> 	MASK_LOAD/MASK_STORE.
> 	(vect_analyze_data_refs, vect_supportable_dr_alignment): Likewise.
> 	* gimple.h (gimple_expr_type): Handle MASK_STORE.
> 	* internal-fn.c (expand_LOOP_VECTORIZED, expand_MASK_LOAD,
> 	expand_MASK_STORE): New functions.
> 	* tree-vectorizer.c: Include tree-cfg.h and gimple-fold.h.
> 	(vect_loop_vectorized_call, vect_loop_select): New functions.
> 	(vectorize_loops): Don't try to vectorize loops with
> 	loop->dont_vectorize set.  Set LOOP_VINFO_SCALAR_LOOP for if-converted
> 	loops, fold LOOP_VECTORIZED internal call depending on if loop
> 	has been vectorized or not.  Use vect_loop_select to attempt to
> 	vectorize an if-converted loop before it's non-if-converted
> 	counterpart.  If outer loop vectorization is successful in that
> 	case, ensure the loop in the soon to be dead non-if-converted loop
> 	is not vectorized.
> 	* tree-vect-loop-manip.c (slpeel_duplicate_current_defs_from_edges):
> 	New function.
> 	(slpeel_tree_duplicate_loop_to_edge_cfg): Add scalar_loop argument.
> 	If non-NULL, copy basic blocks from scalar_loop instead of loop, but
> 	still to loop's entry or exit edge.
> 	(slpeel_tree_peel_loop_to_edge): Add scalar_loop argument, pass it
> 	down to slpeel_tree_duplicate_loop_to_edge_cfg.
> 	(vect_do_peeling_for_loop_bound, vect_do_peeling_for_loop_alignment):
> 	Adjust callers.
> 	(vect_loop_versioning): If LOOP_VINFO_SCALAR_LOOP, perform loop
> 	versioning from that loop instead of LOOP_VINFO_LOOP, move it to the
> 	right place in the CFG afterwards.
> 	* tree-vect-loop.c (vect_determine_vectorization_factor): Handle
> 	MASK_STORE.
> 	* cfgloop.h (struct loop): Add dont_vectorize field.
> 	* tree-loop-distribution.c (copy_loop_before): Adjust
> 	slpeel_tree_duplicate_loop_to_edge_cfg caller.
> 	* optabs.def (maskload_optab, maskstore_optab): New optabs.
> 	* passes.def: Add a note that pass_vectorize must immediately follow
> 	pass_if_conversion.
> 	* tree-predcom.c (split_data_refs_to_components): Give up if
> 	DR_STMT is a call.
> 	* tree-vect-stmts.c (vect_mark_relevant): Don't crash if lhs
> 	is NULL.
> 	(exist_non_indexing_operands_for_use_p): Handle MASK_LOAD
> 	and MASK_STORE.
> 	(vectorizable_mask_load_store): New function.
> 	(vectorizable_call): Call it for MASK_LOAD or MASK_STORE.
> 	(vect_transform_stmt): Handle MASK_STORE.
> 	* tree-ssa-phiopt.c (cond_if_else_store_replacement): Ignore
> 	DR_STMT where lhs is NULL.
> 
> 	* gcc.dg/vect/vect-cond-11.c: New test.
> 	* gcc.target/i386/vect-cond-1.c: New test.
> 	* gcc.target/i386/avx2-gather-5.c: New test.
> 	* gcc.target/i386/avx2-gather-6.c: New test.
> 	* gcc.dg/vect/vect-mask-loadstore-1.c: New test.
> 	* gcc.dg/vect/vect-mask-load-1.c: New test.
> 
> --- gcc/tree-vectorizer.h.jj	2013-11-28 09:18:11.771774932 +0100
> +++ gcc/tree-vectorizer.h	2013-11-28 14:14:35.827362293 +0100
> @@ -344,6 +344,10 @@ typedef struct _loop_vec_info {
>       fix it up.  */
>    bool operands_swapped;
>  
> +  /* If if-conversion versioned this loop before conversion, this is the
> +     loop version without if-conversion.  */
> +  struct loop *scalar_loop;
> +
>  } *loop_vec_info;
>  
>  /* Access Functions.  */
> @@ -376,6 +380,7 @@ typedef struct _loop_vec_info {
>  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
>  #define LOOP_VINFO_OPERANDS_SWAPPED(L)     (L)->operands_swapped
>  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> +#define LOOP_VINFO_SCALAR_LOOP(L)	   (L)->scalar_loop
>  
>  #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
>    (L)->may_misalign_stmts.length () > 0
> @@ -934,7 +939,8 @@ extern source_location vect_location;
>     in tree-vect-loop-manip.c.  */
>  extern void slpeel_make_loop_iterate_ntimes (struct loop *, tree);
>  extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
> -struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *, edge);
> +struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
> +						     struct loop *, edge);
>  extern void vect_loop_versioning (loop_vec_info, unsigned int, bool);
>  extern void vect_do_peeling_for_loop_bound (loop_vec_info, tree, tree,
>  					    unsigned int, bool);
> --- gcc/config/i386/sse.md.jj	2013-11-23 15:20:47.452606456 +0100
> +++ gcc/config/i386/sse.md	2013-11-28 14:13:57.562572366 +0100
> @@ -14218,6 +14218,23 @@ (define_insn "<avx_avx2>_maskstore<ssemo
>     (set_attr "btver2_decode" "vector") 
>     (set_attr "mode" "<sseinsnmode>")])
>  
> +(define_expand "maskload<mode>"
> +  [(set (match_operand:V48_AVX2 0 "register_operand")
> +	(unspec:V48_AVX2
> +	  [(match_operand:<sseintvecmode> 2 "register_operand")
> +	   (match_operand:V48_AVX2 1 "memory_operand")]
> +	  UNSPEC_MASKMOV))]
> +  "TARGET_AVX")
> +
> +(define_expand "maskstore<mode>"
> +  [(set (match_operand:V48_AVX2 0 "memory_operand")
> +	(unspec:V48_AVX2
> +	  [(match_operand:<sseintvecmode> 2 "register_operand")
> +	   (match_operand:V48_AVX2 1 "register_operand")
> +	   (match_dup 0)]
> +	  UNSPEC_MASKMOV))]
> +  "TARGET_AVX")
> +
>  (define_insn_and_split "avx_<castmode><avxsizesuffix>_<castmode>"
>    [(set (match_operand:AVX256MODE2P 0 "nonimmediate_operand" "=x,m")
>  	(unspec:AVX256MODE2P

x86 maintainers should comment here (ick - unspecs)

> --- gcc/tree-data-ref.c.jj	2013-11-27 18:02:48.050814182 +0100
> +++ gcc/tree-data-ref.c	2013-11-28 14:13:57.592572476 +0100
> @@ -4320,8 +4320,8 @@ compute_all_dependences (vec<data_refere
>  
>  typedef struct data_ref_loc_d
>  {
> -  /* Position of the memory reference.  */
> -  tree *pos;
> +  /* The memory reference.  */
> +  tree ref;
>  
>    /* True if the memory reference is read.  */
>    bool is_read;
> @@ -4336,7 +4336,7 @@ get_references_in_stmt (gimple stmt, vec
>  {
>    bool clobbers_memory = false;
>    data_ref_loc ref;
> -  tree *op0, *op1;
> +  tree op0, op1;
>    enum gimple_code stmt_code = gimple_code (stmt);
>  
>    /* ASM_EXPR and CALL_EXPR may embed arbitrary side effects.
> @@ -4346,16 +4346,26 @@ get_references_in_stmt (gimple stmt, vec
>        && !(gimple_call_flags (stmt) & ECF_CONST))
>      {
>        /* Allow IFN_GOMP_SIMD_LANE in their own loops.  */
> -      if (gimple_call_internal_p (stmt)
> -	  && gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE)
> -	{
> -	  struct loop *loop = gimple_bb (stmt)->loop_father;
> -	  tree uid = gimple_call_arg (stmt, 0);
> -	  gcc_assert (TREE_CODE (uid) == SSA_NAME);
> -	  if (loop == NULL
> -	      || loop->simduid != SSA_NAME_VAR (uid))
> +      if (gimple_call_internal_p (stmt))
> +	switch (gimple_call_internal_fn (stmt))
> +	  {
> +	  case IFN_GOMP_SIMD_LANE:
> +	    {
> +	      struct loop *loop = gimple_bb (stmt)->loop_father;
> +	      tree uid = gimple_call_arg (stmt, 0);
> +	      gcc_assert (TREE_CODE (uid) == SSA_NAME);
> +	      if (loop == NULL
> +		  || loop->simduid != SSA_NAME_VAR (uid))
> +		clobbers_memory = true;
> +	      break;
> +	    }
> +	  case IFN_MASK_LOAD:
> +	  case IFN_MASK_STORE:
> +	    break;
> +	  default:
>  	    clobbers_memory = true;
> -	}
> +	    break;
> +	  }
>        else
>  	clobbers_memory = true;
>      }
> @@ -4369,15 +4379,15 @@ get_references_in_stmt (gimple stmt, vec
>    if (stmt_code == GIMPLE_ASSIGN)
>      {
>        tree base;
> -      op0 = gimple_assign_lhs_ptr (stmt);
> -      op1 = gimple_assign_rhs1_ptr (stmt);
> +      op0 = gimple_assign_lhs (stmt);
> +      op1 = gimple_assign_rhs1 (stmt);
>  
> -      if (DECL_P (*op1)
> -	  || (REFERENCE_CLASS_P (*op1)
> -	      && (base = get_base_address (*op1))
> +      if (DECL_P (op1)
> +	  || (REFERENCE_CLASS_P (op1)
> +	      && (base = get_base_address (op1))
>  	      && TREE_CODE (base) != SSA_NAME))
>  	{
> -	  ref.pos = op1;
> +	  ref.ref = op1;
>  	  ref.is_read = true;
>  	  references->safe_push (ref);
>  	}
> @@ -4386,16 +4396,35 @@ get_references_in_stmt (gimple stmt, vec
>      {
>        unsigned i, n;
>  
> -      op0 = gimple_call_lhs_ptr (stmt);
> +      ref.is_read = false;
> +      if (gimple_call_internal_p (stmt))
> +	switch (gimple_call_internal_fn (stmt))
> +	  {
> +	  case IFN_MASK_LOAD:
> +	    ref.is_read = true;
> +	  case IFN_MASK_STORE:
> +	    ref.ref = build2 (MEM_REF,
> +			      ref.is_read
> +			      ? TREE_TYPE (gimple_call_lhs (stmt))
> +			      : TREE_TYPE (gimple_call_arg (stmt, 3)),
> +			      gimple_call_arg (stmt, 0),
> +			      gimple_call_arg (stmt, 1));
> +	    references->safe_push (ref);

This may not be a canonical MEM_REF AFAIK, so you should
use fold_build2 here (if the address is &a.b the .b needs folding
into the offset).  I assume the 2nd arg is always constant and
thus doesn't change pointer-type during propagations?

> +	    return false;
> +	  default:
> +	    break;
> +	  }
> +
> +      op0 = gimple_call_lhs (stmt);
>        n = gimple_call_num_args (stmt);
>        for (i = 0; i < n; i++)
>  	{
> -	  op1 = gimple_call_arg_ptr (stmt, i);
> +	  op1 = gimple_call_arg (stmt, i);
>  
> -	  if (DECL_P (*op1)
> -	      || (REFERENCE_CLASS_P (*op1) && get_base_address (*op1)))
> +	  if (DECL_P (op1)
> +	      || (REFERENCE_CLASS_P (op1) && get_base_address (op1)))
>  	    {
> -	      ref.pos = op1;
> +	      ref.ref = op1;
>  	      ref.is_read = true;
>  	      references->safe_push (ref);
>  	    }
> @@ -4404,11 +4433,11 @@ get_references_in_stmt (gimple stmt, vec
>    else
>      return clobbers_memory;
>  
> -  if (*op0
> -      && (DECL_P (*op0)
> -	  || (REFERENCE_CLASS_P (*op0) && get_base_address (*op0))))
> +  if (op0
> +      && (DECL_P (op0)
> +	  || (REFERENCE_CLASS_P (op0) && get_base_address (op0))))
>      {
> -      ref.pos = op0;
> +      ref.ref = op0;
>        ref.is_read = false;
>        references->safe_push (ref);
>      }
> @@ -4435,7 +4464,7 @@ find_data_references_in_stmt (struct loo
>    FOR_EACH_VEC_ELT (references, i, ref)
>      {
>        dr = create_data_ref (nest, loop_containing_stmt (stmt),
> -			    *ref->pos, stmt, ref->is_read);
> +			    ref->ref, stmt, ref->is_read);
>        gcc_assert (dr != NULL);
>        datarefs->safe_push (dr);
>      }
> @@ -4464,7 +4493,7 @@ graphite_find_data_references_in_stmt (l
>  
>    FOR_EACH_VEC_ELT (references, i, ref)
>      {
> -      dr = create_data_ref (nest, loop, *ref->pos, stmt, ref->is_read);
> +      dr = create_data_ref (nest, loop, ref->ref, stmt, ref->is_read);
>        gcc_assert (dr != NULL);
>        datarefs->safe_push (dr);
>      }

Interetsting that you succeeded in removing the indirection
on ref.pos ... I remember trying that twice at least and
failing ;)

You can install that as cleanup now if you split it out (so hopefully
no users creep back that make removing it impossible).

> --- gcc/internal-fn.def.jj	2013-11-26 21:36:14.018329932 +0100
> +++ gcc/internal-fn.def	2013-11-28 14:13:57.517569949 +0100
> @@ -43,5 +43,8 @@ DEF_INTERNAL_FN (STORE_LANES, ECF_CONST
>  DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW)
>  DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW)
>  DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW)
> +DEF_INTERNAL_FN (LOOP_VECTORIZED, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW)
> +DEF_INTERNAL_FN (MASK_LOAD, ECF_PURE | ECF_LEAF)
> +DEF_INTERNAL_FN (MASK_STORE, ECF_LEAF)
>  DEF_INTERNAL_FN (ANNOTATE,  ECF_CONST | ECF_LEAF | ECF_NOTHROW)
>  DEF_INTERNAL_FN (UBSAN_NULL, ECF_LEAF | ECF_NOTHROW)
> --- gcc/tree-if-conv.c.jj	2013-11-22 21:03:14.527852266 +0100
> +++ gcc/tree-if-conv.c	2013-11-28 14:13:57.668572084 +0100
> @@ -110,8 +110,12 @@ along with GCC; see the file COPYING3.
>  #include "tree-chrec.h"
>  #include "tree-data-ref.h"
>  #include "tree-scalar-evolution.h"
> +#include "tree-ssa-address.h"
>  #include "tree-pass.h"
>  #include "dbgcnt.h"
> +#include "target.h"
> +#include "expr.h"
> +#include "optabs.h"

Bah.

>  /* List of basic blocks in if-conversion-suitable order.  */
>  static basic_block *ifc_bbs;
> @@ -194,39 +198,48 @@ init_bb_predicate (basic_block bb)
>    set_bb_predicate (bb, boolean_true_node);
>  }
>  
> -/* Free the predicate of basic block BB.  */
> +/* Release the SSA_NAMEs associated with the predicate of basic block BB,
> +   but don't actually free it.  */
>  
>  static inline void
> -free_bb_predicate (basic_block bb)
> +release_bb_predicate (basic_block bb)
>  {
> -  gimple_seq stmts;
> -
> -  if (!bb_has_predicate (bb))
> -    return;
> -
> -  /* Release the SSA_NAMEs created for the gimplification of the
> -     predicate.  */
> -  stmts = bb_predicate_gimplified_stmts (bb);
> +  gimple_seq stmts = bb_predicate_gimplified_stmts (bb);
>    if (stmts)
>      {
>        gimple_stmt_iterator i;
>  
>        for (i = gsi_start (stmts); !gsi_end_p (i); gsi_next (&i))
>  	free_stmt_operands (gsi_stmt (i));
> +      set_bb_predicate_gimplified_stmts (bb, NULL);
>      }
> +}
>  
> +/* Free the predicate of basic block BB.  */
> +
> +static inline void
> +free_bb_predicate (basic_block bb)
> +{
> +  if (!bb_has_predicate (bb))
> +    return;
> +
> +  release_bb_predicate (bb);
>    free (bb->aux);
>    bb->aux = NULL;
>  }
>  
> -/* Free the predicate of BB and reinitialize it with the true
> -   predicate.  */
> +/* Reinitialize predicate of BB with the true predicate.  */
>  
>  static inline void
>  reset_bb_predicate (basic_block bb)
>  {
> -  free_bb_predicate (bb);
> -  init_bb_predicate (bb);
> +  if (!bb_has_predicate (bb))
> +    init_bb_predicate (bb);
> +  else
> +    {
> +      release_bb_predicate (bb);
> +      set_bb_predicate (bb, boolean_true_node);
> +    }
>  }
>  
>  /* Returns a new SSA_NAME of type TYPE that is assigned the value of
> @@ -464,7 +477,8 @@ bb_with_exit_edge_p (struct loop *loop,
>     - there is a virtual PHI in a BB other than the loop->header.  */
>  
>  static bool
> -if_convertible_phi_p (struct loop *loop, basic_block bb, gimple phi)
> +if_convertible_phi_p (struct loop *loop, basic_block bb, gimple phi,
> +		      bool any_mask_load_store)
>  {
>    if (dump_file && (dump_flags & TDF_DETAILS))
>      {
> @@ -479,7 +493,7 @@ if_convertible_phi_p (struct loop *loop,
>        return false;
>      }
>  
> -  if (flag_tree_loop_if_convert_stores)
> +  if (flag_tree_loop_if_convert_stores || any_mask_load_store)
>      return true;
>  
>    /* When the flag_tree_loop_if_convert_stores is not set, check
> @@ -695,6 +709,78 @@ ifcvt_could_trap_p (gimple stmt, vec<dat
>    return gimple_could_trap_p (stmt);
>  }
>  
> +/* Return true if STMT could be converted into a masked load or store
> +   (conditional load or store based on a mask computed from bb predicate).  */
> +
> +static bool
> +ifcvt_can_use_mask_load_store (gimple stmt)
> +{
> +  tree lhs, ref;
> +  enum machine_mode mode, vmode;
> +  optab op;
> +  basic_block bb = gimple_bb (stmt);
> +  unsigned int vector_sizes;
> +
> +  if (!(flag_tree_loop_vectorize || bb->loop_father->force_vect)
> +      || bb->loop_father->dont_vectorize
> +      || !gimple_assign_single_p (stmt)
> +      || gimple_has_volatile_ops (stmt))
> +    return false;
> +
> +  /* Check whether this is a load or store.  */
> +  lhs = gimple_assign_lhs (stmt);
> +  if (TREE_CODE (lhs) != SSA_NAME)
> +    {

gimple_store_p ()?

> +      if (!is_gimple_val (gimple_assign_rhs1 (stmt)))
> +	return false;
> +      op = maskstore_optab;
> +      ref = lhs;
> +    }
> +  else if (gimple_assign_load_p (stmt))
> +    {
> +      op = maskload_optab;
> +      ref = gimple_assign_rhs1 (stmt);
> +    }
> +  else
> +    return false;
> +
> +  /* And whether REF isn't a MEM_REF with non-addressable decl.  */
> +  if (TREE_CODE (ref) == MEM_REF
> +      && TREE_CODE (TREE_OPERAND (ref, 0)) == ADDR_EXPR
> +      && DECL_P (TREE_OPERAND (TREE_OPERAND (ref, 0), 0))
> +      && !TREE_ADDRESSABLE (TREE_OPERAND (TREE_OPERAND (ref, 0), 0)))
> +    return false;

I think that's overly conservative and not conservative enough.  Just
use may_be_nonaddressable_p () (even though the implementation can
need some TLC) and make sure to set TREE_ADDRESSABLE when you
end up taking its address.

> +  /* Mask should be integer mode of the same size as the load/store
> +     mode.  */
> +  mode = TYPE_MODE (TREE_TYPE (lhs));
> +  if (int_mode_for_mode (mode) == BLKmode)
> +    return false;
> +
> +  /* See if there is any chance the mask load or store might be
> +     vectorized.  If not, punt.  */
> +  vmode = targetm.vectorize.preferred_simd_mode (mode);
> +  if (!VECTOR_MODE_P (vmode))
> +    return false;
> +
> +  if (optab_handler (op, vmode) != CODE_FOR_nothing)
> +    return true;
> +
> +  vector_sizes = targetm.vectorize.autovectorize_vector_sizes ();
> +  while (vector_sizes != 0)
> +    {
> +      unsigned int cur = 1 << floor_log2 (vector_sizes);
> +      vector_sizes &= ~cur;
> +      if (cur <= GET_MODE_SIZE (mode))
> +	continue;
> +      vmode = mode_for_vector (mode, cur / GET_MODE_SIZE (mode));
> +      if (VECTOR_MODE_P (vmode)
> +	  && optab_handler (op, vmode) != CODE_FOR_nothing)
> +	return true;
> +    }
> +  return false;

Please factor out the target bits into a predicate in optabs.c
so you can reduce the amount of includes here.  You can eventually
re-use that from the vectorization parts.

> +}
> +
>  /* Return true when STMT is if-convertible.
>  
>     GIMPLE_ASSIGN statement is not if-convertible if,
> @@ -704,7 +790,8 @@ ifcvt_could_trap_p (gimple stmt, vec<dat
>  
>  static bool
>  if_convertible_gimple_assign_stmt_p (gimple stmt,
> -				     vec<data_reference_p> refs)
> +				     vec<data_reference_p> refs,
> +				     bool *any_mask_load_store)
>  {
>    tree lhs = gimple_assign_lhs (stmt);
>    basic_block bb;
> @@ -730,10 +817,21 @@ if_convertible_gimple_assign_stmt_p (gim
>        return false;
>      }
>  
> +  /* tree-into-ssa.c uses GF_PLF_1, so avoid it, because
> +     in between if_convertible_loop_p and combine_blocks
> +     we can perform loop versioning.  */
> +  gimple_set_plf (stmt, GF_PLF_2, false);
> +
>    if (flag_tree_loop_if_convert_stores)
>      {
>        if (ifcvt_could_trap_p (stmt, refs))
>  	{
> +	  if (ifcvt_can_use_mask_load_store (stmt))
> +	    {
> +	      gimple_set_plf (stmt, GF_PLF_2, true);
> +	      *any_mask_load_store = true;
> +	      return true;
> +	    }
>  	  if (dump_file && (dump_flags & TDF_DETAILS))
>  	    fprintf (dump_file, "tree could trap...\n");
>  	  return false;
> @@ -743,6 +841,12 @@ if_convertible_gimple_assign_stmt_p (gim
>  
>    if (gimple_assign_rhs_could_trap_p (stmt))
>      {
> +      if (ifcvt_can_use_mask_load_store (stmt))
> +	{
> +	  gimple_set_plf (stmt, GF_PLF_2, true);
> +	  *any_mask_load_store = true;
> +	  return true;
> +	}
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	fprintf (dump_file, "tree could trap...\n");
>        return false;
> @@ -754,6 +858,12 @@ if_convertible_gimple_assign_stmt_p (gim
>        && bb != bb->loop_father->header
>        && !bb_with_exit_edge_p (bb->loop_father, bb))
>      {
> +      if (ifcvt_can_use_mask_load_store (stmt))
> +	{
> +	  gimple_set_plf (stmt, GF_PLF_2, true);
> +	  *any_mask_load_store = true;
> +	  return true;
> +	}
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	{
>  	  fprintf (dump_file, "LHS is not var\n");
> @@ -772,7 +882,8 @@ if_convertible_gimple_assign_stmt_p (gim
>     - it is a GIMPLE_LABEL or a GIMPLE_COND.  */
>  
>  static bool
> -if_convertible_stmt_p (gimple stmt, vec<data_reference_p> refs)
> +if_convertible_stmt_p (gimple stmt, vec<data_reference_p> refs,
> +		       bool *any_mask_load_store)
>  {
>    switch (gimple_code (stmt))
>      {
> @@ -782,7 +893,8 @@ if_convertible_stmt_p (gimple stmt, vec<
>        return true;
>  
>      case GIMPLE_ASSIGN:
> -      return if_convertible_gimple_assign_stmt_p (stmt, refs);
> +      return if_convertible_gimple_assign_stmt_p (stmt, refs,
> +						  any_mask_load_store);
>  
>      case GIMPLE_CALL:
>        {
> @@ -984,7 +1096,7 @@ get_loop_body_in_if_conv_order (const st
>     S1 will be predicated with "x", and
>     S2 will be predicated with "!x".  */
>  
> -static bool
> +static void
>  predicate_bbs (loop_p loop)
>  {
>    unsigned int i;
> @@ -996,7 +1108,7 @@ predicate_bbs (loop_p loop)
>      {
>        basic_block bb = ifc_bbs[i];
>        tree cond;
> -      gimple_stmt_iterator itr;
> +      gimple stmt;
>  
>        /* The loop latch is always executed and has no extra conditions
>  	 to be processed: skip it.  */
> @@ -1006,53 +1118,38 @@ predicate_bbs (loop_p loop)
>  	  continue;
>  	}
>  
> +      /* If dominance tells us this basic block is always executed, force
> +	 the condition to be true, this might help simplify other
> +	 conditions.  */
> +      if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> +	reset_bb_predicate (bb);
>        cond = bb_predicate (bb);
> -
> -      for (itr = gsi_start_bb (bb); !gsi_end_p (itr); gsi_next (&itr))
> +      stmt = last_stmt (bb);
> +      if (stmt && gimple_code (stmt) == GIMPLE_COND)
>  	{
> -	  gimple stmt = gsi_stmt (itr);
> -
> -	  switch (gimple_code (stmt))
> -	    {
> -	    case GIMPLE_LABEL:
> -	    case GIMPLE_ASSIGN:
> -	    case GIMPLE_CALL:
> -	    case GIMPLE_DEBUG:
> -	      break;
> -
> -	    case GIMPLE_COND:
> -	      {
> -		tree c2;
> -		edge true_edge, false_edge;
> -		location_t loc = gimple_location (stmt);
> -		tree c = fold_build2_loc (loc, gimple_cond_code (stmt),
> -					  boolean_type_node,
> -					  gimple_cond_lhs (stmt),
> -					  gimple_cond_rhs (stmt));
> -
> -		/* Add new condition into destination's predicate list.  */
> -		extract_true_false_edges_from_block (gimple_bb (stmt),
> -						     &true_edge, &false_edge);
> -
> -		/* If C is true, then TRUE_EDGE is taken.  */
> -		add_to_dst_predicate_list (loop, true_edge,
> -					   unshare_expr (cond),
> -					   unshare_expr (c));
> -
> -		/* If C is false, then FALSE_EDGE is taken.  */
> -		c2 = build1_loc (loc, TRUTH_NOT_EXPR,
> -				 boolean_type_node, unshare_expr (c));
> -		add_to_dst_predicate_list (loop, false_edge,
> -					   unshare_expr (cond), c2);
> -
> -		cond = NULL_TREE;
> -		break;
> -	      }
> +	  tree c2;
> +	  edge true_edge, false_edge;
> +	  location_t loc = gimple_location (stmt);
> +	  tree c = fold_build2_loc (loc, gimple_cond_code (stmt),
> +				    boolean_type_node,
> +				    gimple_cond_lhs (stmt),
> +				    gimple_cond_rhs (stmt));
> +
> +	  /* Add new condition into destination's predicate list.  */
> +	  extract_true_false_edges_from_block (gimple_bb (stmt),
> +					       &true_edge, &false_edge);
> +
> +	  /* If C is true, then TRUE_EDGE is taken.  */
> +	  add_to_dst_predicate_list (loop, true_edge, unshare_expr (cond),
> +				     unshare_expr (c));
> +
> +	  /* If C is false, then FALSE_EDGE is taken.  */
> +	  c2 = build1_loc (loc, TRUTH_NOT_EXPR, boolean_type_node,
> +			   unshare_expr (c));
> +	  add_to_dst_predicate_list (loop, false_edge,
> +				     unshare_expr (cond), c2);
>  
> -	    default:
> -	      /* Not handled yet in if-conversion.  */
> -	      return false;
> -	    }
> +	  cond = NULL_TREE;
>  	}
>  
>        /* If current bb has only one successor, then consider it as an
> @@ -1075,8 +1172,6 @@ predicate_bbs (loop_p loop)
>    reset_bb_predicate (loop->header);
>    gcc_assert (bb_predicate_gimplified_stmts (loop->header) == NULL
>  	      && bb_predicate_gimplified_stmts (loop->latch) == NULL);
> -
> -  return true;
>  }

Looks like cleanup applicable anyway (ok to install).

>  /* Return true when LOOP is if-convertible.  This is a helper function
> @@ -1087,7 +1182,7 @@ static bool
>  if_convertible_loop_p_1 (struct loop *loop,
>  			 vec<loop_p> *loop_nest,
>  			 vec<data_reference_p> *refs,
> -			 vec<ddr_p> *ddrs)
> +			 vec<ddr_p> *ddrs, bool *any_mask_load_store)
>  {
>    bool res;
>    unsigned int i;
> @@ -1121,9 +1216,24 @@ if_convertible_loop_p_1 (struct loop *lo
>  	exit_bb = bb;
>      }
>  
> -  res = predicate_bbs (loop);
> -  if (!res)
> -    return false;
> +  for (i = 0; i < loop->num_nodes; i++)
> +    {
> +      basic_block bb = ifc_bbs[i];
> +      gimple_stmt_iterator gsi;
> +
> +      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +	switch (gimple_code (gsi_stmt (gsi)))
> +	  {
> +	  case GIMPLE_LABEL:
> +	  case GIMPLE_ASSIGN:
> +	  case GIMPLE_CALL:
> +	  case GIMPLE_DEBUG:
> +	  case GIMPLE_COND:
> +	    break;
> +	  default:
> +	    return false;
> +	  }
> +    }
>  
>    if (flag_tree_loop_if_convert_stores)
>      {

Together with this, of course.

> @@ -1135,6 +1245,7 @@ if_convertible_loop_p_1 (struct loop *lo
>  	  DR_WRITTEN_AT_LEAST_ONCE (dr) = -1;
>  	  DR_RW_UNCONDITIONALLY (dr) = -1;
>  	}
> +      predicate_bbs (loop);
>      }
>  
>    for (i = 0; i < loop->num_nodes; i++)
> @@ -1142,17 +1253,31 @@ if_convertible_loop_p_1 (struct loop *lo
>        basic_block bb = ifc_bbs[i];
>        gimple_stmt_iterator itr;
>  
> -      for (itr = gsi_start_phis (bb); !gsi_end_p (itr); gsi_next (&itr))
> -	if (!if_convertible_phi_p (loop, bb, gsi_stmt (itr)))
> -	  return false;
> -
>        /* Check the if-convertibility of statements in predicated BBs.  */
> -      if (is_predicated (bb))
> +      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>  	for (itr = gsi_start_bb (bb); !gsi_end_p (itr); gsi_next (&itr))
> -	  if (!if_convertible_stmt_p (gsi_stmt (itr), *refs))
> +	  if (!if_convertible_stmt_p (gsi_stmt (itr), *refs,
> +				      any_mask_load_store))
>  	    return false;
>      }
>  
> +  if (flag_tree_loop_if_convert_stores)
> +    for (i = 0; i < loop->num_nodes; i++)
> +      free_bb_predicate (ifc_bbs[i]);
> +
> +  /* Checking PHIs needs to be done after stmts, as the fact whether there
> +     are any masked loads or stores affects the tests.  */
> +  for (i = 0; i < loop->num_nodes; i++)
> +    {
> +      basic_block bb = ifc_bbs[i];
> +      gimple_stmt_iterator itr;
> +
> +      for (itr = gsi_start_phis (bb); !gsi_end_p (itr); gsi_next (&itr))
> +	if (!if_convertible_phi_p (loop, bb, gsi_stmt (itr),
> +				   *any_mask_load_store))
> +	  return false;
> +    }
> +
>    if (dump_file)
>      fprintf (dump_file, "Applying if-conversion\n");
>  
> @@ -1168,7 +1293,7 @@ if_convertible_loop_p_1 (struct loop *lo
>     - if its basic blocks and phi nodes are if convertible.  */
>  
>  static bool
> -if_convertible_loop_p (struct loop *loop)
> +if_convertible_loop_p (struct loop *loop, bool *any_mask_load_store)
>  {
>    edge e;
>    edge_iterator ei;
> @@ -1209,7 +1334,8 @@ if_convertible_loop_p (struct loop *loop
>    refs.create (5);
>    ddrs.create (25);
>    stack_vec<loop_p, 3> loop_nest;
> -  res = if_convertible_loop_p_1 (loop, &loop_nest, &refs, &ddrs);
> +  res = if_convertible_loop_p_1 (loop, &loop_nest, &refs, &ddrs,
> +				 any_mask_load_store);
>  
>    if (flag_tree_loop_if_convert_stores)
>      {
> @@ -1395,7 +1521,7 @@ predicate_all_scalar_phis (struct loop *
>     gimplification of the predicates.  */
>  
>  static void
> -insert_gimplified_predicates (loop_p loop)
> +insert_gimplified_predicates (loop_p loop, bool any_mask_load_store)
>  {
>    unsigned int i;
>  
> @@ -1404,7 +1530,8 @@ insert_gimplified_predicates (loop_p loo
>        basic_block bb = ifc_bbs[i];
>        gimple_seq stmts;
>  
> -      if (!is_predicated (bb))
> +      if (!is_predicated (bb)
> +	  || dominated_by_p (CDI_DOMINATORS, loop->latch, bb))

isn't that redundant now?

>  	{
>  	  /* Do not insert statements for a basic block that is not
>  	     predicated.  Also make sure that the predicate of the
> @@ -1416,7 +1543,8 @@ insert_gimplified_predicates (loop_p loo
>        stmts = bb_predicate_gimplified_stmts (bb);
>        if (stmts)
>  	{
> -	  if (flag_tree_loop_if_convert_stores)
> +	  if (flag_tree_loop_if_convert_stores
> +	      || any_mask_load_store)
>  	    {
>  	      /* Insert the predicate of the BB just after the label,
>  		 as the if-conversion of memory writes will use this
> @@ -1575,9 +1703,49 @@ predicate_mem_writes (loop_p loop)
>  	}
>  
>        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> -	if ((stmt = gsi_stmt (gsi))
> -	    && gimple_assign_single_p (stmt)
> -	    && gimple_vdef (stmt))
> +	if ((stmt = gsi_stmt (gsi)) == NULL

I don't think gsi_stmt can be NULL

> +	    || !gimple_assign_single_p (stmt))
> +	  continue;
> +	else if (gimple_plf (stmt, GF_PLF_2))
> +	  {
> +	    tree lhs = gimple_assign_lhs (stmt);
> +	    tree rhs = gimple_assign_rhs1 (stmt);
> +	    tree ref, addr, ptr, masktype, mask_op0, mask_op1, mask;
> +	    gimple new_stmt;
> +	    int bitsize = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (lhs)));
> +
> +	    masktype = build_nonstandard_integer_type (bitsize, 1);
> +	    mask_op0 = build_int_cst (masktype, swap ? 0 : -1);
> +	    mask_op1 = build_int_cst (masktype, swap ? -1 : 0);
> +	    ref = TREE_CODE (lhs) == SSA_NAME ? rhs : lhs;
> +	    addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (ref),
> +					     true, NULL_TREE, true,
> +					     GSI_SAME_STMT);
> +	    cond = force_gimple_operand_gsi_1 (&gsi, unshare_expr (cond),
> +					       is_gimple_condexpr, NULL_TREE,
> +					       true, GSI_SAME_STMT);
> +	    mask = fold_build_cond_expr (masktype, unshare_expr (cond),
> +					 mask_op0, mask_op1);
> +	    mask = ifc_temp_var (masktype, mask, &gsi);
> +	    ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
> +	    /* Copy points-to info if possible.  */
> +	    if (TREE_CODE (addr) == SSA_NAME && !SSA_NAME_PTR_INFO (addr))
> +	      copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr),
> +			     ref);

Eh - can you split out a copy_ref_info_to_addr so you can avoid
creating the MEM_REF?

> +	    if (TREE_CODE (lhs) == SSA_NAME)
> +	      {
> +		new_stmt
> +		  = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr,
> +						ptr, mask);
> +		gimple_call_set_lhs (new_stmt, lhs);
> +	      }
> +	    else
> +	      new_stmt
> +		= gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
> +					      mask, rhs);
> +	    gsi_replace (&gsi, new_stmt, false);
> +	  }
> +	else if (gimple_vdef (stmt))
>  	  {
>  	    tree lhs = gimple_assign_lhs (stmt);
>  	    tree rhs = gimple_assign_rhs1 (stmt);
> @@ -1647,7 +1815,7 @@ remove_conditions_and_labels (loop_p loo
>     blocks.  Replace PHI nodes with conditional modify expressions.  */
>  
>  static void
> -combine_blocks (struct loop *loop)
> +combine_blocks (struct loop *loop, bool any_mask_load_store)
>  {
>    basic_block bb, exit_bb, merge_target_bb;
>    unsigned int orig_loop_num_nodes = loop->num_nodes;
> @@ -1655,11 +1823,12 @@ combine_blocks (struct loop *loop)
>    edge e;
>    edge_iterator ei;
>  
> +  predicate_bbs (loop);
>    remove_conditions_and_labels (loop);
> -  insert_gimplified_predicates (loop);
> +  insert_gimplified_predicates (loop, any_mask_load_store);
>    predicate_all_scalar_phis (loop);
>  
> -  if (flag_tree_loop_if_convert_stores)
> +  if (flag_tree_loop_if_convert_stores || any_mask_load_store)
>      predicate_mem_writes (loop);
>  
>    /* Merge basic blocks: first remove all the edges in the loop,
> @@ -1749,28 +1918,146 @@ combine_blocks (struct loop *loop)
>    ifc_bbs = NULL;
>  }
>  
> -/* If-convert LOOP when it is legal.  For the moment this pass has no
> -   profitability analysis.  Returns true when something changed.  */
> +/* Version LOOP before if-converting it, the original loop
> +   will be then if-converted, the new copy of the loop will not,
> +   and the LOOP_VECTORIZED internal call will be guarding which
> +   loop to execute.  The vectorizer pass will fold this
> +   internal call into either true or false.  */
>  
>  static bool
> +version_loop_for_if_conversion (struct loop *loop, bool *do_outer)
> +{

What's the do_outer parameter?

> +  struct loop *outer = loop_outer (loop);
> +  basic_block cond_bb;
> +  tree cond = make_ssa_name (boolean_type_node, NULL);
> +  struct loop *new_loop;
> +  gimple g;
> +  gimple_stmt_iterator gsi;
> +
> +  if (do_outer)
> +    {
> +      *do_outer = false;
> +      if (loop->inner == NULL
> +	  && outer->inner == loop
> +	  && loop->next == NULL
> +	  && loop_outer (outer)
> +	  && outer->num_nodes == 3 + loop->num_nodes
> +	  && loop_preheader_edge (loop)->src == outer->header
> +	  && single_exit (loop)
> +	  && outer->latch
> +	  && single_exit (loop)->dest == EDGE_PRED (outer->latch, 0)->src)
> +	*do_outer = true;
> +    }

Please add a comment before this.  Seems you match what outer loop
vectorization handles?  Thus, best factor out a predicate in
tree-loop-vect.c that you can use in both places?

> +  g = gimple_build_call_internal (IFN_LOOP_VECTORIZED, 2,
> +				  build_int_cst (integer_type_node, loop->num),
> +				  integer_zero_node);
> +  gimple_call_set_lhs (g, cond);
> +
> +  initialize_original_copy_tables ();
> +  new_loop = loop_version (loop, cond, &cond_bb,
> +			   REG_BR_PROB_BASE, REG_BR_PROB_BASE,
> +			   REG_BR_PROB_BASE, true);
> +  free_original_copy_tables ();
> +  if (new_loop == NULL)
> +    return false;
> +  new_loop->dont_vectorize = true;
> +  new_loop->force_vect = false;
> +  gsi = gsi_last_bb (cond_bb);
> +  gimple_call_set_arg (g, 1, build_int_cst (integer_type_node, new_loop->num));
> +  gsi_insert_before (&gsi, g, GSI_SAME_STMT);
> +  update_ssa (TODO_update_ssa);
> +  if (do_outer == NULL)
> +    {
> +      gcc_assert (single_succ_p (loop->header));
> +      gsi = gsi_last_bb (single_succ (loop->header));
> +      gimple cond_stmt = gsi_stmt (gsi);
> +      gsi_prev (&gsi);
> +      g = gsi_stmt (gsi);
> +      gcc_assert (gimple_code (cond_stmt) == GIMPLE_COND
> +		  && is_gimple_call (g)
> +		  && gimple_call_internal_p (g)
> +		  && gimple_call_internal_fn (g) == IFN_LOOP_VECTORIZED
> +		  && gimple_cond_lhs (cond_stmt) == gimple_call_lhs (g));
> +      gimple_cond_set_lhs (cond_stmt, boolean_true_node);
> +      update_stmt (cond_stmt);
> +      gcc_assert (has_zero_uses (gimple_call_lhs (g)));
> +      gsi_remove (&gsi, false);
> +      gcc_assert (single_succ_p (new_loop->header));
> +      gsi = gsi_last_bb (single_succ (new_loop->header));
> +      cond_stmt = gsi_stmt (gsi);
> +      gsi_prev (&gsi);
> +      g = gsi_stmt (gsi);
> +      gcc_assert (gimple_code (cond_stmt) == GIMPLE_COND
> +		  && is_gimple_call (g)
> +		  && gimple_call_internal_p (g)
> +		  && gimple_call_internal_fn (g) == IFN_LOOP_VECTORIZED
> +		  && gimple_cond_lhs (cond_stmt) == gimple_call_lhs (g)
> +		  && new_loop->inner
> +		  && new_loop->inner->next
> +		  && new_loop->inner->next->next == NULL);
> +      struct loop *inner = new_loop->inner;
> +      basic_block empty_bb = loop_preheader_edge (inner)->src;
> +      gcc_assert (empty_block_p (empty_bb)
> +		  && single_pred_p (empty_bb)
> +		  && single_succ_p (empty_bb)
> +		  && single_pred (empty_bb) == single_succ (new_loop->header));
> +      if (single_pred_edge (empty_bb)->flags & EDGE_TRUE_VALUE)
> +	{
> +	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
> +						    inner->num));
> +	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
> +						    inner->next->num));
> +	  inner->next->dont_vectorize = true;
> +	}
> +      else
> +	{
> +	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
> +						    inner->next->num));
> +	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
> +						    inner->num));
> +	  inner->dont_vectorize = true;
> +	}
> +    }

This needs a comment with explaining what code you create.

> +  return true;
> +}
> +
> +/* If-convert LOOP when it is legal.  For the moment this pass has no
> +   profitability analysis.  Returns non-zero todo flags when something
> +   changed.  */
> +
> +static unsigned int
>  tree_if_conversion (struct loop *loop)
>  {
> -  bool changed = false;
> +  unsigned int todo = 0;
> +  bool version_outer_loop = false;
>    ifc_bbs = NULL;
> +  bool any_mask_load_store = false;
>  
> -  if (!if_convertible_loop_p (loop)
> +  if (!if_convertible_loop_p (loop, &any_mask_load_store)
>        || !dbg_cnt (if_conversion_tree))
>      goto cleanup;
>  
> +  if (any_mask_load_store
> +      && ((!flag_tree_loop_vectorize && !loop->force_vect)
> +	  || loop->dont_vectorize))
> +    goto cleanup;
> +
> +  if (any_mask_load_store
> +      && !version_loop_for_if_conversion (loop, &version_outer_loop))
> +    goto cleanup;
> +
>    /* Now all statements are if-convertible.  Combine all the basic
>       blocks into one huge basic block doing the if-conversion
>       on-the-fly.  */
> -  combine_blocks (loop);
> -
> -  if (flag_tree_loop_if_convert_stores)
> -    mark_virtual_operands_for_renaming (cfun);
> +  combine_blocks (loop, any_mask_load_store);
>  
> -  changed = true;
> +  todo |= TODO_cleanup_cfg;
> +  if (flag_tree_loop_if_convert_stores || any_mask_load_store)
> +    {
> +      mark_virtual_operands_for_renaming (cfun);
> +      todo |= TODO_update_ssa_only_virtuals;
> +    }
>  
>   cleanup:
>    if (ifc_bbs)
> @@ -1784,7 +2071,16 @@ tree_if_conversion (struct loop *loop)
>        ifc_bbs = NULL;
>      }
>  
> -  return changed;
> +  if (todo && version_outer_loop)
> +    {
> +      if (todo & TODO_update_ssa_only_virtuals)
> +	{
> +	  update_ssa (TODO_update_ssa_only_virtuals);
> +	  todo &= ~TODO_update_ssa_only_virtuals;
> +	}

Btw I hate that we do update_ssa multiple times per pass per
function.  That makes us possibly worse than O(N^2) as update_ssa computes
the IDF of the whole function.

This is something your patch introduces (it's only rewriting the
virtuals, not the incremental SSA update by BB copying).

> +      version_loop_for_if_conversion (loop_outer (loop), NULL);
> +    }
> +  return todo;
>  }
>  
>  /* Tree if-conversion pass management.  */
> @@ -1793,7 +2089,6 @@ static unsigned int
>  main_tree_if_conversion (void)
>  {
>    struct loop *loop;
> -  bool changed = false;
>    unsigned todo = 0;
>  
>    if (number_of_loops (cfun) <= 1)
> @@ -1802,15 +2097,9 @@ main_tree_if_conversion (void)
>    FOR_EACH_LOOP (loop, 0)
>      if (flag_tree_loop_if_convert == 1
>  	|| flag_tree_loop_if_convert_stores == 1
> -	|| flag_tree_loop_vectorize
> -	|| loop->force_vect)
> -    changed |= tree_if_conversion (loop);
> -
> -  if (changed)
> -    todo |= TODO_cleanup_cfg;
> -
> -  if (changed && flag_tree_loop_if_convert_stores)
> -    todo |= TODO_update_ssa_only_virtuals;
> +	|| ((flag_tree_loop_vectorize || loop->force_vect)
> +	    && !loop->dont_vectorize))
> +      todo |= tree_if_conversion (loop);
>  
>  #ifdef ENABLE_CHECKING
>    {

Otherwise the if-conv changes loook ok.

> --- gcc/tree-vect-data-refs.c.jj	2013-11-28 09:18:11.784774865 +0100
> +++ gcc/tree-vect-data-refs.c	2013-11-28 14:13:57.617572349 +0100
> @@ -2959,6 +2959,24 @@ vect_check_gather (gimple stmt, loop_vec
>    enum machine_mode pmode;
>    int punsignedp, pvolatilep;
>  
> +  base = DR_REF (dr);
> +  /* For masked loads/stores, DR_REF (dr) is an artificial MEM_REF,
> +     see if we can use the def stmt of the address.  */
> +  if (is_gimple_call (stmt)
> +      && gimple_call_internal_p (stmt)
> +      && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> +	  || gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
> +      && TREE_CODE (base) == MEM_REF
> +      && TREE_CODE (TREE_OPERAND (base, 0)) == SSA_NAME
> +      && integer_zerop (TREE_OPERAND (base, 1))
> +      && !expr_invariant_in_loop_p (loop, TREE_OPERAND (base, 0)))
> +    {
> +      gimple def_stmt = SSA_NAME_DEF_STMT (TREE_OPERAND (base, 0));
> +      if (is_gimple_assign (def_stmt)
> +	  && gimple_assign_rhs_code (def_stmt) == ADDR_EXPR)
> +	base = TREE_OPERAND (gimple_assign_rhs1 (def_stmt), 0);
> +    }
> +
>    /* The gather builtins need address of the form
>       loop_invariant + vector * {1, 2, 4, 8}
>       or
> @@ -2971,7 +2989,7 @@ vect_check_gather (gimple stmt, loop_vec
>       vectorized.  The following code attempts to find such a preexistng
>       SSA_NAME OFF and put the loop invariants into a tree BASE
>       that can be gimplified before the loop.  */
> -  base = get_inner_reference (DR_REF (dr), &pbitsize, &pbitpos, &off,
> +  base = get_inner_reference (base, &pbitsize, &pbitpos, &off,
>  			      &pmode, &punsignedp, &pvolatilep, false);
>    gcc_assert (base != NULL_TREE && (pbitpos % BITS_PER_UNIT) == 0);
>  
> @@ -3468,7 +3486,10 @@ again:
>        offset = unshare_expr (DR_OFFSET (dr));
>        init = unshare_expr (DR_INIT (dr));
>  
> -      if (is_gimple_call (stmt))
> +      if (is_gimple_call (stmt)
> +	  && (!gimple_call_internal_p (stmt)
> +	      || (gimple_call_internal_fn (stmt) != IFN_MASK_LOAD
> +		  && gimple_call_internal_fn (stmt) != IFN_MASK_STORE)))
>  	{
>  	  if (dump_enabled_p ())
>  	    {
> @@ -5119,6 +5140,14 @@ vect_supportable_dr_alignment (struct da
>    if (aligned_access_p (dr) && !check_aligned_accesses)
>      return dr_aligned;
>  
> +  /* For now assume all conditional loads/stores support unaligned
> +     access without any special code.  */
> +  if (is_gimple_call (stmt)
> +      && gimple_call_internal_p (stmt)
> +      && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> +	  || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
> +    return dr_unaligned_supported;
> +
>    if (loop_vinfo)
>      {
>        vect_loop = LOOP_VINFO_LOOP (loop_vinfo);
> --- gcc/gimple.h.jj	2013-11-27 12:10:46.932896086 +0100
> +++ gcc/gimple.h	2013-11-28 14:13:57.603572422 +0100
> @@ -5670,7 +5670,13 @@ gimple_expr_type (const_gimple stmt)
>  	 useless conversion involved.  That means returning the
>  	 original RHS type as far as we can reconstruct it.  */
>        if (code == GIMPLE_CALL)
> -	type = gimple_call_return_type (stmt);
> +	{
> +	  if (gimple_call_internal_p (stmt)
> +	      && gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
> +	    type = TREE_TYPE (gimple_call_arg (stmt, 3));
> +	  else
> +	    type = gimple_call_return_type (stmt);
> +	}
>        else
>  	switch (gimple_assign_rhs_code (stmt))
>  	  {

Eek :/  (and so many uses of this!  and it can be simplified)

Most callers don't want to use this function (in fact I don't
see a good reason to use it at all).

Well.  Ok for now ...

> --- gcc/internal-fn.c.jj	2013-11-26 21:36:14.218328913 +0100
> +++ gcc/internal-fn.c	2013-11-28 14:13:57.661572121 +0100
> @@ -153,6 +153,60 @@ expand_UBSAN_NULL (gimple stmt ATTRIBUTE
>    gcc_unreachable ();
>  }
>  
> +/* This should get folded in tree-vectorizer.c.  */
> +
> +static void
> +expand_LOOP_VECTORIZED (gimple stmt ATTRIBUTE_UNUSED)
> +{
> +  gcc_unreachable ();
> +}
> +
> +static void
> +expand_MASK_LOAD (gimple stmt)
> +{
> +  struct expand_operand ops[3];
> +  tree type, lhs, rhs, maskt;
> +  rtx mem, target, mask;
> +
> +  maskt = gimple_call_arg (stmt, 2);
> +  lhs = gimple_call_lhs (stmt);
> +  type = TREE_TYPE (lhs);
> +  rhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
> +		gimple_call_arg (stmt, 1));

That's possibly not canonical again, use fold_build2.

> +  mem = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  gcc_assert (MEM_P (mem));
> +  mask = expand_normal (maskt);
> +  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  create_output_operand (&ops[0], target, TYPE_MODE (type));
> +  create_fixed_operand (&ops[1], mem);
> +  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> +  expand_insn (optab_handler (maskload_optab, TYPE_MODE (type)), 3, ops);
> +}
> +
> +static void
> +expand_MASK_STORE (gimple stmt)
> +{
> +  struct expand_operand ops[3];
> +  tree type, lhs, rhs, maskt;
> +  rtx mem, reg, mask;
> +
> +  maskt = gimple_call_arg (stmt, 2);
> +  rhs = gimple_call_arg (stmt, 3);
> +  type = TREE_TYPE (rhs);
> +  lhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
> +		gimple_call_arg (stmt, 1));

Likewise.

> +  mem = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  gcc_assert (MEM_P (mem));
> +  mask = expand_normal (maskt);
> +  reg = expand_normal (rhs);
> +  create_fixed_operand (&ops[0], mem);
> +  create_input_operand (&ops[1], reg, TYPE_MODE (type));
> +  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> +  expand_insn (optab_handler (maskstore_optab, TYPE_MODE (type)), 3, ops);
> +}
> +
>  /* Routines to expand each internal function, indexed by function number.
>     Each routine has the prototype:
>  
> --- gcc/tree-vectorizer.c.jj	2013-11-22 21:03:14.525852274 +0100
> +++ gcc/tree-vectorizer.c	2013-11-28 15:10:33.364872892 +0100
> @@ -75,11 +75,13 @@ along with GCC; see the file COPYING3.
>  #include "tree-phinodes.h"
>  #include "ssa-iterators.h"
>  #include "tree-ssa-loop-manip.h"
> +#include "tree-cfg.h"
>  #include "cfgloop.h"
>  #include "tree-vectorizer.h"
>  #include "tree-pass.h"
>  #include "tree-ssa-propagate.h"
>  #include "dbgcnt.h"
> +#include "gimple-fold.h"
>  
>  /* Loop or bb location.  */
>  source_location vect_location;
> @@ -317,6 +319,68 @@ vect_destroy_datarefs (loop_vec_info loo
>  }
>  
>  
> +/* If LOOP has been versioned during ifcvt, return the internal call
> +   guarding it.  */
> +
> +static gimple
> +vect_loop_vectorized_call (struct loop *loop)
> +{
> +  basic_block bb = loop_preheader_edge (loop)->src;
> +  gimple g;
> +  do
> +    {
> +      g = last_stmt (bb);
> +      if (g)
> +	break;
> +      if (!single_pred_p (bb))
> +	break;
> +      bb = single_pred (bb);
> +    }
> +  while (1);
> +  if (g && gimple_code (g) == GIMPLE_COND)
> +    {
> +      gimple_stmt_iterator gsi = gsi_for_stmt (g);
> +      gsi_prev (&gsi);
> +      if (!gsi_end_p (gsi))
> +	{
> +	  g = gsi_stmt (gsi);
> +	  if (is_gimple_call (g)
> +	      && gimple_call_internal_p (g)
> +	      && gimple_call_internal_fn (g) == IFN_LOOP_VECTORIZED
> +	      && (tree_to_shwi (gimple_call_arg (g, 0)) == loop->num
> +		  || tree_to_shwi (gimple_call_arg (g, 1)) == loop->num))
> +	    return g;
> +	}
> +    }
> +  return NULL;
> +}
> +
> +/* Helper function of vectorize_loops.  If LOOP is non-if-converted
> +   loop that has if-converted counterpart, return the if-converted
> +   counterpart, so that we try vectorizing if-converted loops before
> +   inner loops of non-if-converted loops.  */
> +
> +static struct loop *
> +vect_loop_select (struct loop *loop)
> +{
> +  if (!loop->dont_vectorize)
> +    return loop;
> +
> +  gimple g = vect_loop_vectorized_call (loop);
> +  if (g == NULL)
> +    return loop;
> +
> +  if (tree_to_shwi (gimple_call_arg (g, 1)) != loop->num)
> +    return loop;
> +
> +  struct loop *ifcvt_loop
> +    = get_loop (cfun, tree_to_shwi (gimple_call_arg (g, 0)));
> +  if (ifcvt_loop && !ifcvt_loop->dont_vectorize)
> +    return ifcvt_loop;
> +  return loop;
> +}
> +
> +
>  /* Function vectorize_loops.
>  
>     Entry point to loop vectorization phase.  */
> @@ -327,9 +391,11 @@ vectorize_loops (void)
>    unsigned int i;
>    unsigned int num_vectorized_loops = 0;
>    unsigned int vect_loops_num;
> -  struct loop *loop;
> +  struct loop *loop, *iloop;
>    hash_table <simduid_to_vf> simduid_to_vf_htab;
>    hash_table <simd_array_to_simduid> simd_array_to_simduid_htab;
> +  bool any_ifcvt_loops = false;
> +  unsigned ret = 0;
>  
>    vect_loops_num = number_of_loops (cfun);
>  
> @@ -351,9 +417,12 @@ vectorize_loops (void)
>    /* If some loop was duplicated, it gets bigger number
>       than all previously defined loops.  This fact allows us to run
>       only over initial loops skipping newly generated ones.  */
> -  FOR_EACH_LOOP (loop, 0)
> -    if ((flag_tree_loop_vectorize && optimize_loop_nest_for_speed_p (loop))
> -	|| loop->force_vect)
> +  FOR_EACH_LOOP (iloop, 0)
> +    if ((loop = vect_loop_select (iloop))->dont_vectorize)
> +      any_ifcvt_loops = true;
> +    else if ((flag_tree_loop_vectorize
> +	      && optimize_loop_nest_for_speed_p (loop))
> +	     || loop->force_vect)
>        {
>  	loop_vec_info loop_vinfo;
>  	vect_location = find_loop_location (loop);
> @@ -363,6 +432,10 @@ vectorize_loops (void)
>                         LOCATION_FILE (vect_location),
>  		       LOCATION_LINE (vect_location));
>  
> +	/* Make sure we don't try to vectorize this loop
> +	   more than once.  */
> +	loop->dont_vectorize = true;
> +
>  	loop_vinfo = vect_analyze_loop (loop);
>  	loop->aux = loop_vinfo;
>  
> @@ -372,6 +445,45 @@ vectorize_loops (void)
>          if (!dbg_cnt (vect_loop))
>  	  break;
>  
> +	gimple loop_vectorized_call = vect_loop_vectorized_call (loop);
> +	if (loop_vectorized_call)
> +	  {
> +	    tree arg = gimple_call_arg (loop_vectorized_call, 1);
> +	    basic_block *bbs;
> +	    unsigned int i;
> +	    struct loop *scalar_loop = get_loop (cfun, tree_to_shwi (arg));
> +	    struct loop *inner;
> +
> +	    LOOP_VINFO_SCALAR_LOOP (loop_vinfo) = scalar_loop;
> +	    gcc_checking_assert (vect_loop_vectorized_call
> +					(LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> +				 == loop_vectorized_call);
> +	    bbs = get_loop_body (scalar_loop);
> +	    for (i = 0; i < scalar_loop->num_nodes; i++)
> +	      {
> +		basic_block bb = bbs[i];
> +		gimple_stmt_iterator gsi;
> +		for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
> +		     gsi_next (&gsi))
> +		  {
> +		    gimple phi = gsi_stmt (gsi);
> +		    gimple_set_uid (phi, 0);
> +		  }
> +		for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
> +		     gsi_next (&gsi))
> +		  {
> +		    gimple stmt = gsi_stmt (gsi);
> +		    gimple_set_uid (stmt, 0);
> +		  }
> +	      }
> +	    free (bbs);
> +	    /* If we have successfully vectorized an if-converted outer
> +	       loop, don't attempt to vectorize the if-converted inner
> +	       loop of the alternate loop.  */
> +	    for (inner = scalar_loop->inner; inner; inner = inner->next)
> +	      inner->dont_vectorize = true;
> +	  }
> +
>          if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
>  	    && dump_enabled_p ())
>            dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
> @@ -392,7 +504,29 @@ vectorize_loops (void)
>  	    *simduid_to_vf_htab.find_slot (simduid_to_vf_data, INSERT)
>  	      = simduid_to_vf_data;
>  	  }
> +
> +	if (loop_vectorized_call)
> +	  {
> +	    gimple g = loop_vectorized_call;
> +	    tree lhs = gimple_call_lhs (g);
> +	    gimple_stmt_iterator gsi = gsi_for_stmt (g);
> +	    gimplify_and_update_call_from_tree (&gsi, boolean_true_node);

plain update_call_from_tree should also work here, boolean_true_node
is already gimple.

> +	    gsi_next (&gsi);
> +	    if (!gsi_end_p (gsi))
> +	      {
> +		g = gsi_stmt (gsi);
> +		if (gimple_code (g) == GIMPLE_COND
> +		    && gimple_cond_lhs (g) == lhs)
> +		  {
> +		    gimple_cond_set_lhs (g, boolean_true_node);

or simply replace all immediate uses of 'lhs' by boolean_true_node
and remove the loop_vectorized call?

> +		    update_stmt (g);
> +		    ret |= TODO_cleanup_cfg;
> +		  }
> +	      }
> +	  }
>        }
> +    else
> +      loop->dont_vectorize = true;
>  
>    vect_location = UNKNOWN_LOCATION;
>  
> @@ -405,6 +539,34 @@ vectorize_loops (void)
>  
>    /*  ----------- Finalize. -----------  */
>  
> +  if (any_ifcvt_loops)
> +    for (i = 1; i < vect_loops_num; i++)
> +      {
> +	loop = get_loop (cfun, i);
> +	if (loop && loop->dont_vectorize)
> +	  {
> +	    gimple g = vect_loop_vectorized_call (loop);
> +	    if (g)
> +	      {
> +		tree lhs = gimple_call_lhs (g);
> +		gimple_stmt_iterator gsi = gsi_for_stmt (g);
> +		gimplify_and_update_call_from_tree (&gsi, boolean_false_node);
> +		gsi_next (&gsi);
> +		if (!gsi_end_p (gsi))
> +		  {
> +		    g = gsi_stmt (gsi);
> +		    if (gimple_code (g) == GIMPLE_COND
> +			&& gimple_cond_lhs (g) == lhs)
> +		      {
> +			gimple_cond_set_lhs (g, boolean_false_node);
> +			update_stmt (g);
> +			ret |= TODO_cleanup_cfg;
> +		      }
> +		  }

See above.  And factor this out into a function.  Also move this
to the cleanup loop below.

> +	      }
> +	  }
> +      }
> +
>    for (i = 1; i < vect_loops_num; i++)
>      {
>        loop_vec_info loop_vinfo;
> @@ -462,7 +624,7 @@ vectorize_loops (void)
>        return TODO_cleanup_cfg;
>      }
>  
> -  return 0;
> +  return ret;
>  }
>  
>  
> --- gcc/tree-vect-loop-manip.c.jj	2013-11-22 21:03:08.418882641 +0100
> +++ gcc/tree-vect-loop-manip.c	2013-11-28 14:54:01.621096704 +0100
> @@ -703,12 +703,42 @@ slpeel_make_loop_iterate_ntimes (struct
>    loop->nb_iterations = niters;
>  }
>  
> +/* Helper routine of slpeel_tree_duplicate_loop_to_edge_cfg.
> +   For all PHI arguments in FROM->dest and TO->dest from those
> +   edges ensure that TO->dest PHI arguments have current_def
> +   to that in from.  */
> +
> +static void
> +slpeel_duplicate_current_defs_from_edges (edge from, edge to)
> +{
> +  gimple_stmt_iterator gsi_from, gsi_to;
> +
> +  for (gsi_from = gsi_start_phis (from->dest),
> +       gsi_to = gsi_start_phis (to->dest);
> +       !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> +       gsi_next (&gsi_from), gsi_next (&gsi_to))
> +    {
> +      gimple from_phi = gsi_stmt (gsi_from);
> +      gimple to_phi = gsi_stmt (gsi_to);
> +      tree from_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, from);
> +      tree to_arg = PHI_ARG_DEF_FROM_EDGE (to_phi, to);
> +      if (TREE_CODE (from_arg) == SSA_NAME
> +	  && TREE_CODE (to_arg) == SSA_NAME
> +	  && get_current_def (to_arg) == NULL_TREE)
> +	set_current_def (to_arg, get_current_def (from_arg));
> +    }
> +}
> +
>  
>  /* Given LOOP this function generates a new copy of it and puts it
> -   on E which is either the entry or exit of LOOP.  */
> +   on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
> +   non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
> +   basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> +   entry or exit of LOOP.  */
>  
>  struct loop *
> -slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *loop, edge e)
> +slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *loop,
> +					struct loop *scalar_loop, edge e)
>  {
>    struct loop *new_loop;
>    basic_block *new_bbs, *bbs;
> @@ -722,19 +752,22 @@ slpeel_tree_duplicate_loop_to_edge_cfg (
>    if (!at_exit && e != loop_preheader_edge (loop))
>      return NULL;
>  
> -  bbs = XNEWVEC (basic_block, loop->num_nodes + 1);
> -  get_loop_body_with_size (loop, bbs, loop->num_nodes);
> +  if (scalar_loop == NULL)
> +    scalar_loop = loop;
> +
> +  bbs = XNEWVEC (basic_block, scalar_loop->num_nodes + 1);
> +  get_loop_body_with_size (scalar_loop, bbs, scalar_loop->num_nodes);
>  
>    /* Check whether duplication is possible.  */
> -  if (!can_copy_bbs_p (bbs, loop->num_nodes))
> +  if (!can_copy_bbs_p (bbs, scalar_loop->num_nodes))
>      {
>        free (bbs);
>        return NULL;
>      }
>  
>    /* Generate new loop structure.  */
> -  new_loop = duplicate_loop (loop, loop_outer (loop));
> -  duplicate_subloops (loop, new_loop);
> +  new_loop = duplicate_loop (scalar_loop, loop_outer (scalar_loop));
> +  duplicate_subloops (scalar_loop, new_loop);
>  
>    exit_dest = exit->dest;
>    was_imm_dom = (get_immediate_dominator (CDI_DOMINATORS,
> @@ -744,35 +777,80 @@ slpeel_tree_duplicate_loop_to_edge_cfg (
>    /* Also copy the pre-header, this avoids jumping through hoops to
>       duplicate the loop entry PHI arguments.  Create an empty
>       pre-header unconditionally for this.  */
> -  basic_block preheader = split_edge (loop_preheader_edge (loop));
> +  basic_block preheader = split_edge (loop_preheader_edge (scalar_loop));
>    edge entry_e = single_pred_edge (preheader);
> -  bbs[loop->num_nodes] = preheader;
> -  new_bbs = XNEWVEC (basic_block, loop->num_nodes + 1);
> +  bbs[scalar_loop->num_nodes] = preheader;
> +  new_bbs = XNEWVEC (basic_block, scalar_loop->num_nodes + 1);
>  
> -  copy_bbs (bbs, loop->num_nodes + 1, new_bbs,
> +  exit = single_exit (scalar_loop);
> +  copy_bbs (bbs, scalar_loop->num_nodes + 1, new_bbs,
>  	    &exit, 1, &new_exit, NULL,
>  	    e->src, true);
> -  basic_block new_preheader = new_bbs[loop->num_nodes];
> +  exit = single_exit (loop);
> +  basic_block new_preheader = new_bbs[scalar_loop->num_nodes];
>  
> -  add_phi_args_after_copy (new_bbs, loop->num_nodes + 1, NULL);
> +  add_phi_args_after_copy (new_bbs, scalar_loop->num_nodes + 1, NULL);
> +
> +  if (scalar_loop != loop)
> +    {
> +      /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
> +	 SCALAR_LOOP will have current_def set to SSA_NAMEs in the new_loop,
> +	 but LOOP will not.  slpeel_update_phi_nodes_for_guard{1,2} expects
> +	 the LOOP SSA_NAMEs (on the exit edge and edge from latch to
> +	 header) to have current_def set, so copy them over.  */
> +      slpeel_duplicate_current_defs_from_edges (single_exit (scalar_loop),
> +						exit);
> +      slpeel_duplicate_current_defs_from_edges (EDGE_SUCC (scalar_loop->latch,
> +							   0),
> +						EDGE_SUCC (loop->latch, 0));
> +    }
>  
>    if (at_exit) /* Add the loop copy at exit.  */
>      {
> +      if (scalar_loop != loop)
> +	{
> +	  gimple_stmt_iterator gsi;
> +	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> +
> +	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> +	       gsi_next (&gsi))
> +	    {
> +	      gimple phi = gsi_stmt (gsi);
> +	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> +	      location_t orig_locus
> +		= gimple_phi_arg_location_from_edge (phi, e);
> +
> +	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> +	    }
> +	}
>        redirect_edge_and_branch_force (e, new_preheader);
>        flush_pending_stmts (e);
>        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
>        if (was_imm_dom)
> -	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_loop->header);
> +	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit->src);
>  
>        /* And remove the non-necessary forwarder again.  Keep the other
>           one so we have a proper pre-header for the loop at the exit edge.  */
> -      redirect_edge_pred (single_succ_edge (preheader), single_pred (preheader));
> +      redirect_edge_pred (single_succ_edge (preheader),
> +			  single_pred (preheader));
>        delete_basic_block (preheader);
> -      set_immediate_dominator (CDI_DOMINATORS, loop->header,
> -			       loop_preheader_edge (loop)->src);
> +      set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> +			       loop_preheader_edge (scalar_loop)->src);
>      }
>    else /* Add the copy at entry.  */
>      {
> +      if (scalar_loop != loop)
> +	{
> +	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> +	  redirect_edge_pred (single_succ_edge (preheader),
> +			      single_pred (preheader));
> +	  delete_basic_block (preheader);
> +	  set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> +				   loop_preheader_edge (scalar_loop)->src);
> +	  preheader = split_edge (loop_preheader_edge (loop));
> +	  entry_e = single_pred_edge (preheader);
> +	}
> +
>        redirect_edge_and_branch_force (entry_e, new_preheader);
>        flush_pending_stmts (entry_e);
>        set_immediate_dominator (CDI_DOMINATORS, new_preheader, entry_e->src);
> @@ -783,15 +861,39 @@ slpeel_tree_duplicate_loop_to_edge_cfg (
>  
>        /* And remove the non-necessary forwarder again.  Keep the other
>           one so we have a proper pre-header for the loop at the exit edge.  */
> -      redirect_edge_pred (single_succ_edge (new_preheader), single_pred (new_preheader));
> +      redirect_edge_pred (single_succ_edge (new_preheader),
> +			  single_pred (new_preheader));
>        delete_basic_block (new_preheader);
>        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
>  			       loop_preheader_edge (new_loop)->src);
>      }
>  
> -  for (unsigned i = 0; i < loop->num_nodes+1; i++)
> +  for (unsigned i = 0; i < scalar_loop->num_nodes + 1; i++)
>      rename_variables_in_bb (new_bbs[i]);
>  
> +  if (scalar_loop != loop)
> +    {
> +      /* Update new_loop->header PHIs, so that on the preheader
> +	 edge they are the ones from loop rather than scalar_loop.  */
> +      gimple_stmt_iterator gsi_orig, gsi_new;
> +      edge orig_e = loop_preheader_edge (loop);
> +      edge new_e = loop_preheader_edge (new_loop);
> +
> +      for (gsi_orig = gsi_start_phis (loop->header),
> +	   gsi_new = gsi_start_phis (new_loop->header);
> +	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> +	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> +	{
> +	  gimple orig_phi = gsi_stmt (gsi_orig);
> +	  gimple new_phi = gsi_stmt (gsi_new);
> +	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> +	  location_t orig_locus
> +	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> +
> +	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> +	}
> +    }
> +
>    free (new_bbs);
>    free (bbs);
>  
> @@ -1002,6 +1104,8 @@ set_prologue_iterations (basic_block bb_
>  
>     Input:
>     - LOOP: the loop to be peeled.
> +   - SCALAR_LOOP: if non-NULL, the alternate loop from which basic blocks
> +	should be copied.
>     - E: the exit or entry edge of LOOP.
>          If it is the entry edge, we peel the first iterations of LOOP. In this
>          case first-loop is LOOP, and second-loop is the newly created loop.
> @@ -1043,8 +1147,8 @@ set_prologue_iterations (basic_block bb_
>     FORNOW the resulting code will not be in loop-closed-ssa form.
>  */
>  
> -static struct loop*
> -slpeel_tree_peel_loop_to_edge (struct loop *loop,
> +static struct loop *
> +slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
>  			       edge e, tree *first_niters,
>  			       tree niters, bool update_first_loop_count,
>  			       unsigned int th, bool check_profitability,
> @@ -1129,7 +1233,8 @@ slpeel_tree_peel_loop_to_edge (struct lo
>          orig_exit_bb:
>     */
>  
> -  if (!(new_loop = slpeel_tree_duplicate_loop_to_edge_cfg (loop, e)))
> +  if (!(new_loop = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> +							   e)))
>      {
>        loop_loc = find_loop_location (loop);
>        dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
> @@ -1625,6 +1730,7 @@ vect_do_peeling_for_loop_bound (loop_vec
>  				unsigned int th, bool check_profitability)
>  {
>    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
>    struct loop *new_loop;
>    edge update_e;
>    basic_block preheader;
> @@ -1641,11 +1747,12 @@ vect_do_peeling_for_loop_bound (loop_vec
>  
>    loop_num  = loop->num;
>  
> -  new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
> -                                            &ratio_mult_vf_name, ni_name, false,
> -                                            th, check_profitability,
> -					    cond_expr, cond_expr_stmt_list,
> -					    0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> +  new_loop
> +    = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
> +				     &ratio_mult_vf_name, ni_name, false,
> +				     th, check_profitability,
> +				     cond_expr, cond_expr_stmt_list,
> +				     0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
>    gcc_assert (new_loop);
>    gcc_assert (loop_num == loop->num);
>  #ifdef ENABLE_CHECKING
> @@ -1878,6 +1985,7 @@ vect_do_peeling_for_alignment (loop_vec_
>  			       unsigned int th, bool check_profitability)
>  {
>    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
>    tree niters_of_prolog_loop;
>    tree wide_prolog_niters;
>    struct loop *new_loop;
> @@ -1899,11 +2007,11 @@ vect_do_peeling_for_alignment (loop_vec_
>  
>    /* Peel the prolog loop and iterate it niters_of_prolog_loop.  */
>    new_loop =
> -    slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
> +    slpeel_tree_peel_loop_to_edge (loop, scalar_loop,
> +				   loop_preheader_edge (loop),
>  				   &niters_of_prolog_loop, ni_name, true,
>  				   th, check_profitability, NULL_TREE, NULL,
> -				   bound,
> -				   0);
> +				   bound, 0);
>  
>    gcc_assert (new_loop);
>  #ifdef ENABLE_CHECKING
> @@ -2187,6 +2295,7 @@ vect_loop_versioning (loop_vec_info loop
>  		      unsigned int th, bool check_profitability)
>  {
>    struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
>    basic_block condition_bb;
>    gimple_stmt_iterator gsi, cond_exp_gsi;
>    basic_block merge_bb;
> @@ -2222,8 +2331,43 @@ vect_loop_versioning (loop_vec_info loop
>    gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
>  
>    initialize_original_copy_tables ();
> -  loop_version (loop, cond_expr, &condition_bb,
> -		prob, prob, REG_BR_PROB_BASE - prob, true);
> +  if (scalar_loop)
> +    {
> +      edge scalar_e;
> +      basic_block preheader, scalar_preheader;
> +
> +      /* We don't want to scale SCALAR_LOOP's frequencies, we need to
> +	 scale LOOP's frequencies instead.  */
> +      loop_version (scalar_loop, cond_expr, &condition_bb,
> +		    prob, REG_BR_PROB_BASE, REG_BR_PROB_BASE - prob, true);
> +      scale_loop_frequencies (loop, prob, REG_BR_PROB_BASE);
> +      /* CONDITION_BB was created above SCALAR_LOOP's preheader,
> +	 while we need to move it above LOOP's preheader.  */
> +      e = loop_preheader_edge (loop);
> +      scalar_e = loop_preheader_edge (scalar_loop);
> +      gcc_assert (empty_block_p (e->src)
> +		  && single_pred_p (e->src));
> +      gcc_assert (empty_block_p (scalar_e->src)
> +		  && single_pred_p (scalar_e->src));
> +      gcc_assert (single_pred_p (condition_bb));
> +      preheader = e->src;
> +      scalar_preheader = scalar_e->src;
> +      scalar_e = find_edge (condition_bb, scalar_preheader);
> +      e = single_pred_edge (preheader);
> +      redirect_edge_and_branch_force (single_pred_edge (condition_bb),
> +				      scalar_preheader);
> +      redirect_edge_and_branch_force (scalar_e, preheader);
> +      redirect_edge_and_branch_force (e, condition_bb);
> +      set_immediate_dominator (CDI_DOMINATORS, condition_bb,
> +			       single_pred (condition_bb));
> +      set_immediate_dominator (CDI_DOMINATORS, scalar_preheader,
> +			       single_pred (scalar_preheader));
> +      set_immediate_dominator (CDI_DOMINATORS, preheader,
> +			       condition_bb);
> +    }
> +  else
> +    loop_version (loop, cond_expr, &condition_bb,
> +		  prob, prob, REG_BR_PROB_BASE - prob, true);
>  
>    if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
>        && dump_enabled_p ())
> @@ -2246,24 +2390,29 @@ vect_loop_versioning (loop_vec_info loop
>       basic block (i.e. it has two predecessors). Just in order to simplify
>       following transformations in the vectorizer, we fix this situation
>       here by adding a new (empty) block on the exit-edge of the loop,
> -     with the proper loop-exit phis to maintain loop-closed-form.  */
> +     with the proper loop-exit phis to maintain loop-closed-form.
> +     If loop versioning wasn't done from loop, but scalar_loop instead,
> +     merge_bb will have already just a single successor.  */
>  
>    merge_bb = single_exit (loop)->dest;
> -  gcc_assert (EDGE_COUNT (merge_bb->preds) == 2);
> -  new_exit_bb = split_edge (single_exit (loop));
> -  new_exit_e = single_exit (loop);
> -  e = EDGE_SUCC (new_exit_bb, 0);
> -
> -  for (gsi = gsi_start_phis (merge_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> -    {
> -      tree new_res;
> -      orig_phi = gsi_stmt (gsi);
> -      new_res = copy_ssa_name (PHI_RESULT (orig_phi), NULL);
> -      new_phi = create_phi_node (new_res, new_exit_bb);
> -      arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, e);
> -      add_phi_arg (new_phi, arg, new_exit_e,
> -		   gimple_phi_arg_location_from_edge (orig_phi, e));
> -      adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
> +  if (scalar_loop == NULL || EDGE_COUNT (merge_bb->preds) >= 2)
> +    {
> +      gcc_assert (EDGE_COUNT (merge_bb->preds) >= 2);
> +      new_exit_bb = split_edge (single_exit (loop));
> +      new_exit_e = single_exit (loop);
> +      e = EDGE_SUCC (new_exit_bb, 0);
> +
> +      for (gsi = gsi_start_phis (merge_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +	{
> +	  tree new_res;
> +	  orig_phi = gsi_stmt (gsi);
> +	  new_res = copy_ssa_name (PHI_RESULT (orig_phi), NULL);
> +	  new_phi = create_phi_node (new_res, new_exit_bb);
> +	  arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, e);
> +	  add_phi_arg (new_phi, arg, new_exit_e,
> +		       gimple_phi_arg_location_from_edge (orig_phi, e));
> +	  adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
> +	}
>      }
>  
>  
> --- gcc/tree-vect-loop.c.jj	2013-11-28 09:18:11.772774927 +0100
> +++ gcc/tree-vect-loop.c	2013-11-28 14:13:57.643572214 +0100
> @@ -374,7 +374,11 @@ vect_determine_vectorization_factor (loo
>  		analyze_pattern_stmt = false;
>  	    }
>  
> -	  if (gimple_get_lhs (stmt) == NULL_TREE)
> +	  if (gimple_get_lhs (stmt) == NULL_TREE
> +	      /* MASK_STORE has no lhs, but is ok.  */
> +	      && (!is_gimple_call (stmt)
> +		  || !gimple_call_internal_p (stmt)
> +		  || gimple_call_internal_fn (stmt) != IFN_MASK_STORE))
>  	    {
>  	      if (is_gimple_call (stmt))
>  		{
> @@ -426,7 +430,12 @@ vect_determine_vectorization_factor (loo
>  	  else
>  	    {
>  	      gcc_assert (!STMT_VINFO_DATA_REF (stmt_info));
> -	      scalar_type = TREE_TYPE (gimple_get_lhs (stmt));
> +	      if (is_gimple_call (stmt)
> +		  && gimple_call_internal_p (stmt)
> +		  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
> +		scalar_type = TREE_TYPE (gimple_call_arg (stmt, 3));
> +	      else
> +		scalar_type = TREE_TYPE (gimple_get_lhs (stmt));
>  	      if (dump_enabled_p ())
>  		{
>  		  dump_printf_loc (MSG_NOTE, vect_location,
> --- gcc/cfgloop.h.jj	2013-11-19 21:56:40.389335752 +0100
> +++ gcc/cfgloop.h	2013-11-28 14:13:57.602572427 +0100
> @@ -176,6 +176,9 @@ struct GTY ((chain_next ("%h.next"))) lo
>    /* True if we should try harder to vectorize this loop.  */
>    bool force_vect;
>  
> +  /* True if this loop should never be vectorized.  */
> +  bool dont_vectorize;
> +
>    /* For SIMD loops, this is a unique identifier of the loop, referenced
>       by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
>       builtins.  */
> --- gcc/tree-loop-distribution.c.jj	2013-11-22 21:03:05.696896177 +0100
> +++ gcc/tree-loop-distribution.c	2013-11-28 14:13:57.632572271 +0100
> @@ -588,7 +588,7 @@ copy_loop_before (struct loop *loop)
>    edge preheader = loop_preheader_edge (loop);
>  
>    initialize_original_copy_tables ();
> -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, preheader);
> +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
>    gcc_assert (res != NULL);
>    free_original_copy_tables ();
>    delete_update_ssa ();
> --- gcc/optabs.def.jj	2013-11-26 21:36:14.066329682 +0100
> +++ gcc/optabs.def	2013-11-28 14:13:57.624572312 +0100
> @@ -248,6 +248,8 @@ OPTAB_D (sdot_prod_optab, "sdot_prod$I$a
>  OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
>  OPTAB_D (udot_prod_optab, "udot_prod$I$a")
>  OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
> +OPTAB_D (maskload_optab, "maskload$a")
> +OPTAB_D (maskstore_optab, "maskstore$a")
>  OPTAB_D (vec_extract_optab, "vec_extract$a")
>  OPTAB_D (vec_init_optab, "vec_init$a")
>  OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
> --- gcc/testsuite/gcc.target/i386/avx2-gather-6.c.jj	2013-11-28 14:13:57.633572267 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-gather-6.c	2013-11-28 14:13:57.633572267 +0100
> @@ -0,0 +1,7 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx2 -fno-common -fdump-tree-vect-details" } */
> +
> +#include "avx2-gather-5.c"
> +
> +/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops in function" 1 "vect" } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> --- gcc/testsuite/gcc.target/i386/vect-cond-1.c.jj	2013-11-28 14:57:58.182864189 +0100
> +++ gcc/testsuite/gcc.target/i386/vect-cond-1.c	2013-11-28 14:57:58.182864189 +0100
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -mavx2" { target avx2 } } */
> +
> +int a[1024];
> +
> +int
> +foo (int *p)
> +{
> +  int i;
> +  for (i = 0; i < 1024; i++)
> +    {
> +      int t;
> +      if (a[i] < 30)
> +	t = *p;
> +      else
> +	t = a[i] + 12;
> +      a[i] = t;
> +    }
> +}
> +
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> --- gcc/testsuite/gcc.target/i386/avx2-gather-5.c.jj	2013-11-28 14:13:57.633572267 +0100
> +++ gcc/testsuite/gcc.target/i386/avx2-gather-5.c	2013-11-28 14:13:57.633572267 +0100
> @@ -0,0 +1,47 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx2 } */
> +/* { dg-options "-O3 -mavx2 -fno-common" } */
> +
> +#include "avx2-check.h"
> +
> +#define N 1024
> +float vf1[N+16], vf2[N], vf3[N];
> +int k[N];
> +
> +__attribute__((noinline, noclone)) void
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i < N; i++)
> +    {
> +      float f;
> +      if (vf3[i] < 0.0f)
> +	f = vf1[k[i]];
> +      else
> +	f = 7.0f;
> +      vf2[i] = f;
> +    }
> +}
> +
> +static void
> +avx2_test (void)
> +{
> +  int i;
> +  for (i = 0; i < N + 16; i++)
> +    {
> +      vf1[i] = 5.5f * i;
> +      if (i >= N)
> +	continue;
> +      vf2[i] = 2.0f;
> +      vf3[i] = (i & 1) ? i : -i - 1;
> +      k[i] = (i & 1) ? ((i & 2) ? -i : N / 2 + i) : (i * 7) % N;
> +      asm ("");
> +    }
> +  foo ();
> +  for (i = 0; i < N; i++)
> +    if (vf1[i] != 5.5 * i
> +	|| vf2[i] != ((i & 1) ? 7.0f : 5.5f * ((i * 7) % N))
> +	|| vf3[i] != ((i & 1) ? i : -i - 1)
> +	|| k[i] != ((i & 1) ? ((i & 2) ? -i : N / 2 + i) : ((i * 7) % N)))
> +      abort ();
> +}
> --- gcc/testsuite/gcc.dg/vect/vect-cond-11.c.jj	2013-11-28 14:13:57.634572262 +0100
> +++ gcc/testsuite/gcc.dg/vect/vect-cond-11.c	2013-11-28 14:13:57.634572262 +0100
> @@ -0,0 +1,116 @@
> +#include "tree-vect.h"
> +
> +#define N 1024
> +typedef int V __attribute__((vector_size (4)));
> +unsigned int a[N * 2] __attribute__((aligned));
> +unsigned int b[N * 2] __attribute__((aligned));
> +V c[N];
> +
> +__attribute__((noinline, noclone)) unsigned int
> +foo (unsigned int *a, unsigned int *b)
> +{
> +  int i;
> +  unsigned int r = 0;
> +  for (i = 0; i < N; i++)
> +    {
> +      unsigned int x = a[i], y = b[i];
> +      if (x < 32)
> +	{
> +	  x = x + 127;
> +	  y = y * 2;
> +	}
> +      else
> +	{
> +	  x = x - 16;
> +	  y = y + 1;
> +	}
> +      a[i] = x;
> +      b[i] = y;
> +      r += x;
> +    }
> +  return r;
> +}
> +
> +__attribute__((noinline, noclone)) unsigned int
> +bar (unsigned int *a, unsigned int *b)
> +{
> +  int i;
> +  unsigned int r = 0;
> +  for (i = 0; i < N; i++)
> +    {
> +      unsigned int x = a[i], y = b[i];
> +      if (x < 32)
> +	{
> +	  x = x + 127;
> +	  y = y * 2;
> +	}
> +      else
> +	{
> +	  x = x - 16;
> +	  y = y + 1;
> +	}
> +      a[i] = x;
> +      b[i] = y;
> +      c[i] = c[i] + 1;
> +      r += x;
> +    }
> +  return r;
> +}
> +
> +void
> +baz (unsigned int *a, unsigned int *b,
> +     unsigned int (*fn) (unsigned int *, unsigned int *))
> +{
> +  int i;
> +  for (i = -64; i < 0; i++)
> +    {
> +      a[i] = 19;
> +      b[i] = 17;
> +    }
> +  for (; i < N; i++)
> +    {
> +      a[i] = i - 512;
> +      b[i] = i;
> +    }
> +  for (; i < N + 64; i++)
> +    {
> +      a[i] = 27;
> +      b[i] = 19;
> +    }
> +  if (fn (a, b) != -512U - (N - 32) * 16U + 32 * 127U)
> +    __builtin_abort ();
> +  for (i = -64; i < 0; i++)
> +    if (a[i] != 19 || b[i] != 17)
> +      __builtin_abort ();
> +  for (; i < N; i++)
> +    if (a[i] != (i - 512U < 32U ? i - 512U + 127 : i - 512U - 16)
> +	|| b[i] != (i - 512U < 32U ? i * 2U : i + 1U))
> +      __builtin_abort ();
> +  for (; i < N + 64; i++)
> +    if (a[i] != 27 || b[i] != 19)
> +      __builtin_abort ();
> +}
> +
> +int
> +main ()
> +{
> +  int i;
> +  check_vect ();
> +  baz (a + 512, b + 512, foo);
> +  baz (a + 512, b + 512, bar);
> +  baz (a + 512 + 1, b + 512 + 1, foo);
> +  baz (a + 512 + 1, b + 512 + 1, bar);
> +  baz (a + 512 + 31, b + 512 + 31, foo);
> +  baz (a + 512 + 31, b + 512 + 31, bar);
> +  baz (a + 512 + 1, b + 512, foo);
> +  baz (a + 512 + 1, b + 512, bar);
> +  baz (a + 512 + 31, b + 512, foo);
> +  baz (a + 512 + 31, b + 512, bar);
> +  baz (a + 512, b + 512 + 1, foo);
> +  baz (a + 512, b + 512 + 1, bar);
> +  baz (a + 512, b + 512 + 31, foo);
> +  baz (a + 512, b + 512 + 31, bar);
> +  return 0;
> +}
> +
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> --- gcc/testsuite/gcc.dg/vect/vect-mask-load-1.c.jj	2013-11-28 14:13:57.633572267 +0100
> +++ gcc/testsuite/gcc.dg/vect/vect-mask-load-1.c	2013-11-28 14:13:57.633572267 +0100
> @@ -0,0 +1,52 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-Ofast -fno-common" } */
> +/* { dg-additional-options "-Ofast -fno-common -mavx" { target avx_runtime } } */
> +
> +#include <stdlib.h>
> +#include "tree-vect.h"
> +
> +__attribute__((noinline, noclone)) void
> +foo (double *x, double *y)
> +{
> +  double *p = __builtin_assume_aligned (x, 16);
> +  double *q = __builtin_assume_aligned (y, 16);
> +  double z, h;
> +  int i;
> +  for (i = 0; i < 1024; i++)
> +    {
> +      if (p[i] < 0.0)
> +	z = q[i], h = q[i] * 7.0 + 3.0;
> +      else
> +	z = p[i] + 6.0, h = p[1024 + i];
> +      p[i] = z + 2.0 * h;
> +    }
> +}
> +
> +double a[2048] __attribute__((aligned (16)));
> +double b[1024] __attribute__((aligned (16)));
> +
> +int
> +main ()
> +{
> +  int i;
> +  check_vect ();
> +  for (i = 0; i < 1024; i++)
> +    {
> +      a[i] = (i & 1) ? -i : 2 * i;
> +      a[i + 1024] = i;
> +      b[i] = 7 * i;
> +      asm ("");
> +    }
> +  foo (a, b);
> +  for (i = 0; i < 1024; i++)
> +    if (a[i] != ((i & 1)
> +		 ? 7 * i + 2.0 * (7 * i * 7.0 + 3.0)
> +		 : 2 * i + 6.0 + 2.0 * i)
> +	|| b[i] != 7 * i
> +	|| a[i + 1024] != i)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops" 1 "vect" { target avx_runtime } } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> --- gcc/testsuite/gcc.dg/vect/vect-mask-loadstore-1.c.jj	2013-11-28 14:13:57.634572262 +0100
> +++ gcc/testsuite/gcc.dg/vect/vect-mask-loadstore-1.c	2013-11-28 14:13:57.634572262 +0100
> @@ -0,0 +1,50 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-Ofast -fno-common" } */
> +/* { dg-additional-options "-Ofast -fno-common -mavx" { target avx_runtime } } */
> +
> +#include <stdlib.h>
> +#include "tree-vect.h"
> +
> +__attribute__((noinline, noclone)) void
> +foo (float *__restrict x, float *__restrict y, float *__restrict z)
> +{
> +  float *__restrict p = __builtin_assume_aligned (x, 32);
> +  float *__restrict q = __builtin_assume_aligned (y, 32);
> +  float *__restrict r = __builtin_assume_aligned (z, 32);
> +  int i;
> +  for (i = 0; i < 1024; i++)
> +    {
> +      if (p[i] < 0.0f)
> +	q[i] = p[i] + 2.0f;
> +      else
> +	p[i] = r[i] + 3.0f;
> +    }
> +}
> +
> +float a[1024] __attribute__((aligned (32)));
> +float b[1024] __attribute__((aligned (32)));
> +float c[1024] __attribute__((aligned (32)));
> +
> +int
> +main ()
> +{
> +  int i;
> +  check_vect ();
> +  for (i = 0; i < 1024; i++)
> +    {
> +      a[i] = (i & 1) ? -i : i;
> +      b[i] = 7 * i;
> +      c[i] = a[i] - 3.0f;
> +      asm ("");
> +    }
> +  foo (a, b, c);
> +  for (i = 0; i < 1024; i++)
> +    if (a[i] != ((i & 1) ? -i : i)
> +	|| b[i] != ((i & 1) ? a[i] + 2.0f : 7 * i)
> +	|| c[i] != a[i] - 3.0f)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops" 1 "vect" { target avx_runtime } } } */
> +/* { dg-final { cleanup-tree-dump "vect" } } */
> --- gcc/passes.def.jj	2013-11-27 12:15:13.999517045 +0100
> +++ gcc/passes.def	2013-11-28 14:13:57.602572427 +0100
> @@ -217,6 +217,8 @@ along with GCC; see the file COPYING3.
>  	  NEXT_PASS (pass_iv_canon);
>  	  NEXT_PASS (pass_parallelize_loops);
>  	  NEXT_PASS (pass_if_conversion);
> +	  /* pass_vectorize must immediately follow pass_if_conversion.
> +	     Please do not add any other passes in between.  */
>  	  NEXT_PASS (pass_vectorize);
>            PUSH_INSERT_PASSES_WITHIN (pass_vectorize)
>  	      NEXT_PASS (pass_dce_loop);
> --- gcc/tree-predcom.c.jj	2013-11-22 21:03:14.589851957 +0100
> +++ gcc/tree-predcom.c	2013-11-28 14:59:15.529464377 +0100
> @@ -732,6 +732,9 @@ split_data_refs_to_components (struct lo
>  	     just fail.  */
>  	  goto end;
>  	}
> +      /* predcom pass isn't prepared to handle calls with data references.  */
> +      if (is_gimple_call (DR_STMT (dr)))
> +	goto end;
>        dr->aux = (void *) (size_t) i;
>        comp_father[i] = i;
>        comp_size[i] = 1;
> --- gcc/tree-vect-stmts.c.jj	2013-11-27 12:15:14.038516844 +0100
> +++ gcc/tree-vect-stmts.c	2013-11-28 14:57:58.182864189 +0100
> @@ -235,7 +235,7 @@ vect_mark_relevant (vec<gimple> *worklis
>            /* This use is out of pattern use, if LHS has other uses that are
>               pattern uses, we should mark the stmt itself, and not the pattern
>               stmt.  */
> -	  if (TREE_CODE (lhs) == SSA_NAME)
> +	  if (lhs && TREE_CODE (lhs) == SSA_NAME)
>  	    FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
>  	      {
>  		if (is_gimple_debug (USE_STMT (use_p)))
> @@ -393,7 +393,27 @@ exist_non_indexing_operands_for_use_p (t
>       first case, and whether var corresponds to USE.  */
>  
>    if (!gimple_assign_copy_p (stmt))
> -    return false;
> +    {
> +      if (is_gimple_call (stmt)
> +	  && gimple_call_internal_p (stmt))
> +	switch (gimple_call_internal_fn (stmt))
> +	  {
> +	  case IFN_MASK_STORE:
> +	    operand = gimple_call_arg (stmt, 3);
> +	    if (operand == use)
> +	      return true;
> +	    /* FALLTHRU */
> +	  case IFN_MASK_LOAD:
> +	    operand = gimple_call_arg (stmt, 2);
> +	    if (operand == use)
> +	      return true;
> +	    break;
> +	  default:
> +	    break;
> +	  }
> +      return false;
> +    }
> +
>    if (TREE_CODE (gimple_assign_lhs (stmt)) == SSA_NAME)
>      return false;
>    operand = gimple_assign_rhs1 (stmt);
> @@ -1696,6 +1716,413 @@ vectorizable_function (gimple call, tree
>  						        vectype_in);
>  }
>  
> +
> +static tree permute_vec_elements (tree, tree, tree, gimple,
> +				  gimple_stmt_iterator *);
> +
> +
> +/* Function vectorizable_mask_load_store.
> +
> +   Check if STMT performs a conditional load or store that can be vectorized.
> +   If VEC_STMT is also passed, vectorize the STMT: create a vectorized
> +   stmt to replace it, put it in VEC_STMT, and insert it at GSI.
> +   Return FALSE if not a vectorizable STMT, TRUE otherwise.  */
> +
> +static bool
> +vectorizable_mask_load_store (gimple stmt, gimple_stmt_iterator *gsi,
> +			      gimple *vec_stmt, slp_tree slp_node)
> +{
> +  tree vec_dest = NULL;
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  stmt_vec_info prev_stmt_info;
> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
> +  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
> +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  tree elem_type;
> +  gimple new_stmt;
> +  tree dummy;
> +  tree dataref_ptr = NULL_TREE;
> +  gimple ptr_incr;
> +  int nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  int ncopies;
> +  int i, j;
> +  bool inv_p;
> +  tree gather_base = NULL_TREE, gather_off = NULL_TREE;
> +  tree gather_off_vectype = NULL_TREE, gather_decl = NULL_TREE;
> +  int gather_scale = 1;
> +  enum vect_def_type gather_dt = vect_unknown_def_type;
> +  bool is_store;
> +  tree mask;
> +  gimple def_stmt;
> +  tree def;
> +  enum vect_def_type dt;
> +
> +  if (slp_node != NULL)
> +    return false;
> +
> +  ncopies = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / nunits;
> +  gcc_assert (ncopies >= 1);
> +
> +  is_store = gimple_call_internal_fn (stmt) == IFN_MASK_STORE;
> +  mask = gimple_call_arg (stmt, 2);
> +  if (TYPE_PRECISION (TREE_TYPE (mask))
> +      != GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype))))
> +    return false;
> +
> +  /* FORNOW. This restriction should be relaxed.  */
> +  if (nested_in_vect_loop && ncopies > 1)
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "multiple types in nested loop.");
> +      return false;
> +    }
> +
> +  if (!STMT_VINFO_RELEVANT_P (stmt_info))
> +    return false;
> +
> +  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_internal_def)
> +    return false;
> +
> +  if (!STMT_VINFO_DATA_REF (stmt_info))
> +    return false;
> +
> +  elem_type = TREE_TYPE (vectype);
> +
> +  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
> +    return false;
> +
> +  if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> +    return false;
> +
> +  if (STMT_VINFO_GATHER_P (stmt_info))
> +    {
> +      gimple def_stmt;
> +      tree def;
> +      gather_decl = vect_check_gather (stmt, loop_vinfo, &gather_base,
> +				       &gather_off, &gather_scale);
> +      gcc_assert (gather_decl);
> +      if (!vect_is_simple_use_1 (gather_off, NULL, loop_vinfo, NULL,
> +				 &def_stmt, &def, &gather_dt,
> +				 &gather_off_vectype))
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			     "gather index use not simple.");
> +	  return false;
> +	}
> +    }
> +  else if (tree_int_cst_compare (nested_in_vect_loop
> +				 ? STMT_VINFO_DR_STEP (stmt_info)
> +				 : DR_STEP (dr), size_zero_node) <= 0)
> +    return false;
> +  else if (optab_handler (is_store ? maskstore_optab : maskload_optab,
> +			  TYPE_MODE (vectype)) == CODE_FOR_nothing)
> +    return false;
> +
> +  if (TREE_CODE (mask) != SSA_NAME)
> +    return false;
> +
> +  if (!vect_is_simple_use (mask, stmt, loop_vinfo, NULL,
> +			   &def_stmt, &def, &dt))
> +    return false;
> +
> +  if (is_store)
> +    {
> +      tree rhs = gimple_call_arg (stmt, 3);
> +      if (!vect_is_simple_use (rhs, stmt, loop_vinfo, NULL,
> +			       &def_stmt, &def, &dt))
> +	return false;
> +    }
> +
> +  if (!vec_stmt) /* transformation not required.  */
> +    {
> +      STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
> +      if (is_store)
> +	vect_model_store_cost (stmt_info, ncopies, false, dt,
> +			       NULL, NULL, NULL);
> +      else
> +	vect_model_load_cost (stmt_info, ncopies, false, NULL, NULL, NULL);
> +      return true;
> +    }
> +
> +  /** Transform.  **/
> +
> +  if (STMT_VINFO_GATHER_P (stmt_info))
> +    {
> +      tree vec_oprnd0 = NULL_TREE, op;
> +      tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
> +      tree rettype, srctype, ptrtype, idxtype, masktype, scaletype;
> +      tree ptr, vec_mask = NULL_TREE, mask_op, var, scale;
> +      tree perm_mask = NULL_TREE, prev_res = NULL_TREE;
> +      edge pe = loop_preheader_edge (loop);
> +      gimple_seq seq;
> +      basic_block new_bb;
> +      enum { NARROW, NONE, WIDEN } modifier;
> +      int gather_off_nunits = TYPE_VECTOR_SUBPARTS (gather_off_vectype);
> +
> +      if (nunits == gather_off_nunits)
> +	modifier = NONE;
> +      else if (nunits == gather_off_nunits / 2)
> +	{
> +	  unsigned char *sel = XALLOCAVEC (unsigned char, gather_off_nunits);
> +	  modifier = WIDEN;
> +
> +	  for (i = 0; i < gather_off_nunits; ++i)
> +	    sel[i] = i | nunits;
> +
> +	  perm_mask = vect_gen_perm_mask (gather_off_vectype, sel);
> +	  gcc_assert (perm_mask != NULL_TREE);
> +	}
> +      else if (nunits == gather_off_nunits * 2)
> +	{
> +	  unsigned char *sel = XALLOCAVEC (unsigned char, nunits);
> +	  modifier = NARROW;
> +
> +	  for (i = 0; i < nunits; ++i)
> +	    sel[i] = i < gather_off_nunits
> +		     ? i : i + nunits - gather_off_nunits;
> +
> +	  perm_mask = vect_gen_perm_mask (vectype, sel);
> +	  gcc_assert (perm_mask != NULL_TREE);
> +	  ncopies *= 2;
> +	}
> +      else
> +	gcc_unreachable ();
> +
> +      rettype = TREE_TYPE (TREE_TYPE (gather_decl));
> +      srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      ptrtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
> +      scaletype = TREE_VALUE (arglist);
> +      gcc_checking_assert (types_compatible_p (srctype, rettype)
> +			   && types_compatible_p (srctype, masktype));
> +
> +      vec_dest = vect_create_destination_var (gimple_call_lhs (stmt), vectype);
> +
> +      ptr = fold_convert (ptrtype, gather_base);
> +      if (!is_gimple_min_invariant (ptr))
> +	{
> +	  ptr = force_gimple_operand (ptr, &seq, true, NULL_TREE);
> +	  new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
> +	  gcc_assert (!new_bb);
> +	}
> +
> +      scale = build_int_cst (scaletype, gather_scale);
> +
> +      prev_stmt_info = NULL;
> +      for (j = 0; j < ncopies; ++j)
> +	{
> +	  if (modifier == WIDEN && (j & 1))
> +	    op = permute_vec_elements (vec_oprnd0, vec_oprnd0,
> +				       perm_mask, stmt, gsi);
> +	  else if (j == 0)
> +	    op = vec_oprnd0
> +	      = vect_get_vec_def_for_operand (gather_off, stmt, NULL);
> +	  else
> +	    op = vec_oprnd0
> +	      = vect_get_vec_def_for_stmt_copy (gather_dt, vec_oprnd0);
> +
> +	  if (!useless_type_conversion_p (idxtype, TREE_TYPE (op)))
> +	    {
> +	      gcc_assert (TYPE_VECTOR_SUBPARTS (TREE_TYPE (op))
> +			  == TYPE_VECTOR_SUBPARTS (idxtype));
> +	      var = vect_get_new_vect_var (idxtype, vect_simple_var, NULL);
> +	      var = make_ssa_name (var, NULL);
> +	      op = build1 (VIEW_CONVERT_EXPR, idxtype, op);
> +	      new_stmt
> +		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var,
> +						op, NULL_TREE);
> +	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +	      op = var;
> +	    }
> +
> +	  if (j == 0)
> +	    vec_mask = vect_get_vec_def_for_operand (mask, stmt, NULL);
> +	  else
> +	    {
> +	      vect_is_simple_use (vec_mask, NULL, loop_vinfo, NULL, &def_stmt,
> +				  &def, &dt);
> +	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
> +	    }
> +
> +	  mask_op = vec_mask;
> +	  if (!useless_type_conversion_p (masktype, TREE_TYPE (vec_mask)))
> +	    {
> +	      gcc_assert (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask_op))
> +			  == TYPE_VECTOR_SUBPARTS (masktype));
> +	      var = vect_get_new_vect_var (masktype, vect_simple_var, NULL);
> +	      var = make_ssa_name (var, NULL);
> +	      mask_op = build1 (VIEW_CONVERT_EXPR, masktype, mask_op);
> +	      new_stmt
> +		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var,
> +						mask_op, NULL_TREE);
> +	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +	      mask_op = var;
> +	    }
> +
> +	  new_stmt
> +	    = gimple_build_call (gather_decl, 5, mask_op, ptr, op, mask_op,
> +				 scale);
> +
> +	  if (!useless_type_conversion_p (vectype, rettype))
> +	    {
> +	      gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
> +			  == TYPE_VECTOR_SUBPARTS (rettype));
> +	      var = vect_get_new_vect_var (rettype, vect_simple_var, NULL);
> +	      op = make_ssa_name (var, new_stmt);
> +	      gimple_call_set_lhs (new_stmt, op);
> +	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +	      var = make_ssa_name (vec_dest, NULL);
> +	      op = build1 (VIEW_CONVERT_EXPR, vectype, op);
> +	      new_stmt
> +		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var, op,
> +						NULL_TREE);
> +	    }
> +	  else
> +	    {
> +	      var = make_ssa_name (vec_dest, new_stmt);
> +	      gimple_call_set_lhs (new_stmt, var);
> +	    }
> +
> +	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +
> +	  if (modifier == NARROW)
> +	    {
> +	      if ((j & 1) == 0)
> +		{
> +		  prev_res = var;
> +		  continue;
> +		}
> +	      var = permute_vec_elements (prev_res, var,
> +					  perm_mask, stmt, gsi);
> +	      new_stmt = SSA_NAME_DEF_STMT (var);
> +	    }
> +
> +	  if (prev_stmt_info == NULL)
> +	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +	  else
> +	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
> +	  prev_stmt_info = vinfo_for_stmt (new_stmt);
> +	}
> +      return true;
> +    }
> +  else if (is_store)
> +    {
> +      tree vec_rhs = NULL_TREE, vec_mask = NULL_TREE;
> +      prev_stmt_info = NULL;
> +      for (i = 0; i < ncopies; i++)
> +	{
> +	  unsigned align, misalign;
> +
> +	  if (i == 0)
> +	    {
> +	      tree rhs = gimple_call_arg (stmt, 3);
> +	      vec_rhs = vect_get_vec_def_for_operand (rhs, stmt, NULL);
> +	      vec_mask = vect_get_vec_def_for_operand (mask, stmt, NULL);
> +	      /* We should have catched mismatched types earlier.  */
> +	      gcc_assert (useless_type_conversion_p (vectype,
> +						     TREE_TYPE (vec_rhs)));
> +	      dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
> +						      NULL_TREE, &dummy, gsi,
> +						      &ptr_incr, false, &inv_p);
> +	      gcc_assert (!inv_p);
> +	    }
> +	  else
> +	    {
> +	      vect_is_simple_use (vec_rhs, NULL, loop_vinfo, NULL, &def_stmt,
> +				  &def, &dt);
> +	      vec_rhs = vect_get_vec_def_for_stmt_copy (dt, vec_rhs);
> +	      vect_is_simple_use (vec_mask, NULL, loop_vinfo, NULL, &def_stmt,
> +				  &def, &dt);
> +	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
> +	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
> +					     TYPE_SIZE_UNIT (vectype));
> +	    }
> +
> +	  align = TYPE_ALIGN_UNIT (vectype);
> +	  if (aligned_access_p (dr))
> +	    misalign = 0;
> +	  else if (DR_MISALIGNMENT (dr) == -1)
> +	    {
> +	      align = TYPE_ALIGN_UNIT (elem_type);
> +	      misalign = 0;
> +	    }
> +	  else
> +	    misalign = DR_MISALIGNMENT (dr);
> +	  set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
> +				  misalign);
> +	  new_stmt
> +	    = gimple_build_call_internal (IFN_MASK_STORE, 4, dataref_ptr,
> +					  gimple_call_arg (stmt, 1),
> +					  vec_mask, vec_rhs);
> +	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +	  if (i == 0)
> +	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +	  else
> +	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
> +	  prev_stmt_info = vinfo_for_stmt (new_stmt);
> +	}
> +    }
> +  else
> +    {
> +      tree vec_mask = NULL_TREE;
> +      prev_stmt_info = NULL;
> +      vec_dest = vect_create_destination_var (gimple_call_lhs (stmt), vectype);
> +      for (i = 0; i < ncopies; i++)
> +	{
> +	  unsigned align, misalign;
> +
> +	  if (i == 0)
> +	    {
> +	      vec_mask = vect_get_vec_def_for_operand (mask, stmt, NULL);
> +	      dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
> +						      NULL_TREE, &dummy, gsi,
> +						      &ptr_incr, false, &inv_p);
> +	      gcc_assert (!inv_p);
> +	    }
> +	  else
> +	    {
> +	      vect_is_simple_use (vec_mask, NULL, loop_vinfo, NULL, &def_stmt,
> +				  &def, &dt);
> +	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
> +	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
> +					     TYPE_SIZE_UNIT (vectype));
> +	    }
> +
> +	  align = TYPE_ALIGN_UNIT (vectype);
> +	  if (aligned_access_p (dr))
> +	    misalign = 0;
> +	  else if (DR_MISALIGNMENT (dr) == -1)
> +	    {
> +	      align = TYPE_ALIGN_UNIT (elem_type);
> +	      misalign = 0;
> +	    }
> +	  else
> +	    misalign = DR_MISALIGNMENT (dr);
> +	  set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
> +				  misalign);
> +	  new_stmt
> +	    = gimple_build_call_internal (IFN_MASK_LOAD, 3, dataref_ptr,
> +					  gimple_call_arg (stmt, 1),
> +					  vec_mask);
> +	  gimple_call_set_lhs (new_stmt, make_ssa_name (vec_dest, NULL));
> +	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +	  if (i == 0)
> +	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +	  else
> +	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
> +	  prev_stmt_info = vinfo_for_stmt (new_stmt);
> +	}
> +    }
> +
> +  return true;
> +}
> +
> +
>  /* Function vectorizable_call.
>  
>     Check if STMT performs a function call that can be vectorized.
> @@ -1738,6 +2165,12 @@ vectorizable_call (gimple stmt, gimple_s
>    if (!is_gimple_call (stmt))
>      return false;
>  
> +  if (gimple_call_internal_p (stmt)
> +      && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
> +	  || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
> +    return vectorizable_mask_load_store (stmt, gsi, vec_stmt,
> +					 slp_node);
> +
>    if (gimple_call_lhs (stmt) == NULL_TREE
>        || TREE_CODE (gimple_call_lhs (stmt)) != SSA_NAME)
>      return false;
> @@ -4051,10 +4484,6 @@ vectorizable_shift (gimple stmt, gimple_
>  }
>  
>  
> -static tree permute_vec_elements (tree, tree, tree, gimple,
> -				  gimple_stmt_iterator *);
> -
> -
>  /* Function vectorizable_operation.
>  
>     Check if STMT performs a binary, unary or ternary operation that can
> @@ -6567,6 +6996,10 @@ vect_transform_stmt (gimple stmt, gimple
>      case call_vec_info_type:
>        done = vectorizable_call (stmt, gsi, &vec_stmt, slp_node);
>        stmt = gsi_stmt (*gsi);
> +      if (is_gimple_call (stmt)
> +	  && gimple_call_internal_p (stmt)
> +	  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
> +	is_store = true;
>        break;
>  
>      case call_simd_clone_vec_info_type:
> --- gcc/tree-ssa-phiopt.c.jj	2013-11-22 21:03:14.569852057 +0100
> +++ gcc/tree-ssa-phiopt.c	2013-11-28 15:01:39.825688128 +0100
> @@ -1706,7 +1706,7 @@ cond_if_else_store_replacement (basic_bl
>          == chrec_dont_know)
>        || !then_datarefs.length ()
>        || (find_data_references_in_bb (NULL, else_bb, &else_datarefs)
> -        == chrec_dont_know)
> +	  == chrec_dont_know)
>        || !else_datarefs.length ())
>      {
>        free_data_refs (then_datarefs);
> @@ -1723,6 +1723,8 @@ cond_if_else_store_replacement (basic_bl
>  
>        then_store = DR_STMT (then_dr);
>        then_lhs = gimple_get_lhs (then_store);
> +      if (then_lhs == NULL_TREE)
> +	continue;
>        found = false;
>  
>        FOR_EACH_VEC_ELT (else_datarefs, j, else_dr)
> @@ -1732,6 +1734,8 @@ cond_if_else_store_replacement (basic_bl
>  
>            else_store = DR_STMT (else_dr);
>            else_lhs = gimple_get_lhs (else_store);
> +	  if (else_lhs == NULL_TREE)
> +	    continue;
>  
>            if (operand_equal_p (then_lhs, else_lhs, 0))
>              {
> 

The rest looks ok to me.

Thanks,
Richard.
Jakub Jelinek Dec. 6, 2013, 1:27 p.m. UTC | #3
On Fri, Dec 06, 2013 at 01:49:50PM +0100, Richard Biener wrote:
> Comments inline (scary large this patch for this stage ...)

Thanks.

> > +(define_expand "maskload<mode>"
> > +  [(set (match_operand:V48_AVX2 0 "register_operand")
> > +	(unspec:V48_AVX2
> > +	  [(match_operand:<sseintvecmode> 2 "register_operand")
> > +	   (match_operand:V48_AVX2 1 "memory_operand")]
> > +	  UNSPEC_MASKMOV))]
> > +  "TARGET_AVX")
> > +
> > +(define_expand "maskstore<mode>"
> > +  [(set (match_operand:V48_AVX2 0 "memory_operand")
> > +	(unspec:V48_AVX2
> > +	  [(match_operand:<sseintvecmode> 2 "register_operand")
> > +	   (match_operand:V48_AVX2 1 "register_operand")
> > +	   (match_dup 0)]
> > +	  UNSPEC_MASKMOV))]
> > +  "TARGET_AVX")
> > +
> >  (define_insn_and_split "avx_<castmode><avxsizesuffix>_<castmode>"
> >    [(set (match_operand:AVX256MODE2P 0 "nonimmediate_operand" "=x,m")
> >  	(unspec:AVX256MODE2P
> 
> x86 maintainers should comment here (ick - unspecs)

Well, the unspecs are preexisting (right now used by intrinsics only), I'm
just adding expanders that will expand to those instructions.

> > @@ -4386,16 +4396,35 @@ get_references_in_stmt (gimple stmt, vec
> >      {
> >        unsigned i, n;
> >  
> > -      op0 = gimple_call_lhs_ptr (stmt);
> > +      ref.is_read = false;
> > +      if (gimple_call_internal_p (stmt))
> > +	switch (gimple_call_internal_fn (stmt))
> > +	  {
> > +	  case IFN_MASK_LOAD:
> > +	    ref.is_read = true;
> > +	  case IFN_MASK_STORE:
> > +	    ref.ref = build2 (MEM_REF,
> > +			      ref.is_read
> > +			      ? TREE_TYPE (gimple_call_lhs (stmt))
> > +			      : TREE_TYPE (gimple_call_arg (stmt, 3)),
> > +			      gimple_call_arg (stmt, 0),
> > +			      gimple_call_arg (stmt, 1));
> > +	    references->safe_push (ref);
> 
> This may not be a canonical MEM_REF AFAIK, so you should
> use fold_build2 here (if the address is &a.b the .b needs folding

Ok, will try that.

> into the offset).  I assume the 2nd arg is always constant and
> thus doesn't change pointer-type during propagations?

Yes, it is, always created by
  ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
(and for vectorized IFN_MASK_* copied over from the non-vectorized
IFN_MASK_* call).

> > @@ -4464,7 +4493,7 @@ graphite_find_data_references_in_stmt (l
> >  
> >    FOR_EACH_VEC_ELT (references, i, ref)
> >      {
> > -      dr = create_data_ref (nest, loop, *ref->pos, stmt, ref->is_read);
> > +      dr = create_data_ref (nest, loop, ref->ref, stmt, ref->is_read);
> >        gcc_assert (dr != NULL);
> >        datarefs->safe_push (dr);
> >      }
> 
> Interetsting that you succeeded in removing the indirection
> on ref.pos ... I remember trying that twice at least and
> failing ;)
> 
> You can install that as cleanup now if you split it out (so hopefully
> no users creep back that make removing it impossible).

Ok, will do.

> > +  /* Check whether this is a load or store.  */
> > +  lhs = gimple_assign_lhs (stmt);
> > +  if (TREE_CODE (lhs) != SSA_NAME)
> > +    {
> 
> gimple_store_p ()?

Likely.

> > +      if (!is_gimple_val (gimple_assign_rhs1 (stmt)))
> > +	return false;
> > +      op = maskstore_optab;
> > +      ref = lhs;
> > +    }
> > +  else if (gimple_assign_load_p (stmt))
> > +    {
> > +      op = maskload_optab;
> > +      ref = gimple_assign_rhs1 (stmt);
> > +    }
> > +  else
> > +    return false;
> > +
> > +  /* And whether REF isn't a MEM_REF with non-addressable decl.  */
> > +  if (TREE_CODE (ref) == MEM_REF
> > +      && TREE_CODE (TREE_OPERAND (ref, 0)) == ADDR_EXPR
> > +      && DECL_P (TREE_OPERAND (TREE_OPERAND (ref, 0), 0))
> > +      && !TREE_ADDRESSABLE (TREE_OPERAND (TREE_OPERAND (ref, 0), 0)))
> > +    return false;
> 
> I think that's overly conservative and not conservative enough.  Just
> use may_be_nonaddressable_p () (even though the implementation can
> need some TLC) and make sure to set TREE_ADDRESSABLE when you
> end up taking its address.

Will try.

> Please factor out the target bits into a predicate in optabs.c
> so you can reduce the amount of includes here.  You can eventually
> re-use that from the vectorization parts.

Okay.

> > @@ -1404,7 +1530,8 @@ insert_gimplified_predicates (loop_p loo
> >        basic_block bb = ifc_bbs[i];
> >        gimple_seq stmts;
> >  
> > -      if (!is_predicated (bb))
> > +      if (!is_predicated (bb)
> > +	  || dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> 
> isn't that redundant now?

Will try (and read the corresponding threads and IRC about that).

> >        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > -	if ((stmt = gsi_stmt (gsi))
> > -	    && gimple_assign_single_p (stmt)
> > -	    && gimple_vdef (stmt))
> > +	if ((stmt = gsi_stmt (gsi)) == NULL
> 
> I don't think gsi_stmt can be NULL

It can if gsi_end_p, but that is apparently in the loop condition.
It was preexisting code anyway, but will change.

> > +   will be then if-converted, the new copy of the loop will not,
> > +   and the LOOP_VECTORIZED internal call will be guarding which
> > +   loop to execute.  The vectorizer pass will fold this
> > +   internal call into either true or false.  */
> >  
> >  static bool
> > +version_loop_for_if_conversion (struct loop *loop, bool *do_outer)
> > +{
> 
> What's the do_outer parameter?

That is whether it should version also the outer loop.
Though, perhaps if we solely use the loop versioning + IFN_LOOP_VECTORIZED
for MASK_{LOAD,STORE} and perhaps MASK_CALL, we could avoid that and simply
live with not vectorizing those in nested loops, perhaps that occurs rarely
enough and will not be a vectorization regression.  The reason I have added
it was when the versioning was used always for if-conversion, then without
it we have regressed quite some tests.  Perhaps the outer loop versioning
is too costly for too unlikely case.  So, do you prefer to drop that, or
keep?
> 
> This needs a comment with explaining what code you create.

Ok.

> >  
> > -  return changed;
> > +  if (todo && version_outer_loop)
> > +    {
> > +      if (todo & TODO_update_ssa_only_virtuals)
> > +	{
> > +	  update_ssa (TODO_update_ssa_only_virtuals);
> > +	  todo &= ~TODO_update_ssa_only_virtuals;
> > +	}
> 
> Btw I hate that we do update_ssa multiple times per pass per
> function.  That makes us possibly worse than O(N^2) as update_ssa computes
> the IDF of the whole function.
> 
> This is something your patch introduces (it's only rewriting the
> virtuals, not the incremental SSA update by BB copying).

See above.  If you don't like the 2x loop versioning, it can be dropped.
That said, every loop versioning has one update_ssa anyway, so this
isn't making the complexity any worse, other than constant factor.

> > +  maskt = gimple_call_arg (stmt, 2);
> > +  lhs = gimple_call_lhs (stmt);
> > +  type = TREE_TYPE (lhs);
> > +  rhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
> > +		gimple_call_arg (stmt, 1));
> 
> That's possibly not canonical again, use fold_build2.

Ok, will try.

> > +	    gimple g = loop_vectorized_call;
> > +	    tree lhs = gimple_call_lhs (g);
> > +	    gimple_stmt_iterator gsi = gsi_for_stmt (g);
> > +	    gimplify_and_update_call_from_tree (&gsi, boolean_true_node);
> 
> plain update_call_from_tree should also work here, boolean_true_node
> is already gimple.

Will test.
> 
> > +	    gsi_next (&gsi);
> > +	    if (!gsi_end_p (gsi))
> > +	      {
> > +		g = gsi_stmt (gsi);
> > +		if (gimple_code (g) == GIMPLE_COND
> > +		    && gimple_cond_lhs (g) == lhs)
> > +		  {
> > +		    gimple_cond_set_lhs (g, boolean_true_node);
> 
> or simply replace all immediate uses of 'lhs' by boolean_true_node
> and remove the loop_vectorized call?

That would work, though e.g. removing of the call is something
any DCE probably handles well and vectorizer relies on tons of DCE anyway.

> See above.  And factor this out into a function.  Also move this
> to the cleanup loop below.

Ok.

	Jakub
Richard Biener Dec. 6, 2013, 1:32 p.m. UTC | #4
On Fri, 6 Dec 2013, Jakub Jelinek wrote:

> On Fri, Dec 06, 2013 at 01:49:50PM +0100, Richard Biener wrote:
> > Comments inline (scary large this patch for this stage ...)
> 
> Thanks.
> 
> > > +(define_expand "maskload<mode>"
> > > +  [(set (match_operand:V48_AVX2 0 "register_operand")
> > > +	(unspec:V48_AVX2
> > > +	  [(match_operand:<sseintvecmode> 2 "register_operand")
> > > +	   (match_operand:V48_AVX2 1 "memory_operand")]
> > > +	  UNSPEC_MASKMOV))]
> > > +  "TARGET_AVX")
> > > +
> > > +(define_expand "maskstore<mode>"
> > > +  [(set (match_operand:V48_AVX2 0 "memory_operand")
> > > +	(unspec:V48_AVX2
> > > +	  [(match_operand:<sseintvecmode> 2 "register_operand")
> > > +	   (match_operand:V48_AVX2 1 "register_operand")
> > > +	   (match_dup 0)]
> > > +	  UNSPEC_MASKMOV))]
> > > +  "TARGET_AVX")
> > > +
> > >  (define_insn_and_split "avx_<castmode><avxsizesuffix>_<castmode>"
> > >    [(set (match_operand:AVX256MODE2P 0 "nonimmediate_operand" "=x,m")
> > >  	(unspec:AVX256MODE2P
> > 
> > x86 maintainers should comment here (ick - unspecs)
> 
> Well, the unspecs are preexisting (right now used by intrinsics only), I'm
> just adding expanders that will expand to those instructions.
> 
> > > @@ -4386,16 +4396,35 @@ get_references_in_stmt (gimple stmt, vec
> > >      {
> > >        unsigned i, n;
> > >  
> > > -      op0 = gimple_call_lhs_ptr (stmt);
> > > +      ref.is_read = false;
> > > +      if (gimple_call_internal_p (stmt))
> > > +	switch (gimple_call_internal_fn (stmt))
> > > +	  {
> > > +	  case IFN_MASK_LOAD:
> > > +	    ref.is_read = true;
> > > +	  case IFN_MASK_STORE:
> > > +	    ref.ref = build2 (MEM_REF,
> > > +			      ref.is_read
> > > +			      ? TREE_TYPE (gimple_call_lhs (stmt))
> > > +			      : TREE_TYPE (gimple_call_arg (stmt, 3)),
> > > +			      gimple_call_arg (stmt, 0),
> > > +			      gimple_call_arg (stmt, 1));
> > > +	    references->safe_push (ref);
> > 
> > This may not be a canonical MEM_REF AFAIK, so you should
> > use fold_build2 here (if the address is &a.b the .b needs folding
> 
> Ok, will try that.
> 
> > into the offset).  I assume the 2nd arg is always constant and
> > thus doesn't change pointer-type during propagations?
> 
> Yes, it is, always created by
>   ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
> (and for vectorized IFN_MASK_* copied over from the non-vectorized
> IFN_MASK_* call).
> 
> > > @@ -4464,7 +4493,7 @@ graphite_find_data_references_in_stmt (l
> > >  
> > >    FOR_EACH_VEC_ELT (references, i, ref)
> > >      {
> > > -      dr = create_data_ref (nest, loop, *ref->pos, stmt, ref->is_read);
> > > +      dr = create_data_ref (nest, loop, ref->ref, stmt, ref->is_read);
> > >        gcc_assert (dr != NULL);
> > >        datarefs->safe_push (dr);
> > >      }
> > 
> > Interetsting that you succeeded in removing the indirection
> > on ref.pos ... I remember trying that twice at least and
> > failing ;)
> > 
> > You can install that as cleanup now if you split it out (so hopefully
> > no users creep back that make removing it impossible).
> 
> Ok, will do.
> 
> > > +  /* Check whether this is a load or store.  */
> > > +  lhs = gimple_assign_lhs (stmt);
> > > +  if (TREE_CODE (lhs) != SSA_NAME)
> > > +    {
> > 
> > gimple_store_p ()?
> 
> Likely.
> 
> > > +      if (!is_gimple_val (gimple_assign_rhs1 (stmt)))
> > > +	return false;
> > > +      op = maskstore_optab;
> > > +      ref = lhs;
> > > +    }
> > > +  else if (gimple_assign_load_p (stmt))
> > > +    {
> > > +      op = maskload_optab;
> > > +      ref = gimple_assign_rhs1 (stmt);
> > > +    }
> > > +  else
> > > +    return false;
> > > +
> > > +  /* And whether REF isn't a MEM_REF with non-addressable decl.  */
> > > +  if (TREE_CODE (ref) == MEM_REF
> > > +      && TREE_CODE (TREE_OPERAND (ref, 0)) == ADDR_EXPR
> > > +      && DECL_P (TREE_OPERAND (TREE_OPERAND (ref, 0), 0))
> > > +      && !TREE_ADDRESSABLE (TREE_OPERAND (TREE_OPERAND (ref, 0), 0)))
> > > +    return false;
> > 
> > I think that's overly conservative and not conservative enough.  Just
> > use may_be_nonaddressable_p () (even though the implementation can
> > need some TLC) and make sure to set TREE_ADDRESSABLE when you
> > end up taking its address.
> 
> Will try.
> 
> > Please factor out the target bits into a predicate in optabs.c
> > so you can reduce the amount of includes here.  You can eventually
> > re-use that from the vectorization parts.
> 
> Okay.
> 
> > > @@ -1404,7 +1530,8 @@ insert_gimplified_predicates (loop_p loo
> > >        basic_block bb = ifc_bbs[i];
> > >        gimple_seq stmts;
> > >  
> > > -      if (!is_predicated (bb))
> > > +      if (!is_predicated (bb)
> > > +	  || dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> > 
> > isn't that redundant now?
> 
> Will try (and read the corresponding threads and IRC about that).
> 
> > >        for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > -	if ((stmt = gsi_stmt (gsi))
> > > -	    && gimple_assign_single_p (stmt)
> > > -	    && gimple_vdef (stmt))
> > > +	if ((stmt = gsi_stmt (gsi)) == NULL
> > 
> > I don't think gsi_stmt can be NULL
> 
> It can if gsi_end_p, but that is apparently in the loop condition.
> It was preexisting code anyway, but will change.
> 
> > > +   will be then if-converted, the new copy of the loop will not,
> > > +   and the LOOP_VECTORIZED internal call will be guarding which
> > > +   loop to execute.  The vectorizer pass will fold this
> > > +   internal call into either true or false.  */
> > >  
> > >  static bool
> > > +version_loop_for_if_conversion (struct loop *loop, bool *do_outer)
> > > +{
> > 
> > What's the do_outer parameter?
> 
> That is whether it should version also the outer loop.
> Though, perhaps if we solely use the loop versioning + IFN_LOOP_VECTORIZED
> for MASK_{LOAD,STORE} and perhaps MASK_CALL, we could avoid that and simply
> live with not vectorizing those in nested loops, perhaps that occurs rarely
> enough and will not be a vectorization regression.  The reason I have added
> it was when the versioning was used always for if-conversion, then without
> it we have regressed quite some tests.  Perhaps the outer loop versioning
> is too costly for too unlikely case.  So, do you prefer to drop that, or
> keep?

If it's easy to rip out it looks it can simplify the patch at this
stage which is good.

> > 
> > This needs a comment with explaining what code you create.
> 
> Ok.
> 
> > >  
> > > -  return changed;
> > > +  if (todo && version_outer_loop)
> > > +    {
> > > +      if (todo & TODO_update_ssa_only_virtuals)
> > > +	{
> > > +	  update_ssa (TODO_update_ssa_only_virtuals);
> > > +	  todo &= ~TODO_update_ssa_only_virtuals;
> > > +	}
> > 
> > Btw I hate that we do update_ssa multiple times per pass per
> > function.  That makes us possibly worse than O(N^2) as update_ssa computes
> > the IDF of the whole function.
> > 
> > This is something your patch introduces (it's only rewriting the
> > virtuals, not the incremental SSA update by BB copying).
> 
> See above.  If you don't like the 2x loop versioning, it can be dropped.
> That said, every loop versioning has one update_ssa anyway, so this
> isn't making the complexity any worse, other than constant factor.

Yeah.  On my TODO list is still cheaper loop versioning (including
getting rid of the speel_* stuff and having only a single SESE
copying machinery).

> > > +  maskt = gimple_call_arg (stmt, 2);
> > > +  lhs = gimple_call_lhs (stmt);
> > > +  type = TREE_TYPE (lhs);
> > > +  rhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
> > > +		gimple_call_arg (stmt, 1));
> > 
> > That's possibly not canonical again, use fold_build2.
> 
> Ok, will try.
> 
> > > +	    gimple g = loop_vectorized_call;
> > > +	    tree lhs = gimple_call_lhs (g);
> > > +	    gimple_stmt_iterator gsi = gsi_for_stmt (g);
> > > +	    gimplify_and_update_call_from_tree (&gsi, boolean_true_node);
> > 
> > plain update_call_from_tree should also work here, boolean_true_node
> > is already gimple.
> 
> Will test.
> > 
> > > +	    gsi_next (&gsi);
> > > +	    if (!gsi_end_p (gsi))
> > > +	      {
> > > +		g = gsi_stmt (gsi);
> > > +		if (gimple_code (g) == GIMPLE_COND
> > > +		    && gimple_cond_lhs (g) == lhs)
> > > +		  {
> > > +		    gimple_cond_set_lhs (g, boolean_true_node);
> > 
> > or simply replace all immediate uses of 'lhs' by boolean_true_node
> > and remove the loop_vectorized call?
> 
> That would work, though e.g. removing of the call is something
> any DCE probably handles well and vectorizer relies on tons of DCE anyway.

Well, but you handle forwarding into the COND_EXPR here anyway
(probably so that cfgcleanup removes it), so simply forward it
to all uses.

> > See above.  And factor this out into a function.  Also move this
> > to the cleanup loop below.
> 
> Ok.

Thanks.

Ricahrd.
diff mbox

Patch

--- gcc/tree-vectorizer.h.jj	2013-11-28 09:18:11.771774932 +0100
+++ gcc/tree-vectorizer.h	2013-11-28 14:14:35.827362293 +0100
@@ -344,6 +344,10 @@  typedef struct _loop_vec_info {
      fix it up.  */
   bool operands_swapped;
 
+  /* If if-conversion versioned this loop before conversion, this is the
+     loop version without if-conversion.  */
+  struct loop *scalar_loop;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -376,6 +380,7 @@  typedef struct _loop_vec_info {
 #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
 #define LOOP_VINFO_OPERANDS_SWAPPED(L)     (L)->operands_swapped
 #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
+#define LOOP_VINFO_SCALAR_LOOP(L)	   (L)->scalar_loop
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
   (L)->may_misalign_stmts.length () > 0
@@ -934,7 +939,8 @@  extern source_location vect_location;
    in tree-vect-loop-manip.c.  */
 extern void slpeel_make_loop_iterate_ntimes (struct loop *, tree);
 extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
-struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *, edge);
+struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
+						     struct loop *, edge);
 extern void vect_loop_versioning (loop_vec_info, unsigned int, bool);
 extern void vect_do_peeling_for_loop_bound (loop_vec_info, tree, tree,
 					    unsigned int, bool);
--- gcc/config/i386/sse.md.jj	2013-11-23 15:20:47.452606456 +0100
+++ gcc/config/i386/sse.md	2013-11-28 14:13:57.562572366 +0100
@@ -14218,6 +14218,23 @@  (define_insn "<avx_avx2>_maskstore<ssemo
    (set_attr "btver2_decode" "vector") 
    (set_attr "mode" "<sseinsnmode>")])
 
+(define_expand "maskload<mode>"
+  [(set (match_operand:V48_AVX2 0 "register_operand")
+	(unspec:V48_AVX2
+	  [(match_operand:<sseintvecmode> 2 "register_operand")
+	   (match_operand:V48_AVX2 1 "memory_operand")]
+	  UNSPEC_MASKMOV))]
+  "TARGET_AVX")
+
+(define_expand "maskstore<mode>"
+  [(set (match_operand:V48_AVX2 0 "memory_operand")
+	(unspec:V48_AVX2
+	  [(match_operand:<sseintvecmode> 2 "register_operand")
+	   (match_operand:V48_AVX2 1 "register_operand")
+	   (match_dup 0)]
+	  UNSPEC_MASKMOV))]
+  "TARGET_AVX")
+
 (define_insn_and_split "avx_<castmode><avxsizesuffix>_<castmode>"
   [(set (match_operand:AVX256MODE2P 0 "nonimmediate_operand" "=x,m")
 	(unspec:AVX256MODE2P
--- gcc/tree-data-ref.c.jj	2013-11-27 18:02:48.050814182 +0100
+++ gcc/tree-data-ref.c	2013-11-28 14:13:57.592572476 +0100
@@ -4320,8 +4320,8 @@  compute_all_dependences (vec<data_refere
 
 typedef struct data_ref_loc_d
 {
-  /* Position of the memory reference.  */
-  tree *pos;
+  /* The memory reference.  */
+  tree ref;
 
   /* True if the memory reference is read.  */
   bool is_read;
@@ -4336,7 +4336,7 @@  get_references_in_stmt (gimple stmt, vec
 {
   bool clobbers_memory = false;
   data_ref_loc ref;
-  tree *op0, *op1;
+  tree op0, op1;
   enum gimple_code stmt_code = gimple_code (stmt);
 
   /* ASM_EXPR and CALL_EXPR may embed arbitrary side effects.
@@ -4346,16 +4346,26 @@  get_references_in_stmt (gimple stmt, vec
       && !(gimple_call_flags (stmt) & ECF_CONST))
     {
       /* Allow IFN_GOMP_SIMD_LANE in their own loops.  */
-      if (gimple_call_internal_p (stmt)
-	  && gimple_call_internal_fn (stmt) == IFN_GOMP_SIMD_LANE)
-	{
-	  struct loop *loop = gimple_bb (stmt)->loop_father;
-	  tree uid = gimple_call_arg (stmt, 0);
-	  gcc_assert (TREE_CODE (uid) == SSA_NAME);
-	  if (loop == NULL
-	      || loop->simduid != SSA_NAME_VAR (uid))
+      if (gimple_call_internal_p (stmt))
+	switch (gimple_call_internal_fn (stmt))
+	  {
+	  case IFN_GOMP_SIMD_LANE:
+	    {
+	      struct loop *loop = gimple_bb (stmt)->loop_father;
+	      tree uid = gimple_call_arg (stmt, 0);
+	      gcc_assert (TREE_CODE (uid) == SSA_NAME);
+	      if (loop == NULL
+		  || loop->simduid != SSA_NAME_VAR (uid))
+		clobbers_memory = true;
+	      break;
+	    }
+	  case IFN_MASK_LOAD:
+	  case IFN_MASK_STORE:
+	    break;
+	  default:
 	    clobbers_memory = true;
-	}
+	    break;
+	  }
       else
 	clobbers_memory = true;
     }
@@ -4369,15 +4379,15 @@  get_references_in_stmt (gimple stmt, vec
   if (stmt_code == GIMPLE_ASSIGN)
     {
       tree base;
-      op0 = gimple_assign_lhs_ptr (stmt);
-      op1 = gimple_assign_rhs1_ptr (stmt);
+      op0 = gimple_assign_lhs (stmt);
+      op1 = gimple_assign_rhs1 (stmt);
 
-      if (DECL_P (*op1)
-	  || (REFERENCE_CLASS_P (*op1)
-	      && (base = get_base_address (*op1))
+      if (DECL_P (op1)
+	  || (REFERENCE_CLASS_P (op1)
+	      && (base = get_base_address (op1))
 	      && TREE_CODE (base) != SSA_NAME))
 	{
-	  ref.pos = op1;
+	  ref.ref = op1;
 	  ref.is_read = true;
 	  references->safe_push (ref);
 	}
@@ -4386,16 +4396,35 @@  get_references_in_stmt (gimple stmt, vec
     {
       unsigned i, n;
 
-      op0 = gimple_call_lhs_ptr (stmt);
+      ref.is_read = false;
+      if (gimple_call_internal_p (stmt))
+	switch (gimple_call_internal_fn (stmt))
+	  {
+	  case IFN_MASK_LOAD:
+	    ref.is_read = true;
+	  case IFN_MASK_STORE:
+	    ref.ref = build2 (MEM_REF,
+			      ref.is_read
+			      ? TREE_TYPE (gimple_call_lhs (stmt))
+			      : TREE_TYPE (gimple_call_arg (stmt, 3)),
+			      gimple_call_arg (stmt, 0),
+			      gimple_call_arg (stmt, 1));
+	    references->safe_push (ref);
+	    return false;
+	  default:
+	    break;
+	  }
+
+      op0 = gimple_call_lhs (stmt);
       n = gimple_call_num_args (stmt);
       for (i = 0; i < n; i++)
 	{
-	  op1 = gimple_call_arg_ptr (stmt, i);
+	  op1 = gimple_call_arg (stmt, i);
 
-	  if (DECL_P (*op1)
-	      || (REFERENCE_CLASS_P (*op1) && get_base_address (*op1)))
+	  if (DECL_P (op1)
+	      || (REFERENCE_CLASS_P (op1) && get_base_address (op1)))
 	    {
-	      ref.pos = op1;
+	      ref.ref = op1;
 	      ref.is_read = true;
 	      references->safe_push (ref);
 	    }
@@ -4404,11 +4433,11 @@  get_references_in_stmt (gimple stmt, vec
   else
     return clobbers_memory;
 
-  if (*op0
-      && (DECL_P (*op0)
-	  || (REFERENCE_CLASS_P (*op0) && get_base_address (*op0))))
+  if (op0
+      && (DECL_P (op0)
+	  || (REFERENCE_CLASS_P (op0) && get_base_address (op0))))
     {
-      ref.pos = op0;
+      ref.ref = op0;
       ref.is_read = false;
       references->safe_push (ref);
     }
@@ -4435,7 +4464,7 @@  find_data_references_in_stmt (struct loo
   FOR_EACH_VEC_ELT (references, i, ref)
     {
       dr = create_data_ref (nest, loop_containing_stmt (stmt),
-			    *ref->pos, stmt, ref->is_read);
+			    ref->ref, stmt, ref->is_read);
       gcc_assert (dr != NULL);
       datarefs->safe_push (dr);
     }
@@ -4464,7 +4493,7 @@  graphite_find_data_references_in_stmt (l
 
   FOR_EACH_VEC_ELT (references, i, ref)
     {
-      dr = create_data_ref (nest, loop, *ref->pos, stmt, ref->is_read);
+      dr = create_data_ref (nest, loop, ref->ref, stmt, ref->is_read);
       gcc_assert (dr != NULL);
       datarefs->safe_push (dr);
     }
--- gcc/internal-fn.def.jj	2013-11-26 21:36:14.018329932 +0100
+++ gcc/internal-fn.def	2013-11-28 14:13:57.517569949 +0100
@@ -43,5 +43,8 @@  DEF_INTERNAL_FN (STORE_LANES, ECF_CONST
 DEF_INTERNAL_FN (GOMP_SIMD_LANE, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW)
 DEF_INTERNAL_FN (GOMP_SIMD_VF, ECF_CONST | ECF_LEAF | ECF_NOTHROW)
 DEF_INTERNAL_FN (GOMP_SIMD_LAST_LANE, ECF_CONST | ECF_LEAF | ECF_NOTHROW)
+DEF_INTERNAL_FN (LOOP_VECTORIZED, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW)
+DEF_INTERNAL_FN (MASK_LOAD, ECF_PURE | ECF_LEAF)
+DEF_INTERNAL_FN (MASK_STORE, ECF_LEAF)
 DEF_INTERNAL_FN (ANNOTATE,  ECF_CONST | ECF_LEAF | ECF_NOTHROW)
 DEF_INTERNAL_FN (UBSAN_NULL, ECF_LEAF | ECF_NOTHROW)
--- gcc/tree-if-conv.c.jj	2013-11-22 21:03:14.527852266 +0100
+++ gcc/tree-if-conv.c	2013-11-28 14:13:57.668572084 +0100
@@ -110,8 +110,12 @@  along with GCC; see the file COPYING3.
 #include "tree-chrec.h"
 #include "tree-data-ref.h"
 #include "tree-scalar-evolution.h"
+#include "tree-ssa-address.h"
 #include "tree-pass.h"
 #include "dbgcnt.h"
+#include "target.h"
+#include "expr.h"
+#include "optabs.h"
 
 /* List of basic blocks in if-conversion-suitable order.  */
 static basic_block *ifc_bbs;
@@ -194,39 +198,48 @@  init_bb_predicate (basic_block bb)
   set_bb_predicate (bb, boolean_true_node);
 }
 
-/* Free the predicate of basic block BB.  */
+/* Release the SSA_NAMEs associated with the predicate of basic block BB,
+   but don't actually free it.  */
 
 static inline void
-free_bb_predicate (basic_block bb)
+release_bb_predicate (basic_block bb)
 {
-  gimple_seq stmts;
-
-  if (!bb_has_predicate (bb))
-    return;
-
-  /* Release the SSA_NAMEs created for the gimplification of the
-     predicate.  */
-  stmts = bb_predicate_gimplified_stmts (bb);
+  gimple_seq stmts = bb_predicate_gimplified_stmts (bb);
   if (stmts)
     {
       gimple_stmt_iterator i;
 
       for (i = gsi_start (stmts); !gsi_end_p (i); gsi_next (&i))
 	free_stmt_operands (gsi_stmt (i));
+      set_bb_predicate_gimplified_stmts (bb, NULL);
     }
+}
 
+/* Free the predicate of basic block BB.  */
+
+static inline void
+free_bb_predicate (basic_block bb)
+{
+  if (!bb_has_predicate (bb))
+    return;
+
+  release_bb_predicate (bb);
   free (bb->aux);
   bb->aux = NULL;
 }
 
-/* Free the predicate of BB and reinitialize it with the true
-   predicate.  */
+/* Reinitialize predicate of BB with the true predicate.  */
 
 static inline void
 reset_bb_predicate (basic_block bb)
 {
-  free_bb_predicate (bb);
-  init_bb_predicate (bb);
+  if (!bb_has_predicate (bb))
+    init_bb_predicate (bb);
+  else
+    {
+      release_bb_predicate (bb);
+      set_bb_predicate (bb, boolean_true_node);
+    }
 }
 
 /* Returns a new SSA_NAME of type TYPE that is assigned the value of
@@ -464,7 +477,8 @@  bb_with_exit_edge_p (struct loop *loop,
    - there is a virtual PHI in a BB other than the loop->header.  */
 
 static bool
-if_convertible_phi_p (struct loop *loop, basic_block bb, gimple phi)
+if_convertible_phi_p (struct loop *loop, basic_block bb, gimple phi,
+		      bool any_mask_load_store)
 {
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -479,7 +493,7 @@  if_convertible_phi_p (struct loop *loop,
       return false;
     }
 
-  if (flag_tree_loop_if_convert_stores)
+  if (flag_tree_loop_if_convert_stores || any_mask_load_store)
     return true;
 
   /* When the flag_tree_loop_if_convert_stores is not set, check
@@ -695,6 +709,78 @@  ifcvt_could_trap_p (gimple stmt, vec<dat
   return gimple_could_trap_p (stmt);
 }
 
+/* Return true if STMT could be converted into a masked load or store
+   (conditional load or store based on a mask computed from bb predicate).  */
+
+static bool
+ifcvt_can_use_mask_load_store (gimple stmt)
+{
+  tree lhs, ref;
+  enum machine_mode mode, vmode;
+  optab op;
+  basic_block bb = gimple_bb (stmt);
+  unsigned int vector_sizes;
+
+  if (!(flag_tree_loop_vectorize || bb->loop_father->force_vect)
+      || bb->loop_father->dont_vectorize
+      || !gimple_assign_single_p (stmt)
+      || gimple_has_volatile_ops (stmt))
+    return false;
+
+  /* Check whether this is a load or store.  */
+  lhs = gimple_assign_lhs (stmt);
+  if (TREE_CODE (lhs) != SSA_NAME)
+    {
+      if (!is_gimple_val (gimple_assign_rhs1 (stmt)))
+	return false;
+      op = maskstore_optab;
+      ref = lhs;
+    }
+  else if (gimple_assign_load_p (stmt))
+    {
+      op = maskload_optab;
+      ref = gimple_assign_rhs1 (stmt);
+    }
+  else
+    return false;
+
+  /* And whether REF isn't a MEM_REF with non-addressable decl.  */
+  if (TREE_CODE (ref) == MEM_REF
+      && TREE_CODE (TREE_OPERAND (ref, 0)) == ADDR_EXPR
+      && DECL_P (TREE_OPERAND (TREE_OPERAND (ref, 0), 0))
+      && !TREE_ADDRESSABLE (TREE_OPERAND (TREE_OPERAND (ref, 0), 0)))
+    return false;
+
+  /* Mask should be integer mode of the same size as the load/store
+     mode.  */
+  mode = TYPE_MODE (TREE_TYPE (lhs));
+  if (int_mode_for_mode (mode) == BLKmode)
+    return false;
+
+  /* See if there is any chance the mask load or store might be
+     vectorized.  If not, punt.  */
+  vmode = targetm.vectorize.preferred_simd_mode (mode);
+  if (!VECTOR_MODE_P (vmode))
+    return false;
+
+  if (optab_handler (op, vmode) != CODE_FOR_nothing)
+    return true;
+
+  vector_sizes = targetm.vectorize.autovectorize_vector_sizes ();
+  while (vector_sizes != 0)
+    {
+      unsigned int cur = 1 << floor_log2 (vector_sizes);
+      vector_sizes &= ~cur;
+      if (cur <= GET_MODE_SIZE (mode))
+	continue;
+      vmode = mode_for_vector (mode, cur / GET_MODE_SIZE (mode));
+      if (VECTOR_MODE_P (vmode)
+	  && optab_handler (op, vmode) != CODE_FOR_nothing)
+	return true;
+    }
+  return false;
+}
+
 /* Return true when STMT is if-convertible.
 
    GIMPLE_ASSIGN statement is not if-convertible if,
@@ -704,7 +790,8 @@  ifcvt_could_trap_p (gimple stmt, vec<dat
 
 static bool
 if_convertible_gimple_assign_stmt_p (gimple stmt,
-				     vec<data_reference_p> refs)
+				     vec<data_reference_p> refs,
+				     bool *any_mask_load_store)
 {
   tree lhs = gimple_assign_lhs (stmt);
   basic_block bb;
@@ -730,10 +817,21 @@  if_convertible_gimple_assign_stmt_p (gim
       return false;
     }
 
+  /* tree-into-ssa.c uses GF_PLF_1, so avoid it, because
+     in between if_convertible_loop_p and combine_blocks
+     we can perform loop versioning.  */
+  gimple_set_plf (stmt, GF_PLF_2, false);
+
   if (flag_tree_loop_if_convert_stores)
     {
       if (ifcvt_could_trap_p (stmt, refs))
 	{
+	  if (ifcvt_can_use_mask_load_store (stmt))
+	    {
+	      gimple_set_plf (stmt, GF_PLF_2, true);
+	      *any_mask_load_store = true;
+	      return true;
+	    }
 	  if (dump_file && (dump_flags & TDF_DETAILS))
 	    fprintf (dump_file, "tree could trap...\n");
 	  return false;
@@ -743,6 +841,12 @@  if_convertible_gimple_assign_stmt_p (gim
 
   if (gimple_assign_rhs_could_trap_p (stmt))
     {
+      if (ifcvt_can_use_mask_load_store (stmt))
+	{
+	  gimple_set_plf (stmt, GF_PLF_2, true);
+	  *any_mask_load_store = true;
+	  return true;
+	}
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file, "tree could trap...\n");
       return false;
@@ -754,6 +858,12 @@  if_convertible_gimple_assign_stmt_p (gim
       && bb != bb->loop_father->header
       && !bb_with_exit_edge_p (bb->loop_father, bb))
     {
+      if (ifcvt_can_use_mask_load_store (stmt))
+	{
+	  gimple_set_plf (stmt, GF_PLF_2, true);
+	  *any_mask_load_store = true;
+	  return true;
+	}
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
 	  fprintf (dump_file, "LHS is not var\n");
@@ -772,7 +882,8 @@  if_convertible_gimple_assign_stmt_p (gim
    - it is a GIMPLE_LABEL or a GIMPLE_COND.  */
 
 static bool
-if_convertible_stmt_p (gimple stmt, vec<data_reference_p> refs)
+if_convertible_stmt_p (gimple stmt, vec<data_reference_p> refs,
+		       bool *any_mask_load_store)
 {
   switch (gimple_code (stmt))
     {
@@ -782,7 +893,8 @@  if_convertible_stmt_p (gimple stmt, vec<
       return true;
 
     case GIMPLE_ASSIGN:
-      return if_convertible_gimple_assign_stmt_p (stmt, refs);
+      return if_convertible_gimple_assign_stmt_p (stmt, refs,
+						  any_mask_load_store);
 
     case GIMPLE_CALL:
       {
@@ -984,7 +1096,7 @@  get_loop_body_in_if_conv_order (const st
    S1 will be predicated with "x", and
    S2 will be predicated with "!x".  */
 
-static bool
+static void
 predicate_bbs (loop_p loop)
 {
   unsigned int i;
@@ -996,7 +1108,7 @@  predicate_bbs (loop_p loop)
     {
       basic_block bb = ifc_bbs[i];
       tree cond;
-      gimple_stmt_iterator itr;
+      gimple stmt;
 
       /* The loop latch is always executed and has no extra conditions
 	 to be processed: skip it.  */
@@ -1006,53 +1118,38 @@  predicate_bbs (loop_p loop)
 	  continue;
 	}
 
+      /* If dominance tells us this basic block is always executed, force
+	 the condition to be true, this might help simplify other
+	 conditions.  */
+      if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
+	reset_bb_predicate (bb);
       cond = bb_predicate (bb);
-
-      for (itr = gsi_start_bb (bb); !gsi_end_p (itr); gsi_next (&itr))
+      stmt = last_stmt (bb);
+      if (stmt && gimple_code (stmt) == GIMPLE_COND)
 	{
-	  gimple stmt = gsi_stmt (itr);
-
-	  switch (gimple_code (stmt))
-	    {
-	    case GIMPLE_LABEL:
-	    case GIMPLE_ASSIGN:
-	    case GIMPLE_CALL:
-	    case GIMPLE_DEBUG:
-	      break;
-
-	    case GIMPLE_COND:
-	      {
-		tree c2;
-		edge true_edge, false_edge;
-		location_t loc = gimple_location (stmt);
-		tree c = fold_build2_loc (loc, gimple_cond_code (stmt),
-					  boolean_type_node,
-					  gimple_cond_lhs (stmt),
-					  gimple_cond_rhs (stmt));
-
-		/* Add new condition into destination's predicate list.  */
-		extract_true_false_edges_from_block (gimple_bb (stmt),
-						     &true_edge, &false_edge);
-
-		/* If C is true, then TRUE_EDGE is taken.  */
-		add_to_dst_predicate_list (loop, true_edge,
-					   unshare_expr (cond),
-					   unshare_expr (c));
-
-		/* If C is false, then FALSE_EDGE is taken.  */
-		c2 = build1_loc (loc, TRUTH_NOT_EXPR,
-				 boolean_type_node, unshare_expr (c));
-		add_to_dst_predicate_list (loop, false_edge,
-					   unshare_expr (cond), c2);
-
-		cond = NULL_TREE;
-		break;
-	      }
+	  tree c2;
+	  edge true_edge, false_edge;
+	  location_t loc = gimple_location (stmt);
+	  tree c = fold_build2_loc (loc, gimple_cond_code (stmt),
+				    boolean_type_node,
+				    gimple_cond_lhs (stmt),
+				    gimple_cond_rhs (stmt));
+
+	  /* Add new condition into destination's predicate list.  */
+	  extract_true_false_edges_from_block (gimple_bb (stmt),
+					       &true_edge, &false_edge);
+
+	  /* If C is true, then TRUE_EDGE is taken.  */
+	  add_to_dst_predicate_list (loop, true_edge, unshare_expr (cond),
+				     unshare_expr (c));
+
+	  /* If C is false, then FALSE_EDGE is taken.  */
+	  c2 = build1_loc (loc, TRUTH_NOT_EXPR, boolean_type_node,
+			   unshare_expr (c));
+	  add_to_dst_predicate_list (loop, false_edge,
+				     unshare_expr (cond), c2);
 
-	    default:
-	      /* Not handled yet in if-conversion.  */
-	      return false;
-	    }
+	  cond = NULL_TREE;
 	}
 
       /* If current bb has only one successor, then consider it as an
@@ -1075,8 +1172,6 @@  predicate_bbs (loop_p loop)
   reset_bb_predicate (loop->header);
   gcc_assert (bb_predicate_gimplified_stmts (loop->header) == NULL
 	      && bb_predicate_gimplified_stmts (loop->latch) == NULL);
-
-  return true;
 }
 
 /* Return true when LOOP is if-convertible.  This is a helper function
@@ -1087,7 +1182,7 @@  static bool
 if_convertible_loop_p_1 (struct loop *loop,
 			 vec<loop_p> *loop_nest,
 			 vec<data_reference_p> *refs,
-			 vec<ddr_p> *ddrs)
+			 vec<ddr_p> *ddrs, bool *any_mask_load_store)
 {
   bool res;
   unsigned int i;
@@ -1121,9 +1216,24 @@  if_convertible_loop_p_1 (struct loop *lo
 	exit_bb = bb;
     }
 
-  res = predicate_bbs (loop);
-  if (!res)
-    return false;
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = ifc_bbs[i];
+      gimple_stmt_iterator gsi;
+
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	switch (gimple_code (gsi_stmt (gsi)))
+	  {
+	  case GIMPLE_LABEL:
+	  case GIMPLE_ASSIGN:
+	  case GIMPLE_CALL:
+	  case GIMPLE_DEBUG:
+	  case GIMPLE_COND:
+	    break;
+	  default:
+	    return false;
+	  }
+    }
 
   if (flag_tree_loop_if_convert_stores)
     {
@@ -1135,6 +1245,7 @@  if_convertible_loop_p_1 (struct loop *lo
 	  DR_WRITTEN_AT_LEAST_ONCE (dr) = -1;
 	  DR_RW_UNCONDITIONALLY (dr) = -1;
 	}
+      predicate_bbs (loop);
     }
 
   for (i = 0; i < loop->num_nodes; i++)
@@ -1142,17 +1253,31 @@  if_convertible_loop_p_1 (struct loop *lo
       basic_block bb = ifc_bbs[i];
       gimple_stmt_iterator itr;
 
-      for (itr = gsi_start_phis (bb); !gsi_end_p (itr); gsi_next (&itr))
-	if (!if_convertible_phi_p (loop, bb, gsi_stmt (itr)))
-	  return false;
-
       /* Check the if-convertibility of statements in predicated BBs.  */
-      if (is_predicated (bb))
+      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
 	for (itr = gsi_start_bb (bb); !gsi_end_p (itr); gsi_next (&itr))
-	  if (!if_convertible_stmt_p (gsi_stmt (itr), *refs))
+	  if (!if_convertible_stmt_p (gsi_stmt (itr), *refs,
+				      any_mask_load_store))
 	    return false;
     }
 
+  if (flag_tree_loop_if_convert_stores)
+    for (i = 0; i < loop->num_nodes; i++)
+      free_bb_predicate (ifc_bbs[i]);
+
+  /* Checking PHIs needs to be done after stmts, as the fact whether there
+     are any masked loads or stores affects the tests.  */
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = ifc_bbs[i];
+      gimple_stmt_iterator itr;
+
+      for (itr = gsi_start_phis (bb); !gsi_end_p (itr); gsi_next (&itr))
+	if (!if_convertible_phi_p (loop, bb, gsi_stmt (itr),
+				   *any_mask_load_store))
+	  return false;
+    }
+
   if (dump_file)
     fprintf (dump_file, "Applying if-conversion\n");
 
@@ -1168,7 +1293,7 @@  if_convertible_loop_p_1 (struct loop *lo
    - if its basic blocks and phi nodes are if convertible.  */
 
 static bool
-if_convertible_loop_p (struct loop *loop)
+if_convertible_loop_p (struct loop *loop, bool *any_mask_load_store)
 {
   edge e;
   edge_iterator ei;
@@ -1209,7 +1334,8 @@  if_convertible_loop_p (struct loop *loop
   refs.create (5);
   ddrs.create (25);
   stack_vec<loop_p, 3> loop_nest;
-  res = if_convertible_loop_p_1 (loop, &loop_nest, &refs, &ddrs);
+  res = if_convertible_loop_p_1 (loop, &loop_nest, &refs, &ddrs,
+				 any_mask_load_store);
 
   if (flag_tree_loop_if_convert_stores)
     {
@@ -1395,7 +1521,7 @@  predicate_all_scalar_phis (struct loop *
    gimplification of the predicates.  */
 
 static void
-insert_gimplified_predicates (loop_p loop)
+insert_gimplified_predicates (loop_p loop, bool any_mask_load_store)
 {
   unsigned int i;
 
@@ -1404,7 +1530,8 @@  insert_gimplified_predicates (loop_p loo
       basic_block bb = ifc_bbs[i];
       gimple_seq stmts;
 
-      if (!is_predicated (bb))
+      if (!is_predicated (bb)
+	  || dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
 	{
 	  /* Do not insert statements for a basic block that is not
 	     predicated.  Also make sure that the predicate of the
@@ -1416,7 +1543,8 @@  insert_gimplified_predicates (loop_p loo
       stmts = bb_predicate_gimplified_stmts (bb);
       if (stmts)
 	{
-	  if (flag_tree_loop_if_convert_stores)
+	  if (flag_tree_loop_if_convert_stores
+	      || any_mask_load_store)
 	    {
 	      /* Insert the predicate of the BB just after the label,
 		 as the if-conversion of memory writes will use this
@@ -1575,9 +1703,49 @@  predicate_mem_writes (loop_p loop)
 	}
 
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
-	if ((stmt = gsi_stmt (gsi))
-	    && gimple_assign_single_p (stmt)
-	    && gimple_vdef (stmt))
+	if ((stmt = gsi_stmt (gsi)) == NULL
+	    || !gimple_assign_single_p (stmt))
+	  continue;
+	else if (gimple_plf (stmt, GF_PLF_2))
+	  {
+	    tree lhs = gimple_assign_lhs (stmt);
+	    tree rhs = gimple_assign_rhs1 (stmt);
+	    tree ref, addr, ptr, masktype, mask_op0, mask_op1, mask;
+	    gimple new_stmt;
+	    int bitsize = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (lhs)));
+
+	    masktype = build_nonstandard_integer_type (bitsize, 1);
+	    mask_op0 = build_int_cst (masktype, swap ? 0 : -1);
+	    mask_op1 = build_int_cst (masktype, swap ? -1 : 0);
+	    ref = TREE_CODE (lhs) == SSA_NAME ? rhs : lhs;
+	    addr = force_gimple_operand_gsi (&gsi, build_fold_addr_expr (ref),
+					     true, NULL_TREE, true,
+					     GSI_SAME_STMT);
+	    cond = force_gimple_operand_gsi_1 (&gsi, unshare_expr (cond),
+					       is_gimple_condexpr, NULL_TREE,
+					       true, GSI_SAME_STMT);
+	    mask = fold_build_cond_expr (masktype, unshare_expr (cond),
+					 mask_op0, mask_op1);
+	    mask = ifc_temp_var (masktype, mask, &gsi);
+	    ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
+	    /* Copy points-to info if possible.  */
+	    if (TREE_CODE (addr) == SSA_NAME && !SSA_NAME_PTR_INFO (addr))
+	      copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr),
+			     ref);
+	    if (TREE_CODE (lhs) == SSA_NAME)
+	      {
+		new_stmt
+		  = gimple_build_call_internal (IFN_MASK_LOAD, 3, addr,
+						ptr, mask);
+		gimple_call_set_lhs (new_stmt, lhs);
+	      }
+	    else
+	      new_stmt
+		= gimple_build_call_internal (IFN_MASK_STORE, 4, addr, ptr,
+					      mask, rhs);
+	    gsi_replace (&gsi, new_stmt, false);
+	  }
+	else if (gimple_vdef (stmt))
 	  {
 	    tree lhs = gimple_assign_lhs (stmt);
 	    tree rhs = gimple_assign_rhs1 (stmt);
@@ -1647,7 +1815,7 @@  remove_conditions_and_labels (loop_p loo
    blocks.  Replace PHI nodes with conditional modify expressions.  */
 
 static void
-combine_blocks (struct loop *loop)
+combine_blocks (struct loop *loop, bool any_mask_load_store)
 {
   basic_block bb, exit_bb, merge_target_bb;
   unsigned int orig_loop_num_nodes = loop->num_nodes;
@@ -1655,11 +1823,12 @@  combine_blocks (struct loop *loop)
   edge e;
   edge_iterator ei;
 
+  predicate_bbs (loop);
   remove_conditions_and_labels (loop);
-  insert_gimplified_predicates (loop);
+  insert_gimplified_predicates (loop, any_mask_load_store);
   predicate_all_scalar_phis (loop);
 
-  if (flag_tree_loop_if_convert_stores)
+  if (flag_tree_loop_if_convert_stores || any_mask_load_store)
     predicate_mem_writes (loop);
 
   /* Merge basic blocks: first remove all the edges in the loop,
@@ -1749,28 +1918,146 @@  combine_blocks (struct loop *loop)
   ifc_bbs = NULL;
 }
 
-/* If-convert LOOP when it is legal.  For the moment this pass has no
-   profitability analysis.  Returns true when something changed.  */
+/* Version LOOP before if-converting it, the original loop
+   will be then if-converted, the new copy of the loop will not,
+   and the LOOP_VECTORIZED internal call will be guarding which
+   loop to execute.  The vectorizer pass will fold this
+   internal call into either true or false.  */
 
 static bool
+version_loop_for_if_conversion (struct loop *loop, bool *do_outer)
+{
+  struct loop *outer = loop_outer (loop);
+  basic_block cond_bb;
+  tree cond = make_ssa_name (boolean_type_node, NULL);
+  struct loop *new_loop;
+  gimple g;
+  gimple_stmt_iterator gsi;
+
+  if (do_outer)
+    {
+      *do_outer = false;
+      if (loop->inner == NULL
+	  && outer->inner == loop
+	  && loop->next == NULL
+	  && loop_outer (outer)
+	  && outer->num_nodes == 3 + loop->num_nodes
+	  && loop_preheader_edge (loop)->src == outer->header
+	  && single_exit (loop)
+	  && outer->latch
+	  && single_exit (loop)->dest == EDGE_PRED (outer->latch, 0)->src)
+	*do_outer = true;
+    }
+
+  g = gimple_build_call_internal (IFN_LOOP_VECTORIZED, 2,
+				  build_int_cst (integer_type_node, loop->num),
+				  integer_zero_node);
+  gimple_call_set_lhs (g, cond);
+
+  initialize_original_copy_tables ();
+  new_loop = loop_version (loop, cond, &cond_bb,
+			   REG_BR_PROB_BASE, REG_BR_PROB_BASE,
+			   REG_BR_PROB_BASE, true);
+  free_original_copy_tables ();
+  if (new_loop == NULL)
+    return false;
+  new_loop->dont_vectorize = true;
+  new_loop->force_vect = false;
+  gsi = gsi_last_bb (cond_bb);
+  gimple_call_set_arg (g, 1, build_int_cst (integer_type_node, new_loop->num));
+  gsi_insert_before (&gsi, g, GSI_SAME_STMT);
+  update_ssa (TODO_update_ssa);
+  if (do_outer == NULL)
+    {
+      gcc_assert (single_succ_p (loop->header));
+      gsi = gsi_last_bb (single_succ (loop->header));
+      gimple cond_stmt = gsi_stmt (gsi);
+      gsi_prev (&gsi);
+      g = gsi_stmt (gsi);
+      gcc_assert (gimple_code (cond_stmt) == GIMPLE_COND
+		  && is_gimple_call (g)
+		  && gimple_call_internal_p (g)
+		  && gimple_call_internal_fn (g) == IFN_LOOP_VECTORIZED
+		  && gimple_cond_lhs (cond_stmt) == gimple_call_lhs (g));
+      gimple_cond_set_lhs (cond_stmt, boolean_true_node);
+      update_stmt (cond_stmt);
+      gcc_assert (has_zero_uses (gimple_call_lhs (g)));
+      gsi_remove (&gsi, false);
+      gcc_assert (single_succ_p (new_loop->header));
+      gsi = gsi_last_bb (single_succ (new_loop->header));
+      cond_stmt = gsi_stmt (gsi);
+      gsi_prev (&gsi);
+      g = gsi_stmt (gsi);
+      gcc_assert (gimple_code (cond_stmt) == GIMPLE_COND
+		  && is_gimple_call (g)
+		  && gimple_call_internal_p (g)
+		  && gimple_call_internal_fn (g) == IFN_LOOP_VECTORIZED
+		  && gimple_cond_lhs (cond_stmt) == gimple_call_lhs (g)
+		  && new_loop->inner
+		  && new_loop->inner->next
+		  && new_loop->inner->next->next == NULL);
+      struct loop *inner = new_loop->inner;
+      basic_block empty_bb = loop_preheader_edge (inner)->src;
+      gcc_assert (empty_block_p (empty_bb)
+		  && single_pred_p (empty_bb)
+		  && single_succ_p (empty_bb)
+		  && single_pred (empty_bb) == single_succ (new_loop->header));
+      if (single_pred_edge (empty_bb)->flags & EDGE_TRUE_VALUE)
+	{
+	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
+						    inner->num));
+	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
+						    inner->next->num));
+	  inner->next->dont_vectorize = true;
+	}
+      else
+	{
+	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
+						    inner->next->num));
+	  gimple_call_set_arg (g, 0, build_int_cst (integer_type_node,
+						    inner->num));
+	  inner->dont_vectorize = true;
+	}
+    }
+  return true;
+}
+
+/* If-convert LOOP when it is legal.  For the moment this pass has no
+   profitability analysis.  Returns non-zero todo flags when something
+   changed.  */
+
+static unsigned int
 tree_if_conversion (struct loop *loop)
 {
-  bool changed = false;
+  unsigned int todo = 0;
+  bool version_outer_loop = false;
   ifc_bbs = NULL;
+  bool any_mask_load_store = false;
 
-  if (!if_convertible_loop_p (loop)
+  if (!if_convertible_loop_p (loop, &any_mask_load_store)
       || !dbg_cnt (if_conversion_tree))
     goto cleanup;
 
+  if (any_mask_load_store
+      && ((!flag_tree_loop_vectorize && !loop->force_vect)
+	  || loop->dont_vectorize))
+    goto cleanup;
+
+  if (any_mask_load_store
+      && !version_loop_for_if_conversion (loop, &version_outer_loop))
+    goto cleanup;
+
   /* Now all statements are if-convertible.  Combine all the basic
      blocks into one huge basic block doing the if-conversion
      on-the-fly.  */
-  combine_blocks (loop);
-
-  if (flag_tree_loop_if_convert_stores)
-    mark_virtual_operands_for_renaming (cfun);
+  combine_blocks (loop, any_mask_load_store);
 
-  changed = true;
+  todo |= TODO_cleanup_cfg;
+  if (flag_tree_loop_if_convert_stores || any_mask_load_store)
+    {
+      mark_virtual_operands_for_renaming (cfun);
+      todo |= TODO_update_ssa_only_virtuals;
+    }
 
  cleanup:
   if (ifc_bbs)
@@ -1784,7 +2071,16 @@  tree_if_conversion (struct loop *loop)
       ifc_bbs = NULL;
     }
 
-  return changed;
+  if (todo && version_outer_loop)
+    {
+      if (todo & TODO_update_ssa_only_virtuals)
+	{
+	  update_ssa (TODO_update_ssa_only_virtuals);
+	  todo &= ~TODO_update_ssa_only_virtuals;
+	}
+      version_loop_for_if_conversion (loop_outer (loop), NULL);
+    }
+  return todo;
 }
 
 /* Tree if-conversion pass management.  */
@@ -1793,7 +2089,6 @@  static unsigned int
 main_tree_if_conversion (void)
 {
   struct loop *loop;
-  bool changed = false;
   unsigned todo = 0;
 
   if (number_of_loops (cfun) <= 1)
@@ -1802,15 +2097,9 @@  main_tree_if_conversion (void)
   FOR_EACH_LOOP (loop, 0)
     if (flag_tree_loop_if_convert == 1
 	|| flag_tree_loop_if_convert_stores == 1
-	|| flag_tree_loop_vectorize
-	|| loop->force_vect)
-    changed |= tree_if_conversion (loop);
-
-  if (changed)
-    todo |= TODO_cleanup_cfg;
-
-  if (changed && flag_tree_loop_if_convert_stores)
-    todo |= TODO_update_ssa_only_virtuals;
+	|| ((flag_tree_loop_vectorize || loop->force_vect)
+	    && !loop->dont_vectorize))
+      todo |= tree_if_conversion (loop);
 
 #ifdef ENABLE_CHECKING
   {
--- gcc/tree-vect-data-refs.c.jj	2013-11-28 09:18:11.784774865 +0100
+++ gcc/tree-vect-data-refs.c	2013-11-28 14:13:57.617572349 +0100
@@ -2959,6 +2959,24 @@  vect_check_gather (gimple stmt, loop_vec
   enum machine_mode pmode;
   int punsignedp, pvolatilep;
 
+  base = DR_REF (dr);
+  /* For masked loads/stores, DR_REF (dr) is an artificial MEM_REF,
+     see if we can use the def stmt of the address.  */
+  if (is_gimple_call (stmt)
+      && gimple_call_internal_p (stmt)
+      && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	  || gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
+      && TREE_CODE (base) == MEM_REF
+      && TREE_CODE (TREE_OPERAND (base, 0)) == SSA_NAME
+      && integer_zerop (TREE_OPERAND (base, 1))
+      && !expr_invariant_in_loop_p (loop, TREE_OPERAND (base, 0)))
+    {
+      gimple def_stmt = SSA_NAME_DEF_STMT (TREE_OPERAND (base, 0));
+      if (is_gimple_assign (def_stmt)
+	  && gimple_assign_rhs_code (def_stmt) == ADDR_EXPR)
+	base = TREE_OPERAND (gimple_assign_rhs1 (def_stmt), 0);
+    }
+
   /* The gather builtins need address of the form
      loop_invariant + vector * {1, 2, 4, 8}
      or
@@ -2971,7 +2989,7 @@  vect_check_gather (gimple stmt, loop_vec
      vectorized.  The following code attempts to find such a preexistng
      SSA_NAME OFF and put the loop invariants into a tree BASE
      that can be gimplified before the loop.  */
-  base = get_inner_reference (DR_REF (dr), &pbitsize, &pbitpos, &off,
+  base = get_inner_reference (base, &pbitsize, &pbitpos, &off,
 			      &pmode, &punsignedp, &pvolatilep, false);
   gcc_assert (base != NULL_TREE && (pbitpos % BITS_PER_UNIT) == 0);
 
@@ -3468,7 +3486,10 @@  again:
       offset = unshare_expr (DR_OFFSET (dr));
       init = unshare_expr (DR_INIT (dr));
 
-      if (is_gimple_call (stmt))
+      if (is_gimple_call (stmt)
+	  && (!gimple_call_internal_p (stmt)
+	      || (gimple_call_internal_fn (stmt) != IFN_MASK_LOAD
+		  && gimple_call_internal_fn (stmt) != IFN_MASK_STORE)))
 	{
 	  if (dump_enabled_p ())
 	    {
@@ -5119,6 +5140,14 @@  vect_supportable_dr_alignment (struct da
   if (aligned_access_p (dr) && !check_aligned_accesses)
     return dr_aligned;
 
+  /* For now assume all conditional loads/stores support unaligned
+     access without any special code.  */
+  if (is_gimple_call (stmt)
+      && gimple_call_internal_p (stmt)
+      && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	  || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
+    return dr_unaligned_supported;
+
   if (loop_vinfo)
     {
       vect_loop = LOOP_VINFO_LOOP (loop_vinfo);
--- gcc/gimple.h.jj	2013-11-27 12:10:46.932896086 +0100
+++ gcc/gimple.h	2013-11-28 14:13:57.603572422 +0100
@@ -5670,7 +5670,13 @@  gimple_expr_type (const_gimple stmt)
 	 useless conversion involved.  That means returning the
 	 original RHS type as far as we can reconstruct it.  */
       if (code == GIMPLE_CALL)
-	type = gimple_call_return_type (stmt);
+	{
+	  if (gimple_call_internal_p (stmt)
+	      && gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
+	    type = TREE_TYPE (gimple_call_arg (stmt, 3));
+	  else
+	    type = gimple_call_return_type (stmt);
+	}
       else
 	switch (gimple_assign_rhs_code (stmt))
 	  {
--- gcc/internal-fn.c.jj	2013-11-26 21:36:14.218328913 +0100
+++ gcc/internal-fn.c	2013-11-28 14:13:57.661572121 +0100
@@ -153,6 +153,60 @@  expand_UBSAN_NULL (gimple stmt ATTRIBUTE
   gcc_unreachable ();
 }
 
+/* This should get folded in tree-vectorizer.c.  */
+
+static void
+expand_LOOP_VECTORIZED (gimple stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+static void
+expand_MASK_LOAD (gimple stmt)
+{
+  struct expand_operand ops[3];
+  tree type, lhs, rhs, maskt;
+  rtx mem, target, mask;
+
+  maskt = gimple_call_arg (stmt, 2);
+  lhs = gimple_call_lhs (stmt);
+  type = TREE_TYPE (lhs);
+  rhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
+		gimple_call_arg (stmt, 1));
+
+  mem = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  gcc_assert (MEM_P (mem));
+  mask = expand_normal (maskt);
+  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], target, TYPE_MODE (type));
+  create_fixed_operand (&ops[1], mem);
+  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  expand_insn (optab_handler (maskload_optab, TYPE_MODE (type)), 3, ops);
+}
+
+static void
+expand_MASK_STORE (gimple stmt)
+{
+  struct expand_operand ops[3];
+  tree type, lhs, rhs, maskt;
+  rtx mem, reg, mask;
+
+  maskt = gimple_call_arg (stmt, 2);
+  rhs = gimple_call_arg (stmt, 3);
+  type = TREE_TYPE (rhs);
+  lhs = build2 (MEM_REF, type, gimple_call_arg (stmt, 0),
+		gimple_call_arg (stmt, 1));
+
+  mem = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  gcc_assert (MEM_P (mem));
+  mask = expand_normal (maskt);
+  reg = expand_normal (rhs);
+  create_fixed_operand (&ops[0], mem);
+  create_input_operand (&ops[1], reg, TYPE_MODE (type));
+  create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
+  expand_insn (optab_handler (maskstore_optab, TYPE_MODE (type)), 3, ops);
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
--- gcc/tree-vectorizer.c.jj	2013-11-22 21:03:14.525852274 +0100
+++ gcc/tree-vectorizer.c	2013-11-28 15:10:33.364872892 +0100
@@ -75,11 +75,13 @@  along with GCC; see the file COPYING3.
 #include "tree-phinodes.h"
 #include "ssa-iterators.h"
 #include "tree-ssa-loop-manip.h"
+#include "tree-cfg.h"
 #include "cfgloop.h"
 #include "tree-vectorizer.h"
 #include "tree-pass.h"
 #include "tree-ssa-propagate.h"
 #include "dbgcnt.h"
+#include "gimple-fold.h"
 
 /* Loop or bb location.  */
 source_location vect_location;
@@ -317,6 +319,68 @@  vect_destroy_datarefs (loop_vec_info loo
 }
 
 
+/* If LOOP has been versioned during ifcvt, return the internal call
+   guarding it.  */
+
+static gimple
+vect_loop_vectorized_call (struct loop *loop)
+{
+  basic_block bb = loop_preheader_edge (loop)->src;
+  gimple g;
+  do
+    {
+      g = last_stmt (bb);
+      if (g)
+	break;
+      if (!single_pred_p (bb))
+	break;
+      bb = single_pred (bb);
+    }
+  while (1);
+  if (g && gimple_code (g) == GIMPLE_COND)
+    {
+      gimple_stmt_iterator gsi = gsi_for_stmt (g);
+      gsi_prev (&gsi);
+      if (!gsi_end_p (gsi))
+	{
+	  g = gsi_stmt (gsi);
+	  if (is_gimple_call (g)
+	      && gimple_call_internal_p (g)
+	      && gimple_call_internal_fn (g) == IFN_LOOP_VECTORIZED
+	      && (tree_to_shwi (gimple_call_arg (g, 0)) == loop->num
+		  || tree_to_shwi (gimple_call_arg (g, 1)) == loop->num))
+	    return g;
+	}
+    }
+  return NULL;
+}
+
+/* Helper function of vectorize_loops.  If LOOP is non-if-converted
+   loop that has if-converted counterpart, return the if-converted
+   counterpart, so that we try vectorizing if-converted loops before
+   inner loops of non-if-converted loops.  */
+
+static struct loop *
+vect_loop_select (struct loop *loop)
+{
+  if (!loop->dont_vectorize)
+    return loop;
+
+  gimple g = vect_loop_vectorized_call (loop);
+  if (g == NULL)
+    return loop;
+
+  if (tree_to_shwi (gimple_call_arg (g, 1)) != loop->num)
+    return loop;
+
+  struct loop *ifcvt_loop
+    = get_loop (cfun, tree_to_shwi (gimple_call_arg (g, 0)));
+  if (ifcvt_loop && !ifcvt_loop->dont_vectorize)
+    return ifcvt_loop;
+  return loop;
+}
+
+
 /* Function vectorize_loops.
 
    Entry point to loop vectorization phase.  */
@@ -327,9 +391,11 @@  vectorize_loops (void)
   unsigned int i;
   unsigned int num_vectorized_loops = 0;
   unsigned int vect_loops_num;
-  struct loop *loop;
+  struct loop *loop, *iloop;
   hash_table <simduid_to_vf> simduid_to_vf_htab;
   hash_table <simd_array_to_simduid> simd_array_to_simduid_htab;
+  bool any_ifcvt_loops = false;
+  unsigned ret = 0;
 
   vect_loops_num = number_of_loops (cfun);
 
@@ -351,9 +417,12 @@  vectorize_loops (void)
   /* If some loop was duplicated, it gets bigger number
      than all previously defined loops.  This fact allows us to run
      only over initial loops skipping newly generated ones.  */
-  FOR_EACH_LOOP (loop, 0)
-    if ((flag_tree_loop_vectorize && optimize_loop_nest_for_speed_p (loop))
-	|| loop->force_vect)
+  FOR_EACH_LOOP (iloop, 0)
+    if ((loop = vect_loop_select (iloop))->dont_vectorize)
+      any_ifcvt_loops = true;
+    else if ((flag_tree_loop_vectorize
+	      && optimize_loop_nest_for_speed_p (loop))
+	     || loop->force_vect)
       {
 	loop_vec_info loop_vinfo;
 	vect_location = find_loop_location (loop);
@@ -363,6 +432,10 @@  vectorize_loops (void)
                        LOCATION_FILE (vect_location),
 		       LOCATION_LINE (vect_location));
 
+	/* Make sure we don't try to vectorize this loop
+	   more than once.  */
+	loop->dont_vectorize = true;
+
 	loop_vinfo = vect_analyze_loop (loop);
 	loop->aux = loop_vinfo;
 
@@ -372,6 +445,45 @@  vectorize_loops (void)
         if (!dbg_cnt (vect_loop))
 	  break;
 
+	gimple loop_vectorized_call = vect_loop_vectorized_call (loop);
+	if (loop_vectorized_call)
+	  {
+	    tree arg = gimple_call_arg (loop_vectorized_call, 1);
+	    basic_block *bbs;
+	    unsigned int i;
+	    struct loop *scalar_loop = get_loop (cfun, tree_to_shwi (arg));
+	    struct loop *inner;
+
+	    LOOP_VINFO_SCALAR_LOOP (loop_vinfo) = scalar_loop;
+	    gcc_checking_assert (vect_loop_vectorized_call
+					(LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
+				 == loop_vectorized_call);
+	    bbs = get_loop_body (scalar_loop);
+	    for (i = 0; i < scalar_loop->num_nodes; i++)
+	      {
+		basic_block bb = bbs[i];
+		gimple_stmt_iterator gsi;
+		for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
+		     gsi_next (&gsi))
+		  {
+		    gimple phi = gsi_stmt (gsi);
+		    gimple_set_uid (phi, 0);
+		  }
+		for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+		     gsi_next (&gsi))
+		  {
+		    gimple stmt = gsi_stmt (gsi);
+		    gimple_set_uid (stmt, 0);
+		  }
+	      }
+	    free (bbs);
+	    /* If we have successfully vectorized an if-converted outer
+	       loop, don't attempt to vectorize the if-converted inner
+	       loop of the alternate loop.  */
+	    for (inner = scalar_loop->inner; inner; inner = inner->next)
+	      inner->dont_vectorize = true;
+	  }
+
         if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
 	    && dump_enabled_p ())
           dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, vect_location,
@@ -392,7 +504,29 @@  vectorize_loops (void)
 	    *simduid_to_vf_htab.find_slot (simduid_to_vf_data, INSERT)
 	      = simduid_to_vf_data;
 	  }
+
+	if (loop_vectorized_call)
+	  {
+	    gimple g = loop_vectorized_call;
+	    tree lhs = gimple_call_lhs (g);
+	    gimple_stmt_iterator gsi = gsi_for_stmt (g);
+	    gimplify_and_update_call_from_tree (&gsi, boolean_true_node);
+	    gsi_next (&gsi);
+	    if (!gsi_end_p (gsi))
+	      {
+		g = gsi_stmt (gsi);
+		if (gimple_code (g) == GIMPLE_COND
+		    && gimple_cond_lhs (g) == lhs)
+		  {
+		    gimple_cond_set_lhs (g, boolean_true_node);
+		    update_stmt (g);
+		    ret |= TODO_cleanup_cfg;
+		  }
+	      }
+	  }
       }
+    else
+      loop->dont_vectorize = true;
 
   vect_location = UNKNOWN_LOCATION;
 
@@ -405,6 +539,34 @@  vectorize_loops (void)
 
   /*  ----------- Finalize. -----------  */
 
+  if (any_ifcvt_loops)
+    for (i = 1; i < vect_loops_num; i++)
+      {
+	loop = get_loop (cfun, i);
+	if (loop && loop->dont_vectorize)
+	  {
+	    gimple g = vect_loop_vectorized_call (loop);
+	    if (g)
+	      {
+		tree lhs = gimple_call_lhs (g);
+		gimple_stmt_iterator gsi = gsi_for_stmt (g);
+		gimplify_and_update_call_from_tree (&gsi, boolean_false_node);
+		gsi_next (&gsi);
+		if (!gsi_end_p (gsi))
+		  {
+		    g = gsi_stmt (gsi);
+		    if (gimple_code (g) == GIMPLE_COND
+			&& gimple_cond_lhs (g) == lhs)
+		      {
+			gimple_cond_set_lhs (g, boolean_false_node);
+			update_stmt (g);
+			ret |= TODO_cleanup_cfg;
+		      }
+		  }
+	      }
+	  }
+      }
+
   for (i = 1; i < vect_loops_num; i++)
     {
       loop_vec_info loop_vinfo;
@@ -462,7 +624,7 @@  vectorize_loops (void)
       return TODO_cleanup_cfg;
     }
 
-  return 0;
+  return ret;
 }
 
 
--- gcc/tree-vect-loop-manip.c.jj	2013-11-22 21:03:08.418882641 +0100
+++ gcc/tree-vect-loop-manip.c	2013-11-28 14:54:01.621096704 +0100
@@ -703,12 +703,42 @@  slpeel_make_loop_iterate_ntimes (struct
   loop->nb_iterations = niters;
 }
 
+/* Helper routine of slpeel_tree_duplicate_loop_to_edge_cfg.
+   For all PHI arguments in FROM->dest and TO->dest from those
+   edges ensure that TO->dest PHI arguments have current_def
+   to that in from.  */
+
+static void
+slpeel_duplicate_current_defs_from_edges (edge from, edge to)
+{
+  gimple_stmt_iterator gsi_from, gsi_to;
+
+  for (gsi_from = gsi_start_phis (from->dest),
+       gsi_to = gsi_start_phis (to->dest);
+       !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
+       gsi_next (&gsi_from), gsi_next (&gsi_to))
+    {
+      gimple from_phi = gsi_stmt (gsi_from);
+      gimple to_phi = gsi_stmt (gsi_to);
+      tree from_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, from);
+      tree to_arg = PHI_ARG_DEF_FROM_EDGE (to_phi, to);
+      if (TREE_CODE (from_arg) == SSA_NAME
+	  && TREE_CODE (to_arg) == SSA_NAME
+	  && get_current_def (to_arg) == NULL_TREE)
+	set_current_def (to_arg, get_current_def (from_arg));
+    }
+}
+
 
 /* Given LOOP this function generates a new copy of it and puts it
-   on E which is either the entry or exit of LOOP.  */
+   on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
+   non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
+   basic blocks from SCALAR_LOOP instead of LOOP, but to either the
+   entry or exit of LOOP.  */
 
 struct loop *
-slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *loop, edge e)
+slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *loop,
+					struct loop *scalar_loop, edge e)
 {
   struct loop *new_loop;
   basic_block *new_bbs, *bbs;
@@ -722,19 +752,22 @@  slpeel_tree_duplicate_loop_to_edge_cfg (
   if (!at_exit && e != loop_preheader_edge (loop))
     return NULL;
 
-  bbs = XNEWVEC (basic_block, loop->num_nodes + 1);
-  get_loop_body_with_size (loop, bbs, loop->num_nodes);
+  if (scalar_loop == NULL)
+    scalar_loop = loop;
+
+  bbs = XNEWVEC (basic_block, scalar_loop->num_nodes + 1);
+  get_loop_body_with_size (scalar_loop, bbs, scalar_loop->num_nodes);
 
   /* Check whether duplication is possible.  */
-  if (!can_copy_bbs_p (bbs, loop->num_nodes))
+  if (!can_copy_bbs_p (bbs, scalar_loop->num_nodes))
     {
       free (bbs);
       return NULL;
     }
 
   /* Generate new loop structure.  */
-  new_loop = duplicate_loop (loop, loop_outer (loop));
-  duplicate_subloops (loop, new_loop);
+  new_loop = duplicate_loop (scalar_loop, loop_outer (scalar_loop));
+  duplicate_subloops (scalar_loop, new_loop);
 
   exit_dest = exit->dest;
   was_imm_dom = (get_immediate_dominator (CDI_DOMINATORS,
@@ -744,35 +777,80 @@  slpeel_tree_duplicate_loop_to_edge_cfg (
   /* Also copy the pre-header, this avoids jumping through hoops to
      duplicate the loop entry PHI arguments.  Create an empty
      pre-header unconditionally for this.  */
-  basic_block preheader = split_edge (loop_preheader_edge (loop));
+  basic_block preheader = split_edge (loop_preheader_edge (scalar_loop));
   edge entry_e = single_pred_edge (preheader);
-  bbs[loop->num_nodes] = preheader;
-  new_bbs = XNEWVEC (basic_block, loop->num_nodes + 1);
+  bbs[scalar_loop->num_nodes] = preheader;
+  new_bbs = XNEWVEC (basic_block, scalar_loop->num_nodes + 1);
 
-  copy_bbs (bbs, loop->num_nodes + 1, new_bbs,
+  exit = single_exit (scalar_loop);
+  copy_bbs (bbs, scalar_loop->num_nodes + 1, new_bbs,
 	    &exit, 1, &new_exit, NULL,
 	    e->src, true);
-  basic_block new_preheader = new_bbs[loop->num_nodes];
+  exit = single_exit (loop);
+  basic_block new_preheader = new_bbs[scalar_loop->num_nodes];
 
-  add_phi_args_after_copy (new_bbs, loop->num_nodes + 1, NULL);
+  add_phi_args_after_copy (new_bbs, scalar_loop->num_nodes + 1, NULL);
+
+  if (scalar_loop != loop)
+    {
+      /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
+	 SCALAR_LOOP will have current_def set to SSA_NAMEs in the new_loop,
+	 but LOOP will not.  slpeel_update_phi_nodes_for_guard{1,2} expects
+	 the LOOP SSA_NAMEs (on the exit edge and edge from latch to
+	 header) to have current_def set, so copy them over.  */
+      slpeel_duplicate_current_defs_from_edges (single_exit (scalar_loop),
+						exit);
+      slpeel_duplicate_current_defs_from_edges (EDGE_SUCC (scalar_loop->latch,
+							   0),
+						EDGE_SUCC (loop->latch, 0));
+    }
 
   if (at_exit) /* Add the loop copy at exit.  */
     {
+      if (scalar_loop != loop)
+	{
+	  gimple_stmt_iterator gsi;
+	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
+
+	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
+	       gsi_next (&gsi))
+	    {
+	      gimple phi = gsi_stmt (gsi);
+	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	      location_t orig_locus
+		= gimple_phi_arg_location_from_edge (phi, e);
+
+	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
+	    }
+	}
       redirect_edge_and_branch_force (e, new_preheader);
       flush_pending_stmts (e);
       set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
       if (was_imm_dom)
-	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_loop->header);
+	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit->src);
 
       /* And remove the non-necessary forwarder again.  Keep the other
          one so we have a proper pre-header for the loop at the exit edge.  */
-      redirect_edge_pred (single_succ_edge (preheader), single_pred (preheader));
+      redirect_edge_pred (single_succ_edge (preheader),
+			  single_pred (preheader));
       delete_basic_block (preheader);
-      set_immediate_dominator (CDI_DOMINATORS, loop->header,
-			       loop_preheader_edge (loop)->src);
+      set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
+			       loop_preheader_edge (scalar_loop)->src);
     }
   else /* Add the copy at entry.  */
     {
+      if (scalar_loop != loop)
+	{
+	  /* Remove the non-necessary forwarder of scalar_loop again.  */
+	  redirect_edge_pred (single_succ_edge (preheader),
+			      single_pred (preheader));
+	  delete_basic_block (preheader);
+	  set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
+				   loop_preheader_edge (scalar_loop)->src);
+	  preheader = split_edge (loop_preheader_edge (loop));
+	  entry_e = single_pred_edge (preheader);
+	}
+
       redirect_edge_and_branch_force (entry_e, new_preheader);
       flush_pending_stmts (entry_e);
       set_immediate_dominator (CDI_DOMINATORS, new_preheader, entry_e->src);
@@ -783,15 +861,39 @@  slpeel_tree_duplicate_loop_to_edge_cfg (
 
       /* And remove the non-necessary forwarder again.  Keep the other
          one so we have a proper pre-header for the loop at the exit edge.  */
-      redirect_edge_pred (single_succ_edge (new_preheader), single_pred (new_preheader));
+      redirect_edge_pred (single_succ_edge (new_preheader),
+			  single_pred (new_preheader));
       delete_basic_block (new_preheader);
       set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
 			       loop_preheader_edge (new_loop)->src);
     }
 
-  for (unsigned i = 0; i < loop->num_nodes+1; i++)
+  for (unsigned i = 0; i < scalar_loop->num_nodes + 1; i++)
     rename_variables_in_bb (new_bbs[i]);
 
+  if (scalar_loop != loop)
+    {
+      /* Update new_loop->header PHIs, so that on the preheader
+	 edge they are the ones from loop rather than scalar_loop.  */
+      gimple_stmt_iterator gsi_orig, gsi_new;
+      edge orig_e = loop_preheader_edge (loop);
+      edge new_e = loop_preheader_edge (new_loop);
+
+      for (gsi_orig = gsi_start_phis (loop->header),
+	   gsi_new = gsi_start_phis (new_loop->header);
+	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
+	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
+	{
+	  gimple orig_phi = gsi_stmt (gsi_orig);
+	  gimple new_phi = gsi_stmt (gsi_new);
+	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
+	  location_t orig_locus
+	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
+
+	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
+	}
+    }
+
   free (new_bbs);
   free (bbs);
 
@@ -1002,6 +1104,8 @@  set_prologue_iterations (basic_block bb_
 
    Input:
    - LOOP: the loop to be peeled.
+   - SCALAR_LOOP: if non-NULL, the alternate loop from which basic blocks
+	should be copied.
    - E: the exit or entry edge of LOOP.
         If it is the entry edge, we peel the first iterations of LOOP. In this
         case first-loop is LOOP, and second-loop is the newly created loop.
@@ -1043,8 +1147,8 @@  set_prologue_iterations (basic_block bb_
    FORNOW the resulting code will not be in loop-closed-ssa form.
 */
 
-static struct loop*
-slpeel_tree_peel_loop_to_edge (struct loop *loop,
+static struct loop *
+slpeel_tree_peel_loop_to_edge (struct loop *loop, struct loop *scalar_loop,
 			       edge e, tree *first_niters,
 			       tree niters, bool update_first_loop_count,
 			       unsigned int th, bool check_profitability,
@@ -1129,7 +1233,8 @@  slpeel_tree_peel_loop_to_edge (struct lo
         orig_exit_bb:
    */
 
-  if (!(new_loop = slpeel_tree_duplicate_loop_to_edge_cfg (loop, e)))
+  if (!(new_loop = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
+							   e)))
     {
       loop_loc = find_loop_location (loop);
       dump_printf_loc (MSG_MISSED_OPTIMIZATION, loop_loc,
@@ -1625,6 +1730,7 @@  vect_do_peeling_for_loop_bound (loop_vec
 				unsigned int th, bool check_profitability)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
   struct loop *new_loop;
   edge update_e;
   basic_block preheader;
@@ -1641,11 +1747,12 @@  vect_do_peeling_for_loop_bound (loop_vec
 
   loop_num  = loop->num;
 
-  new_loop = slpeel_tree_peel_loop_to_edge (loop, single_exit (loop),
-                                            &ratio_mult_vf_name, ni_name, false,
-                                            th, check_profitability,
-					    cond_expr, cond_expr_stmt_list,
-					    0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+  new_loop
+    = slpeel_tree_peel_loop_to_edge (loop, scalar_loop, single_exit (loop),
+				     &ratio_mult_vf_name, ni_name, false,
+				     th, check_profitability,
+				     cond_expr, cond_expr_stmt_list,
+				     0, LOOP_VINFO_VECT_FACTOR (loop_vinfo));
   gcc_assert (new_loop);
   gcc_assert (loop_num == loop->num);
 #ifdef ENABLE_CHECKING
@@ -1878,6 +1985,7 @@  vect_do_peeling_for_alignment (loop_vec_
 			       unsigned int th, bool check_profitability)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
   tree niters_of_prolog_loop;
   tree wide_prolog_niters;
   struct loop *new_loop;
@@ -1899,11 +2007,11 @@  vect_do_peeling_for_alignment (loop_vec_
 
   /* Peel the prolog loop and iterate it niters_of_prolog_loop.  */
   new_loop =
-    slpeel_tree_peel_loop_to_edge (loop, loop_preheader_edge (loop),
+    slpeel_tree_peel_loop_to_edge (loop, scalar_loop,
+				   loop_preheader_edge (loop),
 				   &niters_of_prolog_loop, ni_name, true,
 				   th, check_profitability, NULL_TREE, NULL,
-				   bound,
-				   0);
+				   bound, 0);
 
   gcc_assert (new_loop);
 #ifdef ENABLE_CHECKING
@@ -2187,6 +2295,7 @@  vect_loop_versioning (loop_vec_info loop
 		      unsigned int th, bool check_profitability)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
   basic_block condition_bb;
   gimple_stmt_iterator gsi, cond_exp_gsi;
   basic_block merge_bb;
@@ -2222,8 +2331,43 @@  vect_loop_versioning (loop_vec_info loop
   gimple_seq_add_seq (&cond_expr_stmt_list, gimplify_stmt_list);
 
   initialize_original_copy_tables ();
-  loop_version (loop, cond_expr, &condition_bb,
-		prob, prob, REG_BR_PROB_BASE - prob, true);
+  if (scalar_loop)
+    {
+      edge scalar_e;
+      basic_block preheader, scalar_preheader;
+
+      /* We don't want to scale SCALAR_LOOP's frequencies, we need to
+	 scale LOOP's frequencies instead.  */
+      loop_version (scalar_loop, cond_expr, &condition_bb,
+		    prob, REG_BR_PROB_BASE, REG_BR_PROB_BASE - prob, true);
+      scale_loop_frequencies (loop, prob, REG_BR_PROB_BASE);
+      /* CONDITION_BB was created above SCALAR_LOOP's preheader,
+	 while we need to move it above LOOP's preheader.  */
+      e = loop_preheader_edge (loop);
+      scalar_e = loop_preheader_edge (scalar_loop);
+      gcc_assert (empty_block_p (e->src)
+		  && single_pred_p (e->src));
+      gcc_assert (empty_block_p (scalar_e->src)
+		  && single_pred_p (scalar_e->src));
+      gcc_assert (single_pred_p (condition_bb));
+      preheader = e->src;
+      scalar_preheader = scalar_e->src;
+      scalar_e = find_edge (condition_bb, scalar_preheader);
+      e = single_pred_edge (preheader);
+      redirect_edge_and_branch_force (single_pred_edge (condition_bb),
+				      scalar_preheader);
+      redirect_edge_and_branch_force (scalar_e, preheader);
+      redirect_edge_and_branch_force (e, condition_bb);
+      set_immediate_dominator (CDI_DOMINATORS, condition_bb,
+			       single_pred (condition_bb));
+      set_immediate_dominator (CDI_DOMINATORS, scalar_preheader,
+			       single_pred (scalar_preheader));
+      set_immediate_dominator (CDI_DOMINATORS, preheader,
+			       condition_bb);
+    }
+  else
+    loop_version (loop, cond_expr, &condition_bb,
+		  prob, prob, REG_BR_PROB_BASE - prob, true);
 
   if (LOCATION_LOCUS (vect_location) != UNKNOWN_LOCATION
       && dump_enabled_p ())
@@ -2246,24 +2390,29 @@  vect_loop_versioning (loop_vec_info loop
      basic block (i.e. it has two predecessors). Just in order to simplify
      following transformations in the vectorizer, we fix this situation
      here by adding a new (empty) block on the exit-edge of the loop,
-     with the proper loop-exit phis to maintain loop-closed-form.  */
+     with the proper loop-exit phis to maintain loop-closed-form.
+     If loop versioning wasn't done from loop, but scalar_loop instead,
+     merge_bb will have already just a single successor.  */
 
   merge_bb = single_exit (loop)->dest;
-  gcc_assert (EDGE_COUNT (merge_bb->preds) == 2);
-  new_exit_bb = split_edge (single_exit (loop));
-  new_exit_e = single_exit (loop);
-  e = EDGE_SUCC (new_exit_bb, 0);
-
-  for (gsi = gsi_start_phis (merge_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      tree new_res;
-      orig_phi = gsi_stmt (gsi);
-      new_res = copy_ssa_name (PHI_RESULT (orig_phi), NULL);
-      new_phi = create_phi_node (new_res, new_exit_bb);
-      arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, e);
-      add_phi_arg (new_phi, arg, new_exit_e,
-		   gimple_phi_arg_location_from_edge (orig_phi, e));
-      adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
+  if (scalar_loop == NULL || EDGE_COUNT (merge_bb->preds) >= 2)
+    {
+      gcc_assert (EDGE_COUNT (merge_bb->preds) >= 2);
+      new_exit_bb = split_edge (single_exit (loop));
+      new_exit_e = single_exit (loop);
+      e = EDGE_SUCC (new_exit_bb, 0);
+
+      for (gsi = gsi_start_phis (merge_bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  tree new_res;
+	  orig_phi = gsi_stmt (gsi);
+	  new_res = copy_ssa_name (PHI_RESULT (orig_phi), NULL);
+	  new_phi = create_phi_node (new_res, new_exit_bb);
+	  arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, e);
+	  add_phi_arg (new_phi, arg, new_exit_e,
+		       gimple_phi_arg_location_from_edge (orig_phi, e));
+	  adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
+	}
     }
 
 
--- gcc/tree-vect-loop.c.jj	2013-11-28 09:18:11.772774927 +0100
+++ gcc/tree-vect-loop.c	2013-11-28 14:13:57.643572214 +0100
@@ -374,7 +374,11 @@  vect_determine_vectorization_factor (loo
 		analyze_pattern_stmt = false;
 	    }
 
-	  if (gimple_get_lhs (stmt) == NULL_TREE)
+	  if (gimple_get_lhs (stmt) == NULL_TREE
+	      /* MASK_STORE has no lhs, but is ok.  */
+	      && (!is_gimple_call (stmt)
+		  || !gimple_call_internal_p (stmt)
+		  || gimple_call_internal_fn (stmt) != IFN_MASK_STORE))
 	    {
 	      if (is_gimple_call (stmt))
 		{
@@ -426,7 +430,12 @@  vect_determine_vectorization_factor (loo
 	  else
 	    {
 	      gcc_assert (!STMT_VINFO_DATA_REF (stmt_info));
-	      scalar_type = TREE_TYPE (gimple_get_lhs (stmt));
+	      if (is_gimple_call (stmt)
+		  && gimple_call_internal_p (stmt)
+		  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
+		scalar_type = TREE_TYPE (gimple_call_arg (stmt, 3));
+	      else
+		scalar_type = TREE_TYPE (gimple_get_lhs (stmt));
 	      if (dump_enabled_p ())
 		{
 		  dump_printf_loc (MSG_NOTE, vect_location,
--- gcc/cfgloop.h.jj	2013-11-19 21:56:40.389335752 +0100
+++ gcc/cfgloop.h	2013-11-28 14:13:57.602572427 +0100
@@ -176,6 +176,9 @@  struct GTY ((chain_next ("%h.next"))) lo
   /* True if we should try harder to vectorize this loop.  */
   bool force_vect;
 
+  /* True if this loop should never be vectorized.  */
+  bool dont_vectorize;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
--- gcc/tree-loop-distribution.c.jj	2013-11-22 21:03:05.696896177 +0100
+++ gcc/tree-loop-distribution.c	2013-11-28 14:13:57.632572271 +0100
@@ -588,7 +588,7 @@  copy_loop_before (struct loop *loop)
   edge preheader = loop_preheader_edge (loop);
 
   initialize_original_copy_tables ();
-  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, preheader);
+  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
   gcc_assert (res != NULL);
   free_original_copy_tables ();
   delete_update_ssa ();
--- gcc/optabs.def.jj	2013-11-26 21:36:14.066329682 +0100
+++ gcc/optabs.def	2013-11-28 14:13:57.624572312 +0100
@@ -248,6 +248,8 @@  OPTAB_D (sdot_prod_optab, "sdot_prod$I$a
 OPTAB_D (ssum_widen_optab, "widen_ssum$I$a3")
 OPTAB_D (udot_prod_optab, "udot_prod$I$a")
 OPTAB_D (usum_widen_optab, "widen_usum$I$a3")
+OPTAB_D (maskload_optab, "maskload$a")
+OPTAB_D (maskstore_optab, "maskstore$a")
 OPTAB_D (vec_extract_optab, "vec_extract$a")
 OPTAB_D (vec_init_optab, "vec_init$a")
 OPTAB_D (vec_pack_sfix_trunc_optab, "vec_pack_sfix_trunc_$a")
--- gcc/testsuite/gcc.target/i386/avx2-gather-6.c.jj	2013-11-28 14:13:57.633572267 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-gather-6.c	2013-11-28 14:13:57.633572267 +0100
@@ -0,0 +1,7 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mavx2 -fno-common -fdump-tree-vect-details" } */
+
+#include "avx2-gather-5.c"
+
+/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops in function" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
--- gcc/testsuite/gcc.target/i386/vect-cond-1.c.jj	2013-11-28 14:57:58.182864189 +0100
+++ gcc/testsuite/gcc.target/i386/vect-cond-1.c	2013-11-28 14:57:58.182864189 +0100
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -mavx2" { target avx2 } } */
+
+int a[1024];
+
+int
+foo (int *p)
+{
+  int i;
+  for (i = 0; i < 1024; i++)
+    {
+      int t;
+      if (a[i] < 30)
+	t = *p;
+      else
+	t = a[i] + 12;
+      a[i] = t;
+    }
+}
+
+/* { dg-final { cleanup-tree-dump "vect" } } */
--- gcc/testsuite/gcc.target/i386/avx2-gather-5.c.jj	2013-11-28 14:13:57.633572267 +0100
+++ gcc/testsuite/gcc.target/i386/avx2-gather-5.c	2013-11-28 14:13:57.633572267 +0100
@@ -0,0 +1,47 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target avx2 } */
+/* { dg-options "-O3 -mavx2 -fno-common" } */
+
+#include "avx2-check.h"
+
+#define N 1024
+float vf1[N+16], vf2[N], vf3[N];
+int k[N];
+
+__attribute__((noinline, noclone)) void
+foo (void)
+{
+  int i;
+  for (i = 0; i < N; i++)
+    {
+      float f;
+      if (vf3[i] < 0.0f)
+	f = vf1[k[i]];
+      else
+	f = 7.0f;
+      vf2[i] = f;
+    }
+}
+
+static void
+avx2_test (void)
+{
+  int i;
+  for (i = 0; i < N + 16; i++)
+    {
+      vf1[i] = 5.5f * i;
+      if (i >= N)
+	continue;
+      vf2[i] = 2.0f;
+      vf3[i] = (i & 1) ? i : -i - 1;
+      k[i] = (i & 1) ? ((i & 2) ? -i : N / 2 + i) : (i * 7) % N;
+      asm ("");
+    }
+  foo ();
+  for (i = 0; i < N; i++)
+    if (vf1[i] != 5.5 * i
+	|| vf2[i] != ((i & 1) ? 7.0f : 5.5f * ((i * 7) % N))
+	|| vf3[i] != ((i & 1) ? i : -i - 1)
+	|| k[i] != ((i & 1) ? ((i & 2) ? -i : N / 2 + i) : ((i * 7) % N)))
+      abort ();
+}
--- gcc/testsuite/gcc.dg/vect/vect-cond-11.c.jj	2013-11-28 14:13:57.634572262 +0100
+++ gcc/testsuite/gcc.dg/vect/vect-cond-11.c	2013-11-28 14:13:57.634572262 +0100
@@ -0,0 +1,116 @@ 
+#include "tree-vect.h"
+
+#define N 1024
+typedef int V __attribute__((vector_size (4)));
+unsigned int a[N * 2] __attribute__((aligned));
+unsigned int b[N * 2] __attribute__((aligned));
+V c[N];
+
+__attribute__((noinline, noclone)) unsigned int
+foo (unsigned int *a, unsigned int *b)
+{
+  int i;
+  unsigned int r = 0;
+  for (i = 0; i < N; i++)
+    {
+      unsigned int x = a[i], y = b[i];
+      if (x < 32)
+	{
+	  x = x + 127;
+	  y = y * 2;
+	}
+      else
+	{
+	  x = x - 16;
+	  y = y + 1;
+	}
+      a[i] = x;
+      b[i] = y;
+      r += x;
+    }
+  return r;
+}
+
+__attribute__((noinline, noclone)) unsigned int
+bar (unsigned int *a, unsigned int *b)
+{
+  int i;
+  unsigned int r = 0;
+  for (i = 0; i < N; i++)
+    {
+      unsigned int x = a[i], y = b[i];
+      if (x < 32)
+	{
+	  x = x + 127;
+	  y = y * 2;
+	}
+      else
+	{
+	  x = x - 16;
+	  y = y + 1;
+	}
+      a[i] = x;
+      b[i] = y;
+      c[i] = c[i] + 1;
+      r += x;
+    }
+  return r;
+}
+
+void
+baz (unsigned int *a, unsigned int *b,
+     unsigned int (*fn) (unsigned int *, unsigned int *))
+{
+  int i;
+  for (i = -64; i < 0; i++)
+    {
+      a[i] = 19;
+      b[i] = 17;
+    }
+  for (; i < N; i++)
+    {
+      a[i] = i - 512;
+      b[i] = i;
+    }
+  for (; i < N + 64; i++)
+    {
+      a[i] = 27;
+      b[i] = 19;
+    }
+  if (fn (a, b) != -512U - (N - 32) * 16U + 32 * 127U)
+    __builtin_abort ();
+  for (i = -64; i < 0; i++)
+    if (a[i] != 19 || b[i] != 17)
+      __builtin_abort ();
+  for (; i < N; i++)
+    if (a[i] != (i - 512U < 32U ? i - 512U + 127 : i - 512U - 16)
+	|| b[i] != (i - 512U < 32U ? i * 2U : i + 1U))
+      __builtin_abort ();
+  for (; i < N + 64; i++)
+    if (a[i] != 27 || b[i] != 19)
+      __builtin_abort ();
+}
+
+int
+main ()
+{
+  int i;
+  check_vect ();
+  baz (a + 512, b + 512, foo);
+  baz (a + 512, b + 512, bar);
+  baz (a + 512 + 1, b + 512 + 1, foo);
+  baz (a + 512 + 1, b + 512 + 1, bar);
+  baz (a + 512 + 31, b + 512 + 31, foo);
+  baz (a + 512 + 31, b + 512 + 31, bar);
+  baz (a + 512 + 1, b + 512, foo);
+  baz (a + 512 + 1, b + 512, bar);
+  baz (a + 512 + 31, b + 512, foo);
+  baz (a + 512 + 31, b + 512, bar);
+  baz (a + 512, b + 512 + 1, foo);
+  baz (a + 512, b + 512 + 1, bar);
+  baz (a + 512, b + 512 + 31, foo);
+  baz (a + 512, b + 512 + 31, bar);
+  return 0;
+}
+
+/* { dg-final { cleanup-tree-dump "vect" } } */
--- gcc/testsuite/gcc.dg/vect/vect-mask-load-1.c.jj	2013-11-28 14:13:57.633572267 +0100
+++ gcc/testsuite/gcc.dg/vect/vect-mask-load-1.c	2013-11-28 14:13:57.633572267 +0100
@@ -0,0 +1,52 @@ 
+/* { dg-do run } */
+/* { dg-additional-options "-Ofast -fno-common" } */
+/* { dg-additional-options "-Ofast -fno-common -mavx" { target avx_runtime } } */
+
+#include <stdlib.h>
+#include "tree-vect.h"
+
+__attribute__((noinline, noclone)) void
+foo (double *x, double *y)
+{
+  double *p = __builtin_assume_aligned (x, 16);
+  double *q = __builtin_assume_aligned (y, 16);
+  double z, h;
+  int i;
+  for (i = 0; i < 1024; i++)
+    {
+      if (p[i] < 0.0)
+	z = q[i], h = q[i] * 7.0 + 3.0;
+      else
+	z = p[i] + 6.0, h = p[1024 + i];
+      p[i] = z + 2.0 * h;
+    }
+}
+
+double a[2048] __attribute__((aligned (16)));
+double b[1024] __attribute__((aligned (16)));
+
+int
+main ()
+{
+  int i;
+  check_vect ();
+  for (i = 0; i < 1024; i++)
+    {
+      a[i] = (i & 1) ? -i : 2 * i;
+      a[i + 1024] = i;
+      b[i] = 7 * i;
+      asm ("");
+    }
+  foo (a, b);
+  for (i = 0; i < 1024; i++)
+    if (a[i] != ((i & 1)
+		 ? 7 * i + 2.0 * (7 * i * 7.0 + 3.0)
+		 : 2 * i + 6.0 + 2.0 * i)
+	|| b[i] != 7 * i
+	|| a[i + 1024] != i)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops" 1 "vect" { target avx_runtime } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
--- gcc/testsuite/gcc.dg/vect/vect-mask-loadstore-1.c.jj	2013-11-28 14:13:57.634572262 +0100
+++ gcc/testsuite/gcc.dg/vect/vect-mask-loadstore-1.c	2013-11-28 14:13:57.634572262 +0100
@@ -0,0 +1,50 @@ 
+/* { dg-do run } */
+/* { dg-additional-options "-Ofast -fno-common" } */
+/* { dg-additional-options "-Ofast -fno-common -mavx" { target avx_runtime } } */
+
+#include <stdlib.h>
+#include "tree-vect.h"
+
+__attribute__((noinline, noclone)) void
+foo (float *__restrict x, float *__restrict y, float *__restrict z)
+{
+  float *__restrict p = __builtin_assume_aligned (x, 32);
+  float *__restrict q = __builtin_assume_aligned (y, 32);
+  float *__restrict r = __builtin_assume_aligned (z, 32);
+  int i;
+  for (i = 0; i < 1024; i++)
+    {
+      if (p[i] < 0.0f)
+	q[i] = p[i] + 2.0f;
+      else
+	p[i] = r[i] + 3.0f;
+    }
+}
+
+float a[1024] __attribute__((aligned (32)));
+float b[1024] __attribute__((aligned (32)));
+float c[1024] __attribute__((aligned (32)));
+
+int
+main ()
+{
+  int i;
+  check_vect ();
+  for (i = 0; i < 1024; i++)
+    {
+      a[i] = (i & 1) ? -i : i;
+      b[i] = 7 * i;
+      c[i] = a[i] - 3.0f;
+      asm ("");
+    }
+  foo (a, b, c);
+  for (i = 0; i < 1024; i++)
+    if (a[i] != ((i & 1) ? -i : i)
+	|| b[i] != ((i & 1) ? a[i] + 2.0f : 7 * i)
+	|| c[i] != a[i] - 3.0f)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "note: vectorized 1 loops" 1 "vect" { target avx_runtime } } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
--- gcc/passes.def.jj	2013-11-27 12:15:13.999517045 +0100
+++ gcc/passes.def	2013-11-28 14:13:57.602572427 +0100
@@ -217,6 +217,8 @@  along with GCC; see the file COPYING3.
 	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_parallelize_loops);
 	  NEXT_PASS (pass_if_conversion);
+	  /* pass_vectorize must immediately follow pass_if_conversion.
+	     Please do not add any other passes in between.  */
 	  NEXT_PASS (pass_vectorize);
           PUSH_INSERT_PASSES_WITHIN (pass_vectorize)
 	      NEXT_PASS (pass_dce_loop);
--- gcc/tree-predcom.c.jj	2013-11-22 21:03:14.589851957 +0100
+++ gcc/tree-predcom.c	2013-11-28 14:59:15.529464377 +0100
@@ -732,6 +732,9 @@  split_data_refs_to_components (struct lo
 	     just fail.  */
 	  goto end;
 	}
+      /* predcom pass isn't prepared to handle calls with data references.  */
+      if (is_gimple_call (DR_STMT (dr)))
+	goto end;
       dr->aux = (void *) (size_t) i;
       comp_father[i] = i;
       comp_size[i] = 1;
--- gcc/tree-vect-stmts.c.jj	2013-11-27 12:15:14.038516844 +0100
+++ gcc/tree-vect-stmts.c	2013-11-28 14:57:58.182864189 +0100
@@ -235,7 +235,7 @@  vect_mark_relevant (vec<gimple> *worklis
           /* This use is out of pattern use, if LHS has other uses that are
              pattern uses, we should mark the stmt itself, and not the pattern
              stmt.  */
-	  if (TREE_CODE (lhs) == SSA_NAME)
+	  if (lhs && TREE_CODE (lhs) == SSA_NAME)
 	    FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
 	      {
 		if (is_gimple_debug (USE_STMT (use_p)))
@@ -393,7 +393,27 @@  exist_non_indexing_operands_for_use_p (t
      first case, and whether var corresponds to USE.  */
 
   if (!gimple_assign_copy_p (stmt))
-    return false;
+    {
+      if (is_gimple_call (stmt)
+	  && gimple_call_internal_p (stmt))
+	switch (gimple_call_internal_fn (stmt))
+	  {
+	  case IFN_MASK_STORE:
+	    operand = gimple_call_arg (stmt, 3);
+	    if (operand == use)
+	      return true;
+	    /* FALLTHRU */
+	  case IFN_MASK_LOAD:
+	    operand = gimple_call_arg (stmt, 2);
+	    if (operand == use)
+	      return true;
+	    break;
+	  default:
+	    break;
+	  }
+      return false;
+    }
+
   if (TREE_CODE (gimple_assign_lhs (stmt)) == SSA_NAME)
     return false;
   operand = gimple_assign_rhs1 (stmt);
@@ -1696,6 +1716,413 @@  vectorizable_function (gimple call, tree
 						        vectype_in);
 }
 
+
+static tree permute_vec_elements (tree, tree, tree, gimple,
+				  gimple_stmt_iterator *);
+
+
+/* Function vectorizable_mask_load_store.
+
+   Check if STMT performs a conditional load or store that can be vectorized.
+   If VEC_STMT is also passed, vectorize the STMT: create a vectorized
+   stmt to replace it, put it in VEC_STMT, and insert it at GSI.
+   Return FALSE if not a vectorizable STMT, TRUE otherwise.  */
+
+static bool
+vectorizable_mask_load_store (gimple stmt, gimple_stmt_iterator *gsi,
+			      gimple *vec_stmt, slp_tree slp_node)
+{
+  tree vec_dest = NULL;
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  stmt_vec_info prev_stmt_info;
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  bool nested_in_vect_loop = nested_in_vect_loop_p (loop, stmt);
+  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+  tree elem_type;
+  gimple new_stmt;
+  tree dummy;
+  tree dataref_ptr = NULL_TREE;
+  gimple ptr_incr;
+  int nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  int ncopies;
+  int i, j;
+  bool inv_p;
+  tree gather_base = NULL_TREE, gather_off = NULL_TREE;
+  tree gather_off_vectype = NULL_TREE, gather_decl = NULL_TREE;
+  int gather_scale = 1;
+  enum vect_def_type gather_dt = vect_unknown_def_type;
+  bool is_store;
+  tree mask;
+  gimple def_stmt;
+  tree def;
+  enum vect_def_type dt;
+
+  if (slp_node != NULL)
+    return false;
+
+  ncopies = LOOP_VINFO_VECT_FACTOR (loop_vinfo) / nunits;
+  gcc_assert (ncopies >= 1);
+
+  is_store = gimple_call_internal_fn (stmt) == IFN_MASK_STORE;
+  mask = gimple_call_arg (stmt, 2);
+  if (TYPE_PRECISION (TREE_TYPE (mask))
+      != GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (vectype))))
+    return false;
+
+  /* FORNOW. This restriction should be relaxed.  */
+  if (nested_in_vect_loop && ncopies > 1)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "multiple types in nested loop.");
+      return false;
+    }
+
+  if (!STMT_VINFO_RELEVANT_P (stmt_info))
+    return false;
+
+  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_internal_def)
+    return false;
+
+  if (!STMT_VINFO_DATA_REF (stmt_info))
+    return false;
+
+  elem_type = TREE_TYPE (vectype);
+
+  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+    return false;
+
+  if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
+    return false;
+
+  if (STMT_VINFO_GATHER_P (stmt_info))
+    {
+      gimple def_stmt;
+      tree def;
+      gather_decl = vect_check_gather (stmt, loop_vinfo, &gather_base,
+				       &gather_off, &gather_scale);
+      gcc_assert (gather_decl);
+      if (!vect_is_simple_use_1 (gather_off, NULL, loop_vinfo, NULL,
+				 &def_stmt, &def, &gather_dt,
+				 &gather_off_vectype))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "gather index use not simple.");
+	  return false;
+	}
+    }
+  else if (tree_int_cst_compare (nested_in_vect_loop
+				 ? STMT_VINFO_DR_STEP (stmt_info)
+				 : DR_STEP (dr), size_zero_node) <= 0)
+    return false;
+  else if (optab_handler (is_store ? maskstore_optab : maskload_optab,
+			  TYPE_MODE (vectype)) == CODE_FOR_nothing)
+    return false;
+
+  if (TREE_CODE (mask) != SSA_NAME)
+    return false;
+
+  if (!vect_is_simple_use (mask, stmt, loop_vinfo, NULL,
+			   &def_stmt, &def, &dt))
+    return false;
+
+  if (is_store)
+    {
+      tree rhs = gimple_call_arg (stmt, 3);
+      if (!vect_is_simple_use (rhs, stmt, loop_vinfo, NULL,
+			       &def_stmt, &def, &dt))
+	return false;
+    }
+
+  if (!vec_stmt) /* transformation not required.  */
+    {
+      STMT_VINFO_TYPE (stmt_info) = call_vec_info_type;
+      if (is_store)
+	vect_model_store_cost (stmt_info, ncopies, false, dt,
+			       NULL, NULL, NULL);
+      else
+	vect_model_load_cost (stmt_info, ncopies, false, NULL, NULL, NULL);
+      return true;
+    }
+
+  /** Transform.  **/
+
+  if (STMT_VINFO_GATHER_P (stmt_info))
+    {
+      tree vec_oprnd0 = NULL_TREE, op;
+      tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gather_decl));
+      tree rettype, srctype, ptrtype, idxtype, masktype, scaletype;
+      tree ptr, vec_mask = NULL_TREE, mask_op, var, scale;
+      tree perm_mask = NULL_TREE, prev_res = NULL_TREE;
+      edge pe = loop_preheader_edge (loop);
+      gimple_seq seq;
+      basic_block new_bb;
+      enum { NARROW, NONE, WIDEN } modifier;
+      int gather_off_nunits = TYPE_VECTOR_SUBPARTS (gather_off_vectype);
+
+      if (nunits == gather_off_nunits)
+	modifier = NONE;
+      else if (nunits == gather_off_nunits / 2)
+	{
+	  unsigned char *sel = XALLOCAVEC (unsigned char, gather_off_nunits);
+	  modifier = WIDEN;
+
+	  for (i = 0; i < gather_off_nunits; ++i)
+	    sel[i] = i | nunits;
+
+	  perm_mask = vect_gen_perm_mask (gather_off_vectype, sel);
+	  gcc_assert (perm_mask != NULL_TREE);
+	}
+      else if (nunits == gather_off_nunits * 2)
+	{
+	  unsigned char *sel = XALLOCAVEC (unsigned char, nunits);
+	  modifier = NARROW;
+
+	  for (i = 0; i < nunits; ++i)
+	    sel[i] = i < gather_off_nunits
+		     ? i : i + nunits - gather_off_nunits;
+
+	  perm_mask = vect_gen_perm_mask (vectype, sel);
+	  gcc_assert (perm_mask != NULL_TREE);
+	  ncopies *= 2;
+	}
+      else
+	gcc_unreachable ();
+
+      rettype = TREE_TYPE (TREE_TYPE (gather_decl));
+      srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      ptrtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+      scaletype = TREE_VALUE (arglist);
+      gcc_checking_assert (types_compatible_p (srctype, rettype)
+			   && types_compatible_p (srctype, masktype));
+
+      vec_dest = vect_create_destination_var (gimple_call_lhs (stmt), vectype);
+
+      ptr = fold_convert (ptrtype, gather_base);
+      if (!is_gimple_min_invariant (ptr))
+	{
+	  ptr = force_gimple_operand (ptr, &seq, true, NULL_TREE);
+	  new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
+	  gcc_assert (!new_bb);
+	}
+
+      scale = build_int_cst (scaletype, gather_scale);
+
+      prev_stmt_info = NULL;
+      for (j = 0; j < ncopies; ++j)
+	{
+	  if (modifier == WIDEN && (j & 1))
+	    op = permute_vec_elements (vec_oprnd0, vec_oprnd0,
+				       perm_mask, stmt, gsi);
+	  else if (j == 0)
+	    op = vec_oprnd0
+	      = vect_get_vec_def_for_operand (gather_off, stmt, NULL);
+	  else
+	    op = vec_oprnd0
+	      = vect_get_vec_def_for_stmt_copy (gather_dt, vec_oprnd0);
+
+	  if (!useless_type_conversion_p (idxtype, TREE_TYPE (op)))
+	    {
+	      gcc_assert (TYPE_VECTOR_SUBPARTS (TREE_TYPE (op))
+			  == TYPE_VECTOR_SUBPARTS (idxtype));
+	      var = vect_get_new_vect_var (idxtype, vect_simple_var, NULL);
+	      var = make_ssa_name (var, NULL);
+	      op = build1 (VIEW_CONVERT_EXPR, idxtype, op);
+	      new_stmt
+		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var,
+						op, NULL_TREE);
+	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	      op = var;
+	    }
+
+	  if (j == 0)
+	    vec_mask = vect_get_vec_def_for_operand (mask, stmt, NULL);
+	  else
+	    {
+	      vect_is_simple_use (vec_mask, NULL, loop_vinfo, NULL, &def_stmt,
+				  &def, &dt);
+	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
+	    }
+
+	  mask_op = vec_mask;
+	  if (!useless_type_conversion_p (masktype, TREE_TYPE (vec_mask)))
+	    {
+	      gcc_assert (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask_op))
+			  == TYPE_VECTOR_SUBPARTS (masktype));
+	      var = vect_get_new_vect_var (masktype, vect_simple_var, NULL);
+	      var = make_ssa_name (var, NULL);
+	      mask_op = build1 (VIEW_CONVERT_EXPR, masktype, mask_op);
+	      new_stmt
+		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var,
+						mask_op, NULL_TREE);
+	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	      mask_op = var;
+	    }
+
+	  new_stmt
+	    = gimple_build_call (gather_decl, 5, mask_op, ptr, op, mask_op,
+				 scale);
+
+	  if (!useless_type_conversion_p (vectype, rettype))
+	    {
+	      gcc_assert (TYPE_VECTOR_SUBPARTS (vectype)
+			  == TYPE_VECTOR_SUBPARTS (rettype));
+	      var = vect_get_new_vect_var (rettype, vect_simple_var, NULL);
+	      op = make_ssa_name (var, new_stmt);
+	      gimple_call_set_lhs (new_stmt, op);
+	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	      var = make_ssa_name (vec_dest, NULL);
+	      op = build1 (VIEW_CONVERT_EXPR, vectype, op);
+	      new_stmt
+		= gimple_build_assign_with_ops (VIEW_CONVERT_EXPR, var, op,
+						NULL_TREE);
+	    }
+	  else
+	    {
+	      var = make_ssa_name (vec_dest, new_stmt);
+	      gimple_call_set_lhs (new_stmt, var);
+	    }
+
+	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+
+	  if (modifier == NARROW)
+	    {
+	      if ((j & 1) == 0)
+		{
+		  prev_res = var;
+		  continue;
+		}
+	      var = permute_vec_elements (prev_res, var,
+					  perm_mask, stmt, gsi);
+	      new_stmt = SSA_NAME_DEF_STMT (var);
+	    }
+
+	  if (prev_stmt_info == NULL)
+	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	  else
+	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
+	  prev_stmt_info = vinfo_for_stmt (new_stmt);
+	}
+      return true;
+    }
+  else if (is_store)
+    {
+      tree vec_rhs = NULL_TREE, vec_mask = NULL_TREE;
+      prev_stmt_info = NULL;
+      for (i = 0; i < ncopies; i++)
+	{
+	  unsigned align, misalign;
+
+	  if (i == 0)
+	    {
+	      tree rhs = gimple_call_arg (stmt, 3);
+	      vec_rhs = vect_get_vec_def_for_operand (rhs, stmt, NULL);
+	      vec_mask = vect_get_vec_def_for_operand (mask, stmt, NULL);
+	      /* We should have catched mismatched types earlier.  */
+	      gcc_assert (useless_type_conversion_p (vectype,
+						     TREE_TYPE (vec_rhs)));
+	      dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
+						      NULL_TREE, &dummy, gsi,
+						      &ptr_incr, false, &inv_p);
+	      gcc_assert (!inv_p);
+	    }
+	  else
+	    {
+	      vect_is_simple_use (vec_rhs, NULL, loop_vinfo, NULL, &def_stmt,
+				  &def, &dt);
+	      vec_rhs = vect_get_vec_def_for_stmt_copy (dt, vec_rhs);
+	      vect_is_simple_use (vec_mask, NULL, loop_vinfo, NULL, &def_stmt,
+				  &def, &dt);
+	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
+	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
+					     TYPE_SIZE_UNIT (vectype));
+	    }
+
+	  align = TYPE_ALIGN_UNIT (vectype);
+	  if (aligned_access_p (dr))
+	    misalign = 0;
+	  else if (DR_MISALIGNMENT (dr) == -1)
+	    {
+	      align = TYPE_ALIGN_UNIT (elem_type);
+	      misalign = 0;
+	    }
+	  else
+	    misalign = DR_MISALIGNMENT (dr);
+	  set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
+				  misalign);
+	  new_stmt
+	    = gimple_build_call_internal (IFN_MASK_STORE, 4, dataref_ptr,
+					  gimple_call_arg (stmt, 1),
+					  vec_mask, vec_rhs);
+	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	  if (i == 0)
+	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	  else
+	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
+	  prev_stmt_info = vinfo_for_stmt (new_stmt);
+	}
+    }
+  else
+    {
+      tree vec_mask = NULL_TREE;
+      prev_stmt_info = NULL;
+      vec_dest = vect_create_destination_var (gimple_call_lhs (stmt), vectype);
+      for (i = 0; i < ncopies; i++)
+	{
+	  unsigned align, misalign;
+
+	  if (i == 0)
+	    {
+	      vec_mask = vect_get_vec_def_for_operand (mask, stmt, NULL);
+	      dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
+						      NULL_TREE, &dummy, gsi,
+						      &ptr_incr, false, &inv_p);
+	      gcc_assert (!inv_p);
+	    }
+	  else
+	    {
+	      vect_is_simple_use (vec_mask, NULL, loop_vinfo, NULL, &def_stmt,
+				  &def, &dt);
+	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
+	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
+					     TYPE_SIZE_UNIT (vectype));
+	    }
+
+	  align = TYPE_ALIGN_UNIT (vectype);
+	  if (aligned_access_p (dr))
+	    misalign = 0;
+	  else if (DR_MISALIGNMENT (dr) == -1)
+	    {
+	      align = TYPE_ALIGN_UNIT (elem_type);
+	      misalign = 0;
+	    }
+	  else
+	    misalign = DR_MISALIGNMENT (dr);
+	  set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
+				  misalign);
+	  new_stmt
+	    = gimple_build_call_internal (IFN_MASK_LOAD, 3, dataref_ptr,
+					  gimple_call_arg (stmt, 1),
+					  vec_mask);
+	  gimple_call_set_lhs (new_stmt, make_ssa_name (vec_dest, NULL));
+	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	  if (i == 0)
+	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
+	  else
+	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = new_stmt;
+	  prev_stmt_info = vinfo_for_stmt (new_stmt);
+	}
+    }
+
+  return true;
+}
+
+
 /* Function vectorizable_call.
 
    Check if STMT performs a function call that can be vectorized.
@@ -1738,6 +2165,12 @@  vectorizable_call (gimple stmt, gimple_s
   if (!is_gimple_call (stmt))
     return false;
 
+  if (gimple_call_internal_p (stmt)
+      && (gimple_call_internal_fn (stmt) == IFN_MASK_LOAD
+	  || gimple_call_internal_fn (stmt) == IFN_MASK_STORE))
+    return vectorizable_mask_load_store (stmt, gsi, vec_stmt,
+					 slp_node);
+
   if (gimple_call_lhs (stmt) == NULL_TREE
       || TREE_CODE (gimple_call_lhs (stmt)) != SSA_NAME)
     return false;
@@ -4051,10 +4484,6 @@  vectorizable_shift (gimple stmt, gimple_
 }
 
 
-static tree permute_vec_elements (tree, tree, tree, gimple,
-				  gimple_stmt_iterator *);
-
-
 /* Function vectorizable_operation.
 
    Check if STMT performs a binary, unary or ternary operation that can
@@ -6567,6 +6996,10 @@  vect_transform_stmt (gimple stmt, gimple
     case call_vec_info_type:
       done = vectorizable_call (stmt, gsi, &vec_stmt, slp_node);
       stmt = gsi_stmt (*gsi);
+      if (is_gimple_call (stmt)
+	  && gimple_call_internal_p (stmt)
+	  && gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
+	is_store = true;
       break;
 
     case call_simd_clone_vec_info_type:
--- gcc/tree-ssa-phiopt.c.jj	2013-11-22 21:03:14.569852057 +0100
+++ gcc/tree-ssa-phiopt.c	2013-11-28 15:01:39.825688128 +0100
@@ -1706,7 +1706,7 @@  cond_if_else_store_replacement (basic_bl
         == chrec_dont_know)
       || !then_datarefs.length ()
       || (find_data_references_in_bb (NULL, else_bb, &else_datarefs)
-        == chrec_dont_know)
+	  == chrec_dont_know)
       || !else_datarefs.length ())
     {
       free_data_refs (then_datarefs);
@@ -1723,6 +1723,8 @@  cond_if_else_store_replacement (basic_bl
 
       then_store = DR_STMT (then_dr);
       then_lhs = gimple_get_lhs (then_store);
+      if (then_lhs == NULL_TREE)
+	continue;
       found = false;
 
       FOR_EACH_VEC_ELT (else_datarefs, j, else_dr)
@@ -1732,6 +1734,8 @@  cond_if_else_store_replacement (basic_bl
 
           else_store = DR_STMT (else_dr);
           else_lhs = gimple_get_lhs (else_store);
+	  if (else_lhs == NULL_TREE)
+	    continue;
 
           if (operand_equal_p (then_lhs, else_lhs, 0))
             {