diff mbox series

Add support for fully-predicated loops

Message ID 87po8hymvt.fsf@linaro.org
State New
Headers show
Series Add support for fully-predicated loops | expand

Commit Message

Richard Sandiford Nov. 17, 2017, 2:56 p.m. UTC
This patch adds support for using a single fully-predicated loop instead
of a vector loop and a scalar tail.  An SVE WHILELO instruction generates
the predicate for each iteration of the loop, given the current scalar
iv value and the loop bound.  This operation is wrapped up in a new internal
function called WHILE_ULT.  E.g.:

   WHILE_ULT (0, 3, { 0, 0, 0, 0}) -> { 1, 1, 1, 0 }
   WHILE_ULT (UINT_MAX - 1, UINT_MAX, { 0, 0, 0, 0 }) -> { 1, 0, 0, 0 }

The third WHILE_ULT argument is needed to make the operation
unambiguous: without it, WHILE_ULT (0, 3) for one vector type would
seem equivalent to WHILE_ULT (0, 3) for another, even if the types have
different numbers of elements.

Note that the patch uses "mask" and "fully-masked" instead of
"predicate" and "fully-predicated", to follow existing GCC terminology.

This patch just handles the simple cases, punting for things like
reductions and live-out values.  Later patches remove most of these
restrictions.

Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* optabs.def (while_ult_optab): New optab.
	* doc/md.texi (while_ult@var{m}@var{n}): Document.
	* internal-fn.def (WHILE_ULT): New internal function.
	* internal-fn.h (direct_internal_fn_supported_p): New override
	that takes two types as argument.
	* internal-fn.c (while_direct): New macro.
	(expand_while_optab_fn): New function.
	(convert_optab_supported_p): Likewise.
	(direct_while_optab_supported_p): New macro.
	* wide-int.h (wi::udiv_ceil): New function.
	* tree-vectorizer.h (rgroup_masks): New structure.
	(vec_loop_masks): New typedef.
	(_loop_vec_info): Add masks, mask_compare_type, can_fully_mask_p
	and fully_masked_p.
	(LOOP_VINFO_CAN_FULLY_MASK_P, LOOP_VINFO_FULLY_MASKED_P)
	(LOOP_VINFO_MASKS, LOOP_VINFO_MASK_COMPARE_TYPE): New macros.
	(vect_max_vf): New function.
	(slpeel_make_loop_iterate_ntimes): Delete.
	(vect_set_loop_condition, vect_get_loop_mask_type, vect_gen_while)
	(vect_halve_mask_nunits, vect_double_mask_nunits): Declare.
	)vect_record_loop_mask, vect_get_loop_mask): Likewise.
	* tree-vect-loop-manip.c: Include tree-ssa-loop-niter.h,
	internal-fn.h, stor-layout.h and optabs-query.h.
	(vect_set_loop_mask): New function.
	(add_preheader_seq): Likewise.
	(add_header_seq): Likewise.
	(vect_maybe_permute_loop_masks): Likewise.
	(vect_set_loop_masks_directly): Likewise.
	(vect_set_loop_condition_masked): Likewise.
	(vect_set_loop_condition_unmasked): New function, split out from
	slpeel_make_loop_iterate_ntimes.
	(slpeel_make_loop_iterate_ntimes): Rename to..
	(vect_set_loop_condition): ...this.  Use vect_set_loop_condition_masked
	for fully-masked loops and vect_set_loop_condition_unmasked otherwise.
	(vect_do_peeling): Update call accordingly.
	(vect_gen_vector_loop_niters): Use VF as the step for fully-masked
	loops.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
	mask_compare_type, can_fully_mask_p and fully_masked_p.
	(release_vec_loop_masks): New function.
	(_loop_vec_info): Use it to free the loop masks.
	(can_produce_all_loop_masks_p): New function.
	(vect_get_max_nscalars_per_iter): Likewise.
	(vect_verify_full_masking): Likewise.
	(vect_analyze_loop_2): Save LOOP_VINFO_CAN_FULLY_MASK_P around
	retries, and free the mask rgroups before retrying.  Check loop-wide
	reasons for disallowing fully-masked loops.  Make the final decision
	about whether use a fully-masked loop or not.
	(vect_estimate_min_profitable_iters): Do not assume that peeling
	for the number of iterations will be needed for fully-masked loops.
	(vectorizable_reduction): Disable fully-masked loops.
	(vectorizable_live_operation): Likewise.
	(vect_halve_mask_nunits): New function.
	(vect_double_mask_nunits): Likewise.
	(vect_record_loop_mask): Likewise.
	(vect_get_loop_mask): Likewise.
	(vect_transform_loop): Handle the case in which the final loop
	iteration might handle a partial vector.  Call vect_set_loop_condition
	instead of slpeel_make_loop_iterate_ntimes.
	* tree-vect-stmts.c: Include tree-ssa-loop-niter.h and gimple-fold.h.
	(check_load_store_masking): New function.
	(prepare_load_store_mask): Likewise.
	(vectorizable_store): Handle fully-masked loops.
	(vectorizable_load): Likewise.
	(supportable_widening_operation): Use vect_halve_mask_nunits for
	booleans.
	(supportable_narrowing_operation): Likewise vect_double_mask_nunits.
	(vect_gen_while): New function.
	* config/aarch64/aarch64.md (umax<mode>3): New expander.
	(aarch64_uqdec<mode>): New insn.
	* config/aarch64/aarch64-sve.md (<perm_optab>_<mode>)
	(*aarch64_sve_<perm_insn><perm_hilo><mode>): New predicate patterns.

gcc/testsuite/
	* gcc.dg/tree-ssa/cunroll-10.c: Disable vectorization.
	* gcc.dg/tree-ssa/peel1.c: Likewise.
	* gcc.dg/vect/vect-load-lanes-peeling-1.c: Remove XFAIL for
	variable-length vectors.
	* gcc.target/aarch64/sve_vcond_6.c: XFAIL test for AND.
	* gcc.target/aarch64/sve_vec_bool_cmp_1.c: Expect BIC instead of NOT.
	* gcc.target/aarch64/sve_slp_1.c: Check for a fully-masked loop.
	* gcc.target/aarch64/sve_slp_2.c: Likewise.
	* gcc.target/aarch64/sve_slp_3.c: Likewise.
	* gcc.target/aarch64/sve_slp_4.c: Likewise.
	* gcc.target/aarch64/sve_slp_6.c: Likewise.
	* gcc.target/aarch64/sve_slp_8.c: New test.
	* gcc.target/aarch64/sve_slp_8_run.c: Likewise.
	* gcc.target/aarch64/sve_slp_9.c: Likewise.
	* gcc.target/aarch64/sve_slp_9_run.c: Likewise.
	* gcc.target/aarch64/sve_slp_10.c: Likewise.
	* gcc.target/aarch64/sve_slp_10_run.c: Likewise.
	* gcc.target/aarch64/sve_slp_11.c: Likewise.
	* gcc.target/aarch64/sve_slp_11_run.c: Likewise.
	* gcc.target/aarch64/sve_slp_12.c: Likewise.
	* gcc.target/aarch64/sve_slp_12_run.c: Likewise.
	* gcc.target/aarch64/sve_ld1r_2.c: Likewise.
	* gcc.target/aarch64/sve_ld1r_2_run.c: Likewise.
	* gcc.target/aarch64/sve_while_1.c: Likewise.
	* gcc.target/aarch64/sve_while_2.c: Likewise.
	* gcc.target/aarch64/sve_while_3.c: Likewise.
	* gcc.target/aarch64/sve_while_4.c: Likewise.

Comments

Jeff Law Dec. 18, 2017, 7:40 p.m. UTC | #1
On 11/17/2017 07:56 AM, Richard Sandiford wrote:
> This patch adds support for using a single fully-predicated loop instead
> of a vector loop and a scalar tail.  An SVE WHILELO instruction generates
> the predicate for each iteration of the loop, given the current scalar
> iv value and the loop bound.  This operation is wrapped up in a new internal
> function called WHILE_ULT.  E.g.:
> 
>    WHILE_ULT (0, 3, { 0, 0, 0, 0}) -> { 1, 1, 1, 0 }
>    WHILE_ULT (UINT_MAX - 1, UINT_MAX, { 0, 0, 0, 0 }) -> { 1, 0, 0, 0 }
> 
> The third WHILE_ULT argument is needed to make the operation
> unambiguous: without it, WHILE_ULT (0, 3) for one vector type would
> seem equivalent to WHILE_ULT (0, 3) for another, even if the types have
> different numbers of elements.
> 
> Note that the patch uses "mask" and "fully-masked" instead of
> "predicate" and "fully-predicated", to follow existing GCC terminology.
> 
> This patch just handles the simple cases, punting for things like
> reductions and live-out values.  Later patches remove most of these
> restrictions.
> 
> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> and powerpc64le-linux-gnu.  OK to install?
> 
> Richard
> 
> 
> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
> 	    Alan Hayward  <alan.hayward@arm.com>
> 	    David Sherwood  <david.sherwood@arm.com>
> 
> gcc/
> 	* optabs.def (while_ult_optab): New optab.
> 	* doc/md.texi (while_ult@var{m}@var{n}): Document.
> 	* internal-fn.def (WHILE_ULT): New internal function.
> 	* internal-fn.h (direct_internal_fn_supported_p): New override
> 	that takes two types as argument.
> 	* internal-fn.c (while_direct): New macro.
> 	(expand_while_optab_fn): New function.
> 	(convert_optab_supported_p): Likewise.
> 	(direct_while_optab_supported_p): New macro.
> 	* wide-int.h (wi::udiv_ceil): New function.
> 	* tree-vectorizer.h (rgroup_masks): New structure.
> 	(vec_loop_masks): New typedef.
> 	(_loop_vec_info): Add masks, mask_compare_type, can_fully_mask_p
> 	and fully_masked_p.
> 	(LOOP_VINFO_CAN_FULLY_MASK_P, LOOP_VINFO_FULLY_MASKED_P)
> 	(LOOP_VINFO_MASKS, LOOP_VINFO_MASK_COMPARE_TYPE): New macros.
> 	(vect_max_vf): New function.
> 	(slpeel_make_loop_iterate_ntimes): Delete.
> 	(vect_set_loop_condition, vect_get_loop_mask_type, vect_gen_while)
> 	(vect_halve_mask_nunits, vect_double_mask_nunits): Declare.
> 	)vect_record_loop_mask, vect_get_loop_mask): Likewise.
> 	* tree-vect-loop-manip.c: Include tree-ssa-loop-niter.h,
> 	internal-fn.h, stor-layout.h and optabs-query.h.
> 	(vect_set_loop_mask): New function.
> 	(add_preheader_seq): Likewise.
> 	(add_header_seq): Likewise.
> 	(vect_maybe_permute_loop_masks): Likewise.
> 	(vect_set_loop_masks_directly): Likewise.
> 	(vect_set_loop_condition_masked): Likewise.
> 	(vect_set_loop_condition_unmasked): New function, split out from
> 	slpeel_make_loop_iterate_ntimes.
> 	(slpeel_make_loop_iterate_ntimes): Rename to..
> 	(vect_set_loop_condition): ...this.  Use vect_set_loop_condition_masked
> 	for fully-masked loops and vect_set_loop_condition_unmasked otherwise.
> 	(vect_do_peeling): Update call accordingly.
> 	(vect_gen_vector_loop_niters): Use VF as the step for fully-masked
> 	loops.
> 	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
> 	mask_compare_type, can_fully_mask_p and fully_masked_p.
> 	(release_vec_loop_masks): New function.
> 	(_loop_vec_info): Use it to free the loop masks.
> 	(can_produce_all_loop_masks_p): New function.
> 	(vect_get_max_nscalars_per_iter): Likewise.
> 	(vect_verify_full_masking): Likewise.
> 	(vect_analyze_loop_2): Save LOOP_VINFO_CAN_FULLY_MASK_P around
> 	retries, and free the mask rgroups before retrying.  Check loop-wide
> 	reasons for disallowing fully-masked loops.  Make the final decision
> 	about whether use a fully-masked loop or not.
> 	(vect_estimate_min_profitable_iters): Do not assume that peeling
> 	for the number of iterations will be needed for fully-masked loops.
> 	(vectorizable_reduction): Disable fully-masked loops.
> 	(vectorizable_live_operation): Likewise.
> 	(vect_halve_mask_nunits): New function.
> 	(vect_double_mask_nunits): Likewise.
> 	(vect_record_loop_mask): Likewise.
> 	(vect_get_loop_mask): Likewise.
> 	(vect_transform_loop): Handle the case in which the final loop
> 	iteration might handle a partial vector.  Call vect_set_loop_condition
> 	instead of slpeel_make_loop_iterate_ntimes.
> 	* tree-vect-stmts.c: Include tree-ssa-loop-niter.h and gimple-fold.h.
> 	(check_load_store_masking): New function.
> 	(prepare_load_store_mask): Likewise.
> 	(vectorizable_store): Handle fully-masked loops.
> 	(vectorizable_load): Likewise.
> 	(supportable_widening_operation): Use vect_halve_mask_nunits for
> 	booleans.
> 	(supportable_narrowing_operation): Likewise vect_double_mask_nunits.
> 	(vect_gen_while): New function.
> 	* config/aarch64/aarch64.md (umax<mode>3): New expander.
> 	(aarch64_uqdec<mode>): New insn.
> 	* config/aarch64/aarch64-sve.md (<perm_optab>_<mode>)
> 	(*aarch64_sve_<perm_insn><perm_hilo><mode>): New predicate patterns.
> 
> gcc/testsuite/
> 	* gcc.dg/tree-ssa/cunroll-10.c: Disable vectorization.
> 	* gcc.dg/tree-ssa/peel1.c: Likewise.
> 	* gcc.dg/vect/vect-load-lanes-peeling-1.c: Remove XFAIL for
> 	variable-length vectors.
> 	* gcc.target/aarch64/sve_vcond_6.c: XFAIL test for AND.
> 	* gcc.target/aarch64/sve_vec_bool_cmp_1.c: Expect BIC instead of NOT.
> 	* gcc.target/aarch64/sve_slp_1.c: Check for a fully-masked loop.
> 	* gcc.target/aarch64/sve_slp_2.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_3.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_4.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_6.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_8.c: New test.
> 	* gcc.target/aarch64/sve_slp_8_run.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_9.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_9_run.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_10.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_10_run.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_11.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_11_run.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_12.c: Likewise.
> 	* gcc.target/aarch64/sve_slp_12_run.c: Likewise.
> 	* gcc.target/aarch64/sve_ld1r_2.c: Likewise.
> 	* gcc.target/aarch64/sve_ld1r_2_run.c: Likewise.
> 	* gcc.target/aarch64/sve_while_1.c: Likewise.
> 	* gcc.target/aarch64/sve_while_2.c: Likewise.
> 	* gcc.target/aarch64/sve_while_3.c: Likewise.
> 	* gcc.target/aarch64/sve_while_4.c: Likewise.
Like other SVE related patches, I haven't looked at the aarch64 specific
bits, just the generic bits.

Sadly, I'm totally lost on this one....   I understand at a 30000ft
level what you're trying to do and many of the low level primitives made
sense.  But I wasn't able to go from those primitives to the higher
level implementation details, even though the higher level
implementation details didn't seem all that large.

I trust your judgment on this stuff.

OK for the trunk.


Jeff
James Greenhalgh Jan. 7, 2018, 5:08 p.m. UTC | #2
On Mon, Dec 18, 2017 at 07:40:00PM +0000, Jeff Law wrote:
> On 11/17/2017 07:56 AM, Richard Sandiford wrote:
> > This patch adds support for using a single fully-predicated loop instead
> > of a vector loop and a scalar tail.  An SVE WHILELO instruction generates
> > the predicate for each iteration of the loop, given the current scalar
> > iv value and the loop bound.  This operation is wrapped up in a new internal
> > function called WHILE_ULT.  E.g.:
> > 
> >    WHILE_ULT (0, 3, { 0, 0, 0, 0}) -> { 1, 1, 1, 0 }
> >    WHILE_ULT (UINT_MAX - 1, UINT_MAX, { 0, 0, 0, 0 }) -> { 1, 0, 0, 0 }
> > 
> > The third WHILE_ULT argument is needed to make the operation
> > unambiguous: without it, WHILE_ULT (0, 3) for one vector type would
> > seem equivalent to WHILE_ULT (0, 3) for another, even if the types have
> > different numbers of elements.
> > 
> > Note that the patch uses "mask" and "fully-masked" instead of
> > "predicate" and "fully-predicated", to follow existing GCC terminology.
> > 
> > This patch just handles the simple cases, punting for things like
> > reductions and live-out values.  Later patches remove most of these
> > restrictions.
> > 
> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> > and powerpc64le-linux-gnu.  OK to install?
> > 
> > Richard
> > 
> > 
> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
> > 	    Alan Hayward  <alan.hayward@arm.com>
> > 	    David Sherwood  <david.sherwood@arm.com>
> > 
> > gcc/
> > 	* optabs.def (while_ult_optab): New optab.
> > 	* doc/md.texi (while_ult@var{m}@var{n}): Document.
> > 	* internal-fn.def (WHILE_ULT): New internal function.
> > 	* internal-fn.h (direct_internal_fn_supported_p): New override
> > 	that takes two types as argument.
> > 	* internal-fn.c (while_direct): New macro.
> > 	(expand_while_optab_fn): New function.
> > 	(convert_optab_supported_p): Likewise.
> > 	(direct_while_optab_supported_p): New macro.
> > 	* wide-int.h (wi::udiv_ceil): New function.
> > 	* tree-vectorizer.h (rgroup_masks): New structure.
> > 	(vec_loop_masks): New typedef.
> > 	(_loop_vec_info): Add masks, mask_compare_type, can_fully_mask_p
> > 	and fully_masked_p.
> > 	(LOOP_VINFO_CAN_FULLY_MASK_P, LOOP_VINFO_FULLY_MASKED_P)
> > 	(LOOP_VINFO_MASKS, LOOP_VINFO_MASK_COMPARE_TYPE): New macros.
> > 	(vect_max_vf): New function.
> > 	(slpeel_make_loop_iterate_ntimes): Delete.
> > 	(vect_set_loop_condition, vect_get_loop_mask_type, vect_gen_while)
> > 	(vect_halve_mask_nunits, vect_double_mask_nunits): Declare.
> > 	)vect_record_loop_mask, vect_get_loop_mask): Likewise.
> > 	* tree-vect-loop-manip.c: Include tree-ssa-loop-niter.h,
> > 	internal-fn.h, stor-layout.h and optabs-query.h.
> > 	(vect_set_loop_mask): New function.
> > 	(add_preheader_seq): Likewise.
> > 	(add_header_seq): Likewise.
> > 	(vect_maybe_permute_loop_masks): Likewise.
> > 	(vect_set_loop_masks_directly): Likewise.
> > 	(vect_set_loop_condition_masked): Likewise.
> > 	(vect_set_loop_condition_unmasked): New function, split out from
> > 	slpeel_make_loop_iterate_ntimes.
> > 	(slpeel_make_loop_iterate_ntimes): Rename to..
> > 	(vect_set_loop_condition): ...this.  Use vect_set_loop_condition_masked
> > 	for fully-masked loops and vect_set_loop_condition_unmasked otherwise.
> > 	(vect_do_peeling): Update call accordingly.
> > 	(vect_gen_vector_loop_niters): Use VF as the step for fully-masked
> > 	loops.
> > 	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
> > 	mask_compare_type, can_fully_mask_p and fully_masked_p.
> > 	(release_vec_loop_masks): New function.
> > 	(_loop_vec_info): Use it to free the loop masks.
> > 	(can_produce_all_loop_masks_p): New function.
> > 	(vect_get_max_nscalars_per_iter): Likewise.
> > 	(vect_verify_full_masking): Likewise.
> > 	(vect_analyze_loop_2): Save LOOP_VINFO_CAN_FULLY_MASK_P around
> > 	retries, and free the mask rgroups before retrying.  Check loop-wide
> > 	reasons for disallowing fully-masked loops.  Make the final decision
> > 	about whether use a fully-masked loop or not.
> > 	(vect_estimate_min_profitable_iters): Do not assume that peeling
> > 	for the number of iterations will be needed for fully-masked loops.
> > 	(vectorizable_reduction): Disable fully-masked loops.
> > 	(vectorizable_live_operation): Likewise.
> > 	(vect_halve_mask_nunits): New function.
> > 	(vect_double_mask_nunits): Likewise.
> > 	(vect_record_loop_mask): Likewise.
> > 	(vect_get_loop_mask): Likewise.
> > 	(vect_transform_loop): Handle the case in which the final loop
> > 	iteration might handle a partial vector.  Call vect_set_loop_condition
> > 	instead of slpeel_make_loop_iterate_ntimes.
> > 	* tree-vect-stmts.c: Include tree-ssa-loop-niter.h and gimple-fold.h.
> > 	(check_load_store_masking): New function.
> > 	(prepare_load_store_mask): Likewise.
> > 	(vectorizable_store): Handle fully-masked loops.
> > 	(vectorizable_load): Likewise.
> > 	(supportable_widening_operation): Use vect_halve_mask_nunits for
> > 	booleans.
> > 	(supportable_narrowing_operation): Likewise vect_double_mask_nunits.
> > 	(vect_gen_while): New function.
> > 	* config/aarch64/aarch64.md (umax<mode>3): New expander.
> > 	(aarch64_uqdec<mode>): New insn.
> > 	* config/aarch64/aarch64-sve.md (<perm_optab>_<mode>)
> > 	(*aarch64_sve_<perm_insn><perm_hilo><mode>): New predicate patterns.
> > 
> > gcc/testsuite/
> > 	* gcc.dg/tree-ssa/cunroll-10.c: Disable vectorization.
> > 	* gcc.dg/tree-ssa/peel1.c: Likewise.
> > 	* gcc.dg/vect/vect-load-lanes-peeling-1.c: Remove XFAIL for
> > 	variable-length vectors.
> > 	* gcc.target/aarch64/sve_vcond_6.c: XFAIL test for AND.
> > 	* gcc.target/aarch64/sve_vec_bool_cmp_1.c: Expect BIC instead of NOT.
> > 	* gcc.target/aarch64/sve_slp_1.c: Check for a fully-masked loop.
> > 	* gcc.target/aarch64/sve_slp_2.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_3.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_4.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_6.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_8.c: New test.
> > 	* gcc.target/aarch64/sve_slp_8_run.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_9.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_9_run.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_10.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_10_run.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_11.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_11_run.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_12.c: Likewise.
> > 	* gcc.target/aarch64/sve_slp_12_run.c: Likewise.
> > 	* gcc.target/aarch64/sve_ld1r_2.c: Likewise.
> > 	* gcc.target/aarch64/sve_ld1r_2_run.c: Likewise.
> > 	* gcc.target/aarch64/sve_while_1.c: Likewise.
> > 	* gcc.target/aarch64/sve_while_2.c: Likewise.
> > 	* gcc.target/aarch64/sve_while_3.c: Likewise.
> > 	* gcc.target/aarch64/sve_while_4.c: Likewise.
> Like other SVE related patches, I haven't looked at the aarch64 specific
> bits, just the generic bits.
> 
> Sadly, I'm totally lost on this one....   I understand at a 30000ft
> level what you're trying to do and many of the low level primitives made
> sense.  But I wasn't able to go from those primitives to the higher
> level implementation details, even though the higher level
> implementation details didn't seem all that large.
> 
> I trust your judgment on this stuff.
> 
> OK for the trunk.

The AArch64 bits are OK.

Thanks,
James
Christophe Lyon Jan. 15, 2018, 9:57 a.m. UTC | #3
Hi Richard,


On 7 January 2018 at 18:08, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Mon, Dec 18, 2017 at 07:40:00PM +0000, Jeff Law wrote:
>> On 11/17/2017 07:56 AM, Richard Sandiford wrote:
>> > This patch adds support for using a single fully-predicated loop instead
>> > of a vector loop and a scalar tail.  An SVE WHILELO instruction generates
>> > the predicate for each iteration of the loop, given the current scalar
>> > iv value and the loop bound.  This operation is wrapped up in a new internal
>> > function called WHILE_ULT.  E.g.:
>> >
>> >    WHILE_ULT (0, 3, { 0, 0, 0, 0}) -> { 1, 1, 1, 0 }
>> >    WHILE_ULT (UINT_MAX - 1, UINT_MAX, { 0, 0, 0, 0 }) -> { 1, 0, 0, 0 }
>> >
>> > The third WHILE_ULT argument is needed to make the operation
>> > unambiguous: without it, WHILE_ULT (0, 3) for one vector type would
>> > seem equivalent to WHILE_ULT (0, 3) for another, even if the types have
>> > different numbers of elements.
>> >
>> > Note that the patch uses "mask" and "fully-masked" instead of
>> > "predicate" and "fully-predicated", to follow existing GCC terminology.
>> >
>> > This patch just handles the simple cases, punting for things like
>> > reductions and live-out values.  Later patches remove most of these
>> > restrictions.
>> >
>> > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
>> > and powerpc64le-linux-gnu.  OK to install?
>> >
>> > Richard
>> >
>> >
>> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
>> >         Alan Hayward  <alan.hayward@arm.com>
>> >         David Sherwood  <david.sherwood@arm.com>
>> >
>> > gcc/
>> >     * optabs.def (while_ult_optab): New optab.
>> >     * doc/md.texi (while_ult@var{m}@var{n}): Document.
>> >     * internal-fn.def (WHILE_ULT): New internal function.
>> >     * internal-fn.h (direct_internal_fn_supported_p): New override
>> >     that takes two types as argument.
>> >     * internal-fn.c (while_direct): New macro.
>> >     (expand_while_optab_fn): New function.
>> >     (convert_optab_supported_p): Likewise.
>> >     (direct_while_optab_supported_p): New macro.
>> >     * wide-int.h (wi::udiv_ceil): New function.
>> >     * tree-vectorizer.h (rgroup_masks): New structure.
>> >     (vec_loop_masks): New typedef.
>> >     (_loop_vec_info): Add masks, mask_compare_type, can_fully_mask_p
>> >     and fully_masked_p.
>> >     (LOOP_VINFO_CAN_FULLY_MASK_P, LOOP_VINFO_FULLY_MASKED_P)
>> >     (LOOP_VINFO_MASKS, LOOP_VINFO_MASK_COMPARE_TYPE): New macros.
>> >     (vect_max_vf): New function.
>> >     (slpeel_make_loop_iterate_ntimes): Delete.
>> >     (vect_set_loop_condition, vect_get_loop_mask_type, vect_gen_while)
>> >     (vect_halve_mask_nunits, vect_double_mask_nunits): Declare.
>> >     )vect_record_loop_mask, vect_get_loop_mask): Likewise.
>> >     * tree-vect-loop-manip.c: Include tree-ssa-loop-niter.h,
>> >     internal-fn.h, stor-layout.h and optabs-query.h.
>> >     (vect_set_loop_mask): New function.
>> >     (add_preheader_seq): Likewise.
>> >     (add_header_seq): Likewise.
>> >     (vect_maybe_permute_loop_masks): Likewise.
>> >     (vect_set_loop_masks_directly): Likewise.
>> >     (vect_set_loop_condition_masked): Likewise.
>> >     (vect_set_loop_condition_unmasked): New function, split out from
>> >     slpeel_make_loop_iterate_ntimes.
>> >     (slpeel_make_loop_iterate_ntimes): Rename to..
>> >     (vect_set_loop_condition): ...this.  Use vect_set_loop_condition_masked
>> >     for fully-masked loops and vect_set_loop_condition_unmasked otherwise.
>> >     (vect_do_peeling): Update call accordingly.
>> >     (vect_gen_vector_loop_niters): Use VF as the step for fully-masked
>> >     loops.
>> >     * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
>> >     mask_compare_type, can_fully_mask_p and fully_masked_p.
>> >     (release_vec_loop_masks): New function.
>> >     (_loop_vec_info): Use it to free the loop masks.
>> >     (can_produce_all_loop_masks_p): New function.
>> >     (vect_get_max_nscalars_per_iter): Likewise.
>> >     (vect_verify_full_masking): Likewise.
>> >     (vect_analyze_loop_2): Save LOOP_VINFO_CAN_FULLY_MASK_P around
>> >     retries, and free the mask rgroups before retrying.  Check loop-wide
>> >     reasons for disallowing fully-masked loops.  Make the final decision
>> >     about whether use a fully-masked loop or not.
>> >     (vect_estimate_min_profitable_iters): Do not assume that peeling
>> >     for the number of iterations will be needed for fully-masked loops.
>> >     (vectorizable_reduction): Disable fully-masked loops.
>> >     (vectorizable_live_operation): Likewise.
>> >     (vect_halve_mask_nunits): New function.
>> >     (vect_double_mask_nunits): Likewise.
>> >     (vect_record_loop_mask): Likewise.
>> >     (vect_get_loop_mask): Likewise.
>> >     (vect_transform_loop): Handle the case in which the final loop
>> >     iteration might handle a partial vector.  Call vect_set_loop_condition
>> >     instead of slpeel_make_loop_iterate_ntimes.
>> >     * tree-vect-stmts.c: Include tree-ssa-loop-niter.h and gimple-fold.h.
>> >     (check_load_store_masking): New function.
>> >     (prepare_load_store_mask): Likewise.
>> >     (vectorizable_store): Handle fully-masked loops.
>> >     (vectorizable_load): Likewise.
>> >     (supportable_widening_operation): Use vect_halve_mask_nunits for
>> >     booleans.
>> >     (supportable_narrowing_operation): Likewise vect_double_mask_nunits.
>> >     (vect_gen_while): New function.
>> >     * config/aarch64/aarch64.md (umax<mode>3): New expander.
>> >     (aarch64_uqdec<mode>): New insn.
>> >     * config/aarch64/aarch64-sve.md (<perm_optab>_<mode>)
>> >     (*aarch64_sve_<perm_insn><perm_hilo><mode>): New predicate patterns.
>> >
>> > gcc/testsuite/
>> >     * gcc.dg/tree-ssa/cunroll-10.c: Disable vectorization.
>> >     * gcc.dg/tree-ssa/peel1.c: Likewise.
>> >     * gcc.dg/vect/vect-load-lanes-peeling-1.c: Remove XFAIL for
>> >     variable-length vectors.
>> >     * gcc.target/aarch64/sve_vcond_6.c: XFAIL test for AND.
>> >     * gcc.target/aarch64/sve_vec_bool_cmp_1.c: Expect BIC instead of NOT.
>> >     * gcc.target/aarch64/sve_slp_1.c: Check for a fully-masked loop.
>> >     * gcc.target/aarch64/sve_slp_2.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_3.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_4.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_6.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_8.c: New test.
>> >     * gcc.target/aarch64/sve_slp_8_run.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_9.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_9_run.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_10.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_10_run.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_11.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_11_run.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_12.c: Likewise.
>> >     * gcc.target/aarch64/sve_slp_12_run.c: Likewise.
>> >     * gcc.target/aarch64/sve_ld1r_2.c: Likewise.
>> >     * gcc.target/aarch64/sve_ld1r_2_run.c: Likewise.
>> >     * gcc.target/aarch64/sve_while_1.c: Likewise.
>> >     * gcc.target/aarch64/sve_while_2.c: Likewise.
>> >     * gcc.target/aarch64/sve_while_3.c: Likewise.
>> >     * gcc.target/aarch64/sve_while_4.c: Likewise.
>> Like other SVE related patches, I haven't looked at the aarch64 specific
>> bits, just the generic bits.
>>
>> Sadly, I'm totally lost on this one....   I understand at a 30000ft
>> level what you're trying to do and many of the low level primitives made
>> sense.  But I wasn't able to go from those primitives to the higher
>> level implementation details, even though the higher level
>> implementation details didn't seem all that large.
>>
>> I trust your judgment on this stuff.
>>
>> OK for the trunk.
>
> The AArch64 bits are OK.
>

As I reported in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83846,
I've noticed that one of the new tests (aarch64/sve/while_4.c)
fails when using -mabi=ilp32

Christophe

> Thanks,
> James
>
diff mbox series

Patch

Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2017-11-17 14:54:05.623937515 +0000
+++ gcc/optabs.def	2017-11-17 14:54:06.032587493 +0000
@@ -94,6 +94,8 @@  OPTAB_CD(maskstore_optab, "maskstore$a$b
 OPTAB_CD(vec_extract_optab, "vec_extract$a$b")
 OPTAB_CD(vec_init_optab, "vec_init$a$b")
 
+OPTAB_CD (while_ult_optab, "while_ult$a$b")
+
 OPTAB_NL(add_optab, "add$P$a3", PLUS, "add", '3', gen_int_fp_fixed_libfunc)
 OPTAB_NX(add_optab, "add$F$a3")
 OPTAB_NX(add_optab, "add$Q$a3")
Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2017-11-17 14:54:05.623937515 +0000
+++ gcc/doc/md.texi	2017-11-17 14:54:06.032587493 +0000
@@ -4951,6 +4951,19 @@  rounding behavior for @var{i} > 1.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{while_ult@var{m}@var{n}} instruction pattern
+@item @code{while_ult@var{m}@var{n}}
+Set operand 0 to a mask that is true while incrementing operand 1
+gives a value that is less than operand 2.  Operand 0 has mode @var{n}
+and operands 1 and 2 are scalar integers of mode @var{m}.
+The operation is equivalent to:
+
+@smallexample
+operand0[0] = operand1 < operand2;
+for (i = 1; i < GET_MODE_NUNITS (@var{n}); i++)
+  operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
+@end smallexample
+
 @cindex @code{vec_cmp@var{m}@var{n}} instruction pattern
 @item @samp{vec_cmp@var{m}@var{n}}
 Output a vector comparison.  Operand 0 of mode @var{n} is the destination for
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-17 14:54:05.623937515 +0000
+++ gcc/internal-fn.def	2017-11-17 14:54:06.032587493 +0000
@@ -102,6 +102,8 @@  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_
 DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
 		       vec_mask_store_lanes, mask_store_lanes)
 
+DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
+
 DEF_INTERNAL_OPTAB_FN (VEC_INTERLEAVE_LO, ECF_CONST | ECF_NOTHROW,
 		       vec_interleave_lo, binary)
 DEF_INTERNAL_OPTAB_FN (VEC_INTERLEAVE_HI, ECF_CONST | ECF_NOTHROW,
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	2017-11-17 14:54:05.623937515 +0000
+++ gcc/internal-fn.h	2017-11-17 14:54:06.032587493 +0000
@@ -174,6 +174,20 @@  extern bool direct_internal_fn_supported
 					    optimization_type);
 extern bool direct_internal_fn_supported_p (internal_fn, tree,
 					    optimization_type);
+
+/* Return true if FN is supported for types TYPE0 and TYPE1 when the
+   optimization type is OPT_TYPE.  The types are those associated with
+   the "type0" and "type1" fields of FN's direct_internal_fn_info
+   structure.  */
+
+inline bool
+direct_internal_fn_supported_p (internal_fn fn, tree type0, tree type1,
+				optimization_type opt_type)
+{
+  return direct_internal_fn_supported_p (fn, tree_pair (type0, type1),
+					 opt_type);
+}
+
 extern bool set_edom_supported_p (void);
 
 extern void expand_internal_call (gcall *);
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/internal-fn.c	2017-11-17 14:54:06.032587493 +0000
@@ -88,6 +88,7 @@  #define store_lanes_direct { 0, 0, false
 #define mask_store_lanes_direct { 0, 0, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
+#define while_direct { 0, 2, false }
 
 const direct_internal_fn_info direct_internal_fn_array[IFN_LAST + 1] = {
 #define DEF_INTERNAL_FN(CODE, FLAGS, FNSPEC) not_direct,
@@ -2764,6 +2765,35 @@  expand_direct_optab_fn (internal_fn fn,
     }
 }
 
+/* Expand WHILE_ULT call STMT using optab OPTAB.  */
+
+static void
+expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
+{
+  expand_operand ops[3];
+  tree rhs_type[2];
+
+  tree lhs = gimple_call_lhs (stmt);
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+    {
+      tree rhs = gimple_call_arg (stmt, i);
+      rhs_type[i] = TREE_TYPE (rhs);
+      rtx rhs_rtx = expand_normal (rhs);
+      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type[i]));
+    }
+
+  insn_code icode = convert_optab_handler (optab, TYPE_MODE (rhs_type[0]),
+					   TYPE_MODE (lhs_type));
+
+  expand_insn (icode, 3, ops);
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
 /* Expanders for optabs that can use expand_direct_optab_fn.  */
 
 #define expand_unary_optab_fn(FN, STMT, OPTAB) \
@@ -2816,6 +2846,19 @@  direct_optab_supported_p (direct_optab o
   return direct_optab_handler (optab, mode, opt_type) != CODE_FOR_nothing;
 }
 
+/* Return true if OPTAB is supported for TYPES, where the first type
+   is the destination and the second type is the source.  Used for
+   convert optabs.  */
+
+static bool
+convert_optab_supported_p (convert_optab optab, tree_pair types,
+			   optimization_type opt_type)
+{
+  return (convert_optab_handler (optab, TYPE_MODE (types.first),
+				 TYPE_MODE (types.second), opt_type)
+	  != CODE_FOR_nothing);
+}
+
 /* Return true if load/store lanes optab OPTAB is supported for
    array type TYPES.first when the optimization type is OPT_TYPE.  */
 
@@ -2838,6 +2881,7 @@  #define direct_mask_load_lanes_optab_sup
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_while_optab_supported_p convert_optab_supported_p
 
 /* Return true if FN is supported for the types in TYPES when the
    optimization type is OPT_TYPE.  The types are those associated with
Index: gcc/wide-int.h
===================================================================
--- gcc/wide-int.h	2017-11-17 14:54:05.623937515 +0000
+++ gcc/wide-int.h	2017-11-17 14:54:06.038930176 +0000
@@ -557,6 +557,7 @@  #define SHIFT_FUNCTION \
   BINARY_FUNCTION udiv_floor (const T1 &, const T2 &);
   BINARY_FUNCTION sdiv_floor (const T1 &, const T2 &);
   BINARY_FUNCTION div_ceil (const T1 &, const T2 &, signop, bool * = 0);
+  BINARY_FUNCTION udiv_ceil (const T1 &, const T2 &);
   BINARY_FUNCTION div_round (const T1 &, const T2 &, signop, bool * = 0);
   BINARY_FUNCTION divmod_trunc (const T1 &, const T2 &, signop,
 				WI_BINARY_RESULT (T1, T2) *);
@@ -2677,6 +2678,14 @@  wi::div_ceil (const T1 &x, const T2 &y,
   return quotient;
 }
 
+/* Return X / Y, rouding towards +inf.  Treat X and Y as unsigned values.  */
+template <typename T1, typename T2>
+inline WI_BINARY_RESULT (T1, T2)
+wi::udiv_ceil (const T1 &x, const T2 &y)
+{
+  return div_ceil (x, y, UNSIGNED);
+}
+
 /* Return X / Y, rouding towards nearest with ties away from zero.
    Treat X and Y as having the signedness given by SGN.  Indicate
    in *OVERFLOW if the result overflows.  */
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2017-11-17 14:54:05.623937515 +0000
+++ gcc/tree-vectorizer.h	2017-11-17 14:54:06.038930176 +0000
@@ -211,6 +211,102 @@  is_a_helper <_bb_vec_info *>::test (vec_
 }
 
 
+/* In general, we can divide the vector statements in a vectorized loop
+   into related groups ("rgroups") and say that for each rgroup there is
+   some nS such that the rgroup operates on nS values from one scalar
+   iteration followed by nS values from the next.  That is, if VF is the
+   vectorization factor of the loop, the rgroup operates on a sequence:
+
+     (1,1) (1,2) ... (1,nS) (2,1) ... (2,nS) ... (VF,1) ... (VF,nS)
+
+   where (i,j) represents a scalar value with index j in a scalar
+   iteration with index i.
+
+   [ We use the term "rgroup" to emphasise that this grouping isn't
+     necessarily the same as the grouping of statements used elsewhere.
+     For example, if we implement a group of scalar loads using gather
+     loads, we'll use a separate gather load for each scalar load, and
+     thus each gather load will belong to its own rgroup. ]
+
+   In general this sequence will occupy nV vectors concatenated
+   together.  If these vectors have nL lanes each, the total number
+   of scalar values N is given by:
+
+       N = nS * VF = nV * nL
+
+   None of nS, VF, nV and nL are required to be a power of 2.  nS and nV
+   are compile-time constants but VF and nL can be variable (if the target
+   supports variable-length vectors).
+
+   In classical vectorization, each iteration of the vector loop would
+   handle exactly VF iterations of the original scalar loop.  However,
+   in a fully-masked loop, a particular iteration of the vector loop
+   might handle fewer than VF iterations of the scalar loop.  The vector
+   lanes that correspond to iterations of the scalar loop are said to be
+   "active" and the other lanes are said to be "inactive".
+
+   In a fully-masked loop, many rgroups need to be masked to ensure that
+   they have no effect for the inactive lanes.  Each such rgroup needs a
+   sequence of booleans in the same order as above, but with each (i,j)
+   replaced by a boolean that indicates whether iteration i is active.
+   This sequence occupies nV vector masks that again have nL lanes each.
+   Thus the mask sequence as a whole consists of VF independent booleans
+   that are each repeated nS times.
+
+   We make the simplifying assumption that if a sequence of nV masks is
+   suitable for one (nS,nL) pair, we can reuse it for (nS/2,nL/2) by
+   VIEW_CONVERTing it.  This holds for all current targets that support
+   fully-masked loops.  For example, suppose the scalar loop is:
+
+     float *f;
+     double *d;
+     for (int i = 0; i < n; ++i)
+       {
+         f[i * 2 + 0] += 1.0f;
+         f[i * 2 + 1] += 2.0f;
+         d[i] += 3.0;
+       }
+
+   and suppose that vectors have 256 bits.  The vectorized f accesses
+   will belong to one rgroup and the vectorized d access to another:
+
+     f rgroup: nS = 2, nV = 1, nL = 8
+     d rgroup: nS = 1, nV = 1, nL = 4
+               VF = 4
+
+     [ In this simple example the rgroups do correspond to the normal
+       SLP grouping scheme. ]
+
+   If only the first three lanes are active, the masks we need are:
+
+     f rgroup: 1 1 | 1 1 | 1 1 | 0 0
+     d rgroup:  1  |  1  |  1  |  0
+
+   Here we can use a mask calculated for f's rgroup for d's, but not
+   vice versa.
+
+   Thus for each value of nV, it is enough to provide nV masks, with the
+   mask being calculated based on the highest nL (or, equivalently, based
+   on the highest nS) required by any rgroup with that nV.  We therefore
+   represent the entire collection of masks as a two-level table, with the
+   first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
+   the second being indexed by the mask index 0 <= i < nV.  */
+
+/* The masks needed by rgroups with nV vectors, according to the
+   description above.  */
+struct rgroup_masks {
+  /* The largest nS for all rgroups that use these masks.  */
+  unsigned int max_nscalars_per_iter;
+
+  /* The type of mask to use, based on the highest nS recorded above.  */
+  tree mask_type;
+
+  /* A vector of nV masks, in iteration order.  */
+  vec<tree> masks;
+};
+
+typedef auto_vec<rgroup_masks> vec_loop_masks;
+
 /*-----------------------------------------------------------------*/
 /* Info on vectorized loops.                                       */
 /*-----------------------------------------------------------------*/
@@ -251,6 +347,14 @@  typedef struct _loop_vec_info : public v
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
 
+  /* The masks that a fully-masked loop should use to avoid operating
+     on inactive scalars.  */
+  vec_loop_masks masks;
+
+  /* Type of the variables to use in the WHILE_ULT call for fully-masked
+     loops.  */
+  tree mask_compare_type;
+
   /* Unknown DRs according to which loop was peeled.  */
   struct data_reference *unaligned_dr;
 
@@ -305,6 +409,12 @@  typedef struct _loop_vec_info : public v
   /* Is the loop vectorizable? */
   bool vectorizable;
 
+  /* Records whether we still have the option of using a fully-masked loop.  */
+  bool can_fully_mask_p;
+
+  /* True if have decided to use a fully-masked loop.  */
+  bool fully_masked_p;
+
   /* When we have grouped data accesses with gaps, we may introduce invalid
      memory accesses.  We peel the last iteration of the loop to prevent
      this.  */
@@ -365,8 +475,12 @@  #define LOOP_VINFO_NITERS_ASSUMPTIONS(L)
 #define LOOP_VINFO_COST_MODEL_THRESHOLD(L) (L)->th
 #define LOOP_VINFO_VERSIONING_THRESHOLD(L) (L)->versioning_threshold
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
+#define LOOP_VINFO_CAN_FULLY_MASK_P(L)     (L)->can_fully_mask_p
+#define LOOP_VINFO_FULLY_MASKED_P(L)       (L)->fully_masked_p
 #define LOOP_VINFO_VECT_FACTOR(L)          (L)->vectorization_factor
 #define LOOP_VINFO_MAX_VECT_FACTOR(L)      (L)->max_vectorization_factor
+#define LOOP_VINFO_MASKS(L)                (L)->masks
+#define LOOP_VINFO_MASK_COMPARE_TYPE(L)    (L)->mask_compare_type
 #define LOOP_VINFO_PTR_MASK(L)             (L)->ptr_mask
 #define LOOP_VINFO_LOOP_NEST(L)            (L)->loop_nest
 #define LOOP_VINFO_DATAREFS(L)             (L)->datarefs
@@ -1172,6 +1286,17 @@  vect_nunits_for_cost (tree vec_type)
   return estimated_poly_value (TYPE_VECTOR_SUBPARTS (vec_type));
 }
 
+/* Return the maximum possible vectorization factor for LOOP_VINFO.  */
+
+static inline unsigned HOST_WIDE_INT
+vect_max_vf (loop_vec_info loop_vinfo)
+{
+  unsigned HOST_WIDE_INT vf;
+  if (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&vf))
+    return vf;
+  return MAX_VECTORIZATION_FACTOR;
+}
+
 /* Return the size of the value accessed by unvectorized data reference DR.
    This is only valid once STMT_VINFO_VECTYPE has been calculated for the
    associated gimple statement, since that guarantees that DR accesses
@@ -1194,8 +1319,8 @@  vect_get_scalar_dr_size (struct data_ref
 
 /* Simple loop peeling and versioning utilities for vectorizer's purposes -
    in tree-vect-loop-manip.c.  */
-extern void slpeel_make_loop_iterate_ntimes (struct loop *, tree, tree,
-					     tree, bool);
+extern void vect_set_loop_condition (struct loop *, loop_vec_info,
+				     tree, tree, tree, bool);
 extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge);
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *,
 						     struct loop *, edge);
@@ -1211,6 +1336,7 @@  extern bool vect_can_advance_ivs_p (loop
 extern tree get_vectype_for_scalar_type (tree);
 extern tree get_mask_type_for_scalar_type (tree);
 extern tree get_same_sized_vectype (tree, tree);
+extern bool vect_get_loop_mask_type (loop_vec_info);
 extern bool vect_is_simple_use (tree, vec_info *, gimple **,
                                 enum vect_def_type *);
 extern bool vect_is_simple_use (tree, vec_info *, gimple **,
@@ -1265,6 +1391,7 @@  extern bool vect_supportable_shift (enum
 extern tree vect_gen_perm_mask_any (tree, vec_perm_indices);
 extern tree vect_gen_perm_mask_checked (tree, vec_perm_indices);
 extern void optimize_mask_stores (struct loop*);
+extern gcall *vect_gen_while (tree, tree, tree);
 
 /* In tree-vect-data-refs.c.  */
 extern bool vect_can_force_dr_alignment_p (const_tree, unsigned int);
@@ -1318,6 +1445,13 @@  extern loop_vec_info vect_analyze_loop (
 extern tree vect_build_loop_niters (loop_vec_info, bool * = NULL);
 extern void vect_gen_vector_loop_niters (loop_vec_info, tree, tree *,
 					 tree *, bool);
+extern tree vect_halve_mask_nunits (tree);
+extern tree vect_double_mask_nunits (tree);
+extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *,
+				   unsigned int, tree);
+extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
+				unsigned int, tree, unsigned int);
+
 /* Drive for loop transformation stage.  */
 extern struct loop *vect_transform_loop (loop_vec_info);
 extern loop_vec_info vect_analyze_loop_form (struct loop *);
Index: gcc/tree-vect-loop-manip.c
===================================================================
--- gcc/tree-vect-loop-manip.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/tree-vect-loop-manip.c	2017-11-17 14:54:06.035305786 +0000
@@ -42,6 +42,10 @@  Software Foundation; either version 3, o
 #include "tree-vectorizer.h"
 #include "tree-ssa-loop-ivopts.h"
 #include "gimple-fold.h"
+#include "tree-ssa-loop-niter.h"
+#include "internal-fn.h"
+#include "stor-layout.h"
+#include "optabs-query.h"
 
 /*************************************************************************
   Simple Loop Peeling Utilities
@@ -248,33 +252,420 @@  adjust_phi_and_debug_stmts (gimple *upda
 			gimple_bb (update_phi));
 }
 
-/* Make LOOP iterate N == (NITERS - STEP) / STEP + 1 times,
-   where NITERS is known to be outside the range [1, STEP - 1].
-   This is equivalent to making the loop execute NITERS / STEP
-   times when NITERS is nonzero and (1 << M) / STEP times otherwise,
-   where M is the precision of NITERS.
+/* Define one loop mask MASK from loop LOOP.  INIT_MASK is the value that
+   the mask should have during the first iteration and NEXT_MASK is the
+   value that it should have on subsequent iterations.  */
 
-   NITERS_MAYBE_ZERO is true if NITERS can be zero, false it is known
-   to be >= STEP.  In the latter case N is always NITERS / STEP.
+static void
+vect_set_loop_mask (struct loop *loop, tree mask, tree init_mask,
+		    tree next_mask)
+{
+  gphi *phi = create_phi_node (mask, loop->header);
+  add_phi_arg (phi, init_mask, loop_preheader_edge (loop), UNKNOWN_LOCATION);
+  add_phi_arg (phi, next_mask, loop_latch_edge (loop), UNKNOWN_LOCATION);
+}
 
-   If FINAL_IV is nonnull, it is an SSA name that should be set to
-   N * STEP on exit from the loop.
+/* Add SEQ to the end of LOOP's preheader block.  */
 
-   Assumption: the exit-condition of LOOP is the last stmt in the loop.  */
+static void
+add_preheader_seq (struct loop *loop, gimple_seq seq)
+{
+  if (seq)
+    {
+      edge pe = loop_preheader_edge (loop);
+      basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
+      gcc_assert (!new_bb);
+    }
+}
 
-void
-slpeel_make_loop_iterate_ntimes (struct loop *loop, tree niters, tree step,
-				 tree final_iv, bool niters_maybe_zero)
+/* Add SEQ to the beginning of LOOP's header block.  */
+
+static void
+add_header_seq (struct loop *loop, gimple_seq seq)
+{
+  if (seq)
+    {
+      gimple_stmt_iterator gsi = gsi_after_labels (loop->header);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+}
+
+/* Try to use permutes to define the masks in DEST_RGM using the masks
+   in SRC_RGM, given that the former has twice as many masks as the
+   latter.  Return true on success, adding any new statements to SEQ.  */
+
+static bool
+vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm,
+			       rgroup_masks *src_rgm)
+{
+  tree src_masktype = src_rgm->mask_type;
+  tree dest_masktype = dest_rgm->mask_type;
+  machine_mode src_mode = TYPE_MODE (src_masktype);
+  if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
+      && optab_handler (vec_unpacku_hi_optab, src_mode) != CODE_FOR_nothing
+      && optab_handler (vec_unpacku_lo_optab, src_mode) != CODE_FOR_nothing)
+    {
+      /* Unpacking the source masks gives at least as many mask bits as
+	 we need.  We can then VIEW_CONVERT any excess bits away.  */
+      tree unpack_masktype = vect_halve_mask_nunits (src_masktype);
+      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+	{
+	  tree src = src_rgm->masks[i / 2];
+	  tree dest = dest_rgm->masks[i];
+	  tree_code code = (i & 1 ? VEC_UNPACK_HI_EXPR
+			    : VEC_UNPACK_LO_EXPR);
+	  gassign *stmt;
+	  if (dest_masktype == unpack_masktype)
+	    stmt = gimple_build_assign (dest, code, src);
+	  else
+	    {
+	      tree temp = make_ssa_name (unpack_masktype);
+	      stmt = gimple_build_assign (temp, code, src);
+	      gimple_seq_add_stmt (seq, stmt);
+	      stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
+					  build1 (VIEW_CONVERT_EXPR,
+						  dest_masktype, temp));
+	    }
+	  gimple_seq_add_stmt (seq, stmt);
+	}
+      return true;
+    }
+  if (dest_masktype == src_masktype
+      && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_LO, src_masktype,
+					 OPTIMIZE_FOR_SPEED)
+      && direct_internal_fn_supported_p (IFN_VEC_INTERLEAVE_HI, src_masktype,
+					 OPTIMIZE_FOR_SPEED))
+    {
+      /* The destination requires twice as many mask bits as the source, so
+	 we can use interleaving permutes to double up the number of bits.  */
+      for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i)
+	{
+	  tree src = src_rgm->masks[i / 2];
+	  tree dest = dest_rgm->masks[i];
+	  internal_fn ifn = (i & 1 ? IFN_VEC_INTERLEAVE_HI
+			    : IFN_VEC_INTERLEAVE_LO);
+	  gcall *stmt = gimple_build_call_internal (ifn, 2, src, src);
+	  gimple_call_set_lhs (stmt, dest);
+	  gimple_seq_add_stmt (seq, stmt);
+	}
+      return true;
+    }
+  return false;
+}
+
+/* Helper for vect_set_loop_condition_masked.  Generate definitions for
+   all the masks in RGM and return a mask that is nonzero when the loop
+   needs to iterate.  Add any new preheader statements to PREHEADER_SEQ.
+   Use LOOP_COND_GSI to insert code before the exit gcond.
+
+   RGM belongs to loop LOOP.  The loop originally iterated NITERS
+   times and has been vectorized according to LOOP_VINFO.  Each iteration
+   of the vectorized loop handles VF iterations of the scalar loop.
+
+   It is known that:
+
+     NITERS * RGM->max_nscalars_per_iter
+
+   does not overflow.  However, MIGHT_WRAP_P says whether an induction
+   variable that starts at 0 and has step:
+
+     VF * RGM->max_nscalars_per_iter
+
+   might overflow before hitting a value above:
+
+     NITERS * RGM->max_nscalars_per_iter
+
+   This means that we cannot guarantee that such an induction variable
+   would ever hit a value that produces a set of all-false masks for RGM.  */
+
+static tree
+vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
+			      gimple_seq *preheader_seq,
+			      gimple_stmt_iterator loop_cond_gsi,
+			      rgroup_masks *rgm, tree vf,
+			      tree niters, bool might_wrap_p)
+{
+  tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
+  tree mask_type = rgm->mask_type;
+  unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
+  poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
+
+  /* Calculate the maximum number of scalar values that the rgroup
+     handles in total and the number that it handles for each iteration
+     of the vector loop.  */
+  tree nscalars_total = niters;
+  tree nscalars_step = vf;
+  if (nscalars_per_iter != 1)
+    {
+      /* We checked before choosing to use a fully-masked loop that these
+	 multiplications don't overflow.  */
+      tree factor = build_int_cst (compare_type, nscalars_per_iter);
+      nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
+				     nscalars_total, factor);
+      nscalars_step = gimple_build (preheader_seq, MULT_EXPR, compare_type,
+				    nscalars_step, factor);
+    }
+
+  /* Create an induction variable that counts the number of scalars
+     processed.  */
+  tree index_before_incr, index_after_incr;
+  gimple_stmt_iterator incr_gsi;
+  bool insert_after;
+  tree zero_index = build_int_cst (compare_type, 0);
+  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+  create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
+	     insert_after, &index_before_incr, &index_after_incr);
+
+  tree test_index, test_limit;
+  gimple_stmt_iterator *test_gsi;
+  if (might_wrap_p)
+    {
+      /* In principle the loop should stop iterating once the incremented
+	 IV reaches a value greater than or equal to NSCALAR_TOTAL.
+	 However, there's no guarantee that the IV hits a value above
+	 this value before wrapping around.  We therefore adjust the
+	 limit down by one IV step:
+
+	   NSCALARS_TOTAL -[infinite-prec] NSCALARS_STEP
+
+	 and compare the IV against this limit _before_ incrementing it.
+	 Since the comparison type is unsigned, we actually want the
+	 subtraction to saturate at zero:
+
+	   NSCALARS_TOTAL -[sat] NSCALARS_STEP.  */
+      test_index = index_before_incr;
+      test_limit = gimple_build (preheader_seq, MAX_EXPR, compare_type,
+				 nscalars_total, nscalars_step);
+      test_limit = gimple_build (preheader_seq, MINUS_EXPR, compare_type,
+				 test_limit, nscalars_step);
+      test_gsi = &incr_gsi;
+    }
+  else
+    {
+      /* Test the incremented IV, which will always hit a value above
+	 the bound before wrapping.  */
+      test_index = index_after_incr;
+      test_limit = nscalars_total;
+      test_gsi = &loop_cond_gsi;
+    }
+
+  /* Provide a definition of each mask in the group.  */
+  tree next_mask = NULL_TREE;
+  tree mask;
+  unsigned int i;
+  FOR_EACH_VEC_ELT_REVERSE (rgm->masks, i, mask)
+    {
+      /* Previous masks will cover BIAS scalars.  This mask covers the
+	 next batch.  */
+      poly_uint64 bias = nscalars_per_mask * i;
+      tree bias_tree = build_int_cst (compare_type, bias);
+      gimple *tmp_stmt;
+
+      /* See whether the first iteration of the vector loop is known
+	 to have a full mask.  */
+      poly_uint64 const_limit;
+      bool first_iteration_full
+	= (poly_int_tree_p (nscalars_total, &const_limit)
+	   && must_ge (const_limit, (i + 1) * nscalars_per_mask));
+
+      /* Rather than have a new IV that starts at BIAS and goes up to
+	 TEST_LIMIT, prefer to use the same 0-based IV for each mask
+	 and adjust the bound down by BIAS.  */
+      tree this_test_limit = test_limit;
+      if (i != 0)
+	{
+	  this_test_limit = gimple_build (preheader_seq, MAX_EXPR,
+					  compare_type, this_test_limit,
+					  bias_tree);
+	  this_test_limit = gimple_build (preheader_seq, MINUS_EXPR,
+					  compare_type, this_test_limit,
+					  bias_tree);
+	}
+
+      /* Create the initial mask.  */
+      tree init_mask = NULL_TREE;
+      if (!first_iteration_full)
+	{
+	  tree start, end;
+	  if (nscalars_total == test_limit)
+	    {
+	      /* Use a natural test between zero (the initial IV value)
+		 and the loop limit.  The "else" block would be valid too,
+		 but this choice can avoid the need to load BIAS_TREE into
+		 a register.  */
+	      start = zero_index;
+	      end = this_test_limit;
+	    }
+	  else
+	    {
+	      start = bias_tree;
+	      end = nscalars_total;
+	    }
+
+	  init_mask = make_temp_ssa_name (mask_type, NULL, "max_mask");
+	  tmp_stmt = vect_gen_while (init_mask, start, end);
+	  gimple_seq_add_stmt (preheader_seq, tmp_stmt);
+	}
+
+      if (!init_mask)
+	/* First iteration is full.  */
+	init_mask = build_minus_one_cst (mask_type);
+
+      /* Get the mask value for the next iteration of the loop.  */
+      next_mask = make_temp_ssa_name (mask_type, NULL, "next_mask");
+      gcall *call = vect_gen_while (next_mask, test_index, this_test_limit);
+      gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
+
+      vect_set_loop_mask (loop, mask, init_mask, next_mask);
+    }
+  return next_mask;
+}
+
+/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls.
+   LOOP_VINFO describes the vectorization of LOOP.  NITERS is the
+   number of iterations of the original scalar loop.  NITERS_MAYBE_ZERO
+   and FINAL_IV are as for vect_set_loop_condition.
+
+   Insert the branch-back condition before LOOP_COND_GSI and return the
+   final gcond.  */
+
+static gcond *
+vect_set_loop_condition_masked (struct loop *loop, loop_vec_info loop_vinfo,
+				tree niters, tree final_iv,
+				bool niters_maybe_zero,
+				gimple_stmt_iterator loop_cond_gsi)
+{
+  gimple_seq preheader_seq = NULL;
+  gimple_seq header_seq = NULL;
+
+  tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
+  unsigned int compare_precision = TYPE_PRECISION (compare_type);
+  unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo);
+  tree orig_niters = niters;
+
+  /* Type of the initial value of NITERS.  */
+  tree ni_actual_type = TREE_TYPE (niters);
+  unsigned int ni_actual_precision = TYPE_PRECISION (ni_actual_type);
+
+  /* Convert NITERS to the same size as the compare.  */
+  if (compare_precision > ni_actual_precision
+      && niters_maybe_zero)
+    {
+      /* We know that there is always at least one iteration, so if the
+	 count is zero then it must have wrapped.  Cope with this by
+	 subtracting 1 before the conversion and adding 1 to the result.  */
+      gcc_assert (TYPE_UNSIGNED (ni_actual_type));
+      niters = gimple_build (&preheader_seq, PLUS_EXPR, ni_actual_type,
+			     niters, build_minus_one_cst (ni_actual_type));
+      niters = gimple_convert (&preheader_seq, compare_type, niters);
+      niters = gimple_build (&preheader_seq, PLUS_EXPR, compare_type,
+			     niters, build_one_cst (compare_type));
+    }
+  else
+    niters = gimple_convert (&preheader_seq, compare_type, niters);
+
+  /* Now calculate the value that the induction variable must be able
+     to hit in order to ensure that we end the loop with an all-false mask.
+     This involves adding the maximum number of inactive trailing scalar
+     iterations.  */
+  widest_int iv_limit;
+  bool known_max_iters = max_loop_iterations (loop, &iv_limit);
+  if (known_max_iters)
+    {
+      /* IV_LIMIT is the maximum number of latch iterations, which is also
+	 the maximum in-range IV value.  Round this value down to the previous
+	 vector alignment boundary and then add an extra full iteration.  */
+      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+      iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf;
+    }
+
+  /* Get the vectorization factor in tree form.  */
+  tree vf = build_int_cst (compare_type,
+			   LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+
+  /* Iterate over all the rgroups and fill in their masks.  We could use
+     the first mask from any rgroup for the loop condition; here we
+     arbitrarily pick the last.  */
+  tree test_mask = NULL_TREE;
+  rgroup_masks *rgm;
+  unsigned int i;
+  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+  FOR_EACH_VEC_ELT (*masks, i, rgm)
+    if (!rgm->masks.is_empty ())
+      {
+	/* First try using permutes.  This adds a single vector
+	   instruction to the loop for each mask, but needs no extra
+	   loop invariants or IVs.  */
+	unsigned int nmasks = i + 1;
+	if ((nmasks & 1) == 0)
+	  {
+	    rgroup_masks *half_rgm = &(*masks)[nmasks / 2 - 1];
+	    if (!half_rgm->masks.is_empty ()
+		&& vect_maybe_permute_loop_masks (&header_seq, rgm, half_rgm))
+	      continue;
+	  }
+
+	/* See whether zero-based IV would ever generate all-false masks
+	   before wrapping around.  */
+	bool might_wrap_p
+	  = (!known_max_iters
+	     || (wi::min_precision (iv_limit * rgm->max_nscalars_per_iter,
+				    UNSIGNED)
+		 > compare_precision));
+
+	/* Set up all masks for this group.  */
+	test_mask = vect_set_loop_masks_directly (loop, loop_vinfo,
+						  &preheader_seq,
+						  loop_cond_gsi, rgm, vf,
+						  niters, might_wrap_p);
+      }
+
+  /* Emit all accumulated statements.  */
+  add_preheader_seq (loop, preheader_seq);
+  add_header_seq (loop, header_seq);
+
+  /* Get a boolean result that tells us whether to iterate.  */
+  edge exit_edge = single_exit (loop);
+  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
+  tree zero_mask = build_zero_cst (TREE_TYPE (test_mask));
+  gcond *cond_stmt = gimple_build_cond (code, test_mask, zero_mask,
+					NULL_TREE, NULL_TREE);
+  gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
+
+  /* The loop iterates (NITERS - 1) / VF + 1 times.
+     Subtract one from this to get the latch count.  */
+  tree step = build_int_cst (compare_type,
+			     LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+  tree niters_minus_one = fold_build2 (PLUS_EXPR, compare_type, niters,
+				       build_minus_one_cst (compare_type));
+  loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, compare_type,
+				     niters_minus_one, step);
+
+  if (final_iv)
+    {
+      gassign *assign = gimple_build_assign (final_iv, orig_niters);
+      gsi_insert_on_edge_immediate (single_exit (loop), assign);
+    }
+
+  return cond_stmt;
+}
+
+/* Like vect_set_loop_condition, but handle the case in which there
+   are no loop masks.  */
+
+static gcond *
+vect_set_loop_condition_unmasked (struct loop *loop, tree niters,
+				  tree step, tree final_iv,
+				  bool niters_maybe_zero,
+				  gimple_stmt_iterator loop_cond_gsi)
 {
   tree indx_before_incr, indx_after_incr;
   gcond *cond_stmt;
   gcond *orig_cond;
   edge pe = loop_preheader_edge (loop);
   edge exit_edge = single_exit (loop);
-  gimple_stmt_iterator loop_cond_gsi;
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
-  source_location loop_loc;
   enum tree_code code;
   tree niters_type = TREE_TYPE (niters);
 
@@ -360,7 +751,6 @@  slpeel_make_loop_iterate_ntimes (struct
   standard_iv_increment_position (loop, &incr_gsi, &insert_after);
   create_iv (init, step, NULL_TREE, loop,
              &incr_gsi, insert_after, &indx_before_incr, &indx_after_incr);
-
   indx_after_incr = force_gimple_operand_gsi (&loop_cond_gsi, indx_after_incr,
 					      true, NULL_TREE, true,
 					      GSI_SAME_STMT);
@@ -372,19 +762,6 @@  slpeel_make_loop_iterate_ntimes (struct
 
   gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
 
-  /* Remove old loop exit test:  */
-  gsi_remove (&loop_cond_gsi, true);
-  free_stmt_vec_info (orig_cond);
-
-  loop_loc = find_loop_location (loop);
-  if (dump_enabled_p ())
-    {
-      if (LOCATION_LOCUS (loop_loc) != UNKNOWN_LOCATION)
-	dump_printf (MSG_NOTE, "\nloop at %s:%d: ", LOCATION_FILE (loop_loc),
-		     LOCATION_LINE (loop_loc));
-      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, cond_stmt, 0);
-    }
-
   /* Record the number of latch iterations.  */
   if (limit == niters)
     /* Case A: the loop iterates NITERS times.  Subtract one to get the
@@ -403,6 +780,59 @@  slpeel_make_loop_iterate_ntimes (struct
 					     indx_after_incr, init);
       gsi_insert_on_edge_immediate (single_exit (loop), assign);
     }
+
+  return cond_stmt;
+}
+
+/* If we're using fully-masked loops, make LOOP iterate:
+
+      N == (NITERS - 1) / STEP + 1
+
+   times.  When NITERS is zero, this is equivalent to making the loop
+   execute (1 << M) / STEP times, where M is the precision of NITERS.
+   NITERS_MAYBE_ZERO is true if this last case might occur.
+
+   If we're not using fully-masked loops, make LOOP iterate:
+
+      N == (NITERS - STEP) / STEP + 1
+
+   times, where NITERS is known to be outside the range [1, STEP - 1].
+   This is equivalent to making the loop execute NITERS / STEP times
+   when NITERS is nonzero and (1 << M) / STEP times otherwise.
+   NITERS_MAYBE_ZERO again indicates whether this last case might occur.
+
+   If FINAL_IV is nonnull, it is an SSA name that should be set to
+   N * STEP on exit from the loop.
+
+   Assumption: the exit-condition of LOOP is the last stmt in the loop.  */
+
+void
+vect_set_loop_condition (struct loop *loop, loop_vec_info loop_vinfo,
+			 tree niters, tree step, tree final_iv,
+			 bool niters_maybe_zero)
+{
+  gcond *cond_stmt;
+  gcond *orig_cond = get_loop_exit_condition (loop);
+  gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond);
+
+  if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters,
+						final_iv, niters_maybe_zero,
+						loop_cond_gsi);
+  else
+    cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step,
+						  final_iv, niters_maybe_zero,
+						  loop_cond_gsi);
+
+  /* Remove old loop exit test.  */
+  gsi_remove (&loop_cond_gsi, true);
+  free_stmt_vec_info (orig_cond);
+
+  if (dump_enabled_p ())
+    {
+      dump_printf_loc (MSG_NOTE, vect_location, "New loop exit condition: ");
+      dump_gimple_stmt (MSG_NOTE, TDF_SLIM, cond_stmt, 0);
+    }
 }
 
 /* Helper routine of slpeel_tree_duplicate_loop_to_edge_cfg.
@@ -1317,7 +1747,8 @@  vect_gen_vector_loop_niters (loop_vec_in
     ni_minus_gap = niters;
 
   unsigned HOST_WIDE_INT const_vf;
-  if (vf.is_constant (&const_vf))
+  if (vf.is_constant (&const_vf)
+      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
     {
       /* Create: niters >> log2(vf) */
       /* If it's known that niters == number of latch executions + 1 doesn't
@@ -1724,8 +2155,7 @@  slpeel_update_phi_nodes_for_lcssa (struc
 			      CHECK_PROFITABILITY is true.
    Output:
    - *NITERS_VECTOR and *STEP_VECTOR describe how the main loop should
-     iterate after vectorization; see slpeel_make_loop_iterate_ntimes
-     for details.
+     iterate after vectorization; see vect_set_loop_condition for details.
    - *NITERS_VECTOR_MULT_VF_VAR is either null or an SSA name that
      should be set to the number of scalar iterations handled by the
      vector loop.  The SSA name is only used on exit from the loop.
@@ -1890,8 +2320,8 @@  vect_do_peeling (loop_vec_info loop_vinf
       niters_prolog = vect_gen_prolog_loop_niters (loop_vinfo, anchor,
 						   &bound_prolog);
       tree step_prolog = build_one_cst (TREE_TYPE (niters_prolog));
-      slpeel_make_loop_iterate_ntimes (prolog, niters_prolog, step_prolog,
-				       NULL_TREE, false);
+      vect_set_loop_condition (prolog, NULL, niters_prolog,
+			       step_prolog, NULL_TREE, false);
 
       /* Skip the prolog loop.  */
       if (skip_prolog)
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/tree-vect-loop.c	2017-11-17 14:54:06.037117981 +0000
@@ -1119,12 +1119,15 @@  _loop_vec_info::_loop_vec_info (struct l
     versioning_threshold (0),
     vectorization_factor (0),
     max_vectorization_factor (0),
+    mask_compare_type (NULL_TREE),
     unaligned_dr (NULL),
     peeling_for_alignment (0),
     ptr_mask (0),
     slp_unrolling_factor (1),
     single_scalar_iteration_cost (0),
     vectorizable (false),
+    can_fully_mask_p (true),
+    fully_masked_p (false),
     peeling_for_gaps (false),
     peeling_for_niter (false),
     operands_swapped (false),
@@ -1166,6 +1169,17 @@  _loop_vec_info::_loop_vec_info (struct l
   gcc_assert (nbbs == loop->num_nodes);
 }
 
+/* Free all levels of MASKS.  */
+
+void
+release_vec_loop_masks (vec_loop_masks *masks)
+{
+  rgroup_masks *rgm;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (*masks, i, rgm)
+    rgm->masks.release ();
+  masks->release ();
+}
 
 /* Free all memory used by the _loop_vec_info, as well as all the
    stmt_vec_info structs of all the stmts in the loop.  */
@@ -1231,9 +1245,97 @@  _loop_vec_info::~_loop_vec_info ()
 
   free (bbs);
 
+  release_vec_loop_masks (&masks);
+
   loop->aux = NULL;
 }
 
+/* Return true if we can use CMP_TYPE as the comparison type to produce
+   all masks required to mask LOOP_VINFO.  */
+
+static bool
+can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type)
+{
+  rgroup_masks *rgm;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
+    if (rgm->mask_type != NULL_TREE
+	&& !direct_internal_fn_supported_p (IFN_WHILE_ULT,
+					    cmp_type, rgm->mask_type,
+					    OPTIMIZE_FOR_SPEED))
+      return false;
+  return true;
+}
+
+/* Calculate the maximum number of scalars per iteration for every
+   rgroup in LOOP_VINFO.  */
+
+static unsigned int
+vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
+{
+  unsigned int res = 1;
+  unsigned int i;
+  rgroup_masks *rgm;
+  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
+    res = MAX (res, rgm->max_nscalars_per_iter);
+  return res;
+}
+
+/* Each statement in LOOP_VINFO can be masked where necessary.  Check
+   whether we can actually generate the masks required.  Return true if so,
+   storing the type of the scalar IV in LOOP_VINFO_MASK_COMPARE_TYPE.  */
+
+static bool
+vect_verify_full_masking (loop_vec_info loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  unsigned int min_ni_width;
+
+  /* Get the maximum number of iterations that is representable
+     in the counter type.  */
+  tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo));
+  widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1;
+
+  /* Get a more refined estimate for the number of iterations.  */
+  widest_int max_back_edges;
+  if (max_loop_iterations (loop, &max_back_edges))
+    max_ni = wi::smin (max_ni, max_back_edges + 1);
+
+  /* Account for rgroup masks, in which each bit is replicated N times.  */
+  max_ni *= vect_get_max_nscalars_per_iter (loop_vinfo);
+
+  /* Work out how many bits we need to represent the limit.  */
+  min_ni_width = wi::min_precision (max_ni, UNSIGNED);
+
+  /* Find a scalar mode for which WHILE_ULT is supported.  */
+  opt_scalar_int_mode cmp_mode_iter;
+  tree cmp_type = NULL_TREE;
+  FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
+    {
+      unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
+      if (cmp_bits >= min_ni_width)
+	{
+	  tree this_type = build_nonstandard_integer_type (cmp_bits, true);
+	  if (this_type
+	      && can_produce_all_loop_masks_p (loop_vinfo, this_type))
+	    {
+	      /* Although we could stop as soon as we find a valid mode,
+		 it's often better to continue until we hit Pmode, since the
+		 operands to the WHILE are more likely to be reusable in
+		 address calculations.  */
+	      cmp_type = this_type;
+	      if (cmp_bits >= GET_MODE_BITSIZE (Pmode))
+		break;
+	    }
+	}
+    }
+
+  if (!cmp_type)
+    return false;
+
+  LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo) = cmp_type;
+  return true;
+}
 
 /* Calculate the cost of one scalar iteration of the loop.  */
 static void
@@ -1978,6 +2080,12 @@  vect_analyze_loop_2 (loop_vec_info loop_
       vect_update_vf_for_slp (loop_vinfo);
     }
 
+  bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo);
+
+  /* We don't expect to have to roll back to anything other than an empty
+     set of rgroups.  */
+  gcc_assert (LOOP_VINFO_MASKS (loop_vinfo).is_empty ());
+
   /* This is the point where we can re-start analysis with SLP forced off.  */
 start_over:
 
@@ -2066,11 +2174,47 @@  vect_analyze_loop_2 (loop_vec_info loop_
       return false;
     }
 
+  if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
+      && LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+    {
+      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a fully-masked loop because peeling for"
+			 " gaps is required.\n");
+    }
+
+  if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
+      && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo))
+    {
+      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a fully-masked loop because peeling for"
+			 " alignment is required.\n");
+    }
+
+  /* Decide whether to use a fully-masked loop for this vectorization
+     factor.  */
+  LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+    = (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)
+       && vect_verify_full_masking (loop_vinfo));
+  if (dump_enabled_p ())
+    {
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "using a fully-masked loop.\n");
+      else
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "not using a fully-masked loop.\n");
+    }
+
   /* If epilog loop is required because of data accesses with gaps,
      one additional iteration needs to be peeled.  Check if there is
      enough iterations for vectorization.  */
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-      && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+      && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
     {
       poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
       tree scalar_niters = LOOP_VINFO_NITERSM1 (loop_vinfo);
@@ -2151,8 +2295,11 @@  vect_analyze_loop_2 (loop_vec_info loop_
   th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
   unsigned HOST_WIDE_INT const_vf;
-  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
-      && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0)
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    /* The main loop handles all iterations.  */
+    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
+  else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	   && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0)
     {
       if (!multiple_p (LOOP_VINFO_INT_NITERS (loop_vinfo)
 		       - LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo),
@@ -2210,7 +2357,8 @@  vect_analyze_loop_2 (loop_vec_info loop_
 	niters_th = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
 
       /* Niters for at least one iteration of vectorized loop.  */
-      niters_th += LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+      if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	niters_th += LOOP_VINFO_VECT_FACTOR (loop_vinfo);
       /* One additional iteration because of peeling for gap.  */
       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
 	niters_th += 1;
@@ -2313,11 +2461,14 @@  vect_analyze_loop_2 (loop_vec_info loop_
   destroy_cost_data (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo));
   LOOP_VINFO_TARGET_COST_DATA (loop_vinfo)
     = init_cost (LOOP_VINFO_LOOP (loop_vinfo));
+  /* Reset accumulated rgroup information.  */
+  release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo));
   /* Reset assorted flags.  */
   LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false;
   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false;
   LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0;
   LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0;
+  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p;
 
   goto start_over;
 }
@@ -3512,7 +3663,7 @@  vect_estimate_min_profitable_iters (loop
     = LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST (loop_vinfo);
 
   /* Add additional cost for the peeled instructions in prologue and epilogue
-     loop.
+     loop.  (For fully-masked loops there will be no peeling.)
 
      FORNOW: If we don't know the value of peel_iters for prologue or epilogue
      at compile-time - we assume it's vf/2 (the worst would be vf-1).
@@ -3520,7 +3671,12 @@  vect_estimate_min_profitable_iters (loop
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (npeel  < 0)
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
+      peel_iters_prologue = 0;
+      peel_iters_epilogue = 0;
+    }
+  else if (npeel < 0)
     {
       peel_iters_prologue = assumed_vf / 2;
       dump_printf (MSG_NOTE, "cost model: "
@@ -3751,8 +3907,9 @@  vect_estimate_min_profitable_iters (loop
 	       "  Calculated minimum iters for profitability: %d\n",
 	       min_profitable_iters);
 
-  /* We want the vectorized loop to execute at least once.  */
-  if (min_profitable_iters < (assumed_vf + peel_iters_prologue))
+  if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+      && min_profitable_iters < (assumed_vf + peel_iters_prologue))
+    /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
 
   if (dump_enabled_p ())
@@ -6569,6 +6726,15 @@  vectorizable_reduction (gimple *stmt, gi
 
   if (!vec_stmt) /* transformation not required.  */
     {
+      if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "can't use a fully-masked loop due to "
+			     "reduction operation.\n");
+	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	}
+
       if (first_p)
 	vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies);
       STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
@@ -7389,8 +7555,19 @@  vectorizable_live_operation (gimple *stm
     }
 
   if (!vec_stmt)
-    /* No transformation required.  */
-    return true;
+    {
+      if (LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "can't use a fully-masked loop because "
+			     "a value is live outside the loop.\n");
+	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	}
+
+      /* No transformation required.  */
+      return true;
+    }
 
   /* If stmt has a related stmt, then use that for getting the lhs.  */
   if (is_pattern_stmt_p (stmt_info))
@@ -7405,6 +7582,8 @@  vectorizable_live_operation (gimple *stm
 	     : TYPE_SIZE (TREE_TYPE (vectype)));
   vec_bitsize = TYPE_SIZE (vectype);
 
+  gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+
   /* Get the vectorized lhs of STMT and the lane to use (counted in bits).  */
   tree vec_lhs, bitstart;
   if (slp_node)
@@ -7538,6 +7717,97 @@  loop_niters_no_overflow (loop_vec_info l
   return false;
 }
 
+/* Return a mask type with half the number of elements as TYPE.  */
+
+tree
+vect_halve_mask_nunits (tree type)
+{
+  poly_uint64 nunits = exact_div (TYPE_VECTOR_SUBPARTS (type), 2);
+  return build_truth_vector_type (nunits, current_vector_size);
+}
+
+/* Return a mask type with twice as many elements as TYPE.  */
+
+tree
+vect_double_mask_nunits (tree type)
+{
+  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (type) * 2;
+  return build_truth_vector_type (nunits, current_vector_size);
+}
+
+/* Record that a fully-masked version of LOOP_VINFO would need MASKS to
+   contain a sequence of NVECTORS masks that each control a vector of type
+   VECTYPE.  */
+
+void
+vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks,
+		       unsigned int nvectors, tree vectype)
+{
+  gcc_assert (nvectors != 0);
+  if (masks->length () < nvectors)
+    masks->safe_grow_cleared (nvectors);
+  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  /* The number of scalars per iteration and the number of vectors are
+     both compile-time constants.  */
+  unsigned int nscalars_per_iter
+    = exact_div (nvectors * TYPE_VECTOR_SUBPARTS (vectype),
+		 LOOP_VINFO_VECT_FACTOR (loop_vinfo)).to_constant ();
+  if (rgm->max_nscalars_per_iter < nscalars_per_iter)
+    {
+      rgm->max_nscalars_per_iter = nscalars_per_iter;
+      rgm->mask_type = build_same_sized_truth_vector_type (vectype);
+    }
+}
+
+/* Given a complete set of masks MASKS, extract mask number INDEX
+   for an rgroup that operates on NVECTORS vectors of type VECTYPE,
+   where 0 <= INDEX < NVECTORS.  Insert any set-up statements before GSI.
+
+   See the comment above vec_loop_masks for more details about the mask
+   arrangement.  */
+
+tree
+vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
+		    unsigned int nvectors, tree vectype, unsigned int index)
+{
+  rgroup_masks *rgm = &(*masks)[nvectors - 1];
+  tree mask_type = rgm->mask_type;
+
+  /* Populate the rgroup's mask array, if this is the first time we've
+     used it.  */
+  if (rgm->masks.is_empty ())
+    {
+      rgm->masks.safe_grow_cleared (nvectors);
+      for (unsigned int i = 0; i < nvectors; ++i)
+	{
+	  tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask");
+	  /* Provide a dummy definition until the real one is available.  */
+	  SSA_NAME_DEF_STMT (mask) = gimple_build_nop ();
+	  rgm->masks[i] = mask;
+	}
+    }
+
+  tree mask = rgm->masks[index];
+  if (may_ne (TYPE_VECTOR_SUBPARTS (mask_type),
+	      TYPE_VECTOR_SUBPARTS (vectype)))
+    {
+      /* A loop mask for data type X can be reused for data type Y
+	 if X has N times more elements than Y and if Y's elements
+	 are N times bigger than X's.  In this case each sequence
+	 of N elements in the loop mask will be all-zero or all-one.
+	 We can then view-convert the mask so that each sequence of
+	 N elements is replaced by a single element.  */
+      gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (mask_type),
+			      TYPE_VECTOR_SUBPARTS (vectype)));
+      gimple_seq seq = NULL;
+      mask_type = build_same_sized_truth_vector_type (vectype);
+      mask = gimple_build (&seq, VIEW_CONVERT_EXPR, mask_type, mask);
+      if (seq)
+	gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
+    }
+  return mask;
+}
+
 /* Scale profiling counters by estimation for LOOP which is vectorized
    by factor VF.  */
 
@@ -7672,9 +7942,12 @@  vect_transform_loop (loop_vec_info loop_
   epilogue = vect_do_peeling (loop_vinfo, niters, nitersm1, &niters_vector,
 			      &step_vector, &niters_vector_mult_vf, th,
 			      check_profitability, niters_no_overflow);
+
   if (niters_vector == NULL_TREE)
     {
-      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && must_eq (lowest_vf, vf))
+      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+	  && must_eq (lowest_vf, vf))
 	{
 	  niters_vector
 	    = build_int_cst (TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)),
@@ -7956,13 +8229,15 @@  vect_transform_loop (loop_vec_info loop_
      a zero NITERS becomes a nonzero NITERS_VECTOR.  */
   if (integer_onep (step_vector))
     niters_no_overflow = true;
-  slpeel_make_loop_iterate_ntimes (loop, niters_vector, step_vector,
-				   niters_vector_mult_vf,
-				   !niters_no_overflow);
+  vect_set_loop_condition (loop, loop_vinfo, niters_vector, step_vector,
+			   niters_vector_mult_vf, !niters_no_overflow);
 
   unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo);
   scale_profile_for_vect_loop (loop, assumed_vf);
 
+  /* True if the final iteration might not handle a full vector's
+     worth of scalar iterations.  */
+  bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
   /* The minimum number of iterations performed by the epilogue.  This
      is 1 when peeling for gaps because we always need a final scalar
      iteration.  */
@@ -7975,16 +8250,25 @@  vect_transform_loop (loop_vec_info loop_
      back to latch counts.  */
   if (loop->any_upper_bound)
     loop->nb_iterations_upper_bound
-      = wi::udiv_floor (loop->nb_iterations_upper_bound + bias,
-			lowest_vf) - 1;
+      = (final_iter_may_be_partial
+	 ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias,
+			  lowest_vf) - 1
+	 : wi::udiv_floor (loop->nb_iterations_upper_bound + bias,
+			   lowest_vf) - 1);
   if (loop->any_likely_upper_bound)
     loop->nb_iterations_likely_upper_bound
-      = wi::udiv_floor (loop->nb_iterations_likely_upper_bound + bias,
-			lowest_vf) - 1;
+      = (final_iter_may_be_partial
+	 ? wi::udiv_ceil (loop->nb_iterations_likely_upper_bound + bias,
+			  lowest_vf) - 1
+	 : wi::udiv_floor (loop->nb_iterations_likely_upper_bound + bias,
+			   lowest_vf) - 1);
   if (loop->any_estimate)
     loop->nb_iterations_estimate
-      = wi::udiv_floor (loop->nb_iterations_estimate + bias,
-			assumed_vf) - 1;
+      = (final_iter_may_be_partial
+	 ? wi::udiv_ceil (loop->nb_iterations_estimate + bias,
+			  assumed_vf) - 1
+	 : wi::udiv_floor (loop->nb_iterations_estimate + bias,
+			   assumed_vf) - 1);
 
   if (dump_enabled_p ())
     {
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/tree-vect-stmts.c	2017-11-17 14:54:06.038024078 +0000
@@ -48,6 +48,8 @@  Software Foundation; either version 3, o
 #include "tree-vectorizer.h"
 #include "builtins.h"
 #include "internal-fn.h"
+#include "tree-ssa-loop-niter.h"
+#include "gimple-fold.h"
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
@@ -1692,6 +1694,113 @@  vectorizable_internal_function (combined
 static tree permute_vec_elements (tree, tree, tree, gimple *,
 				  gimple_stmt_iterator *);
 
+/* Check whether a load or store statement in the loop described by
+   LOOP_VINFO is possible in a fully-masked loop.  This is testing
+   whether the vectorizer pass has the appropriate support, as well as
+   whether the target does.
+
+   VLS_TYPE says whether the statement is a load or store and VECTYPE
+   is the type of the vector being loaded or stored.  MEMORY_ACCESS_TYPE
+   says how the load or store is going to be implemented and GROUP_SIZE
+   is the number of load or store statements in the containing group.
+
+   Clear LOOP_VINFO_CAN_FULLY_MASK_P if a fully-masked loop is not
+   supported, otherwise record the required mask types.  */
+
+static void
+check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
+			  vec_load_store_type vls_type, int group_size,
+			  vect_memory_access_type memory_access_type)
+{
+  /* Invariant loads need no special support.  */
+  if (memory_access_type == VMAT_INVARIANT)
+    return;
+
+  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+  machine_mode vecmode = TYPE_MODE (vectype);
+  bool is_load = (vls_type == VLS_LOAD);
+  if (memory_access_type == VMAT_LOAD_STORE_LANES)
+    {
+      if (is_load
+	  ? !vect_load_lanes_supported (vectype, group_size, true)
+	  : !vect_store_lanes_supported (vectype, group_size, true))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "can't use a fully-masked loop because the"
+			     " target doesn't have an appropriate masked"
+			     " load/store-lanes instruction.\n");
+	  LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+	  return;
+	}
+      unsigned int ncopies = vect_get_num_copies (loop_vinfo, vectype);
+      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);
+      return;
+    }
+
+  if (memory_access_type != VMAT_CONTIGUOUS
+      && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
+    {
+      /* Element X of the data must come from iteration i * VF + X of the
+	 scalar loop.  We need more work to support other mappings.  */
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a fully-masked loop because an access"
+			 " isn't contiguous.\n");
+      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+      return;
+    }
+
+  machine_mode mask_mode;
+  if (!(targetm.vectorize.get_mask_mode
+	(GET_MODE_NUNITS (vecmode),
+	 GET_MODE_SIZE (vecmode)).exists (&mask_mode))
+      || !can_vec_mask_load_store_p (vecmode, mask_mode, is_load))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "can't use a fully-masked loop because the target"
+			 " doesn't have the appropriate masked load or"
+			 " store.\n");
+      LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = false;
+      return;
+    }
+  /* We might load more scalars than we need for permuting SLP loads.
+     We checked in get_group_load_store_type that the extra elements
+     don't leak into a new vector.  */
+  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  unsigned int nvectors;
+  if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors))
+    vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype);
+  else
+    gcc_unreachable ();
+}
+
+/* Return the mask input to a masked load or store.  VEC_MASK is the vectorized
+   form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask
+   that needs to be applied to all loads and stores in a vectorized loop.
+   Return VEC_MASK if LOOP_MASK is null, otherwise return VEC_MASK & LOOP_MASK.
+
+   MASK_TYPE is the type of both masks.  If new statements are needed,
+   insert them before GSI.  */
+
+static tree
+prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
+			 gimple_stmt_iterator *gsi)
+{
+  gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
+  if (!loop_mask)
+    return vec_mask;
+
+  gcc_assert (TREE_TYPE (loop_mask) == mask_type);
+  tree and_res = make_temp_ssa_name (mask_type, NULL, "vec_mask_and");
+  gimple *and_stmt = gimple_build_assign (and_res, BIT_AND_EXPR,
+					  vec_mask, loop_mask);
+  gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
+  return and_res;
+}
+
 /* STMT is a non-strided load or store, meaning that it accesses
    elements with a known constant step.  Return -1 if that step
    is negative, 0 if it is zero, and 1 if it is greater than zero.  */
@@ -5806,9 +5915,29 @@  vectorizable_store (gimple *stmt, gimple
 	return false;
     }
 
+  grouped_store = STMT_VINFO_GROUPED_ACCESS (stmt_info);
+  if (grouped_store)
+    {
+      first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
+      first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (first_stmt));
+      group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
+    }
+  else
+    {
+      first_stmt = stmt;
+      first_dr = dr;
+      group_size = vec_num = 1;
+    }
+
   if (!vec_stmt) /* transformation not required.  */
     {
       STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
+
+      if (loop_vinfo
+	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+	check_load_store_masking (loop_vinfo, vectype, vls_type, group_size,
+				  memory_access_type);
+
       STMT_VINFO_TYPE (stmt_info) = store_vec_info_type;
       /* The SLP costs are calculated during SLP analysis.  */
       if (!PURE_SLP_STMT (stmt_info))
@@ -5969,13 +6098,8 @@  vectorizable_store (gimple *stmt, gimple
       return true;
     }
 
-  grouped_store = STMT_VINFO_GROUPED_ACCESS (stmt_info);
   if (grouped_store)
     {
-      first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
-      first_dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (first_stmt));
-      group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
-
       GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))++;
 
       /* FORNOW */
@@ -6010,12 +6134,7 @@  vectorizable_store (gimple *stmt, gimple
       ref_type = get_group_alias_ptr_type (first_stmt);
     }
   else
-    {
-      first_stmt = stmt;
-      first_dr = dr;
-      group_size = vec_num = 1;
-      ref_type = reference_alias_ptr_type (DR_REF (first_dr));
-    }
+    ref_type = reference_alias_ptr_type (DR_REF (first_dr));
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -6037,6 +6156,7 @@  vectorizable_store (gimple *stmt, gimple
       /* Checked by get_load_store_type.  */
       unsigned int const_nunits = nunits.to_constant ();
 
+      gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop_p (loop, stmt));
 
       stride_base
@@ -6267,10 +6387,13 @@  vectorizable_store (gimple *stmt, gimple
 
   alignment_support_scheme = vect_supportable_dr_alignment (first_dr, false);
   gcc_assert (alignment_support_scheme);
+  bool masked_loop_p = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
   /* Targets with store-lane instructions must not require explicit
      realignment.  vect_supportable_dr_alignment always returns either
      dr_aligned or dr_unaligned_supported for masked operations.  */
-  gcc_assert ((memory_access_type != VMAT_LOAD_STORE_LANES && !mask)
+  gcc_assert ((memory_access_type != VMAT_LOAD_STORE_LANES
+	       && !mask
+	       && !masked_loop_p)
 	      || alignment_support_scheme == dr_aligned
 	      || alignment_support_scheme == dr_unaligned_supported);
 
@@ -6327,6 +6450,7 @@  vectorizable_store (gimple *stmt, gimple
 
   prev_stmt_info = NULL;
   tree vec_mask = NULL_TREE;
+  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
   for (j = 0; j < ncopies; j++)
     {
 
@@ -6436,8 +6560,15 @@  vectorizable_store (gimple *stmt, gimple
 	      write_vector_array (stmt, gsi, vec_oprnd, vec_array, i);
 	    }
 
+	  tree final_mask = NULL;
+	  if (masked_loop_p)
+	    final_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
+	  if (vec_mask)
+	    final_mask = prepare_load_store_mask (mask_vectype, final_mask,
+						  vec_mask, gsi);
+
 	  gcall *call;
-	  if (mask)
+	  if (final_mask)
 	    {
 	      /* Emit:
 		   MASK_STORE_LANES (DATAREF_PTR, ALIAS_PTR, VEC_MASK,
@@ -6446,7 +6577,7 @@  vectorizable_store (gimple *stmt, gimple
 	      tree alias_ptr = build_int_cst (ref_type, align);
 	      call = gimple_build_call_internal (IFN_MASK_STORE_LANES, 4,
 						 dataref_ptr, alias_ptr,
-						 vec_mask, vec_array);
+						 final_mask, vec_array);
 	    }
 	  else
 	    {
@@ -6478,6 +6609,14 @@  vectorizable_store (gimple *stmt, gimple
 	    {
 	      unsigned align, misalign;
 
+	      tree final_mask = NULL_TREE;
+	      if (masked_loop_p)
+		final_mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
+						 vectype, vec_num * j + i);
+	      if (vec_mask)
+		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
+						      vec_mask, gsi);
+
 	      if (i > 0)
 		/* Bump the vector pointer.  */
 		dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi,
@@ -6514,14 +6653,14 @@  vectorizable_store (gimple *stmt, gimple
 		}
 
 	      /* Arguments are ready.  Create the new vector stmt.  */
-	      if (mask)
+	      if (final_mask)
 		{
 		  align = least_bit_hwi (misalign | align);
 		  tree ptr = build_int_cst (ref_type, align);
 		  gcall *call
 		    = gimple_build_call_internal (IFN_MASK_STORE, 4,
 						  dataref_ptr, ptr,
-						  vec_mask, vec_oprnd);
+						  final_mask, vec_oprnd);
 		  gimple_call_set_nothrow (call, true);
 		  new_stmt = call;
 		}
@@ -6892,6 +7031,8 @@  vectorizable_load (gimple *stmt, gimple_
 	  return false;
 	}
     }
+  else
+    group_size = 1;
 
   vect_memory_access_type memory_access_type;
   if (!get_load_store_type (stmt, vectype, slp, mask, VLS_LOAD, ncopies,
@@ -6935,6 +7076,12 @@  vectorizable_load (gimple *stmt, gimple_
     {
       if (!slp)
 	STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) = memory_access_type;
+
+      if (loop_vinfo
+	  && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
+	check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size,
+				  memory_access_type);
+
       STMT_VINFO_TYPE (stmt_info) = load_vec_info_type;
       /* The SLP costs are calculated during SLP analysis.  */
       if (!PURE_SLP_STMT (stmt_info))
@@ -6976,6 +7123,7 @@  vectorizable_load (gimple *stmt, gimple_
       /* Checked by get_load_store_type.  */
       unsigned int const_nunits = nunits.to_constant ();
 
+      gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
       gcc_assert (!nested_in_vect_loop);
 
       if (slp && grouped_load)
@@ -7252,9 +7400,13 @@  vectorizable_load (gimple *stmt, gimple_
 
   alignment_support_scheme = vect_supportable_dr_alignment (first_dr, false);
   gcc_assert (alignment_support_scheme);
-  /* Targets with load-lane instructions must not require explicit
-     realignment.  */
-  gcc_assert (memory_access_type != VMAT_LOAD_STORE_LANES
+  bool masked_loop_p = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+  /* Targets with store-lane instructions must not require explicit
+     realignment.  vect_supportable_dr_alignment always returns either
+     dr_aligned or dr_unaligned_supported for masked operations.  */
+  gcc_assert ((memory_access_type != VMAT_LOAD_STORE_LANES
+	       && !mask
+	       && !masked_loop_p)
 	      || alignment_support_scheme == dr_aligned
 	      || alignment_support_scheme == dr_unaligned_supported);
 
@@ -7397,6 +7549,7 @@  vectorizable_load (gimple *stmt, gimple_
   tree vec_mask = NULL_TREE;
   prev_stmt_info = NULL;
   poly_uint64 group_elt = 0;
+  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
   for (j = 0; j < ncopies; j++)
     {
       /* 1. Create the vector or array pointer update chain.  */
@@ -7472,8 +7625,15 @@  vectorizable_load (gimple *stmt, gimple_
 
 	  vec_array = create_vector_array (vectype, vec_num);
 
+	  tree final_mask = NULL_TREE;
+	  if (masked_loop_p)
+	    final_mask = vect_get_loop_mask (gsi, masks, ncopies, vectype, j);
+	  if (vec_mask)
+	    final_mask = prepare_load_store_mask (mask_vectype, final_mask,
+						  vec_mask, gsi);
+
 	  gcall *call;
-	  if (mask)
+	  if (final_mask)
 	    {
 	      /* Emit:
 		   VEC_ARRAY = MASK_LOAD_LANES (DATAREF_PTR, ALIAS_PTR,
@@ -7482,7 +7642,7 @@  vectorizable_load (gimple *stmt, gimple_
 	      tree alias_ptr = build_int_cst (ref_type, align);
 	      call = gimple_build_call_internal (IFN_MASK_LOAD_LANES, 3,
 						 dataref_ptr, alias_ptr,
-						 vec_mask);
+						 final_mask);
 	    }
 	  else
 	    {
@@ -7511,6 +7671,15 @@  vectorizable_load (gimple *stmt, gimple_
 	{
 	  for (i = 0; i < vec_num; i++)
 	    {
+	      tree final_mask = NULL_TREE;
+	      if (masked_loop_p
+		  && memory_access_type != VMAT_INVARIANT)
+		final_mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
+						 vectype, vec_num * j + i);
+	      if (vec_mask)
+		final_mask = prepare_load_store_mask (mask_vectype, final_mask,
+						      vec_mask, gsi);
+
 	      if (i > 0)
 		dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi,
 					       stmt, NULL_TREE);
@@ -7541,14 +7710,14 @@  vectorizable_load (gimple *stmt, gimple_
 		      set_ptr_info_alignment (get_ptr_info (dataref_ptr),
 					      align, misalign);
 
-		    if (mask)
+		    if (final_mask)
 		      {
 			align = least_bit_hwi (misalign | align);
 			tree ptr = build_int_cst (ref_type, align);
 			gcall *call
 			  = gimple_build_call_internal (IFN_MASK_LOAD, 3,
 							dataref_ptr, ptr,
-							vec_mask);
+							final_mask);
 			gimple_call_set_nothrow (call, true);
 			new_stmt = call;
 			data_ref = NULL_TREE;
@@ -9594,11 +9763,7 @@  supportable_widening_operation (enum tre
       intermediate_mode = insn_data[icode1].operand[0].mode;
       if (VECTOR_BOOLEAN_TYPE_P (prev_type))
 	{
-	  poly_uint64 intermediate_nelts
-	    = exact_div (TYPE_VECTOR_SUBPARTS (prev_type), 2);
-	  intermediate_type
-	    = build_truth_vector_type (intermediate_nelts,
-				       current_vector_size);
+	  intermediate_type = vect_halve_mask_nunits (prev_type);
 	  if (intermediate_mode != TYPE_MODE (intermediate_type))
 	    return false;
 	}
@@ -9759,11 +9924,9 @@  supportable_narrowing_operation (enum tr
       intermediate_mode = insn_data[icode1].operand[0].mode;
       if (VECTOR_BOOLEAN_TYPE_P (prev_type))
 	{
-	  intermediate_type
-	    = build_truth_vector_type (TYPE_VECTOR_SUBPARTS (prev_type) * 2,
-				       current_vector_size);
+	  intermediate_type = vect_double_mask_nunits (prev_type);
 	  if (intermediate_mode != TYPE_MODE (intermediate_type))
-	      return false;
+	    return false;
 	}
       else
 	intermediate_type
@@ -9794,3 +9957,21 @@  supportable_narrowing_operation (enum tr
   interm_types->release ();
   return false;
 }
+
+/* Generate and return a statement that sets vector mask MASK such that
+   MASK[I] is true iff J + START_INDEX < END_INDEX for all J <= I.  */
+
+gcall *
+vect_gen_while (tree mask, tree start_index, tree end_index)
+{
+  tree cmp_type = TREE_TYPE (start_index);
+  tree mask_type = TREE_TYPE (mask);
+  gcc_checking_assert (direct_internal_fn_supported_p (IFN_WHILE_ULT,
+						       cmp_type, mask_type,
+						       OPTIMIZE_FOR_SPEED));
+  gcall *call = gimple_build_call_internal (IFN_WHILE_ULT, 3,
+					    start_index, end_index,
+					    build_zero_cst (mask_type));
+  gimple_call_set_lhs (call, mask);
+  return call;
+}
Index: gcc/config/aarch64/aarch64.md
===================================================================
--- gcc/config/aarch64/aarch64.md	2017-11-17 14:54:05.623937515 +0000
+++ gcc/config/aarch64/aarch64.md	2017-11-17 14:54:06.031681396 +0000
@@ -3485,6 +3485,63 @@  (define_insn "csneg3<mode>_insn"
   [(set_attr "type" "csel")]
 )
 
+;; If X can be loaded by a single CNT[BHWD] instruction,
+;;
+;;    A = UMAX (B, X)
+;;
+;; is equivalent to:
+;;
+;;    TMP = UQDEC[BHWD] (B, X)
+;;    A = TMP + X
+;;
+;; Defining the pattern this way means that:
+;;
+;;    A = UMAX (B, X) - X
+;;
+;; becomes:
+;;
+;;    TMP1 = UQDEC[BHWD] (B, X)
+;;    TMP2 = TMP1 + X
+;;    A = TMP2 - X
+;;
+;; which combine can optimize to:
+;;
+;;    A = UQDEC[BHWD] (B, X)
+;;
+;; We don't use match_operand predicates because the order of the operands
+;; can vary: the CNT[BHWD] constant will come first if the other operand is
+;; a simpler constant (such as a CONST_INT), otherwise it will come second.
+(define_expand "umax<mode>3"
+  [(set (match_operand:GPI 0 "register_operand")
+	(umax:GPI (match_operand:GPI 1 "")
+		  (match_operand:GPI 2 "")))]
+  "TARGET_SVE"
+  {
+    if (aarch64_sve_cnt_immediate (operands[1], <MODE>mode))
+      std::swap (operands[1], operands[2]);
+    else if (!aarch64_sve_cnt_immediate (operands[2], <MODE>mode))
+      FAIL;
+    rtx temp = gen_reg_rtx (<MODE>mode);
+    operands[1] = force_reg (<MODE>mode, operands[1]);
+    emit_insn (gen_aarch64_uqdec<mode> (temp, operands[1], operands[2]));
+    emit_insn (gen_add<mode>3 (operands[0], temp, operands[2]));
+    DONE;
+  }
+)
+
+;; Saturating unsigned subtraction of a CNT[BHWD] immediate.
+(define_insn "aarch64_uqdec<mode>"
+  [(set (match_operand:GPI 0 "register_operand" "=r")
+	(minus:GPI
+	 (umax:GPI (match_operand:GPI 1 "register_operand" "0")
+		   (match_operand:GPI 2 "aarch64_sve_cnt_immediate" "Dv"))
+	 (match_dup 2)))]
+  "TARGET_SVE"
+  {
+    return aarch64_output_sve_cnt_immediate ("uqdec", "%<w>0", operands[2]);
+  }
+)
+
 ;; -------------------------------------------------------------------
 ;; Logical operations
 ;; -------------------------------------------------------------------
Index: gcc/config/aarch64/aarch64-sve.md
===================================================================
--- gcc/config/aarch64/aarch64-sve.md	2017-11-17 14:54:05.623937515 +0000
+++ gcc/config/aarch64/aarch64-sve.md	2017-11-17 14:54:06.030775298 +0000
@@ -631,6 +631,13 @@  (define_expand "vec_perm<mode>"
 )
 
 (define_expand "<perm_optab>_<mode>"
+  [(set (match_operand:PRED_ALL 0 "register_operand")
+	(unspec:PRED_ALL [(match_operand:PRED_ALL 1 "register_operand")
+			  (match_operand:PRED_ALL 2 "register_operand")]
+			 OPTAB_PERMUTE))]
+  "TARGET_SVE")
+
+(define_expand "<perm_optab>_<mode>"
   [(set (match_operand:SVE_ALL 0 "register_operand")
 	(unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand")
 			 (match_operand:SVE_ALL 2 "register_operand")]
@@ -654,6 +661,15 @@  (define_insn "*aarch64_sve_tbl<mode>"
 )
 
 (define_insn "*aarch64_sve_<perm_insn><perm_hilo><mode>"
+  [(set (match_operand:PRED_ALL 0 "register_operand" "=Upa")
+	(unspec:PRED_ALL [(match_operand:PRED_ALL 1 "register_operand" "Upa")
+			  (match_operand:PRED_ALL 2 "register_operand" "Upa")]
+			 PERMUTE))]
+  "TARGET_SVE"
+  "<perm_insn><perm_hilo>\t%0.<Vetype>, %1.<Vetype>, %2.<Vetype>"
+)
+
+(define_insn "*aarch64_sve_<perm_insn><perm_hilo><mode>"
   [(set (match_operand:SVE_ALL 0 "register_operand" "=w")
 	(unspec:SVE_ALL [(match_operand:SVE_ALL 1 "register_operand" "w")
 			 (match_operand:SVE_ALL 2 "register_operand" "w")]
Index: gcc/testsuite/gcc.dg/tree-ssa/cunroll-10.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/cunroll-10.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.dg/tree-ssa/cunroll-10.c	2017-11-17 14:54:06.032587493 +0000
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O3 -Warray-bounds -fdump-tree-cunroll-details" } */
+/* { dg-options "-O3 -Warray-bounds -fno-tree-vectorize -fdump-tree-cunroll-details" } */
 int a[3];
 int b[4];
 int
Index: gcc/testsuite/gcc.dg/tree-ssa/peel1.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/peel1.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.dg/tree-ssa/peel1.c	2017-11-17 14:54:06.032587493 +0000
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+/* { dg-options "-O3 -fno-tree-vectorize -fdump-tree-cunroll-details" } */
 struct foo {int b; int a[3];} foo;
 void add(struct foo *a,int l)
 {
Index: gcc/testsuite/gcc.dg/vect/vect-load-lanes-peeling-1.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-load-lanes-peeling-1.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-load-lanes-peeling-1.c	2017-11-17 14:54:06.033493591 +0000
@@ -10,4 +10,4 @@  f (int *__restrict a, int *__restrict b)
 }
 
 /* { dg-final { scan-tree-dump-not "Data access with gaps" "vect" } } */
-/* { dg-final { scan-tree-dump-not "epilog loop required" "vect" { xfail vect_variable_length } } } */
+/* { dg-final { scan-tree-dump-not "epilog loop required" "vect" } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_vcond_6.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_vcond_6.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vcond_6.c	2017-11-17 14:54:06.034399688 +0000
@@ -40,7 +40,8 @@  #define TEST_ALL(T) \
 
 TEST_ALL (LOOP)
 
-/* { dg-final { scan-assembler-times {\tand\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 } } */
+/* Currently we don't manage to remove ANDs from the other loops.  */
+/* { dg-final { scan-assembler-times {\tand\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 { xfail *-*-* } } } */
 /* { dg-final { scan-assembler {\tand\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} } } */
 /* { dg-final { scan-assembler-times {\torr\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 } } */
 /* { dg-final { scan-assembler-times {\teor\tp[0-9]+\.b, p[0-9]+/z, p[0-9]+\.b, p[0-9]+\.b} 3 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_vec_bool_cmp_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_vec_bool_cmp_1.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_vec_bool_cmp_1.c	2017-11-17 14:54:06.035305786 +0000
@@ -36,5 +36,6 @@  TEST_ALL (VEC_BOOL)
 
 /* Both cmpne and cmpeq loops will contain an exclusive predicate or.  */
 /* { dg-final { scan-assembler-times {\teors?\tp[0-9]*\.b, p[0-7]/z, p[0-9]*\.b, p[0-9]*\.b\n} 12 } } */
-/* cmpeq will also contain a predicate not operation.  */
-/* { dg-final { scan-assembler-times {\tnot\tp[0-9]*\.b, p[0-7]/z, p[0-9]*\.b\n} 6 } } */
+/* cmpeq will also contain a masked predicate not operation, which gets
+   folded to BIC.  */
+/* { dg-final { scan-assembler-times {\tbic\tp[0-9]+\.b, p[0-7]/z, p[0-9]+\.b, p[0-9]+\.b\n} 6 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_1.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_1.c	2017-11-17 14:54:06.033493591 +0000
@@ -38,3 +38,22 @@  TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */
+
+/* The loop should be fully-masked.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-times {\tstr} 2 } } */
+/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_2.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_2.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_2.c	2017-11-17 14:54:06.034399688 +0000
@@ -36,3 +36,21 @@  TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-not {\tzip2\t} } } */
+
+/* The loop should be fully-masked.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_3.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_3.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_3.c	2017-11-17 14:54:06.034399688 +0000
@@ -45,3 +45,23 @@  TEST_ALL (VEC_PERM)
       ZIP1 ZIP2.  */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 12 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
+
+/* The loop should be fully-masked.  The 64-bit types need two loads
+   and stores each.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 12 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec[bhw]\t} } } */
+/* { dg-final { scan-assembler-times {\tuqdecd\t} 3 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_4.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_4.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_4.c	2017-11-17 14:54:06.034399688 +0000
@@ -58,3 +58,25 @@  TEST_ALL (VEC_PERM)
       ZIP1 ZIP2 ZIP1 ZIP2.  */
 /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 36 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
+
+/* The loop should be fully-masked.  The 32-bit types need two loads
+   and stores each and the 64-bit types need four.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 12 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 12 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 12 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 24 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec[bh]\t} } } */
+/* We use UQDECW instead of UQDECD ..., MUL #2.  */
+/* { dg-final { scan-assembler-times {\tuqdecw\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tuqdecd\t} 6 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_6.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve_slp_6.c	2017-11-17 14:54:05.623937515 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_6.c	2017-11-17 14:54:06.034399688 +0000
@@ -45,3 +45,5 @@  TEST_ALL (VEC_PERM)
 /* { dg-final { scan-assembler {\tld3h\t} } } */
 /* { dg-final { scan-assembler {\tld3w\t} } } */
 /* { dg-final { scan-assembler {\tld3d\t} } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_8.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_8.c	2017-11-17 14:54:06.034399688 +0000
@@ -0,0 +1,63 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE)						\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
+{								\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      a[i * 2] += 1;						\
+      a[i * 2 + 1] += 2;					\
+      b[i * 4] += 3;						\
+      b[i * 4 + 1] += 4;					\
+      b[i * 4 + 2] += 5;					\
+      b[i * 4 + 3] += 6;					\
+    }								\
+}
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (VEC_PERM)
+
+/* The loop should be fully-masked.  The load XFAILs for fixed-length
+   SVE account for extra loads from the constant pool.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 9 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 9 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 9 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 9 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* We should use WHILEs for the accesses to "a" and ZIPs for the accesses
+   to "b".  */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.b} 2 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.s} 3 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tp[0-7]\.d} 3 } } */
+/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.b} 2 } } */
+/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.h} 2 } } */
+/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.s} 3 } } */
+/* { dg-final { scan-assembler-times {\tzip2\tp[0-7]\.d} 3 } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_8_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_8_run.c	2017-11-17 14:54:06.034399688 +0000
@@ -0,0 +1,44 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_slp_8.c"
+
+#define N1 (103 * 2)
+#define N2 (111 * 2)
+
+#define HARNESS(TYPE)						\
+  {								\
+    TYPE a[N2], b[N2 * 2];					\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	a[i] = i * 2 + i % 5;					\
+	b[i * 2] = i * 3 + i % 7;				\
+	b[i * 2 + 1] = i * 5 + i % 9;				\
+      }								\
+    vec_slp_##TYPE (a, b, N1 / 2);				\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	TYPE orig_a = i * 2 + i % 5;				\
+	TYPE orig_b1 = i * 3 + i % 7;				\
+	TYPE orig_b2 = i * 5 + i % 9;				\
+	TYPE expected_a = orig_a;				\
+	TYPE expected_b1 = orig_b1;				\
+	TYPE expected_b2 = orig_b2;				\
+	if (i < N1)						\
+	  {							\
+	    expected_a += i & 1 ? 2 : 1;			\
+	    expected_b1 += i & 1 ? 5 : 3;			\
+	    expected_b2 += i & 1 ? 6 : 4;			\
+	  }							\
+	if (a[i] != expected_a					\
+	    || b[i * 2] != expected_b1				\
+	    || b[i * 2 + 1] != expected_b2)			\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_9.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_9.c	2017-11-17 14:54:06.034399688 +0000
@@ -0,0 +1,53 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE1, TYPE2)					\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE1##_##TYPE2 (TYPE1 *restrict a,			\
+			   TYPE2 *restrict b, int n)		\
+{								\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      a[i * 2] += 1;						\
+      a[i * 2 + 1] += 2;					\
+      b[i * 2] += 3;						\
+      b[i * 2 + 1] += 4;					\
+    }								\
+}
+
+#define TEST_ALL(T)				\
+  T (int8_t, uint16_t)				\
+  T (uint8_t, int16_t)				\
+  T (int16_t, uint32_t)				\
+  T (uint16_t, int32_t)				\
+  T (int32_t, double)				\
+  T (uint32_t, int64_t)				\
+  T (float, uint64_t)
+
+TEST_ALL (VEC_PERM)
+
+/* The loop should be fully-masked.  The load XFAILs for fixed-length
+   SVE account for extra loads from the constant pool.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail { aarch64_sve && { ! vect_variable_length } } } } }*/
+/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 7 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 7 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 6 { xfail { aarch64_sve && { ! vect_variable_length } } } } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 6 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* We should use WHILEs for the accesses to "a" and unpacks for the accesses
+   to "b".  */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-not {\twhilelo\tp[0-7]\.d} } } */
+/* { dg-final { scan-assembler-times {\tpunpklo\tp[0-7]\.h} 7 } } */
+/* { dg-final { scan-assembler-times {\tpunpkhi\tp[0-7]\.h} 7 } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_9_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_9_run.c	2017-11-17 14:54:06.034399688 +0000
@@ -0,0 +1,39 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_slp_9.c"
+
+#define N1 (103 * 2)
+#define N2 (111 * 2)
+
+#define HARNESS(TYPE1, TYPE2)					\
+  {								\
+    TYPE1 a[N2];						\
+    TYPE2 b[N2];						\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	a[i] = i * 2 + i % 5;					\
+	b[i] = i * 3 + i % 7;					\
+      }								\
+    vec_slp_##TYPE1##_##TYPE2 (a, b, N1 / 2);			\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	TYPE1 orig_a = i * 2 + i % 5;				\
+	TYPE2 orig_b = i * 3 + i % 7;				\
+	TYPE1 expected_a = orig_a;				\
+	TYPE2 expected_b = orig_b;				\
+	if (i < N1)						\
+	  {							\
+	    expected_a += i & 1 ? 2 : 1;			\
+	    expected_b += i & 1 ? 4 : 3;			\
+	  }							\
+	if (a[i] != expected_a || b[i] != expected_b)		\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((noinline, noclone))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_10.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_10.c	2017-11-17 14:54:06.033493591 +0000
@@ -0,0 +1,58 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE)						\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
+{								\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      a[i] += 1;						\
+      b[i * 4] += 2;						\
+      b[i * 4 + 1] += 3;					\
+      b[i * 4 + 2] += 4;					\
+      b[i * 4 + 3] += 5;					\
+    }								\
+}
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (VEC_PERM)
+
+/* The loop should be fully-masked.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 15 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 15 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 15 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 15 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* We should use WHILEs for all accesses.  */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 20 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 20 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 30 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 30 } } */
+
+/* 6 for the 8-bit types and 2 for the 16-bit types.  */
+/* { dg-final { scan-assembler-times {\tuqdecb\t} 8 } } */
+/* 4 for the 16-bit types and 3 for the 32-bit types.  */
+/* { dg-final { scan-assembler-times {\tuqdech\t} 7 } } */
+/* 6 for the 32-bit types and 3 for the 64-bit types.  */
+/* { dg-final { scan-assembler-times {\tuqdecw\t} 9 } } */
+/* { dg-final { scan-assembler-times {\tuqdecd\t} 6 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_10_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_10_run.c	2017-11-17 14:54:06.033493591 +0000
@@ -0,0 +1,54 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_slp_10.c"
+
+#define N1 (103 * 2)
+#define N2 (111 * 2)
+
+#define HARNESS(TYPE)						\
+  {								\
+    TYPE a[N2], b[N2 * 4];					\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	a[i] = i * 2 + i % 5;					\
+	b[i * 4] = i * 3 + i % 7;				\
+	b[i * 4 + 1] = i * 5 + i % 9;				\
+	b[i * 4 + 2] = i * 7 + i % 11;				\
+	b[i * 4 + 3] = i * 9 + i % 13;				\
+      }								\
+    vec_slp_##TYPE (a, b, N1);					\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	TYPE orig_a = i * 2 + i % 5;				\
+	TYPE orig_b1 = i * 3 + i % 7;				\
+	TYPE orig_b2 = i * 5 + i % 9;				\
+	TYPE orig_b3 = i * 7 + i % 11;				\
+	TYPE orig_b4 = i * 9 + i % 13;				\
+	TYPE expected_a = orig_a;				\
+	TYPE expected_b1 = orig_b1;				\
+	TYPE expected_b2 = orig_b2;				\
+	TYPE expected_b3 = orig_b3;				\
+	TYPE expected_b4 = orig_b4;				\
+	if (i < N1)						\
+	  {							\
+	    expected_a += 1;					\
+	    expected_b1 += 2;					\
+	    expected_b2 += 3;					\
+	    expected_b3 += 4;					\
+	    expected_b4 += 5;					\
+	  }							\
+	if (a[i] != expected_a					\
+	    || b[i * 4] != expected_b1				\
+	    || b[i * 4 + 1] != expected_b2			\
+	    || b[i * 4 + 2] != expected_b3			\
+	    || b[i * 4 + 3] != expected_b4)			\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_11.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_11.c	2017-11-17 14:54:06.033493591 +0000
@@ -0,0 +1,52 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE1, TYPE2)					\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE1##_##TYPE2 (TYPE1 *restrict a,			\
+			   TYPE2 *restrict b, int n)		\
+{								\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      a[i * 2] += 1;						\
+      a[i * 2 + 1] += 2;					\
+      b[i * 4] += 3;						\
+      b[i * 4 + 1] += 4;					\
+      b[i * 4 + 2] += 5;					\
+      b[i * 4 + 3] += 6;					\
+    }								\
+}
+
+#define TEST_ALL(T)				\
+  T (int16_t, uint8_t)				\
+  T (uint16_t, int8_t)				\
+  T (int32_t, uint16_t)				\
+  T (uint32_t, int16_t)				\
+  T (float, uint16_t)				\
+  T (int64_t, float)				\
+  T (uint64_t, int32_t)				\
+  T (double, uint32_t)
+
+TEST_ALL (VEC_PERM)
+
+/* The loop should be fully-masked.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 5 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 5 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 6 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 3 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 3 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* We should use the same WHILEs for both accesses.  */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
+/* { dg-final { scan-assembler-not {\twhilelo\tp[0-7]\.d} } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_11_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_11_run.c	2017-11-17 14:54:06.033493591 +0000
@@ -0,0 +1,45 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_slp_11.c"
+
+#define N1 (103 * 2)
+#define N2 (111 * 2)
+
+#define HARNESS(TYPE1, TYPE2)					\
+  {								\
+    TYPE1 a[N2];						\
+    TYPE2 b[N2 * 2];						\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	a[i] = i * 2 + i % 5;					\
+	b[i * 2] = i * 3 + i % 7;				\
+	b[i * 2 + 1] = i * 5 + i % 9;				\
+      }								\
+    vec_slp_##TYPE1##_##TYPE2 (a, b, N1 / 2);			\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	TYPE1 orig_a = i * 2 + i % 5;				\
+	TYPE2 orig_b1 = i * 3 + i % 7;				\
+	TYPE2 orig_b2 = i * 5 + i % 9;				\
+	TYPE1 expected_a = orig_a;				\
+	TYPE2 expected_b1 = orig_b1;				\
+	TYPE2 expected_b2 = orig_b2;				\
+	if (i < N1)						\
+	  {							\
+	    expected_a += i & 1 ? 2 : 1;			\
+	    expected_b1 += i & 1 ? 5 : 3;			\
+	    expected_b2 += i & 1 ? 6 : 4;			\
+	  }							\
+	if (a[i] != expected_a					\
+	    || b[i * 2] != expected_b1				\
+	    || b[i * 2 + 1] != expected_b2)			\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_12.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_12.c	2017-11-17 14:54:06.033493591 +0000
@@ -0,0 +1,60 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define N1 (19 * 2)
+
+#define VEC_PERM(TYPE)						\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b)		\
+{								\
+  for (int i = 0; i < N1; ++i)					\
+    {								\
+      a[i] += 1;						\
+      b[i * 4] += 2;						\
+      b[i * 4 + 1] += 3;					\
+      b[i * 4 + 2] += 4;					\
+      b[i * 4 + 3] += 5;					\
+    }								\
+}
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (VEC_PERM)
+
+/* The loop should be fully-masked.  */
+/* { dg-final { scan-assembler-times {\tld1b\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tst1b\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tld1h\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tst1h\t} 10 } } */
+/* { dg-final { scan-assembler-times {\tld1w\t} 15 } } */
+/* { dg-final { scan-assembler-times {\tst1w\t} 15 } } */
+/* { dg-final { scan-assembler-times {\tld1d\t} 15 } } */
+/* { dg-final { scan-assembler-times {\tst1d\t} 15 } } */
+/* { dg-final { scan-assembler-not {\tldr} } } */
+/* { dg-final { scan-assembler-not {\tstr} } } */
+
+/* We should use WHILEs for all accesses.  */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 20 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 20 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 30 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 30 } } */
+
+/* 6 for the 8-bit types and 2 for the 16-bit types.  */
+/* { dg-final { scan-assembler-times {\tuqdecb\t} 8 } } */
+/* 4 for the 16-bit types and 3 for the 32-bit types.  */
+/* { dg-final { scan-assembler-times {\tuqdech\t} 7 } } */
+/* 6 for the 32-bit types and 3 for the 64-bit types.  */
+/* { dg-final { scan-assembler-times {\tuqdecw\t} 9 } } */
+/* { dg-final { scan-assembler-times {\tuqdecd\t} 6 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_slp_12_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_slp_12_run.c	2017-11-17 14:54:06.034399688 +0000
@@ -0,0 +1,53 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include "sve_slp_12.c"
+
+#define N2 (31 * 2)
+
+#define HARNESS(TYPE)						\
+  {								\
+    TYPE a[N2], b[N2 * 4];					\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	a[i] = i * 2 + i % 5;					\
+	b[i * 4] = i * 3 + i % 7;				\
+	b[i * 4 + 1] = i * 5 + i % 9;				\
+	b[i * 4 + 2] = i * 7 + i % 11;				\
+	b[i * 4 + 3] = i * 9 + i % 13;				\
+      }								\
+    vec_slp_##TYPE (a, b);					\
+    for (unsigned int i = 0; i < N2; ++i)			\
+      {								\
+	TYPE orig_a = i * 2 + i % 5;				\
+	TYPE orig_b1 = i * 3 + i % 7;				\
+	TYPE orig_b2 = i * 5 + i % 9;				\
+	TYPE orig_b3 = i * 7 + i % 11;				\
+	TYPE orig_b4 = i * 9 + i % 13;				\
+	TYPE expected_a = orig_a;				\
+	TYPE expected_b1 = orig_b1;				\
+	TYPE expected_b2 = orig_b2;				\
+	TYPE expected_b3 = orig_b3;				\
+	TYPE expected_b4 = orig_b4;				\
+	if (i < N1)						\
+	  {							\
+	    expected_a += 1;					\
+	    expected_b1 += 2;					\
+	    expected_b2 += 3;					\
+	    expected_b3 += 4;					\
+	    expected_b4 += 5;					\
+	  }							\
+	if (a[i] != expected_a					\
+	    || b[i * 4] != expected_b1				\
+	    || b[i * 4 + 1] != expected_b2			\
+	    || b[i * 4 + 2] != expected_b3			\
+	    || b[i * 4 + 3] != expected_b4)			\
+	  __builtin_abort ();					\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST_ALL (HARNESS)
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_ld1r_2.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_ld1r_2.c	2017-11-17 14:54:06.033493591 +0000
@@ -0,0 +1,61 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=armv8-a+sve -fno-tree-loop-distribute-patterns" } */
+
+#include <stdint.h>
+
+#define NUM_ELEMS(TYPE) (1024 / sizeof (TYPE))
+
+#define DEF_LOAD_BROADCAST(TYPE)			\
+  void __attribute__ ((noinline, noclone))		\
+  set_##TYPE (TYPE *restrict a, TYPE *restrict b)	\
+  {							\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)		\
+      a[i] = *b;					\
+  }
+
+#define DEF_LOAD_BROADCAST_IMM(TYPE, IMM, SUFFIX)	\
+  void __attribute__ ((noinline, noclone))		\
+  set_##TYPE##_##SUFFIX (TYPE *a)			\
+  {							\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)		\
+      a[i] = IMM;					\
+  }
+
+#define FOR_EACH_LOAD_BROADCAST(T)		\
+  T (int8_t)					\
+  T (int16_t)					\
+  T (int32_t)					\
+  T (int64_t)
+
+#define FOR_EACH_LOAD_BROADCAST_IMM(T)					\
+  T (int16_t, 129, imm_129)						\
+  T (int32_t, 129, imm_129)						\
+  T (int64_t, 129, imm_129)						\
+									\
+  T (int16_t, -130, imm_m130)						\
+  T (int32_t, -130, imm_m130)						\
+  T (int64_t, -130, imm_m130)						\
+									\
+  T (int16_t, 0x1234, imm_0x1234)					\
+  T (int32_t, 0x1234, imm_0x1234)					\
+  T (int64_t, 0x1234, imm_0x1234)					\
+									\
+  T (int16_t, 0xFEDC, imm_0xFEDC)					\
+  T (int32_t, 0xFEDC, imm_0xFEDC)					\
+  T (int64_t, 0xFEDC, imm_0xFEDC)					\
+									\
+  T (int32_t, 0x12345678, imm_0x12345678)				\
+  T (int64_t, 0x12345678, imm_0x12345678)				\
+									\
+  T (int32_t, 0xF2345678, imm_0xF2345678)				\
+  T (int64_t, 0xF2345678, imm_0xF2345678)				\
+									\
+  T (int64_t, (int64_t) 0xFEBA716B12371765, imm_FEBA716B12371765)
+
+FOR_EACH_LOAD_BROADCAST (DEF_LOAD_BROADCAST)
+FOR_EACH_LOAD_BROADCAST_IMM (DEF_LOAD_BROADCAST_IMM)
+
+/* { dg-final { scan-assembler-times {\tld1rb\tz[0-9]+\.b, p[0-7]/z, } 1 } } */
+/* { dg-final { scan-assembler-times {\tld1rh\tz[0-9]+\.h, p[0-7]/z, } 5 } } */
+/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, p[0-7]/z, } 7 } } */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, p[0-7]/z, } 8 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_ld1r_2_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_ld1r_2_run.c	2017-11-17 14:54:06.033493591 +0000
@@ -0,0 +1,38 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O3 -march=armv8-a+sve -fno-tree-loop-distribute-patterns" } */
+
+#include "sve_ld1r_2.c"
+
+#define TEST_LOAD_BROADCAST(TYPE)		\
+  {						\
+    TYPE v[NUM_ELEMS (TYPE)];			\
+    TYPE val = 99;				\
+    set_##TYPE (v, &val);			\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++)	\
+      {						\
+	if (v[i] != (TYPE) 99)			\
+	  __builtin_abort ();			\
+	asm volatile ("" ::: "memory");		\
+      }						\
+  }
+
+#define TEST_LOAD_BROADCAST_IMM(TYPE, IMM, SUFFIX)	\
+  {							\
+    TYPE v[NUM_ELEMS (TYPE)];				\
+    set_##TYPE##_##SUFFIX (v);				\
+    for (int i = 0; i < NUM_ELEMS (TYPE); i++ )		\
+      {							\
+	if (v[i] != (TYPE) IMM)				\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (int argc, char **argv)
+{
+  FOR_EACH_LOAD_BROADCAST (TEST_LOAD_BROADCAST)
+  FOR_EACH_LOAD_BROADCAST_IMM (TEST_LOAD_BROADCAST_IMM)
+
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_while_1.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_while_1.c	2017-11-17 14:54:06.035305786 +0000
@@ -0,0 +1,36 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+#define ADD_LOOP(TYPE)				\
+  void __attribute__ ((noinline, noclone))	\
+  vec_while_##TYPE (TYPE *restrict a, int n)	\
+  {						\
+    for (int i = 0; i < n; ++i)			\
+      a[i] += 1;				\
+  }
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (ADD_LOOP)
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_while_2.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_while_2.c	2017-11-17 14:54:06.035305786 +0000
@@ -0,0 +1,36 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+#define ADD_LOOP(TYPE)					\
+  void __attribute__ ((noinline, noclone))		\
+  vec_while_##TYPE (TYPE *restrict a, unsigned int n)	\
+  {							\
+    for (unsigned int i = 0; i < n; ++i)		\
+      a[i] += 1;					\
+  }
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (ADD_LOOP)
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_while_3.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_while_3.c	2017-11-17 14:54:06.035305786 +0000
@@ -0,0 +1,36 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
+
+#include <stdint.h>
+
+#define ADD_LOOP(TYPE)					\
+  TYPE __attribute__ ((noinline, noclone))		\
+  vec_while_##TYPE (TYPE *restrict a, int64_t n)	\
+  {							\
+    for (int64_t i = 0; i < n; ++i)			\
+      a[i] += 1;					\
+  }
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (ADD_LOOP)
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_while_4.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_while_4.c	2017-11-17 14:54:06.035305786 +0000
@@ -0,0 +1,37 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable" } */
+
+#include <stdint.h>
+
+#define ADD_LOOP(TYPE)					\
+  TYPE __attribute__ ((noinline, noclone))		\
+  vec_while_##TYPE (TYPE *restrict a, uint64_t n)	\
+  {							\
+    for (uint64_t i = 0; i < n; ++i)			\
+      a[i] += 1;					\
+  }
+
+#define TEST_ALL(T)				\
+  T (int8_t)					\
+  T (uint8_t)					\
+  T (int16_t)					\
+  T (uint16_t)					\
+  T (int32_t)					\
+  T (uint32_t)					\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (float)					\
+  T (double)
+
+TEST_ALL (ADD_LOOP)
+
+/* { dg-final { scan-assembler-times {\tuqdec} 2 } } */
+/* { dg-final { scan-assembler-times {\tuqdecb\tx[0-9]+} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */