From patchwork Mon Nov 11 18:51:58 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Sandiford <richard.sandiford@arm.com>
X-Patchwork-Id: 1193072
Return-Path: 
 <gcc-patches-return-512995-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized)
	smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131;
	helo=sourceware.org;
	envelope-from=gcc-patches-return-512995-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=arm.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="br6h1DOO"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 47Bg3K71tWz9sV5
	for <incoming@patchwork.ozlabs.org>;
	Tue, 12 Nov 2019 05:52:13 +1100 (AEDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:subject:references:date:in-reply-to:message-id:mime-version
	:content-type; q=dns; s=default; b=liZLtUXI5b4AejmVH4EsfxdBMjxKO
	NUA5kf8tbq1a/ZPzcyay7jqjqKjr7WbWHE8umWN6poQs8RnSF3Wz+OME+sicwlMs
	qPnQUpddPmaU6nx5PcmrIi6OI/oNe00DEnDFvHMJpGlJOi+Ib7McDJldQ0nSUm5B
	PZluLBrqt5Jico=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:subject:references:date:in-reply-to:message-id:mime-version
	:content-type; s=default; bh=dx9hu0Jf04zhM4w7bWroJ6f+ft4=; b=br6
	h1DOOyuI0M5zMqVOdykEUy1gIPUh1tugwiZL8u11Oul2dGIRLzQy1LN6eqhRujfJ
	ZhzAV99E+44qCMFonCXelnnwL63oZdenUwLRt6bjZ4RTzk76g128dsY4css1Yy8b
	WAOR6VWMesEOt7Xi0k5Jo4we+BwIBhfquKqLEjCw=
Received: (qmail 18297 invoked by alias); 11 Nov 2019 18:52:06 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 18288 invoked by uid 89); 11 Nov 2019 18:52:05 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-10.8 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
	SPF_PASS,
	UNSUBSCRIBE_BODY autolearn=ham version=3.3.1 spammy=tstr
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.110.172) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 11 Nov 2019 18:52:02 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])	by
	usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id
	4B56A1FB	for <gcc-patches@gcc.gnu.org>;
	Mon, 11 Nov 2019 10:52:00 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])	by
	usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id
	CAD673F52E	for <gcc-patches@gcc.gnu.org>;
	Mon, 11 Nov 2019 10:51:59 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Subject: [8/8] Optimise WAR and WAW alias checks
References: <mpt36eui76b.fsf@arm.com>
Date: Mon, 11 Nov 2019 18:51:58 +0000
In-Reply-To: <mpt36eui76b.fsf@arm.com> (Richard Sandiford's message of "Mon,
	11 Nov 2019 18:45:00 +0000")
Message-ID: <mpt4kzagsa9.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
X-IsSubscribed: yes

For:

  void
  f1 (int *x, int *y)
  {
    for (int i = 0; i < 32; ++i)
      x[i] += y[i];
  }

we checked at runtime whether one vector at x would overlap one vector
at y.  But in cases like this, the vector code would handle x <= y just
fine, since any write to address A still happens after any read from
address A.  The only problem is if x is ahead of y by less than a
vector.

The same is true for two writes:

  void
  f2 (int *x, int *y)
  {
    for (int i = 0; i < 32; ++i)
      {
        x[i] = i;
        y[i] = 2;
      }
  }

if y <= x then a vector write at y after a vector write at x would
have the same net effect as the original scalar writes.

This patch optimises the alias checks for these two cases.  E.g.,
before the patch, f1 used:

        add     x2, x0, 15
        sub     x2, x2, x1
        cmp     x2, 30
        bls     .L2

whereas after the patch it uses:

        add     x2, x1, 4
        sub     x2, x0, x2
        cmp     x2, 8
        bls     .L2

Read-after-write cases like:

  int
  f3 (int *x, int *y)
  {
    int res = 0;
    for (int i = 0; i < 32; ++i)
      {
        x[i] = i;
        res += y[i];
      }
    return res;
  }

can cope with x == y, but otherwise don't allow overlap in either
direction.  Since checking for x == y at runtime would require extra
code, we're probably better off sticking with the current overlap test.

An overlap test is also needed if the scalar or vector accesses covered
by the alias check are mixed together, rather than all statements for
the second access following all statements for the first access.

The new code for gcc.target/aarch64/sve/var_strict_[135].c is slightly
better than before.


2019-11-11  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-data-ref.c (create_intersect_range_checks_index): If the
	alias pair describes simple WAW and WAR dependencies, just check
	whether the first B access overlaps later A accesses.
	(create_waw_or_war_checks): New function that performs the same
	optimization on addresses.
	(create_intersect_range_checks): Call it.

gcc/testsuite/
	* gcc.dg/vect/vect-alias-check-8.c: Expect WAR/WAW checks to be used.
	* gcc.dg/vect/vect-alias-check-14.c: Likewise.
	* gcc.dg/vect/vect-alias-check-15.c: Likewise.
	* gcc.dg/vect/vect-alias-check-18.c: Likewise.
	* gcc.dg/vect/vect-alias-check-19.c: Likewise.
	* gcc.target/aarch64/sve/var_stride_1.c: Update expected sequence.
	* gcc.target/aarch64/sve/var_stride_2.c: Likewise.
	* gcc.target/aarch64/sve/var_stride_3.c: Likewise.
	* gcc.target/aarch64/sve/var_stride_5.c: Likewise.

Index: gcc/tree-data-ref.c
===================================================================
--- gcc/tree-data-ref.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/tree-data-ref.c	2019-11-11 18:32:13.186616541 +0000
@@ -1806,6 +1806,8 @@ create_intersect_range_checks_index (cla
 			   abs_step, &niter_access2))
     return false;
 
+  bool waw_or_war_p = (alias_pair.flags & ~(DR_ALIAS_WAR | DR_ALIAS_WAW)) == 0;
+
   unsigned int i;
   for (i = 0; i < DR_NUM_DIMENSIONS (dr_a.dr); i++)
     {
@@ -1907,16 +1909,57 @@ create_intersect_range_checks_index (cla
 
 	 Combining the tests requires limit to be computable in an unsigned
 	 form of the index type; if it isn't, we fall back to the usual
-	 pointer-based checks.  */
-      poly_offset_int limit = (idx_len1 + idx_access1 - 1
-			       + idx_len2 + idx_access2 - 1);
+	 pointer-based checks.
+
+	 We can do better if DR_B is a write and if DR_A and DR_B are
+	 well-ordered in both the original and the new code (see the
+	 comment above the DR_ALIAS_* flags for details).  In this case
+	 we know that for each i in [0, n-1], the write performed by
+	 access i of DR_B occurs after access numbers j<=i of DR_A in
+	 both the original and the new code.  Any write or anti
+	 dependencies wrt those DR_A accesses are therefore maintained.
+
+	 We just need to make sure that each individual write in DR_B does not
+	 overlap any higher-indexed access in DR_A; such DR_A accesses happen
+	 after the DR_B access in the original code but happen before it in
+	 the new code.
+
+	 We know the steps for both accesses are equal, so by induction, we
+	 just need to test whether the first write of DR_B overlaps a later
+	 access of DR_A.  In other words, we need to move min1 along by
+	 one iteration:
+
+	   min1' = min1 + idx_step
+
+	 and use the ranges:
+
+	   [min1' + low_offset1', min1' + high_offset1' + idx_access1 - 1]
+
+	 and:
+
+	   [min2, min2 + idx_access2 - 1]
+
+	 where:
+
+	    low_offset1' = +ve step ? 0 : -(idx_len1 - |idx_step|)
+	   high_offset1' = +ve_step ? idx_len1 - |idx_step| : 0.  */
+      if (waw_or_war_p)
+	idx_len1 -= abs_idx_step;
+
+      poly_offset_int limit = idx_len1 + idx_access1 - 1 + idx_access2 - 1;
+      if (!waw_or_war_p)
+	limit += idx_len2;
+
       tree utype = unsigned_type_for (TREE_TYPE (min1));
       if (!wi::fits_to_tree_p (limit, utype))
 	return false;
 
       poly_offset_int low_offset1 = neg_step ? -idx_len1 : 0;
-      poly_offset_int high_offset2 = neg_step ? 0 : idx_len2;
+      poly_offset_int high_offset2 = neg_step || waw_or_war_p ? 0 : idx_len2;
       poly_offset_int bias = high_offset2 + idx_access2 - 1 - low_offset1;
+      /* Equivalent to adding IDX_STEP to MIN1.  */
+      if (waw_or_war_p)
+	bias -= wi::to_offset (idx_step);
 
       tree subject = fold_build2 (MINUS_EXPR, utype,
 				  fold_convert (utype, min2),
@@ -1932,7 +1975,169 @@ create_intersect_range_checks_index (cla
 	*cond_expr = part_cond_expr;
     }
   if (dump_enabled_p ())
-    dump_printf (MSG_NOTE, "using an index-based overlap test\n");
+    {
+      if (waw_or_war_p)
+	dump_printf (MSG_NOTE, "using an index-based WAR/WAW test\n");
+      else
+	dump_printf (MSG_NOTE, "using an index-based overlap test\n");
+    }
+  return true;
+}
+
+/* A subroutine of create_intersect_range_checks, with a subset of the
+   same arguments.  Try to optimize cases in which the second access
+   is a write and in which some overlap is valid.  */
+
+static bool
+create_waw_or_war_checks (tree *cond_expr,
+			  const dr_with_seg_len_pair_t &alias_pair)
+{
+  const dr_with_seg_len& dr_a = alias_pair.first;
+  const dr_with_seg_len& dr_b = alias_pair.second;
+
+  /* Check for cases in which:
+
+     (a) DR_B is always a write;
+     (b) the accesses are well-ordered in both the original and new code
+	 (see the comment above the DR_ALIAS_* flags for details); and
+     (c) the DR_STEPs describe all access pairs covered by ALIAS_PAIR.  */
+  if (alias_pair.flags & ~(DR_ALIAS_WAR | DR_ALIAS_WAW))
+    return false;
+
+  /* Check for equal (but possibly variable) steps.  */
+  tree step = DR_STEP (dr_a.dr);
+  if (!operand_equal_p (step, DR_STEP (dr_b.dr)))
+    return false;
+
+  /* Make sure that we can operate on sizetype without loss of precision.  */
+  tree addr_type = TREE_TYPE (DR_BASE_ADDRESS (dr_a.dr));
+  if (TYPE_PRECISION (addr_type) != TYPE_PRECISION (sizetype))
+    return false;
+
+  /* All addresses involved are known to have a common alignment ALIGN.
+     We can therefore subtract ALIGN from an exclusive endpoint to get
+     an inclusive endpoint.  In the best (and common) case, ALIGN is the
+     same as the access sizes of both DRs, and so subtracting ALIGN
+     cancels out the addition of an access size.  */
+  unsigned int align = MIN (dr_a.align, dr_b.align);
+  poly_uint64 last_chunk_a = dr_a.access_size - align;
+  poly_uint64 last_chunk_b = dr_b.access_size - align;
+
+  /* Get a boolean expression that is true when the step is negative.  */
+  tree indicator = dr_direction_indicator (dr_a.dr);
+  tree neg_step = fold_build2 (LT_EXPR, boolean_type_node,
+			       fold_convert (ssizetype, indicator),
+			       ssize_int (0));
+
+  /* Get lengths in sizetype.  */
+  tree seg_len_a
+    = fold_convert (sizetype, rewrite_to_non_trapping_overflow (dr_a.seg_len));
+  step = fold_convert (sizetype, rewrite_to_non_trapping_overflow (step));
+
+  /* Each access has the following pattern:
+
+	  <- |seg_len| ->
+	  <--- A: -ve step --->
+	  +-----+-------+-----+-------+-----+
+	  | n-1 | ..... |  0  | ..... | n-1 |
+	  +-----+-------+-----+-------+-----+
+			<--- B: +ve step --->
+			<- |seg_len| ->
+			|
+		   base address
+
+     where "n" is the number of scalar iterations covered by the segment.
+
+     A is the range of bytes accessed when the step is negative,
+     B is the range when the step is positive.
+
+     We know that DR_B is a write.  We also know (from checking that
+     DR_A and DR_B are well-ordered) that for each i in [0, n-1],
+     the write performed by access i of DR_B occurs after access numbers
+     j<=i of DR_A in both the original and the new code.  Any write or
+     anti dependencies wrt those DR_A accesses are therefore maintained.
+
+     We just need to make sure that each individual write in DR_B does not
+     overlap any higher-indexed access in DR_A; such DR_A accesses happen
+     after the DR_B access in the original code but happen before it in
+     the new code.
+
+     We know the steps for both accesses are equal, so by induction, we
+     just need to test whether the first write of DR_B overlaps a later
+     access of DR_A.  In other words, we need to move addr_a along by
+     one iteration:
+
+       addr_a' = addr_a + step
+
+     and check whether:
+
+       [addr_b, addr_b + last_chunk_b]
+
+     overlaps:
+
+       [addr_a' + low_offset_a, addr_a' + high_offset_a + last_chunk_a]
+
+     where [low_offset_a, high_offset_a] spans accesses [1, n-1].  I.e.:
+
+	low_offset_a = +ve step ? 0 : seg_len_a - step
+       high_offset_a = +ve step ? seg_len_a - step : 0
+
+     This is equivalent to testing whether:
+
+       addr_a' + low_offset_a <= addr_b + last_chunk_b
+       && addr_b <= addr_a' + high_offset_a + last_chunk_a
+
+     Converting this into a single test, there is an overlap if:
+
+       0 <= addr_b + last_chunk_b - addr_a' - low_offset_a <= limit
+
+     where limit = high_offset_a - low_offset_a + last_chunk_a + last_chunk_b
+
+     If DR_A is performed, limit + |step| - last_chunk_b is known to be
+     less than the size of the object underlying DR_A.  We also know
+     that last_chunk_b <= |step|; this is checked elsewhere if it isn't
+     guaranteed at compile time.  There can therefore be no overflow if
+     "limit" is calculated in an unsigned type with pointer precision.  */
+  tree addr_a = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_a.dr),
+					 DR_OFFSET (dr_a.dr));
+  addr_a = fold_build_pointer_plus (addr_a, DR_INIT (dr_a.dr));
+
+  tree addr_b = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_b.dr),
+					 DR_OFFSET (dr_b.dr));
+  addr_b = fold_build_pointer_plus (addr_b, DR_INIT (dr_b.dr));
+
+  /* Advance ADDR_A by one iteration and adjust the length to compensate.  */
+  addr_a = fold_build_pointer_plus (addr_a, step);
+  tree seg_len_a_minus_step = fold_build2 (MINUS_EXPR, sizetype,
+					   seg_len_a, step);
+  if (!CONSTANT_CLASS_P (seg_len_a_minus_step))
+    seg_len_a_minus_step = build1 (SAVE_EXPR, sizetype, seg_len_a_minus_step);
+
+  tree low_offset_a = fold_build3 (COND_EXPR, sizetype, neg_step,
+				   seg_len_a_minus_step, size_zero_node);
+  if (!CONSTANT_CLASS_P (low_offset_a))
+    low_offset_a = build1 (SAVE_EXPR, sizetype, low_offset_a);
+
+  /* We could use COND_EXPR <neg_step, size_zero_node, seg_len_a_minus_step>,
+     but it's usually more efficient to reuse the LOW_OFFSET_A result.  */
+  tree high_offset_a = fold_build2 (MINUS_EXPR, sizetype, seg_len_a_minus_step,
+				    low_offset_a);
+
+  /* The amount added to addr_b - addr_a'.  */
+  tree bias = fold_build2 (MINUS_EXPR, sizetype,
+			   size_int (last_chunk_b), low_offset_a);
+
+  tree limit = fold_build2 (MINUS_EXPR, sizetype, high_offset_a, low_offset_a);
+  limit = fold_build2 (PLUS_EXPR, sizetype, limit,
+		       size_int (last_chunk_a + last_chunk_b));
+
+  tree subject = fold_build2 (POINTER_DIFF_EXPR, ssizetype, addr_b, addr_a);
+  subject = fold_build2 (PLUS_EXPR, sizetype,
+			 fold_convert (sizetype, subject), bias);
+
+  *cond_expr = fold_build2 (GT_EXPR, boolean_type_node, subject, limit);
+  if (dump_enabled_p ())
+    dump_printf (MSG_NOTE, "using an address-based WAR/WAW test\n");
   return true;
 }
 
@@ -2036,6 +2241,9 @@ create_intersect_range_checks (class loo
   if (create_intersect_range_checks_index (loop, cond_expr, alias_pair))
     return;
 
+  if (create_waw_or_war_checks (cond_expr, alias_pair))
+    return;
+
   unsigned HOST_WIDE_INT min_align;
   tree_code cmp_code;
   /* We don't have to check DR_ALIAS_MIXED_STEPS here, since both versions
Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-8.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-alias-check-8.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-alias-check-8.c	2019-11-11 18:32:13.186616541 +0000
@@ -60,5 +60,5 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump {flags: *WAR\n} "vect" { target vect_int } } } */
-/* { dg-final { scan-tree-dump "using an index-based overlap test" "vect" } } */
+/* { dg-final { scan-tree-dump "using an index-based WAR/WAW test" "vect" } } */
 /* { dg-final { scan-tree-dump-not "using an address-based" "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-14.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-alias-check-14.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-alias-check-14.c	2019-11-11 18:32:13.186616541 +0000
@@ -60,5 +60,5 @@ main (void)
 
 /* { dg-final { scan-tree-dump {flags: *WAR\n} "vect" { target vect_int } } } */
 /* { dg-final { scan-tree-dump-not {flags: [^\n]*ARBITRARY\n} "vect" } } */
-/* { dg-final { scan-tree-dump "using an address-based overlap test" "vect" } } */
+/* { dg-final { scan-tree-dump "using an address-based WAR/WAW test" "vect" } } */
 /* { dg-final { scan-tree-dump-not "using an index-based" "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-15.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-alias-check-15.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-alias-check-15.c	2019-11-11 18:32:13.186616541 +0000
@@ -57,5 +57,5 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump {flags: *WAW\n} "vect" { target vect_int } } } */
-/* { dg-final { scan-tree-dump "using an address-based overlap test" "vect" } } */
+/* { dg-final { scan-tree-dump "using an address-based WAR/WAW test" "vect" } } */
 /* { dg-final { scan-tree-dump-not "using an index-based" "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-18.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-alias-check-18.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-alias-check-18.c	2019-11-11 18:32:13.186616541 +0000
@@ -60,5 +60,5 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump {flags: *WAR\n} "vect" { target vect_int } } } */
-/* { dg-final { scan-tree-dump "using an index-based overlap test" "vect" } } */
+/* { dg-final { scan-tree-dump "using an index-based WAR/WAW test" "vect" } } */
 /* { dg-final { scan-tree-dump-not "using an address-based" "vect" } } */
Index: gcc/testsuite/gcc.dg/vect/vect-alias-check-19.c
===================================================================
--- gcc/testsuite/gcc.dg/vect/vect-alias-check-19.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-alias-check-19.c	2019-11-11 18:32:13.186616541 +0000
@@ -58,5 +58,5 @@ main (void)
 }
 
 /* { dg-final { scan-tree-dump {flags: *WAW\n} "vect" { target vect_int } } } */
-/* { dg-final { scan-tree-dump "using an index-based overlap test" "vect" } } */
+/* { dg-final { scan-tree-dump "using an index-based WAR/WAW test" "vect" } } */
 /* { dg-final { scan-tree-dump-not "using an address-based" "vect" } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_1.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/var_stride_1.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_1.c	2019-11-11 18:32:13.186616541 +0000
@@ -15,13 +15,9 @@ f (TYPE *x, TYPE *y, unsigned short n, l
 /* { dg-final { scan-assembler {\tst1w\tz[0-9]+} } } */
 /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
 /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
-/* Should multiply by (VF-1)*4 rather than (257-1)*4.  */
-/* { dg-final { scan-assembler-not {, 1024} } } */
-/* { dg-final { scan-assembler-not {\t.bfiz\t} } } */
-/* { dg-final { scan-assembler-not {lsl[^\n]*[, ]10} } } */
-/* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
-/* { dg-final { scan-assembler-not {\tcmp\tw[0-9]+, 0} } } */
-/* { dg-final { scan-assembler-not {\tcsel\tx[0-9]+} } } */
-/* Two range checks and a check for n being zero.  */
-/* { dg-final { scan-assembler-times {\tcmp\t} 1 } } */
-/* { dg-final { scan-assembler-times {\tccmp\t} 2 } } */
+/* Should use a WAR check that multiplies by (VF-2)*4 rather than
+   an overlap check that multiplies by (257-1)*4.  */
+/* { dg-final { scan-assembler {\tcntb\t(x[0-9]+)\n.*\tsub\tx[0-9]+, \1, #8\n.*\tmul\tx[0-9]+,[^\n]*\1} } } */
+/* One range check and a check for n being zero.  */
+/* { dg-final { scan-assembler-times {\t(?:cmp|tst)\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tccmp\t} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_2.c	2019-11-11 18:32:13.186616541 +0000
@@ -15,7 +15,7 @@ f (TYPE *x, TYPE *y, unsigned short n, u
 /* { dg-final { scan-assembler {\tst1w\tz[0-9]+} } } */
 /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
 /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
-/* Should multiply by (257-1)*4 rather than (VF-1)*4.  */
+/* Should multiply by (257-1)*4 rather than (VF-1)*4 or (VF-2)*4.  */
 /* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x2, 10, 16\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tubfiz\tx[0-9]+, x3, 10, 16\n} 1 } } */
 /* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_3.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/var_stride_3.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_3.c	2019-11-11 18:32:13.186616541 +0000
@@ -15,13 +15,10 @@ f (TYPE *x, TYPE *y, int n, long m __att
 /* { dg-final { scan-assembler {\tst1w\tz[0-9]+} } } */
 /* { dg-final { scan-assembler {\tldr\tw[0-9]+} } } */
 /* { dg-final { scan-assembler {\tstr\tw[0-9]+} } } */
-/* Should multiply by (VF-1)*4 rather than (257-1)*4.  */
-/* { dg-final { scan-assembler-not {, 1024} } } */
-/* { dg-final { scan-assembler-not {\t.bfiz\t} } } */
-/* { dg-final { scan-assembler-not {lsl[^\n]*[, ]10} } } */
-/* { dg-final { scan-assembler-not {\tcmp\tx[0-9]+, 0} } } */
-/* { dg-final { scan-assembler {\tcmp\tw2, 0} } } */
-/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+} 2 } } */
-/* Two range checks and a check for n being zero.  */
-/* { dg-final { scan-assembler {\tcmp\t} } } */
-/* { dg-final { scan-assembler-times {\tccmp\t} 2 } } */
+/* Should use a WAR check that multiplies by (VF-2)*4 rather than
+   an overlap check that multiplies by (257-1)*4.  */
+/* { dg-final { scan-assembler {\tcntb\t(x[0-9]+)\n.*\tsub\tx[0-9]+, \1, #8\n.*\tmul\tx[0-9]+,[^\n]*\1} } } */
+/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+[^\n]*xzr} 1 } } */
+/* One range check and a check for n being zero.  */
+/* { dg-final { scan-assembler-times {\tcmp\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tccmp\t} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/var_stride_5.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/var_stride_5.c	2019-11-11 18:32:12.000000000 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/var_stride_5.c	2019-11-11 18:32:13.186616541 +0000
@@ -15,13 +15,10 @@ f (TYPE *x, TYPE *y, long n, long m __at
 /* { dg-final { scan-assembler {\tst1d\tz[0-9]+} } } */
 /* { dg-final { scan-assembler {\tldr\td[0-9]+} } } */
 /* { dg-final { scan-assembler {\tstr\td[0-9]+} } } */
-/* Should multiply by (VF-1)*8 rather than (257-1)*8.  */
-/* { dg-final { scan-assembler-not {, 2048} } } */
-/* { dg-final { scan-assembler-not {\t.bfiz\t} } } */
-/* { dg-final { scan-assembler-not {lsl[^\n]*[, ]11} } } */
-/* { dg-final { scan-assembler {\tcmp\tx[0-9]+, 0} } } */
-/* { dg-final { scan-assembler-not {\tcmp\tw[0-9]+, 0} } } */
-/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+} 2 } } */
-/* Two range checks and a check for n being zero.  */
-/* { dg-final { scan-assembler {\tcmp\t} } } */
-/* { dg-final { scan-assembler-times {\tccmp\t} 2 } } */
+/* Should use a WAR check that multiplies by (VF-2)*8 rather than
+   an overlap check that multiplies by (257-1)*4.  */
+/* { dg-final { scan-assembler {\tcntb\t(x[0-9]+)\n.*\tsub\tx[0-9]+, \1, #16\n.*\tmul\tx[0-9]+,[^\n]*\1} } } */
+/* { dg-final { scan-assembler-times {\tcsel\tx[0-9]+[^\n]*xzr} 1 } } */
+/* One range check and a check for n being zero.  */
+/* { dg-final { scan-assembler-times {\tcmp\t} 1 } } */
+/* { dg-final { scan-assembler-times {\tccmp\t} 1 } } */