diff mbox series

lra: Canonicalize mult to shift in address reloads

Message ID 20200825101855.clw3dgzfxsgudpzn@arm.com
State New
Headers show
Series lra: Canonicalize mult to shift in address reloads | expand

Commit Message

Alex Coplan Aug. 25, 2020, 10:18 a.m. UTC
Hello,

Inside a (mem) RTX, it is canonical to write multiplications by powers
of two using a (mult) [0]. For example, given the following C function:

long f(long *p, long x)
{
    return p[x];
}

AArch64 GCC generates the following RTL insn (in final):

(set (reg/i:DI 0 x0)
     (mem:DI (plus:DI (mult:DI (reg:DI 1 x1 [101])
                 (const_int 8 [0x8]))
             (reg:DI 0 x0 [100])) [1 *_3+0 S8 A64]))

However, outside of a (mem), the canonical way to write multiplications
by powers of two is using (ashift). E.g. for the following C function:

long *g(long *p, long x)
{
    return p + x;
}

AArch64 GCC generates:

(set (reg/i:DI 0 x0)
     (plus:DI (ashift:DI (reg:DI 1 x1 [100])
             (const_int 3 [0x3]))
         (reg:DI 0 x0 [99])))

Now I observed that LRA does not quite respect this RTL canonicalization
rule. When compiling gcc/testsuite/gcc.dg/torture/pr34330.c with -Os
-ftree-vectorize, the RTL in the dump "281r.ira" has the insn:

(set (reg:SI 111 [ MEM[base: b_25(D), index: ivtmp.9_35, step: 4, offset: 0B] ])
     (mem:SI (plus:DI (mult:DI (reg:DI 101 [ ivtmp.9 ])
                 (const_int 4 [0x4]))
             (reg/v/f:DI 105 [ b ])) [2 MEM[base: b_25(D), index: ivtmp.9_35, step: 4, offset: 0B]+0 S4 A32]))

but LRA then proceeds to generate a reload, and we get the following
non-canonical insn in "282r.reload":

(set (reg:DI 7 x7 [121])
     (plus:DI (mult:DI (reg:DI 5 x5 [orig:101 ivtmp.9 ] [101])
             (const_int 4 [0x4]))
         (reg/v/f:DI 1 x1 [orig:105 b ] [105])))

This patch fixes LRA to ensure that we generate canonical RTL in this
case. After the patch, we get the following insn in "282r.reload":

(set (reg:DI 7 x7 [121])
        (plus:DI (ashift:DI (reg:DI 5 x5 [orig:101 ivtmp.9 ] [101])
                (const_int 2 [0x2]))
            (reg/v/f:DI 1 x1 [orig:105 b ] [105])))

The motivation here is to be able to remove several redundant patterns
in the AArch64 backend. See the previous thread [1] for context.

Testing:
 * Bootstrapped and regtested on aarch64-none-linux-gnu,
   x86_64-pc-linux-gnu.
 * New unit test which checks that we're using the shift pattern (rather
   than the problematic mult pattern) on AArch64.

OK for master?

Thanks,
Alex

[0] : https://gcc.gnu.org/onlinedocs/gccint/Insn-Canonicalizations.html
[1] : https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552066.html

---

gcc/ChangeLog:

	* lra-constraints.c (canonicalize_reload_addr): New.
	(curr_insn_transform): Use canonicalize_reload_addr to ensure we
	generate canonical RTL for an address reload.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/mem-shift-canonical.c: New test.

Comments

Vladimir Makarov Aug. 25, 2020, 8:45 p.m. UTC | #1
On 2020-08-25 6:18 a.m., Alex Coplan wrote:
> The motivation here is to be able to remove several redundant patterns
> in the AArch64 backend. See the previous thread [1] for context.
>
> Testing:
>   * Bootstrapped and regtested on aarch64-none-linux-gnu,
>     x86_64-pc-linux-gnu.
>   * New unit test which checks that we're using the shift pattern (rather
>     than the problematic mult pattern) on AArch64.
>
> OK for master?
Yes. Thank you for working on this issue and the patch.
> Thanks,
> Alex
>
> [0] : https://gcc.gnu.org/onlinedocs/gccint/Insn-Canonicalizations.html
> [1] : https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552066.html
>
> ---
>
> gcc/ChangeLog:
>
> 	* lra-constraints.c (canonicalize_reload_addr): New.
> 	(curr_insn_transform): Use canonicalize_reload_addr to ensure we
> 	generate canonical RTL for an address reload.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/mem-shift-canonical.c: New test.
>
Richard Sandiford Aug. 26, 2020, 9:06 a.m. UTC | #2
Alex Coplan <alex.coplan@arm.com> writes:
> Hello,
>
> Inside a (mem) RTX, it is canonical to write multiplications by powers
> of two using a (mult) [0]. For example, given the following C function:
>
> long f(long *p, long x)
> {
>     return p[x];
> }
>
> AArch64 GCC generates the following RTL insn (in final):
>
> (set (reg/i:DI 0 x0)
>      (mem:DI (plus:DI (mult:DI (reg:DI 1 x1 [101])
>                  (const_int 8 [0x8]))
>              (reg:DI 0 x0 [100])) [1 *_3+0 S8 A64]))
>
> However, outside of a (mem), the canonical way to write multiplications
> by powers of two is using (ashift). E.g. for the following C function:
>
> long *g(long *p, long x)
> {
>     return p + x;
> }
>
> AArch64 GCC generates:
>
> (set (reg/i:DI 0 x0)
>      (plus:DI (ashift:DI (reg:DI 1 x1 [100])
>              (const_int 3 [0x3]))
>          (reg:DI 0 x0 [99])))
>
> Now I observed that LRA does not quite respect this RTL canonicalization
> rule. When compiling gcc/testsuite/gcc.dg/torture/pr34330.c with -Os
> -ftree-vectorize, the RTL in the dump "281r.ira" has the insn:
>
> (set (reg:SI 111 [ MEM[base: b_25(D), index: ivtmp.9_35, step: 4, offset: 0B] ])
>      (mem:SI (plus:DI (mult:DI (reg:DI 101 [ ivtmp.9 ])
>                  (const_int 4 [0x4]))
>              (reg/v/f:DI 105 [ b ])) [2 MEM[base: b_25(D), index: ivtmp.9_35, step: 4, offset: 0B]+0 S4 A32]))
>
> but LRA then proceeds to generate a reload, and we get the following
> non-canonical insn in "282r.reload":
>
> (set (reg:DI 7 x7 [121])
>      (plus:DI (mult:DI (reg:DI 5 x5 [orig:101 ivtmp.9 ] [101])
>              (const_int 4 [0x4]))
>          (reg/v/f:DI 1 x1 [orig:105 b ] [105])))
>
> This patch fixes LRA to ensure that we generate canonical RTL in this
> case. After the patch, we get the following insn in "282r.reload":
>
> (set (reg:DI 7 x7 [121])
>         (plus:DI (ashift:DI (reg:DI 5 x5 [orig:101 ivtmp.9 ] [101])
>                 (const_int 2 [0x2]))
>             (reg/v/f:DI 1 x1 [orig:105 b ] [105])))
>
> The motivation here is to be able to remove several redundant patterns
> in the AArch64 backend. See the previous thread [1] for context.
>
> Testing:
>  * Bootstrapped and regtested on aarch64-none-linux-gnu,
>    x86_64-pc-linux-gnu.
>  * New unit test which checks that we're using the shift pattern (rather
>    than the problematic mult pattern) on AArch64.
>
> OK for master?
>
> Thanks,
> Alex
>
> [0] : https://gcc.gnu.org/onlinedocs/gccint/Insn-Canonicalizations.html
> [1] : https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552066.html

Thanks for doing this.

>
> ---
>
> gcc/ChangeLog:
>
> 	* lra-constraints.c (canonicalize_reload_addr): New.
> 	(curr_insn_transform): Use canonicalize_reload_addr to ensure we
> 	generate canonical RTL for an address reload.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/mem-shift-canonical.c: New test.
>
> diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
> index 421c453997b..630a4f167cd 100644
> --- a/gcc/lra-constraints.c
> +++ b/gcc/lra-constraints.c
> @@ -570,6 +570,34 @@ init_curr_insn_input_reloads (void)
>    curr_insn_input_reloads_num = 0;
>  }
>  
> +/* The canonical form of an rtx inside a MEM is not necessarily the same as the
> +   canonical form of the rtx outside the MEM.  Fix this up in the case that
> +   we're reloading an address (and therefore pulling it outside a MEM).  */
> +static rtx canonicalize_reload_addr (rtx addr)

Minor nit, should be formatted as:

static rtx
canonicalize_reload_addr (rtx addr)

> +{
> +  const enum rtx_code code = GET_CODE (addr);
> +
> +  if (code == PLUS)
> +    {
> +      rtx inner = XEXP (addr, 0);
> +      if (GET_CODE (inner) == MULT && CONST_INT_P (XEXP (inner, 1)))
> +	{
> +	  const HOST_WIDE_INT ci = INTVAL (XEXP (inner, 1));
> +	  const int pwr2 = exact_log2 (ci);
> +	  if (pwr2 > 0)
> +	    {
> +	      /* Rewrite this to use a shift instead, which is canonical
> +		 when outside of a MEM.  */
> +	      XEXP (addr, 0) = gen_rtx_ASHIFT (GET_MODE (inner),
> +					       XEXP (inner, 0),
> +					       GEN_INT (pwr2));
> +	    }
> +	}
> +    }

I don't think we should we restrict this to (plus (mult X Y) Z),
since addresses can be more complicated than that.  One way to
search for all MULTs is:

  subrtx_iterator::array_type array;
  FOR_EACH_SUBRTX (iter, array, x, NONCONST)
    {
      rtx x = *iter;
      if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
        ...
    }

(Needs rtl-iter.h)

Thanks,
Richard
Vladimir Makarov Aug. 26, 2020, 1:19 p.m. UTC | #3
On 2020-08-26 5:06 a.m., Richard Sandiford wrote:
> Alex Coplan <alex.coplan@arm.com> writes:
>
> Minor nit, should be formatted as:
>
> static rtx
> canonicalize_reload_addr (rtx addr)
Sorry for missing this.  Alex, it should be fixed anyway.
>
> I don't think we should we restrict this to (plus (mult X Y) Z),
> since addresses can be more complicated than that.  One way to
> search for all MULTs is:
>
>    subrtx_iterator::array_type array;
>    FOR_EACH_SUBRTX (iter, array, x, NONCONST)
>      {
>        rtx x = *iter;
>        if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
>          ...
>      }
>
> (Needs rtl-iter.h)

I am agree it would be nice to process a general case.  Alex, you can do 
this as a separate patch if you want.

Richard, thank you for looking at this patch too.
Alex Coplan Aug. 26, 2020, 3:15 p.m. UTC | #4
Thanks for the review, both.

On 26/08/2020 09:19, Vladimir Makarov wrote:
> 
> On 2020-08-26 5:06 a.m., Richard Sandiford wrote:
> > Alex Coplan <alex.coplan@arm.com> writes:
> > 
> > Minor nit, should be formatted as:
> > 
> > static rtx
> > canonicalize_reload_addr (rtx addr)
> Sorry for missing this.  Alex, it should be fixed anyway.
> > 
> > I don't think we should we restrict this to (plus (mult X Y) Z),
> > since addresses can be more complicated than that.  One way to
> > search for all MULTs is:
> > 
> >    subrtx_iterator::array_type array;
> >    FOR_EACH_SUBRTX (iter, array, x, NONCONST)
> >      {
> >        rtx x = *iter;
> >        if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
> >          ...
> >      }
> > 
> > (Needs rtl-iter.h)
> 
> I am agree it would be nice to process a general case.  Alex, you can do
> this as a separate patch if you want.
> 
> Richard, thank you for looking at this patch too.
> 
> 

Please find a reworked version of the patch attached incorporating
Richard's feedback.

Testing:
 * Bootstrap and regtest on aarch64-none-linux-gnu,
   arm-none-linux-gnueabihf, and x86_64-pc-linux-gnu: no regressions.

Is the updated patch OK for master?

Thanks,
Alex
diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 421c453997b..580da9c3ed6 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -131,6 +131,7 @@
 #include "lra-int.h"
 #include "print-rtl.h"
 #include "function-abi.h"
+#include "rtl-iter.h"
 
 /* Value of LRA_CURR_RELOAD_NUM at the beginning of BB of the current
    insn.  Remember that LRA_CURR_RELOAD_NUM is the number of emitted
@@ -570,6 +571,33 @@ init_curr_insn_input_reloads (void)
   curr_insn_input_reloads_num = 0;
 }
 
+/* The canonical form of an rtx inside a MEM is not necessarily the same as the
+   canonical form of the rtx outside the MEM.  Fix this up in the case that
+   we're reloading an address (and therefore pulling it outside a MEM).  */
+static rtx
+canonicalize_reload_addr (rtx addr)
+{
+  subrtx_var_iterator::array_type array;
+  FOR_EACH_SUBRTX_VAR (iter, array, addr, NONCONST)
+    {
+      rtx x = *iter;
+      if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
+	{
+	  const HOST_WIDE_INT ci = INTVAL (XEXP (x, 1));
+	  const int pwr2 = exact_log2 (ci);
+	  if (pwr2 > 0)
+	    {
+	      /* Rewrite this to use a shift instead, which is canonical when
+		 outside of a MEM.  */
+	      PUT_CODE (x, ASHIFT);
+	      XEXP (x, 1) = GEN_INT (pwr2);
+	    }
+	}
+    }
+
+  return addr;
+}
+
 /* Create a new pseudo using MODE, RCLASS, ORIGINAL or reuse already
    created input reload pseudo (only if TYPE is not OP_OUT).  Don't
    reuse pseudo if IN_SUBREG_P is true and the reused pseudo should be
@@ -4362,12 +4390,19 @@ curr_insn_transform (bool check_only_p)
 	    {
 	      rtx addr = *loc;
 	      enum rtx_code code = GET_CODE (addr);
-	      
+	      bool align_p = false;
+
 	      if (code == AND && CONST_INT_P (XEXP (addr, 1)))
-		/* (and ... (const_int -X)) is used to align to X bytes.  */
-		addr = XEXP (*loc, 0);
+		{
+		  /* (and ... (const_int -X)) is used to align to X bytes.  */
+		  align_p = true;
+		  addr = XEXP (*loc, 0);
+		}
+	      else
+		addr = canonicalize_reload_addr (addr);
+
 	      lra_emit_move (new_reg, addr);
-	      if (addr != *loc)
+	      if (align_p)
 		emit_move_insn (new_reg, gen_rtx_AND (GET_MODE (new_reg), new_reg, XEXP (*loc, 1)));
 	    }
 	  before = get_insns ();
diff --git a/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c b/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c
new file mode 100644
index 00000000000..36beed497a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c
@@ -0,0 +1,27 @@
+/* This test is a copy of gcc.dg/torture/pr34330.c: here we are looking for
+   specific patterns being matched in the AArch64 backend.  */
+
+/* { dg-do compile } */
+/* { dg-options "-Os -ftree-vectorize -dp" } */
+
+
+struct T
+{
+  int t;
+  struct { short s1, s2, s3, s4; } *s;
+};
+
+void
+foo (int *a, int *b, int *c, int *d, struct T *e)
+{
+  int i;
+  for (i = 0; i < e->t; i++)
+    {
+      e->s[i].s1 = a[i];
+      e->s[i].s2 = b[i];
+      e->s[i].s3 = c[i];
+      e->s[i].s4 = d[i];
+    }
+}
+
+/* { dg-final { scan-assembler-times "add_lsl_di" 3 } } */
Vladimir Makarov Aug. 27, 2020, 2:26 a.m. UTC | #5
On 2020-08-26 11:15 a.m., Alex Coplan wrote:
> Thanks for the review, both.
>
>
> Please find a reworked version of the patch attached incorporating
> Richard's feedback.
>
> Testing:
>   * Bootstrap and regtest on aarch64-none-linux-gnu,
>     arm-none-linux-gnueabihf, and x86_64-pc-linux-gnu: no regressions.
>
> Is the updated patch OK for master?
>
Yes.  Thank you, Alex.
Christophe Lyon Aug. 28, 2020, 8:16 a.m. UTC | #6
Hi Alex,


On Wed, 26 Aug 2020 at 17:15, Alex Coplan <alex.coplan@arm.com> wrote:
>
> Thanks for the review, both.
>
> On 26/08/2020 09:19, Vladimir Makarov wrote:
> >
> > On 2020-08-26 5:06 a.m., Richard Sandiford wrote:
> > > Alex Coplan <alex.coplan@arm.com> writes:
> > >
> > > Minor nit, should be formatted as:
> > >
> > > static rtx
> > > canonicalize_reload_addr (rtx addr)
> > Sorry for missing this.  Alex, it should be fixed anyway.
> > >
> > > I don't think we should we restrict this to (plus (mult X Y) Z),
> > > since addresses can be more complicated than that.  One way to
> > > search for all MULTs is:
> > >
> > >    subrtx_iterator::array_type array;
> > >    FOR_EACH_SUBRTX (iter, array, x, NONCONST)
> > >      {
> > >        rtx x = *iter;
> > >        if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
> > >          ...
> > >      }
> > >
> > > (Needs rtl-iter.h)
> >
> > I am agree it would be nice to process a general case.  Alex, you can do
> > this as a separate patch if you want.
> >
> > Richard, thank you for looking at this patch too.
> >
> >
>
> Please find a reworked version of the patch attached incorporating
> Richard's feedback.
>
> Testing:
>  * Bootstrap and regtest on aarch64-none-linux-gnu,
>    arm-none-linux-gnueabihf, and x86_64-pc-linux-gnu: no regressions.
>

Unfortunately, the new test fails on aarch64 with mabi=ilp32
FAIL: gcc.target/aarch64/mem-shift-canonical.c scan-assembler-times add_lsl_di 3

Christophe

> Is the updated patch OK for master?
>
> Thanks,
> Alex
Alex Coplan Aug. 28, 2020, 9:25 a.m. UTC | #7
Re: [PATCH] lra: Canonicalize mult to shift in address reloads

Hi Christophe,

On 28/08/2020 10:16, Christophe Lyon wrote:
> Hi Alex,
> 
> 
> On Wed, 26 Aug 2020 at 17:15, Alex Coplan <alex.coplan@arm.com> wrote:
> >
> > Thanks for the review, both.
> >
> > On 26/08/2020 09:19, Vladimir Makarov wrote:
> > >
> > > On 2020-08-26 5:06 a.m., Richard Sandiford wrote:
> > > > Alex Coplan <alex.coplan@arm.com> writes:
> > > >
> > > > Minor nit, should be formatted as:
> > > >
> > > > static rtx
> > > > canonicalize_reload_addr (rtx addr)
> > > Sorry for missing this.  Alex, it should be fixed anyway.
> > > >
> > > > I don't think we should we restrict this to (plus (mult X Y) Z),
> > > > since addresses can be more complicated than that.  One way to
> > > > search for all MULTs is:
> > > >
> > > >    subrtx_iterator::array_type array;
> > > >    FOR_EACH_SUBRTX (iter, array, x, NONCONST)
> > > >      {
> > > >        rtx x = *iter;
> > > >        if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
> > > >          ...
> > > >      }
> > > >
> > > > (Needs rtl-iter.h)
> > >
> > > I am agree it would be nice to process a general case.  Alex, you can do
> > > this as a separate patch if you want.
> > >
> > > Richard, thank you for looking at this patch too.
> > >
> > >
> >
> > Please find a reworked version of the patch attached incorporating
> > Richard's feedback.
> >
> > Testing:
> >  * Bootstrap and regtest on aarch64-none-linux-gnu,
> >    arm-none-linux-gnueabihf, and x86_64-pc-linux-gnu: no regressions.
> >
> 
> Unfortunately, the new test fails on aarch64 with mabi=ilp32
> FAIL: gcc.target/aarch64/mem-shift-canonical.c scan-assembler-times add_lsl_di 3

Thanks for catching this. It fails since we're looking for a pattern
that could only be hit on LP64. Disabling the test on ILP32 since the
problematic mult pattern was never hit there, so there's nothing to
test.

Committing as obvious.

Thanks,
Alex

> 
> Christophe
> 

---

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/mem-shift-canonical.c: Skip on ILP32.

---

diff --git a/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c b/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c
index 36beed497a0..47479ffff29 100644
--- a/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c
+++ b/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c
@@ -3,6 +3,7 @@
 
 /* { dg-do compile } */
 /* { dg-options "-Os -ftree-vectorize -dp" } */
+/* { dg-require-effective-target lp64 } */
 
 
 struct T
Maciej W. Rozycki March 24, 2021, 6:13 p.m. UTC | #8
On Wed, 26 Aug 2020, Vladimir Makarov via Gcc-patches wrote:

> On 2020-08-26 5:06 a.m., Richard Sandiford wrote:
> > 
> > I don't think we should we restrict this to (plus (mult X Y) Z),
> > since addresses can be more complicated than that.  One way to
> > search for all MULTs is:
> > 
> >    subrtx_iterator::array_type array;
> >    FOR_EACH_SUBRTX (iter, array, x, NONCONST)
> >      {
> >        rtx x = *iter;
> >        if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
> >          ...
> >      }
> > 
> > (Needs rtl-iter.h)
> 
> I am agree it would be nice to process a general case.  Alex, you can do this
> as a separate patch if you want.
> 
> Richard, thank you for looking at this patch too.

[From <https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552586.html>; 
also commit 6b3034eaba83.]

 Guys, this triggers a backend's functional regression and an ICE in the 
test suite with the LRA conversion I'm currently working on for the VAX 
backend.  Before I go ahead and paper it over in the backend I'd like to 
understand why this change was considered correct in the first place.

 Contrary to what the change description suggests the ASHIFT form is not 
documented to be the canonical form for constant multiplication involving 
a power of two for addresses used outside `mem'.  What our rules only say 
is that for addresses inside `mem' the MULT form is:

   * Within address computations (i.e., inside 'mem'), a left shift is
     converted into the appropriate multiplication by a power of two.

 This change does the reverse of the conversion described above and makes
TARGET_LEGITIMATE_ADDRESS_P and possibly other backend code be presented 
with either form for indexed addresses, which complicates handling.  The 
ICE mentioned above specifically is caused by:

(plus:SI (plus:SI (mult:SI (reg:SI 30 [ _10 ])
            (const_int 4 [0x4]))
        (reg/f:SI 26 [ _6 ]))
    (const_int 12 [0xc]))

coming from:

(insn 58 57 59 10 (set (reg:SI 33 [ _13 ])
        (zero_extract:SI (mem:SI (plus:SI (plus:SI (mult:SI (reg:SI 30 [ _10 ])
                            (const_int 4 [0x4]))
                        (reg/f:SI 26 [ _6 ]))
                    (const_int 12 [0xc])) [4 _6->bits[_10]+0 S4 A32])
            (reg:QI 56)
            (reg:SI 53))) 
".../gcc/testsuite/gcc.c-torture/execute/20090113-2.c":64:12 490 {*extzv_non_const}
     (expr_list:REG_DEAD (reg:QI 56)
        (expr_list:REG_DEAD (reg:SI 53)
            (expr_list:REG_DEAD (reg:SI 30 [ _10 ])
                (expr_list:REG_DEAD (reg/f:SI 26 [ _6 ])
                    (nil))))))

being converted into:

(plus:SI (plus:SI (ashift:SI (reg:SI 30 [ _10 ])
            (const_int 2 [0x2]))
        (reg/f:SI 26 [ _6 ]))
    (const_int 12 [0xc]))

which the backend currently does not recognise as a valid machine address 
and clearly all the fallback handling fails in this case.  It also causes 
indexed addressing for non-byte data (scaling is implicit in the VAX ISA) 
to cease being used where no ICE actually triggers, which causes a serious 
code quality regression from extraneous manual address calculations.

 Given that the VAX backend currently does not expect ASHIFT in addresses 
and it works, this single piece in LRA must be the only one across the 
middle end that uses this form and all the other code must have stuck to 
the MULT form.  So it does not appear to me that ASHIFT form indeed used 
not to be considered canonical until this problematic change.

 I have looked into what other backends do that support scaled indexed 
addressing and x86 escaped a regression here owing to an unrelated much 
earlier fix: <https://gcc.gnu.org/ml/gcc-patches/2010-04/msg01170.html> 
for PR target/43766 (commit 90f775a9c7af) that added ASHIFT support to its 
TARGET_LEGITIMATE_ADDRESS_P handler, and Aarch64 presumably has always had 
it.

 I have therefore made an experimental change for the VAX backend to 
accept ASHIFT in its TARGET_LEGITIMATE_ADDRESS_P handler and just like 
reverting this change it makes the ICE go away and indexed addressing to 
be used again.  However there are numerous other places across the VAX 
backend that expect addresses to be in the MULT from, including in 
particular expression cost calculation, and it is not clear to me if they 
all have to be adjusted for the possibility created by this change for 
addresses to come in either form.

 So why do we want to use a different canonical form for addresses 
depending on the context they are used with?

 It does complicate handling in the backend and my understanding has been 
that canonicalisation is meant to simplify handling throughout instead.  
And sadly the change description does not explain why it is correct to 
have addresses use the ASHIFT form in certain contexts and the MULT form 
in the remaining ones.

  Maciej
Richard Sandiford March 25, 2021, 6:28 p.m. UTC | #9
"Maciej W. Rozycki" <macro@orcam.me.uk> writes:
> On Wed, 26 Aug 2020, Vladimir Makarov via Gcc-patches wrote:
>
>> On 2020-08-26 5:06 a.m., Richard Sandiford wrote:
>> > 
>> > I don't think we should we restrict this to (plus (mult X Y) Z),
>> > since addresses can be more complicated than that.  One way to
>> > search for all MULTs is:
>> > 
>> >    subrtx_iterator::array_type array;
>> >    FOR_EACH_SUBRTX (iter, array, x, NONCONST)
>> >      {
>> >        rtx x = *iter;
>> >        if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
>> >          ...
>> >      }
>> > 
>> > (Needs rtl-iter.h)
>> 
>> I am agree it would be nice to process a general case.  Alex, you can do this
>> as a separate patch if you want.
>> 
>> Richard, thank you for looking at this patch too.
>
> [From <https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552586.html>; 
> also commit 6b3034eaba83.]
>
>  Guys, this triggers a backend's functional regression and an ICE in the 
> test suite with the LRA conversion I'm currently working on for the VAX 
> backend.  Before I go ahead and paper it over in the backend I'd like to 
> understand why this change was considered correct in the first place.
>
>  Contrary to what the change description suggests the ASHIFT form is not 
> documented to be the canonical form for constant multiplication involving 
> a power of two for addresses used outside `mem'.

One thing to note here is that, outside of a mem, there's no distinction
between an address calculation and normal integer arithmetic.  In other
words, “addresses used outside of a ‘mem’” aren't a distinct category of
rtx that can be separated from other things outside of a “mem“.  So…

> What our rules only say 
> is that for addresses inside `mem' the MULT form is:
>
>    * Within address computations (i.e., inside 'mem'), a left shift is
>      converted into the appropriate multiplication by a power of two.
>
>  This change does the reverse of the conversion described above and makes
> TARGET_LEGITIMATE_ADDRESS_P and possibly other backend code be presented 
> with either form for indexed addresses, which complicates handling.  The 
> ICE mentioned above specifically is caused by:
>
> (plus:SI (plus:SI (mult:SI (reg:SI 30 [ _10 ])
>             (const_int 4 [0x4]))
>         (reg/f:SI 26 [ _6 ]))
>     (const_int 12 [0xc]))

…if you write:

-----------------------------------------------------------
long *foo ();
long *bar (long *ptr, long x) { return &foo ()[x + 3]; }
-----------------------------------------------------------

then the rtl you get is:

-----------------------------------------------------------
…
(insn 10 9 11 2 (set (reg:SI 32)
        (plus:SI (reg/v:SI 29 [ x ])
            (const_int 3 [0x3]))) "/tmp/foo.c":2:47 -1
     (nil))
(insn 11 10 12 2 (set (reg:SI 33)
        (ashift:SI (reg:SI 32)
            (const_int 2 [0x2]))) "/tmp/foo.c":2:47 -1
     (nil))
(insn 12 11 13 2 (set (reg:SI 31)
        (plus:SI (reg/f:SI 23 [ _1 ])
            (reg:SI 33))) "/tmp/foo.c":2:40 -1
     (nil))
…
-----------------------------------------------------------

where the address uses “ashift” rather than “mult”.  Then combine
tries to generate the same kind of address as the one you quote above,
but with “ashift” rather than “mult”:

-----------------------------------------------------------
Trying 10, 11 -> 12:
   10: r32:SI=r29:SI+0x3
      REG_DEAD r29:SI
   11: r33:SI=r32:SI<<0x2
      REG_DEAD r32:SI
   12: r31:SI=r34:SI+r33:SI
      REG_DEAD r34:SI
      REG_DEAD r33:SI
Failed to match this instruction:
(set (reg:SI 31)
    (plus:SI (plus:SI (ashift:SI (reg/v:SI 29 [ x ])
                (const_int 2 [0x2]))
            (reg:SI 34))
        (const_int 12 [0xc])))
-----------------------------------------------------------

So I don't see your VAX change as papering over the issue.  The above
“ashift” form is what address calculations normally look like outside
of a “mem”.  The point of the rtl canonicalisation rules is to make sure
that targets don't need to support two different ways of writing the
same thing, which in this case means not having to support
“mult”-by-a-power-of-2 as well as “ashift” for the LEA above.

> coming from:
>
> (insn 58 57 59 10 (set (reg:SI 33 [ _13 ])
>         (zero_extract:SI (mem:SI (plus:SI (plus:SI (mult:SI (reg:SI 30 [ _10 ])
>                             (const_int 4 [0x4]))
>                         (reg/f:SI 26 [ _6 ]))
>                     (const_int 12 [0xc])) [4 _6->bits[_10]+0 S4 A32])
>             (reg:QI 56)
>             (reg:SI 53))) 
> ".../gcc/testsuite/gcc.c-torture/execute/20090113-2.c":64:12 490 {*extzv_non_const}
>      (expr_list:REG_DEAD (reg:QI 56)
>         (expr_list:REG_DEAD (reg:SI 53)
>             (expr_list:REG_DEAD (reg:SI 30 [ _10 ])
>                 (expr_list:REG_DEAD (reg/f:SI 26 [ _6 ])
>                     (nil))))))
>
> being converted into:
>
> (plus:SI (plus:SI (ashift:SI (reg:SI 30 [ _10 ])
>             (const_int 2 [0x2]))
>         (reg/f:SI 26 [ _6 ]))
>     (const_int 12 [0xc]))
>
> which the backend currently does not recognise as a valid machine address 
> and clearly all the fallback handling fails in this case.  It also causes 
> indexed addressing for non-byte data (scaling is implicit in the VAX ISA) 
> to cease being used where no ICE actually triggers, which causes a serious 
> code quality regression from extraneous manual address calculations.
>
>  Given that the VAX backend currently does not expect ASHIFT in addresses 
> and it works, this single piece in LRA must be the only one across the 
> middle end that uses this form and all the other code must have stuck to 
> the MULT form.  So it does not appear to me that ASHIFT form indeed used 
> not to be considered canonical until this problematic change.

Hopefully the above combine example answers this.

>  I have looked into what other backends do that support scaled indexed 
> addressing and x86 escaped a regression here owing to an unrelated much 
> earlier fix: <https://gcc.gnu.org/ml/gcc-patches/2010-04/msg01170.html> 
> for PR target/43766 (commit 90f775a9c7af) that added ASHIFT support to its 
> TARGET_LEGITIMATE_ADDRESS_P handler, and Aarch64 presumably has always had 
> it.
>
>  I have therefore made an experimental change for the VAX backend to 
> accept ASHIFT in its TARGET_LEGITIMATE_ADDRESS_P handler and just like 
> reverting this change it makes the ICE go away and indexed addressing to 
> be used again.  However there are numerous other places across the VAX 
> backend that expect addresses to be in the MULT from, including in 
> particular expression cost calculation, and it is not clear to me if they 
> all have to be adjusted for the possibility created by this change for 
> addresses to come in either form.

Does the patch also help to optimise my example above?  If so,
it sounds like a good thing for that reason alone.

>  So why do we want to use a different canonical form for addresses 
> depending on the context they are used with?
>
>  It does complicate handling in the backend and my understanding has been 
> that canonicalisation is meant to simplify handling throughout instead.  
> And sadly the change description does not explain why it is correct to 
> have addresses use the ASHIFT form in certain contexts and the MULT form 
> in the remaining ones.

Yeah, canonicalisation is supposed to simplify handling.  I think the
thing that thwarts that in this particular context is the fact that
we have different rules for “mult”s inside addresses.  IMO that was
a mistake, and “mult” by a power of 2 should be an “ashift”
wherever it occurs.  But fixing that for all backends would probably
be a huge task.

While the special “mem” case exists, we're trading complication in
one direction for complication in another.  If code matches addresses
that are used for both address constraints and memory constraints,
it will need to support both the “ashift” and “mem” forms (and for
good code quality it should have done that even before the patch).
On the other side, rtl patterns that match address-like expressions
can rely on “ashift” being used and do not need to support “mult”.

Thanks,
Richard
Maciej W. Rozycki April 8, 2021, 9:18 p.m. UTC | #10
On Thu, 25 Mar 2021, Richard Sandiford wrote:

> "Maciej W. Rozycki" <macro@orcam.me.uk> writes:
> > On Wed, 26 Aug 2020, Vladimir Makarov via Gcc-patches wrote:
> >
> >> On 2020-08-26 5:06 a.m., Richard Sandiford wrote:
> >> > 
> >> > I don't think we should we restrict this to (plus (mult X Y) Z),
> >> > since addresses can be more complicated than that.  One way to
> >> > search for all MULTs is:
> >> > 
> >> >    subrtx_iterator::array_type array;
> >> >    FOR_EACH_SUBRTX (iter, array, x, NONCONST)
> >> >      {
> >> >        rtx x = *iter;
> >> >        if (GET_CODE (x) == MULT && CONST_INT_P (XEXP (x, 1)))
> >> >          ...
> >> >      }
> >> > 
> >> > (Needs rtl-iter.h)
> >> 
> >> I am agree it would be nice to process a general case.  Alex, you can do this
> >> as a separate patch if you want.
> >> 
> >> Richard, thank you for looking at this patch too.
> >
> > [From <https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552586.html>; 
> > also commit 6b3034eaba83.]
> >
> >  Guys, this triggers a backend's functional regression and an ICE in the 
> > test suite with the LRA conversion I'm currently working on for the VAX 
> > backend.  Before I go ahead and paper it over in the backend I'd like to 
> > understand why this change was considered correct in the first place.
> >
> >  Contrary to what the change description suggests the ASHIFT form is not 
> > documented to be the canonical form for constant multiplication involving 
> > a power of two for addresses used outside `mem'.
> 
> One thing to note here is that, outside of a mem, there's no distinction
> between an address calculation and normal integer arithmetic.  In other
> words, “addresses used outside of a ‘mem’” aren't a distinct category of
> rtx that can be separated from other things outside of a “mem“.  So…

 In principle you're right, however my interpretation would be that 
whenever TARGET_LEGITIMATE_ADDRESS_P is called the middle end does 
consider the rtx passed an address calculation rather than normal integer 
arithmetic even if potentially.  So I would expect the middle end to 
convert the calculation to the canonical address form before such 
invocation even if to convert it back (revert to the original unconverted 
form) right after the return from TARGET_LEGITIMATE_ADDRESS_P.

 And it seems to me like this change has broken the contract for rtx's 
passed to TARGET_LEGITIMATE_ADDRESS_P that have been produced by the 
middle end rather than individual backends (which I guess may have not 
kept to the contract as indicated by the PR target/43766 fix for x86).

 Have I got my interpretation wrong here?

> > What our rules only say 
> > is that for addresses inside `mem' the MULT form is:
> >
> >    * Within address computations (i.e., inside 'mem'), a left shift is
> >      converted into the appropriate multiplication by a power of two.
> >
> >  This change does the reverse of the conversion described above and makes
> > TARGET_LEGITIMATE_ADDRESS_P and possibly other backend code be presented 
> > with either form for indexed addresses, which complicates handling.  The 
> > ICE mentioned above specifically is caused by:
> >
> > (plus:SI (plus:SI (mult:SI (reg:SI 30 [ _10 ])
> >             (const_int 4 [0x4]))
> >         (reg/f:SI 26 [ _6 ]))
> >     (const_int 12 [0xc]))
> 
> …if you write:
> 
> -----------------------------------------------------------
> long *foo ();
> long *bar (long *ptr, long x) { return &foo ()[x + 3]; }
> -----------------------------------------------------------
> 
> then the rtl you get is:
> 
> -----------------------------------------------------------
> …
> (insn 10 9 11 2 (set (reg:SI 32)
>         (plus:SI (reg/v:SI 29 [ x ])
>             (const_int 3 [0x3]))) "/tmp/foo.c":2:47 -1
>      (nil))
> (insn 11 10 12 2 (set (reg:SI 33)
>         (ashift:SI (reg:SI 32)
>             (const_int 2 [0x2]))) "/tmp/foo.c":2:47 -1
>      (nil))
> (insn 12 11 13 2 (set (reg:SI 31)
>         (plus:SI (reg/f:SI 23 [ _1 ])
>             (reg:SI 33))) "/tmp/foo.c":2:40 -1
>      (nil))
> …
> -----------------------------------------------------------
> 
> where the address uses “ashift” rather than “mult”.  Then combine
> tries to generate the same kind of address as the one you quote above,
> but with “ashift” rather than “mult”:
> 
> -----------------------------------------------------------
> Trying 10, 11 -> 12:
>    10: r32:SI=r29:SI+0x3
>       REG_DEAD r29:SI
>    11: r33:SI=r32:SI<<0x2
>       REG_DEAD r32:SI
>    12: r31:SI=r34:SI+r33:SI
>       REG_DEAD r34:SI
>       REG_DEAD r33:SI
> Failed to match this instruction:
> (set (reg:SI 31)
>     (plus:SI (plus:SI (ashift:SI (reg/v:SI 29 [ x ])
>                 (const_int 2 [0x2]))
>             (reg:SI 34))
>         (const_int 12 [0xc])))
> -----------------------------------------------------------
> 
> So I don't see your VAX change as papering over the issue.  The above
> “ashift” form is what address calculations normally look like outside
> of a “mem”.  The point of the rtl canonicalisation rules is to make sure
> that targets don't need to support two different ways of writing the
> same thing, which in this case means not having to support
> “mult”-by-a-power-of-2 as well as “ashift” for the LEA above.

 The insn numbering is off by one here, so I have:

(insn 9 8 10 2 (set (reg:SI 31)
        (plus:SI (reg/v:SI 29 [ x ])
            (const_int 3 [0x3]))) "test.c":2:47 -1
     (nil))
(insn 10 9 11 2 (set (reg:SI 32)
        (ashift:SI (reg:SI 31)
            (const_int 2 [0x2]))) "test.c":2:47 -1
     (nil))
(insn 11 10 12 2 (set (reg:SI 30)
        (plus:SI (reg/f:SI 23 [ _1 ])
            (reg:SI 32))) "test.c":2:40 -1
     (nil))

and indeed combine fails to match this:

Trying 9, 10 -> 11:
    9: r31:SI=r29:SI+0x3
      REG_DEAD r29:SI
   10: r32:SI=r31:SI<<0x2
      REG_DEAD r31:SI
   11: r30:SI=r33:SI+r32:SI
      REG_DEAD r33:SI
      REG_DEAD r32:SI
Failed to match this instruction:
(set (reg:SI 30)
    (plus:SI (plus:SI (ashift:SI (reg/v:SI 29 [ x ])
                (const_int 2 [0x2]))
            (reg:SI 33))
        (const_int 12 [0xc])))
Failed to match this instruction:
(set (reg:SI 32)
    (plus:SI (ashift:SI (reg/v:SI 29 [ x ])
            (const_int 2 [0x2]))
        (reg:SI 33)))

and then with ASHIFT allowed in TARGET_LEGITIMATE_ADDRESS_P earlier 
transformations match:

Trying 9 -> 10:
    9: r31:SI=r29:SI+0x3
      REG_DEAD r29:SI
   10: r32:SI=r31:SI<<0x2
      REG_DEAD r31:SI
Successfully matched this instruction:
(set (reg:SI 32)
    (plus:SI (ashift:SI (reg/v:SI 29 [ x ])
            (const_int 2 [0x2]))
        (const_int 12 [0xc])))
allowing combination of insns 9 and 10
original costs 32 + 40 = 72
replacement cost 36
deferring deletion of insn with uid = 9.
modifying insn i3    10: r32:SI=r29:SI<<0x2+0xc
      REG_DEAD r29:SI
deferring rescan insn with uid = 10.

and:

Trying 10 -> 11:
   10: r32:SI=r29:SI<<0x2+0xc
      REG_DEAD r29:SI
   11: r30:SI=r33:SI+r32:SI
      REG_DEAD r33:SI
      REG_DEAD r32:SI
Successfully matched this instruction:
(set (reg:SI 30)
    (plus:SI (plus:SI (ashift:SI (reg/v:SI 29 [ x ])
                (const_int 2 [0x2]))
            (reg:SI 33))
        (const_int 12 [0xc])))
allowing combination of insns 10 and 11
original costs 36 + 32 = 68
replacement cost 36
deferring deletion of insn with uid = 10.
modifying insn i3    11: r30:SI=r29:SI<<0x2+r33:SI+0xc
      REG_DEAD r29:SI
      REG_DEAD r33:SI
deferring rescan insn with uid = 11.

producing this:

(note 9 8 10 2 NOTE_INSN_DELETED)
(note 10 9 11 2 NOTE_INSN_DELETED)
(note 11 10 16 2 NOTE_INSN_DELETED)
(insn 16 11 17 2 (set (reg/i:SI 0 %r0)
        (plus:SI (plus:SI (ashift:SI (reg/v:SI 29 [ x ])
                    (const_int 2 [0x2]))
                (reg:SI 33))
            (const_int 12 [0xc]))) "test.c":2:56 613 {movaddrsi}
     (expr_list:REG_DEAD (reg:SI 33)
        (expr_list:REG_DEAD (reg/v:SI 29 [ x ])
            (nil))))

but then reload does a poor job handling this insn:

(insn 20 11 22 2 (set (reg/v:SI 1 %r1 [orig:29 x ] [29])
        (mem/c:SI (plus:SI (reg/f:SI 12 %ap)
                (const_int 8 [0x8])) [2 x+0 S4 A32])) "test.c":2:56 47 {movsi_2}     (expr_list:REG_EQUIV (mem/c:SI (plus:SI (reg/f:SI 12 %ap)
                (const_int 8 [0x8])) [2 x+0 S4 A32])
        (nil)))
(insn 22 20 23 2 (set (reg:SI 0 %r0 [35])
        (plus:SI (reg:SI 0 %r0 [33])
            (const_int 12 [0xc]))) "test.c":2:56 201 {addsi3}
     (nil))
(insn 23 22 24 2 (set (reg:SI 1 %r1 [36])
        (ashift:SI (reg/v:SI 1 %r1 [orig:29 x ] [29])
            (const_int 2 [0x2]))) "test.c":2:56 432 {ashlsi3}
     (nil))
(insn 24 23 17 2 (set (reg:SI 0 %r0 [37])
        (plus:SI (reg:SI 0 %r0 [35])
            (reg:SI 1 %r1 [36]))) "test.c":2:56 201 {addsi3}
     (nil))

(this could be reduced to just two machine instructions, as shown below, 
but from the original insn rather than this expanded sequence, by making 
an observation that the complicated index expression with a stack operand 
can be precalculated with a single machine instruction, and once we have 
it the resulting operands fit another single machine instruction).

> > coming from:
> >
> > (insn 58 57 59 10 (set (reg:SI 33 [ _13 ])
> >         (zero_extract:SI (mem:SI (plus:SI (plus:SI (mult:SI (reg:SI 30 [ _10 ])
> >                             (const_int 4 [0x4]))
> >                         (reg/f:SI 26 [ _6 ]))
> >                     (const_int 12 [0xc])) [4 _6->bits[_10]+0 S4 A32])
> >             (reg:QI 56)
> >             (reg:SI 53))) 
> > ".../gcc/testsuite/gcc.c-torture/execute/20090113-2.c":64:12 490 {*extzv_non_const}
> >      (expr_list:REG_DEAD (reg:QI 56)
> >         (expr_list:REG_DEAD (reg:SI 53)
> >             (expr_list:REG_DEAD (reg:SI 30 [ _10 ])
> >                 (expr_list:REG_DEAD (reg/f:SI 26 [ _6 ])
> >                     (nil))))))
> >
> > being converted into:
> >
> > (plus:SI (plus:SI (ashift:SI (reg:SI 30 [ _10 ])
> >             (const_int 2 [0x2]))
> >         (reg/f:SI 26 [ _6 ]))
> >     (const_int 12 [0xc]))
> >
> > which the backend currently does not recognise as a valid machine address 
> > and clearly all the fallback handling fails in this case.  It also causes 
> > indexed addressing for non-byte data (scaling is implicit in the VAX ISA) 
> > to cease being used where no ICE actually triggers, which causes a serious 
> > code quality regression from extraneous manual address calculations.
> >
> >  Given that the VAX backend currently does not expect ASHIFT in addresses 
> > and it works, this single piece in LRA must be the only one across the 
> > middle end that uses this form and all the other code must have stuck to 
> > the MULT form.  So it does not appear to me that ASHIFT form indeed used 
> > not to be considered canonical until this problematic change.
> 
> Hopefully the above combine example answers this.

 It is enlightening, thanks, however it did not address my concern I'm 
afraid.  It did help me formulate my question perhaps in a better way 
though, as quoted above: if TARGET_LEGITIMATE_ADDRESS_P is called, then 
why doesn't the caller observe that it considers the rtx given to the 
callee an address?

> >  I have looked into what other backends do that support scaled indexed 
> > addressing and x86 escaped a regression here owing to an unrelated much 
> > earlier fix: <https://gcc.gnu.org/ml/gcc-patches/2010-04/msg01170.html> 
> > for PR target/43766 (commit 90f775a9c7af) that added ASHIFT support to its 
> > TARGET_LEGITIMATE_ADDRESS_P handler, and Aarch64 presumably has always had 
> > it.
> >
> >  I have therefore made an experimental change for the VAX backend to 
> > accept ASHIFT in its TARGET_LEGITIMATE_ADDRESS_P handler and just like 
> > reverting this change it makes the ICE go away and indexed addressing to 
> > be used again.  However there are numerous other places across the VAX 
> > backend that expect addresses to be in the MULT from, including in 
> > particular expression cost calculation, and it is not clear to me if they 
> > all have to be adjusted for the possibility created by this change for 
> > addresses to come in either form.
> 
> Does the patch also help to optimise my example above?  If so,
> it sounds like a good thing for that reason alone.

 Nope, it actually regresses code produced, causing an extra instruction 
to be used where it doesn't have to:

--- test-ira/test.s
+++ test-lra-legitimate-address-ashift/test.s
@@ -8,7 +8,8 @@
 	.word 0
 	subl2 $4,%sp
 	calls $0,foo
-	addl3 8(%ap),$3,%r1
+	movl 8(%ap),%r1
+	addl2 $12,%r0
 	moval 0[%r1],%r1
 	addl2 %r1,%r0
 	ret

(old code is either IRA, or LRA without my change to accept ASHIFT in 
TARGET_LEGITIMATE_ADDRESS_P).  Please observe that the original ADDL3 
instruction completes the index calculation (as desired, as noted above), 
and then the rest boils down to combining it with the base component as 
handed over by `foo'.

 NB (%reg) is a base register and [%reg] is an index register, implicitly 
scaled by the data width determined by the relevant suffix of the mnemonic 
(`l' here stands for "longword", so 32-bit).  And MOVA is "move address" 
like x86's LEA.

 Though frankly what we really want anyway is:

	calls $0,foo
	addl3 8(%ap),$3,%r1
	moval (%r0)[%r1],%r0
	ret

because what we need to do here to complete the calculation is to combine 
the base in %r0 with the index in %r1 -- there's no performance penalty 
for the use of a base register (an index costs an extra cycle) and the 
encoding is much shorter:

  11:	de 41 9f 00 	moval *0x0[r1],r1
  15:	00 00 00 51 
  19:	c0 51 50 	addl2 r1,r0

vs:

  11:	de 41 60 50 	moval (r0)[r1],r0

(because in the absence of a base register the absolute address mode has 
to be used, which always uses full 32-bit extension, unlike displacements 
optionally used with a base register).  But while we're not there, we're 
moving away from rather than towards the desired solution.  This seems to 
be an unfortunate consequence of how reload works though, so I gather it 
needs to be fixed in reload.

> >  So why do we want to use a different canonical form for addresses 
> > depending on the context they are used with?
> >
> >  It does complicate handling in the backend and my understanding has been 
> > that canonicalisation is meant to simplify handling throughout instead.  
> > And sadly the change description does not explain why it is correct to 
> > have addresses use the ASHIFT form in certain contexts and the MULT form 
> > in the remaining ones.
> 
> Yeah, canonicalisation is supposed to simplify handling.  I think the
> thing that thwarts that in this particular context is the fact that
> we have different rules for “mult”s inside addresses.  IMO that was
> a mistake, and “mult” by a power of 2 should be an “ashift”
> wherever it occurs.  But fixing that for all backends would probably
> be a huge task.

 Worth doing anyway perhaps?  How many targets are there that we support 
that have scaled indexed addressing?

> While the special “mem” case exists, we're trading complication in
> one direction for complication in another.  If code matches addresses
> that are used for both address constraints and memory constraints,
> it will need to support both the “ashift” and “mem” forms (and for
> good code quality it should have done that even before the patch).
> On the other side, rtl patterns that match address-like expressions
> can rely on “ashift” being used and do not need to support “mult”.

 So what's the difference between "address constraints" and "memory 
constraints"?  Does it boil down to targets I cannot name that do not 
support the same address calculations regardless of whether the address is 
at the same time dereferenced or not?

 Thank you for your input.

  Maciej
Maciej W. Rozycki April 8, 2021, 11:19 p.m. UTC | #11
On Thu, 8 Apr 2021, Maciej W. Rozycki wrote:

> > Does the patch also help to optimise my example above?  If so,
> > it sounds like a good thing for that reason alone.
> 
>  Nope, it actually regresses code produced, causing an extra instruction 
> to be used where it doesn't have to:
> 
> --- test-ira/test.s
> +++ test-lra-legitimate-address-ashift/test.s
> @@ -8,7 +8,8 @@
>  	.word 0
>  	subl2 $4,%sp
>  	calls $0,foo
> -	addl3 8(%ap),$3,%r1
> +	movl 8(%ap),%r1
> +	addl2 $12,%r0
>  	moval 0[%r1],%r1
>  	addl2 %r1,%r0
>  	ret
> 
> (old code is either IRA, or LRA without my change to accept ASHIFT in 
> TARGET_LEGITIMATE_ADDRESS_P).  Please observe that the original ADDL3 
> instruction completes the index calculation (as desired, as noted above), 
> and then the rest boils down to combining it with the base component as 
> handed over by `foo'.
> 
>  NB (%reg) is a base register and [%reg] is an index register, implicitly 
> scaled by the data width determined by the relevant suffix of the mnemonic 
> (`l' here stands for "longword", so 32-bit).  And MOVA is "move address" 
> like x86's LEA.
> 
>  Though frankly what we really want anyway is:
> 
> 	calls $0,foo
> 	addl3 8(%ap),$3,%r1
> 	moval (%r0)[%r1],%r0
> 	ret
> 
> because what we need to do here to complete the calculation is to combine 
> the base in %r0 with the index in %r1 -- there's no performance penalty 
> for the use of a base register (an index costs an extra cycle) and the 
> encoding is much shorter:
> 
>   11:	de 41 9f 00 	moval *0x0[r1],r1
>   15:	00 00 00 51 

 FAOD mind that MOVAL produced here does not come from indexed addressing 
but rather from the ASHIFT pattern, as also shown by the original insn:

(insn 23 22 24 2 (set (reg:SI 1 %r1 [36])
        (ashift:SI (reg/v:SI 1 %r1 [orig:29 x ] [29])
            (const_int 2 [0x2]))) "test.c":2:56 432 {ashlsi3}
     (nil))

I quoted in last message, or in the past-reload form:

(insn 27 26 28 (parallel [
            (set (reg:SI 1 %r1 [36])
                (ashift:SI (reg/v:SI 1 %r1 [orig:29 x ] [29])
                    (const_int 2 [0x2])))
            (clobber (reg:CC 16 %psl))
        ]) "test.c":2:56 433 {*ashlsi3}
     (expr_list:REG_UNUSED (reg:CC 16 %psl)
        (nil)))

>   19:	c0 51 50 	addl2 r1,r0
> 
> vs:
> 
>   11:	de 41 60 50 	moval (r0)[r1],r0
> 
> (because in the absence of a base register the absolute address mode has 
> to be used, which always uses full 32-bit extension, unlike displacements 
> optionally used with a base register).  But while we're not there, we're 
> moving away from rather than towards the desired solution.  This seems to 
> be an unfortunate consequence of how reload works though, so I gather it 
> needs to be fixed in reload.

 So I thought of another experiment and I applied my ASHIFT patch to the 
old reload configuration.  In its unmodified form it caused an ICE with 
your example:

during RTL pass: final
test.c: In function 'bar':
test.c:2:56: internal compiler error: in print_operand_address, at config/vax/vax.c:419
    2 | long *bar (long *ptr, long x) { return &foo ()[x + 3]; }
      |                                                        ^
0x1188489f print_operand_address(_IO_FILE*, rtx_def*)
	.../gcc/config/vax/vax.c:419
0x111acbbb default_print_operand_address(_IO_FILE*, machine_mode, rtx_def*)
	.../gcc/targhooks.c:367
0x10b3115f output_address(machine_mode, rtx_def*)
	.../gcc/final.c:4085
0x10b3088b output_asm_insn(char const*, rtx_def**)
	.../gcc/final.c:3942
0x10b2ea63 final_scan_insn_1
	.../gcc/final.c:3125
0x10b2eda7 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
	.../gcc/final.c:3171
0x10b2bd1f final_1
	.../gcc/final.c:2022
0x10b327e3 rest_of_handle_final
	.../gcc/final.c:4676
0x10b32d23 execute
	.../gcc/final.c:4754
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

coming from:

(plus:SI (ashift:SI (reg/v:SI 1 %r1 [orig:29 x ] [29])
        (const_int 2 [0x2]))
    (reg:SI 0 %r0 [33]))

and serving as a proof that backends used not to expect ASHIFT in 
addresses.  Anyway I went ahead and converted `print_operand_address' too, 
yielding this code:

	calls $0,foo
	movl 8(%ap),%r1
	moval 12(%r0)[%r1],%r0
	ret

or:

   c:	d0 ac 08 51 	movl 0x8(ap),r1
  10:	de 41 a0 0c 	moval 0xc(r0)[r1],r0
  14:	50 
  15:	04 		ret

which does use indexed addressing:

(insn 24 23 17 (parallel [
            (set (reg/i:SI 0 %r0)
                (plus:SI (plus:SI (ashift:SI (reg/v:SI 1 %r1 [orig:29 x ] [29])
                            (const_int 2 [0x2]))
                        (reg:SI 0 %r0 [33]))
                    (const_int 12 [0xc])))
            (clobber (reg:CC 16 %psl))
        ]) "test.c":2:56 623 {*movaddrsi}
     (expr_list:REG_DEAD (reg/v:SI 1 %r1 [orig:29 x ] [29])
        (expr_list:REG_UNUSED (reg:CC 16 %psl)
            (nil))))

and given that the left-shift does not cause the displacement to require a 
wider encoding here it is just as good as the desired sequence I came up 
with above (here with ADDL3/RET I omitted above):

   c:	c1 ac 08 03 	addl3 0x8(ap),$0x3,r1
  10:	51 
  11:	de 41 60 50 	moval (r0)[r1],r0
  15:	04 		ret

so we're clearly doing something considerably worse with LRA than we used 
to with old reload.

> > Yeah, canonicalisation is supposed to simplify handling.  I think the
> > thing that thwarts that in this particular context is the fact that
> > we have different rules for “mult”s inside addresses.  IMO that was
> > a mistake, and “mult” by a power of 2 should be an “ashift”
> > wherever it occurs.  But fixing that for all backends would probably
> > be a huge task.
> 
>  Worth doing anyway perhaps?  How many targets are there that we support 
> that have scaled indexed addressing?

 So with the updates to `print_operand_address' and `vax_address_cost_1' 
as well I have the VAX backend prepared for handling either representation 
throughout.  I can't imagine it being *that* horribly complicated for the 
other backends, and once all are handled we can convert the middle end.

  Maciej
diff mbox series

Patch

diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
index 421c453997b..630a4f167cd 100644
--- a/gcc/lra-constraints.c
+++ b/gcc/lra-constraints.c
@@ -570,6 +570,34 @@  init_curr_insn_input_reloads (void)
   curr_insn_input_reloads_num = 0;
 }
 
+/* The canonical form of an rtx inside a MEM is not necessarily the same as the
+   canonical form of the rtx outside the MEM.  Fix this up in the case that
+   we're reloading an address (and therefore pulling it outside a MEM).  */
+static rtx canonicalize_reload_addr (rtx addr)
+{
+  const enum rtx_code code = GET_CODE (addr);
+
+  if (code == PLUS)
+    {
+      rtx inner = XEXP (addr, 0);
+      if (GET_CODE (inner) == MULT && CONST_INT_P (XEXP (inner, 1)))
+	{
+	  const HOST_WIDE_INT ci = INTVAL (XEXP (inner, 1));
+	  const int pwr2 = exact_log2 (ci);
+	  if (pwr2 > 0)
+	    {
+	      /* Rewrite this to use a shift instead, which is canonical
+		 when outside of a MEM.  */
+	      XEXP (addr, 0) = gen_rtx_ASHIFT (GET_MODE (inner),
+					       XEXP (inner, 0),
+					       GEN_INT (pwr2));
+	    }
+	}
+    }
+
+  return addr;
+}
+
 /* Create a new pseudo using MODE, RCLASS, ORIGINAL or reuse already
    created input reload pseudo (only if TYPE is not OP_OUT).  Don't
    reuse pseudo if IN_SUBREG_P is true and the reused pseudo should be
@@ -4362,12 +4390,19 @@  curr_insn_transform (bool check_only_p)
 	    {
 	      rtx addr = *loc;
 	      enum rtx_code code = GET_CODE (addr);
-	      
+	      bool align_p = false;
+
 	      if (code == AND && CONST_INT_P (XEXP (addr, 1)))
-		/* (and ... (const_int -X)) is used to align to X bytes.  */
-		addr = XEXP (*loc, 0);
+		{
+		  /* (and ... (const_int -X)) is used to align to X bytes.  */
+		  align_p = true;
+		  addr = XEXP (*loc, 0);
+		}
+	      else
+		addr = canonicalize_reload_addr (addr);
+
 	      lra_emit_move (new_reg, addr);
-	      if (addr != *loc)
+	      if (align_p)
 		emit_move_insn (new_reg, gen_rtx_AND (GET_MODE (new_reg), new_reg, XEXP (*loc, 1)));
 	    }
 	  before = get_insns ();
diff --git a/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c b/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c
new file mode 100644
index 00000000000..36beed497a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mem-shift-canonical.c
@@ -0,0 +1,27 @@ 
+/* This test is a copy of gcc.dg/torture/pr34330.c: here we are looking for
+   specific patterns being matched in the AArch64 backend.  */
+
+/* { dg-do compile } */
+/* { dg-options "-Os -ftree-vectorize -dp" } */
+
+
+struct T
+{
+  int t;
+  struct { short s1, s2, s3, s4; } *s;
+};
+
+void
+foo (int *a, int *b, int *c, int *d, struct T *e)
+{
+  int i;
+  for (i = 0; i < e->t; i++)
+    {
+      e->s[i].s1 = a[i];
+      e->s[i].s2 = b[i];
+      e->s[i].s3 = c[i];
+      e->s[i].s4 = d[i];
+    }
+}
+
+/* { dg-final { scan-assembler-times "add_lsl_di" 3 } } */