diff mbox series

Try placing RTL folded constants in constant pool

Message ID 00b401d7b862$b1a511d0$14ef3570$@nextmovesoftware.com
State New
Headers show
Series Try placing RTL folded constants in constant pool | expand

Commit Message

Roger Sayle Oct. 3, 2021, 2:26 p.m. UTC
My recent attempts to come up with a testcase for my patch to evaluate
ss_plus in simplify-rtx.c, identified a missed optimization opportunity
(that's potentially a long-time regression): The RTL optimizers no longer
place constants in the constant pool.

The motivating x86_64 example is the simple program:

typedef char v8qi __attribute__ ((vector_size (8)));

v8qi foo()
{
  v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
  v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
  v8qi t = __builtin_ia32_paddsb(tx, ty);
  return t;
}

which (with my previous patch) currently results in:
foo:    movq    .LC0(%rip), %xmm0
        movq    .LC1(%rip), %xmm1
        paddsb  %xmm1, %xmm0
        ret

even though the RTL contains the result in a REG_EQUAL note:

(insn 7 6 12 2 (set (reg:V8QI 83)
        (ss_plus:V8QI (reg:V8QI 84)
            (reg:V8QI 85))) "ssaddqi3.c":7:12 1419 {*mmx_ssaddv8qi3}
     (expr_list:REG_DEAD (reg:V8QI 85)
        (expr_list:REG_DEAD (reg:V8QI 84)
            (expr_list:REG_EQUAL (const_vector:V8QI [
                        (const_int 3 [0x3])
                        (const_int 0 [0]) repeated x7
                    ])
                (nil)))))

Together with the patch below, GCC will now generate the much
more sensible:
foo:    movq    .LC2(%rip), %xmm0
        ret

My first approach was to look in cse.c (where the REG_EQUAL note gets
added) and notice that the constant pool handling functionality has been
unreachable for a while.  A quick search for constant_pool_entries_cost
shows that it's initialized to zero, but never set to a non-zero value,
meaning that force_const_mem is never called.  This functionality used
to work way back in 2003, but has been lost over time:
https://gcc.gnu.org/pipermail/gcc-patches/2003-October/116435.html

The changes to cse.c below restore this functionality (placing suitable
constants in the constant pool) with two significant refinements;
(i) it only attempts to do this if the function already uses a constant
pool (thanks to the availability of crtl->uses_constant_pool since 2003).
(ii) it allows different constants (i.e. modes) to have different costs,
so that floating point "doubles" and 64-bit, 128-bit, 256-bit and 512-bit
vectors don't all have the share the same cost.  Back in 2003, the 
assumption was that everything in a constant pool had the same
cost, hence the global variable constant_pool_entries_cost.

Although this is a useful CSE fix, it turns out that it doesn't cure my
motivating problem above.  CSE only considers a single instruction,
so determines that it's cheaper to perform the ss_plus (COSTS_N_INSNS(1))
than read the result from the constant pool (COSTS_N_INSNS(2)).  It's
only when the other reads from the constant pool are also eliminated,
that this transformation is a win.  Hence a better place to perform
this transformation is in combine, where after failing to "recog" the
load of a suitable constant, it can retry after calling force_const_mem.
This achieves the desired transformation and allows the backend insn_cost
call-back to control whether or not using the constant pool is preferrable.

Alas, it's rare to change code generation without affecting something in
GCC's testsuite.  On x86_64-pc-linux-gnu there were two families of new
failures (and I'd predict similar benign fallout on other platforms).
One failure was gcc.target/i386/387-12.c (aka PR target/26915), where
the test is missing an explicit -m32 flag.  On i686, it's very reasonable
to materialize -1.0 using "fld1; fchs", but on x86_64-pc-linux-gnu we
currently generate the awkward:
testm1: fld1
        fchs
        fstpl   -8(%rsp)
        movsd   -8(%rsp), %xmm0
        ret

which combine now very reasonably simplifies to just:
testm1: movsd   .LC3(%rip), %xmm0
        ret

The other class of x86_64-pc-linux-gnu failure was from materialization
of vector constants using vpbroadcast (e.g. gcc.target/i386/pr90773-17.c)
where the decision is finely balanced; the load of an integer register
with an immediate constant, followed by a vpbroadcast is deemed to be
COSTS_N_INSNS(2), whereas a load from the constant pool is also reported
as COSTS_N_INSNS(2).  My solution is to tweak the i386.c's rtx_costs
so that all other things being equal, an instruction (sequence) that
accesses memory is fractionally more expensive than one that doesn't.


Hopefully, this all makes sense.  If someone could benchmark this for
me that would me much appreciated.  This patch has been tested on
x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no
new failures.  Ok for mainline?


2021-10-03  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* combine.c (recog_for_combine): For an unrecognized move/set of
	a constant, try force_const_mem to place it in the constant pool.
	* cse.c (constant_pool_entries_cost, constant_pool_entries_regcost):
	Delete global variables (that are no longer assigned a cost value).
	(cse_insn): Simplify logic for deciding whether to place a folded
	constant in the constant pool using force_const_mem.
	(cse_main): Remove zero initialization of constant_pool_entries_cost
	and constant_pool_entries_regcost.

	* config/i386/i386.c (ix86_rtx_costs): Make memory accesses
	fractionally more expensive, when optimizing for speed.

gcc/testsuite/ChangeLog
	* gcc.target/i386/387-12.c: Add explicit -m32 option.

Roger
--

Comments

H.J. Lu Oct. 3, 2021, 2:36 p.m. UTC | #1
On Sun, Oct 3, 2021 at 7:27 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> My recent attempts to come up with a testcase for my patch to evaluate
> ss_plus in simplify-rtx.c, identified a missed optimization opportunity
> (that's potentially a long-time regression): The RTL optimizers no longer
> place constants in the constant pool.
>
> The motivating x86_64 example is the simple program:
>
> typedef char v8qi __attribute__ ((vector_size (8)));
>
> v8qi foo()
> {
>   v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
>   v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
>   v8qi t = __builtin_ia32_paddsb(tx, ty);
>   return t;
> }
>
> which (with my previous patch) currently results in:
> foo:    movq    .LC0(%rip), %xmm0
>         movq    .LC1(%rip), %xmm1
>         paddsb  %xmm1, %xmm0
>         ret
>
> even though the RTL contains the result in a REG_EQUAL note:
>
> (insn 7 6 12 2 (set (reg:V8QI 83)
>         (ss_plus:V8QI (reg:V8QI 84)
>             (reg:V8QI 85))) "ssaddqi3.c":7:12 1419 {*mmx_ssaddv8qi3}
>      (expr_list:REG_DEAD (reg:V8QI 85)
>         (expr_list:REG_DEAD (reg:V8QI 84)
>             (expr_list:REG_EQUAL (const_vector:V8QI [
>                         (const_int 3 [0x3])
>                         (const_int 0 [0]) repeated x7
>                     ])
>                 (nil)))))
>
> Together with the patch below, GCC will now generate the much
> more sensible:
> foo:    movq    .LC2(%rip), %xmm0
>         ret
>
> My first approach was to look in cse.c (where the REG_EQUAL note gets
> added) and notice that the constant pool handling functionality has been
> unreachable for a while.  A quick search for constant_pool_entries_cost
> shows that it's initialized to zero, but never set to a non-zero value,
> meaning that force_const_mem is never called.  This functionality used
> to work way back in 2003, but has been lost over time:
> https://gcc.gnu.org/pipermail/gcc-patches/2003-October/116435.html
>
> The changes to cse.c below restore this functionality (placing suitable
> constants in the constant pool) with two significant refinements;
> (i) it only attempts to do this if the function already uses a constant
> pool (thanks to the availability of crtl->uses_constant_pool since 2003).
> (ii) it allows different constants (i.e. modes) to have different costs,
> so that floating point "doubles" and 64-bit, 128-bit, 256-bit and 512-bit
> vectors don't all have the share the same cost.  Back in 2003, the
> assumption was that everything in a constant pool had the same
> cost, hence the global variable constant_pool_entries_cost.
>
> Although this is a useful CSE fix, it turns out that it doesn't cure my
> motivating problem above.  CSE only considers a single instruction,
> so determines that it's cheaper to perform the ss_plus (COSTS_N_INSNS(1))
> than read the result from the constant pool (COSTS_N_INSNS(2)).  It's
> only when the other reads from the constant pool are also eliminated,
> that this transformation is a win.  Hence a better place to perform
> this transformation is in combine, where after failing to "recog" the
> load of a suitable constant, it can retry after calling force_const_mem.
> This achieves the desired transformation and allows the backend insn_cost
> call-back to control whether or not using the constant pool is preferrable.
>
> Alas, it's rare to change code generation without affecting something in
> GCC's testsuite.  On x86_64-pc-linux-gnu there were two families of new
> failures (and I'd predict similar benign fallout on other platforms).
> One failure was gcc.target/i386/387-12.c (aka PR target/26915), where
> the test is missing an explicit -m32 flag.  On i686, it's very reasonable
> to materialize -1.0 using "fld1; fchs", but on x86_64-pc-linux-gnu we
> currently generate the awkward:
> testm1: fld1
>         fchs
>         fstpl   -8(%rsp)
>         movsd   -8(%rsp), %xmm0
>         ret
>
> which combine now very reasonably simplifies to just:
> testm1: movsd   .LC3(%rip), %xmm0
>         ret
>
> The other class of x86_64-pc-linux-gnu failure was from materialization
> of vector constants using vpbroadcast (e.g. gcc.target/i386/pr90773-17.c)
> where the decision is finely balanced; the load of an integer register
> with an immediate constant, followed by a vpbroadcast is deemed to be
> COSTS_N_INSNS(2), whereas a load from the constant pool is also reported
> as COSTS_N_INSNS(2).  My solution is to tweak the i386.c's rtx_costs
> so that all other things being equal, an instruction (sequence) that
> accesses memory is fractionally more expensive than one that doesn't.
>
>
> Hopefully, this all makes sense.  If someone could benchmark this for
> me that would me much appreciated.  This patch has been tested on
> x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no
> new failures.  Ok for mainline?
>
>
> 2021-10-03  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * combine.c (recog_for_combine): For an unrecognized move/set of
>         a constant, try force_const_mem to place it in the constant pool.
>         * cse.c (constant_pool_entries_cost, constant_pool_entries_regcost):
>         Delete global variables (that are no longer assigned a cost value).
>         (cse_insn): Simplify logic for deciding whether to place a folded
>         constant in the constant pool using force_const_mem.
>         (cse_main): Remove zero initialization of constant_pool_entries_cost
>         and constant_pool_entries_regcost.
>
>         * config/i386/i386.c (ix86_rtx_costs): Make memory accesses
>         fractionally more expensive, when optimizing for speed.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/387-12.c: Add explicit -m32 option.

Should it use

/* { dg-do compile { target ia32 } } */

instead?

> Roger
> --
Jeff Law Oct. 17, 2021, 10:08 p.m. UTC | #2
On 10/3/2021 8:26 AM, Roger Sayle wrote:
> My recent attempts to come up with a testcase for my patch to evaluate
> ss_plus in simplify-rtx.c, identified a missed optimization opportunity
> (that's potentially a long-time regression): The RTL optimizers no longer
> place constants in the constant pool.
>
> The motivating x86_64 example is the simple program:
>
> typedef char v8qi __attribute__ ((vector_size (8)));
>
> v8qi foo()
> {
>    v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 };
>    v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 };
>    v8qi t = __builtin_ia32_paddsb(tx, ty);
>    return t;
> }
>
> which (with my previous patch) currently results in:
> foo:    movq    .LC0(%rip), %xmm0
>          movq    .LC1(%rip), %xmm1
>          paddsb  %xmm1, %xmm0
>          ret
>
> even though the RTL contains the result in a REG_EQUAL note:
>
> (insn 7 6 12 2 (set (reg:V8QI 83)
>          (ss_plus:V8QI (reg:V8QI 84)
>              (reg:V8QI 85))) "ssaddqi3.c":7:12 1419 {*mmx_ssaddv8qi3}
>       (expr_list:REG_DEAD (reg:V8QI 85)
>          (expr_list:REG_DEAD (reg:V8QI 84)
>              (expr_list:REG_EQUAL (const_vector:V8QI [
>                          (const_int 3 [0x3])
>                          (const_int 0 [0]) repeated x7
>                      ])
>                  (nil)))))
>
> Together with the patch below, GCC will now generate the much
> more sensible:
> foo:    movq    .LC2(%rip), %xmm0
>          ret
>
> My first approach was to look in cse.c (where the REG_EQUAL note gets
> added) and notice that the constant pool handling functionality has been
> unreachable for a while.  A quick search for constant_pool_entries_cost
> shows that it's initialized to zero, but never set to a non-zero value,
> meaning that force_const_mem is never called.  This functionality used
> to work way back in 2003, but has been lost over time:
> https://gcc.gnu.org/pipermail/gcc-patches/2003-October/116435.html
>
> The changes to cse.c below restore this functionality (placing suitable
> constants in the constant pool) with two significant refinements;
> (i) it only attempts to do this if the function already uses a constant
> pool (thanks to the availability of crtl->uses_constant_pool since 2003).
> (ii) it allows different constants (i.e. modes) to have different costs,
> so that floating point "doubles" and 64-bit, 128-bit, 256-bit and 512-bit
> vectors don't all have the share the same cost.  Back in 2003, the
> assumption was that everything in a constant pool had the same
> cost, hence the global variable constant_pool_entries_cost.
>
> Although this is a useful CSE fix, it turns out that it doesn't cure my
> motivating problem above.  CSE only considers a single instruction,
> so determines that it's cheaper to perform the ss_plus (COSTS_N_INSNS(1))
> than read the result from the constant pool (COSTS_N_INSNS(2)).  It's
> only when the other reads from the constant pool are also eliminated,
> that this transformation is a win.  Hence a better place to perform
> this transformation is in combine, where after failing to "recog" the
> load of a suitable constant, it can retry after calling force_const_mem.
> This achieves the desired transformation and allows the backend insn_cost
> call-back to control whether or not using the constant pool is preferrable.
>
> Alas, it's rare to change code generation without affecting something in
> GCC's testsuite.  On x86_64-pc-linux-gnu there were two families of new
> failures (and I'd predict similar benign fallout on other platforms).
> One failure was gcc.target/i386/387-12.c (aka PR target/26915), where
> the test is missing an explicit -m32 flag.  On i686, it's very reasonable
> to materialize -1.0 using "fld1; fchs", but on x86_64-pc-linux-gnu we
> currently generate the awkward:
> testm1: fld1
>          fchs
>          fstpl   -8(%rsp)
>          movsd   -8(%rsp), %xmm0
>          ret
>
> which combine now very reasonably simplifies to just:
> testm1: movsd   .LC3(%rip), %xmm0
>          ret
>
> The other class of x86_64-pc-linux-gnu failure was from materialization
> of vector constants using vpbroadcast (e.g. gcc.target/i386/pr90773-17.c)
> where the decision is finely balanced; the load of an integer register
> with an immediate constant, followed by a vpbroadcast is deemed to be
> COSTS_N_INSNS(2), whereas a load from the constant pool is also reported
> as COSTS_N_INSNS(2).  My solution is to tweak the i386.c's rtx_costs
> so that all other things being equal, an instruction (sequence) that
> accesses memory is fractionally more expensive than one that doesn't.
>
>
> Hopefully, this all makes sense.  If someone could benchmark this for
> me that would me much appreciated.  This patch has been tested on
> x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no
> new failures.  Ok for mainline?
>
>
> 2021-10-03  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> 	* combine.c (recog_for_combine): For an unrecognized move/set of
> 	a constant, try force_const_mem to place it in the constant pool.
> 	* cse.c (constant_pool_entries_cost, constant_pool_entries_regcost):
> 	Delete global variables (that are no longer assigned a cost value).
> 	(cse_insn): Simplify logic for deciding whether to place a folded
> 	constant in the constant pool using force_const_mem.
> 	(cse_main): Remove zero initialization of constant_pool_entries_cost
> 	and constant_pool_entries_regcost.
>
> 	* config/i386/i386.c (ix86_rtx_costs): Make memory accesses
> 	fractionally more expensive, when optimizing for speed.
>
> gcc/testsuite/ChangeLog
> 	* gcc.target/i386/387-12.c: Add explicit -m32 option.
OK.  I'll have to keep this one in mind as we continue evaluating 
performance of our internal port :-)  We're mostly focused on gcc-11 
right now, but will likely start moving focus to the gcc-12 bits in the 
reasonably near future.

Jeff
diff mbox series

Patch

diff --git a/gcc/combine.c b/gcc/combine.c
index 892c834..03e9a78 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -11567,7 +11567,27 @@  recog_for_combine (rtx *pnewpat, rtx_insn *insn, rtx *pnotes)
   bool changed = false;
 
   if (GET_CODE (pat) == SET)
-    changed = change_zero_ext (pat);
+    {
+      /* For an unrecognized single set of a constant, try placing it in
+	 the constant pool, if this function already uses one.  */
+      rtx src = SET_SRC (pat);
+      if (CONSTANT_P (src)
+	  && !CONST_INT_P (src)
+	  && crtl->uses_const_pool)
+	{
+	  machine_mode mode = GET_MODE (src);
+	  if (mode == VOIDmode)
+	    mode = GET_MODE (SET_DEST (pat));
+	  src = force_const_mem (mode, src);
+	  if (src)
+	    {
+	      SUBST (SET_SRC (pat), src);
+	      changed = true;
+	    }
+	}
+      else
+	changed = change_zero_ext (pat);
+    }
   else if (GET_CODE (pat) == PARALLEL)
     {
       int i;
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ba89e11..cd0e38d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -20799,6 +20799,13 @@  ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno,
 	*total = cost->sse_op;
       return true;
 
+    case MEM:
+      /* An insn that accesses memory is slightly more expensive
+         than one that does not.  */
+      if (speed)
+        *total += 1;
+      return false;
+
     default:
       return false;
     }
diff --git a/gcc/cse.c b/gcc/cse.c
index 330c1e9..4c3988e 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -491,14 +491,6 @@  static struct table_elt *table[HASH_SIZE];
 
 static struct table_elt *free_element_chain;
 
-/* Set to the cost of a constant pool reference if one was found for a
-   symbolic constant.  If this was found, it means we should try to
-   convert constants into constant pool entries if they don't fit in
-   the insn.  */
-
-static int constant_pool_entries_cost;
-static int constant_pool_entries_regcost;
-
 /* Trace a patch through the CFG.  */
 
 struct branch_path
@@ -4609,9 +4601,6 @@  cse_insn (rtx_insn *insn)
       int src_folded_regcost = MAX_COST;
       int src_related_regcost = MAX_COST;
       int src_elt_regcost = MAX_COST;
-      /* Set nonzero if we need to call force_const_mem on with the
-	 contents of src_folded before using it.  */
-      int src_folded_force_flag = 0;
       scalar_int_mode int_mode;
 
       dest = SET_DEST (sets[i].rtl);
@@ -5166,15 +5155,7 @@  cse_insn (rtx_insn *insn)
 			     src_related_cost, src_related_regcost) <= 0
 	      && preferable (src_folded_cost, src_folded_regcost,
 			     src_elt_cost, src_elt_regcost) <= 0)
-	    {
-	      trial = src_folded, src_folded_cost = MAX_COST;
-	      if (src_folded_force_flag)
-		{
-		  rtx forced = force_const_mem (mode, trial);
-		  if (forced)
-		    trial = forced;
-		}
-	    }
+	    trial = src_folded, src_folded_cost = MAX_COST;
 	  else if (src
 		   && preferable (src_cost, src_regcost,
 				  src_eqv_cost, src_eqv_regcost) <= 0
@@ -5361,23 +5342,24 @@  cse_insn (rtx_insn *insn)
 	      break;
 	    }
 
-	  /* If we previously found constant pool entries for
-	     constants and this is a constant, try making a
-	     pool entry.  Put it in src_folded unless we already have done
-	     this since that is where it likely came from.  */
+	  /* If the current function uses a constant pool and this is a
+	     constant, try making a pool entry. Put it in src_folded
+	     unless we already have done this since that is where it
+	     likely came from.  */
 
-	  else if (constant_pool_entries_cost
+	  else if (crtl->uses_const_pool
 		   && CONSTANT_P (trial)
-		   && (src_folded == 0
-		       || (!MEM_P (src_folded)
-			   && ! src_folded_force_flag))
+		   && !CONST_INT_P (trial)
+		   && (src_folded == 0 || !MEM_P (src_folded))
 		   && GET_MODE_CLASS (mode) != MODE_CC
 		   && mode != VOIDmode)
 	    {
-	      src_folded_force_flag = 1;
-	      src_folded = trial;
-	      src_folded_cost = constant_pool_entries_cost;
-	      src_folded_regcost = constant_pool_entries_regcost;
+	      src_folded = force_const_mem (mode, trial);
+	      if (src_folded)
+		{
+		  src_folded_cost = COST (src_folded, mode);
+		  src_folded_regcost = approx_reg_cost (src_folded);
+		}
 	    }
 	}
 
@@ -6630,8 +6612,6 @@  cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs)
   cse_cfg_altered = false;
   cse_jumps_altered = false;
   recorded_label_ref = false;
-  constant_pool_entries_cost = 0;
-  constant_pool_entries_regcost = 0;
   ebb_data.path_size = 0;
   ebb_data.nsets = 0;
   rtl_hooks = cse_rtl_hooks;
diff --git a/gcc/testsuite/gcc.target/i386/387-12.c b/gcc/testsuite/gcc.target/i386/387-12.c
index 62c1d48..7fe50a2 100644
--- a/gcc/testsuite/gcc.target/i386/387-12.c
+++ b/gcc/testsuite/gcc.target/i386/387-12.c
@@ -1,6 +1,6 @@ 
 /* PR target/26915 */
 /* { dg-do compile } */
-/* { dg-options "-O -mfpmath=387 -mfancy-math-387" } */
+/* { dg-options "-O -m32 -mfpmath=387 -mfancy-math-387" } */
 
 double testm0(void)
 {