Message ID | 00b401d7b862$b1a511d0$14ef3570$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | Try placing RTL folded constants in constant pool | expand |
On Sun, Oct 3, 2021 at 7:27 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > My recent attempts to come up with a testcase for my patch to evaluate > ss_plus in simplify-rtx.c, identified a missed optimization opportunity > (that's potentially a long-time regression): The RTL optimizers no longer > place constants in the constant pool. > > The motivating x86_64 example is the simple program: > > typedef char v8qi __attribute__ ((vector_size (8))); > > v8qi foo() > { > v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 }; > v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 }; > v8qi t = __builtin_ia32_paddsb(tx, ty); > return t; > } > > which (with my previous patch) currently results in: > foo: movq .LC0(%rip), %xmm0 > movq .LC1(%rip), %xmm1 > paddsb %xmm1, %xmm0 > ret > > even though the RTL contains the result in a REG_EQUAL note: > > (insn 7 6 12 2 (set (reg:V8QI 83) > (ss_plus:V8QI (reg:V8QI 84) > (reg:V8QI 85))) "ssaddqi3.c":7:12 1419 {*mmx_ssaddv8qi3} > (expr_list:REG_DEAD (reg:V8QI 85) > (expr_list:REG_DEAD (reg:V8QI 84) > (expr_list:REG_EQUAL (const_vector:V8QI [ > (const_int 3 [0x3]) > (const_int 0 [0]) repeated x7 > ]) > (nil))))) > > Together with the patch below, GCC will now generate the much > more sensible: > foo: movq .LC2(%rip), %xmm0 > ret > > My first approach was to look in cse.c (where the REG_EQUAL note gets > added) and notice that the constant pool handling functionality has been > unreachable for a while. A quick search for constant_pool_entries_cost > shows that it's initialized to zero, but never set to a non-zero value, > meaning that force_const_mem is never called. This functionality used > to work way back in 2003, but has been lost over time: > https://gcc.gnu.org/pipermail/gcc-patches/2003-October/116435.html > > The changes to cse.c below restore this functionality (placing suitable > constants in the constant pool) with two significant refinements; > (i) it only attempts to do this if the function already uses a constant > pool (thanks to the availability of crtl->uses_constant_pool since 2003). > (ii) it allows different constants (i.e. modes) to have different costs, > so that floating point "doubles" and 64-bit, 128-bit, 256-bit and 512-bit > vectors don't all have the share the same cost. Back in 2003, the > assumption was that everything in a constant pool had the same > cost, hence the global variable constant_pool_entries_cost. > > Although this is a useful CSE fix, it turns out that it doesn't cure my > motivating problem above. CSE only considers a single instruction, > so determines that it's cheaper to perform the ss_plus (COSTS_N_INSNS(1)) > than read the result from the constant pool (COSTS_N_INSNS(2)). It's > only when the other reads from the constant pool are also eliminated, > that this transformation is a win. Hence a better place to perform > this transformation is in combine, where after failing to "recog" the > load of a suitable constant, it can retry after calling force_const_mem. > This achieves the desired transformation and allows the backend insn_cost > call-back to control whether or not using the constant pool is preferrable. > > Alas, it's rare to change code generation without affecting something in > GCC's testsuite. On x86_64-pc-linux-gnu there were two families of new > failures (and I'd predict similar benign fallout on other platforms). > One failure was gcc.target/i386/387-12.c (aka PR target/26915), where > the test is missing an explicit -m32 flag. On i686, it's very reasonable > to materialize -1.0 using "fld1; fchs", but on x86_64-pc-linux-gnu we > currently generate the awkward: > testm1: fld1 > fchs > fstpl -8(%rsp) > movsd -8(%rsp), %xmm0 > ret > > which combine now very reasonably simplifies to just: > testm1: movsd .LC3(%rip), %xmm0 > ret > > The other class of x86_64-pc-linux-gnu failure was from materialization > of vector constants using vpbroadcast (e.g. gcc.target/i386/pr90773-17.c) > where the decision is finely balanced; the load of an integer register > with an immediate constant, followed by a vpbroadcast is deemed to be > COSTS_N_INSNS(2), whereas a load from the constant pool is also reported > as COSTS_N_INSNS(2). My solution is to tweak the i386.c's rtx_costs > so that all other things being equal, an instruction (sequence) that > accesses memory is fractionally more expensive than one that doesn't. > > > Hopefully, this all makes sense. If someone could benchmark this for > me that would me much appreciated. This patch has been tested on > x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no > new failures. Ok for mainline? > > > 2021-10-03 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * combine.c (recog_for_combine): For an unrecognized move/set of > a constant, try force_const_mem to place it in the constant pool. > * cse.c (constant_pool_entries_cost, constant_pool_entries_regcost): > Delete global variables (that are no longer assigned a cost value). > (cse_insn): Simplify logic for deciding whether to place a folded > constant in the constant pool using force_const_mem. > (cse_main): Remove zero initialization of constant_pool_entries_cost > and constant_pool_entries_regcost. > > * config/i386/i386.c (ix86_rtx_costs): Make memory accesses > fractionally more expensive, when optimizing for speed. > > gcc/testsuite/ChangeLog > * gcc.target/i386/387-12.c: Add explicit -m32 option. Should it use /* { dg-do compile { target ia32 } } */ instead? > Roger > --
On 10/3/2021 8:26 AM, Roger Sayle wrote: > My recent attempts to come up with a testcase for my patch to evaluate > ss_plus in simplify-rtx.c, identified a missed optimization opportunity > (that's potentially a long-time regression): The RTL optimizers no longer > place constants in the constant pool. > > The motivating x86_64 example is the simple program: > > typedef char v8qi __attribute__ ((vector_size (8))); > > v8qi foo() > { > v8qi tx = { 1, 0, 0, 0, 0, 0, 0, 0 }; > v8qi ty = { 2, 0, 0, 0, 0, 0, 0, 0 }; > v8qi t = __builtin_ia32_paddsb(tx, ty); > return t; > } > > which (with my previous patch) currently results in: > foo: movq .LC0(%rip), %xmm0 > movq .LC1(%rip), %xmm1 > paddsb %xmm1, %xmm0 > ret > > even though the RTL contains the result in a REG_EQUAL note: > > (insn 7 6 12 2 (set (reg:V8QI 83) > (ss_plus:V8QI (reg:V8QI 84) > (reg:V8QI 85))) "ssaddqi3.c":7:12 1419 {*mmx_ssaddv8qi3} > (expr_list:REG_DEAD (reg:V8QI 85) > (expr_list:REG_DEAD (reg:V8QI 84) > (expr_list:REG_EQUAL (const_vector:V8QI [ > (const_int 3 [0x3]) > (const_int 0 [0]) repeated x7 > ]) > (nil))))) > > Together with the patch below, GCC will now generate the much > more sensible: > foo: movq .LC2(%rip), %xmm0 > ret > > My first approach was to look in cse.c (where the REG_EQUAL note gets > added) and notice that the constant pool handling functionality has been > unreachable for a while. A quick search for constant_pool_entries_cost > shows that it's initialized to zero, but never set to a non-zero value, > meaning that force_const_mem is never called. This functionality used > to work way back in 2003, but has been lost over time: > https://gcc.gnu.org/pipermail/gcc-patches/2003-October/116435.html > > The changes to cse.c below restore this functionality (placing suitable > constants in the constant pool) with two significant refinements; > (i) it only attempts to do this if the function already uses a constant > pool (thanks to the availability of crtl->uses_constant_pool since 2003). > (ii) it allows different constants (i.e. modes) to have different costs, > so that floating point "doubles" and 64-bit, 128-bit, 256-bit and 512-bit > vectors don't all have the share the same cost. Back in 2003, the > assumption was that everything in a constant pool had the same > cost, hence the global variable constant_pool_entries_cost. > > Although this is a useful CSE fix, it turns out that it doesn't cure my > motivating problem above. CSE only considers a single instruction, > so determines that it's cheaper to perform the ss_plus (COSTS_N_INSNS(1)) > than read the result from the constant pool (COSTS_N_INSNS(2)). It's > only when the other reads from the constant pool are also eliminated, > that this transformation is a win. Hence a better place to perform > this transformation is in combine, where after failing to "recog" the > load of a suitable constant, it can retry after calling force_const_mem. > This achieves the desired transformation and allows the backend insn_cost > call-back to control whether or not using the constant pool is preferrable. > > Alas, it's rare to change code generation without affecting something in > GCC's testsuite. On x86_64-pc-linux-gnu there were two families of new > failures (and I'd predict similar benign fallout on other platforms). > One failure was gcc.target/i386/387-12.c (aka PR target/26915), where > the test is missing an explicit -m32 flag. On i686, it's very reasonable > to materialize -1.0 using "fld1; fchs", but on x86_64-pc-linux-gnu we > currently generate the awkward: > testm1: fld1 > fchs > fstpl -8(%rsp) > movsd -8(%rsp), %xmm0 > ret > > which combine now very reasonably simplifies to just: > testm1: movsd .LC3(%rip), %xmm0 > ret > > The other class of x86_64-pc-linux-gnu failure was from materialization > of vector constants using vpbroadcast (e.g. gcc.target/i386/pr90773-17.c) > where the decision is finely balanced; the load of an integer register > with an immediate constant, followed by a vpbroadcast is deemed to be > COSTS_N_INSNS(2), whereas a load from the constant pool is also reported > as COSTS_N_INSNS(2). My solution is to tweak the i386.c's rtx_costs > so that all other things being equal, an instruction (sequence) that > accesses memory is fractionally more expensive than one that doesn't. > > > Hopefully, this all makes sense. If someone could benchmark this for > me that would me much appreciated. This patch has been tested on > x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no > new failures. Ok for mainline? > > > 2021-10-03 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * combine.c (recog_for_combine): For an unrecognized move/set of > a constant, try force_const_mem to place it in the constant pool. > * cse.c (constant_pool_entries_cost, constant_pool_entries_regcost): > Delete global variables (that are no longer assigned a cost value). > (cse_insn): Simplify logic for deciding whether to place a folded > constant in the constant pool using force_const_mem. > (cse_main): Remove zero initialization of constant_pool_entries_cost > and constant_pool_entries_regcost. > > * config/i386/i386.c (ix86_rtx_costs): Make memory accesses > fractionally more expensive, when optimizing for speed. > > gcc/testsuite/ChangeLog > * gcc.target/i386/387-12.c: Add explicit -m32 option. OK. I'll have to keep this one in mind as we continue evaluating performance of our internal port :-) We're mostly focused on gcc-11 right now, but will likely start moving focus to the gcc-12 bits in the reasonably near future. Jeff
diff --git a/gcc/combine.c b/gcc/combine.c index 892c834..03e9a78 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -11567,7 +11567,27 @@ recog_for_combine (rtx *pnewpat, rtx_insn *insn, rtx *pnotes) bool changed = false; if (GET_CODE (pat) == SET) - changed = change_zero_ext (pat); + { + /* For an unrecognized single set of a constant, try placing it in + the constant pool, if this function already uses one. */ + rtx src = SET_SRC (pat); + if (CONSTANT_P (src) + && !CONST_INT_P (src) + && crtl->uses_const_pool) + { + machine_mode mode = GET_MODE (src); + if (mode == VOIDmode) + mode = GET_MODE (SET_DEST (pat)); + src = force_const_mem (mode, src); + if (src) + { + SUBST (SET_SRC (pat), src); + changed = true; + } + } + else + changed = change_zero_ext (pat); + } else if (GET_CODE (pat) == PARALLEL) { int i; diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ba89e11..cd0e38d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -20799,6 +20799,13 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, *total = cost->sse_op; return true; + case MEM: + /* An insn that accesses memory is slightly more expensive + than one that does not. */ + if (speed) + *total += 1; + return false; + default: return false; } diff --git a/gcc/cse.c b/gcc/cse.c index 330c1e9..4c3988e 100644 --- a/gcc/cse.c +++ b/gcc/cse.c @@ -491,14 +491,6 @@ static struct table_elt *table[HASH_SIZE]; static struct table_elt *free_element_chain; -/* Set to the cost of a constant pool reference if one was found for a - symbolic constant. If this was found, it means we should try to - convert constants into constant pool entries if they don't fit in - the insn. */ - -static int constant_pool_entries_cost; -static int constant_pool_entries_regcost; - /* Trace a patch through the CFG. */ struct branch_path @@ -4609,9 +4601,6 @@ cse_insn (rtx_insn *insn) int src_folded_regcost = MAX_COST; int src_related_regcost = MAX_COST; int src_elt_regcost = MAX_COST; - /* Set nonzero if we need to call force_const_mem on with the - contents of src_folded before using it. */ - int src_folded_force_flag = 0; scalar_int_mode int_mode; dest = SET_DEST (sets[i].rtl); @@ -5166,15 +5155,7 @@ cse_insn (rtx_insn *insn) src_related_cost, src_related_regcost) <= 0 && preferable (src_folded_cost, src_folded_regcost, src_elt_cost, src_elt_regcost) <= 0) - { - trial = src_folded, src_folded_cost = MAX_COST; - if (src_folded_force_flag) - { - rtx forced = force_const_mem (mode, trial); - if (forced) - trial = forced; - } - } + trial = src_folded, src_folded_cost = MAX_COST; else if (src && preferable (src_cost, src_regcost, src_eqv_cost, src_eqv_regcost) <= 0 @@ -5361,23 +5342,24 @@ cse_insn (rtx_insn *insn) break; } - /* If we previously found constant pool entries for - constants and this is a constant, try making a - pool entry. Put it in src_folded unless we already have done - this since that is where it likely came from. */ + /* If the current function uses a constant pool and this is a + constant, try making a pool entry. Put it in src_folded + unless we already have done this since that is where it + likely came from. */ - else if (constant_pool_entries_cost + else if (crtl->uses_const_pool && CONSTANT_P (trial) - && (src_folded == 0 - || (!MEM_P (src_folded) - && ! src_folded_force_flag)) + && !CONST_INT_P (trial) + && (src_folded == 0 || !MEM_P (src_folded)) && GET_MODE_CLASS (mode) != MODE_CC && mode != VOIDmode) { - src_folded_force_flag = 1; - src_folded = trial; - src_folded_cost = constant_pool_entries_cost; - src_folded_regcost = constant_pool_entries_regcost; + src_folded = force_const_mem (mode, trial); + if (src_folded) + { + src_folded_cost = COST (src_folded, mode); + src_folded_regcost = approx_reg_cost (src_folded); + } } } @@ -6630,8 +6612,6 @@ cse_main (rtx_insn *f ATTRIBUTE_UNUSED, int nregs) cse_cfg_altered = false; cse_jumps_altered = false; recorded_label_ref = false; - constant_pool_entries_cost = 0; - constant_pool_entries_regcost = 0; ebb_data.path_size = 0; ebb_data.nsets = 0; rtl_hooks = cse_rtl_hooks; diff --git a/gcc/testsuite/gcc.target/i386/387-12.c b/gcc/testsuite/gcc.target/i386/387-12.c index 62c1d48..7fe50a2 100644 --- a/gcc/testsuite/gcc.target/i386/387-12.c +++ b/gcc/testsuite/gcc.target/i386/387-12.c @@ -1,6 +1,6 @@ /* PR target/26915 */ /* { dg-do compile } */ -/* { dg-options "-O -mfpmath=387 -mfancy-math-387" } */ +/* { dg-options "-O -m32 -mfpmath=387 -mfancy-math-387" } */ double testm0(void) {