Message ID | alpine.DEB.2.20.2311171501110.5892@tpp.orcam.me.uk |
---|---|
State | New |
Headers | show |
Series | RISC-V: Various if-conversion fixes and improvements | expand |
On 11/18/23 22:36, Maciej W. Rozycki wrote: > The generic branch costing model for if-conversion assumes a fixed cost > of COSTS_N_INSNS (2) for a conditional branch, and that one half of that > cost comes from a preceding condition-set instruction, such as with > MODE_CC targets, and then the other half of that cost is for the actual > branch instruction. This is hardcoded for `if_info.original_cost' in > `noce_find_if_block' and regardless of the cost set for branches via > BRANCH_COST. > > Then `default_max_noce_ifcvt_seq_cost' instructs if-conversion to prefer > a branchless sequence as costly as high as triple the BRANCH_COST value > set. This is apparently to make up for the inability to accurately > guess the branch penalty. > > Consequently for the BRANCH_COST of 3 we commonly set for tuning, > if-conversion will consider branchless sequences costing 3 * 3 - 2 = 7 > instruction units more than a corresponding branch sequence. For the > BRANCH_COST of 4 such as with `sifive-7-series' tuning this is even > worse, at 3 * 4 - 2 = 10. Effectively it means a branchless sequence > will always be chosen if available, even a very inefficient one. > > Rework the branch costing model to better match our architecture, > observing in particular that we have no preparatory instructions for > branches so that the cost of a branch is naked BRANCH_COST plus any > extra overhead the processing of a branch's source RTX might incur. > > Provide TARGET_INSN_COST and TARGET_MAX_NOCE_IFCVT_SEQ_COST handlers > than that return suitable cost based on BRANCH_COST. The latter hook > usually returns a value that is lower than the cost of the corresponding > branched sequence. This is because we don't really want to produce a > branchless sequence that is more expensive than the original branched > sequence. If this turns out too conservative for some corner case, then > this choice might be revisited. > > Then we don't want to fiddle with `noce_find_if_block' without a lot of > cross-target verification, so add TARGET_NOCE_CONVERSION_PROFITABLE_P > defined such that it subtracts the fixed COSTS_N_INSNS (2) cost from the > cost of the original branched sequence supplied and instead adds actual > branch cost calculated from the conditional branch instruction used. It > is then further tweaked according to simple analysis of the replacement > branchless sequence produced so as to cancel the cost of an extraneous > zero extend operation produced by `noce_try_store_flag_mask' as observed > with gcc/testsuite/gcc.target/riscv/pr105314.c. > > Tweak the testsuite accordingly and set `-mbranch-cost=' explicitly for > the relevant cases so that the expected if-conversion transformation is > made regardless of the default BRANCH_COST value of tuning in effect. > Some of these settings will be lowered later on as deficiencies in > branchless sequence generation have been fixed that lower their cost > calculated by if-conversion. As I suspect you know a big part of the problem here is that BRANCH_COST and rtx_cost don't have any common scale and thus trying to compare BRANCH_COST to RTX_COST doesn't have well defined meaning. That hasn't kept us from trying to do precisely that and the result has always been less than satisfactory. You're introducing more, but I don't think there's a reasonable way out of this mess at this point. > > gcc/ > * config/riscv/riscv.cc (riscv_insn_cost): New function. > (riscv_max_noce_ifcvt_seq_cost): Likewise. > (riscv_noce_conversion_profitable_p): Likewise. > (TARGET_INSN_COST): New macro. > (TARGET_MAX_NOCE_IFCVT_SEQ_COST): New macro. > (TARGET_NOCE_CONVERSION_PROFITABLE_P: New macro. > > gcc/testsuite/ > * gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_imm.c: > Explicitly set the branch cost. > * gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_reg.c: > Likewise. > * gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_reg_reg.c: > Likewise. > * gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_imm.c: > Likewise. > * gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_reg.c: > Likewise. > * gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_reg_reg.c: > Likewise. > --- > FWIW I don't understand why the test cases absolutely HAD to have such > overlong names guaranteed to exceed our 80 column limit in any context. > It's such a pain to handle. I dislike the long names as well. I nearly changed them myself as part of the eswin submission, but that seemed a bit gratituous to me so I left them as-is. If you wanted to rename them, be my guest, consider it pre-approved ;-) WRT the extraneous zero-extension. Isn't that arguably a bug in the scc expander for risc-v? Fixing that isn't a prerequisite here, but it probably worth a bit of someone's time. OK for the trunk. jeff
On Sun, 19 Nov 2023, Jeff Law wrote: > As I suspect you know a big part of the problem here is that BRANCH_COST and > rtx_cost don't have any common scale and thus trying to compare BRANCH_COST to > RTX_COST doesn't have well defined meaning. We do have preexisting places using COSTS_N_INSNS (BRANCH_COST ()) though, as documented in ifcvt.cc: ??? Actually, instead of the branch instruction costs we might want to use COSTS_N_INSNS (BRANCH_COST ()) as in other places. */ so it seems the right direction, and given that we expose this measure to the user (and at the very least GCC developers implementing new tuning microarchitectures) I think it's the only sane way to do branch costing: define the measure in terms of how many ordinary ALU instructions a branch is statistically equivalent to. We may have to consider whether we want/need higher resolution here, effectively branch costs including a fractional part. > That hasn't kept us from trying to do precisely that and the result has always > been less than satisfactory. You're introducing more, but I don't think > there's a reasonable way out of this mess at this point. Ack. > > FWIW I don't understand why the test cases absolutely HAD to have such > > overlong names guaranteed to exceed our 80 column limit in any context. > > It's such a pain to handle. > I dislike the long names as well. I nearly changed them myself as part of the > eswin submission, but that seemed a bit gratituous to me so I left them as-is. > > If you wanted to rename them, be my guest, consider it pre-approved ;-) Ack. > WRT the extraneous zero-extension. Isn't that arguably a bug in the scc > expander for risc-v? Fixing that isn't a prerequisite here, but it probably > worth a bit of someone's time. I've looked at it already and it's the middle end that ends up with the zero-extension, specifically `convert_move' invoked from `emit_cstore' down the call to `noce_try_store_flag_mask', to widen the output from `cstoredi4', so I don't think we can do anything in the backend to prevent it from happening. And neither I think we can do anything useful about `cstoredi4' having a SImode output, as it's a pattern matched by name rather than RTX, so we can't provide variants having a SImode and a DImode output each both at a time, as that would cause a name clash. What we could do is adding new named patterns with the output mode spelt out explicitly, such as `cstoresidi4', `cstoredidi4', etc., but that would be quite a big rewrite. Maybe worth doing, maybe not, I don't know. For the time being the hack I have made does the trick. Maciej
On 11/23/23 11:34, Maciej W. Rozycki wrote: > On Sun, 19 Nov 2023, Jeff Law wrote: > >> As I suspect you know a big part of the problem here is that BRANCH_COST and >> rtx_cost don't have any common scale and thus trying to compare BRANCH_COST to >> RTX_COST doesn't have well defined meaning. > > We do have preexisting places using COSTS_N_INSNS (BRANCH_COST ()) > though, as documented in ifcvt.cc: > > ??? Actually, instead of the branch instruction costs we might want > to use COSTS_N_INSNS (BRANCH_COST ()) as in other places. */ > > so it seems the right direction, and given that we expose this measure to > the user (and at the very least GCC developers implementing new tuning > microarchitectures) I think it's the only sane way to do branch costing: > define the measure in terms of how many ordinary ALU instructions a branch > is statistically equivalent to. Oh, it probably is the only sane way at this point. To do anything more sane we'd have to disassociate the BRANCH_COST uses during tree/gimple from those in RTL-land. FWIW, I was looking at a regression with our internal tests after your changes. It was quite nice to see how well twiddling -mbranch-cost correlated to how many instructions we would allow in a conditional move sequence. The downside is it highlighted the gimple vs RTL use issue. I'm confident that we would like to see a higher branch cost in the RTL phases for our uarch, but I'm much less comfortable with how that's going to change the decisions made in trees/gimple. We'll have to investigate that at some depth. > >> WRT the extraneous zero-extension. Isn't that arguably a bug in the scc >> expander for risc-v? Fixing that isn't a prerequisite here, but it probably >> worth a bit of someone's time. > > I've looked at it already and it's the middle end that ends up with the > zero-extension, specifically `convert_move' invoked from `emit_cstore' > down the call to `noce_try_store_flag_mask', to widen the output from > `cstoredi4', so I don't think we can do anything in the backend to prevent > it from happening. And neither I think we can do anything useful about > `cstoredi4' having a SImode output, as it's a pattern matched by name > rather than RTX, so we can't provide variants having a SImode and a DImode > output each both at a time, as that would cause a name clash. We're actually tracking some of these extraneous extensions. Do you happen to know if the zero-extended object happens to be (subreg:SI (reg:DI)) kind of construct? That's the kind of thing we're chasing down right now from various points. Vineet has already fixed one class of them. Jivan and I are looking at others. jeff
On Tue, 28 Nov 2023, Jeff Law wrote: > FWIW, I was looking at a regression with our internal tests after your > changes. It was quite nice to see how well twiddling -mbranch-cost > correlated to how many instructions we would allow in a conditional move > sequence. I'm a bit concerned though that our interpretation of `-mbranch-cost=0' is different from the middle end's, such as in `emit_store_flag': /* If we reached here, we can't do this with a scc insn, however there are some comparisons that can be done in other ways. Don't do any of these cases if branches are very cheap. */ if (BRANCH_COST (optimize_insn_for_speed_p (), false) == 0) return 0; > The downside is it highlighted the gimple vs RTL use issue. I'm confident > that we would like to see a higher branch cost in the RTL phases for our > uarch, but I'm much less comfortable with how that's going to change the > decisions made in trees/gimple. We'll have to investigate that at some depth. Ack. > > I've looked at it already and it's the middle end that ends up with the > > zero-extension, specifically `convert_move' invoked from `emit_cstore' > > down the call to `noce_try_store_flag_mask', to widen the output from > > `cstoredi4', so I don't think we can do anything in the backend to prevent > > it from happening. And neither I think we can do anything useful about > > `cstoredi4' having a SImode output, as it's a pattern matched by name > > rather than RTX, so we can't provide variants having a SImode and a DImode > > output each both at a time, as that would cause a name clash. > We're actually tracking some of these extraneous extensions. Do you happen to > know if the zero-extended object happens to be (subreg:SI (reg:DI)) kind of > construct? That's the kind of thing we're chasing down right now from various > points. Vineet has already fixed one class of them. Jivan and I are looking > at others. Under GDB it's a plain move from (reg:SI 140) to (reg:DI 139), as in the FROM and TO arguments to `convert_move' respectively. This makes it call `convert_mode_scalar', which then chooses between `zext_optab' and `sext_optab' as appropriate, under: /* If the target has a converter from FROM_MODE to TO_MODE, use it. */ to produce: (set (reg:DI 139) (zero_extend:DI (reg:SI 140))) ending up with this complete sequence: (insn 27 0 28 (set (reg:SI 140) (eq:SI (reg/v:DI 137 [ c ]) (const_int 0 [0]))) -1 (nil)) (insn 28 27 29 (set (reg:DI 139) (zero_extend:DI (reg:SI 140))) -1 (nil)) (insn 29 28 30 (set (reg:DI 141) (neg:DI (reg:DI 139))) -1 (nil)) (insn 30 29 0 (set (reg/v:DI 134 [ <retval> ]) (and:DI (reg/v:DI 135 [ a ]) (reg:DI 141))) -1 (nil)) passed to `targetm.noce_conversion_profitable_p' right away. Maybe you can teach `emit_cstore' or `convert_move' to use a subreg when it is known for the particular target that the value produced by the conditional-set machine instruction emitted by `cstoreMODE4' is valid unchanged in both modes. You can fiddle with it by trying: $ gcc -march=rv64gc -mbranch-cost=3 -O2 -S gcc/testsuite/gcc.target/riscv/pr105314.c Set a breakpoint at `noce_try_store_flag_mask' and then single-step to see how things proceed. Maciej
Index: gcc/gcc/config/riscv/riscv.cc =================================================================== --- gcc.orig/gcc/config/riscv/riscv.cc +++ gcc/gcc/config/riscv/riscv.cc @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. #include "calls.h" #include "function.h" #include "explow.h" +#include "ifcvt.h" #include "memmodel.h" #include "emit-rtl.h" #include "reload.h" @@ -3295,6 +3296,118 @@ riscv_address_cost (rtx addr, machine_mo return riscv_address_insns (addr, mode, false); } +/* Implement TARGET_INSN_COST. We factor in the branch cost in the cost + calculation for conditional branches: one unit is considered the cost + of microarchitecture-dependent actual branch execution and therefore + multiplied by BRANCH_COST and any remaining units are considered fixed + branch overhead. */ + +static int +riscv_insn_cost (rtx_insn *insn, bool speed) +{ + rtx x = PATTERN (insn); + int cost = pattern_cost (x, speed); + + if (JUMP_P (insn) + && GET_CODE (x) == SET + && GET_CODE (SET_DEST (x)) == PC + && GET_CODE (SET_SRC (x)) == IF_THEN_ELSE) + cost += COSTS_N_INSNS (BRANCH_COST (speed, false) - 1); + return cost; +} + +/* Implement TARGET_MAX_NOCE_IFCVT_SEQ_COST. Like the default implementation, + but we consider cost units of branch instructions equal to cost units of + other instructions. */ + +static unsigned int +riscv_max_noce_ifcvt_seq_cost (edge e) +{ + bool predictable_p = predictable_edge_p (e); + + if (predictable_p) + { + if (OPTION_SET_P (param_max_rtl_if_conversion_predictable_cost)) + return param_max_rtl_if_conversion_predictable_cost; + } + else + { + if (OPTION_SET_P (param_max_rtl_if_conversion_unpredictable_cost)) + return param_max_rtl_if_conversion_unpredictable_cost; + } + + return COSTS_N_INSNS (BRANCH_COST (true, predictable_p)); +} + +/* Implement TARGET_NOCE_CONVERSION_PROFITABLE_P. We replace the cost of a + conditional branch assumed by `noce_find_if_block' at `COSTS_N_INSNS (2)' + by our actual conditional branch cost, observing that our branches test + conditions directly, so there is no preparatory extra condition-set + instruction. */ + +static bool +riscv_noce_conversion_profitable_p (rtx_insn *seq, + struct noce_if_info *if_info) +{ + struct noce_if_info riscv_if_info = *if_info; + + riscv_if_info.original_cost -= COSTS_N_INSNS (2); + riscv_if_info.original_cost += insn_cost (if_info->jump, if_info->speed_p); + + /* Hack alert! When `noce_try_store_flag_mask' uses `cstore<mode>4' + to emit a conditional set operation on DImode output it comes up + with a sequence such as: + + (insn 26 0 27 (set (reg:SI 140) + (eq:SI (reg/v:DI 137 [ c ]) + (const_int 0 [0]))) 302 {*seq_zero_disi} + (nil)) + (insn 27 26 28 (set (reg:DI 139) + (zero_extend:DI (reg:SI 140))) 116 {*zero_extendsidi2_internal} + (nil)) + + because our `cstore<mode>4' pattern expands to an insn that gives + a SImode output. The output of conditional set is 0 or 1 boolean, + so it is valid for input in any scalar integer mode and therefore + combine later folds the zero extend operation into an equivalent + conditional set operation that produces a DImode output, however + this redundant zero extend operation counts towards the cost of + the replacement sequence. Compensate for that by incrementing the + cost of the original sequence as well as the maximum sequence cost + accordingly. */ + rtx last_dest = NULL_RTX; + for (rtx_insn *insn = seq; insn; insn = NEXT_INSN (insn)) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + + rtx x = PATTERN (insn); + if (NONJUMP_INSN_P (insn) + && GET_CODE (x) == SET) + { + rtx src = SET_SRC (x); + if (last_dest != NULL_RTX + && GET_CODE (src) == ZERO_EXTEND + && REG_P (XEXP (src, 0)) + && REGNO (XEXP (src, 0)) == REGNO (last_dest)) + { + riscv_if_info.original_cost += COSTS_N_INSNS (1); + riscv_if_info.max_seq_cost += COSTS_N_INSNS (1); + } + last_dest = NULL_RTX; + rtx dest = SET_DEST (x); + if (COMPARISON_P (src) + && REG_P (dest) + && GET_MODE (dest) == SImode) + last_dest = dest; + } + else + last_dest = NULL_RTX; + } + + return default_noce_conversion_profitable_p (seq, &riscv_if_info); +} + /* Return one word of double-word value OP. HIGH_P is true to select the high part or false to select the low part. */ @@ -9823,6 +9936,13 @@ riscv_preferred_else_value (unsigned ifn #define TARGET_RTX_COSTS riscv_rtx_costs #undef TARGET_ADDRESS_COST #define TARGET_ADDRESS_COST riscv_address_cost +#undef TARGET_INSN_COST +#define TARGET_INSN_COST riscv_insn_cost + +#undef TARGET_MAX_NOCE_IFCVT_SEQ_COST +#define TARGET_MAX_NOCE_IFCVT_SEQ_COST riscv_max_noce_ifcvt_seq_cost +#undef TARGET_NOCE_CONVERSION_PROFITABLE_P +#define TARGET_NOCE_CONVERSION_PROFITABLE_P riscv_noce_conversion_profitable_p #undef TARGET_ASM_FILE_START #define TARGET_ASM_FILE_START riscv_file_start Index: gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_imm.c =================================================================== --- gcc.orig/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_imm.c +++ gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_imm.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-march=rv64gc_zicond -mabi=lp64d" { target { rv64 } } } */ -/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f" { target { rv32 } } } */ +/* { dg-options "-march=rv64gc_zicond -mabi=lp64d -mbranch-cost=4" { target { rv64 } } } */ +/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f -mbranch-cost=4" { target { rv32 } } } */ /* { dg-skip-if "" { *-*-* } {"-O0" "-Og" "-Os" "-Oz"} } */ long primitiveSemantics_compare_imm_return_imm_imm_00(long a, long b) { Index: gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_reg.c =================================================================== --- gcc.orig/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_reg.c +++ gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_imm_reg.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-march=rv64gc_zicond -mabi=lp64d" { target { rv64 } } } */ -/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f" { target { rv32 } } } */ +/* { dg-options "-march=rv64gc_zicond -mabi=lp64d -mbranch-cost=4" { target { rv64 } } } */ +/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f -mbranch-cost=4" { target { rv32 } } } */ /* { dg-skip-if "" { *-*-* } {"-O0" "-Og" "-Os" "-Oz"} } */ long primitiveSemantics_compare_imm_return_imm_reg_00(long a, long b) { Index: gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_reg_reg.c =================================================================== --- gcc.orig/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_reg_reg.c +++ gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_imm_return_reg_reg.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-march=rv64gc_zicond -mabi=lp64d" { target { rv64 } } } */ -/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f" { target { rv32 } } } */ +/* { dg-options "-march=rv64gc_zicond -mabi=lp64d -mbranch-cost=5" { target { rv64 } } } */ +/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f -mbranch-cost=5" { target { rv32 } } } */ /* { dg-skip-if "" { *-*-* } {"-O0" "-Og" "-Os" "-Oz"} } */ long primitiveSemantics_compare_imm_return_reg_reg_00(long a, long b, long c) { Index: gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_imm.c =================================================================== --- gcc.orig/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_imm.c +++ gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_imm.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-march=rv64gc_zicond -mabi=lp64d" { target { rv64 } } } */ -/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f" { target { rv32 } } } */ +/* { dg-options "-march=rv64gc_zicond -mabi=lp64d -mbranch-cost=4" { target { rv64 } } } */ +/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f -mbranch-cost=4" { target { rv32 } } } */ /* { dg-skip-if "" { *-*-* } {"-O0" "-Og" "-Os" "-Oz"} } */ long primitiveSemantics_compare_reg_return_imm_imm_00(long a, long b, long c) { Index: gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_reg.c =================================================================== --- gcc.orig/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_reg.c +++ gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_imm_reg.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-march=rv64gc_zicond -mabi=lp64d" { target { rv64 } } } */ -/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f" { target { rv32 } } } */ +/* { dg-options "-march=rv64gc_zicond -mabi=lp64d -mbranch-cost=4" { target { rv64 } } } */ +/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f -mbranch-cost=4" { target { rv32 } } } */ /* { dg-skip-if "" { *-*-* } {"-O0" "-Og" "-Os" "-Oz"} } */ long primitiveSemantics_compare_reg_return_imm_reg_00(long a, long b, long c) { Index: gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_reg_reg.c =================================================================== --- gcc.orig/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_reg_reg.c +++ gcc/gcc/testsuite/gcc.target/riscv/zicond-primitiveSemantics_compare_reg_return_reg_reg.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ -/* { dg-options "-march=rv64gc_zicond -mabi=lp64d" { target { rv64 } } } */ -/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f" { target { rv32 } } } */ +/* { dg-options "-march=rv64gc_zicond -mabi=lp64d -mbranch-cost=5" { target { rv64 } } } */ +/* { dg-options "-march=rv32gc_zicond -mabi=ilp32f -mbranch-cost=5" { target { rv32 } } } */ /* { dg-skip-if "" { *-*-* } {"-O0" "-Og" "-Os" "-Oz"} } */ long primitiveSemantics_compare_reg_return_reg_reg_00(long a, long b, long c,