Message ID | 6f3a6c31-5a02-86bb-a659-13f541350616@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [RFC] SHIFT_COUNT_TRUNCATED and shift_truncation_mask | expand |
On 5/9/19 5:52 AM, Robin Dapp wrote: > Hi, > > while trying to improve s390 code generation for rotate and shift I > noticed superfluous subregs for shift count operands. In our backend we > already have quite cumbersome patterns that would need to be duplicated > (or complicated further by more subst patterns) in order to get rid of > the subregs. > > I had already finished all the patterns when I realized that > SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already > exist and could do what is needed without extra patterns. Just defining > shift_truncation_mask was not enough though as most of the additional > insns get introduced by combine. > > Event SHIFT_COUNT_TRUNCATED is no perfect match to what our hardware > does because we always only consider the last 6 bits of a shift operand.> > Despite all the warnings in the other backends, most notably > SHIFT_COUNT_TRUNCATED being "discouraged" as mentioned in riscv.h, I > wrote the attached tentative patch. It's a little ad-hoc, uses the > SHIFT_COUNT_TRUNCATED paths only if shift_truncation_mask () != 0 and, > instead of truncating via & (GET_MODE_BITSIZE (mode) - 1), it applies > the mask returned by shift_truncation_mask. Doing so, usage of both > "methods" actually reduces to a single way. THe main reason it's discouraged is because some targets have insns where the count would be truncated and others where it would not. So for example normal shifts might truncate, but vector shifts might or (mips) or shifts might truncate but bit tests do not (x86). I don't know enough about the s390 architecture to know if there's any corner cases. You'd have to look at ever pattern in your machine description with a shift and verify that it's going to DTRT if the count hasn't been truncated. It would really help if you could provide testcases which show the suboptimal code and any analysis you've done. Jeff
[ Please Cc: me on combine patches. Thanks! ] On Thu, May 09, 2019 at 01:52:58PM +0200, Robin Dapp wrote: > I had already finished all the patterns when I realized that > SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already > exist and could do what is needed without extra patterns. Just defining > shift_truncation_mask was not enough though as most of the additional > insns get introduced by combine. SHIFT_COUNT_TRUNCATED says that *all* of your (variable) shifts truncate the shift count in a particular way. targetm.shift_truncation_mask says that *all* of your variable shifts in a particular mode are done in a particular way. Neither is true, for some (many?) targets. For example, on Power, if done in the integer registers, simple shifts use one bit more than the datum rotated (this works for rotates as well of course). In the vector registers, you get a mask of just the size of the datum however. If it is not a simple shift, but a shift-and-mask, *nothing* works: for example, a DImode shift-and-mask with certain arguments is actually implemented with a 32-bit instruction (but many use a 64-bit instruction). It really depends on the machine instruction what the behaviour is, what the mask has to be; and for that at a minimum you need to know the icode, but many patterns have different machine insns in their alternatives, too. > While the attached patch might probably work for s390, it will probably > not for other targets. In addition to what my patch does, would it be > useful to unify both truncation methods in a target hook that takes the > operation (shift, rotate, zero_extract, ...) as well as the mode as > arguments? Thus, we would let the target decide what to do with the > specific combination of both. Maybe this would also allow to > distinguish bit test operations from the rest. Will this be useful for many targets? That's the question for all new hooks that aren't needed, that are just a nicety. It's a cost for everyone, not just your target, which is a bad tradeoff if it doesn't help enough targets, and helps enough. > @@ -12295,7 +12293,7 @@ simplify_comparison (enum rtx_code code, rtx *pop0, rtx *pop1) > between the position and the location of the single bit. */ > /* Except we can't if SHIFT_COUNT_TRUNCATED is set, since we might > have already reduced the shift count modulo the word size. */ > - if (!SHIFT_COUNT_TRUNCATED > + if ((!SHIFT_COUNT_TRUNCATED || !targetm.shift_truncation_mask (mode)) Please make the (default) hook implementation use the macro, instead? > diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c > index ad8eacdf4dc..1d723f29e1e 100644 > --- a/gcc/config/s390/s390.c > +++ b/gcc/config/s390/s390.c > @@ -2320,6 +2320,7 @@ s390_single_part (rtx op, > part = i; > } > } > + No spurious whitespace changes please. There is a lot that can be improved or should be fixed, but let's not do that one by one, and not in random patches. Segher
On Thu, May 09, 2019 at 09:59:55AM -0600, Jeff Law wrote: > THe main reason it's discouraged is because some targets have insns > where the count would be truncated and others where it would not. So > for example normal shifts might truncate, but vector shifts might or > (mips) or shifts might truncate but bit tests do not (x86). And don't forget about zero_extract, on targets that are unforunate enough to have that. > I don't know enough about the s390 architecture to know if there's any > corner cases. You'd have to look at ever pattern in your machine > description with a shift and verify that it's going to DTRT if the count > hasn't been truncated. Yeah. > It would really help if you could provide testcases which show the > suboptimal code and any analysis you've done. I'd love to see this, too! Segher
On Thu, May 9, 2019 at 6:00 PM Jeff Law <law@redhat.com> wrote: > > On 5/9/19 5:52 AM, Robin Dapp wrote: > > Hi, > > > > while trying to improve s390 code generation for rotate and shift I > > noticed superfluous subregs for shift count operands. In our backend we > > already have quite cumbersome patterns that would need to be duplicated > > (or complicated further by more subst patterns) in order to get rid of > > the subregs. > > > > I had already finished all the patterns when I realized that > > SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already > > exist and could do what is needed without extra patterns. Just defining > > shift_truncation_mask was not enough though as most of the additional > > insns get introduced by combine. > > > > Event SHIFT_COUNT_TRUNCATED is no perfect match to what our hardware > > does because we always only consider the last 6 bits of a shift operand.> > > Despite all the warnings in the other backends, most notably > > SHIFT_COUNT_TRUNCATED being "discouraged" as mentioned in riscv.h, I > > wrote the attached tentative patch. It's a little ad-hoc, uses the > > SHIFT_COUNT_TRUNCATED paths only if shift_truncation_mask () != 0 and, > > instead of truncating via & (GET_MODE_BITSIZE (mode) - 1), it applies > > the mask returned by shift_truncation_mask. Doing so, usage of both > > "methods" actually reduces to a single way. > THe main reason it's discouraged is because some targets have insns > where the count would be truncated and others where it would not. So > for example normal shifts might truncate, but vector shifts might or > (mips) or shifts might truncate but bit tests do not (x86). > > I don't know enough about the s390 architecture to know if there's any > corner cases. You'd have to look at ever pattern in your machine > description with a shift and verify that it's going to DTRT if the count > hasn't been truncated. > > > It would really help if you could provide testcases which show the > suboptimal code and any analysis you've done. The main reason I dislike SHIFT_COUNT_TRUNCATED is that it changes the meaning of the IL. We generally want these kind of things to be explicit. Richard. > > Jeff
On 5/10/19 1:08 AM, Richard Biener wrote: > On Thu, May 9, 2019 at 6:00 PM Jeff Law <law@redhat.com> wrote: >> >> On 5/9/19 5:52 AM, Robin Dapp wrote: >>> Hi, >>> >>> while trying to improve s390 code generation for rotate and shift I >>> noticed superfluous subregs for shift count operands. In our backend we >>> already have quite cumbersome patterns that would need to be duplicated >>> (or complicated further by more subst patterns) in order to get rid of >>> the subregs. >>> >>> I had already finished all the patterns when I realized that >>> SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already >>> exist and could do what is needed without extra patterns. Just defining >>> shift_truncation_mask was not enough though as most of the additional >>> insns get introduced by combine. >>> >>> Event SHIFT_COUNT_TRUNCATED is no perfect match to what our hardware >>> does because we always only consider the last 6 bits of a shift operand.> >>> Despite all the warnings in the other backends, most notably >>> SHIFT_COUNT_TRUNCATED being "discouraged" as mentioned in riscv.h, I >>> wrote the attached tentative patch. It's a little ad-hoc, uses the >>> SHIFT_COUNT_TRUNCATED paths only if shift_truncation_mask () != 0 and, >>> instead of truncating via & (GET_MODE_BITSIZE (mode) - 1), it applies >>> the mask returned by shift_truncation_mask. Doing so, usage of both >>> "methods" actually reduces to a single way. >> THe main reason it's discouraged is because some targets have insns >> where the count would be truncated and others where it would not. So >> for example normal shifts might truncate, but vector shifts might or >> (mips) or shifts might truncate but bit tests do not (x86). >> >> I don't know enough about the s390 architecture to know if there's any >> corner cases. You'd have to look at ever pattern in your machine >> description with a shift and verify that it's going to DTRT if the count >> hasn't been truncated. >> >> >> It would really help if you could provide testcases which show the >> suboptimal code and any analysis you've done. > > The main reason I dislike SHIFT_COUNT_TRUNCATED is that it > changes the meaning of the IL. We generally want these kind > of things to be explicit. Understood and agreed. So I think this argues that Robin should look at a different approach and we should start pushing ports away from SHIFT_COUNT_TRUNCATED a bit more aggressively. Jeff
> It would really help if you could provide testcases which show the > suboptimal code and any analysis you've done. I tried introducing a define_subst pattern that substitutes something one of two other subst patterns already changed. The first subst pattern helps remove a superfluous and on the shift count operand by accepting both variants, with and without an and for the insn pattern. (define_subst "masked_op_subst" [(set (match_operand:DSI 0 "" "") (shift:DSI (match_operand:DSI 1 "" "") (match_operand:SI 2 "" "")))] "" [(set (match_dup 0) (shift:DSI (match_dup 1) (and:SI (match_dup 2) (match_operand:SI 3 "const_int_6bitset_operand" "jm6"))))]) The second subst helps encode a shift count addition like $r1 + 1 as address style operand 1($r1) that is directly supported by the shift instruction. (define_subst "addr_style_op_subst" [(set (match_operand:DSI_VI 0 "" "") (shift:DSI_VI (match_operand:DSI_VI 1 "" "") (match_operand:SI 2 "" "")))] "" [(set (match_dup 0) (shift:DSI_VI (match_dup 1) (plus:SI (match_operand:SI 2 "register_operand" "a") (match_operand 3 "const_int_operand" "n"))))]) Both of these are also used in combination. Now, in order to get rid of the subregs in the pattern combine creates, I would need to be able to do something like (define_subst "subreg_subst" [(set (match_operand:DI 0 "" "") (shift:DI (match_operand:DI 1 "" "") (subreg:SI (match_dup:DI 2)))] where the (match_dup:DI 2) would capture both (and:SI ...) [with the first argument being either a register or an already substituted (plus:SI ...)] as well as a simple (plus:SI ...). As far as I can tell match_dup:mode can be used to change the mode of the top-level operation but the operands will remain the same. For this, a match_dup_deep or whatever would be useful. I'm pretty sure we don't want to open this can of worms, though :) To get rid of this, I explicitly duplicated all three subst combinations resulting in a lot of additional code. This is not necessary when the subregs are eliminated by the middle end via SHIFT_COUNT_TRUNCATED. Maybe there is a much more obvious way that I missed? Regards Robin
On 5/15/19 8:30 AM, Robin Dapp wrote: >> It would really help if you could provide testcases which show the >> suboptimal code and any analysis you've done. > > I tried introducing a define_subst pattern that substitutes something > one of two other subst patterns already changed. > > The first subst pattern helps remove a superfluous and on the shift > count operand by accepting both variants, with and without an and for > the insn pattern. > > (define_subst "masked_op_subst" > [(set (match_operand:DSI 0 "" "") > (shift:DSI (match_operand:DSI 1 "" "") > (match_operand:SI 2 "" "")))] > "" > [(set (match_dup 0) > (shift:DSI (match_dup 1) > (and:SI (match_dup 2) > (match_operand:SI 3 "const_int_6bitset_operand" "jm6"))))]) > > The second subst helps encode a shift count addition like $r1 + 1 as > address style operand 1($r1) that is directly supported by the shift > instruction. > > (define_subst "addr_style_op_subst" > [(set (match_operand:DSI_VI 0 "" "") > (shift:DSI_VI (match_operand:DSI_VI 1 "" "") > (match_operand:SI 2 "" "")))] > "" > [(set (match_dup 0) > (shift:DSI_VI (match_dup 1) > (plus:SI (match_operand:SI 2 "register_operand" "a") > (match_operand 3 "const_int_operand" "n"))))]) > > Both of these are also used in combination. > > Now, in order to get rid of the subregs in the pattern combine creates, > I would need to be able to do something like > > (define_subst "subreg_subst" > [(set (match_operand:DI 0 "" "") > (shift:DI (match_operand:DI 1 "" "") > (subreg:SI (match_dup:DI 2)))] > > where the (match_dup:DI 2) would capture both (and:SI ...) [with the > first argument being either a register or an already substituted > (plus:SI ...)] as well as a simple (plus:SI ...). > > As far as I can tell match_dup:mode can be used to change the mode of > the top-level operation but the operands will remain the same. For > this, a match_dup_deep or whatever would be useful. I'm pretty sure we > don't want to open this can of worms, though :) > > To get rid of this, I explicitly duplicated all three subst combinations > resulting in a lot of additional code. This is not necessary when the > subregs are eliminated by the middle end via SHIFT_COUNT_TRUNCATED. > Maybe there is a much more obvious way that I missed? Painful. I doubt exposing the masking during the RTL expansion phase and hoping the standard optimizers will eliminate it would work better -- though perhaps if the expanders queried the global range information and elided the masking when the range of the shift was known to be in range. jeff
>> Now, in order to get rid of the subregs in the pattern combine creates, >> I would need to be able to do something like >> >> (define_subst "subreg_subst" >> [(set (match_operand:DI 0 "" "") >> (shift:DI (match_operand:DI 1 "" "") >> (subreg:SI (match_dup:DI 2)))] >> >> where the (match_dup:DI 2) would capture both (and:SI ...) [with the >> first argument being either a register or an already substituted >> (plus:SI ...)] as well as a simple (plus:SI ...). >> >> As far as I can tell match_dup:mode can be used to change the mode of >> the top-level operation but the operands will remain the same. For >> this, a match_dup_deep or whatever would be useful. I'm pretty sure we >> don't want to open this can of worms, though :) [..] > Painful. I doubt exposing the masking during the RTL expansion phase > and hoping the standard optimizers will eliminate it would work better > -- though perhaps if the expanders queried the global range information > and elided the masking when the range of the shift was known to be in range. I went for another approach - creating a dedicated big predicate and a constraint that captures the full subreg (and (plus ...))) block. This, however lead to reload problems: When needing to spill the shiftcount register, it would not be able to reload the full operation including and/plus etc. and just bail out/ICE. I guess that's expected behavior since the predicate is too generic to handle now. Andreas suggested a secondary reload for that but we're not sure whether we really want that. Thinking back at the mode-changing "match_dup_deep", I still cannot imagine how to define this in a sane way or rather where to stop with mode changing (e.g. what about mems). What I would need it to do is (and:SI (reg:SI const_int)) --> (and:DI (reg:DI) const_int) (plus:SI (reg:SI const_int)) --> (plus:DI (reg:DI const_int)) (and:SI (plus:SI (reg:SI const_int)) const_int) --> (and:DI (plus:DI (reg:DI const_int)) const_int) Any other ideas how to achieve that without tons of boilerplate code? Regards Robin
diff --git a/gcc/combine.c b/gcc/combine.c index 91e32c88c88..d2a659f929b 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -6445,14 +6445,12 @@ combine_simplify_rtx (rtx x, machine_mode op0_mode, int in_dest, return simplify_shift_const (x, code, mode, XEXP (x, 0), INTVAL (XEXP (x, 1))); - else if (SHIFT_COUNT_TRUNCATED && !REG_P (XEXP (x, 1))) + else if (SHIFT_COUNT_TRUNCATED + && targetm.shift_truncation_mask (mode) + && !REG_P (XEXP (x, 1))) SUBST (XEXP (x, 1), force_to_mode (XEXP (x, 1), GET_MODE (XEXP (x, 1)), - (HOST_WIDE_INT_1U - << exact_log2 (GET_MODE_UNIT_BITSIZE - (GET_MODE (x)))) - - 1, - 0)); + targetm.shift_truncation_mask (mode), 0)); break; default: @@ -10594,8 +10592,8 @@ simplify_shift_const_1 (enum rtx_code code, machine_mode result_mode, /* Make sure and truncate the "natural" shift on the way in. We don't want to do this inside the loop as it makes it more difficult to combine shifts. */ - if (SHIFT_COUNT_TRUNCATED) - orig_count &= GET_MODE_UNIT_BITSIZE (mode) - 1; + if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (mode)) + orig_count &= targetm.shift_truncation_mask (mode); /* If we were given an invalid count, don't do anything except exactly what was requested. */ @@ -12295,7 +12293,7 @@ simplify_comparison (enum rtx_code code, rtx *pop0, rtx *pop1) between the position and the location of the single bit. */ /* Except we can't if SHIFT_COUNT_TRUNCATED is set, since we might have already reduced the shift count modulo the word size. */ - if (!SHIFT_COUNT_TRUNCATED + if ((!SHIFT_COUNT_TRUNCATED || !targetm.shift_truncation_mask (mode)) && CONST_INT_P (XEXP (op0, 0)) && XEXP (op0, 1) == const1_rtx && equality_comparison_p && const_op == 0 diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c index ad8eacdf4dc..1d723f29e1e 100644 --- a/gcc/config/s390/s390.c +++ b/gcc/config/s390/s390.c @@ -2320,6 +2320,7 @@ s390_single_part (rtx op, part = i; } } + return part == -1 ? -1 : n_parts - 1 - part; } @@ -2702,6 +2703,7 @@ s390_logical_operator_ok_p (rtx *operands) return true; } + /* Narrow logical operation CODE of memory operand MEMOP with immediate operand IMMOP to switch from SS to SI type instructions. */ @@ -16294,6 +16296,13 @@ s390_case_values_threshold (void) return default_case_values_threshold (); } +static unsigned HOST_WIDE_INT +s390_shift_truncation_mask (machine_mode mode) +{ + return (mode == DImode || mode == SImode + || mode == HImode || mode == QImode) ? 63 : 0; +} + /* Initialize GCC target structure. */ #undef TARGET_ASM_ALIGNED_HI_OP @@ -16585,6 +16594,9 @@ s390_case_values_threshold (void) #undef TARGET_CASE_VALUES_THRESHOLD #define TARGET_CASE_VALUES_THRESHOLD s390_case_values_threshold +#undef TARGET_SHIFT_TRUNCATION_MASK +#define TARGET_SHIFT_TRUNCATION_MASK s390_shift_truncation_mask + /* Use only short displacement, since long displacement is not available for the floating point instructions. */ #undef TARGET_MAX_ANCHOR_OFFSET diff --git a/gcc/config/s390/s390.h b/gcc/config/s390/s390.h index 969f58a2ba0..d85bfcc5e04 100644 --- a/gcc/config/s390/s390.h +++ b/gcc/config/s390/s390.h @@ -1188,5 +1188,6 @@ struct GTY(()) machine_function #define TARGET_INDIRECT_BRANCH_TABLE s390_indirect_branch_table +#define SHIFT_COUNT_TRUNCATED 1 #endif /* S390_H */ diff --git a/gcc/cse.c b/gcc/cse.c index 6c9cda16a98..f7bf287e9ef 100644 --- a/gcc/cse.c +++ b/gcc/cse.c @@ -3649,10 +3649,11 @@ fold_rtx (rtx x, rtx_insn *insn) && (INTVAL (const_arg1) >= GET_MODE_UNIT_PRECISION (mode) || INTVAL (const_arg1) < 0)) { - if (SHIFT_COUNT_TRUNCATED) + if (SHIFT_COUNT_TRUNCATED + && targetm.shift_truncation_mask (mode)) canon_const_arg1 = gen_int_shift_amount (mode, (INTVAL (const_arg1) - & (GET_MODE_UNIT_BITSIZE (mode) - 1))); + & targetm.shift_truncation_mask (mode))); else break; } @@ -3698,10 +3699,11 @@ fold_rtx (rtx x, rtx_insn *insn) && (INTVAL (inner_const) >= GET_MODE_UNIT_PRECISION (mode) || INTVAL (inner_const) < 0)) { - if (SHIFT_COUNT_TRUNCATED) + if (SHIFT_COUNT_TRUNCATED + && targetm.shift_truncation_mask (mode)) inner_const = gen_int_shift_amount (mode, (INTVAL (inner_const) - & (GET_MODE_UNIT_BITSIZE (mode) - 1))); + & targetm.shift_truncation_mask (mode))); else break; } diff --git a/gcc/expmed.c b/gcc/expmed.c index d7f8e9a5d76..d8eebac5f08 100644 --- a/gcc/expmed.c +++ b/gcc/expmed.c @@ -2476,11 +2476,14 @@ expand_shift_1 (enum tree_code code, machine_mode mode, rtx shifted, (unsigned HOST_WIDE_INT) GET_MODE_BITSIZE (scalar_mode))) op1 = gen_int_shift_amount (mode, (unsigned HOST_WIDE_INT) INTVAL (op1) - % GET_MODE_BITSIZE (scalar_mode)); + % targetm.shift_truncation_mask (scalar_mode) + 1); else if (GET_CODE (op1) == SUBREG && subreg_lowpart_p (op1) && SCALAR_INT_MODE_P (GET_MODE (SUBREG_REG (op1))) - && SCALAR_INT_MODE_P (GET_MODE (op1))) + && SCALAR_INT_MODE_P (GET_MODE (op1)) + && targetm.shift_truncation_mask (mode) + == (unsigned HOST_WIDE_INT) GET_MODE_BITSIZE (scalar_mode) - 1 + ) op1 = SUBREG_REG (op1); } diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 89a46a933fa..88064e4ac57 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -3525,9 +3525,10 @@ simplify_binary_operation_1 (enum rtx_code code, machine_mode mode, return lowpart_subreg (int_mode, tmp, inner_mode); } - if (SHIFT_COUNT_TRUNCATED && CONST_INT_P (op1)) + if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (mode) + && CONST_INT_P (op1)) { - val = INTVAL (op1) & (GET_MODE_UNIT_PRECISION (mode) - 1); + val = INTVAL (op1) & targetm.shift_truncation_mask (mode); if (val != INTVAL (op1)) return simplify_gen_binary (code, mode, op0, gen_int_shift_amount (mode, val)); @@ -4347,8 +4348,9 @@ simplify_const_binary_operation (enum rtx_code code, machine_mode mode, case ASHIFT: { wide_int wop1 = pop1; - if (SHIFT_COUNT_TRUNCATED) - wop1 = wi::umod_trunc (wop1, GET_MODE_PRECISION (int_mode)); + if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (int_mode)) + wop1 = wi::umod_trunc (wop1, + targetm.shift_truncation_mask (int_mode) + 1); else if (wi::geu_p (wop1, GET_MODE_PRECISION (int_mode))) return NULL_RTX; @@ -4426,8 +4428,9 @@ simplify_const_binary_operation (enum rtx_code code, machine_mode mode, if (CONST_SCALAR_INT_P (op1)) { wide_int shift = rtx_mode_t (op1, mode); - if (SHIFT_COUNT_TRUNCATED) - shift = wi::umod_trunc (shift, GET_MODE_PRECISION (int_mode)); + if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (int_mode)) + shift = wi::umod_trunc (shift, + targetm.shift_truncation_mask (int_mode) + 1); else if (wi::geu_p (shift, GET_MODE_PRECISION (int_mode))) return NULL_RTX; result = wi::to_poly_wide (op0, mode) << shift;