Message ID | 4F74CFB3.5090308@naturalbridge.com |
---|---|
State | New |
Headers | show |
On 29 March 2012 22:10, Kenneth Zadeck <zadeck@naturalbridge.com> wrote: > This patch takes a different approach to fixing PR52543 than does the patch > in > > http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html > > This patch transforms the lower-subreg pass(es) from unconditionally > splitting wide moves, zero extensions, and shifts, so that it now takes into > account the target specific costs and only does the transformations if it is > profitable. > > Unconditional splitting is a problem that not only occurs on the AVR but is > also a problem on the ARM NEON and my private port. Furthermore, it is a > problem that is likely to occur on most modern larger machines since these > machines are more likely to have fast instructions for moving things that > are larger than word mode. Nice - this means that atleast one pending patches for subreg style operations for neon intrinsics can go in after appropriate tweaking. of costs. It probably requires some tweaking and benchmarking on ARM, but the case where we saw such spills to the stack with subreg style operations is now much improved , indicating that the existing costs infrastructure manages to get this right atleast for this case. Richard(S) - If you remember your PR48941 patch - after applying the lower-subreg patch I now see far better code and what one gets out of -fno-split-wide-types but a lot of that gratuitous spillng has gone away. There are still too many moves between Neon registers but there are far less moves to the integer side and the gratuitous spilling is now gone. old on left - new on right ( i.e. Kenneth's patch + Richard's PR48941 patch http://patchwork.ozlabs.org/patch/130429/) regards Ramana cross: cross: @ args = 0, pretend = 0, frame = 16 | @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, uses_anonymous_args = 0 | @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. @ link register save eliminated. str fp, [sp, #-4]! | vldmia r0, {d26-d27} add fp, sp, #0 | vldmia r1, {d24-d25} sub sp, sp, #20 | vmov q10, q13 @ v4sf vldmia r0, {d16-d17} | vmov q11, q13 @ v4sf vmov q10, q8 @ v4sf | vmov q8, q12 @ v4sf sub sp, sp, #48 | vmov q9, q12 @ v4sf > vzip.32 q10, q11 > vzip.32 q8, q9 > vmov q14, q10 @ v4sf vmov q12, q8 @ v4sf vmov q12, q8 @ v4sf add r3, sp, #15 | vmov d21, d22 @ v2sf bic r3, r3, #15 | vmul.f32 d16, d29, d18 vzip.32 q10, q12 | vmul.f32 d17, d21, d24 vstmia r3, {d20-d21} | vmov d19, d18 @ v2sf vstr d24, [r3, #16] | vmul.f32 d18, d28, d25 vstr d25, [r3, #24] | vmls.f32 d16, d21, d25 vldmia r1, {d16-d17} | vmls.f32 d17, d28, d19 vmov q9, q8 @ v4sf | vmls.f32 d18, d29, d24 vmov q11, q8 @ v4sf | vmov d26, d16 @ v2sf vzip.32 q9, q11 | vmov d27, d17 @ v2sf vstmia r3, {d18-d19} | vmov d17, d18 @ v2sf vstr d22, [r3, #16] | vuzp.32 d26, d27 vstr d23, [r3, #24] | vmov d16, d26 @ v2sf vmov d25, d18 @ v2sf | vmov r0, r1, d16 @ v4sf vmul.f32 d17, d21, d22 | vmov r2, r3, d17 vmul.f32 d18, d24, d18 < vmov d16, d19 @ v2sf < vmul.f32 d19, d20, d19 < vmls.f32 d17, d24, d16 < vmls.f32 d18, d20, d22 < vmls.f32 d19, d21, d25 < vuzp.32 d17, d18 < vmov d20, d17 @ v2sf < vmov d21, d19 @ v2sf < vmov r0, r1, d20 @ v4sf < vmov r2, r3, d21 < add sp, fp, #0 < ldmfd sp!, {fp} < bx lr bx lr
Kenneth Zadeck <zadeck@naturalbridge.com> writes: > This patch takes a different approach to fixing PR52543 than does the > patch in > > http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html > > This patch transforms the lower-subreg pass(es) from unconditionally > splitting wide moves, zero extensions, and shifts, so that it now takes > into account the target specific costs and only does the transformations > if it is profitable. Thanks for doing this. > + This pass only splits moves with modes are wider than word_mode and ...modes that are wider... > + ashifts, lshiftrt and zero_extensions with integer modes that are Nitlet: lshiftrts to be consistent. Or "ASHIFTs, LSHIFTRTs and ZERO_EXTENDs". > + There are two useful preprocessor defines for use by maintainers: > + > + #define LOG_COSTS > + > + if you wish to see the actual cost estimates that are being used > + for each mode wider than word mode and the cost estimates for zero > + extension and the shifts. This can be useful when port maintainers > + are tuning insn rtx costs. > + > + #define FORCE_LOWERING > + > + if you wish to test the pass with all the transformation forced on. > + This can be useful for finding bugs in the transformations. Must admit I'm not keen on these kinds of macro, but it's Ian's call. Idea for the future (i.e. not this patch) is to have a dump file for target initialisation. > +/* This pass can transform 4 different operations: move, ashift, > + lshiftrt, and zero_extend. There is a boolean vector for move > + splitting that is indexed by mode and is true for each mode that is > + to have its copies split. The other three operations are only done > + for one mode so they are only controlled by a single boolean .*/ As mentioned privately, whether this is profitable for shifts depends to some extent on the shift amount. GCC already supports targets where this transformation would be OK for some shift amounts but not others. So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT booleans rather than just one. More comments below about how this filters through your other changes. > +/* This pass can transform 4 different operations: move, ashift, > + lshiftrt, and zero_extend. There is a boolean vector for move > + splitting that is indexed by mode and is true for each mode that is > + to have its copies split. The other three operations are only done > + for one mode so they are only controlled by a single boolean .*/ > +static bool move_modes_to_split[MAX_MACHINE_MODE]; > + > +/* Other splitting operations to be done on the this platform. */ > +static bool splitting_ashift; > +static bool splitting_lshiftrt; > +static bool splitting_zext; > +/* The or of the previous three values. */ > +static bool splitting_some_shifts; > + > +/* The integer mode that is twice the width of word_mode. */ > +static enum machine_mode twice_word_mode; > + > /* Bit N in the bitmap in element M of this array is set if there is a > copy from reg M to reg N. */ > static VEC(bitmap,heap) *reg_copy_graph; > > -/* Return whether X is a simple object which we can take a word_mode > - subreg of. */ > +static bool something_to_do; > + > +/* Precomputed cost values used to determine if lowering shift > + operations is profitable. */ > +static int word_mode_move_cost; > +static int move_zero_cost; > + All this new data should be wrapped in a target structure. See expmed.[hc] for an example. > + return wide_cost > word_mode_move_cost + move_zero_cost; Should be >=. The purpose of this transformation isn't AIUI to improve shifts in themselves. It's just to stop shifts from being an artificial barrier to lower-subreg's main "optimisation" -- which in some ways is more of an IR simplification -- i.e., replacing subregs with regs. We want to do it if the costs of the two sequences are equal. More justification at the end. > + else > + { > + int narrow_cost; > + > + trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER); > + src = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER); Probably best to use FIRST_PSEUDO_REGISTER + 1 for the second (here and elsewhere) so that we get the cost of a general 3-operand shift rather than a 2-operand one. > + narrow_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, > + gen_rtx_fmt_ee (code, word_mode, > + src, > + GEN_INT (shift_amt - word_mode_size))), Watch the long lines. Everything should be less than 80 chars. > + return wide_cost > narrow_cost + move_zero_cost; ">=" here too. > +/* Initialize pass once per execution. This envolves determining Once per target. And since it can be called more than once, the function needs to zero out the new data rather than rely on load-time zero-initialisation. > + t = compute_move_cost ((enum machine_mode) i); > + > +#ifdef LOG_COSTS > + fprintf (stderr, "mode %s, move cost %d, factor %d\n", mode_name[i], t, factor); > +#endif > + if (t > word_mode_move_cost * factor) ">=" here too. > + /* The only case here to check to see if moving the upper part with a > + zero is cheaper than doing the zext itself. There is some theory > + that any well implemented port would just have the pattern. */ Don't really get this comment. These days, I don't think ports are really expected to define double-word patterns that simply decompose to the usual (target-independent) word-mode sequence. Ports used to do that because having a unified operation helped the earlier rtl optimisers, but those cases are now handled at the gimple level. These days it seems better to let optabs lower the thing immediately and allow the individual word ops (which weren't exposed at the gimple level) to get maximum optimisation. Maybe easiest to remove the second sentence rather than argue about it. > + if (insn_rtx_cost (pat, true) > word_mode_move_cost + move_zero_cost) ">=" :-) > @@ -173,7 +364,7 @@ find_pseudo_copy (rtx set) > if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs)) > return false; > > - if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD) > + if (!move_modes_to_split[(int)GET_MODE (dest)]) > return false; Space after "(int)". > @@ -335,6 +526,11 @@ find_decomposable_subregs (rtx *px, void > bitmap_set_bit (decomposable_context, regno); > break; > case SIMPLE_MOVE: > + /* If this is marked as a simple move with a mode this > + large, it is because find_pseudo_copy returned false > + because this was not a mode we wanted to split. */ > + if (!move_modes_to_split[GET_MODE (x)]) I think you need the (int) here too. (Should have been caught by boostrap if so.) > @@ -668,7 +864,7 @@ resolve_simple_move (rtx set, rtx insn) > orig_mode = GET_MODE (dest); > > words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD; > - if (words <= 1) > + if (!move_modes_to_split[(int)GET_MODE (dest)]) > return insn; > > start_sequence (); Space after "(int)". > op = SET_SRC (set); > - if (GET_CODE (op) != ASHIFT > - && GET_CODE (op) != LSHIFTRT > - && GET_CODE (op) != ZERO_EXTEND) > + mode = GET_MODE (op); > + > + if (mode != twice_word_mode) > + return 0; > + > + switch (GET_CODE (op)) { > + case ASHIFT: > + if (splitting_ashift) > + break; > + return 0; > + > + case LSHIFTRT: > + if (splitting_lshiftrt) > + break; > + return 0; > + > + case ZERO_EXTEND: > + if (splitting_zext) > + break; > + return 0; With the boolean array change suggested above, we'd not bother with this bit and check the cached cost information: > @@ -961,23 +1178,26 @@ find_decomposable_shift_zext (rtx insn) > > if (GET_CODE (op) == ZERO_EXTEND) > { > - if (GET_MODE (op_operand) != word_mode > - || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD) > + if (GET_MODE (op_operand) != word_mode) > return 0; > } ...here and: > else /* left or right shift */ > { > + int shift_val; > + > if (!CONST_INT_P (XEXP (op, 1)) > - || INTVAL (XEXP (op, 1)) < BITS_PER_WORD > - || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD) > + || INTVAL (XEXP (op, 1)) < BITS_PER_WORD) > + return 0; > + > + shift_val = INTVAL (XEXP (op, 1)); > + if (!profitable_shift_p (GET_CODE (op), shift_val)) > return 0; > + > + bitmap_set_bit (decomposable_context, REGNO (op_operand)); > } ...here (using the boolean array instead of profitable_shift_p). > @@ -1094,24 +1342,41 @@ decompose_multiword_subregs (void) > all the insns. */ > { > unsigned int i; > + bool useful_modes_seen = false; > > for (i = FIRST_PSEUDO_REGISTER; i < max; ++i) > { > - if (regno_reg_rtx[i] != NULL > - && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD) > - break; > + if (regno_reg_rtx[i] != NULL) > + { > + enum machine_mode mode = GET_MODE (regno_reg_rtx[i]); > + if (regno_reg_rtx[i] != NULL > + && (move_modes_to_split [mode] > + || (splitting_some_shifts && mode == twice_word_mode))) Here I think we should just check move_modes_to_split. We should not split a mode for which moves get more expensive, regardless of whether shifts are cheaper. lower-subreg is designed to help later passes produce better code. Splitting a shift when the associated move is more expensive is likely to encourage those passes to generate more expensive move sequences instead of cheaper ones. (Includes RA and reload.) Besides, I claim that it makes no sense to define a double-word shift pattern that is inherently more expensive than what optabs would generate without the double-word shift pattern. So I think the ">="s in the (modified) shift cost calculations should act as "=="s in practice (but should still be coded as ">="s for consistency). That is, we'll allow the shift to be split if the cost of doing so is the same as the original shift, but not otherwise. Or the short version: if splitting double-word moves is expensive, skip all the zext and shift stuff too. Looks good to me otherwise, but it's Ian's pass. Richard
>> + There are two useful preprocessor defines for use by maintainers: >> + >> + #define LOG_COSTS >> + >> + if you wish to see the actual cost estimates that are being used >> + for each mode wider than word mode and the cost estimates for zero >> + extension and the shifts. This can be useful when port maintainers >> + are tuning insn rtx costs. >> + >> + #define FORCE_LOWERING >> + >> + if you wish to test the pass with all the transformation forced on. >> + This can be useful for finding bugs in the transformations. > Must admit I'm not keen on these kinds of macro, but it's Ian's call. > > Idea for the future (i.e. not this patch) is to have a dump file for > target initialisation. Imagine my horror when i did all of this as you had privately suggested and discovered that there was no way to log what i was doing. This is good enough until someone wants to fix the general problem. >> +/* This pass can transform 4 different operations: move, ashift, >> + lshiftrt, and zero_extend. There is a boolean vector for move >> + splitting that is indexed by mode and is true for each mode that is >> + to have its copies split. The other three operations are only done >> + for one mode so they are only controlled by a single boolean .*/ > As mentioned privately, whether this is profitable for shifts depends > to some extent on the shift amount. GCC already supports targets where > this transformation would be OK for some shift amounts but not others. > So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT > booleans rather than just one. > > More comments below about how this filters through your other changes. I think that you actually are missing what i am doing with this. I look at 3 representative values that "should" discover any non uniformities. If any of them are profitable, i set this bit. Then at the point where i really have to pull the trigger on a real instance, i check the shift amount used at that spot to see if the individual shift is profitable. I did this for two reasons. One of them was that i was a little concerned that HOST_BITS_PER_WIDE_INT on the smallest host was not as big as the bitsize of word_word mode on the largest target (it could be but this knowledge is above my pay grade). The other reason was did not see this as a common operation and checking it on demand seemed like the winner. I will do everything else you mention and resubmit after i fix ramana's ice.
Kenneth Zadeck <zadeck@naturalbridge.com> writes: >>> + There are two useful preprocessor defines for use by maintainers: >>> + >>> + #define LOG_COSTS >>> + >>> + if you wish to see the actual cost estimates that are being used >>> + for each mode wider than word mode and the cost estimates for zero >>> + extension and the shifts. This can be useful when port maintainers >>> + are tuning insn rtx costs. >>> + >>> + #define FORCE_LOWERING >>> + >>> + if you wish to test the pass with all the transformation forced on. >>> + This can be useful for finding bugs in the transformations. >> Must admit I'm not keen on these kinds of macro, but it's Ian's call. >> >> Idea for the future (i.e. not this patch) is to have a dump file for >> target initialisation. > Imagine my horror when i did all of this as you had privately suggested > and discovered that there was no way to log what i was doing. This is > good enough until someone wants to fix the general problem. > >>> +/* This pass can transform 4 different operations: move, ashift, >>> + lshiftrt, and zero_extend. There is a boolean vector for move >>> + splitting that is indexed by mode and is true for each mode that is >>> + to have its copies split. The other three operations are only done >>> + for one mode so they are only controlled by a single boolean .*/ >> As mentioned privately, whether this is profitable for shifts depends >> to some extent on the shift amount. GCC already supports targets where >> this transformation would be OK for some shift amounts but not others. >> So for shifts, I think this should be an array of HOST_BITS_PER_WIDE_INT >> booleans rather than just one. >> >> More comments below about how this filters through your other changes. > I think that you actually are missing what i am doing with this. I > look at 3 representative values that "should" discover any non > uniformities. If any of them are profitable, i set this bit. Then at > the point where i really have to pull the trigger on a real instance, i > check the shift amount used at that spot to see if the individual shift > is profitable. No, I got that. I just think it's an unnecessary complication. > I did this for two reasons. One of them was that i was a little > concerned that HOST_BITS_PER_WIDE_INT on the smallest host was not as > big as the bitsize of word_word mode on the largest target (it could be > but this knowledge is above my pay grade). Ah, yes, sorry, I meant an array of BITS_PER_WORD booleans. I had HOST_WIDE_INT on the brain after Mike's patch. > The other reason was did not see this as a common operation and > checking it on demand seemed like the winner. But (at least after the other changes I mentioned) these overall booleans cut out only a very small portion of find_decomposable_shift_zext. I.e.: op = SET_SRC (set); if (GET_CODE (op) != ASHIFT && GET_CODE (op) != LSHIFTRT && GET_CODE (op) != ZERO_EXTEND) <-- unified booleans checked here return 0; op_operand = XEXP (op, 0); if (!REG_P (SET_DEST (set)) || !REG_P (op_operand) || HARD_REGISTER_NUM_P (REGNO (SET_DEST (set))) || HARD_REGISTER_NUM_P (REGNO (op_operand)) || !SCALAR_INT_MODE_P (GET_MODE (op))) return 0; if (GET_CODE (op) == ZERO_EXTEND) { if (GET_MODE (op_operand) != word_mode || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD) return 0; } else /* left or right shift */ { <--- specific booleans checked here if (!CONST_INT_P (XEXP (op, 1)) || INTVAL (XEXP (op, 1)) < BITS_PER_WORD || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD) return 0; } It seems better (and simpler) not to prejudge which shift amounts are interesting and instead cache the "win or no win" flag for each value. As I say, this is all in the context of this pass not being interesting for modes where the split move is strictly more expensive than the unified move, regardless of shift & zext costs. Richard
ramana i get the same failure on the trunk without my patch. kenny On 03/30/2012 07:36 AM, Ramana Radhakrishnan wrote: > Hi > > >> I have tested this on an x86_64 with both the force lowering on and off and >> neither cause any regressions as well as extensive testing on my port. >> > So, just out of curiosity, I decided to run this through a > cross-build and noticed the following ICE with eglibc. I haven't had > the time to debug this further but it does appear as though it could > do with some more testing on some more ports and this probably needs > some tuning as you say. > > $> /work/cross-build/fsf/arm-none-linux-gnueabi/tools-lowersubregchanges-patched/bin/arm-none-linux-gnueabi-gcc > -c -O2 ./besttry.c -mfloat-abi=soft -march=armv5te > ./besttry.c: In function ‘_IO_new_file_write’: > ./besttry.c:36:1: internal compiler error: in get_loop_body, at cfgloop.c:831 > > $> cat besttry.c > __extension__ typedef int __ssize_t; > extern __thread int __libc_errno __attribute__ ((tls_model ("initial-exec"))); > struct _IO_FILE { > int _fileno; > int _flags2; > }; > typedef struct _IO_FILE _IO_FILE; > _IO_new_file_write (f, > data, > n) > _IO_FILE *f; > { > __ssize_t to_do = n; > while (to_do> 0) > { > __ssize_t count = > (__builtin_expect (f->_flags2& 2, 0) ? > ({ unsigned int _sys_result = ({ register int _a1 asm ("r0"), _nr asm ("r7"); > int _a3tmp = (int) ((to_do)); > int _a2tmp = (int) ((data)); > register int _a2 asm ("a2") = _a2tmp; > register int _a3 asm ("a3") = _a3tmp; _nr = ((0 + 4)); > asm volatile ("swi 0x0 @ syscall " "SYS_ify(write)" : "=r" > (_a1) : "r" (_nr) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _a1; > }); > if (__builtin_expect (((unsigned int) (_sys_result)>= 0xfffff001u), 0)) > { (__libc_errno = ((-(_sys_result)))); > _sys_result = (unsigned int) -1; } > (int) _sys_result; }) > : __write (f->_fileno, data, to_do)); > if (count< 0) > { > break; > } > to_do -= count; > } > } > > > Ramana > > On 29 March 2012 22:10, Kenneth Zadeck<zadeck@naturalbridge.com> wrote: >> This patch takes a different approach to fixing PR52543 than does the patch >> in >> >> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html >> >> This patch transforms the lower-subreg pass(es) from unconditionally >> splitting wide moves, zero extensions, and shifts, so that it now takes into >> account the target specific costs and only does the transformations if it is >> profitable. >> >> Unconditional splitting is a problem that not only occurs on the AVR but is >> also a problem on the ARM NEON and my private port. Furthermore, it is a >> problem that is likely to occur on most modern larger machines since these >> machines are more likely to have fast instructions for moving things that >> are larger than word mode. >> >> At compiler initialization time, each mode that is larger that a word mode >> is examined to determine if the cost of moving a value of that mode is less >> expensive that inserting the proper number of word sided moves. If it is >> cheaper to split it up, a bit is set to allow moves of that mode to be >> lowered. >> >> A similar analysis is made for the zero extensions and shifts except that >> lower subreg had been (and is still limited to only breaking up these >> operations if the target size was twice the size of word mode.) Also, if >> the analysis determines that there are no profitable transformations, the >> pass exits quickly without doing any analysis. >> >> It is quite likely that most ports will have to be adjusted after this patch >> is accepted. For instance, the analysis discovers that there are no >> profitable transformations to be performed on the x86-64. Since this is >> not my platform, I have no idea if these are the correct settings. But the >> pass uses the standard insn_rtx_cost interface and it is the port >> maintainers responsibility to not lie to the optimization passes so this >> extra work in stage one should be acceptable. >> >> I do know from a private conversation with Richard Sandiford, that mips >> patches are likely forthcoming. >> >> There is preprocessor controlled code that prints out the cost analysis. >> Only a summary of this can go in the subregs dump file because the analysis >> is called from backend_init_target and so the dump file is not available. >> But it is very useful to define LOG_COSTS when adjusting your port. >> >> There is also preprocessor code that forces all of the lowering operations >> to marked as profitable. This is useful in debugging the new logic. >> >> Both of these preprocessor symbols are documented at the top of the pass. >> >> I have tested this on an x86_64 with both the force lowering on and off and >> neither cause any regressions as well as extensive testing on my port. >> >> Ok to commit? >> >> Kenny >> >> 2012-03-29 Kenneth Zadeck<zadeck@naturalbridge.com> >> >> * toplev.c (backend_init_target): Call initializer for lower-subreg pass. >> >> * lower-subreg.c (move_modes_to_split, splitting_ashift, >> splitting_lshiftrt) >> splitting_zext, splitting_some_shifts, twice_word_mode, >> something_to_do, >> word_mode_move_cost, move_zero_cost): New static vars. >> (compute_move_cost, profitable_shift_p, init_lower_subreg): New >> functions. >> (find_pseudo_copy, resolve_simple_move): Added code to only split based >> on costs. >> (find_decomposable_subregs): Added code to mark as decomposable >> moves that are not profitable. >> (find_decomposable_shift_zext): Added code to only decompose >> shifts and zext if profitable. >> (resolve_shift_zext): Added comment. >> (decompose_multiword_subregs): Dump list of profitable >> transformations. Add code to skip non profitable transformations. >> >> *rtl.h(init_lower_subreg): Added declaration. >> >>
On 30 March 2012 20:29, Kenneth Zadeck <zadeck@naturalbridge.com> wrote: > ramana > > i get the same failure on the trunk without my patch. > In which case I apologise and will file a bug report separately. I should really have checked :( . Ramana > > kenny > > On 03/30/2012 07:36 AM, Ramana Radhakrishnan wrote: >> >> Hi >> >> >>> I have tested this on an x86_64 with both the force lowering on and off >>> and >>> neither cause any regressions as well as extensive testing on my port. >>> >> So, just out of curiosity, I decided to run this through a >> cross-build and noticed the following ICE with eglibc. I haven't had >> the time to debug this further but it does appear as though it could >> do with some more testing on some more ports and this probably needs >> some tuning as you say. >> >> $> >> /work/cross-build/fsf/arm-none-linux-gnueabi/tools-lowersubregchanges-patched/bin/arm-none-linux-gnueabi-gcc >> -c -O2 ./besttry.c -mfloat-abi=soft -march=armv5te >> ./besttry.c: In function ‘_IO_new_file_write’: >> ./besttry.c:36:1: internal compiler error: in get_loop_body, at >> cfgloop.c:831 >> >> $> cat besttry.c >> __extension__ typedef int __ssize_t; >> extern __thread int __libc_errno __attribute__ ((tls_model >> ("initial-exec"))); >> struct _IO_FILE { >> int _fileno; >> int _flags2; >> }; >> typedef struct _IO_FILE _IO_FILE; >> _IO_new_file_write (f, >> data, >> n) >> _IO_FILE *f; >> { >> __ssize_t to_do = n; >> while (to_do> 0) >> { >> __ssize_t count = >> (__builtin_expect (f->_flags2& 2, 0) ? >> >> ({ unsigned int _sys_result = ({ register int _a1 asm ("r0"), _nr asm >> ("r7"); >> int _a3tmp = (int) ((to_do)); >> int _a2tmp = (int) ((data)); >> register int _a2 asm ("a2") = _a2tmp; >> register int _a3 asm ("a3") = _a3tmp; _nr = ((0 + 4)); >> asm volatile ("swi 0x0 @ syscall " "SYS_ify(write)" : "=r" >> (_a1) : "r" (_nr) , "r" (_a1), "r" (_a2), "r" (_a3) : "memory"); _a1; >> }); >> if (__builtin_expect (((unsigned int) (_sys_result)>= 0xfffff001u), >> 0)) >> { (__libc_errno = ((-(_sys_result)))); >> _sys_result = (unsigned int) -1; } >> (int) _sys_result; }) >> : __write (f->_fileno, data, to_do)); >> if (count< 0) >> { >> break; >> } >> to_do -= count; >> } >> } >> >> >> Ramana >> >> On 29 March 2012 22:10, Kenneth Zadeck<zadeck@naturalbridge.com> wrote: >>> >>> This patch takes a different approach to fixing PR52543 than does the >>> patch >>> in >>> >>> http://gcc.gnu.org/ml/gcc-patches/2012-03/msg00641.html >>> >>> This patch transforms the lower-subreg pass(es) from unconditionally >>> splitting wide moves, zero extensions, and shifts, so that it now takes >>> into >>> account the target specific costs and only does the transformations if it >>> is >>> profitable. >>> >>> Unconditional splitting is a problem that not only occurs on the AVR but >>> is >>> also a problem on the ARM NEON and my private port. Furthermore, it is a >>> problem that is likely to occur on most modern larger machines since >>> these >>> machines are more likely to have fast instructions for moving things that >>> are larger than word mode. >>> >>> At compiler initialization time, each mode that is larger that a word >>> mode >>> is examined to determine if the cost of moving a value of that mode is >>> less >>> expensive that inserting the proper number of word sided moves. If it >>> is >>> cheaper to split it up, a bit is set to allow moves of that mode to be >>> lowered. >>> >>> A similar analysis is made for the zero extensions and shifts except that >>> lower subreg had been (and is still limited to only breaking up these >>> operations if the target size was twice the size of word mode.) Also, if >>> the analysis determines that there are no profitable transformations, the >>> pass exits quickly without doing any analysis. >>> >>> It is quite likely that most ports will have to be adjusted after this >>> patch >>> is accepted. For instance, the analysis discovers that there are no >>> profitable transformations to be performed on the x86-64. Since this >>> is >>> not my platform, I have no idea if these are the correct settings. But >>> the >>> pass uses the standard insn_rtx_cost interface and it is the port >>> maintainers responsibility to not lie to the optimization passes so this >>> extra work in stage one should be acceptable. >>> >>> I do know from a private conversation with Richard Sandiford, that mips >>> patches are likely forthcoming. >>> >>> There is preprocessor controlled code that prints out the cost analysis. >>> Only a summary of this can go in the subregs dump file because the >>> analysis >>> is called from backend_init_target and so the dump file is not available. >>> But it is very useful to define LOG_COSTS when adjusting your port. >>> >>> There is also preprocessor code that forces all of the lowering >>> operations >>> to marked as profitable. This is useful in debugging the new logic. >>> >>> Both of these preprocessor symbols are documented at the top of the pass. >>> >>> I have tested this on an x86_64 with both the force lowering on and off >>> and >>> neither cause any regressions as well as extensive testing on my port. >>> >>> Ok to commit? >>> >>> Kenny >>> >>> 2012-03-29 Kenneth Zadeck<zadeck@naturalbridge.com> >>> >>> * toplev.c (backend_init_target): Call initializer for lower-subreg >>> pass. >>> >>> * lower-subreg.c (move_modes_to_split, splitting_ashift, >>> splitting_lshiftrt) >>> splitting_zext, splitting_some_shifts, twice_word_mode, >>> something_to_do, >>> word_mode_move_cost, move_zero_cost): New static vars. >>> (compute_move_cost, profitable_shift_p, init_lower_subreg): New >>> functions. >>> (find_pseudo_copy, resolve_simple_move): Added code to only split >>> based >>> on costs. >>> (find_decomposable_subregs): Added code to mark as decomposable >>> moves that are not profitable. >>> (find_decomposable_shift_zext): Added code to only decompose >>> shifts and zext if profitable. >>> (resolve_shift_zext): Added comment. >>> (decompose_multiword_subregs): Dump list of profitable >>> transformations. Add code to skip non profitable transformations. >>> >>> *rtl.h(init_lower_subreg): Added declaration. >>> >>> >
Il 30/03/2012 12:08, Richard Sandiford ha scritto: >> > + There are two useful preprocessor defines for use by maintainers: >> > + >> > + #define LOG_COSTS >> > + >> > + if you wish to see the actual cost estimates that are being used >> > + for each mode wider than word mode and the cost estimates for zero >> > + extension and the shifts. This can be useful when port maintainers >> > + are tuning insn rtx costs. >> > + >> > + #define FORCE_LOWERING >> > + >> > + if you wish to test the pass with all the transformation forced on. >> > + This can be useful for finding bugs in the transformations. > Must admit I'm not keen on these kinds of macro, but it's Ian's call. Indeed, LOG_COSTS should be (dump_flags & TDF_DETAILS) != 0, and perhaps FORCE_LOWERING should be a -f flag (like -flower-all-subregs) or a --param. Paolo
Il 10/05/2012 08:45, Paolo Bonzini ha scritto: > Il 30/03/2012 12:08, Richard Sandiford ha scritto: >>>> + There are two useful preprocessor defines for use by maintainers: >>>> + >>>> + #define LOG_COSTS >>>> + >>>> + if you wish to see the actual cost estimates that are being used >>>> + for each mode wider than word mode and the cost estimates for zero >>>> + extension and the shifts. This can be useful when port maintainers >>>> + are tuning insn rtx costs. >>>> + >>>> + #define FORCE_LOWERING >>>> + >>>> + if you wish to test the pass with all the transformation forced on. >>>> + This can be useful for finding bugs in the transformations. >> Must admit I'm not keen on these kinds of macro, but it's Ian's call. > > Indeed, LOG_COSTS should be (dump_flags & TDF_DETAILS) != 0, and perhaps > FORCE_LOWERING should be a -f flag (like -flower-all-subregs) or a --param. Not sure how this got sent a month after I wrote it (and decided not to send it). :) Paolo
Index: toplev.c =================================================================== --- toplev.c (revision 185969) +++ toplev.c (working copy) @@ -1660,6 +1660,7 @@ backend_init_target (void) /* rtx_cost is mode-dependent, so cached values need to be recomputed on a mode change. */ init_expmed (); + init_lower_subreg (); /* We may need to recompute regno_save_code[] and regno_restore_code[] after a mode change as well. */ Index: lower-subreg.c =================================================================== --- lower-subreg.c (revision 185969) +++ lower-subreg.c (working copy) @@ -52,10 +52,34 @@ DEF_VEC_P (bitmap); DEF_VEC_ALLOC_P (bitmap,heap); /* Decompose multi-word pseudo-registers into individual - pseudo-registers when possible. This is possible when all the uses - of a multi-word register are via SUBREG, or are copies of the - register to another location. Breaking apart the register permits - more CSE and permits better register allocation. */ + pseudo-registers when possible and profitable. This is possible + when all the uses of a multi-word register are via SUBREG, or are + copies of the register to another location. Breaking apart the + register permits more CSE and permits better register allocation. + This is profitable if the machine does not have move instructions + to do this. + + This pass only splits moves with modes are wider than word_mode and + ashifts, lshiftrt and zero_extensions with integer modes that are + twice the width of word_mode. The latter could be generalized if + there was a need to do this, but the trend in architectures is to + not need this. + + There are two useful preprocessor defines for use by maintainers: + + #define LOG_COSTS + + if you wish to see the actual cost estimates that are being used + for each mode wider than word mode and the cost estimates for zero + extension and the shifts. This can be useful when port maintainers + are tuning insn rtx costs. + + #define FORCE_LOWERING + + if you wish to test the pass with all the transformation forced on. + This can be useful for finding bugs in the transformations. + + Both of these should not be enabled at the same time. */ /* Bit N in this bitmap is set if regno N is used in a context in which we can decompose it. */ @@ -71,12 +95,179 @@ static bitmap non_decomposable_context; avoid generating accesses to its subwords in integer modes. */ static bitmap subreg_context; +/* This pass can transform 4 different operations: move, ashift, + lshiftrt, and zero_extend. There is a boolean vector for move + splitting that is indexed by mode and is true for each mode that is + to have its copies split. The other three operations are only done + for one mode so they are only controlled by a single boolean .*/ +static bool move_modes_to_split[MAX_MACHINE_MODE]; + +/* Other splitting operations to be done on the this platform. */ +static bool splitting_ashift; +static bool splitting_lshiftrt; +static bool splitting_zext; +/* The or of the previous three values. */ +static bool splitting_some_shifts; + +/* The integer mode that is twice the width of word_mode. */ +static enum machine_mode twice_word_mode; + /* Bit N in the bitmap in element M of this array is set if there is a copy from reg M to reg N. */ static VEC(bitmap,heap) *reg_copy_graph; -/* Return whether X is a simple object which we can take a word_mode - subreg of. */ +static bool something_to_do; + +/* Precomputed cost values used to determine if lowering shift + operations is profitable. */ +static int word_mode_move_cost; +static int move_zero_cost; + +/* See what the move cost is. */ +static int +compute_move_cost (enum machine_mode mode) +{ + rtx src = gen_rtx_REG (mode, FIRST_PSEUDO_REGISTER); + rtx trg = gen_rtx_REG (mode, FIRST_PSEUDO_REGISTER); + rtx pat = gen_rtx_SET (VOIDmode, trg, src); + + return insn_rtx_cost (pat, true); +} + + +/* Return true if it is profitable to lower and shift by SHIFT_AMT. + CODE can be either ASHIFT or LSHIFTRT. */ +static bool +profitable_shift_p (enum rtx_code code, int shift_amt) +{ + rtx trg = gen_rtx_REG (twice_word_mode, FIRST_PSEUDO_REGISTER); + rtx src = gen_rtx_REG (twice_word_mode, FIRST_PSEUDO_REGISTER); + int word_mode_size = GET_MODE_BITSIZE (word_mode); + int wide_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, + gen_rtx_fmt_ee (code, twice_word_mode, + src, GEN_INT (shift_amt))), + true); +#ifdef FORCE_LOWERING + return true; +#endif + +#ifdef LOG_COSTS + fprintf (stderr, "shift code %d, shift amt %d, wide_cost %d\n", code, shift_amt, wide_cost); +#endif + if (shift_amt == word_mode_size) + return wide_cost > word_mode_move_cost + move_zero_cost; + else + { + int narrow_cost; + + trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER); + src = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER); + narrow_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, + gen_rtx_fmt_ee (code, word_mode, + src, + GEN_INT (shift_amt - word_mode_size))), + true); + +#ifdef LOG_COSTS + fprintf (stderr, "narrow_cost %d\n", narrow_cost); +#endif + return wide_cost > narrow_cost + move_zero_cost; + } +} + + +/* Initialize pass once per execution. This envolves determining + which operations on the machine are profitable. If none are found, + then the pass just returns when called. */ + +void +init_lower_subreg (void) +{ + int word_mode_size = GET_MODE_BITSIZE (word_mode); + rtx trg = gen_rtx_REG (word_mode, FIRST_PSEUDO_REGISTER); + rtx pat; + unsigned int i; + int factor; + + word_mode_move_cost = compute_move_cost (word_mode); + move_zero_cost = insn_rtx_cost (gen_rtx_SET (VOIDmode, trg, CONST0_RTX (word_mode)), true); + +#ifdef LOG_COSTS + fprintf (stderr, "word mode move cost %d\n", word_mode_move_cost); + fprintf (stderr, "move 0 cost %d\n", move_zero_cost); +#endif + for (i = 0; i < MAX_MACHINE_MODE; i++) + { + int t; + factor = GET_MODE_SIZE (i) / UNITS_PER_WORD; + + if (factor <= 1) + continue; + + t = compute_move_cost ((enum machine_mode) i); + +#ifdef LOG_COSTS + fprintf (stderr, "mode %s, move cost %d, factor %d\n", mode_name[i], t, factor); +#endif + if (t > word_mode_move_cost * factor) + { + move_modes_to_split[i] = true; + something_to_do = true; + } +#ifdef FORCE_LOWERING + move_modes_to_split[i] = true; + something_to_do = true; +#endif + } + + /* For the moves and shifts, the only case that is checked is one + where the mode of the target is an integer mode twice the width + of the word_mode. */ + + twice_word_mode = GET_MODE_WIDER_MODE (word_mode); + + /* The only case here to check to see if moving the upper part with a + zero is cheaper than doing the zext itself. There is some theory + that any well implemented port would just have the pattern. */ + pat = gen_rtx_SET (VOIDmode, trg, + gen_rtx_ZERO_EXTEND (twice_word_mode, gen_reg_rtx (word_mode))); + + if (insn_rtx_cost (pat, true) > word_mode_move_cost + move_zero_cost) + { + splitting_zext = true; + splitting_some_shifts = true; + something_to_do = true; + } + +#ifdef FORCE_LOWERING + splitting_zext = true; + splitting_some_shifts = true; + something_to_do = true; +#endif + /* For the shifts there are three shift amounts that need to be + checked: word_mode, word_mode + 1 and word_mode * 1.5. The first + of these can be done without a shift. The second and third case + are the same on large machines but may be different on small + machines that special case shift 1 and 2. If any of these are + found to be useful, then we set the flags to consider those cases + and when the actual shift amount is known, redo the cost + calculation. */ + + splitting_ashift = profitable_shift_p (ASHIFT, word_mode_size) + || profitable_shift_p (ASHIFT, word_mode_size + 1) + || profitable_shift_p (ASHIFT, word_mode_size + (word_mode_size >> 1)); + + splitting_some_shifts |= splitting_ashift; + something_to_do |= splitting_ashift; + + splitting_lshiftrt = profitable_shift_p (LSHIFTRT, word_mode_size) + || profitable_shift_p (LSHIFTRT, word_mode_size + 1) + || profitable_shift_p (LSHIFTRT, word_mode_size + (word_mode_size >> 1)); + + splitting_some_shifts |= splitting_lshiftrt; + something_to_do |= splitting_lshiftrt; +} + static bool simple_move_operand (rtx x) @@ -173,7 +364,7 @@ find_pseudo_copy (rtx set) if (HARD_REGISTER_NUM_P (rd) || HARD_REGISTER_NUM_P (rs)) return false; - if (GET_MODE_SIZE (GET_MODE (dest)) <= UNITS_PER_WORD) + if (!move_modes_to_split[(int)GET_MODE (dest)]) return false; b = VEC_index (bitmap, reg_copy_graph, rs); @@ -335,6 +526,11 @@ find_decomposable_subregs (rtx *px, void bitmap_set_bit (decomposable_context, regno); break; case SIMPLE_MOVE: + /* If this is marked as a simple move with a mode this + large, it is because find_pseudo_copy returned false + because this was not a mode we wanted to split. */ + if (!move_modes_to_split[GET_MODE (x)]) + bitmap_set_bit (non_decomposable_context, regno); break; default: gcc_unreachable (); @@ -668,7 +864,7 @@ resolve_simple_move (rtx set, rtx insn) orig_mode = GET_MODE (dest); words = (GET_MODE_SIZE (orig_mode) + UNITS_PER_WORD - 1) / UNITS_PER_WORD; - if (words <= 1) + if (!move_modes_to_split[(int)GET_MODE (dest)]) return insn; start_sequence (); @@ -941,16 +1137,37 @@ find_decomposable_shift_zext (rtx insn) rtx set; rtx op; rtx op_operand; + enum machine_mode mode; set = single_set (insn); if (!set) return 0; op = SET_SRC (set); - if (GET_CODE (op) != ASHIFT - && GET_CODE (op) != LSHIFTRT - && GET_CODE (op) != ZERO_EXTEND) + mode = GET_MODE (op); + + if (mode != twice_word_mode) + return 0; + + switch (GET_CODE (op)) { + case ASHIFT: + if (splitting_ashift) + break; + return 0; + + case LSHIFTRT: + if (splitting_lshiftrt) + break; + return 0; + + case ZERO_EXTEND: + if (splitting_zext) + break; + return 0; + + default: return 0; + } op_operand = XEXP (op, 0); if (!REG_P (SET_DEST (set)) || !REG_P (op_operand) @@ -961,23 +1178,26 @@ find_decomposable_shift_zext (rtx insn) if (GET_CODE (op) == ZERO_EXTEND) { - if (GET_MODE (op_operand) != word_mode - || GET_MODE_BITSIZE (GET_MODE (op)) != 2 * BITS_PER_WORD) + if (GET_MODE (op_operand) != word_mode) return 0; } else /* left or right shift */ { + int shift_val; + if (!CONST_INT_P (XEXP (op, 1)) - || INTVAL (XEXP (op, 1)) < BITS_PER_WORD - || GET_MODE_BITSIZE (GET_MODE (op_operand)) != 2 * BITS_PER_WORD) + || INTVAL (XEXP (op, 1)) < BITS_PER_WORD) + return 0; + + shift_val = INTVAL (XEXP (op, 1)); + if (!profitable_shift_p (GET_CODE (op), shift_val)) return 0; + + bitmap_set_bit (decomposable_context, REGNO (op_operand)); } bitmap_set_bit (decomposable_context, REGNO (SET_DEST (set))); - if (GET_CODE (op) != ZERO_EXTEND) - bitmap_set_bit (decomposable_context, REGNO (op_operand)); - return 1; } @@ -1008,6 +1228,8 @@ resolve_shift_zext (rtx insn) op_operand = XEXP (op, 0); + /* We can tear this operation apart only if the regs were already + torn apart. */ if (!resolve_reg_p (SET_DEST (set)) && !resolve_reg_p (op_operand)) return NULL_RTX; @@ -1083,8 +1305,34 @@ decompose_multiword_subregs (void) unsigned int max; basic_block bb; - if (df) - df_set_flags (DF_DEFER_INSN_RESCAN); + if (dump_file) + { + unsigned int i; + + for (i = 0; i < MAX_MACHINE_MODE; i++) + { + if (GET_MODE_SIZE (i) > UNITS_PER_WORD) + fprintf (dump_file, "%s mode %s for copy lowering.\n", + move_modes_to_split[i] ? "Splitting" : "Skipping", mode_name[i]); + } + + fprintf (dump_file, "%s mode %s for zero_extend lowering.\n", + splitting_zext ? "Splitting" : "Skipping", mode_name[twice_word_mode]); + + fprintf (dump_file, "%s mode %s for ashift lowering.\n", + splitting_ashift ? "Splitting" : "Skipping", mode_name[twice_word_mode]); + + fprintf (dump_file, "%s mode %s for lshiftrt lowering.\n", + splitting_lshiftrt ? "Splitting" : "Skipping", mode_name[twice_word_mode]); + + if (!something_to_do) + fprintf (dump_file, "\nNothing to lower for port.\n\n"); + } + + + /* Check if this target even has any modes to consider lowering. */ + if (!something_to_do) + return; max = max_reg_num (); @@ -1094,24 +1342,41 @@ decompose_multiword_subregs (void) all the insns. */ { unsigned int i; + bool useful_modes_seen = false; for (i = FIRST_PSEUDO_REGISTER; i < max; ++i) { - if (regno_reg_rtx[i] != NULL - && GET_MODE_SIZE (GET_MODE (regno_reg_rtx[i])) > UNITS_PER_WORD) - break; + if (regno_reg_rtx[i] != NULL) + { + enum machine_mode mode = GET_MODE (regno_reg_rtx[i]); + if (regno_reg_rtx[i] != NULL + && (move_modes_to_split [mode] + || (splitting_some_shifts && mode == twice_word_mode))) + { + useful_modes_seen = true; + break; + } + } + } + + if (!useful_modes_seen) + { + if (dump_file) + fprintf (dump_file, "Nothing to lower in this function.\n"); + return; } - if (i == max) - return; } - if (df) - run_word_dce (); + if (df) + { + df_set_flags (DF_DEFER_INSN_RESCAN); + run_word_dce (); + } - /* FIXME: When the dataflow branch is merged, we can change this - code to look for each multi-word pseudo-register and to find each - insn which sets or uses that register. That should be faster - than scanning all the insns. */ + /* FIXME: It may be possible to change this code to look for each + multi-word pseudo-register and to find each insn which sets or + uses that register. That should be faster than scanning all the + insns. */ decomposable_context = BITMAP_ALLOC (NULL); non_decomposable_context = BITMAP_ALLOC (NULL); Index: rtl.h =================================================================== --- rtl.h (revision 185969) +++ rtl.h (working copy) @@ -2523,6 +2523,9 @@ extern void init_expmed (void); extern void expand_inc (rtx, rtx); extern void expand_dec (rtx, rtx); +/* In lower-subreg.c */ +extern void init_lower_subreg (void); + /* In gcse.c */ extern bool can_copy_p (enum machine_mode); extern bool can_assign_to_reg_without_clobbers_p (rtx);