Message ID | 20121028131312.poycwzqkg0kc4gc8-nzlynne@webmail.spamcop.net |
---|---|
State | New |
Headers | show |
On Sun, Oct 28, 2012 at 6:13 PM, Joern Rennecke <joern.rennecke@embecosm.com> wrote: > Quoting Richard Biener <richard.guenther@gmail.com>: > >>> Thus, you can allow the length to vary downwards as well as upwards >>> across iterations with suitable definitions of the @code{length} >>> attribute >>> and/or @code{ADJUST_INSN_LENGTH}. Care has to be taken that this does >>> not >>> lead to infinite loops. >> >> >> I don't see that you can shrink length with just suitable lock_length and >> length attributes. > > > I disagree there (for certain values of 'just'), but we can just agree > to disagree on this point because... > >> What seems to be the cruical difference is that you >> apply ADJUST_INSN_LENGTH after combining lock-length-max and length. >> But then you _decrease_ length with ADJUST_INSN_LENGHT ... >> >> Maybe what you really want is ADJUST_INSN_LENGTH_AFTER which is >> applied afterwards? Thus, > > > Well, actually, I found a number of problems with ADJUST_INSN_LENGTH: > - it is not applied to inner insn is a delay slot sequence. You can > sort of work around this by stitching recursive calls together, but > then you are faced with another problem: > - You don't know what the length prior to ADJUST_INSN_LENGTH was. That > was even worse with the non-unique uids where you didn't knew squat > about the instructions in the delay slot. > - As you said, I want the adjustment happen after the maximum calculation. > Well, usually. There are actually special cases where it would be useful > to have some sort of maximum calculation in place, to break up > alignment-driven cycles, but only applicable with a lesser priority than > for the range-driven branch offset calculations. > > But adding yet another macro neither does solve all these problems, nor > would it align with our goal to move away from target macros. > > I found now an alternate way to make the ARC port termiante building its > insn-attrtab.c - it involves using match_test("get_attr_length (insn) == 2") > instead of eq_attr("[lock_]length" (const_int 2)) - where I really want to > know if the instruction was considered short in the previous iteration. > > So, I have made a patch to replace the ADJUST_INSN_LENGTH macro in final.c > with an adjust_insn_length hook, for which a default implementation > using ADJUST_INSN_LENGTH is provided in targhooks.c to provide for an > easy transition. > I've looked at the existing ports that use ADJUST_INSN_LENGTH, and some > seem to prefer to return an adjustment to be added to the length, while > others prefer to return the new length. The latter seemed to be slightly > more numerous, so I went with that. > > The hook has two new parameters: > - a flag that tells it if the insn in question is a delay sequence. > The default hook implementation skips the invocation of ADJUST_INSN_LENGTH > in this case for the sake of compatibility. > - a pointer to int to set the number of iteration loops till the length > locking feature is supposed to apply to this instruction length when > using the increasing size calculations. The pointed-to value is > initialized > to zero, which means that length locking is always applied (assuming > final.c > uses the increasing length algorithm). Setting this to a higher number > effectively gives the new instruction length a lower priority to be put > into uid_lock_length. > >> Note that >> >>> Care has to be taken that this does not >>> lead to infinite loops. >> >> >> doesn't convince me that is properly designed ;) > > > With the hook mechanism, it is much harder to create an infinite loop > inside shorten_branches. (It would involve something like setting > iteration_threshold to MAX_INT and making length decreasing when niter > is at MAX_INT, then letting integer overflow of niter take its course. > Making it impossible for a port maintainer to get things wrong is not a > meaningful goal for GCC, but making it straightforward to get it right is.) > > This patch builds on: > http://gcc.gnu.org/ml/gcc-patches/2012-10/msg02527.html > > Bootstrapped in revision 192879 on i686-pc-linux-gnu. > Tested with config-list.mk on x86_64-unknown-linux-gnu. Apart from the iteration_threshold the hookization would be straight-forward. Now I cannot decipher from the patch what functional change it introduces ;) There are also ony 10 users of ADJUST_INSN_LENGTH, not enough to warrant introducing a hook without removing the macro IMHO. Thanks, Richard. > 2012-10-28 Joern Rennecke <joern.rennecke@embecosm.com> > > * doc/tm.texi.in (@hook TARGET_ADJUST_INSN_LENGTH): Add. > * doc/tm.texi: Regenerate. > * final.c (get_attr_length_1): Use targetm.adjust_insn_length > instead of ADJUST_INSN_LENGTH. > (adjust_length): New function. > (shorten_branches): Use adjust_length instead of ADJUST_INSN_LENGTH. > * target.def (adjust_insn_length): New hook. > * targhooks.c (default_adjust_insn_length): New function. > * targhooks.h (default_adjust_insn_length): Declare. > > diff -drup gcc-192879-haveattr/doc/tm.texi gcc/doc/tm.texi > --- gcc-192879-haveattr/doc/tm.texi 2012-10-28 01:07:38.463469923 +0000 > +++ gcc/doc/tm.texi 2012-10-28 01:31:15.522227927 +0000 > @@ -11341,3 +11341,7 @@ This value should be set if the result w > @deftypevrx {Target Hook} {bool} TARGET_HAVE_ATTR_LENGTH > These flags are automatically generated; you should not override them in > tm.c: > @end deftypevr > + > +@deftypefn {Target Hook} int TARGET_ADJUST_INSN_LENGTH (rtx @var{insn}, int > @var{length}, bool @var{in_delay_sequence}, int *@var{iteration_threshold}) > +Return an adjusted length for @var{insn}. @var{length} is the value that > has been calculated using the @code{length} instruction attribute. > @var{in_delay_sequence} if @var{insn} forms part of a delay sequence. > *@var{iteration_threshold} specifies the number of branch shortening > iterations before length decreases are inhibited. The default > implementation uses @code{ADJUST_INSN_LENGTH}, if defined. > +@end deftypefn > diff -drup gcc-192879-haveattr/doc/tm.texi.in gcc/doc/tm.texi.in > --- gcc-192879-haveattr/doc/tm.texi.in 2012-10-28 01:07:38.489470252 +0000 > +++ gcc/doc/tm.texi.in 2012-10-28 01:31:55.578745701 +0000 > @@ -11175,3 +11175,5 @@ memory model bits are allowed. > @hook TARGET_ATOMIC_TEST_AND_SET_TRUEVAL > > @hook TARGET_HAVE_CC0 > + > +@hook TARGET_ADJUST_INSN_LENGTH > diff -drup gcc-192879-haveattr/final.c gcc/final.c > --- gcc-192879-haveattr/final.c 2012-10-28 01:07:38.492470288 +0000 > +++ gcc/final.c 2012-10-28 15:24:49.731169660 +0000 > @@ -424,10 +424,8 @@ get_attr_length_1 (rtx insn, int (*fallb > break; > } > > -#ifdef ADJUST_INSN_LENGTH > - ADJUST_INSN_LENGTH (insn, length); > -#endif > - return length; > + int dummy = -1; > + return targetm.adjust_insn_length (insn, length, false, &dummy); > } > > /* Obtain the current length of an insn. If branch shortening has been > done, > @@ -827,6 +825,56 @@ struct rtl_opt_pass pass_compute_alignme > }; > > > +/* Helper function for shorten_branches, to apply the adjust_insn_length > + target hook, and lock in length increases in order to avoid infinite > + iteration cycles. > + INSN the the instruction being processed, and new_length is the length > + that has been calulated for it from its length attribute. UID is > + its UID. SEQ_P indicates if it is inside a delay slot sequence. > + INSN_LENGTHS and UID_LOCK_LENGTH are the arrays holding the current > lengths > + and lockeds-in maximum lengths from the prevuious iterations, and are > + updated for UID when appropriate. > + *NITER contains the number of iterations since we last made a change > + to uid_lock_length. > + *SOMETHING_CHANGED will be set if we make a change to INSN_LENGTHS. > + The return value is the new length taking the result of the > + adjust_insn_length and uid_lock_length into account. > + Note that when *NITER is negative, no length locking is effective, and > + the new lengths are just assigned. This is used in the initial pass, > + and when only making a single pass to update instruction length > downward. */ > +static int > +adjust_length (rtx insn, int new_length, int uid, int *insn_lengths, > + int *uid_lock_length, bool seq_p, int *niter, > + bool *something_changed) > + > +{ > + int iter_threshold = 0; > + new_length > + = targetm.adjust_insn_length (insn, new_length, seq_p, > &iter_threshold); > + if (new_length < 0) > + fatal_insn ("negative insn length", insn); > + if (iter_threshold < 0) > + fatal_insn ("negative shorten_branches iteration threshold", insn); > + if (new_length != insn_lengths[uid]) > + { > + if (new_length < uid_lock_length[uid]) > + new_length = uid_lock_length[uid]; > + if (new_length == insn_lengths[uid]) > + ; /* done here. */ > + else if (*niter < iter_threshold > + || new_length > insn_lengths[uid]) > + { > + if (*niter >= iter_threshold) > + uid_lock_length[uid] = new_length, *niter = 0; > + insn_lengths[uid] = new_length; > + *something_changed = true; > + } > + else > + new_length = insn_lengths[uid]; > + } > + return new_length; > +} > + > /* Make a pass over all insns and compute their actual lengths by > shortening > any branches of variable length if possible. */ > > @@ -848,7 +896,7 @@ shorten_branches (rtx first) > int max_skip; > #define MAX_CODE_ALIGN 16 > rtx seq; > - int something_changed = 1; > + bool something_changed = true; > char *varying_length; > rtx body; > int uid; > @@ -964,7 +1012,8 @@ shorten_branches (rtx first) > return; > > /* Allocate the rest of the arrays. */ > - insn_lengths = XNEWVEC (int, max_uid); > + insn_lengths = XCNEWVEC (int, max_uid); > + int *uid_lock_length = XCNEWVEC (int, max_uid); > insn_lengths_max_uid = max_uid; > /* Syntax errors can lead to labels being outside of the main insn > stream. > Initialize insn_addresses, so that we get reproducible results. */ > @@ -1065,6 +1114,7 @@ shorten_branches (rtx first) > > /* Compute initial lengths, addresses, and varying flags for each insn. > */ > int (*length_fun) (rtx) = increasing ? insn_min_length : > insn_default_length; > + int niter = increasing ? 0 : -1; > > for (insn_current_address = 0, insn = first; > insn != 0; > @@ -1072,8 +1122,6 @@ shorten_branches (rtx first) > { > uid = INSN_UID (insn); > > - insn_lengths[uid] = 0; > - > if (LABEL_P (insn)) > { > int log = LABEL_TO_ALIGNMENT (insn); > @@ -1093,6 +1141,8 @@ shorten_branches (rtx first) > if (INSN_DELETED_P (insn)) > continue; > > + int length = 0; > + > body = PATTERN (insn); > if (GET_CODE (body) == ADDR_VEC || GET_CODE (body) == ADDR_DIFF_VEC) > { > @@ -1100,13 +1150,12 @@ shorten_branches (rtx first) > section. */ > if (JUMP_TABLES_IN_TEXT_SECTION > || readonly_data_section == text_section) > - insn_lengths[uid] = (XVECLEN (body, > - GET_CODE (body) == ADDR_DIFF_VEC) > - * GET_MODE_SIZE (GET_MODE (body))); > + length = (XVECLEN (body, GET_CODE (body) == ADDR_DIFF_VEC) > + * GET_MODE_SIZE (GET_MODE (body))); > /* Alignment is handled by ADDR_VEC_ALIGN. */ > } > else if (GET_CODE (body) == ASM_INPUT || asm_noperands (body) >= 0) > - insn_lengths[uid] = asm_insn_count (body) * insn_default_length > (insn); > + length = asm_insn_count (body) * insn_default_length (insn); > else if (GET_CODE (body) == SEQUENCE) > { > int i; > @@ -1134,7 +1183,10 @@ shorten_branches (rtx first) > else > inner_length = inner_length_fun (inner_insn); > > - insn_lengths[inner_uid] = inner_length; > + inner_length > + = adjust_length (insn, inner_length, inner_uid, > insn_lengths, > + uid_lock_length, true, &niter, > + &something_changed); > if (const_delay_slots) > { > if ((varying_length[inner_uid] > @@ -1145,21 +1197,18 @@ shorten_branches (rtx first) > } > else > varying_length[inner_uid] = 0; > - insn_lengths[uid] += inner_length; > + length += inner_length; > } > } > else if (GET_CODE (body) != USE && GET_CODE (body) != CLOBBER) > { > - insn_lengths[uid] = length_fun (insn); > + length = length_fun (insn); > varying_length[uid] = insn_variable_length_p (insn); > } > - > + > /* If needed, do any adjustment. */ > -#ifdef ADJUST_INSN_LENGTH > - ADJUST_INSN_LENGTH (insn, insn_lengths[uid]); > - if (insn_lengths[uid] < 0) > - fatal_insn ("negative insn length", insn); > -#endif > + adjust_length (insn, length, uid, insn_lengths, uid_lock_length, > + false, &niter, &something_changed); > } > > /* Now loop over all the insns finding varying length insns. For each, > @@ -1168,16 +1217,16 @@ shorten_branches (rtx first) > > while (something_changed) > { > - something_changed = 0; > + if (increasing) > + niter++; > + > + something_changed = false; > insn_current_align = MAX_CODE_ALIGN - 1; > for (insn_current_address = 0, insn = first; > insn != 0; > insn = NEXT_INSN (insn)) > { > int new_length; > -#ifdef ADJUST_INSN_LENGTH > - int tmp_length; > -#endif > int length_align; > > uid = INSN_UID (insn); > @@ -1363,17 +1412,13 @@ shorten_branches (rtx first) > if (! varying_length[inner_uid]) > inner_length = insn_lengths[inner_uid]; > else > - inner_length = insn_current_length (inner_insn); > - > - if (inner_length != insn_lengths[inner_uid]) > { > - if (!increasing || inner_length > > insn_lengths[inner_uid]) > - { > - insn_lengths[inner_uid] = inner_length; > - something_changed = 1; > - } > - else > - inner_length = insn_lengths[inner_uid]; > + inner_length = insn_current_length (inner_insn); > + > + inner_length > + = adjust_length (insn, inner_length, inner_uid, > + insn_lengths, uid_lock_length, > true, > + &niter, &something_changed); > } > insn_current_address += inner_length; > new_length += inner_length; > @@ -1385,21 +1430,13 @@ shorten_branches (rtx first) > insn_current_address += new_length; > } > > -#ifdef ADJUST_INSN_LENGTH > /* If needed, do any adjustment. */ > - tmp_length = new_length; > - ADJUST_INSN_LENGTH (insn, new_length); > + int tmp_length = new_length; > + new_length > + = adjust_length (insn, new_length, uid, insn_lengths, > + uid_lock_length, false, &niter, > + &something_changed); > insn_current_address += (new_length - tmp_length); > -#endif > - > - if (new_length != insn_lengths[uid] > - && (!increasing || new_length > insn_lengths[uid])) > - { > - insn_lengths[uid] = new_length; > - something_changed = 1; > - } > - else > - insn_current_address += insn_lengths[uid] - new_length; > } > /* For a non-optimizing compile, do only a single pass. */ > if (!increasing) > diff -drup gcc-192879-haveattr/target.def gcc/target.def > --- gcc-192879-haveattr/target.def 2012-10-28 01:07:38.517470606 +0000 > +++ gcc/target.def 2012-10-28 01:31:15.525227979 +0000 > @@ -2818,6 +2818,17 @@ DEFHOOK > enum unwind_info_type, (void), > default_debug_unwind_info) > > +DEFHOOK > +(adjust_insn_length, > + "Return an adjusted length for @var{insn}. @var{length} is the value > that\ > + has been calculated using the @code{length} instruction attribute. \ > + @var{in_delay_sequence} if @var{insn} forms part of a delay sequence. \ > + *@var{iteration_threshold} specifies the number of branch shortening\ > + iterations before length decreases are inhibited. The default\ > + implementation uses @code{ADJUST_INSN_LENGTH}, if defined.", > + int, (rtx insn, int length, bool in_delay_sequence, int > *iteration_threshold), > + default_adjust_insn_length) > + > DEFHOOKPOD > (atomic_test_and_set_trueval, > "This value should be set if the result written by\ > diff -drup gcc-192879-haveattr/targhooks.c gcc/targhooks.c > --- gcc-192879-haveattr/targhooks.c 2012-10-25 03:33:26.000000000 +0100 > +++ gcc/targhooks.c 2012-10-28 01:41:35.082931024 +0000 > @@ -1540,4 +1540,20 @@ default_member_type_forces_blk (const_tr > return false; > } > > +/* Default version of adjust_insn_length. */ > + > +int > +default_adjust_insn_length (rtx insn ATTRIBUTE_UNUSED, int length, > + bool in_delay_sequence, int *iter_threshold) > +{ > + if (in_delay_sequence) > + *iter_threshold = INT_MAX; > +#ifdef ADJUST_INSN_LENGTH > + else > + ADJUST_INSN_LENGTH (insn, length); > +#endif > + return length; > +} > + > + > #include "gt-targhooks.h" > diff -drup gcc-192879-haveattr/targhooks.h gcc/targhooks.h > --- gcc-192879-haveattr/targhooks.h 2012-10-25 03:33:26.000000000 +0100 > +++ gcc/targhooks.h 2012-10-28 01:31:15.479227682 +0000 > @@ -193,3 +193,4 @@ extern const char *default_pch_valid_p ( > extern void default_asm_output_ident_directive (const char*); > > extern bool default_member_type_forces_blk (const_tree, enum machine_mode); > +extern int default_adjust_insn_length (rtx, int, bool, int *); >
Quoting Richard Biener <richard.guenther@gmail.com>: > Apart from the iteration_threshold the hookization would be straight-forward. > Now I cannot decipher from the patch what functional change it introduces ;) The only change occurs if we reach an iteration count of MAX_INT iterations - which should already be indicative of a problem. At the MAX_INTth iteration, we will apply the length locking logic to instructions inside a delay slot sequence as well. If we disregard this exceptional case, there should be no functional changes unless someone defines TARGET_ADJUST_INSN_LENGTH. uid_lock_length gets initialized to allocated with XCNEWVEC. So, in particular, the elements corresponding to instructions inside delay slot sequences are initialized to zero. As default hook sets *iter_threshold to MAX_INT when inside a delay sequence, and doesn't change length, the max operation with uid_lock_length is a no-op, and *niter < iter_threshold is true, hence a length change results in updating the length immediately, without changing uid_lock_length. In the case that we are not inside a delay slot sequence, the default hook leaves iter_threshold as 0, and applies ADJUST_INSN_LENGTH. Thus, when niter is 0 or larger, as is the case for the ordinary looping operation, we always find *niter >= iter_threshold, and thus apply the length locking mechanism. If we are in the preliminary pass, or doing the single !increasing iteration, niter is set to -1, so *inter < iter_threshold is always true, so again we update the length immediately. > There are also ony 10 users of ADJUST_INSN_LENGTH, not enough to > warrant introducing a hook without removing the macro IMHO. Ok, I can prepare a patch for that, even though it makes it even a bit less obvious that there's no functional change. What about the preparatory patch http://gcc.gnu.org/ml/gcc-patches/2012-10/msg02527.html ? Can I check that in now?
On Tue, Oct 30, 2012 at 5:14 PM, Joern Rennecke <joern.rennecke@embecosm.com> wrote: > Quoting Richard Biener <richard.guenther@gmail.com>: > >> Apart from the iteration_threshold the hookization would be >> straight-forward. >> Now I cannot decipher from the patch what functional change it introduces >> ;) > > > The only change occurs if we reach an iteration count of MAX_INT iterations > - > which should already be indicative of a problem. > > At the MAX_INTth iteration, we will apply the length locking logic to > instructions inside a delay slot sequence as well. > > If we disregard this exceptional case, there should be no functional changes > unless someone defines TARGET_ADJUST_INSN_LENGTH. > > uid_lock_length gets initialized to allocated with XCNEWVEC. So, in > particular, the elements corresponding to instructions inside delay slot > sequences are initialized to zero. > > As default hook sets *iter_threshold to MAX_INT when inside a delay > sequence, > and doesn't change length, the max operation with uid_lock_length is a > no-op, > and *niter < iter_threshold is true, hence a length change results in > updating the length immediately, without changing uid_lock_length. > > In the case that we are not inside a delay slot sequence, the default hook > leaves iter_threshold as 0, and applies ADJUST_INSN_LENGTH. Thus, when > niter > is 0 or larger, as is the case for the ordinary looping operation, we > always find *niter >= iter_threshold, and thus apply the length locking > mechanism. > > If we are in the preliminary pass, or doing the single !increasing > iteration, > niter is set to -1, so *inter < iter_threshold is always true, so again we > update the length immediately. Ok, I see. The patch is ok then. >> There are also ony 10 users of ADJUST_INSN_LENGTH, not enough to >> warrant introducing a hook without removing the macro IMHO. > > > Ok, I can prepare a patch for that, even though it makes it even a bit > less obvious that there's no functional change. That would be nice (you can do that as followup). > What about the preparatory patch > http://gcc.gnu.org/ml/gcc-patches/2012-10/msg02527.html ? > Can I check that in now? Richard S. had some comments there so I defer to him. Richard.
Richard Biener <richard.guenther@gmail.com> writes: > On Tue, Oct 30, 2012 at 5:14 PM, Joern Rennecke > <joern.rennecke@embecosm.com> wrote: >> Quoting Richard Biener <richard.guenther@gmail.com>: >> >>> Apart from the iteration_threshold the hookization would be >>> straight-forward. >>> Now I cannot decipher from the patch what functional change it introduces >>> ;) >> >> >> The only change occurs if we reach an iteration count of MAX_INT iterations >> - >> which should already be indicative of a problem. >> >> At the MAX_INTth iteration, we will apply the length locking logic to >> instructions inside a delay slot sequence as well. >> >> If we disregard this exceptional case, there should be no functional changes >> unless someone defines TARGET_ADJUST_INSN_LENGTH. >> >> uid_lock_length gets initialized to allocated with XCNEWVEC. So, in >> particular, the elements corresponding to instructions inside delay slot >> sequences are initialized to zero. >> >> As default hook sets *iter_threshold to MAX_INT when inside a delay >> sequence, >> and doesn't change length, the max operation with uid_lock_length is a >> no-op, >> and *niter < iter_threshold is true, hence a length change results in >> updating the length immediately, without changing uid_lock_length. >> >> In the case that we are not inside a delay slot sequence, the default hook >> leaves iter_threshold as 0, and applies ADJUST_INSN_LENGTH. Thus, when >> niter >> is 0 or larger, as is the case for the ordinary looping operation, we >> always find *niter >= iter_threshold, and thus apply the length locking >> mechanism. >> >> If we are in the preliminary pass, or doing the single !increasing >> iteration, >> niter is set to -1, so *inter < iter_threshold is always true, so again we >> update the length immediately. > > Ok, I see. > > The patch is ok then. I should probably have piped up earlier, but I'm really not sure about it. ADJUST_INSN_LENGTH as defined now is telling us a property of the target. The patch instead ties the new hook directly into the shorten_branches algorithm, which I think is a bad idea. IMO, the length attributes and ADJUST_INSN_LENGTH should be treated as a fused process: every value returned by a length attribute (whether min, default or current) should be filtered through ADJUST_INSN_LENGTH before being used. Whether ADJUST_INSN_LENGTH decreases or increases the original length doesn't matter, because only the output of ADJUST_INSN_LENGTH should be used. Joern's patch makes us use ADJUST_INSN_LENGTH for delay slot insns, which I agree is a good thing in itself. It's the all-important iteration parameter that feels wrong. TBH, I still don't understand what property of the target we're trying to describe here. I gather it has something to do with alignment. But I'd really like a description of that target property, independent of the shorten_branches implementation. Richard
On Wed, Oct 31, 2012 at 11:22 AM, Richard Sandiford <rdsandiford@googlemail.com> wrote: > Richard Biener <richard.guenther@gmail.com> writes: >> On Tue, Oct 30, 2012 at 5:14 PM, Joern Rennecke >> <joern.rennecke@embecosm.com> wrote: >>> Quoting Richard Biener <richard.guenther@gmail.com>: >>> >>>> Apart from the iteration_threshold the hookization would be >>>> straight-forward. >>>> Now I cannot decipher from the patch what functional change it introduces >>>> ;) >>> >>> >>> The only change occurs if we reach an iteration count of MAX_INT iterations >>> - >>> which should already be indicative of a problem. >>> >>> At the MAX_INTth iteration, we will apply the length locking logic to >>> instructions inside a delay slot sequence as well. >>> >>> If we disregard this exceptional case, there should be no functional changes >>> unless someone defines TARGET_ADJUST_INSN_LENGTH. >>> >>> uid_lock_length gets initialized to allocated with XCNEWVEC. So, in >>> particular, the elements corresponding to instructions inside delay slot >>> sequences are initialized to zero. >>> >>> As default hook sets *iter_threshold to MAX_INT when inside a delay >>> sequence, >>> and doesn't change length, the max operation with uid_lock_length is a >>> no-op, >>> and *niter < iter_threshold is true, hence a length change results in >>> updating the length immediately, without changing uid_lock_length. >>> >>> In the case that we are not inside a delay slot sequence, the default hook >>> leaves iter_threshold as 0, and applies ADJUST_INSN_LENGTH. Thus, when >>> niter >>> is 0 or larger, as is the case for the ordinary looping operation, we >>> always find *niter >= iter_threshold, and thus apply the length locking >>> mechanism. >>> >>> If we are in the preliminary pass, or doing the single !increasing >>> iteration, >>> niter is set to -1, so *inter < iter_threshold is always true, so again we >>> update the length immediately. >> >> Ok, I see. >> >> The patch is ok then. > > I should probably have piped up earlier, but I'm really not sure about it. > ADJUST_INSN_LENGTH as defined now is telling us a property of the target. > The patch instead ties the new hook directly into the shorten_branches > algorithm, which I think is a bad idea. > > IMO, the length attributes and ADJUST_INSN_LENGTH should be treated as > a fused process: every value returned by a length attribute (whether min, > default or current) should be filtered through ADJUST_INSN_LENGTH before > being used. Whether ADJUST_INSN_LENGTH decreases or increases the original > length doesn't matter, because only the output of ADJUST_INSN_LENGTH should > be used. > > Joern's patch makes us use ADJUST_INSN_LENGTH for delay slot insns, > which I agree is a good thing in itself. It's the all-important > iteration parameter that feels wrong. > > TBH, I still don't understand what property of the target we're trying > to describe here. I gather it has something to do with alignment. > But I'd really like a description of that target property, independent > of the shorten_branches implementation. Yeah, the iteration parameter doesn't feel right, I agree. But ADJUST_INSN_LENGTH is only used in this code. Which is why I originally said if the patch would not add the iteration parameter it would be obvious ... So, consider my approval revoked. Richard. > Richard
Quoting Richard Sandiford <rdsandiford@googlemail.com>: > I should probably have piped up earlier, but I'm really not sure about it. > ADJUST_INSN_LENGTH as defined now is telling us a property of the target. > The patch instead ties the new hook directly into the shorten_branches > algorithm, which I think is a bad idea. > > IMO, the length attributes and ADJUST_INSN_LENGTH should be treated as > a fused process: every value returned by a length attribute (whether min, > default or current) should be filtered through ADJUST_INSN_LENGTH before > being used. Whether ADJUST_INSN_LENGTH decreases or increases the original > length doesn't matter, because only the output of ADJUST_INSN_LENGTH should > be used. > > Joern's patch makes us use ADJUST_INSN_LENGTH for delay slot insns, > which I agree is a good thing in itself. It's the all-important > iteration parameter that feels wrong. > > TBH, I still don't understand what property of the target we're trying > to describe here. I gather it has something to do with alignment. > But I'd really like a description of that target property, independent > of the shorten_branches implementation. It's about describing complex interactions of length adjustments that derive from branch shortening and length added for (un)alignment for scheduling purposes. Expressed naively, these can lead to cycles. I could manage with the combination of length / lock_length attribute, but there was a fine line between tuning for optimal scheduling and risking cycles. So, as Richard Biener has pointed out, that design was not optimal. What is really needed is a way to express priorities between instructions when to inhibit length shrinkage from one iteration to another. Ports that have no issues with negative feedback in the length calculations don't need to worry about priorities; the length locking mechanism will be a no-op anyways, so the question when it is being applied is moot. In the case you have negative feedback, the default, to apply length locking immediatly to all instructions, gives the fastest convergence, although it will likely give a sub-optimal shortening solution. The length variation for the ARC are not alike: there are branches that are subject to branch shortening in the usual way, but they might shrink when other things shrink. When we are iterating starting at minimal length and increasing lengths, these branches should be stopped from shrinking, as they likly will grow back again in the next iteration. OTOH there are potentially short instructions that are lengthened to archive scheduling-dictated alignment or misalignment. These should be allowed to change freely back and forth. Unless we have a rare cycle that only depends on alignments. So, to get this right without having the port writer to worry too much when a length adjustment causes a cycle, we want all varying length instructions subject to length locking to break cycles, but not with the same priority. Specifying this as an iteration count is an easy way to implement this, without having to make shorten_branches identify every cycle and the participating instructions that are oscillating in length. The latter would actually run up against computability problems in distinguishing a single complex cycle and multiple independent cycles. If you like, we can describe the iter_threshold parameter in a different way, just saying that it indicates a priority, 0 being the first to get length locking applied, and the larger the number, the later.
Joern Rennecke <joern.rennecke@embecosm.com> writes: > Quoting Richard Sandiford <rdsandiford@googlemail.com>: >> I should probably have piped up earlier, but I'm really not sure about it. >> ADJUST_INSN_LENGTH as defined now is telling us a property of the target. >> The patch instead ties the new hook directly into the shorten_branches >> algorithm, which I think is a bad idea. >> >> IMO, the length attributes and ADJUST_INSN_LENGTH should be treated as >> a fused process: every value returned by a length attribute (whether min, >> default or current) should be filtered through ADJUST_INSN_LENGTH before >> being used. Whether ADJUST_INSN_LENGTH decreases or increases the original >> length doesn't matter, because only the output of ADJUST_INSN_LENGTH should >> be used. >> >> Joern's patch makes us use ADJUST_INSN_LENGTH for delay slot insns, >> which I agree is a good thing in itself. It's the all-important >> iteration parameter that feels wrong. >> >> TBH, I still don't understand what property of the target we're trying >> to describe here. I gather it has something to do with alignment. >> But I'd really like a description of that target property, independent >> of the shorten_branches implementation. > > It's about describing complex interactions of length adjustments that > derive from branch shortening and length added for (un)alignment for > scheduling purposes. Expressed naively, these can lead to cycles. But shorten_branches should be written to avoid cycles, and I thought your patch did that by making sure that the address of each instruction only increases. > I could manage with the combination of length / lock_length attribute, > but there was a fine line between tuning for optimal scheduling and > risking cycles. > So, as Richard Biener has pointed out, that design was not optimal. > > What is really needed is a way to express priorities between instructions > when to inhibit length shrinkage from one iteration to another. > > Ports that have no issues with negative feedback in the length calculations > don't need to worry about priorities; the length locking mechanism > will be a no-op anyways, so the question when it is being applied is moot. > > In the case you have negative feedback, the default, to apply length > locking immediatly to all instructions, gives the fastest convergence, > although it will likely give a sub-optimal shortening solution. > > The length variation for the ARC are not alike: there are branches that > are subject to branch shortening in the usual way, but they might > shrink when other things shrink. When we are iterating starting at > minimal length and increasing lengths, these branches should be stopped > from shrinking, as they likly will grow back again in the next iteration. > OTOH there are potentially short instructions that are lengthened to > archive scheduling-dictated alignment or misalignment. These should > be allowed to change freely back and forth. Unless we have a rare > cycle that only depends on alignments. Hmm, but this is still talking in terms of shorten_branches, rather than the target property that we're trying to model. It sounds like the property is something along the lines of: some instructions have different encodings (or work-alikes with different encodings), and we want to make all the encodings available to the optimisers. Is that right? If so, that sounds like a useful thing to model, but I think it should be modelled directly, rather than as a modification to the shorten_branches algorithm. Is the alignment/misalignment explicitly modelled in the rtl? If so, how? Using labels, or something else? Richard
Quoting Richard Sandiford <rdsandiford@googlemail.com>: >> It's about describing complex interactions of length adjustments that >> derive from branch shortening and length added for (un)alignment for >> scheduling purposes. Expressed naively, these can lead to cycles. > > But shorten_branches should be written to avoid cycles, and I thought > your patch did that by making sure that the address of each instruction > only increases. This actually gives a very poor branch shortening result for ARC. What I did before with the lock_length attribute was only locking in increases for the length of branches, and preserving a complex set of invariants to avoid cycles from the (mis)alignment padding. Saying that for some instructions, the length locking comes in effect only after a few iterations of not permanently increasing the size of other instructions is still cycle-safe. >> The length variation for the ARC are not alike: there are branches that >> are subject to branch shortening in the usual way, but they might >> shrink when other things shrink. When we are iterating starting at >> minimal length and increasing lengths, these branches should be stopped >> from shrinking, as they likly will grow back again in the next iteration. >> OTOH there are potentially short instructions that are lengthened to >> archive scheduling-dictated alignment or misalignment. These should >> be allowed to change freely back and forth. Unless we have a rare >> cycle that only depends on alignments. > > Hmm, but this is still talking in terms of shorten_branches, rather than > the target property that we're trying to model. It sounds like the property > is something along the lines of: some instructions have different encodings > (or work-alikes with different encodings), and we want to make all the > encodings available to the optimisers. Is that right? No, not really. There are lots of redundant encodings, but we have only one favourite encoding for each size. We usually want the shortest version, unless we want to tweak the alignment of a following instruction, or in the case of unaligned conditional branches with unfilled delay slot, to hide the branch penalty (don't ask me why it has that effect...). > If so, that sounds > like a useful thing to model, but I think it should be modelled directly, > rather than as a modification to the shorten_branches algorithm. The shorten_branches algorithm can't give the correct result without taking these things into account. Also, instruction lengths are needed to know the requirements. So, do you want me to run a scheduling pass in each iteration of shorten_branches? > Is the alignment/misalignment explicitly modelled in the rtl? If so, how? > Using labels, or something else? No, it works with instruction addresses during branch shortening, and extra bits in cfun->machine during instruction output to keep track of the current alignment.
On Wed, Oct 31, 2012 at 1:33 PM, Joern Rennecke <joern.rennecke@embecosm.com> wrote: > Quoting Richard Sandiford <rdsandiford@googlemail.com>: > >>> It's about describing complex interactions of length adjustments that >>> derive from branch shortening and length added for (un)alignment for >>> scheduling purposes. Expressed naively, these can lead to cycles. >> >> >> But shorten_branches should be written to avoid cycles, and I thought >> your patch did that by making sure that the address of each instruction >> only increases. > > > This actually gives a very poor branch shortening result for ARC. > What I did before with the lock_length attribute was only locking in > increases > for the length of branches, and preserving a complex set of invariants to > avoid cycles from the (mis)alignment padding. > > Saying that for some instructions, the length locking comes in effect > only after a few iterations of not permanently increasing the size of > other instructions is still cycle-safe. > > >>> The length variation for the ARC are not alike: there are branches that >>> are subject to branch shortening in the usual way, but they might >>> shrink when other things shrink. When we are iterating starting at >>> minimal length and increasing lengths, these branches should be stopped >>> from shrinking, as they likly will grow back again in the next iteration. >>> OTOH there are potentially short instructions that are lengthened to >>> archive scheduling-dictated alignment or misalignment. These should >>> be allowed to change freely back and forth. Unless we have a rare >>> cycle that only depends on alignments. >> >> >> Hmm, but this is still talking in terms of shorten_branches, rather than >> the target property that we're trying to model. It sounds like the >> property >> is something along the lines of: some instructions have different >> encodings >> (or work-alikes with different encodings), and we want to make all the >> encodings available to the optimisers. Is that right? > > > No, not really. There are lots of redundant encodings, but we have only > one favourite encoding for each size. We usually want the shortest version, > unless we want to tweak the alignment of a following instruction, or in > the case of unaligned conditional branches with unfilled delay slot, to > hide the branch penalty (don't ask me why it has that effect...). Maybe we should split the thing then into a adjust_insn_length attribute without the iteration parameter and a hook, similar to the various scheduler hooks, that is special for branch shortening. Richard. > >> If so, that sounds >> like a useful thing to model, but I think it should be modelled directly, >> rather than as a modification to the shorten_branches algorithm. > > > The shorten_branches algorithm can't give the correct result without > taking these things into account. Also, instruction lengths are needed > to know the requirements. So, do you want me to run a scheduling > pass in each iteration of shorten_branches? > > >> Is the alignment/misalignment explicitly modelled in the rtl? If so, how? >> Using labels, or something else? > > > No, it works with instruction addresses during branch shortening, and > extra bits in cfun->machine during instruction output to keep track of > the current alignment.
Quoting Richard Biener <richard.guenther@gmail.com>: > Maybe we should split the thing then into a adjust_insn_length attribute > without the iteration parameter Attributes don't get any parameter but the instruction, and don't apply to delay slot SEQUENCEs.
On Wed, Oct 31, 2012 at 2:06 PM, Joern Rennecke <joern.rennecke@embecosm.com> wrote: > Quoting Richard Biener <richard.guenther@gmail.com>: > >> Maybe we should split the thing then into a adjust_insn_length attribute >> without the iteration parameter > > > Attributes don't get any parameter but the instruction, and don't apply > to delay slot SEQUENCEs. hook I meant, not attribute. Richard.
Quoting Richard Biener <richard.guenther@gmail.com>: > On Wed, Oct 31, 2012 at 2:06 PM, Joern Rennecke > <joern.rennecke@embecosm.com> wrote: >> Quoting Richard Biener <richard.guenther@gmail.com>: >> >>> Maybe we should split the thing then into a adjust_insn_length attribute >>> without the iteration parameter >> >> >> Attributes don't get any parameter but the instruction, and don't apply >> to delay slot SEQUENCEs. > > hook I meant, not attribute. All right. So I gather first is called again adjust_insn_length. What do we call the one for the iteration count (or priority, if you want to put it that way) for the lock_length mechanism to become effective? int lock_length_iter_count (rtx_insn, bool in_delay_slot) ? int lock_length_priority (rtx_insn, bool iter_count) ?
Joern Rennecke <joern.rennecke@embecosm.com> writes: > Quoting Richard Sandiford <rdsandiford@googlemail.com>: >>> The length variation for the ARC are not alike: there are branches that >>> are subject to branch shortening in the usual way, but they might >>> shrink when other things shrink. When we are iterating starting at >>> minimal length and increasing lengths, these branches should be stopped >>> from shrinking, as they likly will grow back again in the next iteration. >>> OTOH there are potentially short instructions that are lengthened to >>> archive scheduling-dictated alignment or misalignment. These should >>> be allowed to change freely back and forth. Unless we have a rare >>> cycle that only depends on alignments. >> >> Hmm, but this is still talking in terms of shorten_branches, rather than >> the target property that we're trying to model. It sounds like the property >> is something along the lines of: some instructions have different encodings >> (or work-alikes with different encodings), and we want to make all the >> encodings available to the optimisers. Is that right? > > No, not really. There are lots of redundant encodings, but we have only > one favourite encoding for each size. We usually want the shortest version, > unless we want to tweak the alignment of a following instruction, or in > the case of unaligned conditional branches with unfilled delay slot, to > hide the branch penalty (don't ask me why it has that effect...). But what I'm trying to get at is: why can't the backend tell shorten_branches about the amount of alignment/misalignment that the target wants, and where? Via an attribute or a hook, I don't mind which. But it should be declarative, rather than a part of the shorten_branches algorithm. Also... >> If so, that sounds >> like a useful thing to model, but I think it should be modelled directly, >> rather than as a modification to the shorten_branches algorithm. > > The shorten_branches algorithm can't give the correct result without > taking these things into account. Also, instruction lengths are needed > to know the requirements. So, do you want me to run a scheduling > pass in each iteration of shorten_branches? ...it looks like arc_reorg already runs shorten_branches in a loop. Is that right? So in a sense we already have iteration involving shorten_branches and the backend. I realise we want to cut down the number of iterations where possible, but at the same time, it seems like you're trying to do conditional execution conversion on the fly during shorten_branches, such that the rtl doesn't actually represent the instructions we output. And the performance problem you're seeing is that those behind-the-scenes changes are too confusing for the current shorten_branches code to handle well. E.g. sometimes you decide to "delete" branches that stay in the rtl stream as branches, and that got a length in a previous iteration. At some point, as you say, you hav have to commit to using a particular form (conditionally executed or not) in order to avoid cycles. But that seems like a conditional execution decision rather than a shorten_branches decision. (It'd also be good to represent it in the rtl if possible.) Richard
Quoting Richard Sandiford <rdsandiford@googlemail.com>: > But what I'm trying to get at is: why can't the backend tell > shorten_branches about the amount of alignment/misalignment > that the target wants, and where? Via an attribute or a hook, > I don't mind which. But it should be declarative, rather than > a part of the shorten_branches algorithm. So, if we want to go with a declarative approach, each instruction can have a plurality of variants, each described by a size, an instruction fetch alignment base, an instruction fetch alignment offset set, a cache line base, a cache line offset set, a cost as a branch destination - cached, a cost as a branch destination - uncached, a fall-through predicted cost, a fall-through mispredicted cost, a branch-taken-predicted cost, a branch-taken-mispredicted cost. Or, if we want to get into more detail, we should actually have a scheduler instruction reservation descriptions instead of singular costs. A variants availability and costs will depend on branch offsets, the presence, kind and size of delay slot instructions and further length-sensitive context (like the set of paths from the function entry, with aggregate instruction fetch sizes and probabilities), so you have to call a target hook each iteration to get the currently available set of variants. Then you have to see this in the context of relative execution frequencies (for cached/uncached branch destinations), and branch probabilities and the reliability of the probability information, and to-be-added annotations about branch predictability. to arrive at a near-optimal solution in reasonable time. Do you have any suggestions about what algorithm to use? FWIW, the current ARC description with the adjust_insn_length hook patch uses instruction fetch offsets 0 and 2 to base 4, branch-taken/non-taken costs with implicit machine-specific heuristics about predictability (which are not as good as I'd like them to be, but better than nothing), length-sensitive context information, branch probabilities and their reliability, and delay slot instruction presence and size. There are some canned rules about how different requirements interact with each other to get at a converging solution quickly and with minimal compromises. All of the other mentioned features would be in principle interesting, but not enough so far to justify implementing them in the previous length/lock_length attribute framework (which was prone to cycles if you were not very careful while adding new rules). I believe other ports would have different priorities, possibly within the sent I enumerated, maybe going beyond that. > ...it looks like arc_reorg already runs shorten_branches in a loop. > Is that right? So in a sense we already have iteration involving > shorten_branches and the backend. Actually, that is mostly for instruction selection/splitting/combination for compare-and-branch/test-bit-and-branch instructions, which have very stringent offset requirements; these decisions need to be made (mostly; there is code to handle rare out-of-range offsets later, but it makes code quality worse) before delayed branch scheduling and final conditional execution decisions. But the fine-tuning of instruction sizes and alignments happens later in the original branch shortening pass. > I realise we want to cut down the number of iterations where possible, > but at the same time, it seems like you're trying to do conditional > execution conversion on the fly during shorten_branches, such that the > rtl doesn't actually represent the instructions we output. And the > performance problem you're seeing is that those behind-the-scenes > changes are too confusing for the current shorten_branches code to > handle well. E.g. sometimes you decide to "delete" branches that stay > in the rtl stream as branches, and that got a length in a previous > iteration. No, not really. Within each branch shortening pass, these decisions are static; they may change when the branches are modified after early branch shortening. They may also change after delay slot scheduling. If we want to get really precise with rtl, I have to alter it in every branch shortening iteration when the decision changes if a compare-and-branch/ test-bit-and branch is within range or not. If it's a single instruction, it is transparent to previous flag settings; if it isn't, we might be able to use the flag setting result for a successive branch or conditional execution. So then we'd need a way to describe this rtl modification depending on branch offset ranges. But that still won't get rid of the ARC requirement of running an extra branch shortening pass, unless you want to roll the delay slot scheduling into branch shortening as well. Others claim that delayed branch scheduling should be integrated with the normal scheduler, while there are also good arguments to integrate the scheduler with the register allocator. Just roll it all into one big pile of hair. Going back to pragmatic issues... The problem I had with the original GCC 4.4 framework was that branch shortening would not terminate for some functions when there is non-linear positive and negative feedback. The problem I have with the framework in the current trunk is that it applies length locking even to those instructions that it should not apply to - or at least, not so quickly, i.e. the short insns that are upsized for alignment purposes. When the input alignment changes, all previous decisions are wrong. > At some point, as you say, you have have to commit to using a particular > form (conditionally executed or not) in order to avoid cycles. But that > seems like a conditional execution decision rather than a shorten_branches > decision. (It'd also be good to represent it in the rtl if possible.) Well, I could do this with additional machine-dependent rtl passes, but this is really a (costly, in terms of development time and maintenance) solution looking for a problem. The ARC ccfsm machinery was originally only run at instruction output time, and I modified it so I can run it earlier to tell what changes it would make to the instructions and their lengths to get better quality length predictions (most importantly, when optimizing for speed, it can at times increase code size slightly, which can be catastrophic when we commited to using a short branch at the edge of its offset range. Running the same code to make or predict decisions avoids issues with code getting out of sync I would have with using a modified rtl-modifying copy. By not actually modifying rtl, I don't have to allocate new uids during branch shortening (which would require branch shorteing to restart from scratch or reallocate arrays and somehow map different instructions and uids), nor do I have to convert rtl back as the input changes, with the hazard of increasing the rtl each iteration.
Joern Rennecke <joern.rennecke@embecosm.com> writes: > Quoting Richard Sandiford <rdsandiford@googlemail.com>: > >> But what I'm trying to get at is: why can't the backend tell >> shorten_branches about the amount of alignment/misalignment >> that the target wants, and where? Via an attribute or a hook, >> I don't mind which. But it should be declarative, rather than >> a part of the shorten_branches algorithm. > > So, if we want to go with a declarative approach, each instruction can have > a plurality of variants, each described by a size, an instruction fetch > alignment base, an instruction fetch alignment offset set, a cache line base, > a cache line offset set, a cost as a branch destination - cached, > a cost as a branch destination - uncached, a fall-through predicted cost, > a fall-through mispredicted cost, a branch-taken-predicted cost, > a branch-taken-mispredicted cost. > Or, if we want to get into more detail, we should actually have a scheduler > instruction reservation descriptions instead of singular costs. > > A variants availability and costs will depend on branch offsets, > the presence, kind and size of delay slot instructions and further > length-sensitive context (like the set of paths from the function entry, > with aggregate instruction fetch sizes and probabilities), so you have > to call a target hook each iteration to get the currently available set > of variants. > > Then you have to see this in the context of relative execution frequencies > (for cached/uncached branch destinations), and branch probabilities and > the reliability of the probability information, and to-be-added annotations > about branch predictability. to arrive at a near-optimal solution in > reasonable time. Hey, I wasn't suggesting anything like that :-) I was just asking about alignment. It seemed to me that part of the problem is that ARC wants to increase the length of an instruction in order to align a following instruction (which I agree is a useful thing to support). But as things stand, shorten_branches has no way of distinguing that from an instruction that was forced to grow because of offsets being out of range. If instead shorten_branches knows that an instruction prefers a particular alignment/misalignment, it can cooperate with the backend to extend earlier instructions to make that possible. If we do it that way, shorten_branches knows what's going on. You quote lots of complications above, that's also my concern. Directly hooking into ADJUST_INSN_LENGTH restricts us to a very peephole optimisation with no view wider than the current instruction. E.g. for: insn1 insn2 insn3 -- wants to be misaligned by 2 bytes hooking directly into ADJUST_INSN_LENGTH doesn't make it easy to consider using insn1 rather than insn2 to provide the misalignment. >> ...it looks like arc_reorg already runs shorten_branches in a loop. >> Is that right? So in a sense we already have iteration involving >> shorten_branches and the backend. > > Actually, that is mostly for instruction selection/splitting/combination > for compare-and-branch/test-bit-and-branch instructions, which have > very stringent offset requirements; these decisions need to be made > (mostly; there is code to handle rare out-of-range offsets later, but it > makes code quality worse) > before delayed branch scheduling and final conditional execution decisions. > > But the fine-tuning of instruction sizes and alignments happens later in > the original branch shortening pass. OK, but the same kind of iteration seems acceptable after delayed-branch scheduling too. >> I realise we want to cut down the number of iterations where possible, >> but at the same time, it seems like you're trying to do conditional >> execution conversion on the fly during shorten_branches, such that the >> rtl doesn't actually represent the instructions we output. And the >> performance problem you're seeing is that those behind-the-scenes >> changes are too confusing for the current shorten_branches code to >> handle well. E.g. sometimes you decide to "delete" branches that stay >> in the rtl stream as branches, and that got a length in a previous >> iteration. > > No, not really. Within each branch shortening pass, these decisions > are static; they may change when the branches are modified after > early branch shortening. They may also change after delay slot scheduling. > > If we want to get really precise with rtl, I have to alter it in every > branch shortening iteration when the decision changes if a > compare-and-branch/ test-bit-and branch is within range or not. I was suggesting that once you've committed to something (which I think in your proposal corresponds to the iteration count becoming too high) that commitment should be reflected in the rtl. That also stops the decision from being accidentally reversed. >> At some point, as you say, you have have to commit to using a particular >> form (conditionally executed or not) in order to avoid cycles. But that >> seems like a conditional execution decision rather than a shorten_branches >> decision. (It'd also be good to represent it in the rtl if possible.) > > Well, I could do this with additional machine-dependent rtl passes, but > this is really a (costly, in terms of development time and maintenance) > solution looking for a problem. > > The ARC ccfsm machinery was originally only run at instruction output > time, and I modified it so I can run it earlier to tell what changes > it would make to the instructions and their lengths to get better quality > length predictions (most importantly, when optimizing for speed, it can > at times increase code size slightly, which can be catastrophic when we > commited to using a short branch at the edge of its offset range. Running > the same code to make or predict decisions avoids issues with code getting > out of sync I would have with using a modified rtl-modifying copy. > By not actually modifying rtl, I don't have to allocate new uids during branch > shortening (which would require branch shorteing to restart from scratch or > reallocate arrays and somehow map different instructions and uids), nor do > I have to convert rtl back as the input changes, with the hazard of > increasing the rtl each iteration. But I think extending the FSM from just being an on-the-fly transformatino during final (as for ARM) to something that happens earlier is a pretty major step in terms of the interface between the backend and the rest of GCC. There are then undocumented restrictions on what hooks can be called when, which if broken could lead to the FSM producing different results during final than it did earlier. E.g., personally, I don't think it's acceptable for ADJUST_INSN_LENGTH to hard-code an assumption that all instructions are processed in assembly order, which the ARC one seems to (irrespective of this patch). That is, whenever ARC's ADJUST_INSN_LENGTH is called, it seems to "simulate" the instructions between the current instruction and the one previously passed to ADJUST_INSN_LENGTH, so it will break if we ever try to take lengths in a different order. I talked earlier about doing predication on the fly, but the ARC hook also does alignment optimisations on the fly. AIUI, the alignment optimisations can increase offsets, and increasing offsets interferes with the predication. As mentioned above, my main problem with this is that you seem to be saying "there are lots of complications, so we have to do everything at once", but at the same time you're suggesting something that forces both the target-independent and ARC code to process things in assembly order and look at each instruction in isolation. It makes it very difficult to do any of the three optimisations (branch shortening, predication, and alignment) in a less peephole way in future, such as by taking execution frequency into account. OTOH, I realise I'm probably not being helpful here. I've said my piece, so I'd better bow out. Richard
Quoting Richard Sandiford <rdsandiford@googlemail.com>: > But I think extending the FSM from just being an on-the-fly > transformatino during final (as for ARM) to something that happens > earlier is a pretty major step in terms of the interface between the > backend and the rest of GCC. There are then undocumented restrictions > on what hooks can be called when, Indeed. I found I have to call extract_constrain_insn_cached to restore recog_data (the struct, even though gdb will never believe this and insist on that being a type) if I have recognized other insns (by evaluating an attribute) from a splitter predicate. > which if broken could lead to the FSM > producing different results during final than it did earlier. Well, actually, I did find that this happens, and added code to either compensate for changes or inhibit them. For the most part, these changes are wanted, we just need to manage them to avoid out-of-range branches or ADDR_DIFF_VECs. You could think of this as simulated annealing for ice: some provisions need to be made to avoid rupturing the vessel, either by anticipating size increases, containing pressure to limit size increase, or a combination of both. > E.g., personally, I don't think it's acceptable for ADJUST_INSN_LENGTH > to hard-code an assumption that all instructions are processed in > assembly order, which the ARC one seems to (irrespective of this patch). It doesn't, at least not for correctness; you might argue that it depends on this for compilation time performance. It does a scan in round-robin fashion from the last insn it processed till it reaches the one that is to be processed. But I did develop this stuff when passes were much more rigid than they are today; I'll look into the feasibility of running the ccfsm in a machine-specific pass to finalize in rtl whatever COND_EXEC opportunities are visible at this stage, and calling it after preliminary compare-and-branch determination, and after delay branch scheduling. Maybe at some other places too. One complication is that sometimes a compare-and-branch decision has to be reversed, so the exact ccfsm description before instruction output is that it is unlikely to clobber the flags. Thus, a conditionalizing that depends on the flags surviving through a compare-and-branch cannot be finalized till the branch is safe. Still, the conditionalizing is likely and we should take length changes into account, provided that the compare-and-branch stays in range. I suppose I could add a note REG_DEP_TRUE (unspec [<list of brcc insn for required to be short for conditionalizing>] UNSPEC_CCFSM) for any instructions subject to such uncertain conditionalizations. Or probably rather three UNSPEC kinds: one to indicate ccfsm-predicated deletion (for a branch), one for general conditionalizing (for arithmetic, branches and calls), and one for equality conditionalizing (for branches, wider offset ranges are available for short variants, and for register zeroing, which has a not-equal conditionalized short form).
Quoting Richard Sandiford <rdsandiford@googlemail.com>: > Joern Rennecke <joern.rennecke@embecosm.com> writes: >> So, if we want to go with a declarative approach, each instruction can have >> a plurality of variants, each described by a size, an instruction fetch >> alignment base, an instruction fetch alignment offset set, a cache >> line base, >> a cache line offset set, a cost as a branch destination - cached, >> a cost as a branch destination - uncached, a fall-through predicted cost, >> a fall-through mispredicted cost, a branch-taken-predicted cost, >> a branch-taken-mispredicted cost. >> Or, if we want to get into more detail, we should actually have a scheduler >> instruction reservation descriptions instead of singular costs. >> >> A variants availability and costs will depend on branch offsets, >> the presence, kind and size of delay slot instructions and further >> length-sensitive context (like the set of paths from the function entry, >> with aggregate instruction fetch sizes and probabilities), so you have >> to call a target hook each iteration to get the currently available set >> of variants. >> >> Then you have to see this in the context of relative execution frequencies >> (for cached/uncached branch destinations), and branch probabilities and >> the reliability of the probability information, and to-be-added annotations >> about branch predictability. to arrive at a near-optimal solution in >> reasonable time. > > Hey, I wasn't suggesting anything like that :-) I was just asking > about alignment. It seemed to me that part of the problem is that > ARC wants to increase the length of an instruction in order to align > a following instruction (which I agree is a useful thing to support). > But as things stand, shorten_branches has no way of distinguing that from > an instruction that was forced to grow because of offsets being out of range. It takes a lot of effort to change the interface of an existing hook that requires in-depth target knowledge to understand the ramifictions of the target's settings. OTOH it is impossible to anticipate all the things that various targets - or even your own, as it evolves over time - will want from the interface in the future. Therefore, I think we want an interface that naturally supports extensibility while supporting backward compatibility. My current plan is to have one hook to be called at the start of shorten_branches. This hook gets passes a pointer to a zeroed out data structure. The target may fill in various fields, some of them, or leave them all blank. Some fields might indicate that other fields are valid, and thus have to be populated with valid data. I'll refer to this hook for now as TARGET_INSN_LENGTHS_PARAMETERS, and type that the passed pointer points to as insn_length_parameters_t. The return value is a pointer to a function, to be used for adjusting instruction length in the place of ADJUST_INSN_LENGTHS - and where it is missing, or zero if no such functionis required. (I first thought of having separate target hooks for everything, but then I realized that'd just be a waste of space in struct gcc_target.) I can make a simple ADJUST_INSN_LENGTH backward-compatible targhooks.c implementation, or a full conversion - but please don't make me waste another day implementing and testing the latter only to throw it away again. One setting that I want to support for ARC is setting an alignment unit size and left shift, and a function pointer to be called on each instruction to provide instruction variants with various alignment offsets (0 or 1 times unit size 2), sizes, fall-through costs, and branch-taken costs, and branch-target costs. The instruction alternatives would have to be allocated zeroed (or provied zeroed - I haven-tworked thisoutyet) and flags set before by TARGET_INSN_LENGTH_PARAMETERS might indicate additional things that the port is cognizant about and fill in data for in the instrution alternatives. Thus, ports that provide additional information could be catered for with refined algorithms while other ports that don't provide this information get sensible defauls for the not provided data.
Quoting Joern Rennecke <joern.rennecke@embecosm.com>: > The return value is a pointer to a function, to be used for adjusting > instruction length in the place of ADJUST_INSN_LENGTHS - and where it > is missing, or zero if no such functionis required. > (I first thought of having separate target hooks for everything, but then > I realized that'd just be a waste of space in struct gcc_target.) Hmm, that bit doesn't work too well, because we have earlier optimizers that want to know insn_min_length, and getting a pointer to the adjustment function initialized would be messy. So it's two hooks then, again with a separate adjust_insn_length hooks. Although I'm not sure if it makes sense to apply that during branch shortening on instructions with multiple variants. It's probably better to let the target incorporate that in the variants generation.
diff -drup gcc-192879-haveattr/doc/tm.texi gcc/doc/tm.texi --- gcc-192879-haveattr/doc/tm.texi 2012-10-28 01:07:38.463469923 +0000 +++ gcc/doc/tm.texi 2012-10-28 01:31:15.522227927 +0000 @@ -11341,3 +11341,7 @@ This value should be set if the result w @deftypevrx {Target Hook} {bool} TARGET_HAVE_ATTR_LENGTH These flags are automatically generated; you should not override them in tm.c: @end deftypevr + +@deftypefn {Target Hook} int TARGET_ADJUST_INSN_LENGTH (rtx @var{insn}, int @var{length}, bool @var{in_delay_sequence}, int *@var{iteration_threshold}) +Return an adjusted length for @var{insn}. @var{length} is the value that has been calculated using the @code{length} instruction attribute. @var{in_delay_sequence} if @var{insn} forms part of a delay sequence. *@var{iteration_threshold} specifies the number of branch shortening iterations before length decreases are inhibited. The default implementation uses @code{ADJUST_INSN_LENGTH}, if defined. +@end deftypefn diff -drup gcc-192879-haveattr/doc/tm.texi.in gcc/doc/tm.texi.in --- gcc-192879-haveattr/doc/tm.texi.in 2012-10-28 01:07:38.489470252 +0000 +++ gcc/doc/tm.texi.in 2012-10-28 01:31:55.578745701 +0000 @@ -11175,3 +11175,5 @@ memory model bits are allowed. @hook TARGET_ATOMIC_TEST_AND_SET_TRUEVAL @hook TARGET_HAVE_CC0 + +@hook TARGET_ADJUST_INSN_LENGTH diff -drup gcc-192879-haveattr/final.c gcc/final.c --- gcc-192879-haveattr/final.c 2012-10-28 01:07:38.492470288 +0000 +++ gcc/final.c 2012-10-28 15:24:49.731169660 +0000 @@ -424,10 +424,8 @@ get_attr_length_1 (rtx insn, int (*fallb break; } -#ifdef ADJUST_INSN_LENGTH - ADJUST_INSN_LENGTH (insn, length); -#endif - return length; + int dummy = -1; + return targetm.adjust_insn_length (insn, length, false, &dummy); } /* Obtain the current length of an insn. If branch shortening has been done, @@ -827,6 +825,56 @@ struct rtl_opt_pass pass_compute_alignme }; +/* Helper function for shorten_branches, to apply the adjust_insn_length + target hook, and lock in length increases in order to avoid infinite + iteration cycles. + INSN the the instruction being processed, and new_length is the length + that has been calulated for it from its length attribute. UID is + its UID. SEQ_P indicates if it is inside a delay slot sequence. + INSN_LENGTHS and UID_LOCK_LENGTH are the arrays holding the current lengths + and lockeds-in maximum lengths from the prevuious iterations, and are + updated for UID when appropriate. + *NITER contains the number of iterations since we last made a change + to uid_lock_length. + *SOMETHING_CHANGED will be set if we make a change to INSN_LENGTHS. + The return value is the new length taking the result of the + adjust_insn_length and uid_lock_length into account. + Note that when *NITER is negative, no length locking is effective, and + the new lengths are just assigned. This is used in the initial pass, + and when only making a single pass to update instruction length downward. */ +static int +adjust_length (rtx insn, int new_length, int uid, int *insn_lengths, + int *uid_lock_length, bool seq_p, int *niter, + bool *something_changed) + +{ + int iter_threshold = 0; + new_length + = targetm.adjust_insn_length (insn, new_length, seq_p, &iter_threshold); + if (new_length < 0) + fatal_insn ("negative insn length", insn); + if (iter_threshold < 0) + fatal_insn ("negative shorten_branches iteration threshold", insn); + if (new_length != insn_lengths[uid]) + { + if (new_length < uid_lock_length[uid]) + new_length = uid_lock_length[uid]; + if (new_length == insn_lengths[uid]) + ; /* done here. */ + else if (*niter < iter_threshold + || new_length > insn_lengths[uid]) + { + if (*niter >= iter_threshold) + uid_lock_length[uid] = new_length, *niter = 0; + insn_lengths[uid] = new_length; + *something_changed = true; + } + else + new_length = insn_lengths[uid]; + } + return new_length; +} + /* Make a pass over all insns and compute their actual lengths by shortening any branches of variable length if possible. */ @@ -848,7 +896,7 @@ shorten_branches (rtx first) int max_skip; #define MAX_CODE_ALIGN 16 rtx seq; - int something_changed = 1; + bool something_changed = true; char *varying_length; rtx body; int uid; @@ -964,7 +1012,8 @@ shorten_branches (rtx first) return; /* Allocate the rest of the arrays. */ - insn_lengths = XNEWVEC (int, max_uid); + insn_lengths = XCNEWVEC (int, max_uid); + int *uid_lock_length = XCNEWVEC (int, max_uid); insn_lengths_max_uid = max_uid; /* Syntax errors can lead to labels being outside of the main insn stream. Initialize insn_addresses, so that we get reproducible results. */ @@ -1065,6 +1114,7 @@ shorten_branches (rtx first) /* Compute initial lengths, addresses, and varying flags for each insn. */ int (*length_fun) (rtx) = increasing ? insn_min_length : insn_default_length; + int niter = increasing ? 0 : -1; for (insn_current_address = 0, insn = first; insn != 0; @@ -1072,8 +1122,6 @@ shorten_branches (rtx first) { uid = INSN_UID (insn); - insn_lengths[uid] = 0; - if (LABEL_P (insn)) { int log = LABEL_TO_ALIGNMENT (insn); @@ -1093,6 +1141,8 @@ shorten_branches (rtx first) if (INSN_DELETED_P (insn)) continue; + int length = 0; + body = PATTERN (insn); if (GET_CODE (body) == ADDR_VEC || GET_CODE (body) == ADDR_DIFF_VEC) { @@ -1100,13 +1150,12 @@ shorten_branches (rtx first) section. */ if (JUMP_TABLES_IN_TEXT_SECTION || readonly_data_section == text_section) - insn_lengths[uid] = (XVECLEN (body, - GET_CODE (body) == ADDR_DIFF_VEC) - * GET_MODE_SIZE (GET_MODE (body))); + length = (XVECLEN (body, GET_CODE (body) == ADDR_DIFF_VEC) + * GET_MODE_SIZE (GET_MODE (body))); /* Alignment is handled by ADDR_VEC_ALIGN. */ } else if (GET_CODE (body) == ASM_INPUT || asm_noperands (body) >= 0) - insn_lengths[uid] = asm_insn_count (body) * insn_default_length (insn); + length = asm_insn_count (body) * insn_default_length (insn); else if (GET_CODE (body) == SEQUENCE) { int i; @@ -1134,7 +1183,10 @@ shorten_branches (rtx first) else inner_length = inner_length_fun (inner_insn); - insn_lengths[inner_uid] = inner_length; + inner_length + = adjust_length (insn, inner_length, inner_uid, insn_lengths, + uid_lock_length, true, &niter, + &something_changed); if (const_delay_slots) { if ((varying_length[inner_uid] @@ -1145,21 +1197,18 @@ shorten_branches (rtx first) } else varying_length[inner_uid] = 0; - insn_lengths[uid] += inner_length; + length += inner_length; } } else if (GET_CODE (body) != USE && GET_CODE (body) != CLOBBER) { - insn_lengths[uid] = length_fun (insn); + length = length_fun (insn); varying_length[uid] = insn_variable_length_p (insn); } - + /* If needed, do any adjustment. */ -#ifdef ADJUST_INSN_LENGTH - ADJUST_INSN_LENGTH (insn, insn_lengths[uid]); - if (insn_lengths[uid] < 0) - fatal_insn ("negative insn length", insn); -#endif + adjust_length (insn, length, uid, insn_lengths, uid_lock_length, + false, &niter, &something_changed); } /* Now loop over all the insns finding varying length insns. For each, @@ -1168,16 +1217,16 @@ shorten_branches (rtx first) while (something_changed) { - something_changed = 0; + if (increasing) + niter++; + + something_changed = false; insn_current_align = MAX_CODE_ALIGN - 1; for (insn_current_address = 0, insn = first; insn != 0; insn = NEXT_INSN (insn)) { int new_length; -#ifdef ADJUST_INSN_LENGTH - int tmp_length; -#endif int length_align; uid = INSN_UID (insn); @@ -1363,17 +1412,13 @@ shorten_branches (rtx first) if (! varying_length[inner_uid]) inner_length = insn_lengths[inner_uid]; else - inner_length = insn_current_length (inner_insn); - - if (inner_length != insn_lengths[inner_uid]) { - if (!increasing || inner_length > insn_lengths[inner_uid]) - { - insn_lengths[inner_uid] = inner_length; - something_changed = 1; - } - else - inner_length = insn_lengths[inner_uid]; + inner_length = insn_current_length (inner_insn); + + inner_length + = adjust_length (insn, inner_length, inner_uid, + insn_lengths, uid_lock_length, true, + &niter, &something_changed); } insn_current_address += inner_length; new_length += inner_length; @@ -1385,21 +1430,13 @@ shorten_branches (rtx first) insn_current_address += new_length; } -#ifdef ADJUST_INSN_LENGTH /* If needed, do any adjustment. */ - tmp_length = new_length; - ADJUST_INSN_LENGTH (insn, new_length); + int tmp_length = new_length; + new_length + = adjust_length (insn, new_length, uid, insn_lengths, + uid_lock_length, false, &niter, + &something_changed); insn_current_address += (new_length - tmp_length); -#endif - - if (new_length != insn_lengths[uid] - && (!increasing || new_length > insn_lengths[uid])) - { - insn_lengths[uid] = new_length; - something_changed = 1; - } - else - insn_current_address += insn_lengths[uid] - new_length; } /* For a non-optimizing compile, do only a single pass. */ if (!increasing) diff -drup gcc-192879-haveattr/target.def gcc/target.def --- gcc-192879-haveattr/target.def 2012-10-28 01:07:38.517470606 +0000 +++ gcc/target.def 2012-10-28 01:31:15.525227979 +0000 @@ -2818,6 +2818,17 @@ DEFHOOK enum unwind_info_type, (void), default_debug_unwind_info) +DEFHOOK +(adjust_insn_length, + "Return an adjusted length for @var{insn}. @var{length} is the value that\ + has been calculated using the @code{length} instruction attribute. \ + @var{in_delay_sequence} if @var{insn} forms part of a delay sequence. \ + *@var{iteration_threshold} specifies the number of branch shortening\ + iterations before length decreases are inhibited. The default\ + implementation uses @code{ADJUST_INSN_LENGTH}, if defined.", + int, (rtx insn, int length, bool in_delay_sequence, int *iteration_threshold), + default_adjust_insn_length) + DEFHOOKPOD (atomic_test_and_set_trueval, "This value should be set if the result written by\ diff -drup gcc-192879-haveattr/targhooks.c gcc/targhooks.c --- gcc-192879-haveattr/targhooks.c 2012-10-25 03:33:26.000000000 +0100 +++ gcc/targhooks.c 2012-10-28 01:41:35.082931024 +0000 @@ -1540,4 +1540,20 @@ default_member_type_forces_blk (const_tr return false; } +/* Default version of adjust_insn_length. */ + +int +default_adjust_insn_length (rtx insn ATTRIBUTE_UNUSED, int length, + bool in_delay_sequence, int *iter_threshold) +{ + if (in_delay_sequence) + *iter_threshold = INT_MAX; +#ifdef ADJUST_INSN_LENGTH + else + ADJUST_INSN_LENGTH (insn, length); +#endif + return length; +} + + #include "gt-targhooks.h" diff -drup gcc-192879-haveattr/targhooks.h gcc/targhooks.h --- gcc-192879-haveattr/targhooks.h 2012-10-25 03:33:26.000000000 +0100 +++ gcc/targhooks.h 2012-10-28 01:31:15.479227682 +0000 @@ -193,3 +193,4 @@ extern const char *default_pch_valid_p ( extern void default_asm_output_ident_directive (const char*); extern bool default_member_type_forces_blk (const_tree, enum machine_mode); +extern int default_adjust_insn_length (rtx, int, bool, int *);