Patchwork Ping: RFA: add lock_length attribute to break branch-shortening cycles

login
register
mail settings
Submitter Joern Rennecke
Date Oct. 16, 2012, 7:35 p.m.
Message ID <20121016153515.0nmsft8o2soo4s40-nzlynne@webmail.spamcop.net>
Download mbox | patch
Permalink /patch/191869/
State New
Headers show

Comments

Joern Rennecke - Oct. 16, 2012, 7:35 p.m.
Quoting Richard Sandiford <rdsandiford@googlemail.com>:

> Joern Rennecke <joern.rennecke@embecosm.com> writes:
>> 2012-10-04  Joern Rennecke  <joern.rennecke@embecosm.com>
>>
>>          * final.c (get_attr_length_1): Use direct recursion rather than
>>          calling get_attr_length.
>>          (get_attr_lock_length): New function.
>>          (INSN_VARIABLE_LENGTH_P): Define.
>>          (shorten_branches): Take HAVE_ATTR_lock_length into account.
>>          Don't overwrite non-delay slot insn lengths with the lengths of
>>          delay slot insns with same uid.
>>          * genattrtab.c (lock_length_str): New variable.
>>          (make_length_attrs): New parameter base.
>>          (main): Initialize lock_length_str.
>>          Generate lock_lengths attributes.
>>          * genattr.c (gen_attr): Emit declarations for lock_length attribute
>> 	related functions.
>> 	* doc/md.texi (node Insn Lengths): Document lock_length attribute.
>>
>> http://gcc.gnu.org/ml/gcc-patches/2012-10/msg00383.html
>
> Sorry, this is really just repeating Richard B's comments, but I still
> don't understand why we need two attributes.  Why can't shorten_branches
> work with the existing length and simply make sure that the length doesn't
> decrease from one iteration to the next?  That seems to be how you implement
> CASE_VECTOR_SHORTEN_MODE.  It also means that we can continue to use the
> pessimistic algorithm for -O0.

That does not really work for ARCompact, non-branch instructions are
lengthened to align or un-align branch instructions.  Failure to
correct the size of such instruction downwards when indicated messes
not only up the directly affected branch, but also gets the alignment
wrong downstream, and can even cause branches go out of range  
(blindsiding branch shortening) when there are interactions with  
explicit alignments
for things like loops.
Unless you think it's OK to poke the contents of the insn_length array from
ADJUST_INSN_LENGTH.  That should work, but it'd require an extra parameter
when you want to hookize this macro.

> You said in your reply that one of the reasons was to avoid
> "interesting" interactions with ADJUST_INSN_LENGTH.  But the
> previous minimum length would be applied after ADJUST_INSN_LENGTH,
> so I'm not sure why it's a factor.

Well, for starters, the ARCompact ADJUST_INSN_LENGTH uses
get_attr_lock_length when processing delayed-branch SEQUENCEs.
And then there's the short/long instruction opcode distinction (function /
attribute verify_short), which depends on alignment, and influences
instruction length and conditional execution calculation.
Without exact alignment information, branch scheduling becomes a crapshot.

You might think I could subsume the length effect of the upsized instructions
in the affected branch, but that doesn't work because the exact length is
needed both for ADJUST_INSN_LENGTH, and in order to output the correct
branch / jump output template.


> If lock_length is just an optimisation on top of that, then maybe
> it would help to split it out.

Well, to the extent that branch shortening and scheduling are just
optimizations, having the lock_length attribute is just an optimization.

Well, we could split it anyway, and give ports without the need for
multiple length attributes the benefit of the optimistic algorithm.

I have attached a patch that implements this.

I first meant to test it for sh-elf, alas, PR54938 currently makes this
impossible.
So I used arm-eabi instead.  regression tests look OK, but libgcc.a and
libc.a (newlib) are unchanged in size, while libdtfc++.a increaes by twelve
bytes; more specifically, the size increase happens for the object file
debug.o, which goes from 11028 to 11040 bytes.
Likewise, cris-elf and mipsel-elf show a small increase in size of debug.o,
with the rest of libstdc++.a unchanged in size.

Generating and comparing the arm intermediate files debug.s shows a suspicious
incidence of pathnames.  Using a longer build directory pathname makes no
difference, though.
OTOH, using a longer source directory pathname does.  So that leaves...
makes no difference for building these libraries.
It should stop endless looping in shorten_branches, though.  Albeit that
is only releant for ports that do have some negative shorten_branch feedback.
2012-10-16  J"orn Rennecke  <joern.rennecke@arc.com>

	* final.c (shorten_branches): When optimizing, start with small
	length and increase from there, and don't decrease lengths.
	Don't overwrite non-delay slot insn lengths with the lengths of
	delay slot insns with same uid.
Richard Guenther - Oct. 17, 2012, 8:50 a.m.
On Tue, Oct 16, 2012 at 9:35 PM, Joern Rennecke
<joern.rennecke@embecosm.com> wrote:
> Quoting Richard Sandiford <rdsandiford@googlemail.com>:
>
>> Joern Rennecke <joern.rennecke@embecosm.com> writes:
>>>
>>> 2012-10-04  Joern Rennecke  <joern.rennecke@embecosm.com>
>>>
>>>          * final.c (get_attr_length_1): Use direct recursion rather than
>>>          calling get_attr_length.
>>>          (get_attr_lock_length): New function.
>>>          (INSN_VARIABLE_LENGTH_P): Define.
>>>          (shorten_branches): Take HAVE_ATTR_lock_length into account.
>>>          Don't overwrite non-delay slot insn lengths with the lengths of
>>>          delay slot insns with same uid.
>>>          * genattrtab.c (lock_length_str): New variable.
>>>          (make_length_attrs): New parameter base.
>>>          (main): Initialize lock_length_str.
>>>          Generate lock_lengths attributes.
>>>          * genattr.c (gen_attr): Emit declarations for lock_length
>>> attribute
>>>         related functions.
>>>         * doc/md.texi (node Insn Lengths): Document lock_length
>>> attribute.
>>>
>>> http://gcc.gnu.org/ml/gcc-patches/2012-10/msg00383.html
>>
>>
>> Sorry, this is really just repeating Richard B's comments, but I still
>> don't understand why we need two attributes.  Why can't shorten_branches
>> work with the existing length and simply make sure that the length doesn't
>> decrease from one iteration to the next?  That seems to be how you
>> implement
>> CASE_VECTOR_SHORTEN_MODE.  It also means that we can continue to use the
>> pessimistic algorithm for -O0.
>
>
> That does not really work for ARCompact, non-branch instructions are
> lengthened to align or un-align branch instructions.  Failure to
> correct the size of such instruction downwards when indicated messes
> not only up the directly affected branch, but also gets the alignment
> wrong downstream, and can even cause branches go out of range (blindsiding
> branch shortening) when there are interactions with explicit alignments
> for things like loops.
> Unless you think it's OK to poke the contents of the insn_length array from
> ADJUST_INSN_LENGTH.  That should work, but it'd require an extra parameter
> when you want to hookize this macro.
>
>
>> You said in your reply that one of the reasons was to avoid
>> "interesting" interactions with ADJUST_INSN_LENGTH.  But the
>> previous minimum length would be applied after ADJUST_INSN_LENGTH,
>> so I'm not sure why it's a factor.
>
>
> Well, for starters, the ARCompact ADJUST_INSN_LENGTH uses
> get_attr_lock_length when processing delayed-branch SEQUENCEs.
> And then there's the short/long instruction opcode distinction (function /
> attribute verify_short), which depends on alignment, and influences
> instruction length and conditional execution calculation.
> Without exact alignment information, branch scheduling becomes a crapshot.
>
> You might think I could subsume the length effect of the upsized
> instructions
> in the affected branch, but that doesn't work because the exact length is
> needed both for ADJUST_INSN_LENGTH, and in order to output the correct
> branch / jump output template.
>
>
>
>> If lock_length is just an optimisation on top of that, then maybe
>> it would help to split it out.
>
>
> Well, to the extent that branch shortening and scheduling are just
> optimizations, having the lock_length attribute is just an optimization.
>
> Well, we could split it anyway, and give ports without the need for
> multiple length attributes the benefit of the optimistic algorithm.
>
> I have attached a patch that implements this.

Looks reasonable to me, though I'm not familiar enough with the code
to approve it.

I'd strongly suggest to try harder to make things work for you without
the new attribute even though I wasn't really able to follow your reasoning
on why that wouldn't work.  It may be easier to motivate this change
once the port is in without that attribute so one can actually look at
the machine description and port details.

Richard.

> I first meant to test it for sh-elf, alas, PR54938 currently makes this
> impossible.
> So I used arm-eabi instead.  regression tests look OK, but libgcc.a and
> libc.a (newlib) are unchanged in size, while libdtfc++.a increaes by twelve
> bytes; more specifically, the size increase happens for the object file
> debug.o, which goes from 11028 to 11040 bytes.
> Likewise, cris-elf and mipsel-elf show a small increase in size of debug.o,
> with the rest of libstdc++.a unchanged in size.
>
> Generating and comparing the arm intermediate files debug.s shows a
> suspicious
> incidence of pathnames.  Using a longer build directory pathname makes no
> difference, though.
> OTOH, using a longer source directory pathname does.  So that leaves...
> makes no difference for building these libraries.
> It should stop endless looping in shorten_branches, though.  Albeit that
> is only releant for ports that do have some negative shorten_branch
> feedback.
Richard Sandiford - Oct. 18, 2012, 7:32 p.m.
Joern Rennecke <joern.rennecke@embecosm.com> writes:
> @@ -1360,12 +1369,20 @@ shorten_branches (rtx first ATTRIBUTE_UN
>  		  else
>  		    inner_length = insn_current_length (inner_insn);
>  
> -		  if (inner_length != insn_lengths[inner_uid])
> +		  /* We can't record lengths of delay slot insns, because
> +		     they might just be a copy of the insn at the branch
> +		     target, sharing its uid.  */
> +		  if (i == 0 && inner_length != insn_lengths[inner_uid])
>  		    {
> -		      insn_lengths[inner_uid] = inner_length;
> -		      something_changed = 1;
> +		      if (inner_length > insn_lengths[inner_uid] || first_pass)
> +			{
> +			  insn_lengths[inner_uid] = inner_length;
> +			  something_changed = 1;
> +			}
> +		      else
> +			inner_length = insn_lengths[inner_uid];
>  		    }
> -		  insn_current_address += insn_lengths[inner_uid];
> +		  insn_current_address += inner_length;
>  		  new_length += inner_length;
>  		}
>  	    }

The fact that we even have shared unique ids is pretty bad -- and surely
a contradiction in terms -- but I think both ways of handling them rely
on the length being the same for all copies.  If we don't record a length
(your version), we won't set something_changed if the length of this copy
did in fact change, which could lead to branches remaining too short.
If we do record a length (current version), we could end up changing
the length of previous copies that have already been processed in
this iteration.  Both would be wrong, but at least the current version
is simpler and means we get the natural behaviour if anyone ever does
"fix" dbr_schedule.

I think first_pass should be !optimize.  The set-up loop chooses minimum
lengths when optimising, so it should be correct to keep the maximum on
all iterations of this loop.  I.e.

		  if (optimize
		      ? inner_length > insn_lengths[inner_uid]
		      : inner_length < insn_lengths[inner_uid])
		    {
		      insn_lengths[inner_uid] = inner_length;
		      something_changed = 1;
		    }
		  else
		    inner_length = insn_lengths[inner_uid];

> @@ -1382,7 +1399,8 @@ shorten_branches (rtx first ATTRIBUTE_UN
>  	  insn_current_address += (new_length - tmp_length);
>  #endif
>  
> -	  if (new_length != insn_lengths[uid])
> +	  if (new_length != insn_lengths[uid]
> +	      && (new_length > insn_lengths[uid] || first_pass))
>  	    {
>  	      insn_lengths[uid] = new_length;
>  	      something_changed = 1;

Same first_pass comment here.  I.e.:

	  if (optimize
	      ? new_length > insn_lengths[uid]
	      : new_length < insn_lengths[uid])

I think you need an "else" that does:

    insn_current_address += insn_lengths[uid] - new_length;

Richard
Joern Rennecke - Oct. 18, 2012, 8:29 p.m.
Quoting Richard Sandiford <rdsandiford@googlemail.com>:

> The fact that we even have shared unique ids is pretty bad -- and surely
> a contradiction in terms -- but I think both ways of handling them rely
> on the length being the same for all copies.  If we don't record a length
> (your version), we won't set something_changed if the length of this copy
> did in fact change, which could lead to branches remaining too short.

That is not the case, as the current length of the inner insn is added to
new_length (of the outer insn).

> If we do record a length (current version),

In the current version, the length of the inner insn is calculated anew
each iteration, so only the out-of-sequence copy suffers from the bogosity.

> we could end up changing
> the length of previous copies that have already been processed in
> this iteration.

Both that, and also using the length or the previous insn for the inner
length calculation, and ratcheting them up together.
In fact, this can make delay slot instructions ineligible for the delay slot
they are in.  Also, the unwanted length cross-interference can violate
alignment invariants, and this leads to invalid code on ARCompact.

   Both would be wrong, but at least the current version
> is simpler and means we get the natural behaviour if anyone ever does
> "fix" dbr_schedule.
>
> I think first_pass should be !optimize.

That would not work for CASE_VECTOR_SHORTEN_MODE, where it controls setting
the mode of BODY.

And as we have to use the first_pass variable there, I thought it's simpler
to also use it in these other places; if we ever want to switch the
decision on when to use increasing and when to use decreasing lengths,
the algorithm continues to work this way, without the need to replace the
optimize test in all places with the new test.

>> @@ -1382,7 +1399,8 @@ shorten_branches (rtx first ATTRIBUTE_UN
>>  	  insn_current_address += (new_length - tmp_length);
>>  #endif
>>
>> -	  if (new_length != insn_lengths[uid])
>> +	  if (new_length != insn_lengths[uid]
>> +	      && (new_length > insn_lengths[uid] || first_pass))
>>  	    {
>>  	      insn_lengths[uid] = new_length;
>>  	      something_changed = 1;
>
> Same first_pass comment here.  I.e.:
>
> 	  if (optimize
> 	      ? new_length > insn_lengths[uid]
> 	      : new_length < insn_lengths[uid])

It's about testing a condition first that can eliminate further tests.
if the length is unchanged - which would be expected to be the case
for most variable length instructions after the first iteration -
we don't need to test the other conditions.
And optimize, being a global variable, is relatively expensive on
some host platforms.
It could get even more expensive if/when we ever go multithreaded.

> I think you need an "else" that does:
>
>     insn_current_address += insn_lengths[uid] - new_length;

When the condition is not fulfilled, we want to keep the length from the
previous iteration.  We could put
"new_length = insn_length[uid];" in the else clause, but that'd be pointless,
as new_length is dead at that point.
Richard Sandiford - Oct. 18, 2012, 9:03 p.m.
Joern Rennecke <joern.rennecke@embecosm.com> writes:
> Quoting Richard Sandiford <rdsandiford@googlemail.com>:
>> The fact that we even have shared unique ids is pretty bad -- and surely
>> a contradiction in terms -- but I think both ways of handling them rely
>> on the length being the same for all copies.  If we don't record a length
>> (your version), we won't set something_changed if the length of this copy
>> did in fact change, which could lead to branches remaining too short.
>
> That is not the case, as the current length of the inner insn is added to
> new_length (of the outer insn).
>
>> If we do record a length (current version),
>
> In the current version, the length of the inner insn is calculated anew
> each iteration, so only the out-of-sequence copy suffers from the bogosity.
>
>> we could end up changing
>> the length of previous copies that have already been processed in
>> this iteration.
>
> Both that, and also using the length or the previous insn for the inner
> length calculation, and ratcheting them up together.
> In fact, this can make delay slot instructions ineligible for the delay slot
> they are in.  Also, the unwanted length cross-interference can violate
> alignment invariants, and this leads to invalid code on ARCompact.

But if you're saying that ARCompat wants different copies of the same insn
(with the same uid) to have different lengths, then please fix dbr_schedule
so that it uses different uids.  The "i == 0" thing is just too hackish IMO.

>> is simpler and means we get the natural behaviour if anyone ever does
>> "fix" dbr_schedule.
>>
>> I think first_pass should be !optimize.
>
> That would not work for CASE_VECTOR_SHORTEN_MODE, where it controls setting
> the mode of BODY.
>
> And as we have to use the first_pass variable there, I thought it's simpler
> to also use it in these other places; if we ever want to switch the
> decision on when to use increasing and when to use decreasing lengths,
> the algorithm continues to work this way, without the need to replace the
> optimize test in all places with the new test.

I don't mind having a local boolean to control whether we're applying
increasing or decreasing lengths, if you prefer that to plain "optimize".
I just don't like the condition changing from the second approximation
to the third.

>>> @@ -1382,7 +1399,8 @@ shorten_branches (rtx first ATTRIBUTE_UN
>>>  	  insn_current_address += (new_length - tmp_length);
>>>  #endif
>>>
>>> -	  if (new_length != insn_lengths[uid])
>>> +	  if (new_length != insn_lengths[uid]
>>> +	      && (new_length > insn_lengths[uid] || first_pass))
>>>  	    {
>>>  	      insn_lengths[uid] = new_length;
>>>  	      something_changed = 1;
>>
>> Same first_pass comment here.  I.e.:
>>
>> 	  if (optimize
>> 	      ? new_length > insn_lengths[uid]
>> 	      : new_length < insn_lengths[uid])
>
> It's about testing a condition first that can eliminate further tests.
> if the length is unchanged - which would be expected to be the case
> for most variable length instructions after the first iteration -
> we don't need to test the other conditions.
> And optimize, being a global variable, is relatively expensive on
> some host platforms.
> It could get even more expensive if/when we ever go multithreaded.
>
>> I think you need an "else" that does:
>>
>>     insn_current_address += insn_lengths[uid] - new_length;
>
> When the condition is not fulfilled, we want to keep the length from the
> previous iteration.

Right, that's what I mean.  So we need to make sure that the difference
between the address of the current instruction and the address of the
next instruction also doesn't change.  The code as posted was:

	  else
	    {
	      new_length = insn_current_length (insn);
	      insn_current_address += new_length;
	    }

#ifdef ADJUST_INSN_LENGTH
	  /* If needed, do any adjustment.  */
	  tmp_length = new_length;
	  ADJUST_INSN_LENGTH (insn, new_length);
	  insn_current_address += (new_length - tmp_length);
#endif

	  if (new_length != insn_lengths[uid]
	      && (new_length > insn_lengths[uid] || first_pass))
	    {
	      insn_lengths[uid] = new_length;
	      something_changed = 1;
	    }

which always leaves insn_current_address based off new_length,
even if new_length is out of kilter with insn_lengths[uid].
It looked like it should be:

	  else
	    {
	      new_length = insn_current_length (insn);
	      insn_current_address += new_length;
	    }

#ifdef ADJUST_INSN_LENGTH
	  /* If needed, do any adjustment.  */
	  tmp_length = new_length;
	  ADJUST_INSN_LENGTH (insn, new_length);
	  insn_current_address += (new_length - tmp_length);
#endif

	  if (...honour new_length...)
	    {
	      insn_lengths[uid] = new_length;
	      something_changed = 1;
	    }
	  else
	    insn_current_address += insn_lengths[uid] - new_length;

so that insn_current_address continues to match the final value
of insn_lengths[uid].

Richard

Patch

Index: final.c
===================================================================
--- final.c	(revision 192491)
+++ final.c	(working copy)
@@ -1067,6 +1067,8 @@  shorten_branches (rtx first ATTRIBUTE_UN
 #endif /* CASE_VECTOR_SHORTEN_MODE */
 
   /* Compute initial lengths, addresses, and varying flags for each insn.  */
+  int (*length_fun) (rtx) = optimize ? insn_min_length : insn_default_length;
+
   for (insn_current_address = 0, insn = first;
        insn != 0;
        insn_current_address += insn_lengths[uid], insn = NEXT_INSN (insn))
@@ -1117,6 +1119,8 @@  shorten_branches (rtx first ATTRIBUTE_UN
 #else
 	  const_delay_slots = 0;
 #endif
+	  int (*inner_length_fun) (rtx)
+	    = const_delay_slots ? length_fun : insn_default_length;
 	  /* Inside a delay slot sequence, we do not do any branch shortening
 	     if the shortening could change the number of delay slots
 	     of the branch.  */
@@ -1131,7 +1135,7 @@  shorten_branches (rtx first ATTRIBUTE_UN
 		inner_length = (asm_insn_count (PATTERN (inner_insn))
 				* insn_default_length (inner_insn));
 	      else
-		inner_length = insn_default_length (inner_insn);
+		inner_length = inner_length_fun (inner_insn);
 
 	      insn_lengths[inner_uid] = inner_length;
 	      if (const_delay_slots)
@@ -1149,7 +1153,7 @@  shorten_branches (rtx first ATTRIBUTE_UN
 	}
       else if (GET_CODE (body) != USE && GET_CODE (body) != CLOBBER)
 	{
-	  insn_lengths[uid] = insn_default_length (insn);
+	  insn_lengths[uid] = length_fun (insn);
 	  varying_length[uid] = insn_variable_length_p (insn);
 	}
 
@@ -1165,6 +1169,7 @@  shorten_branches (rtx first ATTRIBUTE_UN
      get the current insn length.  If it has changed, reflect the change.
      When nothing changes for a full pass, we are done.  */
 
+  bool first_pass = true;
   while (something_changed)
     {
       something_changed = 0;
@@ -1220,6 +1225,7 @@  shorten_branches (rtx first ATTRIBUTE_UN
 	      rtx prev;
 	      int rel_align = 0;
 	      addr_diff_vec_flags flags;
+	      enum machine_mode vec_mode;
 
 	      /* Avoid automatic aggregate initialization.  */
 	      flags = ADDR_DIFF_VEC_FLAGS (body);
@@ -1298,9 +1304,12 @@  shorten_branches (rtx first ATTRIBUTE_UN
 		  else
 		    max_addr += align_fuzz (max_lab, rel_lab, 0, 0);
 		}
-	      PUT_MODE (body, CASE_VECTOR_SHORTEN_MODE (min_addr - rel_addr,
-							max_addr - rel_addr,
-							body));
+	      vec_mode = CASE_VECTOR_SHORTEN_MODE (min_addr - rel_addr,
+						   max_addr - rel_addr, body);
+	      if (first_pass
+		  || (GET_MODE_SIZE (vec_mode)
+		      >= GET_MODE_SIZE (GET_MODE (body))))
+		PUT_MODE (body, vec_mode);
 	      if (JUMP_TABLES_IN_TEXT_SECTION
 		  || readonly_data_section == text_section)
 		{
@@ -1360,12 +1369,20 @@  shorten_branches (rtx first ATTRIBUTE_UN
 		  else
 		    inner_length = insn_current_length (inner_insn);
 
-		  if (inner_length != insn_lengths[inner_uid])
+		  /* We can't record lengths of delay slot insns, because
+		     they might just be a copy of the insn at the branch
+		     target, sharing its uid.  */
+		  if (i == 0 && inner_length != insn_lengths[inner_uid])
 		    {
-		      insn_lengths[inner_uid] = inner_length;
-		      something_changed = 1;
+		      if (inner_length > insn_lengths[inner_uid] || first_pass)
+			{
+			  insn_lengths[inner_uid] = inner_length;
+			  something_changed = 1;
+			}
+		      else
+			inner_length = insn_lengths[inner_uid];
 		    }
-		  insn_current_address += insn_lengths[inner_uid];
+		  insn_current_address += inner_length;
 		  new_length += inner_length;
 		}
 	    }
@@ -1382,7 +1399,8 @@  shorten_branches (rtx first ATTRIBUTE_UN
 	  insn_current_address += (new_length - tmp_length);
 #endif
 
-	  if (new_length != insn_lengths[uid])
+	  if (new_length != insn_lengths[uid]
+	      && (new_length > insn_lengths[uid] || first_pass))
 	    {
 	      insn_lengths[uid] = new_length;
 	      something_changed = 1;
@@ -1391,6 +1409,7 @@  shorten_branches (rtx first ATTRIBUTE_UN
       /* For a non-optimizing compile, do only a single pass.  */
       if (!optimize)
 	break;
+      first_pass = false;
     }
 
   free (varying_length);