RFC: LRA for x86/x86-64 [7/9] -- continuation

Message ID	5064DA43.90803@redhat.com
State	New
Headers	show Return-Path: <gcc-patches-return-327540-incoming=patchwork.ozlabs.org@gcc.gnu.org> Comment: DKIM? See http://www.dkim.org Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=gcc.gnu.org; h=Received:Received:X-SWARE-Spam-Status:X-Spam-Check-By:Received:Received:Received:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:Content-Type:X-IsSubscribed:Mailing-List:Precedence:List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:Sender:Delivered-To; b=xlwT8OTVV60YD2Eu5AehHrpszH7kTElaut1VrgXoNtRfC6SedTWUXDpfYdi8Tj Ss62IAwWn471JqbVE7wstoJw6p3v5W0Mb5GUvDSAARNLNB3Rp8UdKw4409e38mZi IxAlpE3BAUCEcbMNnaY2Asmm4R0heQsg84SQY11BWz89U=; Message-ID: <5064DA43.90803@redhat.com> Date: Thu, 27 Sep 2012 18:59:15 -0400 From: Vladimir Makarov <vmakarov@redhat.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: GCC Patches <gcc-patches@gcc.gnu.org> Subject: RFC: LRA for x86/x86-64 [7/9] -- continuation Content-Type: multipart/mixed; boundary="------------030002020309090607060505" Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org

Vladimir Makarov Sept. 27, 2012, 10:59 p.m. UTC

It is the 2nd part of the previous patch.

Richard Sandiford Oct. 3, 2012, 11:11 a.m. UTC | #1

Hi Vlad,

Some comments on lra-spills.c and lra-coalesce.c.

> +   The pass creates necessary stack slots and assign spilled pseudos
> +   to the stack slots in following way:

s/assign/assigns/

> +   (or insn memory constraints) might be not satisfied any more.

s/might be not/might not be/

> +   For some targets, the pass can spill some pseudos into hard
> +   registers of different class (usually into vector registers)
> +   instead of spilling them into memory if it is possible and
> +   profitable.	Spilling GENERAL_REGS pseudo into SSE registers for
> +   modern Intel x86/x86-64 processors is an example of such
> +   optimization.  And this is actually recommended by Intel
> +   optimization guide.

Maybe mention core i7 specifically?  "Modern" is a bit dangerous
in code that'll live a long time.

> +/* The structure describes a stack slot which can be used for several
> +   spilled pseudos.  */
> +struct slot
> +{

Looks like this describes "a register or stack slot" given the hard_regno case.

> +/* Array containing info about the stack slots.	 The array element is
> +   indexed by the stack slot number in the range [0..slost_num).  */

Typo: slots_num

> +  /* Each pseudo has an inherent size which comes from its own mode,
> +     and a total size which provides room for paradoxical subregs
> +     which refer to the pseudo reg in wider modes.
> +     
> +     We can use a slot already allocated if it provides both enough
> +     inherent space and enough total space.  Otherwise, we allocate a
> +     new slot, making sure that it has no less inherent space, and no
> +     less total space, then the previous slot.	*/

The second part of the comment seems a bit misplaced, since the following
code doesn't reuse stack slots.  This is done elsewhere instead.
Maybe the first part would be better above the inherent_size assignment.

> +  /* If we have any adjustment to make, or if the stack slot is the
> +     wrong mode, make a new stack slot.	 */
> +  x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust);

We don't make a new slot here.

> +/* Sort pseudos according their slot numbers putting ones with smaller
> +   numbers first, or last when the frame pointer is not needed.	 So
> +   pseudos with the first slot will be finally addressed with smaller
> +   address displacement.  */
> +static int
> +pseudo_reg_slot_compare (const void *v1p, const void *v2p)
> +{
> +  const int regno1 = *(const int *) v1p;
> +  const int regno2 = *(const int *) v2p;
> +  int diff, slot_num1, slot_num2;
> +  int total_size1, total_size2;
> +
> +  slot_num1 = pseudo_slots[regno1].slot_num;
> +  slot_num2 = pseudo_slots[regno2].slot_num;
> +  if ((diff = slot_num1 - slot_num2) != 0)
> +    return (frame_pointer_needed
> +	    || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff);

The comment doesn't quite describe the condition.  Maybe:

/* Sort pseudos according to their slots, putting the slots in the order
   that they should be allocated.  Slots with lower numbers have the highest
   priority and should get the smallest displacement from the stack or
   frame pointer (whichever is being used).

   The first allocated slot is always closest to the frame pointer,
   so prefer lower slot numbers when frame_pointer_needed.  If the stack
   and frame grow in the same direction, then the first allocated slot is
   always closest to the initial stack pointer and furthest away from the
   final stack pointer, so allocate higher numbers first when using the
   stack pointer in that case.  The reverse is true if the stack and
   frame grow in opposite directions.  */

> +  total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1),
> +		     GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode));
> +  total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2),
> +		     GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode));
> +  if ((diff = total_size2 - total_size1) != 0)
> +    return diff;

I think this could do with a bit more commentary.  When is biggest_mode
ever smaller than PSEUDO_REGNO_BYTES?  Is that for pseudos that are only
ever referenced as lowpart subregs?  If so, why does PSEUDO_REGNO_BYTES
matter for those registers here but not when calculating biggest_mode?

> +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
> +   pseudos which did not get a spill hard register at the beginning of
> +   array PSEUDO_REGNOS.	 Return the number of such pseudos.  */

It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest
frequency first.

> +  bitmap set_jump_crosses = regstat_get_setjmp_crosses ();

I notice you use "set_jump" here and "setjump" in parts of 7a.patch.
Probably better to use setjmp across the board.

> +  /* Hard registers which can not be used for any purpose at given
> +     program point because they are unallocatable or already allocated
> +     for other pseudos.	 */ 
> +  HARD_REG_SET *reserved_hard_regs;
> +
> +  if (! lra_reg_spill_p)
> +    return n;
> +  /* Set up reserved hard regs for every program point.	 */
> +  reserved_hard_regs = (HARD_REG_SET *) xmalloc (sizeof (HARD_REG_SET)
> +						 * lra_live_max_point);
> +  for (p = 0; p < lra_live_max_point; p++)
> +    COPY_HARD_REG_SET (reserved_hard_regs[p], lra_no_alloc_regs);
> +  for (i = FIRST_PSEUDO_REGISTER; i < regs_num; i++)
> +    if (lra_reg_info[i].nrefs != 0
> +	&& (hard_regno = lra_get_regno_hard_regno (i)) >= 0)
> +      for (r = lra_reg_info[i].live_ranges; r != NULL; r = r->next)
> +	for (p = r->start; p <= r->finish; p++)
> +	  lra_add_hard_reg_set (hard_regno, lra_reg_info[i].biggest_mode,
> +				&reserved_hard_regs[p]);

Since compilation time seems to be all the rage, I wonder if it would be
quicker to have one live range list per hard register.  Then:

> +      for (r = lra_reg_info[regno].live_ranges; r != NULL; r = r->next)
> +	for (p = r->start; p <= r->finish; p++)
> +	  IOR_HARD_REG_SET (conflict_hard_regs, reserved_hard_regs[p]);

would just be checking for live range intersection and:

> +      /* Update reserved_hard_regs.  */
> +      for (r = lra_reg_info[regno].live_ranges; r != NULL; r = r->next)
> +	for (p = r->start; p <= r->finish; p++)
> +	  lra_add_hard_reg_set (hard_regno, lra_reg_info[regno].biggest_mode,
> +				&reserved_hard_regs[p]);

would again be a merge.

Just an idea, not a merge requirement.  If you've already tried this and
found it to be worse, that might be worth a comment.

> +      first = pseudo_slots[regno].first = &pseudo_slots[slots[slot_num].regno];
> +      pseudo_slots[regno].next = pseudo_slots[slots[slot_num].regno].next;
> +      first->next = &pseudo_slots[regno];

Very minor nit, but I think this would be easier to read if the middle
line also used "first->next".

> +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
> +   pseudos which did not get a spill hard register at the beginning of
> +   array PSEUDO_REGNOS.	 Return the number of such pseudos.  */

Here too I think it's worth mentioning that PSEUDO_REGNOS is sorted
with highest frequency first.

> +/* Recursively process LOC in INSN and change spilled pseudos to the
> +   corresponding memory or spilled hard reg.  Ignore spilled pseudos
> +   created from the scratches.	*/
> +static bool
> +remove_pseudos (rtx *loc, rtx insn)

The return value is now ignored -- we know in advance which insns need
changing -- so this could be simplified.

> +/* Change spilled pseudos into memory or spill hard regs.  The
> +   function put changed insns on the constraint stack (these insns
> +   will be considered on the next constraint pass).  The changed insns
> +   are all insns in which pseudos were changed.	 */

s/The function put/Put/

> +/* Set up REMOVED_PSEUDOS_BITMAP and USED_PSEUDOS_BITMAP, and update
> +   LR_BITMAP (a BB live info bitmap).  */
> +static void
> +update_live_info (bitmap lr_bitmap)
> +{
> +  unsigned int j;
> +  bitmap_iterator bi;
> +
> +  bitmap_clear (&removed_pseudos_bitmap);
> +  bitmap_clear (&used_pseudos_bitmap);
> +  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
> +			    FIRST_PSEUDO_REGISTER, j, bi)
> +    {
> +      bitmap_set_bit (&removed_pseudos_bitmap, j);
> +      bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
> +    }
> +  if (! bitmap_empty_p (&removed_pseudos_bitmap))
> +    {
> +      bitmap_and_compl_into (lr_bitmap, &removed_pseudos_bitmap);
> +      bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
> +    }
> +}

Might be wrong, but it looks like nothing really uses removed_pseudos_bitmap
outside this function.  I think this could simply be:

/* Set up REMOVED_PSEUDOS_BITMAP and USED_PSEUDOS_BITMAP, and update
   LR_BITMAP (a BB live info bitmap).  */
static void
update_live_info (bitmap lr_bitmap)
{
  unsigned int j;
  bitmap_iterator bi;

  bitmap_clear (&used_pseudos_bitmap);
  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
			    FIRST_PSEUDO_REGISTER, j, bi)
    bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
  if (! bitmap_empty_p (&used_pseudos_bitmap))
    {
      bitmap_and_compl_into (lr_bitmap, &coalesced_pseudos_bitmap);
      bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
    }
}

> +	    && mem_move_p (sregno, dregno)
> +	    /* Don't coalesce inheritance pseudos because spilled
> +	       inheritance pseudos will be removed in subsequent 'undo
> +	       inheritance' pass.  */
> +	    && lra_reg_info[sregno].restore_regno < 0
> +	    && lra_reg_info[dregno].restore_regno < 0
> +	    /* We undo splits for spilled pseudos whose live ranges
> +	       were split.  So don't coalesce them, it is not
> +	       necessary and the undo transformations would be
> +	       wrong.  */
> +	    && ! bitmap_bit_p (&split_origin_bitmap, sregno)
> +	    && ! bitmap_bit_p (&split_origin_bitmap, dregno)
> +	    && ! side_effects_p (set)
> +	    /* Don't coalesces bound pseudos.  Bound pseudos has own
> +	       rules for finding live ranges.  It is hard to maintain
> +	       this info with coalescing and it is not worth to do
> +	       it.  */
> +	    && ! bitmap_bit_p (&lra_bound_pseudos, sregno)
> +	    && ! bitmap_bit_p (&lra_bound_pseudos, dregno)
> +	    /* We don't want to coalesce regnos with equivalences,
> +	       at least without updating this info.  */
> +	    && ira_reg_equiv[sregno].constant == NULL_RTX
> +	    && ira_reg_equiv[sregno].memory == NULL_RTX
> +	    && ira_reg_equiv[sregno].invariant == NULL_RTX
> +	    && ira_reg_equiv[dregno].constant == NULL_RTX
> +	    && ira_reg_equiv[dregno].memory == NULL_RTX
> +	    && ira_reg_equiv[dregno].invariant == NULL_RTX

Probably personal preference, but I think this would be easier
to read as:

	    && coalescable_reg_p (sregno)
	    && coalescable_reg_p (dregno)
	    && !side_effects_p (set)

with coalescable_reg_p checking reg_renumber (from mem_move_p)
and the open-coded stuff in the quote above.

> +  for (; mv_num != 0;)
> +    {
> +      for (i = 0; i < mv_num; i++)
> +	{
> +	  mv = sorted_moves[i];
> +	  set = single_set (mv);
> +	  lra_assert (set != NULL && REG_P (SET_SRC (set))
> +		      && REG_P (SET_DEST (set)));
> +	  sregno = REGNO (SET_SRC (set));
> +	  dregno = REGNO (SET_DEST (set));
> +	  if (! lra_intersected_live_ranges_p
> +		(lra_reg_info[first_coalesced_pseudo[sregno]].live_ranges,
> +		 lra_reg_info[first_coalesced_pseudo[dregno]].live_ranges))
> +	    {
> +	      coalesced_moves++;
> +	      if (lra_dump_file != NULL)
> +		fprintf
> +		  (lra_dump_file,
> +		   "	  Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n",
> +		   INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)),
> +		   dregno, ORIGINAL_REGNO (SET_DEST (set)),
> +		   BLOCK_FOR_INSN (mv)->frequency);
> +	      bitmap_ior_into (&involved_insns_bitmap,
> +			       &lra_reg_info[sregno].insn_bitmap);
> +	      bitmap_ior_into (&involved_insns_bitmap,
> +			       &lra_reg_info[dregno].insn_bitmap);
> +	      merge_pseudos (sregno, dregno);
> +	      i++;
> +	      break;
> +	    }
> +	}
> +      /* Collect the rest of copies.  */
> +      for (n = 0; i < mv_num; i++)
> +	{
> +	  mv = sorted_moves[i];
> +	  set = single_set (mv);
> +	  lra_assert (set != NULL && REG_P (SET_SRC (set))
> +		      && REG_P (SET_DEST (set)));
> +	  sregno = REGNO (SET_SRC (set));
> +	  dregno = REGNO (SET_DEST (set));
> +	  if (first_coalesced_pseudo[sregno] != first_coalesced_pseudo[dregno])
> +	    sorted_moves[n++] = mv;
> +	  else if (lra_dump_file != NULL)
> +	    {
> +	      coalesced_moves++;
> +	      fprintf
> +		(lra_dump_file, "      Coalescing move %i:r%d-r%d (freq=%d)\n",
> +		 INSN_UID (mv), sregno, dregno,
> +		 BLOCK_FOR_INSN (mv)->frequency);
> +	    }
> +	}
> +      mv_num = n;

I'm probably being dense here, sorry, but why the nested loops?
Why can't we have one loop along the lines of:

      for (i = 0; i < mv_num; i++)
	{
	  mv = sorted_moves[i];
	  set = single_set (mv);
	  lra_assert (set != NULL && REG_P (SET_SRC (set))
		      && REG_P (SET_DEST (set)));
	  sregno = REGNO (SET_SRC (set));
	  dregno = REGNO (SET_DEST (set));
	  if (first_coalesced_pseudo[sregno] == first_coalesced_pseudo[dregno])
	    {
	      coalesced_moves++;
	      fprintf
		(lra_dump_file, "      Coalescing move %i:r%d-r%d (freq=%d)\n",
		 INSN_UID (mv), sregno, dregno,
		 BLOCK_FOR_INSN (mv)->frequency);
	      /* We updated involved_insns_bitmap when doing the mrege */
	    }
	  else if (!(lra_intersected_live_ranges_p
		     (lra_reg_info[first_coalesced_pseudo[sregno]].live_ranges,
		      lra_reg_info[first_coalesced_pseudo[dregno]].live_ranges)))
	    {
	      coalesced_moves++;
	      if (lra_dump_file != NULL)
		fprintf
		  (lra_dump_file,
		   "	  Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n",
		   INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)),
		   dregno, ORIGINAL_REGNO (SET_DEST (set)),
		   BLOCK_FOR_INSN (mv)->frequency);
	      bitmap_ior_into (&involved_insns_bitmap,
			       &lra_reg_info[sregno].insn_bitmap);
	      bitmap_ior_into (&involved_insns_bitmap,
			       &lra_reg_info[dregno].insn_bitmap);
	      merge_pseudos (sregno, dregno);
	    }
	}

(completely untested)

> +	    if ((set = single_set (insn)) != NULL_RTX
> +		&& REG_P (SET_DEST (set)) && REG_P (SET_SRC (set))
> +		&& REGNO (SET_SRC (set)) == REGNO (SET_DEST (set))
> +		&& ! side_effects_p (set))

Maybe use set_noop_p here?

Richard

Richard Sandiford Oct. 4, 2012, 3:50 p.m. UTC | #2

Hi Vlad,

This message is for lra-assigns.c.  Sorry for the piecemeal reviews,
never sure when I'll get time...

> +/* This file contains a pass mostly assigning hard registers to reload
> +   pseudos.  There is no any RTL code transformation on this pass.

Maybe:

/* This file's main objective is to assign hard registers to reload pseudos.
   It also tries to allocate hard registers to other pseudos, but at a lower
   priority than the reload pseudos.  The pass does not transform the RTL.

if that's accurate.

> +   Reload pseudos get what they need (usually) hard registers in
> +   anyway possibly by spilling non-reload pseudos and by assignment
> +   reload pseudos with smallest number of available hard registers
> +   first.
> +
> +   If reload pseudos can get hard registers only through spilling
> +   other pseudos, we choose what pseudos to spill taking into account
> +   how given reload pseudo benefits and also how other reload pseudos
> +   not assigned yet benefit too (see function spill_for).

Maybe:

   We must allocate a hard register to every reload pseudo.  We try to
   increase the chances of finding a viable allocation by assigning the
   pseudos in order of fewest available hard registers first.  If we
   still fail to find a hard register, we spill other (non-reload)
   pseudos in order to make room.

   assign_hard_regno_for allocates registers without spilling.
   spill_for does the same with spilling.  Both functions use
   a cost model to determine the most profitable choice of
   hard and spill registers.

> +   Non-reload pseudos can get hard registers too if it is possible and
> +   improves the code.  It might be possible because of spilling
> +   non-reload pseudos on given pass.

Maybe:

   Once we have finished allocating reload pseudos, we also try to
   assign registers to other (non-reload) pseudos.  This is useful
   if hard registers were freed up by the spilling just described.

> +   We try to assign hard registers processing pseudos by threads.  The
> +   thread contains reload and inheritance pseudos connected by copies
> +   (move insns).  It improves the chance to get the same hard register
> +   to pseudos in the thread and, as the result, to remove some move
> +   insns.

Maybe:

   We try to assign hard registers by collecting pseudos into threads.
   These threads contain reload and inheritance pseudos that are connected
   by copies (move insns).  Doing this improves the chances of pseudos
   in the thread getting the same hard register and, as a result,
   of allowing some move insns to be deleted.

> +   When we assign hard register to a pseudo, we decrease the cost of
> +   the hard registers for corresponding pseudos connected by copies.

Maybe:

   When we assign a hard register to a pseudo, we decrease the cost of
   using the same hard register for pseudos that are connected by copies.

> +   If two hard registers are equally good for assigning the pseudo
> +   with hard register cost point of view, we prefer a hard register in
> +   smaller register bank.  By default, there is only one register
> +   bank.  A target can define register banks by hook
> +   register_bank. For example, x86-64 has a few register banks: hard
> +   regs with and without REX prefixes are in different banks.  It
> +   permits to generate smaller code as insns without REX prefix are
> +   shorter.

Maybe:

   If two hard registers have the same frequency-derived cost,
   we prefer hard registers in lower register banks.  The mapping
   of registers to banks is controlled by the register_bank target hook.
   For example, x86-64 has a few register banks: hard registers with and
   without REX prefixes are in different banks.  This permits us
   to generate smaller code as insns without REX prefixes are shorter.

although this might change if the name of the hook changes.

> +/* Info about pseudo used during the assignment pass.  Thread is a set
> +   of connected reload and inheritance pseudos with the same set of
> +   available hard reg set.  Thread is a pseudo itself for other
> +   cases.  */
> +struct regno_assign_info

Maybe:

/* Information about the thread to which a pseudo belongs.  Threads are
   a set of connected reload and inheritance pseudos with the same set of
   available hard registers.  Lone registers belong to their own threads.  */

Although the condition seems to be:

> +	&& (ira_class_hard_regs_num[regno_allocno_class_array[regno1]]
> +	    == ira_class_hard_regs_num[regno_allocno_class_array[regno2]]))

i.e. the same _number_ of available hard regs, but not necessarily the
same set.

"thread" might be more mnemonic than "regno_assign" in this file,
but that's bikeshed stuff.

> +  for (i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
> +    {
> +      regno_assign_info[i].first = i;
> +      regno_assign_info[i].next = -1;
> +      regno_assign_info[i].freq = lra_reg_info[i].freq;
> +    }

Minor speedup, but it's probably worth caching max_reg_num () rather than
calling it in each loop iteration.  Several other loops with the same thing.

> +/* Process a pseudo copy with execution frequency COPY_FREQ connecting
> +   REGNO1 and REGNO2 to form threads.  */
> +static void
> +process_copy_to_form_thread (int regno1, int regno2, int copy_freq)
> +{
> +  int last, regno1_first, regno2_first;
> +
> +  lra_assert (regno1 >= lra_constraint_new_regno_start
> +	      && regno2 >= lra_constraint_new_regno_start);
> +  regno1_first = regno_assign_info[regno1].first;
> +  regno2_first = regno_assign_info[regno2].first;
> +  if (regno1_first != regno2_first)
> +    {
> +      for (last = regno2_first;
> +	   regno_assign_info[last].next >= 0;
> +	   last = regno_assign_info[last].next)
> +	regno_assign_info[last].first = regno1_first;
> +      regno_assign_info[last].next = regno_assign_info[regno1_first].next;
> +      regno_assign_info[regno1_first].first = regno2_first;
> +      regno_assign_info[regno1_first].freq
> +	+= regno_assign_info[regno2_first].freq;

Couple of things I don't understand here:

- Why don't we set regno_assign_info[last].first (for final "last")
  to regno1_first?  I.e. the loop stops while "last" is still valid,
  but only assigns to that element's "next" field, leaving "first"
  as before.

- I might be wrong, but should:

    regno_assign_info[regno1_first].first = regno2_first;

  be:

    regno_assign_info[regno1_first].next = regno2_first;

  so that the list becomes:

    regno1_first regno2_first ... last ...

  The current version seems to create a cycle:

   regno_assign_info[regno1_first].first == regno2_first
   regno_assign_info[regno2_first].first == regno1_first

> +/* Update LIVE_HARD_REG_PSEUDOS and LIVE_PSEUDOS_REG_RENUMBER by
> +   pseudo REGNO assignment or by the pseudo spilling if FREE_P.	 */

Maybe:

/* Update the LIVE_HARD_REG_PSEUDOS and LIVE_PSEUDOS_REG_RENUMBER
   entries for pseudo REGNO.  Assume that the register has been
   spilled if FREE_P, otherwise assume that it has been assigned
   reg_renumber[REGNO] (if >= 0).  */

> +/* Find and return best (or TRY_ONLY_HARD_REGNO) free hard register
> +   for pseudo REGNO.  In the failure case, return a negative number.
> +   Return through *COST the cost of usage of the hard register for the
> +   pseudo.  Best free hard register has smallest cost of usage for
> +   REGNO or smallest register bank if the cost is the same.  */

Maybe:

/* Try to find a free hard register for pseudo REGNO.  Return the
   hard register on success and set *COST to the cost of using
   that register.  (If several registers have equal cost, the one with
   the lowest register bank wins.)  Return -1 on failure.

   If TRY_ONLY_HARD_REGNO >= 0, consider only that hard register,
   otherwise consider all hard registers in REGNO's class.  */

> +      if (hard_regno_costs_check[hard_regno] != curr_hard_regno_costs_check)
> +	hard_regno_costs[hard_regno] = 0;
> +      hard_regno_costs_check[hard_regno] = curr_hard_regno_costs_check;
> +      hard_regno_costs[hard_regno]
> +	-= lra_reg_info[regno].preferred_hard_regno_profit1;

This pattern occurs several times.  I think it'd be clearer to have
an inline helper function (adjust_hard_regno_cost, or whatever).

> +  /* That is important for allocation of multi-word pseudos.  */
> +  IOR_COMPL_HARD_REG_SET (conflict_set, reg_class_contents[rclass]);

Maybe:

  /* Make sure that all registers in a multi-word pseudo belong to the
     required class.  */

> +	  /* We can not use prohibited_class_mode_regs because it is
> +	     defined not for all classes.  */

s/defined not/not defined/

> +	  && ! TEST_HARD_REG_BIT (impossible_start_hard_regs, hard_regno)
> +	  && (nregs_diff == 0
> +#ifdef WORDS_BIG_ENDIAN
> +	      || (hard_regno - nregs_diff >= 0
> +		  && TEST_HARD_REG_BIT (reg_class_contents[rclass],
> +					hard_regno - nregs_diff))
> +#else
> +	      || TEST_HARD_REG_BIT (reg_class_contents[rclass],
> +				    hard_regno + nregs_diff)
> +#endif
> +	      ))

impossible_start_hard_regs is set up as:

> +	conflict_hr = live_pseudos_reg_renumber[conflict_regno];
> +	nregs = (hard_regno_nregs[conflict_hr]
> +		 [lra_reg_info[conflict_regno].biggest_mode]);
> +	/* Remember about multi-register pseudos.  For example, 2 hard
> +	   register pseudos can start on the same hard register but can
> +	   not start on HR and HR+1/HR-1.  */ 
> +	for (hr = conflict_hr + 1;
> +	     hr < FIRST_PSEUDO_REGISTER && hr < conflict_hr + nregs;
> +	     hr++)
> +	  SET_HARD_REG_BIT (impossible_start_hard_regs, hr);
> +	for (hr = conflict_hr - 1;
> +	     hr >= 0 && hr + hard_regno_nregs[hr][biggest_mode] > conflict_hr;
> +	     hr--)
> +	  SET_HARD_REG_BIT (impossible_start_hard_regs, hr);

which I don't think copes with big-endian cases like:

  other hard reg in widest mode:    ........XXXX...
  impossible_start_regs:            .....XXX.XXX...
  this hard reg in pseudo's mode:   ............XX.
  this hard reg in widest mode:     ..........XXXX.

which AIUI is an invalid choice.

There are other corner cases too.  If the other hard reg is narrower than
its widest mode, and that widest mode is wider than the current regno's
widest mode, then on big-endian targets we could have:

  other hard reg in its own mode:   ........XX....
  other hard reg in widest mode:    ......XXXX.....
  impossible_start_regs:            .......X.XXX... (*)
  this hard reg in pseudo's mode:   .....XX........
  this hard reg in widest mode:     .....XX........

(*) note no big-endian adjustment for the other hard reg's widest mode here.

Maybe it would be easier to track impossible end regs for
big-endian targets?

> +/* Update HARD_REGNO preference for pseudos connected (directly or
> +   indirectly) to a pseudo with REGNO.	Use divisor DIV to the
> +   corresponding copy frequency for the hard regno cost preference
> +   calculation.	 The more indirectly a pseudo connected, the less the
> +   cost preference.  It is achieved by increasing the divisor for each
> +   next recursive level move.  */

"cost preference" seems a bit contradictory.  Maybe:

/* Update the preference for using HARD_REGNO for pseudos that are
   connected directly or indirectly with REGNO.  Apply divisor DIV
   to any preference adjustments.

   The more indirectly a pseudo is connected, the smaller its effect
   should be.  We therefore increase DIV on each "hop".  */

> +static void
> +update_hard_regno_preference (int regno, int hard_regno, int div)
> +{
> +  int another_regno, cost;
> +  lra_copy_t cp, next_cp;
> +
> +  /* Search depth 5 seems to be enough.	 */
> +  if (div > (1 << 5))
> +    return;
> +  for (cp = lra_reg_info[regno].copies; cp != NULL; cp = next_cp)
> +    {
> +      if (cp->regno1 == regno)
> +	{
> +	  next_cp = cp->regno1_next;
> +	  another_regno = cp->regno2;
> +	}
> +      else if (cp->regno2 == regno)
> +	{
> +	  next_cp = cp->regno2_next;
> +	  another_regno = cp->regno1;
> +	}
> +      else
> +	gcc_unreachable ();
> +      if (reg_renumber[another_regno] < 0
> +	  && (update_hard_regno_preference_check[another_regno]
> +	      != curr_update_hard_regno_preference_check))
> +	{
> +	  update_hard_regno_preference_check[another_regno]
> +	    = curr_update_hard_regno_preference_check;
> +	  cost = cp->freq < div ? 1 : cp->freq / div;
> +	  lra_setup_reload_pseudo_preferenced_hard_reg
> +	    (another_regno, hard_regno, cost);
> +	  update_hard_regno_preference (another_regno, hard_regno, div * 2);
> +	}
> +    }
> +}

Using a depth-first search for this seems a bit dangerous, because we
could end up processing a connected pseudo via a very indirect path
first, even though it is more tightly connected via a more direct path.
(Could be a well-known problem, sorry.)

> +/* Update REG_RENUMBER and other pseudo preferences by assignment of
> +   HARD_REGNO to pseudo REGNO and print about it if PRINT_P.  */
> +void
> +lra_setup_reg_renumber (int regno, int hard_regno, bool print_p)
> +{
> +  int i, hr;
> +
> +  if ((hr = hard_regno) < 0)
> +    hr = reg_renumber[regno];
> +  reg_renumber[regno] = hard_regno;
> +  lra_assert (hr >= 0);
> +  for (i = 0; i < hard_regno_nregs[hr][PSEUDO_REGNO_MODE (regno)]; i++)
> +    if (hard_regno < 0)
> +      lra_hard_reg_usage[hr + i] -= lra_reg_info[regno].freq;
> +    else
> +      lra_hard_reg_usage[hr + i] += lra_reg_info[regno].freq;

Is it possible for this function to reallocate a register,
i.e. for reg_regnumber to be >= 0 both before and after the call?
If so, I think we'd need two loops.  If not, an assert would be good.

> +      mode = PSEUDO_REGNO_MODE (spill_regno);
> +      if (lra_hard_reg_set_intersection_p
> +	  (live_pseudos_reg_renumber[spill_regno],
> +	   mode, reg_class_contents[rclass]))
> +	{
> +	  hard_regno = live_pseudos_reg_renumber[spill_regno];

Very minor, sorry, but I think this would be more readable with the
hard_regno assignment before the condition and hard_regno used in it.

> +/* Spill some pseudos for a reload pseudo REGNO and return hard
> +   register which should be used for pseudo after spilling.  The
> +   function adds spilled pseudos to SPILLED_PSEUDO_BITMAP.  When we
> +   choose hard register (and pseudos occupying the hard registers and
> +   to be spilled), we take into account not only how REGNO will
> +   benefit from the spills but also how other reload pseudos not
> +   assigned to hard registers yet benefit from the spills too.	*/

"...not yet assigned to hard registers benefit..."

> +  curr_pseudo_check++; /* Invalidate try_hard_reg_pseudos elements.  */

Comment on its own line.

> +  bitmap_clear (&ignore_pseudos_bitmap);
> +  bitmap_clear (&best_spill_pseudos_bitmap);
> +  EXECUTE_IF_SET_IN_BITMAP (&lra_reg_info[regno].insn_bitmap, 0, uid, bi)
> +    {
> +      struct lra_insn_reg *ir;
> +      
> +      for (ir = lra_get_insn_regs (uid); ir != NULL; ir = ir->next)
> +	if (ir->regno >= FIRST_PSEUDO_REGISTER)
> +	  bitmap_set_bit (&ignore_pseudos_bitmap, ir->regno);
> +    }

The name "ignore_pseudos_bitmap" doesn't seem to describe how the set is
actually used.  We still allow the pseudos to be spilled, but the number
of such spills is the first-order cost.  Maybe "insn_conflict_pseudos"
or something like that?

> +      /* Spill pseudos.	 */
> +      CLEAR_HARD_REG_SET (spilled_hard_regs);
> +      EXECUTE_IF_SET_IN_BITMAP (&spill_pseudos_bitmap, 0, spill_regno, bi)
> +	if ((int) spill_regno >= lra_constraint_new_regno_start
> +	    /* ??? */
> +	    && ! bitmap_bit_p (&lra_inheritance_pseudos, spill_regno)
> +	    && ! bitmap_bit_p (&lra_split_pseudos, spill_regno)
> +	    && ! bitmap_bit_p (&lra_optional_reload_pseudos, spill_regno))
> +	  goto fail;

Leftover ??? (or lacks enough info if it's supposed to be kept)

> +	      EXECUTE_IF_SET_IN_BITMAP (&live_hard_reg_pseudos[r->start],
> +					0, k, bi2)
> +		sparseset_set_bit (live_range_hard_reg_pseudos, k);

live_range_hard_reg_pseudos and &live_hard_reg_pseudos[r->start]
seem like similar quantities.  Was there a reason for using
sparsesets for one and bitmaps for the other?

> +	      for (p = r->start + 1; p <= r->finish; p++)
> +		{
> +		  lra_live_range_t r2;
> +		  
> +		  for (r2 = lra_start_point_ranges[p];
> +		       r2 != NULL;
> +		       r2 = r2->start_next)
> +		    if (r2->regno >= lra_constraint_new_regno_start)
> +		      sparseset_set_bit (live_range_reload_pseudos, r2->regno);
> +		}

This is probably just showing my ignorance, but -- taking the above two
quotes together -- why do we calculate these two live sets in different ways?
Also, does live_range_reload_pseudos really just contain "reload" pseudos,
or inheritance pseudos as well?

> +      /* We are trying to spill a reload pseudo.  That is wrong we
> +	 should assign all reload pseudos, otherwise we cannot reuse
> +	 the selected alternatives.  */
> +      hard_regno = find_hard_regno_for (regno, &cost, -1);
> +      if (hard_regno >= 0)
> +	{

Don't really understand this comment, sorry.

Also, why are we passing -1 to find_hard_regno_for, rather than hard_regno?
The loop body up till this point has been specifically freeing up registers
to make hard_regno allocatable.  I realise that, by spilling everything
that overlaps this range, we might have freed up other registers too,
and so made others besides hard_regno allocatable.  But wouldn't we want
to test those other hard registers in "their" iteration of the loop
instead of this one?  The spills in those iterations ought to be more
directed (i.e. there should be less incidental spilling).

As things stand, doing an rclass_size * rclass_size scan seems
unnecessarily expensive, although probably off the radar.

> +	  assign_temporarily (regno, hard_regno);
> +	  n = 0;
> +	  EXECUTE_IF_SET_IN_SPARSESET (live_range_reload_pseudos, reload_regno)
> +	    if (live_pseudos_reg_renumber[reload_regno] < 0
> +		&& (hard_reg_set_intersect_p
> +		    (reg_class_contents
> +		     [regno_allocno_class_array[reload_regno]],
> +		     spilled_hard_regs)))
> +	      sorted_reload_pseudos[n++] = reload_regno;
> +	  qsort (sorted_reload_pseudos, n, sizeof (int),
> +		 reload_pseudo_compare_func);
> +	  for (j = 0; j < n; j++)
> +	    {
> +	      reload_regno = sorted_reload_pseudos[j];
> +	      if (live_pseudos_reg_renumber[reload_regno] < 0

Just trying to make sure I understand, but: isn't the final condition in
this quote redundant?  I thought that was a requirement for the register
being in sorted_reload_pseudos to begin with.

> +		  && (reload_hard_regno
> +		      = find_hard_regno_for (reload_regno,
> +					     &reload_cost, -1)) >= 0
> +		  && (lra_hard_reg_set_intersection_p
> +		      (reload_hard_regno, PSEUDO_REGNO_MODE (reload_regno),
> +		       spilled_hard_regs)))
> +		{
> +		  if (lra_dump_file != NULL)
> +		    fprintf (lra_dump_file, " assign %d(cost=%d)",
> +			     reload_regno, reload_cost);
> +		  assign_temporarily (reload_regno, reload_hard_regno);
> +		  cost += reload_cost;

It looks like registers that can be reallocated make hard_regno more
expensive (specifically by reload_cost), but registers that can't be
reallocated contribute no cost.  Is that right?  Seemed a little odd,
so maybe worth a comment.

Also, AIUI find_hard_regno_for is trying to allocate the register for
reload_regno on the basis that reload_regno has the same conflicts as
the current regno, and so it's only an approximation.  Is that right?
Might be worth a comment if so (not least to explain why we don't commit
to this allocation if we end up choosing hard_regno).

> +	  if (best_insn_pseudos_num > insn_pseudos_num
> +	      || (best_insn_pseudos_num == insn_pseudos_num
> +		  && best_cost > cost))

Should we check the register bank and levelling here too,
for consistency?

> +      /* Restore the live hard reg pseudo info for spilled pseudos.  */
> +      EXECUTE_IF_SET_IN_BITMAP (&spill_pseudos_bitmap, 0, spill_regno, bi)
> +	update_lives (spill_regno, false);

I couldn't tell why this was outside the "hard_regno >= 0" condition.
Do we really change these registers even if find_hard_regno_for fails?

> +  /* Spill: */
> +  EXECUTE_IF_SET_IN_BITMAP (&best_spill_pseudos_bitmap, 0, spill_regno, bi)

Very minor, but I think it'd be worth asserting that best_hard_regno >= 0
before this loop.

> +/* Constraint transformation can use equivalences and they can
> +   contains pseudos assigned to hard registers.	 Such equivalence
> +   usage might create new conflicts of pseudos with hard registers
> +   (like ones used for parameter passing or call clobbered ones) or
> +   other pseudos assigned to the same hard registers.  Another very
> +   rare risky transformation is restoring whole multi-register pseudo
> +   when only one subreg lives and unused hard register is used already
> +   for something else.

In a way, I found this comment almost too detailed. :-)  Maybe:

/* The constraints pass is allowed to create equivalences between
   pseudos that make the current allocation "incorrect" (in the sense
   that pseudos are assigned to hard registers from their own conflict sets).
   The global variable lra_risky_transformations_p says whether this might
   have happened.

if that's accurate.  The detail about when this occurs probably
belongs above lra_risky_transformations_p, although it's mostly
there already.  (Haven't got to the ira-conflicts.c stuff yet,
so no comments about that here.)

> +   Process pseudos assigned to hard registers (most frequently used
> +   first), spill if a conflict is found, and mark the spilled pseudos
> +   in SPILLED_PSEUDO_BITMAP.  Set up LIVE_HARD_REG_PSEUDOS from
> +   pseudos, assigned to hard registers.	 */

Why do we spill the most frequently used registers first?  Probably worth
a comment.

> +  for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
> +    if (reg_renumber[i] >= 0 && lra_reg_info[i].nrefs > 0)
> +      {
> +	if (lra_risky_transformations_p)
> +	  sorted_pseudos[n++] = i;
> +	else
> +	  update_lives (i, false);
> +      }
> +  if (! lra_risky_transformations_p)
> +    return;

Seems like this would be more logically split into two (the
lra_risky_transformations_p case and the !lra_risky_transformations_p case).

> +	    /* If it is multi-register pseudos they should start on
> +	       the same hard register.	*/
> +	    || hard_regno != reg_renumber[conflict_regno])

This seems different from the find_hard_regno_for case, which took
biggest_mode into account.

> +	  /* Don't change reload pseudo allocation.  It might have
> +	     this allocation for a purpose (e.g. bound to another
> +	     pseudo) and changing it can result in LRA cycling.	 */
> +	  if (another_regno < lra_constraint_new_regno_start
> +	      && (another_hard_regno = reg_renumber[another_regno]) >= 0
> +	      && another_hard_regno != hard_regno)

Seems like this excludes split pseudos as well as reload pseudos,
or are they never included in these copies?  Might be worth mentioning
them either way.

The only general comment I have so far is that it's sometimes
difficult to follow which types of pseudos are being included
or excluded by a comparison with lra_constraint_new_regno_start.
Sometimes the comments talk about "reload pseudos", but other
similar checks imply that the registers could be inheritance
pseudos or split pseudos as well.  Some thin inline wrappers
might help here.

> +	      /* Remember that reload pseudos can be spilled on the
> +		 1st pass.  */
> +	      bitmap_clear_bit (&all_spilled_pseudos, regno);
> +	      assign_hard_regno (hard_regno, regno);

Maybe:

  /* This register might have been spilled by the previous pass.
     Indicate that it is no longer spilled.  */

> +		/* We can use inheritance pseudos in original insns
> +		   (not reload ones).  */
> +		if (regno < lra_constraint_new_regno_start
> +		    || bitmap_bit_p (&lra_inheritance_pseudos, regno)
> +		    || reg_renumber[regno] < 0)
> +		  continue;
> +		sorted_pseudos[nfails++] = regno;
> +		if (lra_dump_file != NULL)
> +		  fprintf (lra_dump_file,
> +			   "	  Spill reload r%d(hr=%d, freq=%d)\n",
> +			   regno, reg_renumber[regno],
> +			   lra_reg_info[regno].freq);

Same comment about types of pseudo as above.  (I.e. the code checks for
inheritance pseudos, but not split pseudos.)

> +  bitmap_initialize (&do_not_assign_nonreload_pseudos, &reg_obstack);
> +  EXECUTE_IF_SET_IN_BITMAP (&lra_inheritance_pseudos, 0, u, bi)
> +    if ((restore_regno = lra_reg_info[u].restore_regno) >= 0
> +	&& reg_renumber[u] < 0 && bitmap_bit_p (&lra_inheritance_pseudos, u))
> +      bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno);
> +  EXECUTE_IF_SET_IN_BITMAP (&lra_split_pseudos, 0, u, bi)
> +    if ((restore_regno = lra_reg_info[u].restore_regno) >= 0
> +	&& reg_renumber[u] >= 0 && bitmap_bit_p (&lra_split_pseudos, u))
> +      bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno);

The bitmap_bit_p tests look redundant.  Also, the following code is:

> +  for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
> +    if (((i < lra_constraint_new_regno_start
> +	  && ! bitmap_bit_p (&do_not_assign_nonreload_pseudos, i))
> +	 || (bitmap_bit_p (&lra_inheritance_pseudos, i)
> +	     && lra_reg_info[i].restore_regno >= 0)
> +	 || (bitmap_bit_p (&lra_split_pseudos, i)
> +	     && lra_reg_info[i].restore_regno >= 0)
> +	 || bitmap_bit_p (&lra_optional_reload_pseudos, i))
> +	&& reg_renumber[i] < 0 && lra_reg_info[i].nrefs != 0
> +	&& regno_allocno_class_array[i] != NO_REGS)
> +      sorted_pseudos[n++] = i;
> +  bitmap_clear (&do_not_assign_nonreload_pseudos);

where we test very similar things inline, and then clear
do_not_assign_nonreload_pseudos.  Do we need d_n_a_n_p at all?

> +  if (n != 0 && lra_dump_file != NULL)
> +    fprintf (lra_dump_file, "  Reassing non-reload pseudos\n");

"Reassigning"

Richard

Richard Sandiford Oct. 10, 2012, 3:31 p.m. UTC | #3

Hi Vlad,

Here's a review of the first part of ira-constraints.c.  It's a big file,
and this is a long reply, so I'll send further comments another day in
another message.

Vladimir Makarov <vmakarov@redhat.com> writes:
> +/* This file contains code for 3 passes: constraint pass,
> +   inheritance/split pass, and pass for undoing failed inheritance and
> +   split.
> +
> +   The major goal of constraint pass is to transform RTL to satisfy
> +   insn and address constraints by:
> +     o choosing insn alternatives;
> +     o generating *reload insns* (or reloads in brief) and *reload
> +       pseudos* which will got necessary hard registers later;

s/got/get/

> +     o substituting pseudo equivalences (if it is done once, is done
> +       everywhere) and removes insns initializing used equivalent
> +       substitution.

Suggest:

     o substituting pseudos with equivalent values and removing the
       instructions that initialized those pseudos.

> +   To speed the pass up we process only necessary insns (first time
> +   all insns) and reuse of already chosen alternatives in some
> +   cases.

Suggest:

   On the first iteration of the pass we process every instruction and
   choose an alternative for each one.  On subsequent iterations we try
   to avoid reprocessing instructions if we can be sure that the old
   choice is still valid.

> +   The inheritance/spilt pass is to transform code to achieve
> +   ineheritance and live range splitting.  It is done on backward
> +   traverse of EBBs.

Typo: inheritance.  "backward traversal".

> +   The inheritance optimization goal is to reuse values in hard
> +   registers. There is analogous optimization in old reload pass.  The
> +   inheritance is achieved by following transformation:
> +
> +       reload_p1 <- p	     reload_p1 <- p
> +       ...		     new_p <- reload_p1
> +       ...		=>   ...
> +       reload_p2 <- p	     reload_p2 <- new_p
> +
> +   where p is spilled and not changed between the insns.  Reload_p1 is
> +   also called *original pseudo* and new_p is called *inheritance
> +   pseudo*.
> +
> +   The subsequent assignment pass will try to assign the same (or
> +   another if it is not possible) hard register to new_p as to
> +   reload_p1 or reload_p2.
> +
> +   If it fails to assign a hard register, the opposite transformation
> +   will restore the original code on (the pass called undoing
> +   inheritance) because with spilled new_p the code would be much
> +   worse. [...]

Maybe:

   If the assignment pass fails to assign a hard register to new_p,
   this file will undo the inheritance and restore the original code.
   This is because implementing the above sequence with a spilled
   new_p would make the code much worse.

> +   Splitting (transformation) is also done in EBB scope on the same
> +   pass as the inheritance:
> +
> +       r <- ... or ... <- r		 r <- ... or ... <- r
> +       ...				 s <- r (new insn -- save)
> +       ...			  => 
> +       ...				 r <- s (new insn -- restore)
> +       ... <- r				 ... <- r
> +
> +    The *split pseudo* s is assigned to the hard register of the
> +    original pseudo or hard register r.
> +
> +    Splitting is done:
> +      o In EBBs with high register pressure for global pseudos (living
> +	in at least 2 BBs) and assigned to hard registers when there
> +	are more one reloads needing the hard registers;
> +      o for pseudos needing save/restore code around calls.
> +
> +    If the split pseudo still has the same hard register as the
> +    original pseudo after the subsequent assignment pass, the opposite
> +    transformation is done on the same pass for undoing inheritance.  */

AIUI spill_for can spill split pseudos.  I think the comment should say
what happens then.  If I understand the code correctly, we keep the
split if "r" is a hard register or was assigned a hard register.
We undo it if "r" was not assigned a hard register.  Is that right?

> +/* Array whose element is (MEM:MODE BASE_REG) corresponding to the
> +   mode (index) and where BASE_REG is a base hard register for given
> +   memory mode.	 */
> +static rtx indirect_mem[MAX_MACHINE_MODE];

Maybe:

/* Index M is an rtx of the form (mem:M BASE_REG), where BASE_REG
   is a sample hard register that is a valid address for mode M.
   The memory refers to the generic address space.  */

> +/* Return class of hard regno of REGNO or if it is was not assigned to
> +   a hard register, return its allocno class but only for reload
> +   pseudos created on the current constraint pass.  Otherwise, return
> +   NO_REGS.  */
> +static enum reg_class
> +get_reg_class (int regno)

Maybe:

/* If REGNO is a hard register or has been allocated a hard register,
   return the class of that register.  If REGNO is a pseudo created
   by the current constraints pass, assume that it will be allocated
   a hard register and return the class that that register will have.
   (This assumption is optimistic when REGNO is an inheritance or
   split pseudo.)  Return NO_REGS otherwise.  */

if that's accurate.  I dropped the term "reload pseudo" because of
the general comment in my earlier reply about the use of "reload pseudo"
when the code seems to include inheritance and split pseudos too.

> +/* Return true if REGNO in REG_MODE satisfies reg class constraint CL.
> +   For new reload pseudos we should make more accurate class
> +   *NEW_CLASS (we set up it if it is not NULL) to satisfy the
> +   constraints.  Otherwise, set up NEW_CLASS to NO_REGS.  */
> +static bool
> +in_class_p (int regno, enum machine_mode reg_mode,
> +	    enum reg_class cl, enum reg_class *new_class)

Same comment here, since it uses get_reg_class.  I.e. for registers >=
new_regno_start, we're really testing whether the first allocatable
register in REGNO's allocno class satisfies CL.

Also, the only caller that doesn't directly pass REGNO and REG_MODE
from an rtx is process_addr_reg.  in_class_p uses:

> +  if (new_class != NULL)
> +    *new_class = NO_REGS;
> +  if (regno < FIRST_PSEUDO_REGISTER)
> +    return TEST_HARD_REG_BIT (reg_class_contents[cl], regno);
> +  rclass = get_reg_class (regno);

whereas process_addr_reg uses:

> +  final_regno = regno = REGNO (reg);
> +  if (regno < FIRST_PSEUDO_REGISTER)
> +    {
> +      rtx final_reg = reg;
> +      rtx *final_loc = &final_reg;
> +
> +      lra_eliminate_reg_if_possible (final_loc);
> +      final_regno = REGNO (*final_loc);
> +    }

I.e. process_addr_reg applies eliminations before testing whereas
in_class_p doesn't.  I couldn't really tell why the two were different.
Since the idea is that we use elimination source registers to represent
their targets, shouldn't in_class_p eliminate too?

With that difference removed, in_class_p could take the rtx instead
of a (REGNO, MODE) pair.  It could then pass that rtx directly to
lra_eliminate_reg_if_possible.  I think this would lead to a cleaner
interface and make things more regular.

Then the comment for in_class_p could be:

/* Return true if X satisfies (or will satisfy) reg class constraint CL.
   If X is a pseudo created by this constraints pass, assume that it will
   be allocated a hard register from its allocno class, but allow that
   class to be narrowed to CL if it is currently a superset of CL.

   If NEW_CLASS is nonnull, set *NEW_CLASS to the new allocno class
   of REGNO (X), or NO_REGS if no change in its class was needed.  */

That's a change in the meaning of NEW_CLASS, but seems easier for
callers to handle.  I think all it requires is changing:

> +      common_class = ira_reg_class_subset[rclass][cl];
> +      if (new_class != NULL)
> +	*new_class = common_class;

to:

      common_class = ira_reg_class_subset[rclass][cl];
      if (new_class != NULL && rclass != common_class)
	*new_class = common_class;

> +  if (regno < new_regno_start
> +      /* Do not make more accurate class from reloads generated.  They
> +	 are mostly moves with a lot of constraints.  Making more
> +	 accurate class may results in very narrow class and
> +	 impossibility of find registers for several reloads of one
> +	 insn.	*/

Maybe:

      /* Do not allow the constraints for reload instructions to
	 influence the classes of new pseudos.  These reloads are
	 typically moves that have many alternatives, and restricting
	 reload pseudos for one alternative may lead to situations
	 where other reload pseudos are no longer allocatable.  */

> +      || INSN_UID (curr_insn) >= new_insn_uid_start)
> +    return ((regno >= new_regno_start && rclass == ALL_REGS)
> +	    || (rclass != NO_REGS && ira_class_subset_p[rclass][cl]
> +		&& ! hard_reg_set_subset_p (reg_class_contents[cl],
> +					    lra_no_alloc_regs)));

Why the ALL_REGS special case?  I think it deserves a comment.

> +/* Return the defined and profitable equiv substitution of reg X, return
> +   X otherwise.	 */

Maybe:

/* If we have decided to substitute X with another value, return that value,
   otherwise return X.  */

> +/* Change class of pseudo REGNO to NEW_CLASS.  Print info about it
> +   using TITLE.	 Output a new line if NL_P.  */
> +static void
> +change_class (int regno, enum reg_class new_class,
> +	      const char *title, bool nl_p)
> +{
> +  if (lra_dump_file != NULL)
> +    fprintf (lra_dump_file, "%s to class %s for r%d",
> +	     title, reg_class_names[new_class], regno);
> +  setup_reg_classes (regno, new_class, NO_REGS, new_class);
> +  if (lra_dump_file != NULL && nl_p)
> +    fprintf (lra_dump_file, "\n");
> +}

I think either this or setup_reg_classes should have an assert
that REGNO is >= FIRST_PSEUDO_REGISTER.   This matters more now
because a lot of LRA deals with hard and pseudo registers
side-by-side.

> +/* Create a new pseudo using MODE, RCLASS, ORIGINAL, TITLE or reuse
> +   already created input reload pseudo (only if TYPE is not OP_OUT).
> +   The result pseudo is returned through RESULT_REG.  Return TRUE if
> +   we created a new pseudo, FALSE if we reused the already created
> +   input reload pseudo.	 */

Maybe:

/* Store in *RESULT_REG a register for reloading ORIGINAL, which has
   mode MODE.  TYPE specifies the direction of the reload -- either OP_IN
   or OP_OUT -- and RCLASS specifies the class of hard register required.

   Try to reuse existing input reloads where possible.  Return true if
   *RESULT_REG is a new register, false if it is an existing one.
   Use TITLE to describe new registers for debug purposes.  */

although I admit that's a bit convoluted...

> +  for (i = 0; i < curr_insn_input_reloads_num; i++)
> +    if (rtx_equal_p (curr_insn_input_reloads[i].input, original))
> +      break;
> +  if (i >= curr_insn_input_reloads_num
> +      || ! in_class_p (REGNO (curr_insn_input_reloads[i].reg),
> +		       GET_MODE (curr_insn_input_reloads[i].reg),
> +		       rclass, &new_class))
> +    {
> +      res_p = true;
> +      *result_reg = lra_create_new_reg (mode, original, rclass, title);
> +    }
> +  else
> +    {
> +      lra_assert (! side_effects_p (original));
> +      res_p = false;
> +      *result_reg = curr_insn_input_reloads[i].reg;
> +      regno = REGNO (*result_reg);
> +      if (lra_dump_file != NULL)
> +	 {
> +	   fprintf (lra_dump_file, "	 Reuse r%d for reload ", regno);
> +	   print_value_slim (lra_dump_file, original, 1);
> +	 }
> +      if (rclass != new_class)
> +	 change_class (regno, new_class, ", change", false);
> +      if (lra_dump_file != NULL)
> +	 fprintf (lra_dump_file, "\n");
> +    }
> +  lra_assert (curr_insn_input_reloads_num < LRA_MAX_INSN_RELOADS);
> +  curr_insn_input_reloads[curr_insn_input_reloads_num].input = original;
> +  curr_insn_input_reloads[curr_insn_input_reloads_num++].reg = *result_reg;
> +  return res_p;

It probably doesn't matter in practice, but I think this would
be better as:

  for (i = 0; i < curr_insn_input_reloads_num; i++)
    if (rtx_equal_p (curr_insn_input_reloads[i].input, original)
        && in_class_p (curr_insn_input_reloads[i].reg, rclass, &new_class))
      {
        ...reuse case..
        return false;
      }
  ...new case...
  return true;

which also copes with the unlikely case that the same input is used
three times, and that the third use requires the same class as the
second.

> +/* The page contains code to extract memory address parts.  */
> +
> +/* Info about base and index regs of an address.  In some rare cases,
> +   base/index register can be actually memory.	In this case we will
> +   reload it.  */
> +struct address
> +{
> +  rtx *base_reg_loc;  /* NULL if there is no a base register.  */
> +  rtx *base_reg_loc2; /* Second location of {post/pre}_modify, NULL
> +			 otherwise.  */
> +  rtx *index_reg_loc; /* NULL if there is no an index register.	 */
> +  rtx *index_loc; /* location of index reg * scale or index_reg_loc
> +		      otherwise.  */
> +  rtx *disp_loc; /* NULL if there is no a displacement.	 */
> +  /* Defined if base_reg_loc is not NULL.  */
> +  enum rtx_code base_outer_code, index_code;
> +  /* True if the base register is modified in the address, for
> +     example, in PRE_INC.  */
> +  bool base_modify_p;
> +};

Comments should be consistently above the fields rather than to the right.

> +/* Process address part in space AS (or all address if TOP_P) with
> +   location *LOC to extract address characteristics.
> +
> +   If CONTEXT_P is false, we are looking at the base part of an
> +   address, otherwise we are looking at the index part.
> +
> +   MODE is the mode of the memory reference; OUTER_CODE and INDEX_CODE
> +   give the context that the rtx appears in; MODIFY_P if *LOC is
> +   modified.  */
> +static void
> +extract_loc_address_regs (bool top_p, enum machine_mode mode, addr_space_t as,
> +			  rtx *loc, bool context_p, enum rtx_code outer_code,
> +			  enum rtx_code index_code,
> +			  bool modify_p, struct address *ad)
> +{
> +  rtx x = *loc;
> +  enum rtx_code code = GET_CODE (x);
> +  bool base_ok_p;
> +
> +  switch (code)
> +    {
> +    case CONST_INT:
> +    case CONST:
> +    case SYMBOL_REF:
> +    case LABEL_REF:
> +      if (! context_p)
> +	ad->disp_loc = loc;

This looks a bit odd.  I assume it's trying to avoid treating MULT
scale factors as displacements, but I thought whether something was
a displacement or not depended on whether it is involved (possibly
indirectly) in a sum with the base.  Seems like it'd be better
to check for that directly.

> +	/* If this machine only allows one register per address, it
> +	   must be in the first operand.  */
> +	if (MAX_REGS_PER_ADDRESS == 1 || code == LO_SUM)
> +	  {
> +	    extract_loc_address_regs (false, mode, as, arg0_loc, false, code,
> +				      code1, modify_p, ad);
> +	    ad->disp_loc = arg1_loc;
> +	  }
> +	/* If index and base registers are the same on this machine,
> +	   just record registers in any non-constant operands.	We
> +	   assume here, as well as in the tests below, that all
> +	   addresses are in canonical form.  */
> +	else if (INDEX_REG_CLASS
> +		 == base_reg_class (VOIDmode, as, PLUS, SCRATCH)
> +		 && code0 != PLUS && code0 != MULT)
> +	  {
> +	    extract_loc_address_regs (false, mode, as, arg0_loc, false, PLUS,
> +				      code1, modify_p, ad);
> +	    if (! CONSTANT_P (arg1))
> +	      extract_loc_address_regs (false, mode, as, arg1_loc, true, PLUS,
> +					code0, modify_p, ad);
> +	    else
> +	      ad->disp_loc = arg1_loc;
> +	  }
> +
> +	/* If the second operand is a constant integer, it doesn't
> +	   change what class the first operand must be.	 */
> +	else if (code1 == CONST_INT || code1 == CONST_DOUBLE)
> +	  {
> +	    ad->disp_loc = arg1_loc;
> +	    extract_loc_address_regs (false, mode, as, arg0_loc, context_p,
> +				      PLUS, code1, modify_p, ad);
> +	  }
> +	/* If the second operand is a symbolic constant, the first
> +	   operand must be an index register but only if this part is
> +	   all the address.  */
> +	else if (code1 == SYMBOL_REF || code1 == CONST || code1 == LABEL_REF)
> +	  {
> +	    ad->disp_loc = arg1_loc;
> +	    extract_loc_address_regs (false, mode, as, arg0_loc,
> +				      top_p ? true : context_p, PLUS, code1,
> +				      modify_p, ad);
> +	  }

What's the reason for the distinction between the last two, which AIUI
doesn't exist in reload?  I'm not sure the:

    top_p ? true : context_p

condition is safe: some targets use aligning addresses like
(and X (const_int -ALIGN)), but that shouldn't really affect whether
a register in X is treated as a base or an index.

> +	/* If both operands are registers but one is already a hard
> +	   register of index or reg-base class, give the other the
> +	   class that the hard register is not.	 */
> +	else if (code0 == REG && code1 == REG
> +		 && REGNO (arg0) < FIRST_PSEUDO_REGISTER
> +		 && ((base_ok_p
> +		      = ok_for_base_p_nonstrict (arg0, mode, as, PLUS, REG))
> +		     || ok_for_index_p_nonstrict (arg0)))
> +	  {
> +	    extract_loc_address_regs (false, mode, as, arg0_loc, ! base_ok_p,
> +				      PLUS, REG, modify_p, ad);
> +	    extract_loc_address_regs (false, mode, as, arg1_loc, base_ok_p,
> +				      PLUS, REG, modify_p, ad);
> +	  }
> +	else if (code0 == REG && code1 == REG
> +		 && REGNO (arg1) < FIRST_PSEUDO_REGISTER
> +		 && ((base_ok_p
> +		      = ok_for_base_p_nonstrict (arg1, mode, as, PLUS, REG))
> +		     || ok_for_index_p_nonstrict (arg1)))
> +	  {
> +	    extract_loc_address_regs (false, mode, as, arg0_loc, base_ok_p,
> +				      PLUS, REG, modify_p, ad);
> +	    extract_loc_address_regs (false, mode, as, arg1_loc, ! base_ok_p,
> +				      PLUS, REG, modify_p, ad);
> +	  }
> +	/* If one operand is known to be a pointer, it must be the
> +	   base with the other operand the index.  Likewise if the
> +	   other operand is a MULT.  */
> +	else if ((code0 == REG && REG_POINTER (arg0)) || code1 == MULT)
> +	  {
> +	    extract_loc_address_regs (false, mode, as, arg0_loc, false, PLUS,
> +				      code1, modify_p, ad);
> +	    if (code1 == MULT)
> +	      ad->index_loc = arg1_loc;
> +	    extract_loc_address_regs (false, mode, as, arg1_loc, true, PLUS,
> +				      code0, modify_p, ad);
> +	  }
> +	else if ((code1 == REG && REG_POINTER (arg1)) || code0 == MULT)
> +	  {
> +	    extract_loc_address_regs (false, mode, as, arg0_loc, true, PLUS,
> +				      code1, modify_p, ad);
> +	    if (code0 == MULT)
> +	      ad->index_loc = arg0_loc;
> +	    extract_loc_address_regs (false, mode, as, arg1_loc, false, PLUS,
> +				      code0, modify_p, ad);
> +	  }

Some targets care about the choice between index and base for
correctness reasons (PA IIRC) or for performance (some ppc targets IIRC),
so I'm not sure whether it's safe to give REG_POINTER such a low priority.

> +    default:
> +      {
> +	const char *fmt = GET_RTX_FORMAT (code);
> +	int i;
> +
> +	if (GET_RTX_LENGTH (code) != 1
> +	    || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
> +	  {
> +	    for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
> +	      if (fmt[i] == 'e')
> +		extract_loc_address_regs (false, mode, as, &XEXP (x, i),
> +					  context_p, code, SCRATCH,
> +					  modify_p, ad);
> +	    break;
> +	  }
> +	/* fall through for case UNARY_OP (UNSPEC ...)	*/
> +      }
> +
> +    case UNSPEC:
> +      if (ad->disp_loc == NULL)
> +	ad->disp_loc = loc;
> +      else if (ad->base_reg_loc == NULL)
> +	{
> +	  ad->base_reg_loc = loc;
> +	  ad->base_outer_code = outer_code;
> +	  ad->index_code = index_code;
> +	  ad->base_modify_p = modify_p;
> +	}
> +      else
> +	{
> +	  lra_assert (ad->index_reg_loc == NULL);
> +	  ad->index_reg_loc = loc;
> +	}
> +      break;
> +
> +    }

Which targets use a bare UNSPEC as a displacement?  I thought a
displacement had to be a link-time constant, in which case it should
satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.

I'm just a bit worried that the UNSPEC handling is sensitive to the
order that subrtxes are processed (unlike PLUS, which goes to some
trouble to work out what's what).  It could be especially confusing
because the default case processes operands in reverse order while
PLUS processes them in forward order.

Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
Probably deserves a comment.

AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
recording where reloads of a particular class need to go (obviously
in the case of disp_loc, which isn't reloaded at all).  The feidls
have semantic value too.  I.e. we use them to work out the value
of at least part of the address.

In that case it seems dangerous to look through general rtxes
in the way that the default case above does.  Maybe just making
sure that DISP_LOC is involved in a sum with the base would be
enough, but another idea was:

----------------------------------------------------------------
I know of three ways of "mutating" (for want of a better word)
an address:

  1. (and X (const_int X)), to align
  2. a subreg
  3. a unary operator (such as truncation or extension)

So maybe we could:

  a. remove outer mutations (using a helper function)
  b. handle LO_SUM, PRE_*, POST_*: as now
  c. otherwise treat the address of the sum of one, two or three pieces.
     c1. Peel mutations of all pieces.
     c2. Classify the pieces into base, index and displacement.
         This would be similar to the jousting code above, but hopefully
         easier because all three rtxes are to hand.  E.g. we could
         do the base vs. index thing in a similar way to
         commutative_operand_precedence.
     c3. Record which pieces were mutated (e.g. using something like the
         index_loc vs. index_reg_loc distinction in the current code)

That should be general enough for current targets, but if it isn't,
we could generalise it further when we know what generalisation is needed.

That's still going to be a fair amount of code, but hopefully not more,
and we might have more confidence at each stage what each value is.
And it avoids the risk of treating "mutated" addresses as "unmutated" ones.
----------------------------------------------------------------

Just an idea though.  Probably not for 4.8, although I might try it
if I find time.

It would be nice to sort out the disp_loc thing for 4.8 though.

> +/* Extract address characteristics in address with location *LOC in
> +   space AS.  Return them in AD.  Parameter OUTER_CODE for MEM should
> +   be MEM.  Parameter OUTER_CODE for 'p' constraint should be ADDRESS
> +   and MEM_MODE should be VOIDmode.  */

Maybe:

/* Describe address *LOC in AD.  There are two cases:

   - *LOC is the address in a (mem ...).  In this case OUTER_CODE is MEM
     and AS is the mem's address space.

   - *LOC is matched to an address constraint such as 'p'.  In this case
     OUTER_CODE is ADDRESS and AS is ADDR_SPACE_GENERIC.  */

> +/* Return start register offset of hard register REGNO in MODE.	 */
> +int
> +lra_constraint_offset (int regno, enum machine_mode mode)
> +{
> +  lra_assert (regno < FIRST_PSEUDO_REGISTER);
> +  /* On a WORDS_BIG_ENDIAN machine, point to the last register of a
> +     multiple hard register group of scalar integer registers, so that
> +     for example (reg:DI 0) and (reg:SI 1) will be considered the same
> +     register.	*/
> +  if (WORDS_BIG_ENDIAN && GET_MODE_SIZE (mode) > UNITS_PER_WORD
> +      && SCALAR_INT_MODE_P (mode))
> +    return hard_regno_nregs[regno][mode] - 1;
> +  return 0;
> +}

Maybe the head comment could be:

/* Return the offset from REGNO of the least significant register
   in (reg:MODE REGNO).

   This function is used to tell whether two registers satisfy
   a matching constraint.  (reg:MODE1 REGNO1) matches (reg:MODE2 REGNO2) if:

         REGNO1 + lra_constraint_offset (REGNO1, MODE1)
      == REGNO2 + lra_constraint_offset (REGNO2, MODE2)  */

(and remove the inner comment).

> +/* Like rtx_equal_p except that it allows a REG and a SUBREG to match
> +   if they are the same hard reg, and has special hacks for
> +   auto-increment and auto-decrement.  This is specifically intended for
> +   process_alt_operands to use in determining whether two operands
> +   match.  X is the operand whose number is the lower of the two.
> +
> +   It is supposed that X is the output operand and Y is the input
> +   operand.  */
> +static bool
> +operands_match_p (rtx x, rtx y, int y_hard_regno)

Need to say what Y_HARD_REGNO is.

> +  switch (code)
> +    {
> +    case CONST_INT:
> +    case CONST_DOUBLE:
> +    case CONST_FIXED:

After a recent change this should be:

   CASE_CONST_UNIQUE:

> +	      val = operands_match_p (XVECEXP (x, i, j), XVECEXP (y, i, j),
> +				      y_hard_regno);
> +	      if (val == 0)
> +		return false;

Why do we pass the old y_hard_regno even though Y has changed?
Some of the earlier code assumes that GET_MODE (y) is the mode
of y_hard_regno.

> +/* Reload pseudos created for matched input and output reloads whose
> +   mode are different.	Such pseudos has a modified rules for finding
> +   their living ranges, e.g. assigning to subreg of such pseudo means
> +   changing all pseudo value.  */
> +bitmap_head lra_bound_pseudos;

Maybe:

/* Reload pseudos created for matched input and output reloads whose
   modes are different.  Such pseudos have different live ranges from
   other pseudos; e.g. any assignment to a subreg of these pseudos
   changes the whole pseudo's value.  */

Although that said, couldn't emit_move_insn_1 (called by gen_move_insn)
split a multiword pseudo move into two word moves?  Using the traditional
clobber technique sounds better than having special liveness rules.

> +/* True if C is a non-empty register class that has too few registers
> +   to be safely used as a reload target class.	*/
> +#define SMALL_REGISTER_CLASS_P(C)					\
> +  (reg_class_size [(C)] == 1						\
> +   || (reg_class_size [(C)] >= 1 && targetm.class_likely_spilled_p (C)))

Feels like ira_class_hard_regs_num might be better, but since the
current definition is traditional, that shouldn't be a merge requirement.

> +/* Return mode of WHAT inside of WHERE whose mode of the context is
> +   OUTER_MODE.	If WHERE does not contain WHAT, return VOIDmode.  */
> +static enum machine_mode
> +find_mode (rtx *where, enum machine_mode outer_mode, rtx *what)
> +{
> +  int i, j;
> +  enum machine_mode mode;
> +  rtx x;
> +  const char *fmt;
> +  enum rtx_code code;
> +
> +  if (where == what)
> +    return outer_mode;
> +  if (*where == NULL_RTX)
> +    return VOIDmode;
> +  x = *where;
> +  code = GET_CODE (x);
> +  outer_mode = GET_MODE (x);
> +  fmt = GET_RTX_FORMAT (code);
> +  for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
> +    {
> +      if (fmt[i] == 'e')
> +	{
> +	  if ((mode = find_mode (&XEXP (x, i), outer_mode, what)) != VOIDmode)
> +	    return mode;
> +	}
> +      else if (fmt[i] == 'E')
> +	{
> +	  for (j = XVECLEN (x, i) - 1; j >= 0; j--)
> +	  if ((mode = find_mode (&XVECEXP (x, i, j), outer_mode, what))
> +	      != VOIDmode)
> +	    return mode;
> +	}
> +    }
> +  return VOIDmode;
> +}
> +
> +/* Return mode for operand NOP of the current insn.  */
> +static inline enum machine_mode
> +get_op_mode (int nop)
> +{
> +  rtx *loc;
> +  enum machine_mode mode;
> +  bool md_first_p = asm_noperands (PATTERN (curr_insn)) < 0;
> +
> +  /* Take mode from the machine description first.  */
> +  if (md_first_p && (mode = curr_static_id->operand[nop].mode) != VOIDmode)
> +    return mode;
> +  loc = curr_id->operand_loc[nop];
> +  /* Take mode from the operand second.	 */
> +  mode = GET_MODE (*loc);
> +  if (mode != VOIDmode)
> +    return mode;
> +  if (! md_first_p && (mode = curr_static_id->operand[nop].mode) != VOIDmode)
> +    return mode;
> +  /* Here is a very rare case.	Take mode from the context.  */
> +  return find_mode (&PATTERN (curr_insn), VOIDmode, loc);
> +}

This looks a lot more complicated than the reload version.  Why is
it needed?  In reload the conditions for address operands were:

	  /* Address operands are reloaded in their existing mode,
	     no matter what is specified in the machine description.  */
	  operand_mode[i] = GET_MODE (recog_data.operand[i]);

	  /* If the address is a single CONST_INT pick address mode
	     instead otherwise we will later not know in which mode
	     the reload should be performed.  */
	  if (operand_mode[i] == VOIDmode)
	    operand_mode[i] = Pmode;

which for LRA might look like:

	  /* The mode specified in the .md file for address operands
	     is the mode of the addressed value, not the address itself.
	     We therefore need to get the mode from the operand rtx.
	     If the operand has no mode, assume it was Pmode.  */

For other operands, recog_data.operand_mode ought to be correct.

find_mode assumes that the mode of an operand is the same as the mode of
the outer rtx, which isn't true when the outer rtx is a subreg, mem,
or one of several unary operators.

This is one that I think would be best decided for 4.8.

> +/* If REG is a reload pseudo, try to make its class satisfying CL.  */
> +static void
> +narrow_reload_pseudo_class (rtx reg, enum reg_class cl)
> +{
> +  int regno;
> +  enum reg_class rclass;
> +
> +  /* Do not make more accurate class from reloads generated.  They are
> +     mostly moves with a lot of constraints.  Making more accurate
> +     class may results in very narrow class and impossibility of find
> +     registers for several reloads of one insn.	 */
> +  if (INSN_UID (curr_insn) >= new_insn_uid_start)
> +    return;
> +  if (GET_CODE (reg) == SUBREG)
> +    reg = SUBREG_REG (reg);
> +  if (! REG_P (reg) || (regno = REGNO (reg)) < new_regno_start)
> +    return;
> +  rclass = get_reg_class (regno);
> +  rclass = ira_reg_class_subset[rclass][cl];
> +  if (rclass == NO_REGS)
> +    return;
> +  change_class (regno, rclass, "      Change", true);
> +}

There seems to be an overlap in functionality with in_class_p here.
Maybe:

{
  enum reg_class rclass;

  if (in_class_p (reg, cl, &rclass) && rclass != NO_REGS)
    change_class (REGNO (reg), rclass, "      Change", true);
}

(assuming the change in in_class_p interface suggested above).
This avoids duplicating subtleties like the handling of reloads.

> +      /* We create pseudo for out rtx because we always should keep
> +	 registers with the same original regno have synchronized
> +	 value (it is not true for out register but it will be
> +	 corrected by the next insn).

I don't understand this comment, sorry.

> +	 Do not reuse register because of the following situation: a <-
> +	 a op b, and b should be the same as a.	 */

This part is very convincing though :-)  Maybe:

	 We cannot reuse the current output register because we might
	 have a situation like "a <- a op b", where the constraints force
	 the second input operand ("b") to match the output operand ("a").
	 "b" must then be copied into a new register so that it doesn't
	 clobber the current value of "a".  */

We should probably keep the other reason too, of course.

> +      /* Don't generate inheritance for the new register because we
> +	 can not use the same hard register for the corresponding
> +	 inheritance pseudo for input reload.  */
> +      bitmap_set_bit (&lra_matched_pseudos, REGNO (new_in_reg));

Suggest dropping this comment, since we don't do any inheritance here.
The comment above lra_matched_pseudos already says the same thing.

> +  /* In and out operand can be got from transformations before
> +     processing constraints.  So the pseudos might have inaccurate
> +     class and we should make their classes more accurate.  */
> +  narrow_reload_pseudo_class (in_rtx, goal_class);
> +  narrow_reload_pseudo_class (out_rtx, goal_class);

I don't understand this, sorry.  Does "transformations" mean inheritance
and reload splitting?  So the registers we're changing here are inheritance
and split pseudos rather than reload pseudos created for this instruction?
If so, it sounds on face value like it conflicts with the comment quoted
above about not allowing reload instructions to the narrow the class
of pseudos.  Might be worth saying why that's OK here but not there.

Also, I'm not sure I understand why it helps.  Is it just trying
to encourage the pseudos to form a chain in lra-assigns.c?

E.g. MIPS16 has several instructions that require matched MIPS16 registers.
However, moves between MIPS16 registers and general registers are as cheap
as moves between two MIPS16 registers, so narrowing the reloaded values
from GENERAL_REGS to M16_REGS (if that ever happens) wouldn't necessarily
be a good thing.

Not saying this is wrong, just that it might need more commentary
to justify it.

> +  for (i = 0; (in = ins[i]) >= 0; i++)
> +    *curr_id->operand_loc[in] = new_in_reg;

The code assumes that all input operands have the same mode.
Probably worth asserting that here (or maybe further up; I don't mind),
just to make the assumption explicit.

> +/* Return final hard regno (plus offset) which will be after
> +   elimination.	 We do this for matching constraints because the final
> +   hard regno could have a different class.  */
> +static int
> +get_final_hard_regno (int hard_regno, int offset)
> +{
> +  if (hard_regno < 0)
> +    return hard_regno;
> +  hard_regno += offset;
> +  return lra_get_elimation_hard_regno (hard_regno);

Why apply the offset before rather than after elimination?
AIUI, AVR's eliminable registers span more than one hard register,
and the elimination is based off the first.

Also, all uses but one of lra_get_hard_regno_and_offset follow
the pattern:

      lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
      /* The real hard regno of the operand after the allocation.  */
      x_hard_regno = get_final_hard_regno (x_hard_regno, offset);

so couldn't lra_get_hard_regno_and_offset just return the final
hard register, including elimination?  Then it could apply the
elimination on the original rtx.

FWIW, the exception I mentioned was operands_match_p:

      lra_get_hard_regno_and_offset (x, &i, &offset);
      if (i < 0)
	goto slow;
      i += offset;

but I'm not sure why this is the only caller that would want
to ignore elimination.

> +/* Return register class of OP.	 That is a class of the hard register
> +   itself (if OP is a hard register), or class of assigned hard
> +   register to the pseudo (if OP is pseudo), or allocno class of
> +   unassigned pseudo (if OP is reload pseudo).	Return NO_REGS
> +   otherwise.  */
> +static enum reg_class
> +get_op_class (rtx op)
> +{
> +  int regno, hard_regno, offset;
> +
> +  if (! REG_P (op))
> +    return NO_REGS;
> +  lra_get_hard_regno_and_offset (op, &hard_regno, &offset);
> +  if (hard_regno >= 0)
> +    {
> +      hard_regno = get_final_hard_regno (hard_regno, offset);
> +      return REGNO_REG_CLASS (hard_regno);
> +    }
> +  /* Reload pseudo will get a hard register in any case.  */
> +  if ((regno = REGNO (op)) >= new_regno_start)
> +    return lra_get_allocno_class (regno);
> +  return NO_REGS;
> +}

This looks like it ought to be the same as:

  return REG_P (x) ? get_reg_class (REGNO (x)) : NO_REGS;

If not, I think there should be a comment explaining the difference.
If so, the comment might be:

/* If OP is a register, return the class of the register as per
   get_reg_class, otherwise return NO_REGS.  */

> +/* Return generated insn mem_pseudo:=val if TO_P or val:=mem_pseudo
> +   otherwise.  If modes of MEM_PSEUDO and VAL are different, use
> +   SUBREG for VAL to make them equal.  Assign CODE to the insn if it
> +   is not recognized.
> +
> +   We can not use emit_move_insn in some cases because of bad used
> +   practice in some machine descriptions.  For example, power can use
> +   only base+index addressing for altivec move insns and it is checked
> +   by insn predicates.	On the other hand, the same move insn
> +   constraints permit to use offsetable memory for moving vector mode
> +   values from/to general registers to/from memory.  emit_move_insn
> +   will transform offsetable address to one with base+index addressing
> +   which is rejected by the constraint.	 So sometimes we need to
> +   generate move insn without modifications and assign the code
> +   explicitly because the generated move can be unrecognizable because
> +   of the predicates.  */

Ick :-)  Can't we just say that fixing this is part of the process
of porting a target to LRA?  It'd be nice not to carry hacks like
this around in shiny new code.

As it stands:

> +static rtx
> +emit_spill_move (bool to_p, rtx mem_pseudo, rtx val, int code)
> +{
> +  rtx insn, after;
> +
> +  start_sequence ();
> +  if (GET_MODE (mem_pseudo) != GET_MODE (val))
> +    val = gen_rtx_SUBREG (GET_MODE (mem_pseudo),
> +			  GET_CODE (val) == SUBREG ? SUBREG_REG (val) : val,
> +			  0);
> +  if (to_p)
> +    insn = gen_move_insn (mem_pseudo, val);
> +  else
> +    insn = gen_move_insn (val, mem_pseudo);
> +  if (recog_memoized (insn) < 0)
> +    INSN_CODE (insn) = code;
> +  emit_insn (insn);
> +  after = get_insns ();
> +  end_sequence ();
> +  return after;
> +}

this recog_memoized code effectively assumes that INSN is just one
instruction, whereas emit_move_insn_1 or the backend move expanders
could split moves into several instructions.

Since the code-forcing stuff is for rs6000, I think we could drop it
from 4.8 whatever happens.

The sequence stuff above looks redundant; we should just return
INSN directly.

> +  /* Quick check on the right move insn which does not need
> +     reloads.  */
> +  if ((dclass = get_op_class (dest)) != NO_REGS
> +      && (sclass = get_op_class (src)) != NO_REGS
> +      && targetm.register_move_cost (GET_MODE (src), dclass, sclass) == 2)
> +    return true;

Suggest:

  /* The backend guarantees that register moves of cost 2 never need
     reloads.  */

> +  if (GET_CODE (dest) == SUBREG)
> +    dreg = SUBREG_REG (dest);
> +  if (GET_CODE (src) == SUBREG)
> +    sreg = SUBREG_REG (src);
> +  if (! REG_P (dreg) || ! REG_P (sreg))
> +    return false;
> +  sclass = dclass = NO_REGS;
> +  dr = get_equiv_substitution (dreg);
> +  if (dr != dreg)
> +    dreg = copy_rtx (dr);

I think this copy is too early, because there are quite a few
conditions under which we never emit anything with DREG in it.

> +  if (REG_P (dreg))
> +    dclass = get_reg_class (REGNO (dreg));
> +  if (dclass == ALL_REGS)
> +    /* We don't know what class we will use -- let it be figured out
> +       by curr_insn_transform function.	 Remember some targets does not
> +       work with such classes through their implementation of
> +       machine-dependent hooks like secondary_memory_needed.  */
> +    return false;

Don't really understand this comment, sorry.

> +  sreg_mode = GET_MODE (sreg);
> +  sr = get_equiv_substitution (sreg);
> +  if (sr != sreg)
> +    sreg = copy_rtx (sr);

This copy also seems too early.

> +  sri.prev_sri = NULL;
> +  sri.icode = CODE_FOR_nothing;
> +  sri.extra_cost = 0;
> +  secondary_class = NO_REGS;
> +  /* Set up hard register for a reload pseudo for hook
> +     secondary_reload because some targets just ignore unassigned
> +     pseudos in the hook.  */
> +  if (dclass != NO_REGS
> +      && REG_P (dreg) && (dregno = REGNO (dreg)) >= new_regno_start
> +      && lra_get_regno_hard_regno (dregno) < 0)
> +    reg_renumber[dregno] = ira_class_hard_regs[dclass][0];
> +  else
> +    dregno = -1;
> +  if (sclass != NO_REGS
> +      && REG_P (sreg) && (sregno = REGNO (sreg)) >= new_regno_start
> +      && lra_get_regno_hard_regno (sregno) < 0)
> +    reg_renumber[sregno] = ira_class_hard_regs[sclass][0];
> +  else
> +    sregno = -1;

I think this would be correct without the:

     && REG_P (dreg) && (dregno = REGNO (dreg)) >= new_regno_start

condition (and similarly for the src case).  IMO it would be clearer too:
the decision about when to return a register class for unallocated pseudos
is then localised to get_reg_class rather than copied both here and there.

> +  if (sclass != NO_REGS)
> +    secondary_class
> +      = (enum reg_class) targetm.secondary_reload (false, dest,
> +						   (reg_class_t) sclass,
> +						   GET_MODE (src), &sri);
> +  if (sclass == NO_REGS
> +      || ((secondary_class != NO_REGS || sri.icode != CODE_FOR_nothing)
> +	  && dclass != NO_REGS))
> +    secondary_class
> +      = (enum reg_class) targetm.secondary_reload (true, sreg,
> +						   (reg_class_t) dclass,
> +						   sreg_mode, &sri);

Hmm, so for register<-register moves, if the target says that the output
reload needs a secondary reload, we try again with an input reload and
hope for a different answer?

If the target is giving different answers in that case, I think that's
a bug in the target, and we should assert instead.  The problem is that
if we allow the answers to be different, and both answers involve
secondary reloads, we have no way of knowing whether the second answer
is easier to implement or "more correct" than the first.  An assert
avoids that, and puts the onus on the target to sort itself out.

Again, as long as x86 is free of this bug for 4.8, I don't the merge
needs to cater for broken targets.

> +  *change_p = true;

I think this is the point at which substituted values should be copied.

> +  new_reg = NULL_RTX;
> +  if (secondary_class != NO_REGS)
> +    new_reg = lra_create_new_reg_with_unique_value (sreg_mode, NULL_RTX,
> +						    secondary_class,
> +						    "secondary");
> +  start_sequence ();
> +  if (sri.icode == CODE_FOR_nothing)
> +    lra_emit_move (new_reg, sreg);
> +  else
> +    {
> +      enum reg_class scratch_class;
> +
> +      scratch_class = (reg_class_from_constraints
> +		       (insn_data[sri.icode].operand[2].constraint));
> +      scratch_reg = (lra_create_new_reg_with_unique_value
> +		     (insn_data[sri.icode].operand[2].mode, NULL_RTX,
> +		      scratch_class, "scratch"));
> +      emit_insn (GEN_FCN (sri.icode) (new_reg != NULL_RTX ? new_reg : dest,
> +				      sreg, scratch_reg));
> +    }
> +  before = get_insns ();
> +  end_sequence ();
> +  lra_process_new_insns (curr_insn, before, NULL_RTX, "Inserting the move");

AIUI, the constraints pass will look at these instructions and generate
what are now known as tertiary reloads where needed (by calling this
function again).  Is that right?  Very nice if so: that's far more
natural than the current reload handling.

> +/* The chosen reg classes which should be used for the corresponding
> +   operands.  */
> +static enum reg_class goal_alt[MAX_RECOG_OPERANDS];
> +/* True if the operand should be the same as another operand and the
> +   another operand does not need a reload.  */

s/and the another/and that other/

> +/* Make reloads for addr register in LOC which should be of class CL,
> +   add reloads to list BEFORE.	If AFTER is not null emit insns to set
> +   the register up after the insn (it is case of inc/dec, modify).  */

Maybe:

/* Arrange for address element *LOC to be a register of class CL.
   Add any input reloads to list BEFORE.  AFTER is nonnull if *LOC is an
   automodified value; handle that case by adding the required output
   reloads to list AFTER.  Return true if the RTL was changed.  */

> +static bool
> +process_addr_reg (rtx *loc, rtx *before, rtx *after, enum reg_class cl)
> +{
> +  int regno, final_regno;
> +  enum reg_class rclass, new_class;
> +  rtx reg = *loc;
> +  rtx new_reg;
> +  enum machine_mode mode;
> +  bool change_p = false;
> +
> +  mode = GET_MODE (reg);
> +  if (! REG_P (reg))
> +    {
> +      /* Always reload memory in an address even if the target
> +	 supports such addresses.  */
> +      new_reg
> +	= lra_create_new_reg_with_unique_value (mode, reg, cl, "address");
> +      push_to_sequence (*before);
> +      lra_emit_move (new_reg, reg);
> +      *before = get_insns ();
> +      end_sequence ();
> +      *loc = new_reg;
> +      if (after != NULL)
> +	{
> +	  start_sequence ();
> +	  lra_emit_move (reg, new_reg);
> +	  emit_insn (*after);
> +	  *after = get_insns ();
> +	  end_sequence ();
> +	}
> +      return true;

Why does this need to be a special case, rather than reusing the
code later in the function?  Specifically:

> +    }
> +  lra_assert (REG_P (reg));
> +  final_regno = regno = REGNO (reg);
> +  if (regno < FIRST_PSEUDO_REGISTER)
> +    {
> +      rtx final_reg = reg;
> +      rtx *final_loc = &final_reg;
> +
> +      lra_eliminate_reg_if_possible (final_loc);
> +      final_regno = REGNO (*final_loc);
> +    }
> +  /* Use class of hard register after elimination because some targets
> +     do not recognize virtual hard registers as valid address
> +     registers.	 */
> +  rclass = get_reg_class (final_regno);
> +  if ((*loc = get_equiv_substitution (reg)) != reg)
> +    {
> +      if (lra_dump_file != NULL)
> +	{
> +	  fprintf (lra_dump_file,
> +		   "Changing pseudo %d in address of insn %u on equiv ",
> +		   REGNO (reg), INSN_UID (curr_insn));
> +	  print_value_slim (lra_dump_file, *loc, 1);
> +	  fprintf (lra_dump_file, "\n");
> +	}
> +      *loc = copy_rtx (*loc);
> +      change_p = true;
> +    }
> +  if (*loc != reg || ! in_class_p (final_regno, GET_MODE (reg), cl, &new_class))
> +    {
> +      reg = *loc;
> +      if (get_reload_reg (OP_IN, mode, reg, cl, "address", &new_reg))
> +	{
> +	  push_to_sequence (*before);
> +	  lra_emit_move (new_reg, reg);
> +	  *before = get_insns ();
> +	  end_sequence ();
> +	}
> +      *loc = new_reg;
> +      if (after != NULL)
> +	{
> +	  start_sequence ();
> +	  lra_emit_move (reg, new_reg);
> +	  emit_insn (*after);
> +	  *after = get_insns ();
> +	  end_sequence ();
> +	}
> +      change_p = true;
> +    }
> +  else if (new_class != NO_REGS && rclass != new_class)
> +    change_class (regno, new_class, "	   Change", true);
> +  return change_p;
> +}

E.g.:

  if ((*loc = get_equiv_substitution (reg)) != reg)
    ...as above...
  if (*loc != reg || !in_class_p (reg, cl, &new_class))
    ...as above...
  else if (new_class != NO_REGS && rclass != new_class)
    change_class (regno, new_class, "	   Change", true);
  return change_p;

(assuming change to in_class_p suggested earlier) seems like it
covers the same cases.

Also, should OP_IN be OP_INOUT for after != NULL, so that we don't try
to reuse existing reload pseudos?  That would mean changing get_reload_reg
(both commentary and code) to handle OP_INOUT like OP_OUT.

Or maybe just pass OP_OUT instead of OP_INOUT, if that's more consistent.
I don't mind which.

> +  /* Force reload if this is a constant or PLUS or if there may be a
> +     problem accessing OPERAND in the outer mode.  */

Suggest:

  /* Force a reload of the SUBREG_REG if this ...

> +      /* Constant mode ???? */
> +      enum op_type type = curr_static_id->operand[nop].type;

Not sure what the comment means, but REG is still the original SUBREG_REG,
so there shouldn't be any risk of a VOIDmode constant.  (subreg (const_int))
is invalid rtl.

> +/* Return TRUE if *LOC refers for a hard register from SET.  */
> +static bool
> +uses_hard_regs_p (rtx *loc, HARD_REG_SET set)
> +{

Nothing seems to care about the address, so we would pass the rtx
rather than a pointer to it.

> +  int i, j, x_hard_regno, offset;
> +  enum machine_mode mode;
> +  rtx x;
> +  const char *fmt;
> +  enum rtx_code code;
> +
> +  if (*loc == NULL_RTX)
> +    return false;
> +  x = *loc;
> +  code = GET_CODE (x);
> +  mode = GET_MODE (x);
> +  if (code == SUBREG)
> +    {
> +      loc = &SUBREG_REG (x);
> +      x = SUBREG_REG (x);
> +      code = GET_CODE (x);
> +      if (GET_MODE_SIZE (GET_MODE (x)) > GET_MODE_SIZE (mode))
> +	mode = GET_MODE (x);
> +    }
> +  
> +  if (REG_P (x))
> +    {
> +      lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
> +      /* The real hard regno of the operand after the allocation.  */
> +      x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
> +      return (x_hard_regno >= 0
> +	      && lra_hard_reg_set_intersection_p (x_hard_regno, mode, set));

With the subreg mode handling above, this looks little-endian specific.
The MEM case:

> +  if (MEM_P (x))
> +    {
> +      struct address ad;
> +      enum machine_mode mode = GET_MODE (x);
> +      rtx *addr_loc = &XEXP (x, 0);
> +
> +      extract_address_regs (mode, MEM_ADDR_SPACE (x), addr_loc, MEM, &ad);
> +      if (ad.base_reg_loc != NULL)
> +	{
> +	  if (uses_hard_regs_p (ad.base_reg_loc, set))
> +	    return true;
> +	}
> +      if (ad.index_reg_loc != NULL)
> +	{
> +	  if (uses_hard_regs_p (ad.index_reg_loc, set))
> +	    return true;
> +	}
> +    }

is independent of the subreg handling, so perhaps the paradoxical subreg
case should be handled separately, using simplify_subreg_regno.

> +/* Major function to choose the current insn alternative and what
> +   operands should be reloaded and how.	 If ONLY_ALTERNATIVE is not
> +   negative we should consider only this alternative.  Return false if
> +   we can not choose the alternative or find how to reload the
> +   operands.  */
> +static bool
> +process_alt_operands (int only_alternative)
> +{
> +  bool ok_p = false;
> +  int nop, small_class_operands_num, overall, nalt, offset;
> +  int n_alternatives = curr_static_id->n_alternatives;
> +  int n_operands = curr_static_id->n_operands;
> +  /* LOSERS counts those that don't fit this alternative and would
> +     require loading.  */
> +  int losers;

s/those/the operands/

> +  /* Calculate some data common for all alternatives to speed up the
> +     function.	*/
> +  for (nop = 0; nop < n_operands; nop++)
> +    {
> +      op = no_subreg_operand[nop] = *curr_id->operand_loc[nop];
> +      lra_get_hard_regno_and_offset (op, &hard_regno[nop], &offset);
> +      /* The real hard regno of the operand after the allocation.  */
> +      hard_regno[nop] = get_final_hard_regno (hard_regno[nop], offset);
> +      
> +      operand_reg[nop] = op;
> +      biggest_mode[nop] = GET_MODE (operand_reg[nop]);
> +      if (GET_CODE (operand_reg[nop]) == SUBREG)
> +	{
> +	  operand_reg[nop] = SUBREG_REG (operand_reg[nop]);
> +	  if (GET_MODE_SIZE (biggest_mode[nop])
> +	      < GET_MODE_SIZE (GET_MODE (operand_reg[nop])))
> +	    biggest_mode[nop] = GET_MODE (operand_reg[nop]);
> +	}
> +      if (REG_P (operand_reg[nop]))
> +	no_subreg_operand[nop] = operand_reg[nop];
> +      else
> +	operand_reg[nop] = NULL_RTX;

This looks odd: no_subreg_operand ends up being a subreg if the
SUBREG_REG wasn't a REG.  Some more commentary might help.

> +  /* The constraints are made of several alternatives.	Each operand's
> +     constraint looks like foo,bar,... with commas separating the
> +     alternatives.  The first alternatives for all operands go
> +     together, the second alternatives go together, etc.
> +
> +     First loop over alternatives.  */
> +  for (nalt = 0; nalt < n_alternatives; nalt++)
> +    {
> +      /* Loop over operands for one constraint alternative.  */
> +      if (
> +#ifdef HAVE_ATTR_enabled
> +	  (curr_id->alternative_enabled_p != NULL
> +	   && ! curr_id->alternative_enabled_p[nalt])
> +	  ||
> +#endif
> +	  (only_alternative >= 0 && nalt != only_alternative))
> +	continue;

Probably more natural if split into two "if (...) continue;"s.  E.g.:

#ifdef HAVE_ATTR_enabled
      if (curr_id->alternative_enabled_p != NULL
	  && !curr_id->alternative_enabled_p[nalt])
	continue;
#endif
      if (only_alternative >= 0 && nalt != only_alternative))
	continue;

> +      for (nop = 0; nop < n_operands; nop++)
> +	{
> +	  const char *p;
> +	  char *end;
> +	  int len, c, m, i, opalt_num, this_alternative_matches;
> +	  bool win, did_match, offmemok, early_clobber_p;
> +	  /* false => this operand can be reloaded somehow for this
> +	     alternative.  */
> +	  bool badop;
> +	  /* false => this operand can be reloaded if the alternative
> +	     allows regs.  */
> +	  bool winreg;
> +	  /* False if a constant forced into memory would be OK for
> +	     this operand.  */
> +	  bool constmemok;
> +	  enum reg_class this_alternative, this_costly_alternative;
> +	  HARD_REG_SET this_alternative_set, this_costly_alternative_set;
> +	  bool this_alternative_match_win, this_alternative_win;
> +	  bool this_alternative_offmemok;
> +	  int invalidate_m;
> +	  enum machine_mode mode;
> +
> +	  opalt_num = nalt * n_operands + nop;
> +	  if (curr_static_id->operand_alternative[opalt_num].anything_ok)
> +	    {
> +	      /* Fast track for no constraints at all.	*/
> +	      curr_alt[nop] = NO_REGS;
> +	      CLEAR_HARD_REG_SET (curr_alt_set[nop]);
> +	      curr_alt_win[nop] = true;
> +	      curr_alt_match_win[nop] = false;
> +	      curr_alt_offmemok[nop] = false;
> +	      curr_alt_matches[nop] = -1;
> +	      continue;
> +	    }

Given that this code is pretty complex, it might be clearer to remove
the intermediate "this_*" variables and assign directly to curr_alt_*.  I.e.:

	  curr_alt[nop] = NO_REGS;
	  CLEAR_HARD_REG_SET (curr_alt_set[nop]);
	  curr_alt_win[nop] = false;
	  curr_alt_match_win[nop] = false;
	  curr_alt_offmemok[nop] = false;
	  curr_alt_matches[nop] = -1;

	  opalt_num = nalt * n_operands + nop;
	  if (curr_static_id->operand_alternative[opalt_num].anything_ok)
	    {
	      /* Fast track for no constraints at all.	*/
	      curr_alt_win[nop] = true;
	      continue;
	    }

Obviously keep this nice comment:

> +	  /* We update set of possible hard regs besides its class
> +	     because reg class might be inaccurate.  For example,
> +	     union of LO_REGS (l), HI_REGS(h), and STACK_REG(k) in ARM
> +	     is translated in HI_REGS because classes are merged by
> +	     pairs and there is no accurate intermediate class.	 */

somewhere though, either here or above the declaration of curr_alt_set.

> +		    /* We are supposed to match a previous operand.
> +		       If we do, we win if that one did.  If we do
> +		       not, count both of the operands as losers.
> +		       (This is too conservative, since most of the
> +		       time only a single reload insn will be needed
> +		       to make the two operands win.  As a result,
> +		       this alternative may be rejected when it is
> +		       actually desirable.)  */
> +		    /* If it conflicts with others.  */

Last line looks incomplete/misplaced.

> +		    match_p = false;
> +		    if (operands_match_p (*curr_id->operand_loc[nop],
> +					  *curr_id->operand_loc[m], m_hregno))
> +		      {
> +			int i;
> +			
> +			for (i = 0; i < early_clobbered_regs_num; i++)
> +			  if (early_clobbered_nops[i] == m)
> +			    break;
> +			/* We should reject matching of an early
> +			   clobber operand if the matching operand is
> +			   not dying in the insn.  */
> +			if (i >= early_clobbered_regs_num

Why not simply use operands m's early_clobber field?

> +			    || operand_reg[nop] == NULL_RTX
> +			    || (find_regno_note (curr_insn, REG_DEAD,
> +						 REGNO (operand_reg[nop]))
> +				!= NULL_RTX))
> +			  match_p = true;

...although I don't really understand this condition.  If the two
operands are the same value X, then X must die here whatever the
notes say.  So I assume this is coping with a case where the operands
are different but still match.  If so, could you give an example?

Matched earlyclobbers explicitly guarantee that the earlyclobber doesn't
apply to the matched input operand; the earlyclobber only applies to
other input operands.  So I'd have expected it was those operands
that might need reloading rather than this one.

E.g. if X occurs three times, twice in a matched earlyclobber pair
and once as an independent operand, it's the latter operand that would
need reloading.

> +			/* Operands don't match.  */
> +			/* Retroactively mark the operand we had to
> +			   match as a loser, if it wasn't already and
> +			   it wasn't matched to a register constraint
> +			   (e.g it might be matched by memory).	 */
> +			if (curr_alt_win[m]
> +			    && (operand_reg[m] == NULL_RTX
> +				|| hard_regno[m] < 0))
> +			  {
> +			    losers++;
> +			    if (curr_alt[m] != NO_REGS)
> +			      reload_nregs
> +				+= (ira_reg_class_max_nregs[curr_alt[m]]
> +				    [GET_MODE (*curr_id->operand_loc[m])]);
> +			}
> +			invalidate_m = m;
> +			if (curr_alt[m] == NO_REGS)
> +			  continue;

I found this a bit confusing.  If the operands don't match and operand m
allows no registers, don't we have to reject this constraint outright?
E.g. something like:

			/* Operands don't match.  Both operands must
			   allow a reload register, otherwise we cannot
			   make them match.  */
			if (curr_alt[m] == NO_REGS)
			  break;
			/* Retroactively mark the operand we had to
			   match as a loser, if it wasn't already and
			   it wasn't matched to a register constraint
			   (e.g it might be matched by memory).	 */
			if (curr_alt_win[m]
			    && (operand_reg[m] == NULL_RTX
				|| hard_regno[m] < 0))
			  {
			    losers++;
			    reload_nregs
			      += (ira_reg_class_max_nregs[curr_alt[m]]
				  [GET_MODE (*curr_id->operand_loc[m])]);
			  }

> +		    /* This can be fixed with reloads if the operand
> +		       we are supposed to match can be fixed with
> +		       reloads.	 */
> +		    badop = false;
> +		    this_alternative = curr_alt[m];
> +		    COPY_HARD_REG_SET (this_alternative_set, curr_alt_set[m]);
> +		    
> +		    /* If we have to reload this operand and some
> +		       previous operand also had to match the same
> +		       thing as this operand, we don't know how to do
> +		       that.  So reject this alternative.  */
> +		    if (! did_match)
> +		      for (i = 0; i < nop; i++)
> +			if (curr_alt_matches[i] == this_alternative_matches)
> +			  badop = true;

OK, so this is another case of cruft from reload that I'd like to remove,
but do you know of any reason why this shouldn't be:

		    /* If we have to reload this operand and some previous
		       operand also had to match the same thing as this
		       operand, we don't know how to do that.  */
		    if (!match_p || !curr_alt_win[m])
		      {
			for (i = 0; i < nop; i++)
			  if (curr_alt_matches[i] == m)
			    break;
			if (i < nop)
			  break;
		      }
		    else
--->		      did_match = true;

		    /* This can be fixed with reloads if the operand
		       we are supposed to match can be fixed with
		       reloads.	 */
--->		    this_alternative_matches = m;
--->		    invalidate_m = m;
		    badop = false;
		    this_alternative = curr_alt[m];
		    COPY_HARD_REG_SET (this_alternative_set, curr_alt_set[m]);

(although a helper function might be better than the awkward breaking)?
Note that the ---> lines have moved from further up.

This is the only time in the switch statement where one constraint
in a constraint string uses "badop = true" to reject the operand.
I.e. for "<something else>0" we should normally not reject the
alternative based solely on the "0", since the "<something else>"
might have been satisfied instead.  And we should only record matching
information if we've decided the match can be implemented by reloads
(the last block).

> +			/* We prefer no matching alternatives because
> +			   it gives more freedom in RA.	 */
> +			if (operand_reg[nop] == NULL_RTX
> +			    || (find_regno_note (curr_insn, REG_DEAD,
> +						 REGNO (operand_reg[nop]))
> +				 == NULL_RTX))
> +			  reject += 2;

Looks like a new reject rule.  I agree it makes conceptual sense though,
so I'm all for it.

> +		      || (REG_P (op)
> +			  && REGNO (op) >= FIRST_PSEUDO_REGISTER
> +			  && in_mem_p (REGNO (op))))

This pattern occurs several times.  I think a helper function like
spilled_reg_p (op) would help.  See 'g' below.

> +		case 's':
> +		  if (CONST_INT_P (op)
> +		      || (GET_CODE (op) == CONST_DOUBLE && mode == VOIDmode))
...
> +		case 'n':
> +		  if (CONST_INT_P (op)
> +		      || (GET_CODE (op) == CONST_DOUBLE && mode == VOIDmode))

After recent changes these should be CONST_SCALAR_INT_P (op)

> +		case 'g':
> +		  if (/* A PLUS is never a valid operand, but LRA can
> +			 make it from a register when eliminating
> +			 registers.  */
> +		      GET_CODE (op) != PLUS
> +		      && (! CONSTANT_P (op) || ! flag_pic
> +			  || LEGITIMATE_PIC_OPERAND_P (op))
> +		      && (! REG_P (op)
> +			  || (REGNO (op) >= FIRST_PSEUDO_REGISTER
> +			      && in_mem_p (REGNO (op)))))

Rather than the special case for PLUS, I think this would be better as:

	  if (MEM_P (op)
	      || spilled_reg_p (op)
	      || general_constant_p (op))
	    win = true;

where general_constant_p abstracts away:

	  (CONSTANT_P (op)
	   && (! flag_pic || LEGITIMATE_PIC_OPERAND_P (op)))

general_constant_p probably ought to go in a common file because several
places need this condition (including other parts of this switch statement).

> +#ifdef EXTRA_CONSTRAINT_STR
> +		      if (EXTRA_MEMORY_CONSTRAINT (c, p))
> +			{
> +			  if (EXTRA_CONSTRAINT_STR (op, c, p))
> +			    win = true;
> +			  /* For regno_equiv_mem_loc we have to
> +			     check.  */
> +			  else if (REG_P (op)
> +				   && REGNO (op) >= FIRST_PSEUDO_REGISTER
> +				   && in_mem_p (REGNO (op)))

Looks like an old comment from an earlier iteration.  There doesn't
seem to be a function called regno_equiv_mem_loc in the current patch.
But...

> +			    {
> +			      /* We could transform spilled memory
> +				 finally to indirect memory.  */
> +			      if (EXTRA_CONSTRAINT_STR
> +				  (get_indirect_mem (mode), c, p))
> +				win = true;
> +			    }

...is this check really needed?  It's a documented requirement that memory
constraints accept plain base registers.  Also, the following code is:

> +			  /* If we didn't already win, we can reload
> +			     constants via force_const_mem, and other
> +			     MEMs by reloading the address like for
> +			     'o'.  */
> +			  if (CONST_POOL_OK_P (mode, op) || MEM_P (op))
> +			    badop = false;

It seems a bit inconsistent to treat a spilled pseudo whose address
might well need reloading as a win, while not treating existing MEMs
whose addresses need reloading as a win.

> +		      if (EXTRA_CONSTRAINT_STR (op, c, p))
> +			win = true;
> +		      else if (REG_P (op)
> +			       && REGNO (op) >= FIRST_PSEUDO_REGISTER
> +			       && in_mem_p (REGNO (op)))
> +			{
> +			  /* We could transform spilled memory finally
> +			     to indirect memory.  */
> +			  if (EXTRA_CONSTRAINT_STR (get_indirect_mem (mode),
> +						    c, p))
> +			    win = true;
> +			}

I don't understand why there's two copies of this.  I think we have
to trust the target's classification of constraints, so if the target
says that something isn't a memory constraint, we shouldn't check the
(mem (base)) case.

> +	      if (c != ' ' && c != '\t')
> +		costly_p = c == '*';

I think there needs to be a comment somewhere saying how we handle this.
Being costly seems to contribute one reject point (i.e. a sixth of a '?')
compared to the normal case, which is very different from the current
reload behaviour.  We should probably update the "*" documentation
in md.texi too.

Some targets use "*" constraints to make sure that floating-point
registers don't get used as spill space in purely integer code
(so that task switches don't pay the FPU save/restore penalty).
Would that still "work" with this definition?  (FWIW, I think
using "*" is a bad way to achieve this feature, just asking.)

> +		  /* We simulate the behaviour of old reload here.
> +		     Although scratches need hard registers and it
> +		     might result in spilling other pseudos, no reload
> +		     insns are generated for the scratches.  So it
> +		     might cost something but probably less than old
> +		     reload pass believes.  */
> +		  if (lra_former_scratch_p (REGNO (operand_reg[nop])))
> +		    reject += LOSER_COST_FACTOR;

Yeah, this caused me no end of trouble when tweaking the MIPS
multiply-accumulate patterns.  However, unlike the other bits of
cruft I've been complaining about, this is one where I can't think
of any alternative that makes more inherent sense (to me).  So I agree
that leaving it as-is is the best approach for now.

> +	      /* If the operand is dying, has a matching constraint,
> +		 and satisfies constraints of the matched operand
> +		 which failed to satisfy the own constraints, we do
> +		 not need to generate a reload insn for this
> +		 operand.  */
> +	      if (this_alternative_matches < 0
> +		  || curr_alt_win[this_alternative_matches]
> +		  || ! REG_P (op)
> +		  || find_regno_note (curr_insn, REG_DEAD,
> +				      REGNO (op)) == NULL_RTX
> +		  || ((hard_regno[nop] < 0
> +		       || ! in_hard_reg_set_p (this_alternative_set,
> +					       mode, hard_regno[nop]))
> +		      && (hard_regno[nop] >= 0
> +			  || ! in_class_p (REGNO (op), GET_MODE (op),
> +					   this_alternative, NULL))))
> +		losers++;

I think this might be clearer as:

	      if (!(this_alternative_matches >= 0
		    && !curr_alt_win[this_alternative_matches]
		    && REG_P (op)
		    && find_regno_note (curr_insn, REG_DEAD, REGNO (op))
		    && (hard_regno[nop] >= 0
			? in_hard_reg_set_p (this_alternative_set,
					     mode, hard_regno[nop])
			: in_class_p (op, this_alternative, NULL))))
		losers++;

> +	      if (operand_reg[nop] != NULL_RTX)
> +		{
> +		  int last_reload = (lra_reg_info[ORIGINAL_REGNO
> +						  (operand_reg[nop])]
> +				     .last_reload);
> +
> +		  if (last_reload > bb_reload_num)
> +		    reload_sum += last_reload;
> +		  else
> +		    reload_sum += bb_reload_num;

The comment for reload_sum says:

> +/* Overall number reflecting distances of previous reloading the same
> +   value.  It is used to improve inheritance chances.  */
> +static int best_reload_sum;

which made me think of distance from the current instruction.  I see
it's actually something else, effectively a sum of instruction numbers.

I assumed the idea was to prefer registers that were reloaded more
recently (closer the current instruction).  In that case I thought that,
for a distance-based best_reload_sum, smaller would be better,
while for an instruction-number-based best_reload_sum, larger would
be better.  It looks like we use instruction-number based best_reload_sums
but prefer smaller sums:

> +			      && (reload_nregs < best_reload_nregs
> +				  || (reload_nregs == best_reload_nregs
> +				      && best_reload_sum < reload_sum))))))))

Is that intentional?

Also, is this value meaningful for output reloads, which aren't really
going to be able to inherit a value as such?  We seem to apply the cost
regardless of whether it's an input or an output, so probably deserves
a comment.

Same for matched input operands, which as you say elsewhere aren't
inherited.

> +	      if (badop
> +		  /* Alternative loses if it has no regs for a reg
> +		     operand.  */
> +		  || (REG_P (op) && no_regs_p
> +		      && this_alternative_matches < 0))
> +		goto fail;

More reload cruft, but I don't understand why we have both this and, later:

> +	      if (this_alternative_matches < 0
> +		  && no_regs_p && ! this_alternative_offmemok && ! constmemok)
> +		goto fail;

We also had the earlier:

> +	  /* If this operand could be handled with a reg, and some reg
> +	     is allowed, then this operand can be handled.  */
> +	  if (winreg && this_alternative != NO_REGS)
> +	    badop = false;

which I think belongs in the same else statement.  At least after the
matching changes suggested above, I think all three can be replaced by:

	  /* If this operand accepts a register, and if the register class
	     has at least one allocatable register, then this operand
	     can be reloaded.  */
	  if (winreg && !no_regs_p)
	    badop = false;

	  if (badop)
	    goto fail;

which IMO belongs after the "no_regs_p" assignment.  badop should never
be false if we have no way of reloading the value.

> +	      if (! no_regs_p)
> +		reload_nregs
> +		  += ira_reg_class_max_nregs[this_alternative][mode];

I wasn't sure why we counted this even in the "const_to_mem && constmmeok"
and "MEM_P (op) && offmemok" cases from:

	      /* We prefer to reload pseudos over reloading other
		 things, since such reloads may be able to be
		 eliminated later.  So bump REJECT in other cases.
		 Don't do this in the case where we are forcing a
		 constant into memory and it will then win since we
		 don't want to have a different alternative match
		 then.	*/
	      if (! (REG_P (op)
		     && REGNO (op) >= FIRST_PSEUDO_REGISTER)
		  && ! (const_to_mem && constmemok)
		  /* We can reload the address instead of memory (so
		     do not punish it).	 It is preferable to do to
		     avoid cycling in some cases.  */
		  && ! (MEM_P (op) && offmemok))
		reject += 2;

> +	  if (early_clobber_p)
> +	    reject++;
> +	  /* ??? Should we update the cost because early clobber
> +	     register reloads or it is a rare thing to be worth to do
> +	     it.  */
> +	  overall = losers * LOSER_COST_FACTOR + reject;

Could you expand on the comment a bit?

> +	  if ((best_losers == 0 || losers != 0) && best_overall < overall)
> +	    goto fail;
> +
> +	  curr_alt[nop] = this_alternative;
> +	  COPY_HARD_REG_SET (curr_alt_set[nop], this_alternative_set);
> +	  curr_alt_win[nop] = this_alternative_win;
> +	  curr_alt_match_win[nop] = this_alternative_match_win;
> +	  curr_alt_offmemok[nop] = this_alternative_offmemok;
> +	  curr_alt_matches[nop] = this_alternative_matches;
> +  
> +	  if (invalidate_m >= 0 && ! this_alternative_win)
> +	    curr_alt_win[invalidate_m] = false;

BTW, after the matching changes above, I don't think we need both
"invalidate_m" and "this_alternative_matches".

> +	  for (j = hard_regno_nregs[clobbered_hard_regno][biggest_mode[i]] - 1;
> +	       j >= 0;
> +	       j--)
> +	    SET_HARD_REG_BIT (temp_set, clobbered_hard_regno + j);

add_to_hard_reg_set.

> +	    else if (curr_alt_matches[j] == i && curr_alt_match_win[j])
> +	      {
> +		/* This is a trick.  Such operands don't conflict and
> +		   don't need a reload.	 But it is hard to transfer
> +		   this information to the assignment pass which
> +		   spills one operand without this info.  We avoid the
> +		   conflict by forcing to use the same pseudo for the
> +		   operands hoping that the pseudo gets the same hard
> +		   regno as the operands and the reloads are gone.  */
> +		if (*curr_id->operand_loc[i] != *curr_id->operand_loc[j])
...
> +		/* See the comment for the previous case.  */
> +		if (*curr_id->operand_loc[i] != *curr_id->operand_loc[j])

What are these last two if statements for?  I wasn't sure how two operands
could have the same address.

Not saying they're wrong, but I think a comment would be good.

> +      small_class_operands_num = 0;
> +      for (nop = 0; nop < n_operands; nop++)
> +	/* If this alternative can be made to work by reloading, and
> +	   it needs less reloading than the others checked so far,
> +	   record it as the chosen goal for reloading.	*/
> +	small_class_operands_num
> +	  += SMALL_REGISTER_CLASS_P (curr_alt[nop]) ? 1 : 0;

Misplaced comment; I think it belongs after this line.

Richard

Richard Sandiford Oct. 10, 2012, 7:50 p.m. UTC | #4

Sorry, reading back in different surroundings made me notice a couple
of silly errors:

Richard Sandiford <rdsandiford@googlemail.com> writes:
> E.g.:
>
>   if ((*loc = get_equiv_substitution (reg)) != reg)
>     ...as above...
>   if (*loc != reg || !in_class_p (reg, cl, &new_class))
>     ...as above...
>   else if (new_class != NO_REGS && rclass != new_class)
>     change_class (regno, new_class, "	   Change", true);
>   return change_p;
>
> (assuming change to in_class_p suggested earlier) seems like it
> covers the same cases.

...but that same in_class_p change means that the "rclass != new_class"
condition isn't needed.  I.e.

  if ((*loc = get_equiv_substitution (reg)) != reg)
    ...as above...
  if (*loc != reg || !in_class_p (reg, cl, &new_class))
    ...as above...
  else if (new_class != NO_REGS)
    change_class (regno, new_class, "	   Change", true);
  return change_p;

>> +	      if (operand_reg[nop] != NULL_RTX)
>> +		{
>> +		  int last_reload = (lra_reg_info[ORIGINAL_REGNO
>> +						  (operand_reg[nop])]
>> +				     .last_reload);
>> +
>> +		  if (last_reload > bb_reload_num)
>> +		    reload_sum += last_reload;
>> +		  else
>> +		    reload_sum += bb_reload_num;
>
> The comment for reload_sum says:
>
>> +/* Overall number reflecting distances of previous reloading the same
>> +   value.  It is used to improve inheritance chances.  */
>> +static int best_reload_sum;
>
> which made me think of distance from the current instruction.  I see
> it's actually something else, effectively a sum of instruction numbers.
>
> I assumed the idea was to prefer registers that were reloaded more
> recently (closer the current instruction).  In that case I thought that,
> for a distance-based best_reload_sum, smaller would be better,
> while for an instruction-number-based best_reload_sum, larger would
> be better.  It looks like we use instruction-number based best_reload_sums
> but prefer smaller sums:
>
>> +			      && (reload_nregs < best_reload_nregs
>> +				  || (reload_nregs == best_reload_nregs
>> +				      && best_reload_sum < reload_sum))))))))
>
> Is that intentional?

Clearly I can't read.  The code _does_ prefer higher numbers.  I still
think "distance" is a bit misleading though. :-)

Just for the record, the rest of my question:

> Also, is this value meaningful for output reloads, which aren't really
> going to be able to inherit a value as such?  We seem to apply the cost
> regardless of whether it's an input or an output, so probably deserves
> a comment.
>
> Same for matched input operands, which as you say elsewhere aren't
> inherited.

still applies.

Richard

Vladimir Makarov Oct. 11, 2012, 12:41 a.m. UTC | #5

On 12-10-03 7:11 AM, Richard Sandiford wrote:
> Hi Vlad,
>
> Some comments on lra-spills.c and lra-coalesce.c.
>
>> +   The pass creates necessary stack slots and assign spilled pseudos
>> +   to the stack slots in following way:
> s/assign/assigns/
Fixed.
>> +   (or insn memory constraints) might be not satisfied any more.
> s/might be not/might not be/
Fixed.
>> +   For some targets, the pass can spill some pseudos into hard
>> +   registers of different class (usually into vector registers)
>> +   instead of spilling them into memory if it is possible and
>> +   profitable.	Spilling GENERAL_REGS pseudo into SSE registers for
>> +   modern Intel x86/x86-64 processors is an example of such
>> +   optimization.  And this is actually recommended by Intel
>> +   optimization guide.
> Maybe mention core i7 specifically?  "Modern" is a bit dangerous
> in code that'll live a long time.
Yes, right.  Fixed.  Another bad thing would be an usage of word new.
>> +/* The structure describes a stack slot which can be used for several
>> +   spilled pseudos.  */
>> +struct slot
>> +{
> Looks like this describes "a register or stack slot" given the hard_regno case.
Fixed
>> +/* Array containing info about the stack slots.	 The array element is
>> +   indexed by the stack slot number in the range [0..slost_num).  */
> Typo: slots_num
Fixed.
>> +  /* Each pseudo has an inherent size which comes from its own mode,
>> +     and a total size which provides room for paradoxical subregs
>> +     which refer to the pseudo reg in wider modes.
>> +
>> +     We can use a slot already allocated if it provides both enough
>> +     inherent space and enough total space.  Otherwise, we allocate a
>> +     new slot, making sure that it has no less inherent space, and no
>> +     less total space, then the previous slot.	*/
> The second part of the comment seems a bit misplaced, since the following
> code doesn't reuse stack slots.  This is done elsewhere instead.
> Maybe the first part would be better above the inherent_size assignment.
Right.  I've changed comment to reflect the current state of the code.
>> +  /* If we have any adjustment to make, or if the stack slot is the
>> +     wrong mode, make a new stack slot.	 */
>> +  x = adjust_address_nv (x, GET_MODE (regno_reg_rtx[i]), adjust);
> We don't make a new slot here.
I removed the comment.  The same comment is present in reload1.c and 
probably should be also removed.
>> +/* Sort pseudos according their slot numbers putting ones with smaller
>> +   numbers first, or last when the frame pointer is not needed.	 So
>> +   pseudos with the first slot will be finally addressed with smaller
>> +   address displacement.  */
>> +static int
>> +pseudo_reg_slot_compare (const void *v1p, const void *v2p)
>> +{
>> +  const int regno1 = *(const int *) v1p;
>> +  const int regno2 = *(const int *) v2p;
>> +  int diff, slot_num1, slot_num2;
>> +  int total_size1, total_size2;
>> +
>> +  slot_num1 = pseudo_slots[regno1].slot_num;
>> +  slot_num2 = pseudo_slots[regno2].slot_num;
>> +  if ((diff = slot_num1 - slot_num2) != 0)
>> +    return (frame_pointer_needed
>> +	    || !FRAME_GROWS_DOWNWARD == STACK_GROWS_DOWNWARD ? diff : -diff);
> The comment doesn't quite describe the condition.  Maybe:
>
> /* Sort pseudos according to their slots, putting the slots in the order
>     that they should be allocated.  Slots with lower numbers have the highest
>     priority and should get the smallest displacement from the stack or
>     frame pointer (whichever is being used).
>
>     The first allocated slot is always closest to the frame pointer,
>     so prefer lower slot numbers when frame_pointer_needed.  If the stack
>     and frame grow in the same direction, then the first allocated slot is
>     always closest to the initial stack pointer and furthest away from the
>     final stack pointer, so allocate higher numbers first when using the
>     stack pointer in that case.  The reverse is true if the stack and
>     frame grow in opposite directions.  */
I used your comment.  Thanks.
>> +  total_size1 = MAX (PSEUDO_REGNO_BYTES (regno1),
>> +		     GET_MODE_SIZE (lra_reg_info[regno1].biggest_mode));
>> +  total_size2 = MAX (PSEUDO_REGNO_BYTES (regno2),
>> +		     GET_MODE_SIZE (lra_reg_info[regno2].biggest_mode));
>> +  if ((diff = total_size2 - total_size1) != 0)
>> +    return diff;
> I think this could do with a bit more commentary.  When is biggest_mode
> ever smaller than PSEUDO_REGNO_BYTES?  Is that for pseudos that are only
> ever referenced as lowpart subregs?  If so, why does PSEUDO_REGNO_BYTES
> matter for those registers here but not when calculating biggest_mode?
The MAX code has no sense to me too (probably it was wrongly adapted 
from somewhere).  So I removed MAX.
>> +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
>> +   pseudos which did not get a spill hard register at the beginning of
>> +   array PSEUDO_REGNOS.	 Return the number of such pseudos.  */
> It'd be worth saying that PSEUDO_REGNOS is sorted in order of highest
> frequency first.
Fixed.
>> +  bitmap set_jump_crosses = regstat_get_setjmp_crosses ();
> I notice you use "set_jump" here and "setjump" in parts of 7a.patch.
> Probably better to use setjmp across the board.
Fixed.  setjump is also used in other parts of GCC.
>> +  /* Hard registers which can not be used for any purpose at given
>> +     program point because they are unallocatable or already allocated
>> +     for other pseudos.	 */
>> +  HARD_REG_SET *reserved_hard_regs;
>> +
>> +  if (! lra_reg_spill_p)
>> +    return n;
>> +  /* Set up reserved hard regs for every program point.	 */
>> +  reserved_hard_regs = (HARD_REG_SET *) xmalloc (sizeof (HARD_REG_SET)
>> +						 * lra_live_max_point);
>> +  for (p = 0; p < lra_live_max_point; p++)
>> +    COPY_HARD_REG_SET (reserved_hard_regs[p], lra_no_alloc_regs);
>> +  for (i = FIRST_PSEUDO_REGISTER; i < regs_num; i++)
>> +    if (lra_reg_info[i].nrefs != 0
>> +	&& (hard_regno = lra_get_regno_hard_regno (i)) >= 0)
>> +      for (r = lra_reg_info[i].live_ranges; r != NULL; r = r->next)
>> +	for (p = r->start; p <= r->finish; p++)
>> +	  lra_add_hard_reg_set (hard_regno, lra_reg_info[i].biggest_mode,
>> +				&reserved_hard_regs[p]);
> Since compilation time seems to be all the rage, I wonder if it would be
> quicker to have one live range list per hard register.  Then:
>> +      for (r = lra_reg_info[regno].live_ranges; r != NULL; r = r->next)
>> +	for (p = r->start; p <= r->finish; p++)
>> +	  IOR_HARD_REG_SET (conflict_hard_regs, reserved_hard_regs[p]);
> would just be checking for live range intersection and:
>
>> +      /* Update reserved_hard_regs.  */
>> +      for (r = lra_reg_info[regno].live_ranges; r != NULL; r = r->next)
>> +	for (p = r->start; p <= r->finish; p++)
>> +	  lra_add_hard_reg_set (hard_regno, lra_reg_info[regno].biggest_mode,
>> +				&reserved_hard_regs[p]);
> would again be a merge.
>
> Just an idea, not a merge requirement.  If you've already tried this and
> found it to be worse, that might be worth a comment.
I checked profiles and coverage for different tests (including huge 
ones) and did not see this code is critical.  But probably it is worth 
to try.  It might a bit more complicated for multi-register pseudos.  
I've just used the same pattern as in IRA fast allocation.  On the other 
hand, stack slots are allocated as you propose.  It might be good to 
unify the code.  I'll put it on my (long) todo list.
>> +      first = pseudo_slots[regno].first = &pseudo_slots[slots[slot_num].regno];
>> +      pseudo_slots[regno].next = pseudo_slots[slots[slot_num].regno].next;
>> +      first->next = &pseudo_slots[regno];
> Very minor nit, but I think this would be easier to read if the middle
> line also used "first->next".
Fixed.
>> +/* Assign spill hard registers to N pseudos in PSEUDO_REGNOS.  Put the
>> +   pseudos which did not get a spill hard register at the beginning of
>> +   array PSEUDO_REGNOS.	 Return the number of such pseudos.  */
> Here too I think it's worth mentioning that PSEUDO_REGNOS is sorted
> with highest frequency first.
Fixed.
>> +/* Recursively process LOC in INSN and change spilled pseudos to the
>> +   corresponding memory or spilled hard reg.  Ignore spilled pseudos
>> +   created from the scratches.	*/
>> +static bool
>> +remove_pseudos (rtx *loc, rtx insn)
> The return value is now ignored -- we know in advance which insns need
> changing -- so this could be simplified.
Fixed.
>> +/* Change spilled pseudos into memory or spill hard regs.  The
>> +   function put changed insns on the constraint stack (these insns
>> +   will be considered on the next constraint pass).  The changed insns
>> +   are all insns in which pseudos were changed.	 */
> s/The function put/Put/
Fixed
>> +/* Set up REMOVED_PSEUDOS_BITMAP and USED_PSEUDOS_BITMAP, and update
>> +   LR_BITMAP (a BB live info bitmap).  */
>> +static void
>> +update_live_info (bitmap lr_bitmap)
>> +{
>> +  unsigned int j;
>> +  bitmap_iterator bi;
>> +
>> +  bitmap_clear (&removed_pseudos_bitmap);
>> +  bitmap_clear (&used_pseudos_bitmap);
>> +  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
>> +			    FIRST_PSEUDO_REGISTER, j, bi)
>> +    {
>> +      bitmap_set_bit (&removed_pseudos_bitmap, j);
>> +      bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
>> +    }
>> +  if (! bitmap_empty_p (&removed_pseudos_bitmap))
>> +    {
>> +      bitmap_and_compl_into (lr_bitmap, &removed_pseudos_bitmap);
>> +      bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
>> +    }
>> +}
> Might be wrong, but it looks like nothing really uses removed_pseudos_bitmap
> outside this function.  I think this could simply be:
> /* Set up REMOVED_PSEUDOS_BITMAP and USED_PSEUDOS_BITMAP, and update
>     LR_BITMAP (a BB live info bitmap).  */
> static void
> update_live_info (bitmap lr_bitmap)
> {
>    unsigned int j;
>    bitmap_iterator bi;
>
>    bitmap_clear (&used_pseudos_bitmap);
>    EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
> 			    FIRST_PSEUDO_REGISTER, j, bi)
>      bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
>    if (! bitmap_empty_p (&used_pseudos_bitmap))
>      {
>        bitmap_and_compl_into (lr_bitmap, &coalesced_pseudos_bitmap);
>        bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
>      }
> }
Yes.  Thanks for finding such nontrivial change.  I fixed it.
>> +	    && mem_move_p (sregno, dregno)
>> +	    /* Don't coalesce inheritance pseudos because spilled
>> +	       inheritance pseudos will be removed in subsequent 'undo
>> +	       inheritance' pass.  */
>> +	    && lra_reg_info[sregno].restore_regno < 0
>> +	    && lra_reg_info[dregno].restore_regno < 0
>> +	    /* We undo splits for spilled pseudos whose live ranges
>> +	       were split.  So don't coalesce them, it is not
>> +	       necessary and the undo transformations would be
>> +	       wrong.  */
>> +	    && ! bitmap_bit_p (&split_origin_bitmap, sregno)
>> +	    && ! bitmap_bit_p (&split_origin_bitmap, dregno)
>> +	    && ! side_effects_p (set)
>> +	    /* Don't coalesces bound pseudos.  Bound pseudos has own
>> +	       rules for finding live ranges.  It is hard to maintain
>> +	       this info with coalescing and it is not worth to do
>> +	       it.  */
>> +	    && ! bitmap_bit_p (&lra_bound_pseudos, sregno)
>> +	    && ! bitmap_bit_p (&lra_bound_pseudos, dregno)
>> +	    /* We don't want to coalesce regnos with equivalences,
>> +	       at least without updating this info.  */
>> +	    && ira_reg_equiv[sregno].constant == NULL_RTX
>> +	    && ira_reg_equiv[sregno].memory == NULL_RTX
>> +	    && ira_reg_equiv[sregno].invariant == NULL_RTX
>> +	    && ira_reg_equiv[dregno].constant == NULL_RTX
>> +	    && ira_reg_equiv[dregno].memory == NULL_RTX
>> +	    && ira_reg_equiv[dregno].invariant == NULL_RTX
> Probably personal preference, but I think this would be easier
> to read as:
>
> 	    && coalescable_reg_p (sregno)
> 	    && coalescable_reg_p (dregno)
> 	    && !side_effects_p (set)
>
> with coalescable_reg_p checking reg_renumber (from mem_move_p)
> and the open-coded stuff in the quote above.
Ok.  Fixed.
>> +  for (; mv_num != 0;)
>> +    {
>> +      for (i = 0; i < mv_num; i++)
>> +	{
>> +	  mv = sorted_moves[i];
>> +	  set = single_set (mv);
>> +	  lra_assert (set != NULL && REG_P (SET_SRC (set))
>> +		      && REG_P (SET_DEST (set)));
>> +	  sregno = REGNO (SET_SRC (set));
>> +	  dregno = REGNO (SET_DEST (set));
>> +	  if (! lra_intersected_live_ranges_p
>> +		(lra_reg_info[first_coalesced_pseudo[sregno]].live_ranges,
>> +		 lra_reg_info[first_coalesced_pseudo[dregno]].live_ranges))
>> +	    {
>> +	      coalesced_moves++;
>> +	      if (lra_dump_file != NULL)
>> +		fprintf
>> +		  (lra_dump_file,
>> +		   "	  Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n",
>> +		   INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)),
>> +		   dregno, ORIGINAL_REGNO (SET_DEST (set)),
>> +		   BLOCK_FOR_INSN (mv)->frequency);
>> +	      bitmap_ior_into (&involved_insns_bitmap,
>> +			       &lra_reg_info[sregno].insn_bitmap);
>> +	      bitmap_ior_into (&involved_insns_bitmap,
>> +			       &lra_reg_info[dregno].insn_bitmap);
>> +	      merge_pseudos (sregno, dregno);
>> +	      i++;
>> +	      break;
>> +	    }
>> +	}
>> +      /* Collect the rest of copies.  */
>> +      for (n = 0; i < mv_num; i++)
>> +	{
>> +	  mv = sorted_moves[i];
>> +	  set = single_set (mv);
>> +	  lra_assert (set != NULL && REG_P (SET_SRC (set))
>> +		      && REG_P (SET_DEST (set)));
>> +	  sregno = REGNO (SET_SRC (set));
>> +	  dregno = REGNO (SET_DEST (set));
>> +	  if (first_coalesced_pseudo[sregno] != first_coalesced_pseudo[dregno])
>> +	    sorted_moves[n++] = mv;
>> +	  else if (lra_dump_file != NULL)
>> +	    {
>> +	      coalesced_moves++;
>> +	      fprintf
>> +		(lra_dump_file, "      Coalescing move %i:r%d-r%d (freq=%d)\n",
>> +		 INSN_UID (mv), sregno, dregno,
>> +		 BLOCK_FOR_INSN (mv)->frequency);
>> +	    }
>> +	}
>> +      mv_num = n;
> I'm probably being dense here, sorry, but why the nested loops?
> Why can't we have one loop along the lines of:
>
>        for (i = 0; i < mv_num; i++)
> 	{
> 	  mv = sorted_moves[i];
> 	  set = single_set (mv);
> 	  lra_assert (set != NULL && REG_P (SET_SRC (set))
> 		      && REG_P (SET_DEST (set)));
> 	  sregno = REGNO (SET_SRC (set));
> 	  dregno = REGNO (SET_DEST (set));
> 	  if (first_coalesced_pseudo[sregno] == first_coalesced_pseudo[dregno])
> 	    {
> 	      coalesced_moves++;
> 	      fprintf
> 		(lra_dump_file, "      Coalescing move %i:r%d-r%d (freq=%d)\n",
> 		 INSN_UID (mv), sregno, dregno,
> 		 BLOCK_FOR_INSN (mv)->frequency);
> 	      /* We updated involved_insns_bitmap when doing the mrege */
> 	    }
> 	  else if (!(lra_intersected_live_ranges_p
> 		     (lra_reg_info[first_coalesced_pseudo[sregno]].live_ranges,
> 		      lra_reg_info[first_coalesced_pseudo[dregno]].live_ranges)))
> 	    {
> 	      coalesced_moves++;
> 	      if (lra_dump_file != NULL)
> 		fprintf
> 		  (lra_dump_file,
> 		   "	  Coalescing move %i:r%d(%d)-r%d(%d) (freq=%d)\n",
> 		   INSN_UID (mv), sregno, ORIGINAL_REGNO (SET_SRC (set)),
> 		   dregno, ORIGINAL_REGNO (SET_DEST (set)),
> 		   BLOCK_FOR_INSN (mv)->frequency);
> 	      bitmap_ior_into (&involved_insns_bitmap,
> 			       &lra_reg_info[sregno].insn_bitmap);
> 	      bitmap_ior_into (&involved_insns_bitmap,
> 			       &lra_reg_info[dregno].insn_bitmap);
> 	      merge_pseudos (sregno, dregno);
> 	    }
> 	}
>
> (completely untested)
As I remember, it was more complicated coalesced algorithm where sorting 
was done on each iteration after one move coalesce.

I changed the code.
>
>> +	    if ((set = single_set (insn)) != NULL_RTX
>> +		&& REG_P (SET_DEST (set)) && REG_P (SET_SRC (set))
>> +		&& REGNO (SET_SRC (set)) == REGNO (SET_DEST (set))
>> +		&& ! side_effects_p (set))
> Maybe use set_noop_p here?
>
Ok.  Why not.  The code is rarely executed so more generalize code could 
be used.  I changed it to set_noop_p.

Vladimir Makarov Oct. 12, 2012, 3:37 a.m. UTC | #6

On 10/04/2012 11:50 AM, Richard Sandiford wrote:
> Hi Vlad,
>
> This message is for lra-assigns.c.  Sorry for the piecemeal reviews,
> never sure when I'll get time...
>
>> +/* This file contains a pass mostly assigning hard registers to reload
>> +   pseudos.  There is no any RTL code transformation on this pass.
> Maybe:
>
> /* This file's main objective is to assign hard registers to reload pseudos.
>     It also tries to allocate hard registers to other pseudos, but at a lower
>     priority than the reload pseudos.  The pass does not transform the RTL.
>
> if that's accurate.
Yes.  That is better.  I used your comment.
>> +   Reload pseudos get what they need (usually) hard registers in
>> +   anyway possibly by spilling non-reload pseudos and by assignment
>> +   reload pseudos with smallest number of available hard registers
>> +   first.
>> +
>> +   If reload pseudos can get hard registers only through spilling
>> +   other pseudos, we choose what pseudos to spill taking into account
>> +   how given reload pseudo benefits and also how other reload pseudos
>> +   not assigned yet benefit too (see function spill_for).
> Maybe:
>
>     We must allocate a hard register to every reload pseudo.  We try to
>     increase the chances of finding a viable allocation by assigning the
>     pseudos in order of fewest available hard registers first.  If we
>     still fail to find a hard register, we spill other (non-reload)
>     pseudos in order to make room.
>
>     assign_hard_regno_for allocates registers without spilling.
>     spill_for does the same with spilling.  Both functions use
>     a cost model to determine the most profitable choice of
>     hard and spill registers.
Ok.  I just changed two sentences a bit:

  find_hard_regno_for finds hard registers for allocation without  
spilling.  spill_for does the same with spilling.

>> +   Non-reload pseudos can get hard registers too if it is possible and
>> +   improves the code.  It might be possible because of spilling
>> +   non-reload pseudos on given pass.
> Maybe:
>
>     Once we have finished allocating reload pseudos, we also try to
>     assign registers to other (non-reload) pseudos.  This is useful
>     if hard registers were freed up by the spilling just described.
>
Fixed.
>> +   We try to assign hard registers processing pseudos by threads.  The
>> +   thread contains reload and inheritance pseudos connected by copies
>> +   (move insns).  It improves the chance to get the same hard register
>> +   to pseudos in the thread and, as the result, to remove some move
>> +   insns.
> Maybe:
>
>     We try to assign hard registers by collecting pseudos into threads.
>     These threads contain reload and inheritance pseudos that are connected
>     by copies (move insns).  Doing this improves the chances of pseudos
>     in the thread getting the same hard register and, as a result,
>     of allowing some move insns to be deleted.
Fixed.
>> +   When we assign hard register to a pseudo, we decrease the cost of
>> +   the hard registers for corresponding pseudos connected by copies.
> Maybe:
>
>     When we assign a hard register to a pseudo, we decrease the cost of
>     using the same hard register for pseudos that are connected by copies.
Fixed.
>> +   If two hard registers are equally good for assigning the pseudo
>> +   with hard register cost point of view, we prefer a hard register in
>> +   smaller register bank.  By default, there is only one register
>> +   bank.  A target can define register banks by hook
>> +   register_bank. For example, x86-64 has a few register banks: hard
>> +   regs with and without REX prefixes are in different banks.  It
>> +   permits to generate smaller code as insns without REX prefix are
>> +   shorter.
> Maybe:
>
>     If two hard registers have the same frequency-derived cost,
>     we prefer hard registers in lower register banks.  The mapping
>     of registers to banks is controlled by the register_bank target hook.
>     For example, x86-64 has a few register banks: hard registers with and
>     without REX prefixes are in different banks.  This permits us
>     to generate smaller code as insns without REX prefixes are shorter.
>
> although this might change if the name of the hook changes.

With recent change in the hook name, I modified it to:

    If two hard registers have the same frequency-derived cost, we
    prefer hard registers with bigger priorities.  The mapping of
    registers to priorities is controlled by the register_priority
    target hook.  For example, x86-64 has a few register priorities:
    hard registers with and without REX prefixes have different
    priorities.  This permits us to generate smaller code as insns
    without REX prefixes are shorter.

>> +/* Info about pseudo used during the assignment pass.  Thread is a set
>> +   of connected reload and inheritance pseudos with the same set of
>> +   available hard reg set.  Thread is a pseudo itself for other
>> +   cases.  */
>> +struct regno_assign_info
> Maybe:
>
> /* Information about the thread to which a pseudo belongs.  Threads are
>     a set of connected reload and inheritance pseudos with the same set of
>     available hard registers.  Lone registers belong to their own threads.  */
Fixed.
> Although the condition seems to be:
>> +	&& (ira_class_hard_regs_num[regno_allocno_class_array[regno1]]
>> +	    == ira_class_hard_regs_num[regno_allocno_class_array[regno2]]))
> i.e. the same _number_ of available hard regs, but not necessarily the
> same set.
It should be the same in most cases.  This condition is just faster 
approximation of the same available hard reg set.
> "thread" might be more mnemonic than "regno_assign" in this file,
> but that's bikeshed stuff.
>
>> +  for (i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
>> +    {
>> +      regno_assign_info[i].first = i;
>> +      regno_assign_info[i].next = -1;
>> +      regno_assign_info[i].freq = lra_reg_info[i].freq;
>> +    }
> Minor speedup, but it's probably worth caching max_reg_num () rather than
> calling it in each loop iteration.  Several other loops with the same thing.
That is not a critical code and LTO could solve the problem. But as we 
usually don't use it for building GCC, I rewrote it.
>> +/* Process a pseudo copy with execution frequency COPY_FREQ connecting
>> +   REGNO1 and REGNO2 to form threads.  */
>> +static void
>> +process_copy_to_form_thread (int regno1, int regno2, int copy_freq)
>> +{
>> +  int last, regno1_first, regno2_first;
>> +
>> +  lra_assert (regno1 >= lra_constraint_new_regno_start
>> +	      && regno2 >= lra_constraint_new_regno_start);
>> +  regno1_first = regno_assign_info[regno1].first;
>> +  regno2_first = regno_assign_info[regno2].first;
>> +  if (regno1_first != regno2_first)
>> +    {
>> +      for (last = regno2_first;
>> +	   regno_assign_info[last].next >= 0;
>> +	   last = regno_assign_info[last].next)
>> +	regno_assign_info[last].first = regno1_first;
>> +      regno_assign_info[last].next = regno_assign_info[regno1_first].next;
>> +      regno_assign_info[regno1_first].first = regno2_first;
>> +      regno_assign_info[regno1_first].freq
>> +	+= regno_assign_info[regno2_first].freq;
> Couple of things I don't understand here:
>
> - Why don't we set regno_assign_info[last].first (for final "last")
>    to regno1_first?  I.e. the loop stops while "last" is still valid,
>    but only assigns to that element's "next" field, leaving "first"
>    as before.
>
> - I might be wrong, but should:
>
>      regno_assign_info[regno1_first].first = regno2_first;
>
>    be:
>
>      regno_assign_info[regno1_first].next = regno2_first;
>
>    so that the list becomes:
>
>      regno1_first regno2_first ... last ...
>
>    The current version seems to create a cycle:
>
>     regno_assign_info[regno1_first].first == regno2_first
>     regno_assign_info[regno2_first].first == regno1_first
It is a typo.  Fixed.  Thanks.  There is no looping danger as we don't 
traverse this list by field first but it results in assignment order 
different from what I assumed.
>> +/* Update LIVE_HARD_REG_PSEUDOS and LIVE_PSEUDOS_REG_RENUMBER by
>> +   pseudo REGNO assignment or by the pseudo spilling if FREE_P.	 */
> Maybe:
>
> /* Update the LIVE_HARD_REG_PSEUDOS and LIVE_PSEUDOS_REG_RENUMBER
>     entries for pseudo REGNO.  Assume that the register has been
>     spilled if FREE_P, otherwise assume that it has been assigned
>     reg_renumber[REGNO] (if >= 0).  */
Fixed.  I also added a comment about recently added 
insert_in_live_range_start_chain call.
>> +/* Find and return best (or TRY_ONLY_HARD_REGNO) free hard register
>> +   for pseudo REGNO.  In the failure case, return a negative number.
>> +   Return through *COST the cost of usage of the hard register for the
>> +   pseudo.  Best free hard register has smallest cost of usage for
>> +   REGNO or smallest register bank if the cost is the same.  */
> Maybe:
>
> /* Try to find a free hard register for pseudo REGNO.  Return the
>     hard register on success and set *COST to the cost of using
>     that register.  (If several registers have equal cost, the one with
>     the lowest register bank wins.)  Return -1 on failure.
>
>     If TRY_ONLY_HARD_REGNO >= 0, consider only that hard register,
>     otherwise consider all hard registers in REGNO's class.  */
>
Fixed with changing bank to priority.
>> +      if (hard_regno_costs_check[hard_regno] != curr_hard_regno_costs_check)
>> +	hard_regno_costs[hard_regno] = 0;
>> +      hard_regno_costs_check[hard_regno] = curr_hard_regno_costs_check;
>> +      hard_regno_costs[hard_regno]
>> +	-= lra_reg_info[regno].preferred_hard_regno_profit1;
> This pattern occurs several times.  I think it'd be clearer to have
> an inline helper function (adjust_hard_regno_cost, or whatever).
Done.
>> +  /* That is important for allocation of multi-word pseudos.  */
>> +  IOR_COMPL_HARD_REG_SET (conflict_set, reg_class_contents[rclass]);
> Maybe:
>
>    /* Make sure that all registers in a multi-word pseudo belong to the
>       required class.  */
Fixed.
>> +	  /* We can not use prohibited_class_mode_regs because it is
>> +	     defined not for all classes.  */
> s/defined not/not defined/
Fixed.
>> +	  && ! TEST_HARD_REG_BIT (impossible_start_hard_regs, hard_regno)
>> +	  && (nregs_diff == 0
>> +#ifdef WORDS_BIG_ENDIAN
>> +	      || (hard_regno - nregs_diff >= 0
>> +		  && TEST_HARD_REG_BIT (reg_class_contents[rclass],
>> +					hard_regno - nregs_diff))
>> +#else
>> +	      || TEST_HARD_REG_BIT (reg_class_contents[rclass],
>> +				    hard_regno + nregs_diff)
>> +#endif
>> +	      ))
> impossible_start_hard_regs is set up as:
>
>> +	conflict_hr = live_pseudos_reg_renumber[conflict_regno];
>> +	nregs = (hard_regno_nregs[conflict_hr]
>> +		 [lra_reg_info[conflict_regno].biggest_mode]);
>> +	/* Remember about multi-register pseudos.  For example, 2 hard
>> +	   register pseudos can start on the same hard register but can
>> +	   not start on HR and HR+1/HR-1.  */
>> +	for (hr = conflict_hr + 1;
>> +	     hr < FIRST_PSEUDO_REGISTER && hr < conflict_hr + nregs;
>> +	     hr++)
>> +	  SET_HARD_REG_BIT (impossible_start_hard_regs, hr);
>> +	for (hr = conflict_hr - 1;
>> +	     hr >= 0 && hr + hard_regno_nregs[hr][biggest_mode] > conflict_hr;
>> +	     hr--)
>> +	  SET_HARD_REG_BIT (impossible_start_hard_regs, hr);
> which I don't think copes with big-endian cases like:
>
>    other hard reg in widest mode:    ........XXXX...
>    impossible_start_regs:            .....XXX.XXX...
>    this hard reg in pseudo's mode:   ............XX.
>    this hard reg in widest mode:     ..........XXXX.
>
> which AIUI is an invalid choice.
>
> There are other corner cases too.  If the other hard reg is narrower than
> its widest mode, and that widest mode is wider than the current regno's
> widest mode, then on big-endian targets we could have:
>
>    other hard reg in its own mode:   ........XX....
>    other hard reg in widest mode:    ......XXXX.....
>    impossible_start_regs:            .......X.XXX... (*)
>    this hard reg in pseudo's mode:   .....XX........
>    this hard reg in widest mode:     .....XX........
>
> (*) note no big-endian adjustment for the other hard reg's widest mode here.
>
> Maybe it would be easier to track impossible end regs for
> big-endian targets?
I'll look at this tomorrow.
>> +/* Update HARD_REGNO preference for pseudos connected (directly or
>> +   indirectly) to a pseudo with REGNO.	Use divisor DIV to the
>> +   corresponding copy frequency for the hard regno cost preference
>> +   calculation.	 The more indirectly a pseudo connected, the less the
>> +   cost preference.  It is achieved by increasing the divisor for each
>> +   next recursive level move.  */
> "cost preference" seems a bit contradictory.  Maybe:
>
> /* Update the preference for using HARD_REGNO for pseudos that are
>     connected directly or indirectly with REGNO.  Apply divisor DIV
>     to any preference adjustments.
>
>     The more indirectly a pseudo is connected, the smaller its effect
>     should be.  We therefore increase DIV on each "hop".  */
>
Fixed.  By the way, it is your invention from IRA.
>> +static void
>> +update_hard_regno_preference (int regno, int hard_regno, int div)
>> +{
>> +  int another_regno, cost;
>> +  lra_copy_t cp, next_cp;
>> +
>> +  /* Search depth 5 seems to be enough.	 */
>> +  if (div > (1 << 5))
>> +    return;
>> +  for (cp = lra_reg_info[regno].copies; cp != NULL; cp = next_cp)
>> +    {
>> +      if (cp->regno1 == regno)
>> +	{
>> +	  next_cp = cp->regno1_next;
>> +	  another_regno = cp->regno2;
>> +	}
>> +      else if (cp->regno2 == regno)
>> +	{
>> +	  next_cp = cp->regno2_next;
>> +	  another_regno = cp->regno1;
>> +	}
>> +      else
>> +	gcc_unreachable ();
>> +      if (reg_renumber[another_regno] < 0
>> +	  && (update_hard_regno_preference_check[another_regno]
>> +	      != curr_update_hard_regno_preference_check))
>> +	{
>> +	  update_hard_regno_preference_check[another_regno]
>> +	    = curr_update_hard_regno_preference_check;
>> +	  cost = cp->freq < div ? 1 : cp->freq / div;
>> +	  lra_setup_reload_pseudo_preferenced_hard_reg
>> +	    (another_regno, hard_regno, cost);
>> +	  update_hard_regno_preference (another_regno, hard_regno, div * 2);
>> +	}
>> +    }
>> +}
> Using a depth-first search for this seems a bit dangerous, because we
> could end up processing a connected pseudo via a very indirect path
> first, even though it is more tightly connected via a more direct path.
> (Could be a well-known problem, sorry.)
Actually I did on purpose. It is a bit different situation from IRA.  
The vast majority of copies form a straight line (therefore I use term 
threads).  Therefore I use also only two preferences of hard registers 
(from two copies with hard registers).  I searched a balance between 
code simplicity and speed and more sophisticated and slow heuristics.
>> +/* Update REG_RENUMBER and other pseudo preferences by assignment of
>> +   HARD_REGNO to pseudo REGNO and print about it if PRINT_P.  */
>> +void
>> +lra_setup_reg_renumber (int regno, int hard_regno, bool print_p)
>> +{
>> +  int i, hr;
>> +
>> +  if ((hr = hard_regno) < 0)
>> +    hr = reg_renumber[regno];
>> +  reg_renumber[regno] = hard_regno;
>> +  lra_assert (hr >= 0);
>> +  for (i = 0; i < hard_regno_nregs[hr][PSEUDO_REGNO_MODE (regno)]; i++)
>> +    if (hard_regno < 0)
>> +      lra_hard_reg_usage[hr + i] -= lra_reg_info[regno].freq;
>> +    else
>> +      lra_hard_reg_usage[hr + i] += lra_reg_info[regno].freq;
> Is it possible for this function to reallocate a register,
> i.e. for reg_regnumber to be >= 0 both before and after the call?
> If so, I think we'd need two loops.  If not, an assert would be good.
No. I added an assert.
>> +      mode = PSEUDO_REGNO_MODE (spill_regno);
>> +      if (lra_hard_reg_set_intersection_p
>> +	  (live_pseudos_reg_renumber[spill_regno],
>> +	   mode, reg_class_contents[rclass]))
>> +	{
>> +	  hard_regno = live_pseudos_reg_renumber[spill_regno];
> Very minor, sorry, but I think this would be more readable with the
> hard_regno assignment before the condition and hard_regno used in it.
Fixed.
>> +/* Spill some pseudos for a reload pseudo REGNO and return hard
>> +   register which should be used for pseudo after spilling.  The
>> +   function adds spilled pseudos to SPILLED_PSEUDO_BITMAP.  When we
>> +   choose hard register (and pseudos occupying the hard registers and
>> +   to be spilled), we take into account not only how REGNO will
>> +   benefit from the spills but also how other reload pseudos not
>> +   assigned to hard registers yet benefit from the spills too.	*/
> "...not yet assigned to hard registers benefit..."
>
Fixed.
>> +  curr_pseudo_check++; /* Invalidate try_hard_reg_pseudos elements.  */
> Comment on its own line.
Fixed.
>> +  bitmap_clear (&ignore_pseudos_bitmap);
>> +  bitmap_clear (&best_spill_pseudos_bitmap);
>> +  EXECUTE_IF_SET_IN_BITMAP (&lra_reg_info[regno].insn_bitmap, 0, uid, bi)
>> +    {
>> +      struct lra_insn_reg *ir;
>> +
>> +      for (ir = lra_get_insn_regs (uid); ir != NULL; ir = ir->next)
>> +	if (ir->regno >= FIRST_PSEUDO_REGISTER)
>> +	  bitmap_set_bit (&ignore_pseudos_bitmap, ir->regno);
>> +    }
> The name "ignore_pseudos_bitmap" doesn't seem to describe how the set is
> actually used.  We still allow the pseudos to be spilled, but the number
> of such spills is the first-order cost.  Maybe "insn_conflict_pseudos"
> or something like that?
Ok. Fixed.
>> +      /* Spill pseudos.	 */
>> +      CLEAR_HARD_REG_SET (spilled_hard_regs);
>> +      EXECUTE_IF_SET_IN_BITMAP (&spill_pseudos_bitmap, 0, spill_regno, bi)
>> +	if ((int) spill_regno >= lra_constraint_new_regno_start
>> +	    /* ??? */
>> +	    && ! bitmap_bit_p (&lra_inheritance_pseudos, spill_regno)
>> +	    && ! bitmap_bit_p (&lra_split_pseudos, spill_regno)
>> +	    && ! bitmap_bit_p (&lra_optional_reload_pseudos, spill_regno))
>> +	  goto fail;
> Leftover ??? (or lacks enough info if it's supposed to be kept)
It is a leftover.  As I remember it was a mark to me to check this code 
when i worked on inheritance and splitting.

I removed it.
>> +	      EXECUTE_IF_SET_IN_BITMAP (&live_hard_reg_pseudos[r->start],
>> +					0, k, bi2)
>> +		sparseset_set_bit (live_range_hard_reg_pseudos, k);
> live_range_hard_reg_pseudos and &live_hard_reg_pseudos[r->start]
> seem like similar quantities.  Was there a reason for using
> sparsesets for one and bitmaps for the other?
>> +	      for (p = r->start + 1; p <= r->finish; p++)
>> +		{
>> +		  lra_live_range_t r2;
>> +		
>> +		  for (r2 = lra_start_point_ranges[p];
>> +		       r2 != NULL;
>> +		       r2 = r2->start_next)
>> +		    if (r2->regno >= lra_constraint_new_regno_start)
>> +		      sparseset_set_bit (live_range_reload_pseudos, r2->regno);
>> +		}
> This is probably just showing my ignorance, but -- taking the above two
> quotes together -- why do we calculate these two live sets in different ways?
> Also, does live_range_reload_pseudos really just contain "reload" pseudos,
> or inheritance pseudos as well?
Thanks for finding this.  live_range_hard_reg_pseudos is not used in 
this function.
As I remember, i used the same code as in find_hard_regno_for and 
live_(hard_)reg_pseudos contained all pseudos including spilled ones.  
But it was too expensive (especially when the register pressure was 
high).  So I started to use less accurate but faster heuristics.

So I am removing

+	      EXECUTE_IF_SET_IN_BITMAP (&live_hard_reg_pseudos[r->start],
+					0, k, bi2)
+		sparseset_set_bit (live_range_hard_reg_pseudos, k);

I changed

p = r->start + 1 to p = r->start


All these variants of code do not result in wrong code generation, only the quality of spilling (which is not worse reload's one in any case).

live_range_reload_pseudos contains inheritance pseudos too (inheritance psedous are also short live range pseudos).  I renamed it.

>> +      /* We are trying to spill a reload pseudo.  That is wrong we
>> +	 should assign all reload pseudos, otherwise we cannot reuse
>> +	 the selected alternatives.  */
>> +      hard_regno = find_hard_regno_for (regno, &cost, -1);
>> +      if (hard_regno >= 0)
>> +	{
> Don't really understand this comment, sorry.
It removed the comment.  It is from an old solution code trying to 
guarantee assignment to the reload pseudo.
> Also, why are we passing -1 to find_hard_regno_for, rather than hard_regno?
> The loop body up till this point has been specifically freeing up registers
> to make hard_regno allocatable.  I realise that, by spilling everything
> that overlaps this range, we might have freed up other registers too,
> and so made others besides hard_regno allocatable.  But wouldn't we want
> to test those other hard registers in "their" iteration of the loop
> instead of this one?  The spills in those iterations ought to be more
> directed (i.e. there should be less incidental spilling).
>
> As things stand, doing an rclass_size * rclass_size scan seems
> unnecessarily expensive, although probably off the radar.
We cannot just pass hard_regno for multi-word pseudo when hard_regno-1 
is already free.
You are right about possibility to speed up the code, although on 
profiles I looked (including the last huge tests) spill_for and 
find_hard_regno_for called from takes few time. That is probably because 
you don't need spill frequently.  Freeing one long live range pseudo 
permits to find hard regno without spilling for many short live pseudos 
(reload and inheritance ones).
Also loop rclass_size * rclass_size is not expensive, the preparation of 
data for the loop is expensive.

I believe it has a potential to speed up spill_for function if we inline 
find_hard_regno_for and remove duplicated code but it will be achieved 
by significant complication of already complicated function.
>> +	  assign_temporarily (regno, hard_regno);
>> +	  n = 0;
>> +	  EXECUTE_IF_SET_IN_SPARSESET (live_range_reload_pseudos, reload_regno)
>> +	    if (live_pseudos_reg_renumber[reload_regno] < 0
>> +		&& (hard_reg_set_intersect_p
>> +		    (reg_class_contents
>> +		     [regno_allocno_class_array[reload_regno]],
>> +		     spilled_hard_regs)))
>> +	      sorted_reload_pseudos[n++] = reload_regno;
>> +	  qsort (sorted_reload_pseudos, n, sizeof (int),
>> +		 reload_pseudo_compare_func);
>> +	  for (j = 0; j < n; j++)
>> +	    {
>> +	      reload_regno = sorted_reload_pseudos[j];
>> +	      if (live_pseudos_reg_renumber[reload_regno] < 0
> Just trying to make sure I understand, but: isn't the final condition in
> this quote redundant?  I thought that was a requirement for the register
> being in sorted_reload_pseudos to begin with.
Yes, it is redundant.  It is a leftover from some experiments with the 
code.  I converted it to assert.
>> +		  && (reload_hard_regno
>> +		      = find_hard_regno_for (reload_regno,
>> +					     &reload_cost, -1)) >= 0
>> +		  && (lra_hard_reg_set_intersection_p
>> +		      (reload_hard_regno, PSEUDO_REGNO_MODE (reload_regno),
>> +		       spilled_hard_regs)))
>> +		{
>> +		  if (lra_dump_file != NULL)
>> +		    fprintf (lra_dump_file, " assign %d(cost=%d)",
>> +			     reload_regno, reload_cost);
>> +		  assign_temporarily (reload_regno, reload_hard_regno);
>> +		  cost += reload_cost;
> It looks like registers that can be reallocated make hard_regno more
> expensive (specifically by reload_cost), but registers that can't be
> reallocated contribute no cost.  Is that right?  Seemed a little odd,
> so maybe worth a comment.
Reload cost is a negative cost.  It is negative base is the pseudo 
frequency.
I added the comment.  The better hard register is, the more negative 
cost is.
> Also, AIUI find_hard_regno_for is trying to allocate the register for
> reload_regno on the basis that reload_regno has the same conflicts as
> the current regno, and so it's only an approximation.  Is that right?
> Might be worth a comment if so (not least to explain why we don't commit
> to this allocation if we end up choosing hard_regno).
Sorry, I did not understand what you are asking.  We try to find best 
pseudos to spill which helps to assign more reload pseudos (on cost 
base) as possible.  Pseudos for which find_hard_regno finds not spilled 
hard regs are ignored as they can be assigned without spilling.
>> +	  if (best_insn_pseudos_num > insn_pseudos_num
>> +	      || (best_insn_pseudos_num == insn_pseudos_num
>> +		  && best_cost > cost))
> Should we check the register bank and levelling here too,
> for consistency?
As I remember I had a quick check of it on a few tests but did not find 
a difference in generated code. I think that is because the probability 
of the same cost is smaller than in assigning code as we usually spill 
different cost pseudos.
   I think we should try it on bigger tests and add such code if it is 
worth, or write a comment about this.  I'll put it on my todo list.  
I'll work on it later on the branch.
>> +      /* Restore the live hard reg pseudo info for spilled pseudos.  */
>> +      EXECUTE_IF_SET_IN_BITMAP (&spill_pseudos_bitmap, 0, spill_regno, bi)
>> +	update_lives (spill_regno, false);
> I couldn't tell why this was outside the "hard_regno >= 0" condition.
> Do we really change these registers even if find_hard_regno_for fails?
The first we spill some pseudo and then call find_hard_regno_for.  So we 
should restore the state before spilling independently on success of 
find_hard_regno_for.
>> +  /* Spill: */
>> +  EXECUTE_IF_SET_IN_BITMAP (&best_spill_pseudos_bitmap, 0, spill_regno, bi)
> Very minor, but I think it'd be worth asserting that best_hard_regno >= 0
> before this loop.
Unfortunately, in very rare cases, best_hard_regno can be < 0.  That is 
why we have two iteration for assignment of reload pseudos (see comment 
for 2nd iter for reload pseudo assignments.

I've added a comment for the function that it can return negative value.
>> +/* Constraint transformation can use equivalences and they can
>> +   contains pseudos assigned to hard registers.	 Such equivalence
>> +   usage might create new conflicts of pseudos with hard registers
>> +   (like ones used for parameter passing or call clobbered ones) or
>> +   other pseudos assigned to the same hard registers.  Another very
>> +   rare risky transformation is restoring whole multi-register pseudo
>> +   when only one subreg lives and unused hard register is used already
>> +   for something else.
> In a way, I found this comment almost too detailed. :-)  Maybe:
>
> /* The constraints pass is allowed to create equivalences between
>     pseudos that make the current allocation "incorrect" (in the sense
>     that pseudos are assigned to hard registers from their own conflict sets).
>     The global variable lra_risky_transformations_p says whether this might
>     have happened.
>
> if that's accurate.  The detail about when this occurs probably
> belongs above lra_risky_transformations_p, although it's mostly
> there already.  (Haven't got to the ira-conflicts.c stuff yet,
> so no comments about that here.)
Ok. Fixed. It looks better to me.  But more important if it looks better 
to you because you have a fresh look.
>> +   Process pseudos assigned to hard registers (most frequently used
>> +   first), spill if a conflict is found, and mark the spilled pseudos
>> +   in SPILLED_PSEUDO_BITMAP.  Set up LIVE_HARD_REG_PSEUDOS from
>> +   pseudos, assigned to hard registers.	 */
> Why do we spill the most frequently used registers first?  Probably worth
> a comment.
>
It should be less frequently pseudos as an intuitive heuristic. Although 
it does matter as even on all_cp2k_fortran.f90 (500K lines test) I did 
not find that the order affect the generated code.  I fixed the code and 
the comment.
>> +  for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
>> +    if (reg_renumber[i] >= 0 && lra_reg_info[i].nrefs > 0)
>> +      {
>> +	if (lra_risky_transformations_p)
>> +	  sorted_pseudos[n++] = i;
>> +	else
>> +	  update_lives (i, false);
>> +      }
>> +  if (! lra_risky_transformations_p)
>> +    return;
> Seems like this would be more logically split into two (the
> lra_risky_transformations_p case and the !lra_risky_transformations_p case).
Fixed.
>> +	    /* If it is multi-register pseudos they should start on
>> +	       the same hard register.	*/
>> +	    || hard_regno != reg_renumber[conflict_regno])
> This seems different from the find_hard_regno_for case, which took
> biggest_mode into account.
I think we should use the common code.  So I'll fix it with the big 
endian problem.
>> +	  /* Don't change reload pseudo allocation.  It might have
>> +	     this allocation for a purpose (e.g. bound to another
>> +	     pseudo) and changing it can result in LRA cycling.	 */
>> +	  if (another_regno < lra_constraint_new_regno_start
>> +	      && (another_hard_regno = reg_renumber[another_regno]) >= 0
>> +	      && another_hard_regno != hard_regno)
> Seems like this excludes split pseudos as well as reload pseudos,
> or are they never included in these copies?  Might be worth mentioning
> them either way.
I fixed it.
> The only general comment I have so far is that it's sometimes
> difficult to follow which types of pseudos are being included
> or excluded by a comparison with lra_constraint_new_regno_start.
> Sometimes the comments talk about "reload pseudos", but other
> similar checks imply that the registers could be inheritance
> pseudos or split pseudos as well.  Some thin inline wrappers
> might help here.
Inheritance, split, reload pseudos created since last constraint pass >= 
lra_constraint_new_regno_start.
Inheritance and split pseudos created on any pass are in the 
corresponding bitmaps.
Inheritance and split pseudos since the last constraint pass has also 
restore_regno >= 0 until split or inheritance transformations are done.

I am putting the comment about this at the top of the file.
>> +	      /* Remember that reload pseudos can be spilled on the
>> +		 1st pass.  */
>> +	      bitmap_clear_bit (&all_spilled_pseudos, regno);
>> +	      assign_hard_regno (hard_regno, regno);
> Maybe:
>
>    /* This register might have been spilled by the previous pass.
>       Indicate that it is no longer spilled.  */
Fixed.
>> +		/* We can use inheritance pseudos in original insns
>> +		   (not reload ones).  */
>> +		if (regno < lra_constraint_new_regno_start
>> +		    || bitmap_bit_p (&lra_inheritance_pseudos, regno)
>> +		    || reg_renumber[regno] < 0)
>> +		  continue;
>> +		sorted_pseudos[nfails++] = regno;
>> +		if (lra_dump_file != NULL)
>> +		  fprintf (lra_dump_file,
>> +			   "	  Spill reload r%d(hr=%d, freq=%d)\n",
>> +			   regno, reg_renumber[regno],
>> +			   lra_reg_info[regno].freq);
> Same comment about types of pseudo as above.  (I.e. the code checks for
> inheritance pseudos, but not split pseudos.)
I modified the comment to

/* A reload pseudo did not get a hard register on the
    first iteration because of the conflict with
    another reload pseudos in the same insn.  So we
    consider only reload pseudos assigned to hard
    registers.  We shall exclude inheritance pseudos as
    they can occur in original insns (not reload ones).
    We can omit the check for split pseudos because
    they occur only in move insns containing non-reload
    pseudos. */

I hope it explains the code.
>> +  bitmap_initialize (&do_not_assign_nonreload_pseudos, &reg_obstack);
>> +  EXECUTE_IF_SET_IN_BITMAP (&lra_inheritance_pseudos, 0, u, bi)
>> +    if ((restore_regno = lra_reg_info[u].restore_regno) >= 0
>> +	&& reg_renumber[u] < 0 && bitmap_bit_p (&lra_inheritance_pseudos, u))
>> +      bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno);
>> +  EXECUTE_IF_SET_IN_BITMAP (&lra_split_pseudos, 0, u, bi)
>> +    if ((restore_regno = lra_reg_info[u].restore_regno) >= 0
>> +	&& reg_renumber[u] >= 0 && bitmap_bit_p (&lra_split_pseudos, u))
>> +      bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno);
> The bitmap_bit_p tests look redundant.  Also, the following code is:
>
Fixed.
>> +  for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
>> +    if (((i < lra_constraint_new_regno_start
>> +	  && ! bitmap_bit_p (&do_not_assign_nonreload_pseudos, i))
>> +	 || (bitmap_bit_p (&lra_inheritance_pseudos, i)
>> +	     && lra_reg_info[i].restore_regno >= 0)
>> +	 || (bitmap_bit_p (&lra_split_pseudos, i)
>> +	     && lra_reg_info[i].restore_regno >= 0)
>> +	 || bitmap_bit_p (&lra_optional_reload_pseudos, i))
>> +	&& reg_renumber[i] < 0 && lra_reg_info[i].nrefs != 0
>> +	&& regno_allocno_class_array[i] != NO_REGS)
>> +      sorted_pseudos[n++] = i;
>> +  bitmap_clear (&do_not_assign_nonreload_pseudos);
> where we test very similar things inline, and then clear
> do_not_assign_nonreload_pseudos.  Do we need d_n_a_n_p at all?
No the code is right.  We still need d_n_a_n_p as we can not easily 
calculate from what pseudo the given pseudo was inherited or split.  It 
is different from the loop where new inheritance and split pseudos are 
checked (it is different from their origins marked in d_n_a_n_p.
>> +  if (n != 0 && lra_dump_file != NULL)
>> +    fprintf (lra_dump_file, "  Reassing non-reload pseudos\n");
> "Reassigning"
>
Fixed. :)

Richard, thank you for the review.  This is the most useful review I 
ever had.  There are a lot of code can be simplified.  I had numerous 
experiments with this code (and my be with code in lra-constraints.c) 
during my work on LRA.  Therefore it can contains some artifacts from 
these experiments.

Richard Sandiford Oct. 12, 2012, 2:29 p.m. UTC | #7

Hi Vlad,

Comments for the rest of ira-constraints.c.

Vladimir Makarov <vmakarov@redhat.com> writes:
> +  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
> +  change_p = equiv_address_substitution (&ad, addr_loc, mode, as, code);
> +  if (ad.base_reg_loc != NULL)
> +    {
> +      if (process_addr_reg
> +	  (ad.base_reg_loc, before,
> +	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
> +	    && find_regno_note (curr_insn, REG_DEAD,
> +				REGNO (*ad.base_reg_loc)) == NULL
> +	    ? after : NULL),
> +	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
> +	change_p = true;
> +      if (ad.base_reg_loc2 != NULL)
> +	*ad.base_reg_loc2 = *ad.base_reg_loc;
> +      saved_base_reg = *ad.base_reg_loc;
> +      lra_eliminate_reg_if_possible (ad.base_reg_loc);
> +      if (ad.base_reg_loc2 != NULL)
> +	{
> +	  saved_base_reg2 = *ad.base_reg_loc2;
> +	  lra_eliminate_reg_if_possible (ad.base_reg_loc2);
> +	}

We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
might be clearer without saved_base_reg2.  More below...

> +      /* The following addressing is checked by constraints and
> +	 usually target specific legitimate address hooks do not
> +	 consider them valid.  */
> +      || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
> +      || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC

typo: two PRE_DECs, although:

> +      || GET_CODE (*addr_loc) == PRE_MODIFY
> +      || GET_CODE (*addr_loc) == POST_MODIFY

the whole lot could just be replaced by ad.base_modify_p, or perhaps
even removed entirely given:

> +      /* In this case we can not do anything because if it is wrong
> +	 that is because of wrong displacement.	 Remember that any
> +	 address was legitimate in non-strict sense before LRA.	 */
> +      || ad.disp_loc == NULL)

It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
E.g. something like:

  if (ad.base_reg_loc != NULL
      && (process_addr_reg
	  (ad.base_reg_loc, before,
	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
	    && find_regno_note (curr_insn, REG_DEAD,
				REGNO (*ad.base_reg_loc)) == NULL
	    ? after : NULL),
	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code))))
    {
      change_p = true;
      if (ad.base_reg_loc2 != NULL)
        *ad.base_reg_loc2 = *ad.base_reg_loc;
    }

  if (ad.index_reg_loc != NULL
      && process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
    change_p = true;

  /* The address was valid before LRA.  We only change its form if the
     address has a displacement, so if it has no displacement it must
     still be valid.  */
  if (ad.disp_loc == NULL)
    return change_p;

  /* See whether the address is still valid.  Some ports do not check
     displacements for eliminable registers, so we replace them
     temporarily with the elimination target.  */
  saved_base_reg = saved_index_reg = NULL_RTX;
  ...
  if (ok_p)
    return change_p;

> +#ifdef HAVE_lo_sum
> +	  {
> +	    rtx insn;
> +	    rtx last = get_last_insn ();
> +
> +	    /* disp => lo_sum (new_base, disp)	*/
> +	    insn = emit_insn (gen_rtx_SET
> +			      (VOIDmode, new_reg,
> +			       gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc))));
> +	    code = recog_memoized (insn);
> +	    if (code >= 0)
> +	      {
> +		rtx save = *ad.disp_loc;
> +
> +		*ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
> +		if (! valid_address_p (mode, *ad.disp_loc, as))
> +		  {
> +		    *ad.disp_loc = save;
> +		    code = -1;
> +		  }
> +	      }
> +	    if (code < 0)
> +	      delete_insns_since (last);
> +	  }
> +#endif

Nice :-)

Purely for the record, I wondered whether the high part should be
generated with emit_move_insn(_1) instead, with the rhs of the move
being the HIGH rtx.  That would allow targets to cope with cases where
the high part isn't represented directly as a HIGH.  E.g. on MIPS and
Alpha, small-data accesses use the global register as the high part instead.

However, both MIPS and Alpha accept small-data addresses as legitimate
constants and addresses before and during reload and only introduce the
split form after reload.  And I think that's how any other cases that
aren't simple HIGHs should be handled too.  E.g. MIPS also represents
GOT page loads as HIGHs until after reload, and only then lowers the
HIGH to a GOT load.  Allowing the backend to generate anything other
than a plain HIGH set here would be a double-edged sword.

So after all that I agree that the gen_rtx_SET above is better than
calling the move expanders.

> +	  /* index * scale + disp => new base + index * scale  */
> +	  enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
> +
> +	  lra_assert (INDEX_REG_CLASS != NO_REGS);
> +	  new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
> +	  lra_assert (GET_CODE (*addr_loc) == PLUS);
> +	  lra_emit_move (new_reg, *ad.disp_loc);
> +	  if (CONSTANT_P (XEXP (*addr_loc, 1)))
> +	    XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
> +	  XEXP (*addr_loc, 0) = new_reg;

The canonical form is (plus (mult ...) (reg)) rather than
(plus (reg) (mult ...)), but it looks like we create the latter.
I realise you try both forms here:

> +	  /* Some targets like ARM, accept address operands in
> +	     specific order -- try exchange them if necessary.	*/
> +	  if (! valid_address_p (mode, *addr_loc, as))
> +	    {
> +	      exchange_plus_ops (*addr_loc);
> +	      if (! valid_address_p (mode, *addr_loc, as))
> +		exchange_plus_ops (*addr_loc);
> +	    }

but I think we should try the canonical form first.  And I'd prefer it
if we didn't try the other form at all, especially in 4.8.  It isn't
really the backend's job to reject non-canonical rtl.  This might well
be another case where some targets need a (hopefully small) tweak in
order to play by the rules.

Also, I suppose this section of code feeds back to my question on
Wednesday about the distinction that LRA seems to make between the
compile-time constant in:

  (plus (reg X1) (const_int Y1))

and the link-time constant in:

  (plus (reg X2) (symbol_ref Y2))

It looked like extract_address_regs classified X1 as a base register and
X2 as an index register.  The difference between the two constants has
no run-time significance though, and I think we should handle both X1
and X2 as base registers (as I think reload does).

I think the path above would then be specific to scaled indices.
In the original address the "complex" index must come first and the
displacement second.  In the modified address, the index would stay
first and the new base register would be second.  More below.

> +      /* We don't use transformation 'base + disp => base + new index'
> +	 because of bad practice used in some machine descriptions
> +	 (see comments for emit_spill_move).  */
> +      /* base + disp => new base  */

As before when commenting on emit_spill_move, I think we should leave
the "bad machine description" stuff out of 4.8 and treat fixing the
machine descriptions as part of the LRA port.

In this case I think there's another reason not to reload the
displacement into an index though: IIRC postreload should be able
to optimise a sequence of address reloads that have the same base
and different displacements.  LRA itself might try using "anchor"
bases in future -- although obviously not in the initial merge --
since that was one thing that LEGITIMIZE_RELOAD_ADDRESS was used for.

E.g. maybe the justification could be:

      /* base + disp => new base  */
      /* Another option would be to reload the displacement into an
	 index register.  However, postreload has code to optimize
	 address reloads that have the same base and different
	 displacements, so reloading into an index register would
	 not necessarily be a win.  */

> +      /* base + scale * index + disp => new base + scale * index  */
> +      new_reg = base_plus_disp_to_reg (mode, as, &ad);
> +      *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
> +      if (! valid_address_p (mode, *addr_loc, as))
> +	{
> +	  /* Some targets like ARM, accept address operands in
> +	     specific order -- try exchange them if necessary.	*/
> +	  exchange_plus_ops (*addr_loc);
> +	  if (! valid_address_p (mode, *addr_loc, as))
> +	    exchange_plus_ops (*addr_loc);
> +	}

Same comment as above about canonical rtl.  Here we can have two
registers -- in which case the base should come first -- or a more
complex index -- in which case the index should come first.

We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
with the operands in either order, and let it take care of the details.
Using simplify_gen_binary would help with the earlier index+disp case too.

> +  /* If this is post-increment, first copy the location to the reload reg.  */
> +  if (post && real_in != result)
> +    emit_insn (gen_move_insn (result, real_in));

Nit, but real_in != result can never be true AIUI, and I was confused how
the code could be correct in that case.  Maybe just remove it, or make
it an assert?

> +  /* We suppose that there are insns to add/sub with the constant
> +     increment permitted in {PRE/POST)_{DEC/INC/MODIFY}.  At least the
> +     old reload worked with this assumption.  If the assumption
> +     becomes wrong, we should use approach in function
> +     base_plus_disp_to_reg.  */
> +  if (in == value)
> +    {
> +      /* See if we can directly increment INCLOC.  */
> +      last = get_last_insn ();
> +      add_insn = emit_insn (plus_p
> +			    ? gen_add2_insn (incloc, inc)
> +			    : gen_sub2_insn (incloc, inc));
> +
> +      code = recog_memoized (add_insn);
> +      /* We should restore recog_data for the current insn.  */

Looks like this comment might be a left-over, maybe from before the
cached insn data?

> +      /* Restore non-modified value for the result.  We prefer this
> +	 way because it does not require an addition hard
> +	 register.  */
> +      if (plus_p)
> +	{
> +	  if (CONST_INT_P (inc))
> +	    emit_insn (gen_add2_insn (result, GEN_INT (-INTVAL (inc))));
> +	  else
> +	    emit_insn (gen_sub2_insn (result, inc));
> +	}
> +      else if (CONST_INT_P (inc))
> +	emit_insn (gen_add2_insn (result, inc));

The last two lines look redundant.  The behaviour is the same as for
the following else:

> +      else
> +	emit_insn (gen_add2_insn (result, inc));

and I don't think there are any cases where !plus && CONST_INT_P (inc)
would hold.

> +/* Main entry point of this file: search the body of the current insn

s/this file/the constraints code/, since it's a static function.

> +  if (change_p)
> +    /* Changes in the insn might result in that we can not satisfy
> +       constraints in lately used alternative of the insn.  */
> +    lra_set_used_insn_alternative (curr_insn, -1);

Maybe:

  /* If we've changed the instruction then any alternative that
     we chose previously may no longer be valid.  */

> +      rtx x;
> +
> +      curr_swapped = !curr_swapped;
> +      if (curr_swapped)
> +	{
> +	  x = *curr_id->operand_loc[commutative];
> +	  *curr_id->operand_loc[commutative]
> +	    = *curr_id->operand_loc[commutative + 1];
> +	  *curr_id->operand_loc[commutative + 1] = x;
> +	  /* Swap the duplicates too.  */
> +	  lra_update_dup (curr_id, commutative);
> +	  lra_update_dup (curr_id, commutative + 1);
> +	  goto try_swapped;
> +	}
> +      else
> +	{
> +	  x = *curr_id->operand_loc[commutative];
> +	  *curr_id->operand_loc[commutative]
> +	    = *curr_id->operand_loc[commutative + 1];
> +	  *curr_id->operand_loc[commutative + 1] = x;
> +	  lra_update_dup (curr_id, commutative);
> +	  lra_update_dup (curr_id, commutative + 1);
> +	}

The swap code is the same in both cases, so I think it'd be better to
make it common.  Or possibly a helper function, since the same code
appears again later on.

> +	if (GET_CODE (op) == PLUS)
> +	  {
> +	    plus = op;
> +	    op = XEXP (op, 1);
> +	  }

Sorry, I'm complaining about old reload code again, but: does this
actually happen in LRA?  In reload, a register operand could become a
PLUS because of elimination, but I thought LRA did things differently.
Besides, this is only needed for:

> +	if (CONST_POOL_OK_P (mode, op)
> +	    && ((targetm.preferred_reload_class
> +		 (op, (enum reg_class) goal_alt[i]) == NO_REGS)
> +		|| no_input_reloads_p)
> +	    && mode != VOIDmode)
> +	  {
> +	    rtx tem = force_const_mem (mode, op);
> +	    
> +	    change_p = true;
> +	    /* If we stripped a SUBREG or a PLUS above add it back.  */
> +	    if (plus != NULL_RTX)
> +	      tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem);

and we shouldn't have (plus (constant ...) ...) after elimination
(or at all outside of a CONST).  I don't understand why the code is
needed even in reload.

> +  for (i = 0; i < n_operands; i++)
> +    {
> +      rtx old, new_reg;
> +      rtx op = *curr_id->operand_loc[i];
> +
> +      if (goal_alt_win[i])
> +	{
> +	  if (goal_alt[i] == NO_REGS
> +	      && REG_P (op)
> +	      && lra_former_scratch_operand_p (curr_insn, i))
> +	    change_class (REGNO (op), NO_REGS, "      Change", true);

I think this could do with a comment.  Does setting the class to NO_REGS
indirectly cause the operand to be switched back to a SCRATCH?

> +	  push_to_sequence (before);
> +	  rclass = base_reg_class (GET_MODE (op), MEM_ADDR_SPACE (op),
> +				   MEM, SCRATCH);
> +	  if (code == PRE_DEC || code == POST_DEC
> +	      || code == PRE_INC || code == POST_INC
> +	      || code == PRE_MODIFY || code == POST_MODIFY)

Very minor, but: GET_RTX_CLASS (code) == RTX_AUTOINC

> +	  enum machine_mode mode;
> +	  rtx reg, *loc;
> +	  int hard_regno, byte;
> +	  enum op_type type = curr_static_id->operand[i].type;
> +
> +	  loc = curr_id->operand_loc[i];
> +	  mode = get_op_mode (i);
> +	  if (GET_CODE (*loc) == SUBREG)
> +	    {
> +	      reg = SUBREG_REG (*loc);
> +	      byte = SUBREG_BYTE (*loc);
> +	      if (REG_P (reg)
> +		  /* Strict_low_part requires reload the register not
> +		     the sub-register.	*/
> +		  && (curr_static_id->operand[i].strict_low
> +		      || (GET_MODE_SIZE (mode)
> +			  <= GET_MODE_SIZE (GET_MODE (reg))
> +			  && (hard_regno
> +			      = get_try_hard_regno (REGNO (reg))) >= 0
> +			  && (simplify_subreg_regno
> +			      (hard_regno,
> +			       GET_MODE (reg), byte, mode) < 0)
> +			  && (goal_alt[i] == NO_REGS
> +			      || (simplify_subreg_regno
> +				  (ira_class_hard_regs[goal_alt[i]][0],
> +				   GET_MODE (reg), byte, mode) >= 0)))))
> +		{
> +		  loc = &SUBREG_REG (*loc);
> +		  mode = GET_MODE (*loc);
> +		}
> +	  old = *loc;

I think this needs a bit more justifying commentary (although I'm glad
to see it's much simpler than the reload version :-)).  One thing in
particular I didn't understand was why we don't reload the inner
register of a paradoxical subreg.

> +	  if (get_reload_reg (type, mode, old, goal_alt[i], "", &new_reg)
> +	      && type != OP_OUT)
> +	    {
> +	      push_to_sequence (before);
> +	      lra_emit_move (new_reg, old);
> +	      before = get_insns ();
> +	      end_sequence ();
> +	    }
> +	  *loc = new_reg;
> +	  if (type != OP_IN)
> +	    {
> +	      if (find_reg_note (curr_insn, REG_UNUSED, old) == NULL_RTX)
> +		{
> +		  start_sequence ();
> +		  /* We don't want sharing subregs as the pseudo can
> +		     get a memory and the memory can be processed
> +		     several times for eliminations.  */
> +		  lra_emit_move (GET_CODE (old) == SUBREG && type == OP_INOUT
> +				 ? copy_rtx (old) : old,
> +				 new_reg);

I think this should simply be:

  lra_emit_move (type == OP_INOUT ? copy_rtx (old) : old, new_reg);

leaving copy_rtx to figure out which rtxes can be shared.  No comment
would be needed for that.

> +		  emit_insn (after);
> +		  after = get_insns ();
> +		  end_sequence ();
> +		}
> +	      *loc = new_reg;
> +	    }

Very minor again, but: redundant *loc assignment (so that the two nested
if statements collapse to one).

> +      else
> +	{
> +	  lra_assert (INSN_CODE (curr_insn) < 0);
> +	  error_for_asm (curr_insn,
> +			 "inconsistent operand constraints in an %<asm%>");
> +	  /* Avoid further trouble with this insn.  */
> +	  PATTERN (curr_insn) = gen_rtx_USE (VOIDmode, const0_rtx);
> +	  return false;

Is this code handling a different case from the corresponding error
code in curr_insn_transform?  If so, it probably deserves a comment
explaining the difference.

> +/* Process all regs in debug location *LOC and change them on
> +   equivalent substitution.  Return true if any change was done.  */
> +static bool
> +debug_loc_equivalence_change_p (rtx *loc)

This doesn't keep the rtl in canonical form.  Probably the easiest and
best fix is to use simplify_replace_fn_rtx, which handles all that for you.
(simplify_replace_fn_rtx returns the original rtx if no change was made.)

> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
> +    ira_reg_equiv[i].profitable_p = true;
> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
> +    if (lra_reg_info[i].nrefs != 0)
> +      {
> +	if ((hard_regno = lra_get_regno_hard_regno (i)) >= 0)
> +	  {
> +	    int j, nregs = hard_regno_nregs[hard_regno][PSEUDO_REGNO_MODE (i)];
> +	    
> +	    for (j = 0; j < nregs; j++)
> +	      df_set_regs_ever_live (hard_regno + j, true);
> +	  }
> +	else if ((x = get_equiv_substitution (regno_reg_rtx[i])) != NULL_RTX)
> +	  {
> +	    if (! first_p && contains_reg_p (x, false, false))
> +	      /* After RTL transformation, we can not guarantee that
> +		 pseudo in the substitution was not reloaded which
> +		 might make equivalence invalid.  For example, in
> +		 reverse equiv of p0
> +
> +		 p0 <- ...
> +		 ...
> +		 equiv_mem <- p0
> +
> +		 the memory address register was reloaded before the
> +		 2nd insn.  */
> +	      ira_reg_equiv[i].defined_p = false;
> +	    if (contains_reg_p (x, false, true))
> +	      ira_reg_equiv[i].profitable_p = false;
> +	  }
> +      }

Do we need two loops because the second may check for equivalences
of other pseudos besides "i"?  I couldn't see how offhand, but I might
well have missed something.  Might be worth a comment.

> +	      dest_reg = SET_DEST (set);
> +	      /* The equivalence pseudo could be set up as SUBREG in a
> +		 case when it is a call restore insn in a mode
> +		 different from the pseudo mode.  */
> +	      if (GET_CODE (dest_reg) == SUBREG)
> +		dest_reg = SUBREG_REG (dest_reg);
> +	      if ((REG_P (dest_reg)
> +		   && (x = get_equiv_substitution (dest_reg)) != dest_reg
> +		   /* Remove insns which set up a pseudo whose value
> +		      can not be changed.  Such insns might be not in
> +		      init_insns because we don't update equiv data
> +		      during insn transformations.
> +			  
> +		      As an example, let suppose that a pseudo got
> +		      hard register and on the 1st pass was not
> +		      changed to equivalent constant.  We generate an
> +		      additional insn setting up the pseudo because of
> +		      secondary memory movement.  Then the pseudo is
> +		      spilled and we use the equiv constant.  In this
> +		      case we should remove the additional insn and
> +		      this insn is not init_insns list.	 */
> +		   && (! MEM_P (x) || MEM_READONLY_P (x)
> +		       || in_list_p (curr_insn,
> +				     ira_reg_equiv
> +				     [REGNO (dest_reg)].init_insns)))

This is probably a stupid question, sorry, but when do we ever want
to keep an assignment to a substituted pseudo?  I.e. why isn't this just:

	      if ((REG_P (dest_reg)
		   && (x = get_equiv_substitution (dest_reg)) != dest_reg)

> +/* Info about last usage of registers in EBB to do inheritance/split
> +   transformation.  Inheritance transformation is done from a spilled
> +   pseudo and split transformations from a hard register or a pseudo
> +   assigned to a hard register.	 */
> +struct usage_insns
> +{
> +  /* If the value is equal to CURR_USAGE_INSNS_CHECK, then the member
> +     value INSNS is valid.  The insns is chain of optional debug insns
> +     and a finishing non-debug insn using the corresponding reg.  */
> +  int check;
> +  /* Value of global reloads_num at the ???corresponding next insns.  */
> +  int reloads_num;
> +  /* Value of global reloads_num at the ???corresponding next insns.  */
> +  int calls_num;

"???s".  Probably "at the last instruction in INSNS" if that's accurate
(because debug insns in INSNS don't affect these fields).

> +/* Process all regs OLD_REGNO in location *LOC and change them on the
> +   reload pseudo NEW_REG.  Return true if any change was done.	*/
> +static bool
> +substitute_pseudo (rtx *loc, int old_regno, rtx new_reg)

This is another case where I found the term "reload pseudo" a bit confusing,
since AIUI new_reg can be an inheritance or split pseudo rather than a pseudo
created solely for insn reloads.  I'll follow up about that on the original
thread.  Maybe just:

/* Replace all references to register OLD_REGNO in *LOC with pseudo register
   NEW_REG.  Return true if any change was made.  */

> +  code = GET_CODE (x);
> +  if (code == REG && (int) REGNO (x) == old_regno)
> +    {
> +      *loc = new_reg;
> +      return true;
> +    }

Maybe assert that the modes are the same?

> +/* Do inheritance transformation for insn INSN defining (if DEF_P) or
> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
> +   we traverse insns in the backward direction) for the original regno
> +   is NEXT_USAGE_INSNS.	 The transformations look like

Maybe:

/* Do interitance transformations for insn INSN, which defines (if DEF_P)
   or uses ORIGINAL_REGNO.  NEXT_USAGE_INSNS specifies which instruction
   in the EBB next uses ORIGINAL_REGNO; it has the same form as the
   "insns" field of usage_insns.

   The transformations look like:

> +
> +     p <- ...		  i <- ...
> +     ...		  p <- i    (new insn)
> +     ...	     =>
> +     <- ... p ...	  <- ... i ...
> +   or
> +     ...		  i <- p    (new insn)
> +     <- ... p ...	  <- ... i ...
> +     ...	     =>
> +     <- ... p ...	  <- ... i ...
> +   where p is a spilled original pseudo and i is a new inheritance pseudo.
> +   
> +   The inheritance pseudo has the smallest class of two classes CL and
> +   class of ORIGINAL REGNO.  It will have unique value if UNIQ_P.  The
> +   unique value is necessary for correct assignment to inheritance
> +   pseudo for input of an insn which should be the same as output
> +   (bound pseudos).  Return true if we succeed in such
> +   transformation.  */

This comment looks really good, but I still wasn't sure about the
UNIQ_P thing.  AIUI this is for cases like:

                       i <- p            [new insn]
   r <- ... p ...      r <- ... i ...    [input reload]
   r <- ... r ...   => r <- ... r ...    [original insn]
   <- r                <- r              [output reload]
   ....                ......
   <- ... p ...        <- ... i ...      [next ref]

where "r" is used on both sides of the original insn and where the
output reload assigns to something other than "p" (otherwise "next ref"
wouldn't be the next ref).  But why does this affect the way "i" is created?
I think it'd be worth expanding that part a bit.

> +  if (! ira_reg_classes_intersect_p[cl][rclass])
> +    {
> +      if (lra_dump_file != NULL)
> +	{
> +	  fprintf (lra_dump_file,
> +		   "	Rejecting inheritance for %d "
> +		   "because of too different classes %s and %s\n",

Suggest s/too different/disjoint/

> +  if ((ira_class_subset_p[cl][rclass] && cl != rclass)
> +      || ira_class_hard_regs_num[cl] < ira_class_hard_regs_num[rclass])
> +    {
> +      if (lra_dump_file != NULL)
> +	fprintf (lra_dump_file, "    Use smallest class of %s and %s\n",
> +		 reg_class_names[cl], reg_class_names[rclass]);
> +      
> +      rclass = cl;
> +    }

I don't understand the second line of the if statement.  Why do we prefer
classes with fewer allocatable registers?

My guess before reading the code was that we'd use the subunion of CL and
RCLASS, so maybe a comment explaining why we use this choice would help.

> +  if (NEXT_INSN (new_insns) != NULL_RTX)
> +    {
> +      if (lra_dump_file != NULL)
> +	{
> +	  fprintf (lra_dump_file,
> +		   "	Rejecting inheritance %d->%d "
> +		   "as it results in 2 or more insns:\n",
> +		   original_regno, REGNO (new_reg));
> +	  debug_rtl_slim (lra_dump_file, new_insns, NULL_RTX, -1, 0);
> +	  fprintf (lra_dump_file,
> +		   "	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
> +	}
> +      return false;
> +    }

Hmm, I wasn't sure about this at first.  Some targets define patterns for
multiword moves and split them later.  Others expose the split straight away.
The two approaches don't really imply any difference in cost, so I didn't
want us to penalise the latter.

But I suppose on targets that split straight away, lower-subreg would
tend to replace the multiword pseudo with individual word-sized pseudos,
so LRA shouldn't see them.  I suppose this check shouldn't matter in practice.

> +  if (def_p)
> +    lra_process_new_insns (insn, NULL_RTX, new_insns,
> +			   "Add original<-inheritance");
> +  else
> +    lra_process_new_insns (insn, new_insns, NULL_RTX,
> +			   "Add inheritance<-pseudo");

Maybe "original" rather than "pseudo" here too for consistency.

> +/* Return true if we need a split for hard register REGNO or pseudo
> +   REGNO which was assigned to a hard register.
> +   POTENTIAL_RELOAD_HARD_REGS contains hard registers which might be
> +   used for reloads since the EBB end.	It is an approximation of the
> +   used hard registers in the split range.  The exact value would
> +   require expensive calculations.  If we were aggressive with
> +   splitting because of the approximation, the split pseudo will save
> +   the same hard register assignment and will be removed in the undo
> +   pass.  We still need the approximation because too aggressive
> +   splitting would result in too inaccurate cost calculation in the
> +   assignment pass because of too many generated moves which will be
> +   probably removed in the undo pass.  */
> +static inline bool
> +need_for_split_p (HARD_REG_SET potential_reload_hard_regs, int regno)
> +{
> +  int hard_regno = regno < FIRST_PSEUDO_REGISTER ? regno : reg_renumber[regno];
> +
> +  lra_assert (hard_regno >= 0);
> +  return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
> +	   && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
> +	   && (usage_insns[regno].reloads_num
> +	       + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
> +	   && ((regno < FIRST_PSEUDO_REGISTER
> +		&& ! bitmap_bit_p (&ebb_global_regs, regno))
> +	       || (regno >= FIRST_PSEUDO_REGISTER
> +		   && lra_reg_info[regno].nrefs > 3
> +		   && bitmap_bit_p (&ebb_global_regs, regno))))
> +	  || (regno >= FIRST_PSEUDO_REGISTER && need_for_call_save_p (regno)));
> +}

Could you add more commentary about the thinking behind this particular
choice of heuristic?  E.g. I wasn't sure what the reloads_num check did,
or why we only split hard registers that are local to the EBB and only
split pseudos that aren't.

The 2 and 3 numbers seemed a bit magic too.  I suppose the 2 has
something to do with "one save and one restore", but I wasn't sure
why we applied it only for pseudos.  (AIUI that arm of the check
deals with "genuine" split pseudos rather than call saves & restores.)

Still, it says a lot for the high quality of LRA that, out of all the
1000s of lines of code I've read so far, this is the only part that
didn't seem to have an intuitive justification.

> +  for (i = 0;
> +       (cl = reg_class_subclasses[allocno_class][i]) != LIM_REG_CLASSES;
> +       i++)
> +    if (! SECONDARY_MEMORY_NEEDED (cl, hard_reg_class, mode)
> +	&& ! SECONDARY_MEMORY_NEEDED (hard_reg_class, cl, mode)
> +	&& TEST_HARD_REG_BIT (reg_class_contents[cl], hard_regno)
> +	&& (best_cl == NO_REGS
> +	    || (hard_reg_set_subset_p (reg_class_contents[best_cl],
> +				       reg_class_contents[cl])
> +		&& ! hard_reg_set_equal_p (reg_class_contents[best_cl],
> +					   reg_class_contents[cl]))))
> +      best_cl = cl;

OK, so this suggestion isn't backed up by any evidence, but what do
you think about this alternative:

	&& (best_cl == NO_REGS
	    || (ira_class_hard_regs_num[best_cl]
		< ira_class_hard_regs_num[cl]))

which should choose the largest class that requires no secondary memory.
It looks like the subset version could get "stuck" on a single-register
class that happens to be early in the list but has no superclass smaller
than allocno_class.

> +/* Do split transformation for insn INSN defining or
> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
> +   we traverse insns in the backward direction) for the original regno
> +   is NEXT_USAGE_INSNS.	 The transformations look like

Same suggestion as for the inheritance function above.

> +  if (call_save_p)
> +    save = emit_spill_move (true, new_reg, original_reg, -1);
> +  else
> +    {
> +      start_sequence ();
> +      emit_move_insn (new_reg, original_reg);
> +      save = get_insns ();
> +      end_sequence ();
> +    }
> +  if (NEXT_INSN (save) != NULL_RTX)
> +    {
> +      lra_assert (! call_save_p);

Is emit_spill_move really guaranteed to return only one instruction in
cases where emit_move_insn might not?  Both of them use emit_move_insn_1
internally, so I wouldn't have expected much difference.

In fact I wasn't really sure why:

  save = gen_move_insn (new, original_reg);

wouldn't be correct for both.

Same comments for the restore code.

> +      /* See which defined values die here.  */
> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +	if (reg->type == OP_OUT && ! reg->early_clobber
> +	    && (! reg->subreg_p
> +		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
> +	  bitmap_clear_bit (&live_regs, reg->regno);
> +      /* Mark each used value as live.	*/
> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +	if (reg->type == OP_IN
> +	    && bitmap_bit_p (&check_only_regs, reg->regno))
> +	  bitmap_set_bit (&live_regs, reg->regno);
> +      /* Mark early clobber outputs dead.  */
> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +	if (reg->type == OP_OUT && reg->early_clobber && ! reg->subreg_p)
> +	  bitmap_clear_bit (&live_regs, reg->regno);

I don't think this would be correct for unreloaded insns because an
unreloaded insn can have the same pseudo as an input and an earlyclobber
output.  (Probably not an issue here, since we're called after the
constraints pass.)  There's also the case of matched earlyclobber operands,
where the matched input is specifically not affected by the earlyclobber.

I'd have thought:

      /* See which defined values die here.  */
      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
	if (reg->type == OP_OUT
	    && (! reg->subreg_p
		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
	  bitmap_clear_bit (&live_regs, reg->regno);
      /* Mark each used value as live.	*/
      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
	if (reg->type == OP_IN
	    && bitmap_bit_p (&check_only_regs, reg->regno))
	  bitmap_set_bit (&live_regs, reg->regno);

ought to be correct, but perhaps I'm missing something.

(I'm still uneasy about the special treatment of bound pseudos here.
A clobber really does seem better.)

> +      /* It is quite important to remove dead move insns because it
> +	 means removing dead store, we don't need to process them for
> +	 constraints, and unfortunately some subsequent optimizations
> +	 (like shrink-wrapping) currently based on assumption that
> +	 there are no trivial dead insns.  */

Maybe best to drop the "subsequent optimizations" part.  This comment
is unlikely to be updated after any change to shrink-wrapping & co.,
and the first two justifications seem convincing enough on their own.

> +/* Add inheritance info REGNO and INSNS.  */
> +static void
> +add_to_inherit (int regno, rtx insns)
> +{
> +  int i;
> +
> +  for (i = 0; i < to_inherit_num; i++)
> +    if (to_inherit[i].regno == regno)
> +      return;

Is the existing "insns" field guaranteed to match the "insns" parameter
in this case, or might they be different?  Probably worth an assert or
comment respectively.

> +/* Return first (if FIRST_P) or last non-debug insn in basic block BB.
> +   Return null if there are no non-debug insns in the block.  */
> +static rtx
> +get_non_debug_insn (bool first_p, basic_block bb)
> +{
> +  rtx insn;
> +
> +  for (insn = first_p ? BB_HEAD (bb) : BB_END (bb);
> +       insn != NULL_RTX && ! NONDEBUG_INSN_P (insn);
> +       insn = first_p ? NEXT_INSN (insn) : PREV_INSN (insn))
> +    ;
> +  if (insn != NULL_RTX && BLOCK_FOR_INSN (insn) != bb)
> +    insn = NULL_RTX;
> +  return insn;
> +}

It probably doesn't matter in practice, but it looks like it'd be better
to limit the walk to the bb, rather than walking until null and then
testing the bb after the walk.

Maybe it would be eaiser to split into two functions, since first_p is
always constant.  E.g.:

  rtx insn;

  FOR_BB_INSNS (bb, insn)
    if (NONDEBUG_INSN_P (insn))
      return insn;
  return NULL_RTX;

for first_p.  s/FOR_BB_INSNS/FOR_BB_INSNS_REVERSE/ for !first_p.

> +/* Set up RES by registers living on edges FROM except the edge (FROM,
> +   TO) or by registers set up in a jump insn in BB FROM.  */
> +static void
> +get_live_on_other_edges (basic_block from, basic_block to, bitmap res)
> +{
> +  int regno;
> +  rtx last;
> +  struct lra_insn_reg *reg;
> +  edge e;
> +  edge_iterator ei;
> +
> +  lra_assert (to != NULL);
> +  bitmap_clear (res);
> +  FOR_EACH_EDGE (e, ei, from->succs)
> +    if (e->dest != to)
> +      bitmap_ior_into (res, DF_LR_IN (e->dest));
> +  if ((last = get_non_debug_insn (false, from)) == NULL_RTX || ! JUMP_P (last))
> +    return;
> +  curr_id = lra_get_insn_recog_data (last);
> +  for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +    if (reg->type != OP_IN
> +	&& (regno = reg->regno) >= FIRST_PSEUDO_REGISTER)
> +      bitmap_set_bit (res, regno);
> +}

Probably a silly question, sorry, but: why does the JUMP_P part only
include pseudo registers?  The other calculations (here and elsewhere)
seem to handle both hard and pseudo registers.

> +/* Do inheritance/split transformations in EBB starting with HEAD and
> +   finishing on TAIL.  We process EBB insns in the reverse order.
> +   Return true if we did any inheritance/split transformation in the
> +   EBB.
> +
> +   We should avoid excessive splitting which results in worse code
> +   because of inaccurate cost calculations for spilling new split
> +   pseudos in such case.  To achieve this we do splitting only if
> +   register pressure is high in given basic block and there reload

"...and there are reload"

> +   pseudos requiring hard registers.  We could do more register
> +   pressure calculations at any given program point to avoid necessary
> +   splitting even more but it is to expensive and the current approach
> +   is well enough.  */

"works well enough".

> +  change_p = false;
> +  curr_usage_insns_check++;
> +  reloads_num = calls_num = 0;
> +  /* Remember: we can remove the current insn.	*/
> +  bitmap_clear (&check_only_regs);
> +  last_processed_bb = NULL;

I couldn't tell which part of the code the comment is referring to.
Maybe left over?

> +	  after_p = (last_insn != NULL_RTX && ! JUMP_P (last_insn)
> +		     && (! CALL_P (last_insn)
> +			 || (find_reg_note (last_insn,
> +					   REG_NORETURN, NULL) == NULL_RTX
> +			     && ((next_insn
> +				  = next_nonnote_nondebug_insn (last_insn))
> +				 == NULL_RTX
> +				 || GET_CODE (next_insn) != BARRIER))));

Genuine question, but: when are the last four lines needed?  The condition
that they're testing for sounds like a noreturn call.

> +      if (src_regno < lra_constraint_new_regno_start
> +	  && src_regno >= FIRST_PSEUDO_REGISTER
> +	  && reg_renumber[src_regno] < 0
> +	  && dst_regno >= lra_constraint_new_regno_start
> +	  && (cl = lra_get_allocno_class (dst_regno)) != NO_REGS)
> +	{
> +	  /* 'reload_pseudo <- original_pseudo'.  */
> +	  reloads_num++;
> +	  succ_p = false;
> +	  if (usage_insns[src_regno].check == curr_usage_insns_check
> +	      && (next_usage_insns = usage_insns[src_regno].insns) != NULL_RTX)
> +	    succ_p = inherit_reload_reg (false,
> +					 bitmap_bit_p (&lra_matched_pseudos,
> +						       dst_regno),
> +					 src_regno, cl,
> +					 curr_insn, next_usage_insns);
> +	  if (succ_p)
> +	    change_p = true;
> +	  else
> +	    {
> +	      usage_insns[src_regno].check = curr_usage_insns_check;
> +	      usage_insns[src_regno].insns = curr_insn;
> +	      usage_insns[src_regno].reloads_num = reloads_num;
> +	      usage_insns[src_regno].calls_num = calls_num;
> +	      usage_insns[src_regno].after_p = false;
> +	    }

Looks like this and other places could use the add_next_usage_insn
helper function.

> +	  if (cl != NO_REGS
> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
> +					live_hard_regs))
> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
> +			      reg_class_contents[cl]);

Redundant "cl != NO_REGS" check.  (Was a bit confused by that at first.)

I don't understand the way potential_reload_hard_regs is set up.
Why does it only include reload pseudos involved in moves of the form
"reload_pseudo <- original_pseudo" and "original_pseudo <- reload_pseudo",
but include those reloads regardless of whether inheritance is possible?

I wondered whether it might be deliberately selective in order to speed
up LRA, but we walk all the registers in an insn regardless.

Same for reloads_num.

> +	  if (cl != NO_REGS
> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
> +					live_hard_regs))
> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
> +			      reg_class_contents[cl]);

Same comment as for the previous block.

> +		if (reg_renumber[dst_regno] < 0
> +		    || (reg->type == OP_OUT && ! reg->subreg_p))
> +		/* Invalidate.	*/
> +		usage_insns[dst_regno].check = 0;

Could you explain this condition a bit more?  Why does reg_renumber
affect things?

> +/* This value affects EBB forming.  If probability of edge from EBB to
> +   a BB is not greater than the following value, we don't add the BB
> +   to EBB.  */ 
> +#define EBB_PROBABILITY_CUTOFF (REG_BR_PROB_BASE / 2)

It looks like schedule_ebbs uses a higher default cutoff for FDO.
Would the same distinction be useful here?

Maybe schedule_ebbs-like params would be good here too.

> +  bitmap_and (&temp_bitmap_head, removed_pseudos, live);
> +  EXECUTE_IF_SET_IN_BITMAP (&temp_bitmap_head, 0, regno, bi)

This isn't going to have much effect on compile time, but
EXECUTE_IF_AND_IN_BITMAP avoids the need for a temporary bitmap.

> +/* Remove inheritance/split pseudos which are in REMOVE_PSEUDOS and
> +   return true if we did any change.  The undo transformations for
> +   inheritance looks like
> +      i <- i2
> +      p <- i	  =>   p <- i2
> +   or removing
> +      p <- i, i <- p, and i <- i3
> +   where p is original pseudo from which inheritance pseudo i was
> +   created, i and i3 are removed inheritance pseudos, i2 is another
> +   not removed inheritance pseudo.  All split pseudos or other
> +   occurrences of removed inheritance pseudos are changed on the
> +   corresponding original pseudos.  */
> +static bool
> +remove_inheritance_pseudos (bitmap remove_pseudos)
> +{
> +  basic_block bb;
> +  int regno, sregno, prev_sregno, dregno, restore_regno;
> +  rtx set, prev_set, prev_insn;
> +  bool change_p, done_p;
> +
> +  change_p = ! bitmap_empty_p (remove_pseudos);

I wondered from the comment why we couldn't just return straight away
for the empty set, but it looks like the function also schedules a
constraints pass for instructions that keep their inheritance or
split pseudos.  Is that right?  Might be worth mentioning that
in the function comment if so.

> +	      else if (bitmap_bit_p (remove_pseudos, sregno)
> +		       && bitmap_bit_p (&lra_inheritance_pseudos, sregno))
> +		{
> +		  /* Search the following pattern:
> +		       inherit_or_split_pseudo1 <- inherit_or_split_pseudo2
> +		       original_pseudo <- inherit_or_split_pseudo1
> +		    where the 2nd insn is the current insn and
> +		    inherit_or_split_pseudo2 is not removed.  If it is found,
> +		    change the current insn onto:
> +		       original_pseudo1 <- inherit_or_split_pseudo2.  */

s/original_pseudo1/original_pseudo/ I think (we don't change the destination).

> +		  for (prev_insn = PREV_INSN (curr_insn);
> +		       prev_insn != NULL_RTX && ! NONDEBUG_INSN_P (prev_insn);
> +		       prev_insn = PREV_INSN (prev_insn))
> +		    ;
> +		  if (prev_insn != NULL_RTX && BLOCK_FOR_INSN (prev_insn) == bb
> +		      && (prev_set = single_set (prev_insn)) != NULL_RTX
> +		      /* There should be no subregs in insn we are
> +			 searching because only the original reg might
> +			 be in subreg when we changed the mode of
> +			 load/store for splitting.  */
> +		      && REG_P (SET_DEST (prev_set))
> +		      && REG_P (SET_SRC (prev_set))
> +		      && (int) REGNO (SET_DEST (prev_set)) == sregno
> +		      && ((prev_sregno = REGNO (SET_SRC (prev_set)))
> +			  >= FIRST_PSEUDO_REGISTER)
> +		      && (lra_reg_info[sregno].restore_regno
> +			  == lra_reg_info[prev_sregno].restore_regno)
> +		      && ! bitmap_bit_p (remove_pseudos, prev_sregno))

I'm sure the restore_regno comparison near the end is correct,
but could you add a comment to explain it?  The substitution
itself seems OK either way.

> +	      struct lra_insn_reg *reg;
> +	      bool insn_change_p = false;
> +
> +	      curr_id = lra_get_insn_recog_data (curr_insn);
> +	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> +		if ((regno = reg->regno) >= lra_constraint_new_regno_start
> +		    && lra_reg_info[regno].restore_regno >= 0)

Is the first part of the comparison needed?  Most other places don't check,
so it looked at first glance like there was something special here.

> +		  {
> +		    if (change_p && bitmap_bit_p (remove_pseudos, regno))
> +		      {
> +			restore_regno = lra_reg_info[regno].restore_regno;
> +			substitute_pseudo (&curr_insn, regno,
> +					   regno_reg_rtx[restore_regno]);
> +			insn_change_p = true;
> +		      }
> +		    else if (NONDEBUG_INSN_P (curr_insn))
> +		      {
> +			lra_push_insn_and_update_insn_regno_info (curr_insn);
> +			lra_set_used_insn_alternative_by_uid
> +			  (INSN_UID (curr_insn), -1);
> +		      }
> +		  }
> +	      if (insn_change_p)
> +		{
> +		  lra_update_insn_regno_info (curr_insn);
> +		  if (lra_dump_file != NULL)
> +		    {
> +		      fprintf (lra_dump_file, "	   Restore original insn:\n");
> +		      debug_rtl_slim (lra_dump_file,
> +				      curr_insn, curr_insn, -1, 0);
> +		    }
> +		}

AIUI we could have a partial restore, keeping some registers but
restoring others.  Is that right?  The dump entry made it sounds
like a full restore.

Maybe something like:

	      struct lra_insn_reg *reg;
	      bool restored_regs_p = false;
	      bool kept_regs_p = false;

	      curr_id = lra_get_insn_recog_data (curr_insn);
	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
		{
		  regno = reg->regno;
		  restore_regno = lra_reg_info[regno].restore_regno;
		  if (restore_regno >= 0)
		    {
		      if (change_p && bitmap_bit_p (remove_pseudos, regno))
			{
			  substitute_pseudo (&curr_insn, regno,
					     regno_reg_rtx[restore_regno]);
			  restored_regs_p = true;
			}
		      else
			kept_regs_p = true;
		    }
		}
	      if (NONDEBUG_INSN_P (curr_insn) && kept_regs_p)
		{
		  /* The instruction has changed since the previous
		     constraints pass.  */
		  lra_push_insn_and_update_insn_regno_info (curr_insn);
		  lra_set_used_insn_alternative_by_uid
		    (INSN_UID (curr_insn), -1);
		}
	      else if (restored_regs_p)
		/* The instruction has been restored to the form that
		   it had during the previous constraints pass.  */
		lra_update_insn_regno_info (curr_insn);

	      if (restored_regs_p && lra_dump_file != NULL)
		{
		  fprintf (lra_dump_file,
			   "	   Insn after restoring regs:\n");
		  debug_rtl_slim (lra_dump_file, curr_insn, curr_insn, -1, 0);
		}

(if correct) might make the partial case clearer, but that's personal
preference, so please feel free to ignore, chop or change.

Also, is regno_reg_rtx[restore_regno] always correct when restoring
registers?  I thought restore_regno could be a hard register and that
the hard register might not necessarily be used in the same mode as
the regno_reg_rtx[] entry.

That just leaves lra.h, lra-int.h and lra.c itself.  I'm hoping to have
read through those by the middle of next week, but any comments about them
will probably just be banal stuff (even more than my comments so far :-))
so I deliberately left them to last.

Richard

Richard Sandiford Oct. 12, 2012, 4:13 p.m. UTC | #8

Vladimir Makarov <vmakarov@redhat.com> writes:
>>> +/* Info about pseudo used during the assignment pass.  Thread is a set
>>> +   of connected reload and inheritance pseudos with the same set of
>>> +   available hard reg set.  Thread is a pseudo itself for other
>>> +   cases.  */
>>> +struct regno_assign_info
>> Maybe:
>>
>> /* Information about the thread to which a pseudo belongs.  Threads are
>>     a set of connected reload and inheritance pseudos with the same set of
>>     available hard registers.  Lone registers belong to their own threads.  */
> Fixed.
>> Although the condition seems to be:
>>> +	&& (ira_class_hard_regs_num[regno_allocno_class_array[regno1]]
>>> +	    == ira_class_hard_regs_num[regno_allocno_class_array[regno2]]))
>> i.e. the same _number_ of available hard regs, but not necessarily the
>> same set.
> It should be the same in most cases.  This condition is just faster 
> approximation of the same available hard reg set.

The distinction does seem important though.   It's possible that
a target has two distinct register files of the same allocatable size.
Would something like:

      (ira_class_subset_p[class1][class2]
       && ira_class_subset_p[class2][class1])

work instead?

>> /* Update the preference for using HARD_REGNO for pseudos that are
>>     connected directly or indirectly with REGNO.  Apply divisor DIV
>>     to any preference adjustments.
>>
>>     The more indirectly a pseudo is connected, the smaller its effect
>>     should be.  We therefore increase DIV on each "hop".  */
>>
> Fixed.  By the way, it is your invention from IRA.

Heh, I'd forgotten all about that.

>>> +      /* We are trying to spill a reload pseudo.  That is wrong we
>>> +	 should assign all reload pseudos, otherwise we cannot reuse
>>> +	 the selected alternatives.  */
>>> +      hard_regno = find_hard_regno_for (regno, &cost, -1);
>>> +      if (hard_regno >= 0)
>>> +	{
>> Don't really understand this comment, sorry.
> It removed the comment.  It is from an old solution code trying to 
> guarantee assignment to the reload pseudo.
>> Also, why are we passing -1 to find_hard_regno_for, rather than hard_regno?
>> The loop body up till this point has been specifically freeing up registers
>> to make hard_regno allocatable.  I realise that, by spilling everything
>> that overlaps this range, we might have freed up other registers too,
>> and so made others besides hard_regno allocatable.  But wouldn't we want
>> to test those other hard registers in "their" iteration of the loop
>> instead of this one?  The spills in those iterations ought to be more
>> directed (i.e. there should be less incidental spilling).
>>
>> As things stand, doing an rclass_size * rclass_size scan seems
>> unnecessarily expensive, although probably off the radar.
> We cannot just pass hard_regno for multi-word pseudo when hard_regno-1 
> is already free.

But this call is in a loop that iterates over all registers in the class:

  for (i = 0; i < rclass_size; i++)
    {
      hard_regno = ira_class_hard_regs[rclass][i];

and we reach the find_hard_regno_for call unless there is some
conflicting register that we cannot spill.  So if "hard_regno - 1"
belongs to the allocation class and is a viable choice, "its" iteration
of the loop would spill specifically for "hard_regno - 1" and get the
most accurate cost for that register.  I couldn't see why any other
iteration of the loop would want to consider "hard_regno - 1".

> You are right about possibility to speed up the code, although on 
> profiles I looked (including the last huge tests) spill_for and 
> find_hard_regno_for called from takes few time. That is probably because 
> you don't need spill frequently.  Freeing one long live range pseudo 
> permits to find hard regno without spilling for many short live pseudos 
> (reload and inheritance ones).
> Also loop rclass_size * rclass_size is not expensive, the preparation of 
> data for the loop is expensive.

OK, in that case maybe the efficiency concern wasn't justified.
FWIW, I still think passing hard_regno would be clearer though,
in terms of meeting expectations.  It just seems odd to spill for
one specific register and then test all of them.  Especially when the
spilling we actually do after choosing register X is based on "X's"
iteration of this loop.

(I realise I could well be missing the point here though, sorry.)

>>> +		  && (reload_hard_regno
>>> +		      = find_hard_regno_for (reload_regno,
>>> +					     &reload_cost, -1)) >= 0
>>> +		  && (lra_hard_reg_set_intersection_p
>>> +		      (reload_hard_regno, PSEUDO_REGNO_MODE (reload_regno),
>>> +		       spilled_hard_regs)))
>>> +		{
>>> +		  if (lra_dump_file != NULL)
>>> +		    fprintf (lra_dump_file, " assign %d(cost=%d)",
>>> +			     reload_regno, reload_cost);
>>> +		  assign_temporarily (reload_regno, reload_hard_regno);
>>> +		  cost += reload_cost;
>> It looks like registers that can be reallocated make hard_regno more
>> expensive (specifically by reload_cost), but registers that can't be
>> reallocated contribute no cost.  Is that right?  Seemed a little odd,
>> so maybe worth a comment.
> Reload cost is a negative cost.  It is negative base is the pseudo 
> frequency.

Ah!  Missed that, sorry.

> I added the comment.  The better hard register is, the more negative 
> cost is.

Thanks.

>> Also, AIUI find_hard_regno_for is trying to allocate the register for
>> reload_regno on the basis that reload_regno has the same conflicts as
>> the current regno, and so it's only an approximation.  Is that right?
>> Might be worth a comment if so (not least to explain why we don't commit
>> to this allocation if we end up choosing hard_regno).
> Sorry, I did not understand what you are asking.  We try to find best 
> pseudos to spill which helps to assign more reload pseudos (on cost 
> base) as possible.  Pseudos for which find_hard_regno finds not spilled 
> hard regs are ignored as they can be assigned without spilling.

Yeah, sorry, ignore this.  I think I'd taken a break here and then
forgotten that find_hard_regno_for works out the conflicts set for
the register itself.  I think I'd assumed when writing it that
find_hard_regno_for used some state that had been precalculated
before the main "for (i = 0; i < rclass_size; i++)" loop.

>>> +      /* Restore the live hard reg pseudo info for spilled pseudos.  */
>>> +      EXECUTE_IF_SET_IN_BITMAP (&spill_pseudos_bitmap, 0, spill_regno, bi)
>>> +	update_lives (spill_regno, false);
>> I couldn't tell why this was outside the "hard_regno >= 0" condition.
>> Do we really change these registers even if find_hard_regno_for fails?
> The first we spill some pseudo and then call find_hard_regno_for.  So we 
> should restore the state before spilling independently on success of 
> find_hard_regno_for.

Yeah, ignore this too :-)

>>> +  /* Spill: */
>>> +  EXECUTE_IF_SET_IN_BITMAP (&best_spill_pseudos_bitmap, 0, spill_regno, bi)
>> Very minor, but I think it'd be worth asserting that best_hard_regno >= 0
>> before this loop.
> Unfortunately, in very rare cases, best_hard_regno can be < 0.  That is 
> why we have two iteration for assignment of reload pseudos (see comment 
> for 2nd iter for reload pseudo assignments.
>
> I've added a comment for the function that it can return negative value.

OK, thanks.

>> The only general comment I have so far is that it's sometimes
>> difficult to follow which types of pseudos are being included
>> or excluded by a comparison with lra_constraint_new_regno_start.
>> Sometimes the comments talk about "reload pseudos", but other
>> similar checks imply that the registers could be inheritance
>> pseudos or split pseudos as well.  Some thin inline wrappers
>> might help here.
> Inheritance, split, reload pseudos created since last constraint pass >= 
> lra_constraint_new_regno_start.
> Inheritance and split pseudos created on any pass are in the 
> corresponding bitmaps.
> Inheritance and split pseudos since the last constraint pass has also 
> restore_regno >= 0 until split or inheritance transformations are done.

OK.  What prompted this was that some comments refer specifically to
"reload pseudo" whereas the accompanying code simply checks against
lra_constraint_new_regno_start.  It then wasn't obvious whether the code
really did just include "reload pseudos" in what I thought was the
strict sense -- e.g. because no other type of LRA-created pseudo could
occur in that context, so there was no point checking anything else --
or whether the code was actually handling inheritance and split pseudos too.

Maybe it would help to have a term to refer all four of:

  - reload pseudos
  - optional reload pseudos
  - inheritance pseudos
  - split pseudos

although I won't suggest one because I'm useless at naming things.

>>> +		/* We can use inheritance pseudos in original insns
>>> +		   (not reload ones).  */
>>> +		if (regno < lra_constraint_new_regno_start
>>> +		    || bitmap_bit_p (&lra_inheritance_pseudos, regno)
>>> +		    || reg_renumber[regno] < 0)
>>> +		  continue;
>>> +		sorted_pseudos[nfails++] = regno;
>>> +		if (lra_dump_file != NULL)
>>> +		  fprintf (lra_dump_file,
>>> +			   "	  Spill reload r%d(hr=%d, freq=%d)\n",
>>> +			   regno, reg_renumber[regno],
>>> +			   lra_reg_info[regno].freq);
>> Same comment about types of pseudo as above.  (I.e. the code checks for
>> inheritance pseudos, but not split pseudos.)
> I modified the comment to
>
> /* A reload pseudo did not get a hard register on the
>     first iteration because of the conflict with
>     another reload pseudos in the same insn.  So we
>     consider only reload pseudos assigned to hard
>     registers.  We shall exclude inheritance pseudos as
>     they can occur in original insns (not reload ones).
>     We can omit the check for split pseudos because
>     they occur only in move insns containing non-reload
>     pseudos. */
>
> I hope it explains the code.

Yes, thanks.

>>> +  for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++)
>>> +    if (((i < lra_constraint_new_regno_start
>>> +	  && ! bitmap_bit_p (&do_not_assign_nonreload_pseudos, i))
>>> +	 || (bitmap_bit_p (&lra_inheritance_pseudos, i)
>>> +	     && lra_reg_info[i].restore_regno >= 0)
>>> +	 || (bitmap_bit_p (&lra_split_pseudos, i)
>>> +	     && lra_reg_info[i].restore_regno >= 0)
>>> +	 || bitmap_bit_p (&lra_optional_reload_pseudos, i))
>>> +	&& reg_renumber[i] < 0 && lra_reg_info[i].nrefs != 0
>>> +	&& regno_allocno_class_array[i] != NO_REGS)
>>> +      sorted_pseudos[n++] = i;
>>> +  bitmap_clear (&do_not_assign_nonreload_pseudos);
>> where we test very similar things inline, and then clear
>> do_not_assign_nonreload_pseudos.  Do we need d_n_a_n_p at all?
> No the code is right.  We still need d_n_a_n_p as we can not easily 
> calculate from what pseudo the given pseudo was inherited or split.  It 
> is different from the loop where new inheritance and split pseudos are 
> checked (it is different from their origins marked in d_n_a_n_p.

Yeah, sorry about that.  I misread the indices.

Richard

Richard Sandiford Oct. 13, 2012, 7:42 a.m. UTC | #9

I'm having to correct my own comments again, sorry.

Richard Sandiford <rdsandiford@googlemail.com> writes:
>> +  /* If this is post-increment, first copy the location to the reload reg.  */
>> +  if (post && real_in != result)
>> +    emit_insn (gen_move_insn (result, real_in));
>
> Nit, but real_in != result can never be true AIUI, and I was confused how
> the code could be correct in that case.  Maybe just remove it, or make
> it an assert?

Probably obvious, but I meant "real_in == result" can never be true.
"real_in != result" could be removed or turned into an assert.

>> +	if (GET_CODE (op) == PLUS)
>> +	  {
>> +	    plus = op;
>> +	    op = XEXP (op, 1);
>> +	  }
>
> Sorry, I'm complaining about old reload code again, but: does this
> actually happen in LRA?  In reload, a register operand could become a
> PLUS because of elimination, but I thought LRA did things differently.
> Besides, this is only needed for:
>
>> +	if (CONST_POOL_OK_P (mode, op)
>> +	    && ((targetm.preferred_reload_class
>> +		 (op, (enum reg_class) goal_alt[i]) == NO_REGS)
>> +		|| no_input_reloads_p)
>> +	    && mode != VOIDmode)
>> +	  {
>> +	    rtx tem = force_const_mem (mode, op);
>> +	    
>> +	    change_p = true;
>> +	    /* If we stripped a SUBREG or a PLUS above add it back.  */
>> +	    if (plus != NULL_RTX)
>> +	      tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem);
>
> and we shouldn't have (plus (constant ...) ...) after elimination
> (or at all outside of a CONST).  I don't understand why the code is
> needed even in reload.

Scratch the thing about needing it for reload.  It's obviously the
second second operand we're reloading, not the first, and it could
well be that an elimination displacement needs to be reloaded via
the constant pool.

The question for LRA still stands though.

Richard

Vladimir Makarov Oct. 14, 2012, 5:36 p.m. UTC | #10

On 12-10-10 11:31 AM, Richard Sandiford wrote:
> Hi Vlad,
>
> Here's a review of the first part of ira-constraints.c.  It's a big file,
> and this is a long reply, so I'll send further comments another day in
> another message.
> Vladimir Makarov <vmakarov@redhat.com> writes:
>> +/* This file contains code for 3 passes: constraint pass,
>> +   inheritance/split pass, and pass for undoing failed inheritance and
>> +   split.
>> +
>> +   The major goal of constraint pass is to transform RTL to satisfy
>> +   insn and address constraints by:
>> +     o choosing insn alternatives;
>> +     o generating *reload insns* (or reloads in brief) and *reload
>> +       pseudos* which will got necessary hard registers later;
> s/got/get/
>
Fixed.
>> +     o substituting pseudo equivalences (if it is done once, is done
>> +       everywhere) and removes insns initializing used equivalent
>> +       substitution.
> Suggest:
>
>       o substituting pseudos with equivalent values and removing the
>         instructions that initialized those pseudos.
Fixed.
>> +   To speed the pass up we process only necessary insns (first time
>> +   all insns) and reuse of already chosen alternatives in some
>> +   cases.
> Suggest:
>
>     On the first iteration of the pass we process every instruction and
>     choose an alternative for each one.  On subsequent iterations we try
>     to avoid reprocessing instructions if we can be sure that the old
>     choice is still valid.
Fixed.
>> +   The inheritance/spilt pass is to transform code to achieve
>> +   ineheritance and live range splitting.  It is done on backward
>> +   traverse of EBBs.
> Typo: inheritance.  "backward traversal".
Fixed.
>> +   The inheritance optimization goal is to reuse values in hard
>> +   registers. There is analogous optimization in old reload pass.  The
>> +   inheritance is achieved by following transformation:
>> +
>> +       reload_p1 <- p	     reload_p1 <- p
>> +       ...		     new_p <- reload_p1
>> +       ...		=>   ...
>> +       reload_p2 <- p	     reload_p2 <- new_p
>> +
>> +   where p is spilled and not changed between the insns.  Reload_p1 is
>> +   also called *original pseudo* and new_p is called *inheritance
>> +   pseudo*.
>> +
>> +   The subsequent assignment pass will try to assign the same (or
>> +   another if it is not possible) hard register to new_p as to
>> +   reload_p1 or reload_p2.
>> +
>> +   If it fails to assign a hard register, the opposite transformation
>> +   will restore the original code on (the pass called undoing
>> +   inheritance) because with spilled new_p the code would be much
>> +   worse. [...]
> Maybe:
>
>     If the assignment pass fails to assign a hard register to new_p,
>     this file will undo the inheritance and restore the original code.
>     This is because implementing the above sequence with a spilled
>     new_p would make the code much worse.
>
Fixed.
>> +   Splitting (transformation) is also done in EBB scope on the same
>> +   pass as the inheritance:
>> +
>> +       r <- ... or ... <- r		 r <- ... or ... <- r
>> +       ...				 s <- r (new insn -- save)
>> +       ...			  =>
>> +       ...				 r <- s (new insn -- restore)
>> +       ... <- r				 ... <- r
>> +
>> +    The *split pseudo* s is assigned to the hard register of the
>> +    original pseudo or hard register r.
>> +
>> +    Splitting is done:
>> +      o In EBBs with high register pressure for global pseudos (living
>> +	in at least 2 BBs) and assigned to hard registers when there
>> +	are more one reloads needing the hard registers;
>> +      o for pseudos needing save/restore code around calls.
>> +
>> +    If the split pseudo still has the same hard register as the
>> +    original pseudo after the subsequent assignment pass, the opposite
>> +    transformation is done on the same pass for undoing inheritance.  */
> AIUI spill_for can spill split pseudos.  I think the comment should say
> what happens then.  If I understand the code correctly, we keep the
> split if "r" is a hard register or was assigned a hard register.
> We undo it if "r" was not assigned a hard register.  Is that right?
>
Yes.  To be more correctly, r and s are assigned the same hard reg 
before the assignment pass.  So if r is *spilled* (by assignment pass) 
or r and s have the same hard reg after the assignment pass, we undo the 
transformation.

I added a comment.
>> +/* Array whose element is (MEM:MODE BASE_REG) corresponding to the
>> +   mode (index) and where BASE_REG is a base hard register for given
>> +   memory mode.	 */
>> +static rtx indirect_mem[MAX_MACHINE_MODE];
> Maybe:
>
> /* Index M is an rtx of the form (mem:M BASE_REG), where BASE_REG
>     is a sample hard register that is a valid address for mode M.
>     The memory refers to the generic address space.  */
Fixed.
>> +/* Return class of hard regno of REGNO or if it is was not assigned to
>> +   a hard register, return its allocno class but only for reload
>> +   pseudos created on the current constraint pass.  Otherwise, return
>> +   NO_REGS.  */
>> +static enum reg_class
>> +get_reg_class (int regno)
> Maybe:
>
> /* If REGNO is a hard register or has been allocated a hard register,
>     return the class of that register.  If REGNO is a pseudo created
>     by the current constraints pass, assume that it will be allocated
>     a hard register and return the class that that register will have.
>     (This assumption is optimistic when REGNO is an inheritance or
>     split pseudo.)  Return NO_REGS otherwise.  */
I don't like

assume that it will be allocated
    a hard register and return the class that that register will have

For example, I could treat this as assigning ax to pseudo having class 
general_regs and returning class a_reg instead of general_regs.

So I modified your comment a bit.
> if that's accurate.  I dropped the term "reload pseudo" because of
> the general comment in my earlier reply about the use of "reload pseudo"
> when the code seems to include inheritance and split pseudos too.
There is no inheritance and splitting yet.  It is done after the 
constraint pass.
So at this stage >= new_regno_start means reload pseudo.
>> +/* Return true if REGNO in REG_MODE satisfies reg class constraint CL.
>> +   For new reload pseudos we should make more accurate class
>> +   *NEW_CLASS (we set up it if it is not NULL) to satisfy the
>> +   constraints.  Otherwise, set up NEW_CLASS to NO_REGS.  */
>> +static bool
>> +in_class_p (int regno, enum machine_mode reg_mode,
>> +	    enum reg_class cl, enum reg_class *new_class)
> Same comment here, since it uses get_reg_class.  I.e. for registers >=
> new_regno_start, we're really testing whether the first allocatable
> register in REGNO's allocno class satisfies CL.
See my comment above.
> Also, the only caller that doesn't directly pass REGNO and REG_MODE
> from an rtx is process_addr_reg.  in_class_p uses:
>
>> +  if (new_class != NULL)
>> +    *new_class = NO_REGS;
>> +  if (regno < FIRST_PSEUDO_REGISTER)
>> +    return TEST_HARD_REG_BIT (reg_class_contents[cl], regno);
>> +  rclass = get_reg_class (regno);
> whereas process_addr_reg uses:
>
>> +  final_regno = regno = REGNO (reg);
>> +  if (regno < FIRST_PSEUDO_REGISTER)
>> +    {
>> +      rtx final_reg = reg;
>> +      rtx *final_loc = &final_reg;
>> +
>> +      lra_eliminate_reg_if_possible (final_loc);
>> +      final_regno = REGNO (*final_loc);
>> +    }
> I.e. process_addr_reg applies eliminations before testing whereas
> in_class_p doesn't.  I couldn't really tell why the two were different.
> Since the idea is that we use elimination source registers to represent
> their targets, shouldn't in_class_p eliminate too?
As I remember that was a fix for a bug for some target which needed an 
eliminated reg for legitimate address recognition.  I did not see such 
problem for the constraints.


> With that difference removed, in_class_p could take the rtx instead
> of a (REGNO, MODE) pair.  It could then pass that rtx directly to
> lra_eliminate_reg_if_possible.  I think this would lead to a cleaner
> interface and make things more regular.
Thanks.  I fixed it.
> Then the comment for in_class_p could be:
>
> /* Return true if X satisfies (or will satisfy) reg class constraint CL.
>     If X is a pseudo created by this constraints pass, assume that it will
>     be allocated a hard register from its allocno class, but allow that
>     class to be narrowed to CL if it is currently a superset of CL.
>
>     If NEW_CLASS is nonnull, set *NEW_CLASS to the new allocno class
>     of REGNO (X), or NO_REGS if no change in its class was needed.  */
Fixed.  I just added 'X is a reload pseudo...'
> That's a change in the meaning of NEW_CLASS, but seems easier for
> callers to handle.  I think all it requires is changing:
>
>> +      common_class = ira_reg_class_subset[rclass][cl];
>> +      if (new_class != NULL)
>> +	*new_class = common_class;
> to:
>
>        common_class = ira_reg_class_subset[rclass][cl];
>        if (new_class != NULL && rclass != common_class)
> 	*new_class = common_class;
This change results in infinite LRA looping on a first libgcc file 
compilation.  Unfortunately I have no time to investigate it.
I'd like to say that most code of in this code is very sensitive to 
changes.  I see it a lot.  You change something looking obvious and a 
target is broken.
I am going to investigate it when I have more time.
>> +  if (regno < new_regno_start
>> +      /* Do not make more accurate class from reloads generated.  They
>> +	 are mostly moves with a lot of constraints.  Making more
>> +	 accurate class may results in very narrow class and
>> +	 impossibility of find registers for several reloads of one
>> +	 insn.	*/
> Maybe:
>
>        /* Do not allow the constraints for reload instructions to
> 	 influence the classes of new pseudos.  These reloads are
> 	 typically moves that have many alternatives, and restricting
> 	 reload pseudos for one alternative may lead to situations
> 	 where other reload pseudos are no longer allocatable.  */
Fixed.
>> +      || INSN_UID (curr_insn) >= new_insn_uid_start)
>> +    return ((regno >= new_regno_start && rclass == ALL_REGS)
>> +	    || (rclass != NO_REGS && ira_class_subset_p[rclass][cl]
>> +		&& ! hard_reg_set_subset_p (reg_class_contents[cl],
>> +					    lra_no_alloc_regs)));
> Why the ALL_REGS special case?  I think it deserves a comment.
>
I added a comment.
>> +/* Return the defined and profitable equiv substitution of reg X, return
>> +   X otherwise.	 */
> Maybe:
>
> /* If we have decided to substitute X with another value, return that value,
>     otherwise return X.  */
>
Fixed.
>> +/* Change class of pseudo REGNO to NEW_CLASS.  Print info about it
>> +   using TITLE.	 Output a new line if NL_P.  */
>> +static void
>> +change_class (int regno, enum reg_class new_class,
>> +	      const char *title, bool nl_p)
>> +{
>> +  if (lra_dump_file != NULL)
>> +    fprintf (lra_dump_file, "%s to class %s for r%d",
>> +	     title, reg_class_names[new_class], regno);
>> +  setup_reg_classes (regno, new_class, NO_REGS, new_class);
>> +  if (lra_dump_file != NULL && nl_p)
>> +    fprintf (lra_dump_file, "\n");
>> +}
> I think either this or setup_reg_classes should have an assert
> that REGNO is >= FIRST_PSEUDO_REGISTER.   This matters more now
> because a lot of LRA deals with hard and pseudo registers
> side-by-side.
Ok. I added an assert.
>> +/* Create a new pseudo using MODE, RCLASS, ORIGINAL, TITLE or reuse
>> +   already created input reload pseudo (only if TYPE is not OP_OUT).
>> +   The result pseudo is returned through RESULT_REG.  Return TRUE if
>> +   we created a new pseudo, FALSE if we reused the already created
>> +   input reload pseudo.	 */
> Maybe:
>
> /* Store in *RESULT_REG a register for reloading ORIGINAL, which has
>     mode MODE.  TYPE specifies the direction of the reload -- either OP_IN
>     or OP_OUT -- and RCLASS specifies the class of hard register required.
It can be OP_INOUT too.
>     Try to reuse existing input reloads where possible.  Return true if
>     *RESULT_REG is a new register, false if it is an existing one.
>     Use TITLE to describe new registers for debug purposes.  */
>
> although I admit that's a bit convoluted...
I combined two comments in a better one.
>> +  for (i = 0; i < curr_insn_input_reloads_num; i++)
>> +    if (rtx_equal_p (curr_insn_input_reloads[i].input, original))
>> +      break;
>> +  if (i >= curr_insn_input_reloads_num
>> +      || ! in_class_p (REGNO (curr_insn_input_reloads[i].reg),
>> +		       GET_MODE (curr_insn_input_reloads[i].reg),
>> +		       rclass, &new_class))
>> +    {
>> +      res_p = true;
>> +      *result_reg = lra_create_new_reg (mode, original, rclass, title);
>> +    }
>> +  else
>> +    {
>> +      lra_assert (! side_effects_p (original));
>> +      res_p = false;
>> +      *result_reg = curr_insn_input_reloads[i].reg;
>> +      regno = REGNO (*result_reg);
>> +      if (lra_dump_file != NULL)
>> +	 {
>> +	   fprintf (lra_dump_file, "	 Reuse r%d for reload ", regno);
>> +	   print_value_slim (lra_dump_file, original, 1);
>> +	 }
>> +      if (rclass != new_class)
>> +	 change_class (regno, new_class, ", change", false);
>> +      if (lra_dump_file != NULL)
>> +	 fprintf (lra_dump_file, "\n");
>> +    }
>> +  lra_assert (curr_insn_input_reloads_num < LRA_MAX_INSN_RELOADS);
>> +  curr_insn_input_reloads[curr_insn_input_reloads_num].input = original;
>> +  curr_insn_input_reloads[curr_insn_input_reloads_num++].reg = *result_reg;
>> +  return res_p;
> It probably doesn't matter in practice, but I think this would
> be better as:
>
>    for (i = 0; i < curr_insn_input_reloads_num; i++)
>      if (rtx_equal_p (curr_insn_input_reloads[i].input, original)
>          && in_class_p (curr_insn_input_reloads[i].reg, rclass, &new_class))
>        {
>          ...reuse case..
>          return false;
>        }
>    ...new case...
>    return true;
>
> which also copes with the unlikely case that the same input is used
> three times, and that the third use requires the same class as the
> second.
Ok. I fixed it.
>> +/* The page contains code to extract memory address parts.  */
>> +
>> +/* Info about base and index regs of an address.  In some rare cases,
>> +   base/index register can be actually memory.	In this case we will
>> +   reload it.  */
>> +struct address
>> +{
>> +  rtx *base_reg_loc;  /* NULL if there is no a base register.  */
>> +  rtx *base_reg_loc2; /* Second location of {post/pre}_modify, NULL
>> +			 otherwise.  */
>> +  rtx *index_reg_loc; /* NULL if there is no an index register.	 */
>> +  rtx *index_loc; /* location of index reg * scale or index_reg_loc
>> +		      otherwise.  */
>> +  rtx *disp_loc; /* NULL if there is no a displacement.	 */
>> +  /* Defined if base_reg_loc is not NULL.  */
>> +  enum rtx_code base_outer_code, index_code;
>> +  /* True if the base register is modified in the address, for
>> +     example, in PRE_INC.  */
>> +  bool base_modify_p;
>> +};
> Comments should be consistently above the fields rather than to the right.
Fixed.
>> +/* Process address part in space AS (or all address if TOP_P) with
>> +   location *LOC to extract address characteristics.
>> +
>> +   If CONTEXT_P is false, we are looking at the base part of an
>> +   address, otherwise we are looking at the index part.
>> +
>> +   MODE is the mode of the memory reference; OUTER_CODE and INDEX_CODE
>> +   give the context that the rtx appears in; MODIFY_P if *LOC is
>> +   modified.  */
>> +static void
>> +extract_loc_address_regs (bool top_p, enum machine_mode mode, addr_space_t as,
>> +			  rtx *loc, bool context_p, enum rtx_code outer_code,
>> +			  enum rtx_code index_code,
>> +			  bool modify_p, struct address *ad)
>> +{
>> +  rtx x = *loc;
>> +  enum rtx_code code = GET_CODE (x);
>> +  bool base_ok_p;
>> +
>> +  switch (code)
>> +    {
>> +    case CONST_INT:
>> +    case CONST:
>> +    case SYMBOL_REF:
>> +    case LABEL_REF:
>> +      if (! context_p)
>> +	ad->disp_loc = loc;
> This looks a bit odd.  I assume it's trying to avoid treating MULT
> scale factors as displacements, but I thought whether something was
> a displacement or not depended on whether it is involved (possibly
> indirectly) in a sum with the base.  Seems like it'd be better
> to check for that directly.
>
>> +	/* If this machine only allows one register per address, it
>> +	   must be in the first operand.  */
>> +	if (MAX_REGS_PER_ADDRESS == 1 || code == LO_SUM)
>> +	  {
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc, false, code,
>> +				      code1, modify_p, ad);
>> +	    ad->disp_loc = arg1_loc;
>> +	  }
>> +	/* If index and base registers are the same on this machine,
>> +	   just record registers in any non-constant operands.	We
>> +	   assume here, as well as in the tests below, that all
>> +	   addresses are in canonical form.  */
>> +	else if (INDEX_REG_CLASS
>> +		 == base_reg_class (VOIDmode, as, PLUS, SCRATCH)
>> +		 && code0 != PLUS && code0 != MULT)
>> +	  {
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc, false, PLUS,
>> +				      code1, modify_p, ad);
>> +	    if (! CONSTANT_P (arg1))
>> +	      extract_loc_address_regs (false, mode, as, arg1_loc, true, PLUS,
>> +					code0, modify_p, ad);
>> +	    else
>> +	      ad->disp_loc = arg1_loc;
>> +	  }
>> +
>> +	/* If the second operand is a constant integer, it doesn't
>> +	   change what class the first operand must be.	 */
>> +	else if (code1 == CONST_INT || code1 == CONST_DOUBLE)
>> +	  {
>> +	    ad->disp_loc = arg1_loc;
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc, context_p,
>> +				      PLUS, code1, modify_p, ad);
>> +	  }
>> +	/* If the second operand is a symbolic constant, the first
>> +	   operand must be an index register but only if this part is
>> +	   all the address.  */
>> +	else if (code1 == SYMBOL_REF || code1 == CONST || code1 == LABEL_REF)
>> +	  {
>> +	    ad->disp_loc = arg1_loc;
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc,
>> +				      top_p ? true : context_p, PLUS, code1,
>> +				      modify_p, ad);
>> +	  }
> What's the reason for the distinction between the last two, which AIUI
> doesn't exist in reload?  I'm not sure the:
>
>      top_p ? true : context_p
>
> condition is safe: some targets use aligning addresses like
> (and X (const_int -ALIGN)), but that shouldn't really affect whether
> a register in X is treated as a base or an index.
The code works for PPC which uses aligning addresses.
>> +	/* If both operands are registers but one is already a hard
>> +	   register of index or reg-base class, give the other the
>> +	   class that the hard register is not.	 */
>> +	else if (code0 == REG && code1 == REG
>> +		 && REGNO (arg0) < FIRST_PSEUDO_REGISTER
>> +		 && ((base_ok_p
>> +		      = ok_for_base_p_nonstrict (arg0, mode, as, PLUS, REG))
>> +		     || ok_for_index_p_nonstrict (arg0)))
>> +	  {
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc, ! base_ok_p,
>> +				      PLUS, REG, modify_p, ad);
>> +	    extract_loc_address_regs (false, mode, as, arg1_loc, base_ok_p,
>> +				      PLUS, REG, modify_p, ad);
>> +	  }
>> +	else if (code0 == REG && code1 == REG
>> +		 && REGNO (arg1) < FIRST_PSEUDO_REGISTER
>> +		 && ((base_ok_p
>> +		      = ok_for_base_p_nonstrict (arg1, mode, as, PLUS, REG))
>> +		     || ok_for_index_p_nonstrict (arg1)))
>> +	  {
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc, base_ok_p,
>> +				      PLUS, REG, modify_p, ad);
>> +	    extract_loc_address_regs (false, mode, as, arg1_loc, ! base_ok_p,
>> +				      PLUS, REG, modify_p, ad);
>> +	  }
>> +	/* If one operand is known to be a pointer, it must be the
>> +	   base with the other operand the index.  Likewise if the
>> +	   other operand is a MULT.  */
>> +	else if ((code0 == REG && REG_POINTER (arg0)) || code1 == MULT)
>> +	  {
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc, false, PLUS,
>> +				      code1, modify_p, ad);
>> +	    if (code1 == MULT)
>> +	      ad->index_loc = arg1_loc;
>> +	    extract_loc_address_regs (false, mode, as, arg1_loc, true, PLUS,
>> +				      code0, modify_p, ad);
>> +	  }
>> +	else if ((code1 == REG && REG_POINTER (arg1)) || code0 == MULT)
>> +	  {
>> +	    extract_loc_address_regs (false, mode, as, arg0_loc, true, PLUS,
>> +				      code1, modify_p, ad);
>> +	    if (code0 == MULT)
>> +	      ad->index_loc = arg0_loc;
>> +	    extract_loc_address_regs (false, mode, as, arg1_loc, false, PLUS,
>> +				      code0, modify_p, ad);
>> +	  }
> Some targets care about the choice between index and base for
> correctness reasons (PA IIRC) or for performance (some ppc targets IIRC),
> so I'm not sure whether it's safe to give REG_POINTER such a low priority.
This code works for PPC and PARISC.
>> +    default:
>> +      {
>> +	const char *fmt = GET_RTX_FORMAT (code);
>> +	int i;
>> +
>> +	if (GET_RTX_LENGTH (code) != 1
>> +	    || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
>> +	  {
>> +	    for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
>> +	      if (fmt[i] == 'e')
>> +		extract_loc_address_regs (false, mode, as, &XEXP (x, i),
>> +					  context_p, code, SCRATCH,
>> +					  modify_p, ad);
>> +	    break;
>> +	  }
>> +	/* fall through for case UNARY_OP (UNSPEC ...)	*/
>> +      }
>> +
>> +    case UNSPEC:
>> +      if (ad->disp_loc == NULL)
>> +	ad->disp_loc = loc;
>> +      else if (ad->base_reg_loc == NULL)
>> +	{
>> +	  ad->base_reg_loc = loc;
>> +	  ad->base_outer_code = outer_code;
>> +	  ad->index_code = index_code;
>> +	  ad->base_modify_p = modify_p;
>> +	}
>> +      else
>> +	{
>> +	  lra_assert (ad->index_reg_loc == NULL);
>> +	  ad->index_reg_loc = loc;
>> +	}
>> +      break;
>> +
>> +    }
> Which targets use a bare UNSPEC as a displacement?  I thought a
> displacement had to be a link-time constant, in which case it should
> satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.
I saw it somewhere.  I guess IA64.
> I'm just a bit worried that the UNSPEC handling is sensitive to the
> order that subrtxes are processed (unlike PLUS, which goes to some
> trouble to work out what's what).  It could be especially confusing
> because the default case processes operands in reverse order while
> PLUS processes them in forward order.
>
> Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
> Probably deserves a comment.
I don't remember.  To figure out, I should switch it off and try all 
targets supported by LRA.
> AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
> recording where reloads of a particular class need to go (obviously
> in the case of disp_loc, which isn't reloaded at all).  The feidls
> have semantic value too.  I.e. we use them to work out the value
> of at least part of the address.
>
> In that case it seems dangerous to look through general rtxes
> in the way that the default case above does.  Maybe just making
> sure that DISP_LOC is involved in a sum with the base would be
> enough, but another idea was:
>
> ----------------------------------------------------------------
> I know of three ways of "mutating" (for want of a better word)
> an address:
>
>    1. (and X (const_int X)), to align
>    2. a subreg
>    3. a unary operator (such as truncation or extension)
>
> So maybe we could:
>
>    a. remove outer mutations (using a helper function)
>    b. handle LO_SUM, PRE_*, POST_*: as now
>    c. otherwise treat the address of the sum of one, two or three pieces.
>       c1. Peel mutations of all pieces.
>       c2. Classify the pieces into base, index and displacement.
>           This would be similar to the jousting code above, but hopefully
>           easier because all three rtxes are to hand.  E.g. we could
>           do the base vs. index thing in a similar way to
>           commutative_operand_precedence.
>       c3. Record which pieces were mutated (e.g. using something like the
>           index_loc vs. index_reg_loc distinction in the current code)
>
> That should be general enough for current targets, but if it isn't,
> we could generalise it further when we know what generalisation is needed.
>
> That's still going to be a fair amount of code, but hopefully not more,
> and we might have more confidence at each stage what each value is.
> And it avoids the risk of treating "mutated" addresses as "unmutated" ones.
> ----------------------------------------------------------------
>
> Just an idea though.  Probably not for 4.8, although I might try it
> if I find time.
I am not sure that you listed all the cases.  It would be great if you 
listed all the cases. In this case we could make this function more clear.
I tried to do this first but permanently found new cases.  After that I 
gave up and tried to use more general implementation.

This function was rewritten and modified many times.  I am afraid to do 
this again when clock is ticking.

It would be great if you re-implement the function according to your 
ideas and we could try it on 8 targets to which LRA was already ported.  
An LRA sub-branch would a perfect place to do it
> It would be nice to sort out the disp_loc thing for 4.8 though.
>
>> +/* Extract address characteristics in address with location *LOC in
>> +   space AS.  Return them in AD.  Parameter OUTER_CODE for MEM should
>> +   be MEM.  Parameter OUTER_CODE for 'p' constraint should be ADDRESS
>> +   and MEM_MODE should be VOIDmode.  */
> Maybe:
>
> /* Describe address *LOC in AD.  There are two cases:
>
>     - *LOC is the address in a (mem ...).  In this case OUTER_CODE is MEM
>       and AS is the mem's address space.
>
>     - *LOC is matched to an address constraint such as 'p'.  In this case
>       OUTER_CODE is ADDRESS and AS is ADDR_SPACE_GENERIC.  */
Fixed.
>> +/* Return start register offset of hard register REGNO in MODE.	 */
>> +int
>> +lra_constraint_offset (int regno, enum machine_mode mode)
>> +{
>> +  lra_assert (regno < FIRST_PSEUDO_REGISTER);
>> +  /* On a WORDS_BIG_ENDIAN machine, point to the last register of a
>> +     multiple hard register group of scalar integer registers, so that
>> +     for example (reg:DI 0) and (reg:SI 1) will be considered the same
>> +     register.	*/
>> +  if (WORDS_BIG_ENDIAN && GET_MODE_SIZE (mode) > UNITS_PER_WORD
>> +      && SCALAR_INT_MODE_P (mode))
>> +    return hard_regno_nregs[regno][mode] - 1;
>> +  return 0;
>> +}
> Maybe the head comment could be:
>
> /* Return the offset from REGNO of the least significant register
>     in (reg:MODE REGNO).
>
>     This function is used to tell whether two registers satisfy
>     a matching constraint.  (reg:MODE1 REGNO1) matches (reg:MODE2 REGNO2) if:
>
>           REGNO1 + lra_constraint_offset (REGNO1, MODE1)
>        == REGNO2 + lra_constraint_offset (REGNO2, MODE2)  */
>
> (and remove the inner comment).
>
Fixed.
>> +/* Like rtx_equal_p except that it allows a REG and a SUBREG to match
>> +   if they are the same hard reg, and has special hacks for
>> +   auto-increment and auto-decrement.  This is specifically intended for
>> +   process_alt_operands to use in determining whether two operands
>> +   match.  X is the operand whose number is the lower of the two.
>> +
>> +   It is supposed that X is the output operand and Y is the input
>> +   operand.  */
>> +static bool
>> +operands_match_p (rtx x, rtx y, int y_hard_regno)
> Need to say what Y_HARD_REGNO is.
>
Fixed.
>> +  switch (code)
>> +    {
>> +    case CONST_INT:
>> +    case CONST_DOUBLE:
>> +    case CONST_FIXED:
> After a recent change this should be:
>
>     CASE_CONST_UNIQUE:
Ok.  It looks I need another round of merging.  Oh, well.
>> +	      val = operands_match_p (XVECEXP (x, i, j), XVECEXP (y, i, j),
>> +				      y_hard_regno);
>> +	      if (val == 0)
>> +		return false;
> Why do we pass the old y_hard_regno even though Y has changed?
> Some of the earlier code assumes that GET_MODE (y) is the mode
> of y_hard_regno.
It does not matter.  As we processed reg and subreg case above and 
y_hard_regno can be non-negative only for this cases.  We can not 
process it again.  But I changed it to -1 for clearness.
>> +/* Reload pseudos created for matched input and output reloads whose
>> +   mode are different.	Such pseudos has a modified rules for finding
>> +   their living ranges, e.g. assigning to subreg of such pseudo means
>> +   changing all pseudo value.  */
>> +bitmap_head lra_bound_pseudos;
> Maybe:
>
> /* Reload pseudos created for matched input and output reloads whose
>     modes are different.  Such pseudos have different live ranges from
>     other pseudos; e.g. any assignment to a subreg of these pseudos
>     changes the whole pseudo's value.  */
Fixed.
> Although that said, couldn't emit_move_insn_1 (called by gen_move_insn)
> split a multiword pseudo move into two word moves?  Using the traditional
> clobber technique sounds better than having special liveness rules.
>
It is not only about multi-words pseudos.  It is about representation of 
this situation by constructions semantically incorrect in order parts of 
compiler.  Reload has no such problem as it does not use RTL.   So I 
don't think it splits as I use emit_move_insn and that calls 
emit_move_insn_1 too.  I really needed a special liveness treatment 
(although I don't remember details) and therefore I added it.  I had no 
detail design for LRA.  The code was modified by numerous test failures 
on different targets.  There is a lot of code analogous to reload one 
and probably its necessity should be rigorously questioned.  I thought 
about and modified part of this code but unfortunately not all.

Also bound pseudos are rare.  Their bitmap is very small and testing (2 
lines of code in overall) them in ira-lives.c is fast.

>> +/* True if C is a non-empty register class that has too few registers
>> +   to be safely used as a reload target class.	*/
>> +#define SMALL_REGISTER_CLASS_P(C)					\
>> +  (reg_class_size [(C)] == 1						\
>> +   || (reg_class_size [(C)] >= 1 && targetm.class_likely_spilled_p (C)))
> Feels like ira_class_hard_regs_num might be better, but since the
> current definition is traditional, that shouldn't be a merge requirement.
Ok.
>> +/* Return mode of WHAT inside of WHERE whose mode of the context is
>> +   OUTER_MODE.	If WHERE does not contain WHAT, return VOIDmode.  */
>> +static enum machine_mode
>> +find_mode (rtx *where, enum machine_mode outer_mode, rtx *what)
>> +{
>> +  int i, j;
>> +  enum machine_mode mode;
>> +  rtx x;
>> +  const char *fmt;
>> +  enum rtx_code code;
>> +
>> +  if (where == what)
>> +    return outer_mode;
>> +  if (*where == NULL_RTX)
>> +    return VOIDmode;
>> +  x = *where;
>> +  code = GET_CODE (x);
>> +  outer_mode = GET_MODE (x);
>> +  fmt = GET_RTX_FORMAT (code);
>> +  for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
>> +    {
>> +      if (fmt[i] == 'e')
>> +	{
>> +	  if ((mode = find_mode (&XEXP (x, i), outer_mode, what)) != VOIDmode)
>> +	    return mode;
>> +	}
>> +      else if (fmt[i] == 'E')
>> +	{
>> +	  for (j = XVECLEN (x, i) - 1; j >= 0; j--)
>> +	  if ((mode = find_mode (&XVECEXP (x, i, j), outer_mode, what))
>> +	      != VOIDmode)
>> +	    return mode;
>> +	}
>> +    }
>> +  return VOIDmode;
>> +}
>> +
>> +/* Return mode for operand NOP of the current insn.  */
>> +static inline enum machine_mode
>> +get_op_mode (int nop)
>> +{
>> +  rtx *loc;
>> +  enum machine_mode mode;
>> +  bool md_first_p = asm_noperands (PATTERN (curr_insn)) < 0;
>> +
>> +  /* Take mode from the machine description first.  */
>> +  if (md_first_p && (mode = curr_static_id->operand[nop].mode) != VOIDmode)
>> +    return mode;
>> +  loc = curr_id->operand_loc[nop];
>> +  /* Take mode from the operand second.	 */
>> +  mode = GET_MODE (*loc);
>> +  if (mode != VOIDmode)
>> +    return mode;
>> +  if (! md_first_p && (mode = curr_static_id->operand[nop].mode) != VOIDmode)
>> +    return mode;
>> +  /* Here is a very rare case.	Take mode from the context.  */
>> +  return find_mode (&PATTERN (curr_insn), VOIDmode, loc);
>> +}
> This looks a lot more complicated than the reload version.  Why is
> it needed?  In reload the conditions for address operands were:
>
> 	  /* Address operands are reloaded in their existing mode,
> 	     no matter what is specified in the machine description.  */
> 	  operand_mode[i] = GET_MODE (recog_data.operand[i]);
>
> 	  /* If the address is a single CONST_INT pick address mode
> 	     instead otherwise we will later not know in which mode
> 	     the reload should be performed.  */
> 	  if (operand_mode[i] == VOIDmode)
> 	    operand_mode[i] = Pmode;
>
> which for LRA might look like:
>
> 	  /* The mode specified in the .md file for address operands
> 	     is the mode of the addressed value, not the address itself.
> 	     We therefore need to get the mode from the operand rtx.
> 	     If the operand has no mode, assume it was Pmode.  */
>
> For other operands, recog_data.operand_mode ought to be correct.
>
> find_mode assumes that the mode of an operand is the same as the mode of
> the outer rtx, which isn't true when the outer rtx is a subreg, mem,
> or one of several unary operators.
>
> This is one that I think would be best decided for 4.8.
I found the reason for this code:

  http://old.nabble.com/-lra--patch-to-fix-SPEC2000-sixtrack-compiler-crash-p32189310.html

But I tried it again and can not repeat the problem.  So I reverted the 
patch.  If on stage3 we find a problem, we try to find a solution.
>> +/* If REG is a reload pseudo, try to make its class satisfying CL.  */
>> +static void
>> +narrow_reload_pseudo_class (rtx reg, enum reg_class cl)
>> +{
>> +  int regno;
>> +  enum reg_class rclass;
>> +
>> +  /* Do not make more accurate class from reloads generated.  They are
>> +     mostly moves with a lot of constraints.  Making more accurate
>> +     class may results in very narrow class and impossibility of find
>> +     registers for several reloads of one insn.	 */
>> +  if (INSN_UID (curr_insn) >= new_insn_uid_start)
>> +    return;
>> +  if (GET_CODE (reg) == SUBREG)
>> +    reg = SUBREG_REG (reg);
>> +  if (! REG_P (reg) || (regno = REGNO (reg)) < new_regno_start)
>> +    return;
>> +  rclass = get_reg_class (regno);
>> +  rclass = ira_reg_class_subset[rclass][cl];
>> +  if (rclass == NO_REGS)
>> +    return;
>> +  change_class (regno, rclass, "      Change", true);
>> +}
> There seems to be an overlap in functionality with in_class_p here.
> Maybe:
>
> {
>    enum reg_class rclass;
>
>    if (in_class_p (reg, cl, &rclass) && rclass != NO_REGS)
>      change_class (REGNO (reg), rclass, "      Change", true);
> }
>
> (assuming the change in in_class_p interface suggested above).
> This avoids duplicating subtleties like the handling of reloads.
>
Fixed.
>> +      /* We create pseudo for out rtx because we always should keep
>> +	 registers with the same original regno have synchronized
>> +	 value (it is not true for out register but it will be
>> +	 corrected by the next insn).
> I don't understand this comment, sorry.
>
Pseudos have values -- see comments for lra_reg_info.  Different pseudos 
with the same value do not conflict even if they live in the same 
place.  When we create a pseudo we assign value of original pseudo (if 
any) from which we created the new pseudo.  If we create the pseudo from 
the input pseudo, the new pseudo will no conflict with the input pseudo 
which is wrong when the input pseudo lives after the insn and as the new 
pseudo value is changed by the insn output.  Therefore we create the new 
pseudo from the output.

I hope it is more understandable.  I changed the comment.
>> +	 Do not reuse register because of the following situation: a <-
>> +	 a op b, and b should be the same as a.	 */
> This part is very convincing though :-)  Maybe:
> 	 We cannot reuse the current output register because we might
> 	 have a situation like "a <- a op b", where the constraints force
> 	 the second input operand ("b") to match the output operand ("a").
> 	 "b" must then be copied into a new register so that it doesn't
> 	 clobber the current value of "a".  */
Ok.  Fixed.
> We should probably keep the other reason too, of course.
>
>> +      /* Don't generate inheritance for the new register because we
>> +	 can not use the same hard register for the corresponding
>> +	 inheritance pseudo for input reload.  */
>> +      bitmap_set_bit (&lra_matched_pseudos, REGNO (new_in_reg));
> Suggest dropping this comment, since we don't do any inheritance here.
> The comment above lra_matched_pseudos already says the same thing.
>
Fixed.
>> +  /* In and out operand can be got from transformations before
>> +     processing constraints.  So the pseudos might have inaccurate
>> +     class and we should make their classes more accurate.  */
>> +  narrow_reload_pseudo_class (in_rtx, goal_class);
>> +  narrow_reload_pseudo_class (out_rtx, goal_class);
> I don't understand this, sorry.  Does "transformations" mean inheritance
> and reload splitting?  So the registers we're changing here are inheritance
> and split pseudos rather than reload pseudos created for this instruction?
> If so, it sounds on face value like it conflicts with the comment quoted
> above about not allowing reload instructions to the narrow the class
> of pseudos.  Might be worth saying why that's OK here but not there.
Again, inheritance and splitting is done after the constraint pass.

The transformations here are mostly reloading of subregs which is done 
before reloads for given insn.  On this transformation we create new 
pseudos for which we don't know reg class yet.  In case we don't know 
pseudo reg class yet, we assign ALL_REGS to the pseudo.
> Also, I'm not sure I understand why it helps.  Is it just trying
> to encourage the pseudos to form a chain in lra-assigns.c?
>
> E.g. MIPS16 has several instructions that require matched MIPS16 registers.
> However, moves between MIPS16 registers and general registers are as cheap
> as moves between two MIPS16 registers, so narrowing the reloaded values
> from GENERAL_REGS to M16_REGS (if that ever happens) wouldn't necessarily
> be a good thing.
>
> Not saying this is wrong, just that it might need more commentary
> to justify it.
>
>> +  for (i = 0; (in = ins[i]) >= 0; i++)
>> +    *curr_id->operand_loc[in] = new_in_reg;
> The code assumes that all input operands have the same mode.
> Probably worth asserting that here (or maybe further up; I don't mind),
> just to make the assumption explicit.
I added an assert.
>> +/* Return final hard regno (plus offset) which will be after
>> +   elimination.	 We do this for matching constraints because the final
>> +   hard regno could have a different class.  */
>> +static int
>> +get_final_hard_regno (int hard_regno, int offset)
>> +{
>> +  if (hard_regno < 0)
>> +    return hard_regno;
>> +  hard_regno += offset;
>> +  return lra_get_elimation_hard_regno (hard_regno);
> Why apply the offset before rather than after elimination?
> AIUI, AVR's eliminable registers span more than one hard register,
> and the elimination is based off the first.
I fixed it.
> Also, all uses but one of lra_get_hard_regno_and_offset follow
> the pattern:
>
>        lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
>        /* The real hard regno of the operand after the allocation.  */
>        x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
>
> so couldn't lra_get_hard_regno_and_offset just return the final
> hard register, including elimination?  Then it could apply the
> elimination on the original rtx.
>
> FWIW, the exception I mentioned was operands_match_p:
>
>        lra_get_hard_regno_and_offset (x, &i, &offset);
>        if (i < 0)
> 	goto slow;
>        i += offset;
>
> but I'm not sure why this is the only caller that would want
> to ignore elimination.
???
>> +/* Return register class of OP.	 That is a class of the hard register
>> +   itself (if OP is a hard register), or class of assigned hard
>> +   register to the pseudo (if OP is pseudo), or allocno class of
>> +   unassigned pseudo (if OP is reload pseudo).	Return NO_REGS
>> +   otherwise.  */
>> +static enum reg_class
>> +get_op_class (rtx op)
>> +{
>> +  int regno, hard_regno, offset;
>> +
>> +  if (! REG_P (op))
>> +    return NO_REGS;
>> +  lra_get_hard_regno_and_offset (op, &hard_regno, &offset);
>> +  if (hard_regno >= 0)
>> +    {
>> +      hard_regno = get_final_hard_regno (hard_regno, offset);
>> +      return REGNO_REG_CLASS (hard_regno);
>> +    }
>> +  /* Reload pseudo will get a hard register in any case.  */
>> +  if ((regno = REGNO (op)) >= new_regno_start)
>> +    return lra_get_allocno_class (regno);
>> +  return NO_REGS;
>> +}
> This looks like it ought to be the same as:
>
>    return REG_P (x) ? get_reg_class (REGNO (x)) : NO_REGS;
>
> If not, I think there should be a comment explaining the difference.
> If so, the comment might be:
>
> /* If OP is a register, return the class of the register as per
>     get_reg_class, otherwise return NO_REGS.  */
The difference is in elimination.  But if i add elimination to 
get_reg_class, it will be the same.  I think it is right to do.  It 
permits to remove elimination code in process_addr_reg too.  So I 
modified the code.  It looks more brief and logical.
>> +/* Return generated insn mem_pseudo:=val if TO_P or val:=mem_pseudo
>> +   otherwise.  If modes of MEM_PSEUDO and VAL are different, use
>> +   SUBREG for VAL to make them equal.  Assign CODE to the insn if it
>> +   is not recognized.
>> +
>> +   We can not use emit_move_insn in some cases because of bad used
>> +   practice in some machine descriptions.  For example, power can use
>> +   only base+index addressing for altivec move insns and it is checked
>> +   by insn predicates.	On the other hand, the same move insn
>> +   constraints permit to use offsetable memory for moving vector mode
>> +   values from/to general registers to/from memory.  emit_move_insn
>> +   will transform offsetable address to one with base+index addressing
>> +   which is rejected by the constraint.	 So sometimes we need to
>> +   generate move insn without modifications and assign the code
>> +   explicitly because the generated move can be unrecognizable because
>> +   of the predicates.  */
> Ick :-)  Can't we just say that fixing this is part of the process
> of porting a target to LRA?  It'd be nice not to carry hacks like
> this around in shiny new code.
It would be great but I don't expect that target maintainers will be so 
cooperative.  So my goal was to write LRA requiring minimum changes in 
target code or no changes at all.
Even with keeping this goal, I am a bit pessimistic about how much time 
will be needed to remove reload.  With such requirements not to use 
hacks, it would take forever.

The biggest number of tricks I saw was PPC.  I spent a lot of time 
porting LRA to it.  And there are a lot of code in rs6000.c for LRA.
> As it stands:
>
>> +static rtx
>> +emit_spill_move (bool to_p, rtx mem_pseudo, rtx val, int code)
>> +{
>> +  rtx insn, after;
>> +
>> +  start_sequence ();
>> +  if (GET_MODE (mem_pseudo) != GET_MODE (val))
>> +    val = gen_rtx_SUBREG (GET_MODE (mem_pseudo),
>> +			  GET_CODE (val) == SUBREG ? SUBREG_REG (val) : val,
>> +			  0);
>> +  if (to_p)
>> +    insn = gen_move_insn (mem_pseudo, val);
>> +  else
>> +    insn = gen_move_insn (val, mem_pseudo);
>> +  if (recog_memoized (insn) < 0)
>> +    INSN_CODE (insn) = code;
>> +  emit_insn (insn);
>> +  after = get_insns ();
>> +  end_sequence ();
>> +  return after;
>> +}
> this recog_memoized code effectively assumes that INSN is just one
> instruction, whereas emit_move_insn_1 or the backend move expanders
> could split moves into several instructions.
>
> Since the code-forcing stuff is for rs6000, I think we could drop it
> from 4.8 whatever happens.
>
> The sequence stuff above looks redundant; we should just return
> INSN directly.
Ok. I fixed.  Although it makes my life harder as some targets will be 
broken on the branch after all these changes.
>> +  /* Quick check on the right move insn which does not need
>> +     reloads.  */
>> +  if ((dclass = get_op_class (dest)) != NO_REGS
>> +      && (sclass = get_op_class (src)) != NO_REGS
>> +      && targetm.register_move_cost (GET_MODE (src), dclass, sclass) == 2)
>> +    return true;
> Suggest:
>
>    /* The backend guarantees that register moves of cost 2 never need
>       reloads.  */
Fixed.
>> +  if (GET_CODE (dest) == SUBREG)
>> +    dreg = SUBREG_REG (dest);
>> +  if (GET_CODE (src) == SUBREG)
>> +    sreg = SUBREG_REG (src);
>> +  if (! REG_P (dreg) || ! REG_P (sreg))
>> +    return false;
>> +  sclass = dclass = NO_REGS;
>> +  dr = get_equiv_substitution (dreg);
>> +  if (dr != dreg)
>> +    dreg = copy_rtx (dr);
> I think this copy is too early, because there are quite a few
> conditions under which we never emit anything with DREG in it.
Ok, fixed.
>> +  if (REG_P (dreg))
>> +    dclass = get_reg_class (REGNO (dreg));
>> +  if (dclass == ALL_REGS)
>> +    /* We don't know what class we will use -- let it be figured out
>> +       by curr_insn_transform function.	 Remember some targets does not
>> +       work with such classes through their implementation of
>> +       machine-dependent hooks like secondary_memory_needed.  */
>> +    return false;
> Don't really understand this comment, sorry.
Again  ALL_REGS is used for new pseudos created by transformation like 
reload of SUBREG_REG.  We don't know its class yet.  We should figure 
out the class from processing the insn constraints not in this fast path 
function.  Even if ALL_REGS were a right class for the pseudo, 
secondary_... hooks usually are not define for ALL_REGS.

I fixed the comment.
>> +  sreg_mode = GET_MODE (sreg);
>> +  sr = get_equiv_substitution (sreg);
>> +  if (sr != sreg)
>> +    sreg = copy_rtx (sr);
> This copy also seems too early.
Fixed.
>> +  sri.prev_sri = NULL;
>> +  sri.icode = CODE_FOR_nothing;
>> +  sri.extra_cost = 0;
>> +  secondary_class = NO_REGS;
>> +  /* Set up hard register for a reload pseudo for hook
>> +     secondary_reload because some targets just ignore unassigned
>> +     pseudos in the hook.  */
>> +  if (dclass != NO_REGS
>> +      && REG_P (dreg) && (dregno = REGNO (dreg)) >= new_regno_start
>> +      && lra_get_regno_hard_regno (dregno) < 0)
>> +    reg_renumber[dregno] = ira_class_hard_regs[dclass][0];
>> +  else
>> +    dregno = -1;
>> +  if (sclass != NO_REGS
>> +      && REG_P (sreg) && (sregno = REGNO (sreg)) >= new_regno_start
>> +      && lra_get_regno_hard_regno (sregno) < 0)
>> +    reg_renumber[sregno] = ira_class_hard_regs[sclass][0];
>> +  else
>> +    sregno = -1;
> I think this would be correct without the:
>
>       && REG_P (dreg) && (dregno = REGNO (dreg)) >= new_regno_start
>
> condition (and similarly for the src case).  IMO it would be clearer too:
> the decision about when to return a register class for unallocated pseudos
> is then localised to get_reg_class rather than copied both here and there.
Right. fixed.
>> +  if (sclass != NO_REGS)
>> +    secondary_class
>> +      = (enum reg_class) targetm.secondary_reload (false, dest,
>> +						   (reg_class_t) sclass,
>> +						   GET_MODE (src), &sri);
>> +  if (sclass == NO_REGS
>> +      || ((secondary_class != NO_REGS || sri.icode != CODE_FOR_nothing)
>> +	  && dclass != NO_REGS))
>> +    secondary_class
>> +      = (enum reg_class) targetm.secondary_reload (true, sreg,
>> +						   (reg_class_t) dclass,
>> +						   sreg_mode, &sri);
> Hmm, so for register<-register moves, if the target says that the output
> reload needs a secondary reload, we try again with an input reload and
> hope for a different answer?
>
> If the target is giving different answers in that case, I think that's
> a bug in the target, and we should assert instead.  The problem is that
> if we allow the answers to be different, and both answers involve
> secondary reloads, we have no way of knowing whether the second answer
> is easier to implement or "more correct" than the first.  An assert
> avoids that, and puts the onus on the target to sort itself out.
>
> Again, as long as x86 is free of this bug for 4.8, I don't the merge
> needs to cater for broken targets.
I added an assert.
>> +  *change_p = true;
> I think this is the point at which substituted values should be copied.
Fixed.
>> +  new_reg = NULL_RTX;
>> +  if (secondary_class != NO_REGS)
>> +    new_reg = lra_create_new_reg_with_unique_value (sreg_mode, NULL_RTX,
>> +						    secondary_class,
>> +						    "secondary");
>> +  start_sequence ();
>> +  if (sri.icode == CODE_FOR_nothing)
>> +    lra_emit_move (new_reg, sreg);
>> +  else
>> +    {
>> +      enum reg_class scratch_class;
>> +
>> +      scratch_class = (reg_class_from_constraints
>> +		       (insn_data[sri.icode].operand[2].constraint));
>> +      scratch_reg = (lra_create_new_reg_with_unique_value
>> +		     (insn_data[sri.icode].operand[2].mode, NULL_RTX,
>> +		      scratch_class, "scratch"));
>> +      emit_insn (GEN_FCN (sri.icode) (new_reg != NULL_RTX ? new_reg : dest,
>> +				      sreg, scratch_reg));
>> +    }
>> +  before = get_insns ();
>> +  end_sequence ();
>> +  lra_process_new_insns (curr_insn, before, NULL_RTX, "Inserting the move");
> AIUI, the constraints pass will look at these instructions and generate
> what are now known as tertiary reloads where needed (by calling this
> function again).  Is that right?  Very nice if so: that's far more
> natural than the current reload handling.
Yes.
>> +/* The chosen reg classes which should be used for the corresponding
>> +   operands.  */
>> +static enum reg_class goal_alt[MAX_RECOG_OPERANDS];
>> +/* True if the operand should be the same as another operand and the
>> +   another operand does not need a reload.  */
> s/and the another/and that other/
Fixed.
>> +/* Make reloads for addr register in LOC which should be of class CL,
>> +   add reloads to list BEFORE.	If AFTER is not null emit insns to set
>> +   the register up after the insn (it is case of inc/dec, modify).  */
> Maybe:
>
> /* Arrange for address element *LOC to be a register of class CL.
>     Add any input reloads to list BEFORE.  AFTER is nonnull if *LOC is an
>     automodified value; handle that case by adding the required output
>     reloads to list AFTER.  Return true if the RTL was changed.  */
Fixed.
>> +static bool
>> +process_addr_reg (rtx *loc, rtx *before, rtx *after, enum reg_class cl)
>> +{
>> +  int regno, final_regno;
>> +  enum reg_class rclass, new_class;
>> +  rtx reg = *loc;
>> +  rtx new_reg;
>> +  enum machine_mode mode;
>> +  bool change_p = false;
>> +
>> +  mode = GET_MODE (reg);
>> +  if (! REG_P (reg))
>> +    {
>> +      /* Always reload memory in an address even if the target
>> +	 supports such addresses.  */
>> +      new_reg
>> +	= lra_create_new_reg_with_unique_value (mode, reg, cl, "address");
>> +      push_to_sequence (*before);
>> +      lra_emit_move (new_reg, reg);
>> +      *before = get_insns ();
>> +      end_sequence ();
>> +      *loc = new_reg;
>> +      if (after != NULL)
>> +	{
>> +	  start_sequence ();
>> +	  lra_emit_move (reg, new_reg);
>> +	  emit_insn (*after);
>> +	  *after = get_insns ();
>> +	  end_sequence ();
>> +	}
>> +      return true;
> Why does this need to be a special case, rather than reusing the
> code later in the function?  Specifically:
I simplified the function factoring common code.
>> +    }
>> +  lra_assert (REG_P (reg));
>> +  final_regno = regno = REGNO (reg);
>> +  if (regno < FIRST_PSEUDO_REGISTER)
>> +    {
>> +      rtx final_reg = reg;
>> +      rtx *final_loc = &final_reg;
>> +
>> +      lra_eliminate_reg_if_possible (final_loc);
>> +      final_regno = REGNO (*final_loc);
>> +    }
>> +  /* Use class of hard register after elimination because some targets
>> +     do not recognize virtual hard registers as valid address
>> +     registers.	 */
>> +  rclass = get_reg_class (final_regno);
>> +  if ((*loc = get_equiv_substitution (reg)) != reg)
>> +    {
>> +      if (lra_dump_file != NULL)
>> +	{
>> +	  fprintf (lra_dump_file,
>> +		   "Changing pseudo %d in address of insn %u on equiv ",
>> +		   REGNO (reg), INSN_UID (curr_insn));
>> +	  print_value_slim (lra_dump_file, *loc, 1);
>> +	  fprintf (lra_dump_file, "\n");
>> +	}
>> +      *loc = copy_rtx (*loc);
>> +      change_p = true;
>> +    }
>> +  if (*loc != reg || ! in_class_p (final_regno, GET_MODE (reg), cl, &new_class))
>> +    {
>> +      reg = *loc;
>> +      if (get_reload_reg (OP_IN, mode, reg, cl, "address", &new_reg))
>> +	{
>> +	  push_to_sequence (*before);
>> +	  lra_emit_move (new_reg, reg);
>> +	  *before = get_insns ();
>> +	  end_sequence ();
>> +	}
>> +      *loc = new_reg;
>> +      if (after != NULL)
>> +	{
>> +	  start_sequence ();
>> +	  lra_emit_move (reg, new_reg);
>> +	  emit_insn (*after);
>> +	  *after = get_insns ();
>> +	  end_sequence ();
>> +	}
>> +      change_p = true;
>> +    }
>> +  else if (new_class != NO_REGS && rclass != new_class)
>> +    change_class (regno, new_class, "	   Change", true);
>> +  return change_p;
>> +}
> E.g.:
>
>    if ((*loc = get_equiv_substitution (reg)) != reg)
>      ...as above...
>    if (*loc != reg || !in_class_p (reg, cl, &new_class))
>      ...as above...
>    else if (new_class != NO_REGS && rclass != new_class)
>      change_class (regno, new_class, "	   Change", true);
>    return change_p;
>
> (assuming change to in_class_p suggested earlier) seems like it
> covers the same cases.
>
> Also, should OP_IN be OP_INOUT for after != NULL, so that we don't try
> to reuse existing reload pseudos?  That would mean changing get_reload_reg
> (both commentary and code) to handle OP_INOUT like OP_OUT.
Fixed.
> Or maybe just pass OP_OUT instead of OP_INOUT, if that's more consistent.
> I don't mind which.
>> +  /* Force reload if this is a constant or PLUS or if there may be a
>> +     problem accessing OPERAND in the outer mode.  */
> Suggest:
>
>    /* Force a reload of the SUBREG_REG if this ...
Fixed.
>> +      /* Constant mode ???? */
>> +      enum op_type type = curr_static_id->operand[nop].type;
> Not sure what the comment means, but REG is still the original SUBREG_REG,
> so there shouldn't be any risk of a VOIDmode constant.  (subreg (const_int))
> is invalid rtl.
I removed the comment.  It was probably a question for myself.
>> +/* Return TRUE if *LOC refers for a hard register from SET.  */
>> +static bool
>> +uses_hard_regs_p (rtx *loc, HARD_REG_SET set)
>> +{
> Nothing seems to care about the address, so we would pass the rtx
> rather than a pointer to it.
Fixed.
>> +  int i, j, x_hard_regno, offset;
>> +  enum machine_mode mode;
>> +  rtx x;
>> +  const char *fmt;
>> +  enum rtx_code code;
>> +
>> +  if (*loc == NULL_RTX)
>> +    return false;
>> +  x = *loc;
>> +  code = GET_CODE (x);
>> +  mode = GET_MODE (x);
>> +  if (code == SUBREG)
>> +    {
>> +      loc = &SUBREG_REG (x);
>> +      x = SUBREG_REG (x);
>> +      code = GET_CODE (x);
>> +      if (GET_MODE_SIZE (GET_MODE (x)) > GET_MODE_SIZE (mode))
>> +	mode = GET_MODE (x);
>> +    }
>> +
>> +  if (REG_P (x))
>> +    {
>> +      lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
>> +      /* The real hard regno of the operand after the allocation.  */
>> +      x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
>> +      return (x_hard_regno >= 0
>> +	      && lra_hard_reg_set_intersection_p (x_hard_regno, mode, set));
> With the subreg mode handling above, this looks little-endian specific.
> The MEM case:
>> +  if (MEM_P (x))
>> +    {
>> +      struct address ad;
>> +      enum machine_mode mode = GET_MODE (x);
>> +      rtx *addr_loc = &XEXP (x, 0);
>> +
>> +      extract_address_regs (mode, MEM_ADDR_SPACE (x), addr_loc, MEM, &ad);
>> +      if (ad.base_reg_loc != NULL)
>> +	{
>> +	  if (uses_hard_regs_p (ad.base_reg_loc, set))
>> +	    return true;
>> +	}
>> +      if (ad.index_reg_loc != NULL)
>> +	{
>> +	  if (uses_hard_regs_p (ad.index_reg_loc, set))
>> +	    return true;
>> +	}
>> +    }
> is independent of the subreg handling, so perhaps the paradoxical subreg
> case should be handled separately, using simplify_subreg_regno.
>> +/* Major function to choose the current insn alternative and what
>> +   operands should be reloaded and how.	 If ONLY_ALTERNATIVE is not
>> +   negative we should consider only this alternative.  Return false if
>> +   we can not choose the alternative or find how to reload the
>> +   operands.  */
>> +static bool
>> +process_alt_operands (int only_alternative)
>> +{
>> +  bool ok_p = false;
>> +  int nop, small_class_operands_num, overall, nalt, offset;
>> +  int n_alternatives = curr_static_id->n_alternatives;
>> +  int n_operands = curr_static_id->n_operands;
>> +  /* LOSERS counts those that don't fit this alternative and would
>> +     require loading.  */
>> +  int losers;
> s/those/the operands/
Fixed.
>> +  /* Calculate some data common for all alternatives to speed up the
>> +     function.	*/
>> +  for (nop = 0; nop < n_operands; nop++)
>> +    {
>> +      op = no_subreg_operand[nop] = *curr_id->operand_loc[nop];
>> +      lra_get_hard_regno_and_offset (op, &hard_regno[nop], &offset);
>> +      /* The real hard regno of the operand after the allocation.  */
>> +      hard_regno[nop] = get_final_hard_regno (hard_regno[nop], offset);
>> +
>> +      operand_reg[nop] = op;
>> +      biggest_mode[nop] = GET_MODE (operand_reg[nop]);
>> +      if (GET_CODE (operand_reg[nop]) == SUBREG)
>> +	{
>> +	  operand_reg[nop] = SUBREG_REG (operand_reg[nop]);
>> +	  if (GET_MODE_SIZE (biggest_mode[nop])
>> +	      < GET_MODE_SIZE (GET_MODE (operand_reg[nop])))
>> +	    biggest_mode[nop] = GET_MODE (operand_reg[nop]);
>> +	}
>> +      if (REG_P (operand_reg[nop]))
>> +	no_subreg_operand[nop] = operand_reg[nop];
>> +      else
>> +	operand_reg[nop] = NULL_RTX;
> This looks odd: no_subreg_operand ends up being a subreg if the
> SUBREG_REG wasn't a REG.  Some more commentary might help.
Probably I should have a better name no_subreg_reg_operand.  I also 
added comments for its definition and definition of operand_reg.
>> +  /* The constraints are made of several alternatives.	Each operand's
>> +     constraint looks like foo,bar,... with commas separating the
>> +     alternatives.  The first alternatives for all operands go
>> +     together, the second alternatives go together, etc.
>> +
>> +     First loop over alternatives.  */
>> +  for (nalt = 0; nalt < n_alternatives; nalt++)
>> +    {
>> +      /* Loop over operands for one constraint alternative.  */
>> +      if (
>> +#ifdef HAVE_ATTR_enabled
>> +	  (curr_id->alternative_enabled_p != NULL
>> +	   && ! curr_id->alternative_enabled_p[nalt])
>> +	  ||
>> +#endif
>> +	  (only_alternative >= 0 && nalt != only_alternative))
>> +	continue;
> Probably more natural if split into two "if (...) continue;"s.  E.g.:
Fixed.
> #ifdef HAVE_ATTR_enabled
>        if (curr_id->alternative_enabled_p != NULL
> 	  && !curr_id->alternative_enabled_p[nalt])
> 	continue;
> #endif
>        if (only_alternative >= 0 && nalt != only_alternative))
> 	continue;
>
>> +      for (nop = 0; nop < n_operands; nop++)
>> +	{
>> +	  const char *p;
>> +	  char *end;
>> +	  int len, c, m, i, opalt_num, this_alternative_matches;
>> +	  bool win, did_match, offmemok, early_clobber_p;
>> +	  /* false => this operand can be reloaded somehow for this
>> +	     alternative.  */
>> +	  bool badop;
>> +	  /* false => this operand can be reloaded if the alternative
>> +	     allows regs.  */
>> +	  bool winreg;
>> +	  /* False if a constant forced into memory would be OK for
>> +	     this operand.  */
>> +	  bool constmemok;
>> +	  enum reg_class this_alternative, this_costly_alternative;
>> +	  HARD_REG_SET this_alternative_set, this_costly_alternative_set;
>> +	  bool this_alternative_match_win, this_alternative_win;
>> +	  bool this_alternative_offmemok;
>> +	  int invalidate_m;
>> +	  enum machine_mode mode;
>> +
>> +	  opalt_num = nalt * n_operands + nop;
>> +	  if (curr_static_id->operand_alternative[opalt_num].anything_ok)
>> +	    {
>> +	      /* Fast track for no constraints at all.	*/
>> +	      curr_alt[nop] = NO_REGS;
>> +	      CLEAR_HARD_REG_SET (curr_alt_set[nop]);
>> +	      curr_alt_win[nop] = true;
>> +	      curr_alt_match_win[nop] = false;
>> +	      curr_alt_offmemok[nop] = false;
>> +	      curr_alt_matches[nop] = -1;
>> +	      continue;
>> +	    }
> Given that this code is pretty complex, it might be clearer to remove
> the intermediate "this_*" variables and assign directly to curr_alt_*.  I.e.:
I'd rather not to do this.  Using array element instead of simple 
variable occurrences in about 50 places.  I don't think it makes code 
cleaner.
> 	  curr_alt[nop] = NO_REGS;
> 	  CLEAR_HARD_REG_SET (curr_alt_set[nop]);
> 	  curr_alt_win[nop] = false;
> 	  curr_alt_match_win[nop] = false;
> 	  curr_alt_offmemok[nop] = false;
> 	  curr_alt_matches[nop] = -1;
>
> 	  opalt_num = nalt * n_operands + nop;
> 	  if (curr_static_id->operand_alternative[opalt_num].anything_ok)
> 	    {
> 	      /* Fast track for no constraints at all.	*/
> 	      curr_alt_win[nop] = true;
> 	      continue;
> 	    }
>
> Obviously keep this nice comment:
>> +	  /* We update set of possible hard regs besides its class
>> +	     because reg class might be inaccurate.  For example,
>> +	     union of LO_REGS (l), HI_REGS(h), and STACK_REG(k) in ARM
>> +	     is translated in HI_REGS because classes are merged by
>> +	     pairs and there is no accurate intermediate class.	 */
> somewhere though, either here or above the declaration of curr_alt_set.
>
>> +		    /* We are supposed to match a previous operand.
>> +		       If we do, we win if that one did.  If we do
>> +		       not, count both of the operands as losers.
>> +		       (This is too conservative, since most of the
>> +		       time only a single reload insn will be needed
>> +		       to make the two operands win.  As a result,
>> +		       this alternative may be rejected when it is
>> +		       actually desirable.)  */
>> +		    /* If it conflicts with others.  */
> Last line looks incomplete/misplaced.
I have no idea where/what it should be.  I removed it.
>> +		    match_p = false;
>> +		    if (operands_match_p (*curr_id->operand_loc[nop],
>> +					  *curr_id->operand_loc[m], m_hregno))
>> +		      {
>> +			int i;
>> +			
>> +			for (i = 0; i < early_clobbered_regs_num; i++)
>> +			  if (early_clobbered_nops[i] == m)
>> +			    break;
>> +			/* We should reject matching of an early
>> +			   clobber operand if the matching operand is
>> +			   not dying in the insn.  */
>> +			if (i >= early_clobbered_regs_num
> Why not simply use operands m's early_clobber field?
Ok.  Fixed.
>> +			    || operand_reg[nop] == NULL_RTX
>> +			    || (find_regno_note (curr_insn, REG_DEAD,
>> +						 REGNO (operand_reg[nop]))
>> +				!= NULL_RTX))
>> +			  match_p = true;
> ...although I don't really understand this condition.  If the two
> operands are the same value X, then X must die here whatever the
> notes say.  So I assume this is coping with a case where the operands
> are different but still match.  If so, could you give an example?
I remember I saw such insn but I don't remember details.
> Matched earlyclobbers explicitly guarantee that the earlyclobber doesn't
> apply to the matched input operand; the earlyclobber only applies to
> other input operands.  So I'd have expected it was those operands
> that might need reloading rather than this one.
>
> E.g. if X occurs three times, twice in a matched earlyclobber pair
> and once as an independent operand, it's the latter operand that would
> need reloading.
Yes, I know.
>> +			/* Operands don't match.  */
>> +			/* Retroactively mark the operand we had to
>> +			   match as a loser, if it wasn't already and
>> +			   it wasn't matched to a register constraint
>> +			   (e.g it might be matched by memory).	 */
>> +			if (curr_alt_win[m]
>> +			    && (operand_reg[m] == NULL_RTX
>> +				|| hard_regno[m] < 0))
>> +			  {
>> +			    losers++;
>> +			    if (curr_alt[m] != NO_REGS)
>> +			      reload_nregs
>> +				+= (ira_reg_class_max_nregs[curr_alt[m]]
>> +				    [GET_MODE (*curr_id->operand_loc[m])]);
>> +			}
>> +			invalidate_m = m;
>> +			if (curr_alt[m] == NO_REGS)
>> +			  continue;
> I found this a bit confusing.  If the operands don't match and operand m
> allows no registers, don't we have to reject this constraint outright?
> E.g. something like:
Yes, this function as in reload can be investigated for long time. I 
cleared it a bit.  May be it is right time to clear it up more although 
time is a scarce resource.
I tried your variant.  I did not find serious problems on x86 which 
should be concerned of this code.  So I am using it.
> 			/* Operands don't match.  Both operands must
> 			   allow a reload register, otherwise we cannot
> 			   make them match.  */
> 			if (curr_alt[m] == NO_REGS)
> 			  break;
> 			/* Retroactively mark the operand we had to
> 			   match as a loser, if it wasn't already and
> 			   it wasn't matched to a register constraint
> 			   (e.g it might be matched by memory).	 */
> 			if (curr_alt_win[m]
> 			    && (operand_reg[m] == NULL_RTX
> 				|| hard_regno[m] < 0))
> 			  {
> 			    losers++;
> 			    reload_nregs
> 			      += (ira_reg_class_max_nregs[curr_alt[m]]
> 				  [GET_MODE (*curr_id->operand_loc[m])]);
> 			  }
>
>> +		    /* This can be fixed with reloads if the operand
>> +		       we are supposed to match can be fixed with
>> +		       reloads.	 */
>> +		    badop = false;
>> +		    this_alternative = curr_alt[m];
>> +		    COPY_HARD_REG_SET (this_alternative_set, curr_alt_set[m]);
>> +		
>> +		    /* If we have to reload this operand and some
>> +		       previous operand also had to match the same
>> +		       thing as this operand, we don't know how to do
>> +		       that.  So reject this alternative.  */
>> +		    if (! did_match)
>> +		      for (i = 0; i < nop; i++)
>> +			if (curr_alt_matches[i] == this_alternative_matches)
>> +			  badop = true;
> OK, so this is another case of cruft from reload that I'd like to remove,
> but do you know of any reason why this shouldn't be:
>
> 		    /* If we have to reload this operand and some previous
> 		       operand also had to match the same thing as this
> 		       operand, we don't know how to do that.  */
> 		    if (!match_p || !curr_alt_win[m])
> 		      {
> 			for (i = 0; i < nop; i++)
> 			  if (curr_alt_matches[i] == m)
> 			    break;
> 			if (i < nop)
> 			  break;
> 		      }
> 		    else
> --->		      did_match = true;
>
> 		    /* This can be fixed with reloads if the operand
> 		       we are supposed to match can be fixed with
> 		       reloads.	 */
> --->		    this_alternative_matches = m;
> --->		    invalidate_m = m;
> 		    badop = false;
> 		    this_alternative = curr_alt[m];
> 		    COPY_HARD_REG_SET (this_alternative_set, curr_alt_set[m]);
>
> (although a helper function might be better than the awkward breaking)?
> Note that the ---> lines have moved from further up.
>
> This is the only time in the switch statement where one constraint
> in a constraint string uses "badop = true" to reject the operand.
> I.e. for "<something else>0" we should normally not reject the
> alternative based solely on the "0", since the "<something else>"
> might have been satisfied instead.  And we should only record matching
> information if we've decided the match can be implemented by reloads
> (the last block).
Yes, that is bad.  Especially it is not documented but people use this.

I tried it and I did not a problem with this.  So I use your variant 
(with one modification setting invalidate_m only for !did_match).
>> +			/* We prefer no matching alternatives because
>> +			   it gives more freedom in RA.	 */
>> +			if (operand_reg[nop] == NULL_RTX
>> +			    || (find_regno_note (curr_insn, REG_DEAD,
>> +						 REGNO (operand_reg[nop]))
>> +				 == NULL_RTX))
>> +			  reject += 2;
> Looks like a new reject rule.  I agree it makes conceptual sense though,
> so I'm all for it.
>> +		      || (REG_P (op)
>> +			  && REGNO (op) >= FIRST_PSEUDO_REGISTER
>> +			  && in_mem_p (REGNO (op))))
> This pattern occurs several times.  I think a helper function like
> spilled_reg_p (op) would help.  See 'g' below.
Ok. Fixed.
>> +		case 's':
>> +		  if (CONST_INT_P (op)
>> +		      || (GET_CODE (op) == CONST_DOUBLE && mode == VOIDmode))
> ...
>> +		case 'n':
>> +		  if (CONST_INT_P (op)
>> +		      || (GET_CODE (op) == CONST_DOUBLE && mode == VOIDmode))
> After recent changes these should be CONST_SCALAR_INT_P (op)
OK.
>> +		case 'g':
>> +		  if (/* A PLUS is never a valid operand, but LRA can
>> +			 make it from a register when eliminating
>> +			 registers.  */
>> +		      GET_CODE (op) != PLUS
>> +		      && (! CONSTANT_P (op) || ! flag_pic
>> +			  || LEGITIMATE_PIC_OPERAND_P (op))
>> +		      && (! REG_P (op)
>> +			  || (REGNO (op) >= FIRST_PSEUDO_REGISTER
>> +			      && in_mem_p (REGNO (op)))))
> Rather than the special case for PLUS, I think this would be better as:
>
> 	  if (MEM_P (op)
> 	      || spilled_reg_p (op)
> 	      || general_constant_p (op))
> 	    win = true;
>
> where general_constant_p abstracts away:
>
> 	  (CONSTANT_P (op)
> 	   && (! flag_pic || LEGITIMATE_PIC_OPERAND_P (op)))
>
> general_constant_p probably ought to go in a common file because several
> places need this condition (including other parts of this switch statement).
OK. Fixed.
>> +#ifdef EXTRA_CONSTRAINT_STR
>> +		      if (EXTRA_MEMORY_CONSTRAINT (c, p))
>> +			{
>> +			  if (EXTRA_CONSTRAINT_STR (op, c, p))
>> +			    win = true;
>> +			  /* For regno_equiv_mem_loc we have to
>> +			     check.  */
>> +			  else if (REG_P (op)
>> +				   && REGNO (op) >= FIRST_PSEUDO_REGISTER
>> +				   && in_mem_p (REGNO (op)))
> Looks like an old comment from an earlier iteration.  There doesn't
> seem to be a function called regno_equiv_mem_loc in the current patch.
> But...
Yes, it is from early version.  I removed it.
>> +			    {
>> +			      /* We could transform spilled memory
>> +				 finally to indirect memory.  */
>> +			      if (EXTRA_CONSTRAINT_STR
>> +				  (get_indirect_mem (mode), c, p))
>> +				win = true;
>> +			    }
> ...is this check really needed?  It's a documented requirement that memory
> constraints accept plain base registers.  Also, the following code is:
I removed it.  At least it works for x86/x86-64.
>> +			  /* If we didn't already win, we can reload
>> +			     constants via force_const_mem, and other
>> +			     MEMs by reloading the address like for
>> +			     'o'.  */
>> +			  if (CONST_POOL_OK_P (mode, op) || MEM_P (op))
>> +			    badop = false;
> It seems a bit inconsistent to treat a spilled pseudo whose address
> might well need reloading as a win, while not treating existing MEMs
> whose addresses need reloading as a win.
Well, probability of reloading address of spilled pseudo is very small 
on most targets but reloading for MEM in this case is real. So I see it 
logical.
>> +		      if (EXTRA_CONSTRAINT_STR (op, c, p))
>> +			win = true;
>> +		      else if (REG_P (op)
>> +			       && REGNO (op) >= FIRST_PSEUDO_REGISTER
>> +			       && in_mem_p (REGNO (op)))
>> +			{
>> +			  /* We could transform spilled memory finally
>> +			     to indirect memory.  */
>> +			  if (EXTRA_CONSTRAINT_STR (get_indirect_mem (mode),
>> +						    c, p))
>> +			    win = true;
>> +			}
> I don't understand why there's two copies of this.  I think we have
> to trust the target's classification of constraints, so if the target
> says that something isn't a memory constraint, we shouldn't check the
> (mem (base)) case.
I removed it too.
>> +	      if (c != ' ' && c != '\t')
>> +		costly_p = c == '*';
> I think there needs to be a comment somewhere saying how we handle this.
> Being costly seems to contribute one reject point (i.e. a sixth of a '?')
> compared to the normal case, which is very different from the current
> reload behaviour.  We should probably update the "*" documentation
> in md.texi too.
Yes.  It is different.  New heuristics result in a better code generation.
> Some targets use "*" constraints to make sure that floating-point
> registers don't get used as spill space in purely integer code
> (so that task switches don't pay the FPU save/restore penalty).
> Would that still "work" with this definition?  (FWIW, I think
> using "*" is a bad way to achieve this feature, just asking.)
There are two places for processing '*'.  One is in ira-cost.c for 
choose classes.  Therefore I believe it will work.  At least I did not 
find any problems on 8 targets.  I changed md.texi too.
>> +		  /* We simulate the behaviour of old reload here.
>> +		     Although scratches need hard registers and it
>> +		     might result in spilling other pseudos, no reload
>> +		     insns are generated for the scratches.  So it
>> +		     might cost something but probably less than old
>> +		     reload pass believes.  */
>> +		  if (lra_former_scratch_p (REGNO (operand_reg[nop])))
>> +		    reject += LOSER_COST_FACTOR;
> Yeah, this caused me no end of trouble when tweaking the MIPS
> multiply-accumulate patterns.  However, unlike the other bits of
> cruft I've been complaining about, this is one where I can't think
> of any alternative that makes more inherent sense (to me).  So I agree
> that leaving it as-is is the best approach for now.
>> +	      /* If the operand is dying, has a matching constraint,
>> +		 and satisfies constraints of the matched operand
>> +		 which failed to satisfy the own constraints, we do
>> +		 not need to generate a reload insn for this
>> +		 operand.  */
>> +	      if (this_alternative_matches < 0
>> +		  || curr_alt_win[this_alternative_matches]
>> +		  || ! REG_P (op)
>> +		  || find_regno_note (curr_insn, REG_DEAD,
>> +				      REGNO (op)) == NULL_RTX
>> +		  || ((hard_regno[nop] < 0
>> +		       || ! in_hard_reg_set_p (this_alternative_set,
>> +					       mode, hard_regno[nop]))
>> +		      && (hard_regno[nop] >= 0
>> +			  || ! in_class_p (REGNO (op), GET_MODE (op),
>> +					   this_alternative, NULL))))
>> +		losers++;
> I think this might be clearer as:
Fixed.
> 	      if (!(this_alternative_matches >= 0
> 		    && !curr_alt_win[this_alternative_matches]
> 		    && REG_P (op)
> 		    && find_regno_note (curr_insn, REG_DEAD, REGNO (op))
> 		    && (hard_regno[nop] >= 0
> 			? in_hard_reg_set_p (this_alternative_set,
> 					     mode, hard_regno[nop])
> 			: in_class_p (op, this_alternative, NULL))))
> 		losers++;
>
>> +	      if (operand_reg[nop] != NULL_RTX)
>> +		{
>> +		  int last_reload = (lra_reg_info[ORIGINAL_REGNO
>> +						  (operand_reg[nop])]
>> +				     .last_reload);
>> +
>> +		  if (last_reload > bb_reload_num)
>> +		    reload_sum += last_reload;
>> +		  else
>> +		    reload_sum += bb_reload_num;
> The comment for reload_sum says:
>
>> +/* Overall number reflecting distances of previous reloading the same
>> +   value.  It is used to improve inheritance chances.  */
>> +static int best_reload_sum;
That is a wrong comment.  It should be

/* Overall number reflecting distances of previous reloading the same
    value.  The distances are counted from the current BB start.  It is
    used to improve inheritance chances.  */

I fixed it.  I am also decreasing the number by bb_reload_num every time 
when I increase reload_sum.
> which made me think of distance from the current instruction.  I see
> it's actually something else, effectively a sum of instruction numbers.
>
> I assumed the idea was to prefer registers that were reloaded more
> recently (closer the current instruction).  In that case I thought that,
> for a distance-based best_reload_sum, smaller would be better,
> while for an instruction-number-based best_reload_sum, larger would
> be better.  It looks like we use instruction-number based best_reload_sums
> but prefer smaller sums:
>
>> +			      && (reload_nregs < best_reload_nregs
>> +				  || (reload_nregs == best_reload_nregs
>> +				      && best_reload_sum < reload_sum))))))))
> Is that intentional?
Now it has sense the bigger number, the closer the last reloading to the 
current insn.
> Also, is this value meaningful for output reloads, which aren't really
> going to be able to inherit a value as such?  We seem to apply the cost
> regardless of whether it's an input or an output, so probably deserves
> a comment.
>
> Same for matched input operands, which as you say elsewhere aren't
> inherited.
Right.  It could improve the heuristic more.  I added the code.
>> +	      if (badop
>> +		  /* Alternative loses if it has no regs for a reg
>> +		     operand.  */
>> +		  || (REG_P (op) && no_regs_p
>> +		      && this_alternative_matches < 0))
>> +		goto fail;
> More reload cruft, but I don't understand why we have both this and, later:
>> +	      if (this_alternative_matches < 0
>> +		  && no_regs_p && ! this_alternative_offmemok && ! constmemok)
>> +		goto fail;
> We also had the earlier:
>
>> +	  /* If this operand could be handled with a reg, and some reg
>> +	     is allowed, then this operand can be handled.  */
>> +	  if (winreg && this_alternative != NO_REGS)
>> +	    badop = false;
> which I think belongs in the same else statement.  At least after the
> matching changes suggested above, I think all three can be replaced by:
>
> 	  /* If this operand accepts a register, and if the register class
> 	     has at least one allocatable register, then this operand
> 	     can be reloaded.  */
> 	  if (winreg && !no_regs_p)
> 	    badop = false;
>
> 	  if (badop)
> 	    goto fail;
>
> which IMO belongs after the "no_regs_p" assignment.  badop should never
> be false if we have no way of reloading the value.
This change is non-trivial without knowing semantics of all these 
variables.  It looks ok to me.

So I changed the code.
>> +	      if (! no_regs_p)
>> +		reload_nregs
>> +		  += ira_reg_class_max_nregs[this_alternative][mode];
> I wasn't sure why we counted this even in the "const_to_mem && constmmeok"
> and "MEM_P (op) && offmemok" cases from:
> 	      /* We prefer to reload pseudos over reloading other
> 		 things, since such reloads may be able to be
> 		 eliminated later.  So bump REJECT in other cases.
> 		 Don't do this in the case where we are forcing a
> 		 constant into memory and it will then win since we
> 		 don't want to have a different alternative match
> 		 then.	*/
> 	      if (! (REG_P (op)
> 		     && REGNO (op) >= FIRST_PSEUDO_REGISTER)
> 		  && ! (const_to_mem && constmemok)
> 		  /* We can reload the address instead of memory (so
> 		     do not punish it).	 It is preferable to do to
> 		     avoid cycling in some cases.  */
> 		  && ! (MEM_P (op) && offmemok))
> 		reject += 2;
I think constmemok is obvious.  It is not a reload, it just putting 
constant in the constant pool.  We should not punish it as no additional 
insns are generated.

There is a comment for offmemok case.  I think it describes it. 
Apparently it was a fix for LRA cycling.  I don't remember details. To 
restore them, I need to remove the code and to try it on many targets.  
I guess, it would take 3-4 days.  But I removed this as it does not 
affect x86/x86-64.
>> +	  if (early_clobber_p)
>> +	    reject++;
>> +	  /* ??? Should we update the cost because early clobber
>> +	     register reloads or it is a rare thing to be worth to do
>> +	     it.  */
>> +	  overall = losers * LOSER_COST_FACTOR + reject;
> Could you expand on the comment a bit?
Yes, I did it.
>> +	  if ((best_losers == 0 || losers != 0) && best_overall < overall)
>> +	    goto fail;
>> +
>> +	  curr_alt[nop] = this_alternative;
>> +	  COPY_HARD_REG_SET (curr_alt_set[nop], this_alternative_set);
>> +	  curr_alt_win[nop] = this_alternative_win;
>> +	  curr_alt_match_win[nop] = this_alternative_match_win;
>> +	  curr_alt_offmemok[nop] = this_alternative_offmemok;
>> +	  curr_alt_matches[nop] = this_alternative_matches;
>> +
>> +	  if (invalidate_m >= 0 && ! this_alternative_win)
>> +	    curr_alt_win[invalidate_m] = false;
> BTW, after the matching changes above, I don't think we need both
> "invalidate_m" and "this_alternative_matches".
>
Yes, if we use did_match, we could remove invalidate_m.  I removed it.
>> +	  for (j = hard_regno_nregs[clobbered_hard_regno][biggest_mode[i]] - 1;
>> +	       j >= 0;
>> +	       j--)
>> +	    SET_HARD_REG_BIT (temp_set, clobbered_hard_regno + j);
> add_to_hard_reg_set.
>
Fixed.
>> +	    else if (curr_alt_matches[j] == i && curr_alt_match_win[j])
>> +	      {
>> +		/* This is a trick.  Such operands don't conflict and
>> +		   don't need a reload.	 But it is hard to transfer
>> +		   this information to the assignment pass which
>> +		   spills one operand without this info.  We avoid the
>> +		   conflict by forcing to use the same pseudo for the
>> +		   operands hoping that the pseudo gets the same hard
>> +		   regno as the operands and the reloads are gone.  */
>> +		if (*curr_id->operand_loc[i] != *curr_id->operand_loc[j])
> ...
>> +		/* See the comment for the previous case.  */
>> +		if (*curr_id->operand_loc[i] != *curr_id->operand_loc[j])
> What are these last two if statements for?  I wasn't sure how two operands
> could have the same address.
>
> Not saying they're wrong, but I think a comment would be good.
I think it is a leftover from older state of code.  If lra-assign.c does 
something wrong in this situation, I think, next round lra-constraint.c 
will fix it.
So I am removing the code.
>> +      small_class_operands_num = 0;
>> +      for (nop = 0; nop < n_operands; nop++)
>> +	/* If this alternative can be made to work by reloading, and
>> +	   it needs less reloading than the others checked so far,
>> +	   record it as the chosen goal for reloading.	*/
>> +	small_class_operands_num
>> +	  += SMALL_REGISTER_CLASS_P (curr_alt[nop]) ? 1 : 0;
> Misplaced comment; I think it belongs after this line.
>
Yes. Fixed.
That was a rigorous review.  I took 2 full days and numerous bootstraps 
and testing to check your proposals and answer you questions.
But I am glad.  I believe the code became more clear.

Richard Sandiford Oct. 15, 2012, 12:06 p.m. UTC | #11

Vladimir Makarov <vmakarov@redhat.com> writes:
>> if that's accurate.  I dropped the term "reload pseudo" because of
>> the general comment in my earlier reply about the use of "reload pseudo"
>> when the code seems to include inheritance and split pseudos too.
> There is no inheritance and splitting yet.  It is done after the 
> constraint pass.
> So at this stage >= new_regno_start means reload pseudo.

Ah, OK.

>> That's a change in the meaning of NEW_CLASS, but seems easier for
>> callers to handle.  I think all it requires is changing:
>>
>>> +      common_class = ira_reg_class_subset[rclass][cl];
>>> +      if (new_class != NULL)
>>> +	*new_class = common_class;
>> to:
>>
>>        common_class = ira_reg_class_subset[rclass][cl];
>>        if (new_class != NULL && rclass != common_class)
>> 	*new_class = common_class;
> This change results in infinite LRA looping on a first libgcc file 
> compilation.  Unfortunately I have no time to investigate it.
> I'd like to say that most code of in this code is very sensitive to 
> changes.  I see it a lot.  You change something looking obvious and a 
> target is broken.
> I am going to investigate it when I have more time.

Thanks.

>>> +    default:
>>> +      {
>>> +	const char *fmt = GET_RTX_FORMAT (code);
>>> +	int i;
>>> +
>>> +	if (GET_RTX_LENGTH (code) != 1
>>> +	    || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
>>> +	  {
>>> +	    for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
>>> +	      if (fmt[i] == 'e')
>>> +		extract_loc_address_regs (false, mode, as, &XEXP (x, i),
>>> +					  context_p, code, SCRATCH,
>>> +					  modify_p, ad);
>>> +	    break;
>>> +	  }
>>> +	/* fall through for case UNARY_OP (UNSPEC ...)	*/
>>> +      }
>>> +
>>> +    case UNSPEC:
>>> +      if (ad->disp_loc == NULL)
>>> +	ad->disp_loc = loc;
>>> +      else if (ad->base_reg_loc == NULL)
>>> +	{
>>> +	  ad->base_reg_loc = loc;
>>> +	  ad->base_outer_code = outer_code;
>>> +	  ad->index_code = index_code;
>>> +	  ad->base_modify_p = modify_p;
>>> +	}
>>> +      else
>>> +	{
>>> +	  lra_assert (ad->index_reg_loc == NULL);
>>> +	  ad->index_reg_loc = loc;
>>> +	}
>>> +      break;
>>> +
>>> +    }
>> Which targets use a bare UNSPEC as a displacement?  I thought a
>> displacement had to be a link-time constant, in which case it should
>> satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.
> I saw it somewhere.  I guess IA64.
>> I'm just a bit worried that the UNSPEC handling is sensitive to the
>> order that subrtxes are processed (unlike PLUS, which goes to some
>> trouble to work out what's what).  It could be especially confusing
>> because the default case processes operands in reverse order while
>> PLUS processes them in forward order.
>>
>> Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
>> Probably deserves a comment.
> I don't remember.  To figure out, I should switch it off and try all 
> targets supported by LRA.
>> AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
>> recording where reloads of a particular class need to go (obviously
>> in the case of disp_loc, which isn't reloaded at all).  The feidls
>> have semantic value too.  I.e. we use them to work out the value
>> of at least part of the address.
>>
>> In that case it seems dangerous to look through general rtxes
>> in the way that the default case above does.  Maybe just making
>> sure that DISP_LOC is involved in a sum with the base would be
>> enough, but another idea was:
>>
>> ----------------------------------------------------------------
>> I know of three ways of "mutating" (for want of a better word)
>> an address:
>>
>>    1. (and X (const_int X)), to align
>>    2. a subreg
>>    3. a unary operator (such as truncation or extension)
>>
>> So maybe we could:
>>
>>    a. remove outer mutations (using a helper function)
>>    b. handle LO_SUM, PRE_*, POST_*: as now
>>    c. otherwise treat the address of the sum of one, two or three pieces.
>>       c1. Peel mutations of all pieces.
>>       c2. Classify the pieces into base, index and displacement.
>>           This would be similar to the jousting code above, but hopefully
>>           easier because all three rtxes are to hand.  E.g. we could
>>           do the base vs. index thing in a similar way to
>>           commutative_operand_precedence.
>>       c3. Record which pieces were mutated (e.g. using something like the
>>           index_loc vs. index_reg_loc distinction in the current code)
>>
>> That should be general enough for current targets, but if it isn't,
>> we could generalise it further when we know what generalisation is needed.
>>
>> That's still going to be a fair amount of code, but hopefully not more,
>> and we might have more confidence at each stage what each value is.
>> And it avoids the risk of treating "mutated" addresses as "unmutated" ones.
>> ----------------------------------------------------------------
>>
>> Just an idea though.  Probably not for 4.8, although I might try it
>> if I find time.
> I am not sure that you listed all the cases.  It would be great if you 
> listed all the cases. In this case we could make this function more clear.
> I tried to do this first but permanently found new cases.  After that I 
> gave up and tried to use more general implementation.
>
> This function was rewritten and modified many times.  I am afraid to do 
> this again when clock is ticking.
>
> It would be great if you re-implement the function according to your 
> ideas and we could try it on 8 targets to which LRA was already ported.  
> An LRA sub-branch would a perfect place to do it

Fair enough.  I still stand by:

>> It would be nice to sort out the disp_loc thing for 4.8 though.

though.  My point is that base_plus_disp_to_reg assumes that
*base_loc and *disp_loc are involved in a sum.  It adds them together,
replaces the base_loc with the new pseudo, and removes the disp_loc.
But extract_address_regs seems to be deliberately written in a way that
doesn't require base_loc and disp_loc to be involved in a sum, and uses
a more indirect approach to working out disp_loc instead.  It just feels
like it leaves open the potential for a silent wrong-code bug.

>>> +/* Reload pseudos created for matched input and output reloads whose
>>> +   mode are different.	Such pseudos has a modified rules for finding
>>> +   their living ranges, e.g. assigning to subreg of such pseudo means
>>> +   changing all pseudo value.  */
>>> +bitmap_head lra_bound_pseudos;
>> Maybe:
>>
>> /* Reload pseudos created for matched input and output reloads whose
>>     modes are different.  Such pseudos have different live ranges from
>>     other pseudos; e.g. any assignment to a subreg of these pseudos
>>     changes the whole pseudo's value.  */
> Fixed.
>> Although that said, couldn't emit_move_insn_1 (called by gen_move_insn)
>> split a multiword pseudo move into two word moves?  Using the traditional
>> clobber technique sounds better than having special liveness rules.
>>
> It is not only about multi-words pseudos.  It is about representation of 
> this situation by constructions semantically incorrect in order parts of 
> compiler.  Reload has no such problem as it does not use RTL.   So I 
> don't think it splits as I use emit_move_insn and that calls 
> emit_move_insn_1 too.

But my point is that emit_move_insn_1 _does_ split moves that have no
.md pattern of their own.  E.g. some targets do not define double-word
move patterns because such moves are always equivalent to two individual
word moves.  And if emit_move_insn_1 splits:

   (set (reg:DI X) (reg:DI Y))

into:

   (set (subreg:SI (reg:DI X) 0) (subreg:SI (reg:DI Y) 0))
   (set (subreg:SI (reg:DI X) 4) (subreg:SI (reg:DI Y) 4))

then it would be to say that the subreg in the second instruction
is a complete definition of X.

> I really needed a special liveness treatment (although I don't
> remember details) and therefore I added it.  I had no detail design
> for LRA.  The code was modified by numerous test failures on different
> targets.  There is a lot of code analogous to reload one and probably
> its necessity should be rigorously questioned.  I thought about and
> modified part of this code but unfortunately not all.
>
> Also bound pseudos are rare.  Their bitmap is very small and testing (2 
> lines of code in overall) them in ira-lives.c is fast.

FWIW, It wasn't really speed as much as correctness I was worried about.
In a way, rarity makes having special rules seem even more dangerous.

>>> +      /* We create pseudo for out rtx because we always should keep
>>> +	 registers with the same original regno have synchronized
>>> +	 value (it is not true for out register but it will be
>>> +	 corrected by the next insn).
>> I don't understand this comment, sorry.
>>
> Pseudos have values -- see comments for lra_reg_info.  Different pseudos 
> with the same value do not conflict even if they live in the same 
> place.  When we create a pseudo we assign value of original pseudo (if 
> any) from which we created the new pseudo.  If we create the pseudo from 
> the input pseudo, the new pseudo will no conflict with the input pseudo 
> which is wrong when the input pseudo lives after the insn and as the new 
> pseudo value is changed by the insn output.  Therefore we create the new 
> pseudo from the output.
>
> I hope it is more understandable.  I changed the comment.

Yeah, I think that makes it a lot clearer, thanks.

>>> +  /* In and out operand can be got from transformations before
>>> +     processing constraints.  So the pseudos might have inaccurate
>>> +     class and we should make their classes more accurate.  */
>>> +  narrow_reload_pseudo_class (in_rtx, goal_class);
>>> +  narrow_reload_pseudo_class (out_rtx, goal_class);
>> I don't understand this, sorry.  Does "transformations" mean inheritance
>> and reload splitting?  So the registers we're changing here are inheritance
>> and split pseudos rather than reload pseudos created for this instruction?
>> If so, it sounds on face value like it conflicts with the comment quoted
>> above about not allowing reload instructions to the narrow the class
>> of pseudos.  Might be worth saying why that's OK here but not there.
> Again, inheritance and splitting is done after the constraint pass.
>
> The transformations here are mostly reloading of subregs which is done 
> before reloads for given insn.  On this transformation we create new 
> pseudos for which we don't know reg class yet.  In case we don't know 
> pseudo reg class yet, we assign ALL_REGS to the pseudo.

OK, thanks.

>> Also, all uses but one of lra_get_hard_regno_and_offset follow
>> the pattern:
>>
>>        lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
>>        /* The real hard regno of the operand after the allocation.  */
>>        x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
>>
>> so couldn't lra_get_hard_regno_and_offset just return the final
>> hard register, including elimination?  Then it could apply the
>> elimination on the original rtx.
>>
>> FWIW, the exception I mentioned was operands_match_p:
>>
>>        lra_get_hard_regno_and_offset (x, &i, &offset);
>>        if (i < 0)
>> 	goto slow;
>>        i += offset;
>>
>> but I'm not sure why this is the only caller that would want
>> to ignore elimination.
> ???

Not sure what you meant here :-)  Was that a placeholder,
or something else?  What I was getting at was that it would
be nice to replace all occurences of:

      lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
      /* The real hard regno of the operand after the allocation.  */
      x_hard_regno = get_final_hard_regno (x_hard_regno, offset);

with something like:

      x_hard_regno = lra_get_hard_regno (x);

and this operands_match_p code seemed to be the only place that didn't
apply get_final_hard_regno to the result of lra_get_hard_regno_and_offset.
I wasn't really sure why operands_match_p was different.

>>> +  int i, j, x_hard_regno, offset;
>>> +  enum machine_mode mode;
>>> +  rtx x;
>>> +  const char *fmt;
>>> +  enum rtx_code code;
>>> +
>>> +  if (*loc == NULL_RTX)
>>> +    return false;
>>> +  x = *loc;
>>> +  code = GET_CODE (x);
>>> +  mode = GET_MODE (x);
>>> +  if (code == SUBREG)
>>> +    {
>>> +      loc = &SUBREG_REG (x);
>>> +      x = SUBREG_REG (x);
>>> +      code = GET_CODE (x);
>>> +      if (GET_MODE_SIZE (GET_MODE (x)) > GET_MODE_SIZE (mode))
>>> +	mode = GET_MODE (x);
>>> +    }
>>> +
>>> +  if (REG_P (x))
>>> +    {
>>> +      lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
>>> +      /* The real hard regno of the operand after the allocation.  */
>>> +      x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
>>> +      return (x_hard_regno >= 0
>>> +	      && lra_hard_reg_set_intersection_p (x_hard_regno, mode, set));
>> With the subreg mode handling above, this looks little-endian specific.
>> The MEM case:
>>> +  if (MEM_P (x))
>>> +    {
>>> +      struct address ad;
>>> +      enum machine_mode mode = GET_MODE (x);
>>> +      rtx *addr_loc = &XEXP (x, 0);
>>> +
>>> +      extract_address_regs (mode, MEM_ADDR_SPACE (x), addr_loc, MEM, &ad);
>>> +      if (ad.base_reg_loc != NULL)
>>> +	{
>>> +	  if (uses_hard_regs_p (ad.base_reg_loc, set))
>>> +	    return true;
>>> +	}
>>> +      if (ad.index_reg_loc != NULL)
>>> +	{
>>> +	  if (uses_hard_regs_p (ad.index_reg_loc, set))
>>> +	    return true;
>>> +	}
>>> +    }
>> is independent of the subreg handling, so perhaps the paradoxical subreg
>> case should be handled separately, using simplify_subreg_regno.

Not sure: did you have any thoughts on this?

>>> +		    match_p = false;
>>> +		    if (operands_match_p (*curr_id->operand_loc[nop],
>>> +					  *curr_id->operand_loc[m], m_hregno))
>>> +		      {
>>> +			int i;
>>> +			
>>> +			for (i = 0; i < early_clobbered_regs_num; i++)
>>> +			  if (early_clobbered_nops[i] == m)
>>> +			    break;
>>> +			/* We should reject matching of an early
>>> +			   clobber operand if the matching operand is
>>> +			   not dying in the insn.  */
>>> +			if (i >= early_clobbered_regs_num
>> Why not simply use operands m's early_clobber field?
> Ok.  Fixed.
>>> +			    || operand_reg[nop] == NULL_RTX
>>> +			    || (find_regno_note (curr_insn, REG_DEAD,
>>> +						 REGNO (operand_reg[nop]))
>>> +				!= NULL_RTX))
>>> +			  match_p = true;
>> ...although I don't really understand this condition.  If the two
>> operands are the same value X, then X must die here whatever the
>> notes say.  So I assume this is coping with a case where the operands
>> are different but still match.  If so, could you give an example?
> I remember I saw such insn but I don't remember details.
>> Matched earlyclobbers explicitly guarantee that the earlyclobber doesn't
>> apply to the matched input operand; the earlyclobber only applies to
>> other input operands.  So I'd have expected it was those operands
>> that might need reloading rather than this one.
>>
>> E.g. if X occurs three times, twice in a matched earlyclobber pair
>> and once as an independent operand, it's the latter operand that would
>> need reloading.
> Yes, I know.

But in that case I don't understand the condition.  If we have:

  (set (reg X) (... (reg X) ...))

(which is the kind of thing operands_match_p is testing for)
then there is no requirement for a REG_DEAD note for X.
But it's still OK for two Xs form an earlyclobber pair.

>>> +			  /* If we didn't already win, we can reload
>>> +			     constants via force_const_mem, and other
>>> +			     MEMs by reloading the address like for
>>> +			     'o'.  */
>>> +			  if (CONST_POOL_OK_P (mode, op) || MEM_P (op))
>>> +			    badop = false;
>> It seems a bit inconsistent to treat a spilled pseudo whose address
>> might well need reloading as a win, while not treating existing MEMs
>> whose addresses need reloading as a win.
> Well, probability of reloading address of spilled pseudo is very small 
> on most targets but reloading for MEM in this case is real. So I see it 
> logical.

OK, that's probably true. :-)

>>> +	      if (! no_regs_p)
>>> +		reload_nregs
>>> +		  += ira_reg_class_max_nregs[this_alternative][mode];
>> I wasn't sure why we counted this even in the "const_to_mem && constmmeok"
>> and "MEM_P (op) && offmemok" cases from:
>> 	      /* We prefer to reload pseudos over reloading other
>> 		 things, since such reloads may be able to be
>> 		 eliminated later.  So bump REJECT in other cases.
>> 		 Don't do this in the case where we are forcing a
>> 		 constant into memory and it will then win since we
>> 		 don't want to have a different alternative match
>> 		 then.	*/
>> 	      if (! (REG_P (op)
>> 		     && REGNO (op) >= FIRST_PSEUDO_REGISTER)
>> 		  && ! (const_to_mem && constmemok)
>> 		  /* We can reload the address instead of memory (so
>> 		     do not punish it).	 It is preferable to do to
>> 		     avoid cycling in some cases.  */
>> 		  && ! (MEM_P (op) && offmemok))
>> 		reject += 2;
> I think constmemok is obvious.  It is not a reload, it just putting 
> constant in the constant pool.  We should not punish it as no additional 
> insns are generated.
>
> There is a comment for offmemok case.  I think it describes it. 
> Apparently it was a fix for LRA cycling.  I don't remember details. To 
> restore them, I need to remove the code and to try it on many targets.  
> I guess, it would take 3-4 days.  But I removed this as it does not 
> affect x86/x86-64.

Sorry, my comment wasn't as clear as it should have been.  I think:

	      /* We prefer to reload pseudos over reloading other
		 things, since such reloads may be able to be
		 eliminated later.  So bump REJECT in other cases.
		 Don't do this in the case where we are forcing a
		 constant into memory and it will then win since we
		 don't want to have a different alternative match
		 then.	*/
	      if (! (REG_P (op)
		     && REGNO (op) >= FIRST_PSEUDO_REGISTER)
		  && ! (const_to_mem && constmemok)
		  /* We can reload the address instead of memory (so
		     do not punish it).	 It is preferable to do to
		     avoid cycling in some cases.  */
		  && ! (MEM_P (op) && offmemok))
		reject += 2;

is fine.  The point I was trying to make is that this code (correctly IMO)
ignores register reload costs if the operand won't be reloaded into
a register.  The operand won't be reloaded into a register if:

	(const_to_mem && constmemok)
     || (MEM_P (op) && offmemok)

However:

	      if (! no_regs_p)
		reload_nregs
		  += ira_reg_class_max_nregs[this_alternative][mode];

seems to count the register reload cost regardless of whether
the operand would be reloaded into a register.  It felt like
it ought to be something like:

	      if (!((const_to_mem && const_mem_ok)
		    || (MEM_P (op) && offmemok)))
		{
		  /* We prefer to reload pseudos over reloading other
		     things, since such reloads may be able to be
		     eliminated later.  So bump REJECT in other cases.  */
		  if (! (REG_P (op)
			 && REGNO (op) >= FIRST_PSEUDO_REGISTER)
		    reject += 2;

		  if (! no_regs_p)
		    reload_nregs
		      += ira_reg_class_max_nregs[this_alternative][mode];
		}

Maybe other register-specific stuff could go in that if statement too.

Richard

Richard Sandiford Oct. 15, 2012, 4:49 p.m. UTC | #12

Hi Vlad,

Some comments about the rest of LRA.  Nothing major here...

Vladimir Makarov <vmakarov@redhat.com> writes:
> +/* Info about register in an insn.  */
> +struct lra_insn_reg
> +{
> +  /* The biggest mode through which the insn refers to the register
> +     (remember the register can be accessed through a subreg in the
> +     insn).  */
> +  ENUM_BITFIELD(machine_mode) biggest_mode : 16;

AFAICT, this is actually always the mode of a specific reference,
and if there are references to the same register in different modes,
those references get their own lra_insn_regs.  "mode" might be better
than "biggest_mode" if so.

> +/* Static part (common info for insns with the same ICODE) of LRA
> +   internal insn info.	It exists in at most one exemplar for each
> +   non-negative ICODE.	Warning: if the structure definition is
> +   changed, the initializer for debug_insn_static_data in lra.c should
> +   be changed too.  */

Probably worth saying (before the warning) that there is also
one structure for each asm.

> +/* LRA internal info about an insn (LRA internal insn
> +   representation).  */
> +struct lra_insn_recog_data
> +{
> +  int icode; /* The insn code.	*/
> +  rtx insn; /* The insn itself.	 */
> +  /* Common data for insns with the same ICODE.	 */
> +  struct lra_static_insn_data *insn_static_data;

Maybe worth mentioning asms here too.

> +  /* Two arrays of size correspondingly equal to the operand and the
> +     duplication numbers: */
> +  rtx **operand_loc; /* The operand locations, NULL if no operands.  */
> +  rtx **dup_loc; /* The dup locations, NULL if no dups.	 */
> +  /* Number of hard registers implicitly used in given call insn.  The
> +     value can be NULL or points to array of the hard register numbers
> +     ending with a negative value.  */
> +  int *arg_hard_regs;
> +#ifdef HAVE_ATTR_enabled
> +  /* Alternative enabled for the insn.	NULL for debug insns.  */
> +  bool *alternative_enabled_p;
> +#endif
> +  /* The alternative should be used for the insn, -1 if invalid, or we
> +     should try to use any alternative, or the insn is a debug
> +     insn.  */
> +  int used_insn_alternative;
> +  struct lra_insn_reg *regs;  /* Always NULL for a debug insn.	*/

Comments consistently above the field.

> +extern void lra_expand_reg_info (void);

This doesn't exist any more.

> +extern int lra_constraint_new_insn_uid_start;

Just saying in case: this seems to be write-only, with lra-constraints.c
instead using a static variable to track the uid start.

I realise you might want to keep it anyway for consistency with
lra_constraint_new_regno_start, or for debugging.

> +extern rtx lra_secondary_memory[NUM_MACHINE_MODES];

This doesn't exist any more.

> +/* lra-saves.c: */
> +
> +extern bool lra_save_restore (void);

Same for this file & function.

> +/* The function returns TRUE if at least one hard register from ones
> +   starting with HARD_REGNO and containing value of MODE are in set
> +   HARD_REGSET.	 */
> +static inline bool
> +lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode,
> +				 HARD_REG_SET hard_regset)
> +{
> +  int i;
> +
> +  lra_assert (hard_regno >= 0);
> +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
> +    if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i))
> +      return true;
> +  return false;
> +}

This is the same as overlaps_hard_reg_set_p.

> +/* Return hard regno and offset of (sub-)register X through arguments
> +   HARD_REGNO and OFFSET.  If it is not (sub-)register or the hard
> +   register is unknown, then return -1 and 0 correspondingly.  */

The function seems to return -1 for both.

> +/* Add hard registers starting with HARD_REGNO and holding value of
> +   MODE to the set S.  */
> +static inline void
> +lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET *s)
> +{
> +  int i;
> +
> +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
> +    SET_HARD_REG_BIT (*s, hard_regno + i);
> +}

This is add_to_hard_reg_set.

> +   Here is block diagram of LRA passes:
> +
> +	  ---------------------				    
> +	 | Undo inheritance    |      ---------------	     --------------- 
> +	 | for spilled pseudos)|     | Memory-memory |	    | New (and old) |
> +	 | and splits (for     |<----| move coalesce |<-----|	 pseudos    |
> +	 | pseudos got the     |      ---------------	    |	assignment  |
> +  Start	 |  same  hard regs)   |			     --------------- 
> +    |	  ---------------------					    ^
> +    V		  |		 ----------------		    |
> + -----------	  V		| Update virtual |		    |
> +|  Remove   |----> ------------>|    register	 |		    |
> +| scratches |	  ^		|  displacements |		    |
> + -----------	  |		 ----------------		    |
> +		  |			 |			    |
> +		  |			 V	   New		    |
> +	 ----------------    No	   ------------	 pseudos   -------------------
> +	| Spilled pseudo | change |Constraints:| or insns | Inheritance/split |
> +	|    to memory	 |<-------|    RTL     |--------->|  transformations  |
> +	|  substitution	 |	  | transfor-  |	  |    in EBB scope   |
> +	 ----------------	  |  mations   |	   -------------------
> +		|		    ------------ 
> +		V
> +    -------------------------
> +   | Hard regs substitution, |
> +   |  devirtalization, and   |------> Finish
> +   | restoring scratches got |
> +   |	     memory	     |
> +    -------------------------

This is a great diagram, thanks.

> +/* Create and return a new reg from register FROM corresponding to
> +   machine description operand of mode MD_MODE.	 Initialize its
> +   register class to RCLASS.  Print message about assigning class
> +   RCLASS containing new register name TITLE unless it is NULL.	 The
> +   created register will have unique held value.  */
> +rtx
> +lra_create_new_reg_with_unique_value (enum machine_mode md_mode, rtx original,
> +				      enum reg_class rclass, const char *title)

Comment says FROM, but parameter is called ORIGINAL.  The code copes
with both null and non-register ORIGINALs, which aren't mentinoed in
the comment.

> +/* Target checks operands through operand predicates to recognize an
> +   insn.  We should have a special precaution to generate add insns
> +   which are frequent results of elimination.
> +
> +   Emit insns for x = y + z.  X can be used to store intermediate
> +   values and should be not in Y and Z when we use x to store an
> +   intermediate value.	*/ 

I think this should say what Y and Z are allowed to be, since it's
more than just registers and constants.

> +/* Map INSN_UID -> the operand alternative data (NULL if unknown).  We
> +   assume that this data is valid until register info is changed
> +   because classes in the data can be changed.	*/
> +struct operand_alternative *op_alt_data[LAST_INSN_CODE];

In that case I think it should be in a target_globals structure,
a bit like target_ira.

> +	    for (curr = list; curr != NULL; curr = curr->next)
> +	      if (curr->regno == regno)
> +		break;
> +	    if (curr == NULL || curr->subreg_p != subreg_p
> +		|| curr->biggest_mode != mode)
> +	      {
> +		/* This is a new hard regno or the info can not be
> +		   integrated into the found structure.	 */
> +#ifdef STACK_REGS
> +		early_clobber
> +		  = (early_clobber
> +		     /* This clobber is to inform popping floating
> +			point stack only.  */
> +		     && ! (FIRST_STACK_REG <= regno
> +			   && regno <= LAST_STACK_REG));
> +#endif
> +		list = new_insn_reg (regno, type, mode, subreg_p,
> +				     early_clobber, list);
> +	      }
> +	    else
> +	      {
> +		if (curr->type != type)
> +		  curr->type = OP_INOUT;
> +		if (curr->early_clobber != early_clobber)
> +		  curr->early_clobber = true;
> +	      }

OK, so this is probably only a technicality, but I think this should be:

	    for (curr = list; curr != NULL; curr = curr->next)
	      if (curr->regno == regno
		  && curr->subreg_p == subreg_p
		  && curr->biggest_mode == mode)
		{
		  ..reuse..;
		  return list;
		}
	     ..new entry..;
	     return list;

> +      icode = INSN_CODE (insn);
> +      if (icode < 0)
> +	/* It might be a new simple insn which is not recognized yet.  */
> +	INSN_CODE (insn) = icode = recog (PATTERN (insn), insn, 0);

Any reason not to use recog_memoized here?

> +      n = insn_static_data->n_operands;
> +      if (n == 0)
> +	locs = NULL;
> +      else
> +	{
> +	  
> +	  locs = (rtx **) xmalloc (n * sizeof (rtx *));
> +	  memcpy (locs, recog_data.operand_loc, n * sizeof (rtx *));
> +	}

Excess blank line after "else" (sorry!)

> +  /* Some output operand can be recognized only from the context not
> +     from the constraints which are empty in this case.	 Call insn may
> +     contain a hard register in set destination with empty constraint
> +     and extract_insn treats them as an input.	*/
> +  for (i = 0; i < insn_static_data->n_operands; i++)
> +    {
> +      int j;
> +      rtx pat, set;
> +      struct lra_operand_data *operand = &insn_static_data->operand[i];
> +
> +      /* ??? Should we treat 'X' the same way.	It looks to me that
> +	 'X' means anything and empty constraint means we do not
> +	 care.	*/

FWIW, I think any X output operand has to be "=X" or "+X"; just "X"
would be as wrong as "r".  genrecog is supposed to complain about that
for insns, and parse_output_constraint for asms.

So I agree the code is correct in just handling empty constraints.

> +/* Update all the insn info about INSN.	 It is usually called when
> +   something in the insn was changed.  Return the udpated info.	 */

Typo: updated.

> +  for (i = 0; i < reg_info_size; i++)
> +    {
> +      bitmap_initialize (&lra_reg_info[i].insn_bitmap, &reg_obstack);
> +#ifdef STACK_REGS
> +      lra_reg_info[i].no_stack_p = false;
> +#endif
> +      CLEAR_HARD_REG_SET (lra_reg_info[i].conflict_hard_regs);
> +      lra_reg_info[i].preferred_hard_regno1 = -1;
> +      lra_reg_info[i].preferred_hard_regno2 = -1;
> +      lra_reg_info[i].preferred_hard_regno_profit1 = 0;
> +      lra_reg_info[i].preferred_hard_regno_profit2 = 0;
> +      lra_reg_info[i].live_ranges = NULL;
> +      lra_reg_info[i].nrefs = lra_reg_info[i].freq = 0;
> +      lra_reg_info[i].last_reload = 0;
> +      lra_reg_info[i].restore_regno = -1;
> +      lra_reg_info[i].val = get_new_reg_value ();
> +      lra_reg_info[i].copies = NULL;
> +    }

The same loop (with a different start index) appears in expand_reg_info.
It'd be nice to factor it out, so that there's only one place to update
if the structure is changed.

> +	  for (curr = data->regs; curr != NULL; curr = curr->next)
> +	    if (curr->regno == regno)
> +	      break;
> +	  if (curr->subreg_p != subreg_p || curr->biggest_mode != mode)
> +	    /* The info can not be integrated into the found
> +	       structure.  */
> +	    data->regs = new_insn_reg (regno, type, mode, subreg_p,
> +				       early_clobber, data->regs);
> +	  else
> +	    {
> +	      if (curr->type != type)
> +		curr->type = OP_INOUT;
> +	      if (curr->early_clobber != early_clobber)
> +		curr->early_clobber = true;
> +	    }
> +	  lra_assert (curr != NULL);
> +	}

Same loop comment as for collect_non_operand_hard_regs.  Maybe another
factoring opportunity.

> +		/* Some ports don't recognize the following addresses
> +		   as legitimate.  Although they are legitimate if
> +		   they satisfies the constraints and will be checked
> +		   by insn constraints which we ignore here.  */
> +		&& GET_CODE (XEXP (op, 0)) != UNSPEC
> +		&& GET_CODE (XEXP (op, 0)) != PRE_DEC
> +		&& GET_CODE (XEXP (op, 0)) != PRE_INC
> +		&& GET_CODE (XEXP (op, 0)) != POST_DEC
> +		&& GET_CODE (XEXP (op, 0)) != POST_INC
> +		&& GET_CODE (XEXP (op, 0)) != PRE_MODIFY
> +		&& GET_CODE (XEXP (op, 0)) != POST_MODIFY)

GET_RTX_CLASS (GET_CODE (XEXP (op, 0))) == RTX_AUTOINC

> +/* Determine if the current function has an exception receiver block
> +   that reaches the exit block via non-exceptional edges  */
> +static bool
> +has_nonexceptional_receiver (void)
> +{
> +  edge e;
> +  edge_iterator ei;
> +  basic_block *tos, *worklist, bb;
> +
> +  /* If we're not optimizing, then just err on the safe side.  */
> +  if (!optimize)
> +    return true;
> +  
> +  /* First determine which blocks can reach exit via normal paths.  */
> +  tos = worklist = XNEWVEC (basic_block, n_basic_blocks + 1);
> +
> +  FOR_EACH_BB (bb)
> +    bb->flags &= ~BB_REACHABLE;
> +
> +  /* Place the exit block on our worklist.  */
> +  EXIT_BLOCK_PTR->flags |= BB_REACHABLE;
> +  *tos++ = EXIT_BLOCK_PTR;
> +  
> +  /* Iterate: find everything reachable from what we've already seen.  */
> +  while (tos != worklist)
> +    {
> +      bb = *--tos;
> +
> +      FOR_EACH_EDGE (e, ei, bb->preds)
> +	if (!(e->flags & EDGE_ABNORMAL))
> +	  {
> +	    basic_block src = e->src;
> +
> +	    if (!(src->flags & BB_REACHABLE))
> +	      {
> +		src->flags |= BB_REACHABLE;
> +		*tos++ = src;
> +	      }
> +	  }
> +    }
> +  free (worklist);
> +
> +  /* Now see if there's a reachable block with an exceptional incoming
> +     edge.  */
> +  FOR_EACH_BB (bb)
> +    if (bb->flags & BB_REACHABLE)
> +      FOR_EACH_EDGE (e, ei, bb->preds)
> +	if (e->flags & EDGE_ABNORMAL)
> +	  return true;
> +
> +  /* No exceptional block reached exit unexceptionally.	 */
> +  return false;
> +}

Looks like we could just early out on the first loop and get rid
of the second.

> +/* Remove all REG_DEAD and REG_UNUSED notes and regenerate REG_INC.
> +   We change pseudos by hard registers without notification of DF and
> +   that can make the notes obsolete.  DF-infrastructure does not deal
> +   with REG_INC notes -- so we should regenerate them here.  */

These days passes are supposed to regenerate REG_DEAD and REG_UNUSED
notes if they need them, so that part might not be necessary.
The REG_INC bit is still needed though...

> +/* Initialize LRA data once per function.  */
> +void
> +lra_init (void)
> +{
> +  init_op_alt_data ();
> +}

I think it's more like:

/* Initialize LRA whenever register-related information is changed.  */


In summary, LRA looks really good to me FWIW.  Thanks for all your hard work.

Getting rid of reload always seemed like a pipe dream, and if the only
known drawback of this replacement is that it takes a while on extreme
testcases, that's an amazing achievement.  (Not to say compile time
isn't important, just that there were so many other hurdles to overcome.)

It looks like opinion has crystalised in favour of merging LRA for 4.8.
I hope that's what happens.  I don't see that anything would be gained
by delaying it to 4.9.  The code's not going to get any more testing on the
branch that it already has; whenever we merge, the stress test is always
going to be trunk.

Richard

Vladimir Makarov Oct. 17, 2012, 12:54 a.m. UTC | #13

On 12-10-12 10:29 AM, Richard Sandiford wrote:
> Hi Vlad,
>
> Comments for the rest of ira-constraints.c.
>
> Vladimir Makarov<vmakarov@redhat.com>  writes:
>> +  saved_base_reg = saved_base_reg2 = saved_index_reg = NULL_RTX;
>> +  change_p = equiv_address_substitution (&ad, addr_loc, mode, as, code);
>> +  if (ad.base_reg_loc != NULL)
>> +    {
>> +      if (process_addr_reg
>> +	  (ad.base_reg_loc, before,
>> +	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
>> +	    && find_regno_note (curr_insn, REG_DEAD,
>> +				REGNO (*ad.base_reg_loc)) == NULL
>> +	    ? after : NULL),
>> +	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code)))
>> +	change_p = true;
>> +      if (ad.base_reg_loc2 != NULL)
>> +	*ad.base_reg_loc2 = *ad.base_reg_loc;
>> +      saved_base_reg = *ad.base_reg_loc;
>> +      lra_eliminate_reg_if_possible (ad.base_reg_loc);
>> +      if (ad.base_reg_loc2 != NULL)
>> +	{
>> +	  saved_base_reg2 = *ad.base_reg_loc2;
>> +	  lra_eliminate_reg_if_possible (ad.base_reg_loc2);
>> +	}
> We unconditionally make *ad.base_reg_loc2 = *ad.base_reg_loc, so it
> might be clearer without saved_base_reg2.  More below...
>> +      /* The following addressing is checked by constraints and
>> +	 usually target specific legitimate address hooks do not
>> +	 consider them valid.  */
>> +      || GET_CODE (*addr_loc) == POST_DEC || GET_CODE (*addr_loc) == POST_INC
>> +      || GET_CODE (*addr_loc) == PRE_DEC || GET_CODE (*addr_loc) == PRE_DEC
> typo: two PRE_DECs, although:
>> +      || GET_CODE (*addr_loc) == PRE_MODIFY
>> +      || GET_CODE (*addr_loc) == POST_MODIFY
> the whole lot could just be replaced by ad.base_modify_p, or perhaps
> even removed entirely given:
>> +      /* In this case we can not do anything because if it is wrong
>> +	 that is because of wrong displacement.	 Remember that any
>> +	 address was legitimate in non-strict sense before LRA.	 */
>> +      || ad.disp_loc == NULL)
> It doesn't seem worth validating the address at all for ad.disp_loc == NULL.
> E.g. something like:
>
>    if (ad.base_reg_loc != NULL
>        && (process_addr_reg
> 	  (ad.base_reg_loc, before,
> 	   (ad.base_modify_p && REG_P (*ad.base_reg_loc)
> 	    && find_regno_note (curr_insn, REG_DEAD,
> 				REGNO (*ad.base_reg_loc)) == NULL
> 	    ? after : NULL),
> 	   base_reg_class (mode, as, ad.base_outer_code, ad.index_code))))
>      {
>        change_p = true;
>        if (ad.base_reg_loc2 != NULL)
>          *ad.base_reg_loc2 = *ad.base_reg_loc;
>      }
>
>    if (ad.index_reg_loc != NULL
>        && process_addr_reg (ad.index_reg_loc, before, NULL, INDEX_REG_CLASS))
>      change_p = true;
>
>    /* The address was valid before LRA.  We only change its form if the
>       address has a displacement, so if it has no displacement it must
>       still be valid.  */
>    if (ad.disp_loc == NULL)
>      return change_p;
>
>    /* See whether the address is still valid.  Some ports do not check
>       displacements for eliminable registers, so we replace them
>       temporarily with the elimination target.  */
>    saved_base_reg = saved_index_reg = NULL_RTX;
>    ...
>    if (ok_p)
>      return change_p;
Yes, it has sense.  I changed the code as you propose.
>> +#ifdef HAVE_lo_sum
>> +	  {
>> +	    rtx insn;
>> +	    rtx last = get_last_insn ();
>> +
>> +	    /* disp => lo_sum (new_base, disp)	*/
>> +	    insn = emit_insn (gen_rtx_SET
>> +			      (VOIDmode, new_reg,
>> +			       gen_rtx_HIGH (Pmode, copy_rtx (*ad.disp_loc))));
>> +	    code = recog_memoized (insn);
>> +	    if (code >= 0)
>> +	      {
>> +		rtx save = *ad.disp_loc;
>> +
>> +		*ad.disp_loc = gen_rtx_LO_SUM (Pmode, new_reg, *ad.disp_loc);
>> +		if (! valid_address_p (mode, *ad.disp_loc, as))
>> +		  {
>> +		    *ad.disp_loc = save;
>> +		    code = -1;
>> +		  }
>> +	      }
>> +	    if (code < 0)
>> +	      delete_insns_since (last);
>> +	  }
>> +#endif
> Nice :-)
>
> Purely for the record, I wondered whether the high part should be
> generated with emit_move_insn(_1) instead, with the rhs of the move
> being the HIGH rtx.  That would allow targets to cope with cases where
> the high part isn't represented directly as a HIGH.  E.g. on MIPS and
> Alpha, small-data accesses use the global register as the high part instead.
>
> However, both MIPS and Alpha accept small-data addresses as legitimate
> constants and addresses before and during reload and only introduce the
> split form after reload.  And I think that's how any other cases that
> aren't simple HIGHs should be handled too.  E.g. MIPS also represents
> GOT page loads as HIGHs until after reload, and only then lowers the
> HIGH to a GOT load.  Allowing the backend to generate anything other
> than a plain HIGH set here would be a double-edged sword.
>
> So after all that I agree that the gen_rtx_SET above is better than
> calling the move expanders.
Thanks for sharing your knowledge.
>> +	  /* index * scale + disp => new base + index * scale  */
>> +	  enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
>> +
>> +	  lra_assert (INDEX_REG_CLASS != NO_REGS);
>> +	  new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
>> +	  lra_assert (GET_CODE (*addr_loc) == PLUS);
>> +	  lra_emit_move (new_reg, *ad.disp_loc);
>> +	  if (CONSTANT_P (XEXP (*addr_loc, 1)))
>> +	    XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
>> +	  XEXP (*addr_loc, 0) = new_reg;
> The canonical form is (plus (mult ...) (reg)) rather than
> (plus (reg) (mult ...)), but it looks like we create the latter.
> I realise you try both forms here:
It might happen because equiv substitution in LRA.
>> +	  /* Some targets like ARM, accept address operands in
>> +	     specific order -- try exchange them if necessary.	*/
>> +	  if (! valid_address_p (mode, *addr_loc, as))
>> +	    {
>> +	      exchange_plus_ops (*addr_loc);
>> +	      if (! valid_address_p (mode, *addr_loc, as))
>> +		exchange_plus_ops (*addr_loc);
>> +	    }
> but I think we should try the canonical form first.  And I'd prefer it
> if we didn't try the other form at all, especially in 4.8.  It isn't
> really the backend's job to reject non-canonical rtl.  This might well
> be another case where some targets need a (hopefully small) tweak in
> order to play by the rules.
>
> Also, I suppose this section of code feeds back to my question on
> Wednesday about the distinction that LRA seems to make between the
> compile-time constant in:
>
>    (plus (reg X1) (const_int Y1))
>
> and the link-time constant in:
>
>    (plus (reg X2) (symbol_ref Y2))
>
> It looked like extract_address_regs classified X1 as a base register and
> X2 as an index register.  The difference between the two constants has
> no run-time significance though, and I think we should handle both X1
> and X2 as base registers (as I think reload does).
>
> I think the path above would then be specific to scaled indices.
> In the original address the "complex" index must come first and the
> displacement second.  In the modified address, the index would stay
> first and the new base register would be second.  More below.
As I wrote above the problem is also in that equiv substitution can 
create non-canonical forms.
>> +      /* We don't use transformation 'base + disp => base + new index'
>> +	 because of bad practice used in some machine descriptions
>> +	 (see comments for emit_spill_move).  */
>> +      /* base + disp => new base  */
> As before when commenting on emit_spill_move, I think we should leave
> the "bad machine description" stuff out of 4.8 and treat fixing the
> machine descriptions as part of the LRA port.
>
> In this case I think there's another reason not to reload the
> displacement into an index though: IIRC postreload should be able
> to optimise a sequence of address reloads that have the same base
> and different displacements.  LRA itself might try using "anchor"
> bases in future -- although obviously not in the initial merge --
> since that was one thing that LEGITIMIZE_RELOAD_ADDRESS was used for.
>
> E.g. maybe the justification could be:
>
>        /* base + disp => new base  */
>        /* Another option would be to reload the displacement into an
> 	 index register.  However, postreload has code to optimize
> 	 address reloads that have the same base and different
> 	 displacements, so reloading into an index register would
> 	 not necessarily be a win.  */
Fixed.
>> +      /* base + scale * index + disp => new base + scale * index  */
>> +      new_reg = base_plus_disp_to_reg (mode, as, &ad);
>> +      *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
>> +      if (! valid_address_p (mode, *addr_loc, as))
>> +	{
>> +	  /* Some targets like ARM, accept address operands in
>> +	     specific order -- try exchange them if necessary.	*/
>> +	  exchange_plus_ops (*addr_loc);
>> +	  if (! valid_address_p (mode, *addr_loc, as))
>> +	    exchange_plus_ops (*addr_loc);
>> +	}
> Same comment as above about canonical rtl.  Here we can have two
> registers -- in which case the base should come first -- or a more
> complex index -- in which case the index should come first.
>
> We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
> with the operands in either order, and let it take care of the details.
> Using simplify_gen_binary would help with the earlier index+disp case too.
Equiv substitution can create non-canonical forms.  There are 2 approaches:
o have a code for dealing with non-canonical forms (equiv substitution, 
target stupidity)
o always support canonical forms and require them from targets.

I decided to use the 1st variant but I am reconsidering it.  I'll try to 
fix before inclusion.  But I am not sure I have time for this.  All 
these changes makes LRA unstable. In fact, I've just found that changes 
I already made so far resulted in 2 SPEC2000 tests broken although GCC 
testsuite and bootstrap looks good.

As I wrote we also have two different approaches to implement LRA. Fist 
one is to make easier porting LRA to other targets (which still requires 
some changes on target side).  The 2nd one is to require targets play by 
strict rules.  I believe that the 2nd approach will slow down transition 
to LRA and removing reload.  But probably it is more right.

After recent changes required by you, we are now in some middle 
approach.  A lot of targets on lra branch are broken by these changes 
and I will need some efforts on target dependent side to make them working.

If it were only target stupidity (as in ARM case), I would remove this 
code right away.  But unfortunately, equiv substitution is also a 
problem.  And right now, I have no time to fix it using your approach.  
The whole idea of including LRA for x86 for 4.8 was to get a real 
experience and harsh testing for LRA.  I am sorry, Richard but I think 
solving this problem as you want might result in postponing LRA 
inclusion until gcc4.9.


>> +  /* If this is post-increment, first copy the location to the reload reg.  */
>> +  if (post && real_in != result)
>> +    emit_insn (gen_move_insn (result, real_in));
> Nit, but real_in != result can never be true AIUI, and I was confused how
> the code could be correct in that case.  Maybe just remove it, or make
> it an assert?
No, it might be true:

real_in = in == value ? incloc : in;
...
if (cond)
   result = incloc;
else
   result = ...

if (post && real_in != result)

So it is true if in==value && cond
>> +  /* We suppose that there are insns to add/sub with the constant
>> +     increment permitted in {PRE/POST)_{DEC/INC/MODIFY}.  At least the
>> +     old reload worked with this assumption.  If the assumption
>> +     becomes wrong, we should use approach in function
>> +     base_plus_disp_to_reg.  */
>> +  if (in == value)
>> +    {
>> +      /* See if we can directly increment INCLOC.  */
>> +      last = get_last_insn ();
>> +      add_insn = emit_insn (plus_p
>> +			    ? gen_add2_insn (incloc, inc)
>> +			    : gen_sub2_insn (incloc, inc));
>> +
>> +      code = recog_memoized (add_insn);
>> +      /* We should restore recog_data for the current insn.  */
> Looks like this comment might be a left-over, maybe from before the
> cached insn data?
Yes.  The very first variant used recog_data and was much slower.  I 
removed the comment.
>> +      /* Restore non-modified value for the result.  We prefer this
>> +	 way because it does not require an addition hard
>> +	 register.  */
>> +      if (plus_p)
>> +	{
>> +	  if (CONST_INT_P (inc))
>> +	    emit_insn (gen_add2_insn (result, GEN_INT (-INTVAL (inc))));
>> +	  else
>> +	    emit_insn (gen_sub2_insn (result, inc));
>> +	}
>> +      else if (CONST_INT_P (inc))
>> +	emit_insn (gen_add2_insn (result, inc));
> The last two lines look redundant.  The behaviour is the same as for
> the following else:
>> +      else
>> +	emit_insn (gen_add2_insn (result, inc));
Fixed.
> and I don't think there are any cases where !plus && CONST_INT_P (inc)
> would hold.
It seems I thought about this and therefore I added a placeholder which 
I can remove now.
>> +/* Main entry point of this file: search the body of the current insn
> s/this file/the constraints code/, since it's a static function.
Fixed.
>> +  if (change_p)
>> +    /* Changes in the insn might result in that we can not satisfy
>> +       constraints in lately used alternative of the insn.  */
>> +    lra_set_used_insn_alternative (curr_insn, -1);
> Maybe:
>
>    /* If we've changed the instruction then any alternative that
>       we chose previously may no longer be valid.  */
Fixed.
>> +      rtx x;
>> +
>> +      curr_swapped = !curr_swapped;
>> +      if (curr_swapped)
>> +	{
>> +	  x = *curr_id->operand_loc[commutative];
>> +	  *curr_id->operand_loc[commutative]
>> +	    = *curr_id->operand_loc[commutative + 1];
>> +	  *curr_id->operand_loc[commutative + 1] = x;
>> +	  /* Swap the duplicates too.  */
>> +	  lra_update_dup (curr_id, commutative);
>> +	  lra_update_dup (curr_id, commutative + 1);
>> +	  goto try_swapped;
>> +	}
>> +      else
>> +	{
>> +	  x = *curr_id->operand_loc[commutative];
>> +	  *curr_id->operand_loc[commutative]
>> +	    = *curr_id->operand_loc[commutative + 1];
>> +	  *curr_id->operand_loc[commutative + 1] = x;
>> +	  lra_update_dup (curr_id, commutative);
>> +	  lra_update_dup (curr_id, commutative + 1);
>> +	}
> The swap code is the same in both cases, so I think it'd be better to
> make it common.  Or possibly a helper function, since the same code
> appears again later on.
Fixed.
>> +	if (GET_CODE (op) == PLUS)
>> +	  {
>> +	    plus = op;
>> +	    op = XEXP (op, 1);
>> +	  }
> Sorry, I'm complaining about old reload code again, but: does this
> actually happen in LRA?  In reload, a register operand could become a
> PLUS because of elimination, but I thought LRA did things differently.
> Besides, this is only needed for:
No, I don't think it happens in LRA.  It is a leftover from reload 
code.  I removed it.
>> +	if (CONST_POOL_OK_P (mode, op)
>> +	    && ((targetm.preferred_reload_class
>> +		 (op, (enum reg_class) goal_alt[i]) == NO_REGS)
>> +		|| no_input_reloads_p)
>> +	    && mode != VOIDmode)
>> +	  {
>> +	    rtx tem = force_const_mem (mode, op);
>> +	
>> +	    change_p = true;
>> +	    /* If we stripped a SUBREG or a PLUS above add it back.  */
>> +	    if (plus != NULL_RTX)
>> +	      tem = gen_rtx_PLUS (mode, XEXP (plus, 0), tem);
> and we shouldn't have (plus (constant ...) ...) after elimination
> (or at all outside of a CONST).  I don't understand why the code is
> needed even in reload.
I removed the plus code.
>> +  for (i = 0; i < n_operands; i++)
>> +    {
>> +      rtx old, new_reg;
>> +      rtx op = *curr_id->operand_loc[i];
>> +
>> +      if (goal_alt_win[i])
>> +	{
>> +	  if (goal_alt[i] == NO_REGS
>> +	      && REG_P (op)
>> +	      && lra_former_scratch_operand_p (curr_insn, i))
>> +	    change_class (REGNO (op), NO_REGS, "      Change", true);
> I think this could do with a comment.  Does setting the class to NO_REGS
> indirectly cause the operand to be switched back to a SCRATCH?
When we assign NO_REGS it means that we will not assign a hard register 
to the scratch pseudo and the scratch pseudo will be spilled.  Spilled 
scratch pseudos are transformed back to scratches at the LRA end.  I've 
added a comment.
>> +	  push_to_sequence (before);
>> +	  rclass = base_reg_class (GET_MODE (op), MEM_ADDR_SPACE (op),
>> +				   MEM, SCRATCH);
>> +	  if (code == PRE_DEC || code == POST_DEC
>> +	      || code == PRE_INC || code == POST_INC
>> +	      || code == PRE_MODIFY || code == POST_MODIFY)
> Very minor, but: GET_RTX_CLASS (code) == RTX_AUTOINC
Fixed.
>> +	  enum machine_mode mode;
>> +	  rtx reg, *loc;
>> +	  int hard_regno, byte;
>> +	  enum op_type type = curr_static_id->operand[i].type;
>> +
>> +	  loc = curr_id->operand_loc[i];
>> +	  mode = get_op_mode (i);
>> +	  if (GET_CODE (*loc) == SUBREG)
>> +	    {
>> +	      reg = SUBREG_REG (*loc);
>> +	      byte = SUBREG_BYTE (*loc);
>> +	      if (REG_P (reg)
>> +		  /* Strict_low_part requires reload the register not
>> +		     the sub-register.	*/
>> +		  && (curr_static_id->operand[i].strict_low
>> +		      || (GET_MODE_SIZE (mode)
>> +			  <= GET_MODE_SIZE (GET_MODE (reg))
>> +			  && (hard_regno
>> +			      = get_try_hard_regno (REGNO (reg))) >= 0
>> +			  && (simplify_subreg_regno
>> +			      (hard_regno,
>> +			       GET_MODE (reg), byte, mode) < 0)
>> +			  && (goal_alt[i] == NO_REGS
>> +			      || (simplify_subreg_regno
>> +				  (ira_class_hard_regs[goal_alt[i]][0],
>> +				   GET_MODE (reg), byte, mode) >= 0)))))
>> +		{
>> +		  loc = &SUBREG_REG (*loc);
>> +		  mode = GET_MODE (*loc);
>> +		}
>> +	  old = *loc;
> I think this needs a bit more justifying commentary (although I'm glad
> to see it's much simpler than the reload version :-)).  One thing in
> particular I didn't understand was why we don't reload the inner
> register of a paradoxical subreg.
It seems not necessary. As we use the biggest mode, pseudo gets all hard 
registers or wide stack slot for paradoxical subreg.
>> +	  if (get_reload_reg (type, mode, old, goal_alt[i], "", &new_reg)
>> +	      && type != OP_OUT)
>> +	    {
>> +	      push_to_sequence (before);
>> +	      lra_emit_move (new_reg, old);
>> +	      before = get_insns ();
>> +	      end_sequence ();
>> +	    }
>> +	  *loc = new_reg;
>> +	  if (type != OP_IN)
>> +	    {
>> +	      if (find_reg_note (curr_insn, REG_UNUSED, old) == NULL_RTX)
>> +		{
>> +		  start_sequence ();
>> +		  /* We don't want sharing subregs as the pseudo can
>> +		     get a memory and the memory can be processed
>> +		     several times for eliminations.  */
>> +		  lra_emit_move (GET_CODE (old) == SUBREG && type == OP_INOUT
>> +				 ? copy_rtx (old) : old,
>> +				 new_reg);
> I think this should simply be:
>
>    lra_emit_move (type == OP_INOUT ? copy_rtx (old) : old, new_reg);
>
> leaving copy_rtx to figure out which rtxes can be shared.  No comment
> would be needed for that.
Fixed.
>> +		  emit_insn (after);
>> +		  after = get_insns ();
>> +		  end_sequence ();
>> +		}
>> +	      *loc = new_reg;
>> +	    }
> Very minor again, but: redundant *loc assignment (so that the two nested
> if statements collapse to one).
Fixed.  It seems a leftover from previous modifications.  First variants 
of the function were more complicated.
>> +      else
>> +	{
>> +	  lra_assert (INSN_CODE (curr_insn) < 0);
>> +	  error_for_asm (curr_insn,
>> +			 "inconsistent operand constraints in an %<asm%>");
>> +	  /* Avoid further trouble with this insn.  */
>> +	  PATTERN (curr_insn) = gen_rtx_USE (VOIDmode, const0_rtx);
>> +	  return false;
> Is this code handling a different case from the corresponding error
> code in curr_insn_transform?  If so, it probably deserves a comment
> explaining the difference.
I have no idea.  It is taken from reload.  I think we could try to 
remove it.  If process_alt_operands finds an alternative we should 
generate a code anyway.  I'll make it unreachable.
>> +/* Process all regs in debug location *LOC and change them on
>> +   equivalent substitution.  Return true if any change was done.  */
>> +static bool
>> +debug_loc_equivalence_change_p (rtx *loc)
> This doesn't keep the rtl in canonical form.  Probably the easiest and
> best fix is to use simplify_replace_fn_rtx, which handles all that for you.
> (simplify_replace_fn_rtx returns the original rtx if no change was made.)
See comments above about canonical forms.
>> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
>> +    ira_reg_equiv[i].profitable_p = true;
>> +  for (i = FIRST_PSEUDO_REGISTER; i < new_regno_start; i++)
>> +    if (lra_reg_info[i].nrefs != 0)
>> +      {
>> +	if ((hard_regno = lra_get_regno_hard_regno (i)) >= 0)
>> +	  {
>> +	    int j, nregs = hard_regno_nregs[hard_regno][PSEUDO_REGNO_MODE (i)];
>> +	
>> +	    for (j = 0; j < nregs; j++)
>> +	      df_set_regs_ever_live (hard_regno + j, true);
>> +	  }
>> +	else if ((x = get_equiv_substitution (regno_reg_rtx[i])) != NULL_RTX)
>> +	  {
>> +	    if (! first_p && contains_reg_p (x, false, false))
>> +	      /* After RTL transformation, we can not guarantee that
>> +		 pseudo in the substitution was not reloaded which
>> +		 might make equivalence invalid.  For example, in
>> +		 reverse equiv of p0
>> +
>> +		 p0 <- ...
>> +		 ...
>> +		 equiv_mem <- p0
>> +
>> +		 the memory address register was reloaded before the
>> +		 2nd insn.  */
>> +	      ira_reg_equiv[i].defined_p = false;
>> +	    if (contains_reg_p (x, false, true))
>> +	      ira_reg_equiv[i].profitable_p = false;
>> +	  }
>> +      }
> Do we need two loops because the second may check for equivalences
> of other pseudos besides "i"?  I couldn't see how offhand, but I might
> well have missed something.  Might be worth a comment.
No, we don't need two loops.  I merged them.
>> +	      dest_reg = SET_DEST (set);
>> +	      /* The equivalence pseudo could be set up as SUBREG in a
>> +		 case when it is a call restore insn in a mode
>> +		 different from the pseudo mode.  */
>> +	      if (GET_CODE (dest_reg) == SUBREG)
>> +		dest_reg = SUBREG_REG (dest_reg);
>> +	      if ((REG_P (dest_reg)
>> +		   && (x = get_equiv_substitution (dest_reg)) != dest_reg
>> +		   /* Remove insns which set up a pseudo whose value
>> +		      can not be changed.  Such insns might be not in
>> +		      init_insns because we don't update equiv data
>> +		      during insn transformations.
>> +			
>> +		      As an example, let suppose that a pseudo got
>> +		      hard register and on the 1st pass was not
>> +		      changed to equivalent constant.  We generate an
>> +		      additional insn setting up the pseudo because of
>> +		      secondary memory movement.  Then the pseudo is
>> +		      spilled and we use the equiv constant.  In this
>> +		      case we should remove the additional insn and
>> +		      this insn is not init_insns list.	 */
>> +		   && (! MEM_P (x) || MEM_READONLY_P (x)
>> +		       || in_list_p (curr_insn,
>> +				     ira_reg_equiv
>> +				     [REGNO (dest_reg)].init_insns)))
> This is probably a stupid question, sorry, but when do we ever want
> to keep an assignment to a substituted pseudo?  I.e. why isn't this just:
>
> 	      if ((REG_P (dest_reg)
> 		   && (x = get_equiv_substitution (dest_reg)) != dest_reg)
Equivalence can be memory location.  It means you can assign many 
different values to the location.  Removing them would generate a wrong 
code.  For example, if you use your simple variant of code, you will 
have > 100 test failures of GCC testsuite.
>> +/* Info about last usage of registers in EBB to do inheritance/split
>> +   transformation.  Inheritance transformation is done from a spilled
>> +   pseudo and split transformations from a hard register or a pseudo
>> +   assigned to a hard register.	 */
>> +struct usage_insns
>> +{
>> +  /* If the value is equal to CURR_USAGE_INSNS_CHECK, then the member
>> +     value INSNS is valid.  The insns is chain of optional debug insns
>> +     and a finishing non-debug insn using the corresponding reg.  */
>> +  int check;
>> +  /* Value of global reloads_num at the ???corresponding next insns.  */
>> +  int reloads_num;
>> +  /* Value of global reloads_num at the ???corresponding next insns.  */
>> +  int calls_num;
> "???s".  Probably "at the last instruction in INSNS" if that's accurate
> (because debug insns in INSNS don't affect these fields).
Fixed.
>> +/* Process all regs OLD_REGNO in location *LOC and change them on the
>> +   reload pseudo NEW_REG.  Return true if any change was done.	*/
>> +static bool
>> +substitute_pseudo (rtx *loc, int old_regno, rtx new_reg)
> This is another case where I found the term "reload pseudo" a bit confusing,
> since AIUI new_reg can be an inheritance or split pseudo rather than a pseudo
> created solely for insn reloads.  I'll follow up about that on the original
> thread.  Maybe just:
>
> /* Replace all references to register OLD_REGNO in *LOC with pseudo register
>     NEW_REG.  Return true if any change was made.  */
Fixed.
>> +  code = GET_CODE (x);
>> +  if (code == REG && (int) REGNO (x) == old_regno)
>> +    {
>> +      *loc = new_reg;
>> +      return true;
>> +    }
> Maybe assert that the modes are the same?
I've just realized that the modes might be different because of 
secondary_memory_needed_mode in split_reg (probably small chance).  So I 
added a code dealing with it.  Thanks, Richard.
>> +/* Do inheritance transformation for insn INSN defining (if DEF_P) or
>> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
>> +   we traverse insns in the backward direction) for the original regno
>> +   is NEXT_USAGE_INSNS.	 The transformations look like
> Maybe:
>
> /* Do interitance transformations for insn INSN, which defines (if DEF_P)
>     or uses ORIGINAL_REGNO.  NEXT_USAGE_INSNS specifies which instruction
>     in the EBB next uses ORIGINAL_REGNO; it has the same form as the
>     "insns" field of usage_insns.
>
>     The transformations look like:
Fixed.
>> +
>> +     p <- ...		  i <- ...
>> +     ...		  p <- i    (new insn)
>> +     ...	     =>
>> +     <- ... p ...	  <- ... i ...
>> +   or
>> +     ...		  i <- p    (new insn)
>> +     <- ... p ...	  <- ... i ...
>> +     ...	     =>
>> +     <- ... p ...	  <- ... i ...
>> +   where p is a spilled original pseudo and i is a new inheritance pseudo.
>> +
>> +   The inheritance pseudo has the smallest class of two classes CL and
>> +   class of ORIGINAL REGNO.  It will have unique value if UNIQ_P.  The
>> +   unique value is necessary for correct assignment to inheritance
>> +   pseudo for input of an insn which should be the same as output
>> +   (bound pseudos).  Return true if we succeed in such
>> +   transformation.  */
> This comment looks really good, but I still wasn't sure about the
> UNIQ_P thing.  AIUI this is for cases like:
>
>                         i <- p            [new insn]
>     r <- ... p ...      r <- ... i ...    [input reload]
>     r <- ... r ...   => r <- ... r ...    [original insn]
>     <- r                <- r              [output reload]
>     ....                ......
>     <- ... p ...        <- ... i ...      [next ref]
>
> where "r" is used on both sides of the original insn and where the
> output reload assigns to something other than "p" (otherwise "next ref"
> wouldn't be the next ref).  But why does this affect the way "i" is created?
> I think it'd be worth expanding that part a bit.
I realized that code is not necessary anymore.  It was necessary when I 
generated 2 pseudos for reloading matching operands with different 
modes.  Now I use one pseudo for this and use (may be illegal in other 
parts of GCC) subregs of this pseudo.  I modified the code.
>> +  if (! ira_reg_classes_intersect_p[cl][rclass])
>> +    {
>> +      if (lra_dump_file != NULL)
>> +	{
>> +	  fprintf (lra_dump_file,
>> +		   "	Rejecting inheritance for %d "
>> +		   "because of too different classes %s and %s\n",
> Suggest s/too different/disjoint/
Fixed.
>> +  if ((ira_class_subset_p[cl][rclass] && cl != rclass)
>> +      || ira_class_hard_regs_num[cl] < ira_class_hard_regs_num[rclass])
>> +    {
>> +      if (lra_dump_file != NULL)
>> +	fprintf (lra_dump_file, "    Use smallest class of %s and %s\n",
>> +		 reg_class_names[cl], reg_class_names[rclass]);
>> +
>> +      rclass = cl;
>> +    }
> I don't understand the second line of the if statement.  Why do we prefer
> classes with fewer allocatable registers?
>
> My guess before reading the code was that we'd use the subunion of CL and
> RCLASS, so maybe a comment explaining why we use this choice would help.
I added a comment

      /* We don't use a subset of two classes because it can be
         NO_REGS.  This transformation is still profitable in most
         cases even if the classes are not intersected as register
move is probably cheaper than a memory load.  */

>> +  if (NEXT_INSN (new_insns) != NULL_RTX)
>> +    {
>> +      if (lra_dump_file != NULL)
>> +	{
>> +	  fprintf (lra_dump_file,
>> +		   "	Rejecting inheritance %d->%d "
>> +		   "as it results in 2 or more insns:\n",
>> +		   original_regno, REGNO (new_reg));
>> +	  debug_rtl_slim (lra_dump_file, new_insns, NULL_RTX, -1, 0);
>> +	  fprintf (lra_dump_file,
>> +		   "	>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n");
>> +	}
>> +      return false;
>> +    }
> Hmm, I wasn't sure about this at first.  Some targets define patterns for
> multiword moves and split them later.  Others expose the split straight away.
> The two approaches don't really imply any difference in cost, so I didn't
> want us to penalise the latter.
>
> But I suppose on targets that split straight away, lower-subreg would
> tend to replace the multiword pseudo with individual word-sized pseudos,
> so LRA shouldn't see them.  I suppose this check shouldn't matter in practice.
I think I saw it on practice and therefore I added the code.  I don't 
remember details although.  I tried to document in commentaries some 
specific cases in LRA.  But probably I should have made more comments 
with examples in the code because it is easy to forget them (too many 
cases and details).
>> +  if (def_p)
>> +    lra_process_new_insns (insn, NULL_RTX, new_insns,
>> +			   "Add original<-inheritance");
>> +  else
>> +    lra_process_new_insns (insn, new_insns, NULL_RTX,
>> +			   "Add inheritance<-pseudo");
> Maybe "original" rather than "pseudo" here too for consistency.
Fixed.
>> +/* Return true if we need a split for hard register REGNO or pseudo
>> +   REGNO which was assigned to a hard register.
>> +   POTENTIAL_RELOAD_HARD_REGS contains hard registers which might be
>> +   used for reloads since the EBB end.	It is an approximation of the
>> +   used hard registers in the split range.  The exact value would
>> +   require expensive calculations.  If we were aggressive with
>> +   splitting because of the approximation, the split pseudo will save
>> +   the same hard register assignment and will be removed in the undo
>> +   pass.  We still need the approximation because too aggressive
>> +   splitting would result in too inaccurate cost calculation in the
>> +   assignment pass because of too many generated moves which will be
>> +   probably removed in the undo pass.  */
>> +static inline bool
>> +need_for_split_p (HARD_REG_SET potential_reload_hard_regs, int regno)
>> +{
>> +  int hard_regno = regno < FIRST_PSEUDO_REGISTER ? regno : reg_renumber[regno];
>> +
>> +  lra_assert (hard_regno >= 0);
>> +  return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
>> +	   && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
>> +	   && (usage_insns[regno].reloads_num
>> +	       + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
>> +	   && ((regno < FIRST_PSEUDO_REGISTER
>> +		&& ! bitmap_bit_p (&ebb_global_regs, regno))
>> +	       || (regno >= FIRST_PSEUDO_REGISTER
>> +		   && lra_reg_info[regno].nrefs > 3
>> +		   && bitmap_bit_p (&ebb_global_regs, regno))))
>> +	  || (regno >= FIRST_PSEUDO_REGISTER && need_for_call_save_p (regno)));
>> +}
> Could you add more commentary about the thinking behind this particular
> choice of heuristic?  E.g. I wasn't sure what the reloads_num check did,
> or why we only split hard registers that are local to the EBB and only
> split pseudos that aren't.
>
> The 2 and 3 numbers seemed a bit magic too.  I suppose the 2 has
> something to do with "one save and one restore", but I wasn't sure
> why we applied it only for pseudos.  (AIUI that arm of the check
> deals with "genuine" split pseudos rather than call saves & restores.)
>
> Still, it says a lot for the high quality of LRA that, out of all the
> 1000s of lines of code I've read so far, this is the only part that
> didn't seem to have an intuitive justification.
Yes, right.  I checked many parameters and finally picked out the above 
ones.  What I found is that aggressive and even moderate splitting 
usually result in worse code.  Most splitting is undone and if we do 
splitting aggressively we generate a lot of insns which will be removed 
but their costs are taken into account during assignment pass.  
Inheritance has quite bigger chance to successful.  One reason why 
splitting is undone is that spilling + inheritance of short living 
pseudos is a substitution of splitting in some way.  Therefore I do 
splitting for long living pseudos. That means non local pseudos.

I added the following comments.

   return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
            && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
            /* We need at least 2 reloads to make pseudo splitting
               profitable.  We should provide hard regno splitting in
               any case to solve 1st insn scheduling problem when
               moving hard register definition up might result in
               impossibility to find hard register for reload pseudo of
               small register class.  */
            && (usage_insns[regno].reloads_num
                + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
            && (regno < FIRST_PSEUDO_REGISTER
                /* For short living pseudos, spilling + inheritance can
                   be considered a substitution for splitting.
                   Therefore we do not splitting for local pseudos.  It
                   decreases also aggressiveness of splitting.  The
                   minimal number of references is chosen taking into
                   account that for 2 references splitting has no sense
                   as we can just spill the pseudo.  */
                || (regno >= FIRST_PSEUDO_REGISTER
                    && lra_reg_info[regno].nrefs > 3
                    && bitmap_bit_p (&ebb_global_regs, regno))))

I removed checking that hard_regno is local.  It was a parameter to 
decrease aggressiveness of hard register splitting but I realized that 
it contradicts the reason for splitting to solve 1st insn scheduling 
problem because regional scheduling can move hard register definition 
beyond BB.
>> +  for (i = 0;
>> +       (cl = reg_class_subclasses[allocno_class][i]) != LIM_REG_CLASSES;
>> +       i++)
>> +    if (! SECONDARY_MEMORY_NEEDED (cl, hard_reg_class, mode)
>> +	&& ! SECONDARY_MEMORY_NEEDED (hard_reg_class, cl, mode)
>> +	&& TEST_HARD_REG_BIT (reg_class_contents[cl], hard_regno)
>> +	&& (best_cl == NO_REGS
>> +	    || (hard_reg_set_subset_p (reg_class_contents[best_cl],
>> +				       reg_class_contents[cl])
>> +		&& ! hard_reg_set_equal_p (reg_class_contents[best_cl],
>> +					   reg_class_contents[cl]))))
>> +      best_cl = cl;
> OK, so this suggestion isn't backed up by any evidence, but what do
> you think about this alternative:
>
> 	&& (best_cl == NO_REGS
> 	    || (ira_class_hard_regs_num[best_cl]
> 		< ira_class_hard_regs_num[cl]))
>
> which should choose the largest class that requires no secondary memory.
> It looks like the subset version could get "stuck" on a single-register
> class that happens to be early in the list but has no superclass smaller
> than allocno_class.
It seems reasonable.  I choose your variant.
>> +/* Do split transformation for insn INSN defining or
>> +   using ORIGINAL_REGNO where the subsequent insn(s) in EBB (remember
>> +   we traverse insns in the backward direction) for the original regno
>> +   is NEXT_USAGE_INSNS.	 The transformations look like
> Same suggestion as for the inheritance function above.
Fixed.
>> +  if (call_save_p)
>> +    save = emit_spill_move (true, new_reg, original_reg, -1);
>> +  else
>> +    {
>> +      start_sequence ();
>> +      emit_move_insn (new_reg, original_reg);
>> +      save = get_insns ();
>> +      end_sequence ();
>> +    }
>> +  if (NEXT_INSN (save) != NULL_RTX)
>> +    {
>> +      lra_assert (! call_save_p);
> Is emit_spill_move really guaranteed to return only one instruction in
> cases where emit_move_insn might not?  Both of them use emit_move_insn_1
> internally, so I wouldn't have expected much difference.
>
> In fact I wasn't really sure why:
>
>    save = gen_move_insn (new, original_reg);
For caller save, modes new and original can be different.
> wouldn't be correct for both.
>
> Same comments for the restore code.
I fixed it using only emit_spill_move.
>> +      /* See which defined values die here.  */
>> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +	if (reg->type == OP_OUT && ! reg->early_clobber
>> +	    && (! reg->subreg_p
>> +		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
>> +	  bitmap_clear_bit (&live_regs, reg->regno);
>> +      /* Mark each used value as live.	*/
>> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +	if (reg->type == OP_IN
>> +	    && bitmap_bit_p (&check_only_regs, reg->regno))
>> +	  bitmap_set_bit (&live_regs, reg->regno);
>> +      /* Mark early clobber outputs dead.  */
>> +      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +	if (reg->type == OP_OUT && reg->early_clobber && ! reg->subreg_p)
>> +	  bitmap_clear_bit (&live_regs, reg->regno);
> I don't think this would be correct for unreloaded insns because an
> unreloaded insn can have the same pseudo as an input and an earlyclobber
> output.  (Probably not an issue here, since we're called after the
> constraints pass.)  There's also the case of matched earlyclobber operands,
> where the matched input is specifically not affected by the earlyclobber.
>
> I'd have thought:
>
>        /* See which defined values die here.  */
>        for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> 	if (reg->type == OP_OUT
> 	    && (! reg->subreg_p
> 		|| bitmap_bit_p (&lra_bound_pseudos, reg->regno)))
> 	  bitmap_clear_bit (&live_regs, reg->regno);
>        /* Mark each used value as live.	*/
>        for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> 	if (reg->type == OP_IN
> 	    && bitmap_bit_p (&check_only_regs, reg->regno))
> 	  bitmap_set_bit (&live_regs, reg->regno);
>
> ought to be correct, but perhaps I'm missing something.
>
> (I'm still uneasy about the special treatment of bound pseudos here.
> A clobber really does seem better.)
It seems a right variant.  I am going to use it.
>> +      /* It is quite important to remove dead move insns because it
>> +	 means removing dead store, we don't need to process them for
>> +	 constraints, and unfortunately some subsequent optimizations
>> +	 (like shrink-wrapping) currently based on assumption that
>> +	 there are no trivial dead insns.  */
> Maybe best to drop the "subsequent optimizations" part.  This comment
> is unlikely to be updated after any change to shrink-wrapping & co.,
> and the first two justifications seem convincing enough on their own.
Fixed.
>> +/* Add inheritance info REGNO and INSNS.  */
>> +static void
>> +add_to_inherit (int regno, rtx insns)
>> +{
>> +  int i;
>> +
>> +  for (i = 0; i < to_inherit_num; i++)
>> +    if (to_inherit[i].regno == regno)
>> +      return;
> Is the existing "insns" field guaranteed to match the "insns" parameter
> in this case, or might they be different?  Probably worth an assert or
> comment respectively.
I added a comment.
>> +/* Return first (if FIRST_P) or last non-debug insn in basic block BB.
>> +   Return null if there are no non-debug insns in the block.  */
>> +static rtx
>> +get_non_debug_insn (bool first_p, basic_block bb)
>> +{
>> +  rtx insn;
>> +
>> +  for (insn = first_p ? BB_HEAD (bb) : BB_END (bb);
>> +       insn != NULL_RTX && ! NONDEBUG_INSN_P (insn);
>> +       insn = first_p ? NEXT_INSN (insn) : PREV_INSN (insn))
>> +    ;
>> +  if (insn != NULL_RTX && BLOCK_FOR_INSN (insn) != bb)
>> +    insn = NULL_RTX;
>> +  return insn;
>> +}
> It probably doesn't matter in practice, but it looks like it'd be better
> to limit the walk to the bb, rather than walking until null and then
> testing the bb after the walk.
>
> Maybe it would be eaiser to split into two functions, since first_p is
> always constant.  E.g.:
>
>    rtx insn;
>
>    FOR_BB_INSNS (bb, insn)
>      if (NONDEBUG_INSN_P (insn))
>        return insn;
>    return NULL_RTX;
>
> for first_p.  s/FOR_BB_INSNS/FOR_BB_INSNS_REVERSE/ for !first_p.
That is better.  Fixed.
>> +/* Set up RES by registers living on edges FROM except the edge (FROM,
>> +   TO) or by registers set up in a jump insn in BB FROM.  */
>> +static void
>> +get_live_on_other_edges (basic_block from, basic_block to, bitmap res)
>> +{
>> +  int regno;
>> +  rtx last;
>> +  struct lra_insn_reg *reg;
>> +  edge e;
>> +  edge_iterator ei;
>> +
>> +  lra_assert (to != NULL);
>> +  bitmap_clear (res);
>> +  FOR_EACH_EDGE (e, ei, from->succs)
>> +    if (e->dest != to)
>> +      bitmap_ior_into (res, DF_LR_IN (e->dest));
>> +  if ((last = get_non_debug_insn (false, from)) == NULL_RTX || ! JUMP_P (last))
>> +    return;
>> +  curr_id = lra_get_insn_recog_data (last);
>> +  for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +    if (reg->type != OP_IN
>> +	&& (regno = reg->regno) >= FIRST_PSEUDO_REGISTER)
>> +      bitmap_set_bit (res, regno);
>> +}
> Probably a silly question, sorry, but: why does the JUMP_P part only
> include pseudo registers?  The other calculations (here and elsewhere)
> seem to handle both hard and pseudo registers.
No, that is not a silly question.  I think it is a potential bug 
although it is probability is very small.  Originally, splitting and 
inheritance were done only on pseudos.  After that, splitting hard 
registers were added and that code was not changed.  Thanks, Richard.  I 
removed this condition.
>> +/* Do inheritance/split transformations in EBB starting with HEAD and
>> +   finishing on TAIL.  We process EBB insns in the reverse order.
>> +   Return true if we did any inheritance/split transformation in the
>> +   EBB.
>> +
>> +   We should avoid excessive splitting which results in worse code
>> +   because of inaccurate cost calculations for spilling new split
>> +   pseudos in such case.  To achieve this we do splitting only if
>> +   register pressure is high in given basic block and there reload
> "...and there are reload"
Fixed.
>> +   pseudos requiring hard registers.  We could do more register
>> +   pressure calculations at any given program point to avoid necessary
>> +   splitting even more but it is to expensive and the current approach
>> +   is well enough.  */
> "works well enough".
Fixed.
>> +  change_p = false;
>> +  curr_usage_insns_check++;
>> +  reloads_num = calls_num = 0;
>> +  /* Remember: we can remove the current insn.	*/
>> +  bitmap_clear (&check_only_regs);
>> +  last_processed_bb = NULL;
> I couldn't tell which part of the code the comment is referring to.
> Maybe left over?
It is a leftover.  I removed it.
>> +	  after_p = (last_insn != NULL_RTX && ! JUMP_P (last_insn)
>> +		     && (! CALL_P (last_insn)
>> +			 || (find_reg_note (last_insn,
>> +					   REG_NORETURN, NULL) == NULL_RTX
>> +			     && ((next_insn
>> +				  = next_nonnote_nondebug_insn (last_insn))
>> +				 == NULL_RTX
>> +				 || GET_CODE (next_insn) != BARRIER))));
> Genuine question, but: when are the last four lines needed?  The condition
> that they're testing for sounds like a noreturn call.
Yes, probably you are right.  I also can not imagine such situation.  I 
am removing them.
>> +      if (src_regno < lra_constraint_new_regno_start
>> +	  && src_regno >= FIRST_PSEUDO_REGISTER
>> +	  && reg_renumber[src_regno] < 0
>> +	  && dst_regno >= lra_constraint_new_regno_start
>> +	  && (cl = lra_get_allocno_class (dst_regno)) != NO_REGS)
>> +	{
>> +	  /* 'reload_pseudo <- original_pseudo'.  */
>> +	  reloads_num++;
>> +	  succ_p = false;
>> +	  if (usage_insns[src_regno].check == curr_usage_insns_check
>> +	      && (next_usage_insns = usage_insns[src_regno].insns) != NULL_RTX)
>> +	    succ_p = inherit_reload_reg (false,
>> +					 bitmap_bit_p (&lra_matched_pseudos,
>> +						       dst_regno),
>> +					 src_regno, cl,
>> +					 curr_insn, next_usage_insns);
>> +	  if (succ_p)
>> +	    change_p = true;
>> +	  else
>> +	    {
>> +	      usage_insns[src_regno].check = curr_usage_insns_check;
>> +	      usage_insns[src_regno].insns = curr_insn;
>> +	      usage_insns[src_regno].reloads_num = reloads_num;
>> +	      usage_insns[src_regno].calls_num = calls_num;
>> +	      usage_insns[src_regno].after_p = false;
>> +	    }
> Looks like this and other places could use the add_next_usage_insn
> helper function.
Fixed.
>> +	  if (cl != NO_REGS
>> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
>> +					live_hard_regs))
>> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
>> +			      reg_class_contents[cl]);
> Redundant "cl != NO_REGS" check.  (Was a bit confused by that at first.)
Fixed.
> I don't understand the way potential_reload_hard_regs is set up.
> Why does it only include reload pseudos involved in moves of the form
> "reload_pseudo <- original_pseudo" and "original_pseudo <- reload_pseudo",
> but include those reloads regardless of whether inheritance is possible?
>
> I wondered whether it might be deliberately selective in order to speed
> up LRA, but we walk all the registers in an insn regardless.
>
> Same for reloads_num.
It is just simple heuristics.  They can be changed.  Adding new 
heuristics (more complicated ones) requires a lot of experiments. This 
work can be done lately.  LRA as now is not frozen.  Its development 
will be continued.
>> +	  if (cl != NO_REGS
>> +	      && hard_reg_set_subset_p (reg_class_contents[cl],
>> +					live_hard_regs))
>> +	    IOR_HARD_REG_SET (potential_reload_hard_regs,
>> +			      reg_class_contents[cl]);
> Same comment as for the previous block.
Fixed.
>> +		if (reg_renumber[dst_regno] < 0
>> +		    || (reg->type == OP_OUT && ! reg->subreg_p))
>> +		/* Invalidate.	*/
>> +		usage_insns[dst_regno].check = 0;
> Could you explain this condition a bit more?  Why does reg_renumber
> affect things?
I added a comment.
>> +/* This value affects EBB forming.  If probability of edge from EBB to
>> +   a BB is not greater than the following value, we don't add the BB
>> +   to EBB.  */
>> +#define EBB_PROBABILITY_CUTOFF (REG_BR_PROB_BASE / 2)
> It looks like schedule_ebbs uses a higher default cutoff for FDO.
> Would the same distinction be useful here?
>
> Maybe schedule_ebbs-like params would be good here too.
I am thinking about adding several parametersfor IRA and LRA.  But I'd 
like to add for the next release.
>> +  bitmap_and (&temp_bitmap_head, removed_pseudos, live);
>> +  EXECUTE_IF_SET_IN_BITMAP (&temp_bitmap_head, 0, regno, bi)
> This isn't going to have much effect on compile time, but
> EXECUTE_IF_AND_IN_BITMAP avoids the need for a temporary bitmap.
Fixed.
>> +/* Remove inheritance/split pseudos which are in REMOVE_PSEUDOS and
>> +   return true if we did any change.  The undo transformations for
>> +   inheritance looks like
>> +      i <- i2
>> +      p <- i	  =>   p <- i2
>> +   or removing
>> +      p <- i, i <- p, and i <- i3
>> +   where p is original pseudo from which inheritance pseudo i was
>> +   created, i and i3 are removed inheritance pseudos, i2 is another
>> +   not removed inheritance pseudo.  All split pseudos or other
>> +   occurrences of removed inheritance pseudos are changed on the
>> +   corresponding original pseudos.  */
>> +static bool
>> +remove_inheritance_pseudos (bitmap remove_pseudos)
>> +{
>> +  basic_block bb;
>> +  int regno, sregno, prev_sregno, dregno, restore_regno;
>> +  rtx set, prev_set, prev_insn;
>> +  bool change_p, done_p;
>> +
>> +  change_p = ! bitmap_empty_p (remove_pseudos);
> I wondered from the comment why we couldn't just return straight away
> for the empty set, but it looks like the function also schedules a
> constraints pass for instructions that keep their inheritance or
> split pseudos.  Is that right?  Might be worth mentioning that
> in the function comment if so.
Yes, that is right.  I've added a comment.
>> +	      else if (bitmap_bit_p (remove_pseudos, sregno)
>> +		       && bitmap_bit_p (&lra_inheritance_pseudos, sregno))
>> +		{
>> +		  /* Search the following pattern:
>> +		       inherit_or_split_pseudo1 <- inherit_or_split_pseudo2
>> +		       original_pseudo <- inherit_or_split_pseudo1
>> +		    where the 2nd insn is the current insn and
>> +		    inherit_or_split_pseudo2 is not removed.  If it is found,
>> +		    change the current insn onto:
>> +		       original_pseudo1 <- inherit_or_split_pseudo2.  */
> s/original_pseudo1/original_pseudo/ I think (we don't change the destination).
Yes, it is a typo.  Fixed.
>> +		  for (prev_insn = PREV_INSN (curr_insn);
>> +		       prev_insn != NULL_RTX && ! NONDEBUG_INSN_P (prev_insn);
>> +		       prev_insn = PREV_INSN (prev_insn))
>> +		    ;
>> +		  if (prev_insn != NULL_RTX && BLOCK_FOR_INSN (prev_insn) == bb
>> +		      && (prev_set = single_set (prev_insn)) != NULL_RTX
>> +		      /* There should be no subregs in insn we are
>> +			 searching because only the original reg might
>> +			 be in subreg when we changed the mode of
>> +			 load/store for splitting.  */
>> +		      && REG_P (SET_DEST (prev_set))
>> +		      && REG_P (SET_SRC (prev_set))
>> +		      && (int) REGNO (SET_DEST (prev_set)) == sregno
>> +		      && ((prev_sregno = REGNO (SET_SRC (prev_set)))
>> +			  >= FIRST_PSEUDO_REGISTER)
>> +		      && (lra_reg_info[sregno].restore_regno
>> +			  == lra_reg_info[prev_sregno].restore_regno)
>> +		      && ! bitmap_bit_p (remove_pseudos, prev_sregno))
> I'm sure the restore_regno comparison near the end is correct,
> but could you add a comment to explain it?  The substitution
> itself seems OK either way.
I added a comment.
>> +	      struct lra_insn_reg *reg;
>> +	      bool insn_change_p = false;
>> +
>> +	      curr_id = lra_get_insn_recog_data (curr_insn);
>> +	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
>> +		if ((regno = reg->regno) >= lra_constraint_new_regno_start
>> +		    && lra_reg_info[regno].restore_regno >= 0)
> Is the first part of the comparison needed?  Most other places don't check,
> so it looked at first glance like there was something special here.
I think it is a leftover from older implementation where I worked 
differently with restore_reg.  I removed.
>> +		  {
>> +		    if (change_p && bitmap_bit_p (remove_pseudos, regno))
>> +		      {
>> +			restore_regno = lra_reg_info[regno].restore_regno;
>> +			substitute_pseudo (&curr_insn, regno,
>> +					   regno_reg_rtx[restore_regno]);
>> +			insn_change_p = true;
>> +		      }
>> +		    else if (NONDEBUG_INSN_P (curr_insn))
>> +		      {
>> +			lra_push_insn_and_update_insn_regno_info (curr_insn);
>> +			lra_set_used_insn_alternative_by_uid
>> +			  (INSN_UID (curr_insn), -1);
>> +		      }
>> +		  }
>> +	      if (insn_change_p)
>> +		{
>> +		  lra_update_insn_regno_info (curr_insn);
>> +		  if (lra_dump_file != NULL)
>> +		    {
>> +		      fprintf (lra_dump_file, "	   Restore original insn:\n");
>> +		      debug_rtl_slim (lra_dump_file,
>> +				      curr_insn, curr_insn, -1, 0);
>> +		    }
>> +		}
> AIUI we could have a partial restore, keeping some registers but
> restoring others.  Is that right?  The dump entry made it sounds
> like a full restore.
Yes, the dump is a bit misleading.  It might be a partial restore.
> Maybe something like:
>
> 	      struct lra_insn_reg *reg;
> 	      bool restored_regs_p = false;
> 	      bool kept_regs_p = false;
>
> 	      curr_id = lra_get_insn_recog_data (curr_insn);
> 	      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
> 		{
> 		  regno = reg->regno;
> 		  restore_regno = lra_reg_info[regno].restore_regno;
> 		  if (restore_regno >= 0)
> 		    {
> 		      if (change_p && bitmap_bit_p (remove_pseudos, regno))
> 			{
> 			  substitute_pseudo (&curr_insn, regno,
> 					     regno_reg_rtx[restore_regno]);
> 			  restored_regs_p = true;
> 			}
> 		      else
> 			kept_regs_p = true;
> 		    }
> 		}
> 	      if (NONDEBUG_INSN_P (curr_insn) && kept_regs_p)
> 		{
> 		  /* The instruction has changed since the previous
> 		     constraints pass.  */
> 		  lra_push_insn_and_update_insn_regno_info (curr_insn);
> 		  lra_set_used_insn_alternative_by_uid
> 		    (INSN_UID (curr_insn), -1);
> 		}
> 	      else if (restored_regs_p)
> 		/* The instruction has been restored to the form that
> 		   it had during the previous constraints pass.  */
> 		lra_update_insn_regno_info (curr_insn);
>
> 	      if (restored_regs_p && lra_dump_file != NULL)
> 		{
> 		  fprintf (lra_dump_file,
> 			   "	   Insn after restoring regs:\n");
> 		  debug_rtl_slim (lra_dump_file, curr_insn, curr_insn, -1, 0);
> 		}
>
> (if correct) might make the partial case clearer, but that's personal
> preference, so please feel free to ignore, chop or change.
  I used your variant.
> Also, is regno_reg_rtx[restore_regno] always correct when restoring
> registers?  I thought restore_regno could be a hard register and that
> the hard register might not necessarily be used in the same mode as
> the regno_reg_rtx[] entry.
Right.  I fixed it.
> That just leaves lra.h, lra-int.h and lra.c itself.  I'm hoping to have
> read through those by the middle of next week, but any comments about them
> will probably just be banal stuff (even more than my comments so far :-))
> so I deliberately left them to last.
>
Richard, thanks again for your invaluable feedback which made and will 
make LRA better.

Richard Sandiford Oct. 17, 2012, 11:24 a.m. UTC | #14

Thanks for all the updates.

Vladimir Makarov <vmakarov@redhat.com> writes:
>>> +	  /* index * scale + disp => new base + index * scale  */
>>> +	  enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
>>> +
>>> +	  lra_assert (INDEX_REG_CLASS != NO_REGS);
>>> +	  new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
>>> +	  lra_assert (GET_CODE (*addr_loc) == PLUS);
>>> +	  lra_emit_move (new_reg, *ad.disp_loc);
>>> +	  if (CONSTANT_P (XEXP (*addr_loc, 1)))
>>> +	    XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
>>> +	  XEXP (*addr_loc, 0) = new_reg;
>> The canonical form is (plus (mult ...) (reg)) rather than
>> (plus (reg) (mult ...)), but it looks like we create the latter.
>> I realise you try both forms here:
> It might happen because equiv substitution in LRA.
>>> +	  /* Some targets like ARM, accept address operands in
>>> +	     specific order -- try exchange them if necessary.	*/
>>> +	  if (! valid_address_p (mode, *addr_loc, as))
>>> +	    {
>>> +	      exchange_plus_ops (*addr_loc);
>>> +	      if (! valid_address_p (mode, *addr_loc, as))
>>> +		exchange_plus_ops (*addr_loc);
>>> +	    }
>> but I think we should try the canonical form first.  And I'd prefer it
>> if we didn't try the other form at all, especially in 4.8.  It isn't
>> really the backend's job to reject non-canonical rtl.  This might well
>> be another case where some targets need a (hopefully small) tweak in
>> order to play by the rules.
>>
>> Also, I suppose this section of code feeds back to my question on
>> Wednesday about the distinction that LRA seems to make between the
>> compile-time constant in:
>>
>>    (plus (reg X1) (const_int Y1))
>>
>> and the link-time constant in:
>>
>>    (plus (reg X2) (symbol_ref Y2))
>>
>> It looked like extract_address_regs classified X1 as a base register and
>> X2 as an index register.  The difference between the two constants has
>> no run-time significance though, and I think we should handle both X1
>> and X2 as base registers (as I think reload does).
>>
>> I think the path above would then be specific to scaled indices.
>> In the original address the "complex" index must come first and the
>> displacement second.  In the modified address, the index would stay
>> first and the new base register would be second.  More below.
> As I wrote above the problem is also in that equiv substitution can 
> create non-canonical forms.

Right.  Just in case there's a misunderstanding: I'm not complaining
about these routines internally using forms that are noncanonical
(which could happen because of equiv substitution, like you say).
I just think that what we eventually try to validate should be canonical.
In a way it's similar to how the simplify-rtx.c routines work.

If there are targets that only accept noncanonical rtl (which is after
all just a specific type of invalid rtl), they need to be fixed.

>>> +      /* base + scale * index + disp => new base + scale * index  */
>>> +      new_reg = base_plus_disp_to_reg (mode, as, &ad);
>>> +      *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
>>> +      if (! valid_address_p (mode, *addr_loc, as))
>>> +	{
>>> +	  /* Some targets like ARM, accept address operands in
>>> +	     specific order -- try exchange them if necessary.	*/
>>> +	  exchange_plus_ops (*addr_loc);
>>> +	  if (! valid_address_p (mode, *addr_loc, as))
>>> +	    exchange_plus_ops (*addr_loc);
>>> +	}
>> Same comment as above about canonical rtl.  Here we can have two
>> registers -- in which case the base should come first -- or a more
>> complex index -- in which case the index should come first.
>>
>> We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
>> with the operands in either order, and let it take care of the details.
>> Using simplify_gen_binary would help with the earlier index+disp case too.
> Equiv substitution can create non-canonical forms.  There are 2 approaches:
> o have a code for dealing with non-canonical forms (equiv substitution, 
> target stupidity)
> o always support canonical forms and require them from targets.
>
> I decided to use the 1st variant but I am reconsidering it.  I'll try to 
> fix before inclusion.  But I am not sure I have time for this.  All 
> these changes makes LRA unstable. In fact, I've just found that changes 
> I already made so far resulted in 2 SPEC2000 tests broken although GCC 
> testsuite and bootstrap looks good.

OK.  I'm happy to try fixing the noncanonical thing.

>>> +  /* If this is post-increment, first copy the location to the reload reg.  */
>>> +  if (post && real_in != result)
>>> +    emit_insn (gen_move_insn (result, real_in));
>> Nit, but real_in != result can never be true AIUI, and I was confused how
>> the code could be correct in that case.  Maybe just remove it, or make
>> it an assert?
> No, it might be true:
>
> real_in = in == value ? incloc : in;
> ...
> if (cond)
>    result = incloc;
> else
>    result = ...
>
> if (post && real_in != result)
>
> So it is true if in==value && cond

Sorry, what I meant was that cond is "! post && REG_P (incloc)":

  if (! post && REG_P (incloc))
    result = incloc;
  else
    result = lra_create_new_reg (GET_MODE (value), value, new_rclass,
				 "INC/DEC result");

so it can never be true in the "post" case quoted above.

>>> +	      dest_reg = SET_DEST (set);
>>> +	      /* The equivalence pseudo could be set up as SUBREG in a
>>> +		 case when it is a call restore insn in a mode
>>> +		 different from the pseudo mode.  */
>>> +	      if (GET_CODE (dest_reg) == SUBREG)
>>> +		dest_reg = SUBREG_REG (dest_reg);
>>> +	      if ((REG_P (dest_reg)
>>> +		   && (x = get_equiv_substitution (dest_reg)) != dest_reg
>>> +		   /* Remove insns which set up a pseudo whose value
>>> +		      can not be changed.  Such insns might be not in
>>> +		      init_insns because we don't update equiv data
>>> +		      during insn transformations.
>>> +			
>>> +		      As an example, let suppose that a pseudo got
>>> +		      hard register and on the 1st pass was not
>>> +		      changed to equivalent constant.  We generate an
>>> +		      additional insn setting up the pseudo because of
>>> +		      secondary memory movement.  Then the pseudo is
>>> +		      spilled and we use the equiv constant.  In this
>>> +		      case we should remove the additional insn and
>>> +		      this insn is not init_insns list.	 */
>>> +		   && (! MEM_P (x) || MEM_READONLY_P (x)
>>> +		       || in_list_p (curr_insn,
>>> +				     ira_reg_equiv
>>> +				     [REGNO (dest_reg)].init_insns)))
>> This is probably a stupid question, sorry, but when do we ever want
>> to keep an assignment to a substituted pseudo?  I.e. why isn't this just:
>>
>> 	      if ((REG_P (dest_reg)
>> 		   && (x = get_equiv_substitution (dest_reg)) != dest_reg)
> Equivalence can be memory location.  It means you can assign many 
> different values to the location.  Removing them would generate a wrong 
> code.  For example, if you use your simple variant of code, you will 
> have > 100 test failures of GCC testsuite.

OK :-)

>>> +/* Return true if we need a split for hard register REGNO or pseudo
>>> +   REGNO which was assigned to a hard register.
>>> +   POTENTIAL_RELOAD_HARD_REGS contains hard registers which might be
>>> +   used for reloads since the EBB end.	It is an approximation of the
>>> +   used hard registers in the split range.  The exact value would
>>> +   require expensive calculations.  If we were aggressive with
>>> +   splitting because of the approximation, the split pseudo will save
>>> +   the same hard register assignment and will be removed in the undo
>>> +   pass.  We still need the approximation because too aggressive
>>> +   splitting would result in too inaccurate cost calculation in the
>>> +   assignment pass because of too many generated moves which will be
>>> +   probably removed in the undo pass.  */
>>> +static inline bool
>>> +need_for_split_p (HARD_REG_SET potential_reload_hard_regs, int regno)
>>> +{
>>> +  int hard_regno = regno < FIRST_PSEUDO_REGISTER ? regno : reg_renumber[regno];
>>> +
>>> +  lra_assert (hard_regno >= 0);
>>> +  return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
>>> +	   && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
>>> +	   && (usage_insns[regno].reloads_num
>>> +	       + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
>>> +	   && ((regno < FIRST_PSEUDO_REGISTER
>>> +		&& ! bitmap_bit_p (&ebb_global_regs, regno))
>>> +	       || (regno >= FIRST_PSEUDO_REGISTER
>>> +		   && lra_reg_info[regno].nrefs > 3
>>> +		   && bitmap_bit_p (&ebb_global_regs, regno))))
>>> +	  || (regno >= FIRST_PSEUDO_REGISTER && need_for_call_save_p (regno)));
>>> +}
>> Could you add more commentary about the thinking behind this particular
>> choice of heuristic?  E.g. I wasn't sure what the reloads_num check did,
>> or why we only split hard registers that are local to the EBB and only
>> split pseudos that aren't.
>>
>> The 2 and 3 numbers seemed a bit magic too.  I suppose the 2 has
>> something to do with "one save and one restore", but I wasn't sure
>> why we applied it only for pseudos.  (AIUI that arm of the check
>> deals with "genuine" split pseudos rather than call saves & restores.)
>>
>> Still, it says a lot for the high quality of LRA that, out of all the
>> 1000s of lines of code I've read so far, this is the only part that
>> didn't seem to have an intuitive justification.
> Yes, right.  I checked many parameters and finally picked out the above 
> ones.  What I found is that aggressive and even moderate splitting 
> usually result in worse code.  Most splitting is undone and if we do 
> splitting aggressively we generate a lot of insns which will be removed 
> but their costs are taken into account during assignment pass.  
> Inheritance has quite bigger chance to successful.  One reason why 
> splitting is undone is that spilling + inheritance of short living 
> pseudos is a substitution of splitting in some way.  Therefore I do 
> splitting for long living pseudos. That means non local pseudos.
>
> I added the following comments.
>
>    return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
>             && ! TEST_HARD_REG_BIT (lra_no_alloc_regs, hard_regno)
>             /* We need at least 2 reloads to make pseudo splitting
>                profitable.  We should provide hard regno splitting in
>                any case to solve 1st insn scheduling problem when
>                moving hard register definition up might result in
>                impossibility to find hard register for reload pseudo of
>                small register class.  */
>             && (usage_insns[regno].reloads_num
>                 + (regno < FIRST_PSEUDO_REGISTER ? 0 : 2) < reloads_num)
>             && (regno < FIRST_PSEUDO_REGISTER
>                 /* For short living pseudos, spilling + inheritance can
>                    be considered a substitution for splitting.
>                    Therefore we do not splitting for local pseudos.  It
>                    decreases also aggressiveness of splitting.  The
>                    minimal number of references is chosen taking into
>                    account that for 2 references splitting has no sense
>                    as we can just spill the pseudo.  */
>                 || (regno >= FIRST_PSEUDO_REGISTER
>                     && lra_reg_info[regno].nrefs > 3
>                     && bitmap_bit_p (&ebb_global_regs, regno))))
>
> I removed checking that hard_regno is local.  It was a parameter to 
> decrease aggressiveness of hard register splitting but I realized that 
> it contradicts the reason for splitting to solve 1st insn scheduling 
> problem because regional scheduling can move hard register definition 
> beyond BB.

Thanks, this looks much better to me FWIW.

Richard

Vladimir Makarov Oct. 17, 2012, 7:53 p.m. UTC | #15

On 12-10-15 12:49 PM, Richard Sandiford wrote:
> Hi Vlad,
>
> Some comments about the rest of LRA.  Nothing major here...
>
> Vladimir Makarov <vmakarov@redhat.com> writes:
>> +/* Info about register in an insn.  */
>> +struct lra_insn_reg
>> +{
>> +  /* The biggest mode through which the insn refers to the register
>> +     (remember the register can be accessed through a subreg in the
>> +     insn).  */
>> +  ENUM_BITFIELD(machine_mode) biggest_mode : 16;
> AFAICT, this is actually always the mode of a specific reference,
> and if there are references to the same register in different modes,
> those references get their own lra_insn_regs.  "mode" might be better
> than "biggest_mode" if so.
>
I seems mode is also not good.  I've just modified the comment to 
reflect the fact that is just a reference.
>> +/* Static part (common info for insns with the same ICODE) of LRA
>> +   internal insn info.	It exists in at most one exemplar for each
>> +   non-negative ICODE.	Warning: if the structure definition is
>> +   changed, the initializer for debug_insn_static_data in lra.c should
>> +   be changed too.  */
> Probably worth saying (before the warning) that there is also
> one structure for each asm.
Good point.  I added a comment.
>> +/* LRA internal info about an insn (LRA internal insn
>> +   representation).  */
>> +struct lra_insn_recog_data
>> +{
>> +  int icode; /* The insn code.	*/
>> +  rtx insn; /* The insn itself.	 */
>> +  /* Common data for insns with the same ICODE.	 */
>> +  struct lra_static_insn_data *insn_static_data;
> Maybe worth mentioning asms here too.
Fixed.
>> +  /* Two arrays of size correspondingly equal to the operand and the
>> +     duplication numbers: */
>> +  rtx **operand_loc; /* The operand locations, NULL if no operands.  */
>> +  rtx **dup_loc; /* The dup locations, NULL if no dups.	 */
>> +  /* Number of hard registers implicitly used in given call insn.  The
>> +     value can be NULL or points to array of the hard register numbers
>> +     ending with a negative value.  */
>> +  int *arg_hard_regs;
>> +#ifdef HAVE_ATTR_enabled
>> +  /* Alternative enabled for the insn.	NULL for debug insns.  */
>> +  bool *alternative_enabled_p;
>> +#endif
>> +  /* The alternative should be used for the insn, -1 if invalid, or we
>> +     should try to use any alternative, or the insn is a debug
>> +     insn.  */
>> +  int used_insn_alternative;
>> +  struct lra_insn_reg *regs;  /* Always NULL for a debug insn.	*/
> Comments consistently above the field.
Fixed.
>> +extern void lra_expand_reg_info (void);
> This doesn't exist any more.
Fixed.
>> +extern int lra_constraint_new_insn_uid_start;
> Just saying in case: this seems to be write-only, with lra-constraints.c
> instead using a static variable to track the uid start.
>
> I realise you might want to keep it anyway for consistency with
> lra_constraint_new_regno_start, or for debugging.
>
>> +extern rtx lra_secondary_memory[NUM_MACHINE_MODES];
> This doesn't exist any more.
Removed.  Thanks.
>> +/* lra-saves.c: */
>> +
>> +extern bool lra_save_restore (void);
> Same for this file & function.
Removed.
>> +/* The function returns TRUE if at least one hard register from ones
>> +   starting with HARD_REGNO and containing value of MODE are in set
>> +   HARD_REGSET.	 */
>> +static inline bool
>> +lra_hard_reg_set_intersection_p (int hard_regno, enum machine_mode mode,
>> +				 HARD_REG_SET hard_regset)
>> +{
>> +  int i;
>> +
>> +  lra_assert (hard_regno >= 0);
>> +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
>> +    if (TEST_HARD_REG_BIT (hard_regset, hard_regno + i))
>> +      return true;
>> +  return false;
>> +}
> This is the same as overlaps_hard_reg_set_p.
>
I removed it and started to use the function overlaps_hard_reg_set_p.
>> +/* Return hard regno and offset of (sub-)register X through arguments
>> +   HARD_REGNO and OFFSET.  If it is not (sub-)register or the hard
>> +   register is unknown, then return -1 and 0 correspondingly.  */
> The function seems to return -1 for both.
Fixed.  It does not matter for the rest of code as offset is used only 
when hard_regno >= 0.
>> +/* Add hard registers starting with HARD_REGNO and holding value of
>> +   MODE to the set S.  */
>> +static inline void
>> +lra_add_hard_reg_set (int hard_regno, enum machine_mode mode, HARD_REG_SET *s)
>> +{
>> +  int i;
>> +
>> +  for (i = hard_regno_nregs[hard_regno][mode] - 1; i >= 0; i--)
>> +    SET_HARD_REG_BIT (*s, hard_regno + i);
>> +}
> This is add_to_hard_reg_set.
Removed.
>> +   Here is block diagram of LRA passes:
>> +
>> +	  ---------------------				
>> +	 | Undo inheritance    |      ---------------	     ---------------
>> +	 | for spilled pseudos)|     | Memory-memory |	    | New (and old) |
>> +	 | and splits (for     |<----| move coalesce |<-----|	 pseudos    |
>> +	 | pseudos got the     |      ---------------	    |	assignment  |
>> +  Start	 |  same  hard regs)   |			     ---------------
>> +    |	  ---------------------					    ^
>> +    V		  |		 ----------------		    |
>> + -----------	  V		| Update virtual |		    |
>> +|  Remove   |----> ------------>|    register	 |		    |
>> +| scratches |	  ^		|  displacements |		    |
>> + -----------	  |		 ----------------		    |
>> +		  |			 |			    |
>> +		  |			 V	   New		    |
>> +	 ----------------    No	   ------------	 pseudos   -------------------
>> +	| Spilled pseudo | change |Constraints:| or insns | Inheritance/split |
>> +	|    to memory	 |<-------|    RTL     |--------->|  transformations  |
>> +	|  substitution	 |	  | transfor-  |	  |    in EBB scope   |
>> +	 ----------------	  |  mations   |	   -------------------
>> +		|		    ------------
>> +		V
>> +    -------------------------
>> +   | Hard regs substitution, |
>> +   |  devirtalization, and   |------> Finish
>> +   | restoring scratches got |
>> +   |	     memory	     |
>> +    -------------------------
> This is a great diagram, thanks.
>
>> +/* Create and return a new reg from register FROM corresponding to
>> +   machine description operand of mode MD_MODE.	 Initialize its
>> +   register class to RCLASS.  Print message about assigning class
>> +   RCLASS containing new register name TITLE unless it is NULL.	 The
>> +   created register will have unique held value.  */
>> +rtx
>> +lra_create_new_reg_with_unique_value (enum machine_mode md_mode, rtx original,
>> +				      enum reg_class rclass, const char *title)
> Comment says FROM, but parameter is called ORIGINAL.  The code copes
> with both null and non-register ORIGINALs, which aren't mentinoed in
> the comment.
Fixed.
>> +/* Target checks operands through operand predicates to recognize an
>> +   insn.  We should have a special precaution to generate add insns
>> +   which are frequent results of elimination.
>> +
>> +   Emit insns for x = y + z.  X can be used to store intermediate
>> +   values and should be not in Y and Z when we use x to store an
>> +   intermediate value.	*/
> I think this should say what Y and Z are allowed to be, since it's
> more than just registers and constants.
Fixed.
>> +/* Map INSN_UID -> the operand alternative data (NULL if unknown).  We
>> +   assume that this data is valid until register info is changed
>> +   because classes in the data can be changed.	*/
>> +struct operand_alternative *op_alt_data[LAST_INSN_CODE];
> In that case I think it should be in a target_globals structure,
> a bit like target_ira.
Fixed.
>> +	    for (curr = list; curr != NULL; curr = curr->next)
>> +	      if (curr->regno == regno)
>> +		break;
>> +	    if (curr == NULL || curr->subreg_p != subreg_p
>> +		|| curr->biggest_mode != mode)
>> +	      {
>> +		/* This is a new hard regno or the info can not be
>> +		   integrated into the found structure.	 */
>> +#ifdef STACK_REGS
>> +		early_clobber
>> +		  = (early_clobber
>> +		     /* This clobber is to inform popping floating
>> +			point stack only.  */
>> +		     && ! (FIRST_STACK_REG <= regno
>> +			   && regno <= LAST_STACK_REG));
>> +#endif
>> +		list = new_insn_reg (regno, type, mode, subreg_p,
>> +				     early_clobber, list);
>> +	      }
>> +	    else
>> +	      {
>> +		if (curr->type != type)
>> +		  curr->type = OP_INOUT;
>> +		if (curr->early_clobber != early_clobber)
>> +		  curr->early_clobber = true;
>> +	      }
> OK, so this is probably only a technicality, but I think this should be:
>
> 	    for (curr = list; curr != NULL; curr = curr->next)
> 	      if (curr->regno == regno
> 		  && curr->subreg_p == subreg_p
> 		  && curr->biggest_mode == mode)
> 		{
> 		  ..reuse..;
> 		  return list;
> 		}
> 	     ..new entry..;
> 	     return list;
Fixed by using your approach.  It cannot be return because the return 
originally is after a loop covering this code.
>> +      icode = INSN_CODE (insn);
>> +      if (icode < 0)
>> +	/* It might be a new simple insn which is not recognized yet.  */
>> +	INSN_CODE (insn) = icode = recog (PATTERN (insn), insn, 0);
> Any reason not to use recog_memoized here?
It simply will call recog always as icode < 0.  But as it has simpler 
interface, I've changed it.
>> +      n = insn_static_data->n_operands;
>> +      if (n == 0)
>> +	locs = NULL;
>> +      else
>> +	{
>> +	
>> +	  locs = (rtx **) xmalloc (n * sizeof (rtx *));
>> +	  memcpy (locs, recog_data.operand_loc, n * sizeof (rtx *));
>> +	}
> Excess blank line after "else" (sorry!)
It looks like it is already fixed.
>> +  /* Some output operand can be recognized only from the context not
>> +     from the constraints which are empty in this case.	 Call insn may
>> +     contain a hard register in set destination with empty constraint
>> +     and extract_insn treats them as an input.	*/
>> +  for (i = 0; i < insn_static_data->n_operands; i++)
>> +    {
>> +      int j;
>> +      rtx pat, set;
>> +      struct lra_operand_data *operand = &insn_static_data->operand[i];
>> +
>> +      /* ??? Should we treat 'X' the same way.	It looks to me that
>> +	 'X' means anything and empty constraint means we do not
>> +	 care.	*/
> FWIW, I think any X output operand has to be "=X" or "+X"; just "X"
> would be as wrong as "r".  genrecog is supposed to complain about that
> for insns, and parse_output_constraint for asms.
>
> So I agree the code is correct in just handling empty constraints.
Ok.
>> +/* Update all the insn info about INSN.	 It is usually called when
>> +   something in the insn was changed.  Return the udpated info.	 */
> Typo: updated.
Fixed.
>> +  for (i = 0; i < reg_info_size; i++)
>> +    {
>> +      bitmap_initialize (&lra_reg_info[i].insn_bitmap, &reg_obstack);
>> +#ifdef STACK_REGS
>> +      lra_reg_info[i].no_stack_p = false;
>> +#endif
>> +      CLEAR_HARD_REG_SET (lra_reg_info[i].conflict_hard_regs);
>> +      lra_reg_info[i].preferred_hard_regno1 = -1;
>> +      lra_reg_info[i].preferred_hard_regno2 = -1;
>> +      lra_reg_info[i].preferred_hard_regno_profit1 = 0;
>> +      lra_reg_info[i].preferred_hard_regno_profit2 = 0;
>> +      lra_reg_info[i].live_ranges = NULL;
>> +      lra_reg_info[i].nrefs = lra_reg_info[i].freq = 0;
>> +      lra_reg_info[i].last_reload = 0;
>> +      lra_reg_info[i].restore_regno = -1;
>> +      lra_reg_info[i].val = get_new_reg_value ();
>> +      lra_reg_info[i].copies = NULL;
>> +    }
> The same loop (with a different start index) appears in expand_reg_info.
> It'd be nice to factor it out, so that there's only one place to update
> if the structure is changed.
Fixed.

>> +	  for (curr = data->regs; curr != NULL; curr = curr->next)
>> +	    if (curr->regno == regno)
>> +	      break;
>> +	  if (curr->subreg_p != subreg_p || curr->biggest_mode != mode)
>> +	    /* The info can not be integrated into the found
>> +	       structure.  */
>> +	    data->regs = new_insn_reg (regno, type, mode, subreg_p,
>> +				       early_clobber, data->regs);
>> +	  else
>> +	    {
>> +	      if (curr->type != type)
>> +		curr->type = OP_INOUT;
>> +	      if (curr->early_clobber != early_clobber)
>> +		curr->early_clobber = true;
>> +	    }
>> +	  lra_assert (curr != NULL);
>> +	}
> Same loop comment as for collect_non_operand_hard_regs.  Maybe another
> factoring opportunity.
Fixed.
>> +		/* Some ports don't recognize the following addresses
>> +		   as legitimate.  Although they are legitimate if
>> +		   they satisfies the constraints and will be checked
>> +		   by insn constraints which we ignore here.  */
>> +		&& GET_CODE (XEXP (op, 0)) != UNSPEC
>> +		&& GET_CODE (XEXP (op, 0)) != PRE_DEC
>> +		&& GET_CODE (XEXP (op, 0)) != PRE_INC
>> +		&& GET_CODE (XEXP (op, 0)) != POST_DEC
>> +		&& GET_CODE (XEXP (op, 0)) != POST_INC
>> +		&& GET_CODE (XEXP (op, 0)) != PRE_MODIFY
>> +		&& GET_CODE (XEXP (op, 0)) != POST_MODIFY)
> GET_RTX_CLASS (GET_CODE (XEXP (op, 0))) == RTX_AUTOINC
Fixed.
>> +/* Determine if the current function has an exception receiver block
>> +   that reaches the exit block via non-exceptional edges  */
>> +static bool
>> +has_nonexceptional_receiver (void)
>> +{
>> +  edge e;
>> +  edge_iterator ei;
>> +  basic_block *tos, *worklist, bb;
>> +
>> +  /* If we're not optimizing, then just err on the safe side.  */
>> +  if (!optimize)
>> +    return true;
>> +
>> +  /* First determine which blocks can reach exit via normal paths.  */
>> +  tos = worklist = XNEWVEC (basic_block, n_basic_blocks + 1);
>> +
>> +  FOR_EACH_BB (bb)
>> +    bb->flags &= ~BB_REACHABLE;
>> +
>> +  /* Place the exit block on our worklist.  */
>> +  EXIT_BLOCK_PTR->flags |= BB_REACHABLE;
>> +  *tos++ = EXIT_BLOCK_PTR;
>> +
>> +  /* Iterate: find everything reachable from what we've already seen.  */
>> +  while (tos != worklist)
>> +    {
>> +      bb = *--tos;
>> +
>> +      FOR_EACH_EDGE (e, ei, bb->preds)
>> +	if (!(e->flags & EDGE_ABNORMAL))
>> +	  {
>> +	    basic_block src = e->src;
>> +
>> +	    if (!(src->flags & BB_REACHABLE))
>> +	      {
>> +		src->flags |= BB_REACHABLE;
>> +		*tos++ = src;
>> +	      }
>> +	  }
>> +    }
>> +  free (worklist);
>> +
>> +  /* Now see if there's a reachable block with an exceptional incoming
>> +     edge.  */
>> +  FOR_EACH_BB (bb)
>> +    if (bb->flags & BB_REACHABLE)
>> +      FOR_EACH_EDGE (e, ei, bb->preds)
>> +	if (e->flags & EDGE_ABNORMAL)
>> +	  return true;
>> +
>> +  /* No exceptional block reached exit unexceptionally.	 */
>> +  return false;
>> +}
> Looks like we could just early out on the first loop and get rid
> of the second.
It seems so.  I fixed it.  This is from reload.  Code in reload could be 
fixed too.
>> +/* Remove all REG_DEAD and REG_UNUSED notes and regenerate REG_INC.
>> +   We change pseudos by hard registers without notification of DF and
>> +   that can make the notes obsolete.  DF-infrastructure does not deal
>> +   with REG_INC notes -- so we should regenerate them here.  */
> These days passes are supposed to regenerate REG_DEAD and REG_UNUSED
> notes if they need them, so that part might not be necessary.
> The REG_INC bit is still needed though...
Ok.  I fixed it.
>> +/* Initialize LRA data once per function.  */
>> +void
>> +lra_init (void)
>> +{
>> +  init_op_alt_data ();
>> +}
> I think it's more like:
>
> /* Initialize LRA whenever register-related information is changed.  */
Fixed.
>
> In summary, LRA looks really good to me FWIW.  Thanks for all your hard work.
>
> Getting rid of reload always seemed like a pipe dream, and if the only
> known drawback of this replacement is that it takes a while on extreme
> testcases, that's an amazing achievement.  (Not to say compile time
> isn't important, just that there were so many other hurdles to overcome.)
It is my second attempt.  The first one was YARA project.  I got a lot 
of experience from this project and knowledge how not to do this.
LRA will be still a long lasting project.  I don't think I found all 
weirdness of reload just trying 8 targets (fixing one bug on one target 
frequently resulted in new bugs on other targets so it required to do 
frequently cardinal changes to the original code). Only after trying the 
8 targets I got feeling that this approach could well.  There are still 
much more targets (and subtargets) to port to LRA.  And I hope that 
people will help me.  I am very glad and appreciate that you did such 
rigorous review because it means that you now understand LRA very well 
(sometimes I think even better than me).
> It looks like opinion has crystalised in favour of merging LRA for 4.8.
> I hope that's what happens.  I don't see that anything would be gained
> by delaying it to 4.9.  The code's not going to get any more testing on the
> branch that it already has; whenever we merge, the stress test is always
> going to be trunk.
>
Richard, thanks for you invaluable help.

Steven Bosscher Oct. 17, 2012, 8:12 p.m. UTC | #16

On Wed, Oct 17, 2012 at 9:53 PM, Vladimir Makarov wrote:
> On 12-10-15 12:49 PM, Richard Sandiford wrote:
>> Getting rid of reload always seemed like a pipe dream, and if the only
>> known drawback of this replacement is that it takes a while on extreme
>> testcases, that's an amazing achievement.  (Not to say compile time
>> isn't important, just that there were so many other hurdles to overcome.)

Just to be clear, LRA now does no worse from a compile time POV than,
say, tree-ssa-live. Most of the scalability problems have been
addressed.

> It is my second attempt.  The first one was YARA project.  I got a lot of
> experience from this project and knowledge how not to do this.
> LRA will be still a long lasting project.  I don't think I found all
> weirdness of reload just trying 8 targets (fixing one bug on one target
> frequently resulted in new bugs on other targets so it required to do
> frequently cardinal changes to the original code). Only after trying the 8
> targets I got feeling that this approach could well.

Hats off to you, Vlad, for your years of effort on improving GCC's RA!

Ciao!
Steven

Vladimir Makarov Oct. 19, 2012, 5:14 a.m. UTC | #17

On 12-10-15 8:06 AM, Richard Sandiford wrote:
> Vladimir Makarov <vmakarov@redhat.com> writes:
>>> if that's accurate.  I dropped the term "reload pseudo" because of
>>> the general comment in my earlier reply about the use of "reload pseudo"
>>> when the code seems to include inheritance and split pseudos too.
>> There is no inheritance and splitting yet.  It is done after the
>> constraint pass.
>> So at this stage >= new_regno_start means reload pseudo.
> Ah, OK.
>
>>> That's a change in the meaning of NEW_CLASS, but seems easier for
>>> callers to handle.  I think all it requires is changing:
>>>
>>>> +      common_class = ira_reg_class_subset[rclass][cl];
>>>> +      if (new_class != NULL)
>>>> +	*new_class = common_class;
>>> to:
>>>
>>>         common_class = ira_reg_class_subset[rclass][cl];
>>>         if (new_class != NULL && rclass != common_class)
>>> 	*new_class = common_class;
>> This change results in infinite LRA looping on a first libgcc file
>> compilation.  Unfortunately I have no time to investigate it.
>> I'd like to say that most code of in this code is very sensitive to
>> changes.  I see it a lot.  You change something looking obvious and a
>> target is broken.
>> I am going to investigate it when I have more time.
> Thanks.
>
>>>> +    default:
>>>> +      {
>>>> +	const char *fmt = GET_RTX_FORMAT (code);
>>>> +	int i;
>>>> +
>>>> +	if (GET_RTX_LENGTH (code) != 1
>>>> +	    || fmt[0] != 'e' || GET_CODE (XEXP (x, 0)) != UNSPEC)
>>>> +	  {
>>>> +	    for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
>>>> +	      if (fmt[i] == 'e')
>>>> +		extract_loc_address_regs (false, mode, as, &XEXP (x, i),
>>>> +					  context_p, code, SCRATCH,
>>>> +					  modify_p, ad);
>>>> +	    break;
>>>> +	  }
>>>> +	/* fall through for case UNARY_OP (UNSPEC ...)	*/
>>>> +      }
>>>> +
>>>> +    case UNSPEC:
>>>> +      if (ad->disp_loc == NULL)
>>>> +	ad->disp_loc = loc;
>>>> +      else if (ad->base_reg_loc == NULL)
>>>> +	{
>>>> +	  ad->base_reg_loc = loc;
>>>> +	  ad->base_outer_code = outer_code;
>>>> +	  ad->index_code = index_code;
>>>> +	  ad->base_modify_p = modify_p;
>>>> +	}
>>>> +      else
>>>> +	{
>>>> +	  lra_assert (ad->index_reg_loc == NULL);
>>>> +	  ad->index_reg_loc = loc;
>>>> +	}
>>>> +      break;
>>>> +
>>>> +    }
>>> Which targets use a bare UNSPEC as a displacement?  I thought a
>>> displacement had to be a link-time constant, in which case it should
>>> satisfy CONSTANT_P.  For UNSPECs, that means wrapping it in a CONST.
>> I saw it somewhere.  I guess IA64.
>>> I'm just a bit worried that the UNSPEC handling is sensitive to the
>>> order that subrtxes are processed (unlike PLUS, which goes to some
>>> trouble to work out what's what).  It could be especially confusing
>>> because the default case processes operands in reverse order while
>>> PLUS processes them in forward order.
>>>
>>> Also, which cases require the special UNARY_OP (UNSPEC ...) fallthrough?
>>> Probably deserves a comment.
>> I don't remember.  To figure out, I should switch it off and try all
>> targets supported by LRA.
>>> AIUI the base_reg_loc, index_reg_loc and disp_loc fields aren't just
>>> recording where reloads of a particular class need to go (obviously
>>> in the case of disp_loc, which isn't reloaded at all).  The feidls
>>> have semantic value too.  I.e. we use them to work out the value
>>> of at least part of the address.
>>>
>>> In that case it seems dangerous to look through general rtxes
>>> in the way that the default case above does.  Maybe just making
>>> sure that DISP_LOC is involved in a sum with the base would be
>>> enough, but another idea was:
>>>
>>> ----------------------------------------------------------------
>>> I know of three ways of "mutating" (for want of a better word)
>>> an address:
>>>
>>>     1. (and X (const_int X)), to align
>>>     2. a subreg
>>>     3. a unary operator (such as truncation or extension)
>>>
>>> So maybe we could:
>>>
>>>     a. remove outer mutations (using a helper function)
>>>     b. handle LO_SUM, PRE_*, POST_*: as now
>>>     c. otherwise treat the address of the sum of one, two or three pieces.
>>>        c1. Peel mutations of all pieces.
>>>        c2. Classify the pieces into base, index and displacement.
>>>            This would be similar to the jousting code above, but hopefully
>>>            easier because all three rtxes are to hand.  E.g. we could
>>>            do the base vs. index thing in a similar way to
>>>            commutative_operand_precedence.
>>>        c3. Record which pieces were mutated (e.g. using something like the
>>>            index_loc vs. index_reg_loc distinction in the current code)
>>>
>>> That should be general enough for current targets, but if it isn't,
>>> we could generalise it further when we know what generalisation is needed.
>>>
>>> That's still going to be a fair amount of code, but hopefully not more,
>>> and we might have more confidence at each stage what each value is.
>>> And it avoids the risk of treating "mutated" addresses as "unmutated" ones.
>>> ----------------------------------------------------------------
>>>
>>> Just an idea though.  Probably not for 4.8, although I might try it
>>> if I find time.
>> I am not sure that you listed all the cases.  It would be great if you
>> listed all the cases. In this case we could make this function more clear.
>> I tried to do this first but permanently found new cases.  After that I
>> gave up and tried to use more general implementation.
>>
>> This function was rewritten and modified many times.  I am afraid to do
>> this again when clock is ticking.
>>
>> It would be great if you re-implement the function according to your
>> ideas and we could try it on 8 targets to which LRA was already ported.
>> An LRA sub-branch would a perfect place to do it
> Fair enough.  I still stand by:
>
>>> It would be nice to sort out the disp_loc thing for 4.8 though.
> though.  My point is that base_plus_disp_to_reg assumes that
> *base_loc and *disp_loc are involved in a sum.  It adds them together,
> replaces the base_loc with the new pseudo, and removes the disp_loc.
> But extract_address_regs seems to be deliberately written in a way that
> doesn't require base_loc and disp_loc to be involved in a sum, and uses
> a more indirect approach to working out disp_loc instead.  It just feels
> like it leaves open the potential for a silent wrong-code bug.
I add the explicit check of base + disp case.  I also added a lot of 
asserts checking that we don't assign twice.
>>>> +/* Reload pseudos created for matched input and output reloads whose
>>>> +   mode are different.	Such pseudos has a modified rules for finding
>>>> +   their living ranges, e.g. assigning to subreg of such pseudo means
>>>> +   changing all pseudo value.  */
>>>> +bitmap_head lra_bound_pseudos;
>>> Maybe:
>>>
>>> /* Reload pseudos created for matched input and output reloads whose
>>>      modes are different.  Such pseudos have different live ranges from
>>>      other pseudos; e.g. any assignment to a subreg of these pseudos
>>>      changes the whole pseudo's value.  */
>> Fixed.
>>> Although that said, couldn't emit_move_insn_1 (called by gen_move_insn)
>>> split a multiword pseudo move into two word moves?  Using the traditional
>>> clobber technique sounds better than having special liveness rules.
>>>
>> It is not only about multi-words pseudos.  It is about representation of
>> this situation by constructions semantically incorrect in order parts of
>> compiler.  Reload has no such problem as it does not use RTL.   So I
>> don't think it splits as I use emit_move_insn and that calls
>> emit_move_insn_1 too.
> But my point is that emit_move_insn_1 _does_ split moves that have no
> .md pattern of their own.  E.g. some targets do not define double-word
> move patterns because such moves are always equivalent to two individual
> word moves.  And if emit_move_insn_1 splits:
>
>     (set (reg:DI X) (reg:DI Y))
>
> into:
>
>     (set (subreg:SI (reg:DI X) 0) (subreg:SI (reg:DI Y) 0))
>     (set (subreg:SI (reg:DI X) 4) (subreg:SI (reg:DI Y) 4))
>
> then it would be to say that the subreg in the second instruction
> is a complete definition of X.
>> I really needed a special liveness treatment (although I don't
>> remember details) and therefore I added it.  I had no detail design
>> for LRA.  The code was modified by numerous test failures on different
>> targets.  There is a lot of code analogous to reload one and probably
>> its necessity should be rigorously questioned.  I thought about and
>> modified part of this code but unfortunately not all.
>>
>> Also bound pseudos are rare.  Their bitmap is very small and testing (2
>> lines of code in overall) them in ira-lives.c is fast.
> FWIW, It wasn't really speed as much as correctness I was worried about.
> In a way, rarity makes having special rules seem even more dangerous.
>
Right.  I don't think it is more dangerous than clobbers but I like what 
you proposed.  After few tries, I managed with clobbers.  So now there 
are no bound pseudos (or special reload pseudos. that is how it is 
called now).  The code is clear and there is no additional notion for 
LRA.  Thank you.
>>>> +      /* We create pseudo for out rtx because we always should keep
>>>> +	 registers with the same original regno have synchronized
>>>> +	 value (it is not true for out register but it will be
>>>> +	 corrected by the next insn).
>>> I don't understand this comment, sorry.
>>>
>> Pseudos have values -- see comments for lra_reg_info.  Different pseudos
>> with the same value do not conflict even if they live in the same
>> place.  When we create a pseudo we assign value of original pseudo (if
>> any) from which we created the new pseudo.  If we create the pseudo from
>> the input pseudo, the new pseudo will no conflict with the input pseudo
>> which is wrong when the input pseudo lives after the insn and as the new
>> pseudo value is changed by the insn output.  Therefore we create the new
>> pseudo from the output.
>>
>> I hope it is more understandable.  I changed the comment.
> Yeah, I think that makes it a lot clearer, thanks.
>
>>>> +  /* In and out operand can be got from transformations before
>>>> +     processing constraints.  So the pseudos might have inaccurate
>>>> +     class and we should make their classes more accurate.  */
>>>> +  narrow_reload_pseudo_class (in_rtx, goal_class);
>>>> +  narrow_reload_pseudo_class (out_rtx, goal_class);
>>> I don't understand this, sorry.  Does "transformations" mean inheritance
>>> and reload splitting?  So the registers we're changing here are inheritance
>>> and split pseudos rather than reload pseudos created for this instruction?
>>> If so, it sounds on face value like it conflicts with the comment quoted
>>> above about not allowing reload instructions to the narrow the class
>>> of pseudos.  Might be worth saying why that's OK here but not there.
>> Again, inheritance and splitting is done after the constraint pass.
>>
>> The transformations here are mostly reloading of subregs which is done
>> before reloads for given insn.  On this transformation we create new
>> pseudos for which we don't know reg class yet.  In case we don't know
>> pseudo reg class yet, we assign ALL_REGS to the pseudo.
> OK, thanks.
>
>>> Also, all uses but one of lra_get_hard_regno_and_offset follow
>>> the pattern:
>>>
>>>         lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
>>>         /* The real hard regno of the operand after the allocation.  */
>>>         x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
>>>
>>> so couldn't lra_get_hard_regno_and_offset just return the final
>>> hard register, including elimination?  Then it could apply the
>>> elimination on the original rtx.
>>>
>>> FWIW, the exception I mentioned was operands_match_p:
>>>
>>>         lra_get_hard_regno_and_offset (x, &i, &offset);
>>>         if (i < 0)
>>> 	goto slow;
>>>         i += offset;
>>>
>>> but I'm not sure why this is the only caller that would want
>>> to ignore elimination.
>> ???
Sorry, I just composed this email for 2 days and that was a placeholder 
for your question I should answer or work on:)  It looks I did not 
answered the question.
> Not sure what you meant here :-)  Was that a placeholder,
> or something else?  What I was getting at was that it would
> be nice to replace all occurences of:
>
>        lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
>        /* The real hard regno of the operand after the allocation.  */
>        x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
>
> with something like:
>
>        x_hard_regno = lra_get_hard_regno (x);
>
> and this operands_match_p code seemed to be the only place that didn't
> apply get_final_hard_regno to the result of lra_get_hard_regno_and_offset.
> I wasn't really sure why operands_match_p was different.
Fixed.
>>>> +  int i, j, x_hard_regno, offset;
>>>> +  enum machine_mode mode;
>>>> +  rtx x;
>>>> +  const char *fmt;
>>>> +  enum rtx_code code;
>>>> +
>>>> +  if (*loc == NULL_RTX)
>>>> +    return false;
>>>> +  x = *loc;
>>>> +  code = GET_CODE (x);
>>>> +  mode = GET_MODE (x);
>>>> +  if (code == SUBREG)
>>>> +    {
>>>> +      loc = &SUBREG_REG (x);
>>>> +      x = SUBREG_REG (x);
>>>> +      code = GET_CODE (x);
>>>> +      if (GET_MODE_SIZE (GET_MODE (x)) > GET_MODE_SIZE (mode))
>>>> +	mode = GET_MODE (x);
>>>> +    }
>>>> +
>>>> +  if (REG_P (x))
>>>> +    {
>>>> +      lra_get_hard_regno_and_offset (x, &x_hard_regno, &offset);
>>>> +      /* The real hard regno of the operand after the allocation.  */
>>>> +      x_hard_regno = get_final_hard_regno (x_hard_regno, offset);
>>>> +      return (x_hard_regno >= 0
>>>> +	      && lra_hard_reg_set_intersection_p (x_hard_regno, mode, set));
>>> With the subreg mode handling above, this looks little-endian specific.
>>> The MEM case:
>>>> +  if (MEM_P (x))
>>>> +    {
>>>> +      struct address ad;
>>>> +      enum machine_mode mode = GET_MODE (x);
>>>> +      rtx *addr_loc = &XEXP (x, 0);
>>>> +
>>>> +      extract_address_regs (mode, MEM_ADDR_SPACE (x), addr_loc, MEM, &ad);
>>>> +      if (ad.base_reg_loc != NULL)
>>>> +	{
>>>> +	  if (uses_hard_regs_p (ad.base_reg_loc, set))
>>>> +	    return true;
>>>> +	}
>>>> +      if (ad.index_reg_loc != NULL)
>>>> +	{
>>>> +	  if (uses_hard_regs_p (ad.index_reg_loc, set))
>>>> +	    return true;
>>>> +	}
>>>> +    }
>>> is independent of the subreg handling, so perhaps the paradoxical subreg
>>> case should be handled separately, using simplify_subreg_regno.
> Not sure: did you have any thoughts on this?
>
No.  Sorry.  I'll think about this morning.  A bit tired for today.

>>>> +		    match_p = false;
>>>> +		    if (operands_match_p (*curr_id->operand_loc[nop],
>>>> +					  *curr_id->operand_loc[m], m_hregno))
>>>> +		      {
>>>> +			int i;
>>>> +			
>>>> +			for (i = 0; i < early_clobbered_regs_num; i++)
>>>> +			  if (early_clobbered_nops[i] == m)
>>>> +			    break;
>>>> +			/* We should reject matching of an early
>>>> +			   clobber operand if the matching operand is
>>>> +			   not dying in the insn.  */
>>>> +			if (i >= early_clobbered_regs_num
>>> Why not simply use operands m's early_clobber field?
>> Ok.  Fixed.
>>>> +			    || operand_reg[nop] == NULL_RTX
>>>> +			    || (find_regno_note (curr_insn, REG_DEAD,
>>>> +						 REGNO (operand_reg[nop]))
>>>> +				!= NULL_RTX))
>>>> +			  match_p = true;
>>> ...although I don't really understand this condition.  If the two
>>> operands are the same value X, then X must die here whatever the
>>> notes say.  So I assume this is coping with a case where the operands
>>> are different but still match.  If so, could you give an example?
>> I remember I saw such insn but I don't remember details.
>>> Matched earlyclobbers explicitly guarantee that the earlyclobber doesn't
>>> apply to the matched input operand; the earlyclobber only applies to
>>> other input operands.  So I'd have expected it was those operands
>>> that might need reloading rather than this one.
>>>
>>> E.g. if X occurs three times, twice in a matched earlyclobber pair
>>> and once as an independent operand, it's the latter operand that would
>>> need reloading.
>> Yes, I know.
> But in that case I don't understand the condition.  If we have:
>
>    (set (reg X) (... (reg X) ...))
>
> (which is the kind of thing operands_match_p is testing for)
> then there is no requirement for a REG_DEAD note for X.
> But it's still OK for two Xs form an earlyclobber pair.
>
I think I added this for some purpose.  It might be IRA assigning 
wrongly the same hard register in situation.

(set (reg X1) (... (reg X2) ...)) REG_UNUSED:X1)
and X2 lives below.

I think it is better not to change this code.  I promise I'll remove it 
on LRA branch to find why I added.

>>>> +			  /* If we didn't already win, we can reload
>>>> +			     constants via force_const_mem, and other
>>>> +			     MEMs by reloading the address like for
>>>> +			     'o'.  */
>>>> +			  if (CONST_POOL_OK_P (mode, op) || MEM_P (op))
>>>> +			    badop = false;
>>> It seems a bit inconsistent to treat a spilled pseudo whose address
>>> might well need reloading as a win, while not treating existing MEMs
>>> whose addresses need reloading as a win.
>> Well, probability of reloading address of spilled pseudo is very small
>> on most targets but reloading for MEM in this case is real. So I see it
>> logical.
> OK, that's probably true. :-)
>
>>>> +	      if (! no_regs_p)
>>>> +		reload_nregs
>>>> +		  += ira_reg_class_max_nregs[this_alternative][mode];
>>> I wasn't sure why we counted this even in the "const_to_mem && constmmeok"
>>> and "MEM_P (op) && offmemok" cases from:
>>> 	      /* We prefer to reload pseudos over reloading other
>>> 		 things, since such reloads may be able to be
>>> 		 eliminated later.  So bump REJECT in other cases.
>>> 		 Don't do this in the case where we are forcing a
>>> 		 constant into memory and it will then win since we
>>> 		 don't want to have a different alternative match
>>> 		 then.	*/
>>> 	      if (! (REG_P (op)
>>> 		     && REGNO (op) >= FIRST_PSEUDO_REGISTER)
>>> 		  && ! (const_to_mem && constmemok)
>>> 		  /* We can reload the address instead of memory (so
>>> 		     do not punish it).	 It is preferable to do to
>>> 		     avoid cycling in some cases.  */
>>> 		  && ! (MEM_P (op) && offmemok))
>>> 		reject += 2;
>> I think constmemok is obvious.  It is not a reload, it just putting
>> constant in the constant pool.  We should not punish it as no additional
>> insns are generated.
>>
>> There is a comment for offmemok case.  I think it describes it.
>> Apparently it was a fix for LRA cycling.  I don't remember details. To
>> restore them, I need to remove the code and to try it on many targets.
>> I guess, it would take 3-4 days.  But I removed this as it does not
>> affect x86/x86-64.
> Sorry, my comment wasn't as clear as it should have been.  I think:
>
> 	      /* We prefer to reload pseudos over reloading other
> 		 things, since such reloads may be able to be
> 		 eliminated later.  So bump REJECT in other cases.
> 		 Don't do this in the case where we are forcing a
> 		 constant into memory and it will then win since we
> 		 don't want to have a different alternative match
> 		 then.	*/
> 	      if (! (REG_P (op)
> 		     && REGNO (op) >= FIRST_PSEUDO_REGISTER)
> 		  && ! (const_to_mem && constmemok)
> 		  /* We can reload the address instead of memory (so
> 		     do not punish it).	 It is preferable to do to
> 		     avoid cycling in some cases.  */
> 		  && ! (MEM_P (op) && offmemok))
> 		reject += 2;
>
> is fine.  The point I was trying to make is that this code (correctly IMO)
> ignores register reload costs if the operand won't be reloaded into
> a register.  The operand won't be reloaded into a register if:
>
> 	(const_to_mem && constmemok)
>       || (MEM_P (op) && offmemok)
>
> However:
>
> 	      if (! no_regs_p)
> 		reload_nregs
> 		  += ira_reg_class_max_nregs[this_alternative][mode];
>
> seems to count the register reload cost regardless of whether
> the operand would be reloaded into a register.  It felt like
> it ought to be something like:
>
> 	      if (!((const_to_mem && const_mem_ok)
> 		    || (MEM_P (op) && offmemok)))
> 		{
> 		  /* We prefer to reload pseudos over reloading other
> 		     things, since such reloads may be able to be
> 		     eliminated later.  So bump REJECT in other cases.  */
> 		  if (! (REG_P (op)
> 			 && REGNO (op) >= FIRST_PSEUDO_REGISTER)
> 		    reject += 2;
>
> 		  if (! no_regs_p)
> 		    reload_nregs
> 		      += ira_reg_class_max_nregs[this_alternative][mode];
> 		}
>
>
It has sense.  I fixed it using your variant.

Vladimir Makarov Oct. 19, 2012, 5:14 a.m. UTC | #18

On 12-10-17 7:24 AM, Richard Sandiford wrote:
> Thanks for all the updates.
>
> Vladimir Makarov <vmakarov@redhat.com> writes:
>>>> +	  /* index * scale + disp => new base + index * scale  */
>>>> +	  enum reg_class cl = base_reg_class (mode, as, SCRATCH, SCRATCH);
>>>> +
>>>> +	  lra_assert (INDEX_REG_CLASS != NO_REGS);
>>>> +	  new_reg = lra_create_new_reg (Pmode, NULL_RTX, cl, "disp");
>>>> +	  lra_assert (GET_CODE (*addr_loc) == PLUS);
>>>> +	  lra_emit_move (new_reg, *ad.disp_loc);
>>>> +	  if (CONSTANT_P (XEXP (*addr_loc, 1)))
>>>> +	    XEXP (*addr_loc, 1) = XEXP (*addr_loc, 0);
>>>> +	  XEXP (*addr_loc, 0) = new_reg;
>>> The canonical form is (plus (mult ...) (reg)) rather than
>>> (plus (reg) (mult ...)), but it looks like we create the latter.
>>> I realise you try both forms here:
>> It might happen because equiv substitution in LRA.
>>>> +	  /* Some targets like ARM, accept address operands in
>>>> +	     specific order -- try exchange them if necessary.	*/
>>>> +	  if (! valid_address_p (mode, *addr_loc, as))
>>>> +	    {
>>>> +	      exchange_plus_ops (*addr_loc);
>>>> +	      if (! valid_address_p (mode, *addr_loc, as))
>>>> +		exchange_plus_ops (*addr_loc);
>>>> +	    }
>>> but I think we should try the canonical form first.  And I'd prefer it
>>> if we didn't try the other form at all, especially in 4.8.  It isn't
>>> really the backend's job to reject non-canonical rtl.  This might well
>>> be another case where some targets need a (hopefully small) tweak in
>>> order to play by the rules.
>>>
>>> Also, I suppose this section of code feeds back to my question on
>>> Wednesday about the distinction that LRA seems to make between the
>>> compile-time constant in:
>>>
>>>     (plus (reg X1) (const_int Y1))
>>>
>>> and the link-time constant in:
>>>
>>>     (plus (reg X2) (symbol_ref Y2))
>>>
>>> It looked like extract_address_regs classified X1 as a base register and
>>> X2 as an index register.  The difference between the two constants has
>>> no run-time significance though, and I think we should handle both X1
>>> and X2 as base registers (as I think reload does).
>>>
>>> I think the path above would then be specific to scaled indices.
>>> In the original address the "complex" index must come first and the
>>> displacement second.  In the modified address, the index would stay
>>> first and the new base register would be second.  More below.
>> As I wrote above the problem is also in that equiv substitution can
>> create non-canonical forms.
> Right.  Just in case there's a misunderstanding: I'm not complaining
> about these routines internally using forms that are noncanonical
> (which could happen because of equiv substitution, like you say).
> I just think that what we eventually try to validate should be canonical.
> In a way it's similar to how the simplify-rtx.c routines work.
>
> If there are targets that only accept noncanonical rtl (which is after
> all just a specific type of invalid rtl), they need to be fixed.
Agree.  In order not to forget to fix targets I am removing operand 
exchange.
>>>> +      /* base + scale * index + disp => new base + scale * index  */
>>>> +      new_reg = base_plus_disp_to_reg (mode, as, &ad);
>>>> +      *addr_loc = gen_rtx_PLUS (Pmode, new_reg, *ad.index_loc);
>>>> +      if (! valid_address_p (mode, *addr_loc, as))
>>>> +	{
>>>> +	  /* Some targets like ARM, accept address operands in
>>>> +	     specific order -- try exchange them if necessary.	*/
>>>> +	  exchange_plus_ops (*addr_loc);
>>>> +	  if (! valid_address_p (mode, *addr_loc, as))
>>>> +	    exchange_plus_ops (*addr_loc);
>>>> +	}
>>> Same comment as above about canonical rtl.  Here we can have two
>>> registers -- in which case the base should come first -- or a more
>>> complex index -- in which case the index should come first.
>>>
>>> We should be able to pass both rtxes to simplify_gen_binary (PLUS, ...),
>>> with the operands in either order, and let it take care of the details.
>>> Using simplify_gen_binary would help with the earlier index+disp case too.
>> Equiv substitution can create non-canonical forms.  There are 2 approaches:
>> o have a code for dealing with non-canonical forms (equiv substitution,
>> target stupidity)
>> o always support canonical forms and require them from targets.
>>
>> I decided to use the 1st variant but I am reconsidering it.  I'll try to
>> fix before inclusion.  But I am not sure I have time for this.  All
>> these changes makes LRA unstable. In fact, I've just found that changes
>> I already made so far resulted in 2 SPEC2000 tests broken although GCC
>> testsuite and bootstrap looks good.
> OK.  I'm happy to try fixing the noncanonical thing.
>
>>>> +  /* If this is post-increment, first copy the location to the reload reg.  */
>>>> +  if (post && real_in != result)
>>>> +    emit_insn (gen_move_insn (result, real_in));
>>> Nit, but real_in != result can never be true AIUI, and I was confused how
>>> the code could be correct in that case.  Maybe just remove it, or make
>>> it an assert?
>> No, it might be true:
>>
>> real_in = in == value ? incloc : in;
>> ...
>> if (cond)
>>     result = incloc;
>> else
>>     result = ...
>>
>> if (post && real_in != result)
>>
>> So it is true if in==value && cond
> Sorry, what I meant was that cond is "! post && REG_P (incloc)":
>
>    if (! post && REG_P (incloc))
>      result = incloc;
>    else
>      result = lra_create_new_reg (GET_MODE (value), value, new_rclass,
> 				 "INC/DEC result");
>
> so it can never be true in the "post" case quoted above.
>
>
Fixed.

RFC: LRA for x86/x86-64 [7/9] -- continuation

Commit Message

Comments

Patch