diff mbox

[RS6000] Fix PR61098, Poor code setting count register

Message ID 20140508014846.GA5162@bubble.grove.modra.org
State New
Headers show

Commit Message

Alan Modra May 8, 2014, 1:48 a.m. UTC
On powerpc64, to set a large loop count we have code like the
following after split1:

(insn 67 14 68 4 (set (reg:DI 160)
        (const_int 99942400 [0x5f50000])) /home/amodra/unaligned_load.c:14 -1
     (nil))
(insn 68 67 42 4 (set (reg:DI 160)
        (ior:DI (reg:DI 160)
            (const_int 57600 [0xe100]))) /home/amodra/unaligned_load.c:14 -1
     (expr_list:REG_EQUAL (const_int 100000000 [0x5f5e100])
        (nil)))

and then test for loop exit with:

(jump_insn 65 31 45 5 (parallel [
            (set (pc)
                (if_then_else (ne (reg:DI 160)
                        (const_int 1 [0x1]))
                    (label_ref:DI 42)
                    (pc)))
            (set (reg:DI 160)
                (plus:DI (reg:DI 160)
                    (const_int -1 [0xffffffffffffffff])))
            (clobber (scratch:CC))
            (clobber (scratch:DI))
        ]) /home/amodra/unaligned_load.c:15 800 {*ctrdi_internal1}
     (int_list:REG_BR_PROB 9899 (nil))
 -> 42)

The jump_insn of course is meant for use with bdnz, which implies a
strong preference for reg 160 to live in the count register.  Trouble
is, the count register doesn't do arithmetic.

So, use a new psuedo for intermediate results.  On looking at this,
I noticed the !TARGET_POWERPC64 code in rs6000_emit_set_long_const was
broken, apparently expecting c1 and c2 to be the high and low 32 bits
of the constant.  That's no longer true, so I've fixed that as well.
Bootstrapped and regression tested powerpc64-linux.  OK for mainline
and branches?

	PR target/61098
	* config/rs6000/rs6000.c (rs6000_emit_set_const): Remove unneeded
	params and return value.  Simplify.  Update comment.
	(rs6000_emit_set_long_const): Remove unneeded param and return
	value.  Correct !TARGET_POWERPC64 handling of constants > 2G.
	If we can, use a new pseudo for intermediate calculations.

Comments

David Edelsohn May 8, 2014, 1:48 p.m. UTC | #1
On Wed, May 7, 2014 at 9:48 PM, Alan Modra <amodra@gmail.com> wrote:
> On powerpc64, to set a large loop count we have code like the
> following after split1:
>
> (insn 67 14 68 4 (set (reg:DI 160)
>         (const_int 99942400 [0x5f50000])) /home/amodra/unaligned_load.c:14 -1
>      (nil))
> (insn 68 67 42 4 (set (reg:DI 160)
>         (ior:DI (reg:DI 160)
>             (const_int 57600 [0xe100]))) /home/amodra/unaligned_load.c:14 -1
>      (expr_list:REG_EQUAL (const_int 100000000 [0x5f5e100])
>         (nil)))
>
> and then test for loop exit with:
>
> (jump_insn 65 31 45 5 (parallel [
>             (set (pc)
>                 (if_then_else (ne (reg:DI 160)
>                         (const_int 1 [0x1]))
>                     (label_ref:DI 42)
>                     (pc)))
>             (set (reg:DI 160)
>                 (plus:DI (reg:DI 160)
>                     (const_int -1 [0xffffffffffffffff])))
>             (clobber (scratch:CC))
>             (clobber (scratch:DI))
>         ]) /home/amodra/unaligned_load.c:15 800 {*ctrdi_internal1}
>      (int_list:REG_BR_PROB 9899 (nil))
>  -> 42)
>
> The jump_insn of course is meant for use with bdnz, which implies a
> strong preference for reg 160 to live in the count register.  Trouble
> is, the count register doesn't do arithmetic.
>
> So, use a new psuedo for intermediate results.  On looking at this,
> I noticed the !TARGET_POWERPC64 code in rs6000_emit_set_long_const was
> broken, apparently expecting c1 and c2 to be the high and low 32 bits
> of the constant.  That's no longer true, so I've fixed that as well.
> Bootstrapped and regression tested powerpc64-linux.  OK for mainline
> and branches?
>
>         PR target/61098
>         * config/rs6000/rs6000.c (rs6000_emit_set_const): Remove unneeded
>         params and return value.  Simplify.  Update comment.
>         (rs6000_emit_set_long_const): Remove unneeded param and return
>         value.  Correct !TARGET_POWERPC64 handling of constants > 2G.
>         If we can, use a new pseudo for intermediate calculations.

Alan,

The history is 32 bit HWI.

The ChangeLog does not mention the changes to rs6000.md nor rs6000-protos.h.

Please do not remove all of the comments from the two functions. The
comments should provide some documentation about the different
purposes of the two functions other than setting DEST to a CONST.

Why did you remove the test for NULL dest?

-      if (dest == NULL)
-       dest = gen_reg_rtx (mode);

That could occur, at least it used to occur.

I think that the way you rearranged the invocations of copy_rtx() in
rs6000_emit_set_long_const() is okay, but it would be good for someone
else to double check.

Thanks, David
Alan Modra May 9, 2014, 2:40 a.m. UTC | #2
On Thu, May 08, 2014 at 09:48:35AM -0400, David Edelsohn wrote:
> The history is 32 bit HWI.

Right.

> The ChangeLog does not mention the changes to rs6000.md nor rs6000-protos.h.

Oops, added.

	* config/rs6000/rs6000.md (movsi_internal1_single+1): Update
	call to rs6000_emit_set_const in splitter.
	(movdi_internal64+2, +3): Likewise.
	* config/rs6000/rs6000-protos.h (rs6000_emit_set_const): Update
	prototype.

> Please do not remove all of the comments from the two functions. The
> comments should provide some documentation about the different
> purposes of the two functions other than setting DEST to a CONST.

I believe my updated comment covers the complete purpose of the
function nowadays.  The comments I removed are out-dated, and should
have been removed a long time ago..  rs6000_emit_set_const does not
even look at N, it always returns a non-zero result, and the return is
only tested for non-zero.  I removed MODE too, because that is always
the same as GET_MODE (dest).

> Why did you remove the test for NULL dest?
> 
> -      if (dest == NULL)
> -       dest = gen_reg_rtx (mode);
> 
> That could occur, at least it used to occur.

I'm sure we can't get a NULL dest nowadays.  All (three) uses of
rs6000_emit_set_const occur in splitters.  They all must have passed a
gpc_reg_operand constraint on operands[0] before calling
rs6000_emit_set_const, so if NULL were possible we'd segfault in
gpc_reg_operand.

> I think that the way you rearranged the invocations of copy_rtx() in
> rs6000_emit_set_long_const() is okay, but it would be good for someone
> else to double check.

Yeah, that function is a bit messy.  I took the approach of always use
a bare "dest" once in the last instruction emitted, with every other
use getting hit with copy_rtx.  The previous approach was similar,
but used the bare "dest" on the first instruction emitted.  Obviously
you don't need copy_rtx anywhere with the new code when
can_create_pseudo_p is true, but I felt it wasn't worth optimising
that for the added source complication.
David Edelsohn May 11, 2014, 2:24 a.m. UTC | #3
On Thu, May 8, 2014 at 10:40 PM, Alan Modra <amodra@gmail.com> wrote:

>> Please do not remove all of the comments from the two functions. The
>> comments should provide some documentation about the different
>> purposes of the two functions other than setting DEST to a CONST.
>
> I believe my updated comment covers the complete purpose of the
> function nowadays.  The comments I removed are out-dated, and should
> have been removed a long time ago..  rs6000_emit_set_const does not
> even look at N, it always returns a non-zero result, and the return is
> only tested for non-zero.  I removed MODE too, because that is always
> the same as GET_MODE (dest).

It is helpful if the comment expresses more than restating the
information one can glean from the function name. It's useful to note
that rs6000_emit_set_long_const is a standard decomposition with a
bounded number of instructions.

>> I think that the way you rearranged the invocations of copy_rtx() in
>> rs6000_emit_set_long_const() is okay, but it would be good for someone
>> else to double check.
>
> Yeah, that function is a bit messy.  I took the approach of always use
> a bare "dest" once in the last instruction emitted, with every other
> use getting hit with copy_rtx.  The previous approach was similar,
> but used the bare "dest" on the first instruction emitted.  Obviously
> you don't need copy_rtx anywhere with the new code when
> can_create_pseudo_p is true, but I felt it wasn't worth optimising
> that for the added source complication.

Can you help clarify the removal of the code that tests if the
splitter failed?  The splitters in the Alpha port follow mostly the
same rhythm, with a little bit of further cleanup and consolidation
relative to the rs6000 port. alpha_split_const_mov() falls back on
alpha_emit_set_long_const(), but checks that the target is valid and
allows the splitter to fail. Either the Alpha port is doing
unnecessary work or this cleanup patch is too aggressive. Either way,
a comment seems necessary.

Thanks, David
Alan Modra May 11, 2014, 10:53 p.m. UTC | #4
On Sat, May 10, 2014 at 10:24:34PM -0400, David Edelsohn wrote:
> On Thu, May 8, 2014 at 10:40 PM, Alan Modra <amodra@gmail.com> wrote:
> > rs6000_emit_set_const ... always returns a non-zero result ...
> 
> Can you help clarify the removal of the code that tests if the
> splitter failed?

See above.
Alan Modra May 11, 2014, 11:39 p.m. UTC | #5
On Mon, May 12, 2014 at 08:23:16AM +0930, Alan Modra wrote:
> On Sat, May 10, 2014 at 10:24:34PM -0400, David Edelsohn wrote:
> > On Thu, May 8, 2014 at 10:40 PM, Alan Modra <amodra@gmail.com> wrote:
> > > rs6000_emit_set_const ... always returns a non-zero result ...
> > 
> > Can you help clarify the removal of the code that tests if the
> > splitter failed?
> 
> See above.

On thinking some more, let me retract the patch for the time being.
While it's true that I was only removing dead code in the splitters,
the question of why this code has become dead is worth looking into.

I suspect a previous patch to rs6000_emit_set_const was wrong, and
that we should in fact be failing before reload (but never after),
when an expansion would take too many instructions.  "Too many" being
a sequence that is slower than loading a 64-bit constant from the
TOC.

We try to make that sort of tradeoff in rs6000_emit_move (see
num_insn_constant call), but that is really too early.  Some
manipulations of code modify constants.  I've seen cases where
rs6000_emit_move decided to inline a constant, but later changes to
the value meant the expansion was five instructions.
Alan Modra May 14, 2014, 3:04 a.m. UTC | #6
On Sat, May 10, 2014 at 10:24:34PM -0400, David Edelsohn wrote:
> On Thu, May 8, 2014 at 10:40 PM, Alan Modra <amodra@gmail.com> wrote:
> 
> >> Please do not remove all of the comments from the two functions. The
> >> comments should provide some documentation about the different
> >> purposes of the two functions other than setting DEST to a CONST.
> >
> > I believe my updated comment covers the complete purpose of the
> > function nowadays.  The comments I removed are out-dated, and should
> > have been removed a long time ago..  rs6000_emit_set_const does not
> > even look at N, it always returns a non-zero result, and the return is
> > only tested for non-zero.  I removed MODE too, because that is always
> > the same as GET_MODE (dest).
> 
> It is helpful if the comment expresses more than restating the
> information one can glean from the function name. It's useful to note
> that rs6000_emit_set_long_const is a standard decomposition with a
> bounded number of instructions.
> 
> >> I think that the way you rearranged the invocations of copy_rtx() in
> >> rs6000_emit_set_long_const() is okay, but it would be good for someone
> >> else to double check.
> >
> > Yeah, that function is a bit messy.  I took the approach of always use
> > a bare "dest" once in the last instruction emitted, with every other
> > use getting hit with copy_rtx.  The previous approach was similar,
> > but used the bare "dest" on the first instruction emitted.  Obviously
> > you don't need copy_rtx anywhere with the new code when
> > can_create_pseudo_p is true, but I felt it wasn't worth optimising
> > that for the added source complication.
> 
> Can you help clarify the removal of the code that tests if the
> splitter failed?  The splitters in the Alpha port follow mostly the
> same rhythm, with a little bit of further cleanup and consolidation
> relative to the rs6000 port. alpha_split_const_mov() falls back on
> alpha_emit_set_long_const(), but checks that the target is valid and
> allows the splitter to fail. Either the Alpha port is doing
> unnecessary work or this cleanup patch is too aggressive. Either way,
> a comment seems necessary.

OK, I've had a good look at the history of this code.

rs6000_emit_set_const and rs6000_emit_set_long_const were introduced
with revision 44516, a largish patch by Dan Berlin.  As you hint
above, it seems the functions were copied from alpha.  So the
parameters were unnecessary and the comments just plain wrong for the
rs6000 version of code right from the initial commit.  Worse, only
half of necessary infrastructure was copied from alpha..

So let me lay out what I believe should be happening with
(set (reg) (constant))

At expand time, if the above set can't be implemented in a single
instruction, then it should be decomposed to the equivalent set high
part, ori low part, and possibly shift instructions so long as the
resulting sequence is small.  I think we basically do this correctly
in rs6000_emit_move.  See the num_insns_constant call there.
Constants that can't be evaluated inline by two (or three)
instructions will be replaced with a load from the TOC.

The same thing ought to happen in the splitters that use
rs6000_emit_set_const.  rs6000_emit_set_const should refuse to expand
to too many instructions (just like alpha).  We don't do this, but if
we did, this would leave some (set (reg) (constant)) instructions in
the RTL.  Alternatively the splitters could generate loads from the
TOC, but see pr57836, which shows the loads from the TOC we crafted
at expand time being reduced back to (set (reg) (constant)).

Finally, at reload time, any remaining (set (reg) (constant))
(ie. those that result in a long inline sequence) should be forced to
the TOC.  This is the missing part of the infrastructure that wasn't
copied from alpha.  Our legitimate_constant_p needs to reject some
constants..  As it is, reload simply expands to a four or five
inline instruction sequence.

David, I'd like some help with the legitimate_constant_p
implementation.  I have something that seems to work (not yet
regression tested) but there are a number of things that I'm not clear
on (eg. the revision 20229 change) so likely will get it wrong.
David Edelsohn May 14, 2014, 3:46 a.m. UTC | #7
Alan,

Danny may have re-organized the code, but I thought that it originally
came from Tom Rixx, if not earlier.

I seem to remember problems in the past with late creation of TOC
entries for constants causing problems, so it was easier to fall back
to materializing all integer constants inline. I don't remember the
PRs, but I think there were issues with creating a TOC if the late
constant were the only TOC reference, or maybe the issue was buying a
stack frame to materialize the TOC/GOT for a late constant.  And
maximum 5 instruction sequence is not really bad relative to a load
from the TOC (even with medium model / data in TOC). There are a lot
of trade-offs with respect to I$ expansion versus the load hitting in
the L1 D$.

Alpha emit_set_const does limit the number of instructions, but that
is a search for a more efficient sequence than the naive sequence. The
Alpha splitters use alpha_split_const_mov(), which tries
alpha_emit_set_const() for an efficient sequence and then falls back
to alpha_emit_set_long_const() for a naive sequence.  Alpha uses PLUS
instead of IOR because of the way its ISA works.
alpha_emit_set_long_const() always will materialize the constant and
does not check for a maximum number of instructions. This is why it's
comment says "fall back to straight forward decomposition".

However, alpha_emit_set_long_const() and alpha_split_const_mov() can
fail, presumably because emit_move_insn() fails, not because of
reaching a maximum number of instructions.

alpha_legitimate_constant_p() rejecs expensive constants early. Once
the splitter is invoked, it always tries to materialize the constant,
but the splitter apparently can fail for other reasons.

I don't mind exploring the benefits of tighening up
rs6000_legitimate_const(), but I'm not sure it's an obvious win,
especially with the potential failure corner cases.

However, I want to have a better understanding about the part of the
patch that removes the FAIL path from the splitters.

Thanks, David



On Tue, May 13, 2014 at 11:04 PM, Alan Modra <amodra@gmail.com> wrote:
> On Sat, May 10, 2014 at 10:24:34PM -0400, David Edelsohn wrote:
>> On Thu, May 8, 2014 at 10:40 PM, Alan Modra <amodra@gmail.com> wrote:
>>
>> >> Please do not remove all of the comments from the two functions. The
>> >> comments should provide some documentation about the different
>> >> purposes of the two functions other than setting DEST to a CONST.
>> >
>> > I believe my updated comment covers the complete purpose of the
>> > function nowadays.  The comments I removed are out-dated, and should
>> > have been removed a long time ago..  rs6000_emit_set_const does not
>> > even look at N, it always returns a non-zero result, and the return is
>> > only tested for non-zero.  I removed MODE too, because that is always
>> > the same as GET_MODE (dest).
>>
>> It is helpful if the comment expresses more than restating the
>> information one can glean from the function name. It's useful to note
>> that rs6000_emit_set_long_const is a standard decomposition with a
>> bounded number of instructions.
>>
>> >> I think that the way you rearranged the invocations of copy_rtx() in
>> >> rs6000_emit_set_long_const() is okay, but it would be good for someone
>> >> else to double check.
>> >
>> > Yeah, that function is a bit messy.  I took the approach of always use
>> > a bare "dest" once in the last instruction emitted, with every other
>> > use getting hit with copy_rtx.  The previous approach was similar,
>> > but used the bare "dest" on the first instruction emitted.  Obviously
>> > you don't need copy_rtx anywhere with the new code when
>> > can_create_pseudo_p is true, but I felt it wasn't worth optimising
>> > that for the added source complication.
>>
>> Can you help clarify the removal of the code that tests if the
>> splitter failed?  The splitters in the Alpha port follow mostly the
>> same rhythm, with a little bit of further cleanup and consolidation
>> relative to the rs6000 port. alpha_split_const_mov() falls back on
>> alpha_emit_set_long_const(), but checks that the target is valid and
>> allows the splitter to fail. Either the Alpha port is doing
>> unnecessary work or this cleanup patch is too aggressive. Either way,
>> a comment seems necessary.
>
> OK, I've had a good look at the history of this code.
>
> rs6000_emit_set_const and rs6000_emit_set_long_const were introduced
> with revision 44516, a largish patch by Dan Berlin.  As you hint
> above, it seems the functions were copied from alpha.  So the
> parameters were unnecessary and the comments just plain wrong for the
> rs6000 version of code right from the initial commit.  Worse, only
> half of necessary infrastructure was copied from alpha..
>
> So let me lay out what I believe should be happening with
> (set (reg) (constant))
>
> At expand time, if the above set can't be implemented in a single
> instruction, then it should be decomposed to the equivalent set high
> part, ori low part, and possibly shift instructions so long as the
> resulting sequence is small.  I think we basically do this correctly
> in rs6000_emit_move.  See the num_insns_constant call there.
> Constants that can't be evaluated inline by two (or three)
> instructions will be replaced with a load from the TOC.
>
> The same thing ought to happen in the splitters that use
> rs6000_emit_set_const.  rs6000_emit_set_const should refuse to expand
> to too many instructions (just like alpha).  We don't do this, but if
> we did, this would leave some (set (reg) (constant)) instructions in
> the RTL.  Alternatively the splitters could generate loads from the
> TOC, but see pr57836, which shows the loads from the TOC we crafted
> at expand time being reduced back to (set (reg) (constant)).
>
> Finally, at reload time, any remaining (set (reg) (constant))
> (ie. those that result in a long inline sequence) should be forced to
> the TOC.  This is the missing part of the infrastructure that wasn't
> copied from alpha.  Our legitimate_constant_p needs to reject some
> constants..  As it is, reload simply expands to a four or five
> inline instruction sequence.
>
> David, I'd like some help with the legitimate_constant_p
> implementation.  I have something that seems to work (not yet
> regression tested) but there are a number of things that I'm not clear
> on (eg. the revision 20229 change) so likely will get it wrong.
>
> --
> Alan Modra
> Australia Development Lab, IBM
Alan Modra May 14, 2014, 9:56 a.m. UTC | #8
Hi David,
On Tue, May 13, 2014 at 11:46:20PM -0400, David Edelsohn wrote:
> Danny may have re-organized the code, but I thought that it originally
> came from Tom Rixx, if not earlier.

OK, I'm not trying to apportion blame.  My name is on plenty of dodgy
code in the rs6000 backend too.  :)

> I seem to remember problems in the past with late creation of TOC
> entries for constants causing problems, so it was easier to fall back
> to materializing all integer constants inline. I don't remember the
> PRs, but I think there were issues with creating a TOC if the late
> constant were the only TOC reference, or maybe the issue was buying a
> stack frame to materialize the TOC/GOT for a late constant.  And
> maximum 5 instruction sequence is not really bad relative to a load
> from the TOC (even with medium model / data in TOC). There are a lot
> of trade-offs with respect to I$ expansion versus the load hitting in
> the L1 D$.

Sure, but Steve will tell you that the 5 instruction sequence is both
slower due to all the dependent ops, and results in larger code+data
size.  We definitely want to avoid it if possible, and pr67836 shows a
case taken from glibc math library code where there should be no
problem in using the TOC.

> Alpha emit_set_const does limit the number of instructions, but that
> is a search for a more efficient sequence than the naive sequence. The
> Alpha splitters use alpha_split_const_mov(), which tries
> alpha_emit_set_const() for an efficient sequence and then falls back
> to alpha_emit_set_long_const() for a naive sequence.  Alpha uses PLUS
> instead of IOR because of the way its ISA works.
> alpha_emit_set_long_const() always will materialize the constant and
> does not check for a maximum number of instructions. This is why it's
> comment says "fall back to straight forward decomposition".
> 
> However, alpha_emit_set_long_const() and alpha_split_const_mov() can
> fail, presumably because emit_move_insn() fails, not because of
> reaching a maximum number of instructions.
> 
> alpha_legitimate_constant_p() rejecs expensive constants early. Once
> the splitter is invoked, it always tries to materialize the constant,
> but the splitter apparently can fail for other reasons.

No, that is wrong.  alpha_emit_set_const does *not* always try to
materialize the constant inline.  It does so for constants that need
more than three instructions only when TARGET_BUILD_CONSTANTS.

> I don't mind exploring the benefits of tighening up
> rs6000_legitimate_const(), but I'm not sure it's an obvious win,
> especially with the potential failure corner cases.

Yes, those potential corner cases have me worried too..

> However, I want to have a better understanding about the part of the
> patch that removes the FAIL path from the splitters.

That part was really quite simple.  I was removing dead code.
rs6000_emit_set_const has never returned anything but DEST, right from
the initial commit.  It can't be called with DEST == NULL, so
"dest = gen_reg_rtx (mode)" is also dead code.

However, I retracted that patch because I now think
rs6000_emit_set_const should in fact sometimes result in the splitter
failing, exactly as is done in the alpha port.
David Edelsohn May 14, 2014, 9:27 p.m. UTC | #9
On Wed, May 14, 2014 at 5:56 AM, Alan Modra <amodra@gmail.com> wrote:

>> I seem to remember problems in the past with late creation of TOC
>> entries for constants causing problems, so it was easier to fall back
>> to materializing all integer constants inline. I don't remember the
>> PRs, but I think there were issues with creating a TOC if the late
>> constant were the only TOC reference, or maybe the issue was buying a
>> stack frame to materialize the TOC/GOT for a late constant.  And
>> maximum 5 instruction sequence is not really bad relative to a load
>> from the TOC (even with medium model / data in TOC). There are a lot
>> of trade-offs with respect to I$ expansion versus the load hitting in
>> the L1 D$.
>
> Sure, but Steve will tell you that the 5 instruction sequence is both
> slower due to all the dependent ops, and results in larger code+data
> size.  We definitely want to avoid it if possible, and pr67836 shows a
> case taken from glibc math library code where there should be no
> problem in using the TOC.

I don't necessarily believe this is a win overall.  If the constant
reliably is in the L1 D$ (or maybe L2 D$) and accessed with a direct
load (data in TOC or medium model), then yes. If it's farther away in
the memory hierarchy, then it's not a win. I agree about the code
expansion concern, which has its own secondary effects.

If this is a constant in a tight loop, okay, but if it's a unique
constant, it may not occur elsewhere in the code to be shared and may
not be placed in the same cache line as other, recently accessed
constants. This would push the load to L3 or farther.

Also, remember that this same heuristic is used by AIX, which still
defaults to small TOC model. So either the constant is in the TOC
anchor constant pool, which hopefully will pre-load the anchor, or
will be a constant in the TOC, possibly putting more pressure on TOC
size and causing overflow.

I am certain that there are anecdotal examples where it is a win for
PPC64 Linux, but I would want more evidence that it's a general win.

>> alpha_emit_set_long_const() always will materialize the constant and
>> does not check for a maximum number of instructions. This is why it's
>> comment says "fall back to straight forward decomposition".

> No, that is wrong.  alpha_emit_set_const does *not* always try to
> materialize the constant inline.  It does so for constants that need
> more than three instructions only when TARGET_BUILD_CONSTANTS.

I said that alpha_emit_set_long_const() always materializes the
constant, but, as you say, it is not always called.
alpha_emit_set_const() may fail if it requires too many instructions
or the search depth is too deep. You seem to be referring to some of
the logic in alpha_split_const_mov() as well.

Again, this definitely is worth exploring. And I am confident that
there are cases where loading the constant from memory is a win. I
just don't have a good instinct if it is a win most of the time for a
broad range of real-world applications. One optimization opportunity
in GLIBC is not a general heuristic. I don't think that we know a lot
about the context of the use of the constant to apply a finer-grained
policy.

I think the original code tried to put the constant in memory if it
appeared before reload, when everything could be calculated correctly
for prologue and materializing the TOC, but tried to materialze any
constants that appeared during reload using splitters.  That can avoid
some of the problem corner cases. The code needs to handle

PPC32
PPC64
eABI
AIX

Thanks, David
diff mbox

Patch

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 209926)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -1068,7 +1069,7 @@  static tree rs6000_handle_longcall_attribute (tree
 static tree rs6000_handle_altivec_attribute (tree *, tree, tree, int, bool *);
 static tree rs6000_handle_struct_attribute (tree *, tree, tree, int, bool *);
 static tree rs6000_builtin_vectorized_libmass (tree, tree, tree);
-static rtx rs6000_emit_set_long_const (rtx, HOST_WIDE_INT, HOST_WIDE_INT);
+static void rs6000_emit_set_long_const (rtx, HOST_WIDE_INT);
 static int rs6000_memory_move_cost (enum machine_mode, reg_class_t, bool);
 static bool rs6000_debug_rtx_costs (rtx, int, int, int, int *, bool);
 static int rs6000_debug_address_cost (rtx, enum machine_mode, addr_space_t,
@@ -7826,53 +7811,36 @@  rs6000_conditional_register_usage (void)
 }
 
 
-/* Try to output insns to set TARGET equal to the constant C if it can
-   be done in less than N insns.  Do all computations in MODE.
-   Returns the place where the output has been placed if it can be
-   done and the insns have been emitted.  If it would take more than N
-   insns, zero is returned and no insns and emitted.  */
+/* Output insns to set DEST equal to the constant SOURCE.  */
 
-rtx
-rs6000_emit_set_const (rtx dest, enum machine_mode mode,
-		       rtx source, int n ATTRIBUTE_UNUSED)
+void
+rs6000_emit_set_const (rtx dest, rtx source)
 {
-  rtx result, insn, set;
-  HOST_WIDE_INT c0, c1;
+  enum machine_mode mode = GET_MODE (dest);
+  rtx temp, insn, set;
+  HOST_WIDE_INT c;
 
+  gcc_checking_assert (CONST_INT_P (source));
+  c = INTVAL (source);
   switch (mode)
     {
-    case  QImode:
+    case QImode:
     case HImode:
-      if (dest == NULL)
-	dest = gen_reg_rtx (mode);
       emit_insn (gen_rtx_SET (VOIDmode, dest, source));
-      return dest;
+      return;
 
     case SImode:
-      result = !can_create_pseudo_p () ? dest : gen_reg_rtx (SImode);
+      temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (SImode);
 
-      emit_insn (gen_rtx_SET (VOIDmode, copy_rtx (result),
-			      GEN_INT (INTVAL (source)
-				       & (~ (HOST_WIDE_INT) 0xffff))));
+      emit_insn (gen_rtx_SET (VOIDmode, copy_rtx (temp),
+			      GEN_INT (c & (~ (HOST_WIDE_INT) 0xffff))));
       emit_insn (gen_rtx_SET (VOIDmode, dest,
-			      gen_rtx_IOR (SImode, copy_rtx (result),
-					   GEN_INT (INTVAL (source) & 0xffff))));
-      result = dest;
+			      gen_rtx_IOR (SImode, copy_rtx (temp),
+					   GEN_INT (c & 0xffff))));
       break;
 
     case DImode:
-      switch (GET_CODE (source))
-	{
-	case CONST_INT:
-	  c0 = INTVAL (source);
-	  c1 = -(c0 < 0);
-	  break;
-
-	default:
-	  gcc_unreachable ();
-	}
-
-      result = rs6000_emit_set_long_const (dest, c0, c1);
+      rs6000_emit_set_long_const (dest, c);
       break;
 
     default:
@@ -7883,37 +7851,38 @@  rs6000_conditional_register_usage (void)
   set = single_set (insn);
   if (! CONSTANT_P (SET_SRC (set)))
     set_unique_reg_note (insn, REG_EQUAL, source);
-
-  return result;
 }
 
-/* Having failed to find a 3 insn sequence in rs6000_emit_set_const,
-   fall back to a straight forward decomposition.  We do this to avoid
-   exponential run times encountered when looking for longer sequences
-   with rs6000_emit_set_const.  */
-static rtx
-rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c1, HOST_WIDE_INT c2)
+/* Output insns to set a DImode DEST equal to the constant C.  */
+
+static void
+rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c)
 {
+  rtx temp;
+
   if (!TARGET_POWERPC64)
     {
-      rtx operand1, operand2;
+      rtx hi, lo;
 
-      operand1 = operand_subword_force (dest, WORDS_BIG_ENDIAN == 0,
-					DImode);
-      operand2 = operand_subword_force (copy_rtx (dest), WORDS_BIG_ENDIAN != 0,
-					DImode);
-      emit_move_insn (operand1, GEN_INT (c1));
-      emit_move_insn (operand2, GEN_INT (c2));
+      hi = operand_subword_force (copy_rtx (dest), WORDS_BIG_ENDIAN == 0,
+				  DImode);
+      lo = operand_subword_force (dest, WORDS_BIG_ENDIAN != 0,
+				  DImode);
+      emit_move_insn (hi, GEN_INT (c >> 32));
+      c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000;
+      emit_move_insn (lo, GEN_INT (c));
     }
   else
     {
       HOST_WIDE_INT ud1, ud2, ud3, ud4;
 
-      ud1 = c1 & 0xffff;
-      ud2 = (c1 & 0xffff0000) >> 16;
-      c2 = c1 >> 32;
-      ud3 = c2 & 0xffff;
-      ud4 = (c2 & 0xffff0000) >> 16;
+      ud1 = c & 0xffff;
+      c = c >> 16;
+      ud2 = c & 0xffff;
+      c = c >> 16;
+      ud3 = c & 0xffff;
+      c = c >> 16;
+      ud4 = c & 0xffff;
 
       if ((ud4 == 0xffff && ud3 == 0xffff && ud2 == 0xffff && (ud1 & 0x8000))
 	  || (ud4 == 0 && ud3 == 0 && ud2 == 0 && ! (ud1 & 0x8000)))
@@ -7922,67 +7891,74 @@  rs6000_conditional_register_usage (void)
       else if ((ud4 == 0xffff && ud3 == 0xffff && (ud2 & 0x8000))
 	       || (ud4 == 0 && ud3 == 0 && ! (ud2 & 0x8000)))
 	{
-	  emit_move_insn (dest, GEN_INT (((ud2 << 16) ^ 0x80000000)
-					 - 0x80000000));
+	  temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
+
+	  emit_move_insn (ud1 != 0 ? copy_rtx (temp) : dest,
+			  GEN_INT (((ud2 << 16) ^ 0x80000000) - 0x80000000));
 	  if (ud1 != 0)
-	    emit_move_insn (copy_rtx (dest),
-			    gen_rtx_IOR (DImode, copy_rtx (dest),
+	    emit_move_insn (dest,
+			    gen_rtx_IOR (DImode, copy_rtx (temp),
 					 GEN_INT (ud1)));
 	}
       else if (ud3 == 0 && ud4 == 0)
 	{
+	  temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
+
 	  gcc_assert (ud2 & 0x8000);
-	  emit_move_insn (dest, GEN_INT (((ud2 << 16) ^ 0x80000000)
-					 - 0x80000000));
+	  emit_move_insn (copy_rtx (temp),
+			  GEN_INT (((ud2 << 16) ^ 0x80000000) - 0x80000000));
 	  if (ud1 != 0)
-	    emit_move_insn (copy_rtx (dest),
-			    gen_rtx_IOR (DImode, copy_rtx (dest),
+	    emit_move_insn (copy_rtx (temp),
+			    gen_rtx_IOR (DImode, copy_rtx (temp),
 					 GEN_INT (ud1)));
-	  emit_move_insn (copy_rtx (dest),
+	  emit_move_insn (dest,
 			  gen_rtx_ZERO_EXTEND (DImode,
 					       gen_lowpart (SImode,
-							    copy_rtx (dest))));
+							    copy_rtx (temp))));
 	}
       else if ((ud4 == 0xffff && (ud3 & 0x8000))
 	       || (ud4 == 0 && ! (ud3 & 0x8000)))
 	{
-	  emit_move_insn (dest, GEN_INT (((ud3 << 16) ^ 0x80000000)
-					 - 0x80000000));
+	  temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
+
+	  emit_move_insn (copy_rtx (temp),
+			  GEN_INT (((ud3 << 16) ^ 0x80000000) - 0x80000000));
 	  if (ud2 != 0)
-	    emit_move_insn (copy_rtx (dest),
-			    gen_rtx_IOR (DImode, copy_rtx (dest),
+	    emit_move_insn (copy_rtx (temp),
+			    gen_rtx_IOR (DImode, copy_rtx (temp),
 					 GEN_INT (ud2)));
-	  emit_move_insn (copy_rtx (dest),
-			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
+	  emit_move_insn (ud1 != 0 ? copy_rtx (temp) : dest,
+			  gen_rtx_ASHIFT (DImode, copy_rtx (temp),
 					  GEN_INT (16)));
 	  if (ud1 != 0)
-	    emit_move_insn (copy_rtx (dest),
-			    gen_rtx_IOR (DImode, copy_rtx (dest),
+	    emit_move_insn (dest,
+			    gen_rtx_IOR (DImode, copy_rtx (temp),
 					 GEN_INT (ud1)));
 	}
       else
 	{
-	  emit_move_insn (dest, GEN_INT (((ud4 << 16) ^ 0x80000000)
-					 - 0x80000000));
+	  temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
+
+	  emit_move_insn (copy_rtx (temp),
+			  GEN_INT (((ud4 << 16) ^ 0x80000000) - 0x80000000));
 	  if (ud3 != 0)
-	    emit_move_insn (copy_rtx (dest),
-			    gen_rtx_IOR (DImode, copy_rtx (dest),
+	    emit_move_insn (copy_rtx (temp),
+			    gen_rtx_IOR (DImode, copy_rtx (temp),
 					 GEN_INT (ud3)));
 
-	  emit_move_insn (copy_rtx (dest),
-			  gen_rtx_ASHIFT (DImode, copy_rtx (dest),
+	  emit_move_insn (ud2 != 0 || ud1 != 0 ? copy_rtx (temp) : dest,
+			  gen_rtx_ASHIFT (DImode, copy_rtx (temp),
 					  GEN_INT (32)));
 	  if (ud2 != 0)
-	    emit_move_insn (copy_rtx (dest),
-			    gen_rtx_IOR (DImode, copy_rtx (dest),
+	    emit_move_insn (ud1 != 0 ? copy_rtx (temp) : dest,
+			    gen_rtx_IOR (DImode, copy_rtx (temp),
 					 GEN_INT (ud2 << 16)));
 	  if (ud1 != 0)
-	    emit_move_insn (copy_rtx (dest),
-			    gen_rtx_IOR (DImode, copy_rtx (dest),
+	    emit_move_insn (dest,
+			    gen_rtx_IOR (DImode, copy_rtx (temp),
 					 GEN_INT (ud1)));
 	}
     }
-  return dest;
 }
 
 /* Helper for the following.  Get rid of [r+r] memory refs
Index: gcc/config/rs6000/rs6000-protos.h
===================================================================
--- gcc/config/rs6000/rs6000-protos.h	(revision 209926)
+++ gcc/config/rs6000/rs6000-protos.h	(working copy)
@@ -114,7 +114,7 @@  extern void rs6000_emit_cbranch (enum machine_mode
 extern char * output_cbranch (rtx, const char *, int, rtx);
 extern char * output_e500_flip_gt_bit (rtx, rtx);
 extern const char * output_probe_stack_range (rtx, rtx);
-extern rtx rs6000_emit_set_const (rtx, enum machine_mode, rtx, int);
+extern void rs6000_emit_set_const (rtx, rtx);
 extern int rs6000_emit_cmove (rtx, rtx, rtx, rtx);
 extern int rs6000_emit_vector_cond_expr (rtx, rtx, rtx, rtx, rtx, rtx);
 extern void rs6000_emit_minmax (rtx, enum rtx_code, rtx, rtx);
Index: gcc/config/rs6000/rs6000.md
===================================================================
--- gcc/config/rs6000/rs6000.md	(revision 209926)
+++ gcc/config/rs6000/rs6000.md	(working copy)
@@ -8978,12 +8978,8 @@ 
 	(ior:SI (match_dup 0)
 		(match_dup 3)))]
   "
-{ rtx tem = rs6000_emit_set_const (operands[0], SImode, operands[1], 2);
-
-  if (tem == operands[0])
-    DONE;
-  else
-    FAIL;
+{ rs6000_emit_set_const (operands[0], operands[1]);
+  DONE;
 }")
 
 (define_insn "*mov<mode>_internal2"
@@ -10326,12 +10322,8 @@ 
   [(set (match_dup 0) (match_dup 2))
    (set (match_dup 0) (plus:DI (match_dup 0) (match_dup 3)))]
   "
-{ rtx tem = rs6000_emit_set_const (operands[0], DImode, operands[1], 5);
-
-  if (tem == operands[0])
-    DONE;
-  else
-    FAIL;
+{ rs6000_emit_set_const (operands[0], operands[1]);
+  DONE;
 }")
 
 (define_split
@@ -10341,12 +10333,8 @@ 
   [(set (match_dup 0) (match_dup 2))
    (set (match_dup 0) (plus:DI (match_dup 0) (match_dup 3)))]
   "
-{ rtx tem = rs6000_emit_set_const (operands[0], DImode, operands[1], 5);
-
-  if (tem == operands[0])
-    DONE;
-  else
-    FAIL;
+{ rs6000_emit_set_const (operands[0], operands[1]);
+  DONE;
 }")
 
 ;; TImode/PTImode is similar, except that we usually want to compute the