diff mbox

Enable EBX for x86 in 32bits PIC code

Message ID 20140822121151.GA60032@msticlxl57.ims.intel.com
State New
Headers show

Commit Message

Ilya Enkovich Aug. 22, 2014, 12:21 p.m. UTC
Hi,

On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.

The idea of the patch was very simple and included few things;
 1.  Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do not have any hard reg fixed for PIC.
 2.  Initialize pic_offset_table_rtx with a new pseudo register in the begining of a function expand.
 3.  Change ABI so that there is a possible implicit PIC argument for calls; pic_offset_table_rtx is used as an arg value if such implicit arg exist.

Such approach worked well on small tests but trying to run some benchmarks we faced a problem with reload of address constants.  The problem is that when we try to rematerialize address constant or some constant memory reference, we have to use pic_offset_table_rtx.  It means we insert new usages of a speudo register and alocator cannot handle it correctly.  Same problem also applies for float and vector constants.

Rematerialization is not the only case causing new pic_offset_table_rtx usage.  Another case is a split of some instructions using constant but not having proper constraints.  E.g. pushtf pattern allows push of constant but it has to be replaced with push of memory in reload pass causing additional usage of pic_offset_table_rtx.

There are two ways to fix it.  The first one is to support modifications of pseudo register live range during reload and correctly allocate hard regs for its new usages (currently we have some hard reg allocated for new usage of pseudo reg but it may contain value of some other pseudo reg; thus we reveal the problem at runtime only).

The second way is to avoid all cases when new usages of pic_offset_table_rtx appear in reload.  That is a way I chose because it appeared simplier to me and would allow me to get some performance data faster.  Also having rematerialization of address anf float constants in PIC mode would mean we have higher register pressure, thus having them on stack should be even more efficient.  To achieve it I had to cut off reg equivs to all exprs using symbol references and all constants living in the memory.  I also had to avoid instructions requiring split in reload causing load of constant from memory (*push[txd]f).

Resulting compiler successfully passes make check, compiles EEMBC and SPEC2000 benchmarks.  There is no confidence I covered all cases and there still may be some templates causing split in reload with new pic_offset_table_rtx usages.  I think support of reload with pseudo PIC would be better and more general solution.  But I don't know how difficult is to implement it though.  Any ideas on resolving this reload issue?

I collected some performance numbers for EEMBC and SPEC2000 benchmarks.  Here are patch results for -Ofast optlevel with LTO collectd on Avoton server:
AUTOmark +1,9%
TELECOMmark +4,0%
DENmark +10,0%
SPEC2000 -0,5%

There are few degradations on EEMBC benchmarks but on SPEC2000 situation is different and we see more performance losses.  Some of them are caused by disabled rematerialization of address constants.  In some cases relaxed ebx causes more spills/fills in plaecs where GOT is frequently used.  There are also some minor fixes required in the patch to allow more efficient function prolog (avoid unnecessary GOT register initialization and allow its initialization without ebx usage).  Suppose some performance problems may be resolved but a good fix for reload should go first.

Thanks,
Ilya
--

Comments

Hans-Peter Nilsson Aug. 23, 2014, 1:47 a.m. UTC | #1
(Dropping gcc@ and people known to subscribe to gcc-patches
from the CC.)

Sorry for the drive-by review, but...

On Fri, 22 Aug 2014, Ilya Enkovich wrote:
> Hi,
>
> On Cauldron 2014 we had a couple of talks about relaxation of
> ebx usage in 32bit PIC mode.  It was decided that the best
> approach would be to not fix ebx register, use speudo register
> for GOT base address and let allocator do the rest.  This should
> be similar to how clang and icc work with GOT base address.
> I've been working for some time on such patch and now want to
> share my results.

...did you send the right version of the patch?
This one uses the RTX-returning hook only in boolean tests,
unless I misread.

Using the return value in boolean tests (non/NULL) here:

> diff --git a/gcc/calls.c b/gcc/calls.c
> index 4285ec1..85dae6b 100644
> --- a/gcc/calls.c
> +++ b/gcc/calls.c
> @@ -1122,6 +1122,14 @@ initialize_argument_information (int num_actuals ATTRIBUTE_UNUSED,
>      call_expr_arg_iterator iter;
>      tree arg;
>
> +    if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
...
> +  /* Add implicit PIC arg.  */
> +  if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
> +    num_actuals++;
...
> +  if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))

but:

> +/* Return reg in which implicit PIC base address
> +   arg is passed.  */
> +static rtx
> +ix86_implicit_pic_arg (const_tree fntype_or_decl ATTRIBUTE_UNUSED)
...
> +#undef TARGET_IMPLICIT_PIC_ARG
> +#define TARGET_IMPLICIT_PIC_ARG ix86_implicit_pic_arg
>  #undef TARGET_FUNCTION_ARG_BOUNDARY

and:

> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -3967,6 +3967,12 @@ If @code{TARGET_FUNCTION_INCOMING_ARG} is not defined,
>  @code{TARGET_FUNCTION_ARG} serves both purposes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} rtx TARGET_IMPLICIT_PIC_ARG (const_tree @var{fntype_or_decl})
> +This hook returns register holding PIC base address for functions
> +which do not fix hard register but handle it similar to function arg
> +assigning a virtual reg for it.
> +@end deftypefn

Also, the contains_symbol_ref removal seems like an independent
cleanup-patch.

> index a458380..63d2be5 100644
> --- a/gcc/var-tracking.c
> +++ b/gcc/var-tracking.c
> @@ -661,7 +661,6 @@ static bool variable_different_p (variable, variable);
>  static bool dataflow_set_different (dataflow_set *, dataflow_set *);
>  static void dataflow_set_destroy (dataflow_set *);
>
> -static bool contains_symbol_ref (rtx);

brgds, H-P
Ilya Enkovich Aug. 25, 2014, 9:25 a.m. UTC | #2
2014-08-23 5:47 GMT+04:00 Hans-Peter Nilsson <hp@bitrange.com>:
> (Dropping gcc@ and people known to subscribe to gcc-patches
> from the CC.)
>
> Sorry for the drive-by review, but...
>
> On Fri, 22 Aug 2014, Ilya Enkovich wrote:
>> Hi,
>>
>> On Cauldron 2014 we had a couple of talks about relaxation of
>> ebx usage in 32bit PIC mode.  It was decided that the best
>> approach would be to not fix ebx register, use speudo register
>> for GOT base address and let allocator do the rest.  This should
>> be similar to how clang and icc work with GOT base address.
>> I've been working for some time on such patch and now want to
>> share my results.
>
> ...did you send the right version of the patch?
> This one uses the RTX-returning hook only in boolean tests,
> unless I misread.
>
> Using the return value in boolean tests (non/NULL) here:

NULL returned by hook means we do not have implicit pic arg to
pass/receive and there are pieces of code which should be executed
only when implicit pic arg exists.  This causes these boolean tests.
There are also non boolean usages. E.g.:

+      rtx old_reg = targetm.calls.implicit_pic_arg (fndecl);
+      rtx new_reg = gen_reg_rtx (GET_MODE (old_reg));
+      emit_move_insn (new_reg, old_reg);
+      pic_offset_table_rtx = new_reg;

>
>> diff --git a/gcc/calls.c b/gcc/calls.c
>> index 4285ec1..85dae6b 100644
>> --- a/gcc/calls.c
>> +++ b/gcc/calls.c
>> @@ -1122,6 +1122,14 @@ initialize_argument_information (int num_actuals ATTRIBUTE_UNUSED,
>>      call_expr_arg_iterator iter;
>>      tree arg;
>>
>> +    if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
> ...
>> +  /* Add implicit PIC arg.  */
>> +  if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
>> +    num_actuals++;
> ...
>> +  if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
>
> but:
>
>> +/* Return reg in which implicit PIC base address
>> +   arg is passed.  */
>> +static rtx
>> +ix86_implicit_pic_arg (const_tree fntype_or_decl ATTRIBUTE_UNUSED)
> ...
>> +#undef TARGET_IMPLICIT_PIC_ARG
>> +#define TARGET_IMPLICIT_PIC_ARG ix86_implicit_pic_arg
>>  #undef TARGET_FUNCTION_ARG_BOUNDARY
>
> and:
>
>> --- a/gcc/doc/tm.texi
>> +++ b/gcc/doc/tm.texi
>> @@ -3967,6 +3967,12 @@ If @code{TARGET_FUNCTION_INCOMING_ARG} is not defined,
>>  @code{TARGET_FUNCTION_ARG} serves both purposes.
>>  @end deftypefn
>>
>> +@deftypefn {Target Hook} rtx TARGET_IMPLICIT_PIC_ARG (const_tree @var{fntype_or_decl})
>> +This hook returns register holding PIC base address for functions
>> +which do not fix hard register but handle it similar to function arg
>> +assigning a virtual reg for it.
>> +@end deftypefn
>
> Also, the contains_symbol_ref removal seems like an independent
> cleanup-patch.

It was not removed, it was just moved into rtlanal.c for shared usage
(I used it in ira.c).

Thanks,
Ilya

>
>> index a458380..63d2be5 100644
>> --- a/gcc/var-tracking.c
>> +++ b/gcc/var-tracking.c
>> @@ -661,7 +661,6 @@ static bool variable_different_p (variable, variable);
>>  static bool dataflow_set_different (dataflow_set *, dataflow_set *);
>>  static void dataflow_set_destroy (dataflow_set *);
>>
>> -static bool contains_symbol_ref (rtx);
>
> brgds, H-P
Hans-Peter Nilsson Aug. 25, 2014, 11:24 a.m. UTC | #3
On Mon, 25 Aug 2014, Ilya Enkovich wrote:
> 2014-08-23 5:47 GMT+04:00 Hans-Peter Nilsson <hp@bitrange.com>:
> > ...did you send the right version of the patch?
> > This one uses the RTX-returning hook only in boolean tests,
> > unless I misread.

(I did, but not by much.)

> NULL returned by hook means we do not have implicit pic arg to
> pass/receive and there are pieces of code which should be executed
> only when implicit pic arg exists.  This causes these boolean tests.

Well, obviously, but...

> There are also non boolean usages. E.g.:

I thing singular ("usage") is more correct?
I saw only one such use. :)

> +      rtx old_reg = targetm.calls.implicit_pic_arg (fndecl);
> +      rtx new_reg = gen_reg_rtx (GET_MODE (old_reg));
> +      emit_move_insn (new_reg, old_reg);
> +      pic_offset_table_rtx = new_reg;

And before that, it's called as a boolean test, throwing away
the result!

I suggest you change the hook to return a boolean, with a
pointer argument to a variable to set, passed as NULL from
callers not interested in the actual value.

I.e. instead of:

> >> +@deftypefn {Target Hook} rtx TARGET_IMPLICIT_PIC_ARG (const_tree @var{fntype_or_decl})

make it a:

@deftypefn {Target Hook} bool TARGET_IMPLICIT_PIC_ARG
 (const_tree @var{fntype_or_decl}, rtx *@var{addr})

brgds, H-P
Ilya Enkovich Aug. 25, 2014, 11:43 a.m. UTC | #4
2014-08-25 15:24 GMT+04:00 Hans-Peter Nilsson <hp@bitrange.com>:
> On Mon, 25 Aug 2014, Ilya Enkovich wrote:
>> 2014-08-23 5:47 GMT+04:00 Hans-Peter Nilsson <hp@bitrange.com>:
>> > ...did you send the right version of the patch?
>> > This one uses the RTX-returning hook only in boolean tests,
>> > unless I misread.
>
> (I did, but not by much.)
>
>> NULL returned by hook means we do not have implicit pic arg to
>> pass/receive and there are pieces of code which should be executed
>> only when implicit pic arg exists.  This causes these boolean tests.
>
> Well, obviously, but...
>
>> There are also non boolean usages. E.g.:
>
> I thing singular ("usage") is more correct?
> I saw only one such use. :)

There is another one in i386.c :)

>
>> +      rtx old_reg = targetm.calls.implicit_pic_arg (fndecl);
>> +      rtx new_reg = gen_reg_rtx (GET_MODE (old_reg));
>> +      emit_move_insn (new_reg, old_reg);
>> +      pic_offset_table_rtx = new_reg;
>
> And before that, it's called as a boolean test, throwing away
> the result!
>
> I suggest you change the hook to return a boolean, with a
> pointer argument to a variable to set, passed as NULL from
> callers not interested in the actual value.
>
> I.e. instead of:
>
>> >> +@deftypefn {Target Hook} rtx TARGET_IMPLICIT_PIC_ARG (const_tree @var{fntype_or_decl})
>
> make it a:
>
> @deftypefn {Target Hook} bool TARGET_IMPLICIT_PIC_ARG
>  (const_tree @var{fntype_or_decl}, rtx *@var{addr})

OK.  I'll change this hook if it goes to a product quality patch.
Current patch is posted to demonstrate an approach and show narrow
points I have to deal with in reload.  There is no reason in cleaning
it until a decision about next steps is made.

Thanks,
Ilya

>
> brgds, H-P
Vladimir Makarov Aug. 25, 2014, 3:08 p.m. UTC | #5
On 2014-08-22 8:21 AM, Ilya Enkovich wrote:
> Hi,
>
> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.
>
> The idea of the patch was very simple and included few things;
>   1.  Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do not have any hard reg fixed for PIC.
>   2.  Initialize pic_offset_table_rtx with a new pseudo register in the begining of a function expand.
>   3.  Change ABI so that there is a possible implicit PIC argument for calls; pic_offset_table_rtx is used as an arg value if such implicit arg exist.
>
> Such approach worked well on small tests but trying to run some benchmarks we faced a problem with reload of address constants.  The problem is that when we try to rematerialize address constant or some constant memory reference, we have to use pic_offset_table_rtx.  It means we insert new usages of a speudo register and alocator cannot handle it correctly.  Same problem also applies for float and vector constants.
>
> Rematerialization is not the only case causing new pic_offset_table_rtx usage.  Another case is a split of some instructions using constant but not having proper constraints.  E.g. pushtf pattern allows push of constant but it has to be replaced with push of memory in reload pass causing additional usage of pic_offset_table_rtx.
>
> There are two ways to fix it.  The first one is to support modifications of pseudo register live range during reload and correctly allocate hard regs for its new usages (currently we have some hard reg allocated for new usage of pseudo reg but it may contain value of some other pseudo reg; thus we reveal the problem at runtime only).
>

I believe there is already code to deal with this situation.  It is code 
for risky transformations (please check flag 
lra_risky_transformation_p).  If this flag is set, next lra assign 
subpass is running and checking correctness of assignments (e.g. 
checking situation when two different pseudos have intersected live 
ranges and the same assigned hard reg.  If such dangerous situation is 
found, it is fixed).

> The second way is to avoid all cases when new usages of pic_offset_table_rtx appear in reload.  That is a way I chose because it appeared simplier to me and would allow me to get some performance data faster.  Also having rematerialization of address anf float constants in PIC mode would mean we have higher register pressure, thus having them on stack should be even more efficient.  To achieve it I had to cut off reg equivs to all exprs using symbol references and all constants living in the memory.  I also had to avoid instructions requiring split in reload causing load of constant from memory (*push[txd]f).
>
> Resulting compiler successfully passes make check, compiles EEMBC and SPEC2000 benchmarks.  There is no confidence I covered all cases and there still may be some templates causing split in reload with new pic_offset_table_rtx usages.  I think support of reload with pseudo PIC would be better and more general solution.  But I don't know how difficult is to implement it though.  Any ideas on resolving this reload issue?
>

Please see what I mentioned above.  May be it can fix the degradation. 
Rematerialization is important for performance and switching it of 
completely is not wise.


> I collected some performance numbers for EEMBC and SPEC2000 benchmarks.  Here are patch results for -Ofast optlevel with LTO collectd on Avoton server:
> AUTOmark +1,9%
> TELECOMmark +4,0%
> DENmark +10,0%
> SPEC2000 -0,5%
>
> There are few degradations on EEMBC benchmarks but on SPEC2000 situation is different and we see more performance losses.  Some of them are caused by disabled rematerialization of address constants.  In some cases relaxed ebx causes more spills/fills in plaecs where GOT is frequently used.  There are also some minor fixes required in the patch to allow more efficient function prolog (avoid unnecessary GOT register initialization and allow its initialization without ebx usage).  Suppose some performance problems may be resolved but a good fix for reload should go first.
>
>

Ilya, the optimization you are trying to implement is important in many 
cases and should be in some way included in gcc.  If the degradations 
can be solved in a way i mentioned above we could introduce a 
machine-dependent flag.
Jeff Law Aug. 25, 2014, 5:30 p.m. UTC | #6
On 08/22/14 06:21, Ilya Enkovich wrote:
>
> Such approach worked well on small tests but trying to run some
> benchmarks we faced a problem with reload of address constants.  The
> problem is that when we try to rematerialize address constant or some
> constant memory reference, we have to use pic_offset_table_rtx.  It
> means we insert new usages of a speudo register and alocator cannot
> handle it correctly.  Same problem also applies for float and vector
> constants.
Isn't this typically handled with secondary reloads?   It's not an exact 
match, but if you look at the PA port, you can see cases where we need 
to have %r1 available when we rematerialize certain constants.  Several 
ports have secondary reloads that you may be able to refer back to.  LRA 
may handle things differently, so first check LRA's paths.



>
> Rematerialization is not the only case causing new
> pic_offset_table_rtx usage.  Another case is a split of some
> instructions using constant but not having proper constraints.  E.g.
> pushtf pattern allows push of constant but it has to be replaced with
> push of memory in reload pass causing additional usage of
> pic_offset_table_rtx.
Yup.  I think those would be handled the same way.


Jeff
Ilya Enkovich Aug. 26, 2014, 7:49 a.m. UTC | #7
2014-08-25 19:08 GMT+04:00 Vladimir Makarov <vmakarov@redhat.com>:
> On 2014-08-22 8:21 AM, Ilya Enkovich wrote:
>>
>> Hi,
>>
>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in
>> 32bit PIC mode.  It was decided that the best approach would be to not fix
>> ebx register, use speudo register for GOT base address and let allocator do
>> the rest.  This should be similar to how clang and icc work with GOT base
>> address.  I've been working for some time on such patch and now want to
>> share my results.
>>
>> The idea of the patch was very simple and included few things;
>>   1.  Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do
>> not have any hard reg fixed for PIC.
>>   2.  Initialize pic_offset_table_rtx with a new pseudo register in the
>> begining of a function expand.
>>   3.  Change ABI so that there is a possible implicit PIC argument for
>> calls; pic_offset_table_rtx is used as an arg value if such implicit arg
>> exist.
>>
>> Such approach worked well on small tests but trying to run some benchmarks
>> we faced a problem with reload of address constants.  The problem is that
>> when we try to rematerialize address constant or some constant memory
>> reference, we have to use pic_offset_table_rtx.  It means we insert new
>> usages of a speudo register and alocator cannot handle it correctly.  Same
>> problem also applies for float and vector constants.
>>
>> Rematerialization is not the only case causing new pic_offset_table_rtx
>> usage.  Another case is a split of some instructions using constant but not
>> having proper constraints.  E.g. pushtf pattern allows push of constant but
>> it has to be replaced with push of memory in reload pass causing additional
>> usage of pic_offset_table_rtx.
>>
>> There are two ways to fix it.  The first one is to support modifications
>> of pseudo register live range during reload and correctly allocate hard regs
>> for its new usages (currently we have some hard reg allocated for new usage
>> of pseudo reg but it may contain value of some other pseudo reg; thus we
>> reveal the problem at runtime only).
>>
>
> I believe there is already code to deal with this situation.  It is code for
> risky transformations (please check flag lra_risky_transformation_p).  If
> this flag is set, next lra assign subpass is running and checking
> correctness of assignments (e.g. checking situation when two different
> pseudos have intersected live ranges and the same assigned hard reg.  If
> such dangerous situation is found, it is fixed).

I tried to remove my restrictions from setup_reg_equiv and initialize
lra_risky_transformation_p with 'true' in lra_constraints instead.  I
got only 50% pass rate for SPEC2000 on Ofast with LTO.  Will search
for fail reason.

Ilya

>
>
>> The second way is to avoid all cases when new usages of
>> pic_offset_table_rtx appear in reload.  That is a way I chose because it
>> appeared simplier to me and would allow me to get some performance data
>> faster.  Also having rematerialization of address anf float constants in PIC
>> mode would mean we have higher register pressure, thus having them on stack
>> should be even more efficient.  To achieve it I had to cut off reg equivs to
>> all exprs using symbol references and all constants living in the memory.  I
>> also had to avoid instructions requiring split in reload causing load of
>> constant from memory (*push[txd]f).
>>
>> Resulting compiler successfully passes make check, compiles EEMBC and
>> SPEC2000 benchmarks.  There is no confidence I covered all cases and there
>> still may be some templates causing split in reload with new
>> pic_offset_table_rtx usages.  I think support of reload with pseudo PIC
>> would be better and more general solution.  But I don't know how difficult
>> is to implement it though.  Any ideas on resolving this reload issue?
>>
>
> Please see what I mentioned above.  May be it can fix the degradation.
> Rematerialization is important for performance and switching it of
> completely is not wise.
>
>
>
>> I collected some performance numbers for EEMBC and SPEC2000 benchmarks.
>> Here are patch results for -Ofast optlevel with LTO collectd on Avoton
>> server:
>> AUTOmark +1,9%
>> TELECOMmark +4,0%
>> DENmark +10,0%
>> SPEC2000 -0,5%
>>
>> There are few degradations on EEMBC benchmarks but on SPEC2000 situation
>> is different and we see more performance losses.  Some of them are caused by
>> disabled rematerialization of address constants.  In some cases relaxed ebx
>> causes more spills/fills in plaecs where GOT is frequently used.  There are
>> also some minor fixes required in the patch to allow more efficient function
>> prolog (avoid unnecessary GOT register initialization and allow its
>> initialization without ebx usage).  Suppose some performance problems may be
>> resolved but a good fix for reload should go first.
>>
>>
>
> Ilya, the optimization you are trying to implement is important in many
> cases and should be in some way included in gcc.  If the degradations can be
> solved in a way i mentioned above we could introduce a machine-dependent
> flag.
>
Ilya Enkovich Aug. 26, 2014, 8:57 a.m. UTC | #8
2014-08-26 11:49 GMT+04:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
> 2014-08-25 19:08 GMT+04:00 Vladimir Makarov <vmakarov@redhat.com>:
>> On 2014-08-22 8:21 AM, Ilya Enkovich wrote:
>>>
>>> Hi,
>>>
>>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in
>>> 32bit PIC mode.  It was decided that the best approach would be to not fix
>>> ebx register, use speudo register for GOT base address and let allocator do
>>> the rest.  This should be similar to how clang and icc work with GOT base
>>> address.  I've been working for some time on such patch and now want to
>>> share my results.
>>>
>>> The idea of the patch was very simple and included few things;
>>>   1.  Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do
>>> not have any hard reg fixed for PIC.
>>>   2.  Initialize pic_offset_table_rtx with a new pseudo register in the
>>> begining of a function expand.
>>>   3.  Change ABI so that there is a possible implicit PIC argument for
>>> calls; pic_offset_table_rtx is used as an arg value if such implicit arg
>>> exist.
>>>
>>> Such approach worked well on small tests but trying to run some benchmarks
>>> we faced a problem with reload of address constants.  The problem is that
>>> when we try to rematerialize address constant or some constant memory
>>> reference, we have to use pic_offset_table_rtx.  It means we insert new
>>> usages of a speudo register and alocator cannot handle it correctly.  Same
>>> problem also applies for float and vector constants.
>>>
>>> Rematerialization is not the only case causing new pic_offset_table_rtx
>>> usage.  Another case is a split of some instructions using constant but not
>>> having proper constraints.  E.g. pushtf pattern allows push of constant but
>>> it has to be replaced with push of memory in reload pass causing additional
>>> usage of pic_offset_table_rtx.
>>>
>>> There are two ways to fix it.  The first one is to support modifications
>>> of pseudo register live range during reload and correctly allocate hard regs
>>> for its new usages (currently we have some hard reg allocated for new usage
>>> of pseudo reg but it may contain value of some other pseudo reg; thus we
>>> reveal the problem at runtime only).
>>>
>>
>> I believe there is already code to deal with this situation.  It is code for
>> risky transformations (please check flag lra_risky_transformation_p).  If
>> this flag is set, next lra assign subpass is running and checking
>> correctness of assignments (e.g. checking situation when two different
>> pseudos have intersected live ranges and the same assigned hard reg.  If
>> such dangerous situation is found, it is fixed).
>
> I tried to remove my restrictions from setup_reg_equiv and initialize
> lra_risky_transformation_p with 'true' in lra_constraints instead.  I
> got only 50% pass rate for SPEC2000 on Ofast with LTO.  Will search
> for fail reason.

I've looked into one of fails.  There is still a problem with
allocation in reload. Here is a piece of code which uses float
constant:

(insn 1199 1198 1200 96 (set (reg:SI 3 bx)
        (reg:SI 1301 [528])) /usr/include/bits/stdlib-float.h:28 90
{*movsi_internal}
     (nil))
(call_insn 1200 1199 1201 96 (set (reg:DF 8 st)
        (call (mem:QI (symbol_ref:SI ("strtod") [flags 0x41]
<function_decl 0x2b29b8ea8900 strtod>) [0 strtod S1 A8])
            (const_int 8 [0x8]))) /usr/include/bits/stdlib-float.h:28
661 {*call_value}
     (expr_list:REG_DEAD (reg:SI 3 bx)
        (expr_list:REG_CALL_DECL (symbol_ref:SI ("strtod") [flags
0x41]  <function_decl 0x2b29b8ea8900 strtod>)
            (expr_list:REG_EH_REGION (const_int 0 [0])
                (nil))))
    (expr_list (use (reg:SI 3 bx))
        (expr_list:SI (use (reg:SI 3 bx))
            (expr_list:SI (use (mem/f:SI (reg/f:SI 7 sp) [0  S4 A32]))
                (expr_list:SI (use (mem/f:SI (plus:SI (reg/f:SI 7 sp)
                                (const_int 4 [0x4])) [0  S4 A32]))
                    (nil))))))
(insn 1201 1200 1202 96 (set (reg:DF 321 [ D.7817 ])
        (reg:DF 8 st)) /usr/include/bits/stdlib-float.h:28 128 {*movdf_internal}
     (expr_list:REG_DEAD (reg:DF 8 st)
        (nil)))
(insn 1202 1201 1203 96 (set (reg:SF 322 [ D.7804 ])
        (float_truncate:SF (reg:DF 321 [ D.7817 ]))) read_arch.c:700
157 {*truncdfsf_fast_sse}
     (expr_list:REG_DEAD (reg:DF 321 [ D.7817 ])
        (nil)))
(insn 1203 1202 1204 96 (set (mem:SF (reg/f:SI 198 [ D.7812 ]) [4
_130->frequency+0 S4 A32])
        (reg:SF 322 [ D.7804 ])) read_arch.c:700 129 {*movsf_internal}
     (nil))
(insn 1204 1203 1205 96 (set (reg:SF 1209)
        (mem/u/c:SF (plus:SI (reg:SI 1301 [528])
                (const:SI (unspec:SI [
                            (symbol_ref/u:SI ("*.LC12") [flags 0x2])
                        ] UNSPEC_GOTOFF))) [4  S4 A32]))
read_arch.c:701 129 {*movsf_internal}
     (expr_list:REG_EQUAL (const_double:SF 0.0 [0x0.0p+0])
        (nil)))
(note 1205 1204 1206 96 NOTE_INSN_DELETED)
(note 1206 1205 1207 96 NOTE_INSN_DELETED)
(insn 1207 1206 1208 96 (set (reg:CCFP 17 flags)
        (compare:CCFP (reg:SF 1209)
            (reg:SF 322 [ D.7804 ]))) read_arch.c:701 53 {*cmpisf_sse}
     (nil))
(jump_insn 1208 1207 3075 96 (set (pc)
        (if_then_else (ge (reg:CCFP 17 flags)
                (const_int 0 [0]))
            (label_ref:SI 3114)
            (pc))) read_arch.c:701 606 {*jcc_1}
     (expr_list:REG_DEAD (reg:CCFP 17 flags)
        (int_list:REG_BR_PROB 2 (nil)))
 -> 3114)
(note 3075 1208 1209 97 [bb 97] NOTE_INSN_BASIC_BLOCK)
(insn 1209 3075 1210 97 (set (reg:SF 1208)
        (mem/u/c:SF (plus:SI (reg:SI 1301 [528])
                (const:SI (unspec:SI [
                            (symbol_ref/u:SI ("*.LC11") [flags 0x2])
                        ] UNSPEC_GOTOFF))) [4  S4 A32]))
read_arch.c:701 129 {*movsf_internal}
     (expr_list:REG_EQUIV (const_double:SF 1.0e+0 [0x0.8p+1])
        (nil)))
(note 1210 1209 1211 97 NOTE_INSN_DELETED)
(note 1211 1210 1212 97 NOTE_INSN_DELETED)
(insn 1212 1211 1213 97 (set (reg:CCFP 17 flags)
        (compare:CCFP (reg:SF 322 [ D.7804 ])
            (reg:SF 1208))) read_arch.c:701 53 {*cmpisf_sse}
     (nil))

We have PIC register r1301 (former r528) used for constant load (insn
1209).  This register was actually loaded to bx (insn 1199) and this
hard reg may be used by insn 1209.  During reload we have insn 1209
removed and a new one created instead:

(insn 3864 1211 1212 104 (set (reg:SI 0 ax [1468])
        (plus:SI (reg:SI 6 bp [528])
            (const:SI (unspec:SI [
                        (symbol_ref/u:SI ("*.LC11") [flags 0x2])
                    ] UNSPEC_GOTOFF)))) read_arch.c:701 213 {*leasi}
     (expr_list:REG_EQUAL (symbol_ref/u:SI ("*.LC11") [flags 0x2])
        (nil)))
(insn 1212 3864 1213 104 (set (reg:CCFP 17 flags)
        (compare:CCFP (reg:SF 21 xmm0 [orig:322 D.7804 ] [322])
            (mem/u/c:SF (reg:SI 0 ax [1468]) [4  S4 A32])))
read_arch.c:701 53 {*cmpisf_sse}
     (nil))

In this new instruction bp is used which is wrong. We actually have
required value in bx. In debugger I also checked that bp doesn't have
required value.  I suppose I enabled flag correctly because found this
in the log: "Spill r1301 after risky transformations".  Is it possible
we are still not allowed to use the original PIC register (r528) and
should use a reg copy created for particular region (in this case
r1301)?

Ilya

>
> Ilya
>
>>
>>
>>> The second way is to avoid all cases when new usages of
>>> pic_offset_table_rtx appear in reload.  That is a way I chose because it
>>> appeared simplier to me and would allow me to get some performance data
>>> faster.  Also having rematerialization of address anf float constants in PIC
>>> mode would mean we have higher register pressure, thus having them on stack
>>> should be even more efficient.  To achieve it I had to cut off reg equivs to
>>> all exprs using symbol references and all constants living in the memory.  I
>>> also had to avoid instructions requiring split in reload causing load of
>>> constant from memory (*push[txd]f).
>>>
>>> Resulting compiler successfully passes make check, compiles EEMBC and
>>> SPEC2000 benchmarks.  There is no confidence I covered all cases and there
>>> still may be some templates causing split in reload with new
>>> pic_offset_table_rtx usages.  I think support of reload with pseudo PIC
>>> would be better and more general solution.  But I don't know how difficult
>>> is to implement it though.  Any ideas on resolving this reload issue?
>>>
>>
>> Please see what I mentioned above.  May be it can fix the degradation.
>> Rematerialization is important for performance and switching it of
>> completely is not wise.
>>
>>
>>
>>> I collected some performance numbers for EEMBC and SPEC2000 benchmarks.
>>> Here are patch results for -Ofast optlevel with LTO collectd on Avoton
>>> server:
>>> AUTOmark +1,9%
>>> TELECOMmark +4,0%
>>> DENmark +10,0%
>>> SPEC2000 -0,5%
>>>
>>> There are few degradations on EEMBC benchmarks but on SPEC2000 situation
>>> is different and we see more performance losses.  Some of them are caused by
>>> disabled rematerialization of address constants.  In some cases relaxed ebx
>>> causes more spills/fills in plaecs where GOT is frequently used.  There are
>>> also some minor fixes required in the patch to allow more efficient function
>>> prolog (avoid unnecessary GOT register initialization and allow its
>>> initialization without ebx usage).  Suppose some performance problems may be
>>> resolved but a good fix for reload should go first.
>>>
>>>
>>
>> Ilya, the optimization you are trying to implement is important in many
>> cases and should be in some way included in gcc.  If the degradations can be
>> solved in a way i mentioned above we could introduce a machine-dependent
>> flag.
>>
Vladimir Makarov Aug. 26, 2014, 3:25 p.m. UTC | #9
On 08/26/2014 04:57 AM, Ilya Enkovich wrote:
> 2014-08-26 11:49 GMT+04:00 Ilya Enkovich <enkovich.gnu@gmail.com>:
>> 2014-08-25 19:08 GMT+04:00 Vladimir Makarov <vmakarov@redhat.com>:
>>> On 2014-08-22 8:21 AM, Ilya Enkovich wrote:
>>>> Hi,
>>>>
>>>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in
>>>> 32bit PIC mode.  It was decided that the best approach would be to not fix
>>>> ebx register, use speudo register for GOT base address and let allocator do
>>>> the rest.  This should be similar to how clang and icc work with GOT base
>>>> address.  I've been working for some time on such patch and now want to
>>>> share my results.
>>>>
>>>> The idea of the patch was very simple and included few things;
>>>>   1.  Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do
>>>> not have any hard reg fixed for PIC.
>>>>   2.  Initialize pic_offset_table_rtx with a new pseudo register in the
>>>> begining of a function expand.
>>>>   3.  Change ABI so that there is a possible implicit PIC argument for
>>>> calls; pic_offset_table_rtx is used as an arg value if such implicit arg
>>>> exist.
>>>>
>>>> Such approach worked well on small tests but trying to run some benchmarks
>>>> we faced a problem with reload of address constants.  The problem is that
>>>> when we try to rematerialize address constant or some constant memory
>>>> reference, we have to use pic_offset_table_rtx.  It means we insert new
>>>> usages of a speudo register and alocator cannot handle it correctly.  Same
>>>> problem also applies for float and vector constants.
>>>>
>>>> Rematerialization is not the only case causing new pic_offset_table_rtx
>>>> usage.  Another case is a split of some instructions using constant but not
>>>> having proper constraints.  E.g. pushtf pattern allows push of constant but
>>>> it has to be replaced with push of memory in reload pass causing additional
>>>> usage of pic_offset_table_rtx.
>>>>
>>>> There are two ways to fix it.  The first one is to support modifications
>>>> of pseudo register live range during reload and correctly allocate hard regs
>>>> for its new usages (currently we have some hard reg allocated for new usage
>>>> of pseudo reg but it may contain value of some other pseudo reg; thus we
>>>> reveal the problem at runtime only).
>>>>
>>> I believe there is already code to deal with this situation.  It is code for
>>> risky transformations (please check flag lra_risky_transformation_p).  If
>>> this flag is set, next lra assign subpass is running and checking
>>> correctness of assignments (e.g. checking situation when two different
>>> pseudos have intersected live ranges and the same assigned hard reg.  If
>>> such dangerous situation is found, it is fixed).
>> I tried to remove my restrictions from setup_reg_equiv and initialize
>> lra_risky_transformation_p with 'true' in lra_constraints instead.  I
>> got only 50% pass rate for SPEC2000 on Ofast with LTO.  Will search
>> for fail reason.
> I've looked into one of fails.  There is still a problem with
> allocation in reload. Here is a piece of code which uses float
> constant:
>
> (insn 1199 1198 1200 96 (set (reg:SI 3 bx)
>         (reg:SI 1301 [528])) /usr/include/bits/stdlib-float.h:28 90
> {*movsi_internal}
>      (nil))
> (call_insn 1200 1199 1201 96 (set (reg:DF 8 st)
>         (call (mem:QI (symbol_ref:SI ("strtod") [flags 0x41]
> <function_decl 0x2b29b8ea8900 strtod>) [0 strtod S1 A8])
>             (const_int 8 [0x8]))) /usr/include/bits/stdlib-float.h:28
> 661 {*call_value}
>      (expr_list:REG_DEAD (reg:SI 3 bx)
>         (expr_list:REG_CALL_DECL (symbol_ref:SI ("strtod") [flags
> 0x41]  <function_decl 0x2b29b8ea8900 strtod>)
>             (expr_list:REG_EH_REGION (const_int 0 [0])
>                 (nil))))
>     (expr_list (use (reg:SI 3 bx))
>         (expr_list:SI (use (reg:SI 3 bx))
>             (expr_list:SI (use (mem/f:SI (reg/f:SI 7 sp) [0  S4 A32]))
>                 (expr_list:SI (use (mem/f:SI (plus:SI (reg/f:SI 7 sp)
>                                 (const_int 4 [0x4])) [0  S4 A32]))
>                     (nil))))))
> (insn 1201 1200 1202 96 (set (reg:DF 321 [ D.7817 ])
>         (reg:DF 8 st)) /usr/include/bits/stdlib-float.h:28 128 {*movdf_internal}
>      (expr_list:REG_DEAD (reg:DF 8 st)
>         (nil)))
> (insn 1202 1201 1203 96 (set (reg:SF 322 [ D.7804 ])
>         (float_truncate:SF (reg:DF 321 [ D.7817 ]))) read_arch.c:700
> 157 {*truncdfsf_fast_sse}
>      (expr_list:REG_DEAD (reg:DF 321 [ D.7817 ])
>         (nil)))
> (insn 1203 1202 1204 96 (set (mem:SF (reg/f:SI 198 [ D.7812 ]) [4
> _130->frequency+0 S4 A32])
>         (reg:SF 322 [ D.7804 ])) read_arch.c:700 129 {*movsf_internal}
>      (nil))
> (insn 1204 1203 1205 96 (set (reg:SF 1209)
>         (mem/u/c:SF (plus:SI (reg:SI 1301 [528])
>                 (const:SI (unspec:SI [
>                             (symbol_ref/u:SI ("*.LC12") [flags 0x2])
>                         ] UNSPEC_GOTOFF))) [4  S4 A32]))
> read_arch.c:701 129 {*movsf_internal}
>      (expr_list:REG_EQUAL (const_double:SF 0.0 [0x0.0p+0])
>         (nil)))
> (note 1205 1204 1206 96 NOTE_INSN_DELETED)
> (note 1206 1205 1207 96 NOTE_INSN_DELETED)
> (insn 1207 1206 1208 96 (set (reg:CCFP 17 flags)
>         (compare:CCFP (reg:SF 1209)
>             (reg:SF 322 [ D.7804 ]))) read_arch.c:701 53 {*cmpisf_sse}
>      (nil))
> (jump_insn 1208 1207 3075 96 (set (pc)
>         (if_then_else (ge (reg:CCFP 17 flags)
>                 (const_int 0 [0]))
>             (label_ref:SI 3114)
>             (pc))) read_arch.c:701 606 {*jcc_1}
>      (expr_list:REG_DEAD (reg:CCFP 17 flags)
>         (int_list:REG_BR_PROB 2 (nil)))
>  -> 3114)
> (note 3075 1208 1209 97 [bb 97] NOTE_INSN_BASIC_BLOCK)
> (insn 1209 3075 1210 97 (set (reg:SF 1208)
>         (mem/u/c:SF (plus:SI (reg:SI 1301 [528])
>                 (const:SI (unspec:SI [
>                             (symbol_ref/u:SI ("*.LC11") [flags 0x2])
>                         ] UNSPEC_GOTOFF))) [4  S4 A32]))
> read_arch.c:701 129 {*movsf_internal}
>      (expr_list:REG_EQUIV (const_double:SF 1.0e+0 [0x0.8p+1])
>         (nil)))
> (note 1210 1209 1211 97 NOTE_INSN_DELETED)
> (note 1211 1210 1212 97 NOTE_INSN_DELETED)
> (insn 1212 1211 1213 97 (set (reg:CCFP 17 flags)
>         (compare:CCFP (reg:SF 322 [ D.7804 ])
>             (reg:SF 1208))) read_arch.c:701 53 {*cmpisf_sse}
>      (nil))
>
> We have PIC register r1301 (former r528) used for constant load (insn
> 1209).  This register was actually loaded to bx (insn 1199) and this
> hard reg may be used by insn 1209.  During reload we have insn 1209
> removed and a new one created instead:
>
> (insn 3864 1211 1212 104 (set (reg:SI 0 ax [1468])
>         (plus:SI (reg:SI 6 bp [528])
>             (const:SI (unspec:SI [
>                         (symbol_ref/u:SI ("*.LC11") [flags 0x2])
>                     ] UNSPEC_GOTOFF)))) read_arch.c:701 213 {*leasi}
>      (expr_list:REG_EQUAL (symbol_ref/u:SI ("*.LC11") [flags 0x2])
>         (nil)))
> (insn 1212 3864 1213 104 (set (reg:CCFP 17 flags)
>         (compare:CCFP (reg:SF 21 xmm0 [orig:322 D.7804 ] [322])
>             (mem/u/c:SF (reg:SI 0 ax [1468]) [4  S4 A32])))
> read_arch.c:701 53 {*cmpisf_sse}
>      (nil))
>
> In this new instruction bp is used which is wrong. We actually have
> required value in bx. In debugger I also checked that bp doesn't have
> required value.  I suppose I enabled flag correctly because found this
> in the log: "Spill r1301 after risky transformations".  Is it possible
> we are still not allowed to use the original PIC register (r528) and
> should use a reg copy created for particular region (in this case
> r1301)?
>
It is hard for me to say without the full patch and the test.  I can
only guess that 1301 gets a wrong class and therefore assigned to the
wrong hard ref.

Could you send me the patch and the test.  I'll look at this and inform
you what is going on.
Uros Bizjak Aug. 28, 2014, 1:01 p.m. UTC | #10
On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
> Hi,
>
> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.

+#define PIC_OFFSET_TABLE_REGNUM
         \
+  ((TARGET_64BIT && (ix86_cmodel == CM_SMALL_PIC                       \
+                     || TARGET_PECOFF))
         \
+   || !flag_pic ? INVALID_REGNUM                                       \
+   : X86_TUNE_RELAX_PIC_REG ? (pic_offset_table_rtx ? INVALID_REGNUM   \
+                              : REAL_PIC_OFFSET_TABLE_REGNUM)          \
+   : reload_completed ? REGNO (pic_offset_table_rtx)                   \
    : REAL_PIC_OFFSET_TABLE_REGNUM)

I'd like to avoid X86_TUNE_RELAX_PIC_REG and always treat EBX as an
allocatable register. This way, we can avoid all mess with implicit
xchgs in atomic_compare_and_swap<dwi>_doubleword. Also, having
allocatable EBX would allow us to introduce __builtin_cpuid builtin
and cleanup cpiud.h.
Ilya Enkovich Aug. 28, 2014, 1:13 p.m. UTC | #11
2014-08-28 17:01 GMT+04:00 Uros Bizjak <ubizjak@gmail.com>:
> On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> Hi,
>>
>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.
>
> +#define PIC_OFFSET_TABLE_REGNUM
>          \
> +  ((TARGET_64BIT && (ix86_cmodel == CM_SMALL_PIC                       \
> +                     || TARGET_PECOFF))
>          \
> +   || !flag_pic ? INVALID_REGNUM                                       \
> +   : X86_TUNE_RELAX_PIC_REG ? (pic_offset_table_rtx ? INVALID_REGNUM   \
> +                              : REAL_PIC_OFFSET_TABLE_REGNUM)          \
> +   : reload_completed ? REGNO (pic_offset_table_rtx)                   \
>     : REAL_PIC_OFFSET_TABLE_REGNUM)
>
> I'd like to avoid X86_TUNE_RELAX_PIC_REG and always treat EBX as an
> allocatable register. This way, we can avoid all mess with implicit
> xchgs in atomic_compare_and_swap<dwi>_doubleword. Also, having
> allocatable EBX would allow us to introduce __builtin_cpuid builtin
> and cleanup cpiud.h.

We should show nice performance to have this feature enabled by
default.  Currently patch causes a set of performance losses. I have a
version of this patch where EBX is relaxed by a compiler flag, not
tune flag.

Ilya
Florian Weimer Aug. 28, 2014, 6:30 p.m. UTC | #12
On 08/28/2014 03:01 PM, Uros Bizjak wrote:
> I'd like to avoid X86_TUNE_RELAX_PIC_REG and always treat EBX as an
> allocatable register. This way, we can avoid all mess with implicit
> xchgs in atomic_compare_and_swap<dwi>_doubleword. Also, having
> allocatable EBX would allow us to introduce __builtin_cpuid builtin
> and cleanup cpiud.h.

It also makes writing solid inline assembly which has to use %ebx for 
some reason much easier.  We just fixed a glibc bug related to that.
Uros Bizjak Aug. 28, 2014, 6:58 p.m. UTC | #13
On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:

> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.

>  (define_insn "*pushtf"
>    [(set (match_operand:TF 0 "push_operand" "=<,<")
> -       (match_operand:TF 1 "general_no_elim_operand" "x,*roF"))]
> +       (match_operand:TF 1 "nonimmediate_no_elim_operand" "x,*roF"))]

Can you please explain the reason for this change (and a couple of
similar changes to push patterns) ?

Uros.
Ilya Enkovich Aug. 29, 2014, 6:50 a.m. UTC | #14
2014-08-28 22:58 GMT+04:00 Uros Bizjak <ubizjak@gmail.com>:
> On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>
>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.
>
>>  (define_insn "*pushtf"
>>    [(set (match_operand:TF 0 "push_operand" "=<,<")
>> -       (match_operand:TF 1 "general_no_elim_operand" "x,*roF"))]
>> +       (match_operand:TF 1 "nonimmediate_no_elim_operand" "x,*roF"))]
>
> Can you please explain the reason for this change (and a couple of
> similar changes to push patterns) ?

This is a workaround for stability problem with reload.  Immediate
operands cause new usages of pseudo PIC register in reload which leads
to wrong registers allocation.  These changes wouldn't be required
after reload issue if resolved.

Ilya

>
> Uros.
Jeff Law Aug. 29, 2014, 6:45 p.m. UTC | #15
On 08/28/14 12:58, Uros Bizjak wrote:
> On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>
>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.
>
>>   (define_insn "*pushtf"
>>     [(set (match_operand:TF 0 "push_operand" "=<,<")
>> -       (match_operand:TF 1 "general_no_elim_operand" "x,*roF"))]
>> +       (match_operand:TF 1 "nonimmediate_no_elim_operand" "x,*roF"))]
>
> Can you please explain the reason for this change (and a couple of
> similar changes to push patterns) ?
I'd recommend dropping them from the WIP postings.

jeff
Jeff Law Aug. 29, 2014, 6:48 p.m. UTC | #16
On 08/28/14 07:01, Uros Bizjak wrote:
> On Fri, Aug 22, 2014 at 2:21 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote:
>> Hi,
>>
>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 32bit PIC mode.  It was decided that the best approach would be to not fix ebx register, use speudo register for GOT base address and let allocator do the rest.  This should be similar to how clang and icc work with GOT base address.  I've been working for some time on such patch and now want to share my results.
>
> +#define PIC_OFFSET_TABLE_REGNUM
>           \
> +  ((TARGET_64BIT && (ix86_cmodel == CM_SMALL_PIC                       \
> +                     || TARGET_PECOFF))
>           \
> +   || !flag_pic ? INVALID_REGNUM                                       \
> +   : X86_TUNE_RELAX_PIC_REG ? (pic_offset_table_rtx ? INVALID_REGNUM   \
> +                              : REAL_PIC_OFFSET_TABLE_REGNUM)          \
> +   : reload_completed ? REGNO (pic_offset_table_rtx)                   \
>      : REAL_PIC_OFFSET_TABLE_REGNUM)
>
> I'd like to avoid X86_TUNE_RELAX_PIC_REG and always treat EBX as an
> allocatable register. This way, we can avoid all mess with implicit
> xchgs in atomic_compare_and_swap<dwi>_doubleword. Also, having
> allocatable EBX would allow us to introduce __builtin_cpuid builtin
> and cleanup cpiud.h.
I think for the initial WIP patch it was fine.  However I think we all 
agree that we want EBX as an allocatable register without any special 
conditions.  So I'd recommend pulling this out of the WIP patches as well.
Jeff
diff mbox

Patch

diff --git a/gcc/calls.c b/gcc/calls.c
index 4285ec1..85dae6b 100644
--- a/gcc/calls.c
+++ b/gcc/calls.c
@@ -1122,6 +1122,14 @@  initialize_argument_information (int num_actuals ATTRIBUTE_UNUSED,
     call_expr_arg_iterator iter;
     tree arg;
 
+    if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
+      {
+	gcc_assert (pic_offset_table_rtx);
+	args[j].tree_value = make_tree (ptr_type_node,
+					pic_offset_table_rtx);
+	j--;
+      }
+
     if (struct_value_addr_value)
       {
 	args[j].tree_value = struct_value_addr_value;
@@ -2520,6 +2528,10 @@  expand_call (tree exp, rtx target, int ignore)
     /* Treat all args as named.  */
     n_named_args = num_actuals;
 
+  /* Add implicit PIC arg.  */
+  if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
+    num_actuals++;
+
   /* Make a vector to hold all the information about each arg.  */
   args = XALLOCAVEC (struct arg_data, num_actuals);
   memset (args, 0, num_actuals * sizeof (struct arg_data));
@@ -3133,6 +3145,8 @@  expand_call (tree exp, rtx target, int ignore)
 	{
 	  int arg_nr = return_flags & ERF_RETURN_ARG_MASK;
 	  arg_nr = num_actuals - arg_nr - 1;
+	  if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
+	    arg_nr--;
 	  if (arg_nr >= 0
 	      && arg_nr < num_actuals
 	      && args[arg_nr].reg
@@ -3700,8 +3714,8 @@  emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
      of the full argument passing conventions to limit complexity here since
      library functions shouldn't have many args.  */
 
-  argvec = XALLOCAVEC (struct arg, nargs + 1);
-  memset (argvec, 0, (nargs + 1) * sizeof (struct arg));
+  argvec = XALLOCAVEC (struct arg, nargs + 2);
+  memset (argvec, 0, (nargs + 2) * sizeof (struct arg));
 
 #ifdef INIT_CUMULATIVE_LIBCALL_ARGS
   INIT_CUMULATIVE_LIBCALL_ARGS (args_so_far_v, outmode, fun);
@@ -3717,6 +3731,23 @@  emit_library_call_value_1 (int retval, rtx orgfun, rtx value,
 
   push_temp_slots ();
 
+  if (targetm.calls.implicit_pic_arg (fndecl ? fndecl : fntype))
+    {
+      gcc_assert (pic_offset_table_rtx);
+
+      argvec[count].value = pic_offset_table_rtx;
+      argvec[count].mode = Pmode;
+      argvec[count].partial = 0;
+
+      argvec[count].reg = targetm.calls.function_arg (args_so_far,
+						      Pmode, NULL_TREE, true);
+
+      targetm.calls.function_arg_advance (args_so_far, Pmode, NULL_TREE, true);
+
+      count++;
+      nargs++;
+    }
+
   /* If there's a structure value address to be passed,
      either pass it in the special place, or pass it as an extra argument.  */
   if (mem_value && struct_value == 0 && ! pcc_struct_value)
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index cc4b0c7..cfafcdd 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -6133,6 +6133,21 @@  ix86_maybe_switch_abi (void)
     reinit_regs ();
 }
 
+/* Return reg in which implicit PIC base address
+   arg is passed.  */
+static rtx
+ix86_implicit_pic_arg (const_tree fntype_or_decl ATTRIBUTE_UNUSED)
+{
+  if ((TARGET_64BIT
+       && (ix86_cmodel == CM_SMALL_PIC
+	   || TARGET_PECOFF))
+      || !flag_pic
+      || !X86_TUNE_RELAX_PIC_REG)
+    return NULL_RTX;
+
+  return gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM);
+}
+
 /* Initialize a variable CUM of type CUMULATIVE_ARGS
    for a call to a function whose data type is FNTYPE.
    For a library call, FNTYPE is 0.  */
@@ -6198,6 +6213,11 @@  init_cumulative_args (CUMULATIVE_ARGS *cum,  /* Argument info to initialize */
 		      ? (!prototype_p (fntype) || stdarg_p (fntype))
 		      : !libname);
 
+  if (caller)
+    cum->implicit_pic_arg = ix86_implicit_pic_arg (fndecl ? fndecl : fntype);
+  else
+    cum->implicit_pic_arg = NULL_RTX;
+
   if (!TARGET_64BIT)
     {
       /* If there are variable arguments, then we won't pass anything
@@ -7291,7 +7311,9 @@  ix86_function_arg_advance (cumulative_args_t cum_v, enum machine_mode mode,
   if (type)
     mode = type_natural_mode (type, NULL, false);
 
-  if (TARGET_64BIT && (cum ? cum->call_abi : ix86_abi) == MS_ABI)
+  if (cum->implicit_pic_arg)
+    cum->implicit_pic_arg = NULL_RTX;
+  else if (TARGET_64BIT && (cum ? cum->call_abi : ix86_abi) == MS_ABI)
     function_arg_advance_ms_64 (cum, bytes, words);
   else if (TARGET_64BIT)
     function_arg_advance_64 (cum, mode, type, words, named);
@@ -7542,7 +7564,9 @@  ix86_function_arg (cumulative_args_t cum_v, enum machine_mode omode,
   if (type && TREE_CODE (type) == VECTOR_TYPE)
     mode = type_natural_mode (type, cum, false);
 
-  if (TARGET_64BIT && (cum ? cum->call_abi : ix86_abi) == MS_ABI)
+  if (cum->implicit_pic_arg)
+    arg = cum->implicit_pic_arg;
+  else if (TARGET_64BIT && (cum ? cum->call_abi : ix86_abi) == MS_ABI)
     arg = function_arg_ms_64 (cum, mode, omode, named, bytes);
   else if (TARGET_64BIT)
     arg = function_arg_64 (cum, mode, omode, type, named);
@@ -9373,6 +9397,9 @@  gen_pop (rtx arg)
 static unsigned int
 ix86_select_alt_pic_regnum (void)
 {
+  if (ix86_implicit_pic_arg (NULL))
+    return INVALID_REGNUM;
+
   if (crtl->is_leaf
       && !crtl->profile
       && !ix86_current_function_calls_tls_descriptor)
@@ -11236,7 +11263,8 @@  ix86_expand_prologue (void)
 	}
       else
 	{
-          insn = emit_insn (gen_set_got (pic_offset_table_rtx));
+	  rtx reg = gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM);
+          insn = emit_insn (gen_set_got (reg));
 	  RTX_FRAME_RELATED_P (insn) = 1;
 	  add_reg_note (insn, REG_CFA_FLUSH_QUEUE, NULL_RTX);
 	}
@@ -11789,7 +11817,8 @@  ix86_expand_epilogue (int style)
 static void
 ix86_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED, HOST_WIDE_INT)
 {
-  if (pic_offset_table_rtx)
+  if (pic_offset_table_rtx
+      && REGNO (pic_offset_table_rtx) < FIRST_PSEUDO_REGISTER)
     SET_REGNO (pic_offset_table_rtx, REAL_PIC_OFFSET_TABLE_REGNUM);
 #if TARGET_MACHO
   /* Mach-O doesn't support labels at the end of objects, so if
@@ -13107,6 +13136,15 @@  ix86_GOT_alias_set (void)
   return set;
 }
 
+/* Set regs_ever_live for PIC base address register
+   to true if required.  */
+static void
+set_pic_reg_ever_alive ()
+{
+  if (reload_in_progress)
+    df_set_regs_ever_live (REGNO (pic_offset_table_rtx), true);
+}
+
 /* Return a legitimate reference for ORIG (an address) using the
    register REG.  If REG is 0, a new pseudo is generated.
 
@@ -13157,8 +13195,7 @@  legitimize_pic_address (rtx orig, rtx reg)
       /* This symbol may be referenced via a displacement from the PIC
 	 base address (@GOTOFF).  */
 
-      if (reload_in_progress)
-	df_set_regs_ever_live (PIC_OFFSET_TABLE_REGNUM, true);
+      set_pic_reg_ever_alive ();
       if (GET_CODE (addr) == CONST)
 	addr = XEXP (addr, 0);
       if (GET_CODE (addr) == PLUS)
@@ -13190,8 +13227,7 @@  legitimize_pic_address (rtx orig, rtx reg)
       /* This symbol may be referenced via a displacement from the PIC
 	 base address (@GOTOFF).  */
 
-      if (reload_in_progress)
-	df_set_regs_ever_live (PIC_OFFSET_TABLE_REGNUM, true);
+      set_pic_reg_ever_alive ();
       if (GET_CODE (addr) == CONST)
 	addr = XEXP (addr, 0);
       if (GET_CODE (addr) == PLUS)
@@ -13252,8 +13288,7 @@  legitimize_pic_address (rtx orig, rtx reg)
 	  /* This symbol must be referenced via a load from the
 	     Global Offset Table (@GOT).  */
 
-	  if (reload_in_progress)
-	    df_set_regs_ever_live (PIC_OFFSET_TABLE_REGNUM, true);
+	  set_pic_reg_ever_alive ();
 	  new_rtx = gen_rtx_UNSPEC (Pmode, gen_rtvec (1, addr), UNSPEC_GOT);
 	  new_rtx = gen_rtx_CONST (Pmode, new_rtx);
 	  if (TARGET_64BIT)
@@ -13305,8 +13340,7 @@  legitimize_pic_address (rtx orig, rtx reg)
 	    {
 	      if (!TARGET_64BIT)
 		{
-		  if (reload_in_progress)
-		    df_set_regs_ever_live (PIC_OFFSET_TABLE_REGNUM, true);
+		  set_pic_reg_ever_alive ();
 		  new_rtx = gen_rtx_UNSPEC (Pmode, gen_rtvec (1, op0),
 					    UNSPEC_GOTOFF);
 		  new_rtx = gen_rtx_PLUS (Pmode, new_rtx, op1);
@@ -13601,8 +13635,7 @@  legitimize_tls_address (rtx x, enum tls_model model, bool for_mov)
 	}
       else if (flag_pic)
 	{
-	  if (reload_in_progress)
-	    df_set_regs_ever_live (PIC_OFFSET_TABLE_REGNUM, true);
+	  set_pic_reg_ever_alive ();
 	  pic = pic_offset_table_rtx;
 	  type = TARGET_ANY_GNU_TLS ? UNSPEC_GOTNTPOFF : UNSPEC_GOTTPOFF;
 	}
@@ -14233,6 +14266,8 @@  ix86_pic_register_p (rtx x)
   if (GET_CODE (x) == VALUE && CSELIB_VAL_PTR (x))
     return (pic_offset_table_rtx
 	    && rtx_equal_for_cselib_p (x, pic_offset_table_rtx));
+  else if (pic_offset_table_rtx)
+    return REG_P (x) && REGNO (x) == REGNO (pic_offset_table_rtx);
   else
     return REG_P (x) && REGNO (x) == PIC_OFFSET_TABLE_REGNUM;
 }
@@ -14408,7 +14443,9 @@  ix86_delegitimize_address (rtx x)
 	 ...
 	 movl foo@GOTOFF(%ecx), %edx
 	 in which case we return (%ecx - %ebx) + foo.  */
-      if (pic_offset_table_rtx)
+      if (pic_offset_table_rtx
+	  && (!reload_completed
+	      || REGNO (pic_offset_table_rtx) < FIRST_PSEUDO_REGISTER))
         result = gen_rtx_PLUS (Pmode, gen_rtx_MINUS (Pmode, copy_rtx (addend),
 						     pic_offset_table_rtx),
 			       result);
@@ -24915,7 +24952,7 @@  ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
 		  && DEFAULT_ABI != MS_ABI))
 	  && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF
 	  && ! SYMBOL_REF_LOCAL_P (XEXP (fnaddr, 0)))
-	use_reg (&use, pic_offset_table_rtx);
+	use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
     }
 
   if (TARGET_64BIT && INTVAL (callarg2) >= 0)
@@ -47228,6 +47265,8 @@  ix86_atomic_assign_expand_fenv (tree *hold, tree *clear, tree *update)
 #define TARGET_FUNCTION_ARG_ADVANCE ix86_function_arg_advance
 #undef TARGET_FUNCTION_ARG
 #define TARGET_FUNCTION_ARG ix86_function_arg
+#undef TARGET_IMPLICIT_PIC_ARG
+#define TARGET_IMPLICIT_PIC_ARG ix86_implicit_pic_arg
 #undef TARGET_FUNCTION_ARG_BOUNDARY
 #define TARGET_FUNCTION_ARG_BOUNDARY ix86_function_arg_boundary
 #undef TARGET_PASS_BY_REFERENCE
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 2c64162..d5fa250 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1243,11 +1243,13 @@  extern const char *host_detect_local_cpu (int argc, const char **argv);
 
 #define REAL_PIC_OFFSET_TABLE_REGNUM  BX_REG
 
-#define PIC_OFFSET_TABLE_REGNUM				\
-  ((TARGET_64BIT && (ix86_cmodel == CM_SMALL_PIC	\
-                     || TARGET_PECOFF))		\
-   || !flag_pic ? INVALID_REGNUM			\
-   : reload_completed ? REGNO (pic_offset_table_rtx)	\
+#define PIC_OFFSET_TABLE_REGNUM						\
+  ((TARGET_64BIT && (ix86_cmodel == CM_SMALL_PIC			\
+                     || TARGET_PECOFF))					\
+   || !flag_pic ? INVALID_REGNUM					\
+   : X86_TUNE_RELAX_PIC_REG ? (pic_offset_table_rtx ? INVALID_REGNUM	\
+			       : REAL_PIC_OFFSET_TABLE_REGNUM)		\
+   : reload_completed ? REGNO (pic_offset_table_rtx)			\
    : REAL_PIC_OFFSET_TABLE_REGNUM)
 
 #define GOT_SYMBOL_NAME "_GLOBAL_OFFSET_TABLE_"
@@ -1652,6 +1654,7 @@  typedef struct ix86_args {
   int float_in_sse;		/* Set to 1 or 2 for 32bit targets if
 				   SFmode/DFmode arguments should be passed
 				   in SSE registers.  Otherwise 0.  */
+  rtx implicit_pic_arg;         /* Implicit PIC base address arg if passed.  */
   enum calling_abi call_abi;	/* Set to SYSV_ABI for sysv abi. Otherwise
  				   MS_ABI for ms abi.  */
 } CUMULATIVE_ARGS;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 8e74eab..27028ba 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -2725,7 +2725,7 @@ 
 
 (define_insn "*pushtf"
   [(set (match_operand:TF 0 "push_operand" "=<,<")
-	(match_operand:TF 1 "general_no_elim_operand" "x,*roF"))]
+	(match_operand:TF 1 "nonimmediate_no_elim_operand" "x,*roF"))]
   "TARGET_64BIT || TARGET_SSE"
 {
   /* This insn should be already split before reg-stack.  */
@@ -2750,7 +2750,7 @@ 
 
 (define_insn "*pushxf"
   [(set (match_operand:XF 0 "push_operand" "=<,<")
-	(match_operand:XF 1 "general_no_elim_operand" "f,Yx*roF"))]
+	(match_operand:XF 1 "nonimmediate_no_elim_operand" "f,Yx*roF"))]
   ""
 {
   /* This insn should be already split before reg-stack.  */
@@ -2781,7 +2781,7 @@ 
 
 (define_insn "*pushdf"
   [(set (match_operand:DF 0 "push_operand" "=<,<,<,<")
-	(match_operand:DF 1 "general_no_elim_operand" "f,Yd*roF,rmF,x"))]
+	(match_operand:DF 1 "nonimmediate_no_elim_operand" "f,Yd*roF,rmF,x"))]
   ""
 {
   /* This insn should be already split before reg-stack.  */
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 62970be..56eca24 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -580,6 +580,12 @@ 
     (match_operand 0 "register_no_elim_operand")
     (match_operand 0 "general_operand")))
 
+;; Return false if this is any eliminable register.  Otherwise nonimmediate_operand.
+(define_predicate "nonimmediate_no_elim_operand"
+  (if_then_else (match_code "reg,subreg")
+    (match_operand 0 "register_no_elim_operand")
+    (match_operand 0 "nonimmediate_operand")))
+
 ;; Return false if this is any eliminable register.  Otherwise
 ;; register_operand or a constant.
 (define_predicate "nonmemory_no_elim_operand"
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 215c63c..ffb7a2d 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -537,3 +537,6 @@  DEF_TUNE (X86_TUNE_PROMOTE_QI_REGS, "promote_qi_regs", 0)
    unrolling small loop less important. For, such architectures we adjust
    the unroll factor so that the unrolled loop fits the loop buffer.  */
 DEF_TUNE (X86_TUNE_ADJUST_UNROLL, "adjust_unroll_factor", m_BDVER3 | m_BDVER4)
+
+/* X86_TUNE_RELAX_PIC_REG: Do not fix hard register for GOT base usage.  */
+DEF_TUNE (X86_TUNE_RELAX_PIC_REG, "relax_pic_reg", ~0)
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 9dd8d68..33b36be 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -3967,6 +3967,12 @@  If @code{TARGET_FUNCTION_INCOMING_ARG} is not defined,
 @code{TARGET_FUNCTION_ARG} serves both purposes.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_IMPLICIT_PIC_ARG (const_tree @var{fntype_or_decl})
+This hook returns register holding PIC base address for functions
+which do not fix hard register but handle it similar to function arg
+assigning a virtual reg for it.
+@end deftypefn
+
 @deftypefn {Target Hook} int TARGET_ARG_PARTIAL_BYTES (cumulative_args_t @var{cum}, enum machine_mode @var{mode}, tree @var{type}, bool @var{named})
 This target hook returns the number of bytes at the beginning of an
 argument that must be put in registers.  The value must be zero for
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index dd72b98..3e6da2f 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -3413,6 +3413,8 @@  the stack.
 
 @hook TARGET_FUNCTION_INCOMING_ARG
 
+@hook TARGET_IMPLICIT_PIC_ARG
+
 @hook TARGET_ARG_PARTIAL_BYTES
 
 @hook TARGET_PASS_BY_REFERENCE
diff --git a/gcc/function.c b/gcc/function.c
index 8156766..3a85c16 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -3456,6 +3456,15 @@  assign_parms (tree fndecl)
 
   fnargs.release ();
 
+  /* Handle implicit PIC arg if any.  */
+  if (targetm.calls.implicit_pic_arg (fndecl))
+    {
+      rtx old_reg = targetm.calls.implicit_pic_arg (fndecl);
+      rtx new_reg = gen_reg_rtx (GET_MODE (old_reg));
+      emit_move_insn (new_reg, old_reg);
+      pic_offset_table_rtx = new_reg;
+    }
+
   /* Output all parameter conversion instructions (possibly including calls)
      now that all parameters have been copied out of hard registers.  */
   emit_insn (all.first_conversion_insn);
diff --git a/gcc/hooks.c b/gcc/hooks.c
index 5c06562..47784e2 100644
--- a/gcc/hooks.c
+++ b/gcc/hooks.c
@@ -352,6 +352,13 @@  hook_rtx_rtx_null (rtx x ATTRIBUTE_UNUSED)
   return NULL;
 }
 
+/* Generic hook that takes a const_tree arg and returns NULL_RTX.  */
+rtx
+hook_rtx_const_tree_null (const_tree a ATTRIBUTE_UNUSED)
+{
+  return NULL;
+}
+
 /* Generic hook that takes a tree and an int and returns NULL_RTX.  */
 rtx
 hook_rtx_tree_int_null (tree a ATTRIBUTE_UNUSED, int b ATTRIBUTE_UNUSED)
diff --git a/gcc/hooks.h b/gcc/hooks.h
index ba42b6c..cf830ef 100644
--- a/gcc/hooks.h
+++ b/gcc/hooks.h
@@ -100,6 +100,7 @@  extern bool default_can_output_mi_thunk_no_vcall (const_tree, HOST_WIDE_INT,
 
 extern rtx hook_rtx_rtx_identity (rtx);
 extern rtx hook_rtx_rtx_null (rtx);
+extern rtx hook_rtx_const_tree_null (const_tree);
 extern rtx hook_rtx_tree_int_null (tree, int);
 
 extern const char *hook_constcharptr_void_null (void);
diff --git a/gcc/ira.c b/gcc/ira.c
index 3f41061..dc2eaed 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -3467,6 +3467,11 @@  update_equiv_regs (void)
 	  if (note && GET_CODE (XEXP (note, 0)) == EXPR_LIST)
 	    note = NULL_RTX;
 
+	  if (pic_offset_table_rtx
+	      && REGNO (pic_offset_table_rtx) >= FIRST_PSEUDO_REGISTER
+	      && contains_symbol_ref (insn))
+	    note = NULL_RTX;
+
 	  if (DF_REG_DEF_COUNT (regno) != 1
 	      && (! note
 		  || rtx_varies_p (XEXP (note, 0), 0)
@@ -3512,6 +3517,10 @@  update_equiv_regs (void)
 	      && MEM_P (SET_SRC (set))
 	      && validate_equiv_mem (insn, dest, SET_SRC (set)))
 	    note = set_unique_reg_note (insn, REG_EQUIV, copy_rtx (SET_SRC (set)));
+	  if (pic_offset_table_rtx
+	      && REGNO (pic_offset_table_rtx) >= FIRST_PSEUDO_REGISTER
+	      && contains_symbol_ref (insn))
+	    note = NULL_RTX;
 
 	  if (note)
 	    {
@@ -3886,11 +3895,19 @@  setup_reg_equiv (void)
 		      /* This is PLUS of frame pointer and a constant,
 			 or fp, or argp.  */
 		      ira_reg_equiv[i].invariant = x;
-		    else if (targetm.legitimate_constant_p (mode, x))
+		    else if (targetm.legitimate_constant_p (mode, x)
+			     && (!pic_offset_table_rtx
+				 || REGNO (pic_offset_table_rtx) < FIRST_PSEUDO_REGISTER
+				 || (GET_CODE (x) != CONST_DOUBLE
+				     && GET_CODE (x) != CONST_VECTOR)))
 		      ira_reg_equiv[i].constant = x;
 		    else
 		      {
 			ira_reg_equiv[i].memory = force_const_mem (mode, x);
+			if (pic_offset_table_rtx
+			    && REGNO (pic_offset_table_rtx) >= FIRST_PSEUDO_REGISTER
+			    && contains_symbol_ref (ira_reg_equiv[i].memory))
+			  ira_reg_equiv[i].memory = NULL_RTX;
 			if (ira_reg_equiv[i].memory == NULL_RTX)
 			  {
 			    ira_reg_equiv[i].defined_p = false;
diff --git a/gcc/rtl.h b/gcc/rtl.h
index b6a21b6..02fcf96 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -2610,6 +2610,7 @@  extern int rtx_referenced_p (rtx, rtx);
 extern bool tablejump_p (const_rtx, rtx *, rtx_jump_table_data **);
 extern int computed_jump_p (const_rtx);
 extern bool tls_referenced_p (rtx);
+extern bool contains_symbol_ref (rtx);
 
 typedef int (*rtx_function) (rtx *, void *);
 extern int for_each_rtx (rtx *, rtx_function, void *);
diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index bc16437..21f2872 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -110,7 +110,8 @@  rtx_unstable_p (const_rtx x)
       /* ??? When call-clobbered, the value is stable modulo the restore
 	 that must happen after a call.  This currently screws up local-alloc
 	 into believing that the restore is not needed.  */
-      if (!PIC_OFFSET_TABLE_REG_CALL_CLOBBERED && x == pic_offset_table_rtx)
+      if (!PIC_OFFSET_TABLE_REG_CALL_CLOBBERED && x == pic_offset_table_rtx
+	  && REGNO (pic_offset_table_rtx) < FIRST_PSEUDO_REGISTER)
 	return 0;
       return 1;
 
@@ -185,7 +186,9 @@  rtx_varies_p (const_rtx x, bool for_alias)
 	     that must happen after a call.  This currently screws up
 	     local-alloc into believing that the restore is not needed, so we
 	     must return 0 only if we are called from alias analysis.  */
-	  && (!PIC_OFFSET_TABLE_REG_CALL_CLOBBERED || for_alias))
+	  && ((!PIC_OFFSET_TABLE_REG_CALL_CLOBBERED
+	       && REGNO (pic_offset_table_rtx) < FIRST_PSEUDO_REGISTER)
+	      || for_alias))
 	return 0;
       return 1;
 
@@ -5978,6 +5981,42 @@  get_index_code (const struct address_info *info)
   return SCRATCH;
 }
 
+/* Return true if RTL X contains a SYMBOL_REF.  */
+
+bool
+contains_symbol_ref (rtx x)
+{
+  const char *fmt;
+  RTX_CODE code;
+  int i;
+
+  if (!x)
+    return false;
+
+  code = GET_CODE (x);
+  if (code == SYMBOL_REF)
+    return true;
+
+  fmt = GET_RTX_FORMAT (code);
+  for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
+    {
+      if (fmt[i] == 'e')
+	{
+	  if (contains_symbol_ref (XEXP (x, i)))
+	    return true;
+	}
+      else if (fmt[i] == 'E')
+	{
+	  int j;
+	  for (j = 0; j < XVECLEN (x, i); j++)
+	    if (contains_symbol_ref (XVECEXP (x, i, j)))
+	      return true;
+	}
+    }
+
+  return false;
+}
+
 /* Return 1 if *X is a thread-local symbol.  */
 
 static int
diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index 5c34fee..50de8d5 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -448,7 +448,7 @@  try_shrink_wrapping (edge *entry_edge, edge orig_entry_edge,
     {
       HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;
       struct hard_reg_set_container set_up_by_prologue;
-      rtx p_insn;
+      rtx p_insn, reg;
       vec<basic_block> vec;
       basic_block bb;
       bitmap_head bb_antic_flags;
@@ -494,9 +494,13 @@  try_shrink_wrapping (edge *entry_edge, edge orig_entry_edge,
       if (frame_pointer_needed)
 	add_to_hard_reg_set (&set_up_by_prologue.set, Pmode,
 			     HARD_FRAME_POINTER_REGNUM);
-      if (pic_offset_table_rtx)
+      if (pic_offset_table_rtx
+	  && PIC_OFFSET_TABLE_REGNUM != INVALID_REGNUM)
 	add_to_hard_reg_set (&set_up_by_prologue.set, Pmode,
 			     PIC_OFFSET_TABLE_REGNUM);
+      if ((reg = targetm.calls.implicit_pic_arg (current_function_decl)))
+	add_to_hard_reg_set (&set_up_by_prologue.set,
+			     Pmode, REGNO (reg));
       if (crtl->drap_reg)
 	add_to_hard_reg_set (&set_up_by_prologue.set,
 			     GET_MODE (crtl->drap_reg),
diff --git a/gcc/target.def b/gcc/target.def
index 3a41db1..5c221b6 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -3976,6 +3976,14 @@  If @code{TARGET_FUNCTION_INCOMING_ARG} is not defined,\n\
  default_function_incoming_arg)
 
 DEFHOOK
+(implicit_pic_arg,
+ "This hook returns register holding PIC base address for functions\n\
+which do not fix hard register but handle it similar to function arg\n\
+assigning a virtual reg for it.",
+ rtx, (const_tree fntype_or_decl),
+ hook_rtx_const_tree_null)
+
+DEFHOOK
 (function_arg_boundary,
  "This hook returns the alignment boundary, in bits, of an argument\n\
 with the specified mode and type.  The default hook returns\n\
diff --git a/gcc/var-tracking.c b/gcc/var-tracking.c
index a458380..63d2be5 100644
--- a/gcc/var-tracking.c
+++ b/gcc/var-tracking.c
@@ -661,7 +661,6 @@  static bool variable_different_p (variable, variable);
 static bool dataflow_set_different (dataflow_set *, dataflow_set *);
 static void dataflow_set_destroy (dataflow_set *);
 
-static bool contains_symbol_ref (rtx);
 static bool track_expr_p (tree, bool);
 static bool same_variable_part_p (rtx, tree, HOST_WIDE_INT);
 static int add_uses (rtx *, void *);
@@ -5032,42 +5031,6 @@  dataflow_set_destroy (dataflow_set *set)
   set->vars = NULL;
 }
 
-/* Return true if RTL X contains a SYMBOL_REF.  */
-
-static bool
-contains_symbol_ref (rtx x)
-{
-  const char *fmt;
-  RTX_CODE code;
-  int i;
-
-  if (!x)
-    return false;
-
-  code = GET_CODE (x);
-  if (code == SYMBOL_REF)
-    return true;
-
-  fmt = GET_RTX_FORMAT (code);
-  for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
-    {
-      if (fmt[i] == 'e')
-	{
-	  if (contains_symbol_ref (XEXP (x, i)))
-	    return true;
-	}
-      else if (fmt[i] == 'E')
-	{
-	  int j;
-	  for (j = 0; j < XVECLEN (x, i); j++)
-	    if (contains_symbol_ref (XVECEXP (x, i, j)))
-	      return true;
-	}
-    }
-
-  return false;
-}
-
 /* Shall EXPR be tracked?  */
 
 static bool