Patchwork Enhance reload_cse_move2add

login
register
mail settings
Submitter Jie Zhang
Date June 30, 2010, 7:45 a.m.
Message ID <4C2AF62F.4030700@codesourcery.com>
Download mbox | patch
Permalink /patch/57381/
State New
Headers show

Comments

Jie Zhang - June 30, 2010, 7:45 a.m.
Currently reload_cse_move2add can transform

    (set (REGX) (CONST_INT A))
    ...
    (set (REGX) (CONST_INT B))

to

    (set (REGX) (CONST_INT A))
    ...
    (set (REGX) (plus (REGX) (CONST_INT B-A)))

This patch enhances it to be able to transform

    (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
    ...
    (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))

to

    (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
    ...
    (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))


Benchmarking using EEMBC on ARM Cortex-A8 shows performance improvement 
on one test:

idctrn01: 6%

No meaningful changes on other tests.

Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% 
regression on CINT2000 and 0.1% improvement on CFP2000.

Bootstrapped and regression tested on x86_64.

Is it OK?

Regards,
--
Jie Zhang
CodeSourcery
Jeff Law - June 30, 2010, 4:47 p.m.
On 06/30/10 01:45, Jie Zhang wrote:
> Currently reload_cse_move2add can transform
>
>    (set (REGX) (CONST_INT A))
>    ...
>    (set (REGX) (CONST_INT B))
>
> to
>
>    (set (REGX) (CONST_INT A))
>    ...
>    (set (REGX) (plus (REGX) (CONST_INT B-A)))
>
> This patch enhances it to be able to transform
>
>    (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>    ...
>    (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>
> to
>
>    (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>    ...
>    (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>
>
> Benchmarking using EEMBC on ARM Cortex-A8 shows performance 
> improvement on one test:
>
> idctrn01: 6%
Was this a size or runtime performance improvement?

>
> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% 
> regression on CINT2000 and 0.1% improvement on CFP2000.
>
> Bootstrapped and regression tested on x86_64.
Any thoughts on why spec2k showed a regression and was it a size or 
runtime regression?

I'm generally OK with the patch, but want to understand what's happening 
benchmark-wise before final approval.

jeff
Jie Zhang - June 30, 2010, 5:13 p.m.
On 07/01/2010 12:47 AM, Jeff Law wrote:
> On 06/30/10 01:45, Jie Zhang wrote:
>> Currently reload_cse_move2add can transform
>>
>> (set (REGX) (CONST_INT A))
>> ...
>> (set (REGX) (CONST_INT B))
>>
>> to
>>
>> (set (REGX) (CONST_INT A))
>> ...
>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>
>> This patch enhances it to be able to transform
>>
>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>> ...
>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>
>> to
>>
>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>> ...
>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>
>>
>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>> improvement on one test:
>>
>> idctrn01: 6%
> Was this a size or runtime performance improvement?
>
This is a runtime performance improvement.

>>
>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>
>> Bootstrapped and regression tested on x86_64.
> Any thoughts on why spec2k showed a regression and was it a size or
> runtime regression?
>
I'm not sure what caused the regressions. I'm redoing the benchmarking. 
This time I do it without X and shut down running servers as much as I 
can. Hope this can remove measuring error.

> I'm generally OK with the patch, but want to understand what's happening
> benchmark-wise before final approval.
>
OK.


Thanks,
Jie
Jeff Law - July 1, 2010, 6:10 p.m.
On 06/30/10 11:13, Jie Zhang wrote:
> On 07/01/2010 12:47 AM, Jeff Law wrote:
>> On 06/30/10 01:45, Jie Zhang wrote:
>>> Currently reload_cse_move2add can transform
>>>
>>> (set (REGX) (CONST_INT A))
>>> ...
>>> (set (REGX) (CONST_INT B))
>>>
>>> to
>>>
>>> (set (REGX) (CONST_INT A))
>>> ...
>>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>>
>>> This patch enhances it to be able to transform
>>>
>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>> ...
>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>>
>>> to
>>>
>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>> ...
>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>>
>>>
>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>>> improvement on one test:
>>>
>>> idctrn01: 6%
>> Was this a size or runtime performance improvement?
>>
> This is a runtime performance improvement.
That's quite a surprise.  Just for giggles, does x86 show any change on 
idctrn01?  I realize it's an eembc benchmark, but if it's that sensitive 
to this optimization, we ought to see some change in behaviour for x86 
as well.


>
>>>
>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>>
>>> Bootstrapped and regression tested on x86_64.
>> Any thoughts on why spec2k showed a regression and was it a size or
>> runtime regression?
>>
> I'm not sure what caused the regressions. I'm redoing the 
> benchmarking. This time I do it without X and shut down running 
> servers as much as I can. Hope this can remove measuring error.
Strongly advised (shut down X and as many services as possible, turn off 
speed scaling in the processor, etc).

If you can get size #s, that would be interesting too -- I'd expect this 
to be a small size improvement independent of the processor 
architecture.  Runtime performance I don't have a good feel for -- I can 
easily envision cases where it's going to be better and others where 
it's going to be worse.

Jeff
Jie Zhang - July 1, 2010, 6:26 p.m.
On 07/02/2010 02:10 AM, Jeff Law wrote:
> On 06/30/10 11:13, Jie Zhang wrote:
>> On 07/01/2010 12:47 AM, Jeff Law wrote:
>>> On 06/30/10 01:45, Jie Zhang wrote:
>>>> Currently reload_cse_move2add can transform
>>>>
>>>> (set (REGX) (CONST_INT A))
>>>> ...
>>>> (set (REGX) (CONST_INT B))
>>>>
>>>> to
>>>>
>>>> (set (REGX) (CONST_INT A))
>>>> ...
>>>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>>>
>>>> This patch enhances it to be able to transform
>>>>
>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>> ...
>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>>>
>>>> to
>>>>
>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>> ...
>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>>>
>>>>
>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>>>> improvement on one test:
>>>>
>>>> idctrn01: 6%
>>> Was this a size or runtime performance improvement?
>>>
>> This is a runtime performance improvement.
> That's quite a surprise. Just for giggles, does x86 show any change on
> idctrn01? I realize it's an eembc benchmark, but if it's that sensitive
> to this optimization, we ought to see some change in behaviour for x86
> as well.
>
I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for 
benchmarking on x86 here. I need to ask if it's possible to setup EEMBC 
for x86 in our environment.

>
>>
>>>>
>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>>>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>>>
>>>> Bootstrapped and regression tested on x86_64.
>>> Any thoughts on why spec2k showed a regression and was it a size or
>>> runtime regression?
>>>
>> I'm not sure what caused the regressions. I'm redoing the
>> benchmarking. This time I do it without X and shut down running
>> servers as much as I can. Hope this can remove measuring error.
> Strongly advised (shut down X and as many services as possible, turn off
> speed scaling in the processor, etc).
>
Yes, I did all of these this time.

> If you can get size #s, that would be interesting too -- I'd expect this
> to be a small size improvement independent of the processor
> architecture. Runtime performance I don't have a good feel for -- I can
> easily envision cases where it's going to be better and others where
> it's going to be worse.
>
I use 5 iterations and test -O2 and -O3 this time. It have been running 
for 24 hours. It might need one or two more hours to complete. So I have 
to report the results tomorrow.


Thanks,
Jie Zhang - July 2, 2010, 10:58 a.m.
On 07/02/2010 02:26 AM, Jie Zhang wrote:
> On 07/02/2010 02:10 AM, Jeff Law wrote:
>> On 06/30/10 11:13, Jie Zhang wrote:
>>> On 07/01/2010 12:47 AM, Jeff Law wrote:
>>>> On 06/30/10 01:45, Jie Zhang wrote:
>>>>> Currently reload_cse_move2add can transform
>>>>>
>>>>> (set (REGX) (CONST_INT A))
>>>>> ...
>>>>> (set (REGX) (CONST_INT B))
>>>>>
>>>>> to
>>>>>
>>>>> (set (REGX) (CONST_INT A))
>>>>> ...
>>>>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>>>>
>>>>> This patch enhances it to be able to transform
>>>>>
>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>>> ...
>>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>>>>
>>>>> to
>>>>>
>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>>> ...
>>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>>>>
>>>>>
>>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>>>>> improvement on one test:
>>>>>
>>>>> idctrn01: 6%
>>>> Was this a size or runtime performance improvement?
>>>>
>>> This is a runtime performance improvement.
>> That's quite a surprise. Just for giggles, does x86 show any change on
>> idctrn01? I realize it's an eembc benchmark, but if it's that sensitive
>> to this optimization, we ought to see some change in behaviour for x86
>> as well.
>>
> I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for
> benchmarking on x86 here. I need to ask if it's possible to setup EEMBC
> for x86 in our environment.
>
It's not easy for me to run EEMBC on x86. We have no dedicated x86 
hardware for running EEMBC.

>>
>>>
>>>>>
>>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>>>>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>>>>
>>>>> Bootstrapped and regression tested on x86_64.
>>>> Any thoughts on why spec2k showed a regression and was it a size or
>>>> runtime regression?
>>>>
>>> I'm not sure what caused the regressions. I'm redoing the
>>> benchmarking. This time I do it without X and shut down running
>>> servers as much as I can. Hope this can remove measuring error.
>> Strongly advised (shut down X and as many services as possible, turn off
>> speed scaling in the processor, etc).
>>
> Yes, I did all of these this time.
>
>> If you can get size #s, that would be interesting too -- I'd expect this
>> to be a small size improvement independent of the processor
>> architecture. Runtime performance I don't have a good feel for -- I can
>> easily envision cases where it's going to be better and others where
>> it's going to be worse.
>>
> I use 5 iterations and test -O2 and -O3 this time. It have been running
> for 24 hours. It might need one or two more hours to complete. So I have
> to report the results tomorrow.
>
Here are the new results. 'Before' is before applying my patch, 'After' 
is after applying my patch.

First the results for -O3.

---------------------------------------------------
CINT2000  -O3         Before      After      Change
---------------------------------------------------
164.gzip                1016       1015      -0.10%
175.vpr                  813        810      -0.37%
176.gcc                 1048       1048       0.00%
181.mcf                  625        624      -0.16%
186.crafty              1705       1706       0.06%
197.parser               769        773       0.52%
252.eon                 1857       1857       0.00%
253.perlbmk             1264       1267       0.24%
254.gap                 1097       1093      -0.36%
255.vortex              1513       1510      -0.20%
256.bzip2                980        980       0.00%
300.twolf                808        807      -0.12%
---------------------------------------------------
SPECint_base2000        1067       1067       0.00%
---------------------------------------------------

Notes on the above result:

* No significant changes.


---------------------------------------------------
CFP2000   -O3         Before      After      Change
---------------------------------------------------
168.wupwise             1187       1186      -0.08%
171.swim                1351       1364       0.96%
172.mgrid                881        882       0.11%
173.applu               1072       1069      -0.28%
177.mesa                1292       1290      -0.15%
178.galgel              1576       1583       0.44%
179.art                 1013        998      -1.48%
183.equake              1442       1428      -0.97%
187.facerec             1077       1078       0.09%
188.ammp                 873        876       0.34%
189.lucas               1399       1417       1.29%
191.fma3d               1078       1074      -0.37%
200.sixtrack             527        527       0.00%
301.apsi                 963        970       0.73%
---------------------------------------------------
SPECfp_base2000         1088       1088       0.00%
---------------------------------------------------

Notes on the above results:

* 171: The 5 results for "Before" are 1351, 1298, 1325, 1357, 1360.
                 while for "After" are 1364, 1368, 1357, 1358, 1375.

   Seems a real improvement.

* 179: The 5 results for "Before" are 1013, 1023, 1004, 1012, 1016.
                 while for "After" are  993,  998,  998, 1009, 1005.

   Seems a real regression.

* 183: The 5 results for "Before" are 1442, 1442, 1442, 1443, 1441.
                 while for "After" are 1427, 1431, 1427, 1433, 1428.

   Seems a real regression.

* 189: The 5 results for "Before" are 1396, 1399, 1401, 1394, 1399.
                 while for "After" are 1418, 1416, 1417, 1401, 1418.

   Seems a real improvement.



Then the results for -O2:

---------------------------------------------------
CINT2000   -O2       Before      After       Change
---------------------------------------------------
164.gzip               1011       1012        0.10%
175.vpr                 791        810        2.40%
176.gcc                1033       1031       -0.19%
181.mcf                 624        624        0.00%
186.crafty             1699       1698       -0.06%
197.parser              739        736       -0.41%
252.eon                1880       1880        0.00%
253.perlbmk            1341       1342        0.07%
254.gap                1089       1094        0.46%
255.vortex             1449       1448       -0.07%
256.bzip2               970        971        0.10%
300.twolf               805        801       -0.50%
---------------------------------------------------
SPECint_base2000      1060       1062         0.19%
---------------------------------------------------

Notes on the above results:

* 175: The 5 results for "Before" are 789, 791, 789, 793, 801.
                 while for "After" are 810, 811, 811, 808, 808.

   Seems a real improvement.


---------------------------------------------------
CFP2000    -O2       Before      After       Change
---------------------------------------------------
168.wupwise            1260       1232       -2.22%
171.swim               1382       1373       -0.65%
172.mgrid               905        900       -0.55%
173.applu              1049       1051        0.19%
177.mesa               1301       1300       -0.08%
178.galgel             1573       1565       -0.51%
179.art                1003       1002       -0.10%
183.equake             1352       1368        1.18%
187.facerec            1057       1062        0.47%
188.ammp                886        885       -0.11%
189.lucas              1406       1380       -1.85%
191.fma3d              1046       1048        0.19%
200.sixtrack            526        526        0.00%
301.apsi                965        955       -1.04%
---------------------------------------------------
SPECfp_base2000        1087       1083       -0.37%
---------------------------------------------------

Notes on the above results:

* 168: The 5 results for "Before" are 1261, 1231, 1261, 1253, 1260.
                 while for "After" are 1257, 1255, 1229, 1232, 1232.

   Hard to say if it's a real regression, at least not a regression as 
large as -2.22%.

* 183: The 5 results for "Before" are 1352, 1353, 1352, 1353, 1352.
                 while for "After" are 1368, 1368, 1368, 1367, 1367.

   Seems a real improvement.

* 189: The 5 results for "Before" are 1403, 1406, 1424, 1402, 1425.
                 while for "After" are 1380, 1399, 1380, 1380, 1382.

   Seems a real regression.

* 301: The 5 results for "Before" are 963, 967, 965, 964, 966.
                 while for "After" are 955, 954, 955, 954, 955.

   Seems a real regression.


Summary
=======

I list all tests with ABS(change) >= 1% below:

CINT2000         Change(-O3) Change(-O2)
----------------------------------------
175.vpr               -0.37%       2.40%
----------------------------------------
SPECint_base2000       0.00%       0.19%


CFP2000          Change(-O3) Change(-O2)
----------------------------------------
168.wupwise           -0.08%      -2.22%
171.swim               0.96%      -0.65%
179.art               -1.48%      -0.10%
183.equake            -0.97%       1.18%
189.lucas              1.29%      -1.85%
301.apsi               0.73%      -1.04%
----------------------------------------
SPECfp_base2000        0.00%      -0.37%

Some tests have regressions while some other tests have improvements. 
Some tests have regressions at one optimization level but have no minor 
change or even improvements at the other optimization level.

With these test results, is it safe to commit my patch?


Regards,
Paolo Bonzini - July 2, 2010, 11:59 a.m.
On 07/02/2010 12:58 PM, Jie Zhang wrote:
> 168.wupwise            1260       1232       -2.22%
> 189.lucas              1406       1380       -1.85%

Lucas is insanely dependent on register allocation and scheduling.  With 
-fweb and -frename-registers both disabled at -O2, I suspect the reason 
of the regression is that your patch increases register lifetimes and 
makes scheduling a bit harder.  I don't think this should be a reason to 
hold up your patch though, it would be more profitable to add some kind 
of rematerialization to the scheduler.

Paolo
Jeff Law - July 7, 2010, 4:54 p.m.
On 07/02/10 05:59, Paolo Bonzini wrote:
> On 07/02/2010 12:58 PM, Jie Zhang wrote:
>> 168.wupwise            1260       1232       -2.22%
>> 189.lucas              1406       1380       -1.85%
>   I suspect the reason of the regression is that your patch increases 
> register lifetimes and makes scheduling a bit harder.
Which is precisely what I was trying to figure out.  Note the patch 
changes the post-reload code, so while lifetimes are
increased, it really doesn't affect register allocation.  So any 
regressions ought to be scheduling or other micro-architecture issues 
and I was having a hard time believing that we're going to see a 
measurable regression, particularly on any modern x86 implementation.



> I don't think this should be a reason to hold up your patch though, it 
> would be more profitable to add some kind of rematerialization to the 
> scheduler.
One way to do this would be to attach the equivalent (original) form and 
somehow note to the scheduler that the equivalent form can be directly 
substituted for the current form.  When the scheduler has a bubble in 
the pipeline it could search QUEUE for an insn with an equivalent form, 
if found, it would verify dependencies and possibly issue the insn with 
the equivalent (original) form.

There's other instances where such capabilities might be useful, but I'd 
consider them outside the scope of this patch.

I'm still reviewing the data -- a pair of 2% regressions is pretty 
significant.



> Paolo
Jeff Law - July 7, 2010, 5:02 p.m.
> On 07/02/2010 02:26 AM, Jie Zhang wrote:
>>
>> I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for
>> benchmarking on x86 here. I need to ask if it's possible to setup EEMBC
>> for x86 in our environment.
>>
> It's not easy for me to run EEMBC on x86. We have no dedicated x86 
> hardware for running EEMBC.
Even if you just run it a handful of times on whatever your development 
workstation happens to be and it's only one test that's of particular 
interest (idctrn01).  Basically this one test is a clear outlier on ARM, 
so I'd like to see if that behaviour shows up on x86 as well.

>>> If you can get size #s, that would be interesting too -- I'd expect 
>>> this
>>> to be a small size improvement independent of the processor
>>> architecture. Runtime performance I don't have a good feel for -- I can
>>> easily envision cases where it's going to be better and others where
>>> it's going to be worse.
>>>
>> I use 5 iterations and test -O2 and -O3 this time. It have been running
>> for 24 hours. It might need one or two more hours to complete. So I have
>> to report the results tomorrow.
>>
>
> Here are the new results. 'Before' is before applying my patch, 
> 'After' is after applying my patch.
Thanks.   Overall it looks like a wash (with the exception of idctrn01 
on arm).

Did you get the codesize #s?  I'd expect them to show a small, but 
consistent improvement which ought to be enough to push the patch from a 
wash to a clear improvement on both x86 & arm.

jeff
Bernd Schmidt - July 7, 2010, 5:02 p.m.
On 07/02/2010 01:59 PM, Paolo Bonzini wrote:
> On 07/02/2010 12:58 PM, Jie Zhang wrote:
>> 168.wupwise            1260       1232       -2.22%
>> 189.lucas              1406       1380       -1.85%
> 
> Lucas is insanely dependent on register allocation and scheduling.  With
> -fweb and -frename-registers both disabled at -O2

This reminds me.  I committed a number of regrename speedup patches
earlier this year - does anyone think it would be a good idea to add
that to -O2 now?


Bernd
Jie Zhang - July 7, 2010, 5:47 p.m.
On 07/08/2010 12:54 AM, Jeff Law wrote:
> On 07/02/10 05:59, Paolo Bonzini wrote:
>> On 07/02/2010 12:58 PM, Jie Zhang wrote:
>>> 168.wupwise 1260 1232 -2.22%
>>> 189.lucas 1406 1380 -1.85%
>> I suspect the reason of the regression is that your patch increases
>> register lifetimes and makes scheduling a bit harder.
> Which is precisely what I was trying to figure out. Note the patch
> changes the post-reload code, so while lifetimes are
> increased, it really doesn't affect register allocation. So any
> regressions ought to be scheduling or other micro-architecture issues
> and I was having a hard time believing that we're going to see a
> measurable regression, particularly on any modern x86 implementation.
>
I was wondering it will be different if I run benchmark on an Intel 
machine. But currently I have no such spare machine to run on.
>
>> I don't think this should be a reason to hold up your patch though, it
>> would be more profitable to add some kind of rematerialization to the
>> scheduler.
> One way to do this would be to attach the equivalent (original) form and
> somehow note to the scheduler that the equivalent form can be directly
> substituted for the current form. When the scheduler has a bubble in the
> pipeline it could search QUEUE for an insn with an equivalent form, if
> found, it would verify dependencies and possibly issue the insn with the
> equivalent (original) form.
>
This is really a good idea. ARM Cortex-A8 has two ways to load large 
constants, one is using constant pool, the other is using MOVW/MOVT. 
These two methods have different pipeline usage. It will be good if 
scheduler can choose one from them according to the current schedule 
requirement.

> There's other instances where such capabilities might be useful, but I'd
> consider them outside the scope of this patch.
>
> I'm still reviewing the data -- a pair of 2% regressions is pretty
> significant.
>
Yeah. And -O3 does not have them. I would like to take a deep look. But 
tests in SPEC are not like EEMBC...
Bernd Schmidt - July 7, 2010, 5:56 p.m.
On 07/07/2010 06:54 PM, Jeff Law wrote:
> I'm still reviewing the data -- a pair of 2% regressions is pretty
> significant.

Note that Jie's raw data for wupwise suggests that there isn't actually
that big a regression there, it just picked a high value in one run and
a low value in the other, but there were instances of both in either run.


Bernd
Jie Zhang - July 7, 2010, 5:57 p.m.
On 07/08/2010 01:02 AM, Bernd Schmidt wrote:
> On 07/02/2010 01:59 PM, Paolo Bonzini wrote:
>> On 07/02/2010 12:58 PM, Jie Zhang wrote:
>>> 168.wupwise            1260       1232       -2.22%
>>> 189.lucas              1406       1380       -1.85%
>>
>> Lucas is insanely dependent on register allocation and scheduling.  With
>> -fweb and -frename-registers both disabled at -O2
>
> This reminds me.  I committed a number of regrename speedup patches
> earlier this year - does anyone think it would be a good idea to add
> that to -O2 now?
>
I'm curious, why regrename is not in -O2?
Paolo Bonzini - July 8, 2010, 4:11 p.m.
On 07/07/2010 07:02 PM, Bernd Schmidt wrote:
> On 07/02/2010 01:59 PM, Paolo Bonzini wrote:
>> On 07/02/2010 12:58 PM, Jie Zhang wrote:
>>> 168.wupwise            1260       1232       -2.22%
>>> 189.lucas              1406       1380       -1.85%
>>
>> Lucas is insanely dependent on register allocation and scheduling.  With
>> -fweb and -frename-registers both disabled at -O2
>
> This reminds me.  I committed a number of regrename speedup patches
> earlier this year - does anyone think it would be a good idea to add
> that to -O2 now?

Definitely on i686, since it disables pre-regalloc scheduling.  You 
cannot get decent scheduling there without regrename.

Paolo
Paolo Bonzini - July 8, 2010, 4:21 p.m.
On 07/07/2010 06:54 PM, Jeff Law wrote:
> Note the patch changes the post-reload code, so while lifetimes are
> increased, it really doesn't affect register allocation.  So any
> regressions ought to be scheduling or other micro-architecture issues
> and I was having a hard time believing that we're going to see a
> measurable regression, particularly on any modern x86 implementation.

Out-of-order execution and elimination of inter-instruction dependencies 
(especially load-to-use dependencies) are much more important on modern 
x86 cores than Intel would like you to believe.  See PR38824 and PR27827 
for examples.

Also, even if scheduling is only decoding bandwidth, it may require more 
freedom than what IRA leaves.  Bernd's idea of turning regrename on at 
-O2 is a good one.

Paolo
Bernd Schmidt - July 8, 2010, 4:28 p.m.
On 07/08/2010 06:21 PM, Paolo Bonzini wrote:
> On 07/07/2010 06:54 PM, Jeff Law wrote:
>> Note the patch changes the post-reload code, so while lifetimes are
>> increased, it really doesn't affect register allocation.  So any
>> regressions ought to be scheduling or other micro-architecture issues
>> and I was having a hard time believing that we're going to see a
>> measurable regression, particularly on any modern x86 implementation.
> 
> Out-of-order execution and elimination of inter-instruction dependencies
> (especially load-to-use dependencies) are much more important on modern
> x86 cores than Intel would like you to believe.  See PR38824 and PR27827
> for examples.
> 
> Also, even if scheduling is only decoding bandwidth, it may require more
> freedom than what IRA leaves.  Bernd's idea of turning regrename on at
> -O2 is a good one.

Here's a slightly crazier idea then: What we could additionally do in
IRA is to allocate some hard registers speculatively if we think there's
an important loop in the function.  That would give us more room to work
with in regrename, at the expense of saving more regs in the prologue.


Bernd
Jie Zhang - July 9, 2010, 4:58 p.m.
On 07/08/2010 01:02 AM, Jeff Law wrote:
>> On 07/02/2010 02:26 AM, Jie Zhang wrote:
>>>
>>> I have not benchmarked it on x86 using EEMBC. We use SPEC2000 for
>>> benchmarking on x86 here. I need to ask if it's possible to setup EEMBC
>>> for x86 in our environment.
>>>
>> It's not easy for me to run EEMBC on x86. We have no dedicated x86
>> hardware for running EEMBC.
> Even if you just run it a handful of times on whatever your development
> workstation happens to be and it's only one test that's of particular
> interest (idctrn01). Basically this one test is a clear outlier on ARM,
> so I'd like to see if that behaviour shows up on x86 as well.
>
I just tested idctrn01 on my pentium-m laptop. My patch does not change 
the performance of idctrn01 on i386. I did some investigation. It's 
because i386 can directly use "symbol_ref + offset" as the address in 
load instructions. So there is no optimization opportunity that my patch 
can utilize. That's the difference between RISC and CISC.

I also tried to test my patch on PowerPC. It does not make any 
difference either on PowerPC. It's because when reload pass inserts 
instructions to load "symbol_ref + offset" into a register, it uses two 
instructions instead of one:

(set (reg:SI 23 23)
      (high:SI (const:SI (plus:SI (symbol_ref:SI ("*.LANCHOR0"))
                                  (const_int 336 [0x150])))))

(set (reg:SI 23 23)
      (lo_sum:SI (reg:SI 23 23)
                 (const:SI (plus:SI (symbol_ref:SI ("*.LANCHOR0"))
                                    (const_int 336 [0x150])))))

Currently my patch cannot optimize for this case. But it can be enhanced 
to handle this with some effort.

>>>> If you can get size #s, that would be interesting too -- I'd expect
>>>> this
>>>> to be a small size improvement independent of the processor
>>>> architecture. Runtime performance I don't have a good feel for -- I can
>>>> easily envision cases where it's going to be better and others where
>>>> it's going to be worse.
>>>>
>>> I use 5 iterations and test -O2 and -O3 this time. It have been running
>>> for 24 hours. It might need one or two more hours to complete. So I have
>>> to report the results tomorrow.
>>>
>>
>> Here are the new results. 'Before' is before applying my patch,
>> 'After' is after applying my patch.
> Thanks. Overall it looks like a wash (with the exception of idctrn01 on
> arm).
>
> Did you get the codesize #s? I'd expect them to show a small, but
> consistent improvement which ought to be enough to push the patch from a
> wash to a clear improvement on both x86 & arm.
>
I collected the code size data from SPEC2000 on AMD64. My patch does not 
change code size for any test whether -O2 or -O3, except

Code Size
=========
Test  -O2		Before	After	Change
----------------------------------------------
172.mgrid		14962	14976	    14

Code Size
Test  -O3		Before	After	Change
----------------------------------------------
171.swim		19231	19247	    16


For EEMBC on ARM, my patch saves some bytes in the following 4 tests for 
code. No changes on other tests. No changes on data/bss size on all tests.

Code Size
=========
Test  -O3		Before	After	Change
----------------------------------------------
automotive/aifftr01	86444	86396	   -48
automotive/bitmnp01	60129	60112	   -17
automotive/idctrn01	84812	84557	  -255
office/text01		72685	72668	   -17


Regards,
Bernd Schmidt - July 10, 2010, 12:06 a.m.
On 07/09/2010 06:58 PM, Jie Zhang wrote:
> I just tested idctrn01 on my pentium-m laptop. My patch does not change
> the performance of idctrn01 on i386. I did some investigation. It's
> because i386 can directly use "symbol_ref + offset" as the address in
> load instructions. So there is no optimization opportunity that my patch
> can utilize. That's the difference between RISC and CISC.
[...]
> I collected the code size data from SPEC2000 on AMD64. My patch does not
> change code size for any test whether -O2 or -O3, except
> 
> Code Size
> =========
> Test  -O2        Before    After    Change
> ----------------------------------------------
> 172.mgrid        14962    14976        14
> 
> Code Size
> Test  -O3        Before    After    Change
> ----------------------------------------------
> 171.swim        19231    19247        16

So what kinds of changes are there that would explain a 1.48% drop on
179.art?  If you can verify that the code is in fact identical, and the
performance change is noise, the patch is ok.


Bernd
Jie Zhang - July 12, 2010, 7:10 a.m.
On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
> On 07/09/2010 06:58 PM, Jie Zhang wrote:
>> I just tested idctrn01 on my pentium-m laptop. My patch does not change
>> the performance of idctrn01 on i386. I did some investigation. It's
>> because i386 can directly use "symbol_ref + offset" as the address in
>> load instructions. So there is no optimization opportunity that my patch
>> can utilize. That's the difference between RISC and CISC.
> [...]
>> I collected the code size data from SPEC2000 on AMD64. My patch does not
>> change code size for any test whether -O2 or -O3, except
>>
>> Code Size
>> =========
>> Test  -O2        Before    After    Change
>> ----------------------------------------------
>> 172.mgrid        14962    14976        14
>>
>> Code Size
>> Test  -O3        Before    After    Change
>> ----------------------------------------------
>> 171.swim        19231    19247        16
>
> So what kinds of changes are there that would explain a 1.48% drop on
> 179.art?  If you can verify that the code is in fact identical, and the
> performance change is noise, the patch is ok.
>
I checked the disassemblies of all SPEC2000 tests. My patch generates 
different code only for 171.swim, 172.mgrid and 173.applu whether with 
-O2 or -O3. It seems the noise can be as large as 1.48%.
Steven Bosscher - July 12, 2010, 7:17 a.m.
On Mon, Jul 12, 2010 at 9:10 AM, Jie Zhang <jie@codesourcery.com> wrote:
> On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
>>
>> On 07/09/2010 06:58 PM, Jie Zhang wrote:
>>>
>>> I just tested idctrn01 on my pentium-m laptop. My patch does not change
>>> the performance of idctrn01 on i386. I did some investigation. It's
>>> because i386 can directly use "symbol_ref + offset" as the address in
>>> load instructions. So there is no optimization opportunity that my patch
>>> can utilize. That's the difference between RISC and CISC.
>>
>> [...]
>>>
>>> I collected the code size data from SPEC2000 on AMD64. My patch does not
>>> change code size for any test whether -O2 or -O3, except
>>>
>>> Code Size
>>> =========
>>> Test  -O2        Before    After    Change
>>> ----------------------------------------------
>>> 172.mgrid        14962    14976        14
>>>
>>> Code Size
>>> Test  -O3        Before    After    Change
>>> ----------------------------------------------
>>> 171.swim        19231    19247        16
>>
>> So what kinds of changes are there that would explain a 1.48% drop on
>> 179.art?  If you can verify that the code is in fact identical, and the
>> performance change is noise, the patch is ok.
>>
> I checked the disassemblies of all SPEC2000 tests. My patch generates
> different code only for 171.swim, 172.mgrid and 173.applu whether with -O2
> or -O3. It seems the noise can be as large as 1.48%.

It can be much worse, depending on your machine configuration, cache,
and whether the SPEC*CPU* test is actually memory-bound (like art,
mcf, ammp, and a few others)...

Ciao!
Steven
Jie Zhang - July 12, 2010, 8:09 a.m.
On 07/12/2010 03:17 PM, Steven Bosscher wrote:
> On Mon, Jul 12, 2010 at 9:10 AM, Jie Zhang<jie@codesourcery.com>  wrote:
>> On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
>>>
>>> On 07/09/2010 06:58 PM, Jie Zhang wrote:
>>>>
>>>> I just tested idctrn01 on my pentium-m laptop. My patch does not change
>>>> the performance of idctrn01 on i386. I did some investigation. It's
>>>> because i386 can directly use "symbol_ref + offset" as the address in
>>>> load instructions. So there is no optimization opportunity that my patch
>>>> can utilize. That's the difference between RISC and CISC.
>>>
>>> [...]
>>>>
>>>> I collected the code size data from SPEC2000 on AMD64. My patch does not
>>>> change code size for any test whether -O2 or -O3, except
>>>>
>>>> Code Size
>>>> =========
>>>> Test  -O2        Before    After    Change
>>>> ----------------------------------------------
>>>> 172.mgrid        14962    14976        14
>>>>
>>>> Code Size
>>>> Test  -O3        Before    After    Change
>>>> ----------------------------------------------
>>>> 171.swim        19231    19247        16
>>>
>>> So what kinds of changes are there that would explain a 1.48% drop on
>>> 179.art?  If you can verify that the code is in fact identical, and the
>>> performance change is noise, the patch is ok.
>>>
>> I checked the disassemblies of all SPEC2000 tests. My patch generates
>> different code only for 171.swim, 172.mgrid and 173.applu whether with -O2
>> or -O3. It seems the noise can be as large as 1.48%.
>
> It can be much worse, depending on your machine configuration, cache,
> and whether the SPEC*CPU* test is actually memory-bound (like art,
> mcf, ammp, and a few others)...
>
Even I run it 5 iterations, it still can have noise like 179.art I got.

* 179: The 5 results for "Before" are 1013, 1023, 1004, 1012, 1016.
                 while for "After" are  993,  998,  998, 1009, 1005.

I'm lucky this time since the patch does not change the code. But if it 
changed the code, telling if it's noise or real regression would become 
much harder. I wonder how many iterations we have to use if we want to 
make sure the result can be trusted.
Jeff Law - July 12, 2010, 4:37 p.m.
On 06/30/10 01:45, Jie Zhang wrote:
> Currently reload_cse_move2add can transform
>
>    (set (REGX) (CONST_INT A))
>    ...
>    (set (REGX) (CONST_INT B))
>
> to
>
>    (set (REGX) (CONST_INT A))
>    ...
>    (set (REGX) (plus (REGX) (CONST_INT B-A)))
>
> This patch enhances it to be able to transform
>
>    (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>    ...
>    (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>
> to
>
>    (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>    ...
>    (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>
>
> Benchmarking using EEMBC on ARM Cortex-A8 shows performance 
> improvement on one test:
>
> idctrn01: 6%
>
> No meaningful changes on other tests.
>
> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4% 
> regression on CINT2000 and 0.1% improvement on CFP2000.
>
> Bootstrapped and regression tested on x86_64.
>
> Is it OK?
Thanks for your patience, particularly when my requests for additional 
data.    I think we're OK with the benchmarking #s, particularly looking 
at the amount of noise in the runs we're seeing for x86_64.

Appoved,

Jeff
Jie Zhang - July 12, 2010, 5:41 p.m.
On 07/13/2010 12:37 AM, Jeff Law wrote:
> On 06/30/10 01:45, Jie Zhang wrote:
>> Currently reload_cse_move2add can transform
>>
>> (set (REGX) (CONST_INT A))
>> ...
>> (set (REGX) (CONST_INT B))
>>
>> to
>>
>> (set (REGX) (CONST_INT A))
>> ...
>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>
>> This patch enhances it to be able to transform
>>
>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>> ...
>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>
>> to
>>
>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>> ...
>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>
>>
>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>> improvement on one test:
>>
>> idctrn01: 6%
>>
>> No meaningful changes on other tests.
>>
>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>
>> Bootstrapped and regression tested on x86_64.
>>
>> Is it OK?
> Thanks for your patience, particularly when my requests for additional
> data. I think we're OK with the benchmarking #s, particularly looking at
> the amount of noise in the runs we're seeing for x86_64.
>
> Appoved,
>
Jeff,

Thanks! I learned a lot during the procedure.

On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
 > So what kinds of changes are there that would explain a 1.48% drop on
 > 179.art?  If you can verify that the code is in fact identical, and the
 > performance change is noise, the patch is ok.
 >
Bernd,

I didn't realize it was a conditional approval until I asked you 
privately. Thanks!


I have committed it on trunk.


Regards,
H.J. Lu - July 12, 2010, 6:36 p.m.
On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang <jie@codesourcery.com> wrote:
> On 07/13/2010 12:37 AM, Jeff Law wrote:
>>
>> On 06/30/10 01:45, Jie Zhang wrote:
>>>
>>> Currently reload_cse_move2add can transform
>>>
>>> (set (REGX) (CONST_INT A))
>>> ...
>>> (set (REGX) (CONST_INT B))
>>>
>>> to
>>>
>>> (set (REGX) (CONST_INT A))
>>> ...
>>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>>
>>> This patch enhances it to be able to transform
>>>
>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>> ...
>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>>
>>> to
>>>
>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>> ...
>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>>
>>>
>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>>> improvement on one test:
>>>
>>> idctrn01: 6%
>>>
>>> No meaningful changes on other tests.
>>>
>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>>
>>> Bootstrapped and regression tested on x86_64.
>>>
>>> Is it OK?
>>
>> Thanks for your patience, particularly when my requests for additional
>> data. I think we're OK with the benchmarking #s, particularly looking at
>> the amount of noise in the runs we're seeing for x86_64.
>>
>> Appoved,
>>
> Jeff,
>
> Thanks! I learned a lot during the procedure.
>
> On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
>> So what kinds of changes are there that would explain a 1.48% drop on
>> 179.art?  If you can verify that the code is in fact identical, and the
>> performance change is noise, the patch is ok.
>>
> Bernd,
>
> I didn't realize it was a conditional approval until I asked you privately.
> Thanks!
>
>
> I have committed it on trunk.
>

I think it breaks bootstrap on Linux/ia32:

../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs':
../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be
used uninitialized in this function [-Werror=uninitialized]
../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was declared here
cc1: all warnings being treated as errors
H.J. Lu - July 12, 2010, 6:38 p.m.
On Mon, Jul 12, 2010 at 11:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang <jie@codesourcery.com> wrote:
>> On 07/13/2010 12:37 AM, Jeff Law wrote:
>>>
>>> On 06/30/10 01:45, Jie Zhang wrote:
>>>>
>>>> Currently reload_cse_move2add can transform
>>>>
>>>> (set (REGX) (CONST_INT A))
>>>> ...
>>>> (set (REGX) (CONST_INT B))
>>>>
>>>> to
>>>>
>>>> (set (REGX) (CONST_INT A))
>>>> ...
>>>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>>>
>>>> This patch enhances it to be able to transform
>>>>
>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>> ...
>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>>>
>>>> to
>>>>
>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>> ...
>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>>>
>>>>
>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>>>> improvement on one test:
>>>>
>>>> idctrn01: 6%
>>>>
>>>> No meaningful changes on other tests.
>>>>
>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>>>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>>>
>>>> Bootstrapped and regression tested on x86_64.
>>>>
>>>> Is it OK?
>>>
>>> Thanks for your patience, particularly when my requests for additional
>>> data. I think we're OK with the benchmarking #s, particularly looking at
>>> the amount of noise in the runs we're seeing for x86_64.
>>>
>>> Appoved,
>>>
>> Jeff,
>>
>> Thanks! I learned a lot during the procedure.
>>
>> On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
>>> So what kinds of changes are there that would explain a 1.48% drop on
>>> 179.art?  If you can verify that the code is in fact identical, and the
>>> performance change is noise, the patch is ok.
>>>
>> Bernd,
>>
>> I didn't realize it was a conditional approval until I asked you privately.
>> Thanks!
>>
>>
>> I have committed it on trunk.
>>
>
> I think it breaks bootstrap on Linux/ia32:
>
> ../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs':
> ../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be
> used uninitialized in this function [-Werror=uninitialized]
> ../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was declared here
> cc1: all warnings being treated as errors
>

I opened:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44921
Jie Zhang - July 13, 2010, 4:50 a.m.
On 07/13/2010 02:38 AM, H.J. Lu wrote:
> On Mon, Jul 12, 2010 at 11:36 AM, H.J. Lu<hjl.tools@gmail.com>  wrote:
>> On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang<jie@codesourcery.com>  wrote:
>>> On 07/13/2010 12:37 AM, Jeff Law wrote:
>>>>
>>>> On 06/30/10 01:45, Jie Zhang wrote:
>>>>>
>>>>> Currently reload_cse_move2add can transform
>>>>>
>>>>> (set (REGX) (CONST_INT A))
>>>>> ...
>>>>> (set (REGX) (CONST_INT B))
>>>>>
>>>>> to
>>>>>
>>>>> (set (REGX) (CONST_INT A))
>>>>> ...
>>>>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>>>>
>>>>> This patch enhances it to be able to transform
>>>>>
>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>>> ...
>>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>>>>
>>>>> to
>>>>>
>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>>> ...
>>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>>>>
>>>>>
>>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>>>>> improvement on one test:
>>>>>
>>>>> idctrn01: 6%
>>>>>
>>>>> No meaningful changes on other tests.
>>>>>
>>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>>>>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>>>>
>>>>> Bootstrapped and regression tested on x86_64.
>>>>>
>>>>> Is it OK?
>>>>
>>>> Thanks for your patience, particularly when my requests for additional
>>>> data. I think we're OK with the benchmarking #s, particularly looking at
>>>> the amount of noise in the runs we're seeing for x86_64.
>>>>
>>>> Appoved,
>>>>
>>> Jeff,
>>>
>>> Thanks! I learned a lot during the procedure.
>>>
>>> On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
>>>> So what kinds of changes are there that would explain a 1.48% drop on
>>>> 179.art?  If you can verify that the code is in fact identical, and the
>>>> performance change is noise, the patch is ok.
>>>>
>>> Bernd,
>>>
>>> I didn't realize it was a conditional approval until I asked you privately.
>>> Thanks!
>>>
>>>
>>> I have committed it on trunk.
>>>
>>
>> I think it breaks bootstrap on Linux/ia32:
>>
>> ../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs':
>> ../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be
>> used uninitialized in this function [-Werror=uninitialized]
>> ../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was declared here
>> cc1: all warnings being treated as errors
>>
>
> I opened:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44921
>
Thanks for catching that. I did the bootstraping and regression testing 
when I submitted my patch. But I didn't do them before committing.

I saw your have already committed the fix:

http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01021.html

But as amylaar (sorry I don't know the real name since the bugzilla is 
broken now) pointed out in the bugzilla, this is a false warning. I did 
several testing of my patch with old revisions this morning. I found 
r161569 with my patch bootstrap OK. r161614 with my patch fails 
bootstrap. So there is some revision between them caused the false 
warning and we need to find out which and if it's better to fix that 
revision than adding initialization to min_regno.


Regards,
Jie Zhang - July 13, 2010, 9:31 a.m.
On 07/13/2010 12:50 PM, Jie Zhang wrote:
> On 07/13/2010 02:38 AM, H.J. Lu wrote:
>> On Mon, Jul 12, 2010 at 11:36 AM, H.J. Lu<hjl.tools@gmail.com> wrote:
>>> On Mon, Jul 12, 2010 at 10:41 AM, Jie Zhang<jie@codesourcery.com> wrote:
>>>> On 07/13/2010 12:37 AM, Jeff Law wrote:
>>>>>
>>>>> On 06/30/10 01:45, Jie Zhang wrote:
>>>>>>
>>>>>> Currently reload_cse_move2add can transform
>>>>>>
>>>>>> (set (REGX) (CONST_INT A))
>>>>>> ...
>>>>>> (set (REGX) (CONST_INT B))
>>>>>>
>>>>>> to
>>>>>>
>>>>>> (set (REGX) (CONST_INT A))
>>>>>> ...
>>>>>> (set (REGX) (plus (REGX) (CONST_INT B-A)))
>>>>>>
>>>>>> This patch enhances it to be able to transform
>>>>>>
>>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>>>> ...
>>>>>> (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
>>>>>>
>>>>>> to
>>>>>>
>>>>>> (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
>>>>>> ...
>>>>>> (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))
>>>>>>
>>>>>>
>>>>>> Benchmarking using EEMBC on ARM Cortex-A8 shows performance
>>>>>> improvement on one test:
>>>>>>
>>>>>> idctrn01: 6%
>>>>>>
>>>>>> No meaningful changes on other tests.
>>>>>>
>>>>>> Benchmarking using SPEC2000 on AMD Athlon64 X2 3800+ shows 0.4%
>>>>>> regression on CINT2000 and 0.1% improvement on CFP2000.
>>>>>>
>>>>>> Bootstrapped and regression tested on x86_64.
>>>>>>
>>>>>> Is it OK?
>>>>>
>>>>> Thanks for your patience, particularly when my requests for additional
>>>>> data. I think we're OK with the benchmarking #s, particularly
>>>>> looking at
>>>>> the amount of noise in the runs we're seeing for x86_64.
>>>>>
>>>>> Appoved,
>>>>>
>>>> Jeff,
>>>>
>>>> Thanks! I learned a lot during the procedure.
>>>>
>>>> On 07/10/2010 08:06 AM, Bernd Schmidt wrote:
>>>>> So what kinds of changes are there that would explain a 1.48% drop on
>>>>> 179.art? If you can verify that the code is in fact identical, and the
>>>>> performance change is noise, the patch is ok.
>>>>>
>>>> Bernd,
>>>>
>>>> I didn't realize it was a conditional approval until I asked you
>>>> privately.
>>>> Thanks!
>>>>
>>>>
>>>> I have committed it on trunk.
>>>>
>>>
>>> I think it breaks bootstrap on Linux/ia32:
>>>
>>> ../../src-trunk/gcc/postreload.c: In function 'reload_cse_regs':
>>> ../../src-trunk/gcc/postreload.c:1327:57: error: 'min_regno' may be
>>> used uninitialized in this function [-Werror=uninitialized]
>>> ../../src-trunk/gcc/postreload.c:1284:7: note: 'min_regno' was
>>> declared here
>>> cc1: all warnings being treated as errors
>>>
>>
>> I opened:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44921
>>
> Thanks for catching that. I did the bootstraping and regression testing
> when I submitted my patch. But I didn't do them before committing.
>
> I saw your have already committed the fix:
>
> http://gcc.gnu.org/ml/gcc-patches/2010-07/msg01021.html
>
> But as amylaar (sorry I don't know the real name since the bugzilla is
> broken now) pointed out in the bugzilla, this is a false warning. I did
> several testing of my patch with old revisions this morning. I found
> r161569 with my patch bootstrap OK. r161614 with my patch fails
> bootstrap. So there is some revision between them caused the false
> warning and we need to find out which and if it's better to fix that
> revision than adding initialization to min_regno.
>
It's revision 161605. I will add a followup to the corresponding thread.

Patch


	* postreload.c (reg_symbol_ref[]): New.
	(move2add_use_add2_insn): New.
	(move2add_use_add3_insn): New.
	(reload_cse_move2add): Handle SYMBOL + OFFSET case.
	(move2add_note_store): Likewise.

Index: postreload.c
===================================================================
--- postreload.c	(revision 290047)
+++ postreload.c	(working copy)
@@ -1156,17 +1156,19 @@  reload_combine_note_use (rtx *xp, rtx in
    information about register contents we have would be costly, so we
    use move2add_last_label_luid to note where the label is and then
    later disable any optimization that would cross it.
-   reg_offset[n] / reg_base_reg[n] / reg_mode[n] are only valid if
-   reg_set_luid[n] is greater than move2add_last_label_luid.  */
+   reg_offset[n] / reg_base_reg[n] / reg_symbol_ref[n] / reg_mode[n]
+   are only valid if reg_set_luid[n] is greater than
+   move2add_last_label_luid.  */
 static int reg_set_luid[FIRST_PSEUDO_REGISTER];
 
 /* If reg_base_reg[n] is negative, register n has been set to
-   reg_offset[n] in mode reg_mode[n] .
+   reg_offset[n] or reg_symbol_ref[n] + reg_offset[n] in mode reg_mode[n].
    If reg_base_reg[n] is non-negative, register n has been set to the
    sum of reg_offset[n] and the value of register reg_base_reg[n]
    before reg_set_luid[n], calculated in mode reg_mode[n] .  */
 static HOST_WIDE_INT reg_offset[FIRST_PSEUDO_REGISTER];
 static int reg_base_reg[FIRST_PSEUDO_REGISTER];
+static rtx reg_symbol_ref[FIRST_PSEUDO_REGISTER];
 static enum machine_mode reg_mode[FIRST_PSEUDO_REGISTER];
 
 /* move2add_luid is linearly increased while scanning the instructions
@@ -1186,6 +1188,151 @@  static int move2add_last_label_luid;
        && TRULY_NOOP_TRUNCATION (GET_MODE_BITSIZE (OUTMODE), \
 				 GET_MODE_BITSIZE (INMODE))))
 
+/* This function is called with INSN that sets REG to (SYM + OFF),
+   while REG is known to already have value (SYM + offset).
+   This function tries to change INSN into an add instruction
+   (set (REG) (plus (REG) (OFF - offset))) using the known value.
+   It also updates the information about REG's known value.  */
+
+static void
+move2add_use_add2_insn (rtx reg, rtx sym, rtx off, rtx insn)
+{
+  rtx pat = PATTERN (insn);
+  rtx src = SET_SRC (pat);
+  int regno = REGNO (reg);
+  rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[regno],
+			      GET_MODE (reg));
+  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
+
+  /* (set (reg) (plus (reg) (const_int 0))) is not canonical;
+     use (set (reg) (reg)) instead.
+     We don't delete this insn, nor do we convert it into a
+     note, to avoid losing register notes or the return
+     value flag.  jump2 already knows how to get rid of
+     no-op moves.  */
+  if (new_src == const0_rtx)
+    {
+      /* If the constants are different, this is a
+	 truncation, that, if turned into (set (reg)
+	 (reg)), would be discarded.  Maybe we should
+	 try a truncMN pattern?  */
+      if (INTVAL (off) == reg_offset [regno])
+	validate_change (insn, &SET_SRC (pat), reg, 0);
+    }
+  else if (rtx_cost (new_src, PLUS, speed) < rtx_cost (src, SET, speed)
+	   && have_add2_insn (reg, new_src))
+    {
+      rtx tem = gen_rtx_PLUS (GET_MODE (reg), reg, new_src);
+      validate_change (insn, &SET_SRC (pat), tem, 0);
+    }
+  else if (sym == NULL_RTX && GET_MODE (reg) != BImode)
+    {
+      enum machine_mode narrow_mode;
+      for (narrow_mode = GET_CLASS_NARROWEST_MODE (MODE_INT);
+	   narrow_mode != VOIDmode
+	     && narrow_mode != GET_MODE (reg);
+	   narrow_mode = GET_MODE_WIDER_MODE (narrow_mode))
+	{
+	  if (have_insn_for (STRICT_LOW_PART, narrow_mode)
+	      && ((reg_offset[regno]
+		   & ~GET_MODE_MASK (narrow_mode))
+		  == (INTVAL (off)
+		      & ~GET_MODE_MASK (narrow_mode))))
+	    {
+	      rtx narrow_reg = gen_rtx_REG (narrow_mode,
+					    REGNO (reg));
+	      rtx narrow_src = gen_int_mode (INTVAL (off),
+					     narrow_mode);
+	      rtx new_set =
+		gen_rtx_SET (VOIDmode,
+			     gen_rtx_STRICT_LOW_PART (VOIDmode,
+						      narrow_reg),
+			     narrow_src);
+	      if (validate_change (insn, &PATTERN (insn),
+				   new_set, 0))
+		break;
+	    }
+	}
+    }
+  reg_set_luid[regno] = move2add_luid;
+  reg_base_reg[regno] = -1;
+  reg_mode[regno] = GET_MODE (reg);
+  reg_symbol_ref[regno] = sym;
+  reg_offset[regno] = INTVAL (off);
+}
+
+
+/* This function is called with INSN that sets REG to (SYM + OFF),
+   but REG doesn't have known value (SYM + offset).  This function
+   tries to find another register which is known to already have
+   value (SYM + offset) and change INSN into an add instruction
+   (set (REG) (plus (the found register) (OFF - offset))) if such
+   a register is found.  It also updates the information about
+   REG's known value.  */
+
+static void
+move2add_use_add3_insn (rtx reg, rtx sym, rtx off, rtx insn)
+{
+  rtx pat = PATTERN (insn);
+  rtx src = SET_SRC (pat);
+  int regno = REGNO (reg);
+  int min_cost = INT_MAX;
+  int min_regno;
+  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
+  int i;
+
+  for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+    if (reg_set_luid[i] > move2add_last_label_luid
+	&& reg_mode[i] == GET_MODE (reg)
+	&& reg_base_reg[i] < 0
+	&& reg_symbol_ref[i] != NULL_RTX
+	&& rtx_equal_p (sym, reg_symbol_ref[i]))
+      {
+	rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[i],
+				    GET_MODE (reg));
+	/* (set (reg) (plus (reg) (const_int 0))) is not canonical;
+	   use (set (reg) (reg)) instead.
+	   We don't delete this insn, nor do we convert it into a
+	   note, to avoid losing register notes or the return
+	   value flag.  jump2 already knows how to get rid of
+	   no-op moves.  */
+	if (new_src == const0_rtx)
+	  {
+	    min_cost = 0;
+	    min_regno = i;
+	    break;
+	  }
+	else
+	  {
+	    int cost = rtx_cost (new_src, PLUS, speed);
+	    if (cost < min_cost)
+	      {
+		min_cost = cost;
+		min_regno = i;
+	      }
+	  }
+      }
+
+  if (min_cost < rtx_cost (src, SET, speed))
+    {
+      rtx tem;
+
+      tem = gen_rtx_REG (GET_MODE (reg), min_regno);
+      if (i != min_regno)
+	{
+	  rtx new_src = gen_int_mode (INTVAL (off) - reg_offset[min_regno],
+				      GET_MODE (reg));
+	  tem = gen_rtx_PLUS (GET_MODE (reg), tem, new_src);
+	}
+      validate_change (insn, &SET_SRC (pat), tem, 0);
+    }
+  reg_set_luid[regno] = move2add_luid;
+  reg_base_reg[regno] = -1;
+  reg_mode[regno] = GET_MODE (reg);
+  reg_symbol_ref[regno] = sym;
+  reg_offset[regno] = INTVAL (off);
+}
+
 static void
 reload_cse_move2add (rtx first)
 {
@@ -1193,7 +1340,13 @@  reload_cse_move2add (rtx first)
   rtx insn;
 
   for (i = FIRST_PSEUDO_REGISTER - 1; i >= 0; i--)
-    reg_set_luid[i] = 0;
+    {
+      reg_set_luid[i] = 0;
+      reg_offset[i] = 0;
+      reg_base_reg[i] = 0;
+      reg_symbol_ref[i] = NULL_RTX;
+      reg_mode[i] = VOIDmode;
+    }
 
   move2add_last_label_luid = 0;
   move2add_luid = 2;
@@ -1241,65 +1394,11 @@  reload_cse_move2add (rtx first)
 				  (set (STRICT_LOW_PART (REGX)) (CONST_INT B))
 	      */
 
-	      if (CONST_INT_P (src) && reg_base_reg[regno] < 0)
+	      if (CONST_INT_P (src)
+		  && reg_base_reg[regno] < 0
+		  && reg_symbol_ref[regno] == NULL_RTX)
 		{
-		  rtx new_src = gen_int_mode (INTVAL (src) - reg_offset[regno],
-					      GET_MODE (reg));
-		  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
-
-		  /* (set (reg) (plus (reg) (const_int 0))) is not canonical;
-		     use (set (reg) (reg)) instead.
-		     We don't delete this insn, nor do we convert it into a
-		     note, to avoid losing register notes or the return
-		     value flag.  jump2 already knows how to get rid of
-		     no-op moves.  */
-		  if (new_src == const0_rtx)
-		    {
-		      /* If the constants are different, this is a
-			 truncation, that, if turned into (set (reg)
-			 (reg)), would be discarded.  Maybe we should
-			 try a truncMN pattern?  */
-		      if (INTVAL (src) == reg_offset [regno])
-			validate_change (insn, &SET_SRC (pat), reg, 0);
-		    }
-		  else if (rtx_cost (new_src, PLUS, speed) < rtx_cost (src, SET, speed)
-			   && have_add2_insn (reg, new_src))
-		    {
-		      rtx tem = gen_rtx_PLUS (GET_MODE (reg), reg, new_src);
-		      validate_change (insn, &SET_SRC (pat), tem, 0);
-		    }
-		  else if (GET_MODE (reg) != BImode)
-		    {
-		      enum machine_mode narrow_mode;
-		      for (narrow_mode = GET_CLASS_NARROWEST_MODE (MODE_INT);
-			   narrow_mode != VOIDmode
-			   && narrow_mode != GET_MODE (reg);
-			   narrow_mode = GET_MODE_WIDER_MODE (narrow_mode))
-			{
-			  if (have_insn_for (STRICT_LOW_PART, narrow_mode)
-			      && ((reg_offset[regno]
-				   & ~GET_MODE_MASK (narrow_mode))
-				  == (INTVAL (src)
-				      & ~GET_MODE_MASK (narrow_mode))))
-			    {
-			      rtx narrow_reg = gen_rtx_REG (narrow_mode,
-							    REGNO (reg));
-			      rtx narrow_src = gen_int_mode (INTVAL (src),
-							     narrow_mode);
-			      rtx new_set =
-				gen_rtx_SET (VOIDmode,
-					     gen_rtx_STRICT_LOW_PART (VOIDmode,
-								      narrow_reg),
-					     narrow_src);
-			      if (validate_change (insn, &PATTERN (insn),
-						   new_set, 0))
-				break;
-			    }
-			}
-		    }
-		  reg_set_luid[regno] = move2add_luid;
-		  reg_mode[regno] = GET_MODE (reg);
-		  reg_offset[regno] = INTVAL (src);
+		  move2add_use_add2_insn (reg, NULL_RTX, src, insn);
 		  continue;
 		}
 
@@ -1369,6 +1468,51 @@  reload_cse_move2add (rtx first)
 		    }
 		}
 	    }
+
+	  /* Try to transform
+	     (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
+	     ...
+	     (set (REGY) (CONST (PLUS (SYMBOL_REF) (CONST_INT B))))
+	     to
+	     (set (REGX) (CONST (PLUS (SYMBOL_REF) (CONST_INT A))))
+	     ...
+	     (set (REGY) (CONST (PLUS (REGX) (CONST_INT B-A))))  */
+	  if ((GET_CODE (src) == SYMBOL_REF
+	       || (GET_CODE (src) == CONST
+		   && GET_CODE (XEXP (src, 0)) == PLUS
+		   && GET_CODE (XEXP (XEXP (src, 0), 0)) == SYMBOL_REF
+		   && CONST_INT_P (XEXP (XEXP (src, 0), 1))))
+	      && dbg_cnt (cse2_move2add))
+	    {
+	      rtx sym, off;
+
+	      if (GET_CODE (src) == SYMBOL_REF)
+		{
+		  sym = src;
+		  off = const0_rtx;
+		}
+	      else
+		{
+		  sym = XEXP (XEXP (src, 0), 0);
+		  off = XEXP (XEXP (src, 0), 1);
+		}
+
+	      /* If the reg already contains the value which is sum of
+		 sym and some constant value, we can use an add2 insn.  */
+	      if (reg_set_luid[regno] > move2add_last_label_luid
+		  && MODES_OK_FOR_MOVE2ADD (GET_MODE (reg), reg_mode[regno])
+		  && reg_base_reg[regno] < 0
+		  && reg_symbol_ref[regno] != NULL_RTX
+		  && rtx_equal_p (sym, reg_symbol_ref[regno]))
+		move2add_use_add2_insn (reg, sym, off, insn);
+
+	      /* Otherwise, we have to find a register whose value is sum
+		 of sym and some constant value.  */
+	      else
+		move2add_use_add3_insn (reg, sym, off, insn);
+
+	      continue;
+	    }
 	}
 
       for (note = REG_NOTES (insn); note; note = XEXP (note, 1))
@@ -1382,7 +1526,7 @@  reload_cse_move2add (rtx first)
 		reg_set_luid[regno] = 0;
 	    }
 	}
-      note_stores (PATTERN (insn), move2add_note_store, NULL);
+      note_stores (PATTERN (insn), move2add_note_store, insn);
 
       /* If INSN is a conditional branch, we try to extract an
 	 implicit set out of it.  */
@@ -1404,7 +1548,7 @@  reload_cse_move2add (rtx first)
 	    {
 	      rtx implicit_set =
 		gen_rtx_SET (VOIDmode, XEXP (cnd, 0), XEXP (cnd, 1));
-	      move2add_note_store (SET_DEST (implicit_set), implicit_set, 0);
+	      move2add_note_store (SET_DEST (implicit_set), implicit_set, insn);
 	    }
 	}
 
@@ -1422,13 +1566,15 @@  reload_cse_move2add (rtx first)
     }
 }
 
-/* SET is a SET or CLOBBER that sets DST.
+/* SET is a SET or CLOBBER that sets DST.  DATA is the insn which
+   contains SET.
    Update reg_set_luid, reg_offset and reg_base_reg accordingly.
    Called from reload_cse_move2add via note_stores.  */
 
 static void
-move2add_note_store (rtx dst, const_rtx set, void *data ATTRIBUTE_UNUSED)
+move2add_note_store (rtx dst, const_rtx set, void *data)
 {
+  rtx insn = (rtx) data;
   unsigned int regno = 0;
   unsigned int nregs = 0;
   unsigned int i;
@@ -1462,6 +1608,38 @@  move2add_note_store (rtx dst, const_rtx 
     nregs = hard_regno_nregs[regno][mode];
 
   if (SCALAR_INT_MODE_P (GET_MODE (dst))
+      && nregs == 1 && GET_CODE (set) == SET)
+    {
+      rtx note, sym = NULL_RTX;
+      HOST_WIDE_INT off;
+
+      note = find_reg_equal_equiv_note (insn);
+      if (note && GET_CODE (XEXP (note, 0)) == SYMBOL_REF)
+	{
+	  sym = XEXP (note, 0);
+	  off = 0;
+	}
+      else if (note && GET_CODE (XEXP (note, 0)) == CONST
+	       && GET_CODE (XEXP (XEXP (note, 0), 0)) == PLUS
+	       && GET_CODE (XEXP (XEXP (XEXP (note, 0), 0), 0)) == SYMBOL_REF
+	       && CONST_INT_P (XEXP (XEXP (XEXP (note, 0), 0), 1)))
+	{
+	  sym = XEXP (XEXP (XEXP (note, 0), 0), 0);
+	  off = INTVAL (XEXP (XEXP (XEXP (note, 0), 0), 1));
+	}
+
+      if (sym != NULL_RTX)
+	{
+	  reg_base_reg[regno] = -1;
+	  reg_symbol_ref[regno] = sym;
+	  reg_offset[regno] = off;
+	  reg_mode[regno] = mode;
+	  reg_set_luid[regno] = move2add_luid;
+	  return;
+	}
+    }
+
+  if (SCALAR_INT_MODE_P (GET_MODE (dst))
       && nregs == 1 && GET_CODE (set) == SET
       && GET_CODE (SET_DEST (set)) != ZERO_EXTRACT
       && GET_CODE (SET_DEST (set)) != STRICT_LOW_PART)
@@ -1521,6 +1699,7 @@  move2add_note_store (rtx dst, const_rtx 
 	case CONST_INT:
 	  /* Start tracking the register as a constant.  */
 	  reg_base_reg[regno] = -1;
+	  reg_symbol_ref[regno] = NULL_RTX;
 	  reg_offset[regno] = INTVAL (SET_SRC (set));
 	  /* We assign the same luid to all registers set to constants.  */
 	  reg_set_luid[regno] = move2add_last_label_luid + 1;
@@ -1541,6 +1720,7 @@  move2add_note_store (rtx dst, const_rtx 
       if (reg_set_luid[base_regno] <= move2add_last_label_luid)
 	{
 	  reg_base_reg[base_regno] = base_regno;
+	  reg_symbol_ref[base_regno] = NULL_RTX;
 	  reg_offset[base_regno] = 0;
 	  reg_set_luid[base_regno] = move2add_luid;
 	  reg_mode[base_regno] = mode;
@@ -1554,6 +1734,7 @@  move2add_note_store (rtx dst, const_rtx 
       /* Copy base information from our base register.  */
       reg_set_luid[regno] = reg_set_luid[base_regno];
       reg_base_reg[regno] = reg_base_reg[base_regno];
+      reg_symbol_ref[regno] = reg_symbol_ref[base_regno];
 
       /* Compute the sum of the offsets or constants.  */
       reg_offset[regno] = trunc_int_for_mode (offset