mbox series

[gcc-15,0/3] RISC-V improve stack/array access by constant mat tweak

Message ID 20240316173524.1147760-1-vineetg@rivosinc.com
Headers show
Series RISC-V improve stack/array access by constant mat tweak | expand

Message

Vineet Gupta March 16, 2024, 5:35 p.m. UTC
Hi,

This set of patches (for gcc-15) help improve stack/array accesses
by improving constant materialization. Details are in respective
patches.

The first patch is the main change which improves SPEC cactu by 10%.

There is at least one more pending change with same theme
(riscv_add_offset/riscv_legitimize_address) but that is tripping up
and currently being debugged.

Thx,
-Vineet

Vineet Gupta (3):
  RISC-V: avoid LUI based const materialization ... [part of PR/106265]
  RISC-V: avoid LUI based const mat: keep stack offsets aligned
  RISC-V: avoid LUI based const mat in prologue/epilogue expansion
    [PR/105733]

 gcc/config/riscv/constraints.md               | 12 +++
 gcc/config/riscv/predicates.md                | 12 +++
 gcc/config/riscv/riscv-protos.h               |  3 +
 gcc/config/riscv/riscv.cc                     | 91 +++++++++++++++++--
 gcc/config/riscv/riscv.h                      | 20 ++++
 gcc/config/riscv/riscv.md                     | 65 +++++++++++++
 gcc/testsuite/gcc.target/riscv/pr105733.c     | 15 +++
 .../riscv/rvv/autovec/vls/spill-1.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-2.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-3.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-4.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-5.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-6.c           |  4 +-
 .../riscv/rvv/autovec/vls/spill-7.c           |  4 +-
 .../gcc.target/riscv/sum-of-two-s12-const-1.c | 45 +++++++++
 .../gcc.target/riscv/sum-of-two-s12-const-2.c | 22 +++++
 16 files changed, 292 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/pr105733.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/sum-of-two-s12-const-2.c

Comments

Jeff Law March 19, 2024, 4:41 a.m. UTC | #1
On 3/16/24 11:35 AM, Vineet Gupta wrote:
> Hi,
> 
> This set of patches (for gcc-15) help improve stack/array accesses
> by improving constant materialization. Details are in respective
> patches.
> 
> The first patch is the main change which improves SPEC cactu by 10%.
Just to confirm.  Yup, 10% reduction in icounts and about a 3.5% 
improvement in cycles on our target.  Which is great!

This also makes me wonder if cactu is the benchmark that was sensitive 
to flushing the pending queue in the scheduler.  Jivan's data would tend 
to indicate that is the case as several routines seem to flush the 
pending queue often.  In particular:

ML_BSSN_RHS_Body
ML_BSSN_Advect_Body
ML_BSSN_constraints_Body

All have a high number of dynamic instructions as well as lots of 
flushes of the pending queue.

Vineet, you might want to look and see if cranking up the 
max-pending-list-length parameter helps drive down spilling.   I think 
it's default value is 32 insns.  I've seen it cranked up to 128 and 256 
insns without significant ill effects on compile time.

My recollection (it's been like 3 years) of the key loop was that it had 
a few hundred instructions and we'd flush the pending list about 50 
cycles into the loop as there just wasn't enough issue bandwidth to the 
FP units to dispatch all the FP instructions as their inputs became 
ready.  So you'd be looking for flushes in a big loop.


Jeff
Vineet Gupta March 21, 2024, 12:45 a.m. UTC | #2
On 3/18/24 21:41, Jeff Law wrote:
>> The first patch is the main change which improves SPEC cactu by 10%.
> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5% 
> improvement in cycles on our target.  Which is great!

Nice.

> This also makes me wonder if cactu is the benchmark that was sensitive 
> to flushing the pending queue in the scheduler.  Jivan's data would tend 
> to indicate that is the case as several routines seem to flush the 
> pending queue often.  In particular:
>
> ML_BSSN_RHS_Body
> ML_BSSN_Advect_Body
> ML_BSSN_constraints_Body
>
> All have a high number of dynamic instructions as well as lots of 
> flushes of the pending queue.
>
> Vineet, you might want to look and see if cranking up the 
> max-pending-list-length parameter helps drive down spilling.   I think 
> it's default value is 32 insns.  I've seen it cranked up to 128 and 256 
> insns without significant ill effects on compile time.
>
> My recollection (it's been like 3 years) of the key loop was that it had 
> a few hundred instructions and we'd flush the pending list about 50 
> cycles into the loop as there just wasn't enough issue bandwidth to the 
> FP units to dispatch all the FP instructions as their inputs became 
> ready.  So you'd be looking for flushes in a big loop.

Great insight.

Fired off a cactu run with 128, will keep you posted.

Thx,
-Vineet
Vineet Gupta March 21, 2024, 2:36 p.m. UTC | #3
On 3/18/24 21:41, Jeff Law wrote:
>
> On 3/16/24 11:35 AM, Vineet Gupta wrote:
>> Hi,
>>
>> This set of patches (for gcc-15) help improve stack/array accesses
>> by improving constant materialization. Details are in respective
>> patches.
>>
>> The first patch is the main change which improves SPEC cactu by 10%.
> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5% 
> improvement in cycles on our target.  Which is great!
>
> This also makes me wonder if cactu is the benchmark that was sensitive 
> to flushing the pending queue in the scheduler.  Jivan's data would tend 
> to indicate that is the case as several routines seem to flush the 
> pending queue often.  In particular:
>
> ML_BSSN_RHS_Body
> ML_BSSN_Advect_Body
> ML_BSSN_constraints_Body
>
> All have a high number of dynamic instructions as well as lots of 
> flushes of the pending queue.
>
> Vineet, you might want to look and see if cranking up the 
> max-pending-list-length parameter helps drive down spilling.   I think 
> it's default value is 32 insns.  I've seen it cranked up to 128 and 256 
> insns without significant ill effects on compile time.
>
> My recollection (it's been like 3 years) of the key loop was that it had 
> a few hundred instructions and we'd flush the pending list about 50 
> cycles into the loop as there just wasn't enough issue bandwidth to the 
> FP units to dispatch all the FP instructions as their inputs became 
> ready.  So you'd be looking for flushes in a big loop.

Here are the results for Cactu on top of the new splitter changes:

default	: 2,565,319,368,591
128	: 2,509,741,035,068
256	: 2,527,817,813,612

I've haven't probed deeper in generated code itself but likely to be
helping with spilling

-Vineet
Jeff Law March 21, 2024, 2:45 p.m. UTC | #4
On 3/21/24 8:36 AM, Vineet Gupta wrote:
> 
> 
> On 3/18/24 21:41, Jeff Law wrote:
>>
>> On 3/16/24 11:35 AM, Vineet Gupta wrote:
>>> Hi,
>>>
>>> This set of patches (for gcc-15) help improve stack/array accesses
>>> by improving constant materialization. Details are in respective
>>> patches.
>>>
>>> The first patch is the main change which improves SPEC cactu by 10%.
>> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5%
>> improvement in cycles on our target.  Which is great!
>>
>> This also makes me wonder if cactu is the benchmark that was sensitive
>> to flushing the pending queue in the scheduler.  Jivan's data would tend
>> to indicate that is the case as several routines seem to flush the
>> pending queue often.  In particular:
>>
>> ML_BSSN_RHS_Body
>> ML_BSSN_Advect_Body
>> ML_BSSN_constraints_Body
>>
>> All have a high number of dynamic instructions as well as lots of
>> flushes of the pending queue.
>>
>> Vineet, you might want to look and see if cranking up the
>> max-pending-list-length parameter helps drive down spilling.   I think
>> it's default value is 32 insns.  I've seen it cranked up to 128 and 256
>> insns without significant ill effects on compile time.
>>
>> My recollection (it's been like 3 years) of the key loop was that it had
>> a few hundred instructions and we'd flush the pending list about 50
>> cycles into the loop as there just wasn't enough issue bandwidth to the
>> FP units to dispatch all the FP instructions as their inputs became
>> ready.  So you'd be looking for flushes in a big loop.
> 
> Here are the results for Cactu on top of the new splitter changes:
> 
> default	: 2,565,319,368,591
> 128	: 2,509,741,035,068
> 256	: 2,527,817,813,612
> 
> I've haven't probed deeper in generated code itself but likely to be
> helping with spilling
Actually, I read that as not important for this issue.  While it is 50b 
instructions, I would be looking for something that had perhaps an order 
of magnitude bigger impact.    Ultimately I think it means we still 
don't have a good handle on what's causing the spilling.  Oh well.

So if we go back to Robin's observation that scheduling dramatically 
increases the instruction count, perhaps we try a run with 
-fno-schedule-insns -fno-schedule-insns2 and see how the instruction 
counts compare.


Jeff
Vineet Gupta March 21, 2024, 5:19 p.m. UTC | #5
On 3/21/24 07:45, Jeff Law wrote:
>>>> The first patch is the main change which improves SPEC cactu by 10%.
>>> Just to confirm.  Yup, 10% reduction in icounts and about a 3.5%
>>> improvement in cycles on our target.  Which is great!
>>>
>>> This also makes me wonder if cactu is the benchmark that was sensitive
>>> to flushing the pending queue in the scheduler.  Jivan's data would tend
>>> to indicate that is the case as several routines seem to flush the
>>> pending queue often.  In particular:
>>>
>>> ML_BSSN_RHS_Body
>>> ML_BSSN_Advect_Body
>>> ML_BSSN_constraints_Body
>>>
>>> All have a high number of dynamic instructions as well as lots of
>>> flushes of the pending queue.
>>>
>>> Vineet, you might want to look and see if cranking up the
>>> max-pending-list-length parameter helps drive down spilling.   I think
>>> it's default value is 32 insns.  I've seen it cranked up to 128 and 256
>>> insns without significant ill effects on compile time.
>>>
>>> My recollection (it's been like 3 years) of the key loop was that it had
>>> a few hundred instructions and we'd flush the pending list about 50
>>> cycles into the loop as there just wasn't enough issue bandwidth to the
>>> FP units to dispatch all the FP instructions as their inputs became
>>> ready.  So you'd be looking for flushes in a big loop.
>> Here are the results for Cactu on top of the new splitter changes:
>>
>> default	: 2,565,319,368,591
>> 128	: 2,509,741,035,068
>> 256	: 2,527,817,813,612
>>
>> I've haven't probed deeper in generated code itself but likely to be
>> helping with spilling
> Actually, I read that as not important for this issue.  While it is 50b 
> instructions, I would be looking for something that had perhaps an order 
> of magnitude bigger impact.    Ultimately I think it means we still 
> don't have a good handle on what's causing the spilling.  Oh well.
>
> So if we go back to Robin's observation that scheduling dramatically 
> increases the instruction count, perhaps we try a run with 
> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction 
> counts compare.

Oh yeah ! Robin hinted to this in Tues patchworks meeting too

default	    : 2,565,319,368,591
128	    : 2,509,741,035,068
256	    : 2,527,817,813,612
no-sched{,2}: 1,295,520,567,376

-Vineet
Jeff Law March 21, 2024, 7:56 p.m. UTC | #6
On 3/21/24 11:19 AM, Vineet Gupta wrote:

>>
>> So if we go back to Robin's observation that scheduling dramatically
>> increases the instruction count, perhaps we try a run with
>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
>> counts compare.
> 
> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
> 
> default	    : 2,565,319,368,591
> 128	    : 2,509,741,035,068
> 256	    : 2,527,817,813,612
> no-sched{,2}: 1,295,520,567,376
Now we're getting somewhere.  That's in line with expectations.

I would strongly suspect it's -fno-schedule-insns rather than 
-fno-schedule-insns2.  The former turns off scheduling before register 
allocation, the second turns it off after register allocation.  So if 
our theory about spilling is correct, then it must be the first since 
the second won't affect register allocation.   While I can speculate 
about other potential scheduler impacts, spilling due to sched1's 
actions is by far the most likely.

Given the magnitude here, I would bet we can see this pretty clearly if 
you've got function level or block level count data for those runs.  I'd 
start with that, ideally narrowing things down to a function or hot loop 
within a function which shows a huge delta.

 From that we can then look at the IRA and LRA dumps and correlate what 
we see there with the before/after scheduling dumps to see how we've 
lengthened lifetimes in critical locations.

I'd probably start with the IRA dump.  It's going to have annotations in 
its dump output like "Potential Spill" which may guide us.  In simplest 
terms a pseudo is trivially allocatable when it has fewer neighbors in 
the conflict graph than available hard registers.  If it has more 
neighbors in the conflict graph than available hard registers, then it's 
potentially going to be spilled -- we can't know during this phase of 
allocation.

As we pop registers off the coloring stack, some neighbors of the pseudo 
in question may end up allocated into the same hard register.  That can 
sometimes result in a hard register being available.  It might be easier 
to see with a graph

     a--b--c
        |
        d

Where a..d are pseudo registers.  If two pseudos are connected by an 
edge, then they have overlapping lifetimes and can't be allocated to the 
same hard register.  So as we can see b conflicts with a, c & d.  If we 
only have two hard registers, then b is not trivially colorable and will 
be marked as a potential spill.

During coloring we may end up allocating a, c & d to the same hard 
register (they don't conflict, so its safe).  If that happens, then 
there would be a register available for b.

Anyway, that should explain why b would be marked as a potential spill 
and how it might end up getting a hard register anyway.

The hope is we can see the potential spills increasing.  At which point 
we can walk backwards to sched1 and dive into its scheduling decisions.

Jeff
Vineet Gupta March 22, 2024, 12:34 a.m. UTC | #7
On 3/21/24 12:56, Jeff Law wrote:
>
> On 3/21/24 11:19 AM, Vineet Gupta wrote:
>
>>> So if we go back to Robin's observation that scheduling dramatically
>>> increases the instruction count, perhaps we try a run with
>>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
>>> counts compare.
>> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
>>
>> default	    : 2,565,319,368,591
>> 128	    : 2,509,741,035,068
>> 256	    : 2,527,817,813,612
>> no-sched{,2}: 1,295,520,567,376
> Now we're getting somewhere.  That's in line with expectations.
>
> I would strongly suspect it's -fno-schedule-insns rather than 
> -fno-schedule-insns2.  The former turns off scheduling before register 
> allocation, the second turns it off after register allocation.  So if 
> our theory about spilling is correct, then it must be the first since 
> the second won't affect register allocation.   While I can speculate 
> about other potential scheduler impacts, spilling due to sched1's 
> actions is by far the most likely.

As always you are absolutely right, just doing -fno-schedule-insns gets
almost the same as last row above.

> Given the magnitude here, I would bet we can see this pretty clearly if 
> you've got function level or block level count data for those runs.  I'd 
> start with that, ideally narrowing things down to a function or hot loop 
> within a function which shows a huge delta.

Alright, on it.

Thx,
-Vineet

> From that we can then look at the IRA and LRA dumps and correlate what 
> we see there with the before/after scheduling dumps to see how we've 
> lengthened lifetimes in critical locations.
>
> I'd probably start with the IRA dump.  It's going to have annotations in 
> its dump output like "Potential Spill" which may guide us.  In simplest 
> terms a pseudo is trivially allocatable when it has fewer neighbors in 
> the conflict graph than available hard registers.  If it has more 
> neighbors in the conflict graph than available hard registers, then it's 
> potentially going to be spilled -- we can't know during this phase of 
> allocation.
>
> As we pop registers off the coloring stack, some neighbors of the pseudo 
> in question may end up allocated into the same hard register.  That can 
> sometimes result in a hard register being available.  It might be easier 
> to see with a graph
>
>      a--b--c
>         |
>         d
>
> Where a..d are pseudo registers.  If two pseudos are connected by an 
> edge, then they have overlapping lifetimes and can't be allocated to the 
> same hard register.  So as we can see b conflicts with a, c & d.  If we 
> only have two hard registers, then b is not trivially colorable and will 
> be marked as a potential spill.
>
> During coloring we may end up allocating a, c & d to the same hard 
> register (they don't conflict, so its safe).  If that happens, then 
> there would be a register available for b.
>
> Anyway, that should explain why b would be marked as a potential spill 
> and how it might end up getting a hard register anyway.
>
> The hope is we can see the potential spills increasing.  At which point 
> we can walk backwards to sched1 and dive into its scheduling decisions.
>
> Jeff
Richard Biener March 22, 2024, 8:47 a.m. UTC | #8
On Thu, Mar 21, 2024 at 8:56 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 3/21/24 11:19 AM, Vineet Gupta wrote:
>
> >>
> >> So if we go back to Robin's observation that scheduling dramatically
> >> increases the instruction count, perhaps we try a run with
> >> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
> >> counts compare.
> >
> > Oh yeah ! Robin hinted to this in Tues patchworks meeting too
> >
> > default           : 2,565,319,368,591
> > 128       : 2,509,741,035,068
> > 256       : 2,527,817,813,612
> > no-sched{,2}: 1,295,520,567,376
> Now we're getting somewhere.  That's in line with expectations.
>
> I would strongly suspect it's -fno-schedule-insns rather than
> -fno-schedule-insns2.  The former turns off scheduling before register
> allocation, the second turns it off after register allocation.  So if
> our theory about spilling is correct, then it must be the first since
> the second won't affect register allocation.   While I can speculate
> about other potential scheduler impacts, spilling due to sched1's
> actions is by far the most likely.

Another option is to enable -fsched-pressure which should help with
this issue.

> Given the magnitude here, I would bet we can see this pretty clearly if
> you've got function level or block level count data for those runs.  I'd
> start with that, ideally narrowing things down to a function or hot loop
> within a function which shows a huge delta.
>
>  From that we can then look at the IRA and LRA dumps and correlate what
> we see there with the before/after scheduling dumps to see how we've
> lengthened lifetimes in critical locations.
>
> I'd probably start with the IRA dump.  It's going to have annotations in
> its dump output like "Potential Spill" which may guide us.  In simplest
> terms a pseudo is trivially allocatable when it has fewer neighbors in
> the conflict graph than available hard registers.  If it has more
> neighbors in the conflict graph than available hard registers, then it's
> potentially going to be spilled -- we can't know during this phase of
> allocation.
>
> As we pop registers off the coloring stack, some neighbors of the pseudo
> in question may end up allocated into the same hard register.  That can
> sometimes result in a hard register being available.  It might be easier
> to see with a graph
>
>      a--b--c
>         |
>         d
>
> Where a..d are pseudo registers.  If two pseudos are connected by an
> edge, then they have overlapping lifetimes and can't be allocated to the
> same hard register.  So as we can see b conflicts with a, c & d.  If we
> only have two hard registers, then b is not trivially colorable and will
> be marked as a potential spill.
>
> During coloring we may end up allocating a, c & d to the same hard
> register (they don't conflict, so its safe).  If that happens, then
> there would be a register available for b.
>
> Anyway, that should explain why b would be marked as a potential spill
> and how it might end up getting a hard register anyway.
>
> The hope is we can see the potential spills increasing.  At which point
> we can walk backwards to sched1 and dive into its scheduling decisions.
>
> Jeff
Jeff Law March 22, 2024, 12:29 p.m. UTC | #9
On 3/22/24 2:47 AM, Richard Biener wrote:
> On Thu, Mar 21, 2024 at 8:56 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>>
>>
>>
>> On 3/21/24 11:19 AM, Vineet Gupta wrote:
>>
>>>>
>>>> So if we go back to Robin's observation that scheduling dramatically
>>>> increases the instruction count, perhaps we try a run with
>>>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction
>>>> counts compare.
>>>
>>> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
>>>
>>> default           : 2,565,319,368,591
>>> 128       : 2,509,741,035,068
>>> 256       : 2,527,817,813,612
>>> no-sched{,2}: 1,295,520,567,376
>> Now we're getting somewhere.  That's in line with expectations.
>>
>> I would strongly suspect it's -fno-schedule-insns rather than
>> -fno-schedule-insns2.  The former turns off scheduling before register
>> allocation, the second turns it off after register allocation.  So if
>> our theory about spilling is correct, then it must be the first since
>> the second won't affect register allocation.   While I can speculate
>> about other potential scheduler impacts, spilling due to sched1's
>> actions is by far the most likely.
> 
> Another option is to enable -fsched-pressure which should help with
> this issue.
In theory we're already using that by default -- it's part of what makes 
me so curious to understand what's going on.

jeff
Vineet Gupta March 22, 2024, 4:56 p.m. UTC | #10
On 3/22/24 05:29, Jeff Law wrote:
>> Another option is to enable -fsched-pressure which should help with
>> this issue.
> In theory we're already using that by default -- it's part of what makes 
> me so curious to understand what's going on.

We are actually using it in practice :-)
Its the default for RISC-V port since Aug of last year.

-Vineet
Jeff Law March 25, 2024, 3:05 a.m. UTC | #11
On 3/21/24 11:19 AM, Vineet Gupta wrote:
> 

> 
> Oh yeah ! Robin hinted to this in Tues patchworks meeting too
> 
> default	    : 2,565,319,368,591
> 128	    : 2,509,741,035,068
> 256	    : 2,527,817,813,612
> no-sched{,2}: 1,295,520,567,376
So one more nugget here.  I happened to be doing some historical data 
mining and I can see the huge instruction jump in our internal runs.  We 
jump from 1.4T instructions to 2.1T.  But more importantly we see the 
cycle counts *improve* across that span, roughly 17%.   Unfortunately 
the data points are way far apart in time, so I don't think they help us 
narrow down the root cause.

Mostly it highlights that while instruction counts are generally 
correlated to cycle counts, they can deviate and in this case they do so 
wildly.  Whatever fix we end up making we'll likely need to run it on a 
design to evaluate its actual performance impact.

jeff