Message ID | 20240316173524.1147760-1-vineetg@rivosinc.com |
---|---|
Headers | show |
Series | RISC-V improve stack/array access by constant mat tweak | expand |
On 3/16/24 11:35 AM, Vineet Gupta wrote: > Hi, > > This set of patches (for gcc-15) help improve stack/array accesses > by improving constant materialization. Details are in respective > patches. > > The first patch is the main change which improves SPEC cactu by 10%. Just to confirm. Yup, 10% reduction in icounts and about a 3.5% improvement in cycles on our target. Which is great! This also makes me wonder if cactu is the benchmark that was sensitive to flushing the pending queue in the scheduler. Jivan's data would tend to indicate that is the case as several routines seem to flush the pending queue often. In particular: ML_BSSN_RHS_Body ML_BSSN_Advect_Body ML_BSSN_constraints_Body All have a high number of dynamic instructions as well as lots of flushes of the pending queue. Vineet, you might want to look and see if cranking up the max-pending-list-length parameter helps drive down spilling. I think it's default value is 32 insns. I've seen it cranked up to 128 and 256 insns without significant ill effects on compile time. My recollection (it's been like 3 years) of the key loop was that it had a few hundred instructions and we'd flush the pending list about 50 cycles into the loop as there just wasn't enough issue bandwidth to the FP units to dispatch all the FP instructions as their inputs became ready. So you'd be looking for flushes in a big loop. Jeff
On 3/18/24 21:41, Jeff Law wrote: >> The first patch is the main change which improves SPEC cactu by 10%. > Just to confirm. Yup, 10% reduction in icounts and about a 3.5% > improvement in cycles on our target. Which is great! Nice. > This also makes me wonder if cactu is the benchmark that was sensitive > to flushing the pending queue in the scheduler. Jivan's data would tend > to indicate that is the case as several routines seem to flush the > pending queue often. In particular: > > ML_BSSN_RHS_Body > ML_BSSN_Advect_Body > ML_BSSN_constraints_Body > > All have a high number of dynamic instructions as well as lots of > flushes of the pending queue. > > Vineet, you might want to look and see if cranking up the > max-pending-list-length parameter helps drive down spilling. I think > it's default value is 32 insns. I've seen it cranked up to 128 and 256 > insns without significant ill effects on compile time. > > My recollection (it's been like 3 years) of the key loop was that it had > a few hundred instructions and we'd flush the pending list about 50 > cycles into the loop as there just wasn't enough issue bandwidth to the > FP units to dispatch all the FP instructions as their inputs became > ready. So you'd be looking for flushes in a big loop. Great insight. Fired off a cactu run with 128, will keep you posted. Thx, -Vineet
On 3/18/24 21:41, Jeff Law wrote: > > On 3/16/24 11:35 AM, Vineet Gupta wrote: >> Hi, >> >> This set of patches (for gcc-15) help improve stack/array accesses >> by improving constant materialization. Details are in respective >> patches. >> >> The first patch is the main change which improves SPEC cactu by 10%. > Just to confirm. Yup, 10% reduction in icounts and about a 3.5% > improvement in cycles on our target. Which is great! > > This also makes me wonder if cactu is the benchmark that was sensitive > to flushing the pending queue in the scheduler. Jivan's data would tend > to indicate that is the case as several routines seem to flush the > pending queue often. In particular: > > ML_BSSN_RHS_Body > ML_BSSN_Advect_Body > ML_BSSN_constraints_Body > > All have a high number of dynamic instructions as well as lots of > flushes of the pending queue. > > Vineet, you might want to look and see if cranking up the > max-pending-list-length parameter helps drive down spilling. I think > it's default value is 32 insns. I've seen it cranked up to 128 and 256 > insns without significant ill effects on compile time. > > My recollection (it's been like 3 years) of the key loop was that it had > a few hundred instructions and we'd flush the pending list about 50 > cycles into the loop as there just wasn't enough issue bandwidth to the > FP units to dispatch all the FP instructions as their inputs became > ready. So you'd be looking for flushes in a big loop. Here are the results for Cactu on top of the new splitter changes: default : 2,565,319,368,591 128 : 2,509,741,035,068 256 : 2,527,817,813,612 I've haven't probed deeper in generated code itself but likely to be helping with spilling -Vineet
On 3/21/24 8:36 AM, Vineet Gupta wrote: > > > On 3/18/24 21:41, Jeff Law wrote: >> >> On 3/16/24 11:35 AM, Vineet Gupta wrote: >>> Hi, >>> >>> This set of patches (for gcc-15) help improve stack/array accesses >>> by improving constant materialization. Details are in respective >>> patches. >>> >>> The first patch is the main change which improves SPEC cactu by 10%. >> Just to confirm. Yup, 10% reduction in icounts and about a 3.5% >> improvement in cycles on our target. Which is great! >> >> This also makes me wonder if cactu is the benchmark that was sensitive >> to flushing the pending queue in the scheduler. Jivan's data would tend >> to indicate that is the case as several routines seem to flush the >> pending queue often. In particular: >> >> ML_BSSN_RHS_Body >> ML_BSSN_Advect_Body >> ML_BSSN_constraints_Body >> >> All have a high number of dynamic instructions as well as lots of >> flushes of the pending queue. >> >> Vineet, you might want to look and see if cranking up the >> max-pending-list-length parameter helps drive down spilling. I think >> it's default value is 32 insns. I've seen it cranked up to 128 and 256 >> insns without significant ill effects on compile time. >> >> My recollection (it's been like 3 years) of the key loop was that it had >> a few hundred instructions and we'd flush the pending list about 50 >> cycles into the loop as there just wasn't enough issue bandwidth to the >> FP units to dispatch all the FP instructions as their inputs became >> ready. So you'd be looking for flushes in a big loop. > > Here are the results for Cactu on top of the new splitter changes: > > default : 2,565,319,368,591 > 128 : 2,509,741,035,068 > 256 : 2,527,817,813,612 > > I've haven't probed deeper in generated code itself but likely to be > helping with spilling Actually, I read that as not important for this issue. While it is 50b instructions, I would be looking for something that had perhaps an order of magnitude bigger impact. Ultimately I think it means we still don't have a good handle on what's causing the spilling. Oh well. So if we go back to Robin's observation that scheduling dramatically increases the instruction count, perhaps we try a run with -fno-schedule-insns -fno-schedule-insns2 and see how the instruction counts compare. Jeff
On 3/21/24 07:45, Jeff Law wrote: >>>> The first patch is the main change which improves SPEC cactu by 10%. >>> Just to confirm. Yup, 10% reduction in icounts and about a 3.5% >>> improvement in cycles on our target. Which is great! >>> >>> This also makes me wonder if cactu is the benchmark that was sensitive >>> to flushing the pending queue in the scheduler. Jivan's data would tend >>> to indicate that is the case as several routines seem to flush the >>> pending queue often. In particular: >>> >>> ML_BSSN_RHS_Body >>> ML_BSSN_Advect_Body >>> ML_BSSN_constraints_Body >>> >>> All have a high number of dynamic instructions as well as lots of >>> flushes of the pending queue. >>> >>> Vineet, you might want to look and see if cranking up the >>> max-pending-list-length parameter helps drive down spilling. I think >>> it's default value is 32 insns. I've seen it cranked up to 128 and 256 >>> insns without significant ill effects on compile time. >>> >>> My recollection (it's been like 3 years) of the key loop was that it had >>> a few hundred instructions and we'd flush the pending list about 50 >>> cycles into the loop as there just wasn't enough issue bandwidth to the >>> FP units to dispatch all the FP instructions as their inputs became >>> ready. So you'd be looking for flushes in a big loop. >> Here are the results for Cactu on top of the new splitter changes: >> >> default : 2,565,319,368,591 >> 128 : 2,509,741,035,068 >> 256 : 2,527,817,813,612 >> >> I've haven't probed deeper in generated code itself but likely to be >> helping with spilling > Actually, I read that as not important for this issue. While it is 50b > instructions, I would be looking for something that had perhaps an order > of magnitude bigger impact. Ultimately I think it means we still > don't have a good handle on what's causing the spilling. Oh well. > > So if we go back to Robin's observation that scheduling dramatically > increases the instruction count, perhaps we try a run with > -fno-schedule-insns -fno-schedule-insns2 and see how the instruction > counts compare. Oh yeah ! Robin hinted to this in Tues patchworks meeting too default : 2,565,319,368,591 128 : 2,509,741,035,068 256 : 2,527,817,813,612 no-sched{,2}: 1,295,520,567,376 -Vineet
On 3/21/24 11:19 AM, Vineet Gupta wrote: >> >> So if we go back to Robin's observation that scheduling dramatically >> increases the instruction count, perhaps we try a run with >> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction >> counts compare. > > Oh yeah ! Robin hinted to this in Tues patchworks meeting too > > default : 2,565,319,368,591 > 128 : 2,509,741,035,068 > 256 : 2,527,817,813,612 > no-sched{,2}: 1,295,520,567,376 Now we're getting somewhere. That's in line with expectations. I would strongly suspect it's -fno-schedule-insns rather than -fno-schedule-insns2. The former turns off scheduling before register allocation, the second turns it off after register allocation. So if our theory about spilling is correct, then it must be the first since the second won't affect register allocation. While I can speculate about other potential scheduler impacts, spilling due to sched1's actions is by far the most likely. Given the magnitude here, I would bet we can see this pretty clearly if you've got function level or block level count data for those runs. I'd start with that, ideally narrowing things down to a function or hot loop within a function which shows a huge delta. From that we can then look at the IRA and LRA dumps and correlate what we see there with the before/after scheduling dumps to see how we've lengthened lifetimes in critical locations. I'd probably start with the IRA dump. It's going to have annotations in its dump output like "Potential Spill" which may guide us. In simplest terms a pseudo is trivially allocatable when it has fewer neighbors in the conflict graph than available hard registers. If it has more neighbors in the conflict graph than available hard registers, then it's potentially going to be spilled -- we can't know during this phase of allocation. As we pop registers off the coloring stack, some neighbors of the pseudo in question may end up allocated into the same hard register. That can sometimes result in a hard register being available. It might be easier to see with a graph a--b--c | d Where a..d are pseudo registers. If two pseudos are connected by an edge, then they have overlapping lifetimes and can't be allocated to the same hard register. So as we can see b conflicts with a, c & d. If we only have two hard registers, then b is not trivially colorable and will be marked as a potential spill. During coloring we may end up allocating a, c & d to the same hard register (they don't conflict, so its safe). If that happens, then there would be a register available for b. Anyway, that should explain why b would be marked as a potential spill and how it might end up getting a hard register anyway. The hope is we can see the potential spills increasing. At which point we can walk backwards to sched1 and dive into its scheduling decisions. Jeff
On 3/21/24 12:56, Jeff Law wrote: > > On 3/21/24 11:19 AM, Vineet Gupta wrote: > >>> So if we go back to Robin's observation that scheduling dramatically >>> increases the instruction count, perhaps we try a run with >>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction >>> counts compare. >> Oh yeah ! Robin hinted to this in Tues patchworks meeting too >> >> default : 2,565,319,368,591 >> 128 : 2,509,741,035,068 >> 256 : 2,527,817,813,612 >> no-sched{,2}: 1,295,520,567,376 > Now we're getting somewhere. That's in line with expectations. > > I would strongly suspect it's -fno-schedule-insns rather than > -fno-schedule-insns2. The former turns off scheduling before register > allocation, the second turns it off after register allocation. So if > our theory about spilling is correct, then it must be the first since > the second won't affect register allocation. While I can speculate > about other potential scheduler impacts, spilling due to sched1's > actions is by far the most likely. As always you are absolutely right, just doing -fno-schedule-insns gets almost the same as last row above. > Given the magnitude here, I would bet we can see this pretty clearly if > you've got function level or block level count data for those runs. I'd > start with that, ideally narrowing things down to a function or hot loop > within a function which shows a huge delta. Alright, on it. Thx, -Vineet > From that we can then look at the IRA and LRA dumps and correlate what > we see there with the before/after scheduling dumps to see how we've > lengthened lifetimes in critical locations. > > I'd probably start with the IRA dump. It's going to have annotations in > its dump output like "Potential Spill" which may guide us. In simplest > terms a pseudo is trivially allocatable when it has fewer neighbors in > the conflict graph than available hard registers. If it has more > neighbors in the conflict graph than available hard registers, then it's > potentially going to be spilled -- we can't know during this phase of > allocation. > > As we pop registers off the coloring stack, some neighbors of the pseudo > in question may end up allocated into the same hard register. That can > sometimes result in a hard register being available. It might be easier > to see with a graph > > a--b--c > | > d > > Where a..d are pseudo registers. If two pseudos are connected by an > edge, then they have overlapping lifetimes and can't be allocated to the > same hard register. So as we can see b conflicts with a, c & d. If we > only have two hard registers, then b is not trivially colorable and will > be marked as a potential spill. > > During coloring we may end up allocating a, c & d to the same hard > register (they don't conflict, so its safe). If that happens, then > there would be a register available for b. > > Anyway, that should explain why b would be marked as a potential spill > and how it might end up getting a hard register anyway. > > The hope is we can see the potential spills increasing. At which point > we can walk backwards to sched1 and dive into its scheduling decisions. > > Jeff
On Thu, Mar 21, 2024 at 8:56 PM Jeff Law <jeffreyalaw@gmail.com> wrote: > > > > On 3/21/24 11:19 AM, Vineet Gupta wrote: > > >> > >> So if we go back to Robin's observation that scheduling dramatically > >> increases the instruction count, perhaps we try a run with > >> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction > >> counts compare. > > > > Oh yeah ! Robin hinted to this in Tues patchworks meeting too > > > > default : 2,565,319,368,591 > > 128 : 2,509,741,035,068 > > 256 : 2,527,817,813,612 > > no-sched{,2}: 1,295,520,567,376 > Now we're getting somewhere. That's in line with expectations. > > I would strongly suspect it's -fno-schedule-insns rather than > -fno-schedule-insns2. The former turns off scheduling before register > allocation, the second turns it off after register allocation. So if > our theory about spilling is correct, then it must be the first since > the second won't affect register allocation. While I can speculate > about other potential scheduler impacts, spilling due to sched1's > actions is by far the most likely. Another option is to enable -fsched-pressure which should help with this issue. > Given the magnitude here, I would bet we can see this pretty clearly if > you've got function level or block level count data for those runs. I'd > start with that, ideally narrowing things down to a function or hot loop > within a function which shows a huge delta. > > From that we can then look at the IRA and LRA dumps and correlate what > we see there with the before/after scheduling dumps to see how we've > lengthened lifetimes in critical locations. > > I'd probably start with the IRA dump. It's going to have annotations in > its dump output like "Potential Spill" which may guide us. In simplest > terms a pseudo is trivially allocatable when it has fewer neighbors in > the conflict graph than available hard registers. If it has more > neighbors in the conflict graph than available hard registers, then it's > potentially going to be spilled -- we can't know during this phase of > allocation. > > As we pop registers off the coloring stack, some neighbors of the pseudo > in question may end up allocated into the same hard register. That can > sometimes result in a hard register being available. It might be easier > to see with a graph > > a--b--c > | > d > > Where a..d are pseudo registers. If two pseudos are connected by an > edge, then they have overlapping lifetimes and can't be allocated to the > same hard register. So as we can see b conflicts with a, c & d. If we > only have two hard registers, then b is not trivially colorable and will > be marked as a potential spill. > > During coloring we may end up allocating a, c & d to the same hard > register (they don't conflict, so its safe). If that happens, then > there would be a register available for b. > > Anyway, that should explain why b would be marked as a potential spill > and how it might end up getting a hard register anyway. > > The hope is we can see the potential spills increasing. At which point > we can walk backwards to sched1 and dive into its scheduling decisions. > > Jeff
On 3/22/24 2:47 AM, Richard Biener wrote: > On Thu, Mar 21, 2024 at 8:56 PM Jeff Law <jeffreyalaw@gmail.com> wrote: >> >> >> >> On 3/21/24 11:19 AM, Vineet Gupta wrote: >> >>>> >>>> So if we go back to Robin's observation that scheduling dramatically >>>> increases the instruction count, perhaps we try a run with >>>> -fno-schedule-insns -fno-schedule-insns2 and see how the instruction >>>> counts compare. >>> >>> Oh yeah ! Robin hinted to this in Tues patchworks meeting too >>> >>> default : 2,565,319,368,591 >>> 128 : 2,509,741,035,068 >>> 256 : 2,527,817,813,612 >>> no-sched{,2}: 1,295,520,567,376 >> Now we're getting somewhere. That's in line with expectations. >> >> I would strongly suspect it's -fno-schedule-insns rather than >> -fno-schedule-insns2. The former turns off scheduling before register >> allocation, the second turns it off after register allocation. So if >> our theory about spilling is correct, then it must be the first since >> the second won't affect register allocation. While I can speculate >> about other potential scheduler impacts, spilling due to sched1's >> actions is by far the most likely. > > Another option is to enable -fsched-pressure which should help with > this issue. In theory we're already using that by default -- it's part of what makes me so curious to understand what's going on. jeff
On 3/22/24 05:29, Jeff Law wrote: >> Another option is to enable -fsched-pressure which should help with >> this issue. > In theory we're already using that by default -- it's part of what makes > me so curious to understand what's going on. We are actually using it in practice :-) Its the default for RISC-V port since Aug of last year. -Vineet
On 3/21/24 11:19 AM, Vineet Gupta wrote: > > > Oh yeah ! Robin hinted to this in Tues patchworks meeting too > > default : 2,565,319,368,591 > 128 : 2,509,741,035,068 > 256 : 2,527,817,813,612 > no-sched{,2}: 1,295,520,567,376 So one more nugget here. I happened to be doing some historical data mining and I can see the huge instruction jump in our internal runs. We jump from 1.4T instructions to 2.1T. But more importantly we see the cycle counts *improve* across that span, roughly 17%. Unfortunately the data points are way far apart in time, so I don't think they help us narrow down the root cause. Mostly it highlights that while instruction counts are generally correlated to cycle counts, they can deviate and in this case they do so wildly. Whatever fix we end up making we'll likely need to run it on a design to evaluate its actual performance impact. jeff