Message ID | 98179c8e-bcec-83ed-5b99-6f54791bd7cd@tachyum.com |
---|---|
State | New |
Headers | show |
Series | Aligning stack offsets for spills | expand |
On Mon, Jun 7, 2021 at 9:00 PM Jeff Law <jlaw@tachyum.com> wrote: > > > So, as many of you know I left Red Hat a while ago and joined Tachyum. > We're building a new processor and we've come across an issue where I > think we need upstream discussion. > > I can't divulge many of the details right now, but one of the quirks of > our architecture is that reg+d addressing modes for our vector > loads/stores require the displacement to be aligned. This is an > artifact of how these instructions are encoded. > > Obviously we can emit a load of the address into a register when the > displacement isn't aligned. From a correctness point that works > perfectly. Unfortunately, it's a significant performance hit on some > standard benchmarks (spec) where we have a great number of spills of > vector objects into the stack at unaligned offsets in the hot parts of > the code. > > > We've considered 3 possible approaches to solve this problem. > > 1. When the displacement isn't properly aligned, allocate more space in > assign_stack_local so that we can make the offset aligned. The downside > is this potentially burns a lot of stack space, but in practice the cost > was minimal (16 bytes in a 9k frame) From a performance standpoint this > works perfectly. > > 2. Abuse the register elimination code to create a second pointer into > the stack. Spills would start as <virtual> + offset, then either get > eliminated to sp+offset' when the offset is aligned or gpr+offset'' when > the offset wasn't properly aligned. We started a bit down this path, but > with #1 working so well, we didn't get this approach to proof-of-concept. > > 3. Hack up the post-reload optimizers to fix things up as best as we > can. This may still be advantageous, but again with #1 working so well, > we didn't explore this in any significant way. We may still look at > this at some point in other contexts. > So just as extra info - you're pre-allocating the frame (including for spills) and not using push/pop? > Here's what we're playing with. Obviously we'd need a target hook to > drive this behavior. I was thinking that we'd pass in any slot offset > alignment requirements (from the target hook) to assign_stack_local and > that would bubble down to this point in try_fit_stack_local: > > diff --git a/gcc/function.c b/gcc/function.c > index d616f5f64f4..7f441b87a63 100644 > --- a/gcc/function.c > +++ b/gcc/function.c > @@ -307,6 +307,14 @@ try_fit_stack_local (poly_int64 start, poly_int64 > length, > frame_off = targetm.starting_frame_offset () % frame_alignment; > frame_phase = frame_off ? frame_alignment - frame_off : 0; > > + if (known_eq (size, 64) && alignment < 64) > + alignment = 64; > + I'm not familiar with the spill slot allocation code in GCC (I assume the above is part of it) - do we in any way "sort" the spill slots so the extra required padding is minimal? Does the above guarantee that in the end the offset will be aligned? I assume IRA/LRA can still choose to eliminate the respective frame pointer to sth else that ends up misaligning the offset again? Thus is it a real fix or a heuristic that ends up working most of the time? The actual alignment value should be dependent on the mode and target preference and thus a target hook I suppose (you mention this applies to vector loads/stores only). Don't you have the very same issue with non-stack accesses? Richard. > /* Round the frame offset to the specified alignment. */ > > if (FRAME_GROWS_DOWNWARD) > > Thoughts? > > Jeff
Hello, On Mon, 7 Jun 2021, Jeff Law wrote: > > So, as many of you know I left Red Hat a while ago and joined Tachyum. We're > building a new processor and we've come across an issue where I think we need > upstream discussion. > > I can't divulge many of the details right now, but one of the quirks of our > architecture is that reg+d addressing modes for our vector loads/stores > require the displacement to be aligned. This is an artifact of how these > instructions are encoded. > > Obviously we can emit a load of the address into a register when the > displacement isn't aligned. From a correctness point that works perfectly. > Unfortunately, it's a significant performance hit on some standard benchmarks > (spec) where we have a great number of spills of vector objects into the stack > at unaligned offsets in the hot parts of the code. > > > We've considered 3 possible approaches to solve this problem. > > 1. When the displacement isn't properly aligned, allocate more space in > assign_stack_local so that we can make the offset aligned. The downside is > this potentially burns a lot of stack space, but in practice the cost was > minimal (16 bytes in a 9k frame) From a performance standpoint this works > perfectly. > > 2. Abuse the register elimination code to create a second pointer into the > stack. Spills would start as <virtual> + offset, then either get eliminated > to sp+offset' when the offset is aligned or gpr+offset'' when the offset > wasn't properly aligned. We started a bit down this path, but with #1 working > so well, we didn't get this approach to proof-of-concept. > > 3. Hack up the post-reload optimizers to fix things up as best as we can. > This may still be advantageous, but again with #1 working so well, we didn't > explore this in any significant way. We may still look at this at some point > in other contexts. > > Here's what we're playing with. Obviously we'd need a target hook to > drive this behavior. I was thinking that we'd pass in any slot offset > alignment requirements (from the target hook) to assign_stack_local and > that would bubble down to this point in try_fit_stack_local: Why is the machinery involving STACK_SLOT_ALIGNMENT and spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for backing stack slots) not working for you? If everything is setup correctly the input alignment to try_fit_stack_local ought to be correct already. Ciao, Michael.
On 6/8/2021 8:08 AM, Michael Matz wrote: > Hello, > > On Mon, 7 Jun 2021, Jeff Law wrote: > >> So, as many of you know I left Red Hat a while ago and joined Tachyum. We're >> building a new processor and we've come across an issue where I think we need >> upstream discussion. >> >> I can't divulge many of the details right now, but one of the quirks of our >> architecture is that reg+d addressing modes for our vector loads/stores >> require the displacement to be aligned. This is an artifact of how these >> instructions are encoded. >> >> Obviously we can emit a load of the address into a register when the >> displacement isn't aligned. From a correctness point that works perfectly. >> Unfortunately, it's a significant performance hit on some standard benchmarks >> (spec) where we have a great number of spills of vector objects into the stack >> at unaligned offsets in the hot parts of the code. >> >> >> We've considered 3 possible approaches to solve this problem. >> >> 1. When the displacement isn't properly aligned, allocate more space in >> assign_stack_local so that we can make the offset aligned. The downside is >> this potentially burns a lot of stack space, but in practice the cost was >> minimal (16 bytes in a 9k frame) From a performance standpoint this works >> perfectly. >> >> 2. Abuse the register elimination code to create a second pointer into the >> stack. Spills would start as <virtual> + offset, then either get eliminated >> to sp+offset' when the offset is aligned or gpr+offset'' when the offset >> wasn't properly aligned. We started a bit down this path, but with #1 working >> so well, we didn't get this approach to proof-of-concept. >> >> 3. Hack up the post-reload optimizers to fix things up as best as we can. >> This may still be advantageous, but again with #1 working so well, we didn't >> explore this in any significant way. We may still look at this at some point >> in other contexts. >> >> Here's what we're playing with. Obviously we'd need a target hook to >> drive this behavior. I was thinking that we'd pass in any slot offset >> alignment requirements (from the target hook) to assign_stack_local and >> that would bubble down to this point in try_fit_stack_local: > Why is the machinery involving STACK_SLOT_ALIGNMENT and > spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for > backing stack slots) not working for you? If everything is setup > correctly the input alignment to try_fit_stack_local ought to be correct > already. We don't need the MEM as a whole aligned, just the offset in the address calculation due to how we encode those instructions. If I've read that code correctly, it would arrange for a dynamic realignment of the stack so that it could then align the slot. None of that is necessary for us and we'd like to avoid forcing the dynamic stack realignment. Or did I misread the code? jeff
On Tue, Jun 08, 2021 at 08:47:26AM -0600, Jeff Law wrote: > > Why is the machinery involving STACK_SLOT_ALIGNMENT and > > spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for > > backing stack slots) not working for you? If everything is setup > > correctly the input alignment to try_fit_stack_local ought to be correct > > already. > We don't need the MEM as a whole aligned, just the offset in the address > calculation due to how we encode those instructions. If I've read that code > correctly, it would arrange for a dynamic realignment of the stack so that > it could then align the slot. None of that is necessary for us and we'd like > to avoid forcing the dynamic stack realignment. Or did I misread the code? I think dynamic stack realignment is done only on x86, no other backend has that support, on all the other arches larger alignments are done in expand_stack_vars by effectively performing __builtin_alloca_with_align for the block containing all such variables. So I'd the the functions Michael mentioned shouldn't be doing dynamic stack realignment, though perhaps by pretending the vars have higher alignment might be recorded in MEM_ALIGN and perhaps might result in wrong-code if something will try to e.g. test if least significant bits of certain MEM address are 0. Jakub
On 6/8/2021 12:56 AM, Richard Biener wrote: > On Mon, Jun 7, 2021 at 9:00 PM Jeff Law <jlaw@tachyum.com> wrote: >> >> So, as many of you know I left Red Hat a while ago and joined Tachyum. >> We're building a new processor and we've come across an issue where I >> think we need upstream discussion. >> >> I can't divulge many of the details right now, but one of the quirks of >> our architecture is that reg+d addressing modes for our vector >> loads/stores require the displacement to be aligned. This is an >> artifact of how these instructions are encoded. >> >> Obviously we can emit a load of the address into a register when the >> displacement isn't aligned. From a correctness point that works >> perfectly. Unfortunately, it's a significant performance hit on some >> standard benchmarks (spec) where we have a great number of spills of >> vector objects into the stack at unaligned offsets in the hot parts of >> the code. >> >> >> We've considered 3 possible approaches to solve this problem. >> >> 1. When the displacement isn't properly aligned, allocate more space in >> assign_stack_local so that we can make the offset aligned. The downside >> is this potentially burns a lot of stack space, but in practice the cost >> was minimal (16 bytes in a 9k frame) From a performance standpoint this >> works perfectly. >> >> 2. Abuse the register elimination code to create a second pointer into >> the stack. Spills would start as <virtual> + offset, then either get >> eliminated to sp+offset' when the offset is aligned or gpr+offset'' when >> the offset wasn't properly aligned. We started a bit down this path, but >> with #1 working so well, we didn't get this approach to proof-of-concept. >> >> 3. Hack up the post-reload optimizers to fix things up as best as we >> can. This may still be advantageous, but again with #1 working so well, >> we didn't explore this in any significant way. We may still look at >> this at some point in other contexts. >> > So just as extra info - you're pre-allocating the frame (including for spills) > and not using push/pop? Yes, we're an ACCUMULATE_OUTGOING_ARGS target. > >> Here's what we're playing with. Obviously we'd need a target hook to >> drive this behavior. I was thinking that we'd pass in any slot offset >> alignment requirements (from the target hook) to assign_stack_local and >> that would bubble down to this point in try_fit_stack_local: >> >> diff --git a/gcc/function.c b/gcc/function.c >> index d616f5f64f4..7f441b87a63 100644 >> --- a/gcc/function.c >> +++ b/gcc/function.c >> @@ -307,6 +307,14 @@ try_fit_stack_local (poly_int64 start, poly_int64 >> length, >> frame_off = targetm.starting_frame_offset () % frame_alignment; >> frame_phase = frame_off ? frame_alignment - frame_off : 0; >> >> + if (known_eq (size, 64) && alignment < 64) >> + alignment = 64; >> + > I'm not familiar with the spill slot allocation code in GCC (I assume the above > is part of it) - do we in any way "sort" the spill slots so the extra required > padding is minimal? Does the above guarantee that in the end the > offset will be aligned? I assume IRA/LRA can still choose to eliminate > the respective frame pointer to sth else that ends up misaligning the offset > again? Thus is it a real fix or a heuristic that ends up working most of > the time? LRA does sort the spill slots, but I haven't looked into its sorting algorithm to see if it's anything other than a priority sort. LRA does allow sharing spill slots for non-conflicting pseudos which is what I've assumed has kept the extra padding to a minimum. It's a real fix, not a heuristic. > > The actual alignment value should be dependent on the mode and > target preference and thus a target hook I suppose (you mention > this applies to vector loads/stores only). Absolutely. What I posted was just the initial proof-of-concept. It needs to be a target hook and we need to pass in the data from LRA since by the time we get into assign_stack_local, we don't have a useful mode -- LRA passes in the size and BLKmode. I probably trimmed out too many comments in my attempt to avoid disclosing anything I shouldn't. It's worth noting that adjusting things at that particular point results in getting the offsets aligned without forcing the stack as a whole into a higher alignment or even forcing slots to a higher alignment. > Don't you have the very same issue with non-stack accesses? We do and will continue to handle those by reloading the reg+d address when the displacement isn't suitably aligned. In practice those cases aren't common and aren't on critical paths. Jeff
On Tue, Jun 8, 2021 at 7:56 AM Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > On Tue, Jun 08, 2021 at 08:47:26AM -0600, Jeff Law wrote: > > > Why is the machinery involving STACK_SLOT_ALIGNMENT and > > > spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for > > > backing stack slots) not working for you? If everything is setup > > > correctly the input alignment to try_fit_stack_local ought to be correct > > > already. > > We don't need the MEM as a whole aligned, just the offset in the address > > calculation due to how we encode those instructions. If I've read that code > > correctly, it would arrange for a dynamic realignment of the stack so that > > it could then align the slot. None of that is necessary for us and we'd like > > to avoid forcing the dynamic stack realignment. Or did I misread the code? > > I think dynamic stack realignment is done only on x86, no other backend has I believe that all pieces of infrastructure to realign the stack are in place. You just need to properly align the stack in the backend. > that support, on all the other arches larger alignments are done > in expand_stack_vars by effectively performing __builtin_alloca_with_align > for the block containing all such variables. > So I'd the the functions Michael mentioned shouldn't be doing dynamic stack > realignment, though perhaps by pretending the vars have higher alignment > might be recorded in MEM_ALIGN and perhaps might result in wrong-code if > something will try to e.g. test if least significant bits of certain MEM > address are 0. > > Jakub >
On 6/8/2021 9:06 AM, H.J. Lu wrote: > On Tue, Jun 8, 2021 at 7:56 AM Jakub Jelinek via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: >> On Tue, Jun 08, 2021 at 08:47:26AM -0600, Jeff Law wrote: >>>> Why is the machinery involving STACK_SLOT_ALIGNMENT and >>>> spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for >>>> backing stack slots) not working for you? If everything is setup >>>> correctly the input alignment to try_fit_stack_local ought to be correct >>>> already. >>> We don't need the MEM as a whole aligned, just the offset in the address >>> calculation due to how we encode those instructions. If I've read that code >>> correctly, it would arrange for a dynamic realignment of the stack so that >>> it could then align the slot. None of that is necessary for us and we'd like >>> to avoid forcing the dynamic stack realignment. Or did I misread the code? >> I think dynamic stack realignment is done only on x86, no other backend has > I believe that all pieces of infrastructure to realign the stack are > in place. You > just need to properly align the stack in the backend. As I've stated, we don't need the stack aligned to these higher boundaries. Nor do we need the slot as a whole aligned. That's ultimately just wasteful since we don't need them. We just want to get an aligned offset. Jeff
Hello, On Tue, 8 Jun 2021, Jeff Law wrote: > On 6/8/2021 9:06 AM, H.J. Lu wrote: > > On Tue, Jun 8, 2021 at 7:56 AM Jakub Jelinek via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > >> On Tue, Jun 08, 2021 at 08:47:26AM -0600, Jeff Law wrote: > >>>> Why is the machinery involving STACK_SLOT_ALIGNMENT and > >>>> spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for > >>>> backing stack slots) not working for you? If everything is setup > >>>> correctly the input alignment to try_fit_stack_local ought to be correct > >>>> already. > >>> We don't need the MEM as a whole aligned, just the offset in the address > >>> calculation due to how we encode those instructions. If I've read that > >>> code > >>> correctly, it would arrange for a dynamic realignment of the stack so > >>> that > >>> it could then align the slot. None of that is necessary for us and we'd > >>> like > >>> to avoid forcing the dynamic stack realignment. Or did I misread the > >>> code? > >> I think dynamic stack realignment is done only on x86, no other backend has > > I believe that all pieces of infrastructure to realign the stack are > > in place. You > > just need to properly align the stack in the backend. > > As I've stated, we don't need the stack aligned to these higher boundaries. > Nor do we need the slot as a whole aligned. That's ultimately just wasteful > since we don't need them. We just want to get an aligned offset. Well, but isn't that creating a difference when there is none? You need an aligned offset; when given an aligned stack pointer that then is equivalent to an aligned stack address. You are saying that you don't need the aligned stack pointer, sure, but would it be a problem for you? Apart from that: dynamic stack realignment can be disabled (or probably isn't enabled for your target to start with), then the stack offset alignment machinery should still work in isolation. (Well it might generate alignment claims in MEM RTL which then isn't in fact true, depends on the architecture if that's a problem for you). Either way, I think whatever you need should probably be somehow integrated with the existing stack slot alignment knobs. Or rather the two orthogonal pieces (stack pointer alignment and stack offset alignment) be separated and you then using only the latter. (Btw: are you also trying to improve non-stack addresses? Because ultimately your constraints aren't about stack at all but about all address forms. In a way this all is more like a job for addressing mode selection and massaging, but of course our support for such in GCC is limited beyond avoiding invalid modes by reloading into registers, which is exactly what you don't want :) ) Ciao, Michael.
On 6/7/21 2:00 PM, Jeff Law wrote: > I can't divulge many of the details right now, but one of the quirks of our > architecture is that reg+d addressing modes for our vector loads/stores require > the displacement to be aligned. This is an artifact of how these instructions > are encoded. Given what you're describing, it sounds like POWER has something similar. Our reg+displacement addressing uses 16-bit displacements using D, DS or DQ operand fields. The D field encodes the entire 16-bits, but the DS and DQ fields only encode 14-bits and 12-bits respectively. The DS and DQ operands have the same maximum displacement as D operands, we just force that their bottom 2-bits/4-bits must be zero, so we don't need to include them in the insn encoding. I believe this is all just handled in our legitimate address routines, but maybe Segher and/or Mike can correct me if I'm wrong? Peter
On Thu, Jun 10, 2021 at 02:28:34PM -0500, Peter Bergner via Gcc-patches wrote: > On 6/7/21 2:00 PM, Jeff Law wrote: > > I can't divulge many of the details right now, but one of the quirks of our > > architecture is that reg+d addressing modes for our vector loads/stores require > > the displacement to be aligned. This is an artifact of how these instructions > > are encoded. > > Given what you're describing, it sounds like POWER has something similar. > Our reg+displacement addressing uses 16-bit displacements using D, DS or DQ > operand fields. The D field encodes the entire 16-bits, but the DS and DQ > fields only encode 14-bits and 12-bits respectively. Yes, the low 2 resp. 4 bits of the offset are used for other things in the instruction encoding for DS and DQ. > The DS and DQ operands > have the same maximum displacement as D operands, we just force that their > bottom 2-bits/4-bits must be zero, so we don't need to include them in the > insn encoding. I believe this is all just handled in our legitimate address > routines, but maybe Segher and/or Mike can correct me if I'm wrong? This is all correct. But probably the main thing is that our ABIs require most things to be naturally aligned (up to 16 bytes). The machines do not require that (anymore), but requiring this in the ABI still is a performance win afaics (and a lot easier for (compiler) implementers -- most issues you describe just disappear ;-) ) Something tells me this won't work for you though? Segher
On 6/8/2021 8:55 AM, Jakub Jelinek wrote: > On Tue, Jun 08, 2021 at 08:47:26AM -0600, Jeff Law wrote: >>> Why is the machinery involving STACK_SLOT_ALIGNMENT and >>> spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for >>> backing stack slots) not working for you? If everything is setup >>> correctly the input alignment to try_fit_stack_local ought to be correct >>> already. >> We don't need the MEM as a whole aligned, just the offset in the address >> calculation due to how we encode those instructions. If I've read that code >> correctly, it would arrange for a dynamic realignment of the stack so that >> it could then align the slot. None of that is necessary for us and we'd like >> to avoid forcing the dynamic stack realignment. Or did I misread the code? > I think dynamic stack realignment is done only on x86, no other backend has > that support, on all the other arches larger alignments are done > in expand_stack_vars by effectively performing __builtin_alloca_with_align > for the block containing all such variables. > So I'd the the functions Michael mentioned shouldn't be doing dynamic stack > realignment, though perhaps by pretending the vars have higher alignment > might be recorded in MEM_ALIGN and perhaps might result in wrong-code if > something will try to e.g. test if least significant bits of certain MEM > address are 0. Hmm, I thought we'd do a dynamic realignment, if that's not the case then we've got another approach to consider. Thanks everyone... jeff
diff --git a/gcc/function.c b/gcc/function.c index d616f5f64f4..7f441b87a63 100644 --- a/gcc/function.c +++ b/gcc/function.c @@ -307,6 +307,14 @@ try_fit_stack_local (poly_int64 start, poly_int64 length, frame_off = targetm.starting_frame_offset () % frame_alignment; frame_phase = frame_off ? frame_alignment - frame_off : 0; + if (known_eq (size, 64) && alignment < 64) + alignment = 64; + /* Round the frame offset to the specified alignment. */ if (FRAME_GROWS_DOWNWARD) Thoughts? Jeff