Message ID | 1707878.8rQuAFM7vl@polaris |
---|---|
State | New |
Headers | show |
On 11/11/14 15:28, Eric Botcazou wrote: > In https://gcc.gnu.org/ml/gcc-patches/2012-01/msg00363.html, Tom reported an > ICE in the DWARF CFI pass because a frame-related insn was speculated by the > reorg pass and, therefore, disabled the optimization. It turns out that the > same code (when condition == const_true_rtx) can also duplicate frame-related > insns, leading to the same ICE in the DWARF CFI pass; we have run into that > building newlib for a port we are about to contribute. > > Hence the attached patch, which disables the optimization in this case too and > adds a comment for both. Tested on SPARC/Solaris, applied on the mainline. > > > 2014-11-11 Eric Botcazou <ebotcazou@adacore.com> > > * reorg.c (fill_slots_from_thread): Do not copy frame-related insns. I wonder how many other problems of this nature are lurking in reorg.c. For example steal_delay_list_from_{target,fallthrough} or the code which searches for arithmetic at the branch target, and puts the opposite insn in a delay slot. In fact, I really wonder if we should be allowing anything frame related outside fill_simple_delay_slots. jeff
> I wonder how many other problems of this nature are lurking in reorg.c. > For example steal_delay_list_from_{target,fallthrough} or the code > which searches for arithmetic at the branch target, and puts the > opposite insn in a delay slot. Right, and the latter has already been dealt with by Richard: 2012-02-11 Richard Sandiford <rdsandiford@googlemail.com> PR rtl-optimization/52175 * reorg.c (fill_slots_from_thread): Don't apply add/sub optimization to frame-related instructions. > In fact, I really wonder if we should be allowing anything frame related > outside fill_simple_delay_slots. That could well be the end result after a few more years of tweaking. :-)
On 11/14/14 03:44, Eric Botcazou wrote: >> I wonder how many other problems of this nature are lurking in reorg.c. >> For example steal_delay_list_from_{target,fallthrough} or the code >> which searches for arithmetic at the branch target, and puts the >> opposite insn in a delay slot. > > Right, and the latter has already been dealt with by Richard: > > 2012-02-11 Richard Sandiford <rdsandiford@googlemail.com> > > PR rtl-optimization/52175 > * reorg.c (fill_slots_from_thread): Don't apply add/sub optimization > to frame-related instructions. Ahhh, good. > >> In fact, I really wonder if we should be allowing anything frame related >> outside fill_simple_delay_slots. > > That could well be the end result after a few more years of tweaking. :-) IIRC, fill_eager and its related friends are all speculative in some way and aren't those precisely the ones that are causing us problems. Also note we have backends working around this stuff in fairly blunt ways: ;; For conditional branches. Frame related instructions are not allowed ;; because they confuse the unwind support. (define_attr "in_branch_delay" "false,true" (if_then_else (and (eq_attr "type" "!uncond_branch,branch,cbranch,fbranch,call,sibcall,dyncall,multi,milli,sh_func_adrs,parallel_branch,trap") (eq_attr "length" "4") (not (match_test "RTX_FRAME_RELATED_P (insn)"))) (const_string "true") (const_string "false"))) ;; Disallow instructions which use the FPU since they will tie up the FPU ;; even if the instruction is nullified. (define_attr "in_nullified_branch_delay" "false,true" (if_then_else (and (eq_attr "type" "!uncond_branch,branch,cbranch,fbranch,call,sibcall,dyncall,multi,milli,sh_func_adrs,fpcc,fpalu,fpmulsgl,fpmuldbl,fpdivsgl,fpdivdbl,fpsqrtsgl,fpsqrtdbl,parallel_branch,trap") (eq_attr "length" "4") (not (match_test "RTX_FRAME_RELATED_P (insn)"))) (const_string "true") (const_string "false"))) ;; For calls and millicode calls. (define_attr "in_call_delay" "false,true" (if_then_else (and (eq_attr "type" "!uncond_branch,branch,cbranch,fbranch,call,sibcall,dyncall,multi,milli,sh_func_adrs,parallel_branch,trap") (eq_attr "length" "4") (not (match_test "RTX_FRAME_RELATED_P (insn)"))) (const_string "true") (const_string "false"))) Given architectural difficulties of delay slots on modern processors, would it be that painful to just not allow filling slots with frame insns and let dbr try to find something else or drop in a nop? I wouldn't be all that surprised if there wasn't a measurable performance difference on something like a modern Sparc. Jeff
> IIRC, fill_eager and its related friends are all speculative in some way > and aren't those precisely the ones that are causing us problems. Also > note we have backends working around this stuff in fairly blunt ways: I'd say that the PA back-end went a bit too far here, especially if it marks some insns of the epilogue as frame-related. dwarf2cfi.c has special code to handle delay slots (SEQUENCEs) so it's not an all-or-nothing game. > Given architectural difficulties of delay slots on modern processors, > would it be that painful to just not allow filling slots with frame > insns and let dbr try to find something else or drop in a nop? I > wouldn't be all that surprised if there wasn't a measurable performance > difference on something like a modern Sparc. Yes, modern SPARCs have (short) branches without delay slots. But the other big contender is MIPS here and the story might be different for it.
Eric Botcazou <ebotcazou@adacore.com> writes: > > IIRC, fill_eager and its related friends are all speculative in some > way > > and aren't those precisely the ones that are causing us problems. > Also > > note we have backends working around this stuff in fairly blunt ways: > > I'd say that the PA back-end went a bit too far here, especially if it > marks some insns of the epilogue as frame-related. dwarf2cfi.c has > special code to handle delay slots (SEQUENCEs) so it's not an all-or- > nothing game. > > > Given architectural difficulties of delay slots on modern processors, > > would it be that painful to just not allow filling slots with frame > > insns and let dbr try to find something else or drop in a nop? I > > wouldn't be all that surprised if there wasn't a measurable > > performance difference on something like a modern Sparc. > > Yes, modern SPARCs have (short) branches without delay slots. But the > other big contender is MIPS here and the story might be different for > it. MIPSr6 introduces 'compact' branches which do not have delay slots. So the issues of filling delay slots will be less important from R6 onwards. However, delay slots remain important for now. I haven't thought about the problem much but instinctively I'd be surprised if a blanket restriction on frame-related instructions would lead to lots of NOPs in delay slots. Matthew
On 11/15/14 14:37, Matthew Fortune wrote: > Eric Botcazou <ebotcazou@adacore.com> writes: >>> IIRC, fill_eager and its related friends are all speculative in some >> way >>> and aren't those precisely the ones that are causing us problems. >> Also >>> note we have backends working around this stuff in fairly blunt ways: >> >> I'd say that the PA back-end went a bit too far here, especially if it >> marks some insns of the epilogue as frame-related. dwarf2cfi.c has >> special code to handle delay slots (SEQUENCEs) so it's not an all-or- >> nothing game. >> >>> Given architectural difficulties of delay slots on modern processors, >>> would it be that painful to just not allow filling slots with frame >>> insns and let dbr try to find something else or drop in a nop? I >>> wouldn't be all that surprised if there wasn't a measurable >>> performance difference on something like a modern Sparc. >> >> Yes, modern SPARCs have (short) branches without delay slots. But the >> other big contender is MIPS here and the story might be different for >> it. > > MIPSr6 introduces 'compact' branches which do not have delay slots. > > So the issues of filling delay slots will be less important from R6 > onwards. However, delay slots remain important for now. > > I haven't thought about the problem much but instinctively I'd be surprised > if a blanket restriction on frame-related instructions would lead to lots > of NOPs in delay slots. Possibly. I'd be surprised if frame-related stuff is used that often for filling slots... Combine that with the decrease in importance for filling delay slots when the exist, I wouldn't be terribly surprised if nobody could actually measure the change if we were to make it. The PA port may have gone too far, but it's certainly conservatively correct and on every PA processor that "matters" (for a very liberal definition of matters), I doubt the difference is measurable due to the depth of the reorder buffers and the fact that a nop can retire anytime that's convenient. Jeff
Index: reorg.c =================================================================== --- reorg.c (revision 217259) +++ reorg.c (working copy) @@ -2501,9 +2501,11 @@ fill_slots_from_thread (rtx_insn *insn, /* There are two ways we can win: If TRIAL doesn't set anything needed at the opposite thread and can't trap, or if it can - go into an annulled delay slot. */ + go into an annulled delay slot. But we want neither to copy + nor to speculate frame-related insns. */ if (!must_annul - && (condition == const_true_rtx + && ((condition == const_true_rtx + && (own_thread || !RTX_FRAME_RELATED_P (trial))) || (! insn_sets_resource_p (trial, &opposite_needed, true) && ! may_trap_or_fault_p (pat) && ! RTX_FRAME_RELATED_P (trial))))