Message ID | 20240611083727.2642461-1-guojiufu@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [V4,1/2] split complicate 64bit constant to memory | expand |
Hi! On Tue, Jun 11, 2024 at 04:37:25PM +0800, Jiufu Guo wrote: > Sometimes, a complicated constant is built via 3(or more) > instructions. Generally speaking, it would not be as fast > as loading it from the constant pool (as the discussions in > PR63281): > "ld" is one instruction. If consider "address/toc" adjust, > we may count it as 2 instructions. And "pld" may need fewer > cycles. Yeah, three insns to build up a constant always was about as fast/slow as loading a constant from memory. When you need two related constants loading from memory is slower, so we preferred building them up. > As testing(SPEC2017), it could get better/stable runtime > if set the threshold as "> 2" (compare with "> 3"). Interesting! About how much speedup did you see? 0.1%? > As known, because the constant is load from memory by this > patch, so this functionality may affect the cache missing. > Also there may be possible side effects on icach. While, > IMHO, this patch would be still do the right thing. Yeah, but almost every codegen patch has *some* icache effect. Your benchmarks show a net positive effect, that is all that matters in the end. > gcc/ChangeLog: > > * config/rs6000/rs6000.cc (rs6000_emit_set_const): Update to split > complicate constant to memory. (complicated) And don't write "update", every patch is an update :-) There are more conditions here, mention those as well? But, why does base_reg_operand matter? And, why for TARGET_ELF only? That is the only place you tested I'm sure, but make an educated guess for the rest, why would it not be useful for other binary formats? > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc > index e4dc629ddcc..f448df289a0 100644 > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -10240,6 +10240,21 @@ rs6000_emit_set_const (rtx dest, rtx source) > c = sext_hwi (c, 32); > emit_move_insn (lo, GEN_INT (c)); > } > + > + else if (base_reg_operand (dest, mode) && TARGET_64BIT > + && TARGET_ELF && num_insns_constant (source, mode) > 2) > + { > + rtx sym = force_const_mem (mode, source); > + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) > + && use_toc_relative_ref (XEXP (sym, 0), mode)) > + { > + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); > + sym = gen_const_mem (mode, toc); > + set_mem_alias_set (sym, get_TOC_alias_set ()); > + } > + > + emit_move_insn (dest, sym); > + } > else > rs6000_emit_set_long_const (dest, c); > break; Is there no utility function to put constants like this in memory? Like, "output_toc" for example? > diff --git a/gcc/testsuite/gcc.target/powerpc/const_anchors.c b/gcc/testsuite/gcc.target/powerpc/const_anchors.c > index 542e2674b12..f33c9a83f5e 100644 > --- a/gcc/testsuite/gcc.target/powerpc/const_anchors.c > +++ b/gcc/testsuite/gcc.target/powerpc/const_anchors.c > @@ -1,5 +1,5 @@ > /* { dg-do compile { target has_arch_ppc64 } } */ > -/* { dg-options "-O2" } */ > +/* { dg-options "-O2 -fdump-rtl-final" } */ > > #define C1 0x2351847027482577ULL > #define C2 0x2351847027482578ULL > @@ -17,4 +17,5 @@ void __attribute__ ((noinline)) foo1 (long long *a, long long b) > *a++ = C2; > } > > -/* { dg-final { scan-assembler-times {\maddi\M} 2 } } */ > +/* Checking "final" instead checking "asm" output to avoid noise. */ > +/* { dg-final { scan-rtl-dump-times {\madddi3\M} 2 "final" } } */ So, in the RTL dump it finds a named pattern adddi3, while in the asm you find random other addi insns? Hardly worth a comment in the testcase itself, but heh. The point is that you check for the RTL pattern name instead of the machine insn. The compiler pass you check in isn't changed even! > --- a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c > +++ b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c > @@ -6,11 +6,18 @@ > /* { dg-final { scan-assembler-times {\mori\M} 4 } } */ > /* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */ > > +/* The below macro helps to avoid loading constant from memory. */ > +#define CONST_AVOID_BASE_REG(DEST, CST) \ > + { \ > + register long long d asm ("r0") = CST; \ > + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ > + } This needs an actual changelog, not just "update". Something as simple as "New macro to force constant into a reg" is fine, but more than "update" :-) > +/* The below macro helps to avoid loading constant from memory. */ > +#define CONST_AVOID_BASE_REG(DEST, CST) \ > + { \ > + register long long d asm ("r0") = CST; \ > + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ > + } "Forces the constant into a register", yes. > --- a/gcc/testsuite/gcc.target/powerpc/pr106550_1.c > +++ b/gcc/testsuite/gcc.target/powerpc/pr106550_1.c > @@ -4,17 +4,19 @@ > /* { dg-options "-O2 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */ > /* force the constant splitter run after RA: -fdisable-rtl-split1. */ > > +/* The below marco helps to avoid using paddi and avoid loading from memory. */ (macro) > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c > @@ -0,0 +1,11 @@ > +/* Check loading constant from memory pool. */ > +/* { dg-options "-O2 -mpowerpc64" } */ > + > +void > +foo (unsigned long long *a) > +{ > + *a++ = 0x2351847027482577ULL; > +} > + > +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 { target { lp64 } } } } */ Why target lp64 only? You have -mpowerpc64 already, does that not make us use ld here always? Segher
Hi Segher, Thanks a lot for your review! Segher Boessenkool <segher@kernel.crashing.org> writes: > Hi! > > On Tue, Jun 11, 2024 at 04:37:25PM +0800, Jiufu Guo wrote: >> Sometimes, a complicated constant is built via 3(or more) >> instructions. Generally speaking, it would not be as fast >> as loading it from the constant pool (as the discussions in >> PR63281): >> "ld" is one instruction. If consider "address/toc" adjust, >> we may count it as 2 instructions. And "pld" may need fewer >> cycles. > > Yeah, three insns to build up a constant always was about as fast/slow Yeah. > as loading a constant from memory. When you need two related constants > loading from memory is slower, so we preferred building them up. You may talking about the functionality "const_anchor" which is enabled in cse1 for two related constants. Like the test case: const_anchors.c, the first one would be loaded from memory if it is complicated, and the second would be computed based on the first one. > >> As testing(SPEC2017), it could get better/stable runtime >> if set the threshold as "> 2" (compare with "> 3"). > > Interesting! About how much speedup did you see? 0.1%? I did not see the proven speedup from SPEC2017. On some benchs(including perlbench), I saw some runs with speedup >1%. But with deeper checking, it would not directly benefit of this functionality, since it was not hit by the hot functions; and the changed constants do not contribute the runtime visibly. Saying "better/stable", because with different bases(interval weeks/months), ">2" introduce less waving; and with different options (-O2, -O3, -Ofast), ">2" seems more stable. > >> As known, because the constant is load from memory by this >> patch, so this functionality may affect the cache missing. >> Also there may be possible side effects on icach. While, >> IMHO, this patch would be still do the right thing. > > Yeah, but almost every codegen patch has *some* icache effect. Your > benchmarks show a net positive effect, that is all that matters in the > end. Yes. :) As using the small bench (like the cases mentioned in comment20 of PR63281), the result shows the prefer choise would be buiding by "2 insns" > "loading from rodata/toc" > "3 insns", and building by "4/5insn" would be the last choise. I mentioned this here, because in one recent run, for one benchmark, I saw 'caching missing' raising, and 'cache missing' was waving with size of the code (not directly related to this patch.) > >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.cc (rs6000_emit_set_const): Update to split >> complicate constant to memory. > > (complicated) > > And don't write "update", every patch is an update :-) > > There are more conditions here, mention those as well? Thanks! This patch only split complicated constants(>2insns) to memory for -m64 under TARGET_ELF. "Split complicate constant to memory for -m64." > > But, why does base_reg_operand matter? And, why for TARGET_ELF only? > That is the only place you tested I'm sure, but make an educated guess > for the rest, why would it not be useful for other binary formats? Thanks for these very insightful questions! "base_reg_operand" is used to avoid to instruction like "r0=[OFF(r0)]" for the case like "r0=0xXXXXX" (I would have a recheck to see if this is really needed before RA.) Yes, I only tested with ELF. I add ELF checking, just because not sure how profitable it is on other targets. In theory, this functionality would be useful for other targets, at least for the 4/5 insns (witout the aspect of parallel execution in the binary). There is another heuristic code in rs6000_emit_move, it is using: ```(TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)``` This does not check ELF. Levearging this code, it would be ok to accept all targets. This heuristic exist a long time ago, I guess it would be before ppc64le (even more history), it may more for -m32. For -m64, as I tested, ">2" would be suitable. > >> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >> index e4dc629ddcc..f448df289a0 100644 >> --- a/gcc/config/rs6000/rs6000.cc >> +++ b/gcc/config/rs6000/rs6000.cc >> @@ -10240,6 +10240,21 @@ rs6000_emit_set_const (rtx dest, rtx source) >> c = sext_hwi (c, 32); >> emit_move_insn (lo, GEN_INT (c)); >> } >> + >> + else if (base_reg_operand (dest, mode) && TARGET_64BIT >> + && TARGET_ELF && num_insns_constant (source, mode) > 2) >> + { >> + rtx sym = force_const_mem (mode, source); >> + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) >> + && use_toc_relative_ref (XEXP (sym, 0), mode)) >> + { >> + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); >> + sym = gen_const_mem (mode, toc); >> + set_mem_alias_set (sym, get_TOC_alias_set ()); >> + } >> + >> + emit_move_insn (dest, sym); >> + } >> else >> rs6000_emit_set_long_const (dest, c); >> break; > > Is there no utility function to put constants like this in memory? > Like, "output_toc" for example? I guess you mean "force_const_mem" which puts the constant value to constant pool and create "LC" label for it. With "emit_move_insn (dest, force_const_mem (mode, source));" could implement the part of this functionality. While adding code "if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))..." would save one more instructions. So, the code is drafted like this patch. > >> diff --git a/gcc/testsuite/gcc.target/powerpc/const_anchors.c b/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> index 542e2674b12..f33c9a83f5e 100644 >> --- a/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> +++ b/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> @@ -1,5 +1,5 @@ >> /* { dg-do compile { target has_arch_ppc64 } } */ >> -/* { dg-options "-O2" } */ >> +/* { dg-options "-O2 -fdump-rtl-final" } */ >> >> #define C1 0x2351847027482577ULL >> #define C2 0x2351847027482578ULL >> @@ -17,4 +17,5 @@ void __attribute__ ((noinline)) foo1 (long long *a, long long b) >> *a++ = C2; >> } >> >> -/* { dg-final { scan-assembler-times {\maddi\M} 2 } } */ >> +/* Checking "final" instead checking "asm" output to avoid noise. */ >> +/* { dg-final { scan-rtl-dump-times {\madddi3\M} 2 "final" } } */ > > So, in the RTL dump it finds a named pattern adddi3, while in the asm > you find random other addi insns? Hardly worth a comment in the Yes, other "addi"s for memory address occur in asm. Thanks for your kind review. I would remove this no-go comment. > testcase itself, but heh. The point is that you check for the RTL > pattern name instead of the machine insn. The compiler pass you check > in isn't changed even! > >> --- a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c >> +++ b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c >> @@ -6,11 +6,18 @@ >> /* { dg-final { scan-assembler-times {\mori\M} 4 } } */ >> /* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */ >> >> +/* The below macro helps to avoid loading constant from memory. */ >> +#define CONST_AVOID_BASE_REG(DEST, CST) \ >> + { \ >> + register long long d asm ("r0") = CST; \ >> + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ >> + } > > This needs an actual changelog, not just "update". Something as simple > as "New macro to force constant into a reg" is fine, but more than > "update" :-) Thanks a lot for point this out! > >> +/* The below macro helps to avoid loading constant from memory. */ >> +#define CONST_AVOID_BASE_REG(DEST, CST) \ >> + { \ >> + register long long d asm ("r0") = CST; \ >> + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ >> + } > > "Forces the constant into a register", yes. OK, thanks. > >> --- a/gcc/testsuite/gcc.target/powerpc/pr106550_1.c >> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550_1.c >> @@ -4,17 +4,19 @@ >> /* { dg-options "-O2 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */ >> /* force the constant splitter run after RA: -fdisable-rtl-split1. */ >> >> +/* The below marco helps to avoid using paddi and avoid loading from memory. */ > > (macro) OK. > >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >> @@ -0,0 +1,11 @@ >> +/* Check loading constant from memory pool. */ >> +/* { dg-options "-O2 -mpowerpc64" } */ >> + >> +void >> +foo (unsigned long long *a) >> +{ >> + *a++ = 0x2351847027482577ULL; >> +} >> + >> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 { target { lp64 } } } } */ > > Why target lp64 only? You have -mpowerpc64 already, does that not make > us use ld here always? For -m32, we need other code to support it. And the patch for "-m32 -mpowerpc" is the second patch of this series. Here 'lp64' is safe guard for "-m64", in case there is concerns for "-m32" patch. Thanks again for your review!! BR, Jeff(Jiufu) Guo. > > > Segher
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc index e4dc629ddcc..f448df289a0 100644 --- a/gcc/config/rs6000/rs6000.cc +++ b/gcc/config/rs6000/rs6000.cc @@ -10240,6 +10240,21 @@ rs6000_emit_set_const (rtx dest, rtx source) c = sext_hwi (c, 32); emit_move_insn (lo, GEN_INT (c)); } + + else if (base_reg_operand (dest, mode) && TARGET_64BIT + && TARGET_ELF && num_insns_constant (source, mode) > 2) + { + rtx sym = force_const_mem (mode, source); + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) + && use_toc_relative_ref (XEXP (sym, 0), mode)) + { + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); + sym = gen_const_mem (mode, toc); + set_mem_alias_set (sym, get_TOC_alias_set ()); + } + + emit_move_insn (dest, sym); + } else rs6000_emit_set_long_const (dest, c); break; diff --git a/gcc/testsuite/gcc.target/powerpc/const_anchors.c b/gcc/testsuite/gcc.target/powerpc/const_anchors.c index 542e2674b12..f33c9a83f5e 100644 --- a/gcc/testsuite/gcc.target/powerpc/const_anchors.c +++ b/gcc/testsuite/gcc.target/powerpc/const_anchors.c @@ -1,5 +1,5 @@ /* { dg-do compile { target has_arch_ppc64 } } */ -/* { dg-options "-O2" } */ +/* { dg-options "-O2 -fdump-rtl-final" } */ #define C1 0x2351847027482577ULL #define C2 0x2351847027482578ULL @@ -17,4 +17,5 @@ void __attribute__ ((noinline)) foo1 (long long *a, long long b) *a++ = C2; } -/* { dg-final { scan-assembler-times {\maddi\M} 2 } } */ +/* Checking "final" instead checking "asm" output to avoid noise. */ +/* { dg-final { scan-rtl-dump-times {\madddi3\M} 2 "final" } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c index e3a9a7264cf..ff745c730f3 100644 --- a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c +++ b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c @@ -6,11 +6,18 @@ /* { dg-final { scan-assembler-times {\mori\M} 4 } } */ /* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */ +/* The below macro helps to avoid loading constant from memory. */ +#define CONST_AVOID_BASE_REG(DEST, CST) \ + { \ + register long long d asm ("r0") = CST; \ + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ + } + void __attribute__ ((noinline)) foo (unsigned long long *a) { /* 2 lis + 2 ori + 1 rldimi for each constant. */ - *a++ = 0x800aabcdc167fa16ULL; - *a++ = 0x7543a876867f616ULL; + CONST_AVOID_BASE_REG(*a++, 0x800aabcdc167fa16ULL); + CONST_AVOID_BASE_REG(*a++, 0x7543a876867f616ULL); } long long A[] = {0x800aabcdc167fa16ULL, 0x7543a876867f616ULL}; diff --git a/gcc/testsuite/gcc.target/powerpc/pr106550.c b/gcc/testsuite/gcc.target/powerpc/pr106550.c index 92b76ac8811..e5e38b4ea82 100644 --- a/gcc/testsuite/gcc.target/powerpc/pr106550.c +++ b/gcc/testsuite/gcc.target/powerpc/pr106550.c @@ -2,13 +2,19 @@ /* { dg-options "-O2 -mdejagnu-cpu=power10" } */ /* { dg-require-effective-target has_arch_ppc64 } */ +/* The below macro helps to avoid loading constant from memory. */ +#define CONST_AVOID_BASE_REG(DEST, CST) \ + { \ + register long long d asm ("r0") = CST; \ + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ + } + void foo (unsigned long long *a) { - *a++ = 0x020805006106003; /* pli+pli+rldimi */ - *a++ = 0x2351847027482577;/* pli+pli+rldimi */ + CONST_AVOID_BASE_REG (*a++, 0x020805006106003ULL); /* pli+pli+rldimi */ + CONST_AVOID_BASE_REG (*a++, 0x2351847027482577ULL); /* pli+pli+rldimi */ } /* { dg-final { scan-assembler-times {\mpli\M} 4 } } */ /* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */ - diff --git a/gcc/testsuite/gcc.target/powerpc/pr106550_1.c b/gcc/testsuite/gcc.target/powerpc/pr106550_1.c index 5ab40d71a56..66539ee7cf0 100644 --- a/gcc/testsuite/gcc.target/powerpc/pr106550_1.c +++ b/gcc/testsuite/gcc.target/powerpc/pr106550_1.c @@ -4,17 +4,19 @@ /* { dg-options "-O2 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */ /* force the constant splitter run after RA: -fdisable-rtl-split1. */ +/* The below marco helps to avoid using paddi and avoid loading from memory. */ +#define CONST_AVOID_BASE_REG(DEST, CST) \ + { \ + register long long d asm ("r0") = CST; \ + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); \ + } + void foo (unsigned long long *a) { - /* Test oris/ori is used where paddi does not work with 'r0'. */ - register long long d asm("r0") = 0x1245abcef9240dec; /* pli+sldi+oris+ori */ - long long n; - asm("cntlzd %0, %1" : "=r"(n) : "r"(d)); - *a++ = n; - - *a++ = 0x235a8470a7480000ULL; /* pli+sldi+oris */ - *a++ = 0x23a184700000b677ULL; /* pli+sldi+ori */ + CONST_AVOID_BASE_REG (*a++, 0x1245abcef9240dec); /* pli+sldi+oris+ori */ + CONST_AVOID_BASE_REG (*a++, 0x235a8470a7480000ULL); /* pli+sldi+oris */ + CONST_AVOID_BASE_REG (*a++, 0x23a184700000b677ULL); /* pli+sldi+ori */ } /* { dg-final { scan-assembler-times {\mpli\M} 3 } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c b/gcc/testsuite/gcc.target/powerpc/pr63281.c new file mode 100644 index 00000000000..9763a7181fc --- /dev/null +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c @@ -0,0 +1,11 @@ +/* Check loading constant from memory pool. */ +/* { dg-options "-O2 -mpowerpc64" } */ + +void +foo (unsigned long long *a) +{ + *a++ = 0x2351847027482577ULL; +} + +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 { target { lp64 } } } } */ + diff --git a/gcc/testsuite/gcc.target/powerpc/pr87870.c b/gcc/testsuite/gcc.target/powerpc/pr87870.c index d2108ac3386..09b2e8de901 100644 --- a/gcc/testsuite/gcc.target/powerpc/pr87870.c +++ b/gcc/testsuite/gcc.target/powerpc/pr87870.c @@ -25,4 +25,7 @@ test3 (void) return ((__int128)0xdeadbeefcafebabe << 64) | 0xfacefeedbaaaaaad; } -/* { dg-final { scan-assembler-not {\mld\M} } } */ +/* test3 is using "ld" to load the value to r3 and r4. So there are 2 'ld's + test0, test1 and test2 are using "li", then check 6 'li's. */ +/* { dg-final { scan-assembler-times {\mp?ld\M} 2 } } */ +/* { dg-final { scan-assembler-times {\mli\M} 6 } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/pr93012.c b/gcc/testsuite/gcc.target/powerpc/pr93012.c index 4f764d0576f..660fb0dddfa 100644 --- a/gcc/testsuite/gcc.target/powerpc/pr93012.c +++ b/gcc/testsuite/gcc.target/powerpc/pr93012.c @@ -10,4 +10,8 @@ unsigned long long mskh1() { return 0xffff9234ffff9234ULL; } unsigned long long mskl1() { return 0x2bcdffff2bcdffffULL; } unsigned long long mskse() { return 0xffff1234ffff1234ULL; } -/* { dg-final { scan-assembler-times {\mrldimi\M} 7 } } */ +/* { dg-final { scan-assembler-times {\mrldimi\M} 7 { target has_arch_pwr10 } } } */ + +/* 4 complicated constants can be loaded from pool. */ +/* { dg-final { scan-assembler-times {\mrldimi\M} 3 { target { ! has_arch_pwr10 } } } } */ +/* { dg-final { scan-assembler-times {\mld\M} 4 { target { ! has_arch_pwr10 } } } } */