Message ID | 4F9A948A.1040903@gnu.org |
---|---|
State | New |
Headers | show |
On Fri, Apr 27, 2012 at 2:43 PM, Paolo Bonzini <bonzini@gnu.org> wrote: > Il 27/04/2012 13:16, Richard Guenther ha scritto: >> In optabs.c we compare the CTZ_DEFINED_VALUE_AT_ZERO against two, >> is != 0 really what you want here? The docs suggest to me >> that as you are using the optab below you should compare against two, too. > > Interesting, first time I hear about this... > > ... except that no target sets the macros to 2, and all of them could > (as far as I could see). Looks like the code trumps the documentation; > how does this look? Hmm, I'll let target maintainers have a look at this - it's non-obvious for arm (the only case I looked at) at least. My guess is that the difference was to allow __builtin_ctz to be "optimized" and CTZ to be always well-defined (which incidentially it is not ...!?). Hm. Richard? > 2012-04-27 Paolo Bonzini <bonzini@gnu.org> > > * optabs.c (expand_ffs): Check CTZ_DEFINED_VALUE_AT_ZERO > against 1. > * doc/tm.texi (Misc): Invert meaning of 1 and 2 for > CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO. > > Index: optabs.c > =================================================================== > --- optabs.c (revisione 186903) > +++ optabs.c (copia locale) > @@ -2764,7 +2764,7 @@ expand_ffs > if (!temp) > goto fail; > > - defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2); > + defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1); > } > else if (optab_handler (clz_optab, mode) != CODE_FOR_nothing) > { > @@ -2773,7 +2773,7 @@ expand_ffs > if (!temp) > goto fail; > > - if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2) > + if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1) > { > defined_at_zero = true; > val = (GET_MODE_PRECISION (mode) - 1) - val; > Index: doc/tm.texi > =================================================================== > --- doc/tm.texi (revisione 186903) > +++ doc/tm.texi (copia locale) > @@ -10640,9 +10640,9 @@ > for @code{clz} or @code{ctz} with a zero operand. > A result of @code{0} indicates the value is undefined. > If the value is defined for only the RTL expression, the macro should > -evaluate to @code{1}; if the value applies also to the corresponding optab > +evaluate to @code{2}; if the value applies also to the corresponding optab > entry (which is normally the case if it expands directly into > -the corresponding RTL), then the macro should evaluate to @code{2}. > +the corresponding RTL), then the macro should evaluate to @code{1}. > In the cases where the value is defined, @var{value} should be set to > this value. > > plus tweaking this patch to reject 2 as well. > >> What about cost considerations? We only seem to have the general >> "branches are expensive" metric - but ctz/clz may be prohibitely expensive >> themselves, no? > > Yeah, that's a general problem with this kind of tricks. In general > however clz/ctz is getting less and less expensive, so I don't think > it is worrisome (at least at the beginning of stage 1). We can add > rtx_costs checks later. > > Among architectures I know, only i386 has an expensive bsf/bsr but > it also has sete/setne which GCC will use instead of this trick. > > Looking at rtx_costs, nothing seems to mark clz/ctz as prohibitively > expensive (Xtensa does, but only in the case when the optab handler > will not exist). I realize though that this is not a particularly > good statistic, since the compiler would not generate them out of > its hat until now. > > Paolo
Il 03/05/2012 11:57, Richard Guenther ha scritto: > Hmm, I'll let target maintainers have a look at this - it's non-obvious > for arm (the only case I looked at) at least. My guess is that the > difference was to allow __builtin_ctz to be "optimized" and CTZ > to be always well-defined (which incidentially it is not ...!?). Hm. On ARM: - clzsi4 is defined to expand to clz:SI, so it is the textbook example of returning 2 from CLZ_DEFINED_VALUE_AT_ZERO - ctzsi4 expands to bitrev + clz:SI, so it will return 32 too. ctz:SI will never appear at all in the RTX stream, so you can define its value at zero to be whatever you want. Hence the correct valid of CTZ_DEFINED_VALUE_AT_ZERO is also 2. Similar reasoning applies basically everywhere. CTZ_DEFINED_VALUE_AT_ZERO cannot be 2 when a define_expand produces different sequences with different values at zero, depending on optimize_size or the phase of the moon (but not depending on TARGET_*. One would be represented as ctz:SI in the RTL, the other not (perhaps something involving an if_then_else or an unspec). In this case, indeed expand must not consider the value at zero to be defined. However, no such define_expand exists for clz/ctz in config/*/*.md. Paolo
> On Fri, Apr 27, 2012 at 2:43 PM, Paolo Bonzini<bonzini@gnu.org> wrote: >> Interesting, first time I hear about this... >> >> ... except that no target sets the macros to 2, and all of them could >> (as far as I could see). Looks like the code trumps the documentation; >> how does this look? "No target sets to 2"? You mean like mips? You forgot to look at the corresponding CLZ_DEFINED_VALUE_AT_ZERO. I'll admit I hadn't been paying attention when sandra added the patch in question though... >> * optabs.c (expand_ffs): Check CTZ_DEFINED_VALUE_AT_ZERO >> against 1. >> * doc/tm.texi (Misc): Invert meaning of 1 and 2 for >> CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO. So... no, I don't think this is a good idea. r~
On 05/03/2012 03:23 AM, Paolo Bonzini wrote: > - ctzsi4 expands to bitrev + clz:SI, so it will return 32 too. ctz:SI > will never appear at all in the RTX stream, so you can define its value > at zero to be whatever you want. Hence the correct valid of > CTZ_DEFINED_VALUE_AT_ZERO is also 2. That's for armv6. For armv5 we don't have bitrev, so we use 31-clz, which yields a value at 0 of -1. Here I think it's best to mirror the rs6000 port in that CLZ_DVAZ == 2, but CTZ_DVAZ == 1. > Similar reasoning applies basically everywhere. One really needs to examine these closely, one by one. r~
Il 03/05/2012 17:10, Richard Henderson ha scritto: >>> >>> >>> ... except that no target sets the macros to 2, and all of them could >>> (as far as I could see). Looks like the code trumps the documentation; >>> how does this look? > > "No target sets to 2"? You mean like mips? You forgot to look at the > corresponding CLZ_DEFINED_VALUE_AT_ZERO. Yes, I missed that. > I'll admit I hadn't been paying attention when sandra added the patch > in question though... > >>> * optabs.c (expand_ffs): Check CTZ_DEFINED_VALUE_AT_ZERO >>> against 1. >>> * doc/tm.texi (Misc): Invert meaning of 1 and 2 for >>> CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO. > > So... no, I don't think this is a good idea. Yeah, not as I posted... of course config/mips/mips.h needs change, but do you disagree that all targets _do_ have the same value at zero for the optab as they have for RTL? Paolo
On 05/03/2012 08:23 AM, Paolo Bonzini wrote: > Yeah, not as I posted... of course config/mips/mips.h needs change, but > do you disagree that all targets_do_ have the same value at zero for > the optab as they have for RTL? I don't agree without a port-by-port review, no. If, as you suggest, we get proper agreement between the gimple-level (aka optab level) definedness and the rtl-level definedness, then we should drop the tri-state and go back to simple boolean. I'm uncertain if there's any point in settings such as for rs6000, where we do know that ctz(0) = -1. Should we only define the value for clz(0) = wordsize and let the rtl optimizers fold ctz(0) away? r~
On Fri, 27 Apr 2012, Paolo Bonzini wrote: > > What about cost considerations? We only seem to have the general > > "branches are expensive" metric - but ctz/clz may be prohibitely expensive > > themselves, no? > > Yeah, that's a general problem with this kind of tricks. In general > however clz/ctz is getting less and less expensive, so I don't think > it is worrisome (at least at the beginning of stage 1). We can add > rtx_costs checks later. > > Among architectures I know, only i386 has an expensive bsf/bsr but > it also has sete/setne which GCC will use instead of this trick. > > Looking at rtx_costs, nothing seems to mark clz/ctz as prohibitively > expensive (Xtensa does, but only in the case when the optab handler > will not exist). I realize though that this is not a particularly > good statistic, since the compiler would not generate them out of > its hat until now. For the record: MIPS processors that implement CLZ/CLO (for some reason CTZ/CTO haven't been added to the architecture, but these operations can be cheaply transformed into CLZ/CLO) generally have a dedicated unit that causes no pipeline stall for these instructions even in the simplest pipeline designs like the M4K -- IOW they are issued at the usual one instruction per pipeline clock rate. Of course all MIPS processors have SLT too, so perhaps they won't benefit from your change either. Maciej
On Sat, May 5, 2012 at 11:52 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote: > For the record: MIPS processors that implement CLZ/CLO (for some reason > CTZ/CTO haven't been added to the architecture, but these operations can > be cheaply transformed into CLZ/CLO) generally have a dedicated unit that > causes no pipeline stall for these instructions even in the simplest > pipeline designs like the M4K -- IOW they are issued at the usual one > instruction per pipeline clock rate. Even on Octeon this is true. Though Octeon has seq/sneq too so it does not matter in the end. Note I originally was the one who proposed this optimization for PowerPC even before I saw what XLC did. See PR 10588 (which I filed 9 years ago) and it seems we are about to fix it soon. Thanks, Andrew Pinski
On Sun, 6 May 2012, Andrew Pinski wrote: > > For the record: MIPS processors that implement CLZ/CLO (for some reason > > CTZ/CTO haven't been added to the architecture, but these operations can > > be cheaply transformed into CLZ/CLO) generally have a dedicated unit that > > causes no pipeline stall for these instructions even in the simplest > > pipeline designs like the M4K -- IOW they are issued at the usual one > > instruction per pipeline clock rate. > > Even on Octeon this is true. Though Octeon has seq/sneq too so it > does not matter in the end. Does Octeon's pipeline qualify as simple? For some reason I've thought it is a high-performance core. The M4K is one of the smallest/simplest MIPS chips ever built. And actually all MIPS processors (back to 1985's MIPS I ISA) support two-instruction set-if-equal and set-if-not-equal sequences: xor rd, rt, rs sltiu rd, rd, 1 and: xor rd, rt, rs sltu rd, zero, rd respectively, that may still be more beneficial than any possible alternatives, especially ones involving branches. > Note I originally was the one who proposed this optimization for > PowerPC even before I saw what XLC did. See PR 10588 (which I filed 9 > years ago) and it seems we are about to fix it soon. For that -- set-if-zero and set-if-non-zero -- you can use the instructions as above (that are supported by all MIPS processors): sltiu rd, rs, 1 and sltu rd, zero, rs However GCC doesn't seem smart enough to use them well with your example. I'd expect something like: sltiu $4, $4, 1 sltiu $2, $5, 1 jr $31 or $2, $4, $2 however I get: beq $4, $0, .L3 nop jr $31 sltiu $2, $5, 1 .L3: jr $31 li $2, 1 which is never faster and obviously not smaller either. And there is really no need to avoid the second comparison as per logical OR rules here -- it's all in registers. This pessimisation is avoided for MIPS IV and more recent processors that have move-if-non-zero however (and the second comparison is always evaluated): sltiu $5, $5, 1 li $2, 1 jr $31 movn $2, $5, $4 Any chance to get it better with the fix you've mentioned? Maciej
On Sat, May 12, 2012 at 11:36 AM, Maciej W. Rozycki <macro@linux-mips.org> wrote: > On Sun, 6 May 2012, Andrew Pinski wrote: > >> > For the record: MIPS processors that implement CLZ/CLO (for some reason >> > CTZ/CTO haven't been added to the architecture, but these operations can >> > be cheaply transformed into CLZ/CLO) generally have a dedicated unit that >> > causes no pipeline stall for these instructions even in the simplest >> > pipeline designs like the M4K -- IOW they are issued at the usual one >> > instruction per pipeline clock rate. >> >> Even on Octeon this is true. Though Octeon has seq/sneq too so it >> does not matter in the end. > > Does Octeon's pipeline qualify as simple? For some reason I've thought > it is a high-performance core. The M4K is one of the smallest/simplest > MIPS chips ever built. Yes the octeon's pipeline qualifies as simple. It is still an in-order pipeline with few stages. The high-performance of the core is just the clock rate rather than the pipeline. And the number of cores on one chip is the other thing which makes it high performance. > > And actually all MIPS processors (back to 1985's MIPS I ISA) support > two-instruction set-if-equal and set-if-not-equal sequences: > > xor rd, rt, rs > sltiu rd, rd, 1 > > and: > > xor rd, rt, rs > sltu rd, zero, rd > > respectively, that may still be more beneficial than any possible > alternatives, especially ones involving branches. > >> Note I originally was the one who proposed this optimization for >> PowerPC even before I saw what XLC did. See PR 10588 (which I filed 9 >> years ago) and it seems we are about to fix it soon. > > For that -- set-if-zero and set-if-non-zero -- you can use the > instructions as above (that are supported by all MIPS processors): > > sltiu rd, rs, 1 > > and > > sltu rd, zero, rs > > However GCC doesn't seem smart enough to use them well with your example. > I'd expect something like: > > sltiu $4, $4, 1 > sltiu $2, $5, 1 > jr $31 > or $2, $4, $2 > > however I get: > > beq $4, $0, .L3 > nop > jr $31 > sltiu $2, $5, 1 > .L3: > jr $31 > li $2, 1 > > which is never faster and obviously not smaller either. And there is > really no need to avoid the second comparison as per logical OR rules here > -- it's all in registers. I have a few patches already in my queue to submit upstream to improve the above case for MIPS. > > This pessimisation is avoided for MIPS IV and more recent processors that > have move-if-non-zero however (and the second comparison is always > evaluated): > > sltiu $5, $5, 1 > li $2, 1 > jr $31 > movn $2, $5, $4 > > Any chance to get it better with the fix you've mentioned? The above is worse than using the or for at least the octeon as movn is 3 cycles while or is only 1 cycle. As I mentioned, I have a few patches already in my queue which improves the code for MIPS (and other targets too) but I have not got around to submitting them upstream because I have been busy working on more patches. Thanks, Andrew Pinski
Index: optabs.c =================================================================== --- optabs.c (revisione 186903) +++ optabs.c (copia locale) @@ -2764,7 +2764,7 @@ expand_ffs if (!temp) goto fail; - defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2); + defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1); } else if (optab_handler (clz_optab, mode) != CODE_FOR_nothing) { @@ -2773,7 +2773,7 @@ expand_ffs if (!temp) goto fail; - if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2) + if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1) { defined_at_zero = true; val = (GET_MODE_PRECISION (mode) - 1) - val; Index: doc/tm.texi =================================================================== --- doc/tm.texi (revisione 186903) +++ doc/tm.texi (copia locale) @@ -10640,9 +10640,9 @@ for @code{clz} or @code{ctz} with a zero operand. A result of @code{0} indicates the value is undefined. If the value is defined for only the RTL expression, the macro should -evaluate to @code{1}; if the value applies also to the corresponding optab +evaluate to @code{2}; if the value applies also to the corresponding optab entry (which is normally the case if it expands directly into -the corresponding RTL), then the macro should evaluate to @code{2}. +the corresponding RTL), then the macro should evaluate to @code{1}. In the cases where the value is defined, @var{value} should be set to this value.