diff mbox

teach emit_store_flag to use clz/ctz

Message ID 4F9A948A.1040903@gnu.org
State New
Headers show

Commit Message

Paolo Bonzini April 27, 2012, 12:43 p.m. UTC
Il 27/04/2012 13:16, Richard Guenther ha scritto:
> In optabs.c we compare the CTZ_DEFINED_VALUE_AT_ZERO against two,
> is != 0 really what you want here?  The docs suggest to me
> that as you are using the optab below you should compare against two, too.

Interesting, first time I hear about this...

... except that no target sets the macros to 2, and all of them could
(as far as I could see).  Looks like the code trumps the documentation;
how does this look?

2012-04-27  Paolo Bonzini  <bonzini@gnu.org>

	* optabs.c (expand_ffs): Check CTZ_DEFINED_VALUE_AT_ZERO
	against 1.
	* doc/tm.texi (Misc): Invert meaning of 1 and 2 for
	CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO.

plus tweaking this patch to reject 2 as well.

> What about cost considerations?  We only seem to have the general
> "branches are expensive" metric - but ctz/clz may be prohibitely expensive
> themselves, no?

Yeah, that's a general problem with this kind of tricks.  In general
however clz/ctz is getting less and less expensive, so I don't think
it is worrisome (at least at the beginning of stage 1).  We can add
rtx_costs checks later.

Among architectures I know, only i386 has an expensive bsf/bsr but
it also has sete/setne which GCC will use instead of this trick.

Looking at rtx_costs, nothing seems to mark clz/ctz as prohibitively
expensive (Xtensa does, but only in the case when the optab handler
will not exist).  I realize though that this is not a particularly
good statistic, since the compiler would not generate them out of
its hat until now.

Paolo

Comments

Richard Biener May 3, 2012, 9:57 a.m. UTC | #1
On Fri, Apr 27, 2012 at 2:43 PM, Paolo Bonzini <bonzini@gnu.org> wrote:
> Il 27/04/2012 13:16, Richard Guenther ha scritto:
>> In optabs.c we compare the CTZ_DEFINED_VALUE_AT_ZERO against two,
>> is != 0 really what you want here?  The docs suggest to me
>> that as you are using the optab below you should compare against two, too.
>
> Interesting, first time I hear about this...
>
> ... except that no target sets the macros to 2, and all of them could
> (as far as I could see).  Looks like the code trumps the documentation;
> how does this look?

Hmm, I'll let target maintainers have a look at this - it's non-obvious
for arm (the only case I looked at) at least.  My guess is that the
difference was to allow __builtin_ctz to be "optimized" and CTZ
to be always well-defined (which incidentially it is not ...!?).  Hm.

Richard?


> 2012-04-27  Paolo Bonzini  <bonzini@gnu.org>
>
>        * optabs.c (expand_ffs): Check CTZ_DEFINED_VALUE_AT_ZERO
>        against 1.
>        * doc/tm.texi (Misc): Invert meaning of 1 and 2 for
>        CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO.
>
> Index: optabs.c
> ===================================================================
> --- optabs.c    (revisione 186903)
> +++ optabs.c    (copia locale)
> @@ -2764,7 +2764,7 @@ expand_ffs
>       if (!temp)
>        goto fail;
>
> -      defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2);
> +      defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1);
>     }
>   else if (optab_handler (clz_optab, mode) != CODE_FOR_nothing)
>     {
> @@ -2773,7 +2773,7 @@ expand_ffs
>       if (!temp)
>        goto fail;
>
> -      if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2)
> +      if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1)
>        {
>          defined_at_zero = true;
>          val = (GET_MODE_PRECISION (mode) - 1) - val;
> Index: doc/tm.texi
> ===================================================================
> --- doc/tm.texi (revisione 186903)
> +++ doc/tm.texi (copia locale)
> @@ -10640,9 +10640,9 @@
>  for @code{clz} or @code{ctz} with a zero operand.
>  A result of @code{0} indicates the value is undefined.
>  If the value is defined for only the RTL expression, the macro should
> -evaluate to @code{1}; if the value applies also to the corresponding optab
> +evaluate to @code{2}; if the value applies also to the corresponding optab
>  entry (which is normally the case if it expands directly into
> -the corresponding RTL), then the macro should evaluate to @code{2}.
> +the corresponding RTL), then the macro should evaluate to @code{1}.
>  In the cases where the value is defined, @var{value} should be set to
>  this value.
>
> plus tweaking this patch to reject 2 as well.
>
>> What about cost considerations?  We only seem to have the general
>> "branches are expensive" metric - but ctz/clz may be prohibitely expensive
>> themselves, no?
>
> Yeah, that's a general problem with this kind of tricks.  In general
> however clz/ctz is getting less and less expensive, so I don't think
> it is worrisome (at least at the beginning of stage 1).  We can add
> rtx_costs checks later.
>
> Among architectures I know, only i386 has an expensive bsf/bsr but
> it also has sete/setne which GCC will use instead of this trick.
>
> Looking at rtx_costs, nothing seems to mark clz/ctz as prohibitively
> expensive (Xtensa does, but only in the case when the optab handler
> will not exist).  I realize though that this is not a particularly
> good statistic, since the compiler would not generate them out of
> its hat until now.
>
> Paolo
Paolo Bonzini May 3, 2012, 10:23 a.m. UTC | #2
Il 03/05/2012 11:57, Richard Guenther ha scritto:
> Hmm, I'll let target maintainers have a look at this - it's non-obvious
> for arm (the only case I looked at) at least.  My guess is that the
> difference was to allow __builtin_ctz to be "optimized" and CTZ
> to be always well-defined (which incidentially it is not ...!?).  Hm.

On ARM:

- clzsi4 is defined to expand to clz:SI, so it is the textbook example
of returning 2 from CLZ_DEFINED_VALUE_AT_ZERO

- ctzsi4 expands to bitrev + clz:SI, so it will return 32 too. ctz:SI
will never appear at all in the RTX stream, so you can define its value
at zero to be whatever you want.  Hence the correct valid of
CTZ_DEFINED_VALUE_AT_ZERO is also 2.

Similar reasoning applies basically everywhere.

CTZ_DEFINED_VALUE_AT_ZERO cannot be 2 when a define_expand produces
different sequences with different values at zero, depending on
optimize_size or the phase of the moon (but not depending on TARGET_*.
One would be represented as ctz:SI in the RTL, the other not (perhaps
something involving an if_then_else or an unspec).  In this case, indeed
expand must not consider the value at zero to be defined.  However, no
such define_expand exists for clz/ctz in config/*/*.md.

Paolo
Richard Henderson May 3, 2012, 3:10 p.m. UTC | #3
> On Fri, Apr 27, 2012 at 2:43 PM, Paolo Bonzini<bonzini@gnu.org>  wrote:
>> Interesting, first time I hear about this...
>>
>> ... except that no target sets the macros to 2, and all of them could
>> (as far as I could see).  Looks like the code trumps the documentation;
>> how does this look?

"No target sets to 2"?  You mean like mips?  You forgot to look at the
corresponding CLZ_DEFINED_VALUE_AT_ZERO.

I'll admit I hadn't been paying attention when sandra added the patch
in question though...

>>         * optabs.c (expand_ffs): Check CTZ_DEFINED_VALUE_AT_ZERO
>>         against 1.
>>         * doc/tm.texi (Misc): Invert meaning of 1 and 2 for
>>         CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO.

So... no, I don't think this is a good idea.


r~
Richard Henderson May 3, 2012, 3:16 p.m. UTC | #4
On 05/03/2012 03:23 AM, Paolo Bonzini wrote:
> - ctzsi4 expands to bitrev + clz:SI, so it will return 32 too. ctz:SI
> will never appear at all in the RTX stream, so you can define its value
> at zero to be whatever you want.  Hence the correct valid of
> CTZ_DEFINED_VALUE_AT_ZERO is also 2.

That's for armv6.  For armv5 we don't have bitrev, so we use 31-clz,
which yields a value at 0 of -1.  Here I think it's best to mirror
the rs6000 port in that CLZ_DVAZ == 2, but CTZ_DVAZ == 1.

> Similar reasoning applies basically everywhere.

One really needs to examine these closely, one by one.


r~
Paolo Bonzini May 3, 2012, 3:23 p.m. UTC | #5
Il 03/05/2012 17:10, Richard Henderson ha scritto:
>>>
>>>
>>> ... except that no target sets the macros to 2, and all of them could
>>> (as far as I could see).  Looks like the code trumps the documentation;
>>> how does this look?
> 
> "No target sets to 2"?  You mean like mips?  You forgot to look at the
> corresponding CLZ_DEFINED_VALUE_AT_ZERO.

Yes, I missed that.

> I'll admit I hadn't been paying attention when sandra added the patch
> in question though...
> 
>>>         * optabs.c (expand_ffs): Check CTZ_DEFINED_VALUE_AT_ZERO
>>>         against 1.
>>>         * doc/tm.texi (Misc): Invert meaning of 1 and 2 for
>>>         CLZ_DEFINED_VALUE_AT_ZERO and CTZ_DEFINED_VALUE_AT_ZERO.
> 
> So... no, I don't think this is a good idea.

Yeah, not as I posted...  of course config/mips/mips.h needs change, but
do you disagree that all targets _do_ have the same value at zero for
the optab as they have for RTL?

Paolo
Richard Henderson May 3, 2012, 4:41 p.m. UTC | #6
On 05/03/2012 08:23 AM, Paolo Bonzini wrote:
> Yeah, not as I posted...  of course config/mips/mips.h needs change, but
> do you disagree that all targets_do_  have the same value at zero for
> the optab as they have for RTL?

I don't agree without a port-by-port review, no.

If, as you suggest, we get proper agreement between the gimple-level
(aka optab level) definedness and the rtl-level definedness, then we
should drop the tri-state and go back to simple boolean.

I'm uncertain if there's any point in settings such as for rs6000,
where we do know that ctz(0) = -1.  Should we only define the value
for clz(0) = wordsize and let the rtl optimizers fold ctz(0) away?



r~
Maciej W. Rozycki May 6, 2012, 6:52 a.m. UTC | #7
On Fri, 27 Apr 2012, Paolo Bonzini wrote:

> > What about cost considerations?  We only seem to have the general
> > "branches are expensive" metric - but ctz/clz may be prohibitely expensive
> > themselves, no?
> 
> Yeah, that's a general problem with this kind of tricks.  In general
> however clz/ctz is getting less and less expensive, so I don't think
> it is worrisome (at least at the beginning of stage 1).  We can add
> rtx_costs checks later.
> 
> Among architectures I know, only i386 has an expensive bsf/bsr but
> it also has sete/setne which GCC will use instead of this trick.
> 
> Looking at rtx_costs, nothing seems to mark clz/ctz as prohibitively
> expensive (Xtensa does, but only in the case when the optab handler
> will not exist).  I realize though that this is not a particularly
> good statistic, since the compiler would not generate them out of
> its hat until now.

 For the record: MIPS processors that implement CLZ/CLO (for some reason 
CTZ/CTO haven't been added to the architecture, but these operations can 
be cheaply transformed into CLZ/CLO) generally have a dedicated unit that 
causes no pipeline stall for these instructions even in the simplest 
pipeline designs like the M4K -- IOW they are issued at the usual one 
instruction per pipeline clock rate.

 Of course all MIPS processors have SLT too, so perhaps they won't benefit 
from your change either.

  Maciej
Andrew Pinski May 6, 2012, 7:43 a.m. UTC | #8
On Sat, May 5, 2012 at 11:52 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
>  For the record: MIPS processors that implement CLZ/CLO (for some reason
> CTZ/CTO haven't been added to the architecture, but these operations can
> be cheaply transformed into CLZ/CLO) generally have a dedicated unit that
> causes no pipeline stall for these instructions even in the simplest
> pipeline designs like the M4K -- IOW they are issued at the usual one
> instruction per pipeline clock rate.

Even on Octeon this is true.  Though Octeon has seq/sneq too so it
does not matter in the end.

Note I originally was the one who proposed this optimization for
PowerPC even before I saw what XLC did.  See PR 10588 (which I filed 9
years ago)  and it seems we are about to fix it soon.

Thanks,
Andrew Pinski
Maciej W. Rozycki May 12, 2012, 6:36 p.m. UTC | #9
On Sun, 6 May 2012, Andrew Pinski wrote:

> >  For the record: MIPS processors that implement CLZ/CLO (for some reason
> > CTZ/CTO haven't been added to the architecture, but these operations can
> > be cheaply transformed into CLZ/CLO) generally have a dedicated unit that
> > causes no pipeline stall for these instructions even in the simplest
> > pipeline designs like the M4K -- IOW they are issued at the usual one
> > instruction per pipeline clock rate.
> 
> Even on Octeon this is true.  Though Octeon has seq/sneq too so it
> does not matter in the end.

 Does Octeon's pipeline qualify as simple?  For some reason I've thought 
it is a high-performance core.  The M4K is one of the smallest/simplest 
MIPS chips ever built.

 And actually all MIPS processors (back to 1985's MIPS I ISA) support 
two-instruction set-if-equal and set-if-not-equal sequences:

	xor	rd, rt, rs
	sltiu	rd, rd, 1

and:

	xor	rd, rt, rs
	sltu	rd, zero, rd

respectively, that may still be more beneficial than any possible 
alternatives, especially ones involving branches.

> Note I originally was the one who proposed this optimization for
> PowerPC even before I saw what XLC did.  See PR 10588 (which I filed 9
> years ago)  and it seems we are about to fix it soon.

 For that -- set-if-zero and set-if-non-zero -- you can use the 
instructions as above (that are supported by all MIPS processors):

	sltiu	rd, rs, 1

and

	sltu	rd, zero, rs

However GCC doesn't seem smart enough to use them well with your example.  
I'd expect something like:

	sltiu	$4, $4, 1
	sltiu	$2, $5, 1
	jr	$31
	 or	$2, $4, $2

however I get:

	beq	$4, $0, .L3
	 nop
	jr	$31
	 sltiu	$2, $5, 1
.L3:
	jr	$31
	 li	$2, 1

which is never faster and obviously not smaller either.  And there is 
really no need to avoid the second comparison as per logical OR rules here 
-- it's all in registers.

 This pessimisation is avoided for MIPS IV and more recent processors that 
have move-if-non-zero however (and the second comparison is always 
evaluated):

	sltiu	$5, $5, 1
	li	$2, 1
	jr	$31
	 movn	$2, $5, $4

Any chance to get it better with the fix you've mentioned?

  Maciej
Andrew Pinski May 12, 2012, 7:25 p.m. UTC | #10
On Sat, May 12, 2012 at 11:36 AM, Maciej W. Rozycki
<macro@linux-mips.org> wrote:
> On Sun, 6 May 2012, Andrew Pinski wrote:
>
>> >  For the record: MIPS processors that implement CLZ/CLO (for some reason
>> > CTZ/CTO haven't been added to the architecture, but these operations can
>> > be cheaply transformed into CLZ/CLO) generally have a dedicated unit that
>> > causes no pipeline stall for these instructions even in the simplest
>> > pipeline designs like the M4K -- IOW they are issued at the usual one
>> > instruction per pipeline clock rate.
>>
>> Even on Octeon this is true.  Though Octeon has seq/sneq too so it
>> does not matter in the end.
>
>  Does Octeon's pipeline qualify as simple?  For some reason I've thought
> it is a high-performance core.  The M4K is one of the smallest/simplest
> MIPS chips ever built.

Yes the octeon's pipeline qualifies as simple.  It is still an
in-order pipeline with few stages.  The high-performance of the core
is just the clock rate rather than the pipeline.  And the number of
cores on one chip is the other thing which makes it high performance.

>
>  And actually all MIPS processors (back to 1985's MIPS I ISA) support
> two-instruction set-if-equal and set-if-not-equal sequences:
>
>        xor     rd, rt, rs
>        sltiu   rd, rd, 1
>
> and:
>
>        xor     rd, rt, rs
>        sltu    rd, zero, rd
>
> respectively, that may still be more beneficial than any possible
> alternatives, especially ones involving branches.
>
>> Note I originally was the one who proposed this optimization for
>> PowerPC even before I saw what XLC did.  See PR 10588 (which I filed 9
>> years ago)  and it seems we are about to fix it soon.
>
>  For that -- set-if-zero and set-if-non-zero -- you can use the
> instructions as above (that are supported by all MIPS processors):
>
>        sltiu   rd, rs, 1
>
> and
>
>        sltu    rd, zero, rs
>
> However GCC doesn't seem smart enough to use them well with your example.
> I'd expect something like:
>
>        sltiu   $4, $4, 1
>        sltiu   $2, $5, 1
>        jr      $31
>         or     $2, $4, $2
>
> however I get:
>
>        beq     $4, $0, .L3
>         nop
>        jr      $31
>         sltiu  $2, $5, 1
> .L3:
>        jr      $31
>         li     $2, 1
>
> which is never faster and obviously not smaller either.  And there is
> really no need to avoid the second comparison as per logical OR rules here
> -- it's all in registers.


I have a few patches already in my queue to submit upstream to improve
the above case for MIPS.


>
>  This pessimisation is avoided for MIPS IV and more recent processors that
> have move-if-non-zero however (and the second comparison is always
> evaluated):
>
>        sltiu   $5, $5, 1
>        li      $2, 1
>        jr      $31
>         movn   $2, $5, $4
>
> Any chance to get it better with the fix you've mentioned?

The above is worse than using the or for at least the octeon as movn
is 3 cycles while or is only 1 cycle.  As I mentioned, I have a few
patches already in my queue which improves the code for MIPS (and
other targets too) but I have not got around to submitting them
upstream because I have been busy working on more patches.

Thanks,
Andrew Pinski
diff mbox

Patch

Index: optabs.c
===================================================================
--- optabs.c	(revisione 186903)
+++ optabs.c	(copia locale)
@@ -2764,7 +2764,7 @@  expand_ffs
       if (!temp)
 	goto fail;
 
-      defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2);
+      defined_at_zero = (CTZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1);
     }
   else if (optab_handler (clz_optab, mode) != CODE_FOR_nothing)
     {
@@ -2773,7 +2773,7 @@  expand_ffs
       if (!temp)
 	goto fail;
 
-      if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 2)
+      if (CLZ_DEFINED_VALUE_AT_ZERO (mode, val) == 1)
 	{
 	  defined_at_zero = true;
 	  val = (GET_MODE_PRECISION (mode) - 1) - val;
Index: doc/tm.texi
===================================================================
--- doc/tm.texi	(revisione 186903)
+++ doc/tm.texi	(copia locale)
@@ -10640,9 +10640,9 @@ 
 for @code{clz} or @code{ctz} with a zero operand.
 A result of @code{0} indicates the value is undefined.
 If the value is defined for only the RTL expression, the macro should
-evaluate to @code{1}; if the value applies also to the corresponding optab
+evaluate to @code{2}; if the value applies also to the corresponding optab
 entry (which is normally the case if it expands directly into
-the corresponding RTL), then the macro should evaluate to @code{2}.
+the corresponding RTL), then the macro should evaluate to @code{1}.
 In the cases where the value is defined, @var{value} should be set to
 this value.