diff mbox

[3/3,RTL,ifcvt] PR middle-end/37780: Conditional expression with __builtin_clz() should be optimized out

Message ID CAKdteOabZUzYqGDWCKuN_pU=pYTZa5eDcgOv6PReiEA4SHJHsg@mail.gmail.com
State New
Headers show

Commit Message

Christophe Lyon June 10, 2016, 1:39 p.m. UTC
On 10 June 2016 at 10:38, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>
> On 09/06/16 13:14, Christophe Lyon wrote:
>>
>> On 8 June 2016 at 18:40, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
>> wrote:
>>>
>>> On 07/06/16 20:34, Christophe Lyon wrote:
>>>>
>>>> On 26 May 2016 at 11:53, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> In this PR we want to optimise:
>>>>> int foo (int i)
>>>>> {
>>>>>     return (i == 0) ? N : __builtin_clz (i);
>>>>> }
>>>>>
>>>>> on targets where CLZ is defined at zero to the constant 'N'.
>>>>> This is determined at the RTL level through the
>>>>> CLZ_DEFINED_VALUE_AT_ZERO
>>>>> macro.
>>>>> The obvious place to implement this would be in combine through
>>>>> simplify-rtx
>>>>> where we'd
>>>>> recognise an IF_THEN_ELSE of the form:
>>>>> (set (reg:SI r1)
>>>>>        (if_then_else:SI (ne (reg:SI r2)
>>>>>                             (const_int 0 [0]))
>>>>>          (clz:SI (reg:SI r2))
>>>>>          (const_int 32)))
>>>>>
>>>>> and if CLZ_DEFINED_VALUE_AT_ZERO is defined to 32 for SImode we'd
>>>>> simplify
>>>>> it into
>>>>> just (clz:SI (reg:SI r2)).
>>>>> However, I found this doesn't quite happen for a couple of reasons:
>>>>> 1) This depends on ifcvt or some other pass to have created a
>>>>> conditional
>>>>> move of the
>>>>> two branches that provide the IF_THEN_ELSE to propagate the const_int
>>>>> and
>>>>> clz operation into.
>>>>>
>>>>> 2) Combine will refuse to propagate r2 from the above example into both
>>>>> the
>>>>> condition and the
>>>>> CLZ at the same time, so the most we see is:
>>>>> (set (reg:SI r1)
>>>>>        (if_then_else:SI (ne (reg:CC cc)
>>>>>               (const_int 0))
>>>>>          (clz:SI (reg:SI r2))
>>>>>          (const_int 32)))
>>>>>
>>>>> which is not enough information to perform the simplification.
>>>>>
>>>>> This patch implements the optimisation in ce1 using the noce ifcvt
>>>>> framework.
>>>>> During ifcvt noce_process_if_block can see that we're trying to
>>>>> optimise
>>>>> something
>>>>> of the form (x == 0 ? const_int : CLZ (x)) and so it has visibility of
>>>>> all
>>>>> the information
>>>>> needed to perform the transformation.
>>>>>
>>>>> The transformation is performed by adding a new noce_try* function that
>>>>> tries to put the
>>>>> condition and the 'then' and 'else' arms into an IF_THEN_ELSE rtx and
>>>>> try
>>>>> to
>>>>> simplify that
>>>>> using the simplify-rtx machinery. That way, we can implement the
>>>>> simplification logic in
>>>>> simplify-rtx.c where it belongs.
>>>>>
>>>>> A similar transformation for CTZ is implemented as well.
>>>>> So for code:
>>>>> int foo (int i)
>>>>> {
>>>>>     return (i == 0) ? 32 : __builtin_clz (i);
>>>>> }
>>>>>
>>>>> On aarch64 we now emit:
>>>>> foo:
>>>>>           clz     w0, w0
>>>>>           ret
>>>>>
>>>>> instead of:
>>>>> foo:
>>>>>           mov     w1, 32
>>>>>           clz     w2, w0
>>>>>           cmp     w0, 0
>>>>>           csel    w0, w2, w1, ne
>>>>>           ret
>>>>>
>>>>> and for arm similarly we generate:
>>>>> foo:
>>>>>           clz     r0, r0
>>>>>           bx      lr
>>>>>
>>>>> instead of:
>>>>> foo:
>>>>>           cmp     r0, #0
>>>>>           clzne   r0, r0
>>>>>           moveq   r0, #32
>>>>>           bx      lr
>>>>>
>>>>>
>>>>> and for x86_64 with -O2 -mlzcnt we generate:
>>>>> foo:
>>>>>           xorl    %eax, %eax
>>>>>           lzcntl  %edi, %eax
>>>>>           ret
>>>>>
>>>>> instead of:
>>>>> foo:
>>>>>           xorl    %eax, %eax
>>>>>           movl    $32, %edx
>>>>>           lzcntl  %edi, %eax
>>>>>           testl   %edi, %edi
>>>>>           cmove   %edx, %eax
>>>>>           ret
>>>>>
>>>>>
>>>>> I tried getting this to work on other targets as well, but encountered
>>>>> difficulties.
>>>>> For example on powerpc the two arms of the condition seen during ifcvt
>>>>> are:
>>>>>
>>>>> (insn 4 22 11 4 (set (reg:DI 156 [ <retval> ])
>>>>>           (const_int 32 [0x20])) clz.c:3 434 {*movdi_internal64}
>>>>>        (nil))
>>>>> and
>>>>> (insn 10 9 23 3 (set (subreg/s/u:SI (reg:DI 156 [ <retval> ]) 0)
>>>>>           (clz:SI (subreg/u:SI (reg/v:DI 157 [ i ]) 0))) clz.c:3 132
>>>>> {clzsi2}
>>>>>        (expr_list:REG_DEAD (reg/v:DI 157 [ i ])
>>>>>           (nil)))
>>>>>
>>>>> So the setup code in noce_process_if_block sees that the set
>>>>> destination
>>>>> is
>>>>> not the same
>>>>> ((reg:DI 156 [ <retval> ]) and (subreg/s/u:SI (reg:DI 156 [ <retval> ])
>>>>> 0))
>>>>> so it bails out on the rtx_interchangeable_p (x, SET_DEST (set_b))
>>>>> check.
>>>>> I suppose that's a consequence of how SImode operations are represented
>>>>> in
>>>>> early RTL
>>>>> on powerpc, I don't know what to do there. Perhaps that part of ivcvt
>>>>> can
>>>>> be
>>>>> taught to handle
>>>>> destinations that are subregs of one another, but that would be a
>>>>> separate
>>>>> patch.
>>>>>
>>>>> Anyway, is this patch ok for trunk?
>>>>>
>>>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>>>>> aarch64-none-linux-gnu,
>>>>> x86_64-pc-linux-gnu.
>>>>>
>>>>> Thanks,
>>>>> Kyrill
>>>>>
>>>>> 2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>
>>>>>       PR middle-end/37780
>>>>>       * ifcvt.c (noce_try_ifelse_collapse): New function.
>>>>>       Declare prototype.
>>>>>       (noce_process_if_block): Call noce_try_ifelse_collapse.
>>>>>       * simplify-rtx.c (simplify_cond_clz_ctz): New function.
>>>>>       (simplify_ternary_operation): Use the above to simplify
>>>>>       conditional CLZ/CTZ expressions.
>>>>>
>>>>> 2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>
>>>>>       PR middle-end/37780
>>>>>       * gcc.c-torture/execute/pr37780.c: New test.
>>>>>       * gcc.target/aarch64/pr37780_1.c: Likewise.
>>>>>       * gcc.target/arm/pr37780_1.c: Likewise.
>>>>
>>>> Hi Kyrylo,
>>>
>>>
>>> Hi Christophe,
>>>
>>>> I've noticed that gcc.target/arm/pr37780_1.c fails on
>>>> arm if arch < v6.
>>>> I first tried to fix the effective-target guard (IIRC, the doc
>>>> says clz is available starting with v5t), but that isn't sufficient.
>>>>
>>>> When compiling for armv5-t, the scan-assembler directives
>>>> fail. It seems to work with v6t2, so I am wondering whether
>>>> it's just a matter of increasing the effective-target arch version,
>>>> or if you really intended to make the test pass on these old
>>>> architectures?
>>>
>>>
>>> I've dug into it a bit.
>>> I think the problem is that CLZ is available with ARMv5T but only in arm
>>> mode.
>>> In Thumb mode it's only available with ARMv6T2.
>>> So if your armv5t gcc compiles for Thumb by default you'll get the
>>> failure.
>>
>> Actually this gcc is configured with default cpu & mode,
>> target=arm-none-eabi.
>>
>> So I think it's arm7tdmi, arm mode.
>>
>>> I think just bumping the effective to armv6t2 here is appropriate as I
>>> intended
>>> the test to just check the ifcvt transformation when the instruction is
>>> available
>>> rather than test that the CLZ instruction is available at one
>>> architecture
>>> level
>>> or another.
>>>
>>> A patch to bump the effective target check is pre-approved if you want to
>>> do
>>> it.
>>> Otherwise I'll get to it in a few days.
>>>
>> I was a about to commit the arm_arch_v5_ok -> arm_arch_v6t2_ok, but
>> this would not be sufficient without adding
>> dg-add-options arm_arch_v6t2
>> but then if gcc is configured with a higher default architecture, we'll
>> end
>> up testing if works for v6t2 only.
>
>
> I'm okay with adding "dg-add-options arm_arch_v6t2".
> We're not testing the availability of the instruction here, just the ifcvt
> transformation when it *is* available, so I think we should just add
> whatever
> option guarantees that instruction is available.
>
> Kyrill
>

OK, I've committed the attached patch.

2016-06-10  Christophe Lyon  <christophe.lyon@linaro.org>

        * gcc.target/arm/pr37780_1.c: Use arm_arch_v6t2 effective target
        and options.

>
>>> Thanks for spotting this,
>>> Kyrill
>>>>
>>>> Thanks,
>>>>
>>>> Christophe.
>>>
>>>
>

Comments

Kyrill Tkachov June 10, 2016, 1:41 p.m. UTC | #1
On 10/06/16 14:39, Christophe Lyon wrote:
> On 10 June 2016 at 10:38, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote:
>> On 09/06/16 13:14, Christophe Lyon wrote:
>>> On 8 June 2016 at 18:40, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
>>> wrote:
>>>> On 07/06/16 20:34, Christophe Lyon wrote:
>>>>> On 26 May 2016 at 11:53, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
>>>>> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> In this PR we want to optimise:
>>>>>> int foo (int i)
>>>>>> {
>>>>>>      return (i == 0) ? N : __builtin_clz (i);
>>>>>> }
>>>>>>
>>>>>> on targets where CLZ is defined at zero to the constant 'N'.
>>>>>> This is determined at the RTL level through the
>>>>>> CLZ_DEFINED_VALUE_AT_ZERO
>>>>>> macro.
>>>>>> The obvious place to implement this would be in combine through
>>>>>> simplify-rtx
>>>>>> where we'd
>>>>>> recognise an IF_THEN_ELSE of the form:
>>>>>> (set (reg:SI r1)
>>>>>>         (if_then_else:SI (ne (reg:SI r2)
>>>>>>                              (const_int 0 [0]))
>>>>>>           (clz:SI (reg:SI r2))
>>>>>>           (const_int 32)))
>>>>>>
>>>>>> and if CLZ_DEFINED_VALUE_AT_ZERO is defined to 32 for SImode we'd
>>>>>> simplify
>>>>>> it into
>>>>>> just (clz:SI (reg:SI r2)).
>>>>>> However, I found this doesn't quite happen for a couple of reasons:
>>>>>> 1) This depends on ifcvt or some other pass to have created a
>>>>>> conditional
>>>>>> move of the
>>>>>> two branches that provide the IF_THEN_ELSE to propagate the const_int
>>>>>> and
>>>>>> clz operation into.
>>>>>>
>>>>>> 2) Combine will refuse to propagate r2 from the above example into both
>>>>>> the
>>>>>> condition and the
>>>>>> CLZ at the same time, so the most we see is:
>>>>>> (set (reg:SI r1)
>>>>>>         (if_then_else:SI (ne (reg:CC cc)
>>>>>>                (const_int 0))
>>>>>>           (clz:SI (reg:SI r2))
>>>>>>           (const_int 32)))
>>>>>>
>>>>>> which is not enough information to perform the simplification.
>>>>>>
>>>>>> This patch implements the optimisation in ce1 using the noce ifcvt
>>>>>> framework.
>>>>>> During ifcvt noce_process_if_block can see that we're trying to
>>>>>> optimise
>>>>>> something
>>>>>> of the form (x == 0 ? const_int : CLZ (x)) and so it has visibility of
>>>>>> all
>>>>>> the information
>>>>>> needed to perform the transformation.
>>>>>>
>>>>>> The transformation is performed by adding a new noce_try* function that
>>>>>> tries to put the
>>>>>> condition and the 'then' and 'else' arms into an IF_THEN_ELSE rtx and
>>>>>> try
>>>>>> to
>>>>>> simplify that
>>>>>> using the simplify-rtx machinery. That way, we can implement the
>>>>>> simplification logic in
>>>>>> simplify-rtx.c where it belongs.
>>>>>>
>>>>>> A similar transformation for CTZ is implemented as well.
>>>>>> So for code:
>>>>>> int foo (int i)
>>>>>> {
>>>>>>      return (i == 0) ? 32 : __builtin_clz (i);
>>>>>> }
>>>>>>
>>>>>> On aarch64 we now emit:
>>>>>> foo:
>>>>>>            clz     w0, w0
>>>>>>            ret
>>>>>>
>>>>>> instead of:
>>>>>> foo:
>>>>>>            mov     w1, 32
>>>>>>            clz     w2, w0
>>>>>>            cmp     w0, 0
>>>>>>            csel    w0, w2, w1, ne
>>>>>>            ret
>>>>>>
>>>>>> and for arm similarly we generate:
>>>>>> foo:
>>>>>>            clz     r0, r0
>>>>>>            bx      lr
>>>>>>
>>>>>> instead of:
>>>>>> foo:
>>>>>>            cmp     r0, #0
>>>>>>            clzne   r0, r0
>>>>>>            moveq   r0, #32
>>>>>>            bx      lr
>>>>>>
>>>>>>
>>>>>> and for x86_64 with -O2 -mlzcnt we generate:
>>>>>> foo:
>>>>>>            xorl    %eax, %eax
>>>>>>            lzcntl  %edi, %eax
>>>>>>            ret
>>>>>>
>>>>>> instead of:
>>>>>> foo:
>>>>>>            xorl    %eax, %eax
>>>>>>            movl    $32, %edx
>>>>>>            lzcntl  %edi, %eax
>>>>>>            testl   %edi, %edi
>>>>>>            cmove   %edx, %eax
>>>>>>            ret
>>>>>>
>>>>>>
>>>>>> I tried getting this to work on other targets as well, but encountered
>>>>>> difficulties.
>>>>>> For example on powerpc the two arms of the condition seen during ifcvt
>>>>>> are:
>>>>>>
>>>>>> (insn 4 22 11 4 (set (reg:DI 156 [ <retval> ])
>>>>>>            (const_int 32 [0x20])) clz.c:3 434 {*movdi_internal64}
>>>>>>         (nil))
>>>>>> and
>>>>>> (insn 10 9 23 3 (set (subreg/s/u:SI (reg:DI 156 [ <retval> ]) 0)
>>>>>>            (clz:SI (subreg/u:SI (reg/v:DI 157 [ i ]) 0))) clz.c:3 132
>>>>>> {clzsi2}
>>>>>>         (expr_list:REG_DEAD (reg/v:DI 157 [ i ])
>>>>>>            (nil)))
>>>>>>
>>>>>> So the setup code in noce_process_if_block sees that the set
>>>>>> destination
>>>>>> is
>>>>>> not the same
>>>>>> ((reg:DI 156 [ <retval> ]) and (subreg/s/u:SI (reg:DI 156 [ <retval> ])
>>>>>> 0))
>>>>>> so it bails out on the rtx_interchangeable_p (x, SET_DEST (set_b))
>>>>>> check.
>>>>>> I suppose that's a consequence of how SImode operations are represented
>>>>>> in
>>>>>> early RTL
>>>>>> on powerpc, I don't know what to do there. Perhaps that part of ivcvt
>>>>>> can
>>>>>> be
>>>>>> taught to handle
>>>>>> destinations that are subregs of one another, but that would be a
>>>>>> separate
>>>>>> patch.
>>>>>>
>>>>>> Anyway, is this patch ok for trunk?
>>>>>>
>>>>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>>>>>> aarch64-none-linux-gnu,
>>>>>> x86_64-pc-linux-gnu.
>>>>>>
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>>
>>>>>> 2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>>        PR middle-end/37780
>>>>>>        * ifcvt.c (noce_try_ifelse_collapse): New function.
>>>>>>        Declare prototype.
>>>>>>        (noce_process_if_block): Call noce_try_ifelse_collapse.
>>>>>>        * simplify-rtx.c (simplify_cond_clz_ctz): New function.
>>>>>>        (simplify_ternary_operation): Use the above to simplify
>>>>>>        conditional CLZ/CTZ expressions.
>>>>>>
>>>>>> 2016-05-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>>        PR middle-end/37780
>>>>>>        * gcc.c-torture/execute/pr37780.c: New test.
>>>>>>        * gcc.target/aarch64/pr37780_1.c: Likewise.
>>>>>>        * gcc.target/arm/pr37780_1.c: Likewise.
>>>>> Hi Kyrylo,
>>>>
>>>> Hi Christophe,
>>>>
>>>>> I've noticed that gcc.target/arm/pr37780_1.c fails on
>>>>> arm if arch < v6.
>>>>> I first tried to fix the effective-target guard (IIRC, the doc
>>>>> says clz is available starting with v5t), but that isn't sufficient.
>>>>>
>>>>> When compiling for armv5-t, the scan-assembler directives
>>>>> fail. It seems to work with v6t2, so I am wondering whether
>>>>> it's just a matter of increasing the effective-target arch version,
>>>>> or if you really intended to make the test pass on these old
>>>>> architectures?
>>>>
>>>> I've dug into it a bit.
>>>> I think the problem is that CLZ is available with ARMv5T but only in arm
>>>> mode.
>>>> In Thumb mode it's only available with ARMv6T2.
>>>> So if your armv5t gcc compiles for Thumb by default you'll get the
>>>> failure.
>>> Actually this gcc is configured with default cpu & mode,
>>> target=arm-none-eabi.
>>>
>>> So I think it's arm7tdmi, arm mode.
>>>
>>>> I think just bumping the effective to armv6t2 here is appropriate as I
>>>> intended
>>>> the test to just check the ifcvt transformation when the instruction is
>>>> available
>>>> rather than test that the CLZ instruction is available at one
>>>> architecture
>>>> level
>>>> or another.
>>>>
>>>> A patch to bump the effective target check is pre-approved if you want to
>>>> do
>>>> it.
>>>> Otherwise I'll get to it in a few days.
>>>>
>>> I was a about to commit the arm_arch_v5_ok -> arm_arch_v6t2_ok, but
>>> this would not be sufficient without adding
>>> dg-add-options arm_arch_v6t2
>>> but then if gcc is configured with a higher default architecture, we'll
>>> end
>>> up testing if works for v6t2 only.
>>
>> I'm okay with adding "dg-add-options arm_arch_v6t2".
>> We're not testing the availability of the instruction here, just the ifcvt
>> transformation when it *is* available, so I think we should just add
>> whatever
>> option guarantees that instruction is available.
>>
>> Kyrill
>>
> OK, I've committed the attached patch.
>
> 2016-06-10  Christophe Lyon  <christophe.lyon@linaro.org>
>
>          * gcc.target/arm/pr37780_1.c: Use arm_arch_v6t2 effective target
>          and options.
>

Thanks for taking care of this.

Kyrill

>>>> Thanks for spotting this,
>>>> Kyrill
>>>>> Thanks,
>>>>>
>>>>> Christophe.
>>>>
diff mbox

Patch

diff --git a/gcc/testsuite/gcc.target/arm/pr37780_1.c b/gcc/testsuite/gcc.target/arm/pr37780_1.c
index b954fe5..8e06920 100644
--- a/gcc/testsuite/gcc.target/arm/pr37780_1.c
+++ b/gcc/testsuite/gcc.target/arm/pr37780_1.c
@@ -2,8 +2,9 @@ 
    being defined at zero.  */
 
 /* { dg-do compile } */
-/* { dg-require-effective-target arm_arch_v5_ok } */
+/* { dg-require-effective-target arm_arch_v6t2_ok } */
 /* { dg-options "-O2" } */
+/* { dg-add-options arm_arch_v6t2 } */
 
 int
 fooctz (int i)