diff mbox

Simplify SUBREG with operand whose target bits are cleared by AND operation

Message ID CAGbRaL6ah65TTvRsA81w03QJWniUVmThaQYYBGfZ5-h4RHq5Ug@mail.gmail.com
State New
Headers show

Commit Message

Terry Guo April 14, 2014, 10:13 a.m. UTC
On Thu, Apr 3, 2014 at 10:11 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> I find the GCC function simplify_subreg fails to simplify rtx (subreg:SI
>> (and:DI (reg/v:DI 115 [ a ]) (const_int 4294967295 [0xffffffff])) 4) to zero
>> during the fwprop1 pass, considering the fact that the high 32-bit part of
>> (a & 0xFFFFFFFF) is zero. This leads to some unnecessary multiplications
>> for high 32-bit part of the result of AND operation. The attached patch is
>> trying to improve simplify_rtx to handle such case. Other target like x86
>> seems hasn't such issue because it generates different RTX to handle 64bit
>> multiplication on a 32bit machine.
>
> See http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00073.html for another try,
> which led to the simplification in combine.c:combine_simplify_rtx line 5448.
>
> Your variant is both more general, because it isn't restricted to the lowpart,
> and less general, because it is artificially restricted to AND.
>
> Some remarks:
>  - this needs to be restricted to non-paradoxical subregs,
>  - you need to test HWI_COMPUTABLE_MODE_P (innermode),
>  - you need to test !side_effects_p (op).
>
> I think we need to find a common ground between Jakub's patch and yours and
> put a single transformation in simplify_subreg.
>
> --
> Eric Botcazou

Thanks for your review. Even without Jakub's patch, the combine pass
can simplify such subreg to zero. But the live range of destination
register causes combine pass to undo all attempts. Here is detailed
explanation, please note that the AND operation (and:DI (reg/v:DI 115
[ a ]) (const_int 4294967295 [0xffffffff])) from fwprop1 pass is
turned into ZERO_EXTEND operation (zero_extend:DI (reg:SI 0 r0 [ a ]))
in combine pass. The variable a is function arguments and occupies
register r0 and r1. Right before try_combine function we have below
three instructions:

(insn 8 4 10 2 (set (reg:DI 118 [ D.4091 ])
        (zero_extend:DI (reg:SI 0 r0 [ a ]))) 000.c:4 170 {zero_extendsidi2}
     (expr_list:REG_DEAD (reg:SI 0 r0)
        (nil)))

(insn 10 8 13 2 (set (reg:SI 121)
        (mult:SI (reg/v:SI 116 [ b ])
            (subreg:SI (reg:DI 118 [ D.4091 ]) 4))) 000.c:4 41 {*arm_mulsi3_v6}
     (nil))

(insn 13 10 14 2 (set (reg:DI 117 [ D.4091 ])
        (mult:DI (zero_extend:DI (subreg:SI (reg:DI 118 [ D.4091 ]) 0))
            (zero_extend:DI (reg/v:SI 116 [ b ])))) 000.c:4 60 {*umulsidi3_v6}
     (expr_list:REG_DEAD (reg:DI 118 [ D.4091 ])
        (expr_list:REG_DEAD (reg/v:SI 116 [ b ])
            (nil))))

Without any changes, current gcc can simplify the insn 10 to

(insn 10 8 13 2 (set (reg:SI 121)
        (const_int 0 [0])) 000.c:4 41 {*arm_mulsi3_v6}
     (nil))

This is what we wanted. Unfortunately the live range of register
(reg:DI 118) is extended to insn 13. Thus unable to remove insn 8. The
combine pass has to generate PARALLEL rtx to handle such case:

(parallel [
        (set (reg:SI 121)
            (const_int 0 [0]))
        (set (reg:DI 118 [ D.4091 ])
            (zero_extend:DI (reg:SI 0 r0 [ a ])))
    ])

There is no instruction patter to match this parallel rtx. This causes
combine pass to undo all attempts and roll back simplification to
subreg operand. Without such live range extension, everything works
fine. That's why such optimization can only happen in fwprop1 pass.

I made a typo in my previous changelog. My code are indeed for
simplify_subreg function. I updated my patch per your suggestions to
handle more general cases. Please review again. Thanks.

BR,
Terry

Comments

Terry Guo April 24, 2014, 7:48 a.m. UTC | #1
On Mon, Apr 14, 2014 at 6:13 PM, Terry Guo <flameroc@gmail.com> wrote:
> On Thu, Apr 3, 2014 at 10:11 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>>> I find the GCC function simplify_subreg fails to simplify rtx (subreg:SI
>>> (and:DI (reg/v:DI 115 [ a ]) (const_int 4294967295 [0xffffffff])) 4) to zero
>>> during the fwprop1 pass, considering the fact that the high 32-bit part of
>>> (a & 0xFFFFFFFF) is zero. This leads to some unnecessary multiplications
>>> for high 32-bit part of the result of AND operation. The attached patch is
>>> trying to improve simplify_rtx to handle such case. Other target like x86
>>> seems hasn't such issue because it generates different RTX to handle 64bit
>>> multiplication on a 32bit machine.
>>
>> See http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00073.html for another try,
>> which led to the simplification in combine.c:combine_simplify_rtx line 5448.
>>
>> Your variant is both more general, because it isn't restricted to the lowpart,
>> and less general, because it is artificially restricted to AND.
>>
>> Some remarks:
>>  - this needs to be restricted to non-paradoxical subregs,
>>  - you need to test HWI_COMPUTABLE_MODE_P (innermode),
>>  - you need to test !side_effects_p (op).
>>
>> I think we need to find a common ground between Jakub's patch and yours and
>> put a single transformation in simplify_subreg.
>>
>> --
>> Eric Botcazou
>
> Thanks for your review. Even without Jakub's patch, the combine pass
> can simplify such subreg to zero. But the live range of destination
> register causes combine pass to undo all attempts. Here is detailed
> explanation, please note that the AND operation (and:DI (reg/v:DI 115
> [ a ]) (const_int 4294967295 [0xffffffff])) from fwprop1 pass is
> turned into ZERO_EXTEND operation (zero_extend:DI (reg:SI 0 r0 [ a ]))
> in combine pass. The variable a is function arguments and occupies
> register r0 and r1. Right before try_combine function we have below
> three instructions:
>
> (insn 8 4 10 2 (set (reg:DI 118 [ D.4091 ])
>         (zero_extend:DI (reg:SI 0 r0 [ a ]))) 000.c:4 170 {zero_extendsidi2}
>      (expr_list:REG_DEAD (reg:SI 0 r0)
>         (nil)))
>
> (insn 10 8 13 2 (set (reg:SI 121)
>         (mult:SI (reg/v:SI 116 [ b ])
>             (subreg:SI (reg:DI 118 [ D.4091 ]) 4))) 000.c:4 41 {*arm_mulsi3_v6}
>      (nil))
>
> (insn 13 10 14 2 (set (reg:DI 117 [ D.4091 ])
>         (mult:DI (zero_extend:DI (subreg:SI (reg:DI 118 [ D.4091 ]) 0))
>             (zero_extend:DI (reg/v:SI 116 [ b ])))) 000.c:4 60 {*umulsidi3_v6}
>      (expr_list:REG_DEAD (reg:DI 118 [ D.4091 ])
>         (expr_list:REG_DEAD (reg/v:SI 116 [ b ])
>             (nil))))
>
> Without any changes, current gcc can simplify the insn 10 to
>
> (insn 10 8 13 2 (set (reg:SI 121)
>         (const_int 0 [0])) 000.c:4 41 {*arm_mulsi3_v6}
>      (nil))
>
> This is what we wanted. Unfortunately the live range of register
> (reg:DI 118) is extended to insn 13. Thus unable to remove insn 8. The
> combine pass has to generate PARALLEL rtx to handle such case:
>
> (parallel [
>         (set (reg:SI 121)
>             (const_int 0 [0]))
>         (set (reg:DI 118 [ D.4091 ])
>             (zero_extend:DI (reg:SI 0 r0 [ a ])))
>     ])
>
> There is no instruction patter to match this parallel rtx. This causes
> combine pass to undo all attempts and roll back simplification to
> subreg operand. Without such live range extension, everything works
> fine. That's why such optimization can only happen in fwprop1 pass.
>
> I made a typo in my previous changelog. My code are indeed for
> simplify_subreg function. I updated my patch per your suggestions to
> handle more general cases. Please review again. Thanks.
>
> BR,
> Terry

Hi Eric,

How do you think my above investigation? To put it simple, the combine
can optimize it to zero, but lately has to undo it because some
restrictions. Now I generalize my optimization to focus on things like
(subreg:mode (OP) index), then use function nonzero_bits to get bits
status of the result of OP. Once the bits expected by subreg are all
zero, we can optimize this whole subreg to zero. Does this make sense?
Please advise. Thanks.

BR,
Terry
diff mbox

Patch

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 04af01e..1949edc 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -6099,6 +6099,22 @@  simplify_subreg (enum machine_mode outermode, rtx op,
 	return CONST0_RTX (outermode);
     }
 
+  /* If we can make sure the required bits of subreg are zero
+     in subreg operand, we can return a zero.  Here is an example:
+     (subreg:SI (and:DI (reg:DI X) (const_int 0xFFFFFFFF)) 4).  */
+  if (HWI_COMPUTABLE_MODE_P (innermode)
+      && SCALAR_INT_MODE_P (innermode)
+      && GET_MODE_SIZE (innermode) > GET_MODE_SIZE (outermode)
+      && !side_effects_p (op))
+    {
+      unsigned int bitpos = subreg_lsb_1 (outermode, innermode, byte);
+      unsigned HOST_WIDE_INT nzmask = nonzero_bits (op, innermode);
+      unsigned HOST_WIDE_INT smask = GET_MODE_MASK (outermode);
+
+      if (((smask << bitpos) & nzmask) == 0)
+	return CONST0_RTX (outermode);
+    }
+
   if (SCALAR_INT_MODE_P (outermode)
       && SCALAR_INT_MODE_P (innermode)
       && GET_MODE_PRECISION (outermode) < GET_MODE_PRECISION (innermode)