optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803]

Hi!

The following testcase is miscompiled on aarch64.  The problem is that
aarch64 with TARGET_SIMD is !SHIFT_COUNT_TRUNCATED target with
targetm.shift_truncation_mask (DImode) == 0 which has HAVE_conditional_move
true.  If a doubleword shift (in this case TImode) doesn't have its own
expander (as is the case of e.g. x86) and is handled in generic code,
there are 3 possible expansions.  One is when the shift count is constant,
the code computes in that case superword_op1 as op1 - BITS_PER_WORD,
and chooses at compile time which of expand_superword_shift or
expand_subword_shift to call, which ensures that whatever is used
actually has its shift count (superword_op1 or op1) in [0, BITS_PER_WORD - 1]
range.  If !HAVE_conditional_move or that expansion fails, the function
handles non-constant op1 similarly (there are some special cases for
shift_mask >= BITS_PER_WORD - 1 but let's talk here just about
shift_mask < BITS_PER_WORD - 1), except that the selection is done at
runtime, with branches around the stuff.  While superword_op1 can be
[-BITS_PER_WORD, BITS_PER_WORD - 1] and op1 [0, 2 * BITS_PER_WORD - 1],
the runtime selection ensures that the instructions executed at runtime
have their corresponding shift count again in the right range of
[0, BITS_PER_WORD - 1].
The problematic is the HAVE_conditional_move case, which emits both
sequences into the actually executed code, so necessarily one of them
has an out of range shift count and then using 2 conditional moves
picks up a result.
Now, in the testcase because -Og doesn't perform VRP/EVRP the shift
count isn't optimized to constant during GIMPLE passes, but is determined
to be constant during/after expansion into RTL.  The shift count is
actually const0_rtx later, so superword_op1 is (const_int -64) and we end
up with miscompilation during combine because of that.
I'm afraid on targetm.shift_truncation_mask (DImode) == 0 targets we have
to mask the shift counts when the doubleshift count is in range but
one of subword_op1/superword_op1 is not, which is what the following
patch does and what fixes the wrong-code.  Now, targets like x86 or aarch64,
while they are !SHIFT_COUNT_TRUNCATED, have actually patterns to catch
shift with masked counter, so the ANDs can be optimized out.  On the other
side, when we know the result will be masked this way we can use the
simpler ~op1 instead of (BITS_PER_WORD - 1) - op1 in expand_subword_shift.
So, say on
__attribute__((noipa)) __int128
foo (__int128 a, unsigned k)
{
  return a << k;
}
on aarch64 at -Og the difference is:
 foo:
        subs    w5, w2, #64
-       lsl     x6, x0, x5
+       lsl     x4, x0, x2
        lsr     x3, x0, 1
-       mov     w4, 63
-       sub     w4, w4, w2
-       lsr     x3, x3, x4
+       mvn     w6, w2
+       and     w2, w2, 63
+       lsr     x3, x3, x6
        lsl     x1, x1, x2
        orr     x1, x3, x1
        lsl     x0, x0, x2
        csel    x0, xzr, x0, pl
-       csel    x1, x6, x1, pl
+       csel    x1, x4, x1, pl
        ret
We could do even better and optimize the and w2, w2, 63 instruction out,
but it is a matter of costs and so IMHO should be handled incrementally.
For that case consider say:
void corge (int, int, int);

void
qux (int x, int y, int z, int n)
{
  n &= 31;
  corge (x << n, y << n, z >> n);
}
with -O2 -fno-tree-vectorize, on x86_64 one gets
        sarl    %cl, %edx
        sall    %cl, %esi
        sall    %cl, %edi
        jmp     corge
but on aarch64
        and     w3, w3, 31
        lsl     w0, w0, w3
        lsl     w1, w1, w3
        asr     w2, w2, w3
        b       corge
The reason it works on x86 is that its rtx_costs hook recognizes
that the AND in shift+mask patterns is for free.
Trying 9 -> 11:
    9: {r85:SI=r96:SI&0x1f;clobber flags:CC;}
      REG_UNUSED flags:CC
   11: {r91:SI=r87:SI<<r85:SI#0;clobber flags:CC;}
      REG_DEAD r87:SI
      REG_UNUSED flags:CC
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 85 [ n ])
    (and:SI (reg:SI 96)
        (const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 91)
    (ashift:SI (reg/v:SI 87 [ y ])
        (subreg:QI (and:SI (reg:SI 96)
                (const_int 31 [0x1f])) 0)))
allowing combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
Compare that to the aarch64 case:
Trying 9 -> 11:
    9: r95:SI=r106:SI&0x1f
      REG_DEAD r106:SI
   11: r101:SI=r104:SI<<r95:SI#0
      REG_DEAD r104:SI
Failed to match this instruction:
...
Failed to match this instruction:
...
Successfully matched this instruction:
(set (reg/v:SI 95 [ n ])
    (and:SI (reg:SI 106)
        (const_int 31 [0x1f])))
Successfully matched this instruction:
(set (reg:SI 101)
    (ashift:SI (reg:SI 104)
        (subreg:QI (and:SI (reg:SI 106)
                (const_int 31 [0x1f])) 0)))
rejecting combination of insns 9 and 11
original costs 4 + 4 = 8
replacement costs 4 + 8 = 12
Now, if the rtx costs on the *aarch64_ashl_reg_si3_mask1
and similar would be the same as for normal shift without the
masking, this would succeed and we could get rid of the and.

Bootstrapped/regtested on aarch64-linux, ok for trunk?

2023-02-17  Jakub Jelinek  <jakub@redhat.com>

	PR middle-end/108803
	* optabs.cc (expand_subword_shift): Add MASK_COUNT argument,
	if true, use ~op1 & (BITS_PER_WORD - 1) instead of
	(BITS_PER_WORD - 1) - op1 as tmp and op1 & (BITS_PER_WORD - 1)
	instead of op1 on the other two shifts.
	(expand_doubleword_shift_condmove): Use
	superword_op1 & (BITS_PER_WORD - 1) instead of superword_op1.  Pass
	shift_mask < BITS_PER_WORD - 1 as MASK_COUNT to expand_subword_shift.
	(expand_doubleword_shift): Pass false as MASK_COUNT to
	expand_subword_shift.

	* gcc.dg/pr108803.c: New test.

	Jakub

Message ID	Y+9TodOHE09e9Vwq@tucnak
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 359E93858C78 Date: Fri, 17 Feb 2023 11:14:57 +0100 To: Richard Sandiford <richard.sandiford@arm.com>, Richard Biener <rguenther@suse.de> Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803] Message-ID: <Y+9TodOHE09e9Vwq@tucnak> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: list From: Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Jakub Jelinek <jakub@redhat.com> Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
Series	optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803] \| expand optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803]

optabs: Fix up expand_doubleword_shift_condmove for shift_mask == 0 [PR108803]

Commit Message

Comments

Patch