From patchwork Sat Jun 6 13:46:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roger Sayle X-Patchwork-Id: 1304537 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=nextmovesoftware.com header.i=@nextmovesoftware.com header.a=rsa-sha256 header.s=default header.b=hGJehrO1; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 49fLRR5ZN5z9sRh for ; Sat, 6 Jun 2020 23:47:13 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 848FF387090B; Sat, 6 Jun 2020 13:47:04 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id AC2A03851C2F for ; Sat, 6 Jun 2020 13:47:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org AC2A03851C2F Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=roger@nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:To:From:Sender:Reply-To:Cc:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=TKdiF74gI6D9jDlJgeqVX1FkK8ruH+SKKInwxzsWLw8=; b=hGJehrO1MxQrt+j6lo15KbYGpH nPl2CpKA8oECYq2B4XX2tuC+q99PuWLxyAigm6TQNlQMIzS3ljdZOf2aTE5RpbPJy8yV6V/9yYYrX 79V2U6nE2uyREYu5pG0Dx6Zv24UwvTtZcnBOeCuah5UoDD/B8NMeDIwVdnnHVyDJe8OKNWqQMwJjc rK0geCXl1CD74kkkkDDRUHekQSMp0vhLGALOil052sQKKFgLPfmhd5IZDPp2kk0//REENLWjhZQzn q6WCchuAahbOujtPMnR7Gt335oo0Cgr0gemAcyT4/MUXIpbKqV87rJUjeIhJMuIAOhb5ksfRFsd8K A4UBqZZw==; Received: from host86-137-89-123.range86-137.btcentralplus.com ([86.137.89.123]:56466 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1jhZA4-0005Ws-S0 for gcc-patches@gcc.gnu.org; Sat, 06 Jun 2020 09:47:01 -0400 From: "Roger Sayle" To: Subject: [PATCH] x86: Improve expansion of __builtin_parity Date: Sat, 6 Jun 2020 14:46:58 +0100 Message-ID: <004b01d63c08$f3c70640$db5512c0$@nextmovesoftware.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: AdY8B6mcl+PXoeHHSn2RoFcCVC56Fg== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, HTML_MESSAGE, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" This patch changes the way that __builtin_parity is expanded in i386.md. This changes the code generated for the function: int test(unsigned char a) { unsigned char x = a + 1; return __builtin_parity(x); } from this before the patch: test: addl $1, %edi movzbl %dil, %edi movl %edi, %eax shrl $16, %edi xorl %edi, %eax xorb %ah, %al setnp %al movzbl %al, %eax ret to this with the patch: test: leal 1(%rdi), %eax testb %al, %al setnp %al movzbl %al, %eax ret GCC currently hides the shift and xor reduction inside a backend specific UNSPEC PARITY, making it invisible to the RTL optimizers until very late during compilation. It is normally reasonable for the middle-end to maintain wider mode representations for as long as possible and split them later, but this only helps if the semantics are visible at the RTL-level (to combine and other passes), but UNSPECs are black boxes, so in this case splitting early (during RTL expansion) is a better strategy. I'd originally written this patch after investigating whether GCC could generate the x86's "jp" and "jnp" conditional branch on parity flag, and noticed the inefficient reduction. It was only when I was preparing this patch for submission that I discovered that this is actually a regression fix, and that Uros had already implemented/thought about the above optimization back in 2007. Indeed the example above is taken from https://gcc.gnu.org/legacy-ml/gcc-patches/2007-02/msg00972.html Alas there is a mismatch between RTL's definition of PARITY operation which has an integer mode result, and the x86's parity flag. So when Uros addressed PR target/44481 in 2010, by introducing UNSPEC PARITY, we lost some of the optimization opportunities of his original patch. https://gcc.gnu.org/legacy-ml/gcc-patches/2010-06/msg01259.html The early splitting approach in this patch safely restores the 2007-2010 optimization opportunities. It turns out that that popcount instruction on modern x86_64 processors has (almost) made the integer parity flag in the x86 ALU completely obsolete, especially as POPCOUNT's integer semantics are a much better fit to RTL. The one remaining case where these transistors are useful is where __builtin_parity is immediately tested by a conditional branch, and therefore the result is wanted in a flags register rather than as an integer. This case is captured by two peephole2 optimizations in the attached patch. Hence the code: void foo(unsigned char x) { if (__builtin_parity(x)) bar(); } which on modern hardware is currently generated as: foo: movzbl %dil, %edi popcntl %edi, %edi andl $1, %edi jne .L4 ret .L4: jmp bar with this patch will be generated as: foo: testb %dil, %dil jnp .L4 ret .L4: jmp bar [I believe that popcount has a 3-cycle latency, but a test followed by a conditional branch can dispatch in the same cycle, but I'm not an expert]. The new parityhi2 and parityqi2 expanders are infrastructure to support (possible) future middle-end changes. This patch has been tested with a "make bootstrap" and "make -k check" on x86_64-pc-linux-gnu with no regressions. I've also added 7 new test cases; 4 tests gcc.target/i386/parity-[3-6].c check the existing behaviour, and 3 tests gcc.target/i386/parity-[7-9].c check the changes from this patch. Alas I'm very out of practice contributing patches (but my paperwork should still be valid), so if a friendly i386/x86_64 backend maintainer could take care of things from here, that would be very much appreciated. Many thanks in advance, Roger --- 2020-06-05 Roger Sayle * config/i386/i386.md (paritydi2, paritysi2): Expand reduction via shift and xor to an USPEC PARITY matching a parityhi2_cmp. (paritydi2_cmp, paritysi2_cmp): Delete these define_insn_and_split. (parityhi2, parityqi2): New expanders. (parityhi2_cmp): Implement set parity flag with xorb insn. (parityqi2_cmp): Implement set parity flag with testb insn. New peephole2s to use these insns (UNSPEC PARITY) when appropriate. 2020-06-05 Roger Sayle * gcc.target/i386/parity-[3-9].c: New tests. /* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler "setp" } } */ /* { dg-final { scan-assembler "jnp" } } */ /* { dg-final { scan-assembler "jp" } } */ void dummy(void); int foo(unsigned int x) { return !__builtin_parity(x); } void bar(unsigned int x) { if (__builtin_parity(x)) dummy(); } void baz(unsigned int x) { if (!__builtin_parity(x)) dummy(); } /* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler "setp" } } */ /* { dg-final { scan-assembler "jnp" } } */ /* { dg-final { scan-assembler "jp" } } */ void dummy(void); int foo(unsigned long long x) { return !__builtin_parityll(x); } void bar(unsigned long long x) { if (__builtin_parityll(x)) dummy(); } void baz(unsigned long long x) { if (!__builtin_parityll(x)) dummy(); } /* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2" } */ /* { dg-final { scan-assembler "popcnt" } } */ /* { dg-final { scan-assembler "and" } } */ int foo(unsigned int x) { return __builtin_parity(x); } /* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2" } */ /* { dg-final { scan-assembler "popcnt" } } */ /* { dg-final { scan-assembler "and" } } */ int foo(unsigned long long x) { return __builtin_parityll(x); } /* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler-times "test" 2 } } */ /* { dg-final { scan-assembler-not "shr" } } */ int foo(unsigned char x) { return __builtin_parity(x); } int bar(unsigned char x) { return __builtin_parityll(x); } /* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler-not "shr" } } */ int foo(unsigned short x) { return __builtin_parity(x); } int bar(unsigned short x) { return __builtin_parityll(x); } /* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2" } */ /* { dg-final { scan-assembler-not "popcnt" } } */ /* { dg-final { scan-assembler-not "shr" } } */ /* { dg-final { scan-assembler-times "jp" 2 } } */ /* { dg-final { scan-assembler-times "jnp" 2 } } */ void dummy(void); void pos8(unsigned char x) { if (__builtin_parity(x)) dummy(); } void neg8(unsigned char x) { if (!__builtin_parity(x)) dummy(); } void pos16(unsigned short x) { if (__builtin_parity(x)) dummy(); } void neg16(unsigned short x) { if (!__builtin_parity(x)) dummy(); } diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 459cf62..6659657 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -14780,9 +14780,32 @@ "! TARGET_POPCNT" { rtx scratch = gen_reg_rtx (QImode); + rtx hipart1 = gen_reg_rtx (SImode); + rtx lopart1 = gen_reg_rtx (SImode); + rtx xor1 = gen_reg_rtx (SImode); + rtx shift2 = gen_reg_rtx (SImode); + rtx hipart2 = gen_reg_rtx (HImode); + rtx lopart2 = gen_reg_rtx (HImode); + rtx xor2 = gen_reg_rtx (HImode); - emit_insn (gen_paritydi2_cmp (NULL_RTX, NULL_RTX, - NULL_RTX, operands[1])); + if (TARGET_64BIT) + { + rtx shift1 = gen_reg_rtx (DImode); + emit_insn (gen_lshrdi3 (shift1, operands[1], GEN_INT (32))); + emit_move_insn (hipart1, gen_lowpart (SImode, shift1)); + } + else + emit_move_insn (hipart1, gen_highpart (SImode, operands[1])); + + emit_move_insn (lopart1, gen_lowpart (SImode, operands[1])); + emit_insn (gen_xorsi3 (xor1, hipart1, lopart1)); + + emit_insn (gen_lshrsi3 (shift2, xor1, GEN_INT (16))); + emit_move_insn (hipart2, gen_lowpart (HImode, shift2)); + emit_move_insn (lopart2, gen_lowpart (HImode, xor1)); + emit_insn (gen_xorhi3 (xor2, hipart2, lopart2)); + + emit_insn (gen_parityhi2_cmp (xor2)); ix86_expand_setcc (scratch, ORDERED, gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); @@ -14805,8 +14828,17 @@ "! TARGET_POPCNT" { rtx scratch = gen_reg_rtx (QImode); + rtx shift = gen_reg_rtx (SImode); + rtx hipart = gen_reg_rtx (HImode); + rtx lopart = gen_reg_rtx (HImode); + rtx tmp = gen_reg_rtx (HImode); + + emit_insn (gen_lshrsi3 (shift, operands[1], GEN_INT (16))); + emit_move_insn (hipart, gen_lowpart (HImode, shift)); + emit_move_insn (lopart, gen_lowpart (HImode, operands[1])); + emit_insn (gen_xorhi3 (tmp, hipart, lopart)); - emit_insn (gen_paritysi2_cmp (NULL_RTX, NULL_RTX, operands[1])); + emit_insn (gen_parityhi2_cmp (tmp)); ix86_expand_setcc (scratch, ORDERED, gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); @@ -14815,70 +14847,128 @@ DONE; }) -(define_insn_and_split "paritydi2_cmp" - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_operand:DI 3 "register_operand" "0")] - UNSPEC_PARITY)) - (clobber (match_scratch:DI 0 "=r")) - (clobber (match_scratch:SI 1 "=&r")) - (clobber (match_scratch:HI 2 "=Q"))] +(define_expand "parityhi2" + [(set (match_operand:HI 0 "register_operand") + (parity:HI (match_operand:HI 1 "register_operand")))] "! TARGET_POPCNT" - "#" - "&& reload_completed" - [(parallel - [(set (match_dup 1) - (xor:SI (match_dup 1) (match_dup 4))) - (clobber (reg:CC FLAGS_REG))]) - (parallel - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_dup 1)] UNSPEC_PARITY)) - (clobber (match_dup 1)) - (clobber (match_dup 2))])] { - operands[4] = gen_lowpart (SImode, operands[3]); + rtx scratch = gen_reg_rtx (QImode); - if (TARGET_64BIT) - { - emit_move_insn (operands[1], gen_lowpart (SImode, operands[3])); - emit_insn (gen_lshrdi3 (operands[3], operands[3], GEN_INT (32))); - } - else - operands[1] = gen_highpart (SImode, operands[3]); + emit_insn (gen_parityhi2_cmp (operands[1])); + + ix86_expand_setcc (scratch, ORDERED, + gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); + + emit_insn (gen_zero_extendqihi2 (operands[0], scratch)); + DONE; }) -(define_insn_and_split "paritysi2_cmp" - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_operand:SI 2 "register_operand" "0")] - UNSPEC_PARITY)) - (clobber (match_scratch:SI 0 "=r")) - (clobber (match_scratch:HI 1 "=&Q"))] +(define_expand "parityqi2" + [(set (match_operand:QI 0 "register_operand") + (parity:QI (match_operand:QI 1 "register_operand")))] "! TARGET_POPCNT" - "#" - "&& reload_completed" - [(parallel - [(set (match_dup 1) - (xor:HI (match_dup 1) (match_dup 3))) - (clobber (reg:CC FLAGS_REG))]) - (parallel - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_dup 1)] UNSPEC_PARITY)) - (clobber (match_dup 1))])] { - operands[3] = gen_lowpart (HImode, operands[2]); + emit_insn (gen_parityqi2_cmp (operands[1])); - emit_move_insn (operands[1], gen_lowpart (HImode, operands[2])); - emit_insn (gen_lshrsi3 (operands[2], operands[2], GEN_INT (16))); + ix86_expand_setcc (operands[0], ORDERED, + gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); + DONE; }) -(define_insn "*parityhi2_cmp" +(define_insn "parityhi2_cmp" [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_operand:HI 1 "register_operand" "0")] + (unspec:CC [(match_operand:HI 0 "register_operand" "+Q")] UNSPEC_PARITY)) - (clobber (match_scratch:HI 0 "=Q"))] - "! TARGET_POPCNT" + (clobber (match_dup 0))] + "" "xor{b}\t{%h0, %b0|%b0, %h0}" [(set_attr "length" "2") - (set_attr "mode" "HI")]) + (set_attr "mode" "QI")]) + +(define_insn "parityqi2_cmp" + [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_operand:QI 0 "register_operand")] + UNSPEC_PARITY))] + "" + "test{b}\t%b0, %b0" + [(set_attr "mode" "QI")]) + +;; Replace zero_extend:HI followed by parityhi2_cmp with parityqi2_cmp +(define_peephole2 + [(set (match_operand:HI 0 "register_operand") + (zero_extend:HI (match_operand:QI 1 "register_operand"))) + (parallel [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 0)] UNSPEC_PARITY)) + (clobber (match_dup 0))])] + "" + [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 1)] UNSPEC_PARITY))]) + +;; Eliminate QImode popcount&1 using parity flag +(define_peephole2 + [(set (match_operand:SI 0 "register_operand") + (zero_extend:SI (match_operand:QI 1 "register_operand"))) + (parallel [(set (match_operand:SI 2 "register_operand") + (popcount:SI (match_dup 0))) + (clobber (reg:CC FLAGS_REG))]) + (set (reg:CCZ FLAGS_REG) + (compare:CCZ (and:QI (match_operand:QI 3 "register_operand") + (const_int 1)) + (const_int 0))) + (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator" + [(reg:CCZ FLAGS_REG) + (const_int 0)]) + (label_ref (match_operand 5)) + (pc)))] + "REGNO (operands[2]) == REGNO (operands[3]) + && peep2_reg_dead_p (3, operands[0]) + && peep2_reg_dead_p (3, operands[2]) + && peep2_regno_dead_p (4, FLAGS_REG)" + [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 1)] UNSPEC_PARITY)) + (set (pc) (if_then_else (match_op_dup 4 [(reg:CC FLAGS_REG) + (const_int 0)]) + (label_ref (match_dup 5)) + (pc)))] +{ + operands[4] = shallow_copy_rtx (operands[4]); + PUT_CODE (operands[4], GET_CODE (operands[4]) == EQ ? UNORDERED : ORDERED); +}) + +;; Eliminate HImode popcount&1 using parity flag +(define_peephole2 + [(match_scratch:HI 0 "Q") + (parallel [(set (match_operand:HI 1 "register_operand") + (popcount:HI + (match_operand:HI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))]) + (set (match_operand 3 "register_operand") + (zero_extend (match_dup 1))) + (set (reg:CCZ FLAGS_REG) + (compare:CCZ (and:QI (match_operand:QI 4 "register_operand") + (const_int 1)) + (const_int 0))) + (set (pc) (if_then_else (match_operator 5 "bt_comparison_operator" + [(reg:CCZ FLAGS_REG) + (const_int 0)]) + (label_ref (match_operand 6)) + (pc)))] + "REGNO (operands[3]) == REGNO (operands[4]) + && peep2_reg_dead_p (3, operands[1]) + && peep2_reg_dead_p (3, operands[3]) + && peep2_regno_dead_p (4, FLAGS_REG)" + [(set (match_dup 0) (match_dup 2)) + (parallel [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 0)] UNSPEC_PARITY)) + (clobber (match_dup 0))]) + (set (pc) (if_then_else (match_op_dup 5 [(reg:CC FLAGS_REG) + (const_int 0)]) + (label_ref (match_dup 6)) + (pc)))] +{ + operands[5] = shallow_copy_rtx (operands[5]); + PUT_CODE (operands[5], GET_CODE (operands[5]) == EQ ? UNORDERED : ORDERED); +}) ;; Thread-local storage patterns for ELF.