From patchwork Tue Sep 10 23:11:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexandre Oliva X-Patchwork-Id: 1160574 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-508810-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=adacore.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="yvCquhWX"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 46SgmQ6C2tz9sNT for ; Wed, 11 Sep 2019 09:12:36 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:mime-version:content-type; q=dns; s=default; b=v6m4UDmtarseW3uzVvxuq1te0oSvrmyL7zhY3wsU6VcUcJO0H4 oWPqUkS7vU/N1PnOcVKwUxNVTuXbQzsCZnP5qbpOSt2PRG3iNgKUlvjHjhaIj5sb prSeTfdmdlyUJf3EtQHaU8Ur0Ssh2q4mGNLcNzZ/c7vFlY1McL6F/V+X4= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:mime-version:content-type; s= default; bh=Zm48AomFq0BWB5MY065NR5kIxf4=; b=yvCquhWXcvqUr07r0dbC bKWKSkjt73COHu6qMsMUeqWrmnJiI3jSBeVoV84FfAmbXSmuSqIPFimrxZwWKh5Z kraImPSVgNz2M1kAsqvi5lrVJmLYFKEhW9+DW85BXSt9llNFMBKD8yyWc13N/dmp XdVIjvYNRhklBeQCRtkh/XE= Received: (qmail 59987 invoked by alias); 10 Sep 2019 23:12:27 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 59963 invoked by uid 89); 10 Sep 2019 23:12:27 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.1 spammy=pero, free!, hay, Balance X-HELO: rock.gnat.com Received: from rock.gnat.com (HELO rock.gnat.com) (205.232.38.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 10 Sep 2019 23:12:16 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by filtered-rock.gnat.com (Postfix) with ESMTP id 50327117752; Tue, 10 Sep 2019 19:12:14 -0400 (EDT) Received: from rock.gnat.com ([127.0.0.1]) by localhost (rock.gnat.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 1HDzfYsp6omG; Tue, 10 Sep 2019 19:12:14 -0400 (EDT) Received: from free.home (tron.gnat.com [IPv6:2620:20:4000:0:46a8:42ff:fe0e:e294]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by rock.gnat.com (Postfix) with ESMTPS id 75ADF116221; Tue, 10 Sep 2019 19:12:13 -0400 (EDT) Received: from livre (livre.home [172.31.160.2]) by free.home (8.15.2/8.15.2) with ESMTPS id x8ANBrL3523710 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Tue, 10 Sep 2019 20:11:58 -0300 From: Alexandre Oliva To: gcc-patches@gcc.gnu.org Cc: Jan Hubicka , Uros Bizjak Subject: new x86 cmpmemsi expander, and adjustments for cmpstrn* Date: Tue, 10 Sep 2019 20:11:53 -0300 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 This patchset fixes some latent problems in cmpstrn* patterns for x86, and introduces cmpmemsi for short fixed-size memcmp. I've verified that repz cmpsb is not a good choice for memcmp, performance-wise, so I turned to movbe/bswapped loads and unsigned compares. Those have performed better than glibc's CPU-optimized memcmp for all short counts I threw at them, for aligned and misaligned inputs, on very old and quite new CPUs, Intel and AMD. I suppose this result won't carry over to other arches, so I kept it in x86-specific code. The fixed-size inlined memcmps come at a significant code expansion growth per compare sequence, so I ended up not even trying to find a per-CPU break-even point, and instead put in a small cut-off limit that varies depending on the optimization level. I even considered using a param for tuning, but params wouldn't work in the md file, and I ended up deciding that was probably overkill, so I'd ask for feedback before proceeding any further down that path. In order to compare the performance of memcmp, I used the attached test program, that uses x86 CPU cycle counters, with repetition to seek the minimum cycle count for an operation, rather than an average that is not so reliable given all the self-tuning and caching and whatnot a CPU is capable of. There are 3 kinds of tests in the test program, to test comparing identical buffers with same alignment, misaligned uncached identical buffers, and averages over misaligned buffers that are about twice more likely to first differ at byte N than at byte N+1. The patchset has 3 patches. In the first one, I fix latent issues in existing cmpstr patterns that became apparent with the introduction of cmpmem. In the second one, I introduce a very simple cmpmemsi pattern that will only expand memcmp if length is a power of two, when that many bytes can be loaded from memory to a non-vector register in a single instruction, regardless of alignment. In the third one, I extend it so as to expand up to a certain number of compare sequences, depending on the optimization level, so that memcmp will be inlined for several common small block sizes, but without growing the code too much at low optimization levels. The set was regstrapped on x86_64-linux-gnu. The test program exhibited better performance in all tests at -m64 and -m32, on various old and new CPUs, when memcmp was inlined than when it wasn't. Ok to install? Would it make sense to install the test program somewhere? make cmpstrnsi patterns safer If cmpstrnqi_1 is given a length operand in a non-immediate operand, which cmpstrnsi accepts as an input, the operand will be modified instead of preserved. Turn the clobbered match_dup into a separate output operand to avoid this undesirable effect. This problem never came up in existing uses of cmpstrnsi, but upcoming uses of cmpmemsi run into this. While at that, adjust cmpstrnqi_1 patterns so that FLAGS_REG is visibly propagated when length is zero, rather than being merely used. for gcc/ChangeLog * config/i386/i386.md (cmpstrnsi): Create separate output count to pass to cmpstrnqi_nz_1 and ... (cmpstrnqi_1): ... this. Do not use a match_dup of the count input as an output. Preserve FLAGS_REG when length is zero. (*cmpstrnqi_1): Preserve FLAGS_REG when length is zero. --- gcc/config/i386/i386.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 7ad9788241988..2b7469991d837 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -16974,7 +16974,7 @@ (use (match_operand 4 "immediate_operand"))] "" { - rtx addr1, addr2, countreg, align, out; + rtx addr1, addr2, countreg, countout, align, out; if (optimize_insn_for_size_p () && !TARGET_INLINE_ALL_STRINGOPS) FAIL; @@ -17006,6 +17006,7 @@ operands[2] = replace_equiv_address_nv (operands[2], addr2); countreg = ix86_zero_extend_to_Pmode (operands[3]); + countout = gen_reg_rtx (Pmode); /* %%% Iff we are testing strict equality, we can use known alignment to good advantage. This may be possible with combine, particularly @@ -17019,14 +17020,14 @@ emit_move_insn (operands[0], const0_rtx); DONE; } - emit_insn (gen_cmpstrnqi_nz_1 (addr1, addr2, countreg, align, - operands[1], operands[2])); + emit_insn (gen_cmpstrnqi_nz_1 (addr1, addr2, countout, align, + operands[1], operands[2], countreg)); } else { emit_insn (gen_cmp_1 (Pmode, countreg, countreg)); - emit_insn (gen_cmpstrnqi_1 (addr1, addr2, countreg, align, - operands[1], operands[2])); + emit_insn (gen_cmpstrnqi_1 (addr1, addr2, countout, align, + operands[1], operands[2], countreg)); } out = gen_lowpart (QImode, operands[0]); @@ -17060,11 +17061,11 @@ [(parallel [(set (reg:CC FLAGS_REG) (compare:CC (match_operand 4 "memory_operand") (match_operand 5 "memory_operand"))) - (use (match_operand 2 "register_operand")) + (use (match_operand 6 "register_operand")) (use (match_operand:SI 3 "immediate_operand")) (clobber (match_operand 0 "register_operand")) (clobber (match_operand 1 "register_operand")) - (clobber (match_dup 2))])] + (clobber (match_operand 2 "register_operand"))])] "" { if (TARGET_CLD) @@ -17096,16 +17097,15 @@ (define_expand "cmpstrnqi_1" [(parallel [(set (reg:CC FLAGS_REG) - (if_then_else:CC (ne (match_operand 2 "register_operand") + (if_then_else:CC (ne (match_operand 6 "register_operand") (const_int 0)) (compare:CC (match_operand 4 "memory_operand") (match_operand 5 "memory_operand")) - (const_int 0))) + (reg:CC FLAGS_REG))) (use (match_operand:SI 3 "immediate_operand")) - (use (reg:CC FLAGS_REG)) (clobber (match_operand 0 "register_operand")) (clobber (match_operand 1 "register_operand")) - (clobber (match_dup 2))])] + (clobber (match_operand 2 "register_operand"))])] "" { if (TARGET_CLD) @@ -17118,9 +17118,8 @@ (const_int 0)) (compare:CC (mem:BLK (match_operand:P 4 "register_operand" "0")) (mem:BLK (match_operand:P 5 "register_operand" "1"))) - (const_int 0))) + (reg:CC FLAGS_REG))) (use (match_operand:SI 3 "immediate_operand" "i")) - (use (reg:CC FLAGS_REG)) (clobber (match_operand:P 0 "register_operand" "=S")) (clobber (match_operand:P 1 "register_operand" "=D")) (clobber (match_operand:P 2 "register_operand" "=c"))] x86 cmpmemsi pattern - single compare This patch introduces a cmpmemsi pattern to expand to a single compare insn sequence, involving one bswapped load from each input mem block. It disregards alignment entirely, leaving it up for the CPU to deal with it. for gcc/ChangeLog * config/i386/i386.md (cmpmemsi): New pattern. --- gcc/config/i386/i386.md | 114 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2b7469991d837..b72b94a1fd51d 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -16966,6 +16966,120 @@ (const_string "*"))) (set_attr "mode" "QI")]) +(define_expand "cmpmemsi" + [(set (match_operand:SI 0 "register_operand") + (compare:SI (match_operand:BLK 1 "general_operand") + (match_operand:BLK 2 "general_operand"))) + (use (match_operand 3 "immediate_operand")) + (use (match_operand 4 "immediate_operand"))] + "" +{ + rtx op1, op2, tmp; + + if (!CONST_INT_P (operands[3])) + FAIL; + + if (optimize_insn_for_size_p () && !TARGET_INLINE_ALL_STRINGOPS) + FAIL; + + switch (INTVAL (operands[3])) + { + case 0: + emit_move_insn (operands[0], const0_rtx); + DONE; + + default: + FAIL; + + case 8: + if (!TARGET_64BIT) + FAIL; + + op1 = gen_rtx_MEM (DImode, XEXP (operands[1], 0)); + MEM_COPY_ATTRIBUTES (op1, operands[1]); + + tmp = gen_reg_rtx (DImode); + emit_insn (gen_bswapdi2 (tmp, op1)); + op1 = tmp; + + op2 = gen_rtx_MEM (DImode, XEXP (operands[2], 0)); + MEM_COPY_ATTRIBUTES (op2, operands[2]); + + tmp = gen_reg_rtx (DImode); + emit_insn (gen_bswapdi2 (tmp, op2)); + op2 = tmp; + + emit_insn (gen_cmp_1 (DImode, op1, op2)); + + tmp = gen_lowpart (QImode, operands[0]); + emit_insn (gen_cmpintqi (tmp)); + emit_move_insn (operands[0], gen_rtx_SIGN_EXTEND (SImode, tmp)); + DONE; + + case 4: + op1 = gen_rtx_MEM (SImode, XEXP (operands[1], 0)); + MEM_COPY_ATTRIBUTES (op1, operands[1]); + + tmp = gen_reg_rtx (SImode); + emit_insn (gen_bswapsi2 (tmp, op1)); + op1 = tmp; + + op2 = gen_rtx_MEM (SImode, XEXP (operands[2], 0)); + MEM_COPY_ATTRIBUTES (op2, operands[2]); + + tmp = gen_reg_rtx (SImode); + emit_insn (gen_bswapsi2 (tmp, op2)); + op2 = tmp; + + emit_insn (gen_cmp_1 (SImode, op1, op2)); + + tmp = gen_lowpart (QImode, operands[0]); + emit_insn (gen_cmpintqi (tmp)); + emit_move_insn (operands[0], gen_rtx_SIGN_EXTEND (SImode, tmp)); + DONE; + + case 2: + op1 = gen_rtx_MEM (HImode, XEXP (operands[1], 0)); + MEM_COPY_ATTRIBUTES (op1, operands[1]); + + tmp = gen_reg_rtx (SImode); + emit_insn (gen_zero_extendhisi2 (tmp, op1)); + emit_insn (gen_bswaphi_lowpart (gen_lowpart (HImode, tmp))); + op1 = tmp; + + op2 = gen_rtx_MEM (HImode, XEXP (operands[2], 0)); + MEM_COPY_ATTRIBUTES (op2, operands[2]); + + tmp = gen_reg_rtx (SImode); + emit_insn (gen_zero_extendhisi2 (tmp, op2)); + emit_insn (gen_bswaphi_lowpart (gen_lowpart (HImode, tmp))); + op2 = tmp; + + emit_insn (gen_sub3_insn (operands[0], op1, op2)); + DONE; + + case 1: + op1 = gen_rtx_MEM (QImode, XEXP (operands[1], 0)); + MEM_COPY_ATTRIBUTES (op1, operands[1]); + + tmp = gen_reg_rtx (SImode); + emit_insn (gen_zero_extendqisi2 (tmp, op1)); + op1 = tmp; + + op2 = gen_rtx_MEM (QImode, XEXP (operands[2], 0)); + MEM_COPY_ATTRIBUTES (op2, operands[2]); + + tmp = gen_reg_rtx (SImode); + emit_insn (gen_zero_extendqisi2 (tmp, op2)); + op2 = tmp; + + emit_insn (gen_sub3_insn (operands[0], op1, op2)); + DONE; + } + + FAIL; +}) + (define_expand "cmpstrnsi" [(set (match_operand:SI 0 "register_operand") (compare:SI (match_operand:BLK 1 "general_operand") extend x86 cmpmemsi to use loops This patch extends the cmpmemsi expander introduced in the previous patch to use loops for lengths that extend over multiple words. for gcc/ChangeLog * config/i386/i386.md (cmpmemsi): Expand more than one fragment compare sequence depending on optimization level. (subcmpsi3): New expand pattern. --- gcc/config/i386/i386.md | 204 +++++++++++++++++++++++++++++++++++------------ 1 file changed, 154 insertions(+), 50 deletions(-) diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index b72b94a1fd51d..088a591dd5c17 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -16974,112 +16974,216 @@ (use (match_operand 4 "immediate_operand"))] "" { - rtx op1, op2, tmp; - if (!CONST_INT_P (operands[3])) FAIL; + unsigned HOST_WIDE_INT todo = UINTVAL (operands[3]); + + /* Balance size expansion with optimization level. This will inline + memcmp of up to 4 bytes or 1 word at -O1, 4 words or 1 word plus + 3 compares at -O2, and 7 words or 4 words plus 3 compares at -O3. + These are not so much related with the combinations that make + individual memcmp calls faster, but with the significant extra + code cache use that each additional sequence of loads, byte + swapping and compare incurs. */ + + HOST_WIDE_INT size = (TARGET_64BIT ? todo / 8 + !!(todo & 4) : todo / 4); + if (size) + size++; + size += !!(todo & 1) + !!(todo & 2); + if (size > 1) + size++; + if (size > optimize * 3) + FAIL; + if (optimize_insn_for_size_p () && !TARGET_INLINE_ALL_STRINGOPS) FAIL; - switch (INTVAL (operands[3])) + if (!todo) { - case 0: emit_move_insn (operands[0], const0_rtx); DONE; + } - default: - FAIL; - - case 8: - if (!TARGET_64BIT) - FAIL; - - op1 = gen_rtx_MEM (DImode, XEXP (operands[1], 0)); - MEM_COPY_ATTRIBUTES (op1, operands[1]); - - tmp = gen_reg_rtx (DImode); - emit_insn (gen_bswapdi2 (tmp, op1)); - op1 = tmp; - - op2 = gen_rtx_MEM (DImode, XEXP (operands[2], 0)); - MEM_COPY_ATTRIBUTES (op2, operands[2]); - - tmp = gen_reg_rtx (DImode); - emit_insn (gen_bswapdi2 (tmp, op2)); - op2 = tmp; + rtx tmpout = operands[0]; + if (reg_overlap_mentioned_p (operands[0], XEXP (operands[1], 0)) + || reg_overlap_mentioned_p (operands[0], XEXP (operands[2], 0))) + tmpout = gen_reg_rtx (SImode); - emit_insn (gen_cmp_1 (DImode, op1, op2)); + rtx_code_label *labnz = 0, *labfv = 0; + unsigned HOST_WIDE_INT done = 0; + bool needcmpint = false; - tmp = gen_lowpart (QImode, operands[0]); - emit_insn (gen_cmpintqi (tmp)); - emit_move_insn (operands[0], gen_rtx_SIGN_EXTEND (SImode, tmp)); - DONE; + if (TARGET_64BIT) + while (todo >= 8) + { + rtx op1 = gen_rtx_MEM (DImode, XEXP (operands[1], 0)); + MEM_COPY_ATTRIBUTES (op1, operands[1]); + if (done) + op1 = offset_address (op1, GEN_INT (done), 8); + + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_bswapdi2 (tmp, op1)); + op1 = tmp; + + rtx op2 = gen_rtx_MEM (DImode, XEXP (operands[2], 0)); + MEM_COPY_ATTRIBUTES (op2, operands[2]); + if (done) + op2 = offset_address (op2, GEN_INT (done), 8); + + tmp = gen_reg_rtx (DImode); + emit_insn (gen_bswapdi2 (tmp, op2)); + op2 = tmp; + + emit_insn (gen_cmp_1 (DImode, op1, op2)); + needcmpint = true; + + done += 8; + todo -= 8; + if (todo) + { + if (!labnz) + labnz = gen_label_rtx (); + LABEL_NUSES (labnz)++; + ix86_expand_branch (NE, gen_rtx_REG (CCmode, FLAGS_REG), + const0_rtx, labnz); + } + } - case 4: - op1 = gen_rtx_MEM (SImode, XEXP (operands[1], 0)); + while (todo >= 4) + { + rtx op1 = gen_rtx_MEM (SImode, XEXP (operands[1], 0)); MEM_COPY_ATTRIBUTES (op1, operands[1]); + if (done) + op1 = offset_address (op1, GEN_INT (done), 4); - tmp = gen_reg_rtx (SImode); + rtx tmp = gen_reg_rtx (SImode); emit_insn (gen_bswapsi2 (tmp, op1)); op1 = tmp; - op2 = gen_rtx_MEM (SImode, XEXP (operands[2], 0)); + rtx op2 = gen_rtx_MEM (SImode, XEXP (operands[2], 0)); MEM_COPY_ATTRIBUTES (op2, operands[2]); + if (done) + op2 = offset_address (op2, GEN_INT (done), 4); tmp = gen_reg_rtx (SImode); emit_insn (gen_bswapsi2 (tmp, op2)); op2 = tmp; emit_insn (gen_cmp_1 (SImode, op1, op2)); + needcmpint = true; - tmp = gen_lowpart (QImode, operands[0]); - emit_insn (gen_cmpintqi (tmp)); - emit_move_insn (operands[0], gen_rtx_SIGN_EXTEND (SImode, tmp)); - DONE; + done += 4; + todo -= 4; + if (todo) + { + if (!labnz) + labnz = gen_label_rtx (); + LABEL_NUSES (labnz)++; + ix86_expand_branch (NE, gen_rtx_REG (CCmode, FLAGS_REG), + const0_rtx, labnz); + } + } - case 2: - op1 = gen_rtx_MEM (HImode, XEXP (operands[1], 0)); + if (todo >= 2) + { + rtx op1 = gen_rtx_MEM (HImode, XEXP (operands[1], 0)); MEM_COPY_ATTRIBUTES (op1, operands[1]); + if (done) + op1 = offset_address (op1, GEN_INT (done), 4); - tmp = gen_reg_rtx (SImode); + rtx tmp = gen_reg_rtx (SImode); emit_insn (gen_zero_extendhisi2 (tmp, op1)); emit_insn (gen_bswaphi_lowpart (gen_lowpart (HImode, tmp))); op1 = tmp; - op2 = gen_rtx_MEM (HImode, XEXP (operands[2], 0)); + rtx op2 = gen_rtx_MEM (HImode, XEXP (operands[2], 0)); MEM_COPY_ATTRIBUTES (op2, operands[2]); + if (done) + op2 = offset_address (op2, GEN_INT (done), 4); tmp = gen_reg_rtx (SImode); emit_insn (gen_zero_extendhisi2 (tmp, op2)); emit_insn (gen_bswaphi_lowpart (gen_lowpart (HImode, tmp))); op2 = tmp; - emit_insn (gen_sub3_insn (operands[0], op1, op2)); - DONE; + if (needcmpint) + emit_insn (gen_cmp_1 (SImode, op1, op2)); + else + emit_insn (gen_subcmpsi3 (tmpout, op1, op2)); - case 1: - op1 = gen_rtx_MEM (QImode, XEXP (operands[1], 0)); + done += 2; + todo -= 2; + if (todo) + { + rtx_code_label *lab = labnz; + if (!needcmpint) + lab = labfv = gen_label_rtx (); + LABEL_NUSES (lab)++; + ix86_expand_branch (NE, gen_rtx_REG (CCmode, FLAGS_REG), + const0_rtx, lab); + } + } + + if (todo >= 1) + { + rtx op1 = gen_rtx_MEM (QImode, XEXP (operands[1], 0)); MEM_COPY_ATTRIBUTES (op1, operands[1]); + if (done) + op1 = offset_address (op1, GEN_INT (done), 2); - tmp = gen_reg_rtx (SImode); + rtx tmp = gen_reg_rtx (SImode); emit_insn (gen_zero_extendqisi2 (tmp, op1)); op1 = tmp; - op2 = gen_rtx_MEM (QImode, XEXP (operands[2], 0)); + rtx op2 = gen_rtx_MEM (QImode, XEXP (operands[2], 0)); MEM_COPY_ATTRIBUTES (op2, operands[2]); + if (done) + op2 = offset_address (op2, GEN_INT (done), 2); tmp = gen_reg_rtx (SImode); emit_insn (gen_zero_extendqisi2 (tmp, op2)); op2 = tmp; - emit_insn (gen_sub3_insn (operands[0], op1, op2)); - DONE; + if (needcmpint) + emit_insn (gen_cmp_1 (SImode, op1, op2)); + else + emit_insn (gen_subcmpsi3 (tmpout, op1, op2)); + + done += 1; + todo -= 1; + } + gcc_assert (!todo); + + if (labnz) + emit_label (labnz); + + if (needcmpint) + { + rtx tmp = gen_lowpart (QImode, tmpout); + emit_insn (gen_cmpintqi (tmp)); + emit_move_insn (tmpout, gen_rtx_SIGN_EXTEND (SImode, tmp)); } - FAIL; + if (labfv) + emit_label (labfv); + + if (tmpout != operands[0]) + emit_move_insn (operands[0], tmpout); + + DONE; }) +;; Expand a "*sub_2" pattern with mode=SI. +(define_expand "subcmpsi3" + [(parallel [(set (reg:CC FLAGS_REG) + (compare:CC + (match_operand:SI 1 "register_operand") + (match_operand:SI 2 "register_operand"))) + (set (match_operand:SI 0 "register_operand") + (minus:SI (match_dup 1) (match_dup 2)))])] + "") + (define_expand "cmpstrnsi" [(set (match_operand:SI 0 "register_operand") (compare:SI (match_operand:BLK 1 "general_operand") DO NOT USE - FTR only - cmpsb-based cmpmemsi pattern for x86 I include this just for the record, in case someone wishes to compare memcmp performance when implemented as 'repz cmpsb', same as used for strncmp, with the implementations in glibc or in the proposed patchset above. --- gcc/config/i386/i386.md | 56 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2b7469991d837..cd7b974c7e33f 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -16966,6 +16966,62 @@ (const_string "*"))) (set_attr "mode" "QI")]) +(define_expand "cmpmemsi" + [(set (match_operand:SI 0 "register_operand") + (compare:SI (match_operand:BLK 1 "general_operand") + (match_operand:BLK 2 "general_operand"))) + (use (match_operand 3 "general_operand")) + (use (match_operand 4 "immediate_operand"))] + "" +{ + rtx addr1, addr2, countreg, countout, align, out; + + if (optimize_insn_for_size_p () && !TARGET_INLINE_ALL_STRINGOPS) + FAIL; + + /* Can't use this if the user has appropriated ecx, esi or edi. */ + if (fixed_regs[CX_REG] || fixed_regs[SI_REG] || fixed_regs[DI_REG]) + FAIL; + + addr1 = copy_addr_to_reg (XEXP (operands[1], 0)); + addr2 = copy_addr_to_reg (XEXP (operands[2], 0)); + if (addr1 != XEXP (operands[1], 0)) + operands[1] = replace_equiv_address_nv (operands[1], addr1); + if (addr2 != XEXP (operands[2], 0)) + operands[2] = replace_equiv_address_nv (operands[2], addr2); + + countreg = ix86_zero_extend_to_Pmode (operands[3]); + countout = gen_reg_rtx (Pmode); + + /* %%% Iff we are testing strict equality, we can use known alignment + to good advantage. This may be possible with combine, particularly + once cc0 is dead. */ + align = operands[4]; + + if (CONST_INT_P (operands[3])) + { + if (operands[3] == const0_rtx) + { + emit_move_insn (operands[0], const0_rtx); + DONE; + } + emit_insn (gen_cmpstrnqi_nz_1 (addr1, addr2, countout, align, + operands[1], operands[2], countreg)); + } + else + { + emit_insn (gen_cmp_1 (Pmode, countreg, countreg)); + emit_insn (gen_cmpstrnqi_1 (addr1, addr2, countout, align, + operands[1], operands[2], countreg)); + } + + out = gen_lowpart (QImode, operands[0]); + emit_insn (gen_cmpintqi (out)); + emit_move_insn (operands[0], gen_rtx_SIGN_EXTEND (SImode, out)); + + DONE; +}) + (define_expand "cmpstrnsi" [(set (match_operand:SI 0 "register_operand") (compare:SI (match_operand:BLK 1 "general_operand")