From patchwork Fri Apr 1 06:46:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: liuhongt X-Patchwork-Id: 1612064 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=JvJ2G0KD; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4KV9hC5Xybz9sV6 for ; Fri, 1 Apr 2022 17:47:06 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A477F389A102 for ; Fri, 1 Apr 2022 06:47:03 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A477F389A102 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1648795623; bh=Toh+xCfyrQ/R2UBifcha+w7gnbQtdF3p2Lp3xNTiUpU=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=JvJ2G0KDWV9NqwqdeEj0pOr+xjBxt+Wt4/3mGHe74X7XnxtwkyE/blEh0FRMNW2T/ MA5pfQoFpRogCi+wQdFgwEPl0rDeenq3wHyF5HDaPkGCxzAlFA1jBc6NYT6vuEH1V4 JSJ1GgXkFgzTr6kugK0gwjjGcQASlvpbl9+hVAHQ= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by sourceware.org (Postfix) with ESMTPS id 0CC5B3858C53 for ; Fri, 1 Apr 2022 06:46:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0CC5B3858C53 X-IronPort-AV: E=McAfee;i="6200,9189,10303"; a="242203336" X-IronPort-AV: E=Sophos;i="5.90,226,1643702400"; d="scan'208";a="242203336" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Mar 2022 23:46:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,226,1643702400"; d="scan'208";a="504057307" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga003.jf.intel.com with ESMTP; 31 Mar 2022 23:46:37 -0700 Received: from shliclel051.sh.intel.com (shliclel051.sh.intel.com [10.239.236.51]) by scymds01.sc.intel.com with ESMTP id 2316kZN7032044; Thu, 31 Mar 2022 23:46:35 -0700 To: gcc-patches@gcc.gnu.org Subject: [PATCH] Split vector load from parm_del to elemental loads to avoid STLF stalls. Date: Fri, 1 Apr 2022 14:46:34 +0800 Message-Id: <20220401064634.16091-1-hongtao.liu@intel.com> X-Mailer: git-send-email 2.18.1 In-Reply-To: References: X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: liuhongt via Gcc-patches From: liuhongt Reply-To: liuhongt Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Update in V2: 1. Use get_insns instead of FOR_EACH_BB_CFUN and FOR_BB_INSNS. 2. Return for any_uncondjump_p and ANY_RETURN_P. 3. Add dump info for spliting instruction. 4. Restrict ix86_split_stlf_stall_load under TARGET_SSE2. Since cfg is freed before machine_reorg, just do a rough calculation of the window according to the layout. Also according to an experiment on CLX, set window size to 64. Currently only handle V2DFmode load since it doesn't need any scratch registers, and it's sufficient to recover cray performance for -O2 compared to GCC11. gcc/ChangeLog: PR target/101908 * config/i386/i386.cc (ix86_split_stlf_stall_load): New function (ix86_reorg): Call ix86_split_stlf_stall_load. gcc/testsuite/ChangeLog: * gcc.target/i386/pr101908-1.c: New test. * gcc.target/i386/pr101908-2.c: New test. --- gcc/config/i386/i386.cc | 60 ++++++++++++++++++++++ gcc/testsuite/gcc.target/i386/pr101908-1.c | 12 +++++ gcc/testsuite/gcc.target/i386/pr101908-2.c | 12 +++++ 3 files changed, 84 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr101908-2.c diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 5a561966eb4..c88a689f32b 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -21933,6 +21933,64 @@ ix86_seh_fixup_eh_fallthru (void) emit_insn_after (gen_nops (const1_rtx), insn); } } +/* Split vector load from parm_decl to elemental loads to avoid STLF + stalls. */ +static void +ix86_split_stlf_stall_load () +{ + rtx_insn* insn, *start = get_insns (); + unsigned window = 0; + + for (insn = start; insn; insn = NEXT_INSN (insn)) + { + if (!NONDEBUG_INSN_P (insn)) + continue; + window++; + /* Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each + other, just emulate for pipeline) before stalled load, stlf stall + case is as fast as no stall cases on CLX. + Since CFG is freed before machine_reorg, just do a rough + calculation of the window according to the layout. */ + if (window > 64) + return; + + if (any_uncondjump_p (insn) + || ANY_RETURN_P (PATTERN (insn))) + return; + + rtx set = single_set (insn); + if (!set) + continue; + rtx src = SET_SRC (set); + if (!MEM_P (src) + /* Only handle V2DFmode load since it doesn't need any scratch + register. */ + || GET_MODE (src) != E_V2DFmode + || !MEM_EXPR (src) + || TREE_CODE (get_base_address (MEM_EXPR (src))) != PARM_DECL) + continue; + + rtx zero = CONST0_RTX (V2DFmode); + rtx dest = SET_DEST (set); + rtx m = adjust_address (src, DFmode, 0); + rtx loadlpd = gen_sse2_loadlpd (dest, zero, m); + emit_insn_before (loadlpd, insn); + m = adjust_address (src, DFmode, 8); + rtx loadhpd = gen_sse2_loadhpd (dest, dest, m); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fputs ("Due to potential STLF stall, split instruction:\n", + dump_file); + print_rtl_single (dump_file, insn); + fputs ("To:\n", dump_file); + print_rtl_single (dump_file, loadlpd); + print_rtl_single (dump_file, loadhpd); + } + PATTERN (insn) = loadhpd; + INSN_CODE (insn) = -1; + gcc_assert (recog_memoized (insn) != -1); + } +} /* Implement machine specific optimizations. We implement padding of returns for K8 CPUs and pass to avoid 4 jumps in the single 16 byte window. */ @@ -21948,6 +22006,8 @@ ix86_reorg (void) if (optimize && optimize_function_for_speed_p (cfun)) { + if (TARGET_SSE2) + ix86_split_stlf_stall_load (); if (TARGET_PAD_SHORT_FUNCTION) ix86_pad_short_function (); else if (TARGET_PAD_RETURNS) diff --git a/gcc/testsuite/gcc.target/i386/pr101908-1.c b/gcc/testsuite/gcc.target/i386/pr101908-1.c new file mode 100644 index 00000000000..33d9684f0ad --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101908-1.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx" } */ +/* { dg-final { scan-assembler-not {(?n)movhpd[ \t]} } } */ + +struct X { double x[2]; }; +typedef double v2df __attribute__((vector_size(16))); + +v2df __attribute__((noipa)) +foo (struct X* x, struct X* y) +{ + return (v2df) {x->x[1], x->x[0] } + (v2df) { y->x[1], y->x[0] }; +} diff --git a/gcc/testsuite/gcc.target/i386/pr101908-2.c b/gcc/testsuite/gcc.target/i386/pr101908-2.c new file mode 100644 index 00000000000..45060b73c06 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101908-2.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -mno-avx" } */ +/* { dg-final { scan-assembler-times {(?n)movhpd[ \t]+} "2" } } */ + +struct X { double x[4]; }; +typedef double v2df __attribute__((vector_size(16))); + +v2df __attribute__((noipa)) +foo (struct X x, struct X y) +{ + return (v2df) {x.x[1], x.x[0] } + (v2df) { y.x[1], y.x[0] }; +}