From patchwork Tue Nov 12 07:39:31 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hongtao Liu X-Patchwork-Id: 1193358 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-513027-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="xuvE6/z0"; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="Nx6tzszZ"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 47C01N5zVbz9s7T for ; Tue, 12 Nov 2019 18:36:38 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:from:date:message-id:subject:to:cc:content-type; q=dns; s=default; b=R5s9G59+qLLvhs49LDHGziE2Xe5MCeFIvJ7SqKh2uPm jboVXsrGA0VVvnmY++N7B2KvWSYkui9faRJoJ7oQvxFJDrIv932q6NQQuFLfDdhM T1ykAgnQqo+edefIwcTmq1ZBpU4S09Y0/a1HG/v2AqWXCmQErj6V/7+VCE2oV80w = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:from:date:message-id:subject:to:cc:content-type; s=default; bh=Xmxa+O84lxSgJtzPiHgqbZ5zHdo=; b=xuvE6/z004XRtF1V5 U3L4oNiM+w3HAqy+5hNgJ53123deD9wW3X5UWoVbb46bVIyUaRzSVvluHm05EcYq ujmfyBonxKd2L7taEkEUAqKRLcex6QDK9E8YPSdj6fM7Nxr/EO3tvyeExsrjWyB9 s1CZVygqhHGkoq4pTXs8rprUdE= Received: (qmail 4983 invoked by alias); 12 Nov 2019 07:36:29 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 4968 invoked by uid 89); 12 Nov 2019 07:36:29 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-18.3 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.1 spammy=bellow, 527cam4_r, 511.povray_r, UD:511.povray_r X-HELO: mail-oi1-f178.google.com Received: from mail-oi1-f178.google.com (HELO mail-oi1-f178.google.com) (209.85.167.178) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 12 Nov 2019 07:36:26 +0000 Received: by mail-oi1-f178.google.com with SMTP id y194so13963088oie.4 for ; Mon, 11 Nov 2019 23:36:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to:cc; bh=qxJZG777Y0268zvmnH21aPNnrXMAriu8EbIuSZsjr4A=; b=Nx6tzszZHTA8nYk/WMZ6uqqDZstVnshUGiJ4w1Z3QvgBmGXqT++yvWVsJo4Jj66w1S 2TGs87ZorLDbUcznWwYAl+YDGZBTeu0tJOCE1xsye9WLE3tBpqE/lOFrL2LHOgX9siUM U4xKBxHyeuUnlpisA+pFNGlbH9Z+ZiZshOyJ0L2RGAK0Q2I2xVStISSzfZHKztXO9x5x 5AGOZKyCKe4sNaUxcRJMbOMBi+nX6dM534HftNRajAn8IqQRvRufNJQdcTjdevzUmxZk 8DWJ23shwjvVNCJLJVkb8aqgpy2RrAsxtLWAp7FjPRTnN4TMyX7rWzVSJ4QXtGpX91Yo m0kA== MIME-Version: 1.0 From: Hongtao Liu Date: Tue, 12 Nov 2019 15:39:31 +0800 Message-ID: Subject: [PATCH] Set AVX128_OPTIMAL for all avx targets. To: Richard Biener Cc: "H. J. Lu" , GCC Patches , Uros Bizjak X-IsSubscribed: yes Hi: This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for all AVX target because we found there's still performance gap between 128-bit auto-vectorization and 256-bit auto-vectorization even with epilog vectorized. The performance influence of setting avx128_optimal as default on SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on CLX is as bellow: INT rate 500.perlbench_r -0.32% 502.gcc_r -1.32% 505.mcf_r -0.12% 520.omnetpp_r -0.34% 523.xalancbmk_r -0.65% 525.x264_r 2.23% 531.deepsjeng_r 0.81% 541.leela_r -0.02% 548.exchange2_r 10.89% ----------> big improvement 557.xz_r 0.38% geomean for intrate 1.10% FP rate 503.bwaves_r 1.41% 507.cactuBSSN_r -0.14% 508.namd_r 1.54% 510.parest_r -0.87% 511.povray_r 0.28% 519.lbm_r 0.32% 521.wrf_r -0.54% 526.blender_r 0.59% 527.cam4_r -2.70% 538.imagick_r 3.92% 544.nab_r 0.59% 549.fotonik3d_r -5.44% -------------> regression 554.roms_r -2.34% geomean for fprate -0.28% The 10% improvement of 548.exchange_r is because there is 9-layer nested loop, and the loop count for innermost layer is small(enough for 128-bit vectorization, but not for 256-bit vectorization). Since loop count is not statically analyzed out, vectorizer will choose 256-bit vectorization which would never never be triggered. The vectorization of epilog will introduced some extra instructions, normally it will bring back some performance, but since it's 9-layer nested loop, costs of extra instructions will cover the gain. The 5.44% regression of 549.fotonik3d_r is because 256-bit vectorization is better than 128-bit vectorization. Generally when enabling 256-bit or 512-bit vectorization, there will be instruction clocksticks reduction also with frequency reduction. when frequency reduction is less than instructions clocksticks reduction, long vector width vectorization would be better than shorter one, otherwise the opposite. The regression of 549.fotonik3d_r is due to this, similar for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit vectorization is best. Bootstrap and regression test on i386 is ok. Ok for trunk? Changelog gcc/ * config/i386/i386-option.c (m_CORE_AVX): New macro. * config/i386/x86-tune.def: Enable 128_optimal for avx and replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX. * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase. * testsuite/gcc.target/i386/pr84413-2.c: Ditto. * testsuite/gcc.target/i386/pr84413-3.c: Ditto. * testsuite/gcc.target/i386/pr70021.c: Ditto. * testsuite/gcc.target/i386/pr90579.c: New test. From a02d5c896600c4c80765f375d531c5412a778145 Mon Sep 17 00:00:00 2001 From: liuhongt Date: Wed, 6 Nov 2019 09:36:57 +0800 Subject: [PATCH] Enbale 128-bit auto-vectorization for avx Performance impact test on CLX8280 with best perf option -Ofast -march=native -funroll-loops -flto -mfpmath=sse. INT rate 500.perlbench_r -0.32% 502.gcc_r -1.32% 505.mcf_r -0.12% 520.omnetpp_r -0.34% 523.xalancbmk_r -0.65% 525.x264_r 2.23% 531.deepsjeng_r 0.81% 541.leela_r -0.02% 548.exchange2_r 10.89% 557.xz_r 0.38% geomean for intrate 1.10% FP rate 503.bwaves_r 1.41% 507.cactuBSSN_r -0.14% 508.namd_r 1.54% 510.parest_r -0.87% 511.povray_r 0.28% 519.lbm_r 0.32% 521.wrf_r -0.54% 526.blender_r 0.59% 527.cam4_r -2.70% 538.imagick_r 3.92% 544.nab_r 0.59% 549.fotonik3d_r -5.44% 554.roms_r -2.34% geomean for fprate -0.28% Changelog gcc/ * config/i386/i386-option.c (m_CORE_AVX): New macro. * config/i386/x86-tune.def: Enable 128_optimal for avx and replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX. * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase. * testsuite/gcc.target/i386/pr84413-2.c: Ditto. * testsuite/gcc.target/i386/pr84413-3.c: Ditto. * testsuite/gcc.target/i386/pr70021.c: Ditto. * testsuite/gcc.target/i386/pr90579.c: New test. --- gcc/config/i386/i386-options.c | 1 + gcc/config/i386/x86-tune.def | 24 +++++++++++------------ gcc/testsuite/gcc.target/i386/pr70021.c | 2 +- gcc/testsuite/gcc.target/i386/pr84413-1.c | 4 ++-- gcc/testsuite/gcc.target/i386/pr84413-2.c | 4 ++-- gcc/testsuite/gcc.target/i386/pr84413-3.c | 4 ++-- gcc/testsuite/gcc.target/i386/pr90579.c | 20 +++++++++++++++++++ 7 files changed, 40 insertions(+), 19 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr90579.c diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c index dfc8ae23ba0..7277f74e360 100644 --- a/gcc/config/i386/i386-options.c +++ b/gcc/config/i386/i386-options.c @@ -127,6 +127,7 @@ along with GCC; see the file COPYING3. If not see | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_CASCADELAKE \ | m_TIGERLAKE | m_COOPERLAKE) #define m_CORE_AVX2 (m_HASWELL | m_SKYLAKE | m_CORE_AVX512) +#define m_CORE_AVX (m_SANDYBRIDGE | m_CORE_AVX2) #define m_CORE_ALL (m_CORE2 | m_NEHALEM | m_SANDYBRIDGE | m_CORE_AVX2) #define m_GOLDMONT (HOST_WIDE_INT_1U<