From patchwork Sun Feb 5 00:42:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 1737587 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=AirbnC/W; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4P8Vwc6bh7z23fc for ; Sun, 5 Feb 2023 11:42:39 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 5DCFD3858004 for ; Sun, 5 Feb 2023 00:42:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5DCFD3858004 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1675557756; bh=T8TU1v9wJY93fPhJxBjh0H3xfIccouyj+1fHzRpi5oM=; h=Date:To:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=AirbnC/WQ7zBR4MDiQnXQ652MDZ3CwNc8hrwZ2LRF736+0mq4F4WK6u9NsO/KCzc1 Kpzx9ugvZaszxiGXhray3L0b32ZC0IvvBDnAAWgwey9jn1Hywqarn4h+n/IcVd7ej3 62njupLcZwMhcTbV9MRc3KmXEDgcGqA7QsjQNQRA= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id 3D4153858C2D for ; Sun, 5 Feb 2023 00:42:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3D4153858C2D Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 94E1C281D76; Sun, 5 Feb 2023 01:42:13 +0100 (CET) Date: Sun, 5 Feb 2023 01:42:13 +0100 To: gcc-patches@gcc.gnu.org, mjambor@suse.cz Subject: Enable AVX512 512bit vectors by default on Zen4 Message-ID: MIME-Version: 1.0 Content-Disposition: inline X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Jan Hubicka via Gcc-patches From: Jan Hubicka Reply-To: Jan Hubicka Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Hi, this patch enables AVX512 by default on Zen4. While internally 512 registers are splits into two 256 halves, 512 bit vectors reduces number of instructions to retire and has chance to improve paralelism. There are few tsvc benchmarks that improves significantly: runtime benchmark 256bit 512bit s2275 48.57 20.67 -58% s311 32.29 16.06 -50% s312 32.30 16.07 -50% vsumr 32.30 16.07 -50% s314 10.77 5.42 -50% s313 21.52 10.85 -50% vdotr 43.05 21.69 -50% s316 10.80 5.64 -48% s235 61.72 33.91 -45% s161 15.91 9.95 -38% s3251 32.13 20.31 -36% And there are no benchmarks with off-noise regression. The basic matrix multiplication loop improves by 32% (for 1000x1000 marices). It is also expected that 512 bit vectors are more power effecient (I can't masure that). The down side is that loops with low trip counts slower for an iteration ranges where the epilogue is hit more often. In SPECfp this problem happens with x264 (12% regression) and bwaves (6% regression) and this is tracked in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 and will need more work on vectorizer to support masked epilogues. After some additional testing it seems that using 512 bit vectors by default is now overall better choice. Bootstrapped/regtested x86_64-linux. Plan to commit it tomorrow. * config/i386/x86-tune.def (X86_TUNE_AVX256_OPTIMAL): Turn off for znver4. diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index c78dad07c88..3054656a12c 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -551,7 +551,7 @@ DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2 /* X86_TUNE_AVX256_OPTIMAL: Use 256-bit AVX instructions instead of 512-bit AVX instructions in the auto-vectorizer. */ -DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", m_CORE_AVX512 | m_ZNVER4) +DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", m_CORE_AVX512) /* X86_TUNE_AVX256_SPLIT_REGS: if true, AVX512 ops are split into two AVX256 ops. */ DEF_TUNE (X86_TUNE_AVX512_SPLIT_REGS, "avx512_split_regs", m_ZNVER4)