From patchwork Thu Mar 16 20:39:11 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 739986 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3vkgLq2YlYz9ryQ for ; Fri, 17 Mar 2017 07:39:31 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="mHt66nla"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:subject:message-id:reply-to :mime-version:content-type; q=dns; s=default; b=SUmQJXDsGdXhEOtx zzVWuKI2LlEY1b5pmfqqTonN8icjispscFZZ/ueuLAiKy0qlyW5ppT6Hh6p7iX6S vX/Tfa8qraPdmQv7QPTq7Cwrui0B9qWkg3otwszbNSwL6InS2DlBAci/Sl66nOEm oRZIQukYDwPDganRPzZghW//LtM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:subject:message-id:reply-to :mime-version:content-type; s=default; bh=hbc08O4GSDy2GXdL0VikI3 cOKF0=; b=mHt66nlaRNKiLfY1gOrNrN+ry3B30VaKxDwLqVi6yrAHznKJ9RDLmb sNl+GRq+EQu3LJdeciDsTn4DwuQzx+6IbeFUnm0ag94m8luWxeOSdYjMbFO7v4sM BPfQAjH7MKObo9q23W5M8QmEVeOAlJAtGWJM/rRnfBqrktt6yzVY0= Received: (qmail 67882 invoked by alias); 16 Mar 2017 20:39:14 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 67681 invoked by uid 89); 16 Mar 2017 20:39:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-23.5 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_LAZY_DOMAIN_SECURITY, NO_DNS_FOR_FROM, RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=1454, 879 X-HELO: mga05.intel.com X-ExtLoop1: 1 Date: Thu, 16 Mar 2017 13:39:11 -0700 From: "H.J. Lu" To: GNU C Library Subject: [PATCH] x86-64: Improve branch predication in _dl_runtime_resolve_avx512_opt [BZ #21258] Message-ID: <20170316203911.GA26261@intel.com> Reply-To: "H.J. Lu" MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.7.1 (2016-10-04) On Skylake server, _dl_runtime_resolve_avx512_opt is used to preserve the first 8 vector registers. The code layout is if only %xmm0 - %xmm7 registers are used preserve %xmm0 - %xmm7 registers if only %ymm0 - %ymm7 registers are used preserve %ymm0 - %ymm7 registers preserve %zmm0 - %zmm7 registers Branch predication always executes the fallthrough code path to preserve %zmm0 - %zmm7 registers speculatively, even though only %xmm0 - %xmm7 registers are used. This leads to lower CPU frequency on Skylake server. This patch changes the fallthrough code path to preserve %xmm0 - %xmm7 registers instead: if whole %zmm0 - %zmm7 registers are used preserve %zmm0 - %zmm7 registers if only %ymm0 - %ymm7 registers are used preserve %ymm0 - %ymm7 registers preserve %xmm0 - %xmm7 registers Tested on Skylake server. Any comments? H.J. --- [BZ #21258] * sysdeps/x86_64/dl-trampoline.S (_dl_runtime_resolve_opt): Define only if _dl_runtime_resolve is defined to _dl_runtime_resolve_sse_vex. * sysdeps/x86_64/dl-trampoline.h (_dl_runtime_resolve_opt): Fallthrough to _dl_runtime_resolve_sse_vex. --- sysdeps/x86_64/dl-trampoline.S | 3 +-- sysdeps/x86_64/dl-trampoline.h | 9 +++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/sysdeps/x86_64/dl-trampoline.S b/sysdeps/x86_64/dl-trampoline.S index 33d7fcf..c14c61a 100644 --- a/sysdeps/x86_64/dl-trampoline.S +++ b/sysdeps/x86_64/dl-trampoline.S @@ -87,11 +87,9 @@ #endif #define VEC(i) zmm##i #define _dl_runtime_resolve _dl_runtime_resolve_avx512 -#define _dl_runtime_resolve_opt _dl_runtime_resolve_avx512_opt #define _dl_runtime_profile _dl_runtime_profile_avx512 #include "dl-trampoline.h" #undef _dl_runtime_resolve -#undef _dl_runtime_resolve_opt #undef _dl_runtime_profile #undef VEC #undef VMOV @@ -145,4 +143,5 @@ # define VMOV vmovdqu #endif #define _dl_runtime_resolve _dl_runtime_resolve_sse_vex +#define _dl_runtime_resolve_opt _dl_runtime_resolve_avx512_opt #include "dl-trampoline.h" diff --git a/sysdeps/x86_64/dl-trampoline.h b/sysdeps/x86_64/dl-trampoline.h index b27fa06..8db24c1 100644 --- a/sysdeps/x86_64/dl-trampoline.h +++ b/sysdeps/x86_64/dl-trampoline.h @@ -129,19 +129,20 @@ _dl_runtime_resolve_opt: # YMM state isn't in use. PRESERVE_BND_REGS_PREFIX jz _dl_runtime_resolve_sse_vex -# elif VEC_SIZE == 64 +# elif VEC_SIZE == 16 # For ZMM registers, check if YMM state and ZMM state are in # use. andl $(bit_YMM_state | bit_ZMM0_15_state), %r11d cmpl $bit_YMM_state, %r11d - # Preserve %xmm0 - %xmm7 registers with the zero upper 384 bits if - # neither YMM state nor ZMM state are in use. + # Preserve %zmm0 - %zmm7 registers if ZMM state is in use. PRESERVE_BND_REGS_PREFIX - jl _dl_runtime_resolve_sse_vex + jg _dl_runtime_resolve_avx512 # Preserve %ymm0 - %ymm7 registers with the zero upper 256 bits if # ZMM state isn't in use. PRESERVE_BND_REGS_PREFIX je _dl_runtime_resolve_avx + # Preserve %xmm0 - %xmm7 registers with the zero upper 384 bits if + # neither YMM state nor ZMM state are in use. # else # error Unsupported VEC_SIZE! # endif