From patchwork Tue Jan 6 14:29:39 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?T25kxZllaiBCw61sa2E=?= X-Patchwork-Id: 425694 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 81E2614009B for ; Wed, 7 Jan 2015 01:29:57 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id :mime-version:content-type; q=dns; s=default; b=v+vwNq7itpBoWvV8 nkk7piq0qTr1/z2615p2/luHT0NjB4thuJNFEGn9xy4b3Lkwx8L3MdFxg97l5u4E DF/CDB/DexVz3H+Y0NUAU18SEhwvEs9HmG6Lvao7hRm960VZAIA39JP14UhkAqXy q6x3IWDV0lsJvVJjc3rTZrjt81c= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id :mime-version:content-type; s=default; bh=VDC1W7MgMzjmqTz8W9phGo HXwbM=; b=O6yGF2pBVvTE3gAv7ItxFMzNvDL6j8RJxo7GkWvxWEVoDkI8TVv0Qf FJ3sOYeptTtZXf0HqIe7mv8PSqUodYgyg0iEJiaXzYSDWfp7hauW3QZkd7fveILe XZ8oAuFQ7zmTOG+K2u2xOY7JJhOX741xt6tBXcCmenvxIWnjJ8nIQ= Received: (qmail 8738 invoked by alias); 6 Jan 2015 14:29:51 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 8725 invoked by uid 89); 6 Jan 2015 14:29:51 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=4.3 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, SPAM_URI1, SPF_NEUTRAL autolearn=no version=3.3.2 X-HELO: popelka.ms.mff.cuni.cz Date: Tue, 6 Jan 2015 15:29:39 +0100 From: =?utf-8?B?T25kxZllaiBCw61sa2E=?= To: hjl.tools@gmail.com Cc: libc-alpha@sourceware.org Subject: [PATCH][BZ #17801] Fix memcpy regression (five times slower on bulldozer.) Message-ID: <20150106142939.GB5835@domone> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) H. J, in this commit there slipped performance regression by review. commit 05f3633da4f9df870d04dd77336e793746e57ed4 Author: Ling Ma Date: Mon Jul 14 00:02:52 2014 -0400 Improve 64bit memcpy performance for Haswell CPU with AVX instruction I seem to recall that I mentioned something about avx being typo and should be avx2 but did not look it further. As I assumed its avx2 only I was ok with that nad haswell specific optimizations like using rep movsq. However ifunc checks for avx which is bad as we already know that avx loads/stores are slow on sandy bridge. Also testing on affected architectures would reveal that. Especially amd bulldozer where its five times slower on 2kb-16kb range, see http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/memcpy_profile_avx/results_rand/result.html because movsb is slow. On sandy bridge its only 20% regression on same range. http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_avx/results_rand/result.html Also avx loop for 128-2024 bytes is slower there so there is no point using it. What about following change? * sysdeps/x86_64/multiarch/memcpy.S: Fix performance regression. diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S index 992e40d..27f89e4 100644 --- a/sysdeps/x86_64/multiarch/memcpy.S +++ b/sysdeps/x86_64/multiarch/memcpy.S @@ -32,10 +32,13 @@ ENTRY(__new_memcpy) cmpl $0, KIND_OFFSET+__cpu_features(%rip) jne 1f call __init_cpu_features +#ifdef HAVE_AVX2_SUPPORT 1: leaq __memcpy_avx_unaligned(%rip), %rax - testl $bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip) + testl $bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip) + jz 1f ret +#endif 1: leaq __memcpy_sse2(%rip), %rax testl $bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip) jnz 2f