From patchwork Thu Jun 7 01:57:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Simon Guo X-Patchwork-Id: 926086 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 411TXT09W3z9s1B for ; Thu, 7 Jun 2018 12:10:33 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="KdnCEdBk"; dkim-atps=neutral Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 411TXS5XxXzF323 for ; Thu, 7 Jun 2018 12:10:32 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="KdnCEdBk"; dkim-atps=neutral X-Original-To: linuxppc-dev@lists.ozlabs.org Delivered-To: linuxppc-dev@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gmail.com (client-ip=2607:f8b0:400e:c05::244; helo=mail-pg0-x244.google.com; envelope-from=wei.guo.simon@gmail.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="KdnCEdBk"; dkim-atps=neutral Received: from mail-pg0-x244.google.com (mail-pg0-x244.google.com [IPv6:2607:f8b0:400e:c05::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 411TGM5LfCzF1Nq for ; Thu, 7 Jun 2018 11:58:19 +1000 (AEST) Received: by mail-pg0-x244.google.com with SMTP id p21-v6so3907503pgd.11 for ; Wed, 06 Jun 2018 18:58:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=I8LloTzAYJTbCjwYDFdHSAiYJ7H2lIMC40hQNVhOIcc=; b=KdnCEdBkuisI3Vn2IIeUdNE68Eon0LCxwY5/wmABnYz5jMUB0Mhmhg38oU6clf3Anl 5F/EauBJHduAvnAy5LZniHbB7H4tT5kJJzTp/2xCzH3+5x6m70wamjnox9qk/hp1lK9y 40Git7qg/8rZBrGx+26IUCyZ82qwDNRHllruayfNzJAqV7QkvffEkNcH0BUMIbRBMJpb ovBbRxYZRO/QUJ24kfFXp7oWEeB+j5rQbnFaHaDNKkVAdNY7/AiRgRl8RRYsJynpKfOD NFfZCcoc9AYSJhj/wYI57VouuceaVsuUzLAZ8rAnkYB0oE2k3zmJaFiM8vK8a43L8rAo pUFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=I8LloTzAYJTbCjwYDFdHSAiYJ7H2lIMC40hQNVhOIcc=; b=ja9SxtJ60j9uXsjWPqzCeRcQ64c17UurYFV+qejHQgBbeKMnZbBUrY2iNS/TgtFrog PZuGJW+vUjHf/pTme0/sdiYI1JxOJpYJfZ6l7ZYUcdfwY5pnmNSpVc43Wjj+5XUAiNem qxdZ4v5coZqOVObg8exH3LhQ92B2CdFDD3NvqQbaYp/MHQ0kqpZLQPs24QSYpzgNMPC5 cVPoXA7awaMH5sqq8CFZ7Q/rHKatPSaXR7JPqtt+rkbqeCXDvrfbIzAyGOWVqIOSl0n3 /AIUsPCwZ9R+UfcYaKwkpn895nc2iqj1T1GJGKywsfX0jgoHWu716PbiKR4ppxH2A39D nzHw== X-Gm-Message-State: APt69E0Pt2R+NrpuL/Xg5CNdTmJJufh54ciHtxwEfJcCjUylguQKLy0A 1kXQXy+KHhFOH7aONN3Snbsmvg== X-Google-Smtp-Source: ADUXVKISYbVxLeX52aaRqJcIdJFF5iScER/XZIfc/45zd4CnkWq+FjsOlXkCLolLpJPen/D2e1i+Rw== X-Received: by 2002:a65:62ce:: with SMTP id m14-v6mr4442399pgv.407.1528336697804; Wed, 06 Jun 2018 18:58:17 -0700 (PDT) Received: from simonLocalRHEL7.cn.ibm.com ([112.73.0.89]) by smtp.gmail.com with ESMTPSA id g4-v6sm51946444pfg.38.2018.06.06.18.58.15 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Jun 2018 18:58:17 -0700 (PDT) From: wei.guo.simon@gmail.com To: linuxppc-dev@lists.ozlabs.org Subject: [PATCH v8 4/5] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() Date: Thu, 7 Jun 2018 09:57:54 +0800 Message-Id: <1528336675-10879-5-git-send-email-wei.guo.simon@gmail.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1528336675-10879-1-git-send-email-wei.guo.simon@gmail.com> References: <1528336675-10879-1-git-send-email-wei.guo.simon@gmail.com> X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Naveen N. Rao" , Simon Guo , Cyril Bur Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" From: Simon Guo This patch is based on the previous VMX patch on memcmp(). To optimize ppc64 memcmp() with VMX instruction, we need to think about the VMX penalty brought with: If kernel uses VMX instruction, it needs to save/restore current thread's VMX registers. There are 32 x 128 bits VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store. The major concern regarding the memcmp() performance in kernel is KSM, who will use memcmp() frequently to merge identical pages. So it will make sense to take some measures/enhancement on KSM to see whether any improvement can be done here. Cyril Bur indicates that the memcmp() for KSM has a higher possibility to fail (unmatch) early in previous bytes in following mail. https://patchwork.ozlabs.org/patch/817322/#1773629 And I am taking a follow-up on this with this patch. Per some testing, it shows KSM memcmp() will fail early at previous 32 bytes. More specifically: - 76% cases will fail/unmatch before 16 bytes; - 83% cases will fail/unmatch before 32 bytes; - 84% cases will fail/unmatch before 64 bytes; So 32 bytes looks a better choice than other bytes for pre-checking. The early failure is also true for memcmp() for non-KSM case. With a non-typical call load, it shows ~73% cases fail before first 32 bytes. This patch adds a 32 bytes pre-checking firstly before jumping into VMX operations, to avoid the unnecessary VMX penalty. It is not limited to KSM case. And the testing shows ~20% improvement on memcmp() average execution time with this patch. And note the 32B pre-checking is only performed when the compare size is long enough (>=4K currently) to allow VMX operation. The detail data and analysis is at: https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md Signed-off-by: Simon Guo --- arch/powerpc/lib/memcmp_64.S | 57 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 46 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index be2f792..844d8e7 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -404,8 +404,27 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) #ifdef CONFIG_ALTIVEC .Lsameoffset_vmx_cmp: /* Enter with src/dst addrs has the same offset with 8 bytes - * align boundary + * align boundary. + * + * There is an optimization based on following fact: memcmp() + * prones to fail early at the first 32 bytes. + * Before applying VMX instructions which will lead to 32x128bits + * VMX regs load/restore penalty, we compare the first 32 bytes + * so that we can catch the ~80% fail cases. */ + + li r0,4 + mtctr r0 +.Lsameoffset_prechk_32B_loop: + LD rA,0,r3 + LD rB,0,r4 + cmpld cr0,rA,rB + addi r3,r3,8 + addi r4,r4,8 + bne cr0,.LcmpAB_lightweight + addi r5,r5,-8 + bdnz .Lsameoffset_prechk_32B_loop + ENTER_VMX_OPS beq cr1,.Llong_novmx_cmp @@ -482,16 +501,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) #endif .Ldiffoffset_8bytes_make_align_start: -#ifdef CONFIG_ALTIVEC -BEGIN_FTR_SECTION - /* only do vmx ops when the size equal or greater than 4K bytes */ - cmpdi cr5,r5,VMX_THRESH - bge cr5,.Ldiffoffset_vmx_cmp -END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) - -.Ldiffoffset_novmx_cmp: -#endif - /* now try to align s1 with 8 bytes */ rlwinm r6,r3,3,26,28 beq .Ldiffoffset_align_s1_8bytes @@ -515,6 +524,17 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) .Ldiffoffset_align_s1_8bytes: /* now s1 is aligned with 8 bytes. */ +#ifdef CONFIG_ALTIVEC +BEGIN_FTR_SECTION + /* only do vmx ops when the size equal or greater than 4K bytes */ + cmpdi cr5,r5,VMX_THRESH + bge cr5,.Ldiffoffset_vmx_cmp +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) + +.Ldiffoffset_novmx_cmp: +#endif + + cmpdi cr5,r5,31 ble cr5,.Lcmp_lt32bytes @@ -526,6 +546,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) #ifdef CONFIG_ALTIVEC .Ldiffoffset_vmx_cmp: + /* perform a 32 bytes pre-checking before + * enable VMX operations. + */ + li r0,4 + mtctr r0 +.Ldiffoffset_prechk_32B_loop: + LD rA,0,r3 + LD rB,0,r4 + cmpld cr0,rA,rB + addi r3,r3,8 + addi r4,r4,8 + bne cr0,.LcmpAB_lightweight + addi r5,r5,-8 + bdnz .Ldiffoffset_prechk_32B_loop + ENTER_VMX_OPS beq cr1,.Ldiffoffset_novmx_cmp