From patchwork Thu May 31 06:19:19 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anton Blanchard X-Patchwork-Id: 162108 X-Patchwork-Delegate: michael@ellerman.id.au Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from ozlabs.org (localhost [IPv6:::1]) by ozlabs.org (Postfix) with ESMTP id 21200B732D for ; Thu, 31 May 2012 16:20:59 +1000 (EST) Received: from kryten (ppp121-45-170-72.lns20.syd6.internode.on.net [121.45.170.72]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPSA id 0480BB70C6; Thu, 31 May 2012 16:19:18 +1000 (EST) Date: Thu, 31 May 2012 16:19:19 +1000 From: Anton Blanchard To: benh@kernel.crashing.org, paulus@samba.org, michael@ellerman.id.au, amodra@gmail.com Subject: [PATCH] powerpc: Use enhanced touch instructions in POWER7 copy_to_user/copy_from_user Message-ID: <20120531161919.1e619998@kryten> In-Reply-To: <20120529181432.2038c1a3@kryten> References: <20120529181432.2038c1a3@kryten> X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Cc: linuxppc-dev@lists.ozlabs.org X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Version 2.06 of the POWER ISA introduced enhanced touch instructions, allowing us to specify a number of attributes including the length of a stream. This patch adds a software stream for both loads and stores in the POWER7 copy_tofrom_user loop. Since the setup is quite complicated and we have to use an eieio to ensure correct ordering of the "GO" command we only do this for copies above 4kB. To quantify any performance improvements we need a working set bigger than the caches so we operate on a 1GB file: # dd if=/dev/zero of=/tmp/foo bs=1M count=1024 And we compare how fast we can read the file: # dd if=/tmp/foo of=/dev/null bs=1M before: 7.7 GB/s after: 9.6 GB/s A 25% improvement. The worst case for this patch will be a completely L1 cache contained copy of just over 4kB. We can test this with the copy_to_user testcase we used to tune copy_tofrom_user originally: http://ozlabs.org/~anton/junkcode/copy_to_user.c # time ./copy_to_user2 -l 4224 -i 10000000 before: 6.807 s after: 6.946 s A 2% slowdown, which seems reasonable considering our data is unlikely to be completely L1 contained. Signed-off-by: Anton Blanchard --- v2: Use cr1 in the comparison so we don't corrupt the compare/branch to select vmx vs non vmx loops. Index: linux-build/arch/powerpc/lib/copyuser_power7.S =================================================================== --- linux-build.orig/arch/powerpc/lib/copyuser_power7.S 2012-05-29 21:22:40.445551834 +1000 +++ linux-build/arch/powerpc/lib/copyuser_power7.S 2012-05-31 15:28:35.336354208 +1000 @@ -298,6 +298,37 @@ err1; stb r0,0(r3) ld r5,STACKFRAMESIZE+64(r1) mtlr r0 + /* + * We prefetch both the source and destination using enhanced touch + * instructions. We use a stream ID of 0 for the load side and + * 1 for the store side. + */ + clrrdi r6,r4,7 + clrrdi r9,r3,7 + ori r9,r9,1 /* stream=1 */ + + srdi r7,r5,7 /* length in cachelines, capped at 0x3FF */ + cmpldi cr1,r7,0x3FF + ble cr1,1f + li r7,0x3FF +1: lis r0,0x0E00 /* depth=7 */ + sldi r7,r7,7 + or r7,r7,r0 + ori r10,r7,1 /* stream=1 */ + + lis r8,0x8000 /* GO=1 */ + clrldi r8,r8,32 + +.machine push +.machine "power4" + dcbt r0,r6,0b01000 + dcbt r0,r7,0b01010 + dcbtst r0,r9,0b01000 + dcbtst r0,r10,0b01010 + eieio + dcbt r0,r8,0b01010 /* GO */ +.machine pop + beq .Lunwind_stack_nonvmx_copy /*