From patchwork Fri Nov 10 23:02:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michael Meissner X-Patchwork-Id: 1862600 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=ibm.com header.i=@ibm.com header.a=rsa-sha256 header.s=pp1 header.b=IKNcHgbB; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4SRvZx5njrz1yR3 for ; Sat, 11 Nov 2023 10:06:33 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1B44F3858031 for ; Fri, 10 Nov 2023 23:06:31 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 40C203858D37 for ; Fri, 10 Nov 2023 23:06:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 40C203858D37 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 40C203858D37 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1699657580; cv=none; b=pP5PT4U/gluPPCqAX9JG3a3Yf+vVhN5rXZNMlm9f+PP9ODKTHphmkK2QdLkfOThHmUgIZUio9MjdbPvnp90yxMe27OMZEyWfS02gQlaNYL6JTTPbItMiBE2ZwntXm+6MNMwjy2I4B23hBJ0d88yf7kDvso+xC7pYJC+jrQbMvYo= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1699657580; c=relaxed/simple; bh=X0pngVj37dUwNg/trM7fENfLklBhMDRlaHsBiC/UGfM=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=ThPvD6eiHaT/pypFn4IDHMYzDm2lLlL6viBdpWMdC7rqQviXKNRgTWjnd3MMwp+XQGi+duDCkabeSkyt+EEtJRsRwXBzP/koTKOq7qWlbILegLTI02F3+ozrZcwitaeLlwtLhJuDio1sg2Zw8MW1jaqkbpHCJPBpNZ/6aC4oeTI= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from pps.filterd (m0353727.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AAN67b9028731; Fri, 10 Nov 2023 23:06:15 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : subject : message-id : mime-version : content-type; s=pp1; bh=kjXpYL2W82lR4rEm8ubI6ntzytyki5VnIw+RwscELRU=; b=IKNcHgbB7hRDYxwq1HMau94jjb+edjDMql/RWREdkqWWbIOYKXiSNf8hyKTzbtn4rpfm jjhJVlXkZRRGkbLoe3bs0q00oVIucL4SuEAoMWj+/vDR1viV9UA5aipORHZTu2YRvuUr SiOAzsIyJ4SoXN6hei1yrlvwwzs5U5vYn4278UJFOsijLLxMPXhUqnPcQ6HwFeIRm0A6 EteztM3Ho330wZKPGQSsEfINyTSY/3M/Lq5MFtU1yZNeIJlTTTp2n8bCQ6pIkvsPD4OE ifMruI0Qy+rWrXLC10w6WUlIF7xZTPdDr6kOX9zm3d/HSEHodgSijAlMNNvv+Ca/Wkvu XQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3u9wm80cve-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Nov 2023 23:06:14 +0000 Received: from m0353727.ppops.net (m0353727.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AAN6DEu029297; Fri, 10 Nov 2023 23:06:13 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3u9wm80ce9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Nov 2023 23:06:12 +0000 Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AALbFN7014325; Fri, 10 Nov 2023 23:02:31 GMT Received: from smtprelay06.wdc07v.mail.ibm.com ([172.16.1.73]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3u7w22ea20-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 10 Nov 2023 23:02:31 +0000 Received: from smtpav02.wdc07v.mail.ibm.com (smtpav02.wdc07v.mail.ibm.com [10.39.53.229]) by smtprelay06.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AAN2V3I19727072 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 10 Nov 2023 23:02:31 GMT Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1B7C558067; Fri, 10 Nov 2023 23:02:31 +0000 (GMT) Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 74BBD5805E; Fri, 10 Nov 2023 23:02:30 +0000 (GMT) Received: from cowardly-lion.the-meissners.org (unknown [9.61.104.206]) by smtpav02.wdc07v.mail.ibm.com (Postfix) with ESMTPS; Fri, 10 Nov 2023 23:02:30 +0000 (GMT) Date: Fri, 10 Nov 2023 18:02:28 -0500 From: Michael Meissner To: gcc-patches@gcc.gnu.org, Michael Meissner , Segher Boessenkool , "Kewen.Lin" , David Edelsohn , Peter Bergner Subject: [PATCH 0/4] Add vector pair builtins to PowerPC Message-ID: Mail-Followup-To: Michael Meissner , gcc-patches@gcc.gnu.org, Segher Boessenkool , "Kewen.Lin" , David Edelsohn , Peter Bergner MIME-Version: 1.0 Content-Disposition: inline X-TM-AS-GCONF: 00 X-Proofpoint-GUID: MdCvtBbC1dpNnqULWNcpmmIt75lJalTL X-Proofpoint-ORIG-GUID: MxXHzUrcEypre331k08dkWf2oAJ894zD X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-10_21,2023-11-09_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 phishscore=0 adultscore=0 priorityscore=1501 impostorscore=0 bulkscore=0 clxscore=1015 mlxscore=0 lowpriorityscore=0 spamscore=0 mlxlogscore=656 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311100191 X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org These set of patches add support for using the vector pair load (lxvp, plxvp, and lxvpx) instructions and the vector pair store (stxvp, pstxvp, and stxvpx) that were introduced with ISA 3.1 on Power10 systems. With GCC 13, the only place vector pairs (and vector quads) were used were to feed into the MMA subsystem. These patches do not use the MMA subsystem, but it gives users a way to write code that is extremely memory bandwidth intensive. There are two main ways to add vector pair support to the GCC compiler: built-in functions vs. __attribute__((__vector_size__(32))). The first method is to add a set of built-in functions that use the vector pair type and it allows the user to write loops and such using the vector pair type (__vector_pair). Loads are normally done using the load vector pair instructions. Then the operation is done as a post reload split to do the two independent vector operations on the two 128-bit vectors located in the vector pair. When the type is stored, normally a store vector pair instruction is used. By keeping the value within a vector pair through register allocation, the compiler does not generate extra move instructions which can slow down the loop. The second method is to add support for the V4DF, V8SF, etc. types. By doing so, you can use the attribute __vector_size__(32)) to declare variables that are vector pairs, and the GCC compiler will generate the appropriate code. I implemented a limited prototype of this support, but it has some problems that I haven't addressed. One potential problem with using the 32-byte vector size is it can generate worse code for options that aren't covered withe as the compiler unpacks things and re-packs them. The compiler would also generate these unpacks and packs if you are generating code for a power9 system. There are a bunch of test cases that fail with my prototype implementation that I haven't addressed yet. After discussions within our group, it was decided that using built-in functions is the way to go at this time, and these patches are implement those functions. In terms of benchmarks, I wrote two benchmarks: 1) One benchmark is a saxpy type loop: value[i] += (a[i] * b[i]). That is a loop with 3 loads and a store per loop. 2) Another benchmark produces a scalar sun of an entire vector. This is a loop that just has a single load and no store. For the saxpy type loop, I get the following general numbers for both float and double: 1) The vector pair built-in functions are roughly 10% faster than using normal vector processing. 2) The vector pair built-in functions are roughly 19-20% faster than if I write the loop using the vector pair loads using the exist built-ins, and then manually split the values and do the arithmetic and single vector stores, 3) The vector pair built-in functions are roughly 35-40% faster than if I write the loop using the existing built-ins for both vector pair load and vector pair store. If I apply the patches that Peter Bergner has been writing for PR target/109116, then it improves the speed of the existing built-ins for assembling and disassembling vector pairs. In this case, the vector pair built-in functions are 20-25% faster, instead of 35-40% faster. This is due to the patch eliminating extra vector moves. Unfortunately, for floating point, doing the sum of the whole vector is slower using the new vector pair built-in functions using a simple loop (compared to using the existing built-ins for disassembling vector pairs. If I write more complex loops that manually unroll the loop, then the floating point vector pair built-in functions become like the integer vector pair integer built-in functions. So there is some amount of tuning that will need to be done. There are 4 patches within this group of patches. 1) The first patch adds vector pair support for 32-bit and 64-bit floating point operations. The operations provided are absolute value, addition, fused multiply-add, minimu, maximum, multiplication, negation, and subtraction. I did not add divde or square root because these instructions take long enough to compute that you don't get any advantage of using the vector pair load/store instructions. 2) The second patch add vector pair support for 8-bit, 16-bit, 32-bit, and 64-bit integer operations. The operations provided include addition, bitwise and, bitwise inclusive or, bitwise exclusive or, bitwise not, both signed and unsigned minimum/maximu, negation, and subtraction. I did not add multiply because the PowerPC architecture does not provide single instructions to do integer vector multiply on the whole vector. I could add shifts and rotates, but I didn't think memory intensive code used these operations. 3) The third patch adds methods to create vector pair values (zero, splat from a scalar value, and combine two 128-bit vectors), as well as a convenient method to exact one 128-bit vector from a vector pair. 4) The fourth patch adds horizontal addition for 32-bit, 64-bit floating point, and 64-bit integers. I do wonder if there are more horizontal reductions that should be done. I have built and tested these patches on: * A little endian power10 server using --with-cpu=power10 * A little endian power9 server using --with-cpu=power9 * A big endian power9 server using --with-cpu=power9. Can I check these patches into the master branch?