From patchwork Fri Nov 10 23:02:28 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michael Meissner <meissner@linux.ibm.com>
X-Patchwork-Id: 1862600
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=ibm.com header.i=@ibm.com header.a=rsa-sha256
 header.s=pp1 header.b=IKNcHgbB;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4SRvZx5njrz1yR3
	for <incoming@patchwork.ozlabs.org>; Sat, 11 Nov 2023 10:06:33 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 1B44F3858031
	for <incoming@patchwork.ozlabs.org>; Fri, 10 Nov 2023 23:06:31 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 40C203858D37
 for <gcc-patches@gcc.gnu.org>; Fri, 10 Nov 2023 23:06:16 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 40C203858D37
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 40C203858D37
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=148.163.156.1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1699657580; cv=none;
 b=pP5PT4U/gluPPCqAX9JG3a3Yf+vVhN5rXZNMlm9f+PP9ODKTHphmkK2QdLkfOThHmUgIZUio9MjdbPvnp90yxMe27OMZEyWfS02gQlaNYL6JTTPbItMiBE2ZwntXm+6MNMwjy2I4B23hBJ0d88yf7kDvso+xC7pYJC+jrQbMvYo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1699657580; c=relaxed/simple;
 bh=X0pngVj37dUwNg/trM7fENfLklBhMDRlaHsBiC/UGfM=;
 h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version;
 b=ThPvD6eiHaT/pypFn4IDHMYzDm2lLlL6viBdpWMdC7rqQviXKNRgTWjnd3MMwp+XQGi+duDCkabeSkyt+EEtJRsRwXBzP/koTKOq7qWlbILegLTI02F3+ozrZcwitaeLlwtLhJuDio1sg2Zw8MW1jaqkbpHCJPBpNZ/6aC4oeTI=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from pps.filterd (m0353727.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id
 3AAN67b9028731; Fri, 10 Nov 2023 23:06:15 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com;
 h=date : from : to :
 subject : message-id : mime-version : content-type; s=pp1;
 bh=kjXpYL2W82lR4rEm8ubI6ntzytyki5VnIw+RwscELRU=;
 b=IKNcHgbB7hRDYxwq1HMau94jjb+edjDMql/RWREdkqWWbIOYKXiSNf8hyKTzbtn4rpfm
 jjhJVlXkZRRGkbLoe3bs0q00oVIucL4SuEAoMWj+/vDR1viV9UA5aipORHZTu2YRvuUr
 SiOAzsIyJ4SoXN6hei1yrlvwwzs5U5vYn4278UJFOsijLLxMPXhUqnPcQ6HwFeIRm0A6
 EteztM3Ho330wZKPGQSsEfINyTSY/3M/Lq5MFtU1yZNeIJlTTTp2n8bCQ6pIkvsPD4OE
 ifMruI0Qy+rWrXLC10w6WUlIF7xZTPdDr6kOX9zm3d/HSEHodgSijAlMNNvv+Ca/Wkvu XQ==
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3u9wm80cve-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 10 Nov 2023 23:06:14 +0000
Received: from m0353727.ppops.net (m0353727.ppops.net [127.0.0.1])
 by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AAN6DEu029297;
 Fri, 10 Nov 2023 23:06:13 GMT
Received: from ppma23.wdc07v.mail.ibm.com
 (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93])
 by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3u9wm80ce9-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 10 Nov 2023 23:06:12 +0000
Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1])
 by ppma23.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id
 3AALbFN7014325; Fri, 10 Nov 2023 23:02:31 GMT
Received: from smtprelay06.wdc07v.mail.ibm.com ([172.16.1.73])
 by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3u7w22ea20-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 10 Nov 2023 23:02:31 +0000
Received: from smtpav02.wdc07v.mail.ibm.com (smtpav02.wdc07v.mail.ibm.com
 [10.39.53.229])
 by smtprelay06.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 3AAN2V3I19727072
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 10 Nov 2023 23:02:31 GMT
Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 1B7C558067;
 Fri, 10 Nov 2023 23:02:31 +0000 (GMT)
Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 74BBD5805E;
 Fri, 10 Nov 2023 23:02:30 +0000 (GMT)
Received: from cowardly-lion.the-meissners.org (unknown [9.61.104.206])
 by smtpav02.wdc07v.mail.ibm.com (Postfix) with ESMTPS;
 Fri, 10 Nov 2023 23:02:30 +0000 (GMT)
Date: Fri, 10 Nov 2023 18:02:28 -0500
From: Michael Meissner <meissner@linux.ibm.com>
To: gcc-patches@gcc.gnu.org, Michael Meissner <meissner@linux.ibm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 "Kewen.Lin" <linkw@linux.ibm.com>, David Edelsohn <dje.gcc@gmail.com>,
 Peter Bergner <bergner@linux.ibm.com>
Subject: [PATCH 0/4] Add vector pair builtins to PowerPC
Message-ID: <ZU62hIC0H7pvSwrY@cowardly-lion.the-meissners.org>
Mail-Followup-To: Michael Meissner <meissner@linux.ibm.com>,
 gcc-patches@gcc.gnu.org,
 Segher Boessenkool <segher@kernel.crashing.org>,
 "Kewen.Lin" <linkw@linux.ibm.com>,
 David Edelsohn <dje.gcc@gmail.com>,
 Peter Bergner <bergner@linux.ibm.com>
MIME-Version: 1.0
Content-Disposition: inline
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: MdCvtBbC1dpNnqULWNcpmmIt75lJalTL
X-Proofpoint-ORIG-GUID: MxXHzUrcEypre331k08dkWf2oAJ894zD
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26
 definitions=2023-11-10_21,2023-11-09_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 malwarescore=0 phishscore=0
 adultscore=0 priorityscore=1501 impostorscore=0 bulkscore=0 clxscore=1015
 mlxscore=0 lowpriorityscore=0 spamscore=0 mlxlogscore=656 suspectscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000
 definitions=main-2311100191
X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL,
 SPF_HELO_NONE,
 SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

These set of patches add support for using the vector pair load (lxvp, plxvp,
and lxvpx) instructions and the vector pair store (stxvp, pstxvp, and stxvpx)
that were introduced with ISA 3.1 on Power10 systems.

With GCC 13, the only place vector pairs (and vector quads) were used were to
feed into the MMA subsystem.  These patches do not use the MMA subsystem, but
it gives users a way to write code that is extremely memory bandwidth
intensive.

There are two main ways to add vector pair support to the GCC compiler:
built-in functions vs. __attribute__((__vector_size__(32))).

The first method is to add a set of built-in functions that use the vector pair
type and it allows the user to write loops and such using the vector pair type
(__vector_pair).  Loads are normally done using the load vector pair
instructions.  Then the operation is done as a post reload split to do the two
independent vector operations on the two 128-bit vectors located in the vector
pair.  When the type is stored, normally a store vector pair instruction is
used.  By keeping the value within a vector pair through register allocation,
the compiler does not generate extra move instructions which can slow down the
loop.

The second method is to add support for the V4DF, V8SF, etc. types.  By doing
so, you can use the attribute __vector_size__(32)) to declare variables that
are vector pairs, and the GCC compiler will generate the appropriate code.  I
implemented a limited prototype of this support, but it has some problems that
I haven't addressed.  One potential problem with using the 32-byte vector size
is it can generate worse code for options that aren't covered withe as the
compiler unpacks things and re-packs them.  The compiler would also generate
these unpacks and packs if you are generating code for a power9 system.  There
are a bunch of test cases that fail with my prototype implementation that I
haven't addressed yet.

After discussions within our group, it was decided that using built-in
functions is the way to go at this time, and these patches are implement those
functions.

In terms of benchmarks, I wrote two benchmarks:

   1)	One benchmark is a saxpy type loop: value[i] += (a[i] * b[i]).  That is
	a loop with 3 loads and a store per loop.

   2)	Another benchmark produces a scalar sun of an entire vector.  This is a
	loop that just has a single load and no store.

For the saxpy type loop, I get the following general numbers for both float and
double:

   1)	The vector pair built-in functions are roughly 10% faster than using
	normal vector processing.

   2)	The vector pair built-in functions are roughly 19-20% faster than if I
	write the loop using the vector pair loads using the exist built-ins,
	and then manually split the values and do the arithmetic and single
	vector stores,

   3)	The vector pair built-in functions are roughly 35-40% faster than if I
	write the loop using the existing built-ins for both vector pair load
	and vector pair store.  If I apply the patches that Peter Bergner has
	been writing for PR target/109116, then it improves the speed of the
	existing built-ins for assembling and disassembling vector pairs.  In
	this case, the vector pair built-in functions are 20-25% faster,
	instead of 35-40% faster.  This is due to the patch eliminating extra
	vector moves.

Unfortunately, for floating point, doing the sum of the whole vector is slower
using the new vector pair built-in functions using a simple loop (compared to
using the existing built-ins for disassembling vector pairs.  If I write more
complex loops that manually unroll the loop, then the floating point vector
pair built-in functions become like the integer vector pair integer built-in
functions.  So there is some amount of tuning that will need to be done.

There are 4 patches within this group of patches.

    1)	The first patch adds vector pair support for 32-bit and 64-bit floating
	point operations.  The operations provided are absolute value,
	addition, fused multiply-add, minimu, maximum, multiplication,
	negation, and subtraction.  I did not add divde or square root because
	these instructions take long enough to compute that you don't get any
	advantage of using the vector pair load/store instructions.

    2)	The second patch add vector pair support for 8-bit, 16-bit, 32-bit, and
	64-bit integer operations.  The operations provided include addition,
	bitwise and, bitwise inclusive or, bitwise exclusive or, bitwise not,
	both signed and unsigned minimum/maximu, negation, and subtraction.  I
	did not add multiply because the PowerPC architecture does not provide
	single instructions to do integer vector multiply on the whole vector.
	I could add shifts and rotates, but I didn't think memory intensive
	code used these operations.

    3)	The third patch adds methods to create vector pair values (zero, splat
	from a scalar value, and combine two 128-bit vectors), as well as a
	convenient method to exact one 128-bit vector from a vector pair.

    4)	The fourth patch adds horizontal addition for 32-bit, 64-bit floating
	point, and 64-bit integers.  I do wonder if there are more horizontal
	reductions that should be done.

I have built and tested these patches on:

    *	A little endian power10 server using --with-cpu=power10
    *	A little endian power9 server using --with-cpu=power9
    *	A big endian power9 server using --with-cpu=power9.

Can I check these patches into the master branch?