From patchwork Mon Jun 1 21:40:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Patrick McGehearty X-Patchwork-Id: 1301965 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=gcc.gnu.org Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=SiQ8Oi73; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 49bTBP1Wdmz9sW3 for ; Tue, 2 Jun 2020 07:40:59 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 43FCF3851C37; Mon, 1 Jun 2020 21:40:55 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 43FCF3851C37 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1591047655; bh=IVLMwJbVaNyTGlucTyUluRqy81H5iunQYT62082NjRg=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=SiQ8Oi73A2ZVjyEzUZ+HzL50ahZvIEbDMxDNql9pLleSSUu6FsSa1ECFxpQNyaTU4 bWk0swRlDqtx2dkJx7raikcleVWyMoqxR+ZNTDhusE7nfOcIMaHMS87vlDPKl2l+uG g4odFeskJvJwAEltJcgI8+id+/7ynYanu7zwJ/Hk= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from userp2130.oracle.com (userp2130.oracle.com [156.151.31.86]) by sourceware.org (Postfix) with ESMTPS id 9765E3851C37 for ; Mon, 1 Jun 2020 21:40:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 9765E3851C37 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 051Lc6Y9081883 for ; Mon, 1 Jun 2020 21:40:50 GMT Received: from userp3030.oracle.com (userp3030.oracle.com [156.151.31.80]) by userp2130.oracle.com with ESMTP id 31bewqs05q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL) for ; Mon, 01 Jun 2020 21:40:50 +0000 Received: from pps.filterd (userp3030.oracle.com [127.0.0.1]) by userp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 051LdBtZ146233 for ; Mon, 1 Jun 2020 21:40:50 GMT Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userp3030.oracle.com with ESMTP id 31c1dw4pfu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Mon, 01 Jun 2020 21:40:50 +0000 Received: from abhmp0003.oracle.com (abhmp0003.oracle.com [141.146.116.9]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 051Len4r018414 for ; Mon, 1 Jun 2020 21:40:49 GMT Received: from localhost.us.oracle.com (/10.211.14.34) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 01 Jun 2020 14:40:49 -0700 To: gcc-patches@gcc.gnu.org Subject: [PATCH] Practical Improvement to Double Precision Complex Divide Date: Mon, 1 Jun 2020 21:40:43 +0000 Message-Id: <1591047643-9019-1-git-send-email-patrick.mcgehearty@oracle.com> X-Mailer: git-send-email 1.8.3.1 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9639 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 mlxlogscore=999 spamscore=0 bulkscore=0 adultscore=0 suspectscore=1 mlxscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2006010157 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9639 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 bulkscore=0 phishscore=0 suspectscore=1 impostorscore=0 cotscore=-2147483648 lowpriorityscore=0 mlxscore=0 adultscore=0 spamscore=0 mlxlogscore=999 malwarescore=0 clxscore=1011 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2006010157 X-Spam-Status: No, score=-8.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_ASCII_DIVIDERS, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Patrick McGehearty via Gcc-patches From: Patrick McGehearty Reply-To: Patrick McGehearty Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" The following patch to libgcc/libgcc2.c __divdc3 provides an opportunity to gain important improvements to the quality of answers for the default double precision complex divide routine when dealing with very large or very small exponents. The current code correctly implements Smith's method (1962) [1] further modified by c99's requirements for dealing with NaN (not a number) results. When working with input values where the exponents are greater than 512 (i.e. 2.0^512) or less than -512 (i.e. 2.0^-512), results are substantially different from the answers provided by quad precision more than 1% of the time. Since the allowed exponent range for double precision numbers is -1076 to +1023, the error rate may be unacceptable for many applications. The proposed method reduces the frequency of "substantially different" answers by more than 99% at a modest cost of performance. Differences between current gcc methods and the new method will be described. Then accuracy and performance differences will be discussed. NOTATION For all of the following, the notation is: Input complex values: a+bi (a= 64bit real part, b= 64bit imaginary part) c+di Output complex value: e+fi = (a+bi)/(c+di) DESCRIPTIONS of different complex divide methods: NAIVE COMPUTATION (-fcx-limited-range): e = (a*c + b*d)/(c*c + d*d) f = (b*c - a*d)/(c*c + d*d) Note that c*c and d*d will overflow or underflow if either c or d is outside the range 2^-538 to 2^512. This method is available in gcc when the switch -fcx-limited-range is used. That switch is also enabled by -ffast-math. Only one who has a clear understanding of the maximum range of intermediate values generated by a computation should consider using this switch. SMITH's METHOD (current libgcc): if(fabs(c) RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) { a = a * 0.5; b = b * 0.5; c = c * 0.5; d = d * 0.5; } /* minimize overflow/underflow issues when c and d are small */ else if (FABS (d) < RMIN2) { a = a * RMINSCAL; b = b * RMINSCAL; c = c * RMINSCAL; d = d * RMINSCAL; } r = c/d; denom = (c*r) + d; if( r > RMIN ) { e = (a*r + b) / denom ; f = (b*r - a) / denom } else { e = (c * (a/d) + b) / denom; f = (c * (b/d) - a) / denom; } [ only presenting the fabs(c) < fabs(d) case here, full code in patch. ] Before any computation of the answer, the code checks for near maximum or near minimum inputs and scale the results to move all values away from the extremes. If the complex divide can be computed at all without generating infinities, these scalings will not affect the accuracy since they are by a power of 2. Values that are over RBIG are relatively rare but it is easy to test for them and required to avoid unnecessary overflows. Testing for RMIN2 reveals when both c and d are less than 2^-512. By scaling all values by 2^510, the code avoids many underflows in intermediate computations that otherwise might occur. If scaling a and b by 2^510 causes either to overflow, then the computation will overflow whatever method is used. Next, r (the ratio of c to d) is checked for being near zero. Baudin and Smith checked r for zero. Checking for values less than DBL_MIN covers more cases and improves overall accuracy. If r is near zero, then when it is used in a multiplication, there is a high chance that the result will underflow to zero, losing significant accuracy. That underflow can be avoided if the computation is done in a different order. When r is subnormal, the code replaces a*r (= a*(c/d)) with ((a/d)*c) which is mathematically the same but avoids the unnecessary underflow. TEST Data Two sets of data are presented to test these methods. Both sets contain 10 million pairs of 64bit complex values. The exponents and mantissas are generated using multiple calls to random() and then combining the results. Only values which give results to complex divide that are representable in 64-bits after being computed in quad precision are used. The first data set is labeled "moderate exponents". The exponent range is limited to -512 to +511. The second data set is labeled "full exponents". The exponent range is -1076 to + 1024. ACCURACY Test results: Note: All results are based on use of fused multiply-add. If fused multiply-add is not used, the error rate increases slightly for the 2 ulp and 8 ulp cases. The complex divide methods are evaluated by determining what percentage of values exceed different ulp (units in last place) levels. If a "2 ulp" test results show 1%, that would mean that 1% of 10,000,000 values (100,000) have either a real or imaginary part that had a greater than 2 bit difference from the quad precision result. Results are reported for differences greater than or equal to 2 ulp, 8 ulp, 16 ulp, 24 ulp, and 52 ulp. Even when the patch avoids overflows and underflows, some input values are expected to have errors due to normal limitations of floating point subtraction. For example, when b*c and a*d are nearly equal subtraction has potential loss of accuracy. This patch does not attempt to detect or minimize this type of error, but neither does it increase them. In the following charts, lower values are better. ============================== Errors Moderate Dataset gtr eq current new ====== ======== ======== 2 ulp 0.01762% 0.01762% 8 ulp 0.00026% 0.00026% 16 ulp 0.00000% 0.00000% 24 ulp 0% 0% 52 ulp 0% 0% ============================== Table 1: Errors with Moderate Dataset Note in Table 1 that both the old and new methods give identical error rates for data with moderate exponents. Errors exceeding 16 ulp are exceedingly rare. ============================== Errors Full Dataset gtr eq current new ====== ======== ======== 2 ulp 1.70% 0.03874% 8 ulp 1.61% 0.02403% 16 ulp 1.51% 0.01554% 24 ulp 1.41% 0.00947% 52 ulp 1.13% 0.00001% ============================== Table 2: Errors with Full Dataset Table 2 shows significant differences in error rates. With the current code, 1.13% of test values have results where no bits of the mantissa match the correct answer. That level of error is extremely rare with the new code. Consider the results for errors greater than or equal to 24 ulp. The current code sees those errors at a rate of 1.4%. Most would agree that such answers are "substantially different" from the answers calculated using extended precision. The new code reduces errors at that level by more than 99%. These improvements are important for serious computation of complex numbers. PERFORMANCE Test results In order for a library change to be practical, it is necessary to show the slowdown is tolerable. The slowdowns observed are much less than would be seen by (for example) switching to quad precision, which on many machines may cause a slowdown of 1000x. The actual slowdown depends on the machine architecture. It also depends on the nature of the input data. If underflow/overflow is rare, then implementations that have strong branch prediction will only slowdown by a few cycles. If underflow/overflow is common, then the branch predictors will be less successful and the cost will be higher. Results from two machines are presented as examples of the overhead for the new method. The one labeled x86 is a 5 year old Intel x86 processor and the one labeled aarch64 is a 3 year old arm64 processor. In the following chart, lower values are better. The values are nanoseconds/complex divide (ns) averaged over the full 10 million value data sets. ===================================================== Moderate set full set x86 aarch64 x86 aarch64 ======== ================= ================= current 19.5 ns 34.9 ns 29.4 ns 32.9 ns new 20.0 ns 42.7 ns 37.4 ns 53.8 ns ======== ================= ================= %slowdown 3% 22% 27% 64% ===================================================== Table 3: Performance Comparisons In Table 3, overhead is modest when the new branches are consistently false. The overhead grows when the new branches are taken to get correct answers. Both platforms are somewhat dated, with the x86 having a better branch predictor which reduces the cost of the additional branches in the new code. Of course, the relative slowdown may be greater for some architectures, especially those with limited branch prediction combined with a high cost of misprediction. This cost is claimed to be tolerable on the grounds that: (a) most applications will only spend a small fraction of their time calculating complex divide. (b) it is much less than the cost of extended precision (c) users are not forced to use it (as described below) (d) the cost if worthwhile considering the accuracy improvement shown in Table 2. Those users who find this degree of slowdown unsatisfactory may use the switch -fcx-fortran-rules which does not use the library routine, instead inlining Smith's method without the C99 requirement for dealing with NaN results. The proposed patch for double precision complex divide does not affect the behavior of -fcx-fortran-rules. SUMMARY When input data to complex divide has exponents between -512 and 511, this patch makes no changes in accuracy and has only a modest effect on performance. When input data contains values outside those ranges, the patch eliminates more than 99% of major errors with a tolerable cost in performance. REFERENCES [1] Robert L. Smith. Algorithm 116: Complex division. Commun. ACM, 5(8):435, 1962. [2] Michael Baudin and Robert L. Smith. "A robust complex division in Scilab," October 2012, available at http://arxiv.org/abs/1210.4539. --- libgcc/ChangeLog | 6 ++++++ libgcc/libgcc2.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 66 insertions(+), 4 deletions(-) diff --git a/libgcc/ChangeLog b/libgcc/ChangeLog index 06e7986..5a666d4 100644 --- a/libgcc/ChangeLog +++ b/libgcc/ChangeLog @@ -1,3 +1,9 @@ +2020-05-29 Patrick McGehearty + + * libgcc2.c (__divdc3): Enhance accuracy of double precision + complex divide for input values with very large and very + small exponents. + 2020-05-28 Dong JianQiang PR gcov-profile/95332 diff --git a/libgcc/libgcc2.c b/libgcc/libgcc2.c index e0a9fd7..2d32b7c 100644 --- a/libgcc/libgcc2.c +++ b/libgcc/libgcc2.c @@ -2036,6 +2036,10 @@ CONCAT3(__mul,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) CTYPE CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) { +#define RBIG ((DBL_MAX)/2.0) +#define RMIN (DBL_MIN) +#define RMIN2 (0x1.0p-512) +#define RMINSCAL (0x1.0p+510) MTYPE denom, ratio, x, y; CTYPE res; @@ -2043,19 +2047,71 @@ CONCAT3(__div,MODE,3) (MTYPE a, MTYPE b, MTYPE c, MTYPE d) the division. But that would mean starting to link libgcc against libm. We could implement something akin to ldexp/frexp as gcc builtins fairly easily... */ + /* + Scaling by max(c,d) shifts a division and avoids a multiplication + reducing round-off errors by a modest amount (average of a half-bit) + */ if (FABS (c) < FABS (d)) { + /* prevent overflow when arguments are near max representable */ + if ((FABS (d) >= RBIG) || (FABS (a) >= RBIG) || (FABS (b) >= RBIG) ) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow/underflow issues when c and d are small */ + else if (FABS (d) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } ratio = c / d; denom = (c * ratio) + d; - x = ((a * ratio) + b) / denom; - y = ((b * ratio) - a) / denom; + if ( FABS (ratio) > RMIN ) + { + x = ((a * ratio) + b) / denom; + y = ((b * ratio) - a) / denom; + } + else + { + x = ((c * (a / d)) + b) / denom; + y = ((c * (b / d)) - a) / denom; + } } else { + /* prevent overflow when arguments are near max representable */ + if ((FABS(c) >= RBIG) || (FABS(a) >= RBIG) || (FABS(b) >= RBIG) ) + { + a = a * 0.5; + b = b * 0.5; + c = c * 0.5; + d = d * 0.5; + } + /* avoid overflow issues when c and d are small */ + else if (FABS(c) < RMIN2) + { + a = a * RMINSCAL; + b = b * RMINSCAL; + c = c * RMINSCAL; + d = d * RMINSCAL; + } ratio = d / c; denom = (d * ratio) + c; - x = ((b * ratio) + a) / denom; - y = (b - (a * ratio)) / denom; + if ( FABS (ratio) > RMIN ) + { + x = (a + (b * ratio)) / denom; + y = (b - (a * ratio)) / denom; + } + else + { + x = (a + (d * (b / c))) / denom; + y = (b - (d * (a / c))) / denom; + } } /* Recover infinities and zeros that computed as NaN+iNaN; the only cases