From patchwork Mon Dec 9 19:07:50 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kenneth Zadeck X-Patchwork-Id: 299166 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 1D5A92C00AB for ; Tue, 10 Dec 2013 06:08:10 +1100 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:subject:content-type; q= dns; s=default; b=rYfOnuXaCvl7KgzCWO8G4wnHcOBgV/Rtd0YxJJHp/OzaDx eBiJVj9fHROjN80h6djrPPGhk1DmL1WQN352fQPwQ02gThp11tnV7njfgpB/m8DQ IjOAEq+HVztNtMy9CsxtN78I3TASm+fii5FzJZ12R0m2oW7DH5UEloQlPGJh8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:subject:content-type; s= default; bh=8rYEaaL5eIHdUPpYkQJ0R2/NyWM=; b=T1RsbkaSnbi29wZtzK+f g9hidvpGHtYlI44gcxAaRyb7BaVIxMhcnHFLZAbRs6h16rzaZQfZVVAcxQq3IVTP 5fU8urtr/lGOcckG3EwEJ+xgfeQnYaHDT9foGQB0fh0U6UZdLeGEoKK/MLmtj9V9 8ME7e6tnlXX232k7peQ/k4o= Received: (qmail 3760 invoked by alias); 9 Dec 2013 19:08:03 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 3741 invoked by uid 89); 9 Dec 2013 19:08:02 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.2 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 X-HELO: mail-yh0-f44.google.com Received: from Unknown (HELO mail-yh0-f44.google.com) (209.85.213.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Mon, 09 Dec 2013 19:08:00 +0000 Received: by mail-yh0-f44.google.com with SMTP id f64so3023348yha.3 for ; Mon, 09 Dec 2013 11:07:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:content-type; bh=er//5qZ4udA2wLfnZJXXMYCKklnA8hkjaXC+/SDzsF4=; b=R2r2dretBsBpRk42ToeHh4kPMpDuL2RgqE32XfRr+63Y9ajUSZCeU8UDYtwGg2Llvr Jq7K1t6yhhQJ0c/dc7cgHvwdqhZrZ++zNpdNd2OSN5a8uG3EFejWiKenZzZbipX3P5NF nBZuNmyElDFmLTh8tMXM42ebJSeTks3dFTX9Pul3K4bF0Uq++/LtFfLG5Sh9ZsqJlX5M genDvwOa12PB4Vr0fsa07VOviB7kJfCxDkTr1EOWVAE59Wvh5FXpRd1Gqbi0Hi0U2TC3 74xGYuTYoQVUoEfej10y7rAURN89DWZqq0wPdajmBezlsUMg784W2dRfTl4TrPYRBJpI JnTA== X-Gm-Message-State: ALoCoQniisbGw+PxurvK+nr0NEaTf+ewgqDl5Hk0EWfb1PDWzbYuLpH/svYe2Mazz6pDScVW+ZrN X-Received: by 10.236.2.166 with SMTP id 26mr4223836yhf.79.1386616072539; Mon, 09 Dec 2013 11:07:52 -0800 (PST) Received: from moria.site (pool-98-113-157-168.nycmny.fios.verizon.net. [98.113.157.168]) by mx.google.com with ESMTPSA id r1sm17945999yhf.17.2013.12.09.11.07.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 09 Dec 2013 11:07:51 -0800 (PST) Message-ID: <52A61506.5000407@naturalbridge.com> Date: Mon, 09 Dec 2013 14:07:50 -0500 From: Kenneth Zadeck User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Richard Sandiford , Richard Biener , Mike Stump , gcc-patches Subject: wide-int more performance fixes for wide multiplication. This patch is the last performance patch that i have for wide-int. This patch changes large multiply from taking precision/hbpwi * precision/hbpwi multiplies to taking #significant_bits1/hbpwi * #significant_bits2/hbpwi multiplications. That was a significant number of multiplies on machines that had a large largest integer. This patch was mostly tested before richard sandifords patch which uses the longlong stuff. That patch gets many of the cases that this would get. but if either of the operands does not fit into a hwi, then things got really expensive without this patch. This patch works a lot differently from the version in the branch. In the branch, the inputs were assumed to unsigned numbers and the multiply happened with what ever came in. After the multiply happened and if a signed multiply was requested, the product was adjusted. in this version, the adjustment happens first if we are doing signed multiply. Thus the operands are always unsigned numbers to the actual multiply loops. We may need an extra bit for this, but we have it. The advantage of this is that negative numbers have a lot of leading 1s and compensating for these is complex. The easy solution is to convert this to a positive number and then count many hwi's are needed to represent that. Of course, if exactly one of the operands was negative, then we end up negating the product at the end. This patch was tested on x86-64 with richards longlong short cut commented out to get it to hit a lot of times. And for this it makes a lot of difference in the number of multiplies that are performed. Without commenting that out, the number of hits is small when one looks only at the bootstrap and the regression tests. However, this patch will also get more hit more often for ports that decide to revert back to having a HWI be 32 bits because any time one of the operands is larger than 32 bits, this patch will take control. Also, this patch hits if any of the operands really do not fit in a hwi. This patch only works after that patch to change tree-vrp has been applied. http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00877.html kenny Index: gcc/wide-int.cc =================================================================== --- gcc/wide-int.cc (revision 205765) +++ gcc/wide-int.cc (working copy) @@ -1275,23 +1275,31 @@ wi::mul_internal (HOST_WIDE_INT *val, co unsigned int j; unsigned int blocks_needed = BLOCKS_NEEDED (prec); unsigned int half_blocks_needed = blocks_needed * 2; - /* The sizes here are scaled to support a 2x largest mode by 2x - largest mode yielding a 4x largest mode result. This is what is - needed by vpn. */ + /* The sizes here are scaled to support a 2*largest mode + a little + by 2*largest mode + a little yielding a 4*largest mode + a + little. This is what is needed by vpn. */ unsigned HOST_HALF_WIDE_INT - u[4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT]; + u[2 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT]; + unsigned int ulen; unsigned HOST_HALF_WIDE_INT - v[4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT]; - /* The '2' in 'R' is because we are internally doing a full + v[2 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT]; + unsigned int vlen; + unsigned int uvlen; + unsigned int vallen; + + /* The '4' in 'R' is because we are internally doing a wide multiply. */ unsigned HOST_HALF_WIDE_INT - r[2 * 4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT]; - HOST_WIDE_INT mask = ((HOST_WIDE_INT)1 << HOST_BITS_PER_HALF_WIDE_INT) - 1; + r[4 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT]; + + /* True if the result needs to be negated. */ + bool is_neg = false; /* If the top level routine did not really pass in an overflow, then just make sure that we never attempt to set it. */ bool needs_overflow = (overflow != 0); + const HOST_WIDE_INT zero = 0; if (needs_overflow) *overflow = false; @@ -1382,61 +1392,96 @@ wi::mul_internal (HOST_WIDE_INT *val, co return 1; } - /* We do unsigned mul and then correct it. */ - wi_unpack (u, (const unsigned HOST_WIDE_INT *) op1val, op1len, - half_blocks_needed, prec, SIGNED); - wi_unpack (v, (const unsigned HOST_WIDE_INT *) op2val, op2len, - half_blocks_needed, prec, SIGNED); - - /* The 2 is for a full mult. */ - memset (r, 0, half_blocks_needed * 2 - * HOST_BITS_PER_HALF_WIDE_INT / CHAR_BIT); + ulen = op1len * 2; + vlen = op2len * 2; + + if (sgn == SIGNED) + { + if (wi::neg_p (op1)) + { + HOST_WIDE_INT nop1[2 * WIDE_INT_MAX_ELTS]; + + int nop1len + = wi::sub_large (nop1, &zero, 1, op1val, op1len, prec + 1, SIGNED, 0); + is_neg = !is_neg; + ulen = nop1len * 2; + wi_unpack (u, (const unsigned HOST_WIDE_INT*)nop1, nop1len, + ulen, prec, SIGNED); + } + else + wi_unpack (u, (const unsigned HOST_WIDE_INT*)op1val, op1len, + ulen, prec, SIGNED); + + if (wi::neg_p (op2)) + { + HOST_WIDE_INT nop2[2 * WIDE_INT_MAX_ELTS]; + + int nop2len + = wi::sub_large (nop2, &zero, 1, op2val, op2len, prec + 1, SIGNED, 0); + is_neg = !is_neg; + vlen = nop2len * 2; + wi_unpack (v, (const unsigned HOST_WIDE_INT*)nop2, nop2len, + vlen, prec, SIGNED); + } + else + wi_unpack (v, (const unsigned HOST_WIDE_INT*)op2val, op2len, + vlen, prec, SIGNED); + } + else + { + /* UNSIGNED mul. */ + /* For large positive integers inside regular wide-ints, we must + explictly expand out the 1s to full width for the mul to be + correct. Note that the top bit will never be set for a + unsigned number in offset_int of widest-int. */ + if (wi::neg_p (op1)) + ulen = half_blocks_needed; + if (wi::neg_p (op2)) + vlen = half_blocks_needed; + + wi_unpack (u, (const unsigned HOST_WIDE_INT*)op1val, op1len, + ulen, prec, SIGNED); + + wi_unpack (v, (const unsigned HOST_WIDE_INT*)op2val, op2len, + vlen, prec, SIGNED); + } - for (j = 0; j < half_blocks_needed; j++) + uvlen = ulen + vlen; + memset (r, 0, uvlen * HOST_BITS_PER_HALF_WIDE_INT / CHAR_BIT); + + /* Do the actually multiply. */ + for (j = 0; j < vlen; j++) { k = 0; - for (i = 0; i < half_blocks_needed; i++) + for (i = 0; i < ulen; i++) { t = ((unsigned HOST_WIDE_INT)u[i] * (unsigned HOST_WIDE_INT)v[j] + r[i + j] + k); r[i + j] = t & HALF_INT_MASK; k = t >> HOST_BITS_PER_HALF_WIDE_INT; } - r[j + half_blocks_needed] = k; + r[j + ulen] = k; } - /* We did unsigned math above. For signed we must adjust the - product (assuming we need to see that). */ - if (sgn == SIGNED && (high || needs_overflow)) + /* We did unsigned math above. For signed we may need to adjust the + product if exactly 1 of the operands was negative. */ + if (is_neg) { - unsigned HOST_WIDE_INT b; - if (wi::neg_p (op1)) + t = (unsigned HOST_WIDE_INT)(~r[0]) + 1; + r[0] = t & HALF_INT_MASK; + for (i = 1; i < uvlen; i++) { - b = 0; - for (i = 0; i < half_blocks_needed; i++) - { - t = (unsigned HOST_WIDE_INT)r[i + half_blocks_needed] - - (unsigned HOST_WIDE_INT)v[i] - b; - r[i + half_blocks_needed] = t & HALF_INT_MASK; - b = t >> (HOST_BITS_PER_WIDE_INT - 1); - } - } - if (wi::neg_p (op2)) - { - b = 0; - for (i = 0; i < half_blocks_needed; i++) - { - t = (unsigned HOST_WIDE_INT)r[i + half_blocks_needed] - - (unsigned HOST_WIDE_INT)u[i] - b; - r[i + half_blocks_needed] = t & HALF_INT_MASK; - b = t >> (HOST_BITS_PER_WIDE_INT - 1); - } + t = (unsigned HOST_WIDE_INT)(~r[i]) + (t >> HOST_BITS_PER_HALF_WIDE_INT); + r[i] = t & HALF_INT_MASK; } } - if (needs_overflow) + /* We only need to do the expensive check for overflow if the value + really was big enough to overflow. */ + if (needs_overflow && (uvlen >= half_blocks_needed)) { HOST_WIDE_INT top; + const HOST_WIDE_INT mask = ((HOST_WIDE_INT)1 << HOST_BITS_PER_HALF_WIDE_INT) - 1; /* For unsigned, overflow is true if any of the top bits are set. For signed, overflow is true if any of the top bits are not equal @@ -1450,7 +1495,7 @@ wi::mul_internal (HOST_WIDE_INT *val, co top &= mask; } - for (i = half_blocks_needed; i < half_blocks_needed * 2; i++) + for (i = half_blocks_needed; i < uvlen; i++) if (((HOST_WIDE_INT)(r[i] & mask)) != top) *overflow = true; } @@ -1459,15 +1504,42 @@ wi::mul_internal (HOST_WIDE_INT *val, co { /* compute [prec] <- ([prec] * [prec]) >> [prec] */ wi_pack ((unsigned HOST_WIDE_INT *) val, - &r[half_blocks_needed], half_blocks_needed); - return canonize (val, blocks_needed, prec); + r, uvlen); + vallen = canonize (val, (uvlen + 1) >> 1, prec); + + /* Shift is not always safe to write over one of the + operands, so we must copy. */ + HOST_WIDE_INT tval[2 * WIDE_INT_MAX_ELTS]; + memcpy (tval, val, vallen * CHAR_BIT / HOST_BITS_PER_WIDE_INT); + vallen = wi::lrshift_large (val, tval, vallen, prec*2, prec, prec); } else { + if (uvlen > half_blocks_needed) + uvlen = half_blocks_needed; /* compute [prec] <- ([prec] * [prec]) && ((1 << [prec]) - 1) */ - wi_pack ((unsigned HOST_WIDE_INT *) val, r, half_blocks_needed); - return canonize (val, blocks_needed, prec); + wi_pack ((unsigned HOST_WIDE_INT *) val, r, uvlen); + vallen = canonize (val, (uvlen + 1) >> 1, prec); } + +#if 0 + static int cnt = 0; + printf ("cnt=%d op1=", cnt++); + for (i = 0; i < op1len; i++) + printf ("0x%lx ", op1[i]); + + printf (" op2="); + for (i = 0; i < op2len; i++) + printf ("0x%lx ", op2[i]); + printf (" result="); + for (i = 0; i < vallen; i++) + printf ("0x%lx ", val[i]); + if (needs_overflow && *overflow) + printf (" OVERFLOW"); + printf ("\n"); +#endif + + return vallen; } /* Compute the population count of X. */ Index: gcc/wide-int.h =================================================================== --- gcc/wide-int.h (revision 205765) +++ gcc/wide-int.h (working copy) @@ -2406,7 +2408,7 @@ wi::mul (const T1 &x, const T2 &y) } else result.set_len (mul_internal (val, xi.val, xi.len, yi.val, yi.len, - precision, UNSIGNED, 0, false)); + precision, SIGNED, 0, false)); return result; }