From patchwork Mon Dec  9 19:07:50 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kenneth Zadeck <zadeck@naturalbridge.com>
X-Patchwork-Id: 299166
Return-Path: 
 <gcc-patches-return-357644-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 1D5A92C00AB
	for <incoming@patchwork.ozlabs.org>;
	Tue, 10 Dec 2013 06:08:10 +1100 (EST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:subject:content-type; q=
	dns; s=default; b=rYfOnuXaCvl7KgzCWO8G4wnHcOBgV/Rtd0YxJJHp/OzaDx
	eBiJVj9fHROjN80h6djrPPGhk1DmL1WQN352fQPwQ02gThp11tnV7njfgpB/m8DQ
	IjOAEq+HVztNtMy9CsxtN78I3TASm+fii5FzJZ12R0m2oW7DH5UEloQlPGJh8=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:subject:content-type; s=
	default; bh=8rYEaaL5eIHdUPpYkQJ0R2/NyWM=; b=T1RsbkaSnbi29wZtzK+f
	g9hidvpGHtYlI44gcxAaRyb7BaVIxMhcnHFLZAbRs6h16rzaZQfZVVAcxQq3IVTP
	5fU8urtr/lGOcckG3EwEJ+xgfeQnYaHDT9foGQB0fh0U6UZdLeGEoKK/MLmtj9V9
	8ME7e6tnlXX232k7peQ/k4o=
Received: (qmail 3760 invoked by alias); 9 Dec 2013 19:08:03 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-##L=##H@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 3741 invoked by uid 89); 9 Dec 2013 19:08:02 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.2 required=5.0 tests=AWL, BAYES_00,
	RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2
X-HELO: mail-yh0-f44.google.com
Received: from Unknown (HELO mail-yh0-f44.google.com) (209.85.213.44) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with
	(AES128-SHA encrypted) ESMTPS; Mon, 09 Dec 2013 19:08:00 +0000
Received: by mail-yh0-f44.google.com with SMTP id f64so3023348yha.3 for
	<gcc-patches@gcc.gnu.org>; Mon, 09 Dec 2013 11:07:52 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
	s=20130820;
	h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to
	:subject:content-type;
	bh=er//5qZ4udA2wLfnZJXXMYCKklnA8hkjaXC+/SDzsF4=;
	b=R2r2dretBsBpRk42ToeHh4kPMpDuL2RgqE32XfRr+63Y9ajUSZCeU8UDYtwGg2Llvr
	Jq7K1t6yhhQJ0c/dc7cgHvwdqhZrZ++zNpdNd2OSN5a8uG3EFejWiKenZzZbipX3P5NF
	nBZuNmyElDFmLTh8tMXM42ebJSeTks3dFTX9Pul3K4bF0Uq++/LtFfLG5Sh9ZsqJlX5M
	genDvwOa12PB4Vr0fsa07VOviB7kJfCxDkTr1EOWVAE59Wvh5FXpRd1Gqbi0Hi0U2TC3
	74xGYuTYoQVUoEfej10y7rAURN89DWZqq0wPdajmBezlsUMg784W2dRfTl4TrPYRBJpI
	JnTA==
X-Gm-Message-State: 
 ALoCoQniisbGw+PxurvK+nr0NEaTf+ewgqDl5Hk0EWfb1PDWzbYuLpH/svYe2Mazz6pDScVW+ZrN
X-Received: by 10.236.2.166 with SMTP id 26mr4223836yhf.79.1386616072539;
	Mon, 09 Dec 2013 11:07:52 -0800 (PST)
Received: from moria.site (pool-98-113-157-168.nycmny.fios.verizon.net.
	[98.113.157.168]) by mx.google.com with ESMTPSA id
	r1sm17945999yhf.17.2013.12.09.11.07.51 for <multiple
	recipients> (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA
	bits=128/128); Mon, 09 Dec 2013 11:07:51 -0800 (PST)
Message-ID: <52A61506.5000407@naturalbridge.com>
Date: Mon, 09 Dec 2013 14:07:50 -0500
From: Kenneth Zadeck <zadeck@naturalbridge.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:24.0) Gecko/20100101 Thunderbird/24.1.0
MIME-Version: 1.0
To: Richard Sandiford <rdsandiford@googlemail.com>,
	Richard Biener <rguenther@suse.de>, Mike Stump <mikestump@comcast.net>,
	gcc-patches <gcc-patches@gcc.gnu.org>
Subject: wide-int more performance fixes for wide multiplication.

This patch is the last performance patch that i have for wide-int.

This patch changes large multiply from taking precision/hbpwi * 
precision/hbpwi multiplies to taking
#significant_bits1/hbpwi * #significant_bits2/hbpwi multiplications.   
That was a significant number of multiplies on machines that had a large 
largest integer.

This patch was mostly tested before richard sandifords patch which uses 
the longlong stuff.   That patch gets many of the cases that this would 
get.  but if either of the operands does not fit into a hwi, then things 
got really expensive without this patch.

This patch works a lot differently from the version in the branch. In 
the branch, the inputs were assumed to unsigned numbers and the multiply 
happened with what ever came in.   After the multiply happened and if a 
signed multiply was requested, the product was adjusted.

in this version, the adjustment happens first if we are doing signed 
multiply.   Thus the operands are always unsigned numbers to the actual 
multiply loops.   We may need an extra bit for this, but we have it.   
The advantage of this is that negative numbers have a lot of leading 1s 
and compensating for these is complex.   The easy solution is to convert 
this to a positive number and then count many hwi's are needed to 
represent that.    Of course, if exactly one of the operands was 
negative, then we end up negating the product at the end.

This patch was tested on x86-64 with richards longlong short cut 
commented out to get it to hit a lot of times.   And for this it makes a 
lot of difference in the number of multiplies that are performed.    
Without commenting that out, the number of hits is small when one looks 
only at the bootstrap and the regression tests.   However, this patch 
will also get more hit more often for ports that decide to revert back 
to having a HWI be 32 bits because any time one of the operands is 
larger than 32 bits, this patch will take control.   Also, this patch 
hits if any of the operands really do not fit in a hwi.

This patch only works after that patch to change tree-vrp has been applied.

http://gcc.gnu.org/ml/gcc-patches/2013-12/msg00877.html

kenny

Index: gcc/wide-int.cc
===================================================================
--- gcc/wide-int.cc	(revision 205765)
+++ gcc/wide-int.cc	(working copy)
@@ -1275,23 +1275,31 @@ wi::mul_internal (HOST_WIDE_INT *val, co
   unsigned int j;
   unsigned int blocks_needed = BLOCKS_NEEDED (prec);
   unsigned int half_blocks_needed = blocks_needed * 2;
-  /* The sizes here are scaled to support a 2x largest mode by 2x
-     largest mode yielding a 4x largest mode result.  This is what is
-     needed by vpn.  */
 
+  /* The sizes here are scaled to support a 2*largest mode + a little
+     by 2*largest mode + a little yielding a 4*largest mode + a
+     little.  This is what is needed by vpn.  */
   unsigned HOST_HALF_WIDE_INT
-    u[4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT];
+    u[2 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
+  unsigned int ulen;
   unsigned HOST_HALF_WIDE_INT
-    v[4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT];
-  /* The '2' in 'R' is because we are internally doing a full
+    v[2 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
+  unsigned int vlen;
+  unsigned int uvlen;
+  unsigned int vallen;
+
+  /* The '4' in 'R' is because we are internally doing a wide
      multiply.  */
   unsigned HOST_HALF_WIDE_INT
-    r[2 * 4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT];
-  HOST_WIDE_INT mask = ((HOST_WIDE_INT)1 << HOST_BITS_PER_HALF_WIDE_INT) - 1;
+    r[4 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
+
+  /* True if the result needs to be negated.  */
+  bool is_neg = false;
 
   /* If the top level routine did not really pass in an overflow, then
      just make sure that we never attempt to set it.  */
   bool needs_overflow = (overflow != 0);
+  const HOST_WIDE_INT zero = 0; 
   if (needs_overflow)
     *overflow = false;
 
@@ -1382,61 +1392,96 @@ wi::mul_internal (HOST_WIDE_INT *val, co
       return 1;
     }
 
-  /* We do unsigned mul and then correct it.  */
-  wi_unpack (u, (const unsigned HOST_WIDE_INT *) op1val, op1len,
-	     half_blocks_needed, prec, SIGNED);
-  wi_unpack (v, (const unsigned HOST_WIDE_INT *) op2val, op2len,
-	     half_blocks_needed, prec, SIGNED);
-
-  /* The 2 is for a full mult.  */
-  memset (r, 0, half_blocks_needed * 2
-	  * HOST_BITS_PER_HALF_WIDE_INT / CHAR_BIT);
+  ulen = op1len * 2;
+  vlen = op2len * 2;
+
+  if (sgn == SIGNED)
+    {
+      if (wi::neg_p (op1))
+	{
+	  HOST_WIDE_INT nop1[2 * WIDE_INT_MAX_ELTS];
+	  
+	  int nop1len 
+	    = wi::sub_large (nop1, &zero, 1, op1val, op1len, prec + 1, SIGNED, 0);
+	  is_neg = !is_neg;
+	  ulen = nop1len * 2;
+	  wi_unpack (u, (const unsigned HOST_WIDE_INT*)nop1, nop1len,
+		     ulen, prec, SIGNED);
+	}
+      else
+	wi_unpack (u, (const unsigned HOST_WIDE_INT*)op1val, op1len,
+		   ulen, prec, SIGNED);
+
+      if (wi::neg_p (op2))
+	{
+	  HOST_WIDE_INT nop2[2 * WIDE_INT_MAX_ELTS];
+	  
+	  int nop2len 
+	    = wi::sub_large (nop2, &zero, 1, op2val, op2len, prec + 1, SIGNED, 0);
+	  is_neg = !is_neg;
+	  vlen = nop2len * 2;
+	  wi_unpack (v, (const unsigned HOST_WIDE_INT*)nop2, nop2len,
+		     vlen, prec, SIGNED);
+	}
+      else
+	wi_unpack (v, (const unsigned HOST_WIDE_INT*)op2val, op2len,
+		   vlen, prec, SIGNED);
+    }
+  else
+    {
+      /* UNSIGNED mul. */
+      /* For large positive integers inside regular wide-ints, we must
+	 explictly expand out the 1s to full width for the mul to be
+	 correct. Note that the top bit will never be set for a
+	 unsigned number in offset_int of widest-int.  */
+      if (wi::neg_p (op1))
+	ulen = half_blocks_needed;
+      if (wi::neg_p (op2))
+	vlen = half_blocks_needed;
+
+      wi_unpack (u, (const unsigned HOST_WIDE_INT*)op1val, op1len,
+		 ulen, prec, SIGNED);
+
+      wi_unpack (v, (const unsigned HOST_WIDE_INT*)op2val, op2len,
+		 vlen, prec, SIGNED);
+    }
 
-  for (j = 0; j < half_blocks_needed; j++)
+  uvlen = ulen + vlen;
+  memset (r, 0, uvlen * HOST_BITS_PER_HALF_WIDE_INT / CHAR_BIT);
+  
+  /* Do the actually multiply.  */
+  for (j = 0; j < vlen; j++)
     {
       k = 0;
-      for (i = 0; i < half_blocks_needed; i++)
+      for (i = 0; i < ulen; i++)
 	{
 	  t = ((unsigned HOST_WIDE_INT)u[i] * (unsigned HOST_WIDE_INT)v[j]
 	       + r[i + j] + k);
 	  r[i + j] = t & HALF_INT_MASK;
 	  k = t >> HOST_BITS_PER_HALF_WIDE_INT;
 	}
-      r[j + half_blocks_needed] = k;
+      r[j + ulen] = k;
     }
 
-  /* We did unsigned math above.  For signed we must adjust the
-     product (assuming we need to see that).  */
-  if (sgn == SIGNED && (high || needs_overflow))
+  /* We did unsigned math above.  For signed we may need to adjust the
+     product if exactly 1 of the operands was negative.  */
+  if (is_neg)
     {
-      unsigned HOST_WIDE_INT b;
-      if (wi::neg_p (op1))
+      t = (unsigned HOST_WIDE_INT)(~r[0]) + 1;
+      r[0] = t & HALF_INT_MASK;
+      for (i = 1; i < uvlen; i++)
 	{
-	  b = 0;
-	  for (i = 0; i < half_blocks_needed; i++)
-	    {
-	      t = (unsigned HOST_WIDE_INT)r[i + half_blocks_needed]
-		- (unsigned HOST_WIDE_INT)v[i] - b;
-	      r[i + half_blocks_needed] = t & HALF_INT_MASK;
-	      b = t >> (HOST_BITS_PER_WIDE_INT - 1);
-	    }
-	}
-      if (wi::neg_p (op2))
-	{
-	  b = 0;
-	  for (i = 0; i < half_blocks_needed; i++)
-	    {
-	      t = (unsigned HOST_WIDE_INT)r[i + half_blocks_needed]
-		- (unsigned HOST_WIDE_INT)u[i] - b;
-	      r[i + half_blocks_needed] = t & HALF_INT_MASK;
-	      b = t >> (HOST_BITS_PER_WIDE_INT - 1);
-	    }
+	  t = (unsigned HOST_WIDE_INT)(~r[i]) + (t >> HOST_BITS_PER_HALF_WIDE_INT);
+	  r[i] = t & HALF_INT_MASK;
 	}
     }
 
-  if (needs_overflow)
+  /* We only need to do the expensive check for overflow if the value
+     really was big enough to overflow.  */
+  if (needs_overflow && (uvlen >= half_blocks_needed))
     {
       HOST_WIDE_INT top;
+      const HOST_WIDE_INT mask = ((HOST_WIDE_INT)1 << HOST_BITS_PER_HALF_WIDE_INT) - 1;
 
       /* For unsigned, overflow is true if any of the top bits are set.
 	 For signed, overflow is true if any of the top bits are not equal
@@ -1450,7 +1495,7 @@ wi::mul_internal (HOST_WIDE_INT *val, co
 	  top &= mask;
 	}
 
-      for (i = half_blocks_needed; i < half_blocks_needed * 2; i++)
+      for (i = half_blocks_needed; i < uvlen; i++)
 	if (((HOST_WIDE_INT)(r[i] & mask)) != top)
 	  *overflow = true;
     }
@@ -1459,15 +1504,42 @@ wi::mul_internal (HOST_WIDE_INT *val, co
     {
       /* compute [prec] <- ([prec] * [prec]) >> [prec] */
       wi_pack ((unsigned HOST_WIDE_INT *) val,
-	       &r[half_blocks_needed], half_blocks_needed);
-      return canonize (val, blocks_needed, prec);
+	       r, uvlen);
+      vallen = canonize (val, (uvlen + 1) >> 1, prec);
+
+      /* Shift is not always safe to write over one of the
+	 operands, so we must copy.  */ 
+      HOST_WIDE_INT tval[2 * WIDE_INT_MAX_ELTS];
+      memcpy (tval, val, vallen * CHAR_BIT / HOST_BITS_PER_WIDE_INT); 
+      vallen = wi::lrshift_large (val, tval, vallen, prec*2, prec, prec);
     }
   else
     {
+      if (uvlen > half_blocks_needed)
+	uvlen = half_blocks_needed;
       /* compute [prec] <- ([prec] * [prec]) && ((1 << [prec]) - 1) */
-      wi_pack ((unsigned HOST_WIDE_INT *) val, r, half_blocks_needed);
-      return canonize (val, blocks_needed, prec);
+      wi_pack ((unsigned HOST_WIDE_INT *) val, r, uvlen);
+      vallen = canonize (val, (uvlen + 1) >> 1, prec);
     }
+  
+#if 0
+  static int cnt = 0;
+  printf ("cnt=%d op1=", cnt++);
+  for (i = 0; i < op1len; i++)
+    printf ("0x%lx ", op1[i]);
+
+  printf ("  op2=");
+  for (i = 0; i < op2len; i++)
+    printf ("0x%lx ", op2[i]);
+  printf ("  result=");
+  for (i = 0; i < vallen; i++)
+    printf ("0x%lx ", val[i]);
+  if (needs_overflow && *overflow)
+    printf ("  OVERFLOW");
+  printf ("\n");
+#endif
+
+  return vallen;
 }
 
 /* Compute the population count of X.  */
Index: gcc/wide-int.h
===================================================================
--- gcc/wide-int.h	(revision 205765)
+++ gcc/wide-int.h	(working copy)
@@ -2406,7 +2408,7 @@ wi::mul (const T1 &x, const T2 &y)
     }
   else
     result.set_len (mul_internal (val, xi.val, xi.len, yi.val, yi.len,
-				  precision, UNSIGNED, 0, false));
+				  precision, SIGNED, 0, false));
   return result;
 }