Factor out division by squares and remove division around comparisons (2/2)

Message ID d9796ede-3704-d4bf-11c1-00e89eae7a64@foss.arm.com
State New
Headers show

Commit Message

Jackson Woodruff Aug. 10, 2017, 2:10 p.m.
Hi all,

The patch implements the some of the division optimizations discussed in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .

We now reassociate (as discussed in the bug report):

     x / (y * y) -> x  * (1 / y) * (1 / y)

If it is reasonable to do so. This is done with
-funsafe-math-optimizations.

Bootstrapped and regtested with part (1/2). OK for trunk?

Jackson

gcc/

2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* tree-ssa-math-opts (is_division_by_square,
	is_square_of, insert_sqaure_reciprocals): New.
	(insert_reciprocals): Change to insert reciprocals
	before a division by a square.
	(execute_cse_reciprocals_1): Change to consider
	division by a square.


gcc/testsuite

2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* gcc.dg/associate_division_1.c: New.

Comments

Richard Biener Aug. 15, 2017, 2:07 p.m. | #1
On Thu, Aug 10, 2017 at 4:10 PM, Jackson Woodruff
<jackson.woodruff@foss.arm.com> wrote:
> Hi all,
>
> The patch implements the some of the division optimizations discussed in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .
>
> We now reassociate (as discussed in the bug report):
>
>     x / (y * y) -> x  * (1 / y) * (1 / y)
>
> If it is reasonable to do so. This is done with
> -funsafe-math-optimizations.
>
> Bootstrapped and regtested with part (1/2). OK for trunk?

I believe your enhancement shows the inherent weakness of
CSE of reciprocals in that it works from the defs.  It will
handle x / (y * y) but not x / (y * y * y).

I think a rewrite of this mini-pass is warranted.

Richard.

> Jackson
>
> gcc/
>
> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>
>         PR 71026/tree-optimization
>         * tree-ssa-math-opts (is_division_by_square,
>         is_square_of, insert_sqaure_reciprocals): New.
>         (insert_reciprocals): Change to insert reciprocals
>         before a division by a square.
>         (execute_cse_reciprocals_1): Change to consider
>         division by a square.
>
>
> gcc/testsuite
>
> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>
>         PR 71026/tree-optimization
>         * gcc.dg/associate_division_1.c: New.
>
Jackson Woodruff Aug. 30, 2017, 4:32 p.m. | #2
Hi all,

I've attached a new version of the patch in response to a few of Wilco's 
comments in person.

The end product of the pass is still the same, but I have fixed several 
bugs.

Now tested independently of the other patches.

On 08/15/2017 03:07 PM, Richard Biener wrote:
> On Thu, Aug 10, 2017 at 4:10 PM, Jackson Woodruff
> <jackson.woodruff@foss.arm.com> wrote:
>> Hi all,
>>
>> The patch implements the some of the division optimizations discussed in
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .
>>
>> We now reassociate (as discussed in the bug report):
>>
>>      x / (y * y) -> x  * (1 / y) * (1 / y)
>>
>> If it is reasonable to do so. This is done with
>> -funsafe-math-optimizations.
>>
>> Bootstrapped and regtested with part (1/2). OK for trunk?
> 
> I believe your enhancement shows the inherent weakness of
> CSE of reciprocals in that it works from the defs.  It will
> handle x / (y * y) but not x / (y * y * y).
> 
> I think a rewrite of this mini-pass is warranted.

I suspect that there might be more to gain by of handling the case of
x / (y * z) rather than the case of x / (y**n), but I agree that this 
pass could do more.

> 
> Richard.
> 
>> Jackson
>>
>> gcc/
>>
>> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>>
>>          PR 71026/tree-optimization
>>          * tree-ssa-math-opts (is_division_by_square,
>>          is_square_of, insert_sqaure_reciprocals): New.
>>          (insert_reciprocals): Change to insert reciprocals
>>          before a division by a square.
>>          (execute_cse_reciprocals_1): Change to consider
>>          division by a square.
>>
>>
>> gcc/testsuite
>>
>> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>>
>>          PR 71026/tree-optimization
>>          * gcc.dg/associate_division_1.c: New.
>>

Thanks,

Jackson.

Updated ChangeLog:

gcc/

2017-08-30  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* tree-ssa-math-opts (is_division_by_square, is_square_of): New.
	(insert_reciprocals): Change to insert reciprocals
	before a division by a square and to insert the square
	of a reciprocal.
	(execute_cse_reciprocals_1): Change to consider
	division by a square.
	(register_division_in): Add importance parameter.

gcc/testsuite

2017-08-30  Jackson Woodruff  <jackson.woodruff@arm.com>

	PR 71026/tree-optimization
	* gcc.dg/extract_recip_3.c: New.
	* gcc.dg/extract_recip_4.c: New.
	* gfortran.dg/extract_recip_1.f: New.
diff --git a/gcc/testsuite/gcc.dg/extract_recip_3.c b/gcc/testsuite/gcc.dg/extract_recip_3.c
new file mode 100644
index 0000000000000000000000000000000000000000..0ea3fdf5cca06a0806a55185e0000f8b0a4b0507
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_3.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+extract_square (float x, float y, float *a, float *b)
+{
+  *a = 3 / (y * y);
+  *b = 5 / (y * y);
+
+  return x / (y * y);
+}
+
+/* Don't expect the 'powmult' (calculation of y * y)
+   to be deleted until a later pass, so look for one
+   more multiplication than strictly necessary.  */
+float
+extract_recip (float *w, float x, float y, float z)
+{
+  *w = 7 / y;
+
+  return x / (y * y) + z / y;
+}
+
+/* 3 For the pointers to a, b and w, 4 multiplications in 'extract_square',
+   5 multiplications in 'extract_recip' expected.  */
+/* { dg-final { scan-tree-dump-times " \\* " 12 "optimized" } } */
+
+/* 1 division in 'extract_square', 1 division in 'extract_recip'. */
+/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/extract_recip_4.c b/gcc/testsuite/gcc.dg/extract_recip_4.c
new file mode 100644
index 0000000000000000000000000000000000000000..5a31d40ed910cdd1914cc1e82358493be428946a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_4.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+/* Don't expect any of these divisions to be extracted.  */
+double f (double x, int p)
+{
+  if (p > 0)
+    {
+      return 1.0/(x * x);
+    }
+
+  if (p > -1)
+    {
+      return x * x * x;
+    }
+  return  1.0 /(x);
+}
+
+/* Expect a reciprocal to be extracted here.  */
+double g (double x, double y)
+{
+  double k = x / (y * y);
+
+  if (y * y == 2.0)
+    return k + 1 / y;
+  else
+    return k - 1 / y;
+}
+
+/* Expect 2 divisions in 'f' and 1 in 'g'.  */
+/* { dg-final { scan-tree-dump-times " / " 3 "optimized" } } */
+/* Expect 3 multiplications in 'f' and 3 in 'g'.  */
+/* { dg-final { scan-tree-dump-times " \\* " 6 "optimized" } } */
diff --git a/gcc/testsuite/gfortran.dg/extract_recip_1.f b/gcc/testsuite/gfortran.dg/extract_recip_1.f
new file mode 100644
index 0000000000000000000000000000000000000000..ecf05189773b6c2f46222857fd88fd010bfdf348
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/extract_recip_1.f
@@ -0,0 +1,19 @@
+! { dg-do compile }
+! { dg-options "-Ofast -fdump-tree-optimized" }
+
+      SUBROUTINE F(N,X,Y,Z,A,B)
+          DIMENSION X(0,4), Y(4), Z(4)
+          REAL, INTENT(INOUT) :: A, B
+
+          A = 1 / Y(N)*Y(N)
+
+          DO I = 1, NV
+          X(I, I) = 1 + X(I, I)
+          ENDDO
+
+          Z(1) =  B / Y(N)
+          Z(2) =  N / Y(N)
+          RETURN
+      END
+
+! { dg-final { scan-tree-dump-times " / " 1 "optimized" } }
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index df0bcd6d459119243a69fc652e761f90d574ca78..f2e308aee970147b05b5e0d8e5aa19499afa41c7 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -127,6 +127,10 @@ struct occurrence {
      inserted in BB.  */
   tree recip_def;
 
+  /* If non-NULL, the SSA_NAME holding the definition for a squared
+     reciprocal inserted in BB.  */
+  tree square_recip_def;
+
   /* If non-NULL, the GIMPLE_ASSIGN for a reciprocal computation that
      was inserted in BB.  */
   gimple *recip_def_stmt;
@@ -282,10 +286,14 @@ insert_bb (struct occurrence *new_occ, basic_block idom,
   *p_head = new_occ;
 }
 
-/* Register that we found a division in BB.  */
+/* Register that we found a division in BB.
+   IMPORTANCE is a measure of how much weighting to give
+   that division.  Use IMPORTANCE = 2 to register a single
+   division.  If the division is going to be found multiple
+   times use 1 (as it is with squares).  */
 
 static inline void
-register_division_in (basic_block bb)
+register_division_in (basic_block bb, int importance)
 {
   struct occurrence *occ;
 
@@ -297,7 +305,7 @@ register_division_in (basic_block bb)
     }
 
   occ->bb_has_division = true;
-  occ->num_divisions++;
+  occ->num_divisions += importance;
 }
 
 
@@ -340,6 +348,39 @@ is_division_by (gimple *use_stmt, tree def)
 	 && gimple_assign_rhs1 (use_stmt) != def;
 }
 
+/* Return whether USE_STMT is DEF * DEF.  */
+static inline bool
+is_square_of (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
+    {
+      tree op0 = gimple_assign_rhs1 (use_stmt);
+      tree op1 = gimple_assign_rhs2 (use_stmt);
+
+      return op0 == op1 && op0 == def;
+    }
+  return 0;
+}
+
+/* Return whether USE_STMT is a floating-point division by
+   DEF * DEF.  */
+static inline bool
+is_division_by_square (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == RDIV_EXPR
+      && gimple_assign_rhs1 (use_stmt) != gimple_assign_rhs2 (use_stmt))
+    {
+      tree denominator = gimple_assign_rhs2 (use_stmt);
+      if (TREE_CODE (denominator) == SSA_NAME)
+	{
+	  return is_square_of (SSA_NAME_DEF_STMT (denominator), def);
+	}
+    }
+  return 0;
+}
+
 /* Walk the subset of the dominator tree rooted at OCC, setting the
    RECIP_DEF field to a definition of 1.0 / DEF that can be used in
    the given basic block.  The field may be left NULL, of course,
@@ -347,20 +388,27 @@ is_division_by (gimple *use_stmt, tree def)
 
    DEF_BSI is an iterator pointing at the statement defining DEF.
    If RECIP_DEF is set, a dominator already has a computation that can
-   be used.  */
+   be used.
+
+   If should_insert_square_recip is set, then this also inserts
+   the square of the reciprocal immediately after the definition
+   of the reciprocal.  */
 
 static void
 insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
-		    tree def, tree recip_def, int threshold)
+		    tree def, tree recip_def, tree square_recip_def,
+		    int should_insert_square_recip, int threshold)
 {
   tree type;
-  gassign *new_stmt;
+  gassign *new_stmt, *new_square_stmt;
   gimple_stmt_iterator gsi;
   struct occurrence *occ_child;
 
   if (!recip_def
       && (occ->bb_has_division || !flag_trapping_math)
-      && occ->num_divisions >= threshold)
+      /* Divide by two as all divisions are counted twice in
+	 the costing loop.  */
+      && occ->num_divisions / 2 >= threshold)
     {
       /* Make a variable with the replacement and substitute it.  */
       type = TREE_TYPE (def);
@@ -368,29 +416,44 @@ insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
       new_stmt = gimple_build_assign (recip_def, RDIV_EXPR,
 				      build_one_cst (type), def);
 
+      if (should_insert_square_recip)
+	{
+	  square_recip_def = create_tmp_reg (type, "powmult_reciptmp");
+	  new_square_stmt = gimple_build_assign (square_recip_def, MULT_EXPR,
+						 recip_def, recip_def);
+	}
+
       if (occ->bb_has_division)
-        {
-          /* Case 1: insert before an existing division.  */
-          gsi = gsi_after_labels (occ->bb);
-          while (!gsi_end_p (gsi) && !is_division_by (gsi_stmt (gsi), def))
+	{
+	  /* Case 1: insert before an existing division.  */
+	  gsi = gsi_after_labels (occ->bb);
+	  while (!gsi_end_p (gsi)
+		 && (!is_division_by (gsi_stmt (gsi), def))
+		 && (!is_division_by_square (gsi_stmt (gsi), def)))
 	    gsi_next (&gsi);
 
-          gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
-        }
+	  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+	}
       else if (def_gsi && occ->bb == def_gsi->bb)
-        {
-          /* Case 2: insert right after the definition.  Note that this will
+	{
+	  /* Case 2: insert right after the definition.  Note that this will
 	     never happen if the definition statement can throw, because in
 	     that case the sole successor of the statement's basic block will
 	     dominate all the uses as well.  */
-          gsi_insert_after (def_gsi, new_stmt, GSI_NEW_STMT);
-        }
+	  gsi = *def_gsi;
+	  gsi_insert_after (def_gsi, new_stmt, GSI_NEW_STMT);
+	}
       else
-        {
-          /* Case 3: insert in a basic block not containing defs/uses.  */
-          gsi = gsi_after_labels (occ->bb);
-          gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
-        }
+	{
+	  /* Case 3: insert in a basic block not containing defs/uses.  */
+	  gsi = gsi_after_labels (occ->bb);
+	  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+	}
+
+      /* Regardless of which case the reciprocal as inserted in,
+	 we insert the square immediately after the reciprocal.  */
+      if (should_insert_square_recip)
+	gsi_insert_before (&gsi, new_square_stmt, GSI_SAME_STMT);
 
       reciprocal_stats.rdivs_inserted++;
 
@@ -398,8 +461,32 @@ insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
     }
 
   occ->recip_def = recip_def;
+  occ->square_recip_def = square_recip_def;
   for (occ_child = occ->children; occ_child; occ_child = occ_child->next)
-    insert_reciprocals (def_gsi, occ_child, def, recip_def, threshold);
+    insert_reciprocals (def_gsi, occ_child, def, recip_def,
+			square_recip_def, should_insert_square_recip,
+			threshold);
+}
+
+/* Replace occurrences of expr / (x * x) with expr * ((1 / x) * (1 / x)).
+   Take as argument the use for (x * x).  */
+static inline void
+replace_reciprocal_squares (use_operand_p use_p)
+{
+  gimple *use_stmt = USE_STMT (use_p);
+  basic_block bb = gimple_bb (use_stmt);
+  struct occurrence *occ = (struct occurrence *) bb->aux;
+
+  if (optimize_bb_for_speed_p (bb) && occ->square_recip_def
+      && occ->recip_def)
+    {
+      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
+      gimple_assign_set_rhs_code (use_stmt, MULT_EXPR);
+      gimple_assign_set_rhs2 (use_stmt, occ->square_recip_def);
+      SET_USE (use_p, occ->square_recip_def);
+      fold_stmt_inplace (&gsi);
+      update_stmt (use_stmt);
+    }
 }
 
 
@@ -460,32 +547,85 @@ free_bb (struct occurrence *occ)
 static void
 execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
 {
-  use_operand_p use_p;
-  imm_use_iterator use_iter;
+  use_operand_p use_p, square_use_p;
+  imm_use_iterator use_iter, square_use_iter;
+  tree square_def;
   struct occurrence *occ;
-  int count = 0, threshold;
+  int count = 0;
+  int threshold;
+  int square_recip_count = 0;
+  int sqrt_recip_count = 0;
 
   gcc_assert (FLOAT_TYPE_P (TREE_TYPE (def)) && is_gimple_reg (def));
+  threshold = targetm.min_divisions_for_recip_mul (TYPE_MODE (TREE_TYPE (def)));
+
+  /* If this is a square (x * x), we should check whether there are any
+     enough divisions by x on it's own to warrant waiting for that pass.  */
+  if (TREE_CODE (def) == SSA_NAME)
+    {
+      gimple *def_stmt = SSA_NAME_DEF_STMT (def);
+
+      if (is_gimple_assign (def_stmt)
+	  && gimple_assign_rhs_code (def_stmt) == MULT_EXPR
+	  && gimple_assign_rhs1 (def_stmt) == gimple_assign_rhs2 (def_stmt))
+	{
+	  /* This statement is a square of something.  We should take this
+	     in to account, as it may be more profitable to not extract
+	     the reciprocal here.  */
+	  tree op0 = gimple_assign_rhs1 (def_stmt);
+	  FOR_EACH_IMM_USE_FAST (use_p, use_iter, op0)
+	    {
+	      gimple *use_stmt = USE_STMT (use_p);
+	      if (is_division_by (use_stmt, op0))
+		sqrt_recip_count ++;
+	    }
+	}
+    }
 
   FOR_EACH_IMM_USE_FAST (use_p, use_iter, def)
     {
       gimple *use_stmt = USE_STMT (use_p);
       if (is_division_by (use_stmt, def))
 	{
-	  register_division_in (gimple_bb (use_stmt));
+	  register_division_in (gimple_bb (use_stmt), 2);
 	  count++;
 	}
+
+      if (is_square_of (use_stmt, def))
+	{
+	  square_def = gimple_assign_lhs (use_stmt);
+	  FOR_EACH_IMM_USE_FAST (square_use_p, square_use_iter, square_def)
+	    {
+	      gimple *square_use_stmt = USE_STMT (square_use_p);
+	      if (is_division_by (square_use_stmt, square_def))
+		{
+		  /* Halve the relative importance as this is called twice
+		     for each division by a square.  */
+		  register_division_in (gimple_bb (square_use_stmt), 1);
+		  square_recip_count ++;
+		}
+	    }
+	}
     }
 
+  /* Square reciprocals will have been counted twice.  */
+  square_recip_count /= 2;
+
+  if (sqrt_recip_count > square_recip_count)
+    /* It will be more profitable to extract a 1 / x expression later,
+       so it is not worth attempting to extract 1 / (x * x) now.  */
+    return;
+
   /* Do the expensive part only if we can hope to optimize something.  */
-  threshold = targetm.min_divisions_for_recip_mul (TYPE_MODE (TREE_TYPE (def)));
-  if (count >= threshold)
+  if (count + square_recip_count >= threshold
+      && count >= 1)
     {
       gimple *use_stmt;
       for (occ = occ_head; occ; occ = occ->next)
 	{
 	  compute_merit (occ);
-	  insert_reciprocals (def_gsi, occ, def, NULL, threshold);
+	  insert_reciprocals (def_gsi, occ, def, NULL, NULL,
+			      square_recip_count, threshold);
 	}
 
       FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def)
@@ -495,6 +635,31 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
 	      FOR_EACH_IMM_USE_ON_STMT (use_p, use_iter)
 		replace_reciprocal (use_p);
 	    }
+	  else if (square_recip_count > 0
+		   && is_square_of (use_stmt, def))
+	    {
+	      FOR_EACH_IMM_USE_ON_STMT (use_p, use_iter)
+		{
+		  /* Find all uses of the square that are divisions and
+		   * replace them by multiplications with the inverse.  */
+		  imm_use_iterator square_iterator;
+		  gimple *powmult_use_stmt = USE_STMT (use_p);
+		  tree powmult_def_name = gimple_assign_lhs (powmult_use_stmt);
+
+		  FOR_EACH_IMM_USE_STMT (powmult_use_stmt,
+					 square_iterator, powmult_def_name)
+		    FOR_EACH_IMM_USE_ON_STMT (square_use_p, square_iterator)
+		      {
+			gimple *powmult_use_stmt = USE_STMT (square_use_p);
+			if (is_gimple_assign (powmult_use_stmt)
+			    && gimple_assign_rhs_code (powmult_use_stmt)
+			    == RDIV_EXPR
+			    && gimple_assign_rhs2 (powmult_use_stmt)
+			    == powmult_def_name)
+			  replace_reciprocal_squares (square_use_p);
+		      }
+		}
+	    }
 	}
     }
Jackson Woodruff Sept. 6, 2017, 9:55 a.m. | #3
Hi all,

A minor improvement came to mind while updating other parts of this patch.

I've updated a testcase to make it more clear and a condition now uses a 
call to is_division_by rather than manually checking those conditions.

Jackson

On 08/30/2017 05:32 PM, Jackson Woodruff wrote:
> Hi all,
> 
> I've attached a new version of the patch in response to a few of Wilco's 
> comments in person.
> 
> The end product of the pass is still the same, but I have fixed several 
> bugs.
> 
> Now tested independently of the other patches.
> 
> On 08/15/2017 03:07 PM, Richard Biener wrote:
>> On Thu, Aug 10, 2017 at 4:10 PM, Jackson Woodruff
>> <jackson.woodruff@foss.arm.com> wrote:
>>> Hi all,
>>>
>>> The patch implements the some of the division optimizations discussed in
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71026 .
>>>
>>> We now reassociate (as discussed in the bug report):
>>>
>>>      x / (y * y) -> x  * (1 / y) * (1 / y)
>>>
>>> If it is reasonable to do so. This is done with
>>> -funsafe-math-optimizations.
>>>
>>> Bootstrapped and regtested with part (1/2). OK for trunk?
>>
>> I believe your enhancement shows the inherent weakness of
>> CSE of reciprocals in that it works from the defs.  It will
>> handle x / (y * y) but not x / (y * y * y).
>>
>> I think a rewrite of this mini-pass is warranted.
> 
> I suspect that there might be more to gain by of handling the case of
> x / (y * z) rather than the case of x / (y**n), but I agree that this 
> pass could do more.
> 
>>
>> Richard.
>>
>>> Jackson
>>>
>>> gcc/
>>>
>>> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>>>
>>>          PR 71026/tree-optimization
>>>          * tree-ssa-math-opts (is_division_by_square,
>>>          is_square_of, insert_sqaure_reciprocals): New.
>>>          (insert_reciprocals): Change to insert reciprocals
>>>          before a division by a square.
>>>          (execute_cse_reciprocals_1): Change to consider
>>>          division by a square.
>>>
>>>
>>> gcc/testsuite
>>>
>>> 2017-08-03  Jackson Woodruff  <jackson.woodruff@arm.com>
>>>
>>>          PR 71026/tree-optimization
>>>          * gcc.dg/associate_division_1.c: New.
>>>
> 
> Thanks,
> 
> Jackson.
> 
> Updated ChangeLog:
> 
> gcc/
> 
> 2017-08-30  Jackson Woodruff  <jackson.woodruff@arm.com>
> 
>      PR 71026/tree-optimization
>      * tree-ssa-math-opts (is_division_by_square, is_square_of): New.
>      (insert_reciprocals): Change to insert reciprocals
>      before a division by a square and to insert the square
>      of a reciprocal.
>      (execute_cse_reciprocals_1): Change to consider
>      division by a square.
>      (register_division_in): Add importance parameter.
> 
> gcc/testsuite
> 
> 2017-08-30  Jackson Woodruff  <jackson.woodruff@arm.com>
> 
>      PR 71026/tree-optimization
>      * gcc.dg/extract_recip_3.c: New.
>      * gcc.dg/extract_recip_4.c: New.
>      * gfortran.dg/extract_recip_1.f: New.
diff --git a/gcc/testsuite/gcc.dg/extract_recip_3.c b/gcc/testsuite/gcc.dg/extract_recip_3.c
new file mode 100644
index 0000000000000000000000000000000000000000..ad9f2dc36f1e695ceca1f50bc78f4ac4fbb2e787
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_3.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+extract_square (float *a, float *b, float x, float y)
+{
+  *a = 3 / (y * y);
+  *b = 5 / (y * y);
+
+  return x / (y * y);
+}
+
+/* Don't expect the 'powmult' (calculation of y * y)
+   to be deleted until a later pass, so look for one
+   more multiplication than strictly necessary.  */
+float
+extract_recip (float *a, float *b, float x, float y, float z)
+{
+  *a = 7 / y;
+  *b = x / (y * y);
+
+  return z / y;
+}
+
+/* 4 For the pointers to a, b, 4 multiplications in 'extract_square',
+   4 multiplications in 'extract_recip' expected.  */
+/* { dg-final { scan-tree-dump-times " \\* " 12 "optimized" } } */
+
+/* 1 division in 'extract_square', 1 division in 'extract_recip'. */
+/* { dg-final { scan-tree-dump-times " / " 2 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/extract_recip_4.c b/gcc/testsuite/gcc.dg/extract_recip_4.c
new file mode 100644
index 0000000000000000000000000000000000000000..83105c60ced5c2671f3793d76482c35502712a2c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/extract_recip_4.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+/* Don't expect any of these divisions to be extracted.  */
+double f (double x, int p)
+{
+  if (p > 0)
+    {
+      return 1.0/(x * x);
+    }
+
+  if (p > -1)
+    {
+      return x * x * x;
+    }
+  return  1.0 /(x);
+}
+
+/* Expect a reciprocal to be extracted here.  */
+double g (double *a, double x, double y)
+{
+  *a = 3 / y;
+  double k = x / (y * y);
+
+  if (y * y == 2.0)
+    return k + 1 / y;
+  else
+    return k - 1 / y;
+}
+
+/* Expect 2 divisions in 'f' and 1 in 'g'.  */
+/* { dg-final { scan-tree-dump-times " / " 3 "optimized" } } */
+/* Expect 3 multiplications in 'f' and 4 in 'g'.  Also
+   expect one for the point to a.  */
+/* { dg-final { scan-tree-dump-times " \\* " 8 "optimized" } } */
diff --git a/gcc/testsuite/gfortran.dg/extract_recip_1.f b/gcc/testsuite/gfortran.dg/extract_recip_1.f
new file mode 100644
index 0000000000000000000000000000000000000000..ecf05189773b6c2f46222857fd88fd010bfdf348
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/extract_recip_1.f
@@ -0,0 +1,19 @@
+! { dg-do compile }
+! { dg-options "-Ofast -fdump-tree-optimized" }
+
+      SUBROUTINE F(N,X,Y,Z,A,B)
+          DIMENSION X(0,4), Y(4), Z(4)
+          REAL, INTENT(INOUT) :: A, B
+
+          A = 1 / Y(N)*Y(N)
+
+          DO I = 1, NV
+          X(I, I) = 1 + X(I, I)
+          ENDDO
+
+          Z(1) =  B / Y(N)
+          Z(2) =  N / Y(N)
+          RETURN
+      END
+
+! { dg-final { scan-tree-dump-times " / " 1 "optimized" } }
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index df0bcd6d459119243a69fc652e761f90d574ca78..542cda5a026d2adb2c419553d1954a7008573375 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -127,6 +127,10 @@ struct occurrence {
      inserted in BB.  */
   tree recip_def;
 
+  /* If non-NULL, the SSA_NAME holding the definition for a squared
+     reciprocal inserted in BB.  */
+  tree square_recip_def;
+
   /* If non-NULL, the GIMPLE_ASSIGN for a reciprocal computation that
      was inserted in BB.  */
   gimple *recip_def_stmt;
@@ -282,10 +286,14 @@ insert_bb (struct occurrence *new_occ, basic_block idom,
   *p_head = new_occ;
 }
 
-/* Register that we found a division in BB.  */
+/* Register that we found a division in BB.
+   IMPORTANCE is a measure of how much weighting to give
+   that division.  Use IMPORTANCE = 2 to register a single
+   division.  If the division is going to be found multiple
+   times use 1 (as it is with squares).  */
 
 static inline void
-register_division_in (basic_block bb)
+register_division_in (basic_block bb, int importance)
 {
   struct occurrence *occ;
 
@@ -297,7 +305,7 @@ register_division_in (basic_block bb)
     }
 
   occ->bb_has_division = true;
-  occ->num_divisions++;
+  occ->num_divisions += importance;
 }
 
 
@@ -340,6 +348,39 @@ is_division_by (gimple *use_stmt, tree def)
 	 && gimple_assign_rhs1 (use_stmt) != def;
 }
 
+/* Return whether USE_STMT is DEF * DEF.  */
+static inline bool
+is_square_of (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
+    {
+      tree op0 = gimple_assign_rhs1 (use_stmt);
+      tree op1 = gimple_assign_rhs2 (use_stmt);
+
+      return op0 == op1 && op0 == def;
+    }
+  return 0;
+}
+
+/* Return whether USE_STMT is a floating-point division by
+   DEF * DEF.  */
+static inline bool
+is_division_by_square (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == RDIV_EXPR
+      && gimple_assign_rhs1 (use_stmt) != gimple_assign_rhs2 (use_stmt))
+    {
+      tree denominator = gimple_assign_rhs2 (use_stmt);
+      if (TREE_CODE (denominator) == SSA_NAME)
+	{
+	  return is_square_of (SSA_NAME_DEF_STMT (denominator), def);
+	}
+    }
+  return 0;
+}
+
 /* Walk the subset of the dominator tree rooted at OCC, setting the
    RECIP_DEF field to a definition of 1.0 / DEF that can be used in
    the given basic block.  The field may be left NULL, of course,
@@ -347,20 +388,27 @@ is_division_by (gimple *use_stmt, tree def)
 
    DEF_BSI is an iterator pointing at the statement defining DEF.
    If RECIP_DEF is set, a dominator already has a computation that can
-   be used.  */
+   be used.
+
+   If should_insert_square_recip is set, then this also inserts
+   the square of the reciprocal immediately after the definition
+   of the reciprocal.  */
 
 static void
 insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
-		    tree def, tree recip_def, int threshold)
+		    tree def, tree recip_def, tree square_recip_def,
+		    int should_insert_square_recip, int threshold)
 {
   tree type;
-  gassign *new_stmt;
+  gassign *new_stmt, *new_square_stmt;
   gimple_stmt_iterator gsi;
   struct occurrence *occ_child;
 
   if (!recip_def
       && (occ->bb_has_division || !flag_trapping_math)
-      && occ->num_divisions >= threshold)
+      /* Divide by two as all divisions are counted twice in
+	 the costing loop.  */
+      && occ->num_divisions / 2 >= threshold)
     {
       /* Make a variable with the replacement and substitute it.  */
       type = TREE_TYPE (def);
@@ -368,29 +416,44 @@ insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
       new_stmt = gimple_build_assign (recip_def, RDIV_EXPR,
 				      build_one_cst (type), def);
 
+      if (should_insert_square_recip)
+	{
+	  square_recip_def = create_tmp_reg (type, "powmult_reciptmp");
+	  new_square_stmt = gimple_build_assign (square_recip_def, MULT_EXPR,
+						 recip_def, recip_def);
+	}
+
       if (occ->bb_has_division)
-        {
-          /* Case 1: insert before an existing division.  */
-          gsi = gsi_after_labels (occ->bb);
-          while (!gsi_end_p (gsi) && !is_division_by (gsi_stmt (gsi), def))
+	{
+	  /* Case 1: insert before an existing division.  */
+	  gsi = gsi_after_labels (occ->bb);
+	  while (!gsi_end_p (gsi)
+		 && (!is_division_by (gsi_stmt (gsi), def))
+		 && (!is_division_by_square (gsi_stmt (gsi), def)))
 	    gsi_next (&gsi);
 
-          gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
-        }
+	  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+	}
       else if (def_gsi && occ->bb == def_gsi->bb)
-        {
-          /* Case 2: insert right after the definition.  Note that this will
+	{
+	  /* Case 2: insert right after the definition.  Note that this will
 	     never happen if the definition statement can throw, because in
 	     that case the sole successor of the statement's basic block will
 	     dominate all the uses as well.  */
-          gsi_insert_after (def_gsi, new_stmt, GSI_NEW_STMT);
-        }
+	  gsi = *def_gsi;
+	  gsi_insert_after (def_gsi, new_stmt, GSI_NEW_STMT);
+	}
       else
-        {
-          /* Case 3: insert in a basic block not containing defs/uses.  */
-          gsi = gsi_after_labels (occ->bb);
-          gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
-        }
+	{
+	  /* Case 3: insert in a basic block not containing defs/uses.  */
+	  gsi = gsi_after_labels (occ->bb);
+	  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+	}
+
+      /* Regardless of which case the reciprocal as inserted in,
+	 we insert the square immediately after the reciprocal.  */
+      if (should_insert_square_recip)
+	gsi_insert_before (&gsi, new_square_stmt, GSI_SAME_STMT);
 
       reciprocal_stats.rdivs_inserted++;
 
@@ -398,8 +461,32 @@ insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
     }
 
   occ->recip_def = recip_def;
+  occ->square_recip_def = square_recip_def;
   for (occ_child = occ->children; occ_child; occ_child = occ_child->next)
-    insert_reciprocals (def_gsi, occ_child, def, recip_def, threshold);
+    insert_reciprocals (def_gsi, occ_child, def, recip_def,
+			square_recip_def, should_insert_square_recip,
+			threshold);
+}
+
+/* Replace occurrences of expr / (x * x) with expr * ((1 / x) * (1 / x)).
+   Take as argument the use for (x * x).  */
+static inline void
+replace_reciprocal_squares (use_operand_p use_p)
+{
+  gimple *use_stmt = USE_STMT (use_p);
+  basic_block bb = gimple_bb (use_stmt);
+  struct occurrence *occ = (struct occurrence *) bb->aux;
+
+  if (optimize_bb_for_speed_p (bb) && occ->square_recip_def
+      && occ->recip_def)
+    {
+      gimple_stmt_iterator gsi = gsi_for_stmt (use_stmt);
+      gimple_assign_set_rhs_code (use_stmt, MULT_EXPR);
+      gimple_assign_set_rhs2 (use_stmt, occ->square_recip_def);
+      SET_USE (use_p, occ->square_recip_def);
+      fold_stmt_inplace (&gsi);
+      update_stmt (use_stmt);
+    }
 }
 
 
@@ -460,32 +547,85 @@ free_bb (struct occurrence *occ)
 static void
 execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
 {
-  use_operand_p use_p;
-  imm_use_iterator use_iter;
+  use_operand_p use_p, square_use_p;
+  imm_use_iterator use_iter, square_use_iter;
+  tree square_def;
   struct occurrence *occ;
-  int count = 0, threshold;
+  int count = 0;
+  int threshold;
+  int square_recip_count = 0;
+  int sqrt_recip_count = 0;
 
   gcc_assert (FLOAT_TYPE_P (TREE_TYPE (def)) && is_gimple_reg (def));
+  threshold = targetm.min_divisions_for_recip_mul (TYPE_MODE (TREE_TYPE (def)));
+
+  /* If this is a square (x * x), we should check whether there are any
+     enough divisions by x on it's own to warrant waiting for that pass.  */
+  if (TREE_CODE (def) == SSA_NAME)
+    {
+      gimple *def_stmt = SSA_NAME_DEF_STMT (def);
+
+      if (is_gimple_assign (def_stmt)
+	  && gimple_assign_rhs_code (def_stmt) == MULT_EXPR
+	  && gimple_assign_rhs1 (def_stmt) == gimple_assign_rhs2 (def_stmt))
+	{
+	  /* This statement is a square of something.  We should take this
+	     in to account, as it may be more profitable to not extract
+	     the reciprocal here.  */
+	  tree op0 = gimple_assign_rhs1 (def_stmt);
+	  FOR_EACH_IMM_USE_FAST (use_p, use_iter, op0)
+	    {
+	      gimple *use_stmt = USE_STMT (use_p);
+	      if (is_division_by (use_stmt, op0))
+		sqrt_recip_count ++;
+	    }
+	}
+    }
 
   FOR_EACH_IMM_USE_FAST (use_p, use_iter, def)
     {
       gimple *use_stmt = USE_STMT (use_p);
       if (is_division_by (use_stmt, def))
 	{
-	  register_division_in (gimple_bb (use_stmt));
+	  register_division_in (gimple_bb (use_stmt), 2);
 	  count++;
 	}
+
+      if (is_square_of (use_stmt, def))
+	{
+	  square_def = gimple_assign_lhs (use_stmt);
+	  FOR_EACH_IMM_USE_FAST (square_use_p, square_use_iter, square_def)
+	    {
+	      gimple *square_use_stmt = USE_STMT (square_use_p);
+	      if (is_division_by (square_use_stmt, square_def))
+		{
+		  /* Halve the relative importance as this is called twice
+		     for each division by a square.  */
+		  register_division_in (gimple_bb (square_use_stmt), 1);
+		  square_recip_count ++;
+		}
+	    }
+	}
     }
 
+  /* Square reciprocals will have been counted twice.  */
+  square_recip_count /= 2;
+
+  if (sqrt_recip_count > square_recip_count)
+    /* It will be more profitable to extract a 1 / x expression later,
+       so it is not worth attempting to extract 1 / (x * x) now.  */
+    return;
+
   /* Do the expensive part only if we can hope to optimize something.  */
-  threshold = targetm.min_divisions_for_recip_mul (TYPE_MODE (TREE_TYPE (def)));
-  if (count >= threshold)
+  if (count + square_recip_count >= threshold
+      && count >= 1)
     {
       gimple *use_stmt;
       for (occ = occ_head; occ; occ = occ->next)
 	{
 	  compute_merit (occ);
-	  insert_reciprocals (def_gsi, occ, def, NULL, threshold);
+	  insert_reciprocals (def_gsi, occ, def, NULL, NULL,
+			      square_recip_count, threshold);
 	}
 
       FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def)
@@ -495,6 +635,27 @@ execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
 	      FOR_EACH_IMM_USE_ON_STMT (use_p, use_iter)
 		replace_reciprocal (use_p);
 	    }
+	  else if (square_recip_count > 0
+		   && is_square_of (use_stmt, def))
+	    {
+	      FOR_EACH_IMM_USE_ON_STMT (use_p, use_iter)
+		{
+		  /* Find all uses of the square that are divisions and
+		   * replace them by multiplications with the inverse.  */
+		  imm_use_iterator square_iterator;
+		  gimple *powmult_use_stmt = USE_STMT (use_p);
+		  tree powmult_def_name = gimple_assign_lhs (powmult_use_stmt);
+
+		  FOR_EACH_IMM_USE_STMT (powmult_use_stmt,
+					 square_iterator, powmult_def_name)
+		    FOR_EACH_IMM_USE_ON_STMT (square_use_p, square_iterator)
+		      {
+			gimple *powmult_use_stmt = USE_STMT (square_use_p);
+			if (is_division_by (powmult_use_stmt, powmult_def_name))
+			  replace_reciprocal_squares (square_use_p);
+		      }
+		}
+	    }
 	}
     }

Patch

diff --git a/gcc/testsuite/gcc.dg/associate_division_1.c b/gcc/testsuite/gcc.dg/associate_division_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..187d3597af8825a6a43f29bfaa68b089d2d5bbfe
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/associate_division_1.c
@@ -0,0 +1,46 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-optimized" } */
+
+float
+extract_square (float x, float y, float *a, float *b)
+{
+  *a = 3 / (y * y);
+  *b = 5 / (y * y);
+
+  return x / (y * y);
+}
+
+/* Don't expect the 'powmult' (calculation of y * y)
+   to be deleted until a later pass, so look for one
+   more multiplication than strictly necessary.  */
+float
+extract_recip (float *w, float x, float y, float z)
+{
+  *w = 7 / y;
+
+  return x / (y * y) + z / y;
+}
+
+float
+neg_recip (float x, float y, float z)
+{
+  return (x / y) + (z / -y);
+}
+
+float
+const_divisor (float *w, float x, float y, float z)
+{
+  *w = 5 / y;
+  return x / (y * 3.0f) + z / y;
+}
+
+/* 4 For the pointers to a, b, w and w, 4 multiplications in 'extract_square',
+   5 multiplications in 'extract_recip', 0 multiplications
+   in 'neg_recip', 3 multiplcations in 'const_divisor' expected.  */
+/* { dg-final { scan-tree-dump-times " \\* " 14 "optimized" } } */
+
+/* 1 division in 'extract_square', 1 division in 'extract_recip',
+   1 division in 'neg_recip', 1 division in 'const_divisor'.  */
+/* { dg-final { scan-tree-dump-times " / " 4 "optimized" } } */
+
+
diff --git a/gcc/tree-ssa-math-opts.c b/gcc/tree-ssa-math-opts.c
index 7ac1659fa0670b7080685f3f9513939807073a63..0db160f68001ffb90942c4002703625430128b9f 100644
--- a/gcc/tree-ssa-math-opts.c
+++ b/gcc/tree-ssa-math-opts.c
@@ -340,6 +340,38 @@  is_division_by (gimple *use_stmt, tree def)
 	 && gimple_assign_rhs1 (use_stmt) != def;
 }
 
+/* Return whether USE_STMT is DEF * DEF.  */
+static inline bool
+is_square_of (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == MULT_EXPR)
+    {
+      tree op0 = gimple_assign_rhs1 (use_stmt);
+      tree op1 = gimple_assign_rhs2 (use_stmt);
+
+      return op0 == op1 && op0 == def;
+    }
+  return 0;
+}
+
+/* Return whether USE_STMT is a floating-point division by
+   DEF * DEF.  */
+static inline bool
+is_division_by_square (gimple *use_stmt, tree def)
+{
+  if (gimple_code (use_stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (use_stmt) == RDIV_EXPR)
+    {
+      tree denominator = gimple_assign_rhs2 (use_stmt);
+      if (TREE_CODE (denominator) == SSA_NAME)
+	{
+	  return is_square_of (SSA_NAME_DEF_STMT (denominator), def);
+	}
+    }
+  return 0;
+}
+
 /* Walk the subset of the dominator tree rooted at OCC, setting the
    RECIP_DEF field to a definition of 1.0 / DEF that can be used in
    the given basic block.  The field may be left NULL, of course,
@@ -369,14 +401,16 @@  insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
 				      build_one_cst (type), def);
 
       if (occ->bb_has_division)
-        {
-          /* Case 1: insert before an existing division.  */
-          gsi = gsi_after_labels (occ->bb);
-          while (!gsi_end_p (gsi) && !is_division_by (gsi_stmt (gsi), def))
+	{
+	  /* Case 1: insert before an existing division.  */
+	  gsi = gsi_after_labels (occ->bb);
+	  while (!gsi_end_p (gsi)
+		 && (!is_division_by (gsi_stmt (gsi), def))
+		 && (!is_division_by_square (gsi_stmt (gsi), def)))
 	    gsi_next (&gsi);
 
-          gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
-        }
+	  gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+	}
       else if (def_gsi && occ->bb == def_gsi->bb)
         {
           /* Case 2: insert right after the definition.  Note that this will
@@ -403,6 +437,80 @@  insert_reciprocals (gimple_stmt_iterator *def_gsi, struct occurrence *occ,
 }
 
 
+/* Walk the tree until we find the first division by a square.  Insert
+   the definition of the reciprocal there.  This returns that definition,
+   or if there is an alternate definition earlier, then it returns that
+   instead.  */
+
+static tree
+insert_square_reciprocals (struct occurrence *occ, tree def)
+{
+  gimple_stmt_iterator gsi = gsi_after_labels (occ->bb);
+
+  while (!gsi_end_p (gsi)
+	 && !is_division_by (gsi_stmt (gsi), def)
+	 /* Check to see if a calculation statement has already
+	    been inserted.  */
+	 && !is_square_of (gsi_stmt (gsi), occ->recip_def))
+    gsi_next (&gsi);
+
+  if (gsi_end_p (gsi))
+      return NULL;
+  else if (is_square_of (gsi_stmt (gsi), occ->recip_def))
+      return gimple_assign_lhs (gsi_stmt (gsi));
+  else
+    {
+      tree type = TREE_TYPE (def);
+      tree recip_square_def = create_tmp_reg (type, "powmult_reciptmp");
+      gassign *new_stmt = gimple_build_assign (recip_square_def, MULT_EXPR,
+					       occ->recip_def, occ->recip_def);
+      gsi_insert_before (&gsi, new_stmt, GSI_SAME_STMT);
+      return recip_square_def;
+    }
+}
+
+/* Replace occurrences of expr / (x * x) with expr * ((1 / x) * (1 / x)).
+   Unlike replace_reciprocals, we expect use_p to be a square definition
+   (i.e. use_p is _ = x * x).  Iterate though all uses of this square
+   and replace them.  */
+static inline void
+replace_reciprocal_squares (use_operand_p use_p)
+{
+  gimple *use_stmt = USE_STMT (use_p);
+  basic_block bb = gimple_bb (use_stmt);
+  struct occurrence *occ = (struct occurrence *) bb->aux;
+  imm_use_iterator use_iter;
+  tree def_name = gimple_assign_lhs (use_stmt);
+  use_operand_p square_use_p;
+  tree squared_reciprocal = insert_square_reciprocals (occ, def_name);
+
+  if (optimize_bb_for_speed_p (bb) && squared_reciprocal
+      && occ->recip_def && use_stmt != occ->recip_def_stmt)
+    {
+
+      /* Find all occurrences of the use name and replace
+	 them by multiplications of this new inverse.  */
+      FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def_name)
+	{
+	  FOR_EACH_IMM_USE_ON_STMT (square_use_p, use_iter)
+	    {
+	      gimple *square_use = USE_STMT (square_use_p);
+
+	      if (gimple_assign_rhs_code (square_use) == RDIV_EXPR)
+		{
+		  gimple_assign_set_rhs_code (square_use, MULT_EXPR);
+		  gimple_assign_set_rhs2 (square_use, squared_reciprocal);
+		  SET_USE (square_use_p, squared_reciprocal);
+
+		  reciprocal_stats.rdivs_inserted++;
+		  update_stmt (square_use);
+		}
+	    }
+	}
+    }
+}
+
+
 /* Replace the division at USE_P with a multiplication by the reciprocal, if
    possible.  */
 
@@ -460,10 +568,12 @@  free_bb (struct occurrence *occ)
 static void
 execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
 {
-  use_operand_p use_p;
-  imm_use_iterator use_iter;
+  use_operand_p use_p, square_use_p;
+  imm_use_iterator use_iter, square_use_iter;
+  tree square_def;
   struct occurrence *occ;
   int count = 0, threshold;
+  int square_recip_count = 0;
 
   gcc_assert (FLOAT_TYPE_P (TREE_TYPE (def)) && is_gimple_reg (def));
 
@@ -475,11 +585,26 @@  execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
 	  register_division_in (gimple_bb (use_stmt));
 	  count++;
 	}
+
+      if (is_square_of (use_stmt, def))
+	{
+	  square_def = gimple_assign_lhs (use_stmt);
+	  FOR_EACH_IMM_USE_FAST (square_use_p, square_use_iter, square_def)
+	    {
+	      gimple *square_use_stmt = USE_STMT (square_use_p);
+	      if (is_division_by (square_use_stmt, square_def))
+		{
+		  register_division_in (gimple_bb (square_use_stmt));
+		  square_recip_count++;
+		}
+	    }
+	}
     }
 
   /* Do the expensive part only if we can hope to optimize something.  */
   threshold = targetm.min_divisions_for_recip_mul (TYPE_MODE (TREE_TYPE (def)));
-  if (count >= threshold)
+  if (count + square_recip_count >= threshold
+      && count >= 1)
     {
       gimple *use_stmt;
       for (occ = occ_head; occ; occ = occ->next)
@@ -495,6 +620,11 @@  execute_cse_reciprocals_1 (gimple_stmt_iterator *def_gsi, tree def)
 	      FOR_EACH_IMM_USE_ON_STMT (use_p, use_iter)
 		replace_reciprocal (use_p);
 	    }
+	  else if (is_square_of (use_stmt, def))
+	    {
+	      FOR_EACH_IMM_USE_ON_STMT (use_p, use_iter)
+		replace_reciprocal_squares (use_p);
+	    }
 	}
     }