diff mbox

[RFC,Vectorizer,AArch64] Fix PR/61114 by redefining REDUC_xxx_EXPR tree codes to return scalars

Message ID 53DB699B.1010506@arm.com
State New
Headers show

Commit Message

Alan Lawrence Aug. 1, 2014, 10:19 a.m. UTC
This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.

These are presently documented as producing a vector with the result in element 
0, and this is inconsistent with their use in tree-vect-loop.c (which on 
bigendian targets pulls the bits out of the other end of the vector result). 
This leads to bugs on bigendian targets - see 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114 for a small testcase.

I discounted "fixing" the vectorizer (to always read from element 0) and then 
making the optab reverse the result on bigendian targets (whose architectural 
insn produces the result in lane N-1), as optimization of vectors in RTL seems 
unlikely to remove such a reverse/permute and so this would lead to a 
performance regression (specifically on PowerPC).

Instead it seems more natural for the tree code to produce a scalar result 
(producing a vector with the result in lane 0 has already caused confusion, e.g. 
https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).

This patch preserves the meaning of the existing optab (producing a result in 
lane 0 on little-endian architectures or N-1 on bigendian), thus generally 
avoiding the need to change backends. Hence, expr.c extracts an 
endianness-dependent element from the optab result to give the result expected 
for the tree code.

Significant complication in the AArch64 backend stems from the existence of 
builtins for reduction operations, which are gimple_fold'd to the tree code. 
Hence, I introduce new define_expands, and map the existing 
__builtin_aarch64_reduc_s{plus,min,max}_<mode> functions to those, with scalar 
result types, matching the result of the tree code to which these are still 
gimple_folded.

If the above/proposed solution is acceptable, I'd make a longer patch series, 
including some cleanup to tree-vect-loop.c (vect_create_epilog_for_reduction now 
has only one case where extract_scalar_result == true), and separating out 
AArch64 changes. Further, I'd like to propose creating a new optab that directly 
outputs a scalar, as a migration path away from the existing optab whose meaning 
is endianness-dependent, i.e. such that expand_unop falls back to the existing 
optab only if the new one is not defined.

Patch as it stands has been bootstrapped on x86_64 and regression tested on 
aarch64 and aarch64_be without regressions. On x86_64 there is a regression in 
gcc.target/i386/pr51235.c, where it seems my check in tree-cfg.c is too strict - 
we end up with a reduction from "vector (4) unsigned long int" to "void *". 
(Even if I modify tree-vect-loop.c to build the REDUC_..._EXPR as returning the 
element type of the input vector, its return type is later changed.) It seems I 
can "get away with" a less-strict check in tree-cfg.c, i.e. allowing the case 
where the modes of the expected and actual result types match (rather than 
"useless_type_conversion_p" holding between said types), but if anyone can 
suggest an alternative/better check then it'd be great to hear it...

--Alan

Comments

Richard Biener Aug. 1, 2014, 10:39 a.m. UTC | #1
On Fri, Aug 1, 2014 at 12:19 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This fixes PR/61114 by redefining the REDUC_{MIN,MAX,PLUS}_EXPR tree codes.
>
> These are presently documented as producing a vector with the result in
> element 0, and this is inconsistent with their use in tree-vect-loop.c
> (which on bigendian targets pulls the bits out of the other end of the
> vector result). This leads to bugs on bigendian targets - see
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61114 for a small testcase.
>
> I discounted "fixing" the vectorizer (to always read from element 0) and
> then making the optab reverse the result on bigendian targets (whose
> architectural insn produces the result in lane N-1), as optimization of
> vectors in RTL seems unlikely to remove such a reverse/permute and so this
> would lead to a performance regression (specifically on PowerPC).
>
> Instead it seems more natural for the tree code to produce a scalar result
> (producing a vector with the result in lane 0 has already caused confusion,
> e.g. https://gcc.gnu.org/ml/gcc-patches/2012-10/msg01100.html).
>
> This patch preserves the meaning of the existing optab (producing a result
> in lane 0 on little-endian architectures or N-1 on bigendian), thus
> generally avoiding the need to change backends. Hence, expr.c extracts an
> endianness-dependent element from the optab result to give the result
> expected for the tree code.
>
> Significant complication in the AArch64 backend stems from the existence of
> builtins for reduction operations, which are gimple_fold'd to the tree code.
> Hence, I introduce new define_expands, and map the existing
> __builtin_aarch64_reduc_s{plus,min,max}_<mode> functions to those, with
> scalar result types, matching the result of the tree code to which these are
> still gimple_folded.
>
> If the above/proposed solution is acceptable, I'd make a longer patch
> series, including some cleanup to tree-vect-loop.c
> (vect_create_epilog_for_reduction now has only one case where
> extract_scalar_result == true), and separating out AArch64 changes. Further,
> I'd like to propose creating a new optab that directly outputs a scalar, as
> a migration path away from the existing optab whose meaning is
> endianness-dependent, i.e. such that expand_unop falls back to the existing
> optab only if the new one is not defined.
>
> Patch as it stands has been bootstrapped on x86_64 and regression tested on
> aarch64 and aarch64_be without regressions. On x86_64 there is a regression
> in gcc.target/i386/pr51235.c, where it seems my check in tree-cfg.c is too
> strict - we end up with a reduction from "vector (4) unsigned long int" to
> "void *". (Even if I modify tree-vect-loop.c to build the REDUC_..._EXPR as
> returning the element type of the input vector, its return type is later
> changed.) It seems I can "get away with" a less-strict check in tree-cfg.c,
> i.e. allowing the case where the modes of the expected and actual result
> types match (rather than "useless_type_conversion_p" holding between said
> types), but if anyone can suggest an alternative/better check then it'd be
> great to hear it...

We should fix the vectorizer code-generation instead.

Makes sense to me - non-aarch64 parts of the patch are ok.  The optab
migration strategy is as well.

Thanks,
Richard.

> --Alan
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index fee17ecf637436c8704f565be2eb9ef23891209a..77ed36ecc4cade4c2c6cafd16070198dacb0b869 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -1396,21 +1396,21 @@  aarch64_gimple_fold_builtin (gimple_stmt_iterator *gsi)
 
 	  switch (fcode)
 	    {
-	      BUILTIN_VALL (UNOP, reduc_splus_, 10)
+	      BUILTIN_VALL (UNOP, reduc_splus_, 0)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_PLUS_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smax_, 0)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MAX_EXPR,
 						gimple_call_lhs (stmt),
 						args[0],
 						NULL_TREE);
 		break;
-	      BUILTIN_VDQIF (UNOP, reduc_smin_, 10)
+	      BUILTIN_VDQIF (UNOP, reduc_smin_, 0)
 		new_stmt = gimple_build_assign_with_ops (
 						REDUC_MIN_EXPR,
 						gimple_call_lhs (stmt),
diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index 268432cc117b7027ee9472fc5a4f9b1ea13bea0f..ef363c676bf5bb05cfdaf0f111324d42c3f3d992 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -251,13 +251,19 @@ 
   BUILTIN_VSDQ_I_DI (BINOP, cmgtu, 0)
   BUILTIN_VSDQ_I_DI (BINOP, cmtst, 0)
 
+  /* Implemented by aarch64_reduc_splus_<mode>.  */
+  BUILTIN_VALL (UNOP, reduc_splus_, 0)
+
   /* Implemented by reduc_<sur>plus_<mode>.  */
-  BUILTIN_VALL (UNOP, reduc_splus_, 10)
   BUILTIN_VDQ (UNOP, reduc_uplus_, 10)
 
+  /* Implemented by aarch64_reduc_smax_<mode>.  */
+  BUILTIN_VDQIF (UNOP, reduc_smax_, 0)
+
+  /* Implemented by aarch64_reduc_smin_<mode>.  */
+  BUILTIN_VDQIF (UNOP, reduc_smin_, 0)
+
   /* Implemented by reduc_<maxmin_uns>_<mode>.  */
-  BUILTIN_VDQIF (UNOP, reduc_smax_, 10)
-  BUILTIN_VDQIF (UNOP, reduc_smin_, 10)
   BUILTIN_VDQ_BHSI (UNOP, reduc_umax_, 10)
   BUILTIN_VDQ_BHSI (UNOP, reduc_umin_, 10)
   BUILTIN_VDQF (UNOP, reduc_smax_nan_, 10)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 6300b9b6c7ac06384d2e59bbac1a0d5445975bb6..4ade92c7f47bf71dc993d25621ec839ea867e3d5 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1719,6 +1719,19 @@ 
 
 ;; 'across lanes' add.
 
+;; Template for outputting a scalar, so we can create __builtins which can be
+;; gimple_fold'd to the REDUC_PLUS_EXPR tree code.
+(define_expand "aarch64_reduc_splus_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand")
+        (match_operand:VALL 1 "register_operand"))]
+  "TARGET_SIMD"
+  {
+    /* Must be handled by aarch64_gimple_fold_builtin.  */
+    gcc_unreachable ();
+    FAIL;
+  }
+)
+
 (define_insn "reduc_<sur>plus_<mode>"
  [(set (match_operand:VDQV 0 "register_operand" "=w")
        (unspec:VDQV [(match_operand:VDQV 1 "register_operand" "w")]
@@ -1776,6 +1789,31 @@ 
 
 ;; 'across lanes' max and min ops.
 
+;; Template for outputting a scalar, so we can create __builtins which can be
+;; gimple_fold'd to the REDUC_MAX_EXPR tree code.  The V2DI isn't used.
+(define_expand "aarch64_reduc_smax_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand")
+        (match_operand:VALL 1 "register_operand"))]
+  "TARGET_SIMD"
+  {
+    /* Must be handled in aarch64_gimple_fold_builtin.  */
+    gcc_unreachable ();
+    FAIL;
+  }
+)
+
+;; Likewise for REDUC_MIN_EXPR tree code.
+(define_expand "aarch64_reduc_smin_<mode>"
+  [(set (match_operand:<VEL> 0 "register_operand")
+        (match_operand:VALL 1 "register_operand"))]
+  "TARGET_SIMD"
+  {
+    /* Must be handled in aarch64_gimple_fold_builtin.  */
+    gcc_unreachable ();
+    FAIL;
+  }
+)
+
 (define_insn "reduc_<maxmin_uns>_<mode>"
  [(set (match_operand:VDQV_S 0 "register_operand" "=w")
        (unspec:VDQV_S [(match_operand:VDQV_S 1 "register_operand" "w")]
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 83ac5e96d422ceccadcb212ec792665b78c03fae..b4d7e892e8ea2e4df3dedce7980b771da4b922e2 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -13532,19 +13532,19 @@  vaddd_u64 (uint64_t __a, uint64_t __b)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vaddv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_splus_v8qi (__a), 0);
+  return __builtin_aarch64_reduc_splus_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vaddv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_splus_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_splus_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vaddv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_splus_v2si (__a), 0);
+  return __builtin_aarch64_reduc_splus_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
@@ -13574,26 +13574,25 @@  vaddv_u32 (uint32x2_t __a)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vaddvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_splus_v16qi (__a),
-			0);
+  return __builtin_aarch64_reduc_splus_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vaddvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_splus_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_splus_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vaddvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_splus_v4si (__a), 0);
+  return __builtin_aarch64_reduc_splus_v4si (__a);
 }
 
 __extension__ static __inline int64_t __attribute__ ((__always_inline__))
 vaddvq_s64 (int64x2_t __a)
 {
-  return vgetq_lane_s64 (__builtin_aarch64_reduc_splus_v2di (__a), 0);
+  return __builtin_aarch64_reduc_splus_v2di (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
@@ -13631,22 +13630,19 @@  vaddvq_u64 (uint64x2_t __a)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vaddv_f32 (float32x2_t __a)
 {
-  float32x2_t __t = __builtin_aarch64_reduc_splus_v2sf (__a);
-  return vget_lane_f32 (__t, 0);
+  return __builtin_aarch64_reduc_splus_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vaddvq_f32 (float32x4_t __a)
 {
-  float32x4_t __t = __builtin_aarch64_reduc_splus_v4sf (__a);
-  return vgetq_lane_f32 (__t, 0);
+  return __builtin_aarch64_reduc_splus_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vaddvq_f64 (float64x2_t __a)
 {
-  float64x2_t __t = __builtin_aarch64_reduc_splus_v2df (__a);
-  return vgetq_lane_f64 (__t, 0);
+  return __builtin_aarch64_reduc_splus_v2df (__a);
 }
 
 /* vbsl  */
@@ -18125,19 +18121,19 @@  vmaxv_f32 (float32x2_t __a)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vmaxv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_smax_v8qi (__a), 0);
+  return __builtin_aarch64_reduc_smax_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vmaxv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_smax_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_smax_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vmaxv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_smax_v2si (__a), 0);
+  return __builtin_aarch64_reduc_smax_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
@@ -18181,19 +18177,19 @@  vmaxvq_f64 (float64x2_t __a)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vmaxvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_smax_v16qi (__a), 0);
+  return __builtin_aarch64_reduc_smax_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vmaxvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_smax_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_smax_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vmaxvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_smax_v4si (__a), 0);
+  return __builtin_aarch64_reduc_smax_v4si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
@@ -18225,20 +18221,19 @@  vmaxvq_u32 (uint32x4_t __a)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxnmv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smax_v2sf (__a),
-			0);
+  return __builtin_aarch64_reduc_smax_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vmaxnmvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smax_v4sf (__a), 0);
+  return __builtin_aarch64_reduc_smax_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vmaxnmvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smax_v2df (__a), 0);
+  return __builtin_aarch64_reduc_smax_v2df (__a);
 }
 
 /* vmin  */
@@ -18371,20 +18366,19 @@  vminv_f32 (float32x2_t __a)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vminv_s8 (int8x8_t __a)
 {
-  return vget_lane_s8 (__builtin_aarch64_reduc_smin_v8qi (__a),
-		       0);
+  return __builtin_aarch64_reduc_smin_v8qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vminv_s16 (int16x4_t __a)
 {
-  return vget_lane_s16 (__builtin_aarch64_reduc_smin_v4hi (__a), 0);
+  return __builtin_aarch64_reduc_smin_v4hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vminv_s32 (int32x2_t __a)
 {
-  return vget_lane_s32 (__builtin_aarch64_reduc_smin_v2si (__a), 0);
+  return __builtin_aarch64_reduc_smin_v2si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
@@ -18428,19 +18422,19 @@  vminvq_f64 (float64x2_t __a)
 __extension__ static __inline int8_t __attribute__ ((__always_inline__))
 vminvq_s8 (int8x16_t __a)
 {
-  return vgetq_lane_s8 (__builtin_aarch64_reduc_smin_v16qi (__a), 0);
+  return __builtin_aarch64_reduc_smin_v16qi (__a);
 }
 
 __extension__ static __inline int16_t __attribute__ ((__always_inline__))
 vminvq_s16 (int16x8_t __a)
 {
-  return vgetq_lane_s16 (__builtin_aarch64_reduc_smin_v8hi (__a), 0);
+  return __builtin_aarch64_reduc_smin_v8hi (__a);
 }
 
 __extension__ static __inline int32_t __attribute__ ((__always_inline__))
 vminvq_s32 (int32x4_t __a)
 {
-  return vgetq_lane_s32 (__builtin_aarch64_reduc_smin_v4si (__a), 0);
+  return __builtin_aarch64_reduc_smin_v4si (__a);
 }
 
 __extension__ static __inline uint8_t __attribute__ ((__always_inline__))
@@ -18472,19 +18466,19 @@  vminvq_u32 (uint32x4_t __a)
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminnmv_f32 (float32x2_t __a)
 {
-  return vget_lane_f32 (__builtin_aarch64_reduc_smin_v2sf (__a), 0);
+  return __builtin_aarch64_reduc_smin_v2sf (__a);
 }
 
 __extension__ static __inline float32_t __attribute__ ((__always_inline__))
 vminnmvq_f32 (float32x4_t __a)
 {
-  return vgetq_lane_f32 (__builtin_aarch64_reduc_smin_v4sf (__a), 0);
+  return __builtin_aarch64_reduc_smin_v4sf (__a);
 }
 
 __extension__ static __inline float64_t __attribute__ ((__always_inline__))
 vminnmvq_f64 (float64x2_t __a)
 {
-  return vgetq_lane_f64 (__builtin_aarch64_reduc_smin_v2df (__a), 0);
+  return __builtin_aarch64_reduc_smin_v2df (__a);
 }
 
 /* vmla */
diff --git a/gcc/expr.c b/gcc/expr.c
index 4d2163f721b092e35fad464d71797aebaf0cb6e3..dfec5b1bf12e9fb82f88d35cad109e511f10c8d2 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9019,7 +9019,17 @@  expand_expr_real_2 (sepops ops, rtx target, enum machine_mode tmode,
       {
         op0 = expand_normal (treeop0);
         this_optab = optab_for_tree_code (code, type, optab_default);
-        temp = expand_unop (mode, this_optab, op0, target, unsignedp);
+        enum machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop0));
+        temp = expand_unop (vec_mode, this_optab, op0, NULL_RTX, unsignedp);
+        gcc_assert (temp);
+        /* The tree code produces a scalar result, but (somewhat by convention)
+           the optab produces a vector with the result in element 0 if
+           little-endian, or element N-1 if big-endian.  So pull the scalar
+           result out of that element.  */
+        int index = BYTES_BIG_ENDIAN ? GET_MODE_NUNITS (vec_mode) - 1 : 0;
+        int bitsize = GET_MODE_BITSIZE (GET_MODE_INNER (vec_mode));
+        temp = extract_bit_field (temp, bitsize, bitsize * index, unsignedp,
+				  target, mode, mode);
         gcc_assert (temp);
         return temp;
       }
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index d22eac15962a7abfb605bb79b6f9b7809228dab3..3597750a4998ed1a714ef05f9484495c50baa029 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -8439,12 +8439,13 @@  fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
     case REDUC_MAX_EXPR:
     case REDUC_PLUS_EXPR:
       {
-	unsigned int nelts = TYPE_VECTOR_SUBPARTS (type), i;
+	unsigned int nelts, i;
 	tree *elts;
 	enum tree_code subcode;
 
 	if (TREE_CODE (op0) != VECTOR_CST)
 	  return NULL_TREE;
+        nelts = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
 
 	elts = XALLOCAVEC (tree, nelts);
 	if (!vec_cst_ctor_to_array (op0, elts))
@@ -8463,10 +8464,9 @@  fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
 	    elts[0] = const_binop (subcode, elts[0], elts[i]);
 	    if (elts[0] == NULL_TREE || !CONSTANT_CLASS_P (elts[0]))
 	      return NULL_TREE;
-	    elts[i] = build_zero_cst (TREE_TYPE (type));
 	  }
 
-	return build_vector (type, elts);
+	return elts[0];
       }
 
     default:
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index abf09d5304d002641634ea45e68c7c8939825a1f..68b57637add0c4e1b610cdb5182e95d849541869 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3531,12 +3531,21 @@  verify_gimple_assign_unary (gimple stmt)
 
         return false;
       }
-
-    case VEC_UNPACK_HI_EXPR:
-    case VEC_UNPACK_LO_EXPR:
     case REDUC_MAX_EXPR:
     case REDUC_MIN_EXPR:
     case REDUC_PLUS_EXPR:
+      if (!VECTOR_TYPE_P (rhs1_type)
+	  || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs1_type)))
+        {
+	  error ("reduction should convert from vector to element type");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  return true;
+	}
+      return false;
+
+    case VEC_UNPACK_HI_EXPR:
+    case VEC_UNPACK_LO_EXPR:
     case VEC_UNPACK_FLOAT_HI_EXPR:
     case VEC_UNPACK_FLOAT_LO_EXPR:
       /* FIXME.  */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 7e013f3b549a07bd44789bd4d3e3701eec7c51dc..35a8fde5b8f77393765a57f2833d799e529d0d8d 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1892,9 +1892,9 @@  vect_analyze_loop (struct loop *loop)
 
    Output:
    REDUC_CODE - the corresponding tree-code to be used to reduce the
-      vector of partial results into a single scalar result (which
-      will also reside in a vector) or ERROR_MARK if the operation is
-      a supported reduction operation, but does not have such tree-code.
+      vector of partial results into a single scalar result, or ERROR_MARK
+      if the operation is a supported reduction operation, but does not have
+      such tree-code.
 
    Return FALSE if CODE currently cannot be vectorized as reduction.  */
 
@@ -4175,14 +4175,12 @@  vect_create_epilog_for_reduction (vec<tree> vect_defs, gimple stmt,
         dump_printf_loc (MSG_NOTE, vect_location,
 			 "Reduce using direct vector reduction.\n");
 
-      vec_dest = vect_create_destination_var (scalar_dest, vectype);
-      tmp = build1 (reduc_code, vectype, new_phi_result);
-      epilog_stmt = gimple_build_assign (vec_dest, tmp);
-      new_temp = make_ssa_name (vec_dest, epilog_stmt);
+      tmp = build1 (reduc_code, scalar_type, new_phi_result);
+      epilog_stmt = gimple_build_assign (new_scalar_dest, tmp);
+      new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
       gimple_assign_set_lhs (epilog_stmt, new_temp);
       gsi_insert_before (&exit_gsi, epilog_stmt, GSI_SAME_STMT);
-
-      extract_scalar_result = true;
+      scalar_results.safe_push (new_temp);
     }
   else
     {
diff --git a/gcc/tree.def b/gcc/tree.def
index 84ffe93aa6fdc827f18ca81225bca007d50b50f6..e9af52e554babb100d49ea14f47c805cd5024949 100644
--- a/gcc/tree.def
+++ b/gcc/tree.def
@@ -1157,10 +1157,9 @@  DEFTREECODE (TRANSACTION_EXPR, "transaction_expr", tcc_expression, 1)
    result (e.g. summing the elements of the vector, finding the minimum over
    the vector elements, etc).
    Operand 0 is a vector.
-   The expression returns a vector of the same type, with the first
-   element in the vector holding the result of the reduction of all elements
-   of the operand.  The content of the other elements in the returned vector
-   is undefined.  */
+   The expression returns a scalar, with type the same as the elements of the
+   vector, holding the result of the reduction of all elements of the operand.
+   */
 DEFTREECODE (REDUC_MAX_EXPR, "reduc_max_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_MIN_EXPR, "reduc_min_expr", tcc_unary, 1)
 DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)