From patchwork Thu Dec 9 14:05:51 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Julian Brown X-Patchwork-Id: 74894 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 3DDBBB6EF0 for ; Fri, 10 Dec 2010 01:06:10 +1100 (EST) Received: (qmail 21805 invoked by alias); 9 Dec 2010 14:06:06 -0000 Received: (qmail 21791 invoked by uid 22791); 9 Dec 2010 14:06:04 -0000 X-SWARE-Spam-Status: No, hits=-1.7 required=5.0 tests=AWL, BAYES_00, T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (38.113.113.100) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 09 Dec 2010 14:05:57 +0000 Received: (qmail 4017 invoked from network); 9 Dec 2010 14:05:55 -0000 Received: from unknown (HELO rex.config) (julian@127.0.0.2) by mail.codesourcery.com with ESMTPA; 9 Dec 2010 14:05:55 -0000 Date: Thu, 9 Dec 2010 14:05:51 +0000 From: Julian Brown To: gcc-patches@gcc.gnu.org Cc: paul@codesourcery.com, Richard Earnshaw Subject: [PATCH, ARM] Avoid element-order-dependent operations for quad-word vectors in big-endian mode for NEON Message-ID: <20101209140551.50118914@rex.config> Mime-Version: 1.0 X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org (Sending to the right mailing list address this time, sorry for duplicates!) Hi, This patch fixes a large number of execution failures which occur when compiling with NEON and auto-vectorization enabled in big-endian mode, particularly when -mvectorize-with-neon-quad is in use. The basic issue (as discussed several times previously, e.g. http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html) is that NEON quad-word vectors are formed from pairs of double-word vectors, and those pairs are "the wrong way round" in big-endian mode. This is contrary to the middle-end's expectations, and several patterns introduced fairly recently to the GCC NEON support fail to take this "feature" into account. So: in the interests of un-breaking code in big-endian mode, the attached patch does the simplest thing possible: disabling the troublesome patterns in big-endian mode. I experimented with other approaches, i.e. trying to "hide" the permutations required by NEON in the backend patterns: http://lists.linaro.org/pipermail/linaro-toolchain/2010-November/000536.html But eventually came to the conclusion that it was impractical to do that. (We're planning to revisit the "internal" ordering of vectors for big-endian NEON in the vectorizer at least, anyway.) I've tested this patch with custom multilib configurations (big-endian/little-endian), with the following options: little-endian, -mfpu=neon -mfloat-abi=softfp little-endian, -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad big-endian, -mfpu=neon -mfloat-abi=softfp big-endian, -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad Unfortunately for C only, since building C++ was broken at the time I started testing. Some tests fail to vectorize post-patch in BE mode (predictably, since fewer things end up vectorizable), but many execution tests transition from FAIL to PASS. OK to apply? Thanks, Julian ChangeLog gcc/ * config/arm/neon.md (vec_shr_, vec_shl_): Disable in big-endian mode. (reduc_splus_, reduc_uplus_, reduc_smin_) (reduc_smax_, reduc_umin_, reduc_umax_) (neon_vec_unpack_lo_, neon_vec_unpack_hi_) (vec_unpack_hi_, vec_unpack_lo_) (neon_vec_mult_lo_, vec_widen_mult_lo_) (neon_vec_mult_hi_, vec_widen_mult_hi_) (vec_pack_trunc_, neon_vec_pack_trunc_): Disable for Q registers in big-endian mode. Index: gcc/config/arm/neon.md =================================================================== --- gcc/config/arm/neon.md (revision 167434) +++ gcc/config/arm/neon.md (working copy) @@ -1028,11 +1028,14 @@ ;; shift-count granularity. That's good enough for the middle-end's current ;; needs. +;; Note that it's not safe to perform such an operation in big-endian mode, +;; due to element-ordering issues. + (define_expand "vec_shr_" [(match_operand:VDQ 0 "s_register_operand" "") (match_operand:VDQ 1 "s_register_operand" "") (match_operand:SI 2 "const_multiple_of_8_operand" "")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx zero_reg; HOST_WIDE_INT num_bits = INTVAL (operands[2]); @@ -1060,7 +1063,7 @@ [(match_operand:VDQ 0 "s_register_operand" "") (match_operand:VDQ 1 "s_register_operand" "") (match_operand:SI 2 "const_multiple_of_8_operand" "")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx zero_reg; HOST_WIDE_INT num_bits = INTVAL (operands[2]); @@ -1256,7 +1259,8 @@ (define_expand "reduc_splus_" [(match_operand:VQ 0 "s_register_operand" "") (match_operand:VQ 1 "s_register_operand" "")] - "TARGET_NEON && (! || flag_unsafe_math_optimizations)" + "TARGET_NEON && (! || flag_unsafe_math_optimizations) + && !BYTES_BIG_ENDIAN" { rtx step1 = gen_reg_rtx (mode); rtx res_d = gen_reg_rtx (mode); @@ -1272,7 +1276,7 @@ [(set (match_operand:V2DI 0 "s_register_operand" "=w") (unspec:V2DI [(match_operand:V2DI 1 "s_register_operand" "w")] UNSPEC_VPADD))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vadd.i64\t%e0, %e1, %f1" [(set_attr "neon_type" "neon_int_1")] ) @@ -1282,7 +1286,7 @@ (define_expand "reduc_uplus_" [(match_operand:VDQI 0 "s_register_operand" "") (match_operand:VDQI 1 "s_register_operand" "")] - "TARGET_NEON" + "TARGET_NEON && ( || !BYTES_BIG_ENDIAN)" { emit_insn (gen_reduc_splus_ (operands[0], operands[1])); DONE; @@ -1301,7 +1305,8 @@ (define_expand "reduc_smin_" [(match_operand:VQ 0 "s_register_operand" "") (match_operand:VQ 1 "s_register_operand" "")] - "TARGET_NEON && (! || flag_unsafe_math_optimizations)" + "TARGET_NEON && (! || flag_unsafe_math_optimizations) + && !BYTES_BIG_ENDIAN" { rtx step1 = gen_reg_rtx (mode); rtx res_d = gen_reg_rtx (mode); @@ -1326,7 +1331,8 @@ (define_expand "reduc_smax_" [(match_operand:VQ 0 "s_register_operand" "") (match_operand:VQ 1 "s_register_operand" "")] - "TARGET_NEON && (! || flag_unsafe_math_optimizations)" + "TARGET_NEON && (! || flag_unsafe_math_optimizations) + && !BYTES_BIG_ENDIAN" { rtx step1 = gen_reg_rtx (mode); rtx res_d = gen_reg_rtx (mode); @@ -1351,7 +1357,7 @@ (define_expand "reduc_umin_" [(match_operand:VQI 0 "s_register_operand" "") (match_operand:VQI 1 "s_register_operand" "")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx step1 = gen_reg_rtx (mode); rtx res_d = gen_reg_rtx (mode); @@ -1376,7 +1382,7 @@ (define_expand "reduc_umax_" [(match_operand:VQI 0 "s_register_operand" "") (match_operand:VQI 1 "s_register_operand" "")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx step1 = gen_reg_rtx (mode); rtx res_d = gen_reg_rtx (mode); @@ -5238,7 +5244,7 @@ (SE: (vec_select: (match_operand:VU 1 "register_operand" "w") (match_operand:VU 2 "vect_par_constant_low" ""))))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vmovl. %q0, %e1" [(set_attr "neon_type" "neon_shift_1")] ) @@ -5248,7 +5254,7 @@ (SE: (vec_select: (match_operand:VU 1 "register_operand" "w") (match_operand:VU 2 "vect_par_constant_high" ""))))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vmovl. %q0, %f1" [(set_attr "neon_type" "neon_shift_1")] ) @@ -5256,7 +5262,7 @@ (define_expand "vec_unpack_hi_" [(match_operand: 0 "register_operand" "") (SE: (match_operand:VU 1 "register_operand"))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtvec v = rtvec_alloc (/2) ; rtx t1; @@ -5275,7 +5281,7 @@ (define_expand "vec_unpack_lo_" [(match_operand: 0 "register_operand" "") (SE: (match_operand:VU 1 "register_operand" ""))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtvec v = rtvec_alloc (/2) ; rtx t1; @@ -5298,7 +5304,7 @@ (SE: (vec_select: (match_operand:VU 3 "register_operand" "w") (match_dup 2)))))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vmull. %q0, %e1, %e3" [(set_attr "neon_type" "neon_shift_1")] ) @@ -5307,7 +5313,7 @@ [(match_operand: 0 "register_operand" "") (SE: (match_operand:VU 1 "register_operand" "")) (SE: (match_operand:VU 2 "register_operand" ""))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtvec v = rtvec_alloc (/2) ; rtx t1; @@ -5332,7 +5338,7 @@ (SE: (vec_select: (match_operand:VU 3 "register_operand" "w") (match_dup 2)))))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vmull. %q0, %f1, %f3" [(set_attr "neon_type" "neon_shift_1")] ) @@ -5341,7 +5347,7 @@ [(match_operand: 0 "register_operand" "") (SE: (match_operand:VU 1 "register_operand" "")) (SE: (match_operand:VU 2 "register_operand" ""))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtvec v = rtvec_alloc (/2) ; rtx t1; @@ -5435,6 +5441,10 @@ } ) +; FIXME: These instruction patterns can't be used safely in big-endian mode +; because the ordering of vector elements in Q registers is different from what +; the semantics of the instructions require. + (define_insn "vec_pack_trunc_" [(set (match_operand: 0 "register_operand" "=&w") (vec_concat: @@ -5442,7 +5452,7 @@ (match_operand:VN 1 "register_operand" "w")) (truncate: (match_operand:VN 2 "register_operand" "w"))))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vmovn.i\t%e0, %q1\n\tvmovn.i\t%f0, %q2" [(set_attr "neon_type" "neon_shift_1")] ) @@ -5451,7 +5461,7 @@ (define_insn "neon_vec_pack_trunc_" [(set (match_operand: 0 "register_operand" "=w") (truncate: (match_operand:VN 1 "register_operand" "w")))] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" "vmovn.i\t%P0, %q1" [(set_attr "neon_type" "neon_shift_1")] ) @@ -5460,7 +5470,7 @@ [(match_operand: 0 "register_operand" "") (match_operand:VSHFT 1 "register_operand" "") (match_operand:VSHFT 2 "register_operand")] - "TARGET_NEON" + "TARGET_NEON && !BYTES_BIG_ENDIAN" { rtx tempreg = gen_reg_rtx (mode);