From patchwork Thu Nov 3 17:54:40 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jakub Jelinek X-Patchwork-Id: 123479 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id AE566B6F67 for ; Fri, 4 Nov 2011 04:55:15 +1100 (EST) Received: (qmail 9662 invoked by alias); 3 Nov 2011 17:55:12 -0000 Received: (qmail 9566 invoked by uid 22791); 3 Nov 2011 17:55:05 -0000 X-SWARE-Spam-Status: No, hits=-6.8 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, SPF_HELO_PASS, TW_AV, TW_DQ, TW_DV, TW_PX, TW_SR, TW_TD, TW_VC, TW_VD, TW_VM, TW_VP, TW_VT, TW_VX X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 03 Nov 2011 17:54:42 +0000 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id pA3Hsfnb019695 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Thu, 3 Nov 2011 13:54:42 -0400 Received: from tyan-ft48-01.lab.bos.redhat.com (tyan-ft48-01.lab.bos.redhat.com [10.16.42.4]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id pA3Hsfle006494 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 3 Nov 2011 13:54:41 -0400 Received: from tyan-ft48-01.lab.bos.redhat.com (tyan-ft48-01.lab.bos.redhat.com [127.0.0.1]) by tyan-ft48-01.lab.bos.redhat.com (8.14.4/8.14.4) with ESMTP id pA3HseAj011996; Thu, 3 Nov 2011 18:54:40 +0100 Received: (from jakub@localhost) by tyan-ft48-01.lab.bos.redhat.com (8.14.4/8.14.4/Submit) id pA3Hsemd011995; Thu, 3 Nov 2011 18:54:40 +0100 Date: Thu, 3 Nov 2011 18:54:40 +0100 From: Jakub Jelinek To: Richard Henderson , Uros Bizjak Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] Fix up floatunsv{4,8}siv{4,8}sf2 Message-ID: <20111103175440.GC1052@tyan-ft48-01.lab.bos.redhat.com> Reply-To: Jakub Jelinek MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Hi! As mentioned in the last mail, the testcase in that patch shows a bug in the unsigned int -> float vectorization on i?86/x86_64. E.g. 0x8000008fU converted to float by scalar code is 0x1.000002p+31, but by vector code is 0x1p+31, i.e. one ulp smaller. We currently generate: vxorps %xmm2, %xmm2, %xmm2 vmovaps .LC0(%rip), %xmm3 ! { 1333788672, 1333788672, 1333788672, 1333788672 } ... .L2: vcvtdq2ps ui(%rax), %xmm0 vcmpltps %xmm2, %xmm0, %xmm1 vandps %xmm3, %xmm1, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vmovaps %xmm0, f(%rax) For the scalar code we just zero extend it to 64-bit register and convert as signed 64-bit integer to float, but we don't have a vector insn for that. Attached are 3 different fix alternatives: X313a is based on scalar code we generate for ulonglong to double code, i.e. if 0x80000000 or bigger divide by two with oring the shifted off bit into the lowest bit, then signed conversion to float and finally doubling that value. vpxor %xmm4, %xmm4, %xmm4 vmovdqa .LC0(%rip), %xmm5 ! { 1, 1, 1, 1 } ... .L2: vmovdqa ui(%rax), %xmm0 vpand %xmm5, %xmm0, %xmm2 vpsrld $1, %xmm0, %xmm3 vpcmpgtd %xmm0, %xmm4, %xmm1 vpor %xmm2, %xmm3, %xmm2 vpblendvb %xmm1, %xmm2, %xmm0, %xmm0 vcvtdq2ps %xmm0, %xmm0 vandps %xmm1, %xmm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 vmovaps %xmm0, f(%rax) X313b is based on Richard's sequence, vmovdqa .LC0(%rip), %xmm2 ! { 65535, 65535, 65535, 65535 } ... .L2: vmovdqa ui(%rax), %xmm1 vpandn %xmm1, %xmm2, %xmm0 vpand %xmm2, %xmm1, %xmm1 vpsrld $1, %xmm0, %xmm0 vcvtdq2ps %xmm0, %xmm0 vaddps %xmm0, %xmm0, %xmm0 vcvtdq2ps %xmm1, %xmm1 vaddps %xmm0, %xmm1, %xmm0 vmovaps %xmm0, f(%rax) and finally X313c is based on scalar code we generate for -m32 -O2 -msse2 -mfpmath=sse: vmovdqa .LC0(%rip), %xmm3 ! { 65535, 65535, 65535, 65535 } vmovaps .LC1(%rip), %xmm2 ! { 65536.0f, 65536.0f, 65536.0f, 65536.0f } ... .L2: vmovdqa ui(%rax), %xmm0 vpand %xmm3, %xmm0, %xmm1 vpsrld $16, %xmm0, %xmm0 vcvtdq2ps %xmm0, %xmm0 vmulps %xmm2, %xmm0, %xmm0 vcvtdq2ps %xmm1, %xmm1 vaddps %xmm0, %xmm1, %xmm0 vmovaps %xmm0, f(%rax) I've done some benchmarking of these, X313b has the advantage that it needs just one constant while all other need two, but the last variant seems to win, it is one insn shorter and surprisingly the multiplication isn't any slower than the addition. All this is using the attached testcase which is just a cut down version of the vect-cvt-1.c testcase from my last patch, just with the ui2f routine executed many times. Sandybridge, -O3 -mavx vanilla -fno-tree-vectorize 0m4.136s vanilla 0m1.052s patch X313a 0m1.689s patch X313b 0m1.398s patch X313c 0m1.115s Sandybridge, -O3 -msse2 vanilla -fno-tree-vectorize 0m4.099s vanilla 0m1.054s patch X313a 0m2.527s patch X313b 0m1.394s patch X313c 0m1.267s Barcelona, -O3 -msse2 vanilla -fno-tree-vectorize 0m9.176s vanilla 0m2.706s patch X313a 0m6.515s patch X313b 0m3.825s patch X313c 0m3.310s So, what do you prefer and do you want the expander to be moved into i386.c or kept like this? Do we perhaps want for -ffast-math keep the slightly faster, but imprecise version (I'd prefer not to)? Jakub 2011-11-03 Jakub Jelinek * config/i386/sse.md (floatuns2): Rewritten. * config/i386/i386.c (ix86_expand_sse_movcc): No longer static. * gcc.dg/torture/vec-cvt-1.c: Enable commented out inttoflttestui test. 2011-11-03 Jakub Jelinek * config/i386/sse.md (floatuns2): Rewritten. * gcc.dg/torture/vec-cvt-1.c: Enable commented out inttoflttestui test. --- gcc/config/i386/sse.md.jj 2011-11-03 16:11:05.000000000 +0100 +++ gcc/config/i386/sse.md 2011-11-03 17:12:28.000000000 +0100 @@ -2242,30 +2242,35 @@ (define_insn "float< (set_attr "mode" "")]) (define_expand "floatuns2" - [(set (match_dup 5) - (float:VF1 - (match_operand: 1 "nonimmediate_operand" ""))) - (set (match_dup 6) - (lt:VF1 (match_dup 5) (match_dup 3))) - (set (match_dup 7) - (and:VF1 (match_dup 6) (match_dup 4))) - (set (match_operand:VF1 0 "register_operand" "") - (plus:VF1 (match_dup 5) (match_dup 7)))] - "TARGET_SSE2" -{ - REAL_VALUE_TYPE TWO32r; - rtx x; - int i; - - real_ldexp (&TWO32r, &dconst1, 32); - x = const_double_from_real_value (TWO32r, SFmode); - - operands[3] = force_reg (mode, CONST0_RTX (mode)); - operands[4] = force_reg (mode, - ix86_build_const_vector (mode, 1, x)); - - for (i = 5; i < 8; i++) - operands[i] = gen_reg_rtx (mode); + [(match_operand:VF1 0 "register_operand" "") + (match_operand: 1 "register_operand" "")] + "TARGET_SSE2 && (mode == V4SFmode || TARGET_AVX2)" +{ + rtx tmp[8]; + tmp[0] = force_reg (mode, + ix86_build_const_vector (mode, 1, + GEN_INT (0xffff))); + tmp[1] = gen_reg_rtx (mode); + if (mode == V4SImode) + emit_insn (gen_sse2_andnotv4si3 (tmp[1], tmp[0], operands[1])); + else + emit_insn (gen_avx2_andnotv8si3 (tmp[1], tmp[0], operands[1])); + tmp[2] = expand_simple_binop (mode, AND, + operands[1], tmp[0], NULL_RTX, + 1, OPTAB_DIRECT); + tmp[3] = expand_simple_binop (mode, LSHIFTRT, + tmp[1], const1_rtx, NULL_RTX, + 1, OPTAB_DIRECT); + tmp[4] = gen_reg_rtx (mode); + emit_insn (gen_float2 (tmp[4], tmp[2])); + tmp[5] = gen_reg_rtx (mode); + emit_insn (gen_float2 (tmp[5], tmp[3])); + tmp[6] = expand_simple_binop (mode, PLUS, tmp[5], tmp[5], + NULL_RTX, 1, OPTAB_DIRECT); + tmp[7] = expand_simple_binop (mode, PLUS, tmp[4], tmp[6], + operands[0], 1, OPTAB_DIRECT); + gcc_assert (tmp[7] == operands[0]); + DONE; }) (define_insn "avx_cvtps2dq256" --- gcc/testsuite/gcc.dg/torture/vec-cvt-1.c.jj 2011-11-03 14:07:23.000000000 +0100 +++ gcc/testsuite/gcc.dg/torture/vec-cvt-1.c 2011-11-03 17:43:11.000000000 +0100 @@ -205,7 +205,7 @@ main () inttoflttestsl (); inttoflttestuc (); inttoflttestus (); -// inttoflttestui (); + inttoflttestui (); inttoflttestul (); return 0; } 2011-11-03 Jakub Jelinek * config/i386/sse.md (floatuns2): Rewritten. * gcc.dg/torture/vec-cvt-1.c: Enable commented out inttoflttestui test. --- gcc/config/i386/sse.md.jj 2011-11-03 16:11:05.000000000 +0100 +++ gcc/config/i386/sse.md 2011-11-03 17:25:22.000000000 +0100 @@ -2242,30 +2242,35 @@ (define_insn "float< (set_attr "mode" "")]) (define_expand "floatuns2" - [(set (match_dup 5) - (float:VF1 - (match_operand: 1 "nonimmediate_operand" ""))) - (set (match_dup 6) - (lt:VF1 (match_dup 5) (match_dup 3))) - (set (match_dup 7) - (and:VF1 (match_dup 6) (match_dup 4))) - (set (match_operand:VF1 0 "register_operand" "") - (plus:VF1 (match_dup 5) (match_dup 7)))] - "TARGET_SSE2" -{ - REAL_VALUE_TYPE TWO32r; - rtx x; - int i; - - real_ldexp (&TWO32r, &dconst1, 32); - x = const_double_from_real_value (TWO32r, SFmode); - - operands[3] = force_reg (mode, CONST0_RTX (mode)); - operands[4] = force_reg (mode, - ix86_build_const_vector (mode, 1, x)); - - for (i = 5; i < 8; i++) - operands[i] = gen_reg_rtx (mode); + [(match_operand:VF1 0 "register_operand" "") + (match_operand: 1 "register_operand" "")] + "TARGET_SSE2 && (mode == V4SFmode || TARGET_AVX2)" +{ + rtx tmp[8]; + REAL_VALUE_TYPE TWO16r; + tmp[0] = force_reg (mode, + ix86_build_const_vector (mode, 1, + GEN_INT (0xffff))); + tmp[1] = expand_simple_binop (mode, AND, + operands[1], tmp[0], NULL_RTX, + 1, OPTAB_DIRECT); + tmp[2] = expand_simple_binop (mode, LSHIFTRT, + operands[1], GEN_INT (16), NULL_RTX, + 1, OPTAB_DIRECT); + tmp[3] = gen_reg_rtx (mode); + emit_insn (gen_float2 (tmp[3], tmp[1])); + tmp[4] = gen_reg_rtx (mode); + emit_insn (gen_float2 (tmp[4], tmp[2])); + real_ldexp (&TWO16r, &dconst1, 16); + tmp[5] = const_double_from_real_value (TWO16r, SFmode); + tmp[5] = force_reg (mode, + ix86_build_const_vector (mode, 1, tmp[5])); + tmp[6] = expand_simple_binop (mode, MULT, tmp[4], tmp[5], + NULL_RTX, 1, OPTAB_DIRECT); + tmp[7] = expand_simple_binop (mode, PLUS, tmp[3], tmp[6], + operands[0], 1, OPTAB_DIRECT); + gcc_assert (tmp[7] == operands[0]); + DONE; }) (define_insn "avx_cvtps2dq256" --- gcc/testsuite/gcc.dg/torture/vec-cvt-1.c.jj 2011-11-03 14:07:23.000000000 +0100 +++ gcc/testsuite/gcc.dg/torture/vec-cvt-1.c 2011-11-03 17:43:11.000000000 +0100 @@ -205,7 +205,7 @@ main () inttoflttestsl (); inttoflttestuc (); inttoflttestus (); -// inttoflttestui (); + inttoflttestui (); inttoflttestul (); return 0; } #define N 1024 unsigned int ui[N]; float f[N]; #define FN1(from, to) \ __attribute__((noinline, noclone)) void \ from##2##to (void) \ { \ int i; \ for (i = 0; i < N; i++) \ to[i] = from[i]; \ } #define FN(intt, fltt) FN1 (intt, fltt) FN1 (fltt, intt) FN (ui, f) #define FLTTEST(min, max, intt) \ __attribute__((noinline, noclone)) void \ inttoflttest##intt (void) \ { \ int i; \ volatile float vf; \ volatile double vd; \ for (i = 0; i < N; i++) \ { \ asm (""); \ if (i < N / 4) \ intt[i] = min + i; \ else if (i < 3 * N / 4) \ intt[i] = (max + min) / 2 - N * 8 + 16 * i; \ else \ intt[i] = max - N + 1 + i; \ } \ for (i = 0; i < 5000000; i++) \ intt##2f (); \ } FLTTEST (0, 2U * __INT_MAX__ + 1, ui) int main () { inttoflttestui (); return 0; } --- gcc/config/i386/i386.c.jj 2011-11-03 16:11:12.000000000 +0100 +++ gcc/config/i386/i386.c 2011-11-03 16:34:23.000000000 +0100 @@ -19006,7 +19006,9 @@ ix86_expand_sse_cmp (rtx dest, enum rtx_ /* Expand DEST = CMP ? OP_TRUE : OP_FALSE into a sequence of logical operations. This is used for both scalar and vector conditional moves. */ -static void +extern void ix86_expand_sse_movcc (rtx, rtx, rtx, rtx); + +void ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx op_true, rtx op_false) { enum machine_mode mode = GET_MODE (dest); --- gcc/config/i386/sse.md.jj 2011-11-03 16:11:05.000000000 +0100 +++ gcc/config/i386/sse.md 2011-11-03 16:49:13.000000000 +0100 @@ -2242,30 +2242,41 @@ (define_insn "float< (set_attr "mode" "")]) (define_expand "floatuns2" - [(set (match_dup 5) - (float:VF1 - (match_operand: 1 "nonimmediate_operand" ""))) - (set (match_dup 6) - (lt:VF1 (match_dup 5) (match_dup 3))) - (set (match_dup 7) - (and:VF1 (match_dup 6) (match_dup 4))) - (set (match_operand:VF1 0 "register_operand" "") - (plus:VF1 (match_dup 5) (match_dup 7)))] - "TARGET_SSE2" -{ - REAL_VALUE_TYPE TWO32r; - rtx x; - int i; - - real_ldexp (&TWO32r, &dconst1, 32); - x = const_double_from_real_value (TWO32r, SFmode); - - operands[3] = force_reg (mode, CONST0_RTX (mode)); - operands[4] = force_reg (mode, - ix86_build_const_vector (mode, 1, x)); - - for (i = 5; i < 8; i++) - operands[i] = gen_reg_rtx (mode); + [(match_operand:VF1 0 "register_operand" "") + (match_operand: 1 "register_operand" "")] + "TARGET_SSE2 && (mode == V4SFmode || TARGET_AVX2)" +{ + extern void ix86_expand_sse_movcc (rtx, rtx, rtx, rtx); + + rtx tmp[9]; + tmp[0] = gen_reg_rtx (mode); + tmp[1] = force_reg (mode, CONST0_RTX (mode)); + if (mode == V4SImode) + emit_insn (gen_sse2_gtv4si3 (tmp[0], tmp[1], operands[1])); + else + emit_insn (gen_avx2_gtv8si3 (tmp[0], tmp[1], operands[1])); + tmp[1] = expand_simple_binop (mode, LSHIFTRT, + operands[1], const1_rtx, NULL_RTX, + 1, OPTAB_DIRECT); + tmp[2] = force_reg (mode, + ix86_build_const_vector (mode, + 1, const1_rtx)); + tmp[3] = expand_simple_binop (mode, AND, + operands[1], tmp[2], NULL_RTX, 1, + OPTAB_DIRECT); + tmp[4] = expand_simple_binop (mode, IOR, + tmp[1], tmp[3], NULL_RTX, 1, OPTAB_DIRECT); + tmp[5] = gen_reg_rtx (mode); + ix86_expand_sse_movcc (tmp[5], tmp[0], tmp[4], operands[1]); + tmp[6] = gen_reg_rtx (mode); + emit_insn (gen_float2 (tmp[6], tmp[5])); + tmp[7] = expand_simple_binop (mode, AND, + gen_lowpart (mode, tmp[0]), tmp[6], + NULL_RTX, 1, OPTAB_DIRECT); + tmp[8] = expand_simple_binop (mode, PLUS, tmp[6], tmp[7], operands[0], + 1, OPTAB_DIRECT); + gcc_assert (tmp[8] == operands[0]); + DONE; }) (define_insn "avx_cvtps2dq256" --- gcc/testsuite/gcc.dg/torture/vec-cvt-1.c.jj 2011-11-03 14:07:23.000000000 +0100 +++ gcc/testsuite/gcc.dg/torture/vec-cvt-1.c 2011-11-03 17:43:11.000000000 +0100 @@ -205,7 +205,7 @@ main () inttoflttestsl (); inttoflttestuc (); inttoflttestus (); -// inttoflttestui (); + inttoflttestui (); inttoflttestul (); return 0; }