From patchwork Fri Nov 17 16:33:36 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 839062 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-467172-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="Jdk4weNH"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3ydkFw4bkKz9s4s for ; Sat, 18 Nov 2017 03:33:55 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=sJDPYOWYY2qV3u6l9SdOpiLZAorNvGAON8utheEPhie+fBB63nhv6 81RE1JClw1cFjfrxL2gPd7N4c2hLUBobcSlAKNjvn6zJyCl1a6Ig4vD5D2rW56UA c8w+zD94E8TRy+cOgWyr0bq/OddL+VS/JCG3WZEB6J6wMkSp2XWPHw= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; s= default; bh=p2xN2tJu4yfoTsg3rGU60VOaVS0=; b=Jdk4weNHE9etJx01d5Pi lzkWUiy5AjfPKjKNONvt5Prvkg5LuK3dBTZv0x/CHD5hOB7geyIGPKqrJ75kRFj1 m94ofK746Hwp9deH8V0MOrzpjfrBMQCRIC83PU66/45DQw8x4f5TEslcJijkTDKi PfIwaJ+HgROH9CcKBNfTiIw= Received: (qmail 26257 invoked by alias); 17 Nov 2017 16:33:43 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 26247 invoked by uid 89); 17 Nov 2017 16:33:42 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-9.4 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY, KB_WAM_FROM_NAME_SINGLEWORD, RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=UD:xop-imul64-vector.c, sk:xop-imu, sk:xopimu, zen X-HELO: nikam.ms.mff.cuni.cz Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 17 Nov 2017 16:33:39 +0000 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 26ACF548F0E; Fri, 17 Nov 2017 17:33:36 +0100 (CET) Date: Fri, 17 Nov 2017 17:33:36 +0100 From: Jan Hubicka To: gcc-patches@gcc.gnu.org Subject: Fix x86 vectorization cost wrt unsupported 8bit and 64bit integer ops Message-ID: <20171117163335.GA96609@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Hi, as discussed at IRC, currently vectorizer costmodel ignores the fact that not all vector operations are supported. In particular when vectorizing byte and 64bit integer loops we quite often end up producing slower vector sequence by believing that we can use vector operations which does not exist. For some reason rtx_cost already has logic to increase cost of these, but that is not used by the vectorizer's costmodel. This patch siply breaks out the relevant parts for use in there. The patch causes quite noticeable improvement on TAMC benchmarks when done on char or float. It also cause regression on cactusadm on Haswell, I have looked into it and it seems related to fmadd so I think we can solve it incrementally. Patch also disables vectorization at i386/xop-imul64-vector.c. union { __m128i i_align; long i64[SIZE]; } a, b, c, d; void imul64 (void) { int i; for (i = 0; i < SIZE; i++) a.i64[i] = b.i64[i] * c.i64[i]; } 64bit multiply is not supported. For that reason we reduce 4 loads, 2 stores and and scalar multiply onto 2 loads, 1 store, 2 vector multiplies and extra re-packing overhead. With the way the costs are set now this ends up being more expnesive. I tested it on Skylake, Haswell, Buldozer and zen. The vectorizer loop runs significantly slower on skylake and Haswell, but faster on Buldozer and zen. I suppose this is another issue to tweak incrementally once costs are reasonaly realistic. Besides microbenchmarking I tried spec2k, spec2k6 on Haswell and zen and also our C++ benchmarks on haswell. Bootstrapped/regtested x86_64-linux. * i386.c (ix86_multiplication_cost, ix86_division_cost, ix86_shift_rotate_cost): Break out from ... (ix86_rtx_costs): ... here. (ix86_add_stmt_cost): Use rtx cost machinery to compute cost of vector operations. * gcc.target/i386/xop-imul64-vector.c: Disable vectorizer costmodel. Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 254866) +++ config/i386/i386.c (working copy) @@ -39257,6 +39257,168 @@ ix86_vec_cost (machine_mode mode, int co return cost; } +/* Return cost of multiplication in MODE. */ + +static int +ix86_multiplication_cost (const struct processor_costs *cost, + enum machine_mode mode) +{ + machine_mode inner_mode = mode; + if (VECTOR_MODE_P (mode)) + inner_mode = GET_MODE_INNER (mode); + + if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) + return inner_mode == DFmode ? cost->mulsd : cost->mulss; + else if (X87_FLOAT_MODE_P (mode)) + return cost->fmul; + else if (FLOAT_MODE_P (mode)) + return ix86_vec_cost (mode, + inner_mode == DFmode + ? cost->mulsd : cost->mulss, true); + else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) + { + /* V*QImode is emulated with 7-13 insns. */ + if (mode == V16QImode || mode == V32QImode) + { + int extra = 11; + if (TARGET_XOP && mode == V16QImode) + extra = 5; + else if (TARGET_SSSE3) + extra = 6; + return ix86_vec_cost (mode, + cost->mulss * 2 + cost->sse_op * extra, + true); + } + /* V*DImode is emulated with 5-8 insns. */ + else if (mode == V2DImode || mode == V4DImode) + { + if (TARGET_XOP && mode == V2DImode) + return ix86_vec_cost (mode, + cost->mulss * 2 + cost->sse_op * 3, + true); + else + return ix86_vec_cost (mode, + cost->mulss * 3 + cost->sse_op * 5, + true); + } + /* Without sse4.1, we don't have PMULLD; it's emulated with 7 + insns, including two PMULUDQ. */ + else if (mode == V4SImode && !(TARGET_SSE4_1 || TARGET_AVX)) + return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5, + true); + else + return ix86_vec_cost (mode, cost->mulss, true); + } + else + return (cost->mult_init[MODE_INDEX (mode)] + cost->mult_bit * 7); +} + +/* Return cost of multiplication in MODE. */ + +static int +ix86_division_cost (const struct processor_costs *cost, + enum machine_mode mode) +{ + machine_mode inner_mode = mode; + if (VECTOR_MODE_P (mode)) + inner_mode = GET_MODE_INNER (mode); + + if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) + return inner_mode == DFmode ? cost->divsd : cost->divss; + else if (X87_FLOAT_MODE_P (mode)) + return cost->fdiv; + else if (FLOAT_MODE_P (mode)) + return ix86_vec_cost (mode, + inner_mode == DFmode ? cost->divsd : cost->divss, + true); + else + return cost->divide[MODE_INDEX (mode)]; +} + +/* Return cost of shift in MODE. + If CONSTANT_OP1 is true, the op1 value is known and set in OP1_VAL. + AND_IN_OP1 specify in op1 is result of and and SHIFT_AND_TRUNCATE + if op1 is a result of subreg. + + SKIP_OP0/1 is set to true if cost of OP0/1 should be ignored. */ + +static int +ix86_shift_rotate_cost (const struct processor_costs *cost, + enum machine_mode mode, bool constant_op1, + HOST_WIDE_INT op1_val, + bool speed, + bool and_in_op1, + bool shift_and_truncate, + bool *skip_op0, bool *skip_op1) +{ + if (skip_op0) + *skip_op0 = *skip_op1 = false; + if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) + { + /* V*QImode is emulated with 1-11 insns. */ + if (mode == V16QImode || mode == V32QImode) + { + int count = 11; + if (TARGET_XOP && mode == V16QImode) + { + /* For XOP we use vpshab, which requires a broadcast of the + value to the variable shift insn. For constants this + means a V16Q const in mem; even when we can perform the + shift with one insn set the cost to prefer paddb. */ + if (constant_op1) + { + if (skip_op1) + *skip_op1 = true; + return ix86_vec_cost (mode, + cost->sse_op + + (speed + ? 2 + : COSTS_N_BYTES + (GET_MODE_UNIT_SIZE (mode))), true); + } + count = 3; + } + else if (TARGET_SSSE3) + count = 7; + return ix86_vec_cost (mode, cost->sse_op * count, true); + } + else + return ix86_vec_cost (mode, cost->sse_op, true); + } + if (GET_MODE_SIZE (mode) > UNITS_PER_WORD) + { + if (constant_op1) + { + if (op1_val > 32) + return cost->shift_const + COSTS_N_INSNS (2); + else + return cost->shift_const * 2; + } + else + { + if (and_in_op1) + return cost->shift_var * 2; + else + return cost->shift_var * 6 + COSTS_N_INSNS (2); + } + } + else + { + if (constant_op1) + return cost->shift_const; + else if (shift_and_truncate) + { + if (skip_op0) + *skip_op0 = *skip_op1 = true; + /* Return the cost after shift-and truncation. */ + return cost->shift_var; + } + else + return cost->shift_var; + } + return cost->shift_const; +} + /* Compute a (partial) cost for rtx X. Return true if the complete cost has been computed, and false if subexpressions should be scanned. In either case, *TOTAL contains the cost result. */ @@ -39270,9 +39432,6 @@ ix86_rtx_costs (rtx x, machine_mode mode enum rtx_code outer_code = (enum rtx_code) outer_code_i; const struct processor_costs *cost = speed ? ix86_cost : &ix86_size_cost; int src_cost; - machine_mode inner_mode = mode; - if (VECTOR_MODE_P (mode)) - inner_mode = GET_MODE_INNER (mode); switch (code) { @@ -39400,68 +39559,22 @@ ix86_rtx_costs (rtx x, machine_mode mode case ASHIFTRT: case LSHIFTRT: case ROTATERT: - if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) - { - /* ??? Should be SSE vector operation cost. */ - /* At least for published AMD latencies, this really is the same - as the latency for a simple fpu operation like fabs. */ - /* V*QImode is emulated with 1-11 insns. */ - if (mode == V16QImode || mode == V32QImode) - { - int count = 11; - if (TARGET_XOP && mode == V16QImode) - { - /* For XOP we use vpshab, which requires a broadcast of the - value to the variable shift insn. For constants this - means a V16Q const in mem; even when we can perform the - shift with one insn set the cost to prefer paddb. */ - if (CONSTANT_P (XEXP (x, 1))) - { - *total = ix86_vec_cost (mode, - cost->sse_op - + rtx_cost (XEXP (x, 0), mode, code, 0, speed) - + (speed ? 2 : COSTS_N_BYTES (16)), true); - return true; - } - count = 3; - } - else if (TARGET_SSSE3) - count = 7; - *total = ix86_vec_cost (mode, cost->sse_op * count, true); - } - else - *total = ix86_vec_cost (mode, cost->sse_op, true); - } - else if (GET_MODE_SIZE (mode) > UNITS_PER_WORD) - { - if (CONST_INT_P (XEXP (x, 1))) - { - if (INTVAL (XEXP (x, 1)) > 32) - *total = cost->shift_const + COSTS_N_INSNS (2); - else - *total = cost->shift_const * 2; - } - else - { - if (GET_CODE (XEXP (x, 1)) == AND) - *total = cost->shift_var * 2; - else - *total = cost->shift_var * 6 + COSTS_N_INSNS (2); - } - } - else - { - if (CONST_INT_P (XEXP (x, 1))) - *total = cost->shift_const; - else if (SUBREG_P (XEXP (x, 1)) - && GET_CODE (XEXP (XEXP (x, 1), 0)) == AND) - { - /* Return the cost after shift-and truncation. */ - *total = cost->shift_var; - return true; - } - else - *total = cost->shift_var; + bool skip_op0, skip_op1; + *total = ix86_shift_rotate_cost (cost, mode, CONSTANT_P (XEXP (x, 1)), + CONST_INT_P (XEXP (x, 1)) + ? INTVAL (XEXP (x, 1)) : -1, + speed, + GET_CODE (XEXP (x, 1)) == AND, + SUBREG_P (XEXP (x, 1)) + && GET_CODE (XEXP (XEXP (x, 1), 0)) == AND, + &skip_op0, &skip_op1); + if (skip_op0 || skip_op1) + { + if (!skip_op0) + *total += rtx_cost (XEXP (x, 0), mode, code, 0, speed); + if (!skip_op1) + *total += rtx_cost (XEXP (x, 1), mode, code, 0, speed); + return true; } return false; @@ -39491,59 +39604,7 @@ ix86_rtx_costs (rtx x, machine_mode mode } case MULT: - if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) - { - *total = inner_mode == DFmode ? cost->mulsd : cost->mulss; - return false; - } - else if (X87_FLOAT_MODE_P (mode)) - { - *total = cost->fmul; - return false; - } - else if (FLOAT_MODE_P (mode)) - { - *total = ix86_vec_cost (mode, - inner_mode == DFmode - ? cost->mulsd : cost->mulss, true); - return false; - } - else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) - { - /* V*QImode is emulated with 7-13 insns. */ - if (mode == V16QImode || mode == V32QImode) - { - int extra = 11; - if (TARGET_XOP && mode == V16QImode) - extra = 5; - else if (TARGET_SSSE3) - extra = 6; - *total = ix86_vec_cost (mode, - cost->mulss * 2 + cost->sse_op * extra, - true); - } - /* V*DImode is emulated with 5-8 insns. */ - else if (mode == V2DImode || mode == V4DImode) - { - if (TARGET_XOP && mode == V2DImode) - *total = ix86_vec_cost (mode, - cost->mulss * 2 + cost->sse_op * 3, - true); - else - *total = ix86_vec_cost (mode, - cost->mulss * 3 + cost->sse_op * 5, - true); - } - /* Without sse4.1, we don't have PMULLD; it's emulated with 7 - insns, including two PMULUDQ. */ - else if (mode == V4SImode && !(TARGET_SSE4_1 || TARGET_AVX)) - *total = ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 5, - true); - else - *total = ix86_vec_cost (mode, cost->mulss, true); - return false; - } - else + if (!FLOAT_MODE_P (mode) && !VECTOR_MODE_P (mode)) { rtx op0 = XEXP (x, 0); rtx op1 = XEXP (x, 1); @@ -39588,21 +39649,14 @@ ix86_rtx_costs (rtx x, machine_mode mode return true; } + *total = ix86_multiplication_cost (cost, mode); + return false; case DIV: case UDIV: case MOD: case UMOD: - if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) - *total = inner_mode == DFmode ? cost->divsd : cost->divss; - else if (X87_FLOAT_MODE_P (mode)) - *total = cost->fdiv; - else if (FLOAT_MODE_P (mode)) - *total = ix86_vec_cost (mode, - inner_mode == DFmode ? cost->divsd : cost->divss, - true); - else - *total = cost->divide[MODE_INDEX (mode)]; + *total = ix86_division_cost (cost, mode); return false; case PLUS: @@ -48799,7 +48853,117 @@ ix86_add_stmt_cost (void *data, int coun unsigned retval = 0; tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE; - int stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); + int stmt_cost = - 1; + + if ((kind == vector_stmt || kind == scalar_stmt) + && stmt_info + && stmt_info->stmt && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN) + { + tree_code subcode = gimple_assign_rhs_code (stmt_info->stmt); + bool fp = false; + machine_mode mode = TImode; + + if (vectype != NULL) + { + fp = FLOAT_TYPE_P (vectype); + mode = TYPE_MODE (vectype); + } + /*machine_mode inner_mode = mode; + if (VECTOR_MODE_P (mode)) + inner_mode = GET_MODE_INNER (mode);*/ + + switch (subcode) + { + case PLUS_EXPR: + case POINTER_PLUS_EXPR: + case MINUS_EXPR: + if (kind == scalar_stmt) + { + if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) + stmt_cost = ix86_cost->addss; + else if (X87_FLOAT_MODE_P (mode)) + stmt_cost = ix86_cost->fadd; + else + stmt_cost = ix86_cost->add; + } + else + stmt_cost = ix86_vec_cost (mode, + fp ? ix86_cost->addss + : ix86_cost->sse_op, + true); + break; + + case MULT_EXPR: + case WIDEN_MULT_EXPR: + case MULT_HIGHPART_EXPR: + stmt_cost = ix86_multiplication_cost (ix86_cost, mode); + break; + case FMA_EXPR: + stmt_cost = ix86_vec_cost (mode, + mode == SFmode ? ix86_cost->fmass + : ix86_cost->fmasd, + true); + break; + case NEGATE_EXPR: + if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) + stmt_cost = ix86_cost->sse_op; + else if (X87_FLOAT_MODE_P (mode)) + stmt_cost = ix86_cost->fchs; + else if (VECTOR_MODE_P (mode)) + stmt_cost = ix86_vec_cost (mode, ix86_cost->sse_op, true); + else + stmt_cost = ix86_cost->add; + break; + case TRUNC_DIV_EXPR: + case CEIL_DIV_EXPR: + case FLOOR_DIV_EXPR: + case ROUND_DIV_EXPR: + case TRUNC_MOD_EXPR: + case CEIL_MOD_EXPR: + case FLOOR_MOD_EXPR: + case RDIV_EXPR: + case ROUND_MOD_EXPR: + case EXACT_DIV_EXPR: + stmt_cost = ix86_division_cost (ix86_cost, mode); + break; + + case RSHIFT_EXPR: + case LSHIFT_EXPR: + case LROTATE_EXPR: + case RROTATE_EXPR: + { + tree op2 = gimple_assign_rhs2 (stmt_info->stmt); + stmt_cost = ix86_shift_rotate_cost + (ix86_cost, mode, + TREE_CODE (op2) == INTEGER_CST, + cst_and_fits_in_hwi (op2) ? int_cst_value (op2) : -1, + true, false, false, NULL, NULL); + } + break; + case NOP_EXPR: + stmt_cost = 0; + break; + + case BIT_IOR_EXPR: + case ABS_EXPR: + case MIN_EXPR: + case MAX_EXPR: + case BIT_XOR_EXPR: + case BIT_AND_EXPR: + case BIT_NOT_EXPR: + if (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH) + stmt_cost = ix86_cost->sse_op; + else if (VECTOR_MODE_P (mode)) + stmt_cost = ix86_vec_cost (mode, ix86_cost->sse_op, true); + else + stmt_cost = ix86_cost->add; + break; + default: + break; + } + } + if (stmt_cost == -1) + stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign); /* Penalize DFmode vector operations for Bonnell. */ if (TARGET_BONNELL && kind == vector_stmt Index: testsuite/gcc.target/i386/xop-imul64-vector.c =================================================================== --- testsuite/gcc.target/i386/xop-imul64-vector.c (revision 254866) +++ testsuite/gcc.target/i386/xop-imul64-vector.c (working copy) @@ -3,7 +3,7 @@ /* { dg-do compile } */ /* { dg-require-effective-target lp64 } */ -/* { dg-options "-O2 -mxop -mno-avx2 -ftree-vectorize" } */ +/* { dg-options "-O2 -mxop -mno-avx2 -ftree-vectorize -fno-vect-cost-model" } */ extern void exit (int);