From patchwork Thu Jan 23 21:48:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jakub Jelinek X-Patchwork-Id: 1228571 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-518181-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha1 header.s=default header.b=ex4mB6Dy; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=YAvmMoqO; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 483bWK6lH4z9sNF for ; Fri, 24 Jan 2020 08:48:44 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:reply-to:mime-version :content-type:content-transfer-encoding; q=dns; s=default; b=Sac cGkCgcvzhhtdGFHYklObhX4fc0ikSpEX1//NQqLNQQdf45orgH9OnMwWGzzAts2E 3Kd9/YR4NPukjgkTOvWr9eXu3uzp7e86VJpCnT96jg31VF1Al1oJxb14yPcFe9Ok CoYDxN3Mcy8NVC2e+zPvSQ7adg6I7KXztJYR4zwA= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:reply-to:mime-version :content-type:content-transfer-encoding; s=default; bh=3WUPlZzHO QbhWrB/jXa1aog4Q2c=; b=ex4mB6DyBZR4DrSxlCnEpVvfeUANFadG2zI0gkHSy ExDDOaet9rENOOnVmxDQBuKx6f9Kz7PsND2Sj/aAq8840b6oK66pkMcsqPrMBFiX x3xFu5+4ZQJ3paayDyltWUZjsUrhQbBUfMtLWqF8VkrxtwpVtUojpdH6vSjyAecd mI= Received: (qmail 31324 invoked by alias); 23 Jan 2020 21:48:35 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 31257 invoked by uid 89); 23 Jan 2020 21:48:30 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-7.9 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.1 spammy=Fog's, 1-7, Fogs, fog's X-HELO: us-smtp-delivery-1.mimecast.com Received: from us-smtp-2.mimecast.com (HELO us-smtp-delivery-1.mimecast.com) (205.139.110.61) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 23 Jan 2020 21:48:27 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1579816105; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=kaU7Dg3lbwBVVxV4RtB0cKrqnwGrT8cNH4w7GIU1JTs=; b=YAvmMoqOXHwGOw+laBv6SDiC4EPuYhPJEBt9rEEI9EXoJX3F7JBSLPsMMb08OQFNQ9/0Rn u+QXthG6yGnK4ZZMeN3UO7r97i4jgQAxBHDKclEs47CTcsh8G6E05rHQDVzNaDyGh3+cx+ SrdAaTMeKBpHVSwXSuylMgdEc7sVG5U= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-170-On6erCzEMyycJDE8fpGngg-1; Thu, 23 Jan 2020 16:48:21 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id ECDA61937FC0; Thu, 23 Jan 2020 21:48:20 +0000 (UTC) Received: from tucnak.zalov.cz (ovpn-116-51.ams2.redhat.com [10.36.116.51]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4E2465C1B2; Thu, 23 Jan 2020 21:48:20 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.15.2/8.15.2) with ESMTP id 00NLmINR015930; Thu, 23 Jan 2020 22:48:18 +0100 Received: (from jakub@localhost) by tucnak.zalov.cz (8.15.2/8.15.2/Submit) id 00NLmHYX015929; Thu, 23 Jan 2020 22:48:17 +0100 Date: Thu, 23 Jan 2020 22:48:16 +0100 From: Jakub Jelinek To: Uros Bizjak Cc: gcc-patches@gcc.gnu.org Subject: [PATCH] i386: prefer vpermilpd over vpermpd [PR93395] Message-ID: <20200123214816.GJ10088@tucnak> Reply-To: Jakub Jelinek MIME-Version: 1.0 User-Agent: Mutt/1.11.3 (2019-02-01) X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Disposition: inline X-IsSubscribed: yes Hi! In Agner Fog's tables, vpermilp[sd] with immediates seem to be much faster than vpermpd with immediate, for a good reason, the former only permute something within the lanes and don't do anything intra-lane, while vpermpd can. So, functionality-wise, vpermilpd is more efficient subset of vpermpd. We use the same RTL for those though (and also for certain broadcast). Now, the problem was that the vpermpd pattern appeared first in sse.md, followed by the broadcast patterns, followed by the vpermilp[sd]. Which means unless -mavx -mno-avx2, we'd emit vpermpd instead of the more efficient alternatives. The following patch reorders them, so that vpermpd comes last, if we can match a broadcast, we do, if we can match a vpermilp[sd] that is not a broadcast, we will, otherwise fall back (of course only if -mavx2) to vpermpd. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? 2020-01-23 Jakub Jelinek PR target/93395 * config/i386/sse.md (*avx_vperm_broadcast_v4sf, *avx_vperm_broadcast_, _vpermil, *_vpermilp): Move before avx2_perm/avx512f_perm. * gcc.target/i386/pr93395.c: New test. * gcc.target/i386/avx512vl-vpermilpdi-1.c: Remove xfail. Jakub --- gcc/config/i386/sse.md.jj 2020-01-23 19:24:14.851423969 +0100 +++ gcc/config/i386/sse.md 2020-01-23 19:41:58.729091766 +0100 @@ -19875,6 +19875,164 @@ (define_insn "_permvar") (set_attr "mode" "")]) +;; Recognize broadcast as a vec_select as produced by builtin_vec_perm. +;; If it so happens that the input is in memory, use vbroadcast. +;; Otherwise use vpermilp (and in the case of 256-bit modes, vperm2f128). +(define_insn "*avx_vperm_broadcast_v4sf" + [(set (match_operand:V4SF 0 "register_operand" "=v,v,v") + (vec_select:V4SF + (match_operand:V4SF 1 "nonimmediate_operand" "m,o,v") + (match_parallel 2 "avx_vbroadcast_operand" + [(match_operand 3 "const_int_operand" "C,n,n")])))] + "TARGET_AVX" +{ + int elt = INTVAL (operands[3]); + switch (which_alternative) + { + case 0: + case 1: + operands[1] = adjust_address_nv (operands[1], SFmode, elt * 4); + return "vbroadcastss\t{%1, %0|%0, %k1}"; + case 2: + operands[2] = GEN_INT (elt * 0x55); + return "vpermilps\t{%2, %1, %0|%0, %1, %2}"; + default: + gcc_unreachable (); + } +} + [(set_attr "type" "ssemov,ssemov,sselog1") + (set_attr "prefix_extra" "1") + (set_attr "length_immediate" "0,0,1") + (set_attr "prefix" "maybe_evex") + (set_attr "mode" "SF,SF,V4SF")]) + +(define_insn_and_split "*avx_vperm_broadcast_" + [(set (match_operand:VF_256 0 "register_operand" "=v,v,v") + (vec_select:VF_256 + (match_operand:VF_256 1 "nonimmediate_operand" "m,o,?v") + (match_parallel 2 "avx_vbroadcast_operand" + [(match_operand 3 "const_int_operand" "C,n,n")])))] + "TARGET_AVX" + "#" + "&& reload_completed && (mode != V4DFmode || !TARGET_AVX2)" + [(set (match_dup 0) (vec_duplicate:VF_256 (match_dup 1)))] +{ + rtx op0 = operands[0], op1 = operands[1]; + int elt = INTVAL (operands[3]); + + if (REG_P (op1)) + { + int mask; + + if (TARGET_AVX2 && elt == 0) + { + emit_insn (gen_vec_dup (op0, gen_lowpart (mode, + op1))); + DONE; + } + + /* Shuffle element we care about into all elements of the 128-bit lane. + The other lane gets shuffled too, but we don't care. */ + if (mode == V4DFmode) + mask = (elt & 1 ? 15 : 0); + else + mask = (elt & 3) * 0x55; + emit_insn (gen_avx_vpermil (op0, op1, GEN_INT (mask))); + + /* Shuffle the lane we care about into both lanes of the dest. */ + mask = (elt / ( / 2)) * 0x11; + if (EXT_REX_SSE_REG_P (op0)) + { + /* There is no EVEX VPERM2F128, but we can use either VBROADCASTSS + or VSHUFF128. */ + gcc_assert (mode == V8SFmode); + if ((mask & 1) == 0) + emit_insn (gen_avx2_vec_dupv8sf (op0, + gen_lowpart (V4SFmode, op0))); + else + emit_insn (gen_avx512vl_shuf_f32x4_1 (op0, op0, op0, + GEN_INT (4), GEN_INT (5), + GEN_INT (6), GEN_INT (7), + GEN_INT (12), GEN_INT (13), + GEN_INT (14), GEN_INT (15))); + DONE; + } + + emit_insn (gen_avx_vperm2f1283 (op0, op0, op0, GEN_INT (mask))); + DONE; + } + + operands[1] = adjust_address (op1, mode, + elt * GET_MODE_SIZE (mode)); +}) + +(define_expand "_vpermil" + [(set (match_operand:VF2 0 "register_operand") + (vec_select:VF2 + (match_operand:VF2 1 "nonimmediate_operand") + (match_operand:SI 2 "const_0_to_255_operand")))] + "TARGET_AVX && " +{ + int mask = INTVAL (operands[2]); + rtx perm[]; + + int i; + for (i = 0; i < ; i = i + 2) + { + perm[i] = GEN_INT (((mask >> i) & 1) + i); + perm[i + 1] = GEN_INT (((mask >> (i + 1)) & 1) + i); + } + + operands[2] + = gen_rtx_PARALLEL (VOIDmode, gen_rtvec_v (, perm)); +}) + +(define_expand "_vpermil" + [(set (match_operand:VF1 0 "register_operand") + (vec_select:VF1 + (match_operand:VF1 1 "nonimmediate_operand") + (match_operand:SI 2 "const_0_to_255_operand")))] + "TARGET_AVX && " +{ + int mask = INTVAL (operands[2]); + rtx perm[]; + + int i; + for (i = 0; i < ; i = i + 4) + { + perm[i] = GEN_INT (((mask >> 0) & 3) + i); + perm[i + 1] = GEN_INT (((mask >> 2) & 3) + i); + perm[i + 2] = GEN_INT (((mask >> 4) & 3) + i); + perm[i + 3] = GEN_INT (((mask >> 6) & 3) + i); + } + + operands[2] + = gen_rtx_PARALLEL (VOIDmode, gen_rtvec_v (, perm)); +}) + +;; This pattern needs to come before the avx2_perm*/avx512f_perm* +;; patterns, as they have the same RTL representation (vpermilp* +;; being a subset of what vpermp* can do), but vpermilp* has shorter +;; latency as it never crosses lanes. +(define_insn "*_vpermilp" + [(set (match_operand:VF 0 "register_operand" "=v") + (vec_select:VF + (match_operand:VF 1 "nonimmediate_operand" "vm") + (match_parallel 2 "" + [(match_operand 3 "const_int_operand")])))] + "TARGET_AVX && + && avx_vpermilp_parallel (operands[2], mode)" +{ + int mask = avx_vpermilp_parallel (operands[2], mode) - 1; + operands[2] = GEN_INT (mask); + return "vpermil\t{%2, %1, %0|%0, %1, %2}"; +} + [(set_attr "type" "sselog") + (set_attr "prefix_extra" "1") + (set_attr "length_immediate" "1") + (set_attr "prefix" "") + (set_attr "mode" "")]) + (define_expand "avx2_perm" [(match_operand:VI8F_256 0 "register_operand") (match_operand:VI8F_256 1 "nonimmediate_operand") @@ -20376,160 +20534,6 @@ (define_insn "avx512cd_maskw_vec_dup" - [(set (match_operand:VF_256 0 "register_operand" "=v,v,v") - (vec_select:VF_256 - (match_operand:VF_256 1 "nonimmediate_operand" "m,o,?v") - (match_parallel 2 "avx_vbroadcast_operand" - [(match_operand 3 "const_int_operand" "C,n,n")])))] - "TARGET_AVX" - "#" - "&& reload_completed && (mode != V4DFmode || !TARGET_AVX2)" - [(set (match_dup 0) (vec_duplicate:VF_256 (match_dup 1)))] -{ - rtx op0 = operands[0], op1 = operands[1]; - int elt = INTVAL (operands[3]); - - if (REG_P (op1)) - { - int mask; - - if (TARGET_AVX2 && elt == 0) - { - emit_insn (gen_vec_dup (op0, gen_lowpart (mode, - op1))); - DONE; - } - - /* Shuffle element we care about into all elements of the 128-bit lane. - The other lane gets shuffled too, but we don't care. */ - if (mode == V4DFmode) - mask = (elt & 1 ? 15 : 0); - else - mask = (elt & 3) * 0x55; - emit_insn (gen_avx_vpermil (op0, op1, GEN_INT (mask))); - - /* Shuffle the lane we care about into both lanes of the dest. */ - mask = (elt / ( / 2)) * 0x11; - if (EXT_REX_SSE_REG_P (op0)) - { - /* There is no EVEX VPERM2F128, but we can use either VBROADCASTSS - or VSHUFF128. */ - gcc_assert (mode == V8SFmode); - if ((mask & 1) == 0) - emit_insn (gen_avx2_vec_dupv8sf (op0, - gen_lowpart (V4SFmode, op0))); - else - emit_insn (gen_avx512vl_shuf_f32x4_1 (op0, op0, op0, - GEN_INT (4), GEN_INT (5), - GEN_INT (6), GEN_INT (7), - GEN_INT (12), GEN_INT (13), - GEN_INT (14), GEN_INT (15))); - DONE; - } - - emit_insn (gen_avx_vperm2f1283 (op0, op0, op0, GEN_INT (mask))); - DONE; - } - - operands[1] = adjust_address (op1, mode, - elt * GET_MODE_SIZE (mode)); -}) - -(define_expand "_vpermil" - [(set (match_operand:VF2 0 "register_operand") - (vec_select:VF2 - (match_operand:VF2 1 "nonimmediate_operand") - (match_operand:SI 2 "const_0_to_255_operand")))] - "TARGET_AVX && " -{ - int mask = INTVAL (operands[2]); - rtx perm[]; - - int i; - for (i = 0; i < ; i = i + 2) - { - perm[i] = GEN_INT (((mask >> i) & 1) + i); - perm[i + 1] = GEN_INT (((mask >> (i + 1)) & 1) + i); - } - - operands[2] - = gen_rtx_PARALLEL (VOIDmode, gen_rtvec_v (, perm)); -}) - -(define_expand "_vpermil" - [(set (match_operand:VF1 0 "register_operand") - (vec_select:VF1 - (match_operand:VF1 1 "nonimmediate_operand") - (match_operand:SI 2 "const_0_to_255_operand")))] - "TARGET_AVX && " -{ - int mask = INTVAL (operands[2]); - rtx perm[]; - - int i; - for (i = 0; i < ; i = i + 4) - { - perm[i] = GEN_INT (((mask >> 0) & 3) + i); - perm[i + 1] = GEN_INT (((mask >> 2) & 3) + i); - perm[i + 2] = GEN_INT (((mask >> 4) & 3) + i); - perm[i + 3] = GEN_INT (((mask >> 6) & 3) + i); - } - - operands[2] - = gen_rtx_PARALLEL (VOIDmode, gen_rtvec_v (, perm)); -}) - -(define_insn "*_vpermilp" - [(set (match_operand:VF 0 "register_operand" "=v") - (vec_select:VF - (match_operand:VF 1 "nonimmediate_operand" "vm") - (match_parallel 2 "" - [(match_operand 3 "const_int_operand")])))] - "TARGET_AVX && - && avx_vpermilp_parallel (operands[2], mode)" -{ - int mask = avx_vpermilp_parallel (operands[2], mode) - 1; - operands[2] = GEN_INT (mask); - return "vpermil\t{%2, %1, %0|%0, %1, %2}"; -} - [(set_attr "type" "sselog") - (set_attr "prefix_extra" "1") - (set_attr "length_immediate" "1") - (set_attr "prefix" "") - (set_attr "mode" "")]) - (define_insn "_vpermilvar3" [(set (match_operand:VF 0 "register_operand" "=v") (unspec:VF --- gcc/testsuite/gcc.target/i386/pr93395.c.jj 2020-01-23 19:33:06.649854297 +0100 +++ gcc/testsuite/gcc.target/i386/pr93395.c 2020-01-23 19:33:06.648854311 +0100 @@ -0,0 +1,44 @@ +/* PR target/93395 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx512f -masm=att" } */ +/* { dg-final { scan-assembler-times "vpermilpd\t.5, %ymm" 3 } } */ +/* { dg-final { scan-assembler-times "vpermilpd\t.85, %zmm" 3 } } */ +/* { dg-final { scan-assembler-not "vpermpd\t" } } */ + +#include + +__m256d +foo1 (__m256d a) +{ + return _mm256_permute4x64_pd (a, 177); +} + +__m256d +foo2 (__m256d a) +{ + return _mm256_permute_pd (a, 5); +} + +__m256d +foo3 (__m256d a) +{ + return __builtin_shuffle (a, (__v4di) { 1, 0, 3, 2 }); +} + +__m512d +foo4 (__m512d a) +{ + return _mm512_permutex_pd (a, 177); +} + +__m512d +foo5 (__m512d a) +{ + return _mm512_permute_pd (a, 85); +} + +__m512d +foo6 (__m512d a) +{ + return __builtin_shuffle (a, (__v8di) { 1, 0, 3, 2, 5, 4, 7, 6 }); +} --- gcc/testsuite/gcc.target/i386/avx512vl-vpermilpdi-1.c.jj 2020-01-12 11:54:37.929390537 +0100 +++ gcc/testsuite/gcc.target/i386/avx512vl-vpermilpdi-1.c 2020-01-23 19:35:46.068553312 +0100 @@ -1,7 +1,7 @@ /* { dg-do compile } */ /* { dg-options "-mavx512vl -O2" } */ -/* { dg-final { scan-assembler-times "vpermilpd\[ \\t\]+\[^\{\n\]*13\[^\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times "vpermilpd\[ \\t\]+\[^\{\n\]*13\[^\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times "vpermilpd\[ \\t\]+\[^\{\n\]*13\[^\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ +/* { dg-final { scan-assembler-times "vpermilpd\[ \\t\]+\[^\{\n\]*13\[^\n\]*%ymm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpermilpd\[ \\t\]+\[^\{\n\]*3\[^\n\]*%xmm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)" 1 } } */ /* { dg-final { scan-assembler-times "vpermilpd\[ \\t\]+\[^\{\n\]*3\[^\n\]*%xmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)" 1 } } */