znver3 tuning part 2

Hi,
this patch enables gather on zen3 hardware.  For TSVC it get used by 6
benchmarks with following runtime improvements:

s4114: 1.424 -> 1.209  (84.9017%)
s4115: 2.021 -> 1.065  (52.6967%)
s4116: 1.549 -> 0.854  (55.1323%)
s4117: 1.386 -> 1.193  (86.075%)
vag: 2.741 -> 1.940  (70.7771%)

and one regression:

s4112: 1.115 -> 1.184  (106.188%)

In s4112 the internal loop is:

        for (int i = 0; i < LEN_1D; i++) {
            a[i] += b[ip[i]] * s;
        }

(so a standard accmulate and add with indirect addressing)

  40a400:       c5 fe 6f 24 03          vmovdqu (%rbx,%rax,1),%ymm4
  40a405:       c5 fc 28 da             vmovaps %ymm2,%ymm3
  40a409:       48 83 c0 20             add    $0x20,%rax
  40a40d:       c4 e2 65 92 04 a5 00    vgatherdps %ymm3,0x594100(,%ymm4,4),%ymm0
  40a414:       41 59 00 
  40a417:       c4 e2 75 a8 80 e0 34    vfmadd213ps 0x5b34e0(%rax),%ymm1,%ymm0
  40a41e:       5b 00 
  40a420:       c5 fc 29 80 e0 34 5b    vmovaps %ymm0,0x5b34e0(%rax)
  40a427:       00 
  40a428:       48 3d 00 f4 01 00       cmp    $0x1f400,%rax
  40a42e:       75 d0                   jne    40a400 <s4112+0x60>

compared to:

  40a280:       49 63 14 04             movslq (%r12,%rax,1),%rdx
  40a284:       48 83 c0 04             add    $0x4,%rax
  40a288:       c5 fa 10 04 95 00 41    vmovss 0x594100(,%rdx,4),%xmm0
  40a28f:       59 00 
  40a291:       c4 e2 71 a9 80 fc 34    vfmadd213ss 0x5b34fc(%rax),%xmm1,%xmm0
  40a298:       5b 00 
  40a29a:       c5 fa 11 80 fc 34 5b    vmovss %xmm0,0x5b34fc(%rax)
  40a2a1:       00 
  40a2a2:       48 3d 00 f4 01 00       cmp    $0x1f400,%rax
  40a2a8:       75 d6                   jne    40a280 <s4112+0x40>

Looking at instructions latencies

 - fmadd is 4 cycles
 - vgatherdps is 39

So vgather iself is 4.8 cycle per iteration and probably CPU is able to execute
rest out of order getting clos to 4 cycles per iteration (it can do 2 loads in
parallel, one store and rest fits easily to execution resources). That would
explain 20% slowdown.

gimple internal loop is:
  _2 = a[i_38];
  _3 = (long unsigned int) i_38;
  _4 = _3 * 4;
  _5 = ip_18 + _4;
  _6 = *_5;
  _7 = b[_6];
  _8 = _7 * s_19;
  _9 = _2 + _8;
  a[i_38] = _9;
  i_28 = i_38 + 1;
  ivtmp_52 = ivtmp_53 - 1;
  if (ivtmp_52 != 0)
    goto <bb 8>; [98.99%]
  else
    goto <bb 4>; [1.01%]

0x25bac30 a[i_38] 1 times scalar_load costs 12 in body
0x25bac30 *_5 1 times scalar_load costs 12 in body
0x25bac30 b[_6] 1 times scalar_load costs 12 in body
0x25bac30 _7 * s_19 1 times scalar_stmt costs 12 in body
0x25bac30 _2 + _8 1 times scalar_stmt costs 12 in body
0x25bac30 _9 1 times scalar_store costs 16 in body

so 19 cycles estimate of scalar load

0x2668630 a[i_38] 1 times vector_load costs 12 in body
0x2668630 *_5 1 times unaligned_load (misalign -1) costs 12 in body
0x2668630 b[_6] 8 times scalar_load costs 96 in body
0x2668630 _7 * s_19 1 times scalar_to_vec costs 4 in prologue
0x2668630 _7 * s_19 1 times vector_stmt costs 12 in body
0x2668630 _2 + _8 1 times vector_stmt costs 12 in body
0x2668630 _9 1 times vector_store costs 16 in body

so 40 cycles per 8x vectorized body

tsvc.c:3450:27: note:  operating only on full vectors.
tsvc.c:3450:27: note:  Cost model analysis:
  Vector inside of loop cost: 160
  Vector prologue cost: 4
  Vector epilogue cost: 0
  Scalar iteration cost: 76
  Scalar outside cost: 0
  Vector outside cost: 4
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1

I think this generally suffers from GIGO principle.
One problem seems to be that we do not know about fmadd yet and compute it as
two instructions (6 cycles instead of 4). More importnat problem is that we do
not account the parallelism at all.  I do not see how to disable the
vecotrization here without bumping gather costs noticeably off reality and thus
we probably can try to experiment with this if more similar problems are found.

Icc is also using gather in s1115 and s128.
For s1115 the vectorization does not seem to help and s128 gets slower.

Clang nor aocc does not use gathers.

Honza

	* x86-tune-costs.h (znver3_cost): Update costs of gather to match reality.
	* x86-tune.def (X86_TUNE_USE_GATHER): Enable for znver3.

Message ID	20210317214524.GA55027@kam.mff.cuni.cz
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=<UNKNOWN>) Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4F13cG4VXCz9sR4 for <incoming@patchwork.ozlabs.org>; Thu, 18 Mar 2021 08:45:33 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 082AA385782A; Wed, 17 Mar 2021 21:45:31 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id 5605A3857829 for <gcc-patches@gcc.gnu.org>; Wed, 17 Mar 2021 21:45:27 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 5605A3857829 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=ucw.cz Authentication-Results: sourceware.org; spf=none smtp.mailfrom=hubicka@kam.mff.cuni.cz Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 7826E282CA9; Wed, 17 Mar 2021 22:45:24 +0100 (CET) Date: Wed, 17 Mar 2021 22:45:24 +0100 From: Jan Hubicka <hubicka@ucw.cz> To: gcc-patches@gcc.gnu.org Subject: znver3 tuning part 2 Message-ID: <20210317214524.GA55027@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-14.4 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_NUMSUBJECT, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	znver3 tuning part 2 \| expand znver3 tuning part 2

znver3 tuning part 2

Commit Message

Comments

Patch