From patchwork Sun Dec 17 09:20:09 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Markus Trippelsdorf X-Patchwork-Id: 849597 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-469425-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="qD0Vd7Yb"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3yzzFN3v3yz9sBW for ; Sun, 17 Dec 2017 20:21:40 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:mime-version:content-type; q=dns; s=default; b=cDHHmovF2h9Vtxf5uTfu6YtKduQUGJPb+Y6Sk+N0um1vOYrAG4 YzjLZA65gxjti/tVQCcps+q/1dKlHFGgh7GdBCYStGB1yyOBmrjiAmeO/Z0exijz U0VGutrKtT7LPliIsQHdv2BDqMM/Y+fNZLdJXs55GLMXTA5jmbat+Urk4= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:mime-version:content-type; s= default; bh=2HN8nDJsdSQNoeZWlmRxNVSao1w=; b=qD0Vd7Ybl1ZETBjepu2T EYZF8jCGB9JCxqDB6ytfT1TBBsi3ticSwZh7ormCXPe1OEpqqyP+Wb8UuMNYxGB7 YV+HR6ceQgnT6wIfNHZ+Hdv4V+k4LZ+A8AwGIo9s23laJ7Ol40D2w+GuIQsjeZ3g YdXFzXAE7Ngq9Dtc4pCPW0w= Received: (qmail 105851 invoked by alias); 17 Dec 2017 09:21:32 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 9368 invoked by uid 89); 17 Dec 2017 09:20:16 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.7 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_LOW, SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=compensate, H*Ad:D*cz, Decrease X-HELO: mail.ud10.udmedia.de Received: from ud10.udmedia.de (HELO mail.ud10.udmedia.de) (194.117.254.50) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sun, 17 Dec 2017 09:20:13 +0000 Received: (qmail 2783 invoked from network); 17 Dec 2017 10:20:09 +0100 Received: from ip5b40576b.dynamic.kabel-deutschland.de (HELO x4) (ud10?360p3@91.64.87.107) by mail.ud10.udmedia.de with ESMTPSA (ECDHE-RSA-AES256-SHA encrypted, authenticated); 17 Dec 2017 10:20:09 +0100 Date: Sun, 17 Dec 2017 10:20:09 +0100 From: Markus Trippelsdorf To: gcc-patches@gcc.gnu.org Cc: Julia Koval , Uros Bizjak , Jan Hubicka Subject: [PATCH][i386] Correct imul (r64) latency for modern Intel CPUs Message-ID: <20171217092009.GA16559@x4> MIME-Version: 1.0 Content-Disposition: inline Since Nehalem the 64bit multiplication latency is three cycles, not four. So update the costs to reflect reality. Tested on X86_64. OK for trunk? Thanks. * x86-tune-costs.h (skylake_cost, core_cost): Decrease r64 multiply latencies. * gcc.target/i386/wmul-3.c: New test. diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 648219338308..ddb47ba44056 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1538,8 +1538,8 @@ struct processor_costs skylake_cost = { {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ COSTS_N_INSNS (4), /* HI */ COSTS_N_INSNS (3), /* SI */ - COSTS_N_INSNS (4), /* DI */ - COSTS_N_INSNS (4)}, /* other */ + COSTS_N_INSNS (3), /* DI */ + COSTS_N_INSNS (3)}, /* other */ 0, /* cost of multiply per each bit set */ /* Expanding div/mod currently doesn't consider parallelism. So the cost model is not realistic. We compensate by increasing the latencies a bit. */ @@ -2341,8 +2341,8 @@ struct processor_costs core_cost = { {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ COSTS_N_INSNS (4), /* HI */ COSTS_N_INSNS (3), /* SI */ - COSTS_N_INSNS (4), /* DI */ - COSTS_N_INSNS (4)}, /* other */ + COSTS_N_INSNS (3), /* DI */ + COSTS_N_INSNS (3)}, /* other */ 0, /* cost of multiply per each bit set */ /* Expanding div/mod currently doesn't consider parallelism. So the cost model is not realistic. We compensate by increasing the latencies a bit. */ diff --git a/gcc/testsuite/gcc.target/i386/wmul-3.c b/gcc/testsuite/gcc.target/i386/wmul-3.c new file mode 100644 index 000000000000..66c077c2cc0d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/wmul-3.c @@ -0,0 +1,66 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=haswell" } */ + +#include +#include + +static const char b100_tab[200] = { + '0', '0', '0', '1', '0', '2', '0', '3', '0', '4', + '0', '5', '0', '6', '0', '7', '0', '8', '0', '9', + '1', '0', '1', '1', '1', '2', '1', '3', '1', '4', + '1', '5', '1', '6', '1', '7', '1', '8', '1', '9', + '2', '0', '2', '1', '2', '2', '2', '3', '2', '4', + '2', '5', '2', '6', '2', '7', '2', '8', '2', '9', + '3', '0', '3', '1', '3', '2', '3', '3', '3', '4', + '3', '5', '3', '6', '3', '7', '3', '8', '3', '9', + '4', '0', '4', '1', '4', '2', '4', '3', '4', '4', + '4', '5', '4', '6', '4', '7', '4', '8', '4', '9', + '5', '0', '5', '1', '5', '2', '5', '3', '5', '4', + '5', '5', '5', '6', '5', '7', '5', '8', '5', '9', + '6', '0', '6', '1', '6', '2', '6', '3', '6', '4', + '6', '5', '6', '6', '6', '7', '6', '8', '6', '9', + '7', '0', '7', '1', '7', '2', '7', '3', '7', '4', + '7', '5', '7', '6', '7', '7', '7', '8', '7', '9', + '8', '0', '8', '1', '8', '2', '8', '3', '8', '4', + '8', '5', '8', '6', '8', '7', '8', '8', '8', '9', + '9', '0', '9', '1', '9', '2', '9', '3', '9', '4', + '9', '5', '9', '6', '9', '7', '9', '8', '9', '9', +}; + +void uint64_to_ascii_ta7_32_base100(uint64_t val, char *dst) { + const int64_t POW10_10 = ((int64_t)10) * 1000 * 1000 * 1000; + const uint64_t POW2_57_DIV_POW100_4 = + ((int64_t)(1) << 57) / 100 / 100 / 100 / 100 + 1; + const uint64_t MASK32 = ((int64_t)(1) << 32) - 1; + int64_t hix = val / POW10_10; + int64_t lox = val % POW10_10; + int64_t lor = lox & (uint64_t)(-2); + uint64_t hi = hix * POW2_57_DIV_POW100_4; + uint64_t lo = lor * POW2_57_DIV_POW100_4; + memcpy(dst + 0 * 10 + 0, &b100_tab[(hi >> 57) * 2], 2); + memcpy(dst + 1 * 10 + 0, &b100_tab[(lo >> 57) * 2], 2); + hi = (hi >> 25) + 1; + lo = (lo >> 25) + 1; + hi = (hi & MASK32) * 100; + lo = (lo & MASK32) * 100; + memcpy(dst + 0 * 10 + 2, &b100_tab[(hi >> 32) * 2], 2); + hi = (hi & MASK32) * 100; + memcpy(dst + 1 * 10 + 2, &b100_tab[(lo >> 32) * 2], 2); + lo = (lo & MASK32) * 100; + memcpy(dst + 0 * 10 + 4, &b100_tab[(hi >> 32) * 2], 2); + hi = (hi & MASK32) * 100; + memcpy(dst + 1 * 10 + 4, &b100_tab[(lo >> 32) * 2], 2); + lo = (lo & MASK32) * 100; + memcpy(dst + 0 * 10 + 6, &b100_tab[(hi >> 32) * 2], 2); + hi = (hi & MASK32) * 100; + memcpy(dst + 1 * 10 + 6, &b100_tab[(lo >> 32) * 2], 2); + lo = (lo & MASK32) * 100; + hi >>= 32; + lo >>= 32; + lo = (lo & (-2)) | (lox & 1); + memcpy(dst + 0 * 10 + 8, &b100_tab[hi * 2], 2); + memcpy(dst + 1 * 10 + 8, &b100_tab[lo * 2], 2); + dst[2 * 10] = 0; +} + +/* { dg-final { scan-assembler-times "imulq" 11 } } */