From patchwork Thu Nov 9 14:41:52 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kirill Batuzov X-Patchwork-Id: 836385 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=nongnu.org (client-ip=2001:4830:134:3::11; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=) Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3yXm9s0JQfz9t4c for ; Fri, 10 Nov 2017 01:43:13 +1100 (AEDT) Received: from localhost ([::1]:37239 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eCo2x-0005gG-2Z for incoming@patchwork.ozlabs.org; Thu, 09 Nov 2017 09:43:11 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57447) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eCo2F-0005d5-6U for qemu-devel@nongnu.org; Thu, 09 Nov 2017 09:42:28 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eCo2C-0002nw-2y for qemu-devel@nongnu.org; Thu, 09 Nov 2017 09:42:27 -0500 Received: from bran.ispras.ru ([83.149.199.196]:54253 helo=smtp.ispras.ru) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eCo2B-0002lj-MW for qemu-devel@nongnu.org; Thu, 09 Nov 2017 09:42:24 -0500 Received: from bulbul.intra.ispras.ru (bulbul.intra.ispras.ru [10.10.3.51]) by smtp.ispras.ru (Postfix) with ESMTP id 7EF70203C2; Thu, 9 Nov 2017 17:42:19 +0300 (MSK) From: Kirill Batuzov To: qemu-devel@nongnu.org Date: Thu, 9 Nov 2017 17:41:52 +0300 Message-Id: <20171109144155.17076-1-batuzovk@ispras.ru> X-Mailer: git-send-email 2.11.0 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 83.149.199.196 Subject: [Qemu-devel] [PATCH RFC 0/3] TCG: do copy propagation through memory locations X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Richard Henderson , =?utf-8?q?Alex_Benn?= =?utf-8?b?w6ll?= , Kirill Batuzov Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" This patch series is based on native-vector-registers-3: git://github.com/rth7680/qemu.git native-vector-registers-3 Particular goal of this change was to retain values of guest vector registers on host vector registers between different guest instructions. Relation between memory locations and variables is many-to-many. Variables can be copies of each other; multiple variables can have the same value as the one stored in memory location. Any variable can be stored to multiple memory locations as well. To represent all this a data structure that can handle the following operations is needed. (0) Allocate and deallocate memory locations. Exact number of possible memory locations is unknown, but there should not be too many of them known to algorithm simultaneously. (1) Find a memory location with specified offset, size and type among all memory locations. Needed to replace LOADs. (2) For a memory location find a variable containing the same value. Also needed to replace LOADs. (3) Remove memory locations overlapping with specified range of addresses. Needed to remove memory locations affected by STOREs. (4) For a variable find all memory locations containing the same value. In case the value of the variable has changed, these memory locations should not reference this variable any more. In proposed implementation all these cases are handled by multiple lists containing memory locations. - List of unused memory location descriptors. - List of all known memory locations. - List of memory locations containing the same value for every variable. Change was tested on x264 video encoder compiled for ARM32 and run using qemu-linux-user on x86_64 host. Some loads were replaced by MOVs, but no change in performance was observed. x264 video encoder compiled for ARM64 crashed under qemu-linux-user unfortunately. On the artificial test case nearly 3x speedup was observed (8s vs 22s). IN: 0x00000000004005c0: 4ea18400 add v0.4s, v0.4s, v1.4s 0x00000000004005c4: 4ea18400 add v0.4s, v0.4s, v1.4s 0x00000000004005c8: 4ea18400 add v0.4s, v0.4s, v1.4s 0x00000000004005cc: 4ea18400 add v0.4s, v0.4s, v1.4s OP: ---- 00000000004005c0 0000000000000000 0000000000000000 ld_vec tmp7,env,$0x8a0,$0x1 ld_vec tmp8,env,$0x8b0,$0x1 add32_vec tmp9,tmp7,tmp8,$0x1 st_vec tmp9,env,$0x8a0,$0x1 ---- 00000000004005c4 0000000000000000 0000000000000000 ld_vec tmp7,env,$0x8a0,$0x1 ld_vec tmp8,env,$0x8b0,$0x1 add32_vec tmp9,tmp7,tmp8,$0x1 st_vec tmp9,env,$0x8a0,$0x1 ---- 00000000004005c8 0000000000000000 0000000000000000 ld_vec tmp7,env,$0x8a0,$0x1 ld_vec tmp8,env,$0x8b0,$0x1 add32_vec tmp9,tmp7,tmp8,$0x1 st_vec tmp9,env,$0x8a0,$0x1 ---- 00000000004005cc 0000000000000000 0000000000000000 ld_vec tmp7,env,$0x8a0,$0x1 ld_vec tmp8,env,$0x8b0,$0x1 add32_vec tmp9,tmp7,tmp8,$0x1 st_vec tmp9,env,$0x8a0,$0x1 OP after optimization and liveness analysis: ---- 00000000004005c0 0000000000000000 0000000000000000 ld_vec tmp7,env,$0x8a0,$0x1 ld_vec tmp8,env,$0x8b0,$0x1 add32_vec tmp9,tmp7,tmp8,$0x1 dead: 1 st_vec tmp9,env,$0x8a0,$0x1 ---- 00000000004005c4 0000000000000000 0000000000000000 mov_vec tmp7,tmp9,$0x1 dead: 1 add32_vec tmp9,tmp7,tmp8,$0x1 dead: 1 st_vec tmp9,env,$0x8a0,$0x1 ---- 00000000004005c8 0000000000000000 0000000000000000 mov_vec tmp7,tmp9,$0x1 dead: 1 add32_vec tmp9,tmp7,tmp8,$0x1 dead: 1 st_vec tmp9,env,$0x8a0,$0x1 ---- 00000000004005cc 0000000000000000 0000000000000000 mov_vec tmp7,tmp9,$0x1 dead: 1 add32_vec tmp9,tmp7,tmp8,$0x1 dead: 1 2 st_vec tmp9,env,$0x8a0,$0x1 dead: 0 1 I'm not particularly happy about the current implementation. - Data structure seems to be a bit too complicated for the task at hand. May be I'm doing something wrong? - Current data structure is tightly related to struct tcg_temp_info and is a part of optimizations. Very similar data structure will be needed in liveness analysis to eliminate redundant STOREs. Having SSA (or at least single assignment per basic block) will help a lot. It will remove use case (4) completely, and with it the need for the lists of memory locations for each variable, leaving only one list. Another result will be that operation on TCGMemLocation will no longer need to do any modifications of TCGTemp or tcg_temp_info structures thus making TCGMemLocation reusable in liveness or register allocation. But we do not have SSA (yet?). Any thoughts or comments? Kirill Batuzov (3): tcg: support MOV_VEC and MOVI_VEC opcodes in register allocator tcg/optimize: do copy propagation for memory locations tcg/optimize: handle vector loads and stores during copy propagation tcg/optimize.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ tcg/tcg.c | 2 + 2 files changed, 290 insertions(+)