From patchwork Mon Nov 14 14:25:28 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kyrill Tkachov X-Patchwork-Id: 694544 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3tHXr31ZJ6z9t0J for ; Tue, 15 Nov 2016 01:25:55 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="cOCYSeFh"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; q=dns; s=default; b=d80be34iE0jnnq1gn RDSrWYUXoq6lV5Z/jM1RVau6+bBsslHfat2uerxcAJW7qoS3ma6OaBOoyJ2Ls5FW 5mRhZ0zOGOV1HEc/hO9MT2KI9qzK6m23wTEL2P6I2Uk5QlurCMZhgUYYDOYVLOIz Q01c0hjahN3HHWl01k6rCtjWOY= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; s=default; bh=CJ5fWUQ85Hb1Jgoj3FoRgBx e4KE=; b=cOCYSeFh9KmjDhrGf/a4fOFTNDO8JBiNZ9qWa1weY3n4bHgGHgAjpSa 3m1xsF+trTkHbMUUbNDrROZmenSYsMN6mzm4hYa/jVEIV7Y/xIjgJym6UNp5bR2g DcjdlmSNs6dvKYrstgAwyav1CqHEgLQxu1l+WKmB011q9Ik6ocjk= Received: (qmail 108906 invoked by alias); 14 Nov 2016 14:25:45 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 108887 invoked by uid 89); 14 Nov 2016 14:25:44 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.7 required=5.0 tests=BAYES_00, KAM_LAZY_DOMAIN_SECURITY, RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=add_reg_note, sk:frame.f, UD:frame.frame_size, UD:frame_size X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 14 Nov 2016 14:25:33 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8CAFE28; Mon, 14 Nov 2016 06:25:31 -0800 (PST) Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 810CF3F41F; Mon, 14 Nov 2016 06:25:30 -0800 (PST) Message-ID: <5829C958.2010601@foss.arm.com> Date: Mon, 14 Nov 2016 14:25:28 +0000 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Segher Boessenkool , Andrew Pinski CC: GCC Patches , Marcus Shawcroft , Richard Earnshaw , James Greenhalgh Subject: Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation References: <5824836B.5030302@foss.arm.com> <20161110233943.GC17570@gate.crashing.org> <58259AD6.4040203@foss.arm.com> <5825E454.6070302@foss.arm.com> In-Reply-To: <5825E454.6070302@foss.arm.com> On 11/11/16 15:31, Kyrill Tkachov wrote: > > On 11/11/16 10:17, Kyrill Tkachov wrote: >> >> On 10/11/16 23:39, Segher Boessenkool wrote: >>> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote: >>>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov >>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were >>>>> some interesting swings. >>>>> 458.sjeng +1.45% >>>>> 471.omnetpp +2.19% >>>>> 445.gobmk -2.01% >>>>> >>>>> On SPECFP: >>>>> 453.povray +7.00% >>>> >>>> Wow, this looks really good. Thank you for implementing this. If I >>>> get some time I am going to try it out on other processors than A72 >>>> but I doubt I have time any time soon. >>> I'd love to hear what causes the slowdown for gobmk as well, btw. >> >> I haven't yet gotten a direct answer for that (through performance analysis tools) >> but I have noticed that load/store pairs are not generated as aggressively as I hoped. >> They are being merged by the sched fusion pass and peepholes (which runs after this) >> but it still misses cases. I've hacked the SWS hooks to generate pairs explicitly and that >> increases the number of pairs and helps code size to boot. It complicates the logic of >> the hooks a bit but not too much. >> >> I'll make those changes and re-benchmark, hopefully that >> will help performance. >> > > And here's a version that explicitly emits pairs. I've looked at assembly codegen on SPEC2006 > and it generates quite a few more LDP/STP pairs than the original version. > I kicked off benchmarks over the weekend to see the effect. > Andrew, if you want to try it out (more benchmarking and testing always welcome) this is the > one to try. > And I discovered over the weekend that gamess and wrf have validation errors. This version runs correctly. SPECINT results were fine though and there is even a small overall gain due to sjeng and omnetpp. However, gobmk still has the regression. I'll rerun SPECFP with this patch (it's really just a small bugfix over the previous version) and get on with analysing gobmk. Thanks, Kyrill 2016-11-11 Kyrylo Tkachov * config/aarch64/aarch64.h (machine_function): Add reg_is_wrapped_separately field. * config/aarch64/aarch64.c (emit_set_insn): Change return type to rtx_insn *. (aarch64_save_callee_saves): Don't save registers that are wrapped separately. (aarch64_restore_callee_saves): Don't restore registers that are wrapped separately. (offset_9bit_signed_unscaled_p, offset_12bit_unsigned_scaled_p, aarch64_offset_7bit_signed_scaled_p): Move earlier in the file. (aarch64_get_separate_components): New function. (aarch64_get_next_set_bit): Likewise. (aarch64_components_for_bb): Likewise. (aarch64_disqualify_components): Likewise. (aarch64_emit_prologue_components): Likewise. (aarch64_emit_epilogue_components): Likewise. (aarch64_set_handled_components): Likewise. (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS, TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB, TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS, TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS, TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS, TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define. commit 06ac3c30d8aa38781ee9019e60a5fcf727b85231 Author: Kyrylo Tkachov Date: Tue Oct 11 09:25:54 2016 +0100 [AArch64] Separate shrink wrapping hooks implementation diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 325e725..2d33ef6 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -1138,7 +1138,7 @@ aarch64_is_extend_from_extract (machine_mode mode, rtx mult_imm, /* Emit an insn that's a simple single-set. Both the operands must be known to be valid. */ -inline static rtx +inline static rtx_insn * emit_set_insn (rtx x, rtx y) { return emit_insn (gen_rtx_SET (x, y)); @@ -3135,6 +3135,9 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, || regno == cfun->machine->frame.wb_candidate2)) continue; + if (cfun->machine->reg_is_wrapped_separately[regno]) + continue; + reg = gen_rtx_REG (mode, regno); offset = start_offset + cfun->machine->frame.reg_offset[regno]; mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx, @@ -3143,6 +3146,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset, regno2 = aarch64_next_callee_save (regno + 1, limit); if (regno2 <= limit + && !cfun->machine->reg_is_wrapped_separately[regno2] && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) == cfun->machine->frame.reg_offset[regno2])) @@ -3191,6 +3195,9 @@ aarch64_restore_callee_saves (machine_mode mode, regno <= limit; regno = aarch64_next_callee_save (regno + 1, limit)) { + if (cfun->machine->reg_is_wrapped_separately[regno]) + continue; + rtx reg, mem; if (skip_wb @@ -3205,6 +3212,7 @@ aarch64_restore_callee_saves (machine_mode mode, regno2 = aarch64_next_callee_save (regno + 1, limit); if (regno2 <= limit + && !cfun->machine->reg_is_wrapped_separately[regno2] && ((cfun->machine->frame.reg_offset[regno] + UNITS_PER_WORD) == cfun->machine->frame.reg_offset[regno2])) { @@ -3224,6 +3232,272 @@ aarch64_restore_callee_saves (machine_mode mode, } } +static inline bool +offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, + HOST_WIDE_INT offset) +{ + return offset >= -256 && offset < 256; +} + +static inline bool +offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) +{ + return (offset >= 0 + && offset < 4096 * GET_MODE_SIZE (mode) + && offset % GET_MODE_SIZE (mode) == 0); +} + +bool +aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) +{ + return (offset >= -64 * GET_MODE_SIZE (mode) + && offset < 64 * GET_MODE_SIZE (mode) + && offset % GET_MODE_SIZE (mode) == 0); +} + +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ + +static sbitmap +aarch64_get_separate_components (void) +{ + aarch64_layout_frame (); + + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); + bitmap_clear (components); + + /* The registers we need saved to the frame. */ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (aarch64_register_saved_on_entry (regno)) + { + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + /* Check that we can access the stack slot of the register with one + direct load with no adjustments needed. */ + if (offset_12bit_unsigned_scaled_p (DImode, offset)) + bitmap_set_bit (components, regno); + } + + /* Don't mess with the hard frame pointer. */ + if (frame_pointer_needed) + bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM); + + unsigned reg1 = cfun->machine->frame.wb_candidate1; + unsigned reg2 = cfun->machine->frame.wb_candidate2; + /* If aarch64_layout_frame has chosen registers to store/restore with + writeback don't interfere with them to avoid having to output explicit + stack adjustment instructions. */ + if (reg2 != INVALID_REGNUM) + bitmap_clear_bit (components, reg2); + if (reg1 != INVALID_REGNUM) + bitmap_clear_bit (components, reg1); + + bitmap_clear_bit (components, LR_REGNUM); + bitmap_clear_bit (components, SP_REGNUM); + + return components; +} + +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB. */ + +static sbitmap +aarch64_components_for_bb (basic_block bb) +{ + bitmap in = DF_LIVE_IN (bb); + bitmap gen = &DF_LIVE_BB_INFO (bb)->gen; + bitmap kill = &DF_LIVE_BB_INFO (bb)->kill; + + sbitmap components = sbitmap_alloc (V31_REGNUM + 1); + bitmap_clear (components); + + /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets. */ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if ((!call_used_regs[regno]) + && (bitmap_bit_p (in, regno) + || bitmap_bit_p (gen, regno) + || bitmap_bit_p (kill, regno))) + bitmap_set_bit (components, regno); + + return components; +} + +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS. + Nothing to do for aarch64. */ + +static void +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool) +{ +} + +/* Return the next set bit in BMP from START onwards. Return the total number + of bits in BMP if no set bit is found at or after START. */ + +static unsigned int +aarch64_get_next_set_bit (sbitmap bmp, unsigned int start) +{ + unsigned int nbits = SBITMAP_SIZE (bmp); + if (start == nbits) + return start; + + gcc_assert (start < nbits); + for (unsigned int i = start; i < nbits; i++) + if (bitmap_bit_p (bmp, i)) + return i; + + return nbits; +} + +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS. */ + +static void +aarch64_emit_prologue_components (sbitmap components) +{ + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed + ? HARD_FRAME_POINTER_REGNUM + : STACK_POINTER_REGNUM); + + unsigned total_bits = SBITMAP_SIZE (components); + unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM); + rtx_insn *insn = NULL; + + while (regno != total_bits) + { + machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode; + rtx reg = gen_rtx_REG (mode, regno); + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr = plus_constant (Pmode, ptr_reg, offset); + rtx mem = gen_frame_mem (mode, addr); + + rtx set = gen_rtx_SET (mem, reg); + unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1); + /* No more registers to save after REGNO. + Emit a single save and exit. */ + if (regno2 == total_bits) + { + insn = emit_insn (set); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set)); + break; + } + + HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2]; + /* The next register is not of the same class or its offset is not + mergeable with the current one into a pair. */ + if (!satisfies_constraint_Ump (mem) + || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2) + || (offset2 - cfun->machine->frame.reg_offset[regno]) + != GET_MODE_SIZE (DImode)) + { + insn = emit_insn (set); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set)); + + regno = regno2; + continue; + } + + /* REGNO2 can be stored in a pair with REGNO. */ + rtx reg2 = gen_rtx_REG (mode, regno2); + if (!frame_pointer_needed) + offset2 += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr2 = plus_constant (Pmode, ptr_reg, offset2); + rtx mem2 = gen_frame_mem (mode, addr2); + rtx set2 = gen_rtx_SET (mem2, reg2); + + insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2)); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_OFFSET, set); + add_reg_note (insn, REG_CFA_OFFSET, set2); + + regno = aarch64_get_next_set_bit (components, regno2 + 1); + } + +} + +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS. */ + +static void +aarch64_emit_epilogue_components (sbitmap components) +{ + + rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed + ? HARD_FRAME_POINTER_REGNUM + : STACK_POINTER_REGNUM); + unsigned total_bits = SBITMAP_SIZE (components); + unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM); + rtx_insn *insn = NULL; + + while (regno != total_bits) + { + machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode; + rtx reg = gen_rtx_REG (mode, regno); + HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno]; + if (!frame_pointer_needed) + offset += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr = plus_constant (Pmode, ptr_reg, offset); + rtx mem = gen_frame_mem (mode, addr); + + unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1); + /* No more registers after REGNO to restore. + Emit a single restore and exit. */ + if (regno2 == total_bits) + { + insn = emit_move_insn (reg, mem); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_RESTORE, reg); + break; + } + + HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2]; + /* The next register is not of the same class or its offset is not + mergeable with the current one into a pair or the offset doesn't fit + for a load pair. Emit a single restore and continue from REGNO2. */ + if (!satisfies_constraint_Ump (mem) + || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2) + || (offset2 - cfun->machine->frame.reg_offset[regno]) + != GET_MODE_SIZE (DImode)) + { + insn = emit_move_insn (reg, mem); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_RESTORE, reg); + + regno = regno2; + continue; + } + + /* REGNO2 can be loaded in a pair with REGNO. */ + rtx reg2 = gen_rtx_REG (mode, regno2); + if (!frame_pointer_needed) + offset2 += cfun->machine->frame.frame_size + - cfun->machine->frame.hard_fp_offset; + rtx addr2 = plus_constant (Pmode, ptr_reg, offset2); + rtx mem2 = gen_frame_mem (mode, addr2); + + insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2)); + RTX_FRAME_RELATED_P (insn) = 1; + add_reg_note (insn, REG_CFA_RESTORE, reg); + add_reg_note (insn, REG_CFA_RESTORE, reg2); + + regno = aarch64_get_next_set_bit (components, regno2 + 1); + } +} + +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS. */ + +static void +aarch64_set_handled_components (sbitmap components) +{ + for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++) + if (bitmap_bit_p (components, regno)) + cfun->machine->reg_is_wrapped_separately[regno] = true; +} + /* AArch64 stack frames generated by this compiler look like: +-------------------------------+ @@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x, return false; } -bool -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset) -{ - return (offset >= -64 * GET_MODE_SIZE (mode) - && offset < 64 * GET_MODE_SIZE (mode) - && offset % GET_MODE_SIZE (mode) == 0); -} - -static inline bool -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED, - HOST_WIDE_INT offset) -{ - return offset >= -256 && offset < 256; -} - -static inline bool -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset) -{ - return (offset >= 0 - && offset < 4096 * GET_MODE_SIZE (mode) - && offset % GET_MODE_SIZE (mode) == 0); -} - /* Return true if MODE is one of the modes for which we support LDP/STP operations. */ @@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode, #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \ aarch64_first_cycle_multipass_dfa_lookahead_guard +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \ + aarch64_get_separate_components + +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \ + aarch64_components_for_bb + +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \ + aarch64_disqualify_components + +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \ + aarch64_emit_prologue_components + +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \ + aarch64_emit_epilogue_components + +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \ + aarch64_set_handled_components + #undef TARGET_TRAMPOLINE_INIT #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 584ff5c..fb89e5a 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame typedef struct GTY (()) machine_function { struct aarch64_frame frame; + /* One entry for each GPR and FP register. */ + bool reg_is_wrapped_separately[V31_REGNUM + 1]; } machine_function; #endif