From patchwork Wed Nov 19 13:05:19 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Maciej W. Rozycki" X-Patchwork-Id: 412379 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id EC24E140170 for ; Thu, 20 Nov 2014 00:05:45 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:mime-version:content-type; q=dns; s=default; b=IwkZq3qBzSOTJd+4ei5TI/uETsKJoZ3ackb8IrKd9cHIPBMh40 wm95GcrcfbNU3H8Qpoj0wD0tbeqx31D+z3Uuj/U3eXDdAD4NTO39qYIDFB7fJeq7 /xMWvEqjWouwOMwD7P0baP4m4Ru6VXZYJuwbg29VqR+ZKjXwQdERmi3Og= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:mime-version:content-type; s= default; bh=Midi9fPMtlY6TX76KwIYOhpjfmc=; b=wbceGSKVfYXRQORWOaZk NjkUc6cpk3qqPywE15mqV3O1nyoJ5kW4a+D1g7TlPDXNUnObs7JUsWkN+0+G1BWk QgjbBqHPUkROwevkgZoo2/UGo80YznV15Fiv1fUuOuM9ecYY77Ci4pxCyPS3cFET qIS2x0iwjEcU+s97SsLXazA= Received: (qmail 17999 invoked by alias); 19 Nov 2014 13:05:37 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 17984 invoked by uid 89); 19 Nov 2014 13:05:36 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.2 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 19 Nov 2014 13:05:29 +0000 Received: from nat-ies.mentorg.com ([192.94.31.2] helo=SVR-IES-FEM-01.mgc.mentorg.com) by relay1.mentorg.com with esmtp id 1Xr4wn-0001ed-ER from Maciej_Rozycki@mentor.com ; Wed, 19 Nov 2014 05:05:25 -0800 Received: from localhost (137.202.0.76) by SVR-IES-FEM-01.mgc.mentorg.com (137.202.0.104) with Microsoft SMTP Server (TLS) id 14.3.181.6; Wed, 19 Nov 2014 13:05:23 +0000 Date: Wed, 19 Nov 2014 13:05:19 +0000 From: "Maciej W. Rozycki" To: CC: Catherine Moore , Eric Christopher , Matthew Fortune Subject: [PATCH] MIPS16/GCC: Optimise `__call_stub_fp_' call/return stubs Message-ID: User-Agent: Alpine 1.10 (DEB 962 2008-03-14) MIME-Version: 1.0 Hi, It has come to my attention that we create suboptimal code for the `__call_stub_fp_' variant of the MIPS16 call stubs. These stubs are used for outgoing calls made from MIPS16 code to standard MIPS code that return floating-point results and may also pass floating-point arguments. This can be illustrated with this small program: $ cat foo.c double bar (double d); int main (int argc, char **argv) { return bar (argc); } $ mips-linux-gnu-gcc -O2 -mips16 -c foo.c $ mips-linux-gnu-objdump -dr foo.o foo.o: file format elf32-tradbigmips Disassembly of section .mips16.call.fp.bar: 00000000 <__call_stub_fp_bar>: 0: 44856000 mtc1 a1,$f12 4: 44846800 mtc1 a0,$f13 8: 03e09021 move s2,ra c: 0c000000 jal 0 <__call_stub_fp_bar> c: R_MIPS_26 bar 10: 00000000 nop 14: 44030000 mfc1 v1,$f0 18: 02400008 jr s2 1c: 44020800 mfc1 v0,$f1 Disassembly of section .text.startup: 00000000
: 0: f100 64c4 save 32,ra,s2 4: 1800 0000 jal 0
4: R_MIPS16_26 __mips16_floatsidf 8: 6500 nop a: 67a3 move a1,v1 c: 1800 0000 jal 0
c: R_MIPS16_26 bar 10: 6782 move a0,v0 12: 67a3 move a1,v1 14: 1800 0000 jal 0
14: R_MIPS16_26 __mips16_fix_truncdfsi 18: 6782 move a0,v0 1a: f100 6444 restore 32,ra,s2 1e: e8a0 jrc ra -- as you can see in `__call_stub_fp_bar' above the jump delay slot remains empty because the move to $s2 instruction cannot be scheduled there due to a data dependency on $ra. However there is no need to save $ra last and since the MIPS IV ISA there are no coprocessor move delay slots so the last MTC1 instruction could be scheduled there instead. So here is a change to avoid this empty delay slot. With this change in place we get this code instead: $ mips-linux-gnu-objdump -dr foo.o foo.o: file format elf32-tradbigmips Disassembly of section .mips16.call.fp.bar: 00000000 <__call_stub_fp_bar>: 0: 03e09021 move s2,ra 4: 44856000 mtc1 a1,$f12 8: 0c000000 jal 0 <__call_stub_fp_bar> 8: R_MIPS_26 bar c: 44846800 mtc1 a0,$f13 10: 44030000 mfc1 v1,$f0 14: 02400008 jr s2 18: 44020800 mfc1 v0,$f1 Disassembly of section .text.startup: 00000000
: 0: f100 64c4 save 32,ra,s2 4: 1800 0000 jal 0
4: R_MIPS16_26 __mips16_floatsidf 8: 6500 nop a: 67a3 move a1,v1 c: 1800 0000 jal 0
c: R_MIPS16_26 bar 10: 6782 move a0,v0 12: 67a3 move a1,v1 14: 1800 0000 jal 0
14: R_MIPS16_26 __mips16_fix_truncdfsi 18: 6782 move a0,v0 1a: f100 6444 restore 32,ra,s2 1e: e8a0 jrc ra -- as you can see the last MTC1 instruction has now been placed in the delay slot. For ISAs that do have a coprocessor move delay slot there is no gain, but no loss either, both delay slots remain empty due to data dependencies (coprocessor move delay slots): $ mips-linux-gnu-gcc -O2 -mips1 -mips16 -c foo.c $ mips-linux-gnu-objdump -dr foo.o foo.o: file format elf32-tradbigmips Disassembly of section .mips16.call.fp.bar: 00000000 <__call_stub_fp_bar>: 0: 03e09021 move s2,ra 4: 44856000 mtc1 a1,$f12 8: 44846800 mtc1 a0,$f13 c: 0c000000 jal 0 <__call_stub_fp_bar> c: R_MIPS_26 bar 10: 00000000 nop 14: 44030000 mfc1 v1,$f0 18: 44020800 mfc1 v0,$f1 1c: 02400008 jr s2 20: 00000000 nop Disassembly of section .text.startup: 00000000
: 0: 63fc addiu sp,-32 2: 6772 move v1,s2 4: 6207 sw ra,28(sp) 6: 1800 0000 jal 0
6: R_MIPS16_26 __mips16_floatsidf a: d306 sw v1,24(sp) c: 67a3 move a1,v1 e: 1800 0000 jal 0
e: R_MIPS16_26 bar 12: 6782 move a0,v0 14: 67a3 move a1,v1 16: 1800 0000 jal 0
16: R_MIPS16_26 __mips16_fix_truncdfsi 1a: 6782 move a0,v0 1c: 9606 lw a2,24(sp) 1e: 9707 lw a3,28(sp) 20: 6556 move s2,a2 22: ef00 jr a3 24: 6304 addiu sp,32 26: 6500 nop -- though I think this consideration is actually academic as no legacy MIPS16 processors have ever been made that had an FPU I believe (delay slots do not matter with software FPU emulation). The original MIPS16 implementation, the LSI's TinyRISC cores such as the TR4101 CPU did provide a COPx interface including possibly an FPU, but to the best of my knowledge none was actually made. See also the discussion around the previous iteration of these optimisations here: http://gcc.gnu.org/ml/gcc-patches/2013-07/msg01414.html This change has been regression-tested with the mips-linux-gnu target and these multilibs: -EB -EB -mips16 -EB -mmicromips -EB -mabi=n32 -EB -mabi=64 and the -EL variants of same, with no changes in results. FWIW there is no need to make a corresponding change to GDB to interpret the modified stub as the debugger ignores and skips over the move to $s2 in interpreting stub's code, the instruction can be anywhere. I have a second optimisation to make here too, but that triggers a surprising bug in GNU LD where BFD code meant to discard unused stubs appears not to work at all. So that'll have to be fixed first and it also means the other optimisation is unsafe to include in 5.0. I plan to post it shortly anyway for discussion, once I have the linker bug fixed. Meanwhile, OK to apply this change? 2014-11-19 Maciej W. Rozycki gcc/ * config/mips/mips.c (mips16_build_call_stub): Move the save of the return address in $18 ahead of passing arguments to FPRs. Maciej gcc-mips16-call-fp-stub-s2.patch Index: gcc-fsf-trunk-quilt/gcc/config/mips/mips.c =================================================================== --- gcc-fsf-trunk-quilt.orig/gcc/config/mips/mips.c 2014-11-18 13:28:13.508975765 +0000 +++ gcc-fsf-trunk-quilt/gcc/config/mips/mips.c 2014-11-18 23:33:01.418180571 +0000 @@ -6945,6 +6945,17 @@ mips16_build_call_stub (rtx retval, rtx /* "Save" $sp in itself so we don't use the fake CFA. This is: DW_CFA_val_expression r29, { DW_OP_reg29 }. */ fprintf (asm_out_file, "\t.cfi_escape 0x16,29,1,0x6d\n"); + + /* Save the return address in $18. The stub's caller knows + that $18 might be clobbered, even though $18 is usually + a call-saved register. + + Do it early on in case the last move to a floating-point + register can be scheduled into the delay slot of the + call we are about to make. */ + fprintf (asm_out_file, "\tmove\t%s,%s\n", + reg_names[GP_REG_FIRST + 18], + reg_names[RETURN_ADDR_REGNUM]); } else { @@ -6966,11 +6977,7 @@ mips16_build_call_stub (rtx retval, rtx if (fp_ret_p) { - /* Save the return address in $18 and call the non-MIPS16 function. - The stub's caller knows that $18 might be clobbered, even though - $18 is usually a call-saved register. */ - fprintf (asm_out_file, "\tmove\t%s,%s\n", - reg_names[GP_REG_FIRST + 18], reg_names[RETURN_ADDR_REGNUM]); + /* Now call the non-MIPS16 function. */ output_asm_insn (MIPS_CALL ("jal", &fn, 0, -1), &fn); fprintf (asm_out_file, "\t.cfi_register 31,18\n");