From patchwork Thu Oct 27 15:09:09 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 122164 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 130EFB6F7C for ; Fri, 28 Oct 2011 02:09:33 +1100 (EST) Received: (qmail 25977 invoked by alias); 27 Oct 2011 15:09:30 -0000 Received: (qmail 25962 invoked by uid 22791); 27 Oct 2011 15:09:26 -0000 X-SWARE-Spam-Status: No, hits=-2.1 required=5.0 tests=AWL, BAYES_00, RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from nikam.ms.mff.cuni.cz (HELO nikam.ms.mff.cuni.cz) (195.113.20.16) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 27 Oct 2011 15:09:10 +0000 Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 211E09AC74F; Thu, 27 Oct 2011 17:09:09 +0200 (CEST) Date: Thu, 27 Oct 2011 17:09:09 +0200 From: Jan Hubicka To: Michael Zolotukhin Cc: Jakub Jelinek , Jack Howarth , gcc-patches@gcc.gnu.org, Jan Hubicka , Richard Guenther , "H.J. Lu" , izamyatin@gmail.com, areg.melikadamyan@gmail.com Subject: Re: Use of vector instructions in memmov/memset expanding Message-ID: <20111027150909.GA29087@kam.mff.cuni.cz> References: <20110928133105.GA26045@bromo.med.uc.edu> <20110928215104.GA29339@bromo.med.uc.edu> <20110929112159.GT2687@tyan-ft48-01.lab.bos.redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Hi, sorry for delay with the review. This is my first pass through the backend part, hopefully someone else will do the middle end bits. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 2c53423..d7c4330 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */ COSTS_N_BYTES (2), /* cost of FABS instruction. */ COSTS_N_BYTES (2), /* cost of FCHS instruction. */ COSTS_N_BYTES (2), /* cost of FSQRT instruction. */ - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}, - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}}, + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}, + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}}, I am bit concerned about explossion of variants, but adding aligned variants probably makes sense. I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough or shall the alignment be matching the size of loads/stores produced? @@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[]) } else { + if (mode == DImode + && !TARGET_64BIT + && TARGET_SSE2 + && MEM_P (op0) + && MEM_P (op1) + && !push_operand (op0, mode) + && can_create_pseudo_p ()) + { + rtx temp = gen_reg_rtx (V2DImode); + emit_insn (gen_sse2_loadq (temp, op1)); + emit_insn (gen_sse_storeq (op0, temp)); + return; + } + if (mode == DImode + && !TARGET_64BIT + && TARGET_SSE + && !MEM_P (op1) + && GET_MODE (op1) == V2DImode) + { + emit_insn (gen_sse_storeq (op0, op1)); + return; + } + if (mode == TImode + && TARGET_AVX2 + && MEM_P (op0) + && !MEM_P (op1) + && GET_MODE (op1) == V4DImode) + { + op0 = convert_to_mode (V2DImode, op0, 1); + emit_insn (gen_vec_extract_lo_v4di (op0, op1)); + return; + } This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will prevent use of integer registers for 64bit/128bit arithmetic. I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations we already know pretty well that the moves will stay as they are. That might be enough for you? If you go this way, please make separate patch so it can be benchmarked. While the moves are faster there are always problem with size mismatches in load/store buffers. I think for string operations we should output the SSE/AVX instruction variants by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when feasible? This has been traditionally problem with old RA because it did not see that because pseudo is eventually used for arithmetics, it can not go into SSE register. So it was not possible to make MMX/SSE/AVX to be preferred variants for 64bit/128bit manipulations w/o hurting performance of code that does arithmetic on long long and __int128. Perhaps IRA can solve this now. Vladimir, do you have any ida? +/* Helper function for expand_set_or_movmem_via_loop. + This function can reuse iter rtx from another loop and don't generate + code for updating the addresses. */ +static rtx +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem, + rtx destptr, rtx srcptr, rtx value, + rtx count, rtx iter, + enum machine_mode mode, int unroll, + int expected_size, bool change_ptrs) I wrote the original function, but it is not really clear for me what the function does now. I.e. what is code for updating addresses and what means reusing iter. I guess reusing iter means that we won't start the loop from 0. Could you expand comments a bit more? I know I did not documented them originally, but all the parameters ought to be explicitely documented in a function comment. @@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem, emit_insn (gen_strmov (destptr, dest, srcptr, src)); } -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST. */ +/* Emit strset instuction. If RHS is constant, and vector mode will be used, + then move this consatnt to a vector register before emitting strset. */ +static void +emit_strset (rtx destmem, rtx value, + rtx destptr, enum machine_mode mode, int offset) This seems to more naturally belong into gen_strset expander? { - if (TARGET_64BIT) - { - dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset); - emit_insn (gen_strset (destptr, dest, value)); - } - else - { - dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset); - emit_insn (gen_strset (destptr, dest, value)); - dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4); - emit_insn (gen_strset (destptr, dest, value)); - } - offset += 8; + if (GET_MODE (destmem) != move_mode) + destmem = change_address (destmem, move_mode, destptr); AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way it copies memory aliasing attributes and it is needed to zap them here since stringops behaves funily WRT aliaseing. if (max_size > 16) { rtx label = ix86_expand_aligntest (count, 16, true); if (TARGET_64BIT) { - dest = change_address (destmem, DImode, destptr); - emit_insn (gen_strset (destptr, dest, value)); - emit_insn (gen_strset (destptr, dest, value)); + destmem = change_address (destmem, DImode, destptr); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode, + value))); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode, + value))); No use for 128bit moves here? } else { - dest = change_address (destmem, SImode, destptr); - emit_insn (gen_strset (destptr, dest, value)); - emit_insn (gen_strset (destptr, dest, value)); - emit_insn (gen_strset (destptr, dest, value)); - emit_insn (gen_strset (destptr, dest, value)); + destmem = change_address (destmem, SImode, destptr); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, + value))); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, + value))); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, + value))); + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, + value))); And here? @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem, if (align <= 1 && desired_alignment > 1) { rtx label = ix86_expand_aligntest (destptr, 1, false); - srcmem = change_address (srcmem, QImode, srcptr); - destmem = change_address (destmem, QImode, destptr); + srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1); + destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1); You want to always use adjust_automodify_address or adjust_automodify_address_nv, adjust_automodify_address_1 is not intended for general use. @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg, off = 4; emit_insn (gen_strmov (destreg, dst, srcreg, src)); } + if (align_bytes & 8) + { + if (TARGET_64BIT || TARGET_SSE) + { + dst = adjust_automodify_address_nv (dst, DImode, destreg, off); + src = adjust_automodify_address_nv (src, DImode, srcreg, off); + emit_insn (gen_strmov (destreg, dst, srcreg, src)); + } + else + { + dst = adjust_automodify_address_nv (dst, SImode, destreg, off); + src = adjust_automodify_address_nv (src, SImode, srcreg, off); + emit_insn (gen_strmov (destreg, dst, srcreg, src)); + emit_insn (gen_strmov (destreg, dst, srcreg, src)); again, no use for vector moves? +/* Target hook. Returns rtx of mode MODE with promoted value VAL, that is + supposed to represent one byte. MODE could be a vector mode. + Example: + 1) VAL = const_int (0xAB), mode = SImode, + the result is const_int (0xABABABAB). This can be handled in machine independent way, right? + 2) if VAL isn't const, then the result will be the result of MUL-instruction + of VAL and const_int (0x01010101) (for SImode). */ This would probably go better as named expansion pattern, like we do for other machine description interfaces. diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 90cef1c..4b7d67b 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}. The default is zero which means to not iterate over other vector sizes. @end deftypefn +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align}) +This hook should return true if memory accesses in mode @var{mode} to data +aligned by @var{align} bits have a cost many times greater than aligned +accesses, for example if they are emulated in a trap handler. New hooks should go to the machine indpendent part of the patch. Honza