diff mbox

Use of vector instructions in memmov/memset expanding

Message ID 20111027150909.GA29087@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Oct. 27, 2011, 3:09 p.m. UTC
Hi,
sorry for delay with the review. This is my first pass through the backend part, hopefully
someone else will do the middle end bits.

Comments

Michael Zolotukhin Oct. 28, 2011, 12:41 p.m. UTC | #1
Hi Jan!
Thanks for the review, you could find my answers to some of your
remarks below. I'll send a corrected patch soon with answers to the
rest of your remarks.

> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants probably makes
> sense.  I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?

Yes, alignment should match the size of loads/stores as well as offset
from alignment boundary should be known. In other case, strategies for
unknown alignment would be chosen.


> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
> we already know pretty well that the moves will stay as they are.  That might be enough for you?

Yes, theoretically it could harm 64/128-bit arithmetic, but actually
what could we do if we have DImode, mem-to-mem move and our mode is
32-bit? Ideally, RA should be able to make desicions on how to perform
such moves, but currently it doesn't generate SSE-moves - when it'll
be able to do so, I think we could remove this part and rely on RA.
And, one more point. This is quite a special case - here we want to
perform move via half of vector register. This is the main reason why
these particular cases are handled in special, not common, way.


> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to be
> explicitely documented in a function comment.

Yep, you're right - we just don't start the loop from 0. I'll send a
version with the comments soon.


> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?

I don't think it matters here, but to make emit_strset look similar to
emit_strmov, most of emit_strset body realy could be moved to
gen_strset.


>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?

For memset prologues/epilogues I avoid using vector moves as it could
require expensive initialization (we need to create a vector filled
with the given value). For other cases, I'll re-check if use of vector
moves is possible.


> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
>       off = 4;
>       emit_insn (gen_strmov (destreg, dst, srcreg, src));
>     }
> +  if (align_bytes & 8)
> +    {
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      else
> +       {
> +         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?

Actually, here vector-moves are used if they are available - if 32bit
mode is used (so we can't do the move via GPR), but SSE is available,
then SSE-move would be generated.


> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> +   2) if VAL isn't const, then the result will be the result of MUL-instruction
> +   of VAL and const_int (0x01010101) (for SImode).  */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.

I don't think it could be done in machine-independent way - e.g. if
AVX is available, we could use broadcast-instructions, if not - we
need to use multiply-instructions, on other architectures there
probably some other more efficient ways to duplicate byte value across
the entire vector register. So IMO it's a good place to have a hook.


> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>

These changes present in middle-end part too (of course, in the full
patch it's not duplicated). I left it in this patch too to avoid
possible problems with build.
Richard Henderson Oct. 28, 2011, 3:59 p.m. UTC | #2
On 10/28/2011 05:41 AM, Michael Zolotukhin wrote:
>> > +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>> > +   supposed to represent one byte.  MODE could be a vector mode.
>> > +   Example:
>> > +   1) VAL = const_int (0xAB), mode = SImode,
>> > +   the result is const_int (0xABABABAB).
>> >
>> > This can be handled in machine independent way, right?
>> >
>> > +   2) if VAL isn't const, then the result will be the result of MUL-instruction
>> > +   of VAL and const_int (0x01010101) (for SImode).  */
>> >
>> > This would probably go better as named expansion pattern, like we do for other
>> > machine description interfaces.
> I don't think it could be done in machine-independent way - e.g. if
> AVX is available, we could use broadcast-instructions, if not - we
> need to use multiply-instructions, on other architectures there
> probably some other more efficient ways to duplicate byte value across
> the entire vector register. So IMO it's a good place to have a hook.
> 
> 

Certainly it can be done machine-independently.
See expand_vector_broadcast in optabs.c for a start.


r~
Michael Zolotukhin Nov. 1, 2011, 5:34 p.m. UTC | #3
Thanks for the answers!

I tried to take into account all the remarks and updated the patch in
accordance with them. Its full version is attached to this letter,
separate middle-end and back-end parts will be in consequent letters.

What about the rest part of the patch? Jan, could you please review it too?


Below there are responses to remarks you made.
> +/* Helper function for expand_set_or_movmem_via_loop.
> +   This function can reuse iter rtx from another loop and don't generate
> +   code for updating the addresses.  */
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> +                                        rtx destptr, rtx srcptr, rtx value,
> +                                        rtx count, rtx iter,
> +                                        enum machine_mode mode, int unroll,
> +                                        int expected_size, bool change_ptrs)
>
> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
I added some comments - please see updated version of the patch.


> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
Corrected.


>        {
> -         if (TARGET_64BIT)
> -           {
> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         else
> -           {
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> -             emit_insn (gen_strset (destptr, dest, value));
> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
> -             emit_insn (gen_strset (destptr, dest, value));
> -           }
> -         offset += 8;
> +         if (GET_MODE (destmem) != move_mode)
> +           destmem = change_address (destmem, move_mode, destptr);
> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
> it copies memory aliasing attributes and it is needed to zap them here since stringops
> behaves funily WRT aliaseing.
Fixed.


>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?
Fixed.


> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
>   if (align <= 1 && desired_alignment > 1)
>     {
>       rtx label = ix86_expand_aligntest (destptr, 1, false);
> -      srcmem = change_address (srcmem, QImode, srcptr);
> -      destmem = change_address (destmem, QImode, destptr);
> +      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
> +      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>
> You want to always use adjust_automodify_address or adjust_automodify_address_nv,
> adjust_automodify_address_1 is not intended for general use.
Fixed.


> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?

> Certainly it can be done machine-independently.
> See expand_vector_broadcast in optabs.c for a start.
Thanks, there is no need in new hook, indeed. Fixed.

Responses to other questions/remarks were in previous letter.

On 28 October 2011 19:59, Richard Henderson <rth@redhat.com> wrote:
> On 10/28/2011 05:41 AM, Michael Zolotukhin wrote:
>>> > +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>>> > +   supposed to represent one byte.  MODE could be a vector mode.
>>> > +   Example:
>>> > +   1) VAL = const_int (0xAB), mode = SImode,
>>> > +   the result is const_int (0xABABABAB).
>>> >
>>> > This can be handled in machine independent way, right?
>>> >
>>> > +   2) if VAL isn't const, then the result will be the result of MUL-instruction
>>> > +   of VAL and const_int (0x01010101) (for SImode).  */
>>> >
>>> > This would probably go better as named expansion pattern, like we do for other
>>> > machine description interfaces.
>> I don't think it could be done in machine-independent way - e.g. if
>> AVX is available, we could use broadcast-instructions, if not - we
>> need to use multiply-instructions, on other architectures there
>> probably some other more efficient ways to duplicate byte value across
>> the entire vector register. So IMO it's a good place to have a hook.
>>
>>
>
> Certainly it can be done machine-independently.
> See expand_vector_broadcast in optabs.c for a start.
>
>
> r~
>
Michael Zolotukhin Nov. 1, 2011, 5:35 p.m. UTC | #4
Here is the separate middle-end and back-end parts of the updated patch.

On 1 November 2011 21:34, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Thanks for the answers!
>
> I tried to take into account all the remarks and updated the patch in
> accordance with them. Its full version is attached to this letter,
> separate middle-end and back-end parts will be in consequent letters.
>
> What about the rest part of the patch? Jan, could you please review it too?
>
>
> Below there are responses to remarks you made.
>> +/* Helper function for expand_set_or_movmem_via_loop.
>> +   This function can reuse iter rtx from another loop and don't generate
>> +   code for updating the addresses.  */
>> +static rtx
>> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
>> +                                        rtx destptr, rtx srcptr, rtx value,
>> +                                        rtx count, rtx iter,
>> +                                        enum machine_mode mode, int unroll,
>> +                                        int expected_size, bool change_ptrs)
>>
>> I wrote the original function, but it is not really clear for me what the function
>> does now. I.e. what is code for updating addresses and what means reusing iter.
>> I guess reusing iter means that we won't start the loop from 0.  Could you
>> expand comments a bit more?
> I added some comments - please see updated version of the patch.
>
>
>> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
>> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
>> +   then move this consatnt to a vector register before emitting strset.  */
>> +static void
>> +emit_strset (rtx destmem, rtx value,
>> +            rtx destptr, enum machine_mode mode, int offset)
>>
>> This seems to more naturally belong into gen_strset expander?
> Corrected.
>
>
>>        {
>> -         if (TARGET_64BIT)
>> -           {
>> -             dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
>> -             emit_insn (gen_strset (destptr, dest, value));
>> -           }
>> -         else
>> -           {
>> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
>> -             emit_insn (gen_strset (destptr, dest, value));
>> -             dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
>> -             emit_insn (gen_strset (destptr, dest, value));
>> -           }
>> -         offset += 8;
>> +         if (GET_MODE (destmem) != move_mode)
>> +           destmem = change_address (destmem, move_mode, destptr);
>> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
>> it copies memory aliasing attributes and it is needed to zap them here since stringops
>> behaves funily WRT aliaseing.
> Fixed.
>
>
>>   if (max_size > 16)
>>     {
>>       rtx label = ix86_expand_aligntest (count, 16, true);
>>       if (TARGET_64BIT)
>>        {
>> -         dest = change_address (destmem, DImode, destptr);
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> +         destmem = change_address (destmem, DImode, destptr);
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
>> +                                                               value)));
>>
>> No use for 128bit moves here?
>>        }
>>       else
>>        {
>> -         dest = change_address (destmem, SImode, destptr);
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> -         emit_insn (gen_strset (destptr, dest, value));
>> +         destmem = change_address (destmem, SImode, destptr);
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
>> +                                                               value)));
>>
>> And here?
> Fixed.
>
>
>> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
>>   if (align <= 1 && desired_alignment > 1)
>>     {
>>       rtx label = ix86_expand_aligntest (destptr, 1, false);
>> -      srcmem = change_address (srcmem, QImode, srcptr);
>> -      destmem = change_address (destmem, QImode, destptr);
>> +      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
>> +      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>>
>> You want to always use adjust_automodify_address or adjust_automodify_address_nv,
>> adjust_automodify_address_1 is not intended for general use.
> Fixed.
>
>
>> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>> +   supposed to represent one byte.  MODE could be a vector mode.
>> +   Example:
>> +   1) VAL = const_int (0xAB), mode = SImode,
>> +   the result is const_int (0xABABABAB).
>>
>> This can be handled in machine independent way, right?
>
>> Certainly it can be done machine-independently.
>> See expand_vector_broadcast in optabs.c for a start.
> Thanks, there is no need in new hook, indeed. Fixed.
>
> Responses to other questions/remarks were in previous letter.
>
> On 28 October 2011 19:59, Richard Henderson <rth@redhat.com> wrote:
>> On 10/28/2011 05:41 AM, Michael Zolotukhin wrote:
>>>> > +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
>>>> > +   supposed to represent one byte.  MODE could be a vector mode.
>>>> > +   Example:
>>>> > +   1) VAL = const_int (0xAB), mode = SImode,
>>>> > +   the result is const_int (0xABABABAB).
>>>> >
>>>> > This can be handled in machine independent way, right?
>>>> >
>>>> > +   2) if VAL isn't const, then the result will be the result of MUL-instruction
>>>> > +   of VAL and const_int (0x01010101) (for SImode).  */
>>>> >
>>>> > This would probably go better as named expansion pattern, like we do for other
>>>> > machine description interfaces.
>>> I don't think it could be done in machine-independent way - e.g. if
>>> AVX is available, we could use broadcast-instructions, if not - we
>>> need to use multiply-instructions, on other architectures there
>>> probably some other more efficient ways to duplicate byte value across
>>> the entire vector register. So IMO it's a good place to have a hook.
>>>
>>>
>>
>> Certainly it can be done machine-independently.
>> See expand_vector_broadcast in optabs.c for a start.
>>
>>
>> r~
>>
>
>
>
> --
> ---
> Best regards,
> Michael V. Zolotukhin,
> Software Engineer
> Intel Corporation.
>
Jan Hubicka Nov. 2, 2011, 6:58 p.m. UTC | #5
Hi,
I am going to benchmark the following hunk separately tonight. It is
independent change.

Rth, Vladimir: there are obviously several options how to make GCC use SSE for
64bit loads/stores in 32bit codegen (and 128bit loads/stores in 128bit
codegen). What do you think is best variant here?

(an alternative would be to make move patterns to preffer SSE variant in this
case or change RA order to iterate through SSE first, but at least with pre-IRA
this used to lead to bad decisions making RA to place value in SSE despite the
fact it is used in arithmetic that can't be done with SSE).

Honza

@@ -15266,6 +15363,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}
       if (MEM_P (op0)
 	  && (PUSH_ROUNDING (GET_MODE_SIZE (mode)) != GET_MODE_SIZE (mode)
 	      || !push_operand (op0, mode))
Michael Zolotukhin Nov. 2, 2011, 7:32 p.m. UTC | #6
> I am going to benchmark the following hunk separately tonight. It is
> independent change.

You would probably need some changes from sse.md (for gen_sse2_loadq).

Michael
diff mbox

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..d7c4330 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@  struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
+   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},

I am bit concerned about explossion of variants, but adding aligned variants probably makes
sense.  I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
or shall the alignment be matching the size of loads/stores produced?

@@ -15266,6 +15362,38 @@  ix86_expand_move (enum machine_mode mode, rtx operands[])
     }
   else
     {
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE2
+	  && MEM_P (op0)
+	  && MEM_P (op1)
+	  && !push_operand (op0, mode)
+	  && can_create_pseudo_p ())
+	{
+	  rtx temp = gen_reg_rtx (V2DImode);
+	  emit_insn (gen_sse2_loadq (temp, op1));
+	  emit_insn (gen_sse_storeq (op0, temp));
+	  return;
+	}
+      if (mode == DImode
+	  && !TARGET_64BIT
+	  && TARGET_SSE
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V2DImode)
+	{
+	  emit_insn (gen_sse_storeq (op0, op1));
+	  return;
+	}
+      if (mode == TImode
+	  && TARGET_AVX2
+	  && MEM_P (op0)
+	  && !MEM_P (op1)
+	  && GET_MODE (op1) == V4DImode)
+	{
+	  op0 = convert_to_mode (V2DImode, op0, 1);
+	  emit_insn (gen_vec_extract_lo_v4di (op0, op1));
+	  return;
+	}

This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
prevent use of integer registers for 64bit/128bit arithmetic.

I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
we already know pretty well that the moves will stay as they are.  That might be enough for you?

If you go this way, please make separate patch so it can be benchmarked. While the moves are faster
there are always problem with size mismatches in load/store buffers.

I think for string operations we should output the SSE/AVX instruction variants
by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
feasible?  This has been traditionally problem with old RA because it did not
see that because pseudo is eventually used for arithmetics, it can not go into
SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
variants for 64bit/128bit manipulations w/o hurting performance of code that
does arithmetic on long long and __int128.  Perhaps IRA can solve this now.
Vladimir, do you have any ida?

+/* Helper function for expand_set_or_movmem_via_loop.
+   This function can reuse iter rtx from another loop and don't generate
+   code for updating the addresses.  */
+static rtx
+expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
+					 rtx destptr, rtx srcptr, rtx value,
+					 rtx count, rtx iter,
+					 enum machine_mode mode, int unroll,
+					 int expected_size, bool change_ptrs)

I wrote the original function, but it is not really clear for me what the function
does now. I.e. what is code for updating addresses and what means reusing iter.
I guess reusing iter means that we won't start the loop from 0.  Could you
expand comments a bit more?

I know I did not documented them originally, but all the parameters ought to be
explicitely documented in a function comment.
@@ -20913,7 +21065,27 @@  emit_strmov (rtx destmem, rtx srcmem,
   emit_insn (gen_strmov (destptr, dest, srcptr, src));
 }
 
-/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
+/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
+   then move this consatnt to a vector register before emitting strset.  */
+static void
+emit_strset (rtx destmem, rtx value,
+	     rtx destptr, enum machine_mode mode, int offset)

This seems to more naturally belong into gen_strset expander?
 	{
-	  if (TARGET_64BIT)
-	    {
-	      dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  else
-	    {
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
-	      emit_insn (gen_strset (destptr, dest, value));
-	      dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
-	      emit_insn (gen_strset (destptr, dest, value));
-	    }
-	  offset += 8;
+	  if (GET_MODE (destmem) != move_mode)
+	    destmem = change_address (destmem, move_mode, destptr);
AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
it copies memory aliasing attributes and it is needed to zap them here since stringops
behaves funily WRT aliaseing.
   if (max_size > 16)
     {
       rtx label = ix86_expand_aligntest (count, 16, true);
       if (TARGET_64BIT)
 	{
-	  dest = change_address (destmem, DImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, DImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
+								value)));

No use for 128bit moves here?
 	}
       else
 	{
-	  dest = change_address (destmem, SImode, destptr);
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
-	  emit_insn (gen_strset (destptr, dest, value));
+	  destmem = change_address (destmem, SImode, destptr);
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));
+	  emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
+								value)));

And here?
@@ -21204,8 +21426,8 @@  expand_movmem_prologue (rtx destmem, rtx srcmem,
   if (align <= 1 && desired_alignment > 1)
     {
       rtx label = ix86_expand_aligntest (destptr, 1, false);
-      srcmem = change_address (srcmem, QImode, srcptr);
-      destmem = change_address (destmem, QImode, destptr);
+      srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
+      destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);

You want to always use adjust_automodify_address or adjust_automodify_address_nv,
adjust_automodify_address_1 is not intended for general use.
@@ -21286,6 +21528,37 @@  expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
       off = 4;
       emit_insn (gen_strmov (destreg, dst, srcreg, src));
     }
+  if (align_bytes & 8)
+    {
+      if (TARGET_64BIT || TARGET_SSE)
+	{
+	  dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, DImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	}
+      else
+	{
+	  dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
+	  src = adjust_automodify_address_nv (src, SImode, srcreg, off);
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));
+	  emit_insn (gen_strmov (destreg, dst, srcreg, src));

again, no use for vector moves?
+/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
+   supposed to represent one byte.  MODE could be a vector mode.
+   Example:
+   1) VAL = const_int (0xAB), mode = SImode,
+   the result is const_int (0xABABABAB).

This can be handled in machine independent way, right?

+   2) if VAL isn't const, then the result will be the result of MUL-instruction
+   of VAL and const_int (0x01010101) (for SImode).  */

This would probably go better as named expansion pattern, like we do for other
machine description interfaces.
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 90cef1c..4b7d67b 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5780,6 +5780,32 @@  mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
 The default is zero which means to not iterate over other vector sizes.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
+This hook should return true if memory accesses in mode @var{mode} to data
+aligned by @var{align} bits have a cost many times greater than aligned
+accesses, for example if they are emulated in a trap handler.

New hooks should go to the machine indpendent part of the patch.

Honza