diff mbox

[RFC] For TARGET_AVX use *mov<mode>_internal for misaligned loads

Message ID 20131030094713.GC27813@tucnak.zalov.cz
State New
Headers show

Commit Message

Jakub Jelinek Oct. 30, 2013, 9:47 a.m. UTC
Hi!

Yesterday I've noticed that for AVX which allows unaligned operands in
AVX arithmetics instructions we still don't combine unaligned loads with the
AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize
void
f1 (int *__restrict e, int *__restrict f)
{
  int i;
  for (i = 0; i < 1024; i++)
    e[i] = f[i] * 7;
}

void
f2 (int *__restrict e, int *__restrict f)
{
  int i;
  for (i = 0; i < 1024; i++)
    e[i] = f[i];
}
we have:
        vmovdqu (%rsi,%rax), %xmm0
        vpmulld %xmm1, %xmm0, %xmm0
        vmovups %xmm0, (%rdi,%rax)
in the first loop.  Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT
*mov<mode>_internal patterns (and various others) use misaligned_operand
to see if they should emit vmovaps or vmovups (etc.), so as suggested by
Richard on IRC it isn't necessary to either allow UNSPEC_LOADU in memory
operands of all the various non-move AVX instructions for TARGET_AVX, or
add extra patterns to help combine, this patch instead just uses the
*mov<mode>_internal in that case (assuming initially misaligned_operand
doesn't become !misaligned_operand through RTL optimizations).  Additionally
the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned
loads, which usually means combine will fail, by doing the load into a
temporary pseudo in that case and then doing a pseudo to pseudo move with
gen_lowpart on the rhs (which will be merged soon after into following
instructions).

I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my
bootstrap/regtest server isn't AVX capable.

2013-10-30  Jakub Jelinek  <jakub@redhat.com>

	* config/i386/i386.c (ix86_avx256_split_vector_move_misalign): If
	op1 is misaligned_operand, just use *mov<mode>_internal insn
	rather than UNSPEC_LOADU load.
	(ix86_expand_vector_move_misalign): Likewise (for TARGET_AVX only).
	Avoid gen_lowpart on op0 if it isn't MEM.


	Jakub

Comments

Ondřej Bílka Oct. 30, 2013, 9:53 a.m. UTC | #1
On Wed, Oct 30, 2013 at 10:47:13AM +0100, Jakub Jelinek wrote:
> Hi!
> 
> Yesterday I've noticed that for AVX which allows unaligned operands in
> AVX arithmetics instructions we still don't combine unaligned loads with the
> AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize
> void
> f1 (int *__restrict e, int *__restrict f)
> {
>   int i;
>   for (i = 0; i < 1024; i++)
>     e[i] = f[i] * 7;
> }
> 
> void
> f2 (int *__restrict e, int *__restrict f)
> {
>   int i;
>   for (i = 0; i < 1024; i++)
>     e[i] = f[i];
> }
> we have:
>         vmovdqu (%rsi,%rax), %xmm0
>         vpmulld %xmm1, %xmm0, %xmm0
>         vmovups %xmm0, (%rdi,%rax)
> in the first loop.  Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT
> *mov<mode>_internal patterns (and various others) use misaligned_operand
> to see if they should emit vmovaps or vmovups (etc.), so as suggested by

That is intentional. In pre-haswell architectures splitting load is
faster than 32 byte load. 

See Intel® 64 and IA-32 Architectures Optimization Reference Manual for
details.
Jakub Jelinek Oct. 30, 2013, 10 a.m. UTC | #2
On Wed, Oct 30, 2013 at 10:53:58AM +0100, Ondřej Bílka wrote:
> > Yesterday I've noticed that for AVX which allows unaligned operands in
> > AVX arithmetics instructions we still don't combine unaligned loads with the
> > AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize
> > void
> > f1 (int *__restrict e, int *__restrict f)
> > {
> >   int i;
> >   for (i = 0; i < 1024; i++)
> >     e[i] = f[i] * 7;
> > }
> > 
> > void
> > f2 (int *__restrict e, int *__restrict f)
> > {
> >   int i;
> >   for (i = 0; i < 1024; i++)
> >     e[i] = f[i];
> > }
> > we have:
> >         vmovdqu (%rsi,%rax), %xmm0
> >         vpmulld %xmm1, %xmm0, %xmm0
> >         vmovups %xmm0, (%rdi,%rax)
> > in the first loop.  Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT
> > *mov<mode>_internal patterns (and various others) use misaligned_operand
> > to see if they should emit vmovaps or vmovups (etc.), so as suggested by
> 
> That is intentional. In pre-haswell architectures splitting load is
> faster than 32 byte load. 

But the above is 16 byte unaligned load.  Furthermore, GCC supports
-mavx256-split-unaligned-load and can emit 32 byte loads either as an
unaligned 32 byte load, or merge of 16 byte unaligned loads.  The patch
affects only the cases where we were already emitting 16 byte or 32 byte
unaligned loads rather than split loads.

	Jakub
Jakub Jelinek Oct. 30, 2013, 10:05 a.m. UTC | #3
On Wed, Oct 30, 2013 at 11:00:13AM +0100, Jakub Jelinek wrote:
> But the above is 16 byte unaligned load.  Furthermore, GCC supports
> -mavx256-split-unaligned-load and can emit 32 byte loads either as an
> unaligned 32 byte load, or merge of 16 byte unaligned loads.  The patch
> affects only the cases where we were already emitting 16 byte or 32 byte
> unaligned loads rather than split loads.

With my patch, the differences (in all cases only on f1) for
-O2 -mavx -ftree-vectorize with the patch is (16 byte unaligned load, not split):
-	vmovdqu	(%rsi,%rax), %xmm0
-	vpmulld	%xmm1, %xmm0, %xmm0
+	vpmulld	(%rsi,%rax), %xmm1, %xmm0
 	vmovups	%xmm0, (%rdi,%rax)
with -O2 -mavx2 -ftree-vectorize (again, load wasn't split):
-	vmovdqu	(%rsi,%rax), %ymm0
-	vpmulld	%ymm1, %ymm0, %ymm0
+	vpmulld	(%rsi,%rax), %ymm1, %ymm0
 	vmovups	%ymm0, (%rdi,%rax)
and with -O2 -mavx2 -mavx256-split-unaligned-load:
 	vmovdqu	(%rsi,%rax), %xmm0
 	vinserti128	$0x1, 16(%rsi,%rax), %ymm0, %ymm0
-	vpmulld	%ymm1, %ymm0, %ymm0
+	vpmulld	%ymm0, %ymm1, %ymm0
 	vmovups	%ymm0, (%rdi,%rax)
(the last change is just giving RTL optimizers more freedom by not
doing the SUBREG on the lhs).

	Jakub
Ondřej Bílka Oct. 30, 2013, 10:52 a.m. UTC | #4
On Wed, Oct 30, 2013 at 11:05:58AM +0100, Jakub Jelinek wrote:
> On Wed, Oct 30, 2013 at 11:00:13AM +0100, Jakub Jelinek wrote:
> > But the above is 16 byte unaligned load.  Furthermore, GCC supports
> > -mavx256-split-unaligned-load and can emit 32 byte loads either as an
> > unaligned 32 byte load, or merge of 16 byte unaligned loads.  The patch
> > affects only the cases where we were already emitting 16 byte or 32 byte
> > unaligned loads rather than split loads.
> 
> With my patch, the differences (in all cases only on f1) for
> -O2 -mavx -ftree-vectorize with the patch is (16 byte unaligned load, not split):

My point was that this could mask split loads, thank for clarifying
Uros Bizjak Oct. 30, 2013, 10:55 a.m. UTC | #5
On Wed, Oct 30, 2013 at 10:47 AM, Jakub Jelinek <jakub@redhat.com> wrote:

> Yesterday I've noticed that for AVX which allows unaligned operands in
> AVX arithmetics instructions we still don't combine unaligned loads with the
> AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize

This is actually PR 47754 that fell below radar for some reason...

> we have:
>         vmovdqu (%rsi,%rax), %xmm0
>         vpmulld %xmm1, %xmm0, %xmm0
>         vmovups %xmm0, (%rdi,%rax)
> in the first loop.  Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT
> *mov<mode>_internal patterns (and various others) use misaligned_operand
> to see if they should emit vmovaps or vmovups (etc.), so as suggested by
> Richard on IRC it isn't necessary to either allow UNSPEC_LOADU in memory
> operands of all the various non-move AVX instructions for TARGET_AVX, or
> add extra patterns to help combine, this patch instead just uses the
> *mov<mode>_internal in that case (assuming initially misaligned_operand
> doesn't become !misaligned_operand through RTL optimizations).  Additionally

No worries here. We will generate movdqa, and it is definitely a gcc
bug if RTL optimizations change misaligned operand to aligned.

> the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned
> loads, which usually means combine will fail, by doing the load into a
> temporary pseudo in that case and then doing a pseudo to pseudo move with
> gen_lowpart on the rhs (which will be merged soon after into following
> instructions).

Is this similar to PR44141? There were similar problems with V4SFmode
subregs, so combine was not able to merge load to the arithemtic insn.

> I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my
> bootstrap/regtest server isn't AVX capable.

I can bootstrap the patch later today on IvyBridge with
--with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx.

Uros.
Jakub Jelinek Oct. 30, 2013, 11:11 a.m. UTC | #6
On Wed, Oct 30, 2013 at 11:55:44AM +0100, Uros Bizjak wrote:
> On Wed, Oct 30, 2013 at 10:47 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> 
> > Yesterday I've noticed that for AVX which allows unaligned operands in
> > AVX arithmetics instructions we still don't combine unaligned loads with the
> > AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize
> 
> This is actually PR 47754 that fell below radar for some reason...

Apparently yes.

> > the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned
> > loads, which usually means combine will fail, by doing the load into a
> > temporary pseudo in that case and then doing a pseudo to pseudo move with
> > gen_lowpart on the rhs (which will be merged soon after into following
> > instructions).
> 
> Is this similar to PR44141? There were similar problems with V4SFmode
> subregs, so combine was not able to merge load to the arithemtic insn.

From the work on the vectorization last year I remember many cases where
subregs (even equal size) on the LHS of instructions prevented combiner or
other RTL optimizations from doing it's job.  I believe I've changed some
easy places that did that completely unnecessarily, but certainly have not
went through all the code to look for other places where this is done.

Perhaps let's hack up a checking pass that will after expansion walk the
whole IL and complain about same sized subregs on the LHS of insns, then do make
check with it for a couple of ISAs (-msse2,-msse4,-mavx,-mavx2 e.g.?

> > I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my
> > bootstrap/regtest server isn't AVX capable.
> 
> I can bootstrap the patch later today on IvyBridge with
> --with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx.

That would be greatly appreciated, thanks.

	Jakub
Richard Henderson Oct. 30, 2013, 4:17 p.m. UTC | #7
On 10/30/2013 02:47 AM, Jakub Jelinek wrote:
> 2013-10-30  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* config/i386/i386.c (ix86_avx256_split_vector_move_misalign): If
> 	op1 is misaligned_operand, just use *mov<mode>_internal insn
> 	rather than UNSPEC_LOADU load.
> 	(ix86_expand_vector_move_misalign): Likewise (for TARGET_AVX only).
> 	Avoid gen_lowpart on op0 if it isn't MEM.

Ok.


r~
Uros Bizjak Oct. 30, 2013, 5:42 p.m. UTC | #8
On Wed, Oct 30, 2013 at 12:11 PM, Jakub Jelinek <jakub@redhat.com> wrote:

>> > Yesterday I've noticed that for AVX which allows unaligned operands in
>> > AVX arithmetics instructions we still don't combine unaligned loads with the
>> > AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize
>>
>> This is actually PR 47754 that fell below radar for some reason...
>
> Apparently yes.
>
>> > the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned
>> > loads, which usually means combine will fail, by doing the load into a
>> > temporary pseudo in that case and then doing a pseudo to pseudo move with
>> > gen_lowpart on the rhs (which will be merged soon after into following
>> > instructions).
>>
>> Is this similar to PR44141? There were similar problems with V4SFmode
>> subregs, so combine was not able to merge load to the arithemtic insn.
>
> From the work on the vectorization last year I remember many cases where
> subregs (even equal size) on the LHS of instructions prevented combiner or
> other RTL optimizations from doing it's job.  I believe I've changed some
> easy places that did that completely unnecessarily, but certainly have not
> went through all the code to look for other places where this is done.
>
> Perhaps let's hack up a checking pass that will after expansion walk the
> whole IL and complain about same sized subregs on the LHS of insns, then do make
> check with it for a couple of ISAs (-msse2,-msse4,-mavx,-mavx2 e.g.?
>
>> > I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my
>> > bootstrap/regtest server isn't AVX capable.
>>
>> I can bootstrap the patch later today on IvyBridge with
>> --with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx.
>
> That would be greatly appreciated, thanks.

The bootstrap and regression test was OK for x86_64-linux-gnu {,-m32}.

The failures in the attached report are either pre-existing or benign
due to core-avx-i default to AVX.

Please also mention PR44141 in the ChangeLog entry.

Uros.
Uros Bizjak Oct. 30, 2013, 5:46 p.m. UTC | #9
On Wed, Oct 30, 2013 at 6:42 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Wed, Oct 30, 2013 at 12:11 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>
>>> > Yesterday I've noticed that for AVX which allows unaligned operands in
>>> > AVX arithmetics instructions we still don't combine unaligned loads with the
>>> > AVX arithmetics instructions.  So say for -O2 -mavx -ftree-vectorize
>>>
>>> This is actually PR 47754 that fell below radar for some reason...
>>
>> Apparently yes.
>>
>>> > the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned
>>> > loads, which usually means combine will fail, by doing the load into a
>>> > temporary pseudo in that case and then doing a pseudo to pseudo move with
>>> > gen_lowpart on the rhs (which will be merged soon after into following
>>> > instructions).
>>>
>>> Is this similar to PR44141? There were similar problems with V4SFmode
>>> subregs, so combine was not able to merge load to the arithemtic insn.
>>
>> From the work on the vectorization last year I remember many cases where
>> subregs (even equal size) on the LHS of instructions prevented combiner or
>> other RTL optimizations from doing it's job.  I believe I've changed some
>> easy places that did that completely unnecessarily, but certainly have not
>> went through all the code to look for other places where this is done.
>>
>> Perhaps let's hack up a checking pass that will after expansion walk the
>> whole IL and complain about same sized subregs on the LHS of insns, then do make
>> check with it for a couple of ISAs (-msse2,-msse4,-mavx,-mavx2 e.g.?
>>
>>> > I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my
>>> > bootstrap/regtest server isn't AVX capable.
>>>
>>> I can bootstrap the patch later today on IvyBridge with
>>> --with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx.
>>
>> That would be greatly appreciated, thanks.
>
> The bootstrap and regression test was OK for x86_64-linux-gnu {,-m32}.
>
> The failures in the attached report are either pre-existing or benign
> due to core-avx-i default to AVX.

I was referring to the *runtime* failures here.

> Please also mention PR44141 in the ChangeLog entry.

Ops, this should be PR47754.

Thanks,
Uros.
diff mbox

Patch

--- gcc/config/i386/i386.c.jj	2013-10-30 08:15:38.000000000 +0100
+++ gcc/config/i386/i386.c	2013-10-30 10:20:22.684708729 +0100
@@ -16560,6 +16560,12 @@  ix86_avx256_split_vector_move_misalign (
 	  r = gen_rtx_VEC_CONCAT (GET_MODE (op0), r, m);
 	  emit_move_insn (op0, r);
 	}
+      /* Normal *mov<mode>_internal pattern will handle
+	 unaligned loads just fine if misaligned_operand
+	 is true, and without the UNSPEC it can be combined
+	 with arithmetic instructions.  */
+      else if (misaligned_operand (op1, GET_MODE (op1)))
+	emit_insn (gen_rtx_SET (VOIDmode, op0, op1));
       else
 	emit_insn (load_unaligned (op0, op1));
     }
@@ -16634,7 +16640,7 @@  ix86_avx256_split_vector_move_misalign (
 void
 ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[])
 {
-  rtx op0, op1, m;
+  rtx op0, op1, orig_op0 = NULL_RTX, m;
   rtx (*load_unaligned) (rtx, rtx);
   rtx (*store_unaligned) (rtx, rtx);
 
@@ -16647,7 +16653,16 @@  ix86_expand_vector_move_misalign (enum m
 	{
 	case MODE_VECTOR_INT:
 	case MODE_INT:
-	  op0 = gen_lowpart (V16SImode, op0);
+	  if (GET_MODE (op0) != V16SImode)
+	    {
+	      if (!MEM_P (op0))
+		{
+		  orig_op0 = op0;
+		  op0 = gen_reg_rtx (V16SImode);
+		}
+	      else
+		op0 = gen_lowpart (V16SImode, op0);
+	    }
 	  op1 = gen_lowpart (V16SImode, op1);
 	  /* FALLTHRU */
 
@@ -16676,6 +16691,8 @@  ix86_expand_vector_move_misalign (enum m
 	    emit_insn (store_unaligned (op0, op1));
 	  else
 	    gcc_unreachable ();
+	  if (orig_op0)
+	    emit_move_insn (orig_op0, gen_lowpart (GET_MODE (orig_op0), op0));
 	  break;
 
 	default:
@@ -16692,12 +16709,23 @@  ix86_expand_vector_move_misalign (enum m
 	{
 	case MODE_VECTOR_INT:
 	case MODE_INT:
-	  op0 = gen_lowpart (V32QImode, op0);
+	  if (GET_MODE (op0) != V32QImode)
+	    {
+	      if (!MEM_P (op0))
+		{
+		  orig_op0 = op0;
+		  op0 = gen_reg_rtx (V32QImode);
+		}
+	      else
+		op0 = gen_lowpart (V32QImode, op0);
+	    }
 	  op1 = gen_lowpart (V32QImode, op1);
 	  /* FALLTHRU */
 
 	case MODE_VECTOR_FLOAT:
 	  ix86_avx256_split_vector_move_misalign (op0, op1);
+	  if (orig_op0)
+	    emit_move_insn (orig_op0, gen_lowpart (GET_MODE (orig_op0), op0));
 	  break;
 
 	default:
@@ -16709,15 +16737,30 @@  ix86_expand_vector_move_misalign (enum m
 
   if (MEM_P (op1))
     {
+      /* Normal *mov<mode>_internal pattern will handle
+	 unaligned loads just fine if misaligned_operand
+	 is true, and without the UNSPEC it can be combined
+	 with arithmetic instructions.  */
+      if (TARGET_AVX
+	  && (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
+	      || GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT)
+	  && misaligned_operand (op1, GET_MODE (op1)))
+	emit_insn (gen_rtx_SET (VOIDmode, op0, op1));
       /* ??? If we have typed data, then it would appear that using
 	 movdqu is the only way to get unaligned data loaded with
 	 integer type.  */
-      if (TARGET_SSE2 && GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+      else if (TARGET_SSE2 && GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
 	{
-	  op0 = gen_lowpart (V16QImode, op0);
+	  if (GET_MODE (op0) != V16QImode)
+	    {
+	      orig_op0 = op0;
+	      op0 = gen_reg_rtx (V16QImode);
+	    }
 	  op1 = gen_lowpart (V16QImode, op1);
 	  /* We will eventually emit movups based on insn attributes.  */
 	  emit_insn (gen_sse2_loaddquv16qi (op0, op1));
+	  if (orig_op0)
+	    emit_move_insn (orig_op0, gen_lowpart (GET_MODE (orig_op0), op0));
 	}
       else if (TARGET_SSE2 && mode == V2DFmode)
         {
@@ -16765,9 +16808,16 @@  ix86_expand_vector_move_misalign (enum m
 	      || TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL
 	      || optimize_insn_for_size_p ())
 	    {
-	      op0 = gen_lowpart (V4SFmode, op0);
+	      if (GET_MODE (op0) != V4SFmode)
+		{
+		  orig_op0 = op0;
+		  op0 = gen_reg_rtx (V4SFmode);
+		}
 	      op1 = gen_lowpart (V4SFmode, op1);
 	      emit_insn (gen_sse_loadups (op0, op1));
+	      if (orig_op0)
+		emit_move_insn (orig_op0,
+				gen_lowpart (GET_MODE (orig_op0), op0));
 	      return;
             }