Message ID | 20131030094713.GC27813@tucnak.zalov.cz |
---|---|
State | New |
Headers | show |
On Wed, Oct 30, 2013 at 10:47:13AM +0100, Jakub Jelinek wrote: > Hi! > > Yesterday I've noticed that for AVX which allows unaligned operands in > AVX arithmetics instructions we still don't combine unaligned loads with the > AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize > void > f1 (int *__restrict e, int *__restrict f) > { > int i; > for (i = 0; i < 1024; i++) > e[i] = f[i] * 7; > } > > void > f2 (int *__restrict e, int *__restrict f) > { > int i; > for (i = 0; i < 1024; i++) > e[i] = f[i]; > } > we have: > vmovdqu (%rsi,%rax), %xmm0 > vpmulld %xmm1, %xmm0, %xmm0 > vmovups %xmm0, (%rdi,%rax) > in the first loop. Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT > *mov<mode>_internal patterns (and various others) use misaligned_operand > to see if they should emit vmovaps or vmovups (etc.), so as suggested by That is intentional. In pre-haswell architectures splitting load is faster than 32 byte load. See Intel® 64 and IA-32 Architectures Optimization Reference Manual for details.
On Wed, Oct 30, 2013 at 10:53:58AM +0100, Ondřej Bílka wrote: > > Yesterday I've noticed that for AVX which allows unaligned operands in > > AVX arithmetics instructions we still don't combine unaligned loads with the > > AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize > > void > > f1 (int *__restrict e, int *__restrict f) > > { > > int i; > > for (i = 0; i < 1024; i++) > > e[i] = f[i] * 7; > > } > > > > void > > f2 (int *__restrict e, int *__restrict f) > > { > > int i; > > for (i = 0; i < 1024; i++) > > e[i] = f[i]; > > } > > we have: > > vmovdqu (%rsi,%rax), %xmm0 > > vpmulld %xmm1, %xmm0, %xmm0 > > vmovups %xmm0, (%rdi,%rax) > > in the first loop. Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT > > *mov<mode>_internal patterns (and various others) use misaligned_operand > > to see if they should emit vmovaps or vmovups (etc.), so as suggested by > > That is intentional. In pre-haswell architectures splitting load is > faster than 32 byte load. But the above is 16 byte unaligned load. Furthermore, GCC supports -mavx256-split-unaligned-load and can emit 32 byte loads either as an unaligned 32 byte load, or merge of 16 byte unaligned loads. The patch affects only the cases where we were already emitting 16 byte or 32 byte unaligned loads rather than split loads. Jakub
On Wed, Oct 30, 2013 at 11:00:13AM +0100, Jakub Jelinek wrote: > But the above is 16 byte unaligned load. Furthermore, GCC supports > -mavx256-split-unaligned-load and can emit 32 byte loads either as an > unaligned 32 byte load, or merge of 16 byte unaligned loads. The patch > affects only the cases where we were already emitting 16 byte or 32 byte > unaligned loads rather than split loads. With my patch, the differences (in all cases only on f1) for -O2 -mavx -ftree-vectorize with the patch is (16 byte unaligned load, not split): - vmovdqu (%rsi,%rax), %xmm0 - vpmulld %xmm1, %xmm0, %xmm0 + vpmulld (%rsi,%rax), %xmm1, %xmm0 vmovups %xmm0, (%rdi,%rax) with -O2 -mavx2 -ftree-vectorize (again, load wasn't split): - vmovdqu (%rsi,%rax), %ymm0 - vpmulld %ymm1, %ymm0, %ymm0 + vpmulld (%rsi,%rax), %ymm1, %ymm0 vmovups %ymm0, (%rdi,%rax) and with -O2 -mavx2 -mavx256-split-unaligned-load: vmovdqu (%rsi,%rax), %xmm0 vinserti128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0 - vpmulld %ymm1, %ymm0, %ymm0 + vpmulld %ymm0, %ymm1, %ymm0 vmovups %ymm0, (%rdi,%rax) (the last change is just giving RTL optimizers more freedom by not doing the SUBREG on the lhs). Jakub
On Wed, Oct 30, 2013 at 11:05:58AM +0100, Jakub Jelinek wrote: > On Wed, Oct 30, 2013 at 11:00:13AM +0100, Jakub Jelinek wrote: > > But the above is 16 byte unaligned load. Furthermore, GCC supports > > -mavx256-split-unaligned-load and can emit 32 byte loads either as an > > unaligned 32 byte load, or merge of 16 byte unaligned loads. The patch > > affects only the cases where we were already emitting 16 byte or 32 byte > > unaligned loads rather than split loads. > > With my patch, the differences (in all cases only on f1) for > -O2 -mavx -ftree-vectorize with the patch is (16 byte unaligned load, not split): My point was that this could mask split loads, thank for clarifying
On Wed, Oct 30, 2013 at 10:47 AM, Jakub Jelinek <jakub@redhat.com> wrote: > Yesterday I've noticed that for AVX which allows unaligned operands in > AVX arithmetics instructions we still don't combine unaligned loads with the > AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize This is actually PR 47754 that fell below radar for some reason... > we have: > vmovdqu (%rsi,%rax), %xmm0 > vpmulld %xmm1, %xmm0, %xmm0 > vmovups %xmm0, (%rdi,%rax) > in the first loop. Apparently all the MODE_VECTOR_INT and MODE_VECTOR_FLOAT > *mov<mode>_internal patterns (and various others) use misaligned_operand > to see if they should emit vmovaps or vmovups (etc.), so as suggested by > Richard on IRC it isn't necessary to either allow UNSPEC_LOADU in memory > operands of all the various non-move AVX instructions for TARGET_AVX, or > add extra patterns to help combine, this patch instead just uses the > *mov<mode>_internal in that case (assuming initially misaligned_operand > doesn't become !misaligned_operand through RTL optimizations). Additionally No worries here. We will generate movdqa, and it is definitely a gcc bug if RTL optimizations change misaligned operand to aligned. > the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned > loads, which usually means combine will fail, by doing the load into a > temporary pseudo in that case and then doing a pseudo to pseudo move with > gen_lowpart on the rhs (which will be merged soon after into following > instructions). Is this similar to PR44141? There were similar problems with V4SFmode subregs, so combine was not able to merge load to the arithemtic insn. > I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my > bootstrap/regtest server isn't AVX capable. I can bootstrap the patch later today on IvyBridge with --with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx. Uros.
On Wed, Oct 30, 2013 at 11:55:44AM +0100, Uros Bizjak wrote: > On Wed, Oct 30, 2013 at 10:47 AM, Jakub Jelinek <jakub@redhat.com> wrote: > > > Yesterday I've noticed that for AVX which allows unaligned operands in > > AVX arithmetics instructions we still don't combine unaligned loads with the > > AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize > > This is actually PR 47754 that fell below radar for some reason... Apparently yes. > > the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned > > loads, which usually means combine will fail, by doing the load into a > > temporary pseudo in that case and then doing a pseudo to pseudo move with > > gen_lowpart on the rhs (which will be merged soon after into following > > instructions). > > Is this similar to PR44141? There were similar problems with V4SFmode > subregs, so combine was not able to merge load to the arithemtic insn. From the work on the vectorization last year I remember many cases where subregs (even equal size) on the LHS of instructions prevented combiner or other RTL optimizations from doing it's job. I believe I've changed some easy places that did that completely unnecessarily, but certainly have not went through all the code to look for other places where this is done. Perhaps let's hack up a checking pass that will after expansion walk the whole IL and complain about same sized subregs on the LHS of insns, then do make check with it for a couple of ISAs (-msse2,-msse4,-mavx,-mavx2 e.g.? > > I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my > > bootstrap/regtest server isn't AVX capable. > > I can bootstrap the patch later today on IvyBridge with > --with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx. That would be greatly appreciated, thanks. Jakub
On 10/30/2013 02:47 AM, Jakub Jelinek wrote: > 2013-10-30 Jakub Jelinek <jakub@redhat.com> > > * config/i386/i386.c (ix86_avx256_split_vector_move_misalign): If > op1 is misaligned_operand, just use *mov<mode>_internal insn > rather than UNSPEC_LOADU load. > (ix86_expand_vector_move_misalign): Likewise (for TARGET_AVX only). > Avoid gen_lowpart on op0 if it isn't MEM. Ok. r~
On Wed, Oct 30, 2013 at 12:11 PM, Jakub Jelinek <jakub@redhat.com> wrote: >> > Yesterday I've noticed that for AVX which allows unaligned operands in >> > AVX arithmetics instructions we still don't combine unaligned loads with the >> > AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize >> >> This is actually PR 47754 that fell below radar for some reason... > > Apparently yes. > >> > the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned >> > loads, which usually means combine will fail, by doing the load into a >> > temporary pseudo in that case and then doing a pseudo to pseudo move with >> > gen_lowpart on the rhs (which will be merged soon after into following >> > instructions). >> >> Is this similar to PR44141? There were similar problems with V4SFmode >> subregs, so combine was not able to merge load to the arithemtic insn. > > From the work on the vectorization last year I remember many cases where > subregs (even equal size) on the LHS of instructions prevented combiner or > other RTL optimizations from doing it's job. I believe I've changed some > easy places that did that completely unnecessarily, but certainly have not > went through all the code to look for other places where this is done. > > Perhaps let's hack up a checking pass that will after expansion walk the > whole IL and complain about same sized subregs on the LHS of insns, then do make > check with it for a couple of ISAs (-msse2,-msse4,-mavx,-mavx2 e.g.? > >> > I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my >> > bootstrap/regtest server isn't AVX capable. >> >> I can bootstrap the patch later today on IvyBridge with >> --with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx. > > That would be greatly appreciated, thanks. The bootstrap and regression test was OK for x86_64-linux-gnu {,-m32}. The failures in the attached report are either pre-existing or benign due to core-avx-i default to AVX. Please also mention PR44141 in the ChangeLog entry. Uros.
On Wed, Oct 30, 2013 at 6:42 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Wed, Oct 30, 2013 at 12:11 PM, Jakub Jelinek <jakub@redhat.com> wrote: > >>> > Yesterday I've noticed that for AVX which allows unaligned operands in >>> > AVX arithmetics instructions we still don't combine unaligned loads with the >>> > AVX arithmetics instructions. So say for -O2 -mavx -ftree-vectorize >>> >>> This is actually PR 47754 that fell below radar for some reason... >> >> Apparently yes. >> >>> > the patch attempts to avoid gen_lowpart on the non-MEM lhs of the unaligned >>> > loads, which usually means combine will fail, by doing the load into a >>> > temporary pseudo in that case and then doing a pseudo to pseudo move with >>> > gen_lowpart on the rhs (which will be merged soon after into following >>> > instructions). >>> >>> Is this similar to PR44141? There were similar problems with V4SFmode >>> subregs, so combine was not able to merge load to the arithemtic insn. >> >> From the work on the vectorization last year I remember many cases where >> subregs (even equal size) on the LHS of instructions prevented combiner or >> other RTL optimizations from doing it's job. I believe I've changed some >> easy places that did that completely unnecessarily, but certainly have not >> went through all the code to look for other places where this is done. >> >> Perhaps let's hack up a checking pass that will after expansion walk the >> whole IL and complain about same sized subregs on the LHS of insns, then do make >> check with it for a couple of ISAs (-msse2,-msse4,-mavx,-mavx2 e.g.? >> >>> > I'll bootstrap/regtest this on x86_64-linux and i686-linux, unfortunately my >>> > bootstrap/regtest server isn't AVX capable. >>> >>> I can bootstrap the patch later today on IvyBridge with >>> --with-arch=core-avx-i --with-cpu=core-avx-i --with-fpmath=avx. >> >> That would be greatly appreciated, thanks. > > The bootstrap and regression test was OK for x86_64-linux-gnu {,-m32}. > > The failures in the attached report are either pre-existing or benign > due to core-avx-i default to AVX. I was referring to the *runtime* failures here. > Please also mention PR44141 in the ChangeLog entry. Ops, this should be PR47754. Thanks, Uros.
--- gcc/config/i386/i386.c.jj 2013-10-30 08:15:38.000000000 +0100 +++ gcc/config/i386/i386.c 2013-10-30 10:20:22.684708729 +0100 @@ -16560,6 +16560,12 @@ ix86_avx256_split_vector_move_misalign ( r = gen_rtx_VEC_CONCAT (GET_MODE (op0), r, m); emit_move_insn (op0, r); } + /* Normal *mov<mode>_internal pattern will handle + unaligned loads just fine if misaligned_operand + is true, and without the UNSPEC it can be combined + with arithmetic instructions. */ + else if (misaligned_operand (op1, GET_MODE (op1))) + emit_insn (gen_rtx_SET (VOIDmode, op0, op1)); else emit_insn (load_unaligned (op0, op1)); } @@ -16634,7 +16640,7 @@ ix86_avx256_split_vector_move_misalign ( void ix86_expand_vector_move_misalign (enum machine_mode mode, rtx operands[]) { - rtx op0, op1, m; + rtx op0, op1, orig_op0 = NULL_RTX, m; rtx (*load_unaligned) (rtx, rtx); rtx (*store_unaligned) (rtx, rtx); @@ -16647,7 +16653,16 @@ ix86_expand_vector_move_misalign (enum m { case MODE_VECTOR_INT: case MODE_INT: - op0 = gen_lowpart (V16SImode, op0); + if (GET_MODE (op0) != V16SImode) + { + if (!MEM_P (op0)) + { + orig_op0 = op0; + op0 = gen_reg_rtx (V16SImode); + } + else + op0 = gen_lowpart (V16SImode, op0); + } op1 = gen_lowpart (V16SImode, op1); /* FALLTHRU */ @@ -16676,6 +16691,8 @@ ix86_expand_vector_move_misalign (enum m emit_insn (store_unaligned (op0, op1)); else gcc_unreachable (); + if (orig_op0) + emit_move_insn (orig_op0, gen_lowpart (GET_MODE (orig_op0), op0)); break; default: @@ -16692,12 +16709,23 @@ ix86_expand_vector_move_misalign (enum m { case MODE_VECTOR_INT: case MODE_INT: - op0 = gen_lowpart (V32QImode, op0); + if (GET_MODE (op0) != V32QImode) + { + if (!MEM_P (op0)) + { + orig_op0 = op0; + op0 = gen_reg_rtx (V32QImode); + } + else + op0 = gen_lowpart (V32QImode, op0); + } op1 = gen_lowpart (V32QImode, op1); /* FALLTHRU */ case MODE_VECTOR_FLOAT: ix86_avx256_split_vector_move_misalign (op0, op1); + if (orig_op0) + emit_move_insn (orig_op0, gen_lowpart (GET_MODE (orig_op0), op0)); break; default: @@ -16709,15 +16737,30 @@ ix86_expand_vector_move_misalign (enum m if (MEM_P (op1)) { + /* Normal *mov<mode>_internal pattern will handle + unaligned loads just fine if misaligned_operand + is true, and without the UNSPEC it can be combined + with arithmetic instructions. */ + if (TARGET_AVX + && (GET_MODE_CLASS (mode) == MODE_VECTOR_INT + || GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT) + && misaligned_operand (op1, GET_MODE (op1))) + emit_insn (gen_rtx_SET (VOIDmode, op0, op1)); /* ??? If we have typed data, then it would appear that using movdqu is the only way to get unaligned data loaded with integer type. */ - if (TARGET_SSE2 && GET_MODE_CLASS (mode) == MODE_VECTOR_INT) + else if (TARGET_SSE2 && GET_MODE_CLASS (mode) == MODE_VECTOR_INT) { - op0 = gen_lowpart (V16QImode, op0); + if (GET_MODE (op0) != V16QImode) + { + orig_op0 = op0; + op0 = gen_reg_rtx (V16QImode); + } op1 = gen_lowpart (V16QImode, op1); /* We will eventually emit movups based on insn attributes. */ emit_insn (gen_sse2_loaddquv16qi (op0, op1)); + if (orig_op0) + emit_move_insn (orig_op0, gen_lowpart (GET_MODE (orig_op0), op0)); } else if (TARGET_SSE2 && mode == V2DFmode) { @@ -16765,9 +16808,16 @@ ix86_expand_vector_move_misalign (enum m || TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL || optimize_insn_for_size_p ()) { - op0 = gen_lowpart (V4SFmode, op0); + if (GET_MODE (op0) != V4SFmode) + { + orig_op0 = op0; + op0 = gen_reg_rtx (V4SFmode); + } op1 = gen_lowpart (V4SFmode, op1); emit_insn (gen_sse_loadups (op0, op1)); + if (orig_op0) + emit_move_insn (orig_op0, + gen_lowpart (GET_MODE (orig_op0), op0)); return; }