Message ID | CAMe9rOraZscgHYj37O4kBDozA8-V4A3eR6J8hDg5h5JQ=ZOYSg@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Thu, Aug 11, 2016 at 5:51 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>>>>>> Use TImode for piecewise move in 64-bit mode. When vector register >>>>>>>> is used for piecewise move, we don't increase stack_alignment_needed >>>>>>>> since vector register spill isn't required for piecewise move. Since >>>>>>>> stack_realign_needed is set to true by checking stack_alignment_estimated >>>>>>>> set by pseudo vector register usage, we also need to check >>>>>>>> stack_realign_needed to eliminate frame pointer. >>>>>>> >>>>>>> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode. >>>>>> >>>>>> I will extend it to 32-bit mode. >>>>> >>>>> It doesn't work in 32-bit mode due to >>>>> >>>>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : DImode): >>>>> >>>>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc >>>>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2 >>>>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i >>>>> x.i: In function ‘foo’: >>>>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799 >>>>> return __builtin_mempcpy (dst, src, 32); >>>>> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>> >>>> This happens since by_pieces_ninsns determines widest mode by calling >>>> widest_*INT*_mode_for_size, while moves can also use vector-mode >>>> moves. This is an infrastructure problem, and will bite you on 64bit >>>> targets when MOVE_MAX_PIECES returns OImode or XImode size. >>> >>> I opened: >>> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113 >>> >>>> +#define MOVE_MAX_PIECES \ >>>> + ((TARGET_64BIT \ >>>> + && TARGET_SSE2 \ >>>> + && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \ >>>> + && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD) >>>> >>>> The above part is OK with an appropriate ??? comment, describing the >>>> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode) >>>> instead of magic constant. >>>> >>>> Can you please submit the realignment patch as a separate follow-up >>>> patch? Let's keep two issues separate. >>>> >>>> Uros. >>> >>> Here is the updated patch. OK for trunk? >> >> OK, but please do not yet introduce: >> >> +/* No need to dynamically realign the stack here. */ >> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */ >> +/* Nor use a frame pointer. */ >> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */ >> >> in the testcases. This should be part of a followup patch. > > This is what I checked in. Playing a bit with a patched gcc, I found no stack realignment insns in the assembly of the provided testcases. However, if -mincoming-stack-boundary=3 is added, then no vector instructions are generated (and also no realignment insns). Uros.
On Thu, Aug 11, 2016 at 6:12 PM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Thu, Aug 11, 2016 at 5:51 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > >>>>>>>>> Use TImode for piecewise move in 64-bit mode. When vector register >>>>>>>>> is used for piecewise move, we don't increase stack_alignment_needed >>>>>>>>> since vector register spill isn't required for piecewise move. Since >>>>>>>>> stack_realign_needed is set to true by checking stack_alignment_estimated >>>>>>>>> set by pseudo vector register usage, we also need to check >>>>>>>>> stack_realign_needed to eliminate frame pointer. >>>>>>>> >>>>>>>> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode. >>>>>>> >>>>>>> I will extend it to 32-bit mode. >>>>>> >>>>>> It doesn't work in 32-bit mode due to >>>>>> >>>>>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : DImode): >>>>>> >>>>>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc >>>>>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2 >>>>>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i >>>>>> x.i: In function ‘foo’: >>>>>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799 >>>>>> return __builtin_mempcpy (dst, src, 32); >>>>>> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>> >>>>> This happens since by_pieces_ninsns determines widest mode by calling >>>>> widest_*INT*_mode_for_size, while moves can also use vector-mode >>>>> moves. This is an infrastructure problem, and will bite you on 64bit >>>>> targets when MOVE_MAX_PIECES returns OImode or XImode size. >>>> >>>> I opened: >>>> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113 >>>> >>>>> +#define MOVE_MAX_PIECES \ >>>>> + ((TARGET_64BIT \ >>>>> + && TARGET_SSE2 \ >>>>> + && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \ >>>>> + && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD) >>>>> >>>>> The above part is OK with an appropriate ??? comment, describing the >>>>> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode) >>>>> instead of magic constant. >>>>> >>>>> Can you please submit the realignment patch as a separate follow-up >>>>> patch? Let's keep two issues separate. >>>>> >>>>> Uros. >>>> >>>> Here is the updated patch. OK for trunk? >>> >>> OK, but please do not yet introduce: >>> >>> +/* No need to dynamically realign the stack here. */ >>> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */ >>> +/* Nor use a frame pointer. */ >>> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */ >>> >>> in the testcases. This should be part of a followup patch. >> >> This is what I checked in. > > Playing a bit with a patched gcc, I found no stack realignment insns > in the assembly of the provided testcases. However, if > -mincoming-stack-boundary=3 is added, then no vector instructions are > generated (and also no realignment insns). Ah yes, STV pass is disabled for -mincoming-stack-boundary={2,3}. It looks that we don't need extra realignment patch. Uros.
From e52d6c1aed5a688fd28b9f6e005c6d9f3857b9cf Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.tools@gmail.com> Date: Mon, 8 Aug 2016 09:08:01 -0700 Subject: [PATCH] Use TImode for piecewise move in 64-bit mode Use TImode for piecewise move in 64-bit mode. We should use TImode in 32-bit mode and use OImode or XImode if they are available. But since by_pieces_ninsns determines the widest mode with MAX_FIXED_MODE_SIZE, we can only use TImode in 64-bit mode. gcc/ * config/i386/i386.h (MOVE_MAX_PIECES): Use TImode in 64-bit mode if unaligned SSE load and store are optimal. gcc/testsuite/ * gcc.target/i386/pieces-memcpy-1.c: New test. * gcc.target/i386/pieces-memcpy-2.c: Likewise. * gcc.target/i386/pieces-memcpy-3.c: Likewise. * gcc.target/i386/pieces-memcpy-4.c: Likewise. * gcc.target/i386/pieces-memcpy-5.c: Likewise. * gcc.target/i386/pieces-memcpy-6.c: Likewise. --- gcc/config/i386/i386.h | 14 ++++++++++++-- gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c | 13 +++++++++++++ gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c | 13 +++++++++++++ gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c | 13 +++++++++++++ gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c | 13 +++++++++++++ gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c | 13 +++++++++++++ gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c | 13 +++++++++++++ 7 files changed, 90 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 9b66264..8751143 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -1950,8 +1950,18 @@ typedef struct ix86_args { /* MOVE_MAX_PIECES is the number of bytes at a time which we can move efficiently, as opposed to MOVE_MAX which is the maximum - number of bytes we can move with a single instruction. */ -#define MOVE_MAX_PIECES UNITS_PER_WORD + number of bytes we can move with a single instruction. + + ??? We should use TImode in 32-bit mode and use OImode or XImode + if they are available. But since by_pieces_ninsns determines the + widest mode with MAX_FIXED_MODE_SIZE, we can only use TImode in + 64-bit mode. */ +#define MOVE_MAX_PIECES \ + ((TARGET_64BIT \ + && TARGET_SSE2 \ + && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \ + && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \ + ? GET_MODE_SIZE (TImode) : UNITS_PER_WORD) /* If a memory-to-memory move would take MOVE_RATIO or more simple move-instruction pairs, we will do a movmem or libcall instead. diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c new file mode 100644 index 0000000..22202c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 64); +} + +/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 4 } } */ +/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c new file mode 100644 index 0000000..bc4f05b --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 33); +} + +/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */ +/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c new file mode 100644 index 0000000..84d6676 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 17); +} + +/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */ +/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c new file mode 100644 index 0000000..64e8921 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx2 -mavx -mtune=generic" } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 18); +} + +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */ +/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c new file mode 100644 index 0000000..3c464c3 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mavx512f -mtune=generic" } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 19); +} + +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */ +/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c new file mode 100644 index 0000000..cdb00e0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */ + +extern char *dst, *src; + +void +foo (void) +{ + __builtin_memcpy (dst, src, 33); +} + +/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */ +/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 2 } } */ -- 2.7.4