diff mbox

Use TImode for piecewise move in 64-bit mode

Message ID CAMe9rOraZscgHYj37O4kBDozA8-V4A3eR6J8hDg5h5JQ=ZOYSg@mail.gmail.com
State New
Headers show

Commit Message

H.J. Lu Aug. 11, 2016, 3:51 p.m. UTC
On Thu, Aug 11, 2016 at 8:41 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Aug 11, 2016 at 5:26 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> On Thu, Aug 11, 2016 at 1:16 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Wed, Aug 10, 2016 at 6:44 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>>
>>>>>>> Use TImode for piecewise move in 64-bit mode.  When vector register
>>>>>>> is used for piecewise move, we don't increase stack_alignment_needed
>>>>>>> since vector register spill isn't required for piecewise move.  Since
>>>>>>> stack_realign_needed is set to true by checking stack_alignment_estimated
>>>>>>> set by pseudo vector register usage, we also need to check
>>>>>>> stack_realign_needed to eliminate frame pointer.
>>>>>>
>>>>>> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.
>>>>>
>>>>> I will extend it to 32-bit mode.
>>>>
>>>> It doesn't work in 32-bit mode due to
>>>>
>>>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : DImode):
>>>>
>>>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
>>>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
>>>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
>>>> x.i: In function ‘foo’:
>>>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
>>>>    return __builtin_mempcpy (dst, src, 32);
>>>>           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> This happens since by_pieces_ninsns determines widest mode by calling
>>> widest_*INT*_mode_for_size, while moves can also use vector-mode
>>> moves. This is an infrastructure problem, and will bite you on 64bit
>>> targets when MOVE_MAX_PIECES returns OImode or XImode size.
>>
>> I opened:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113
>>
>>> +#define MOVE_MAX_PIECES \
>>> +  ((TARGET_64BIT \
>>> +    && TARGET_SSE2 \
>>> +    && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>>> +    && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)
>>>
>>> The above part is OK with an appropriate ??? comment, describing the
>>> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
>>> instead of magic constant.
>>>
>>> Can you please submit the realignment patch as a separate follow-up
>>> patch? Let's keep two issues separate.
>>>
>>> Uros.
>>
>> Here is the updated patch.  OK for trunk?
>
> OK, but please do not yet introduce:
>
> +/* No need to dynamically realign the stack here.  */
> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
> +/* Nor use a frame pointer.  */
> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
>
> in the testcases. This should be part of a followup patch.
>
> Thanks,
> Uros.

This is what I checked in.

Thanks.

Comments

Uros Bizjak Aug. 11, 2016, 4:12 p.m. UTC | #1
On Thu, Aug 11, 2016 at 5:51 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>>>>>>>> Use TImode for piecewise move in 64-bit mode.  When vector register
>>>>>>>> is used for piecewise move, we don't increase stack_alignment_needed
>>>>>>>> since vector register spill isn't required for piecewise move.  Since
>>>>>>>> stack_realign_needed is set to true by checking stack_alignment_estimated
>>>>>>>> set by pseudo vector register usage, we also need to check
>>>>>>>> stack_realign_needed to eliminate frame pointer.
>>>>>>>
>>>>>>> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.
>>>>>>
>>>>>> I will extend it to 32-bit mode.
>>>>>
>>>>> It doesn't work in 32-bit mode due to
>>>>>
>>>>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : DImode):
>>>>>
>>>>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
>>>>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
>>>>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
>>>>> x.i: In function ‘foo’:
>>>>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
>>>>>    return __builtin_mempcpy (dst, src, 32);
>>>>>           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>
>>>> This happens since by_pieces_ninsns determines widest mode by calling
>>>> widest_*INT*_mode_for_size, while moves can also use vector-mode
>>>> moves. This is an infrastructure problem, and will bite you on 64bit
>>>> targets when MOVE_MAX_PIECES returns OImode or XImode size.
>>>
>>> I opened:
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113
>>>
>>>> +#define MOVE_MAX_PIECES \
>>>> +  ((TARGET_64BIT \
>>>> +    && TARGET_SSE2 \
>>>> +    && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>>>> +    && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)
>>>>
>>>> The above part is OK with an appropriate ??? comment, describing the
>>>> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
>>>> instead of magic constant.
>>>>
>>>> Can you please submit the realignment patch as a separate follow-up
>>>> patch? Let's keep two issues separate.
>>>>
>>>> Uros.
>>>
>>> Here is the updated patch.  OK for trunk?
>>
>> OK, but please do not yet introduce:
>>
>> +/* No need to dynamically realign the stack here.  */
>> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
>> +/* Nor use a frame pointer.  */
>> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
>>
>> in the testcases. This should be part of a followup patch.
>
> This is what I checked in.

Playing a bit with a patched gcc, I found no stack realignment insns
in the assembly of the provided testcases. However, if
-mincoming-stack-boundary=3 is added, then no vector instructions are
generated (and also no realignment insns).

Uros.
Uros Bizjak Aug. 11, 2016, 4:25 p.m. UTC | #2
On Thu, Aug 11, 2016 at 6:12 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Aug 11, 2016 at 5:51 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>>>>>>>>> Use TImode for piecewise move in 64-bit mode.  When vector register
>>>>>>>>> is used for piecewise move, we don't increase stack_alignment_needed
>>>>>>>>> since vector register spill isn't required for piecewise move.  Since
>>>>>>>>> stack_realign_needed is set to true by checking stack_alignment_estimated
>>>>>>>>> set by pseudo vector register usage, we also need to check
>>>>>>>>> stack_realign_needed to eliminate frame pointer.
>>>>>>>>
>>>>>>>> Why only in 64-bit mode? We can use SSE moves also in 32-bit mode.
>>>>>>>
>>>>>>> I will extend it to 32-bit mode.
>>>>>>
>>>>>> It doesn't work in 32-bit mode due to
>>>>>>
>>>>>> #define MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (TARGET_64BIT ? TImode : DImode):
>>>>>>
>>>>>> /export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
>>>>>> -B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -O2
>>>>>> -fno-asynchronous-unwind-tables -m32 -S -o x.s x.i
>>>>>> x.i: In function ‘foo’:
>>>>>> x.i:6:10: internal compiler error: in by_pieces_ninsns, at expr.c:799
>>>>>>    return __builtin_mempcpy (dst, src, 32);
>>>>>>           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>> This happens since by_pieces_ninsns determines widest mode by calling
>>>>> widest_*INT*_mode_for_size, while moves can also use vector-mode
>>>>> moves. This is an infrastructure problem, and will bite you on 64bit
>>>>> targets when MOVE_MAX_PIECES returns OImode or XImode size.
>>>>
>>>> I opened:
>>>>
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74113
>>>>
>>>>> +#define MOVE_MAX_PIECES \
>>>>> +  ((TARGET_64BIT \
>>>>> +    && TARGET_SSE2 \
>>>>> +    && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
>>>>> +    && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) ? 16 : UNITS_PER_WORD)
>>>>>
>>>>> The above part is OK with an appropriate ??? comment, describing the
>>>>> infrastructure limitation. Also, please use GET_MODE_SIZE (TImode)
>>>>> instead of magic constant.
>>>>>
>>>>> Can you please submit the realignment patch as a separate follow-up
>>>>> patch? Let's keep two issues separate.
>>>>>
>>>>> Uros.
>>>>
>>>> Here is the updated patch.  OK for trunk?
>>>
>>> OK, but please do not yet introduce:
>>>
>>> +/* No need to dynamically realign the stack here.  */
>>> +/* { dg-final { scan-assembler-not "and\[^\n\r]*%\[re\]sp" } } */
>>> +/* Nor use a frame pointer.  */
>>> +/* { dg-final { scan-assembler-not "%\[re\]bp" } } */
>>>
>>> in the testcases. This should be part of a followup patch.
>>
>> This is what I checked in.
>
> Playing a bit with a patched gcc, I found no stack realignment insns
> in the assembly of the provided testcases. However, if
> -mincoming-stack-boundary=3 is added, then no vector instructions are
> generated (and also no realignment insns).

Ah yes, STV pass is disabled for -mincoming-stack-boundary={2,3}.

It looks that we don't need extra realignment patch.

Uros.
diff mbox

Patch

From e52d6c1aed5a688fd28b9f6e005c6d9f3857b9cf Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Mon, 8 Aug 2016 09:08:01 -0700
Subject: [PATCH] Use TImode for piecewise move in 64-bit mode

Use TImode for piecewise move in 64-bit mode.  We should use TImode in
32-bit mode and use OImode or XImode if they are available.  But since
by_pieces_ninsns determines the widest mode with MAX_FIXED_MODE_SIZE,
we can only use TImode in 64-bit mode.

gcc/

	* config/i386/i386.h (MOVE_MAX_PIECES): Use TImode in 64-bit
	mode if unaligned SSE load and store are optimal.

gcc/testsuite/

	* gcc.target/i386/pieces-memcpy-1.c: New test.
	* gcc.target/i386/pieces-memcpy-2.c: Likewise.
	* gcc.target/i386/pieces-memcpy-3.c: Likewise.
	* gcc.target/i386/pieces-memcpy-4.c: Likewise.
	* gcc.target/i386/pieces-memcpy-5.c: Likewise.
	* gcc.target/i386/pieces-memcpy-6.c: Likewise.
---
 gcc/config/i386/i386.h                          | 14 ++++++++++++--
 gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c | 13 +++++++++++++
 gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c | 13 +++++++++++++
 7 files changed, 90 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 9b66264..8751143 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1950,8 +1950,18 @@  typedef struct ix86_args {
 
 /* MOVE_MAX_PIECES is the number of bytes at a time which we can
    move efficiently, as opposed to  MOVE_MAX which is the maximum
-   number of bytes we can move with a single instruction.  */
-#define MOVE_MAX_PIECES UNITS_PER_WORD
+   number of bytes we can move with a single instruction.
+
+   ??? We should use TImode in 32-bit mode and use OImode or XImode
+   if they are available.  But since by_pieces_ninsns determines the
+   widest mode with MAX_FIXED_MODE_SIZE, we can only use TImode in
+   64-bit mode.  */
+#define MOVE_MAX_PIECES \
+  ((TARGET_64BIT \
+    && TARGET_SSE2 \
+    && TARGET_SSE_UNALIGNED_LOAD_OPTIMAL \
+    && TARGET_SSE_UNALIGNED_STORE_OPTIMAL) \
+   ? GET_MODE_SIZE (TImode) : UNITS_PER_WORD)
 
 /* If a memory-to-memory move would take MOVE_RATIO or more simple
    move-instruction pairs, we will do a movmem or libcall instead.
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c
new file mode 100644
index 0000000..22202c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-1.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 64);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 4 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 4 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c
new file mode 100644
index 0000000..bc4f05b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-2.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 33);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c
new file mode 100644
index 0000000..84d6676
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-3.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mno-avx -msse2 -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 17);
+}
+
+/* { dg-final { scan-assembler-times "movdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "movups\[ \\t\]+\[^\n\]*%xmm" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c
new file mode 100644
index 0000000..64e8921
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-4.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 18);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c
new file mode 100644
index 0000000..3c464c3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-5.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mavx512f -mtune=generic" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 19);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 1 } } */
+/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c b/gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c
new file mode 100644
index 0000000..cdb00e0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c
@@ -0,0 +1,13 @@ 
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mno-avx2 -mavx -mtune=sandybridge" } */
+
+extern char *dst, *src;
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, 33);
+}
+
+/* { dg-final { scan-assembler-times "vmovdqu\[ \\t\]+\[^\n\]*%xmm" 2 } } */
+/* { dg-final { scan-assembler-times "vmovups\[ \\t\]+\[^\n\]*%xmm" 2 } } */
-- 
2.7.4