Use xchg for -Os (PR target/92549)
diff mbox series

Message ID 20191119090435.GE4650@tucnak
State New
Headers show
Series
  • Use xchg for -Os (PR target/92549)
Related show

Commit Message

Jakub Jelinek Nov. 19, 2019, 9:04 a.m. UTC
Hi!

xchg instruction is smaller, in some cases much smaller than 3 moves,
(e.g. in the testcase 2 bytes vs. 8 bytes), and is not a performance
disaster, but from Agner Fog tables and
https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures
it doesn't seem to be something we'd want to use when optimizing for speed,
at least not on Intel.

While we have *swap<mode> patterns, those are very unlikely to be triggered
during combine, usually we have different pseudos in there and the actual
need for swapping is only materialized during RA.

The following patch does it when optimizing the insn for size only.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2019-11-19  Jakub Jelinek  <jakub@redhat.com>

	PR target/92549
	* config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.

	* gcc.target/i386/pr92549.c: New test.


	Jakub

Comments

Richard Biener Nov. 19, 2019, 9:19 a.m. UTC | #1
On Tue, Nov 19, 2019 at 10:04 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> Hi!
>
> xchg instruction is smaller, in some cases much smaller than 3 moves,
> (e.g. in the testcase 2 bytes vs. 8 bytes), and is not a performance
> disaster, but from Agner Fog tables and
> https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures
> it doesn't seem to be something we'd want to use when optimizing for speed,
> at least not on Intel.
>
> While we have *swap<mode> patterns, those are very unlikely to be triggered
> during combine, usually we have different pseudos in there and the actual
> need for swapping is only materialized during RA.
>
> The following patch does it when optimizing the insn for size only.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

I wonder if IRA/LRA should be made aware of an xchg instruction
(and it's cost?) so it knows it doesn't actually need a temporary register?

Richard.

> 2019-11-19  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/92549
>         * config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.
>
>         * gcc.target/i386/pr92549.c: New test.
>
> --- gcc/config/i386/i386.md.jj  2019-10-28 22:16:14.583008121 +0100
> +++ gcc/config/i386/i386.md     2019-11-18 17:06:36.050742545 +0100
> @@ -2787,6 +2787,17 @@
>     (set_attr "amdfam10_decode" "double")
>     (set_attr "bdver1_decode" "double")])
>
> +(define_peephole2
> +  [(set (match_operand:SWI 0 "register_operand")
> +       (match_operand:SWI 1 "register_operand"))
> +   (set (match_dup 1)
> +       (match_operand:SWI 2 "register_operand"))
> +   (set (match_dup 2) (match_dup 0))]
> +  "peep2_reg_dead_p (3, operands[0])
> +   && optimize_insn_for_size_p ()"
> +  [(parallel [(set (match_dup 1) (match_dup 2))
> +             (set (match_dup 2) (match_dup 1))])])
> +
>  (define_expand "movstrict<mode>"
>    [(set (strict_low_part (match_operand:SWI12 0 "register_operand"))
>         (match_operand:SWI12 1 "general_operand"))]
> --- gcc/testsuite/gcc.target/i386/pr92549.c.jj  2019-11-18 17:48:35.533177377 +0100
> +++ gcc/testsuite/gcc.target/i386/pr92549.c     2019-11-18 17:49:31.888336444 +0100
> @@ -0,0 +1,28 @@
> +/* PR target/92549 */
> +/* { dg-do compile } */
> +/* { dg-options "-Os -masm=att" } */
> +/* { dg-final { scan-assembler "xchgl" } } */
> +
> +#ifdef __i386__
> +#define R , regparm (2)
> +#else
> +#define R
> +#endif
> +
> +__attribute__((noipa R)) int
> +bar (int a, int b)
> +{
> +  return b - a + 5;
> +}
> +
> +__attribute__((noipa R)) int
> +foo (int a, int b)
> +{
> +  return 1 + bar (b, a);
> +}
> +
> +int
> +main ()
> +{
> +  return foo (39, 3);
> +}
>
>         Jakub
>
Uros Bizjak Nov. 19, 2019, 9:20 a.m. UTC | #2
On Tue, Nov 19, 2019 at 10:04 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> Hi!
>
> xchg instruction is smaller, in some cases much smaller than 3 moves,
> (e.g. in the testcase 2 bytes vs. 8 bytes), and is not a performance
> disaster, but from Agner Fog tables and
> https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures
> it doesn't seem to be something we'd want to use when optimizing for speed,
> at least not on Intel.
>
> While we have *swap<mode> patterns, those are very unlikely to be triggered
> during combine, usually we have different pseudos in there and the actual
> need for swapping is only materialized during RA.
>
> The following patch does it when optimizing the insn for size only.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2019-11-19  Jakub Jelinek  <jakub@redhat.com>
>
>         PR target/92549
>         * config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.
>
>         * gcc.target/i386/pr92549.c: New test.

OK with some test adjustments.

Thanks,
Uros.

> --- gcc/config/i386/i386.md.jj  2019-10-28 22:16:14.583008121 +0100
> +++ gcc/config/i386/i386.md     2019-11-18 17:06:36.050742545 +0100
> @@ -2787,6 +2787,17 @@
>     (set_attr "amdfam10_decode" "double")
>     (set_attr "bdver1_decode" "double")])
>
> +(define_peephole2
> +  [(set (match_operand:SWI 0 "register_operand")
> +       (match_operand:SWI 1 "register_operand"))
> +   (set (match_dup 1)
> +       (match_operand:SWI 2 "register_operand"))
> +   (set (match_dup 2) (match_dup 0))]
> +  "peep2_reg_dead_p (3, operands[0])
> +   && optimize_insn_for_size_p ()"
> +  [(parallel [(set (match_dup 1) (match_dup 2))
> +             (set (match_dup 2) (match_dup 1))])])
> +
>  (define_expand "movstrict<mode>"
>    [(set (strict_low_part (match_operand:SWI12 0 "register_operand"))
>         (match_operand:SWI12 1 "general_operand"))]
> --- gcc/testsuite/gcc.target/i386/pr92549.c.jj  2019-11-18 17:48:35.533177377 +0100
> +++ gcc/testsuite/gcc.target/i386/pr92549.c     2019-11-18 17:49:31.888336444 +0100
> @@ -0,0 +1,28 @@
> +/* PR target/92549 */
> +/* { dg-do compile } */
> +/* { dg-options "-Os -masm=att" } */
> +/* { dg-final { scan-assembler "xchgl" } } */
> +
> +#ifdef __i386__
> +#define R , regparm (2)
> +#else
> +#define R
> +#endif

Please use

*/ { dg-additional-options "-mregparm=2" { target ia32 } } */

instead.

> +
> +__attribute__((noipa R)) int
> +bar (int a, int b)
> +{
> +  return b - a + 5;
> +}
> +
> +__attribute__((noipa R)) int
> +foo (int a, int b)
> +{
> +  return 1 + bar (b, a);
> +}
> +
> +int
> +main ()
> +{
> +  return foo (39, 3);
> +}

No need for main in compile-only tests.

>         Jakub
>
Jan Hubicka Nov. 19, 2019, 9:44 a.m. UTC | #3
> Hi!
> 
> xchg instruction is smaller, in some cases much smaller than 3 moves,
> (e.g. in the testcase 2 bytes vs. 8 bytes), and is not a performance
> disaster, but from Agner Fog tables and
> https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures
> it doesn't seem to be something we'd want to use when optimizing for speed,
> at least not on Intel.
> 
> While we have *swap<mode> patterns, those are very unlikely to be triggered
> during combine, usually we have different pseudos in there and the actual
> need for swapping is only materialized during RA.
> 
> The following patch does it when optimizing the insn for size only.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> 2019-11-19  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR target/92549
> 	* config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.
> 
> 	* gcc.target/i386/pr92549.c: New test.

Yep, it makes sense to me.
I remember poking around xchg a while ago.  We used to generate it
pre-reload which had various unexpected consequences, but doing it via
peephole seems safe. So patch is OK.

Honza
> 
> --- gcc/config/i386/i386.md.jj	2019-10-28 22:16:14.583008121 +0100
> +++ gcc/config/i386/i386.md	2019-11-18 17:06:36.050742545 +0100
> @@ -2787,6 +2787,17 @@
>     (set_attr "amdfam10_decode" "double")
>     (set_attr "bdver1_decode" "double")])
>  
> +(define_peephole2
> +  [(set (match_operand:SWI 0 "register_operand")
> +	(match_operand:SWI 1 "register_operand"))
> +   (set (match_dup 1)
> +	(match_operand:SWI 2 "register_operand"))
> +   (set (match_dup 2) (match_dup 0))]
> +  "peep2_reg_dead_p (3, operands[0])
> +   && optimize_insn_for_size_p ()"
> +  [(parallel [(set (match_dup 1) (match_dup 2))
> +	      (set (match_dup 2) (match_dup 1))])])
> +
>  (define_expand "movstrict<mode>"
>    [(set (strict_low_part (match_operand:SWI12 0 "register_operand"))
>  	(match_operand:SWI12 1 "general_operand"))]
> --- gcc/testsuite/gcc.target/i386/pr92549.c.jj	2019-11-18 17:48:35.533177377 +0100
> +++ gcc/testsuite/gcc.target/i386/pr92549.c	2019-11-18 17:49:31.888336444 +0100
> @@ -0,0 +1,28 @@
> +/* PR target/92549 */
> +/* { dg-do compile } */
> +/* { dg-options "-Os -masm=att" } */
> +/* { dg-final { scan-assembler "xchgl" } } */
> +
> +#ifdef __i386__
> +#define R , regparm (2)
> +#else
> +#define R
> +#endif
> +
> +__attribute__((noipa R)) int
> +bar (int a, int b)
> +{
> +  return b - a + 5;
> +}
> +
> +__attribute__((noipa R)) int
> +foo (int a, int b)
> +{
> +  return 1 + bar (b, a);
> +}
> +
> +int
> +main ()
> +{
> +  return foo (39, 3);
> +}
> 
> 	Jakub
>
Vladimir Makarov Nov. 19, 2019, 3:53 p.m. UTC | #4
On 11/19/19 4:19 AM, Richard Biener wrote:
> On Tue, Nov 19, 2019 at 10:04 AM Jakub Jelinek <jakub@redhat.com> wrote:
>> Hi!
>>
>> xchg instruction is smaller, in some cases much smaller than 3 moves,
>> (e.g. in the testcase 2 bytes vs. 8 bytes), and is not a performance
>> disaster, but from Agner Fog tables and
>> https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures
>> it doesn't seem to be something we'd want to use when optimizing for speed,
>> at least not on Intel.
>>
>> While we have *swap<mode> patterns, those are very unlikely to be triggered
>> during combine, usually we have different pseudos in there and the actual
>> need for swapping is only materialized during RA.
>>
>> The following patch does it when optimizing the insn for size only.
>>
>> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> I wonder if IRA/LRA should be made aware of an xchg instruction
> (and it's cost?) so it knows it doesn't actually need a temporary register?
>
>
There is one place where it could be used.  It is 
ira-emit.c::modify_move_list where register shuffling on region border 
happens.  It is possible that pseudo1:hr1, pseudo2:hr2 inside a region 
should be pseudo1:hr2 and pseudo2:hr1 outside the region.

First I used a trick of 3 xor insns to break cycle w/o new temp 
creation. But such move cycles were so very rare events that I just 
finished with new temp creation.  That is extremely rare case, the goal 
IRA to decrease any move generation on region borders.  But if somebody 
implements xchg usage there, I'll not be against this.
Jan Hubicka Dec. 1, 2019, 2:43 p.m. UTC | #5
> Hi!
> 
> xchg instruction is smaller, in some cases much smaller than 3 moves,
> (e.g. in the testcase 2 bytes vs. 8 bytes), and is not a performance
> disaster, but from Agner Fog tables and
> https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures
> it doesn't seem to be something we'd want to use when optimizing for speed,
> at least not on Intel.
> 
> While we have *swap<mode> patterns, those are very unlikely to be triggered
> during combine, usually we have different pseudos in there and the actual
> need for swapping is only materialized during RA.
> 
> The following patch does it when optimizing the insn for size only.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> 2019-11-19  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR target/92549
> 	* config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.
> 
> 	* gcc.target/i386/pr92549.c: New test.

It is very hard to get a testcase, unforutnately, but I got the
following (locally non-reproducible) failure while building firefox with
LTO+FDO:

[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -   1080 | }
[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -        |
[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -  (insn 5555 1389 1390 41 (parallel [
[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -              (set (reg:SI 24 xmm4 [orig:187 SR.3778 ] [187])
[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO - (reg:SI 23 xmm3 [orig:104 SR.3780 ] [104]))
[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -              (set (reg:SI 23 xmm3 [orig:104 SR.3780 ] [104])
[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO - (reg:SI 24 xmm4 [orig:187 SR.3778 ] [187]))
[task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -          ])
"/builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/WritingModes.h":1117:0
78 {*swapsi}
[task 2019-12-01T14:38:04.167Z] 14:38:04     INFO -       (nil))
[task 2019-12-01T14:38:04.167Z] 14:38:04     INFO -  during RTL pass: rnreg

I guess the problem is that there is no xchange in SSE instruction set,
so peephle needs to be more restrictive?

Honza
Jakub Jelinek Dec. 1, 2019, 5:55 p.m. UTC | #6
On Sun, Dec 01, 2019 at  03:43:37PM +0100, Jan Hubicka wrote:
> > 	PR target/92549
> > 	* config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.
> > 
> > 	* gcc.target/i386/pr92549.c: New test.
> 
> It is very hard to get a testcase, unforutnately, but I got the
> following (locally non-reproducible) failure while building firefox with
> LTO+FDO:
> 
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -   1080 | }
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -        |
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -  (insn 5555 1389 1390 41 (parallel [
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -              (set (reg:SI 24 xmm4 [orig:187 SR.3778 ] [187])
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO - (reg:SI 23 xmm3 [orig:104 SR.3780 ] [104]))
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -              (set (reg:SI 23 xmm3 [orig:104 SR.3780 ] [104])
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO - (reg:SI 24 xmm4 [orig:187 SR.3778 ] [187]))
> [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -          ])
> "/builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/WritingModes.h":1117:0
> 78 {*swapsi}
> [task 2019-12-01T14:38:04.167Z] 14:38:04     INFO -       (nil))
> [task 2019-12-01T14:38:04.167Z] 14:38:04     INFO -  during RTL pass: rnreg
> 
> I guess the problem is that there is no xchange in SSE instruction set,
> so peephle needs to be more restrictive?

I'll have a look tomorrow.

	Jakub
Uros Bizjak Dec. 2, 2019, 7:28 a.m. UTC | #7
On Sun, Dec 1, 2019 at 6:55 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Sun, Dec 01, 2019 at  03:43:37PM +0100, Jan Hubicka wrote:
> > >     PR target/92549
> > >     * config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.
> > >
> > >     * gcc.target/i386/pr92549.c: New test.
> >
> > It is very hard to get a testcase, unforutnately, but I got the
> > following (locally non-reproducible) failure while building firefox with
> > LTO+FDO:
> >
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -   1080 | }
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -        |
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -  (insn 5555 1389 1390 41 (parallel [
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -              (set (reg:SI 24 xmm4 [orig:187 SR.3778 ] [187])
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO - (reg:SI 23 xmm3 [orig:104 SR.3780 ] [104]))
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -              (set (reg:SI 23 xmm3 [orig:104 SR.3780 ] [104])
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO - (reg:SI 24 xmm4 [orig:187 SR.3778 ] [187]))
> > [task 2019-12-01T14:38:04.166Z] 14:38:04     INFO -          ])
> > "/builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/WritingModes.h":1117:0
> > 78 {*swapsi}
> > [task 2019-12-01T14:38:04.167Z] 14:38:04     INFO -       (nil))
> > [task 2019-12-01T14:38:04.167Z] 14:38:04     INFO -  during RTL pass: rnreg
> >
> > I guess the problem is that there is no xchange in SSE instruction set,
> > so peephle needs to be more restrictive?
>
> I'll have a look tomorrow.

general_reg_operand should be used in the pattern.

Uros.

Patch
diff mbox series

--- gcc/config/i386/i386.md.jj	2019-10-28 22:16:14.583008121 +0100
+++ gcc/config/i386/i386.md	2019-11-18 17:06:36.050742545 +0100
@@ -2787,6 +2787,17 @@ 
    (set_attr "amdfam10_decode" "double")
    (set_attr "bdver1_decode" "double")])
 
+(define_peephole2
+  [(set (match_operand:SWI 0 "register_operand")
+	(match_operand:SWI 1 "register_operand"))
+   (set (match_dup 1)
+	(match_operand:SWI 2 "register_operand"))
+   (set (match_dup 2) (match_dup 0))]
+  "peep2_reg_dead_p (3, operands[0])
+   && optimize_insn_for_size_p ()"
+  [(parallel [(set (match_dup 1) (match_dup 2))
+	      (set (match_dup 2) (match_dup 1))])])
+
 (define_expand "movstrict<mode>"
   [(set (strict_low_part (match_operand:SWI12 0 "register_operand"))
 	(match_operand:SWI12 1 "general_operand"))]
--- gcc/testsuite/gcc.target/i386/pr92549.c.jj	2019-11-18 17:48:35.533177377 +0100
+++ gcc/testsuite/gcc.target/i386/pr92549.c	2019-11-18 17:49:31.888336444 +0100
@@ -0,0 +1,28 @@ 
+/* PR target/92549 */
+/* { dg-do compile } */
+/* { dg-options "-Os -masm=att" } */
+/* { dg-final { scan-assembler "xchgl" } } */
+
+#ifdef __i386__
+#define R , regparm (2)
+#else
+#define R
+#endif
+
+__attribute__((noipa R)) int
+bar (int a, int b)
+{
+  return b - a + 5;
+}
+
+__attribute__((noipa R)) int
+foo (int a, int b)
+{
+  return 1 + bar (b, a);
+}
+
+int
+main ()
+{
+  return foo (39, 3);
+}