diff mbox series

[i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735]

Message ID CAMZc-bwzYQWvQH7sbDczJg6hWktBqA7Pdcbp-FtAnza1By8TAQ@mail.gmail.com
State New
Headers show
Series [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper will kill sse registers. [PR target/82735] | expand

Commit Message

Hongtao Liu May 13, 2021, 9:23 a.m. UTC
Hi:
  When __builtin_ia32_vzeroupper is called explicitly, the corresponding
vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
which leads to incorrect optimization in pass_reload.
In order to solve this problem, this patch introduces a pre_reload
splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
problem in pr.

At the same time, in order to optimize the low 128 bits in
post_reload CSE, this patch also transforms those CLOBBERS to SETs in
pass_vzeroupper.

It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
function, because of vzeroupper's pattern, pro_epilog will save and
restore those registers, which is obviously redundant. In order to
eliminate this redundancy, a post_reload splitter is introduced, which
drops those SETs, until epilogue_completed splitter adds those SETs
back, it looks to be safe since there's no CSE between post_reload
split2 and epilogue_completed split3??? Also frame info needs to be
updated in pro_epilog, which saves and restores xmm6-xmm15 only if
there's usage other than explicit vzeroupper pattern.

  Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
  Ok for trunk?

gcc/ChangeLog:

        PR target/82735
        * config/i386/i386-expand.c (ix86_expand_builtin): Count
        number of __builtin_ia32_vzeroupper.
        * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
        Transform CLOBBERs to SETs for explicit vzeroupper pattern so
        that CSE can optimize lower 128 bits.
        * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
        New.
        (ix86_save_reg): If there's no use of xmm6~xmm15 other than
        explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
        REGNO.
        (ix86_finalize_stack_frame_flags): Recompute frame layout if
        there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
        * config/i386/i386.h (struct machine_function): Change type of
        has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
        * config/i386/sse.md (*avx_vzeroupper_2): New post-reload
        splitter which will drop all SETs for explicit vzeroupper
        patterns.
        (*avx_vzeroupper_1): Generate SET reg to reg instead of
        CLOBBER, and add pre-reload splitter after it.

gcc/testsuite/ChangeLog:

        PR target/82735
        * gcc.target/i386/pr82735-1.c: New test.
        * gcc.target/i386/pr82735-2.c: New test.
        * gcc.target/i386/pr82735-3.c: New test.
        * gcc.target/i386/pr82735-4.c: New test.
        * gcc.target/i386/pr82735-5.c: New test.

Comments

Uros Bizjak May 13, 2021, 9:40 a.m. UTC | #1
On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Hi:
>   When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> which leads to incorrect optimization in pass_reload.
> In order to solve this problem, this patch introduces a pre_reload
> splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
> problem in pr.
>
> At the same time, in order to optimize the low 128 bits in
> post_reload CSE, this patch also transforms those CLOBBERS to SETs in
> pass_vzeroupper.
>
> It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
> are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
> function, because of vzeroupper's pattern, pro_epilog will save and
> restore those registers, which is obviously redundant. In order to
> eliminate this redundancy, a post_reload splitter is introduced, which
> drops those SETs, until epilogue_completed splitter adds those SETs
> back, it looks to be safe since there's no CSE between post_reload
> split2 and epilogue_completed split3??? Also frame info needs to be
> updated in pro_epilog, which saves and restores xmm6-xmm15 only if
> there's usage other than explicit vzeroupper pattern.
>
>   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>   Ok for trunk?

Some time ago a support for CLOBBER_HIGH RTX was added (and later
removed for some reason). Perhaps we could resurrect the patch for the
purpose of ferrying 128bit modes via vzeroupper RTX?

+(define_split
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX && ix86_pre_reload_split ()"
+  [(match_dup 0)]
+{
+  /* When vzeroupper is explictly used, for LRA purpose, make it clear
+     the instruction kills sse registers.  */
+  gcc_assert (cfun->machine->has_explicit_vzeroupper);
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
+                        gen_rtvec (1, const1_rtx),
+                        UNSPECV_VZEROUPPER);
+  for (unsigned int i = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      rtx reg = gen_rtx_REG (V2DImode, regno);
+      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+    }
+  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
+})

Wouldn't this also kill lower 128bit values that are not touched by
vzeroupper? A CLOBBER_HIGH would be more appropriate here.

Uros.


> gcc/ChangeLog:
>
>         PR target/82735
>         * config/i386/i386-expand.c (ix86_expand_builtin): Count
>         number of __builtin_ia32_vzeroupper.
>         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
>         Transform CLOBBERs to SETs for explicit vzeroupper pattern so
>         that CSE can optimize lower 128 bits.
>         * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
>         New.
>         (ix86_save_reg): If there's no use of xmm6~xmm15 other than
>         explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
>         REGNO.
>         (ix86_finalize_stack_frame_flags): Recompute frame layout if
>         there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
>         * config/i386/i386.h (struct machine_function): Change type of
>         has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
>         * config/i386/sse.md (*avx_vzeroupper_2): New post-reload
>         splitter which will drop all SETs for explicit vzeroupper
>         patterns.
>         (*avx_vzeroupper_1): Generate SET reg to reg instead of
>         CLOBBER, and add pre-reload splitter after it.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/82735
>         * gcc.target/i386/pr82735-1.c: New test.
>         * gcc.target/i386/pr82735-2.c: New test.
>         * gcc.target/i386/pr82735-3.c: New test.
>         * gcc.target/i386/pr82735-4.c: New test.
>         * gcc.target/i386/pr82735-5.c: New test.
>
>
> --
> BR,
> Hongtao
Uros Bizjak May 13, 2021, 9:43 a.m. UTC | #2
On Thu, May 13, 2021 at 11:40 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 11:18 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > Hi:
> >   When __builtin_ia32_vzeroupper is called explicitly, the corresponding
> > vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
> > which leads to incorrect optimization in pass_reload.
> > In order to solve this problem, this patch introduces a pre_reload
> > splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
> > problem in pr.
> >
> > At the same time, in order to optimize the low 128 bits in
> > post_reload CSE, this patch also transforms those CLOBBERS to SETs in
> > pass_vzeroupper.
> >
> > It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
> > are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
> > function, because of vzeroupper's pattern, pro_epilog will save and
> > restore those registers, which is obviously redundant. In order to
> > eliminate this redundancy, a post_reload splitter is introduced, which
> > drops those SETs, until epilogue_completed splitter adds those SETs
> > back, it looks to be safe since there's no CSE between post_reload
> > split2 and epilogue_completed split3??? Also frame info needs to be
> > updated in pro_epilog, which saves and restores xmm6-xmm15 only if
> > there's usage other than explicit vzeroupper pattern.
> >
> >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >   Ok for trunk?
>
> Some time ago a support for CLOBBER_HIGH RTX was added (and later
> removed for some reason). Perhaps we could resurrect the patch for the
> purpose of ferrying 128bit modes via vzeroupper RTX?

https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html

Uros.

>
> +(define_split
> +  [(match_parallel 0 "vzeroupper_pattern"
> +     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> +  "TARGET_AVX && ix86_pre_reload_split ()"
> +  [(match_dup 0)]
> +{
> +  /* When vzeroupper is explictly used, for LRA purpose, make it clear
> +     the instruction kills sse registers.  */
> +  gcc_assert (cfun->machine->has_explicit_vzeroupper);
> +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> +  rtvec vec = rtvec_alloc (nregs + 1);
> +  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
> +                        gen_rtvec (1, const1_rtx),
> +                        UNSPECV_VZEROUPPER);
> +  for (unsigned int i = 0; i < nregs; ++i)
> +    {
> +      unsigned int regno = GET_SSE_REGNO (i);
> +      rtx reg = gen_rtx_REG (V2DImode, regno);
> +      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> +    }
> +  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> +})
>
> Wouldn't this also kill lower 128bit values that are not touched by
> vzeroupper? A CLOBBER_HIGH would be more appropriate here.
>
> Uros.
>
>
> > gcc/ChangeLog:
> >
> >         PR target/82735
> >         * config/i386/i386-expand.c (ix86_expand_builtin): Count
> >         number of __builtin_ia32_vzeroupper.
> >         * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
> >         Transform CLOBBERs to SETs for explicit vzeroupper pattern so
> >         that CSE can optimize lower 128 bits.
> >         * config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
> >         New.
> >         (ix86_save_reg): If there's no use of xmm6~xmm15 other than
> >         explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
> >         REGNO.
> >         (ix86_finalize_stack_frame_flags): Recompute frame layout if
> >         there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
> >         * config/i386/i386.h (struct machine_function): Change type of
> >         has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
> >         * config/i386/sse.md (*avx_vzeroupper_2): New post-reload
> >         splitter which will drop all SETs for explicit vzeroupper
> >         patterns.
> >         (*avx_vzeroupper_1): Generate SET reg to reg instead of
> >         CLOBBER, and add pre-reload splitter after it.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/82735
> >         * gcc.target/i386/pr82735-1.c: New test.
> >         * gcc.target/i386/pr82735-2.c: New test.
> >         * gcc.target/i386/pr82735-3.c: New test.
> >         * gcc.target/i386/pr82735-4.c: New test.
> >         * gcc.target/i386/pr82735-5.c: New test.
> >
> >
> > --
> > BR,
> > Hongtao
Jakub Jelinek May 13, 2021, 9:54 a.m. UTC | #3
On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> > >   Ok for trunk?
> >
> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> > removed for some reason). Perhaps we could resurrect the patch for the
> > purpose of ferrying 128bit modes via vzeroupper RTX?
> 
> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html

https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
is where it got removed, CCing Richard.

> > +(define_split
> > +  [(match_parallel 0 "vzeroupper_pattern"
> > +     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> > +  "TARGET_AVX && ix86_pre_reload_split ()"
> > +  [(match_dup 0)]
> > +{
> > +  /* When vzeroupper is explictly used, for LRA purpose, make it clear
> > +     the instruction kills sse registers.  */
> > +  gcc_assert (cfun->machine->has_explicit_vzeroupper);
> > +  unsigned int nregs = TARGET_64BIT ? 16 : 8;
> > +  rtvec vec = rtvec_alloc (nregs + 1);
> > +  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
> > +                        gen_rtvec (1, const1_rtx),
> > +                        UNSPECV_VZEROUPPER);
> > +  for (unsigned int i = 0; i < nregs; ++i)
> > +    {
> > +      unsigned int regno = GET_SSE_REGNO (i);
> > +      rtx reg = gen_rtx_REG (V2DImode, regno);
> > +      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
> > +    }
> > +  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
> > +})
> >
> > Wouldn't this also kill lower 128bit values that are not touched by
> > vzeroupper? A CLOBBER_HIGH would be more appropriate here.

Yes, it would.  But normally the only xmm* hard regs live across the
explicit user vzeroupper would be local and global register variables,
I think the 1st scheduler etc. shouldn't extend lifetime of the
xmm hard regs across UNSPEC_VOLATILE.

	Jakub
Richard Sandiford May 13, 2021, 11:32 a.m. UTC | #4
Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
>> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>> > >   Ok for trunk?
>> >
>> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
>> > removed for some reason). Perhaps we could resurrect the patch for the
>> > purpose of ferrying 128bit modes via vzeroupper RTX?
>> 
>> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
>
> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> is where it got removed, CCing Richard.

Yeah.  Initially clobber_high seemed like the best appraoch for
handling the tlsdesc thing, but in practice it was too difficult
to shoe-horn the concept in after the fact, when so much rtl
infrastructure wasn't prepared to deal with it.  The old support
didn't handle all cases and passes correctly, and handled others
suboptimally.

I think it would be worth using the same approach as
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
vzeroupper: represent the instructions as call_insns in which the
call has a special vzeroupper ABI.  I think that's likely to lead
to better code than clobber_high would (or at least, it did for tlsdesc).

Thanks,
Richard
Jakub Jelinek May 13, 2021, 11:37 a.m. UTC | #5
On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> Jakub Jelinek <jakub@redhat.com> writes:
> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >> > >   Ok for trunk?
> >> >
> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> >> > removed for some reason). Perhaps we could resurrect the patch for the
> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> >> 
> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> >
> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> > is where it got removed, CCing Richard.
> 
> Yeah.  Initially clobber_high seemed like the best appraoch for
> handling the tlsdesc thing, but in practice it was too difficult
> to shoe-horn the concept in after the fact, when so much rtl
> infrastructure wasn't prepared to deal with it.  The old support
> didn't handle all cases and passes correctly, and handled others
> suboptimally.
> 
> I think it would be worth using the same approach as
> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> vzeroupper: represent the instructions as call_insns in which the
> call has a special vzeroupper ABI.  I think that's likely to lead
> to better code than clobber_high would (or at least, it did for tlsdesc).

Perhaps a magic call_insn that is split post-reload into a normal insn
with the sets then?

	Jakub
Richard Sandiford May 13, 2021, 11:52 a.m. UTC | #6
Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
>> Jakub Jelinek <jakub@redhat.com> writes:
>> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
>> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>> >> > >   Ok for trunk?
>> >> >
>> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
>> >> > removed for some reason). Perhaps we could resurrect the patch for the
>> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
>> >> 
>> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
>> >
>> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
>> > is where it got removed, CCing Richard.
>> 
>> Yeah.  Initially clobber_high seemed like the best appraoch for
>> handling the tlsdesc thing, but in practice it was too difficult
>> to shoe-horn the concept in after the fact, when so much rtl
>> infrastructure wasn't prepared to deal with it.  The old support
>> didn't handle all cases and passes correctly, and handled others
>> suboptimally.
>> 
>> I think it would be worth using the same approach as
>> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
>> vzeroupper: represent the instructions as call_insns in which the
>> call has a special vzeroupper ABI.  I think that's likely to lead
>> to better code than clobber_high would (or at least, it did for tlsdesc).
>
> Perhaps a magic call_insn that is split post-reload into a normal insn
> with the sets then?

I'd be tempted to treat it is a call_insn throughout.  The unspec_volatile
means that we can't move the instruction, so converting a call_insn to an
insn isn't likely to help from that point of view.  The sets are also
likely to be handled suboptimally compared to the more accurate register
information attached to the call: all code that handles calls has to be
prepared to deal with partial clobbers, whereas most code dealing with
sets will assume that the set does useful work, and that the rhs of the
set is live.

Thanks,
Richard
Hongtao Liu May 14, 2021, 2:27 a.m. UTC | #7
On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Jakub Jelinek <jakub@redhat.com> writes:
> > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> >> Jakub Jelinek <jakub@redhat.com> writes:
> >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >> >> > >   Ok for trunk?
> >> >> >
> >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> >> >> > removed for some reason). Perhaps we could resurrect the patch for the
> >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> >> >>
> >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> >> >
> >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> >> > is where it got removed, CCing Richard.
> >>
> >> Yeah.  Initially clobber_high seemed like the best appraoch for
> >> handling the tlsdesc thing, but in practice it was too difficult
> >> to shoe-horn the concept in after the fact, when so much rtl
> >> infrastructure wasn't prepared to deal with it.  The old support
> >> didn't handle all cases and passes correctly, and handled others
> >> suboptimally.
> >>
> >> I think it would be worth using the same approach as
> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> >> vzeroupper: represent the instructions as call_insns in which the
> >> call has a special vzeroupper ABI.  I think that's likely to lead
> >> to better code than clobber_high would (or at least, it did for tlsdesc).

From an implementation perspective, I guess you're meaning we should
implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
backend.

> >
> > Perhaps a magic call_insn that is split post-reload into a normal insn
> > with the sets then?
>
> I'd be tempted to treat it is a call_insn throughout.  The unspec_volatile
> means that we can't move the instruction, so converting a call_insn to an
> insn isn't likely to help from that point of view.  The sets are also
> likely to be handled suboptimally compared to the more accurate register
> information attached to the call: all code that handles calls has to be
> prepared to deal with partial clobbers, whereas most code dealing with
> sets will assume that the set does useful work, and that the rhs of the
> set is live.
>
> Thanks,
> Richard
>
Hongtao Liu May 17, 2021, 8:44 a.m. UTC | #8
On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Jakub Jelinek <jakub@redhat.com> writes:
> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> > >> Jakub Jelinek <jakub@redhat.com> writes:
> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> > >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> > >> >> > >   Ok for trunk?
> > >> >> >
> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the
> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> > >> >>
> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> > >> >
> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> > >> > is where it got removed, CCing Richard.
> > >>
> > >> Yeah.  Initially clobber_high seemed like the best appraoch for
> > >> handling the tlsdesc thing, but in practice it was too difficult
> > >> to shoe-horn the concept in after the fact, when so much rtl
> > >> infrastructure wasn't prepared to deal with it.  The old support
> > >> didn't handle all cases and passes correctly, and handled others
> > >> suboptimally.
> > >>
> > >> I think it would be worth using the same approach as
> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> > >> vzeroupper: represent the instructions as call_insns in which the
> > >> call has a special vzeroupper ABI.  I think that's likely to lead
> > >> to better code than clobber_high would (or at least, it did for tlsdesc).
>
> From an implementation perspective, I guess you're meaning we should
> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
> backend.
>
When I implemented the vzeroupper pattern as call_insn and defined
TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related
to 2 parts

1. requires_stack_frame_p return true for vzeroupper which should be false.
2. in subst_stack_regs, vzeroupper shouldn't kill arguments

I've tried a rough patch like below, it works for those failures,
unfortunately, I don't have an arm machine to test, so I want to ask
would the below change break something in the arm backend?

modified   gcc/reg-stack.c
@@ -174,6 +174,7 @@
 #include "reload.h"
 #include "tree-pass.h"
 #include "rtl-iter.h"
+#include "function-abi.h"

 #ifdef STACK_REGS

@@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
   bool control_flow_insn_deleted = false;
   int i;

-  if (CALL_P (insn))
+  if (CALL_P (insn) && insn_callee_abi (insn).id () == 0)
     {
       int top = regstack->top;

modified   gcc/shrink-wrap.c
@@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn,
HARD_REG_SET prologue_used,
   unsigned regno;

   if (CALL_P (insn))
-    return !SIBLING_CALL_P (insn);
+    {
+      if (insn_callee_abi (insn).id() != 0)
+ return false;
+      else
+ return !SIBLING_CALL_P (insn);
+    }

   /* We need a frame to get the unique CFA expected by the unwinder.  */
   if (cfun->can_throw_non_call_exceptions && can_throw_internal (insn))
> > >
> > > Perhaps a magic call_insn that is split post-reload into a normal insn
> > > with the sets then?
> >
> > I'd be tempted to treat it is a call_insn throughout.  The unspec_volatile
> > means that we can't move the instruction, so converting a call_insn to an
> > insn isn't likely to help from that point of view.  The sets are also
> > likely to be handled suboptimally compared to the more accurate register
> > information attached to the call: all code that handles calls has to be
> > prepared to deal with partial clobbers, whereas most code dealing with
> > sets will assume that the set does useful work, and that the rhs of the
> > set is live.
> >
> > Thanks,
> > Richard
> >
>
>
> --
> BR,
> Hongtao
Richard Sandiford May 17, 2021, 9:56 a.m. UTC | #9
Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote:
>>
>> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Jakub Jelinek <jakub@redhat.com> writes:
>> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
>> > >> Jakub Jelinek <jakub@redhat.com> writes:
>> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
>> > >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
>> > >> >> > >   Ok for trunk?
>> > >> >> >
>> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
>> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the
>> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
>> > >> >>
>> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
>> > >> >
>> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
>> > >> > is where it got removed, CCing Richard.
>> > >>
>> > >> Yeah.  Initially clobber_high seemed like the best appraoch for
>> > >> handling the tlsdesc thing, but in practice it was too difficult
>> > >> to shoe-horn the concept in after the fact, when so much rtl
>> > >> infrastructure wasn't prepared to deal with it.  The old support
>> > >> didn't handle all cases and passes correctly, and handled others
>> > >> suboptimally.
>> > >>
>> > >> I think it would be worth using the same approach as
>> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
>> > >> vzeroupper: represent the instructions as call_insns in which the
>> > >> call has a special vzeroupper ABI.  I think that's likely to lead
>> > >> to better code than clobber_high would (or at least, it did for tlsdesc).
>>
>> From an implementation perspective, I guess you're meaning we should
>> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
>> backend.
>>
> When I implemented the vzeroupper pattern as call_insn and defined
> TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related
> to 2 parts
>
> 1. requires_stack_frame_p return true for vzeroupper which should be false.
> 2. in subst_stack_regs, vzeroupper shouldn't kill arguments
>
> I've tried a rough patch like below, it works for those failures,
> unfortunately, I don't have an arm machine to test, so I want to ask
> would the below change break something in the arm backend?

ABI id 0 just means the default ABI.  Real calls can use other ABIs
besides the default.  That said…

> modified   gcc/reg-stack.c
> @@ -174,6 +174,7 @@
>  #include "reload.h"
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
> +#include "function-abi.h"
>
>  #ifdef STACK_REGS
>
> @@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
>    bool control_flow_insn_deleted = false;
>    int i;
>
> -  if (CALL_P (insn))
> +  if (CALL_P (insn) && insn_callee_abi (insn).id () == 0)
>      {
>        int top = regstack->top;

…reg-stack.c is effectively x86-specific code, so checking id 0 here
wouldn't affect anything else.  It doesn't feel very future-proof
though, since x86 could use ABIs other than 0 for real calls in future.

AIUI the property that matters here isn't the ABI, but that the target
of the call doesn't reference stack registers.  That can be true for
real calls too, with -fipa-ra.

> modified   gcc/shrink-wrap.c
> @@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn,
> HARD_REG_SET prologue_used,
>    unsigned regno;
>
>    if (CALL_P (insn))
> -    return !SIBLING_CALL_P (insn);
> +    {
> +      if (insn_callee_abi (insn).id() != 0)
> + return false;
> +      else
> + return !SIBLING_CALL_P (insn);
> +    }

TBH I'm not sure why off-hand this function needs to treat non-sibling
calls specially, rather than rely on normal DF information.  Calls have
a use of the stack pointer, so we should return true for that reason:

	/* The stack ptr is used (honorarily) by a CALL insn.  */
	df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
		       NULL, bb, insn_info, DF_REF_REG_USE,
		       DF_REF_CALL_STACK_USAGE | flags);

I guess this is something we should suppress for fake calls though.

It looks like the rtx “used” flag is unused for INSNs, so we could
use that as a CALL_INSN flag that indicates a fake call.  We could just
need to make:

      /* For all other RTXes clear the used flag on the copy.  */
      RTX_FLAG (copy, used) = 0;

conditional on !INSN_P.

Thanks,
Richard
Hongtao Liu May 18, 2021, 1:12 p.m. UTC | #10
On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Fri, May 14, 2021 at 10:27 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >>
> >> On Thu, May 13, 2021 at 7:52 PM Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Jakub Jelinek <jakub@redhat.com> writes:
> >> > > On Thu, May 13, 2021 at 12:32:26PM +0100, Richard Sandiford wrote:
> >> > >> Jakub Jelinek <jakub@redhat.com> writes:
> >> > >> > On Thu, May 13, 2021 at 11:43:19AM +0200, Uros Bizjak wrote:
> >> > >> >> > >   Bootstrapped and regtested on X86_64-linux-gnu{-m32,}
> >> > >> >> > >   Ok for trunk?
> >> > >> >> >
> >> > >> >> > Some time ago a support for CLOBBER_HIGH RTX was added (and later
> >> > >> >> > removed for some reason). Perhaps we could resurrect the patch for the
> >> > >> >> > purpose of ferrying 128bit modes via vzeroupper RTX?
> >> > >> >>
> >> > >> >> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-11/msg01325.html
> >> > >> >
> >> > >> > https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01468.html
> >> > >> > is where it got removed, CCing Richard.
> >> > >>
> >> > >> Yeah.  Initially clobber_high seemed like the best appraoch for
> >> > >> handling the tlsdesc thing, but in practice it was too difficult
> >> > >> to shoe-horn the concept in after the fact, when so much rtl
> >> > >> infrastructure wasn't prepared to deal with it.  The old support
> >> > >> didn't handle all cases and passes correctly, and handled others
> >> > >> suboptimally.
> >> > >>
> >> > >> I think it would be worth using the same approach as
> >> > >> https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01466.html for
> >> > >> vzeroupper: represent the instructions as call_insns in which the
> >> > >> call has a special vzeroupper ABI.  I think that's likely to lead
> >> > >> to better code than clobber_high would (or at least, it did for tlsdesc).
> >>
> >> From an implementation perspective, I guess you're meaning we should
> >> implement TARGET_INSN_CALLEE_ABI and TARGET_FNTYPE_ABI in the i386
> >> backend.
> >>
> > When I implemented the vzeroupper pattern as call_insn and defined
> > TARGET_INSN_CALLEE_ABI for it, I got several failures. they're related
> > to 2 parts
> >
> > 1. requires_stack_frame_p return true for vzeroupper which should be false.
> > 2. in subst_stack_regs, vzeroupper shouldn't kill arguments
> >
> > I've tried a rough patch like below, it works for those failures,
> > unfortunately, I don't have an arm machine to test, so I want to ask
> > would the below change break something in the arm backend?
>
> ABI id 0 just means the default ABI.  Real calls can use other ABIs
> besides the default.  That said…
>
> > modified   gcc/reg-stack.c
> > @@ -174,6 +174,7 @@
> >  #include "reload.h"
> >  #include "tree-pass.h"
> >  #include "rtl-iter.h"
> > +#include "function-abi.h"
> >
> >  #ifdef STACK_REGS
> >
> > @@ -2385,7 +2386,7 @@ subst_stack_regs (rtx_insn *insn, stack_ptr regstack)
> >    bool control_flow_insn_deleted = false;
> >    int i;
> >
> > -  if (CALL_P (insn))
> > +  if (CALL_P (insn) && insn_callee_abi (insn).id () == 0)
> >      {
> >        int top = regstack->top;
>
> …reg-stack.c is effectively x86-specific code, so checking id 0 here
> wouldn't affect anything else.  It doesn't feel very future-proof
> though, since x86 could use ABIs other than 0 for real calls in future.
>
> AIUI the property that matters here isn't the ABI, but that the target
> of the call doesn't reference stack registers.  That can be true for
> real calls too, with -fipa-ra.
>
> > modified   gcc/shrink-wrap.c
> > @@ -58,7 +58,12 @@ requires_stack_frame_p (rtx_insn *insn,
> > HARD_REG_SET prologue_used,
> >    unsigned regno;
> >
> >    if (CALL_P (insn))
> > -    return !SIBLING_CALL_P (insn);
> > +    {
> > +      if (insn_callee_abi (insn).id() != 0)
> > + return false;
> > +      else
> > + return !SIBLING_CALL_P (insn);
> > +    }
>
> TBH I'm not sure why off-hand this function needs to treat non-sibling
> calls specially, rather than rely on normal DF information.  Calls have
> a use of the stack pointer, so we should return true for that reason:
>
>         /* The stack ptr is used (honorarily) by a CALL insn.  */
>         df_ref_record (DF_REF_BASE, collection_rec, regno_reg_rtx[i],
>                        NULL, bb, insn_info, DF_REF_REG_USE,
>                        DF_REF_CALL_STACK_USAGE | flags);
>
> I guess this is something we should suppress for fake calls though.
>
> It looks like the rtx “used” flag is unused for INSNs, so we could
> use that as a CALL_INSN flag that indicates a fake call.  We could just
> need to make:
>
>       /* For all other RTXes clear the used flag on the copy.  */
>       RTX_FLAG (copy, used) = 0;
>
> conditional on !INSN_P.
>
I got another error in

@@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
       return true;

     case CALL_INSN:
+      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
+      if (RTX_FLAG (insn, used))
+ break;

and performance issue in

modified   gcc/final.c
@@ -4498,7 +4498,8 @@ leaf_function_p (void)
   for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
     {
       if (CALL_P (insn)
-   && ! SIBLING_CALL_P (insn))
+   && ! SIBLING_CALL_P (insn)
+   && !RTX_FLAG (insn, used))
  return 0;
       if (NONJUMP_INSN_P (insn)

Also i grep CALL_P or CALL_INSN in GCC source codes, there are many
places which hold the assumption CALL_P/CALL_INSN is a real call.
Considering that vzeroupper is used a lot on the i386 backend, I'm a
bit worried that this implementation solution will be a bottomless
pit.

> Thanks,
> Richard
Richard Sandiford May 18, 2021, 3:18 p.m. UTC | #11
Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> It looks like the rtx “used” flag is unused for INSNs, so we could
>> use that as a CALL_INSN flag that indicates a fake call.  We could just
>> need to make:
>>
>>       /* For all other RTXes clear the used flag on the copy.  */
>>       RTX_FLAG (copy, used) = 0;
>>
>> conditional on !INSN_P.
>>
> I got another error in
>
> @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
>        return true;
>
>      case CALL_INSN:
> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> +      if (RTX_FLAG (insn, used))
> + break;

I guess this is because of the nonlocal_goto condition?  If so, that
could be fixed by adding a REG_EH_REGION note of INT_MIN.  Even if we
don't do that, I think the fix belongs in nonlocal_goto instead.

> and performance issue in
>
> modified   gcc/final.c
> @@ -4498,7 +4498,8 @@ leaf_function_p (void)
>    for (insn = get_insns (); insn; insn = NEXT_INSN (insn))
>      {
>        if (CALL_P (insn)
> -   && ! SIBLING_CALL_P (insn))
> +   && ! SIBLING_CALL_P (insn)
> +   && !RTX_FLAG (insn, used))
>   return 0;
>        if (NONJUMP_INSN_P (insn)
>
> Also i grep CALL_P or CALL_INSN in GCC source codes, there are many
> places which hold the assumption CALL_P/CALL_INSN is a real call.
> Considering that vzeroupper is used a lot on the i386 backend, I'm a
> bit worried that this implementation solution will be a bottomless
> pit.

Maybe, but I think the same is true for CLOBBER_HIGH.  If we have
a third alternative then we should consider it, but I think the
call approach is still going to be less problematic then CLOBBER_HIGH.

The main advantage of the call approach is that the CALL_P handling
is (mostly) conservatively correct and performance problems are just
a one-line change.  The CLOBBER_HIGH approach instead requires
changes to the way that passes track liveness information for
non-call instructions (so is much more than a one-line change).
Also, treating a CLOBBER_HIGH like a CLOBBER isn't conservatively
correct, because other code might be relying on part of the register
being preserved.

Thanks,
Richard
Hongtao Liu May 25, 2021, 6:04 a.m. UTC | #12
On Tue, May 18, 2021 at 11:18 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >> It looks like the rtx “used” flag is unused for INSNs, so we could
> >> use that as a CALL_INSN flag that indicates a fake call.  We could just
> >> need to make:
> >>
> >>       /* For all other RTXes clear the used flag on the copy.  */
> >>       RTX_FLAG (copy, used) = 0;
> >>
> >> conditional on !INSN_P.
> >>
> > I got another error in
> >
> > @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
> >        return true;
> >
> >      case CALL_INSN:
> > +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > +      if (RTX_FLAG (insn, used))
> > + break;
>
> I guess this is because of the nonlocal_goto condition?  If so, that
> could be fixed by adding a REG_EH_REGION note of INT_MIN.  Even if we
> don't do that, I think the fix belongs in nonlocal_goto instead.
>
This is error info, IMHO, the fix should be in control_flow_insn_p?

../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:
In function ‘foo’:
../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
error: in basic block 5:
   21 | }
      | ^
../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
error: flow control insn inside a basic block
(call_insn 77 50 86 5 (parallel [
            (call (mem:QI (unspec_volatile [
                            (const_int 0 [0])
                        ] UNSPECV_VZEROUPPER) [0  S1 A8])
                (const_int 0 [0]))
            (unspec [
                    (const_int 1 [0x1])
                ] UNSPEC_CALLEE_ABI)
        ]) -1
     (nil)
    (nil))
during RTL pass: pro_and_epilogue
../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
internal compiler error: in rtl_verify_bb_insns, at cfgrtl.c:2797
0x129a2a3 _fatal_insn(char const*, rtx_def const*, char const*, int,
char const*)
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/rtl-error.c:108
0xcb8834 rtl_verify_bb_insns
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2797
0xcb8b09 rtl_verify_flow_info_1
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2883
0xcb9284 rtl_verify_flow_info
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:3125
0xc9f44d verify_flow_info()
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.c:267
0xcb21b7 checking_verify_flow_info
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.h:212
0xcb6a3c commit_edge_insertions()
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2115
0xebfcb8 thread_prologue_and_epilogue_insns()
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6136
0xec07db rest_of_handle_thread_prologue_and_epilogue
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6510
0xec09b8 execute
/export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6586
>
> Thanks,
> Richard
Hongtao Liu May 25, 2021, 6:30 a.m. UTC | #13
On Tue, May 25, 2021 at 2:04 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Tue, May 18, 2021 at 11:18 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Hongtao Liu via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > On Mon, May 17, 2021 at 5:56 PM Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >> It looks like the rtx “used” flag is unused for INSNs, so we could
> > >> use that as a CALL_INSN flag that indicates a fake call.  We could just
> > >> need to make:
> > >>
> > >>       /* For all other RTXes clear the used flag on the copy.  */
> > >>       RTX_FLAG (copy, used) = 0;
> > >>
> > >> conditional on !INSN_P.
> > >>
> > > I got another error in
> > >
> > > @@ -83,6 +83,9 @@ control_flow_insn_p (const rtx_insn *insn)
> > >        return true;
> > >
> > >      case CALL_INSN:
> > > +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > > +      if (RTX_FLAG (insn, used))
> > > + break;
> >
> > I guess this is because of the nonlocal_goto condition?  If so, that
Oh, I guess you're meaning can_nonlocal_goto which is inside
constrol_flow_insn_p. Sorry for disturbing you.
> > could be fixed by adding a REG_EH_REGION note of INT_MIN.  Even if we
> > don't do that, I think the fix belongs in nonlocal_goto instead.
> >
> This is error info, IMHO, the fix should be in control_flow_insn_p?
>
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:
> In function ‘foo’:
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
> error: in basic block 5:
>    21 | }
>       | ^
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
> error: flow control insn inside a basic block
> (call_insn 77 50 86 5 (parallel [
>             (call (mem:QI (unspec_volatile [
>                             (const_int 0 [0])
>                         ] UNSPECV_VZEROUPPER) [0  S1 A8])
>                 (const_int 0 [0]))
>             (unspec [
>                     (const_int 1 [0x1])
>                 ] UNSPEC_CALLEE_ABI)
>         ]) -1
>      (nil)
>     (nil))
> during RTL pass: pro_and_epilogue
> ../../gcc/gnu-toolchain/pr82735/gcc/testsuite/gcc.target/i386/pr64061.c:21:1:
> internal compiler error: in rtl_verify_bb_insns, at cfgrtl.c:2797
> 0x129a2a3 _fatal_insn(char const*, rtx_def const*, char const*, int,
> char const*)
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/rtl-error.c:108
> 0xcb8834 rtl_verify_bb_insns
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2797
> 0xcb8b09 rtl_verify_flow_info_1
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2883
> 0xcb9284 rtl_verify_flow_info
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:3125
> 0xc9f44d verify_flow_info()
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.c:267
> 0xcb21b7 checking_verify_flow_info
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfghooks.h:212
> 0xcb6a3c commit_edge_insertions()
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/cfgrtl.c:2115
> 0xebfcb8 thread_prologue_and_epilogue_insns()
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6136
> 0xec07db rest_of_handle_thread_prologue_and_epilogue
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6510
> 0xec09b8 execute
> /export/users2/liuhongt/gcc/gnu-toolchain/pr82735/gcc/function.c:6586
> >
> > Thanks,
> > Richard
>
>
>
> --
> BR,
> Hongtao
Hongtao Liu May 27, 2021, 5:07 a.m. UTC | #14
Hi:
  This is an updated patch which implements vzeroupper as call_insn
which has a special vzeroupper ABI, also in this patch i reverted
r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in
a different way.
  Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and
x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}.
  Also test the patch on SPEC2017 and eembc, no performance impact as expected.
  Ok for trunk?

gcc/ChangeLog:

        PR target/82735
        * config/i386/i386-expand.c (ix86_expand_builtin): Remove
        assignment of cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-features.c
        (ix86_add_reg_usage_to_vzerouppers): Delete.
        (ix86_add_reg_usage_to_vzeroupper): Ditto.
        (rest_of_handle_insert_vzeroupper): Remove
        ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
        of the function.
        (gate): Remove cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
        Declared.
        * config/i386/i386.c (ix86_insn_callee_abi): New function.
        (ix86_initialize_callee_abi): Ditto.
        (ix86_expand_avx_vzeroupper): Ditto.
        (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
        ABI.
        (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
        * config/i386/i386.h (enum i386_insn_callee_abi_index): New.
        (struct GTY(()) machine_function): Delete
        has_explicit_vzeroupper.
        * config/i386/i386.md (enum unspec): New member
        UNSPEC_CALLEE_ABI.
        * config/i386/predicates.md (vzeroupper_pattern): Adjust.
        * config/i386/sse.md (avx_vzeroupper): Call
        ix86_expand_avx_vzeroupper.
        (*avx_vzeroupper): Rename to ..
        (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
        call_insn which has a special vzeroupper ABI.
        (*avx_vzeroupper_1): Deleted.
        * df-scan.c (df_get_call_refs): When call_insn is a fake call,
        it won't use stack pointer reg.
        * final.c (leaf_function_p): When call_insn is a fake call, it
        won't affect caller as a leaf function.
        * reg-stack.c (callee_clobbers_any_stack_reg): New.
        (subst_stack_regs): When call_insn doesn't clobber any stack
        reg, don't clear the arguments.
        * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
        a insn.
        * shrink-wrap.c (requires_stack_frame_p): No need for stack
        frame for a fake call.

gcc/testsuite/ChangeLog:

        PR target/82735
        * gcc.target/i386/pr82735-1.c: New test.
        * gcc.target/i386/pr82735-2.c: New test.
        * gcc.target/i386/pr82735-3.c: New test.
        * gcc.target/i386/pr82735-4.c: New test.
        * gcc.target/i386/pr82735-5.c: New test.
Uros Bizjak May 27, 2021, 7:05 a.m. UTC | #15
On Thu, May 27, 2021 at 7:03 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> Hi:
>   This is an updated patch which implements vzeroupper as call_insn
> which has a special vzeroupper ABI, also in this patch i reverted
> r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in
> a different way.
>   Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and
> x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}.
>   Also test the patch on SPEC2017 and eembc, no performance impact as expected.
>   Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/82735
>         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
>         assignment of cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-features.c
>         (ix86_add_reg_usage_to_vzerouppers): Delete.
>         (ix86_add_reg_usage_to_vzeroupper): Ditto.
>         (rest_of_handle_insert_vzeroupper): Remove
>         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
>         of the function.
>         (gate): Remove cfun->machine->has_explicit_vzeroupper.
>         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
>         Declared.
>         * config/i386/i386.c (ix86_insn_callee_abi): New function.
>         (ix86_initialize_callee_abi): Ditto.
>         (ix86_expand_avx_vzeroupper): Ditto.
>         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
>         ABI.
>         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
>         * config/i386/i386.h (enum i386_insn_callee_abi_index): New.
>         (struct GTY(()) machine_function): Delete
>         has_explicit_vzeroupper.
>         * config/i386/i386.md (enum unspec): New member
>         UNSPEC_CALLEE_ABI.
>         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
>         * config/i386/sse.md (avx_vzeroupper): Call
>         ix86_expand_avx_vzeroupper.
>         (*avx_vzeroupper): Rename to ..
>         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
>         call_insn which has a special vzeroupper ABI.
>         (*avx_vzeroupper_1): Deleted.
>         * df-scan.c (df_get_call_refs): When call_insn is a fake call,
>         it won't use stack pointer reg.
>         * final.c (leaf_function_p): When call_insn is a fake call, it
>         won't affect caller as a leaf function.
>         * reg-stack.c (callee_clobbers_any_stack_reg): New.
>         (subst_stack_regs): When call_insn doesn't clobber any stack
>         reg, don't clear the arguments.
>         * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
>         a insn.
>         * shrink-wrap.c (requires_stack_frame_p): No need for stack
>         frame for a fake call.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/82735
>         * gcc.target/i386/pr82735-1.c: New test.
>         * gcc.target/i386/pr82735-2.c: New test.
>         * gcc.target/i386/pr82735-3.c: New test.
>         * gcc.target/i386/pr82735-4.c: New test.
>         * gcc.target/i386/pr82735-5.c: New test.

Please split the patch to middle-end and target part. The middle-end
should be approved first.

 (define_expand "avx_vzeroupper"
-  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
-  "TARGET_AVX")
+  [(parallel [(call (mem:QI (unspec_volatile [(const_int 0)]
UNSPECV_VZEROUPPER))
+            (const_int 0))
+         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]

The call insn doesn't look like a valid RTX. Why not just:

+  [(parallel [(call (mem:QI (const_int 0)
+            (const_int 0))

for a fake call? Also, UNSPEC_VZEROUPPER can be removed this way since
the const_int 1 of UNSPEC_CALLEE_ABI is now used to detect vzeroupper.

Also, you don't need the avx_vzeroupper pattern to just call
ix86_expand_avx_vzeroupper. Just call the function directly from the
call site:

    case AVX_U128:
      if (mode == AVX_U128_CLEAN)
    emit_insn (gen_avx_vzeroupper ());
      break;

+         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]

Can this const_int 1 be somehow more descriptive? Perhaps use
define_constant to define I386_VZEROUPPER ABI and use it in .md as
well as .c files.

Uros.
Jakub Jelinek May 27, 2021, 7:20 a.m. UTC | #16
On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
> +  /* Flag used for call_insn indicates it's a fake call.  */
> +  RTX_FLAG (insn, used) = 1;

> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> +      if (i == STACK_POINTER_REGNUM
> +	  && !RTX_FLAG (insn_info->insn, used))

> -	  && ! SIBLING_CALL_P (insn))
> +	  && ! SIBLING_CALL_P (insn)
> +	  && !RTX_FLAG (insn, used))

> -      /* For all other RTXes clear the used flag on the copy.  */
> -      RTX_FLAG (copy, used) = 0;
> +      /* For all other RTXes clear the used flag on the copy.
> +	 CALL_INSN use "used" flag to indicate it's a fake call.  */
> +      if (!INSN_P (orig))
> +	RTX_FLAG (copy, used) = 0;
>        break;
>      }
>    return copy;
> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
>    HARD_REG_SET hardregs;
>    unsigned regno;
>  
> -  if (CALL_P (insn))
> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
>      return !SIBLING_CALL_P (insn);

Please define a macro for this in rtl.h (and mention it above used;
member too in a comment, see all the other comments in there), like:
/* 1 if RTX is a call_insn for a fake call.  */
#define FAKE_CALL_P(RTX)					\
  (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
Though, I'm also not sure if used can be actually used for this,
because it is used e.g. in emit-rtl.c for verification of RTL sharing.
Though, it seems no other rtl flag is free for CALL_INSN.
Could this fake call flag sit on the CALL rtx instead?

	Jakub
Richard Sandiford May 27, 2021, 10:50 a.m. UTC | #17
Jakub Jelinek <jakub@redhat.com> writes:
> On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
>> +  /* Flag used for call_insn indicates it's a fake call.  */
>> +  RTX_FLAG (insn, used) = 1;
>
>> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
>> +      if (i == STACK_POINTER_REGNUM
>> +	  && !RTX_FLAG (insn_info->insn, used))
>
>> -	  && ! SIBLING_CALL_P (insn))
>> +	  && ! SIBLING_CALL_P (insn)
>> +	  && !RTX_FLAG (insn, used))
>
>> -      /* For all other RTXes clear the used flag on the copy.  */
>> -      RTX_FLAG (copy, used) = 0;
>> +      /* For all other RTXes clear the used flag on the copy.
>> +	 CALL_INSN use "used" flag to indicate it's a fake call.  */
>> +      if (!INSN_P (orig))
>> +	RTX_FLAG (copy, used) = 0;
>>        break;
>>      }
>>    return copy;
>> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
>>    HARD_REG_SET hardregs;
>>    unsigned regno;
>>  
>> -  if (CALL_P (insn))
>> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
>> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
>>      return !SIBLING_CALL_P (insn);
>
> Please define a macro for this in rtl.h (and mention it above used;
> member too in a comment, see all the other comments in there), like:
> /* 1 if RTX is a call_insn for a fake call.  */
> #define FAKE_CALL_P(RTX)					\
>   (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
> Though, I'm also not sure if used can be actually used for this,
> because it is used e.g. in emit-rtl.c for verification of RTL sharing.

I thought it should be OK, since:

- copy_rtx_if_shared_1 and mark_used_flags do nothing for insns
- verify_rtx_sharing is only called for parts of an insn, rather than
  an insn itself

I guess an alternative would be to add a new rtx_code for fake call
insns and use CALL_P to test for both.  However, that would lose the
property that the default behaviour is conservatively correct
(even for direct checks of CALL_INSN), so the flag IMO seems better.

Thanks,
Richard

> Though, it seems no other rtl flag is free for CALL_INSN.
> Could this fake call flag sit on the CALL rtx instead?
>
> 	Jakub
Hongtao Liu June 1, 2021, 2:22 a.m. UTC | #18
On Thu, May 27, 2021 at 6:50 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Jakub Jelinek <jakub@redhat.com> writes:
> > On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
> >> +  /* Flag used for call_insn indicates it's a fake call.  */
> >> +  RTX_FLAG (insn, used) = 1;
> >
> >> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> >> +      if (i == STACK_POINTER_REGNUM
> >> +      && !RTX_FLAG (insn_info->insn, used))
> >
> >> -      && ! SIBLING_CALL_P (insn))
> >> +      && ! SIBLING_CALL_P (insn)
> >> +      && !RTX_FLAG (insn, used))
> >
> >> -      /* For all other RTXes clear the used flag on the copy.  */
> >> -      RTX_FLAG (copy, used) = 0;
> >> +      /* For all other RTXes clear the used flag on the copy.
> >> +     CALL_INSN use "used" flag to indicate it's a fake call.  */
> >> +      if (!INSN_P (orig))
> >> +    RTX_FLAG (copy, used) = 0;
> >>        break;
> >>      }
> >>    return copy;
> >> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
> >>    HARD_REG_SET hardregs;
> >>    unsigned regno;
> >>
> >> -  if (CALL_P (insn))
> >> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> >> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
> >>      return !SIBLING_CALL_P (insn);
> >
> > Please define a macro for this in rtl.h (and mention it above used;
> > member too in a comment, see all the other comments in there), like:
> > /* 1 if RTX is a call_insn for a fake call.  */
> > #define FAKE_CALL_P(RTX)                                      \
> >   (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
Changed.
> > Though, I'm also not sure if used can be actually used for this,
> > because it is used e.g. in emit-rtl.c for verification of RTL sharing.
>
> I thought it should be OK, since:
>
> - copy_rtx_if_shared_1 and mark_used_flags do nothing for insns
> - verify_rtx_sharing is only called for parts of an insn, rather than
>   an insn itself
>
> I guess an alternative would be to add a new rtx_code for fake call
> insns and use CALL_P to test for both.  However, that would lose the
> property that the default behaviour is conservatively correct
> (even for direct checks of CALL_INSN), so the flag IMO seems better.
>
> Thanks,
> Richard
>
> > Though, it seems no other rtl flag is free for CALL_INSN.
> > Could this fake call flag sit on the CALL rtx instead?
> >
> >       Jakub

Updated separate patch for the middle-end part.
Hongtao Liu June 1, 2021, 2:24 a.m. UTC | #19
On Thu, May 27, 2021 at 3:05 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Thu, May 27, 2021 at 7:03 AM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > Hi:
> >   This is an updated patch which implements vzeroupper as call_insn
> > which has a special vzeroupper ABI, also in this patch i reverted
> > r11-7684, r10-6451, r10-3677 which seems to fix the same issue but in
> > a different way.
> >   Bootstrapped and regtested on x86_64-linux-gnux{-m32,} and
> > x86_64-linux-gnux{-m32 \-march=cascadelake,-march=cascadelake}.
> >   Also test the patch on SPEC2017 and eembc, no performance impact as expected.
> >   Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         PR target/82735
> >         * config/i386/i386-expand.c (ix86_expand_builtin): Remove
> >         assignment of cfun->machine->has_explicit_vzeroupper.
> >         * config/i386/i386-features.c
> >         (ix86_add_reg_usage_to_vzerouppers): Delete.
> >         (ix86_add_reg_usage_to_vzeroupper): Ditto.
> >         (rest_of_handle_insert_vzeroupper): Remove
> >         ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
> >         of the function.
> >         (gate): Remove cfun->machine->has_explicit_vzeroupper.
> >         * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
> >         Declared.
> >         * config/i386/i386.c (ix86_insn_callee_abi): New function.
> >         (ix86_initialize_callee_abi): Ditto.
> >         (ix86_expand_avx_vzeroupper): Ditto.
> >         (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
> >         ABI.
> >         (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
> >         * config/i386/i386.h (enum i386_insn_callee_abi_index): New.
> >         (struct GTY(()) machine_function): Delete
> >         has_explicit_vzeroupper.
> >         * config/i386/i386.md (enum unspec): New member
> >         UNSPEC_CALLEE_ABI.
> >         * config/i386/predicates.md (vzeroupper_pattern): Adjust.
> >         * config/i386/sse.md (avx_vzeroupper): Call
> >         ix86_expand_avx_vzeroupper.
> >         (*avx_vzeroupper): Rename to ..
> >         (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
> >         call_insn which has a special vzeroupper ABI.
> >         (*avx_vzeroupper_1): Deleted.
> >         * df-scan.c (df_get_call_refs): When call_insn is a fake call,
> >         it won't use stack pointer reg.
> >         * final.c (leaf_function_p): When call_insn is a fake call, it
> >         won't affect caller as a leaf function.
> >         * reg-stack.c (callee_clobbers_any_stack_reg): New.
> >         (subst_stack_regs): When call_insn doesn't clobber any stack
> >         reg, don't clear the arguments.
> >         * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
> >         a insn.
> >         * shrink-wrap.c (requires_stack_frame_p): No need for stack
> >         frame for a fake call.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/82735
> >         * gcc.target/i386/pr82735-1.c: New test.
> >         * gcc.target/i386/pr82735-2.c: New test.
> >         * gcc.target/i386/pr82735-3.c: New test.
> >         * gcc.target/i386/pr82735-4.c: New test.
> >         * gcc.target/i386/pr82735-5.c: New test.
>
> Please split the patch to middle-end and target part. The middle-end
> should be approved first.
>
>  (define_expand "avx_vzeroupper"
> -  [(parallel [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
> -  "TARGET_AVX")
> +  [(parallel [(call (mem:QI (unspec_volatile [(const_int 0)]
> UNSPECV_VZEROUPPER))
> +            (const_int 0))
> +         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]
>
> The call insn doesn't look like a valid RTX. Why not just:
>
> +  [(parallel [(call (mem:QI (const_int 0)
> +            (const_int 0))
>
> for a fake call? Also, UNSPEC_VZEROUPPER can be removed this way since
> the const_int 1 of UNSPEC_CALLEE_ABI is now used to detect vzeroupper.
>
Changed.
> Also, you don't need the avx_vzeroupper pattern to just call
> ix86_expand_avx_vzeroupper. Just call the function directly from the
> call site:
>
>     case AVX_U128:
>       if (mode == AVX_U128_CLEAN)
>     emit_insn (gen_avx_vzeroupper ());
>       break;
>
Changed.
> +         (unspec [(const_int 1)] UNSPEC_CALLEE_ABI)])]
>
> Can this const_int 1 be somehow more descriptive? Perhaps use
> define_constant to define I386_VZEROUPPER ABI and use it in .md as
> well as .c files.
Changed.
>
> Uros.

Update separate patch for the backend part.

gcc/ChangeLog:

        PR target/82735
        * config/i386/i386-expand.c (ix86_expand_builtin): Remove
        assignment of cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-features.c
        (ix86_add_reg_usage_to_vzerouppers): Delete.
        (ix86_add_reg_usage_to_vzeroupper): Ditto.
        (rest_of_handle_insert_vzeroupper): Remove
        ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end
        of the function.
        (gate): Remove cfun->machine->has_explicit_vzeroupper.
        * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper):
        Declared.
        * config/i386/i386.c (ix86_insn_callee_abi): New function.
        (ix86_initialize_callee_abi): Ditto.
        (ix86_expand_avx_vzeroupper): Ditto.
        (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper
        ABI.
        (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi.
        (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper
        directly.
        * config/i386/i386.h (struct GTY(()) machine_function): Delete
        has_explicit_vzeroupper.
        * config/i386/i386.md (enum unspec): New member
        UNSPEC_CALLEE_ABI.
        (I386_DEFAULT,I386_VZEROUPPER,I386_UNKNOWN): New
        define_constants for insn callee abi index.
        * config/i386/predicates.md (vzeroupper_pattern): Adjust.
        * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted.
        (avx_vzeroupper): Call ix86_expand_avx_vzeroupper.
        (*avx_vzeroupper): Rename to ..
        (avx_vzeroupper_callee_abi): .. this, and adjust pattern as
        call_insn which has a special vzeroupper ABI.
        (*avx_vzeroupper_1): Deleted.

gcc/testsuite/ChangeLog:

        PR target/82735
        * gcc.target/i386/pr82735-1.c: New test.
        * gcc.target/i386/pr82735-2.c: New test.
        * gcc.target/i386/pr82735-3.c: New test.
        * gcc.target/i386/pr82735-4.c: New test.
        * gcc.target/i386/pr82735-5.c: New test.
Hongtao Liu June 1, 2021, 2:25 a.m. UTC | #20
On Tue, Jun 1, 2021 at 10:22 AM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Thu, May 27, 2021 at 6:50 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Jakub Jelinek <jakub@redhat.com> writes:
> > > On Thu, May 27, 2021 at 01:07:09PM +0800, Hongtao Liu via Gcc-patches wrote:
> > >> +  /* Flag used for call_insn indicates it's a fake call.  */
> > >> +  RTX_FLAG (insn, used) = 1;
> > >
> > >> +      /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > >> +      if (i == STACK_POINTER_REGNUM
> > >> +      && !RTX_FLAG (insn_info->insn, used))
> > >
> > >> -      && ! SIBLING_CALL_P (insn))
> > >> +      && ! SIBLING_CALL_P (insn)
> > >> +      && !RTX_FLAG (insn, used))
> > >
> > >> -      /* For all other RTXes clear the used flag on the copy.  */
> > >> -      RTX_FLAG (copy, used) = 0;
> > >> +      /* For all other RTXes clear the used flag on the copy.
> > >> +     CALL_INSN use "used" flag to indicate it's a fake call.  */
> > >> +      if (!INSN_P (orig))
> > >> +    RTX_FLAG (copy, used) = 0;
> > >>        break;
> > >>      }
> > >>    return copy;
> > >> @@ -57,7 +57,8 @@ requires_stack_frame_p (rtx_insn *insn, HARD_REG_SET prologue_used,
> > >>    HARD_REG_SET hardregs;
> > >>    unsigned regno;
> > >>
> > >> -  if (CALL_P (insn))
> > >> +  /* CALL_INSN use "used" flag to indicate it's a fake call.  */
> > >> +  if (CALL_P (insn) && !RTX_FLAG (insn, used))
> > >>      return !SIBLING_CALL_P (insn);
> > >
> > > Please define a macro for this in rtl.h (and mention it above used;
> > > member too in a comment, see all the other comments in there), like:
> > > /* 1 if RTX is a call_insn for a fake call.  */
> > > #define FAKE_CALL_P(RTX)                                      \
> > >   (RTL_FLAG_CHECK1 ("FAKE_CALL_P", (RTX), CALL_INSN)->used)
> Changed.
> > > Though, I'm also not sure if used can be actually used for this,
> > > because it is used e.g. in emit-rtl.c for verification of RTL sharing.
> >
> > I thought it should be OK, since:
> >
> > - copy_rtx_if_shared_1 and mark_used_flags do nothing for insns
> > - verify_rtx_sharing is only called for parts of an insn, rather than
> >   an insn itself
> >
> > I guess an alternative would be to add a new rtx_code for fake call
> > insns and use CALL_P to test for both.  However, that would lose the
> > property that the default behaviour is conservatively correct
> > (even for direct checks of CALL_INSN), so the flag IMO seems better.
> >
> > Thanks,
> > Richard
> >
> > > Though, it seems no other rtl flag is free for CALL_INSN.
> > > Could this fake call flag sit on the CALL rtx instead?
> > >
> > >       Jakub
>
> Updated separate patch for the middle-end part.

gcc/ChangeLog

        PR target/82735
        * df-scan.c (df_get_call_refs): When call_insn is a fake call,
        it won't use stack pointer reg.
        * final.c (leaf_function_p): When call_insn is a fake call, it
        won't affect caller as a leaf function.
        * reg-stack.c (callee_clobbers_any_stack_reg): New.
        (subst_stack_regs): When call_insn doesn't clobber any stack
        reg, don't clear the arguments.
        * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is
        a insn.
        * shrink-wrap.c (requires_stack_frame_p): No need for stack
        frame for a fake call.
        * rtl.h (FAKE_CALL_P): New macro.

>
> --
> BR,
> Hongtao
diff mbox series

Patch

From d53b0c6934ea499c9f87df963661b627e7e977bf Mon Sep 17 00:00:00 2001
From: liuhongt <hongtao.liu@intel.com>
Date: Wed, 12 May 2021 14:20:54 +0800
Subject: [PATCH] [i386] Fix _mm256_zeroupper to notify LRA that vzeroupper
 will kill sse registers.

When __builtin_ia32_vzeroupper is called explicitly, the corresponding
vzeroupper pattern does not carry any CLOBBERS or SETs before LRA,
which leads to incorrect optimization in pass_reload.
In order to solve this problem, this patch introduces a pre_reload
splitter which adds CLOBBERS to vzeroupper's pattern, it can solve the
problem in pr.

At the same time, in order to optimize the low 128 bits in
post_reload CSE, this patch also transforms those CLOBBERS to SETs in
pass_vzeroupper.

It works fine except for TARGET_64BIT_MS_ABI, under which xmm6-xmm15
are callee-saved, so even if there're no other uses of xmm6-xmm15 in the
function, because of vzeroupper's pattern, pro_epilog will save and
restore those registers, which is obviously redundant. In order to
eliminate this redundancy, a post_reload splitter is introduced, which
drops those SETs, until epilogue_completed splitter adds those SETs
back, it looks to be safe since there's no CSE between post_reload
split2 and epilogue_completed split3??? Also frame info needs to be
updated in pro_epilog, which saves and restores xmm6-xmm15 only if
there's usage other than explicit vzeroupper pattern.

gcc/ChangeLog:

	PR target/82735
	* config/i386/i386-expand.c (ix86_expand_builtin): Count
	number of __builtin_ia32_vzeroupper.
	* config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers):
	Transform CLOBBERs to SETs for explict vzeroupper pattern so
	that CSE can optimize lower 128 bits.
	* config/i386/i386.c (ix86_handle_explicit_vzeroupper_in_pro_epilog):
	New.
	(ix86_save_reg): If there's no use of xmm6~xmm15 other than
	explicit vzeroupper under TARGET_64BIT_MS_ABI, no need to save
	REGNO.
	(ix86_finalize_stack_frame_flags): Recompute frame layout if
	there's explicit vzeroupper under TARGET_64BIT_MS_ABI.
	* config/i386/i386.h (struct machine_function): Change type of
	has_explicit_vzeroupper from BOOL_BITFILED to unsigned int.
	* config/i386/sse.md (*avx_vzeroupper_2): New post-reload
	splitter which will drop all SETs for explicit vzeroupper
	patterns.
	(*avx_vzeroupper_1): Generate SET reg to reg instead of
	CLOBBER, and add pre-reload splitter after it.

gcc/testsuite/ChangeLog:

	PR target/82735
	* gcc.target/i386/pr82735-1.c: New test.
	* gcc.target/i386/pr82735-2.c: New test.
	* gcc.target/i386/pr82735-3.c: New test.
	* gcc.target/i386/pr82735-4.c: New test.
	* gcc.target/i386/pr82735-5.c: New test.
---
 gcc/config/i386/i386-expand.c             |  2 +-
 gcc/config/i386/i386-features.c           | 25 ++++++++++-
 gcc/config/i386/i386.c                    | 23 ++++++++++
 gcc/config/i386/i386.h                    |  8 ++--
 gcc/config/i386/sse.md                    | 48 +++++++++++++++++++-
 gcc/testsuite/gcc.target/i386/pr82735-1.c | 29 ++++++++++++
 gcc/testsuite/gcc.target/i386/pr82735-2.c | 21 +++++++++
 gcc/testsuite/gcc.target/i386/pr82735-3.c |  5 +++
 gcc/testsuite/gcc.target/i386/pr82735-4.c | 48 ++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr82735-5.c | 54 +++++++++++++++++++++++
 10 files changed, 256 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr82735-5.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index fee4d07b7fd..7f3326a12b2 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -13233,7 +13233,7 @@  rdseed_step:
       return 0;
 
     case IX86_BUILTIN_VZEROUPPER:
-      cfun->machine->has_explicit_vzeroupper = true;
+      cfun->machine->has_explicit_vzeroupper++;
       break;
 
     default:
diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index 77783a154b6..6b2179f16cb 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1827,8 +1827,31 @@  ix86_add_reg_usage_to_vzerouppers (void)
 	{
 	  if (!NONDEBUG_INSN_P (insn))
 	    continue;
+	  /* Transform CLOBBERs to SETs so that lower 128 bits of sse reisters
+	     will be able to cross vzeroupper in post-reload CSE.  */
 	  if (vzeroupper_pattern (PATTERN (insn), VOIDmode))
-	    ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
+	    {
+	      if (XVECEXP (XVECEXP (PATTERN (insn), 0, 0), 0, 0) == const1_rtx)
+		{
+		  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+		  rtvec vec = rtvec_alloc (nregs + 1);
+		  RTVEC_ELT (vec, 0) = XVECEXP (PATTERN (insn), 0, 0);
+		  for (unsigned int i = 0; i < nregs; ++i)
+		    {
+		      unsigned int regno = GET_SSE_REGNO (i);
+		      rtx reg = gen_rtx_REG (V2DImode, regno);
+		      RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
+		    }
+		  XVEC (PATTERN (insn), 0) = vec;
+		  INSN_CODE (insn) = -1;
+		  df_insn_rescan (insn);
+		}
+	      else
+		{
+		  gcc_assert (XVECLEN (PATTERN (insn), 0) == 1);
+		  ix86_add_reg_usage_to_vzeroupper (insn, live_regs);
+		}
+	    }
 	  df_simulate_one_insn_backwards (bb, insn, live_regs);
 	}
     }
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 780da108a7c..4d4d7dbbc82 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -6170,6 +6170,17 @@  ix86_hard_regno_scratch_ok (unsigned int regno)
 	      && df_regs_ever_live_p (regno)));
 }
 
+/* Return true if explicit usage of __builtin_ia32_vzeroupper
+   should be specially handled in pro_epilog.  */
+static bool
+ix86_handle_explicit_vzeroupper_in_pro_epilog ()
+{
+  return (cfun->machine->has_explicit_vzeroupper
+	  && TARGET_64BIT_MS_ABI
+	  && !epilogue_completed
+	  && reload_completed);
+}
+
 /* Return TRUE if we need to save REGNO.  */
 
 bool
@@ -6244,6 +6255,16 @@  ix86_save_reg (unsigned int regno, bool maybe_eh_return, bool ignore_outlined)
       && !cfun->machine->no_drap_save_restore)
     return true;
 
+  /* If there's no use other than explicit vzeroupper
+     for xmm6~xmm15 under TARGET_64BIT_MS_ABI,
+     no need to save REGNO.  */
+  if (ix86_handle_explicit_vzeroupper_in_pro_epilog ()
+      && (IN_RANGE (regno, FIRST_SSE_REG + 6, LAST_SSE_REG)
+	  || IN_RANGE (regno, FIRST_REX_SSE_REG, LAST_REX_SSE_REG)))
+    return df_regs_ever_live_p (regno)
+      ? df_hard_reg_used_count (regno) > cfun->machine->has_explicit_vzeroupper
+      : false;
+
   return (df_regs_ever_live_p (regno)
 	  && !call_used_or_fixed_reg_p (regno)
 	  && (regno != HARD_FRAME_POINTER_REGNUM || !frame_pointer_needed));
@@ -8046,6 +8067,8 @@  ix86_finalize_stack_frame_flags (void)
     recompute_frame_layout_p = true;
   crtl->stack_realign_needed = stack_realign;
   crtl->stack_realign_finalized = true;
+  if (ix86_handle_explicit_vzeroupper_in_pro_epilog ())
+    recompute_frame_layout_p = true;
   if (recompute_frame_layout_p)
     ix86_compute_frame_layout ();
 }
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 97d6f3863cb..c0855a936ac 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2654,10 +2654,6 @@  struct GTY(()) machine_function {
   /* True if the function needs a stack frame.  */
   BOOL_BITFIELD stack_frame_required : 1;
 
-  /* True if __builtin_ia32_vzeroupper () has been expanded in current
-     function.  */
-  BOOL_BITFIELD has_explicit_vzeroupper : 1;
-
   /* True if we should act silently, rather than raise an error for
      invalid calls.  */
   BOOL_BITFIELD silent_p : 1;
@@ -2665,6 +2661,10 @@  struct GTY(()) machine_function {
   /* The largest alignment, in bytes, of stack slot actually used.  */
   unsigned int max_used_stack_alignment;
 
+  /* Number of __builtin_ia32_vzeroupper () which has been expanded in
+     current function.  */
+  unsigned int has_explicit_vzeroupper;
+
   /* During prologue/epilogue generation, the current frame state.
      Otherwise, the frame state at the end of the prologue.  */
   struct machine_frame_state fs;
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 897cf3eaea9..489fa02fa20 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -20626,7 +20626,7 @@  (define_insn_and_split "*avx_vzeroupper_1"
       else
 	{
 	  rtx reg = gen_rtx_REG (V2DImode, regno);
-	  RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+	  RTVEC_ELT (vec, i + 1) = gen_rtx_SET (reg, reg);
 	}
     }
   operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
@@ -20638,6 +20638,52 @@  (define_insn_and_split "*avx_vzeroupper_1"
    (set_attr "btver2_decode" "vector")
    (set_attr "mode" "OI")])
 
+(define_split
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 0)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX && ix86_pre_reload_split ()"
+  [(match_dup 0)]
+{
+  /* When vzeroupper is explictly used, for LRA purpose, make it clear
+     the instruction kills sse registers.  */
+  gcc_assert (cfun->machine->has_explicit_vzeroupper);
+  unsigned int nregs = TARGET_64BIT ? 16 : 8;
+  rtvec vec = rtvec_alloc (nregs + 1);
+  RTVEC_ELT (vec, 0) = gen_rtx_UNSPEC_VOLATILE (VOIDmode,
+						gen_rtvec (1, const1_rtx),
+						UNSPECV_VZEROUPPER);
+  for (unsigned int i = 0; i < nregs; ++i)
+    {
+      unsigned int regno = GET_SSE_REGNO (i);
+      rtx reg = gen_rtx_REG (V2DImode, regno);
+      RTVEC_ELT (vec, i + 1) = gen_rtx_CLOBBER (VOIDmode, reg);
+    }
+  operands[0] = gen_rtx_PARALLEL (VOIDmode, vec);
+})
+
+(define_insn_and_split "*avx_vzeroupper_2"
+  [(match_parallel 0 "vzeroupper_pattern"
+     [(unspec_volatile [(const_int 1)] UNSPECV_VZEROUPPER)])]
+  "TARGET_AVX && XVECLEN (operands[0], 0) == (TARGET_64BIT ? 16 : 8) + 1"
+  "vzeroupper"
+  "&& reload_completed && TARGET_64BIT_MS_ABI"
+  [(const_int 0)]
+{
+  /* To avoid redundant save and restore in pro_and_epilog, drop
+     those SETs/CLOBBERs which are added by pre-reload splitter
+     or pass_vzeroupper, it's safe since there's no CSE optimization
+     between post-reload split2 and epilogue-completed split3???  */
+  gcc_assert (cfun->machine->has_explicit_vzeroupper);
+  emit_insn (gen_avx_vzeroupper ());
+  DONE;
+}
+  [(set_attr "type" "sse")
+   (set_attr "modrm" "0")
+   (set_attr "memory" "none")
+   (set_attr "prefix" "vex")
+   (set_attr "btver2_decode" "vector")
+   (set_attr "mode" "OI")])
+
 (define_mode_attr pbroadcast_evex_isa
   [(V64QI "avx512bw") (V32QI "avx512bw") (V16QI "avx512bw")
    (V32HI "avx512bw") (V16HI "avx512bw") (V8HI "avx512bw")
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-1.c b/gcc/testsuite/gcc.target/i386/pr82735-1.c
new file mode 100644
index 00000000000..1a63b9ae9c9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-1.c
@@ -0,0 +1,29 @@ 
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx" } */
+/* { dg-require-effective-target avx } */
+
+#include "avx-check.h"
+
+void
+__attribute__ ((noipa))
+mtest(char *dest)
+{
+  __m256i ymm1 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm1);
+  _mm256_zeroupper();
+  __m256i ymm2 = _mm256_set1_epi8((char)0x1);
+  _mm256_storeu_si256((__m256i *)dest, ymm2);
+}
+
+void
+avx_test ()
+{
+  char buf[64];
+  for (int i = 0; i != 64; i++)
+    buf[i] = 2;
+  mtest (buf);
+
+  for (int i = 0; i < 32; ++i)
+    if (buf[i] != 1)
+      __builtin_abort ();
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-2.c b/gcc/testsuite/gcc.target/i386/pr82735-2.c
new file mode 100644
index 00000000000..48d0d6e983d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-2.c
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2" } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  /* xmm1 can be propagated to xmm2 by CSE.  */
+  __m128i xmm1 = _mm_set1_epi8((char)0x1);
+  _mm_storeu_si128((__m128i *)(dest + 32), xmm1);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  __m128i xmm2 = _mm_set1_epi8((char)0x1);
+  _mm_storeu_si128((__m128i *)dest, xmm2);
+}
+
+/* Darwin local constant symbol is "lC0", ELF targets ".LC0" */
+/* { dg-final { scan-assembler-times {(?n)vmovdqa\t\.?[Ll]C0[^,]*, %xmm[0-9]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-3.c b/gcc/testsuite/gcc.target/i386/pr82735-3.c
new file mode 100644
index 00000000000..e3f801e6924
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-3.c
@@ -0,0 +1,5 @@ 
+/* { dg-do compile } */
+/* { dg-options "-mavx -O2 -mabi=ms" } */
+/* { dg-final { scan-assembler-not {(?n)xmm([6-9]|1[0-5])} } } */
+
+#include "pr82735-2.c"
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-4.c b/gcc/testsuite/gcc.target/i386/pr82735-4.c
new file mode 100644
index 00000000000..78c0a6cb2c8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-4.c
@@ -0,0 +1,48 @@ 
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr82735-5.c b/gcc/testsuite/gcc.target/i386/pr82735-5.c
new file mode 100644
index 00000000000..2a58cbe52d0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr82735-5.c
@@ -0,0 +1,54 @@ 
+/* { dg-do compile { target { ! ia32 } } }  */
+/* { dg-options "-mavx -O2 -mabi=ms -mno-avx512f -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*%xmm[0-9]+, [0-9]*\(%rsp\)} 10 } } */
+/* { dg-final { scan-assembler-times {(?n)(?:vmovdqa[1-9]*|vmovap[sd])[\t ]*[0-9]*\(%rsp\), %xmm[0-9]+} 10 } } */
+
+#include <immintrin.h>
+
+void test(char *dest)
+{
+  __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15;
+  asm volatile ("vmovdqa\t%%ymm0, %0\n\t"
+		"vmovdqa\t%%ymm0, %1\n\t"
+		"vmovdqa\t%%ymm0, %2\n\t"
+		"vmovdqa\t%%ymm0, %3\n\t"
+		"vmovdqa\t%%ymm0, %4\n\t"
+		"vmovdqa\t%%ymm0, %5\n\t"
+		"vmovdqa\t%%ymm0, %6\n\t"
+		"vmovdqa\t%%ymm0, %7\n\t"
+		"vmovdqa\t%%ymm0, %8\n\t"
+		"vmovdqa\t%%ymm0, %9\n\t"
+		"vmovdqa\t%%ymm0, %10\n\t"
+		"vmovdqa\t%%ymm0, %11\n\t"
+		"vmovdqa\t%%ymm0, %12\n\t"
+		"vmovdqa\t%%ymm0, %13\n\t"
+		"vmovdqa\t%%ymm0, %14\n\t"
+		"vmovdqa\t%%ymm0, %15\n\t"
+		: "=v" (ymm1), "=v" (ymm2), "=v"(ymm3), "=v" (ymm4), "=v" (ymm5),
+		  "=v" (ymm6), "=v" (ymm7), "=v"(ymm8), "=v" (ymm9), "=v" (ymm10),
+		  "=v" (ymm11), "=v" (ymm12), "=v"(ymm13), "=v" (ymm14), "=v" (ymm15),
+		  "=v"(ymm0)
+		::);
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_zeroupper();
+  _mm256_storeu_si256((__m256i *)dest, ymm1);
+  _mm256_storeu_si256((__m256i *)(dest + 32), ymm2);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 2), ymm3);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 3), ymm4);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 4), ymm5);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 5), ymm6);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 6), ymm7);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 7), ymm8);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 8), ymm9);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 9), ymm10);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 10), ymm11);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 11), ymm12);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 12), ymm13);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 13), ymm14);
+  _mm256_storeu_si256((__m256i *)(dest + 32 * 14), ymm15);
+}
-- 
2.18.1