[04/32,x86] Robustify vzeroupper handling across calls
diff mbox series

Message ID mpt1rwmzma7.fsf@arm.com
State New
Headers show
Series
  • Support multiple ABIs in the same translation unit
Related show

Commit Message

Richard Sandiford Sept. 11, 2019, 7:05 p.m. UTC
One of the effects of the function_abi series is to make -fipa-ra
work for partially call-clobbered registers.  E.g. if a call preserves
only the low 32 bits of a register R, we handled the partial clobber
separately from -fipa-ra, and so treated the upper bits of R as
clobbered even if we knew that the target function doesn't touch R.

"Fixing" this caused problems for the vzeroupper handling on x86.
The pass that inserts the vzerouppers assumes that no 256-bit or 512-bit
values are live across a call unless the call takes a 256-bit or 512-bit
argument:

      /* Needed mode is set to AVX_U128_CLEAN if there are
	 no 256bit or 512bit modes used in function arguments. */

This implicitly relies on:

/* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
   saves SSE registers across calls is Win64 (thus no need to check the
   current ABI here), and with AVX enabled Win64 only guarantees that
   the low 16 bytes are saved.  */

static bool
ix86_hard_regno_call_part_clobbered (rtx_insn *insn ATTRIBUTE_UNUSED,
				     unsigned int regno, machine_mode mode)
{
  return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
}

The comment suggests that this code is only needed for Win64 and that
not testing for Win64 is just a simplification.  But in practice it was
needed for correctness on GNU/Linux and other targets too, since without
it the RA would be able to keep 256-bit and 512-bit values in SSE
registers across calls that are known not to clobber them.

This patch conservatively treats calls as AVX_U128_ANY if the RA can see
that some SSE registers are not touched by a call.  There are then no
regressions if the ix86_hard_regno_call_part_clobbered check is disabled
for GNU/Linux (not something we should do, was just for testing).

If in fact we want -fipa-ra to pretend that all functions clobber
SSE registers above 128 bits, it'd certainly be possible to arrange
that.  But IMO that would be an optimisation decision, whereas what
the patch is fixing is a correctness decision.  So I think we should
have this check even so.


2019-09-11  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* config/i386/i386.c: Include function-abi.h.
	(ix86_avx_u128_mode_needed): Treat function calls as AVX_U128_ANY
	if they preserve some 256-bit or 512-bit SSE registers.

Comments

Richard Sandiford Sept. 25, 2019, 3:48 p.m. UTC | #1
Ping

Richard Sandiford <richard.sandiford@arm.com> writes:
> One of the effects of the function_abi series is to make -fipa-ra
> work for partially call-clobbered registers.  E.g. if a call preserves
> only the low 32 bits of a register R, we handled the partial clobber
> separately from -fipa-ra, and so treated the upper bits of R as
> clobbered even if we knew that the target function doesn't touch R.
>
> "Fixing" this caused problems for the vzeroupper handling on x86.
> The pass that inserts the vzerouppers assumes that no 256-bit or 512-bit
> values are live across a call unless the call takes a 256-bit or 512-bit
> argument:
>
>       /* Needed mode is set to AVX_U128_CLEAN if there are
> 	 no 256bit or 512bit modes used in function arguments. */
>
> This implicitly relies on:
>
> /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
>    saves SSE registers across calls is Win64 (thus no need to check the
>    current ABI here), and with AVX enabled Win64 only guarantees that
>    the low 16 bytes are saved.  */
>
> static bool
> ix86_hard_regno_call_part_clobbered (rtx_insn *insn ATTRIBUTE_UNUSED,
> 				     unsigned int regno, machine_mode mode)
> {
>   return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
> }
>
> The comment suggests that this code is only needed for Win64 and that
> not testing for Win64 is just a simplification.  But in practice it was
> needed for correctness on GNU/Linux and other targets too, since without
> it the RA would be able to keep 256-bit and 512-bit values in SSE
> registers across calls that are known not to clobber them.
>
> This patch conservatively treats calls as AVX_U128_ANY if the RA can see
> that some SSE registers are not touched by a call.  There are then no
> regressions if the ix86_hard_regno_call_part_clobbered check is disabled
> for GNU/Linux (not something we should do, was just for testing).
>
> If in fact we want -fipa-ra to pretend that all functions clobber
> SSE registers above 128 bits, it'd certainly be possible to arrange
> that.  But IMO that would be an optimisation decision, whereas what
> the patch is fixing is a correctness decision.  So I think we should
> have this check even so.

2019-09-25  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* config/i386/i386.c: Include function-abi.h.
	(ix86_avx_u128_mode_needed): Treat function calls as AVX_U128_ANY
	if they preserve some 256-bit or 512-bit SSE registers.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	2019-09-25 16:47:48.000000000 +0100
+++ gcc/config/i386/i386.c	2019-09-25 16:47:49.089962608 +0100
@@ -95,6 +95,7 @@ #define IN_TARGET_CODE 1
 #include "i386-builtins.h"
 #include "i386-expand.h"
 #include "i386-features.h"
+#include "function-abi.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -13511,6 +13512,15 @@ ix86_avx_u128_mode_needed (rtx_insn *ins
 	    }
 	}
 
+      /* If the function is known to preserve some SSE registers,
+	 RA and previous passes can legitimately rely on that for
+	 modes wider than 256 bits.  It's only safe to issue a
+	 vzeroupper if all SSE registers are clobbered.  */
+      const function_abi &abi = insn_callee_abi (insn);
+      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
+				  abi.mode_clobbers (V4DImode)))
+	return AVX_U128_ANY;
+
       return AVX_U128_CLEAN;
     }
Uros Bizjak Sept. 25, 2019, 6:11 p.m. UTC | #2
On Wed, Sep 25, 2019 at 5:48 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Ping
>
> Richard Sandiford <richard.sandiford@arm.com> writes:
> > One of the effects of the function_abi series is to make -fipa-ra
> > work for partially call-clobbered registers.  E.g. if a call preserves
> > only the low 32 bits of a register R, we handled the partial clobber
> > separately from -fipa-ra, and so treated the upper bits of R as
> > clobbered even if we knew that the target function doesn't touch R.
> >
> > "Fixing" this caused problems for the vzeroupper handling on x86.
> > The pass that inserts the vzerouppers assumes that no 256-bit or 512-bit
> > values are live across a call unless the call takes a 256-bit or 512-bit
> > argument:
> >
> >       /* Needed mode is set to AVX_U128_CLEAN if there are
> >        no 256bit or 512bit modes used in function arguments. */
> >
> > This implicitly relies on:
> >
> > /* Implement TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  The only ABI that
> >    saves SSE registers across calls is Win64 (thus no need to check the
> >    current ABI here), and with AVX enabled Win64 only guarantees that
> >    the low 16 bytes are saved.  */
> >
> > static bool
> > ix86_hard_regno_call_part_clobbered (rtx_insn *insn ATTRIBUTE_UNUSED,
> >                                    unsigned int regno, machine_mode mode)
> > {
> >   return SSE_REGNO_P (regno) && GET_MODE_SIZE (mode) > 16;
> > }
> >
> > The comment suggests that this code is only needed for Win64 and that
> > not testing for Win64 is just a simplification.  But in practice it was
> > needed for correctness on GNU/Linux and other targets too, since without
> > it the RA would be able to keep 256-bit and 512-bit values in SSE
> > registers across calls that are known not to clobber them.
> >
> > This patch conservatively treats calls as AVX_U128_ANY if the RA can see
> > that some SSE registers are not touched by a call.  There are then no
> > regressions if the ix86_hard_regno_call_part_clobbered check is disabled
> > for GNU/Linux (not something we should do, was just for testing).
> >
> > If in fact we want -fipa-ra to pretend that all functions clobber
> > SSE registers above 128 bits, it'd certainly be possible to arrange
> > that.  But IMO that would be an optimisation decision, whereas what
> > the patch is fixing is a correctness decision.  So I think we should
> > have this check even so.
>
> 2019-09-25  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * config/i386/i386.c: Include function-abi.h.
>         (ix86_avx_u128_mode_needed): Treat function calls as AVX_U128_ANY
>         if they preserve some 256-bit or 512-bit SSE registers.

OK.

Thanks,
Uros.

>
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c      2019-09-25 16:47:48.000000000 +0100
> +++ gcc/config/i386/i386.c      2019-09-25 16:47:49.089962608 +0100
> @@ -95,6 +95,7 @@ #define IN_TARGET_CODE 1
>  #include "i386-builtins.h"
>  #include "i386-expand.h"
>  #include "i386-features.h"
> +#include "function-abi.h"
>
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -13511,6 +13512,15 @@ ix86_avx_u128_mode_needed (rtx_insn *ins
>             }
>         }
>
> +      /* If the function is known to preserve some SSE registers,
> +        RA and previous passes can legitimately rely on that for
> +        modes wider than 256 bits.  It's only safe to issue a
> +        vzeroupper if all SSE registers are clobbered.  */
> +      const function_abi &abi = insn_callee_abi (insn);
> +      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
> +                                 abi.mode_clobbers (V4DImode)))
> +       return AVX_U128_ANY;
> +
>        return AVX_U128_CLEAN;
>      }
>
Uros Bizjak Oct. 1, 2019, 10:14 a.m. UTC | #3
On Wed, Sep 25, 2019 at 5:48 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:

> > The comment suggests that this code is only needed for Win64 and that
> > not testing for Win64 is just a simplification.  But in practice it was
> > needed for correctness on GNU/Linux and other targets too, since without
> > it the RA would be able to keep 256-bit and 512-bit values in SSE
> > registers across calls that are known not to clobber them.
> >
> > This patch conservatively treats calls as AVX_U128_ANY if the RA can see
> > that some SSE registers are not touched by a call.  There are then no
> > regressions if the ix86_hard_regno_call_part_clobbered check is disabled
> > for GNU/Linux (not something we should do, was just for testing).

If RA can sse that some SSE regs are not touched by the call, then we
are sure that the called function is part of the current TU. In this
case, the called function will be compiled using VEX instructions,
where there is no AVX-SSE transition penalty. So, skipping VZEROUPPER
is beneficial here.

Uros.

> > If in fact we want -fipa-ra to pretend that all functions clobber
> > SSE registers above 128 bits, it'd certainly be possible to arrange
> > that.  But IMO that would be an optimisation decision, whereas what
> > the patch is fixing is a correctness decision.  So I think we should
> > have this check even so.
>
> 2019-09-25  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * config/i386/i386.c: Include function-abi.h.
>         (ix86_avx_u128_mode_needed): Treat function calls as AVX_U128_ANY
>         if they preserve some 256-bit or 512-bit SSE registers.
>
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c      2019-09-25 16:47:48.000000000 +0100
> +++ gcc/config/i386/i386.c      2019-09-25 16:47:49.089962608 +0100
> @@ -95,6 +95,7 @@ #define IN_TARGET_CODE 1
>  #include "i386-builtins.h"
>  #include "i386-expand.h"
>  #include "i386-features.h"
> +#include "function-abi.h"
>
>  /* This file should be included last.  */
>  #include "target-def.h"
> @@ -13511,6 +13512,15 @@ ix86_avx_u128_mode_needed (rtx_insn *ins
>             }
>         }
>
> +      /* If the function is known to preserve some SSE registers,
> +        RA and previous passes can legitimately rely on that for
> +        modes wider than 256 bits.  It's only safe to issue a
> +        vzeroupper if all SSE registers are clobbered.  */
> +      const function_abi &abi = insn_callee_abi (insn);
> +      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
> +                                 abi.mode_clobbers (V4DImode)))
> +       return AVX_U128_ANY;
> +
>        return AVX_U128_CLEAN;
>      }
>
Uros Bizjak Oct. 8, 2019, 6:17 p.m. UTC | #4
The following patch uses correct SSE register class; vzeroupper
operates only on lower 16 (8 on 32bit target) SSE registers.

2019-10-08  UroŇ° Bizjak  <ubizjak@gmail.com>

    PR target/91994
    * config/i386/i386.c (x86_avx_u128_mode_needed): Use SSE_REG
    instead of ALL_SSE_REG to check if function call preserves some
    256-bit SSE registers.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Committed to mainline SVN.

Uros.

On Tue, Oct 1, 2019 at 12:14 PM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Sep 25, 2019 at 5:48 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>
> > > The comment suggests that this code is only needed for Win64 and that
> > > not testing for Win64 is just a simplification.  But in practice it was
> > > needed for correctness on GNU/Linux and other targets too, since without
> > > it the RA would be able to keep 256-bit and 512-bit values in SSE
> > > registers across calls that are known not to clobber them.
> > >
> > > This patch conservatively treats calls as AVX_U128_ANY if the RA can see
> > > that some SSE registers are not touched by a call.  There are then no
> > > regressions if the ix86_hard_regno_call_part_clobbered check is disabled
> > > for GNU/Linux (not something we should do, was just for testing).
>
> If RA can sse that some SSE regs are not touched by the call, then we
> are sure that the called function is part of the current TU. In this
> case, the called function will be compiled using VEX instructions,
> where there is no AVX-SSE transition penalty. So, skipping VZEROUPPER
> is beneficial here.
>
> Uros.
>
> > > If in fact we want -fipa-ra to pretend that all functions clobber
> > > SSE registers above 128 bits, it'd certainly be possible to arrange
> > > that.  But IMO that would be an optimisation decision, whereas what
> > > the patch is fixing is a correctness decision.  So I think we should
> > > have this check even so.
> >
> > 2019-09-25  Richard Sandiford  <richard.sandiford@arm.com>
> >
> > gcc/
> >         * config/i386/i386.c: Include function-abi.h.
> >         (ix86_avx_u128_mode_needed): Treat function calls as AVX_U128_ANY
> >         if they preserve some 256-bit or 512-bit SSE registers.
> >
> > Index: gcc/config/i386/i386.c
> > ===================================================================
> > --- gcc/config/i386/i386.c      2019-09-25 16:47:48.000000000 +0100
> > +++ gcc/config/i386/i386.c      2019-09-25 16:47:49.089962608 +0100
> > @@ -95,6 +95,7 @@ #define IN_TARGET_CODE 1
> >  #include "i386-builtins.h"
> >  #include "i386-expand.h"
> >  #include "i386-features.h"
> > +#include "function-abi.h"
> >
> >  /* This file should be included last.  */
> >  #include "target-def.h"
> > @@ -13511,6 +13512,15 @@ ix86_avx_u128_mode_needed (rtx_insn *ins
> >             }
> >         }
> >
> > +      /* If the function is known to preserve some SSE registers,
> > +        RA and previous passes can legitimately rely on that for
> > +        modes wider than 256 bits.  It's only safe to issue a
> > +        vzeroupper if all SSE registers are clobbered.  */
> > +      const function_abi &abi = insn_callee_abi (insn);
> > +      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
> > +                                 abi.mode_clobbers (V4DImode)))
> > +       return AVX_U128_ANY;
> > +
> >        return AVX_U128_CLEAN;
> >      }
> >
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 276677)
+++ config/i386/i386.c	(working copy)
@@ -13530,7 +13530,7 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
 	 modes wider than 256 bits.  It's only safe to issue a
 	 vzeroupper if all SSE registers are clobbered.  */
       const function_abi &abi = insn_callee_abi (insn);
-      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
+      if (!hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
 				  abi.mode_clobbers (V4DImode)))
 	return AVX_U128_ANY;

Patch
diff mbox series

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	2019-09-10 19:56:55.601105594 +0100
+++ gcc/config/i386/i386.c	2019-09-11 19:47:28.506233865 +0100
@@ -95,6 +95,7 @@  #define IN_TARGET_CODE 1
 #include "i386-builtins.h"
 #include "i386-expand.h"
 #include "i386-features.h"
+#include "function-abi.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -13511,6 +13512,15 @@  ix86_avx_u128_mode_needed (rtx_insn *ins
 	    }
 	}
 
+      /* If the function is known to preserve some SSE registers,
+	 RA and previous passes can legitimately rely on that for
+	 modes wider than 256 bits.  It's only safe to issue a
+	 vzeroupper if all SSE registers are clobbered.  */
+      const function_abi &abi = call_insn_abi (insn);
+      if (!hard_reg_set_subset_p (reg_class_contents[ALL_SSE_REGS],
+				  abi.mode_clobbers (V4DImode)))
+	return AVX_U128_ANY;
+
       return AVX_U128_CLEAN;
     }