Message ID | 1508758664-24186-2-git-send-email-Dave.Martin@arm.com |
---|---|
State | New |
Headers | show |
Series | [RFC] arm64/sve: ABI change: Zero SVE regs on syscall entry | expand |
Dave Martin <Dave.Martin@arm.com> writes: > As currently documented, no guarantee is made about what a user > task sees in the SVE registers after a syscall, except that V0-V31 > (corresponding to Z0-Z31 bits [127:0]) are preserved. > > The actual kernel behaviour currently implemented is that the SVE > registers are zeroed if a context switch or signal delivery occurs > during a syscall. After a fork() or clone(), the SVE registers > of the child task are zeroed also. The SVE registers are otherwise > preserved. Flexibility is retained in the ABI about the exact > criteria for the decision. > > There are some potential problems with this approach. > > Syscall impact > -------------- > > Will, Catalin and Mark have expressed concerns about the risk of > creating de facto ABI here: in scenarios or workloads where a > context switch never occurs or is very unlikely, userspace may > learn to rely on preservation of the SVE registers across certain > syscalls. I think this is a reasonable concern but are there any equivalent cases in the rest of the kernel? Is this new territory for Linux as these super large registers are introduced? > It is difficult to assess the impact of this: the syscall ABI is > not a general-purpose interface, since it is usually hidden behind > libc wrappers: direct invocation of SVC is discouraged. However, > specialised runtimes, statically linked programs and binary blobs > may bake in direct syscalls that make bad assumptions. > > Conversely, the relative cost of zeroing the SVE regs to mitigate > against this also cannot be well characterised until SVE hardware > exists. > > ptrace impact > ------------- > > The current implementation can discard and zero the SVE registers > at any point during a syscall, including before, after or between > ptrace traps inside a single syscall. This means that setting the > SVE registers through PTRACE_SETREGSET will often not do what the > user expects: the new register values are only guaranteed to > survive as far as userspace if set from an asynchronous > signal-delivery-stop (e.g., breakpoint, SEGV or asynchronous signal > delivered outside syscall context). > > This is consistent with the currently documented SVE user ABI, but > likely to be surprising for a debugger user, since setting most > registers of a tracee doesn't behave in this way. > > This patch > ---------- > > The common syscall entry path is modified to forcibly discard SVE, > and the discard logic elsewhere is removed. > > This means that there is a mandatory additional trap to the kernel > when a user task tries to use SVE again after a syscall. This can > be expensive for programs that use SVE heavily around syscalls, but > can be optimised later. Won't it impact every restart from syscall? It's a shame you have to trap when I suspect most first accesses after a syscall are likely to be either restoring the caller-saved values which assume the value is trashed anyway. Adding gettimeofday() or write(stdout) while debugging is going to kill performance. -- Alex Bennée
Hi Dave, On Mon, Oct 23, 2017 at 12:37:44PM +0100, Dave P Martin wrote: > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > index 6718780..68917de 100644 > --- a/arch/arm64/kernel/entry.S > +++ b/arch/arm64/kernel/entry.S > @@ -850,6 +850,25 @@ el0_svc: > adrp stbl, sys_call_table // load syscall table pointer > mov wscno, w8 // syscall number in w8 > mov wsc_nr, #__NR_syscalls > + > +#ifdef CONFIG_ARM64_SVE > +alternative_if_not ARM64_SVE > + b 1f > +alternative_else > + ldr x10, [tsk, #TSK_TI_FLAGS] > +alternative_endif > + bic x10, x10, #_TIF_SVE // discard SVE state on syscall Could we have: tbz x10, #TIF_SVE, 1f > + str x10, [tsk, #TSK_TI_FLAGS] > + > + tbz x10, #TIF_FOREIGN_FPSTATE, 2f > + ASM_BUG() > +2: If we've never hit this problem before, I think we can avoid the TIF_FOREIGN_FPSTATE here. Just double-check the return path that we always invoke do_notify_resume(). > + mrs x9, cpacr_el1 > + bic x9, x9, #CPACR_EL1_ZEN_EL0EN // disable SVE for el0 > + msr cpacr_el1, x9 // synchronised by eret to el0 I haven't checked the other patches but do we context switch cpacr_el1? If not, we may return to user with this enabled. Alternatively, can we disable it on the return path? > +1: > +#endif > + > el0_svc_naked: // compat entry point > stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and syscall number > enable_dbg_and_irq I think you could move the TIF_SVE checks above in el0_svc_naked and reuse x16 to avoid loading TSK_TI_FLAGS twice. For compat tasks TIF_SVE would be 0 anyway (as long as we can skip the whole block as per my suggested tbz). > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c > index ac56a96..9a3c561 100644 > --- a/arch/arm64/kernel/process.c > +++ b/arch/arm64/kernel/process.c > @@ -247,22 +247,22 @@ void arch_release_task_struct(struct task_struct *tsk) > fpsimd_release_task(tsk); > } > > +/* > + * src and dst may temporarily have aliased sve_state after task_struct > + * is copied. We cannot fix this properly here, because src may have > + * live SVE state and dst's thread_info may not exist yet, so tweaking > + * either src's or dst's TIF_SVE is not safe. > + * > + * The unaliasing is done in copy_thread() instead. This works because > + * dst is not schedulable or traceable until both of these functions > + * have been called. > + */ > int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) > { > - /* > - * For SVE, dst and src must not end up with aliases of the same > - * sve_state. Because we are definitely in a syscall here, SVE state > - * can be discarded unconditionally without violating the user ABI. > - * dst's sve_state pointer can then be zapped with no ill effects. > - * > - * src can safely retain its sve_state memory for later use. > - */ > if (current->mm) > - fpsimd_preserve_current_state_discard_sve(); > + fpsimd_preserve_current_state(); > *dst = *src; > > - dst->thread.sve_state = NULL; > - > return 0; > } > > @@ -275,6 +275,13 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start, > > memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context)); > > + /* > + * Unalias p->thread.sve_state (if any) from the parent task > + * and disable discard SVE state for p: > + */ > + clear_tsk_thread_flag(p, TIF_SVE); > + p->thread.sve_state = NULL; > + > if (likely(!(p->flags & PF_KTHREAD))) { > *childregs = *current_pt_regs(); > childregs->regs[0] = 0; This looks fine.
On Mon, Oct 23, 2017 at 06:43:23PM +0100, Catalin Marinas wrote: > Hi Dave, > > On Mon, Oct 23, 2017 at 12:37:44PM +0100, Dave P Martin wrote: > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > > index 6718780..68917de 100644 > > --- a/arch/arm64/kernel/entry.S > > +++ b/arch/arm64/kernel/entry.S > > @@ -850,6 +850,25 @@ el0_svc: > > adrp stbl, sys_call_table // load syscall table pointer > > mov wscno, w8 // syscall number in w8 > > mov wsc_nr, #__NR_syscalls > > + > > +#ifdef CONFIG_ARM64_SVE > > +alternative_if_not ARM64_SVE > > + b 1f > > +alternative_else > > + ldr x10, [tsk, #TSK_TI_FLAGS] > > +alternative_endif > > + bic x10, x10, #_TIF_SVE // discard SVE state on syscall > > Could we have: > > tbz x10, #TIF_SVE, 1f This depends on how likely this branch is to be mispredicted, which I don't have a good feel for. I thought the store was likely to be free or very inexpensive: we read/ write the thread flags elsewhere in the svc path anyway, so I presumed this would mostly move cost around rather than adding it. I don't have a strong opinion, though... > > > + str x10, [tsk, #TSK_TI_FLAGS] > > + > > + tbz x10, #TIF_FOREIGN_FPSTATE, 2f > > + ASM_BUG() > > +2: > > If we've never hit this problem before, I think we can avoid the > TIF_FOREIGN_FPSTATE here. Just double-check the return path that we > always invoke do_notify_resume(). Sure, that was me debugging: I'll remove that. I did hit this, but only because I'd written tbnz by mistake... > > > + mrs x9, cpacr_el1 > > + bic x9, x9, #CPACR_EL1_ZEN_EL0EN // disable SVE for el0 > > + msr cpacr_el1, x9 // synchronised by eret to el0 > > I haven't checked the other patches but do we context switch cpacr_el1? > If not, we may return to user with this enabled. Alternatively, can we > disable it on the return path? Currently, cpacr_el1 is only written via ret_to_user when TIF_FOREIGN_FPSTATE is set (i.e., if there is pending context switch work to do anyway). We could write this unconditionally in ret_to_user, but a) I was trying to keep the changes minimal for the first spin of the patch, and b) By the time we reach ret_to_user, we no longer know whether we are in a syscall: an in_syscall() check would be needed. > > > +1: > > +#endif > > + > > el0_svc_naked: // compat entry point > > stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and syscall number > > enable_dbg_and_irq > > I think you could move the TIF_SVE checks above in el0_svc_naked and > reuse x16 to avoid loading TSK_TI_FLAGS twice. For compat tasks TIF_SVE > would be 0 anyway (as long as we can skip the whole block as per my > suggested tbz). Could do. This code is irrelevant to the compat case, so I was trying to keep it separate. Alternatively, I could el0_svc: ldr x16, [tsk, #TSK_TI_FLAGS] //... b 3f el0_evc_naked: ldr x16, [tsk, #TSK_TI_FLAGS] 3: which does the load only once, but means hoisting the load before enable_dbg_and_irq in the el0_svc_naked case. It's also a bit ugly. Thoughts? OTOH, doing the load twice looks cheap: it should be hot in the cache. > > > diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c > > index ac56a96..9a3c561 100644 > > --- a/arch/arm64/kernel/process.c > > +++ b/arch/arm64/kernel/process.c > > @@ -247,22 +247,22 @@ void arch_release_task_struct(struct task_struct *tsk) > > fpsimd_release_task(tsk); > > } > > > > +/* > > + * src and dst may temporarily have aliased sve_state after task_struct > > + * is copied. We cannot fix this properly here, because src may have > > + * live SVE state and dst's thread_info may not exist yet, so tweaking > > + * either src's or dst's TIF_SVE is not safe. > > + * > > + * The unaliasing is done in copy_thread() instead. This works because > > + * dst is not schedulable or traceable until both of these functions > > + * have been called. > > + */ > > int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) > > { > > - /* > > - * For SVE, dst and src must not end up with aliases of the same > > - * sve_state. Because we are definitely in a syscall here, SVE state > > - * can be discarded unconditionally without violating the user ABI. > > - * dst's sve_state pointer can then be zapped with no ill effects. > > - * > > - * src can safely retain its sve_state memory for later use. > > - */ > > if (current->mm) > > - fpsimd_preserve_current_state_discard_sve(); > > + fpsimd_preserve_current_state(); > > *dst = *src; > > > > - dst->thread.sve_state = NULL; > > - > > return 0; > > } > > > > @@ -275,6 +275,13 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start, > > > > memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context)); > > > > + /* > > + * Unalias p->thread.sve_state (if any) from the parent task > > + * and disable discard SVE state for p: > > + */ > > + clear_tsk_thread_flag(p, TIF_SVE); > > + p->thread.sve_state = NULL; > > + > > if (likely(!(p->flags & PF_KTHREAD))) { > > *childregs = *current_pt_regs(); > > childregs->regs[0] = 0; > > This looks fine. OK, thanks ---Dave
[Richard, can you comment below on likely code generation choices in the compiler?] On Mon, Oct 23, 2017 at 06:08:39PM +0100, Alex Bennée wrote: > > Dave Martin <Dave.Martin@arm.com> writes: > > > As currently documented, no guarantee is made about what a user > > task sees in the SVE registers after a syscall, except that V0-V31 > > (corresponding to Z0-Z31 bits [127:0]) are preserved. > > > > The actual kernel behaviour currently implemented is that the SVE > > registers are zeroed if a context switch or signal delivery occurs > > during a syscall. After a fork() or clone(), the SVE registers > > of the child task are zeroed also. The SVE registers are otherwise > > preserved. Flexibility is retained in the ABI about the exact > > criteria for the decision. > > > > There are some potential problems with this approach. > > > > Syscall impact > > -------------- > > > > Will, Catalin and Mark have expressed concerns about the risk of > > creating de facto ABI here: in scenarios or workloads where a > > context switch never occurs or is very unlikely, userspace may > > learn to rely on preservation of the SVE registers across certain > > syscalls. > > I think this is a reasonable concern but are there any equivalent cases > in the rest of the kernel? Is this new territory for Linux as these > super large registers are introduced? Not that I know of. My implementation is influenced by the SVE register set size (which is up to > 4 times the size of any other in mainline that I know of), and the lack of architectural commitment to now growing the size further in the future. > > It is difficult to assess the impact of this: the syscall ABI is > > not a general-purpose interface, since it is usually hidden behind > > libc wrappers: direct invocation of SVC is discouraged. However, > > specialised runtimes, statically linked programs and binary blobs > > may bake in direct syscalls that make bad assumptions. > > > > Conversely, the relative cost of zeroing the SVE regs to mitigate > > against this also cannot be well characterised until SVE hardware > > exists. > > > > ptrace impact > > ------------- > > > > The current implementation can discard and zero the SVE registers > > at any point during a syscall, including before, after or between > > ptrace traps inside a single syscall. This means that setting the > > SVE registers through PTRACE_SETREGSET will often not do what the > > user expects: the new register values are only guaranteed to > > survive as far as userspace if set from an asynchronous > > signal-delivery-stop (e.g., breakpoint, SEGV or asynchronous signal > > delivered outside syscall context). > > > > This is consistent with the currently documented SVE user ABI, but > > likely to be surprising for a debugger user, since setting most > > registers of a tracee doesn't behave in this way. > > > > This patch > > ---------- > > > > The common syscall entry path is modified to forcibly discard SVE, > > and the discard logic elsewhere is removed. > > > > This means that there is a mandatory additional trap to the kernel > > when a user task tries to use SVE again after a syscall. This can > > be expensive for programs that use SVE heavily around syscalls, but > > can be optimised later. > > Won't it impact every restart from syscall? It's a shame you have to Only if SVE is used. The average extra cost is going to be proportional to rate of execution of syscall...syscall intervals where SVE is used, so for (;;) { syscall(); crunch_huge_data() } might cause up to 20 extra SVE traps per second if crunch_huge_data() takes 50ms say; whereas for (;;) { syscall(); crunch_trivial_data(); } would cause up to 100000 extra SVE traps per second if crunch_trivial_data() takes 10us. That's much worse. I think the worst realistic case here is use of gcc -mfpu=sve (or whatever the option would be called). In that case though, the compiler _should_ consider the caller-saveness of the SVE regs when computing cost tradeoffs for code generation around external function calls -- this may hide much of the penalty in practice. It's easy to write a worst-case microbenchmark though, and people will inevitably do that sooner or later... > trap when I suspect most first accesses after a syscall are likely to be > either restoring the caller-saved values which assume the value is > trashed anyway. So, this was may main argument: in practice the SVE regs will never be live at syscall entry. If they were, the caller would have needed to save them off anyway. > Adding gettimeofday() or write(stdout) while debugging > is going to kill performance. Sure, but adding a syscall inside your core loop is going to kill performance anyway. (gettimeofday() calls the vdso, so that won't hit this issue). Note, we don't really need to take a trap after every syscall: that's my current implementation, but I will optimise later zero the regs in-place and avoid the extra trap for this situation. This should be much cheaper than the current do_sve_acc() path. Cheers ---Dave
Hi, Dave Martin <Dave.Martin@arm.com> writes: > [Richard, can you comment below on likely code generation choices in the > compiler?] > > On Mon, Oct 23, 2017 at 06:08:39PM +0100, Alex Bennée wrote: >> >> Dave Martin <Dave.Martin@arm.com> writes: >> >> > As currently documented, no guarantee is made about what a user >> > task sees in the SVE registers after a syscall, except that V0-V31 >> > (corresponding to Z0-Z31 bits [127:0]) are preserved. >> > >> > The actual kernel behaviour currently implemented is that the SVE >> > registers are zeroed if a context switch or signal delivery occurs >> > during a syscall. After a fork() or clone(), the SVE registers >> > of the child task are zeroed also. The SVE registers are otherwise >> > preserved. Flexibility is retained in the ABI about the exact >> > criteria for the decision. >> > >> > There are some potential problems with this approach. >> > >> > Syscall impact >> > -------------- >> > >> > Will, Catalin and Mark have expressed concerns about the risk of >> > creating de facto ABI here: in scenarios or workloads where a >> > context switch never occurs or is very unlikely, userspace may >> > learn to rely on preservation of the SVE registers across certain >> > syscalls. >> >> I think this is a reasonable concern but are there any equivalent cases >> in the rest of the kernel? Is this new territory for Linux as these >> super large registers are introduced? > > Not that I know of. > > My implementation is influenced by the SVE register set size (which is > up to > 4 times the size of any other in mainline that I know of), and > the lack of architectural commitment to now growing the size further > in the future. > >> > It is difficult to assess the impact of this: the syscall ABI is >> > not a general-purpose interface, since it is usually hidden behind >> > libc wrappers: direct invocation of SVC is discouraged. However, >> > specialised runtimes, statically linked programs and binary blobs >> > may bake in direct syscalls that make bad assumptions. I can see this could be a concern in principle, but do we have a feel for how common these direct uses of SVC are? I think the uses via libc wrappers should be OK, since the SVE PCS says that all SVE state is clobbered by normal function calls. I think we can be relatively confident that the compilers implement this correctly, since it's the natural extension of the base AArch64 PCS (which only preserves the low 64 bits of V8-V15). Perhaps one concern would be LTO, since we then rely on the syscall asm statement having the correct clobber lists. And at the moment there's no syntax for saying that a register R is clobbered above X bits. (Alan's working on a GCC patch that could be reused for this if necessary.) >> > Conversely, the relative cost of zeroing the SVE regs to mitigate >> > against this also cannot be well characterised until SVE hardware >> > exists. >> > >> > ptrace impact >> > ------------- >> > >> > The current implementation can discard and zero the SVE registers >> > at any point during a syscall, including before, after or between >> > ptrace traps inside a single syscall. This means that setting the >> > SVE registers through PTRACE_SETREGSET will often not do what the >> > user expects: the new register values are only guaranteed to >> > survive as far as userspace if set from an asynchronous >> > signal-delivery-stop (e.g., breakpoint, SEGV or asynchronous signal >> > delivered outside syscall context). >> > >> > This is consistent with the currently documented SVE user ABI, but >> > likely to be surprising for a debugger user, since setting most >> > registers of a tracee doesn't behave in this way. >> > >> > This patch >> > ---------- >> > >> > The common syscall entry path is modified to forcibly discard SVE, >> > and the discard logic elsewhere is removed. >> > >> > This means that there is a mandatory additional trap to the kernel >> > when a user task tries to use SVE again after a syscall. This can >> > be expensive for programs that use SVE heavily around syscalls, but >> > can be optimised later. >> >> Won't it impact every restart from syscall? It's a shame you have to > > Only if SVE is used. > > The average extra cost is going to be proportional to rate of execution > of syscall...syscall intervals where SVE is used, so > > for (;;) { > syscall(); > crunch_huge_data() > } > > might cause up to 20 extra SVE traps per second if crunch_huge_data() > takes 50ms say; whereas > > for (;;) { > syscall(); > crunch_trivial_data(); > } > > would cause up to 100000 extra SVE traps per second if > crunch_trivial_data() takes 10us. That's much worse. > > > I think the worst realistic case here is use of gcc -mfpu=sve (or > whatever the option would be called). In that case though, the > compiler _should_ consider the caller-saveness of the SVE regs when > computing cost tradeoffs for code generation around external function > calls -- this may hide much of the penalty in practice. I'm not sure I'm really answering the point, sorry, but one of the advantages of SVE is that it can vectorise code with very little extra overhead. We'd therefore tend to vectorise any loop we can, even if it only iterates a few times. (This doesn't necessarily happen as much as it should yet.) The compiler should certainly consider the cost of saving and restoring data around function calls, but in the cases I've seen so far, it's rarely natural for SVE state to be live across a call. Spilling around calls tends to come from the compiler hoisting invariants too far, such as hoisting a PTRUE outside the for(;;) loop in your example. That's an optimisation bug that I've tried to fix for GCC. But as I understand it, the cost the compiler would need to consider here isn't the cost of saving and restoring SVE state around a call, but that (at least with the trap implementation) using SVE for a short time between two function calls can be expensive if those functions happen to use syscalls. Probably more expensive than not using SVE in the first place. I think it would be very difficult for the compiler to know when that's a concern. Without LTO, most function calls are just black boxes. Also, using SVE for a leaf function could turn out to be espensive if the caller of that leaf function uses syscalls either side, which is something the compiler wouldn't necessarily be able to see. So there's a danger we could get into the mindset of "don't use SVE for small loops". Of course, none of this matters if the automatic selection between traps and explicit zeroing makes the cost negligble (compared to the overhead of the syscall itself). Not sure that was any help, sorry :-) [...] Thanks, Richard
On Tue, Oct 24, 2017 at 09:30:55PM +0100, Richard Sandiford wrote: > Hi, > > Dave Martin <Dave.Martin@arm.com> writes: > > [Richard, can you comment below on likely code generation choices in the > > compiler?] > > > > On Mon, Oct 23, 2017 at 06:08:39PM +0100, Alex Bennée wrote: > >> > >> Dave Martin <Dave.Martin@arm.com> writes: > >> > >> > As currently documented, no guarantee is made about what a user > >> > task sees in the SVE registers after a syscall, except that V0-V31 > >> > (corresponding to Z0-Z31 bits [127:0]) are preserved. > >> > > >> > The actual kernel behaviour currently implemented is that the SVE > >> > registers are zeroed if a context switch or signal delivery occurs > >> > during a syscall. After a fork() or clone(), the SVE registers > >> > of the child task are zeroed also. The SVE registers are otherwise > >> > preserved. Flexibility is retained in the ABI about the exact > >> > criteria for the decision. > >> > > >> > There are some potential problems with this approach. > >> > > >> > Syscall impact > >> > -------------- > >> > > >> > Will, Catalin and Mark have expressed concerns about the risk of > >> > creating de facto ABI here: in scenarios or workloads where a > >> > context switch never occurs or is very unlikely, userspace may > >> > learn to rely on preservation of the SVE registers across certain > >> > syscalls. > >> > >> I think this is a reasonable concern but are there any equivalent cases > >> in the rest of the kernel? Is this new territory for Linux as these > >> super large registers are introduced? > > > > Not that I know of. > > > > My implementation is influenced by the SVE register set size (which is > > up to > 4 times the size of any other in mainline that I know of), and > > the lack of architectural commitment to now growing the size further > > in the future. > > > >> > It is difficult to assess the impact of this: the syscall ABI is > >> > not a general-purpose interface, since it is usually hidden behind > >> > libc wrappers: direct invocation of SVC is discouraged. However, > >> > specialised runtimes, statically linked programs and binary blobs > >> > may bake in direct syscalls that make bad assumptions. > > I can see this could be a concern in principle, but do we have a feel > for how common these direct uses of SVC are? > > I think the uses via libc wrappers should be OK, since the SVE PCS says > that all SVE state is clobbered by normal function calls. I think we > can be relatively confident that the compilers implement this correctly, > since it's the natural extension of the base AArch64 PCS (which only > preserves the low 64 bits of V8-V15). > > Perhaps one concern would be LTO, since we then rely on the syscall asm > statement having the correct clobber lists. And at the moment there's > no syntax for saying that a register R is clobbered above X bits. > (Alan's working on a GCC patch that could be reused for this if necessary.) I wonder whether the lack of a precise clobber will discourage people from writing a correct clobber list for SVCs -- the kernel guarantees to preserve V0-V31, so listing z0-z31 as clobbered would resulting unnecessary spilling of V8-V15[63:0] around SVC (as required by the ARMv8 base PCS). If SVC is always in out-of-line asm though, this isn't an issue. I'm not sure what glibc does. > >> > Conversely, the relative cost of zeroing the SVE regs to mitigate > >> > against this also cannot be well characterised until SVE hardware > >> > exists. > >> > > >> > ptrace impact > >> > ------------- > >> > > >> > The current implementation can discard and zero the SVE registers > >> > at any point during a syscall, including before, after or between > >> > ptrace traps inside a single syscall. This means that setting the > >> > SVE registers through PTRACE_SETREGSET will often not do what the > >> > user expects: the new register values are only guaranteed to > >> > survive as far as userspace if set from an asynchronous > >> > signal-delivery-stop (e.g., breakpoint, SEGV or asynchronous signal > >> > delivered outside syscall context). > >> > > >> > This is consistent with the currently documented SVE user ABI, but > >> > likely to be surprising for a debugger user, since setting most > >> > registers of a tracee doesn't behave in this way. > >> > > >> > This patch > >> > ---------- > >> > > >> > The common syscall entry path is modified to forcibly discard SVE, > >> > and the discard logic elsewhere is removed. > >> > > >> > This means that there is a mandatory additional trap to the kernel > >> > when a user task tries to use SVE again after a syscall. This can > >> > be expensive for programs that use SVE heavily around syscalls, but > >> > can be optimised later. > >> > >> Won't it impact every restart from syscall? It's a shame you have to > > > > Only if SVE is used. > > > > The average extra cost is going to be proportional to rate of execution > > of syscall...syscall intervals where SVE is used, so > > > > for (;;) { > > syscall(); > > crunch_huge_data() > > } > > > > might cause up to 20 extra SVE traps per second if crunch_huge_data() > > takes 50ms say; whereas > > > > for (;;) { > > syscall(); > > crunch_trivial_data(); > > } > > > > would cause up to 100000 extra SVE traps per second if > > crunch_trivial_data() takes 10us. That's much worse. > > > > > > I think the worst realistic case here is use of gcc -mfpu=sve (or > > whatever the option would be called). In that case though, the > > compiler _should_ consider the caller-saveness of the SVE regs when > > computing cost tradeoffs for code generation around external function > > calls -- this may hide much of the penalty in practice. > > I'm not sure I'm really answering the point, sorry, but one of the > advantages of SVE is that it can vectorise code with very little extra > overhead. We'd therefore tend to vectorise any loop we can, even if it > only iterates a few times. (This doesn't necessarily happen as much as > it should yet.) > > The compiler should certainly consider the cost of saving and restoring > data around function calls, but in the cases I've seen so far, it's > rarely natural for SVE state to be live across a call. Spilling around > calls tends to come from the compiler hoisting invariants too far, > such as hoisting a PTRUE outside the for(;;) loop in your example. > That's an optimisation bug that I've tried to fix for GCC. What if the call is to a function tagged as SVE callee-save? Would GCC decide differently, or does it still tend to be unnatural to keep live state in SVE regs across any call? > But as I understand it, the cost the compiler would need to consider > here isn't the cost of saving and restoring SVE state around a call, > but that (at least with the trap implementation) using SVE for a short > time between two function calls can be expensive if those functions > happen to use syscalls. Probably more expensive than not using > SVE in the first place. I guess one takeaway from this is that tagging syscall wrappers as SVE-callee-save is a bad idea, because it commits the function to saving state just in case the caller needs it. The zeroing behaviour doesn't change the situation, because zeroing was possible on any syscall even previously: now it would be guaranteed instead. So perhaps we should tag these explicitly in libc rather than just accepting whatever the PCS libc is built with. > I think it would be very difficult for the compiler to know when > that's a concern. Without LTO, most function calls are just black > boxes. Also, using SVE for a leaf function could turn out to be > espensive if the caller of that leaf function uses syscalls either side, > which is something the compiler wouldn't necessarily be able to see. > So there's a danger we could get into the mindset of "don't use SVE > for small loops". > > Of course, none of this matters if the automatic selection between traps > and explicit zeroing makes the cost negligble (compared to the overhead > of the syscall itself). Agreed: I think we can disregard the trap cost: there will be some cost, but for now the implementation is known to be suboptimal, and ultimately the trap will be avoided most of the time, in favour of zeroing the regs in place. > Not sure that was any help, sorry :-) I do feel better informed now -- thanks. It sounds like the impact of this change on userspace is performace isn't likely to be significant, other than the increase in syscall overhead, which is probably not that bad. I will continue to hedge my bets by documenting the SVE regs as unspecified after a syscall, but we may tighten it up later. Cheers ---Dave
On 25/10/17 13:57, Dave Martin wrote: > On Tue, Oct 24, 2017 at 09:30:55PM +0100, Richard Sandiford wrote: >> I think the uses via libc wrappers should be OK, since the SVE PCS says >> that all SVE state is clobbered by normal function calls. I think we >> can be relatively confident that the compilers implement this correctly, >> since it's the natural extension of the base AArch64 PCS (which only >> preserves the low 64 bits of V8-V15). >> >> Perhaps one concern would be LTO, since we then rely on the syscall asm >> statement having the correct clobber lists. And at the moment there's >> no syntax for saying that a register R is clobbered above X bits. >> (Alan's working on a GCC patch that could be reused for this if necessary.) > > I wonder whether the lack of a precise clobber will discourage people > from writing a correct clobber list for SVCs -- the kernel guarantees to > preserve V0-V31, so listing z0-z31 as clobbered would resulting > unnecessary spilling of V8-V15[63:0] around SVC (as required by the > ARMv8 base PCS). > > If SVC is always in out-of-line asm though, this isn't an issue. I'm > not sure what glibc does. glibc assumes that the libc code is not lto'd. it has syscall asm files as well as inline asm with svc, the later does not have magic sve clobber, but it should not matter as long as no sve is used in glibc. gcc runtimes (libstdc++, libgomp, libitm,..) have some raw syscalls (mainly futex) but on aarch64 it's a call to syscall() in libc, not inline asm. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h index da0e5be..74f3439 100644 --- a/arch/arm64/include/asm/fpsimd.h +++ b/arch/arm64/include/asm/fpsimd.h @@ -70,7 +70,6 @@ extern void fpsimd_flush_thread(void); extern void fpsimd_signal_preserve_current_state(void); extern void fpsimd_preserve_current_state(void); -extern void fpsimd_preserve_current_state_discard_sve(void); extern void fpsimd_restore_current_state(void); extern void fpsimd_update_current_state(struct fpsimd_state *state); diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index b7f8442..eb43128 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -109,6 +109,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_UPROBE (1 << TIF_UPROBE) #define _TIF_FSCHECK (1 << TIF_FSCHECK) #define _TIF_32BIT (1 << TIF_32BIT) +#define _TIF_SVE (1 << TIF_SVE) #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 6718780..68917de 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -850,6 +850,25 @@ el0_svc: adrp stbl, sys_call_table // load syscall table pointer mov wscno, w8 // syscall number in w8 mov wsc_nr, #__NR_syscalls + +#ifdef CONFIG_ARM64_SVE +alternative_if_not ARM64_SVE + b 1f +alternative_else + ldr x10, [tsk, #TSK_TI_FLAGS] +alternative_endif + bic x10, x10, #_TIF_SVE // discard SVE state on syscall + str x10, [tsk, #TSK_TI_FLAGS] + + tbz x10, #TIF_FOREIGN_FPSTATE, 2f + ASM_BUG() +2: + mrs x9, cpacr_el1 + bic x9, x9, #CPACR_EL1_ZEN_EL0EN // disable SVE for el0 + msr cpacr_el1, x9 // synchronised by eret to el0 +1: +#endif + el0_svc_naked: // compat entry point stp x0, xscno, [sp, #S_ORIG_X0] // save the original x0 and syscall number enable_dbg_and_irq diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index 1e0a7dc..8f15462 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -265,30 +265,11 @@ static void task_fpsimd_load(void) * with respect to the CPU registers. * * Softirqs (and preemption) must be disabled. - * - * As an optimisation, this function may skip the cost of saving the - * full SVE state if called in syscall context: this is permitted - * because the syscall ABI does not require the SVE registers to be - * preserved across system calls except for the subset shared with - * FPSIMD. However, don't _assume_ that this will occur. In the - * future, discarding of SVE state may be avoided for tasks that - * appear to use SVE heavily. - * - * SVE state may be discarded if in a syscall. - * To force SVE discard unconditionally, pass force_discard==true. */ -static void __task_fpsimd_save(bool force_discard) +static void task_fpsimd_save(void) { WARN_ON(!in_softirq() && !irqs_disabled()); - if (system_supports_sve() && test_thread_flag(TIF_SVE)) - if (force_discard || in_syscall(current_pt_regs())) { - clear_thread_flag(TIF_SVE); - - /* Trap if the task tries to use SVE again: */ - sve_user_disable(); - } - if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) { if (system_supports_sve() && test_thread_flag(TIF_SVE)) { if (WARN_ON(sve_get_vl() != current->thread.sve_vl)) { @@ -309,11 +290,6 @@ static void __task_fpsimd_save(bool force_discard) } } -static void task_fpsimd_save(void) -{ - __task_fpsimd_save(false); -} - /* * Helpers to translate bit indices in sve_vq_map to VQ values (and * vice versa). This allows find_next_bit() to be used to find the @@ -990,34 +966,17 @@ void fpsimd_flush_thread(void) /* * Save the userland FPSIMD state of 'current' to memory, but only if the state * currently held in the registers does in fact belong to 'current' - * - * SVE state may be discarded if in a syscall. - * To force SVE discard unconditionally, pass force_discard==true. */ -static void __fpsimd_preserve_current_state(bool force_discard) +void fpsimd_preserve_current_state(void) { if (!system_supports_fpsimd()) return; local_bh_disable(); - __task_fpsimd_save(force_discard); + task_fpsimd_save(); local_bh_enable(); } -void fpsimd_preserve_current_state(void) -{ - __fpsimd_preserve_current_state(false); -} - -/* - * Like fpsimd_preserve_current_state(), but explicitly discard SVE - * state. - */ -void fpsimd_preserve_current_state_discard_sve(void) -{ - __fpsimd_preserve_current_state(true); -} - /* * Like fpsimd_preserve_current_state(), but ensure that * current->thread.fpsimd_state is updated so that it can be copied to diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index ac56a96..9a3c561 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -247,22 +247,22 @@ void arch_release_task_struct(struct task_struct *tsk) fpsimd_release_task(tsk); } +/* + * src and dst may temporarily have aliased sve_state after task_struct + * is copied. We cannot fix this properly here, because src may have + * live SVE state and dst's thread_info may not exist yet, so tweaking + * either src's or dst's TIF_SVE is not safe. + * + * The unaliasing is done in copy_thread() instead. This works because + * dst is not schedulable or traceable until both of these functions + * have been called. + */ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src) { - /* - * For SVE, dst and src must not end up with aliases of the same - * sve_state. Because we are definitely in a syscall here, SVE state - * can be discarded unconditionally without violating the user ABI. - * dst's sve_state pointer can then be zapped with no ill effects. - * - * src can safely retain its sve_state memory for later use. - */ if (current->mm) - fpsimd_preserve_current_state_discard_sve(); + fpsimd_preserve_current_state(); *dst = *src; - dst->thread.sve_state = NULL; - return 0; } @@ -275,6 +275,13 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start, memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context)); + /* + * Unalias p->thread.sve_state (if any) from the parent task + * and disable discard SVE state for p: + */ + clear_tsk_thread_flag(p, TIF_SVE); + p->thread.sve_state = NULL; + if (likely(!(p->flags & PF_KTHREAD))) { *childregs = *current_pt_regs(); childregs->regs[0] = 0; diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 15bc2ad..260b67f 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -760,12 +760,6 @@ static int setup_rt_frame(int usig, struct ksignal *ksig, sigset_t *set, struct rt_sigframe __user *frame; int err = 0; - /* - * Ensure FPSIMD/SVE state in task_struct is up-to-date. - * This is needed here in order to complete any pending SVE discard: - * otherwise, discard may occur between deciding on the sigframe - * layout and dumping the register data. - */ fpsimd_signal_preserve_current_state(); if (get_sigframe(&user, ksig, regs))
As currently documented, no guarantee is made about what a user task sees in the SVE registers after a syscall, except that V0-V31 (corresponding to Z0-Z31 bits [127:0]) are preserved. The actual kernel behaviour currently implemented is that the SVE registers are zeroed if a context switch or signal delivery occurs during a syscall. After a fork() or clone(), the SVE registers of the child task are zeroed also. The SVE registers are otherwise preserved. Flexibility is retained in the ABI about the exact criteria for the decision. There are some potential problems with this approach. Syscall impact -------------- Will, Catalin and Mark have expressed concerns about the risk of creating de facto ABI here: in scenarios or workloads where a context switch never occurs or is very unlikely, userspace may learn to rely on preservation of the SVE registers across certain syscalls. It is difficult to assess the impact of this: the syscall ABI is not a general-purpose interface, since it is usually hidden behind libc wrappers: direct invocation of SVC is discouraged. However, specialised runtimes, statically linked programs and binary blobs may bake in direct syscalls that make bad assumptions. Conversely, the relative cost of zeroing the SVE regs to mitigate against this also cannot be well characterised until SVE hardware exists. ptrace impact ------------- The current implementation can discard and zero the SVE registers at any point during a syscall, including before, after or between ptrace traps inside a single syscall. This means that setting the SVE registers through PTRACE_SETREGSET will often not do what the user expects: the new register values are only guaranteed to survive as far as userspace if set from an asynchronous signal-delivery-stop (e.g., breakpoint, SEGV or asynchronous signal delivered outside syscall context). This is consistent with the currently documented SVE user ABI, but likely to be surprising for a debugger user, since setting most registers of a tracee doesn't behave in this way. This patch ---------- The common syscall entry path is modified to forcibly discard SVE, and the discard logic elsewhere is removed. This means that there is a mandatory additional trap to the kernel when a user task tries to use SVE again after a syscall. This can be expensive for programs that use SVE heavily around syscalls, but can be optimised later. Because all ptrace traps occur after syscall entry, this also means that no discard will occur after SVE registers are set through ptrace, until the task next passes through userspace and does another syscall. The only exception to this is in fork()/clone() where forced discard still occurs: however, this is likely to be encountered less frequently by debugger users. This ABI change will commit the kernel to some extra cost: at least some thread flag manipulation, conditional branching and mov v0.16b, v0.16b mov v1.16b, v1.16b // ... mov v31.16b, v31.16b pfalse p0.b // ... pfalse p15.b wrffr p0.b (or equivalent) on the syscall fast path, for SVE tasks. Though this patch requires trap + memset() + reload, so is significantly more expensive right now. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Cc: Alex Bennée <alex.bennee@linaro.org> Cc: Ard Biesheuvel <Ard.Biesheuvel@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Alan Hayward <alan.hayward@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com> Cc: Yao Qi <yao.qi@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Okamoto Takayuki <tokamoto@jp.fujitsu.com> --- Changes since v3: * New patch since v3. * This _may_ be folded into the Core task context handling patch later. --- arch/arm64/include/asm/fpsimd.h | 1 - arch/arm64/include/asm/thread_info.h | 1 + arch/arm64/kernel/entry.S | 19 +++++++++++++++ arch/arm64/kernel/fpsimd.c | 47 +++--------------------------------- arch/arm64/kernel/process.c | 29 +++++++++++++--------- arch/arm64/kernel/signal.c | 6 ----- 6 files changed, 41 insertions(+), 62 deletions(-)