diff mbox series

[v2,11/28] arm64/sve: Core task context handling

Message ID 1504198860-12951-12-git-send-email-Dave.Martin@arm.com
State New
Headers show
Series ARM Scalable Vector Extension (SVE) | expand

Commit Message

Dave Martin Aug. 31, 2017, 5 p.m. UTC
This patch adds the core support for switching and managing the SVE
architectural state of user tasks.

Calls to the existing FPSIMD low-level save/restore functions are
factored out as new functions task_fpsimd_{save,load}(), since SVE
now dynamically may or may not need to be handled at these points
depending on the kernel configuration, hardware features discovered
at boot, and the runtime state of the task.  To make these
decisions as fast as possible, const cpucaps are used where
feasible, via the system_supports_sve() helper.

The SVE registers are only tracked for threads that have explicitly
used SVE, indicated by the new thread flag TIF_SVE.  Otherwise, the
FPSIMD view of the architectural state is stored in
thread.fpsimd_state as usual.

When in use, the SVE registers are not stored directly in
thread_struct due to their potentially large and variable size.
Because the task_struct slab allocator must be configured very
early during kernel boot, it is also tricky to configure it
correctly to match the maximum vector length provided by the
hardware, since this depends on examining secondary CPUs as well as
the primary.  Instead, a pointer sve_state in thread_struct points
to a dynamically allocated buffer containing the SVE register data,
and code is added to allocate, duplicate and free this buffer at
appropriate times.

TIF_SVE is set when taking an SVE access trap from userspace, if
suitable hardware support has been detected.  This enables SVE for
the thread: a subsequent return to userspace will disable the trap
accordingly.  If such a trap is taken without sufficient hardware
support, SIGILL is sent to the thread instead as if an undefined
instruction had been executed: this may happen if userspace tries
to use SVE in a system where not all CPUs support it for example.

The kernel may clear TIF_SVE and disable SVE for the thread
whenever an explicit syscall is made by userspace, though this is
considered an optimisation opportunity rather than a deterministic
guarantee: the kernel may not do this on every syscall, but it is
permitted to do so.  For backwards compatibility reasons and
conformance with the spirit of the base AArch64 procedure call
standard, the subset of the SVE register state that aliases the
FPSIMD registers is still preserved across a syscall even if this
happens.

TIF_SVE is also cleared, and SVE disabled, on exec: this is an
obvious slow path and a hint that we are running a new binary that
may not use SVE.

Code is added to sync data between thread.fpsimd_state and
thread.sve_state whenever enabling/disabling SVE, in a manner
consistent with the SVE architectural programmer's model.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>

---

Changes since v1
----------------

Requested by Ard Biesheuvel:

* Fix unbalanced ifelse bracing to match kernel coding style.
* Improve abstraction of change_cpacr to be a bit less clunky.
* Rearrange task_fpsimd_{load,save}() and friends to eliminate
forward declarations.

Changes related to Ard Biesheuvel's comments:

* Undo suprious replacement of assignment by memset in
arch_dup_task_struct().

Requested by Alex Bennée:

* Add missing include of <linux/compat.h>.
* Make thread_struct.sve_vl an unsigned int: that's the type used
virtually everywhere else, and the memory saving is too minor to be
interesting.

* Thin out BUG_ON()s:
Redundant BUG_ON()s and ones that just check invariants are removed.
Important sanity-checks are migrated to WARN_ON()s, with some
minimal best-effort patch-up code.  The fpsimd_dup_sve() case is changed
to a WARN_ON, for user threads only, and only if actually clearing
TIF_SVE.

Other:

* [bugfix] Unconditionally drop child task's reference to the parent's
SVE state on fork.  Otherwise we can end up with it aliased, which is
bad.
---
 arch/arm64/include/asm/fpsimd.h      |  19 +++
 arch/arm64/include/asm/processor.h   |   2 +
 arch/arm64/include/asm/thread_info.h |   1 +
 arch/arm64/include/asm/traps.h       |   2 +
 arch/arm64/kernel/entry.S            |  14 ++-
 arch/arm64/kernel/fpsimd.c           | 227 ++++++++++++++++++++++++++++++++++-
 arch/arm64/kernel/process.c          |   4 +
 arch/arm64/kernel/traps.c            |   4 +-
 8 files changed, 265 insertions(+), 8 deletions(-)

Comments

Catalin Marinas Sept. 13, 2017, 2:33 p.m. UTC | #1
On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> +/*
> + * Handle SVE state across fork():
> + *
> + * dst and src must not end up with aliases of the same sve_state.
> + * Because a task cannot fork except in a syscall, we can discard SVE
> + * state for dst here, so long as we take care to retain the FPSIMD
> + * subset of the state if SVE is in use.  Reallocation of the SVE state
> + * will be deferred until dst tries to use SVE.
> + */
> +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> +{
> +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> +		sve_to_fpsimd(dst);
> +	}
> +
> +	dst->thread.sve_state = NULL;
> +}

I first thought the thread flags are not visible in dst yet since
dup_task_struct() calls arch_dup_task_struct() before
setup_thread_stack(). However, at the end of the last year we enabled
CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
on this.

Anyway, IIUC we don't need sve_to_fpsimd() here. The
arch_dup_task_struct() already called fpsimd_preserve_current_state()
for src, so the FPSIMD state (which we care about) is transferred during
the *dst = *src assignment. So you'd only need the last statement,
possibly with a different function name like fpsimd_erase_sve (and maybe
make the function static inline in the header).

[...]
>  int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
> @@ -246,6 +247,9 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
>  	if (current->mm)
>  		fpsimd_preserve_current_state();
>  	*dst = *src;
> +
> +	fpsimd_dup_sve(dst, src);
> +
>  	return 0;
>  }
Catalin Marinas Sept. 13, 2017, 5:26 p.m. UTC | #2
On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> +el0_sve_acc:
> +	/*
> +	 * Scalable Vector Extension access
> +	 */
> +	enable_dbg
> +	ct_user_exit
> +	mov	x0, x25
> +	mov	x1, sp
> +	bl	do_sve_acc
> +	b	ret_to_user

I think do_sve_acc() runs with interrupts disabled. We may have some
high latency for large SVE states.

> +/*
> + * Trapped SVE access
> + */
> +void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> +{
> +	/* Even if we chose not to use SVE, the hardware could still trap: */
> +	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> +		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
> +		return;
> +	}
> +
> +	task_fpsimd_save();
> +
> +	sve_alloc(current);
> +	fpsimd_to_sve(current);
> +	if (test_and_set_thread_flag(TIF_SVE))
> +		WARN_ON(1); /* SVE access shouldn't have trapped */
> +
> +	task_fpsimd_load();
> +}

When this function is entered, do we expect TIF_SVE to always be
cleared? It's worth adding a comment on the expected conditions. If
that's the case, task_fpsimd_save() would only save the FPSIMD state
which is fine. However, you subsequently transfer the FPSIMD state to
SVE, set TIF_SVE and restore the full SVE state. If we don't care about
the SVE state here, can we call task_fpsimd_load() *before* setting
TIF_SVE?

I may as well have confused myself with the state bouncing between
FPSIMD and SVE (more reasons to document the data flow better ;)).
Dave Martin Sept. 13, 2017, 7:17 p.m. UTC | #3
On Wed, Sep 13, 2017 at 10:26:05AM -0700, Catalin Marinas wrote:
> On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > +el0_sve_acc:
> > +	/*
> > +	 * Scalable Vector Extension access
> > +	 */
> > +	enable_dbg
> > +	ct_user_exit
> > +	mov	x0, x25
> > +	mov	x1, sp
> > +	bl	do_sve_acc
> > +	b	ret_to_user
> 
> I think do_sve_acc() runs with interrupts disabled. We may have some
> high latency for large SVE states.

Historically I wanted to play safe.

I meant to change this to enable_dbg_and_irq now, but it looks like I
forgot to do it.

I'll double-check that do_sve_acc() does nothing that makes it unsafe to
enable irqs, but it should be OK.  It's certainly _intended_ to be
irq-safe.

> > +/*
> > + * Trapped SVE access
> > + */
> > +void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> > +{
> > +	/* Even if we chose not to use SVE, the hardware could still trap: */
> > +	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> > +		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
> > +		return;
> > +	}
> > +
> > +	task_fpsimd_save();
> > +
> > +	sve_alloc(current);
> > +	fpsimd_to_sve(current);
> > +	if (test_and_set_thread_flag(TIF_SVE))
> > +		WARN_ON(1); /* SVE access shouldn't have trapped */
> > +
> > +	task_fpsimd_load();
> > +}
> 
> When this function is entered, do we expect TIF_SVE to always be
> cleared? It's worth adding a comment on the expected conditions. If

Yes, and this is required for correctness, as you observe.

I had a BUG_ON() here which I removed, but it makes sense to add a
comment to capture the precondition here, and how it is satisfied.

> that's the case, task_fpsimd_save() would only save the FPSIMD state
> which is fine. However, you subsequently transfer the FPSIMD state to
> SVE, set TIF_SVE and restore the full SVE state. If we don't care about
> the SVE state here, can we call task_fpsimd_load() *before* setting
> TIF_SVE?

There should be no way to reach this code with TIF_SVE set, unless
task_fpsimd_load() sets the CPACR trap bit wrongly, or the hardware is
broken -- either of which is a bug.

> I may as well have confused myself with the state bouncing between
> FPSIMD and SVE (more reasons to document the data flow better ;)).

Agreed, I think this does need more explanation...  I'm currently
trying to figure out the best way to describe it.

Cheers
---Dave
Catalin Marinas Sept. 13, 2017, 10:21 p.m. UTC | #4
On Wed, Sep 13, 2017 at 08:17:07PM +0100, Dave P Martin wrote:
> On Wed, Sep 13, 2017 at 10:26:05AM -0700, Catalin Marinas wrote:
> > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > +/*
> > > + * Trapped SVE access
> > > + */
> > > +void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> > > +{
> > > +	/* Even if we chose not to use SVE, the hardware could still trap: */
> > > +	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> > > +		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
> > > +		return;
> > > +	}
> > > +
> > > +	task_fpsimd_save();
> > > +
> > > +	sve_alloc(current);
> > > +	fpsimd_to_sve(current);
> > > +	if (test_and_set_thread_flag(TIF_SVE))
> > > +		WARN_ON(1); /* SVE access shouldn't have trapped */
> > > +
> > > +	task_fpsimd_load();
> > > +}
> > 
> > When this function is entered, do we expect TIF_SVE to always be
> > cleared? It's worth adding a comment on the expected conditions. If
> 
> Yes, and this is required for correctness, as you observe.
> 
> I had a BUG_ON() here which I removed, but it makes sense to add a
> comment to capture the precondition here, and how it is satisfied.
> 
> > that's the case, task_fpsimd_save() would only save the FPSIMD state
> > which is fine. However, you subsequently transfer the FPSIMD state to
> > SVE, set TIF_SVE and restore the full SVE state. If we don't care about
> > the SVE state here, can we call task_fpsimd_load() *before* setting
> > TIF_SVE?
> 
> There should be no way to reach this code with TIF_SVE set, unless
> task_fpsimd_load() sets the CPACR trap bit wrongly, or the hardware is
> broken -- either of which is a bug.

Thanks for confirming my assumptions. What I meant was rewriting the
above function as:

	/* reset the SVE state (other than FPSIMD) */
	task_fpsimd_save();
	task_fpsimd_load();

	sve_alloc(current);
	set_thread_flag(TIF_SVE);
Dave Martin Sept. 14, 2017, 7:40 p.m. UTC | #5
On Wed, Sep 13, 2017 at 03:21:29PM -0700, Catalin Marinas wrote:
> On Wed, Sep 13, 2017 at 08:17:07PM +0100, Dave P Martin wrote:
> > On Wed, Sep 13, 2017 at 10:26:05AM -0700, Catalin Marinas wrote:
> > > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > > +/*
> > > > + * Trapped SVE access
> > > > + */
> > > > +void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> > > > +{
> > > > +	/* Even if we chose not to use SVE, the hardware could still trap: */
> > > > +	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> > > > +		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	task_fpsimd_save();
> > > > +
> > > > +	sve_alloc(current);
> > > > +	fpsimd_to_sve(current);
> > > > +	if (test_and_set_thread_flag(TIF_SVE))
> > > > +		WARN_ON(1); /* SVE access shouldn't have trapped */
> > > > +
> > > > +	task_fpsimd_load();
> > > > +}
> > > 
> > > When this function is entered, do we expect TIF_SVE to always be
> > > cleared? It's worth adding a comment on the expected conditions. If
> > 
> > Yes, and this is required for correctness, as you observe.
> > 
> > I had a BUG_ON() here which I removed, but it makes sense to add a
> > comment to capture the precondition here, and how it is satisfied.
> > 
> > > that's the case, task_fpsimd_save() would only save the FPSIMD state
> > > which is fine. However, you subsequently transfer the FPSIMD state to
> > > SVE, set TIF_SVE and restore the full SVE state. If we don't care about
> > > the SVE state here, can we call task_fpsimd_load() *before* setting
> > > TIF_SVE?
> > 
> > There should be no way to reach this code with TIF_SVE set, unless
> > task_fpsimd_load() sets the CPACR trap bit wrongly, or the hardware is
> > broken -- either of which is a bug.
> 
> Thanks for confirming my assumptions. What I meant was rewriting the
> above function as:
> 
> 	/* reset the SVE state (other than FPSIMD) */
> 	task_fpsimd_save();
> 	task_fpsimd_load();

I think this works, but can you explain your rationale?

I think the main effect of your suggestion is that it is cheaper, due
to eliminating some unnecessary load/store operations.

We could go one better, and do

	mov	v0.16b, v0.16b
	mov	v1.16b, v1.16b
	// ...
	mov	v31.16b, v31.16b

which doesn't require any memory access.

But I still prefer to zero p0..p15, ffr for cleanliness, even though
the SVE programmer's model doesn't require this (unlike for the Z-reg
high bits where we do need to zero them in order not to violate the
programmer's model).

Currently sve_alloc()+task_fpsimd_load() ensures that all the non-FPSIMD
regs are zeroed too, in addition to the Z-reg high bits.

So we might want a special-purpose helper -- if so, we can do it all
with no memory access.

	pfalse	p0.b
	// ..
	pfalse	p15.b
	wrffr	p0.b

This would allow the memset-zero an sve_alloc() to be removed, but I
would need to check what other code is relying on it.

I guess I hadn't done this because I viewed it as an optimisation.

Thoughts?

Cheers
---Dave
Dave Martin Sept. 14, 2017, 7:55 p.m. UTC | #6
On Wed, Sep 13, 2017 at 07:33:25AM -0700, Catalin Marinas wrote:
> On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > +/*
> > + * Handle SVE state across fork():
> > + *
> > + * dst and src must not end up with aliases of the same sve_state.
> > + * Because a task cannot fork except in a syscall, we can discard SVE
> > + * state for dst here, so long as we take care to retain the FPSIMD
> > + * subset of the state if SVE is in use.  Reallocation of the SVE state
> > + * will be deferred until dst tries to use SVE.
> > + */
> > +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> > +{
> > +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> > +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> > +		sve_to_fpsimd(dst);
> > +	}
> > +
> > +	dst->thread.sve_state = NULL;
> > +}
> 
> I first thought the thread flags are not visible in dst yet since
> dup_task_struct() calls arch_dup_task_struct() before
> setup_thread_stack(). However, at the end of the last year we enabled
> CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
> on this.

Hmmm, I see your point, but there are some sequencing issues here.

> Anyway, IIUC we don't need sve_to_fpsimd() here. The
> arch_dup_task_struct() already called fpsimd_preserve_current_state()

I consider SVE discard as an optional side effect of task_fpsimd_save(),
not something that is guaranteed to happen -- the decision about whether
to do so may become more intelligent later on.  So, for src, we may
discard SVE (because syscall), but for dst we must NULL .sve_state (and
therefore clear TIF_SVE) simply to avoid aliasing of src->sve_state and
dst->sve_state.

The latter requires operating on the thread_flags of dst.  I'll need to
check whether there's another suitable hook for updating the thread flags
of dst, if we aren't confident that they will always have been
initialised by the time arch_dup_task_struct() is called.


Either way, there would be an intra-thread ordering requirement between
the task_struct and thread_info updates here, _if_ dst were schedulable:
dst:TIF_SVE must be cleared before dst->sve_state is NULLed.

dst is not schedulable until fork is done though, so maybe this doesn't
really matter...

> for src, so the FPSIMD state (which we care about) is transferred during
> the *dst = *src assignment. So you'd only need the last statement,
> possibly with a different function name like fpsimd_erase_sve (and maybe
> make the function static inline in the header).

Not quite: TIF_SVE must be cleared so that a context switch or
kernel_neon_begin() after dst is scheduled doesn't try to save state in
the (NULL) dst->sve_state.

> 
> [...]
> >  int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
> > @@ -246,6 +247,9 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
> >  	if (current->mm)
> >  		fpsimd_preserve_current_state();
> >  	*dst = *src;
> > +
> > +	fpsimd_dup_sve(dst, src);
> > +
> >  	return 0;
> >  }


Cheers
---Dave
Catalin Marinas Sept. 19, 2017, 5:13 p.m. UTC | #7
On Thu, Sep 14, 2017 at 08:40:41PM +0100, Dave P Martin wrote:
> On Wed, Sep 13, 2017 at 03:21:29PM -0700, Catalin Marinas wrote:
> > On Wed, Sep 13, 2017 at 08:17:07PM +0100, Dave P Martin wrote:
> > > On Wed, Sep 13, 2017 at 10:26:05AM -0700, Catalin Marinas wrote:
> > > > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > > > +/*
> > > > > + * Trapped SVE access
> > > > > + */
> > > > > +void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> > > > > +{
> > > > > +	/* Even if we chose not to use SVE, the hardware could still trap: */
> > > > > +	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> > > > > +		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > > +	task_fpsimd_save();
> > > > > +
> > > > > +	sve_alloc(current);
> > > > > +	fpsimd_to_sve(current);
> > > > > +	if (test_and_set_thread_flag(TIF_SVE))
> > > > > +		WARN_ON(1); /* SVE access shouldn't have trapped */
> > > > > +
> > > > > +	task_fpsimd_load();
> > > > > +}
> > > > 
> > > > When this function is entered, do we expect TIF_SVE to always be
> > > > cleared? It's worth adding a comment on the expected conditions. If
> > > 
> > > Yes, and this is required for correctness, as you observe.
> > > 
> > > I had a BUG_ON() here which I removed, but it makes sense to add a
> > > comment to capture the precondition here, and how it is satisfied.
> > > 
> > > > that's the case, task_fpsimd_save() would only save the FPSIMD state
> > > > which is fine. However, you subsequently transfer the FPSIMD state to
> > > > SVE, set TIF_SVE and restore the full SVE state. If we don't care about
> > > > the SVE state here, can we call task_fpsimd_load() *before* setting
> > > > TIF_SVE?
> > > 
> > > There should be no way to reach this code with TIF_SVE set, unless
> > > task_fpsimd_load() sets the CPACR trap bit wrongly, or the hardware is
> > > broken -- either of which is a bug.
> > 
> > Thanks for confirming my assumptions. What I meant was rewriting the
> > above function as:
> > 
> > 	/* reset the SVE state (other than FPSIMD) */
> > 	task_fpsimd_save();
> > 	task_fpsimd_load();
> 
> I think this works, but can you explain your rationale?
> 
> I think the main effect of your suggestion is that it is cheaper, due
> to eliminating some unnecessary load/store operations.

My rationale was to avoid copying between the in-memory FPSIMD and SVE
state.

> We could go one better, and do
> 
> 	mov	v0.16b, v0.16b
> 	mov	v1.16b, v1.16b
> 	// ...
> 	mov	v31.16b, v31.16b
> 
> which doesn't require any memory access.

Yes, that's even better.

> But I still prefer to zero p0..p15, ffr for cleanliness, even though
> the SVE programmer's model doesn't require this (unlike for the Z-reg
> high bits where we do need to zero them in order not to violate the
> programmer's model).

I missed the px, ffr aspect. Can you not have a clear_sve_state() (or a
better name) function to zero the predicate regs, ffr and the top bits
of the vectors?

> Currently sve_alloc()+task_fpsimd_load() ensures that all the non-FPSIMD
> regs are zeroed too, in addition to the Z-reg high bits.

Yes, just wondering if this can be implemented with less memory accesses
since the SVE state is irrelevant at this stage.

> 
> So we might want a special-purpose helper -- if so, we can do it all
> with no memory access.
> 
> 	pfalse	p0.b
> 	// ..
> 	pfalse	p15.b
> 	wrffr	p0.b
> 
> This would allow the memset-zero an sve_alloc() to be removed, but I
> would need to check what other code is relying on it.
> 
> I guess I hadn't done this because I viewed it as an optimisation.

It looked like some low-hanging optimisation to slightly accelerate the
allocation of the SVE state on access, though I'm also worried I don't
fully understand all the corner cases (like what happens if we allow
interrupts during this function and get preempted).

Anyway, I'm fine to leave this as it is for now and try to optimise it
later with additional patches on top.
Catalin Marinas Sept. 20, 2017, 1:58 p.m. UTC | #8
On Thu, Sep 14, 2017 at 08:55:56PM +0100, Dave P Martin wrote:
> On Wed, Sep 13, 2017 at 07:33:25AM -0700, Catalin Marinas wrote:
> > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > +/*
> > > + * Handle SVE state across fork():
> > > + *
> > > + * dst and src must not end up with aliases of the same sve_state.
> > > + * Because a task cannot fork except in a syscall, we can discard SVE
> > > + * state for dst here, so long as we take care to retain the FPSIMD
> > > + * subset of the state if SVE is in use.  Reallocation of the SVE state
> > > + * will be deferred until dst tries to use SVE.
> > > + */
> > > +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> > > +{
> > > +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> > > +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> > > +		sve_to_fpsimd(dst);
> > > +	}
> > > +
> > > +	dst->thread.sve_state = NULL;
> > > +}
> > 
> > I first thought the thread flags are not visible in dst yet since
> > dup_task_struct() calls arch_dup_task_struct() before
> > setup_thread_stack(). However, at the end of the last year we enabled
> > CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
> > on this.
> 
> Hmmm, I see your point, but there are some sequencing issues here.
> 
> > Anyway, IIUC we don't need sve_to_fpsimd() here. The
> > arch_dup_task_struct() already called fpsimd_preserve_current_state()
> 
> I consider SVE discard as an optional side effect of task_fpsimd_save(),
> not something that is guaranteed to happen -- the decision about whether
> to do so may become more intelligent later on.  So, for src, we may
> discard SVE (because syscall), but for dst we must NULL .sve_state (and
> therefore clear TIF_SVE) simply to avoid aliasing of src->sve_state and
> dst->sve_state.

My point was that the SVE state of src is already preserved at this
point and copied into dst. You don't need the sve_to_fpsimd(dst) again
which basically does the same copying of the src SVE saved state into
the FPSIMD one in dst. This has already been done in
arch_dup_task_struct() by the combination of
fpsimd_preserve_current_state() and *dst = *src (and, of course,
clearing TIF_SVE in dst).

I don't think the TIF_SVE clearing in src is just a side effect of
task_fpsimd_save() here but rather a requirement. When returning from
fork(), both src and dst would need to have the same state. However,
your fpsimd_dup_sve() implementation makes it very clear that the SVE
state is lost in dst. This is only allowed if we also lose it in src (as
a result of a syscall). So making dst->sve_state = NULL requires that
TIF_SVE is also cleared in both src and dst. Alternatively, you'd have
to allocate a new state here and copy the full src SVE state across to
dst, together with setting TIF_SVE (that's not necessary, however, since
we get here as a result of a syscall).

> > for src, so the FPSIMD state (which we care about) is transferred during
> > the *dst = *src assignment. So you'd only need the last statement,
> > possibly with a different function name like fpsimd_erase_sve (and maybe
> > make the function static inline in the header).
> 
> Not quite: TIF_SVE must be cleared so that a context switch or
> kernel_neon_begin() after dst is scheduled doesn't try to save state in
> the (NULL) dst->sve_state.

Yes, TIF_SVE must also be cleared in dst when dst->sve_state = NULL (I
may have forgotten to mention this).
Dave Martin Oct. 3, 2017, 11:11 a.m. UTC | #9
On Wed, Sep 20, 2017 at 02:58:56PM +0100, Catalin Marinas wrote:
> On Thu, Sep 14, 2017 at 08:55:56PM +0100, Dave P Martin wrote:
> > On Wed, Sep 13, 2017 at 07:33:25AM -0700, Catalin Marinas wrote:
> > > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > > +/*
> > > > + * Handle SVE state across fork():
> > > > + *
> > > > + * dst and src must not end up with aliases of the same sve_state.
> > > > + * Because a task cannot fork except in a syscall, we can discard SVE
> > > > + * state for dst here, so long as we take care to retain the FPSIMD
> > > > + * subset of the state if SVE is in use.  Reallocation of the SVE state
> > > > + * will be deferred until dst tries to use SVE.
> > > > + */
> > > > +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> > > > +{
> > > > +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> > > > +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> > > > +		sve_to_fpsimd(dst);
> > > > +	}
> > > > +
> > > > +	dst->thread.sve_state = NULL;
> > > > +}
> > > 
> > > I first thought the thread flags are not visible in dst yet since
> > > dup_task_struct() calls arch_dup_task_struct() before
> > > setup_thread_stack(). However, at the end of the last year we enabled
> > > CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
> > > on this.
> > 
> > Hmmm, I see your point, but there are some sequencing issues here.
> > 
> > > Anyway, IIUC we don't need sve_to_fpsimd() here. The
> > > arch_dup_task_struct() already called fpsimd_preserve_current_state()
> > 
> > I consider SVE discard as an optional side effect of task_fpsimd_save(),
> > not something that is guaranteed to happen -- the decision about whether
> > to do so may become more intelligent later on.  So, for src, we may
> > discard SVE (because syscall), but for dst we must NULL .sve_state (and
> > therefore clear TIF_SVE) simply to avoid aliasing of src->sve_state and
> > dst->sve_state.
> 
> My point was that the SVE state of src is already preserved at this
> point and copied into dst. You don't need the sve_to_fpsimd(dst) again
> which basically does the same copying of the src SVE saved state into
> the FPSIMD one in dst. This has already been done in
> arch_dup_task_struct() by the combination of
> fpsimd_preserve_current_state() and *dst = *src (and, of course,
> clearing TIF_SVE in dst).
> 
> I don't think the TIF_SVE clearing in src is just a side effect of
> task_fpsimd_save() here but rather a requirement. When returning from
> fork(), both src and dst would need to have the same state. However,
> your fpsimd_dup_sve() implementation makes it very clear that the SVE
> state is lost in dst. This is only allowed if we also lose it in src (as
> a result of a syscall). So making dst->sve_state = NULL requires that
> TIF_SVE is also cleared in both src and dst. Alternatively, you'd have
> to allocate a new state here and copy the full src SVE state across to
> dst, together with setting TIF_SVE (that's not necessary, however, since
> we get here as a result of a syscall).

The currently intended ABI is that the SVE bits are unspecified after a
syscall, so it is legitimate (though perhaps surprising) for different
things to happen in dst and src.

This complicates things a lot though, just to avoid the next SVE usage
exception in src after the fork.


It should be simpler to do what you suggest and discard the SVE state of
src unconditionally before the copy: then we really are just cloning the
thread apart from the need to set dst->thread.sve_state to NULL.

fpsimd_preserve_current_state() does not necessarily write back to
current->thread.fpsmid_state though: at the moment, it does do this as a
side effect of task_fpsimd_save() because we happen to be in a syscall
(i.e., fork).  

What we really want is unconditional discarding of the state.  This
wasn't needed anywhere else yet, so there's no explicit helper for it.
But it makes sense to add one.

What about refactoring along these lines:


fpsimd.c:
/* Unconditionally discard the SVE state */
void task_sve_discard(struct task_struct *task)
{
	if (!system_supports_sve())
		return;

	local_bh_disable();
	if (test_and_clear_tsk_thread_flag(task, TIF_SVE))
		sve_to_fpsimd(task);
	local_bh_enable();
}

process.c:
int arch_dup_task_struct(sturct task_struct *dst, struct task_struct *src)
{
	if (current->mm) {
		fpsimd_preserve_current_state();
		task_sve_discard(src);
	}

	*dst = *src;

	dst->thread.sve_state = NULL;
}


This also avoids having to touch dst's thread flags, since now we
are just cloning the task except for assigning NULL to
dst->thread.sve_state.


> > > for src, so the FPSIMD state (which we care about) is transferred during
> > > the *dst = *src assignment. So you'd only need the last statement,
> > > possibly with a different function name like fpsimd_erase_sve (and maybe
> > > make the function static inline in the header).
> > 
> > Not quite: TIF_SVE must be cleared so that a context switch or
> > kernel_neon_begin() after dst is scheduled doesn't try to save state in
> > the (NULL) dst->sve_state.
> 
> Yes, TIF_SVE must also be cleared in dst when dst->sve_state = NULL (I
> may have forgotten to mention this).

With the above factoring, the constraint "TIF_SVE implies sve_state
valid" is then never false, because TIF_SVE is already cleared before
the NULL assignment.

What do you think?

Cheers
---Dave
Dave Martin Oct. 3, 2017, 11:33 a.m. UTC | #10
On Wed, Sep 13, 2017 at 07:33:25AM -0700, Catalin Marinas wrote:
> On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > +/*
> > + * Handle SVE state across fork():
> > + *
> > + * dst and src must not end up with aliases of the same sve_state.
> > + * Because a task cannot fork except in a syscall, we can discard SVE
> > + * state for dst here, so long as we take care to retain the FPSIMD
> > + * subset of the state if SVE is in use.  Reallocation of the SVE state
> > + * will be deferred until dst tries to use SVE.
> > + */
> > +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> > +{
> > +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> > +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> > +		sve_to_fpsimd(dst);
> > +	}
> > +
> > +	dst->thread.sve_state = NULL;
> > +}
> 
> I first thought the thread flags are not visible in dst yet since
> dup_task_struct() calls arch_dup_task_struct() before
> setup_thread_stack(). However, at the end of the last year we enabled
> CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
> on this.
> 
> Anyway, IIUC we don't need sve_to_fpsimd() here. The
> arch_dup_task_struct() already called fpsimd_preserve_current_state()
> for src, so the FPSIMD state (which we care about) is transferred during
> the *dst = *src assignment. So you'd only need the last statement,
> possibly with a different function name like fpsimd_erase_sve (and maybe
> make the function static inline in the header).

Regarding the intended meanings of the thread flags, does this help?

--8<--

TIF_FOREIGN_FPSTATE's meaning is expanded to cover SVE, but otherwise
unchanged:

 * If a task is running and !TIF_FOREIGN_FPSTATE, then the the CPU
   registers of the CPU the task is running on contain the authoritative
   FPSIMD/SVE state of the task.  The backing memory may be stale.

 * Otherwise (i.e., task not running, or task running and
   TIF_FOREIGN_FPSTATE), the task's FPSIMD/SVE backing memory is
   authoritative.  If additionally per_cpu(fpsimd_last_state,
   task->fpsimd_state.cpu) == &task->fpsimd_state.cpu, then
   task->fpsimd_state.cpu's registers are also up to date for task, but
   not authorititive: the current FPSIMD/SVE state may be read from
   them, but they must not be written.
 
The FPSIMD/SVE backing memory is selected by TIF_SVE:

 * TIF_SVE set: Zn (incorporating Vn in bits[127:0]), Pn and FFR are
   stored in task->thread.sve_state, formatted appropriately for vector
   length task->thread.sve_vl.  task->thread.sve_state must point to a
   valid buffer at least sve_state_size(task) bytes in size.

 * TIF_SVE clear: Vn are stored in task->fpsimd_state; Zn[max : 128] are
   logically zero[*] but not stored anywhere; Pn, FFR are not stored and
   have unspecified values from userspace's point of view.
   task->thread.sve_state does not need to be non-null, valid or any
   particular size: it must not be dereferenced.

   In practice I don't exploit the "unspecifiedness" much.  The Zn high
   bits, Pn and FFR are all zeroed when setting TIF_SVE again:
   sve_alloc() is the common path for this.

 * FPSR and FPCR are always stored in task->fpsimd_state irrespctive of
   whether TIF_SVE is clear or set, since these are not vector length
   dependent.


[*] theoretically unspecified, which is what I've written in sve.txt.
However, on any FPSIMD Vn write the upper bits of the corresponding Zn
must become logically zero in order to conform to the SVE programmer's
model.  It's not feasible to track which Vn have been written
individually because that would involve trapping every FPSIMD insn until
all possible Vn have been written.  So the kernel ensures that the Zn
high bits become zero.

Maybe we should just guarantee "zero-or-preserved" behaviour for
userspace.  This may close down some future optimisation opportunities,
but maybe it's better to be simple.

Thoughts?

[...]

Cheers
---Dave
Catalin Marinas Oct. 4, 2017, 5:29 p.m. UTC | #11
On Tue, Oct 03, 2017 at 12:11:01PM +0100, Dave P Martin wrote:
> On Wed, Sep 20, 2017 at 02:58:56PM +0100, Catalin Marinas wrote:
> > On Thu, Sep 14, 2017 at 08:55:56PM +0100, Dave P Martin wrote:
> > > On Wed, Sep 13, 2017 at 07:33:25AM -0700, Catalin Marinas wrote:
> > > > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > > > +/*
> > > > > + * Handle SVE state across fork():
> > > > > + *
> > > > > + * dst and src must not end up with aliases of the same sve_state.
> > > > > + * Because a task cannot fork except in a syscall, we can discard SVE
> > > > > + * state for dst here, so long as we take care to retain the FPSIMD
> > > > > + * subset of the state if SVE is in use.  Reallocation of the SVE state
> > > > > + * will be deferred until dst tries to use SVE.
> > > > > + */
> > > > > +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> > > > > +{
> > > > > +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> > > > > +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> > > > > +		sve_to_fpsimd(dst);
> > > > > +	}
> > > > > +
> > > > > +	dst->thread.sve_state = NULL;
> > > > > +}
> > > > 
> > > > I first thought the thread flags are not visible in dst yet since
> > > > dup_task_struct() calls arch_dup_task_struct() before
> > > > setup_thread_stack(). However, at the end of the last year we enabled
> > > > CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
> > > > on this.
> > > 
> > > Hmmm, I see your point, but there are some sequencing issues here.
> > > 
> > > > Anyway, IIUC we don't need sve_to_fpsimd() here. The
> > > > arch_dup_task_struct() already called fpsimd_preserve_current_state()
> > > 
> > > I consider SVE discard as an optional side effect of task_fpsimd_save(),
> > > not something that is guaranteed to happen -- the decision about whether
> > > to do so may become more intelligent later on.  So, for src, we may
> > > discard SVE (because syscall), but for dst we must NULL .sve_state (and
> > > therefore clear TIF_SVE) simply to avoid aliasing of src->sve_state and
> > > dst->sve_state.
> > 
> > My point was that the SVE state of src is already preserved at this
> > point and copied into dst. You don't need the sve_to_fpsimd(dst) again
> > which basically does the same copying of the src SVE saved state into
> > the FPSIMD one in dst. This has already been done in
> > arch_dup_task_struct() by the combination of
> > fpsimd_preserve_current_state() and *dst = *src (and, of course,
> > clearing TIF_SVE in dst).
> > 
> > I don't think the TIF_SVE clearing in src is just a side effect of
> > task_fpsimd_save() here but rather a requirement. When returning from
> > fork(), both src and dst would need to have the same state. However,
> > your fpsimd_dup_sve() implementation makes it very clear that the SVE
> > state is lost in dst. This is only allowed if we also lose it in src (as
> > a result of a syscall). So making dst->sve_state = NULL requires that
> > TIF_SVE is also cleared in both src and dst. Alternatively, you'd have
> > to allocate a new state here and copy the full src SVE state across to
> > dst, together with setting TIF_SVE (that's not necessary, however, since
> > we get here as a result of a syscall).
> 
> The currently intended ABI is that the SVE bits are unspecified after a
> syscall, so it is legitimate (though perhaps surprising) for different
> things to happen in dst and src.
> 
> This complicates things a lot though, just to avoid the next SVE usage
> exception in src after the fork.
> 
> 
> It should be simpler to do what you suggest and discard the SVE state of
> src unconditionally before the copy: then we really are just cloning the
> thread apart from the need to set dst->thread.sve_state to NULL.
> 
> fpsimd_preserve_current_state() does not necessarily write back to
> current->thread.fpsmid_state though: at the moment, it does do this as a
> side effect of task_fpsimd_save() because we happen to be in a syscall
> (i.e., fork).  
> 
> What we really want is unconditional discarding of the state.  This
> wasn't needed anywhere else yet, so there's no explicit helper for it.
> But it makes sense to add one.
> 
> What about refactoring along these lines:
> 
> 
> fpsimd.c:
> /* Unconditionally discard the SVE state */
> void task_sve_discard(struct task_struct *task)
> {
> 	if (!system_supports_sve())
> 		return;
> 
> 	local_bh_disable();
> 	if (test_and_clear_tsk_thread_flag(task, TIF_SVE))
> 		sve_to_fpsimd(task);
> 	local_bh_enable();
> }
> 
> process.c:
> int arch_dup_task_struct(sturct task_struct *dst, struct task_struct *src)
> {
> 	if (current->mm) {
> 		fpsimd_preserve_current_state();
> 		task_sve_discard(src);
> 	}
> 
> 	*dst = *src;
> 
> 	dst->thread.sve_state = NULL;
> }
> 
> 
> This also avoids having to touch dst's thread flags, since now we
> are just cloning the task except for assigning NULL to
> dst->thread.sve_state.

This looks fine to me, the execution of task_sve_discard() is nearly a
no-op with the current code.

We still have some local_bh_disable/enable() calls, though I don't think
it's worth optimising them now (e.g. having a
fpsimd_preserve_current_state_and_discard_sve() function with a
"sve_discard" argument to task_fpsimd_save() to force this).
Catalin Marinas Oct. 5, 2017, 11:28 a.m. UTC | #12
On Tue, Oct 03, 2017 at 12:33:03PM +0100, Dave P Martin wrote:
> On Wed, Sep 13, 2017 at 07:33:25AM -0700, Catalin Marinas wrote:
> > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > +/*
> > > + * Handle SVE state across fork():
> > > + *
> > > + * dst and src must not end up with aliases of the same sve_state.
> > > + * Because a task cannot fork except in a syscall, we can discard SVE
> > > + * state for dst here, so long as we take care to retain the FPSIMD
> > > + * subset of the state if SVE is in use.  Reallocation of the SVE state
> > > + * will be deferred until dst tries to use SVE.
> > > + */
> > > +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> > > +{
> > > +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> > > +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> > > +		sve_to_fpsimd(dst);
> > > +	}
> > > +
> > > +	dst->thread.sve_state = NULL;
> > > +}
> > 
> > I first thought the thread flags are not visible in dst yet since
> > dup_task_struct() calls arch_dup_task_struct() before
> > setup_thread_stack(). However, at the end of the last year we enabled
> > CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
> > on this.
> > 
> > Anyway, IIUC we don't need sve_to_fpsimd() here. The
> > arch_dup_task_struct() already called fpsimd_preserve_current_state()
> > for src, so the FPSIMD state (which we care about) is transferred during
> > the *dst = *src assignment. So you'd only need the last statement,
> > possibly with a different function name like fpsimd_erase_sve (and maybe
> > make the function static inline in the header).
> 
> Regarding the intended meanings of the thread flags, does this help?
> 
> --8<--
> 
> TIF_FOREIGN_FPSTATE's meaning is expanded to cover SVE, but otherwise
> unchanged:
> 
>  * If a task is running and !TIF_FOREIGN_FPSTATE, then the the CPU
>    registers of the CPU the task is running on contain the authoritative
>    FPSIMD/SVE state of the task.  The backing memory may be stale.
> 
>  * Otherwise (i.e., task not running, or task running and
>    TIF_FOREIGN_FPSTATE), the task's FPSIMD/SVE backing memory is
>    authoritative.  If additionally per_cpu(fpsimd_last_state,
>    task->fpsimd_state.cpu) == &task->fpsimd_state.cpu, then
>    task->fpsimd_state.cpu's registers are also up to date for task, but
>    not authorititive: the current FPSIMD/SVE state may be read from
>    them, but they must not be written.
>  
> The FPSIMD/SVE backing memory is selected by TIF_SVE:
> 
>  * TIF_SVE set: Zn (incorporating Vn in bits[127:0]), Pn and FFR are
>    stored in task->thread.sve_state, formatted appropriately for vector
>    length task->thread.sve_vl.  task->thread.sve_state must point to a
>    valid buffer at least sve_state_size(task) bytes in size.
> 
>  * TIF_SVE clear: Vn are stored in task->fpsimd_state; Zn[max : 128] are
>    logically zero[*] but not stored anywhere; Pn, FFR are not stored and
>    have unspecified values from userspace's point of view.
>    task->thread.sve_state does not need to be non-null, valid or any
>    particular size: it must not be dereferenced.
> 
>    In practice I don't exploit the "unspecifiedness" much.  The Zn high
>    bits, Pn and FFR are all zeroed when setting TIF_SVE again:
>    sve_alloc() is the common path for this.
> 
>  * FPSR and FPCR are always stored in task->fpsimd_state irrespctive of
>    whether TIF_SVE is clear or set, since these are not vector length
>    dependent.

This looks fine. I think we need to make sure (with a warning) that
task_fpsimd_save() (and probably load) is always called in a
non-preemptible context. 

> [*] theoretically unspecified, which is what I've written in sve.txt.
> However, on any FPSIMD Vn write the upper bits of the corresponding Zn
> must become logically zero in order to conform to the SVE programmer's
> model.  It's not feasible to track which Vn have been written
> individually because that would involve trapping every FPSIMD insn until
> all possible Vn have been written.  So the kernel ensures that the Zn
> high bits become zero.
> 
> Maybe we should just guarantee "zero-or-preserved" behaviour for
> userspace.  This may close down some future optimisation opportunities,
> but maybe it's better to be simple.

Does it work if we leave it as "unspecified" in the document but just do
zero-or-preserved in the kernel code?

Just wondering, as an optimisation for do_sve_acc() - instead of
sve_alloc() and fpsimd_to_sve(), can we not force the loading of the
FPSIMD regs on the return to user via TIF_FOREIGN_FPSTATE? This would
ensure the zeroing of the top SVE bits and we only need to allocate the
SVE state on the saving path. This means enabling SVE for user and
setting TIF_SVE without having the backing storage allocated.
Dave Martin Oct. 6, 2017, 1:10 p.m. UTC | #13
On Thu, Oct 05, 2017 at 12:28:35PM +0100, Catalin Marinas wrote:
> On Tue, Oct 03, 2017 at 12:33:03PM +0100, Dave P Martin wrote:
> > On Wed, Sep 13, 2017 at 07:33:25AM -0700, Catalin Marinas wrote:
> > > On Thu, Aug 31, 2017 at 06:00:43PM +0100, Dave P Martin wrote:
> > > > +/*
> > > > + * Handle SVE state across fork():
> > > > + *
> > > > + * dst and src must not end up with aliases of the same sve_state.
> > > > + * Because a task cannot fork except in a syscall, we can discard SVE
> > > > + * state for dst here, so long as we take care to retain the FPSIMD
> > > > + * subset of the state if SVE is in use.  Reallocation of the SVE state
> > > > + * will be deferred until dst tries to use SVE.
> > > > + */
> > > > +void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
> > > > +{
> > > > +	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
> > > > +		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
> > > > +		sve_to_fpsimd(dst);
> > > > +	}
> > > > +
> > > > +	dst->thread.sve_state = NULL;
> > > > +}
> > > 
> > > I first thought the thread flags are not visible in dst yet since
> > > dup_task_struct() calls arch_dup_task_struct() before
> > > setup_thread_stack(). However, at the end of the last year we enabled
> > > CONFIG_THREAD_INFO_IN_TASK_STRUCT. But I don't particularly like relying
> > > on this.
> > > 
> > > Anyway, IIUC we don't need sve_to_fpsimd() here. The
> > > arch_dup_task_struct() already called fpsimd_preserve_current_state()
> > > for src, so the FPSIMD state (which we care about) is transferred during
> > > the *dst = *src assignment. So you'd only need the last statement,
> > > possibly with a different function name like fpsimd_erase_sve (and maybe
> > > make the function static inline in the header).
> > 
> > Regarding the intended meanings of the thread flags, does this help?
> > 
> > --8<--
> > 
> > TIF_FOREIGN_FPSTATE's meaning is expanded to cover SVE, but otherwise
> > unchanged:
> > 
> >  * If a task is running and !TIF_FOREIGN_FPSTATE, then the the CPU
> >    registers of the CPU the task is running on contain the authoritative
> >    FPSIMD/SVE state of the task.  The backing memory may be stale.
> > 
> >  * Otherwise (i.e., task not running, or task running and
> >    TIF_FOREIGN_FPSTATE), the task's FPSIMD/SVE backing memory is
> >    authoritative.  If additionally per_cpu(fpsimd_last_state,
> >    task->fpsimd_state.cpu) == &task->fpsimd_state.cpu, then
> >    task->fpsimd_state.cpu's registers are also up to date for task, but
> >    not authorititive: the current FPSIMD/SVE state may be read from
> >    them, but they must not be written.
> >  
> > The FPSIMD/SVE backing memory is selected by TIF_SVE:
> > 
> >  * TIF_SVE set: Zn (incorporating Vn in bits[127:0]), Pn and FFR are
> >    stored in task->thread.sve_state, formatted appropriately for vector
> >    length task->thread.sve_vl.  task->thread.sve_state must point to a
> >    valid buffer at least sve_state_size(task) bytes in size.
> > 
> >  * TIF_SVE clear: Vn are stored in task->fpsimd_state; Zn[max : 128] are
> >    logically zero[*] but not stored anywhere; Pn, FFR are not stored and
> >    have unspecified values from userspace's point of view.
> >    task->thread.sve_state does not need to be non-null, valid or any
> >    particular size: it must not be dereferenced.
> > 
> >    In practice I don't exploit the "unspecifiedness" much.  The Zn high
> >    bits, Pn and FFR are all zeroed when setting TIF_SVE again:
> >    sve_alloc() is the common path for this.
> > 
> >  * FPSR and FPCR are always stored in task->fpsimd_state irrespctive of
> >    whether TIF_SVE is clear or set, since these are not vector length
> >    dependent.
> 
> This looks fine. I think we need to make sure (with a warning) that
> task_fpsimd_save() (and probably load) is always called in a
> non-preemptible context. 
> 
> > [*] theoretically unspecified, which is what I've written in sve.txt.
> > However, on any FPSIMD Vn write the upper bits of the corresponding Zn
> > must become logically zero in order to conform to the SVE programmer's
> > model.  It's not feasible to track which Vn have been written
> > individually because that would involve trapping every FPSIMD insn until
> > all possible Vn have been written.  So the kernel ensures that the Zn
> > high bits become zero.
> > 
> > Maybe we should just guarantee "zero-or-preserved" behaviour for
> > userspace.  This may close down some future optimisation opportunities,
> > but maybe it's better to be simple.
> 
> Does it work if we leave it as "unspecified" in the document but just do
> zero-or-preserved in the kernel code?

Sure, that's the behaviour today in effect.

I'll leave the documentation unchanged, then we can take advantage of
this flexibility later if is proves to be useful.

> Just wondering, as an optimisation for do_sve_acc() - instead of
> sve_alloc() and fpsimd_to_sve(), can we not force the loading of the
> FPSIMD regs on the return to user via TIF_FOREIGN_FPSTATE? This would
> ensure the zeroing of the top SVE bits and we only need to allocate the
> SVE state on the saving path. This means enabling SVE for user and
> setting TIF_SVE without having the backing storage allocated.

Currently the set of places where the "TIF_SVE implies sve_state valid"
assumption is applied is not very constrained, so while your suggestion
is reasonable I'd rather not mess with it just now, if possible.


But we can do this (which is what my current fixup has):

el0_sve_acc:
	enable_dbg_and_irq
	// ...
	bl	do_sve_acc
	b	ret_to_user

void do_sve_acc(unsigned int esr, struct pt_regs *regs)
{
	/* Even if we chose not to use SVE, the hardware could still trap: */
	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
		return;
	}

	sve_alloc(current);

	local_bh_disable();
	if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
		task_fpsimd_load(); /* flushes high Zn bits as a side-effect */
		sve_flush_pregs();
	} else {
		sve_flush_all(); /* flush all the SVE bits in-place */
	}

	if (test_and_set_thread_flag(TIF_SVE))
		WARN_ON(1); /* SVE access shouldn't have trapped */
	local_bh_enable();
}

where sve_flush_all() zeroes all the high Zn bits via a series of
MOV Vn, Vn instructions, and also zeroes Pn and FFR.  sve_fplush_pregs()
just does the latter.

In the common case the sve_alloc() does a memzero, but doesn't
allocate memory because more often than not, it will already have
been allocated.

On this path, the memzero is now pointless -- I'll need to double-check
what else may be impacted by its removal (probably only ptrace, and I'm
not sure if the zeroing is strictly required there).


There is still a bit of room for improvement, but it's still a step up
from v2.

What do you think?

Cheers
---Dave
Catalin Marinas Oct. 6, 2017, 1:36 p.m. UTC | #14
On Fri, Oct 06, 2017 at 02:10:09PM +0100, Dave P Martin wrote:
> On Thu, Oct 05, 2017 at 12:28:35PM +0100, Catalin Marinas wrote:
> > On Tue, Oct 03, 2017 at 12:33:03PM +0100, Dave P Martin wrote:
> > > TIF_FOREIGN_FPSTATE's meaning is expanded to cover SVE, but otherwise
> > > unchanged:
> > > 
> > >  * If a task is running and !TIF_FOREIGN_FPSTATE, then the the CPU
> > >    registers of the CPU the task is running on contain the authoritative
> > >    FPSIMD/SVE state of the task.  The backing memory may be stale.
> > > 
> > >  * Otherwise (i.e., task not running, or task running and
> > >    TIF_FOREIGN_FPSTATE), the task's FPSIMD/SVE backing memory is
> > >    authoritative.  If additionally per_cpu(fpsimd_last_state,
> > >    task->fpsimd_state.cpu) == &task->fpsimd_state.cpu, then
> > >    task->fpsimd_state.cpu's registers are also up to date for task, but
> > >    not authorititive: the current FPSIMD/SVE state may be read from
> > >    them, but they must not be written.
> > >  
> > > The FPSIMD/SVE backing memory is selected by TIF_SVE:
> > > 
> > >  * TIF_SVE set: Zn (incorporating Vn in bits[127:0]), Pn and FFR are
> > >    stored in task->thread.sve_state, formatted appropriately for vector
> > >    length task->thread.sve_vl.  task->thread.sve_state must point to a
> > >    valid buffer at least sve_state_size(task) bytes in size.

"Zn [...] stored in  task->thread.sve_state" - is this still true with
the changes you proposed? I guess even without these changes, you have
situations where the hardware regs are out of sync with sve_state (see
more below).

> > >  * TIF_SVE clear: Vn are stored in task->fpsimd_state; Zn[max : 128] are
> > >    logically zero[*] but not stored anywhere; Pn, FFR are not stored and
> > >    have unspecified values from userspace's point of view.
> > >    task->thread.sve_state does not need to be non-null, valid or any
> > >    particular size: it must not be dereferenced.
> > > 
> > >    In practice I don't exploit the "unspecifiedness" much.  The Zn high
> > >    bits, Pn and FFR are all zeroed when setting TIF_SVE again:
> > >    sve_alloc() is the common path for this.
> > > 
> > >  * FPSR and FPCR are always stored in task->fpsimd_state irrespctive of
> > >    whether TIF_SVE is clear or set, since these are not vector length
> > >    dependent.
[...]
> > Just wondering, as an optimisation for do_sve_acc() - instead of
> > sve_alloc() and fpsimd_to_sve(), can we not force the loading of the
> > FPSIMD regs on the return to user via TIF_FOREIGN_FPSTATE? This would
> > ensure the zeroing of the top SVE bits and we only need to allocate the
> > SVE state on the saving path. This means enabling SVE for user and
> > setting TIF_SVE without having the backing storage allocated.
> 
> Currently the set of places where the "TIF_SVE implies sve_state valid"
> assumption is applied is not very constrained, so while your suggestion
> is reasonable I'd rather not mess with it just now, if possible.
> 
> 
> But we can do this (which is what my current fixup has):
> 
> el0_sve_acc:
> 	enable_dbg_and_irq
> 	// ...
> 	bl	do_sve_acc
> 	b	ret_to_user
> 
> void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> {
> 	/* Even if we chose not to use SVE, the hardware could still trap: */
> 	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> 		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
> 		return;
> 	}
> 
> 	sve_alloc(current);
> 
> 	local_bh_disable();
> 	if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> 		task_fpsimd_load(); /* flushes high Zn bits as a side-effect */
> 		sve_flush_pregs();
> 	} else {
> 		sve_flush_all(); /* flush all the SVE bits in-place */
> 	}
> 
> 	if (test_and_set_thread_flag(TIF_SVE))
> 		WARN_ON(1); /* SVE access shouldn't have trapped */
> 	local_bh_enable();
> }
> 
> where sve_flush_all() zeroes all the high Zn bits via a series of
> MOV Vn, Vn instructions, and also zeroes Pn and FFR.  sve_fplush_pregs()
> just does the latter.

This looks fine to me but I added a comment above. IIUC, we can now have
TIF_SVE set while sve_state contains stale data. I don't see an issue
given that every time you enter the kernel from user space you have
TIF_SVE set and the sve_state storage out of sync. Maybe tweak the
TIF_SVE description above slightly.
Dave Martin Oct. 6, 2017, 3:15 p.m. UTC | #15
On Fri, Oct 06, 2017 at 02:36:40PM +0100, Catalin Marinas wrote:
> On Fri, Oct 06, 2017 at 02:10:09PM +0100, Dave P Martin wrote:
> > On Thu, Oct 05, 2017 at 12:28:35PM +0100, Catalin Marinas wrote:
> > > On Tue, Oct 03, 2017 at 12:33:03PM +0100, Dave P Martin wrote:
> > > > TIF_FOREIGN_FPSTATE's meaning is expanded to cover SVE, but otherwise
> > > > unchanged:
> > > > 
> > > >  * If a task is running and !TIF_FOREIGN_FPSTATE, then the the CPU
> > > >    registers of the CPU the task is running on contain the authoritative
> > > >    FPSIMD/SVE state of the task.  The backing memory may be stale.
> > > > 
> > > >  * Otherwise (i.e., task not running, or task running and
> > > >    TIF_FOREIGN_FPSTATE), the task's FPSIMD/SVE backing memory is
> > > >    authoritative.  If additionally per_cpu(fpsimd_last_state,
> > > >    task->fpsimd_state.cpu) == &task->fpsimd_state.cpu, then
> > > >    task->fpsimd_state.cpu's registers are also up to date for task, but
> > > >    not authorititive: the current FPSIMD/SVE state may be read from
> > > >    them, but they must not be written.
> > > >  
> > > > The FPSIMD/SVE backing memory is selected by TIF_SVE:
> > > > 
> > > >  * TIF_SVE set: Zn (incorporating Vn in bits[127:0]), Pn and FFR are
> > > >    stored in task->thread.sve_state, formatted appropriately for vector
> > > >    length task->thread.sve_vl.  task->thread.sve_state must point to a
> > > >    valid buffer at least sve_state_size(task) bytes in size.
> 
> "Zn [...] stored in  task->thread.sve_state" - is this still true with
> the changes you proposed? I guess even without these changes, you have
> situations where the hardware regs are out of sync with sve_state (see
> more below).

I guess I need to tweak the wording here.

TIF_SVE says where the vector state should be loaded/stored from,
but does not say whether the data is up to date in memory, or when
it should be loaded/stored.

The latter is described by a cocktail of different things including
which bit of kernel code we are executing if any, whether the task
is running/stopped etc., TIF_FOREIGN_FPSTATE,
task->thread.fpsimd_state.cpu and per_cpu(fpsimd_last_state).


Does this make any better sense of my code below?

> 
> > > >  * TIF_SVE clear: Vn are stored in task->fpsimd_state; Zn[max : 128] are
> > > >    logically zero[*] but not stored anywhere; Pn, FFR are not stored and
> > > >    have unspecified values from userspace's point of view.
> > > >    task->thread.sve_state does not need to be non-null, valid or any
> > > >    particular size: it must not be dereferenced.
> > > > 
> > > >    In practice I don't exploit the "unspecifiedness" much.  The Zn high
> > > >    bits, Pn and FFR are all zeroed when setting TIF_SVE again:
> > > >    sve_alloc() is the common path for this.
> > > > 
> > > >  * FPSR and FPCR are always stored in task->fpsimd_state irrespctive of
> > > >    whether TIF_SVE is clear or set, since these are not vector length
> > > >    dependent.
> [...]
> > > Just wondering, as an optimisation for do_sve_acc() - instead of
> > > sve_alloc() and fpsimd_to_sve(), can we not force the loading of the
> > > FPSIMD regs on the return to user via TIF_FOREIGN_FPSTATE? This would
> > > ensure the zeroing of the top SVE bits and we only need to allocate the
> > > SVE state on the saving path. This means enabling SVE for user and
> > > setting TIF_SVE without having the backing storage allocated.
> > 
> > Currently the set of places where the "TIF_SVE implies sve_state valid"
> > assumption is applied is not very constrained, so while your suggestion
> > is reasonable I'd rather not mess with it just now, if possible.
> > 
> > 
> > But we can do this (which is what my current fixup has):
> > 
> > el0_sve_acc:
> > 	enable_dbg_and_irq
> > 	// ...
> > 	bl	do_sve_acc
> > 	b	ret_to_user
> > 
> > void do_sve_acc(unsigned int esr, struct pt_regs *regs)
> > {
> > 	/* Even if we chose not to use SVE, the hardware could still trap: */
> > 	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
> > 		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
> > 		return;
> > 	}
> > 
> > 	sve_alloc(current);
> > 
> > 	local_bh_disable();
> > 	if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> > 		task_fpsimd_load(); /* flushes high Zn bits as a side-effect */
> > 		sve_flush_pregs();
> > 	} else {
> > 		sve_flush_all(); /* flush all the SVE bits in-place */
> > 	}
> > 
> > 	if (test_and_set_thread_flag(TIF_SVE))
> > 		WARN_ON(1); /* SVE access shouldn't have trapped */
> > 	local_bh_enable();
> > }
> > 
> > where sve_flush_all() zeroes all the high Zn bits via a series of
> > MOV Vn, Vn instructions, and also zeroes Pn and FFR.  sve_fplush_pregs()
> > just does the latter.
> 
> This looks fine to me but I added a comment above. IIUC, we can now have
> TIF_SVE set while sve_state contains stale data. I don't see an issue
> given that every time you enter the kernel from user space you have
> TIF_SVE set and the sve_state storage out of sync. Maybe tweak the
> TIF_SVE description above slightly.
> 

See my comment above ... any better?

If so, I'll paste some of that explanatory text into fpsimd.c (in lieu
of a better place to put it).

Cheers
---Dave
Catalin Marinas Oct. 6, 2017, 3:33 p.m. UTC | #16
On Fri, Oct 06, 2017 at 04:15:28PM +0100, Dave P Martin wrote:
> On Fri, Oct 06, 2017 at 02:36:40PM +0100, Catalin Marinas wrote:
> > On Fri, Oct 06, 2017 at 02:10:09PM +0100, Dave P Martin wrote:
> > > On Thu, Oct 05, 2017 at 12:28:35PM +0100, Catalin Marinas wrote:
> > > > On Tue, Oct 03, 2017 at 12:33:03PM +0100, Dave P Martin wrote:
> > > > > TIF_FOREIGN_FPSTATE's meaning is expanded to cover SVE, but otherwise
> > > > > unchanged:
> > > > > 
> > > > >  * If a task is running and !TIF_FOREIGN_FPSTATE, then the the CPU
> > > > >    registers of the CPU the task is running on contain the authoritative
> > > > >    FPSIMD/SVE state of the task.  The backing memory may be stale.
> > > > > 
> > > > >  * Otherwise (i.e., task not running, or task running and
> > > > >    TIF_FOREIGN_FPSTATE), the task's FPSIMD/SVE backing memory is
> > > > >    authoritative.  If additionally per_cpu(fpsimd_last_state,
> > > > >    task->fpsimd_state.cpu) == &task->fpsimd_state.cpu, then
> > > > >    task->fpsimd_state.cpu's registers are also up to date for task, but
> > > > >    not authorititive: the current FPSIMD/SVE state may be read from
> > > > >    them, but they must not be written.
> > > > >  
> > > > > The FPSIMD/SVE backing memory is selected by TIF_SVE:
> > > > > 
> > > > >  * TIF_SVE set: Zn (incorporating Vn in bits[127:0]), Pn and FFR are
> > > > >    stored in task->thread.sve_state, formatted appropriately for vector
> > > > >    length task->thread.sve_vl.  task->thread.sve_state must point to a
> > > > >    valid buffer at least sve_state_size(task) bytes in size.
> > 
> > "Zn [...] stored in  task->thread.sve_state" - is this still true with
> > the changes you proposed? I guess even without these changes, you have
> > situations where the hardware regs are out of sync with sve_state (see
> > more below).
> 
> I guess I need to tweak the wording here.
> 
> TIF_SVE says where the vector state should be loaded/stored from,
> but does not say whether the data is up to date in memory, or when
> it should be loaded/stored.
> 
> The latter is described by a cocktail of different things including
> which bit of kernel code we are executing if any, whether the task
> is running/stopped etc., TIF_FOREIGN_FPSTATE,
> task->thread.fpsimd_state.cpu and per_cpu(fpsimd_last_state).
> 
> Does this make any better sense of my code below?

Yes, it looks fine.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/fpsimd.h b/arch/arm64/include/asm/fpsimd.h
index 026a7c7..72090a1 100644
--- a/arch/arm64/include/asm/fpsimd.h
+++ b/arch/arm64/include/asm/fpsimd.h
@@ -20,6 +20,8 @@ 
 
 #ifndef __ASSEMBLY__
 
+#include <linux/stddef.h>
+
 /*
  * FP/SIMD storage area has:
  *  - FPSR and FPCR
@@ -72,6 +74,23 @@  extern void sve_load_state(void const *state, u32 const *pfpsr,
 			   unsigned long vq_minus_1);
 extern unsigned int sve_get_vl(void);
 
+#ifdef CONFIG_ARM64_SVE
+
+extern size_t sve_state_size(struct task_struct const *task);
+
+extern void sve_alloc(struct task_struct *task);
+extern void fpsimd_release_thread(struct task_struct *task);
+extern void fpsimd_dup_sve(struct task_struct *dst,
+			   struct task_struct const *src);
+
+#else /* ! CONFIG_ARM64_SVE */
+
+static void __maybe_unused sve_alloc(struct task_struct *task) { }
+static void __maybe_unused fpsimd_release_thread(struct task_struct *task) { }
+static void __maybe_unused fpsimd_dup_sve(struct task_struct *dst,
+					  struct task_struct const *src) { }
+#endif /* ! CONFIG_ARM64_SVE */
+
 /* For use by EFI runtime services calls only */
 extern void __efi_fpsimd_begin(void);
 extern void __efi_fpsimd_end(void);
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 29adab8..4831d28 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -85,6 +85,8 @@  struct thread_struct {
 	unsigned long		tp2_value;
 #endif
 	struct fpsimd_state	fpsimd_state;
+	void			*sve_state;	/* SVE registers, if any */
+	unsigned int		sve_vl;		/* SVE vector length */
 	unsigned long		fault_address;	/* fault info */
 	unsigned long		fault_code;	/* ESR_EL1 value */
 	struct debug_info	debug;		/* debugging */
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 2eca178..f0880fc 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -91,6 +91,7 @@  void arch_setup_new_exec(void);
 #define TIF_RESTORE_SIGMASK	20
 #define TIF_SINGLESTEP		21
 #define TIF_32BIT		22	/* 32bit process */
+#define TIF_SVE			23	/* Scalable Vector Extension in use */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
diff --git a/arch/arm64/include/asm/traps.h b/arch/arm64/include/asm/traps.h
index 4136168..282163f 100644
--- a/arch/arm64/include/asm/traps.h
+++ b/arch/arm64/include/asm/traps.h
@@ -34,6 +34,8 @@  struct undef_hook {
 
 void register_undef_hook(struct undef_hook *hook);
 void unregister_undef_hook(struct undef_hook *hook);
+void force_signal_inject(int signal, int code, struct pt_regs *regs,
+			 unsigned long address);
 
 void arm64_notify_segfault(struct pt_regs *regs, unsigned long addr);
 
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index e1c59d4..e57020f 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -607,6 +607,8 @@  el0_sync:
 	b.eq	el0_ia
 	cmp	x24, #ESR_ELx_EC_FP_ASIMD	// FP/ASIMD access
 	b.eq	el0_fpsimd_acc
+	cmp	x24, #ESR_ELx_EC_SVE		// SVE access
+	b.eq	el0_sve_acc
 	cmp	x24, #ESR_ELx_EC_FP_EXC64	// FP/ASIMD exception
 	b.eq	el0_fpsimd_exc
 	cmp	x24, #ESR_ELx_EC_SYS64		// configurable trap
@@ -705,9 +707,19 @@  el0_fpsimd_acc:
 	mov	x1, sp
 	bl	do_fpsimd_acc
 	b	ret_to_user
+el0_sve_acc:
+	/*
+	 * Scalable Vector Extension access
+	 */
+	enable_dbg
+	ct_user_exit
+	mov	x0, x25
+	mov	x1, sp
+	bl	do_sve_acc
+	b	ret_to_user
 el0_fpsimd_exc:
 	/*
-	 * Floating Point or Advanced SIMD exception
+	 * Floating Point, Advanced SIMD or SVE exception
 	 */
 	enable_dbg
 	ct_user_exit
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 9d762dd..9b1ebd7 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -18,18 +18,25 @@ 
  */
 
 #include <linux/bottom_half.h>
+#include <linux/bug.h>
+#include <linux/compat.h>
 #include <linux/cpu.h>
 #include <linux/cpu_pm.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/percpu.h>
 #include <linux/preempt.h>
+#include <linux/ptrace.h>
 #include <linux/sched/signal.h>
 #include <linux/signal.h>
+#include <linux/slab.h>
 
 #include <asm/fpsimd.h>
 #include <asm/cputype.h>
 #include <asm/simd.h>
+#include <asm/sigcontext.h>
+#include <asm/sysreg.h>
+#include <asm/traps.h>
 
 #define FPEXC_IOF	(1 << 0)
 #define FPEXC_DZF	(1 << 1)
@@ -99,6 +106,190 @@ 
  */
 static DEFINE_PER_CPU(struct fpsimd_state *, fpsimd_last_state);
 
+static void sve_free(struct task_struct *task)
+{
+	kfree(task->thread.sve_state);
+	task->thread.sve_state = NULL;
+}
+
+/* Offset of FFR in the SVE register dump */
+static size_t sve_ffr_offset(int vl)
+{
+	return SVE_SIG_FFR_OFFSET(sve_vq_from_vl(vl)) - SVE_SIG_REGS_OFFSET;
+}
+
+static void *sve_pffr(struct task_struct *task)
+{
+	return (char *)task->thread.sve_state +
+		sve_ffr_offset(task->thread.sve_vl);
+}
+
+static void change_cpacr(u64 val, u64 mask)
+{
+	u64 cpacr = read_sysreg(CPACR_EL1);
+	u64 new = (cpacr & ~mask) | val;
+
+	if (new != cpacr)
+		write_sysreg(new, CPACR_EL1);
+}
+
+static void sve_user_disable(void)
+{
+	change_cpacr(0, CPACR_EL1_ZEN_EL0EN);
+}
+
+static void sve_user_enable(void)
+{
+	change_cpacr(CPACR_EL1_ZEN_EL0EN, CPACR_EL1_ZEN_EL0EN);
+}
+
+static void task_fpsimd_load(void)
+{
+	if (system_supports_sve() && test_thread_flag(TIF_SVE))
+		sve_load_state(sve_pffr(current),
+			       &current->thread.fpsimd_state.fpsr,
+			       sve_vq_from_vl(current->thread.sve_vl) - 1);
+	else
+		fpsimd_load_state(&current->thread.fpsimd_state);
+
+	if (system_supports_sve()) {
+		/* Toggle SVE trapping for userspace if needed */
+		if (test_thread_flag(TIF_SVE))
+			sve_user_enable();
+		else
+			sve_user_disable();
+
+		/* Serialised by exception return to user */
+	}
+}
+
+static void task_fpsimd_save(void)
+{
+	if (system_supports_sve() &&
+	    in_syscall(current_pt_regs()) &&
+	    test_thread_flag(TIF_SVE)) {
+		clear_thread_flag(TIF_SVE);
+
+		/* Trap if the task tries to use SVE again: */
+		sve_user_disable();
+	}
+
+	if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
+		if (system_supports_sve() && test_thread_flag(TIF_SVE))
+			sve_save_state(sve_pffr(current),
+				       &current->thread.fpsimd_state.fpsr);
+		else
+			fpsimd_save_state(&current->thread.fpsimd_state);
+	}
+}
+
+#define ZREG(sve_state, vq, n) ((char *)(sve_state) +		\
+	(SVE_SIG_ZREG_OFFSET(vq, n) - SVE_SIG_REGS_OFFSET))
+
+static void fpsimd_to_sve(struct task_struct *task)
+{
+	unsigned int vq;
+	void *sst = task->thread.sve_state;
+	struct fpsimd_state const *fst = &task->thread.fpsimd_state;
+	unsigned int i;
+
+	if (!system_supports_sve())
+		return;
+
+	vq = sve_vq_from_vl(task->thread.sve_vl);
+	for (i = 0; i < 32; ++i)
+		memcpy(ZREG(sst, vq, i), &fst->vregs[i],
+		       sizeof(fst->vregs[i]));
+}
+
+#ifdef CONFIG_ARM64_SVE
+
+static void sve_to_fpsimd(struct task_struct *task)
+{
+	unsigned int vq;
+	void const *sst = task->thread.sve_state;
+	struct fpsimd_state *fst = &task->thread.fpsimd_state;
+	unsigned int i;
+
+	if (!system_supports_sve())
+		return;
+
+	vq = sve_vq_from_vl(task->thread.sve_vl);
+	for (i = 0; i < 32; ++i)
+		memcpy(&fst->vregs[i], ZREG(sst, vq, i),
+		       sizeof(fst->vregs[i]));
+}
+
+size_t sve_state_size(struct task_struct const *task)
+{
+	return SVE_SIG_REGS_SIZE(sve_vq_from_vl(task->thread.sve_vl));
+}
+
+void sve_alloc(struct task_struct *task)
+{
+	if (task->thread.sve_state) {
+		memset(task->thread.sve_state, 0, sve_state_size(current));
+		return;
+	}
+
+	/* This is a small allocation (maximum ~8KB) and Should Not Fail. */
+	task->thread.sve_state =
+		kzalloc(sve_state_size(task), GFP_KERNEL);
+
+	/*
+	 * If future SVE revisions can have larger vectors though,
+	 * this may cease to be true:
+	 */
+	BUG_ON(!task->thread.sve_state);
+}
+
+/*
+ * Handle SVE state across fork():
+ *
+ * dst and src must not end up with aliases of the same sve_state.
+ * Because a task cannot fork except in a syscall, we can discard SVE
+ * state for dst here, so long as we take care to retain the FPSIMD
+ * subset of the state if SVE is in use.  Reallocation of the SVE state
+ * will be deferred until dst tries to use SVE.
+ */
+void fpsimd_dup_sve(struct task_struct *dst, struct task_struct const *src)
+{
+	if (test_and_clear_tsk_thread_flag(dst, TIF_SVE)) {
+		WARN_ON(dst->mm && !in_syscall(task_pt_regs(dst)));
+		sve_to_fpsimd(dst);
+	}
+
+	dst->thread.sve_state = NULL;
+}
+
+void fpsimd_release_thread(struct task_struct *dead_task)
+{
+	sve_free(dead_task);
+}
+
+#endif /* CONFIG_ARM64_SVE */
+
+/*
+ * Trapped SVE access
+ */
+void do_sve_acc(unsigned int esr, struct pt_regs *regs)
+{
+	/* Even if we chose not to use SVE, the hardware could still trap: */
+	if (unlikely(!system_supports_sve()) || WARN_ON(is_compat_task())) {
+		force_signal_inject(SIGILL, ILL_ILLOPC, regs, 0);
+		return;
+	}
+
+	task_fpsimd_save();
+
+	sve_alloc(current);
+	fpsimd_to_sve(current);
+	if (test_and_set_thread_flag(TIF_SVE))
+		WARN_ON(1); /* SVE access shouldn't have trapped */
+
+	task_fpsimd_load();
+}
+
 /*
  * Trapped FP/ASIMD access.
  */
@@ -144,8 +335,8 @@  void fpsimd_thread_switch(struct task_struct *next)
 	 * the registers is in fact the most recent userland FPSIMD state of
 	 * 'current'.
 	 */
-	if (current->mm && !test_thread_flag(TIF_FOREIGN_FPSTATE))
-		fpsimd_save_state(&current->thread.fpsimd_state);
+	if (current->mm)
+		task_fpsimd_save();
 
 	if (next->mm) {
 		/*
@@ -167,6 +358,8 @@  void fpsimd_thread_switch(struct task_struct *next)
 
 void fpsimd_flush_thread(void)
 {
+	int vl;
+
 	if (!system_supports_fpsimd())
 		return;
 
@@ -174,6 +367,30 @@  void fpsimd_flush_thread(void)
 
 	memset(&current->thread.fpsimd_state, 0, sizeof(struct fpsimd_state));
 	fpsimd_flush_task_state(current);
+
+	if (system_supports_sve()) {
+		clear_thread_flag(TIF_SVE);
+		sve_free(current);
+
+		/*
+		 * Reset the task vector length as required.
+		 * This is where we ensure that all user tasks have a valid
+		 * vector length configured: no kernel task can become a user
+		 * task without an exec and hence a call to this function.
+		 * If a bug causes this to go wrong, we make some noise and
+		 * try to fudge thread.sve_vl to a safe value here.
+		 */
+		vl = current->thread.sve_vl;
+
+		if (vl == 0)
+			vl = SVE_VL_MIN;
+
+		if (WARN_ON(!sve_vl_valid(vl)))
+			vl = SVE_VL_MIN;
+
+		current->thread.sve_vl = vl;
+	}
+
 	set_thread_flag(TIF_FOREIGN_FPSTATE);
 
 	local_bh_enable();
@@ -211,7 +428,7 @@  void fpsimd_restore_current_state(void)
 	if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
 		struct fpsimd_state *st = &current->thread.fpsimd_state;
 
-		fpsimd_load_state(st);
+		task_fpsimd_load();
 		__this_cpu_write(fpsimd_last_state, st);
 		st->cpu = smp_processor_id();
 	}
@@ -376,8 +593,8 @@  static int fpsimd_cpu_pm_notifier(struct notifier_block *self,
 {
 	switch (cmd) {
 	case CPU_PM_ENTER:
-		if (current->mm && !test_thread_flag(TIF_FOREIGN_FPSTATE))
-			fpsimd_save_state(&current->thread.fpsimd_state);
+		if (current->mm)
+			task_fpsimd_save();
 		this_cpu_write(fpsimd_last_state, NULL);
 		break;
 	case CPU_PM_EXIT:
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index e6bf19c..eb1cae7 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -239,6 +239,7 @@  void flush_thread(void)
 
 void release_thread(struct task_struct *dead_task)
 {
+	fpsimd_release_thread(dead_task);
 }
 
 int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
@@ -246,6 +247,9 @@  int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	if (current->mm)
 		fpsimd_preserve_current_state();
 	*dst = *src;
+
+	fpsimd_dup_sve(dst, src);
+
 	return 0;
 }
 
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index f202932..7f3b5c6 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -358,8 +358,8 @@  static int call_undef_hook(struct pt_regs *regs)
 	return fn ? fn(regs, instr) : 1;
 }
 
-static void force_signal_inject(int signal, int code, struct pt_regs *regs,
-				unsigned long address)
+void force_signal_inject(int signal, int code, struct pt_regs *regs,
+			 unsigned long address)
 {
 	siginfo_t info;
 	void __user *pc = (void __user *)instruction_pointer(regs);