diff mbox series

[v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

Message ID 1921539696.48534988.1700407082933.JavaMail.zimbra@raptorengineeringinc.com (mailing list archive)
State Accepted
Commit 5e1d824f9a283cbf90f25241b66d1f69adb3835b
Headers show
Series [v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 23 jobs.

Commit Message

Timothy Pearson Nov. 19, 2023, 3:18 p.m. UTC
During floating point and vector save to thread data fr0/vs0 are clobbered
by the FPSCR/VSCR store routine.  This leads to userspace register corruption
and application data corruption / crash under the following rare condition:

 * A userspace thread is executing with VSX/FP mode enabled
 * The userspace thread is making active use of fr0 and/or vs0
 * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
 * The userspace thread is interrupted by the IPI before accessing data it
   previously stored in fr0/vs0
 * The thread being switched in by the IPI has a pending signal

If these exact criteria are met, then the following sequence happens:

 * The existing thread FP storage is still valid before the IPI, due to a
   prior call to save_fpu() or store_fp_state().  Note that the current
   fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
   is now invalid pending a call to restore_fp()/restore_altivec().
 * IPI -- FP/VSX register state remains invalid
 * interrupt_exit_user_prepare_main() calls do_notify_resume(),
   due to the pending signal
 * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
   merrily reads and saves the invalid FP/VSX state to thread local storage.
 * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
   FP/VSX state back to registers.
 * Execution is released to userspace, and the application crashes or corrupts
   data.

Without the pending signal, do_notify_resume() is never called, therefore the
invalid register state does't matter as it is overwritten nearly immediately
by interrupt_exit_user_prepare_main() calling restore_math() before return
to userspace.

Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
altivec register save paths.

Tested under QEMU in kvm mode, running on a Talos II workstation with dual
POWER9 DD2.2 CPUs.

Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.JavaMail.zimbra@raptorengineeringinc.com/
Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.JavaMail.zimbra@raptorengineeringinc.com/
Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
---
 arch/powerpc/kernel/fpu.S    | 13 +++++++++++++
 arch/powerpc/kernel/vector.S |  2 ++
 2 files changed, 15 insertions(+)

Comments

Michael Ellerman Nov. 20, 2023, 7:10 a.m. UTC | #1
Hi Timothy,

Great work debugging this. I think your fix is good, but I want to understand it
a bit more to make sure I can explain why we haven't seen it outside of io-uring.
If this can be triggered outside of io-uring then I have even more backporting
in my future :}

Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
from the thread struct before letting the task use FP again. So in that case
save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
for the task.

There is another case though, which is the path via:
  copy_process()
    dup_task_struct()
      arch_dup_task_struct()
        flush_all_to_thread()
          save_all()

That path saves the FP regs but leaves them live. That's meant as an
optimisation for a process that's using FP/VSX and then calls fork(), leaving
the regs live means the parent process doesn't have to take a fault after the
fork to get its FP regs back.

That path does clobber fr0, but fr0 is volatile across a syscall, and the only
way to reach copy_process() from userspace is via a syscall. So in normal usage
fr0 being clobbered across a syscall shouldn't cause data corruption.

Even if we handle a signal on the return from the fork() syscall, the worst that
happens is that the task's thread struct holds the clobbered fr0, but the task
doesn't care (because fr0 is volatile across the syscall anyway).

That path is something like:

system_call_vectored_common()
  system_call_exception()
    sys_fork()
      kernel_clone()
        copy_process()
          dup_task_struct()
            arch_dup_task_struct()
              flush_all_to_thread()
                save_all()
                  if (tsk->thread.regs->msr & MSR_FP)
                    save_fpu()
                    # does not clear MSR_FP from regs->msr
  syscall_exit_prepare()
    interrupt_exit_user_prepare_main()
      do_notify_resume()
        get_signal()
        handle_rt_signal64()
          prepare_setup_sigcontext()
            flush_fp_to_thread()
              if (tsk->thread.regs->msr & MSR_FP)
                giveup_fpu()
                  __giveup_fpu
                    save_fpu()
                    # clobbered fr0 is saved, but task considers it volatile
                    # across syscall anyway


But we now have a new path, because io-uring can call copy_process() via
create_io_thread() from the signal handling path. That's OK if the signal is
handled as we return from a syscall, but it's not OK if the signal is handled
due to some other interrupt.

Which is:

interrupt_return_srr_user()
  interrupt_exit_user_prepare()
    interrupt_exit_user_prepare_main()
      do_notify_resume()
        get_signal()
          task_work_run()
            create_worker_cb()
              create_io_worker()
                copy_process()
                  dup_task_struct()
                    arch_dup_task_struct()
                      flush_all_to_thread()
                        save_all()
                          if (tsk->thread.regs->msr & MSR_FP)
                            save_fpu()
                            # fr0 is clobbered and potentially live in userspace


So tldr I think the corruption is only an issue since io-uring started doing
the clone via signal, which I think matches the observed timeline of this bug
appearing.

Gotta run home, will have a closer look at the actual patch later on.

cheers


Timothy Pearson <tpearson@raptorengineering.com> writes:
> During floating point and vector save to thread data fr0/vs0 are clobbered
> by the FPSCR/VSCR store routine.  This leads to userspace register corruption
> and application data corruption / crash under the following rare condition:
>
>  * A userspace thread is executing with VSX/FP mode enabled
>  * The userspace thread is making active use of fr0 and/or vs0
>  * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
>  * The userspace thread is interrupted by the IPI before accessing data it
>    previously stored in fr0/vs0
>  * The thread being switched in by the IPI has a pending signal
>
> If these exact criteria are met, then the following sequence happens:
>
>  * The existing thread FP storage is still valid before the IPI, due to a
>    prior call to save_fpu() or store_fp_state().  Note that the current
>    fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
>    is now invalid pending a call to restore_fp()/restore_altivec().
>  * IPI -- FP/VSX register state remains invalid
>  * interrupt_exit_user_prepare_main() calls do_notify_resume(),
>    due to the pending signal
>  * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
>    merrily reads and saves the invalid FP/VSX state to thread local storage.
>  * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
>    FP/VSX state back to registers.
>  * Execution is released to userspace, and the application crashes or corrupts
>    data.
>
> Without the pending signal, do_notify_resume() is never called, therefore the
> invalid register state does't matter as it is overwritten nearly immediately
> by interrupt_exit_user_prepare_main() calling restore_math() before return
> to userspace.
>
> Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
> altivec register save paths.
>
> Tested under QEMU in kvm mode, running on a Talos II workstation with dual
> POWER9 DD2.2 CPUs.
>
> Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.JavaMail.zimbra@raptorengineeringinc.com/
> Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.JavaMail.zimbra@raptorengineeringinc.com/
> Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
> ---
>  arch/powerpc/kernel/fpu.S    | 13 +++++++++++++
>  arch/powerpc/kernel/vector.S |  2 ++
>  2 files changed, 15 insertions(+)
>
> diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
> index 6a9acfb690c9..2f8f3f93cbb6 100644
> --- a/arch/powerpc/kernel/fpu.S
> +++ b/arch/powerpc/kernel/fpu.S
> @@ -23,6 +23,15 @@
>  #include <asm/feature-fixups.h>
>  
>  #ifdef CONFIG_VSX
> +#define __REST_1FPVSR(n,c,base)						\
> +BEGIN_FTR_SECTION							\
> +	b	2f;							\
> +END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> +	REST_FPR(n,base);						\
> +	b	3f;							\
> +2:	REST_VSR(n,c,base);						\
> +3:
> +
>  #define __REST_32FPVSRS(n,c,base)					\
>  BEGIN_FTR_SECTION							\
>  	b	2f;							\
> @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
>  2:	SAVE_32VSRS(n,c,base);						\
>  3:
>  #else
> +#define __REST_1FPVSR(n,b,base)		REST_FPR(n, base)
>  #define __REST_32FPVSRS(n,b,base)	REST_32FPRS(n, base)
>  #define __SAVE_32FPVSRS(n,b,base)	SAVE_32FPRS(n, base)
>  #endif
> +#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
>  #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
>  #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
>  
> @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
>  	SAVE_32FPVSRS(0, R4, R3)
>  	mffs	fr0
>  	stfd	fr0,FPSTATE_FPSCR(r3)
> +	REST_1FPVSR(0, R4, R3)
>  	blr
>  EXPORT_SYMBOL(store_fp_state)
>  
> @@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
>  2:	SAVE_32FPVSRS(0, R4, R6)
>  	mffs	fr0
>  	stfd	fr0,FPSTATE_FPSCR(r6)
> +	REST_1FPVSR(0, R4, R6)
>  	blr
> diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
> index 4094e4c4c77a..80b3f6e476b6 100644
> --- a/arch/powerpc/kernel/vector.S
> +++ b/arch/powerpc/kernel/vector.S
> @@ -33,6 +33,7 @@ _GLOBAL(store_vr_state)
>  	mfvscr	v0
>  	li	r4, VRSTATE_VSCR
>  	stvx	v0, r4, r3
> +	lvx	v0, 0, r3
>  	blr
>  EXPORT_SYMBOL(store_vr_state)
>  
> @@ -109,6 +110,7 @@ _GLOBAL(save_altivec)
>  	mfvscr	v0
>  	li	r4,VRSTATE_VSCR
>  	stvx	v0,r4,r7
> +	lvx	v0,0,r7
>  	blr
>  
>  #ifdef CONFIG_VSX
> -- 
> 2.39.2
Nicholas Piggin Nov. 20, 2023, 2:32 p.m. UTC | #2
Yeah, awesome.

On Mon Nov 20, 2023 at 5:10 PM AEST, Michael Ellerman wrote:
> Hi Timothy,
>
> Great work debugging this. I think your fix is good, but I want to understand it
> a bit more to make sure I can explain why we haven't seen it outside of io-uring.

Analysis seems right to me.

Probably the best minimal fix. But I wonder if we should just use the
one path for saving/flushing/giving up, just use giveup instead of
save?

KVM looks odd too, and actually gets this wrong. In a way that's not
fixed by Timothy's patch, because it's just not restoring userspace
registers at all. Fortunately QEMU isn't in the habit of using non
volatile FP/VEC registers over a VCPU ioctl, but there's no reason it
couldn't do since GCC/LLVM can easily use them. KVM really wants to be
using giveup.

Thanks,
Nick

> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
>
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
> for the task.
>
> There is another case though, which is the path via:
>   copy_process()
>     dup_task_struct()
>       arch_dup_task_struct()
>         flush_all_to_thread()
>           save_all()
>
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
>
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
>
> Even if we handle a signal on the return from the fork() syscall, the worst that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
>
> That path is something like:
>
> system_call_vectored_common()
>   system_call_exception()
>     sys_fork()
>       kernel_clone()
>         copy_process()
>           dup_task_struct()
>             arch_dup_task_struct()
>               flush_all_to_thread()
>                 save_all()
>                   if (tsk->thread.regs->msr & MSR_FP)
>                     save_fpu()
>                     # does not clear MSR_FP from regs->msr
>   syscall_exit_prepare()
>     interrupt_exit_user_prepare_main()
>       do_notify_resume()
>         get_signal()
>         handle_rt_signal64()
>           prepare_setup_sigcontext()
>             flush_fp_to_thread()
>               if (tsk->thread.regs->msr & MSR_FP)
>                 giveup_fpu()
>                   __giveup_fpu
>                     save_fpu()
>                     # clobbered fr0 is saved, but task considers it volatile
>                     # across syscall anyway
>
>
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
>
> Which is:
>
> interrupt_return_srr_user()
>   interrupt_exit_user_prepare()
>     interrupt_exit_user_prepare_main()
>       do_notify_resume()
>         get_signal()
>           task_work_run()
>             create_worker_cb()
>               create_io_worker()
>                 copy_process()
>                   dup_task_struct()
>                     arch_dup_task_struct()
>                       flush_all_to_thread()
>                         save_all()
>                           if (tsk->thread.regs->msr & MSR_FP)
>                             save_fpu()
>                             # fr0 is clobbered and potentially live in userspace
>
>
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.
>
> Gotta run home, will have a closer look at the actual patch later on.
>
> cheers
>
>
> Timothy Pearson <tpearson@raptorengineering.com> writes:
> > During floating point and vector save to thread data fr0/vs0 are clobbered
> > by the FPSCR/VSCR store routine.  This leads to userspace register corruption
> > and application data corruption / crash under the following rare condition:
> >
> >  * A userspace thread is executing with VSX/FP mode enabled
> >  * The userspace thread is making active use of fr0 and/or vs0
> >  * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
> >  * The userspace thread is interrupted by the IPI before accessing data it
> >    previously stored in fr0/vs0
> >  * The thread being switched in by the IPI has a pending signal
> >
> > If these exact criteria are met, then the following sequence happens:
> >
> >  * The existing thread FP storage is still valid before the IPI, due to a
> >    prior call to save_fpu() or store_fp_state().  Note that the current
> >    fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
> >    is now invalid pending a call to restore_fp()/restore_altivec().
> >  * IPI -- FP/VSX register state remains invalid
> >  * interrupt_exit_user_prepare_main() calls do_notify_resume(),
> >    due to the pending signal
> >  * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
> >    merrily reads and saves the invalid FP/VSX state to thread local storage.
> >  * interrupt_exit_user_prepare_main() calls restore_math(), writing the invalid
> >    FP/VSX state back to registers.
> >  * Execution is released to userspace, and the application crashes or corrupts
> >    data.
> >
> > Without the pending signal, do_notify_resume() is never called, therefore the
> > invalid register state does't matter as it is overwritten nearly immediately
> > by interrupt_exit_user_prepare_main() calling restore_math() before return
> > to userspace.
> >
> > Restore fr0/vs0 after FPSCR/VSCR store has completed for both the fp and
> > altivec register save paths.
> >
> > Tested under QEMU in kvm mode, running on a Talos II workstation with dual
> > POWER9 DD2.2 CPUs.
> >
> > Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.JavaMail.zimbra@raptorengineeringinc.com/
> > Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.JavaMail.zimbra@raptorengineeringinc.com/
> > Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
> > Tested-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
> > ---
> >  arch/powerpc/kernel/fpu.S    | 13 +++++++++++++
> >  arch/powerpc/kernel/vector.S |  2 ++
> >  2 files changed, 15 insertions(+)
> >
> > diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
> > index 6a9acfb690c9..2f8f3f93cbb6 100644
> > --- a/arch/powerpc/kernel/fpu.S
> > +++ b/arch/powerpc/kernel/fpu.S
> > @@ -23,6 +23,15 @@
> >  #include <asm/feature-fixups.h>
> >  
> >  #ifdef CONFIG_VSX
> > +#define __REST_1FPVSR(n,c,base)						\
> > +BEGIN_FTR_SECTION							\
> > +	b	2f;							\
> > +END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> > +	REST_FPR(n,base);						\
> > +	b	3f;							\
> > +2:	REST_VSR(n,c,base);						\
> > +3:
> > +
> >  #define __REST_32FPVSRS(n,c,base)					\
> >  BEGIN_FTR_SECTION							\
> >  	b	2f;							\
> > @@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
> >  2:	SAVE_32VSRS(n,c,base);						\
> >  3:
> >  #else
> > +#define __REST_1FPVSR(n,b,base)		REST_FPR(n, base)
> >  #define __REST_32FPVSRS(n,b,base)	REST_32FPRS(n, base)
> >  #define __SAVE_32FPVSRS(n,b,base)	SAVE_32FPRS(n, base)
> >  #endif
> > +#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
> >  #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
> >  #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
> >  
> > @@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
> >  	SAVE_32FPVSRS(0, R4, R3)
> >  	mffs	fr0
> >  	stfd	fr0,FPSTATE_FPSCR(r3)
> > +	REST_1FPVSR(0, R4, R3)
> >  	blr
> >  EXPORT_SYMBOL(store_fp_state)
> >  
> > @@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
> >  2:	SAVE_32FPVSRS(0, R4, R6)
> >  	mffs	fr0
> >  	stfd	fr0,FPSTATE_FPSCR(r6)
> > +	REST_1FPVSR(0, R4, R6)
> >  	blr
> > diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
> > index 4094e4c4c77a..80b3f6e476b6 100644
> > --- a/arch/powerpc/kernel/vector.S
> > +++ b/arch/powerpc/kernel/vector.S
> > @@ -33,6 +33,7 @@ _GLOBAL(store_vr_state)
> >  	mfvscr	v0
> >  	li	r4, VRSTATE_VSCR
> >  	stvx	v0, r4, r3
> > +	lvx	v0, 0, r3
> >  	blr
> >  EXPORT_SYMBOL(store_vr_state)
> >  
> > @@ -109,6 +110,7 @@ _GLOBAL(save_altivec)
> >  	mfvscr	v0
> >  	li	r4,VRSTATE_VSCR
> >  	stvx	v0,r4,r7
> > +	lvx	v0,0,r7
> >  	blr
> >  
> >  #ifdef CONFIG_VSX
> > -- 
> > 2.39.2
Timothy Pearson Nov. 20, 2023, 4:45 p.m. UTC | #3
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>, "Jens Axboe" <axboe@kernel.dk>, "regressions"
> <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>, "christophe leroy" <christophe.leroy@csgroup.eu>,
> "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Monday, November 20, 2023 1:10:06 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> Hi Timothy,
> 
> Great work debugging this. I think your fix is good, but I want to understand it
> a bit more to make sure I can explain why we haven't seen it outside of
> io-uring.
> If this can be triggered outside of io-uring then I have even more backporting
> in my future :}
> 
> Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
> from the thread struct before letting the task use FP again. So in that case
> save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
> for the task.
> 
> There is another case though, which is the path via:
>  copy_process()
>    dup_task_struct()
>      arch_dup_task_struct()
>        flush_all_to_thread()
>          save_all()
> 
> That path saves the FP regs but leaves them live. That's meant as an
> optimisation for a process that's using FP/VSX and then calls fork(), leaving
> the regs live means the parent process doesn't have to take a fault after the
> fork to get its FP regs back.
> 
> That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> way to reach copy_process() from userspace is via a syscall. So in normal usage
> fr0 being clobbered across a syscall shouldn't cause data corruption.
> 
> Even if we handle a signal on the return from the fork() syscall, the worst that
> happens is that the task's thread struct holds the clobbered fr0, but the task
> doesn't care (because fr0 is volatile across the syscall anyway).
> 
> That path is something like:
> 
> system_call_vectored_common()
>  system_call_exception()
>    sys_fork()
>      kernel_clone()
>        copy_process()
>          dup_task_struct()
>            arch_dup_task_struct()
>              flush_all_to_thread()
>                save_all()
>                  if (tsk->thread.regs->msr & MSR_FP)
>                    save_fpu()
>                    # does not clear MSR_FP from regs->msr
>  syscall_exit_prepare()
>    interrupt_exit_user_prepare_main()
>      do_notify_resume()
>        get_signal()
>        handle_rt_signal64()
>          prepare_setup_sigcontext()
>            flush_fp_to_thread()
>              if (tsk->thread.regs->msr & MSR_FP)
>                giveup_fpu()
>                  __giveup_fpu
>                    save_fpu()
>                    # clobbered fr0 is saved, but task considers it volatile
>                    # across syscall anyway
> 
> 
> But we now have a new path, because io-uring can call copy_process() via
> create_io_thread() from the signal handling path. That's OK if the signal is
> handled as we return from a syscall, but it's not OK if the signal is handled
> due to some other interrupt.
> 
> Which is:
> 
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>    interrupt_exit_user_prepare_main()
>      do_notify_resume()
>        get_signal()
>          task_work_run()
>            create_worker_cb()
>              create_io_worker()
>                copy_process()
>                  dup_task_struct()
>                    arch_dup_task_struct()
>                      flush_all_to_thread()
>                        save_all()
>                          if (tsk->thread.regs->msr & MSR_FP)
>                            save_fpu()
>                            # fr0 is clobbered and potentially live in userspace
> 
> 
> So tldr I think the corruption is only an issue since io-uring started doing
> the clone via signal, which I think matches the observed timeline of this bug
> appearing.

I agree the corruption really only started showing up in earnest on io_uring clone-via-signal, as this was confirmed several times in the course of debugging.  Bear in mind however that we have seen very, very rare crashes over several years in other tasks, and I am starting to think this bug might be the root cause (see below).

Note as well that I may very well have a wrong call order in the commit message, since I was relying on a couple of WARN_ON() macros I inserted to check for a similar (but not identical) condition and didn't spend much time getting new traces after identifying the root cause.

I went back and grabbed some real world system-wide stack traces, since I now know what to trigger on.  A typical example is:

interrupt_return_srr_user()
 interrupt_exit_user_prepare()
  interrupt_exit_user_prepare_main()
   schedule()
    __schedule()
     __switch_to()
      giveup_all()
       # tsk->thread.regs->msr MSR_FP is still set here
       __giveup_fpu()
        save_fpu()
        # fr0 is clobbered and potentially live in userspace

This indicates that the pending signal is not actually required, it simply makes triggering much more likely.  Both the pending signal and the _TIF_NEED_RESCHED paths give up control and end up in the same overall position of trigging down-call routines that assume the FPU state is valid.

perl and bash seem to be affected to some degree, though current builds don't use enough VSX instructions rapidly enough to cause crashes with any significant probability.  That said, over many years of running POWER at datacenter scale I have seen enough random bash/perl crashes in the logs to recognize the pattern; I think this has been a low-grade issue for a long time, but with an infantismally small chance of happening it was seen as random noise / hardware issues / other rare bugs in various programs.
Michael Ellerman Nov. 20, 2023, 11:39 p.m. UTC | #4
Timothy Pearson <tpearson@raptorengineering.com> writes:
> ----- Original Message -----
>> From: "Michael Ellerman" <mpe@ellerman.id.au>
...
>> 
>> But we now have a new path, because io-uring can call copy_process() via
>> create_io_thread() from the signal handling path. That's OK if the signal is
>> handled as we return from a syscall, but it's not OK if the signal is handled
>> due to some other interrupt.
>> 
>> Which is:
>> 
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>    interrupt_exit_user_prepare_main()
>>      do_notify_resume()
>>        get_signal()
>>          task_work_run()
>>            create_worker_cb()
>>              create_io_worker()
>>                copy_process()
>>                  dup_task_struct()
>>                    arch_dup_task_struct()
>>                      flush_all_to_thread()
>>                        save_all()
>>                          if (tsk->thread.regs->msr & MSR_FP)
>>                            save_fpu()
>>                            # fr0 is clobbered and potentially live in userspace
>> 
>> 
>> So tldr I think the corruption is only an issue since io-uring started doing
>> the clone via signal, which I think matches the observed timeline of this bug
>> appearing.
>
> I agree the corruption really only started showing up in earnest on
> io_uring clone-via-signal, as this was confirmed several times in the
> course of debugging.

Thanks.

> Note as well that I may very well have a wrong call order in the
> commit message, since I was relying on a couple of WARN_ON() macros I
> inserted to check for a similar (but not identical) condition and
> didn't spend much time getting new traces after identifying the root
> cause.

Yep no worries. I'll reword it to incorporate the full path from my mail.

> I went back and grabbed some real world system-wide stack traces, since I now know what to trigger on.  A typical example is:
>
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>   interrupt_exit_user_prepare_main()
>    schedule()
>     __schedule()
>      __switch_to()
>       giveup_all()
>        # tsk->thread.regs->msr MSR_FP is still set here
>        __giveup_fpu()
>         save_fpu()
>         # fr0 is clobbered and potentially live in userspace

fr0 is not live there.

__giveup_fpu() does roughly:

	msr = tsk->thread.regs->msr;
	msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
        msr &= ~MSR_VSX;
	tsk->thread.regs = msr;

ie. it clears the FP etc. bits from the task's MSR. That means the FP
state will be reloaded from the thread struct before the task is run again.

Also on that path we're switching to another task, so we'll be reloading
the other task's FP state before returning to userspace.

So I don't see any bug there.

There's only two places that call save_fpu() and skip the giveup logic,
which is save_all() and kvmppc_save_user_regs().

save_all() is only called via clone() so I think that's unable to
actually cause visible register corruption as I described in my previous
mail.

I thought the KVM case was similar, as it's called via an ioctl, but
I'll have to talk to Nick as his mail indicates otherwise.

cheers
Nicholas Piggin Nov. 21, 2023, 12:18 a.m. UTC | #5
On Tue Nov 21, 2023 at 2:45 AM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe@ellerman.id.au>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "Jens Axboe" <axboe@kernel.dk>, "regressions"
> > <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>, "christophe leroy" <christophe.leroy@csgroup.eu>,
> > "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Monday, November 20, 2023 1:10:06 AM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save
>
> > Hi Timothy,
> > 
> > Great work debugging this. I think your fix is good, but I want to understand it
> > a bit more to make sure I can explain why we haven't seen it outside of
> > io-uring.
> > If this can be triggered outside of io-uring then I have even more backporting
> > in my future :}
> > 
> > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
> > from the thread struct before letting the task use FP again. So in that case
> > save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
> > for the task.
> > 
> > There is another case though, which is the path via:
> >  copy_process()
> >    dup_task_struct()
> >      arch_dup_task_struct()
> >        flush_all_to_thread()
> >          save_all()
> > 
> > That path saves the FP regs but leaves them live. That's meant as an
> > optimisation for a process that's using FP/VSX and then calls fork(), leaving
> > the regs live means the parent process doesn't have to take a fault after the
> > fork to get its FP regs back.
> > 
> > That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> > way to reach copy_process() from userspace is via a syscall. So in normal usage
> > fr0 being clobbered across a syscall shouldn't cause data corruption.
> > 
> > Even if we handle a signal on the return from the fork() syscall, the worst that
> > happens is that the task's thread struct holds the clobbered fr0, but the task
> > doesn't care (because fr0 is volatile across the syscall anyway).
> > 
> > That path is something like:
> > 
> > system_call_vectored_common()
> >  system_call_exception()
> >    sys_fork()
> >      kernel_clone()
> >        copy_process()
> >          dup_task_struct()
> >            arch_dup_task_struct()
> >              flush_all_to_thread()
> >                save_all()
> >                  if (tsk->thread.regs->msr & MSR_FP)
> >                    save_fpu()
> >                    # does not clear MSR_FP from regs->msr
> >  syscall_exit_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >        handle_rt_signal64()
> >          prepare_setup_sigcontext()
> >            flush_fp_to_thread()
> >              if (tsk->thread.regs->msr & MSR_FP)
> >                giveup_fpu()
> >                  __giveup_fpu
> >                    save_fpu()
> >                    # clobbered fr0 is saved, but task considers it volatile
> >                    # across syscall anyway
> > 
> > 
> > But we now have a new path, because io-uring can call copy_process() via
> > create_io_thread() from the signal handling path. That's OK if the signal is
> > handled as we return from a syscall, but it's not OK if the signal is handled
> > due to some other interrupt.
> > 
> > Which is:
> > 
> > interrupt_return_srr_user()
> >  interrupt_exit_user_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >          task_work_run()
> >            create_worker_cb()
> >              create_io_worker()
> >                copy_process()
> >                  dup_task_struct()
> >                    arch_dup_task_struct()
> >                      flush_all_to_thread()
> >                        save_all()
> >                          if (tsk->thread.regs->msr & MSR_FP)
> >                            save_fpu()
> >                            # fr0 is clobbered and potentially live in userspace
> > 
> > 
> > So tldr I think the corruption is only an issue since io-uring started doing
> > the clone via signal, which I think matches the observed timeline of this bug
> > appearing.
>
> I agree the corruption really only started showing up in earnest on io_uring clone-via-signal, as this was confirmed several times in the course of debugging.  Bear in mind however that we have seen very, very rare crashes over several years in other tasks, and I am starting to think this bug might be the root cause (see below).
>
> Note as well that I may very well have a wrong call order in the commit message, since I was relying on a couple of WARN_ON() macros I inserted to check for a similar (but not identical) condition and didn't spend much time getting new traces after identifying the root cause.
>
> I went back and grabbed some real world system-wide stack traces, since I now know what to trigger on.  A typical example is:
>
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>   interrupt_exit_user_prepare_main()
>    schedule()
>     __schedule()
>      __switch_to()
>       giveup_all()
>        # tsk->thread.regs->msr MSR_FP is still set here
>        __giveup_fpu()
>         save_fpu()
>         # fr0 is clobbered and potentially live in userspace
>
> This indicates that the pending signal is not actually required, it simply makes triggering much more likely.  Both the pending signal and the _TIF_NEED_RESCHED paths give up control and end up in the same overall position of trigging down-call routines that assume the FPU state is valid.

That seems probably true, but that's on the other side of the equation,
I think? Because __giveup_fpu does clear MSR[FP] before any context
switch or return to user is possile.

*After* we have clobbered fr0 without clearing MSR_FP from the user msr,
then yes any context switch or return to user will cause uerspace to
next run with a clobbered fr0. But getting to that fr0/msr state
requires the flush_all_to_thread(), and that needs to be called in a
path where user fr0 must not be clobbered. I don't see one other than
io-uring yet.

[ KVM via kvmppc_save_user_regs() (which is basically the same as
flush_all_to_thread()) can do it. Not via the fr0 clobber in save_fpu,
because this is called via a VCPU run ioctl, but KVM will later clobber
all FP/VEC registers via running guest code. So we clobber non-volatile
regs as well. This wouldn't be causing random corruption and crashes
though, only possibly bugs in code that runs KVM guests. ]

Thanks,
Nick

>
> perl and bash seem to be affected to some degree, though current builds don't use enough VSX instructions rapidly enough to cause crashes with any significant probability.  That said, over many years of running POWER at datacenter scale I have seen enough random bash/perl crashes in the logs to recognize the pattern; I think this has been a low-grade issue for a long time, but with an infantismally small chance of happening it was seen as random noise / hardware issues / other rare bugs in various programs.
Nicholas Piggin Nov. 21, 2023, 12:27 a.m. UTC | #6
On Tue Nov 21, 2023 at 9:39 AM AEST, Michael Ellerman wrote:
> Timothy Pearson <tpearson@raptorengineering.com> writes:
> > ----- Original Message -----
> >> From: "Michael Ellerman" <mpe@ellerman.id.au>
> ...
> >> 
> >> But we now have a new path, because io-uring can call copy_process() via
> >> create_io_thread() from the signal handling path. That's OK if the signal is
> >> handled as we return from a syscall, but it's not OK if the signal is handled
> >> due to some other interrupt.
> >> 
> >> Which is:
> >> 
> >> interrupt_return_srr_user()
> >>  interrupt_exit_user_prepare()
> >>    interrupt_exit_user_prepare_main()
> >>      do_notify_resume()
> >>        get_signal()
> >>          task_work_run()
> >>            create_worker_cb()
> >>              create_io_worker()
> >>                copy_process()
> >>                  dup_task_struct()
> >>                    arch_dup_task_struct()
> >>                      flush_all_to_thread()
> >>                        save_all()
> >>                          if (tsk->thread.regs->msr & MSR_FP)
> >>                            save_fpu()
> >>                            # fr0 is clobbered and potentially live in userspace
> >> 
> >> 
> >> So tldr I think the corruption is only an issue since io-uring started doing
> >> the clone via signal, which I think matches the observed timeline of this bug
> >> appearing.
> >
> > I agree the corruption really only started showing up in earnest on
> > io_uring clone-via-signal, as this was confirmed several times in the
> > course of debugging.
>
> Thanks.
>
> > Note as well that I may very well have a wrong call order in the
> > commit message, since I was relying on a couple of WARN_ON() macros I
> > inserted to check for a similar (but not identical) condition and
> > didn't spend much time getting new traces after identifying the root
> > cause.
>
> Yep no worries. I'll reword it to incorporate the full path from my mail.
>
> > I went back and grabbed some real world system-wide stack traces, since I now know what to trigger on.  A typical example is:
> >
> > interrupt_return_srr_user()
> >  interrupt_exit_user_prepare()
> >   interrupt_exit_user_prepare_main()
> >    schedule()
> >     __schedule()
> >      __switch_to()
> >       giveup_all()
> >        # tsk->thread.regs->msr MSR_FP is still set here
> >        __giveup_fpu()
> >         save_fpu()
> >         # fr0 is clobbered and potentially live in userspace
>
> fr0 is not live there.
>
> __giveup_fpu() does roughly:
>
> 	msr = tsk->thread.regs->msr;
> 	msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
>         msr &= ~MSR_VSX;
> 	tsk->thread.regs = msr;
>
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.
>
> Also on that path we're switching to another task, so we'll be reloading
> the other task's FP state before returning to userspace.
>
> So I don't see any bug there.
>
> There's only two places that call save_fpu() and skip the giveup logic,
> which is save_all() and kvmppc_save_user_regs().
>
> save_all() is only called via clone() so I think that's unable to
> actually cause visible register corruption as I described in my previous
> mail.
>
> I thought the KVM case was similar, as it's called via an ioctl, but
> I'll have to talk to Nick as his mail indicates otherwise.

Yeah it can, because later on it runs the guest and that will clobber
other regs. I reproduced fr14 corruption in the host easily with KVM
selftests.

It should just do a "giveup" on FP/VEC (as it does with TM).

Thanks,
Nick
Timothy Pearson Nov. 21, 2023, 1:23 a.m. UTC | #7
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>,
> "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Monday, November 20, 2023 5:39:52 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> Timothy Pearson <tpearson@raptorengineering.com> writes:
>> ----- Original Message -----
>>> From: "Michael Ellerman" <mpe@ellerman.id.au>
> ...
>>> 
>>> But we now have a new path, because io-uring can call copy_process() via
>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>> handled as we return from a syscall, but it's not OK if the signal is handled
>>> due to some other interrupt.
>>> 
>>> Which is:
>>> 
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>    interrupt_exit_user_prepare_main()
>>>      do_notify_resume()
>>>        get_signal()
>>>          task_work_run()
>>>            create_worker_cb()
>>>              create_io_worker()
>>>                copy_process()
>>>                  dup_task_struct()
>>>                    arch_dup_task_struct()
>>>                      flush_all_to_thread()
>>>                        save_all()
>>>                          if (tsk->thread.regs->msr & MSR_FP)
>>>                            save_fpu()
>>>                            # fr0 is clobbered and potentially live in userspace
>>> 
>>> 
>>> So tldr I think the corruption is only an issue since io-uring started doing
>>> the clone via signal, which I think matches the observed timeline of this bug
>>> appearing.
>>
>> I agree the corruption really only started showing up in earnest on
>> io_uring clone-via-signal, as this was confirmed several times in the
>> course of debugging.
> 
> Thanks.
> 
>> Note as well that I may very well have a wrong call order in the
>> commit message, since I was relying on a couple of WARN_ON() macros I
>> inserted to check for a similar (but not identical) condition and
>> didn't spend much time getting new traces after identifying the root
>> cause.
> 
> Yep no worries. I'll reword it to incorporate the full path from my mail.
> 
>> I went back and grabbed some real world system-wide stack traces, since I now
>> know what to trigger on.  A typical example is:
>>
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>   interrupt_exit_user_prepare_main()
>>    schedule()
>>     __schedule()
>>      __switch_to()
>>       giveup_all()
>>        # tsk->thread.regs->msr MSR_FP is still set here
>>        __giveup_fpu()
>>         save_fpu()
>>         # fr0 is clobbered and potentially live in userspace
> 
> fr0 is not live there.
> 
> __giveup_fpu() does roughly:
> 
>	msr = tsk->thread.regs->msr;
>	msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
>        msr &= ~MSR_VSX;
>	tsk->thread.regs = msr;
> 
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.
> 
> Also on that path we're switching to another task, so we'll be reloading
> the other task's FP state before returning to userspace.
> 
> So I don't see any bug there.

Yeah, you're right.  I was trying to get traces while doing something else, and didn't think that all the way through, sorry. :)  It's not going to be super easy to get a real trace (I was triggering the WARN_ON() from of fr0 getting set to to FPSCR), so let's just assume it's mainly the path you manually found above and update the commit message accordingly.

> There's only two places that call save_fpu() and skip the giveup logic,
> which is save_all() and kvmppc_save_user_regs().

Now that's interesting as well, since it might explain some issues I've seen for years on a specific QEMU workload.  Once this is backported to stable I'll need to get the kernels updated on those boxes and see if the issues disappear...
Timothy Pearson Nov. 21, 2023, 4:10 a.m. UTC | #8
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>,
> "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Monday, November 20, 2023 5:39:52 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> Timothy Pearson <tpearson@raptorengineering.com> writes:
>> ----- Original Message -----
>>> From: "Michael Ellerman" <mpe@ellerman.id.au>
> ...
>>> 
>>> But we now have a new path, because io-uring can call copy_process() via
>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>> handled as we return from a syscall, but it's not OK if the signal is handled
>>> due to some other interrupt.
>>> 
>>> Which is:
>>> 
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>    interrupt_exit_user_prepare_main()
>>>      do_notify_resume()
>>>        get_signal()
>>>          task_work_run()
>>>            create_worker_cb()
>>>              create_io_worker()
>>>                copy_process()
>>>                  dup_task_struct()
>>>                    arch_dup_task_struct()
>>>                      flush_all_to_thread()
>>>                        save_all()
>>>                          if (tsk->thread.regs->msr & MSR_FP)
>>>                            save_fpu()
>>>                            # fr0 is clobbered and potentially live in userspace
>>> 
>>> 
>>> So tldr I think the corruption is only an issue since io-uring started doing
>>> the clone via signal, which I think matches the observed timeline of this bug
>>> appearing.
>>
>> I agree the corruption really only started showing up in earnest on
>> io_uring clone-via-signal, as this was confirmed several times in the
>> course of debugging.
> 
> Thanks.
> 
>> Note as well that I may very well have a wrong call order in the
>> commit message, since I was relying on a couple of WARN_ON() macros I
>> inserted to check for a similar (but not identical) condition and
>> didn't spend much time getting new traces after identifying the root
>> cause.
> 
> Yep no worries. I'll reword it to incorporate the full path from my mail.
> 
>> I went back and grabbed some real world system-wide stack traces, since I now
>> know what to trigger on.  A typical example is:
>>
>> interrupt_return_srr_user()
>>  interrupt_exit_user_prepare()
>>   interrupt_exit_user_prepare_main()
>>    schedule()
>>     __schedule()
>>      __switch_to()
>>       giveup_all()
>>        # tsk->thread.regs->msr MSR_FP is still set here
>>        __giveup_fpu()
>>         save_fpu()
>>         # fr0 is clobbered and potentially live in userspace
> 
> fr0 is not live there.
<snip> 
> ie. it clears the FP etc. bits from the task's MSR. That means the FP
> state will be reloaded from the thread struct before the task is run again.

So a little more detail on this, just to put it to rest properly vs. assuming hand analysis caught every possible pathway. :)

The debugging that generates this stack trace also verifies the following in __giveup_fpu():

1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling save_fpu()
2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling save_fpu()
3.) MSR_FP is set both in the task struct and in the live MSR.

Only if all three conditions are met will it generate the trace.  This is a generalization of the hack I used to find the problem in the first place.

If the state will subsequently be reloaded from the thread struct, that means we're reloading the registers from the thread struct that we just verified was corrupted by the earlier save_fpu() call.  There are only two ways I can see for that to be true -- one is if the registers were already clobbered when giveup_all() was entered, and the other is if save_fpu() went ahead and clobbered them right here inside giveup_all().

To see which scenario we were dealing with, I added a bit more instrumentation to dump the current register state if MSR_FP bit was already set in registers (i.e. not dumping data from task struct, but using the live FPU registers instead), and sure enough the registers are corrupt on entry, so something else has already called save_fpu() before we even hit giveup_all() in this call chain.

Unless I'm missing something, doesn't this effectively mean that anything interrupting a task can hit this bug?  Or, put another way, I'm seeing several processes hit this exact call chain with the corrupt register going back out to userspace without io_uring even in the mix, so there seems to be another pathway in play.  These traces are from a qemu guest, in case it matters given the kvm path is possibly susceptible.

Just a few things to think about.  The FPU patch itself definitely resolves the problems; I used a sledgehammer approach *specifically* so that there is no place for a rare call sequence we didn't consider to hit it again down the line. :)
Timothy Pearson Nov. 21, 2023, 4:26 a.m. UTC | #9
----- Original Message -----
> From: "Timothy Pearson" <tpearson@raptorengineeringinc.com>
> To: "Michael Ellerman" <mpe@ellerman.id.au>
> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>,
> "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Monday, November 20, 2023 10:10:32 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> ----- Original Message -----
>> From: "Michael Ellerman" <mpe@ellerman.id.au>
>> To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>,
>> "npiggin" <npiggin@gmail.com>,
>> "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev"
>> <linuxppc-dev@lists.ozlabs.org>
>> Sent: Monday, November 20, 2023 5:39:52 PM
>> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec
>> register  save
> 
>> Timothy Pearson <tpearson@raptorengineering.com> writes:
>>> ----- Original Message -----
>>>> From: "Michael Ellerman" <mpe@ellerman.id.au>
>> ...
>>>> 
>>>> But we now have a new path, because io-uring can call copy_process() via
>>>> create_io_thread() from the signal handling path. That's OK if the signal is
>>>> handled as we return from a syscall, but it's not OK if the signal is handled
>>>> due to some other interrupt.
>>>> 
>>>> Which is:
>>>> 
>>>> interrupt_return_srr_user()
>>>>  interrupt_exit_user_prepare()
>>>>    interrupt_exit_user_prepare_main()
>>>>      do_notify_resume()
>>>>        get_signal()
>>>>          task_work_run()
>>>>            create_worker_cb()
>>>>              create_io_worker()
>>>>                copy_process()
>>>>                  dup_task_struct()
>>>>                    arch_dup_task_struct()
>>>>                      flush_all_to_thread()
>>>>                        save_all()
>>>>                          if (tsk->thread.regs->msr & MSR_FP)
>>>>                            save_fpu()
>>>>                            # fr0 is clobbered and potentially live in userspace
>>>> 
>>>> 
>>>> So tldr I think the corruption is only an issue since io-uring started doing
>>>> the clone via signal, which I think matches the observed timeline of this bug
>>>> appearing.
>>>
>>> I agree the corruption really only started showing up in earnest on
>>> io_uring clone-via-signal, as this was confirmed several times in the
>>> course of debugging.
>> 
>> Thanks.
>> 
>>> Note as well that I may very well have a wrong call order in the
>>> commit message, since I was relying on a couple of WARN_ON() macros I
>>> inserted to check for a similar (but not identical) condition and
>>> didn't spend much time getting new traces after identifying the root
>>> cause.
>> 
>> Yep no worries. I'll reword it to incorporate the full path from my mail.
>> 
>>> I went back and grabbed some real world system-wide stack traces, since I now
>>> know what to trigger on.  A typical example is:
>>>
>>> interrupt_return_srr_user()
>>>  interrupt_exit_user_prepare()
>>>   interrupt_exit_user_prepare_main()
>>>    schedule()
>>>     __schedule()
>>>      __switch_to()
>>>       giveup_all()
>>>        # tsk->thread.regs->msr MSR_FP is still set here
>>>        __giveup_fpu()
>>>         save_fpu()
>>>         # fr0 is clobbered and potentially live in userspace
>> 
>> fr0 is not live there.
> <snip>
>> ie. it clears the FP etc. bits from the task's MSR. That means the FP
>> state will be reloaded from the thread struct before the task is run again.
> 
> So a little more detail on this, just to put it to rest properly vs. assuming
> hand analysis caught every possible pathway. :)
> 
> The debugging that generates this stack trace also verifies the following in
> __giveup_fpu():
> 
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling
> save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling
> save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
> 
> Only if all three conditions are met will it generate the trace.  This is a
> generalization of the hack I used to find the problem in the first place.
> 
> If the state will subsequently be reloaded from the thread struct, that means
> we're reloading the registers from the thread struct that we just verified was
> corrupted by the earlier save_fpu() call.  There are only two ways I can see
> for that to be true -- one is if the registers were already clobbered when
> giveup_all() was entered, and the other is if save_fpu() went ahead and
> clobbered them right here inside giveup_all().
> 
> To see which scenario we were dealing with, I added a bit more instrumentation
> to dump the current register state if MSR_FP bit was already set in registers
> (i.e. not dumping data from task struct, but using the live FPU registers
> instead), and sure enough the registers are corrupt on entry, so something else
> has already called save_fpu() before we even hit giveup_all() in this call
> chain.
> 
> Unless I'm missing something, doesn't this effectively mean that anything
> interrupting a task can hit this bug?  Or, put another way, I'm seeing several
> processes hit this exact call chain with the corrupt register going back out to
> userspace without io_uring even in the mix, so there seems to be another
> pathway in play.  These traces are from a qemu guest, in case it matters given
> the kvm path is possibly susceptible.
> 
> Just a few things to think about.  The FPU patch itself definitely resolves the
> problems; I used a sledgehammer approach *specifically* so that there is no
> place for a rare call sequence we didn't consider to hit it again down the
> line. :)

For reference, a couple of traces that are verified to hit the conditions above when I leave the debugging unrestricted system-wide:

From perl:

[  100.735133] NIP [c00000000001b0d8] __giveup_fpu+0xc8/0x280
[  100.735162] LR [c00000000001b084] __giveup_fpu+0x74/0x280
[  100.735190] Call Trace:
[  100.735205] [c000000008ac7710] [c000000008ac7830] 0xc000000008ac7830 (unreliable)
[  100.735251] [c000000008ac7a10] [c00000000001c094] giveup_all+0x84/0x120
[  100.735289] [c000000008ac7a40] [c00000000001cb08] __switch_to+0x128/0x2e0
[  100.735327] [c000000008ac7aa0] [c00000000101e0d0] __schedule+0x1020/0x11c0
[  100.735362] [c000000008ac7b90] [c00000000101e3f8] schedule+0x188/0x1f0
[  100.735397] [c000000008ac7c10] [c00000000063a834] pipe_read+0x3c4/0x5c0
[  100.735437] [c000000008ac7cf0] [c00000000062a9cc] vfs_read+0x18c/0x360
[  100.735505] [c000000008ac7dc0] [c00000000062b9e4] ksys_read+0xf4/0x150
[  100.735540] [c000000008ac7e10] [c00000000002fca4] system_call_exception+0x294/0x2e0
[  100.735581] [c000000008ac7e50] [c00000000000d0dc] system_call_vectored_common+0x15c/0x2ec

From mariadbd:

[  129.374710] NIP [c00000000001b0d8] __giveup_fpu+0xc8/0x280
[  129.374743] LR [c00000000001b084] __giveup_fpu+0x74/0x280
[  129.374774] Call Trace:
[  129.374791] [c000000018dbf680] [0000000000000001] 0x1 (unreliable)
[  129.374833] [c000000018dbf980] [c00000000001c094] giveup_all+0x84/0x120
[  129.374873] [c000000018dbf9b0] [c00000000001cb08] __switch_to+0x128/0x2e0
[  129.374915] [c000000018dbfa10] [c00000000101e0d0] __schedule+0x1020/0x11c0
[  129.374958] [c000000018dbfb00] [c00000000101e3f8] schedule+0x188/0x1f0
[  129.374996] [c000000018dbfb80] [c0000000002ba240] futex_wait_queue+0x80/0xf0
[  129.375051] [c000000018dbfbc0] [c0000000002baf70] __futex_wait+0xc0/0x180
[  129.375102] [c000000018dbfca0] [c0000000002bb0c4] futex_wait+0x94/0x150
[  129.375161] [c000000018dbfd60] [c0000000002b5d5c] do_futex+0x11c/0x320
[  129.375214] [c000000018dbfd90] [c0000000002b6130] sys_futex+0x1d0/0x240
[  129.375261] [c000000018dbfe10] [c00000000002fca4] system_call_exception+0x294/0x2e0
[  129.375307] [c000000018dbfe50] [c00000000000d0dc] system_call_vectored_common+0x15c/0x2ec

Another from mariadbd (this one takes out the server in this particular run, but that's just because it "lost" the race with the io_uring worker spawn):

[  136.342361] NIP [c00000000001b0d8] __giveup_fpu+0xc8/0x280
[  136.342416] LR [c00000000001b084] __giveup_fpu+0x74/0x280
[  136.342467] Call Trace:
[  136.342491] [c000000018dbf8d0] [000000000000002b] 0x2b (unreliable)
[  136.342557] [c000000018dbfbd0] [c00000000001c094] giveup_all+0x84/0x120
[  136.342620] [c000000018dbfc00] [c00000000001cb08] __switch_to+0x128/0x2e0
[  136.342685] [c000000018dbfc60] [c00000000101e0d0] __schedule+0x1020/0x11c0
[  136.342752] [c000000018dbfd50] [c00000000101e3f8] schedule+0x188/0x1f0
[  136.342818] [c000000018dbfdd0] [c00000000002e92c] interrupt_exit_user_prepare_main+0x7c/0x2e0
[  136.342907] [c000000018dbfe20] [c00000000002ee18] interrupt_exit_user_prepare+0x88/0xa0
[  136.342983] [c000000018dbfe50] [c00000000000d954] interrupt_return_srr_user+0x8/0x12c
Nicholas Piggin Nov. 21, 2023, 7:54 a.m. UTC | #10
On Tue Nov 21, 2023 at 2:10 PM AEST, Timothy Pearson wrote:
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe@ellerman.id.au>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>,
> > "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Monday, November 20, 2023 5:39:52 PM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save
>
> > Timothy Pearson <tpearson@raptorengineering.com> writes:
> >> ----- Original Message -----
> >>> From: "Michael Ellerman" <mpe@ellerman.id.au>
> > ...
> >>> 
> >>> But we now have a new path, because io-uring can call copy_process() via
> >>> create_io_thread() from the signal handling path. That's OK if the signal is
> >>> handled as we return from a syscall, but it's not OK if the signal is handled
> >>> due to some other interrupt.
> >>> 
> >>> Which is:
> >>> 
> >>> interrupt_return_srr_user()
> >>>  interrupt_exit_user_prepare()
> >>>    interrupt_exit_user_prepare_main()
> >>>      do_notify_resume()
> >>>        get_signal()
> >>>          task_work_run()
> >>>            create_worker_cb()
> >>>              create_io_worker()
> >>>                copy_process()
> >>>                  dup_task_struct()
> >>>                    arch_dup_task_struct()
> >>>                      flush_all_to_thread()
> >>>                        save_all()
> >>>                          if (tsk->thread.regs->msr & MSR_FP)
> >>>                            save_fpu()
> >>>                            # fr0 is clobbered and potentially live in userspace
> >>> 
> >>> 
> >>> So tldr I think the corruption is only an issue since io-uring started doing
> >>> the clone via signal, which I think matches the observed timeline of this bug
> >>> appearing.
> >>
> >> I agree the corruption really only started showing up in earnest on
> >> io_uring clone-via-signal, as this was confirmed several times in the
> >> course of debugging.
> > 
> > Thanks.
> > 
> >> Note as well that I may very well have a wrong call order in the
> >> commit message, since I was relying on a couple of WARN_ON() macros I
> >> inserted to check for a similar (but not identical) condition and
> >> didn't spend much time getting new traces after identifying the root
> >> cause.
> > 
> > Yep no worries. I'll reword it to incorporate the full path from my mail.
> > 
> >> I went back and grabbed some real world system-wide stack traces, since I now
> >> know what to trigger on.  A typical example is:
> >>
> >> interrupt_return_srr_user()
> >>  interrupt_exit_user_prepare()
> >>   interrupt_exit_user_prepare_main()
> >>    schedule()
> >>     __schedule()
> >>      __switch_to()
> >>       giveup_all()
> >>        # tsk->thread.regs->msr MSR_FP is still set here
> >>        __giveup_fpu()
> >>         save_fpu()
> >>         # fr0 is clobbered and potentially live in userspace
> > 
> > fr0 is not live there.
> <snip> 
> > ie. it clears the FP etc. bits from the task's MSR. That means the FP
> > state will be reloaded from the thread struct before the task is run again.
>
> So a little more detail on this, just to put it to rest properly vs. assuming hand analysis caught every possible pathway. :)
>
> The debugging that generates this stack trace also verifies the following in __giveup_fpu():
>
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
>
> Only if all three conditions are met will it generate the trace.  This is a generalization of the hack I used to find the problem in the first place.
>
> If the state will subsequently be reloaded from the thread struct, that means we're reloading the registers from the thread struct that we just verified was corrupted by the earlier save_fpu() call.  There are only two ways I can see for that to be true -- one is if the registers were already clobbered when giveup_all() was entered, and the other is if save_fpu() went ahead and clobbered them right here inside giveup_all().
>
> To see which scenario we were dealing with, I added a bit more instrumentation to dump the current register state if MSR_FP bit was already set in registers (i.e. not dumping data from task struct, but using the live FPU registers instead), and sure enough the registers are corrupt on entry, so something else has already called save_fpu() before we even hit giveup_all() in this call chain.
>
> Unless I'm missing something, doesn't this effectively mean that anything interrupting a task can hit this bug?  Or, put another way, I'm seeing several processes hit this exact call chain with the corrupt register going back out to userspace without io_uring even in the mix, so there seems to be another pathway in play.  These traces are from a qemu guest, in case it matters given the kvm path is possibly susceptible.
>
> Just a few things to think about.  The FPU patch itself definitely resolves the problems; I used a sledgehammer approach *specifically* so that there is no place for a rare call sequence we didn't consider to hit it again down the line. :)

I don't think interrupts are supposed to use (or save) FP/VEC
registers. So you're allowed to _take_ interrupts while FP/VEC
are being saved or used, just not be preempted, block, or
return to user. Hence all the preempt_disable() around these
things.

Not that we cover all these with warnings very well in the
enable_kernel_* functions AFAIKS. We could add more checks.
At least interrupts enabled would be good. balance and user
exit checks should somewhat be covered by preempt count
implicitly.

Thanks,
Nick
Nicholas Piggin Nov. 21, 2023, 7:56 a.m. UTC | #11
On Tue Nov 21, 2023 at 11:23 AM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe@ellerman.id.au>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>,
> > "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Monday, November 20, 2023 5:39:52 PM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save
>
> > Timothy Pearson <tpearson@raptorengineering.com> writes:
> >> ----- Original Message -----
> >>> From: "Michael Ellerman" <mpe@ellerman.id.au>
> > ...
> >>> 
> >>> But we now have a new path, because io-uring can call copy_process() via
> >>> create_io_thread() from the signal handling path. That's OK if the signal is
> >>> handled as we return from a syscall, but it's not OK if the signal is handled
> >>> due to some other interrupt.
> >>> 
> >>> Which is:
> >>> 
> >>> interrupt_return_srr_user()
> >>>  interrupt_exit_user_prepare()
> >>>    interrupt_exit_user_prepare_main()
> >>>      do_notify_resume()
> >>>        get_signal()
> >>>          task_work_run()
> >>>            create_worker_cb()
> >>>              create_io_worker()
> >>>                copy_process()
> >>>                  dup_task_struct()
> >>>                    arch_dup_task_struct()
> >>>                      flush_all_to_thread()
> >>>                        save_all()
> >>>                          if (tsk->thread.regs->msr & MSR_FP)
> >>>                            save_fpu()
> >>>                            # fr0 is clobbered and potentially live in userspace
> >>> 
> >>> 
> >>> So tldr I think the corruption is only an issue since io-uring started doing
> >>> the clone via signal, which I think matches the observed timeline of this bug
> >>> appearing.
> >>
> >> I agree the corruption really only started showing up in earnest on
> >> io_uring clone-via-signal, as this was confirmed several times in the
> >> course of debugging.
> > 
> > Thanks.
> > 
> >> Note as well that I may very well have a wrong call order in the
> >> commit message, since I was relying on a couple of WARN_ON() macros I
> >> inserted to check for a similar (but not identical) condition and
> >> didn't spend much time getting new traces after identifying the root
> >> cause.
> > 
> > Yep no worries. I'll reword it to incorporate the full path from my mail.
> > 
> >> I went back and grabbed some real world system-wide stack traces, since I now
> >> know what to trigger on.  A typical example is:
> >>
> >> interrupt_return_srr_user()
> >>  interrupt_exit_user_prepare()
> >>   interrupt_exit_user_prepare_main()
> >>    schedule()
> >>     __schedule()
> >>      __switch_to()
> >>       giveup_all()
> >>        # tsk->thread.regs->msr MSR_FP is still set here
> >>        __giveup_fpu()
> >>         save_fpu()
> >>         # fr0 is clobbered and potentially live in userspace
> > 
> > fr0 is not live there.
> > 
> > __giveup_fpu() does roughly:
> > 
> >	msr = tsk->thread.regs->msr;
> >	msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
> >        msr &= ~MSR_VSX;
> >	tsk->thread.regs = msr;
> > 
> > ie. it clears the FP etc. bits from the task's MSR. That means the FP
> > state will be reloaded from the thread struct before the task is run again.
> > 
> > Also on that path we're switching to another task, so we'll be reloading
> > the other task's FP state before returning to userspace.
> > 
> > So I don't see any bug there.
>
> Yeah, you're right.  I was trying to get traces while doing something else, and didn't think that all the way through, sorry. :)  It's not going to be super easy to get a real trace (I was triggering the WARN_ON() from of fr0 getting set to to FPSCR), so let's just assume it's mainly the path you manually found above and update the commit message accordingly.
>
> > There's only two places that call save_fpu() and skip the giveup logic,
> > which is save_all() and kvmppc_save_user_regs().
>
> Now that's interesting as well, since it might explain some issues I've seen for years on a specific QEMU workload.  Once this is backported to stable I'll need to get the kernels updated on those boxes and see if the issues disappear...

KVM issue is actually slightly different. You need this (lightly
verified to solve issue so far).

---

From c12fbed0e12207058282a2411da59b43b1f96c49 Mon Sep 17 00:00:00 2001
From: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 21 Nov 2023 17:03:55 +1000
Subject: [PATCH] KVM: PPC: Book3S HV: Fix KVM_RUN ioctl clobbering FP/VEC
 registers

Before running a guest, the host process's FP/VEC registers are saved
away like a context switch or a kernel use of those regs. The guest
FP/VEC registers can then be loaded as needed. The host process
registers would be restored lazily when it uses FP/VEC again.

KVM HV did not do this correctly. The registers are saved in the
thread struct, but the FP/VEC/VSX bits remain enabled in the user
MSR, leading the kernel to think they are still valid. Even after
they are clobbered by guest registers. This leads to the host
process getting guest FP/VEC register values.

KVM must be invoked by ioctl in this path, and almost certainly that
means a C call to a wrapper function, but that still leaves real
possibility of corruption in for non-volatile registers to cause
problems for QEMU process.
---
 arch/powerpc/kernel/process.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 392404688cec..9452a54d356c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1198,11 +1198,11 @@ void kvmppc_save_user_regs(void)
 
 	usermsr = current->thread.regs->msr;
 
+	/* Caller has enabled FP/VEC/VSX/TM in MSR */
 	if (usermsr & MSR_FP)
-		save_fpu(current);
-
+		__giveup_fpu(current);
 	if (usermsr & MSR_VEC)
-		save_altivec(current);
+		__giveup_altivec(current);
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 	if (usermsr & MSR_TM) {
Michael Ellerman Nov. 22, 2023, 5:01 a.m. UTC | #12
Timothy Pearson <tpearson@raptorengineering.com> writes:
>
...
>
> So a little more detail on this, just to put it to rest properly vs.
> assuming hand analysis caught every possible pathway. :)
>
> The debugging that generates this stack trace also verifies the following in __giveup_fpu():
>
> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling save_fpu()
> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling save_fpu()
> 3.) MSR_FP is set both in the task struct and in the live MSR.
>
> Only if all three conditions are met will it generate the trace.  This
> is a generalization of the hack I used to find the problem in the
> first place.
>
> If the state will subsequently be reloaded from the thread struct,
> that means we're reloading the registers from the thread struct that
> we just verified was corrupted by the earlier save_fpu() call.  There
> are only two ways I can see for that to be true -- one is if the
> registers were already clobbered when giveup_all() was entered, and
> the other is if save_fpu() went ahead and clobbered them right here
> inside giveup_all().
>
> To see which scenario we were dealing with, I added a bit more
> instrumentation to dump the current register state if MSR_FP bit was
> already set in registers (i.e. not dumping data from task struct, but
> using the live FPU registers instead), and sure enough the registers
> are corrupt on entry, so something else has already called save_fpu()
> before we even hit giveup_all() in this call chain.

Can you share the debug patch you're using?

cheers
Timothy Pearson Nov. 24, 2023, 12:01 a.m. UTC | #13
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>,
> "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Tuesday, November 21, 2023 11:01:50 PM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> Timothy Pearson <tpearson@raptorengineering.com> writes:
>>
> ...
>>
>> So a little more detail on this, just to put it to rest properly vs.
>> assuming hand analysis caught every possible pathway. :)
>>
>> The debugging that generates this stack trace also verifies the following in
>> __giveup_fpu():
>>
>> 1.) tsk->thread.fp_state.fpr doesn't contain the FPSCR contents prior to calling
>> save_fpu()
>> 2.) tsk->thread.fp_state.fpr contains the FPSCR contents directly after calling
>> save_fpu()
>> 3.) MSR_FP is set both in the task struct and in the live MSR.
>>
>> Only if all three conditions are met will it generate the trace.  This
>> is a generalization of the hack I used to find the problem in the
>> first place.
>>
>> If the state will subsequently be reloaded from the thread struct,
>> that means we're reloading the registers from the thread struct that
>> we just verified was corrupted by the earlier save_fpu() call.  There
>> are only two ways I can see for that to be true -- one is if the
>> registers were already clobbered when giveup_all() was entered, and
>> the other is if save_fpu() went ahead and clobbered them right here
>> inside giveup_all().
>>
>> To see which scenario we were dealing with, I added a bit more
>> instrumentation to dump the current register state if MSR_FP bit was
>> already set in registers (i.e. not dumping data from task struct, but
>> using the live FPU registers instead), and sure enough the registers
>> are corrupt on entry, so something else has already called save_fpu()
>> before we even hit giveup_all() in this call chain.
> 
> Can you share the debug patch you're using?
> 
> cheers

Sure, here you go.  Note that with my FPU patch there is no WARN_ON hit, at least in my testing, so it isn't userspace purposefully loading the fr0/vs0 register with the FPSCR.

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 392404688cec..bde57dc3262a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -154,7 +154,49 @@ static void __giveup_fpu(struct task_struct *tsk)
 {
 	unsigned long msr;
 
+	// DEBUGGING
+	uint64_t prev_fpr0 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0);
+	uint64_t prev_fpr1 = *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1);
+	struct thread_fp_state debug_fp_state;
+	unsigned long currentmsr = mfmsr();
+
+	if (currentmsr & MSR_FP) {
+		store_fp_state(&debug_fp_state);
+		load_fp_state(&debug_fp_state);
+	}
+
 	save_fpu(tsk);
+
+	// DEBUGGING
+	if (tsk->thread.regs->msr & MSR_FP) {
+		if (((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0) == 0x82004000) && (prev_fpr0 != 0x82004000))
+		 || ((*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1) == 0x82004000) && (prev_fpr1 != 0x82004000)))
+		{
+			WARN_ON(1);
+
+			printk("[TS %lld] In __giveup_fpu() for process [comm: '%s'  pid %d tid %d], before save current "
+			"fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 0x%016llx/%016llx fp9: 0x%016llx/%016llx"
+			" msr: 0x%016lx (FP %d VSX %d EE %d) on core %d\n",
+			ktime_get_boottime_ns(), current->comm, current->pid, current->tgid,
+			*(((uint64_t*)(&debug_fp_state.fpr[0]))+0), *(((uint64_t*)(&debug_fp_state.fpr[0]))+1),
+			*(((uint64_t*)(&debug_fp_state.fpr[1]))+0), *(((uint64_t*)(&debug_fp_state.fpr[1]))+1),
+			*(((uint64_t*)(&debug_fp_state.fpr[8]))+0), *(((uint64_t*)(&debug_fp_state.fpr[8]))+1),
+			*(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+0), *(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+1),
+			currentmsr, !!(currentmsr & MSR_FP), !!(currentmsr & MSR_VSX), !!(currentmsr & MSR_EE), raw_smp_processor_id());
+
+			printk("[TS %lld] In __giveup_fpu() for process [comm: '%s'  pid %d tid %d], after save saved "
+			"fp0: 0x%016llx/%016llx fp1: 0x%016llx/%016llx fp8: 0x%016llx/%016llx fp9: 0x%016llx/%016llx"
+			" msr: 0x%016lx (FP %d VSX %d EE %d) on core %d\n",
+			ktime_get_boottime_ns(), current->comm, current->pid, current->tgid,
+			*(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+0), *(((uint64_t*)(&tsk->thread.fp_state.fpr[0]))+1),
+			*(((uint64_t*)(&tsk->thread.fp_state.fpr[1]))+0), *(((uint64_t*)(&tsk->thread.fp_state.fpr[1]))+1),
+			*(((uint64_t*)(&tsk->thread.fp_state.fpr[8]))+0), *(((uint64_t*)(&tsk->thread.fp_state.fpr[8]))+1),
+			*(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+0), *(((uint64_t*)(&tsk->thread.fp_state.fpr[9]))+1),
+			tsk->thread.regs->msr, !!(tsk->thread.regs->msr & MSR_FP), !!(tsk->thread.regs->msr & MSR_VSX), !!(tsk->thread.regs->msr & MSR_EE), raw_smp_processor_id());
+		}
+	}
+
+
 	msr = tsk->thread.regs->msr;
 	msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
 	if (cpu_has_feature(CPU_FTR_VSX))
Timothy Pearson Nov. 27, 2023, 6:39 p.m. UTC | #14
Just wanted to check back and see if this patch was going to be queued up soon?  We're still having to work around / advertise the data destruction issues the underlying bug is causing on e.g. Debian Stable.

Thanks!
Christophe Leroy Nov. 27, 2023, 7:58 p.m. UTC | #15
Hi,

Le 27/11/2023 à 19:39, Timothy Pearson a écrit :
> Just wanted to check back and see if this patch was going to be queued up soon?  We're still having to work around / advertise the data destruction issues the underlying bug is causing on e.g. Debian Stable.
> 

Has any agreement been reach on the final solution ? Seeing the many 
discussion on patch v2 I had the feeling that it was not the final solution.

Christophe
Michael Ellerman Nov. 27, 2023, 10:53 p.m. UTC | #16
Timothy Pearson <tpearson@raptorengineering.com> writes:

> Just wanted to check back and see if this patch was going to be queued
> up soon?  We're still having to work around / advertise the data
> destruction issues the underlying bug is causing on e.g. Debian
> Stable.

Yeah I'll apply it this week, so it will be in rc4.

I wanted to be more certain the corruption only happens in practice with
io-uring before applying it.

cheers
Michael Ellerman Nov. 28, 2023, 12:59 a.m. UTC | #17
Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Le 27/11/2023 à 19:39, Timothy Pearson a écrit :
>> Just wanted to check back and see if this patch was going to be
>> queued up soon?  We're still having to work around / advertise the
>> data destruction issues the underlying bug is causing on e.g. Debian
>> Stable.
>
> Has any agreement been reach on the final solution ? Seeing the many 
> discussion on patch v2 I had the feeling that it was not the final solution.

The actual patch is fine I think.

The discussion was about improving the explanation of exactly what's
happening in the change log, and whether there is a larger bug causing
FP corruption unrelated to io-uring.

I'm now reasonably confident there's no detectable corruption of fr0
happening except via the io-uring -> clone path.

It's still a bad bug for us to corrupt fr0 across sys_clone(), but in
practice it doesn't affect userspace because fr0 is volatile across
function calls.

cheers
Nicholas Piggin Nov. 28, 2023, 1:40 a.m. UTC | #18
On Tue Nov 28, 2023 at 10:59 AM AEST, Michael Ellerman wrote:
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> > Le 27/11/2023 à 19:39, Timothy Pearson a écrit :
> >> Just wanted to check back and see if this patch was going to be
> >> queued up soon?  We're still having to work around / advertise the
> >> data destruction issues the underlying bug is causing on e.g. Debian
> >> Stable.
> >
> > Has any agreement been reach on the final solution ? Seeing the many 
> > discussion on patch v2 I had the feeling that it was not the final solution.
>
> The actual patch is fine I think.
>
> The discussion was about improving the explanation of exactly what's
> happening in the change log, and whether there is a larger bug causing
> FP corruption unrelated to io-uring.

One thing I said is maybe we should remove the "register save" concept
entirely and always giveup. But that would not be suitable for a minimal
fix. I didn't mean it as an alternate fix.

Even if we did always giveup in future, this patch still gives a nicer
API. There would have to be a noticable performance advantage to not
restoring fr0/vs0 after saving before we'd think about changing it back
to clobber, since that encourages foot shooting.

Thanks,
Nick
Michael Ellerman Nov. 28, 2023, 12:57 p.m. UTC | #19
Michael Ellerman <mpe@ellerman.id.au> writes:
> Timothy Pearson <tpearson@raptorengineering.com> writes:
>
>> Just wanted to check back and see if this patch was going to be queued
>> up soon?  We're still having to work around / advertise the data
>> destruction issues the underlying bug is causing on e.g. Debian
>> Stable.
>
> Yeah I'll apply it this week, so it will be in rc4.

I reworked the change log to include the exact call path I identified
instead of the more high level description you had. And tweaked a few
other bits of wording and so on, apparently fr0 is a kernelism, the ABI
and binutils calls it f0.

I'm not sure how wedded you were to your change log, so if you dislike
my edits let me know and we can come up with a joint one.

The actual patch is unchanged.

cheers


From 5e1d824f9a283cbf90f25241b66d1f69adb3835b Mon Sep 17 00:00:00 2001
From: Timothy Pearson <tpearson@raptorengineering.com>
Date: Sun, 19 Nov 2023 09:18:02 -0600
Subject: [PATCH] powerpc: Don't clobber f0/vs0 during fp|altivec register save

During floating point and vector save to thread data f0/vs0 are
clobbered by the FPSCR/VSCR store routine. This has been obvserved to
lead to userspace register corruption and application data corruption
with io-uring.

Fix it by restoring f0/vs0 after FPSCR/VSCR store has completed for
all the FP, altivec, VMX register save paths.

Tested under QEMU in kvm mode, running on a Talos II workstation with
dual POWER9 DD2.2 CPUs.

Additional detail (mpe):

Typically save_fpu() is called from __giveup_fpu() which saves the FP
regs and also *turns off FP* in the tasks MSR, meaning the kernel will
reload the FP regs from the thread struct before letting the task use FP
again. So in that case save_fpu() is free to clobber f0 because the FP
regs no longer hold live values for the task.

There is another case though, which is the path via:
  sys_clone()
    ...
    copy_process()
      dup_task_struct()
        arch_dup_task_struct()
          flush_all_to_thread()
            save_all()

That path saves the FP regs but leaves them live. That's meant as an
optimisation for a process that's using FP/VSX and then calls fork(),
leaving the regs live means the parent process doesn't have to take a
fault after the fork to get its FP regs back. The optimisation was added
in commit 8792468da5e1 ("powerpc: Add the ability to save FPU without
giving it up").

That path does clobber f0, but f0 is volatile across function calls,
and typically programs reach copy_process() from userspace via a syscall
wrapper function. So in normal usage f0 being clobbered across a
syscall doesn't cause visible data corruption.

But there is now a new path, because io-uring can call copy_process()
via create_io_thread() from the signal handling path. That's OK if the
signal is handled as part of syscall return, but it's not OK if the
signal is handled due to some other interrupt.

That path is:

interrupt_return_srr_user()
  interrupt_exit_user_prepare()
    interrupt_exit_user_prepare_main()
      do_notify_resume()
        get_signal()
          task_work_run()
            create_worker_cb()
              create_io_worker()
                copy_process()
                  dup_task_struct()
                    arch_dup_task_struct()
                      flush_all_to_thread()
                        save_all()
                          if (tsk->thread.regs->msr & MSR_FP)
                            save_fpu()
                            # f0 is clobbered and potentially live in userspace

Note the above discussion applies equally to save_altivec().

Fixes: 8792468da5e1 ("powerpc: Add the ability to save FPU without giving it up")
Cc: stable@vger.kernel.org # v4.6+
Closes: https://lore.kernel.org/all/480932026.45576726.1699374859845.JavaMail.zimbra@raptorengineeringinc.com/
Closes: https://lore.kernel.org/linuxppc-dev/480221078.47953493.1700206777956.JavaMail.zimbra@raptorengineeringinc.com/
Tested-by: Timothy Pearson <tpearson@raptorengineering.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
[mpe: Reword change log to describe exact path of corruption & other minor tweaks]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/1921539696.48534988.1700407082933.JavaMail.zimbra@raptorengineeringinc.com
---
 arch/powerpc/kernel/fpu.S    | 13 +++++++++++++
 arch/powerpc/kernel/vector.S |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index 6a9acfb690c9..2f8f3f93cbb6 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -23,6 +23,15 @@
 #include <asm/feature-fixups.h>
 
 #ifdef CONFIG_VSX
+#define __REST_1FPVSR(n,c,base)						\
+BEGIN_FTR_SECTION							\
+	b	2f;							\
+END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
+	REST_FPR(n,base);						\
+	b	3f;							\
+2:	REST_VSR(n,c,base);						\
+3:
+
 #define __REST_32FPVSRS(n,c,base)					\
 BEGIN_FTR_SECTION							\
 	b	2f;							\
@@ -41,9 +50,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
 2:	SAVE_32VSRS(n,c,base);						\
 3:
 #else
+#define __REST_1FPVSR(n,b,base)		REST_FPR(n, base)
 #define __REST_32FPVSRS(n,b,base)	REST_32FPRS(n, base)
 #define __SAVE_32FPVSRS(n,b,base)	SAVE_32FPRS(n, base)
 #endif
+#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
 #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
 #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
 
@@ -67,6 +78,7 @@ _GLOBAL(store_fp_state)
 	SAVE_32FPVSRS(0, R4, R3)
 	mffs	fr0
 	stfd	fr0,FPSTATE_FPSCR(r3)
+	REST_1FPVSR(0, R4, R3)
 	blr
 EXPORT_SYMBOL(store_fp_state)
 
@@ -138,4 +150,5 @@ _GLOBAL(save_fpu)
 2:	SAVE_32FPVSRS(0, R4, R6)
 	mffs	fr0
 	stfd	fr0,FPSTATE_FPSCR(r6)
+	REST_1FPVSR(0, R4, R6)
 	blr
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 4094e4c4c77a..80b3f6e476b6 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -33,6 +33,7 @@ _GLOBAL(store_vr_state)
 	mfvscr	v0
 	li	r4, VRSTATE_VSCR
 	stvx	v0, r4, r3
+	lvx	v0, 0, r3
 	blr
 EXPORT_SYMBOL(store_vr_state)
 
@@ -109,6 +110,7 @@ _GLOBAL(save_altivec)
 	mfvscr	v0
 	li	r4,VRSTATE_VSCR
 	stvx	v0,r4,r7
+	lvx	v0,0,r7
 	blr
 
 #ifdef CONFIG_VSX
Timothy Pearson Nov. 30, 2023, 4:29 p.m. UTC | #20
----- Original Message -----
> From: "Michael Ellerman" <mpe@ellerman.id.au>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "Jens Axboe" <axboe@kernel.dk>, "regressions" <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>,
> "christophe leroy" <christophe.leroy@csgroup.eu>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Tuesday, November 28, 2023 6:57:01 AM
> Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save

> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Timothy Pearson <tpearson@raptorengineering.com> writes:
>>
>>> Just wanted to check back and see if this patch was going to be queued
>>> up soon?  We're still having to work around / advertise the data
>>> destruction issues the underlying bug is causing on e.g. Debian
>>> Stable.
>>
>> Yeah I'll apply it this week, so it will be in rc4.
> 
> I reworked the change log to include the exact call path I identified
> instead of the more high level description you had. And tweaked a few
> other bits of wording and so on, apparently fr0 is a kernelism, the ABI
> and binutils calls it f0.
> 
> I'm not sure how wedded you were to your change log, so if you dislike
> my edits let me know and we can come up with a joint one.
> 
> The actual patch is unchanged.
> 
> cheers

The commit message looks OK to me.  I've also seen application crashes as a result of the register corruption, but that may be a minor detail that isn't really worth updating things over at this point -- those come from e.g. glibc using vs0 as part of a path that processes pointer information, typically seen where there's a need to replicate the same pointer to adjacent fields in a data struct.
Michael Ellerman Dec. 2, 2023, 11 p.m. UTC | #21
On Sun, 19 Nov 2023 09:18:02 -0600, Timothy Pearson wrote:
> During floating point and vector save to thread data fr0/vs0 are clobbered
> by the FPSCR/VSCR store routine.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc: Don't clobber fr0/vs0 during fp|altivec register save
      https://git.kernel.org/powerpc/c/5e1d824f9a283cbf90f25241b66d1f69adb3835b

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index 6a9acfb690c9..2f8f3f93cbb6 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -23,6 +23,15 @@ 
 #include <asm/feature-fixups.h>
 
 #ifdef CONFIG_VSX
+#define __REST_1FPVSR(n,c,base)						\
+BEGIN_FTR_SECTION							\
+	b	2f;							\
+END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
+	REST_FPR(n,base);						\
+	b	3f;							\
+2:	REST_VSR(n,c,base);						\
+3:
+
 #define __REST_32FPVSRS(n,c,base)					\
 BEGIN_FTR_SECTION							\
 	b	2f;							\
@@ -41,9 +50,11 @@  END_FTR_SECTION_IFSET(CPU_FTR_VSX);					\
 2:	SAVE_32VSRS(n,c,base);						\
 3:
 #else
+#define __REST_1FPVSR(n,b,base)		REST_FPR(n, base)
 #define __REST_32FPVSRS(n,b,base)	REST_32FPRS(n, base)
 #define __SAVE_32FPVSRS(n,b,base)	SAVE_32FPRS(n, base)
 #endif
+#define REST_1FPVSR(n,c,base)   __REST_1FPVSR(n,__REG_##c,__REG_##base)
 #define REST_32FPVSRS(n,c,base) __REST_32FPVSRS(n,__REG_##c,__REG_##base)
 #define SAVE_32FPVSRS(n,c,base) __SAVE_32FPVSRS(n,__REG_##c,__REG_##base)
 
@@ -67,6 +78,7 @@  _GLOBAL(store_fp_state)
 	SAVE_32FPVSRS(0, R4, R3)
 	mffs	fr0
 	stfd	fr0,FPSTATE_FPSCR(r3)
+	REST_1FPVSR(0, R4, R3)
 	blr
 EXPORT_SYMBOL(store_fp_state)
 
@@ -138,4 +150,5 @@  _GLOBAL(save_fpu)
 2:	SAVE_32FPVSRS(0, R4, R6)
 	mffs	fr0
 	stfd	fr0,FPSTATE_FPSCR(r6)
+	REST_1FPVSR(0, R4, R6)
 	blr
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 4094e4c4c77a..80b3f6e476b6 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -33,6 +33,7 @@  _GLOBAL(store_vr_state)
 	mfvscr	v0
 	li	r4, VRSTATE_VSCR
 	stvx	v0, r4, r3
+	lvx	v0, 0, r3
 	blr
 EXPORT_SYMBOL(store_vr_state)
 
@@ -109,6 +110,7 @@  _GLOBAL(save_altivec)
 	mfvscr	v0
 	li	r4,VRSTATE_VSCR
 	stvx	v0,r4,r7
+	lvx	v0,0,r7
 	blr
 
 #ifdef CONFIG_VSX