mbox series

[SRU,Bionic,0/2] Fix ThunderX boot issues

Message ID 20200115173638.31514-1-juergh@canonical.com
Headers show
Series Fix ThunderX boot issues | expand

Message

Juerg Haefliger Jan. 15, 2020, 5:36 p.m. UTC
Two trivial KPTI patches to fix a nasty ThunderX boot issues.

The 'arm64: add sentinel to kpti_safe_list' patch might not be a
contributor here but it certainly fixes a real issue since on ThunderX the
list is traversed all the way to the end and not stopped due to the
missing sentinel.

The real problem is fixed by 'arm64: Check for errata before evaluating
cpu features'. The kernel has a check in place for a certain Cavium errata.
If that errata is enabled, KPTI is turned off (the code comment mentions
I-cache clobbering if both are enabled at the same time). Some earlier KPTI
commit changed the order in which features and erratas are evaluated and
enabled resulting in a short period of time where both KPTI and that errata
are enabled. The fix reverses that order.

Dirk Mueller (1):
  arm64: Check for errata before evaluating cpu features

Mark Rutland (1):
  arm64: add sentinel to kpti_safe_list

 arch/arm64/kernel/cpufeature.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Andrea Righi Jan. 15, 2020, 5:42 p.m. UTC | #1
On Wed, Jan 15, 2020 at 06:36:36PM +0100, Juerg Haefliger wrote:
> Two trivial KPTI patches to fix a nasty ThunderX boot issues.
> 
> The 'arm64: add sentinel to kpti_safe_list' patch might not be a
> contributor here but it certainly fixes a real issue since on ThunderX the
> list is traversed all the way to the end and not stopped due to the
> missing sentinel.
> 
> The real problem is fixed by 'arm64: Check for errata before evaluating
> cpu features'. The kernel has a check in place for a certain Cavium errata.
> If that errata is enabled, KPTI is turned off (the code comment mentions
> I-cache clobbering if both are enabled at the same time). Some earlier KPTI
> commit changed the order in which features and erratas are evaluated and
> enabled resulting in a short period of time where both KPTI and that errata
> are enabled. The fix reverses that order.
> 
> Dirk Mueller (1):
>   arm64: Check for errata before evaluating cpu features
> 
> Mark Rutland (1):
>   arm64: add sentinel to kpti_safe_list
> 
>  arch/arm64/kernel/cpufeature.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)

The two patches are trivial and they look correct to me.

Acked-by: Andrea Righi <andrea.righi@canonical.com>
Khalid Elmously Jan. 15, 2020, 5:57 p.m. UTC | #2
On 2020-01-15 18:36:36 , Juerg Haefliger wrote:
> Two trivial KPTI patches to fix a nasty ThunderX boot issues.
> 
> The 'arm64: add sentinel to kpti_safe_list' patch might not be a
> contributor here but it certainly fixes a real issue since on ThunderX the
> list is traversed all the way to the end and not stopped due to the
> missing sentinel.
> 
> The real problem is fixed by 'arm64: Check for errata before evaluating
> cpu features'. The kernel has a check in place for a certain Cavium errata.
> If that errata is enabled, KPTI is turned off (the code comment mentions
> I-cache clobbering if both are enabled at the same time). Some earlier KPTI
> commit changed the order in which features and erratas are evaluated and
> enabled resulting in a short period of time where both KPTI and that errata
> are enabled. The fix reverses that order.
> 
> Dirk Mueller (1):
>   arm64: Check for errata before evaluating cpu features
> 
> Mark Rutland (1):
>   arm64: add sentinel to kpti_safe_list
> 
>  arch/arm64/kernel/cpufeature.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> -- 
> 2.20.1
> 
> 

Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Tyler Hicks Jan. 15, 2020, 5:58 p.m. UTC | #3
On 2020-01-15 18:36:36, Juerg Haefliger wrote:
> Two trivial KPTI patches to fix a nasty ThunderX boot issues.
> 
> The 'arm64: add sentinel to kpti_safe_list' patch might not be a
> contributor here but it certainly fixes a real issue since on ThunderX the
> list is traversed all the way to the end and not stopped due to the
> missing sentinel.
> 
> The real problem is fixed by 'arm64: Check for errata before evaluating
> cpu features'. The kernel has a check in place for a certain Cavium errata.
> If that errata is enabled, KPTI is turned off (the code comment mentions
> I-cache clobbering if both are enabled at the same time). Some earlier KPTI
> commit changed the order in which features and erratas are evaluated and
> enabled resulting in a short period of time where both KPTI and that errata
> are enabled. The fix reverses that order.

Both patches look correct.

 Acked-by: Tyler Hicks <tyhicks@canonical.com>

Tyler

> 
> Dirk Mueller (1):
>   arm64: Check for errata before evaluating cpu features
> 
> Mark Rutland (1):
>   arm64: add sentinel to kpti_safe_list
> 
>  arch/arm64/kernel/cpufeature.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> -- 
> 2.20.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
dann frazier Jan. 15, 2020, 8:40 p.m. UTC | #4
On Wed, Jan 15, 2020 at 10:37 AM Juerg Haefliger
<juerg.haefliger@canonical.com> wrote:
>
> Two trivial KPTI patches to fix a nasty ThunderX boot issues.
>
> The 'arm64: add sentinel to kpti_safe_list' patch might not be a
> contributor here but it certainly fixes a real issue since on ThunderX the
> list is traversed all the way to the end and not stopped due to the
> missing sentinel.
>
> The real problem is fixed by 'arm64: Check for errata before evaluating
> cpu features'. The kernel has a check in place for a certain Cavium errata.
> If that errata is enabled, KPTI is turned off (the code comment mentions
> I-cache clobbering if both are enabled at the same time). Some earlier KPTI
> commit changed the order in which features and erratas are evaluated and
> enabled resulting in a short period of time where both KPTI and that errata
> are enabled. The fix reverses that order.
>
> Dirk Mueller (1):
>   arm64: Check for errata before evaluating cpu features
>
> Mark Rutland (1):
>   arm64: add sentinel to kpti_safe_list

Both fixes look correct to me. Indeed, I had noticed that the "kernel
page table isolation" message was printed very early in unaffected
kernels logs:

[    0.000000] CPU features: kernel page table isolation forced OFF by
ARM64_WORKAROUND_CAVIUM_27456

While affected kernels weren't printing it until much later, just
before enabling non-boot CPUs:

[    0.035287] smp: Bringing up secondary CPUs ...
[    0.035637] CPU features: kernel page table isolation forced OFF by
ARM64_WORKAROUND_CAVIUM_27456

With these fixes, these messages are now again emitted at [    0.000000].

When I saw crashes before, they were always pretty early - either
during boot, or immediately after when I started a kernel build. I ran
tests overnight on 3 machines (build kernel & reboot loop) with
4.15.0-74.84 +, these patches and some other candidates. 1 machine
survived the night, another survived for several hours until it was
released by automation, and a 3rd failed - but due to an SEA which
might be an actual hardware failure.

I've restarted the same testing on 4 machines today using latest
bionic/master-next + just these 2 patches.  They've all survived at
least 2 iterations so far. I'll leave it running but, after surviving
>16 passes, I believe either the bug is squashed, or Juerg has scared
it back into hiding :)

Acked-by: dann frazier <dann.frazier@canonical.com>
Juerg Haefliger Jan. 16, 2020, 8:08 a.m. UTC | #5
On Wed, 15 Jan 2020 18:36:36 +0100
Juerg Haefliger <juerg.haefliger@canonical.com> wrote:

> Two trivial KPTI patches to fix a nasty ThunderX boot issues.
> 
> The 'arm64: add sentinel to kpti_safe_list' patch might not be a
> contributor here but it certainly fixes a real issue since on ThunderX the
> list is traversed all the way to the end and not stopped due to the
> missing sentinel.
> 
> The real problem is fixed by 'arm64: Check for errata before evaluating
> cpu features'. The kernel has a check in place for a certain Cavium errata.
> If that errata is enabled, KPTI is turned off (the code comment mentions
> I-cache clobbering if both are enabled at the same time). Some earlier KPTI
> commit changed the order in which features and erratas are evaluated and
> enabled resulting in a short period of time where both KPTI and that errata
> are enabled. The fix reverses that order.

Correction, KPTI and the erratum are enabled persistently and not just for a
brief period of time.

...Juerg

> Dirk Mueller (1):
>   arm64: Check for errata before evaluating cpu features
> 
> Mark Rutland (1):
>   arm64: add sentinel to kpti_safe_list
> 
>  arch/arm64/kernel/cpufeature.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
Marcelo Henrique Cerri Jan. 28, 2020, 10:45 p.m. UTC | #6
That one was already applied last cycle. Sorry for the delay.