mbox series

[v3,0/3] Off-load TLB invalidations to host for !GTSE

Message ID 20200703053608.12884-1-bharata@linux.ibm.com (mailing list archive)
Headers show
Series Off-load TLB invalidations to host for !GTSE | expand

Message

Bharata B Rao July 3, 2020, 5:36 a.m. UTC
Hypervisor may choose not to enable Guest Translation Shootdown Enable
(GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
permitted to use instructions like tblie and tlbsync directly, but is
expected to make hypervisor calls to get the TLB flushed.

This series enables the TLB flush routines in the radix code to
off-load TLB flushing to hypervisor via the newly proposed hcall
H_RPT_INVALIDATE. 

To easily check the availability of GTSE, it is made an MMU feature.
The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
handle GTSE as an optionally available feature and to not assume GTSE
when radix support is available.

The actual hcall implementation for KVM isn't included in this
patchset and will be posted separately.

Changes in v3
=============
- Fixed a bug in the hcall wrapper code where we were missing setting
  H_RPTI_TYPE_NESTED while retrying the failed flush request with
  a full flush for the nested case.
- s/psize_to_h_rpti/psize_to_rpti_pgsize

v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t

Bharata B Rao (2):
  powerpc/mm: Enable radix GTSE only if supported.
  powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
    enabled

Nicholas Piggin (1):
  powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
    !GTSE

 .../include/asm/book3s/64/tlbflush-radix.h    | 15 ++++
 arch/powerpc/include/asm/hvcall.h             | 34 +++++++-
 arch/powerpc/include/asm/mmu.h                |  4 +
 arch/powerpc/include/asm/plpar_wrappers.h     | 52 ++++++++++++
 arch/powerpc/kernel/dt_cpu_ftrs.c             |  1 +
 arch/powerpc/kernel/prom_init.c               | 13 +--
 arch/powerpc/mm/book3s64/radix_tlb.c          | 82 +++++++++++++++++--
 arch/powerpc/mm/init_64.c                     |  5 +-
 arch/powerpc/platforms/pseries/lpar.c         |  8 +-
 9 files changed, 197 insertions(+), 17 deletions(-)

Comments

Michael Ellerman July 16, 2020, 12:55 p.m. UTC | #1
On Fri, 3 Jul 2020 11:06:05 +0530, Bharata B Rao wrote:
> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> permitted to use instructions like tblie and tlbsync directly, but is
> expected to make hypervisor calls to get the TLB flushed.
> 
> This series enables the TLB flush routines in the radix code to
> off-load TLB flushing to hypervisor via the newly proposed hcall
> H_RPT_INVALIDATE.
> 
> [...]

Applied to powerpc/next.

[1/3] powerpc/mm: Enable radix GTSE only if supported.
      https://git.kernel.org/powerpc/c/029ab30b4c0a7ec587eece1ec07c3981fdff2bed
[2/3] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled
      https://git.kernel.org/powerpc/c/b6c84175078ff022b343b7b0737aeb33001ca90c
[3/3] powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when !GTSE
      https://git.kernel.org/powerpc/c/dd3d9aa5589c52efaec12ffeb84f0f5f8208fbc3

cheers
Qian Cai July 16, 2020, 5:27 p.m. UTC | #2
On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> permitted to use instructions like tblie and tlbsync directly, but is
> expected to make hypervisor calls to get the TLB flushed.
> 
> This series enables the TLB flush routines in the radix code to
> off-load TLB flushing to hypervisor via the newly proposed hcall
> H_RPT_INVALIDATE. 
> 
> To easily check the availability of GTSE, it is made an MMU feature.
> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
> handle GTSE as an optionally available feature and to not assume GTSE
> when radix support is available.
> 
> The actual hcall implementation for KVM isn't included in this
> patchset and will be posted separately.
> 
> Changes in v3
> =============
> - Fixed a bug in the hcall wrapper code where we were missing setting
>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
>   a full flush for the nested case.
> - s/psize_to_h_rpti/psize_to_rpti_pgsize
> 
> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
> 
> Bharata B Rao (2):
>   powerpc/mm: Enable radix GTSE only if supported.
>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
>     enabled
> 
> Nicholas Piggin (1):
>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
>     !GTSE

Reverting the whole series fixed random memory corruptions during boot on
POWER9 PowerNV systems below.

IBM 8335-GTH (ibm,witherspoon)
POWER9, altivec supported
262144 MB memory, 2000 GB disk space

.config:
https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config

[    9.338996][  T925] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[    9.339026][  T925] Faulting instruction address: 0x00000000
[    9.339051][  T925] Oops: Kernel access of bad area, sig: 11 [#1]
[    9.339064][  T925] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[    9.339098][  T925] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod
[    9.339150][  T925] CPU: 92 PID: 925 Comm: (md-udevd) Not tainted 5.8.0-rc5-next-20200716 #3
[    9.339186][  T925] NIP:  0000000000000000 LR: c00000000021f2cc CTR: 0000000000000000
[    9.339210][  T925] REGS: c000201cb52d79b0 TRAP: 0400   Not tainted  (5.8.0-rc5-next-20200716)
[    9.339244][  T925] MSR:  9000000040009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24222292  XER: 00000000
[    9.339278][  T925] CFAR: c00000000021f2c8 IRQMASK: 0 
[    9.339278][  T925] GPR00: c00000000021f2cc c000201cb52d7c40 c000000005901000 c000201cb52d7ca8 
[    9.339278][  T925] GPR04: c00800000ea60038 0000000000000000 000000007fff0000 000000007fff0000 
[    9.339278][  T925] GPR08: 0000000000000000 0000000000000000 c000201cb50bd500 0000000000000003 
[    9.339278][  T925] GPR12: 0000000000000000 c000201fff694500 00007fffa4a8a940 00007fffa4a8a6c8 
[    9.339278][  T925] GPR16: 00007fffa4a8a8f8 00007fffa4a8a650 00007fffa4a8a488 0000000000000000 
[    9.339278][  T925] GPR20: 0000000000050001 00007fffa4a8a984 000000007fff0000 c00000000a4545cc 
[    9.339278][  T925] GPR24: c000000000affe28 0000000000000000 0000000000000000 0000000000000166 
[    9.339278][  T925] GPR28: c000201cb52d7ca8 c00800000ea60000 c000201cc3b72600 000000007fff0000 
[    9.339493][  T925] NIP [0000000000000000] 0x0
[    9.339516][  T925] LR [c00000000021f2cc] __seccomp_filter+0xec/0x530
bpf_dispatcher_nop_func at include/linux/bpf.h:567
(inlined by) bpf_prog_run_pin_on_cpu at include/linux/filter.h:597
(inlined by) seccomp_run_filters at kernel/seccomp.c:324
(inlined by) __seccomp_filter at kernel/seccomp.c:937
[    9.339538][  T925] Call Trace:
[    9.339548][  T925] [c000201cb52d7c40] [c00000000021f2cc] __seccomp_filter+0xec/0x530 (unreliable)
[    9.339566][  T925] [c000201cb52d7d50] [c000000000025af8] do_syscall_trace_enter+0xb8/0x470
do_seccomp at arch/powerpc/kernel/ptrace/ptrace.c:252
(inlined by) do_syscall_trace_enter at arch/powerpc/kernel/ptrace/ptrace.c:327
[    9.339600][  T925] [c000201cb52d7dc0] [c00000000002c8f8] system_call_exception+0x138/0x180
[    9.339625][  T925] [c000201cb52d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[    9.339648][  T925] Instruction dump:
[    9.339667][  T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    9.339706][  T925] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[    9.339748][  T925] ---[ end trace d89eb80f9a6bc141 ]---
[  OK  ] Started Journal Service.
[   10.452364][  T925] Kernel panic - not syncing: Fatal exception
[   11.876655][  T925] ---[ end Kernel panic - not syncing: Fatal exception ]---

There could also be lots of random userspace segfault like,

[   16.463545][  T771] rngd[771]: segfault (11) at 0 nip 0 lr 0 code 1 in rngd[106d60000+20000]
[   16.463620][  T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 
[   16.463656][  T771] rngd[771]: code: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

Occasionally, there are many soft-lockups,

[  396.920702][   C99] watchdog: BUG: soft lockup - CPU#99 stuck for 22s! [(spawn):2692]
[  396.920754][   C99] Modules linked in: kvm_hv kvm ip_tables x_tables sd_mod bnx2x tg3 ahci mdio libahci libphy firmware_class libata dm_mirror dm_region_hash dm_log dm_mod
[  396.920843][   C99] irq event stamp: 1731717220
[  396.920860][   C99] hardirqs last  enabled at (1731717219): [<c00000000004d6f4>] do_page_fault+0x324/0xd90
[  396.920889][   C99] hardirqs last disabled at (1731717220): [<c000000000015638>] arch_local_irq_restore+0x48/0xd0
[  396.920919][   C99] softirqs last  enabled at (41260): [<c0000000009abbe8>] __do_softirq+0x648/0x8c8
[  396.920948][   C99] softirqs last disabled at (41125): [<c0000000000d717c>] irq_exit+0x15c/0x1c0
[  396.920976][   C99] CPU: 99 PID: 2692 Comm: (spawn) Tainted: G             L    5.8.0-rc5-next-20200716 #3
[  396.921001][   C99] NIP:  c0000000000152b4 LR: c000000000015640 CTR: 0000000000000000
[  396.921037][   C99] REGS: c000201cbc3d7178 TRAP: 0900   Tainted: G             L     (5.8.0-rc5-next-20200716)
[  396.921074][   C99] MSR:  9000000000001033 <SF,HV,ME,IR,DR,RI,LE>  CR: 28022482  XER: 20040000
[  396.921122][   C99] CFAR: 0000000000000000 IRQMASK: 3 
[  396.921122][   C99] GPR00: c000000000015640 c000201cbc3d7340 c000000005901000 c000201cbc3d7178 
[  396.921122][   C99] GPR04: c0000000057d7280 0000000000000000 000000000002000a 0000000000000003 
[  396.921122][   C99] GPR08: 0000201cc61c0000 0000000000000000 0000000000000001 c00000000593f868 
[  396.921122][   C99] GPR12: 0000000000002000 c000201fff67e700 00007fffdcda3918 0000000139eeba60 
[  396.921122][   C99] GPR16: 0000000139f30130 00007fffdcda39c8 0000000139eea708 0000000000000000 
[  396.921122][   C99] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 
[  396.921122][   C99] GPR24: 0000000000000e60 0000000000000900 0000000000000500 0000000000000a00 
[  396.921122][   C99] GPR28: 0000000000000f00 0000000000000002 0000000000000003 c000201ca4212400 
[  396.921468][   C99] NIP [c0000000000152b4] replay_soft_interrupts+0x74/0x3b0
replay_soft_interrupts at arch/powerpc/kernel/irq.c:216
[  396.921504][   C99] LR [c000000000015640] arch_local_irq_restore+0x50/0xd0
arch_local_irq_restore at arch/powerpc/kernel/irq.c:375
[  396.921539][   C99] Call Trace:
[  396.921560][   C99] [c000201cbc3d7340] [c000000000015640] arch_local_irq_restore+0x50/0xd0 (unreliable)
[  396.921602][   C99] [c000201cbc3d7360] [c0000000009a0c68] lock_is_held_type+0xf8/0x180
[  396.921641][   C99] [c000201cbc3d73c0] [c0000000003e8cf0] mem_cgroup_from_task+0xa0/0x130
[  396.921666][   C99] [c000201cbc3d7400] [c000000000337950] handle_mm_fault+0x140/0x1d20
[  396.921703][   C99] [c000201cbc3d7500] [c00000000004d5ac] do_page_fault+0x1dc/0xd90
[  396.921763][   C99] [c000201cbc3d7600] [c00000000000c028] handle_page_fault+0x10/0x2c
[  396.921804][   C99] --- interrupt: 300 at futex_cleanup+0x3c0/0x740
[  396.921804][   C99]     LR = futex_cleanup+0x35c/0x740
[  396.921879][   C99] [c000201cbc3d79c0] [c0000000001df2e8] futex_exec_release+0x28/0x50
[  396.921929][   C99] [c000201cbc3d79f0] [c0000000000c5e54] exec_mm_release+0x24/0x50
[  396.921968][   C99] [c000201cbc3d7a30] [c000000000421e84] begin_new_exec+0x324/0xea0
[  396.922005][   C99] [c000201cbc3d7af0] [c0000000004d8f1c] load_elf_binary+0x7fc/0x1110
[  396.922042][   C99] [c000201cbc3d7bf0] [c000000000420824] exec_binprm+0x1c4/0x7d0
[  396.922079][   C99] [c000201cbc3d7cb0] [c000000000421540] do_execveat_common+0x710/0x960
[  396.922117][   C99] [c000201cbc3d7d90] [c000000000422a44] sys_execve+0x44/0x60
[  396.922156][   C99] [c000201cbc3d7dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180
[  396.922205][   C99] [c000201cbc3d7e20] [c00000000000c9e8] system_call_common+0xe8/0x214
[  396.922253][   C99] Instruction dump:
[  396.922286][   C99] 3b000e60 3b400500 3b600a00 3b800f00 f8010010 f821fe11 38610028 e92d0c70 
[  396.922316][   C99] f9210198 39200000 8aed0989 48037df9 <60000000> 39200003 f9210160 56e90738


[  248.821138][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.821170][  T676] khugepaged      D28416   682      2 0x00000800 
[  248.821212][  T676] Call Trace: 
[  248.821241][  T676] [c000001ff310f4e0] [c000000000caeb80] lru_add_drain_work+0x0/0x48 (unreliable) 
[  248.821275][  T676] [c000001ff310f6c0] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.821308][  T676] [c000001ff310f720] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.821352][  T676] [c000001ff310f7f0] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.821387][  T676] [c000001ff310f820] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.821432][  T676] [c000001ff310f960] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.821460][  T676] [c000001ff310f9d0] [c0000000000fd0c8] __flush_work+0x3b8/0x770 
[  248.821491][  T676] [c000001ff310faf0] [c0000000002e0ac4] lru_add_drain_all+0x3e4/0x760 
[  248.821521][  T676] [c000001ff310fbf0] [c0000000003e0f18] khugepaged+0xd8/0x1770 
[  248.821560][  T676] [c000001ff310fdb0] [c0000000001095fc] kthread+0x1bc/0x1d0 
[  248.821611][  T676] [c000001ff310fe20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78 
[  248.821655][  T676] INFO: task kworker/56:1:719 blocked for more than 122 seconds. 
[  248.821689][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.821729][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.821779][  T676] kworker/56:1    D27584   719      2 0x00000800 
[  248.821839][  T676] Workqueue: rcu_gp wait_rcu_exp_gp 
[  248.821862][  T676] Call Trace: 
[  248.821888][  T676] [c000001ff2b57660] [c0000000057c9fb0] rcu_state+0x4fb0/0x5100 (unreliable) 
[  248.821934][  T676] [c000001ff2b57840] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.821977][  T676] [c000001ff2b578a0] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.822021][  T676] [c000001ff2b57970] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.822066][  T676] [c000001ff2b579a0] [c0000000009a970c] schedule_timeout+0x1fc/0x520 
[  248.822110][  T676] [c000001ff2b57ae0] [c0000000001a86d0] rcu_exp_wait_wake+0x1b0/0x950 
[  248.822153][  T676] [c000001ff2b57c30] [c0000000000fb754] process_one_work+0x304/0x900 
[  248.822197][  T676] [c000001ff2b57d20] [c0000000000fbdc8] worker_thread+0x78/0x520 
[  248.822242][  T676] [c000001ff2b57db0] [c0000000001095fc] kthread+0x1bc/0x1d0 
[  248.822279][  T676] [c000001ff2b57e20] [c00000000000cbc4] ret_from_kernel_thread+0x5c/0x78 
[  248.822385][  T676] INFO: task lvm:3123 blocked for more than 122 seconds. 
[  248.822413][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.822462][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.822503][  T676] lvm             D26608  3123      1 0x00040000 
[  248.822552][  T676] Call Trace: 
[  248.822576][  T676] [c000001feee27a00] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.822620][  T676] [c000001feee27a60] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.822648][  T676] [c000001feee27b30] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.822680][  T676] [c000001feee27b60] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.822724][  T676] [c000001feee27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.822768][  T676] [c000001feee27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0 
[  248.822808][  T676] [c000001feee27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.822840][  T676] [c000001feee27e20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.822873][  T676] INFO: task lvm:3126 blocked for more than 122 seconds. 
[  248.822901][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.822938][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.822987][  T676] lvm             D26608  3126      1 0x00040000 
[  248.823017][  T676] Call Trace: 
[  248.823031][  T676] [c000001fc0b27a00] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.823075][  T676] [c000001fc0b27a60] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.823113][  T676] [c000001fc0b27b30] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.823158][  T676] [c000001fc0b27b60] [c0000000009a9814] schedule_timeout+0x304/0x520 
[  248.823199][  T676] [c000001fc0b27ca0] [c0000000009a3c84] wait_for_completion+0xc4/0x1b0 
[  248.823250][  T676] [c000001fc0b27d10] [c0000000004b0e88] sys_io_destroy+0x238/0x2f0 
[  248.823294][  T676] [c000001fc0b27dc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.823332][  T676] [c000001fc0b27e20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.823374][  T676] INFO: task auditd:3163 blocked for more than 122 seconds. 
[  248.823424][  T676]       Tainted: G             L    5.8.0-rc5-next-20200716 #3 
[  248.823471][  T676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[  248.823512][  T676] auditd          D27088  3163      1 0x00042408 
[  248.823551][  T676] Call Trace: 
[  248.823583][  T676] [c000001fc080f760] [0000000200000001] 0x200000001 (unreliable) 
[  248.823648][  T676] [c000001fc080f940] [c00000000001a2d0] __switch_to+0x260/0x380 
[  248.823689][  T676] [c000001fc080f9a0] [c0000000009a18b8] __schedule+0x398/0x9f0 
[  248.823742][  T676] [c000001fc080fa70] [c0000000009a1fa8] schedule+0x98/0x160 
[  248.823784][  T676] [c000001fc080faa0] [c0000000001a9244] synchronize_rcu_expedited+0x394/0x600 
[  248.823837][  T676] [c000001fc080fba0] [c0000000004504c4] namespace_unlock+0xf4/0x230 
[  248.823881][  T676] [c000001fc080fc00] [c000000000456dec] put_mnt_ns+0x5c/0x80 
[  248.823926][  T676] [c000001fc080fc30] [c00000000010ba6c] free_nsproxy+0x2c/0x1e0 
[  248.823966][  T676] [c000001fc080fc60] [c0000000000d5130] do_exit+0x4e0/0xee0 
[  248.823997][  T676] [c000001fc080fd60] [c0000000000d5bec] do_group_exit+0x5c/0xd0 
[  248.824019][  T676] [c000001fc080fda0] [c0000000000d5c7c] sys_exit_group+0x1c/0x20 
[  248.824060][  T676] [c000001fc080fdc0] [c00000000002c8b8] system_call_exception+0xf8/0x180 
[  248.824103][  T676] [c000001fc080fe20] [c00000000000c9e8] system_call_common+0xe8/0x214 
[  248.824192][  T676]  
[  248.824192][  T676] Showing all locks held in the system: 
[  248.824419][  T676] 1 lock held by khungtaskd/676: 
[  248.824455][  T676]  #0: c0000000057c44c0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.29+0x8/0x30 
[  248.824531][  T676] 1 lock held by khugepaged/682: 
[  248.824565][  T676]  #0: c0000000057f42c8 (lock#4){+.+.}-{3:3}, at: lru_add_drain_all+0x68/0x760 
[  248.824679][  T676] 2 locks held by kworker/56:1/719: 
[  248.824742][  T676]  #0: c00000000bcc8938 ((wq_completion)rcu_gp){+.+.}-{0:0}, at: process_one_work+0x21c/0x900 
[  248.824857][  T676]  #1: c000001ff2b57c90 ((work_completion)(&rew.rew_work)){+.+.}-{0:0}, at: process_one_work+0x21c/0x900 
[  248.825026][  T676] 3 locks held by (spawn)/2692: 
[  248.825077][  T676] 1 lock held by auditd/3163: 
[  248.825135][  T676]  #0: c0000000057c9ee8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x254/0x600 
[  248.825296][  T676] ============================================= 
[  248.825296][  T676]  

> 
>  .../include/asm/book3s/64/tlbflush-radix.h    | 15 ++++
>  arch/powerpc/include/asm/hvcall.h             | 34 +++++++-
>  arch/powerpc/include/asm/mmu.h                |  4 +
>  arch/powerpc/include/asm/plpar_wrappers.h     | 52 ++++++++++++
>  arch/powerpc/kernel/dt_cpu_ftrs.c             |  1 +
>  arch/powerpc/kernel/prom_init.c               | 13 +--
>  arch/powerpc/mm/book3s64/radix_tlb.c          | 82 +++++++++++++++++--
>  arch/powerpc/mm/init_64.c                     |  5 +-
>  arch/powerpc/platforms/pseries/lpar.c         |  8 +-
>  9 files changed, 197 insertions(+), 17 deletions(-)
> 
> -- 
> 2.21.3
>
Stephen Rothwell July 16, 2020, 11:09 p.m. UTC | #3
Hi all,

On Thu, 16 Jul 2020 13:27:14 -0400 Qian Cai <cai@lca.pw> wrote:
>
> Reverting the whole series fixed random memory corruptions during boot on
> POWER9 PowerNV systems below.

I will revert those commits from linux-next today as well (they revert
cleanly).
Nicholas Piggin July 17, 2020, 2:08 a.m. UTC | #4
Excerpts from Qian Cai's message of July 17, 2020 3:27 am:
> On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
>> Hypervisor may choose not to enable Guest Translation Shootdown Enable
>> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
>> permitted to use instructions like tblie and tlbsync directly, but is
>> expected to make hypervisor calls to get the TLB flushed.
>> 
>> This series enables the TLB flush routines in the radix code to
>> off-load TLB flushing to hypervisor via the newly proposed hcall
>> H_RPT_INVALIDATE. 
>> 
>> To easily check the availability of GTSE, it is made an MMU feature.
>> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
>> handle GTSE as an optionally available feature and to not assume GTSE
>> when radix support is available.
>> 
>> The actual hcall implementation for KVM isn't included in this
>> patchset and will be posted separately.
>> 
>> Changes in v3
>> =============
>> - Fixed a bug in the hcall wrapper code where we were missing setting
>>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
>>   a full flush for the nested case.
>> - s/psize_to_h_rpti/psize_to_rpti_pgsize
>> 
>> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
>> 
>> Bharata B Rao (2):
>>   powerpc/mm: Enable radix GTSE only if supported.
>>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
>>     enabled
>> 
>> Nicholas Piggin (1):
>>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
>>     !GTSE
> 
> Reverting the whole series fixed random memory corruptions during boot on
> POWER9 PowerNV systems below.

If I s/mmu_has_feature(MMU_FTR_GTSE)/(1)/g in radix_tlb.c, then the .o
disasm is the same as reverting my patch.

Feature bits not being set right? PowerNV should be pretty simple, seems
to do the same as FTR_TYPE_RADIX.

So... test being done before static keys are set up? Shouldn't be. Must
be something obvious I just can't see it.

Thanks,
Nick
Nicholas Piggin July 17, 2020, 2:44 a.m. UTC | #5
Excerpts from Nicholas Piggin's message of July 17, 2020 12:08 pm:
> Excerpts from Qian Cai's message of July 17, 2020 3:27 am:
>> On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
>>> Hypervisor may choose not to enable Guest Translation Shootdown Enable
>>> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
>>> permitted to use instructions like tblie and tlbsync directly, but is
>>> expected to make hypervisor calls to get the TLB flushed.
>>> 
>>> This series enables the TLB flush routines in the radix code to
>>> off-load TLB flushing to hypervisor via the newly proposed hcall
>>> H_RPT_INVALIDATE. 
>>> 
>>> To easily check the availability of GTSE, it is made an MMU feature.
>>> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
>>> handle GTSE as an optionally available feature and to not assume GTSE
>>> when radix support is available.
>>> 
>>> The actual hcall implementation for KVM isn't included in this
>>> patchset and will be posted separately.
>>> 
>>> Changes in v3
>>> =============
>>> - Fixed a bug in the hcall wrapper code where we were missing setting
>>>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
>>>   a full flush for the nested case.
>>> - s/psize_to_h_rpti/psize_to_rpti_pgsize
>>> 
>>> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
>>> 
>>> Bharata B Rao (2):
>>>   powerpc/mm: Enable radix GTSE only if supported.
>>>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
>>>     enabled
>>> 
>>> Nicholas Piggin (1):
>>>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
>>>     !GTSE
>> 
>> Reverting the whole series fixed random memory corruptions during boot on
>> POWER9 PowerNV systems below.
> 
> If I s/mmu_has_feature(MMU_FTR_GTSE)/(1)/g in radix_tlb.c, then the .o
> disasm is the same as reverting my patch.
> 
> Feature bits not being set right? PowerNV should be pretty simple, seems
> to do the same as FTR_TYPE_RADIX.

Might need this fix

---

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9cc49f265c86..54c9bcea9d4e 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -163,7 +163,7 @@ static struct ibm_pa_feature {
 	{ .pabyte = 0,  .pabit = 6, .cpu_features  = CPU_FTR_NOEXECUTE },
 	{ .pabyte = 1,  .pabit = 2, .mmu_features  = MMU_FTR_CI_LARGE_PAGE },
 #ifdef CONFIG_PPC_RADIX_MMU
-	{ .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX },
+	{ .pabyte = 40, .pabit = 0, .mmu_features  = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },
 #endif
 	{ .pabyte = 1,  .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
 	{ .pabyte = 5,  .pabit = 0, .cpu_features  = CPU_FTR_REAL_LE,
Bharata B Rao July 17, 2020, 4:28 a.m. UTC | #6
On Fri, Jul 17, 2020 at 12:44:00PM +1000, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of July 17, 2020 12:08 pm:
> > Excerpts from Qian Cai's message of July 17, 2020 3:27 am:
> >> On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
> >>> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> >>> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> >>> permitted to use instructions like tblie and tlbsync directly, but is
> >>> expected to make hypervisor calls to get the TLB flushed.
> >>> 
> >>> This series enables the TLB flush routines in the radix code to
> >>> off-load TLB flushing to hypervisor via the newly proposed hcall
> >>> H_RPT_INVALIDATE. 
> >>> 
> >>> To easily check the availability of GTSE, it is made an MMU feature.
> >>> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
> >>> handle GTSE as an optionally available feature and to not assume GTSE
> >>> when radix support is available.
> >>> 
> >>> The actual hcall implementation for KVM isn't included in this
> >>> patchset and will be posted separately.
> >>> 
> >>> Changes in v3
> >>> =============
> >>> - Fixed a bug in the hcall wrapper code where we were missing setting
> >>>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
> >>>   a full flush for the nested case.
> >>> - s/psize_to_h_rpti/psize_to_rpti_pgsize
> >>> 
> >>> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
> >>> 
> >>> Bharata B Rao (2):
> >>>   powerpc/mm: Enable radix GTSE only if supported.
> >>>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
> >>>     enabled
> >>> 
> >>> Nicholas Piggin (1):
> >>>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
> >>>     !GTSE
> >> 
> >> Reverting the whole series fixed random memory corruptions during boot on
> >> POWER9 PowerNV systems below.
> > 
> > If I s/mmu_has_feature(MMU_FTR_GTSE)/(1)/g in radix_tlb.c, then the .o
> > disasm is the same as reverting my patch.
> > 
> > Feature bits not being set right? PowerNV should be pretty simple, seems
> > to do the same as FTR_TYPE_RADIX.
> 
> Might need this fix
> 
> ---
> 
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 9cc49f265c86..54c9bcea9d4e 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -163,7 +163,7 @@ static struct ibm_pa_feature {
>  	{ .pabyte = 0,  .pabit = 6, .cpu_features  = CPU_FTR_NOEXECUTE },
>  	{ .pabyte = 1,  .pabit = 2, .mmu_features  = MMU_FTR_CI_LARGE_PAGE },
>  #ifdef CONFIG_PPC_RADIX_MMU
> -	{ .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX },
> +	{ .pabyte = 40, .pabit = 0, .mmu_features  = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },
>  #endif
>  	{ .pabyte = 1,  .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
>  	{ .pabyte = 5,  .pabit = 0, .cpu_features  = CPU_FTR_REAL_LE,

Michael - Let me know if this should be folded into 1/3 and the complete
series resent.

Regards,
Bharata.
Michael Ellerman July 18, 2020, 12:40 p.m. UTC | #7
Bharata B Rao <bharata@linux.ibm.com> writes:
> On Fri, Jul 17, 2020 at 12:44:00PM +1000, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of July 17, 2020 12:08 pm:
>> > Excerpts from Qian Cai's message of July 17, 2020 3:27 am:
>> >> On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
>> >>> Hypervisor may choose not to enable Guest Translation Shootdown Enable
>> >>> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
>> >>> permitted to use instructions like tblie and tlbsync directly, but is
>> >>> expected to make hypervisor calls to get the TLB flushed.
>> >>> 
>> >>> This series enables the TLB flush routines in the radix code to
>> >>> off-load TLB flushing to hypervisor via the newly proposed hcall
>> >>> H_RPT_INVALIDATE. 
>> >>> 
>> >>> To easily check the availability of GTSE, it is made an MMU feature.
>> >>> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
>> >>> handle GTSE as an optionally available feature and to not assume GTSE
>> >>> when radix support is available.
>> >>> 
>> >>> The actual hcall implementation for KVM isn't included in this
>> >>> patchset and will be posted separately.
>> >>> 
>> >>> Changes in v3
>> >>> =============
>> >>> - Fixed a bug in the hcall wrapper code where we were missing setting
>> >>>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
>> >>>   a full flush for the nested case.
>> >>> - s/psize_to_h_rpti/psize_to_rpti_pgsize
>> >>> 
>> >>> v2: https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bharata@linux.ibm.com/T/#t
>> >>> 
>> >>> Bharata B Rao (2):
>> >>>   powerpc/mm: Enable radix GTSE only if supported.
>> >>>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
>> >>>     enabled
>> >>> 
>> >>> Nicholas Piggin (1):
>> >>>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
>> >>>     !GTSE
>> >> 
>> >> Reverting the whole series fixed random memory corruptions during boot on
>> >> POWER9 PowerNV systems below.
>> > 
>> > If I s/mmu_has_feature(MMU_FTR_GTSE)/(1)/g in radix_tlb.c, then the .o
>> > disasm is the same as reverting my patch.
>> > 
>> > Feature bits not being set right? PowerNV should be pretty simple, seems
>> > to do the same as FTR_TYPE_RADIX.
>> 
>> Might need this fix
>> 
>> ---
>> 
>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>> index 9cc49f265c86..54c9bcea9d4e 100644
>> --- a/arch/powerpc/kernel/prom.c
>> +++ b/arch/powerpc/kernel/prom.c
>> @@ -163,7 +163,7 @@ static struct ibm_pa_feature {
>>  	{ .pabyte = 0,  .pabit = 6, .cpu_features  = CPU_FTR_NOEXECUTE },
>>  	{ .pabyte = 1,  .pabit = 2, .mmu_features  = MMU_FTR_CI_LARGE_PAGE },
>>  #ifdef CONFIG_PPC_RADIX_MMU
>> -	{ .pabyte = 40, .pabit = 0, .mmu_features  = MMU_FTR_TYPE_RADIX },
>> +	{ .pabyte = 40, .pabit = 0, .mmu_features  = (MMU_FTR_TYPE_RADIX | MMU_FTR_GTSE) },
>>  #endif
>>  	{ .pabyte = 1,  .pabit = 1, .invert = 1, .cpu_features = CPU_FTR_NODSISRALIGN },
>>  	{ .pabyte = 5,  .pabit = 0, .cpu_features  = CPU_FTR_REAL_LE,
>
> Michael - Let me know if this should be folded into 1/3 and the complete
> series resent.

No it's already merged.

Please send a proper patch with a Fixes: tag and a change log that
explains the details.

cheers