diff mbox series

[v02] REPOST powerpc/migration: Handle unitialized timer reset

Message ID 6a7769a5-c8bc-f857-61cc-a76cdd98b28c@linux.vnet.ibm.com (mailing list archive)
State Accepted
Commit 8604895a34d92f5e186ceb931b0d1b384030ea3d
Headers show
Series [v02] REPOST powerpc/migration: Handle unitialized timer reset | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied
snowpatch_ozlabs/checkpatch success Test checkpatch on branch next
snowpatch_ozlabs/build-ppc64le success Test build-ppc64le on branch next
snowpatch_ozlabs/build-ppc64be success Test build-ppc64be on branch next
snowpatch_ozlabs/build-ppc64e success Test build-ppc64e on branch next
snowpatch_ozlabs/build-ppc32 success Test build-ppc32 on branch next

Commit Message

Michael Bringmann Sept. 20, 2018, 4:45 p.m. UTC
After migration of a powerpc LPAR, the kernel executes code to
update the system state to reflect new platform characteristics.
Such changes include modifications to device tree properties
provided to the system by PHYP.  Property notifications received
by the powerpc 'migration_store' code are passed along to the
kernel in general through a call to 'of_update_property' which
in turn passes such events back to all modules through entries like
the '.notifier_call' function within the NUMA module.  When the
NUMA module updates its state, it resets its event timer.  If this
occurs after a previous call to 'stop_topology_update' or on a
system without VPHN enabled, the code runs into an unitialized
timer structure and crashes.  This patch adds a safety check
along this path toward the problem code.

Note: This crash was observed on every LPM in the 4.17-rc7 kernel
(and in the 4.18 kernel) of a system with dedicated CPUs enabled.
An example crash log without the patch is as follows.

  [ 2571.437467] ibmvscsi 30000081: Re-enabling adapter!
[ 2571.673850] ------------[ cut here ]------------
[ 2571.673863] kernel BUG at kernel/time/timer.c:958!
[ 2571.673875] Oops: Exception in kernel mode, sig: 5 [#1]
[ 2571.673877] LE SMP NR_CPUS=2048 NUMA pSeries
[ 2571.673886] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
[ 2571.673969] CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
[ 2571.673972] NIP:  c000000000198a2c LR: c000000000075990 CTR: 0000000000000000
[ 2571.673975] REGS: c0000003f9407560 TRAP: 0700   Not tainted  (4.17.0+)
[ 2571.673977] MSR:  800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 44482824  XER: 00000000
[ 2571.673990] CFAR: c00000000007598c SOFTE: 0
[ 2571.673990] GPR00: c000000000075990 c0000003f94077e0 c00000000113a900 c0000000012e5968
[ 2571.673990] GPR04: 000000010003776b c0000003ffa05180 0000000000000020 c0000003f9407850
[ 2571.673990] GPR08: 0000000000000000 0000000000000001 ffffffffffffffff 0000000000000220
[ 2571.673990] GPR12: c000000000076560 c00000001ec90200 0000000040000000 0000000000000018
[ 2571.673990] GPR16: 0000000000000000 000000000000000d c0000003e9ea5015 0000000000000010
[ 2571.673990] GPR20: 000000000000000b 0000000000000050 c0000003e9ea4068 0000000000000001
[ 2571.673990] GPR24: 000000000000001c c0000003ffff3ab0 0000000000000000 c0000003f9407990
[ 2571.673990] GPR28: 0000000000000005 c0000003ffff3ab0 000000010003776b c0000000012e5968
[ 2571.674042] NIP [c000000000198a2c] mod_timer+0x4c/0x400
[ 2571.674051] LR [c000000000075990] reset_topology_timer+0x40/0x60
[ 2571.674053] Call Trace:
[ 2571.674056] [c0000003f94077e0] [c0000003f9407830] 0xc0000003f9407830 (unreliable)
[ 2571.674060] [c0000003f9407860] [c000000000075990] reset_topology_timer+0x40/0x60
[ 2571.674063] [c0000003f9407880] [c000000000076660] dt_update_callback+0x100/0x120
[ 2571.674072] [c0000003f94078d0] [c00000000012ada0] notifier_call_chain+0x90/0x100
[ 2571.674077] [c0000003f9407920] [c00000000012b270] __blocking_notifier_call_chain+0x60/0x90
[ 2571.674092] [c0000003f9407970] [c0000000007b9a60] of_property_notify+0x90/0xd0
[ 2571.674096] [c0000003f94079d0] [c0000000007b4644] of_update_property+0x104/0x150
[ 2571.674103] [c0000003f9407a30] [c0000000000c0ddc] update_dt_property+0xdc/0x1f0
[ 2571.674106] [c0000003f9407a90] [c0000000000c11c0] pseries_devicetree_update+0x2d0/0x510
[ 2571.674110] [c0000003f9407bc0] [c0000000000c147c] post_mobility_fixup+0x7c/0xf0
[ 2571.674113] [c0000003f9407c30] [c0000000000c1594] migration_store+0xa4/0xc0
[ 2571.674123] [c0000003f9407c70] [c000000000989940] kobj_attr_store+0x30/0x60
[ 2571.674133] [c0000003f9407c90] [c00000000040b294] sysfs_kf_write+0x64/0xa0
[ 2571.674136] [c0000003f9407cb0] [c00000000040a02c] kernfs_fop_write+0x16c/0x240
[ 2571.674146] [c0000003f9407d00] [c00000000034eeb0] __vfs_write+0x40/0x200
[ 2571.674149] [c0000003f9407d90] [c00000000034f288] vfs_write+0xc8/0x240
[ 2571.674152] [c0000003f9407de0] [c00000000034f5bc] ksys_write+0x5c/0x100
[ 2571.674158] [c0000003f9407e30] [c00000000000b284] system_call+0x58/0x6c
[ 2571.674161] Instruction dump:
[ 2571.674163] fb01ffc0 7c7f1b78 fb21ffc8 fb41ffd0 fb61ffd8 fb81ffe0 fba1ffe8 f8010010
[ 2571.674168] f821ff81 e9230018 7d290074 7929d182 <0b090000> e9230008 2fa90000 419e0278
[ 2571.674176] ---[ end trace 0c7939657d5522df ]---

Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
---
Changes in patch:
  -- Revised patch description.
---
 arch/powerpc/mm/numa.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Michael Ellerman Sept. 26, 2018, 12:13 p.m. UTC | #1
On Thu, 2018-09-20 at 16:45:13 UTC, Michael Bringmann wrote:
> After migration of a powerpc LPAR, the kernel executes code to
> update the system state to reflect new platform characteristics.
> Such changes include modifications to device tree properties
> provided to the system by PHYP.  Property notifications received
> by the powerpc 'migration_store' code are passed along to the
> kernel in general through a call to 'of_update_property' which
> in turn passes such events back to all modules through entries like
> the '.notifier_call' function within the NUMA module.  When the
> NUMA module updates its state, it resets its event timer.  If this
> occurs after a previous call to 'stop_topology_update' or on a
> system without VPHN enabled, the code runs into an unitialized
> timer structure and crashes.  This patch adds a safety check
> along this path toward the problem code.
> 
> Note: This crash was observed on every LPM in the 4.17-rc7 kernel
> (and in the 4.18 kernel) of a system with dedicated CPUs enabled.
> An example crash log without the patch is as follows.
> 
...
> 
> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/8604895a34d92f5e186ceb931b0d1b

cheers
Michael Bringmann Sept. 26, 2018, 2:44 p.m. UTC | #2
Hello, Michael:
    Please look at v03 of the patch set which was posted
just on 09/25/2018.  We had a last minute brainstorm of how
to simplify the code after the v02 REPOST.
    Also, note that we would like to invalidate the set,

[PATCH v08 0/5] powerpc/hotplug: Update affinity for migrated CPUs
https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-July/176301.html

as we think it is covered now by other code.

    Thanks for your attention and assistance.

Michael

On 09/26/2018 07:13 AM, Michael Ellerman wrote:
> On Thu, 2018-09-20 at 16:45:13 UTC, Michael Bringmann wrote:
>> After migration of a powerpc LPAR, the kernel executes code to
>> update the system state to reflect new platform characteristics.
>> Such changes include modifications to device tree properties
>> provided to the system by PHYP.  Property notifications received
>> by the powerpc 'migration_store' code are passed along to the
>> kernel in general through a call to 'of_update_property' which
>> in turn passes such events back to all modules through entries like
>> the '.notifier_call' function within the NUMA module.  When the
>> NUMA module updates its state, it resets its event timer.  If this
>> occurs after a previous call to 'stop_topology_update' or on a
>> system without VPHN enabled, the code runs into an unitialized
>> timer structure and crashes.  This patch adds a safety check
>> along this path toward the problem code.
>>
>> Note: This crash was observed on every LPM in the 4.17-rc7 kernel
>> (and in the 4.18 kernel) of a system with dedicated CPUs enabled.
>> An example crash log without the patch is as follows.
>>
> ...
>>
>> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
> 
> Applied to powerpc fixes, thanks.
> 
> https://git.kernel.org/powerpc/c/8604895a34d92f5e186ceb931b0d1b
> 
> cheers
> 
>
Michael Bringmann Sept. 26, 2018, 2:54 p.m. UTC | #3
Hello, Michael:
    I am an idiot this morning.  Not enough coffee.
I confused this posting with another.  Please forget the request.

    We do want to omit 

[PATCH v08 0/5] powerpc/hotplug: Update affinity for migrated CPUs
https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-July/176301.html

though.

Very sorry for the inconvenience.

Michael

On 09/26/2018 09:44 AM, Michael Bringmann wrote:
> Hello, Michael:
>     Please look at v03 of the patch set which was posted
> just on 09/25/2018.  We had a last minute brainstorm of how
> to simplify the code after the v02 REPOST.
>     Also, note that we would like to invalidate the set,
> 
> [PATCH v08 0/5] powerpc/hotplug: Update affinity for migrated CPUs
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-July/176301.html
> 
> as we think it is covered now by other code.
> 
>     Thanks for your attention and assistance.
> 
> Michael
> 
> On 09/26/2018 07:13 AM, Michael Ellerman wrote:
>> On Thu, 2018-09-20 at 16:45:13 UTC, Michael Bringmann wrote:
>>> After migration of a powerpc LPAR, the kernel executes code to
>>> update the system state to reflect new platform characteristics.
>>> Such changes include modifications to device tree properties
>>> provided to the system by PHYP.  Property notifications received
>>> by the powerpc 'migration_store' code are passed along to the
>>> kernel in general through a call to 'of_update_property' which
>>> in turn passes such events back to all modules through entries like
>>> the '.notifier_call' function within the NUMA module.  When the
>>> NUMA module updates its state, it resets its event timer.  If this
>>> occurs after a previous call to 'stop_topology_update' or on a
>>> system without VPHN enabled, the code runs into an unitialized
>>> timer structure and crashes.  This patch adds a safety check
>>> along this path toward the problem code.
>>>
>>> Note: This crash was observed on every LPM in the 4.17-rc7 kernel
>>> (and in the 4.18 kernel) of a system with dedicated CPUs enabled.
>>> An example crash log without the patch is as follows.
>>>
>> ...
>>>
>>> Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
>>
>> Applied to powerpc fixes, thanks.
>>
>> https://git.kernel.org/powerpc/c/8604895a34d92f5e186ceb931b0d1b
>>
>> cheers
>>
>>
>
Michael Ellerman Sept. 27, 2018, 6:57 a.m. UTC | #4
Michael Bringmann <mwb@linux.vnet.ibm.com> writes:

> Hello, Michael:
>     I am an idiot this morning.  Not enough coffee.

Emailing when insufficiently caffeinated, that is dangerous :)

> I confused this posting with another.  Please forget the request.

No worries.

>     We do want to omit 
>
> [PATCH v08 0/5] powerpc/hotplug: Update affinity for migrated CPUs
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-July/176301.html
>
> though.

OK, I've marked it 'rejected'.

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0c7e05d..a789d57 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1457,7 +1457,8 @@  static void topology_timer_fn(struct timer_list *unused)
 
 static void reset_topology_timer(void)
 {
-	mod_timer(&topology_timer, jiffies + topology_timer_secs * HZ);
+	if (vphn_enabled)
+		mod_timer(&topology_timer, jiffies + topology_timer_secs * HZ);
 }
 
 #ifdef CONFIG_SMP