Message ID | 20200330071219.12284-1-ganeshgr@linux.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | powerpc/mce: Add MCE notification chain | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (c6624071c338732402e8c726df6a4074473eaa0e) |
snowpatch_ozlabs/build-ppc64le | success | Build succeeded |
snowpatch_ozlabs/build-ppc64be | success | Build succeeded |
snowpatch_ozlabs/build-ppc64e | success | Build succeeded |
snowpatch_ozlabs/build-pmac32 | success | Build succeeded |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 35 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
Ganesh Goudar's on March 30, 2020 5:12 pm: > From: Santosh S <santosh@fossix.org> > > Introduce notification chain which lets know about uncorrected memory > errors(UE). This would help prospective users in pmem or nvdimm subsystem > to track bad blocks for better handling of persistent memory allocations. > > Signed-off-by: Santosh S <santosh@fossix.org> > Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Do you have any such users yet? It would be good to refer to an example user and give a brief description of what it does in its notifier. > @@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct *work) > while (__this_cpu_read(mce_ue_count) > 0) { > index = __this_cpu_read(mce_ue_count) - 1; > evt = this_cpu_ptr(&mce_ue_event_queue[index]); > + blocking_notifier_call_chain(&mce_notifier_list, 0, evt); Can we really use a blocking notifier here? I'm not sure that we can. Thanks, Nick
On 4/3/20 7:38 AM, Nicholas Piggin wrote: > Ganesh Goudar's on March 30, 2020 5:12 pm: >> From: Santosh S <santosh@fossix.org> >> >> Introduce notification chain which lets know about uncorrected memory >> errors(UE). This would help prospective users in pmem or nvdimm subsystem >> to track bad blocks for better handling of persistent memory allocations. >> >> Signed-off-by: Santosh S <santosh@fossix.org> >> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> > Do you have any such users yet? It would be good to refer to an example > user and give a brief description of what it does in its notifier. Santosh has sent a patch which uses this notification. https://patchwork.ozlabs.org/patch/1265062/ >> @@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct *work) >> while (__this_cpu_read(mce_ue_count) > 0) { >> index = __this_cpu_read(mce_ue_count) - 1; >> evt = this_cpu_ptr(&mce_ue_event_queue[index]); >> + blocking_notifier_call_chain(&mce_notifier_list, 0, evt); > Can we really use a blocking notifier here? I'm not sure that we can. I think we can, do you see any problem? > > Thanks, > Nick
Ganesh's on April 4, 2020 11:05 pm: > On 4/3/20 7:38 AM, Nicholas Piggin wrote: > >> Ganesh Goudar's on March 30, 2020 5:12 pm: >>> From: Santosh S <santosh@fossix.org> >>> >>> Introduce notification chain which lets know about uncorrected memory >>> errors(UE). This would help prospective users in pmem or nvdimm subsystem >>> to track bad blocks for better handling of persistent memory allocations. >>> >>> Signed-off-by: Santosh S <santosh@fossix.org> >>> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> >> Do you have any such users yet? It would be good to refer to an example >> user and give a brief description of what it does in its notifier. > > Santosh has sent a patch which uses this notification. > https://patchwork.ozlabs.org/patch/1265062/ Okay. So these things are asynchronous after the machine check. I guess that's the design of it and memory offlining does something similar by the looks, but how do you prevent the memory being allocated for something else before the notifiers run? >>> @@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct *work) >>> while (__this_cpu_read(mce_ue_count) > 0) { >>> index = __this_cpu_read(mce_ue_count) - 1; >>> evt = this_cpu_ptr(&mce_ue_event_queue[index]); >>> + blocking_notifier_call_chain(&mce_notifier_list, 0, evt); >> Can we really use a blocking notifier here? I'm not sure that we can. > > I think we can, do you see any problem? No it looks okay after better look, sorry for the noise. Thanks, Nick
On 2020-04-06 12:17:22 Mon, Nicholas Piggin wrote: > Ganesh's on April 4, 2020 11:05 pm: > > On 4/3/20 7:38 AM, Nicholas Piggin wrote: > > > >> Ganesh Goudar's on March 30, 2020 5:12 pm: > >>> From: Santosh S <santosh@fossix.org> > >>> > >>> Introduce notification chain which lets know about uncorrected memory > >>> errors(UE). This would help prospective users in pmem or nvdimm subsystem > >>> to track bad blocks for better handling of persistent memory allocations. > >>> > >>> Signed-off-by: Santosh S <santosh@fossix.org> > >>> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> > >> Do you have any such users yet? It would be good to refer to an example > >> user and give a brief description of what it does in its notifier. > > > > Santosh has sent a patch which uses this notification. > > https://patchwork.ozlabs.org/patch/1265062/ > > Okay. So these things are asynchronous after the machine check. I guess > that's the design of it and memory offlining does something similar by > the looks, but how do you prevent the memory being allocated for > something else before the notifiers run? We can't. This race even exists today when we call memory_failure(). If the same memory is allocated again then we may hit another mce on same address when touched until the subsystem that has resistered for notification has completed handling the notified address. Thanks, -Mahesh. > > >>> @@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct *work) > >>> while (__this_cpu_read(mce_ue_count) > 0) { > >>> index = __this_cpu_read(mce_ue_count) - 1; > >>> evt = this_cpu_ptr(&mce_ue_event_queue[index]); > >>> + blocking_notifier_call_chain(&mce_notifier_list, 0, evt); > >> Can we really use a blocking notifier here? I'm not sure that we can. > > > > I think we can, do you see any problem? > > No it looks okay after better look, sorry for the noise. > > Thanks, > Nick
On 3/30/20 12:42 PM, Ganesh Goudar wrote: > From: Santosh S <santosh@fossix.org> > > Introduce notification chain which lets know about uncorrected memory > errors(UE). This would help prospective users in pmem or nvdimm subsystem > to track bad blocks for better handling of persistent memory allocations. > > Signed-off-by: Santosh S <santosh@fossix.org> > Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> > --- Hi mpe, Do you have any comments on this patch?
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h index 6a6ddaabdb34..6e222a94a68a 100644 --- a/arch/powerpc/include/asm/mce.h +++ b/arch/powerpc/include/asm/mce.h @@ -218,6 +218,8 @@ extern void machine_check_queue_event(void); extern void machine_check_print_event_info(struct machine_check_event *evt, bool user_mode, bool in_guest); unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr); +int mce_register_notifier(struct notifier_block *nb); +int mce_unregister_notifier(struct notifier_block *nb); #ifdef CONFIG_PPC_BOOK3S_64 void flush_and_reload_slb(void); #endif /* CONFIG_PPC_BOOK3S_64 */ diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c index 34c1001e9e8b..f50d7f56c02c 100644 --- a/arch/powerpc/kernel/mce.c +++ b/arch/powerpc/kernel/mce.c @@ -47,6 +47,20 @@ static struct irq_work mce_ue_event_irq_work = { DECLARE_WORK(mce_ue_event_work, machine_process_ue_event); +static BLOCKING_NOTIFIER_HEAD(mce_notifier_list); + +int mce_register_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&mce_notifier_list, nb); +} +EXPORT_SYMBOL_GPL(mce_register_notifier); + +int mce_unregister_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(&mce_notifier_list, nb); +} +EXPORT_SYMBOL_GPL(mce_unregister_notifier); + static void mce_set_error_info(struct machine_check_event *mce, struct mce_error_info *mce_err) { @@ -263,6 +277,7 @@ static void machine_process_ue_event(struct work_struct *work) while (__this_cpu_read(mce_ue_count) > 0) { index = __this_cpu_read(mce_ue_count) - 1; evt = this_cpu_ptr(&mce_ue_event_queue[index]); + blocking_notifier_call_chain(&mce_notifier_list, 0, evt); #ifdef CONFIG_MEMORY_FAILURE /* * This should probably queued elsewhere, but