diff mbox series

[v6,1/2] ACPI / APEI: Add support to notify the vendor specific HW errors

Message ID 20200325164223.650-2-shiju.jose@huawei.com
State New
Headers show
Series ACPI / APEI: Add support to notify the vendor specific HW errors | expand

Commit Message

Shiju Jose March 25, 2020, 4:42 p.m. UTC
Presently APEI does not support reporting the vendor specific
HW errors, received in the vendor defined table entries, to the
vendor drivers for any recovery.

This patch adds the support to register and unregister the
error handling function for the vendor specific HW errors and
notify the registered kernel driver.

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
 drivers/acpi/apei/ghes.c | 35 ++++++++++++++++++++++++++++++++++-
 drivers/ras/ras.c        |  5 +++--
 include/acpi/ghes.h      | 28 ++++++++++++++++++++++++++++
 include/linux/ras.h      |  6 ++++--
 include/ras/ras_event.h  |  7 +++++--
 5 files changed, 74 insertions(+), 7 deletions(-)

Comments

Borislav Petkov March 27, 2020, 6:22 p.m. UTC | #1
On Wed, Mar 25, 2020 at 04:42:22PM +0000, Shiju Jose wrote:
> Presently APEI does not support reporting the vendor specific
> HW errors, received in the vendor defined table entries, to the
> vendor drivers for any recovery.
> 
> This patch adds the support to register and unregister the

Avoid having "This patch" or "This commit" in the commit message. It is
tautologically useless.

Also, do

$ git grep 'This patch' Documentation/process

for more details.

> error handling function for the vendor specific HW errors and
> notify the registered kernel driver.
> 
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> ---
>  drivers/acpi/apei/ghes.c | 35 ++++++++++++++++++++++++++++++++++-
>  drivers/ras/ras.c        |  5 +++--
>  include/acpi/ghes.h      | 28 ++++++++++++++++++++++++++++
>  include/linux/ras.h      |  6 ++++--
>  include/ras/ras_event.h  |  7 +++++--
>  5 files changed, 74 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 24c9642e8fc7..d83f0b1aad0d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -490,6 +490,32 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
>  #endif
>  }
>  
> +static ATOMIC_NOTIFIER_HEAD(ghes_event_notify_list);
> +
> +/**
> + * ghes_register_event_notifier - register an event notifier
> + * for the non-fatal HW errors.
> + * @nb: pointer to the notifier_block structure of the event handler.
> + *
> + * return 0 : SUCCESS, non-zero : FAIL
> + */
> +int ghes_register_event_notifier(struct notifier_block *nb)
> +{
> +	return atomic_notifier_chain_register(&ghes_event_notify_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(ghes_register_event_notifier);
> +
> +/**
> + * ghes_unregister_event_notifier - unregister the previously
> + * registered event notifier.
> + * @nb: pointer to the notifier_block structure of the event handler.
> + */
> +void ghes_unregister_event_notifier(struct notifier_block *nb)
> +{
> +	atomic_notifier_chain_unregister(&ghes_event_notify_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(ghes_unregister_event_notifier);
> +
>  static void ghes_do_proc(struct ghes *ghes,
>  			 const struct acpi_hest_generic_status *estatus)
>  {
> @@ -526,10 +552,17 @@ static void ghes_do_proc(struct ghes *ghes,
>  			log_arm_hw_error(err);
>  		} else {
>  			void *err = acpi_hest_get_payload(gdata);
> +			u8 error_handled = false;
> +			int ret;
> +
> +			ret = atomic_notifier_call_chain(&ghes_event_notify_list, 0, gdata);

Well, this is a notifier with standard name for a non-standard event.
Not optimal.

Why does only this event need a notifier? Because your driver is
interested in only those events?

> +			if (ret & NOTIFY_OK)
> +				error_handled = true;
>  
>  			log_non_standard_event(sec_type, fru_id, fru_text,
>  					       sec_sev, err,
> -					       gdata->error_data_length);
> +					       gdata->error_data_length,
> +					       error_handled);

What's that error_handled thing for? That's just silly.

Your notifier returns NOTIFY_STOP when it has queued the error. If you
don't want to log it, just test == NOTIFY_STOP and do not log it then.

Then your notifier callback is queuing the error into a kfifo for
whatever reason and then scheduling a workqueue to handle it in user
context...

So I'm thinking that it would be better if you:

* make that kfifo generic and part of ghes.c and queue all types of
error records into it in ghes_do_proc() - not just the non-standard
ones.

* then, when you're done queuing, you kick a workqueue.

* that workqueue runs a normal, blocking notifier to which drivers
register.

Your driver can register to that notifier too and do the normal handling
then and not have this ad-hoc, semi-generic, semi-vendor-specific thing.

Thx.
Shiju Jose March 30, 2020, 10:14 a.m. UTC | #2
Hi Borislav,

Thanks for reviewing the patches.

>-----Original Message-----
>From: linux-acpi-owner@vger.kernel.org [mailto:linux-acpi-
>owner@vger.kernel.org] On Behalf Of Borislav Petkov
>Sent: 27 March 2020 18:22
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-pci@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; helgaas@kernel.org;
>lenb@kernel.org; james.morse@arm.com; tony.luck@intel.com;
>gregkh@linuxfoundation.org; zhangliguang@linux.alibaba.com;
>tglx@linutronix.de; Linuxarm <linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>;
>yangyicong <yangyicong@huawei.com>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>On Wed, Mar 25, 2020 at 04:42:22PM +0000, Shiju Jose wrote:
>> Presently APEI does not support reporting the vendor specific HW
>> errors, received in the vendor defined table entries, to the vendor
>> drivers for any recovery.
>>
>> This patch adds the support to register and unregister the
>
>Avoid having "This patch" or "This commit" in the commit message. It is
>tautologically useless.
>
Sure.

>Also, do
>
>$ git grep 'This patch' Documentation/process
>
>for more details.
Sure.

>
>> error handling function for the vendor specific HW errors and notify
>> the registered kernel driver.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> ---
>>  drivers/acpi/apei/ghes.c | 35 ++++++++++++++++++++++++++++++++++-
>>  drivers/ras/ras.c        |  5 +++--
>>  include/acpi/ghes.h      | 28 ++++++++++++++++++++++++++++
>>  include/linux/ras.h      |  6 ++++--
>>  include/ras/ras_event.h  |  7 +++++--
>>  5 files changed, 74 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
>> 24c9642e8fc7..d83f0b1aad0d 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -490,6 +490,32 @@ static void ghes_handle_aer(struct
>> acpi_hest_generic_data *gdata)  #endif  }
>>
>> +static ATOMIC_NOTIFIER_HEAD(ghes_event_notify_list);
>> +
>> +/**
>> + * ghes_register_event_notifier - register an event notifier
>> + * for the non-fatal HW errors.
>> + * @nb: pointer to the notifier_block structure of the event handler.
>> + *
>> + * return 0 : SUCCESS, non-zero : FAIL  */ int
>> +ghes_register_event_notifier(struct notifier_block *nb) {
>> +	return atomic_notifier_chain_register(&ghes_event_notify_list, nb);
>> +} EXPORT_SYMBOL_GPL(ghes_register_event_notifier);
>> +
>> +/**
>> + * ghes_unregister_event_notifier - unregister the previously
>> + * registered event notifier.
>> + * @nb: pointer to the notifier_block structure of the event handler.
>> + */
>> +void ghes_unregister_event_notifier(struct notifier_block *nb)
>> +{
>> +	atomic_notifier_chain_unregister(&ghes_event_notify_list, nb);
>> +}
>> +EXPORT_SYMBOL_GPL(ghes_unregister_event_notifier);
>> +
>>  static void ghes_do_proc(struct ghes *ghes,
>>  			 const struct acpi_hest_generic_status *estatus)
>>  {
>> @@ -526,10 +552,17 @@ static void ghes_do_proc(struct ghes *ghes,
>>  			log_arm_hw_error(err);
>>  		} else {
>>  			void *err = acpi_hest_get_payload(gdata);
>> +			u8 error_handled = false;
>> +			int ret;
>> +
>> +			ret =
>atomic_notifier_call_chain(&ghes_event_notify_list, 0, gdata);
>
>Well, this is a notifier with standard name for a non-standard event.
>Not optimal.
Ok.

>
>Why does only this event need a notifier? Because your driver is
>interested in only those events?
The error events for the PCIe controller can be reported to the kernel in the vendor defined format
[as per the"N.2.3 Non-standard Section Body" of the UEFI spec]. 
Thus these events require a notifier from APEI to the corresponding kernel driver. 

>
>> +			if (ret & NOTIFY_OK)
>> +				error_handled = true;
>>
>>  			log_non_standard_event(sec_type, fru_id, fru_text,
>>  					       sec_sev, err,
>> -					       gdata->error_data_length);
>> +					       gdata->error_data_length,
>> +					       error_handled);
>
>What's that error_handled thing for? That's just silly.
This field added based on the input from James Morse on v4 patch to enable the user space application(rasdaemon)
do the decoding and logging of the any extra error information shared by the corresponding  kernel driver to the user space.

>
>Your notifier returns NOTIFY_STOP when it has queued the error. If you
>don't want to log it, just test == NOTIFY_STOP and do not log it then.
sure.
   
>
>Then your notifier callback is queuing the error into a kfifo for
>whatever reason and then scheduling a workqueue to handle it in user
>context...
>
>So I'm thinking that it would be better if you:
>
>* make that kfifo generic and part of ghes.c and queue all types of
>error records into it in ghes_do_proc() - not just the non-standard
>ones.
>
>* then, when you're done queuing, you kick a workqueue.
>
>* that workqueue runs a normal, blocking notifier to which drivers
>register.
Sure. I will test this method and update.
Can you please confirm you want all the existing standard errors(memory, ARM, PCIE) in the ghes_do_proc ()
to be reported through the blocking notifier?

>
>Your driver can register to that notifier too and do the normal handling
>then and not have this ad-hoc, semi-generic, semi-vendor-specific thing.
>
>Thx.
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju
Borislav Petkov March 30, 2020, 10:33 a.m. UTC | #3
On Mon, Mar 30, 2020 at 10:14:20AM +0000, Shiju Jose wrote:
> This field added based on the input from James Morse on v4 patch to
> enable the user space application(rasdaemon) do the decoding and
> logging of the any extra error information shared by the corresponding
> kernel driver to the user space.

How is your error reporting supposed to work?

Your driver is printing error information in dmesg and, at the same
time, you want to report errors with the rasdaemon.

Currently, the kernel does not report any error info if there's a user
agent like rasdaemon registered so you need to think about what exactly
you're trying to achieve here wrt to error handling. Port resetting,
printing error info, etc. Always ask yourself, what can the user do with
the information you're printing. And so on...

> Can you please confirm you want all the existing standard
> errors(memory, ARM, PCIE) in the ghes_do_proc () to be reported
> through the blocking notifier?

Yes, I would very much prefer to have a generic solution instead of
vendor-specific stuff left and right.

Thx.
Shiju Jose March 30, 2020, 11:55 a.m. UTC | #4
Hi Boris,

>-----Original Message-----
>From: Borislav Petkov [mailto:bp@alien8.de]
>Sent: 30 March 2020 11:34
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-pci@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; helgaas@kernel.org;
>lenb@kernel.org; james.morse@arm.com; tony.luck@intel.com;
>gregkh@linuxfoundation.org; zhangliguang@linux.alibaba.com;
>tglx@linutronix.de; Linuxarm <linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>;
>yangyicong <yangyicong@huawei.com>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>On Mon, Mar 30, 2020 at 10:14:20AM +0000, Shiju Jose wrote:
>> This field added based on the input from James Morse on v4 patch to
>> enable the user space application(rasdaemon) do the decoding and
>> logging of the any extra error information shared by the corresponding
>> kernel driver to the user space.
>
>How is your error reporting supposed to work?
>
>Your driver is printing error information in dmesg and, at the same time, you
>want to report errors with the rasdaemon.
>
>Currently, the kernel does not report any error info if there's a user agent like
>rasdaemon registered so you need to think about what exactly you're trying
>to achieve here wrt to error handling. Port resetting, printing error info, etc.
>Always ask yourself, what can the user do with the information you're
>printing. And so on...
The error_handled field added on the generic basis for the non-standard errors.
rasdaemon supports adding decoding of the vendor-specific error data, printing and 
storing the decoded vendor error information to the sql database. 
The idea was the  error handled field  will help the decoding part of the rasdaemon to do the
appropriate steps for logging the vendor error information depending on whether a corresponding kernel driver
has handled the error or not.  
However I think the same can be achieved by adding an error handling status field to the vendor-specific data, which
the kernel  driver will set after handling the error and corresponding vendor-specific code in the rasdaemon will use it 
while logging the vendor error data.
>
>> Can you please confirm you want all the existing standard
>> errors(memory, ARM, PCIE) in the ghes_do_proc () to be reported
>> through the blocking notifier?
>
>Yes, I would very much prefer to have a generic solution instead of vendor-
>specific stuff left and right.
Sure.

>
>Thx.
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju
Borislav Petkov March 30, 2020, 1:42 p.m. UTC | #5
On Mon, Mar 30, 2020 at 11:55:35AM +0000, Shiju Jose wrote:
> The idea was the error handled field will help the decoding part of
> the rasdaemon to do the appropriate steps for logging the vendor error
> information depending on whether a corresponding kernel driver has
> handled the error or not.

What's the difference for rasdaemon whether the error has been handled
or not?
Shiju Jose March 30, 2020, 3:44 p.m. UTC | #6
Hi Boris,

>-----Original Message-----
>From: Borislav Petkov [mailto:bp@alien8.de]
>Sent: 30 March 2020 14:43
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-pci@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; helgaas@kernel.org;
>lenb@kernel.org; james.morse@arm.com; tony.luck@intel.com;
>gregkh@linuxfoundation.org; zhangliguang@linux.alibaba.com;
>tglx@linutronix.de; Linuxarm <linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>;
>yangyicong <yangyicong@huawei.com>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>On Mon, Mar 30, 2020 at 11:55:35AM +0000, Shiju Jose wrote:
>> The idea was the error handled field will help the decoding part of
>> the rasdaemon to do the appropriate steps for logging the vendor error
>> information depending on whether a corresponding kernel driver has
>> handled the error or not.
>
>What's the difference for rasdaemon whether the error has been handled or
>not?
Following are some of the examples of the usage of error handled status
in the vendor specific code of the rasdaemon,
1. rasdaemon need not to print the vendor error data reported by the firmware if the 
    kernel driver already print those information. In this case rasdaemon will only need to store
    the decoded vendor error data to the SQL database.  
2. If the vendor kernel driver want to report extra error information through
    the vendor specific data (though presently we do not have any such use case) for the rasdamon to log. 
    I think the error handled status useful to indicate that the kernel driver has filled the extra information and
    rasdaemon to decode and log them after extra data specific validity check.
      
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju
Borislav Petkov March 31, 2020, 9:09 a.m. UTC | #7
On Mon, Mar 30, 2020 at 03:44:29PM +0000, Shiju Jose wrote:
> 1. rasdaemon need not to print the vendor error data reported by the firmware if the 
>     kernel driver already print those information. In this case rasdaemon will only need to store
>     the decoded vendor error data to the SQL database.

Well, there's a problem with this:

rasdaemon printing != kernel driver printing

Because printing in dmesg would need people to go grep dmesg.

Printing through rasdaemon or any userspace agent, OTOH, is a lot more
flexible wrt analyzing and collecting those error records. Especially
if you are a data center admin and you want to collect all your error
records: grepping dmesg simply doesn't scale versus all the rasdaemon
agents reporting to a centrallized location.

> 2. If the vendor kernel driver want to report extra error information through
>     the vendor specific data (though presently we do not have any such use case) for the rasdamon to log. 
>     I think the error handled status useful to indicate that the kernel driver has filled the extra information and
>     rasdaemon to decode and log them after extra data specific validity check.

The kernel driver can report that extra information without the kernel
saying that the error was handled.

So I still see no sense for the kernel to tell userspace explicitly that
it handled the error. There might be a valid reason, though, of which I
cannot think of right now.

Thx.
Shiju Jose April 8, 2020, 9:20 a.m. UTC | #8
Hi Boris,

>-----Original Message-----
>From: Borislav Petkov [mailto:bp@alien8.de]
>Sent: 31 March 2020 10:09
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-pci@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; helgaas@kernel.org;
>lenb@kernel.org; james.morse@arm.com; tony.luck@intel.com;
>gregkh@linuxfoundation.org; zhangliguang@linux.alibaba.com;
>tglx@linutronix.de; Linuxarm <linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>;
>yangyicong <yangyicong@huawei.com>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>On Mon, Mar 30, 2020 at 03:44:29PM +0000, Shiju Jose wrote:
>> 1. rasdaemon need not to print the vendor error data reported by the
>firmware if the
>>     kernel driver already print those information. In this case rasdaemon will
>only need to store
>>     the decoded vendor error data to the SQL database.
>
>Well, there's a problem with this:
>
>rasdaemon printing != kernel driver printing
>
>Because printing in dmesg would need people to go grep dmesg.
>
>Printing through rasdaemon or any userspace agent, OTOH, is a lot more
>flexible wrt analyzing and collecting those error records. Especially if you are a
>data center admin and you want to collect all your error
>records: grepping dmesg simply doesn't scale versus all the rasdaemon
>agents reporting to a centrallized location.
Ok.
I posted V7 of this series.  
"[v7 PATCH 0/6] ACPI / APEI: Add support to notify non-fatal HW errors"

>
>> 2. If the vendor kernel driver want to report extra error information
>through
>>     the vendor specific data (though presently we do not have any such use
>case) for the rasdamon to log.
>>     I think the error handled status useful to indicate that the kernel driver
>has filled the extra information and
>>     rasdaemon to decode and log them after extra data specific validity
>check.
>
>The kernel driver can report that extra information without the kernel saying
>that the error was handled.
>
>So I still see no sense for the kernel to tell userspace explicitly that it handled
>the error. There might be a valid reason, though, of which I cannot think of
>right now.
Ok.

>
>Thx.
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju
James Morse April 8, 2020, 10:03 a.m. UTC | #9
Hi Boris, Shiju,

Sorry for not spotting this reply earlier: Its in-reply to v1, so gets buried.

On 27/03/2020 18:22, Borislav Petkov wrote:
> On Wed, Mar 25, 2020 at 04:42:22PM +0000, Shiju Jose wrote:
>> Presently APEI does not support reporting the vendor specific
>> HW errors, received in the vendor defined table entries, to the
>> vendor drivers for any recovery.
>>
>> This patch adds the support to register and unregister the
> 
> Avoid having "This patch" or "This commit" in the commit message. It is
> tautologically useless.
> 
> Also, do
> 
> $ git grep 'This patch' Documentation/process
> 
> for more details.
> 
>> error handling function for the vendor specific HW errors and
>> notify the registered kernel driver.

>> @@ -526,10 +552,17 @@ static void ghes_do_proc(struct ghes *ghes,
>>  			log_arm_hw_error(err);
>>  		} else {
>>  			void *err = acpi_hest_get_payload(gdata);
>> +			u8 error_handled = false;
>> +			int ret;
>> +
>> +			ret = atomic_notifier_call_chain(&ghes_event_notify_list, 0, gdata);
> 
> Well, this is a notifier with standard name for a non-standard event.
> Not optimal.
> 
> Why does only this event need a notifier? Because your driver is
> interested in only those events?

Its the 'else' catch-all for stuff drivers/acpi/apei  doesn't know to handle.

In this case its because its a vendor specific GUID that only the vendor driver knows how
to parse.


>> +			if (ret & NOTIFY_OK)
>> +				error_handled = true;
>>  
>>  			log_non_standard_event(sec_type, fru_id, fru_text,
>>  					       sec_sev, err,
>> -					       gdata->error_data_length);
>> +					       gdata->error_data_length,
>> +					       error_handled);
> 
> What's that error_handled thing for? That's just silly.
> 
> Your notifier returns NOTIFY_STOP when it has queued the error. If you
> don't want to log it, just test == NOTIFY_STOP and do not log it then.

My thinking for this being needed was so user-space consumers of those tracepoints keep
working. Otherwise you upgrade, get this feature, and your user-space counters stop working.

You'd need to know this error source was now managed by an in-kernel driver, which may
report the errors somewhere else...


> Then your notifier callback is queuing the error into a kfifo for
> whatever reason and then scheduling a workqueue to handle it in user
> context...
> 
> So I'm thinking that it would be better if you:
> 
> * make that kfifo generic and part of ghes.c and queue all types of
> error records into it in ghes_do_proc() - not just the non-standard
> ones.

Move the drop to process context into ghes.c? This should result in less code.

I asked for this hooking to only be for the 'catch all' don't-know case so that we don't
get drivers trying to hook and handle memory errors. (if we ever wanted that, it should be
from part of memory_failure() so it catches all the ways of reporting memory-failure)
32bit arm has prior in this area.


> * then, when you're done queuing, you kick a workqueue.
> 
> * that workqueue runs a normal, blocking notifier to which drivers
> register.
> 
> Your driver can register to that notifier too and do the normal handling
> then and not have this ad-hoc, semi-generic, semi-vendor-specific thing.

As long as we don't walk a list of things that might handle a memory-error, and have some
random driver try and NOTIFY_STOP it....

aer_recover_queue() would be replaced by this. memory_failure_queue() has one additional
caller in drivers/ras/cec.c.


Thanks,

James
Shiju Jose April 21, 2020, 1:18 p.m. UTC | #10
Hi James,

>-----Original Message-----
>From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
>owner@vger.kernel.org] On Behalf Of James Morse
>Sent: 08 April 2020 11:03
>To: Borislav Petkov <bp@alien8.de>; Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-pci@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; helgaas@kernel.org;
>lenb@kernel.org; tony.luck@intel.com; gregkh@linuxfoundation.org;
>zhangliguang@linux.alibaba.com; tglx@linutronix.de; Linuxarm
><linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>;
>yangyicong <yangyicong@huawei.com>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>Hi Boris, Shiju,
>
>Sorry for not spotting this reply earlier: Its in-reply to v1, so gets buried.
I will resend the v7 patch solving this issue.
I guess the remaining  questions here are for Boris. May be can we discuss
your comments with V7 patch, which I will send?

>
>On 27/03/2020 18:22, Borislav Petkov wrote:
>> On Wed, Mar 25, 2020 at 04:42:22PM +0000, Shiju Jose wrote:
>>> Presently APEI does not support reporting the vendor specific HW
>>> errors, received in the vendor defined table entries, to the vendor
>>> drivers for any recovery.
>>>
>>> This patch adds the support to register and unregister the
>>
>> Avoid having "This patch" or "This commit" in the commit message. It
>> is tautologically useless.
>>
>> Also, do
>>
>> $ git grep 'This patch' Documentation/process
>>
>> for more details.
>>
>>> error handling function for the vendor specific HW errors and notify
>>> the registered kernel driver.
>
>>> @@ -526,10 +552,17 @@ static void ghes_do_proc(struct ghes *ghes,
>>>  			log_arm_hw_error(err);
>>>  		} else {
>>>  			void *err = acpi_hest_get_payload(gdata);
>>> +			u8 error_handled = false;
>>> +			int ret;
>>> +
>>> +			ret =
>atomic_notifier_call_chain(&ghes_event_notify_list, 0,
>>> +gdata);
>>
>> Well, this is a notifier with standard name for a non-standard event.
>> Not optimal.
>>
>> Why does only this event need a notifier? Because your driver is
>> interested in only those events?
>
>Its the 'else' catch-all for stuff drivers/acpi/apei  doesn't know to handle.
>
>In this case its because its a vendor specific GUID that only the vendor driver
>knows how to parse.
>
>
>>> +			if (ret & NOTIFY_OK)
>>> +				error_handled = true;
>>>
>>>  			log_non_standard_event(sec_type, fru_id, fru_text,
>>>  					       sec_sev, err,
>>> -					       gdata->error_data_length);
>>> +					       gdata->error_data_length,
>>> +					       error_handled);
>>
>> What's that error_handled thing for? That's just silly.
>>
>> Your notifier returns NOTIFY_STOP when it has queued the error. If you
>> don't want to log it, just test == NOTIFY_STOP and do not log it then.
>
>My thinking for this being needed was so user-space consumers of those
>tracepoints keep working. Otherwise you upgrade, get this feature, and your
>user-space counters stop working.
>
>You'd need to know this error source was now managed by an in-kernel
>driver, which may report the errors somewhere else...
>
>
>> Then your notifier callback is queuing the error into a kfifo for
>> whatever reason and then scheduling a workqueue to handle it in user
>> context...
>>
>> So I'm thinking that it would be better if you:
>>
>> * make that kfifo generic and part of ghes.c and queue all types of
>> error records into it in ghes_do_proc() - not just the non-standard
>> ones.
>
>Move the drop to process context into ghes.c? This should result in less code.
>
>I asked for this hooking to only be for the 'catch all' don't-know case so that
>we don't get drivers trying to hook and handle memory errors. (if we ever
>wanted that, it should be from part of memory_failure() so it catches all the
>ways of reporting memory-failure) 32bit arm has prior in this area.
>
>
>> * then, when you're done queuing, you kick a workqueue.
>>
>> * that workqueue runs a normal, blocking notifier to which drivers
>> register.
>>
>> Your driver can register to that notifier too and do the normal
>> handling then and not have this ad-hoc, semi-generic, semi-vendor-specific
>thing.
>
>As long as we don't walk a list of things that might handle a memory-error,
>and have some random driver try and NOTIFY_STOP it....
>
>aer_recover_queue() would be replaced by this. memory_failure_queue() has
>one additional caller in drivers/ras/cec.c.
>
>
>Thanks,
>
>James
Thanks,
Shiju
Shiju Jose May 11, 2020, 11:20 a.m. UTC | #11
Hi Boris, Hi James,

>-----Original Message-----
>From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
>owner@vger.kernel.org] On Behalf Of James Morse
>Sent: 08 April 2020 11:03
>To: Borislav Petkov <bp@alien8.de>; Shiju Jose <shiju.jose@huawei.com>
>Cc: linux-acpi@vger.kernel.org; linux-pci@vger.kernel.org; linux-
>kernel@vger.kernel.org; rjw@rjwysocki.net; helgaas@kernel.org;
>lenb@kernel.org; tony.luck@intel.com; gregkh@linuxfoundation.org;
>zhangliguang@linux.alibaba.com; tglx@linutronix.de; Linuxarm
><linuxarm@huawei.com>; Jonathan Cameron
><jonathan.cameron@huawei.com>; tanxiaofei <tanxiaofei@huawei.com>;
>yangyicong <yangyicong@huawei.com>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>Hi Boris, Shiju,
>
>Sorry for not spotting this reply earlier: Its in-reply to v1, so gets buried.
>
>On 27/03/2020 18:22, Borislav Petkov wrote:
>> On Wed, Mar 25, 2020 at 04:42:22PM +0000, Shiju Jose wrote:
>>> Presently APEI does not support reporting the vendor specific HW
>>> errors, received in the vendor defined table entries, to the vendor
>>> drivers for any recovery.
>>>
>>> This patch adds the support to register and unregister the
>>
>> Avoid having "This patch" or "This commit" in the commit message. It
>> is tautologically useless.
>>
>> Also, do
>>
>> $ git grep 'This patch' Documentation/process
>>
>> for more details.
>>
>>> error handling function for the vendor specific HW errors and notify
>>> the registered kernel driver.
>
>>> @@ -526,10 +552,17 @@ static void ghes_do_proc(struct ghes *ghes,
>>>  			log_arm_hw_error(err);
>>>  		} else {
>>>  			void *err = acpi_hest_get_payload(gdata);
>>> +			u8 error_handled = false;
>>> +			int ret;
>>> +
>>> +			ret =
>atomic_notifier_call_chain(&ghes_event_notify_list, 0,
>>> +gdata);
>>
>> Well, this is a notifier with standard name for a non-standard event.
>> Not optimal.
>>
>> Why does only this event need a notifier? Because your driver is
>> interested in only those events?
>
>Its the 'else' catch-all for stuff drivers/acpi/apei  doesn't know to handle.
>
>In this case its because its a vendor specific GUID that only the vendor driver
>knows how to parse.
>
>
>>> +			if (ret & NOTIFY_OK)
>>> +				error_handled = true;
>>>
>>>  			log_non_standard_event(sec_type, fru_id, fru_text,
>>>  					       sec_sev, err,
>>> -					       gdata->error_data_length);
>>> +					       gdata->error_data_length,
>>> +					       error_handled);
>>
>> What's that error_handled thing for? That's just silly.
>>
>> Your notifier returns NOTIFY_STOP when it has queued the error. If you
>> don't want to log it, just test == NOTIFY_STOP and do not log it then.
>
>My thinking for this being needed was so user-space consumers of those
>tracepoints keep working. Otherwise you upgrade, get this feature, and your
>user-space counters stop working.
>
>You'd need to know this error source was now managed by an in-kernel
>driver, which may report the errors somewhere else...
>
>
>> Then your notifier callback is queuing the error into a kfifo for
>> whatever reason and then scheduling a workqueue to handle it in user
>> context...
>>
>> So I'm thinking that it would be better if you:
>>
>> * make that kfifo generic and part of ghes.c and queue all types of
>> error records into it in ghes_do_proc() - not just the non-standard
>> ones.
>
>Move the drop to process context into ghes.c? This should result in less code.
>
>I asked for this hooking to only be for the 'catch all' don't-know case so that
>we don't get drivers trying to hook and handle memory errors. (if we ever
>wanted that, it should be from part of memory_failure() so it catches all the
>ways of reporting memory-failure) 32bit arm has prior in this area.
>
>
>> * then, when you're done queuing, you kick a workqueue.
>>
>> * that workqueue runs a normal, blocking notifier to which drivers
>> register.
>>
>> Your driver can register to that notifier too and do the normal
>> handling then and not have this ad-hoc, semi-generic, semi-vendor-specific
>thing.
>
>As long as we don't walk a list of things that might handle a memory-error,
>and have some random driver try and NOTIFY_STOP it....
>
>aer_recover_queue() would be replaced by this. memory_failure_queue() has
>one additional caller in drivers/ras/cec.c.

Can you suggest whether the standard errors can report through the 
notifier (kfifo + blocking notifier), [which implemented  in V7 patch], or not 
so that we can proceed with the changes to notify the vendor specific errors?

>
>
>Thanks,
>
>James

Thanks,
Shiju
diff mbox series

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 24c9642e8fc7..d83f0b1aad0d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -490,6 +490,32 @@  static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 #endif
 }
 
+static ATOMIC_NOTIFIER_HEAD(ghes_event_notify_list);
+
+/**
+ * ghes_register_event_notifier - register an event notifier
+ * for the non-fatal HW errors.
+ * @nb: pointer to the notifier_block structure of the event handler.
+ *
+ * return 0 : SUCCESS, non-zero : FAIL
+ */
+int ghes_register_event_notifier(struct notifier_block *nb)
+{
+	return atomic_notifier_chain_register(&ghes_event_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(ghes_register_event_notifier);
+
+/**
+ * ghes_unregister_event_notifier - unregister the previously
+ * registered event notifier.
+ * @nb: pointer to the notifier_block structure of the event handler.
+ */
+void ghes_unregister_event_notifier(struct notifier_block *nb)
+{
+	atomic_notifier_chain_unregister(&ghes_event_notify_list, nb);
+}
+EXPORT_SYMBOL_GPL(ghes_unregister_event_notifier);
+
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
@@ -526,10 +552,17 @@  static void ghes_do_proc(struct ghes *ghes,
 			log_arm_hw_error(err);
 		} else {
 			void *err = acpi_hest_get_payload(gdata);
+			u8 error_handled = false;
+			int ret;
+
+			ret = atomic_notifier_call_chain(&ghes_event_notify_list, 0, gdata);
+			if (ret & NOTIFY_OK)
+				error_handled = true;
 
 			log_non_standard_event(sec_type, fru_id, fru_text,
 					       sec_sev, err,
-					       gdata->error_data_length);
+					       gdata->error_data_length,
+					       error_handled);
 		}
 	}
 }
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index 95540ea8dd9d..0ed784a8466e 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -16,9 +16,10 @@ 
 
 void log_non_standard_event(const guid_t *sec_type, const guid_t *fru_id,
 			    const char *fru_text, const u8 sev, const u8 *err,
-			    const u32 len)
+			    const u32 len, const u8 error_handled)
 {
-	trace_non_standard_event(sec_type, fru_id, fru_text, sev, err, len);
+	trace_non_standard_event(sec_type, fru_id, fru_text, sev,
+				 err, len, error_handled);
 }
 
 void log_arm_hw_error(struct cper_sec_proc_arm *err)
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index e3f1cddb4ac8..a3dd82069069 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -50,6 +50,34 @@  enum {
 	GHES_SEV_PANIC = 0x3,
 };
 
+
+#ifdef CONFIG_ACPI_APEI_GHES
+/**
+ * ghes_register_event_notifier - register an event notifier
+ * for the non-fatal HW errors.
+ * @nb: pointer to the notifier_block structure of the event notifier.
+ *
+ * Return : 0 - SUCCESS, non-zero - FAIL.
+ */
+int ghes_register_event_notifier(struct notifier_block *nb);
+
+/**
+ * ghes_unregister_event_notifier - unregister the previously
+ * registered event notifier.
+ * @nb: pointer to the notifier_block structure of the event notifier.
+ */
+void ghes_unregister_event_notifier(struct notifier_block *nb);
+#else
+static inline int ghes_register_event_notifier(struct notifier_block *nb)
+{
+	return -ENODEV;
+}
+
+static inline void ghes_unregister_event_notifier(struct notifier_block *nb)
+{
+}
+#endif
+
 int ghes_estatus_pool_init(int num_ghes);
 
 /* From drivers/edac/ghes_edac.c */
diff --git a/include/linux/ras.h b/include/linux/ras.h
index 7c3debb47c87..6ed3c67ab905 100644
--- a/include/linux/ras.h
+++ b/include/linux/ras.h
@@ -28,13 +28,15 @@  static inline int cec_add_elem(u64 pfn)		{ return -ENODEV; }
 #ifdef CONFIG_RAS
 void log_non_standard_event(const guid_t *sec_type,
 			    const guid_t *fru_id, const char *fru_text,
-			    const u8 sev, const u8 *err, const u32 len);
+			    const u8 sev, const u8 *err, const u32 len,
+			    const u8 error_handled);
 void log_arm_hw_error(struct cper_sec_proc_arm *err);
 #else
 static inline void
 log_non_standard_event(const guid_t *sec_type,
 		       const guid_t *fru_id, const char *fru_text,
-		       const u8 sev, const u8 *err, const u32 len)
+		       const u8 sev, const u8 *err, const u32 len,
+		       const u8 error_handled);
 { return; }
 static inline void
 log_arm_hw_error(struct cper_sec_proc_arm *err) { return; }
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 36c5c5e38c1d..38fd05d82d8e 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -223,9 +223,10 @@  TRACE_EVENT(non_standard_event,
 		 const char *fru_text,
 		 const u8 sev,
 		 const u8 *err,
-		 const u32 len),
+		 const u32 len,
+		 const u8 error_handled),
 
-	TP_ARGS(sec_type, fru_id, fru_text, sev, err, len),
+	TP_ARGS(sec_type, fru_id, fru_text, sev, err, len, error_handled),
 
 	TP_STRUCT__entry(
 		__array(char, sec_type, UUID_SIZE)
@@ -234,6 +235,7 @@  TRACE_EVENT(non_standard_event,
 		__field(u8, sev)
 		__field(u32, len)
 		__dynamic_array(u8, buf, len)
+		__field(u8, error_handled)
 	),
 
 	TP_fast_assign(
@@ -243,6 +245,7 @@  TRACE_EVENT(non_standard_event,
 		__entry->sev = sev;
 		__entry->len = len;
 		memcpy(__get_dynamic_array(buf), err, len);
+		__entry->error_handled = error_handled;
 	),
 
 	TP_printk("severity: %d; sec type:%pU; FRU: %pU %s; data len:%d; raw data:%s",