diff mbox series

PCI/AER: correctable error message as KERN_INFO

Message ID 20230301060453.4031503-1-grundler@chromium.org (mailing list archive)
State Handled Elsewhere
Headers show
Series PCI/AER: correctable error message as KERN_INFO | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 24 jobs.

Commit Message

Grant Grundler March 1, 2023, 6:04 a.m. UTC
Since correctable errors have been corrected (and counted), the dmesg output
should not be reported as a warning, but rather as "informational".

Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
station, the dmesg buffer can be spammed with correctable errors, 717 bytes
per instance, potentially many MB per day.

Given the "WARN" priority, these messages have already confused the typical
user that stumbles across them, support staff (triaging feedback reports),
and more than a few linux kernel devs. Changing to INFO will hide these
messages from most audiences.

Signed-off-by: Grant Grundler <grundler@chromium.org>
---
This patch will likely conflict with:
  https://lore.kernel.org/all/20230103165548.570377-1-rajat.khandelwal@linux.intel.com/

which I'd also like to see upstream. Please let me know to resubmit mine if Rajat's patch lands first. Or feel free to fix up this one.

 drivers/pci/pcie/aer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Grant Grundler March 8, 2023, 8 p.m. UTC | #1
Ping? Did I miss an email or other work that this patch collides with?

cheers,
grant

On Tue, Feb 28, 2023 at 10:05 PM Grant Grundler <grundler@chromium.org> wrote:
>
> Since correctable errors have been corrected (and counted), the dmesg output
> should not be reported as a warning, but rather as "informational".
>
> Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
> station, the dmesg buffer can be spammed with correctable errors, 717 bytes
> per instance, potentially many MB per day.
>
> Given the "WARN" priority, these messages have already confused the typical
> user that stumbles across them, support staff (triaging feedback reports),
> and more than a few linux kernel devs. Changing to INFO will hide these
> messages from most audiences.
>
> Signed-off-by: Grant Grundler <grundler@chromium.org>
> ---
> This patch will likely conflict with:
>   https://lore.kernel.org/all/20230103165548.570377-1-rajat.khandelwal@linux.intel.com/
>
> which I'd also like to see upstream. Please let me know to resubmit mine if Rajat's patch lands first. Or feel free to fix up this one.
>
>  drivers/pci/pcie/aer.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f6c24ded134c..e4cf3ec40d66 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
>
>         if (info->severity == AER_CORRECTABLE) {
>                 strings = aer_correctable_error_string;
> -               level = KERN_WARNING;
> +               level = KERN_INFO;
>         } else {
>                 strings = aer_uncorrectable_error_string;
>                 level = KERN_ERR;
> @@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>         layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>         agent = AER_GET_AGENT(info->severity, info->status);
>
> -       level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> +       level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
>
>         pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>                    aer_error_severity_string[info->severity],
> --
> 2.39.2.722.g9855ee24e9-goog
>
Bjorn Helgaas March 8, 2023, 8:18 p.m. UTC | #2
On Wed, Mar 08, 2023 at 12:00:48PM -0800, Grant Grundler wrote:
> Ping? Did I miss an email or other work that this patch collides with?

Nope, we typically make topic branches based on -rc1, so not much
happens during the merge window.  -rc1 was tagged Sunday, so things
will start appearing in -next soon.

Bjorn

> On Tue, Feb 28, 2023 at 10:05 PM Grant Grundler <grundler@chromium.org> wrote:
> >
> > Since correctable errors have been corrected (and counted), the dmesg output
> > should not be reported as a warning, but rather as "informational".
> >
> > Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
> > station, the dmesg buffer can be spammed with correctable errors, 717 bytes
> > per instance, potentially many MB per day.
> >
> > Given the "WARN" priority, these messages have already confused the typical
> > user that stumbles across them, support staff (triaging feedback reports),
> > and more than a few linux kernel devs. Changing to INFO will hide these
> > messages from most audiences.
> >
> > Signed-off-by: Grant Grundler <grundler@chromium.org>
> > ---
> > This patch will likely conflict with:
> >   https://lore.kernel.org/all/20230103165548.570377-1-rajat.khandelwal@linux.intel.com/
> >
> > which I'd also like to see upstream. Please let me know to resubmit mine if Rajat's patch lands first. Or feel free to fix up this one.
> >
> >  drivers/pci/pcie/aer.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index f6c24ded134c..e4cf3ec40d66 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
> >
> >         if (info->severity == AER_CORRECTABLE) {
> >                 strings = aer_correctable_error_string;
> > -               level = KERN_WARNING;
> > +               level = KERN_INFO;
> >         } else {
> >                 strings = aer_uncorrectable_error_string;
> >                 level = KERN_ERR;
> > @@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >         layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> >         agent = AER_GET_AGENT(info->severity, info->status);
> >
> > -       level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> > +       level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
> >
> >         pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> >                    aer_error_severity_string[info->severity],
> > --
> > 2.39.2.722.g9855ee24e9-goog
> >
Grant Grundler March 8, 2023, 8:23 p.m. UTC | #3
On Wed, Mar 8, 2023 at 12:18 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Mar 08, 2023 at 12:00:48PM -0800, Grant Grundler wrote:
> > Ping? Did I miss an email or other work that this patch collides with?
>
> Nope, we typically make topic branches based on -rc1, so not much
> happens during the merge window.  -rc1 was tagged Sunday, so things
> will start appearing in -next soon.

Ah ok! Thanks for clarifying Bjorn!

cheers,
grant

>
> Bjorn
>
> > On Tue, Feb 28, 2023 at 10:05 PM Grant Grundler <grundler@chromium.org> wrote:
> > >
> > > Since correctable errors have been corrected (and counted), the dmesg output
> > > should not be reported as a warning, but rather as "informational".
> > >
> > > Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
> > > station, the dmesg buffer can be spammed with correctable errors, 717 bytes
> > > per instance, potentially many MB per day.
> > >
> > > Given the "WARN" priority, these messages have already confused the typical
> > > user that stumbles across them, support staff (triaging feedback reports),
> > > and more than a few linux kernel devs. Changing to INFO will hide these
> > > messages from most audiences.
> > >
> > > Signed-off-by: Grant Grundler <grundler@chromium.org>
> > > ---
> > > This patch will likely conflict with:
> > >   https://lore.kernel.org/all/20230103165548.570377-1-rajat.khandelwal@linux.intel.com/
> > >
> > > which I'd also like to see upstream. Please let me know to resubmit mine if Rajat's patch lands first. Or feel free to fix up this one.
> > >
> > >  drivers/pci/pcie/aer.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > index f6c24ded134c..e4cf3ec40d66 100644
> > > --- a/drivers/pci/pcie/aer.c
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
> > >
> > >         if (info->severity == AER_CORRECTABLE) {
> > >                 strings = aer_correctable_error_string;
> > > -               level = KERN_WARNING;
> > > +               level = KERN_INFO;
> > >         } else {
> > >                 strings = aer_uncorrectable_error_string;
> > >                 level = KERN_ERR;
> > > @@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> > >         layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> > >         agent = AER_GET_AGENT(info->severity, info->status);
> > >
> > > -       level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> > > +       level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
> > >
> > >         pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> > >                    aer_error_severity_string[info->severity],
> > > --
> > > 2.39.2.722.g9855ee24e9-goog
> > >
Bjorn Helgaas March 14, 2023, 7:38 p.m. UTC | #4
On Tue, Feb 28, 2023 at 10:04:53PM -0800, Grant Grundler wrote:
> Since correctable errors have been corrected (and counted), the dmesg output
> should not be reported as a warning, but rather as "informational".
> 
> Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
> station, the dmesg buffer can be spammed with correctable errors, 717 bytes
> per instance, potentially many MB per day.
> 
> Given the "WARN" priority, these messages have already confused the typical
> user that stumbles across them, support staff (triaging feedback reports),
> and more than a few linux kernel devs. Changing to INFO will hide these
> messages from most audiences.
> 
> Signed-off-by: Grant Grundler <grundler@chromium.org>
> ---
> This patch will likely conflict with:
>   https://lore.kernel.org/all/20230103165548.570377-1-rajat.khandelwal@linux.intel.com/
> 
> which I'd also like to see upstream. Please let me know to resubmit
> mine if Rajat's patch lands first. Or feel free to fix up this one.

Yes.  I think it makes sense to separate this into two patches:

  1) Log correctable errors as KERN_INFO instead of KERN_WARNING, and

  2) Rate-limit correctable error logging.

>  drivers/pci/pcie/aer.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f6c24ded134c..e4cf3ec40d66 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
>  
>  	if (info->severity == AER_CORRECTABLE) {
>  		strings = aer_correctable_error_string;
> -		level = KERN_WARNING;
> +		level = KERN_INFO;
>  	} else {
>  		strings = aer_uncorrectable_error_string;
>  		level = KERN_ERR;
> @@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>  	agent = AER_GET_AGENT(info->severity, info->status);
>  
> -	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> +	level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
>  
>  	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>  		   aer_error_severity_string[info->severity],

Shouldn't we do the same in the cper_print_aer() path?  That path
currently uses pci_err() and then calls __aer_print_error(), so the
initial message will always be KERN_ERR, and the decoding done by
__aer_print_error() will be KERN_INFO (for correctable) or KERN_ERR.

Seems like a shame to do the same test in three places, but would
require a little more refactoring to avoid that.

Bjorn
Grant Grundler March 15, 2023, 12:24 a.m. UTC | #5
On Tue, Mar 14, 2023 at 12:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Feb 28, 2023 at 10:04:53PM -0800, Grant Grundler wrote:
> > Since correctable errors have been corrected (and counted), the dmesg output
> > should not be reported as a warning, but rather as "informational".
> >
> > Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
> > station, the dmesg buffer can be spammed with correctable errors, 717 bytes
> > per instance, potentially many MB per day.
> >
> > Given the "WARN" priority, these messages have already confused the typical
> > user that stumbles across them, support staff (triaging feedback reports),
> > and more than a few linux kernel devs. Changing to INFO will hide these
> > messages from most audiences.
> >
> > Signed-off-by: Grant Grundler <grundler@chromium.org>
> > ---
> > This patch will likely conflict with:
> >   https://lore.kernel.org/all/20230103165548.570377-1-rajat.khandelwal@linux.intel.com/
> >
> > which I'd also like to see upstream. Please let me know to resubmit
> > mine if Rajat's patch lands first. Or feel free to fix up this one.
>
> Yes.  I think it makes sense to separate this into two patches:
>
>   1) Log correctable errors as KERN_INFO instead of KERN_WARNING, and
>   2) Rate-limit correctable error logging.

I'm going to look into your comment below. I'll port Rajat's patch on
top of mine to follow the order you've listed above.

> >  drivers/pci/pcie/aer.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index f6c24ded134c..e4cf3ec40d66 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
> >
> >       if (info->severity == AER_CORRECTABLE) {
> >               strings = aer_correctable_error_string;
> > -             level = KERN_WARNING;
> > +             level = KERN_INFO;
> >       } else {
> >               strings = aer_uncorrectable_error_string;
> >               level = KERN_ERR;
> > @@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >       layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> >       agent = AER_GET_AGENT(info->severity, info->status);
> >
> > -     level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> > +     level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
> >
> >       pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> >                  aer_error_severity_string[info->severity],
>
> Shouldn't we do the same in the cper_print_aer() path?  That path
> currently uses pci_err() and then calls __aer_print_error(), so the
> initial message will always be KERN_ERR, and the decoding done by
> __aer_print_error() will be KERN_INFO (for correctable) or KERN_ERR.

I was completely unaware of this since it's not causing me any
immediate problems. But I agree the message priority should be
consistent for correctable errors.

> Seems like a shame to do the same test in three places, but would
> require a little more refactoring to avoid that.

I don't mind doing the same test in multiple places. If refactoring
this isn't straight forward, I'll leave the refactoring for someone
more ambitious. :D

cheers,
grant

>
> Bjorn
Grant Grundler March 17, 2023, 5:57 p.m. UTC | #6
On Tue, Mar 14, 2023 at 5:24 PM Grant Grundler <grundler@chromium.org> wrote:
>
> On Tue, Mar 14, 2023 at 12:38 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Tue, Feb 28, 2023 at 10:04:53PM -0800, Grant Grundler wrote:
> > > Since correctable errors have been corrected (and counted), the dmesg output
> > > should not be reported as a warning, but rather as "informational".
> > >
> > > Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
> > > station, the dmesg buffer can be spammed with correctable errors, 717 bytes
> > > per instance, potentially many MB per day.
> > >
> > > Given the "WARN" priority, these messages have already confused the typical
> > > user that stumbles across them, support staff (triaging feedback reports),
> > > and more than a few linux kernel devs. Changing to INFO will hide these
> > > messages from most audiences.
> > >
> > > Signed-off-by: Grant Grundler <grundler@chromium.org>
> > > ---
> > > This patch will likely conflict with:
> > >   https://lore.kernel.org/all/20230103165548.570377-1-rajat.khandelwal@linux.intel.com/
> > >
> > > which I'd also like to see upstream. Please let me know to resubmit
> > > mine if Rajat's patch lands first. Or feel free to fix up this one.
> >
> > Yes.  I think it makes sense to separate this into two patches:
> >
> >   1) Log correctable errors as KERN_INFO instead of KERN_WARNING, and
> >   2) Rate-limit correctable error logging.
>
> I'm going to look into your comment below. I'll port Rajat's patch on
> top of mine to follow the order you've listed above.
>
> > >  drivers/pci/pcie/aer.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > index f6c24ded134c..e4cf3ec40d66 100644
> > > --- a/drivers/pci/pcie/aer.c
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
> > >
> > >       if (info->severity == AER_CORRECTABLE) {
> > >               strings = aer_correctable_error_string;
> > > -             level = KERN_WARNING;
> > > +             level = KERN_INFO;
> > >       } else {
> > >               strings = aer_uncorrectable_error_string;
> > >               level = KERN_ERR;
> > > @@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> > >       layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> > >       agent = AER_GET_AGENT(info->severity, info->status);
> > >
> > > -     level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> > > +     level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
> > >
> > >       pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> > >                  aer_error_severity_string[info->severity],
> >
> > Shouldn't we do the same in the cper_print_aer() path?  That path
> > currently uses pci_err() and then calls __aer_print_error(), so the
> > initial message will always be KERN_ERR, and the decoding done by
> > __aer_print_error() will be KERN_INFO (for correctable) or KERN_ERR.
>
> I was completely unaware of this since it's not causing me any
> immediate problems. But I agree the message priority should be
> consistent for correctable errors.

I've just posted a V2 which I believe is against "pci-next":

grundler <1607>git remote -v show pci-next
* remote pci-next
  Fetch URL: git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
  Push  URL: git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
  HEAD branch: main
  Remote branches:
    aer                   tracked
    controller/dt         tracked
    controller/kirin      tracked
    controller/layerscape tracked
    controller/rcar       tracked
    for-linus             tracked
    main                  tracked
    next                  tracked
  Local branch configured for 'git pull':
    aer_correctable_info merges with remote next

Please let me know if this is the wrong git tree and branch to track.

> > Seems like a shame to do the same test in three places, but would
> > require a little more refactoring to avoid that.
>
> I don't mind doing the same test in multiple places. If refactoring
> this isn't straight forward, I'll leave the refactoring for someone
> more ambitious. :D

I've moved one of the pci_info lines from cper_print_aer()  to
__aer_print_info() since the status/mask are the same for both paths
that invoke __aer_print_info(). But that's as far as I understand what
each of the paths that calls __aer_print_info() do.  If this is not
OK, I can move it back.

cheers,
grant
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f6c24ded134c..e4cf3ec40d66 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -692,7 +692,7 @@  static void __aer_print_error(struct pci_dev *dev,
 
 	if (info->severity == AER_CORRECTABLE) {
 		strings = aer_correctable_error_string;
-		level = KERN_WARNING;
+		level = KERN_INFO;
 	} else {
 		strings = aer_uncorrectable_error_string;
 		level = KERN_ERR;
@@ -724,7 +724,7 @@  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
-	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+	level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
 
 	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
 		   aer_error_severity_string[info->severity],