diff mbox

PCI: aer_inject: Log actual error causes

Message ID 20160126095205.0e5923bd@endymion.delvare
State Changes Requested
Headers show

Commit Message

Jean Delvare Jan. 26, 2016, 8:52 a.m. UTC
The aer_inject driver is very quiet. In most cases, it merely returns
an error code to user-space, leaving the user with little clue about
the actual reason for the failure.

So, log error messages for 4 of the most frequent causes of failure:
* Can't find the root port of the specified device.
* Device doesn't support AER.
* Root port doesn't support AER.
* AER device not found.
This gives the user a chance to understand why aer-inject failed.

Based on a preliminary patch by Thomas Renninger.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer/aer_inject.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Borislav Petkov Jan. 26, 2016, 10:12 a.m. UTC | #1
On Tue, Jan 26, 2016 at 09:52:05AM +0100, Jean Delvare (by way of Jean Delvare <jdelvare@suse.de>) wrote:
> The aer_inject driver is very quiet. In most cases, it merely returns
> an error code to user-space, leaving the user with little clue about
> the actual reason for the failure.
> 
> So, log error messages for 4 of the most frequent causes of failure:
> * Can't find the root port of the specified device.
> * Device doesn't support AER.
> * Root port doesn't support AER.
> * AER device not found.
> This gives the user a chance to understand why aer-inject failed.
> 
> Based on a preliminary patch by Thomas Renninger.
> 
> Signed-off-by: Jean Delvare <jdelvare@suse.de>
> Cc: Thomas Renninger <trenn@suse.de>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer/aer_inject.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> --- linux-4.5-rc0.orig/drivers/pci/pcie/aer/aer_inject.c	2016-01-20 09:25:54.815852332 +0100
> +++ linux-4.5-rc0/drivers/pci/pcie/aer/aer_inject.c	2016-01-26 09:41:17.361994839 +0100
> @@ -334,12 +334,14 @@ static int aer_inject(struct aer_error_i
>  		return -ENODEV;
>  	rpdev = pcie_find_root_port(dev);
>  	if (!rpdev) {
> +		dev_err(&dev->dev, "aer_inject: Root port not found\n");
>  		ret = -ENODEV;
>  		goto out_put;
>  	}
>  
>  	pos_cap_err = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
>  	if (!pos_cap_err) {
> +		dev_err(&dev->dev, "aer_inject: Device doesn't support AER\n");
>  		ret = -EPERM;

Btw, this -EPERM looks wrong - if we're checking for capabilities, we
shouldn't be returning -EPERM but maybe something like -ENODEV or so.

>  		goto out_put;
>  	}
> @@ -350,6 +352,8 @@ static int aer_inject(struct aer_error_i
>  
>  	rp_pos_cap_err = pci_find_ext_capability(rpdev, PCI_EXT_CAP_ID_ERR);
>  	if (!rp_pos_cap_err) {
> +		dev_err(&rpdev->dev,
> +			"aer_inject: Root port doesn't support AER\n");
>  		ret = -EPERM;

Ditto.

>  		goto out_put;
>  	}
> @@ -462,8 +466,10 @@ static int aer_inject(struct aer_error_i
>  			goto out_put;
>  		}
>  		aer_irq(-1, edev);
> -	} else
> +	} else {
> +		dev_err(&rpdev->dev, "aer_inject: AER device not found\n");

So other error prints in that function do printk(KERN_WARNING. Why
dev_err()?

Why not pr_err() and define pr_fmt to "aer_inject: " and then drop
that prefix from the messages?

Thanks.
Jean Delvare Jan. 26, 2016, 12:27 p.m. UTC | #2
Hi Borislav,

Thanks for the quick review.

Le Tuesday 26 January 2016 à 11:12 +0100, Borislav Petkov a écrit :
> On Tue, Jan 26, 2016 at 09:52:05AM +0100, Jean Delvare (by way of Jean Delvare <jdelvare@suse.de>) wrote:
> > The aer_inject driver is very quiet. In most cases, it merely returns
> > an error code to user-space, leaving the user with little clue about
> > the actual reason for the failure.
> > 
> > So, log error messages for 4 of the most frequent causes of failure:
> > * Can't find the root port of the specified device.
> > * Device doesn't support AER.
> > * Root port doesn't support AER.
> > * AER device not found.
> > This gives the user a chance to understand why aer-inject failed.
> > 
> > Based on a preliminary patch by Thomas Renninger.
> > 
> > Signed-off-by: Jean Delvare <jdelvare@suse.de>
> > Cc: Thomas Renninger <trenn@suse.de>
> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> >  drivers/pci/pcie/aer/aer_inject.c |    8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > --- linux-4.5-rc0.orig/drivers/pci/pcie/aer/aer_inject.c	2016-01-20 09:25:54.815852332 +0100
> > +++ linux-4.5-rc0/drivers/pci/pcie/aer/aer_inject.c	2016-01-26 09:41:17.361994839 +0100
> > @@ -334,12 +334,14 @@ static int aer_inject(struct aer_error_i
> >  		return -ENODEV;
> >  	rpdev = pcie_find_root_port(dev);
> >  	if (!rpdev) {
> > +		dev_err(&dev->dev, "aer_inject: Root port not found\n");
> >  		ret = -ENODEV;
> >  		goto out_put;
> >  	}
> >  
> >  	pos_cap_err = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
> >  	if (!pos_cap_err) {
> > +		dev_err(&dev->dev, "aer_inject: Device doesn't support AER\n");
> >  		ret = -EPERM;
> 
> Btw, this -EPERM looks wrong - if we're checking for capabilities, we
> shouldn't be returning -EPERM but maybe something like -ENODEV or so.

I agree. It was originally -ENOTTY, changed to -EPERM by:

commit e82b14bdd390c534750a191f9936f842bab255d4
Author: Prarit Bhargava <prarit@redhat.com>
Date:   Wed Mar 20 12:04:43 2013 +0000

But I'd say -EPERM is hardly better. The problem with -ENODEV is that it
is already returned by this function for several other error causes.
Also the aer-inject user-space tool will print the error message from
the error code, and I don't think "No such device" is helpful in that
case. What about -ENOTSUPP ("Operation not supported") or
-EEPROTONOSUPPORT ("Protocol not supported")?

I can change it if nobody objects. I think the change can be included in
this patch as it is quite related.

> >  		goto out_put;
> >  	}
> > @@ -350,6 +352,8 @@ static int aer_inject(struct aer_error_i
> >  
> >  	rp_pos_cap_err = pci_find_ext_capability(rpdev, PCI_EXT_CAP_ID_ERR);
> >  	if (!rp_pos_cap_err) {
> > +		dev_err(&rpdev->dev,
> > +			"aer_inject: Root port doesn't support AER\n");
> >  		ret = -EPERM;
> 
> Ditto.
> 
> >  		goto out_put;
> >  	}
> > @@ -462,8 +466,10 @@ static int aer_inject(struct aer_error_i
> >  			goto out_put;
> >  		}
> >  		aer_irq(-1, edev);
> > -	} else
> > +	} else {
> > +		dev_err(&rpdev->dev, "aer_inject: AER device not found\n");
> 
> So other error prints in that function do printk(KERN_WARNING. Why
> dev_err()?

I'd rather ask, why printk? ;-) Using raw printk is considered bad and
should be avoided whenever possible. So says checkpatch.pl. If anything,
all these printks should be converted to at least pr_* and ideally
dev_*. But that would be a separate patch.

> Why not pr_err() and define pr_fmt to "aer_inject: " and then drop
> that prefix from the messages?

Because I believe that including the device name in the error messages
makes them more helpful to understand and diagnose the problem. If the
device where we try to inject the error has a problem, it's PCI name
will be included in the error message. If the error is with the root
port, then we include the root port's PCI name. If I used pr_err()
instead then the device information would be missing.
Borislav Petkov Jan. 26, 2016, 12:49 p.m. UTC | #3
On Tue, Jan 26, 2016 at 01:27:18PM +0100, Jean Delvare wrote:
> But I'd say -EPERM is hardly better. The problem with -ENODEV is that it
> is already returned by this function for several other error causes.
> Also the aer-inject user-space tool will print the error message from
> the error code, and I don't think "No such device" is helpful in that
> case. What about -ENOTSUPP ("Operation not supported") or
> -EEPROTONOSUPPORT ("Protocol not supported")?

Makes sense.

> I can change it if nobody objects. I think the change can be included in
> this patch as it is quite related.

I'd do a separate patch but this is only my opinion. I guess that's
Bjorn's call.

> I'd rather ask, why printk? ;-) Using raw printk is considered bad and
> should be avoided whenever possible.

Hmm, interesting. Why?

> So says checkpatch.pl.

Please don't tell me you believe what checkpatch says.

> > Why not pr_err() and define pr_fmt to "aer_inject: " and then drop
> > that prefix from the messages?
> 
> Because I believe that including the device name in the error messages
> makes them more helpful to understand and diagnose the problem. If the
> device where we try to inject the error has a problem, it's PCI name
> will be included in the error message. If the error is with the root
> port, then we include the root port's PCI name. If I used pr_err()
> instead then the device information would be missing.

True, that's a good argument.

However, if you're doing aer injection, you already *know* the device
you're injecting too. Unless you want to inject in multiple devices and
then it is helpful.

So sure, dev_* sounds better as it gives more info about which device
fails, but then please convert the whole driver.

Thanks.
Jean Delvare Jan. 26, 2016, 1:05 p.m. UTC | #4
Le Tuesday 26 January 2016 à 13:49 +0100, Borislav Petkov a écrit :
> On Tue, Jan 26, 2016 at 01:27:18PM +0100, Jean Delvare wrote:
> > But I'd say -EPERM is hardly better. The problem with -ENODEV is that it
> > is already returned by this function for several other error causes.
> > Also the aer-inject user-space tool will print the error message from
> > the error code, and I don't think "No such device" is helpful in that
> > case. What about -ENOTSUPP ("Operation not supported") or
> > -EEPROTONOSUPPORT ("Protocol not supported")?
> 
> Makes sense.
> 
> > I can change it if nobody objects. I think the change can be included in
> > this patch as it is quite related.
> 
> I'd do a separate patch but this is only my opinion. I guess that's
> Bjorn's call.

I am almost always advocating for separate patches, but here it seemed
like hairsplitting so I wasn't sure. I'm fine both ways really.

> > I'd rather ask, why printk? ;-) Using raw printk is considered bad and
> > should be avoided whenever possible.
> 
> Hmm, interesting. Why?

I guess the idea is that it makes message formats more consistent and
valuable.

> > So says checkpatch.pl.
> 
> Please don't tell me you believe what checkpatch says.

Of course I believe it, as long as it says what I want to hear. If not
then I just claim it's a piece of crap and ignore it ;-) As everybody
does, it seems.

> > > Why not pr_err() and define pr_fmt to "aer_inject: " and then drop
> > > that prefix from the messages?
> > 
> > Because I believe that including the device name in the error messages
> > makes them more helpful to understand and diagnose the problem. If the
> > device where we try to inject the error has a problem, it's PCI name
> > will be included in the error message. If the error is with the root
> > port, then we include the root port's PCI name. If I used pr_err()
> > instead then the device information would be missing.
> 
> True, that's a good argument.
> 
> However, if you're doing aer injection, you already *know* the device
> you're injecting too. Unless you want to inject in multiple devices and
> then it is helpful.

You know the device, but you don't know its root device, which
apparently matters a lot for AER, and is used for 2 of the messages I
introduced.

Even for the device itself, a confirmation of the PCI device name is
always good to have, to avoid confusion if you made a typo in your
injection data for example.

> So sure, dev_* sounds better as it gives more info about which device
> fails, but then please convert the whole driver.

OK, I'll work on this once the first round of reviews if done. I don't
know if others have more comments, so let's wait a bit.
Bjorn Helgaas Jan. 26, 2016, 10:16 p.m. UTC | #5
On Tue, Jan 26, 2016 at 02:05:54PM +0100, Jean Delvare wrote:
> Le Tuesday 26 January 2016 à 13:49 +0100, Borislav Petkov a écrit :
> > On Tue, Jan 26, 2016 at 01:27:18PM +0100, Jean Delvare wrote:
> > > But I'd say -EPERM is hardly better. The problem with -ENODEV is that it
> > > is already returned by this function for several other error causes.
> > > Also the aer-inject user-space tool will print the error message from
> > > the error code, and I don't think "No such device" is helpful in that
> > > case. What about -ENOTSUPP ("Operation not supported") or
> > > -EEPROTONOSUPPORT ("Protocol not supported")?
> > 
> > Makes sense.
> > 
> > > I can change it if nobody objects. I think the change can be included in
> > > this patch as it is quite related.
> > 
> > I'd do a separate patch but this is only my opinion. I guess that's
> > Bjorn's call.
> 
> I am almost always advocating for separate patches, but here it seemed
> like hairsplitting so I wasn't sure. I'm fine both ways really.

I'd prefer one patch to change the errno (only) and another to add the
printk logging.

I definitely prefer dev_* whenever possible.  The aer_inject user
knows the relevant device at the time, but dev_* makes the dmesg log
more useful later.

In fact, your patch only adds logging to some error paths.  I'd like
to have some indication in dmesg that aer_inject was used at all.
Maybe even a synopsis of the injected error, though maybe that's too
much if aer_inject is used in an automated way.  It would just be nice
to have a dmesg indication that subsequent AER events *might* be
injected rather than real errors.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- linux-4.5-rc0.orig/drivers/pci/pcie/aer/aer_inject.c	2016-01-20 09:25:54.815852332 +0100
+++ linux-4.5-rc0/drivers/pci/pcie/aer/aer_inject.c	2016-01-26 09:41:17.361994839 +0100
@@ -334,12 +334,14 @@  static int aer_inject(struct aer_error_i
 		return -ENODEV;
 	rpdev = pcie_find_root_port(dev);
 	if (!rpdev) {
+		dev_err(&dev->dev, "aer_inject: Root port not found\n");
 		ret = -ENODEV;
 		goto out_put;
 	}
 
 	pos_cap_err = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
 	if (!pos_cap_err) {
+		dev_err(&dev->dev, "aer_inject: Device doesn't support AER\n");
 		ret = -EPERM;
 		goto out_put;
 	}
@@ -350,6 +352,8 @@  static int aer_inject(struct aer_error_i
 
 	rp_pos_cap_err = pci_find_ext_capability(rpdev, PCI_EXT_CAP_ID_ERR);
 	if (!rp_pos_cap_err) {
+		dev_err(&rpdev->dev,
+			"aer_inject: Root port doesn't support AER\n");
 		ret = -EPERM;
 		goto out_put;
 	}
@@ -462,8 +466,10 @@  static int aer_inject(struct aer_error_i
 			goto out_put;
 		}
 		aer_irq(-1, edev);
-	} else
+	} else {
+		dev_err(&rpdev->dev, "aer_inject: AER device not found\n");
 		ret = -EINVAL;
+	}
 out_put:
 	kfree(err_alloc);
 	kfree(rperr_alloc);