diff mbox series

[1/1] PCI/RCEC: Fix failure to inject errors to some RCiEP devices

Message ID 20210210020516.95292-1-qiuxu.zhuo@intel.com
State New
Headers show
Series [1/1] PCI/RCEC: Fix failure to inject errors to some RCiEP devices | expand

Commit Message

Zhuo, Qiuxu Feb. 10, 2021, 2:05 a.m. UTC
On a Sapphire Rapids server, it failed to inject correctable errors
to the RCiEP device e8:02.0 which was associated with the RCEC device
e8:00.4. See the following error log before applying the patch:

aer-inject -s e8:02.0 examples/correctable
Error: Failed to write, No such device

This was because rcec_assoc_rciep() mistakenly used "rciep->devfn" as
device number to check whether the corresponding bit was set in
the RCiEPBitmap of the RCEC. So that the RCiEP device e8:02.0 wasn't
linked to the RCEC and resulted in the above error.

Fix it by using PCI_SLOT() to convert rciep->devfn to device number.
Ensure that the RCiEP devices associated with the RCEC are linked to
the RCEC as the RCEC is enumerated. After applying the patch, correctable
errors can be injected to the RCiEP successfully.

Reported-and-tested-by: Wen Jin <wen.jin@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
 drivers/pci/pcie/rcec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kelley, Sean V Feb. 10, 2021, 4:33 a.m. UTC | #1
> On Feb 9, 2021, at 6:05 PM, Qiuxu Zhuo <qiuxu.zhuo@intel.com> wrote:
> 
> On a Sapphire Rapids server, it failed to inject correctable errors
> to the RCiEP device e8:02.0 which was associated with the RCEC device
> e8:00.4. See the following error log before applying the patch:
> 
> aer-inject -s e8:02.0 examples/correctable
> Error: Failed to write, No such device
> 
> This was because rcec_assoc_rciep() mistakenly used "rciep->devfn" as
> device number to check whether the corresponding bit was set in
> the RCiEPBitmap of the RCEC. So that the RCiEP device e8:02.0 wasn't
> linked to the RCEC and resulted in the above error.
> 
> Fix it by using PCI_SLOT() to convert rciep->devfn to device number.
> Ensure that the RCiEP devices associated with the RCEC are linked to
> the RCEC as the RCEC is enumerated. After applying the patch, correctable
> errors can be injected to the RCiEP successfully.
> 
> Reported-and-tested-by: Wen Jin <wen.jin@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Reviewed-by: Sean V Kelley <sean.v.kelley@intel.com>

Thanks,

Sean

> ---
> drivers/pci/pcie/rcec.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
> index 2c5c552994e4..d0bcd141ac9c 100644
> --- a/drivers/pci/pcie/rcec.c
> +++ b/drivers/pci/pcie/rcec.c
> @@ -32,7 +32,7 @@ static bool rcec_assoc_rciep(struct pci_dev *rcec, struct pci_dev *rciep)
> 
> 	/* Same bus, so check bitmap */
> 	for_each_set_bit(devn, &bitmap, 32)
> -		if (devn == rciep->devfn)
> +		if (devn == PCI_SLOT(rciep->devfn))
> 			return true;
> 
> 	return false;
> -- 
> 2.17.1
>
Krzysztof Wilczyński Feb. 10, 2021, 5:12 p.m. UTC | #2
Hi Qiuxu,

Nice catch!  Thank you for sending the fix over!

[...]
> On a Sapphire Rapids server, it failed to inject correctable errors
> to the RCiEP device e8:02.0 which was associated with the RCEC device
> e8:00.4. See the following error log before applying the patch:
> 
> aer-inject -s e8:02.0 examples/correctable
> Error: Failed to write, No such device
> 
> This was because rcec_assoc_rciep() mistakenly used "rciep->devfn" as
> device number to check whether the corresponding bit was set in
> the RCiEPBitmap of the RCEC. So that the RCiEP device e8:02.0 wasn't
> linked to the RCEC and resulted in the above error.
> 
> Fix it by using PCI_SLOT() to convert rciep->devfn to device number.
> Ensure that the RCiEP devices associated with the RCEC are linked to
> the RCEC as the RCEC is enumerated. After applying the patch, correctable
> errors can be injected to the RCiEP successfully.

Would this only affect error injection or would this be also a generic
problem with the driver itself causing issues regardless of whether it
was an error injection or not for this particular device?  I am asking,
as there is a lot going on in the commit message.

I wonder if simplifying this commit message so that it clearly explains
what was broken, why, and how this patch is fixing it, would perhaps be
an option?  The backstory of how you found the issue while doing some
testing and error injection is nice, but not sure if needed.

What do you think?

Krzysztof
Zhuo, Qiuxu Feb. 18, 2021, 3 a.m. UTC | #3
Hi Krzysztof,

Sorry, just back from Chinese New Year holiday.

> From: Krzysztof Wilczyński <kw@linux.com>
> ...
> ...
> Would this only affect error injection or would this be also a generic problem
> with the driver itself causing issues regardless of whether it was an error
> injection or not for this particular device?  I am asking, as there is a lot going on
> in the commit message.

This is also a generic problem.

> I wonder if simplifying this commit message so that it clearly explains what was
> broken, why, and how this patch is fixing it, would perhaps be an option?  The
> backstory of how you found the issue while doing some testing and error
> injection is nice, but not sure if needed.
> 
> What do you think?

Agree to simplify the commit message. How about the following subject and commit message?

Subject:  
Use device number to check RCiEPBitmap of RCEC

Commit message: 
rcec_assoc_rciep() used the combination of device number and function number 'devfn' to check whether the corresponding bit in the RCiEPBimap of RCEC was set. According to [1], it only needs to use the device number to check the corresponding bit in the RCiEPBitmap was set. So fix it by using PCI_SLOT() to convert 'devfn' to device number for rcec_assoc_rciep().
[1] PCIe r5.0, sec "7.9.10.2 Association Bitmap for RCiEPs"


Thanks!
-Qiuxu
Krzysztof Wilczyński Feb. 18, 2021, 10:07 p.m. UTC | #4
[+cc Bjorn as we talked about RCiEP briefly on IRC]

Hello Qiuxu,

[...]
> Sorry, just back from Chinese New Year holiday.

Welcome back!  I hope you had a nice rest, and also Happy New Year!

[...]
> > Would this only affect error injection or would this be also a generic problem
> > with the driver itself causing issues regardless of whether it was an error
> > injection or not for this particular device?  I am asking, as there is a lot going on
> > in the commit message.
> 
> This is also a generic problem.

Good to know.  Bjorn was also wondering if this is potentially a sign of
a larger probed with the RCiEP support.

> > I wonder if simplifying this commit message so that it clearly explains what was
> > broken, why, and how this patch is fixing it, would perhaps be an option?  The
> > backstory of how you found the issue while doing some testing and error
> > injection is nice, but not sure if needed.
> > 
> > What do you think?
> 
> Agree to simplify the commit message. How about the following subject and commit message?
> 
> Subject:  
> Use device number to check RCiEPBitmap of RCEC
> 
> Commit message: 
> rcec_assoc_rciep() used the combination of device number and function
> number 'devfn' to check whether the corresponding bit in the
> RCiEPBimap of RCEC was set. According to [1], it only needs to use the
> device number to check the corresponding bit in the RCiEPBitmap was
> set. So fix it by using PCI_SLOT() to convert 'devfn' to device number
> for rcec_assoc_rciep(). [1] PCIe r5.0, sec "7.9.10.2 Association
> Bitmap for RCiEPs"

I took your suggestion and came up with the following:

  Function rcec_assoc_rciep() incorrectly used "rciep->devfn" (a single
  byte encoding the device and function number) as the device number to
  check whether the corresponding bit was set in the RCiEPBitmap of the
  RCEC (Root Complex Event Collector) while enumerating over each bit of
  the RCiEPBitmap.

  As per the PCI Express Base Specification, Revision 5.0, Version 1.0,
  Section 7.9.10.2, "Association Bitmap for RCiEPs", p. 935, only needs to
  use a device number to check whether the corresponding bit was set in
  the RCiEPBitmap.

  Fix rcec_assoc_rciep() using the PCI_SLOT() macro and convert the value
  of "rciep->devfn" to a device number to ensure that the RCiEP devices
  are associated with the RCEC are linked when the RCEC is enumerated.

Using either of the following as the subject:

  PCI/RCEC: Use device number to check RCiEPBitmap of RCEC
  PCI/RCEC: Fix RCiEP capable devices RCEC association

What do you think?  Also, feel free to change whatever you see fit, of
course, as tis is only a suggestion.

Krzysztof
Krzysztof Wilczyński Feb. 18, 2021, 10:11 p.m. UTC | #5
Hi Qiuxu,

[...]
> > Agree to simplify the commit message. How about the following subject and commit message?
> > 
> > Subject:  
> > Use device number to check RCiEPBitmap of RCEC
> > 
> > Commit message: 
> > rcec_assoc_rciep() used the combination of device number and function
> > number 'devfn' to check whether the corresponding bit in the
> > RCiEPBimap of RCEC was set. According to [1], it only needs to use the
> > device number to check the corresponding bit in the RCiEPBitmap was
> > set. So fix it by using PCI_SLOT() to convert 'devfn' to device number
> > for rcec_assoc_rciep(). [1] PCIe r5.0, sec "7.9.10.2 Association
> > Bitmap for RCiEPs"
> 
> I took your suggestion and came up with the following:
> 
>   Function rcec_assoc_rciep() incorrectly used "rciep->devfn" (a single
>   byte encoding the device and function number) as the device number to
>   check whether the corresponding bit was set in the RCiEPBitmap of the
>   RCEC (Root Complex Event Collector) while enumerating over each bit of
>   the RCiEPBitmap.
> 
>   As per the PCI Express Base Specification, Revision 5.0, Version 1.0,
>   Section 7.9.10.2, "Association Bitmap for RCiEPs", p. 935, only needs to
>   use a device number to check whether the corresponding bit was set in
>   the RCiEPBitmap.
> 
>   Fix rcec_assoc_rciep() using the PCI_SLOT() macro and convert the value
>   of "rciep->devfn" to a device number to ensure that the RCiEP devices
>   are associated with the RCEC are linked when the RCEC is enumerated.
> 
> Using either of the following as the subject:
> 
>   PCI/RCEC: Use device number to check RCiEPBitmap of RCEC
>   PCI/RCEC: Fix RCiEP capable devices RCEC association
> 
> What do you think?  Also, feel free to change whatever you see fit, of
> course, as tis is only a suggestion.

We could probably add the following:

  Fixes: 507b460f8144 ("PCI/ERR: Add pcie_link_rcec() to associate RCiEPs")

Since this would where the issue was originally introduced.  I forgot to
mention this in the previous message, apologies.

Krzysztof
Zhuo, Qiuxu Feb. 19, 2021, 1:51 a.m. UTC | #6
> ...
> 
> I took your suggestion and came up with the following:
> 
>   Function rcec_assoc_rciep() incorrectly used "rciep->devfn" (a single
>   byte encoding the device and function number) as the device number to
>   check whether the corresponding bit was set in the RCiEPBitmap of the
>   RCEC (Root Complex Event Collector) while enumerating over each bit of
>   the RCiEPBitmap.
> 
>   As per the PCI Express Base Specification, Revision 5.0, Version 1.0,
>   Section 7.9.10.2, "Association Bitmap for RCiEPs", p. 935, only needs to
>   use a device number to check whether the corresponding bit was set in
>   the RCiEPBitmap.
> 
>   Fix rcec_assoc_rciep() using the PCI_SLOT() macro and convert the value
>   of "rciep->devfn" to a device number to ensure that the RCiEP devices
>   are associated with the RCEC are linked when the RCEC is enumerated.
>
> Using either of the following as the subject:
> 
>   PCI/RCEC: Use device number to check RCiEPBitmap of RCEC
>   PCI/RCEC: Fix RCiEP capable devices RCEC association
> 
> What do you think?  Also, feel free to change whatever you see fit, of course, as
> tis is only a suggestion.
> 

Hi Krzysztof,

Thanks for improving the commit message. It looks clearer. 😊
Will send out a v2 with this commit message.

Thanks!
-Qiuxu
Zhuo, Qiuxu Feb. 19, 2021, 1:52 a.m. UTC | #7
>...
> 
> We could probably add the following:
> 
>   Fixes: 507b460f8144 ("PCI/ERR: Add pcie_link_rcec() to associate RCiEPs")
> 

OK. Will add this to the v2.

Thanks!
-Qiuxu
diff mbox series

Patch

diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
index 2c5c552994e4..d0bcd141ac9c 100644
--- a/drivers/pci/pcie/rcec.c
+++ b/drivers/pci/pcie/rcec.c
@@ -32,7 +32,7 @@  static bool rcec_assoc_rciep(struct pci_dev *rcec, struct pci_dev *rciep)
 
 	/* Same bus, so check bitmap */
 	for_each_set_bit(devn, &bitmap, 32)
-		if (devn == rciep->devfn)
+		if (devn == PCI_SLOT(rciep->devfn))
 			return true;
 
 	return false;