diff mbox series

[v3,1/3] cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers

Message ID 20230803230129.127590-2-Smita.KoralahalliChannabasappa@amd.com
State New
Headers show
Series PCI/AER, CXL: Fix appropriate _OSC check for CXL RAS Cap | expand

Commit Message

Smita Koralahalli Aug. 3, 2023, 11:01 p.m. UTC
According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
of AER should also own CXL Protocol Error Management as there is no
explicit control of CXL Protocol error. And the CXL RAS Cap registers
reported on Protocol errors should check for AER _OSC rather than CXL
Memory Error Reporting Control _OSC.

The CXL Memory Error Reporting Control _OSC specifically highlights
handling Memory Error Logging and Signaling Enhancements. These kinds of
errors are reported through a device's mailbox and can be managed
independently from CXL Protocol Errors.

This change fixes handling and reporting CXL Protocol Errors and RAS
registers natively with native AER and FW-First CXL Memory Error Reporting
Control.

[1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.

Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Robert Richter <rrichter@amd.com>
---
v2:
	Added fixes tag.
	Included what the patch fixes in commit message.
v3:
	Added "Reviewed-by" tag.
---
 drivers/cxl/pci.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Jonathan Cameron Aug. 4, 2023, 12:09 p.m. UTC | #1
On Thu, 3 Aug 2023 23:01:27 +0000
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:

> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
> of AER should also own CXL Protocol Error Management as there is no
> explicit control of CXL Protocol error. And the CXL RAS Cap registers
> reported on Protocol errors should check for AER _OSC rather than CXL
> Memory Error Reporting Control _OSC.
> 
> The CXL Memory Error Reporting Control _OSC specifically highlights
> handling Memory Error Logging and Signaling Enhancements. These kinds of
> errors are reported through a device's mailbox and can be managed
> independently from CXL Protocol Errors.
> 
> This change fixes handling and reporting CXL Protocol Errors and RAS
> registers natively with native AER and FW-First CXL Memory Error Reporting
> Control.
> 
> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
> 
> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> Reviewed-by: Robert Richter <rrichter@amd.com>

I'd be tempted to add a comment on why this returns 0 rather than an
error.  I think that makes sense but it isn't immediately obvious from
the local context.

Otherwise LGTM

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> ---
> v2:
> 	Added fixes tag.
> 	Included what the patch fixes in commit message.
> v3:
> 	Added "Reviewed-by" tag.
> ---
>  drivers/cxl/pci.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 1cb1494c28fe..2323169b6e5f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>  		return 0;
>  	}
>  
> -	/* BIOS has CXL error control */
> -	if (!host_bridge->native_cxl_error)
> -		return -ENXIO;
> +	/* BIOS has PCIe AER error control */
> +	if (!host_bridge->native_aer)
> +		return 0;
>  
>  	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>  	if (rc)
Dave Jiang Aug. 16, 2023, 6:06 p.m. UTC | #2
On 8/4/23 05:09, Jonathan Cameron wrote:
> On Thu, 3 Aug 2023 23:01:27 +0000
> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:
> 
>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>> of AER should also own CXL Protocol Error Management as there is no
>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>> reported on Protocol errors should check for AER _OSC rather than CXL
>> Memory Error Reporting Control _OSC.
>>
>> The CXL Memory Error Reporting Control _OSC specifically highlights
>> handling Memory Error Logging and Signaling Enhancements. These kinds of
>> errors are reported through a device's mailbox and can be managed
>> independently from CXL Protocol Errors.
>>
>> This change fixes handling and reporting CXL Protocol Errors and RAS
>> registers natively with native AER and FW-First CXL Memory Error Reporting
>> Control.
>>
>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>
>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
>> Reviewed-by: Robert Richter <rrichter@amd.com>
> 
> I'd be tempted to add a comment on why this returns 0 rather than an
> error.  I think that makes sense but it isn't immediately obvious from
> the local context.
> 
> Otherwise LGTM
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Echo Jonathan's comment.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> 
>> ---
>> v2:
>> 	Added fixes tag.
>> 	Included what the patch fixes in commit message.
>> v3:
>> 	Added "Reviewed-by" tag.
>> ---
>>   drivers/cxl/pci.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index 1cb1494c28fe..2323169b6e5f 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>   		return 0;
>>   	}
>>   
>> -	/* BIOS has CXL error control */
>> -	if (!host_bridge->native_cxl_error)
>> -		return -ENXIO;
>> +	/* BIOS has PCIe AER error control */
>> +	if (!host_bridge->native_aer)
>> +		return 0;
>>   
>>   	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>   	if (rc)
>
Smita Koralahalli Aug. 16, 2023, 9:33 p.m. UTC | #3
On 8/16/2023 11:06 AM, Dave Jiang wrote:
> 
> 
> On 8/4/23 05:09, Jonathan Cameron wrote:
>> On Thu, 3 Aug 2023 23:01:27 +0000
>> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:
>>
>>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>>> of AER should also own CXL Protocol Error Management as there is no
>>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>>> reported on Protocol errors should check for AER _OSC rather than CXL
>>> Memory Error Reporting Control _OSC.
>>>
>>> The CXL Memory Error Reporting Control _OSC specifically highlights
>>> handling Memory Error Logging and Signaling Enhancements. These kinds of
>>> errors are reported through a device's mailbox and can be managed
>>> independently from CXL Protocol Errors.
>>>
>>> This change fixes handling and reporting CXL Protocol Errors and RAS
>>> registers natively with native AER and FW-First CXL Memory Error 
>>> Reporting
>>> Control.
>>>
>>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>>
>>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>>> Signed-off-by: Smita Koralahalli 
>>> <Smita.KoralahalliChannabasappa@amd.com>
>>> Reviewed-by: Robert Richter <rrichter@amd.com>
>>
>> I'd be tempted to add a comment on why this returns 0 rather than an
>> error.  I think that makes sense but it isn't immediately obvious from
>> the local context.
>>
>> Otherwise LGTM
>>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Echo Jonathan's comment.
> 
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Yes, and Dan is probably against returning error code.
https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/

But I think returning zero is required as we don't want to interfere 
with cxl device access when operating in native cxl memory error 
reporting. Returning error code will basically fail cxl_pci_probe() and 
thus fail to create a cxl device node.

I was thinking a single line comment as: "Return zero to not block the 
communication with the cxl device when in native memory error reporting 
mode".

Agree? Or anything more that needs to be added?

Thanks,
Smita
> 
>>
>>
>>> ---
>>> v2:
>>>     Added fixes tag.
>>>     Included what the patch fixes in commit message.
>>> v3:
>>>     Added "Reviewed-by" tag.
>>> ---
>>>   drivers/cxl/pci.c | 6 +++---
>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>> index 1cb1494c28fe..2323169b6e5f 100644
>>> --- a/drivers/cxl/pci.c
>>> +++ b/drivers/cxl/pci.c
>>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>>           return 0;
>>>       }
>>> -    /* BIOS has CXL error control */
>>> -    if (!host_bridge->native_cxl_error)
>>> -        return -ENXIO;
>>> +    /* BIOS has PCIe AER error control */
>>> +    if (!host_bridge->native_aer)
>>> +        return 0;
>>>       rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>>       if (rc)
>>
Smita Koralahalli Aug. 16, 2023, 9:36 p.m. UTC | #4
On 8/16/2023 2:33 PM, Smita Koralahalli wrote:
> On 8/16/2023 11:06 AM, Dave Jiang wrote:
>>
>>
>> On 8/4/23 05:09, Jonathan Cameron wrote:
>>> On Thu, 3 Aug 2023 23:01:27 +0000
>>> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:
>>>
>>>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>>>> of AER should also own CXL Protocol Error Management as there is no
>>>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>>>> reported on Protocol errors should check for AER _OSC rather than CXL
>>>> Memory Error Reporting Control _OSC.
>>>>
>>>> The CXL Memory Error Reporting Control _OSC specifically highlights
>>>> handling Memory Error Logging and Signaling Enhancements. These 
>>>> kinds of
>>>> errors are reported through a device's mailbox and can be managed
>>>> independently from CXL Protocol Errors.
>>>>
>>>> This change fixes handling and reporting CXL Protocol Errors and RAS
>>>> registers natively with native AER and FW-First CXL Memory Error 
>>>> Reporting
>>>> Control.
>>>>
>>>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>>>
>>>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>>>> Signed-off-by: Smita Koralahalli 
>>>> <Smita.KoralahalliChannabasappa@amd.com>
>>>> Reviewed-by: Robert Richter <rrichter@amd.com>
>>>
>>> I'd be tempted to add a comment on why this returns 0 rather than an
>>> error.  I think that makes sense but it isn't immediately obvious from
>>> the local context.
>>>
>>> Otherwise LGTM
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>
>> Echo Jonathan's comment.
>>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> Yes, and Dan is probably against returning error code.

Against returning zero. My bad sorry!
> https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/ 
> 
> 
> But I think returning zero is required as we don't want to interfere 
> with cxl device access when operating in native cxl memory error 
> reporting. Returning error code will basically fail cxl_pci_probe() and 
> thus fail to create a cxl device node.
> 
> I was thinking a single line comment as: "Return zero to not block the 
> communication with the cxl device when in native memory error reporting 
> mode".
> 
> Agree? Or anything more that needs to be added?
> 
> Thanks,
> Smita
>>
>>>
>>>
>>>> ---
>>>> v2:
>>>>     Added fixes tag.
>>>>     Included what the patch fixes in commit message.
>>>> v3:
>>>>     Added "Reviewed-by" tag.
>>>> ---
>>>>   drivers/cxl/pci.c | 6 +++---
>>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>>> index 1cb1494c28fe..2323169b6e5f 100644
>>>> --- a/drivers/cxl/pci.c
>>>> +++ b/drivers/cxl/pci.c
>>>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>>>           return 0;
>>>>       }
>>>> -    /* BIOS has CXL error control */
>>>> -    if (!host_bridge->native_cxl_error)
>>>> -        return -ENXIO;
>>>> +    /* BIOS has PCIe AER error control */
>>>> +    if (!host_bridge->native_aer)
>>>> +        return 0;
>>>>       rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>>>       if (rc)
>>>
>
Dave Jiang Aug. 16, 2023, 9:39 p.m. UTC | #5
On 8/16/23 14:33, Smita Koralahalli wrote:
> On 8/16/2023 11:06 AM, Dave Jiang wrote:
>>
>>
>> On 8/4/23 05:09, Jonathan Cameron wrote:
>>> On Thu, 3 Aug 2023 23:01:27 +0000
>>> Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> wrote:
>>>
>>>> According to Section 9.17.2, Table 9-26 of CXL Specification [1], owner
>>>> of AER should also own CXL Protocol Error Management as there is no
>>>> explicit control of CXL Protocol error. And the CXL RAS Cap registers
>>>> reported on Protocol errors should check for AER _OSC rather than CXL
>>>> Memory Error Reporting Control _OSC.
>>>>
>>>> The CXL Memory Error Reporting Control _OSC specifically highlights
>>>> handling Memory Error Logging and Signaling Enhancements. These 
>>>> kinds of
>>>> errors are reported through a device's mailbox and can be managed
>>>> independently from CXL Protocol Errors.
>>>>
>>>> This change fixes handling and reporting CXL Protocol Errors and RAS
>>>> registers natively with native AER and FW-First CXL Memory Error 
>>>> Reporting
>>>> Control.
>>>>
>>>> [1] Compute Express Link (CXL) Specification, Revision 3.1, Aug 1 2022.
>>>>
>>>> Fixes: 248529edc86f ("cxl: add RAS status unmasking for CXL")
>>>> Signed-off-by: Smita Koralahalli 
>>>> <Smita.KoralahalliChannabasappa@amd.com>
>>>> Reviewed-by: Robert Richter <rrichter@amd.com>
>>>
>>> I'd be tempted to add a comment on why this returns 0 rather than an
>>> error.  I think that makes sense but it isn't immediately obvious from
>>> the local context.
>>>
>>> Otherwise LGTM
>>>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>
>> Echo Jonathan's comment.
>>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> Yes, and Dan is probably against returning error code.
> https://lore.kernel.org/all/64d1b3e78629f_5ea6e2944@dwillia2-xfh.jf.intel.com.notmuch/
> 
> But I think returning zero is required as we don't want to interfere 
> with cxl device access when operating in native cxl memory error 
> reporting. Returning error code will basically fail cxl_pci_probe() and 
> thus fail to create a cxl device node.
> 
> I was thinking a single line comment as: "Return zero to not block the 
> communication with the cxl device when in native memory error reporting 
> mode".

Looks reasonable to me. Unless Dan feels it needs to fail the probe.

> 
> Agree? Or anything more that needs to be added?
> 
> Thanks,
> Smita
>>
>>>
>>>
>>>> ---
>>>> v2:
>>>>     Added fixes tag.
>>>>     Included what the patch fixes in commit message.
>>>> v3:
>>>>     Added "Reviewed-by" tag.
>>>> ---
>>>>   drivers/cxl/pci.c | 6 +++---
>>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>>> index 1cb1494c28fe..2323169b6e5f 100644
>>>> --- a/drivers/cxl/pci.c
>>>> +++ b/drivers/cxl/pci.c
>>>> @@ -541,9 +541,9 @@ static int cxl_pci_ras_unmask(struct pci_dev *pdev)
>>>>           return 0;
>>>>       }
>>>> -    /* BIOS has CXL error control */
>>>> -    if (!host_bridge->native_cxl_error)
>>>> -        return -ENXIO;
>>>> +    /* BIOS has PCIe AER error control */
>>>> +    if (!host_bridge->native_aer)
>>>> +        return 0;
>>>>       rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
>>>>       if (rc)
>>>
>
diff mbox series

Patch

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 1cb1494c28fe..2323169b6e5f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -541,9 +541,9 @@  static int cxl_pci_ras_unmask(struct pci_dev *pdev)
 		return 0;
 	}
 
-	/* BIOS has CXL error control */
-	if (!host_bridge->native_cxl_error)
-		return -ENXIO;
+	/* BIOS has PCIe AER error control */
+	if (!host_bridge->native_aer)
+		return 0;
 
 	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVCTL, &cap);
 	if (rc)