diff mbox series

[pci-next] pci/edr: Ignore Surprise Down error on hot removal

Message ID 20240304090819.3812465-1-haifeng.zhao@linux.intel.com
State New
Headers show
Series [pci-next] pci/edr: Ignore Surprise Down error on hot removal | expand

Commit Message

Ethan Zhao March 4, 2024, 9:08 a.m. UTC
Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
handling path, FW should clear UC errors logged by port and bring link
out of DPC, but because of ambiguity of wording in the spec, some BIOSes
doesn't clear the surprise down error and the error bits in pci status,
still notify OS to handle it. thus following trick is needed in EDR when
double reporting (hot removal interrupt && dpc notification) is hit.

https://patchwork.kernel.org/project/linux-pci/patch/20240207181854.
121335-1-Smita.KoralahalliChannabasappa@amd.com/

Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
---
 drivers/pci/pci.h      | 1 +
 drivers/pci/pcie/dpc.c | 9 +++++----
 drivers/pci/pcie/edr.c | 3 +++
 3 files changed, 9 insertions(+), 4 deletions(-)


base-commit: a66f2b4a4d365dc4bac35576f3a9d4f5982f1d63

Comments

Lukas Wunner March 4, 2024, 11:58 a.m. UTC | #1
On Mon, Mar 04, 2024 at 04:08:19AM -0500, Ethan Zhao wrote:
> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
> handling path, FW should clear UC errors logged by port and bring link
> out of DPC, but because of ambiguity of wording in the spec, some BIOSes
> doesn't clear the surprise down error and the error bits in pci status,
> still notify OS to handle it. thus following trick is needed in EDR when
> double reporting (hot removal interrupt && dpc notification) is hit.

Please provide more detailed information about the hardware and BIOS
affected by this.


> -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>  {
> +	if (!dpc_is_surprise_removal(pdev))
> +		return false;

This change of moving dpc_is_surprise_removal() into
dpc_handle_surprise_removal() seems unrelated to the problem at hand.

Please drop it if it's unnecessary to fix the issue.


> --- a/drivers/pci/pcie/edr.c
> +++ b/drivers/pci/pcie/edr.c
> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>  		goto send_ost;
>  	}
>  
> +	if (dpc_handle_surprise_removal(edev))
> +		goto send_ost;
> +
>  	dpc_process_error(edev);
>  	pci_aer_raw_clear_status(edev);

This seems to be the only necessary change.  Please reduce the
patch to contain only it and no other refactoring.

Please capitalize the "PCI/EDR: " prefix in the subject and add
a Fixes tag.

Thanks,

Lukas
Smita Koralahalli March 4, 2024, 7:33 p.m. UTC | #2
Hi Ethan,

On 3/4/2024 3:58 AM, Lukas Wunner wrote:
> On Mon, Mar 04, 2024 at 04:08:19AM -0500, Ethan Zhao wrote:
>> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
>> handling path, FW should clear UC errors logged by port and bring link
>> out of DPC, but because of ambiguity of wording in the spec, some BIOSes
>> doesn't clear the surprise down error and the error bits in pci status,
>> still notify OS to handle it. thus following trick is needed in EDR when
>> double reporting (hot removal interrupt && dpc notification) is hit.

Please correct me if I'm wrong.

When there is double reporting (hot removal interrupt && dpc 
notification), won't the DPC handler be called always which takes care 
of clearing the surprise down errors? Do we need it again from EDR handler?

Thanks
Smita

> 
> Please provide more detailed information about the hardware and BIOS
> affected by this.
> 
> 
>> -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>>   {
>> +	if (!dpc_is_surprise_removal(pdev))
>> +		return false;
> 
> This change of moving dpc_is_surprise_removal() into
> dpc_handle_surprise_removal() seems unrelated to the problem at hand.
> 
> Please drop it if it's unnecessary to fix the issue.
> 
> 
>> --- a/drivers/pci/pcie/edr.c
>> +++ b/drivers/pci/pcie/edr.c
>> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>>   		goto send_ost;
>>   	}
>>   
>> +	if (dpc_handle_surprise_removal(edev))
>> +		goto send_ost;
>> +
>>   	dpc_process_error(edev);
>>   	pci_aer_raw_clear_status(edev);
> 
> This seems to be the only necessary change.  Please reduce the
> patch to contain only it and no other refactoring.
> 
> Please capitalize the "PCI/EDR: " prefix in the subject and add
> a Fixes tag.
> 
> Thanks,
> 
> Lukas
>
Kuppuswamy Sathyanarayanan March 4, 2024, 8:10 p.m. UTC | #3
On 3/4/24 1:08 AM, Ethan Zhao wrote:
> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
> handling path, FW should clear UC errors logged by port and bring link
> out of DPC, but because of ambiguity of wording in the spec, some BIOSes
> doesn't clear the surprise down error and the error bits in pci status,

As Lukas mentioned, please include the hardware and BIOS version
where you see this issue.

> still notify OS to handle it. thus following trick is needed in EDR when
> double reporting (hot removal interrupt && dpc notification) is hit.

EDR notification is generally used when a firmware wants OS to invalidate
or recover the error state of child devices when handling a containment event.
Since this DPC event is a side effect of async removal, there is no recovery
involved. So there is no value in firmware notifying the OS via an ACPI notification
and then OS ignoring it.

If you check the PCIe firmware spec, sec 4.6.12, IMPLEMENTATION NOTE, it
recommends firmware to ignore the DPC due to hotplug surprise.

>
> https://patchwork.kernel.org/project/linux-pci/patch/20240207181854.
> 121335-1-Smita.KoralahalliChannabasappa@amd.com/
>
> Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
> ---
>  drivers/pci/pci.h      | 1 +
>  drivers/pci/pcie/dpc.c | 9 +++++----
>  drivers/pci/pcie/edr.c | 3 +++
>  3 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 50134b5e3235..3787bb32e724 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -443,6 +443,7 @@ void pci_save_dpc_state(struct pci_dev *dev);
>  void pci_restore_dpc_state(struct pci_dev *dev);
>  void pci_dpc_init(struct pci_dev *pdev);
>  void dpc_process_error(struct pci_dev *pdev);
> +bool dpc_handle_surprise_removal(struct pci_dev *pdev);
>  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
>  bool pci_dpc_recovered(struct pci_dev *pdev);
>  #else
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 98b42e425bb9..be79f205e04c 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -319,8 +319,10 @@ static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>  	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, PCI_EXP_DEVSTA_FED);
>  }
>  
> -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>  {
> +	if (!dpc_is_surprise_removal(pdev))
> +		return false;
>  	if (!pcie_wait_for_link(pdev, false)) {
>  		pci_info(pdev, "Data Link Layer Link Active not cleared in 1000 msec\n");
>  		goto out;
> @@ -338,6 +340,7 @@ static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>  out:
>  	clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
>  	wake_up_all(&dpc_completed_waitqueue);
> +	return true;
>  }
>  
>  static bool dpc_is_surprise_removal(struct pci_dev *pdev)
> @@ -362,10 +365,8 @@ static irqreturn_t dpc_handler(int irq, void *context)
>  	 * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
>  	 * of async removal and should be ignored by software.
>  	 */
> -	if (dpc_is_surprise_removal(pdev)) {
> -		dpc_handle_surprise_removal(pdev);
> +	if (dpc_handle_surprise_removal(pdev))
>  		return IRQ_HANDLED;
> -	}
>  
>  	dpc_process_error(pdev);
>  
> diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
> index 5f4914d313a1..556edfb2696a 100644
> --- a/drivers/pci/pcie/edr.c
> +++ b/drivers/pci/pcie/edr.c
> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>  		goto send_ost;
>  	}
>  
> +	if (dpc_handle_surprise_removal(edev))
> +		goto send_ost;
> +
>  	dpc_process_error(edev);
>  	pci_aer_raw_clear_status(edev);
>  
>
> base-commit: a66f2b4a4d365dc4bac35576f3a9d4f5982f1d63
Ethan Zhao March 5, 2024, 2:09 a.m. UTC | #4
On 3/4/2024 7:58 PM, Lukas Wunner wrote:
> On Mon, Mar 04, 2024 at 04:08:19AM -0500, Ethan Zhao wrote:
>> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
>> handling path, FW should clear UC errors logged by port and bring link
>> out of DPC, but because of ambiguity of wording in the spec, some BIOSes
>> doesn't clear the surprise down error and the error bits in pci status,
>> still notify OS to handle it. thus following trick is needed in EDR when
>> double reporting (hot removal interrupt && dpc notification) is hit.
> Please provide more detailed information about the hardware and BIOS
> affected by this.
>
You know, to disclose the detail hardware and BIOS info list might invovle
very complex internal legal approval process.

To put it simply, at least one platform, such SPR and one customer's BIOS
is affected.

If FFM(firmware first mode) and hotplug are executed. the side effect can
be observed if it is affected, UC errors are reported along with pciehp
log.

>> -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>>   {
>> +	if (!dpc_is_surprise_removal(pdev))
>> +		return false;
> This change of moving dpc_is_surprise_removal() into
> dpc_handle_surprise_removal() seems unrelated to the problem at hand.
>
> Please drop it if it's unnecessary to fix the issue.

To only export one function dpc_is_surprise_removal()... or I have to
export them both.
Seems I should keep them intact or refactor them in separated patch ?

>
>
>> --- a/drivers/pci/pcie/edr.c
>> +++ b/drivers/pci/pcie/edr.c
>> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>>   		goto send_ost;
>>   	}
>>   
>> +	if (dpc_handle_surprise_removal(edev))
>> +		goto send_ost;
>> +
>>   	dpc_process_error(edev);
>>   	pci_aer_raw_clear_status(edev);
> This seems to be the only necessary change.  Please reduce the
> patch to contain only it and no other refactoring.
>
> Please capitalize the "PCI/EDR: " prefix in the subject and add
> a Fixes tag.

Sure !

Thanks,
Ethan

> Thanks,
>
> Lukas
Ethan Zhao March 5, 2024, 2:19 a.m. UTC | #5
On 3/5/2024 3:33 AM, Smita Koralahalli wrote:
> Hi Ethan,
>
> On 3/4/2024 3:58 AM, Lukas Wunner wrote:
>> On Mon, Mar 04, 2024 at 04:08:19AM -0500, Ethan Zhao wrote:
>>> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
>>> handling path, FW should clear UC errors logged by port and bring link
>>> out of DPC, but because of ambiguity of wording in the spec, some 
>>> BIOSes
>>> doesn't clear the surprise down error and the error bits in pci status,
>>> still notify OS to handle it. thus following trick is needed in EDR 
>>> when
>>> double reporting (hot removal interrupt && dpc notification) is hit.
>
> Please correct me if I'm wrong.
>
> When there is double reporting (hot removal interrupt && dpc 
> notification), won't the DPC handler be called always which takes care 
> of clearing the surprise down errors? Do we need it again from EDR 
> handler?

My understanding, if firmware first mode is enabled, DPC driver wouldn't
be enabled, EDR is notified instead, though some of the common functions
are used in EDR, such as dpc_process_error() is called in edr_handle_event(),
but dpc_handler() isn't called, so does the dpc_handle_surprise_removal().

Thanks,
Ethan

>
> Thanks
> Smita
>
>>
>> Please provide more detailed information about the hardware and BIOS
>> affected by this.
>>
>>
>>> -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>>> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>>>   {
>>> +    if (!dpc_is_surprise_removal(pdev))
>>> +        return false;
>>
>> This change of moving dpc_is_surprise_removal() into
>> dpc_handle_surprise_removal() seems unrelated to the problem at hand.
>>
>> Please drop it if it's unnecessary to fix the issue.
>>
>>
>>> --- a/drivers/pci/pcie/edr.c
>>> +++ b/drivers/pci/pcie/edr.c
>>> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, 
>>> u32 event, void *data)
>>>           goto send_ost;
>>>       }
>>>   +    if (dpc_handle_surprise_removal(edev))
>>> +        goto send_ost;
>>> +
>>>       dpc_process_error(edev);
>>>       pci_aer_raw_clear_status(edev);
>>
>> This seems to be the only necessary change.  Please reduce the
>> patch to contain only it and no other refactoring.
>>
>> Please capitalize the "PCI/EDR: " prefix in the subject and add
>> a Fixes tag.
>>
>> Thanks,
>>
>> Lukas
>>
Ethan Zhao March 5, 2024, 2:29 a.m. UTC | #6
On 3/5/2024 4:10 AM, Kuppuswamy Sathyanarayanan wrote:
> On 3/4/24 1:08 AM, Ethan Zhao wrote:
>> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
>> handling path, FW should clear UC errors logged by port and bring link
>> out of DPC, but because of ambiguity of wording in the spec, some BIOSes
>> doesn't clear the surprise down error and the error bits in pci status,
> As Lukas mentioned, please include the hardware and BIOS version
> where you see this issue.

Reproduced on "Hardware name: Intel Corporation ArcherCity/ArcherCity,
  BIOS EGSDCRB1.86B.0107.D20.2310211929 10/21/2023"

>
>> still notify OS to handle it. thus following trick is needed in EDR when
>> double reporting (hot removal interrupt && dpc notification) is hit.
> EDR notification is generally used when a firmware wants OS to invalidate
> or recover the error state of child devices when handling a containment event.
> Since this DPC event is a side effect of async removal, there is no recovery
> involved. So there is no value in firmware notifying the OS via an ACPI notification
> and then OS ignoring it.
>
> If you check the PCIe firmware spec, sec 4.6.12, IMPLEMENTATION NOTE, it
> recommends firmware to ignore the DPC due to hotplug surprise.

My understanding is the same, let firmware to ignore the errors and bring
it out of DPC.

But due to the wording like:
"FW should not issue Notify(0xF) to avoid doule reporting. FW should clear
*other* UC errors logged by port(if any) and bring link out of DPC if it has
entered DPC."

Some BIOS writers have different understanding, wouldn't clear the surprise
down error.

Thanks,
Ethan

>
>> https://patchwork.kernel.org/project/linux-pci/patch/20240207181854.
>> 121335-1-Smita.KoralahalliChannabasappa@amd.com/
>>
>> Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
>> ---
>>   drivers/pci/pci.h      | 1 +
>>   drivers/pci/pcie/dpc.c | 9 +++++----
>>   drivers/pci/pcie/edr.c | 3 +++
>>   3 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 50134b5e3235..3787bb32e724 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -443,6 +443,7 @@ void pci_save_dpc_state(struct pci_dev *dev);
>>   void pci_restore_dpc_state(struct pci_dev *dev);
>>   void pci_dpc_init(struct pci_dev *pdev);
>>   void dpc_process_error(struct pci_dev *pdev);
>> +bool dpc_handle_surprise_removal(struct pci_dev *pdev);
>>   pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
>>   bool pci_dpc_recovered(struct pci_dev *pdev);
>>   #else
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index 98b42e425bb9..be79f205e04c 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -319,8 +319,10 @@ static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>>   	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, PCI_EXP_DEVSTA_FED);
>>   }
>>   
>> -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>>   {
>> +	if (!dpc_is_surprise_removal(pdev))
>> +		return false;
>>   	if (!pcie_wait_for_link(pdev, false)) {
>>   		pci_info(pdev, "Data Link Layer Link Active not cleared in 1000 msec\n");
>>   		goto out;
>> @@ -338,6 +340,7 @@ static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>>   out:
>>   	clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
>>   	wake_up_all(&dpc_completed_waitqueue);
>> +	return true;
>>   }
>>   
>>   static bool dpc_is_surprise_removal(struct pci_dev *pdev)
>> @@ -362,10 +365,8 @@ static irqreturn_t dpc_handler(int irq, void *context)
>>   	 * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
>>   	 * of async removal and should be ignored by software.
>>   	 */
>> -	if (dpc_is_surprise_removal(pdev)) {
>> -		dpc_handle_surprise_removal(pdev);
>> +	if (dpc_handle_surprise_removal(pdev))
>>   		return IRQ_HANDLED;
>> -	}
>>   
>>   	dpc_process_error(pdev);
>>   
>> diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
>> index 5f4914d313a1..556edfb2696a 100644
>> --- a/drivers/pci/pcie/edr.c
>> +++ b/drivers/pci/pcie/edr.c
>> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>>   		goto send_ost;
>>   	}
>>   
>> +	if (dpc_handle_surprise_removal(edev))
>> +		goto send_ost;
>> +
>>   	dpc_process_error(edev);
>>   	pci_aer_raw_clear_status(edev);
>>   
>>
>> base-commit: a66f2b4a4d365dc4bac35576f3a9d4f5982f1d63
Kuppuswamy Sathyanarayanan March 5, 2024, 4:04 a.m. UTC | #7
On 3/4/24 6:29 PM, Ethan Zhao wrote:
> On 3/5/2024 4:10 AM, Kuppuswamy Sathyanarayanan wrote:
>> On 3/4/24 1:08 AM, Ethan Zhao wrote:
>>> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
>>> handling path, FW should clear UC errors logged by port and bring link
>>> out of DPC, but because of ambiguity of wording in the spec, some BIOSes
>>> doesn't clear the surprise down error and the error bits in pci status,
>> As Lukas mentioned, please include the hardware and BIOS version
>> where you see this issue.
>
> Reproduced on "Hardware name: Intel Corporation ArcherCity/ArcherCity,
>  BIOS EGSDCRB1.86B.0107.D20.2310211929 10/21/2023"
>
>>
>>> still notify OS to handle it. thus following trick is needed in EDR when
>>> double reporting (hot removal interrupt && dpc notification) is hit.
>> EDR notification is generally used when a firmware wants OS to invalidate
>> or recover the error state of child devices when handling a containment event.
>> Since this DPC event is a side effect of async removal, there is no recovery
>> involved. So there is no value in firmware notifying the OS via an ACPI notification
>> and then OS ignoring it.
>>
>> If you check the PCIe firmware spec, sec 4.6.12, IMPLEMENTATION NOTE, it
>> recommends firmware to ignore the DPC due to hotplug surprise.
>
> My understanding is the same, let firmware to ignore the errors and bring
> it out of DPC.
>
> But due to the wording like:
> "FW should not issue Notify(0xF) to avoid doule reporting. FW should clear
> *other* UC errors logged by port(if any) and bring link out of DPC if it has
> entered DPC."
>

"Since surprise hot remove event is signaled to OS via hot plug interrupt, FW should
not issue Notify(0xF) to avoid double reporting. FW should clear other UC errors logged
by port (if any) and bring link out of DPC if it has entered DPC."


Above statement is a note about how to treat DPC triggered due to async
removal. Since OS already gets notification via DLLSC change (hotplug interrupt),
there is no need for reporting again using EDR notification. I think the flow
chart is very clear about handling the hotplug related error case. Also the
note "Clear other UC errors" means it also includes "Surprise Down" error.

> Some BIOS writers have different understanding, wouldn't clear the surprise
> down error.
>
> Thanks,
> Ethan
>
>>
>>> https://patchwork.kernel.org/project/linux-pci/patch/20240207181854.
>>> 121335-1-Smita.KoralahalliChannabasappa@amd.com/
>>>
>>> Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
>>> ---
>>>   drivers/pci/pci.h      | 1 +
>>>   drivers/pci/pcie/dpc.c | 9 +++++----
>>>   drivers/pci/pcie/edr.c | 3 +++
>>>   3 files changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>>> index 50134b5e3235..3787bb32e724 100644
>>> --- a/drivers/pci/pci.h
>>> +++ b/drivers/pci/pci.h
>>> @@ -443,6 +443,7 @@ void pci_save_dpc_state(struct pci_dev *dev);
>>>   void pci_restore_dpc_state(struct pci_dev *dev);
>>>   void pci_dpc_init(struct pci_dev *pdev);
>>>   void dpc_process_error(struct pci_dev *pdev);
>>> +bool dpc_handle_surprise_removal(struct pci_dev *pdev);
>>>   pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
>>>   bool pci_dpc_recovered(struct pci_dev *pdev);
>>>   #else
>>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>>> index 98b42e425bb9..be79f205e04c 100644
>>> --- a/drivers/pci/pcie/dpc.c
>>> +++ b/drivers/pci/pcie/dpc.c
>>> @@ -319,8 +319,10 @@ static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>>>       pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, PCI_EXP_DEVSTA_FED);
>>>   }
>>>   -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>>> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>>>   {
>>> +    if (!dpc_is_surprise_removal(pdev))
>>> +        return false;
>>>       if (!pcie_wait_for_link(pdev, false)) {
>>>           pci_info(pdev, "Data Link Layer Link Active not cleared in 1000 msec\n");
>>>           goto out;
>>> @@ -338,6 +340,7 @@ static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>>>   out:
>>>       clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
>>>       wake_up_all(&dpc_completed_waitqueue);
>>> +    return true;
>>>   }
>>>     static bool dpc_is_surprise_removal(struct pci_dev *pdev)
>>> @@ -362,10 +365,8 @@ static irqreturn_t dpc_handler(int irq, void *context)
>>>        * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
>>>        * of async removal and should be ignored by software.
>>>        */
>>> -    if (dpc_is_surprise_removal(pdev)) {
>>> -        dpc_handle_surprise_removal(pdev);
>>> +    if (dpc_handle_surprise_removal(pdev))
>>>           return IRQ_HANDLED;
>>> -    }
>>>         dpc_process_error(pdev);
>>>   diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
>>> index 5f4914d313a1..556edfb2696a 100644
>>> --- a/drivers/pci/pcie/edr.c
>>> +++ b/drivers/pci/pcie/edr.c
>>> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>>>           goto send_ost;
>>>       }
>>>   +    if (dpc_handle_surprise_removal(edev))
>>> +        goto send_ost;
>>> +
>>>       dpc_process_error(edev);
>>>       pci_aer_raw_clear_status(edev);
>>>  
>>> base-commit: a66f2b4a4d365dc4bac35576f3a9d4f5982f1d63
>
Ethan Zhao March 5, 2024, 5:49 a.m. UTC | #8
On 3/5/2024 12:04 PM, Kuppuswamy Sathyanarayanan wrote:
> On 3/4/24 6:29 PM, Ethan Zhao wrote:
>> On 3/5/2024 4:10 AM, Kuppuswamy Sathyanarayanan wrote:
>>> On 3/4/24 1:08 AM, Ethan Zhao wrote:
>>>> Per PCI firmware spec r3.3 sec 4.6.12, for firmware first mode DPC
>>>> handling path, FW should clear UC errors logged by port and bring link
>>>> out of DPC, but because of ambiguity of wording in the spec, some BIOSes
>>>> doesn't clear the surprise down error and the error bits in pci status,
>>> As Lukas mentioned, please include the hardware and BIOS version
>>> where you see this issue.
>> Reproduced on "Hardware name: Intel Corporation ArcherCity/ArcherCity,
>>   BIOS EGSDCRB1.86B.0107.D20.2310211929 10/21/2023"
>>
>>>> still notify OS to handle it. thus following trick is needed in EDR when
>>>> double reporting (hot removal interrupt && dpc notification) is hit.
>>> EDR notification is generally used when a firmware wants OS to invalidate
>>> or recover the error state of child devices when handling a containment event.
>>> Since this DPC event is a side effect of async removal, there is no recovery
>>> involved. So there is no value in firmware notifying the OS via an ACPI notification
>>> and then OS ignoring it.
>>>
>>> If you check the PCIe firmware spec, sec 4.6.12, IMPLEMENTATION NOTE, it
>>> recommends firmware to ignore the DPC due to hotplug surprise.
>> My understanding is the same, let firmware to ignore the errors and bring
>> it out of DPC.
>>
>> But due to the wording like:
>> "FW should not issue Notify(0xF) to avoid doule reporting. FW should clear
>> *other* UC errors logged by port(if any) and bring link out of DPC if it has
>> entered DPC."
>>
> "Since surprise hot remove event is signaled to OS via hot plug interrupt, FW should
> not issue Notify(0xF) to avoid double reporting. FW should clear other UC errors logged
> by port (if any) and bring link out of DPC if it has entered DPC."
>
>
> Above statement is a note about how to treat DPC triggered due to async
> removal. Since OS already gets notification via DLLSC change (hotplug interrupt),
> there is no need for reporting again using EDR notification. I think the flow
> chart is very clear about handling the hotplug related error case. Also the
> note "Clear other UC errors" means it also includes "Surprise Down" error.

Agree. some BIOS writers might misunderstand that as leave "Surprise Down"
error to OS for specicial treatment.

Thanks,
Ethan

>
>> Some BIOS writers have different understanding, wouldn't clear the surprise
>> down error.
>>
>> Thanks,
>> Ethan
>>
>>>> https://patchwork.kernel.org/project/linux-pci/patch/20240207181854.
>>>> 121335-1-Smita.KoralahalliChannabasappa@amd.com/
>>>>
>>>> Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
>>>> ---
>>>>    drivers/pci/pci.h      | 1 +
>>>>    drivers/pci/pcie/dpc.c | 9 +++++----
>>>>    drivers/pci/pcie/edr.c | 3 +++
>>>>    3 files changed, 9 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>>>> index 50134b5e3235..3787bb32e724 100644
>>>> --- a/drivers/pci/pci.h
>>>> +++ b/drivers/pci/pci.h
>>>> @@ -443,6 +443,7 @@ void pci_save_dpc_state(struct pci_dev *dev);
>>>>    void pci_restore_dpc_state(struct pci_dev *dev);
>>>>    void pci_dpc_init(struct pci_dev *pdev);
>>>>    void dpc_process_error(struct pci_dev *pdev);
>>>> +bool dpc_handle_surprise_removal(struct pci_dev *pdev);
>>>>    pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
>>>>    bool pci_dpc_recovered(struct pci_dev *pdev);
>>>>    #else
>>>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>>>> index 98b42e425bb9..be79f205e04c 100644
>>>> --- a/drivers/pci/pcie/dpc.c
>>>> +++ b/drivers/pci/pcie/dpc.c
>>>> @@ -319,8 +319,10 @@ static void pci_clear_surpdn_errors(struct pci_dev *pdev)
>>>>        pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, PCI_EXP_DEVSTA_FED);
>>>>    }
>>>>    -static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>>>> +bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
>>>>    {
>>>> +    if (!dpc_is_surprise_removal(pdev))
>>>> +        return false;
>>>>        if (!pcie_wait_for_link(pdev, false)) {
>>>>            pci_info(pdev, "Data Link Layer Link Active not cleared in 1000 msec\n");
>>>>            goto out;
>>>> @@ -338,6 +340,7 @@ static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>>>>    out:
>>>>        clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
>>>>        wake_up_all(&dpc_completed_waitqueue);
>>>> +    return true;
>>>>    }
>>>>      static bool dpc_is_surprise_removal(struct pci_dev *pdev)
>>>> @@ -362,10 +365,8 @@ static irqreturn_t dpc_handler(int irq, void *context)
>>>>         * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
>>>>         * of async removal and should be ignored by software.
>>>>         */
>>>> -    if (dpc_is_surprise_removal(pdev)) {
>>>> -        dpc_handle_surprise_removal(pdev);
>>>> +    if (dpc_handle_surprise_removal(pdev))
>>>>            return IRQ_HANDLED;
>>>> -    }
>>>>          dpc_process_error(pdev);
>>>>    diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
>>>> index 5f4914d313a1..556edfb2696a 100644
>>>> --- a/drivers/pci/pcie/edr.c
>>>> +++ b/drivers/pci/pcie/edr.c
>>>> @@ -184,6 +184,9 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
>>>>            goto send_ost;
>>>>        }
>>>>    +    if (dpc_handle_surprise_removal(edev))
>>>> +        goto send_ost;
>>>> +
>>>>        dpc_process_error(edev);
>>>>        pci_aer_raw_clear_status(edev);
>>>>   
>>>> base-commit: a66f2b4a4d365dc4bac35576f3a9d4f5982f1d63
diff mbox series

Patch

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 50134b5e3235..3787bb32e724 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -443,6 +443,7 @@  void pci_save_dpc_state(struct pci_dev *dev);
 void pci_restore_dpc_state(struct pci_dev *dev);
 void pci_dpc_init(struct pci_dev *pdev);
 void dpc_process_error(struct pci_dev *pdev);
+bool dpc_handle_surprise_removal(struct pci_dev *pdev);
 pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
 bool pci_dpc_recovered(struct pci_dev *pdev);
 #else
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 98b42e425bb9..be79f205e04c 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -319,8 +319,10 @@  static void pci_clear_surpdn_errors(struct pci_dev *pdev)
 	pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, PCI_EXP_DEVSTA_FED);
 }
 
-static void dpc_handle_surprise_removal(struct pci_dev *pdev)
+bool  dpc_handle_surprise_removal(struct pci_dev *pdev)
 {
+	if (!dpc_is_surprise_removal(pdev))
+		return false;
 	if (!pcie_wait_for_link(pdev, false)) {
 		pci_info(pdev, "Data Link Layer Link Active not cleared in 1000 msec\n");
 		goto out;
@@ -338,6 +340,7 @@  static void dpc_handle_surprise_removal(struct pci_dev *pdev)
 out:
 	clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags);
 	wake_up_all(&dpc_completed_waitqueue);
+	return true;
 }
 
 static bool dpc_is_surprise_removal(struct pci_dev *pdev)
@@ -362,10 +365,8 @@  static irqreturn_t dpc_handler(int irq, void *context)
 	 * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
 	 * of async removal and should be ignored by software.
 	 */
-	if (dpc_is_surprise_removal(pdev)) {
-		dpc_handle_surprise_removal(pdev);
+	if (dpc_handle_surprise_removal(pdev))
 		return IRQ_HANDLED;
-	}
 
 	dpc_process_error(pdev);
 
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index 5f4914d313a1..556edfb2696a 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -184,6 +184,9 @@  static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 		goto send_ost;
 	}
 
+	if (dpc_handle_surprise_removal(edev))
+		goto send_ost;
+
 	dpc_process_error(edev);
 	pci_aer_raw_clear_status(edev);