diff mbox series

phb4: Reset pfir and nfir if new errors reported during ETU reset

Message ID 20180830125751.25579-1-vaibhav@linux.ibm.com
State Superseded
Headers show
Series phb4: Reset pfir and nfir if new errors reported during ETU reset | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success master/apply_patch Successfully applied
snowpatch_ozlabs/make_check success Test make_check on branch master

Commit Message

Vaibhav Jain Aug. 30, 2018, 12:57 p.m. UTC
During fast-reboot a PCI device can continue sending requests even
after ETU-Reset is asserted. This will cause new errors to be reported
in ETU fir-registers and will result in values of variables nfir_cache
and pfir_cache to be out of sync.

Presently during step-2 of CRESET nfir_cache and pfir_cache values are
used to bring the PHB out of reset state. However if these variables
are out of date the nfir/pfir registers are never reset completely and
ETU still remains frozen.

Hence this patch updates step-2 of phb4_creset to re-read the values of
nfir/pfir registers to check if any new errors were reported after
ETU-reset was asserted, report these new errors and reset the
nfir/pfir registers. This should bring the ETU out of reset
successfully.

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 hw/phb4.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Vasant Hegde Aug. 31, 2018, 5:12 a.m. UTC | #1
On 08/30/2018 06:27 PM, Vaibhav Jain wrote:
> During fast-reboot a PCI device can continue sending requests even
> after ETU-Reset is asserted. This will cause new errors to be reported
> in ETU fir-registers and will result in values of variables nfir_cache
> and pfir_cache to be out of sync.
> 
> Presently during step-2 of CRESET nfir_cache and pfir_cache values are
> used to bring the PHB out of reset state. However if these variables
> are out of date the nfir/pfir registers are never reset completely and
> ETU still remains frozen.
> 
> Hence this patch updates step-2 of phb4_creset to re-read the values of
> nfir/pfir registers to check if any new errors were reported after
> ETU-reset was asserted, report these new errors and reset the
> nfir/pfir registers. This should bring the ETU out of reset
> successfully.
> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>

Tested-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>

-Vasant
Oliver O'Halloran Aug. 31, 2018, 6 a.m. UTC | #2
On Thu, 2018-08-30 at 18:27 +0530, Vaibhav Jain wrote:
> During fast-reboot a PCI device can continue sending requests even
> after ETU-Reset is asserted. This will cause new errors to be
> reported
> in ETU fir-registers and will result in values of variables
> nfir_cache
> and pfir_cache to be out of sync.
> 
> Presently during step-2 of CRESET nfir_cache and pfir_cache values
> are
> used to bring the PHB out of reset state. However if these variables
> are out of date the nfir/pfir registers are never reset completely
> and
> ETU still remains frozen.
> 
> Hence this patch updates step-2 of phb4_creset to re-read the values
> of
> nfir/pfir registers to check if any new errors were reported after
> ETU-reset was asserted, report these new errors and reset the
> nfir/pfir registers. This should bring the ETU out of reset
> successfully.

Fun bug.

We could avoid this by setting the link disable bit before we assert
the ETU reset in stage 1. That way we shouldn't get any more PCIe
traffic while the reset is in progress. We set LD later on when we
call into phb4_freset() so it shouldn't hurt.

Can you or Vasant give this a try with your repro case?

diff --git a/hw/phb4.c b/hw/phb4.c
index 7cb8b89e5c20..6578aab09e2e 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -3124,6 +3124,12 @@ static int64_t phb4_creset(struct pci_slot
*slot)
 		PHBDBG(p, "CRESET: Starts\n");
 
 		phb4_prepare_link_change(slot, false);
+
+		phb4_pcicfg_read16(&p->phb, 0, p->ecap +
PCICAP_EXP_LCTL,
+				   &reg16);
+		reg16 |= PCICAP_EXP_LCTL_LINK_DIS;
+		phb4_pcicfg_write16(&p->phb, 0, p->ecap +
PCICAP_EXP_LCTL,
+				    reg16);
 		/* Clear error inject register, preventing recursive
errors */
 		xscom_write(p->chip_id, p->pe_xscom + 0x2, 0x0);
 

> 
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  hw/phb4.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/hw/phb4.c b/hw/phb4.c
> index d1245dce..9c4b54b5 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot
> *slot)
>  			xscom_write(p->chip_id, p->pe_stk_xscom +
> 0x1,
>  				    ~p->nfir_cache);
>  
> +			/* Re-read errors in PFIR and NFIR and reset
> any new
> +			 * error reported. This may happen as after
> fundamental
> +			 * reset was asserted in previous step the
> device may
> +			 * still be sending TLPs causing fence to be
> raised.
> +			 */
> +			xscom_read(p->chip_id, p->pci_stk_xscom +
> +				   XPEC_PCI_STK_PCI_FIR, &p-
> >pfir_cache);
> +			xscom_read(p->chip_id, p->pe_stk_xscom +
> +				   XPEC_NEST_STK_PCI_NFIR, &p-
> >nfir_cache);
> +
> +			if (p->pfir_cache || p->nfir_cache) {
> +				PHBERR(p, "CRESET: PHB still fenced
> !!\n");
> +				PHBERR(p, "PCI FIR=0x%016llx\n",
> +				       p->pfir_cache);
> +				PHBERR(p, "NEST FIR=0x%016llx\n",
> +				       p->nfir_cache);
> +
> +				/* Dump other error registers */
> +				phb4_eeh_dump_regs(p);
> +
> +				/* Reset the PHB errors */
> +				xscom_write(p->chip_id, p-
> >pci_stk_xscom +
> +					    XPEC_PCI_STK_PCI_FIR,
> 0);
> +				xscom_write(p->chip_id, p-
> >pe_stk_xscom +
> +					    XPEC_NEST_STK_PCI_NFIR,
> 0);

Do we also need to call phb4_err_clear() here? If a TLP is causing an
error during the reset I'd expect some of the lower-level error
registers to be set, but those might be cleared during the reset.

> +			}
> +
>  			/* Clear PHB from reset */
>  			xscom_write(p->chip_id,
>  				    p->pci_stk_xscom +
> XPEC_PCI_STK_ETU_RESET, 0x0);
Vasant Hegde Aug. 31, 2018, 11:21 a.m. UTC | #3
On 08/31/2018 11:30 AM, Oliver O'Halloran wrote:
> On Thu, 2018-08-30 at 18:27 +0530, Vaibhav Jain wrote:
>> During fast-reboot a PCI device can continue sending requests even
>> after ETU-Reset is asserted. This will cause new errors to be
>> reported
>> in ETU fir-registers and will result in values of variables
>> nfir_cache
>> and pfir_cache to be out of sync.
>>
>> Presently during step-2 of CRESET nfir_cache and pfir_cache values
>> are
>> used to bring the PHB out of reset state. However if these variables
>> are out of date the nfir/pfir registers are never reset completely
>> and
>> ETU still remains frozen.
>>
>> Hence this patch updates step-2 of phb4_creset to re-read the values
>> of
>> nfir/pfir registers to check if any new errors were reported after
>> ETU-reset was asserted, report these new errors and reset the
>> nfir/pfir registers. This should bring the ETU out of reset
>> successfully.
> 
> Fun bug.
> 
> We could avoid this by setting the link disable bit before we assert
> the ETU reset in stage 1. That way we shouldn't get any more PCIe
> traffic while the reset is in progress. We set LD later on when we
> call into phb4_freset() so it shouldn't hurt.
> 
> Can you or Vasant give this a try with your repro case?

Yeah. Below change didn't fix the issue. I still see PHB errors and disks are 
not getting detected.

/ # grep PHB /sys/firmware/opal/msglog |grep Error
[  286.016664627,7] PHB#0033:00:00.0 Error -6 resetting
[  286.016682002,7] PHB#0002:00:00.0 Error -6 resetting
[  286.016690554,7] PHB#0001:00:00.0 Error -6 resetting
[  286.016725126,7] PHB#0004:00:00.0 Error -6 resetting
[  286.016736261,7] PHB#0003:00:00.0 Error -6 resetting
[  286.016758274,7] PHB#0031:00:00.0 Error -6 resetting
[  286.016763042,7] PHB#0030:00:00.0 Error -6 resetting
[  286.017111640,7] PHB#0000:00:00.0 Error -6 resetting
[  286.021380651,3] PHB#0002[0:2]: phb4_get_link_info: Error -6 getting link state
[  286.021380646,3] PHB#0001[0:1]: phb4_get_link_info: Error -6 getting link state
[  286.021384703,3] PHB#0033[8:3]: phb4_get_link_info: Error -6 getting link state
[  286.021397387,3] PHB#0033:00:00.0 Error -6 querying link status
[  286.021381928,3] PHB#0004[0:4]: phb4_get_link_info: Error -6 getting link state
[  286.021381356,3] PHB#0003[0:3]: phb4_get_link_info: Error -6 getting link state
[  286.021412009,3] PHB#0004:00:00.0 Error -6 querying link status
[  286.021418175,3] PHB#0003:00:00.0 Error -6 querying link status
[  286.021383000,3] PHB#0030[8:0]: phb4_get_link_info: Error -6 getting link state
[  286.021383992,3] PHB#0002:00:00.0 Error -6 querying link status
[  286.021388334,3] PHB#0001:00:00.0 Error -6 querying link status
[  286.021440040,3] PHB#0030:00:00.0 Error -6 querying link status
[  286.021383563,3] PHB#0031[8:1]: phb4_get_link_info: Error -6 getting link state
[  286.021468511,3] PHB#0031:00:00.0 Error -6 querying link status
[  286.021384405,3] PHB#0000[0:0]: phb4_get_link_info: Error -6 getting link state
[  286.021523256,3] PHB#0000:00:00.0 Error -6 querying link status


-Vasant

> 
> diff --git a/hw/phb4.c b/hw/phb4.c
> index 7cb8b89e5c20..6578aab09e2e 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3124,6 +3124,12 @@ static int64_t phb4_creset(struct pci_slot
> *slot)
>   		PHBDBG(p, "CRESET: Starts\n");
>   
>   		phb4_prepare_link_change(slot, false);
> +
> +		phb4_pcicfg_read16(&p->phb, 0, p->ecap +
> PCICAP_EXP_LCTL,
> +				   &reg16);
> +		reg16 |= PCICAP_EXP_LCTL_LINK_DIS;
> +		phb4_pcicfg_write16(&p->phb, 0, p->ecap +
> PCICAP_EXP_LCTL,
> +				    reg16);
>   		/* Clear error inject register, preventing recursive
> errors */
>   		xscom_write(p->chip_id, p->pe_xscom + 0x2, 0x0);
>   
> 
>>
>> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>> ---
>>   hw/phb4.c | 27 +++++++++++++++++++++++++++
>>   1 file changed, 27 insertions(+)
>>
>> diff --git a/hw/phb4.c b/hw/phb4.c
>> index d1245dce..9c4b54b5 100644
>> --- a/hw/phb4.c
>> +++ b/hw/phb4.c
>> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot
>> *slot)
>>   			xscom_write(p->chip_id, p->pe_stk_xscom +
>> 0x1,
>>   				    ~p->nfir_cache);
>>   
>> +			/* Re-read errors in PFIR and NFIR and reset
>> any new
>> +			 * error reported. This may happen as after
>> fundamental
>> +			 * reset was asserted in previous step the
>> device may
>> +			 * still be sending TLPs causing fence to be
>> raised.
>> +			 */
>> +			xscom_read(p->chip_id, p->pci_stk_xscom +
>> +				   XPEC_PCI_STK_PCI_FIR, &p-
>>> pfir_cache);
>> +			xscom_read(p->chip_id, p->pe_stk_xscom +
>> +				   XPEC_NEST_STK_PCI_NFIR, &p-
>>> nfir_cache);
>> +
>> +			if (p->pfir_cache || p->nfir_cache) {
>> +				PHBERR(p, "CRESET: PHB still fenced
>> !!\n");
>> +				PHBERR(p, "PCI FIR=0x%016llx\n",
>> +				       p->pfir_cache);
>> +				PHBERR(p, "NEST FIR=0x%016llx\n",
>> +				       p->nfir_cache);
>> +
>> +				/* Dump other error registers */
>> +				phb4_eeh_dump_regs(p);
>> +
>> +				/* Reset the PHB errors */
>> +				xscom_write(p->chip_id, p-
>>> pci_stk_xscom +
>> +					    XPEC_PCI_STK_PCI_FIR,
>> 0);
>> +				xscom_write(p->chip_id, p-
>>> pe_stk_xscom +
>> +					    XPEC_NEST_STK_PCI_NFIR,
>> 0);
> 
> Do we also need to call phb4_err_clear() here? If a TLP is causing an
> error during the reset I'd expect some of the lower-level error
> registers to be set, but those might be cleared during the reset.
> 
>> +			}
>> +
>>   			/* Clear PHB from reset */
>>   			xscom_write(p->chip_id,
>>   				    p->pci_stk_xscom +
>> XPEC_PCI_STK_ETU_RESET, 0x0);
>
Oliver O'Halloran Sept. 17, 2018, 3:22 a.m. UTC | #4
We should probably just merge this and if it turns out there's other
problems we can fix them later.

On Thu, Aug 30, 2018 at 10:57 PM, Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
> During fast-reboot a PCI device can continue sending requests even
> after ETU-Reset is asserted. This will cause new errors to be reported
> in ETU fir-registers and will result in values of variables nfir_cache
> and pfir_cache to be out of sync.
>
> Presently during step-2 of CRESET nfir_cache and pfir_cache values are
> used to bring the PHB out of reset state. However if these variables
> are out of date the nfir/pfir registers are never reset completely and
> ETU still remains frozen.
>
> Hence this patch updates step-2 of phb4_creset to re-read the values of
> nfir/pfir registers to check if any new errors were reported after
> ETU-reset was asserted, report these new errors and reset the
> nfir/pfir registers. This should bring the ETU out of reset
> successfully.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  hw/phb4.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
>
> diff --git a/hw/phb4.c b/hw/phb4.c
> index d1245dce..9c4b54b5 100644
> --- a/hw/phb4.c
> +++ b/hw/phb4.c
> @@ -3148,6 +3148,33 @@ static int64_t phb4_creset(struct pci_slot *slot)
>                         xscom_write(p->chip_id, p->pe_stk_xscom + 0x1,
>                                     ~p->nfir_cache);
>
> +                       /* Re-read errors in PFIR and NFIR and reset any new
> +                        * error reported. This may happen as after fundamental
> +                        * reset was asserted in previous step the device may
> +                        * still be sending TLPs causing fence to be raised.
> +                        */
> +                       xscom_read(p->chip_id, p->pci_stk_xscom +
> +                                  XPEC_PCI_STK_PCI_FIR, &p->pfir_cache);
> +                       xscom_read(p->chip_id, p->pe_stk_xscom +
> +                                  XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache);
> +
> +                       if (p->pfir_cache || p->nfir_cache) {
> +                               PHBERR(p, "CRESET: PHB still fenced !!\n");
> +                               PHBERR(p, "PCI FIR=0x%016llx\n",
> +                                      p->pfir_cache);
> +                               PHBERR(p, "NEST FIR=0x%016llx\n",
> +                                      p->nfir_cache);

The AIB error log register also useful here, can you also dump that?
Something like this does the trick:

diff --git a/hw/phb4.c b/hw/phb4.c
index f463d20d15be..0f6d8bc5eb0b 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -3200,9 +3200,16 @@ static int64_t phb4_creset(struct pci_slot *slot)
     XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache);

  if (p->pfir_cache || p->nfir_cache) {
+ uint64_t err_aib;
+
+ xscom_read(p->chip_id, p->pci_stk_xscom +
+ XPEC_PCI_STK_PBAIB_ERR_REPORT,
+ &err_aib);
+
  PHBERR(p, "CRESET: PHB still fenced !!\n");
  PHBERR(p, "PCI FIR=0x%016llx\n",
         p->pfir_cache);
  PHBERR(p, "NEST FIR=0x%016llx\n",
+ PHBERR(p, "AIB ERR=0x%016llx\n", err_aib);
         p->nfir_cache);

> +                               /* Dump other error registers */
> +                               phb4_eeh_dump_regs(p);

At this point the ETU is still in reset and trying to dump the error
registers floods the msglog with XSCOM errors. I think this is because
the HV indirect register access xscoms don't work while reset is
asserted so this should probably be removed.

Otherwise,  Reviewed-by: Oliver O'Halloran <oohall@gmail.com>

> +
> +                               /* Reset the PHB errors */
> +                               xscom_write(p->chip_id, p->pci_stk_xscom +
> +                                           XPEC_PCI_STK_PCI_FIR, 0);
> +                               xscom_write(p->chip_id, p->pe_stk_xscom +
> +                                           XPEC_NEST_STK_PCI_NFIR, 0);
> +                       }
> +
>                         /* Clear PHB from reset */
>                         xscom_write(p->chip_id,
>                                     p->pci_stk_xscom + XPEC_PCI_STK_ETU_RESET, 0x0);
> --
> 2.17.1
>
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
diff mbox series

Patch

diff --git a/hw/phb4.c b/hw/phb4.c
index d1245dce..9c4b54b5 100644
--- a/hw/phb4.c
+++ b/hw/phb4.c
@@ -3148,6 +3148,33 @@  static int64_t phb4_creset(struct pci_slot *slot)
 			xscom_write(p->chip_id, p->pe_stk_xscom + 0x1,
 				    ~p->nfir_cache);
 
+			/* Re-read errors in PFIR and NFIR and reset any new
+			 * error reported. This may happen as after fundamental
+			 * reset was asserted in previous step the device may
+			 * still be sending TLPs causing fence to be raised.
+			 */
+			xscom_read(p->chip_id, p->pci_stk_xscom +
+				   XPEC_PCI_STK_PCI_FIR, &p->pfir_cache);
+			xscom_read(p->chip_id, p->pe_stk_xscom +
+				   XPEC_NEST_STK_PCI_NFIR, &p->nfir_cache);
+
+			if (p->pfir_cache || p->nfir_cache) {
+				PHBERR(p, "CRESET: PHB still fenced !!\n");
+				PHBERR(p, "PCI FIR=0x%016llx\n",
+				       p->pfir_cache);
+				PHBERR(p, "NEST FIR=0x%016llx\n",
+				       p->nfir_cache);
+
+				/* Dump other error registers */
+				phb4_eeh_dump_regs(p);
+
+				/* Reset the PHB errors */
+				xscom_write(p->chip_id, p->pci_stk_xscom +
+					    XPEC_PCI_STK_PCI_FIR, 0);
+				xscom_write(p->chip_id, p->pe_stk_xscom +
+					    XPEC_NEST_STK_PCI_NFIR, 0);
+			}
+
 			/* Clear PHB from reset */
 			xscom_write(p->chip_id,
 				    p->pci_stk_xscom + XPEC_PCI_STK_ETU_RESET, 0x0);