diff mbox series

PCI: layerscape: Change back to the default error response behavior

Message ID 20200929131328.13779-1-Zhiqiang.Hou@nxp.com
State New
Headers show
Series PCI: layerscape: Change back to the default error response behavior | expand

Commit Message

Zhiqiang Hou Sept. 29, 2020, 1:13 p.m. UTC
From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>

In the current error response behavior, it will send a SLVERR response
to device's internal AXI slave system interface when the PCIe controller
experiences an erroneous completion (UR, CA and CT) from an external
completer for its outbound non-posted request, which will result in
SError and crash the kernel directly.
This patch change back it to the default behavior to increase the
robustness of the kernel. In the default behavior, it always sends an
OKAY response to the internal AXI slave interface when the controller
gets these erroneous completions. And the AER driver will report and
try to recover these errors.

Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
---
 drivers/pci/controller/dwc/pci-layerscape.c | 11 -----------
 1 file changed, 11 deletions(-)

Comments

Bjorn Helgaas Sept. 29, 2020, 3:02 p.m. UTC | #1
On Tue, Sep 29, 2020 at 09:13:28PM +0800, Zhiqiang Hou wrote:
> From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> 
> In the current error response behavior, it will send a SLVERR response
> to device's internal AXI slave system interface when the PCIe controller
> experiences an erroneous completion (UR, CA and CT) from an external
> completer for its outbound non-posted request, which will result in
> SError and crash the kernel directly.

Possible wording:

  As currently configured, when the PCIe controller receives a
  Completion with UR or CA status, or a Completion Timeout occurs, it
  sends a SLVERR response to the internal AXI slave system interface,
  which results in SError and a kernel crash.

Please add a blank line between paragraphs, and
s/This patch change back it/Change it/ below.

> This patch change back it to the default behavior to increase the
> robustness of the kernel. In the default behavior, it always sends an
> OKAY response to the internal AXI slave interface when the controller
> gets these erroneous completions. And the AER driver will report and
> try to recover these errors.

This reverts 84d897d69938 ("PCI: layerscape: Change default error
response behavior"), so please mention that in the commit log,
probably as:

Fixes: 84d897d69938 ("PCI: layerscape: Change default error response behavior")

Maybe it also needs a stable tag, e.g., v4.15+?

Since this is a pure revert, whatever problem 84d897d69938 fixed must
now be fixed in some other way.  Otherwise, this revert would just be
reintroducing the problem fixed by 84d897d69938.

This commit log should mention that what that other fix is.

AER is only a reporting mechanism, it is asynchronous to the
instruction stream, and it's optional (may not be implemented in the
hardware, and may not be supported by the kernel), so I'm not super
convinced that it can be the answer to this problem.

> Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> ---
>  drivers/pci/controller/dwc/pci-layerscape.c | 11 -----------
>  1 file changed, 11 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c b/drivers/pci/controller/dwc/pci-layerscape.c
> index f24f79a70d9a..e92ab8a77046 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -30,8 +30,6 @@
>  
>  /* PEX Internal Configuration Registers */
>  #define PCIE_STRFMR1		0x71c /* Symbol Timer & Filter Mask Register1 */
> -#define PCIE_ABSERR		0x8d0 /* Bridge Slave Error Response Register */
> -#define PCIE_ABSERR_SETTING	0x9401 /* Forward error of non-posted request */
>  
>  #define PCIE_IATU_NUM		6
>  
> @@ -123,14 +121,6 @@ static int ls_pcie_link_up(struct dw_pcie *pci)
>  	return 1;
>  }
>  
> -/* Forward error response of outbound non-posted requests */
> -static void ls_pcie_fix_error_response(struct ls_pcie *pcie)
> -{
> -	struct dw_pcie *pci = pcie->pci;
> -
> -	iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR);
> -}
> -
>  static int ls_pcie_host_init(struct pcie_port *pp)
>  {
>  	struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> @@ -142,7 +132,6 @@ static int ls_pcie_host_init(struct pcie_port *pp)
>  	 * dw_pcie_setup_rc() will reconfigure the outbound windows.
>  	 */
>  	ls_pcie_disable_outbound_atus(pcie);
> -	ls_pcie_fix_error_response(pcie);
>  
>  	dw_pcie_dbi_ro_wr_en(pci);
>  	ls_pcie_clear_multifunction(pcie);
> -- 
> 2.17.1
>
Zhiqiang Hou Sept. 30, 2020, 5:37 a.m. UTC | #2
Hi Bjorn,

Thanks a lot for your comments!

> -----Original Message-----
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: 2020年9月29日 23:03
> To: Z.q. Hou <zhiqiang.hou@nxp.com>
> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; lorenzo.pieralisi@arm.com;
> robh@kernel.org; bhelgaas@google.com; M.h. Lian
> <minghuan.lian@nxp.com>; Roy Zang <roy.zang@nxp.com>; Mingkai Hu
> <mingkai.hu@nxp.com>; Leo Li <leoyang.li@nxp.com>
> Subject: Re: [PATCH] PCI: layerscape: Change back to the default error
> response behavior
> 
> On Tue, Sep 29, 2020 at 09:13:28PM +0800, Zhiqiang Hou wrote:
> > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> >
> > In the current error response behavior, it will send a SLVERR response
> > to device's internal AXI slave system interface when the PCIe
> > controller experiences an erroneous completion (UR, CA and CT) from an
> > external completer for its outbound non-posted request, which will
> > result in SError and crash the kernel directly.
> 
> Possible wording:
> 
>   As currently configured, when the PCIe controller receives a
>   Completion with UR or CA status, or a Completion Timeout occurs, it
>   sends a SLVERR response to the internal AXI slave system interface,
>   which results in SError and a kernel crash.
> 
> Please add a blank line between paragraphs, and s/This patch change back
> it/Change it/ below.
> 
> > This patch change back it to the default behavior to increase the
> > robustness of the kernel. In the default behavior, it always sends an
> > OKAY response to the internal AXI slave interface when the controller
> > gets these erroneous completions. And the AER driver will report and
> > try to recover these errors.
> 
> This reverts 84d897d69938 ("PCI: layerscape: Change default error response
> behavior"), so please mention that in the commit log, probably as:
> 
> Fixes: 84d897d69938 ("PCI: layerscape: Change default error response
> behavior")
> 
> Maybe it also needs a stable tag, e.g., v4.15+?

Thanks for your good suggestions! Will fix in v2.

> 
> Since this is a pure revert, whatever problem 84d897d69938 fixed must now
> be fixed in some other way.  Otherwise, this revert would just be
> reintroducing the problem fixed by 84d897d69938.
> 
> This commit log should mention that what that other fix is.
> 
> AER is only a reporting mechanism, it is asynchronous to the instruction
> stream, and it's optional (may not be implemented in the hardware, and may
> not be supported by the kernel), so I'm not super convinced that it can be the
> answer to this problem.
>

The commit 84d897d69938 ("PCI: layerscape: Change default error response behavior") doesn't fix any issue, it just enable a feature of DesignWare PCIe IP that it allows error response to AXI slave interface, which are not enabled on all other platforms with DWC IP. As mentioned in that commit it will also send an OKAY response to AXI slave interface for erroneous completion of non-post transaction including CFG and MEM_rd transactions, however upstream won't support for platforms aborting on CFG accesses, so we have to change it back to the default error response behavior and bear the error of MEM_rd isn't forwarded, just like other DWC IP platforms.

I remember the SError interrupt mechanism is also asynchronous abort and it is only a reporting mechanism. Contrast with the AER, it will make the kernel crash. So both of these 2 mechanism cannot ensure the data integrity, generally the upper layer data transfer protocol has its own mechanism to ensure the data integrity, it's not a issue for almost users. If one really wants a kernel crash when there is error of MEM_rd, he can enable this in his local code.

Thanks,
Zhiqiang
 
> > Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> > ---
> >  drivers/pci/controller/dwc/pci-layerscape.c | 11 -----------
> >  1 file changed, 11 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c
> > b/drivers/pci/controller/dwc/pci-layerscape.c
> > index f24f79a70d9a..e92ab8a77046 100644
> > --- a/drivers/pci/controller/dwc/pci-layerscape.c
> > +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> > @@ -30,8 +30,6 @@
> >
> >  /* PEX Internal Configuration Registers */
> >  #define PCIE_STRFMR1		0x71c /* Symbol Timer & Filter Mask
> Register1 */
> > -#define PCIE_ABSERR		0x8d0 /* Bridge Slave Error Response
> Register */
> > -#define PCIE_ABSERR_SETTING	0x9401 /* Forward error of
> non-posted request */
> >
> >  #define PCIE_IATU_NUM		6
> >
> > @@ -123,14 +121,6 @@ static int ls_pcie_link_up(struct dw_pcie *pci)
> >  	return 1;
> >  }
> >
> > -/* Forward error response of outbound non-posted requests */ -static
> > void ls_pcie_fix_error_response(struct ls_pcie *pcie) -{
> > -	struct dw_pcie *pci = pcie->pci;
> > -
> > -	iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR);
> > -}
> > -
> >  static int ls_pcie_host_init(struct pcie_port *pp)  {
> >  	struct dw_pcie *pci = to_dw_pcie_from_pp(pp); @@ -142,7 +132,6 @@
> > static int ls_pcie_host_init(struct pcie_port *pp)
> >  	 * dw_pcie_setup_rc() will reconfigure the outbound windows.
> >  	 */
> >  	ls_pcie_disable_outbound_atus(pcie);
> > -	ls_pcie_fix_error_response(pcie);
> >
> >  	dw_pcie_dbi_ro_wr_en(pci);
> >  	ls_pcie_clear_multifunction(pcie);
> > --
> > 2.17.1
> >
Kishon Vijay Abraham I Sept. 30, 2020, 1:29 p.m. UTC | #3
Hi Hou,

On 29/09/20 6:43 pm, Zhiqiang Hou wrote:
> From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> 
> In the current error response behavior, it will send a SLVERR response
> to device's internal AXI slave system interface when the PCIe controller
> experiences an erroneous completion (UR, CA and CT) from an external
> completer for its outbound non-posted request, which will result in
> SError and crash the kernel directly.
> This patch change back it to the default behavior to increase the
> robustness of the kernel. In the default behavior, it always sends an
> OKAY response to the internal AXI slave interface when the controller
> gets these erroneous completions. And the AER driver will report and
> try to recover these errors.

I don't think not forwarding any error interrupts is a good idea. Maybe
you could disable it while reading configuration space registers
(vendorID and deviceID) and then enable error forwarding back?

Thanks
Kishon
> 
> Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> ---
>  drivers/pci/controller/dwc/pci-layerscape.c | 11 -----------
>  1 file changed, 11 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c b/drivers/pci/controller/dwc/pci-layerscape.c
> index f24f79a70d9a..e92ab8a77046 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -30,8 +30,6 @@
>  
>  /* PEX Internal Configuration Registers */
>  #define PCIE_STRFMR1		0x71c /* Symbol Timer & Filter Mask Register1 */
> -#define PCIE_ABSERR		0x8d0 /* Bridge Slave Error Response Register */
> -#define PCIE_ABSERR_SETTING	0x9401 /* Forward error of non-posted request */
>  
>  #define PCIE_IATU_NUM		6
>  
> @@ -123,14 +121,6 @@ static int ls_pcie_link_up(struct dw_pcie *pci)
>  	return 1;
>  }
>  
> -/* Forward error response of outbound non-posted requests */
> -static void ls_pcie_fix_error_response(struct ls_pcie *pcie)
> -{
> -	struct dw_pcie *pci = pcie->pci;
> -
> -	iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR);
> -}
> -
>  static int ls_pcie_host_init(struct pcie_port *pp)
>  {
>  	struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> @@ -142,7 +132,6 @@ static int ls_pcie_host_init(struct pcie_port *pp)
>  	 * dw_pcie_setup_rc() will reconfigure the outbound windows.
>  	 */
>  	ls_pcie_disable_outbound_atus(pcie);
> -	ls_pcie_fix_error_response(pcie);
>  
>  	dw_pcie_dbi_ro_wr_en(pci);
>  	ls_pcie_clear_multifunction(pcie);
>
Rob Herring Sept. 30, 2020, 3:07 p.m. UTC | #4
On Wed, Sep 30, 2020 at 8:29 AM Kishon Vijay Abraham I <kishon@ti.com> wrote:
>
> Hi Hou,
>
> On 29/09/20 6:43 pm, Zhiqiang Hou wrote:
> > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> >
> > In the current error response behavior, it will send a SLVERR response
> > to device's internal AXI slave system interface when the PCIe controller
> > experiences an erroneous completion (UR, CA and CT) from an external
> > completer for its outbound non-posted request, which will result in
> > SError and crash the kernel directly.
> > This patch change back it to the default behavior to increase the
> > robustness of the kernel. In the default behavior, it always sends an
> > OKAY response to the internal AXI slave interface when the controller
> > gets these erroneous completions. And the AER driver will report and
> > try to recover these errors.
>
> I don't think not forwarding any error interrupts is a good idea.

Interrupts would be fine. Abort/SError is not. I think it is pretty
clear what the correct behavior is for config accesses.

> Maybe
> you could disable it while reading configuration space registers
> (vendorID and deviceID) and then enable error forwarding back?

To add to the locking (or lack of) problems in config accesses?

Rob
Kishon Vijay Abraham I Sept. 30, 2020, 3:42 p.m. UTC | #5
Hi,

On 30/09/20 8:37 pm, Rob Herring wrote:
> On Wed, Sep 30, 2020 at 8:29 AM Kishon Vijay Abraham I <kishon@ti.com> wrote:
>>
>> Hi Hou,
>>
>> On 29/09/20 6:43 pm, Zhiqiang Hou wrote:
>>> From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
>>>
>>> In the current error response behavior, it will send a SLVERR response
>>> to device's internal AXI slave system interface when the PCIe controller
>>> experiences an erroneous completion (UR, CA and CT) from an external
>>> completer for its outbound non-posted request, which will result in
>>> SError and crash the kernel directly.
>>> This patch change back it to the default behavior to increase the
>>> robustness of the kernel. In the default behavior, it always sends an
>>> OKAY response to the internal AXI slave interface when the controller
>>> gets these erroneous completions. And the AER driver will report and
>>> try to recover these errors.
>>
>> I don't think not forwarding any error interrupts is a good idea.
> 
> Interrupts would be fine. Abort/SError is not. I think it is pretty
> clear what the correct behavior is for config accesses.

IIUC $patch prevents SError in all cases. Doesn't UR, CA and CT all
sends SLVERR which will result in Abort and that is being prevented
here?. Maybe I'm wrong here, Hou can confirm.

Thanks
Kishon
Zhiqiang Hou Oct. 12, 2020, 4:33 a.m. UTC | #6
Hi Rob and Kishon,

> -----Original Message-----
> From: Rob Herring <robh@kernel.org>
> Sent: 2020年9月30日 23:08
> To: Kishon Vijay Abraham I <kishon@ti.com>
> Cc: Z.q. Hou <zhiqiang.hou@nxp.com>; PCI <linux-pci@vger.kernel.org>;
> linux-kernel@vger.kernel.org; linux-arm-kernel
> <linux-arm-kernel@lists.infradead.org>; Lorenzo Pieralisi
> <lorenzo.pieralisi@arm.com>; Bjorn Helgaas <bhelgaas@google.com>; M.h.
> Lian <minghuan.lian@nxp.com>; Roy Zang <roy.zang@nxp.com>; Mingkai
> Hu <mingkai.hu@nxp.com>; Leo Li <leoyang.li@nxp.com>
> Subject: Re: [PATCH] PCI: layerscape: Change back to the default error
> response behavior
> 
> On Wed, Sep 30, 2020 at 8:29 AM Kishon Vijay Abraham I <kishon@ti.com>
> wrote:
> >
> > Hi Hou,
> >
> > On 29/09/20 6:43 pm, Zhiqiang Hou wrote:
> > > From: Hou Zhiqiang <Zhiqiang.Hou@nxp.com>
> > >
> > > In the current error response behavior, it will send a SLVERR
> > > response to device's internal AXI slave system interface when the
> > > PCIe controller experiences an erroneous completion (UR, CA and CT)
> > > from an external completer for its outbound non-posted request,
> > > which will result in SError and crash the kernel directly.
> > > This patch change back it to the default behavior to increase the
> > > robustness of the kernel. In the default behavior, it always sends
> > > an OKAY response to the internal AXI slave interface when the
> > > controller gets these erroneous completions. And the AER driver will
> > > report and try to recover these errors.
> >
> > I don't think not forwarding any error interrupts is a good idea.
> 
> Interrupts would be fine. Abort/SError is not. I think it is pretty clear what the
> correct behavior is for config accesses.

I agree with Rob.

> 
> > Maybe
> > you could disable it while reading configuration space registers
> > (vendorID and deviceID) and then enable error forwarding back?
> 
> To add to the locking (or lack of) problems in config accesses?

If take this approach, during the hole of CFG access, the error of MEM_rd will also not be forwarded, so it's not a reliable mechanism for user.

Thanks,
Zhiqiang

> 
> Rob
diff mbox series

Patch

diff --git a/drivers/pci/controller/dwc/pci-layerscape.c b/drivers/pci/controller/dwc/pci-layerscape.c
index f24f79a70d9a..e92ab8a77046 100644
--- a/drivers/pci/controller/dwc/pci-layerscape.c
+++ b/drivers/pci/controller/dwc/pci-layerscape.c
@@ -30,8 +30,6 @@ 
 
 /* PEX Internal Configuration Registers */
 #define PCIE_STRFMR1		0x71c /* Symbol Timer & Filter Mask Register1 */
-#define PCIE_ABSERR		0x8d0 /* Bridge Slave Error Response Register */
-#define PCIE_ABSERR_SETTING	0x9401 /* Forward error of non-posted request */
 
 #define PCIE_IATU_NUM		6
 
@@ -123,14 +121,6 @@  static int ls_pcie_link_up(struct dw_pcie *pci)
 	return 1;
 }
 
-/* Forward error response of outbound non-posted requests */
-static void ls_pcie_fix_error_response(struct ls_pcie *pcie)
-{
-	struct dw_pcie *pci = pcie->pci;
-
-	iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR);
-}
-
 static int ls_pcie_host_init(struct pcie_port *pp)
 {
 	struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
@@ -142,7 +132,6 @@  static int ls_pcie_host_init(struct pcie_port *pp)
 	 * dw_pcie_setup_rc() will reconfigure the outbound windows.
 	 */
 	ls_pcie_disable_outbound_atus(pcie);
-	ls_pcie_fix_error_response(pcie);
 
 	dw_pcie_dbi_ro_wr_en(pci);
 	ls_pcie_clear_multifunction(pcie);