mbox series

[v13,0/6] Address error and recovery for AER and DPC

Message ID 1523284914-2037-1-git-send-email-poza@codeaurora.org
Headers show
Series Address error and recovery for AER and DPC | expand

Message

Oza Pawandeep April 9, 2018, 2:41 p.m. UTC
This patch set brings in error handling support for DPC

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.

DPC should behave identical to AER as far as error handling is concerned.
DPC should remove the devices and not to do recovery for hotplug enabled system.

Changes since v13:
    Bjorn's and Keith's Comments addressed.
    > Made DPC and AER error handling identical <aligned err.c>
    > hanldled cases for hotplug enabled system differently.
Changes since v11:
    Bjorn's comments addressed.
    > rename pcie-err.c to err.c
    > removed EXPORT_SYMBOL
    > made generic find_serivce function in port driver.
    > removed mutex patch as no need to have mutex in pcie_do_recovery
    > brough in DPC_FATAL in aer.h
    > so now all the error codes (AER and DPC) are unified in aer.h
Changes since v10:
    Christoph Hellwig's, David Laight's and Randy Dunlap's
    comments addressed.
        > renamed pci_do_recovery to pcie_do_recovery
        > removed inner braces in conditional statements.
        > restrctured the code in pci_wait_for_link
        > EXPORT_SYMBOL_GPL
Changes since v9:
    Sinan's comments addressed.
        > bool active = true; unnecessary variable removed.
Changes since v8:
    Fixed Kbuild errors.
Changes since v7:
    Rebased the code on pci master
        > https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci
Changes since v6:
    Sinan's and Stefan's comments implemented.
        > reordered patch 6 and 7
        > cleaned up
Changes since v5:
    Sinan's and Keith's comments incorporated.
        > made separate patch for mutex
        > unified error repotting codes into driver/pci/pci.h
        > got rid of wait link active/inactive and
          made generic function in driver/pci/pci.c
Changes since v4:
    Bjorn's comments incorporated.
        > Renamed only do_recovery.
        > moved the things more locally to drivers/pci/pci.h
Changes since v3:
    Bjorn's comments incorporated.
        > Made separate patch renaming generic pci_err.c
        > Introduce pci_err.h to contain all the error types and recovery
        > removed all the dependencies on pci.h
Changes since v2:
    Based on feedback from Keith:
    "
    When DPC is triggered due to receipt of an uncorrectable error Message,
    the Requester ID from the Message is recorded in the DPC Error
    Source ID register and that Message is discarded and not forwarded Upstream.
    "
    Removed the patch where AER checks if DPC service is active
Changes since v1:
    Kbuild errors fixed:
        > pci_find_dpc_dev made static
        > ras_event.h updated
        > pci_find_aer_service call with CONFIG check
        > pci_find_dpc_service call with CONFIG check

Oza Pawandeep (6):
  PCI/AER: Rename error recovery to generic PCI naming
  PCI/AER: Factor out error reporting from AER
  PCI/PORTDRV: Implement generic find service
  PCI/DPC: Unify and plumb error handling into DPC
  PCI: Unify wait for link active into generic PCI
  PCI/DPC: Do not do recovery for hotplug enabled system

 drivers/pci/hotplug/pciehp_hpc.c   |  20 +--
 drivers/pci/pci.c                  |  29 ++++
 drivers/pci/pci.h                  |   5 +
 drivers/pci/pcie/Makefile          |   2 +-
 drivers/pci/pcie/aer/aerdrv.h      |  30 ----
 drivers/pci/pcie/aer/aerdrv_core.c | 317 +---------------------------------
 drivers/pci/pcie/err.c             | 340 +++++++++++++++++++++++++++++++++++++
 drivers/pci/pcie/pcie-dpc.c        | 107 +++++++++---
 drivers/pci/pcie/portdrv.h         |   2 +
 drivers/pci/pcie/portdrv_core.c    |  45 +++++
 include/linux/aer.h                |   2 +
 11 files changed, 509 insertions(+), 390 deletions(-)
 create mode 100644 drivers/pci/pcie/err.c

Comments

Bjorn Helgaas April 16, 2018, 3:16 a.m. UTC | #1
On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
> This patch set brings in error handling support for DPC
> 
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC get
> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> 
> DPC should behave identical to AER as far as error handling is concerned.
> DPC should remove the devices and not to do recovery for hotplug enabled system.

Is there a specific bug that's fixed by these patches?  I didn't see
one mentioned in the changelogs.
Sinan Kaya April 16, 2018, 3:53 a.m. UTC | #2
On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
> On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
>> This patch set brings in error handling support for DPC
>>
>> The current implementation of AER and error message broadcasting to the
>> EP driver is tightly coupled and limited to AER service driver.
>> It is important to factor out broadcasting and other link handling
>> callbacks. So that not only when AER gets triggered, but also when DPC get
>> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>>
>> DPC should behave identical to AER as far as error handling is concerned.
>> DPC should remove the devices and not to do recovery for hotplug enabled system.
> 
> Is there a specific bug that's fixed by these patches?  I didn't see
> one mentioned in the changelogs.
> 

There is no actual bug. 

We realized that DPC and hotplug is heavily integrated today. We have use
cases for systems without hotplug support but still support DPC. That's the
problem we are trying to solve with this patchset.
Oza Pawandeep April 16, 2018, 6:03 a.m. UTC | #3
On 2018-04-16 09:23, Sinan Kaya wrote:
> On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
>> On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
>>> This patch set brings in error handling support for DPC
>>> 
>>> The current implementation of AER and error message broadcasting to 
>>> the
>>> EP driver is tightly coupled and limited to AER service driver.
>>> It is important to factor out broadcasting and other link handling
>>> callbacks. So that not only when AER gets triggered, but also when 
>>> DPC get
>>> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>>> 
>>> DPC should behave identical to AER as far as error handling is 
>>> concerned.
>>> DPC should remove the devices and not to do recovery for hotplug 
>>> enabled system.
>> 
>> Is there a specific bug that's fixed by these patches?  I didn't see
>> one mentioned in the changelogs.
>> 
> 
> There is no actual bug.
> 
> We realized that DPC and hotplug is heavily integrated today. We have 
> use
> cases for systems without hotplug support but still support DPC. That's 
> the
> problem we are trying to solve with this patchset.

Adding to what Sinan said;

DPC should handle the error handling and recovery similar to AER, 
because finally both
are attempting recovery in some or the other way,
and for that error handling and recovery framework has to be loosely 
coupled.
It achieves uniformity and transparency to the error handling agents 
such as AER, DPC, with respect to recovery and error handling.

So, this patch-set tries to unify lot of things between error agents and 
make them behave in a well defined way. (be it error (FATAL, NON_FATAL) 
handling or recovery).

Regards,
Oza.
Bjorn Helgaas April 16, 2018, 1:27 p.m. UTC | #4
On Mon, Apr 16, 2018 at 11:33:13AM +0530, poza@codeaurora.org wrote:
> On 2018-04-16 09:23, Sinan Kaya wrote:
> > On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
> > > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
> > > > This patch set brings in error handling support for DPC
> > > > 
> > > > The current implementation of AER and error message broadcasting
> > > > to the
> > > > EP driver is tightly coupled and limited to AER service driver.
> > > > It is important to factor out broadcasting and other link handling
> > > > callbacks. So that not only when AER gets triggered, but also
> > > > when DPC get
> > > > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> > > > 
> > > > DPC should behave identical to AER as far as error handling is
> > > > concerned.
> > > > DPC should remove the devices and not to do recovery for hotplug
> > > > enabled system.
> > > 
> > > Is there a specific bug that's fixed by these patches?  I didn't see
> > > one mentioned in the changelogs.
> > > 
> > 
> > There is no actual bug.
> > 
> > We realized that DPC and hotplug is heavily integrated today. We
> > have use cases for systems without hotplug support but still
> > support DPC. That's the problem we are trying to solve with this
> > patchset.

Apparently there's a problem with systems that have DPC but not
hotplug.  It will be extremely helpful if you can articulate what that
problem is and include it in the appropriate changelog.

> Adding to what Sinan said;
> 
> DPC should handle the error handling and recovery similar to AER,
> because finally both are attempting recovery in some or the other
> way, and for that error handling and recovery framework has to be
> loosely coupled.  It achieves uniformity and transparency to the
> error handling agents such as AER, DPC, with respect to recovery and
> error handling.
> 
> So, this patch-set tries to unify lot of things between error agents
> and make them behave in a well defined way. (be it error (FATAL,
> NON_FATAL) handling or recovery).

I totally support this objective.

Bjorn
Oza Pawandeep April 16, 2018, 2:12 p.m. UTC | #5
On 2018-04-16 18:57, Bjorn Helgaas wrote:
> On Mon, Apr 16, 2018 at 11:33:13AM +0530, poza@codeaurora.org wrote:
>> On 2018-04-16 09:23, Sinan Kaya wrote:
>> > On 4/15/2018 11:16 PM, Bjorn Helgaas wrote:
>> > > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote:
>> > > > This patch set brings in error handling support for DPC
>> > > >
>> > > > The current implementation of AER and error message broadcasting
>> > > > to the
>> > > > EP driver is tightly coupled and limited to AER service driver.
>> > > > It is important to factor out broadcasting and other link handling
>> > > > callbacks. So that not only when AER gets triggered, but also
>> > > > when DPC get
>> > > > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>> > > >
>> > > > DPC should behave identical to AER as far as error handling is
>> > > > concerned.
>> > > > DPC should remove the devices and not to do recovery for hotplug
>> > > > enabled system.
>> > >
>> > > Is there a specific bug that's fixed by these patches?  I didn't see
>> > > one mentioned in the changelogs.
>> > >
>> >
>> > There is no actual bug.
>> >
>> > We realized that DPC and hotplug is heavily integrated today. We
>> > have use cases for systems without hotplug support but still
>> > support DPC. That's the problem we are trying to solve with this
>> > patchset.
> 
> Apparently there's a problem with systems that have DPC but not
> hotplug.  It will be extremely helpful if you can articulate what that
> problem is and include it in the appropriate changelog.
> 
>> Adding to what Sinan said;
>> 
>> DPC should handle the error handling and recovery similar to AER,
>> because finally both are attempting recovery in some or the other
>> way, and for that error handling and recovery framework has to be
>> loosely coupled.  It achieves uniformity and transparency to the
>> error handling agents such as AER, DPC, with respect to recovery and
>> error handling.
>> 
>> So, this patch-set tries to unify lot of things between error agents
>> and make them behave in a well defined way. (be it error (FATAL,
>> NON_FATAL) handling or recovery).
> 
> I totally support this objective.

Thanks Bjorn, I will include this objective in Changelog along with 
Sinan's text.
I am not clear on one last thing Bjorn; which is;
do we need last patch ? patch-6 which handles hotplug case.
Also I think we could take this patch-set as basic changes/attempt to 
unify the code which it does.

And, in the next follow-up patches we can improve upon the things such 
as,
whether to do different actions for FATAL cases and NON_FATAL cases. And 
then I can make needed changes to AER and DPC
Please let me know how this sounds.

> 
> Bjorn
Sinan Kaya April 16, 2018, 2:30 p.m. UTC | #6
On 4/16/2018 9:27 AM, Bjorn Helgaas wrote:
>>> We realized that DPC and hotplug is heavily integrated today. We
>>> have use cases for systems without hotplug support but still
>>> support DPC. That's the problem we are trying to solve with this
>>> patchset.
> Apparently there's a problem with systems that have DPC but not
> hotplug.  It will be extremely helpful if you can articulate what that
> problem is and include it in the appropriate changelog.
> 

At a higher level, the DPC driver performs the stop operation regardless of
hotplug. However, DPC driver relies on hotplug driver observing link up to
re-enumerate. 

Of course, when the system didn't support hotplug; there was nobody to
restore functionality.

Our initial attempt was to also do a re-enumeration in the DPC driver
regardless of hotplug driver in the system or not. 

If hotplug driver is present, it would observe two enumerations. It still
worked as long as these were protected by a mutex.

Then, we got your input that you want DPC and AER to behave the same. We
started converging towards the AER path.