Message ID | 1523284914-2037-1-git-send-email-poza@codeaurora.org |
---|---|
Headers | show |
Series | Address error and recovery for AER and DPC | expand |
On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote: > This patch set brings in error handling support for DPC > > The current implementation of AER and error message broadcasting to the > EP driver is tightly coupled and limited to AER service driver. > It is important to factor out broadcasting and other link handling > callbacks. So that not only when AER gets triggered, but also when DPC get > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately. > > DPC should behave identical to AER as far as error handling is concerned. > DPC should remove the devices and not to do recovery for hotplug enabled system. Is there a specific bug that's fixed by these patches? I didn't see one mentioned in the changelogs.
On 4/15/2018 11:16 PM, Bjorn Helgaas wrote: > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote: >> This patch set brings in error handling support for DPC >> >> The current implementation of AER and error message broadcasting to the >> EP driver is tightly coupled and limited to AER service driver. >> It is important to factor out broadcasting and other link handling >> callbacks. So that not only when AER gets triggered, but also when DPC get >> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately. >> >> DPC should behave identical to AER as far as error handling is concerned. >> DPC should remove the devices and not to do recovery for hotplug enabled system. > > Is there a specific bug that's fixed by these patches? I didn't see > one mentioned in the changelogs. > There is no actual bug. We realized that DPC and hotplug is heavily integrated today. We have use cases for systems without hotplug support but still support DPC. That's the problem we are trying to solve with this patchset.
On 2018-04-16 09:23, Sinan Kaya wrote: > On 4/15/2018 11:16 PM, Bjorn Helgaas wrote: >> On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote: >>> This patch set brings in error handling support for DPC >>> >>> The current implementation of AER and error message broadcasting to >>> the >>> EP driver is tightly coupled and limited to AER service driver. >>> It is important to factor out broadcasting and other link handling >>> callbacks. So that not only when AER gets triggered, but also when >>> DPC get >>> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately. >>> >>> DPC should behave identical to AER as far as error handling is >>> concerned. >>> DPC should remove the devices and not to do recovery for hotplug >>> enabled system. >> >> Is there a specific bug that's fixed by these patches? I didn't see >> one mentioned in the changelogs. >> > > There is no actual bug. > > We realized that DPC and hotplug is heavily integrated today. We have > use > cases for systems without hotplug support but still support DPC. That's > the > problem we are trying to solve with this patchset. Adding to what Sinan said; DPC should handle the error handling and recovery similar to AER, because finally both are attempting recovery in some or the other way, and for that error handling and recovery framework has to be loosely coupled. It achieves uniformity and transparency to the error handling agents such as AER, DPC, with respect to recovery and error handling. So, this patch-set tries to unify lot of things between error agents and make them behave in a well defined way. (be it error (FATAL, NON_FATAL) handling or recovery). Regards, Oza.
On Mon, Apr 16, 2018 at 11:33:13AM +0530, poza@codeaurora.org wrote: > On 2018-04-16 09:23, Sinan Kaya wrote: > > On 4/15/2018 11:16 PM, Bjorn Helgaas wrote: > > > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote: > > > > This patch set brings in error handling support for DPC > > > > > > > > The current implementation of AER and error message broadcasting > > > > to the > > > > EP driver is tightly coupled and limited to AER service driver. > > > > It is important to factor out broadcasting and other link handling > > > > callbacks. So that not only when AER gets triggered, but also > > > > when DPC get > > > > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately. > > > > > > > > DPC should behave identical to AER as far as error handling is > > > > concerned. > > > > DPC should remove the devices and not to do recovery for hotplug > > > > enabled system. > > > > > > Is there a specific bug that's fixed by these patches? I didn't see > > > one mentioned in the changelogs. > > > > > > > There is no actual bug. > > > > We realized that DPC and hotplug is heavily integrated today. We > > have use cases for systems without hotplug support but still > > support DPC. That's the problem we are trying to solve with this > > patchset. Apparently there's a problem with systems that have DPC but not hotplug. It will be extremely helpful if you can articulate what that problem is and include it in the appropriate changelog. > Adding to what Sinan said; > > DPC should handle the error handling and recovery similar to AER, > because finally both are attempting recovery in some or the other > way, and for that error handling and recovery framework has to be > loosely coupled. It achieves uniformity and transparency to the > error handling agents such as AER, DPC, with respect to recovery and > error handling. > > So, this patch-set tries to unify lot of things between error agents > and make them behave in a well defined way. (be it error (FATAL, > NON_FATAL) handling or recovery). I totally support this objective. Bjorn
On 2018-04-16 18:57, Bjorn Helgaas wrote: > On Mon, Apr 16, 2018 at 11:33:13AM +0530, poza@codeaurora.org wrote: >> On 2018-04-16 09:23, Sinan Kaya wrote: >> > On 4/15/2018 11:16 PM, Bjorn Helgaas wrote: >> > > On Mon, Apr 09, 2018 at 10:41:48AM -0400, Oza Pawandeep wrote: >> > > > This patch set brings in error handling support for DPC >> > > > >> > > > The current implementation of AER and error message broadcasting >> > > > to the >> > > > EP driver is tightly coupled and limited to AER service driver. >> > > > It is important to factor out broadcasting and other link handling >> > > > callbacks. So that not only when AER gets triggered, but also >> > > > when DPC get >> > > > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately. >> > > > >> > > > DPC should behave identical to AER as far as error handling is >> > > > concerned. >> > > > DPC should remove the devices and not to do recovery for hotplug >> > > > enabled system. >> > > >> > > Is there a specific bug that's fixed by these patches? I didn't see >> > > one mentioned in the changelogs. >> > > >> > >> > There is no actual bug. >> > >> > We realized that DPC and hotplug is heavily integrated today. We >> > have use cases for systems without hotplug support but still >> > support DPC. That's the problem we are trying to solve with this >> > patchset. > > Apparently there's a problem with systems that have DPC but not > hotplug. It will be extremely helpful if you can articulate what that > problem is and include it in the appropriate changelog. > >> Adding to what Sinan said; >> >> DPC should handle the error handling and recovery similar to AER, >> because finally both are attempting recovery in some or the other >> way, and for that error handling and recovery framework has to be >> loosely coupled. It achieves uniformity and transparency to the >> error handling agents such as AER, DPC, with respect to recovery and >> error handling. >> >> So, this patch-set tries to unify lot of things between error agents >> and make them behave in a well defined way. (be it error (FATAL, >> NON_FATAL) handling or recovery). > > I totally support this objective. Thanks Bjorn, I will include this objective in Changelog along with Sinan's text. I am not clear on one last thing Bjorn; which is; do we need last patch ? patch-6 which handles hotplug case. Also I think we could take this patch-set as basic changes/attempt to unify the code which it does. And, in the next follow-up patches we can improve upon the things such as, whether to do different actions for FATAL cases and NON_FATAL cases. And then I can make needed changes to AER and DPC Please let me know how this sounds. > > Bjorn
On 4/16/2018 9:27 AM, Bjorn Helgaas wrote: >>> We realized that DPC and hotplug is heavily integrated today. We >>> have use cases for systems without hotplug support but still >>> support DPC. That's the problem we are trying to solve with this >>> patchset. > Apparently there's a problem with systems that have DPC but not > hotplug. It will be extremely helpful if you can articulate what that > problem is and include it in the appropriate changelog. > At a higher level, the DPC driver performs the stop operation regardless of hotplug. However, DPC driver relies on hotplug driver observing link up to re-enumerate. Of course, when the system didn't support hotplug; there was nobody to restore functionality. Our initial attempt was to also do a re-enumeration in the DPC driver regardless of hotplug driver in the system or not. If hotplug driver is present, it would observe two enumerations. It still worked as long as these were protected by a mutex. Then, we got your input that you want DPC and AER to behave the same. We started converging towards the AER path.