mbox series

[RFC,0/3] Asynchronous EEH recovery

Message ID 20230613014337.286222-1-ganeshgr@linux.ibm.com (mailing list archive)
Headers show
Series Asynchronous EEH recovery | expand

Message

Ganesh Goudar June 13, 2023, 1:43 a.m. UTC
Hi,

EEH recovery is currently serialized and these patches shorten
the time taken for EEH recovery by making the recovery to run
in parallel. The original author of these patches is Sam Bobroff,
I have rebased and tested these patches.

On powervm with 64 VFs from same PHB,  I see approximately 48%
reduction in time taken in EEH recovery.

On powernv with 9 network cards, Where 2 cards installed on one
PHB and 1 card on each of the rest of the PHBs, Providing 20 PFs
in total. I see approximately 33% reduction in time taken in EEH
recovery.

These patches were originally posted as separate RFCs by Sam, And
I rebased and posted these patches almost a year back, I stopped
pursuing these patches as I was not able test this on powernv, Due
to the issues in drivers of cards I was testing this on, Which are
now resolved. Since I am re-posting this after long time, Posting
this as a fresh RFC, Please comment.

Thanks.  

Ganesh Goudar (3):
  powerpc/eeh: Synchronization for safety
  powerpc/eeh: Provide a unique ID for each EEH recovery
  powerpc/eeh: Asynchronous recovery

 arch/powerpc/include/asm/eeh.h               |   7 +-
 arch/powerpc/include/asm/eeh_event.h         |  10 +-
 arch/powerpc/include/asm/pci-bridge.h        |   3 +
 arch/powerpc/include/asm/ppc-pci.h           |   2 +-
 arch/powerpc/kernel/eeh.c                    | 154 +++--
 arch/powerpc/kernel/eeh_driver.c             | 580 +++++++++++++++----
 arch/powerpc/kernel/eeh_event.c              |  71 ++-
 arch/powerpc/kernel/eeh_pe.c                 |  33 +-
 arch/powerpc/platforms/powernv/eeh-powernv.c |  12 +-
 arch/powerpc/platforms/pseries/eeh_pseries.c |   5 +-
 arch/powerpc/platforms/pseries/pci_dlpar.c   |   5 +-
 drivers/pci/hotplug/pnv_php.c                |   5 +-
 drivers/pci/hotplug/rpadlpar_core.c          |   2 +
 drivers/vfio/vfio_iommu_spapr_tce.c          |  10 +-
 include/linux/mmzone.h                       |   2 +-
 15 files changed, 687 insertions(+), 214 deletions(-)

Comments

Oliver O'Halloran June 13, 2023, 2:36 a.m. UTC | #1
On Tue, Jun 13, 2023 at 11:44 AM Ganesh Goudar <ganeshgr@linux.ibm.com> wrote:
>
> Hi,
>
> EEH recovery is currently serialized and these patches shorten
> the time taken for EEH recovery by making the recovery to run
> in parallel. The original author of these patches is Sam Bobroff,
> I have rebased and tested these patches.
>
> On powervm with 64 VFs from same PHB,  I see approximately 48%
> reduction in time taken in EEH recovery.
>
> On powernv with 9 network cards, Where 2 cards installed on one
> PHB and 1 card on each of the rest of the PHBs, Providing 20 PFs
> in total. I see approximately 33% reduction in time taken in EEH
> recovery.
>
> These patches were originally posted as separate RFCs by Sam, And
> I rebased and posted these patches almost a year back, I stopped
> pursuing these patches as I was not able test this on powernv, Due
> to the issues in drivers of cards I was testing this on, Which are
> now resolved. Since I am re-posting this after long time, Posting
> this as a fresh RFC, Please comment.

What changes have you made since the last time you posted this series?
If the patches are the same then the comments I posted last time still
apply.
Ganesh Goudar July 17, 2023, 8:10 a.m. UTC | #2
On 6/13/23 8:06 AM, Oliver O'Halloran wrote:

> On Tue, Jun 13, 2023 at 11:44 AM Ganesh Goudar<ganeshgr@linux.ibm.com>  wrote:
>> Hi,
>>
>> EEH recovery is currently serialized and these patches shorten
>> the time taken for EEH recovery by making the recovery to run
>> in parallel. The original author of these patches is Sam Bobroff,
>> I have rebased and tested these patches.
>>
>> On powervm with 64 VFs from same PHB,  I see approximately 48%
>> reduction in time taken in EEH recovery.
>>
>> On powernv with 9 network cards, Where 2 cards installed on one
>> PHB and 1 card on each of the rest of the PHBs, Providing 20 PFs
>> in total. I see approximately 33% reduction in time taken in EEH
>> recovery.
>>
>> These patches were originally posted as separate RFCs by Sam, And
>> I rebased and posted these patches almost a year back, I stopped
>> pursuing these patches as I was not able test this on powernv, Due
>> to the issues in drivers of cards I was testing this on, Which are
>> now resolved. Since I am re-posting this after long time, Posting
>> this as a fresh RFC, Please comment.
> What changes have you made since the last time you posted this series?
> If the patches are the same then the comments I posted last time still
> apply.

Hi Oliver, You asked about the way we are testing this on powervm, You expressed
concerns about having this on powernv, suggested to have this feature just for
powervm for now, and also expressed concerns on having two locks.

On powervm using two port card we are instantiating 64 VFS, for an lpar and injecting
the error on the bus from phyp, to observe the behavior.
I was able to test this on powernv with 16 PFs from 8 cards installed on separate PHBs,
Where I saw considerable performance improvement.
Regarding two locks idea, I may not have tested it for all scenarios, So far I have not
faced any issue, Are you suggesting a different approach.

Thanks