mbox series

[SRU,J:linux-bluefield,v1,0/1] UBUNTU: SAUCE: mlxbf-gige: Vitesse PHY stuck in a bad state during reboot test

Message ID 20240429200115.29252-1-asmaa@nvidia.com
Headers show
Series UBUNTU: SAUCE: mlxbf-gige: Vitesse PHY stuck in a bad state during reboot test | expand

Message

Asmaa Mnebhi April 29, 2024, 8:01 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2062384

SRU Justification:

[Impact]

During the QA reboot test, the BF3 Vitesse PHY gets stuck in a bad state, resulting in no ip provisioning. The only way to recover is to powercycle.
We might have found a software workaround to avoid getting in this state in the first place: suspend the PHY during graceful shutdown. Suspend the PHY = Power down = set bit 11 to 1 in reg 0 of the PHY. This WA passed 1800 reboots on QA's setup.

[Fix]

* During reboot, the mlxbf_gige_shutdown() function makes a call to phy_stop(). phy_stop() calls phy_suspend().
* Certain Linux PHY drivers, like the Vitesse PHY, don't support suspend() to power down the PHY during shutdown.
* Our Hardware also does not toggle the hard reset signal of the PHY during reboot.
* Hence, when the PHY is in a bad state, it stays in its bad state until powercycle.
* We have found a way to prevent the PHY from entering this bad state by suspending the PHY in the case of reboot.

[Test Case]

* do the reboot test (at least 2000 reboots): run 'reboot' from linux.
* Check that the oob_net0 interface is up and the ip is assigned.
* please note that if the the OOB doesn't get an ip, try reloading the driver (rmmod/modprobe). it that solves the issue, that would be a different bug. In the bug at stake, nothing recovers the OOB ip except power cycle.

[Regression Potential]

* Make sure the redfish DHCP is still working during the reboot test
* Make sure the OOB gets an ip

[Other]

These changes were made both in the mlxbf-gige driver and UEFI

Comments

Bartlomiej Zolnierkiewicz April 30, 2024, 10:12 a.m. UTC | #1
Hi Asmaa,

This patch fails to build:

/build/jammy/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c: In func
tion 'mlxbf_gige_shutdown':
/build/jammy/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c:569:33:
error: 'MLXBF_GIGE_BLUEFIELD3' undeclared (first use in this function)
  569 |         if (priv->hw_version == MLXBF_GIGE_BLUEFIELD3)
      |                                 ^~~~~~~~~~~~~~~~~~~~~

--
Best regards,
Bartlomiej

On Mon, Apr 29, 2024 at 10:02 PM Asmaa Mnebhi <asmaa@nvidia.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/2062384
>
> SRU Justification:
>
> [Impact]
>
> During the QA reboot test, the BF3 Vitesse PHY gets stuck in a bad state, resulting in no ip provisioning. The only way to recover is to powercycle.
> We might have found a software workaround to avoid getting in this state in the first place: suspend the PHY during graceful shutdown. Suspend the PHY = Power down = set bit 11 to 1 in reg 0 of the PHY. This WA passed 1800 reboots on QA's setup.
>
> [Fix]
>
> * During reboot, the mlxbf_gige_shutdown() function makes a call to phy_stop(). phy_stop() calls phy_suspend().
> * Certain Linux PHY drivers, like the Vitesse PHY, don't support suspend() to power down the PHY during shutdown.
> * Our Hardware also does not toggle the hard reset signal of the PHY during reboot.
> * Hence, when the PHY is in a bad state, it stays in its bad state until powercycle.
> * We have found a way to prevent the PHY from entering this bad state by suspending the PHY in the case of reboot.
>
> [Test Case]
>
> * do the reboot test (at least 2000 reboots): run 'reboot' from linux.
> * Check that the oob_net0 interface is up and the ip is assigned.
> * please note that if the the OOB doesn't get an ip, try reloading the driver (rmmod/modprobe). it that solves the issue, that would be a different bug. In the bug at stake, nothing recovers the OOB ip except power cycle.
>
> [Regression Potential]
>
> * Make sure the redfish DHCP is still working during the reboot test
> * Make sure the OOB gets an ip
>
> [Other]
>
> These changes were made both in the mlxbf-gige driver and UEFI
>
Asmaa Mnebhi April 30, 2024, 6:38 p.m. UTC | #2
Thanks Bart. I am actually abandoning this patch since it turns out the WA doesn’t work. 

> -----Original Message-----
> From: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz@canonical.com>
> Sent: Tuesday, April 30, 2024 6:12 AM
> To: Asmaa Mnebhi <asmaa@nvidia.com>
> Cc: Ubuntu Kernel Team <kernel-team@lists.ubuntu.com>
> Subject: NAK/Cmnt: [SRU][J:linux-bluefield][PATCH v1 0/1] UBUNTU: SAUCE:
> mlxbf-gige: Vitesse PHY stuck in a bad state during reboot test
> 
> Hi Asmaa,
> 
> This patch fails to build:
> 
> /build/jammy/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.
> c: In func tion 'mlxbf_gige_shutdown':
> /build/jammy/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.
> c:569:33:
> error: 'MLXBF_GIGE_BLUEFIELD3' undeclared (first use in this function)
>   569 |         if (priv->hw_version == MLXBF_GIGE_BLUEFIELD3)
>       |                                 ^~~~~~~~~~~~~~~~~~~~~
> 
> --
> Best regards,
> Bartlomiej
> 
> On Mon, Apr 29, 2024 at 10:02 PM Asmaa Mnebhi <asmaa@nvidia.com>
> wrote:
> >
> > BugLink:
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs
> >
> .launchpad.net%2Fbugs%2F2062384&data=05%7C02%7Casmaa%40nvidia.co
> m%7C83
> >
> 8774d1e7df4ae40b5f08dc68fe12d6%7C43083d15727340c1b7db39efd9ccc17
> a%7C0%
> >
> 7C0%7C638500687659578929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4
> wLjAwMDAiL
> >
> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata
> =38%2F
> > LMqrRFEHbTaIA9PTi6UiGlkkkS4j%2FW9ipGvEBkRw%3D&reserved=0
> >
> > SRU Justification:
> >
> > [Impact]
> >
> > During the QA reboot test, the BF3 Vitesse PHY gets stuck in a bad state,
> resulting in no ip provisioning. The only way to recover is to powercycle.
> > We might have found a software workaround to avoid getting in this state
> in the first place: suspend the PHY during graceful shutdown. Suspend the
> PHY = Power down = set bit 11 to 1 in reg 0 of the PHY. This WA passed 1800
> reboots on QA's setup.
> >
> > [Fix]
> >
> > * During reboot, the mlxbf_gige_shutdown() function makes a call to
> phy_stop(). phy_stop() calls phy_suspend().
> > * Certain Linux PHY drivers, like the Vitesse PHY, don't support suspend()
> to power down the PHY during shutdown.
> > * Our Hardware also does not toggle the hard reset signal of the PHY
> during reboot.
> > * Hence, when the PHY is in a bad state, it stays in its bad state until
> powercycle.
> > * We have found a way to prevent the PHY from entering this bad state by
> suspending the PHY in the case of reboot.
> >
> > [Test Case]
> >
> > * do the reboot test (at least 2000 reboots): run 'reboot' from linux.
> > * Check that the oob_net0 interface is up and the ip is assigned.
> > * please note that if the the OOB doesn't get an ip, try reloading the
> driver (rmmod/modprobe). it that solves the issue, that would be a
> different bug. In the bug at stake, nothing recovers the OOB ip except power
> cycle.
> >
> > [Regression Potential]
> >
> > * Make sure the redfish DHCP is still working during the reboot test
> > * Make sure the OOB gets an ip
> >
> > [Other]
> >
> > These changes were made both in the mlxbf-gige driver and UEFI
> >