Message ID | 20171025232124.14120-5-f.fainelli@gmail.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Series | [RFC,net-next,1/4] net: phy: Export phy_stop_machine() | expand |
Hi Florian, On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: > Marc reported that he was not getting the PHY library adjust_link() > callback function to run when calling phy_stop() + phy_disconnect() > which does not indeed happen because we set the state machine to > PHY_HALTED but we don't get to run it to process this state past that > point. > > Fix this with a synchronous call to phy_state_machine() in order to have > the state machine actually act on PHY_HALTED, set the PHY device's link > down, turn the network device's carrier off and finally call the > adjust_link() function. > > At the end of phy_state_machine() though, if we are going to be moving > from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this > is pointless. > > Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> > Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") > Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> > Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Thanks for your patch! Unfortunately, after applying this one, the last in your series, both sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system suspend/resume path, due to register accesses while the device is already suspended: PM: suspend entry (deep) PM: Syncing filesystems ... done. Freezing user space processes ... (elapsed 0.000 seconds) done. OOM killer disabled. Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. PM: suspend devices took 0.100 seconds Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 Disabling non-boot CPUs ... pgd = c0004000 [0005b950] *pgd=00000000 Internal error: : 1406 [#1] SMP ARM Modules linked in: CPU: 1 PID: 17 Comm: kworker/1:0 Not tainted 4.14.0-rc5-kzm9g-00454-g13bf0e7c8794d746 #1002 Hardware name: Generic SH73A0 (Flattened Device Tree) Workqueue: events linkwatch_event task: df490480 task.stack: df492000 PC is at __smsc911x_reg_read+0x1c/0x60 LR is at smsc911x_tx_get_txstatus+0x64/0x7c pc : [<c03cc77c>] lr : [<c03cd000>] psr: 20030093 sp : df493d98 ip : 00000000 fp : c09313a0 r10: c0931330 r9 : c0909a40 r8 : 00000000 r7 : 00030013 r6 : df705e08 r5 : df705dc0 r4 : 001c0000 r3 : e0903000 r2 : 00000001 r1 : e0903048 r0 : 00000000 Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 5ed1804a DAC: 00000051 Process kworker/1:0 (pid: 17, stack limit = 0xdf492210) Stack: (0xdf493d98 to 0xdf494000) 3d80: c03cd2dc df705800 3da0: df705dc0 c063b2b0 df493e18 c03cd248 c03cd2dc df705800 df705800 c03cd2e8 3dc0: c03cd2dc def4b0e4 df705800 c063b2b0 df493e18 c048a13c def4b0e0 dedbb300 3de0: df705800 c04ab388 dedbb300 df705800 def4b000 c04a5e10 0000002a 00000000 3e00: c04a5a8c c01ffcfc c04aabd0 dedbb300 014000c0 df493e47 00000000 00000000 3e20: 00000000 00000000 00000000 00000000 00000050 00000000 df493e47 014000c0 3e40: 00000000 dedbb300 df705800 00000010 00000000 00000000 00000000 c0931330 3e60: c09313a0 c04aac00 00000000 00000000 00000000 00000000 00000000 00000000 3e80: 014000c0 df705800 df705800 c09313a0 df493ee0 c04aac9c 014000c0 00000000 3ea0: 00000000 df705800 c0931330 c04aace0 014000c0 00000000 00000000 c048ef18 3ec0: df705800 df705800 00000000 00000000 df705800 c04abee0 df705ab0 c04ac150 3ee0: df493ee0 df493ee0 c109a9f8 df436d00 c0931330 dfbd7bc0 df493f30 dfbdae00 3f00: 00000000 00000000 00000001 c04ac1bc c04ac198 c013a6f8 00000001 00000000 3f20: c013a680 00000000 00000000 00000000 c0931330 00000000 00000000 c0758db6 3f40: c0905900 df436d00 dfbd7bc0 dfbd7bc0 df492000 dfbd7bf4 c0905900 df436d18 3f60: 00000008 c013aecc df490480 df436400 df42c900 00000000 df443e6c df436d00 3f80: c013ac14 df436438 00000000 c0140200 df42c900 c01400dc 00000000 00000000 3fa0: 00000000 00000000 00000000 c01071c8 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 7fdfffff fff7fdff [<c03cc77c>] (__smsc911x_reg_read) from [<c03cd000>] (smsc911x_tx_get_txstatus+0x64/0x7c) [<c03cd000>] (smsc911x_tx_get_txstatus) from [<c03cd248>] (smsc911x_tx_update_txcounters+0x14/0xa8) [<c03cd248>] (smsc911x_tx_update_txcounters) from [<c03cd2e8>] (smsc911x_get_stats+0xc/0x58) [<c03cd2e8>] (smsc911x_get_stats) from [<c048a13c>] (dev_get_stats+0x54/0xa4) [<c048a13c>] (dev_get_stats) from [<c04ab388>] (rtnl_fill_stats+0x38/0x118) [<c04ab388>] (rtnl_fill_stats) from [<c04a5e10>] (rtnl_fill_ifinfo+0x5f0/0xf74) [<c04a5e10>] (rtnl_fill_ifinfo) from [<c04aac00>] (rtmsg_ifinfo_build_skb+0x6c/0xc0) [<c04aac00>] (rtmsg_ifinfo_build_skb) from [<c04aac9c>] (rtmsg_ifinfo_event.part.5+0x1c/0x40) [<c04aac9c>] (rtmsg_ifinfo_event.part.5) from [<c04aace0>] (rtmsg_ifinfo+0x20/0x28) [<c04aace0>] (rtmsg_ifinfo) from [<c048ef18>] (netdev_state_change+0x48/0x54) [<c048ef18>] (netdev_state_change) from [<c04abee0>] (linkwatch_do_dev+0x50/0x74) [<c04abee0>] (linkwatch_do_dev) from [<c04ac150>] (__linkwatch_run_queue+0x124/0x16c) [<c04ac150>] (__linkwatch_run_queue) from [<c04ac1bc>] (linkwatch_event+0x24/0x34) [<c04ac1bc>] (linkwatch_event) from [<c013a6f8>] (process_one_work+0x240/0x3fc) [<c013a6f8>] (process_one_work) from [<c013aecc>] (worker_thread+0x2b8/0x3f4) [<c013aecc>] (worker_thread) from [<c0140200>] (kthread+0x124/0x144) [<c0140200>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c) Code: e5903000 e0831001 e5910000 f57ff04f (e12fff1e) ---[ end trace e71c0b5246c61082 ]--- Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote: > Hi Florian, > > On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: >> Marc reported that he was not getting the PHY library adjust_link() >> callback function to run when calling phy_stop() + phy_disconnect() >> which does not indeed happen because we set the state machine to >> PHY_HALTED but we don't get to run it to process this state past that >> point. >> >> Fix this with a synchronous call to phy_state_machine() in order to have >> the state machine actually act on PHY_HALTED, set the PHY device's link >> down, turn the network device's carrier off and finally call the >> adjust_link() function. >> >> At the end of phy_state_machine() though, if we are going to be moving >> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this >> is pointless. >> >> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> > > Thanks for your patch! > > Unfortunately, after applying this one, the last in your series, both > sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system > suspend/resume path, due to register accesses while the device is already > suspended: OK, seems like there is another path, uncovered by this patch that we can be hitting, does the following patch below help? diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c index 82f9a175073f..51498699b18c 100644 --- a/drivers/net/ethernet/smsc/smsc911x.c +++ b/drivers/net/ethernet/smsc/smsc911x.c @@ -1136,6 +1136,9 @@ static void smsc911x_tx_update_txcounters(struct net_device *dev) struct smsc911x_data *pdata = netdev_priv(dev); unsigned int tx_stat; + if (!netif_running(dev)) + return; + while ((tx_stat = smsc911x_tx_get_txstatus(pdata)) != 0) { if (unlikely(tx_stat & 0x80000000)) { /* In this driver the packet tag is used as the packet > > PM: suspend entry (deep) > PM: Syncing filesystems ... done. > Freezing user space processes ... (elapsed 0.000 seconds) done. > OOM killer disabled. > Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. > PM: suspend devices took 0.100 seconds > Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 > Disabling non-boot CPUs ... > pgd = c0004000 > [0005b950] *pgd=00000000 > Internal error: : 1406 [#1] SMP ARM > Modules linked in: > CPU: 1 PID: 17 Comm: kworker/1:0 Not tainted > 4.14.0-rc5-kzm9g-00454-g13bf0e7c8794d746 #1002 > Hardware name: Generic SH73A0 (Flattened Device Tree) > Workqueue: events linkwatch_event > task: df490480 task.stack: df492000 > PC is at __smsc911x_reg_read+0x1c/0x60 > LR is at smsc911x_tx_get_txstatus+0x64/0x7c > pc : [<c03cc77c>] lr : [<c03cd000>] psr: 20030093 > sp : df493d98 ip : 00000000 fp : c09313a0 > r10: c0931330 r9 : c0909a40 r8 : 00000000 > r7 : 00030013 r6 : df705e08 r5 : df705dc0 r4 : 001c0000 > r3 : e0903000 r2 : 00000001 r1 : e0903048 r0 : 00000000 > Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none > Control: 10c5387d Table: 5ed1804a DAC: 00000051 > Process kworker/1:0 (pid: 17, stack limit = 0xdf492210) > Stack: (0xdf493d98 to 0xdf494000) > 3d80: c03cd2dc df705800 > 3da0: df705dc0 c063b2b0 df493e18 c03cd248 c03cd2dc df705800 df705800 c03cd2e8 > 3dc0: c03cd2dc def4b0e4 df705800 c063b2b0 df493e18 c048a13c def4b0e0 dedbb300 > 3de0: df705800 c04ab388 dedbb300 df705800 def4b000 c04a5e10 0000002a 00000000 > 3e00: c04a5a8c c01ffcfc c04aabd0 dedbb300 014000c0 df493e47 00000000 00000000 > 3e20: 00000000 00000000 00000000 00000000 00000050 00000000 df493e47 014000c0 > 3e40: 00000000 dedbb300 df705800 00000010 00000000 00000000 00000000 c0931330 > 3e60: c09313a0 c04aac00 00000000 00000000 00000000 00000000 00000000 00000000 > 3e80: 014000c0 df705800 df705800 c09313a0 df493ee0 c04aac9c 014000c0 00000000 > 3ea0: 00000000 df705800 c0931330 c04aace0 014000c0 00000000 00000000 c048ef18 > 3ec0: df705800 df705800 00000000 00000000 df705800 c04abee0 df705ab0 c04ac150 > 3ee0: df493ee0 df493ee0 c109a9f8 df436d00 c0931330 dfbd7bc0 df493f30 dfbdae00 > 3f00: 00000000 00000000 00000001 c04ac1bc c04ac198 c013a6f8 00000001 00000000 > 3f20: c013a680 00000000 00000000 00000000 c0931330 00000000 00000000 c0758db6 > 3f40: c0905900 df436d00 dfbd7bc0 dfbd7bc0 df492000 dfbd7bf4 c0905900 df436d18 > 3f60: 00000008 c013aecc df490480 df436400 df42c900 00000000 df443e6c df436d00 > 3f80: c013ac14 df436438 00000000 c0140200 df42c900 c01400dc 00000000 00000000 > 3fa0: 00000000 00000000 00000000 c01071c8 00000000 00000000 00000000 00000000 > 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 > 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 7fdfffff fff7fdff > [<c03cc77c>] (__smsc911x_reg_read) from [<c03cd000>] > (smsc911x_tx_get_txstatus+0x64/0x7c) > [<c03cd000>] (smsc911x_tx_get_txstatus) from [<c03cd248>] > (smsc911x_tx_update_txcounters+0x14/0xa8) > [<c03cd248>] (smsc911x_tx_update_txcounters) from [<c03cd2e8>] > (smsc911x_get_stats+0xc/0x58) > [<c03cd2e8>] (smsc911x_get_stats) from [<c048a13c>] (dev_get_stats+0x54/0xa4) > [<c048a13c>] (dev_get_stats) from [<c04ab388>] (rtnl_fill_stats+0x38/0x118) > [<c04ab388>] (rtnl_fill_stats) from [<c04a5e10>] (rtnl_fill_ifinfo+0x5f0/0xf74) > [<c04a5e10>] (rtnl_fill_ifinfo) from [<c04aac00>] > (rtmsg_ifinfo_build_skb+0x6c/0xc0) > [<c04aac00>] (rtmsg_ifinfo_build_skb) from [<c04aac9c>] > (rtmsg_ifinfo_event.part.5+0x1c/0x40) > [<c04aac9c>] (rtmsg_ifinfo_event.part.5) from [<c04aace0>] > (rtmsg_ifinfo+0x20/0x28) > [<c04aace0>] (rtmsg_ifinfo) from [<c048ef18>] (netdev_state_change+0x48/0x54) > [<c048ef18>] (netdev_state_change) from [<c04abee0>] > (linkwatch_do_dev+0x50/0x74) > [<c04abee0>] (linkwatch_do_dev) from [<c04ac150>] > (__linkwatch_run_queue+0x124/0x16c) > [<c04ac150>] (__linkwatch_run_queue) from [<c04ac1bc>] > (linkwatch_event+0x24/0x34) > [<c04ac1bc>] (linkwatch_event) from [<c013a6f8>] (process_one_work+0x240/0x3fc) > [<c013a6f8>] (process_one_work) from [<c013aecc>] (worker_thread+0x2b8/0x3f4) > [<c013aecc>] (worker_thread) from [<c0140200>] (kthread+0x124/0x144) > [<c0140200>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c) > Code: e5903000 e0831001 e5910000 f57ff04f (e12fff1e) > ---[ end trace e71c0b5246c61082 ]--- > > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like that. > -- Linus Torvalds >
Hi Florian, On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: > On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote: >> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>> Marc reported that he was not getting the PHY library adjust_link() >>> callback function to run when calling phy_stop() + phy_disconnect() >>> which does not indeed happen because we set the state machine to >>> PHY_HALTED but we don't get to run it to process this state past that >>> point. >>> >>> Fix this with a synchronous call to phy_state_machine() in order to have >>> the state machine actually act on PHY_HALTED, set the PHY device's link >>> down, turn the network device's carrier off and finally call the >>> adjust_link() function. >>> >>> At the end of phy_state_machine() though, if we are going to be moving >>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this >>> is pointless. >>> >>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> >> >> Thanks for your patch! >> >> Unfortunately, after applying this one, the last in your series, both >> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system >> suspend/resume path, due to register accesses while the device is already >> suspended: > > OK, seems like there is another path, uncovered by this patch that we > can be hitting, does the following patch below help? Unfortunately it doesn't help. >> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 Note that this is an imprecise external abort, i.e. it's reporting may be delayed, and the backtrace may be inaccurate. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote: > Hi Florian, > > On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: >> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote: >>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>>> Marc reported that he was not getting the PHY library adjust_link() >>>> callback function to run when calling phy_stop() + phy_disconnect() >>>> which does not indeed happen because we set the state machine to >>>> PHY_HALTED but we don't get to run it to process this state past that >>>> point. >>>> >>>> Fix this with a synchronous call to phy_state_machine() in order to have >>>> the state machine actually act on PHY_HALTED, set the PHY device's link >>>> down, turn the network device's carrier off and finally call the >>>> adjust_link() function. >>>> >>>> At the end of phy_state_machine() though, if we are going to be moving >>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this >>>> is pointless. >>>> >>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> >>> >>> Thanks for your patch! >>> >>> Unfortunately, after applying this one, the last in your series, both >>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system >>> suspend/resume path, due to register accesses while the device is already >>> suspended: >> >> OK, seems like there is another path, uncovered by this patch that we >> can be hitting, does the following patch below help? > > Unfortunately it doesn't help. OK :/ > >>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 > > Note that this is an imprecise external abort, i.e. it's reporting may > be delayed, > and the backtrace may be inaccurate. True, can you help narrow it down with me? Can you confirm that adjust_link() (assuming that is the problem) does not get called past phy_stop_machine() as it should?
Hi Florian, On Tue, Oct 31, 2017 at 5:33 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: > On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote: >> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote: >>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>>>> Marc reported that he was not getting the PHY library adjust_link() >>>>> callback function to run when calling phy_stop() + phy_disconnect() >>>>> which does not indeed happen because we set the state machine to >>>>> PHY_HALTED but we don't get to run it to process this state past that >>>>> point. >>>>> >>>>> Fix this with a synchronous call to phy_state_machine() in order to have >>>>> the state machine actually act on PHY_HALTED, set the PHY device's link >>>>> down, turn the network device's carrier off and finally call the >>>>> adjust_link() function. >>>>> >>>>> At the end of phy_state_machine() though, if we are going to be moving >>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this >>>>> is pointless. >>>>> >>>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >>>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> >>>> >>>> Thanks for your patch! >>>> >>>> Unfortunately, after applying this one, the last in your series, both >>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system >>>> suspend/resume path, due to register accesses while the device is already >>>> suspended: >>> >>> OK, seems like there is another path, uncovered by this patch that we >>> can be hitting, does the following patch below help? >> >> Unfortunately it doesn't help. > > OK :/ > >> >>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 >> >> Note that this is an imprecise external abort, i.e. it's reporting may >> be delayed, >> and the backtrace may be inaccurate. > > True, can you help narrow it down with me? Can you confirm that > adjust_link() (assuming that is the problem) does not get called past > phy_stop_machine() as it should? I've added some additional debug checks (keep track of both phy and smsc state, and refuse the access registers if smsc is disabled). Apparently phy_stop_machine() is called twice: - Once from mdio_bus_phy_suspend(), cfr. the first backtrace, - A second time from smsc911x_suspend(), cfr. the second backtrace. The second call causes a call to smsc911x_phy_adjust_link() while the smsc is already disabled, cfr. the third backtrace. This would trigger the imprecise external abort if I let it access the registers. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:597 phy_stop_machine+0x44/0xcc phy_stop_machine: phy running, good CPU: 0 PID: 1083 Comm: bash Not tainted 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 Hardware name: Generic R8A73A4 (Flattened Device Tree) [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) [<c002ac8c>] (warn_slowpath_fmt) from [<c0314680>] (phy_stop_machine+0x44/0xcc) [<c0314680>] (phy_stop_machine) from [<c03162ec>] (mdio_bus_phy_suspend+0x24/0x40) [<c03162ec>] (mdio_bus_phy_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec) [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) [<c030a974>] (dpm_suspend) from [<c00880bc>] (suspend_devices_and_enter+0x78/0xe98) [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] (pm_suspend+0xa40/0xbec) [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) ---[ end trace 8fc4c71351438007 ]--- libphy: phy_stop_machine: Kicking state machine synchronously libphy: phy_stop_machine: Kicking state machine done ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:598 phy_stop_machine+0x64/0xcc phy_stop_machine: phy already stopped CPU: 0 PID: 1083 Comm: bash Tainted: G W 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 Hardware name: Generic R8A73A4 (Flattened Device Tree) [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) [<c002ac8c>] (warn_slowpath_fmt) from [<c03146a0>] (phy_stop_machine+0x64/0xcc) [<c03146a0>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4) [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec) [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) [<c030a974>] (dpm_suspend) from [<c00880bc>] (suspend_devices_and_enter+0x78/0xe98) [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] (pm_suspend+0xa40/0xbec) [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) ---[ end trace 8fc4c71351438008 ]--- libphy: phy_stop_machine: Kicking state machine synchronously ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1083 at drivers/net/ethernet/smsc/smsc911x.c:988 smsc911x_phy_adjust_link+0x2c/0x2e0 PHY stopped CPU: 0 PID: 1083 Comm: bash Tainted: G W 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 Hardware name: Generic R8A73A4 (Flattened Device Tree) [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) [<c002ac8c>] (warn_slowpath_fmt) from [<c031b1e8>] (smsc911x_phy_adjust_link+0x2c/0x2e0) [<c031b1e8>] (smsc911x_phy_adjust_link) from [<c031378c>] (phy_link_down+0x18/0x24) [<c031378c>] (phy_link_down) from [<c0314518>] (phy_state_machine+0x2d0/0x3f4) [<c0314518>] (phy_state_machine) from [<c03146d8>] (phy_stop_machine+0x9c/0xcc) [<c03146d8>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4) [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec) [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) [<c030a974>] (dpm_suspend) from [<c00880bc>] (suspend_devices_and_enter+0x78/0xe98) [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] (pm_suspend+0xa40/0xbec) [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) ---[ end trace 8fc4c71351438009 ]--- libphy: phy_stop_machine: Kicking state machine done If I revert your "net: smsc911x: Properly manage PHY during suspend/resume", phy_stop_machine() is no longer called twice, and system suspend works. However, during resume, smsc911x_mii_read() is called before the smsc is enabled, cfr. the fourth backtrace: WARNING: CPU: 1 PID: 17 at drivers/net/ethernet/smsc/smsc911x.c:165 __smsc911x_reg_read+0x1c/0x88 Modules linked in: CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G W 4.14.0-rc7-kzm9g-00443-gcdfc0e18a47e0bb3-dirty #1013 Hardware name: Generic SH73A0 (Flattened Device Tree) smsc hangs of [fec10000.bus, which is started only here ---> Workqueue: events_power_efficient phy_state_machine [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14) [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc) [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc) [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24) [<c0120554>] (warn_slowpath_null) from [<c03ce264>] (__smsc911x_reg_read+0x1c/0x88) [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>] (smsc911x_mac_read+0x4c/0x118) [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>] (smsc911x_mii_read+0x2c/0xa4) [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70) [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50) [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>] (genphy_read_status+0xc/0x1cc) [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>] (phy_state_machine+0xa8/0x3f4) [<c03ca3e4>] (phy_state_machine) from [<c013a740>] (process_one_work+0x240/0x3fc) [<c013a740>] (process_one_work) from [<c013af28>] (worker_thread+0x2cc/0x40c) [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144) [<c014024c>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c) ---[ end trace 21b7024e273f9f21 ]--- ------------[ cut here ]------------ (yes, the fourth backtrace is from another machine, but I can trigger all of this on both r8a73a4/ape6evm and sh73a0/kzm9g anyway). Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hi Geert, On 11/06/2017 07:50 AM, Geert Uytterhoeven wrote: > Hi Florian, > > On Tue, Oct 31, 2017 at 5:33 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: >> On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote: >>> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote: >>>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>>>>> Marc reported that he was not getting the PHY library adjust_link() >>>>>> callback function to run when calling phy_stop() + phy_disconnect() >>>>>> which does not indeed happen because we set the state machine to >>>>>> PHY_HALTED but we don't get to run it to process this state past that >>>>>> point. >>>>>> >>>>>> Fix this with a synchronous call to phy_state_machine() in order to have >>>>>> the state machine actually act on PHY_HALTED, set the PHY device's link >>>>>> down, turn the network device's carrier off and finally call the >>>>>> adjust_link() function. >>>>>> >>>>>> At the end of phy_state_machine() though, if we are going to be moving >>>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this >>>>>> is pointless. >>>>>> >>>>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >>>>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> >>>>> >>>>> Thanks for your patch! >>>>> >>>>> Unfortunately, after applying this one, the last in your series, both >>>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system >>>>> suspend/resume path, due to register accesses while the device is already >>>>> suspended: >>>> >>>> OK, seems like there is another path, uncovered by this patch that we >>>> can be hitting, does the following patch below help? >>> >>> Unfortunately it doesn't help. >> >> OK :/ >> >>> >>>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 >>> >>> Note that this is an imprecise external abort, i.e. it's reporting may >>> be delayed, >>> and the backtrace may be inaccurate. >> >> True, can you help narrow it down with me? Can you confirm that >> adjust_link() (assuming that is the problem) does not get called past >> phy_stop_machine() as it should? > > I've added some additional debug checks (keep track of both phy and > smsc state, and refuse the access registers if smsc is disabled). Thanks for doing that, and sorry for responding that late. > > Apparently phy_stop_machine() is called twice: > - Once from mdio_bus_phy_suspend(), cfr. the first backtrace, > - A second time from smsc911x_suspend(), cfr. the second backtrace. > > The second call causes a call to smsc911x_phy_adjust_link() while the smsc is > already disabled, cfr. the third backtrace. This would trigger the imprecise > external abort if I let it access the registers. > > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:597 > phy_stop_machine+0x44/0xcc > phy_stop_machine: phy running, good > CPU: 0 PID: 1083 Comm: bash Not tainted > 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 > Hardware name: Generic R8A73A4 (Flattened Device Tree) > [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) > [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) > [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) > [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) > [<c002ac8c>] (warn_slowpath_fmt) from [<c0314680>] (phy_stop_machine+0x44/0xcc) > [<c0314680>] (phy_stop_machine) from [<c03162ec>] > (mdio_bus_phy_suspend+0x24/0x40) > [<c03162ec>] (mdio_bus_phy_suspend) from [<c0307090>] > (dpm_run_callback+0x17c/0x3ec) > [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) > [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) > [<c030a974>] (dpm_suspend) from [<c00880bc>] > (suspend_devices_and_enter+0x78/0xe98) > [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] > (pm_suspend+0xa40/0xbec) > [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) > [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) > [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) > [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) > [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) > [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) > ---[ end trace 8fc4c71351438007 ]--- > libphy: phy_stop_machine: Kicking state machine synchronously > libphy: phy_stop_machine: Kicking state machine done > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:598 > phy_stop_machine+0x64/0xcc > phy_stop_machine: phy already stopped > CPU: 0 PID: 1083 Comm: bash Tainted: G W > 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 > Hardware name: Generic R8A73A4 (Flattened Device Tree) > [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) > [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) > [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) > [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) > [<c002ac8c>] (warn_slowpath_fmt) from [<c03146a0>] (phy_stop_machine+0x64/0xcc) > [<c03146a0>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4) > [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec) > [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) > [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) > [<c030a974>] (dpm_suspend) from [<c00880bc>] > (suspend_devices_and_enter+0x78/0xe98) > [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] > (pm_suspend+0xa40/0xbec) > [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) > [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) > [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) > [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) > [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) > [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) > ---[ end trace 8fc4c71351438008 ]--- > libphy: phy_stop_machine: Kicking state machine synchronously > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 1083 at drivers/net/ethernet/smsc/smsc911x.c:988 > smsc911x_phy_adjust_link+0x2c/0x2e0 > PHY stopped > CPU: 0 PID: 1083 Comm: bash Tainted: G W > 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 > Hardware name: Generic R8A73A4 (Flattened Device Tree) > [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) > [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) > [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) > [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) > [<c002ac8c>] (warn_slowpath_fmt) from [<c031b1e8>] > (smsc911x_phy_adjust_link+0x2c/0x2e0) > [<c031b1e8>] (smsc911x_phy_adjust_link) from [<c031378c>] > (phy_link_down+0x18/0x24) > [<c031378c>] (phy_link_down) from [<c0314518>] (phy_state_machine+0x2d0/0x3f4) > [<c0314518>] (phy_state_machine) from [<c03146d8>] (phy_stop_machine+0x9c/0xcc) > [<c03146d8>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4) > [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec) > [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) > [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) > [<c030a974>] (dpm_suspend) from [<c00880bc>] > (suspend_devices_and_enter+0x78/0xe98) > [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] > (pm_suspend+0xa40/0xbec) > [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) > [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) > [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) > [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) > [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) > [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) > ---[ end trace 8fc4c71351438009 ]--- > libphy: phy_stop_machine: Kicking state machine done > > If I revert your "net: smsc911x: Properly manage PHY during suspend/resume", > phy_stop_machine() is no longer called twice, and system suspend works. OK, there does appear to be a problem in how the network device vs. mdio bus are suspended/resumed in this particular driver, and I have no idea why, see below. > > However, during resume, smsc911x_mii_read() is called before the > smsc is enabled, cfr. the fourth backtrace: Humm, how is that possible? smsc911x_mii_bus properly sets its parent device to be the platform device of the network device (which is correct) so by virtue of child -> parent relationships, we should have the network device resume first, and then have the MDIO bus resume second (unless I am completely off here). > > WARNING: CPU: 1 PID: 17 at > drivers/net/ethernet/smsc/smsc911x.c:165 __smsc911x_reg_read+0x1c/0x88 > Modules linked in: > CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G W > 4.14.0-rc7-kzm9g-00443-gcdfc0e18a47e0bb3-dirty #1013 > Hardware name: Generic SH73A0 (Flattened Device Tree) > > smsc hangs of [fec10000.bus, which is started only here ---> > > Workqueue: events_power_efficient phy_state_machine > [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14) > [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc) > [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc) > [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24) > [<c0120554>] (warn_slowpath_null) from [<c03ce264>] > (__smsc911x_reg_read+0x1c/0x88) > [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>] > (smsc911x_mac_read+0x4c/0x118) > [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>] > (smsc911x_mii_read+0x2c/0xa4) > [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70) > [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50) > [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>] > (genphy_read_status+0xc/0x1cc) > [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>] > (phy_state_machine+0xa8/0x3f4) > [<c03ca3e4>] (phy_state_machine) from [<c013a740>] > (process_one_work+0x240/0x3fc) > [<c013a740>] (process_one_work) from [<c013af28>] > (worker_thread+0x2cc/0x40c) > [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144) > [<c014024c>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c) > ---[ end trace 21b7024e273f9f21 ]--- > ------------[ cut here ]------------ > > (yes, the fourth backtrace is from another machine, but I can trigger > all of this > on both r8a73a4/ape6evm and sh73a0/kzm9g anyway). > > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like that. > -- Linus Torvalds >
Hi Florian, On Mon, Nov 27, 2017 at 5:05 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: > On 11/06/2017 07:50 AM, Geert Uytterhoeven wrote: >> On Tue, Oct 31, 2017 at 5:33 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>> On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote: >>>> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>>>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote: >>>>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote: >>>>>>> Marc reported that he was not getting the PHY library adjust_link() >>>>>>> callback function to run when calling phy_stop() + phy_disconnect() >>>>>>> which does not indeed happen because we set the state machine to >>>>>>> PHY_HALTED but we don't get to run it to process this state past that >>>>>>> point. >>>>>>> >>>>>>> Fix this with a synchronous call to phy_state_machine() in order to have >>>>>>> the state machine actually act on PHY_HALTED, set the PHY device's link >>>>>>> down, turn the network device's carrier off and finally call the >>>>>>> adjust_link() function. >>>>>>> >>>>>>> At the end of phy_state_machine() though, if we are going to be moving >>>>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this >>>>>>> is pointless. >>>>>>> >>>>>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work") >>>>>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> >>>>>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> >>>>>> >>>>>> Thanks for your patch! >>>>>> >>>>>> Unfortunately, after applying this one, the last in your series, both >>>>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system >>>>>> suspend/resume path, due to register accesses while the device is already >>>>>> suspended: >>>>> >>>>> OK, seems like there is another path, uncovered by this patch that we >>>>> can be hitting, does the following patch below help? >>>> >>>> Unfortunately it doesn't help. >>> >>> OK :/ >>> >>>> >>>>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950 >>>> >>>> Note that this is an imprecise external abort, i.e. it's reporting may >>>> be delayed, >>>> and the backtrace may be inaccurate. >>> >>> True, can you help narrow it down with me? Can you confirm that >>> adjust_link() (assuming that is the problem) does not get called past >>> phy_stop_machine() as it should? >> >> I've added some additional debug checks (keep track of both phy and >> smsc state, and refuse the access registers if smsc is disabled). > > Thanks for doing that, and sorry for responding that late. > >> >> Apparently phy_stop_machine() is called twice: >> - Once from mdio_bus_phy_suspend(), cfr. the first backtrace, >> - A second time from smsc911x_suspend(), cfr. the second backtrace. >> >> The second call causes a call to smsc911x_phy_adjust_link() while the smsc is >> already disabled, cfr. the third backtrace. This would trigger the imprecise >> external abort if I let it access the registers. >> >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:597 >> phy_stop_machine+0x44/0xcc >> phy_stop_machine: phy running, good >> CPU: 0 PID: 1083 Comm: bash Not tainted >> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 >> Hardware name: Generic R8A73A4 (Flattened Device Tree) >> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) >> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) >> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) >> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) >> [<c002ac8c>] (warn_slowpath_fmt) from [<c0314680>] (phy_stop_machine+0x44/0xcc) >> [<c0314680>] (phy_stop_machine) from [<c03162ec>] >> (mdio_bus_phy_suspend+0x24/0x40) >> [<c03162ec>] (mdio_bus_phy_suspend) from [<c0307090>] >> (dpm_run_callback+0x17c/0x3ec) >> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) >> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) >> [<c030a974>] (dpm_suspend) from [<c00880bc>] >> (suspend_devices_and_enter+0x78/0xe98) >> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] >> (pm_suspend+0xa40/0xbec) >> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) >> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) >> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) >> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) >> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) >> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) >> ---[ end trace 8fc4c71351438007 ]--- >> libphy: phy_stop_machine: Kicking state machine synchronously >> libphy: phy_stop_machine: Kicking state machine done >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:598 >> phy_stop_machine+0x64/0xcc >> phy_stop_machine: phy already stopped >> CPU: 0 PID: 1083 Comm: bash Tainted: G W >> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 >> Hardware name: Generic R8A73A4 (Flattened Device Tree) >> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) >> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) >> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) >> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) >> [<c002ac8c>] (warn_slowpath_fmt) from [<c03146a0>] (phy_stop_machine+0x64/0xcc) >> [<c03146a0>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4) >> [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec) >> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) >> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) >> [<c030a974>] (dpm_suspend) from [<c00880bc>] >> (suspend_devices_and_enter+0x78/0xe98) >> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] >> (pm_suspend+0xa40/0xbec) >> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) >> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) >> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) >> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) >> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) >> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) >> ---[ end trace 8fc4c71351438008 ]--- >> libphy: phy_stop_machine: Kicking state machine synchronously >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 1083 at drivers/net/ethernet/smsc/smsc911x.c:988 >> smsc911x_phy_adjust_link+0x2c/0x2e0 >> PHY stopped >> CPU: 0 PID: 1083 Comm: bash Tainted: G W >> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637 >> Hardware name: Generic R8A73A4 (Flattened Device Tree) >> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14) >> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc) >> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc) >> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44) >> [<c002ac8c>] (warn_slowpath_fmt) from [<c031b1e8>] >> (smsc911x_phy_adjust_link+0x2c/0x2e0) >> [<c031b1e8>] (smsc911x_phy_adjust_link) from [<c031378c>] >> (phy_link_down+0x18/0x24) >> [<c031378c>] (phy_link_down) from [<c0314518>] (phy_state_machine+0x2d0/0x3f4) >> [<c0314518>] (phy_state_machine) from [<c03146d8>] (phy_stop_machine+0x9c/0xcc) >> [<c03146d8>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4) >> [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec) >> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0) >> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568) >> [<c030a974>] (dpm_suspend) from [<c00880bc>] >> (suspend_devices_and_enter+0x78/0xe98) >> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>] >> (pm_suspend+0xa40/0xbec) >> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc) >> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0) >> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c) >> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144) >> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80) >> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28) >> ---[ end trace 8fc4c71351438009 ]--- >> libphy: phy_stop_machine: Kicking state machine done >> >> If I revert your "net: smsc911x: Properly manage PHY during suspend/resume", >> phy_stop_machine() is no longer called twice, and system suspend works. > > OK, there does appear to be a problem in how the network device vs. mdio > bus are suspended/resumed in this particular driver, and I have no idea > why, see below. > >> >> However, during resume, smsc911x_mii_read() is called before the >> smsc is enabled, cfr. the fourth backtrace: > > Humm, how is that possible? smsc911x_mii_bus properly sets its parent > device to be the platform device of the network device (which is > correct) so by virtue of child -> parent relationships, we should have > the network device resume first, and then have the MDIO bus resume > second (unless I am completely off here). The MDIO bus callback is not called from the network device ... >> WARNING: CPU: 1 PID: 17 at >> drivers/net/ethernet/smsc/smsc911x.c:165 __smsc911x_reg_read+0x1c/0x88 >> Modules linked in: >> CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G W >> 4.14.0-rc7-kzm9g-00443-gcdfc0e18a47e0bb3-dirty #1013 >> Hardware name: Generic SH73A0 (Flattened Device Tree) >> >> smsc hangs of [fec10000.bus, which is started only here ---> >> >> Workqueue: events_power_efficient phy_state_machine >> [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14) >> [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc) >> [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc) >> [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24) >> [<c0120554>] (warn_slowpath_null) from [<c03ce264>] >> (__smsc911x_reg_read+0x1c/0x88) >> [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>] >> (smsc911x_mac_read+0x4c/0x118) >> [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>] >> (smsc911x_mii_read+0x2c/0xa4) >> [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70) >> [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50) >> [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>] >> (genphy_read_status+0xc/0x1cc) >> [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>] >> (phy_state_machine+0xa8/0x3f4) >> [<c03ca3e4>] (phy_state_machine) from [<c013a740>] >> (process_one_work+0x240/0x3fc) >> [<c013a740>] (process_one_work) from [<c013af28>] >> (worker_thread+0x2cc/0x40c) >> [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144) ... but from the worker thread, which is unaware of PM states, unless it is a freezable workqueue. Cfr. my patch "[1/2] net: phy: Freeze PHY polling before suspending devices" (https://patchwork.kernel.org/patch/9915901/), which made it a freezable workqueue, like was done on PCI to solve similar issues with PME scanning. >> [<c014024c>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c) >> ---[ end trace 21b7024e273f9f21 ]--- >> ------------[ cut here ]------------ >> >> (yes, the fourth backtrace is from another machine, but I can trigger >> all of this >> on both r8a73a4/ape6evm and sh73a0/kzm9g anyway). Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On 27/11/2017 08:48, Geert Uytterhoeven wrote: > The MDIO bus callback is not called from the network device ... > >>> smsc hangs of [fec10000.bus, which is started only here ---> >>> >>> Workqueue: events_power_efficient phy_state_machine >>> [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14) >>> [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc) >>> [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc) >>> [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24) >>> [<c0120554>] (warn_slowpath_null) from [<c03ce264>] (__smsc911x_reg_read+0x1c/0x88) >>> [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>] (smsc911x_mac_read+0x4c/0x118) >>> [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>] (smsc911x_mii_read+0x2c/0xa4) >>> [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70) >>> [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50) >>> [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>] (genphy_read_status+0xc/0x1cc) >>> [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>] (phy_state_machine+0xa8/0x3f4) >>> [<c03ca3e4>] (phy_state_machine) from [<c013a740>] (process_one_work+0x240/0x3fc) >>> [<c013a740>] (process_one_work) from [<c013af28>] (worker_thread+0x2cc/0x40c) >>> [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144) > > ... but from the worker thread, which is unaware of PM states, unless it is > a freezable workqueue. > > Cfr. my patch "[1/2] net: phy: Freeze PHY polling before suspending devices" > (https://patchwork.kernel.org/patch/9915901/), which made it a freezable > workqueue, like was done on PCI to solve similar issues with PME scanning. Hello Geert, Florian, Is there something I can do to help this patch move forward? Should we look for a different solution than a synchronous call to phy_state_machine() ? (Or should I call it from the device driver, instead of expecting the phy framework to do it?) Regards.
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 0ddeb97217ce..20b84ea013c0 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -594,6 +594,9 @@ void phy_stop_machine(struct phy_device *phydev) if (phydev->state > PHY_UP && phydev->state != PHY_HALTED) phydev->state = PHY_UP; mutex_unlock(&phydev->lock); + + /* Now we can run the state machine synchronously */ + phy_state_machine(&phydev->state_queue.work); } EXPORT_SYMBOL_GPL(phy_stop_machine); @@ -1077,9 +1080,14 @@ void phy_state_machine(struct work_struct *work) /* Only re-schedule a PHY state machine change if we are polling the * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving - * between states from phy_mac_interrupt() + * between states from phy_mac_interrupt(). + * + * If do_suspend is set to true from PHY_HALTED, in that case, do not + * reschedule the state machine since that would be pointless and + * possibly error prone when called from phy_disconnect() + * synchronously. */ - if (phydev->irq == PHY_POLL) + if (phydev->irq == PHY_POLL && !do_suspend) queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, PHY_STATE_TIME * HZ); }