diff mbox series

[RFC,net-next,4/4] net: phy: Correctly process PHY_HALTED in phy_stop_machine()

Message ID 20171025232124.14120-5-f.fainelli@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show
Series [RFC,net-next,1/4] net: phy: Export phy_stop_machine() | expand

Commit Message

Florian Fainelli Oct. 25, 2017, 11:21 p.m. UTC
Marc reported that he was not getting the PHY library adjust_link()
callback function to run when calling phy_stop() + phy_disconnect()
which does not indeed happen because we set the state machine to
PHY_HALTED but we don't get to run it to process this state past that
point.

Fix this with a synchronous call to phy_state_machine() in order to have
the state machine actually act on PHY_HALTED, set the PHY device's link
down, turn the network device's carrier off and finally call the
adjust_link() function.

At the end of phy_state_machine() though, if we are going to be moving
from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
is pointless.

Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/phy.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Comments

Geert Uytterhoeven Oct. 30, 2017, 1:56 p.m. UTC | #1
Hi Florian,

On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> Marc reported that he was not getting the PHY library adjust_link()
> callback function to run when calling phy_stop() + phy_disconnect()
> which does not indeed happen because we set the state machine to
> PHY_HALTED but we don't get to run it to process this state past that
> point.
>
> Fix this with a synchronous call to phy_state_machine() in order to have
> the state machine actually act on PHY_HALTED, set the PHY device's link
> down, turn the network device's carrier off and finally call the
> adjust_link() function.
>
> At the end of phy_state_machine() though, if we are going to be moving
> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
> is pointless.
>
> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Thanks for your patch!

Unfortunately, after applying this one, the last in your series, both
sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system
suspend/resume path, due to register accesses while the device is already
suspended:

PM: suspend entry (deep)
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.000 seconds) done.
OOM killer disabled.
Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
PM: suspend devices took 0.100 seconds
Unhandled fault: imprecise external abort (0x1406) at 0x0005b950
Disabling non-boot CPUs ...
pgd = c0004000
[0005b950] *pgd=00000000
Internal error: : 1406 [#1] SMP ARM
Modules linked in:
CPU: 1 PID: 17 Comm: kworker/1:0 Not tainted
4.14.0-rc5-kzm9g-00454-g13bf0e7c8794d746 #1002
Hardware name: Generic SH73A0 (Flattened Device Tree)
Workqueue: events linkwatch_event
task: df490480 task.stack: df492000
PC is at __smsc911x_reg_read+0x1c/0x60
LR is at smsc911x_tx_get_txstatus+0x64/0x7c
pc : [<c03cc77c>]    lr : [<c03cd000>]    psr: 20030093
sp : df493d98  ip : 00000000  fp : c09313a0
r10: c0931330  r9 : c0909a40  r8 : 00000000
r7 : 00030013  r6 : df705e08  r5 : df705dc0  r4 : 001c0000
r3 : e0903000  r2 : 00000001  r1 : e0903048  r0 : 00000000
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 5ed1804a  DAC: 00000051
Process kworker/1:0 (pid: 17, stack limit = 0xdf492210)
Stack: (0xdf493d98 to 0xdf494000)
3d80:                                                       c03cd2dc df705800
3da0: df705dc0 c063b2b0 df493e18 c03cd248 c03cd2dc df705800 df705800 c03cd2e8
3dc0: c03cd2dc def4b0e4 df705800 c063b2b0 df493e18 c048a13c def4b0e0 dedbb300
3de0: df705800 c04ab388 dedbb300 df705800 def4b000 c04a5e10 0000002a 00000000
3e00: c04a5a8c c01ffcfc c04aabd0 dedbb300 014000c0 df493e47 00000000 00000000
3e20: 00000000 00000000 00000000 00000000 00000050 00000000 df493e47 014000c0
3e40: 00000000 dedbb300 df705800 00000010 00000000 00000000 00000000 c0931330
3e60: c09313a0 c04aac00 00000000 00000000 00000000 00000000 00000000 00000000
3e80: 014000c0 df705800 df705800 c09313a0 df493ee0 c04aac9c 014000c0 00000000
3ea0: 00000000 df705800 c0931330 c04aace0 014000c0 00000000 00000000 c048ef18
3ec0: df705800 df705800 00000000 00000000 df705800 c04abee0 df705ab0 c04ac150
3ee0: df493ee0 df493ee0 c109a9f8 df436d00 c0931330 dfbd7bc0 df493f30 dfbdae00
3f00: 00000000 00000000 00000001 c04ac1bc c04ac198 c013a6f8 00000001 00000000
3f20: c013a680 00000000 00000000 00000000 c0931330 00000000 00000000 c0758db6
3f40: c0905900 df436d00 dfbd7bc0 dfbd7bc0 df492000 dfbd7bf4 c0905900 df436d18
3f60: 00000008 c013aecc df490480 df436400 df42c900 00000000 df443e6c df436d00
3f80: c013ac14 df436438 00000000 c0140200 df42c900 c01400dc 00000000 00000000
3fa0: 00000000 00000000 00000000 c01071c8 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 7fdfffff fff7fdff
[<c03cc77c>] (__smsc911x_reg_read) from [<c03cd000>]
(smsc911x_tx_get_txstatus+0x64/0x7c)
[<c03cd000>] (smsc911x_tx_get_txstatus) from [<c03cd248>]
(smsc911x_tx_update_txcounters+0x14/0xa8)
[<c03cd248>] (smsc911x_tx_update_txcounters) from [<c03cd2e8>]
(smsc911x_get_stats+0xc/0x58)
[<c03cd2e8>] (smsc911x_get_stats) from [<c048a13c>] (dev_get_stats+0x54/0xa4)
[<c048a13c>] (dev_get_stats) from [<c04ab388>] (rtnl_fill_stats+0x38/0x118)
[<c04ab388>] (rtnl_fill_stats) from [<c04a5e10>] (rtnl_fill_ifinfo+0x5f0/0xf74)
[<c04a5e10>] (rtnl_fill_ifinfo) from [<c04aac00>]
(rtmsg_ifinfo_build_skb+0x6c/0xc0)
[<c04aac00>] (rtmsg_ifinfo_build_skb) from [<c04aac9c>]
(rtmsg_ifinfo_event.part.5+0x1c/0x40)
[<c04aac9c>] (rtmsg_ifinfo_event.part.5) from [<c04aace0>]
(rtmsg_ifinfo+0x20/0x28)
[<c04aace0>] (rtmsg_ifinfo) from [<c048ef18>] (netdev_state_change+0x48/0x54)
[<c048ef18>] (netdev_state_change) from [<c04abee0>]
(linkwatch_do_dev+0x50/0x74)
[<c04abee0>] (linkwatch_do_dev) from [<c04ac150>]
(__linkwatch_run_queue+0x124/0x16c)
[<c04ac150>] (__linkwatch_run_queue) from [<c04ac1bc>]
(linkwatch_event+0x24/0x34)
[<c04ac1bc>] (linkwatch_event) from [<c013a6f8>] (process_one_work+0x240/0x3fc)
[<c013a6f8>] (process_one_work) from [<c013aecc>] (worker_thread+0x2b8/0x3f4)
[<c013aecc>] (worker_thread) from [<c0140200>] (kthread+0x124/0x144)
[<c0140200>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c)
Code: e5903000 e0831001 e5910000 f57ff04f (e12fff1e)
---[ end trace e71c0b5246c61082 ]---

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Florian Fainelli Oct. 30, 2017, 4:09 p.m. UTC | #2
On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote:
> Hi Florian,
> 
> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> Marc reported that he was not getting the PHY library adjust_link()
>> callback function to run when calling phy_stop() + phy_disconnect()
>> which does not indeed happen because we set the state machine to
>> PHY_HALTED but we don't get to run it to process this state past that
>> point.
>>
>> Fix this with a synchronous call to phy_state_machine() in order to have
>> the state machine actually act on PHY_HALTED, set the PHY device's link
>> down, turn the network device's carrier off and finally call the
>> adjust_link() function.
>>
>> At the end of phy_state_machine() though, if we are going to be moving
>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
>> is pointless.
>>
>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> 
> Thanks for your patch!
> 
> Unfortunately, after applying this one, the last in your series, both
> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system
> suspend/resume path, due to register accesses while the device is already
> suspended:

OK, seems like there is another path, uncovered by this patch that we
can be hitting, does the following patch below help?

diff --git a/drivers/net/ethernet/smsc/smsc911x.c
b/drivers/net/ethernet/smsc/smsc911x.c
index 82f9a175073f..51498699b18c 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -1136,6 +1136,9 @@ static void smsc911x_tx_update_txcounters(struct
net_device *dev)
        struct smsc911x_data *pdata = netdev_priv(dev);
        unsigned int tx_stat;

+       if (!netif_running(dev))
+               return;
+
        while ((tx_stat = smsc911x_tx_get_txstatus(pdata)) != 0) {
                if (unlikely(tx_stat & 0x80000000)) {
                        /* In this driver the packet tag is used as the
packet

> 
> PM: suspend entry (deep)
> PM: Syncing filesystems ... done.
> Freezing user space processes ... (elapsed 0.000 seconds) done.
> OOM killer disabled.
> Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
> PM: suspend devices took 0.100 seconds
> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950
> Disabling non-boot CPUs ...
> pgd = c0004000
> [0005b950] *pgd=00000000
> Internal error: : 1406 [#1] SMP ARM
> Modules linked in:
> CPU: 1 PID: 17 Comm: kworker/1:0 Not tainted
> 4.14.0-rc5-kzm9g-00454-g13bf0e7c8794d746 #1002
> Hardware name: Generic SH73A0 (Flattened Device Tree)
> Workqueue: events linkwatch_event
> task: df490480 task.stack: df492000
> PC is at __smsc911x_reg_read+0x1c/0x60
> LR is at smsc911x_tx_get_txstatus+0x64/0x7c
> pc : [<c03cc77c>]    lr : [<c03cd000>]    psr: 20030093
> sp : df493d98  ip : 00000000  fp : c09313a0
> r10: c0931330  r9 : c0909a40  r8 : 00000000
> r7 : 00030013  r6 : df705e08  r5 : df705dc0  r4 : 001c0000
> r3 : e0903000  r2 : 00000001  r1 : e0903048  r0 : 00000000
> Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> Control: 10c5387d  Table: 5ed1804a  DAC: 00000051
> Process kworker/1:0 (pid: 17, stack limit = 0xdf492210)
> Stack: (0xdf493d98 to 0xdf494000)
> 3d80:                                                       c03cd2dc df705800
> 3da0: df705dc0 c063b2b0 df493e18 c03cd248 c03cd2dc df705800 df705800 c03cd2e8
> 3dc0: c03cd2dc def4b0e4 df705800 c063b2b0 df493e18 c048a13c def4b0e0 dedbb300
> 3de0: df705800 c04ab388 dedbb300 df705800 def4b000 c04a5e10 0000002a 00000000
> 3e00: c04a5a8c c01ffcfc c04aabd0 dedbb300 014000c0 df493e47 00000000 00000000
> 3e20: 00000000 00000000 00000000 00000000 00000050 00000000 df493e47 014000c0
> 3e40: 00000000 dedbb300 df705800 00000010 00000000 00000000 00000000 c0931330
> 3e60: c09313a0 c04aac00 00000000 00000000 00000000 00000000 00000000 00000000
> 3e80: 014000c0 df705800 df705800 c09313a0 df493ee0 c04aac9c 014000c0 00000000
> 3ea0: 00000000 df705800 c0931330 c04aace0 014000c0 00000000 00000000 c048ef18
> 3ec0: df705800 df705800 00000000 00000000 df705800 c04abee0 df705ab0 c04ac150
> 3ee0: df493ee0 df493ee0 c109a9f8 df436d00 c0931330 dfbd7bc0 df493f30 dfbdae00
> 3f00: 00000000 00000000 00000001 c04ac1bc c04ac198 c013a6f8 00000001 00000000
> 3f20: c013a680 00000000 00000000 00000000 c0931330 00000000 00000000 c0758db6
> 3f40: c0905900 df436d00 dfbd7bc0 dfbd7bc0 df492000 dfbd7bf4 c0905900 df436d18
> 3f60: 00000008 c013aecc df490480 df436400 df42c900 00000000 df443e6c df436d00
> 3f80: c013ac14 df436438 00000000 c0140200 df42c900 c01400dc 00000000 00000000
> 3fa0: 00000000 00000000 00000000 c01071c8 00000000 00000000 00000000 00000000
> 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 7fdfffff fff7fdff
> [<c03cc77c>] (__smsc911x_reg_read) from [<c03cd000>]
> (smsc911x_tx_get_txstatus+0x64/0x7c)
> [<c03cd000>] (smsc911x_tx_get_txstatus) from [<c03cd248>]
> (smsc911x_tx_update_txcounters+0x14/0xa8)
> [<c03cd248>] (smsc911x_tx_update_txcounters) from [<c03cd2e8>]
> (smsc911x_get_stats+0xc/0x58)
> [<c03cd2e8>] (smsc911x_get_stats) from [<c048a13c>] (dev_get_stats+0x54/0xa4)
> [<c048a13c>] (dev_get_stats) from [<c04ab388>] (rtnl_fill_stats+0x38/0x118)
> [<c04ab388>] (rtnl_fill_stats) from [<c04a5e10>] (rtnl_fill_ifinfo+0x5f0/0xf74)
> [<c04a5e10>] (rtnl_fill_ifinfo) from [<c04aac00>]
> (rtmsg_ifinfo_build_skb+0x6c/0xc0)
> [<c04aac00>] (rtmsg_ifinfo_build_skb) from [<c04aac9c>]
> (rtmsg_ifinfo_event.part.5+0x1c/0x40)
> [<c04aac9c>] (rtmsg_ifinfo_event.part.5) from [<c04aace0>]
> (rtmsg_ifinfo+0x20/0x28)
> [<c04aace0>] (rtmsg_ifinfo) from [<c048ef18>] (netdev_state_change+0x48/0x54)
> [<c048ef18>] (netdev_state_change) from [<c04abee0>]
> (linkwatch_do_dev+0x50/0x74)
> [<c04abee0>] (linkwatch_do_dev) from [<c04ac150>]
> (__linkwatch_run_queue+0x124/0x16c)
> [<c04ac150>] (__linkwatch_run_queue) from [<c04ac1bc>]
> (linkwatch_event+0x24/0x34)
> [<c04ac1bc>] (linkwatch_event) from [<c013a6f8>] (process_one_work+0x240/0x3fc)
> [<c013a6f8>] (process_one_work) from [<c013aecc>] (worker_thread+0x2b8/0x3f4)
> [<c013aecc>] (worker_thread) from [<c0140200>] (kthread+0x124/0x144)
> [<c0140200>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c)
> Code: e5903000 e0831001 e5910000 f57ff04f (e12fff1e)
> ---[ end trace e71c0b5246c61082 ]---
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
>
Geert Uytterhoeven Oct. 31, 2017, 3:26 p.m. UTC | #3
Hi Florian,

On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote:
>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>> Marc reported that he was not getting the PHY library adjust_link()
>>> callback function to run when calling phy_stop() + phy_disconnect()
>>> which does not indeed happen because we set the state machine to
>>> PHY_HALTED but we don't get to run it to process this state past that
>>> point.
>>>
>>> Fix this with a synchronous call to phy_state_machine() in order to have
>>> the state machine actually act on PHY_HALTED, set the PHY device's link
>>> down, turn the network device's carrier off and finally call the
>>> adjust_link() function.
>>>
>>> At the end of phy_state_machine() though, if we are going to be moving
>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
>>> is pointless.
>>>
>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>
>> Thanks for your patch!
>>
>> Unfortunately, after applying this one, the last in your series, both
>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system
>> suspend/resume path, due to register accesses while the device is already
>> suspended:
>
> OK, seems like there is another path, uncovered by this patch that we
> can be hitting, does the following patch below help?

Unfortunately it doesn't help.

>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950

Note that this is an imprecise external abort, i.e. it's reporting may
be delayed,
and the backtrace may be inaccurate.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Florian Fainelli Oct. 31, 2017, 4:33 p.m. UTC | #4
On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote:
> Hi Florian,
> 
> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote:
>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>> Marc reported that he was not getting the PHY library adjust_link()
>>>> callback function to run when calling phy_stop() + phy_disconnect()
>>>> which does not indeed happen because we set the state machine to
>>>> PHY_HALTED but we don't get to run it to process this state past that
>>>> point.
>>>>
>>>> Fix this with a synchronous call to phy_state_machine() in order to have
>>>> the state machine actually act on PHY_HALTED, set the PHY device's link
>>>> down, turn the network device's carrier off and finally call the
>>>> adjust_link() function.
>>>>
>>>> At the end of phy_state_machine() though, if we are going to be moving
>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
>>>> is pointless.
>>>>
>>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>>
>>> Thanks for your patch!
>>>
>>> Unfortunately, after applying this one, the last in your series, both
>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system
>>> suspend/resume path, due to register accesses while the device is already
>>> suspended:
>>
>> OK, seems like there is another path, uncovered by this patch that we
>> can be hitting, does the following patch below help?
> 
> Unfortunately it doesn't help.

OK :/

> 
>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950
> 
> Note that this is an imprecise external abort, i.e. it's reporting may
> be delayed,
> and the backtrace may be inaccurate.

True, can you help narrow it down with me? Can you confirm that
adjust_link() (assuming that is the problem) does not get called past
phy_stop_machine() as it should?
Geert Uytterhoeven Nov. 6, 2017, 3:50 p.m. UTC | #5
Hi Florian,

On Tue, Oct 31, 2017 at 5:33 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote:
>> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote:
>>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>> Marc reported that he was not getting the PHY library adjust_link()
>>>>> callback function to run when calling phy_stop() + phy_disconnect()
>>>>> which does not indeed happen because we set the state machine to
>>>>> PHY_HALTED but we don't get to run it to process this state past that
>>>>> point.
>>>>>
>>>>> Fix this with a synchronous call to phy_state_machine() in order to have
>>>>> the state machine actually act on PHY_HALTED, set the PHY device's link
>>>>> down, turn the network device's carrier off and finally call the
>>>>> adjust_link() function.
>>>>>
>>>>> At the end of phy_state_machine() though, if we are going to be moving
>>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
>>>>> is pointless.
>>>>>
>>>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>>>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>>>
>>>> Thanks for your patch!
>>>>
>>>> Unfortunately, after applying this one, the last in your series, both
>>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system
>>>> suspend/resume path, due to register accesses while the device is already
>>>> suspended:
>>>
>>> OK, seems like there is another path, uncovered by this patch that we
>>> can be hitting, does the following patch below help?
>>
>> Unfortunately it doesn't help.
>
> OK :/
>
>>
>>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950
>>
>> Note that this is an imprecise external abort, i.e. it's reporting may
>> be delayed,
>> and the backtrace may be inaccurate.
>
> True, can you help narrow it down with me? Can you confirm that
> adjust_link() (assuming that is the problem) does not get called past
> phy_stop_machine() as it should?

I've added some additional debug checks (keep track of both phy and
smsc state, and refuse the access registers if smsc is disabled).

Apparently phy_stop_machine() is called twice:
  - Once from mdio_bus_phy_suspend(), cfr. the first backtrace,
  - A second time from smsc911x_suspend(), cfr. the second backtrace.

The second call causes a call to smsc911x_phy_adjust_link() while the smsc is
already disabled, cfr. the third backtrace. This would trigger the imprecise
external abort if I let it access the registers.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:597
phy_stop_machine+0x44/0xcc
phy_stop_machine: phy running, good
CPU: 0 PID: 1083 Comm: bash Not tainted
4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
Hardware name: Generic R8A73A4 (Flattened Device Tree)
[<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
[<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
[<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
[<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
[<c002ac8c>] (warn_slowpath_fmt) from [<c0314680>] (phy_stop_machine+0x44/0xcc)
[<c0314680>] (phy_stop_machine) from [<c03162ec>]
(mdio_bus_phy_suspend+0x24/0x40)
[<c03162ec>] (mdio_bus_phy_suspend) from [<c0307090>]
(dpm_run_callback+0x17c/0x3ec)
[<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
[<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
[<c030a974>] (dpm_suspend) from [<c00880bc>]
(suspend_devices_and_enter+0x78/0xe98)
[<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
(pm_suspend+0xa40/0xbec)
[<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
[<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
[<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
[<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
[<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
[<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
---[ end trace 8fc4c71351438007 ]---
libphy: phy_stop_machine: Kicking state machine synchronously
libphy: phy_stop_machine: Kicking state machine done
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:598
phy_stop_machine+0x64/0xcc
phy_stop_machine: phy already stopped
CPU: 0 PID: 1083 Comm: bash Tainted: G        W
4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
Hardware name: Generic R8A73A4 (Flattened Device Tree)
[<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
[<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
[<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
[<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
[<c002ac8c>] (warn_slowpath_fmt) from [<c03146a0>] (phy_stop_machine+0x64/0xcc)
[<c03146a0>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4)
[<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec)
[<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
[<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
[<c030a974>] (dpm_suspend) from [<c00880bc>]
(suspend_devices_and_enter+0x78/0xe98)
[<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
(pm_suspend+0xa40/0xbec)
[<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
[<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
[<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
[<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
[<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
[<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
---[ end trace 8fc4c71351438008 ]---
libphy: phy_stop_machine: Kicking state machine synchronously
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1083 at drivers/net/ethernet/smsc/smsc911x.c:988
smsc911x_phy_adjust_link+0x2c/0x2e0
PHY stopped
CPU: 0 PID: 1083 Comm: bash Tainted: G        W
4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
Hardware name: Generic R8A73A4 (Flattened Device Tree)
[<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
[<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
[<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
[<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
[<c002ac8c>] (warn_slowpath_fmt) from [<c031b1e8>]
(smsc911x_phy_adjust_link+0x2c/0x2e0)
[<c031b1e8>] (smsc911x_phy_adjust_link) from [<c031378c>]
(phy_link_down+0x18/0x24)
[<c031378c>] (phy_link_down) from [<c0314518>] (phy_state_machine+0x2d0/0x3f4)
[<c0314518>] (phy_state_machine) from [<c03146d8>] (phy_stop_machine+0x9c/0xcc)
[<c03146d8>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4)
[<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec)
[<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
[<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
[<c030a974>] (dpm_suspend) from [<c00880bc>]
(suspend_devices_and_enter+0x78/0xe98)
[<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
(pm_suspend+0xa40/0xbec)
[<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
[<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
[<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
[<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
[<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
[<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
---[ end trace 8fc4c71351438009 ]---
libphy: phy_stop_machine: Kicking state machine done

If I revert your "net: smsc911x: Properly manage PHY during suspend/resume",
phy_stop_machine() is no longer called twice, and system suspend works.

However, during resume, smsc911x_mii_read() is called before the
smsc is enabled, cfr. the fourth backtrace:

     WARNING: CPU: 1 PID: 17 at
drivers/net/ethernet/smsc/smsc911x.c:165 __smsc911x_reg_read+0x1c/0x88
    Modules linked in:
    CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G        W
4.14.0-rc7-kzm9g-00443-gcdfc0e18a47e0bb3-dirty #1013
    Hardware name: Generic SH73A0 (Flattened Device Tree)

smsc hangs of [fec10000.bus, which is started only here --->

    Workqueue: events_power_efficient phy_state_machine
    [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14)
    [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc)
    [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc)
    [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24)
    [<c0120554>] (warn_slowpath_null) from [<c03ce264>]
(__smsc911x_reg_read+0x1c/0x88)
    [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>]
(smsc911x_mac_read+0x4c/0x118)
    [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>]
(smsc911x_mii_read+0x2c/0xa4)
    [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70)
    [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50)
    [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>]
(genphy_read_status+0xc/0x1cc)
    [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>]
(phy_state_machine+0xa8/0x3f4)
    [<c03ca3e4>] (phy_state_machine) from [<c013a740>]
(process_one_work+0x240/0x3fc)
    [<c013a740>] (process_one_work) from [<c013af28>]
(worker_thread+0x2cc/0x40c)
    [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144)
    [<c014024c>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c)
    ---[ end trace 21b7024e273f9f21 ]---
    ------------[ cut here ]------------

(yes, the fourth backtrace is from another machine, but I can trigger
all of this
 on both r8a73a4/ape6evm and sh73a0/kzm9g anyway).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Florian Fainelli Nov. 27, 2017, 4:05 a.m. UTC | #6
Hi Geert,

On 11/06/2017 07:50 AM, Geert Uytterhoeven wrote:
> Hi Florian,
> 
> On Tue, Oct 31, 2017 at 5:33 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote:
>>> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote:
>>>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>>> Marc reported that he was not getting the PHY library adjust_link()
>>>>>> callback function to run when calling phy_stop() + phy_disconnect()
>>>>>> which does not indeed happen because we set the state machine to
>>>>>> PHY_HALTED but we don't get to run it to process this state past that
>>>>>> point.
>>>>>>
>>>>>> Fix this with a synchronous call to phy_state_machine() in order to have
>>>>>> the state machine actually act on PHY_HALTED, set the PHY device's link
>>>>>> down, turn the network device's carrier off and finally call the
>>>>>> adjust_link() function.
>>>>>>
>>>>>> At the end of phy_state_machine() though, if we are going to be moving
>>>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
>>>>>> is pointless.
>>>>>>
>>>>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>>>>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>>>>
>>>>> Thanks for your patch!
>>>>>
>>>>> Unfortunately, after applying this one, the last in your series, both
>>>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system
>>>>> suspend/resume path, due to register accesses while the device is already
>>>>> suspended:
>>>>
>>>> OK, seems like there is another path, uncovered by this patch that we
>>>> can be hitting, does the following patch below help?
>>>
>>> Unfortunately it doesn't help.
>>
>> OK :/
>>
>>>
>>>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950
>>>
>>> Note that this is an imprecise external abort, i.e. it's reporting may
>>> be delayed,
>>> and the backtrace may be inaccurate.
>>
>> True, can you help narrow it down with me? Can you confirm that
>> adjust_link() (assuming that is the problem) does not get called past
>> phy_stop_machine() as it should?
> 
> I've added some additional debug checks (keep track of both phy and
> smsc state, and refuse the access registers if smsc is disabled).

Thanks for doing that, and sorry for responding that late.

> 
> Apparently phy_stop_machine() is called twice:
>   - Once from mdio_bus_phy_suspend(), cfr. the first backtrace,
>   - A second time from smsc911x_suspend(), cfr. the second backtrace.
> 
> The second call causes a call to smsc911x_phy_adjust_link() while the smsc is
> already disabled, cfr. the third backtrace. This would trigger the imprecise
> external abort if I let it access the registers.
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:597
> phy_stop_machine+0x44/0xcc
> phy_stop_machine: phy running, good
> CPU: 0 PID: 1083 Comm: bash Not tainted
> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
> Hardware name: Generic R8A73A4 (Flattened Device Tree)
> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
> [<c002ac8c>] (warn_slowpath_fmt) from [<c0314680>] (phy_stop_machine+0x44/0xcc)
> [<c0314680>] (phy_stop_machine) from [<c03162ec>]
> (mdio_bus_phy_suspend+0x24/0x40)
> [<c03162ec>] (mdio_bus_phy_suspend) from [<c0307090>]
> (dpm_run_callback+0x17c/0x3ec)
> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
> [<c030a974>] (dpm_suspend) from [<c00880bc>]
> (suspend_devices_and_enter+0x78/0xe98)
> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
> (pm_suspend+0xa40/0xbec)
> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
> ---[ end trace 8fc4c71351438007 ]---
> libphy: phy_stop_machine: Kicking state machine synchronously
> libphy: phy_stop_machine: Kicking state machine done
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:598
> phy_stop_machine+0x64/0xcc
> phy_stop_machine: phy already stopped
> CPU: 0 PID: 1083 Comm: bash Tainted: G        W
> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
> Hardware name: Generic R8A73A4 (Flattened Device Tree)
> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
> [<c002ac8c>] (warn_slowpath_fmt) from [<c03146a0>] (phy_stop_machine+0x64/0xcc)
> [<c03146a0>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4)
> [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec)
> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
> [<c030a974>] (dpm_suspend) from [<c00880bc>]
> (suspend_devices_and_enter+0x78/0xe98)
> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
> (pm_suspend+0xa40/0xbec)
> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
> ---[ end trace 8fc4c71351438008 ]---
> libphy: phy_stop_machine: Kicking state machine synchronously
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1083 at drivers/net/ethernet/smsc/smsc911x.c:988
> smsc911x_phy_adjust_link+0x2c/0x2e0
> PHY stopped
> CPU: 0 PID: 1083 Comm: bash Tainted: G        W
> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
> Hardware name: Generic R8A73A4 (Flattened Device Tree)
> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
> [<c002ac8c>] (warn_slowpath_fmt) from [<c031b1e8>]
> (smsc911x_phy_adjust_link+0x2c/0x2e0)
> [<c031b1e8>] (smsc911x_phy_adjust_link) from [<c031378c>]
> (phy_link_down+0x18/0x24)
> [<c031378c>] (phy_link_down) from [<c0314518>] (phy_state_machine+0x2d0/0x3f4)
> [<c0314518>] (phy_state_machine) from [<c03146d8>] (phy_stop_machine+0x9c/0xcc)
> [<c03146d8>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4)
> [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec)
> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
> [<c030a974>] (dpm_suspend) from [<c00880bc>]
> (suspend_devices_and_enter+0x78/0xe98)
> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
> (pm_suspend+0xa40/0xbec)
> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
> ---[ end trace 8fc4c71351438009 ]---
> libphy: phy_stop_machine: Kicking state machine done
> 
> If I revert your "net: smsc911x: Properly manage PHY during suspend/resume",
> phy_stop_machine() is no longer called twice, and system suspend works.

OK, there does appear to be a problem in how the network device vs. mdio
bus are suspended/resumed in this particular driver, and I have no idea
why, see below.

> 
> However, during resume, smsc911x_mii_read() is called before the
> smsc is enabled, cfr. the fourth backtrace:

Humm, how is that possible? smsc911x_mii_bus properly sets its parent
device to be the platform device of the network device (which is
correct) so by virtue of child -> parent relationships, we should have
the network device resume first, and then have the MDIO bus resume
second (unless I am completely off here).

> 
>      WARNING: CPU: 1 PID: 17 at
> drivers/net/ethernet/smsc/smsc911x.c:165 __smsc911x_reg_read+0x1c/0x88
>     Modules linked in:
>     CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G        W
> 4.14.0-rc7-kzm9g-00443-gcdfc0e18a47e0bb3-dirty #1013
>     Hardware name: Generic SH73A0 (Flattened Device Tree)
> 
> smsc hangs of [fec10000.bus, which is started only here --->
> 
>     Workqueue: events_power_efficient phy_state_machine
>     [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14)
>     [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc)
>     [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc)
>     [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24)
>     [<c0120554>] (warn_slowpath_null) from [<c03ce264>]
> (__smsc911x_reg_read+0x1c/0x88)
>     [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>]
> (smsc911x_mac_read+0x4c/0x118)
>     [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>]
> (smsc911x_mii_read+0x2c/0xa4)
>     [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70)
>     [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50)
>     [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>]
> (genphy_read_status+0xc/0x1cc)
>     [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>]
> (phy_state_machine+0xa8/0x3f4)
>     [<c03ca3e4>] (phy_state_machine) from [<c013a740>]
> (process_one_work+0x240/0x3fc)
>     [<c013a740>] (process_one_work) from [<c013af28>]
> (worker_thread+0x2cc/0x40c)
>     [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144)
>     [<c014024c>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c)
>     ---[ end trace 21b7024e273f9f21 ]---
>     ------------[ cut here ]------------
> 
> (yes, the fourth backtrace is from another machine, but I can trigger
> all of this
>  on both r8a73a4/ape6evm and sh73a0/kzm9g anyway).
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
>
Geert Uytterhoeven Nov. 27, 2017, 7:48 a.m. UTC | #7
Hi Florian,

On Mon, Nov 27, 2017 at 5:05 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 11/06/2017 07:50 AM, Geert Uytterhoeven wrote:
>> On Tue, Oct 31, 2017 at 5:33 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>> On 10/31/2017 08:26 AM, Geert Uytterhoeven wrote:
>>>> On Mon, Oct 30, 2017 at 5:09 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>> On 10/30/2017 06:56 AM, Geert Uytterhoeven wrote:
>>>>>> On Thu, Oct 26, 2017 at 1:21 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>>>> Marc reported that he was not getting the PHY library adjust_link()
>>>>>>> callback function to run when calling phy_stop() + phy_disconnect()
>>>>>>> which does not indeed happen because we set the state machine to
>>>>>>> PHY_HALTED but we don't get to run it to process this state past that
>>>>>>> point.
>>>>>>>
>>>>>>> Fix this with a synchronous call to phy_state_machine() in order to have
>>>>>>> the state machine actually act on PHY_HALTED, set the PHY device's link
>>>>>>> down, turn the network device's carrier off and finally call the
>>>>>>> adjust_link() function.
>>>>>>>
>>>>>>> At the end of phy_state_machine() though, if we are going to be moving
>>>>>>> from PHY_HALTED to PHY_HALTED, do not reschedule the state machine, this
>>>>>>> is pointless.
>>>>>>>
>>>>>>> Reported-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>>>>> Fixes: a390d1f379cf ("phylib: convert state_queue work to delayed_work")
>>>>>>> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
>>>>>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>>>>>
>>>>>> Thanks for your patch!
>>>>>>
>>>>>> Unfortunately, after applying this one, the last in your series, both
>>>>>> sh73a0/kzm9g and r8a73a4/ape6evm start crashing again in the system
>>>>>> suspend/resume path, due to register accesses while the device is already
>>>>>> suspended:
>>>>>
>>>>> OK, seems like there is another path, uncovered by this patch that we
>>>>> can be hitting, does the following patch below help?
>>>>
>>>> Unfortunately it doesn't help.
>>>
>>> OK :/
>>>
>>>>
>>>>>> Unhandled fault: imprecise external abort (0x1406) at 0x0005b950
>>>>
>>>> Note that this is an imprecise external abort, i.e. it's reporting may
>>>> be delayed,
>>>> and the backtrace may be inaccurate.
>>>
>>> True, can you help narrow it down with me? Can you confirm that
>>> adjust_link() (assuming that is the problem) does not get called past
>>> phy_stop_machine() as it should?
>>
>> I've added some additional debug checks (keep track of both phy and
>> smsc state, and refuse the access registers if smsc is disabled).
>
> Thanks for doing that, and sorry for responding that late.
>
>>
>> Apparently phy_stop_machine() is called twice:
>>   - Once from mdio_bus_phy_suspend(), cfr. the first backtrace,
>>   - A second time from smsc911x_suspend(), cfr. the second backtrace.
>>
>> The second call causes a call to smsc911x_phy_adjust_link() while the smsc is
>> already disabled, cfr. the third backtrace. This would trigger the imprecise
>> external abort if I let it access the registers.
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:597
>> phy_stop_machine+0x44/0xcc
>> phy_stop_machine: phy running, good
>> CPU: 0 PID: 1083 Comm: bash Not tainted
>> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
>> Hardware name: Generic R8A73A4 (Flattened Device Tree)
>> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
>> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
>> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
>> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
>> [<c002ac8c>] (warn_slowpath_fmt) from [<c0314680>] (phy_stop_machine+0x44/0xcc)
>> [<c0314680>] (phy_stop_machine) from [<c03162ec>]
>> (mdio_bus_phy_suspend+0x24/0x40)
>> [<c03162ec>] (mdio_bus_phy_suspend) from [<c0307090>]
>> (dpm_run_callback+0x17c/0x3ec)
>> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
>> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
>> [<c030a974>] (dpm_suspend) from [<c00880bc>]
>> (suspend_devices_and_enter+0x78/0xe98)
>> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
>> (pm_suspend+0xa40/0xbec)
>> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
>> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
>> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
>> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
>> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
>> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
>> ---[ end trace 8fc4c71351438007 ]---
>> libphy: phy_stop_machine: Kicking state machine synchronously
>> libphy: phy_stop_machine: Kicking state machine done
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 1083 at drivers/net/phy/phy.c:598
>> phy_stop_machine+0x64/0xcc
>> phy_stop_machine: phy already stopped
>> CPU: 0 PID: 1083 Comm: bash Tainted: G        W
>> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
>> Hardware name: Generic R8A73A4 (Flattened Device Tree)
>> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
>> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
>> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
>> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
>> [<c002ac8c>] (warn_slowpath_fmt) from [<c03146a0>] (phy_stop_machine+0x64/0xcc)
>> [<c03146a0>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4)
>> [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec)
>> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
>> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
>> [<c030a974>] (dpm_suspend) from [<c00880bc>]
>> (suspend_devices_and_enter+0x78/0xe98)
>> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
>> (pm_suspend+0xa40/0xbec)
>> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
>> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
>> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
>> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
>> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
>> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
>> ---[ end trace 8fc4c71351438008 ]---
>> libphy: phy_stop_machine: Kicking state machine synchronously
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 1083 at drivers/net/ethernet/smsc/smsc911x.c:988
>> smsc911x_phy_adjust_link+0x2c/0x2e0
>> PHY stopped
>> CPU: 0 PID: 1083 Comm: bash Tainted: G        W
>> 4.14.0-rc7-ape6evm-00443-gcdfc0e18a47e0bb3-dirty #637
>> Hardware name: Generic R8A73A4 (Flattened Device Tree)
>> [<c0018f3c>] (unwind_backtrace) from [<c0014f40>] (show_stack+0x10/0x14)
>> [<c0014f40>] (show_stack) from [<c04bd928>] (dump_stack+0xa4/0xdc)
>> [<c04bd928>] (dump_stack) from [<c002ac28>] (__warn+0xcc/0xfc)
>> [<c002ac28>] (__warn) from [<c002ac8c>] (warn_slowpath_fmt+0x34/0x44)
>> [<c002ac8c>] (warn_slowpath_fmt) from [<c031b1e8>]
>> (smsc911x_phy_adjust_link+0x2c/0x2e0)
>> [<c031b1e8>] (smsc911x_phy_adjust_link) from [<c031378c>]
>> (phy_link_down+0x18/0x24)
>> [<c031378c>] (phy_link_down) from [<c0314518>] (phy_state_machine+0x2d0/0x3f4)
>> [<c0314518>] (phy_state_machine) from [<c03146d8>] (phy_stop_machine+0x9c/0xcc)
>> [<c03146d8>] (phy_stop_machine) from [<c031a448>] (smsc911x_suspend+0x44/0xa4)
>> [<c031a448>] (smsc911x_suspend) from [<c0307090>] (dpm_run_callback+0x17c/0x3ec)
>> [<c0307090>] (dpm_run_callback) from [<c030a584>] (__device_suspend+0x498/0x6b0)
>> [<c030a584>] (__device_suspend) from [<c030a974>] (dpm_suspend+0x1d8/0x568)
>> [<c030a974>] (dpm_suspend) from [<c00880bc>]
>> (suspend_devices_and_enter+0x78/0xe98)
>> [<c00880bc>] (suspend_devices_and_enter) from [<c008991c>]
>> (pm_suspend+0xa40/0xbec)
>> [<c008991c>] (pm_suspend) from [<c0086e68>] (state_store+0xac/0xcc)
>> [<c0086e68>] (state_store) from [<c01c7780>] (kernfs_fop_write+0x190/0x1d0)
>> [<c01c7780>] (kernfs_fop_write) from [<c0153418>] (__vfs_write+0x20/0x11c)
>> [<c0153418>] (__vfs_write) from [<c0153688>] (vfs_write+0xb8/0x144)
>> [<c0153688>] (vfs_write) from [<c0153808>] (SyS_write+0x40/0x80)
>> [<c0153808>] (SyS_write) from [<c0010200>] (ret_fast_syscall+0x0/0x28)
>> ---[ end trace 8fc4c71351438009 ]---
>> libphy: phy_stop_machine: Kicking state machine done
>>
>> If I revert your "net: smsc911x: Properly manage PHY during suspend/resume",
>> phy_stop_machine() is no longer called twice, and system suspend works.
>
> OK, there does appear to be a problem in how the network device vs. mdio
> bus are suspended/resumed in this particular driver, and I have no idea
> why, see below.
>
>>
>> However, during resume, smsc911x_mii_read() is called before the
>> smsc is enabled, cfr. the fourth backtrace:
>
> Humm, how is that possible? smsc911x_mii_bus properly sets its parent
> device to be the platform device of the network device (which is
> correct) so by virtue of child -> parent relationships, we should have
> the network device resume first, and then have the MDIO bus resume
> second (unless I am completely off here).

The MDIO bus callback is not called from the network device ...

>>      WARNING: CPU: 1 PID: 17 at
>> drivers/net/ethernet/smsc/smsc911x.c:165 __smsc911x_reg_read+0x1c/0x88
>>     Modules linked in:
>>     CPU: 1 PID: 17 Comm: kworker/1:0 Tainted: G        W
>> 4.14.0-rc7-kzm9g-00443-gcdfc0e18a47e0bb3-dirty #1013
>>     Hardware name: Generic SH73A0 (Flattened Device Tree)
>>
>> smsc hangs of [fec10000.bus, which is started only here --->
>>
>>     Workqueue: events_power_efficient phy_state_machine
>>     [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14)
>>     [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc)
>>     [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc)
>>     [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24)
>>     [<c0120554>] (warn_slowpath_null) from [<c03ce264>]
>> (__smsc911x_reg_read+0x1c/0x88)
>>     [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>]
>> (smsc911x_mac_read+0x4c/0x118)
>>     [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>]
>> (smsc911x_mii_read+0x2c/0xa4)
>>     [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70)
>>     [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50)
>>     [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>]
>> (genphy_read_status+0xc/0x1cc)
>>     [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>]
>> (phy_state_machine+0xa8/0x3f4)
>>     [<c03ca3e4>] (phy_state_machine) from [<c013a740>]
>> (process_one_work+0x240/0x3fc)
>>     [<c013a740>] (process_one_work) from [<c013af28>]
>> (worker_thread+0x2cc/0x40c)
>>     [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144)

... but from the worker thread, which is unaware of PM states, unless it is
a freezable workqueue.

Cfr. my patch "[1/2] net: phy: Freeze PHY polling before suspending devices"
(https://patchwork.kernel.org/patch/9915901/), which made it a freezable
workqueue, like was done on PCI to solve similar issues with PME scanning.

>>     [<c014024c>] (kthread) from [<c01071c8>] (ret_from_fork+0x14/0x2c)
>>     ---[ end trace 21b7024e273f9f21 ]---
>>     ------------[ cut here ]------------
>>
>> (yes, the fourth backtrace is from another machine, but I can trigger
>> all of this
>>  on both r8a73a4/ape6evm and sh73a0/kzm9g anyway).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Marc Gonzalez Dec. 4, 2017, 3:08 p.m. UTC | #8
On 27/11/2017 08:48, Geert Uytterhoeven wrote:

> The MDIO bus callback is not called from the network device ...
> 
>>> smsc hangs of [fec10000.bus, which is started only here --->
>>>
>>>     Workqueue: events_power_efficient phy_state_machine
>>>     [<c010f304>] (unwind_backtrace) from [<c010b7ec>] (show_stack+0x10/0x14)
>>>     [<c010b7ec>] (show_stack) from [<c0551ab0>] (dump_stack+0xa4/0xdc)
>>>     [<c0551ab0>] (dump_stack) from [<c0120480>] (__warn+0xcc/0xfc)
>>>     [<c0120480>] (__warn) from [<c0120554>] (warn_slowpath_null+0x1c/0x24)
>>>     [<c0120554>] (warn_slowpath_null) from [<c03ce264>] (__smsc911x_reg_read+0x1c/0x88)
>>>     [<c03ce264>] (__smsc911x_reg_read) from [<c03cefc8>] (smsc911x_mac_read+0x4c/0x118)
>>>     [<c03cefc8>] (smsc911x_mac_read) from [<c03cf320>] (smsc911x_mii_read+0x2c/0xa4)
>>>     [<c03cf320>] (smsc911x_mii_read) from [<c03cd490>] (mdiobus_read+0x58/0x70)
>>>     [<c03cd490>] (mdiobus_read) from [<c03cc4a0>] (genphy_update_link+0x18/0x50)
>>>     [<c03cc4a0>] (genphy_update_link) from [<c03cc4e4>] (genphy_read_status+0xc/0x1cc)
>>>     [<c03cc4e4>] (genphy_read_status) from [<c03ca3e4>] (phy_state_machine+0xa8/0x3f4)
>>>     [<c03ca3e4>] (phy_state_machine) from [<c013a740>] (process_one_work+0x240/0x3fc)
>>>     [<c013a740>] (process_one_work) from [<c013af28>] (worker_thread+0x2cc/0x40c)
>>>     [<c013af28>] (worker_thread) from [<c014024c>] (kthread+0x124/0x144)
>
> ... but from the worker thread, which is unaware of PM states, unless it is
> a freezable workqueue.
> 
> Cfr. my patch "[1/2] net: phy: Freeze PHY polling before suspending devices"
> (https://patchwork.kernel.org/patch/9915901/), which made it a freezable
> workqueue, like was done on PCI to solve similar issues with PME scanning.

Hello Geert, Florian,

Is there something I can do to help this patch move forward?

Should we look for a different solution than a synchronous call
to phy_state_machine() ? (Or should I call it from the device
driver, instead of expecting the phy framework to do it?)

Regards.
diff mbox series

Patch

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 0ddeb97217ce..20b84ea013c0 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -594,6 +594,9 @@  void phy_stop_machine(struct phy_device *phydev)
 	if (phydev->state > PHY_UP && phydev->state != PHY_HALTED)
 		phydev->state = PHY_UP;
 	mutex_unlock(&phydev->lock);
+
+	/* Now we can run the state machine synchronously */
+	phy_state_machine(&phydev->state_queue.work);
 }
 EXPORT_SYMBOL_GPL(phy_stop_machine);
 
@@ -1077,9 +1080,14 @@  void phy_state_machine(struct work_struct *work)
 
 	/* Only re-schedule a PHY state machine change if we are polling the
 	 * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
-	 * between states from phy_mac_interrupt()
+	 * between states from phy_mac_interrupt().
+	 *
+	 * If do_suspend is set to true from PHY_HALTED, in that case, do not
+	 * reschedule the state machine since that would be pointless and
+	 * possibly error prone when called from phy_disconnect()
+	 * synchronously.
 	 */
-	if (phydev->irq == PHY_POLL)
+	if (phydev->irq == PHY_POLL && !do_suspend)
 		queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
 				   PHY_STATE_TIME * HZ);
 }