[net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

Message ID 1504140569-2063-1-git-send-email-f.fainelli@gmail.com
State Accepted
Delegated to: David Miller
Headers show
Series
  • [net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"
Related show

Commit Message

Florian Fainelli Aug. 31, 2017, 12:49 a.m.
This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
Correctly process PHY_HALTED in phy_stop_machine()") because it is
creating the possibility for a NULL pointer dereference.

David Daney provide the following call trace and diagram of events:

When ndo_stop() is called we call:

 phy_disconnect()
    +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
    +---> phy_stop_machine()
    |      +---> phy_state_machine()
    |              +----> queue_delayed_work(): Work queued.
    +--->phy_detach() implies: phydev->attached_dev = NULL;

Now at a later time the queued work does:

 phy_state_machine()
    +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:

 CPU 12 Unable to handle kernel paging request at virtual address
0000000000000048, epc == ffffffff80de37ec, ra == ffffffff80c7c
Oops[#1]:
CPU: 12 PID: 1502 Comm: kworker/12:1 Not tainted 4.9.43-Cavium-Octeon+ #1
Workqueue: events_power_efficient phy_state_machine
task: 80000004021ed100 task.stack: 8000000409d70000
$ 0   : 0000000000000000 ffffffff84720060 0000000000000048 0000000000000004
$ 4   : 0000000000000000 0000000000000001 0000000000000004 0000000000000000
$ 8   : 0000000000000000 0000000000000000 00000000ffff98f3 0000000000000000
$12   : 8000000409d73fe0 0000000000009c00 ffffffff846547c8 000000000000af3b
$16   : 80000004096bab68 80000004096babd0 0000000000000000 80000004096ba800
$20   : 0000000000000000 0000000000000000 ffffffff81090000 0000000000000008
$24   : 0000000000000061 ffffffff808637b0
$28   : 8000000409d70000 8000000409d73cf0 80000000271bd300 ffffffff80c7804c
Hi    : 000000000000002a
Lo    : 000000000000003f
epc   : ffffffff80de37ec netif_carrier_off+0xc/0x58
ra    : ffffffff80c7804c phy_state_machine+0x48c/0x4f8
Status: 14009ce3        KX SX UX KERNEL EXL IE
Cause : 00800008 (ExcCode 02)
BadVA : 0000000000000048
PrId  : 000d9501 (Cavium Octeon III)
Modules linked in:
Process kworker/12:1 (pid: 1502, threadinfo=8000000409d70000,
task=80000004021ed100, tls=0000000000000000)
Stack : 8000000409a54000 80000004096bab68 80000000271bd300 80000000271c1e00
        0000000000000000 ffffffff808a1708 8000000409a54000 80000000271bd300
        80000000271bd320 8000000409a54030 ffffffff80ff0f00 0000000000000001
        ffffffff81090000 ffffffff808a1ac0 8000000402182080 ffffffff84650000
        8000000402182080 ffffffff84650000 ffffffff80ff0000 8000000409a54000
        ffffffff808a1970 0000000000000000 80000004099e8000 8000000402099240
        0000000000000000 ffffffff808a8598 0000000000000000 8000000408eeeb00
        8000000409a54000 00000000810a1d00 0000000000000000 8000000409d73de8
        8000000409d73de8 0000000000000088 000000000c009c00 8000000409d73e08
        8000000409d73e08 8000000402182080 ffffffff808a84d0 8000000402182080
        ...
Call Trace:
[<ffffffff80de37ec>] netif_carrier_off+0xc/0x58
[<ffffffff80c7804c>] phy_state_machine+0x48c/0x4f8
[<ffffffff808a1708>] process_one_work+0x158/0x368
[<ffffffff808a1ac0>] worker_thread+0x150/0x4c0
[<ffffffff808a8598>] kthread+0xc8/0xe0
[<ffffffff808617f0>] ret_from_kernel_thread+0x14/0x1c

The original motivation for this change originated from Marc Gonzales
indicating that his network driver did not have its adjust_link callback
executing with phydev->link = 0 while he was expecting it.

PHYLIB has never made any such guarantees ever because phy_stop() merely just
tells the workqueue to move into PHY_HALTED state which will happen
asynchronously.

Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reported-by: David Daney <ddaney.cavm@gmail.com>
Fixes: 7ad813f20853 ("net: phy: Correctly process PHY_HALTED in phy_stop_machine()")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/phy/phy.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

David Miller Aug. 31, 2017, 1:47 a.m. | #1
From: Florian Fainelli <f.fainelli@gmail.com>
Date: Wed, 30 Aug 2017 17:49:29 -0700

> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
> Correctly process PHY_HALTED in phy_stop_machine()") because it is
> creating the possibility for a NULL pointer dereference.
> 
> David Daney provide the following call trace and diagram of events:
> 
> When ndo_stop() is called we call:
> 
>  phy_disconnect()
>     +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
>     +---> phy_stop_machine()
>     |      +---> phy_state_machine()
>     |              +----> queue_delayed_work(): Work queued.
>     +--->phy_detach() implies: phydev->attached_dev = NULL;
> 
> Now at a later time the queued work does:
> 
>  phy_state_machine()
>     +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
> 
>  CPU 12 Unable to handle kernel paging request at virtual address
> 0000000000000048, epc == ffffffff80de37ec, ra == ffffffff80c7c
> Oops[#1]:
> CPU: 12 PID: 1502 Comm: kworker/12:1 Not tainted 4.9.43-Cavium-Octeon+ #1
> Workqueue: events_power_efficient phy_state_machine
> task: 80000004021ed100 task.stack: 8000000409d70000
> $ 0   : 0000000000000000 ffffffff84720060 0000000000000048 0000000000000004
> $ 4   : 0000000000000000 0000000000000001 0000000000000004 0000000000000000
> $ 8   : 0000000000000000 0000000000000000 00000000ffff98f3 0000000000000000
> $12   : 8000000409d73fe0 0000000000009c00 ffffffff846547c8 000000000000af3b
> $16   : 80000004096bab68 80000004096babd0 0000000000000000 80000004096ba800
> $20   : 0000000000000000 0000000000000000 ffffffff81090000 0000000000000008
> $24   : 0000000000000061 ffffffff808637b0
> $28   : 8000000409d70000 8000000409d73cf0 80000000271bd300 ffffffff80c7804c
> Hi    : 000000000000002a
> Lo    : 000000000000003f
> epc   : ffffffff80de37ec netif_carrier_off+0xc/0x58
> ra    : ffffffff80c7804c phy_state_machine+0x48c/0x4f8
> Status: 14009ce3        KX SX UX KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 0000000000000048
> PrId  : 000d9501 (Cavium Octeon III)
> Modules linked in:
> Process kworker/12:1 (pid: 1502, threadinfo=8000000409d70000,
> task=80000004021ed100, tls=0000000000000000)
> Stack : 8000000409a54000 80000004096bab68 80000000271bd300 80000000271c1e00
>         0000000000000000 ffffffff808a1708 8000000409a54000 80000000271bd300
>         80000000271bd320 8000000409a54030 ffffffff80ff0f00 0000000000000001
>         ffffffff81090000 ffffffff808a1ac0 8000000402182080 ffffffff84650000
>         8000000402182080 ffffffff84650000 ffffffff80ff0000 8000000409a54000
>         ffffffff808a1970 0000000000000000 80000004099e8000 8000000402099240
>         0000000000000000 ffffffff808a8598 0000000000000000 8000000408eeeb00
>         8000000409a54000 00000000810a1d00 0000000000000000 8000000409d73de8
>         8000000409d73de8 0000000000000088 000000000c009c00 8000000409d73e08
>         8000000409d73e08 8000000402182080 ffffffff808a84d0 8000000402182080
>         ...
> Call Trace:
> [<ffffffff80de37ec>] netif_carrier_off+0xc/0x58
> [<ffffffff80c7804c>] phy_state_machine+0x48c/0x4f8
> [<ffffffff808a1708>] process_one_work+0x158/0x368
> [<ffffffff808a1ac0>] worker_thread+0x150/0x4c0
> [<ffffffff808a8598>] kthread+0xc8/0xe0
> [<ffffffff808617f0>] ret_from_kernel_thread+0x14/0x1c
> 
> The original motivation for this change originated from Marc Gonzales
> indicating that his network driver did not have its adjust_link callback
> executing with phydev->link = 0 while he was expecting it.
> 
> PHYLIB has never made any such guarantees ever because phy_stop() merely just
> tells the workqueue to move into PHY_HALTED state which will happen
> asynchronously.
> 
> Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
> Reported-by: David Daney <ddaney.cavm@gmail.com>
> Fixes: 7ad813f20853 ("net: phy: Correctly process PHY_HALTED in phy_stop_machine()")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied and queued up for -stable, thanks Florian.
Marc Gonzalez Aug. 31, 2017, 12:29 p.m. | #2
On 31/08/2017 02:49, Florian Fainelli wrote:

> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
> Correctly process PHY_HALTED in phy_stop_machine()") because it is
> creating the possibility for a NULL pointer dereference.
> 
> David Daney provide the following call trace and diagram of events:
> 
> When ndo_stop() is called we call:
> 
>  phy_disconnect()
>     +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;

What does this mean?

On the contrary, phy_stop_interrupts() is only called when *not* polling.

	if (phydev->irq > 0)
		phy_stop_interrupts(phydev);

>     +---> phy_stop_machine()
>     |      +---> phy_state_machine()
>     |              +----> queue_delayed_work(): Work queued.

You're referring to the fact that, at the end of phy_state_machine()
(in polling mode) the code reschedules itself through:

	if (phydev->irq == PHY_POLL)
		queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, PHY_STATE_TIME * HZ);

>     +--->phy_detach() implies: phydev->attached_dev = NULL;
> 
> Now at a later time the queued work does:
> 
>  phy_state_machine()
>     +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:

I tested a sequence of 500 link up / link down in polling mode,
and saw no such issue. Race condition?

For what case in phy_state_machine() is netif_carrier_off()
being called? Surely not PHY_HALTED?


> The original motivation for this change originated from Marc Gonzales
> indicating that his network driver did not have its adjust_link callback
> executing with phydev->link = 0 while he was expecting it.

I expect the core to call phy_adjust_link() for link changes.
This used to work back in 3.4 and was broken somewhere along
the way.

> PHYLIB has never made any such guarantees ever because phy_stop() merely
> just tells the workqueue to move into PHY_HALTED state which will happen
> asynchronously.

My original proposal was to fix the issue in the driver.
I'll try locating it in my archives.

Regards.
Marc Gonzalez Aug. 31, 2017, 2:21 p.m. | #3
On 31/08/2017 14:29, Marc Gonzalez wrote:

> On 31/08/2017 02:49, Florian Fainelli wrote:
> 
>> The original motivation for this change originated from Marc Gonzalez
>> indicating that his network driver did not have its adjust_link callback
>> executing with phydev->link = 0 while he was expecting it.
> 
> I expect the core to call phy_adjust_link() for link changes.
> This used to work back in 3.4 and was broken somewhere along
> the way.
> 
>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>> just tells the workqueue to move into PHY_HALTED state which will happen
>> asynchronously.
> 
> My original proposal was to fix the issue in the driver.
> I'll try locating it in my archives.

The original proposal was:
(I.e. basically a copy of phy_state_machine()'s PHY_HALTED case)
Is this what I need to submit, now that the synchronous call to
phy_state_machine() has been removed?

diff --git a/drivers/net/ethernet/aurora/nb8800.c b/drivers/net/ethernet/aurora/nb8800.c
index 607064a6d7a1..8b9a981c55c1 100644
--- a/drivers/net/ethernet/aurora/nb8800.c
+++ b/drivers/net/ethernet/aurora/nb8800.c
@@ -1017,6 +1017,10 @@ static int nb8800_stop(struct net_device *dev)
 
        phy_disconnect(phydev);
 
+       phydev->link = 0;
+       netif_carrier_off(dev);
+       nb8800_link_reconfigure(dev);
+
        free_irq(dev->irq, dev);
 
        nb8800_dma_free(dev);
David Daney Aug. 31, 2017, 4:36 p.m. | #4
On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
> On 31/08/2017 02:49, Florian Fainelli wrote:
> 
>> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
>> Correctly process PHY_HALTED in phy_stop_machine()") because it is
>> creating the possibility for a NULL pointer dereference.
>>
>> David Daney provide the following call trace and diagram of events:
>>
>> When ndo_stop() is called we call:
>>
>>   phy_disconnect()
>>      +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
> 
> What does this mean?

I meant that after the call to phy_stop_interrupts(), phydev->irq = 
PHY_POLL;


> 
> On the contrary, phy_stop_interrupts() is only called when *not* polling.

That is the case I have.  We are using interrupts from the phy.


> 
> 	if (phydev->irq > 0)
> 		phy_stop_interrupts(phydev);
> 
>>      +---> phy_stop_machine()
>>      |      +---> phy_state_machine()
>>      |              +----> queue_delayed_work(): Work queued.
> 
> You're referring to the fact that, at the end of phy_state_machine()
> (in polling mode) the code reschedules itself through:
> 
> 	if (phydev->irq == PHY_POLL)
> 		queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, PHY_STATE_TIME * HZ);

Exactly.  The call to phy_disconnect() ensures that there are no more 
interrupts and also that phydev->irq = PHY_POLL

The call to cancel_delayed_work_sync() at the top of phy_stop_machine() 
was meant to ensure that phy_state_machine() was never run again.  No 
interrupts + no queued work means that it should be save to do...

> 
>>      +--->phy_detach() implies: phydev->attached_dev = NULL;

The problem is that by calling phy_state_machine() again (which the 
offending patch added) we now have work scheduled that will try to 
dereference the pointer that was set to NULL as a result of the phy_detach()


>>
>> Now at a later time the queued work does:
>>
>>   phy_state_machine()
>>      +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
> 
> I tested a sequence of 500 link up / link down in polling mode,
> and saw no such issue. Race condition?
> 

You were lucky.

> For what case in phy_state_machine() is netif_carrier_off()
> being called? Surely not PHY_HALTED?
> 

The phy can be in a variety of states.  It is connected to something 
outside of the system that we don't control, so you cannot assume any 
particular state.  We must have code that doesn't crash the system no 
matter what state the phy is in.

I suspect, but have not checked, that the phy is in PHY_RUNNING.  I 
think that means that because this patch turned the state machine back 
on, it will start transitioning through PHY_UP, PHY_AN, ... and 
eventually get to the crash we see because phydev->attached_dev = NULL


> 
>> The original motivation for this change originated from Marc Gonzales
>> indicating that his network driver did not have its adjust_link callback
>> executing with phydev->link = 0 while he was expecting it.
> 
> I expect the core to call phy_adjust_link() for link changes.
> This used to work back in 3.4 and was broken somewhere along
> the way.
> 
>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>> just tells the workqueue to move into PHY_HALTED state which will happen
>> asynchronously.
> 
> My original proposal was to fix the issue in the driver.
> I'll try locating it in my archives.
> 
> Regards.
>
Florian Fainelli Aug. 31, 2017, 4:57 p.m. | #5
On 08/31/2017 09:36 AM, David Daney wrote:
> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>
>>> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
>>> Correctly process PHY_HALTED in phy_stop_machine()") because it is
>>> creating the possibility for a NULL pointer dereference.
>>>
>>> David Daney provide the following call trace and diagram of events:
>>>
>>> When ndo_stop() is called we call:
>>>
>>>   phy_disconnect()
>>>      +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
>>
>> What does this mean?
> 
> I meant that after the call to phy_stop_interrupts(), phydev->irq =
> PHY_POLL;
> 
> 
>>
>> On the contrary, phy_stop_interrupts() is only called when *not* polling.
> 
> That is the case I have.  We are using interrupts from the phy.
> 
> 
>>
>>     if (phydev->irq > 0)
>>         phy_stop_interrupts(phydev);
>>
>>>      +---> phy_stop_machine()
>>>      |      +---> phy_state_machine()
>>>      |              +----> queue_delayed_work(): Work queued.
>>
>> You're referring to the fact that, at the end of phy_state_machine()
>> (in polling mode) the code reschedules itself through:
>>
>>     if (phydev->irq == PHY_POLL)
>>         queue_delayed_work(system_power_efficient_wq,
>> &phydev->state_queue, PHY_STATE_TIME * HZ);
> 
> Exactly.  The call to phy_disconnect() ensures that there are no more
> interrupts and also that phydev->irq = PHY_POLL
> 
> The call to cancel_delayed_work_sync() at the top of phy_stop_machine()
> was meant to ensure that phy_state_machine() was never run again.  No
> interrupts + no queued work means that it should be save to do...
> 
>>
>>>      +--->phy_detach() implies: phydev->attached_dev = NULL;
> 
> The problem is that by calling phy_state_machine() again (which the
> offending patch added) we now have work scheduled that will try to
> dereference the pointer that was set to NULL as a result of the
> phy_detach()

And the race is between phy_detach() setting phydev->attached_dev = NULL
and phy_state_machine() running in PHY_HALTED state and calling
netif_carrier_off().

> 
> 
>>>
>>> Now at a later time the queued work does:
>>>
>>>   phy_state_machine()
>>>      +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
>>
>> I tested a sequence of 500 link up / link down in polling mode,
>> and saw no such issue. Race condition?
>>
> 
> You were lucky.

I too tested this a number of times on a 2 core and 4 core system, but
the race is there, both of us just were lucky enough we did not see any
crash. I suspect the race is easier to reproduce on a (at least 12 core)
system with possibly a higher clock speed.

> 
>> For what case in phy_state_machine() is netif_carrier_off()
>> being called? Surely not PHY_HALTED?
>>
> 
> The phy can be in a variety of states.  It is connected to something
> outside of the system that we don't control, so you cannot assume any
> particular state.  We must have code that doesn't crash the system no
> matter what state the phy is in.
> 
> I suspect, but have not checked, that the phy is in PHY_RUNNING.  I
> think that means that because this patch turned the state machine back
> on, it will start transitioning through PHY_UP, PHY_AN, ... and
> eventually get to the crash we see because phydev->attached_dev = NULL

I actually think the PHY remains in PHY_HALTED but just re-schedules
itself and keeps being in PHY_HALTED again until a call to phy_resume or
phy_start() moves it back to another state. This is largely inefficient,
and we should look into using the patch I posted yesterday which would
prevent a re-schedule when moved to PHY_HALTED:

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index d0626bf5c540..78168e19bd5d 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1234,7 +1234,7 @@ void phy_state_machine(struct work_struct *work)
         * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
         * between states from phy_mac_interrupt()
         */
-       if (phydev->irq == PHY_POLL)
+       if (phydev->irq == PHY_POLL && phydev->state != PHY_HALTED)
                queue_delayed_work(system_power_efficient_wq,
&phydev->state_queue,
                                   PHY_STATE_TIME * HZ);
 }
Florian Fainelli Aug. 31, 2017, 5:03 p.m. | #6
On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
> On 31/08/2017 02:49, Florian Fainelli wrote:
> 
>> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
>> Correctly process PHY_HALTED in phy_stop_machine()") because it is
>> creating the possibility for a NULL pointer dereference.
>>
>> David Daney provide the following call trace and diagram of events:
>>
>> When ndo_stop() is called we call:
>>
>>  phy_disconnect()
>>     +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
> 
> What does this mean?
> 
> On the contrary, phy_stop_interrupts() is only called when *not* polling.
> 
> 	if (phydev->irq > 0)
> 		phy_stop_interrupts(phydev);
> 
>>     +---> phy_stop_machine()
>>     |      +---> phy_state_machine()
>>     |              +----> queue_delayed_work(): Work queued.
> 
> You're referring to the fact that, at the end of phy_state_machine()
> (in polling mode) the code reschedules itself through:
> 
> 	if (phydev->irq == PHY_POLL)
> 		queue_delayed_work(system_power_efficient_wq, &phydev->state_queue, PHY_STATE_TIME * HZ);
> 
>>     +--->phy_detach() implies: phydev->attached_dev = NULL;
>>
>> Now at a later time the queued work does:
>>
>>  phy_state_machine()
>>     +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL:
> 
> I tested a sequence of 500 link up / link down in polling mode,
> and saw no such issue. Race condition?
> 
> For what case in phy_state_machine() is netif_carrier_off()
> being called? Surely not PHY_HALTED?
> 
> 
>> The original motivation for this change originated from Marc Gonzales
>> indicating that his network driver did not have its adjust_link callback
>> executing with phydev->link = 0 while he was expecting it.
> 
> I expect the core to call phy_adjust_link() for link changes.
> This used to work back in 3.4 and was broken somewhere along
> the way.

If that was working correctly in 3.4 surely we can look at the diff and
figure out what changed, even maybe find the offending commit, can you
do that?

> 
>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>> just tells the workqueue to move into PHY_HALTED state which will happen
>> asynchronously.
> 
> My original proposal was to fix the issue in the driver.
> I'll try locating it in my archives.

Yes I remember you telling that, by the way I don't think you ever
provided a clear explanation why this is absolutely necessary for your
driver though?
Mason Aug. 31, 2017, 5:35 p.m. | #7
On 31/08/2017 18:36, David Daney wrote:
> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>
>>> This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy:
>>> Correctly process PHY_HALTED in phy_stop_machine()") because it is
>>> creating the possibility for a NULL pointer dereference.
>>>
>>> David Daney provide the following call trace and diagram of events:
>>>
>>> When ndo_stop() is called we call:
>>>
>>>   phy_disconnect()
>>>      +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL;
>>
>> What does this mean?
> 
> I meant that after the call to phy_stop_interrupts(), phydev->irq = 
> PHY_POLL;

I must be missing something.

http://elixir.free-electrons.com/linux/latest/source/drivers/net/phy/phy.c#L868

phy_stop_interrupts() doesn't change phydev->irq right?

Only phy_start_interrupts() sets phydev->irq to
PHY_POLL if it cannot set up interrupt mode.

Regards.
Mason Aug. 31, 2017, 5:49 p.m. | #8
On 31/08/2017 18:57, Florian Fainelli wrote:
> And the race is between phy_detach() setting phydev->attached_dev = NULL
> and phy_state_machine() running in PHY_HALTED state and calling
> netif_carrier_off().

I must be missing something.
(Since a thread cannot race against itself.)

phy_disconnect calls phy_stop_machine which
1) stops the work queue from running in a separate thread
2) calls phy_state_machine *synchronously*
     which runs the PHY_HALTED case with everything well-defined
end of phy_stop_machine

phy_disconnect only then calls phy_detach()
which makes future calls of phy_state_machine perilous.

This all happens in the same thread, so I'm not yet
seeing where the race happens?

Regards.
Florian Fainelli Aug. 31, 2017, 5:53 p.m. | #9
On 08/31/2017 10:49 AM, Mason wrote:
> On 31/08/2017 18:57, Florian Fainelli wrote:
>> And the race is between phy_detach() setting phydev->attached_dev = NULL
>> and phy_state_machine() running in PHY_HALTED state and calling
>> netif_carrier_off().
> 
> I must be missing something.
> (Since a thread cannot race against itself.)
> 
> phy_disconnect calls phy_stop_machine which
> 1) stops the work queue from running in a separate thread
> 2) calls phy_state_machine *synchronously*
>      which runs the PHY_HALTED case with everything well-defined
> end of phy_stop_machine
> 
> phy_disconnect only then calls phy_detach()
> which makes future calls of phy_state_machine perilous.
> 
> This all happens in the same thread, so I'm not yet
> seeing where the race happens?

The race is as described in David's earlier email, so let's recap:

Thread 1			Thread 2
phy_disconnect()
phy_stop_interrupts()
phy_stop_machine()
phy_state_machine()
 -> queue_delayed_work()
phy_detach()
				phy_state_machine()
				-> netif_carrier_off()

If phy_detach() finishes earlier than the workqueue had a chance to be
scheduled and process PHY_HALTED again, then we trigger the NULL pointer
de-reference.

workqueues are not tasklets, the CPU scheduling them gets no guarantee
they will run on the same CPU.
Mason Aug. 31, 2017, 6:12 p.m. | #10
On 31/08/2017 19:53, Florian Fainelli wrote:
> On 08/31/2017 10:49 AM, Mason wrote:
>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>> And the race is between phy_detach() setting phydev->attached_dev = NULL
>>> and phy_state_machine() running in PHY_HALTED state and calling
>>> netif_carrier_off().
>>
>> I must be missing something.
>> (Since a thread cannot race against itself.)
>>
>> phy_disconnect calls phy_stop_machine which
>> 1) stops the work queue from running in a separate thread
>> 2) calls phy_state_machine *synchronously*
>>      which runs the PHY_HALTED case with everything well-defined
>> end of phy_stop_machine
>>
>> phy_disconnect only then calls phy_detach()
>> which makes future calls of phy_state_machine perilous.
>>
>> This all happens in the same thread, so I'm not yet
>> seeing where the race happens?
> 
> The race is as described in David's earlier email, so let's recap:
> 
> Thread 1			Thread 2
> phy_disconnect()
> phy_stop_interrupts()
> phy_stop_machine()
> phy_state_machine()
>  -> queue_delayed_work()
> phy_detach()
> 				phy_state_machine()
> 				-> netif_carrier_off()
> 
> If phy_detach() finishes earlier than the workqueue had a chance to be
> scheduled and process PHY_HALTED again, then we trigger the NULL pointer
> de-reference.
> 
> workqueues are not tasklets, the CPU scheduling them gets no guarantee
> they will run on the same CPU.

Something does not add up.

The synchronous call to phy_state_machine() does:

	case PHY_HALTED:
		if (phydev->link) {
			phydev->link = 0;
			netif_carrier_off(phydev->attached_dev);
			phy_adjust_link(phydev);
			do_suspend = true;
		}

then sets phydev->link = 0; therefore subsequent calls to
phy_state_machin() will be no-op.

Also, queue_delayed_work() is only called in polling mode.
David stated that he's using interrupt mode.

Regards.
Florian Fainelli Aug. 31, 2017, 6:29 p.m. | #11
On 08/31/2017 11:12 AM, Mason wrote:
> On 31/08/2017 19:53, Florian Fainelli wrote:
>> On 08/31/2017 10:49 AM, Mason wrote:
>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>> And the race is between phy_detach() setting phydev->attached_dev = NULL
>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>> netif_carrier_off().
>>>
>>> I must be missing something.
>>> (Since a thread cannot race against itself.)
>>>
>>> phy_disconnect calls phy_stop_machine which
>>> 1) stops the work queue from running in a separate thread
>>> 2) calls phy_state_machine *synchronously*
>>>      which runs the PHY_HALTED case with everything well-defined
>>> end of phy_stop_machine
>>>
>>> phy_disconnect only then calls phy_detach()
>>> which makes future calls of phy_state_machine perilous.
>>>
>>> This all happens in the same thread, so I'm not yet
>>> seeing where the race happens?
>>
>> The race is as described in David's earlier email, so let's recap:
>>
>> Thread 1			Thread 2
>> phy_disconnect()
>> phy_stop_interrupts()
>> phy_stop_machine()
>> phy_state_machine()
>>  -> queue_delayed_work()
>> phy_detach()
>> 				phy_state_machine()
>> 				-> netif_carrier_off()
>>
>> If phy_detach() finishes earlier than the workqueue had a chance to be
>> scheduled and process PHY_HALTED again, then we trigger the NULL pointer
>> de-reference.
>>
>> workqueues are not tasklets, the CPU scheduling them gets no guarantee
>> they will run on the same CPU.
> 
> Something does not add up.
> 
> The synchronous call to phy_state_machine() does:
> 
> 	case PHY_HALTED:
> 		if (phydev->link) {
> 			phydev->link = 0;
> 			netif_carrier_off(phydev->attached_dev);
> 			phy_adjust_link(phydev);
> 			do_suspend = true;
> 		}
> 
> then sets phydev->link = 0; therefore subsequent calls to
> phy_state_machin() will be no-op.

Actually you are right, once phydev->link is set to 0 these would become
no-ops. Still scratching my head as to what happens for David then...

> 
> Also, queue_delayed_work() is only called in polling mode.
> David stated that he's using interrupt mode.

Right that's confusing too now. David can you check if you tree has:

49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
correctly in phy_stop_machine")
Mason Aug. 31, 2017, 7:09 p.m. | #12
On 31/08/2017 19:03, Florian Fainelli wrote:

> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>
>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>
>>> The original motivation for this change originated from Marc Gonzalez
>>> indicating that his network driver did not have its adjust_link callback
>>> executing with phydev->link = 0 while he was expecting it.
>>
>> I expect the core to call phy_adjust_link() for link changes.
>> This used to work back in 3.4 and was broken somewhere along
>> the way.
> 
> If that was working correctly in 3.4 surely we can look at the diff and
> figure out what changed, even maybe find the offending commit, can you
> do that?

Bisecting would a be a huge pain because my platform was
not upstream until v4.4

You mentioned the guarantees made by PHYLIB.
When is the adjust_link callback guaranteed to be called?

>>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>>> just tells the workqueue to move into PHY_HALTED state which will happen
>>> asynchronously.
>>
>> My original proposal was to fix the issue in the driver.
>> I'll try locating it in my archives.
> 
> Yes I remember you telling that, by the way I don't think you ever
> provided a clear explanation why this is absolutely necessary for your
> driver though?

1) nb8800_link_reconfigure() calls phy_print_status()
which prints the "Link down" and "Link up" messages
to the console. With the patch reverted, nothing is
printed when the link goes down, and the result is
random when the link comes up. Sometimes, we get
down + up, sometimes just up.

2) nb8800_link_reconfigure() does some HW init when
the link state changes. If we miss some notifications,
we might not perform some HW init, and stuff breaks.

Regards.
Florian Fainelli Aug. 31, 2017, 7:18 p.m. | #13
On 08/31/2017 12:09 PM, Mason wrote:
> On 31/08/2017 19:03, Florian Fainelli wrote:
> 
>> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>>
>>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>>
>>>> The original motivation for this change originated from Marc Gonzalez
>>>> indicating that his network driver did not have its adjust_link callback
>>>> executing with phydev->link = 0 while he was expecting it.
>>>
>>> I expect the core to call phy_adjust_link() for link changes.
>>> This used to work back in 3.4 and was broken somewhere along
>>> the way.
>>
>> If that was working correctly in 3.4 surely we can look at the diff and
>> figure out what changed, even maybe find the offending commit, can you
>> do that?
> 
> Bisecting would a be a huge pain because my platform was
> not upstream until v4.4

Then just diff the file and try to pinpoint which commit may have
changed that?

> 
> You mentioned the guarantees made by PHYLIB.
> When is the adjust_link callback guaranteed to be called?

As long as the state machine is running after a call to phy_start()
adjust_link will be called if there a change in link and/or link
settings. Once you call phy_stop() no such guarantees are made.

> 
>>>> PHYLIB has never made any such guarantees ever because phy_stop() merely
>>>> just tells the workqueue to move into PHY_HALTED state which will happen
>>>> asynchronously.
>>>
>>> My original proposal was to fix the issue in the driver.
>>> I'll try locating it in my archives.
>>
>> Yes I remember you telling that, by the way I don't think you ever
>> provided a clear explanation why this is absolutely necessary for your
>> driver though?
> 
> 1) nb8800_link_reconfigure() calls phy_print_status()
> which prints the "Link down" and "Link up" messages
> to the console. With the patch reverted, nothing is
> printed when the link goes down, and the result is
> random when the link comes up. Sometimes, we get
> down + up, sometimes just up.

Nothing printed when you bring down the network interface as a result of
not signaling the link down, there is a small nuance here.

Seeing a random message upon bringing the interface back up suggests you
may not be re-initialization your old vs. new link state book keeping
variables and state transitions are not properly detected and therefore
not printed. In fact, I don't see where priv->link is ever set to say -1
to force the comparison between phydev->link != priv->link to be true,
oversight?

> 
> 2) nb8800_link_reconfigure() does some HW init when
> the link state changes. If we miss some notifications,
> we might not perform some HW init, and stuff breaks.

Care to be more specific? What specific HW init is required during link
notification that if not done breaks the HW? There is both
nb8800_mac_config() and nb8800_pause_config() that are both called in
the adjust_link callback both could presumably be deferred until the
link is detected, so why do you need it during ndo_stop() absolutely?
Mason Sept. 6, 2017, 2:33 p.m. | #14
On 31/08/2017 20:29, Florian Fainelli wrote:
> On 08/31/2017 11:12 AM, Mason wrote:
>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>> And the race is between phy_detach() setting phydev->attached_dev = NULL
>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>> netif_carrier_off().
>>>>
>>>> I must be missing something.
>>>> (Since a thread cannot race against itself.)
>>>>
>>>> phy_disconnect calls phy_stop_machine which
>>>> 1) stops the work queue from running in a separate thread
>>>> 2) calls phy_state_machine *synchronously*
>>>>       which runs the PHY_HALTED case with everything well-defined
>>>> end of phy_stop_machine
>>>>
>>>> phy_disconnect only then calls phy_detach()
>>>> which makes future calls of phy_state_machine perilous.
>>>>
>>>> This all happens in the same thread, so I'm not yet
>>>> seeing where the race happens?
>>>
>>> The race is as described in David's earlier email, so let's recap:
>>>
>>> Thread 1			Thread 2
>>> phy_disconnect()
>>> phy_stop_interrupts()
>>> phy_stop_machine()
>>> phy_state_machine()
>>>   -> queue_delayed_work()
>>> phy_detach()
>>> 				phy_state_machine()
>>> 				-> netif_carrier_off()
>>>
>>> If phy_detach() finishes earlier than the workqueue had a chance to be
>>> scheduled and process PHY_HALTED again, then we trigger the NULL pointer
>>> de-reference.
>>>
>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee
>>> they will run on the same CPU.
>>
>> Something does not add up.
>>
>> The synchronous call to phy_state_machine() does:
>>
>> 	case PHY_HALTED:
>> 		if (phydev->link) {
>> 			phydev->link = 0;
>> 			netif_carrier_off(phydev->attached_dev);
>> 			phy_adjust_link(phydev);
>> 			do_suspend = true;
>> 		}
>>
>> then sets phydev->link = 0; therefore subsequent calls to
>> phy_state_machin() will be no-op.
> 
> Actually you are right, once phydev->link is set to 0 these would become
> no-ops. Still scratching my head as to what happens for David then...
> 
>>
>> Also, queue_delayed_work() is only called in polling mode.
>> David stated that he's using interrupt mode.
> 
> Right that's confusing too now. David can you check if you tree has:
> 
> 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
> correctly in phy_stop_machine")

Hello David,

A week ago, you wrote about my patch:
"This is broken.  Please revert."

I assume you tested the revert locally, and that reverting did make
the crash disappear. Is that correct?

The reason I ask is because the analysis you provided contains some
flaws, as noted above. But, if reverting my patch did fix your issue,
then perhaps understanding *why* is unimportant.

I'm a bit baffled that it took less than 90 minutes for your request
to be approved, and the patch reverted in all branches, before I even
had a chance to comment.

Regards.
Mason Sept. 6, 2017, 2:55 p.m. | #15
On 31/08/2017 21:18, Florian Fainelli wrote:

> On 08/31/2017 12:09 PM, Mason wrote:
> 
>> 1) nb8800_link_reconfigure() calls phy_print_status()
>> which prints the "Link down" and "Link up" messages
>> to the console. With the patch reverted, nothing is
>> printed when the link goes down, and the result is
>> random when the link comes up. Sometimes, we get
>> down + up, sometimes just up.
> 
> Nothing printed when you bring down the network interface as a result of
> not signaling the link down, there is a small nuance here.

Let me first focus on the "Link down" message.

Do you agree that such a message should be printed when the
link goes down, not when the link comes up?

Perhaps the issue is that the 2 following cases need to be
handled differently:
A) operator sets link down on the command-line
B) asynchronous event makes link go down (peer is dead, cable is cut, etc)

In B) the PHY state machine keeps on running, and eventually
calls adjust_link()

In A) the driver calls phy_stop() and phy_disconnect() and
therefore adjust_link() will not be called?

Regards.
Mason Sept. 6, 2017, 3:51 p.m. | #16
On 31/08/2017 21:18, Florian Fainelli wrote:

> On 08/31/2017 12:09 PM, Mason wrote:
>
>> On 31/08/2017 19:03, Florian Fainelli wrote:
>>
>>> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>>>
>>>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>>>
>>>>> The original motivation for this change originated from Marc Gonzalez
>>>>> indicating that his network driver did not have its adjust_link callback
>>>>> executing with phydev->link = 0 while he was expecting it.
>>>>
>>>> I expect the core to call phy_adjust_link() for link changes.
>>>> This used to work back in 3.4 and was broken somewhere along
>>>> the way.
>>>
>>> If that was working correctly in 3.4 surely we can look at the diff and
>>> figure out what changed, even maybe find the offending commit, can you
>>> do that?
>>
>> Bisecting would a be a huge pain because my platform was
>> not upstream until v4.4
> 
> Then just diff the file and try to pinpoint which commit may have
> changed that?

Running 'ip link set eth0 down' on the command-line.

In v3.4 => adjust_link() callback is called
In v4.5 => adjust_link() callback is NOT called

$ git log --oneline --no-merges v3.4..v4.5 drivers/net/phy/phy.c | wc -l
59

I'm not sure what "just diff the file" entails.
I can't move 3.4 up, nor move 4.5 down.
I'm not even sure the problem comes from drivers/net/phy/phy.c
to be honest.

Regards.
David Daney Sept. 6, 2017, 5:53 p.m. | #17
On 09/06/2017 07:33 AM, Mason wrote:
> On 31/08/2017 20:29, Florian Fainelli wrote:
>> On 08/31/2017 11:12 AM, Mason wrote:
>>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>>> And the race is between phy_detach() setting phydev->attached_dev 
>>>>>> = NULL
>>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>>> netif_carrier_off().
>>>>>
>>>>> I must be missing something.
>>>>> (Since a thread cannot race against itself.)
>>>>>
>>>>> phy_disconnect calls phy_stop_machine which
>>>>> 1) stops the work queue from running in a separate thread
>>>>> 2) calls phy_state_machine *synchronously*
>>>>>       which runs the PHY_HALTED case with everything well-defined
>>>>> end of phy_stop_machine
>>>>>
>>>>> phy_disconnect only then calls phy_detach()
>>>>> which makes future calls of phy_state_machine perilous.
>>>>>
>>>>> This all happens in the same thread, so I'm not yet
>>>>> seeing where the race happens?
>>>>
>>>> The race is as described in David's earlier email, so let's recap:
>>>>
>>>> Thread 1            Thread 2
>>>> phy_disconnect()
>>>> phy_stop_interrupts()
>>>> phy_stop_machine()
>>>> phy_state_machine()
>>>>   -> queue_delayed_work()
>>>> phy_detach()
>>>>                 phy_state_machine()
>>>>                 -> netif_carrier_off()
>>>>
>>>> If phy_detach() finishes earlier than the workqueue had a chance to be
>>>> scheduled and process PHY_HALTED again, then we trigger the NULL 
>>>> pointer
>>>> de-reference.
>>>>
>>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee
>>>> they will run on the same CPU.
>>>
>>> Something does not add up.
>>>
>>> The synchronous call to phy_state_machine() does:
>>>
>>>     case PHY_HALTED:
>>>         if (phydev->link) {
>>>             phydev->link = 0;
>>>             netif_carrier_off(phydev->attached_dev);
>>>             phy_adjust_link(phydev);
>>>             do_suspend = true;
>>>         }
>>>
>>> then sets phydev->link = 0; therefore subsequent calls to
>>> phy_state_machin() will be no-op.
>>
>> Actually you are right, once phydev->link is set to 0 these would become
>> no-ops. Still scratching my head as to what happens for David then...
>>
>>>
>>> Also, queue_delayed_work() is only called in polling mode.
>>> David stated that he's using interrupt mode.
>>
>> Right that's confusing too now. David can you check if you tree has:
>>
>> 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
>> correctly in phy_stop_machine")
> 
> Hello David,
> 
> A week ago, you wrote about my patch:
> "This is broken.  Please revert."
> 
> I assume you tested the revert locally, and that reverting did make
> the crash disappear. Is that correct?
> 

Yes, I always test things before making this type of assertion.


> The reason I ask is because the analysis you provided contains some
> flaws, as noted above. But, if reverting my patch did fix your issue,
> then perhaps understanding *why* is unimportant.

I didn't want to take the time to generate calling sequence traces to 
verify each step of my analysis, but I believe the overall concept is 
essentially correct.

Once the polling work is canceled and we set a bunch of essential 
pointers to NULL, you cannot go blindly restarting the polling.

> 
> I'm a bit baffled that it took less than 90 minutes for your request
> to be approved, and the patch reverted in all branches, before I even
> had a chance to comment.
> 

o The last chance for patches to v4.13 was fast approaching.

o There were multiple reports of failures caused by the patch.

o The patch was clearly stand-alone.

The kernel maintainers are a model of efficiency, there was no reason to 
delay.
David Daney Sept. 6, 2017, 6 p.m. | #18
On 08/31/2017 11:29 AM, Florian Fainelli wrote:
> On 08/31/2017 11:12 AM, Mason wrote:
>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>> And the race is between phy_detach() setting phydev->attached_dev = NULL
>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>> netif_carrier_off().
>>>>
>>>> I must be missing something.
>>>> (Since a thread cannot race against itself.)
>>>>
>>>> phy_disconnect calls phy_stop_machine which
>>>> 1) stops the work queue from running in a separate thread
>>>> 2) calls phy_state_machine *synchronously*
>>>>       which runs the PHY_HALTED case with everything well-defined
>>>> end of phy_stop_machine
>>>>
>>>> phy_disconnect only then calls phy_detach()
>>>> which makes future calls of phy_state_machine perilous.
>>>>
>>>> This all happens in the same thread, so I'm not yet
>>>> seeing where the race happens?
>>>
>>> The race is as described in David's earlier email, so let's recap:
>>>
>>> Thread 1			Thread 2
>>> phy_disconnect()
>>> phy_stop_interrupts()
>>> phy_stop_machine()
>>> phy_state_machine()
>>>   -> queue_delayed_work()
>>> phy_detach()
>>> 				phy_state_machine()
>>> 				-> netif_carrier_off()
>>>
>>> If phy_detach() finishes earlier than the workqueue had a chance to be
>>> scheduled and process PHY_HALTED again, then we trigger the NULL pointer
>>> de-reference.
>>>
>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee
>>> they will run on the same CPU.
>>
>> Something does not add up.
>>
>> The synchronous call to phy_state_machine() does:
>>
>> 	case PHY_HALTED:
>> 		if (phydev->link) {
>> 			phydev->link = 0;
>> 			netif_carrier_off(phydev->attached_dev);
>> 			phy_adjust_link(phydev);
>> 			do_suspend = true;
>> 		}
>>
>> then sets phydev->link = 0; therefore subsequent calls to
>> phy_state_machin() will be no-op.
> 
> Actually you are right, once phydev->link is set to 0 these would become
> no-ops. Still scratching my head as to what happens for David then...
> 
>>
>> Also, queue_delayed_work() is only called in polling mode.
>> David stated that he's using interrupt mode.

Did you see what I wrote?

phy_disconnect() calls phy_stop_interrupts() which puts it into polling 
mode.  So the polling work gets queued unconditionally.



> 
> Right that's confusing too now. David can you check if you tree has:
> 
> 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
> correctly in phy_stop_machine")
> 

Yes, I am using the 4.9 stable branch, and that commit was also present.

David.
Florian Fainelli Sept. 6, 2017, 6:59 p.m. | #19
On 09/06/2017 11:00 AM, David Daney wrote:
> On 08/31/2017 11:29 AM, Florian Fainelli wrote:
>> On 08/31/2017 11:12 AM, Mason wrote:
>>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>>> And the race is between phy_detach() setting phydev->attached_dev
>>>>>> = NULL
>>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>>> netif_carrier_off().
>>>>>
>>>>> I must be missing something.
>>>>> (Since a thread cannot race against itself.)
>>>>>
>>>>> phy_disconnect calls phy_stop_machine which
>>>>> 1) stops the work queue from running in a separate thread
>>>>> 2) calls phy_state_machine *synchronously*
>>>>>       which runs the PHY_HALTED case with everything well-defined
>>>>> end of phy_stop_machine
>>>>>
>>>>> phy_disconnect only then calls phy_detach()
>>>>> which makes future calls of phy_state_machine perilous.
>>>>>
>>>>> This all happens in the same thread, so I'm not yet
>>>>> seeing where the race happens?
>>>>
>>>> The race is as described in David's earlier email, so let's recap:
>>>>
>>>> Thread 1            Thread 2
>>>> phy_disconnect()
>>>> phy_stop_interrupts()
>>>> phy_stop_machine()
>>>> phy_state_machine()
>>>>   -> queue_delayed_work()
>>>> phy_detach()
>>>>                 phy_state_machine()
>>>>                 -> netif_carrier_off()
>>>>
>>>> If phy_detach() finishes earlier than the workqueue had a chance to be
>>>> scheduled and process PHY_HALTED again, then we trigger the NULL
>>>> pointer
>>>> de-reference.
>>>>
>>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee
>>>> they will run on the same CPU.
>>>
>>> Something does not add up.
>>>
>>> The synchronous call to phy_state_machine() does:
>>>
>>>     case PHY_HALTED:
>>>         if (phydev->link) {
>>>             phydev->link = 0;
>>>             netif_carrier_off(phydev->attached_dev);
>>>             phy_adjust_link(phydev);
>>>             do_suspend = true;
>>>         }
>>>
>>> then sets phydev->link = 0; therefore subsequent calls to
>>> phy_state_machin() will be no-op.
>>
>> Actually you are right, once phydev->link is set to 0 these would become
>> no-ops. Still scratching my head as to what happens for David then...
>>
>>>
>>> Also, queue_delayed_work() is only called in polling mode.
>>> David stated that he's using interrupt mode.
> 
> Did you see what I wrote?

Still not following, see below.

> 
> phy_disconnect() calls phy_stop_interrupts() which puts it into polling
> mode.  So the polling work gets queued unconditionally.

What part of phy_stop_interrupts() is responsible for changing
phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
subsequent request_irq() calls won't work anymore.
phy_disable_interrupts() only calls back into the PHY driver to
acknowledge and clear interrupts.

If we were using a PHY with PHY_POLL, as Marc said, the first
synchronous call to phy_state_machine() would have acted on PHY_HALTED
and even if we incorrectly keep re-scheduling the state machine from
PHY_HALTED to PHY_HALTED the second time around nothing can happen.

What are we missing here?

> 
> 
> 
>>
>> Right that's confusing too now. David can you check if you tree has:
>>
>> 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
>> correctly in phy_stop_machine")
>>
> 
> Yes, I am using the 4.9 stable branch, and that commit was also present.

Thanks for checking that.
Mason Sept. 6, 2017, 7:14 p.m. | #20
On 06/09/2017 20:00, David Daney wrote:
> On 08/31/2017 11:29 AM, Florian Fainelli wrote:
>> On 08/31/2017 11:12 AM, Mason wrote:
>>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>>> And the race is between phy_detach() setting phydev->attached_dev = NULL
>>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>>> netif_carrier_off().
>>>>>
>>>>> I must be missing something.
>>>>> (Since a thread cannot race against itself.)
>>>>>
>>>>> phy_disconnect calls phy_stop_machine which
>>>>> 1) stops the work queue from running in a separate thread
>>>>> 2) calls phy_state_machine *synchronously*
>>>>>       which runs the PHY_HALTED case with everything well-defined
>>>>> end of phy_stop_machine
>>>>>
>>>>> phy_disconnect only then calls phy_detach()
>>>>> which makes future calls of phy_state_machine perilous.
>>>>>
>>>>> This all happens in the same thread, so I'm not yet
>>>>> seeing where the race happens?
>>>>
>>>> The race is as described in David's earlier email, so let's recap:
>>>>
>>>> Thread 1			Thread 2
>>>> phy_disconnect()
>>>> phy_stop_interrupts()
>>>> phy_stop_machine()
>>>> phy_state_machine()
>>>>   -> queue_delayed_work()
>>>> phy_detach()
>>>> 				phy_state_machine()
>>>> 				-> netif_carrier_off()
>>>>
>>>> If phy_detach() finishes earlier than the workqueue had a chance to be
>>>> scheduled and process PHY_HALTED again, then we trigger the NULL pointer
>>>> de-reference.
>>>>
>>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee
>>>> they will run on the same CPU.
>>>
>>> Something does not add up.
>>>
>>> The synchronous call to phy_state_machine() does:
>>>
>>> 	case PHY_HALTED:
>>> 		if (phydev->link) {
>>> 			phydev->link = 0;
>>> 			netif_carrier_off(phydev->attached_dev);
>>> 			phy_adjust_link(phydev);
>>> 			do_suspend = true;
>>> 		}
>>>
>>> then sets phydev->link = 0; therefore subsequent calls to
>>> phy_state_machin() will be no-op.
>>
>> Actually you are right, once phydev->link is set to 0 these would become
>> no-ops. Still scratching my head as to what happens for David then...
>>
>>>
>>> Also, queue_delayed_work() is only called in polling mode.
>>> David stated that he's using interrupt mode.
> 
> Did you see what I wrote?
> 
> phy_disconnect() calls phy_stop_interrupts() which puts it into polling 
> mode.  So the polling work gets queued unconditionally.

I did address that remark in
https://www.mail-archive.com/netdev@vger.kernel.org/msg186336.html

int phy_stop_interrupts(struct phy_device *phydev)
{
	int err = phy_disable_interrupts(phydev);

	if (err)
		phy_error(phydev);

	free_irq(phydev->irq, phydev);

	/* If work indeed has been cancelled, disable_irq() will have
	 * been left unbalanced from phy_interrupt() and enable_irq()
	 * has to be called so that other devices on the line work.
	 */
	while (atomic_dec_return(&phydev->irq_disable) >= 0)
		enable_irq(phydev->irq);

	return err;
}

Which part of this function changes phydev->irq to PHY_POLL?

Perhaps phydev->drv->config_intr?

What PHY are you using?

Regards.
Florian Fainelli Sept. 6, 2017, 7:28 p.m. | #21
On 09/06/2017 07:55 AM, Mason wrote:
> On 31/08/2017 21:18, Florian Fainelli wrote:
> 
>> On 08/31/2017 12:09 PM, Mason wrote:
>>
>>> 1) nb8800_link_reconfigure() calls phy_print_status()
>>> which prints the "Link down" and "Link up" messages
>>> to the console. With the patch reverted, nothing is
>>> printed when the link goes down, and the result is
>>> random when the link comes up. Sometimes, we get
>>> down + up, sometimes just up.
>>
>> Nothing printed when you bring down the network interface as a result of
>> not signaling the link down, there is a small nuance here.
> 
> Let me first focus on the "Link down" message.
> 
> Do you agree that such a message should be printed when the
> link goes down, not when the link comes up?

The question is not so much about printing the message rather than a)
having the adjust_link callback be called and b) having this particular
callback correctly determine if a "change" has occurred, but let's focus
on the notification too. Printing the message is a consequence of these
two conditions and that's what matters.

There is not unfortunately a hard and fast answer it's clearly a
philosophical problem here.

The link is not physically down, the cable is still plugged so
generating a link down even is not necessarily correct. It would be
convenient for network manager programs, just like it is for network
drivers to treat it as such because that allows them to act like if the
cable was unplugged, which may be a good way to perform a number of
actions including but not limited to: entering a low power state,
re-initialization parts of the Ethernet MAC that need it (DMA, etc.,
etc.). That does not appear to be an absolute requirement for most, if
not all drivers since it changed after 3.4 and no one did bat an eye
about it.

Upon bringing the interface back up again, same thing, if the cable was
not disconnected should we just generate a link UP event, and if we do
that, are we going to confuse any network manager application?
Generating a link transition DOWN -> UP is certainly helpful for any
network application in that they do not need to keep any state just like
it clearly indicates a change was detected.

> 
> Perhaps the issue is that the 2 following cases need to be
> handled differently:
> A) operator sets link down on the command-line

This is already handled differently since when you administratively
bring down an interface you call ndo_stop() which will be doing a
phy_stop() + phy_disconnect() which result in stopping the PHY state
machine and disconnecting from the PHY.

> B) asynchronous event makes link go down (peer is dead, cable is cut, etc)
> 
> In B) the PHY state machine keeps on running, and eventually
> calls adjust_link()

Correct.

> 
> In A) the driver calls phy_stop() and phy_disconnect() and
> therefore adjust_link() will not be called?

That is the current behavior (after the revert) and we can always change
it if deemed necessary, problem is, this broke for two people (one still
being discussed as of right now), so at this point I am very wary of
making any changes without more testing. I really need to get one of
these PHY interrupts wired to one of my boards or create a software
model of such a configuration before accepting new changes in that area.

Thank you
Florian Fainelli Sept. 6, 2017, 7:42 p.m. | #22
On 09/06/2017 08:51 AM, Mason wrote:
> On 31/08/2017 21:18, Florian Fainelli wrote:
> 
>> On 08/31/2017 12:09 PM, Mason wrote:
>>
>>> On 31/08/2017 19:03, Florian Fainelli wrote:
>>>
>>>> On 08/31/2017 05:29 AM, Marc Gonzalez wrote:
>>>>
>>>>> On 31/08/2017 02:49, Florian Fainelli wrote:
>>>>>
>>>>>> The original motivation for this change originated from Marc Gonzalez
>>>>>> indicating that his network driver did not have its adjust_link
>>>>>> callback
>>>>>> executing with phydev->link = 0 while he was expecting it.
>>>>>
>>>>> I expect the core to call phy_adjust_link() for link changes.
>>>>> This used to work back in 3.4 and was broken somewhere along
>>>>> the way.
>>>>
>>>> If that was working correctly in 3.4 surely we can look at the diff and
>>>> figure out what changed, even maybe find the offending commit, can you
>>>> do that?
>>>
>>> Bisecting would a be a huge pain because my platform was
>>> not upstream until v4.4
>>
>> Then just diff the file and try to pinpoint which commit may have
>> changed that?
> 
> Running 'ip link set eth0 down' on the command-line.
> 
> In v3.4 => adjust_link() callback is called
> In v4.5 => adjust_link() callback is NOT called
> 
> $ git log --oneline --no-merges v3.4..v4.5 drivers/net/phy/phy.c | wc -l
> 59
> 
> I'm not sure what "just diff the file" entails.

git log -p --no-merges v3.4..v4.5 drivers/net/phy/{phy,phy_device.c} and
see what would seem remotely sensible to what you are observing.

> I can't move 3.4 up, nor move 4.5 down.

You can always copy the PHYLIB files at any given commit back into an
older tree, or vice versa because it is largely self contained with
little to no dependencies on other headers/files/facilities etc. This is
not convenient I agree, but it's a poor man's way of determining what
changed within PHYLIB that results in what you are seeing.

AFAICT you could use QEMU with the versatile board that has smsc911x as
an Ethernet adapter which is PHYLIB compliant which may be used to
pinpoint which commit start changing this behavior. It's long, it's painful.

> I'm not even sure the problem comes from drivers/net/phy/phy.c
> to be honest.

If that's the case then I am not sure what else we can do.
David Daney Sept. 6, 2017, 8:49 p.m. | #23
On 09/06/2017 11:59 AM, Florian Fainelli wrote:
> On 09/06/2017 11:00 AM, David Daney wrote:
>> On 08/31/2017 11:29 AM, Florian Fainelli wrote:
>>> On 08/31/2017 11:12 AM, Mason wrote:
>>>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>>>> And the race is between phy_detach() setting phydev->attached_dev
>>>>>>> = NULL
>>>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>>>> netif_carrier_off().
>>>>>>
>>>>>> I must be missing something.
>>>>>> (Since a thread cannot race against itself.)
>>>>>>
>>>>>> phy_disconnect calls phy_stop_machine which
>>>>>> 1) stops the work queue from running in a separate thread
>>>>>> 2) calls phy_state_machine *synchronously*
>>>>>>        which runs the PHY_HALTED case with everything well-defined
>>>>>> end of phy_stop_machine
>>>>>>
>>>>>> phy_disconnect only then calls phy_detach()
>>>>>> which makes future calls of phy_state_machine perilous.
>>>>>>
>>>>>> This all happens in the same thread, so I'm not yet
>>>>>> seeing where the race happens?
>>>>>
>>>>> The race is as described in David's earlier email, so let's recap:
>>>>>
>>>>> Thread 1            Thread 2
>>>>> phy_disconnect()
>>>>> phy_stop_interrupts()
>>>>> phy_stop_machine()
>>>>> phy_state_machine()
>>>>>    -> queue_delayed_work()
>>>>> phy_detach()
>>>>>                  phy_state_machine()
>>>>>                  -> netif_carrier_off()
>>>>>
>>>>> If phy_detach() finishes earlier than the workqueue had a chance to be
>>>>> scheduled and process PHY_HALTED again, then we trigger the NULL
>>>>> pointer
>>>>> de-reference.
>>>>>
>>>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee
>>>>> they will run on the same CPU.
>>>>
>>>> Something does not add up.
>>>>
>>>> The synchronous call to phy_state_machine() does:
>>>>
>>>>      case PHY_HALTED:
>>>>          if (phydev->link) {
>>>>              phydev->link = 0;
>>>>              netif_carrier_off(phydev->attached_dev);
>>>>              phy_adjust_link(phydev);
>>>>              do_suspend = true;
>>>>          }
>>>>
>>>> then sets phydev->link = 0; therefore subsequent calls to
>>>> phy_state_machin() will be no-op.
>>>
>>> Actually you are right, once phydev->link is set to 0 these would become
>>> no-ops. Still scratching my head as to what happens for David then...
>>>
>>>>
>>>> Also, queue_delayed_work() is only called in polling mode.
>>>> David stated that he's using interrupt mode.
>>
>> Did you see what I wrote?
> 
> Still not following, see below.
> 
>>
>> phy_disconnect() calls phy_stop_interrupts() which puts it into polling
>> mode.  So the polling work gets queued unconditionally.
> 
> What part of phy_stop_interrupts() is responsible for changing
> phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
> subsequent request_irq() calls won't work anymore.
> phy_disable_interrupts() only calls back into the PHY driver to
> acknowledge and clear interrupts.
> 
> If we were using a PHY with PHY_POLL, as Marc said, the first
> synchronous call to phy_state_machine() would have acted on PHY_HALTED
> and even if we incorrectly keep re-scheduling the state machine from
> PHY_HALTED to PHY_HALTED the second time around nothing can happen.
> 
> What are we missing here?
> 

OK, I am now as confused as you guys are.  I will go back and get an 
ftrace log out of the failure.

David.
David Daney Sept. 6, 2017, 10:51 p.m. | #24
On 09/06/2017 01:49 PM, David Daney wrote:
> On 09/06/2017 11:59 AM, Florian Fainelli wrote:
>> On 09/06/2017 11:00 AM, David Daney wrote:
>>> On 08/31/2017 11:29 AM, Florian Fainelli wrote:
>>>> On 08/31/2017 11:12 AM, Mason wrote:
>>>>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>>>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>>>>> And the race is between phy_detach() setting phydev->attached_dev
>>>>>>>> = NULL
>>>>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>>>>> netif_carrier_off().
>>>>>>>
>>>>>>> I must be missing something.
>>>>>>> (Since a thread cannot race against itself.)
>>>>>>>
>>>>>>> phy_disconnect calls phy_stop_machine which
>>>>>>> 1) stops the work queue from running in a separate thread
>>>>>>> 2) calls phy_state_machine *synchronously*
>>>>>>>        which runs the PHY_HALTED case with everything well-defined
>>>>>>> end of phy_stop_machine
>>>>>>>
>>>>>>> phy_disconnect only then calls phy_detach()
>>>>>>> which makes future calls of phy_state_machine perilous.
>>>>>>>
>>>>>>> This all happens in the same thread, so I'm not yet
>>>>>>> seeing where the race happens?
>>>>>>
>>>>>> The race is as described in David's earlier email, so let's recap:
>>>>>>
>>>>>> Thread 1            Thread 2
>>>>>> phy_disconnect()
>>>>>> phy_stop_interrupts()
>>>>>> phy_stop_machine()
>>>>>> phy_state_machine()
>>>>>>    -> queue_delayed_work()
>>>>>> phy_detach()
>>>>>>                  phy_state_machine()
>>>>>>                  -> netif_carrier_off()
>>>>>>
>>>>>> If phy_detach() finishes earlier than the workqueue had a chance 
>>>>>> to be
>>>>>> scheduled and process PHY_HALTED again, then we trigger the NULL
>>>>>> pointer
>>>>>> de-reference.
>>>>>>
>>>>>> workqueues are not tasklets, the CPU scheduling them gets no 
>>>>>> guarantee
>>>>>> they will run on the same CPU.
>>>>>
>>>>> Something does not add up.
>>>>>
>>>>> The synchronous call to phy_state_machine() does:
>>>>>
>>>>>      case PHY_HALTED:
>>>>>          if (phydev->link) {
>>>>>              phydev->link = 0;
>>>>>              netif_carrier_off(phydev->attached_dev);
>>>>>              phy_adjust_link(phydev);
>>>>>              do_suspend = true;
>>>>>          }
>>>>>
>>>>> then sets phydev->link = 0; therefore subsequent calls to
>>>>> phy_state_machin() will be no-op.
>>>>
>>>> Actually you are right, once phydev->link is set to 0 these would 
>>>> become
>>>> no-ops. Still scratching my head as to what happens for David then...
>>>>
>>>>>
>>>>> Also, queue_delayed_work() is only called in polling mode.
>>>>> David stated that he's using interrupt mode.
>>>
>>> Did you see what I wrote?
>>
>> Still not following, see below.
>>
>>>
>>> phy_disconnect() calls phy_stop_interrupts() which puts it into polling
>>> mode.  So the polling work gets queued unconditionally.
>>
>> What part of phy_stop_interrupts() is responsible for changing
>> phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
>> subsequent request_irq() calls won't work anymore.
>> phy_disable_interrupts() only calls back into the PHY driver to
>> acknowledge and clear interrupts.
>>
>> If we were using a PHY with PHY_POLL, as Marc said, the first
>> synchronous call to phy_state_machine() would have acted on PHY_HALTED
>> and even if we incorrectly keep re-scheduling the state machine from
>> PHY_HALTED to PHY_HALTED the second time around nothing can happen.
>>
>> What are we missing here?
>>
> 
> OK, I am now as confused as you guys are.  I will go back and get an 
> ftrace log out of the failure.
> 
OK, let's forget about the PHY_HALTED discussion.


Consider instead the case of a Marvell phy with no interrupts connected 
on a v4.9.43 kernel, single CPU:


   0)               |                 phy_disconnect() {
   0)               |                   phy_stop_machine() {
   0)               |                     cancel_delayed_work_sync() {
   0) + 23.986 us   |                     } /* cancel_delayed_work_sync */
   0)               |                     phy_state_machine() {
   0)               |                       phy_start_aneg_priv() {
   0)               |                         marvell_config_aneg() {
   0) ! 240.538 us  |                         } /* marvell_config_aneg */
   0) ! 244.971 us  |                       } /* phy_start_aneg_priv */
   0)               |                       queue_delayed_work_on() {
   0) + 18.016 us   |                       } /* queue_delayed_work_on */
   0) ! 268.184 us  |                     } /* phy_state_machine */
   0) ! 297.394 us  |                   } /* phy_stop_machine */
   0)               |                   phy_detach() {
   0)               |                     phy_suspend() {
   0)               |                       phy_ethtool_get_wol() {
   0)   0.677 us    |                       } /* phy_ethtool_get_wol */
   0)               |                       genphy_suspend() {
   0) + 71.250 us   |                       } /* genphy_suspend */
   0) + 74.197 us   |                     } /* phy_suspend */
   0) + 80.302 us   |                   } /* phy_detach */
   0) ! 380.072 us  |                 } /* phy_disconnect */
.
.
.
   0)               |  process_one_work() {
   0)               |    find_worker_executing_work() {
   0)   0.688 us    |    } /* find_worker_executing_work */
   0)               |    set_work_pool_and_clear_pending() {
   0)   0.734 us    |    } /* set_work_pool_and_clear_pending */
   0)               |    phy_state_machine() {
   0)               |      genphy_read_status() {
   0) ! 205.721 us  |      } /* genphy_read_status */
   0)               |      netif_carrier_off() {
   0)               |        do_page_fault() {


The do_page_fault() at the end indicates the NULL pointer dereference.

That added call to phy_state_machine() turns the polling back on 
unconditionally for a phy that should be disconnected.  How is that correct?

David.
Florian Fainelli Sept. 6, 2017, 11:14 p.m. | #25
On 09/06/2017 03:51 PM, David Daney wrote:
> On 09/06/2017 01:49 PM, David Daney wrote:
>> On 09/06/2017 11:59 AM, Florian Fainelli wrote:
>>> On 09/06/2017 11:00 AM, David Daney wrote:
>>>> On 08/31/2017 11:29 AM, Florian Fainelli wrote:
>>>>> On 08/31/2017 11:12 AM, Mason wrote:
>>>>>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>>>>>> On 08/31/2017 10:49 AM, Mason wrote:
>>>>>>>> On 31/08/2017 18:57, Florian Fainelli wrote:
>>>>>>>>> And the race is between phy_detach() setting phydev->attached_dev
>>>>>>>>> = NULL
>>>>>>>>> and phy_state_machine() running in PHY_HALTED state and calling
>>>>>>>>> netif_carrier_off().
>>>>>>>>
>>>>>>>> I must be missing something.
>>>>>>>> (Since a thread cannot race against itself.)
>>>>>>>>
>>>>>>>> phy_disconnect calls phy_stop_machine which
>>>>>>>> 1) stops the work queue from running in a separate thread
>>>>>>>> 2) calls phy_state_machine *synchronously*
>>>>>>>>        which runs the PHY_HALTED case with everything well-defined
>>>>>>>> end of phy_stop_machine
>>>>>>>>
>>>>>>>> phy_disconnect only then calls phy_detach()
>>>>>>>> which makes future calls of phy_state_machine perilous.
>>>>>>>>
>>>>>>>> This all happens in the same thread, so I'm not yet
>>>>>>>> seeing where the race happens?
>>>>>>>
>>>>>>> The race is as described in David's earlier email, so let's recap:
>>>>>>>
>>>>>>> Thread 1            Thread 2
>>>>>>> phy_disconnect()
>>>>>>> phy_stop_interrupts()
>>>>>>> phy_stop_machine()
>>>>>>> phy_state_machine()
>>>>>>>    -> queue_delayed_work()
>>>>>>> phy_detach()
>>>>>>>                  phy_state_machine()
>>>>>>>                  -> netif_carrier_off()
>>>>>>>
>>>>>>> If phy_detach() finishes earlier than the workqueue had a chance
>>>>>>> to be
>>>>>>> scheduled and process PHY_HALTED again, then we trigger the NULL
>>>>>>> pointer
>>>>>>> de-reference.
>>>>>>>
>>>>>>> workqueues are not tasklets, the CPU scheduling them gets no
>>>>>>> guarantee
>>>>>>> they will run on the same CPU.
>>>>>>
>>>>>> Something does not add up.
>>>>>>
>>>>>> The synchronous call to phy_state_machine() does:
>>>>>>
>>>>>>      case PHY_HALTED:
>>>>>>          if (phydev->link) {
>>>>>>              phydev->link = 0;
>>>>>>              netif_carrier_off(phydev->attached_dev);
>>>>>>              phy_adjust_link(phydev);
>>>>>>              do_suspend = true;
>>>>>>          }
>>>>>>
>>>>>> then sets phydev->link = 0; therefore subsequent calls to
>>>>>> phy_state_machin() will be no-op.
>>>>>
>>>>> Actually you are right, once phydev->link is set to 0 these would
>>>>> become
>>>>> no-ops. Still scratching my head as to what happens for David then...
>>>>>
>>>>>>
>>>>>> Also, queue_delayed_work() is only called in polling mode.
>>>>>> David stated that he's using interrupt mode.
>>>>
>>>> Did you see what I wrote?
>>>
>>> Still not following, see below.
>>>
>>>>
>>>> phy_disconnect() calls phy_stop_interrupts() which puts it into polling
>>>> mode.  So the polling work gets queued unconditionally.
>>>
>>> What part of phy_stop_interrupts() is responsible for changing
>>> phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
>>> subsequent request_irq() calls won't work anymore.
>>> phy_disable_interrupts() only calls back into the PHY driver to
>>> acknowledge and clear interrupts.
>>>
>>> If we were using a PHY with PHY_POLL, as Marc said, the first
>>> synchronous call to phy_state_machine() would have acted on PHY_HALTED
>>> and even if we incorrectly keep re-scheduling the state machine from
>>> PHY_HALTED to PHY_HALTED the second time around nothing can happen.
>>>
>>> What are we missing here?
>>>
>>
>> OK, I am now as confused as you guys are.  I will go back and get an
>> ftrace log out of the failure.
>>
> OK, let's forget about the PHY_HALTED discussion.
> 
> 
> Consider instead the case of a Marvell phy with no interrupts connected
> on a v4.9.43 kernel, single CPU:
> 
> 
>   0)               |                 phy_disconnect() {
>   0)               |                   phy_stop_machine() {
>   0)               |                     cancel_delayed_work_sync() {
>   0) + 23.986 us   |                     } /* cancel_delayed_work_sync */
>   0)               |                     phy_state_machine() {
>   0)               |                       phy_start_aneg_priv() {

Thanks for providing the trace, I think I have an idea of what's going
on, see below.

>   0)               |                         marvell_config_aneg() {
>   0) ! 240.538 us  |                         } /* marvell_config_aneg */
>   0) ! 244.971 us  |                       } /* phy_start_aneg_priv */
>   0)               |                       queue_delayed_work_on() {
>   0) + 18.016 us   |                       } /* queue_delayed_work_on */
>   0) ! 268.184 us  |                     } /* phy_state_machine */
>   0) ! 297.394 us  |                   } /* phy_stop_machine */
>   0)               |                   phy_detach() {
>   0)               |                     phy_suspend() {
>   0)               |                       phy_ethtool_get_wol() {
>   0)   0.677 us    |                       } /* phy_ethtool_get_wol */
>   0)               |                       genphy_suspend() {
>   0) + 71.250 us   |                       } /* genphy_suspend */
>   0) + 74.197 us   |                     } /* phy_suspend */
>   0) + 80.302 us   |                   } /* phy_detach */
>   0) ! 380.072 us  |                 } /* phy_disconnect */
> .
> .
> .
>   0)               |  process_one_work() {
>   0)               |    find_worker_executing_work() {
>   0)   0.688 us    |    } /* find_worker_executing_work */
>   0)               |    set_work_pool_and_clear_pending() {
>   0)   0.734 us    |    } /* set_work_pool_and_clear_pending */
>   0)               |    phy_state_machine() {
>   0)               |      genphy_read_status() {
>   0) ! 205.721 us  |      } /* genphy_read_status */
>   0)               |      netif_carrier_off() {
>   0)               |        do_page_fault() {
> 
> 
> The do_page_fault() at the end indicates the NULL pointer dereference.
> 
> That added call to phy_state_machine() turns the polling back on
> unconditionally for a phy that should be disconnected.  How is that
> correct?

It is not fundamentally correct and I don't think there was any
objection to that to begin with. In fact there is a bug/inefficiency
here in that if we have entered the PHY state machine with PHY_HALTED we
should not re-schedule it period, only applicable to PHY_POLL cases
*and* properly calling phy_stop() followed by phy_disconnect().

What I now think is happening in your case is the following:

phy_stop() was not called, so nothing does set phydev->state to
PHY_HALTED in the first place so we have:

phy_disconnect()
-> phy_stop_machine()
	-> cancel_delayed_work_sync() OK
		phydev->state is probably RUNNING so we have:
		-> phydev->state = PHY_UP
	phy_state_machine() is called synchronously
	-> PHY_UP -> needs_aneg = true
	-> phy_restart_aneg()
	-> queue_delayed_work_sync()
-> phydev->adjust_link = NULL
-> phy_deatch() -> boom

Can you confirm whether the driver you are using does call phy_stop()
prior to phy_disconnect()? If that is the case then this whole theory
falls apart, if not, then this needs fixing in both the driver and PHYLIB.

Thanks
David Daney Sept. 7, 2017, 12:10 a.m. | #26
On 09/06/2017 04:14 PM, Florian Fainelli wrote:
> On 09/06/2017 03:51 PM, David Daney wrote:
[...]
>>
>> Consider instead the case of a Marvell phy with no interrupts connected
>> on a v4.9.43 kernel, single CPU:
>>
>>
>>    0)               |                 phy_disconnect() {
>>    0)               |                   phy_stop_machine() {
>>    0)               |                     cancel_delayed_work_sync() {
>>    0) + 23.986 us   |                     } /* cancel_delayed_work_sync */
>>    0)               |                     phy_state_machine() {
>>    0)               |                       phy_start_aneg_priv() {
> 
> Thanks for providing the trace, I think I have an idea of what's going
> on, see below.
> 
>>    0)               |                         marvell_config_aneg() {
>>    0) ! 240.538 us  |                         } /* marvell_config_aneg */
>>    0) ! 244.971 us  |                       } /* phy_start_aneg_priv */
>>    0)               |                       queue_delayed_work_on() {
>>    0) + 18.016 us   |                       } /* queue_delayed_work_on */
>>    0) ! 268.184 us  |                     } /* phy_state_machine */
>>    0) ! 297.394 us  |                   } /* phy_stop_machine */
>>    0)               |                   phy_detach() {
>>    0)               |                     phy_suspend() {
>>    0)               |                       phy_ethtool_get_wol() {
>>    0)   0.677 us    |                       } /* phy_ethtool_get_wol */
>>    0)               |                       genphy_suspend() {
>>    0) + 71.250 us   |                       } /* genphy_suspend */
>>    0) + 74.197 us   |                     } /* phy_suspend */
>>    0) + 80.302 us   |                   } /* phy_detach */
>>    0) ! 380.072 us  |                 } /* phy_disconnect */
>> .
>> .
>> .
>>    0)               |  process_one_work() {
>>    0)               |    find_worker_executing_work() {
>>    0)   0.688 us    |    } /* find_worker_executing_work */
>>    0)               |    set_work_pool_and_clear_pending() {
>>    0)   0.734 us    |    } /* set_work_pool_and_clear_pending */
>>    0)               |    phy_state_machine() {
>>    0)               |      genphy_read_status() {
>>    0) ! 205.721 us  |      } /* genphy_read_status */
>>    0)               |      netif_carrier_off() {
>>    0)               |        do_page_fault() {
>>
>>
>> The do_page_fault() at the end indicates the NULL pointer dereference.
>>
>> That added call to phy_state_machine() turns the polling back on
>> unconditionally for a phy that should be disconnected.  How is that
>> correct?
> 
> It is not fundamentally correct and I don't think there was any
> objection to that to begin with. In fact there is a bug/inefficiency
> here in that if we have entered the PHY state machine with PHY_HALTED we
> should not re-schedule it period, only applicable to PHY_POLL cases
> *and* properly calling phy_stop() followed by phy_disconnect().
> 
> What I now think is happening in your case is the following:
> 
> phy_stop() was not called, so nothing does set phydev->state to
> PHY_HALTED in the first place so we have:
> 
> phy_disconnect()
> -> phy_stop_machine()
> 	-> cancel_delayed_work_sync() OK
> 		phydev->state is probably RUNNING so we have:
> 		-> phydev->state = PHY_UP
> 	phy_state_machine() is called synchronously
> 	-> PHY_UP -> needs_aneg = true
> 	-> phy_restart_aneg()
> 	-> queue_delayed_work_sync()
> -> phydev->adjust_link = NULL
> -> phy_deatch() -> boom
> 
> Can you confirm whether the driver you are using does call phy_stop()
> prior to phy_disconnect()? 

There is no call to phy_stop().

I can add this to the ethernet drivers, but I wonder if it should be 
called by the code code when doing phy_disconnect(), if it was not 
already stopped.

> If that is the case then this whole theory
> falls apart, if not, then this needs fixing in both the driver and PHYLIB.
> 
> Thanks
>
Florian Fainelli Sept. 7, 2017, 1:41 a.m. | #27
On 09/06/2017 05:10 PM, David Daney wrote:
> On 09/06/2017 04:14 PM, Florian Fainelli wrote:
>> On 09/06/2017 03:51 PM, David Daney wrote:
> [...]
>>>
>>> Consider instead the case of a Marvell phy with no interrupts connected
>>> on a v4.9.43 kernel, single CPU:
>>>
>>>
>>>    0)               |                 phy_disconnect() {
>>>    0)               |                   phy_stop_machine() {
>>>    0)               |                     cancel_delayed_work_sync() {
>>>    0) + 23.986 us   |                     } /*
>>> cancel_delayed_work_sync */
>>>    0)               |                     phy_state_machine() {
>>>    0)               |                       phy_start_aneg_priv() {
>>
>> Thanks for providing the trace, I think I have an idea of what's going
>> on, see below.
>>
>>>    0)               |                         marvell_config_aneg() {
>>>    0) ! 240.538 us  |                         } /*
>>> marvell_config_aneg */
>>>    0) ! 244.971 us  |                       } /* phy_start_aneg_priv */
>>>    0)               |                       queue_delayed_work_on() {
>>>    0) + 18.016 us   |                       } /*
>>> queue_delayed_work_on */
>>>    0) ! 268.184 us  |                     } /* phy_state_machine */
>>>    0) ! 297.394 us  |                   } /* phy_stop_machine */
>>>    0)               |                   phy_detach() {
>>>    0)               |                     phy_suspend() {
>>>    0)               |                       phy_ethtool_get_wol() {
>>>    0)   0.677 us    |                       } /* phy_ethtool_get_wol */
>>>    0)               |                       genphy_suspend() {
>>>    0) + 71.250 us   |                       } /* genphy_suspend */
>>>    0) + 74.197 us   |                     } /* phy_suspend */
>>>    0) + 80.302 us   |                   } /* phy_detach */
>>>    0) ! 380.072 us  |                 } /* phy_disconnect */
>>> .
>>> .
>>> .
>>>    0)               |  process_one_work() {
>>>    0)               |    find_worker_executing_work() {
>>>    0)   0.688 us    |    } /* find_worker_executing_work */
>>>    0)               |    set_work_pool_and_clear_pending() {
>>>    0)   0.734 us    |    } /* set_work_pool_and_clear_pending */
>>>    0)               |    phy_state_machine() {
>>>    0)               |      genphy_read_status() {
>>>    0) ! 205.721 us  |      } /* genphy_read_status */
>>>    0)               |      netif_carrier_off() {
>>>    0)               |        do_page_fault() {
>>>
>>>
>>> The do_page_fault() at the end indicates the NULL pointer dereference.
>>>
>>> That added call to phy_state_machine() turns the polling back on
>>> unconditionally for a phy that should be disconnected.  How is that
>>> correct?
>>
>> It is not fundamentally correct and I don't think there was any
>> objection to that to begin with. In fact there is a bug/inefficiency
>> here in that if we have entered the PHY state machine with PHY_HALTED we
>> should not re-schedule it period, only applicable to PHY_POLL cases
>> *and* properly calling phy_stop() followed by phy_disconnect().
>>
>> What I now think is happening in your case is the following:
>>
>> phy_stop() was not called, so nothing does set phydev->state to
>> PHY_HALTED in the first place so we have:
>>
>> phy_disconnect()
>> -> phy_stop_machine()
>>     -> cancel_delayed_work_sync() OK
>>         phydev->state is probably RUNNING so we have:
>>         -> phydev->state = PHY_UP
>>     phy_state_machine() is called synchronously
>>     -> PHY_UP -> needs_aneg = true
>>     -> phy_restart_aneg()
>>     -> queue_delayed_work_sync()
>> -> phydev->adjust_link = NULL
>> -> phy_deatch() -> boom
>>
>> Can you confirm whether the driver you are using does call phy_stop()
>> prior to phy_disconnect()? 
> 
> There is no call to phy_stop().

OK this all makes sense now.

> 
> I can add this to the ethernet drivers, but I wonder if it should be
> called by the code code when doing phy_disconnect(), if it was not
> already stopped.

Fixing the driver should be reasonably quick and easy and can be done
independently from fixing PHYLIB, but I agree that PHYLIB should be
safeguarded against such a case.

Of course, now that I looked again at the code, there is really a ton of
unnecessary workqueue scheduling going on, similarly to phy_stop()
making us go from PHY_HALTED to PHY_HALTED, phy_start_machine() does the
same thing with PHY_READY -> PHY_READY, I suppose back when this was
done the assumption was that there is not going to be a tremendous
amount of time being spent between a call to
phy_connect()/phy_start_machine() and phy_start() and respectively
phy_stop() followed by a phy_disconnect(), oh well.

Now that the revert is in 4.13 we can work on a solution that satisfies
everybody on this thread.

Thanks!

Patch

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 5068c582d502..d0626bf5c540 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -749,9 +749,6 @@  void phy_stop_machine(struct phy_device *phydev)
 	if (phydev->state > PHY_UP && phydev->state != PHY_HALTED)
 		phydev->state = PHY_UP;
 	mutex_unlock(&phydev->lock);
-
-	/* Now we can run the state machine synchronously */
-	phy_state_machine(&phydev->state_queue.work);
 }
 
 /**