diff mbox

[regression] net: phy: fix initialization (config_init) for Marvel 88E1116R PHYs

Message ID 1396396559-6971-1-git-send-email-holler@ahsoftware.de
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Alexander Holler April 1, 2014, 11:55 p.m. UTC
Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
changed the initialization of the mv643xx_eth driver to use phy_init_hw()
to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
was broken such, that it used mdelay() instead of really waiting for a
reset to finish.

The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
come up anymore (no carrier).

Fix this by waiting for a reset to finish before proceeding further.

Signed-off-by: Alexander Holler <holler@ahsoftware.de>

Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: <stable@vger.kernel.org>
---
 drivers/net/phy/marvell.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Florian Fainelli April 2, 2014, 12:57 a.m. UTC | #1
2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
> was broken such, that it used mdelay() instead of really waiting for a
> reset to finish.

So the only big difference before
7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
phy_init_hw to reset PHY") is that we waited potentially forever for
the BMCR_RESET bit to get cleared, while now, we only wait for up to
500 milliseconds.

Could you verify the following two things before your patch gets merged:

- how long does it take for your PHY to clear the BMCR_RESET bit, is
it more than the allowed time out in
drivers/net/phy/phy_device.c::phy_poll_reset
- is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
we might be hitting some corner case where toggling BMCR_RESET will
power it on, but at the expense of waiting longer

Thanks

>
> The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
> come up anymore (no carrier).
>
> Fix this by waiting for a reset to finish before proceeding further.
>
> Signed-off-by: Alexander Holler <holler@ahsoftware.de>
>
> Cc: Michal Simek <michal.simek@xilinx.com>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/net/phy/marvell.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index bd37e45..5b84808 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -396,7 +396,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
>         if (err < 0)
>                 return err;
>
> -       mdelay(500);
> +       do
> +               temp = phy_read(phydev, MII_BMCR);
> +       while (temp & BMCR_RESET);
>
>         err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0);
>         if (err < 0)
> @@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
>         if (err < 0)
>                 return err;
>
> -       mdelay(500);
> +       do
> +               temp = phy_read(phydev, MII_BMCR);
> +       while (temp & BMCR_RESET);
>
>         return 0;
>  }
> --
> 1.8.3.1
>
Alexander Holler April 2, 2014, 9:09 a.m. UTC | #2
Am 02.04.2014 02:57, schrieb Florian Fainelli:
> 2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>> was broken such, that it used mdelay() instead of really waiting for a
>> reset to finish.
>
> So the only big difference before
> 7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
> phy_init_hw to reset PHY") is that we waited potentially forever for
> the BMCR_RESET bit to get cleared, while now, we only wait for up to
> 500 milliseconds.
>
> Could you verify the following two things before your patch gets merged:
>
> - how long does it take for your PHY to clear the BMCR_RESET bit, is
> it more than the allowed time out in
> drivers/net/phy/phy_device.c::phy_poll_reset
> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
> we might be hitting some corner case where toggling BMCR_RESET will
> power it on, but at the expense of waiting longer

I've  done two tests with pr_info before and after the two resets in 
m88e1116r_config_init:

with my patch:
-----------------------
dmesg | grep -A1 -B1  AHO
[    1.090099] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
[    1.175916] AHO: before first reset
[    1.179613] AHO: after first reset
[    1.183743] AHO: before second reset
[    1.187530] AHO: after second reset
[    1.191905] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC 
address xx
--
[    1.426487] netpoll: netconsole: device eth0 not up yet, forcing it
[    1.505986] AHO: before first reset
[    1.509725] AHO: after first reset
[    1.513899] AHO: before second reset
[    1.517754] AHO: after second reset
[    1.521802] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
--
[   21.372591] Adding 2996116k swap on /dev/sda3.  Priority:-1 extents:1 
across:2996116k
[   28.305989] AHO: before first reset
[   28.306200] AHO: after first reset
[   28.306936] AHO: before second reset
[   28.307153] AHO: after second reset
[   31.509973] mv643xx_eth_port mv643xx_eth_port.0 eth0: link up, 1000 
Mb/s, full duplex, flow control disabled
-----------------------


with mdelay (the value after reset is what contains MII_BMCR):

-----------------------
dmesg | grep -A1 -B1  AHO
[    1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
[    1.175888] AHO: before first reset
[    1.678806] AHO: after first reset 0x0
[    1.683281] AHO: before second reset
[    2.186288] AHO: after second reset 0x0
[    2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC 
address xx
--
[    2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
[    2.505917] AHO: before first reset
[    2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
--
[    2.829987] hub 1-1:1.0: 4 ports detected
[    3.044502] AHO: after first reset 0x0
[    3.049133] AHO: before second reset
[    3.126110] usb 1-1.1: new high-speed USB device number 3 using 
orion-ehci
--
[    3.526107] usb 1-1.3: device descriptor read/64, error -32
[    3.614264] AHO: after second reset 0x0
[    3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
--
[   21.335730] Adding 2996116k swap on /dev/sda3.  Priority:-1 extents:1 
across:2996116k
[   28.195942] AHO: before first reset
[   28.696270] AHO: after first reset 0x800
[   28.696958] AHO: before second reset
[   29.197354] AHO: after second reset 0x800
[  111.612263] RPC: Registered named UNIX socket transport module.
-----------------------

So we see, the first reset in the last call of m88e1116r_config_init() 
fails to complete in 500ms and the phy seems to be stuck afterwards, 
but, for whatever reason, it doesn't need 500ms if mdelay isn't used (if 
we can trust the timestamps).

(You can also see, I have netconsole enabled using netconsole=... in the 
kernel cmdline).

That behaviour is reproducible. The first reset in the last call to
m88e1116r_config_init() always fails if mdelay is used.

Regards,

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Holler April 2, 2014, 10:54 a.m. UTC | #3
Am 02.04.2014 11:09, schrieb Alexander Holler:
> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>> changed the initialization of the mv643xx_eth driver to use
>>> phy_init_hw()
>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>> was broken such, that it used mdelay() instead of really waiting for a
>>> reset to finish.
>>
>> So the only big difference before
>> 7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
>> phy_init_hw to reset PHY") is that we waited potentially forever for
>> the BMCR_RESET bit to get cleared, while now, we only wait for up to
>> 500 milliseconds.
>>
>> Could you verify the following two things before your patch gets merged:
>>
>> - how long does it take for your PHY to clear the BMCR_RESET bit, is
>> it more than the allowed time out in
>> drivers/net/phy/phy_device.c::phy_poll_reset
>> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
>> we might be hitting some corner case where toggling BMCR_RESET will
>> power it on, but at the expense of waiting longer
>
> I've  done two tests with pr_info before and after the two resets in
> m88e1116r_config_init:
>
> with my patch:
> -----------------------
> dmesg | grep -A1 -B1  AHO
> [    1.090099] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version
> 1.4
> [    1.175916] AHO: before first reset
> [    1.179613] AHO: after first reset
> [    1.183743] AHO: before second reset
> [    1.187530] AHO: after second reset
> [    1.191905] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [    1.426487] netpoll: netconsole: device eth0 not up yet, forcing it
> [    1.505986] AHO: before first reset
> [    1.509725] AHO: after first reset
> [    1.513899] AHO: before second reset
> [    1.517754] AHO: after second reset
> [    1.521802] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [   21.372591] Adding 2996116k swap on /dev/sda3.  Priority:-1 extents:1
> across:2996116k
> [   28.305989] AHO: before first reset
> [   28.306200] AHO: after first reset
> [   28.306936] AHO: before second reset
> [   28.307153] AHO: after second reset
> [   31.509973] mv643xx_eth_port mv643xx_eth_port.0 eth0: link up, 1000
> Mb/s, full duplex, flow control disabled
> -----------------------
>
>
> with mdelay (the value after reset is what contains MII_BMCR):
>
> -----------------------
> dmesg | grep -A1 -B1  AHO
> [    1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version
> 1.4
> [    1.175888] AHO: before first reset
> [    1.678806] AHO: after first reset 0x0
> [    1.683281] AHO: before second reset
> [    2.186288] AHO: after second reset 0x0
> [    2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [    2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
> [    2.505917] AHO: before first reset
> [    2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
> --
> [    2.829987] hub 1-1:1.0: 4 ports detected
> [    3.044502] AHO: after first reset 0x0
> [    3.049133] AHO: before second reset
> [    3.126110] usb 1-1.1: new high-speed USB device number 3 using
> orion-ehci
> --
> [    3.526107] usb 1-1.3: device descriptor read/64, error -32
> [    3.614264] AHO: after second reset 0x0
> [    3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [   21.335730] Adding 2996116k swap on /dev/sda3.  Priority:-1 extents:1
> across:2996116k
> [   28.195942] AHO: before first reset
> [   28.696270] AHO: after first reset 0x800
> [   28.696958] AHO: before second reset
> [   29.197354] AHO: after second reset 0x800
> [  111.612263] RPC: Registered named UNIX socket transport module.
> -----------------------
>
> So we see, the first reset in the last call of m88e1116r_config_init()
> fails to complete in 500ms and the phy seems to be stuck afterwards,
> but, for whatever reason, it doesn't need 500ms if mdelay isn't used (if
> we can trust the timestamps).
>
> (You can also see, I have netconsole enabled using netconsole=... in the
> kernel cmdline).
>
> That behaviour is reproducible. The first reset in the last call to
> m88e1116r_config_init() always fails if mdelay is used.

I've forgotten to add that MII_BMCR is always zero when 
m88e1116r_config_init() is called.

So to conclude, I've no idea why using mdelay seems to break the PHY. It 
looks like the forced delay of together one second in 
m88e1116r_config_init() somehow breaks something, but that's pure 
speculation. And as my patch fixes things here, I've no reason to dig 
further.

Regards,

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sergei Shtylyov April 2, 2014, 11:51 a.m. UTC | #4
Hello.

On 02-04-2014 3:55, Alexander Holler wrote:

> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
> was broken such, that it used mdelay() instead of really waiting for a
> reset to finish.

> The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
> come up anymore (no carrier).

> Fix this by waiting for a reset to finish before proceeding further.

> Signed-off-by: Alexander Holler <holler@ahsoftware.de>

> Cc: Michal Simek <michal.simek@xilinx.com>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: <stable@vger.kernel.org>
> ---
>   drivers/net/phy/marvell.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index bd37e45..5b84808 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -396,7 +396,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
>   	if (err < 0)
>   		return err;
>
> -	mdelay(500);
> +	do
> +		temp = phy_read(phydev, MII_BMCR);
> +	while (temp & BMCR_RESET);

    I don't understand what's the point of this BMCR reset if phy_init_hw() 
will have already done it before calling phy_init_hw().

> @@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
>   	if (err < 0)
>   		return err;
>
> -	mdelay(500);
> +	do
> +		temp = phy_read(phydev, MII_BMCR);
> +	while (temp & BMCR_RESET);

    Not clear why it's necessary to reset one more time... Also, tight loop 
without a timeout (0.5 sec, specified by IEEE 802.3) doesn't look good. The 
comment above phy_poll_reset() suggests that it could be used in the PHY 
drivers as well. Florian?

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sergei Shtylyov April 2, 2014, 12:07 p.m. UTC | #5
On 02-04-2014 15:51, Sergei Shtylyov wrote:

>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>> was broken such, that it used mdelay() instead of really waiting for a
>> reset to finish.

>> The effect was that the ethernet on my Kirkwood 88F6281 based device didn't
>> come up anymore (no carrier).

>> Fix this by waiting for a reset to finish before proceeding further.

>> Signed-off-by: Alexander Holler <holler@ahsoftware.de>

>> Cc: Michal Simek <michal.simek@xilinx.com>
>> Cc: Florian Fainelli <f.fainelli@gmail.com>
>> Cc: <stable@vger.kernel.org>
>> ---
>>   drivers/net/phy/marvell.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)

>> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
>> index bd37e45..5b84808 100644
>> --- a/drivers/net/phy/marvell.c
>> +++ b/drivers/net/phy/marvell.c
[...]
>> @@ -429,7 +431,9 @@ static int m88e1116r_config_init(struct phy_device *phydev)
>>       if (err < 0)
>>           return err;
>>
>> -    mdelay(500);
>> +    do
>> +        temp = phy_read(phydev, MII_BMCR);
>> +    while (temp & BMCR_RESET);

>     Not clear why it's necessary to reset one more time... Also, tight loop
> without a timeout (0.5 sec, specified by IEEE 802.3) doesn't look good. The
> comment above phy_poll_reset() suggests that it could be used in the PHY
> drivers as well. Florian?

    Ah, I was looking at Linus' tree, not net-next.git, and hadn't read 
Florian's replies before commenting on this...

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Holler April 2, 2014, 2:35 p.m. UTC | #6
Am 02.04.2014 14:07, schrieb Sergei Shtylyov:
> On 02-04-2014 15:51, Sergei Shtylyov wrote:
> 
>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>> changed the initialization of the mv643xx_eth driver to use
>>> phy_init_hw()
>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>> was broken such, that it used mdelay() instead of really waiting for a
>>> reset to finish.

(...)

>>>
>>> -    mdelay(500);
>>> +    do
>>> +        temp = phy_read(phydev, MII_BMCR);
>>> +    while (temp & BMCR_RESET);
> 
>>     Not clear why it's necessary to reset one more time... Also, tight
>> loop
>> without a timeout (0.5 sec, specified by IEEE 802.3) doesn't look
>> good. The
>> comment above phy_poll_reset() suggests that it could be used in the PHY
>> drivers as well. Florian?
> 
>    Ah, I was looking at Linus' tree, not net-next.git, and hadn't read
> Florian's replies before commenting on this...

Just to be clear, this patch isn't a change request, it's a bugfix
because the ethernet doesn't come up on my Kirkwood device(s) using
kernel 3.14. So it's mainly meant for other people which want to use
3.14 with similiar devices and have problems too.

Reverting the above commit does do the trick here too.

I haven't looked at the datasheet nor any possible erratas. So I don't
know why and if there are multiple resets necessary. I've just made the
code more similiar to what was used before the above commit changed the
reset of the PHY. The tight loop is a copy from other m*_config_inits in
the same file and was in use in mv643xx_eth before the above commit too.
So at least the tight loop is proven to work. And genphy_soft_reset(),
which does not wait indefinitely long if the phy never does come out of
reset, is only available in some -next tree, definitely not where I look
at and need it for a bugfix.

But I agree, all the stuff looks very curious, tight loops without a
timeout, multiple resets in order and such things more.

But this patch isn't a cleanup, it's meant to be as small as possible to
make things work again. And I decided that instead of just suggesting a
revert, it's better to fix m88e1116r_config_init() which might be used
at other places too.

It might make sense to wait for some other users to complain, I think it
won't need long as this SOC is used in Sheevaplugs and Dockstars, some
ARMv5 devices which were quiet popular. Or if nobody else complains,
maybe someone might test my change. Ut could be possible that the
failing sequence is through the/my use of netconsole, which is, I
assume, not very common.

Regards,

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Fainelli April 2, 2014, 7:01 p.m. UTC | #7
2014-04-02 2:09 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>
>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>>
>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>> was broken such, that it used mdelay() instead of really waiting for a
>>> reset to finish.
>>
>>
>> So the only big difference before
>> 7cd1463664c2a15721ff4ccfb61d4d970815cb3d ("net: mv643xx_eth: use
>> phy_init_hw to reset PHY") is that we waited potentially forever for
>> the BMCR_RESET bit to get cleared, while now, we only wait for up to
>> 500 milliseconds.
>>
>> Could you verify the following two things before your patch gets merged:
>>
>> - how long does it take for your PHY to clear the BMCR_RESET bit, is
>> it more than the allowed time out in
>> drivers/net/phy/phy_device.c::phy_poll_reset
>> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
>> we might be hitting some corner case where toggling BMCR_RESET will
>> power it on, but at the expense of waiting longer
>
>
> I've  done two tests with pr_info before and after the two resets in
> m88e1116r_config_init:
>
> with my patch:
> -----------------------
> dmesg | grep -A1 -B1  AHO
> [    1.090099] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
> [    1.175916] AHO: before first reset
> [    1.179613] AHO: after first reset

That makes about 3.697 milliseconds

> [    1.183743] AHO: before second reset
> [    1.187530] AHO: after second reset

That makes about 3.787 milliseconds

> [    1.191905] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [    1.426487] netpoll: netconsole: device eth0 not up yet, forcing it
> [    1.505986] AHO: before first reset
> [    1.509725] AHO: after first reset
> [    1.513899] AHO: before second reset
> [    1.517754] AHO: after second reset
> [    1.521802] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [   21.372591] Adding 2996116k swap on /dev/sda3.  Priority:-1 extents:1
> across:2996116k
> [   28.305989] AHO: before first reset
> [   28.306200] AHO: after first reset
> [   28.306936] AHO: before second reset
> [   28.307153] AHO: after second reset
> [   31.509973] mv643xx_eth_port mv643xx_eth_port.0 eth0: link up, 1000 Mb/s,
> full duplex, flow control disabled
> -----------------------
>
>
> with mdelay (the value after reset is what contains MII_BMCR):
>
> -----------------------
> dmesg | grep -A1 -B1  AHO
> [    1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version 1.4
> [    1.175888] AHO: before first reset
> [    1.678806] AHO: after first reset 0x0

That's 502.918 milliseconds

> [    1.683281] AHO: before second reset
> [    2.186288] AHO: after second reset 0x0

That's 503.007 milliseconds

> [    2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> --
> [    2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
> [    2.505917] AHO: before first reset
> [    2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
> --
> [    2.829987] hub 1-1:1.0: 4 ports detected
> [    3.044502] AHO: after first reset 0x0
> [    3.049133] AHO: before second reset
> [    3.126110] usb 1-1.1: new high-speed USB device number 3 using
> orion-ehci
> --
> [    3.526107] usb 1-1.3: device descriptor read/64, error -32
> [    3.614264] AHO: after second reset 0x0
> [    3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> --
> [   21.335730] Adding 2996116k swap on /dev/sda3.  Priority:-1 extents:1
> across:2996116k
> [   28.195942] AHO: before first reset
> [   28.696270] AHO: after first reset 0x800
> [   28.696958] AHO: before second reset
> [   29.197354] AHO: after second reset 0x800
> [  111.612263] RPC: Registered named UNIX socket transport module.
> -----------------------
>
> So we see, the first reset in the last call of m88e1116r_config_init() fails
> to complete in 500ms and the phy seems to be stuck afterwards, but, for
> whatever reason, it doesn't need 500ms if mdelay isn't used (if we can trust
> the timestamps).

I wonder if the extra 2/3 microseconds we are seeing are nothing more
than the buffered printk. At any rate, it looks like the software
reset of this PHY needs to polled, and frequently for it to complete
successfully.

Can you resubmit with the following information:

- you specify the commit that introduce the problem in parenthesis,
e.g: deadbeef ("dead: beef: cafe babe")
- put this dmesg snippet in your log message to clearly illustrate
what is happening
- clarify that the PHY needs to be polled in a comment in
m88e1116r_config_init()

Meanwhile, it would be good if someone with access to this particular
PHY datasheet could look for known erratas, problems, non-standard
compliant behavior ....

>
> (You can also see, I have netconsole enabled using netconsole=... in the
> kernel cmdline).
>
> That behaviour is reproducible. The first reset in the last call to
> m88e1116r_config_init() always fails if mdelay is used.
>
> Regards,
>
> Alexander Holler
Sebastian Hesselbarth April 2, 2014, 8:25 p.m. UTC | #8
On 04/02/2014 09:01 PM, Florian Fainelli wrote:
> 2014-04-02 2:09 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>> reset to finish.
>
> Can you resubmit with the following information:
> 
> - you specify the commit that introduce the problem in parenthesis,
> e.g: deadbeef ("dead: beef: cafe babe")
> - put this dmesg snippet in your log message to clearly illustrate
> what is happening
> - clarify that the PHY needs to be polled in a comment in
> m88e1116r_config_init()
> 
> Meanwhile, it would be good if someone with access to this particular
> PHY datasheet could look for known erratas, problems, non-standard
> compliant behavior ....

Alexander,

I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
(0x01410e40) and cannot reproduce any regression. I tried booting from
tftp and usb, I also rebooted twice to see if there are any side
effects - nothing - ethernet always comes up as expected.

I am curious, how you determined above commit to be the cause of the
regression you are seeing. Can you bisect, if you didn't already?

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Holler April 2, 2014, 10:12 p.m. UTC | #9
Am 02.04.2014 22:25, schrieb Sebastian Hesselbarth:
> On 04/02/2014 09:01 PM, Florian Fainelli wrote:
>> 2014-04-02 2:09 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>>> reset to finish.
>>
>> Can you resubmit with the following information:
>>
>> - you specify the commit that introduce the problem in parenthesis,
>> e.g: deadbeef ("dead: beef: cafe babe")
>> - put this dmesg snippet in your log message to clearly illustrate
>> what is happening
>> - clarify that the PHY needs to be polled in a comment in
>> m88e1116r_config_init()
>>
>> Meanwhile, it would be good if someone with access to this particular
>> PHY datasheet could look for known erratas, problems, non-standard
>> compliant behavior ....
> 
> Alexander,
> 
> I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
> (0x01410e40) and cannot reproduce any regression. I tried booting from
> tftp and usb, I also rebooted twice to see if there are any side
> effects - nothing - ethernet always comes up as expected.
> 
> I am curious, how you determined above commit to be the cause of the
> regression you are seeing. Can you bisect, if you didn't already?

There was no bisecting necessary. I've just looked at what changed in
mv643xx_eth since 3.13 and the first commit I've reverted was already a
hit. Reading a bit source revealed the differences between the old reset
and the newly used one and ended up with my patch (first try) and was a
hit too.

Actually I assumed the reset needs longer than the 500ms, but as the
printks revealed, the reset is much faster.
So the problem seems to be the much increased time (1s) the newly used
reset function idles in mdelay.

But I think I have found the real reason. and the change of the reset
just increased the chance the problem is hit (here with 100% success or
fail rate however you want to name it).

Just give me a day or two to find the time to verify my assumption (I
don't want to speculate) and maybe find a real fix for the problem. Of
course, I still like my patch because it greatly decreases the time
necessary for a reset (and the chance to hit the problem).

Regards,

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Fainelli April 2, 2014, 10:20 p.m. UTC | #10
2014-04-02 15:12 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
> Am 02.04.2014 22:25, schrieb Sebastian Hesselbarth:
>> On 04/02/2014 09:01 PM, Florian Fainelli wrote:
>>> 2014-04-02 2:09 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>>>> reset to finish.
>>>
>>> Can you resubmit with the following information:
>>>
>>> - you specify the commit that introduce the problem in parenthesis,
>>> e.g: deadbeef ("dead: beef: cafe babe")
>>> - put this dmesg snippet in your log message to clearly illustrate
>>> what is happening
>>> - clarify that the PHY needs to be polled in a comment in
>>> m88e1116r_config_init()
>>>
>>> Meanwhile, it would be good if someone with access to this particular
>>> PHY datasheet could look for known erratas, problems, non-standard
>>> compliant behavior ....
>>
>> Alexander,
>>
>> I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
>> (0x01410e40) and cannot reproduce any regression. I tried booting from
>> tftp and usb, I also rebooted twice to see if there are any side
>> effects - nothing - ethernet always comes up as expected.
>>
>> I am curious, how you determined above commit to be the cause of the
>> regression you are seeing. Can you bisect, if you didn't already?
>
> There was no bisecting necessary. I've just looked at what changed in
> mv643xx_eth since 3.13 and the first commit I've reverted was already a
> hit. Reading a bit source revealed the differences between the old reset
> and the newly used one and ended up with my patch (first try) and was a
> hit too.
>
> Actually I assumed the reset needs longer than the 500ms, but as the
> printks revealed, the reset is much faster.
> So the problem seems to be the much increased time (1s) the newly used
> reset function idles in mdelay.
>
> But I think I have found the real reason. and the change of the reset
> just increased the chance the problem is hit (here with 100% success or
> fail rate however you want to name it).
>
> Just give me a day or two to find the time to verify my assumption (I
> don't want to speculate) and maybe find a real fix for the problem. Of
> course, I still like my patch because it greatly decreases the time
> necessary for a reset (and the chance to hit the problem).

Why so mysterious, care to share your assumption?
Sebastian Hesselbarth April 2, 2014, 10:27 p.m. UTC | #11
On 04/03/2014 12:12 AM, Alexander Holler wrote:
> Am 02.04.2014 22:25, schrieb Sebastian Hesselbarth:
>> On 04/02/2014 09:01 PM, Florian Fainelli wrote:
>>> 2014-04-02 2:09 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>>> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>>>>> 2014-04-01 16:55 GMT-07:00 Alexander Holler <holler@ahsoftware.de>:
>>>>>> Commit 7cd1463664c2a15721ff4ccfb61d4d970815cb3d (introduced with 3.14)
>>>>>> changed the initialization of the mv643xx_eth driver to use phy_init_hw()
>>>>>> to reset the PHY. Unfortunately the initialization for the 88E1116R PHY
>>>>>> was broken such, that it used mdelay() instead of really waiting for a
>>>>>> reset to finish.
>>>
>>> Can you resubmit with the following information:
>>>
>>> - you specify the commit that introduce the problem in parenthesis,
>>> e.g: deadbeef ("dead: beef: cafe babe")
>>> - put this dmesg snippet in your log message to clearly illustrate
>>> what is happening
>>> - clarify that the PHY needs to be polled in a comment in
>>> m88e1116r_config_init()
>>>
>>> Meanwhile, it would be good if someone with access to this particular
>>> PHY datasheet could look for known erratas, problems, non-standard
>>> compliant behavior ....
>>
>> Alexander,
>>
>> I tried todays linux/master on Seagate Dockstar with Marvell 88E1116R
>> (0x01410e40) and cannot reproduce any regression. I tried booting from
>> tftp and usb, I also rebooted twice to see if there are any side
>> effects - nothing - ethernet always comes up as expected.
>>
>> I am curious, how you determined above commit to be the cause of the
>> regression you are seeing. Can you bisect, if you didn't already?
> 
> There was no bisecting necessary. I've just looked at what changed in
> mv643xx_eth since 3.13 and the first commit I've reverted was already a
> hit. Reading a bit source revealed the differences between the old reset
> and the newly used one and ended up with my patch (first try) and was a
> hit too.

Honestly, your own fix changes a different driver than mv643xx_eth.
There is a lot of changes from v3.13 to v3.14 and bisecting really
helps to pin-point the one offending patch. As you can see from my
tests with Dockstar, poking in the PHY driver may not be the right
place to fix it.

> Actually I assumed the reset needs longer than the 500ms, but as the
> printks revealed, the reset is much faster.
> So the problem seems to be the much increased time (1s) the newly used
> reset function idles in mdelay.

You assume that the PHY issue comes from waiting for too long _after_
the reset? And again, the very same PHY on Dockstar is not affected.

> But I think I have found the real reason. and the change of the reset
> just increased the chance the problem is hit (here with 100% success or
> fail rate however you want to name it).
> 
> Just give me a day or two to find the time to verify my assumption (I
> don't want to speculate) and maybe find a real fix for the problem. Of
> course, I still like my patch because it greatly decreases the time
> necessary for a reset (and the chance to hit the problem).

Well, you can share your idea anytime. You already speculated that PHY
reset on 88e1116r is broken but it seems that is not true. The more
you share of your issue and the tries to fix it, the more likely is it
we can follow your patch immediately.

Again, if you really want to find the real patch breaking Sheevaplug,
use git bisect.

Sebastian

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Hesselbarth April 2, 2014, 10:30 p.m. UTC | #12
On 04/02/2014 11:09 AM, Alexander Holler wrote:
> Am 02.04.2014 02:57, schrieb Florian Fainelli:
>> Could you verify the following two things before your patch gets merged:
>>
>> - how long does it take for your PHY to clear the BMCR_RESET bit, is
>> it more than the allowed time out in
>> drivers/net/phy/phy_device.c::phy_poll_reset
>> - is your PHY powered down (check BMCR_PWRDOWN), if that is the case,
>> we might be hitting some corner case where toggling BMCR_RESET will
>> power it on, but at the expense of waiting longer
> 
> I've  done two tests with pr_info before and after the two resets in
> m88e1116r_config_init:
> 

[...]

> with mdelay (the value after reset is what contains MII_BMCR):
> 
> -----------------------
> dmesg | grep -A1 -B1  AHO
> [    1.090072] mv643xx_eth: MV-643xx 10/100/1000 ethernet driver version
> 1.4
> [    1.175888] AHO: before first reset
> [    1.678806] AHO: after first reset 0x0
> [    1.683281] AHO: before second reset
> [    2.186288] AHO: after second reset 0x0
> [    2.191010] mv643xx_eth_port mv643xx_eth_port.0 eth0: port 0 with MAC
> address xx
> -- 
> [    2.426349] netpoll: netconsole: device eth0 not up yet, forcing it
> [    2.505917] AHO: before first reset
> [    2.605824] usb 1-1: new high-speed USB device number 2 using orion-ehci
> -- 
> [    2.829987] hub 1-1:1.0: 4 ports detected
> [    3.044502] AHO: after first reset 0x0

If the above hex at the end is BMCR..

> [    3.049133] AHO: before second reset
> [    3.126110] usb 1-1.1: new high-speed USB device number 3 using
> orion-ehci
> -- 
> [    3.526107] usb 1-1.3: device descriptor read/64, error -32
> [    3.614264] AHO: after second reset 0x0
> [    3.618708] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> -- 
> [   21.335730] Adding 2996116k swap on /dev/sda3.  Priority:-1 extents:1
> across:2996116k
> [   28.195942] AHO: before first reset
> [   28.696270] AHO: after first reset 0x800

.. have you noticed that your PHY enters POWERDOWN here?

Sebastian

> [   28.696958] AHO: before second reset
> [   29.197354] AHO: after second reset 0x800
> [  111.612263] RPC: Registered named UNIX socket transport module.
> -----------------------

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Holler April 3, 2014, 7:17 a.m. UTC | #13
Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
> On 04/03/2014 12:12 AM, Alexander Holler wrote:

>>> I am curious, how you determined above commit to be the cause of the
>>> regression you are seeing. Can you bisect, if you didn't already?
>>
>> There was no bisecting necessary. I've just looked at what changed in
>> mv643xx_eth since 3.13 and the first commit I've reverted was already a
>> hit. Reading a bit source revealed the differences between the old reset
>> and the newly used one and ended up with my patch (first try) and was a
>> hit too.
>
> Honestly, your own fix changes a different driver than mv643xx_eth.

It changes stuff which now (through the mentioned commit) gets used 
through the change in mv643xx_eth.

> There is a lot of changes from v3.13 to v3.14 and bisecting really
> helps to pin-point the one offending patch. As you can see from my
> tests with Dockstar, poking in the PHY driver may not be the right
> place to fix it.
>
>> Actually I assumed the reset needs longer than the 500ms, but as the
>> printks revealed, the reset is much faster.
>> So the problem seems to be the much increased time (1s) the newly used
>> reset function idles in mdelay.
>
> You assume that the PHY issue comes from waiting for too long _after_
> the reset? And again, the very same PHY on Dockstar is not affected.

Guess with which hardware I'm experiencing this problem? Hint: 
http://ahsoftware.de/dockstar/ ;)

>
>> But I think I have found the real reason. and the change of the reset
>> just increased the chance the problem is hit (here with 100% success or
>> fail rate however you want to name it).
>>
>> Just give me a day or two to find the time to verify my assumption (I
>> don't want to speculate) and maybe find a real fix for the problem. Of
>> course, I still like my patch because it greatly decreases the time
>> necessary for a reset (and the chance to hit the problem).
>
> Well, you can share your idea anytime. You already speculated that PHY
> reset on 88e1116r is broken but it seems that is not true. The more
> you share of your issue and the tries to fix it, the more likely is it
> we can follow your patch immediately.

Sorry, but wild speculating doesn't help always. Otherwise I could 
mention several dozen possible reasons, starting from broken memory or 
other hw up to some memory corruption elsewhere in the kernel.

But I already have given a hint before, try what happens if you enable 
netconsole (compiled in) through the kernel commandline 
(netconsole=...). Maybe the ethernet on your dockstar will get stuck too.

>
> Again, if you really want to find the real patch breaking Sheevaplug,
> use git bisect.

That's silly if I already know a/the change which brings the problem to 
light. If I revert the mentioned commit the problem disapears. So why 
should I go through the pain to bisect stuff? I already have found the 
knob to kill the ethernet on that machine.

Regards,

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Hesselbarth April 3, 2014, 8:49 a.m. UTC | #14
On 04/03/2014 09:17 AM, Alexander Holler wrote:
> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>> I am curious, how you determined above commit to be the cause of the
>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>
>>> There was no bisecting necessary. I've just looked at what changed in
>>> mv643xx_eth since 3.13 and the first commit I've reverted was already a
>>> hit. Reading a bit source revealed the differences between the old reset
>>> and the newly used one and ended up with my patch (first try) and was a
>>> hit too.
>>
>> Honestly, your own fix changes a different driver than mv643xx_eth.
> 
> It changes stuff which now (through the mentioned commit) gets used
> through the change in mv643xx_eth.

Sigh. You have proven youself that the commit isn't the root cause
of the issue you are seeing. Nor is "fixing" 88e1116r init sequence
a reasonable fix.

>> There is a lot of changes from v3.13 to v3.14 and bisecting really
>> helps to pin-point the one offending patch. As you can see from my
>> tests with Dockstar, poking in the PHY driver may not be the right
>> place to fix it.
>>
>>> Actually I assumed the reset needs longer than the 500ms, but as the
>>> printks revealed, the reset is much faster.
>>> So the problem seems to be the much increased time (1s) the newly used
>>> reset function idles in mdelay.
>>
>> You assume that the PHY issue comes from waiting for too long _after_
>> the reset? And again, the very same PHY on Dockstar is not affected.
> 
> Guess with which hardware I'm experiencing this problem? Hint:
> http://ahsoftware.de/dockstar/ ;)

I don't know, but now I guess it is Dockstar.

>>> But I think I have found the real reason. and the change of the reset
>>> just increased the chance the problem is hit (here with 100% success or
>>> fail rate however you want to name it).
>>>
>>> Just give me a day or two to find the time to verify my assumption (I
>>> don't want to speculate) and maybe find a real fix for the problem. Of
>>> course, I still like my patch because it greatly decreases the time
>>> necessary for a reset (and the chance to hit the problem).
>>
>> Well, you can share your idea anytime. You already speculated that PHY
>> reset on 88e1116r is broken but it seems that is not true. The more
>> you share of your issue and the tries to fix it, the more likely is it
>> we can follow your patch immediately.
> 
> Sorry, but wild speculating doesn't help always. Otherwise I could
> mention several dozen possible reasons, starting from broken memory or
> other hw up to some memory corruption elsewhere in the kernel.
> 
> But I already have given a hint before, try what happens if you enable
> netconsole (compiled in) through the kernel commandline
> (netconsole=...). Maybe the ethernet on your dockstar will get stuck too.

If it is related to netconsole, I would guess it is more platforms
affected than just Dockstar? If you share the idea, we can try to
find a way to allow netconsole on more than just that
mv643xx_eth/88e116r combination.

>> Again, if you really want to find the real patch breaking Sheevaplug,
>> use git bisect.
> 
> That's silly if I already know a/the change which brings the problem to
> light. If I revert the mentioned commit the problem disapears. So why
> should I go through the pain to bisect stuff? I already have found the
> knob to kill the ethernet on that machine.

Really, I can tell you two fixes for it right away: don't use netconsole
or remove Marvell PHY support. But neither is really helping here.

If you share your ideas early, it is at least two more who are looking
at it. This is just a suggestion, you are free to take it though.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Holler April 3, 2014, 3:06 p.m. UTC | #15
Am 03.04.2014 10:49, schrieb Sebastian Hesselbarth:
> On 04/03/2014 09:17 AM, Alexander Holler wrote:
>> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>>> I am curious, how you determined above commit to be the cause of the
>>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>>
>>>> There was no bisecting necessary. I've just looked at what changed in
>>>> mv643xx_eth since 3.13 and the first commit I've reverted was already a
>>>> hit. Reading a bit source revealed the differences between the old reset
>>>> and the newly used one and ended up with my patch (first try) and was a
>>>> hit too.
>>>
>>> Honestly, your own fix changes a different driver than mv643xx_eth.
>>
>> It changes stuff which now (through the mentioned commit) gets used
>> through the change in mv643xx_eth.
>
> Sigh. You have proven youself that the commit isn't the root cause
> of the issue you are seeing. Nor is "fixing" 88e1116r init sequence
> a reasonable fix.

Sorry, continuing this discussion doesn't make sense. You have the same 
hw, so just try enabling netconsole with 3.14 and use dhcpcd from 
userland (this does the final reset here which hangs.

But don't suggest me (or insist on) a time consuming and totally useless 
bisect when I already know what makes the problem to appear (100% 
reliable here).

I have better ways to waste my time.

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 3, 2014, 3:14 p.m. UTC | #16
From: Alexander Holler <holler@ahsoftware.de>
Date: Thu, 03 Apr 2014 17:06:52 +0200

> But don't suggest me (or insist on) a time consuming

Bisects are not time consuming, and help developers analyze your
issue tremendously.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Hesselbarth April 3, 2014, 3:45 p.m. UTC | #17
On 04/03/2014 05:06 PM, Alexander Holler wrote:
> Am 03.04.2014 10:49, schrieb Sebastian Hesselbarth:
>> On 04/03/2014 09:17 AM, Alexander Holler wrote:
>>> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>>>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>>>> I am curious, how you determined above commit to be the cause of the
>>>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>>>
>>>>> There was no bisecting necessary. I've just looked at what changed in
>>>>> mv643xx_eth since 3.13 and the first commit I've reverted was
>>>>> already a
>>>>> hit. Reading a bit source revealed the differences between the old
>>>>> reset
>>>>> and the newly used one and ended up with my patch (first try) and
>>>>> was a
>>>>> hit too.
>>>>
>>>> Honestly, your own fix changes a different driver than mv643xx_eth.
>>>
>>> It changes stuff which now (through the mentioned commit) gets used
>>> through the change in mv643xx_eth.
>>
>> Sigh. You have proven youself that the commit isn't the root cause
>> of the issue you are seeing. Nor is "fixing" 88e1116r init sequence
>> a reasonable fix.
>
> Sorry, continuing this discussion doesn't make sense. You have the same
> hw, so just try enabling netconsole with 3.14 and use dhcpcd from
> userland (this does the final reset here which hangs.
>
> But don't suggest me (or insist on) a time consuming and totally useless
> bisect when I already know what makes the problem to appear (100%
> reliable here).

I _suggest_ to use bisect, I don't _insist_. But for a patch that 
tackles an issue, knowing the offending patch is really good.

Now that you revealed more input, it becomes more clear to me what is
happening (although I have no clue, what netconsole really does to the
eth/phy):

- u-boot powers on the PHY
- netconsole picks up the eth device and the PHY, drops it later
- the PHY state machine powers off the now unused PHY (also introduced
   in between v3.13 and v3.14).
- dhcp client ifup's the device and tries to reset the powered down PHY

I already made a comment about the set POWERDOWN bit before, which you
silently ignored.

I will try to reproduce it and if I hit it, will do a bisect to find
(and fix) the offending patch.

> I have better ways to waste my time.

Like writing workarounds instead of fixing bugs?

Sebastian

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Holler April 3, 2014, 3:58 p.m. UTC | #18
Am 03.04.2014 17:45, schrieb Sebastian Hesselbarth:

>
> I will try to reproduce it and if I hit it, will do a bisect to find
> (and fix) the offending patch.
>
>> I have better ways to waste my time.
>
> Like writing workarounds instead of fixing bugs?

Sure. If I would try to fix (or even describe) every bug in the kernel I 
already know about, I would never have time to do anything else.

Besides that you still try to ignore, that I didn't know at first that 
it's a workaround at all. Especially after I've discovered that the 
reset is the place where things are failing. Otherwise I wouldn't have 
sent the patch (which I know NOW that it just is a workaround).

Alexander Holler
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Hesselbarth April 3, 2014, 5:44 p.m. UTC | #19
On 04/03/2014 05:06 PM, Alexander Holler wrote:
> Am 03.04.2014 10:49, schrieb Sebastian Hesselbarth:
>> On 04/03/2014 09:17 AM, Alexander Holler wrote:
>>> Am 03.04.2014 00:27, schrieb Sebastian Hesselbarth:
>>>> On 04/03/2014 12:12 AM, Alexander Holler wrote:
>>>>>> I am curious, how you determined above commit to be the cause of the
>>>>>> regression you are seeing. Can you bisect, if you didn't already?
>>>>>
>>>>> There was no bisecting necessary. I've just looked at what changed in
>>>>> mv643xx_eth since 3.13 and the first commit I've reverted was
>>>>> already a
>>>>> hit. Reading a bit source revealed the differences between the old
>>>>> reset
>>>>> and the newly used one and ended up with my patch (first try) and
>>>>> was a
>>>>> hit too.
>>>>
> 
> Sorry, continuing this discussion doesn't make sense. You have the same
> hw, so just try enabling netconsole with 3.14 and use dhcpcd from
> userland (this does the final reset here which hangs.
> 
> But don't suggest me (or insist on) a time consuming and totally useless
> bisect when I already know what makes the problem to appear (100%
> reliable here).

Alexander,

I tried to reproduce the regression on Dockstar with 3.14.0 and
3.14.0-07247-gcd6362be, netconsole enabled, DHCP (dhclient).

I can neither reproduce any issues with ethernet before nor after
dhclient kicks in. I can also use netconsole just fine and see
plenty of kernel bootlog messages sent to netconsole receiver.

Can you _please_ share a full report how to properly reproduce it?

Sebastian


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Holler April 3, 2014, 6:20 p.m. UTC | #20
Am 03.04.2014 19:44, schrieb Sebastian Hesselbarth:

> I tried to reproduce the regression on Dockstar with 3.14.0 and
> 3.14.0-07247-gcd6362be, netconsole enabled, DHCP (dhclient).

Try using dhcpcd, that's why I mentioned it. DHCP-clients do work 
differently.

> I can neither reproduce any issues with ethernet before nor after
> dhclient kicks in. I can also use netconsole just fine and see
> plenty of kernel bootlog messages sent to netconsole receiver.
>
> Can you _please_ share a full report how to properly reproduce it?

Impossible because timing and other stuff (boot media, switch, 
type/version of iproute, dhcp-client, ...) might be involved.

Alexander Holler

BTW: if you want to fix another bug, changing the mtu does reset the 
ethernet device. Something special for mv643xx_eth and not very nice. ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index bd37e45..5b84808 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -396,7 +396,9 @@  static int m88e1116r_config_init(struct phy_device *phydev)
 	if (err < 0)
 		return err;
 
-	mdelay(500);
+	do
+		temp = phy_read(phydev, MII_BMCR);
+	while (temp & BMCR_RESET);
 
 	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0);
 	if (err < 0)
@@ -429,7 +431,9 @@  static int m88e1116r_config_init(struct phy_device *phydev)
 	if (err < 0)
 		return err;
 
-	mdelay(500);
+	do
+		temp = phy_read(phydev, MII_BMCR);
+	while (temp & BMCR_RESET);
 
 	return 0;
 }