diff mbox

[v3,00/10] split out emac cpdma and mdio for reuse

Message ID 4C893ADE.809@ti.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Cyril Chemparathy Sept. 9, 2010, 7:51 p.m. UTC
Hi Mike,

[...]
> The hang is in wait_for_user_access() in the davinci_mdio_read() call.  Looks like
> the state machine got put back into IDLE somewhere between the MDIO probe and the 
> EMAC probe. Seems like there should be some sort of time-out and error message 
> in the wait_for_user_access() method.... (maybe even a check for IDLE??)
> 
> If I add a patch to check the state machine for IDLE and then re-enable it in the
> davinci_mdio_read() call, it is able to press on and come up.  I don't see any calls
> to the davinci_mdio_suspend() call, so I am wondering if the EMAC probe routine, 
> particularly the application of the SOFTRESET, is causing the MDIO to drop back to 
> IDLE / disabled.
> 
> I can post the patch if you like, but it is a bit of a hack...

An EMAC soft-reset clobbering the MDIO controller state is a
possibility.  I will poll TI designers to see if this could be the case.

In any case, a couple of unanswered questions remain:

  1. Why don't other davinci devices display similar behavior?

  2. If the answer to #1 above is that the timing window is pretty slim
     (i.e., only if an MDIO read/write is in progress during EMAC
     soft-reset), why do we hit this situation consistently on
     mityomap?

I have put together a quick patch (tested dm365).  See attached.

Regards
Cyril.

Comments

Michael Williamson Sept. 9, 2010, 9:25 p.m. UTC | #1
Hi Cyril,

On 09/09/2010 03:51 PM, Cyril Chemparathy wrote:
> Hi Mike,
> 
> [...]
>> The hang is in wait_for_user_access() in the davinci_mdio_read() call.  Looks like
>> the state machine got put back into IDLE somewhere between the MDIO probe and the 
>> EMAC probe. Seems like there should be some sort of time-out and error message 
>> in the wait_for_user_access() method.... (maybe even a check for IDLE??)
>>
>> If I add a patch to check the state machine for IDLE and then re-enable it in the
>> davinci_mdio_read() call, it is able to press on and come up.  I don't see any calls
>> to the davinci_mdio_suspend() call, so I am wondering if the EMAC probe routine, 
>> particularly the application of the SOFTRESET, is causing the MDIO to drop back to 
>> IDLE / disabled.
>>
>> I can post the patch if you like, but it is a bit of a hack...
> 
> An EMAC soft-reset clobbering the MDIO controller state is a
> possibility.  I will poll TI designers to see if this could be the case.
> 
> In any case, a couple of unanswered questions remain:
> 
>   1. Why don't other davinci devices display similar behavior?
> 
>   2. If the answer to #1 above is that the timing window is pretty slim
>      (i.e., only if an MDIO read/write is in progress during EMAC
>      soft-reset), why do we hit this situation consistently on
>      mityomap?

Has it been confirmed that this only happens on mityomap?  Has anyone had success
using a da850 evm or other da850 platform?  The configuration for the mityomap, wrt 
to the EMAC/MII/MDIO, is pretty much identical to the da850 evm using the MII interface.  
The only difference I am aware of is the assigned address to the PHY chip.  The 
reference clocks and rates are identical, AFAIK, to the evm.

> 
> I have put together a quick patch (tested dm365).  See attached.

Your patch doesn't work with my board.  It does attempt to reset the bus on the read call, 
but following wait_for_user_access() calls are timing out and the _read() and _write() calls punt.  
I bumped up the MDIO_TIMEOUT to 100 ms, and it works.  I'm wondering if the scanning logic 
has to complete an entire cycle (all 32 phys) before issuing a request, and if it's a lot 
slower than expected.

I also found that the initial scanning logic would not reliably find the PHY until I bumped
up the delay time following the reset operation.  Sometimes it would, sometimes no.

Also, your while(1) loops with the continue conditions on the second wait_for_user_access() 
in the read and writes might need some consideration, i.e.:

        while (1) {
                ret = wait_for_user_access(data);
                if (ret == -EAGAIN)
                        continue;
                if (ret < 0)
                        break;

                __raw_writel(reg, &data->regs->user[0].access);

                ret = wait_for_user_access(data);
                if (ret == -EAGAIN)
                        continue;
                        ^^^^^^^^^ <--- this will re-issue the request.... what you want?
                if (ret < 0)
                        break;

                reg = __raw_readl(&data->regs->user[0].access);
                ret = (reg & USERACCESS_ACK) ? (reg & USERACCESS_DATA) : -EIO;
                break;
        }

Also, on the shutdown, I get a major kernel trace.  Here is the dump, as much 
as I could catch of it.... (I need a better terminal program)

Deconfiguring network interfaces... ------------[ cut here ]------------
WARNING: at kernel/softirq.c:143 local_bh_enable+0x40/0xb4()
Modules linked in:
[<c002e684>] (unwind_backtrace+0x0/0xec) from [<c003e284>] (warn_slowpath_common+0x4c/0x64)
[<c003e284>] (warn_slowpath_common+0x4c/0x64) from [<c003e2b8>] (warn_slowpath_null+0x1c/0x24)
[<c003e2b8>] (warn_slowpath_null+0x1c/0x24) from [<c0043e80>] (local_bh_enable+0x40/0xb4)
[<c0043e80>] (local_bh_enable+0x40/0xb4) from [<c01f3760>] (__netif_receive_skb+0xf8/0x3d0)
[<c01f3760>] (__netif_receive_skb+0xf8/0x3d0) from [<c01d344c>] (emac_rx_handler+0x40/0xcc)
[<c01d344c>] (emac_rx_handler+0x40/0xcc) from [<c01d3fe8>] (__cpdma_chan_free+0xac/0xb0)
[<c01d3fe8>] (__cpdma_chan_free+0xac/0xb0) from [<c01d40d0>] (__cpdma_chan_process+0xe4/0x114)
[<c01d40d0>] (__cpdma_chan_process+0xe4/0x114) from [<c01d4238>] (cpdma_chan_stop+0xf0/0x1c8)
[<c01d4238>] (cpdma_chan_stop+0xf0/0x1c8) from [<c01d4390>] (cpdma_ctlr_stop+0x80/0xe4)
[<c01d4390>] (cpdma_ctlr_stop+0x80/0xe4) from [<c01d2c00>] (emac_dev_stop+0xb0/0x13c)
[<c01d2c00>] (emac_dev_stop+0xb0/0x13c) from [<c01f5b40>] (__dev_close+0x74/0x98)
[<c01f5b40>] (__dev_close+0x74/0x98) from [<c01f3168>] (__dev_change_flags+0xac/0x13c)
[<c01f3168>] (__dev_change_flags+0xac/0x13c) from [<c01f5994>] (dev_change_flags+0x10/0x44)
[<c01f5994>] (dev_change_flags+0x10/0x44) from [<c023cf60>] (devinet_ioctl+0x2dc/0x6f4)
[<c023cf60>] (devinet_ioctl+0x2dc/0x6f4) from [<c01e484c>] (sock_ioctl+0x1fc/0x258)
[<c01e484c>] (sock_ioctl+0x1fc/0x258) from [<c00aa0f0>] (do_vfs_ioctl+0x550/0x5c0)
[<c00aa0f0>] (do_vfs_ioctl+0x550/0x5c0) from [<c00aa198>] (sys_ioctl+0x38/0x5c)
[<c00aa198>] (sys_ioctl+0x38/0x5c) from [<c0029f00>] (ret_fast_syscall+0x0/0x2c)
---[ end trace 0e686330480db12e ]---
------------[ cut here ]------------
WARNING: at drivers/net/davinci_emac.c:1025 __cpdma_chan_free+0xac/0xb0()
Modules linked in:
[<c002e684>] (unwind_backtrace+0x0/0xec) from [<c003e284>] (warn_slowpath_common+0x4c/0x64)
[<c003e284>] (warn_slowpath_common+0x4c/0x64) from [<c003e2b8>] (warn_slowpath_null+0x1c/0x24)
[<c003e2b8>] (warn_slowpath_null+0x1c/0x24) from [<c01d3fe8>] (__cpdma_chan_free+0xac/0xb0)
[<c01d3fe8>] (__cpdma_chan_free+0xac/0xb0) from [<c01d40d0>] (__cpdma_chan_process+0xe4/0x114)
[<c01d40d0>] (__cpdma_chan_process+0xe4/0x114) from [<c01d4238>] (cpdma_chan_stop+0xf0/0x1c8)
[<c01d4238>] (cpdma_chan_stop+0xf0/0x1c8) from [<c01d4390>] (cpdma_ctlr_stop+0x80/0xe4)
[<c01d4390>] (cpdma_ctlr_stop+0x80/0xe4) from [<c01d2c00>] (emac_dev_stop+0xb0/0x13c)
[<c01d2c00>] (emac_dev_stop+0xb0/0x13c) from [<c01f5b40>] (__dev_close+0x74/0x98)
[<c01f5b40>] (__dev_close+0x74/0x98) from [<c01f3168>] (__dev_change_flags+0xac/0x13c)
[<c01f3168>] (__dev_change_flags+0xac/0x13c) from [<c01f5994>] (dev_change_flags+0x10/0x44)
[<c01f5994>] (dev_change_flags+0x10/0x44) from [<c023cf60>] (devinet_ioctl+0x2dc/0x6f4)
[<c023cf60>] (devinet_ioctl+0x2dc/0x6f4) from [<c01e484c>] (sock_ioctl+0x1fc/0x258)
[<c01e484c>] (sock_ioctl+0x1fc/0x258) from [<c00aa0f0>] (do_vfs_ioctl+0x550/0x5c0)
[<c00aa0f0>] (do_vfs_ioctl+0x550/0x5c0) from [<c00aa198>] (sys_ioctl+0x38/0x5c)
[<c00aa198>] (sys_ioctl+0x38/0x5c) from [<c0029f00>] (ret_fast_syscall+0x0/0x2c)
---[ end trace 0e686330480db12f ]---


Thanks for the help.

-Mike
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Caglar Akyuz Sept. 10, 2010, 3:23 p.m. UTC | #2
On Friday 10 September 2010 12:25:40 am Michael Williamson wrote:
> Hi Cyril,
> 
> On 09/09/2010 03:51 PM, Cyril Chemparathy wrote:
> > Hi Mike,
> >
> > [...]
> >
> >> The hang is in wait_for_user_access() in the davinci_mdio_read() call. 
> >> Looks like the state machine got put back into IDLE somewhere between
> >> the MDIO probe and the EMAC probe. Seems like there should be some sort
> >> of time-out and error message in the wait_for_user_access() method....
> >> (maybe even a check for IDLE??)
> >>
> >> If I add a patch to check the state machine for IDLE and then re-enable
> >> it in the davinci_mdio_read() call, it is able to press on and come up. 
> >> I don't see any calls to the davinci_mdio_suspend() call, so I am
> >> wondering if the EMAC probe routine, particularly the application of the
> >> SOFTRESET, is causing the MDIO to drop back to IDLE / disabled.
> >>
> >> I can post the patch if you like, but it is a bit of a hack...
> >
> > An EMAC soft-reset clobbering the MDIO controller state is a
> > possibility.  I will poll TI designers to see if this could be the case.
> >
> > In any case, a couple of unanswered questions remain:
> >
> >   1. Why don't other davinci devices display similar behavior?
> >
> >   2. If the answer to #1 above is that the timing window is pretty slim
> >      (i.e., only if an MDIO read/write is in progress during EMAC
> >      soft-reset), why do we hit this situation consistently on
> >      mityomap?
> 
> Has it been confirmed that this only happens on mityomap?  Has anyone had
>  success using a da850 evm or other da850 platform?  The configuration for

Same problem exists on another DA850 board, Hawkboard.(Sorry no support in 
mainline yet)

>  the mityomap, wrt to the EMAC/MII/MDIO, is pretty much identical to the
>  da850 evm using the MII interface. The only difference I am aware of is
>  the assigned address to the PHY chip.  The reference clocks and rates are
>  identical, AFAIK, to the evm.
> 
> > I have put together a quick patch (tested dm365).  See attached.
> 
> Your patch doesn't work with my board.  It does attempt to reset the bus on

This patch fixes the problem here. I'm using kernel IP auto configuration and 
mounting fs over NFS. My system boots and I can login to my board.

Regards,
Caglar

>  the read call, but following wait_for_user_access() calls are timing out
>  and the _read() and _write() calls punt. I bumped up the MDIO_TIMEOUT to
>  100 ms, and it works.  I'm wondering if the scanning logic has to complete
>  an entire cycle (all 32 phys) before issuing a request, and if it's a lot
>  slower than expected.
> 
> I also found that the initial scanning logic would not reliably find the
>  PHY until I bumped up the delay time following the reset operation. 
>  Sometimes it would, sometimes no.
> 
> Also, your while(1) loops with the continue conditions on the second
>  wait_for_user_access() in the read and writes might need some
>  consideration, i.e.:
> 
>         while (1) {
>                 ret = wait_for_user_access(data);
>                 if (ret == -EAGAIN)
>                         continue;
>                 if (ret < 0)
>                         break;
> 
>                 __raw_writel(reg, &data->regs->user[0].access);
> 
>                 ret = wait_for_user_access(data);
>                 if (ret == -EAGAIN)
>                         continue;
>                         ^^^^^^^^^ <--- this will re-issue the request....
>  what you want? if (ret < 0)
>                         break;
> 
>                 reg = __raw_readl(&data->regs->user[0].access);
>                 ret = (reg & USERACCESS_ACK) ? (reg & USERACCESS_DATA) :
>  -EIO; break;
>         }
> 
> Also, on the shutdown, I get a major kernel trace.  Here is the dump, as
>  much as I could catch of it.... (I need a better terminal program)
> 
> Deconfiguring network interfaces... ------------[ cut here ]------------
> WARNING: at kernel/softirq.c:143 local_bh_enable+0x40/0xb4()
> Modules linked in:
> [<c002e684>] (unwind_backtrace+0x0/0xec) from [<c003e284>]
>  (warn_slowpath_common+0x4c/0x64) [<c003e284>]
>  (warn_slowpath_common+0x4c/0x64) from [<c003e2b8>]
>  (warn_slowpath_null+0x1c/0x24) [<c003e2b8>] (warn_slowpath_null+0x1c/0x24)
>  from [<c0043e80>] (local_bh_enable+0x40/0xb4) [<c0043e80>]
>  (local_bh_enable+0x40/0xb4) from [<c01f3760>]
>  (__netif_receive_skb+0xf8/0x3d0) [<c01f3760>]
>  (__netif_receive_skb+0xf8/0x3d0) from [<c01d344c>]
>  (emac_rx_handler+0x40/0xcc) [<c01d344c>] (emac_rx_handler+0x40/0xcc) from
>  [<c01d3fe8>] (__cpdma_chan_free+0xac/0xb0) [<c01d3fe8>]
>  (__cpdma_chan_free+0xac/0xb0) from [<c01d40d0>]
>  (__cpdma_chan_process+0xe4/0x114) [<c01d40d0>]
>  (__cpdma_chan_process+0xe4/0x114) from [<c01d4238>]
>  (cpdma_chan_stop+0xf0/0x1c8) [<c01d4238>] (cpdma_chan_stop+0xf0/0x1c8)
>  from [<c01d4390>] (cpdma_ctlr_stop+0x80/0xe4) [<c01d4390>]
>  (cpdma_ctlr_stop+0x80/0xe4) from [<c01d2c00>] (emac_dev_stop+0xb0/0x13c)
>  [<c01d2c00>] (emac_dev_stop+0xb0/0x13c) from [<c01f5b40>]
>  (__dev_close+0x74/0x98) [<c01f5b40>] (__dev_close+0x74/0x98) from
>  [<c01f3168>] (__dev_change_flags+0xac/0x13c) [<c01f3168>]
>  (__dev_change_flags+0xac/0x13c) from [<c01f5994>]
>  (dev_change_flags+0x10/0x44) [<c01f5994>] (dev_change_flags+0x10/0x44)
>  from [<c023cf60>] (devinet_ioctl+0x2dc/0x6f4) [<c023cf60>]
>  (devinet_ioctl+0x2dc/0x6f4) from [<c01e484c>] (sock_ioctl+0x1fc/0x258)
>  [<c01e484c>] (sock_ioctl+0x1fc/0x258) from [<c00aa0f0>]
>  (do_vfs_ioctl+0x550/0x5c0) [<c00aa0f0>] (do_vfs_ioctl+0x550/0x5c0) from
>  [<c00aa198>] (sys_ioctl+0x38/0x5c) [<c00aa198>] (sys_ioctl+0x38/0x5c) from
>  [<c0029f00>] (ret_fast_syscall+0x0/0x2c) ---[ end trace 0e686330480db12e
>  ]---
> ------------[ cut here ]------------
> WARNING: at drivers/net/davinci_emac.c:1025 __cpdma_chan_free+0xac/0xb0()
> Modules linked in:
> [<c002e684>] (unwind_backtrace+0x0/0xec) from [<c003e284>]
>  (warn_slowpath_common+0x4c/0x64) [<c003e284>]
>  (warn_slowpath_common+0x4c/0x64) from [<c003e2b8>]
>  (warn_slowpath_null+0x1c/0x24) [<c003e2b8>] (warn_slowpath_null+0x1c/0x24)
>  from [<c01d3fe8>] (__cpdma_chan_free+0xac/0xb0) [<c01d3fe8>]
>  (__cpdma_chan_free+0xac/0xb0) from [<c01d40d0>]
>  (__cpdma_chan_process+0xe4/0x114) [<c01d40d0>]
>  (__cpdma_chan_process+0xe4/0x114) from [<c01d4238>]
>  (cpdma_chan_stop+0xf0/0x1c8) [<c01d4238>] (cpdma_chan_stop+0xf0/0x1c8)
>  from [<c01d4390>] (cpdma_ctlr_stop+0x80/0xe4) [<c01d4390>]
>  (cpdma_ctlr_stop+0x80/0xe4) from [<c01d2c00>] (emac_dev_stop+0xb0/0x13c)
>  [<c01d2c00>] (emac_dev_stop+0xb0/0x13c) from [<c01f5b40>]
>  (__dev_close+0x74/0x98) [<c01f5b40>] (__dev_close+0x74/0x98) from
>  [<c01f3168>] (__dev_change_flags+0xac/0x13c) [<c01f3168>]
>  (__dev_change_flags+0xac/0x13c) from [<c01f5994>]
>  (dev_change_flags+0x10/0x44) [<c01f5994>] (dev_change_flags+0x10/0x44)
>  from [<c023cf60>] (devinet_ioctl+0x2dc/0x6f4) [<c023cf60>]
>  (devinet_ioctl+0x2dc/0x6f4) from [<c01e484c>] (sock_ioctl+0x1fc/0x258)
>  [<c01e484c>] (sock_ioctl+0x1fc/0x258) from [<c00aa0f0>]
>  (do_vfs_ioctl+0x550/0x5c0) [<c00aa0f0>] (do_vfs_ioctl+0x550/0x5c0) from
>  [<c00aa198>] (sys_ioctl+0x38/0x5c) [<c00aa198>] (sys_ioctl+0x38/0x5c) from
>  [<c0029f00>] (ret_fast_syscall+0x0/0x2c) ---[ end trace 0e686330480db12f
>  ]---
> 
> 
> Thanks for the help.
> 
> -Mike
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cyril Chemparathy Sept. 10, 2010, 10:59 p.m. UTC | #3
Hi Mike,

I have merged your latest two emails and responded to both here.

[...]
> Your patch doesn't work with my board.  It does attempt to reset the bus on the read call, 
> but following wait_for_user_access() calls are timing out and the _read() and _write() calls punt.  
> I bumped up the MDIO_TIMEOUT to 100 ms, and it works.  I'm wondering if the scanning logic 
> has to complete an entire cycle (all 32 phys) before issuing a request, and if it's a lot 
> slower than expected.

Based on the mdio module's state machine code, the scan does not need to
complete before a user-request is issued.  However, when the module is
first enabled, it _must_ poll at least one phy (phy id 0) before it
handles the user-access request.  Consequently, the first access after
coming out of reset may be slower than subsequent accesses.

One of the patches posted on my repo [1] replaces the dumb mdelay() with
a bit more logic that calculates the worst-case access time.  This
mechanism may work a lot better for you.  Would you mind trying it out?

Since the MDIO_TIMEOUT is simply a defensive measure, I have no
objections to raising this timeout (included in the updated stack).

> I also found that the initial scanning logic would not reliably find the PHY until I bumped
> up the delay time following the reset operation.  Sometimes it would, sometimes no.
> 
> Also, your while(1) loops with the continue conditions on the second wait_for_user_access() 
> in the read and writes might need some consideration, i.e.:
> 
>         while (1) {
>                 ret = wait_for_user_access(data);
>                 if (ret == -EAGAIN)
>                         continue;
>                 if (ret < 0)
>                         break;
> 
>                 __raw_writel(reg, &data->regs->user[0].access);
> 
>                 ret = wait_for_user_access(data);
>                 if (ret == -EAGAIN)
>                         continue;
>                         ^^^^^^^^^ <--- this will re-issue the request.... what you want?

Yes, the wait_for_user_access() would have reset the controller,
throwing the current transaction out.  The intent here is to restart the
current transaction with a newly initialized controller.

>                 if (ret < 0)
>                         break;
> 
>                 reg = __raw_readl(&data->regs->user[0].access);
>                 ret = (reg & USERACCESS_ACK) ? (reg & USERACCESS_DATA) : -EIO;
>                 break;
>         }
> 
> Also, on the shutdown, I get a major kernel trace.  Here is the dump, as much 
> as I could catch of it.... (I need a better terminal program)

[...]
> WARNING: at drivers/net/davinci_emac.c:1025 __cpdma_chan_free+0xac/0xb0()

The current code spits out a huge volume of stuff as a result of a
WARN_ON in the rx handler.  The gitweb on [1] has a patch that fixes this.

[...]
>> > The MDIO module upgrade (rev 1.4 -> 1.5) could have something to do with
>> > this behavior.  Even so, I can't explain why this issue wasn't seen on
>> > da8xx prior to this series.  The original code should (at least in
>> > theory) have sporadically locked up on emac open.
>> > 
> I think, if I understand it correctly, that in the previous version of 
> this code, the emac was reset *prior* to enabling, scanning, and assigning 
> the associated phy on the MDIO bus.  The new implementation sets up and scans 
> the MDIO bus first, then comes back around to the EMAC second... hits a reset,
> and doesn't re-ENABLE the MDIO.

AFAICS, that isn't entirely accurate.  In the previous version, the mdio
bus was being brought up at probe time in davinci_emac_probe().  The
soft-reset was happening later on when the device is opened, in
emac_hw_enable().

The difference, however, is that the original code forced an
emac_mii_reset() immediately after the emac_hw_enable().  This is not
being done with the separated mdio, and that is the problem.  In terms
of behavior, with the current work around, the new and old versions
should be close to identical.  More below...

> Also, maybe hitting the EMAC reset while the MDIO state machine is up is *bad*, I 
> seem to recall some text in the user's guide about waiting for the state
> machine to stop before disabling it.  I wonder if that also applies to reset?

You are correct.  EMAC soft-reset stops the MDIO mid-transaction, quite
unlike disabling the module via the control register.  Therefore, there
is a risk that a badly designed phy could be left hanging in an
arbitrary state.  However, all earlier versions of the emac code have
been exposed to this very same vulnerability (i.e. arbitrary emac
soft-reset regardless of mdio state) all along.

Regards
Cyril.


[1]
http://arago-project.org/git/people/?p=cyril/linux-tnetv107x.git;a=shortlog;h=refs/heads/emac-cpdma-mdio-fixes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Caglar Akyuz Sept. 11, 2010, 8:54 a.m. UTC | #4
On Friday 10 September 2010 06:23:52 pm Caglar Akyuz wrote:
> On Friday 10 September 2010 12:25:40 am Michael Williamson wrote:
> > Hi Cyril,
> >
> > On 09/09/2010 03:51 PM, Cyril Chemparathy wrote:
> > > Hi Mike,
> > >
> > > [...]
> > >
> > >> The hang is in wait_for_user_access() in the davinci_mdio_read() call.
> > >> Looks like the state machine got put back into IDLE somewhere between
> > >> the MDIO probe and the EMAC probe. Seems like there should be some
> > >> sort of time-out and error message in the wait_for_user_access()
> > >> method.... (maybe even a check for IDLE??)
> > >>
> > >> If I add a patch to check the state machine for IDLE and then
> > >> re-enable it in the davinci_mdio_read() call, it is able to press on
> > >> and come up. I don't see any calls to the davinci_mdio_suspend() call,
> > >> so I am wondering if the EMAC probe routine, particularly the
> > >> application of the SOFTRESET, is causing the MDIO to drop back to IDLE
> > >> / disabled.
> > >>
> > >> I can post the patch if you like, but it is a bit of a hack...
> > >
> > > An EMAC soft-reset clobbering the MDIO controller state is a
> > > possibility.  I will poll TI designers to see if this could be the
> > > case.
> > >
> > > In any case, a couple of unanswered questions remain:
> > >
> > >   1. Why don't other davinci devices display similar behavior?
> > >
> > >   2. If the answer to #1 above is that the timing window is pretty slim
> > >      (i.e., only if an MDIO read/write is in progress during EMAC
> > >      soft-reset), why do we hit this situation consistently on
> > >      mityomap?
> >
> > Has it been confirmed that this only happens on mityomap?  Has anyone had
> >  success using a da850 evm or other da850 platform?  The configuration
> > for
> 
> Same problem exists on another DA850 board, Hawkboard.(Sorry no support in
> mainline yet)
> 
> >  the mityomap, wrt to the EMAC/MII/MDIO, is pretty much identical to the
> >  da850 evm using the MII interface. The only difference I am aware of is
> >  the assigned address to the PHY chip.  The reference clocks and rates
> > are identical, AFAIK, to the evm.
> >
> > > I have put together a quick patch (tested dm365).  See attached.
> >
> > Your patch doesn't work with my board.  It does attempt to reset the bus
> > on
> 
> This patch fixes the problem here. I'm using kernel IP auto configuration
>  and mounting fs over NFS. My system boots and I can login to my board.
> 
> Regards,
> Caglar
> 

Unfortunately emac driver is not stable after this series. I face lock-ups 
time to time, followed by attached kernel trace.

Regards,
Caglar
Michael Williamson Sept. 11, 2010, 1:14 p.m. UTC | #5
Hi Cyril,

On 09/10/2010 06:59 PM, Cyril Chemparathy wrote:
> Hi Mike,
> 
> I have merged your latest two emails and responded to both here.
> 
[...]
> One of the patches posted on my repo [1] replaces the dumb mdelay() with
> a bit more logic that calculates the worst-case access time.  This
> mechanism may work a lot better for you.  Would you mind trying it out?
> 

You patch from [1] is working much more reliably now (well, 6 for 6 boot cycles as well as
several ifup/ifdown cycles).  I do get the "resetting idled controlled" console message
every cycle, it seems that will be expected now.

[...]

>>
>> Also, your while(1) loops with the continue conditions on the second wait_for_user_access() 
>> in the read and writes might need some consideration, i.e.:
>>
>>         while (1) {
>>                 ret = wait_for_user_access(data);
>>                 if (ret == -EAGAIN)
>>                         continue;
>>                 if (ret < 0)
>>                         break;
>>
>>                 __raw_writel(reg, &data->regs->user[0].access);
>>
>>                 ret = wait_for_user_access(data);
>>                 if (ret == -EAGAIN)
>>                         continue;
>>                         ^^^^^^^^^ <--- this will re-issue the request.... what you want?
> 
> Yes, the wait_for_user_access() would have reset the controller,
> throwing the current transaction out.  The intent here is to restart the
> current transaction with a newly initialized controller.
> 

OK.  Makes sense.  Looking at it felt like there was a chance for end endless spin, but that
seems unlikely given how that condition might fire.

[...]

>> Also, on the shutdown, I get a major kernel trace.  Here is the dump, as much 
>> as I could catch of it.... (I need a better terminal program)
> 
> [...]
>> WARNING: at drivers/net/davinci_emac.c:1025 __cpdma_chan_free+0xac/0xb0()
> 
> The current code spits out a huge volume of stuff as a result of a
> WARN_ON in the rx handler.  The gitweb on [1] has a patch that fixes this.
> 

Yes,  these messages are no longer issued with the patches from [1].  Thanks.

> [...]
>>>> The MDIO module upgrade (rev 1.4 -> 1.5) could have something to do with
>>>> this behavior.  Even so, I can't explain why this issue wasn't seen on
>>>> da8xx prior to this series.  The original code should (at least in
>>>> theory) have sporadically locked up on emac open.
>>>>
>> I think, if I understand it correctly, that in the previous version of 
>> this code, the emac was reset *prior* to enabling, scanning, and assigning 
>> the associated phy on the MDIO bus.  The new implementation sets up and scans 
>> the MDIO bus first, then comes back around to the EMAC second... hits a reset,
>> and doesn't re-ENABLE the MDIO.
> 
> AFAICS, that isn't entirely accurate.  In the previous version, the mdio
> bus was being brought up at probe time in davinci_emac_probe().  The
> soft-reset was happening later on when the device is opened, in
> emac_hw_enable().
> 
> The difference, however, is that the original code forced an
> emac_mii_reset() immediately after the emac_hw_enable().  This is not
> being done with the separated mdio, and that is the problem.  In terms
> of behavior, with the current work around, the new and old versions
> should be close to identical.  More below...
> 
>> Also, maybe hitting the EMAC reset while the MDIO state machine is up is *bad*, I 
>> seem to recall some text in the user's guide about waiting for the state
>> machine to stop before disabling it.  I wonder if that also applies to reset?
> 
> You are correct.  EMAC soft-reset stops the MDIO mid-transaction, quite
> unlike disabling the module via the control register.  Therefore, there
> is a risk that a badly designed phy could be left hanging in an
> arbitrary state.  However, all earlier versions of the emac code have
> been exposed to this very same vulnerability (i.e. arbitrary emac
> soft-reset regardless of mdio state) all along.
> 

Thanks for straightening me out on this, Cryil.  Your patch series in [1] seems to
have resolved the issues I've been able to see on the da850 based board I'm using
here.  I appreciate your patience and quick response.  I may try to beat on it a 
bit more with some network performance tests (even though it's not at all related 
to the immediate problems you've fixed) -- we've had that on our list of todos 
anyway for this module.

I think it would be good to float your patches over to davinci-next, if possible, before the
37 merge window opens...  Speaking, of course, from a very partial perspective.

> [1]
> http://arago-project.org/git/people/?p=cyril/linux-tnetv107x.git;a=shortlog;h=refs/heads/emac-cpdma-mdio-fixes

-Mike
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cyril Chemparathy Sept. 13, 2010, 2:09 p.m. UTC | #6
Hi Caglar,

[...]
> Unfortunately emac driver is not stable after this series. I face lock-ups 
> time to time, followed by attached kernel trace.

Could you elaborate on your test scenario so that I can try and
reproduce the problem at my end?  Also, did you have the contents of my
commit stack in this particular kernel build?

Assuming that the DMA got stuck at some point leading up to the transmit
timeout, any ideas as to why a host error was not thrown?  To help
debug, I'll post out a set of patches that dump out the MAC (and DMA)
registers on timeout.  That should give us some visibility into the problem.

> [ 1651.440000] nfs: server 192.168.2.34 not responding, still trying
> [ 1859.010000] ------------[ cut here ]------------
> [ 1859.010000] WARNING: at net/sched/sch_generic.c:258 
> dev_watchdog+0x184/0x294()
> [ 1859.020000] NETDEV WATCHDOG: eth0 (davinci_emac): transmit queue 0 timed 
> out
[...]

Regards
Cyril.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cyril Chemparathy Sept. 13, 2010, 3:46 p.m. UTC | #7
Hi Caglar,

[...]
> Assuming that the DMA got stuck at some point leading up to the transmit
> timeout, any ideas as to why a host error was not thrown?  To help
> debug, I'll post out a set of patches that dump out the MAC (and DMA)
> registers on timeout.  That should give us some visibility into the problem.

I have posted a couple of additional patches for this on [1].  Would you
mind giving these a quick try?  The register dumps should prove useful
in figuring this out.

Regards
Cyril.

[1]
http://arago-project.org/git/people/?p=cyril/linux-tnetv107x.git;a=shortlog;h=refs/heads/emac-cpdma-mdio-fixes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Caglar Akyuz Sept. 13, 2010, 5:51 p.m. UTC | #8
On Monday 13 September 2010 05:09:15 pm Cyril Chemparathy wrote:
> Hi Caglar,
> 
> [...]
> 
> > Unfortunately emac driver is not stable after this series. I face
> > lock-ups time to time, followed by attached kernel trace.
> 
> Could you elaborate on your test scenario so that I can try and
> reproduce the problem at my end?  Also, did you have the contents of my
> commit stack in this particular kernel build?
> 

Ooops! I didn't noticed your commits till you noted, I was working with 
DaVinci head. Applying patches in your tree solves all my problems and all my 
use cases are working now. However, I just barely tested them.

To make myself forgiven here are my 'netperf' numbers before and after your 
patches applied:

  TCP_STREAM  	:	54.64 Mbit vs 52.84 Mbit        
  UDP_STREAM   	:       96.27 Mbit vs 96.22 Mbit

Regards,
Caglar

> Assuming that the DMA got stuck at some point leading up to the transmit
> timeout, any ideas as to why a host error was not thrown?  To help
> debug, I'll post out a set of patches that dump out the MAC (and DMA)
> registers on timeout.  That should give us some visibility into the
>  problem.
> 
> > [ 1651.440000] nfs: server 192.168.2.34 not responding, still trying
> > [ 1859.010000] ------------[ cut here ]------------
> > [ 1859.010000] WARNING: at net/sched/sch_generic.c:258
> > dev_watchdog+0x184/0x294()
> > [ 1859.020000] NETDEV WATCHDOG: eth0 (davinci_emac): transmit queue 0
> > timed out
> 
> [...]
> 
> Regards
> Cyril.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/davinci_mdio.c b/drivers/net/davinci_mdio.c
index d34a53a..96a0f9e 100644
--- a/drivers/net/davinci_mdio.c
+++ b/drivers/net/davinci_mdio.c
@@ -36,6 +36,7 @@ 
 #include <linux/io.h>
 #include <linux/davinci_emac.h>
 
+#define MDIO_TIMEOUT		10		/* msecs */
 #define PHY_REG_MASK		0x1f
 #define PHY_ID_MASK		0x1f
 
@@ -85,31 +86,74 @@  struct davinci_mdio_data {
 	bool		suspended;
 };
 
+static void __davinci_mdio_reset(struct mii_bus *bus)
+{
+	struct davinci_mdio_data *data = bus->priv;
+	u32 mdio_in_freq, div;
+
+	mdio_in_freq = clk_get_rate(data->clk);
+	div = (mdio_in_freq / data->pdata.bus_freq) - 1;
+	if (div > CONTROL_MAX_DIV)
+		div = CONTROL_MAX_DIV;
+
+	/* set enable and clock divider */
+	__raw_writel(div | CONTROL_ENABLE, &data->regs->control);
+}
+
 /* wait until hardware is ready for another user access */
-static inline u32 wait_for_user_access(struct davinci_mdio_data *data)
+static inline int wait_for_user_access(struct davinci_mdio_data *data)
 {
-	struct davinci_mdio_regs __iomem *regs = data->regs;
+	unsigned long timeout = jiffies + msecs_to_jiffies(MDIO_TIMEOUT);
 	u32 reg;
+	int ret = -ETIMEDOUT;
 
-	while ((reg = __raw_readl(&regs->user[0].access)) & USERACCESS_GO)
-		;
+	while (time_after(timeout, jiffies)) {
+		reg = __raw_readl(&data->regs->user[0].access);
+		if ((reg & USERACCESS_GO) == 0) {
+			ret = 0;
+			break;
+		}
 
-	return reg;
+		reg = __raw_readl(&data->regs->control);
+		if (reg & CONTROL_IDLE) {
+			/*
+			 * An emac soft_reset may have clobbered the mdio
+			 * controller's state machine.  We need to reset and
+			 * retry the current operation
+			 */
+			dev_warn(data->dev, "controller idle in transaction, "
+					    "resetting\n");
+			__davinci_mdio_reset(data->bus);
+			ret = -EAGAIN;
+			break;
+		}
+	}
+	return ret;
 }
 
 /* wait until hardware state machine is idle */
-static inline void wait_for_idle(struct davinci_mdio_data *data)
+static inline int wait_for_idle(struct davinci_mdio_data *data)
 {
-	struct davinci_mdio_regs __iomem *regs = data->regs;
+	unsigned long timeout = jiffies + msecs_to_jiffies(MDIO_TIMEOUT);
+	int ret = -ETIMEDOUT;
+	u32 reg;
 
-	while ((__raw_readl(&regs->control) & CONTROL_IDLE) == 0)
-		;
+	while (time_after(timeout, jiffies)) {
+		reg = __raw_readl(&data->regs->control);
+		if (reg & CONTROL_IDLE) {
+			ret = 0;
+			break;
+		}
+	}
+
+	return ret;
 }
 
 static int davinci_mdio_read(struct mii_bus *bus, int phy_id, int phy_reg)
 {
 	struct davinci_mdio_data *data = bus->priv;
 	u32 reg;
+	int ret;
 
 	if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK)
 		return -EINVAL;
@@ -121,14 +165,32 @@  static int davinci_mdio_read(struct mii_bus *bus, int phy_id, int phy_reg)
 		return -ENODEV;
 	}
 
-	wait_for_user_access(data);
 	reg = (USERACCESS_GO | USERACCESS_READ | (phy_reg << 21) |
 	       (phy_id << 16));
-	__raw_writel(reg, &data->regs->user[0].access);
-	reg = wait_for_user_access(data);
+
+	while (1) {
+		ret = wait_for_user_access(data);
+		if (ret == -EAGAIN)
+			continue;
+		if (ret < 0)
+			break;
+
+		__raw_writel(reg, &data->regs->user[0].access);
+
+		ret = wait_for_user_access(data);
+		if (ret == -EAGAIN)
+			continue;
+		if (ret < 0)
+			break;
+
+		reg = __raw_readl(&data->regs->user[0].access);
+		ret = (reg & USERACCESS_ACK) ? (reg & USERACCESS_DATA) : -EIO;
+		break;
+	}
+
 	spin_unlock(&data->lock);
 
-	return (reg & USERACCESS_ACK) ? (reg & USERACCESS_DATA) : -EIO;
+	return ret;
 }
 
 static int davinci_mdio_write(struct mii_bus *bus, int phy_id,
@@ -136,6 +198,7 @@  static int davinci_mdio_write(struct mii_bus *bus, int phy_id,
 {
 	struct davinci_mdio_data *data = bus->priv;
 	u32 reg;
+	int ret;
 
 	if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK)
 		return -EINVAL;
@@ -147,23 +210,68 @@  static int davinci_mdio_write(struct mii_bus *bus, int phy_id,
 		return -ENODEV;
 	}
 
-	wait_for_user_access(data);
 	reg = (USERACCESS_GO | USERACCESS_WRITE | (phy_reg << 21) |
 		   (phy_id << 16) | (phy_data & USERACCESS_DATA));
-	__raw_writel(reg, &data->regs->user[0].access);
-	wait_for_user_access(data);
+
+	while (1) {
+		ret = wait_for_user_access(data);
+		if (ret == -EAGAIN)
+			continue;
+		if (ret < 0)
+			break;
+
+		__raw_writel(reg, &data->regs->user[0].access);
+
+		ret = wait_for_user_access(data);
+		if (ret == -EAGAIN)
+			continue;
+		break;
+	}
+
 	spin_unlock(&data->lock);
 
 	return 0;
 }
 
+static int davinci_mdio_reset(struct mii_bus *bus)
+{
+	struct davinci_mdio_data *data = bus->priv;
+	u32 phy_mask;
+
+	__davinci_mdio_reset(bus);
+
+	/*
+	 * wait for scan logic to settle:
+	 * the scan time consists of (a) a large fixed component, and (b) a
+	 * small component that varies with the mii bus frequency.  These
+	 * were estimated using measurements at 1.1 and 2.2 MHz on tnetv107x
+	 * silicon.  Since the effect of (b) was found to be largely
+	 * negligible, we keep things simple here.
+	 */
+	mdelay(1);
+
+	/* get phy mask from the alive register */
+	phy_mask = __raw_readl(&data->regs->alive);
+	if (phy_mask) {
+		/* restrict mdio bus to live phys only */
+		dev_info(data->dev, "detected phy mask %x\n", ~phy_mask);
+		phy_mask = ~phy_mask;
+	} else {
+		/* desperately scan all phys */
+		dev_warn(data->dev, "failed to detect live phys, scanning all\n");
+		phy_mask = 0;
+	}
+	bus->phy_mask = phy_mask;
+
+	return 0;
+}
+
 static int __devinit davinci_mdio_probe(struct platform_device *pdev)
 {
 	struct mdio_platform_data *pdata = pdev->dev.platform_data;
 	struct device *dev = &pdev->dev;
 	struct davinci_mdio_data *data;
 	struct resource *res;
-	u32 mdio_in_freq, mdio_out_freq, div, phy_mask, ver;
 	struct phy_device *phy;
 	int ret, addr;
 
@@ -185,6 +293,7 @@  static int __devinit davinci_mdio_probe(struct platform_device *pdev)
 	data->bus->name		= dev_name(dev);
 	data->bus->read		= davinci_mdio_read,
 	data->bus->write	= davinci_mdio_write,
+	data->bus->reset	= davinci_mdio_reset,
 	data->bus->parent	= dev;
 	data->bus->priv		= data;
 	snprintf(data->bus->id, MII_BUS_ID_SIZE, "%x", pdev->id);
@@ -225,43 +334,6 @@  static int __devinit davinci_mdio_probe(struct platform_device *pdev)
 		goto bail_out;
 	}
 
-	mdio_in_freq = clk_get_rate(data->clk);
-	div = (mdio_in_freq / data->pdata.bus_freq) - 1;
-	if (div > CONTROL_MAX_DIV)
-		div = CONTROL_MAX_DIV;
-	mdio_out_freq = mdio_in_freq / (div + 1);
-
-	/* set enable and clock divider */
-	__raw_writel(div | CONTROL_ENABLE, &data->regs->control);
-
-	/*
-	 * wait for scan logic to settle:
-	 * the scan time consists of (a) a large fixed component, and (b) a
-	 * small component that varies with the mii bus frequency.  These
-	 * were estimated using measurements at 1.1 and 2.2 MHz on tnetv107x
-	 * silicon.  Since the effect of (b) was found to be largely
-	 * negligible, we keep things simple here.
-	 */
-	mdelay(1);
-
-	/* dump hardware version info */
-	ver = __raw_readl(&data->regs->version);
-	dev_info(dev, "davinci mdio revision %d.%d\n",
-			(ver >> 8) & 0xff, ver & 0xff);
-
-	/* get phy mask from the alive register */
-	phy_mask = __raw_readl(&data->regs->alive);
-	if (phy_mask) {
-		/* restrict mdio bus to live phys only */
-		dev_info(dev, "detected phy mask %x\n", ~phy_mask);
-		phy_mask = ~phy_mask;
-	} else {
-		/* desperately scan all phys */
-		dev_warn(dev, "failed to detect live phys, scanning all\n");
-		phy_mask = 0;
-	}
-	data->bus->phy_mask = phy_mask;
-
 	/* register the mii bus */
 	ret = mdiobus_register(data->bus);
 	if (ret)
@@ -324,7 +396,7 @@  static int davinci_mdio_suspend(struct device *dev)
 	ctrl = __raw_readl(&data->regs->control);
 	ctrl &= ~CONTROL_ENABLE;
 	__raw_writel(ctrl, &data->regs->control);
-	wait_for_idle(data);
+	WARN_ON(wait_for_idle(data) < 0);
 
 	if (data->clk)
 		clk_disable(data->clk);