diff mbox

BQL support in gianfar causes network hickup

Message ID 1353950255.7553.6.camel@edumazet-glaptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 26, 2012, 5:17 p.m. UTC
On Mon, 2012-11-26 at 18:08 +0100, Keitel, Tino (ALC NetworX GmbH)
wrote:
> On Mo, 2012-11-26 at 08:34 -0800, Eric Dumazet wrote:
> > On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> > > On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
> > > 
> > > [...]
> > > 
> > > > Hmm, I wonder if BQL makes a particular bug showing more often.
> > > > 
> > > > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > > > drivers use 5 seconds.
> > > > 
> > > > What happens if you change this to 5 seconds ?
> > > 
> > > I still got the trace and a failing ptp client.
> > > 
> > 
> > Thanks. Is this bug easy to trigger ?
> > 
> > I suspect a core issue and a race, likely to happen on your (non x86)
> > hardware
> > 
> > Could you add the following debugging patch ?
> 
> No visible difference:

OK it seems you trigger the problem fast !

Please try the following as well :



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Nov. 27, 2012, 12:36 p.m. UTC | #1
On Tue, 2012-11-27 at 10:36 +0100, Keitel, Tino (ALC NetworX GmbH)
wrote:
> On Mo, 2012-11-26 at 09:17 -0800, Eric Dumazet wrote:
> > On Mon, 2012-11-26 at 18:08 +0100, Keitel, Tino (ALC NetworX GmbH)
> > wrote:
> > > On Mo, 2012-11-26 at 08:34 -0800, Eric Dumazet wrote:
> > > > On Mon, 2012-11-26 at 11:01 +0100, Tino Keitel wrote:
> > > > > On Sat, Nov 24, 2012 at 15:43:36 -0800, Eric Dumazet wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > Hmm, I wonder if BQL makes a particular bug showing more often.
> > > > > > 
> > > > > > I see gianfar uses a very small watchdog_timeo of 1 second, while many
> > > > > > drivers use 5 seconds.
> > > > > > 
> > > > > > What happens if you change this to 5 seconds ?
> > > > > 
> > > > > I still got the trace and a failing ptp client.
> > > > > 
> > > > 
> > > > Thanks. Is this bug easy to trigger ?
> > > > 
> > > > I suspect a core issue and a race, likely to happen on your (non x86)
> > > > hardware
> > > > 
> > > > Could you add the following debugging patch ?
> > > 
> > > No visible difference:
> > 
> > OK it seems you trigger the problem fast !
> > 
> > Please try the following as well :
> 
> Hi,
> 
> yes, it can be triggered within 2 minutes.
> 
> The patch makes no difference:

Can you reproduce the problem using a single cpu ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 27, 2012, 1:32 p.m. UTC | #2
On Tue, 2012-11-27 at 13:42 +0100, Keitel, Tino (ALC NetworX GmbH)
wrote:
> On Di, 2012-11-27 at 04:36 -0800, Eric Dumazet wrote:
> > 
> > Can you reproduce the problem using a single cpu ?
> 
> Yes, it is a single-CPU system.

Can you reproduce the problem without PTP running, or disabled in the
driver ?

(comment the "priv->hwts_tx_en = 1;" line)


This looks like we miss an interrupt ( or TXBD_INTERRUPT not correctly
set)

And it could be a bug occurring if we try to send one skb with fragments
and skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 27, 2012, 1:49 p.m. UTC | #3
On Tue, 2012-11-27 at 05:32 -0800, Eric Dumazet wrote:

> Can you reproduce the problem without PTP running, or disabled in the
> driver ?
> 
> (comment the "priv->hwts_tx_en = 1;" line)
> 
> 
> This looks like we miss an interrupt ( or TXBD_INTERRUPT not correctly
> set)
> 
> And it could be a bug occurring if we try to send one skb with fragments
> and skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP 
> 
> 

By the way are any errata flagged in gfar_detect_errata() ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Gortmaker Feb. 6, 2013, 1:55 a.m. UTC | #4
On Tue, Feb 5, 2013 at 8:00 AM, Keitel, Tino (ALC NetworX GmbH)
<Tino.Keitel@alcnetworx.de> wrote:
> On Di, 2012-11-27 at 05:32 -0800, Eric Dumazet wrote:
>> On Tue, 2012-11-27 at 13:42 +0100, Keitel, Tino (ALC NetworX GmbH)
>> wrote:
>> > On Di, 2012-11-27 at 04:36 -0800, Eric Dumazet wrote:
>> > >
>> > > Can you reproduce the problem using a single cpu ?
>> >
>> > Yes, it is a single-CPU system.
>>
>> Can you reproduce the problem without PTP running, or disabled in the
>> driver ?
>>
>> (comment the "priv->hwts_tx_en = 1;" line)
>
> I can't reproduce it with that line commented. However, so far I was
> only able to reproduce it when starting the ptp2 client, so maybe this
> is connected.

How critical is ptp2 for this?  And/or the platform details?  I can
try and reproduce it on an mpc8349 system and/or an mpc8548
system (and even an mpc8641D system) but I'd rather know that
was a meaningful chase and not a snipe hunt before going there.

So, in that respect, a "If you run this, you will get this" type of
error message would be good.   I understand that may be too
idealistic, though.

Thanks,
Paul.
--

>
>> By the way are any errata flagged in gfar_detect_errata() ?
>
> This is from dmesg:
>
> fsl-gianfar e0024000.ethernet: enabled errata workarounds, flags: 0x7
>
> 0x7 would be GFAR_ERRATA_74, GFAR_ERRATA_76 and GFAR_ERRATA_A002
> according to drivers/net/ethernet/freescale/gianfar.h.
>
> Regards,
> Tino
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keitel, Tino (ALC NetworX GmbH) Feb. 6, 2013, 3:20 p.m. UTC | #5
On Di, 2013-02-05 at 20:55 -0500, Paul Gortmaker wrote:
> How critical is ptp2 for this?  And/or the platform details?  I can
> try and reproduce it on an mpc8349 system and/or an mpc8548
> system (and even an mpc8641D system) but I'd rather know that
> was a meaningful chase and not a snipe hunt before going there.
> 
> So, in that respect, a "If you run this, you will get this" type of
> error message would be good.   I understand that may be too
> idealistic, though.

Hi,

I'm using an MPC8313ERDB board and was able to reproduce it with a
vanilla 3.6.7 kernel. There are some modifications to the device tree,
though. I'll try to reproduce this with as few changes to the device
tree as possible.

Ptp2 is critical for the desired use. So far I also only was able to
reproduce it with this ptp2 client.

If it is a possible option for you, I could provide an affected device
to make the issue easy to reproduce.

Regards,
Tino
Paul Gortmaker Feb. 7, 2013, 9:05 p.m. UTC | #6
On 13-02-05 08:00 AM, Keitel, Tino (ALC NetworX GmbH) wrote:
> On Di, 2012-11-27 at 05:32 -0800, Eric Dumazet wrote:
>> On Tue, 2012-11-27 at 13:42 +0100, Keitel, Tino (ALC NetworX GmbH)
>> wrote:
>>> On Di, 2012-11-27 at 04:36 -0800, Eric Dumazet wrote:
>>>>
>>>> Can you reproduce the problem using a single cpu ?
>>>
>>> Yes, it is a single-CPU system.
>>
>> Can you reproduce the problem without PTP running, or disabled in the
>> driver ?
>>
>> (comment the "priv->hwts_tx_en = 1;" line)
> 
> I can't reproduce it with that line commented. However, so far I was
> only able to reproduce it when starting the ptp2 client, so maybe this
> is connected.

I found an mpc8315erdb, and built the default yocto build (v3.4.20,
which should have the issue based on your earlier reports.)

> 
>> By the way are any errata flagged in gfar_detect_errata() ?
> 
> This is from dmesg:
> 
> fsl-gianfar e0024000.ethernet: enabled errata workarounds, flags: 0x7
> 
> 0x7 would be GFAR_ERRATA_74, GFAR_ERRATA_76 and GFAR_ERRATA_A002
> according to drivers/net/ethernet/freescale/gianfar.h.

The MPC8315ECE.pdf lists the 8315 as having 76 and A002 (amongst
a lot of others.)  However the driver only does errata checks
for the 8313 it seems.  I also then manually forced the driver
to enable the errata and confirmed I saw 0x7 flag in dmesg.

Finally I added the meta-networking layer to the yocto build to
get a ptpd2 (2.2.0-r1)

On another board, I ran a server as:
    ptpd2 -G -C -c -b eth0

On the 8315, I ran the client as:
    ptpd2 -g -C -c -b eth0

In neither case (errata off, and errata manually enabled) I did not
manage to get the tx timeout that you got.

There must be something more specific to your environment, your
ptpd client (and args) and perhaps the board itself.

Paul.
--

> 
> Regards,
> Tino
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Claudiu Manoil April 29, 2013, 1:14 p.m. UTC | #7
Hi,

I think I found a way to reproduce this easily, by forcing
timestamping on Tx with the following code change:

"""
drivers/net/ethernet/freescale/gianfar.c:
@@ -1210,6 +1210,8 @@ static int gfar_probe(struct platform_device *ofdev)
         /* Create all the sysfs files */
         gfar_init_sysfs(dev);

+       priv->hwts_tx_en = 1;
+
         /* Print out the device info */
         netdev_info(dev, "mac: %pM\n", dev->dev_addr);

@@ -2084,6 +2087,7 @@ static int gfar_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
         txq = netdev_get_tx_queue(dev, rq);
         base = tx_queue->tx_bd_base;
         regs = tx_queue->grp->regs;
+       skb_shinfo(skb)->tx_flags |= SKBTX_HW_TSTAMP;

         /* check if time stamp should be generated */
         if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
"""

With this change, tx timeout triggers with the first packet sent.
I tested it on p1020/ p1010 boards, so it's not related to a specific
board.

The proposed fix (passes this test):
http://patchwork.ozlabs.org/patch/240365/

Does it work for you?
I'm a bit concerned about BQL support being added to Gianfar. Seems that
it only adds an overhead to this driver and possible failure points.
I'm not sure whether Gianfar has the issue of excessive H/W queuing.
I'm also wondering why only 5 or 6 eth drivers integrated BQL to date.

Regards,
Claudiu

On 2/6/2013 5:20 PM, Keitel, Tino (ALC NetworX GmbH) wrote:
> On Di, 2013-02-05 at 20:55 -0500, Paul Gortmaker wrote:
>> How critical is ptp2 for this?  And/or the platform details?  I can
>> try and reproduce it on an mpc8349 system and/or an mpc8548
>> system (and even an mpc8641D system) but I'd rather know that
>> was a meaningful chase and not a snipe hunt before going there.
>>
>> So, in that respect, a "If you run this, you will get this" type of
>> error message would be good.   I understand that may be too
>> idealistic, though.
>
> Hi,
>
> I'm using an MPC8313ERDB board and was able to reproduce it with a
> vanilla 3.6.7 kernel. There are some modifications to the device tree,
> though. I'll try to reproduce this with as few changes to the device
> tree as possible.
>
> Ptp2 is critical for the desired use. So far I also only was able to
> reproduce it with this ptp2 client.
>
> If it is a possible option for you, I could provide an affected device
> to make the issue easy to reproduce.
>
> Regards,
> Tino


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keitel, Tino (ALC NetworX GmbH) April 29, 2013, 1:20 p.m. UTC | #8
On Mo, 2013-04-29 at 15:14 +0200, Claudiu Manoil wrote:

> The proposed fix (passes this test):
> http://patchwork.ozlabs.org/patch/240365/
> 
> Does it work for you?

Hi,

thanks a lot. I'll try the patch and report back.

Regards,
Tino


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keitel, Tino (ALC NetworX GmbH) May 27, 2013, 12:47 p.m. UTC | #9
On Mo, 2013-04-29 at 15:20 +0200, Tino Keitel wrote:
> On Mo, 2013-04-29 at 15:14 +0200, Claudiu Manoil wrote:
> 
> > The proposed fix (passes this test):
> > http://patchwork.ozlabs.org/patch/240365/
> > 
> > Does it work for you?

Hi,

I tested the patch over the weekend and it looks fine. I then re-checked
without the patch and got the bug immediately. So the patch seems to
work fine.

Thanks a lot and regards,
Tino Keitel


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 19ac096..77190bf 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -101,7 +101,7 @@ 
 
 #include "gianfar.h"
 
-#define TX_TIMEOUT      (1*HZ)
+#define TX_TIMEOUT      (5*HZ)
 
 const char gfar_driver_version[] = "1.3";
 
@@ -2465,9 +2465,9 @@  static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 	struct sk_buff *skb;
 	int skb_dirtytx;
 	int tx_ring_size = tx_queue->tx_ring_size;
-	int frags = 0, nr_txbds = 0;
+	int frags, nr_txbds;
 	int i;
-	int howmany = 0;
+	int howmany = 0, total_txbds = 0;
 	int tqi = tx_queue->qindex;
 	unsigned int bytes_sent = 0;
 	u32 lstatus;
@@ -2479,7 +2479,6 @@  static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 	skb_dirtytx = tx_queue->skb_dirtytx;
 
 	while ((skb = tx_queue->tx_skbuff[skb_dirtytx])) {
-		unsigned long flags;
 
 		frags = skb_shinfo(skb)->nr_frags;
 
@@ -2541,20 +2540,24 @@  static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
 			      TX_RING_MOD_MASK(tx_ring_size);
 
 		howmany++;
-		spin_lock_irqsave(&tx_queue->txlock, flags);
-		tx_queue->num_txbdfree += nr_txbds;
-		spin_unlock_irqrestore(&tx_queue->txlock, flags);
+		total_txbds += nr_txbds;
 	}
 
-	/* If we freed a buffer, we can restart transmission, if necessary */
-	if (netif_tx_queue_stopped(txq) && tx_queue->num_txbdfree)
-		netif_wake_subqueue(dev, tqi);
+	if (howmany) {
+		unsigned long flags;
 
+		spin_lock_irqsave(&tx_queue->txlock, flags);
+		tx_queue->num_txbdfree += total_txbds;
+		/* If we freed a buffer, we can restart transmission, if necessary */
+		if (netif_tx_queue_stopped(txq))
+			netif_tx_wake_queue(txq);
+		netdev_tx_completed_queue(txq, howmany, bytes_sent);
+		spin_unlock_irqrestore(&tx_queue->txlock, flags);
+	}
 	/* Update dirty indicators */
 	tx_queue->skb_dirtytx = skb_dirtytx;
 	tx_queue->dirty_tx = bdp;
 
-	netdev_tx_completed_queue(txq, howmany, bytes_sent);
 
 	return howmany;
 }