diff mbox

[09/12] net: mediatek: increase watchdog_timeo

Message ID 1465108385-38286-10-git-send-email-john@phrozen.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

John Crispin June 5, 2016, 6:33 a.m. UTC
During stress testing, after reducing the threshold value, we have seen
TX timeouts that were caused by the watchdog_timeo value being too low.
Increase the value to 5 * HZ which is a value commonly used by many other
drivers.

Signed-off-by: John Crispin <john@phrozen.org>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Andrew Lunn June 5, 2016, 2:56 p.m. UTC | #1
On Sun, Jun 05, 2016 at 08:33:02AM +0200, John Crispin wrote:
> During stress testing, after reducing the threshold value, we have seen
> TX timeouts that were caused by the watchdog_timeo value being too low.
> Increase the value to 5 * HZ which is a value commonly used by many other
> drivers.

I've never studied what watchdog_timeo actually means. Does it mean a
transmit has not completed in that amount of time? Would this imply
you have 5 seconds worth of packets in your transmit queue? Do you
know what the driver is doing during this 5 seconds?

Thanks
    Andrew
John Crispin June 6, 2016, 6:24 a.m. UTC | #2
On 05/06/2016 16:56, Andrew Lunn wrote:
> On Sun, Jun 05, 2016 at 08:33:02AM +0200, John Crispin wrote:
>> During stress testing, after reducing the threshold value, we have seen
>> TX timeouts that were caused by the watchdog_timeo value being too low.
>> Increase the value to 5 * HZ which is a value commonly used by many other
>> drivers.
> 
> I've never studied what watchdog_timeo actually means. Does it mean a
> transmit has not completed in that amount of time? Would this imply
> you have 5 seconds worth of packets in your transmit queue? Do you
> know what the driver is doing during this 5 seconds?

Hi Andrew,

it is waiting for the watchdog to trigger :-) TBH the 1s seems to be too
short to for the dma ring length to be flushed and i had to pick some
value and 5 is used most places.

it really depends on the amount of packets in the queue, their length
and the mac setting. the timeout needs to be large enough that it would
not trigger incorrectly even if the mac is on 10mbit half duplex and all
frames in the queue were maximum size.

	John
Andrew Lunn June 6, 2016, 12:21 p.m. UTC | #3
> Hi Andrew,
> 
> it is waiting for the watchdog to trigger :-) TBH the 1s seems to be too
> short to for the dma ring length to be flushed and i had to pick some
> value and 5 is used most places.
> 
> it really depends on the amount of packets in the queue, their length
> and the mac setting. the timeout needs to be large enough that it would
> not trigger incorrectly even if the mac is on 10mbit half duplex and all
> frames in the queue were maximum size.

So you are saying there is 5 seconds worth of traffic in the transmit ring.

As a general point, not specific to this driver, is that wise? Isn't
that really bad buffer bloat?

I just wondered what happened to cause it to have 5 seconds worth of
traffic in the transmit ring. Did downstream signal a pause?  But i
thought the byte queue limit was designed to prevent a big backlog in
the transmit queue? At 10/Half, is it not reacting fast enough?  Since
it is half duplex, do you have a lot of traffic coming the other way
and something is not being fair at distributing up and down traffic?

I'm just wondering if by increasing the watchdog to 5 seconds, you are
just hiding a problem.

    Andrew
John Crispin June 6, 2016, 12:38 p.m. UTC | #4
On 06/06/2016 14:21, Andrew Lunn wrote:
>> Hi Andrew,
>>
>> it is waiting for the watchdog to trigger :-) TBH the 1s seems to be too
>> short to for the dma ring length to be flushed and i had to pick some
>> value and 5 is used most places.
>>
>> it really depends on the amount of packets in the queue, their length
>> and the mac setting. the timeout needs to be large enough that it would
>> not trigger incorrectly even if the mac is on 10mbit half duplex and all
>> frames in the queue were maximum size.
> 
> So you are saying there is 5 seconds worth of traffic in the transmit ring.
> 
> As a general point, not specific to this driver, is that wise? Isn't
> that really bad buffer bloat?
> 
> I just wondered what happened to cause it to have 5 seconds worth of
> traffic in the transmit ring. Did downstream signal a pause?  But i
> thought the byte queue limit was designed to prevent a big backlog in
> the transmit queue? At 10/Half, is it not reacting fast enough?  Since
> it is half duplex, do you have a lot of traffic coming the other way
> and something is not being fair at distributing up and down traffic?
> 
> I'm just wondering if by increasing the watchdog to 5 seconds, you are
> just hiding a problem.
> 
>     Andrew

Hi Andrew,

running the driver without any QoS and using the typical ringsize for
gigabit devices, 1s is not enough. we were seeing false positive
watchdog events. then i grepped to see what other drivers do and most
set 5seconds. ideally the watchdog never triggers as the driver is
functional an does not suffer from deadlocks.

at gbit ethernet can transmit 83 packets that are 1500 bytes long /
second, if there are no pause gaps. at 10Mbit that would be 6 packets.

so assuming we have a ring of 128 and napi set to 64, we would want at
least 2 seconds, 3 if there are a lot of pause gaps, 4-5 if it is half
duplex ... so i took the value commonly used, which is 5 according to grep.

figuring out when the queue is stuck seems to be a little bit more
complicated. imho the trigger should not be based on how long it took to
send a packet, but how long since the last packet was dequeued from dma.

personally i'd rather fix the deadlocks that can happen, which is what
we did, than rely on the watchdog to reset the queue. right now we can
hammer the driver with several streams on both macs utilizing all 4 cpu
cores for several days without seeing any hickups.

		John
diff mbox

Patch

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 8b289e1..24714ab 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -1688,7 +1688,7 @@  static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
 	mac->hw_stats->reg_offset = id * MTK_STAT_OFFSET;
 
 	SET_NETDEV_DEV(eth->netdev[id], eth->dev);
-	eth->netdev[id]->watchdog_timeo = HZ;
+	eth->netdev[id]->watchdog_timeo = 5 * HZ;
 	eth->netdev[id]->netdev_ops = &mtk_netdev_ops;
 	eth->netdev[id]->base_addr = (unsigned long)eth->base;
 	eth->netdev[id]->vlan_features = MTK_HW_FEATURES &