diff mbox

Throughput regression with `tcp: refine TSO autosizing`

Message ID 1423156205.31870.86.camel@edumazet-glaptop2.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 5, 2015, 5:10 p.m. UTC
On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:

> Not at all. This basically removes backpressure.
> 
> A single UDP socket can now blast packets regardless of SO_SNDBUF
> limits.
> 
> This basically remove years of work trying to fix bufferbloat.
> 
> I still do not understand why increasing tcp_limit_output_bytes is not
> working for you.

Oh well, tcp_limit_output_bytes might be ok.

In fact, the problem comes from GSO assumption. Maybe Herbert was right,
when he suggested TCP would be simpler if we enforced GSO...

When GSO is used, the thing works because 2*skb->truesize is roughly 2
ms worth of traffic.

Because you do not use GSO, and tx completions are slow, we need this :



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michal Kazior Feb. 6, 2015, 9:42 a.m. UTC | #1
On 5 February 2015 at 18:10, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:
>
>> Not at all. This basically removes backpressure.
>>
>> A single UDP socket can now blast packets regardless of SO_SNDBUF
>> limits.
>>
>> This basically remove years of work trying to fix bufferbloat.
>>
>> I still do not understand why increasing tcp_limit_output_bytes is not
>> working for you.
>
> Oh well, tcp_limit_output_bytes might be ok.
>
> In fact, the problem comes from GSO assumption. Maybe Herbert was right,
> when he suggested TCP would be simpler if we enforced GSO...
>
> When GSO is used, the thing works because 2*skb->truesize is roughly 2
> ms worth of traffic.
>
> Because you do not use GSO, and tx completions are slow, we need this :
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b95e17..ac01b4cd0035 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                         break;
>
>                 /* TCP Small Queues :
> -                * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> +                * Control number of packets in qdisc/devices to two packets /
> +                * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
> @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>

The above brings back previous behaviour, i.e. I can get 600mbps TCP
on 5 flows again. Single flow is still (as it was before TSO
autosizing) limited to roughly ~280mbps.

I never really bothered before to understand why I need to push a few
flows through ath10k to max it out, i.e. if I run a single UDP flow I
get ~300mbps while with, e.g. 5 I get 670mbps easily.

I guess it was the tx completion latency all along.

I just put an extra debug to ath10k to see the latency between
submission and completion. Here's a log
(http://www.filedropper.com/complete-log) of 2s run of UDP iperf
trying to push 1gbps but managing only 300mbps.

I've made sure to not hold any locks nor introduce internal to ath10k
delays. Frames get completed between 2-4ms in avarage during load.

When I tried using different ath10k hw&fw I got between 1-2ms of
latency for tx completionsyielding ~430mbps while max should be around
670mbps.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 6, 2015, 1:40 p.m. UTC | #2
On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

> The above brings back previous behaviour, i.e. I can get 600mbps TCP
> on 5 flows again. Single flow is still (as it was before TSO
> autosizing) limited to roughly ~280mbps.
> 
> I never really bothered before to understand why I need to push a few
> flows through ath10k to max it out, i.e. if I run a single UDP flow I
> get ~300mbps while with, e.g. 5 I get 670mbps easily.
> 

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1450   10.00      697705      0     809.27
212992           10.00      673412            781.09

# echo 800000 >/proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

800000    1450   10.00     7329221      0    8501.84
212992           10.00     7284051           8449.44


> I guess it was the tx completion latency all along.
> 
> I just put an extra debug to ath10k to see the latency between
> submission and completion. Here's a log
> (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
> trying to push 1gbps but managing only 300mbps.
> 
> I've made sure to not hold any locks nor introduce internal to ath10k
> delays. Frames get completed between 2-4ms in avarage during load.


tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ < 1000.

Then tcp_write_xmit() could use it to adjust :

   limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);

to

   amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 

   limit = max(2 * skb->truesize, amount / 1000);

I'll cook a patch.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 6, 2015, 1:53 p.m. UTC | #3
On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
> of TX completion delay. But this would require yet another expensive
> call to ktime_get() if HZ < 1000.
> 
> Then tcp_write_xmit() could use it to adjust :
> 
>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
> 
> to
> 
>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 
> 
>    limit = max(2 * skb->truesize, amount / 1000);
> 
> I'll cook a patch.

Hmm... doing this in all protocols would be too expensive,
and we do not want to include time spent in qdiscs.

wifi could eventually do that, providing in skb->tx_completion_delay_us
the time spent in wifi driver.

This way, we would have no penalty for network devices doing normal skb
orphaning (loopback interface, ethernet, ...)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 6, 2015, 2:08 p.m. UTC | #4
On 6 February 2015 at 14:40, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:
>
>> The above brings back previous behaviour, i.e. I can get 600mbps TCP
>> on 5 flows again. Single flow is still (as it was before TSO
>> autosizing) limited to roughly ~280mbps.
>>
>> I never really bothered before to understand why I need to push a few
>> flows through ath10k to max it out, i.e. if I run a single UDP flow I
>> get ~300mbps while with, e.g. 5 I get 670mbps easily.
>>
>
> For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
> enough : UDP has no callback from TX completion to feed following frames
> (No write queue like TCP)
>
> # cat /proc/sys/net/core/wmem_default
> 212992
> # ethtool -C eth1 tx-usecs 1024 tx-frames 120
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1450   10.00      697705      0     809.27
> 212992           10.00      673412            781.09
>
> # echo 800000 >/proc/sys/net/core/wmem_default
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 800000    1450   10.00     7329221      0    8501.84
> 212992           10.00     7284051           8449.44

Hmm.. I confirm it works. However the value at which I get full rate
on a single flow is more than 2048K. Also using non-default
wmem_default seems to introduce packet loss as per iperf reports at
the receiver. I suppose this is kind of expected but on the other hand
wmem_default=262992 and 5 flows of UDP max the device out with 0
packet loss.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 6, 2015, 2:09 p.m. UTC | #5
On 6 February 2015 at 14:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>
>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>> of TX completion delay. But this would require yet another expensive
>> call to ktime_get() if HZ < 1000.
>>
>> Then tcp_write_xmit() could use it to adjust :
>>
>>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>
>> to
>>
>>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>
>>    limit = max(2 * skb->truesize, amount / 1000);
>>
>> I'll cook a patch.
>
> Hmm... doing this in all protocols would be too expensive,
> and we do not want to include time spent in qdiscs.
>
> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
>
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

I'll play around with this idea and report back later.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 6, 2015, 2:10 p.m. UTC | #6
On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:


> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
> 
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

Another way would be that wifi does an automatic orphaning after 1 or
2ms.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Feb. 6, 2015, 2:31 p.m. UTC | #7
From: Eric Dumazet

> On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:

> 

> 

> > wifi could eventually do that, providing in skb->tx_completion_delay_us

> > the time spent in wifi driver.

> >

> > This way, we would have no penalty for network devices doing normal skb

> > orphaning (loopback interface, ethernet, ...)

> 

> Another way would be that wifi does an automatic orphaning after 1 or

> 2ms.


Couldn't you do byte counting?
So orphan enough packets to keep a few ms of tx traffic (at the current
tx rate) orphaned.
You might need to give the hardware both orphaned and non-orphaned (parented?)
packets and orphan some when you get a tx complete for an orphaned packet.

	David
Eric Dumazet Feb. 6, 2015, 2:35 p.m. UTC | #8
On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote:

> Hmm.. I confirm it works. However the value at which I get full rate
> on a single flow is more than 2048K. Also using non-default
> wmem_default seems to introduce packet loss as per iperf reports at
> the receiver. I suppose this is kind of expected but on the other hand
> wmem_default=262992 and 5 flows of UDP max the device out with 0
> packet loss.

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 2000000 >/proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 6, 2015, 3:02 p.m. UTC | #9
On Fri, 2015-02-06 at 14:31 +0000, David Laight wrote:
> From: Eric Dumazet
> > On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:
> > 
> > 
> > > wifi could eventually do that, providing in skb->tx_completion_delay_us
> > > the time spent in wifi driver.
> > >
> > > This way, we would have no penalty for network devices doing normal skb
> > > orphaning (loopback interface, ethernet, ...)
> > 
> > Another way would be that wifi does an automatic orphaning after 1 or
> > 2ms.
> 
> Couldn't you do byte counting?
> So orphan enough packets to keep a few ms of tx traffic (at the current
> tx rate) orphaned.
> You might need to give the hardware both orphaned and non-orphaned (parented?)
> packets and orphan some when you get a tx complete for an orphaned packet.

We already have byte counting.

The thing is : A driver can keep an skb for itself, but calling
skb_orphan() in time to allow a socket to send more packets.

For say a UDP server, it would be quite mandatory, as it usually uses a
single UDP socket to receive and send messages.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Feb. 6, 2015, 5:48 p.m. UTC | #10
> If you increase ability to flood on one flow, then you need to make sure
> receiver has big rcvbuf as well.
>
> echo 2000000 >/proc/sys/net/core/rmem_default
>
> Otherwise it might drop bursts.
>
> This is the kind of things that TCP does automatically, not UDP.

An alternative, if the application involved can make explicit 
setsockopt() calls to set SO_SNDBUF and/or SO_RCVBUF, is to tweak 
rmem_max and wmem_max and then let the application make the setsockopt() 
calls.

Which path one would take would depend on circumstances I suspect.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..ac01b4cd0035 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2044,7 +2044,8 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			break;
 
 		/* TCP Small Queues :
-		 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
+		 * Control number of packets in qdisc/devices to two packets /
+		 * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off.
 		 * This allows for :
 		 *  - better RTT estimation and ACK scheduling
 		 *  - faster recovery
@@ -2053,7 +2054,7 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * of queued bytes to ensure line rate.
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
 		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {