Throughput regression with `tcp: refine TSO autosizing`

Message ID	1423156205.31870.86.camel@edumazet-glaptop2.roam.corp.google.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id BA3DE140145 for <patchwork-incoming@ozlabs.org>; Fri, 6 Feb 2015 04:10:25 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758251AbbBERKT (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Thu, 5 Feb 2015 12:10:19 -0500 Received: from mail-ig0-f181.google.com ([209.85.213.181]:62516 "EHLO mail-ig0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754169AbbBERKQ (ORCPT <rfc822;netdev@vger.kernel.org>); Thu, 5 Feb 2015 12:10:16 -0500 Received: by mail-ig0-f181.google.com with SMTP id hn18so14793577igb.2; Thu, 05 Feb 2015 09:10:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:subject:from:to:cc:date:in-reply-to:references :content-type:mime-version:content-transfer-encoding; bh=BM6RlGQJKDMvid3GiHm0hCIYFaxFuIU99iQvv5/WSFs=; b=VpzrsQG+Eqyq1IVANYIKLC71SaV5siy4Jd/LwQNyud/aVJL6hRKu59je7vAk2ssRrT 3HmVbh9COqD3YyjssmHWn8h6HVwN94rUTv2jiaQ6FGy6zjOiONL8AijVoLQvbuTcqJXM 99AdFILVQDf8zWQZVbiKsUdSo3kIYKbp28rOEhD5ZDHr8Rrak5+WeyZo2S1N/ar+BCYb Vdn0Oi81xTfDhD7MvHyXdPuiIkt+NZn+4Vmm52y2rNpp2Lwk5M6IeIaipvPncOMhaCzt AY9xR3lAHQEhst6vucZHRAkdomFQch4e9EWf/dA2Fg27m0iSEoFuYPDw+PFvcI7Bc1vM S8rw== X-Received: by 10.50.79.135 with SMTP id j7mr31866459igx.32.1423156207388; Thu, 05 Feb 2015 09:10:07 -0800 (PST) Received: from [172.19.248.118] ([172.19.248.118]) by mx.google.com with ESMTPSA id y6sm2256907iod.32.2015.02.05.09.10.05 (version=TLSv1.2 cipher=AES128-GCM-SHA256 bits=128/128); Thu, 05 Feb 2015 09:10:06 -0800 (PST) Message-ID: <1423156205.31870.86.camel@edumazet-glaptop2.roam.corp.google.com> Subject: Re: Throughput regression with `tcp: refine TSO autosizing` From: Eric Dumazet <eric.dumazet@gmail.com> To: Michal Kazior <michal.kazior@tieto.com> Cc: Neal Cardwell <ncardwell@google.com>, linux-wireless <linux-wireless@vger.kernel.org>, Network Development <netdev@vger.kernel.org>, eyalpe@dev.mellanox.co.il Date: Thu, 05 Feb 2015 09:10:05 -0800 In-Reply-To: <1423147286.31870.59.camel@edumazet-glaptop2.roam.corp.google.com> References: <CA+BoTQkVu23P3EOmY_Q3E1GJnWsyF==Pawz4iPOS_Bq5dvfO5Q@mail.gmail.com> <1422537297.21689.15.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQk2xT-8DqPuiiKG+kHAjLPrj8F9dLTb-rcGhvMq0u_2Qw@mail.gmail.com> <1422628835.21689.95.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQkV+mOZfe_Niz5101sMQeaV6muKCsShptjGQ1AgOHqqoQ@mail.gmail.com> <1422903136.21689.114.camel@edumazet-glaptop2.roam.corp.google.com> <1422926330.21689.138.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQkMikA8wxm1ce2DkKhPB0HiKeAqT7f+sQ=91W40z=X0Rg@mail.gmail.com> <1422973660.907.10.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQmvUuFdfYF=wVMYxrf_nQZB5GCV=LvDZVvfs-3hAE4WKw@mail.gmail.com> <1423051045.907.108.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQ=BDcQ779uKCuX+f40=4npXVF4MTQnpjKimNYAxPsxBoQ@mail.gmail.com> <1423053531.907.115.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQ=qmCZz4CmSOvCOzMLowrDEG12XBffkTcYxjGqVD9604g@mail.gmail.com> <1423055810.907.125.camel@edumazet-glaptop2.roam.corp.google.com> <1423056591.907.130.camel@edumazet-glaptop2.roam.corp.google.com> <1423084303.31870.15.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQnnwrv0nrKyGyQNvosz_E4e5fBa9iN8fpeqcd-iRfi17g@mail.gmail.com> <1423141038.31870.38.camel@edumazet-glaptop2.roam.corp.google.com> <1423142342.31870.49.camel@edumazet-glaptop2.roam.corp.google.com> <CA+BoTQmcShK0U_cXvEOLY_8y7LH8x3taTgjcyMzv0MLVn4UtCA@mail.gmail.com> <1423147286.31870.59.camel@edumazet-glaptop2.roam.corp.google.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

Eric Dumazet Feb. 5, 2015, 5:10 p.m. UTC

On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:

> Not at all. This basically removes backpressure.
> 
> A single UDP socket can now blast packets regardless of SO_SNDBUF
> limits.
> 
> This basically remove years of work trying to fix bufferbloat.
> 
> I still do not understand why increasing tcp_limit_output_bytes is not
> working for you.

Oh well, tcp_limit_output_bytes might be ok.

In fact, the problem comes from GSO assumption. Maybe Herbert was right,
when he suggested TCP would be simpler if we enforced GSO...

When GSO is used, the thing works because 2*skb->truesize is roughly 2
ms worth of traffic.

Because you do not use GSO, and tx completions are slow, we need this :

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior Feb. 6, 2015, 9:42 a.m. UTC | #1

On 5 February 2015 at 18:10, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:
>
>> Not at all. This basically removes backpressure.
>>
>> A single UDP socket can now blast packets regardless of SO_SNDBUF
>> limits.
>>
>> This basically remove years of work trying to fix bufferbloat.
>>
>> I still do not understand why increasing tcp_limit_output_bytes is not
>> working for you.
>
> Oh well, tcp_limit_output_bytes might be ok.
>
> In fact, the problem comes from GSO assumption. Maybe Herbert was right,
> when he suggested TCP would be simpler if we enforced GSO...
>
> When GSO is used, the thing works because 2*skb->truesize is roughly 2
> ms worth of traffic.
>
> Because you do not use GSO, and tx completions are slow, we need this :
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b95e17..ac01b4cd0035 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                         break;
>
>                 /* TCP Small Queues :
> -                * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> +                * Control number of packets in qdisc/devices to two packets /
> +                * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
> @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>

The above brings back previous behaviour, i.e. I can get 600mbps TCP
on 5 flows again. Single flow is still (as it was before TSO
autosizing) limited to roughly ~280mbps.

I never really bothered before to understand why I need to push a few
flows through ath10k to max it out, i.e. if I run a single UDP flow I
get ~300mbps while with, e.g. 5 I get 670mbps easily.

I guess it was the tx completion latency all along.

I just put an extra debug to ath10k to see the latency between
submission and completion. Here's a log
(http://www.filedropper.com/complete-log) of 2s run of UDP iperf
trying to push 1gbps but managing only 300mbps.

I've made sure to not hold any locks nor introduce internal to ath10k
delays. Frames get completed between 2-4ms in avarage during load.

When I tried using different ath10k hw&fw I got between 1-2ms of
latency for tx completionsyielding ~430mbps while max should be around
670mbps.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 6, 2015, 1:40 p.m. UTC | #2

On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

> The above brings back previous behaviour, i.e. I can get 600mbps TCP
> on 5 flows again. Single flow is still (as it was before TSO
> autosizing) limited to roughly ~280mbps.
> 
> I never really bothered before to understand why I need to push a few
> flows through ath10k to max it out, i.e. if I run a single UDP flow I
> get ~300mbps while with, e.g. 5 I get 670mbps easily.
> 

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1450   10.00      697705      0     809.27
212992           10.00      673412            781.09

# echo 800000 >/proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

800000    1450   10.00     7329221      0    8501.84
212992           10.00     7284051           8449.44


> I guess it was the tx completion latency all along.
> 
> I just put an extra debug to ath10k to see the latency between
> submission and completion. Here's a log
> (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
> trying to push 1gbps but managing only 300mbps.
> 
> I've made sure to not hold any locks nor introduce internal to ath10k
> delays. Frames get completed between 2-4ms in avarage during load.


tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ < 1000.

Then tcp_write_xmit() could use it to adjust :

   limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);

to

   amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 

   limit = max(2 * skb->truesize, amount / 1000);

I'll cook a patch.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 6, 2015, 1:53 p.m. UTC | #3

On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
> of TX completion delay. But this would require yet another expensive
> call to ktime_get() if HZ < 1000.
> 
> Then tcp_write_xmit() could use it to adjust :
> 
>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
> 
> to
> 
>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 
> 
>    limit = max(2 * skb->truesize, amount / 1000);
> 
> I'll cook a patch.

Hmm... doing this in all protocols would be too expensive,
and we do not want to include time spent in qdiscs.

wifi could eventually do that, providing in skb->tx_completion_delay_us
the time spent in wifi driver.

This way, we would have no penalty for network devices doing normal skb
orphaning (loopback interface, ethernet, ...)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior Feb. 6, 2015, 2:08 p.m. UTC | #4

On 6 February 2015 at 14:40, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:
>
>> The above brings back previous behaviour, i.e. I can get 600mbps TCP
>> on 5 flows again. Single flow is still (as it was before TSO
>> autosizing) limited to roughly ~280mbps.
>>
>> I never really bothered before to understand why I need to push a few
>> flows through ath10k to max it out, i.e. if I run a single UDP flow I
>> get ~300mbps while with, e.g. 5 I get 670mbps easily.
>>
>
> For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
> enough : UDP has no callback from TX completion to feed following frames
> (No write queue like TCP)
>
> # cat /proc/sys/net/core/wmem_default
> 212992
> # ethtool -C eth1 tx-usecs 1024 tx-frames 120
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1450   10.00      697705      0     809.27
> 212992           10.00      673412            781.09
>
> # echo 800000 >/proc/sys/net/core/wmem_default
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 800000    1450   10.00     7329221      0    8501.84
> 212992           10.00     7284051           8449.44

Hmm.. I confirm it works. However the value at which I get full rate
on a single flow is more than 2048K. Also using non-default
wmem_default seems to introduce packet loss as per iperf reports at
the receiver. I suppose this is kind of expected but on the other hand
wmem_default=262992 and 5 flows of UDP max the device out with 0
packet loss.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Kazior Feb. 6, 2015, 2:09 p.m. UTC | #5

On 6 February 2015 at 14:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>
>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>> of TX completion delay. But this would require yet another expensive
>> call to ktime_get() if HZ < 1000.
>>
>> Then tcp_write_xmit() could use it to adjust :
>>
>>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>
>> to
>>
>>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>
>>    limit = max(2 * skb->truesize, amount / 1000);
>>
>> I'll cook a patch.
>
> Hmm... doing this in all protocols would be too expensive,
> and we do not want to include time spent in qdiscs.
>
> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
>
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

I'll play around with this idea and report back later.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 6, 2015, 2:10 p.m. UTC | #6

On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:


> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
> 
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

Another way would be that wifi does an automatic orphaning after 1 or
2ms.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Laight Feb. 6, 2015, 2:31 p.m. UTC | #7

From: Eric Dumazet

> On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:

> 

> 

> > wifi could eventually do that, providing in skb->tx_completion_delay_us

> > the time spent in wifi driver.

> >

> > This way, we would have no penalty for network devices doing normal skb

> > orphaning (loopback interface, ethernet, ...)

> 

> Another way would be that wifi does an automatic orphaning after 1 or

> 2ms.

Couldn't you do byte counting?
So orphan enough packets to keep a few ms of tx traffic (at the current
tx rate) orphaned.
You might need to give the hardware both orphaned and non-orphaned (parented?)
packets and orphan some when you get a tx complete for an orphaned packet.

	David

Eric Dumazet Feb. 6, 2015, 2:35 p.m. UTC | #8

On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote:

> Hmm.. I confirm it works. However the value at which I get full rate
> on a single flow is more than 2048K. Also using non-default
> wmem_default seems to introduce packet loss as per iperf reports at
> the receiver. I suppose this is kind of expected but on the other hand
> wmem_default=262992 and 5 flows of UDP max the device out with 0
> packet loss.

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 2000000 >/proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet Feb. 6, 2015, 3:02 p.m. UTC | #9

On Fri, 2015-02-06 at 14:31 +0000, David Laight wrote:
> From: Eric Dumazet
> > On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:
> > 
> > 
> > > wifi could eventually do that, providing in skb->tx_completion_delay_us
> > > the time spent in wifi driver.
> > >
> > > This way, we would have no penalty for network devices doing normal skb
> > > orphaning (loopback interface, ethernet, ...)
> > 
> > Another way would be that wifi does an automatic orphaning after 1 or
> > 2ms.
> 
> Couldn't you do byte counting?
> So orphan enough packets to keep a few ms of tx traffic (at the current
> tx rate) orphaned.
> You might need to give the hardware both orphaned and non-orphaned (parented?)
> packets and orphan some when you get a tx complete for an orphaned packet.

We already have byte counting.

The thing is : A driver can keep an skb for itself, but calling
skb_orphan() in time to allow a socket to send more packets.

For say a UDP server, it would be quite mandatory, as it usually uses a
single UDP socket to receive and send messages.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rick Jones Feb. 6, 2015, 5:48 p.m. UTC | #10

> If you increase ability to flood on one flow, then you need to make sure
> receiver has big rcvbuf as well.
>
> echo 2000000 >/proc/sys/net/core/rmem_default
>
> Otherwise it might drop bursts.
>
> This is the kind of things that TCP does automatically, not UDP.

An alternative, if the application involved can make explicit 
setsockopt() calls to set SO_SNDBUF and/or SO_RCVBUF, is to tweak 
rmem_max and wmem_max and then let the application make the setsockopt() 
calls.

Which path one would take would depend on circumstances I suspect.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Throughput regression with `tcp: refine TSO autosizing`

Commit Message

Comments

Patch