diff mbox

Throughput regression with `tcp: refine TSO autosizing`

Message ID 1422926330.21689.138.camel@edumazet-glaptop2.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 3, 2015, 1:18 a.m. UTC
On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:

> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> no timer is setup to split the TSO packet.)

Following patch might help the TSO split defer logic.

It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
Small Queue logic prevented the xmit at the expiration of first 'timer'.

This patch clears the tso_deferred variable only if we could really
send something.

Please try it, thanks !


 
@@ -2070,6 +2069,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
 			break;
 
+		tp->tso_deferred = 0;
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michal Kazior Feb. 3, 2015, 11:50 a.m. UTC | #1
On 3 February 2015 at 02:18, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:
>
>> It seems to break ACK clocking badly (linux stack has a somewhat buggy
>> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
>> no timer is setup to split the TSO packet.)
>
> Following patch might help the TSO split defer logic.
>
> It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
> Small Queue logic prevented the xmit at the expiration of first 'timer'.
>
> This patch clears the tso_deferred variable only if we could really
> send something.
>
> Please try it, thanks !
[..patch..]

I've done a second round of tests. I've added the A-MSDU count
parameter I've mentioned in my other email into the mix.

 net - net/master (includes stretch ack patches)
 net-tso - net/master + your TSO defer patch
 net-gro - net/master + my ath10k GRO patch
 net-gro-tso - net/master + duh

Here's the best of amsdu count 1 and 3:

 ; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
'x && /Mbits/ {y=$0}; x && y && !/Mbits/ {print y; x=0; y=""}; /set
amsdu cnt to '$j'/{x=1}' | awk '{ if (x < $(NF-1)) {x=$(NF-1)} }
END{print "A-MSDU limit='$j', " x " Mbits/sec"}' } }
 net-gro-tso/output.txt
 A-MSDU limit=1, 436 Mbits/sec
 A-MSDU limit=3, 284 Mbits/sec
 net-gro/output.txt
 A-MSDU limit=1, 444 Mbits/sec
 A-MSDU limit=3, 283 Mbits/sec
 net-tso/output.txt
 A-MSDU limit=1, 376 Mbits/sec
 A-MSDU limit=3, 251 Mbits/sec
 net/output.txt
 A-MSDU limit=1, 387 Mbits/sec
 A-MSDU limit=3, 260 Mbits/sec

IOW:
 - stretch acks / TSO defer don't seem to help much (when compared to
throughput results from yesterday)
 - GRO helps
 - disabling A-MSDU on sender helps
 - net/master+GRO still doesn't reach the performance from before the
regression (~600mbps w/ GRO)

You can grab logs and dumps here: http://www.filedropper.com/test2tar


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 3, 2015, 2:27 p.m. UTC | #2
On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
> On 3 February 2015 at 02:18, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:
> >
> >> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> >> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> >> no timer is setup to split the TSO packet.)
> >
> > Following patch might help the TSO split defer logic.
> >
> > It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
> > Small Queue logic prevented the xmit at the expiration of first 'timer'.
> >
> > This patch clears the tso_deferred variable only if we could really
> > send something.
> >
> > Please try it, thanks !
> [..patch..]
> 
> I've done a second round of tests. I've added the A-MSDU count
> parameter I've mentioned in my other email into the mix.
> 
>  net - net/master (includes stretch ack patches)
>  net-tso - net/master + your TSO defer patch
>  net-gro - net/master + my ath10k GRO patch
>  net-gro-tso - net/master + duh
> 
> Here's the best of amsdu count 1 and 3:
> 
>  ; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
> 'x && /Mbits/ {y=$0}; x && y && !/Mbits/ {print y; x=0; y=""}; /set
> amsdu cnt to '$j'/{x=1}' | awk '{ if (x < $(NF-1)) {x=$(NF-1)} }
> END{print "A-MSDU limit='$j', " x " Mbits/sec"}' } }
>  net-gro-tso/output.txt
>  A-MSDU limit=1, 436 Mbits/sec
>  A-MSDU limit=3, 284 Mbits/sec
>  net-gro/output.txt
>  A-MSDU limit=1, 444 Mbits/sec
>  A-MSDU limit=3, 283 Mbits/sec
>  net-tso/output.txt
>  A-MSDU limit=1, 376 Mbits/sec
>  A-MSDU limit=3, 251 Mbits/sec
>  net/output.txt
>  A-MSDU limit=1, 387 Mbits/sec
>  A-MSDU limit=3, 260 Mbits/sec
> 
> IOW:
>  - stretch acks / TSO defer don't seem to help much (when compared to
> throughput results from yesterday)
>  - GRO helps
>  - disabling A-MSDU on sender helps
>  - net/master+GRO still doesn't reach the performance from before the
> regression (~600mbps w/ GRO)
> 
> You can grab logs and dumps here: http://www.filedropper.com/test2tar
> 

Thanks for these traces.

There is absolutely a problem at the sender, as we can see a big 2ms
delay between reception of ACK and send of following packets.
TCP stack should generate them immediately.
Are you using some kind of netem qdisc ?

These 2ms delays, in a flow with a 5ms RTT are terrible.

06:54:57.408391 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
06:54:57.408418 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408431 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408453 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408474 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 0, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
<this 2ms delay is not generated by TCP stack.>
06:54:57.410243 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410271 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 83984:85432, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410289 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 85432:86880, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410310 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 86880:88328, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410326 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 88328:89776, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410339 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 89776:91224, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410353 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 91224:92672, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410370 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 92672:94120, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448

...
06:54:57.411178 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 28960, win 11268, options [nop,nop,TS val 1053306 ecr 1052253], length 0
06:54:57.411190 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 52128, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411220 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 78192, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411243 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 82536, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
< same 2ms unexplained gap here >
06:54:57.412912 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 165072:166520, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412935 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 166520:167968, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412948 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 167968:169416, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
...
06:54:57.413650 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 244712:246160, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413662 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 246160:247608, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413712 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 112944, win 11268, options [nop,nop,TS val 1053308 ecr 1052256], length 0
06:54:57.413730 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 130320, win 11268, options [nop,nop,TS val 1053308 ecr 1052257], length 0
06:54:57.413754 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 160728, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
06:54:57.413779 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 165072, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
< same 2ms delay >
06:54:57.415682 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 247608:249056, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448
06:54:57.415709 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 249056:250504, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448

Are packets TX completed after a timer or something ?

Some very heavy stuff might run from tasklet (or other softirq triggered) event.

BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
its not clear if they are delayed at one side or the other.

GRO should delay only GRO candidates. ACK packets are not GRO candidates.

Have you tried to disable GSO on sender ?

(Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 3, 2015, 3:03 p.m. UTC | #3
On Tue, 2015-02-03 at 06:27 -0800, Eric Dumazet wrote:

> Are packets TX completed after a timer or something ?
> 
> Some very heavy stuff might run from tasklet (or other softirq triggered) event.
> 

Right, commit 6c5151a9ffa9f796f2d707617cecb6b6b241dff8
("ath10k: batch htt tx/rx completions")
is very suspicious.

Please revert it.

BTW, ath10k_htt_txrx_compl_task() runs from softirq context, so the 
_bh() prefixes are not really needed.

It seems lot of batching happens in wifi drivers, not necessarily at the
right places.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 4, 2015, 11:35 a.m. UTC | #4
On 3 February 2015 at 15:27, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
[...]
>> IOW:
>>  - stretch acks / TSO defer don't seem to help much (when compared to
>> throughput results from yesterday)
>>  - GRO helps
>>  - disabling A-MSDU on sender helps
>>  - net/master+GRO still doesn't reach the performance from before the
>> regression (~600mbps w/ GRO)
>>
>> You can grab logs and dumps here: http://www.filedropper.com/test2tar
>>
>
> Thanks for these traces.
>
> There is absolutely a problem at the sender, as we can see a big 2ms
> delay between reception of ACK and send of following packets.
> TCP stack should generate them immediately.
> Are you using some kind of netem qdisc ?

Both systems have identical setup:

; tc qdisc
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap  1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc mq 0: dev wlan1 root
qdisc pfifo_fast 0: dev wlan1 parent :1 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :2 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :3 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :4 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1


> These 2ms delays, in a flow with a 5ms RTT are terrible.
>
> 06:54:57.408391 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
> 06:54:57.408418 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408431 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408453 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408474 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 0, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> <this 2ms delay is not generated by TCP stack.>
> 06:54:57.410243 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
[...]
>
> Are packets TX completed after a timer or something ?

As far as ath10k is concerned - no timers here. Not sure about
firmware itself though.


> Some very heavy stuff might run from tasklet (or other softirq triggered) event.
>
> BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
> its not clear if they are delayed at one side or the other.
>
> GRO should delay only GRO candidates. ACK packets are not GRO candidates.
>
> Have you tried to disable GSO on sender ?

I assume I do that via ethtool? This is my current setup on both systems:

; ethtool -k wlan1
Features for wlan1:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: off
        tx-scatter-gather: off [fixed]
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [fixed]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]

; ethtool -K wlan1 generic-segmentation-offload off
ethtool: bad command line argument(s)
For more information run ethtool -h


> (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)

This could work if your firmware/device supports this kind of thing.
To my understanding ath10k firmware doesn't.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 4, 2015, 11:57 a.m. UTC | #5
On Wed, 2015-02-04 at 12:35 +0100, Michal Kazior wrote:

> > (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)
> 
> This could work if your firmware/device supports this kind of thing.
> To my understanding ath10k firmware doesn't.

This is a pure software signal. You do not need firmware support.

Idea is the following :

Your driver gets a train of messages, coming from upper layers (TCP, IP,
qdisc)

It can know that a packet is not the last one, by looking at
skb->xmit_more.

Basically, aggregation logic could use this signal as a very clear
indicator you got the end of a train -> force the xmit right now.

To disable gso you would have to use :

ethtool -K wlan1 gso off



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 4, 2015, 12:22 p.m. UTC | #6
On 4 February 2015 at 12:57, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-02-04 at 12:35 +0100, Michal Kazior wrote:
>
>> > (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)
>>
>> This could work if your firmware/device supports this kind of thing.
>> To my understanding ath10k firmware doesn't.
>
> This is a pure software signal. You do not need firmware support.
>
> Idea is the following :
>
> Your driver gets a train of messages, coming from upper layers (TCP, IP,
> qdisc)
>
> It can know that a packet is not the last one, by looking at
> skb->xmit_more.
>
> Basically, aggregation logic could use this signal as a very clear
> indicator you got the end of a train -> force the xmit right now.

There's no way to tell ath10k firmware: "xmit right now". The firmware
does all tx aggregation logic by itself. Host driver just submits a
frame and hopes it'll get out soon. It's not even a tx-ring you'd
expect. Each frame has a host assigned id which firmware then uses in
tx completion.


> To disable gso you would have to use :
>
> ethtool -K wlan1 gso off

Oh, thanks! This works. However I can't turn it on:

; ethtool -K wlan1 gso on
Could not change any device features

..so I guess it makes no sense to re-run tests because:

; ethtool -k wlan1 | grep generic
        tx-checksum-ip-generic: on [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on

And this seems to never change.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 4, 2015, 12:38 p.m. UTC | #7
On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
> On 4 February 2015 at 12:57, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 
> > To disable gso you would have to use :
> >
> > ethtool -K wlan1 gso off
> 
> Oh, thanks! This works. However I can't turn it on:
> 
> ; ethtool -K wlan1 gso on
> Could not change any device features
> 
> ..so I guess it makes no sense to re-run tests because:
> 
> ; ethtool -k wlan1 | grep generic
>         tx-checksum-ip-generic: on [fixed]
> generic-segmentation-offload: off [requested on]
> generic-receive-offload: on
> 
> And this seems to never change.

GSO requires SG (Scatter Gather)

Are you sure this hardware has no SG support ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kazior Feb. 4, 2015, 12:53 p.m. UTC | #8
On 4 February 2015 at 13:38, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
>> On 4 February 2015 at 12:57, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>>
>> > To disable gso you would have to use :
>> >
>> > ethtool -K wlan1 gso off
>>
>> Oh, thanks! This works. However I can't turn it on:
>>
>> ; ethtool -K wlan1 gso on
>> Could not change any device features
>>
>> ..so I guess it makes no sense to re-run tests because:
>>
>> ; ethtool -k wlan1 | grep generic
>>         tx-checksum-ip-generic: on [fixed]
>> generic-segmentation-offload: off [requested on]
>> generic-receive-offload: on
>>
>> And this seems to never change.
>
> GSO requires SG (Scatter Gather)
>
> Are you sure this hardware has no SG support ?

The hardware itself seems to be capable. The firmware is a problem
though. I'm also not sure if mac80211 can handle this as is. No 802.11
driver seems to support SG except wil6210 which uses cfg80211 and
netdevs directly.


Michał
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg Feb. 4, 2015, 12:55 p.m. UTC | #9
On Wed, 2015-02-04 at 13:53 +0100, Michal Kazior wrote:

> The hardware itself seems to be capable. The firmware is a problem
> though. I'm also not sure if mac80211 can handle this as is. No 802.11
> driver seems to support SG except wil6210 which uses cfg80211 and
> netdevs directly.

mac80211 cannot deal with this right now. This would make a good topic
for the workshop since there's interest elsewhere in this as well. It's
probably not terribly hard to do as far as mac80211 is concerned.

How much offload do you really have though? Sometimes people just want
to build A-MSDUs.

johannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 4, 2015, 1:16 p.m. UTC | #10
OK guys

Using a mlx4 testbed I can reproduce the problem by pushing coalescing
settings and disabling SG (thus disabling GSO)

ethtool -K eth0 sg off
Actual changes:
scatter-gather: off
	tx-scatter-gather: off
generic-segmentation-offload: off [requested on]

ethtool -C eth0 tx-usecs 1024 tx-frames 64

Meaning that NIC waits one ms before sending the TX IRQ,
and can accumulate 64 frames before forcing the interrupt.

We probably have a bug in cwnd expansion logic :

lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00       530.39   0.40     0.32     2.965   2.398  


-> final cwnd=59 which is not enough to avoid the 1ms delay between each
burst. 

So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
before sending the following burst.

I am CCing Neal, he probably can help to root cause the problem.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 4, 2015, 1:29 p.m. UTC | #11
On Wed, 2015-02-04 at 05:16 -0800, Eric Dumazet wrote:
> OK guys
> 
> Using a mlx4 testbed I can reproduce the problem by pushing coalescing
> settings and disabling SG (thus disabling GSO)
> 
> ethtool -K eth0 sg off
> Actual changes:
> scatter-gather: off
> 	tx-scatter-gather: off
> generic-segmentation-offload: off [requested on]
> 
> ethtool -C eth0 tx-usecs 1024 tx-frames 64
> 
> Meaning that NIC waits one ms before sending the TX IRQ,
> and can accumulate 64 frames before forcing the interrupt.
> 
> We probably have a bug in cwnd expansion logic :
> 
> lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
> rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    10.00       530.39   0.40     0.32     2.965   2.398  
> 
> 
> -> final cwnd=59 which is not enough to avoid the 1ms delay between each
> burst. 
> 
> So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
> before sending the following burst.
> 
> I am CCing Neal, he probably can help to root cause the problem.

Arg, this was with net-next, ie not including our recent stretch ack
fixes.

Using David Miller 'net' tree, cwnd seems OK.

Speed is low because of 64 queued frames are exceeding
tcp_limit_output_bytes

lpaa23:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
131072
lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=166 rttvar=16 snd_ssthresh=26 cwnd=59 reordering=3 total_retrans=0 ca_state=0 pacing_rate=8203.52 Mbits
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00       569.96   0.52     0.38     3.588   2.625  


lpaa23:~# echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=98 rttvar=18 snd_ssthresh=312 cwnd=313 reordering=3 total_retrans=23 ca_state=0
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      8518.40   2.60     1.57     1.200   0.727  




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..e735f38557db 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1821,7 +1821,6 @@  static bool tcp_tso_should_defer(struct sock *sk,
struct sk_buff *skb,
 	return true;
 
 send_now:
-	tp->tso_deferred = 0;
 	return false;
 }