diff mbox

resend: tcp: performance issue with fastopen connections (mss > window)

Message ID 30f38b3c-8c5c-7fab-e424-985e63ad900a@oracle.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Alexey Kodanev Jan. 13, 2017, 3:01 p.m. UTC
Hi,

Got the issue when running LTP/netstress test on localhost with mss
greater than the send window advertised by client (right after 3WHS).
Here is the testscenario that can reproduce this:

TCP client is sending 32 bytes request, TCP server replies with 65KB answer.
net.ipv4.tcp_fastopen set to 3. Also notethat first TCP Fastopen connection
processed without delay as tcp_send_mss()setshalf of the window size
to the'size_goal' inside tcp_sendmsg().

Though on the 2nd and subsequent connections:

    < S  seq 0:0 win 43690 options [mss 65495 wscale 7
         tfo cookie ac6246a51d5422fc] length 32
    > S.seq 0:0ack 1win 43690 options [mss 65495wscale 7] length 0
    <.  ack 1 win 342 length 0

Inside tcp_sendmsg(), tcp_send_mss() returns 65483 in 'mss_now',as well as
in 'size_goal'. This results the segment not queued for transmition
until all
data copied from userbuffer. Then, inside  __tcp_push_pending_frames() it
breaks on send window test,continue with the check probe timer, thus
introducing 200ms delay here.

Fragmentationoccurs in tcp_write_wakeup()...

+0.2> P. seq 1:43777 ack 1 win 342 length 43776
     <  . ack 43777, win 1365 length 0
     > P. seq 43777:65001 ack 1 win 342 optionslength 21224
     ...


Not sure what is the right fix for this, I guess we could limit size_goal
to the current window or mss, what is currently less, e.g:

 }


Any ideas?

Best regards,
Alexey

Comments

Eric Dumazet Jan. 13, 2017, 3:35 p.m. UTC | #1
On Fri, 2017-01-13 at 18:01 +0300, Alexey Kodanev wrote:
> Hi,
> 
> Got the issue when running LTP/netstress test on localhost with mss
> greater than the send window advertised by client (right after 3WHS).
> Here is the testscenario that can reproduce this:

Hi Alexey

So this is a combination of Fastopen + small window + large MSS ?

I would rather not force burning tons of coal or other fossil fuel,
by making each tcp_sendmsg() done by billions of linux devices more
expensive, only to accommodate for some LTP test doing something not
sensible ;)

Fact that you removed one condition in the BUG_ON() might hide another
issue later in the path.

I would suggest to clamp MSS to half the initial window, but I guess
this is impractical since window in SYN/SYNACK are not scaled.

Care to send a packetdrill test so that we have a clear picture of what
is going on ?

Thanks.
Alexey Kodanev Jan. 13, 2017, 5:07 p.m. UTC | #2
Hi Eric,
On 13.01.2017 18:35, Eric Dumazet wrote:
> On Fri, 2017-01-13 at 18:01 +0300, Alexey Kodanev wrote:
>> Hi,
>>
>> Got the issue when running LTP/netstress test on localhost with mss
>> greater than the send window advertised by client (right after 3WHS).
>> Here is the testscenario that can reproduce this:
> Hi Alexey
>
> So this is a combination of Fastopen + small window + large MSS ?

Yeah, this happens only in the beginning, after first ack from client.
Later window gets
lager than mss and it doesn't happen.

>
> I would rather not force burning tons of coal or other fossil fuel,
> by making each tcp_sendmsg() done by billions of linux devices more
> expensive, only to accommodate for some LTP test doing something not
> sensible ;)
>
> Fact that you removed one condition in the BUG_ON() might hide another
> issue later in the path.
>
> I would suggest to clamp MSS to half the initial window, but I guess
> this is impractical since window in SYN/SYNACK are not scaled.
> Care to send a packetdrill test so that we have a clear picture of what
> is going on ?

Is it capable of making two connections in the single test, one after
another?

Thanks,
Alexey
Eric Dumazet Jan. 13, 2017, 5:14 p.m. UTC | #3
On Fri, Jan 13, 2017 at 9:07 AM, Alexey Kodanev
<alexey.kodanev@oracle.com> wrote:
> Hi Eric,
> On 13.01.2017 18:35, Eric Dumazet wrote:

>> Care to send a packetdrill test so that we have a clear picture of what
>> is going on ?
>
> Is it capable of making two connections in the single test, one after
> another?

Absolutely.

Neal, Yuchung would you be kind enough to send a Fastopen tpacketdrill
template showing a typical fastopen test
running on an upstream kernel ?

Thanks !
Neal Cardwell Jan. 13, 2017, 5:32 p.m. UTC | #4
On Fri, Jan 13, 2017 at 12:14 PM, Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jan 13, 2017 at 9:07 AM, Alexey Kodanev
> <alexey.kodanev@oracle.com> wrote:
> > Hi Eric,
> > On 13.01.2017 18:35, Eric Dumazet wrote:
>
> >> Care to send a packetdrill test so that we have a clear picture of what
> >> is going on ?
> >
> > Is it capable of making two connections in the single test, one after
> > another?
>
> Absolutely.
>
> Neal, Yuchung would you be kind enough to send a Fastopen tpacketdrill
> template showing a typical fastopen test
> running on an upstream kernel ?
>
> Thanks !

Sure, here is an example packetdrill script, IIRC written by Yuchung,
which demonstrates TCP fast open and consecutive active connections:

`sysctl -q net.ipv4.tcp_timestamps=0`

// Cache warmup: send a Fast Open cookie request
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
(Operation is now in progress)
   +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
+.010 < S. 123:123(0) ack 1 win 5840 <mss
1040,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
   +0 > . 1:1(0) ack 1
   +0 close(3) = 0
   +0 > F. 1:1(0) ack 1
+.010 < F. 1:1(0) ack 2 win 92
   +0 > .  2:2(0) ack 2

//
// TEST1: Servers sends SYN-ACK with data and another two data packets
//
   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
   +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
abcd1234,nop,nop>
+.010 < S. 1000000:1001400(1400) ack 1001 win 5840 <mss
1040,nop,nop,sackOK,nop,wscale 6>
   +0 < . 1401:2801(1400) ack 1001 win 257
   +0 < P. 2801:3001(200) ack 1001 win 257

neal
Eric Dumazet Jan. 13, 2017, 5:41 p.m. UTC | #5
On Fri, 2017-01-13 at 12:32 -0500, Neal Cardwell wrote:
> On Fri, Jan 13, 2017 at 12:14 PM, Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jan 13, 2017 at 9:07 AM, Alexey Kodanev
> > <alexey.kodanev@oracle.com> wrote:
> > > Hi Eric,
> > > On 13.01.2017 18:35, Eric Dumazet wrote:
> >
> > >> Care to send a packetdrill test so that we have a clear picture of what
> > >> is going on ?
> > >
> > > Is it capable of making two connections in the single test, one after
> > > another?
> >
> > Absolutely.
> >
> > Neal, Yuchung would you be kind enough to send a Fastopen tpacketdrill
> > template showing a typical fastopen test
> > running on an upstream kernel ?
> >
> > Thanks !
> 
> Sure, here is an example packetdrill script, IIRC written by Yuchung,
> which demonstrates TCP fast open and consecutive active connections:
> 
> `sysctl -q net.ipv4.tcp_timestamps=0`
> 
> // Cache warmup: send a Fast Open cookie request
> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>    +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> (Operation is now in progress)
>    +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> +.010 < S. 123:123(0) ack 1 win 5840 <mss
> 1040,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
>    +0 > . 1:1(0) ack 1
>    +0 close(3) = 0
>    +0 > F. 1:1(0) ack 1
> +.010 < F. 1:1(0) ack 2 win 92
>    +0 > .  2:2(0) ack 2
> 
> //
> // TEST1: Servers sends SYN-ACK with data and another two data packets
> //
>    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>    +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>    +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>    +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> abcd1234,nop,nop>
> +.010 < S. 1000000:1001400(1400) ack 1001 win 5840 <mss
> 1040,nop,nop,sackOK,nop,wscale 6>
>    +0 < . 1401:2801(1400) ack 1001 win 257
>    +0 < P. 2801:3001(200) ack 1001 win 257
> 
> neal

Thanks Neal

Also worth adding that packetdrill has the following option to tune the
MTU on the tun device :

--mtu=xxxxx
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4a04496..3d3bd97 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -860,7 +860,12 @@  static unsigned int tcp_xmit_size_goal(struct sock
*sk, u32 mss_now,
                size_goal = tp->gso_segs * mss_now;
        }

-       return max(size_goal, mss_now);
+       size_goal = max(size_goal, mss_now);
+
+       if (tp->snd_wnd > TCP_MSS_DEFAULT)
+               return min(tp->snd_wnd, size_goal);
+
+       return size_goal;
 }

 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1d5331a..0ac133f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2445,7 +2445,7 @@  void tcp_push_one(struct sock *sk, unsigned int
mss_now)
 {
        struct sk_buff *skb = tcp_send_head(sk);

-       BUG_ON(!skb || skb->len < mss_now);
+       BUG_ON(!skb);

        tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);