Message ID | 1417788937.4322.21.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Dec 5, 2014 at 9:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to > control TSO size, but did this at the wrong place (sendmsg() time) > > At sendmsg() time, we might have a pessimistic view of flow rate, > and we end up building very small skbs (with 2 MSS per skb). > > This is bad because : > > - It sends small TSO packets even in Slow Start where rate quickly > increases. > - It tends to make socket write queue very big, increasing tcp_ack() > processing time, but also increasing memory needs, not necessarily > accounted for, as fast clones overhead is currently ignored. > - Lower GRO efficiency and more ACK packets. > > Servers with a lot of small lived connections suffer from this. > > Lets instead fill skbs as much as possible (64KB of payload), but split > them at xmit time, when we have a precise idea of the flow rate. > skb split is actually quite efficient. Nice. I definitely agree this is the right direction. However, from my experience testing a variant of this approach, this kind of late decision about packet size was sometimes causing performance shortfalls on long-RTT, medium-bandwidth paths unless tcp_tso_should_defer() was also modified to use the new/smaller packet size goal. The issue is that tcp_tso_should_defer() uses tp->xmit_size_goal_segs as a yardstick, and says, "hey, if cwnd and rwin allow us to send tp->xmit_size_goal_segs * tp->mss_cache then let's go ahead and send it now." But if we remove the sendmsg-time autosizing logic that was tuning tp->xmit_size_goal_segs, then tcp_tso_should_defer() is now going to be waiting to try to accumulate permission to send a big skb with tp->xmit_size_goal_segs (e.g. ~40) MSS in it. In my tests I was able to fix this issue by making tcp_tso_should_defer() use the latest size goal instead of tp->xmit_size_goal_segs. So, how about making the rate-based TSO autosizing goal (stored in "segs" in this patch) at the top of tcp_write_xmit()? Then we could pass in that segment goal to tcp_tso_should_defer() for use instead of tp->xmit_size_goal_segs in deciding whether we have a big enough chunk to send now. Similarly, that segment goal could be passed in to tcp_mss_split_point instead of sk->sk_gso_max_segs. (The autosizing calculation could be in a helper function to keep tcp_write_xmit() manageable.) neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-12-05 at 10:32 -0500, Neal Cardwell wrote: > On Fri, Dec 5, 2014 at 9:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > Commit 95bd09eb2750 ("tcp: TSO packets automatic sizing") tried to > > control TSO size, but did this at the wrong place (sendmsg() time) > > > > At sendmsg() time, we might have a pessimistic view of flow rate, > > and we end up building very small skbs (with 2 MSS per skb). > > > > This is bad because : > > > > - It sends small TSO packets even in Slow Start where rate quickly > > increases. > > - It tends to make socket write queue very big, increasing tcp_ack() > > processing time, but also increasing memory needs, not necessarily > > accounted for, as fast clones overhead is currently ignored. > > - Lower GRO efficiency and more ACK packets. > > > > Servers with a lot of small lived connections suffer from this. > > > > Lets instead fill skbs as much as possible (64KB of payload), but split > > them at xmit time, when we have a precise idea of the flow rate. > > skb split is actually quite efficient. > > Nice. I definitely agree this is the right direction. > > However, from my experience testing a variant of this approach, this > kind of late decision about packet size was sometimes causing > performance shortfalls on long-RTT, medium-bandwidth paths unless > tcp_tso_should_defer() was also modified to use the new/smaller packet > size goal. > > The issue is that tcp_tso_should_defer() uses tp->xmit_size_goal_segs > as a yardstick, and says, "hey, if cwnd and rwin allow us to send > tp->xmit_size_goal_segs * tp->mss_cache then let's go ahead and send > it now." > > But if we remove the sendmsg-time autosizing logic that was tuning > tp->xmit_size_goal_segs, then tcp_tso_should_defer() is now going to > be waiting to try to accumulate permission to send a big skb with > tp->xmit_size_goal_segs (e.g. ~40) MSS in it. > > In my tests I was able to fix this issue by making > tcp_tso_should_defer() use the latest size goal instead of > tp->xmit_size_goal_segs. > > So, how about making the rate-based TSO autosizing goal (stored in > "segs" in this patch) at the top of tcp_write_xmit()? Then we could > pass in that segment goal to tcp_tso_should_defer() for use instead of > tp->xmit_size_goal_segs in deciding whether we have a big enough chunk > to send now. Similarly, that segment goal could be passed in to > tcp_mss_split_point instead of sk->sk_gso_max_segs. > > (The autosizing calculation could be in a helper function to keep > tcp_write_xmit() manageable.) > Sounds an awesome suggestion indeed, I am working on it. Thanks Neal ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index dc13a3657e8e..91e6c1406313 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -840,24 +840,14 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, xmit_size_goal = mss_now; if (large_allowed && sk_can_gso(sk)) { - u32 gso_size, hlen; + u32 hlen; /* Maybe we should/could use sk->sk_prot->max_header here ? */ hlen = inet_csk(sk)->icsk_af_ops->net_header_len + inet_csk(sk)->icsk_ext_hdr_len + tp->tcp_header_len; - /* Goal is to send at least one packet per ms, - * not one big TSO packet every 100 ms. - * This preserves ACK clocking and is consistent - * with tcp_tso_should_defer() heuristic. - */ - gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC); - gso_size = max_t(u32, gso_size, - sysctl_tcp_min_tso_segs * mss_now); - - xmit_size_goal = min_t(u32, gso_size, - sk->sk_gso_max_size - 1 - hlen); + xmit_size_goal = sk->sk_gso_max_size - 1 - hlen; xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f5bd4bd3f7e6..dac9991bccbf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1533,6 +1533,16 @@ static unsigned int tcp_mss_split_point(const struct sock *sk, { const struct tcp_sock *tp = tcp_sk(sk); u32 partial, needed, window, max_len; + u32 bytes = sk->sk_pacing_rate >> 10; + u32 segs; + + /* Goal is to send at least one packet per ms, + * not one big TSO packet every 100 ms. + * This preserves ACK clocking and is consistent + * with tcp_tso_should_defer() heuristic. + */ + segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs); + max_segs = min_t(u32, max_segs, segs); window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq; max_len = mss_now * max_segs; @@ -2008,6 +2018,18 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } + limit = mss_now; + if (tso_segs > 1 && !tcp_urg_mode(tp)) + limit = tcp_mss_split_point(sk, skb, mss_now, + min_t(unsigned int, + cwnd_quota, + sk->sk_gso_max_segs), + nonagle); + + if (skb->len > limit && + unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) + break; + /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : @@ -2018,8 +2040,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes, - sk->sk_pacing_rate >> 10); + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); @@ -2032,18 +2054,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - limit = mss_now; - if (tso_segs > 1 && !tcp_urg_mode(tp)) - limit = tcp_mss_split_point(sk, skb, mss_now, - min_t(unsigned int, - cwnd_quota, - sk->sk_gso_max_segs), - nonagle); - - if (skb->len > limit && - unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) - break; - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) break;