Message ID | 1386958426.19078.162.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Dec 13, 2013 at 1:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > While investigating performance problems on small RPC workloads, > I noticed linux TCP stack was always splitting the last TSO skb > into two parts (skbs). One being a multiple of MSS, and a small one > with the Push flag. This split is done even if TCP_NODELAY is set. > ... > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Yuchung Cheng <ycheng@google.com> > Cc: Neal Cardwell <ncardwell@google.com> > Cc: Nandita Dukkipati <nanditad@google.com> > Cc: Van Jacobson <vanj@google.com> > --- > v2: changed tcp_minshall_update() as Neal pointed out. It occurred to me after staring at this code a little longer that although the Nagle test is already done before the call to tcp_mss_split_point(), the tcp_nagle_check() is only run if tso_segs==1 and is only checking whether skb->len < mss_now, so the Nagle code currently is implicitly assuming that if there is an skb that is not an exact multiple of MSS, then the tcp_mss_split_point() will chop off the odd bytes at the end and we'll loop back to the top of the tcp_write_xmit() loop to make a Nagle decision on an skb that has tso_segs==1. So if we just make this improvement to tcp_mss_split_point() and the adjustment to tcp_minshall_update() (as in patch v2), we will still have broken Nagle behavior in a scenario like: - suppose MSS=1460 and Nagle is enabled (TCP_NODELAY is not set) - suppose there is an outstanding un-ACKed small packet, so that tcp_minshall_check() returns true - user writes 2000 bytes - tso_segs is 2, so we do not call tcp_nagle_test() - new version of tcp_mss_split_point() sends out the full 2000 bytes, leading to having 2 packets smaller than an MSS un-ACKed in the network, violating the invariant the minshall/nagle code is trying to maintain or having only one such packet un-ACKed in the network. Previously, tcp_mss_split_point() would have split off the 2000-1460 bytes into a new skb, we would have looped back and executed the tcp_nagle_test() code and decided not to send that small sub-MSS 2000-1460 packet. neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-12-13 at 13:17 -0500, Neal Cardwell wrote: > It occurred to me after staring at this code a little longer that > although the Nagle test is already done before the call to > tcp_mss_split_point(), the tcp_nagle_check() is only run if > tso_segs==1 and is only checking whether skb->len < mss_now, so the > Nagle code currently is implicitly assuming that if there is an skb > that is not an exact multiple of MSS, then the tcp_mss_split_point() > will chop off the odd bytes at the end and we'll loop back to the top > of the tcp_write_xmit() loop to make a Nagle decision on an skb that > has tso_segs==1. > > So if we just make this improvement to tcp_mss_split_point() and the > adjustment to tcp_minshall_update() (as in patch v2), we will still > have broken Nagle behavior in a scenario like: > > - suppose MSS=1460 and Nagle is enabled (TCP_NODELAY is not set) > > - suppose there is an outstanding un-ACKed small packet, so that > tcp_minshall_check() returns true > > - user writes 2000 bytes > > - tso_segs is 2, so we do not call tcp_nagle_test() > > - new version of tcp_mss_split_point() sends out the full 2000 bytes, > leading to having 2 packets smaller than an MSS un-ACKed in the network, > violating the invariant the minshall/nagle code is trying to maintain > or having only one such packet un-ACKed in the network. > > Previously, tcp_mss_split_point() would have split off the 2000-1460 > bytes into a new skb, we would have looped back and executed the > tcp_nagle_test() code and decided not to send that small sub-MSS > 2000-1460 packet. > Hmm, it seems we need to refactor a bit. tcp_snd_test() seems to not care of TSO being partial. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/tcp.h b/include/net/tcp.h index f7e1ab2139ef..9cd62bc09055 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -978,13 +978,6 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp) } bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight); -static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss, - const struct sk_buff *skb) -{ - if (skb->len < mss) - tp->snd_sml = TCP_SKB_CB(skb)->end_seq; -} - static inline void tcp_check_probe_timer(struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 2a69f42e51ca..2c5c0029d1a0 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1384,20 +1384,11 @@ static void tcp_cwnd_validate(struct sock *sk) } } -/* Returns the portion of skb which can be sent right away without - * introducing MSS oddities to segment boundaries. In rare cases where - * mss_now != mss_cache, we will request caller to create a small skb - * per input skb which could be mostly avoided here (if desired). - * - * We explicitly want to create a request for splitting write queue tail - * to a small skb for Nagle purposes while avoiding unnecessary modulos, - * thus all the complexity (cwnd_len is always MSS multiple which we - * return whenever allowed by the other factors). Basically we need the - * modulo only when the receiver window alone is the limiting factor or - * when we would be allowed to send the split-due-to-Nagle skb fully. - */ -static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb, - unsigned int mss_now, unsigned int max_segs) +/* Returns the portion of skb which can be sent right away */ +static unsigned int tcp_mss_split_point(const struct sock *sk, + const struct sk_buff *skb, + unsigned int mss_now, + unsigned int max_segs) { const struct tcp_sock *tp = tcp_sk(sk); u32 needed, window, max_len; @@ -1410,10 +1401,7 @@ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_b needed = min(skb->len, window); - if (max_len <= needed) - return max_len; - - return needed - needed % mss_now; + return min(max_len, needed); } /* Can at least one segment of SKB be sent right now, according to the @@ -1454,12 +1442,27 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb, } /* Minshall's variant of the Nagle send check. */ -static inline bool tcp_minshall_check(const struct tcp_sock *tp) +static bool tcp_minshall_check(const struct tcp_sock *tp) { return after(tp->snd_sml, tp->snd_una) && !after(tp->snd_sml, tp->snd_nxt); } +/* Update snd_sml if this skb is under mss + * Note that a TSO packet might end with a sub-mss segment + * The test is really : + * if ((skb->len % mss_now) != 0) + * tp->snd_sml = TCP_SKB_CB(skb)->end_seq; + * But we can avoid doing the divide again given we already have + * skb_pcount = skb->len / mss_now + */ +static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now, + const struct sk_buff *skb) +{ + if (skb->len < tcp_skb_pcount(skb) * mss_now) + tp->snd_sml = TCP_SKB_CB(skb)->end_seq; +} + /* Return false, if packet can be sent now without violation Nagle's rules: * 1. It is full sized. * 2. Or it contains FIN. (already checked by caller)