Message ID | 1337703170.3361.217.camel@edumazet-glaptop |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote: > > __tcp_select_window() ( more precisely tcp_space() takes into account > memory used in receive/ofo queue, but not frames in backlog queue) > > So if you send bursts, it might explain TCP stack continues to > advertise > a too big window, instead of anticipate the problem. > > Please try the following patch : > > diff --git a/include/net/tcp.h b/include/net/tcp.h > index e79aa48..82382cb 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space) > /* Note: caller must be prepared to deal with negative returns */ > static inline int tcp_space(const struct sock *sk) > { > - return tcp_win_from_space(sk->sk_rcvbuf - > - atomic_read(&sk->sk_rmem_alloc)); > + int used = atomic_read(&sk->sk_rmem_alloc) + > sk->sk_backlog.len; > + > + return tcp_win_from_space(sk->sk_rcvbuf - used); > } > > static inline int tcp_full_space(const struct sock *sk) I can give this a try (not sure when - probably later this week) but I think this it is back to front. The patch above will reduce the advertised window by sk_backlog.len, but at the time that the window was advertised that allowed the dropped packets to be sent the backlog was empty. It is later, when the kernel is waking the application and takes the socket lock that the backlog starts to be used and the drop happens. But reducing the window advertised at this point is futile - the packets that will be dropped are already in flight. The problem exists because the backlog has a tighter limit on it than the receive window does; I think the backlog should be able to accept sk_rcvbuf bytes in addition to what is already in the receive buffer (or up to the advertised receive window if that's smaller). At the moment it will only accept sk_rcvbuf bytes including what is already in the receive buffer. The logic being that in this case we're using the backlog because it's in the process of emptying the receive buffer into the application, and so the receive buffer will very soon be empty, and so we will very soon be able to accept sk_rcvbuf bytes. This is evident from the packet capture as the kernel stack is quite happy to accept the significant quantity of data that arrives as part of the same burst immediately after it has dropped a couple of packets. Perhaps it would be easier for me to write a patch to show this suggested solution? Kieran -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-05-22 at 17:32 +0100, Kieran Mansley wrote: > On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote: > > > > __tcp_select_window() ( more precisely tcp_space() takes into account > > memory used in receive/ofo queue, but not frames in backlog queue) > > > > So if you send bursts, it might explain TCP stack continues to > > advertise > > a too big window, instead of anticipate the problem. > > > > Please try the following patch : > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index e79aa48..82382cb 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space) > > /* Note: caller must be prepared to deal with negative returns */ > > static inline int tcp_space(const struct sock *sk) > > { > > - return tcp_win_from_space(sk->sk_rcvbuf - > > - atomic_read(&sk->sk_rmem_alloc)); > > + int used = atomic_read(&sk->sk_rmem_alloc) + > > sk->sk_backlog.len; > > + > > + return tcp_win_from_space(sk->sk_rcvbuf - used); > > } > > > > static inline int tcp_full_space(const struct sock *sk) > > > I can give this a try (not sure when - probably later this week) but I > think this it is back to front. The patch above will reduce the > advertised window by sk_backlog.len, but at the time that the window was > advertised that allowed the dropped packets to be sent the backlog was > empty. It is later, when the kernel is waking the application and takes > the socket lock that the backlog starts to be used and the drop happens. > But reducing the window advertised at this point is futile - the packets > that will be dropped are already in flight. > Not really. If we receive these packets while backlog is empty, then the sender violates TCP rules. We advertise tcp window directly from memory we are allowed to consume. (On the premise sender behaves correctly, not sending bytes in small packets) > The problem exists because the backlog has a tighter limit on it than > the receive window does; I think the backlog should be able to accept > sk_rcvbuf bytes in addition to what is already in the receive buffer (or > up to the advertised receive window if that's smaller). At the moment > it will only accept sk_rcvbuf bytes including what is already in the > receive buffer. The logic being that in this case we're using the > backlog because it's in the process of emptying the receive buffer into > the application, and so the receive buffer will very soon be empty, and > so we will very soon be able to accept sk_rcvbuf bytes. This is evident > from the packet capture as the kernel stack is quite happy to accept the > significant quantity of data that arrives as part of the same burst > immediately after it has dropped a couple of packets. > This is not evident from the capture, you are mistaken. tcpdump captures packets before tcp stack, it doesnt say if they are : 1) queued in receive of ofo queue 2) queued in socket backlog 3) dropped because we hit socket rcvbuf limit If socket lock is hold by the user, packets are queued to backlog, or dropped. Then, when socket lock is about to be released, we process the backlog. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-05-22 at 18:45 +0200, Eric Dumazet wrote: > This is not evident from the capture, you are mistaken. > > tcpdump captures packets before tcp stack, it doesnt say if they are : > > 1) queued in receive of ofo queue > 2) queued in socket backlog > 3) dropped because we hit socket rcvbuf limit > > If socket lock is hold by the user, packets are queued to backlog, or > dropped. > > Then, when socket lock is about to be released, we process the backlog. > > BTW, latest iproute2 ss util has nice information if you add -m : misc/ss -m dst 192.168.99.2 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 3441896 0 192.168.99.1:44409 192.168.99.2:41197 skmem:(r5035136,rb6291456,t0,tb23080,f1149824,w0,o0) Here you can see that for 3441896 bytes in TCP queue (payload), we have 5035136 bytes in rmem_alloc, and 6291456 'bytes' in sk_rcvbuf It lacks the backlog len, I'll send a patch when net-next reopens. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Update : My tests show that sk_backlog.len can reach 1.5MB easily on a netperf, even with a fast machine, receiving a 10Gb flow (ixgbe adapter), if LRO/GRO are off. 424 backlogdrop for 193.182.546 incoming packets (with my tcp_space() patch applied) I believe that as soon as ixgbe can use build_skb() and avoid the 1024 bytes overhead per skb, it should go away. Of course, another way to solve the problem would be to change tcp_recvmsg() to use lock_sock_fast(), so that no frame is backlogged at all. Locking the socket for the whole operation (including copyout to user) is not very good. It was good enough years ago with small receive window. With a potentially huge backlog, it means user process has to process it, regardless of its latency constraints. CPU caches are also completely destroyed because of huge amount of data included in thousand of skbs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 23 May 2012 11:44:06 +0200 > Locking the socket for the whole operation (including copyout to user) > is not very good. It was good enough years ago with small receive > window. > > With a potentially huge backlog, it means user process has to process > it, regardless of its latency constraints. CPU caches are also > completely destroyed because of huge amount of data included in thousand > of skbs. But it is the only way we can have TCP processing scheduled and accounted to user processes. That does have value when you have lots of flows active. The scheduler's ability to give the process cpu time influences TCP's behavier, and under load if the process can't get enough cpu time then TCP will back off. We want that. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-23 at 13:34 -0400, David Miller wrote: > But it is the only way we can have TCP processing scheduled and > accounted to user processes. That does have value when you have lots > of flows active. > > The scheduler's ability to give the process cpu time influences > TCP's behavier, and under load if the process can't get enough > cpu time then TCP will back off. We want that. But TCP already backs off if user process is not blocked on socket input. Modern applications uses select()/poll()/epoll() on many sockets in //. Only old ones stil block on recv(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 23 May 2012 19:46:50 +0200 > But TCP already backs off if user process is not blocked on socket > input. > > Modern applications uses select()/poll()/epoll() on many sockets in //. > > Only old ones stil block on recv(). These arguments seem circular. Those modern applications still trigger enough TCP work during their recv() calls that it's significant enough for scheduling purposes, and to me being able to account that TCP work as process time is still extremely beneficial. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/tcp.h b/include/net/tcp.h index e79aa48..82382cb 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space) /* Note: caller must be prepared to deal with negative returns */ static inline int tcp_space(const struct sock *sk) { - return tcp_win_from_space(sk->sk_rcvbuf - - atomic_read(&sk->sk_rmem_alloc)); + int used = atomic_read(&sk->sk_rmem_alloc) + sk->sk_backlog.len; + + return tcp_win_from_space(sk->sk_rcvbuf - used); } static inline int tcp_full_space(const struct sock *sk)