TCPBacklogDrops during aggressive bursts of traffic

Message ID	1337703170.3361.217.camel@edumazet-glaptop
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Subject: Re: TCPBacklogDrops during aggressive bursts of traffic From: Eric Dumazet <eric.dumazet@gmail.com> To: Kieran Mansley <kmansley@solarflare.com> Cc: Ben Hutchings <bhutchings@solarflare.com>, netdev@vger.kernel.org In-Reply-To: <1337699379.1698.30.camel@kjm-desktop.uk.level5networks.com> References: <1337092718.1689.45.camel@kjm-desktop.uk.level5networks.com> <1337093776.8512.1089.camel@edumazet-glaptop> <1337099368.1689.47.camel@kjm-desktop.uk.level5networks.com> <1337099641.8512.1102.camel@edumazet-glaptop> <1337100454.2544.25.camel@bwh-desktop.uk.solarflarecom.com> <1337101280.8512.1108.camel@edumazet-glaptop> <1337272292.1681.16.camel@kjm-desktop.uk.level5networks.com> <1337272654.3403.20.camel@edumazet-glaptop> <1337674831.1698.7.camel@kjm-desktop.uk.level5networks.com> <1337678759.3361.147.camel@edumazet-glaptop> <1337679045.3361.154.camel@edumazet-glaptop> <1337699379.1698.30.camel@kjm-desktop.uk.level5networks.com> Content-Type: text/plain; charset="UTF-8" Date: Tue, 22 May 2012 18:12:50 +0200 Message-ID: <1337703170.3361.217.camel@edumazet-glaptop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric Dumazet May 22, 2012, 4:12 p.m. UTC

On Tue, 2012-05-22 at 16:09 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-22 at 11:30 +0200, Eric Dumazet wrote:
> > Also can you post a pcap capture of problematic flow ?
> 
> I'll email this to you directly. The capture is generated with netserver
> on the system under test, and NetPerf sending from a similar server.
> I've only included the first 1000 frames to keep the capture size down.
> There are 7 retransmissions in that capture, and the TCPBacklogDrops
> counter incremented by 7 during the test, so I'm happy to say they are
> the cause of the drops.
> 
> The system under test was running net-next.
> 
> I've not tried with another NIC (e.g. tg3) but will see if I can find
> one to test.

Or you could change sfc to allow its frames being coalesced.

> 
> I've got a feeling that the drops might be easier to reproduce if I
> taskset the netserver process to a different package than the one that
> is handling the network interrupt for that NIC.  This fits with my
> earlier theory in that it is likely to increase the overhead of waking
> the user-level process to satisfy the read and so increase the time
> during which received packets could overflow the backlog.  Having a
> relatively aggressive sending TCP also helps, e.g. one that is
> configured to open its congestion window quickly, as this will produce
> more intensive bursts.

__tcp_select_window() ( more precisely tcp_space() takes into account
memory used in receive/ofo queue, but not frames in backlog queue)

So if you send bursts, it might explain TCP stack continues to advertise
a too big window, instead of anticipate the problem.

Please try the following patch :




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kieran Mansley May 22, 2012, 4:32 p.m. UTC | #1

On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote:
> 
> __tcp_select_window() ( more precisely tcp_space() takes into account
> memory used in receive/ofo queue, but not frames in backlog queue)
> 
> So if you send bursts, it might explain TCP stack continues to
> advertise
> a too big window, instead of anticipate the problem.
> 
> Please try the following patch :
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e79aa48..82382cb 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
>  /* Note: caller must be prepared to deal with negative returns */ 
>  static inline int tcp_space(const struct sock *sk)
>  {
> -       return tcp_win_from_space(sk->sk_rcvbuf -
> -                                 atomic_read(&sk->sk_rmem_alloc));
> +       int used = atomic_read(&sk->sk_rmem_alloc) +
> sk->sk_backlog.len;
> +
> +       return tcp_win_from_space(sk->sk_rcvbuf - used);
>  } 
>  
>  static inline int tcp_full_space(const struct sock *sk)

I can give this a try (not sure when - probably later this week) but I
think this it is back to front.  The patch above will reduce the
advertised window by sk_backlog.len, but at the time that the window was
advertised that allowed the dropped packets to be sent the backlog was
empty.  It is later, when the kernel is waking the application and takes
the socket lock that the backlog starts to be used and the drop happens.
But reducing the window advertised at this point is futile - the packets
that will be dropped are already in flight.

The problem exists because the backlog has a tighter limit on it than
the receive window does; I think the backlog should be able to accept
sk_rcvbuf bytes in addition to what is already in the receive buffer (or
up to the advertised receive window if that's smaller).  At the moment
it will only accept sk_rcvbuf bytes including what is already in the
receive buffer.  The logic being that in this case we're using the
backlog because it's in the process of emptying the receive buffer into
the application, and so the receive buffer will very soon be empty, and
so we will very soon be able to accept sk_rcvbuf bytes.  This is evident
from the packet capture as the kernel stack is quite happy to accept the
significant quantity of data that arrives as part of the same burst
immediately after it has dropped a couple of packets.

Perhaps it would be easier for me to write a patch to show this
suggested solution?

Kieran

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet May 22, 2012, 4:45 p.m. UTC | #2

On Tue, 2012-05-22 at 17:32 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote:
> > 
> > __tcp_select_window() ( more precisely tcp_space() takes into account
> > memory used in receive/ofo queue, but not frames in backlog queue)
> > 
> > So if you send bursts, it might explain TCP stack continues to
> > advertise
> > a too big window, instead of anticipate the problem.
> > 
> > Please try the following patch :
> > 
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index e79aa48..82382cb 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
> >  /* Note: caller must be prepared to deal with negative returns */ 
> >  static inline int tcp_space(const struct sock *sk)
> >  {
> > -       return tcp_win_from_space(sk->sk_rcvbuf -
> > -                                 atomic_read(&sk->sk_rmem_alloc));
> > +       int used = atomic_read(&sk->sk_rmem_alloc) +
> > sk->sk_backlog.len;
> > +
> > +       return tcp_win_from_space(sk->sk_rcvbuf - used);
> >  } 
> >  
> >  static inline int tcp_full_space(const struct sock *sk)
> 
> 
> I can give this a try (not sure when - probably later this week) but I
> think this it is back to front.  The patch above will reduce the
> advertised window by sk_backlog.len, but at the time that the window was
> advertised that allowed the dropped packets to be sent the backlog was
> empty.  It is later, when the kernel is waking the application and takes
> the socket lock that the backlog starts to be used and the drop happens.
> But reducing the window advertised at this point is futile - the packets
> that will be dropped are already in flight.
> 

Not really. If we receive these packets while backlog is empty, then the
sender violates TCP rules.

We advertise tcp window directly from memory we are allowed to consume.

(On the premise sender behaves correctly, not sending bytes in small
packets)


> The problem exists because the backlog has a tighter limit on it than
> the receive window does; I think the backlog should be able to accept
> sk_rcvbuf bytes in addition to what is already in the receive buffer (or
> up to the advertised receive window if that's smaller).  At the moment
> it will only accept sk_rcvbuf bytes including what is already in the
> receive buffer.  The logic being that in this case we're using the
> backlog because it's in the process of emptying the receive buffer into
> the application, and so the receive buffer will very soon be empty, and
> so we will very soon be able to accept sk_rcvbuf bytes.  This is evident
> from the packet capture as the kernel stack is quite happy to accept the
> significant quantity of data that arrives as part of the same burst
> immediately after it has dropped a couple of packets.
> 

This is not evident from the capture, you are mistaken.

tcpdump captures packets before tcp stack, it doesnt say if they are :

1) queued in receive of ofo queue
2) queued in socket backlog
3) dropped because we hit socket rcvbuf limit

If socket lock is hold by the user, packets are queued to backlog, or
dropped.

Then, when socket lock is about to be released, we process the backlog.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet May 22, 2012, 8:54 p.m. UTC | #3

On Tue, 2012-05-22 at 18:45 +0200, Eric Dumazet wrote:

> This is not evident from the capture, you are mistaken.
> 
> tcpdump captures packets before tcp stack, it doesnt say if they are :
> 
> 1) queued in receive of ofo queue
> 2) queued in socket backlog
> 3) dropped because we hit socket rcvbuf limit
> 
> If socket lock is hold by the user, packets are queued to backlog, or
> dropped.
> 
> Then, when socket lock is about to be released, we process the backlog.
> 
> 

BTW, latest iproute2 ss util has nice information if you add -m :

misc/ss -m dst 192.168.99.2
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      3441896 0            192.168.99.1:44409         192.168.99.2:41197   
	 skmem:(r5035136,rb6291456,t0,tb23080,f1149824,w0,o0)

Here you can see that for 3441896 bytes in TCP queue (payload), 
we have 5035136 bytes in rmem_alloc,
and 6291456 'bytes' in sk_rcvbuf

It lacks the backlog len, I'll send a patch when net-next reopens.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet May 23, 2012, 9:44 a.m. UTC | #4

Update :

My tests show that sk_backlog.len can reach 1.5MB easily on a netperf,
even with a fast machine, receiving a 10Gb flow (ixgbe adapter), if
LRO/GRO are off.

424 backlogdrop for 193.182.546 incoming packets (with my tcp_space()
patch applied)

I believe that as soon as ixgbe can use build_skb() and avoid the 1024
bytes overhead per skb, it should go away.

Of course, another way to solve the problem would be to change
tcp_recvmsg() to use lock_sock_fast(), so that no frame is backlogged at
all.

Locking the socket for the whole operation (including copyout to user)
is not very good. It was good enough years ago with small receive
window.

With a potentially huge backlog, it means user process has to process
it, regardless of its latency constraints. CPU caches are also
completely destroyed because of huge amount of data included in thousand
of skbs.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller May 23, 2012, 5:34 p.m. UTC | #5

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 May 2012 11:44:06 +0200

> Locking the socket for the whole operation (including copyout to user)
> is not very good. It was good enough years ago with small receive
> window.
> 
> With a potentially huge backlog, it means user process has to process
> it, regardless of its latency constraints. CPU caches are also
> completely destroyed because of huge amount of data included in thousand
> of skbs.

But it is the only way we can have TCP processing scheduled and
accounted to user processes.  That does have value when you have lots
of flows active.

The scheduler's ability to give the process cpu time influences
TCP's behavier, and under load if the process can't get enough
cpu time then TCP will back off.  We want that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet May 23, 2012, 5:46 p.m. UTC | #6

On Wed, 2012-05-23 at 13:34 -0400, David Miller wrote:

> But it is the only way we can have TCP processing scheduled and
> accounted to user processes.  That does have value when you have lots
> of flows active.
> 
> The scheduler's ability to give the process cpu time influences
> TCP's behavier, and under load if the process can't get enough
> cpu time then TCP will back off.  We want that.

But TCP already backs off if user process is not blocked on socket
input.

Modern applications uses select()/poll()/epoll() on many sockets in //.

Only old ones stil block on recv().



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller May 23, 2012, 5:57 p.m. UTC | #7

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 May 2012 19:46:50 +0200

> But TCP already backs off if user process is not blocked on socket
> input.
> 
> Modern applications uses select()/poll()/epoll() on many sockets in //.
> 
> Only old ones stil block on recv().

These arguments seem circular.

Those modern applications still trigger enough TCP work during their
recv() calls that it's significant enough for scheduling purposes, and
to me being able to account that TCP work as process time is still
extremely beneficial.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

TCPBacklogDrops during aggressive bursts of traffic

Commit Message

Comments

Patch