Message ID | 1357801149.27446.1142.camel@edumazet-glaptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Jan 09, 2013 at 10:59:09PM -0800, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > On Wed, 2013-01-09 at 09:09 -0800, Eric Dumazet wrote: > > > My feeling is that tcp_recv_skb() should eat skbs instead of only > > finding the right one > > > > Thats indeed the case. > > > Thats because skb_splice_bits() releases the socket lock before calling > > splice_to_pipe() > > > > Once socket is released, other incoming TCP frames can be processed, and > > the skb we are actually processing might be 'collapsed' into smaller > > units. > > > > Christian, if I send you patches, are you OK to test them ? > > > > > > Here is the patch fixing this issue. > > Thanks a lot for your report, this is a very very old bug. > > GRO being more deployed, and with TCP coalescing as well, chances to > trigger this bug increased a lot. > > To reproduce it, I had to force MSS=400 and stress the receiver, > adding extra delays in skb_splice_bits() with socket lock being not > held. FWIW, I tested your patch here and did not notice any regression compared to last week-end tests, at various MTU size combinations. Cheers, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-01-10 at 08:21 +0100, Willy Tarreau wrote: > FWIW, I tested your patch here and did not notice any regression > compared to last week-end tests, at various MTU size combinations. > Thanks Willy Tested-by: Willy Tarreau <w@1wt.eu> I believe I need to fix net-next, commit 9ca1b22d6d228177e6f929f6818a1cd3d5e30c4a (net: splice: avoid high order page splitting) missed the loopback case : the skb->head might need several linear_to_page() calls. I'll send a patch after full testing. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-01-10 at 07:29 -0800, Eric Dumazet wrote: > On Thu, 2013-01-10 at 08:21 +0100, Willy Tarreau wrote: > > > FWIW, I tested your patch here and did not notice any regression > > compared to last week-end tests, at various MTU size combinations. > > > > Thanks Willy I also want to thanks Rick, as the latest netperf has splice() support. Thanks Rick ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/10/2013 08:20 AM, Eric Dumazet wrote: > I also want to thanks Rick, as the latest netperf has splice() support. > > Thanks Rick ! You are quite welcome - and thank you for helping me get it to actually work :) Those wishing to try it themselves should grab the top-of-trunk netperf bits from http://www.netperf.org/svn/netperf2/trunk . The use of splice() is gated by a test-specific -V option: raj@tardy:~/netperf2_trunk/src$ ./netperf -t omni -- -d recv -V OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain () port 0 AF_INET : copy avoidance : demo Remote Local Remote Elapsed Throughput Throughput Send Socket Recv Socket Send Time Units Size Size Size (sec) Final Final 1661688 4194304 16384 10.00 26103.14 10^6bits/s You should see that "copy avoidance" appearing in the test banner. It will also "take" for things like a migrated TCP_mumble test. For those cases where you don't see a throughput change, enabling CPU utilization measurement and looking at that and service demand should show a difference. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, this is probably a dumb question ... but since the fix is in net/ipv4/tcp.c I was asking myself if this can affect IPv6 as well? Thanks, Lukas > [PATCH] tcp: fix splice() and tcp collapsing interaction > > Under unusual circumstances, TCP collapse can split a big GRO TCP packet > while its being used in a splice(socket->pipe) operation. > > skb_splice_bits() releases the socket lock before calling > splice_to_pipe(). > > [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc() > [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]- > [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13 > > To fix this problem, we must eat skbs in tcp_recv_skb(). > > Remove the inline keyword from tcp_recv_skb() definition since > it has three call sites. > > Reported-by: Christian Becker <c.becker@traviangames.com> > Cc: Willy Tarreau <w@1wt.eu> > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > net/ipv4/tcp.c | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1ca2536..1f901be 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1428,12 +1428,12 @@ static void tcp_service_net_dma(struct sock *sk, bool wait) > } > #endif > > -static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) > +static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) > { > struct sk_buff *skb; > u32 offset; > > - skb_queue_walk(&sk->sk_receive_queue, skb) { > + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { > offset = seq - TCP_SKB_CB(skb)->seq; > if (tcp_hdr(skb)->syn) > offset--; > @@ -1441,6 +1441,11 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) > *off = offset; > return skb; > } > + /* This looks weird, but this can happen if TCP collapsing > + * splitted a fat GRO packet, while we released socket lock > + * in skb_splice_bits() > + */ > + sk_eat_skb(sk, skb, false); > } > return NULL; > } > @@ -1520,8 +1525,10 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > tcp_rcv_space_adjust(sk); > > /* Clean up data we have read: This will do ACK frames. */ > - if (copied > 0) > + if (copied > 0) { > + tcp_recv_skb(sk, seq, &offset); > tcp_cleanup_rbuf(sk, copied); > + } > return copied; > } > EXPORT_SYMBOL(tcp_read_sock); > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-01-10 at 19:27 +0100, Lukas Tribus wrote: > Hi Eric, > > this is probably a dumb question ... but since the fix is in net/ipv4/tcp.c I was asking myself if this can affect IPv6 as well? > Yes it can, net/ipv4 contains generic TCP code. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-01-10 at 10:22 -0800, Rick Jones wrote: > On 01/10/2013 08:20 AM, Eric Dumazet wrote: > > I also want to thanks Rick, as the latest netperf has splice() support. > > > > Thanks Rick ! > > You are quite welcome - and thank you for helping me get it to actually > work :) > > Those wishing to try it themselves should grab the top-of-trunk netperf > bits from http://www.netperf.org/svn/netperf2/trunk . The use of > splice() is gated by a test-specific -V option: > > raj@tardy:~/netperf2_trunk/src$ ./netperf -t omni -- -d recv -V > OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > localhost.localdomain () port 0 AF_INET : copy avoidance : demo > Remote Local Remote Elapsed Throughput Throughput > Send Socket Recv Socket Send Time Units > Size Size Size (sec) > Final Final > 1661688 4194304 16384 10.00 26103.14 10^6bits/s > > You should see that "copy avoidance" appearing in the test banner. It > will also "take" for things like a migrated TCP_mumble test. For those > cases where you don't see a throughput change, enabling CPU utilization > measurement and looking at that and service demand should show a difference. Thanks Rick ! Can we use zero copy for the sender as well (sendfile() or vmsplice()) ? Here are some results : Reference time (no splice()) # ./netperf -H 127.0.0.1 -t omni -- -d recv OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 () port 0 AF_INET catcher: timer popped with times_up != 0 Remote Local Remote Elapsed Throughput Throughput Send Socket Recv Socket Send Time Units Size Size Size (sec) Final Final 2097152 2097152 16384 10.00 33237.74 10^6bits/s zero copy at receiver (splice(socket ->pipe), splice(pipe -> /dev/null)) # ./netperf -H 127.0.0.1 -t omni -- -d recv -V OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 () port 0 AF_INET : copy avoidance catcher: timer popped with times_up != 0 Remote Local Remote Elapsed Throughput Throughput Send Socket Recv Socket Send Time Units Size Size Size (sec) Final Final 1325580 2097152 16384 10.00 51980.60 10^6bits/s -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/10/2013 10:42 AM, Eric Dumazet wrote:
> Can we use zero copy for the sender as well (sendfile() or vmsplice()) ?
Not at present. The TCP_SENDFILE test has not been "migrated" and there
is no corresponding sendfile() test in the omni code, so the attempt to
enable copy-avoidance won't "take."
I'll take that as a feature request for netperf.
rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 10, 2013 at 10:49:17AM -0800, Rick Jones wrote: > On 01/10/2013 10:42 AM, Eric Dumazet wrote: > >Can we use zero copy for the sender as well (sendfile() or vmsplice()) ? > > Not at present. The TCP_SENDFILE test has not been "migrated" and there > is no corresponding sendfile() test in the omni code, so the attempt to > enable copy-avoidance won't "take." Note that in my httpterm server, I'm using tee+splice(). I had a problem with vmsplice() in the past, I don't remember which one. For the "inject" http client, I'm using recv(MSG_TRUNC) which is always faster than splice() and does the job well, considering that the goal is to count and destroy data very fast. By combining them both on the same machine over the loopback, I see 70 Gbps on a core-2 duo 2.66 GHz. These are obviously not real gigs, but at least it shows what the network stack can do when we remove memory copies ! Regards, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 09 Jan 2013 22:59:09 -0800 > [PATCH] tcp: fix splice() and tcp collapsing interaction > > Under unusual circumstances, TCP collapse can split a big GRO TCP packet > while its being used in a splice(socket->pipe) operation. > > skb_splice_bits() releases the socket lock before calling > splice_to_pipe(). > > [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc() > [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]- > [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13 > > To fix this problem, we must eat skbs in tcp_recv_skb(). > > Remove the inline keyword from tcp_recv_skb() definition since > it has three call sites. > > Reported-by: Christian Becker <c.becker@traviangames.com> > Cc: Willy Tarreau <w@1wt.eu> > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1ca2536..1f901be 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1428,12 +1428,12 @@ static void tcp_service_net_dma(struct sock *sk, bool wait) } #endif -static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) +static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) { struct sk_buff *skb; u32 offset; - skb_queue_walk(&sk->sk_receive_queue, skb) { + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { offset = seq - TCP_SKB_CB(skb)->seq; if (tcp_hdr(skb)->syn) offset--; @@ -1441,6 +1441,11 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) *off = offset; return skb; } + /* This looks weird, but this can happen if TCP collapsing + * splitted a fat GRO packet, while we released socket lock + * in skb_splice_bits() + */ + sk_eat_skb(sk, skb, false); } return NULL; } @@ -1520,8 +1525,10 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ - if (copied > 0) + if (copied > 0) { + tcp_recv_skb(sk, seq, &offset); tcp_cleanup_rbuf(sk, copied); + } return copied; } EXPORT_SYMBOL(tcp_read_sock);