diff mbox

tainted warnings with tcp splicing in 3.7.1

Message ID 1357801149.27446.1142.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 10, 2013, 6:59 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

On Wed, 2013-01-09 at 09:09 -0800, Eric Dumazet wrote:

> My feeling is that tcp_recv_skb() should eat skbs instead of only
> finding the right one
> 

Thats indeed the case.

> Thats because skb_splice_bits() releases the socket lock before calling
> splice_to_pipe()
> 
> Once socket is released, other incoming TCP frames can be processed, and
> the skb we are actually processing might be 'collapsed' into smaller
> units.
> 
> Christian, if I send you patches, are you OK to test them ?
> 
> 

Here is the patch fixing this issue.

Thanks a lot for your report, this is a very very old bug.

GRO being more deployed, and with TCP coalescing as well, chances to
trigger this bug increased a lot.

To reproduce it, I had to force MSS=400 and stress the receiver,
adding extra delays in skb_splice_bits() with socket lock being not
held.



[PATCH] tcp: fix splice() and tcp collapsing interaction

Under unusual circumstances, TCP collapse can split a big GRO TCP packet
while its being used in a splice(socket->pipe) operation.

skb_splice_bits() releases the socket lock before calling
splice_to_pipe().

[ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
[ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
[ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13

To fix this problem, we must eat skbs in tcp_recv_skb().

Remove the inline keyword from tcp_recv_skb() definition since
it has three call sites.

Reported-by: Christian Becker <c.becker@traviangames.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Willy Tarreau Jan. 10, 2013, 7:21 a.m. UTC | #1
On Wed, Jan 09, 2013 at 10:59:09PM -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> On Wed, 2013-01-09 at 09:09 -0800, Eric Dumazet wrote:
> 
> > My feeling is that tcp_recv_skb() should eat skbs instead of only
> > finding the right one
> > 
> 
> Thats indeed the case.
> 
> > Thats because skb_splice_bits() releases the socket lock before calling
> > splice_to_pipe()
> > 
> > Once socket is released, other incoming TCP frames can be processed, and
> > the skb we are actually processing might be 'collapsed' into smaller
> > units.
> > 
> > Christian, if I send you patches, are you OK to test them ?
> > 
> > 
> 
> Here is the patch fixing this issue.
> 
> Thanks a lot for your report, this is a very very old bug.
> 
> GRO being more deployed, and with TCP coalescing as well, chances to
> trigger this bug increased a lot.
> 
> To reproduce it, I had to force MSS=400 and stress the receiver,
> adding extra delays in skb_splice_bits() with socket lock being not
> held.

FWIW, I tested your patch here and did not notice any regression
compared to last week-end tests, at various MTU size combinations.

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 10, 2013, 3:29 p.m. UTC | #2
On Thu, 2013-01-10 at 08:21 +0100, Willy Tarreau wrote:

> FWIW, I tested your patch here and did not notice any regression
> compared to last week-end tests, at various MTU size combinations.
> 

Thanks Willy

Tested-by: Willy Tarreau <w@1wt.eu>

I believe I need to fix net-next, 

commit 9ca1b22d6d228177e6f929f6818a1cd3d5e30c4a
(net: splice: avoid high order page splitting)

missed the loopback case : the skb->head might need several
linear_to_page() calls.

I'll send a patch after full testing.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 10, 2013, 4:20 p.m. UTC | #3
On Thu, 2013-01-10 at 07:29 -0800, Eric Dumazet wrote:
> On Thu, 2013-01-10 at 08:21 +0100, Willy Tarreau wrote:
> 
> > FWIW, I tested your patch here and did not notice any regression
> > compared to last week-end tests, at various MTU size combinations.
> > 
> 
> Thanks Willy

I also want to thanks Rick, as the latest netperf has splice() support.

Thanks Rick !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 10, 2013, 6:22 p.m. UTC | #4
On 01/10/2013 08:20 AM, Eric Dumazet wrote:
> I also want to thanks Rick, as the latest netperf has splice() support.
>
> Thanks Rick !

You are quite welcome - and thank you for helping me get it to actually 
work :)

Those wishing to try it themselves should grab the top-of-trunk netperf 
bits from http://www.netperf.org/svn/netperf2/trunk .  The use of 
splice() is gated by a test-specific -V option:

raj@tardy:~/netperf2_trunk/src$ ./netperf -t omni -- -d recv -V
OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
localhost.localdomain () port 0 AF_INET : copy avoidance : demo
Remote      Local       Remote Elapsed Throughput Throughput
Send Socket Recv Socket Send   Time               Units
Size        Size        Size   (sec)
Final       Final
1661688     4194304     16384  10.00   26103.14   10^6bits/s

You should see that "copy avoidance" appearing in the test banner.  It 
will also "take" for things like a migrated TCP_mumble test.  For those 
cases where you don't see a throughput change, enabling CPU utilization 
measurement and looking at that and service demand should show a difference.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Tribus Jan. 10, 2013, 6:27 p.m. UTC | #5
Hi Eric,

this is probably a dumb question ... but since the fix is in net/ipv4/tcp.c I was asking myself if this can affect IPv6 as well?



Thanks,

Lukas


> [PATCH] tcp: fix splice() and tcp collapsing interaction
> 
> Under unusual circumstances, TCP collapse can split a big GRO TCP packet
> while its being used in a splice(socket->pipe) operation.
> 
> skb_splice_bits() releases the socket lock before calling
> splice_to_pipe().
> 
> [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
> [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
> [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13
> 
> To fix this problem, we must eat skbs in tcp_recv_skb().
> 
> Remove the inline keyword from tcp_recv_skb() definition since
> it has three call sites.
> 
> Reported-by: Christian Becker <c.becker@traviangames.com>
> Cc: Willy Tarreau <w@1wt.eu>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/ipv4/tcp.c |   13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1ca2536..1f901be 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1428,12 +1428,12 @@ static void tcp_service_net_dma(struct sock *sk, bool wait)
>  }
>  #endif
>  
> -static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
> +static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
>  {
>  	struct sk_buff *skb;
>  	u32 offset;
>  
> -	skb_queue_walk(&sk->sk_receive_queue, skb) {
> +	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
>  		offset = seq - TCP_SKB_CB(skb)->seq;
>  		if (tcp_hdr(skb)->syn)
>  			offset--;
> @@ -1441,6 +1441,11 @@ static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
>  			*off = offset;
>  			return skb;
>  		}
> +		/* This looks weird, but this can happen if TCP collapsing
> +		 * splitted a fat GRO packet, while we released socket lock
> +		 * in skb_splice_bits()
> +		 */
> +		sk_eat_skb(sk, skb, false);
>  	}
>  	return NULL;
>  }
> @@ -1520,8 +1525,10 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
>  	tcp_rcv_space_adjust(sk);
>  
>  	/* Clean up data we have read: This will do ACK frames. */
> -	if (copied > 0)
> +	if (copied > 0) {
> +		tcp_recv_skb(sk, seq, &offset);
>  		tcp_cleanup_rbuf(sk, copied);
> +	}
>  	return copied;
>  }
>  EXPORT_SYMBOL(tcp_read_sock);
> 
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 10, 2013, 6:37 p.m. UTC | #6
On Thu, 2013-01-10 at 19:27 +0100, Lukas Tribus wrote:
> Hi Eric,
> 
> this is probably a dumb question ... but since the fix is in net/ipv4/tcp.c I was asking myself if this can affect IPv6 as well?
> 

Yes it can, net/ipv4 contains generic TCP code.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 10, 2013, 6:42 p.m. UTC | #7
On Thu, 2013-01-10 at 10:22 -0800, Rick Jones wrote:
> On 01/10/2013 08:20 AM, Eric Dumazet wrote:
> > I also want to thanks Rick, as the latest netperf has splice() support.
> >
> > Thanks Rick !
> 
> You are quite welcome - and thank you for helping me get it to actually 
> work :)
> 
> Those wishing to try it themselves should grab the top-of-trunk netperf 
> bits from http://www.netperf.org/svn/netperf2/trunk .  The use of 
> splice() is gated by a test-specific -V option:
> 
> raj@tardy:~/netperf2_trunk/src$ ./netperf -t omni -- -d recv -V
> OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> localhost.localdomain () port 0 AF_INET : copy avoidance : demo
> Remote      Local       Remote Elapsed Throughput Throughput
> Send Socket Recv Socket Send   Time               Units
> Size        Size        Size   (sec)
> Final       Final
> 1661688     4194304     16384  10.00   26103.14   10^6bits/s
> 
> You should see that "copy avoidance" appearing in the test banner.  It 
> will also "take" for things like a migrated TCP_mumble test.  For those 
> cases where you don't see a throughput change, enabling CPU utilization 
> measurement and looking at that and service demand should show a difference.

Thanks Rick !

Can we use zero copy for the sender as well (sendfile() or vmsplice()) ?

Here are some results :

Reference time (no splice())

# ./netperf -H 127.0.0.1 -t omni -- -d recv     
OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 ()
port 0 AF_INET
catcher: timer popped with times_up != 0
Remote      Local       Remote Elapsed Throughput Throughput  
Send Socket Recv Socket Send   Time               Units       
Size        Size        Size   (sec)                          
Final       Final                                             
2097152     2097152     16384  10.00   33237.74   10^6bits/s  


zero copy at receiver (splice(socket ->pipe), splice(pipe -> /dev/null))

# ./netperf -H 127.0.0.1 -t omni -- -d recv -V
OMNI Receive TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 ()
port 0 AF_INET : copy avoidance
catcher: timer popped with times_up != 0
Remote      Local       Remote Elapsed Throughput Throughput  
Send Socket Recv Socket Send   Time               Units       
Size        Size        Size   (sec)                          
Final       Final                                             
1325580     2097152     16384  10.00   51980.60   10^6bits/s  



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Jan. 10, 2013, 6:49 p.m. UTC | #8
On 01/10/2013 10:42 AM, Eric Dumazet wrote:
> Can we use zero copy for the sender as well (sendfile() or vmsplice()) ?

Not at present.  The TCP_SENDFILE test has not been "migrated" and there 
is no corresponding sendfile() test in the omni code, so the attempt to 
enable copy-avoidance won't "take."

I'll take that as a feature request for netperf.

rick

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 10, 2013, 7:43 p.m. UTC | #9
On Thu, Jan 10, 2013 at 10:49:17AM -0800, Rick Jones wrote:
> On 01/10/2013 10:42 AM, Eric Dumazet wrote:
> >Can we use zero copy for the sender as well (sendfile() or vmsplice()) ?
> 
> Not at present.  The TCP_SENDFILE test has not been "migrated" and there 
> is no corresponding sendfile() test in the omni code, so the attempt to 
> enable copy-avoidance won't "take."

Note that in my httpterm server, I'm using tee+splice(). I had a problem
with vmsplice() in the past, I don't remember which one.

For the "inject" http client, I'm using recv(MSG_TRUNC) which is always
faster than splice() and does the job well, considering that the goal is
to count and destroy data very fast.

By combining them both on the same machine over the loopback, I see
70 Gbps on a core-2 duo 2.66 GHz. These are obviously not real gigs,
but at least it shows what the network stack can do when we remove
memory copies !

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 10, 2013, 10:39 p.m. UTC | #10
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 09 Jan 2013 22:59:09 -0800

> [PATCH] tcp: fix splice() and tcp collapsing interaction
> 
> Under unusual circumstances, TCP collapse can split a big GRO TCP packet
> while its being used in a splice(socket->pipe) operation.
> 
> skb_splice_bits() releases the socket lock before calling
> splice_to_pipe().
> 
> [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
> [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
> [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13
> 
> To fix this problem, we must eat skbs in tcp_recv_skb().
> 
> Remove the inline keyword from tcp_recv_skb() definition since
> it has three call sites.
> 
> Reported-by: Christian Becker <c.becker@traviangames.com>
> Cc: Willy Tarreau <w@1wt.eu>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..1f901be 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1428,12 +1428,12 @@  static void tcp_service_net_dma(struct sock *sk, bool wait)
 }
 #endif
 
-static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
+static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 {
 	struct sk_buff *skb;
 	u32 offset;
 
-	skb_queue_walk(&sk->sk_receive_queue, skb) {
+	while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
 		offset = seq - TCP_SKB_CB(skb)->seq;
 		if (tcp_hdr(skb)->syn)
 			offset--;
@@ -1441,6 +1441,11 @@  static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 			*off = offset;
 			return skb;
 		}
+		/* This looks weird, but this can happen if TCP collapsing
+		 * splitted a fat GRO packet, while we released socket lock
+		 * in skb_splice_bits()
+		 */
+		sk_eat_skb(sk, skb, false);
 	}
 	return NULL;
 }
@@ -1520,8 +1525,10 @@  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tcp_rcv_space_adjust(sk);
 
 	/* Clean up data we have read: This will do ACK frames. */
-	if (copied > 0)
+	if (copied > 0) {
+		tcp_recv_skb(sk, seq, &offset);
 		tcp_cleanup_rbuf(sk, copied);
+	}
 	return copied;
 }
 EXPORT_SYMBOL(tcp_read_sock);