diff mbox

[net,v2] tcp: Force updating pcount after skb_pull() during mtu probing

Message ID 1433551592-3596877-1-git-send-email-kafai@fb.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Martin KaFai Lau June 6, 2015, 12:46 a.m. UTC
The problem is caught by this WARN_ON(len > skb->len) in tcp_fragment():

[<ffffffff810510ca>] warn_slowpath_null+0x1a/0x20
[<ffffffff8160ec90>] tcp_fragment+0x2a0/0x2b0
[<ffffffff81604e06>] tcp_mark_head_lost+0x196/0x230
[<ffffffff8160585d>] tcp_update_scoreboard+0x4d/0x80
[<ffffffff8160a9ac>] tcp_fastretrans_alert+0x6ac/0xa90
[<ffffffff8160b834>] tcp_ack+0x9d4/0x10e0
[<ffffffff8160c699>] tcp_rcv_established+0x309/0x7e0

The WARN_ON pointed out that tcp_skb_pcount (i.e.
TCP_SKB_CB(skb)->tcp_gso_segs) and skb->len is inconsistent.

The WARN_ON stack goes away after setting net.ipv4.tcp_mtu_probing to 0.

v2
- Replace the skb slicing codes by the existing tcp_trim_head(),
  suggested by Eric Dumazet.

v1
- Call tcp_set_skb_tso_segs() for all slicing cases.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Grant Zhang <gzhang@fastly.com>
Cc: Grant Zhang <gzhang@fastly.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Comments

Eric Dumazet June 6, 2015, 1:11 a.m. UTC | #1
On Fri, 2015-06-05 at 17:46 -0700, Martin KaFai Lau wrote:
> The problem is caught by this WARN_ON(len > skb->len) in tcp_fragment():
> 
> [<ffffffff810510ca>] warn_slowpath_null+0x1a/0x20
> [<ffffffff8160ec90>] tcp_fragment+0x2a0/0x2b0
> [<ffffffff81604e06>] tcp_mark_head_lost+0x196/0x230
> [<ffffffff8160585d>] tcp_update_scoreboard+0x4d/0x80
> [<ffffffff8160a9ac>] tcp_fastretrans_alert+0x6ac/0xa90
> [<ffffffff8160b834>] tcp_ack+0x9d4/0x10e0
> [<ffffffff8160c699>] tcp_rcv_established+0x309/0x7e0
> 
> The WARN_ON pointed out that tcp_skb_pcount (i.e.
> TCP_SKB_CB(skb)->tcp_gso_segs) and skb->len is inconsistent.
> 
> The WARN_ON stack goes away after setting net.ipv4.tcp_mtu_probing to 0.
> 
> v2
> - Replace the skb slicing codes by the existing tcp_trim_head(),
>   suggested by Eric Dumazet.
> 
> v1
> - Call tcp_set_skb_tso_segs() for all slicing cases.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Reported-by: Grant Zhang <gzhang@fastly.com>
> Cc: Grant Zhang <gzhang@fastly.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---
>  net/ipv4/tcp_output.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index a369e8a..4ae4f0c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1977,16 +1977,8 @@ static int tcp_mtu_probe(struct sock *sk)
>  		} else {
>  			TCP_SKB_CB(nskb)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags &
>  						   ~(TCPHDR_FIN|TCPHDR_PSH);
> -			if (!skb_shinfo(skb)->nr_frags) {
> -				skb_pull(skb, copy);
> -				if (skb->ip_summed != CHECKSUM_PARTIAL)
> -					skb->csum = csum_partial(skb->data,
> -								 skb->len, 0);
> -			} else {
> -				__pskb_trim_head(skb, copy);
> -				tcp_set_skb_tso_segs(sk, skb, mss_now);
> -			}
> -			TCP_SKB_CB(skb)->seq += copy;
> +			tcp_skb_pcount_set(skb, 0);
> +			tcp_trim_head(sk, skb, copy);
>  		}
>  
>  		len += copy;


I think the invariant should be that if a packet had been never sent,
its pcount should be already 0.

(cleared in do_tcp_sendpages() and tcp_sendmsg() : it seems we hacked
these functions already in the past :( )

So we might need to track places where we violate this rule, then get
rid of the tcp_skb_pcount_set(skb, 0); done in do_tcp_sendpages() and
tcp_sendmsg().

Here, trimming a packet that was never sent (by definition) should not
force pcount to 0, it should already be the case.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin KaFai Lau June 8, 2015, 5:58 p.m. UTC | #2
On Fri, Jun 05, 2015 at 06:11:33PM -0700, Eric Dumazet wrote:
> On Fri, 2015-06-05 at 17:46 -0700, Martin KaFai Lau wrote:
> > The problem is caught by this WARN_ON(len > skb->len) in tcp_fragment():
> > 
> > [<ffffffff810510ca>] warn_slowpath_null+0x1a/0x20
> > [<ffffffff8160ec90>] tcp_fragment+0x2a0/0x2b0
> > [<ffffffff81604e06>] tcp_mark_head_lost+0x196/0x230
> > [<ffffffff8160585d>] tcp_update_scoreboard+0x4d/0x80
> > [<ffffffff8160a9ac>] tcp_fastretrans_alert+0x6ac/0xa90
> > [<ffffffff8160b834>] tcp_ack+0x9d4/0x10e0
> > [<ffffffff8160c699>] tcp_rcv_established+0x309/0x7e0
> > 
> > The WARN_ON pointed out that tcp_skb_pcount (i.e.
> > TCP_SKB_CB(skb)->tcp_gso_segs) and skb->len is inconsistent.
> > 
> > The WARN_ON stack goes away after setting net.ipv4.tcp_mtu_probing to 0.
> > 
> > v2
> > - Replace the skb slicing codes by the existing tcp_trim_head(),
> >   suggested by Eric Dumazet.
> > 
> > v1
> > - Call tcp_set_skb_tso_segs() for all slicing cases.
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > Reported-by: Grant Zhang <gzhang@fastly.com>
> > Cc: Grant Zhang <gzhang@fastly.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Yuchung Cheng <ycheng@google.com>
> > ---
> >  net/ipv4/tcp_output.c | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index a369e8a..4ae4f0c 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1977,16 +1977,8 @@ static int tcp_mtu_probe(struct sock *sk)
> >  		} else {
> >  			TCP_SKB_CB(nskb)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags &
> >  						   ~(TCPHDR_FIN|TCPHDR_PSH);
> > -			if (!skb_shinfo(skb)->nr_frags) {
> > -				skb_pull(skb, copy);
> > -				if (skb->ip_summed != CHECKSUM_PARTIAL)
> > -					skb->csum = csum_partial(skb->data,
> > -								 skb->len, 0);
> > -			} else {
> > -				__pskb_trim_head(skb, copy);
> > -				tcp_set_skb_tso_segs(sk, skb, mss_now);
> > -			}
> > -			TCP_SKB_CB(skb)->seq += copy;
> > +			tcp_skb_pcount_set(skb, 0);
> > +			tcp_trim_head(sk, skb, copy);
> >  		}
> >  
> >  		len += copy;
> 
> 
> I think the invariant should be that if a packet had been never sent,
> its pcount should be already 0.
> 
> (cleared in do_tcp_sendpages() and tcp_sendmsg() : it seems we hacked
> these functions already in the past :( )
> 
> So we might need to track places where we violate this rule, then get
> rid of the tcp_skb_pcount_set(skb, 0); done in do_tcp_sendpages() and
> tcp_sendmsg().
> 
> Here, trimming a packet that was never sent (by definition) should not
> force pcount to 0, it should already be the case.
It seems the invariant does not hold at this point also.
Should the invariant fix be something for net-next? or Would you like
to post a patch for it?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 8, 2015, 6:11 p.m. UTC | #3
On Mon, 2015-06-08 at 10:58 -0700, Martin KaFai Lau wrote:

> It seems the invariant does not hold at this point also.
> Should the invariant fix be something for net-next? or Would you like
> to post a patch for it?

This patch definitely can target net-next, it is not a new regression.

Once fully tested, we can ask David to submit a stable version later.

Yes, I can provide a v3, probably at the end of today.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 9, 2015, 5:06 p.m. UTC | #4
On Mon, 2015-06-08 at 11:11 -0700, Eric Dumazet wrote:
> On Mon, 2015-06-08 at 10:58 -0700, Martin KaFai Lau wrote:
> 
> > It seems the invariant does not hold at this point also.
> > Should the invariant fix be something for net-next? or Would you like
> > to post a patch for it?
> 
> This patch definitely can target net-next, it is not a new regression.
> 
> Once fully tested, we can ask David to submit a stable version later.
> 
> Yes, I can provide a v3, probably at the end of today.

I've been working on this, but still can get the bug triggering in
tcp_fragment(), no matter what (Neal patch , yours, mine...)

I suspect a problem in skb_shift() or tcp_shifted_skb() and am doing
additional tests.

[ 1806.206345] ------------[ cut here ]------------
[ 1806.206359] WARNING: CPU: 19 PID: 0 at net/ipv4/tcp_output.c:1159 tcp_fragment+0x2b0/0x2d0()
[ 1806.206360] Modules linked in: ip_gre gre sit ip_tunnel tunnel4 bonding w1_therm wire cdc_acm ehci_pci ehci_hcd i2c_dev i2c_i801 i2c_core msr cpuid ib_uverbs mlx4_ib ib_sa ib_mad ib_core ib_addr ip6table_mangle ip6_tables ipv6
[ 1806.206376] CPU: 19 PID: 0 Comm: swapper/19 Tainted: G        W       4.1.0-smp-DEV #1762
[ 1806.206378]  ffffffff8196ba4b ffff88407fae38f8 ffffffff81643125 00000000000000dd
[ 1806.206380]  0000000000000000 ffff88407fae3938 ffffffff810a8c97 ffff88407fae3998
[ 1806.206382]  ffff883f75498200 ffff881fe1d5f800 000000000000000d 0000000000004802
[ 1806.206384] Call Trace:
[ 1806.206385]  <IRQ>  [<ffffffff81643125>] dump_stack+0x45/0x57
[ 1806.206395]  [<ffffffff810a8c97>] warn_slowpath_common+0x97/0xe0
[ 1806.206396]  [<ffffffff810a8cfa>] warn_slowpath_null+0x1a/0x20
[ 1806.206398]  [<ffffffff815df0c0>] tcp_fragment+0x2b0/0x2d0
[ 1806.206400]  [<ffffffff815d62e1>] tcp_mark_head_lost+0x191/0x280
[ 1806.206402]  [<ffffffff815d641d>] tcp_update_scoreboard+0x4d/0x80
[ 1806.206404]  [<ffffffff815daa24>] tcp_fastretrans_alert+0x6e4/0xc20
[ 1806.206406]  [<ffffffff815dbcef>] tcp_ack+0xb0f/0x13c0
[ 1806.206408]  [<ffffffff815dcc31>] tcp_rcv_established+0x311/0x820
[ 1806.206410]  [<ffffffff815e79cb>] tcp_v4_do_rcv+0x14b/0x3d0
[ 1806.206414]  [<ffffffff812f15e6>] ? security_sock_rcv_skb+0x16/0x20
[ 1806.206415]  [<ffffffff815e8e46>] tcp_v4_rcv+0x8c6/0x9d0
[ 1806.206418]  [<ffffffff815c1dec>] ip_local_deliver_finish+0xac/0x230
[ 1806.206419]  [<ffffffff815c216a>] ip_local_deliver+0xaa/0xc0
[ 1806.206421]  [<ffffffff815c1d40>] ? ip_rcv_finish+0x380/0x380
[ 1806.206422]  [<ffffffff815c1a48>] ip_rcv_finish+0x88/0x380
[ 1806.206423]  [<ffffffff815c2487>] ip_rcv+0x307/0x420
[ 1806.206426]  [<ffffffff815c19c0>] ? inet_add_protocol+0x50/0x50
[ 1806.206431]  [<ffffffff8157cc0c>] __netif_receive_skb_core+0x5bc/0xa30
[ 1806.206432]  [<ffffffff8157d0a6>] __netif_receive_skb+0x26/0x70
[ 1806.206434]  [<ffffffff8157d180>] process_backlog+0x90/0x130
[ 1806.206436]  [<ffffffff8157d9a7>] net_rx_action+0x157/0x330
[ 1806.206439]  [<ffffffff810ac9c7>] __do_softirq+0xe7/0x270
[ 1806.206441]  [<ffffffff810acd85>] irq_exit+0xa5/0xb0
[ 1806.206445]  [<ffffffff8108aa65>] smp_call_function_single_interrupt+0x35/0x40
[ 1806.206448]  [<ffffffff8164ae2b>] call_function_single_interrupt+0x6b/0x70



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin KaFai Lau June 9, 2015, 5:45 p.m. UTC | #5
On Tue, Jun 09, 2015 at 10:06:25AM -0700, Eric Dumazet wrote:
> I've been working on this, but still can get the bug triggering in
> tcp_fragment(), no matter what (Neal patch , yours, mine...)
Can you describe the test case that can reproduce it?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 9, 2015, 5:59 p.m. UTC | #6
On Tue, 2015-06-09 at 10:45 -0700, Martin KaFai Lau wrote:
> On Tue, Jun 09, 2015 at 10:06:25AM -0700, Eric Dumazet wrote:
> > I've been working on this, but still can get the bug triggering in
> > tcp_fragment(), no matter what (Neal patch , yours, mine...)
> Can you describe the test case that can reproduce it?

This might not be easy : I used a 40Gb testbed and following.

Warning triggers in about 30 minutes, but this might be related to some
unrelated traffic.

echo 4 >/proc/sys/net/ipv4/tcp_min_tso_segs
echo 0 >/proc/sys/kernel/timer_migration
echo fq >/proc/sys/net/core/default_qdisc
tc qd replace dev eth1 root mq
./super_netperf 1500 --google-pacing-rate 3028000 -H lpaa24 -l 10000 &

You might implement the pacing stuff using regular netperf with

for ETH in eth1
do
 tc qd del dev $ETH root 2>/dev/null
 tc qd add dev $ETH root handle 1: mq
 for i in `seq 1 16`
 do
  slot=$( printf %x $(( i )) )
  tc qd add dev $ETH parent 1:$slot fq maxrate 3028000
 done
done


(Change eth1 by eth0 or something, and 16 by number of TX queues.)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a369e8a..4ae4f0c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1977,16 +1977,8 @@  static int tcp_mtu_probe(struct sock *sk)
 		} else {
 			TCP_SKB_CB(nskb)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags &
 						   ~(TCPHDR_FIN|TCPHDR_PSH);
-			if (!skb_shinfo(skb)->nr_frags) {
-				skb_pull(skb, copy);
-				if (skb->ip_summed != CHECKSUM_PARTIAL)
-					skb->csum = csum_partial(skb->data,
-								 skb->len, 0);
-			} else {
-				__pskb_trim_head(skb, copy);
-				tcp_set_skb_tso_segs(sk, skb, mss_now);
-			}
-			TCP_SKB_CB(skb)->seq += copy;
+			tcp_skb_pcount_set(skb, 0);
+			tcp_trim_head(sk, skb, copy);
 		}
 
 		len += copy;