diff mbox

[net-next] tcp: be more strict before accepting ECN negociation

Message ID 1336144442.3752.348.camel@edumazet-glaptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 4, 2012, 3:14 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

It appears some networks play bad games with the two bits reserved for
ECN. This can trigger false congestion notifications and very slow
transferts.

Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
disable TCP ECN negociation if it happens we receive mangled CT bits in
the SYN packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Perry Lorier <perryl@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Wilmer van der Gaast <wilmer@google.com>
Cc: Ankur Jain <jankur@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Dave Täht <dave.taht@bufferbloat.net>
---
 include/net/tcp.h   |   23 ++++++++++++++++-------
 net/ipv4/tcp_ipv4.c |    2 +-
 net/ipv6/tcp_ipv6.c |    2 +-
 3 files changed, 18 insertions(+), 9 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Neal Cardwell May 4, 2012, 3:54 p.m. UTC | #1
On Fri, May 4, 2012 at 11:14 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> It appears some networks play bad games with the two bits reserved for
> ECN. This can trigger false congestion notifications and very slow
> transferts.
>
> Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
> disable TCP ECN negociation if it happens we receive mangled CT bits in
> the SYN packet.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Perry Lorier <perryl@google.com>
> Cc: Matt Mathis <mattmathis@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Wilmer van der Gaast <wilmer@google.com>
> Cc: Ankur Jain <jankur@google.com>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Dave Täht <dave.taht@bufferbloat.net>
> ---
>  include/net/tcp.h   |   23 ++++++++++++++++-------
>  net/ipv4/tcp_ipv4.c |    2 +-
>  net/ipv6/tcp_ipv6.c |    2 +-
>  3 files changed, 18 insertions(+), 9 deletions(-)

Acked-by: Neal Cardwell <ncardwell@google.com>

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 4, 2012, 4:06 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 04 May 2012 17:14:02 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> It appears some networks play bad games with the two bits reserved for
> ECN. This can trigger false congestion notifications and very slow
> transferts.
> 
> Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
> disable TCP ECN negociation if it happens we receive mangled CT bits in
> the SYN packet.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones May 4, 2012, 6:09 p.m. UTC | #3
On 05/04/2012 08:14 AM, Eric Dumazet wrote:
> From: Eric Dumazet<edumazet@google.com>
>
> It appears some networks play bad games with the two bits reserved for
> ECN. This can trigger false congestion notifications and very slow
> transferts.
>
> Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
> disable TCP ECN negociation if it happens we receive mangled CT bits in
> the SYN packet.

What sort of networks were these?  Any chance it was some sort of 
attempt to add ECN to FastOpen?

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 4, 2012, 6:23 p.m. UTC | #4
On Fri, 2012-05-04 at 11:09 -0700, Rick Jones wrote:
> On 05/04/2012 08:14 AM, Eric Dumazet wrote:
> > From: Eric Dumazet<edumazet@google.com>
> >
> > It appears some networks play bad games with the two bits reserved for
> > ECN. This can trigger false congestion notifications and very slow
> > transferts.
> >
> > Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
> > disable TCP ECN negociation if it happens we receive mangled CT bits in
> > the SYN packet.
> 
> What sort of networks were these?  Any chance it was some sort of 
> attempt to add ECN to FastOpen?

Nothing to do with fastopen.

Just take a look at a random http server and sample all SYN packets it
receives.

Some of them have TOS bits 0 or 1 set, or even both bits set.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones May 4, 2012, 6:48 p.m. UTC | #5
On 05/04/2012 11:23 AM, Eric Dumazet wrote:
> On Fri, 2012-05-04 at 11:09 -0700, Rick Jones wrote:
>> What sort of networks were these?  Any chance it was some sort of
>> attempt to add ECN to FastOpen?
>
> Nothing to do with fastopen.
>
> Just take a look at a random http server and sample all SYN packets it
> receives.
>
> Some of them have TOS bits 0 or 1 set, or even both bits set.

I'll fire-up tcpdump on netperf.org:

tcpdump -i eth0 -vvv '(tcp[tcpflags] & tcp-syn != 0) && (ip[1] != 0x0)'

and see what appears.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 4, 2012, 7:05 p.m. UTC | #6
On Fri, 2012-05-04 at 11:48 -0700, Rick Jones wrote:
> On 05/04/2012 11:23 AM, Eric Dumazet wrote:
> > On Fri, 2012-05-04 at 11:09 -0700, Rick Jones wrote:
> >> What sort of networks were these?  Any chance it was some sort of
> >> attempt to add ECN to FastOpen?
> >
> > Nothing to do with fastopen.
> >
> > Just take a look at a random http server and sample all SYN packets it
> > receives.
> >
> > Some of them have TOS bits 0 or 1 set, or even both bits set.
> 
> I'll fire-up tcpdump on netperf.org:
> 
> tcpdump -i eth0 -vvv '(tcp[tcpflags] & tcp-syn != 0) && (ip[1] != 0x0)'
> 
> and see what appears.
> 
> rick

of (ip[1] & 3 != 0)


Note that you could catch SYNACK with this filter (if your machine
initiates some active TCP sessions), since SYNACK might have ECT bits,
if some stacks implemented :

http://tools.ietf.org/html/draft-kuzmanovic-ecn-syn-00  ( Adding
Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK
Packets )

http://tools.ietf.org/id/draft-ietf-tcpm-ecnsyn-04.txt




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones May 4, 2012, 8:20 p.m. UTC | #7
On 05/04/2012 12:05 PM, Eric Dumazet wrote:
> On Fri, 2012-05-04 at 11:48 -0700, Rick Jones wrote:
>> I'll fire-up tcpdump on netperf.org:
>>
>> tcpdump -i eth0 -vvv '(tcp[tcpflags]&  tcp-syn != 0)&&  (ip[1] != 0x0)'
>>
>> and see what appears.
>>
>> rick
>
> of (ip[1]&  3 != 0)

True, I'm looking at more than the ECN bits, but in the 90 minutes the 
tcpdump has been running there have been no packets with the any of the 
8 bits at ip[1] being 1 anyway :)  Netperf.org doesn't get a massive 
quantity of traffic.  It may go the entire week-end or longer without 
seeing such a packet.

> Note that you could catch SYNACK with this filter (if your machine
> initiates some active TCP sessions), since SYNACK might have ECT bits,
> if some stacks implemented :
>
> http://tools.ietf.org/html/draft-kuzmanovic-ecn-syn-00  ( Adding
> Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK
> Packets )
>
> http://tools.ietf.org/id/draft-ietf-tcpm-ecnsyn-04.txt

True.  I suspect that 99 times out of 10, the outbound connections 
established by netperf.org are in response to traffic to netperf-talk, 
which is itself a rather quiet list, so I'm not too worried about the 
output being cluttered with false hits.

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones May 4, 2012, 8:36 p.m. UTC | #8
On 05/04/2012 01:20 PM, Rick Jones wrote:
> True, I'm looking at more than the ECN bits, but in the 90 minutes the
> tcpdump has been running there have been no packets with the any of the
> 8 bits at ip[1] being 1 anyway :) Netperf.org doesn't get a massive
> quantity of traffic. It may go the entire week-end or longer without
> seeing such a packet.

I see fate is working as intended, or someone decided to try to feed me 
my words :) for within 6 minutes of my sending the above I got:

13:26:16.866007 IP (tos 0x3,CE, ttl 41, id 28850, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55363 > www.netperf.org.www: Flags [S], cksum 
0x4cfc (correct), seq 304457158, win 65535, options [mss 1460,nop,wscale 
3,nop,nop,TS val 288116308 ecr 0,sackOK,eol], length 0
13:26:17.831880 IP (tos 0x3,CE, ttl 41, id 6911, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55367 > www.netperf.org.www: Flags [S], cksum 
0x17aa (correct), seq 586073737, win 65535, options [mss 1460,nop,wscale 
3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
13:26:17.831929 IP (tos 0x3,CE, ttl 41, id 28924, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55368 > www.netperf.org.www: Flags [S], cksum 
0x07cc (correct), seq 1513398047, win 65535, options [mss 
1460,nop,wscale 3,nop,nop,TS val 288117271 ecr 0,sackOK,eol], length 0
13:26:17.831952 IP (tos 0x3,CE, ttl 41, id 2494, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55366 > www.netperf.org.www: Flags [S], cksum 
0x75f4 (correct), seq 1153058420, win 65535, options [mss 
1460,nop,wscale 3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
13:26:17.832177 IP (tos 0x3,CE, ttl 41, id 6854, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55365 > www.netperf.org.www: Flags [S], cksum 
0xfca0 (correct), seq 2332522875, win 65535, options [mss 
1460,nop,wscale 3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
13:26:17.832239 IP (tos 0x3,CE, ttl 41, id 64733, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55364 > www.netperf.org.www: Flags [S], cksum 
0x7414 (correct), seq 1544827132, win 65535, options [mss 
1460,nop,wscale 3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
13:26:38.649126 IP (tos 0x3,CE, ttl 41, id 9860, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55369 > www.netperf.org.www: Flags [S], cksum 
0x6270 (correct), seq 683091230, win 65535, options [mss 1460,nop,wscale 
3,nop,nop,TS val 288137968 ecr 0,sackOK,eol], length 0
13:26:39.417589 IP (tos 0x3,CE, ttl 41, id 13478, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55370 > www.netperf.org.www: Flags [S], cksum 
0x2862 (correct), seq 3168323595, win 65535, options [mss 
1460,nop,wscale 3,nop,nop,TS val 288138734 ecr 0,sackOK,eol], length 0

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 4, 2012, 8:49 p.m. UTC | #9
On Fri, 2012-05-04 at 13:36 -0700, Rick Jones wrote:
> On 05/04/2012 01:20 PM, Rick Jones wrote:
> > True, I'm looking at more than the ECN bits, but in the 90 minutes the
> > tcpdump has been running there have been no packets with the any of the
> > 8 bits at ip[1] being 1 anyway :) Netperf.org doesn't get a massive
> > quantity of traffic. It may go the entire week-end or longer without
> > seeing such a packet.
> 
> I see fate is working as intended, or someone decided to try to feed me 
> my words :) for within 6 minutes of my sending the above I got:
> 
> 13:26:16.866007 IP (tos 0x3,CE, ttl 41, id 28850, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55363 > www.netperf.org.www: Flags [S], cksum 
> 0x4cfc (correct), seq 304457158, win 65535, options [mss 1460,nop,wscale 
> 3,nop,nop,TS val 288116308 ecr 0,sackOK,eol], length 0
> 13:26:17.831880 IP (tos 0x3,CE, ttl 41, id 6911, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55367 > www.netperf.org.www: Flags [S], cksum 
> 0x17aa (correct), seq 586073737, win 65535, options [mss 1460,nop,wscale 
> 3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
> 13:26:17.831929 IP (tos 0x3,CE, ttl 41, id 28924, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55368 > www.netperf.org.www: Flags [S], cksum 
> 0x07cc (correct), seq 1513398047, win 65535, options [mss 
> 1460,nop,wscale 3,nop,nop,TS val 288117271 ecr 0,sackOK,eol], length 0
> 13:26:17.831952 IP (tos 0x3,CE, ttl 41, id 2494, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55366 > www.netperf.org.www: Flags [S], cksum 
> 0x75f4 (correct), seq 1153058420, win 65535, options [mss 
> 1460,nop,wscale 3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
> 13:26:17.832177 IP (tos 0x3,CE, ttl 41, id 6854, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55365 > www.netperf.org.www: Flags [S], cksum 
> 0xfca0 (correct), seq 2332522875, win 65535, options [mss 
> 1460,nop,wscale 3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
> 13:26:17.832239 IP (tos 0x3,CE, ttl 41, id 64733, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55364 > www.netperf.org.www: Flags [S], cksum 
> 0x7414 (correct), seq 1544827132, win 65535, options [mss 
> 1460,nop,wscale 3,nop,nop,TS val 288117270 ecr 0,sackOK,eol], length 0
> 13:26:38.649126 IP (tos 0x3,CE, ttl 41, id 9860, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55369 > www.netperf.org.www: Flags [S], cksum 
> 0x6270 (correct), seq 683091230, win 65535, options [mss 1460,nop,wscale 
> 3,nop,nop,TS val 288137968 ecr 0,sackOK,eol], length 0
> 13:26:39.417589 IP (tos 0x3,CE, ttl 41, id 13478, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55370 > www.netperf.org.www: Flags [S], cksum 
> 0x2862 (correct), seq 3168323595, win 65535, options [mss 
> 1460,nop,wscale 3,nop,nop,TS val 288138734 ecr 0,sackOK,eol], length 0
> 
> rick

Interesting indeed ;)

Did you check if it was spoofed ?

(did the 3WHS really completed)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones May 4, 2012, 9:01 p.m. UTC | #10
>
> Interesting indeed ;)
>
> Did you check if it was spoofed ?
>
> (did the 3WHS really completed)


Well, the tcpdump command was still:


tcpdump -i eth0 -vvv '(tcp[tcpflags]&  tcp-syn != 0)&&  (ip[1] != 0x0)'

I didn't see any SYN|ACKs go out, but netperf.org would have had to set 
ECT for me to see a SYN|ACK going out.   FWIW, this is on a 2.6.31-15 
(Ubuntu) kernel with net.ipv4.tcp_ecn = 2 and I don't think the SYNs 
themselves were negotiating ECN:

13:26:16.866007 IP (tos 0x3,CE, ttl 41, id 28850, offset 0, flags [DF], 
proto TCP (6), length 64)
     somesystemin.de.55363 > www.netperf.org.www: Flags [S], cksum 
0x4cfc (correct), seq 304457158, win 65535, options [mss 1460,nop,wscale

rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 4, 2012, 9:14 p.m. UTC | #11
On Fri, 2012-05-04 at 14:01 -0700, Rick Jones wrote:
> >
> > Interesting indeed ;)
> >
> > Did you check if it was spoofed ?
> >
> > (did the 3WHS really completed)
> 
> 
> Well, the tcpdump command was still:
> 
> 
> tcpdump -i eth0 -vvv '(tcp[tcpflags]&  tcp-syn != 0)&&  (ip[1] != 0x0)'
> 
> I didn't see any SYN|ACKs go out, but netperf.org would have had to set 
> ECT for me to see a SYN|ACK going out.   FWIW, this is on a 2.6.31-15 
> (Ubuntu) kernel with net.ipv4.tcp_ecn = 2 and I don't think the SYNs 
> themselves were negotiating ECN:
> 
> 13:26:16.866007 IP (tos 0x3,CE, ttl 41, id 28850, offset 0, flags [DF], 
> proto TCP (6), length 64)
>      somesystemin.de.55363 > www.netperf.org.www: Flags [S], cksum 
> 0x4cfc (correct), seq 304457158, win 65535, options [mss 1460,nop,wscale

Probably not, or else you would see :

13:26:16.866007 IP (tos 0x3,CE, ttl 41, id 28850, offset 0, flags
[DF],proto TCP (6), length 64)
    somesystemin.de.55363 > www.netperf.org.www: Flags [SEW], cksum ...






--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c826ed7..92faa6a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -367,13 +367,6 @@  static inline void tcp_dec_quickack_mode(struct sock *sk,
 #define	TCP_ECN_DEMAND_CWR	4
 #define	TCP_ECN_SEEN		8
 
-static __inline__ void
-TCP_ECN_create_request(struct request_sock *req, struct tcphdr *th)
-{
-	if (sysctl_tcp_ecn && th->ece && th->cwr)
-		inet_rsk(req)->ecn_ok = 1;
-}
-
 enum tcp_tw_status {
 	TCP_TW_SUCCESS = 0,
 	TCP_TW_RST = 1,
@@ -671,6 +664,22 @@  struct tcp_skb_cb {
 
 #define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))
 
+/* RFC3168 : 6.1.1 SYN packets must not have ECT/ECN bits set
+ *
+ * If we receive a SYN packet with these bits set, it means a network is
+ * playing bad games with TOS bits. In order to avoid possible false congestion
+ * notifications, we disable TCP ECN negociation.
+ */
+static inline void
+TCP_ECN_create_request(struct request_sock *req, const struct sk_buff *skb)
+{
+	const struct tcphdr *th = tcp_hdr(skb);
+
+	if (sysctl_tcp_ecn && th->ece && th->cwr &&
+	    INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield))
+		inet_rsk(req)->ecn_ok = 1;
+}
+
 /* Due to TSO, an SKB can be composed of multiple actual
  * packets.  To keep these tracked properly, we use this.
  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cf97e98..4ff5e1f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1368,7 +1368,7 @@  int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		goto drop_and_free;
 
 	if (!want_cookie || tmp_opt.tstamp_ok)
-		TCP_ECN_create_request(req, tcp_hdr(skb));
+		TCP_ECN_create_request(req, skb);
 
 	if (want_cookie) {
 		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 57b2109..078d039 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1140,7 +1140,7 @@  static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	treq->rmt_addr = ipv6_hdr(skb)->saddr;
 	treq->loc_addr = ipv6_hdr(skb)->daddr;
 	if (!want_cookie || tmp_opt.tstamp_ok)
-		TCP_ECN_create_request(req, tcp_hdr(skb));
+		TCP_ECN_create_request(req, skb);
 
 	treq->iif = sk->sk_bound_dev_if;