diff mbox

[net-next-2.6] ipv4: more compliant RFC 3168 support

Message ID 1305466542.3120.129.camel@edumazet-laptop
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 15, 2011, 1:35 p.m. UTC
Commit 6623e3b24a5e (ipv4: IP defragmentation must be ECN aware) was an
attempt to not lose "Congestion Experienced" (CE) indications when
performing datagram defragmentation.

Stefanos Harhalakis raised the point that RFC 3168 requirements were not
completely met by this commit.

In particular, we MUST detect invalid combinations and eventually drop
illegal frames.

Reported-by: Stefanos Harhalakis <v13@v13.gr>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_fragment.c |   66 ++++++++++++++++++++++++++-------------
 1 file changed, 44 insertions(+), 22 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Stefanos Harhalakis May 15, 2011, 3:08 p.m. UTC | #1
Hello,

On Sunday 15 of May 2011, Eric Dumazet wrote:
> +static inline int ip4_frag_ecn_fold(u8 ecn)
> +{
> +	switch (ecn) {
> +	/* If same ECN combination was observed on all fragments, do nothing */
> +	case IPFRAG_ECN_NOT_ECT:
> +	case IPFRAG_ECN_ECT_1:
> +	case IPFRAG_ECN_ECT_0:
> +	case IPFRAG_ECN_CE:
> +	/* if a ECT_1 ECT_0 combination was observed, do nothing as well */
> +	case IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1:
> +		return 0;
> +	/* at least one fragment had CE, and others ECT_0 or ECT_1 */
> +	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0:
> +	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_1:
> +	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1:
> +		return INET_ECN_CE;
> +	/* other combinations are invalid : drop frame */
> +	default:
> +		return -1;
> +	}
>  }

You may wish to simplify this exhaustive check to:

if (unlikely((ecn & IPFRAG_ECN_NOT_ECT) && ecn!=IPFRAG_ECN_NOT_ECT))
  return -1;
else if (ecn & IPFRAG_ECN_CE)
  return INET_ECN_CE;
else
  return 0;

although I'm not sure which method will be faster.

Also, returning the exact same value for NOT_ECT and ECT_X and then ORing
this with the TOS seems like it will make it loose the ECT_X info. No? (but
also, I'm not sure if this is needed anyway from that point on).

p.s. I'm not sure whether this message will make it to the netdev list.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 15, 2011, 4:01 p.m. UTC | #2
Le dimanche 15 mai 2011 à 18:08 +0300, Stefanos Harhalakis a écrit :
> Hello,
> 
> On Sunday 15 of May 2011, Eric Dumazet wrote:
> > +static inline int ip4_frag_ecn_fold(u8 ecn)
> > +{
> > +	switch (ecn) {
> > +	/* If same ECN combination was observed on all fragments, do nothing */
> > +	case IPFRAG_ECN_NOT_ECT:
> > +	case IPFRAG_ECN_ECT_1:
> > +	case IPFRAG_ECN_ECT_0:
> > +	case IPFRAG_ECN_CE:
> > +	/* if a ECT_1 ECT_0 combination was observed, do nothing as well */
> > +	case IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1:
> > +		return 0;
> > +	/* at least one fragment had CE, and others ECT_0 or ECT_1 */
> > +	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0:
> > +	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_1:
> > +	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1:
> > +		return INET_ECN_CE;
> > +	/* other combinations are invalid : drop frame */
> > +	default:
> > +		return -1;
> > +	}
> >  }
> 
> You may wish to simplify this exhaustive check to:
> 
> if (unlikely((ecn & IPFRAG_ECN_NOT_ECT) && ecn!=IPFRAG_ECN_NOT_ECT))
>   return -1;
> else if (ecn & IPFRAG_ECN_CE)
>   return INET_ECN_CE;
> else
>   return 0;
> 
> although I'm not sure which method will be faster.
> 

Problem of this version is that common frames in the Internet (NOT_ECT
or ECT_X or ECT_X) will take the longest path to come to "return 0;"

a switch() version is fast because gcc emits a table based jump

> Also, returning the exact same value for NOT_ECT and ECT_X and then ORing
> this with the TOS seems like it will make it loose the ECT_X info. No? (but
> also, I'm not sure if this is needed anyway from that point on).
> 

I dont understand what you mean here. We really need to not loose ECT_X,
and I believe we dont.

-1 : Drop the frame anyway
0 : No change on iph->tos field (we keep its value. it can have ECT_X.
Remember all fragments share same (iph->tos & 3) value)
3 : We make sure iph->tos is ORed with 3 to assert CE on result frame.

> p.s. I'm not sure whether this message will make it to the netdev list.

It should, no worry.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 16, 2011, 5:16 p.m. UTC | #3
Le dimanche 15 mai 2011 à 18:01 +0200, Eric Dumazet a écrit :

> Problem of this version is that common frames in the Internet (NOT_ECT
> or ECT_X or ECT_X) will take the longest path to come to "return 0;"
> 
> a switch() version is fast because gcc emits a table based jump

Oh well, a table lookup is even faster...

I'll submit another version


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hagen Paul Pfeifer May 16, 2011, 9:33 p.m. UTC | #4
* Eric Dumazet | 2011-05-15 18:01:50 [+0200]:

>Problem of this version is that common frames in the Internet (NOT_ECT
>or ECT_X or ECT_X) will take the longest path to come to "return 0;"
>
>a switch() version is fast because gcc emits a table based jump

Sure? Is the table access not an dCache miss? E.g.

4003dc: ff 24 fd 58 06 40 00  jmpq   *0x400658(,%rdi,8)

Not sure if jump table access is superior these days ...


Hagen
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 16, 2011, 9:38 p.m. UTC | #5
Le lundi 16 mai 2011 à 23:33 +0200, Hagen Paul Pfeifer a écrit :
> * Eric Dumazet | 2011-05-15 18:01:50 [+0200]:
> 
> >Problem of this version is that common frames in the Internet (NOT_ECT
> >or ECT_X or ECT_X) will take the longest path to come to "return 0;"
> >
> >a switch() version is fast because gcc emits a table based jump
> 
> Sure? Is the table access not an dCache miss? E.g.
> 
> 4003dc: ff 24 fd 58 06 40 00  jmpq   *0x400658(,%rdi,8)
> 
> Not sure if jump table access is superior these days ...
> 

Check v2 of patch, it is fine, small, fast ;)

By the way, you dont want to know how many cpu cycles we spend in IP
defrag functions... Really its insane.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hagen Paul Pfeifer May 16, 2011, 9:49 p.m. UTC | #6
* Eric Dumazet | 2011-05-16 23:38:38 [+0200]:

>Check v2 of patch, it is fine, small, fast ;)

Eric, I trust you! ;-)

>By the way, you dont want to know how many cpu cycles we spend in IP
>defrag functions... Really its insane.

I can image it! I wanted to point out to the fact that gcc's (and other compiler
too) jump table optimizations are not advantageous than a few years ago.
Maybe jump table optimizations are a relict some decades ago ...

Hagen
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index b1d282f..1f74089 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -77,20 +77,46 @@  struct ipq {
 	struct inet_peer *peer;
 };
 
-#define IPFRAG_ECN_CLEAR  0x01 /* one frag had INET_ECN_NOT_ECT */
-#define IPFRAG_ECN_SET_CE 0x04 /* one frag had INET_ECN_CE */
+/* RFC 3168 support :
+ * We want to check ECN values of all fragments, do detect invalid combinations.
+ * In ipq->ecn, we store the OR value of each ip4_frag_ecn() fragment value.
+ */
+enum {
+	IPFRAG_ECN_NOT_ECT	= 0x01, /* one frag had ECN_NOT_ECT */
+	IPFRAG_ECN_ECT_1	= 0x02, /* one frag had ECN_ECT_1 */
+	IPFRAG_ECN_ECT_0	= 0x04, /* one frag had ECN_ECT_0 */
+	IPFRAG_ECN_CE		= 0x08, /* one frag had ECN_CE */
+};
 
 static inline u8 ip4_frag_ecn(u8 tos)
 {
-	tos = (tos & INET_ECN_MASK) + 1;
-	/*
-	 * After the last operation we have (in binary):
-	 * INET_ECN_NOT_ECT => 001
-	 * INET_ECN_ECT_1   => 010
-	 * INET_ECN_ECT_0   => 011
-	 * INET_ECN_CE      => 100
-	 */
-	return (tos & 2) ? 0 : tos;
+	return 1 << (tos & INET_ECN_MASK);
+}
+
+/* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
+ * Return : -1 if frame should be dropped.
+ *           >= 0 value, to be ORed in to final iph->tos field
+ */
+static inline int ip4_frag_ecn_fold(u8 ecn)
+{
+	switch (ecn) {
+	/* If same ECN combination was observed on all fragments, do nothing */
+	case IPFRAG_ECN_NOT_ECT:
+	case IPFRAG_ECN_ECT_1:
+	case IPFRAG_ECN_ECT_0:
+	case IPFRAG_ECN_CE:
+	/* if a ECT_1 ECT_0 combination was observed, do nothing as well */
+	case IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1:
+		return 0;
+	/* at least one fragment had CE, and others ECT_0 or ECT_1 */
+	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0:
+	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_1:
+	case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1:
+		return INET_ECN_CE;
+	/* other combinations are invalid : drop frame */
+	default:
+		return -1;
+	}
 }
 
 static struct inet_frags ip4_frags;
@@ -524,9 +550,15 @@  static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	int len;
 	int ihlen;
 	int err;
+	int ecn;
 
 	ipq_kill(qp);
 
+	ecn = ip4_frag_ecn_fold(qp->ecn);
+	if (unlikely(ecn == -1)) {
+		err = -EINVAL;
+		goto out_fail;
+	}
 	/* Make the one we just received the head. */
 	if (prev) {
 		head = prev->next;
@@ -605,17 +637,7 @@  static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	iph = ip_hdr(head);
 	iph->frag_off = 0;
 	iph->tot_len = htons(len);
-	/* RFC3168 5.3 Fragmentation support
-	 * If one fragment had INET_ECN_NOT_ECT,
-	 *	reassembled frame also has INET_ECN_NOT_ECT
-	 * Elif one fragment had INET_ECN_CE
-	 *	reassembled frame also has INET_ECN_CE
-	 */
-	if (qp->ecn & IPFRAG_ECN_CLEAR)
-		iph->tos &= ~INET_ECN_MASK;
-	else if (qp->ecn & IPFRAG_ECN_SET_CE)
-		iph->tos |= INET_ECN_CE;
-
+	iph->tos |= ecn;
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;