Message ID | 1305466542.3120.129.camel@edumazet-laptop |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Hello, On Sunday 15 of May 2011, Eric Dumazet wrote: > +static inline int ip4_frag_ecn_fold(u8 ecn) > +{ > + switch (ecn) { > + /* If same ECN combination was observed on all fragments, do nothing */ > + case IPFRAG_ECN_NOT_ECT: > + case IPFRAG_ECN_ECT_1: > + case IPFRAG_ECN_ECT_0: > + case IPFRAG_ECN_CE: > + /* if a ECT_1 ECT_0 combination was observed, do nothing as well */ > + case IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1: > + return 0; > + /* at least one fragment had CE, and others ECT_0 or ECT_1 */ > + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0: > + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_1: > + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1: > + return INET_ECN_CE; > + /* other combinations are invalid : drop frame */ > + default: > + return -1; > + } > } You may wish to simplify this exhaustive check to: if (unlikely((ecn & IPFRAG_ECN_NOT_ECT) && ecn!=IPFRAG_ECN_NOT_ECT)) return -1; else if (ecn & IPFRAG_ECN_CE) return INET_ECN_CE; else return 0; although I'm not sure which method will be faster. Also, returning the exact same value for NOT_ECT and ECT_X and then ORing this with the TOS seems like it will make it loose the ECT_X info. No? (but also, I'm not sure if this is needed anyway from that point on). p.s. I'm not sure whether this message will make it to the netdev list. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 15 mai 2011 à 18:08 +0300, Stefanos Harhalakis a écrit : > Hello, > > On Sunday 15 of May 2011, Eric Dumazet wrote: > > +static inline int ip4_frag_ecn_fold(u8 ecn) > > +{ > > + switch (ecn) { > > + /* If same ECN combination was observed on all fragments, do nothing */ > > + case IPFRAG_ECN_NOT_ECT: > > + case IPFRAG_ECN_ECT_1: > > + case IPFRAG_ECN_ECT_0: > > + case IPFRAG_ECN_CE: > > + /* if a ECT_1 ECT_0 combination was observed, do nothing as well */ > > + case IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1: > > + return 0; > > + /* at least one fragment had CE, and others ECT_0 or ECT_1 */ > > + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0: > > + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_1: > > + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1: > > + return INET_ECN_CE; > > + /* other combinations are invalid : drop frame */ > > + default: > > + return -1; > > + } > > } > > You may wish to simplify this exhaustive check to: > > if (unlikely((ecn & IPFRAG_ECN_NOT_ECT) && ecn!=IPFRAG_ECN_NOT_ECT)) > return -1; > else if (ecn & IPFRAG_ECN_CE) > return INET_ECN_CE; > else > return 0; > > although I'm not sure which method will be faster. > Problem of this version is that common frames in the Internet (NOT_ECT or ECT_X or ECT_X) will take the longest path to come to "return 0;" a switch() version is fast because gcc emits a table based jump > Also, returning the exact same value for NOT_ECT and ECT_X and then ORing > this with the TOS seems like it will make it loose the ECT_X info. No? (but > also, I'm not sure if this is needed anyway from that point on). > I dont understand what you mean here. We really need to not loose ECT_X, and I believe we dont. -1 : Drop the frame anyway 0 : No change on iph->tos field (we keep its value. it can have ECT_X. Remember all fragments share same (iph->tos & 3) value) 3 : We make sure iph->tos is ORed with 3 to assert CE on result frame. > p.s. I'm not sure whether this message will make it to the netdev list. It should, no worry. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 15 mai 2011 à 18:01 +0200, Eric Dumazet a écrit : > Problem of this version is that common frames in the Internet (NOT_ECT > or ECT_X or ECT_X) will take the longest path to come to "return 0;" > > a switch() version is fast because gcc emits a table based jump Oh well, a table lookup is even faster... I'll submit another version -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Eric Dumazet | 2011-05-15 18:01:50 [+0200]: >Problem of this version is that common frames in the Internet (NOT_ECT >or ECT_X or ECT_X) will take the longest path to come to "return 0;" > >a switch() version is fast because gcc emits a table based jump Sure? Is the table access not an dCache miss? E.g. 4003dc: ff 24 fd 58 06 40 00 jmpq *0x400658(,%rdi,8) Not sure if jump table access is superior these days ... Hagen -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 16 mai 2011 à 23:33 +0200, Hagen Paul Pfeifer a écrit : > * Eric Dumazet | 2011-05-15 18:01:50 [+0200]: > > >Problem of this version is that common frames in the Internet (NOT_ECT > >or ECT_X or ECT_X) will take the longest path to come to "return 0;" > > > >a switch() version is fast because gcc emits a table based jump > > Sure? Is the table access not an dCache miss? E.g. > > 4003dc: ff 24 fd 58 06 40 00 jmpq *0x400658(,%rdi,8) > > Not sure if jump table access is superior these days ... > Check v2 of patch, it is fine, small, fast ;) By the way, you dont want to know how many cpu cycles we spend in IP defrag functions... Really its insane. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Eric Dumazet | 2011-05-16 23:38:38 [+0200]: >Check v2 of patch, it is fine, small, fast ;) Eric, I trust you! ;-) >By the way, you dont want to know how many cpu cycles we spend in IP >defrag functions... Really its insane. I can image it! I wanted to point out to the fact that gcc's (and other compiler too) jump table optimizations are not advantageous than a few years ago. Maybe jump table optimizations are a relict some decades ago ... Hagen -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index b1d282f..1f74089 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -77,20 +77,46 @@ struct ipq { struct inet_peer *peer; }; -#define IPFRAG_ECN_CLEAR 0x01 /* one frag had INET_ECN_NOT_ECT */ -#define IPFRAG_ECN_SET_CE 0x04 /* one frag had INET_ECN_CE */ +/* RFC 3168 support : + * We want to check ECN values of all fragments, do detect invalid combinations. + * In ipq->ecn, we store the OR value of each ip4_frag_ecn() fragment value. + */ +enum { + IPFRAG_ECN_NOT_ECT = 0x01, /* one frag had ECN_NOT_ECT */ + IPFRAG_ECN_ECT_1 = 0x02, /* one frag had ECN_ECT_1 */ + IPFRAG_ECN_ECT_0 = 0x04, /* one frag had ECN_ECT_0 */ + IPFRAG_ECN_CE = 0x08, /* one frag had ECN_CE */ +}; static inline u8 ip4_frag_ecn(u8 tos) { - tos = (tos & INET_ECN_MASK) + 1; - /* - * After the last operation we have (in binary): - * INET_ECN_NOT_ECT => 001 - * INET_ECN_ECT_1 => 010 - * INET_ECN_ECT_0 => 011 - * INET_ECN_CE => 100 - */ - return (tos & 2) ? 0 : tos; + return 1 << (tos & INET_ECN_MASK); +} + +/* Given the OR values of all fragments, apply RFC 3168 5.3 requirements + * Return : -1 if frame should be dropped. + * >= 0 value, to be ORed in to final iph->tos field + */ +static inline int ip4_frag_ecn_fold(u8 ecn) +{ + switch (ecn) { + /* If same ECN combination was observed on all fragments, do nothing */ + case IPFRAG_ECN_NOT_ECT: + case IPFRAG_ECN_ECT_1: + case IPFRAG_ECN_ECT_0: + case IPFRAG_ECN_CE: + /* if a ECT_1 ECT_0 combination was observed, do nothing as well */ + case IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1: + return 0; + /* at least one fragment had CE, and others ECT_0 or ECT_1 */ + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0: + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_1: + case IPFRAG_ECN_CE | IPFRAG_ECN_ECT_0 | IPFRAG_ECN_ECT_1: + return INET_ECN_CE; + /* other combinations are invalid : drop frame */ + default: + return -1; + } } static struct inet_frags ip4_frags; @@ -524,9 +550,15 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev, int len; int ihlen; int err; + int ecn; ipq_kill(qp); + ecn = ip4_frag_ecn_fold(qp->ecn); + if (unlikely(ecn == -1)) { + err = -EINVAL; + goto out_fail; + } /* Make the one we just received the head. */ if (prev) { head = prev->next; @@ -605,17 +637,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev, iph = ip_hdr(head); iph->frag_off = 0; iph->tot_len = htons(len); - /* RFC3168 5.3 Fragmentation support - * If one fragment had INET_ECN_NOT_ECT, - * reassembled frame also has INET_ECN_NOT_ECT - * Elif one fragment had INET_ECN_CE - * reassembled frame also has INET_ECN_CE - */ - if (qp->ecn & IPFRAG_ECN_CLEAR) - iph->tos &= ~INET_ECN_MASK; - else if (qp->ecn & IPFRAG_ECN_SET_CE) - iph->tos |= INET_ECN_CE; - + iph->tos |= ecn; IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS); qp->q.fragments = NULL; qp->q.fragments_tail = NULL;
Commit 6623e3b24a5e (ipv4: IP defragmentation must be ECN aware) was an attempt to not lose "Congestion Experienced" (CE) indications when performing datagram defragmentation. Stefanos Harhalakis raised the point that RFC 3168 requirements were not completely met by this commit. In particular, we MUST detect invalid combinations and eventually drop illegal frames. Reported-by: Stefanos Harhalakis <v13@v13.gr> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/ipv4/ip_fragment.c | 66 ++++++++++++++++++++++++++------------- 1 file changed, 44 insertions(+), 22 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html