Message ID | E1LB0d1-0000rd-B2@gondolin.me.apana.org.au |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Dec 12, 2008 at 04:31:55PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > +struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb) > +{ > + struct sk_buff **pp = NULL; > + struct sk_buff *p; > + struct tcphdr *th; > + struct tcphdr *th2; > + unsigned int thlen; > + unsigned int flags; > + unsigned int total; > + unsigned int mss = 1; > + int flush = 1; > + > + if (!pskb_may_pull(skb, sizeof(*th))) > + goto out; > + > + th = tcp_hdr(skb); > + thlen = th->doff * 4; > + if (thlen < sizeof(*th)) > + goto out; > + > + if (!pskb_may_pull(skb, thlen)) > + goto out; > + > + th = tcp_hdr(skb); > + __skb_pull(skb, thlen); > + > + flags = tcp_flag_word(th); > + > + for (; (p = *head); head = &p->next) { > + if (!NAPI_GRO_CB(p)->same_flow) > + continue; > + > + th2 = tcp_hdr(p); > + > + if (th->source != th2->source || th->dest != th2->dest) { > + NAPI_GRO_CB(p)->same_flow = 0; > + continue; > + } > + > + goto found; > + } > + > + goto out_check_final; > + > +found: > + flush = NAPI_GRO_CB(p)->flush; > + flush |= flags & TCP_FLAG_CWR; > + flush |= (flags ^ tcp_flag_word(th2)) & > + ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH); > + flush |= th->ack_seq != th2->ack_seq || th->window != th2->window; > + flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th)); > + > + total = p->len; > + mss = total; > + if (skb_shinfo(p)->frag_list) > + mss = skb_shinfo(p)->frag_list->len; > + > + flush |= skb->len > mss || skb->len <= 0; > + flush |= ntohl(th2->seq) + total != ntohl(th->seq); > + No timestamp check? > + if (flush || skb_gro_receive(head, skb)) { > + mss = 1; > + goto out_check_final; > + } > + > + p = *head; > + th2 = tcp_hdr(p); > + tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH); > + > +out_check_final: > + flush = skb->len < mss; > + flush |= flags & (TCP_FLAG_URG | TCP_FLAG_PSH | TCP_FLAG_RST | > + TCP_FLAG_SYN | TCP_FLAG_FIN); > + > + if (p && (!NAPI_GRO_CB(skb)->same_flow || flush)) > + pp = head; > + > +out: > + NAPI_GRO_CB(skb)->flush |= flush; > + > + return pp; > +}
On Fri, Dec 12, 2008 at 10:56:15PM +0300, Evgeniy Polyakov wrote: > > > +found: > > + flush = NAPI_GRO_CB(p)->flush; > > + flush |= flags & TCP_FLAG_CWR; > > + flush |= (flags ^ tcp_flag_word(th2)) & > > + ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH); > > + flush |= th->ack_seq != th2->ack_seq || th->window != th2->window; > > + flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th)); > > + > > + total = p->len; > > + mss = total; > > + if (skb_shinfo(p)->frag_list) > > + mss = skb_shinfo(p)->frag_list->len; > > + > > + flush |= skb->len > mss || skb->len <= 0; > > + flush |= ntohl(th2->seq) + total != ntohl(th->seq); > > + > > No timestamp check? The memcmp does that for us. Cheers,
On Sat, Dec 13, 2008 at 08:46:27AM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > > + flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th)); > > > + > > > + total = p->len; > > > + mss = total; > > > + if (skb_shinfo(p)->frag_list) > > > + mss = skb_shinfo(p)->frag_list->len; > > > + > > > + flush |= skb->len > mss || skb->len <= 0; > > > + flush |= ntohl(th2->seq) + total != ntohl(th->seq); > > > + > > > > No timestamp check? > > The memcmp does that for us. So it will fail if timestamp changed? Or if some new option added?
On Sat, Dec 13, 2008 at 05:40:46AM +0300, Evgeniy Polyakov wrote: > > So it will fail if timestamp changed? Or if some new option added? Correct, it must remain exactly the same. Even at 100Mbps any sane clock frequency will result in an average of 8 packets per clock update. At the sort speeds (>1Gbps) at which we're targetting this'll easily get us to 64K which is our maximum (actually it looks like I forgot to add a check to stop this from growing beyond 64K :) As I alluded to in the opening message, in future we can consider a more generalised form of GRO where we end up with a super-packet that retains individual headers which can then be refragmented as necessary. This would allow us to merge in the presence of timestamp changes. However, IMHO this isn't a killer feature for merging TCP packets. Cheers,
On Sat, Dec 13, 2008 at 01:46:45PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > On Sat, Dec 13, 2008 at 05:40:46AM +0300, Evgeniy Polyakov wrote: > > > > So it will fail if timestamp changed? Or if some new option added? > > Correct, it must remain exactly the same. Even at 100Mbps any > sane clock frequency will result in an average of 8 packets per > clock update. At the sort speeds (>1Gbps) at which we're targetting > this'll easily get us to 64K which is our maximum (actually it > looks like I forgot to add a check to stop this from growing beyond > 64K :) Some stacks use just increased counter since it is allowed by the RFC :) Probably 'not sane' is appropriate name, but still... Probably not a serious problem, but what if just check the timestamp option before/after like it was in LRO?
On Sat, Dec 13, 2008 at 06:10:19AM +0300, Evgeniy Polyakov wrote: > > Some stacks use just increased counter since it is allowed by the RFC :) > Probably 'not sane' is appropriate name, but still... Probably not a Well one big reason why TSO worked so well is because the rest of stack (well most of it) simply treated it as a large packet. As it stands this means keeping below the 64K threshold (e.g., the IP header length field is 16 bits long). In future of course we'd want to increase this, be it through using IPv6 or some other means. > serious problem, but what if just check the timestamp option before/after > like it was in LRO? As I said I don't think the restriction on the timestamp is such a big deal. At the sort of speeds where merging is actually useful only an insane clock frequency would require us to merge packets with different time stamps. Note also that the limited length of the TCP timestamp option means that insane clock frequencies aren't practical anyway as it'll wrap too quickly, thus defeating its purpose. Cheers,
diff --git a/include/net/tcp.h b/include/net/tcp.h index 438014d..cd571a9 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1358,6 +1358,12 @@ extern void tcp_v4_destroy_sock(struct sock *sk); extern int tcp_v4_gso_send_check(struct sk_buff *skb); extern struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features); +extern struct sk_buff **tcp_gro_receive(struct sk_buff **head, + struct sk_buff *skb); +extern struct sk_buff **tcp4_gro_receive(struct sk_buff **head, + struct sk_buff *skb); +extern int tcp_gro_complete(struct sk_buff *skb); +extern int tcp4_gro_complete(struct sk_buff *skb); #ifdef CONFIG_PROC_FS extern int tcp4_proc_init(void); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 260f081..dafbfbd 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1413,6 +1413,8 @@ static struct net_protocol tcp_protocol = { .err_handler = tcp_v4_err, .gso_send_check = tcp_v4_gso_send_check, .gso_segment = tcp_tso_segment, + .gro_receive = tcp4_gro_receive, + .gro_complete = tcp4_gro_complete, .no_policy = 1, .netns_ok = 1, }; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index c5aca0b..294d838 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2461,6 +2461,106 @@ out: } EXPORT_SYMBOL(tcp_tso_segment); +struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb) +{ + struct sk_buff **pp = NULL; + struct sk_buff *p; + struct tcphdr *th; + struct tcphdr *th2; + unsigned int thlen; + unsigned int flags; + unsigned int total; + unsigned int mss = 1; + int flush = 1; + + if (!pskb_may_pull(skb, sizeof(*th))) + goto out; + + th = tcp_hdr(skb); + thlen = th->doff * 4; + if (thlen < sizeof(*th)) + goto out; + + if (!pskb_may_pull(skb, thlen)) + goto out; + + th = tcp_hdr(skb); + __skb_pull(skb, thlen); + + flags = tcp_flag_word(th); + + for (; (p = *head); head = &p->next) { + if (!NAPI_GRO_CB(p)->same_flow) + continue; + + th2 = tcp_hdr(p); + + if (th->source != th2->source || th->dest != th2->dest) { + NAPI_GRO_CB(p)->same_flow = 0; + continue; + } + + goto found; + } + + goto out_check_final; + +found: + flush = NAPI_GRO_CB(p)->flush; + flush |= flags & TCP_FLAG_CWR; + flush |= (flags ^ tcp_flag_word(th2)) & + ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH); + flush |= th->ack_seq != th2->ack_seq || th->window != th2->window; + flush |= memcmp(th + 1, th2 + 1, thlen - sizeof(*th)); + + total = p->len; + mss = total; + if (skb_shinfo(p)->frag_list) + mss = skb_shinfo(p)->frag_list->len; + + flush |= skb->len > mss || skb->len <= 0; + flush |= ntohl(th2->seq) + total != ntohl(th->seq); + + if (flush || skb_gro_receive(head, skb)) { + mss = 1; + goto out_check_final; + } + + p = *head; + th2 = tcp_hdr(p); + tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH); + +out_check_final: + flush = skb->len < mss; + flush |= flags & (TCP_FLAG_URG | TCP_FLAG_PSH | TCP_FLAG_RST | + TCP_FLAG_SYN | TCP_FLAG_FIN); + + if (p && (!NAPI_GRO_CB(skb)->same_flow || flush)) + pp = head; + +out: + NAPI_GRO_CB(skb)->flush |= flush; + + return pp; +} + +int tcp_gro_complete(struct sk_buff *skb) +{ + struct tcphdr *th = tcp_hdr(skb); + + skb->csum_start = skb_transport_header(skb) - skb->head; + skb->csum_offset = offsetof(struct tcphdr, check); + skb->ip_summed = CHECKSUM_PARTIAL; + + skb_shinfo(skb)->gso_size = skb_shinfo(skb)->frag_list->len; + skb_shinfo(skb)->gso_segs = NAPI_GRO_CB(skb)->count; + + if (th->cwr) + skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN; + + return 0; +} + #ifdef CONFIG_TCP_MD5SIG static unsigned long tcp_md5sig_users; static struct tcp_md5sig_pool **tcp_md5sig_pool; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 5c8fa7f..5b7ce84 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2350,6 +2350,41 @@ void tcp4_proc_exit(void) } #endif /* CONFIG_PROC_FS */ +struct sk_buff **tcp4_gro_receive(struct sk_buff **head, struct sk_buff *skb) +{ + struct iphdr *iph = ip_hdr(skb); + + switch (skb->ip_summed) { + case CHECKSUM_COMPLETE: + if (!tcp_v4_check(skb->len, iph->saddr, iph->daddr, + skb->csum)) { + skb->ip_summed = CHECKSUM_UNNECESSARY; + break; + } + + /* fall through */ + case CHECKSUM_NONE: + NAPI_GRO_CB(skb)->flush = 1; + return NULL; + } + + return tcp_gro_receive(head, skb); +} +EXPORT_SYMBOL(tcp4_gro_receive); + +int tcp4_gro_complete(struct sk_buff *skb) +{ + struct iphdr *iph = ip_hdr(skb); + struct tcphdr *th = tcp_hdr(skb); + + th->check = ~tcp_v4_check(skb->len - skb_transport_offset(skb), + iph->saddr, iph->daddr, 0); + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4; + + return tcp_gro_complete(skb); +} +EXPORT_SYMBOL(tcp4_gro_complete); + struct proto tcp_prot = { .name = "TCP", .owner = THIS_MODULE,
tcp: Add GRO support This patch adds the TCP-specific portion of GRO. The criterion for merging is extremely strict (the TCP header must match exactly apart from the checksum) so as to allow refragmentation. Otherwise this is pretty much identical to LRO, except that we support the merging of ECN packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/net/tcp.h | 6 +++ net/ipv4/af_inet.c | 2 + net/ipv4/tcp.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++ net/ipv4/tcp_ipv4.c | 35 ++++++++++++++++++ 4 files changed, 143 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html