Message ID | 1363841553.3333.47.camel@edumazet-glaptop |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
Ack. I've never understood the usefulness of the 'IP ID increments by one' check in the GRO TCP path anyway. TCP packets are DF. AFAICT, the IP identifier field does not really serve a useful purpose for non-fragment-ed/able packets. The only possible exception I can think of has to do with broken/non-spec-compliant stuff which fragments DF packets, or removes the DF flag. But that stuff shouldn't really work (and often doesn't) anyway. Maciej Żenczykowski, Kernel Networking Developer @ Google On Wed, Mar 20, 2013 at 9:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > GRE TSO support doesn't increment the ID in the inner IP header. > > Remove the ID check in inet_gro_receive() so that GRO can properly > aggregate GRE encapsulated TCP packets, instead of forcing > a flush for every packet. > > Testing the IP ID is not really needed anyway for proper GRO operation. > > We can use more readable (and faster) code to access tot_len and > frag_off fields. > > Tested on a bnx2x setup after commit a848ade408b6b > (bnx2x: add CSUM and TSO support for encapsulation protocols) > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Dmitry Kravkov <dmitry@broadcom.com> > Cc: Eilon Greenstein <eilong@broadcom.com> > Cc: Pravin B Shelar <pshelar@nicira.com> > Cc: H.K. Jerry Chu <hkchu@google.com> > Cc: Maciej Żenczykowski <maze@google.com> > --- > net/ipv4/af_inet.c | 11 ++++------- > 1 file changed, 4 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 9e5882c..302a47e 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1355,7 +1355,6 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, > const struct iphdr *iph; > unsigned int hlen; > unsigned int off; > - unsigned int id; > int flush = 1; > int proto; > > @@ -1381,9 +1380,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, > if (unlikely(ip_fast_csum((u8 *)iph, 5))) > goto out_unlock; > > - id = ntohl(*(__be32 *)&iph->id); > - flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id ^ IP_DF)); > - id >>= 16; > + flush = ntohs(iph->tot_len) ^ skb_gro_len(skb); > + > + flush |= (__force u16)iph->frag_off ^ htons(IP_DF); > > for (p = *head; p; p = p->next) { > struct iphdr *iph2; > @@ -1400,11 +1399,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, > continue; > } > > - /* All fields must match except length and checksum. */ > NAPI_GRO_CB(p)->flush |= > (iph->ttl ^ iph2->ttl) | > - (iph->tos ^ iph2->tos) | > - ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id); > + (iph->tos ^ iph2->tos); > > NAPI_GRO_CB(p)->flush |= flush; > } > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I've never understood the usefulness of the 'IP ID increments by one'
check in the GRO TCP path anyway.
I think the above logic holds for non-TCP GRO in exactly the same way.
Furthermore, there's nothing like it in IPv6 (due to the outright lack
of an IPv6 ID field).
(And I have to learn to pay attention to those stupid 3 little
gray-on-white dots that hide the entire email I'm replying to)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Maciej Żenczykowski <maze@google.com> Date: Thu, 21 Mar 2013 02:59:49 -0700 > TCP packets are DF. > AFAICT, the IP identifier field does not really serve a useful purpose > for non-fragment-ed/able packets. Not incrementing it by one each frame breaks serial line TCP header compression. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 20 Mar 2013 21:52:33 -0700 > GRE TSO support doesn't increment the ID in the inner IP header. Is this a fundamental limitation of doing TSO over GRO or were the Broadcom folks just being lazy with their firmware implementation? I really don't want to apply this patch, because ipv4 frames even with DF set should have an incrementing ID field, in order to accomodate various header compression schemes. We go out of our way to do this for normal unencapsulated TCP stream packets, rather than set the ID field to zero (which we did for some time until the compression issue was pointed out to us). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Wed, 20 Mar 2013 21:52:33 -0700 > > > GRE TSO support doesn't increment the ID in the inner IP header. > > Is this a fundamental limitation of doing TSO over GRO or > were the Broadcom folks just being lazy with their firmware > implementation? > Well, I suspect this hardware is not capable of doing the proper ID manipulation twice. (inner and outer header) Still TSO support permits a single GRE flow going from 3Gbps to 9Gbps on our hosts. So even if the inner IP id is 'broken', we are going to use TSO. Note we are limited by the receiver, as the receiver has to perform the tcp checksum in software (bnx2x doesnt support CHECKSUM_COMPLETE yet) Hopefully next firmware or NIC will do the right thing. > I really don't want to apply this patch, because ipv4 frames > even with DF set should have an incrementing ID field, in > order to accomodate various header compression schemes. > > We go out of our way to do this for normal unencapsulated TCP stream > packets, rather than set the ID field to zero (which we did for some > time until the compression issue was pointed out to us). I understand your concern, but this check in GRO brings nothing at all. Once we receive frames with 'bad IPv4 ID', should we accept them or drop them ? TCP stack doesn't care at receive (obviously as this ID is not a concern for the transport layer), so GRO should do the same, as GRO is a best effort to reduce cpu load. I fully understand the 'tos' check because of proper ECN support, but the ttl check or id check are totally useless and time consuming. GRO aggregation should roughly work the same than TCP coalescing, and we don't care of IP ID or ttl in TCP stack. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Wed, 20 Mar 2013 21:52:33 -0700 > > > GRE TSO support doesn't increment the ID in the inner IP header. > > Is this a fundamental limitation of doing TSO over GRO or > were the Broadcom folks just being lazy with their firmware > implementation? > > I really don't want to apply this patch, because ipv4 frames > even with DF set should have an incrementing ID field, in > order to accomodate various header compression schemes. > > We go out of our way to do this for normal unencapsulated TCP stream > packets, rather than set the ID field to zero (which we did for some > time until the compression issue was pointed out to us). Besides which, GRO has been reliably reversible until now. (gso_size is available through packet sockets, even if tcpdump doesn't appear to use it yet.) Ignoring IPv4 IDs will break that guarantee. Ben.
From: Ben Hutchings <bhutchings@solarflare.com> Date: Thu, 21 Mar 2013 16:20:25 +0000 > On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote: >> From: Eric Dumazet <eric.dumazet@gmail.com> >> Date: Wed, 20 Mar 2013 21:52:33 -0700 >> >> > GRE TSO support doesn't increment the ID in the inner IP header. >> >> Is this a fundamental limitation of doing TSO over GRO or >> were the Broadcom folks just being lazy with their firmware >> implementation? >> >> I really don't want to apply this patch, because ipv4 frames >> even with DF set should have an incrementing ID field, in >> order to accomodate various header compression schemes. >> >> We go out of our way to do this for normal unencapsulated TCP stream >> packets, rather than set the ID field to zero (which we did for some >> time until the compression issue was pointed out to us). > > Besides which, GRO has been reliably reversible until now. (gso_size is > available through packet sockets, even if tcpdump doesn't appear to use > it yet.) Ignoring IPv4 IDs will break that guarantee. Right, even ignoring the header compression issues, our segmentation offloads must be perfectly reversible. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 21 Mar 2013 09:08:11 -0700 > I understand your concern, but this check in GRO brings nothing at all. It brings reversibility, a fundamental rule of our segmentation offloads. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Resending for netdev in plantext. On Thu, Mar 21, 2013 at 6:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote: > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Wed, 20 Mar 2013 21:52:33 -0700 > > > > > GRE TSO support doesn't increment the ID in the inner IP header. > > > > Is this a fundamental limitation of doing TSO over GRO or > > were the Broadcom folks just being lazy with their firmware > > implementation? > > > > Well, I suspect this hardware is not capable of doing the proper ID > manipulation twice. (inner and outer header) This is correct: ID only for one of the headers can be handled with current FW/HW, for other DF is set. > Still TSO support permits a single GRE flow going from 3Gbps to 9Gbps on > our hosts. So even if the inner IP id is 'broken', we are going to use > TSO. > > Note we are limited by the receiver, as the receiver has to perform the > tcp checksum in software (bnx2x doesnt support CHECKSUM_COMPLETE yet) > > Hopefully next firmware or NIC will do the right thing. > > > I really don't want to apply this patch, because ipv4 frames > > even with DF set should have an incrementing ID field, in > > order to accomodate various header compression schemes. > > > > We go out of our way to do this for normal unencapsulated TCP stream > > packets, rather than set the ID field to zero (which we did for some > > time until the compression issue was pointed out to us). > > I understand your concern, but this check in GRO brings nothing at all. > > Once we receive frames with 'bad IPv4 ID', should we accept them or drop > them ? > > TCP stack doesn't care at receive (obviously as this ID is not a concern > for the transport layer), so GRO should do the same, as GRO is a best > effort to reduce cpu load. > > I fully understand the 'tos' check because of proper ECN support, but > the ttl check or id check are totally useless and time consuming. > > GRO aggregation should roughly work the same than TCP coalescing, and we > don't care of IP ID or ttl in TCP stack. > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Dmitry Kravkov <dkravkov@gmail.com> Date: Thu, 21 Mar 2013 20:11:37 +0200 >> Well, I suspect this hardware is not capable of doing the proper ID >> manipulation twice. (inner and outer header) > > This is correct: ID only for one of the headers can be handled with > current FW/HW, for other DF is set. DF does not matter. Regardless of DF, we must set the ID field correctly. It is abundantly clear that the current GRE tunnel segmentation is not generating packets according to our well documented rules, in that we must be able to precisely create exactly the original packet stream from the segmented frame. Someone needs to send me patches to revert the bnx2x GRE segmentation support, and any software implementation in our tree that has the same bug. If someone doesn't do it, I will revert all of this code myself. You simply will have to cope with not having this optimization until your hardware can do it properly and according to our well established rules for segmentation offloads. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> This is correct: ID only for one of the headers can be handled with current FW/HW, for other DF is set.
I would prefer for IP ID to be correctly incremented on the inner IPv4
packet rather than the outer IPv4.
Could that be changed?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Maciej Żenczykowski [mailto:maze@google.com] > Sent: Thursday, March 21, 2013 10:08 PM > To: Dmitry Kravkov > Cc: Eric Dumazet; David Miller; netdev@vger.kernel.org; Dmitry Kravkov; Eilon Greenstein; Pravin B Shelar; Hsiao-keng Jerry Chu > Subject: Re: [PATCH net-next] gro: relax ID check in inet_gro_receive() > > > This is correct: ID only for one of the headers can be handled with current FW/HW, for other DF is set. > > I would prefer for IP ID to be correctly incremented on the inner IPv4 > packet rather than the outer IPv4. > Could that be changed? Yes, I will check if it can be done in driver only ...
Reverting/deleting this code is going way too far. Mark it experimental, but don't delete it. This is (a) useful for continued development and experimentation and (b) will likely be widely deployed (and here I don't mean by my employer specifically, but mean in general by cloud/virtualization companies and anyone else that cares about tunnel performance) regardless of whether upstream linux decides to include it or not. (Furthermore if the fw/driver is changed to increment inner IP ID instead of outer IP ID, this will for the most part not be visible to the end users, since the tunneling is usually internal to the virtualization provider, and no-one besides their internal network sees the external IP header.) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Maciej Żenczykowski <maze@google.com> Date: Thu, 21 Mar 2013 13:14:44 -0700 > Reverting/deleting this code is going way too far. > Mark it experimental, but don't delete it. I disagree, people who want to play can apply the code to their tree. It's as simple as that. We have very reasonable requirements for segmentation offloads, if they are not met, the feature isn't working correctly and therefore isn't ready for upstream. Now if someone can provide a fix soon, that's fine too. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> We have very reasonable requirements for segmentation offloads
I would tend to disagree with reversibility being a reasonable requirement.
It offers extremely little benefit, and thus it's a nice to have at best.
However, I do understand why you might want it, and thus I can
understand why you might not want to accept this patch (which would
make GRO less reversible).
This logic does not apply to the GRE segmentation offload patches.
They might create non incrementing IP ID packet sequences on the wire,
and this may prevent GRO from kicking in, but it does not make TSO
non-reversible
(if anything it makes it easy to detect that it happened).
Would you perhaps be willing to take a patch which simply removes the
flag from the driver features?
But leaves the code in?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Maciej Żenczykowski <maze@google.com> Date: Thu, 21 Mar 2013 13:47:07 -0700 > They might create non incrementing IP ID packet sequences on the wire, > and this may prevent GRO from kicking in, but it does not make TSO > non-reversible > (if anything it makes it easy to detect that it happened). This is not an acceptable outcome for any form GSO, it should be completely transparent. The packets we would have output individually and those which result from the segmented frame must be indistinguishable. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 2:05 PM, David Miller <davem@davemloft.net> wrote: > From: Maciej Żenczykowski <maze@google.com> > Date: Thu, 21 Mar 2013 13:47:07 -0700 > >> They might create non incrementing IP ID packet sequences on the wire, >> and this may prevent GRO from kicking in, but it does not make TSO >> non-reversible >> (if anything it makes it easy to detect that it happened). > > This is not an acceptable outcome for any form GSO, it should be > completely transparent. The packets we would have output individually > and those which result from the segmented frame must be > indistinguishable. We actually don't set the outer IP ID for GRE packets with the DF bit set in the non-GSO/TSO case either (and this is not new behavior). Obviously TCP/IP header compression doesn't apply to the GRE packet itself. Therefore, I think if the driver is switched to increment the inner ID, which it sounds like it can be, then everything should be transparent. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jesse Gross <jesse@nicira.com> Date: Thu, 21 Mar 2013 14:37:38 -0700 > On Thu, Mar 21, 2013 at 2:05 PM, David Miller <davem@davemloft.net> wrote: >> From: Maciej Żenczykowski <maze@google.com> >> Date: Thu, 21 Mar 2013 13:47:07 -0700 >> >>> They might create non incrementing IP ID packet sequences on the wire, >>> and this may prevent GRO from kicking in, but it does not make TSO >>> non-reversible >>> (if anything it makes it easy to detect that it happened). >> >> This is not an acceptable outcome for any form GSO, it should be >> completely transparent. The packets we would have output individually >> and those which result from the segmented frame must be >> indistinguishable. > > We actually don't set the outer IP ID for GRE packets with the DF bit > set in the non-GSO/TSO case either (and this is not new behavior). > Obviously TCP/IP header compression doesn't apply to the GRE packet > itself. > > Therefore, I think if the driver is switched to increment the inner > ID, which it sounds like it can be, then everything should be > transparent. That's my impression as well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9e5882c..302a47e 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1355,7 +1355,6 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, const struct iphdr *iph; unsigned int hlen; unsigned int off; - unsigned int id; int flush = 1; int proto; @@ -1381,9 +1380,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, if (unlikely(ip_fast_csum((u8 *)iph, 5))) goto out_unlock; - id = ntohl(*(__be32 *)&iph->id); - flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id ^ IP_DF)); - id >>= 16; + flush = ntohs(iph->tot_len) ^ skb_gro_len(skb); + + flush |= (__force u16)iph->frag_off ^ htons(IP_DF); for (p = *head; p; p = p->next) { struct iphdr *iph2; @@ -1400,11 +1399,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, continue; } - /* All fields must match except length and checksum. */ NAPI_GRO_CB(p)->flush |= (iph->ttl ^ iph2->ttl) | - (iph->tos ^ iph2->tos) | - ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id); + (iph->tos ^ iph2->tos); NAPI_GRO_CB(p)->flush |= flush; }