diff mbox

[net-next] gro: relax ID check in inet_gro_receive()

Message ID 1363841553.3333.47.camel@edumazet-glaptop
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet March 21, 2013, 4:52 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

GRE TSO support doesn't increment the ID in the inner IP header.

Remove the ID check in inet_gro_receive() so that GRO can properly
aggregate GRE encapsulated TCP packets, instead of forcing
a flush for every packet.

Testing the IP ID is not really needed anyway for proper GRO operation.

We can use more readable (and faster) code to access tot_len and
frag_off fields.

Tested on a bnx2x setup after commit a848ade408b6b
(bnx2x: add CSUM and TSO support for encapsulation protocols)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Kravkov <dmitry@broadcom.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
---
 net/ipv4/af_inet.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Maciej Żenczykowski March 21, 2013, 9:59 a.m. UTC | #1
Ack.

I've never understood the usefulness of the 'IP ID increments by one'
check in the GRO TCP path anyway.

TCP packets are DF.
AFAICT, the IP identifier field does not really serve a useful purpose
for non-fragment-ed/able packets.

The only possible exception I can think of has to do with
broken/non-spec-compliant stuff which fragments DF packets, or removes
the DF flag.
But that stuff shouldn't really work (and often doesn't) anyway.

Maciej Żenczykowski, Kernel Networking Developer @ Google

On Wed, Mar 20, 2013 at 9:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> GRE TSO support doesn't increment the ID in the inner IP header.
>
> Remove the ID check in inet_gro_receive() so that GRO can properly
> aggregate GRE encapsulated TCP packets, instead of forcing
> a flush for every packet.
>
> Testing the IP ID is not really needed anyway for proper GRO operation.
>
> We can use more readable (and faster) code to access tot_len and
> frag_off fields.
>
> Tested on a bnx2x setup after commit a848ade408b6b
> (bnx2x: add CSUM and TSO support for encapsulation protocols)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Dmitry Kravkov <dmitry@broadcom.com>
> Cc: Eilon Greenstein <eilong@broadcom.com>
> Cc: Pravin B Shelar <pshelar@nicira.com>
> Cc: H.K. Jerry Chu <hkchu@google.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> ---
>  net/ipv4/af_inet.c |   11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 9e5882c..302a47e 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1355,7 +1355,6 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
>         const struct iphdr *iph;
>         unsigned int hlen;
>         unsigned int off;
> -       unsigned int id;
>         int flush = 1;
>         int proto;
>
> @@ -1381,9 +1380,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
>         if (unlikely(ip_fast_csum((u8 *)iph, 5)))
>                 goto out_unlock;
>
> -       id = ntohl(*(__be32 *)&iph->id);
> -       flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id ^ IP_DF));
> -       id >>= 16;
> +       flush = ntohs(iph->tot_len) ^ skb_gro_len(skb);
> +
> +       flush |= (__force u16)iph->frag_off ^ htons(IP_DF);
>
>         for (p = *head; p; p = p->next) {
>                 struct iphdr *iph2;
> @@ -1400,11 +1399,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
>                         continue;
>                 }
>
> -               /* All fields must match except length and checksum. */
>                 NAPI_GRO_CB(p)->flush |=
>                         (iph->ttl ^ iph2->ttl) |
> -                       (iph->tos ^ iph2->tos) |
> -                       ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
> +                       (iph->tos ^ iph2->tos);
>
>                 NAPI_GRO_CB(p)->flush |= flush;
>         }
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski March 21, 2013, 10:05 a.m. UTC | #2
> I've never understood the usefulness of the 'IP ID increments by one'
check in the GRO TCP path anyway.

I think the above logic holds for non-TCP GRO in exactly the same way.
Furthermore, there's nothing like it in IPv6 (due to the outright lack
of an IPv6 ID field).

(And I have to learn to pay attention to those stupid 3 little
gray-on-white dots that hide the entire email I'm replying to)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 21, 2013, 2:56 p.m. UTC | #3
From: Maciej Żenczykowski <maze@google.com>
Date: Thu, 21 Mar 2013 02:59:49 -0700

> TCP packets are DF.
> AFAICT, the IP identifier field does not really serve a useful purpose
> for non-fragment-ed/able packets.

Not incrementing it by one each frame breaks serial line TCP header
compression.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 21, 2013, 3:46 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Mar 2013 21:52:33 -0700

> GRE TSO support doesn't increment the ID in the inner IP header.

Is this a fundamental limitation of doing TSO over GRO or
were the Broadcom folks just being lazy with their firmware
implementation?

I really don't want to apply this patch, because ipv4 frames
even with DF set should have an incrementing ID field, in
order to accomodate various header compression schemes.

We go out of our way to do this for normal unencapsulated TCP stream
packets, rather than set the ID field to zero (which we did for some
time until the compression issue was pointed out to us).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 21, 2013, 4:08 p.m. UTC | #5
On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Mar 2013 21:52:33 -0700
> 
> > GRE TSO support doesn't increment the ID in the inner IP header.
> 
> Is this a fundamental limitation of doing TSO over GRO or
> were the Broadcom folks just being lazy with their firmware
> implementation?
> 

Well, I suspect this hardware is not capable of doing the proper ID
manipulation twice. (inner and outer header)

Still TSO support permits a single GRE flow going from 3Gbps to 9Gbps on
our hosts. So even if the inner IP id is 'broken', we are going to use
TSO.

Note we are limited by the receiver, as the receiver has to perform the
tcp checksum in software (bnx2x doesnt support CHECKSUM_COMPLETE yet)

Hopefully next firmware or NIC will do the right thing.

> I really don't want to apply this patch, because ipv4 frames
> even with DF set should have an incrementing ID field, in
> order to accomodate various header compression schemes.
> 
> We go out of our way to do this for normal unencapsulated TCP stream
> packets, rather than set the ID field to zero (which we did for some
> time until the compression issue was pointed out to us).

I understand your concern, but this check in GRO brings nothing at all.

Once we receive frames with 'bad IPv4 ID', should we accept them or drop
them ?

TCP stack doesn't care at receive (obviously as this ID is not a concern
for the transport layer), so GRO should do the same, as GRO is a best
effort to reduce cpu load.

I fully understand the 'tos' check because of proper ECN support, but
the ttl check or id check are totally useless and time consuming.

GRO aggregation should roughly work the same than TCP coalescing, and we
don't care of IP ID or ttl in TCP stack.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings March 21, 2013, 4:20 p.m. UTC | #6
On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Mar 2013 21:52:33 -0700
> 
> > GRE TSO support doesn't increment the ID in the inner IP header.
> 
> Is this a fundamental limitation of doing TSO over GRO or
> were the Broadcom folks just being lazy with their firmware
> implementation?
> 
> I really don't want to apply this patch, because ipv4 frames
> even with DF set should have an incrementing ID field, in
> order to accomodate various header compression schemes.
> 
> We go out of our way to do this for normal unencapsulated TCP stream
> packets, rather than set the ID field to zero (which we did for some
> time until the compression issue was pointed out to us).

Besides which, GRO has been reliably reversible until now.  (gso_size is
available through packet sockets, even if tcpdump doesn't appear to use
it yet.)  Ignoring IPv4 IDs will break that guarantee.

Ben.
David Miller March 21, 2013, 4:31 p.m. UTC | #7
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 21 Mar 2013 16:20:25 +0000

> On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 20 Mar 2013 21:52:33 -0700
>> 
>> > GRE TSO support doesn't increment the ID in the inner IP header.
>> 
>> Is this a fundamental limitation of doing TSO over GRO or
>> were the Broadcom folks just being lazy with their firmware
>> implementation?
>> 
>> I really don't want to apply this patch, because ipv4 frames
>> even with DF set should have an incrementing ID field, in
>> order to accomodate various header compression schemes.
>> 
>> We go out of our way to do this for normal unencapsulated TCP stream
>> packets, rather than set the ID field to zero (which we did for some
>> time until the compression issue was pointed out to us).
> 
> Besides which, GRO has been reliably reversible until now.  (gso_size is
> available through packet sockets, even if tcpdump doesn't appear to use
> it yet.)  Ignoring IPv4 IDs will break that guarantee.

Right, even ignoring the header compression issues, our segmentation
offloads must be perfectly reversible.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 21, 2013, 4:31 p.m. UTC | #8
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 21 Mar 2013 09:08:11 -0700

> I understand your concern, but this check in GRO brings nothing at all.

It brings reversibility, a fundamental rule of our segmentation
offloads.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitry Kravkov March 21, 2013, 6:11 p.m. UTC | #9
Resending for netdev in plantext.


On Thu, Mar 21, 2013 at 6:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 20 Mar 2013 21:52:33 -0700
> >
> > > GRE TSO support doesn't increment the ID in the inner IP header.
> >
> > Is this a fundamental limitation of doing TSO over GRO or
> > were the Broadcom folks just being lazy with their firmware
> > implementation?
> >
>
> Well, I suspect this hardware is not capable of doing the proper ID
> manipulation twice. (inner and outer header)

This is correct: ID only for one of the headers can be handled with
current FW/HW, for other DF is set.

> Still TSO support permits a single GRE flow going from 3Gbps to 9Gbps on
> our hosts. So even if the inner IP id is 'broken', we are going to use
> TSO.
>
> Note we are limited by the receiver, as the receiver has to perform the
> tcp checksum in software (bnx2x doesnt support CHECKSUM_COMPLETE yet)
>
> Hopefully next firmware or NIC will do the right thing.
>
> > I really don't want to apply this patch, because ipv4 frames
> > even with DF set should have an incrementing ID field, in
> > order to accomodate various header compression schemes.
> >
> > We go out of our way to do this for normal unencapsulated TCP stream
> > packets, rather than set the ID field to zero (which we did for some
> > time until the compression issue was pointed out to us).
>
> I understand your concern, but this check in GRO brings nothing at all.
>
> Once we receive frames with 'bad IPv4 ID', should we accept them or drop
> them ?
>
> TCP stack doesn't care at receive (obviously as this ID is not a concern
> for the transport layer), so GRO should do the same, as GRO is a best
> effort to reduce cpu load.
>
> I fully understand the 'tos' check because of proper ECN support, but
> the ttl check or id check are totally useless and time consuming.
>
> GRO aggregation should roughly work the same than TCP coalescing, and we
> don't care of IP ID or ttl in TCP stack.
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 21, 2013, 6:56 p.m. UTC | #10
From: Dmitry Kravkov <dkravkov@gmail.com>
Date: Thu, 21 Mar 2013 20:11:37 +0200

>> Well, I suspect this hardware is not capable of doing the proper ID
>> manipulation twice. (inner and outer header)
> 
> This is correct: ID only for one of the headers can be handled with
> current FW/HW, for other DF is set.

DF does not matter.

Regardless of DF, we must set the ID field correctly.

It is abundantly clear that the current GRE tunnel segmentation
is not generating packets according to our well documented
rules, in that we must be able to precisely create exactly
the original packet stream from the segmented frame.

Someone needs to send me patches to revert the bnx2x GRE segmentation
support, and any software implementation in our tree that has the same
bug.

If someone doesn't do it, I will revert all of this code myself.  You
simply will have to cope with not having this optimization until
your hardware can do it properly and according to our well established
rules for segmentation offloads.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski March 21, 2013, 8:07 p.m. UTC | #11
> This is correct: ID only for one of the headers can be handled with current FW/HW, for other DF is set.

I would prefer for IP ID to be correctly incremented on the inner IPv4
packet rather than the outer IPv4.
Could that be changed?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitry Kravkov March 21, 2013, 8:13 p.m. UTC | #12
> -----Original Message-----

> From: Maciej Żenczykowski [mailto:maze@google.com]

> Sent: Thursday, March 21, 2013 10:08 PM

> To: Dmitry Kravkov

> Cc: Eric Dumazet; David Miller; netdev@vger.kernel.org; Dmitry Kravkov; Eilon Greenstein; Pravin B Shelar; Hsiao-keng Jerry Chu

> Subject: Re: [PATCH net-next] gro: relax ID check in inet_gro_receive()

> 

> > This is correct: ID only for one of the headers can be handled with current FW/HW, for other DF is set.

> 

> I would prefer for IP ID to be correctly incremented on the inner IPv4

> packet rather than the outer IPv4.

> Could that be changed?


Yes, I will check if it can be done in driver only ...
Maciej Żenczykowski March 21, 2013, 8:14 p.m. UTC | #13
Reverting/deleting this code is going way too far.
Mark it experimental, but don't delete it.

This is (a) useful for continued development and experimentation and
(b) will likely be widely deployed (and here I don't mean by my
employer specifically, but mean in general by cloud/virtualization
companies and anyone else that cares about tunnel performance)
regardless of whether upstream linux decides to include it or not.

(Furthermore if the fw/driver is changed to increment inner IP ID
instead of outer IP ID, this will for the most part not be visible to
the end users, since the tunneling is usually internal to the
virtualization provider, and no-one besides their internal network
sees the external IP header.)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 21, 2013, 8:20 p.m. UTC | #14
From: Maciej Żenczykowski <maze@google.com>
Date: Thu, 21 Mar 2013 13:14:44 -0700

> Reverting/deleting this code is going way too far.
> Mark it experimental, but don't delete it.

I disagree, people who want to play can apply the code
to their tree.

It's as simple as that.

We have very reasonable requirements for segmentation offloads, if
they are not met, the feature isn't working correctly and therefore
isn't ready for upstream.

Now if someone can provide a fix soon, that's fine too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski March 21, 2013, 8:47 p.m. UTC | #15
> We have very reasonable requirements for segmentation offloads

I would tend to disagree with reversibility being a reasonable requirement.
It offers extremely little benefit, and thus it's a nice to have at best.
However, I do understand why you might want it, and thus I can
understand why you might not want to accept this patch (which would
make GRO less reversible).

This logic does not apply to the GRE segmentation offload patches.
They might create non incrementing IP ID packet sequences on the wire,
and this may prevent GRO from kicking in, but it does not make TSO
non-reversible
(if anything it makes it easy to detect that it happened).

Would you perhaps be willing to take a patch which simply removes the
flag from the driver features?
But leaves the code in?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 21, 2013, 9:05 p.m. UTC | #16
From: Maciej Żenczykowski <maze@google.com>
Date: Thu, 21 Mar 2013 13:47:07 -0700

> They might create non incrementing IP ID packet sequences on the wire,
> and this may prevent GRO from kicking in, but it does not make TSO
> non-reversible
> (if anything it makes it easy to detect that it happened).

This is not an acceptable outcome for any form GSO, it should be
completely transparent.  The packets we would have output individually
and those which result from the segmented frame must be
indistinguishable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross March 21, 2013, 9:37 p.m. UTC | #17
On Thu, Mar 21, 2013 at 2:05 PM, David Miller <davem@davemloft.net> wrote:
> From: Maciej Żenczykowski <maze@google.com>
> Date: Thu, 21 Mar 2013 13:47:07 -0700
>
>> They might create non incrementing IP ID packet sequences on the wire,
>> and this may prevent GRO from kicking in, but it does not make TSO
>> non-reversible
>> (if anything it makes it easy to detect that it happened).
>
> This is not an acceptable outcome for any form GSO, it should be
> completely transparent.  The packets we would have output individually
> and those which result from the segmented frame must be
> indistinguishable.

We actually don't set the outer IP ID for GRE packets with the DF bit
set in the non-GSO/TSO case either (and this is not new behavior).
Obviously TCP/IP header compression doesn't apply to the GRE packet
itself.

Therefore, I think if the driver is switched to increment the inner
ID, which it sounds like it can be, then everything should be
transparent.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 21, 2013, 9:39 p.m. UTC | #18
From: Jesse Gross <jesse@nicira.com>
Date: Thu, 21 Mar 2013 14:37:38 -0700

> On Thu, Mar 21, 2013 at 2:05 PM, David Miller <davem@davemloft.net> wrote:
>> From: Maciej Żenczykowski <maze@google.com>
>> Date: Thu, 21 Mar 2013 13:47:07 -0700
>>
>>> They might create non incrementing IP ID packet sequences on the wire,
>>> and this may prevent GRO from kicking in, but it does not make TSO
>>> non-reversible
>>> (if anything it makes it easy to detect that it happened).
>>
>> This is not an acceptable outcome for any form GSO, it should be
>> completely transparent.  The packets we would have output individually
>> and those which result from the segmented frame must be
>> indistinguishable.
> 
> We actually don't set the outer IP ID for GRE packets with the DF bit
> set in the non-GSO/TSO case either (and this is not new behavior).
> Obviously TCP/IP header compression doesn't apply to the GRE packet
> itself.
> 
> Therefore, I think if the driver is switched to increment the inner
> ID, which it sounds like it can be, then everything should be
> transparent.

That's my impression as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e5882c..302a47e 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1355,7 +1355,6 @@  static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 	const struct iphdr *iph;
 	unsigned int hlen;
 	unsigned int off;
-	unsigned int id;
 	int flush = 1;
 	int proto;
 
@@ -1381,9 +1380,9 @@  static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 	if (unlikely(ip_fast_csum((u8 *)iph, 5)))
 		goto out_unlock;
 
-	id = ntohl(*(__be32 *)&iph->id);
-	flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id ^ IP_DF));
-	id >>= 16;
+	flush = ntohs(iph->tot_len) ^ skb_gro_len(skb);
+
+	flush |= (__force u16)iph->frag_off ^ htons(IP_DF);
 
 	for (p = *head; p; p = p->next) {
 		struct iphdr *iph2;
@@ -1400,11 +1399,9 @@  static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 			continue;
 		}
 
-		/* All fields must match except length and checksum. */
 		NAPI_GRO_CB(p)->flush |=
 			(iph->ttl ^ iph2->ttl) |
-			(iph->tos ^ iph2->tos) |
-			((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ id);
+			(iph->tos ^ iph2->tos);
 
 		NAPI_GRO_CB(p)->flush |= flush;
 	}