diff mbox

[net-next,1/2] net: Header length compution function

Message ID 20140729.215853.2058634690190963314.davem@davemloft.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller July 30, 2014, 4:58 a.m. UTC
From: Amir Vadai <amirv@mellanox.com>
Date: Mon, 28 Jul 2014 13:14:10 +0300

> From: Eric Dumazet <eric.dumazet@gmail.com>
> 
> This commit is based on Eric Dumazet suggestion.
> Use flow dissector to calculate header length.
> Tested the following with a mlx4, and it indeed speeds up GRE traffic,
> as GRO packets can now contain 17 MSS instead of 8.
> (Pulling payload means GRO had to use 2 'frags' per MSS)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.com>

So I decided to see how bad it would be if we tried to avoid making
that funky on-stack SKB and came up with the patch below.

With this you can make __skb_get_poff() take "data" and "hlen" too,
pass it onward to __skb_flow_dissect(), and then your new function is
just:

u32 eth_frame_headlen(void *data, unsigned int len)
{
	if (unlikely(len < ETH_HLEN))
		return len;
	return __skb_get_poff(NULL, data, hlen) + ETH_HLEN;
}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet July 30, 2014, 7 a.m. UTC | #1
On Tue, 2014-07-29 at 21:58 -0700, David Miller wrote:
> From: Amir Vadai <amirv@mellanox.com>
> Date: Mon, 28 Jul 2014 13:14:10 +0300
> 
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> > This commit is based on Eric Dumazet suggestion.
> > Use flow dissector to calculate header length.
> > Tested the following with a mlx4, and it indeed speeds up GRE traffic,
> > as GRO packets can now contain 17 MSS instead of 8.
> > (Pulling payload means GRO had to use 2 'frags' per MSS)
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > Signed-off-by: Amir Vadai <amirv@mellanox.com>
> 
> So I decided to see how bad it would be if we tried to avoid making
> that funky on-stack SKB and came up with the patch below.
> 
> With this you can make __skb_get_poff() take "data" and "hlen" too,
> pass it onward to __skb_flow_dissect(), and then your new function is
> just:
> 
> u32 eth_frame_headlen(void *data, unsigned int len)
> {
> 	if (unlikely(len < ETH_HLEN))
> 		return len;
> 	return __skb_get_poff(NULL, data, hlen) + ETH_HLEN;
> }

Yep, this was basically what I got when I tried it as well, and I was
not convinced it was the way to go.

This adds quite large number of conditional jumps, as
skb_header_pointer() is heavily used in the stack.

If we want to restrict flow dissection to a 'fake linear skb',
needed fields from this fake skb are well known :

skb->data,		(data)
skb->len,	        (len)
skb->data_len		(0)   // this is a fake linear skb ...

Then skb_flow_dissect() needs 2 additional parts, because it assumed
ethernet header was already pulled (from eth_type_trans())

skb->network_header     (ETH_HLEN)
skb->protocol		(eth->h_proto)

This solution, admittedly a bit hacky, do not add more complexity into
fast path.

The last 2 fields (network_header and protocol) could even be passed as
__skb_flow_dissect() parameters.

By nature, skb_flow_dissect() should not access any information outside
of the frame itself.

Apparently, Alexander never trusted this core function and decided to
implement its own limited flow dissector in igb/ixgbe...

skb layout regarding how linear data is attached wont change anytime
soon.

We certainly can add a fat comment, but then any code using skb->data &
skb->len should get a fat comment as well.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H July 30, 2014, 2:26 p.m. UTC | #2
On 07/30/2014 12:00 AM, Eric Dumazet wrote:
> On Tue, 2014-07-29 at 21:58 -0700, David Miller wrote:
>> From: Amir Vadai <amirv@mellanox.com>
>> Date: Mon, 28 Jul 2014 13:14:10 +0300
>>
>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>>
>>> This commit is based on Eric Dumazet suggestion.
>>> Use flow dissector to calculate header length.
>>> Tested the following with a mlx4, and it indeed speeds up GRE traffic,
>>> as GRO packets can now contain 17 MSS instead of 8.
>>> (Pulling payload means GRO had to use 2 'frags' per MSS)
>>>
>>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>> Signed-off-by: Amir Vadai <amirv@mellanox.com>
>>
>> So I decided to see how bad it would be if we tried to avoid making
>> that funky on-stack SKB and came up with the patch below.
>>
>> With this you can make __skb_get_poff() take "data" and "hlen" too,
>> pass it onward to __skb_flow_dissect(), and then your new function is
>> just:
>>
>> u32 eth_frame_headlen(void *data, unsigned int len)
>> {
>> 	if (unlikely(len < ETH_HLEN))
>> 		return len;
>> 	return __skb_get_poff(NULL, data, hlen) + ETH_HLEN;
>> }
> 
> Yep, this was basically what I got when I tried it as well, and I was
> not convinced it was the way to go.
> 
> This adds quite large number of conditional jumps, as
> skb_header_pointer() is heavily used in the stack.
> 
> If we want to restrict flow dissection to a 'fake linear skb',
> needed fields from this fake skb are well known :
> 
> skb->data,		(data)
> skb->len,	        (len)
> skb->data_len		(0)   // this is a fake linear skb ...
> 
> Then skb_flow_dissect() needs 2 additional parts, because it assumed
> ethernet header was already pulled (from eth_type_trans())
> 
> skb->network_header     (ETH_HLEN)
> skb->protocol		(eth->h_proto)
> 
> This solution, admittedly a bit hacky, do not add more complexity into
> fast path.
> 
> The last 2 fields (network_header and protocol) could even be passed as
> __skb_flow_dissect() parameters.
> 
> By nature, skb_flow_dissect() should not access any information outside
> of the frame itself.
> 
> Apparently, Alexander never trusted this core function and decided to
> implement its own limited flow dissector in igb/ixgbe...
> 
> skb layout regarding how linear data is attached wont change anytime
> soon.
> 
> We certainly can add a fat comment, but then any code using skb->data &
> skb->len should get a fat comment as well.
> 
> 
> 

It wasn't that I don't trust the core function.  We already had some of
our own code floating around for the out-of-tree LRO and so I simply
made use of that as it allowed for code reuse in our driver.

My complaint isn't about using data and len.  It is the fact that there
is no shared info and the fact that most of the skb on the stack is
uninitialized memory so if you go to access any fields other than data
or len you will just pull up garbage.

I almost wonder if it wouldn't be worth while to move data_len and len
closer to the end of the sk_buff and perhaps create a structure within
the structure so that you could partition off all of the bits that we
don't need for these kind of simple operations.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 31, 2014, 1:34 a.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 30 Jul 2014 09:00:44 +0200

> This adds quite large number of conditional jumps, as
> skb_header_pointer() is heavily used in the stack.

There is only one, in __skb_header_pointer(), testing "!skb" when we
have to take the slow path of copying from a non-linear area.

The NULL check could even be moved into skb_copy_bits() so that it
would not be inlined.  It should be unlikely() (or even WARN_ON()),
and thus terribly easily to be predicted properly all the time.

We could even pass around a "sentinel" global const SKB for these
cases that would simply make skb_copy_bits() return immediately (just
having skb->len == 0 would do the trick) even if we accidently got
there with one.

Then there would be no extra conditional jumps.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 31, 2014, 1:39 a.m. UTC | #4
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Wed, 30 Jul 2014 07:26:33 -0700

> It wasn't that I don't trust the core function.  We already had some of
> our own code floating around for the out-of-tree LRO and so I simply
> made use of that as it allowed for code reuse in our driver.

It would be nice if this code were converted to use the generic
infrastructure, at some point at least.

> I almost wonder if it wouldn't be worth while to move data_len and len
> closer to the end of the sk_buff and perhaps create a structure within
> the structure so that you could partition off all of the bits that we
> don't need for these kind of simple operations.

That is something we can certainly look into in the long term.

But it's also not all that desirable to try to extricate the
non-linear handling.  It took us a long time to get all the code to
handle non-linear SKBs :-)

And in that case you have the issue of the fragmenting state being
separate in the skb_shared_info.

Anyways, a discussion for a different thread I think.

I don't think my proposed patch is a bad trade off.  Where we have the
__skb_header_pointer() thing that takes preloaded pointers and header
length values.  It adds only one test which frankly should never
trigger and can be moved down into skb_copy_bits() or similar.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H July 31, 2014, 3:34 p.m. UTC | #5
On 07/30/2014 06:39 PM, David Miller wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Wed, 30 Jul 2014 07:26:33 -0700
> 
>> It wasn't that I don't trust the core function.  We already had some of
>> our own code floating around for the out-of-tree LRO and so I simply
>> made use of that as it allowed for code reuse in our driver.
> 
> It would be nice if this code were converted to use the generic
> infrastructure, at some point at least.

I agree.  That is one of the reasons why I supported an earlier approach
that had made a function that was shared between the drivers.  My only
real concern was the on-stack skb approach.

The only other change I see that I might need to address would be to add
FCoE support to the function and then I can probably switch over ixgbe
to use it.

> 
> I don't think my proposed patch is a bad trade off.  Where we have the
> __skb_header_pointer() thing that takes preloaded pointers and header
> length values.  It adds only one test which frankly should never
> trigger and can be moved down into skb_copy_bits() or similar.
> 

This works for me.  Once it is in I can see about pushing a patch to add
some FCoE support and work on moving over igb and ixgbe.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 23, 2014, 7:19 p.m. UTC | #6
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Thu, 31 Jul 2014 08:34:22 -0700

> On 07/30/2014 06:39 PM, David Miller wrote:
>> I don't think my proposed patch is a bad trade off.  Where we have the
>> __skb_header_pointer() thing that takes preloaded pointers and header
>> length values.  It adds only one test which frankly should never
>> trigger and can be moved down into skb_copy_bits() or similar.
> 
> This works for me.  Once it is in I can see about pushing a patch to add
> some FCoE support and work on moving over igb and ixgbe.

You should be able to do this against net-next now, just FYI.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H Aug. 25, 2014, 10:21 p.m. UTC | #7
On 08/23/2014 12:19 PM, David Miller wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Thu, 31 Jul 2014 08:34:22 -0700
> 
>> On 07/30/2014 06:39 PM, David Miller wrote:
>>> I don't think my proposed patch is a bad trade off.  Where we have the
>>> __skb_header_pointer() thing that takes preloaded pointers and header
>>> length values.  It adds only one test which frankly should never
>>> trigger and can be moved down into skb_copy_bits() or similar.
>>
>> This works for me.  Once it is in I can see about pushing a patch to add
>> some FCoE support and work on moving over igb and ixgbe.
> 
> You should be able to do this against net-next now, just FYI.
> 

Actually I was just looking at the code.  It looks like commit
19469a873bafd4e65daef3597db2bd724c1b03c9 "flow_dissector: Use IPv6 flow
label in flow_dissector" is likely breaking things in terms of trying to
use this function to get the header length since the code now returns
early if an IPv6 flow label is present.

Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 25, 2014, 10:32 p.m. UTC | #8
From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Mon, 25 Aug 2014 15:21:54 -0700

> On 08/23/2014 12:19 PM, David Miller wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> Date: Thu, 31 Jul 2014 08:34:22 -0700
>> 
>>> On 07/30/2014 06:39 PM, David Miller wrote:
>>>> I don't think my proposed patch is a bad trade off.  Where we have the
>>>> __skb_header_pointer() thing that takes preloaded pointers and header
>>>> length values.  It adds only one test which frankly should never
>>>> trigger and can be moved down into skb_copy_bits() or similar.
>>>
>>> This works for me.  Once it is in I can see about pushing a patch to add
>>> some FCoE support and work on moving over igb and ixgbe.
>> 
>> You should be able to do this against net-next now, just FYI.
>> 
> 
> Actually I was just looking at the code.  It looks like commit
> 19469a873bafd4e65daef3597db2bd724c1b03c9 "flow_dissector: Use IPv6 flow
> label in flow_dissector" is likely breaking things in terms of trying to
> use this function to get the header length since the code now returns
> early if an IPv6 flow label is present.

Feel free to insert a facility to disable that logic.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 281dece..1e0b04c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2554,20 +2554,26 @@  __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
 __wsum skb_checksum(const struct sk_buff *skb, int offset, int len,
 		    __wsum csum);
 
-static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
-				       int len, void *buffer)
+static inline void *__skb_header_pointer(const struct sk_buff *skb, int offset,
+					 int len, void *data, int hlen, void *buffer)
 {
-	int hlen = skb_headlen(skb);
-
 	if (hlen - offset >= len)
-		return skb->data + offset;
+		return data + offset;
 
-	if (skb_copy_bits(skb, offset, buffer, len) < 0)
+	if (!skb ||
+	    skb_copy_bits(skb, offset, buffer, len) < 0)
 		return NULL;
 
 	return buffer;
 }
 
+static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
+				       int len, void *buffer)
+{
+	return __skb_header_pointer(skb, offset, len, skb->data,
+				    skb_headlen(skb), buffer);
+}
+
 /**
  *	skb_needs_linearize - check if we need to linearize a given skb
  *			      depending on the given device features.
diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
index 6667a05..4040f63 100644
--- a/include/net/flow_keys.h
+++ b/include/net/flow_keys.h
@@ -27,7 +27,17 @@  struct flow_keys {
 	u8 ip_proto;
 };
 
-bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow);
-__be32 skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto);
+bool __skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow,
+			void *data, int hlen);
+static inline bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
+{
+	return __skb_flow_dissect(skb, flow, NULL, 0);
+}
+__be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
+			    void *data, int hlen_proto);
+static inline __be32 skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto)
+{
+	return __skb_flow_get_ports(skb, thoff, ip_proto, NULL, 0);
+}
 u32 flow_hash_from_keys(struct flow_keys *keys);
 #endif
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 5f362c1..660c649 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -34,29 +34,40 @@  static void iph_to_flow_copy_addrs(struct flow_keys *flow, const struct iphdr *i
  * The function will try to retrieve the ports at offset thoff + poff where poff
  * is the protocol port offset returned from proto_ports_offset
  */
-__be32 skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto)
+__be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
+			    void *data, int hlen)
 {
 	int poff = proto_ports_offset(ip_proto);
 
+	if (!data) {
+		data = skb->data;
+		hlen = skb_headlen(skb);
+	}
+
 	if (poff >= 0) {
 		__be32 *ports, _ports;
 
-		ports = skb_header_pointer(skb, thoff + poff,
-					   sizeof(_ports), &_ports);
+		ports = __skb_header_pointer(skb, thoff + poff,
+					     sizeof(_ports), data, hlen, &_ports);
 		if (ports)
 			return *ports;
 	}
 
 	return 0;
 }
-EXPORT_SYMBOL(skb_flow_get_ports);
+EXPORT_SYMBOL(__skb_flow_get_ports);
 
-bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
+bool __skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow, void *data, int hlen)
 {
 	int nhoff = skb_network_offset(skb);
 	u8 ip_proto;
 	__be16 proto = skb->protocol;
 
+	if (!data) {
+		data = skb->data;
+		hlen = skb_headlen(skb);
+	}
+
 	memset(flow, 0, sizeof(*flow));
 
 again:
@@ -65,7 +76,7 @@  again:
 		const struct iphdr *iph;
 		struct iphdr _iph;
 ip:
-		iph = skb_header_pointer(skb, nhoff, sizeof(_iph), &_iph);
+		iph = __skb_header_pointer(skb, nhoff, sizeof(_iph), data, hlen, &_iph);
 		if (!iph || iph->ihl < 5)
 			return false;
 		nhoff += iph->ihl * 4;
@@ -83,7 +94,7 @@  ip:
 		__be32 flow_label;
 
 ipv6:
-		iph = skb_header_pointer(skb, nhoff, sizeof(_iph), &_iph);
+		iph = __skb_header_pointer(skb, nhoff, sizeof(_iph), data, hlen, &_iph);
 		if (!iph)
 			return false;
 
@@ -113,7 +124,7 @@  ipv6:
 		const struct vlan_hdr *vlan;
 		struct vlan_hdr _vlan;
 
-		vlan = skb_header_pointer(skb, nhoff, sizeof(_vlan), &_vlan);
+		vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan);
 		if (!vlan)
 			return false;
 
@@ -126,7 +137,7 @@  ipv6:
 			struct pppoe_hdr hdr;
 			__be16 proto;
 		} *hdr, _hdr;
-		hdr = skb_header_pointer(skb, nhoff, sizeof(_hdr), &_hdr);
+		hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr);
 		if (!hdr)
 			return false;
 		proto = hdr->proto;
@@ -151,7 +162,7 @@  ipv6:
 			__be16 proto;
 		} *hdr, _hdr;
 
-		hdr = skb_header_pointer(skb, nhoff, sizeof(_hdr), &_hdr);
+		hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, hlen, &_hdr);
 		if (!hdr)
 			return false;
 		/*
@@ -171,8 +182,9 @@  ipv6:
 				const struct ethhdr *eth;
 				struct ethhdr _eth;
 
-				eth = skb_header_pointer(skb, nhoff,
-							 sizeof(_eth), &_eth);
+				eth = __skb_header_pointer(skb, nhoff,
+							   sizeof(_eth),
+							   data, hlen, &_eth);
 				if (!eth)
 					return false;
 				proto = eth->h_proto;
@@ -194,12 +206,12 @@  ipv6:
 
 	flow->n_proto = proto;
 	flow->ip_proto = ip_proto;
-	flow->ports = skb_flow_get_ports(skb, nhoff, ip_proto);
+	flow->ports = __skb_flow_get_ports(skb, nhoff, ip_proto, data, hlen);
 	flow->thoff = (u16) nhoff;
 
 	return true;
 }
-EXPORT_SYMBOL(skb_flow_dissect);
+EXPORT_SYMBOL(__skb_flow_dissect);
 
 static u32 hashrnd __read_mostly;
 static __always_inline void __flow_hash_secret_init(void)