diff mbox

[v6] bonding support for IPv6 transmit hashing

Message ID 11100.1345576795@death.nxdomain
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Jay Vosburgh Aug. 21, 2012, 7:19 p.m. UTC
Jeremy Brookman <jeremy.brookman@gmail.com> wrote:

>> You should use a mix of tabs, as necessary, to get things to line up
>> how I told you they need to line up.
>
>Unless I'm missing something, this change doesn't seem to have made it
>through to the kernel tip, but we could really use this bugfix. Is it
>in a repository I didn't notice, or not yet through the review?  If
>it's not through the review, is any help needed to get it there?

	The submitter (John Eaglesham) never posted an updated version
that addressed the various comments, nor did his original patch
submission include a Signed-off-by.

	I went ahead and updated the patch to address the comments; I've
only compile tested this.  Are you (Jeremy or John) able to test this to
confirm that it will hash ipv6 traffic as expected (I can test it, but
it won't be today)?

	John, can you post a Signed-off-by for your patch (really, this
updated version of your patch)?

	If John signs off and somebody tests this, I'll post a formal
submssion with the full commit message.

	-J



---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

John Aug. 21, 2012, 10:21 p.m. UTC | #1
On 8/21/2012 12:19 PM, Jay Vosburgh wrote:
> Jeremy Brookman <jeremy.brookman@gmail.com> wrote:
>
>>> You should use a mix of tabs, as necessary, to get things to line up
>>> how I told you they need to line up.
>>
>> Unless I'm missing something, this change doesn't seem to have made it
>> through to the kernel tip, but we could really use this bugfix. Is it
>> in a repository I didn't notice, or not yet through the review?  If
>> it's not through the review, is any help needed to get it there?
>
> 	The submitter (John Eaglesham) never posted an updated version
> that addressed the various comments, nor did his original patch
> submission include a Signed-off-by.
>
> 	I went ahead and updated the patch to address the comments; I've
> only compile tested this.  Are you (Jeremy or John) able to test this to
> confirm that it will hash ipv6 traffic as expected (I can test it, but
> it won't be today)?
>
> 	John, can you post a Signed-off-by for your patch (really, this
> updated version of your patch)?
>
> 	If John signs off and somebody tests this, I'll post a formal
> submssion with the full commit message.
>
> 	-J
>

Since my last submission I ended up making some changes on my end to 
streamline the logic. I can fold together my patch with yours and test 
them later tonight. If everything looks good I'll post the changes back 
to the list.

Thanks!
John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeremy Brookman Aug. 22, 2012, 12:06 p.m. UTC | #2
>>         If John signs off and somebody tests this, I'll post a formal
>> submssion with the full commit message.
>
> Since my last submission I ended up making some changes on my end to
> streamline the logic. I can fold together my patch with yours and test them
> later tonight. If everything looks good I'll post the changes back to the
> list.

Great - thanks for that Jay/John; will look forward to the latest
patch later. I'm actually looking at the 2.6.32 branch so have tested
a backport of Jay's patch (which only took a couple of very minor
modifications); a quick test on an 8-port bond with layer3+4 hashing
looked fine.  (Will hold off until the final patch before doing more
testing.)

Regards,

Jeremy

On Tue, Aug 21, 2012 at 11:21 PM, John Eaglesham <linux@8192.net> wrote:
> On 8/21/2012 12:19 PM, Jay Vosburgh wrote:
>>
>> Jeremy Brookman <jeremy.brookman@gmail.com> wrote:
>>
>>>> You should use a mix of tabs, as necessary, to get things to line up
>>>> how I told you they need to line up.
>>>
>>>
>>> Unless I'm missing something, this change doesn't seem to have made it
>>> through to the kernel tip, but we could really use this bugfix. Is it
>>> in a repository I didn't notice, or not yet through the review?  If
>>> it's not through the review, is any help needed to get it there?
>>
>>
>>         The submitter (John Eaglesham) never posted an updated version
>> that addressed the various comments, nor did his original patch
>> submission include a Signed-off-by.
>>
>>         I went ahead and updated the patch to address the comments; I've
>> only compile tested this.  Are you (Jeremy or John) able to test this to
>> confirm that it will hash ipv6 traffic as expected (I can test it, but
>> it won't be today)?
>>
>>         John, can you post a Signed-off-by for your patch (really, this
>> updated version of your patch)?
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeremy Brookman Aug. 23, 2012, 10:42 a.m. UTC | #3
Hi,

A few questions on the actual patch inline now I've had a bit more time...

>  static int bond_xmit_hash_policy_l23(struct sk_buff *skb, int count)

...

> +       if (skb->protocol == htons(ETH_P_IP) &&
> +               skb_network_header_len(skb) >= sizeof(struct iphdr)) {
> +               iph = ip_hdr(skb);
>                 return ((ntohl(iph->saddr ^ iph->daddr) & 0xffff) ^
>                         (data->h_dest[5] ^ data->h_source[5])) % count;
> +       } else if (skb->protocol == htons(ETH_P_IPV6) &&
> +               skb_network_header_len(skb) >= sizeof(struct ipv6hdr)) {
> +               ipv6h = ipv6_hdr(skb);
> +               s = &ipv6h->saddr.s6_addr32[0];
> +               d = &ipv6h->daddr.s6_addr32[0];
> +               v6hash = (s[1] ^ d[1]) ^ (s[2] ^ d[2]) ^ (s[3] ^ d[3]);
> +               v6hash ^= (v6hash >> 24) ^ (v6hash >> 16) ^ (v6hash >> 8);
> +               return (v6hash ^ data->h_dest[5] ^ data->h_source[5]) % count;
>         }
>

If the IPv4 case needs an ntohl, does the IPv6 case (if we're
interpreting the address as 4 32-bits)?  If IPv4 hashing algorithm is
consistent across different endiannesses, then maybe IPv6 should be
too?

>  /*
>   * Hash for the output device based upon layer 3 and layer 4 data. If
>   * the packet is a frag or not TCP or UDP, just use layer 3 data.  If it is
> - * altogether not IP, mimic bond_xmit_hash_policy_l2()
> + * altogether not IP, fall back on bond_xmit_hash_policy_l2()
>   */

Looking at the code below, we only check the first value of next_hdr
in the chain; however RFC 2460 lists the following possible extension
headers, all of which will therefore cause fallback to L3 hashing:

           Hop-by-Hop Options
           Routing (Type 0)
           Fragment
           Destination Options
           Authentication
           Encapsulating Security Payload

Clearly with some (eg ESP and fragment) we do need to drop out of
using L4 header info in the hash.  And anyone using Routing (Type 0)
probably deserves anything they get.  But should we at least comment
on the limitation that the existence of the other headers also causes
fallback to L3 hashing only? (And possibly even include in
documentation?)  Or of course, fix?

>         if (skb->protocol == htons(ETH_P_IP)) {
> +               iph = ip_hdr(skb);
>                 if (!ip_is_fragment(iph) &&
>                     (iph->protocol == IPPROTO_TCP ||
>                      iph->protocol == IPPROTO_UDP)) {
> +                       if (iph->ihl * sizeof(u32) + sizeof(__be16) * 2 >
> +                           skb_headlen(skb) - skb_network_offset(skb))
> +                               goto short_header;
> +                       layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
>                         layer4_xor = ntohs((*layer4hdr ^ *(layer4hdr + 1)));
> +               } else if (skb_network_header_len(skb) < sizeof(struct iphdr)) {
> +                       goto short_header;
>                 }

I don't know the assertions we can make about
skb_network_header_len(skb), but it looks odd doing a length check
against sizeof(iphdr) after iph->protocol has already been
dereferenced. Is this really right?  (The pattern recurs a few times.)

Regards,

Jeremy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 6b1c711..834c919 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -752,12 +752,23 @@  xmit_hash_policy
 		protocol information to generate the hash.
 
 		Uses XOR of hardware MAC addresses and IP addresses to
-		generate the hash.  The formula is
+		generate the hash.  The IPv4 formula is
 
 		(((source IP XOR dest IP) AND 0xffff) XOR
 			( source MAC XOR destination MAC ))
 				modulo slave count
 
+		The IPv6 formula is
+
+		hash =
+			(source ip quad 2 XOR dest IP quad 2) XOR
+			(source ip quad 3 XOR dest IP quad 3) XOR
+			(source ip quad 4 XOR dest IP quad 4)
+
+		(((hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash)
+			XOR (source MAC XOR destination MAC))
+				modulo slave count
+
 		This algorithm will place all traffic to a particular
 		network peer on the same slave.  For non-IP traffic,
 		the formula is the same as for the layer2 transmit
@@ -778,19 +789,30 @@  xmit_hash_policy
 		slaves, although a single connection will not span
 		multiple slaves.
 
-		The formula for unfragmented TCP and UDP packets is
+		The formula for unfragmented IPv4 TCP and UDP packets is
 
 		((source port XOR dest port) XOR
 			 ((source IP XOR dest IP) AND 0xffff)
 				modulo slave count
 
-		For fragmented TCP or UDP packets and all other IP
-		protocol traffic, the source and destination port
+		The formula for unfragmented IPv6 TCP and UDP packets is
+
+		hash =
+			(source ip quad 2 XOR dest IP quad 2) XOR
+			(source ip quad 3 XOR dest IP quad 3) XOR
+			(source ip quad 4 XOR dest IP quad 4)
+
+		((source port XOR dest port) XOR
+			(hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash)
+				modulo slave count
+
+		For fragmented TCP or UDP packets and all other IPv4 and
+		IPv6 protocol traffic, the source and destination port
 		information is omitted.  For non-IP traffic, the
 		formula is the same as for the layer2 transmit hash
 		policy.
 
-		This policy is intended to mimic the behavior of
+		The IPv4 policy is intended to mimic the behavior of
 		certain switches, notably Cisco switches with PFC2 as
 		well as some Foundry and IBM products.
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index d95fbc3..4dfe7e3 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3354,56 +3354,99 @@  static struct notifier_block bond_netdev_notifier = {
 /*---------------------------- Hashing Policies -----------------------------*/
 
 /*
+ * Hash for the output device based upon layer 2 data
+ */
+static int bond_xmit_hash_policy_l2(struct sk_buff *skb, int count)
+{
+	struct ethhdr *data = (struct ethhdr *)skb->data;
+
+	if (skb_headlen(skb) >= offsetof(struct ethhdr, h_proto))
+		return (data->h_dest[5] ^ data->h_source[5]) % count;
+
+	return 0;
+}
+
+/*
  * Hash for the output device based upon layer 2 and layer 3 data. If
- * the packet is not IP mimic bond_xmit_hash_policy_l2()
+ * the packet is not IP, fall back on bond_xmit_hash_policy_l2()
  */
 static int bond_xmit_hash_policy_l23(struct sk_buff *skb, int count)
 {
 	struct ethhdr *data = (struct ethhdr *)skb->data;
-	struct iphdr *iph = ip_hdr(skb);
-
-	if (skb->protocol == htons(ETH_P_IP)) {
+	struct iphdr *iph;
+	struct ipv6hdr *ipv6h;
+	u32 v6hash;
+	__be32 *s, *d;
+
+	if (skb->protocol == htons(ETH_P_IP) &&
+		skb_network_header_len(skb) >= sizeof(struct iphdr)) {
+		iph = ip_hdr(skb);
 		return ((ntohl(iph->saddr ^ iph->daddr) & 0xffff) ^
 			(data->h_dest[5] ^ data->h_source[5])) % count;
+	} else if (skb->protocol == htons(ETH_P_IPV6) &&
+		skb_network_header_len(skb) >= sizeof(struct ipv6hdr)) {
+		ipv6h = ipv6_hdr(skb);
+		s = &ipv6h->saddr.s6_addr32[0];
+		d = &ipv6h->daddr.s6_addr32[0];
+		v6hash = (s[1] ^ d[1]) ^ (s[2] ^ d[2]) ^ (s[3] ^ d[3]);
+		v6hash ^= (v6hash >> 24) ^ (v6hash >> 16) ^ (v6hash >> 8);
+		return (v6hash ^ data->h_dest[5] ^ data->h_source[5]) % count;
 	}
 
-	return (data->h_dest[5] ^ data->h_source[5]) % count;
+	return bond_xmit_hash_policy_l2(skb, count);
 }
 
 /*
  * Hash for the output device based upon layer 3 and layer 4 data. If
  * the packet is a frag or not TCP or UDP, just use layer 3 data.  If it is
- * altogether not IP, mimic bond_xmit_hash_policy_l2()
+ * altogether not IP, fall back on bond_xmit_hash_policy_l2()
  */
 static int bond_xmit_hash_policy_l34(struct sk_buff *skb, int count)
 {
-	struct ethhdr *data = (struct ethhdr *)skb->data;
-	struct iphdr *iph = ip_hdr(skb);
-	__be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
-	int layer4_xor = 0;
+	u32 layer4_xor = 0;
+	struct iphdr *iph;
+	struct ipv6hdr *ipv6h;
+	__be32 *s, *d;
+	__be16 *layer4hdr;
 
 	if (skb->protocol == htons(ETH_P_IP)) {
+		iph = ip_hdr(skb);
 		if (!ip_is_fragment(iph) &&
 		    (iph->protocol == IPPROTO_TCP ||
 		     iph->protocol == IPPROTO_UDP)) {
+			if (iph->ihl * sizeof(u32) + sizeof(__be16) * 2 >
+			    skb_headlen(skb) - skb_network_offset(skb))
+				goto short_header;
+			layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
 			layer4_xor = ntohs((*layer4hdr ^ *(layer4hdr + 1)));
+		} else if (skb_network_header_len(skb) < sizeof(struct iphdr)) {
+			goto short_header;
 		}
 		return (layer4_xor ^
 			((ntohl(iph->saddr ^ iph->daddr)) & 0xffff)) % count;
-
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		ipv6h = ipv6_hdr(skb);
+		if (ipv6h->nexthdr == IPPROTO_TCP ||
+		    ipv6h->nexthdr == IPPROTO_UDP) {
+			layer4hdr = (__be16 *)(ipv6h + 1);
+			if (sizeof(struct ipv6hdr) + sizeof(__be16) * 2 >
+				skb_headlen(skb) - skb_network_offset(skb))
+				goto short_header;
+			layer4_xor = (*layer4hdr ^ *(layer4hdr + 1));
+		} else if (skb_network_header_len(skb) <
+			   sizeof(struct ipv6hdr)) {
+			goto short_header;
+		}
+		s = &ipv6h->saddr.s6_addr32[0];
+		d = &ipv6h->daddr.s6_addr32[0];
+		layer4_xor ^= (s[1] ^ d[1]) ^ (s[2] ^ d[2]) ^ (s[3] ^ d[3]);
+		layer4_xor ^= (layer4_xor >> 24) ^ (layer4_xor >> 16) ^
+			       (layer4_xor >> 8);
+		return layer4_xor % count;
 	}
 
-	return (data->h_dest[5] ^ data->h_source[5]) % count;
-}
-
-/*
- * Hash for the output device based upon layer 2 data
- */
-static int bond_xmit_hash_policy_l2(struct sk_buff *skb, int count)
-{
-	struct ethhdr *data = (struct ethhdr *)skb->data;
-
-	return (data->h_dest[5] ^ data->h_source[5]) % count;
+short_header:
+	return bond_xmit_hash_policy_l2(skb, count);
 }
 
 /*-------------------------- Device entry points ----------------------------*/