diff mbox

[net-next,v3] net: ipv4: add support for ECMP hash policy choice

Message ID 1489505775-2913-1-git-send-email-nikolay@cumulusnetworks.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Nikolay Aleksandrov March 14, 2017, 3:36 p.m. UTC
This patch adds support for ECMP hash policy choice via a new sysctl
called fib_multipath_hash_policy and also adds support for L4 hashes.
The current values for fib_multipath_hash_policy are:
 0 - layer 3 (default)
 1 - layer 4
If there's an skb hash already set and it matches the chosen policy then it
will be used instead of being calculated (currently only for L4).
In L3 mode we always calculate the hash due to the ICMP error special
case, the flow dissector's field consistentification should handle the
address order thus we can remove the address reversals.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
v3:
 - keep the ICMP error special handling and always calc L3 hash
   Jakub, could you please run your tests with this version ?

v2:
 - removed the output_key_hash as it's not needed anymore
 - reverted to my original/internal patch with L3 as default hash

 Documentation/networking/ip-sysctl.txt |  8 +++
 include/net/ip_fib.h                   | 14 ++----
 include/net/netns/ipv4.h               |  1 +
 include/net/route.h                    |  9 +---
 net/ipv4/fib_semantics.c               | 11 ++--
 net/ipv4/icmp.c                        | 19 +------
 net/ipv4/route.c                       | 92 ++++++++++++++++++++++++++--------
 net/ipv4/sysctl_net_ipv4.c             |  9 ++++
 8 files changed, 98 insertions(+), 65 deletions(-)

Comments

Stephen Hemminger March 14, 2017, 3:55 p.m. UTC | #1
On Tue, 14 Mar 2017 17:36:15 +0200
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:

> This patch adds support for ECMP hash policy choice via a new sysctl
> called fib_multipath_hash_policy and also adds support for L4 hashes.
> The current values for fib_multipath_hash_policy are:
>  0 - layer 3 (default)
>  1 - layer 4
> If there's an skb hash already set and it matches the chosen policy then it
> will be used instead of being calculated (currently only for L4).
> In L3 mode we always calculate the hash due to the ICMP error special
> case, the flow dissector's field consistentification should handle the
> address order thus we can remove the address reversals.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

It is good to see ECMP come back from the grave.
Linux used to support it long ago but was abandoned after it was unstable
and removed from iproute2 in 2012.

The old API was through route attributes which makes more sense than
doing it with sysctl. It makes more sense to use netlink instead.
Therefore please go back and do something like the old API rather than doing it through
sysctl.
Nikolay Aleksandrov March 14, 2017, 3:58 p.m. UTC | #2
On 14/03/17 17:55, Stephen Hemminger wrote:
> On Tue, 14 Mar 2017 17:36:15 +0200
> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> 
>> This patch adds support for ECMP hash policy choice via a new sysctl
>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>> The current values for fib_multipath_hash_policy are:
>>  0 - layer 3 (default)
>>  1 - layer 4
>> If there's an skb hash already set and it matches the chosen policy then it
>> will be used instead of being calculated (currently only for L4).
>> In L3 mode we always calculate the hash due to the ICMP error special
>> case, the flow dissector's field consistentification should handle the
>> address order thus we can remove the address reversals.
>>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> 
> It is good to see ECMP come back from the grave.
> Linux used to support it long ago but was abandoned after it was unstable
> and removed from iproute2 in 2012.
> 
> The old API was through route attributes which makes more sense than
> doing it with sysctl. It makes more sense to use netlink instead.
> Therefore please go back and do something like the old API rather than doing it through
> sysctl.
> 

That's what my initial version did, but this was discussed during NetConf in Seville
and it was decided that it's best to make a global sysctl, thus the change.

Thanks,
 Nik
David Miller March 14, 2017, 6:48 p.m. UTC | #3
From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Date: Tue, 14 Mar 2017 17:58:46 +0200

> On 14/03/17 17:55, Stephen Hemminger wrote:
>> On Tue, 14 Mar 2017 17:36:15 +0200
>> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>> 
>>> This patch adds support for ECMP hash policy choice via a new sysctl
>>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>>> The current values for fib_multipath_hash_policy are:
>>>  0 - layer 3 (default)
>>>  1 - layer 4
>>> If there's an skb hash already set and it matches the chosen policy then it
>>> will be used instead of being calculated (currently only for L4).
>>> In L3 mode we always calculate the hash due to the ICMP error special
>>> case, the flow dissector's field consistentification should handle the
>>> address order thus we can remove the address reversals.
>>>
>>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> 
>> It is good to see ECMP come back from the grave.
>> Linux used to support it long ago but was abandoned after it was unstable
>> and removed from iproute2 in 2012.
>> 
>> The old API was through route attributes which makes more sense than
>> doing it with sysctl. It makes more sense to use netlink instead.
>> Therefore please go back and do something like the old API rather than doing it through
>> sysctl.
>> 
> 
> That's what my initial version did, but this was discussed during NetConf in Seville
> and it was decided that it's best to make a global sysctl, thus the change.

Correct, we discussed this, and we all agreed to only have a sysctl for now.
Nikolay Aleksandrov March 14, 2017, 6:55 p.m. UTC | #4
> On Mar 14, 2017, at 5:36 PM, Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> 
> This patch adds support for ECMP hash policy choice via a new sysctl
> called fib_multipath_hash_policy and also adds support for L4 hashes.
> The current values for fib_multipath_hash_policy are:
> 0 - layer 3 (default)
> 1 - layer 4
> If there's an skb hash already set and it matches the chosen policy then it
> will be used instead of being calculated (currently only for L4).
> In L3 mode we always calculate the hash due to the ICMP error special
> case, the flow dissector's field consistentification should handle the
> address order thus we can remove the address reversals.
> 
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> v3:
> - keep the ICMP error special handling and always calc L3 hash
>   Jakub, could you please run your tests with this version ?
> 
> v2:
> - removed the output_key_hash as it's not needed anymore
> - reverted to my original/internal patch with L3 as default hash
> 
> Documentation/networking/ip-sysctl.txt |  8 +++
> include/net/ip_fib.h                   | 14 ++----
> include/net/netns/ipv4.h               |  1 +
> include/net/route.h                    |  9 +---
> net/ipv4/fib_semantics.c               | 11 ++--
> net/ipv4/icmp.c                        | 19 +------
> net/ipv4/route.c                       | 92 ++++++++++++++++++++++++++--------
> net/ipv4/sysctl_net_ipv4.c             |  9 ++++
> 8 files changed, 98 insertions(+), 65 deletions(-)
> 
[snip]
> /* To make ICMP packets follow the right flow, the multipath hash is
> - * calculated from the inner IP addresses in reverse order.
> + * calculated from the inner IP addresses.
>  */
> -static int ip_multipath_icmp_hash(struct sk_buff *skb)
> +static void ip_multipath_icmp_hash(const struct sk_buff *skb,
> +				   struct flowi4 *fl4)
> {
> 	const struct iphdr *outer_iph = ip_hdr(skb);
> 	struct icmphdr _icmph;
> @@ -1746,33 +1746,85 @@ static int ip_multipath_icmp_hash(struct sk_buff *skb)
> 	struct iphdr _inner_iph;
> 	const struct iphdr *inner_iph;
> 
> +	fl4->saddr = outer_iph->saddr;
> +	fl4->daddr = outer_iph->daddr;
> 	if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
> -		goto standard_hash;
> +		return;
> 
> 	icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
> 				   &_icmph);
> 	if (!icmph)
> -		goto standard_hash;
> +		return;
> 
> 	if (icmph->type != ICMP_DEST_UNREACH &&
> 	    icmph->type != ICMP_REDIRECT &&
> 	    icmph->type != ICMP_TIME_EXCEEDED &&
> -	    icmph->type != ICMP_PARAMETERPROB) {
> -		goto standard_hash;
> -	}
> +	    icmph->type != ICMP_PARAMETERPROB)
> +		return;
> 
> 	inner_iph = skb_header_pointer(skb,
> 				       outer_iph->ihl * 4 + sizeof(_icmph),
> 				       sizeof(_inner_iph), &_inner_iph);
> 	if (!inner_iph)
> -		goto standard_hash;
> +		return;
> +	fl4->saddr = inner_iph->saddr;
> +	fl4->daddr = inner_iph->daddr;
> +}
> 
> -	return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
> +int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
> +		       const struct sk_buff *skb)
> +{
> +	struct net *net = fi->fib_net;
> +	struct flow_keys hash_keys;
> +	u32 mhash;
> 
> -standard_hash:
> -	return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
> -}
> +	switch (net->ipv4.sysctl_fib_multipath_hash_policy) {
> +	case 0:
> +		memset(&hash_keys, 0, sizeof(hash_keys));
> +		hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> +		if (skb && ip_hdr(skb)->protocol == IPPROTO_ICMP) {
> +			struct flowi4 _fl4;
> 
> +			ip_multipath_icmp_hash(skb, &_fl4);

Ugh, obviously I could’ve just passed hash_keys here, will  change this in v4 but I’ll wait
to see if there aren’t any other comments or issues.

Thanks,
 Nik

> +			hash_keys.addrs.v4addrs.src = _fl4.saddr;
> +			hash_keys.addrs.v4addrs.dst = _fl4.daddr;
> +		} else {
> +			hash_keys.addrs.v4addrs.src = fl4->saddr;
> +			hash_keys.addrs.v4addrs.dst = fl4->daddr;
> +		}
> +		break;
> +	case 1:
> +		/* skb is currently provided only when forwarding */
> +		if (skb) {
> +			unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP;
> +			struct flow_keys keys;
> +
> +			/* short-circuit if we already have L4 hash present */
> +			if (skb->l4_hash)
> +				return skb_get_hash_raw(skb) >> 1;
> +			memset(&hash_keys, 0, sizeof(hash_keys));
> +			skb_flow_dissect_flow_keys(skb, &keys, flag);
> +			hash_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
> +			hash_keys.addrs.v4addrs.dst = keys.addrs.v4addrs.dst;
> +			hash_keys.ports.src = keys.ports.src;
> +			hash_keys.ports.dst = keys.ports.dst;
> +			hash_keys.basic.ip_proto = keys.basic.ip_proto;
> +		} else {
> +			memset(&hash_keys, 0, sizeof(hash_keys));
> +			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> +			hash_keys.addrs.v4addrs.src = fl4->saddr;
> +			hash_keys.addrs.v4addrs.dst = fl4->daddr;
> +			hash_keys.ports.src = fl4->fl4_sport;
> +			hash_keys.ports.dst = fl4->fl4_dport;
> +			hash_keys.basic.ip_proto = fl4->flowi4_proto;
> +		}
> +		break;
> +	}
> +	mhash = flow_hash_from_keys(&hash_keys);
> +
> +	return mhash >> 1;
> +}
> +EXPORT_SYMBOL_GPL(fib_multipath_hash);
> #endif /* CONFIG_IP_ROUTE_MULTIPATH */
Stephen Hemminger March 14, 2017, 8:25 p.m. UTC | #5
On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> Date: Tue, 14 Mar 2017 17:58:46 +0200
> 
> > On 14/03/17 17:55, Stephen Hemminger wrote:  
> >> On Tue, 14 Mar 2017 17:36:15 +0200
> >> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> >>   
> >>> This patch adds support for ECMP hash policy choice via a new sysctl
> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
> >>> The current values for fib_multipath_hash_policy are:
> >>>  0 - layer 3 (default)
> >>>  1 - layer 4
> >>> If there's an skb hash already set and it matches the chosen policy then it
> >>> will be used instead of being calculated (currently only for L4).
> >>> In L3 mode we always calculate the hash due to the ICMP error special
> >>> case, the flow dissector's field consistentification should handle the
> >>> address order thus we can remove the address reversals.
> >>>
> >>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>  
> >> 
> >> It is good to see ECMP come back from the grave.
> >> Linux used to support it long ago but was abandoned after it was unstable
> >> and removed from iproute2 in 2012.
> >> 
> >> The old API was through route attributes which makes more sense than
> >> doing it with sysctl. It makes more sense to use netlink instead.
> >> Therefore please go back and do something like the old API rather than doing it through
> >> sysctl.
> >>   
> > 
> > That's what my initial version did, but this was discussed during NetConf in Seville
> > and it was decided that it's best to make a global sysctl, thus the change.  
> 
> Correct, we discussed this, and we all agreed to only have a sysctl for now.

Why? If you are going to have private discussions please post the rationale
in public.
Roopa Prabhu March 14, 2017, 9:10 p.m. UTC | #6
On Tue, Mar 14, 2017 at 1:25 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
>
>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> Date: Tue, 14 Mar 2017 17:58:46 +0200
>>
>> > On 14/03/17 17:55, Stephen Hemminger wrote:
>> >> On Tue, 14 Mar 2017 17:36:15 +0200
>> >> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>> >>
>> >>> This patch adds support for ECMP hash policy choice via a new sysctl
>> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>> >>> The current values for fib_multipath_hash_policy are:
>> >>>  0 - layer 3 (default)
>> >>>  1 - layer 4
>> >>> If there's an skb hash already set and it matches the chosen policy then it
>> >>> will be used instead of being calculated (currently only for L4).
>> >>> In L3 mode we always calculate the hash due to the ICMP error special
>> >>> case, the flow dissector's field consistentification should handle the
>> >>> address order thus we can remove the address reversals.
>> >>>
>> >>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> >>
>> >> It is good to see ECMP come back from the grave.
>> >> Linux used to support it long ago but was abandoned after it was unstable
>> >> and removed from iproute2 in 2012.
>> >>
>> >> The old API was through route attributes which makes more sense than
>> >> doing it with sysctl. It makes more sense to use netlink instead.
>> >> Therefore please go back and do something like the old API rather than doing it through
>> >> sysctl.
>> >>
>> >
>> > That's what my initial version did, but this was discussed during NetConf in Seville
>> > and it was decided that it's best to make a global sysctl, thus the change.
>>
>> Correct, we discussed this, and we all agreed to only have a sysctl for now.
>
> Why? If you are going to have private discussions please post the rationale
> in public.

Stephen, is there any reason to have a per ecmp route multipath algo
selection ?.
All platforms have a global multipath selection algo. I also don't see
routing daemons ready or willing to specify a per ecmp route multipath
selection algo attribute.
Stephen Hemminger March 14, 2017, 9:42 p.m. UTC | #7
On Tue, 14 Mar 2017 14:10:22 -0700
Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On Tue, Mar 14, 2017 at 1:25 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> > On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
> > David Miller <davem@davemloft.net> wrote:
> >  
> >> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> >> Date: Tue, 14 Mar 2017 17:58:46 +0200
> >>  
> >> > On 14/03/17 17:55, Stephen Hemminger wrote:  
> >> >> On Tue, 14 Mar 2017 17:36:15 +0200
> >> >> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
> >> >>  
> >> >>> This patch adds support for ECMP hash policy choice via a new sysctl
> >> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
> >> >>> The current values for fib_multipath_hash_policy are:
> >> >>>  0 - layer 3 (default)
> >> >>>  1 - layer 4
> >> >>> If there's an skb hash already set and it matches the chosen policy then it
> >> >>> will be used instead of being calculated (currently only for L4).
> >> >>> In L3 mode we always calculate the hash due to the ICMP error special
> >> >>> case, the flow dissector's field consistentification should handle the
> >> >>> address order thus we can remove the address reversals.
> >> >>>
> >> >>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>  
> >> >>
> >> >> It is good to see ECMP come back from the grave.
> >> >> Linux used to support it long ago but was abandoned after it was unstable
> >> >> and removed from iproute2 in 2012.
> >> >>
> >> >> The old API was through route attributes which makes more sense than
> >> >> doing it with sysctl. It makes more sense to use netlink instead.
> >> >> Therefore please go back and do something like the old API rather than doing it through
> >> >> sysctl.
> >> >>  
> >> >
> >> > That's what my initial version did, but this was discussed during NetConf in Seville
> >> > and it was decided that it's best to make a global sysctl, thus the change.  
> >>
> >> Correct, we discussed this, and we all agreed to only have a sysctl for now.  
> >
> > Why? If you are going to have private discussions please post the rationale
> > in public.  
> 
> Stephen, is there any reason to have a per ecmp route multipath algo
> selection ?.
> All platforms have a global multipath selection algo. I also don't see
> routing daemons ready or willing to specify a per ecmp route multipath
> selection algo attribute.

There is no compelling reason to make the attribute per route. But the
issue is more that configuration through sysctl's is problematic. It doesn't
fit into the standard API paradigm. Sysctl's are like routing patches not
part of the real CLI. Trying to trap sysctl's for things like switchedev
offload is particularly problematic. I can see the case for either way,
and don't have a fixed opinion.

The bigger discussion is trying to keep a record of the rationale for decisions
such that there isn't buried tribal knowledge. This is why Dave has always been
quite insistent on having discussions on the mailing list. There doesn't seem to
be a good long term record other than Documentation/networking or commit logs.
Roopa Prabhu March 14, 2017, 10:38 p.m. UTC | #8
On Tue, Mar 14, 2017 at 2:42 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 14 Mar 2017 14:10:22 -0700
> Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>
>> On Tue, Mar 14, 2017 at 1:25 PM, Stephen Hemminger
>> <stephen@networkplumber.org> wrote:
>> > On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
>> > David Miller <davem@davemloft.net> wrote:
>> >
>> >> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

[snip]


>> >> > That's what my initial version did, but this was discussed during NetConf in Seville
>> >> > and it was decided that it's best to make a global sysctl, thus the change.
>> >>
>> >> Correct, we discussed this, and we all agreed to only have a sysctl for now.
>> >
>> > Why? If you are going to have private discussions please post the rationale
>> > in public.
>>
>> Stephen, is there any reason to have a per ecmp route multipath algo
>> selection ?.
>> All platforms have a global multipath selection algo. I also don't see
>> routing daemons ready or willing to specify a per ecmp route multipath
>> selection algo attribute.
>
> There is no compelling reason to make the attribute per route. But the
> issue is more that configuration through sysctl's is problematic. It doesn't
> fit into the standard API paradigm. Sysctl's are like routing patches not
> part of the real CLI. Trying to trap sysctl's for things like switchedev
> offload is particularly problematic. I can see the case for either way,
> and don't have a fixed opinion.

ok. understand the switchdev offload part. It was that way in the past...but
today you can listen to sysctl updates on the netconf netlink channel.
it works pretty well.

>
> The bigger discussion is trying to keep a record of the rationale for decisions
> such that there isn't buried tribal knowledge. This is why Dave has always been
> quite insistent on having discussions on the mailing list. There doesn't seem to
> be a good long term record other than Documentation/networking or commit logs.
>

agree. Most of the discussion around this has happened on the netdev
mailing list so far.
The previous set that updated ecmp algo tried to add a per route
attribute and after review on this list was moved to
on by default. Nikolay's first version included a per route
attribute...to follow the previous ecmp work.
Before Nikolay posted the second version, my feedback was to make it
global. And this feedback was only re-iterated at
last netconf/netdev. so, we can continue the discussion here if there
are other opinions.
Stephen Hemminger March 14, 2017, 11:27 p.m. UTC | #9
On Tue, 14 Mar 2017 15:38:40 -0700
Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> >> >> > That's what my initial version did, but this was discussed during NetConf in Seville
> >> >> > and it was decided that it's best to make a global sysctl, thus the change.  
> >> >>
> >> >> Correct, we discussed this, and we all agreed to only have a sysctl for now.  
> >> >
> >> > Why? If you are going to have private discussions please post the rationale
> >> > in public.  
> >>
> >> Stephen, is there any reason to have a per ecmp route multipath algo
> >> selection ?.
> >> All platforms have a global multipath selection algo. I also don't see
> >> routing daemons ready or willing to specify a per ecmp route multipath
> >> selection algo attribute.  
> >
> > There is no compelling reason to make the attribute per route. But the
> > issue is more that configuration through sysctl's is problematic. It doesn't
> > fit into the standard API paradigm. Sysctl's are like routing patches not
> > part of the real CLI. Trying to trap sysctl's for things like switchedev
> > offload is particularly problematic. I can see the case for either way,
> > and don't have a fixed opinion.  
> 
> ok. understand the switchdev offload part. It was that way in the past...but
> today you can listen to sysctl updates on the netconf netlink channel.
> it works pretty well.

Is there another patch to add the NETCONFA_ECMP support?
David Ahern March 14, 2017, 11:45 p.m. UTC | #10
On 3/14/17 5:27 PM, Stephen Hemminger wrote:
> On Tue, 14 Mar 2017 15:38:40 -0700
> Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> 
>>>>>>> That's what my initial version did, but this was discussed during NetConf in Seville
>>>>>>> and it was decided that it's best to make a global sysctl, thus the change.  
>>>>>>
>>>>>> Correct, we discussed this, and we all agreed to only have a sysctl for now.  
>>>>>
>>>>> Why? If you are going to have private discussions please post the rationale
>>>>> in public.  
>>>>
>>>> Stephen, is there any reason to have a per ecmp route multipath algo
>>>> selection ?.
>>>> All platforms have a global multipath selection algo. I also don't see
>>>> routing daemons ready or willing to specify a per ecmp route multipath
>>>> selection algo attribute.  
>>>
>>> There is no compelling reason to make the attribute per route. But the
>>> issue is more that configuration through sysctl's is problematic. It doesn't
>>> fit into the standard API paradigm. Sysctl's are like routing patches not
>>> part of the real CLI. Trying to trap sysctl's for things like switchedev
>>> offload is particularly problematic. I can see the case for either way,
>>> and don't have a fixed opinion.  
>>
>> ok. understand the switchdev offload part. It was that way in the past...but
>> today you can listen to sysctl updates on the netconf netlink channel.
>> it works pretty well.
> 
> Is there another patch to add the NETCONFA_ECMP support?
> 

does userspace care?

switchdev uses notifiers for in kernel notification of FIB changes
(routes and rules).
David Miller March 15, 2017, 12:24 a.m. UTC | #11
From: Stephen Hemminger <stephen@networkplumber.org>
Date: Tue, 14 Mar 2017 13:25:06 -0700

> On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> Date: Tue, 14 Mar 2017 17:58:46 +0200
>> 
>> > On 14/03/17 17:55, Stephen Hemminger wrote:  
>> >> On Tue, 14 Mar 2017 17:36:15 +0200
>> >> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>> >>   
>> >>> This patch adds support for ECMP hash policy choice via a new sysctl
>> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>> >>> The current values for fib_multipath_hash_policy are:
>> >>>  0 - layer 3 (default)
>> >>>  1 - layer 4
>> >>> If there's an skb hash already set and it matches the chosen policy then it
>> >>> will be used instead of being calculated (currently only for L4).
>> >>> In L3 mode we always calculate the hash due to the ICMP error special
>> >>> case, the flow dissector's field consistentification should handle the
>> >>> address order thus we can remove the address reversals.
>> >>>
>> >>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>  
>> >> 
>> >> It is good to see ECMP come back from the grave.
>> >> Linux used to support it long ago but was abandoned after it was unstable
>> >> and removed from iproute2 in 2012.
>> >> 
>> >> The old API was through route attributes which makes more sense than
>> >> doing it with sysctl. It makes more sense to use netlink instead.
>> >> Therefore please go back and do something like the old API rather than doing it through
>> >> sysctl.
>> >>   
>> > 
>> > That's what my initial version did, but this was discussed during NetConf in Seville
>> > and it was decided that it's best to make a global sysctl, thus the change.  
>> 
>> Correct, we discussed this, and we all agreed to only have a sysctl for now.
> 
> Why? If you are going to have private discussions please post the rationale
> in public.

The idea is that we couldn't come up with an immediate use case, and if one
came up we could easily add the per-route or per-fib-table attribute.

Most people want the entire system to behave a certain way wrt. ECMP, rather
than have fine granularity.  For example, the case being discussed here is
to simply have software's behavior match that of hardware offloads.

We shouldn't add things until there is a real demonstrated and requested need.
Tom Herbert March 15, 2017, 2:30 a.m. UTC | #12
On Tue, Mar 14, 2017 at 5:24 PM, David Miller <davem@davemloft.net> wrote:
> From: Stephen Hemminger <stephen@networkplumber.org>
> Date: Tue, 14 Mar 2017 13:25:06 -0700
>
>> On Tue, 14 Mar 2017 11:48:37 -0700 (PDT)
>> David Miller <davem@davemloft.net> wrote:
>>
>>> From: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>> Date: Tue, 14 Mar 2017 17:58:46 +0200
>>>
>>> > On 14/03/17 17:55, Stephen Hemminger wrote:
>>> >> On Tue, 14 Mar 2017 17:36:15 +0200
>>> >> Nikolay Aleksandrov <nikolay@cumulusnetworks.com> wrote:
>>> >>
>>> >>> This patch adds support for ECMP hash policy choice via a new sysctl
>>> >>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>>> >>> The current values for fib_multipath_hash_policy are:
>>> >>>  0 - layer 3 (default)
>>> >>>  1 - layer 4
>>> >>> If there's an skb hash already set and it matches the chosen policy then it
>>> >>> will be used instead of being calculated (currently only for L4).
>>> >>> In L3 mode we always calculate the hash due to the ICMP error special
>>> >>> case, the flow dissector's field consistentification should handle the
>>> >>> address order thus we can remove the address reversals.
>>> >>>
>>> >>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>>> >>
>>> >> It is good to see ECMP come back from the grave.
>>> >> Linux used to support it long ago but was abandoned after it was unstable
>>> >> and removed from iproute2 in 2012.
>>> >>
>>> >> The old API was through route attributes which makes more sense than
>>> >> doing it with sysctl. It makes more sense to use netlink instead.
>>> >> Therefore please go back and do something like the old API rather than doing it through
>>> >> sysctl.
>>> >>
>>> >
>>> > That's what my initial version did, but this was discussed during NetConf in Seville
>>> > and it was decided that it's best to make a global sysctl, thus the change.
>>>
>>> Correct, we discussed this, and we all agreed to only have a sysctl for now.
>>
>> Why? If you are going to have private discussions please post the rationale
>> in public.
>
> The idea is that we couldn't come up with an immediate use case, and if one
> came up we could easily add the per-route or per-fib-table attribute.
>
> Most people want the entire system to behave a certain way wrt. ECMP, rather
> than have fine granularity.  For example, the case being discussed here is
> to simply have software's behavior match that of hardware offloads.
>
Agreed, but then why do we even need any complexity here by that
argument? RSS is specifically defined to do 5-tuple hashing for TCP
(and UDP), and 3-tuple. No one has ever complained that doing per flow
RSS for TCP is bad thing AFAIK. We followed that same model for RPS,
RFS, and XPS via state in the connection context. The skb_hash is
often given to us for free, whereas in order to do a 3-tuple we have
to actually do more work and do at least an extra jhash. I suppose the
argument is probably that switches allow this configuration and
somehow we want to have feature parity, but it would be very
interesting to know if anyone is not doing per flow ECMP in real life
and why...

Tom
Nicolas Dichtel March 15, 2017, 9:17 a.m. UTC | #13
Le 15/03/2017 à 00:45, David Ahern a écrit :
> On 3/14/17 5:27 PM, Stephen Hemminger wrote:
>> On Tue, 14 Mar 2017 15:38:40 -0700
>> Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>>
>>>>>>>> That's what my initial version did, but this was discussed during NetConf in Seville
>>>>>>>> and it was decided that it's best to make a global sysctl, thus the change.  
>>>>>>>
>>>>>>> Correct, we discussed this, and we all agreed to only have a sysctl for now.  
>>>>>>
>>>>>> Why? If you are going to have private discussions please post the rationale
>>>>>> in public.  
>>>>>
>>>>> Stephen, is there any reason to have a per ecmp route multipath algo
>>>>> selection ?.
>>>>> All platforms have a global multipath selection algo. I also don't see
>>>>> routing daemons ready or willing to specify a per ecmp route multipath
>>>>> selection algo attribute.  
>>>>
>>>> There is no compelling reason to make the attribute per route. But the
>>>> issue is more that configuration through sysctl's is problematic. It doesn't
>>>> fit into the standard API paradigm. Sysctl's are like routing patches not
>>>> part of the real CLI. Trying to trap sysctl's for things like switchedev
>>>> offload is particularly problematic. I can see the case for either way,
>>>> and don't have a fixed opinion.  
>>>
>>> ok. understand the switchdev offload part. It was that way in the past...but
>>> today you can listen to sysctl updates on the netconf netlink channel.
>>> it works pretty well.
>>
>> Is there another patch to add the NETCONFA_ECMP support?
>>
> 
> does userspace care?
Yes, I think it is needed so that userspace can correctly monitor this behavior.
It also enables to check this parameter through netlink.


Regards,
Nicolas
Nikolay Aleksandrov March 15, 2017, 10:46 a.m. UTC | #14
> On Mar 15, 2017, at 11:17 AM, Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
> 
> Le 15/03/2017 à 00:45, David Ahern a écrit :
>> On 3/14/17 5:27 PM, Stephen Hemminger wrote:
>>> On Tue, 14 Mar 2017 15:38:40 -0700
>>> Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>>> 
>>>>>>>>> That's what my initial version did, but this was discussed during NetConf in Seville
>>>>>>>>> and it was decided that it's best to make a global sysctl, thus the change.  
>>>>>>>> 
>>>>>>>> Correct, we discussed this, and we all agreed to only have a sysctl for now.  
>>>>>>> 
>>>>>>> Why? If you are going to have private discussions please post the rationale
>>>>>>> in public.  
>>>>>> 
>>>>>> Stephen, is there any reason to have a per ecmp route multipath algo
>>>>>> selection ?.
>>>>>> All platforms have a global multipath selection algo. I also don't see
>>>>>> routing daemons ready or willing to specify a per ecmp route multipath
>>>>>> selection algo attribute.  
>>>>> 
>>>>> There is no compelling reason to make the attribute per route. But the
>>>>> issue is more that configuration through sysctl's is problematic. It doesn't
>>>>> fit into the standard API paradigm. Sysctl's are like routing patches not
>>>>> part of the real CLI. Trying to trap sysctl's for things like switchedev
>>>>> offload is particularly problematic. I can see the case for either way,
>>>>> and don't have a fixed opinion.  
>>>> 
>>>> ok. understand the switchdev offload part. It was that way in the past...but
>>>> today you can listen to sysctl updates on the netconf netlink channel.
>>>> it works pretty well.
>>> 
>>> Is there another patch to add the NETCONFA_ECMP support?
>>> 
>> 
>> does userspace care?
> Yes, I think it is needed so that userspace can correctly monitor this behavior.
> It also enables to check this parameter through netlink.
> 
> 
> Regards,
> Nicolas

This doesn’t fit the NETCONFA model well, there is no “all”, “default” or per iface option to be set, also
the other multipath sysctl which affects behaviour doesn’t have a notification.

Thanks,
 Nik
Nicolas Dichtel March 15, 2017, 11:18 a.m. UTC | #15
Le 15/03/2017 à 11:46, Nikolay Aleksandrov a écrit :
[snip]
> This doesn’t fit the NETCONFA model well, there is no “all”, “default” or per iface option to be set, also
> the other multipath sysctl which affects behaviour doesn’t have a notification.
Right. Maybe adding a "system-wide" option can help, but it's probably another
patch to add netconf support for other ecmp properties.


Thank you,
Nicolas
Nikolay Aleksandrov March 15, 2017, 11:27 a.m. UTC | #16
> On Mar 15, 2017, at 1:18 PM, Nicolas Dichtel <nicolas.dichtel@6wind.com> wrote:
> 
> Le 15/03/2017 à 11:46, Nikolay Aleksandrov a écrit :
> [snip]
>> This doesn’t fit the NETCONFA model well, there is no “all”, “default” or per iface option to be set, also
>> the other multipath sysctl which affects behaviour doesn’t have a notification.
> Right. Maybe adding a "system-wide" option can help, but it's probably another
> patch to add netconf support for other ecmp properties.
> 
> 
> Thank you,
> Nicolas

Yep, sounds good.
I’ll add the netlink support for system-wide NETCONFA and ecmp as follow up patches so we can
have a separate discussion on the design there.

Thanks,
 Nik
Jakub Sitnicki March 15, 2017, 11:32 a.m. UTC | #17
On Tue, Mar 14, 2017 at 03:36 PM GMT, Nikolay Aleksandrov wrote:
> This patch adds support for ECMP hash policy choice via a new sysctl
> called fib_multipath_hash_policy and also adds support for L4 hashes.
> The current values for fib_multipath_hash_policy are:
>  0 - layer 3 (default)
>  1 - layer 4
> If there's an skb hash already set and it matches the chosen policy then it
> will be used instead of being calculated (currently only for L4).
> In L3 mode we always calculate the hash due to the ICMP error special
> case, the flow dissector's field consistentification should handle the
> address order thus we can remove the address reversals.
>
> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> ---
> v3:
>  - keep the ICMP error special handling and always calc L3 hash
>    Jakub, could you please run your tests with this version ?
>
> v2:
>  - removed the output_key_hash as it's not needed anymore
>  - reverted to my original/internal patch with L3 as default hash
>
>  Documentation/networking/ip-sysctl.txt |  8 +++
>  include/net/ip_fib.h                   | 14 ++----
>  include/net/netns/ipv4.h               |  1 +
>  include/net/route.h                    |  9 +---
>  net/ipv4/fib_semantics.c               | 11 ++--
>  net/ipv4/icmp.c                        | 19 +------
>  net/ipv4/route.c                       | 92 ++++++++++++++++++++++++++--------
>  net/ipv4/sysctl_net_ipv4.c             |  9 ++++
>  8 files changed, 98 insertions(+), 65 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index fc73eeb7b3b8..558e19653106 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -73,6 +73,14 @@ fib_multipath_use_neigh - BOOLEAN
>  	0 - disabled
>  	1 - enabled
>  
> +fib_multipath_hash_policy - INTEGER
> +	Controls which hash policy to use for multipath routes. Only valid
> +	for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.
> +	Default: 0 (Layer 3)
> +	Possible values:
> +	0 - Layer 3
> +	1 - Layer 4
> +
>  route/max_size - INTEGER
>  	Maximum number of routes allowed in the kernel.  Increase
>  	this when using large numbers of interfaces and/or routes.
> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> index d9cee9659978..8c608a17bd9a 100644
> --- a/include/net/ip_fib.h
> +++ b/include/net/ip_fib.h
> @@ -383,17 +383,13 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
>  int fib_sync_down_addr(struct net_device *dev, __be32 local);
>  int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
>  
> -extern u32 fib_multipath_secret __read_mostly;
> -
> -static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
> -{
> -	return jhash_2words((__force u32)saddr, (__force u32)daddr,
> -			    fib_multipath_secret) >> 1;
> -}
> -
> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
> +int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
> +		       const struct sk_buff *skb);
> +#endif
>  void fib_select_multipath(struct fib_result *res, int hash);
>  void fib_select_path(struct net *net, struct fib_result *res,
> -		     struct flowi4 *fl4, int mp_hash);
> +		     struct flowi4 *fl4);
>  
>  /* Exported by fib_trie.c */
>  void fib_trie_init(void);
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 622d2da27135..70a1d4251790 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -152,6 +152,7 @@ struct netns_ipv4 {
>  #endif
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>  	int sysctl_fib_multipath_use_neigh;
> +	int sysctl_fib_multipath_hash_policy;
>  #endif
>  
>  	unsigned int	fib_seq;	/* protected by rtnl_mutex */
> diff --git a/include/net/route.h b/include/net/route.h
> index c0874c87c173..32412c91c222 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -113,14 +113,7 @@ struct in_device;
>  int ip_rt_init(void);
>  void rt_cache_flush(struct net *net);
>  void rt_flush_dev(struct net_device *dev);
> -struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
> -					  int mp_hash);
> -
> -static inline struct rtable *__ip_route_output_key(struct net *net,
> -						   struct flowi4 *flp)
> -{
> -	return __ip_route_output_key_hash(net, flp, -1);
> -}
> +struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *flp);
>  
>  struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
>  				    const struct sock *sk);
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 317026a39cfa..6601bd9744c9 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -57,7 +57,6 @@ static unsigned int fib_info_cnt;
>  static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
>  
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> -u32 fib_multipath_secret __read_mostly;
>  
>  #define for_nexthops(fi) {						\
>  	int nhsel; const struct fib_nh *nh;				\
> @@ -576,9 +575,6 @@ static void fib_rebalance(struct fib_info *fi)
>  
>  		atomic_set(&nexthop_nh->nh_upper_bound, upper_bound);
>  	} endfor_nexthops(fi);
> -
> -	net_get_random_once(&fib_multipath_secret,
> -			    sizeof(fib_multipath_secret));
>  }
>  
>  static inline void fib_add_weight(struct fib_info *fi,
> @@ -1641,7 +1637,7 @@ void fib_select_multipath(struct fib_result *res, int hash)
>  #endif
>  
>  void fib_select_path(struct net *net, struct fib_result *res,
> -		     struct flowi4 *fl4, int mp_hash)
> +		     struct flowi4 *fl4)
>  {
>  	bool oif_check;
>  
> @@ -1650,10 +1646,9 @@ void fib_select_path(struct net *net, struct fib_result *res,
>  
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>  	if (res->fi->fib_nhs > 1 && oif_check) {
> -		if (mp_hash < 0)
> -			mp_hash = get_hash_from_flowi4(fl4) >> 1;
> +		int h = fib_multipath_hash(res->fi, fl4, NULL);
>  
> -		fib_select_multipath(res, mp_hash);
> +		fib_select_multipath(res, h);
>  	}
>  	else
>  #endif
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index fc310db2708b..46630bc30754 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -464,22 +464,6 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
>  	local_bh_enable();
>  }
>  
> -#ifdef CONFIG_IP_ROUTE_MULTIPATH
> -
> -/* Source and destination is swapped. See ip_multipath_icmp_hash */
> -static int icmp_multipath_hash_skb(const struct sk_buff *skb)
> -{
> -	const struct iphdr *iph = ip_hdr(skb);
> -
> -	return fib_multipath_hash(iph->daddr, iph->saddr);
> -}
> -
> -#else
> -
> -#define icmp_multipath_hash_skb(skb) (-1)
> -
> -#endif
> -
>  static struct rtable *icmp_route_lookup(struct net *net,
>  					struct flowi4 *fl4,
>  					struct sk_buff *skb_in,
> @@ -505,8 +489,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
>  	fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
>  
>  	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
> -	rt = __ip_route_output_key_hash(net, fl4,
> -					icmp_multipath_hash_skb(skb_in));
> +	rt = __ip_route_output_key(net, fl4);
>  	if (IS_ERR(rt))
>  		return rt;
>  

I'm observing that the above hunk changes things because saddr passed to
icmp_route_lookup doesn't always remain set to ip_hdr(skb_in)->daddr.

This happens when generating an ICMP error in response to a datagram
which destination address is not a local one, that is from ip_forward.

-Jakub

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 8471dd116771..3d7f99747285 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1734,11 +1734,11 @@ static int __mkroute_input(struct sk_buff *skb,
>  }
>  
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> -
>  /* To make ICMP packets follow the right flow, the multipath hash is
> - * calculated from the inner IP addresses in reverse order.
> + * calculated from the inner IP addresses.
>   */
> -static int ip_multipath_icmp_hash(struct sk_buff *skb)
> +static void ip_multipath_icmp_hash(const struct sk_buff *skb,
> +				   struct flowi4 *fl4)
>  {
>  	const struct iphdr *outer_iph = ip_hdr(skb);
>  	struct icmphdr _icmph;
> @@ -1746,33 +1746,85 @@ static int ip_multipath_icmp_hash(struct sk_buff *skb)
>  	struct iphdr _inner_iph;
>  	const struct iphdr *inner_iph;
>  
> +	fl4->saddr = outer_iph->saddr;
> +	fl4->daddr = outer_iph->daddr;
>  	if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
> -		goto standard_hash;
> +		return;
>  
>  	icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
>  				   &_icmph);
>  	if (!icmph)
> -		goto standard_hash;
> +		return;
>  
>  	if (icmph->type != ICMP_DEST_UNREACH &&
>  	    icmph->type != ICMP_REDIRECT &&
>  	    icmph->type != ICMP_TIME_EXCEEDED &&
> -	    icmph->type != ICMP_PARAMETERPROB) {
> -		goto standard_hash;
> -	}
> +	    icmph->type != ICMP_PARAMETERPROB)
> +		return;
>  
>  	inner_iph = skb_header_pointer(skb,
>  				       outer_iph->ihl * 4 + sizeof(_icmph),
>  				       sizeof(_inner_iph), &_inner_iph);
>  	if (!inner_iph)
> -		goto standard_hash;
> +		return;
> +	fl4->saddr = inner_iph->saddr;
> +	fl4->daddr = inner_iph->daddr;
> +}
>  
> -	return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
> +int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
> +		       const struct sk_buff *skb)
> +{
> +	struct net *net = fi->fib_net;
> +	struct flow_keys hash_keys;
> +	u32 mhash;
>  
> -standard_hash:
> -	return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
> -}
> +	switch (net->ipv4.sysctl_fib_multipath_hash_policy) {
> +	case 0:
> +		memset(&hash_keys, 0, sizeof(hash_keys));
> +		hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> +		if (skb && ip_hdr(skb)->protocol == IPPROTO_ICMP) {
> +			struct flowi4 _fl4;
>  
> +			ip_multipath_icmp_hash(skb, &_fl4);
> +			hash_keys.addrs.v4addrs.src = _fl4.saddr;
> +			hash_keys.addrs.v4addrs.dst = _fl4.daddr;
> +		} else {
> +			hash_keys.addrs.v4addrs.src = fl4->saddr;
> +			hash_keys.addrs.v4addrs.dst = fl4->daddr;
> +		}
> +		break;
> +	case 1:
> +		/* skb is currently provided only when forwarding */
> +		if (skb) {
> +			unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP;
> +			struct flow_keys keys;
> +
> +			/* short-circuit if we already have L4 hash present */
> +			if (skb->l4_hash)
> +				return skb_get_hash_raw(skb) >> 1;
> +			memset(&hash_keys, 0, sizeof(hash_keys));
> +			skb_flow_dissect_flow_keys(skb, &keys, flag);
> +			hash_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
> +			hash_keys.addrs.v4addrs.dst = keys.addrs.v4addrs.dst;
> +			hash_keys.ports.src = keys.ports.src;
> +			hash_keys.ports.dst = keys.ports.dst;
> +			hash_keys.basic.ip_proto = keys.basic.ip_proto;
> +		} else {
> +			memset(&hash_keys, 0, sizeof(hash_keys));
> +			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> +			hash_keys.addrs.v4addrs.src = fl4->saddr;
> +			hash_keys.addrs.v4addrs.dst = fl4->daddr;
> +			hash_keys.ports.src = fl4->fl4_sport;
> +			hash_keys.ports.dst = fl4->fl4_dport;
> +			hash_keys.basic.ip_proto = fl4->flowi4_proto;
> +		}
> +		break;
> +	}
> +	mhash = flow_hash_from_keys(&hash_keys);
> +
> +	return mhash >> 1;
> +}
> +EXPORT_SYMBOL_GPL(fib_multipath_hash);
>  #endif /* CONFIG_IP_ROUTE_MULTIPATH */
>  
>  static int ip_mkroute_input(struct sk_buff *skb,
> @@ -1782,12 +1834,9 @@ static int ip_mkroute_input(struct sk_buff *skb,
>  {
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>  	if (res->fi && res->fi->fib_nhs > 1) {
> -		int h;
> +		struct flowi4 fl4 = { .daddr = daddr, .saddr = saddr };
> +		int h = fib_multipath_hash(res->fi, &fl4, skb);
>  
> -		if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP))
> -			h = ip_multipath_icmp_hash(skb);
> -		else
> -			h = fib_multipath_hash(saddr, daddr);
>  		fib_select_multipath(res, h);
>  	}
>  #endif
> @@ -2202,8 +2251,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
>   * Major route resolver routine.
>   */
>  
> -struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
> -					  int mp_hash)
> +struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
>  {
>  	struct net_device *dev_out = NULL;
>  	__u8 tos = RT_FL_TOS(fl4);
> @@ -2365,7 +2413,7 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
>  		goto make_route;
>  	}
>  
> -	fib_select_path(net, &res, fl4, mp_hash);
> +	fib_select_path(net, &res, fl4);
>  
>  	dev_out = FIB_RES_DEV(res);
>  	fl4->flowi4_oif = dev_out->ifindex;
> @@ -2378,7 +2426,7 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
>  	rcu_read_unlock();
>  	return rth;
>  }
> -EXPORT_SYMBOL_GPL(__ip_route_output_key_hash);
> +EXPORT_SYMBOL_GPL(__ip_route_output_key);
>  
>  static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst, u32 cookie)
>  {
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index d6880a6149ee..62c4f94923e5 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -1004,6 +1004,15 @@ static struct ctl_table ipv4_net_table[] = {
>  		.extra1		= &zero,
>  		.extra2		= &one,
>  	},
> +	{
> +		.procname	= "fib_multipath_hash_policy",
> +		.data		= &init_net.ipv4.sysctl_fib_multipath_hash_policy,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
>  #endif
>  	{
>  		.procname	= "ip_unprivileged_port_start",
Nikolay Aleksandrov March 15, 2017, 12:10 p.m. UTC | #18
On 15/03/17 13:32, Jakub Sitnicki wrote:
> On Tue, Mar 14, 2017 at 03:36 PM GMT, Nikolay Aleksandrov wrote:
>> This patch adds support for ECMP hash policy choice via a new sysctl
>> called fib_multipath_hash_policy and also adds support for L4 hashes.
>> The current values for fib_multipath_hash_policy are:
>>  0 - layer 3 (default)
>>  1 - layer 4
>> If there's an skb hash already set and it matches the chosen policy then it
>> will be used instead of being calculated (currently only for L4).
>> In L3 mode we always calculate the hash due to the ICMP error special
>> case, the flow dissector's field consistentification should handle the
>> address order thus we can remove the address reversals.
>>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
>> ---
>> v3:
>>  - keep the ICMP error special handling and always calc L3 hash
>>    Jakub, could you please run your tests with this version ?
>>
>> v2:
>>  - removed the output_key_hash as it's not needed anymore
>>  - reverted to my original/internal patch with L3 as default hash
>>
>>  Documentation/networking/ip-sysctl.txt |  8 +++
>>  include/net/ip_fib.h                   | 14 ++----
>>  include/net/netns/ipv4.h               |  1 +
>>  include/net/route.h                    |  9 +---
>>  net/ipv4/fib_semantics.c               | 11 ++--
>>  net/ipv4/icmp.c                        | 19 +------
>>  net/ipv4/route.c                       | 92 ++++++++++++++++++++++++++--------
>>  net/ipv4/sysctl_net_ipv4.c             |  9 ++++
>>  8 files changed, 98 insertions(+), 65 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>> index fc73eeb7b3b8..558e19653106 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -73,6 +73,14 @@ fib_multipath_use_neigh - BOOLEAN
>>  	0 - disabled
>>  	1 - enabled
>>  
>> +fib_multipath_hash_policy - INTEGER
>> +	Controls which hash policy to use for multipath routes. Only valid
>> +	for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.
>> +	Default: 0 (Layer 3)
>> +	Possible values:
>> +	0 - Layer 3
>> +	1 - Layer 4
>> +
>>  route/max_size - INTEGER
>>  	Maximum number of routes allowed in the kernel.  Increase
>>  	this when using large numbers of interfaces and/or routes.
>> diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
>> index d9cee9659978..8c608a17bd9a 100644
>> --- a/include/net/ip_fib.h
>> +++ b/include/net/ip_fib.h
>> @@ -383,17 +383,13 @@ int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
>>  int fib_sync_down_addr(struct net_device *dev, __be32 local);
>>  int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
>>  
>> -extern u32 fib_multipath_secret __read_mostly;
>> -
>> -static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
>> -{
>> -	return jhash_2words((__force u32)saddr, (__force u32)daddr,
>> -			    fib_multipath_secret) >> 1;
>> -}
>> -
>> +#ifdef CONFIG_IP_ROUTE_MULTIPATH
>> +int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
>> +		       const struct sk_buff *skb);
>> +#endif
>>  void fib_select_multipath(struct fib_result *res, int hash);
>>  void fib_select_path(struct net *net, struct fib_result *res,
>> -		     struct flowi4 *fl4, int mp_hash);
>> +		     struct flowi4 *fl4);
>>  
>>  /* Exported by fib_trie.c */
>>  void fib_trie_init(void);
>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>> index 622d2da27135..70a1d4251790 100644
>> --- a/include/net/netns/ipv4.h
>> +++ b/include/net/netns/ipv4.h
>> @@ -152,6 +152,7 @@ struct netns_ipv4 {
>>  #endif
>>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>>  	int sysctl_fib_multipath_use_neigh;
>> +	int sysctl_fib_multipath_hash_policy;
>>  #endif
>>  
>>  	unsigned int	fib_seq;	/* protected by rtnl_mutex */
>> diff --git a/include/net/route.h b/include/net/route.h
>> index c0874c87c173..32412c91c222 100644
>> --- a/include/net/route.h
>> +++ b/include/net/route.h
>> @@ -113,14 +113,7 @@ struct in_device;
>>  int ip_rt_init(void);
>>  void rt_cache_flush(struct net *net);
>>  void rt_flush_dev(struct net_device *dev);
>> -struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
>> -					  int mp_hash);
>> -
>> -static inline struct rtable *__ip_route_output_key(struct net *net,
>> -						   struct flowi4 *flp)
>> -{
>> -	return __ip_route_output_key_hash(net, flp, -1);
>> -}
>> +struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *flp);
>>  
>>  struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
>>  				    const struct sock *sk);
>> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
>> index 317026a39cfa..6601bd9744c9 100644
>> --- a/net/ipv4/fib_semantics.c
>> +++ b/net/ipv4/fib_semantics.c
>> @@ -57,7 +57,6 @@ static unsigned int fib_info_cnt;
>>  static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
>>  
>>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> -u32 fib_multipath_secret __read_mostly;
>>  
>>  #define for_nexthops(fi) {						\
>>  	int nhsel; const struct fib_nh *nh;				\
>> @@ -576,9 +575,6 @@ static void fib_rebalance(struct fib_info *fi)
>>  
>>  		atomic_set(&nexthop_nh->nh_upper_bound, upper_bound);
>>  	} endfor_nexthops(fi);
>> -
>> -	net_get_random_once(&fib_multipath_secret,
>> -			    sizeof(fib_multipath_secret));
>>  }
>>  
>>  static inline void fib_add_weight(struct fib_info *fi,
>> @@ -1641,7 +1637,7 @@ void fib_select_multipath(struct fib_result *res, int hash)
>>  #endif
>>  
>>  void fib_select_path(struct net *net, struct fib_result *res,
>> -		     struct flowi4 *fl4, int mp_hash)
>> +		     struct flowi4 *fl4)
>>  {
>>  	bool oif_check;
>>  
>> @@ -1650,10 +1646,9 @@ void fib_select_path(struct net *net, struct fib_result *res,
>>  
>>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>>  	if (res->fi->fib_nhs > 1 && oif_check) {
>> -		if (mp_hash < 0)
>> -			mp_hash = get_hash_from_flowi4(fl4) >> 1;
>> +		int h = fib_multipath_hash(res->fi, fl4, NULL);
>>  
>> -		fib_select_multipath(res, mp_hash);
>> +		fib_select_multipath(res, h);
>>  	}
>>  	else
>>  #endif
>> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> index fc310db2708b..46630bc30754 100644
>> --- a/net/ipv4/icmp.c
>> +++ b/net/ipv4/icmp.c
>> @@ -464,22 +464,6 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
>>  	local_bh_enable();
>>  }
>>  
>> -#ifdef CONFIG_IP_ROUTE_MULTIPATH
>> -
>> -/* Source and destination is swapped. See ip_multipath_icmp_hash */
>> -static int icmp_multipath_hash_skb(const struct sk_buff *skb)
>> -{
>> -	const struct iphdr *iph = ip_hdr(skb);
>> -
>> -	return fib_multipath_hash(iph->daddr, iph->saddr);
>> -}
>> -
>> -#else
>> -
>> -#define icmp_multipath_hash_skb(skb) (-1)
>> -
>> -#endif
>> -
>>  static struct rtable *icmp_route_lookup(struct net *net,
>>  					struct flowi4 *fl4,
>>  					struct sk_buff *skb_in,
>> @@ -505,8 +489,7 @@ static struct rtable *icmp_route_lookup(struct net *net,
>>  	fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
>>  
>>  	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
>> -	rt = __ip_route_output_key_hash(net, fl4,
>> -					icmp_multipath_hash_skb(skb_in));
>> +	rt = __ip_route_output_key(net, fl4);
>>  	if (IS_ERR(rt))
>>  		return rt;
>>  
> 
> I'm observing that the above hunk changes things because saddr passed to
> icmp_route_lookup doesn't always remain set to ip_hdr(skb_in)->daddr.
> 
> This happens when generating an ICMP error in response to a datagram
> which destination address is not a local one, that is from ip_forward.
> 
> -Jakub

Oh, of course. Good catch, I'll fix it in v4.

Thank you,
 Nik

> 
>> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>> index 8471dd116771..3d7f99747285 100644
>> --- a/net/ipv4/route.c
>> +++ b/net/ipv4/route.c
>> @@ -1734,11 +1734,11 @@ static int __mkroute_input(struct sk_buff *skb,
>>  }
>>  
>>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>> -
>>  /* To make ICMP packets follow the right flow, the multipath hash is
>> - * calculated from the inner IP addresses in reverse order.
>> + * calculated from the inner IP addresses.
>>   */
>> -static int ip_multipath_icmp_hash(struct sk_buff *skb)
>> +static void ip_multipath_icmp_hash(const struct sk_buff *skb,
>> +				   struct flowi4 *fl4)
>>  {
>>  	const struct iphdr *outer_iph = ip_hdr(skb);
>>  	struct icmphdr _icmph;
>> @@ -1746,33 +1746,85 @@ static int ip_multipath_icmp_hash(struct sk_buff *skb)
>>  	struct iphdr _inner_iph;
>>  	const struct iphdr *inner_iph;
>>  
>> +	fl4->saddr = outer_iph->saddr;
>> +	fl4->daddr = outer_iph->daddr;
>>  	if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
>> -		goto standard_hash;
>> +		return;
>>  
>>  	icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
>>  				   &_icmph);
>>  	if (!icmph)
>> -		goto standard_hash;
>> +		return;
>>  
>>  	if (icmph->type != ICMP_DEST_UNREACH &&
>>  	    icmph->type != ICMP_REDIRECT &&
>>  	    icmph->type != ICMP_TIME_EXCEEDED &&
>> -	    icmph->type != ICMP_PARAMETERPROB) {
>> -		goto standard_hash;
>> -	}
>> +	    icmph->type != ICMP_PARAMETERPROB)
>> +		return;
>>  
>>  	inner_iph = skb_header_pointer(skb,
>>  				       outer_iph->ihl * 4 + sizeof(_icmph),
>>  				       sizeof(_inner_iph), &_inner_iph);
>>  	if (!inner_iph)
>> -		goto standard_hash;
>> +		return;
>> +	fl4->saddr = inner_iph->saddr;
>> +	fl4->daddr = inner_iph->daddr;
>> +}
>>  
>> -	return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
>> +int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
>> +		       const struct sk_buff *skb)
>> +{
>> +	struct net *net = fi->fib_net;
>> +	struct flow_keys hash_keys;
>> +	u32 mhash;
>>  
>> -standard_hash:
>> -	return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
>> -}
>> +	switch (net->ipv4.sysctl_fib_multipath_hash_policy) {
>> +	case 0:
>> +		memset(&hash_keys, 0, sizeof(hash_keys));
>> +		hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
>> +		if (skb && ip_hdr(skb)->protocol == IPPROTO_ICMP) {
>> +			struct flowi4 _fl4;
>>  
>> +			ip_multipath_icmp_hash(skb, &_fl4);
>> +			hash_keys.addrs.v4addrs.src = _fl4.saddr;
>> +			hash_keys.addrs.v4addrs.dst = _fl4.daddr;
>> +		} else {
>> +			hash_keys.addrs.v4addrs.src = fl4->saddr;
>> +			hash_keys.addrs.v4addrs.dst = fl4->daddr;
>> +		}
>> +		break;
>> +	case 1:
>> +		/* skb is currently provided only when forwarding */
>> +		if (skb) {
>> +			unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP;
>> +			struct flow_keys keys;
>> +
>> +			/* short-circuit if we already have L4 hash present */
>> +			if (skb->l4_hash)
>> +				return skb_get_hash_raw(skb) >> 1;
>> +			memset(&hash_keys, 0, sizeof(hash_keys));
>> +			skb_flow_dissect_flow_keys(skb, &keys, flag);
>> +			hash_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
>> +			hash_keys.addrs.v4addrs.dst = keys.addrs.v4addrs.dst;
>> +			hash_keys.ports.src = keys.ports.src;
>> +			hash_keys.ports.dst = keys.ports.dst;
>> +			hash_keys.basic.ip_proto = keys.basic.ip_proto;
>> +		} else {
>> +			memset(&hash_keys, 0, sizeof(hash_keys));
>> +			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
>> +			hash_keys.addrs.v4addrs.src = fl4->saddr;
>> +			hash_keys.addrs.v4addrs.dst = fl4->daddr;
>> +			hash_keys.ports.src = fl4->fl4_sport;
>> +			hash_keys.ports.dst = fl4->fl4_dport;
>> +			hash_keys.basic.ip_proto = fl4->flowi4_proto;
>> +		}
>> +		break;
>> +	}
>> +	mhash = flow_hash_from_keys(&hash_keys);
>> +
>> +	return mhash >> 1;
>> +}
>> +EXPORT_SYMBOL_GPL(fib_multipath_hash);
>>  #endif /* CONFIG_IP_ROUTE_MULTIPATH */
>>  
>>  static int ip_mkroute_input(struct sk_buff *skb,
>> @@ -1782,12 +1834,9 @@ static int ip_mkroute_input(struct sk_buff *skb,
>>  {
>>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>>  	if (res->fi && res->fi->fib_nhs > 1) {
>> -		int h;
>> +		struct flowi4 fl4 = { .daddr = daddr, .saddr = saddr };
>> +		int h = fib_multipath_hash(res->fi, &fl4, skb);
>>  
>> -		if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP))
>> -			h = ip_multipath_icmp_hash(skb);
>> -		else
>> -			h = fib_multipath_hash(saddr, daddr);
>>  		fib_select_multipath(res, h);
>>  	}
>>  #endif
>> @@ -2202,8 +2251,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
>>   * Major route resolver routine.
>>   */
>>  
>> -struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
>> -					  int mp_hash)
>> +struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
>>  {
>>  	struct net_device *dev_out = NULL;
>>  	__u8 tos = RT_FL_TOS(fl4);
>> @@ -2365,7 +2413,7 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
>>  		goto make_route;
>>  	}
>>  
>> -	fib_select_path(net, &res, fl4, mp_hash);
>> +	fib_select_path(net, &res, fl4);
>>  
>>  	dev_out = FIB_RES_DEV(res);
>>  	fl4->flowi4_oif = dev_out->ifindex;
>> @@ -2378,7 +2426,7 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
>>  	rcu_read_unlock();
>>  	return rth;
>>  }
>> -EXPORT_SYMBOL_GPL(__ip_route_output_key_hash);
>> +EXPORT_SYMBOL_GPL(__ip_route_output_key);
>>  
>>  static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst, u32 cookie)
>>  {
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index d6880a6149ee..62c4f94923e5 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -1004,6 +1004,15 @@ static struct ctl_table ipv4_net_table[] = {
>>  		.extra1		= &zero,
>>  		.extra2		= &one,
>>  	},
>> +	{
>> +		.procname	= "fib_multipath_hash_policy",
>> +		.data		= &init_net.ipv4.sysctl_fib_multipath_hash_policy,
>> +		.maxlen		= sizeof(int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec_minmax,
>> +		.extra1		= &zero,
>> +		.extra2		= &one,
>> +	},
>>  #endif
>>  	{
>>  		.procname	= "ip_unprivileged_port_start",
>
David Ahern March 15, 2017, 3:01 p.m. UTC | #19
On 3/15/17 3:17 AM, Nicolas Dichtel wrote:
>>> Is there another patch to add the NETCONFA_ECMP support?
>>>
>>
>> does userspace care?
> Yes, I think it is needed so that userspace can correctly monitor this behavior.
> It also enables to check this parameter through netlink.
> 

I don't understand why userspace cares. What can userspace do with that
information?
Stephen Hemminger March 15, 2017, 3:20 p.m. UTC | #20
On Wed, 15 Mar 2017 09:01:16 -0600
David Ahern <dsa@cumulusnetworks.com> wrote:

> On 3/15/17 3:17 AM, Nicolas Dichtel wrote:
> >>> Is there another patch to add the NETCONFA_ECMP support?
> >>>  
> >>
> >> does userspace care?  
> > Yes, I think it is needed so that userspace can correctly monitor this behavior.
> > It also enables to check this parameter through netlink.
> >   
> 
> I don't understand why userspace cares. What can userspace do with that
> information?

There are cases such as routing or switch control from userspace where the desire
is to track kernel behavior in an external program. I know of at least 3 cases where
this is done.
David Miller March 17, 2017, 3:36 a.m. UTC | #21
From: Tom Herbert <tom@herbertland.com>
Date: Tue, 14 Mar 2017 19:30:47 -0700

> Agreed, but then why do we even need any complexity here by that
> argument? RSS is specifically defined to do 5-tuple hashing for TCP
> (and UDP), and 3-tuple. No one has ever complained that doing per flow
> RSS for TCP is bad thing AFAIK. We followed that same model for RPS,
> RFS, and XPS via state in the connection context. The skb_hash is
> often given to us for free, whereas in order to do a 3-tuple we have
> to actually do more work and do at least an extra jhash. I suppose the
> argument is probably that switches allow this configuration and
> somehow we want to have feature parity, but it would be very
> interesting to know if anyone is not doing per flow ECMP in real life
> and why...

Anyone doing any kind of hardware offload work requires the
possibility of getting feature parity on the software side.  It's the
only way to properly test and debug things.
diff mbox

Patch

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index fc73eeb7b3b8..558e19653106 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -73,6 +73,14 @@  fib_multipath_use_neigh - BOOLEAN
 	0 - disabled
 	1 - enabled
 
+fib_multipath_hash_policy - INTEGER
+	Controls which hash policy to use for multipath routes. Only valid
+	for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+	Default: 0 (Layer 3)
+	Possible values:
+	0 - Layer 3
+	1 - Layer 4
+
 route/max_size - INTEGER
 	Maximum number of routes allowed in the kernel.  Increase
 	this when using large numbers of interfaces and/or routes.
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index d9cee9659978..8c608a17bd9a 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -383,17 +383,13 @@  int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
 int fib_sync_down_addr(struct net_device *dev, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
 
-extern u32 fib_multipath_secret __read_mostly;
-
-static inline int fib_multipath_hash(__be32 saddr, __be32 daddr)
-{
-	return jhash_2words((__force u32)saddr, (__force u32)daddr,
-			    fib_multipath_secret) >> 1;
-}
-
+#ifdef CONFIG_IP_ROUTE_MULTIPATH
+int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
+		       const struct sk_buff *skb);
+#endif
 void fib_select_multipath(struct fib_result *res, int hash);
 void fib_select_path(struct net *net, struct fib_result *res,
-		     struct flowi4 *fl4, int mp_hash);
+		     struct flowi4 *fl4);
 
 /* Exported by fib_trie.c */
 void fib_trie_init(void);
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 622d2da27135..70a1d4251790 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -152,6 +152,7 @@  struct netns_ipv4 {
 #endif
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int sysctl_fib_multipath_use_neigh;
+	int sysctl_fib_multipath_hash_policy;
 #endif
 
 	unsigned int	fib_seq;	/* protected by rtnl_mutex */
diff --git a/include/net/route.h b/include/net/route.h
index c0874c87c173..32412c91c222 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -113,14 +113,7 @@  struct in_device;
 int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
-struct rtable *__ip_route_output_key_hash(struct net *, struct flowi4 *flp,
-					  int mp_hash);
-
-static inline struct rtable *__ip_route_output_key(struct net *net,
-						   struct flowi4 *flp)
-{
-	return __ip_route_output_key_hash(net, flp, -1);
-}
+struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *flp);
 
 struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    const struct sock *sk);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 317026a39cfa..6601bd9744c9 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -57,7 +57,6 @@  static unsigned int fib_info_cnt;
 static struct hlist_head fib_info_devhash[DEVINDEX_HASHSIZE];
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-u32 fib_multipath_secret __read_mostly;
 
 #define for_nexthops(fi) {						\
 	int nhsel; const struct fib_nh *nh;				\
@@ -576,9 +575,6 @@  static void fib_rebalance(struct fib_info *fi)
 
 		atomic_set(&nexthop_nh->nh_upper_bound, upper_bound);
 	} endfor_nexthops(fi);
-
-	net_get_random_once(&fib_multipath_secret,
-			    sizeof(fib_multipath_secret));
 }
 
 static inline void fib_add_weight(struct fib_info *fi,
@@ -1641,7 +1637,7 @@  void fib_select_multipath(struct fib_result *res, int hash)
 #endif
 
 void fib_select_path(struct net *net, struct fib_result *res,
-		     struct flowi4 *fl4, int mp_hash)
+		     struct flowi4 *fl4)
 {
 	bool oif_check;
 
@@ -1650,10 +1646,9 @@  void fib_select_path(struct net *net, struct fib_result *res,
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi->fib_nhs > 1 && oif_check) {
-		if (mp_hash < 0)
-			mp_hash = get_hash_from_flowi4(fl4) >> 1;
+		int h = fib_multipath_hash(res->fi, fl4, NULL);
 
-		fib_select_multipath(res, mp_hash);
+		fib_select_multipath(res, h);
 	}
 	else
 #endif
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index fc310db2708b..46630bc30754 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -464,22 +464,6 @@  static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	local_bh_enable();
 }
 
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-
-/* Source and destination is swapped. See ip_multipath_icmp_hash */
-static int icmp_multipath_hash_skb(const struct sk_buff *skb)
-{
-	const struct iphdr *iph = ip_hdr(skb);
-
-	return fib_multipath_hash(iph->daddr, iph->saddr);
-}
-
-#else
-
-#define icmp_multipath_hash_skb(skb) (-1)
-
-#endif
-
 static struct rtable *icmp_route_lookup(struct net *net,
 					struct flowi4 *fl4,
 					struct sk_buff *skb_in,
@@ -505,8 +489,7 @@  static struct rtable *icmp_route_lookup(struct net *net,
 	fl4->flowi4_oif = l3mdev_master_ifindex(skb_dst(skb_in)->dev);
 
 	security_skb_classify_flow(skb_in, flowi4_to_flowi(fl4));
-	rt = __ip_route_output_key_hash(net, fl4,
-					icmp_multipath_hash_skb(skb_in));
+	rt = __ip_route_output_key(net, fl4);
 	if (IS_ERR(rt))
 		return rt;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8471dd116771..3d7f99747285 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1734,11 +1734,11 @@  static int __mkroute_input(struct sk_buff *skb,
 }
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-
 /* To make ICMP packets follow the right flow, the multipath hash is
- * calculated from the inner IP addresses in reverse order.
+ * calculated from the inner IP addresses.
  */
-static int ip_multipath_icmp_hash(struct sk_buff *skb)
+static void ip_multipath_icmp_hash(const struct sk_buff *skb,
+				   struct flowi4 *fl4)
 {
 	const struct iphdr *outer_iph = ip_hdr(skb);
 	struct icmphdr _icmph;
@@ -1746,33 +1746,85 @@  static int ip_multipath_icmp_hash(struct sk_buff *skb)
 	struct iphdr _inner_iph;
 	const struct iphdr *inner_iph;
 
+	fl4->saddr = outer_iph->saddr;
+	fl4->daddr = outer_iph->daddr;
 	if (unlikely((outer_iph->frag_off & htons(IP_OFFSET)) != 0))
-		goto standard_hash;
+		return;
 
 	icmph = skb_header_pointer(skb, outer_iph->ihl * 4, sizeof(_icmph),
 				   &_icmph);
 	if (!icmph)
-		goto standard_hash;
+		return;
 
 	if (icmph->type != ICMP_DEST_UNREACH &&
 	    icmph->type != ICMP_REDIRECT &&
 	    icmph->type != ICMP_TIME_EXCEEDED &&
-	    icmph->type != ICMP_PARAMETERPROB) {
-		goto standard_hash;
-	}
+	    icmph->type != ICMP_PARAMETERPROB)
+		return;
 
 	inner_iph = skb_header_pointer(skb,
 				       outer_iph->ihl * 4 + sizeof(_icmph),
 				       sizeof(_inner_iph), &_inner_iph);
 	if (!inner_iph)
-		goto standard_hash;
+		return;
+	fl4->saddr = inner_iph->saddr;
+	fl4->daddr = inner_iph->daddr;
+}
 
-	return fib_multipath_hash(inner_iph->daddr, inner_iph->saddr);
+int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
+		       const struct sk_buff *skb)
+{
+	struct net *net = fi->fib_net;
+	struct flow_keys hash_keys;
+	u32 mhash;
 
-standard_hash:
-	return fib_multipath_hash(outer_iph->saddr, outer_iph->daddr);
-}
+	switch (net->ipv4.sysctl_fib_multipath_hash_policy) {
+	case 0:
+		memset(&hash_keys, 0, sizeof(hash_keys));
+		hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+		if (skb && ip_hdr(skb)->protocol == IPPROTO_ICMP) {
+			struct flowi4 _fl4;
 
+			ip_multipath_icmp_hash(skb, &_fl4);
+			hash_keys.addrs.v4addrs.src = _fl4.saddr;
+			hash_keys.addrs.v4addrs.dst = _fl4.daddr;
+		} else {
+			hash_keys.addrs.v4addrs.src = fl4->saddr;
+			hash_keys.addrs.v4addrs.dst = fl4->daddr;
+		}
+		break;
+	case 1:
+		/* skb is currently provided only when forwarding */
+		if (skb) {
+			unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP;
+			struct flow_keys keys;
+
+			/* short-circuit if we already have L4 hash present */
+			if (skb->l4_hash)
+				return skb_get_hash_raw(skb) >> 1;
+			memset(&hash_keys, 0, sizeof(hash_keys));
+			skb_flow_dissect_flow_keys(skb, &keys, flag);
+			hash_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
+			hash_keys.addrs.v4addrs.dst = keys.addrs.v4addrs.dst;
+			hash_keys.ports.src = keys.ports.src;
+			hash_keys.ports.dst = keys.ports.dst;
+			hash_keys.basic.ip_proto = keys.basic.ip_proto;
+		} else {
+			memset(&hash_keys, 0, sizeof(hash_keys));
+			hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+			hash_keys.addrs.v4addrs.src = fl4->saddr;
+			hash_keys.addrs.v4addrs.dst = fl4->daddr;
+			hash_keys.ports.src = fl4->fl4_sport;
+			hash_keys.ports.dst = fl4->fl4_dport;
+			hash_keys.basic.ip_proto = fl4->flowi4_proto;
+		}
+		break;
+	}
+	mhash = flow_hash_from_keys(&hash_keys);
+
+	return mhash >> 1;
+}
+EXPORT_SYMBOL_GPL(fib_multipath_hash);
 #endif /* CONFIG_IP_ROUTE_MULTIPATH */
 
 static int ip_mkroute_input(struct sk_buff *skb,
@@ -1782,12 +1834,9 @@  static int ip_mkroute_input(struct sk_buff *skb,
 {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && res->fi->fib_nhs > 1) {
-		int h;
+		struct flowi4 fl4 = { .daddr = daddr, .saddr = saddr };
+		int h = fib_multipath_hash(res->fi, &fl4, skb);
 
-		if (unlikely(ip_hdr(skb)->protocol == IPPROTO_ICMP))
-			h = ip_multipath_icmp_hash(skb);
-		else
-			h = fib_multipath_hash(saddr, daddr);
 		fib_select_multipath(res, h);
 	}
 #endif
@@ -2202,8 +2251,7 @@  static struct rtable *__mkroute_output(const struct fib_result *res,
  * Major route resolver routine.
  */
 
-struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
-					  int mp_hash)
+struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4)
 {
 	struct net_device *dev_out = NULL;
 	__u8 tos = RT_FL_TOS(fl4);
@@ -2365,7 +2413,7 @@  struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
 		goto make_route;
 	}
 
-	fib_select_path(net, &res, fl4, mp_hash);
+	fib_select_path(net, &res, fl4);
 
 	dev_out = FIB_RES_DEV(res);
 	fl4->flowi4_oif = dev_out->ifindex;
@@ -2378,7 +2426,7 @@  struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4,
 	rcu_read_unlock();
 	return rth;
 }
-EXPORT_SYMBOL_GPL(__ip_route_output_key_hash);
+EXPORT_SYMBOL_GPL(__ip_route_output_key);
 
 static struct dst_entry *ipv4_blackhole_dst_check(struct dst_entry *dst, u32 cookie)
 {
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d6880a6149ee..62c4f94923e5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1004,6 +1004,15 @@  static struct ctl_table ipv4_net_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "fib_multipath_hash_policy",
+		.data		= &init_net.ipv4.sysctl_fib_multipath_hash_policy,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 #endif
 	{
 		.procname	= "ip_unprivileged_port_start",