diff mbox series

[net,1/2] ipv6: Fix route replacement with dev-only route

Message ID 20200212014107.110066-1-bpoirier@cumulusnetworks.com
State Accepted
Delegated to: David Miller
Headers show
Series [net,1/2] ipv6: Fix route replacement with dev-only route | expand

Commit Message

Benjamin Poirier Feb. 12, 2020, 1:41 a.m. UTC
After commit 27596472473a ("ipv6: fix ECMP route replacement") it is no
longer possible to replace an ECMP-able route by a non ECMP-able route.
For example,
	ip route add 2001:db8::1/128 via fe80::1 dev dummy0
	ip route replace 2001:db8::1/128 dev dummy0
does not work as expected.

Tweak the replacement logic so that point 3 in the log of the above commit
becomes:
3. If the new route is not ECMP-able, and no matching non-ECMP-able route
exists, replace matching ECMP-able route (if any) or add the new route.

We can now summarize the entire replace semantics to:
When doing a replace, prefer replacing a matching route of the same
"ECMP-able-ness" as the replace argument. If there is no such candidate,
fallback to the first route found.

Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>
---
 net/ipv6/ip6_fib.c                       | 7 ++++---
 tools/testing/selftests/net/fib_tests.sh | 6 ++++++
 2 files changed, 10 insertions(+), 3 deletions(-)

Comments

Michal Kubecek Feb. 12, 2020, 9:57 p.m. UTC | #1
On Wed, Feb 12, 2020 at 10:41:06AM +0900, Benjamin Poirier wrote:
> After commit 27596472473a ("ipv6: fix ECMP route replacement") it is no
> longer possible to replace an ECMP-able route by a non ECMP-able route.
> For example,
> 	ip route add 2001:db8::1/128 via fe80::1 dev dummy0
> 	ip route replace 2001:db8::1/128 dev dummy0
> does not work as expected.
> 
> Tweak the replacement logic so that point 3 in the log of the above commit
> becomes:
> 3. If the new route is not ECMP-able, and no matching non-ECMP-able route
> exists, replace matching ECMP-able route (if any) or add the new route.
> 
> We can now summarize the entire replace semantics to:
> When doing a replace, prefer replacing a matching route of the same
> "ECMP-able-ness" as the replace argument. If there is no such candidate,
> fallback to the first route found.
> 
> Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
> Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>

Reviewed-by: Michal Kubecek <mkubecek@suse.cz>

> ---
>  net/ipv6/ip6_fib.c                       | 7 ++++---
>  tools/testing/selftests/net/fib_tests.sh | 6 ++++++
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 58fbde244381..72abf892302f 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1102,8 +1102,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
>  					found++;
>  					break;
>  				}
> -				if (rt_can_ecmp)
> -					fallback_ins = fallback_ins ?: ins;
> +				fallback_ins = fallback_ins ?: ins;
>  				goto next_iter;
>  			}
>  
> @@ -1146,7 +1145,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
>  	}
>  
>  	if (fallback_ins && !found) {
> -		/* No ECMP-able route found, replace first non-ECMP one */
> +		/* No matching route with same ecmp-able-ness found, replace
> +		 * first matching route
> +		 */
>  		ins = fallback_ins;
>  		iter = rcu_dereference_protected(*ins,
>  				    lockdep_is_held(&rt->fib6_table->tb6_lock));
> diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
> index 6dd403103800..60273f1bc7d9 100755
> --- a/tools/testing/selftests/net/fib_tests.sh
> +++ b/tools/testing/selftests/net/fib_tests.sh
> @@ -910,6 +910,12 @@ ipv6_rt_replace_mpath()
>  	check_route6 "2001:db8:104::/64 via 2001:db8:101::3 dev veth1 metric 1024"
>  	log_test $? 0 "Multipath with single path via multipath attribute"
>  
> +	# multipath with dev-only
> +	add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
> +	run_cmd "$IP -6 ro replace 2001:db8:104::/64 dev veth1"
> +	check_route6 "2001:db8:104::/64 dev veth1 metric 1024"
> +	log_test $? 0 "Multipath with dev-only"
> +
>  	# route replace fails - invalid nexthop 1
>  	add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
>  	run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:111::3 nexthop via 2001:db8:103::3"
> -- 
> 2.25.0
>
David Ahern Feb. 15, 2020, 6 p.m. UTC | #2
On 2/11/20 6:41 PM, Benjamin Poirier wrote:
> After commit 27596472473a ("ipv6: fix ECMP route replacement") it is no
> longer possible to replace an ECMP-able route by a non ECMP-able route.
> For example,
> 	ip route add 2001:db8::1/128 via fe80::1 dev dummy0
> 	ip route replace 2001:db8::1/128 dev dummy0
> does not work as expected.
> 
> Tweak the replacement logic so that point 3 in the log of the above commit
> becomes:
> 3. If the new route is not ECMP-able, and no matching non-ECMP-able route
> exists, replace matching ECMP-able route (if any) or add the new route.
> 
> We can now summarize the entire replace semantics to:
> When doing a replace, prefer replacing a matching route of the same
> "ECMP-able-ness" as the replace argument. If there is no such candidate,
> fallback to the first route found.
> 
> Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
> Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>
> ---
>  net/ipv6/ip6_fib.c                       | 7 ++++---
>  tools/testing/selftests/net/fib_tests.sh | 6 ++++++
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 58fbde244381..72abf892302f 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1102,8 +1102,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
>  					found++;
>  					break;
>  				}
> -				if (rt_can_ecmp)
> -					fallback_ins = fallback_ins ?: ins;
> +				fallback_ins = fallback_ins ?: ins;
>  				goto next_iter;
>  			}
>  
> @@ -1146,7 +1145,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
>  	}
>  
>  	if (fallback_ins && !found) {
> -		/* No ECMP-able route found, replace first non-ECMP one */
> +		/* No matching route with same ecmp-able-ness found, replace
> +		 * first matching route
> +		 */
>  		ins = fallback_ins;
>  		iter = rcu_dereference_protected(*ins,
>  				    lockdep_is_held(&rt->fib6_table->tb6_lock));
> diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
> index 6dd403103800..60273f1bc7d9 100755
> --- a/tools/testing/selftests/net/fib_tests.sh
> +++ b/tools/testing/selftests/net/fib_tests.sh
> @@ -910,6 +910,12 @@ ipv6_rt_replace_mpath()
>  	check_route6 "2001:db8:104::/64 via 2001:db8:101::3 dev veth1 metric 1024"
>  	log_test $? 0 "Multipath with single path via multipath attribute"
>  
> +	# multipath with dev-only
> +	add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
> +	run_cmd "$IP -6 ro replace 2001:db8:104::/64 dev veth1"
> +	check_route6 "2001:db8:104::/64 dev veth1 metric 1024"
> +	log_test $? 0 "Multipath with dev-only"
> +
>  	# route replace fails - invalid nexthop 1
>  	add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
>  	run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:111::3 nexthop via 2001:db8:103::3"
> 

Thanks for adding a test case. I take this to mean that all existing
tests pass with this change. We have found this code to be extremely
sensitive to seemingly obvious changes.

Added Ido.
David Miller Feb. 17, 2020, 2:35 a.m. UTC | #3
From: Benjamin Poirier <bpoirier@cumulusnetworks.com>
Date: Wed, 12 Feb 2020 10:41:06 +0900

> After commit 27596472473a ("ipv6: fix ECMP route replacement") it is no
> longer possible to replace an ECMP-able route by a non ECMP-able route.
> For example,
> 	ip route add 2001:db8::1/128 via fe80::1 dev dummy0
> 	ip route replace 2001:db8::1/128 dev dummy0
> does not work as expected.
> 
> Tweak the replacement logic so that point 3 in the log of the above commit
> becomes:
> 3. If the new route is not ECMP-able, and no matching non-ECMP-able route
> exists, replace matching ECMP-able route (if any) or add the new route.
> 
> We can now summarize the entire replace semantics to:
> When doing a replace, prefer replacing a matching route of the same
> "ECMP-able-ness" as the replace argument. If there is no such candidate,
> fallback to the first route found.
> 
> Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
> Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>

Applied and queued up for -stable.
Benjamin Poirier Feb. 17, 2020, 3:12 a.m. UTC | #4
On 2020/02/15 11:00 -0700, David Ahern wrote:
[...]
> 
> Thanks for adding a test case. I take this to mean that all existing
> tests pass with this change. We have found this code to be extremely
> sensitive to seemingly obvious changes.

I saw two failures from that test suite, regardless of this change:

IPv4 rp_filter tests
    TEST: rp_filter passes local packets                                [FAIL]
    TEST: rp_filter passes loopback packets                             [FAIL]

The other tests, including the ipv6_rt group of tests, are OK.


The rp_filter tests fail for me even if I build a kernel from commit
adb701d6cfa4 ("selftests: add a test case for rp_filter")

Running the first ping manually, tcpdump shows:
root@vsid:~# tcpdump -nepi lo
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
12:01:12.618623 52:54:00:6a:c7:5e > 52:54:00:6a:c7:5e, ethertype IPv4 (0x0800), length 98: 198.51.100.1 > 198.51.100.1: ICMP echo request, id 616, seq 435, length 64
12:01:12.618650 52:54:00:6a:c7:5e > 52:54:00:6a:c7:5e, ethertype IPv4 (0x0800), length 98: 198.51.100.1 > 198.51.100.1: ICMP echo reply, id 616, seq 435, length 64

`ping` doesn't show any replies (since it's bound to dummy1...?).

Cong?
Cong Wang Feb. 20, 2020, 3:20 a.m. UTC | #5
On Sun, Feb 16, 2020 at 7:12 PM Benjamin Poirier
<bpoirier@cumulusnetworks.com> wrote:
>
> On 2020/02/15 11:00 -0700, David Ahern wrote:
> [...]
> >
> > Thanks for adding a test case. I take this to mean that all existing
> > tests pass with this change. We have found this code to be extremely
> > sensitive to seemingly obvious changes.
>
> I saw two failures from that test suite, regardless of this change:
>
> IPv4 rp_filter tests
>     TEST: rp_filter passes local packets                                [FAIL]
>     TEST: rp_filter passes loopback packets                             [FAIL]
>
> The other tests, including the ipv6_rt group of tests, are OK.
>
>
> The rp_filter tests fail for me even if I build a kernel from commit
> adb701d6cfa4 ("selftests: add a test case for rp_filter")
>
> Running the first ping manually, tcpdump shows:
> root@vsid:~# tcpdump -nepi lo
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
> 12:01:12.618623 52:54:00:6a:c7:5e > 52:54:00:6a:c7:5e, ethertype IPv4 (0x0800), length 98: 198.51.100.1 > 198.51.100.1: ICMP echo request, id 616, seq 435, length 64
> 12:01:12.618650 52:54:00:6a:c7:5e > 52:54:00:6a:c7:5e, ethertype IPv4 (0x0800), length 98: 198.51.100.1 > 198.51.100.1: ICMP echo reply, id 616, seq 435, length 64
>
> `ping` doesn't show any replies (since it's bound to dummy1...?).

I suspect this is related to the version of ping, as it works perfectly
for me on an old Fedora:

IPv4 rp_filter tests
ping: Warning: source address might be selected on device other than dummy1.
PING 198.51.100.1 (198.51.100.1) from 198.51.100.1 dummy1: 56(84) bytes of data.
64 bytes from 198.51.100.1: icmp_seq=1 ttl=64 time=0.452 ms
64 bytes from 198.51.100.1: icmp_seq=2 ttl=64 time=0.418 ms
64 bytes from 198.51.100.1: icmp_seq=3 ttl=64 time=0.485 ms
64 bytes from 198.51.100.1: icmp_seq=4 ttl=64 time=0.446 ms
64 bytes from 198.51.100.1: icmp_seq=5 ttl=64 time=0.388 ms
64 bytes from 198.51.100.1: icmp_seq=6 ttl=64 time=0.362 ms
64 bytes from 198.51.100.1: icmp_seq=7 ttl=64 time=0.373 ms
64 bytes from 198.51.100.1: icmp_seq=8 ttl=64 time=0.515 ms
64 bytes from 198.51.100.1: icmp_seq=9 ttl=64 time=0.641 ms

--- 198.51.100.1 ping statistics ---
9 packets transmitted, 9 received, 0% packet loss, time 8012ms
rtt min/avg/max/mdev = 0.362/0.453/0.641/0.083 ms
    TEST: rp_filter passes local packets                                [ OK ]


I do see the same failure on a new Fedora. So, I will take a look
at this tomorrow.

Thanks!
diff mbox series

Patch

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 58fbde244381..72abf892302f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1102,8 +1102,7 @@  static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
 					found++;
 					break;
 				}
-				if (rt_can_ecmp)
-					fallback_ins = fallback_ins ?: ins;
+				fallback_ins = fallback_ins ?: ins;
 				goto next_iter;
 			}
 
@@ -1146,7 +1145,9 @@  static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
 	}
 
 	if (fallback_ins && !found) {
-		/* No ECMP-able route found, replace first non-ECMP one */
+		/* No matching route with same ecmp-able-ness found, replace
+		 * first matching route
+		 */
 		ins = fallback_ins;
 		iter = rcu_dereference_protected(*ins,
 				    lockdep_is_held(&rt->fib6_table->tb6_lock));
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 6dd403103800..60273f1bc7d9 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -910,6 +910,12 @@  ipv6_rt_replace_mpath()
 	check_route6 "2001:db8:104::/64 via 2001:db8:101::3 dev veth1 metric 1024"
 	log_test $? 0 "Multipath with single path via multipath attribute"
 
+	# multipath with dev-only
+	add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+	run_cmd "$IP -6 ro replace 2001:db8:104::/64 dev veth1"
+	check_route6 "2001:db8:104::/64 dev veth1 metric 1024"
+	log_test $? 0 "Multipath with dev-only"
+
 	# route replace fails - invalid nexthop 1
 	add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
 	run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:111::3 nexthop via 2001:db8:103::3"