diff mbox series

IPv6 PMTU discovery fails with source-specific routing

Message ID 71e7331f-d528-430e-f880-e995ff53d362@lists.m7n.se
State Not Applicable
Delegated to: David Miller
Headers show
Series IPv6 PMTU discovery fails with source-specific routing | expand

Commit Message

Mikael Magnusson May 13, 2019, 7:22 p.m. UTC
Hello list,

I think I have found a regression in 4.15+ kernels. IPv6 PMTU discovery 
doesn't seem to work with source-specific routing (AKA source-address 
dependent routing, SADR).

I made a test script (see attachment). It sets up a test environment 
with three network namespaces (a, b and c) using SADR. The link between 
b and c is configured with MTU 1280. It then runs a ping test with large 
packets.

I have tested a couple of kernels on Ubuntu 19.04 with the following 
results.

mainline 4.14.117-0414117-generic SUCCESS
ubuntu   4.15.0-1036-oem          FAIL
mainline 5.1.0-050100-generic     FAIL

The mainline kernels are from 
https://kernel.ubuntu.com/~kernel-ppa/mainline/, and the other from the 
Ubuntu 19.04 repository.

I have attached a patch against 5.1 which seems to fix the problem in 
the test case.

It's bug #1788623 in Ubuntu: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788623

/Mikael

Comments

David Ahern May 14, 2019, 3:35 a.m. UTC | #1
On 5/13/19 1:22 PM, Mikael Magnusson wrote:
> Hello list,
> 
> I think I have found a regression in 4.15+ kernels. IPv6 PMTU discovery
> doesn't seem to work with source-specific routing (AKA source-address
> dependent routing, SADR).
> 
> I made a test script (see attachment). It sets up a test environment
> with three network namespaces (a, b and c) using SADR. The link between
> b and c is configured with MTU 1280. It then runs a ping test with large
> packets.
> 
> I have tested a couple of kernels on Ubuntu 19.04 with the following
> results.
> 
> mainline 4.14.117-0414117-generic SUCCESS
> ubuntu   4.15.0-1036-oem          FAIL
> mainline 5.1.0-050100-generic     FAIL
> 

git bisect shows

good: 38fbeeeeccdb38d0635398e8e344d245f6d8dc52
bad:  2b760fcf5cfb34e8610df56d83745b2b74ae1379

Those are back to back commits so
    2b760fcf5cfb ipv6: hook up exception table to store dst cache

has to be the bad commit.

Your patch may work, but does not seem logical relative to code at the
time of 4.15 and the commit that caused the failure. cc'ing authors of
the changes referenced above.
Wei Wang May 14, 2019, 6:12 a.m. UTC | #2
Thanks Mikael for reporting this issue. And thanks David for the bisection.
Let me spend some time to reproduce it and see what is going on.

From: David Ahern <dsahern@gmail.com>
Date: Mon, May 13, 2019 at 8:35 PM
To: Mikael Magnusson, <netdev@vger.kernel.org>, Martin KaFai Lau, Wei Wang

> On 5/13/19 1:22 PM, Mikael Magnusson wrote:
> > Hello list,
> >
> > I think I have found a regression in 4.15+ kernels. IPv6 PMTU discovery
> > doesn't seem to work with source-specific routing (AKA source-address
> > dependent routing, SADR).
> >
> > I made a test script (see attachment). It sets up a test environment
> > with three network namespaces (a, b and c) using SADR. The link between
> > b and c is configured with MTU 1280. It then runs a ping test with large
> > packets.
> >
> > I have tested a couple of kernels on Ubuntu 19.04 with the following
> > results.
> >
> > mainline 4.14.117-0414117-generic SUCCESS
> > ubuntu   4.15.0-1036-oem          FAIL
> > mainline 5.1.0-050100-generic     FAIL
> >
>
> git bisect shows
>
> good: 38fbeeeeccdb38d0635398e8e344d245f6d8dc52
> bad:  2b760fcf5cfb34e8610df56d83745b2b74ae1379
>
> Those are back to back commits so
>     2b760fcf5cfb ipv6: hook up exception table to store dst cache
>
> has to be the bad commit.
>
> Your patch may work, but does not seem logical relative to code at the
> time of 4.15 and the commit that caused the failure. cc'ing authors of
> the changes referenced above.
Stefano Brivio May 14, 2019, 2:33 p.m. UTC | #3
On Mon, 13 May 2019 23:12:31 -0700
Wei Wang <weiwan@google.com> wrote:

> Thanks Mikael for reporting this issue. And thanks David for the bisection.
> Let me spend some time to reproduce it and see what is going on.

Mikael, by the way, once this is sorted out, it would be nice if you
could add your test as a case in tools/testing/selftests/net/pmtu.sh --
you could probably reuse all the setup parts that are already
implemented there.
Wei Wang May 14, 2019, 7:33 p.m. UTC | #4
I think the bug is because when creating exceptions, src_addr is not
always set even though fib6_info is in the subtree. (because of
rt6_is_gw_or_nonexthop() check)
However, when looking up for exceptions, we always set src_addr to the
passed in flow->src_addr if fib6_info is in the subtree. That causes
the exception lookup to fail.
I will make it consistent.
However, I don't quite understand the following logic in ip6_rt_cache_alloc():
        if (!rt6_is_gw_or_nonexthop(ort)) {
                if (ort->fib6_dst.plen != 128 &&
                    ipv6_addr_equal(&ort->fib6_dst.addr, daddr))
                        rt->rt6i_flags |= RTF_ANYCAST;
#ifdef CONFIG_IPV6_SUBTREES
                if (rt->rt6i_src.plen && saddr) {
                        rt->rt6i_src.addr = *saddr;
                        rt->rt6i_src.plen = 128;
                }
#endif
        }
Why do we need to check that the route is not gateway and has next hop
for updating rt6i_src? I checked the git history and it seems this
part was there from very early on (with some refactor in between)...


From: Stefano Brivio <sbrivio@redhat.com>
Date: Tue, May 14, 2019 at 7:33 AM
To: Mikael Magnusson
Cc: Wei Wang, David Ahern, Linux Kernel Network Developers, Martin KaFai Lau

> On Mon, 13 May 2019 23:12:31 -0700
> Wei Wang <weiwan@google.com> wrote:
>
> > Thanks Mikael for reporting this issue. And thanks David for the bisection.
> > Let me spend some time to reproduce it and see what is going on.
>
> Mikael, by the way, once this is sorted out, it would be nice if you
> could add your test as a case in tools/testing/selftests/net/pmtu.sh --
> you could probably reuse all the setup parts that are already
> implemented there.
>
> --
> Stefano
David Ahern May 15, 2019, 3:43 p.m. UTC | #5
On 5/14/19 1:33 PM, Wei Wang wrote:
> I think the bug is because when creating exceptions, src_addr is not
> always set even though fib6_info is in the subtree. (because of
> rt6_is_gw_or_nonexthop() check)
> However, when looking up for exceptions, we always set src_addr to the
> passed in flow->src_addr if fib6_info is in the subtree. That causes
> the exception lookup to fail.
> I will make it consistent.
> However, I don't quite understand the following logic in ip6_rt_cache_alloc():
>         if (!rt6_is_gw_or_nonexthop(ort)) {
>                 if (ort->fib6_dst.plen != 128 &&
>                     ipv6_addr_equal(&ort->fib6_dst.addr, daddr))
>                         rt->rt6i_flags |= RTF_ANYCAST;
> #ifdef CONFIG_IPV6_SUBTREES
>                 if (rt->rt6i_src.plen && saddr) {
>                         rt->rt6i_src.addr = *saddr;
>                         rt->rt6i_src.plen = 128;
>                 }
> #endif
>         }
> Why do we need to check that the route is not gateway and has next hop
> for updating rt6i_src? I checked the git history and it seems this
> part was there from very early on (with some refactor in between)...

I can not make sense of that check either.
Martin KaFai Lau May 15, 2019, 6:06 p.m. UTC | #6
On Tue, May 14, 2019 at 12:33:25PM -0700, Wei Wang wrote:
> I think the bug is because when creating exceptions, src_addr is not
> always set even though fib6_info is in the subtree. (because of
> rt6_is_gw_or_nonexthop() check)
> However, when looking up for exceptions, we always set src_addr to the
> passed in flow->src_addr if fib6_info is in the subtree. That causes
> the exception lookup to fail.
> I will make it consistent.
> However, I don't quite understand the following logic in ip6_rt_cache_alloc():
>         if (!rt6_is_gw_or_nonexthop(ort)) {
>                 if (ort->fib6_dst.plen != 128 &&
>                     ipv6_addr_equal(&ort->fib6_dst.addr, daddr))
>                         rt->rt6i_flags |= RTF_ANYCAST;
> #ifdef CONFIG_IPV6_SUBTREES
>                 if (rt->rt6i_src.plen && saddr) {
>                         rt->rt6i_src.addr = *saddr;
>                         rt->rt6i_src.plen = 128;
>                 }
> #endif
>         }
> Why do we need to check that the route is not gateway and has next hop
> for updating rt6i_src? I checked the git history and it seems this
> part was there from very early on (with some refactor in between)...
I also failed to understand the RTF_GATEWAY check.  The earliest related
commit seems to be c440f1609b65 ("ipv6: Do not depend on rt->n in ip6_pol_route().")

How was it working when the exception route was in the tree?

> 
> 
> From: Stefano Brivio <sbrivio@redhat.com>
> Date: Tue, May 14, 2019 at 7:33 AM
> To: Mikael Magnusson
> Cc: Wei Wang, David Ahern, Linux Kernel Network Developers, Martin KaFai Lau
> 
> > On Mon, 13 May 2019 23:12:31 -0700
> > Wei Wang <weiwan@google.com> wrote:
> >
> > > Thanks Mikael for reporting this issue. And thanks David for the bisection.
> > > Let me spend some time to reproduce it and see what is going on.
> >
> > Mikael, by the way, once this is sorted out, it would be nice if you
> > could add your test as a case in tools/testing/selftests/net/pmtu.sh --
> > you could probably reuse all the setup parts that are already
> > implemented there.
> >
> > --
> > Stefano
Wei Wang May 15, 2019, 6:31 p.m. UTC | #7
On Wed, May 15, 2019 at 11:06 AM Martin Lau <kafai@fb.com> wrote:
>
> On Tue, May 14, 2019 at 12:33:25PM -0700, Wei Wang wrote:
> > I think the bug is because when creating exceptions, src_addr is not
> > always set even though fib6_info is in the subtree. (because of
> > rt6_is_gw_or_nonexthop() check)
> > However, when looking up for exceptions, we always set src_addr to the
> > passed in flow->src_addr if fib6_info is in the subtree. That causes
> > the exception lookup to fail.
> > I will make it consistent.
> > However, I don't quite understand the following logic in ip6_rt_cache_alloc():
> >         if (!rt6_is_gw_or_nonexthop(ort)) {
> >                 if (ort->fib6_dst.plen != 128 &&
> >                     ipv6_addr_equal(&ort->fib6_dst.addr, daddr))
> >                         rt->rt6i_flags |= RTF_ANYCAST;
> > #ifdef CONFIG_IPV6_SUBTREES
> >                 if (rt->rt6i_src.plen && saddr) {
> >                         rt->rt6i_src.addr = *saddr;
> >                         rt->rt6i_src.plen = 128;
> >                 }
> > #endif
> >         }
> > Why do we need to check that the route is not gateway and has next hop
> > for updating rt6i_src? I checked the git history and it seems this
> > part was there from very early on (with some refactor in between)...
> I also failed to understand the RTF_GATEWAY check.  The earliest related
> commit seems to be c440f1609b65 ("ipv6: Do not depend on rt->n in ip6_pol_route().")
>
> How was it working when the exception route was in the tree?
>
When adding all exception route to the main routing tree, because
route cache has dest_addr as /128, the longest prefix match will
always match the /128 route entry.

> >
> >
> > From: Stefano Brivio <sbrivio@redhat.com>
> > Date: Tue, May 14, 2019 at 7:33 AM
> > To: Mikael Magnusson
> > Cc: Wei Wang, David Ahern, Linux Kernel Network Developers, Martin KaFai Lau
> >
> > > On Mon, 13 May 2019 23:12:31 -0700
> > > Wei Wang <weiwan@google.com> wrote:
> > >
> > > > Thanks Mikael for reporting this issue. And thanks David for the bisection.
> > > > Let me spend some time to reproduce it and see what is going on.
> > >
> > > Mikael, by the way, once this is sorted out, it would be nice if you
> > > could add your test as a case in tools/testing/selftests/net/pmtu.sh --
> > > you could probably reuse all the setup parts that are already
> > > implemented there.
> > >
> > > --
> > > Stefano
Martin KaFai Lau May 15, 2019, 9:54 p.m. UTC | #8
On Wed, May 15, 2019 at 11:31:44AM -0700, Wei Wang wrote:
> On Wed, May 15, 2019 at 11:06 AM Martin Lau <kafai@fb.com> wrote:
> >
> > On Tue, May 14, 2019 at 12:33:25PM -0700, Wei Wang wrote:
> > > I think the bug is because when creating exceptions, src_addr is not
> > > always set even though fib6_info is in the subtree. (because of
> > > rt6_is_gw_or_nonexthop() check)
> > > However, when looking up for exceptions, we always set src_addr to the
> > > passed in flow->src_addr if fib6_info is in the subtree. That causes
> > > the exception lookup to fail.
> > > I will make it consistent.
> > > However, I don't quite understand the following logic in ip6_rt_cache_alloc():
> > >         if (!rt6_is_gw_or_nonexthop(ort)) {
> > >                 if (ort->fib6_dst.plen != 128 &&
> > >                     ipv6_addr_equal(&ort->fib6_dst.addr, daddr))
> > >                         rt->rt6i_flags |= RTF_ANYCAST;
> > > #ifdef CONFIG_IPV6_SUBTREES
> > >                 if (rt->rt6i_src.plen && saddr) {
> > >                         rt->rt6i_src.addr = *saddr;
> > >                         rt->rt6i_src.plen = 128;
> > >                 }
> > > #endif
> > >         }
> > > Why do we need to check that the route is not gateway and has next hop
> > > for updating rt6i_src? I checked the git history and it seems this
> > > part was there from very early on (with some refactor in between)...
> > I also failed to understand the RTF_GATEWAY check.  The earliest related
> > commit seems to be c440f1609b65 ("ipv6: Do not depend on rt->n in ip6_pol_route().")
> >
> > How was it working when the exception route was in the tree?
> >
> When adding all exception route to the main routing tree, because
> route cache has dest_addr as /128, the longest prefix match will
> always match the /128 route entry.
Got it.  Thanks for the explanation.

> 
> > >
> > >
> > > From: Stefano Brivio <sbrivio@redhat.com>
> > > Date: Tue, May 14, 2019 at 7:33 AM
> > > To: Mikael Magnusson
> > > Cc: Wei Wang, David Ahern, Linux Kernel Network Developers, Martin KaFai Lau
> > >
> > > > On Mon, 13 May 2019 23:12:31 -0700
> > > > Wei Wang <weiwan@google.com> wrote:
> > > >
> > > > > Thanks Mikael for reporting this issue. And thanks David for the bisection.
> > > > > Let me spend some time to reproduce it and see what is going on.
> > > >
> > > > Mikael, by the way, once this is sorted out, it would be nice if you
> > > > could add your test as a case in tools/testing/selftests/net/pmtu.sh --
> > > > you could probably reuse all the setup parts that are already
> > > > implemented there.
> > > >
> > > > --
> > > > Stefano
diff mbox series

Patch

From 4dd6d3e00663ec1749d8c6cf8139a2e255c7a797 Mon Sep 17 00:00:00 2001
From: Mikael Magnusson <mikma@users.sourceforge.net>
Date: Wed, 8 May 2019 22:06:44 +0000
Subject: [PATCH] net/ipv6/route: Fix PMTU for source-specific routes

---
 net/ipv6/route.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0520aca3354b..63d678588ec9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1197,13 +1197,13 @@  static struct rt6_info *ip6_rt_cache_alloc(struct fib6_info *ort,
 		if (ort->fib6_dst.plen != 128 &&
 		    ipv6_addr_equal(&ort->fib6_dst.addr, daddr))
 			rt->rt6i_flags |= RTF_ANYCAST;
+	}
 #ifdef CONFIG_IPV6_SUBTREES
-		if (rt->rt6i_src.plen && saddr) {
-			rt->rt6i_src.addr = *saddr;
-			rt->rt6i_src.plen = 128;
-		}
-#endif
+	if (rt->rt6i_src.plen && saddr) {
+		rt->rt6i_src.addr = *saddr;
+		rt->rt6i_src.plen = 128;
 	}
+#endif
 
 	return rt;
 }
-- 
2.17.1