Message ID | 1285581957-30694-1-git-send-email-zenczykowski@gmail.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
From: Maciej Żenczykowski <maze@google.com> Date: Mon, 27 Sep 2010 03:10:12 -0700 > FYI, I'm signed up to netdev on my personal account. You don't need to subscribe in order to post patches to the mailing list. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Maciej Żenczykowski <zenczykowski@gmail.com> Date: Mon, 27 Sep 2010 03:05:57 -0700 > From: Maciej Żenczykowski <maze@google.com> > > Signed-off-by: Maciej Żenczykowski <maze@google.com> Please handle all 4 cases just like the ipv4 routing code does: { saddr = SADDR, ifindex = dev->ifindex } { saddr = SADDR, ifindex = 0 } { saddr = INADDR_ANY, ifindex = dev->ifindex } { saddr = INADDR_ANY, ifindex = 0 } I believe I've specifically asked for this every time someone mentioned that they wanted to fix this issue. And the ipv4 side is a good guide in other ways, it uses a set of two arrays so you can just loop over them, making your rt6_do_pmtu_disc() calls along the way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please handle all 4 cases just like the ipv4 routing code does: > > { saddr = SADDR, ifindex = dev->ifindex } > { saddr = SADDR, ifindex = 0 } > { saddr = INADDR_ANY, ifindex = dev->ifindex } > { saddr = INADDR_ANY, ifindex = 0 } > > I believe I've specifically asked for this every time someone mentioned that they wanted to fix this issue. I believe that is exactly what the following code fragment from the above patch does: + rt6_do_pmtu_disc(daddr, saddr, net, pmtu, 0); + rt6_do_pmtu_disc(daddr, saddr, net, pmtu, dev->ifindex); + rt6_do_pmtu_disc(daddr, NULL, net, pmtu, 0); + rt6_do_pmtu_disc(daddr, NULL, net, pmtu, dev->ifindex); The last two of these lines were added to the patch last time around per your feedback - precisely to address this issue. I reposted with those changes and you appeared to be ok with the patch, and your only comment was to add a 'Signed-off-by' line. Side note: * I still think that handling the saddr == NULL ie. INADDR_ANY case is entirely superfluous, since it doesn't actually iterate through all possible source addresses. With IPv6 there can be many, many possible source addresses (just think of link local vs global public vs privacy addresses and then tack on 6to4 and mobility, etc... for example I see 13 ipv6 addresses on eth0 on my desktop at home, 12 of them globally reachable). * Currently PMTU discovery is broken. With this patch with all 4 lines we end up doing PMTU discovery per source address that isn't the default source address. Without those last two lines we'd just do PMTU discovery per source address. Not much of a difference, both allow PMTU discovery to actually happen. Except: Imagine a machine with a 1500 MTU network card and 2 source addresses: - one default v6 native 'A' - one additional 6to4 address 'B' The native address can reach the internet with a max MTU of 1500. While due to source address based routing the 6to4 address goes through a sit ipv4 based tunnel and thus has a max MTU of 1480 (1500 - 20 bytes IPv4 header). Imagine a remote machine with v6 address 'Z' (our default address to reach 'Z' is 'A'). Clearly the PMTU of A->Z is 1500, while the PMTU of B->Z is 1480. [Assume there's nothing which would cause them to be lower.] If you do PMTU discovery of A->Z, you will indeed get 1500, and mark that as the PMTU of A->Z. However if you do PMTU discovery of B->Z, you will indeed get 1480, but you will mark that as the PMTU of both B->Z and A->Z (since A is the default to reach Z). Suddenly the PMTU of A->Z _needlessly and incorrectly_ dropped from 1500 to 1480 merely because of PMTU discovery being performed on B->Z. And this is precisely why I think that adding those last two lines that handle INADDR_ANY is actually more of a problem then solution (indeed as far as I can tell, adding them doesn't actually fix any problem - the code works just fine and performs PMTU discovery correctly with just the first two lines). Thankfully ending up with a lower MTU than actually possible is only a slight performance problem, while not doing PMTU discovery correctly at all (current state with asymmetric routing) is a big issue. Hence regardless of whether this patch includes INADDR_ANY or not the end result will still be an improvement. > And the ipv4 side is a good guide in other ways, it uses a set of two arrays so you can just loop over them, making your rt6_do_pmtu_disc() calls along the way. Sure, I can change it to use arrays, although I don't see any particular improvement in performance (even if rt6_do_pmtu_disc was inlined) or readability or anything. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Maciej Żenczykowski <zenczykowski@gmail.com> Date: Tue, 28 Sep 2010 15:37:26 -0700 > * I still think that handling the saddr == NULL ie. INADDR_ANY case is > entirely superfluous, since it doesn't actually iterate through all > possible source addresses. With IPv6 there can be many, many possible > source addresses (just think of link local vs global public vs privacy > addresses and then tack on 6to4 and mobility, etc... for example I see > 13 ipv6 addresses on eth0 on my desktop at home, 12 of them globally > reachable). I only have %100 confidence in the reasoning behind why ipv4 handles things this way, so I'll discuss this in those terms and then try to tie it into the ipv6 side. When we are looking up an ipv4 output route, there are 2 "source address" objects. 1) The one specified in the "struct flowi" for the lookup (the flp->fl4_src passed into ip_route_output_flow) which is also the one that ends up in the routing cache entry's ->fl.fl4_src member. 2) The one contained in the routing cache entry's specification. Ie. rth->rt_src These are distinct. #1 is what is used to hash and find a matching routing cache entry. Since a source address of INADDR_ANY is allowed for routing lookups, routing cache entries for the same daddr/saddr pair can exist in more than one hash chains. Therefore, if we didn't iterate over INADDR_ANY and the specific address in the icmp PMTU message, we'd miss some routing cache entries. Look at the PMTU loops in ipv4 ip_rt_frag_needed(): for (k = 0; k < 2; k++) { for (i = 0; i < 2; i++) { unsigned hash = rt_hash(daddr, skeys[i], ikeys[k], rt_genid(net)); ("ANY vs. specific" ifindex and saddr are used for the hash computation) if (rth->fl.fl4_dst != daddr || rth->fl.fl4_src != skeys[i] || rth->rt_dst != daddr || rth->rt_src != iph->saddr || rth->fl.oif != ikeys[k] || rth->fl.iif != 0 || (and for the routing cache entry flow member comparisons) But the routing cache entry "rt_src" member is compared always to "iph->saddr", it doesn't use the "ANY vs. specific" skey[] value. Unless ipv6 does not allow INADDR_ANY source address specifications during route lookups, it ought to have the same issue too. My understanding is that ipv6 uses a two-layered tree based scheme, one layer to key off of the source address and one layer to key off of the destination address. So it seems to me that the lookups would have the same aliasing issue that ipv4 does, and thus require checking both the specific saddr and also the saddr INADDR_ANY. Maybe the problem is that the ipv6 side uses the same saddr for both the lookup and the entry comparison in these PMTU code paths? Does it not allow specifying them seperately as the ipv4 PMTU (and incidently the RT redirect) code paths do? Or is this not an issue on the ipv6 side for some reason? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 3a74f90..d700d1c 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1559,14 +1559,13 @@ out: * i.e. Path MTU discovery */ -void rt6_pmtu_discovery(struct in6_addr *daddr, struct in6_addr *saddr, - struct net_device *dev, u32 pmtu) +static void rt6_do_pmtu_disc(struct in6_addr *daddr, struct in6_addr *saddr, + struct net *net, u32 pmtu, int ifindex) { struct rt6_info *rt, *nrt; - struct net *net = dev_net(dev); int allfrag = 0; - rt = rt6_lookup(net, daddr, saddr, dev->ifindex, 0); + rt = rt6_lookup(net, daddr, saddr, ifindex, 0); if (rt == NULL) return; @@ -1634,6 +1633,30 @@ out: dst_release(&rt->dst); } +void rt6_pmtu_discovery(struct in6_addr *daddr, struct in6_addr *saddr, + struct net_device *dev, u32 pmtu) +{ + struct net *net = dev_net(dev); + + /* + * RFC 1981 states that a node "MUST reduce the size of the packets it + * is sending along the path" that caused the Packet Too Big message. + * Since it's not possible in the general case to determine which + * interface was used to send the original packet, we update the MTU + * on the interface that will be used to send future packets. We also + * update the MTU on the interface that received the Packet Too Big in + * case the original packet was forced out that interface with + * SO_BINDTODEVICE or similar. This is the next best thing to the + * correct behaviour, which would be to update the MTU on all + * interfaces. + */ + rt6_do_pmtu_disc(daddr, saddr, net, pmtu, 0); + rt6_do_pmtu_disc(daddr, saddr, net, pmtu, dev->ifindex); + /* also support source address based routing */ + rt6_do_pmtu_disc(daddr, NULL, net, pmtu, 0); + rt6_do_pmtu_disc(daddr, NULL, net, pmtu, dev->ifindex); +} + /* * Misc support functions */