Message ID | 1457989187-92851-1-git-send-email-tracywwnj@gmail.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
From: Wei Wang <weiwan@google.com> Date: Mon, 14 Mar 2016 13:59:47 -0700 > From: Wei Wang <weiwan@google.com> > > When ICMPV6_PKT_TOOBIG message is received by a connected UDP socket, > the new mtu value is not properly updated in the dst_entry associated > with the socket. > This leads to the issue that the mtu value returned by getsockopt(sockfd, > IPPROTO_IPV6, IPV6_MTU, ...) is wrong. > The fix is to update sk->sk_dst_cache and other corresponding fields > when a new routing cache is allocated for the new pmtu in UDP connected > socket case. > > Signed-off-by: Wei Wang <weiwan@google.com> Wait a second: > if (nrt6) { > rt6_do_update_pmtu(nrt6, mtu); > - > - /* ip6_ins_rt(nrt6) will bump the > - * rt6->rt6i_node->fn_sernum > - * which will fail the next rt6_check() and > - * invalidate the sk->sk_dst_cache. > - */ > + if (sk) > + ip6_dst_store(sk, &nrt6->dst, daddr, saddr); > ip6_ins_rt(nrt6); > } > } I still haven't seen a satisfactory answer as to why the as-designed invalidation mechanism using fn_sernum is not working. If that's broken, then a lot of other things won't work properly either. I've read the ip6_ins_rt() code path several times, and it always increments the serial number, and therefore the next dst->check() call (which every cached route usage should invoke) should invalide this socket's route and lookup the new one. Why does this not work?
On Wed, 2016-03-16 at 19:53 -0400, David Miller wrote: > From: Wei Wang <weiwan@google.com> > Date: Mon, 14 Mar 2016 13:59:47 -0700 > > > From: Wei Wang <weiwan@google.com> > > > > When ICMPV6_PKT_TOOBIG message is received by a connected UDP socket, > > the new mtu value is not properly updated in the dst_entry associated > > with the socket. > > This leads to the issue that the mtu value returned by getsockopt(sockfd, > > IPPROTO_IPV6, IPV6_MTU, ...) is wrong. > > The fix is to update sk->sk_dst_cache and other corresponding fields > > when a new routing cache is allocated for the new pmtu in UDP connected > > socket case. > > > > Signed-off-by: Wei Wang <weiwan@google.com> > > Wait a second: > > > if (nrt6) { > > rt6_do_update_pmtu(nrt6, mtu); > > - > > - /* ip6_ins_rt(nrt6) will bump the > > - * rt6->rt6i_node->fn_sernum > > - * which will fail the next rt6_check() and > > - * invalidate the sk->sk_dst_cache. > > - */ > > + if (sk) > > + ip6_dst_store(sk, &nrt6->dst, daddr, saddr); > > ip6_ins_rt(nrt6); > > } > > } > > I still haven't seen a satisfactory answer as to why the as-designed > invalidation mechanism using fn_sernum is not working. > > If that's broken, then a lot of other things won't work properly > either. > > I've read the ip6_ins_rt() code path several times, and it always > increments the serial number, and therefore the next dst->check() call > (which every cached route usage should invoke) should invalide > this socket's route and lookup the new one. > > Why does this not work? One of the issue is that IPV6_MTU getsockopt() will not check the dst, but simply use __sk_dst_get() : It will then report old mtu. Not sure we want to use the full check and then if dst appears to be obsolete, do another route lookup ?
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 16 Mar 2016 17:22:07 -0700 > One of the issue is that IPV6_MTU getsockopt() will not check the dst, > but simply use __sk_dst_get() : It will then report old mtu. That's a bug. ipv4 does it right with a proper sk_dst_get() and so should ipv6.
On Wed, 2016-03-16 at 22:38 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Wed, 16 Mar 2016 17:22:07 -0700 > > > One of the issue is that IPV6_MTU getsockopt() will not check the dst, > > but simply use __sk_dst_get() : It will then report old mtu. > > That's a bug. > > ipv4 does it right with a proper sk_dst_get() and so should > ipv6. Using rcu + __sk_dst_get() (in IPv6) is absolutely equivalent to sk_dst_get() + dst_release() (in IPv4), modulo atomic ops on dst refcnt... Presumably IPv6 implementation using rcu is slightly better if this getsockopt() is badly needed, but apparently nobody cares. sk_dst_check() is a different beast. The problem is that dst_mtu(dst) is not able to perform a route lookup by itself. Do we really want to use ip6_sk_dst_lookup_flow() and its associated setup for this IP6_MTU thing, and maybe other points we might use an obsolete dst ? Looking at the complexity of udpv6_sendmsg() and rawv6_sendmsg() I really wonder if it is worth extracting the route logic.
On Mon, Mar 14, 2016 at 01:59:47PM -0700, Wei Wang wrote: > From: Wei Wang <weiwan@google.com> > > When ICMPV6_PKT_TOOBIG message is received by a connected UDP socket, > the new mtu value is not properly updated in the dst_entry associated > with the socket. A nit picking, the new mtu value cannot always be set directly to the current dst_entry associated with the socket (i.e. sk->sk_dst_cache). In this case, a RTF_CACHE clone has to be created. The problem could be better understood if the commit message was like: "After creating the RTF_CACHE clone (with the new mtu value), sk->sk_dst_cache is not _immediately_ set to this RTF_CACHE clone. getsockopt(IPV6_MTU) does not do a dst_check() first. Hence, if there was no outgoing message to trigger the dst_check() invalidation logic, it may return the stale mtu value." > -void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, int oif, > - u32 mark); > +void ip6_update_pmtu(struct net *net, struct sock *sk, struct sk_buff *skb, > + __be32 mtu, int oif, u32 mark); > void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu); This change seems to make the API a bit confusing. The none _sk_ version is also taking a sk param now. > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -1346,7 +1346,7 @@ static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt) > (rt->rt6i_flags & RTF_PCPU || rt->rt6i_node); > } > > -static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk, > +static void __ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, > const struct ipv6hdr *iph, u32 mtu) > { > struct rt6_info *rt6 = (struct rt6_info *)dst; > @@ -1377,12 +1377,8 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk, > nrt6 = ip6_rt_cache_alloc(rt6, daddr, saddr); > if (nrt6) { > rt6_do_update_pmtu(nrt6, mtu); > - > - /* ip6_ins_rt(nrt6) will bump the > - * rt6->rt6i_node->fn_sernum > - * which will fail the next rt6_check() and > - * invalidate the sk->sk_dst_cache. > - */ > + if (sk) > + ip6_dst_store(sk, &nrt6->dst, daddr, saddr); daddr/saddr could be from iph which is from skb. Considering skb could be gone, are they suitable to be set in np->daddr_cache and np->saddr_cache? After looking at this patch, I like your last patch more because this problem seems to be limited to the connected udp socket only (?) and udp knows better on what to pass to ip6_dst_store(). Feeling bad now about steering you to this direction :( Thanks, -- Martin
On Wed, Mar 16, 2016 at 8:34 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2016-03-16 at 22:38 -0400, David Miller wrote: >> From: Eric Dumazet <eric.dumazet@gmail.com> >> Date: Wed, 16 Mar 2016 17:22:07 -0700 >> >> > One of the issue is that IPV6_MTU getsockopt() will not check the dst, >> > but simply use __sk_dst_get() : It will then report old mtu. >> >> That's a bug. >> >> ipv4 does it right with a proper sk_dst_get() and so should >> ipv6. > > Using rcu + __sk_dst_get() (in IPv6) is absolutely equivalent to > sk_dst_get() + dst_release() (in IPv4), modulo atomic ops on dst > refcnt... > > Presumably IPv6 implementation using rcu is slightly better if this > getsockopt() is badly needed, but apparently nobody cares. > > sk_dst_check() is a different beast. > > The problem is that dst_mtu(dst) is not able to perform a route lookup > by itself. > > Do we really want to use ip6_sk_dst_lookup_flow() and its associated > setup for this IP6_MTU thing, and maybe other points we might use an > obsolete dst ? > > Looking at the complexity of udpv6_sendmsg() and rawv6_sendmsg() I > really wonder if it is worth extracting the route logic. I think the reason why IPv4 does this correctly is ipv4_sk_update_pmtu() takes care of the sk cached dst, while the similar part for IPv6, ip6_sk_update_pmtu() does not that. As Martin already points it out in the previous version.
On Fri, Mar 18, 2016 at 10:45 AM, Wei Wang <weiwan@google.com> wrote: > Thanks all for the comments. > > Cong, you are right that for ipv6, the code does not take care of updating > sk_dst_cache entry like Ipv4. And this patch is updating the entry by > calling ip6_dst_store() to achieve similar functionality as Ipv4. My question is why not updating ip6_sk_update_pmtu() to sync with ipv4?
I don't think ip6_sk_update_pmtu() is a good place to put it as all it does is to call ip6_update_pmtu(). And ip6_update_pmtu() does the route lookup and call __ip6_rt_update_pmtu. We can put it in ip6_update_pmtu(). But that still means we need to pass sk to ip6_update_pmtu() and I don't think it makes any difference compared to the current fix. Thanks. Wei On Fri, Mar 18, 2016 at 11:10 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Fri, Mar 18, 2016 at 10:45 AM, Wei Wang <weiwan@google.com> wrote: >> Thanks all for the comments. >> >> Cong, you are right that for ipv6, the code does not take care of updating >> sk_dst_cache entry like Ipv4. And this patch is updating the entry by >> calling ip6_dst_store() to achieve similar functionality as Ipv4. > > My question is why not updating ip6_sk_update_pmtu() to sync with ipv4?
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 295d291..2b147a8 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -115,8 +115,8 @@ void rt6_purge_dflt_routers(struct net *net); int rt6_route_rcv(struct net_device *dev, u8 *opt, int len, const struct in6_addr *gwaddr); -void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, int oif, - u32 mark); +void ip6_update_pmtu(struct net *net, struct sock *sk, struct sk_buff *skb, + __be32 mtu, int oif, u32 mark); void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu); void ip6_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark); void ip6_redirect_no_header(struct sk_buff *skb, struct net *net, int oif, diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c index 0630a4d5..2c926ec 100644 --- a/net/ipv6/ah6.c +++ b/net/ipv6/ah6.c @@ -664,7 +664,7 @@ static int ah6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (type == NDISC_REDIRECT) ip6_redirect(skb, net, skb->dev->ifindex, 0); else - ip6_update_pmtu(skb, net, info, 0, 0); + ip6_update_pmtu(net, NULL, skb, info, 0, 0); xfrm_state_put(x); return 0; diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 060a60b..b74847a 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -476,7 +476,7 @@ static int esp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (type == NDISC_REDIRECT) ip6_redirect(skb, net, skb->dev->ifindex, 0); else - ip6_update_pmtu(skb, net, info, 0, 0); + ip6_update_pmtu(net, NULL, skb, info, 0, 0); xfrm_state_put(x); return 0; diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index 0a37ddc..03816f5 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -92,7 +92,7 @@ static void icmpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, struct net *net = dev_net(skb->dev); if (type == ICMPV6_PKT_TOOBIG) - ip6_update_pmtu(skb, net, info, 0, 0); + ip6_update_pmtu(net, NULL, skb, info, 0, 0); else if (type == NDISC_REDIRECT) ip6_redirect(skb, net, skb->dev->ifindex, 0); diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c index d90a11f..fa873ca 100644 --- a/net/ipv6/ip6_vti.c +++ b/net/ipv6/ip6_vti.c @@ -599,7 +599,7 @@ static int vti6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (type == NDISC_REDIRECT) ip6_redirect(skb, net, skb->dev->ifindex, 0); else - ip6_update_pmtu(skb, net, info, 0, 0); + ip6_update_pmtu(net, NULL, skb, info, 0, 0); xfrm_state_put(x); return 0; diff --git a/net/ipv6/ipcomp6.c b/net/ipv6/ipcomp6.c index 1b9316e..c07a5ac 100644 --- a/net/ipv6/ipcomp6.c +++ b/net/ipv6/ipcomp6.c @@ -76,7 +76,7 @@ static int ipcomp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, if (type == NDISC_REDIRECT) ip6_redirect(skb, net, skb->dev->ifindex, 0); else - ip6_update_pmtu(skb, net, info, 0, 0); + ip6_update_pmtu(net, NULL, skb, info, 0, 0); xfrm_state_put(x); return 0; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index ed44663..8f6a5f1 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1346,7 +1346,7 @@ static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt) (rt->rt6i_flags & RTF_PCPU || rt->rt6i_node); } -static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk, +static void __ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, const struct ipv6hdr *iph, u32 mtu) { struct rt6_info *rt6 = (struct rt6_info *)dst; @@ -1377,12 +1377,8 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk, nrt6 = ip6_rt_cache_alloc(rt6, daddr, saddr); if (nrt6) { rt6_do_update_pmtu(nrt6, mtu); - - /* ip6_ins_rt(nrt6) will bump the - * rt6->rt6i_node->fn_sernum - * which will fail the next rt6_check() and - * invalidate the sk->sk_dst_cache. - */ + if (sk) + ip6_dst_store(sk, &nrt6->dst, daddr, saddr); ip6_ins_rt(nrt6); } } @@ -1394,8 +1390,8 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, __ip6_rt_update_pmtu(dst, sk, skb ? ipv6_hdr(skb) : NULL, mtu); } -void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, - int oif, u32 mark) +void ip6_update_pmtu(struct net *net, struct sock *sk, + struct sk_buff *skb, __be32 mtu, int oif, u32 mark) { const struct ipv6hdr *iph = (struct ipv6hdr *) skb->data; struct dst_entry *dst; @@ -1410,15 +1406,16 @@ void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, dst = ip6_route_output(net, NULL, &fl6); if (!dst->error) - __ip6_rt_update_pmtu(dst, NULL, iph, ntohl(mtu)); + __ip6_rt_update_pmtu(dst, sk, iph, ntohl(mtu)); dst_release(dst); } EXPORT_SYMBOL_GPL(ip6_update_pmtu); void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu) { - ip6_update_pmtu(skb, sock_net(sk), mtu, - sk->sk_bound_dev_if, sk->sk_mark); + ip6_update_pmtu(sock_net(sk), + (sk->sk_state != TCP_ESTABLISHED) ? NULL : sk, + skb, mtu, sk->sk_bound_dev_if, sk->sk_mark); } EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);