diff mbox

ipv6: Fix the pmtu path for connected UDP socket

Message ID 1457989187-92851-1-git-send-email-tracywwnj@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Wei Wang March 14, 2016, 8:59 p.m. UTC
From: Wei Wang <weiwan@google.com>

When ICMPV6_PKT_TOOBIG message is received by a connected UDP socket,
the new mtu value is not properly updated in the dst_entry associated
with the socket.
This leads to the issue that the mtu value returned by getsockopt(sockfd,
IPPROTO_IPV6, IPV6_MTU, ...) is wrong.
The fix is to update sk->sk_dst_cache and other corresponding fields
when a new routing cache is allocated for the new pmtu in UDP connected
socket case.

Signed-off-by: Wei Wang <weiwan@google.com>
---
 include/net/ip6_route.h |  4 ++--
 net/ipv6/ah6.c          |  2 +-
 net/ipv6/esp6.c         |  2 +-
 net/ipv6/icmp.c         |  2 +-
 net/ipv6/ip6_vti.c      |  2 +-
 net/ipv6/ipcomp6.c      |  2 +-
 net/ipv6/route.c        | 21 +++++++++------------
 7 files changed, 16 insertions(+), 19 deletions(-)

Comments

David Miller March 16, 2016, 11:53 p.m. UTC | #1
From: Wei Wang <weiwan@google.com>
Date: Mon, 14 Mar 2016 13:59:47 -0700

> From: Wei Wang <weiwan@google.com>
> 
> When ICMPV6_PKT_TOOBIG message is received by a connected UDP socket,
> the new mtu value is not properly updated in the dst_entry associated
> with the socket.
> This leads to the issue that the mtu value returned by getsockopt(sockfd,
> IPPROTO_IPV6, IPV6_MTU, ...) is wrong.
> The fix is to update sk->sk_dst_cache and other corresponding fields
> when a new routing cache is allocated for the new pmtu in UDP connected
> socket case.
> 
> Signed-off-by: Wei Wang <weiwan@google.com>

Wait a second:

>  		if (nrt6) {
>  			rt6_do_update_pmtu(nrt6, mtu);
> -
> -			/* ip6_ins_rt(nrt6) will bump the
> -			 * rt6->rt6i_node->fn_sernum
> -			 * which will fail the next rt6_check() and
> -			 * invalidate the sk->sk_dst_cache.
> -			 */
> +			if (sk)
> +				ip6_dst_store(sk, &nrt6->dst, daddr, saddr);
>  			ip6_ins_rt(nrt6);
>  		}
>  	}

I still haven't seen a satisfactory answer as to why the as-designed
invalidation mechanism using fn_sernum is not working.

If that's broken, then a lot of other things won't work properly
either.

I've read the ip6_ins_rt() code path several times, and it always
increments the serial number, and therefore the next dst->check() call
(which every cached route usage should invoke) should invalide
this socket's route and lookup the new one.

Why does this not work?
Eric Dumazet March 17, 2016, 12:22 a.m. UTC | #2
On Wed, 2016-03-16 at 19:53 -0400, David Miller wrote:
> From: Wei Wang <weiwan@google.com>
> Date: Mon, 14 Mar 2016 13:59:47 -0700
> 
> > From: Wei Wang <weiwan@google.com>
> > 
> > When ICMPV6_PKT_TOOBIG message is received by a connected UDP socket,
> > the new mtu value is not properly updated in the dst_entry associated
> > with the socket.
> > This leads to the issue that the mtu value returned by getsockopt(sockfd,
> > IPPROTO_IPV6, IPV6_MTU, ...) is wrong.
> > The fix is to update sk->sk_dst_cache and other corresponding fields
> > when a new routing cache is allocated for the new pmtu in UDP connected
> > socket case.
> > 
> > Signed-off-by: Wei Wang <weiwan@google.com>
> 
> Wait a second:
> 
> >  		if (nrt6) {
> >  			rt6_do_update_pmtu(nrt6, mtu);
> > -
> > -			/* ip6_ins_rt(nrt6) will bump the
> > -			 * rt6->rt6i_node->fn_sernum
> > -			 * which will fail the next rt6_check() and
> > -			 * invalidate the sk->sk_dst_cache.
> > -			 */
> > +			if (sk)
> > +				ip6_dst_store(sk, &nrt6->dst, daddr, saddr);
> >  			ip6_ins_rt(nrt6);
> >  		}
> >  	}
> 
> I still haven't seen a satisfactory answer as to why the as-designed
> invalidation mechanism using fn_sernum is not working.
> 
> If that's broken, then a lot of other things won't work properly
> either.
> 
> I've read the ip6_ins_rt() code path several times, and it always
> increments the serial number, and therefore the next dst->check() call
> (which every cached route usage should invoke) should invalide
> this socket's route and lookup the new one.
> 
> Why does this not work?


One of the issue is that IPV6_MTU getsockopt() will not check the dst,
but simply use __sk_dst_get() : It will then report old mtu.

Not sure we want to use the full check and then if dst appears to be
obsolete, do another route lookup ?
David Miller March 17, 2016, 2:38 a.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 16 Mar 2016 17:22:07 -0700

> One of the issue is that IPV6_MTU getsockopt() will not check the dst,
> but simply use __sk_dst_get() : It will then report old mtu.

That's a bug.

ipv4 does it right with a proper sk_dst_get() and so should
ipv6.
Eric Dumazet March 17, 2016, 3:34 a.m. UTC | #4
On Wed, 2016-03-16 at 22:38 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 16 Mar 2016 17:22:07 -0700
> 
> > One of the issue is that IPV6_MTU getsockopt() will not check the dst,
> > but simply use __sk_dst_get() : It will then report old mtu.
> 
> That's a bug.
> 
> ipv4 does it right with a proper sk_dst_get() and so should
> ipv6.

Using rcu + __sk_dst_get() (in IPv6) is absolutely equivalent to
sk_dst_get() + dst_release() (in IPv4), modulo atomic ops on dst
refcnt...

Presumably IPv6 implementation using rcu is slightly better if this
getsockopt() is badly needed, but apparently nobody cares.

sk_dst_check() is a different beast.

The problem is that dst_mtu(dst) is not able to perform a route lookup
by itself.

Do we really want to use ip6_sk_dst_lookup_flow() and its associated
setup for this IP6_MTU thing, and maybe other points we might use an
obsolete dst ?

Looking at the complexity of udpv6_sendmsg() and rawv6_sendmsg() I
really wonder if it is worth extracting the route logic.
Martin KaFai Lau March 18, 2016, 7:17 a.m. UTC | #5
On Mon, Mar 14, 2016 at 01:59:47PM -0700, Wei Wang wrote:
> From: Wei Wang <weiwan@google.com>
>
> When ICMPV6_PKT_TOOBIG message is received by a connected UDP socket,
> the new mtu value is not properly updated in the dst_entry associated
> with the socket.
A nit picking, the new mtu value cannot always be set directly
to the current dst_entry associated with the socket (i.e. sk->sk_dst_cache).
In this case, a RTF_CACHE clone has to be created.

The problem could be better understood if the commit message was like:
"After creating the RTF_CACHE clone (with the new mtu value),
sk->sk_dst_cache is not _immediately_ set to this RTF_CACHE clone.
getsockopt(IPV6_MTU) does not do a dst_check() first.  Hence,
if there was no outgoing message to trigger the dst_check() invalidation
logic, it may return the stale mtu value."

> -void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, int oif,
> -		     u32 mark);
> +void ip6_update_pmtu(struct net *net, struct sock *sk, struct sk_buff *skb,
> +		     __be32 mtu, int oif, u32 mark);
>  void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu);
This change seems to make the API a bit confusing.  The none _sk_ version
is also taking a sk param now.

> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1346,7 +1346,7 @@ static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
>  		(rt->rt6i_flags & RTF_PCPU || rt->rt6i_node);
>  }
>
> -static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
> +static void __ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
>  				 const struct ipv6hdr *iph, u32 mtu)
>  {
>  	struct rt6_info *rt6 = (struct rt6_info *)dst;
> @@ -1377,12 +1377,8 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
>  		nrt6 = ip6_rt_cache_alloc(rt6, daddr, saddr);
>  		if (nrt6) {
>  			rt6_do_update_pmtu(nrt6, mtu);
> -
> -			/* ip6_ins_rt(nrt6) will bump the
> -			 * rt6->rt6i_node->fn_sernum
> -			 * which will fail the next rt6_check() and
> -			 * invalidate the sk->sk_dst_cache.
> -			 */
> +			if (sk)
> +				ip6_dst_store(sk, &nrt6->dst, daddr, saddr);
daddr/saddr could be from iph which is from skb.  Considering skb could be
gone, are they suitable to be set in np->daddr_cache and np->saddr_cache?

After looking at this patch, I like your last patch more because this
problem seems to be limited to the connected udp socket only (?) and
udp knows better on what to pass to ip6_dst_store().  Feeling bad
now about steering you to this direction :(

Thanks,
-- Martin
Cong Wang March 18, 2016, 5:19 p.m. UTC | #6
On Wed, Mar 16, 2016 at 8:34 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-03-16 at 22:38 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 16 Mar 2016 17:22:07 -0700
>>
>> > One of the issue is that IPV6_MTU getsockopt() will not check the dst,
>> > but simply use __sk_dst_get() : It will then report old mtu.
>>
>> That's a bug.
>>
>> ipv4 does it right with a proper sk_dst_get() and so should
>> ipv6.
>
> Using rcu + __sk_dst_get() (in IPv6) is absolutely equivalent to
> sk_dst_get() + dst_release() (in IPv4), modulo atomic ops on dst
> refcnt...
>
> Presumably IPv6 implementation using rcu is slightly better if this
> getsockopt() is badly needed, but apparently nobody cares.
>
> sk_dst_check() is a different beast.
>
> The problem is that dst_mtu(dst) is not able to perform a route lookup
> by itself.
>
> Do we really want to use ip6_sk_dst_lookup_flow() and its associated
> setup for this IP6_MTU thing, and maybe other points we might use an
> obsolete dst ?
>
> Looking at the complexity of udpv6_sendmsg() and rawv6_sendmsg() I
> really wonder if it is worth extracting the route logic.

I think the reason why IPv4 does this correctly is ipv4_sk_update_pmtu()
takes care of the sk cached dst, while the similar part for IPv6,
ip6_sk_update_pmtu() does not that. As Martin already points it out
in the previous version.
Cong Wang March 18, 2016, 6:10 p.m. UTC | #7
On Fri, Mar 18, 2016 at 10:45 AM, Wei Wang <weiwan@google.com> wrote:
> Thanks all for the comments.
>
> Cong, you are right that for ipv6, the code does not take care of updating
> sk_dst_cache entry like Ipv4. And this patch is updating the entry by
> calling ip6_dst_store() to achieve similar functionality as Ipv4.

My question is why not updating ip6_sk_update_pmtu() to sync with ipv4?
Wei Wang March 18, 2016, 9:26 p.m. UTC | #8
I don't think ip6_sk_update_pmtu() is a good place to put it as all it
does is to call ip6_update_pmtu(). And ip6_update_pmtu() does the
route lookup and call __ip6_rt_update_pmtu.
We can put it in ip6_update_pmtu(). But that still means we need to
pass sk to ip6_update_pmtu() and I don't think it makes any difference
compared to the current fix.

Thanks.
Wei

On Fri, Mar 18, 2016 at 11:10 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Mar 18, 2016 at 10:45 AM, Wei Wang <weiwan@google.com> wrote:
>> Thanks all for the comments.
>>
>> Cong, you are right that for ipv6, the code does not take care of updating
>> sk_dst_cache entry like Ipv4. And this patch is updating the entry by
>> calling ip6_dst_store() to achieve similar functionality as Ipv4.
>
> My question is why not updating ip6_sk_update_pmtu() to sync with ipv4?
diff mbox

Patch

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 295d291..2b147a8 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -115,8 +115,8 @@  void rt6_purge_dflt_routers(struct net *net);
 int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
 		  const struct in6_addr *gwaddr);
 
-void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, int oif,
-		     u32 mark);
+void ip6_update_pmtu(struct net *net, struct sock *sk, struct sk_buff *skb,
+		     __be32 mtu, int oif, u32 mark);
 void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu);
 void ip6_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark);
 void ip6_redirect_no_header(struct sk_buff *skb, struct net *net, int oif,
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index 0630a4d5..2c926ec 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -664,7 +664,7 @@  static int ah6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (type == NDISC_REDIRECT)
 		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 	else
-		ip6_update_pmtu(skb, net, info, 0, 0);
+		ip6_update_pmtu(net, NULL, skb, info, 0, 0);
 	xfrm_state_put(x);
 
 	return 0;
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 060a60b..b74847a 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -476,7 +476,7 @@  static int esp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (type == NDISC_REDIRECT)
 		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 	else
-		ip6_update_pmtu(skb, net, info, 0, 0);
+		ip6_update_pmtu(net, NULL, skb, info, 0, 0);
 	xfrm_state_put(x);
 
 	return 0;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 0a37ddc..03816f5 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -92,7 +92,7 @@  static void icmpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	struct net *net = dev_net(skb->dev);
 
 	if (type == ICMPV6_PKT_TOOBIG)
-		ip6_update_pmtu(skb, net, info, 0, 0);
+		ip6_update_pmtu(net, NULL, skb, info, 0, 0);
 	else if (type == NDISC_REDIRECT)
 		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index d90a11f..fa873ca 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -599,7 +599,7 @@  static int vti6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (type == NDISC_REDIRECT)
 		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 	else
-		ip6_update_pmtu(skb, net, info, 0, 0);
+		ip6_update_pmtu(net, NULL, skb, info, 0, 0);
 	xfrm_state_put(x);
 
 	return 0;
diff --git a/net/ipv6/ipcomp6.c b/net/ipv6/ipcomp6.c
index 1b9316e..c07a5ac 100644
--- a/net/ipv6/ipcomp6.c
+++ b/net/ipv6/ipcomp6.c
@@ -76,7 +76,7 @@  static int ipcomp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (type == NDISC_REDIRECT)
 		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 	else
-		ip6_update_pmtu(skb, net, info, 0, 0);
+		ip6_update_pmtu(net, NULL, skb, info, 0, 0);
 	xfrm_state_put(x);
 
 	return 0;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ed44663..8f6a5f1 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1346,7 +1346,7 @@  static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
 		(rt->rt6i_flags & RTF_PCPU || rt->rt6i_node);
 }
 
-static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
+static void __ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 				 const struct ipv6hdr *iph, u32 mtu)
 {
 	struct rt6_info *rt6 = (struct rt6_info *)dst;
@@ -1377,12 +1377,8 @@  static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
 		nrt6 = ip6_rt_cache_alloc(rt6, daddr, saddr);
 		if (nrt6) {
 			rt6_do_update_pmtu(nrt6, mtu);
-
-			/* ip6_ins_rt(nrt6) will bump the
-			 * rt6->rt6i_node->fn_sernum
-			 * which will fail the next rt6_check() and
-			 * invalidate the sk->sk_dst_cache.
-			 */
+			if (sk)
+				ip6_dst_store(sk, &nrt6->dst, daddr, saddr);
 			ip6_ins_rt(nrt6);
 		}
 	}
@@ -1394,8 +1390,8 @@  static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 	__ip6_rt_update_pmtu(dst, sk, skb ? ipv6_hdr(skb) : NULL, mtu);
 }
 
-void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
-		     int oif, u32 mark)
+void ip6_update_pmtu(struct net *net, struct sock *sk,
+		     struct sk_buff *skb, __be32 mtu, int oif, u32 mark)
 {
 	const struct ipv6hdr *iph = (struct ipv6hdr *) skb->data;
 	struct dst_entry *dst;
@@ -1410,15 +1406,16 @@  void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
 
 	dst = ip6_route_output(net, NULL, &fl6);
 	if (!dst->error)
-		__ip6_rt_update_pmtu(dst, NULL, iph, ntohl(mtu));
+		__ip6_rt_update_pmtu(dst, sk, iph, ntohl(mtu));
 	dst_release(dst);
 }
 EXPORT_SYMBOL_GPL(ip6_update_pmtu);
 
 void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
 {
-	ip6_update_pmtu(skb, sock_net(sk), mtu,
-			sk->sk_bound_dev_if, sk->sk_mark);
+	ip6_update_pmtu(sock_net(sk),
+			(sk->sk_state != TCP_ESTABLISHED) ? NULL : sk,
+			skb, mtu, sk->sk_bound_dev_if, sk->sk_mark);
 }
 EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);