net: Fix IPv6 PMTU disc. w/ asymmetric routes

Message ID	1285581957-30694-1-git-send-email-zenczykowski@gmail.com
State	Superseded, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references :mime-version:content-type:content-transfer-encoding; b=Glg+GDtJCqRBnDBg9yAOCxxM/2hhrD1viaojxcpj711BsZvD577dAgrT+InrAC7bnj HJ2QGCuhNx9mloBINacodEvnDDYGajD0T6LG3azvi006/qhayIP4KTzBZ0o47pJIfANi B+p7Xii+lKeblNdJfHzZZ7L6i9ZZVUzc6TwtM= From: =?UTF-8?q?Maciej=20=C5=BBenczykowski?= <zenczykowski@gmail.com> To: David Miller <davem@davemloft.net>, netdev@vger.kernel.org Cc: =?UTF-8?q?Maciej=20=C5=BBenczykowski?= <maze@google.com> Subject: [PATCH] net: Fix IPv6 PMTU disc. w/ asymmetric routes Date: Mon, 27 Sep 2010 03:05:57 -0700 Message-Id: <1285581957-30694-1-git-send-email-zenczykowski@gmail.com> In-Reply-To: <55a4f86e1001131651j102b600fm1552a42866c3c671@mail.gmail.com> References: <55a4f86e1001131651j102b600fm1552a42866c3c671@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Message ID

1285581957-30694-1-git-send-email-zenczykowski@gmail.com

State

Superseded, archived

Delegated to:

David Miller

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references
	:mime-version:content-type:content-transfer-encoding;
	b=Glg+GDtJCqRBnDBg9yAOCxxM/2hhrD1viaojxcpj711BsZvD577dAgrT+InrAC7bnj
	HJ2QGCuhNx9mloBINacodEvnDDYGajD0T6LG3azvi006/qhayIP4KTzBZ0o47pJIfANi
	B+p7Xii+lKeblNdJfHzZZ7L6i9ZZVUzc6TwtM=
From: =?UTF-8?q?Maciej=20=C5=BBenczykowski?= <zenczykowski@gmail.com>
To: David Miller <davem@davemloft.net>, netdev@vger.kernel.org
Cc: =?UTF-8?q?Maciej=20=C5=BBenczykowski?= <maze@google.com>
Subject: [PATCH] net: Fix IPv6 PMTU disc. w/ asymmetric routes
Date: Mon, 27 Sep 2010 03:05:57 -0700
Message-Id: <1285581957-30694-1-git-send-email-zenczykowski@gmail.com>
In-Reply-To: <55a4f86e1001131651j102b600fm1552a42866c3c671@mail.gmail.com>
References: <55a4f86e1001131651j102b600fm1552a42866c3c671@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

From: Maciej Żenczykowski <maze@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> --- net/ipv6/route.c | 31 +++++++++++++++++++++++++++---- 1 files changed, 27 insertions(+), 4 deletions(-)

Comments

David Miller Sept. 27, 2010, 6:11 p.m. UTC | #1

From: Maciej Żenczykowski <maze@google.com>
Date: Mon, 27 Sep 2010 03:10:12 -0700

> FYI, I'm signed up to netdev on my personal account.

You don't need to subscribe in order to post patches to the mailing
list.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller Sept. 28, 2010, 8:58 p.m. UTC | #2

From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Mon, 27 Sep 2010 03:05:57 -0700

> From: Maciej Żenczykowski <maze@google.com>
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>

Please handle all 4 cases just like the ipv4 routing code does:

{ saddr = SADDR, ifindex = dev->ifindex }
{ saddr = SADDR, ifindex = 0 }
{ saddr = INADDR_ANY, ifindex = dev->ifindex }
{ saddr = INADDR_ANY, ifindex = 0 }

I believe I've specifically asked for this every time someone
mentioned that they wanted to fix this issue.

And the ipv4 side is a good guide in other ways, it uses a set
of two arrays so you can just loop over them, making your
rt6_do_pmtu_disc() calls along the way.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Maciej Żenczykowski Sept. 28, 2010, 10:37 p.m. UTC | #3

> Please handle all 4 cases just like the ipv4 routing code does:
>
> { saddr = SADDR, ifindex = dev->ifindex }
> { saddr = SADDR, ifindex = 0 }
> { saddr = INADDR_ANY, ifindex = dev->ifindex }
> { saddr = INADDR_ANY, ifindex = 0 }
>
> I believe I've specifically asked for this every time someone mentioned that they wanted to fix this issue.

I believe that is exactly what the following code fragment from the
above patch does:

+       rt6_do_pmtu_disc(daddr, saddr, net, pmtu, 0);
+       rt6_do_pmtu_disc(daddr, saddr, net, pmtu, dev->ifindex);
+       rt6_do_pmtu_disc(daddr, NULL, net, pmtu, 0);
+       rt6_do_pmtu_disc(daddr, NULL, net, pmtu, dev->ifindex);

The last two of these lines were added to the patch last time around
per your feedback - precisely to address this issue.
I reposted with those changes and you appeared to be ok with the
patch, and your only comment was to add a 'Signed-off-by' line.


Side note:
* I still think that handling the saddr == NULL ie. INADDR_ANY case is
entirely superfluous, since it doesn't actually iterate through all
possible source addresses.  With IPv6 there can be many, many possible
source addresses (just think of link local vs global public vs privacy
addresses and then tack on 6to4 and mobility, etc... for example I see
13 ipv6 addresses on eth0 on my desktop at home, 12 of them globally
reachable).
* Currently PMTU discovery is broken.  With this patch with all 4
lines we end up doing PMTU discovery per source address that isn't the
default source address.  Without those last two lines we'd just do
PMTU discovery per source address.  Not much of a difference, both
allow PMTU discovery to actually happen.

Except:

Imagine a machine with a 1500 MTU network card and 2 source addresses:
  - one default v6 native 'A'
  - one additional 6to4 address 'B'
The native address can reach the internet with a max MTU of 1500.
While due to source address based routing the 6to4 address goes
through a sit ipv4 based tunnel and thus has a max MTU of 1480 (1500 -
20 bytes IPv4 header).

Imagine a remote machine with v6 address 'Z' (our default address to
reach 'Z' is 'A').
Clearly the PMTU of A->Z is 1500, while the PMTU of B->Z is 1480.
[Assume there's nothing which would cause them to be lower.]

If you do PMTU discovery of A->Z, you will indeed get 1500, and mark
that as the PMTU of A->Z.
However if you do PMTU discovery of B->Z, you will indeed get 1480,
but you will mark that as the PMTU of both B->Z and A->Z (since A is
the default to reach Z).
Suddenly the PMTU of A->Z _needlessly and incorrectly_ dropped from
1500 to 1480 merely because of PMTU discovery being performed on B->Z.

And this is precisely why I think that adding those last two lines
that handle INADDR_ANY is actually more of a problem then solution
(indeed as far as I can tell, adding them doesn't actually fix any
problem - the code works just fine and performs PMTU discovery
correctly with just the first two lines).

Thankfully ending up with a lower MTU than actually possible is only a
slight performance problem, while not doing PMTU discovery correctly
at all (current state with asymmetric routing) is a big issue.
Hence regardless of whether this patch includes INADDR_ANY or not the
end result will still be an improvement.

> And the ipv4 side is a good guide in other ways, it uses a set of two arrays so you can just loop over them, making your rt6_do_pmtu_disc() calls along the way.

Sure, I can change it to use arrays, although I don't see any
particular improvement in performance (even if rt6_do_pmtu_disc was
inlined) or readability or anything.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller Sept. 30, 2010, 7:41 a.m. UTC | #4

From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Tue, 28 Sep 2010 15:37:26 -0700

> * I still think that handling the saddr == NULL ie. INADDR_ANY case is
> entirely superfluous, since it doesn't actually iterate through all
> possible source addresses.  With IPv6 there can be many, many possible
> source addresses (just think of link local vs global public vs privacy
> addresses and then tack on 6to4 and mobility, etc... for example I see
> 13 ipv6 addresses on eth0 on my desktop at home, 12 of them globally
> reachable).

I only have %100 confidence in the reasoning behind why ipv4 handles
things this way, so I'll discuss this in those terms and then try
to tie it into the ipv6 side.

When we are looking up an ipv4 output route, there are 2 "source
address" objects.

1) The one specified in the "struct flowi" for the lookup
   (the flp->fl4_src passed into ip_route_output_flow) which
   is also the one that ends up in the routing cache entry's
   ->fl.fl4_src member.

2) The one contained in the routing cache entry's specification.
   Ie. rth->rt_src

These are distinct.  #1 is what is used to hash and find a matching
routing cache entry.

Since a source address of INADDR_ANY is allowed for routing lookups,
routing cache entries for the same daddr/saddr pair can exist in more
than one hash chains.

Therefore, if we didn't iterate over INADDR_ANY and the specific
address in the icmp PMTU message, we'd miss some routing cache
entries.

Look at the PMTU loops in ipv4 ip_rt_frag_needed():

	for (k = 0; k < 2; k++) {
		for (i = 0; i < 2; i++) {
			unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],
						rt_genid(net));

("ANY vs. specific" ifindex and saddr are used for the hash
 computation)

				if (rth->fl.fl4_dst != daddr ||
				    rth->fl.fl4_src != skeys[i] ||
				    rth->rt_dst != daddr ||
				    rth->rt_src != iph->saddr ||
				    rth->fl.oif != ikeys[k] ||
				    rth->fl.iif != 0 ||

(and for the routing cache entry flow member comparisons)

But the routing cache entry "rt_src" member is compared always to
"iph->saddr", it doesn't use the "ANY vs. specific" skey[] value.

Unless ipv6 does not allow INADDR_ANY source address specifications
during route lookups, it ought to have the same issue too.

My understanding is that ipv6 uses a two-layered tree based scheme,
one layer to key off of the source address and one layer to key off
of the destination address.

So it seems to me that the lookups would have the same aliasing issue
that ipv4 does, and thus require checking both the specific saddr
and also the saddr INADDR_ANY.

Maybe the problem is that the ipv6 side uses the same saddr for both
the lookup and the entry comparison in these PMTU code paths?  Does it
not allow specifying them seperately as the ipv4 PMTU (and incidently
the RT redirect) code paths do?

Or is this not an issue on the ipv6 side for some reason?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3a74f90..d700d1c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1559,14 +1559,13 @@  out:
  *	i.e. Path MTU discovery
  */
 
-void rt6_pmtu_discovery(struct in6_addr *daddr, struct in6_addr *saddr,
-			struct net_device *dev, u32 pmtu)
+static void rt6_do_pmtu_disc(struct in6_addr *daddr, struct in6_addr *saddr,
+			     struct net *net, u32 pmtu, int ifindex)
 {
 	struct rt6_info *rt, *nrt;
-	struct net *net = dev_net(dev);
 	int allfrag = 0;
 
-	rt = rt6_lookup(net, daddr, saddr, dev->ifindex, 0);
+	rt = rt6_lookup(net, daddr, saddr, ifindex, 0);
 	if (rt == NULL)
 		return;
 
@@ -1634,6 +1633,30 @@  out:
 	dst_release(&rt->dst);
 }
 
+void rt6_pmtu_discovery(struct in6_addr *daddr, struct in6_addr *saddr,
+			struct net_device *dev, u32 pmtu)
+{
+	struct net *net = dev_net(dev);
+
+	/*
+	 * RFC 1981 states that a node "MUST reduce the size of the packets it
+	 * is sending along the path" that caused the Packet Too Big message.
+	 * Since it's not possible in the general case to determine which
+	 * interface was used to send the original packet, we update the MTU
+	 * on the interface that will be used to send future packets. We also
+	 * update the MTU on the interface that received the Packet Too Big in
+	 * case the original packet was forced out that interface with
+	 * SO_BINDTODEVICE or similar. This is the next best thing to the
+	 * correct behaviour, which would be to update the MTU on all
+	 * interfaces.
+	 */
+	rt6_do_pmtu_disc(daddr, saddr, net, pmtu, 0);
+	rt6_do_pmtu_disc(daddr, saddr, net, pmtu, dev->ifindex);
+	/* also support source address based routing */
+	rt6_do_pmtu_disc(daddr, NULL, net, pmtu, 0);
+	rt6_do_pmtu_disc(daddr, NULL, net, pmtu, dev->ifindex);
+}
+
 /*
  *	Misc support functions
  */

net: Fix IPv6 PMTU disc. w/ asymmetric routes

Commit Message

Comments

Patch