diff mbox

[2/5] ipv4: Kill ip_rt_frag_needed().

Message ID 20120611.022911.885347106959530782.davem@davemloft.net
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller June 11, 2012, 9:29 a.m. UTC
There is zero point to this function.

It's only real substance is to perform an extremely outdated BSD4.2
ICMP check, which we can safely remove.  If you really have a MTU
limited link being routed by a BSD4.2 derived system, here's a nickel
go buy yourself a real router.

The other actions of ip_rt_frag_needed(), checking and conditionally
updating the peer, are done by the per-protocol handlers of the ICMP
event.

TCP, UDP, et al. have a handler which will receive this event and
transmit it back into the associated route via dst_ops->update_pmtu().

This simplification is important, because it eliminates the one place
where we do not have a proper route context in which to make an
inetpeer lookup.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h  |    2 --
 net/ipv4/icmp.c      |    4 +---
 net/ipv4/route.c     |   61 --------------------------------------------------
 net/rxrpc/ar-error.c |    4 ----
 4 files changed, 1 insertion(+), 70 deletions(-)

Comments

Steffen Klassert June 11, 2012, 11:16 a.m. UTC | #1
On Mon, Jun 11, 2012 at 02:29:11AM -0700, David Miller wrote:
> 
> -unsigned short ip_rt_frag_needed(struct net *net, const struct iphdr *iph,
> -				 unsigned short new_mtu,
> -				 struct net_device *dev)
> -{
> -	unsigned short old_mtu = ntohs(iph->tot_len);
> -	unsigned short est_mtu = 0;
> -	struct inet_peer *peer;
> -
> -	peer = inet_getpeer_v4(net->ipv4.peers, iph->daddr, 1);
> -	if (peer) {
> -		unsigned short mtu = new_mtu;
> -
> -		if (new_mtu < 68 || new_mtu >= old_mtu) {
> -			/* BSD 4.2 derived systems incorrectly adjust
> -			 * tot_len by the IP header length, and report
> -			 * a zero MTU in the ICMP message.
> -			 */
> -			if (mtu == 0 &&
> -			    old_mtu >= 68 + (iph->ihl << 2))
> -				old_mtu -= iph->ihl << 2;
> -			mtu = guess_mtu(old_mtu);
> -		}
> -
> -		if (mtu < ip_rt_min_pmtu)
> -			mtu = ip_rt_min_pmtu;
> -		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
> -			unsigned long pmtu_expires;
> -
> -			pmtu_expires = jiffies + ip_rt_mtu_expires;
> -			if (!pmtu_expires)
> -				pmtu_expires = 1UL;
> -
> -			est_mtu = mtu;
> -			peer->pmtu_learned = mtu;
> -			peer->pmtu_expires = pmtu_expires;
> -			atomic_inc(&__rt_peer_genid);
> -		}
> -
> -		inet_putpeer(peer);
> -	}
> -	return est_mtu ? : new_mtu;
> -}
> -

It seems that we don't cache the learned pmtu informations
in some cases with ip_rt_frag_needed() removed. 

At least when doing a simple ping test on a network that has
a router with mtu 1300 along the path, the following happens:

bash-3.00# ping -c 4 -s 1400 192.168.40.2                                       
PING 192.168.40.2 (192.168.40.2) 1400(1428) bytes of data.                      
From 10.2.2.2 icmp_seq=1 Frag needed and DF set (mtu = 1300)                    
From 10.2.2.2 icmp_seq=2 Frag needed and DF set (mtu = 1300)                    
From 10.2.2.2 icmp_seq=3 Frag needed and DF set (mtu = 1300)                    
From 10.2.2.2 icmp_seq=4 Frag needed and DF set (mtu = 1300)                    
                                                                                
--- 192.168.40.2 ping statistics ---                                            
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3005ms     

We should learn the pmtu information with the first packet,
all further packets should get fragmented according to
the learned informations. Unfortunately we don't cache
these informations: 
                                                        
bash-3.00# ip r g 192.168.40.2                                                  
192.168.40.2 via 192.168.20.1 dev eth0  src 192.168.20.2                        
    cache

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 11, 2012, 11:20 a.m. UTC | #2
From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Mon, 11 Jun 2012 13:16:59 +0200

> It seems that we don't cache the learned pmtu informations
> in some cases with ip_rt_frag_needed() removed. 

We need to find a way to implement this then, in such a way
that we have the route context used to send the ping packet
out.

Otherwise, it's impossible to record the information properly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 11, 2012, 11:28 a.m. UTC | #3
From: David Miller <davem@davemloft.net>
Date: Mon, 11 Jun 2012 04:20:24 -0700 (PDT)

> We need to find a way to implement this then, in such a way
> that we have the route context used to send the ping packet
> out.

The problem is RAW sockets right?  If so, then this is where the
fix belongs.

There is nothing preventing the RAW socket code from remembering the
last route used, as well as the flow4 key used to look it up, and
processing the PMTU message appropriately in raw_err() using that
remembered information if the flow4 key matches.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steffen Klassert June 11, 2012, 11:42 a.m. UTC | #4
On Mon, Jun 11, 2012 at 04:28:13AM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Mon, 11 Jun 2012 04:20:24 -0700 (PDT)
> 
> > We need to find a way to implement this then, in such a way
> > that we have the route context used to send the ping packet
> > out.
> 
> The problem is RAW sockets right?  If so, then this is where the
> fix belongs.
> 

Hm, I've just tried with tracepath (udp) and I also don't see the
pmtu informations cached.

I still had no time to look deeper into the new inetpeer code,
I've just gave it a quick try. I'll try to find out what's going on.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/route.h b/include/net/route.h
index 6340c37..cc693a5 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -215,8 +215,6 @@  static inline int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 s
 	return ip_route_input_common(skb, dst, src, tos, devin, true);
 }
 
-extern unsigned short	ip_rt_frag_needed(struct net *net, const struct iphdr *iph,
-					  unsigned short new_mtu, struct net_device *dev);
 extern void		ip_rt_send_redirect(struct sk_buff *skb);
 
 extern unsigned int		inet_addr_type(struct net *net, __be32 addr);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 0c78ef1..e1caa1a 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -673,9 +673,7 @@  static void icmp_unreach(struct sk_buff *skb)
 				LIMIT_NETDEBUG(KERN_INFO pr_fmt("%pI4: fragmentation needed and DF set\n"),
 					       &iph->daddr);
 			} else {
-				info = ip_rt_frag_needed(net, iph,
-							 ntohs(icmph->un.frag.mtu),
-							 skb->dev);
+				info = ntohs(icmph->un.frag.mtu);
 				if (!info)
 					goto out;
 			}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 03e5b61..4f5834c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1664,67 +1664,6 @@  out:	kfree_skb(skb);
 	return 0;
 }
 
-/*
- *	The last two values are not from the RFC but
- *	are needed for AMPRnet AX.25 paths.
- */
-
-static const unsigned short mtu_plateau[] =
-{32000, 17914, 8166, 4352, 2002, 1492, 576, 296, 216, 128 };
-
-static inline unsigned short guess_mtu(unsigned short old_mtu)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(mtu_plateau); i++)
-		if (old_mtu > mtu_plateau[i])
-			return mtu_plateau[i];
-	return 68;
-}
-
-unsigned short ip_rt_frag_needed(struct net *net, const struct iphdr *iph,
-				 unsigned short new_mtu,
-				 struct net_device *dev)
-{
-	unsigned short old_mtu = ntohs(iph->tot_len);
-	unsigned short est_mtu = 0;
-	struct inet_peer *peer;
-
-	peer = inet_getpeer_v4(net->ipv4.peers, iph->daddr, 1);
-	if (peer) {
-		unsigned short mtu = new_mtu;
-
-		if (new_mtu < 68 || new_mtu >= old_mtu) {
-			/* BSD 4.2 derived systems incorrectly adjust
-			 * tot_len by the IP header length, and report
-			 * a zero MTU in the ICMP message.
-			 */
-			if (mtu == 0 &&
-			    old_mtu >= 68 + (iph->ihl << 2))
-				old_mtu -= iph->ihl << 2;
-			mtu = guess_mtu(old_mtu);
-		}
-
-		if (mtu < ip_rt_min_pmtu)
-			mtu = ip_rt_min_pmtu;
-		if (!peer->pmtu_expires || mtu < peer->pmtu_learned) {
-			unsigned long pmtu_expires;
-
-			pmtu_expires = jiffies + ip_rt_mtu_expires;
-			if (!pmtu_expires)
-				pmtu_expires = 1UL;
-
-			est_mtu = mtu;
-			peer->pmtu_learned = mtu;
-			peer->pmtu_expires = pmtu_expires;
-			atomic_inc(&__rt_peer_genid);
-		}
-
-		inet_putpeer(peer);
-	}
-	return est_mtu ? : new_mtu;
-}
-
 static void check_peer_pmtu(struct dst_entry *dst, struct inet_peer *peer)
 {
 	unsigned long expires = ACCESS_ONCE(peer->pmtu_expires);
diff --git a/net/rxrpc/ar-error.c b/net/rxrpc/ar-error.c
index 5d6b572..a920608 100644
--- a/net/rxrpc/ar-error.c
+++ b/net/rxrpc/ar-error.c
@@ -81,10 +81,6 @@  void rxrpc_UDP_error_report(struct sock *sk)
 			_net("I/F MTU %u", mtu);
 		}
 
-		/* ip_rt_frag_needed() may have eaten the info */
-		if (mtu == 0)
-			mtu = ntohs(icmp_hdr(skb)->un.frag.mtu);
-
 		if (mtu == 0) {
 			/* they didn't give us a size, estimate one */
 			if (mtu > 1500) {