diff mbox

[net-next] ipv4: introduce ip_dst_mtu_secure and protect forwarding path against pmtu spoofing

Message ID 20131215022336.GA27866@order.stressinduktion.org
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Hannes Frederic Sowa Dec. 15, 2013, 2:23 a.m. UTC
While forwarding we should not use the protocol path mtu to calculate
the mtu for a forwarded packet but instead use the interface mtu.

We mark forwarded skbs in ip_forward with IPSKB_FORWARDED which was
introduced for multicast forwarding. But as it does not conflict with
our usage in unicast code path it is perfect for reuse.

I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
along with the new ip_dst_mtu_secure to net/ip.h to fix circular
dependencies because of IPSKB_FORWARDED.

Because someone might have written a software which does probe
destinations manually and expects the kernel to honour those path mtus
I introduced a new per-namespace "forwarding_uses_pmtu" knob so someone
can disable this new behaviour. We also use mtus which are locked on a
route for forwarding.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
 include/net/ip.h                       | 32 ++++++++++++++++++++++++++++++++
 include/net/netns/ipv4.h               |  1 +
 include/net/route.h                    | 19 +++----------------
 net/ipv4/ip_forward.c                  |  7 +++++--
 net/ipv4/ip_output.c                   |  8 +++++---
 net/ipv4/route.c                       |  3 ---
 net/ipv4/sysctl_net_ipv4.c             |  7 +++++++
 8 files changed, 65 insertions(+), 24 deletions(-)

Comments

David Miller Dec. 18, 2013, 10:34 p.m. UTC | #1
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Sun, 15 Dec 2013 03:23:36 +0100

> While forwarding we should not use the protocol path mtu to calculate
> the mtu for a forwarded packet but instead use the interface mtu.
> 
> We mark forwarded skbs in ip_forward with IPSKB_FORWARDED which was
> introduced for multicast forwarding. But as it does not conflict with
> our usage in unicast code path it is perfect for reuse.
> 
> I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
> along with the new ip_dst_mtu_secure to net/ip.h to fix circular
> dependencies because of IPSKB_FORWARDED.
> 
> Because someone might have written a software which does probe
> destinations manually and expects the kernel to honour those path mtus
> I introduced a new per-namespace "forwarding_uses_pmtu" knob so someone
> can disable this new behaviour. We also use mtus which are locked on a
> route for forwarding.
> 
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Sockets aren't the only entity that create cached PMTU information
we might like to use during forwarding, tunnels do too.

I'm afraid what the side effects of this change will be in that
situation, because the dst_mtu() is exactly what will allow us
to relay the ICMP back to the pre-decapsulation source address
properly.

In any event, this code was made to explicitly behave this way
back in 2007 by John Heffner, maybe he has some opinions too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Dec. 18, 2013, 11:54 p.m. UTC | #2
On Wed, Dec 18, 2013 at 05:34:42PM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Sun, 15 Dec 2013 03:23:36 +0100
> 
> > While forwarding we should not use the protocol path mtu to calculate
> > the mtu for a forwarded packet but instead use the interface mtu.
> > 
> > We mark forwarded skbs in ip_forward with IPSKB_FORWARDED which was
> > introduced for multicast forwarding. But as it does not conflict with
> > our usage in unicast code path it is perfect for reuse.
> > 
> > I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
> > along with the new ip_dst_mtu_secure to net/ip.h to fix circular
> > dependencies because of IPSKB_FORWARDED.
> > 
> > Because someone might have written a software which does probe
> > destinations manually and expects the kernel to honour those path mtus
> > I introduced a new per-namespace "forwarding_uses_pmtu" knob so someone
> > can disable this new behaviour. We also use mtus which are locked on a
> > route for forwarding.
> > 
> > Cc: Eric Dumazet <eric.dumazet@gmail.com>
> > Cc: David Miller <davem@davemloft.net>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> Sockets aren't the only entity that create cached PMTU information
> we might like to use during forwarding, tunnels do too.

I had tunnels in mind and they specifically clear IPCB bits when
reinserting the skb into the output functions. So they would get the
non-secure dst_mtu and honour the pmtu data.

Another patch I am currently preparing deals with xt_MSS, which I have
still to test.

> I'm afraid what the side effects of this change will be in that
> situation, because the dst_mtu() is exactly what will allow us
> to relay the ICMP back to the pre-decapsulation source address
> properly.
> 
> In any event, this code was made to explicitly behave this way
> back in 2007 by John Heffner, maybe he has some opinions too.

I doubt that it is the right thing to do, as currently one can easily
bring a router to fragment all packets along one path. E.g. icmp_err
handler lets one easily insert pmtu information for arbitary destinations
(as long as they pass uRPF check).

Thanks for looking after the original change and add the Cc. I thought
it was the original behaviour since the code was implemented.

Btw. this is my ongoing work to try to reduce possible sources of
fragmentation as one can expect that the 16 bit fragmentation id is not
secure enough.

Thanks,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Heffner Dec. 18, 2013, 11:55 p.m. UTC | #3
On Wed, Dec 18, 2013 at 5:34 PM, David Miller <davem@davemloft.net> wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Sun, 15 Dec 2013 03:23:36 +0100
>
>> While forwarding we should not use the protocol path mtu to calculate
>> the mtu for a forwarded packet but instead use the interface mtu.
>>
>> We mark forwarded skbs in ip_forward with IPSKB_FORWARDED which was
>> introduced for multicast forwarding. But as it does not conflict with
>> our usage in unicast code path it is perfect for reuse.
>>
>> I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
>> along with the new ip_dst_mtu_secure to net/ip.h to fix circular
>> dependencies because of IPSKB_FORWARDED.
>>
>> Because someone might have written a software which does probe
>> destinations manually and expects the kernel to honour those path mtus
>> I introduced a new per-namespace "forwarding_uses_pmtu" knob so someone
>> can disable this new behaviour. We also use mtus which are locked on a
>> route for forwarding.
>>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Cc: David Miller <davem@davemloft.net>
>> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>
> Sockets aren't the only entity that create cached PMTU information
> we might like to use during forwarding, tunnels do too.
>
> I'm afraid what the side effects of this change will be in that
> situation, because the dst_mtu() is exactly what will allow us
> to relay the ICMP back to the pre-decapsulation source address
> properly.
>
> In any event, this code was made to explicitly behave this way
> back in 2007 by John Heffner, maybe he has some opinions too.

I added IP_PMTUDISC_PROBE, intended for use with MTU diagnostic tools,
i.e., tracepath.

I don't keep up that well these days with Linux networking, but I see
now the recent addition of IP_PMTUDISC_INTERFACE.  As an aside, I'm a
bit skeptical that we'd actually want to send out packets with DF
unset.  (One, there is no equivalent in IPv6, plus all the
well-documented bad things that happen when you do this.)  But this is
already committed...

On using the interface MTU for all forwarded packets, I have a similar
reaction as David.  And why are forwarded packets more special than
local ones, from the routing code's point of view?  It seems like
there could be other ways to harden a router, like firewall rules.

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Dec. 19, 2013, 12:07 a.m. UTC | #4
On Wed, Dec 18, 2013 at 06:55:13PM -0500, John Heffner wrote:
> On Wed, Dec 18, 2013 at 5:34 PM, David Miller <davem@davemloft.net> wrote:
> > From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Date: Sun, 15 Dec 2013 03:23:36 +0100
> >
> >> While forwarding we should not use the protocol path mtu to calculate
> >> the mtu for a forwarded packet but instead use the interface mtu.
> >>
> >> We mark forwarded skbs in ip_forward with IPSKB_FORWARDED which was
> >> introduced for multicast forwarding. But as it does not conflict with
> >> our usage in unicast code path it is perfect for reuse.
> >>
> >> I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
> >> along with the new ip_dst_mtu_secure to net/ip.h to fix circular
> >> dependencies because of IPSKB_FORWARDED.
> >>
> >> Because someone might have written a software which does probe
> >> destinations manually and expects the kernel to honour those path mtus
> >> I introduced a new per-namespace "forwarding_uses_pmtu" knob so someone
> >> can disable this new behaviour. We also use mtus which are locked on a
> >> route for forwarding.
> >>
> >> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> >> Cc: David Miller <davem@davemloft.net>
> >> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >
> > Sockets aren't the only entity that create cached PMTU information
> > we might like to use during forwarding, tunnels do too.
> >
> > I'm afraid what the side effects of this change will be in that
> > situation, because the dst_mtu() is exactly what will allow us
> > to relay the ICMP back to the pre-decapsulation source address
> > properly.
> >
> > In any event, this code was made to explicitly behave this way
> > back in 2007 by John Heffner, maybe he has some opinions too.
> 
> I added IP_PMTUDISC_PROBE, intended for use with MTU diagnostic tools,
> i.e., tracepath.
> 
> I don't keep up that well these days with Linux networking, but I see
> now the recent addition of IP_PMTUDISC_INTERFACE.  As an aside, I'm a
> bit skeptical that we'd actually want to send out packets with DF
> unset.  (One, there is no equivalent in IPv6, plus all the
> well-documented bad things that happen when you do this.)  But this is
> already committed...

The idea is that DNS servers can prevent local fragmentation but the
packet is still allowed to get fragmented on the way. Otherwise those
packets get dropped and wouldn't reach their destination.

DNS servers should tweak their edns max packet size below the outgoing MTU
so packets will still reach the target but can retry with TCP transport
(trunc bit being set). In case of retry, pmtu data must be respected.

> On using the interface MTU for all forwarded packets, I have a similar
> reaction as David.  And why are forwarded packets more special than
> local ones, from the routing code's point of view?  It seems like
> there could be other ways to harden a router, like firewall rules.

I doubt it is trivial to set up such a filter as we have to inspect
the payload of the icmp error. I played with it and it is certainly
possible, but my intention was that the networking stack does try to
prevent fragmentation and delay generation of fragments to the last
router on the path where it is necessary.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 19, 2013, 5:12 a.m. UTC | #5
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 19 Dec 2013 01:07:59 +0100

> On Wed, Dec 18, 2013 at 06:55:13PM -0500, John Heffner wrote:
>> On using the interface MTU for all forwarded packets, I have a similar
>> reaction as David.  And why are forwarded packets more special than
>> local ones, from the routing code's point of view?  It seems like
>> there could be other ways to harden a router, like firewall rules.
> 
> I doubt it is trivial to set up such a filter as we have to inspect
> the payload of the icmp error. I played with it and it is certainly
> possible, but my intention was that the networking stack does try to
> prevent fragmentation and delay generation of fragments to the last
> router on the path where it is necessary.

John's more important point is why treat forwarded traffic specially
from that which is locally generated?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Dec. 19, 2013, 12:17 p.m. UTC | #6
On Thu, Dec 19, 2013 at 12:12:03AM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 19 Dec 2013 01:07:59 +0100
> 
> > On Wed, Dec 18, 2013 at 06:55:13PM -0500, John Heffner wrote:
> >> On using the interface MTU for all forwarded packets, I have a similar
> >> reaction as David.  And why are forwarded packets more special than
> >> local ones, from the routing code's point of view?  It seems like
> >> there could be other ways to harden a router, like firewall rules.
> > 
> > I doubt it is trivial to set up such a filter as we have to inspect
> > the payload of the icmp error. I played with it and it is certainly
> > possible, but my intention was that the networking stack does try to
> > prevent fragmentation and delay generation of fragments to the last
> > router on the path where it is necessary.
> 
> John's more important point is why treat forwarded traffic specially
> from that which is locally generated?

Routers don't receive ICMP frag-needed packets at all for forwarded
traffic but only for locally generated packets. As we can't be very
strict with payload tag checking because we often don't have enough
information from the returned payload, we cannot easily distinguish
between ICMP packets sent because of local traffic or for forwarded
traffic. All frag-needed notifications sent in return of forwarded traffic
are inherently bogus or have a malicious background and should not be
handled.  Becasue we cannot harden the receive handler in all cases,
we need to protect the forwarding path to use such information.

Networking software on the end system which wants to guard against
that kind of fragmentation can do so by using the various knobs to
limit pmtu notification processing or use IP_PMTUDISC_INTERFACE to
protect itself from sending fragments. (Note, all other methods but
IP_PMTUDISC_INTERFACE have impact on all protocols, which could also
break other services on that box). But if one has a linux router without
complex icmp payload checking firewall enabled, the packets send without
DF-bit will get fragmented at the next hops if an attacker managed to
send such a spoofed pmtu notification to that router. They simply need to
traceroute the path. We should avoid that!

It would be great if we could use IP_PMTUDISC_PROBE (so setting DF-flag)
for DNS packets but it could break stuff for sub-dns-max-size mtu links.

Thanks,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Heffner Dec. 19, 2013, 3:42 p.m. UTC | #7
On Thu, Dec 19, 2013 at 7:17 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Networking software on the end system which wants to guard against
> that kind of fragmentation can do so by using the various knobs to
> limit pmtu notification processing or use IP_PMTUDISC_INTERFACE to
> protect itself from sending fragments. (Note, all other methods but
> IP_PMTUDISC_INTERFACE have impact on all protocols, which could also
> break other services on that box). But if one has a linux router without
> complex icmp payload checking firewall enabled, the packets send without
> DF-bit will get fragmented at the next hops if an attacker managed to
> send such a spoofed pmtu notification to that router. They simply need to
> traceroute the path. We should avoid that!

The best practice for setting up routers usually involves separate
management and forwarding interfaces.  It's fairly simple to set up
firewall rules in such a configuration.

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 19, 2013, 7:30 p.m. UTC | #8
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 19 Dec 2013 13:17:57 +0100

> Networking software on the end system which wants to guard against
> that kind of fragmentation can do so by using the various knobs to
> limit pmtu notification processing or use IP_PMTUDISC_INTERFACE to
> protect itself from sending fragments.

And that's part of where my irritation is coming from.

Applications have to opt-in to this new socket option based behavior,
but you're making the routing thing default to on.

And even if we default it to off, someone is going to cry and tell all
the distributions to turn it on in /etc/sysctl.conf, just like they
did for rp_filter.  And they will.  I don't have the strength and time
to fight every person who makes these decisions at all the major
distributions to explain to each and every one of them how foolish it
would be.

No end host should have rp_filter on.  It unnecessarily makes our
routing lookups much more expensive for zero gain on an end host.  But
people convinced the distributions that turning it on everywhere by
default was a good idea and it stuck.

I don't want to create a carrot for that kind of situation again.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Dec. 19, 2013, 11:53 p.m. UTC | #9
On Thu, Dec 19, 2013 at 02:30:12PM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 19 Dec 2013 13:17:57 +0100
> 
> > Networking software on the end system which wants to guard against
> > that kind of fragmentation can do so by using the various knobs to
> > limit pmtu notification processing or use IP_PMTUDISC_INTERFACE to
> > protect itself from sending fragments.
> 
> And that's part of where my irritation is coming from.
> 
> Applications have to opt-in to this new socket option based behavior,
> but you're making the routing thing default to on.
>
> And even if we default it to off, someone is going to cry and tell all
> the distributions to turn it on in /etc/sysctl.conf, just like they
> did for rp_filter.  And they will.  I don't have the strength and time
> to fight every person who makes these decisions at all the major
> distributions to explain to each and every one of them how foolish it
> would be.
> 
> No end host should have rp_filter on.  It unnecessarily makes our
> routing lookups much more expensive for zero gain on an end host.  But
> people convinced the distributions that turning it on everywhere by
> default was a good idea and it stuck.

Ack.

> I don't want to create a carrot for that kind of situation again.

I see your point. How should we proceed? I definitely see this as a problem.
I just want to point to RFC 1191 "4. Router specification", where it clearly
states that the MTU that should be considered is that of the next-hop network.
And the MTU carried in a fragmentation-needed notification should denote the
size so that the packet is not going to be fragmented at *this* router.

I am fine dropping the knob and just unconditionally ignoring the pmtu data in
the forwarding path? Is that what you wanted to suggest?

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 20, 2013, 12:33 a.m. UTC | #10
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Fri, 20 Dec 2013 00:53:38 +0100

> I am fine dropping the knob and just unconditionally ignoring the
> pmtu data in the forwarding path? Is that what you wanted to
> suggest?

Quite the contrary, I thought it was clear that I want the default to
be the current behavior.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Dec. 20, 2013, 12:44 a.m. UTC | #11
On Thu, Dec 19, 2013 at 07:33:21PM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Fri, 20 Dec 2013 00:53:38 +0100
> 
> > I am fine dropping the knob and just unconditionally ignoring the
> > pmtu data in the forwarding path? Is that what you wanted to
> > suggest?
> 
> Quite the contrary, I thought it was clear that I want the default to
> be the current behavior.

Ok, sorry for misinterpretation.

I'll do so and will send the IPv6 one adopted as well, thanks!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Dec. 20, 2013, 10:21 a.m. UTC | #12
> No end host should have rp_filter on.  It unnecessarily makes our
> routing lookups much more expensive for zero gain on an end host.  But
> people convinced the distributions that turning it on everywhere by
> default was a good idea and it stuck.

Yes I spent a long time trying to get asymmetric routing within a
subnet working [1]. I didn't expect anything on the ingress side
would discard the packets - especially with the 'host based' IP
address scheme that linux uses.

	David

[1] I wanted a netperf RR test to send on one interface and
receive on another so that I could see whether the additional
delays were on the tx or rx side.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Jan. 2, 2014, 5:57 p.m. UTC | #13
On Thu, Dec 19, 2013 at 02:30:12PM -0500, David Miller wrote:
> No end host should have rp_filter on.  It unnecessarily makes our
> routing lookups much more expensive for zero gain on an end host.  But
> people convinced the distributions that turning it on everywhere by
> default was a good idea and it stuck.

Just was thinking about that again when reading the quote on lwn:

Maybe distributions just wanted to be always on the safe side.  As there
is no easy conditional system to enable rp_filter in case a user enables
forwarding, maybe something like

rp_filter:
0 - disable
1 - enable only in case of forwarding
2 - always enable

could be helpful?

I guess it is too late to change this.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d71afa8..6ccc3bd 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -32,6 +32,18 @@  ip_no_pmtu_disc - INTEGER
 min_pmtu - INTEGER
 	default 552 - minimum discovered Path MTU
 
+forwarding_uses_pmtu - BOOLEAN
+	By default we don't trust protocol path mtus while forwarding
+	because they could be easily forged and can lead to unwanted
+	fragmentation by the router. If you have software which tries
+	to discover path mtus by itself and depend on the kernel
+	honouring this information you can enable the kernel to do so
+	with this knob.
+	Default: 0 (disabled)
+	Possible values:
+	0 - disabled
+	1 - enabled
+
 route/max_size - INTEGER
 	Maximum number of routes allowed in the kernel.  Increase
 	this when using large numbers of interfaces and/or routes.
diff --git a/include/net/ip.h b/include/net/ip.h
index 5356644..58fbedc 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -263,6 +263,38 @@  int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 		 !(dst_metric_locked(dst, RTAX_MTU)));
 }
 
+static inline bool ip_sk_accept_pmtu(const struct sock *sk)
+{
+	return inet_sk(sk)->pmtudisc != IP_PMTUDISC_INTERFACE;
+}
+
+static inline bool ip_sk_use_pmtu(const struct sock *sk)
+{
+	return inet_sk(sk)->pmtudisc < IP_PMTUDISC_PROBE;
+}
+static inline unsigned int ip_dst_mtu_secure(const struct dst_entry *dst,
+					     bool forwarding)
+{
+	struct net *net = dev_net(dst->dev);
+
+	if (net->ipv4.sysctl_ip_fwd_use_pmtu ||
+	    dst_metric_locked(dst, RTAX_MTU) ||
+	    !forwarding)
+		return dst_mtu(dst);
+
+	return min(dst->dev->mtu, IP_MAX_MTU);
+}
+
+static inline unsigned int ip_skb_dst_mtu(const struct sk_buff *skb)
+{
+	if (!skb->sk || ip_sk_use_pmtu(skb->sk)) {
+		bool forwarding = !!(IPCB(skb)->flags & IPSKB_FORWARDED);
+		return ip_dst_mtu_secure(skb_dst(skb), forwarding);
+	} else {
+		return min(skb_dst(skb)->dev->mtu, IP_MAX_MTU);
+	}
+}
+
 void __ip_select_ident(struct iphdr *iph, struct dst_entry *dst, int more);
 
 static inline void ip_select_ident(struct sk_buff *skb, struct dst_entry *dst, struct sock *sk)
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 929a668..80f500a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -70,6 +70,7 @@  struct netns_ipv4 {
 
 	int sysctl_tcp_ecn;
 	int sysctl_ip_no_pmtu_disc;
+	int sysctl_ip_fwd_use_pmtu;
 
 	kgid_t sysctl_ping_group_range[2];
 
diff --git a/include/net/route.h b/include/net/route.h
index f68c167..0d4c9da 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -36,6 +36,9 @@ 
 #include <linux/cache.h>
 #include <linux/security.h>
 
+/* IPv4 datagram length is stored into 16bit field (tot_len) */
+#define IP_MAX_MTU	0xFFFFU
+
 #define RTO_ONLINK	0x01
 
 #define RT_CONN_FLAGS(sk)   (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE))
@@ -313,20 +316,4 @@  static inline int ip4_dst_hoplimit(const struct dst_entry *dst)
 	return hoplimit;
 }
 
-static inline bool ip_sk_accept_pmtu(const struct sock *sk)
-{
-	return inet_sk(sk)->pmtudisc != IP_PMTUDISC_INTERFACE;
-}
-
-static inline bool ip_sk_use_pmtu(const struct sock *sk)
-{
-	return inet_sk(sk)->pmtudisc < IP_PMTUDISC_PROBE;
-}
-
-static inline int ip_skb_dst_mtu(const struct sk_buff *skb)
-{
-	return (!skb->sk || ip_sk_use_pmtu(skb->sk)) ?
-	       dst_mtu(skb_dst(skb)) : skb_dst(skb)->dev->mtu;
-}
-
 #endif	/* _ROUTE_H */
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 694de3b..66d027f 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -54,6 +54,7 @@  static int ip_forward_finish(struct sk_buff *skb)
 
 int ip_forward(struct sk_buff *skb)
 {
+	u32 mtu;
 	struct iphdr *iph;	/* Our header */
 	struct rtable *rt;	/* Route we use */
 	struct ip_options *opt	= &(IPCB(skb)->opt);
@@ -88,11 +89,13 @@  int ip_forward(struct sk_buff *skb)
 	if (opt->is_strictroute && rt->rt_uses_gateway)
 		goto sr_failed;
 
-	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
+	IPCB(skb)->flags |= IPSKB_FORWARDED;
+	mtu = ip_dst_mtu_secure(&rt->dst, true);
+	if (unlikely(skb->len > mtu && !skb_is_gso(skb) &&
 		     (ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {
 		IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-			  htonl(dst_mtu(&rt->dst)));
+			  htonl(mtu));
 		goto drop;
 	}
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9124027..180de0d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -449,6 +449,7 @@  int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
 	int err = 0;
+	bool forwarding = !!(IPCB(skb)->flags & IPSKB_FORWARDED);
 
 	dev = rt->dst.dev;
 
@@ -458,12 +459,13 @@  int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	iph = ip_hdr(skb);
 
+	mtu = ip_dst_mtu_secure(&rt->dst, forwarding);
 	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->local_df) ||
 		     (IPCB(skb)->frag_max_size &&
-		      IPCB(skb)->frag_max_size > dst_mtu(&rt->dst)))) {
+		      IPCB(skb)->frag_max_size > mtu))) {
 		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-			  htonl(ip_skb_dst_mtu(skb)));
+			  htonl(mtu));
 		kfree_skb(skb);
 		return -EMSGSIZE;
 	}
@@ -473,7 +475,7 @@  int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	 */
 
 	hlen = iph->ihl * 4;
-	mtu = dst_mtu(&rt->dst) - hlen;	/* Size of data space */
+	mtu = mtu - hlen;	/* Size of data space */
 #ifdef CONFIG_BRIDGE_NETFILTER
 	if (skb->nf_bridge)
 		mtu -= nf_bridge_mtu_reduction(skb);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f8da282..25071b4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,9 +112,6 @@ 
 #define RT_FL_TOS(oldflp4) \
 	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
 
-/* IPv4 datagram length is stored into 16bit field (tot_len) */
-#define IP_MAX_MTU	0xFFFF
-
 #define RT_GC_TIMEOUT (300*HZ)
 
 static int ip_rt_max_size;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d7b63a6..56cc374 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -831,6 +831,13 @@  static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "forwarding_uses_pmtu",
+		.data		= &init_net.ipv4.sysctl_ip_fwd_use_pmtu,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };