diff mbox

[2/3] ipv4: Fix packet size calculation for IPsec packets in __ip_append_data

Message ID 20110606064802.GC31505@secunet.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Steffen Klassert June 6, 2011, 6:48 a.m. UTC
Git commit 59104f06 (ip: take care of last fragment in ip_append_data)
added a check to see if we exceed the mtu when we add trailer_len.
However, the mtu is already subtracted by trailer_len when the
xfrm transfomation bundles are set up. So IPsec packets with mtu
size get fragmented, or if the DF bit is set the packets will not
be send even though they match the mtu perfectly fine. This patch
actually reverts commit 59104f06.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv4/ip_output.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

Comments

Eric Dumazet June 6, 2011, 7:38 a.m. UTC | #1
Le lundi 06 juin 2011 à 08:48 +0200, Steffen Klassert a écrit :
> Git commit 59104f06 (ip: take care of last fragment in ip_append_data)
> added a check to see if we exceed the mtu when we add trailer_len.
> However, the mtu is already subtracted by trailer_len when the
> xfrm transfomation bundles are set up. So IPsec packets with mtu
> size get fragmented, or if the DF bit is set the packets will not
> be send even though they match the mtu perfectly fine. This patch
> actually reverts commit 59104f06.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> ---

Woh, I am afraid I wont have time in following days to check your
assertion.

What about original problem then, how should we fix it ?

We do have some cases where at least one fragment (the last one) is
oversized.

I remember I used Nick Bowler scripts at that time, I might find them
again...

[PATCH] ip : take care of last fragment in ip_append_data

While investigating a bit, I found ip_fragment() slow path was taken
because ip_append_data() provides following layout for a send(MTU +
N*(MTU - 20)) syscall :

- one skb with 1500 (mtu) bytes
- N fragments of 1480 (mtu-20) bytes (before adding IP header)
last fragment gets 17 bytes of trail data because of following bit:

        if (datalen == length + fraggap)
                alloclen += rt->dst.trailer_len;

Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
another bug ?)

In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
so we take slow path, building another skb chain.

In order to avoid taking slow path, we should correct ip_append_data()
to make sure last fragment has real trail space, under mtu...

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steffen Klassert June 6, 2011, 8:52 a.m. UTC | #2
On Mon, Jun 06, 2011 at 09:38:19AM +0200, Eric Dumazet wrote:
> 
> Woh, I am afraid I wont have time in following days to check your
> assertion.

My test setup was the following:

I use an IPsec tunnel with tunnel endpoints 192.168.1.1 and 192.168.1.2

Then I do at 192.168.1.2

ping -c1 -M do -s 1410 192.168.1.1

PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data.
From 192.168.1.2 icmp_seq=1 Frag needed and DF set (mtu = 1438)

--- 192.168.1.1 ping statistics ---
0 packets transmitted, 0 received, +1 errors

So the packet matches the mtu but it is not send.
I used a kernel with your patch as head commit.

Reverting your patch (going one commit deeper in the history):

ping -c1 -M do -s 1410 192.168.1.1

PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data.
1418 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=3.01 ms

--- 192.168.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.014/3.014/3.014/0.000 ms

> 
> What about original problem then, how should we fix it ?
> 

Hm, I don't know. I'll try to reproduce it here.

> We do have some cases where at least one fragment (the last one) is
> oversized.

trailer_len is used only on IPsec so the poroblem exists only when
using IPsec, right?

> 
> I remember I used Nick Bowler scripts at that time, I might find them
> again...

Would be nice if you could provide these scripts and some informations
on how to reproduce the problem.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 7, 2011, 5:06 a.m. UTC | #3
Le lundi 06 juin 2011 à 10:52 +0200, Steffen Klassert a écrit :
> On Mon, Jun 06, 2011 at 09:38:19AM +0200, Eric Dumazet wrote:
> > 
> > Woh, I am afraid I wont have time in following days to check your
> > assertion.
> 
> My test setup was the following:
> 
> I use an IPsec tunnel with tunnel endpoints 192.168.1.1 and 192.168.1.2
> 
> Then I do at 192.168.1.2
> 
> ping -c1 -M do -s 1410 192.168.1.1
> 
> PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data.
> From 192.168.1.2 icmp_seq=1 Frag needed and DF set (mtu = 1438)
> 
> --- 192.168.1.1 ping statistics ---
> 0 packets transmitted, 0 received, +1 errors
> 
> So the packet matches the mtu but it is not send.
> I used a kernel with your patch as head commit.
> 
> Reverting your patch (going one commit deeper in the history):
> 
> ping -c1 -M do -s 1410 192.168.1.1
> 
> PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data.
> 1418 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=3.01 ms
> 
> --- 192.168.1.1 ping statistics ---
> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> rtt min/avg/max/mdev = 3.014/3.014/3.014/0.000 ms
> 
> > 
> > What about original problem then, how should we fix it ?
> > 
> 
> Hm, I don't know. I'll try to reproduce it here.
> 
> > We do have some cases where at least one fragment (the last one) is
> > oversized.
> 
> trailer_len is used only on IPsec so the poroblem exists only when
> using IPsec, right?
> 
> > 
> > I remember I used Nick Bowler scripts at that time, I might find them
> > again...
> 
> Would be nice if you could provide these scripts and some informations
> on how to reproduce the problem.
> 

Nick mail was :

http://www.spinics.net/lists/netdev/msg141308.html

Unfortunatly I could not find on my machines where I put my own
scripts...

Not a big deal, I suspect we can revert my commit if you say it added a
regression :)

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steffen Klassert June 8, 2011, 5:30 a.m. UTC | #4
On Tue, Jun 07, 2011 at 07:06:57AM +0200, Eric Dumazet wrote:
> 
> Nick mail was :
> 
> http://www.spinics.net/lists/netdev/msg141308.html
> 

Thanks for providing these informations.

> Unfortunatly I could not find on my machines where I put my own
> scripts...
> 
> Not a big deal, I suspect we can revert my commit if you say it added a
> regression :)
> 

In between I can confirm that we get the slowpath problem back with my
patch, so we still have a bug somewhere. Reverting your commit would
be just a band aid. I think it is better to find the bug and do a real
fix instead. Unfortunatly I fear I'm not able to track it down before
my vacation that starts tomorrow. I'll continue to work at it once I'm
back...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 9, 2011, 9:47 p.m. UTC | #5
From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Wed, 8 Jun 2011 07:30:20 +0200

> In between I can confirm that we get the slowpath problem back with my
> patch, so we still have a bug somewhere. Reverting your commit would
> be just a band aid. I think it is better to find the bug and do a real
> fix instead. Unfortunatly I fear I'm not able to track it down before
> my vacation that starts tomorrow. I'll continue to work at it once I'm
> back...

Thanks Steffen, looking forward to this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steffen Klassert June 22, 2011, 11:02 a.m. UTC | #6
On Thu, Jun 09, 2011 at 02:47:04PM -0700, David Miller wrote:
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Date: Wed, 8 Jun 2011 07:30:20 +0200
> 
> > In between I can confirm that we get the slowpath problem back with my
> > patch, so we still have a bug somewhere. Reverting your commit would
> > be just a band aid. I think it is better to find the bug and do a real
> > fix instead. Unfortunatly I fear I'm not able to track it down before
> > my vacation that starts tomorrow. I'll continue to work at it once I'm
> > back...
> 
> Thanks Steffen, looking forward to this.

In between I found the problem. In ip_setup_cork() we take the mtu on the
base of dst_mtu(rt->dst.path) and assign it to cork->fragsize which is
used as the mtu in __ip_append_data(). The path dst_entry is a routing
dst_entry that does not take the IPsec header and trailer overhead into
account. So if we build an IPsec packet based on this mtu it might be to
big after the IPsec transformations are applied. If we take the actual
IPsec mtu from dst_mtu(&rt->dst) and adapt the exthdr handling to this,
it works as expected. So I'll send two patches, one that reverts Eric's
patch and one that fixes the slowpath issue.

While reading through the code of __ip_append_data() I noticed that we
might use ip_ufo_append_data() for packets that will be IPsec transformed
later, is this ok? I don't know how ufo handling works, but I would guess
that it expects an udp header and not an IPsec header as the packets
transport header.

I found another funny thing when I tested my patches:

I use an IPsec tunnel with tunnel endpoints 192.168.1.1 and 192.168.1.2

Then I do at 192.168.1.2

ping -c20 -M do -s 1500 192.168.1.1

PING 192.168.1.1 (192.168.1.1) 1500(1528) bytes of data.
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1438)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1438)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1374)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1310)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1246)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1182)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1118)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1054)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 990)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 926)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 862)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 798)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 734)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 670)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 606)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 552)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494)
From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494)

--- 192.168.1.1 ping statistics ---
0 packets transmitted, 0 received, +20 errors

These packets are locally generated and not from 192.168.1.1 as they
claim to be. I think the problem is the following:

The IPsec mtu is 1438 here, so the first packet is too big.
xfrm4_tunnel_check_size() notices this and sends a ICMP_FRAG_NEEDED
packet that announces a mtu of 1438 to the original sender of the ping
packet. Unfortunately the sender is a local address, it's the IPsec
tunnel entry point. So we update the mtu for this connection to 1438.
Now, with the next packet xfrm_bundle_ok() notices that the path mtu has
changed, so it subtracts the IPsec overhead from the mtu a second time
and we end up with a mtu of 1374. This game goes until we reach a minimal
mtu of 494.

Unfortunately I don't know how to fix this. Any ideas?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 28, 2011, 3:39 a.m. UTC | #7
From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Wed, 22 Jun 2011 13:02:19 +0200

> In between I found the problem. In ip_setup_cork() we take the mtu on the
> base of dst_mtu(rt->dst.path) and assign it to cork->fragsize which is
> used as the mtu in __ip_append_data(). The path dst_entry is a routing
> dst_entry that does not take the IPsec header and trailer overhead into
> account. So if we build an IPsec packet based on this mtu it might be to
> big after the IPsec transformations are applied. If we take the actual
> IPsec mtu from dst_mtu(&rt->dst) and adapt the exthdr handling to this,
> it works as expected. So I'll send two patches, one that reverts Eric's
> patch and one that fixes the slowpath issue.

Thanks for doing this work Steffen.

> While reading through the code of __ip_append_data() I noticed that we
> might use ip_ufo_append_data() for packets that will be IPsec transformed
> later, is this ok? I don't know how ufo handling works, but I would guess
> that it expects an udp header and not an IPsec header as the packets
> transport header.

Indeed, it could be a real problem.

> The IPsec mtu is 1438 here, so the first packet is too big.
> xfrm4_tunnel_check_size() notices this and sends a ICMP_FRAG_NEEDED
> packet that announces a mtu of 1438 to the original sender of the ping
> packet. Unfortunately the sender is a local address, it's the IPsec
> tunnel entry point. So we update the mtu for this connection to 1438.
> Now, with the next packet xfrm_bundle_ok() notices that the path mtu has
> changed, so it subtracts the IPsec overhead from the mtu a second time
> and we end up with a mtu of 1374. This game goes until we reach a minimal
> mtu of 494.
> 
> Unfortunately I don't know how to fix this. Any ideas?

If the generic PMTU handling in net/ipv4/route.c is adjusting the MTU
for the IPSEC path's route, that would be the problem.

In the case of IPSEC encapsulation, tunnels, and similar, only the
tunnel or the XFRM layer should be processing the PMTU notification.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steffen Klassert June 30, 2011, 9:06 a.m. UTC | #8
On Mon, Jun 27, 2011 at 08:39:38PM -0700, David Miller wrote:
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Date: Wed, 22 Jun 2011 13:02:19 +0200
> 
> > While reading through the code of __ip_append_data() I noticed that we
> > might use ip_ufo_append_data() for packets that will be IPsec transformed
> > later, is this ok? I don't know how ufo handling works, but I would guess
> > that it expects an udp header and not an IPsec header as the packets
> > transport header.
> 
> Indeed, it could be a real problem.

Ok, so I'll send a patch to fix it up.

> 
> > The IPsec mtu is 1438 here, so the first packet is too big.
> > xfrm4_tunnel_check_size() notices this and sends a ICMP_FRAG_NEEDED
> > packet that announces a mtu of 1438 to the original sender of the ping
> > packet. Unfortunately the sender is a local address, it's the IPsec
> > tunnel entry point. So we update the mtu for this connection to 1438.
> > Now, with the next packet xfrm_bundle_ok() notices that the path mtu has
> > changed, so it subtracts the IPsec overhead from the mtu a second time
> > and we end up with a mtu of 1374. This game goes until we reach a minimal
> > mtu of 494.
> > 
> > Unfortunately I don't know how to fix this. Any ideas?
> 
> If the generic PMTU handling in net/ipv4/route.c is adjusting the MTU
> for the IPSEC path's route, that would be the problem.
> 

Yes, this is exactly what happens. We use icmp_send() to notify about
message size errors even for locally generated packets, this leads to
an incorrect pmtu update. Changing this to use ip_local_error() if we
have socket context fixes the problem. I'll send a patch for this too.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 98af369..a16859b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -888,12 +888,9 @@  alloc_new_skb:
 			 * because we have no idea what fragment will be
 			 * the last.
 			 */
-			if (datalen == length + fraggap) {
+			if (datalen == length + fraggap)
 				alloclen += rt->dst.trailer_len;
-				/* make sure mtu is not reached */
-				if (datalen > mtu - fragheaderlen - rt->dst.trailer_len)
-					datalen -= ALIGN(rt->dst.trailer_len, 8);
-			}
+
 			if (transhdrlen) {
 				skb = sock_alloc_send_skb(sk,
 						alloclen + hh_len + 15,