Message ID | 20110606064802.GC31505@secunet.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Le lundi 06 juin 2011 à 08:48 +0200, Steffen Klassert a écrit : > Git commit 59104f06 (ip: take care of last fragment in ip_append_data) > added a check to see if we exceed the mtu when we add trailer_len. > However, the mtu is already subtracted by trailer_len when the > xfrm transfomation bundles are set up. So IPsec packets with mtu > size get fragmented, or if the DF bit is set the packets will not > be send even though they match the mtu perfectly fine. This patch > actually reverts commit 59104f06. > > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> > --- Woh, I am afraid I wont have time in following days to check your assertion. What about original problem then, how should we fix it ? We do have some cases where at least one fragment (the last one) is oversized. I remember I used Nick Bowler scripts at that time, I might find them again... [PATCH] ip : take care of last fragment in ip_append_data While investigating a bit, I found ip_fragment() slow path was taken because ip_append_data() provides following layout for a send(MTU + N*(MTU - 20)) syscall : - one skb with 1500 (mtu) bytes - N fragments of 1480 (mtu-20) bytes (before adding IP header) last fragment gets 17 bytes of trail data because of following bit: if (datalen == length + fraggap) alloclen += rt->dst.trailer_len; Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm... another bug ?) In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu, so we take slow path, building another skb chain. In order to avoid taking slow path, we should correct ip_append_data() to make sure last fragment has real trail space, under mtu... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 06, 2011 at 09:38:19AM +0200, Eric Dumazet wrote: > > Woh, I am afraid I wont have time in following days to check your > assertion. My test setup was the following: I use an IPsec tunnel with tunnel endpoints 192.168.1.1 and 192.168.1.2 Then I do at 192.168.1.2 ping -c1 -M do -s 1410 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data. From 192.168.1.2 icmp_seq=1 Frag needed and DF set (mtu = 1438) --- 192.168.1.1 ping statistics --- 0 packets transmitted, 0 received, +1 errors So the packet matches the mtu but it is not send. I used a kernel with your patch as head commit. Reverting your patch (going one commit deeper in the history): ping -c1 -M do -s 1410 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data. 1418 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=3.01 ms --- 192.168.1.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 3.014/3.014/3.014/0.000 ms > > What about original problem then, how should we fix it ? > Hm, I don't know. I'll try to reproduce it here. > We do have some cases where at least one fragment (the last one) is > oversized. trailer_len is used only on IPsec so the poroblem exists only when using IPsec, right? > > I remember I used Nick Bowler scripts at that time, I might find them > again... Would be nice if you could provide these scripts and some informations on how to reproduce the problem. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 06 juin 2011 à 10:52 +0200, Steffen Klassert a écrit : > On Mon, Jun 06, 2011 at 09:38:19AM +0200, Eric Dumazet wrote: > > > > Woh, I am afraid I wont have time in following days to check your > > assertion. > > My test setup was the following: > > I use an IPsec tunnel with tunnel endpoints 192.168.1.1 and 192.168.1.2 > > Then I do at 192.168.1.2 > > ping -c1 -M do -s 1410 192.168.1.1 > > PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data. > From 192.168.1.2 icmp_seq=1 Frag needed and DF set (mtu = 1438) > > --- 192.168.1.1 ping statistics --- > 0 packets transmitted, 0 received, +1 errors > > So the packet matches the mtu but it is not send. > I used a kernel with your patch as head commit. > > Reverting your patch (going one commit deeper in the history): > > ping -c1 -M do -s 1410 192.168.1.1 > > PING 192.168.1.1 (192.168.1.1) 1410(1438) bytes of data. > 1418 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=3.01 ms > > --- 192.168.1.1 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 3.014/3.014/3.014/0.000 ms > > > > > What about original problem then, how should we fix it ? > > > > Hm, I don't know. I'll try to reproduce it here. > > > We do have some cases where at least one fragment (the last one) is > > oversized. > > trailer_len is used only on IPsec so the poroblem exists only when > using IPsec, right? > > > > > I remember I used Nick Bowler scripts at that time, I might find them > > again... > > Would be nice if you could provide these scripts and some informations > on how to reproduce the problem. > Nick mail was : http://www.spinics.net/lists/netdev/msg141308.html Unfortunatly I could not find on my machines where I put my own scripts... Not a big deal, I suspect we can revert my commit if you say it added a regression :) Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 07, 2011 at 07:06:57AM +0200, Eric Dumazet wrote: > > Nick mail was : > > http://www.spinics.net/lists/netdev/msg141308.html > Thanks for providing these informations. > Unfortunatly I could not find on my machines where I put my own > scripts... > > Not a big deal, I suspect we can revert my commit if you say it added a > regression :) > In between I can confirm that we get the slowpath problem back with my patch, so we still have a bug somewhere. Reverting your commit would be just a band aid. I think it is better to find the bug and do a real fix instead. Unfortunatly I fear I'm not able to track it down before my vacation that starts tomorrow. I'll continue to work at it once I'm back... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Steffen Klassert <steffen.klassert@secunet.com> Date: Wed, 8 Jun 2011 07:30:20 +0200 > In between I can confirm that we get the slowpath problem back with my > patch, so we still have a bug somewhere. Reverting your commit would > be just a band aid. I think it is better to find the bug and do a real > fix instead. Unfortunatly I fear I'm not able to track it down before > my vacation that starts tomorrow. I'll continue to work at it once I'm > back... Thanks Steffen, looking forward to this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 09, 2011 at 02:47:04PM -0700, David Miller wrote: > From: Steffen Klassert <steffen.klassert@secunet.com> > Date: Wed, 8 Jun 2011 07:30:20 +0200 > > > In between I can confirm that we get the slowpath problem back with my > > patch, so we still have a bug somewhere. Reverting your commit would > > be just a band aid. I think it is better to find the bug and do a real > > fix instead. Unfortunatly I fear I'm not able to track it down before > > my vacation that starts tomorrow. I'll continue to work at it once I'm > > back... > > Thanks Steffen, looking forward to this. In between I found the problem. In ip_setup_cork() we take the mtu on the base of dst_mtu(rt->dst.path) and assign it to cork->fragsize which is used as the mtu in __ip_append_data(). The path dst_entry is a routing dst_entry that does not take the IPsec header and trailer overhead into account. So if we build an IPsec packet based on this mtu it might be to big after the IPsec transformations are applied. If we take the actual IPsec mtu from dst_mtu(&rt->dst) and adapt the exthdr handling to this, it works as expected. So I'll send two patches, one that reverts Eric's patch and one that fixes the slowpath issue. While reading through the code of __ip_append_data() I noticed that we might use ip_ufo_append_data() for packets that will be IPsec transformed later, is this ok? I don't know how ufo handling works, but I would guess that it expects an udp header and not an IPsec header as the packets transport header. I found another funny thing when I tested my patches: I use an IPsec tunnel with tunnel endpoints 192.168.1.1 and 192.168.1.2 Then I do at 192.168.1.2 ping -c20 -M do -s 1500 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 1500(1528) bytes of data. From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1438) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1438) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1374) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1310) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1246) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1182) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1118) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 1054) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 990) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 926) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 862) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 798) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 734) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 670) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 606) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 552) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494) From 192.168.1.1 icmp_seq=1 Frag needed and DF set (mtu = 494) --- 192.168.1.1 ping statistics --- 0 packets transmitted, 0 received, +20 errors These packets are locally generated and not from 192.168.1.1 as they claim to be. I think the problem is the following: The IPsec mtu is 1438 here, so the first packet is too big. xfrm4_tunnel_check_size() notices this and sends a ICMP_FRAG_NEEDED packet that announces a mtu of 1438 to the original sender of the ping packet. Unfortunately the sender is a local address, it's the IPsec tunnel entry point. So we update the mtu for this connection to 1438. Now, with the next packet xfrm_bundle_ok() notices that the path mtu has changed, so it subtracts the IPsec overhead from the mtu a second time and we end up with a mtu of 1374. This game goes until we reach a minimal mtu of 494. Unfortunately I don't know how to fix this. Any ideas? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Steffen Klassert <steffen.klassert@secunet.com> Date: Wed, 22 Jun 2011 13:02:19 +0200 > In between I found the problem. In ip_setup_cork() we take the mtu on the > base of dst_mtu(rt->dst.path) and assign it to cork->fragsize which is > used as the mtu in __ip_append_data(). The path dst_entry is a routing > dst_entry that does not take the IPsec header and trailer overhead into > account. So if we build an IPsec packet based on this mtu it might be to > big after the IPsec transformations are applied. If we take the actual > IPsec mtu from dst_mtu(&rt->dst) and adapt the exthdr handling to this, > it works as expected. So I'll send two patches, one that reverts Eric's > patch and one that fixes the slowpath issue. Thanks for doing this work Steffen. > While reading through the code of __ip_append_data() I noticed that we > might use ip_ufo_append_data() for packets that will be IPsec transformed > later, is this ok? I don't know how ufo handling works, but I would guess > that it expects an udp header and not an IPsec header as the packets > transport header. Indeed, it could be a real problem. > The IPsec mtu is 1438 here, so the first packet is too big. > xfrm4_tunnel_check_size() notices this and sends a ICMP_FRAG_NEEDED > packet that announces a mtu of 1438 to the original sender of the ping > packet. Unfortunately the sender is a local address, it's the IPsec > tunnel entry point. So we update the mtu for this connection to 1438. > Now, with the next packet xfrm_bundle_ok() notices that the path mtu has > changed, so it subtracts the IPsec overhead from the mtu a second time > and we end up with a mtu of 1374. This game goes until we reach a minimal > mtu of 494. > > Unfortunately I don't know how to fix this. Any ideas? If the generic PMTU handling in net/ipv4/route.c is adjusting the MTU for the IPSEC path's route, that would be the problem. In the case of IPSEC encapsulation, tunnels, and similar, only the tunnel or the XFRM layer should be processing the PMTU notification. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 27, 2011 at 08:39:38PM -0700, David Miller wrote: > From: Steffen Klassert <steffen.klassert@secunet.com> > Date: Wed, 22 Jun 2011 13:02:19 +0200 > > > While reading through the code of __ip_append_data() I noticed that we > > might use ip_ufo_append_data() for packets that will be IPsec transformed > > later, is this ok? I don't know how ufo handling works, but I would guess > > that it expects an udp header and not an IPsec header as the packets > > transport header. > > Indeed, it could be a real problem. Ok, so I'll send a patch to fix it up. > > > The IPsec mtu is 1438 here, so the first packet is too big. > > xfrm4_tunnel_check_size() notices this and sends a ICMP_FRAG_NEEDED > > packet that announces a mtu of 1438 to the original sender of the ping > > packet. Unfortunately the sender is a local address, it's the IPsec > > tunnel entry point. So we update the mtu for this connection to 1438. > > Now, with the next packet xfrm_bundle_ok() notices that the path mtu has > > changed, so it subtracts the IPsec overhead from the mtu a second time > > and we end up with a mtu of 1374. This game goes until we reach a minimal > > mtu of 494. > > > > Unfortunately I don't know how to fix this. Any ideas? > > If the generic PMTU handling in net/ipv4/route.c is adjusting the MTU > for the IPSEC path's route, that would be the problem. > Yes, this is exactly what happens. We use icmp_send() to notify about message size errors even for locally generated packets, this leads to an incorrect pmtu update. Changing this to use ip_local_error() if we have socket context fixes the problem. I'll send a patch for this too. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 98af369..a16859b 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -888,12 +888,9 @@ alloc_new_skb: * because we have no idea what fragment will be * the last. */ - if (datalen == length + fraggap) { + if (datalen == length + fraggap) alloclen += rt->dst.trailer_len; - /* make sure mtu is not reached */ - if (datalen > mtu - fragheaderlen - rt->dst.trailer_len) - datalen -= ALIGN(rt->dst.trailer_len, 8); - } + if (transhdrlen) { skb = sock_alloc_send_skb(sk, alloclen + hh_len + 15,
Git commit 59104f06 (ip: take care of last fragment in ip_append_data) added a check to see if we exceed the mtu when we add trailer_len. However, the mtu is already subtracted by trailer_len when the xfrm transfomation bundles are set up. So IPsec packets with mtu size get fragmented, or if the DF bit is set the packets will not be send even though they match the mtu perfectly fine. This patch actually reverts commit 59104f06. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> --- net/ipv4/ip_output.c | 7 ++----- 1 files changed, 2 insertions(+), 5 deletions(-)