diff mbox

ip : take care of last fragment in ip_append_data

Message ID 1285049787.2323.797.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 21, 2010, 6:16 a.m. UTC
Le lundi 20 septembre 2010 à 23:31 +0200, Eric Dumazet a écrit :

> OK, I found a bug in ip_fragment() and ip6_fragment()
> 
> In case slow_path is hit, we have a truesize mismatch
> 
> Could you try following patch ?
> 
> Thanks !
> 
> [PATCH] ip : fix truesize mismatch in ip fragmentation
> 
> We should not set frag->destructor to sock_wkfree() until we are sure we
> dont hit slow path in ip_fragment(). Or we risk uncharging
> frag->truesize twice, and in the end, having negative socket
> sk_wmem_alloc counter, or even freeing socket sooner than expected.
> 
> Many thanks to Nick Bowler, who provided a very clean bug report and
> test programs.
> 
> While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
> expensive sock_hold()/sock_put() on each tx), underlying bug is older.
> 
> Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/ipv4/ip_output.c  |    8 ++++----
>  net/ipv6/ip6_output.c |   10 +++++-----
>  2 files changed, 9 insertions(+), 9 deletions(-)

This previous patch works for me. I also cooked following fix, to not
enter slow path on some lengthes (MTU + N*(MTU-20))

[PATCH] ip : take care of last fragment in ip_append_data

While investigating a bit, I found ip_fragment() slow path was taken
because ip_append_data() provides following layout for a send(MTU +
N*(MTU - 20)) syscall :

- one skb with 1500 (mtu) bytes
- N fragments of 1480 (mtu-20) bytes (before adding IP header)
last fragment gets 17 bytes of trail data because of following bit:

	if (datalen == length + fraggap)
		alloclen += rt->dst.trailer_len;

Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
another bug ?)

In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
so we take slow path, building another skb chain.

In order to avoid taking slow path, we should correct ip_append_data()
to make sure last fragment has real trail space, under mtu...

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_output.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Sept. 21, 2010, 11:38 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 21 Sep 2010 08:16:27 +0200

> [PATCH] ip : take care of last fragment in ip_append_data
> 
> While investigating a bit, I found ip_fragment() slow path was taken
> because ip_append_data() provides following layout for a send(MTU +
> N*(MTU - 20)) syscall :
> 
> - one skb with 1500 (mtu) bytes
> - N fragments of 1480 (mtu-20) bytes (before adding IP header)
> last fragment gets 17 bytes of trail data because of following bit:
> 
> 	if (datalen == length + fraggap)
> 		alloclen += rt->dst.trailer_len;
> 
> Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
> another bug ?)
> 
> In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
> so we take slow path, building another skb chain.
> 
> In order to avoid taking slow path, we should correct ip_append_data()
> to make sure last fragment has real trail space, under mtu...
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

This patch largely looks fine, but:

1) I want to find out where that "17" tailer_len comes from before
   applying this, that doesn't make any sense.

2) Even with #1 addressed, this function is tricky so I want to review
   this patch some more.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 22, 2010, 4:44 a.m. UTC | #2
Le mardi 21 septembre 2010 à 16:38 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 21 Sep 2010 08:16:27 +0200
> 
> > [PATCH] ip : take care of last fragment in ip_append_data
> > 
> > While investigating a bit, I found ip_fragment() slow path was taken
> > because ip_append_data() provides following layout for a send(MTU +
> > N*(MTU - 20)) syscall :
> > 
> > - one skb with 1500 (mtu) bytes
> > - N fragments of 1480 (mtu-20) bytes (before adding IP header)
> > last fragment gets 17 bytes of trail data because of following bit:
> > 
> > 	if (datalen == length + fraggap)
> > 		alloclen += rt->dst.trailer_len;
> > 
> > Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
> > another bug ?)
> > 
> > In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
> > so we take slow path, building another skb chain.
> > 
> > In order to avoid taking slow path, we should correct ip_append_data()
> > to make sure last fragment has real trail space, under mtu...
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> This patch largely looks fine, but:
> 
> 1) I want to find out where that "17" tailer_len comes from before
>    applying this, that doesn't make any sense.
> 
> 2) Even with #1 addressed, this function is tricky so I want to review
>    this patch some more.



The "17" (instead of probable 16 need) comes from :

net/ipv4/esp4.c line 599 :

x->props.trailer_len = align + 1 + crypto_aead_authsize(esp->aead);

In my Nick ipsec script case, 
crypto_aead_blocksize(aead) = 16, 
crypto_aead_authsize(esp->aead) = 0

-> align = 16
trailer_len = 16 + 1 + 0;

I am not sure we need the "+ 1", but I know nothing about this stuff.

Same in net/ipv6/esp6.c ?


Anyway the last frag problem is for packets with lengths :
 MTU + N*(MTU - 20) + LAST

LAST being from [(MTU - trailer_len) ... MTU], not only MTU as I wrote
in changelog



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 22, 2010, 4:53 a.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 22 Sep 2010 06:44:06 +0200

> The "17" (instead of probable 16 need) comes from :
> 
> net/ipv4/esp4.c line 599 :
> 
> x->props.trailer_len = align + 1 + crypto_aead_authsize(esp->aead);
> 
> In my Nick ipsec script case, 
> crypto_aead_blocksize(aead) = 16, 
> crypto_aead_authsize(esp->aead) = 0
> 
> -> align = 16
> trailer_len = 16 + 1 + 0;
> 
> I am not sure we need the "+ 1", but I know nothing about this stuff.
> 
> Same in net/ipv6/esp6.c ?

I think the author of this code thought that the alignment was
expressed as "N - 1" instead of plain "N".

I'll do some research. :-)

Thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 24, 2010, 9:42 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 21 Sep 2010 08:16:27 +0200

> [PATCH] ip : take care of last fragment in ip_append_data
> 
> While investigating a bit, I found ip_fragment() slow path was taken
> because ip_append_data() provides following layout for a send(MTU +
> N*(MTU - 20)) syscall :
> 
> - one skb with 1500 (mtu) bytes
> - N fragments of 1480 (mtu-20) bytes (before adding IP header)
> last fragment gets 17 bytes of trail data because of following bit:
> 
> 	if (datalen == length + fraggap)
> 		alloclen += rt->dst.trailer_len;
> 
> Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
> another bug ?)
> 
> In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
> so we take slow path, building another skb chain.
> 
> In order to avoid taking slow path, we should correct ip_append_data()
> to make sure last fragment has real trail space, under mtu...
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..dfd9a0d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -927,16 +927,19 @@  alloc_new_skb:
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
 			else
-				alloclen = datalen + fragheaderlen;
+				alloclen = fraglen;
 
 			/* The last fragment gets additional space at tail.
 			 * Note, with MSG_MORE we overallocate on fragments,
 			 * because we have no idea what fragment will be
 			 * the last.
 			 */
-			if (datalen == length + fraggap)
+			if (datalen == length + fraggap) {
 				alloclen += rt->dst.trailer_len;
-
+				/* make sure mtu is not reached */
+				if (datalen > mtu - fragheaderlen - rt->dst.trailer_len)
+					datalen -= ALIGN(rt->dst.trailer_len, 8);
+			}
 			if (transhdrlen) {
 				skb = sock_alloc_send_skb(sk,
 						alloclen + hh_len + 15,