diff mbox

[net-next,v3,1/4] ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets

Message ID 1445982042-3207-2-git-send-email-hannes@stressinduktion.org
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Hannes Frederic Sowa Oct. 27, 2015, 9:40 p.m. UTC
We cannot reliable calculate packet size on MSG_MORE corked sockets
and thus cannot decide if they are going to be fragmented later on,
so better not use CHECKSUM_PARTIAL in the first place.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv4/ip_output.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Tom Herbert Oct. 27, 2015, 10:22 p.m. UTC | #1
On Tue, Oct 27, 2015 at 2:40 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> We cannot reliable calculate packet size on MSG_MORE corked sockets
> and thus cannot decide if they are going to be fragmented later on,
> so better not use CHECKSUM_PARTIAL in the first place.
>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Vlad Yasevich <vyasevich@gmail.com>
> Cc: Benjamin Coddington <bcodding@redhat.com>
> Cc: Tom Herbert <tom@herbertland.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>  net/ipv4/ip_output.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 50e2973..0b02417 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -911,6 +911,7 @@ static int __ip_append_data(struct sock *sk,
>         if (transhdrlen &&
>             length + fragheaderlen <= mtu &&
>             rt->dst.dev->features & NETIF_F_V4_CSUM &&
> +           !(flags & MSG_MORE) &&

I still don't understand this. It seems like the effect is to disable
checksum offload for all UDP messages sent with MSG_MORE flag set.

>             !exthdrlen)
>                 csummode = CHECKSUM_PARTIAL;
>
> --
> 2.5.0
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 27, 2015, 11:53 p.m. UTC | #2
On Tue, Oct 27, 2015, at 23:22, Tom Herbert wrote:
> On Tue, Oct 27, 2015 at 2:40 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > We cannot reliable calculate packet size on MSG_MORE corked sockets
> > and thus cannot decide if they are going to be fragmented later on,
> > so better not use CHECKSUM_PARTIAL in the first place.
> >
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Vlad Yasevich <vyasevich@gmail.com>
> > Cc: Benjamin Coddington <bcodding@redhat.com>
> > Cc: Tom Herbert <tom@herbertland.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > ---
> >  net/ipv4/ip_output.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 50e2973..0b02417 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -911,6 +911,7 @@ static int __ip_append_data(struct sock *sk,
> >         if (transhdrlen &&
> >             length + fragheaderlen <= mtu &&
> >             rt->dst.dev->features & NETIF_F_V4_CSUM &&
> > +           !(flags & MSG_MORE) &&
> 
> I still don't understand this. It seems like the effect is to disable
> checksum offload for all UDP messages sent with MSG_MORE flag set.

Exactly.

MSG_MORE/UDP_CORK is a method to append data on the *same* UDP packet.
The probability this packet exceeds the MTU size is rather large, as it
is mostly used to prepare a header and later on send data via
sendpage/sendfile-syscall (IPv6 UDP as no sendpage so it falls back to
normal udpv6_sendmsg path). sendpage is mostly used to send rather large
amount of data because for small amounts regular copying might be faster
(IMHO). So the probability we exceed the MTU is quiet high. This is the
case for NFSv4 which uses this flag over UDP, sending xdr header and
later on the filesystem data directly from the page cache. You will
still have CHECKSUM_PARTIAL capability with sendmsg and multiple iovecs!

Because we cannot simply switch back to CHECKSUM_NONE in the second
write, the first write would not yet have been checksumed, I decided to
exclude MSG_MORE to set up a CHECKSUM_PARTIAL skb.

Because there would be at least some more syscalls between the first
write and the second write (not so in the  NFS example directly from the
kernel but normal user space usage) the data would already be cold in
the caches. So it makes sense to me to checksum the data during copy-in
to trash the CPU caches only once.

The ip6_fragment logic will now catch this case and fragment anyway, but
as I wrote, this is only a last resort.

Hope that makes it more clear.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 50e2973..0b02417 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -911,6 +911,7 @@  static int __ip_append_data(struct sock *sk,
 	if (transhdrlen &&
 	    length + fragheaderlen <= mtu &&
 	    rt->dst.dev->features & NETIF_F_V4_CSUM &&
+	    !(flags & MSG_MORE) &&
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;