Message ID | 1395341341.9114.93.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Mar 20, 2014 at 11:49:01AM -0700, Eric Dumazet wrote: > csum_replace2() uses about 29 cycles, while a plain ip_send_check() is > way faster (16 cycles) > > csum_partial() is not really meant for doing checksums over 8 bytes ! > > Any idea how to make the thing really fast as intended ? > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > is insane... > > Following patch might be the fastest thing ? Your patch makes sense to me. However does it actually help your 1% in the GRO profile? Cheers,
Eric Dumazet <eric.dumazet@gmail.com> writes: > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > is insane... Couldn't it just be the cache miss? -Andi
On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > is insane... > > > Couldn't it just be the cache miss? Well, no, because we aggregate frames that were already fully checked (by GRO engine/layers) in the same NAPI run. In this particular case, it seems we aggregate few frames per run (its a 40Gb Intel NIC, it apparently is tuned for low latency) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 20 Mar 2014 11:49:01 -0700 > csum_replace2() uses about 29 cycles, while a plain ip_send_check() is > way faster (16 cycles) > > csum_partial() is not really meant for doing checksums over 8 bytes ! > > Any idea how to make the thing really fast as intended ? > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > is insane... > > Following patch might be the fastest thing ? > > (At this point we already have validated IP checksum) ... > @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) > int proto = iph->protocol; > int err = -ENOSYS; > > - csum_replace2(&iph->check, iph->tot_len, newlen); > iph->tot_len = newlen; > + ip_send_check(&iph); Yeah the csum_replace*() are extremely suboptimal. We should be able to cons up something cheap like the trick that ip_decrease_ttl() uses. https://tools.ietf.org/html/rfc1624 https://tools.ietf.org/html/rfc1141 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-03-21 at 14:07 -0400, David Miller wrote: > Yeah the csum_replace*() are extremely suboptimal. > > We should be able to cons up something cheap like the trick that > ip_decrease_ttl() uses. > > https://tools.ietf.org/html/rfc1624 > https://tools.ietf.org/html/rfc1141 Thanks for the pointers, I'll cook a patch. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 8c54870db792..86c924c16f3c 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) int proto = iph->protocol; int err = -ENOSYS; - csum_replace2(&iph->check, iph->tot_len, newlen); iph->tot_len = newlen; + ip_send_check(&iph); rcu_read_lock(); ops = rcu_dereference(inet_offloads[proto]);