Message ID | 1395406250.9114.142.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Mar 21, 2014 at 05:50:50AM -0700, Eric Dumazet wrote: > On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > > is insane... > > > > > > Couldn't it just be the cache miss? > > Or the fact that we mix 16 bit stores and 32bit loads ? It should cause a small stall from not doing load-store forwarding, but 1% of a serious workload would be surprising. Are you sure it's not some skid effect? -Andi -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote: > Or the fact that we mix 16 bit stores and 32bit loads ? > > iph->tot_len = newlen; > iph->check = 0; > iph->check = ip_fast_csum(iph, 5); Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the stall. I no longer see inet_gro_complete() in perf top... + if (__builtin_constant_p(ihl) && ihl == 5) { + asm(" movw (%[iph]), %w[sum]\n" /* ihl/version/tos */ + " addw 2(%[iph]), %w[sum]\n" /* tot_len */ + " adcw 8(%[iph]), %w[sum]\n" /* ttl/protocol */ + " adcw 10(%[iph]), %w[sum]\n" /* check */ + " adcl 4(%[iph]), %[sum]\n" /* id/frag_off */ + " adcl 12(%[iph]), %[sum]\n" /* saddr */ + " adcl 16(%[iph]), %[sum]\n" /* daddr */ + " adcl $0, %[sum]\n" + : [sum] "=r" (sum) + : [iph] "r" (iph) + ); + return csum_fold(sum); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet > On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > > is insane... > > > > > > Couldn't it just be the cache miss? > > Or the fact that we mix 16 bit stores and 32bit loads ? > > BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ? The correct constraint would be one that told gcc that it accesses the 20 bytes from the source pointer. Without it gcc won't necessarily write out the values before the asm instructions execute. David
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 21 Mar 2014 05:50:50 -0700 > It looks like a barrier() would be more appropriate. barrier() == __asm__ __volatile__(:::"memory") -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index e6fd8a026c7b..a81cc3a5facc 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -46,6 +46,24 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { unsigned int sum; + /* + * Callers might clear iph->check before calling us, make sure + * compiler wont mess things... + */ + barrier(); + + if (__builtin_constant_p(ihl) && ihl == 5) { + asm(" movl (%[iph]), %[sum]\n" + " addl 4(%[iph]), %[sum]\n" + " adcl 8(%[iph]), %[sum]\n" + " adcl 12(%[iph]), %[sum]\n" + " adcl 16(%[iph]), %[sum]\n" + " adcl $0, %[sum]\n" + : [sum] "=r" (sum) + : [iph] "r" (iph) + ); + return csum_fold(sum); + } asm(" movl (%1), %0\n" " subl $4, %2\n" " jbe 2f\n" @@ -68,7 +86,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) will assume they contain their original values. */ : "=r" (sum), "=r" (iph), "=r" (ihl) : "1" (iph), "2" (ihl) - : "memory"); + ); return (__force __sum16)sum; } diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 8c54870db792..90aa562dfbf5 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1434,8 +1434,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) int proto = iph->protocol; int err = -ENOSYS; - csum_replace2(&iph->check, iph->tot_len, newlen); iph->tot_len = newlen; + iph->check = 0; + iph->check = ip_fast_csum((u8 *)iph, 5); rcu_read_lock(); ops = rcu_dereference(inet_offloads[proto]);