Message ID | 1395629496.9117.47.camel@edumazet-glaptop2.roam.corp.google.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sun, 23 Mar 2014 19:51:36 -0700 > From: Eric Dumazet <edumazet@google.com> > > When changing one 16bit value by another in IP header, we can adjust the > IP checksum by doing a simple operation described in RFC 1624, > as reminded by David. > > csum_partial() is a complex function on x86_64, not really suited > for small number of checksummed bytes. > > I spotted csum_partial() being in the top 20 most consuming > functions (more than 1 %) in a GRO workload, which was rather > unexpected. > > The caller was inet_gro_complete() doing a csum_replace2() when > building the new IP header for the GRO packet. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied, thanks for following through on this Eric. Would be nice to improve csum_replace4() similarly, since every NAT packet uses that thing when we change the address in the protocol header. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote: > From: Eric Dumazet <edumazet@google.com> > > > > When changing one 16bit value by another in IP header, we can adjust the > > IP checksum by doing a simple operation described in RFC 1624, > > as reminded by David. > > > > csum_partial() is a complex function on x86_64, not really suited > > for small number of checksummed bytes. > > > > I spotted csum_partial() being in the top 20 most consuming > > functions (more than 1 %) in a GRO workload, which was rather > > unexpected. > > > > The caller was inet_gro_complete() doing a csum_replace2() when > > building the new IP header for the GRO packet. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > --- > > include/net/checksum.h | 23 +++++++++++++++++++++-- > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > diff --git a/include/net/checksum.h b/include/net/checksum.h > > index 37a0e24adbe7..a28f4e0f6251 100644 > > --- a/include/net/checksum.h > > +++ b/include/net/checksum.h > > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend) > > return csum_add(csum, ~addend); > > } > > > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend) > > +{ > > + u16 res = (__force u16)csum; > > Shouldn't that be u32 ? Why ? We compute 16bit checksums here. > > > + res += (__force u16)addend; > > + return (__force __sum16)(res + (res < (__force u16)addend)); Note how carry is propagated, and how we return 16 bit anyway. Using u32 would force to use a fold, which is more expensive. > > +} > > + > > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend) > > +{ > > + return csum16_add(csum, ~addend); > > +} > > + > > static inline __wsum > > csum_block_add(__wsum csum, __wsum csum2, int offset) > > { > > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to) > > *sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum))); > > } > > > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to) > > +/* Implements RFC 1624 (Incremental Internet Checksum) > > + * 3. Discussion states : > > + * HC' = ~(~HC + ~m + m') > > + * m : old value of a 16bit field > > + * m' : new value of a 16bit field > > + */ > > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new) > > { > > - csum_replace4(sum, (__force __be32)from, (__force __be32)to); > > + *sum = ~csum16_add(csum16_sub(~(*sum), old), new); > > } > > It might be clearer to just say: > *sum = ~csum16_add(csum16_add(~*sum, ~old), new)); > or even: > *sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new)); > which might remove some mask instructions - especially if all the > intermediate values are left larger than 16 bits. We have csum_add() and csum_sub(), I added csum16_add() and csum16_sub(), for analogy and completeness. For linux guys, its quite common stuff to use csum_sub(x, y) instead of csum_add(c, ~y). You can use whatever code matching your taste. RFC 1624 mentions ~m, not m ^ 0xffff But again if you prefer m ^ 0xffff, thats up to you ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet > On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > > > When changing one 16bit value by another in IP header, we can adjust the > > > IP checksum by doing a simple operation described in RFC 1624, > > > as reminded by David. > > > > > > csum_partial() is a complex function on x86_64, not really suited > > > for small number of checksummed bytes. > > > > > > I spotted csum_partial() being in the top 20 most consuming > > > functions (more than 1 %) in a GRO workload, which was rather > > > unexpected. > > > > > > The caller was inet_gro_complete() doing a csum_replace2() when > > > building the new IP header for the GRO packet. > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > > --- > > > include/net/checksum.h | 23 +++++++++++++++++++++-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/net/checksum.h b/include/net/checksum.h > > > index 37a0e24adbe7..a28f4e0f6251 100644 > > > --- a/include/net/checksum.h > > > +++ b/include/net/checksum.h > > > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend) > > > return csum_add(csum, ~addend); > > > } > > > > > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend) > > > +{ > > > + u16 res = (__force u16)csum; > > > > Shouldn't that be u32 ? > > Why ? We compute 16bit checksums here. > > > > > > + res += (__force u16)addend; > > > + return (__force __sum16)(res + (res < (__force u16)addend)); > > Note how carry is propagated, and how we return 16 bit anyway. Not enough coffee - my brain didn't parse that correctly at all :-( > Using u32 would force to use a fold, which is more expensive. Try compiling for something other than x86/amd64 (ie something without 16bit arithmetic). The ppc compiler I have (not the latest( generates 3 masks with 0xffff and a conditional branch for the above function (without the inline). > > > +} > > > + > > > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend) > > > +{ > > > + return csum16_add(csum, ~addend); > > > +} > > > + > > > static inline __wsum > > > csum_block_add(__wsum csum, __wsum csum2, int offset) > > > { > > > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to) > > > *sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum))); > > > } > > > > > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to) > > > +/* Implements RFC 1624 (Incremental Internet Checksum) > > > + * 3. Discussion states : > > > + * HC' = ~(~HC + ~m + m') > > > + * m : old value of a 16bit field > > > + * m' : new value of a 16bit field > > > + */ > > > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new) > > > { > > > - csum_replace4(sum, (__force __be32)from, (__force __be32)to); > > > + *sum = ~csum16_add(csum16_sub(~(*sum), old), new); > > > } > > > > It might be clearer to just say: > > *sum = ~csum16_add(csum16_add(~*sum, ~old), new)); > > or even: > > *sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new)); > > which might remove some mask instructions - especially if all the > > intermediate values are left larger than 16 bits. > > We have csum_add() and csum_sub(), I added csum16_add() and > csum16_sub(), for analogy and completeness. > > For linux guys, its quite common stuff to use csum_sub(x, y) instead of > csum_add(c, ~y). > > You can use whatever code matching your taste. > > RFC 1624 mentions ~m, not m ^ 0xffff But C only has 32bit maths, so all the 16bit values keep need masking. > But again if you prefer m ^ 0xffff, thats up to you ;) On ppc I get much better code for (retyped): static inline u32 csum_add(u32 c, u32 a) { c += a' return (c + (c >> 16)) & 0xffff; } void csum_repl(u16 *c16, u16 o16, u16 n) { u32 c = *c16, o = o16; *c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n); } 13 instructions and no branches against 21 instructions and 2 branches. The same is probably true of arm. For amd64 (where the compiler can do a 16bit compare) my version is one instruction longer. I've not tried to guess at the actual cycle counts! David
On Mon, 2014-03-24 at 14:38 +0000, David Laight wrote: > But C only has 32bit maths, so all the 16bit values keep > need masking. > > > But again if you prefer m ^ 0xffff, thats up to you ;) > > On ppc I get much better code for (retyped): > static inline u32 csum_add(u32 c, u32 a) > { > c += a' > return (c + (c >> 16)) & 0xffff; > } > void csum_repl(u16 *c16, u16 o16, u16 n) > { > u32 c = *c16, o = o16; > *c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n); > } > 13 instructions and no branches against 21 instructions > and 2 branches. > The same is probably true of arm. > For amd64 (where the compiler can do a 16bit compare) > my version is one instruction longer. > I've not tried to guess at the actual cycle counts! > > David Code I provided uses no conditional branch on x86. It sounds you could provide helper to arm, if you really care of this path. I find surprising you did not comment on my prior mails on this subject and you suddenly seem to care now the patch is merged. We for example have the following helper in x86 : static inline unsigned add32_with_carry(unsigned a, unsigned b) { asm("addl %2,%0\n\t" "adcl $0,%0" : "=r" (a) : "0" (a), "r" (b)); return a; } But these days, gcc seems to do a pretty good job without these helpers. Yes we could save 4 instructions, but I doubt it is worth the pain of adding arch helpers. I certainly wont do it. 0000000000000030 <inet_gro_complete>: 30: 55 push %rbp 31: 48 63 c6 movslq %esi,%rax 34: 48 03 87 d0 00 00 00 add 0xd0(%rdi),%rax 3b: 48 89 e5 mov %rsp,%rbp 3e: 8b 57 68 mov 0x68(%rdi),%edx 41: 44 0f b7 40 02 movzwl 0x2(%rax),%r8d 46: 0f b7 48 0a movzwl 0xa(%rax),%ecx 4a: 66 29 f2 sub %si,%dx 4d: 66 c1 c2 08 rol $0x8,%dx 51: 44 0f b6 48 09 movzbl 0x9(%rax),%r9d 56: 66 89 50 02 mov %dx,0x2(%rax) 5a: 41 f7 d0 not %r8d 5d: f7 d1 not %ecx 5f: 66 44 01 c1 add %r8w,%cx 63: 41 0f 92 c0 setb %r8b 67: 45 0f b6 c0 movzbl %r8b,%r8d 6b: 44 01 c1 add %r8d,%ecx 6e: 66 01 d1 add %dx,%cx 71: 41 0f 92 c0 setb %r8b 75: 45 0f b6 c0 movzbl %r8b,%r8d 79: 44 01 c1 add %r8d,%ecx 7c: f7 d1 not %ecx 7e: 66 89 48 0a mov %cx,0xa(%rax) 82: 49 63 c1 movslq %r9d,%rax 85: 48 8b 04 c5 00 00 00 mov 0x0(,%rax,8),%rax -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet ... > Code I provided uses no conditional branch on x86. All the world isn't x86... The sparc, ppc and arm people might want to consider optimising this code further. > It sounds you could provide helper to arm, if you really care of this > path. I find surprising you did not comment on my prior mails on this > subject and you suddenly seem to care now the patch is merged. I don't remember seeing this particular patch before today. Last week you were still sorting out the stall caused by the 16bit write -> 32bit load in the existing code. In any case your change is clearly a significant improvement on what was there before. > We for example have the following helper in x86 : > > static inline unsigned add32_with_carry(unsigned a, unsigned b) > { > asm("addl %2,%0\n\t" > "adcl $0,%0" > : "=r" (a) > : "0" (a), "r" (b)); > return a; > } > > But these days, gcc seems to do a pretty good job without these helpers. Indeed. While x86 can do 16bit maths, most cpus can't - so the generated code for 'short' (and 'char') maths must mask with 0xffff (or 0xff) every time a value is written to a local (ie register) variable. In general you get better code by using local variables that are the same size as machine registers. This also applies to function arguments and return types. I'm not sure how much difference it would make overall. It rather depends on whether anything appears in a very hot path. OTOH a lot of mall changes can add together. David
diff --git a/include/net/checksum.h b/include/net/checksum.h index 37a0e24adbe7..a28f4e0f6251 100644 --- a/include/net/checksum.h +++ b/include/net/checksum.h @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend) return csum_add(csum, ~addend); } +static inline __sum16 csum16_add(__sum16 csum, __be16 addend) +{ + u16 res = (__force u16)csum; + + res += (__force u16)addend; + return (__force __sum16)(res + (res < (__force u16)addend)); +} + +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend) +{ + return csum16_add(csum, ~addend); +} + static inline __wsum csum_block_add(__wsum csum, __wsum csum2, int offset) { @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to) *sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum))); } -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to) +/* Implements RFC 1624 (Incremental Internet Checksum) + * 3. Discussion states : + * HC' = ~(~HC + ~m + m') + * m : old value of a 16bit field + * m' : new value of a 16bit field + */ +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new) { - csum_replace4(sum, (__force __be32)from, (__force __be32)to); + *sum = ~csum16_add(csum16_sub(~(*sum), old), new); } struct sk_buff;