Message ID | 1454527121-4007853-1-git-send-email-tom@herbertland.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
* Tom Herbert <tom@herbertland.com> wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY > conversion. > > CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether > we need to avoid unaligned accesses. Efficient unaligned accesses > offer a nice additional speedup as demonstrated in the results > provided below. > > This implementation is similar to csum_partial implemented in > checksum_32.S, however since we are dealing with 8 bytes at a time > there are more cases for small lengths (and alignments)-- for that > we employ a jump table. > > Testing: > > Correctness: > > Verified correctness by testing arbitrary length buffer filled with > random data. For each buffer I compared the computed checksum > using the original algorithm for each possible alignment (0-7 bytes). > > Checksum performance: > > Isolating old and new implementation for some common cases: > > Old NewA NewA % NewNoA NewNoA % > Len/Aln nsec nsec Improv nsecs Improve > --------+-------+--------+-------+-------+--------------------- > 1400/0 192.9 175.1 10% 174.9 10% (Big packet) > 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) > 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) > 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) > 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) > 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) > 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) > > NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set > NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set > > Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz > > Also test on these with similar results: > > Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz > Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz > > Branch prediction: > > To test the effects of poor branch prediction in the jump tables I > tested checksum performance with runs for two combinations of length > and alignment. As the baseline I performed the test by doing half of > calls with the first combination, followed by using the second > combination for the second half. In the test case, I interleave the > two combinations so that in every call the length and alignment are > different to defeat the effects of branch prediction. Running several > cases, I did not see any material performance difference between > the baseline and the interleaving test case. Possibly because the jumps are missing branch prediction for all these variants... Please run something like: triton:~/tip> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' ~/hackbench 10 ... Performance counter stats for '/home/mingo/hackbench 10' (10 runs): 1,701,689,802 instructions ( +- 0.83% ) 305,439,262 branches ( +- 0.92% ) 2,011,032 branch-misses # 0.66% of all branches ( +- 3.00% ) 0.074311070 seconds time elapsed ( +- 2.42% ) ... to see the quality of x86 branch prediction and to see the statistical quality of the measurement (stddev). The real comparison would be jump table against a hierarchy of open coded branches. The latter is much more likely to be correctly branch-predicted. I have a couple of (mostly stylistic) comments about the patch: > Signed-off-by: Tom Herbert <tom@herbertland.com> > --- > arch/x86/include/asm/checksum_64.h | 5 + > arch/x86/lib/csum-partial_64.S | 277 +++++++++++++++++++++++++++++++++++++ > arch/x86/lib/csum-partial_64.c | 148 -------------------- > 3 files changed, 282 insertions(+), 148 deletions(-) > create mode 100644 arch/x86/lib/csum-partial_64.S > delete mode 100644 arch/x86/lib/csum-partial_64.c > > diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h > index cd00e17..a888f65 100644 > --- a/arch/x86/include/asm/checksum_64.h > +++ b/arch/x86/include/asm/checksum_64.h > @@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, > */ > extern __wsum csum_partial(const void *buff, int len, __wsum sum); > > +static inline __sum16 ip_compute_csum(const void *buff, int len) > +{ > + return csum_fold(csum_partial(buff, len, 0)); > +} > + > #define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1 > #define HAVE_CSUM_COPY_USER 1 > > diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S > new file mode 100644 > index 0000000..520b400 > --- /dev/null > +++ b/arch/x86/lib/csum-partial_64.S > @@ -0,0 +1,277 @@ > +/* Copyright 2016 Tom Herbert <tom@herbertland.com> > + * > + * Checksum partial calculation s/ Partial checksum calculation > + * > + * __wsum csum_partial(const void *buff, int len, __wsum sum) > + * > + * Computes the checksum of a memory block at buff, length len, > + * and adds in "sum" (32-bit) So please refer to arguments consistently in an escaped manner, i.e. 'buff', 'len', 'sum'. It's a bit confusing to read otherwise. This applies to other comments in your patch as well. > + * > + * Returns a 32-bit number suitable for feeding into itself > + * or csum_tcpudp_magic In comments please refer to functions with (), i.e. csum_tcpudp_magic(). > + * > + * CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS determines whether alignment of the > + * buffer must be dealt with. > + * > + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set then the steps are: s/ If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y then the steps are: > + * 1) Initialize accumulator to initial sum > + * 2) Sum 8 bytes at a time using adcq (unroll main loop > + * to do 128 bytes at a time) Please refer to x86 instructions capitalized, i.e. ADCQ here. That makes them stand out better especially when they alias to some English word. This applies to other parts of your patch as well. > + * 3) Sum remaining length (less than 8 bytes) > + * > + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set then the steps are: s/ If !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS then the steps are: > + * 1) Handle buffer that is not aligned to 8 bytes, sum up to 8 byte > + * alignment > + * 2) Sum 8 bytes at a time using adcq (unroll main loop > + * to do 128 bytes at a time) > + * 3) Sum remaining length (less than 8 bytes) > + * 4) Roll result if alignment is odd and add in initial sum argument > + * 5) If buffer is not aligned to 8 bytes and length is less than > + * or equal to 8 - alignment (whole buffer is in one quad), then > + * treat that as a special case. Had to read it 3 times to realize that '-' is not a separator, so please: s/8-alignment > + * > + * Register usage: > + * %rdi: argument #1, buff > + * %rsi: argument #2, length > + * %rdx: argument #3, add in value > + * %rax,%eax: accumulator and return value > + * %rcx,%ecx: counter and tmp > + * %r11: tmp > + * %r10: alignment (0-7) - when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set s/CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y > + */ > + > +#include <linux/linkage.h> > +#include <asm/errno.h> > +#include <asm/asm.h> > + > +#define branch_tbl_len .L_branch_tbl_len > + > +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + > +/* Close the carry chain and return. */ > +#define RETURN \ > + adcl $0, %eax; \ > + ret > + > +#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ > + > +/* Before returning need to roll the result if alignment was odd and then add > + * in the initial sum. > + */ Please use the customary (multi-line) comment style: /* * Comment ..... * ...... goes here. */ specified in Documentation/CodingStyle. This applies to other comments in your patch as well. > +#define RETURN \ > + adcl $0, %eax; \ > + test $0x1, %r10d; \ > + jz 99f; \ > + roll $8, %eax; \ > +99: addl %edx, %eax; \ > + adcl $0, %eax; \ > + ret > + > +#define branch_tbl_align .L_branch_tbl_align > + > +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ On x86 we always set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. So this is dead, untested code that is never compiled on x86. > + > +ENTRY(csum_partial) > + > +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + movl %edx, %eax /* Initialize with initial sum argument */ > +#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + test %esi, %esi /* Zero length? */ > + jne 310f > + movl %edx, %eax > + ret > + > +310: xorl %eax, %eax So in modern x86 assembly we try to use local named labels instead of numeric ones. Easier to extend and the result is a lot easier to read. This applies to most other numeric labels in your patch. > + > + /* Determine alignment */ > + movl %edi, %r10d > + andl $0x7, %r10d > + jz 10f > + movl $8, %ecx > + subl %r10d, %ecx > + cmpl %ecx, %esi > + jle 320f > + clc > + jmpq *branch_tbl_align(, %r10, 8) > + > + /* Whole buffer fits into one quad. Sum up to a four byte alignment > + * and then call into the length table to finish. > + */ > +320: test $0x1, %r10d > + jz 330f > + movb (%rdi), %ah /* Align to two bytes */ > + decl %esi > + lea 1(%rdi), %rdi > +330: cmpl $2, %esi > + jl 340f > + test $0x2, %r10d > + jz 340f > + addw (%rdi), %ax /* Align to four bytes */ > + adcl $0, %eax > + lea 2(%rdi), %rdi > + subl $2, %esi > +340: > + clc > + jmpq *branch_tbl_len(, %rsi, 8) Also, please use extra newlines to better delineate the control flow. I.e. something like: .L_small_buffer: test $0x1, %r10d jz L_2_bytes_aligned: /* Align to two bytes: */ movb (%rdi), %ah decl %esi lea 1(%rdi), %rdi .L_2_bytes_aligned: cmpl $2, %esi jl .L_4_bytes_aligned test $0x2, %r10d jz .L_4_bytes_aligned /* Align to four bytes: */ addw (%rdi), %ax adcl $0, %eax lea 2(%rdi), %rdi subl $2, %esi .L_4_bytes_aligned: See how much more readable such an assembly construct is than numeric labels and no proper flow control separation? Also note how I changed the placement and style of the existing comments. Note that I just guessed about the labels based on patch context, some might be wrong. Also, this code will go away as it's dead code - but the comments still apply to the rest of the patch. Please propagate all such readability improvements to the rest of your patch. > + > +/* Jumps table for alignments */ s/Jump table > + > +201: /* Align 1 */ > + adcw 5(%rdi), %ax > +203: /* Align 3 */ > + adcw 3(%rdi), %ax > +205: /* Align 5 */ > + adcw 1(%rdi), %ax > +207: /* Align 7 */ > + adcl $0, %eax > + addb (%rdi), %ah > + jmp 222f > +202: /* Align 2 */ > + adcw 4(%rdi), %ax > +204: /* Align 4 */ > + adcw 2(%rdi), %ax > +206: /* Align 6 */ > + adcw (%rdi), %ax > + > +222: adcl $0, %eax > + subl %ecx, %esi /* %rcx is 8 - alignment */ > + addq %rcx, %rdi > +200: > + /* Fall through */ > + > +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + > + /* Check length */ > +10: cmpl $8, %esi > + jg 30f > + jl 20f > + > + /* Exactly 8 bytes length */ > + addl (%rdi), %eax > + adcl 4(%rdi), %eax > + RETURN > + > + /* Less than 8 bytes length */ > +20: clc > + jmpq *branch_tbl_len(, %rsi, 8) > + > + /* Greater than 8 bytes length. Determine number of quads (n). Sum > + * over first n % 16 quads > + */ > +30: movl %esi, %ecx > + shrl $3, %ecx > + andl $0xf, %ecx > + negq %rcx > + lea 40f(, %rcx, 4), %r11 > + clc > + jmp *%r11 Are you absolutely sure that a jump table is the proper solution here? It defeats branch prediction on most x86 uarchs. Why not label the loop stages and jump in directly with a binary tree of branches? > + > +.align 8 > + adcq 14*8(%rdi),%rax > + adcq 13*8(%rdi),%rax > + adcq 12*8(%rdi),%rax > + adcq 11*8(%rdi),%rax > + adcq 10*8(%rdi),%rax > + adcq 9*8(%rdi),%rax > + adcq 8*8(%rdi),%rax > + adcq 7*8(%rdi),%rax > + adcq 6*8(%rdi),%rax > + adcq 5*8(%rdi),%rax > + adcq 4*8(%rdi),%rax > + adcq 3*8(%rdi),%rax > + adcq 2*8(%rdi),%rax > + adcq 1*8(%rdi),%rax > + adcq 0*8(%rdi),%rax Stray whitespace. > + nop > +40: /* #quads % 16 jump table base */ So both the jump table and its initial alignment adds NOPs that at minimum blows up the I$ a bit. With direct jumps we could avoid all that. > + > + adcq $0, %rax > + shlq $3, %rcx > + subq %rcx, %rdi /* %rcx is already negative length */ > + > + /* Now determine number of blocks of 8 quads. Sum 128 bytes at a time > + * using unrolled loop. s/using an unrolled loop > + */ > + movl %esi, %ecx > + shrl $7, %ecx > + jz 60f > + clc > + > + /* Main loop */ > +50: adcq 0*8(%rdi),%rax > + adcq 1*8(%rdi),%rax > + adcq 2*8(%rdi),%rax > + adcq 3*8(%rdi),%rax > + adcq 4*8(%rdi),%rax > + adcq 5*8(%rdi),%rax > + adcq 6*8(%rdi),%rax > + adcq 7*8(%rdi),%rax > + adcq 8*8(%rdi),%rax > + adcq 9*8(%rdi),%rax > + adcq 10*8(%rdi),%rax > + adcq 11*8(%rdi),%rax > + adcq 12*8(%rdi),%rax > + adcq 13*8(%rdi),%rax > + adcq 14*8(%rdi),%rax > + adcq 15*8(%rdi),%rax > + lea 128(%rdi), %rdi > + loop 50b > + > + adcq $0, %rax > + > + /* Handle remaining length which is <= 8 bytes */ > +60: andl $0x7, %esi > + > + /* Fold 64 bit sum to 32 bits */ s/64-bit > + movq %rax, %rcx > + shrq $32, %rcx > + addl %ecx, %eax > + > + jmpq *branch_tbl_len(, %rsi, 8) Same fundamental question as above. > + > +/* Length table targets */ > + > +107: /* Length 7 */ > + adcw 4(%rdi), %ax > +105: /* Length 5 */ > + adcw 2(%rdi), %ax > +103: /* Length 3 */ > + adcw (%rdi), %ax > +101: /* Length 1, grab the odd byte */ > + adcb -1(%rdi, %rsi), %al > + adcb $0, %ah > + RETURN > +106: /* Length 6 */ > + adcw 4(%rdi), %ax > +104: /* Length 4, optimized for double word access*/ > + adcl (%rdi), %eax > + RETURN > +102: /* Length 2 */ > + adcw (%rdi), %ax > +100: /* Length 0 */ > + RETURN > + > +.section .rodata > +.align 64 Please use the proper parametric cache alignment define. There are subarchitectures that prefer (and need) much larger alignment. > +.L_branch_tbl_len: > + .quad 100b > + .quad 101b > + .quad 102b > + .quad 103b > + .quad 104b > + .quad 105b > + .quad 106b > + .quad 107b > + > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > +.L_branch_tbl_align: > + .quad 200b > + .quad 201b > + .quad 202b > + .quad 203b > + .quad 204b > + .quad 205b > + .quad 206b > + .quad 207b > +#endif This too is dead code. Thanks, Ingo
* Ingo Molnar <mingo@kernel.org> wrote: > s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > > > + > > + /* Check length */ > > +10: cmpl $8, %esi > > + jg 30f > > + jl 20f > > + > > + /* Exactly 8 bytes length */ > > + addl (%rdi), %eax > > + adcl 4(%rdi), %eax > > + RETURN > > + > > + /* Less than 8 bytes length */ > > +20: clc > > + jmpq *branch_tbl_len(, %rsi, 8) > > + > > + /* Greater than 8 bytes length. Determine number of quads (n). Sum > > + * over first n % 16 quads > > + */ > > +30: movl %esi, %ecx > > + shrl $3, %ecx > > + andl $0xf, %ecx > > + negq %rcx > > + lea 40f(, %rcx, 4), %r11 > > + clc > > + jmp *%r11 > > Are you absolutely sure that a jump table is the proper solution here? It > defeats branch prediction on most x86 uarchs. Why not label the loop stages and > jump in directly with a binary tree of branches? So just to expand on this a bit, attached below is a quick & simple & stupid testcase that generates a 16 entries call table. (Indirect jumps and indirect calls are predicted similarly on most x86 uarchs.) Just built it with: gcc -Wall -O2 -o jump-table jump-table.c Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel CPU and also on AMD Opteron. The numbers are from the Intel box.) this gives a high branch miss rate with a 16 entries jump table: triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16 ... using 16 jump table entries. ... using 100000000 loop iterations. ... result: 10000000100000000 [...] Performance counter stats for './jump-table 16' (10 runs): 1386.131780 task-clock (msec) # 1.001 CPUs utilized ( +- 0.18% ) 33 context-switches # 0.024 K/sec ( +- 1.71% ) 0 cpu-migrations # 0.000 K/sec 52 page-faults # 0.037 K/sec ( +- 0.71% ) 6,247,215,683 cycles # 4.507 GHz ( +- 0.18% ) 3,895,337,877 stalled-cycles-frontend # 62.35% frontend cycles idle ( +- 0.30% ) 1,404,014,996 instructions # 0.22 insns per cycle # 2.77 stalled cycles per insn ( +- 0.02% ) 300,820,988 branches # 217.022 M/sec ( +- 0.02% ) 87,518,741 branch-misses # 29.09% of all branches ( +- 0.01% ) 1.385240076 seconds time elapsed ( +- 0.21% ) ... as you can see the branch miss rate is very significant, causing a stalled decoder and very low instruction throughput. I have to reduce the jump table to a single entry (!) to get good performance on this CPU: Performance counter stats for './jump-table 1' (10 runs): 739.173505 task-clock (msec) # 1.001 CPUs utilized ( +- 0.26% ) 37 context-switches # 0.051 K/sec ( +- 16.79% ) 0 cpu-migrations # 0.000 K/sec 52 page-faults # 0.070 K/sec ( +- 0.41% ) 3,331,328,405 cycles # 4.507 GHz ( +- 0.26% ) 2,012,973,596 stalled-cycles-frontend # 60.43% frontend cycles idle ( +- 0.47% ) 1,403,880,792 instructions # 0.42 insns per cycle # 1.43 stalled cycles per insn ( +- 0.05% ) 300,817,064 branches # 406.964 M/sec ( +- 0.05% ) 12,177 branch-misses # 0.00% of all branches ( +- 12.39% ) 0.738616356 seconds time elapsed ( +- 0.26% ) Note how the runtime got halved: that is because stalls got halved and the instructions per cycle throughput doubled. Even a two entries jump table performs poorly: Performance counter stats for './jump-table 2' (10 runs): 1493.790686 task-clock (msec) # 1.001 CPUs utilized ( +- 0.06% ) 39 context-switches # 0.026 K/sec ( +- 4.73% ) 0 cpu-migrations # 0.000 K/sec 52 page-faults # 0.035 K/sec ( +- 0.26% ) 6,732,372,612 cycles # 4.507 GHz ( +- 0.06% ) 4,229,130,302 stalled-cycles-frontend # 62.82% frontend cycles idle ( +- 0.09% ) 1,407,803,145 instructions # 0.21 insns per cycle # 3.00 stalled cycles per insn ( +- 0.08% ) 301,688,946 branches # 201.962 M/sec ( +- 0.09% ) 100,022,150 branch-misses # 33.15% of all branches ( +- 0.00% ) 1.492820524 seconds time elapsed ( +- 0.08% ) In fact it's worse than the 16 entries runtime. ( Note that Intel SkyLake improved the indirect jump/call branch predictor. On another Intel CPU I have the boundary between good and bad prediction is at 4-6 entries. So this is highly uarch (and code layout) dependent. ) In contrast, a static hierarchy/tree of branches is predicted a lot better on most x86 uarchs, even with highly variable input. (This does not even count the extra calculations needed to calculate the jump table index, which, as you coded it in your patch, add 2-3 cycles of extra latency as well.) Such branch miss characteristics matter and they can become more visible with smaller skb sizes. So I'm generally sceptical of jump tables and I'd like to see careful and convincing perf stat runs on modern x86 uarchs, comparing the jump table solution to a branch hierarchy solution, before accepting a jump table solution into the x86 architecture. x86 uarchs are generally good at predicting function pointer indirect jumps (which tend to be static) - multi-target jump tables are a lot harder nut to crack, especially if the jump index is calculated right before the jump is performed, like your patch does it. The impact of branch misses is more pronounced on modern Intel CPUs, because speculation is deeper. But as always I can be convinced with contrary numbers! :-) Thanks, Ingo --------------------------------------> /* * Simple testcase to generate jump table usage. */ #include <stdio.h> #include <stdlib.h> #include <string.h> typedef unsigned long (fn_t) (unsigned long param); unsigned long global; #define DEFINE_FUN(x) \ \ unsigned long fun_##x(unsigned long param) \ { \ return param+global; \ } #define TABLE_SIZE_MAX 16L DEFINE_FUN( 1); DEFINE_FUN( 2); DEFINE_FUN( 3); DEFINE_FUN( 4); DEFINE_FUN( 5); DEFINE_FUN( 6); DEFINE_FUN( 7); DEFINE_FUN( 8); DEFINE_FUN( 9); DEFINE_FUN(10); DEFINE_FUN(11); DEFINE_FUN(12); DEFINE_FUN(13); DEFINE_FUN(14); DEFINE_FUN(15); DEFINE_FUN(16); int main(int argc, char **argv) { fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_1 , fun_2 , fun_3 , fun_4 , fun_5 , fun_6 , fun_7 , fun_8 , fun_9 , fun_10, fun_11, fun_12, fun_13, fun_14, fun_15, fun_16, }; unsigned long local = 0; unsigned long i; unsigned long table_size = TABLE_SIZE_MAX; unsigned long loops = 100000000; if ((argc >= 2 && !strcmp(argv[1], "-h")) || argc >= 4) { printf("usage: jump-table <table_size(1-%lu)(default: %lu)> <loops(default: %lu)>\n", TABLE_SIZE_MAX, TABLE_SIZE_MAX, loops); exit(0); } if (argc >= 2) { table_size = atol(argv[1]); if (table_size <= 1) table_size = 1; if (table_size > TABLE_SIZE_MAX) table_size = TABLE_SIZE_MAX; } printf("... using %lu jump table entries.\n", table_size); if (argc >= 3) loops = atol(argv[2]); printf("... using %lu loop iterations.\n", loops); /* Call a set of simple arithmetic functions from a jump table: */ for (i = 0; i < loops; i++) { global++; local += fn_ptr[global % table_size](global); } printf("... result: %lu\n", local); }
From: Tom Herbert > Sent: 03 February 2016 19:19 ... > + /* Main loop */ > +50: adcq 0*8(%rdi),%rax > + adcq 1*8(%rdi),%rax > + adcq 2*8(%rdi),%rax > + adcq 3*8(%rdi),%rax > + adcq 4*8(%rdi),%rax > + adcq 5*8(%rdi),%rax > + adcq 6*8(%rdi),%rax > + adcq 7*8(%rdi),%rax > + adcq 8*8(%rdi),%rax > + adcq 9*8(%rdi),%rax > + adcq 10*8(%rdi),%rax > + adcq 11*8(%rdi),%rax > + adcq 12*8(%rdi),%rax > + adcq 13*8(%rdi),%rax > + adcq 14*8(%rdi),%rax > + adcq 15*8(%rdi),%rax > + lea 128(%rdi), %rdi > + loop 50b I'd need convincing that unrolling the loop like that gives any significant gain. You have a dependency chain on the carry flag so have delays between the 'adcq' instructions (these may be more significant than the memory reads from l1 cache). I also don't remember (might be wrong) the 'loop' instruction being executed quickly. If 'loop' is fast then you will probably find that: 10: adcq 0(%rdi),%rax lea 8(%rdi),%rdi loop 10b is just as fast since the three instructions could all be executed in parallel. But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as a single micro-op). IIRC 'adc' and 'dec' will both have dependencies on the flags register so cannot execute together (which is a shame here). It is also possible that breaking the carry-chain dependency by doing 32bit adds (possibly after 64bit reads) can be made to be faster. David
On Thu, Feb 4, 2016 at 3:08 AM, David Laight <David.Laight@aculab.com> wrote: > From: Tom Herbert >> Sent: 03 February 2016 19:19 > ... >> + /* Main loop */ >> +50: adcq 0*8(%rdi),%rax >> + adcq 1*8(%rdi),%rax >> + adcq 2*8(%rdi),%rax >> + adcq 3*8(%rdi),%rax >> + adcq 4*8(%rdi),%rax >> + adcq 5*8(%rdi),%rax >> + adcq 6*8(%rdi),%rax >> + adcq 7*8(%rdi),%rax >> + adcq 8*8(%rdi),%rax >> + adcq 9*8(%rdi),%rax >> + adcq 10*8(%rdi),%rax >> + adcq 11*8(%rdi),%rax >> + adcq 12*8(%rdi),%rax >> + adcq 13*8(%rdi),%rax >> + adcq 14*8(%rdi),%rax >> + adcq 15*8(%rdi),%rax >> + lea 128(%rdi), %rdi >> + loop 50b > > I'd need convincing that unrolling the loop like that gives any significant gain. > You have a dependency chain on the carry flag so have delays between the 'adcq' > instructions (these may be more significant than the memory reads from l1 cache). > > I also don't remember (might be wrong) the 'loop' instruction being executed quickly. > If 'loop' is fast then you will probably find that: > > 10: adcq 0(%rdi),%rax > lea 8(%rdi),%rdi > loop 10b > > is just as fast since the three instructions could all be executed in parallel. > But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as > a single micro-op). > IIRC 'adc' and 'dec' will both have dependencies on the flags register > so cannot execute together (which is a shame here). > > It is also possible that breaking the carry-chain dependency by doing 32bit > adds (possibly after 64bit reads) can be made to be faster. If nothing else reducing the size of this main loop may be desirable. I know the newer x86 is supposed to have a loop buffer so that it can basically loop on already decoded instructions. Normally it is only something like 64 or 128 bytes in size though. You might find that reducing this loop to that smaller size may improve the performance for larger payloads. - Alex
On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Thu, Feb 4, 2016 at 3:08 AM, David Laight <David.Laight@aculab.com> wrote: >> From: Tom Herbert >>> Sent: 03 February 2016 19:19 >> ... >>> + /* Main loop */ >>> +50: adcq 0*8(%rdi),%rax >>> + adcq 1*8(%rdi),%rax >>> + adcq 2*8(%rdi),%rax >>> + adcq 3*8(%rdi),%rax >>> + adcq 4*8(%rdi),%rax >>> + adcq 5*8(%rdi),%rax >>> + adcq 6*8(%rdi),%rax >>> + adcq 7*8(%rdi),%rax >>> + adcq 8*8(%rdi),%rax >>> + adcq 9*8(%rdi),%rax >>> + adcq 10*8(%rdi),%rax >>> + adcq 11*8(%rdi),%rax >>> + adcq 12*8(%rdi),%rax >>> + adcq 13*8(%rdi),%rax >>> + adcq 14*8(%rdi),%rax >>> + adcq 15*8(%rdi),%rax >>> + lea 128(%rdi), %rdi >>> + loop 50b >> >> I'd need convincing that unrolling the loop like that gives any significant gain. >> You have a dependency chain on the carry flag so have delays between the 'adcq' >> instructions (these may be more significant than the memory reads from l1 cache). >> >> I also don't remember (might be wrong) the 'loop' instruction being executed quickly. >> If 'loop' is fast then you will probably find that: >> >> 10: adcq 0(%rdi),%rax >> lea 8(%rdi),%rdi >> loop 10b >> >> is just as fast since the three instructions could all be executed in parallel. >> But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as >> a single micro-op). >> IIRC 'adc' and 'dec' will both have dependencies on the flags register >> so cannot execute together (which is a shame here). >> >> It is also possible that breaking the carry-chain dependency by doing 32bit >> adds (possibly after 64bit reads) can be made to be faster. > > If nothing else reducing the size of this main loop may be desirable. > I know the newer x86 is supposed to have a loop buffer so that it can > basically loop on already decoded instructions. Normally it is only > something like 64 or 128 bytes in size though. You might find that > reducing this loop to that smaller size may improve the performance > for larger payloads. > I saw 128 to be better in my testing. For large packets this loop does all the work. I see performance dependent on the amount of loop overhead, i.e. we got it down to two non-adcq instructions but it is still noticeable. Also, this helps a lot on sizes up to 128 bytes since we only need to do single call in the jump table and no trip through the loop. Tom > - Alex
From: Tom Herbert ... > > If nothing else reducing the size of this main loop may be desirable. > > I know the newer x86 is supposed to have a loop buffer so that it can > > basically loop on already decoded instructions. Normally it is only > > something like 64 or 128 bytes in size though. You might find that > > reducing this loop to that smaller size may improve the performance > > for larger payloads. > > I saw 128 to be better in my testing. For large packets this loop does > all the work. I see performance dependent on the amount of loop > overhead, i.e. we got it down to two non-adcq instructions but it is > still noticeable. Also, this helps a lot on sizes up to 128 bytes > since we only need to do single call in the jump table and no trip > through the loop. But one of your 'loop overhead' instructions is 'loop'. Look at http://www.agner.org/optimize/instruction_tables.pdf you don't want to be using 'loop' on intel cpus. You might get some benefit from pipelining the loop (so you do a read to register in one iteration and a register-register adc the next). David
On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY > conversion. > > CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether > we need to avoid unaligned accesses. Efficient unaligned accesses > offer a nice additional speedup as demonstrated in the results > provided below. > > This implementation is similar to csum_partial implemented in > checksum_32.S, however since we are dealing with 8 bytes at a time > there are more cases for small lengths (and alignments)-- for that > we employ a jump table. > > Testing: > > Correctness: > > Verified correctness by testing arbitrary length buffer filled with > random data. For each buffer I compared the computed checksum > using the original algorithm for each possible alignment (0-7 bytes). > > Checksum performance: > > Isolating old and new implementation for some common cases: > > Old NewA NewA % NewNoA NewNoA % > Len/Aln nsec nsec Improv nsecs Improve > --------+-------+--------+-------+-------+--------------------- > 1400/0 192.9 175.1 10% 174.9 10% (Big packet) > 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) > 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) > 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) > 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) > 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) > 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) > > NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set > NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set > > Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz > > Also test on these with similar results: > > Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz > Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz > > Branch prediction: > > To test the effects of poor branch prediction in the jump tables I > tested checksum performance with runs for two combinations of length > and alignment. As the baseline I performed the test by doing half of > calls with the first combination, followed by using the second > combination for the second half. In the test case, I interleave the > two combinations so that in every call the length and alignment are > different to defeat the effects of branch prediction. Running several > cases, I did not see any material performance difference between > the baseline and the interleaving test case. > > Signed-off-by: Tom Herbert <tom@herbertland.com> So a couple of general questions about all this. First why do you break the alignment part out over so many reads? It seems like you could just get the offset, subtract that from your starting buffer, read the aligned long, and then mask off the bits you don't want. That should take maybe 4 or 5 instructions which would save you a bunch of jumping around since most of it is just arithmetic. I am still not sure about the loops, but as mentioned you probably don't want to use loop. You would likely be better off generating a pointer representing the last valid offset and just running through the loop that way. Then on the last read the same question as the first. Why not just do one read and mask the result. It should be faster. Finally I am not sure if your result is safe if the offset for the buffer is an odd value. The original code did a collapse to 16b and rotate by 8 if the offset was odd. I don't see anything like that in your code. - Alex
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar <mingo@kernel.org> wrote: > > * Ingo Molnar <mingo@kernel.org> wrote: > >> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> >> > + >> > + /* Check length */ >> > +10: cmpl $8, %esi >> > + jg 30f >> > + jl 20f >> > + >> > + /* Exactly 8 bytes length */ >> > + addl (%rdi), %eax >> > + adcl 4(%rdi), %eax >> > + RETURN >> > + >> > + /* Less than 8 bytes length */ >> > +20: clc >> > + jmpq *branch_tbl_len(, %rsi, 8) >> > + >> > + /* Greater than 8 bytes length. Determine number of quads (n). Sum >> > + * over first n % 16 quads >> > + */ >> > +30: movl %esi, %ecx >> > + shrl $3, %ecx >> > + andl $0xf, %ecx >> > + negq %rcx >> > + lea 40f(, %rcx, 4), %r11 >> > + clc >> > + jmp *%r11 >> >> Are you absolutely sure that a jump table is the proper solution here? It >> defeats branch prediction on most x86 uarchs. Why not label the loop stages and >> jump in directly with a binary tree of branches? > > So just to expand on this a bit, attached below is a quick & simple & stupid > testcase that generates a 16 entries call table. (Indirect jumps and indirect > calls are predicted similarly on most x86 uarchs.) Just built it with: > > gcc -Wall -O2 -o jump-table jump-table.c > > Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel CPU > and also on AMD Opteron. The numbers are from the Intel box.) this gives a high > branch miss rate with a 16 entries jump table: > > triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16 > ... using 16 jump table entries. > ... using 100000000 loop iterations. > ... result: 10000000100000000 > [...] > > Performance counter stats for './jump-table 16' (10 runs): > > 1386.131780 task-clock (msec) # 1.001 CPUs utilized ( +- 0.18% ) > 33 context-switches # 0.024 K/sec ( +- 1.71% ) > 0 cpu-migrations # 0.000 K/sec > 52 page-faults # 0.037 K/sec ( +- 0.71% ) > 6,247,215,683 cycles # 4.507 GHz ( +- 0.18% ) > 3,895,337,877 stalled-cycles-frontend # 62.35% frontend cycles idle ( +- 0.30% ) > 1,404,014,996 instructions # 0.22 insns per cycle > # 2.77 stalled cycles per insn ( +- 0.02% ) > 300,820,988 branches # 217.022 M/sec ( +- 0.02% ) > 87,518,741 branch-misses # 29.09% of all branches ( +- 0.01% ) > > 1.385240076 seconds time elapsed ( +- 0.21% ) > > ... as you can see the branch miss rate is very significant, causing a stalled > decoder and very low instruction throughput. > > I have to reduce the jump table to a single entry (!) to get good performance on > this CPU: > > Performance counter stats for './jump-table 1' (10 runs): > > 739.173505 task-clock (msec) # 1.001 CPUs utilized ( +- 0.26% ) > 37 context-switches # 0.051 K/sec ( +- 16.79% ) > 0 cpu-migrations # 0.000 K/sec > 52 page-faults # 0.070 K/sec ( +- 0.41% ) > 3,331,328,405 cycles # 4.507 GHz ( +- 0.26% ) > 2,012,973,596 stalled-cycles-frontend # 60.43% frontend cycles idle ( +- 0.47% ) > 1,403,880,792 instructions # 0.42 insns per cycle > # 1.43 stalled cycles per insn ( +- 0.05% ) > 300,817,064 branches # 406.964 M/sec ( +- 0.05% ) > 12,177 branch-misses # 0.00% of all branches ( +- 12.39% ) > > 0.738616356 seconds time elapsed ( +- 0.26% ) > > Note how the runtime got halved: that is because stalls got halved and the > instructions per cycle throughput doubled. > > Even a two entries jump table performs poorly: > > Performance counter stats for './jump-table 2' (10 runs): > > 1493.790686 task-clock (msec) # 1.001 CPUs utilized ( +- 0.06% ) > 39 context-switches # 0.026 K/sec ( +- 4.73% ) > 0 cpu-migrations # 0.000 K/sec > 52 page-faults # 0.035 K/sec ( +- 0.26% ) > 6,732,372,612 cycles # 4.507 GHz ( +- 0.06% ) > 4,229,130,302 stalled-cycles-frontend # 62.82% frontend cycles idle ( +- 0.09% ) > 1,407,803,145 instructions # 0.21 insns per cycle > # 3.00 stalled cycles per insn ( +- 0.08% ) > 301,688,946 branches # 201.962 M/sec ( +- 0.09% ) > 100,022,150 branch-misses # 33.15% of all branches ( +- 0.00% ) > > 1.492820524 seconds time elapsed ( +- 0.08% ) > > In fact it's worse than the 16 entries runtime. > > ( Note that Intel SkyLake improved the indirect jump/call branch predictor. > On another Intel CPU I have the boundary between good and bad prediction is > at 4-6 entries. So this is highly uarch (and code layout) dependent. ) > > In contrast, a static hierarchy/tree of branches is predicted a lot better on most > x86 uarchs, even with highly variable input. (This does not even count the extra > calculations needed to calculate the jump table index, which, as you coded it in > your patch, add 2-3 cycles of extra latency as well.) > > Such branch miss characteristics matter and they can become more visible with > smaller skb sizes. > > So I'm generally sceptical of jump tables and I'd like to see careful and > convincing perf stat runs on modern x86 uarchs, comparing the jump table solution > to a branch hierarchy solution, before accepting a jump table solution into the > x86 architecture. > > x86 uarchs are generally good at predicting function pointer indirect jumps (which > tend to be static) - multi-target jump tables are a lot harder nut to crack, > especially if the jump index is calculated right before the jump is performed, > like your patch does it. > > The impact of branch misses is more pronounced on modern Intel CPUs, because > speculation is deeper. > > But as always I can be convinced with contrary numbers! :-) > Hi Ingo, Thanks for the explanation and sample code. Expanding on your example, I added a switch statement to perform to function (code below). gcc seems to be implementing this switch as a jump table: jmpq *0x400ac8(,%rdx,8) Which is a different call than the function table implementation: callq *(%rsp,%rdx,8) The switch jump table method is what I coded in the assembly for the checksum routine (I basically just copied the assembly from what gcc was doing with the C code for the same function). Running using the functptr and switch table gives: taskset 1 perf stat --repeat 10 ./jump_table -S 16 ... using funcptr ... result: 10000000100000000 Performance counter stats for './jump_table -S 16' (10 runs): 2318.192392 task-clock (msec) # 0.999 CPUs utilized ( +- 1.02% ) 9 context-switches # 0.004 K/sec ( +- 9.51% ) 1 cpu-migrations # 0.000 K/sec ( +- 68.31% ) 132 page-faults # 0.057 K/sec ( +- 0.10% ) 6,328,766,014 cycles # 2.730 GHz ( +- 0.04% ) [49.99%] 3,325,346,741 stalled-cycles-frontend # 52.54% frontend cycles idle ( +- 0.10% ) [50.01%] 1,793,615,301 stalled-cycles-backend # 28.34% backend cycles idle ( +- 2.84% ) [50.04%] 1,509,733,276 instructions # 0.24 insns per cycle # 2.20 stalled cycles per insn ( +- 0.02% ) [50.03%] 301,788,275 branches # 130.183 M/sec ( +- 0.01% ) [50.01%] 100,030,120 branch-misses # 33.15% of all branches ( +- 0.01% ) [49.98%] 2.320510795 seconds time elapsed ( +- 1.02% ) taskset 1 perf stat --repeat 10 ./jump_table -S 16 -x ... using switch ... result: 10000000100000000 Performance counter stats for './jump_table -S 16 -x' (10 runs): 942.117858 task-clock (msec) # 0.999 CPUs utilized ( +- 1.07% ) 5 context-switches # 0.005 K/sec ( +- 13.04% ) 2 cpu-migrations # 0.003 K/sec ( +- 42.27% ) 132 page-faults # 0.140 K/sec ( +- 0.12% ) 2,521,873,028 cycles # 2.677 GHz ( +- 0.09% ) [50.01%] 1,021,412,581 stalled-cycles-frontend # 40.50% frontend cycles idle ( +- 0.17% ) [50.06%] 255,591,030 stalled-cycles-backend # 10.13% backend cycles idle ( +- 13.11% ) [50.10%] 1,104,081,483 instructions # 0.44 insns per cycle # 0.93 stalled cycles per insn ( +- 0.01% ) [50.05%] 300,713,493 branches # 319.189 M/sec ( +- 0.01% ) [49.98%] 11,500 branch-misses # 0.00% of all branches ( +- 12.18% ) [49.95%] 0.943088132 seconds time elapsed ( +- 1.07% ) So the switch implementation does not seem to be causing branch-misses to be counted and is a lot faster. This is also what I see when looking at the branch-misses with the checksum code-- I haven't yet found any test case thrashing lengths or alignments increase branch-misses and shows a performance degradation over the case where they are static. Tom /* * * Simple testcase to generate jump table usage. * */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <getopt.h> typedef unsigned long (fn_t) (unsigned long param); #define TABLE_SIZE_MAX 16L unsigned long global; unsigned long table_size = TABLE_SIZE_MAX; unsigned long loops = 100000000; int doswitch = 0; #define DEFINE_FUN(x) \ \ unsigned long fun_##x(unsigned long param) \ { \ return param+global; \ } DEFINE_FUN( 0); DEFINE_FUN( 1); DEFINE_FUN( 2); DEFINE_FUN( 3); DEFINE_FUN( 4); DEFINE_FUN( 5); DEFINE_FUN( 6); DEFINE_FUN( 7); DEFINE_FUN( 8); DEFINE_FUN( 9); DEFINE_FUN(10); DEFINE_FUN(11); DEFINE_FUN(12); DEFINE_FUN(13); DEFINE_FUN(14); DEFINE_FUN(15); static void usage(void) { printf("usage: jump-table [-S <table_size>] [-L <loops> ] [-h] [-x]\n"); exit(0); } unsigned long do_funcptr(void) { int i; unsigned long local = 0; fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_0 , fun_1 , fun_2 , fun_3 , fun_4 , fun_5 , fun_6 , fun_7 , fun_8 , fun_9, fun_10, fun_11, fun_12, fun_13, fun_14, fun_15, }; /* Call a set of simple arithmetic functions from a jump table: */ for (i = 0; i < loops; i++) { global++; local += fn_ptr[global % table_size](global); } return local; } unsigned long do_switch(void) { int i; unsigned long local = 0; /* Call a set of simple arithmetic functions via switch: */ #define CASE(X) case X: \ local += fun_##X(global); \ break; for (i = 0; i < loops; i++) { global++; switch (global % table_size) { CASE(0) CASE(1) CASE(2) CASE(3) CASE(4) CASE(5) CASE(6) CASE(7) CASE(8) CASE(9) CASE(10) CASE(11) CASE(12) CASE(13) CASE(14) CASE(15) } } return local; } int main(int argc, char **argv) { unsigned long local = 0; int c; while ((c = getopt(argc, argv, "hS:xL:")) != -1) switch (c) { case 'S': table_size = atoi(optarg); if (table_size <= 1) table_size = 1; if (table_size > TABLE_SIZE_MAX) table_size = TABLE_SIZE_MAX; break; case 'x': doswitch = 1; break; case 'L': loops = atoi(optarg); case 'h': default: usage(); } printf("... using %lu jump table entries.\n", table_size); printf("... using %lu loop iterations.\n", loops); if (doswitch) { printf("... using switch\n"); } else { printf("... using funcptr\n"); } local = doswitch ? do_switch() : do_funcptr(); printf("... result: %lu\n", local); return 0; } > Thanks, > > Ingo > > --------------------------------------> > /* > * Simple testcase to generate jump table usage. > */ > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > > typedef unsigned long (fn_t) (unsigned long param); > > unsigned long global; > > #define DEFINE_FUN(x) \ > \ > unsigned long fun_##x(unsigned long param) \ > { \ > return param+global; \ > } > > #define TABLE_SIZE_MAX 16L > > DEFINE_FUN( 1); > DEFINE_FUN( 2); > DEFINE_FUN( 3); > DEFINE_FUN( 4); > DEFINE_FUN( 5); > DEFINE_FUN( 6); > DEFINE_FUN( 7); > DEFINE_FUN( 8); > DEFINE_FUN( 9); > DEFINE_FUN(10); > DEFINE_FUN(11); > DEFINE_FUN(12); > DEFINE_FUN(13); > DEFINE_FUN(14); > DEFINE_FUN(15); > DEFINE_FUN(16); > > int main(int argc, char **argv) > { > fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_1 , fun_2 , fun_3 , fun_4 , > fun_5 , fun_6 , fun_7 , fun_8 , > fun_9 , fun_10, fun_11, fun_12, > fun_13, fun_14, fun_15, fun_16, > }; > unsigned long local = 0; > unsigned long i; > unsigned long table_size = TABLE_SIZE_MAX; > unsigned long loops = 100000000; > > > if ((argc >= 2 && !strcmp(argv[1], "-h")) || argc >= 4) { > printf("usage: jump-table <table_size(1-%lu)(default: %lu)> <loops(default: %lu)>\n", TABLE_SIZE_MAX, TABLE_SIZE_MAX, loops); > exit(0); > } > > if (argc >= 2) { > table_size = atol(argv[1]); > if (table_size <= 1) > table_size = 1; > if (table_size > TABLE_SIZE_MAX) > table_size = TABLE_SIZE_MAX; > } > printf("... using %lu jump table entries.\n", table_size); > > if (argc >= 3) > loops = atol(argv[2]); > printf("... using %lu loop iterations.\n", loops); > > /* Call a set of simple arithmetic functions from a jump table: */ > for (i = 0; i < loops; i++) { > global++; > local += fn_ptr[global % table_size](global); > } > > printf("... result: %lu\n", local); > }
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote: >> Implement assembly routine for csum_partial for 64 bit x86. This >> primarily speeds up checksum calculation for smaller lengths such as >> those that are present when doing skb_postpull_rcsum when getting >> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY >> conversion. >> >> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether >> we need to avoid unaligned accesses. Efficient unaligned accesses >> offer a nice additional speedup as demonstrated in the results >> provided below. >> >> This implementation is similar to csum_partial implemented in >> checksum_32.S, however since we are dealing with 8 bytes at a time >> there are more cases for small lengths (and alignments)-- for that >> we employ a jump table. >> >> Testing: >> >> Correctness: >> >> Verified correctness by testing arbitrary length buffer filled with >> random data. For each buffer I compared the computed checksum >> using the original algorithm for each possible alignment (0-7 bytes). >> >> Checksum performance: >> >> Isolating old and new implementation for some common cases: >> >> Old NewA NewA % NewNoA NewNoA % >> Len/Aln nsec nsec Improv nsecs Improve >> --------+-------+--------+-------+-------+--------------------- >> 1400/0 192.9 175.1 10% 174.9 10% (Big packet) >> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) >> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) >> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) >> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) >> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) >> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) >> >> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set >> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set >> >> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz >> >> Also test on these with similar results: >> >> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz >> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz >> >> Branch prediction: >> >> To test the effects of poor branch prediction in the jump tables I >> tested checksum performance with runs for two combinations of length >> and alignment. As the baseline I performed the test by doing half of >> calls with the first combination, followed by using the second >> combination for the second half. In the test case, I interleave the >> two combinations so that in every call the length and alignment are >> different to defeat the effects of branch prediction. Running several >> cases, I did not see any material performance difference between >> the baseline and the interleaving test case. >> >> Signed-off-by: Tom Herbert <tom@herbertland.com> > > So a couple of general questions about all this. > > First why do you break the alignment part out over so many reads? It > seems like you could just get the offset, subtract that from your > starting buffer, read the aligned long, and then mask off the bits you > don't want. That should take maybe 4 or 5 instructions which would > save you a bunch of jumping around since most of it is just > arithmetic. > I will look at doing that. > I am still not sure about the loops, but as mentioned you probably > don't want to use loop. You would likely be better off generating a > pointer representing the last valid offset and just running through > the loop that way. > > Then on the last read the same question as the first. Why not just do > one read and mask the result. It should be faster. > > Finally I am not sure if your result is safe if the offset for the > buffer is an odd value. The original code did a collapse to 16b and > rotate by 8 if the offset was odd. I don't see anything like that in > your code. The roll $8 is done in the RETURN macro. We only need to worry about this when alignment matters. Thanks, Tom > > - Alex
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote: >> Implement assembly routine for csum_partial for 64 bit x86. This >> primarily speeds up checksum calculation for smaller lengths such as >> those that are present when doing skb_postpull_rcsum when getting >> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY >> conversion. >> >> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether >> we need to avoid unaligned accesses. Efficient unaligned accesses >> offer a nice additional speedup as demonstrated in the results >> provided below. >> >> This implementation is similar to csum_partial implemented in >> checksum_32.S, however since we are dealing with 8 bytes at a time >> there are more cases for small lengths (and alignments)-- for that >> we employ a jump table. >> >> Testing: >> >> Correctness: >> >> Verified correctness by testing arbitrary length buffer filled with >> random data. For each buffer I compared the computed checksum >> using the original algorithm for each possible alignment (0-7 bytes). >> >> Checksum performance: >> >> Isolating old and new implementation for some common cases: >> >> Old NewA NewA % NewNoA NewNoA % >> Len/Aln nsec nsec Improv nsecs Improve >> --------+-------+--------+-------+-------+--------------------- >> 1400/0 192.9 175.1 10% 174.9 10% (Big packet) >> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) >> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) >> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) >> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) >> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) >> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) >> >> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set >> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set >> >> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz >> >> Also test on these with similar results: >> >> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz >> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz >> >> Branch prediction: >> >> To test the effects of poor branch prediction in the jump tables I >> tested checksum performance with runs for two combinations of length >> and alignment. As the baseline I performed the test by doing half of >> calls with the first combination, followed by using the second >> combination for the second half. In the test case, I interleave the >> two combinations so that in every call the length and alignment are >> different to defeat the effects of branch prediction. Running several >> cases, I did not see any material performance difference between >> the baseline and the interleaving test case. >> >> Signed-off-by: Tom Herbert <tom@herbertland.com> > > So a couple of general questions about all this. > > First why do you break the alignment part out over so many reads? It > seems like you could just get the offset, subtract that from your > starting buffer, read the aligned long, and then mask off the bits you > don't want. That should take maybe 4 or 5 instructions which would > save you a bunch of jumping around since most of it is just > arithmetic. > This would require reading bytes before the buffer pointer back to an 8 byte offset. Is this *always* guaranteed to be safe? Similar question on reading bytes past the length. > I am still not sure about the loops, but as mentioned you probably > don't want to use loop. You would likely be better off generating a > pointer representing the last valid offset and just running through > the loop that way. > > Then on the last read the same question as the first. Why not just do > one read and mask the result. It should be faster. > > Finally I am not sure if your result is safe if the offset for the > buffer is an odd value. The original code did a collapse to 16b and > rotate by 8 if the offset was odd. I don't see anything like that in > your code. > > - Alex
On Thu, Feb 4, 2016 at 11:44 AM, Tom Herbert <tom@herbertland.com> wrote: > On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote: >>> Implement assembly routine for csum_partial for 64 bit x86. This >>> primarily speeds up checksum calculation for smaller lengths such as >>> those that are present when doing skb_postpull_rcsum when getting >>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY >>> conversion. >>> >>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether >>> we need to avoid unaligned accesses. Efficient unaligned accesses >>> offer a nice additional speedup as demonstrated in the results >>> provided below. >>> >>> This implementation is similar to csum_partial implemented in >>> checksum_32.S, however since we are dealing with 8 bytes at a time >>> there are more cases for small lengths (and alignments)-- for that >>> we employ a jump table. >>> >>> Testing: >>> >>> Correctness: >>> >>> Verified correctness by testing arbitrary length buffer filled with >>> random data. For each buffer I compared the computed checksum >>> using the original algorithm for each possible alignment (0-7 bytes). >>> >>> Checksum performance: >>> >>> Isolating old and new implementation for some common cases: >>> >>> Old NewA NewA % NewNoA NewNoA % >>> Len/Aln nsec nsec Improv nsecs Improve >>> --------+-------+--------+-------+-------+--------------------- >>> 1400/0 192.9 175.1 10% 174.9 10% (Big packet) >>> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) >>> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) >>> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) >>> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) >>> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) >>> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) >>> >>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set >>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set >>> >>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz >>> >>> Also test on these with similar results: >>> >>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz >>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz >>> >>> Branch prediction: >>> >>> To test the effects of poor branch prediction in the jump tables I >>> tested checksum performance with runs for two combinations of length >>> and alignment. As the baseline I performed the test by doing half of >>> calls with the first combination, followed by using the second >>> combination for the second half. In the test case, I interleave the >>> two combinations so that in every call the length and alignment are >>> different to defeat the effects of branch prediction. Running several >>> cases, I did not see any material performance difference between >>> the baseline and the interleaving test case. >>> >>> Signed-off-by: Tom Herbert <tom@herbertland.com> >> >> So a couple of general questions about all this. >> >> First why do you break the alignment part out over so many reads? It >> seems like you could just get the offset, subtract that from your >> starting buffer, read the aligned long, and then mask off the bits you >> don't want. That should take maybe 4 or 5 instructions which would >> save you a bunch of jumping around since most of it is just >> arithmetic. >> > This would require reading bytes before the buffer pointer back to an > 8 byte offset. Is this *always* guaranteed to be safe? Similar > question on reading bytes past the length. I would think we should be fine as long as we are only confining ourselves to the nearest long. I would assume page and segmentation boundaries are at least aligned by the size of a long. As such I don't see how reading a few extra bytes to mask off later should cause any issue. - Alex
On Thu, Feb 4, 2016 at 9:09 AM, David Laight <David.Laight@aculab.com> wrote: > From: Tom Herbert > ... >> > If nothing else reducing the size of this main loop may be desirable. >> > I know the newer x86 is supposed to have a loop buffer so that it can >> > basically loop on already decoded instructions. Normally it is only >> > something like 64 or 128 bytes in size though. You might find that >> > reducing this loop to that smaller size may improve the performance >> > for larger payloads. >> >> I saw 128 to be better in my testing. For large packets this loop does >> all the work. I see performance dependent on the amount of loop >> overhead, i.e. we got it down to two non-adcq instructions but it is >> still noticeable. Also, this helps a lot on sizes up to 128 bytes >> since we only need to do single call in the jump table and no trip >> through the loop. > > But one of your 'loop overhead' instructions is 'loop'. > Look at http://www.agner.org/optimize/instruction_tables.pdf > you don't want to be using 'loop' on intel cpus. > I'm not following. We can replace loop with decl %ecx and jg, but why is that better? Tom > You might get some benefit from pipelining the loop (so you do > a read to register in one iteration and a register-register adc > the next). > > David >
On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert <tom@herbertland.com> wrote: > On Thu, Feb 4, 2016 at 9:09 AM, David Laight <David.Laight@aculab.com> wrote: >> From: Tom Herbert >> ... >>> > If nothing else reducing the size of this main loop may be desirable. >>> > I know the newer x86 is supposed to have a loop buffer so that it can >>> > basically loop on already decoded instructions. Normally it is only >>> > something like 64 or 128 bytes in size though. You might find that >>> > reducing this loop to that smaller size may improve the performance >>> > for larger payloads. >>> >>> I saw 128 to be better in my testing. For large packets this loop does >>> all the work. I see performance dependent on the amount of loop >>> overhead, i.e. we got it down to two non-adcq instructions but it is >>> still noticeable. Also, this helps a lot on sizes up to 128 bytes >>> since we only need to do single call in the jump table and no trip >>> through the loop. >> >> But one of your 'loop overhead' instructions is 'loop'. >> Look at http://www.agner.org/optimize/instruction_tables.pdf >> you don't want to be using 'loop' on intel cpus. >> > I'm not following. We can replace loop with decl %ecx and jg, but why > is that better? Because loop takes something like 7 cycles whereas the decl/jg approach takes 2 or 3. It is probably one of the reasons things look so much better with the loop unrolled. - Alex
I missed the original email (I don't have net-devel in my mailbox), but based on Ingo's quoting have a more fundamental question: Why wasn't that done with C code instead of asm with odd numerical targets? It seems likely that the real issue is avoiding the short loops (that will cause branch prediction problems) and use a lookup table instead. But we can probably do better than that asm. For example, for the actual "8 bytes or shorter" case, I think something like this might just work fine: unsigned long csum_partial_8orless(const unsigned char *buf, unsigned long len, unsigned long sum) { static const unsigned long mask[9] = { 0x0000000000000000, 0x000000000000ff00, 0x000000000000ffff, 0x00000000ff00ffff, 0x00000000ffffffff, 0x0000ff00ffffffff, 0x0000ffffffffffff, 0xff00ffffffffffff, 0xffffffffffffffff }; unsigned long val = load_unaligned_zeropad(buf + (len & 1)); val &= mask[len]; asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum)); return sum; } NOTE! The above is 100% untested. But I _think_ it may handle the odd-byte-case correctly, and it should result in just one 8-byte load (the "load_unaligned_zeropad()" is just in case that ends up overflowing and we have page-alloc-debug triggering a page fault at the end). All without looping or any conditional branches that might mispredict. My point is that going to assembly results in pretty hard-to-read code, but it's also fairly inflexible. If we stay with C, we can much more easily play tricks. So for example, make the above an inline function, and now you can do things like this: static inline unsigned long csum_partial_64bit(void *buf, unsigned long len, unsigned long sum) { if (len <= 8) return csum_partial_8orless(buf, len, sum); /* We know it's larger than 8 bytes, so handle alignment */ align = 7 & -(unsigned long)buf; sum = csum_partial_8orless(buf, align, sum); buf += align; /* We need to do the big-endian thing */ sum = rotate_by8_if_odd(sum, align); /* main loop for big packets */ .. do the unrolled inline asm thing that we already do .. sum = rotate_by8_if_odd(sum, align); /* Handle the last bytes */ return csum_partial_8orless(buf, len, sum); } /* Fold the 64-bit sum we computed down to 32 bits __wsum */ __wsum int csum_partial(void *buf, unsigned int len, __wsum partial) { unsigned long sum = csum_partial_64bit(ptr, len, partial); asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum)); return sum; } or something like that. NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole "big-endian 16-bit add" thing right by doing the odd byte masking in the end cases, and by rotating the sum by 8 bits around the 8-byte-unrolled-loop, but I didn't test the above. It's literally written in my mail client. So I can almost guarantee that it is buggy, but it is meant as an *example* of "why not do it this way" rather than production code. I think that writing it in C and trying to be intelligent about it like the above would result in more maintainable code, and it is possible that it would even be faster. Yes, writing it in C *does* likely result in a few more cases of "adcq $0" in order to finish up the carry calculations. The *only* advantage of inline asm is how it allows you to keep the carry flag around. So there is downside to the C model, and it might cause a cycle or two of extra work, but the upside of C is that you can try to do clever things without turning the code completely unreadable. For example, doing the exception handling (that will never actually trigger) for the "let's just do a 8-byte load" is just painful in assembly. In C, we already have the helper function to do it. Hmm? Would somebody be willing to take the likely very buggy code above, and test it and make it work? I assume there's a test harness for the whole csum thing somewhere. Linus
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > static const unsigned long mask[9] = { > 0x0000000000000000, > 0x000000000000ff00, > 0x000000000000ffff, > 0x00000000ff00ffff, > 0x00000000ffffffff, > 0x0000ff00ffffffff, > 0x0000ffffffffffff, > 0xff00ffffffffffff, > 0xffffffffffffffff }; > unsigned long val = load_unaligned_zeropad(buf + (len & 1)); > val &= mask[len]; Yeah, that was buggy. I knew it was likely buggy, but that "buf + (len & 1)" was just stupid. The "+" should be "-", of course - the point is to shift up the value by 8 bits for odd cases, and we need to load starting one byte early for that. The idea is that we use the byte shifter in the load unit to do some work for us. And the bitmasks are the wrong way around for the odd cases - it's the low byte that ends up being bogus for those cases. So it should probably look something like static const unsigned long mask[9] = { 0x0000000000000000, 0x000000000000ff00, 0x000000000000ffff, 0x00000000ffffff00, 0x00000000ffffffff, 0x0000ffffffffff00, 0x0000ffffffffffff, 0xffffffffffffff00, 0xffffffffffffffff }; unsigned long val = load_unaligned_zeropad(buf - (len & 1)); val &= mask[len]; and then it *might* work. Still entirely and utterly untested, I just decided to look at it a bit more and noticed my thinko. Linus
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > I missed the original email (I don't have net-devel in my mailbox), > but based on Ingo's quoting have a more fundamental question: > > Why wasn't that done with C code instead of asm with odd numerical targets? > The reason I did this in assembly is precisely about the your point of having to close the carry chains with adcq $0. I do have a first implementation in C which using switch() to handle alignment, excess length less than 8 bytes, and the odd number of quads to sum in the main loop. gcc turns these switch statements into jump tables (not function tables which is what Ingo's example code was using). The problem I hit was that for each case I needed to close the carry chain in the inline asm so fall through wouldn't have much value and each case is expanded. The C version using switch gave a nice performance gain, moving to all assembly was somewhat better. There is also question of alignment. I f we really don't need to worry about alignment at all on x86, then we should be able to eliminate the complexity of dealing with it. > It seems likely that the real issue is avoiding the short loops (that > will cause branch prediction problems) and use a lookup table instead. > > But we can probably do better than that asm. > > For example, for the actual "8 bytes or shorter" case, I think > something like this might just work fine: > > unsigned long csum_partial_8orless(const unsigned char *buf, > unsigned long len, unsigned long sum) > { > static const unsigned long mask[9] = { > 0x0000000000000000, > 0x000000000000ff00, > 0x000000000000ffff, > 0x00000000ff00ffff, > 0x00000000ffffffff, > 0x0000ff00ffffffff, > 0x0000ffffffffffff, > 0xff00ffffffffffff, > 0xffffffffffffffff }; > unsigned long val = load_unaligned_zeropad(buf + (len & 1)); > val &= mask[len]; > asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum)); > return sum; > } > I will look at doing that. Thanks, Tom > NOTE! The above is 100% untested. But I _think_ it may handle the > odd-byte-case correctly, and it should result in just one 8-byte load > (the "load_unaligned_zeropad()" is just in case that ends up > overflowing and we have page-alloc-debug triggering a page fault at > the end). All without looping or any conditional branches that might > mispredict. > > My point is that going to assembly results in pretty hard-to-read > code, but it's also fairly inflexible. If we stay with C, we can much > more easily play tricks. So for example, make the above an inline > function, and now you can do things like this: > > static inline unsigned long csum_partial_64bit(void *buf, unsigned > long len, unsigned long sum) > { > if (len <= 8) > return csum_partial_8orless(buf, len, sum); > > /* We know it's larger than 8 bytes, so handle alignment */ > align = 7 & -(unsigned long)buf; > sum = csum_partial_8orless(buf, align, sum); > buf += align; > > /* We need to do the big-endian thing */ > sum = rotate_by8_if_odd(sum, align); > > /* main loop for big packets */ > .. do the unrolled inline asm thing that we already do .. > > sum = rotate_by8_if_odd(sum, align); > > /* Handle the last bytes */ > return csum_partial_8orless(buf, len, sum); > } > > /* Fold the 64-bit sum we computed down to 32 bits __wsum */ > __wsum int csum_partial(void *buf, unsigned int len, __wsum partial) > { > unsigned long sum = csum_partial_64bit(ptr, len, partial); > asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum)); > return sum; > } > > or something like that. > > NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole > "big-endian 16-bit add" thing right by doing the odd byte masking in > the end cases, and by rotating the sum by 8 bits around the > 8-byte-unrolled-loop, but I didn't test the above. It's literally > written in my mail client. So I can almost guarantee that it is buggy, > but it is meant as an *example* of "why not do it this way" rather > than production code. > > I think that writing it in C and trying to be intelligent about it > like the above would result in more maintainable code, and it is > possible that it would even be faster. > > Yes, writing it in C *does* likely result in a few more cases of "adcq > $0" in order to finish up the carry calculations. The *only* advantage > of inline asm is how it allows you to keep the carry flag around. So > there is downside to the C model, and it might cause a cycle or two of > extra work, but the upside of C is that you can try to do clever > things without turning the code completely unreadable. > > For example, doing the exception handling (that will never actually > trigger) for the "let's just do a 8-byte load" is just painful in > assembly. In C, we already have the helper function to do it. > > Hmm? Would somebody be willing to take the likely very buggy code > above, and test it and make it work? I assume there's a test harness > for the whole csum thing somewhere. > > Linus
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert <tom@herbertland.com> wrote: > > The reason I did this in assembly is precisely about the your point of > having to close the carry chains with adcq $0. I do have a first > implementation in C which using switch() to handle alignment, excess > length less than 8 bytes, and the odd number of quads to sum in the > main loop. gcc turns these switch statements into jump tables (not > function tables which is what Ingo's example code was using). The > problem I hit was that for each case I needed to close the carry chain > in the inline asm so fall through wouldn't have much value and each > case is expanded. The C version using switch gave a nice performance > gain, moving to all assembly was somewhat better. Yeah,. so I _think_ that if my approach works, the code generated for the 0-8 byte case looks just something like movq %rsi, %rax andl $1, %eax subq %rax, %rdi movq %rdx, %rax movq (%rdi), %rcx andq mask(,%rsi,8), %rcx addq %rcx,%rax adcq $0,%rax which is pretty close to optimal (that's with my hopefully fixed version). That's not with the final narrowing from 64-bit to 32-bit, but it looks like it should perform well on most x86 cores, and then the only remaining branches end up being the actual size checking ones. And yes, you could avoid a few "adc $0" instructions in asm, but quite frankly, I think that's a tradeoff that is worth it. My guess is that you can get to within a percent of optimal with the C approach, and I really think it makes it easier to try different things (ie things like the above that avoids the switch table entirely) > There is also question of alignment. I f we really don't need to worry > about alignment at all on x86, then we should be able to eliminate the > complexity of dealing with it. So most x86 alignment issues are about crossing the cache bank size, which is usually 16 or 32 bytes. Unaligned accesses *within* one of those banks should be basically free (there's a whoppign big byte lane shifter, so there's lots of hardware support for that). Also, even when you do cross a cache bank, it's usually pretty cheap. It extra cycles, but it's generally *less* extra cycles than it would be to try to align things in software and doing two accesses. The rule of thumb is that you should never worry about _individual_ unaligned accesses. It's only really worth it aligning things before biggish loops. So aligning to an 8-byte boundary before you then do a lot of "adcq" instructions makes sense, but worrying about unaligned accesses for the beginning/end does generally not. >>> For example, for the actual "8 bytes or shorter" case, I think >> something like this might just work fine: [ snip ] > > I will look at doing that. So I already found and fixed one bug in that approach, but I still think it's a viable model. But you will almost certainly have to fix a few more of my bugs before it really works ;) Linus
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The "+" should be "-", of course - the point is to shift up the value > by 8 bits for odd cases, and we need to load starting one byte early > for that. The idea is that we use the byte shifter in the load unit to > do some work for us. Ok, so I thought some more about this, and the fact is, we don't actually want to do the byte shifting at all for the first case (the "length < 8" case), since the address of that one hasn't been shifted. it's only for the "we're going to align things to 8 bytes" case that we would want to do it. But then we might as well use the rotate_by8_if_odd() model, so I suspect the address games are just entirely pointless. So here is something that is actually tested (although admittedly not well), and uses that fairly simple model. NOTE! I did not do the unrolling of the "adcq" loop in the middle, but that's a totally trivial thing now. So this isn't very optimized, because it will do a *lot* of extra "adcq $0" to get rid of the carry bit. But with that core loop unrolled, you'd get rid of most of them. Linus --- static unsigned long rotate_by8_if_odd(unsigned long sum, unsigned long aligned) { asm("rorq %b1,%0" :"=r" (sum) :"c" ((aligned & 1) << 3), "0" (sum)); return sum; } static unsigned long csum_partial_lt8(unsigned long val, int len, unsigned long sum) { unsigned long mask = (1ul << len*8)-1; val &= mask; return add64_with_carry(val, sum); } static unsigned long csum_partial_64(const void *buff, unsigned long len, unsigned long sum) { unsigned long align, val; // This is the only potentially unaligned access, and it can // also theoretically overflow into the next page val = load_unaligned_zeropad(buff); if (len < 8) return csum_partial_lt8(val, len, sum); align = 7 & -(unsigned long)buff; sum = csum_partial_lt8(val, align, sum); buff += align; len -= align; sum = rotate_by8_if_odd(sum, align); while (len >= 8) { val = *(unsigned long *) buff; sum = add64_with_carry(sum, val); buff += 8; len -= 8; } sum = csum_partial_lt8(*(unsigned long *)buff, len, sum); return rotate_by8_if_odd(sum, align); } __wsum csum_partial(const void *buff, unsigned long len, unsigned long sum) { sum = csum_partial_64(buff, len, sum); return add32_with_carry(sum, sum >> 32); }
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > sum = csum_partial_lt8(*(unsigned long *)buff, len, sum); > return rotate_by8_if_odd(sum, align); Actually, that last word-sized access to "buff" might be past the end of the buffer. The code does the right thing if "len" is zero, except for the possible page fault or address verifier complaint. So that very last call to "csum_partial_lt8()" either needs to be conditional (easy enough to add an "if (len)" around that whole statement) or the thing could be unconditional but the load needs to use "load_unaligned_zeropad()" so that the exception is harmless. It's probably immaterial which one you pick. The few possibly useless ALU operations vs a possible branch misprodict penalty are probably going to make it a wash. The exception will never happen in practice, but if DEBUG_PAGEALLOC is enabled, or if something like KASAN is active, it will complain loudly if it happens to go past the allocation. Linus
* Tom Herbert <tom@herbertland.com> wrote: > [....] gcc turns these switch statements into jump tables (not function tables > which is what Ingo's example code was using). [...] So to the extent this still matters, on most x86 microarchitectures that count, jump tables and function call tables (i.e. virtual functions that C++ uses) are generally optimized by the same branch predictor hardware mechanism. Indirect jumps (jump tables) and indirect calls (function pointer tables) are very similar conceptually. That is why posted the indirect calls test code. ( The only branching variant that will perform badly even on the latest uarchs are indirect returns: to modify the return address on the stack. ) So my narrow performance point stands, if any sort of indirect jump is used. They should be avoided if possible, because it's pretty hard for the hardware to get it right. As Linus noticed, data lookup tables are the intelligent solution: if you manage to offload the logic into arithmetics and not affect the control flow then that's a big win. The inherent branching will be hidden by executing on massively parallel arithmetics units which effectively execute everything fed to them in a single cycle. In any case, when submitting such patches, please get into the habit of looking at and posting perf stat output - it will give us a good idea about the quality of an implementation. Thanks, Ingo
* Tom Herbert <tom@herbertland.com> wrote: > Thanks for the explanation and sample code. Expanding on your example, I added a > switch statement to perform to function (code below). So I think your new switch() based testcase is broken in a subtle way. The problem is that in your added testcase GCC effectively optimizes out the function call, because it is able to recognize that all the (trivial) functions are identical. (At least here, with GCC 5.2.1) So all overhead of the workload goes into just that do_switch() function you added: # # Total Lost Samples: 0 # # Samples: 5K of event 'cycles:pp' # Event count (approx.): 5738959683 # # Overhead Command Shared Object Symbol # ........ ........... ................. ........................... # 99.94% jump-table2 jump-table2 [.] do_switch 0.02% jump-table2 [kernel.kallsyms] [k] native_read_msr_safe 0.02% jump-table2 [kernel.kallsyms] [k] native_apic_mem_write 0.01% jump-table2 [kernel.kallsyms] [k] __fput 0.00% jump-table2 [kernel.kallsyms] [k] change_protection_range 0.00% perf [kernel.kallsyms] [k] strrchr 0.00% perf [kernel.kallsyms] [k] intel_pmu_handle_irq 0.00% perf [kernel.kallsyms] [k] native_write_msr_safe ... and per instruction overhead in the do_switch() looks like this: : : Disassembly of section .text: : : 0000000000401100 <do_switch>: : do_switch(): 0.00 : 401100: mov 0x201f61(%rip),%r8 # 603068 <loops> 0.00 : 401107: test %r8,%r8 0.00 : 40110a: je 401178 <do_switch+0x78> 0.00 : 40110c: mov 0x201f5d(%rip),%rdi # 603070 <table_size> 0.00 : 401113: xor %esi,%esi 0.00 : 401115: xor %ecx,%ecx 0.00 : 401117: nopw 0x0(%rax,%rax,1) 0.00 : 401120: mov 0x201f61(%rip),%rax # 603088 <global> 1.46 : 401127: xor %edx,%edx 0.00 : 401129: add $0x1,%rax 0.00 : 40112d: mov %rax,0x201f54(%rip) # 603088 <global> 0.00 : 401134: mov 0x201f4d(%rip),%rax # 603088 <global> 51.68 : 40113b: div %rdi 0.00 : 40113e: cmp $0x3f,%rdx 1.34 : 401142: ja 40116b <do_switch+0x6b> 0.02 : 401144: mov 0x201f3d(%rip),%r9 # 603088 <global> 0.10 : 40114b: mov 0x201f36(%rip),%rax # 603088 <global> 6.91 : 401152: jmpq *0x401420(,%rdx,8) 0.00 : 401159: nopl 0x0(%rax) 0.02 : 401160: xor %edx,%edx 35.71 : 401162: div %r9 1.07 : 401165: add %rdx,%r9 1.69 : 401168: add %r9,%rsi 0.00 : 40116b: add $0x1,%rcx 0.00 : 40116f: cmp %r8,%rcx 0.00 : 401172: jne 401120 <do_switch+0x20> 0.00 : 401174: mov %rsi,%rax 0.00 : 401177: retq 0.00 : 401178: xor %esi,%esi 0.00 : 40117a: jmp 401174 <do_switch+0x74> Note that as you remarked there _is_ an indirect jump in there: 6.91 : 401152: jmpq *0x401420(,%rdx,8) But: > gcc seems to be implementing this switch as a jump table: > > jmpq *0x400ac8(,%rdx,8) ... the 'jump table' has identical entries: 40141f: 00 60 11 add %ah,0x11(%rax) 401422: 40 00 00 add %al,(%rax) 401425: 00 00 add %al,(%rax) 401427: 00 60 11 add %ah,0x11(%rax) 40142a: 40 00 00 add %al,(%rax) 40142d: 00 00 add %al,(%rax) 40142f: 00 60 11 add %ah,0x11(%rax) 401432: 40 00 00 add %al,(%rax) 401435: 00 00 add %al,(%rax) 401437: 00 60 11 add %ah,0x11(%rax) 40143a: 40 00 00 add %al,(%rax) (misaligned dump - jump table starts at 0x401420) Every jump table entry points to 0x401160 - which is the code after the indirect jump in do_switch(). So the hardware predictor simply recognizes that it's the same target address all the time, and thus (understandably) performs much faster - none of the functions are actually executed. That is why there are no function call entries in perf report, while in the function pointer case we get: # # Total Lost Samples: 0 # # Samples: 4K of event 'cycles:pp' # Event count (approx.): 5482523153 # # Overhead Command Shared Object Symbol # ........ ........... ................. .......................... # 13.47% jump-table2 jump-table2 [.] fun_1 13.22% jump-table2 jump-table2 [.] fun_6 13.12% jump-table2 jump-table2 [.] fun_5 12.87% jump-table2 jump-table2 [.] fun_7 12.23% jump-table2 jump-table2 [.] fun_2 12.09% jump-table2 jump-table2 [.] fun_0 8.23% jump-table2 jump-table2 [.] do_funcptr 7.43% jump-table2 jump-table2 [.] fun_3 7.27% jump-table2 jump-table2 [.] fun_4 0.02% jump-table2 [kernel.kallsyms] [k] page_add_new_anon_rmap 0.02% jump-table2 [kernel.kallsyms] [k] native_read_msr_safe 0.02% jump-table2 [kernel.kallsyms] [k] perf_pmu_sched_task 0.00% jump-table2 [kernel.kallsyms] [k] flush_tlb_mm_range 0.00% perf [kernel.kallsyms] [k] _raw_spin_lock 0.00% perf [kernel.kallsyms] [k] intel_bts_enable_local 0.00% perf [kernel.kallsyms] [k] native_write_msr_safe Same-target indirect jumps were optimized well starting from PentiumM IIRC, so this would perform well all across the x86 spectrum. Btw., I think it's a bit weird that GCC didn't eliminate the switch jump table itself. (Maybe newer versions of GCC do?) > Which is a different call than the function table implementation: > > callq *(%rsp,%rdx,8) > So the switch implementation does not seem to be causing branch-misses to be > counted and is a lot faster. This is also what I see when looking at the > branch-misses with the checksum code-- I haven't yet found any test case > thrashing lengths or alignments increase branch-misses and shows a performance > degradation over the case where they are static. That's because AFAICS there's no indirect branching in your testcase! :-) And before you spend too much time on this, I do concede your general point: that jump tables are simpler than call tables, and that newer x86 uarchs are able to distinguish between multiple target branches pretty well indexed by a wide range of input registers, resulting in pretty good branch prediction. Anyway - I think Linus's C based approach is vastly superior, so the jump tables vs. call tables topic is somewhat of a side track. Thanks, Ingo
From: Ingo Molnar ... > As Linus noticed, data lookup tables are the intelligent solution: if you manage > to offload the logic into arithmetics and not affect the control flow then that's > a big win. The inherent branching will be hidden by executing on massively > parallel arithmetics units which effectively execute everything fed to them in a > single cycle. Not necessarily, in real-life they are likely to be a cache miss. With the parallel execution units you just need to worry about the register dependency chains. In this case the 'adc' have dependencies on both the result register and the flags register. The best loop might even be: 10: adc %rax,0(%rdi,%rcx) lea %rcx,8(%rcx) jcxz 20f jmp 10b 20: You then need to insert just enough copies of the adc so they take as long as the other instructions. David
diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index cd00e17..a888f65 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr, */ extern __wsum csum_partial(const void *buff, int len, __wsum sum); +static inline __sum16 ip_compute_csum(const void *buff, int len) +{ + return csum_fold(csum_partial(buff, len, 0)); +} + #define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1 #define HAVE_CSUM_COPY_USER 1 diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S new file mode 100644 index 0000000..520b400 --- /dev/null +++ b/arch/x86/lib/csum-partial_64.S @@ -0,0 +1,277 @@ +/* Copyright 2016 Tom Herbert <tom@herbertland.com> + * + * Checksum partial calculation + * + * __wsum csum_partial(const void *buff, int len, __wsum sum) + * + * Computes the checksum of a memory block at buff, length len, + * and adds in "sum" (32-bit) + * + * Returns a 32-bit number suitable for feeding into itself + * or csum_tcpudp_magic + * + * CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS determines whether alignment of the + * buffer must be dealt with. + * + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set then the steps are: + * 1) Initialize accumulator to initial sum + * 2) Sum 8 bytes at a time using adcq (unroll main loop + * to do 128 bytes at a time) + * 3) Sum remaining length (less than 8 bytes) + * + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set then the steps are: + * 1) Handle buffer that is not aligned to 8 bytes, sum up to 8 byte + * alignment + * 2) Sum 8 bytes at a time using adcq (unroll main loop + * to do 128 bytes at a time) + * 3) Sum remaining length (less than 8 bytes) + * 4) Roll result if alignment is odd and add in initial sum argument + * 5) If buffer is not aligned to 8 bytes and length is less than + * or equal to 8 - alignment (whole buffer is in one quad), then + * treat that as a special case. + * + * Register usage: + * %rdi: argument #1, buff + * %rsi: argument #2, length + * %rdx: argument #3, add in value + * %rax,%eax: accumulator and return value + * %rcx,%ecx: counter and tmp + * %r11: tmp + * %r10: alignment (0-7) - when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set + */ + +#include <linux/linkage.h> +#include <asm/errno.h> +#include <asm/asm.h> + +#define branch_tbl_len .L_branch_tbl_len + +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + +/* Close the carry chain and return. */ +#define RETURN \ + adcl $0, %eax; \ + ret + +#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ + +/* Before returning need to roll the result if alignment was odd and then add + * in the initial sum. + */ +#define RETURN \ + adcl $0, %eax; \ + test $0x1, %r10d; \ + jz 99f; \ + roll $8, %eax; \ +99: addl %edx, %eax; \ + adcl $0, %eax; \ + ret + +#define branch_tbl_align .L_branch_tbl_align + +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ + +ENTRY(csum_partial) + +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + movl %edx, %eax /* Initialize with initial sum argument */ +#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ + test %esi, %esi /* Zero length? */ + jne 310f + movl %edx, %eax + ret + +310: xorl %eax, %eax + + /* Determine alignment */ + movl %edi, %r10d + andl $0x7, %r10d + jz 10f + movl $8, %ecx + subl %r10d, %ecx + cmpl %ecx, %esi + jle 320f + clc + jmpq *branch_tbl_align(, %r10, 8) + + /* Whole buffer fits into one quad. Sum up to a four byte alignment + * and then call into the length table to finish. + */ +320: test $0x1, %r10d + jz 330f + movb (%rdi), %ah /* Align to two bytes */ + decl %esi + lea 1(%rdi), %rdi +330: cmpl $2, %esi + jl 340f + test $0x2, %r10d + jz 340f + addw (%rdi), %ax /* Align to four bytes */ + adcl $0, %eax + lea 2(%rdi), %rdi + subl $2, %esi +340: + clc + jmpq *branch_tbl_len(, %rsi, 8) + +/* Jumps table for alignments */ + +201: /* Align 1 */ + adcw 5(%rdi), %ax +203: /* Align 3 */ + adcw 3(%rdi), %ax +205: /* Align 5 */ + adcw 1(%rdi), %ax +207: /* Align 7 */ + adcl $0, %eax + addb (%rdi), %ah + jmp 222f +202: /* Align 2 */ + adcw 4(%rdi), %ax +204: /* Align 4 */ + adcw 2(%rdi), %ax +206: /* Align 6 */ + adcw (%rdi), %ax + +222: adcl $0, %eax + subl %ecx, %esi /* %rcx is 8 - alignment */ + addq %rcx, %rdi +200: + /* Fall through */ + +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */ + + /* Check length */ +10: cmpl $8, %esi + jg 30f + jl 20f + + /* Exactly 8 bytes length */ + addl (%rdi), %eax + adcl 4(%rdi), %eax + RETURN + + /* Less than 8 bytes length */ +20: clc + jmpq *branch_tbl_len(, %rsi, 8) + + /* Greater than 8 bytes length. Determine number of quads (n). Sum + * over first n % 16 quads + */ +30: movl %esi, %ecx + shrl $3, %ecx + andl $0xf, %ecx + negq %rcx + lea 40f(, %rcx, 4), %r11 + clc + jmp *%r11 + +.align 8 + adcq 14*8(%rdi),%rax + adcq 13*8(%rdi),%rax + adcq 12*8(%rdi),%rax + adcq 11*8(%rdi),%rax + adcq 10*8(%rdi),%rax + adcq 9*8(%rdi),%rax + adcq 8*8(%rdi),%rax + adcq 7*8(%rdi),%rax + adcq 6*8(%rdi),%rax + adcq 5*8(%rdi),%rax + adcq 4*8(%rdi),%rax + adcq 3*8(%rdi),%rax + adcq 2*8(%rdi),%rax + adcq 1*8(%rdi),%rax + adcq 0*8(%rdi),%rax + nop +40: /* #quads % 16 jump table base */ + + adcq $0, %rax + shlq $3, %rcx + subq %rcx, %rdi /* %rcx is already negative length */ + + /* Now determine number of blocks of 8 quads. Sum 128 bytes at a time + * using unrolled loop. + */ + movl %esi, %ecx + shrl $7, %ecx + jz 60f + clc + + /* Main loop */ +50: adcq 0*8(%rdi),%rax + adcq 1*8(%rdi),%rax + adcq 2*8(%rdi),%rax + adcq 3*8(%rdi),%rax + adcq 4*8(%rdi),%rax + adcq 5*8(%rdi),%rax + adcq 6*8(%rdi),%rax + adcq 7*8(%rdi),%rax + adcq 8*8(%rdi),%rax + adcq 9*8(%rdi),%rax + adcq 10*8(%rdi),%rax + adcq 11*8(%rdi),%rax + adcq 12*8(%rdi),%rax + adcq 13*8(%rdi),%rax + adcq 14*8(%rdi),%rax + adcq 15*8(%rdi),%rax + lea 128(%rdi), %rdi + loop 50b + + adcq $0, %rax + + /* Handle remaining length which is <= 8 bytes */ +60: andl $0x7, %esi + + /* Fold 64 bit sum to 32 bits */ + movq %rax, %rcx + shrq $32, %rcx + addl %ecx, %eax + + jmpq *branch_tbl_len(, %rsi, 8) + +/* Length table targets */ + +107: /* Length 7 */ + adcw 4(%rdi), %ax +105: /* Length 5 */ + adcw 2(%rdi), %ax +103: /* Length 3 */ + adcw (%rdi), %ax +101: /* Length 1, grab the odd byte */ + adcb -1(%rdi, %rsi), %al + adcb $0, %ah + RETURN +106: /* Length 6 */ + adcw 4(%rdi), %ax +104: /* Length 4, optimized for double word access*/ + adcl (%rdi), %eax + RETURN +102: /* Length 2 */ + adcw (%rdi), %ax +100: /* Length 0 */ + RETURN + +.section .rodata +.align 64 +.L_branch_tbl_len: + .quad 100b + .quad 101b + .quad 102b + .quad 103b + .quad 104b + .quad 105b + .quad 106b + .quad 107b + +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS +.L_branch_tbl_align: + .quad 200b + .quad 201b + .quad 202b + .quad 203b + .quad 204b + .quad 205b + .quad 206b + .quad 207b +#endif + diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c deleted file mode 100644 index 9845371..0000000 --- a/arch/x86/lib/csum-partial_64.c +++ /dev/null @@ -1,148 +0,0 @@ -/* - * arch/x86_64/lib/csum-partial.c - * - * This file contains network checksum routines that are better done - * in an architecture-specific manner due to speed. - */ - -#include <linux/compiler.h> -#include <linux/module.h> -#include <asm/checksum.h> - -static inline unsigned short from32to16(unsigned a) -{ - unsigned short b = a >> 16; - asm("addw %w2,%w0\n\t" - "adcw $0,%w0\n" - : "=r" (b) - : "0" (b), "r" (a)); - return b; -} - -/* - * Do a 64-bit checksum on an arbitrary memory area. - * Returns a 32bit checksum. - * - * This isn't as time critical as it used to be because many NICs - * do hardware checksumming these days. - * - * Things tried and found to not make it faster: - * Manual Prefetching - * Unrolling to an 128 bytes inner loop. - * Using interleaving with more registers to break the carry chains. - */ -static unsigned do_csum(const unsigned char *buff, unsigned len) -{ - unsigned odd, count; - unsigned long result = 0; - - if (unlikely(len == 0)) - return result; - odd = 1 & (unsigned long) buff; - if (unlikely(odd)) { - result = *buff << 8; - len--; - buff++; - } - count = len >> 1; /* nr of 16-bit words.. */ - if (count) { - if (2 & (unsigned long) buff) { - result += *(unsigned short *)buff; - count--; - len -= 2; - buff += 2; - } - count >>= 1; /* nr of 32-bit words.. */ - if (count) { - unsigned long zero; - unsigned count64; - if (4 & (unsigned long) buff) { - result += *(unsigned int *) buff; - count--; - len -= 4; - buff += 4; - } - count >>= 1; /* nr of 64-bit words.. */ - - /* main loop using 64byte blocks */ - zero = 0; - count64 = count >> 3; - while (count64) { - asm("addq 0*8(%[src]),%[res]\n\t" - "adcq 1*8(%[src]),%[res]\n\t" - "adcq 2*8(%[src]),%[res]\n\t" - "adcq 3*8(%[src]),%[res]\n\t" - "adcq 4*8(%[src]),%[res]\n\t" - "adcq 5*8(%[src]),%[res]\n\t" - "adcq 6*8(%[src]),%[res]\n\t" - "adcq 7*8(%[src]),%[res]\n\t" - "adcq %[zero],%[res]" - : [res] "=r" (result) - : [src] "r" (buff), [zero] "r" (zero), - "[res]" (result)); - buff += 64; - count64--; - } - - /* last up to 7 8byte blocks */ - count %= 8; - while (count) { - asm("addq %1,%0\n\t" - "adcq %2,%0\n" - : "=r" (result) - : "m" (*(unsigned long *)buff), - "r" (zero), "0" (result)); - --count; - buff += 8; - } - result = add32_with_carry(result>>32, - result&0xffffffff); - - if (len & 4) { - result += *(unsigned int *) buff; - buff += 4; - } - } - if (len & 2) { - result += *(unsigned short *) buff; - buff += 2; - } - } - if (len & 1) - result += *buff; - result = add32_with_carry(result>>32, result & 0xffffffff); - if (unlikely(odd)) { - result = from32to16(result); - result = ((result >> 8) & 0xff) | ((result & 0xff) << 8); - } - return result; -} - -/* - * computes the checksum of a memory block at buff, length len, - * and adds in "sum" (32-bit) - * - * returns a 32-bit number suitable for feeding into itself - * or csum_tcpudp_magic - * - * this function must be called with even lengths, except - * for the last fragment, which may be odd - * - * it's best to have buff aligned on a 64-bit boundary - */ -__wsum csum_partial(const void *buff, int len, __wsum sum) -{ - return (__force __wsum)add32_with_carry(do_csum(buff, len), - (__force u32)sum); -} - -/* - * this routine is used for miscellaneous IP-like checksums, mainly - * in icmp.c - */ -__sum16 ip_compute_csum(const void *buff, int len) -{ - return csum_fold(csum_partial(buff,len,0)); -} -EXPORT_SYMBOL(ip_compute_csum); -
Implement assembly routine for csum_partial for 64 bit x86. This primarily speeds up checksum calculation for smaller lengths such as those that are present when doing skb_postpull_rcsum when getting CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion. CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether we need to avoid unaligned accesses. Efficient unaligned accesses offer a nice additional speedup as demonstrated in the results provided below. This implementation is similar to csum_partial implemented in checksum_32.S, however since we are dealing with 8 bytes at a time there are more cases for small lengths (and alignments)-- for that we employ a jump table. Testing: Correctness: Verified correctness by testing arbitrary length buffer filled with random data. For each buffer I compared the computed checksum using the original algorithm for each possible alignment (0-7 bytes). Checksum performance: Isolating old and new implementation for some common cases: Old NewA NewA % NewNoA NewNoA % Len/Aln nsec nsec Improv nsecs Improve --------+-------+--------+-------+-------+--------------------- 1400/0 192.9 175.1 10% 174.9 10% (Big packet) 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz Also test on these with similar results: Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz Branch prediction: To test the effects of poor branch prediction in the jump tables I tested checksum performance with runs for two combinations of length and alignment. As the baseline I performed the test by doing half of calls with the first combination, followed by using the second combination for the second half. In the test case, I interleave the two combinations so that in every call the length and alignment are different to defeat the effects of branch prediction. Running several cases, I did not see any material performance difference between the baseline and the interleaving test case. Signed-off-by: Tom Herbert <tom@herbertland.com> --- arch/x86/include/asm/checksum_64.h | 5 + arch/x86/lib/csum-partial_64.S | 277 +++++++++++++++++++++++++++++++++++++ arch/x86/lib/csum-partial_64.c | 148 -------------------- 3 files changed, 282 insertions(+), 148 deletions(-) create mode 100644 arch/x86/lib/csum-partial_64.S delete mode 100644 arch/x86/lib/csum-partial_64.c