diff mbox

[v4,net-next] net: Implement fast csum_partial for x86_64

Message ID 1456516995-3471163-1-git-send-email-tom@herbertland.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Feb. 26, 2016, 8:03 p.m. UTC
This patch implements performant csum_partial for x86_64. The intent is
to speed up checksum calculation, particularly for smaller lengths such
as those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.

- v4
   - went back to C code with inline assembly for critical routines
   - implemented suggestion from Linus to deal with lengths < 8

Testing:

Correctness:

Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).

Performance:

Isolating old and new implementation for some common cases:

             Old      New     %
    Len/Aln  nsecs    nsecs   Improv
    --------+-------+--------+-------
    1400/0    195.6    181.7   7%     (Big packet)
    40/0      11.4     6.2     45%    (Ipv6 hdr cmn case)
    8/4       7.9      3.2     59%    (UDP, VXLAN in IPv4)
    14/0      8.9      5.9     33%    (Eth hdr)
    14/4      9.2      5.9     35%    (Eth hdr in IPv4)
    14/3      9.6      5.9     38%    (Eth with odd align)
    20/0      9.0      6.2     31%    (IP hdr without options)
    7/1       8.9      4.2     52%    (buffer in one quad)
    100/0    17.4     13.9     20%    (medium-sized pkt)
    100/2    17.8     14.2     20%    (medium-sized pkt w/ alignment)

Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Also tested on these with similar results:

Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

Branch  prediction:

To test the effects of poor branch prediction in the jump tables I
tested checksum performance with runs for two combinations of length
and alignment. As the baseline I performed the test by doing half of
calls with the first combination, followed by using the second
combination for the second half. In the test case, I interleave the
two combinations so that in every call the length and alignment are
different to defeat the effects of branch prediction. Running several
cases, I did not see any material performance difference between the
two scenarios (perf stat output is below), neither does either case
show a significant number of branch misses.

Interleave lengths case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
    ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000

 Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):

     9,556,693,202      instructions               ( +-  0.00% )
     1,176,208,640       branches                                                     ( +-  0.00% )
            19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )

       2.049732539 seconds time elapsed

    Non-interleave case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
     ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000

Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):

     9,782,188,310      instructions               ( +-  0.00% )
     1,251,286,958       branches                                                     ( +-  0.01% )
            18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )

       2.271789046 seconds time elapsed

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 arch/x86/include/asm/checksum_64.h |  21 ++++
 arch/x86/lib/csum-partial_64.c     | 225 ++++++++++++++++++++-----------------
 2 files changed, 143 insertions(+), 103 deletions(-)

Comments

Linus Torvalds Feb. 26, 2016, 8:29 p.m. UTC | #1
Looks ok to me.

I am left wondering if the code should just do that

    add32_with_carry3(sum, result >> 32, result);

in the caller instead - right now pretty much every return point in
do_csum() effectively does that,  with the exception of

 - the 0-length case, which is presumably not really an issue in real
life and could as well just return 0

 - the 8-byte case that does two 32-bit loads instead, but could just
do a single 8-byte load and return it (and then the generic version in
the caller would do a shift).

That would simplifiy the code a bit - it wouldn't need to pass in
"sum" to do_csum at all, and we'd have just a single case of that
special 3-input carry add..

But I'm certainly ok with it as-is. I'm not sure how performance
critical the whole csum routine is, but at least now it doesn't
introduce a lot of new complex asm.

And this version might be reasonable to make generic, so that non-x86
architectures could use the same approach. That's what we ended up
doing for the dcache word-at-a-time code too in the end.

                Linus
Tom Herbert Feb. 26, 2016, 8:41 p.m. UTC | #2
On Fri, Feb 26, 2016 at 12:29 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Looks ok to me.
>
> I am left wondering if the code should just do that
>
>     add32_with_carry3(sum, result >> 32, result);
>
> in the caller instead - right now pretty much every return point in
> do_csum() effectively does that,  with the exception of
>
>  - the 0-length case, which is presumably not really an issue in real
> life and could as well just return 0
>
>  - the 8-byte case that does two 32-bit loads instead, but could just
> do a single 8-byte load and return it (and then the generic version in
> the caller would do a shift).

Right, it is slightly faster for those two cases to return the result
directly. (csum over 8 bytes might be common with some encapsulation
protocols).

>
> That would simplifiy the code a bit - it wouldn't need to pass in
> "sum" to do_csum at all, and we'd have just a single case of that
> special 3-input carry add..
>
> But I'm certainly ok with it as-is. I'm not sure how performance
> critical the whole csum routine is, but at least now it doesn't
> introduce a lot of new complex asm.
>
Micro-performance optimizations may become more relevant as we
introduce more high performance network paths to the kernel. But the
short term reason for this is to dispel any remaining notion that NIC
HW support for CHECKSUM_UNNECESSARY is somehow better than
CHECKSUM_COMPLETE because pulling up checksums in the protocols is too
expensive.

Tom

> And this version might be reasonable to make generic, so that non-x86
> architectures could use the same approach. That's what we ended up
> doing for the dcache word-at-a-time code too in the end.
>
>                 Linus
Alexander H Duyck Feb. 26, 2016, 10:52 p.m. UTC | #3
On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert <tom@herbertland.com> wrote:
> This patch implements performant csum_partial for x86_64. The intent is
> to speed up checksum calculation, particularly for smaller lengths such
> as those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>
> - v4
>    - went back to C code with inline assembly for critical routines
>    - implemented suggestion from Linus to deal with lengths < 8
>
> Testing:
>
> Correctness:
>
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Performance:
>
> Isolating old and new implementation for some common cases:
>
>              Old      New     %
>     Len/Aln  nsecs    nsecs   Improv
>     --------+-------+--------+-------
>     1400/0    195.6    181.7   7%     (Big packet)
>     40/0      11.4     6.2     45%    (Ipv6 hdr cmn case)
>     8/4       7.9      3.2     59%    (UDP, VXLAN in IPv4)
>     14/0      8.9      5.9     33%    (Eth hdr)
>     14/4      9.2      5.9     35%    (Eth hdr in IPv4)
>     14/3      9.6      5.9     38%    (Eth with odd align)
>     20/0      9.0      6.2     31%    (IP hdr without options)
>     7/1       8.9      4.2     52%    (buffer in one quad)
>     100/0    17.4     13.9     20%    (medium-sized pkt)
>     100/2    17.8     14.2     20%    (medium-sized pkt w/ alignment)
>
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
> Also tested on these with similar results:
>
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>
> Branch  prediction:
>
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between the
> two scenarios (perf stat output is below), neither does either case
> show a significant number of branch misses.
>
> Interleave lengths case:
>
> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>
>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>
>      9,556,693,202      instructions               ( +-  0.00% )
>      1,176,208,640       branches                                                     ( +-  0.00% )
>             19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )
>
>        2.049732539 seconds time elapsed
>
>     Non-interleave case:
>
> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>
> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>
>      9,782,188,310      instructions               ( +-  0.00% )
>      1,251,286,958       branches                                                     ( +-  0.01% )
>             18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )
>
>        2.271789046 seconds time elapsed
>
> Signed-off-by: Tom Herbert <tom@herbertland.com>
> ---
>  arch/x86/include/asm/checksum_64.h |  21 ++++
>  arch/x86/lib/csum-partial_64.c     | 225 ++++++++++++++++++++-----------------
>  2 files changed, 143 insertions(+), 103 deletions(-)
>
> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
> index cd00e17..e20c35b 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>         return a;
>  }
>
> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
> +{
> +       asm("addq %2,%0\n\t"
> +           "adcq $0,%0"
> +           : "=r" (a)
> +           : "0" (a), "rm" (b));
> +       return a;
> +}
> +
> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
> +                                            unsigned int c)
> +{
> +       asm("addl %2,%0\n\t"
> +           "adcl %3,%0\n\t"
> +           "adcl $0,%0"
> +           : "=r" (a)
> +           : "" (a), "rm" (b), "rm" (c));
> +
> +       return a;
> +}
> +
>  #define HAVE_ARCH_CSUM_ADD
>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>  {
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index 9845371..df82c9b 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -8,114 +8,66 @@
>  #include <linux/compiler.h>
>  #include <linux/module.h>
>  #include <asm/checksum.h>
> +#include <asm/word-at-a-time.h>
>
> -static inline unsigned short from32to16(unsigned a)
> +static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool aligned)
>  {
> -       unsigned short b = a >> 16;
> -       asm("addw %w2,%w0\n\t"
> -           "adcw $0,%w0\n"
> -           : "=r" (b)
> -           : "0" (b), "r" (a));
> -       return b;
> +       if (unlikely(aligned))
> +               asm("rorq $0x8,%0"
> +                       : "=r" (sum)
> +                       : "0" (sum));
> +       return sum;
>  }

Instead of making this conditional you might just want to perform the
rotate always and just bump the aligned value up bits.  That way you
reduce the length of the dependency chain by 1, and rotates tend to
only take a cycle on most x86_64 capable systems anyways.

>
> -/*
> - * Do a 64-bit checksum on an arbitrary memory area.
> - * Returns a 32bit checksum.
> - *
> - * This isn't as time critical as it used to be because many NICs
> - * do hardware checksumming these days.
> - *
> - * Things tried and found to not make it faster:
> - * Manual Prefetching
> - * Unrolling to an 128 bytes inner loop.
> - * Using interleaving with more registers to break the carry chains.
> - */
> -static unsigned do_csum(const unsigned char *buff, unsigned len)
> +/* Extract eight high order (nbo) bytes of quad "val". */
> +static inline unsigned long csum_partial_lt8_head(unsigned long val, int len)
>  {
> -       unsigned odd, count;
> -       unsigned long result = 0;
> +       return val & ((1ul << len * 8) - 1);
> +}
>
> -       if (unlikely(len == 0))
> -               return result;
> -       odd = 1 & (unsigned long) buff;
> -       if (unlikely(odd)) {
> -               result = *buff << 8;
> -               len--;
> -               buff++;
> -       }
> -       count = len >> 1;               /* nr of 16-bit words.. */
> -       if (count) {
> -               if (2 & (unsigned long) buff) {
> -                       result += *(unsigned short *)buff;
> -                       count--;
> -                       len -= 2;
> -                       buff += 2;
> -               }
> -               count >>= 1;            /* nr of 32-bit words.. */
> -               if (count) {
> -                       unsigned long zero;
> -                       unsigned count64;
> -                       if (4 & (unsigned long) buff) {
> -                               result += *(unsigned int *) buff;
> -                               count--;
> -                               len -= 4;
> -                               buff += 4;
> -                       }
> -                       count >>= 1;    /* nr of 64-bit words.. */
> -
> -                       /* main loop using 64byte blocks */
> -                       zero = 0;
> -                       count64 = count >> 3;
> -                       while (count64) {
> -                               asm("addq 0*8(%[src]),%[res]\n\t"
> -                                   "adcq 1*8(%[src]),%[res]\n\t"
> -                                   "adcq 2*8(%[src]),%[res]\n\t"
> -                                   "adcq 3*8(%[src]),%[res]\n\t"
> -                                   "adcq 4*8(%[src]),%[res]\n\t"
> -                                   "adcq 5*8(%[src]),%[res]\n\t"
> -                                   "adcq 6*8(%[src]),%[res]\n\t"
> -                                   "adcq 7*8(%[src]),%[res]\n\t"
> -                                   "adcq %[zero],%[res]"
> -                                   : [res] "=r" (result)
> -                                   : [src] "r" (buff), [zero] "r" (zero),
> -                                   "[res]" (result));
> -                               buff += 64;
> -                               count64--;
> -                       }
> -
> -                       /* last up to 7 8byte blocks */
> -                       count %= 8;
> -                       while (count) {
> -                               asm("addq %1,%0\n\t"
> -                                   "adcq %2,%0\n"
> -                                           : "=r" (result)
> -                                   : "m" (*(unsigned long *)buff),
> -                                   "r" (zero),  "0" (result));
> -                               --count;
> -                                       buff += 8;
> -                       }
> -                       result = add32_with_carry(result>>32,
> -                                                 result&0xffffffff);
> -
> -                       if (len & 4) {
> -                               result += *(unsigned int *) buff;
> -                               buff += 4;
> -                       }
> -               }
> -               if (len & 2) {
> -                       result += *(unsigned short *) buff;
> -                       buff += 2;
> -               }
> -       }
> -       if (len & 1)
> -               result += *buff;
> -       result = add32_with_carry(result>>32, result & 0xffffffff);
> -       if (unlikely(odd)) {
> -               result = from32to16(result);
> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> +/* Extract eight low order (nbo) bytes of quad in "val". */
> +static inline unsigned long csum_partial_lt8_tail(unsigned long val, int len)
> +{
> +       return val >> (8 * (8 - len));
> +}

From what I can tell these functions only get used in one or two
spots.  It may be worth it to just place the code directly in the
function rather than obscuring it by placing it in a function.

> +/* Sum over buffer up to 64 bytes. */
> +static unsigned long csum_partial_le64(const void *buff, unsigned int len,
> +                                      unsigned long sum)
> +{
> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
> +           "clc\n\t"
> +           "jmpq *%%r11\n\t"
> +           "adcq 7*8(%[src]),%[res]\n\t"

Technically this first adcq could just be an addq.  You don't have
anything to carry anyway.

> +           "adcq 6*8(%[src]),%[res]\n\t"
> +           "adcq 5*8(%[src]),%[res]\n\t"
> +           "adcq 4*8(%[src]),%[res]\n\t"
> +           "adcq 3*8(%[src]),%[res]\n\t"
> +           "adcq 2*8(%[src]),%[res]\n\t"
> +           "adcq 1*8(%[src]),%[res]\n\t"
> +           "adcq 0*8(%[src]),%[res]\n\t"
> +           "nop\n\t"
> +           "40: adcq $0,%[res]"
> +                   : [res] "=r" (sum)
> +                   : [src] "r" (buff),
> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
> +                   : "r11");
> +

It would probably useful to add a comment explaining that this block.
I had to compile it to verify my understanding of this is correct in
that you are using the label 40 as a jump target and simply
subtracting the length in longs from that jump target.  If nothing
else you might want to rename the label from 40 to something else just
to improve the readability.  It is possible that the code could get
shifted around or inlined and at that point the 40 becomes completely
meaningless.

Also it wouldn't hurt to explain that the nop, and the reason for the
4 in the lea instruction is because adcq is 4 bytes per instruction
with a memory offset, and the last one has an offset of 0 so we need
to add the nop to keep everything 4 byte aligned from the end.

> +       if (len & 0x7) {
> +               unsigned long val;
> +               /*
> +                * Since "len" is > 8 here we backtrack in the buffer to load
> +                * the outstaning bytes into the low order bytes of a quad and
> +                * sum those in csum_partial_lt8_tail. By doing this we avoid
> +                * additional calls to load_unaligned_zeropad.
> +                */
> +
> +               val = csum_partial_lt8_tail(*(unsigned long *)(buff + len - 8),
> +                                           len & 0x7);
> +               sum = add64_with_carry(val, sum);
>         }
> -       return result;
> +
> +       return sum;
>  }
>
>  /*
> @@ -130,10 +82,77 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>   *
>   * it's best to have buff aligned on a 64-bit boundary
>   */
> +static unsigned int do_csum(const void *buff, int len, unsigned int sum)
> +{
> +       bool aligned = false;
> +       unsigned long result = 0;
> +
> +       /*
> +        * For "small" lengths don't bother with alignment, x86 handles
> +        * this pretty well.
> +        */
> +       if (len <= 8) {
> +               unsigned long val;
> +
> +               /* Optimize trivial cases */
> +               if (len == 8)
> +                       return add32_with_carry3(sum,
> +                                      *(unsigned int *)buff,
> +                                      *(unsigned int *)(buff + 4));

Doesn't this still run the risk of triggering a fault if the buffer
crosses pages on a non-4 byte aligned boundary?

> +               else if (len == 0)
> +                       return sum;
> +
> +               val = load_unaligned_zeropad(buff);
> +               result = csum_partial_lt8_head(val, len);
> +
> +               return add32_with_carry3(sum, result >> 32,
> +                                        result & 0xffffffff);
> +       } else if (len <= 64) {
> +               result = csum_partial_le64(buff, len, result);
> +
> +               return add32_with_carry3(sum, result >> 32,
> +                                        result & 0xffffffff);
> +       }
> +

One side effect I see from the fact that you are calling
csum_partial_le64 in two spots is that you are ending up having to
make a call to it instead of having it be inlined directly into your
do_csum function.  What you may want to do is look at altering your
code flow so that if (len > 64) you do the bits below that occur
before you call into csum_partial_le64, and that way you will reduce
the number of callers to csum_partial_le64 to one which should get the
compiler to inline it into your function.

> +       /*
> +        * Length is greater than 64. Sum to eight byte alignment before
> +        * proceeding with main loop.
> +        */
> +       aligned = !!((unsigned long)buff & 0x1);
> +       if (aligned) {
> +               unsigned int align = 7 & -(unsigned long)buff;
> +
> +               result = csum_partial_lt8_head(*(unsigned long *)buff, align);
> +               buff += align;
> +               len -= align;
> +               result = rotate_by8_if_odd(result, align);
> +       }

I'm still not a fan of the unaligned reads.  They may be okay but it
just seems like we are going run into corner cases all over the place
where this ends up biting us.

> +       /* main loop using 64byte blocks */
> +       for (; len >= 64; len -= 64, buff += 64) {
> +               asm("addq 0*8(%[src]),%[res]\n\t"
> +                   "adcq 1*8(%[src]),%[res]\n\t"
> +                   "adcq 2*8(%[src]),%[res]\n\t"
> +                   "adcq 3*8(%[src]),%[res]\n\t"
> +                   "adcq 4*8(%[src]),%[res]\n\t"
> +                   "adcq 5*8(%[src]),%[res]\n\t"
> +                   "adcq 6*8(%[src]),%[res]\n\t"
> +                   "adcq 7*8(%[src]),%[res]\n\t"
> +                   "adcq $0,%[res]"
> +                   : [res] "=r" (result)
> +                   : [src] "r" (buff),
> +                   "[res]" (result));
> +       }
> +
> +       result = csum_partial_le64(buff, len, result);
> +       result = rotate_by8_if_odd(result, aligned);
> +
> +       return add32_with_carry3(sum, result >> 32, result & 0xffffffff);
> +}
> +
>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>  {
> -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
> -                                               (__force u32)sum);
> +       return (__force __wsum)do_csum(buff, len, (__force u32)sum);
>  }
>
>  /*
> --
> 2.6.5
>
Tom Herbert Feb. 27, 2016, 3:11 a.m. UTC | #4
On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert <tom@herbertland.com> wrote:
>> This patch implements performant csum_partial for x86_64. The intent is
>> to speed up checksum calculation, particularly for smaller lengths such
>> as those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>
>> - v4
>>    - went back to C code with inline assembly for critical routines
>>    - implemented suggestion from Linus to deal with lengths < 8
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>              Old      New     %
>>     Len/Aln  nsecs    nsecs   Improv
>>     --------+-------+--------+-------
>>     1400/0    195.6    181.7   7%     (Big packet)
>>     40/0      11.4     6.2     45%    (Ipv6 hdr cmn case)
>>     8/4       7.9      3.2     59%    (UDP, VXLAN in IPv4)
>>     14/0      8.9      5.9     33%    (Eth hdr)
>>     14/4      9.2      5.9     35%    (Eth hdr in IPv4)
>>     14/3      9.6      5.9     38%    (Eth with odd align)
>>     20/0      9.0      6.2     31%    (IP hdr without options)
>>     7/1       8.9      4.2     52%    (buffer in one quad)
>>     100/0    17.4     13.9     20%    (medium-sized pkt)
>>     100/2    17.8     14.2     20%    (medium-sized pkt w/ alignment)
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also tested on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch  prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between the
>> two scenarios (perf stat output is below), neither does either case
>> show a significant number of branch misses.
>>
>> Interleave lengths case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>>
>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>
>>      9,556,693,202      instructions               ( +-  0.00% )
>>      1,176,208,640       branches                                                     ( +-  0.00% )
>>             19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )
>>
>>        2.049732539 seconds time elapsed
>>
>>     Non-interleave case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>>
>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>
>>      9,782,188,310      instructions               ( +-  0.00% )
>>      1,251,286,958       branches                                                     ( +-  0.01% )
>>             18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )
>>
>>        2.271789046 seconds time elapsed
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>> ---
>>  arch/x86/include/asm/checksum_64.h |  21 ++++
>>  arch/x86/lib/csum-partial_64.c     | 225 ++++++++++++++++++++-----------------
>>  2 files changed, 143 insertions(+), 103 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
>> index cd00e17..e20c35b 100644
>> --- a/arch/x86/include/asm/checksum_64.h
>> +++ b/arch/x86/include/asm/checksum_64.h
>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>>         return a;
>>  }
>>
>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
>> +{
>> +       asm("addq %2,%0\n\t"
>> +           "adcq $0,%0"
>> +           : "=r" (a)
>> +           : "0" (a), "rm" (b));
>> +       return a;
>> +}
>> +
>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>> +                                            unsigned int c)
>> +{
>> +       asm("addl %2,%0\n\t"
>> +           "adcl %3,%0\n\t"
>> +           "adcl $0,%0"
>> +           : "=r" (a)
>> +           : "" (a), "rm" (b), "rm" (c));
>> +
>> +       return a;
>> +}
>> +
>>  #define HAVE_ARCH_CSUM_ADD
>>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>>  {
>> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>> index 9845371..df82c9b 100644
>> --- a/arch/x86/lib/csum-partial_64.c
>> +++ b/arch/x86/lib/csum-partial_64.c
>> @@ -8,114 +8,66 @@
>>  #include <linux/compiler.h>
>>  #include <linux/module.h>
>>  #include <asm/checksum.h>
>> +#include <asm/word-at-a-time.h>
>>
>> -static inline unsigned short from32to16(unsigned a)
>> +static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool aligned)
>>  {
>> -       unsigned short b = a >> 16;
>> -       asm("addw %w2,%w0\n\t"
>> -           "adcw $0,%w0\n"
>> -           : "=r" (b)
>> -           : "0" (b), "r" (a));
>> -       return b;
>> +       if (unlikely(aligned))
>> +               asm("rorq $0x8,%0"
>> +                       : "=r" (sum)
>> +                       : "0" (sum));
>> +       return sum;
>>  }
>
> Instead of making this conditional you might just want to perform the
> rotate always and just bump the aligned value up bits.  That way you
> reduce the length of the dependency chain by 1, and rotates tend to
> only take a cycle on most x86_64 capable systems anyways.
>
I don't understand what you want. Can you provide more detail?

>>
>> -/*
>> - * Do a 64-bit checksum on an arbitrary memory area.
>> - * Returns a 32bit checksum.
>> - *
>> - * This isn't as time critical as it used to be because many NICs
>> - * do hardware checksumming these days.
>> - *
>> - * Things tried and found to not make it faster:
>> - * Manual Prefetching
>> - * Unrolling to an 128 bytes inner loop.
>> - * Using interleaving with more registers to break the carry chains.
>> - */
>> -static unsigned do_csum(const unsigned char *buff, unsigned len)
>> +/* Extract eight high order (nbo) bytes of quad "val". */
>> +static inline unsigned long csum_partial_lt8_head(unsigned long val, int len)
>>  {
>> -       unsigned odd, count;
>> -       unsigned long result = 0;
>> +       return val & ((1ul << len * 8) - 1);
>> +}
>>
>> -       if (unlikely(len == 0))
>> -               return result;
>> -       odd = 1 & (unsigned long) buff;
>> -       if (unlikely(odd)) {
>> -               result = *buff << 8;
>> -               len--;
>> -               buff++;
>> -       }
>> -       count = len >> 1;               /* nr of 16-bit words.. */
>> -       if (count) {
>> -               if (2 & (unsigned long) buff) {
>> -                       result += *(unsigned short *)buff;
>> -                       count--;
>> -                       len -= 2;
>> -                       buff += 2;
>> -               }
>> -               count >>= 1;            /* nr of 32-bit words.. */
>> -               if (count) {
>> -                       unsigned long zero;
>> -                       unsigned count64;
>> -                       if (4 & (unsigned long) buff) {
>> -                               result += *(unsigned int *) buff;
>> -                               count--;
>> -                               len -= 4;
>> -                               buff += 4;
>> -                       }
>> -                       count >>= 1;    /* nr of 64-bit words.. */
>> -
>> -                       /* main loop using 64byte blocks */
>> -                       zero = 0;
>> -                       count64 = count >> 3;
>> -                       while (count64) {
>> -                               asm("addq 0*8(%[src]),%[res]\n\t"
>> -                                   "adcq 1*8(%[src]),%[res]\n\t"
>> -                                   "adcq 2*8(%[src]),%[res]\n\t"
>> -                                   "adcq 3*8(%[src]),%[res]\n\t"
>> -                                   "adcq 4*8(%[src]),%[res]\n\t"
>> -                                   "adcq 5*8(%[src]),%[res]\n\t"
>> -                                   "adcq 6*8(%[src]),%[res]\n\t"
>> -                                   "adcq 7*8(%[src]),%[res]\n\t"
>> -                                   "adcq %[zero],%[res]"
>> -                                   : [res] "=r" (result)
>> -                                   : [src] "r" (buff), [zero] "r" (zero),
>> -                                   "[res]" (result));
>> -                               buff += 64;
>> -                               count64--;
>> -                       }
>> -
>> -                       /* last up to 7 8byte blocks */
>> -                       count %= 8;
>> -                       while (count) {
>> -                               asm("addq %1,%0\n\t"
>> -                                   "adcq %2,%0\n"
>> -                                           : "=r" (result)
>> -                                   : "m" (*(unsigned long *)buff),
>> -                                   "r" (zero),  "0" (result));
>> -                               --count;
>> -                                       buff += 8;
>> -                       }
>> -                       result = add32_with_carry(result>>32,
>> -                                                 result&0xffffffff);
>> -
>> -                       if (len & 4) {
>> -                               result += *(unsigned int *) buff;
>> -                               buff += 4;
>> -                       }
>> -               }
>> -               if (len & 2) {
>> -                       result += *(unsigned short *) buff;
>> -                       buff += 2;
>> -               }
>> -       }
>> -       if (len & 1)
>> -               result += *buff;
>> -       result = add32_with_carry(result>>32, result & 0xffffffff);
>> -       if (unlikely(odd)) {
>> -               result = from32to16(result);
>> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>> +/* Extract eight low order (nbo) bytes of quad in "val". */
>> +static inline unsigned long csum_partial_lt8_tail(unsigned long val, int len)
>> +{
>> +       return val >> (8 * (8 - len));
>> +}
>
> From what I can tell these functions only get used in one or two
> spots.  It may be worth it to just place the code directly in the
> function rather than obscuring it by placing it in a function.
>
>> +/* Sum over buffer up to 64 bytes. */
>> +static unsigned long csum_partial_le64(const void *buff, unsigned int len,
>> +                                      unsigned long sum)
>> +{
>> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
>> +           "clc\n\t"
>> +           "jmpq *%%r11\n\t"
>> +           "adcq 7*8(%[src]),%[res]\n\t"
>
> Technically this first adcq could just be an addq.  You don't have
> anything to carry anyway.
>
To what end?

>> +           "adcq 6*8(%[src]),%[res]\n\t"
>> +           "adcq 5*8(%[src]),%[res]\n\t"
>> +           "adcq 4*8(%[src]),%[res]\n\t"
>> +           "adcq 3*8(%[src]),%[res]\n\t"
>> +           "adcq 2*8(%[src]),%[res]\n\t"
>> +           "adcq 1*8(%[src]),%[res]\n\t"
>> +           "adcq 0*8(%[src]),%[res]\n\t"
>> +           "nop\n\t"
>> +           "40: adcq $0,%[res]"
>> +                   : [res] "=r" (sum)
>> +                   : [src] "r" (buff),
>> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
>> +                   : "r11");
>> +
>
> It would probably useful to add a comment explaining that this block.
> I had to compile it to verify my understanding of this is correct in
> that you are using the label 40 as a jump target and simply
> subtracting the length in longs from that jump target.  If nothing
> else you might want to rename the label from 40 to something else just
> to improve the readability.  It is possible that the code could get
> shifted around or inlined and at that point the 40 becomes completely
> meaningless.
>
> Also it wouldn't hurt to explain that the nop, and the reason for the
> 4 in the lea instruction is because adcq is 4 bytes per instruction
> with a memory offset, and the last one has an offset of 0 so we need
> to add the nop to keep everything 4 byte aligned from the end.
>
>> +       if (len & 0x7) {
>> +               unsigned long val;
>> +               /*
>> +                * Since "len" is > 8 here we backtrack in the buffer to load
>> +                * the outstaning bytes into the low order bytes of a quad and
>> +                * sum those in csum_partial_lt8_tail. By doing this we avoid
>> +                * additional calls to load_unaligned_zeropad.
>> +                */
>> +
>> +               val = csum_partial_lt8_tail(*(unsigned long *)(buff + len - 8),
>> +                                           len & 0x7);
>> +               sum = add64_with_carry(val, sum);
>>         }
>> -       return result;
>> +
>> +       return sum;
>>  }
>>
>>  /*
>> @@ -130,10 +82,77 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>>   *
>>   * it's best to have buff aligned on a 64-bit boundary
>>   */
>> +static unsigned int do_csum(const void *buff, int len, unsigned int sum)
>> +{
>> +       bool aligned = false;
>> +       unsigned long result = 0;
>> +
>> +       /*
>> +        * For "small" lengths don't bother with alignment, x86 handles
>> +        * this pretty well.
>> +        */
>> +       if (len <= 8) {
>> +               unsigned long val;
>> +
>> +               /* Optimize trivial cases */
>> +               if (len == 8)
>> +                       return add32_with_carry3(sum,
>> +                                      *(unsigned int *)buff,
>> +                                      *(unsigned int *)(buff + 4));
>
> Doesn't this still run the risk of triggering a fault if the buffer
> crosses pages on a non-4 byte aligned boundary?
>
It's safe because we know the length of the buffer is eight bytes.
There is only one instance where we need load_unaligned_zeropad to
deal with potentially crossing over to a bad page.

>> +               else if (len == 0)
>> +                       return sum;
>> +
>> +               val = load_unaligned_zeropad(buff);
>> +               result = csum_partial_lt8_head(val, len);
>> +
>> +               return add32_with_carry3(sum, result >> 32,
>> +                                        result & 0xffffffff);
>> +       } else if (len <= 64) {
>> +               result = csum_partial_le64(buff, len, result);
>> +
>> +               return add32_with_carry3(sum, result >> 32,
>> +                                        result & 0xffffffff);
>> +       }
>> +
>
> One side effect I see from the fact that you are calling
> csum_partial_le64 in two spots is that you are ending up having to
> make a call to it instead of having it be inlined directly into your
> do_csum function.  What you may want to do is look at altering your
> code flow so that if (len > 64) you do the bits below that occur
> before you call into csum_partial_le64, and that way you will reduce
> the number of callers to csum_partial_le64 to one which should get the
> compiler to inline it into your function.
>
I see that gcc is inlining both calls to csum_partial_le64.

>> +       /*
>> +        * Length is greater than 64. Sum to eight byte alignment before
>> +        * proceeding with main loop.
>> +        */
>> +       aligned = !!((unsigned long)buff & 0x1);
>> +       if (aligned) {
>> +               unsigned int align = 7 & -(unsigned long)buff;
>> +
>> +               result = csum_partial_lt8_head(*(unsigned long *)buff, align);
>> +               buff += align;
>> +               len -= align;
>> +               result = rotate_by8_if_odd(result, align);
>> +       }
>
> I'm still not a fan of the unaligned reads.  They may be okay but it
> just seems like we are going run into corner cases all over the place
> where this ends up biting us.
>
They are already all over the place in stack. For instance, any time
we process an IP header within VXLAN we're likely doing several
unaligned reads to load IP addresses etc. These are a big problem for
architectures like Sparc, but does not appear to be an issue for x86.

>> +       /* main loop using 64byte blocks */
>> +       for (; len >= 64; len -= 64, buff += 64) {
>> +               asm("addq 0*8(%[src]),%[res]\n\t"
>> +                   "adcq 1*8(%[src]),%[res]\n\t"
>> +                   "adcq 2*8(%[src]),%[res]\n\t"
>> +                   "adcq 3*8(%[src]),%[res]\n\t"
>> +                   "adcq 4*8(%[src]),%[res]\n\t"
>> +                   "adcq 5*8(%[src]),%[res]\n\t"
>> +                   "adcq 6*8(%[src]),%[res]\n\t"
>> +                   "adcq 7*8(%[src]),%[res]\n\t"
>> +                   "adcq $0,%[res]"
>> +                   : [res] "=r" (result)
>> +                   : [src] "r" (buff),
>> +                   "[res]" (result));
>> +       }
>> +
>> +       result = csum_partial_le64(buff, len, result);
>> +       result = rotate_by8_if_odd(result, aligned);
>> +
>> +       return add32_with_carry3(sum, result >> 32, result & 0xffffffff);
>> +}
>> +
>>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>>  {
>> -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
>> -                                               (__force u32)sum);
>> +       return (__force __wsum)do_csum(buff, len, (__force u32)sum);
>>  }
>>
>>  /*
>> --
>> 2.6.5
>>
Alexander H Duyck Feb. 27, 2016, 3:40 a.m. UTC | #5
On Fri, Feb 26, 2016 at 7:11 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> This patch implements performant csum_partial for x86_64. The intent is
>>> to speed up checksum calculation, particularly for smaller lengths such
>>> as those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>>
>>> - v4
>>>    - went back to C code with inline assembly for critical routines
>>>    - implemented suggestion from Linus to deal with lengths < 8
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>>              Old      New     %
>>>     Len/Aln  nsecs    nsecs   Improv
>>>     --------+-------+--------+-------
>>>     1400/0    195.6    181.7   7%     (Big packet)
>>>     40/0      11.4     6.2     45%    (Ipv6 hdr cmn case)
>>>     8/4       7.9      3.2     59%    (UDP, VXLAN in IPv4)
>>>     14/0      8.9      5.9     33%    (Eth hdr)
>>>     14/4      9.2      5.9     35%    (Eth hdr in IPv4)
>>>     14/3      9.6      5.9     38%    (Eth with odd align)
>>>     20/0      9.0      6.2     31%    (IP hdr without options)
>>>     7/1       8.9      4.2     52%    (buffer in one quad)
>>>     100/0    17.4     13.9     20%    (medium-sized pkt)
>>>     100/2    17.8     14.2     20%    (medium-sized pkt w/ alignment)
>>>
>>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>>
>>> Also tested on these with similar results:
>>>
>>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>>
>>> Branch  prediction:
>>>
>>> To test the effects of poor branch prediction in the jump tables I
>>> tested checksum performance with runs for two combinations of length
>>> and alignment. As the baseline I performed the test by doing half of
>>> calls with the first combination, followed by using the second
>>> combination for the second half. In the test case, I interleave the
>>> two combinations so that in every call the length and alignment are
>>> different to defeat the effects of branch prediction. Running several
>>> cases, I did not see any material performance difference between the
>>> two scenarios (perf stat output is below), neither does either case
>>> show a significant number of branch misses.
>>>
>>> Interleave lengths case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>>>
>>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>>
>>>      9,556,693,202      instructions               ( +-  0.00% )
>>>      1,176,208,640       branches                                                     ( +-  0.00% )
>>>             19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )
>>>
>>>        2.049732539 seconds time elapsed
>>>
>>>     Non-interleave case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>>>
>>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>>
>>>      9,782,188,310      instructions               ( +-  0.00% )
>>>      1,251,286,958       branches                                                     ( +-  0.01% )
>>>             18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )
>>>
>>>        2.271789046 seconds time elapsed
>>>
>>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>>> ---
>>>  arch/x86/include/asm/checksum_64.h |  21 ++++
>>>  arch/x86/lib/csum-partial_64.c     | 225 ++++++++++++++++++++-----------------
>>>  2 files changed, 143 insertions(+), 103 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
>>> index cd00e17..e20c35b 100644
>>> --- a/arch/x86/include/asm/checksum_64.h
>>> +++ b/arch/x86/include/asm/checksum_64.h
>>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>>>         return a;
>>>  }
>>>
>>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
>>> +{
>>> +       asm("addq %2,%0\n\t"
>>> +           "adcq $0,%0"
>>> +           : "=r" (a)
>>> +           : "0" (a), "rm" (b));
>>> +       return a;
>>> +}
>>> +
>>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>>> +                                            unsigned int c)
>>> +{
>>> +       asm("addl %2,%0\n\t"
>>> +           "adcl %3,%0\n\t"
>>> +           "adcl $0,%0"
>>> +           : "=r" (a)
>>> +           : "" (a), "rm" (b), "rm" (c));
>>> +
>>> +       return a;
>>> +}
>>> +
>>>  #define HAVE_ARCH_CSUM_ADD
>>>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>>>  {
>>> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>>> index 9845371..df82c9b 100644
>>> --- a/arch/x86/lib/csum-partial_64.c
>>> +++ b/arch/x86/lib/csum-partial_64.c
>>> @@ -8,114 +8,66 @@
>>>  #include <linux/compiler.h>
>>>  #include <linux/module.h>
>>>  #include <asm/checksum.h>
>>> +#include <asm/word-at-a-time.h>
>>>
>>> -static inline unsigned short from32to16(unsigned a)
>>> +static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool aligned)
>>>  {
>>> -       unsigned short b = a >> 16;
>>> -       asm("addw %w2,%w0\n\t"
>>> -           "adcw $0,%w0\n"
>>> -           : "=r" (b)
>>> -           : "0" (b), "r" (a));
>>> -       return b;
>>> +       if (unlikely(aligned))
>>> +               asm("rorq $0x8,%0"
>>> +                       : "=r" (sum)
>>> +                       : "0" (sum));
>>> +       return sum;
>>>  }
>>
>> Instead of making this conditional you might just want to perform the
>> rotate always and just bump the aligned value up bits.  That way you
>> reduce the length of the dependency chain by 1, and rotates tend to
>> only take a cycle on most x86_64 capable systems anyways.
>>
> I don't understand what you want. Can you provide more detail?

So right now you have aligned as a boolean value.  The code path ends
up being something like cmp, jmp, rorq.  If you converted it to an
integer value you could simply shift aligned by 3 and then perform the
rotate in all cases as a rotate by 0 has no effect.  It basically just
drops one uop from the path.

>>>
>>> -/*
>>> - * Do a 64-bit checksum on an arbitrary memory area.
>>> - * Returns a 32bit checksum.
>>> - *
>>> - * This isn't as time critical as it used to be because many NICs
>>> - * do hardware checksumming these days.
>>> - *
>>> - * Things tried and found to not make it faster:
>>> - * Manual Prefetching
>>> - * Unrolling to an 128 bytes inner loop.
>>> - * Using interleaving with more registers to break the carry chains.
>>> - */
>>> -static unsigned do_csum(const unsigned char *buff, unsigned len)
>>> +/* Extract eight high order (nbo) bytes of quad "val". */
>>> +static inline unsigned long csum_partial_lt8_head(unsigned long val, int len)
>>>  {
>>> -       unsigned odd, count;
>>> -       unsigned long result = 0;
>>> +       return val & ((1ul << len * 8) - 1);
>>> +}
>>>
>>> -       if (unlikely(len == 0))
>>> -               return result;
>>> -       odd = 1 & (unsigned long) buff;
>>> -       if (unlikely(odd)) {
>>> -               result = *buff << 8;
>>> -               len--;
>>> -               buff++;
>>> -       }
>>> -       count = len >> 1;               /* nr of 16-bit words.. */
>>> -       if (count) {
>>> -               if (2 & (unsigned long) buff) {
>>> -                       result += *(unsigned short *)buff;
>>> -                       count--;
>>> -                       len -= 2;
>>> -                       buff += 2;
>>> -               }
>>> -               count >>= 1;            /* nr of 32-bit words.. */
>>> -               if (count) {
>>> -                       unsigned long zero;
>>> -                       unsigned count64;
>>> -                       if (4 & (unsigned long) buff) {
>>> -                               result += *(unsigned int *) buff;
>>> -                               count--;
>>> -                               len -= 4;
>>> -                               buff += 4;
>>> -                       }
>>> -                       count >>= 1;    /* nr of 64-bit words.. */
>>> -
>>> -                       /* main loop using 64byte blocks */
>>> -                       zero = 0;
>>> -                       count64 = count >> 3;
>>> -                       while (count64) {
>>> -                               asm("addq 0*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 1*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 2*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 3*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 4*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 5*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 6*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 7*8(%[src]),%[res]\n\t"
>>> -                                   "adcq %[zero],%[res]"
>>> -                                   : [res] "=r" (result)
>>> -                                   : [src] "r" (buff), [zero] "r" (zero),
>>> -                                   "[res]" (result));
>>> -                               buff += 64;
>>> -                               count64--;
>>> -                       }
>>> -
>>> -                       /* last up to 7 8byte blocks */
>>> -                       count %= 8;
>>> -                       while (count) {
>>> -                               asm("addq %1,%0\n\t"
>>> -                                   "adcq %2,%0\n"
>>> -                                           : "=r" (result)
>>> -                                   : "m" (*(unsigned long *)buff),
>>> -                                   "r" (zero),  "0" (result));
>>> -                               --count;
>>> -                                       buff += 8;
>>> -                       }
>>> -                       result = add32_with_carry(result>>32,
>>> -                                                 result&0xffffffff);
>>> -
>>> -                       if (len & 4) {
>>> -                               result += *(unsigned int *) buff;
>>> -                               buff += 4;
>>> -                       }
>>> -               }
>>> -               if (len & 2) {
>>> -                       result += *(unsigned short *) buff;
>>> -                       buff += 2;
>>> -               }
>>> -       }
>>> -       if (len & 1)
>>> -               result += *buff;
>>> -       result = add32_with_carry(result>>32, result & 0xffffffff);
>>> -       if (unlikely(odd)) {
>>> -               result = from32to16(result);
>>> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>>> +/* Extract eight low order (nbo) bytes of quad in "val". */
>>> +static inline unsigned long csum_partial_lt8_tail(unsigned long val, int len)
>>> +{
>>> +       return val >> (8 * (8 - len));
>>> +}
>>
>> From what I can tell these functions only get used in one or two
>> spots.  It may be worth it to just place the code directly in the
>> function rather than obscuring it by placing it in a function.
>>
>>> +/* Sum over buffer up to 64 bytes. */
>>> +static unsigned long csum_partial_le64(const void *buff, unsigned int len,
>>> +                                      unsigned long sum)
>>> +{
>>> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
>>> +           "clc\n\t"
>>> +           "jmpq *%%r11\n\t"
>>> +           "adcq 7*8(%[src]),%[res]\n\t"
>>
>> Technically this first adcq could just be an addq.  You don't have
>> anything to carry anyway.
>>
> To what end?

It basically allows us to get back to what I was proposing when we
were at netconf where we look at merging the main loop and your jump
table path so that you can put together a Duff's device and save
yourself some code overhead.

>>> +           "adcq 6*8(%[src]),%[res]\n\t"
>>> +           "adcq 5*8(%[src]),%[res]\n\t"
>>> +           "adcq 4*8(%[src]),%[res]\n\t"
>>> +           "adcq 3*8(%[src]),%[res]\n\t"
>>> +           "adcq 2*8(%[src]),%[res]\n\t"
>>> +           "adcq 1*8(%[src]),%[res]\n\t"
>>> +           "adcq 0*8(%[src]),%[res]\n\t"
>>> +           "nop\n\t"
>>> +           "40: adcq $0,%[res]"
>>> +                   : [res] "=r" (sum)
>>> +                   : [src] "r" (buff),
>>> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
>>> +                   : "r11");
>>> +
>>
>> It would probably useful to add a comment explaining that this block.
>> I had to compile it to verify my understanding of this is correct in
>> that you are using the label 40 as a jump target and simply
>> subtracting the length in longs from that jump target.  If nothing
>> else you might want to rename the label from 40 to something else just
>> to improve the readability.  It is possible that the code could get
>> shifted around or inlined and at that point the 40 becomes completely
>> meaningless.
>>
>> Also it wouldn't hurt to explain that the nop, and the reason for the
>> 4 in the lea instruction is because adcq is 4 bytes per instruction
>> with a memory offset, and the last one has an offset of 0 so we need
>> to add the nop to keep everything 4 byte aligned from the end.
>>
>>> +       if (len & 0x7) {
>>> +               unsigned long val;
>>> +               /*
>>> +                * Since "len" is > 8 here we backtrack in the buffer to load
>>> +                * the outstaning bytes into the low order bytes of a quad and
>>> +                * sum those in csum_partial_lt8_tail. By doing this we avoid
>>> +                * additional calls to load_unaligned_zeropad.
>>> +                */
>>> +
>>> +               val = csum_partial_lt8_tail(*(unsigned long *)(buff + len - 8),
>>> +                                           len & 0x7);
>>> +               sum = add64_with_carry(val, sum);
>>>         }
>>> -       return result;
>>> +
>>> +       return sum;
>>>  }
>>>
>>>  /*
>>> @@ -130,10 +82,77 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>>>   *
>>>   * it's best to have buff aligned on a 64-bit boundary
>>>   */
>>> +static unsigned int do_csum(const void *buff, int len, unsigned int sum)
>>> +{
>>> +       bool aligned = false;
>>> +       unsigned long result = 0;
>>> +
>>> +       /*
>>> +        * For "small" lengths don't bother with alignment, x86 handles
>>> +        * this pretty well.
>>> +        */
>>> +       if (len <= 8) {
>>> +               unsigned long val;
>>> +
>>> +               /* Optimize trivial cases */
>>> +               if (len == 8)
>>> +                       return add32_with_carry3(sum,
>>> +                                      *(unsigned int *)buff,
>>> +                                      *(unsigned int *)(buff + 4));
>>
>> Doesn't this still run the risk of triggering a fault if the buffer
>> crosses pages on a non-4 byte aligned boundary?
>>
> It's safe because we know the length of the buffer is eight bytes.
> There is only one instance where we need load_unaligned_zeropad to
> deal with potentially crossing over to a bad page.

I think I see what you are getting at, but to me it just seems like
there is some unnecessary optimizations going on.  For example I don't
really see the point in optimizing for the 0 case below.  If someone
requested a sum over 0 bytes I believe the lt8_head function will
return 0 anyway.

You might also be better off just converting this to a (len < 7) as
well and let the code fall through into the section for less than 64
as well.

>>> +               else if (len == 0)
>>> +                       return sum;
>>> +
>>> +               val = load_unaligned_zeropad(buff);
>>> +               result = csum_partial_lt8_head(val, len);
>>> +
>>> +               return add32_with_carry3(sum, result >> 32,
>>> +                                        result & 0xffffffff);
>>> +       } else if (len <= 64) {
>>> +               result = csum_partial_le64(buff, len, result);
>>> +
>>> +               return add32_with_carry3(sum, result >> 32,
>>> +                                        result & 0xffffffff);
>>> +       }
>>> +
>>
>> One side effect I see from the fact that you are calling
>> csum_partial_le64 in two spots is that you are ending up having to
>> make a call to it instead of having it be inlined directly into your
>> do_csum function.  What you may want to do is look at altering your
>> code flow so that if (len > 64) you do the bits below that occur
>> before you call into csum_partial_le64, and that way you will reduce
>> the number of callers to csum_partial_le64 to one which should get the
>> compiler to inline it into your function.
>>
> I see that gcc is inlining both calls to csum_partial_le64.

Maybe your compiler is smarter than mine, or I have some other build
option enabled that isn't triggering that.

>>> +       /*
>>> +        * Length is greater than 64. Sum to eight byte alignment before
>>> +        * proceeding with main loop.
>>> +        */
>>> +       aligned = !!((unsigned long)buff & 0x1);
>>> +       if (aligned) {
>>> +               unsigned int align = 7 & -(unsigned long)buff;
>>> +
>>> +               result = csum_partial_lt8_head(*(unsigned long *)buff, align);
>>> +               buff += align;
>>> +               len -= align;
>>> +               result = rotate_by8_if_odd(result, align);
>>> +       }
>>
>> I'm still not a fan of the unaligned reads.  They may be okay but it
>> just seems like we are going run into corner cases all over the place
>> where this ends up biting us.
>>
> They are already all over the place in stack. For instance, any time
> we process an IP header within VXLAN we're likely doing several
> unaligned reads to load IP addresses etc. These are a big problem for
> architectures like Sparc, but does not appear to be an issue for x86.

Yes, but in our case we can align ourselves with minimal overhead,
that is why I figure it might be worth the effort to do so.

>>> +       /* main loop using 64byte blocks */
>>> +       for (; len >= 64; len -= 64, buff += 64) {
>>> +               asm("addq 0*8(%[src]),%[res]\n\t"
>>> +                   "adcq 1*8(%[src]),%[res]\n\t"
>>> +                   "adcq 2*8(%[src]),%[res]\n\t"
>>> +                   "adcq 3*8(%[src]),%[res]\n\t"
>>> +                   "adcq 4*8(%[src]),%[res]\n\t"
>>> +                   "adcq 5*8(%[src]),%[res]\n\t"
>>> +                   "adcq 6*8(%[src]),%[res]\n\t"
>>> +                   "adcq 7*8(%[src]),%[res]\n\t"
>>> +                   "adcq $0,%[res]"
>>> +                   : [res] "=r" (result)
>>> +                   : [src] "r" (buff),
>>> +                   "[res]" (result));
>>> +       }
>>> +
>>> +       result = csum_partial_le64(buff, len, result);
>>> +       result = rotate_by8_if_odd(result, aligned);
>>> +
>>> +       return add32_with_carry3(sum, result >> 32, result & 0xffffffff);
>>> +}
>>> +

I'll play around with this code a bit over the weekend.  I have a few
ideas I want to play with.

I figure we should be able to get to the point of having this code
running pretty lean.  For example I was looking at the Duff's device
idea and I think we can probably get this code quite a bit smaller and
faster.

As far as the patch itself I have no objections.  It is better then
what we have now.  I think I am just starting to realize we might be
able to squeeze some more speed out of it and might even be able to
reduce the code size in the process.

- Alex
Linus Torvalds Feb. 27, 2016, 5:25 a.m. UTC | #6
On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> I'm still not a fan of the unaligned reads.  They may be okay but it
> just seems like we are going run into corner cases all over the place
> where this ends up biting us.

No.

Unaligned reads are not just "ok".

The fact is, not doing unaligned reads is just stupid.

Guys, the RISC people tried the whole "only do aligned crap". It was a
mistake. It's stupid. It's wrong.

Every single successful remaining RISC architecture learnt from their
mistakes. That should tell you something.

It should tell you that the people who tried to teach you that
unaligned reads were bad were charlatans.

It's *much* better to do unaligned reads in software and let hardware
sort it out than to try to actively avoid them.

On x86, unaligned reads have never even been expensive (except back in
the dark days when people started doing vector extensions and got them
wrong - those people learnt their lesson too).

And on other architectures, that historically got this wrong (ARM got
it *really* wrong originally), sanity eventually prevailed. So there
isn't a single relevant architecture left where it would make sense to
do extra work in order to only do aligned reads.

The whole "unaligned reads are bad" ship sailed long ago, and it sank.
Let is be.

                    Linus
Alexander H Duyck Feb. 27, 2016, 8:30 a.m. UTC | #7
> +{
> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
> +           "clc\n\t"
> +           "jmpq *%%r11\n\t"
> +           "adcq 7*8(%[src]),%[res]\n\t"
> +           "adcq 6*8(%[src]),%[res]\n\t"
> +           "adcq 5*8(%[src]),%[res]\n\t"
> +           "adcq 4*8(%[src]),%[res]\n\t"
> +           "adcq 3*8(%[src]),%[res]\n\t"
> +           "adcq 2*8(%[src]),%[res]\n\t"
> +           "adcq 1*8(%[src]),%[res]\n\t"
> +           "adcq 0*8(%[src]),%[res]\n\t"
> +           "nop\n\t"
> +           "40: adcq $0,%[res]"
> +                   : [res] "=r" (sum)
> +                   : [src] "r" (buff),
> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
> +                   : "r11");
> +

With this patch I cannot mix/match different length checksums without
things failing.  In perf the jmpq in the loop above seems to be set to
a fixed value so perhaps it is something in how the compiler is
interpreting the inline assembler.

- Alex
Alexander H Duyck Feb. 28, 2016, 6:56 p.m. UTC | #8
On Sat, Feb 27, 2016 at 12:30 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>> +{
>> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
>> +           "clc\n\t"
>> +           "jmpq *%%r11\n\t"
>> +           "adcq 7*8(%[src]),%[res]\n\t"
>> +           "adcq 6*8(%[src]),%[res]\n\t"
>> +           "adcq 5*8(%[src]),%[res]\n\t"
>> +           "adcq 4*8(%[src]),%[res]\n\t"
>> +           "adcq 3*8(%[src]),%[res]\n\t"
>> +           "adcq 2*8(%[src]),%[res]\n\t"
>> +           "adcq 1*8(%[src]),%[res]\n\t"
>> +           "adcq 0*8(%[src]),%[res]\n\t"
>> +           "nop\n\t"
>> +           "40: adcq $0,%[res]"
>> +                   : [res] "=r" (sum)
>> +                   : [src] "r" (buff),
>> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
>> +                   : "r11");
>> +
>
> With this patch I cannot mix/match different length checksums without
> things failing.  In perf the jmpq in the loop above seems to be set to
> a fixed value so perhaps it is something in how the compiler is
> interpreting the inline assembler.

The perf thing was a red herring.  Turns out the code is working
correctly there.

I actually found the root cause.  The problem is in add32_with_carry3.

> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
> +                                            unsigned int c)
> +{
> +       asm("addl %2,%0\n\t"
> +           "adcl %3,%0\n\t"
> +           "adcl $0,%0"
> +           : "=r" (a)
> +           : "" (a), "rm" (b), "rm" (c));
> +
> +       return a;
> +}
> +

You need to set the 'a' input variable attribute to "0" instead of ""
and then things work for me correctly.

- Alex
Tom Herbert Feb. 28, 2016, 7:15 p.m. UTC | #9
On Sun, Feb 28, 2016 at 10:56 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Sat, Feb 27, 2016 at 12:30 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>>> +{
>>> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
>>> +           "clc\n\t"
>>> +           "jmpq *%%r11\n\t"
>>> +           "adcq 7*8(%[src]),%[res]\n\t"
>>> +           "adcq 6*8(%[src]),%[res]\n\t"
>>> +           "adcq 5*8(%[src]),%[res]\n\t"
>>> +           "adcq 4*8(%[src]),%[res]\n\t"
>>> +           "adcq 3*8(%[src]),%[res]\n\t"
>>> +           "adcq 2*8(%[src]),%[res]\n\t"
>>> +           "adcq 1*8(%[src]),%[res]\n\t"
>>> +           "adcq 0*8(%[src]),%[res]\n\t"
>>> +           "nop\n\t"
>>> +           "40: adcq $0,%[res]"
>>> +                   : [res] "=r" (sum)
>>> +                   : [src] "r" (buff),
>>> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
>>> +                   : "r11");
>>> +
>>
>> With this patch I cannot mix/match different length checksums without
>> things failing.  In perf the jmpq in the loop above seems to be set to
>> a fixed value so perhaps it is something in how the compiler is
>> interpreting the inline assembler.
>
> The perf thing was a red herring.  Turns out the code is working
> correctly there.
>
> I actually found the root cause.  The problem is in add32_with_carry3.
>
Thanks for the follow-up. btw are you trying to build csum_partial in
userspace for testing, or was this all in kernel?


>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>> +                                            unsigned int c)
>> +{
>> +       asm("addl %2,%0\n\t"
>> +           "adcl %3,%0\n\t"
>> +           "adcl $0,%0"
>> +           : "=r" (a)
>> +           : "" (a), "rm" (b), "rm" (c));
>> +
>> +       return a;
>> +}
>> +
>
> You need to set the 'a' input variable attribute to "0" instead of ""
> and then things work for me correctly.
>
> - Alex
Eric Dumazet Feb. 28, 2016, 8:35 p.m. UTC | #10
On ven., 2016-02-26 at 12:03 -0800, Tom Herbert wrote:
> +
> +	/*
> +	 * Length is greater than 64. Sum to eight byte alignment before
> +	 * proceeding with main loop.
> +	 */
> +	aligned = !!((unsigned long)buff & 0x1);
> +	if (aligned) {
> +		unsigned int align = 7 & -(unsigned long)buff;
> +
> +		result = csum_partial_lt8_head(*(unsigned long *)buff, align);
> +		buff += align;
> +		len -= align;
> +		result = rotate_by8_if_odd(result, align);
> +	}
> +

This looks like you wanted to test 3 low order bits, not only the 1 low
order.

aligned = !((unsigned long)buff & 0x7);

if (!aligned) {
 ...
}

Or rename the variable to notaligned
Alexander H Duyck Feb. 29, 2016, 1:33 a.m. UTC | #11
On Sun, Feb 28, 2016 at 11:15 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Sun, Feb 28, 2016 at 10:56 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Sat, Feb 27, 2016 at 12:30 AM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>>> +{
>>>> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
>>>> +           "clc\n\t"
>>>> +           "jmpq *%%r11\n\t"
>>>> +           "adcq 7*8(%[src]),%[res]\n\t"
>>>> +           "adcq 6*8(%[src]),%[res]\n\t"
>>>> +           "adcq 5*8(%[src]),%[res]\n\t"
>>>> +           "adcq 4*8(%[src]),%[res]\n\t"
>>>> +           "adcq 3*8(%[src]),%[res]\n\t"
>>>> +           "adcq 2*8(%[src]),%[res]\n\t"
>>>> +           "adcq 1*8(%[src]),%[res]\n\t"
>>>> +           "adcq 0*8(%[src]),%[res]\n\t"
>>>> +           "nop\n\t"
>>>> +           "40: adcq $0,%[res]"
>>>> +                   : [res] "=r" (sum)
>>>> +                   : [src] "r" (buff),
>>>> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
>>>> +                   : "r11");
>>>> +
>>>
>>> With this patch I cannot mix/match different length checksums without
>>> things failing.  In perf the jmpq in the loop above seems to be set to
>>> a fixed value so perhaps it is something in how the compiler is
>>> interpreting the inline assembler.
>>
>> The perf thing was a red herring.  Turns out the code is working
>> correctly there.
>>
>> I actually found the root cause.  The problem is in add32_with_carry3.
>>
> Thanks for the follow-up. btw are you trying to build csum_partial in
> userspace for testing, or was this all in kernel?

It was in the kernel.  I have been some user space work but all of the
problems I was having were in the kernel.  My guess is that the
original sum value wasn't being used

- Alex
Maciej W. Rozycki Feb. 29, 2016, 1:39 a.m. UTC | #12
On Sun, 28 Feb 2016, Alexander Duyck wrote:

> I actually found the root cause.  The problem is in add32_with_carry3.
> 
> > +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
> > +                                            unsigned int c)
> > +{
> > +       asm("addl %2,%0\n\t"
> > +           "adcl %3,%0\n\t"
> > +           "adcl $0,%0"
> > +           : "=r" (a)
> > +           : "" (a), "rm" (b), "rm" (c));
> > +
> > +       return a;
> > +}
> > +
> 
> You need to set the 'a' input variable attribute to "0" instead of ""
> and then things work for me correctly.

 Or alternatively you can reduce the number of operands by one, by using 
`"+r" (a)' as output, and then removing `a' as a separate input and 
renumbering references to `b' and `c' accordingly.  It would IMHO actually 
better match the in-place operation as well.

 FWIW,

  Maciej
diff mbox

Patch

diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index cd00e17..e20c35b 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -188,6 +188,27 @@  static inline unsigned add32_with_carry(unsigned a, unsigned b)
 	return a;
 }
 
+static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
+{
+	asm("addq %2,%0\n\t"
+	    "adcq $0,%0"
+	    : "=r" (a)
+	    : "0" (a), "rm" (b));
+	return a;
+}
+
+static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
+					     unsigned int c)
+{
+	asm("addl %2,%0\n\t"
+	    "adcl %3,%0\n\t"
+	    "adcl $0,%0"
+	    : "=r" (a)
+	    : "" (a), "rm" (b), "rm" (c));
+
+	return a;
+}
+
 #define HAVE_ARCH_CSUM_ADD
 static inline __wsum csum_add(__wsum csum, __wsum addend)
 {
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..df82c9b 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -8,114 +8,66 @@ 
 #include <linux/compiler.h>
 #include <linux/module.h>
 #include <asm/checksum.h>
+#include <asm/word-at-a-time.h>
 
-static inline unsigned short from32to16(unsigned a) 
+static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool aligned)
 {
-	unsigned short b = a >> 16; 
-	asm("addw %w2,%w0\n\t"
-	    "adcw $0,%w0\n" 
-	    : "=r" (b)
-	    : "0" (b), "r" (a));
-	return b;
+	if (unlikely(aligned))
+		asm("rorq $0x8,%0"
+			: "=r" (sum)
+			: "0" (sum));
+	return sum;
 }
 
-/*
- * Do a 64-bit checksum on an arbitrary memory area.
- * Returns a 32bit checksum.
- *
- * This isn't as time critical as it used to be because many NICs
- * do hardware checksumming these days.
- * 
- * Things tried and found to not make it faster:
- * Manual Prefetching
- * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
- */
-static unsigned do_csum(const unsigned char *buff, unsigned len)
+/* Extract eight high order (nbo) bytes of quad "val". */
+static inline unsigned long csum_partial_lt8_head(unsigned long val, int len)
 {
-	unsigned odd, count;
-	unsigned long result = 0;
+	return val & ((1ul << len * 8) - 1);
+}
 
-	if (unlikely(len == 0))
-		return result; 
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		result = *buff << 8;
-		len--;
-		buff++;
-	}
-	count = len >> 1;		/* nr of 16-bit words.. */
-	if (count) {
-		if (2 & (unsigned long) buff) {
-			result += *(unsigned short *)buff;
-			count--;
-			len -= 2;
-			buff += 2;
-		}
-		count >>= 1;		/* nr of 32-bit words.. */
-		if (count) {
-			unsigned long zero;
-			unsigned count64;
-			if (4 & (unsigned long) buff) {
-				result += *(unsigned int *) buff;
-				count--;
-				len -= 4;
-				buff += 4;
-			}
-			count >>= 1;	/* nr of 64-bit words.. */
-
-			/* main loop using 64byte blocks */
-			zero = 0;
-			count64 = count >> 3;
-			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
-				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
-				buff += 64;
-				count64--;
-			}
-
-			/* last up to 7 8byte blocks */
-			count %= 8; 
-			while (count) { 
-				asm("addq %1,%0\n\t"
-				    "adcq %2,%0\n" 
-					    : "=r" (result)
-				    : "m" (*(unsigned long *)buff), 
-				    "r" (zero),  "0" (result));
-				--count; 
-					buff += 8;
-			}
-			result = add32_with_carry(result>>32,
-						  result&0xffffffff); 
-
-			if (len & 4) {
-				result += *(unsigned int *) buff;
-				buff += 4;
-			}
-		}
-		if (len & 2) {
-			result += *(unsigned short *) buff;
-			buff += 2;
-		}
-	}
-	if (len & 1)
-		result += *buff;
-	result = add32_with_carry(result>>32, result & 0xffffffff); 
-	if (unlikely(odd)) { 
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
+/* Extract eight low order (nbo) bytes of quad in "val". */
+static inline unsigned long csum_partial_lt8_tail(unsigned long val, int len)
+{
+	return val >> (8 * (8 - len));
+}
+
+/* Sum over buffer up to 64 bytes. */
+static unsigned long csum_partial_le64(const void *buff, unsigned int len,
+				       unsigned long sum)
+{
+	asm("lea 40f(, %[slen], 4), %%r11\n\t"
+	    "clc\n\t"
+	    "jmpq *%%r11\n\t"
+	    "adcq 7*8(%[src]),%[res]\n\t"
+	    "adcq 6*8(%[src]),%[res]\n\t"
+	    "adcq 5*8(%[src]),%[res]\n\t"
+	    "adcq 4*8(%[src]),%[res]\n\t"
+	    "adcq 3*8(%[src]),%[res]\n\t"
+	    "adcq 2*8(%[src]),%[res]\n\t"
+	    "adcq 1*8(%[src]),%[res]\n\t"
+	    "adcq 0*8(%[src]),%[res]\n\t"
+	    "nop\n\t"
+	    "40: adcq $0,%[res]"
+		    : [res] "=r" (sum)
+		    : [src] "r" (buff),
+		      [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum)
+		    : "r11");
+
+	if (len & 0x7) {
+		unsigned long val;
+		/*
+		 * Since "len" is > 8 here we backtrack in the buffer to load
+		 * the outstaning bytes into the low order bytes of a quad and
+		 * sum those in csum_partial_lt8_tail. By doing this we avoid
+		 * additional calls to load_unaligned_zeropad.
+		 */
+
+		val = csum_partial_lt8_tail(*(unsigned long *)(buff + len - 8),
+					    len & 0x7);
+		sum = add64_with_carry(val, sum);
 	}
-	return result;
+
+	return sum;
 }
 
 /*
@@ -130,10 +82,77 @@  static unsigned do_csum(const unsigned char *buff, unsigned len)
  *
  * it's best to have buff aligned on a 64-bit boundary
  */
+static unsigned int do_csum(const void *buff, int len, unsigned int sum)
+{
+	bool aligned = false;
+	unsigned long result = 0;
+
+	/*
+	 * For "small" lengths don't bother with alignment, x86 handles
+	 * this pretty well.
+	 */
+	if (len <= 8) {
+		unsigned long val;
+
+		/* Optimize trivial cases */
+		if (len == 8)
+			return add32_with_carry3(sum,
+				       *(unsigned int *)buff,
+				       *(unsigned int *)(buff + 4));
+		else if (len == 0)
+			return sum;
+
+		val = load_unaligned_zeropad(buff);
+		result = csum_partial_lt8_head(val, len);
+
+		return add32_with_carry3(sum, result >> 32,
+					 result & 0xffffffff);
+	} else if (len <= 64) {
+		result = csum_partial_le64(buff, len, result);
+
+		return add32_with_carry3(sum, result >> 32,
+					 result & 0xffffffff);
+	}
+
+	/*
+	 * Length is greater than 64. Sum to eight byte alignment before
+	 * proceeding with main loop.
+	 */
+	aligned = !!((unsigned long)buff & 0x1);
+	if (aligned) {
+		unsigned int align = 7 & -(unsigned long)buff;
+
+		result = csum_partial_lt8_head(*(unsigned long *)buff, align);
+		buff += align;
+		len -= align;
+		result = rotate_by8_if_odd(result, align);
+	}
+
+	/* main loop using 64byte blocks */
+	for (; len >= 64; len -= 64, buff += 64) {
+		asm("addq 0*8(%[src]),%[res]\n\t"
+		    "adcq 1*8(%[src]),%[res]\n\t"
+		    "adcq 2*8(%[src]),%[res]\n\t"
+		    "adcq 3*8(%[src]),%[res]\n\t"
+		    "adcq 4*8(%[src]),%[res]\n\t"
+		    "adcq 5*8(%[src]),%[res]\n\t"
+		    "adcq 6*8(%[src]),%[res]\n\t"
+		    "adcq 7*8(%[src]),%[res]\n\t"
+		    "adcq $0,%[res]"
+		    : [res] "=r" (result)
+		    : [src] "r" (buff),
+		    "[res]" (result));
+	}
+
+	result = csum_partial_le64(buff, len, result);
+	result = rotate_by8_if_odd(result, aligned);
+
+	return add32_with_carry3(sum, result >> 32, result & 0xffffffff);
+}
+
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
-	return (__force __wsum)add32_with_carry(do_csum(buff, len),
-						(__force u32)sum);
+	return (__force __wsum)do_csum(buff, len, (__force u32)sum);
 }
 
 /*