diff mbox

[v5,net-next] net: Implement fast csum_partial for x86_64

Message ID 1456957112-2702469-1-git-send-email-tom@herbertland.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert March 2, 2016, 10:18 p.m. UTC
This patch implements performant csum_partial for x86_64. The intent is
to speed up checksum calculation, particularly for smaller lengths such
as those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.

- v4
   - went back to C code with inline assembly for critical routines
   - implemented suggestion from Linus to deal with lengths < 8
- v5
   - fixed register attribute add32_with_carry3
   - do_csum returns unsigned long
   - don't consider alignment at all. Rationalization is that x86
     handles unaligned accesses very well except in the case that
     the access crosses a page boundary which has a performance
     penalty (I see about 10nsecs on my system). Drivers and the
     stack go to considerable lengths to not have packets cross page
     boundaries, so the case that csum_partial is called with
     buffer that crosses a page boundary should be a very rare
     occurrence. Not dealing with alignment is a significant
     simplification.

Testing:

Correctness:

Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).

Performance:

Isolating old and new implementation for some common cases:

             Old      New     %
    Len/Aln  nsecs    nsecs   Improv
    --------+-------+--------+-------
    1400/0    195.6    181.7   3%     (Big packet)
    40/0      11.8     6.5     45%    (Ipv6 hdr cmn case)
    8/4       8.1      3.2     60%    (UDP, VXLAN in IPv4)
    14/0      8.9      6.3     29%    (Eth hdr)
    14/4      9.5      6.3     33%    (Eth hdr in IPv4)
    14/3      9.6      6.3     34%    (Eth with odd align)
    20/0      9.1      6.8     25%    (IP hdr without options)
    7/1       9.1      3.9     57%    (buffer in one quad)
    100/0    17.4     13.6     21%    (medium-sized pkt)
    100/2    17.7     13.5     24%    (medium-sized pkt w/ alignment)

Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Also tested on these with similar results:

Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Intel(R) Atom(TM) CPU N450   @ 1.66GHz

Branch  prediction:

To test the effects of poor branch prediction in the jump tables I
tested checksum performance with runs for two combinations of length
and alignment. As the baseline I performed the test by doing half of
calls with the first combination, followed by using the second
combination for the second half. In the test case, I interleave the
two combinations so that in every call the length and alignment are
different to defeat the effects of branch prediction. Running several
cases, I did not see any material performance difference between the
two scenarios (perf stat output is below), neither does either case
show a significant number of branch misses.

Interleave lengths case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
    ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000

 Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):

     9,556,693,202      instructions               ( +-  0.00% )
     1,176,208,640       branches                                                     ( +-  0.00% )
            19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )

       2.049732539 seconds time elapsed

    Non-interleave case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
     ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000

Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):

     9,782,188,310      instructions               ( +-  0.00% )
     1,251,286,958       branches                                                     ( +-  0.01% )
            18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )

       2.271789046 seconds time elapsed

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 arch/x86/include/asm/checksum_64.h |  21 +++++
 arch/x86/lib/csum-partial_64.c     | 171 ++++++++++++++++++-------------------
 2 files changed, 102 insertions(+), 90 deletions(-)

Comments

Eric Dumazet March 2, 2016, 10:35 p.m. UTC | #1
On mer., 2016-03-02 at 14:18 -0800, Tom Herbert wrote:
\
> +	asm("lea 0f(, %[slen], 4), %%r11\n\t"
> +	    "clc\n\t"
> +	    "jmpq *%%r11\n\t"
> +	    "adcq 7*8(%[src]),%[res]\n\t"
> +	    "adcq 6*8(%[src]),%[res]\n\t"
> +	    "adcq 5*8(%[src]),%[res]\n\t"
> +	    "adcq 4*8(%[src]),%[res]\n\t"
> +	    "adcq 3*8(%[src]),%[res]\n\t"
> +	    "adcq 2*8(%[src]),%[res]\n\t"
> +	    "adcq 1*8(%[src]),%[res]\n\t"
> +	    "adcq 0*8(%[src]),%[res]\n\t"
> +	    "nop\n\t"
> +	    "0: adcq $0,%[res]"
> +		    : [res] "=r" (result)
> +		    : [src] "r" (buff),
> +		      [slen] "r" (-(unsigned long)(len >> 3)), "[res]" (result)
> +		    : "r11");
>  


hpa mentioned we could use adcq.d8 0*8(%[src]),%[res]

to avoid the mandatory 'nop'
Alexander H Duyck March 2, 2016, 11:42 p.m. UTC | #2
On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert <tom@herbertland.com> wrote:
> This patch implements performant csum_partial for x86_64. The intent is
> to speed up checksum calculation, particularly for smaller lengths such
> as those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>
> - v4
>    - went back to C code with inline assembly for critical routines
>    - implemented suggestion from Linus to deal with lengths < 8
> - v5
>    - fixed register attribute add32_with_carry3
>    - do_csum returns unsigned long
>    - don't consider alignment at all. Rationalization is that x86
>      handles unaligned accesses very well except in the case that
>      the access crosses a page boundary which has a performance
>      penalty (I see about 10nsecs on my system). Drivers and the
>      stack go to considerable lengths to not have packets cross page
>      boundaries, so the case that csum_partial is called with
>      buffer that crosses a page boundary should be a very rare
>      occurrence. Not dealing with alignment is a significant
>      simplification.
>
> Testing:
>
> Correctness:
>
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Performance:
>
> Isolating old and new implementation for some common cases:
>
>              Old      New     %
>     Len/Aln  nsecs    nsecs   Improv
>     --------+-------+--------+-------
>     1400/0    195.6    181.7   3%     (Big packet)

The interesting bit here for me would be how we handle the 1480 byte
values with a 2 byte alignment offset.  That would typically be our
case where we have 14 (eth) + 20 (IP) and no IP_ALIGN since we set it
to 0 on x86.  This is also the case where we would have around a 30%
chance of there being a page boundary in there somewhere since if I
recall 1536 + 64 (NET_SKB_PAD) + 384 (skb_shared_info) doesn't add up
to an even power of 2 so there is a good likelihood of us actually
regressing in the receive path for large packet checksumming since it
is about a 10ns penalty for spanning a page boundary for a single
read.

>     40/0      11.8     6.5     45%    (Ipv6 hdr cmn case)

Common case for transmit maybe.  For receive on x86 this isn't IP
aligned so the offset would be 6.

>     8/4       8.1      3.2     60%    (UDP, VXLAN in IPv4)

How likely is the 8/4 case in reality?  I ask because you have a
special handler for the case and as such that extra bit of code is
going to cost you one cycle or more in all other cases.  As far as the
checksum for VXLAN I would think you are probably looking at something
more like 20/4 because you would likely be pulling the UDP, VXLAN, and
inner Ethernet header to get to the inner IP header.  The only case I
can think of where we might be working on 8 bytes would be something
like L3 encapsulated inside of GRE.

>     14/0      8.9      6.3     29%    (Eth hdr)
>     14/4      9.5      6.3     33%    (Eth hdr in IPv4)
>     14/3      9.6      6.3     34%    (Eth with odd align)
>     20/0      9.1      6.8     25%    (IP hdr without options)
>     7/1       9.1      3.9     57%    (buffer in one quad)
>     100/0    17.4     13.6     21%    (medium-sized pkt)
>     100/2    17.7     13.5     24%    (medium-sized pkt w/ alignment)
>
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz


> Also tested on these with similar results:
>
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
> Intel(R) Atom(TM) CPU N450   @ 1.66GHz
>
> Branch  prediction:
>
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between the
> two scenarios (perf stat output is below), neither does either case
> show a significant number of branch misses.
>
> Interleave lengths case:
>
> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>
>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>
>      9,556,693,202      instructions               ( +-  0.00% )
>      1,176,208,640       branches                                                     ( +-  0.00% )
>             19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )
>
>        2.049732539 seconds time elapsed
>
>     Non-interleave case:
>
> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>
> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>
>      9,782,188,310      instructions               ( +-  0.00% )
>      1,251,286,958       branches                                                     ( +-  0.01% )
>             18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )
>
>        2.271789046 seconds time elapsed
>
> Signed-off-by: Tom Herbert <tom@herbertland.com>
> ---
>  arch/x86/include/asm/checksum_64.h |  21 +++++
>  arch/x86/lib/csum-partial_64.c     | 171 ++++++++++++++++++-------------------
>  2 files changed, 102 insertions(+), 90 deletions(-)
>
> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
> index cd00e17..1224f7d 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>         return a;
>  }
>
> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
> +{
> +       asm("addq %2,%0\n\t"
> +           "adcq $0,%0"
> +           : "=r" (a)
> +           : "0" (a), "rm" (b));
> +       return a;
> +}
> +

You can probably just convert this and the add32_with_carry over to
the +r approach instead of using "0".

> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
> +                                            unsigned int c)
> +{
> +       asm("addl %1,%0\n\t"
> +           "adcl %2,%0\n\t"
> +           "adcl $0,%0"
> +           : "+r" (a)
> +           : "rm" (b), "rm" (c));
> +
> +       return a;
> +}
> +
>  #define HAVE_ARCH_CSUM_ADD
>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>  {
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index 9845371..7f1f60f 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -8,6 +8,7 @@
>  #include <linux/compiler.h>
>  #include <linux/module.h>
>  #include <asm/checksum.h>
> +#include <asm/word-at-a-time.h>
>
>  static inline unsigned short from32to16(unsigned a)
>  {
> @@ -21,99 +22,78 @@ static inline unsigned short from32to16(unsigned a)
>
>  /*
>   * Do a 64-bit checksum on an arbitrary memory area.
> - * Returns a 32bit checksum.
> + * Returns a 64bit checksum.
>   *
> - * This isn't as time critical as it used to be because many NICs
> - * do hardware checksumming these days.
> - *
> - * Things tried and found to not make it faster:
> - * Manual Prefetching
> - * Unrolling to an 128 bytes inner loop.
> - * Using interleaving with more registers to break the carry chains.
> + * This is optimized for small lengths such as might be common when pulling
> + * up checksums over protocol headers to handle CHECKSUM_COMPLETE (e.g.
> + * checksum over 40 bytes will be quite common for pulling up checksum over
> + * IPv6 headers).
>   */
> -static unsigned do_csum(const unsigned char *buff, unsigned len)
> +static unsigned long do_csum(const void *buff, int len)
>  {
> -       unsigned odd, count;
>         unsigned long result = 0;
>
> -       if (unlikely(len == 0))
> -               return result;
> -       odd = 1 & (unsigned long) buff;
> -       if (unlikely(odd)) {
> -               result = *buff << 8;
> -               len--;
> -               buff++;
> +       /* Check for less than a quad to sum */
> +       if (len < 8) {
> +               unsigned long val = load_unaligned_zeropad(buff);
> +
> +               return (val & ((1ul << len * 8) - 1));
> +       }
> +
> +       /* Main loop using 64byte blocks */
> +       for (; len > 64; len -= 64, buff += 64) {
> +               asm("addq 0*8(%[src]),%[res]\n\t"
> +                   "adcq 1*8(%[src]),%[res]\n\t"
> +                   "adcq 2*8(%[src]),%[res]\n\t"
> +                   "adcq 3*8(%[src]),%[res]\n\t"
> +                   "adcq 4*8(%[src]),%[res]\n\t"
> +                   "adcq 5*8(%[src]),%[res]\n\t"
> +                   "adcq 6*8(%[src]),%[res]\n\t"
> +                   "adcq 7*8(%[src]),%[res]\n\t"
> +                   "adcq $0,%[res]"
> +                   : [res] "=r" (result)

The +r would probably work here to just to be consistent.

> +                   : [src] "r" (buff),
> +                   "[res]" (result));
>         }
> -       count = len >> 1;               /* nr of 16-bit words.. */
> -       if (count) {
> -               if (2 & (unsigned long) buff) {
> -                       result += *(unsigned short *)buff;
> -                       count--;
> -                       len -= 2;
> -                       buff += 2;
> -               }
> -               count >>= 1;            /* nr of 32-bit words.. */
> -               if (count) {
> -                       unsigned long zero;
> -                       unsigned count64;
> -                       if (4 & (unsigned long) buff) {
> -                               result += *(unsigned int *) buff;
> -                               count--;
> -                               len -= 4;
> -                               buff += 4;
> -                       }
> -                       count >>= 1;    /* nr of 64-bit words.. */
>
> -                       /* main loop using 64byte blocks */
> -                       zero = 0;
> -                       count64 = count >> 3;
> -                       while (count64) {
> -                               asm("addq 0*8(%[src]),%[res]\n\t"
> -                                   "adcq 1*8(%[src]),%[res]\n\t"
> -                                   "adcq 2*8(%[src]),%[res]\n\t"
> -                                   "adcq 3*8(%[src]),%[res]\n\t"
> -                                   "adcq 4*8(%[src]),%[res]\n\t"
> -                                   "adcq 5*8(%[src]),%[res]\n\t"
> -                                   "adcq 6*8(%[src]),%[res]\n\t"
> -                                   "adcq 7*8(%[src]),%[res]\n\t"
> -                                   "adcq %[zero],%[res]"
> -                                   : [res] "=r" (result)
> -                                   : [src] "r" (buff), [zero] "r" (zero),
> -                                   "[res]" (result));
> -                               buff += 64;
> -                               count64--;
> -                       }
> +       /*
> +        * Sum over remaining quads (<= 8 of them). This uses a jump table
> +        * based on number of quads to sum. The jump assumes that each case
> +        * is 4 bytes. Each adcq instruction is 4 bytes except for adcq 0()
> +        * which is 3 bytes, so a nop instruction is inserted to make that case
> +        * 4 bytes.
> +        */
> +       asm("lea 0f(, %[slen], 4), %%r11\n\t"
> +           "clc\n\t"
> +           "jmpq *%%r11\n\t"
> +           "adcq 7*8(%[src]),%[res]\n\t"
> +           "adcq 6*8(%[src]),%[res]\n\t"
> +           "adcq 5*8(%[src]),%[res]\n\t"
> +           "adcq 4*8(%[src]),%[res]\n\t"
> +           "adcq 3*8(%[src]),%[res]\n\t"
> +           "adcq 2*8(%[src]),%[res]\n\t"
> +           "adcq 1*8(%[src]),%[res]\n\t"
> +           "adcq 0*8(%[src]),%[res]\n\t"
> +           "nop\n\t"
> +           "0: adcq $0,%[res]"
> +                   : [res] "=r" (result)

Same comment about +r here.

> +                   : [src] "r" (buff),
> +                     [slen] "r" (-(unsigned long)(len >> 3)), "[res]" (result)
> +                   : "r11");
>
> -                       /* last up to 7 8byte blocks */
> -                       count %= 8;
> -                       while (count) {
> -                               asm("addq %1,%0\n\t"
> -                                   "adcq %2,%0\n"
> -                                           : "=r" (result)
> -                                   : "m" (*(unsigned long *)buff),
> -                                   "r" (zero),  "0" (result));
> -                               --count;
> -                                       buff += 8;
> -                       }
> -                       result = add32_with_carry(result>>32,
> -                                                 result&0xffffffff);
> +       /* Sum over any remaining bytes (< 8 of them) */
> +       if (len & 0x7) {
> +               unsigned long val;
> +               /*
> +                * Since "len" is > 8 here we backtrack in the buffer to load
> +                * the outstanding bytes into the low order bytes of a quad and
> +                * then shift to extract the relevant bytes. By doing this we
> +                * avoid additional calls to load_unaligned_zeropad.
> +                */
>
> -                       if (len & 4) {
> -                               result += *(unsigned int *) buff;
> -                               buff += 4;
> -                       }
> -               }
> -               if (len & 2) {
> -                       result += *(unsigned short *) buff;
> -                       buff += 2;
> -               }
> -       }
> -       if (len & 1)
> -               result += *buff;
> -       result = add32_with_carry(result>>32, result & 0xffffffff);
> -       if (unlikely(odd)) {
> -               result = from32to16(result);
> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> +               val = *(unsigned long *)(buff + len - 8);
> +               val >>= 8 * (-len & 0x7);
> +               result = add64_with_carry(val, result);
>         }
>         return result;
>  }
> @@ -125,15 +105,26 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>   * returns a 32-bit number suitable for feeding into itself
>   * or csum_tcpudp_magic
>   *
> - * this function must be called with even lengths, except
> - * for the last fragment, which may be odd
> - *
> - * it's best to have buff aligned on a 64-bit boundary
> + * Note that this implementation makes no attempt to avoid unaligned accesses
> + * (e.g. load a quad word with non 8-byte alignment). On x86 unaligned accesses
> + * only seem to be a performance penalty when the access crosses a page
> + * boundary-- such a scenario should be an extremely rare occurrence for use
> + * cases of csum_partial.
>   */
>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>  {
> -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
> -                                               (__force u32)sum);
> +       if (len == 8) {
> +               /* Optimize trivial case */
> +               return (__force __wsum)add32_with_carry3((__force u32)sum,
> +                                                *(unsigned int *)buff,
> +                                                *(unsigned int *)(buff + 4));
> +       } else {
> +               unsigned long result = do_csum(buff, len);
> +
> +               return (__force __wsum)add32_with_carry3((__force u32)sum,
> +                                                result >> 32,
> +                                                result & 0xffffffff);
> +       }
>  }
>
>  /*
> --
> 2.6.5
>
Tom Herbert March 3, 2016, 12:40 a.m. UTC | #3
On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert <tom@herbertland.com> wrote:
>> This patch implements performant csum_partial for x86_64. The intent is
>> to speed up checksum calculation, particularly for smaller lengths such
>> as those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>
>> - v4
>>    - went back to C code with inline assembly for critical routines
>>    - implemented suggestion from Linus to deal with lengths < 8
>> - v5
>>    - fixed register attribute add32_with_carry3
>>    - do_csum returns unsigned long
>>    - don't consider alignment at all. Rationalization is that x86
>>      handles unaligned accesses very well except in the case that
>>      the access crosses a page boundary which has a performance
>>      penalty (I see about 10nsecs on my system). Drivers and the
>>      stack go to considerable lengths to not have packets cross page
>>      boundaries, so the case that csum_partial is called with
>>      buffer that crosses a page boundary should be a very rare
>>      occurrence. Not dealing with alignment is a significant
>>      simplification.
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>              Old      New     %
>>     Len/Aln  nsecs    nsecs   Improv
>>     --------+-------+--------+-------
>>     1400/0    195.6    181.7   3%     (Big packet)
>
> The interesting bit here for me would be how we handle the 1480 byte
> values with a 2 byte alignment offset.  That would typically be our
> case where we have 14 (eth) + 20 (IP) and no IP_ALIGN since we set it
> to 0 on x86.  This is also the case where we would have around a 30%
> chance of there being a page boundary in there somewhere since if I
> recall 1536 + 64 (NET_SKB_PAD) + 384 (skb_shared_info) doesn't add up
> to an even power of 2 so there is a good likelihood of us actually
> regressing in the receive path for large packet checksumming since it
> is about a 10ns penalty for spanning a page boundary for a single
> read.
>
Yes, but the case your considering would be that in which we need to
perform a full packet checksum in the host-- normally we'd expect to
have HW offload for that. But even in that case, the regression would
not be the full 10ns since the logic to check alignment would be
needed and that has some cost. Also, for longer checksums the cost is
amortized so any relative regression would be smaller. Interestingly,
on the Atom CPU it was still better performance to ignore the
alignment than got through the code to handle it. For Xeon there was
some regression. The potentially bad case would be if headers are
split over a page boundary and we are doing checksum complete (this in
theory could also be a problem for other accesses like if the IP
addresses end up straddling the page boundary). I think the
probability of that is well less than 30%, but if we really are
worried about checksum over a page boundary, then I would put in one
conditional to switch between the old do_csum and the new do_csum
(call it do_fast_csum) based on buffer crossing page boundary-- so the
cost of needing to deal with alignment in the common case is at most a
single conditional (1 ns).

>>     40/0      11.8     6.5     45%    (Ipv6 hdr cmn case)
>
> Common case for transmit maybe.  For receive on x86 this isn't IP
> aligned so the offset would be 6.

With offset 6 gives about the same results.

>
>>     8/4       8.1      3.2     60%    (UDP, VXLAN in IPv4)
>
> How likely is the 8/4 case in reality?  I ask because you have a
> special handler for the case and as such that extra bit of code is
> going to cost you one cycle or more in all other cases.  As far as the
> checksum for VXLAN I would think you are probably looking at something
> more like 20/4 because you would likely be pulling the UDP, VXLAN, and
> inner Ethernet header to get to the inner IP header.  The only case I
> can think of where we might be working on 8 bytes would be something
> like L3 encapsulated inside of GRE.
>
>>     14/0      8.9      6.3     29%    (Eth hdr)
>>     14/4      9.5      6.3     33%    (Eth hdr in IPv4)
>>     14/3      9.6      6.3     34%    (Eth with odd align)
>>     20/0      9.1      6.8     25%    (IP hdr without options)
>>     7/1       9.1      3.9     57%    (buffer in one quad)
>>     100/0    17.4     13.6     21%    (medium-sized pkt)
>>     100/2    17.7     13.5     24%    (medium-sized pkt w/ alignment)
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
>
>> Also tested on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>> Intel(R) Atom(TM) CPU N450   @ 1.66GHz
>>
>> Branch  prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between the
>> two scenarios (perf stat output is below), neither does either case
>> show a significant number of branch misses.
>>
>> Interleave lengths case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>>
>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>
>>      9,556,693,202      instructions               ( +-  0.00% )
>>      1,176,208,640       branches                                                     ( +-  0.00% )
>>             19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )
>>
>>        2.049732539 seconds time elapsed
>>
>>     Non-interleave case:
>>
>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>>
>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>
>>      9,782,188,310      instructions               ( +-  0.00% )
>>      1,251,286,958       branches                                                     ( +-  0.01% )
>>             18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )
>>
>>        2.271789046 seconds time elapsed
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>> ---
>>  arch/x86/include/asm/checksum_64.h |  21 +++++
>>  arch/x86/lib/csum-partial_64.c     | 171 ++++++++++++++++++-------------------
>>  2 files changed, 102 insertions(+), 90 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
>> index cd00e17..1224f7d 100644
>> --- a/arch/x86/include/asm/checksum_64.h
>> +++ b/arch/x86/include/asm/checksum_64.h
>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>>         return a;
>>  }
>>
>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
>> +{
>> +       asm("addq %2,%0\n\t"
>> +           "adcq $0,%0"
>> +           : "=r" (a)
>> +           : "0" (a), "rm" (b));
>> +       return a;
>> +}
>> +
>
> You can probably just convert this and the add32_with_carry over to
> the +r approach instead of using "0".
>
>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>> +                                            unsigned int c)
>> +{
>> +       asm("addl %1,%0\n\t"
>> +           "adcl %2,%0\n\t"
>> +           "adcl $0,%0"
>> +           : "+r" (a)
>> +           : "rm" (b), "rm" (c));
>> +
>> +       return a;
>> +}
>> +
>>  #define HAVE_ARCH_CSUM_ADD
>>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>>  {
>> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>> index 9845371..7f1f60f 100644
>> --- a/arch/x86/lib/csum-partial_64.c
>> +++ b/arch/x86/lib/csum-partial_64.c
>> @@ -8,6 +8,7 @@
>>  #include <linux/compiler.h>
>>  #include <linux/module.h>
>>  #include <asm/checksum.h>
>> +#include <asm/word-at-a-time.h>
>>
>>  static inline unsigned short from32to16(unsigned a)
>>  {
>> @@ -21,99 +22,78 @@ static inline unsigned short from32to16(unsigned a)
>>
>>  /*
>>   * Do a 64-bit checksum on an arbitrary memory area.
>> - * Returns a 32bit checksum.
>> + * Returns a 64bit checksum.
>>   *
>> - * This isn't as time critical as it used to be because many NICs
>> - * do hardware checksumming these days.
>> - *
>> - * Things tried and found to not make it faster:
>> - * Manual Prefetching
>> - * Unrolling to an 128 bytes inner loop.
>> - * Using interleaving with more registers to break the carry chains.
>> + * This is optimized for small lengths such as might be common when pulling
>> + * up checksums over protocol headers to handle CHECKSUM_COMPLETE (e.g.
>> + * checksum over 40 bytes will be quite common for pulling up checksum over
>> + * IPv6 headers).
>>   */
>> -static unsigned do_csum(const unsigned char *buff, unsigned len)
>> +static unsigned long do_csum(const void *buff, int len)
>>  {
>> -       unsigned odd, count;
>>         unsigned long result = 0;
>>
>> -       if (unlikely(len == 0))
>> -               return result;
>> -       odd = 1 & (unsigned long) buff;
>> -       if (unlikely(odd)) {
>> -               result = *buff << 8;
>> -               len--;
>> -               buff++;
>> +       /* Check for less than a quad to sum */
>> +       if (len < 8) {
>> +               unsigned long val = load_unaligned_zeropad(buff);
>> +
>> +               return (val & ((1ul << len * 8) - 1));
>> +       }
>> +
>> +       /* Main loop using 64byte blocks */
>> +       for (; len > 64; len -= 64, buff += 64) {
>> +               asm("addq 0*8(%[src]),%[res]\n\t"
>> +                   "adcq 1*8(%[src]),%[res]\n\t"
>> +                   "adcq 2*8(%[src]),%[res]\n\t"
>> +                   "adcq 3*8(%[src]),%[res]\n\t"
>> +                   "adcq 4*8(%[src]),%[res]\n\t"
>> +                   "adcq 5*8(%[src]),%[res]\n\t"
>> +                   "adcq 6*8(%[src]),%[res]\n\t"
>> +                   "adcq 7*8(%[src]),%[res]\n\t"
>> +                   "adcq $0,%[res]"
>> +                   : [res] "=r" (result)
>
> The +r would probably work here to just to be consistent.
>
>> +                   : [src] "r" (buff),
>> +                   "[res]" (result));
>>         }
>> -       count = len >> 1;               /* nr of 16-bit words.. */
>> -       if (count) {
>> -               if (2 & (unsigned long) buff) {
>> -                       result += *(unsigned short *)buff;
>> -                       count--;
>> -                       len -= 2;
>> -                       buff += 2;
>> -               }
>> -               count >>= 1;            /* nr of 32-bit words.. */
>> -               if (count) {
>> -                       unsigned long zero;
>> -                       unsigned count64;
>> -                       if (4 & (unsigned long) buff) {
>> -                               result += *(unsigned int *) buff;
>> -                               count--;
>> -                               len -= 4;
>> -                               buff += 4;
>> -                       }
>> -                       count >>= 1;    /* nr of 64-bit words.. */
>>
>> -                       /* main loop using 64byte blocks */
>> -                       zero = 0;
>> -                       count64 = count >> 3;
>> -                       while (count64) {
>> -                               asm("addq 0*8(%[src]),%[res]\n\t"
>> -                                   "adcq 1*8(%[src]),%[res]\n\t"
>> -                                   "adcq 2*8(%[src]),%[res]\n\t"
>> -                                   "adcq 3*8(%[src]),%[res]\n\t"
>> -                                   "adcq 4*8(%[src]),%[res]\n\t"
>> -                                   "adcq 5*8(%[src]),%[res]\n\t"
>> -                                   "adcq 6*8(%[src]),%[res]\n\t"
>> -                                   "adcq 7*8(%[src]),%[res]\n\t"
>> -                                   "adcq %[zero],%[res]"
>> -                                   : [res] "=r" (result)
>> -                                   : [src] "r" (buff), [zero] "r" (zero),
>> -                                   "[res]" (result));
>> -                               buff += 64;
>> -                               count64--;
>> -                       }
>> +       /*
>> +        * Sum over remaining quads (<= 8 of them). This uses a jump table
>> +        * based on number of quads to sum. The jump assumes that each case
>> +        * is 4 bytes. Each adcq instruction is 4 bytes except for adcq 0()
>> +        * which is 3 bytes, so a nop instruction is inserted to make that case
>> +        * 4 bytes.
>> +        */
>> +       asm("lea 0f(, %[slen], 4), %%r11\n\t"
>> +           "clc\n\t"
>> +           "jmpq *%%r11\n\t"
>> +           "adcq 7*8(%[src]),%[res]\n\t"
>> +           "adcq 6*8(%[src]),%[res]\n\t"
>> +           "adcq 5*8(%[src]),%[res]\n\t"
>> +           "adcq 4*8(%[src]),%[res]\n\t"
>> +           "adcq 3*8(%[src]),%[res]\n\t"
>> +           "adcq 2*8(%[src]),%[res]\n\t"
>> +           "adcq 1*8(%[src]),%[res]\n\t"
>> +           "adcq 0*8(%[src]),%[res]\n\t"
>> +           "nop\n\t"
>> +           "0: adcq $0,%[res]"
>> +                   : [res] "=r" (result)
>
> Same comment about +r here.
>
>> +                   : [src] "r" (buff),
>> +                     [slen] "r" (-(unsigned long)(len >> 3)), "[res]" (result)
>> +                   : "r11");
>>
>> -                       /* last up to 7 8byte blocks */
>> -                       count %= 8;
>> -                       while (count) {
>> -                               asm("addq %1,%0\n\t"
>> -                                   "adcq %2,%0\n"
>> -                                           : "=r" (result)
>> -                                   : "m" (*(unsigned long *)buff),
>> -                                   "r" (zero),  "0" (result));
>> -                               --count;
>> -                                       buff += 8;
>> -                       }
>> -                       result = add32_with_carry(result>>32,
>> -                                                 result&0xffffffff);
>> +       /* Sum over any remaining bytes (< 8 of them) */
>> +       if (len & 0x7) {
>> +               unsigned long val;
>> +               /*
>> +                * Since "len" is > 8 here we backtrack in the buffer to load
>> +                * the outstanding bytes into the low order bytes of a quad and
>> +                * then shift to extract the relevant bytes. By doing this we
>> +                * avoid additional calls to load_unaligned_zeropad.
>> +                */
>>
>> -                       if (len & 4) {
>> -                               result += *(unsigned int *) buff;
>> -                               buff += 4;
>> -                       }
>> -               }
>> -               if (len & 2) {
>> -                       result += *(unsigned short *) buff;
>> -                       buff += 2;
>> -               }
>> -       }
>> -       if (len & 1)
>> -               result += *buff;
>> -       result = add32_with_carry(result>>32, result & 0xffffffff);
>> -       if (unlikely(odd)) {
>> -               result = from32to16(result);
>> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>> +               val = *(unsigned long *)(buff + len - 8);
>> +               val >>= 8 * (-len & 0x7);
>> +               result = add64_with_carry(val, result);
>>         }
>>         return result;
>>  }
>> @@ -125,15 +105,26 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>>   * returns a 32-bit number suitable for feeding into itself
>>   * or csum_tcpudp_magic
>>   *
>> - * this function must be called with even lengths, except
>> - * for the last fragment, which may be odd
>> - *
>> - * it's best to have buff aligned on a 64-bit boundary
>> + * Note that this implementation makes no attempt to avoid unaligned accesses
>> + * (e.g. load a quad word with non 8-byte alignment). On x86 unaligned accesses
>> + * only seem to be a performance penalty when the access crosses a page
>> + * boundary-- such a scenario should be an extremely rare occurrence for use
>> + * cases of csum_partial.
>>   */
>>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>>  {
>> -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
>> -                                               (__force u32)sum);
>> +       if (len == 8) {
>> +               /* Optimize trivial case */
>> +               return (__force __wsum)add32_with_carry3((__force u32)sum,
>> +                                                *(unsigned int *)buff,
>> +                                                *(unsigned int *)(buff + 4));
>> +       } else {
>> +               unsigned long result = do_csum(buff, len);
>> +
>> +               return (__force __wsum)add32_with_carry3((__force u32)sum,
>> +                                                result >> 32,
>> +                                                result & 0xffffffff);
>> +       }
>>  }
>>
>>  /*
>> --
>> 2.6.5
>>
Alexander H Duyck March 3, 2016, 1:35 a.m. UTC | #4
On Wed, Mar 2, 2016 at 4:40 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Wed, Mar 2, 2016 at 3:42 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Mar 2, 2016 at 2:18 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> This patch implements performant csum_partial for x86_64. The intent is
>>> to speed up checksum calculation, particularly for smaller lengths such
>>> as those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>>
>>> - v4
>>>    - went back to C code with inline assembly for critical routines
>>>    - implemented suggestion from Linus to deal with lengths < 8
>>> - v5
>>>    - fixed register attribute add32_with_carry3
>>>    - do_csum returns unsigned long
>>>    - don't consider alignment at all. Rationalization is that x86
>>>      handles unaligned accesses very well except in the case that
>>>      the access crosses a page boundary which has a performance
>>>      penalty (I see about 10nsecs on my system). Drivers and the
>>>      stack go to considerable lengths to not have packets cross page
>>>      boundaries, so the case that csum_partial is called with
>>>      buffer that crosses a page boundary should be a very rare
>>>      occurrence. Not dealing with alignment is a significant
>>>      simplification.
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>>              Old      New     %
>>>     Len/Aln  nsecs    nsecs   Improv
>>>     --------+-------+--------+-------
>>>     1400/0    195.6    181.7   3%     (Big packet)
>>
>> The interesting bit here for me would be how we handle the 1480 byte
>> values with a 2 byte alignment offset.  That would typically be our
>> case where we have 14 (eth) + 20 (IP) and no IP_ALIGN since we set it
>> to 0 on x86.  This is also the case where we would have around a 30%
>> chance of there being a page boundary in there somewhere since if I
>> recall 1536 + 64 (NET_SKB_PAD) + 384 (skb_shared_info) doesn't add up
>> to an even power of 2 so there is a good likelihood of us actually
>> regressing in the receive path for large packet checksumming since it
>> is about a 10ns penalty for spanning a page boundary for a single
>> read.
>>
> Yes, but the case your considering would be that in which we need to
> perform a full packet checksum in the host-- normally we'd expect to
> have HW offload for that. But even in that case, the regression would
> not be the full 10ns since the logic to check alignment would be
> needed and that has some cost. Also, for longer checksums the cost is
> amortized so any relative regression would be smaller. Interestingly,
> on the Atom CPU it was still better performance to ignore the
> alignment than got through the code to handle it. For Xeon there was
> some regression. The potentially bad case would be if headers are
> split over a page boundary and we are doing checksum complete (this in
> theory could also be a problem for other accesses like if the IP
> addresses end up straddling the page boundary). I think the
> probability of that is well less than 30%, but if we really are
> worried about checksum over a page boundary, then I would put in one
> conditional to switch between the old do_csum and the new do_csum
> (call it do_fast_csum) based on buffer crossing page boundary-- so the
> cost of needing to deal with alignment in the common case is at most a
> single conditional (1 ns).

The part I would be interested in seeing is how much we gain/lose for
dealing with the alignment.  Based on the 10ns value I would expect
that we would see a 2% regression per 4K for dealing with an unaligned
page.  So the question is what do we gain by taking the 2% hit.  I
agree that the 2% hit won't occur often as long as we are only doing
headers, however we can't be myopic about how this code is to be used.
The fact is there are plenty of other paths that use it as well and it
doesn't do us much good if we make the slow path that much slower to
accelerate the fast path by 1 or 2 ns.

By any chance could you email me the code you are using to test this?
I'm assuming you are probably testing this in user-space with some
rdtsc type calls thrown in.  I want to try applying a few patches and
see what the gain/loss is for aligning the code versus not bothering
to align it.

>>>     40/0      11.8     6.5     45%    (Ipv6 hdr cmn case)
>>
>> Common case for transmit maybe.  For receive on x86 this isn't IP
>> aligned so the offset would be 6.
>
> With offset 6 gives about the same results.

Yeah, I kind of figured that.  Just stating that the common alignment is 6.

>>
>>>     8/4       8.1      3.2     60%    (UDP, VXLAN in IPv4)
>>
>> How likely is the 8/4 case in reality?  I ask because you have a
>> special handler for the case and as such that extra bit of code is
>> going to cost you one cycle or more in all other cases.  As far as the
>> checksum for VXLAN I would think you are probably looking at something
>> more like 20/4 because you would likely be pulling the UDP, VXLAN, and
>> inner Ethernet header to get to the inner IP header.  The only case I
>> can think of where we might be working on 8 bytes would be something
>> like L3 encapsulated inside of GRE.
>>
>>>     14/0      8.9      6.3     29%    (Eth hdr)
>>>     14/4      9.5      6.3     33%    (Eth hdr in IPv4)
>>>     14/3      9.6      6.3     34%    (Eth with odd align)
>>>     20/0      9.1      6.8     25%    (IP hdr without options)
>>>     7/1       9.1      3.9     57%    (buffer in one quad)
>>>     100/0    17.4     13.6     21%    (medium-sized pkt)
>>>     100/2    17.7     13.5     24%    (medium-sized pkt w/ alignment)
>>>
>>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Same for all these.  The alignment values to test will likely be 0, 2,
4, or 6.  I hadn't thought about it before but odd values will likely
be less than 1% of what we actually see.  We probably don't need to
worry about odd align all that much since it should be extremely rare.



>>
>>> Also tested on these with similar results:
>>>
>>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>> Intel(R) Atom(TM) CPU N450   @ 1.66GHz
>>>
>>> Branch  prediction:
>>>
>>> To test the effects of poor branch prediction in the jump tables I
>>> tested checksum performance with runs for two combinations of length
>>> and alignment. As the baseline I performed the test by doing half of
>>> calls with the first combination, followed by using the second
>>> combination for the second half. In the test case, I interleave the
>>> two combinations so that in every call the length and alignment are
>>> different to defeat the effects of branch prediction. Running several
>>> cases, I did not see any material performance difference between the
>>> two scenarios (perf stat output is below), neither does either case
>>> show a significant number of branch misses.
>>>
>>> Interleave lengths case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>>>
>>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>>
>>>      9,556,693,202      instructions               ( +-  0.00% )
>>>      1,176,208,640       branches                                                     ( +-  0.00% )
>>>             19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )
>>>
>>>        2.049732539 seconds time elapsed
>>>
>>>     Non-interleave case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>>>
>>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):
>>>
>>>      9,782,188,310      instructions               ( +-  0.00% )
>>>      1,251,286,958       branches                                                     ( +-  0.01% )
>>>             18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )
>>>
>>>        2.271789046 seconds time elapsed
>>>
>>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>>> ---
>>>  arch/x86/include/asm/checksum_64.h |  21 +++++
>>>  arch/x86/lib/csum-partial_64.c     | 171 ++++++++++++++++++-------------------
>>>  2 files changed, 102 insertions(+), 90 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
>>> index cd00e17..1224f7d 100644
>>> --- a/arch/x86/include/asm/checksum_64.h
>>> +++ b/arch/x86/include/asm/checksum_64.h
>>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b)
>>>         return a;
>>>  }
>>>
>>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
>>> +{
>>> +       asm("addq %2,%0\n\t"
>>> +           "adcq $0,%0"
>>> +           : "=r" (a)
>>> +           : "0" (a), "rm" (b));
>>> +       return a;
>>> +}
>>> +
>>
>> You can probably just convert this and the add32_with_carry over to
>> the +r approach instead of using "0".
>>
>>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
>>> +                                            unsigned int c)
>>> +{
>>> +       asm("addl %1,%0\n\t"
>>> +           "adcl %2,%0\n\t"
>>> +           "adcl $0,%0"
>>> +           : "+r" (a)
>>> +           : "rm" (b), "rm" (c));
>>> +
>>> +       return a;
>>> +}
>>> +
>>>  #define HAVE_ARCH_CSUM_ADD
>>>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>>>  {
>>> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>>> index 9845371..7f1f60f 100644
>>> --- a/arch/x86/lib/csum-partial_64.c
>>> +++ b/arch/x86/lib/csum-partial_64.c
>>> @@ -8,6 +8,7 @@
>>>  #include <linux/compiler.h>
>>>  #include <linux/module.h>
>>>  #include <asm/checksum.h>
>>> +#include <asm/word-at-a-time.h>
>>>
>>>  static inline unsigned short from32to16(unsigned a)
>>>  {
>>> @@ -21,99 +22,78 @@ static inline unsigned short from32to16(unsigned a)
>>>
>>>  /*
>>>   * Do a 64-bit checksum on an arbitrary memory area.
>>> - * Returns a 32bit checksum.
>>> + * Returns a 64bit checksum.
>>>   *
>>> - * This isn't as time critical as it used to be because many NICs
>>> - * do hardware checksumming these days.
>>> - *
>>> - * Things tried and found to not make it faster:
>>> - * Manual Prefetching
>>> - * Unrolling to an 128 bytes inner loop.
>>> - * Using interleaving with more registers to break the carry chains.
>>> + * This is optimized for small lengths such as might be common when pulling
>>> + * up checksums over protocol headers to handle CHECKSUM_COMPLETE (e.g.
>>> + * checksum over 40 bytes will be quite common for pulling up checksum over
>>> + * IPv6 headers).
>>>   */
>>> -static unsigned do_csum(const unsigned char *buff, unsigned len)
>>> +static unsigned long do_csum(const void *buff, int len)
>>>  {
>>> -       unsigned odd, count;
>>>         unsigned long result = 0;
>>>
>>> -       if (unlikely(len == 0))
>>> -               return result;
>>> -       odd = 1 & (unsigned long) buff;
>>> -       if (unlikely(odd)) {
>>> -               result = *buff << 8;
>>> -               len--;
>>> -               buff++;
>>> +       /* Check for less than a quad to sum */
>>> +       if (len < 8) {
>>> +               unsigned long val = load_unaligned_zeropad(buff);
>>> +
>>> +               return (val & ((1ul << len * 8) - 1));
>>> +       }
>>> +
>>> +       /* Main loop using 64byte blocks */
>>> +       for (; len > 64; len -= 64, buff += 64) {
>>> +               asm("addq 0*8(%[src]),%[res]\n\t"
>>> +                   "adcq 1*8(%[src]),%[res]\n\t"
>>> +                   "adcq 2*8(%[src]),%[res]\n\t"
>>> +                   "adcq 3*8(%[src]),%[res]\n\t"
>>> +                   "adcq 4*8(%[src]),%[res]\n\t"
>>> +                   "adcq 5*8(%[src]),%[res]\n\t"
>>> +                   "adcq 6*8(%[src]),%[res]\n\t"
>>> +                   "adcq 7*8(%[src]),%[res]\n\t"
>>> +                   "adcq $0,%[res]"
>>> +                   : [res] "=r" (result)
>>
>> The +r would probably work here to just to be consistent.
>>
>>> +                   : [src] "r" (buff),
>>> +                   "[res]" (result));
>>>         }
>>> -       count = len >> 1;               /* nr of 16-bit words.. */
>>> -       if (count) {
>>> -               if (2 & (unsigned long) buff) {
>>> -                       result += *(unsigned short *)buff;
>>> -                       count--;
>>> -                       len -= 2;
>>> -                       buff += 2;
>>> -               }
>>> -               count >>= 1;            /* nr of 32-bit words.. */
>>> -               if (count) {
>>> -                       unsigned long zero;
>>> -                       unsigned count64;
>>> -                       if (4 & (unsigned long) buff) {
>>> -                               result += *(unsigned int *) buff;
>>> -                               count--;
>>> -                               len -= 4;
>>> -                               buff += 4;
>>> -                       }
>>> -                       count >>= 1;    /* nr of 64-bit words.. */
>>>
>>> -                       /* main loop using 64byte blocks */
>>> -                       zero = 0;
>>> -                       count64 = count >> 3;
>>> -                       while (count64) {
>>> -                               asm("addq 0*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 1*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 2*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 3*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 4*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 5*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 6*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 7*8(%[src]),%[res]\n\t"
>>> -                                   "adcq %[zero],%[res]"
>>> -                                   : [res] "=r" (result)
>>> -                                   : [src] "r" (buff), [zero] "r" (zero),
>>> -                                   "[res]" (result));
>>> -                               buff += 64;
>>> -                               count64--;
>>> -                       }
>>> +       /*
>>> +        * Sum over remaining quads (<= 8 of them). This uses a jump table
>>> +        * based on number of quads to sum. The jump assumes that each case
>>> +        * is 4 bytes. Each adcq instruction is 4 bytes except for adcq 0()
>>> +        * which is 3 bytes, so a nop instruction is inserted to make that case
>>> +        * 4 bytes.
>>> +        */
>>> +       asm("lea 0f(, %[slen], 4), %%r11\n\t"
>>> +           "clc\n\t"
>>> +           "jmpq *%%r11\n\t"
>>> +           "adcq 7*8(%[src]),%[res]\n\t"
>>> +           "adcq 6*8(%[src]),%[res]\n\t"
>>> +           "adcq 5*8(%[src]),%[res]\n\t"
>>> +           "adcq 4*8(%[src]),%[res]\n\t"
>>> +           "adcq 3*8(%[src]),%[res]\n\t"
>>> +           "adcq 2*8(%[src]),%[res]\n\t"
>>> +           "adcq 1*8(%[src]),%[res]\n\t"
>>> +           "adcq 0*8(%[src]),%[res]\n\t"
>>> +           "nop\n\t"
>>> +           "0: adcq $0,%[res]"
>>> +                   : [res] "=r" (result)
>>
>> Same comment about +r here.
>>
>>> +                   : [src] "r" (buff),
>>> +                     [slen] "r" (-(unsigned long)(len >> 3)), "[res]" (result)
>>> +                   : "r11");
>>>
>>> -                       /* last up to 7 8byte blocks */
>>> -                       count %= 8;
>>> -                       while (count) {
>>> -                               asm("addq %1,%0\n\t"
>>> -                                   "adcq %2,%0\n"
>>> -                                           : "=r" (result)
>>> -                                   : "m" (*(unsigned long *)buff),
>>> -                                   "r" (zero),  "0" (result));
>>> -                               --count;
>>> -                                       buff += 8;
>>> -                       }
>>> -                       result = add32_with_carry(result>>32,
>>> -                                                 result&0xffffffff);
>>> +       /* Sum over any remaining bytes (< 8 of them) */
>>> +       if (len & 0x7) {
>>> +               unsigned long val;
>>> +               /*
>>> +                * Since "len" is > 8 here we backtrack in the buffer to load
>>> +                * the outstanding bytes into the low order bytes of a quad and
>>> +                * then shift to extract the relevant bytes. By doing this we
>>> +                * avoid additional calls to load_unaligned_zeropad.
>>> +                */
>>>
>>> -                       if (len & 4) {
>>> -                               result += *(unsigned int *) buff;
>>> -                               buff += 4;
>>> -                       }
>>> -               }
>>> -               if (len & 2) {
>>> -                       result += *(unsigned short *) buff;
>>> -                       buff += 2;
>>> -               }
>>> -       }
>>> -       if (len & 1)
>>> -               result += *buff;
>>> -       result = add32_with_carry(result>>32, result & 0xffffffff);
>>> -       if (unlikely(odd)) {
>>> -               result = from32to16(result);
>>> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>>> +               val = *(unsigned long *)(buff + len - 8);
>>> +               val >>= 8 * (-len & 0x7);
>>> +               result = add64_with_carry(val, result);
>>>         }
>>>         return result;
>>>  }
>>> @@ -125,15 +105,26 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>>>   * returns a 32-bit number suitable for feeding into itself
>>>   * or csum_tcpudp_magic
>>>   *
>>> - * this function must be called with even lengths, except
>>> - * for the last fragment, which may be odd
>>> - *
>>> - * it's best to have buff aligned on a 64-bit boundary
>>> + * Note that this implementation makes no attempt to avoid unaligned accesses
>>> + * (e.g. load a quad word with non 8-byte alignment). On x86 unaligned accesses
>>> + * only seem to be a performance penalty when the access crosses a page
>>> + * boundary-- such a scenario should be an extremely rare occurrence for use
>>> + * cases of csum_partial.
>>>   */
>>>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>>>  {
>>> -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
>>> -                                               (__force u32)sum);
>>> +       if (len == 8) {
>>> +               /* Optimize trivial case */
>>> +               return (__force __wsum)add32_with_carry3((__force u32)sum,
>>> +                                                *(unsigned int *)buff,
>>> +                                                *(unsigned int *)(buff + 4));
>>> +       } else {
>>> +               unsigned long result = do_csum(buff, len);
>>> +
>>> +               return (__force __wsum)add32_with_carry3((__force u32)sum,
>>> +                                                result >> 32,
>>> +                                                result & 0xffffffff);
>>> +       }
>>>  }
>>>
>>>  /*
>>> --
>>> 2.6.5
>>>
David Laight March 3, 2016, 4:12 p.m. UTC | #5
From: Tom Herbert
> Sent: 02 March 2016 22:19
...
> +	/* Main loop using 64byte blocks */
> +	for (; len > 64; len -= 64, buff += 64) {
> +		asm("addq 0*8(%[src]),%[res]\n\t"
> +		    "adcq 1*8(%[src]),%[res]\n\t"
> +		    "adcq 2*8(%[src]),%[res]\n\t"
> +		    "adcq 3*8(%[src]),%[res]\n\t"
> +		    "adcq 4*8(%[src]),%[res]\n\t"
> +		    "adcq 5*8(%[src]),%[res]\n\t"
> +		    "adcq 6*8(%[src]),%[res]\n\t"
> +		    "adcq 7*8(%[src]),%[res]\n\t"
> +		    "adcq $0,%[res]"
> +		    : [res] "=r" (result)
> +		    : [src] "r" (buff),
> +		    "[res]" (result));

Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
without any unrolling?

...
> +	/* Sum over any remaining bytes (< 8 of them) */
> +	if (len & 0x7) {
> +		unsigned long val;
> +		/*
> +		 * Since "len" is > 8 here we backtrack in the buffer to load
> +		 * the outstanding bytes into the low order bytes of a quad and
> +		 * then shift to extract the relevant bytes. By doing this we
> +		 * avoid additional calls to load_unaligned_zeropad.

That comment is wrong. Maybe:
		 * Read the last 8 bytes of the buffer then shift to extract
		 * the required bytes.
		 * This is safe because the original length was > 8 and avoids
		 * any problems reading beyond the end of the valid data.

	David
Linus Torvalds March 3, 2016, 6:43 p.m. UTC | #6
On Thu, Mar 3, 2016 at 8:12 AM, David Laight <David.Laight@aculab.com> wrote:
>
> Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
> without any unrolling?

Is that actually supposed to work ok these days? jcxz used to be quite
slow, and is historically *never* used.

Now, in theory, loop constructs can actually do better on branch
prediction etc, but Intel seems to have never really tried to push
them, and has instead pretty much discouraged them in favor of making
the normal jumps go faster (including all the instruction fusion etc)

From what I have seen, the whole "don't use LOOP or JRCXZ" has not changed.

           Linus
David Laight March 4, 2016, 10:38 a.m. UTC | #7
From: Linus Torvalds

> Sent: 03 March 2016 18:44

> 

> On Thu, Mar 3, 2016 at 8:12 AM, David Laight <David.Laight@aculab.com> wrote:

> >

> > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'

> > without any unrolling?

> 

> Is that actually supposed to work ok these days? jcxz used to be quite

> slow, and is historically *never* used.

> 

> Now, in theory, loop constructs can actually do better on branch

> prediction etc, but Intel seems to have never really tried to push

> them, and has instead pretty much discouraged them in favor of making

> the normal jumps go faster (including all the instruction fusion etc)


Yes, they've even added the 'adc using the overflow flag' but not made
it possible to put that into a loop.

> From what I have seen, the whole "don't use LOOP or JRCXZ" has not changed.


LOOP is still slow on intel cpus (but is single clock on recentish amd ones).

JCXZ is reasonable on most cpus, certainly all the ones we care about these days.
On intel cpus JCXZ is still 2 clocks, but it removes the dependency on any
flags (which all other conditional instructions have).
The difficulty is using it for a loop (you need JCXNZ or a fast LOOP).
An alternative to the 'JCXZ, JMPS' pair would be to move the high bits
of the counter into the low bits of cx so that cx would become non-zero
on the last iteration.

	David
Alexander H Duyck March 5, 2016, 12:15 a.m. UTC | #8
On Fri, Mar 4, 2016 at 2:38 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Linus Torvalds
>> Sent: 03 March 2016 18:44
>>
>> On Thu, Mar 3, 2016 at 8:12 AM, David Laight <David.Laight@aculab.com> wrote:
>> >
>> > Did you try the asm loop that used 'leax %rcx..., jcxz... jmps..'
>> > without any unrolling?
>>
>> Is that actually supposed to work ok these days? jcxz used to be quite
>> slow, and is historically *never* used.
>>
>> Now, in theory, loop constructs can actually do better on branch
>> prediction etc, but Intel seems to have never really tried to push
>> them, and has instead pretty much discouraged them in favor of making
>> the normal jumps go faster (including all the instruction fusion etc)
>
> Yes, they've even added the 'adc using the overflow flag' but not made
> it possible to put that into a loop.
>
>> From what I have seen, the whole "don't use LOOP or JRCXZ" has not changed.
>
> LOOP is still slow on intel cpus (but is single clock on recentish amd ones).
>
> JCXZ is reasonable on most cpus, certainly all the ones we care about these days.
> On intel cpus JCXZ is still 2 clocks, but it removes the dependency on any
> flags (which all other conditional instructions have).
> The difficulty is using it for a loop (you need JCXNZ or a fast LOOP).
> An alternative to the 'JCXZ, JMPS' pair would be to move the high bits
> of the counter into the low bits of cx so that cx would become non-zero
> on the last iteration.

Actually probably the easiest way to go on x86 is to just replace the
use of len with (len >> 6) and use decl or incl instead of addl or
subl, and lea instead of addq for the buff address.  None of those
instructions effect the carry flag as this is how such loops were
intended to be implemented.

I've been doing a bit of testing and that seems to work without
needing the adcq until after you exit the loop, but doesn't give that
much of a gain in speed for dropping the instruction from the
hot-path.  I suspect we are probably memory bottle-necked already in
the loop so dropping an instruction or two doesn't gain you much.

- Alex
David Laight March 7, 2016, 1:56 p.m. UTC | #9
From: Alexander Duyck

 ...
> Actually probably the easiest way to go on x86 is to just replace the

> use of len with (len >> 6) and use decl or incl instead of addl or

> subl, and lea instead of addq for the buff address.  None of those

> instructions effect the carry flag as this is how such loops were

> intended to be implemented.

> 

> I've been doing a bit of testing and that seems to work without

> needing the adcq until after you exit the loop, but doesn't give that

> much of a gain in speed for dropping the instruction from the

> hot-path.  I suspect we are probably memory bottle-necked already in

> the loop so dropping an instruction or two doesn't gain you much.


Right, any superscalar architecture gives you some instructions
'for free' if they can execute at the same time as those on the
critical path (in this case the memory reads and the adc).
This is why loop unrolling can be pointless.

So the loop:
10:	addc %rax,(%rdx,%rcx,8)
	inc %rcx
	jnz 10b
could easily be as fast as anything that doesn't use the 'new'
instructions that use the overflow flag.
That loop might be measurable faster for aligned buffers.

	David
Tom Herbert March 7, 2016, 5:33 p.m. UTC | #10
On Mon, Mar 7, 2016 at 5:56 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Alexander Duyck
>  ...
>> Actually probably the easiest way to go on x86 is to just replace the
>> use of len with (len >> 6) and use decl or incl instead of addl or
>> subl, and lea instead of addq for the buff address.  None of those
>> instructions effect the carry flag as this is how such loops were
>> intended to be implemented.
>>
>> I've been doing a bit of testing and that seems to work without
>> needing the adcq until after you exit the loop, but doesn't give that
>> much of a gain in speed for dropping the instruction from the
>> hot-path.  I suspect we are probably memory bottle-necked already in
>> the loop so dropping an instruction or two doesn't gain you much.
>
> Right, any superscalar architecture gives you some instructions
> 'for free' if they can execute at the same time as those on the
> critical path (in this case the memory reads and the adc).
> This is why loop unrolling can be pointless.
>
> So the loop:
> 10:     addc %rax,(%rdx,%rcx,8)
>         inc %rcx
>         jnz 10b
> could easily be as fast as anything that doesn't use the 'new'
> instructions that use the overflow flag.
> That loop might be measurable faster for aligned buffers.

Tested by replacing the unrolled loop in my patch with just:

if (len >= 8) {
                asm("clc\n\t"
                    "0: adcq (%[src],%%rcx,8),%[res]\n\t"
                    "decl %%ecx\n\t"
                    "jge 0b\n\t"
                    "adcq $0, %[res]\n\t"
                            : [res] "=r" (result)
                            : [src] "r" (buff), "[res]" (result), "c"
((len >> 3) - 1));
}

This seems to be significantly slower:

1400 bytes: 797 nsecs vs. 202 nsecs
40 bytes: 6.5 nsecs vs. 26.8 nsecs

Tom

>
>         David
>
Alexander H Duyck March 7, 2016, 11:52 p.m. UTC | #11
On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Mon, Mar 7, 2016 at 5:56 AM, David Laight <David.Laight@aculab.com> wrote:
>> From: Alexander Duyck
>>  ...
>>> Actually probably the easiest way to go on x86 is to just replace the
>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>> subl, and lea instead of addq for the buff address.  None of those
>>> instructions effect the carry flag as this is how such loops were
>>> intended to be implemented.
>>>
>>> I've been doing a bit of testing and that seems to work without
>>> needing the adcq until after you exit the loop, but doesn't give that
>>> much of a gain in speed for dropping the instruction from the
>>> hot-path.  I suspect we are probably memory bottle-necked already in
>>> the loop so dropping an instruction or two doesn't gain you much.
>>
>> Right, any superscalar architecture gives you some instructions
>> 'for free' if they can execute at the same time as those on the
>> critical path (in this case the memory reads and the adc).
>> This is why loop unrolling can be pointless.
>>
>> So the loop:
>> 10:     addc %rax,(%rdx,%rcx,8)
>>         inc %rcx
>>         jnz 10b
>> could easily be as fast as anything that doesn't use the 'new'
>> instructions that use the overflow flag.
>> That loop might be measurable faster for aligned buffers.
>
> Tested by replacing the unrolled loop in my patch with just:
>
> if (len >= 8) {
>                 asm("clc\n\t"
>                     "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>                     "decl %%ecx\n\t"
>                     "jge 0b\n\t"
>                     "adcq $0, %[res]\n\t"
>                             : [res] "=r" (result)
>                             : [src] "r" (buff), "[res]" (result), "c"
> ((len >> 3) - 1));
> }
>
> This seems to be significantly slower:
>
> 1400 bytes: 797 nsecs vs. 202 nsecs
> 40 bytes: 6.5 nsecs vs. 26.8 nsecs

You still need the loop unrolling as the decl and jge have some
overhead.  You can't just get rid of it with a single call in a tight
loop but it should improve things.  The gain from what I have seen
ends up being minimal though.  I haven't really noticed all that much
in my tests anyway.

I have been doing some testing and the penalty for an unaligned
checksum can get pretty big if the data-set is big enough.  I was
messing around and tried doing a checksum over 32K minus some offset
and was seeing a penalty of about 200 cycles per 64K frame.

One thought I had is that we may want to look into making an inline
function that we can call for compile-time defined lengths less than
64.  Maybe call it something like __csum_partial and we could then use
that in place of csum_partial for all those headers that are a fixed
length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
we might be able to look at taking care of alignment for csum_partial
which will improve the skb_checksum() case without impacting the
header pulling cases as much since that code would be inlined
elsewhere.

- Alex
Tom Herbert March 8, 2016, 12:07 a.m. UTC | #12
On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <tom@herbertland.com> wrote:
>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight <David.Laight@aculab.com> wrote:
>>> From: Alexander Duyck
>>>  ...
>>>> Actually probably the easiest way to go on x86 is to just replace the
>>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>>> subl, and lea instead of addq for the buff address.  None of those
>>>> instructions effect the carry flag as this is how such loops were
>>>> intended to be implemented.
>>>>
>>>> I've been doing a bit of testing and that seems to work without
>>>> needing the adcq until after you exit the loop, but doesn't give that
>>>> much of a gain in speed for dropping the instruction from the
>>>> hot-path.  I suspect we are probably memory bottle-necked already in
>>>> the loop so dropping an instruction or two doesn't gain you much.
>>>
>>> Right, any superscalar architecture gives you some instructions
>>> 'for free' if they can execute at the same time as those on the
>>> critical path (in this case the memory reads and the adc).
>>> This is why loop unrolling can be pointless.
>>>
>>> So the loop:
>>> 10:     addc %rax,(%rdx,%rcx,8)
>>>         inc %rcx
>>>         jnz 10b
>>> could easily be as fast as anything that doesn't use the 'new'
>>> instructions that use the overflow flag.
>>> That loop might be measurable faster for aligned buffers.
>>
>> Tested by replacing the unrolled loop in my patch with just:
>>
>> if (len >= 8) {
>>                 asm("clc\n\t"
>>                     "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>>                     "decl %%ecx\n\t"
>>                     "jge 0b\n\t"
>>                     "adcq $0, %[res]\n\t"
>>                             : [res] "=r" (result)
>>                             : [src] "r" (buff), "[res]" (result), "c"
>> ((len >> 3) - 1));
>> }
>>
>> This seems to be significantly slower:
>>
>> 1400 bytes: 797 nsecs vs. 202 nsecs
>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>
> You still need the loop unrolling as the decl and jge have some
> overhead.  You can't just get rid of it with a single call in a tight
> loop but it should improve things.  The gain from what I have seen
> ends up being minimal though.  I haven't really noticed all that much
> in my tests anyway.
>
> I have been doing some testing and the penalty for an unaligned
> checksum can get pretty big if the data-set is big enough.  I was
> messing around and tried doing a checksum over 32K minus some offset
> and was seeing a penalty of about 200 cycles per 64K frame.
>
Out of how many cycles to checksum 64K though?

> One thought I had is that we may want to look into making an inline
> function that we can call for compile-time defined lengths less than
> 64.  Maybe call it something like __csum_partial and we could then use
> that in place of csum_partial for all those headers that are a fixed
> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
> we might be able to look at taking care of alignment for csum_partial
> which will improve the skb_checksum() case without impacting the
> header pulling cases as much since that code would be inlined
> elsewhere.
>
As I said previously, if alignment really is a factor then we can
check up front if a buffer crosses a page boundary and call the slow
path function (original code). I'm seeing a 1 nsec hit to add this
check.

Tom

> - Alex
Alexander H Duyck March 8, 2016, 12:49 a.m. UTC | #13
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <tom@herbertland.com> wrote:
>>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight <David.Laight@aculab.com> wrote:
>>>> From: Alexander Duyck
>>>>  ...
>>>>> Actually probably the easiest way to go on x86 is to just replace the
>>>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>>>> subl, and lea instead of addq for the buff address.  None of those
>>>>> instructions effect the carry flag as this is how such loops were
>>>>> intended to be implemented.
>>>>>
>>>>> I've been doing a bit of testing and that seems to work without
>>>>> needing the adcq until after you exit the loop, but doesn't give that
>>>>> much of a gain in speed for dropping the instruction from the
>>>>> hot-path.  I suspect we are probably memory bottle-necked already in
>>>>> the loop so dropping an instruction or two doesn't gain you much.
>>>>
>>>> Right, any superscalar architecture gives you some instructions
>>>> 'for free' if they can execute at the same time as those on the
>>>> critical path (in this case the memory reads and the adc).
>>>> This is why loop unrolling can be pointless.
>>>>
>>>> So the loop:
>>>> 10:     addc %rax,(%rdx,%rcx,8)
>>>>         inc %rcx
>>>>         jnz 10b
>>>> could easily be as fast as anything that doesn't use the 'new'
>>>> instructions that use the overflow flag.
>>>> That loop might be measurable faster for aligned buffers.
>>>
>>> Tested by replacing the unrolled loop in my patch with just:
>>>
>>> if (len >= 8) {
>>>                 asm("clc\n\t"
>>>                     "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>>>                     "decl %%ecx\n\t"
>>>                     "jge 0b\n\t"
>>>                     "adcq $0, %[res]\n\t"
>>>                             : [res] "=r" (result)
>>>                             : [src] "r" (buff), "[res]" (result), "c"
>>> ((len >> 3) - 1));
>>> }
>>>
>>> This seems to be significantly slower:
>>>
>>> 1400 bytes: 797 nsecs vs. 202 nsecs
>>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>>
>> You still need the loop unrolling as the decl and jge have some
>> overhead.  You can't just get rid of it with a single call in a tight
>> loop but it should improve things.  The gain from what I have seen
>> ends up being minimal though.  I haven't really noticed all that much
>> in my tests anyway.
>>
>> I have been doing some testing and the penalty for an unaligned
>> checksum can get pretty big if the data-set is big enough.  I was
>> messing around and tried doing a checksum over 32K minus some offset
>> and was seeing a penalty of about 200 cycles per 64K frame.
>>
> Out of how many cycles to checksum 64K though?

So the clock cycles I am seeing is ~16660 for unaligned vs 16416
aligned.  So yeah the effect is only a 1.5% penalty for the total
time.

>> One thought I had is that we may want to look into making an inline
>> function that we can call for compile-time defined lengths less than
>> 64.  Maybe call it something like __csum_partial and we could then use
>> that in place of csum_partial for all those headers that are a fixed
>> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
>> we might be able to look at taking care of alignment for csum_partial
>> which will improve the skb_checksum() case without impacting the
>> header pulling cases as much since that code would be inlined
>> elsewhere.
>>
> As I said previously, if alignment really is a factor then we can
> check up front if a buffer crosses a page boundary and call the slow
> path function (original code). I'm seeing a 1 nsec hit to add this
> check.

Well I was just noticing there are a number of places we can get an
even bigger benefit if we just bypass the need for csum_partial
entirely.  For example the DSA code is calling csum_partial to extract
2 bytes.  Same thing for protocols such as VXLAN and the like.  If we
could catch cases like these with a __builtin_constant_p check then we
might be able to save some significant CPU time by avoiding the
function call entirely and just doing some inline addition on the
input values directly.

- Alex
Tom Herbert March 8, 2016, 1:03 a.m. UTC | #14
On Mon, Mar 7, 2016 at 4:49 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert <tom@herbertland.com> wrote:
>>>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight <David.Laight@aculab.com> wrote:
>>>>> From: Alexander Duyck
>>>>>  ...
>>>>>> Actually probably the easiest way to go on x86 is to just replace the
>>>>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>>>>> subl, and lea instead of addq for the buff address.  None of those
>>>>>> instructions effect the carry flag as this is how such loops were
>>>>>> intended to be implemented.
>>>>>>
>>>>>> I've been doing a bit of testing and that seems to work without
>>>>>> needing the adcq until after you exit the loop, but doesn't give that
>>>>>> much of a gain in speed for dropping the instruction from the
>>>>>> hot-path.  I suspect we are probably memory bottle-necked already in
>>>>>> the loop so dropping an instruction or two doesn't gain you much.
>>>>>
>>>>> Right, any superscalar architecture gives you some instructions
>>>>> 'for free' if they can execute at the same time as those on the
>>>>> critical path (in this case the memory reads and the adc).
>>>>> This is why loop unrolling can be pointless.
>>>>>
>>>>> So the loop:
>>>>> 10:     addc %rax,(%rdx,%rcx,8)
>>>>>         inc %rcx
>>>>>         jnz 10b
>>>>> could easily be as fast as anything that doesn't use the 'new'
>>>>> instructions that use the overflow flag.
>>>>> That loop might be measurable faster for aligned buffers.
>>>>
>>>> Tested by replacing the unrolled loop in my patch with just:
>>>>
>>>> if (len >= 8) {
>>>>                 asm("clc\n\t"
>>>>                     "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>>>>                     "decl %%ecx\n\t"
>>>>                     "jge 0b\n\t"
>>>>                     "adcq $0, %[res]\n\t"
>>>>                             : [res] "=r" (result)
>>>>                             : [src] "r" (buff), "[res]" (result), "c"
>>>> ((len >> 3) - 1));
>>>> }
>>>>
>>>> This seems to be significantly slower:
>>>>
>>>> 1400 bytes: 797 nsecs vs. 202 nsecs
>>>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>>>
>>> You still need the loop unrolling as the decl and jge have some
>>> overhead.  You can't just get rid of it with a single call in a tight
>>> loop but it should improve things.  The gain from what I have seen
>>> ends up being minimal though.  I haven't really noticed all that much
>>> in my tests anyway.
>>>
>>> I have been doing some testing and the penalty for an unaligned
>>> checksum can get pretty big if the data-set is big enough.  I was
>>> messing around and tried doing a checksum over 32K minus some offset
>>> and was seeing a penalty of about 200 cycles per 64K frame.
>>>
>> Out of how many cycles to checksum 64K though?
>
> So the clock cycles I am seeing is ~16660 for unaligned vs 16416
> aligned.  So yeah the effect is only a 1.5% penalty for the total
> time.
>
>>> One thought I had is that we may want to look into making an inline
>>> function that we can call for compile-time defined lengths less than
>>> 64.  Maybe call it something like __csum_partial and we could then use
>>> that in place of csum_partial for all those headers that are a fixed
>>> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
>>> we might be able to look at taking care of alignment for csum_partial
>>> which will improve the skb_checksum() case without impacting the
>>> header pulling cases as much since that code would be inlined
>>> elsewhere.
>>>
>> As I said previously, if alignment really is a factor then we can
>> check up front if a buffer crosses a page boundary and call the slow
>> path function (original code). I'm seeing a 1 nsec hit to add this
>> check.
>
> Well I was just noticing there are a number of places we can get an
> even bigger benefit if we just bypass the need for csum_partial
> entirely.  For example the DSA code is calling csum_partial to extract
> 2 bytes.  Same thing for protocols such as VXLAN and the like.  If we
> could catch cases like these with a __builtin_constant_p check then we
> might be able to save some significant CPU time by avoiding the
> function call entirely and just doing some inline addition on the
> input values directly.
>
Sure, we could inline a switch function for common values (0, 2, 4, 8,
14, 16, 20, 40) maybe.

> - Alex
Linus Torvalds March 8, 2016, 1:39 a.m. UTC | #15
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert <tom@herbertland.com> wrote:
>
> As I said previously, if alignment really is a factor then we can
> check up front if a buffer crosses a page boundary and call the slow
> path function (original code). I'm seeing a 1 nsec hit to add this
> check.

It shouldn't be a factor, and you shouldn't check for it. My code was
self-aligning, and had at most one unaligned access at the beginnig
(the data of which was then used to align the rest).

Tom had a version that used that. Although now that I look back at it,
it seems to be broken by some confusion about the one-byte alignment
vs 8-byte alignment.

             Linus
David Laight March 8, 2016, 11:03 a.m. UTC | #16
From: Alexander Duyck 

...
> One thought I had is that we may want to look into making an inline

> function that we can call for compile-time defined lengths less than

> 64.  Maybe call it something like __csum_partial and we could then use

> that in place of csum_partial for all those headers that are a fixed

> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then

> we might be able to look at taking care of alignment for csum_partial

> which will improve the skb_checksum() case without impacting the

> header pulling cases as much since that code would be inlined

> elsewhere.


I think there are some patches going through the ppc tree to do
exactly that.

	David
David Laight March 8, 2016, 11:11 a.m. UTC | #17
From: Alexander Duyck 

...
> >> So the loop:

> >> 10:     addc %rax,(%rdx,%rcx,8)

> >>         inc %rcx

> >>         jnz 10b

> >> could easily be as fast as anything that doesn't use the 'new'

> >> instructions that use the overflow flag.

> >> That loop might be measurable faster for aligned buffers.

> >

> > Tested by replacing the unrolled loop in my patch with just:

> >

> > if (len >= 8) {

> >                 asm("clc\n\t"

> >                     "0: adcq (%[src],%%rcx,8),%[res]\n\t"

> >                     "decl %%ecx\n\t"

> >                     "jge 0b\n\t"

> >                     "adcq $0, %[res]\n\t"

> >                             : [res] "=r" (result)

> >                             : [src] "r" (buff), "[res]" (result), "c"

> > ((len >> 3) - 1));

> > }

> >

> > This seems to be significantly slower:

> >

> > 1400 bytes: 797 nsecs vs. 202 nsecs

> > 40 bytes: 6.5 nsecs vs. 26.8 nsecs

> 

> You still need the loop unrolling as the decl and jge have some

> overhead.  You can't just get rid of it with a single call in a tight

> loop but it should improve things.  The gain from what I have seen

> ends up being minimal though.  I haven't really noticed all that much

> in my tests anyway.


The overhead from the jge and decl is probably similar to that of the adc.
The problem is that they can't be executed at the same time because they
both have dependencies on the carry flag.

Tom did some extra tests last night, using a loop construct of 4 instructions
that didn't modify the flags register was twice the speed of the above.
I think there is a 3 instruction loop, add a second adc and it may well
be as fast as your 8-way unrolled version - but is much simpler to implement.

That loop (using swab and jecxz to loop until the high 32bits of rcx are non-zero)
could also be used with the 'add carry using overflow bit' instructions.

	David
Tom Herbert March 8, 2016, 4:49 p.m. UTC | #18
On Mon, Mar 7, 2016 at 5:39 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert <tom@herbertland.com> wrote:
>>
>> As I said previously, if alignment really is a factor then we can
>> check up front if a buffer crosses a page boundary and call the slow
>> path function (original code). I'm seeing a 1 nsec hit to add this
>> check.
>
> It shouldn't be a factor, and you shouldn't check for it. My code was
> self-aligning, and had at most one unaligned access at the beginnig
> (the data of which was then used to align the rest).
>
Yes, but the logic to do the alignment does not come for free. The
intent of these patches is really to speed up checksums over small
buffers (like the checksum over the IP header or pulling up checksums
over protocol headers for dealing with checksum-complete). For
checksum over larger buffers, e.g. TCP/UDP checksums we are depending
on checksum offload (there are still some case where the host will
need to a packet checksum, but as vendors move to providing up
protocol agnostic checksum those should go away). In the VXLAN GRO
path for instance, we do a checksum pull over both the UDP header and
VLXAN headers each of which are 8 bytes. csum_partial can be trivially
implemented for a buffer of length 8 with three adcq instructions (as
in my patch). When we're using VXLAN in IPv4 both the VLXAN headers
and UDP will likely not be eight byte aligned, but alignment seems to
only be an issue when crossing a page boundary. The probability that
an 8 byte header crosses a page boundary is already very low, and
probably with a little bit code drivers could pretty much guarantee
that packet headers don't straddle page boundaries. So it seems like
the effort to align small buffers, assuming they don't straddle page
boundaries, provides little or no value.

Tom

> Tom had a version that used that. Although now that I look back at it,
> it seems to be broken by some confusion about the one-byte alignment
> vs 8-byte alignment.
>
>              Linus
diff mbox

Patch

diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index cd00e17..1224f7d 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -188,6 +188,27 @@  static inline unsigned add32_with_carry(unsigned a, unsigned b)
 	return a;
 }
 
+static inline unsigned long add64_with_carry(unsigned long a, unsigned long b)
+{
+	asm("addq %2,%0\n\t"
+	    "adcq $0,%0"
+	    : "=r" (a)
+	    : "0" (a), "rm" (b));
+	return a;
+}
+
+static inline unsigned int add32_with_carry3(unsigned int a, unsigned int b,
+					     unsigned int c)
+{
+	asm("addl %1,%0\n\t"
+	    "adcl %2,%0\n\t"
+	    "adcl $0,%0"
+	    : "+r" (a)
+	    : "rm" (b), "rm" (c));
+
+	return a;
+}
+
 #define HAVE_ARCH_CSUM_ADD
 static inline __wsum csum_add(__wsum csum, __wsum addend)
 {
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..7f1f60f 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -8,6 +8,7 @@ 
 #include <linux/compiler.h>
 #include <linux/module.h>
 #include <asm/checksum.h>
+#include <asm/word-at-a-time.h>
 
 static inline unsigned short from32to16(unsigned a) 
 {
@@ -21,99 +22,78 @@  static inline unsigned short from32to16(unsigned a)
 
 /*
  * Do a 64-bit checksum on an arbitrary memory area.
- * Returns a 32bit checksum.
+ * Returns a 64bit checksum.
  *
- * This isn't as time critical as it used to be because many NICs
- * do hardware checksumming these days.
- * 
- * Things tried and found to not make it faster:
- * Manual Prefetching
- * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
+ * This is optimized for small lengths such as might be common when pulling
+ * up checksums over protocol headers to handle CHECKSUM_COMPLETE (e.g.
+ * checksum over 40 bytes will be quite common for pulling up checksum over
+ * IPv6 headers).
  */
-static unsigned do_csum(const unsigned char *buff, unsigned len)
+static unsigned long do_csum(const void *buff, int len)
 {
-	unsigned odd, count;
 	unsigned long result = 0;
 
-	if (unlikely(len == 0))
-		return result; 
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		result = *buff << 8;
-		len--;
-		buff++;
+	/* Check for less than a quad to sum */
+	if (len < 8) {
+		unsigned long val = load_unaligned_zeropad(buff);
+
+		return (val & ((1ul << len * 8) - 1));
+	}
+
+	/* Main loop using 64byte blocks */
+	for (; len > 64; len -= 64, buff += 64) {
+		asm("addq 0*8(%[src]),%[res]\n\t"
+		    "adcq 1*8(%[src]),%[res]\n\t"
+		    "adcq 2*8(%[src]),%[res]\n\t"
+		    "adcq 3*8(%[src]),%[res]\n\t"
+		    "adcq 4*8(%[src]),%[res]\n\t"
+		    "adcq 5*8(%[src]),%[res]\n\t"
+		    "adcq 6*8(%[src]),%[res]\n\t"
+		    "adcq 7*8(%[src]),%[res]\n\t"
+		    "adcq $0,%[res]"
+		    : [res] "=r" (result)
+		    : [src] "r" (buff),
+		    "[res]" (result));
 	}
-	count = len >> 1;		/* nr of 16-bit words.. */
-	if (count) {
-		if (2 & (unsigned long) buff) {
-			result += *(unsigned short *)buff;
-			count--;
-			len -= 2;
-			buff += 2;
-		}
-		count >>= 1;		/* nr of 32-bit words.. */
-		if (count) {
-			unsigned long zero;
-			unsigned count64;
-			if (4 & (unsigned long) buff) {
-				result += *(unsigned int *) buff;
-				count--;
-				len -= 4;
-				buff += 4;
-			}
-			count >>= 1;	/* nr of 64-bit words.. */
 
-			/* main loop using 64byte blocks */
-			zero = 0;
-			count64 = count >> 3;
-			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
-				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
-				buff += 64;
-				count64--;
-			}
+	/*
+	 * Sum over remaining quads (<= 8 of them). This uses a jump table
+	 * based on number of quads to sum. The jump assumes that each case
+	 * is 4 bytes. Each adcq instruction is 4 bytes except for adcq 0()
+	 * which is 3 bytes, so a nop instruction is inserted to make that case
+	 * 4 bytes.
+	 */
+	asm("lea 0f(, %[slen], 4), %%r11\n\t"
+	    "clc\n\t"
+	    "jmpq *%%r11\n\t"
+	    "adcq 7*8(%[src]),%[res]\n\t"
+	    "adcq 6*8(%[src]),%[res]\n\t"
+	    "adcq 5*8(%[src]),%[res]\n\t"
+	    "adcq 4*8(%[src]),%[res]\n\t"
+	    "adcq 3*8(%[src]),%[res]\n\t"
+	    "adcq 2*8(%[src]),%[res]\n\t"
+	    "adcq 1*8(%[src]),%[res]\n\t"
+	    "adcq 0*8(%[src]),%[res]\n\t"
+	    "nop\n\t"
+	    "0: adcq $0,%[res]"
+		    : [res] "=r" (result)
+		    : [src] "r" (buff),
+		      [slen] "r" (-(unsigned long)(len >> 3)), "[res]" (result)
+		    : "r11");
 
-			/* last up to 7 8byte blocks */
-			count %= 8; 
-			while (count) { 
-				asm("addq %1,%0\n\t"
-				    "adcq %2,%0\n" 
-					    : "=r" (result)
-				    : "m" (*(unsigned long *)buff), 
-				    "r" (zero),  "0" (result));
-				--count; 
-					buff += 8;
-			}
-			result = add32_with_carry(result>>32,
-						  result&0xffffffff); 
+	/* Sum over any remaining bytes (< 8 of them) */
+	if (len & 0x7) {
+		unsigned long val;
+		/*
+		 * Since "len" is > 8 here we backtrack in the buffer to load
+		 * the outstanding bytes into the low order bytes of a quad and
+		 * then shift to extract the relevant bytes. By doing this we
+		 * avoid additional calls to load_unaligned_zeropad.
+		 */
 
-			if (len & 4) {
-				result += *(unsigned int *) buff;
-				buff += 4;
-			}
-		}
-		if (len & 2) {
-			result += *(unsigned short *) buff;
-			buff += 2;
-		}
-	}
-	if (len & 1)
-		result += *buff;
-	result = add32_with_carry(result>>32, result & 0xffffffff); 
-	if (unlikely(odd)) { 
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
+		val = *(unsigned long *)(buff + len - 8);
+		val >>= 8 * (-len & 0x7);
+		result = add64_with_carry(val, result);
 	}
 	return result;
 }
@@ -125,15 +105,26 @@  static unsigned do_csum(const unsigned char *buff, unsigned len)
  * returns a 32-bit number suitable for feeding into itself
  * or csum_tcpudp_magic
  *
- * this function must be called with even lengths, except
- * for the last fragment, which may be odd
- *
- * it's best to have buff aligned on a 64-bit boundary
+ * Note that this implementation makes no attempt to avoid unaligned accesses
+ * (e.g. load a quad word with non 8-byte alignment). On x86 unaligned accesses
+ * only seem to be a performance penalty when the access crosses a page
+ * boundary-- such a scenario should be an extremely rare occurrence for use
+ * cases of csum_partial.
  */
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
-	return (__force __wsum)add32_with_carry(do_csum(buff, len),
-						(__force u32)sum);
+	if (len == 8) {
+		/* Optimize trivial case */
+		return (__force __wsum)add32_with_carry3((__force u32)sum,
+						 *(unsigned int *)buff,
+						 *(unsigned int *)(buff + 4));
+	} else {
+		unsigned long result = do_csum(buff, len);
+
+		return (__force __wsum)add32_with_carry3((__force u32)sum,
+						 result >> 32,
+						 result & 0xffffffff);
+	}
 }
 
 /*