[v5,net-next] net: Implement fast csum_partial for x86_64

From: Tom Herbert

This patch implements performant csum_partial for x86_64. The intent is
to speed up checksum calculation, particularly for smaller lengths such
as those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.

- v4
   - went back to C code with inline assembly for critical routines
   - implemented suggestion from Linus to deal with lengths < 8
- v5
   - fixed register attribute add32_with_carry3
   - do_csum returns unsigned long
   - don't consider alignment at all. Rationalization is that x86
     handles unaligned accesses very well except in the case that
     the access crosses a page boundary which has a performance
     penalty (I see about 10nsecs on my system). Drivers and the
     stack go to considerable lengths to not have packets cross page
     boundaries, so the case that csum_partial is called with
     buffer that crosses a page boundary should be a very rare
     occurrence. Not dealing with alignment is a significant
     simplification.

Testing:

Correctness:

Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).

Performance:

Isolating old and new implementation for some common cases:

             Old      New     %
    Len/Aln  nsecs    nsecs   Improv
    --------+-------+--------+-------
    1400/0    195.6    181.7   3%     (Big packet)
    40/0      11.8     6.5     45%    (Ipv6 hdr cmn case)
    8/4       8.1      3.2     60%    (UDP, VXLAN in IPv4)
    14/0      8.9      6.3     29%    (Eth hdr)
    14/4      9.5      6.3     33%    (Eth hdr in IPv4)
    14/3      9.6      6.3     34%    (Eth with odd align)
    20/0      9.1      6.8     25%    (IP hdr without options)
    7/1       9.1      3.9     57%    (buffer in one quad)
    100/0    17.4     13.6     21%    (medium-sized pkt)
    100/2    17.7     13.5     24%    (medium-sized pkt w/ alignment)

Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Also tested on these with similar results:

Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Intel(R) Atom(TM) CPU N450   @ 1.66GHz

Branch  prediction:

To test the effects of poor branch prediction in the jump tables I
tested checksum performance with runs for two combinations of length
and alignment. As the baseline I performed the test by doing half of
calls with the first combination, followed by using the second
combination for the second half. In the test case, I interleave the
two combinations so that in every call the length and alignment are
different to defeat the effects of branch prediction. Running several
cases, I did not see any material performance difference between the
two scenarios (perf stat output is below), neither does either case
show a significant number of branch misses.

Interleave lengths case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
    ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000

 Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000' (10 runs):

     9,556,693,202      instructions               ( +-  0.00% )
     1,176,208,640       branches                                                     ( +-  0.00% )
            19,487       branch-misses            #    0.00% of all branches          ( +-  6.07% )

       2.049732539 seconds time elapsed

    Non-interleave case:

perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
     ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000

Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000' (10 runs):

     9,782,188,310      instructions               ( +-  0.00% )
     1,251,286,958       branches                                                     ( +-  0.01% )
            18,950       branch-misses            #    0.00% of all branches          ( +- 12.74% )

       2.271789046 seconds time elapsed

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 arch/x86/include/asm/checksum_64.h |  21 +++++
 arch/x86/lib/csum-partial_64.c     | 171 ++++++++++++++++++-------------------
 2 files changed, 102 insertions(+), 90 deletions(-)

Message ID	1456957112-2702469-1-git-send-email-tom@herbertland.com
State	Not Applicable, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 6670F1414E3 for <patchwork-incoming@ozlabs.org>; Thu, 3 Mar 2016 09:18:56 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756538AbcCBWSq (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Wed, 2 Mar 2016 17:18:46 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:52386 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756508AbcCBWSn (ORCPT <rfc822;netdev@vger.kernel.org>); Wed, 2 Mar 2016 17:18:43 -0500 Received: from pps.filterd (m0044008.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.15.0.59/8.15.0.59) with SMTP id u22ME9lD019012 for <netdev@vger.kernel.org>; Wed, 2 Mar 2016 14:18:43 -0800 Received: from mail.thefacebook.com ([199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 21e5dr12gq-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for <netdev@vger.kernel.org>; Wed, 02 Mar 2016 14:18:43 -0800 Received: from mx-out.facebook.com (192.168.52.123) by PRN-CHUB08.TheFacebook.com (192.168.16.18) with Microsoft SMTP Server (TLS) id 14.3.248.2; Wed, 2 Mar 2016 14:18:41 -0800 Received: from devbig284.prn2.facebook.com (10.35.15.32) by mx-out.facebook.com (10.223.100.99) with ESMTP id b78bea36e0c411e586b324be05956610-d3ef1270 for <netdev@vger.kernel.org>; Wed, 02 Mar 2016 14:18:40 -0800 From: Tom Herbert <tom@herbertland.com> To: <davem@davemloft.net>, <netdev@vger.kernel.org> CC: <torvalds@linux-foundation.org>, <tglx@linutronix.de>, <mingo@redhat.com>, <hpa@zytor.com>, <x86@kernel.org>, <kernel-team@fb.com> Subject: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64 Date: Wed, 2 Mar 2016 14:18:32 -0800 Message-ID: <1456957112-2702469-1-git-send-email-tom@herbertland.com> X-Mailer: git-send-email 2.6.5 X-FB-Internal: Safe MIME-Version: 1.0 Content-Type: text/plain X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2016-03-02_13:, , signatures=0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

[v5,net-next] net: Implement fast csum_partial for x86_64

Commit Message

Comments

Patch