diff mbox

[net-next] loopback: sctp: add NETIF_F_SCTP_CSUM to device features

Message ID 1393074113-9922-1-git-send-email-dborkman@redhat.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Borkmann Feb. 22, 2014, 1:01 p.m. UTC
Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
hardware crc32c checksumming support for the SCTP protocol.
Currently, NETIF_F_SCTP_CSUM flag is available in igb,
ixgbe, i40e/i40evf drivers and for vlan devices.

If we don't have NETIF_F_SCTP_CSUM then crc32c is done
through CPU instructions, invoked from crypto layer, or
if not available as slow-path fallback in software.

Currently, loopback device propagates checksum offloading
feature flags in dev->features, but is missing SCTP checksum
offloading. Therefore, account for NETIF_F_SCTP_CSUM as
well.

Before patch:

./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

4194304 4194304   4096    10.00    4683.50

After patch:

./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

4194304 4194304   4096    10.00    15348.26

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/loopback.c | 1 +
 1 file changed, 1 insertion(+)

Comments

David Laight Feb. 24, 2014, 10:17 a.m. UTC | #1
From: Daniel Borkmann
> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
> hardware crc32c checksumming support for the SCTP protocol.
> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
> ixgbe, i40e/i40evf drivers and for vlan devices.
> 
> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
> through CPU instructions, invoked from crypto layer, or
> if not available as slow-path fallback in software.
> 
> Currently, loopback device propagates checksum offloading
> feature flags in dev->features, but is missing SCTP checksum
> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
> well.
> 
> Before patch:
> 
> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 4194304 4194304   4096    10.00    4683.50
> 
> After patch:
...
> 4194304 4194304   4096    10.00    15348.26

That seems a much larger increase than you'd expect from removing
a software CRC of the data chunks.
Are you sure that some other difference in the data flows wasn't
also triggered.

I'm also not sure that 4096 is a representative message size for SCTP.
I'm not sure what else it is used for, but SCTP is used to carry various
SS7 protocols and interfaces over IP (SIGTRAN: eg M3UA). For SS7 the maximum
message size is around 270 bytes, and the average size nearer 100.

It is also difficult to model, but the typical traffic for SS7 is neither
bulk data, nor command-response. Rather it is single packets (maybe received
from one of many 64k links) that need transmitting within a reasonable time
interval, a delay of 1-2ms might be acceptable.
This basically means that Nagle has to be disabled (because you never want
the 'timeout waiting for an ack') and almost every data chunk ends up in
its own (short) ethernet frame.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Feb. 24, 2014, 10:31 a.m. UTC | #2
On 02/24/2014 11:17 AM, David Laight wrote:
> From: Daniel Borkmann
>> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
>> hardware crc32c checksumming support for the SCTP protocol.
>> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
>> ixgbe, i40e/i40evf drivers and for vlan devices.
>>
>> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
>> through CPU instructions, invoked from crypto layer, or
>> if not available as slow-path fallback in software.
>>
>> Currently, loopback device propagates checksum offloading
>> feature flags in dev->features, but is missing SCTP checksum
>> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
>> well.
>>
>> Before patch:
>>
>> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
>> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
>> Recv   Send    Send
>> Socket Socket  Message  Elapsed
>> Size   Size    Size     Time     Throughput
>> bytes  bytes   bytes    secs.    10^6bits/sec
>>
>> 4194304 4194304   4096    10.00    4683.50
>>
>> After patch:
> ...
>> 4194304 4194304   4096    10.00    15348.26
>
> That seems a much larger increase than you'd expect from removing
> a software CRC of the data chunks.
> Are you sure that some other difference in the data flows wasn't
> also triggered.

Yes, I run this multiple times with similar results and I double-checked
it with perf. Current code triggers crc32c implementation in software
fallback on my machine which is very expensive.

> I'm also not sure that 4096 is a representative message size for SCTP.

I used netperf default in this case.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Feb. 24, 2014, 10:42 a.m. UTC | #3
From: Daniel Borkmann 
> On 02/24/2014 11:17 AM, David Laight wrote:
> > From: Daniel Borkmann
> >> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
> >> hardware crc32c checksumming support for the SCTP protocol.
> >> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
> >> ixgbe, i40e/i40evf drivers and for vlan devices.
> >>
> >> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
> >> through CPU instructions, invoked from crypto layer, or
> >> if not available as slow-path fallback in software.
> >>
> >> Currently, loopback device propagates checksum offloading
> >> feature flags in dev->features, but is missing SCTP checksum
> >> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
> >> well.
> >>
> >> Before patch:
> >>
> >> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY
> >> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET
> >> Recv   Send    Send
> >> Socket Socket  Message  Elapsed
> >> Size   Size    Size     Time     Throughput
> >> bytes  bytes   bytes    secs.    10^6bits/sec
> >>
> >> 4194304 4194304   4096    10.00    4683.50
> >>
> >> After patch:
> > ...
> >> 4194304 4194304   4096    10.00    15348.26
> >
> > That seems a much larger increase than you'd expect from removing
> > a software CRC of the data chunks.
> > Are you sure that some other difference in the data flows wasn't
> > also triggered.
> 
> Yes, I run this multiple times with similar results and I double-checked
> it with perf. Current code triggers crc32c implementation in software
> fallback on my machine which is very expensive.

I'm sure it shouldn't be that expensive, you are implying that it spent
about 70% of the time doing crc32.
The loop should be dominated by the per-byte lookup in a 256 word table.
With 4k data the table will soon be in the data cache.
Unless it is (stupidly) generating the table on each call, or trying
to use a crc32 instruction, faulting, and emulating it, I wouldn't
really have expected more than a few % improvement.

> > I'm also not sure that 4096 is a representative message size for SCTP.
> 
> I used netperf default in this case.

I was just pointing out that the defaults may not be appropriate here.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Feb. 24, 2014, 12:02 p.m. UTC | #4
On 02/24/2014 11:42 AM, David Laight wrote:
...
> I'm sure it shouldn't be that expensive, you are implying that it spent
> about 70% of the time doing crc32.

In this scenario, the following perf log I get that shows where cycles
are being spent on my machine:

  65.95%       netperf  [kernel.kallsyms]  [k] __crc32c_le
   3.79%       netperf  [kernel.kallsyms]  [k] memcpy
   2.38%       netperf  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
   0.62%       netperf  [sctp]             [k] sctp_datamsg_from_user
   0.62%       netperf  [sctp]             [k] sctp_sendmsg
   0.55%       netperf  [kernel.kallsyms]  [k] __slab_free
   0.52%       netperf  [sctp]             [k] sctp_outq_flush
   0.50%       netperf  [kernel.kallsyms]  [k] kfree
   0.49%       netperf  [kernel.kallsyms]  [k] cmpxchg_double_slab.isra.52
   0.48%       netperf  [kernel.kallsyms]  [k] kmem_cache_alloc
   0.43%       netperf  [kernel.kallsyms]  [k] __slab_alloc
   0.42%       netperf  [kernel.kallsyms]  [k] __copy_skb_header
   0.41%       netperf  [kernel.kallsyms]  [k] __alloc_skb

> The loop should be dominated by the per-byte lookup in a 256 word table.
> With 4k data the table will soon be in the data cache.
> Unless it is (stupidly) generating the table on each call, or trying
> to use a crc32 instruction, faulting, and emulating it, I wouldn't
> really have expected more than a few % improvement.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Feb. 24, 2014, 1:24 p.m. UTC | #5
From: Daniel Borkmann
> On 02/24/2014 11:42 AM, David Laight wrote:
> ...
> > I'm sure it shouldn't be that expensive, you are implying that it spent
> > about 70% of the time doing crc32.
> 
> In this scenario, the following perf log I get that shows where cycles
> are being spent on my machine:
> 
>   65.95%       netperf  [kernel.kallsyms]  [k] __crc32c_le

WTF is that function doing!
Even doing the crc naively bit by bit shouldn't manage to use 65%.

Maybe the _le has something to do with it.
Could it be bit-reversing the crc and data bytes all the time?
The packet will (one would hope) want the crc in the same bit-order
as the data, so no bit reversal is needed - just the correct logic
and lookup table.

Which architecture and which version of crc32_le() does your kernel use?
(I'd guess there are several lurking).
Whichever function you are using wants killing.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Feb. 24, 2014, 1:36 p.m. UTC | #6
On 02/24/2014 02:24 PM, David Laight wrote:
...
> Which architecture and which version of crc32_le() does your kernel use?

It's using slice-by-8 algorithm, and my machine is [only]
a core i7 (x86_64). Apparently, it is not using the crypto
version that is accelerated.

> (I'd guess there are several lurking).
> Whichever function you are using wants killing.
>
> 	David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Borkmann Feb. 24, 2014, 2:06 p.m. UTC | #7
On 02/24/2014 02:24 PM, David Laight wrote:
...
> Maybe the _le has something to do with it.
> Could it be bit-reversing the crc and data bytes all the time?
> The packet will (one would hope) want the crc in the same bit-order
> as the data, so no bit reversal is needed - just the correct logic
> and lookup table.

Although it's a bit off-topic for this patch, but I'll have a look
into __crc32c_le(); maybe something is just wrong with that code.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Feb. 24, 2014, 2:07 p.m. UTC | #8
From: Daniel Borkmann
> > Which architecture and which version of crc32_le() does your kernel use?
> 
> It's using slice-by-8 algorithm, and my machine is [only]
> a core i7 (x86_64). Apparently, it is not using the crypto
> version that is accelerated.

I still can't imagine it being that slow.
The sctp code isn't known for its fast paths, although these are
per-message, not per byte I'd have thought they'd still be significant.

Having looked at the code, try compiling with CRC_LE_BITS == 8.
I suspect the much smaller data cache footprint will help, and
the number of instructions won't be that many less (due to all
the shifts).

What may help is to have a 'double shift' table and compute
the crc's of the odd and even bytes separately, then xor them
together at the end. That might improve the dependency chains
and let all the cpu execute more instructions in parallel.

Another improvement is to read the next byte from the buffer
before processing the previous one (possibly with a slight
loop unroll).

As has been noted elsewhere, using a 16 entry lookup table
can be faster than the 256 entry one.
What I don't know is whether the crc32 can be analysed down do
a small number of shifts and xors (crc16 reduces quite nicely)
that might easily execute faster than the lookup table.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Feb. 24, 2014, 2:28 p.m. UTC | #9
From: Daniel Borkmann 
> > Maybe the _le has something to do with it.
> > Could it be bit-reversing the crc and data bytes all the time?
> > The packet will (one would hope) want the crc in the same bit-order
> > as the data, so no bit reversal is needed - just the correct logic
> > and lookup table.
> 
> Although it's a bit off-topic for this patch, but I'll have a look
> into __crc32c_le(); maybe something is just wrong with that code.

Yes it generates a nice dependency chain of:
	xor	table_b(,%reg,4),%eax
using 8 tables (256*4 bytes each).

I bet that with a little bit of pipelining a simple byte by byte
loop should manage 1 byte every two clocks (ie 1 memory cycle per
clock).

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 25, 2014, midnight UTC | #10
From: Daniel Borkmann <dborkman@redhat.com>
Date: Sat, 22 Feb 2014 14:01:53 +0100

> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have
> hardware crc32c checksumming support for the SCTP protocol.
> Currently, NETIF_F_SCTP_CSUM flag is available in igb,
> ixgbe, i40e/i40evf drivers and for vlan devices.
> 
> If we don't have NETIF_F_SCTP_CSUM then crc32c is done
> through CPU instructions, invoked from crypto layer, or
> if not available as slow-path fallback in software.
> 
> Currently, loopback device propagates checksum offloading
> feature flags in dev->features, but is missing SCTP checksum
> offloading. Therefore, account for NETIF_F_SCTP_CSUM as
> well.
 ...
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Applied, thanks a lot Daniel.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 771c9bf..282effe 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -176,6 +176,7 @@  static void loopback_setup(struct net_device *dev)
 		| NETIF_F_UFO
 		| NETIF_F_HW_CSUM
 		| NETIF_F_RXCSUM
+		| NETIF_F_SCTP_CSUM
 		| NETIF_F_HIGHDMA
 		| NETIF_F_LLTX
 		| NETIF_F_NETNS_LOCAL