Message ID | 1393074113-9922-1-git-send-email-dborkman@redhat.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Daniel Borkmann > Drivers are allowed to set NETIF_F_SCTP_CSUM if they have > hardware crc32c checksumming support for the SCTP protocol. > Currently, NETIF_F_SCTP_CSUM flag is available in igb, > ixgbe, i40e/i40evf drivers and for vlan devices. > > If we don't have NETIF_F_SCTP_CSUM then crc32c is done > through CPU instructions, invoked from crypto layer, or > if not available as slow-path fallback in software. > > Currently, loopback device propagates checksum offloading > feature flags in dev->features, but is missing SCTP checksum > offloading. Therefore, account for NETIF_F_SCTP_CSUM as > well. > > Before patch: > > ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY > SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 4194304 4194304 4096 10.00 4683.50 > > After patch: ... > 4194304 4194304 4096 10.00 15348.26 That seems a much larger increase than you'd expect from removing a software CRC of the data chunks. Are you sure that some other difference in the data flows wasn't also triggered. I'm also not sure that 4096 is a representative message size for SCTP. I'm not sure what else it is used for, but SCTP is used to carry various SS7 protocols and interfaces over IP (SIGTRAN: eg M3UA). For SS7 the maximum message size is around 270 bytes, and the average size nearer 100. It is also difficult to model, but the typical traffic for SS7 is neither bulk data, nor command-response. Rather it is single packets (maybe received from one of many 64k links) that need transmitting within a reasonable time interval, a delay of 1-2ms might be acceptable. This basically means that Nagle has to be disabled (because you never want the 'timeout waiting for an ack') and almost every data chunk ends up in its own (short) ethernet frame. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/24/2014 11:17 AM, David Laight wrote: > From: Daniel Borkmann >> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have >> hardware crc32c checksumming support for the SCTP protocol. >> Currently, NETIF_F_SCTP_CSUM flag is available in igb, >> ixgbe, i40e/i40evf drivers and for vlan devices. >> >> If we don't have NETIF_F_SCTP_CSUM then crc32c is done >> through CPU instructions, invoked from crypto layer, or >> if not available as slow-path fallback in software. >> >> Currently, loopback device propagates checksum offloading >> feature flags in dev->features, but is missing SCTP checksum >> offloading. Therefore, account for NETIF_F_SCTP_CSUM as >> well. >> >> Before patch: >> >> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY >> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET >> Recv Send Send >> Socket Socket Message Elapsed >> Size Size Size Time Throughput >> bytes bytes bytes secs. 10^6bits/sec >> >> 4194304 4194304 4096 10.00 4683.50 >> >> After patch: > ... >> 4194304 4194304 4096 10.00 15348.26 > > That seems a much larger increase than you'd expect from removing > a software CRC of the data chunks. > Are you sure that some other difference in the data flows wasn't > also triggered. Yes, I run this multiple times with similar results and I double-checked it with perf. Current code triggers crc32c implementation in software fallback on my machine which is very expensive. > I'm also not sure that 4096 is a representative message size for SCTP. I used netperf default in this case. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Daniel Borkmann > On 02/24/2014 11:17 AM, David Laight wrote: > > From: Daniel Borkmann > >> Drivers are allowed to set NETIF_F_SCTP_CSUM if they have > >> hardware crc32c checksumming support for the SCTP protocol. > >> Currently, NETIF_F_SCTP_CSUM flag is available in igb, > >> ixgbe, i40e/i40evf drivers and for vlan devices. > >> > >> If we don't have NETIF_F_SCTP_CSUM then crc32c is done > >> through CPU instructions, invoked from crypto layer, or > >> if not available as slow-path fallback in software. > >> > >> Currently, loopback device propagates checksum offloading > >> feature flags in dev->features, but is missing SCTP checksum > >> offloading. Therefore, account for NETIF_F_SCTP_CSUM as > >> well. > >> > >> Before patch: > >> > >> ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY > >> SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET > >> Recv Send Send > >> Socket Socket Message Elapsed > >> Size Size Size Time Throughput > >> bytes bytes bytes secs. 10^6bits/sec > >> > >> 4194304 4194304 4096 10.00 4683.50 > >> > >> After patch: > > ... > >> 4194304 4194304 4096 10.00 15348.26 > > > > That seems a much larger increase than you'd expect from removing > > a software CRC of the data chunks. > > Are you sure that some other difference in the data flows wasn't > > also triggered. > > Yes, I run this multiple times with similar results and I double-checked > it with perf. Current code triggers crc32c implementation in software > fallback on my machine which is very expensive. I'm sure it shouldn't be that expensive, you are implying that it spent about 70% of the time doing crc32. The loop should be dominated by the per-byte lookup in a 256 word table. With 4k data the table will soon be in the data cache. Unless it is (stupidly) generating the table on each call, or trying to use a crc32 instruction, faulting, and emulating it, I wouldn't really have expected more than a few % improvement. > > I'm also not sure that 4096 is a representative message size for SCTP. > > I used netperf default in this case. I was just pointing out that the defaults may not be appropriate here. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/24/2014 11:42 AM, David Laight wrote: ... > I'm sure it shouldn't be that expensive, you are implying that it spent > about 70% of the time doing crc32. In this scenario, the following perf log I get that shows where cycles are being spent on my machine: 65.95% netperf [kernel.kallsyms] [k] __crc32c_le 3.79% netperf [kernel.kallsyms] [k] memcpy 2.38% netperf [kernel.kallsyms] [k] copy_user_enhanced_fast_string 0.62% netperf [sctp] [k] sctp_datamsg_from_user 0.62% netperf [sctp] [k] sctp_sendmsg 0.55% netperf [kernel.kallsyms] [k] __slab_free 0.52% netperf [sctp] [k] sctp_outq_flush 0.50% netperf [kernel.kallsyms] [k] kfree 0.49% netperf [kernel.kallsyms] [k] cmpxchg_double_slab.isra.52 0.48% netperf [kernel.kallsyms] [k] kmem_cache_alloc 0.43% netperf [kernel.kallsyms] [k] __slab_alloc 0.42% netperf [kernel.kallsyms] [k] __copy_skb_header 0.41% netperf [kernel.kallsyms] [k] __alloc_skb > The loop should be dominated by the per-byte lookup in a 256 word table. > With 4k data the table will soon be in the data cache. > Unless it is (stupidly) generating the table on each call, or trying > to use a crc32 instruction, faulting, and emulating it, I wouldn't > really have expected more than a few % improvement. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Daniel Borkmann > On 02/24/2014 11:42 AM, David Laight wrote: > ... > > I'm sure it shouldn't be that expensive, you are implying that it spent > > about 70% of the time doing crc32. > > In this scenario, the following perf log I get that shows where cycles > are being spent on my machine: > > 65.95% netperf [kernel.kallsyms] [k] __crc32c_le WTF is that function doing! Even doing the crc naively bit by bit shouldn't manage to use 65%. Maybe the _le has something to do with it. Could it be bit-reversing the crc and data bytes all the time? The packet will (one would hope) want the crc in the same bit-order as the data, so no bit reversal is needed - just the correct logic and lookup table. Which architecture and which version of crc32_le() does your kernel use? (I'd guess there are several lurking). Whichever function you are using wants killing. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/24/2014 02:24 PM, David Laight wrote: ... > Which architecture and which version of crc32_le() does your kernel use? It's using slice-by-8 algorithm, and my machine is [only] a core i7 (x86_64). Apparently, it is not using the crypto version that is accelerated. > (I'd guess there are several lurking). > Whichever function you are using wants killing. > > David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/24/2014 02:24 PM, David Laight wrote: ... > Maybe the _le has something to do with it. > Could it be bit-reversing the crc and data bytes all the time? > The packet will (one would hope) want the crc in the same bit-order > as the data, so no bit reversal is needed - just the correct logic > and lookup table. Although it's a bit off-topic for this patch, but I'll have a look into __crc32c_le(); maybe something is just wrong with that code. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Daniel Borkmann > > Which architecture and which version of crc32_le() does your kernel use? > > It's using slice-by-8 algorithm, and my machine is [only] > a core i7 (x86_64). Apparently, it is not using the crypto > version that is accelerated. I still can't imagine it being that slow. The sctp code isn't known for its fast paths, although these are per-message, not per byte I'd have thought they'd still be significant. Having looked at the code, try compiling with CRC_LE_BITS == 8. I suspect the much smaller data cache footprint will help, and the number of instructions won't be that many less (due to all the shifts). What may help is to have a 'double shift' table and compute the crc's of the odd and even bytes separately, then xor them together at the end. That might improve the dependency chains and let all the cpu execute more instructions in parallel. Another improvement is to read the next byte from the buffer before processing the previous one (possibly with a slight loop unroll). As has been noted elsewhere, using a 16 entry lookup table can be faster than the 256 entry one. What I don't know is whether the crc32 can be analysed down do a small number of shifts and xors (crc16 reduces quite nicely) that might easily execute faster than the lookup table. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Daniel Borkmann > > Maybe the _le has something to do with it. > > Could it be bit-reversing the crc and data bytes all the time? > > The packet will (one would hope) want the crc in the same bit-order > > as the data, so no bit reversal is needed - just the correct logic > > and lookup table. > > Although it's a bit off-topic for this patch, but I'll have a look > into __crc32c_le(); maybe something is just wrong with that code. Yes it generates a nice dependency chain of: xor table_b(,%reg,4),%eax using 8 tables (256*4 bytes each). I bet that with a little bit of pipelining a simple byte by byte loop should manage 1 byte every two clocks (ie 1 memory cycle per clock). David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Daniel Borkmann <dborkman@redhat.com> Date: Sat, 22 Feb 2014 14:01:53 +0100 > Drivers are allowed to set NETIF_F_SCTP_CSUM if they have > hardware crc32c checksumming support for the SCTP protocol. > Currently, NETIF_F_SCTP_CSUM flag is available in igb, > ixgbe, i40e/i40evf drivers and for vlan devices. > > If we don't have NETIF_F_SCTP_CSUM then crc32c is done > through CPU instructions, invoked from crypto layer, or > if not available as slow-path fallback in software. > > Currently, loopback device propagates checksum offloading > feature flags in dev->features, but is missing SCTP checksum > offloading. Therefore, account for NETIF_F_SCTP_CSUM as > well. ... > Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Applied, thanks a lot Daniel. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 771c9bf..282effe 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -176,6 +176,7 @@ static void loopback_setup(struct net_device *dev) | NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_RXCSUM + | NETIF_F_SCTP_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX | NETIF_F_NETNS_LOCAL
Drivers are allowed to set NETIF_F_SCTP_CSUM if they have hardware crc32c checksumming support for the SCTP protocol. Currently, NETIF_F_SCTP_CSUM flag is available in igb, ixgbe, i40e/i40evf drivers and for vlan devices. If we don't have NETIF_F_SCTP_CSUM then crc32c is done through CPU instructions, invoked from crypto layer, or if not available as slow-path fallback in software. Currently, loopback device propagates checksum offloading feature flags in dev->features, but is missing SCTP checksum offloading. Therefore, account for NETIF_F_SCTP_CSUM as well. Before patch: ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 4194304 4194304 4096 10.00 4683.50 After patch: ./netperf_sctp -H 192.168.0.100 -t SCTP_STREAM_MANY SCTP 1-TO-MANY STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.100 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 4194304 4194304 4096 10.00 15348.26 Signed-off-by: Daniel Borkmann <dborkman@redhat.com> --- drivers/net/loopback.c | 1 + 1 file changed, 1 insertion(+)