Message ID | a4bc263e62ced9f5790932f11e7d829c36566808.1471386679.git.marcelo.leitner@gmail.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On 08/17/2016 12:35 AM, Marcelo Ricardo Leitner wrote: > Because otherwise when crc computation is still needed it's way more > expensive than on a linear buffer to the point that it affects > performance. > > It's so expensive that netperf test gives a perf output as below: > > Overhead Shared Object Symbol > 69,44% [kernel] [k] gf2_matrix_square > 2,84% [kernel] [k] crc32_generic_combine.part.0 > 2,78% [kernel] [k] _raw_spin_lock_bh What kernel is this, seems not net kernel? $ git grep -n gf2_matrix_square $ git grep -n crc32_generic_combine $ Maybe RHEL? Did you consider backporting 6d514b4e7737 et al? > And performance goes from 2Gbit/s to 0.5Gbit/s on this test. Doing the > linearization before checksumming is enough to restore it. > > Fixes: 3acb50c18d8d ("sctp: delay as much as possible skb_linearize") > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
On Wed, Aug 17, 2016 at 12:59:17AM +0200, Daniel Borkmann wrote: > On 08/17/2016 12:35 AM, Marcelo Ricardo Leitner wrote: > > Because otherwise when crc computation is still needed it's way more > > expensive than on a linear buffer to the point that it affects > > performance. > > > > It's so expensive that netperf test gives a perf output as below: > > > > Overhead Shared Object Symbol > > 69,44% [kernel] [k] gf2_matrix_square > > 2,84% [kernel] [k] crc32_generic_combine.part.0 > > 2,78% [kernel] [k] _raw_spin_lock_bh > > What kernel is this, seems not net kernel? > > $ git grep -n gf2_matrix_square > $ git grep -n crc32_generic_combine > $ > > Maybe RHEL? Did you consider backporting 6d514b4e7737 et al? Damn, correct. I'll post a v2 later with a proper changelog. No I hadn't considered backporting that commit. Now from a different environment, upstream kernel, without the patch, using mlx4 and perf record -a -- sleep 5 during netperf (Xeon E5-2690 v3, 24 cpus): Overhead Command Shared Object Symbol 16,85% netserver [kernel.vmlinux] [k] crc32_generic_shift 3,46% swapper [kernel.vmlinux] [k] intel_idle 2,00% netserver [kernel.vmlinux] [k] __pskb_pull_tail 1,73% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string 1,72% swapper [kernel.vmlinux] [k] crc32_generic_shift 1,64% swapper [kernel.vmlinux] [k] poll_idle 1,59% netserver [kernel.vmlinux] [k] memcpy_erms 1,57% netserver [kernel.vmlinux] [k] fib_table_lookup 1,47% netserver [kernel.vmlinux] [k] _raw_spin_lock 1,37% netserver [kernel.vmlinux] [k] __slab_free 1,32% netserver [sctp] [k] sctp_packet_transmit 1,18% netserver [kernel.vmlinux] [k] skb_copy_datagram_iter With the patch: Overhead Command Shared Object Symbol 4,71% swapper [kernel.vmlinux] [k] intel_idle 2,11% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string 1,45% netserver [kernel.vmlinux] [k] memcpy_erms 1,29% swapper [kernel.vmlinux] [k] memcpy_erms 1,28% netserver [kernel.vmlinux] [k] fib_table_lookup 1,27% netserver [kernel.vmlinux] [k] __slab_free 1,27% swapper [kernel.vmlinux] [k] fib_table_lookup 1,26% netserver [kernel.vmlinux] [k] kmem_cache_free 1,14% netserver [kernel.vmlinux] [k] _raw_spin_lock 1,07% netserver [kernel.vmlinux] [k] __pskb_pull_tail 1,06% netserver [kernel.vmlinux] [k] skb_copy_datagram_iter 1,04% netserver [sctp] [k] sctp_packet_transmit 1,04% swapper [kernel.vmlinux] [k] __pskb_pull_tail 1,01% swapper [mlx4_en] [k] mlx4_en_process_rx_cq 0,99% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 0,96% swapper [kernel.vmlinux] [k] _raw_spin_lock 0,89% swapper [sctp] [k] sctp_packet_transmit Without the patch: # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000 SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 212992 212992 12000 10.00 2896.13 3.34 3.88 2.267 2.635 With the patch: # netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000 SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 212992 212992 12000 10.00 3444.89 3.88 3.02 2.216 1.721 And without the patch netperf fluctuates more as there are more packet drops and netserver is constantly at 100% cpu usage. Thanks, Marcelo
diff --git a/net/sctp/input.c b/net/sctp/input.c index c182db7d691ff44a52923fb36c9170e49c141c04..69444d32ecda6cd1a4924911172feba89c5ae976 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -119,7 +119,13 @@ int sctp_rcv(struct sk_buff *skb) skb_transport_offset(skb)) goto discard_it; - if (!pskb_may_pull(skb, sizeof(struct sctphdr))) + /* If the packet is fragmented and we need to do crc checking, + * it's better to just linearize it otherwise crc computing + * takes longer. + */ + if ((!(skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) && + skb_linearize(skb)) || + !pskb_may_pull(skb, sizeof(struct sctphdr))) goto discard_it; /* Pull up the IP header. */ @@ -1177,9 +1183,6 @@ static struct sctp_association *__sctp_rcv_lookup_harder(struct net *net, if ((skb_shinfo(skb)->gso_type & SKB_GSO_SCTP) == SKB_GSO_SCTP) return NULL; - if (skb_linearize(skb)) - return NULL; - ch = (sctp_chunkhdr_t *) skb->data; /* The code below will attempt to walk the chunk and extract diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c index c30ddb0f31907f57c5ce85b00dbe04260ca1cb2e..6437aa97cfd79f14c633499c2b131389204c435b 100644 --- a/net/sctp/inqueue.c +++ b/net/sctp/inqueue.c @@ -170,19 +170,6 @@ next_chunk: chunk = list_entry(entry, struct sctp_chunk, list); - /* Linearize if it's not GSO */ - if ((skb_shinfo(chunk->skb)->gso_type & SKB_GSO_SCTP) != SKB_GSO_SCTP && - skb_is_nonlinear(chunk->skb)) { - if (skb_linearize(chunk->skb)) { - __SCTP_INC_STATS(dev_net(chunk->skb->dev), SCTP_MIB_IN_PKT_DISCARDS); - sctp_chunk_free(chunk); - goto next_chunk; - } - - /* Update sctp_hdr as it probably changed */ - chunk->sctp_hdr = sctp_hdr(chunk->skb); - } - if ((skb_shinfo(chunk->skb)->gso_type & SKB_GSO_SCTP) == SKB_GSO_SCTP) { /* GSO-marked skbs but without frags, handle * them normally
Because otherwise when crc computation is still needed it's way more expensive than on a linear buffer to the point that it affects performance. It's so expensive that netperf test gives a perf output as below: Overhead Shared Object Symbol 69,44% [kernel] [k] gf2_matrix_square 2,84% [kernel] [k] crc32_generic_combine.part.0 2,78% [kernel] [k] _raw_spin_lock_bh And performance goes from 2Gbit/s to 0.5Gbit/s on this test. Doing the linearization before checksumming is enough to restore it. Fixes: 3acb50c18d8d ("sctp: delay as much as possible skb_linearize") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> --- Please consider this to -stable too as v4.7 performance is affected. Thanks net/sctp/input.c | 11 +++++++---- net/sctp/inqueue.c | 13 ------------- 2 files changed, 7 insertions(+), 17 deletions(-)