Message ID | 153132125549.13161.16380200872856218805.stgit@firesoul |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
Series | [net-next] net: ipv4: fix listify ip_rcv_finish in case of forwarding | expand |
On 11/07/18 16:01, Jesper Dangaard Brouer wrote: > In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling > dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls > dst_input(skb) in a loop. > > The problem is that ip_sublist_rcv_finish() forgot to remove the SKB > from the list before invoking dst_input(). Further more we need to > clear skb->next as other parts of the network stack use another kind > of SKB lists for xmit_more (see dev_hard_start_xmit). > > A crash occurs if e.g. dst_input() invoke ip_forward(), which calls > dst_output()/ip_output() that eventually calls __dev_queue_xmit() + > sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). > > This patch only fixes the crash, but there is a huge potential for > a performance boost if we can pass an SKB-list through to ip_forward. > > Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Edward Cree <ecree@solarflare.com> But it feels weird and asymmetric to only NULL skb->next (not ->prev), and to have to do this by hand rather than e.g. being able to use list_del_init(&skb->list). Hopefully this can be revisited once sch_direct_xmit() has been changed to use the list_head rather than SKB special lists.
On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote: > Only driver sfc actually uses this, but I don't have this NIC, so I > tested this on mlx5, with my own changes to make it use > netif_receive_skb_list(), > but I'm not ready to upstream the mlx5 driver change yet. Thanks Jesper for sharing this, should we look forward to those patches or do you want us to implement them ? Thanks, Saeed.
On Wed, 11 Jul 2018 19:05:20 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote: > On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote: > > Only driver sfc actually uses this, but I don't have this NIC, so I > > tested this on mlx5, with my own changes to make it use > > netif_receive_skb_list(), > > but I'm not ready to upstream the mlx5 driver change yet. > > > Thanks Jesper for sharing this, should we look forward to those patches > or do you want us to implement them ? Well, I would prefer you to implement those. I just did a quick implementation (its trivially easy) so I have something to benchmark with. The performance boost is quite impressive! One reason I didn't "just" send a patch, is that Edward so-fare only implemented netif_receive_skb_list() and not napi_gro_receive_list(). And your driver uses napi_gro_receive(). This sort-of disables GRO for your driver, which is not a choice I can make. Interestingly I get around the same netperf TCP_STREAM performance. I assume we can get even better perf if we "listify" napi_gro_receive.
On Wed, 11 Jul 2018 16:41:35 +0100 Edward Cree <ecree@solarflare.com> wrote: > On 11/07/18 16:01, Jesper Dangaard Brouer wrote: > > In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling > > dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls > > dst_input(skb) in a loop. > > > > The problem is that ip_sublist_rcv_finish() forgot to remove the SKB > > from the list before invoking dst_input(). Further more we need to > > clear skb->next as other parts of the network stack use another kind > > of SKB lists for xmit_more (see dev_hard_start_xmit). > > > > A crash occurs if e.g. dst_input() invoke ip_forward(), which calls > > dst_output()/ip_output() that eventually calls __dev_queue_xmit() + > > sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). > > > > This patch only fixes the crash, but there is a huge potential for > > a performance boost if we can pass an SKB-list through to ip_forward. > > > > Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > Acked-by: Edward Cree <ecree@solarflare.com> > > But it feels weird and asymmetric to only NULL skb->next (not ->prev), and > to have to do this by hand rather than e.g. being able to use > list_del_init(&skb->list). Hopefully this can be revisited once > sch_direct_xmit() has been changed to use the list_head rather than SKB > special lists. I cannot use list_del_init(&skb->list) it would also break. This is a fix, and this code should be revisited. The reason I used the list_del() + skb->next=NULL, combo, is to keep as much as possible of the list-poisoning, e.g. 'prev' will be LIST_POISON2.
On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > Well, I would prefer you to implement those. I just did a quick > implementation (its trivially easy) so I have something to benchmark > with. The performance boost is quite impressive! sounds good, but wait > One reason I didn't "just" send a patch, is that Edward so-fare only > implemented netif_receive_skb_list() and not napi_gro_receive_list(). sfc does't support gro?! doesn't make sense.. Edward? > And your driver uses napi_gro_receive(). This sort-of disables GRO for > your driver, which is not a choice I can make. Interestingly I get > around the same netperf TCP_STREAM performance. Same TCP performance with GRO and no rx-batching or without GRO and yes rx-batching is by far not intuitive result to me unless both these techniques mostly serve to eliminate lots of instruction cache misses and the TCP stack is so much optimized that if the code is in the cache, going through it once with 64K byte GRO-ed packet is like going through it ~40 (64K/1500) times with non GRO-ed packets. What's the baseline (with GRO and no rx-batching) number on your setup? > I assume we can get even better perf if we "listify" napi_gro_receive. yeah, that would be very interesting to get there
From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Wed, 11 Jul 2018 17:01:20 +0200 > In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling > dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls > dst_input(skb) in a loop. > > The problem is that ip_sublist_rcv_finish() forgot to remove the SKB > from the list before invoking dst_input(). Further more we need to > clear skb->next as other parts of the network stack use another kind > of SKB lists for xmit_more (see dev_hard_start_xmit). > > A crash occurs if e.g. dst_input() invoke ip_forward(), which calls > dst_output()/ip_output() that eventually calls __dev_queue_xmit() + > sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). > > This patch only fixes the crash, but there is a huge potential for > a performance boost if we can pass an SKB-list through to ip_forward. > > Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > Only driver sfc actually uses this, but I don't have this NIC, so I > tested this on mlx5, with my own changes to make it use netif_receive_skb_list(), > but I'm not ready to upstream the mlx5 driver change yet. Applied, thanks Jesper. This whole: list_del(); skb->next = NULL; business is exactly the kind of dragons I was worried about when starting to use list_head with SKBs. There is a similar fix wrt. the GRO stuff that I'm about to apply as well. It definitely is better if we don't have to forcefully hand off NULL ->next next pointers like this in the long term.
On Thu, 12 Jul 2018 23:10:28 +0300 Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > > Well, I would prefer you to implement those. I just did a quick > > implementation (its trivially easy) so I have something to benchmark > > with. The performance boost is quite impressive! > > sounds good, but wait > > > > One reason I didn't "just" send a patch, is that Edward so-fare only > > implemented netif_receive_skb_list() and not napi_gro_receive_list(). > > sfc does't support gro?! doesn't make sense.. Edward? > > > And your driver uses napi_gro_receive(). This sort-of disables GRO for > > your driver, which is not a choice I can make. Interestingly I get > > around the same netperf TCP_STREAM performance. > > Same TCP performance I said around the same... I'll redo the benchmarks and verify... (did it.. see later). > with GRO and no rx-batching > > or > > without GRO and yes rx-batching Yes, obviously without GRO and yes rx-batching. > is by far not intuitive result to me unless both these techniques > mostly serve to eliminate lots of instruction cache misses and the > TCP stack is so much optimized that if the code is in the cache, > going through it once with 64K byte GRO-ed packet is like going > through it ~40 (64K/1500) times with non GRO-ed packets. Actually the GRO code path is actually rather expensive, and uses a lot of indirect-calls. If you have an UDP workload, then disable-GRO will give you a 10-15% performance boost. Edward's changes are basically a generalized version of GRO, up-to the IP layer (ip_rcv). So, for me it makes perfect sense. > What's the baseline (with GRO and no rx-batching) number on your setup? Okay, redoing the benchmarks... Implemented a code hack so I runtime can control if mlx5 driver uses napi_gro_receive() or netif_receive_skb_list() (abusing a netdev ethtool controlled feature flag no-in-use). To get a quick test going with feedback every 3 sec I use: $ netperf -t TCP_STREAM -H 198.18.1.1 -D3 -l 60000 -T 4,4 Default: using napi_gro_receive() with GRO enabled: Interim result: 25995.28 10^6bits/s over 3.000 seconds Disable GRO but still use napi_gro_receive(): Interim result: 21980.45 10^6bits/s over 3.001 seconds Make driver use netif_receive_skb_list(): Interim result: 25490.67 10^6bits/s over 3.002 seconds As you can see, using netif_receive_skb_list() have a huge performance boost over disabled-GRO. And it comes very close to the performance of enabled-GRO. Which is rather impressive! :-) Notice, even more impressively; these tests are without CONFIG_RETPOLINE. We primarily merged netif_receive_skb_list() due to the overhead of RETPOLINEs, but we even see a benefit when not using RETPOLINEs. > > I assume we can get even better perf if we "listify" napi_gro_receive. > > yeah, that would be very interesting to get there
On 12/07/18 21:10, Or Gerlitz wrote: > On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: >> One reason I didn't "just" send a patch, is that Edward so-fare only >> implemented netif_receive_skb_list() and not napi_gro_receive_list(). > sfc does't support gro?! doesn't make sense.. Edward? sfc has a flag EFX_RX_PKT_TCP set according to bits in the RX event, we call napi_{get,gro}_frags() (via efx_rx_packet_gro()) for TCP packets and netif_receive_skb() (or now the list handling) (via efx_rx_deliver()) for non-TCP packets. So we avoid the GRO overhead for non-TCP workloads. > Same TCP performance > > with GRO and no rx-batching > > or > > without GRO and yes rx-batching > > is by far not intuitive result I'm also surprised by this. If I can find the time I'll try to do similar experiments on sfc. Jesper, are the CPU utilisations similar in both cases? You're sure your stream isn't TX-limited? -Ed
On Fri, 13 Jul 2018 15:19:40 +0100 Edward Cree <ecree@solarflare.com> wrote: > On 12/07/18 21:10, Or Gerlitz wrote: > > On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer > > <brouer@redhat.com> wrote: > >> One reason I didn't "just" send a patch, is that Edward so-fare only > >> implemented netif_receive_skb_list() and not napi_gro_receive_list(). > > sfc does't support gro?! doesn't make sense.. Edward? > sfc has a flag EFX_RX_PKT_TCP set according to bits in the RX event, we > call napi_{get,gro}_frags() (via efx_rx_packet_gro()) for TCP packets and > netif_receive_skb() (or now the list handling) (via efx_rx_deliver()) for > non-TCP packets. So we avoid the GRO overhead for non-TCP workloads. > > > Same TCP performance > > > > with GRO and no rx-batching > > > > or > > > > without GRO and yes rx-batching > > > > is by far not intuitive result > > I'm also surprised by this. If I can find the time I'll try to do similar > experiments on sfc. > Jesper, are the CPU utilisations similar in both cases? The CPU util is very different. With enabled-GRO netperf CPU is only 60.89% loaded in %sys With napi_gro_receive_list it is almost 100% loaded Same CPU-load with just disabling GRO. > You're sure your stream isn't TX-limited? It might be the case, as the netperf sender HW is not as new as the device under test. And the 60% load and idle cycles in case of GRO, does indicate this is the case.
On 07/13/2018 07:19 AM, Edward Cree wrote: > On 12/07/18 21:10, Or Gerlitz wrote: >> On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer >> <brouer@redhat.com> wrote: >>> One reason I didn't "just" send a patch, is that Edward so-fare only >>> implemented netif_receive_skb_list() and not napi_gro_receive_list(). >> sfc does't support gro?! doesn't make sense.. Edward? > sfc has a flag EFX_RX_PKT_TCP set according to bits in the RX event, we > call napi_{get,gro}_frags() (via efx_rx_packet_gro()) for TCP packets and > netif_receive_skb() (or now the list handling) (via efx_rx_deliver()) for > non-TCP packets. So we avoid the GRO overhead for non-TCP workloads. > >> Same TCP performance >> >> with GRO and no rx-batching >> >> or >> >> without GRO and yes rx-batching >> >> is by far not intuitive result > I'm also surprised by this. If I can find the time I'll try to do similar > experiments on sfc. > Jesper, are the CPU utilisations similar in both cases? You're sure your > stream isn't TX-limited? 1) Make sure to test the case where packets of X flows are interleaved on the wire, instead of being nice with the receiver (trains of packets for each flow) (Typical case on a fabric, since switches will mix the ingress traffic to one egress port) 2) Do not test TCP_STREAM traffic, but TCP_RR (RPC like traffic where GRO really cuts down number of ACK packets) TCP_STREAM can hide the GRO gain, since ACK are naturally decimated under sufficient load.
On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Wed, 11 Jul 2018 19:05:20 +0000 > Saeed Mahameed <saeedm@mellanox.com> wrote: > >> On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote: >> > Only driver sfc actually uses this, but I don't have this NIC, so I >> > tested this on mlx5, with my own changes to make it use >> > netif_receive_skb_list(), >> > but I'm not ready to upstream the mlx5 driver change yet. >> >> >> Thanks Jesper for sharing this, should we look forward to those patches >> or do you want us to implement them ? > > Well, I would prefer you to implement those. I just did a quick > implementation (its trivially easy) so I have something to benchmark > with. The performance boost is quite impressive! > > One reason I didn't "just" send a patch, is that Edward so-fare only > implemented netif_receive_skb_list() and not napi_gro_receive_list(). > And your driver uses napi_gro_receive(). This sort-of disables GRO for > your driver, which is not a choice I can make. Interestingly I get But we can apply such patch on the mlx5 code path which comes into play when GRO is disabled, right? or there's just one code path there. Can you send your patch as RFC to the list for the purpose of experimentation? > around the same netperf TCP_STREAM performance. I assume we can get > even better perf if we "listify" napi_gro_receive.
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index 1a3b6f32b1c9..3196cf58f418 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -530,8 +530,14 @@ static void ip_sublist_rcv_finish(struct list_head *head) { struct sk_buff *skb, *next; - list_for_each_entry_safe(skb, next, head, list) + list_for_each_entry_safe(skb, next, head, list) { + list_del(&skb->list); + /* Handle ip{6}_forward case, as sch_direct_xmit have + * another kind of SKB-list usage (see validate_xmit_skb_list) + */ + skb->next = NULL; dst_input(skb); + } } static void ip_list_rcv_finish(struct net *net, struct sock *sk,
In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling dst_input(skb) was split-out. The ip_sublist_rcv_finish() just calls dst_input(skb) in a loop. The problem is that ip_sublist_rcv_finish() forgot to remove the SKB from the list before invoking dst_input(). Further more we need to clear skb->next as other parts of the network stack use another kind of SKB lists for xmit_more (see dev_hard_start_xmit). A crash occurs if e.g. dst_input() invoke ip_forward(), which calls dst_output()/ip_output() that eventually calls __dev_queue_xmit() + sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list(). This patch only fixes the crash, but there is a huge potential for a performance boost if we can pass an SKB-list through to ip_forward. Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- Only driver sfc actually uses this, but I don't have this NIC, so I tested this on mlx5, with my own changes to make it use netif_receive_skb_list(), but I'm not ready to upstream the mlx5 driver change yet. net/ipv4/ip_input.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)