[net-next] net: ipv4: fix listify ip_rcv_finish in case of forwarding

Message ID 153132125549.13161.16380200872856218805.stgit@firesoul
State Accepted
Delegated to: David Miller
Headers show
Series
  • [net-next] net: ipv4: fix listify ip_rcv_finish in case of forwarding
Related show

Commit Message

Jesper Dangaard Brouer July 11, 2018, 3:01 p.m.
In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
dst_input(skb) in a loop.

The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
from the list before invoking dst_input().  Further more we need to
clear skb->next as other parts of the network stack use another kind
of SKB lists for xmit_more (see dev_hard_start_xmit).

A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().

This patch only fixes the crash, but there is a huge potential for
a performance boost if we can pass an SKB-list through to ip_forward.

Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
Only driver sfc actually uses this, but I don't have this NIC, so I
tested this on mlx5, with my own changes to make it use netif_receive_skb_list(),
but I'm not ready to upstream the mlx5 driver change yet.

 net/ipv4/ip_input.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Edward Cree July 11, 2018, 3:41 p.m. | #1
On 11/07/18 16:01, Jesper Dangaard Brouer wrote:
> In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
> dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
> dst_input(skb) in a loop.
>
> The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
> from the list before invoking dst_input().  Further more we need to
> clear skb->next as other parts of the network stack use another kind
> of SKB lists for xmit_more (see dev_hard_start_xmit).
>
> A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
> dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
> sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().
>
> This patch only fixes the crash, but there is a huge potential for
> a performance boost if we can pass an SKB-list through to ip_forward.
>
> Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Edward Cree <ecree@solarflare.com>

But it feels weird and asymmetric to only NULL skb->next (not ->prev), and
 to have to do this by hand rather than e.g. being able to use
 list_del_init(&skb->list).  Hopefully this can be revisited once
 sch_direct_xmit() has been changed to use the list_head rather than SKB
 special lists.
Saeed Mahameed July 11, 2018, 7:05 p.m. | #2
On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote:
> Only driver sfc actually uses this, but I don't have this NIC, so I
> tested this on mlx5, with my own changes to make it use
> netif_receive_skb_list(),
> but I'm not ready to upstream the mlx5 driver change yet.


Thanks Jesper for sharing this, should we look forward to those patches
or do you want us to implement them ?

Thanks,
Saeed.
Jesper Dangaard Brouer July 11, 2018, 8:06 p.m. | #3
On Wed, 11 Jul 2018 19:05:20 +0000
Saeed Mahameed <saeedm@mellanox.com> wrote:

> On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote:
> > Only driver sfc actually uses this, but I don't have this NIC, so I
> > tested this on mlx5, with my own changes to make it use
> > netif_receive_skb_list(),
> > but I'm not ready to upstream the mlx5 driver change yet.  
> 
> 
> Thanks Jesper for sharing this, should we look forward to those patches
> or do you want us to implement them ?

Well, I would prefer you to implement those.  I just did a quick
implementation (its trivially easy) so I have something to benchmark
with.  The performance boost is quite impressive!

One reason I didn't "just" send a patch, is that Edward so-fare only
implemented netif_receive_skb_list() and not napi_gro_receive_list().
And your driver uses napi_gro_receive().  This sort-of disables GRO for
your driver, which is not a choice I can make.  Interestingly I get
around the same netperf TCP_STREAM performance.  I assume we can get
even better perf if we "listify" napi_gro_receive.
Jesper Dangaard Brouer July 11, 2018, 8:15 p.m. | #4
On Wed, 11 Jul 2018 16:41:35 +0100
Edward Cree <ecree@solarflare.com> wrote:

> On 11/07/18 16:01, Jesper Dangaard Brouer wrote:
> > In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
> > dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
> > dst_input(skb) in a loop.
> >
> > The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
> > from the list before invoking dst_input().  Further more we need to
> > clear skb->next as other parts of the network stack use another kind
> > of SKB lists for xmit_more (see dev_hard_start_xmit).
> >
> > A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
> > dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
> > sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().
> >
> > This patch only fixes the crash, but there is a huge potential for
> > a performance boost if we can pass an SKB-list through to ip_forward.
> >
> > Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>  
> Acked-by: Edward Cree <ecree@solarflare.com>
> 
> But it feels weird and asymmetric to only NULL skb->next (not ->prev), and
>  to have to do this by hand rather than e.g. being able to use
>  list_del_init(&skb->list).  Hopefully this can be revisited once
>  sch_direct_xmit() has been changed to use the list_head rather than SKB
>  special lists.

I cannot use list_del_init(&skb->list) it would also break.
This is a fix, and this code should be revisited.

The reason I used the list_del() + skb->next=NULL, combo, is to keep as
much as possible of the list-poisoning, e.g. 'prev' will be LIST_POISON2.
Or Gerlitz July 12, 2018, 8:10 p.m. | #5
On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:

> Well, I would prefer you to implement those.  I just did a quick
> implementation (its trivially easy) so I have something to benchmark
> with.  The performance boost is quite impressive!

sounds good, but wait


> One reason I didn't "just" send a patch, is that Edward so-fare only
> implemented netif_receive_skb_list() and not napi_gro_receive_list().

sfc does't support gro?! doesn't make sense.. Edward?

> And your driver uses napi_gro_receive().  This sort-of disables GRO for
> your driver, which is not a choice I can make.  Interestingly I get
> around the same netperf TCP_STREAM performance.

Same TCP performance

with GRO and no rx-batching

or

without GRO and yes rx-batching

is by far not intuitive result to me unless both these techniques
mostly serve to eliminate lots of instruction cache misses and the
TCP stack is so much optimized that if the code is in the cache,
going through it once with 64K byte GRO-ed packet is like going
through it ~40 (64K/1500) times with non GRO-ed packets.

What's the baseline (with GRO and no rx-batching) number on your setup?

> I assume we can get even better perf if we "listify" napi_gro_receive.

yeah, that would be very interesting to get there
David Miller July 12, 2018, 11:41 p.m. | #6
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 11 Jul 2018 17:01:20 +0200

> In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
> dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
> dst_input(skb) in a loop.
> 
> The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
> from the list before invoking dst_input().  Further more we need to
> clear skb->next as other parts of the network stack use another kind
> of SKB lists for xmit_more (see dev_hard_start_xmit).
> 
> A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
> dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
> sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().
> 
> This patch only fixes the crash, but there is a huge potential for
> a performance boost if we can pass an SKB-list through to ip_forward.
> 
> Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> Only driver sfc actually uses this, but I don't have this NIC, so I
> tested this on mlx5, with my own changes to make it use netif_receive_skb_list(),
> but I'm not ready to upstream the mlx5 driver change yet.

Applied, thanks Jesper.

This whole:

	list_del();
	skb->next = NULL;

business is exactly the kind of dragons I was worried about when starting
to use list_head with SKBs.

There is a similar fix wrt. the GRO stuff that I'm about to apply as well.

It definitely is better if we don't have to forcefully hand off NULL
->next next pointers like this in the long term.
Jesper Dangaard Brouer July 13, 2018, 11:08 a.m. | #7
On Thu, 12 Jul 2018 23:10:28 +0300 Or Gerlitz <gerlitz.or@gmail.com> wrote:

> On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> 
> > Well, I would prefer you to implement those.  I just did a quick
> > implementation (its trivially easy) so I have something to benchmark
> > with.  The performance boost is quite impressive!  
> 
> sounds good, but wait
> 
> 
> > One reason I didn't "just" send a patch, is that Edward so-fare only
> > implemented netif_receive_skb_list() and not napi_gro_receive_list().  
> 
> sfc does't support gro?! doesn't make sense.. Edward?
> 
> > And your driver uses napi_gro_receive().  This sort-of disables GRO for
> > your driver, which is not a choice I can make.  Interestingly I get
> > around the same netperf TCP_STREAM performance.  
> 
> Same TCP performance

I said around the same... I'll redo the benchmarks and verify...
(did it.. see later).

> with GRO and no rx-batching
> 
> or
> 
> without GRO and yes rx-batching

Yes, obviously without GRO and yes rx-batching.


> is by far not intuitive result to me unless both these techniques
> mostly serve to eliminate lots of instruction cache misses and the
> TCP stack is so much optimized that if the code is in the cache,
> going through it once with 64K byte GRO-ed packet is like going
> through it ~40 (64K/1500) times with non GRO-ed packets.

Actually the GRO code path is actually rather expensive, and uses a lot
of indirect-calls.  If you have an UDP workload, then disable-GRO will
give you a 10-15% performance boost.

Edward's changes are basically a generalized version of GRO, up-to the
IP layer (ip_rcv).  So, for me it makes perfect sense.  

> What's the baseline (with GRO and no rx-batching) number on your setup?

Okay, redoing the benchmarks...

Implemented a code hack so I runtime can control if mlx5 driver uses
napi_gro_receive() or netif_receive_skb_list() (abusing a netdev ethtool
controlled feature flag no-in-use).

To get a quick test going with feedback every 3 sec I use:

 $ netperf -t TCP_STREAM -H 198.18.1.1 -D3 -l 60000 -T 4,4


Default: using napi_gro_receive() with GRO enabled:

 Interim result: 25995.28 10^6bits/s over 3.000 seconds

Disable GRO but still use napi_gro_receive():

 Interim result: 21980.45 10^6bits/s over 3.001 seconds

Make driver use netif_receive_skb_list():

 Interim result: 25490.67 10^6bits/s over 3.002 seconds

As you can see, using netif_receive_skb_list() have a huge performance
boost over disabled-GRO.  And it comes very close to the performance
of enabled-GRO. Which is rather impressive! :-)

Notice, even more impressively; these tests are without CONFIG_RETPOLINE.
We primarily merged netif_receive_skb_list() due to the overhead of
RETPOLINEs, but we even see a benefit when not using RETPOLINEs.


> > I assume we can get even better perf if we "listify" napi_gro_receive.  
> 
> yeah, that would be very interesting to get there
Edward Cree July 13, 2018, 2:19 p.m. | #8
On 12/07/18 21:10, Or Gerlitz wrote:
> On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> One reason I didn't "just" send a patch, is that Edward so-fare only
>> implemented netif_receive_skb_list() and not napi_gro_receive_list().
> sfc does't support gro?! doesn't make sense.. Edward?
sfc has a flag EFX_RX_PKT_TCP set according to bits in the RX event, we
 call napi_{get,gro}_frags() (via efx_rx_packet_gro()) for TCP packets and
 netif_receive_skb() (or now the list handling) (via efx_rx_deliver()) for
 non-TCP packets.  So we avoid the GRO overhead for non-TCP workloads.

> Same TCP performance
>
> with GRO and no rx-batching
>
> or
>
> without GRO and yes rx-batching
>
> is by far not intuitive result
I'm also surprised by this.  If I can find the time I'll try to do similar
 experiments on sfc.
Jesper, are the CPU utilisations similar in both cases?  You're sure your
 stream isn't TX-limited?

-Ed
Jesper Dangaard Brouer July 13, 2018, 4:04 p.m. | #9
On Fri, 13 Jul 2018 15:19:40 +0100
Edward Cree <ecree@solarflare.com> wrote:

> On 12/07/18 21:10, Or Gerlitz wrote:
> > On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> >> One reason I didn't "just" send a patch, is that Edward so-fare only
> >> implemented netif_receive_skb_list() and not napi_gro_receive_list().  
> > sfc does't support gro?! doesn't make sense.. Edward?  
> sfc has a flag EFX_RX_PKT_TCP set according to bits in the RX event, we
>  call napi_{get,gro}_frags() (via efx_rx_packet_gro()) for TCP packets and
>  netif_receive_skb() (or now the list handling) (via efx_rx_deliver()) for
>  non-TCP packets.  So we avoid the GRO overhead for non-TCP workloads.
> 
> > Same TCP performance
> >
> > with GRO and no rx-batching
> >
> > or
> >
> > without GRO and yes rx-batching
> >
> > is by far not intuitive result  
>
> I'm also surprised by this.  If I can find the time I'll try to do similar
> experiments on sfc.
> Jesper, are the CPU utilisations similar in both cases?

The CPU util is very different.

 With enabled-GRO netperf CPU is only 60.89% loaded in %sys
 With napi_gro_receive_list it is almost 100% loaded 
 Same CPU-load with just disabling GRO.

> You're sure your stream isn't TX-limited?

It might be the case, as the netperf sender HW is not as new as the
device under test.  And the 60% load and idle cycles in case of GRO,
does indicate this is the case.
Eric Dumazet July 13, 2018, 6:14 p.m. | #10
On 07/13/2018 07:19 AM, Edward Cree wrote:
> On 12/07/18 21:10, Or Gerlitz wrote:
>> On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>> One reason I didn't "just" send a patch, is that Edward so-fare only
>>> implemented netif_receive_skb_list() and not napi_gro_receive_list().
>> sfc does't support gro?! doesn't make sense.. Edward?
> sfc has a flag EFX_RX_PKT_TCP set according to bits in the RX event, we
>  call napi_{get,gro}_frags() (via efx_rx_packet_gro()) for TCP packets and
>  netif_receive_skb() (or now the list handling) (via efx_rx_deliver()) for
>  non-TCP packets.  So we avoid the GRO overhead for non-TCP workloads.
> 
>> Same TCP performance
>>
>> with GRO and no rx-batching
>>
>> or
>>
>> without GRO and yes rx-batching
>>
>> is by far not intuitive result
> I'm also surprised by this.  If I can find the time I'll try to do similar
>  experiments on sfc.
> Jesper, are the CPU utilisations similar in both cases?  You're sure your
>  stream isn't TX-limited?

1) Make sure to test the case where packets of X flows are interleaved on the wire,
instead of being nice with the receiver (trains of packets for each flow)

(Typical case on a fabric, since switches will mix the ingress traffic to one egress port)

2) Do not test TCP_STREAM traffic, but TCP_RR
(RPC like traffic where GRO really cuts down number of ACK packets)

  TCP_STREAM can hide the GRO gain, since ACK are naturally decimated under sufficient
  load.
Or Gerlitz July 30, 2018, 2:38 p.m. | #11
On Wed, Jul 11, 2018 at 11:06 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 11 Jul 2018 19:05:20 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>> On Wed, 2018-07-11 at 17:01 +0200, Jesper Dangaard Brouer wrote:
>> > Only driver sfc actually uses this, but I don't have this NIC, so I
>> > tested this on mlx5, with my own changes to make it use
>> > netif_receive_skb_list(),
>> > but I'm not ready to upstream the mlx5 driver change yet.
>>
>>
>> Thanks Jesper for sharing this, should we look forward to those patches
>> or do you want us to implement them ?
>
> Well, I would prefer you to implement those.  I just did a quick
> implementation (its trivially easy) so I have something to benchmark
> with.  The performance boost is quite impressive!
>
> One reason I didn't "just" send a patch, is that Edward so-fare only
> implemented netif_receive_skb_list() and not napi_gro_receive_list().
> And your driver uses napi_gro_receive().  This sort-of disables GRO for
> your driver, which is not a choice I can make.  Interestingly I get

But we can apply such patch on the mlx5 code path which comes
into play when GRO is disabled, right? or there's just one code path
there. Can you send your patch as RFC to the list for the purpose of
experimentation?

> around the same netperf TCP_STREAM performance.  I assume we can get
> even better perf if we "listify" napi_gro_receive.

Patch

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 1a3b6f32b1c9..3196cf58f418 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -530,8 +530,14 @@  static void ip_sublist_rcv_finish(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
 
-	list_for_each_entry_safe(skb, next, head, list)
+	list_for_each_entry_safe(skb, next, head, list) {
+		list_del(&skb->list);
+		/* Handle ip{6}_forward case, as sch_direct_xmit have
+		 * another kind of SKB-list usage (see validate_xmit_skb_list)
+		 */
+		skb->next = NULL;
 		dst_input(skb);
+	}
 }
 
 static void ip_list_rcv_finish(struct net *net, struct sock *sk,