mbox series

[v3,net-next,0/3] net: batched receive in GRO path

Message ID c6e2474e-2c8a-5881-86bf-59c66bdfc34f@solarflare.com
Headers show
Series net: batched receive in GRO path | expand

Message

Edward Cree Aug. 6, 2019, 1:52 p.m. UTC
This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
 onto a list in the napi struct, which is received at the end of the napi
 poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.

Performance figures with this series, collected on a back-to-back pair of
 Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
 sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
 uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
 (which was the device under test) had the NetPerf process pinned to one CPU,
 and the device interrupts pinned to a second CPU.  CPU utilisation figures
 (used in cases of line-rate performance) are summed across all CPUs.
net.core.gro_normal_batch was left at its default value of 8.

TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next)
TCP 4 streams, GRO off:
net-next: 8.017 Gbps
after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next)
after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next.  But note *)
TCP 1 stream, GRO off:
net-next: 6.553 Gbps
after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next)
after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next)
TCP 1 stream, GRO on, busy_read = 50: all results line rate
net-next: 156.0% cpu
after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next)
after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next)
TCP 1 stream, GRO off, busy_read = 50:
net-next: 6.488 Gbps
after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next)
after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50:
net-next:   2.851 ms
after #1:   2.871 ms (+ 0.7%, p=0.134 vs net-next)
after #3:   2.937 ms (+ 3.0%, p<0.001 vs net-next)
TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50:
net-next: 867.317 us
after #1: 865.717 us (- 0.2%, p=0.334 vs net-next)
after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next)

(*) These tests produced a mixture of line-rate and below-line-rate results,
 meaning that statistically speaking the results were 'censored' by the
 upper bound, and were thus not normally distributed, making a Welch t-test
 mathematically invalid.  I therefore also calculated estimators according
 to [1], which gave the following:
net-next: 8.133 Gbps
after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next)
after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next)
(though my procedure for determining ν wasn't mathematically well-founded
 either, so take that p-value with a grain of salt).
A further check came from dividing the bandwidth figure by the CPU usage for
 each test run, giving:
net-next: 3.461
after #1: 3.198 (- 7.6%, p=0.145 vs net-next)
after #3: 3.641 (+ 5.2%, p=0.280 vs net-next)

The above results are fairly mixed, and in most cases not statistically
 significant.  But I think we can roughly conclude that the series
 marginally improves non-GROable throughput, without hurting latency
 (except in the large-payload busy-polling case, which in any case yields
 horrid performance even on net-next (almost triple the latency without
 busy-poll).  Also, drivers which, unlike sfc, pass UDP traffic to GRO
 would expect to see a benefit from gaining access to batching.

Changed in v3:
 * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one
 * removed RFC tags (no comments after a week means no-one objects, right?)

Changed in v2:
 * During busy poll, call gro_normal_list() to receive batched packets
   after each cycle of the napi busy loop.  See comments in Patch #3 for
   complications of doing the same in busy_poll_stop().

[1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859

Edward Cree (3):
  sfc: don't score irq moderation points for GRO
  sfc: falcon: don't score irq moderation points for GRO
  net: use listified RX for handling GRO_NORMAL skbs

 drivers/net/ethernet/sfc/falcon/rx.c |  5 +---
 drivers/net/ethernet/sfc/rx.c        |  5 +---
 include/linux/netdevice.h            |  3 ++
 net/core/dev.c                       | 44 ++++++++++++++++++++++++++--
 net/core/sysctl_net_core.c           |  8 +++++
 5 files changed, 54 insertions(+), 11 deletions(-)

Comments

David Miller Aug. 9, 2019, 1:22 a.m. UTC | #1
From: Edward Cree <ecree@solarflare.com>
Date: Tue, 6 Aug 2019 14:52:06 +0100

> This series listifies part of GRO processing, in a manner which allows those
>  packets which are not GROed (i.e. for which dev_gro_receive returns
>  GRO_NORMAL) to be passed on to the listified regular receive path.
> dev_gro_receive() itself is not listified, nor the per-protocol GRO
>  callback, since GRO's need to hold packets on lists under napi->gro_hash
>  makes keeping the packets on other lists awkward, and since the GRO control
>  block state of held skbs can refer only to one 'new' skb at a time.
> Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
>  onto a list in the napi struct, which is received at the end of the napi
>  poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.
> 
> Performance figures with this series, collected on a back-to-back pair of
>  Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
>  sample size n for old and new code is 6 runs each; p is from a Welch t-test.
> Tests were run both with GRO enabled and disabled, the latter simulating
>  uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
>  (which was the device under test) had the NetPerf process pinned to one CPU,
>  and the device interrupts pinned to a second CPU.  CPU utilisation figures
>  (used in cases of line-rate performance) are summed across all CPUs.
> net.core.gro_normal_batch was left at its default value of 8.
 ...
> The above results are fairly mixed, and in most cases not statistically
>  significant.  But I think we can roughly conclude that the series
>  marginally improves non-GROable throughput, without hurting latency
>  (except in the large-payload busy-polling case, which in any case yields
>  horrid performance even on net-next (almost triple the latency without
>  busy-poll).  Also, drivers which, unlike sfc, pass UDP traffic to GRO
>  would expect to see a benefit from gaining access to batching.
> 
> Changed in v3:
>  * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one
>  * removed RFC tags (no comments after a week means no-one objects, right?)
> 
> Changed in v2:
>  * During busy poll, call gro_normal_list() to receive batched packets
>    after each cycle of the napi busy loop.  See comments in Patch #3 for
>    complications of doing the same in busy_poll_stop().
> 
> [1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859

Series applied, thanks Edward.
Ioana Radulescu Aug. 9, 2019, 5:14 p.m. UTC | #2
> -----Original Message-----
> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Edward Cree
> Sent: Tuesday, August 6, 2019 4:52 PM
> To: David Miller <davem@davemloft.net>
> Cc: netdev <netdev@vger.kernel.org>; Eric Dumazet
> <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
> Subject: [PATCH v3 net-next 0/3] net: batched receive in GRO path
> 
> This series listifies part of GRO processing, in a manner which allows those
>  packets which are not GROed (i.e. for which dev_gro_receive returns
>  GRO_NORMAL) to be passed on to the listified regular receive path.
> dev_gro_receive() itself is not listified, nor the per-protocol GRO
>  callback, since GRO's need to hold packets on lists under napi->gro_hash
>  makes keeping the packets on other lists awkward, and since the GRO control
>  block state of held skbs can refer only to one 'new' skb at a time.
> Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
>  onto a list in the napi struct, which is received at the end of the napi
>  poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.

Hi Edward,

I'm probably missing a lot of context here, but is there a reason
this change targets only the napi_gro_frags() path and not the
napi_gro_receive() one?

I'm trying to understand what drivers that don't call napi_gro_frags()
should do in order to benefit from this batching feature.

Thanks,
Ioana
Edward Cree Aug. 9, 2019, 5:32 p.m. UTC | #3
On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
> Hi Edward,
>
> I'm probably missing a lot of context here, but is there a reason
> this change targets only the napi_gro_frags() path and not the
> napi_gro_receive() one?
> I'm trying to understand what drivers that don't call napi_gro_frags()
> should do in order to benefit from this batching feature.
The sfc driver (which is what I have lots of hardware for, so I can
 test it) uses napi_gro_frags().
It should be possible to do a similar patch to napi_gro_receive(),
 if someone wants to put in the effort of writing and testing it.
However, there are many more callers, so more effort required to
 make sure none of them care whether the return value is GRO_DROP
 or GRO_NORMAL (since the listified version cannot give that
 indication).
Also, the guidance from Eric is that drivers seeking high performance
 should use napi_gro_frags(), as this allows GRO to recycle the SKB.

All of this together means I don't plan to submit such a patch; I
 would however be happy to review a patch if someone else writes one.

HTH,
-Ed
Ioana Radulescu Aug. 12, 2019, 5:51 p.m. UTC | #4
> -----Original Message-----
> From: Edward Cree <ecree@solarflare.com>
> Sent: Friday, August 9, 2019 8:32 PM
> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
> Cc: David Miller <davem@davemloft.net>; netdev <netdev@vger.kernel.org>;
> Eric Dumazet <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
> Subject: Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
> 
> On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
> > Hi Edward,
> >
> > I'm probably missing a lot of context here, but is there a reason
> > this change targets only the napi_gro_frags() path and not the
> > napi_gro_receive() one?
> > I'm trying to understand what drivers that don't call napi_gro_frags()
> > should do in order to benefit from this batching feature.
> The sfc driver (which is what I have lots of hardware for, so I can
>  test it) uses napi_gro_frags().
> It should be possible to do a similar patch to napi_gro_receive(),
>  if someone wants to put in the effort of writing and testing it.

Rather tricky, since I'm not really familiar with GRO internals and
probably don't understand all the implications of such a change :-/
Any pointers to what I should pay attention to/sensitive areas that
need extra care?

> However, there are many more callers, so more effort required to
>  make sure none of them care whether the return value is GRO_DROP
>  or GRO_NORMAL (since the listified version cannot give that
>  indication).

At a quick glance, there's only one driver that looks at the return
value of napi_gro_receive (drivers/net/ethernet/socionext/netsec.c),
and it only updates interface stats based on it.

> Also, the guidance from Eric is that drivers seeking high performance
>  should use napi_gro_frags(), as this allows GRO to recycle the SKB.

But this guidance is for GRO-able frames only, right? If I try to use
napi_gro_frags() indiscriminately on the Rx path, I get a big
performance penalty in some cases - e.g. forwarding of non-TCP
single buffer frames.

On the other hand, Eric shot down my attempt to differentiate between
TCP and non-TCP frames inside the driver (see 
https://patchwork.ozlabs.org/patch/1135817/#2222236), so I'm not
really sure what's the recommended approach here?

> 
> All of this together means I don't plan to submit such a patch; I
>  would however be happy to review a patch if someone else writes one.

Thanks a lot for the explanations!
Ioana
Eric Dumazet Aug. 12, 2019, 6:20 p.m. UTC | #5
On 8/12/19 7:51 PM, Ioana Ciocoi Radulescu wrote:
>> -----Original Message-----
>> From: Edward Cree <ecree@solarflare.com>
>> Sent: Friday, August 9, 2019 8:32 PM
>> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
>> Cc: David Miller <davem@davemloft.net>; netdev <netdev@vger.kernel.org>;
>> Eric Dumazet <eric.dumazet@gmail.com>; linux-net-drivers@solarflare.com
>> Subject: Re: [PATCH v3 net-next 0/3] net: batched receive in GRO path
>>
>> On 09/08/2019 18:14, Ioana Ciocoi Radulescu wrote:
>>> Hi Edward,
>>>
>>> I'm probably missing a lot of context here, but is there a reason
>>> this change targets only the napi_gro_frags() path and not the
>>> napi_gro_receive() one?
>>> I'm trying to understand what drivers that don't call napi_gro_frags()
>>> should do in order to benefit from this batching feature.
>> The sfc driver (which is what I have lots of hardware for, so I can
>>  test it) uses napi_gro_frags().
>> It should be possible to do a similar patch to napi_gro_receive(),
>>  if someone wants to put in the effort of writing and testing it.
> 
> Rather tricky, since I'm not really familiar with GRO internals and
> probably don't understand all the implications of such a change :-/
> Any pointers to what I should pay attention to/sensitive areas that
> need extra care?
> 
>> However, there are many more callers, so more effort required to
>>  make sure none of them care whether the return value is GRO_DROP
>>  or GRO_NORMAL (since the listified version cannot give that
>>  indication).
> 
> At a quick glance, there's only one driver that looks at the return
> value of napi_gro_receive (drivers/net/ethernet/socionext/netsec.c),
> and it only updates interface stats based on it.
> 
>> Also, the guidance from Eric is that drivers seeking high performance
>>  should use napi_gro_frags(), as this allows GRO to recycle the SKB.
> 
> But this guidance is for GRO-able frames only, right? If I try to use
> napi_gro_frags() indiscriminately on the Rx path, I get a big
> performance penalty in some cases - e.g. forwarding of non-TCP
> single buffer frames.

How big is big ?

You can not win all the time.

Some design (or optimizations) are for the most common case,
they might hurt some other use cases.

> 
> On the other hand, Eric shot down my attempt to differentiate between
> TCP and non-TCP frames inside the driver (see 
> https://patchwork.ozlabs.org/patch/1135817/#2222236), so I'm not
> really sure what's the recommended approach here?

If GRO is not good enough for non-TCP buffer frames, please make the change in GRO,
or document that disabling GRO might help some setups.

We do not want each driver to implement their own logic that are a
maintenance nightmare.

GRO can aggregate non-TCP frames (say if you add any encapsulation over TCP),
with a very significant gain, so detecting if an incoming frame is a 'TCP packet'
in the driver would be a serious problem if the traffic is 100% SIT for example.