diff mbox series

[net-next,2/2] udp: implement and use per cpu rx skbs cache

Message ID dacbc7a3626bb170629e02159ed2f90120f06382.1524045911.git.pabeni@redhat.com
State Deferred, archived
Delegated to: David Miller
Headers show
Series UDP: introduce RX skb cache | expand

Commit Message

Paolo Abeni April 18, 2018, 10:22 a.m. UTC
This changeset extends the idea behind commit c8c8b127091b ("udp:
under rx pressure, try to condense skbs"), trading more BH cpu
time and memory bandwidth to decrease the load on the user space
receiver.

At boot time we allocate a limited amount of skbs with small
data buffer, storing them in per cpu arrays. Such skbs are never
freed.

At run time, under rx pressure, the BH tries to copy the current
skb contents into the cache - if the current cache skb is available,
and the ingress skb is small enough and without any head states.

When using the cache skb, the ingress skb is dropped by the BH
- while still hot on cache - and the cache skb is inserted into
the rx queue, after increasing its usage count. Also, the cache
array index is moved to the next entry.

The receive side is unmodified: in udp_rcvmsg() the usage skb
usage count is decreased and the skb is _not_ freed - since the
cache keeps usage > 0. Since skb->usage is hot in the cache of the
receiver at consume time - the receiver has just read skb->data,
which lies in the same cacheline - the whole skb_consume_udp() becomes
really cheap.

UDP receive performances under flood improve as follow:

NR RX queues	Kpps	Kpps	Delta (%)
		Before	After

1		2252	2305	2
2		2151	2569	19
4		2033	2396	17
8		1969	2329	18

Overall performances of knotd DNS server under real traffic flood
improves as follow:

		Kpps	Kpps	Delta (%)
		Before	After

		3777	3981	5

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
--
Performances figures are with both PAGE_TABLE_ISOLATION and
RETPOLINES enabled, this is way the baseline
---
 net/ipv4/udp.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 159 insertions(+), 1 deletion(-)

Comments

Eric Dumazet April 18, 2018, 4:56 p.m. UTC | #1
On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> This changeset extends the idea behind commit c8c8b127091b ("udp:
> under rx pressure, try to condense skbs"), trading more BH cpu
> time and memory bandwidth to decrease the load on the user space
> receiver.
> 
> At boot time we allocate a limited amount of skbs with small
> data buffer, storing them in per cpu arrays. Such skbs are never
> freed.
> 
> At run time, under rx pressure, the BH tries to copy the current
> skb contents into the cache - if the current cache skb is available,
> and the ingress skb is small enough and without any head states.
> 
> When using the cache skb, the ingress skb is dropped by the BH
> - while still hot on cache - and the cache skb is inserted into
> the rx queue, after increasing its usage count. Also, the cache
> array index is moved to the next entry.
> 
> The receive side is unmodified: in udp_rcvmsg() the usage skb
> usage count is decreased and the skb is _not_ freed - since the
> cache keeps usage > 0. Since skb->usage is hot in the cache of the
> receiver at consume time - the receiver has just read skb->data,
> which lies in the same cacheline - the whole skb_consume_udp() becomes
> really cheap.
> 
> UDP receive performances under flood improve as follow:
> 
> NR RX queues	Kpps	Kpps	Delta (%)
> 		Before	After
> 
> 1		2252	2305	2
> 2		2151	2569	19
> 4		2033	2396	17
> 8		1969	2329	18
> 
> Overall performances of knotd DNS server under real traffic flood
> improves as follow:
> 
> 		Kpps	Kpps	Delta (%)
> 		Before	After
> 
> 		3777	3981	5


It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
adding this bloat to the kernel ?

Sorry, 5% improvement while you easily can get 300% improvement with no kernel change
is not appealing to me :/
Paolo Abeni April 18, 2018, 5:15 p.m. UTC | #2
Hi,

On Wed, 2018-04-18 at 09:56 -0700, Eric Dumazet wrote:
> 
> On 04/18/2018 03:22 AM, Paolo Abeni wrote:
> > This changeset extends the idea behind commit c8c8b127091b ("udp:
> > under rx pressure, try to condense skbs"), trading more BH cpu
> > time and memory bandwidth to decrease the load on the user space
> > receiver.
> > 
> > At boot time we allocate a limited amount of skbs with small
> > data buffer, storing them in per cpu arrays. Such skbs are never
> > freed.
> > 
> > At run time, under rx pressure, the BH tries to copy the current
> > skb contents into the cache - if the current cache skb is available,
> > and the ingress skb is small enough and without any head states.
> > 
> > When using the cache skb, the ingress skb is dropped by the BH
> > - while still hot on cache - and the cache skb is inserted into
> > the rx queue, after increasing its usage count. Also, the cache
> > array index is moved to the next entry.
> > 
> > The receive side is unmodified: in udp_rcvmsg() the usage skb
> > usage count is decreased and the skb is _not_ freed - since the
> > cache keeps usage > 0. Since skb->usage is hot in the cache of the
> > receiver at consume time - the receiver has just read skb->data,
> > which lies in the same cacheline - the whole skb_consume_udp() becomes
> > really cheap.
> > 
> > UDP receive performances under flood improve as follow:
> > 
> > NR RX queues	Kpps	Kpps	Delta (%)
> > 		Before	After
> > 
> > 1		2252	2305	2
> > 2		2151	2569	19
> > 4		2033	2396	17
> > 8		1969	2329	18
> > 
> > Overall performances of knotd DNS server under real traffic flood
> > improves as follow:
> > 
> > 		Kpps	Kpps	Delta (%)
> > 		Before	After
> > 
> > 		3777	3981	5
> 
> 
> It might be time for knotd DNS server to finally use SO_REUSEPORT instead of
> adding this bloat to the kernel ?
> 
> Sorry, 5% improvement while you easily can get 300% improvement with no kernel change
> is not appealing to me :/

Thank you for the feedback.
Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
the above tests are leveraging it.

That 5% is on top of that 300%.

Cheers,

Paolo
Eric Dumazet April 18, 2018, 7:21 p.m. UTC | #3
On 04/18/2018 10:15 AM, Paolo Abeni wrote:
is not appealing to me :/
> 
> Thank you for the feedback.
> Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
> the above tests are leveraging it.
> 
> That 5% is on top of that 300%.

Then there is something wrong.

Adding copies should not increase performance.

If it does, there is certainly another way, reaching 10% instead of 5%
Paolo Abeni April 19, 2018, 7:40 a.m. UTC | #4
Hi,

On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
> 
> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> is not appealing to me :/
> > 
> > Thank you for the feedback.
> > Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
> > the above tests are leveraging it.
> > 
> > That 5% is on top of that 300%.
> 
> Then there is something wrong.
> 
> Adding copies should not increase performance.

The skb and data are copied into the UDP skb cache only if the socket
is under memory pressure, and that happens if and only if the receiver
is slower than the BH/IP receive path.

The copy slows down the RX path - which was dropping packets - and
makes the udp_recvmsg() considerably faster, as consuming skb becomes
almost a no-op.

AFAICS, this is similar to the strategy you used in:

ommit c8c8b127091b758f5768f906bcdeeb88bc9951ca
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Dec 7 09:19:33 2016 -0800

    udp: under rx pressure, try to condense skbs

with the difference that with the UDP skb cache there is an hard limit
to the amount of memory the BH is allowed to copy.

> If it does, there is certainly another way, reaching 10% instead of 5%

I benchmarked vs a DNS server to test and verify that we get measurable
benefits in real life scenario. The measured performance gain for the
RX path with reasonable configurations is ~20%.

Any suggestions for better results are more than welcome!

Cheers,

Paolo
Eric Dumazet April 19, 2018, 1:47 p.m. UTC | #5
On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> Hi,
> 
> On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>>
>> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
>> is not appealing to me :/
>>>
>>> Thank you for the feedback.
>>> Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
>>> the above tests are leveraging it.
>>>
>>> That 5% is on top of that 300%.
>>
>> Then there is something wrong.
>>
>> Adding copies should not increase performance.
> 
> The skb and data are copied into the UDP skb cache only if the socket
> is under memory pressure, and that happens if and only if the receiver
> is slower than the BH/IP receive path.

Which is going to happen under attack.

Bimodal behavior is dangerous for system stability..

> 
> The copy slows down the RX path - which was dropping packets - and
> makes the udp_recvmsg() considerably faster, as consuming skb becomes
> almost a no-op.
> 
> AFAICS, this is similar to the strategy you used in:
> 
> ommit c8c8b127091b758f5768f906bcdeeb88bc9951ca
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Wed Dec 7 09:19:33 2016 -0800
> 
>     udp: under rx pressure, try to condense skbs
> 
> with the difference that with the UDP skb cache there is an hard limit
> to the amount of memory the BH is allowed to copy.
>

Very different strategy really.

We do not copy 500 bytes per skb :/

and the total amount of memory is tunable (socket rcvbuf)
instead of hard coded in the kernel :/

 
>> If it does, there is certainly another way, reaching 10% instead of 5%
> 
> I benchmarked vs a DNS server to test and verify that we get measurable
> benefits in real life scenario. The measured performance gain for the
> RX path with reasonable configurations is ~20%.

Then we probably can make +40% without copies.



> 
> Any suggestions for better results are more than welcome!


Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in Seoul (netdev conference)

Not tied to UDP, but a generic solution.

You are adding more and more code only that only helps in some benchmarks really.

UDP stack is becoming a very complex beast, while heavy duty UDP servers have alternatives,
and can not cope with arbitrary flood anyway.
Jesper Dangaard Brouer April 20, 2018, 1:48 p.m. UTC | #6
On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:  
> >> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
[...]
> > 
> > Any suggestions for better results are more than welcome!  
> 
> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> Seoul (netdev conference). Not tied to UDP, but a generic solution.

Yes, I remember.  I think... was it the idea, where you basically
wanted to queue back SKBs to the CPU that allocated them, right?

Freeing an SKB on the same CPU that allocated it, have multiple
advantages. (1) the SLUB allocator can use a non-atomic
"cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
the SKB stay local.  (3) the atomic SKB refcnt/users stay local.

We just have to avoid that queue back SKB's mechanism, doesn't cost
more than the operations we expect to save.  Bulk transfer is an
obvious approach.  For storing SKBs until they are returned, we already
have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
which SLUB/SLAB-bulk free to amortize cost (1).

I guess, the missing information is that we don't know what CPU the SKB
were created on...

Where to store this CPU info?

(a) In struct sk_buff, in a cache-line that is already read on remote
CPU in UDP code?

(b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
basis, we could have SLUB store a hint about the CPU it was allocated
on, and bet on returning to that CPU ? (might be bad to read the
struct-page cache-line)
Willem de Bruijn April 21, 2018, 3:54 p.m. UTC | #7
On Fri, Apr 20, 2018 at 9:48 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
>> > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>> >> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> [...]
>> >
>> > Any suggestions for better results are more than welcome!
>>
>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
>
> Yes, I remember.  I think... was it the idea, where you basically
> wanted to queue back SKBs to the CPU that allocated them, right?
>
> Freeing an SKB on the same CPU that allocated it, have multiple
> advantages. (1) the SLUB allocator can use a non-atomic
> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
>
> We just have to avoid that queue back SKB's mechanism, doesn't cost
> more than the operations we expect to save.  Bulk transfer is an
> obvious approach.  For storing SKBs until they are returned, we already
> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> which SLUB/SLAB-bulk free to amortize cost (1).
>
> I guess, the missing information is that we don't know what CPU the SKB
> were created on...

For connected sockets, sk->sk_incoming_cpu has this data. It
records BH cpu on enqueue to udp socket, so one caveat is that
it may be wrong with rps/rfs.

Another option is to associate not with source cpu but napi struct
and have the device driver free in the context of its napi processing.
This has the additional benefit that skb->napi_id is already stored
per skb, so this also works for unconnected sockets.

Third, the skb->napi_id field is unused after setting sk->sk_napi_id
on sk enqueue, so the BH cpu could be stored here after that,
essentially extending sk_incoming_cpu to unconnected sockets.
Eric Dumazet April 21, 2018, 4:45 p.m. UTC | #8
On 04/21/2018 08:54 AM, Willem de Bruijn wrote:
> On Fri, Apr 20, 2018 at 9:48 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
>>>> On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>>>>> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
>> [...]
>>>>
>>>> Any suggestions for better results are more than welcome!
>>>
>>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
>>
>> Yes, I remember.  I think... was it the idea, where you basically
>> wanted to queue back SKBs to the CPU that allocated them, right?
>>
>> Freeing an SKB on the same CPU that allocated it, have multiple
>> advantages. (1) the SLUB allocator can use a non-atomic
>> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
>> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
>>
>> We just have to avoid that queue back SKB's mechanism, doesn't cost
>> more than the operations we expect to save.  Bulk transfer is an
>> obvious approach.  For storing SKBs until they are returned, we already
>> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
>> which SLUB/SLAB-bulk free to amortize cost (1).
>>
>> I guess, the missing information is that we don't know what CPU the SKB
>> were created on...
> 
> For connected sockets, sk->sk_incoming_cpu has this data. It
> records BH cpu on enqueue to udp socket, so one caveat is that
> it may be wrong with rps/rfs.
> 
> Another option is to associate not with source cpu but napi struct
> and have the device driver free in the context of its napi processing.
> This has the additional benefit that skb->napi_id is already stored
> per skb, so this also works for unconnected sockets.
> 
> Third, the skb->napi_id field is unused after setting sk->sk_napi_id
> on sk enqueue, so the BH cpu could be stored here after that,
> essentially extending sk_incoming_cpu to unconnected sockets.

We use at Google something named TXCS, which is what I mentioned to Jesper and Tariq.

(In our case, we wanted to not perform skb destructor/freeing on the cpu handling the TX queue,
but on cpus that originally cooked the skb (running TCP stack))

To accommodate generic needs (both RX and TX), I do not believe we can union any existing fields,
without a lot of pain/bugs.
Paolo Abeni April 22, 2018, 11:22 a.m. UTC | #9
On Fri, 2018-04-20 at 15:48 +0200, Jesper Dangaard Brouer wrote:
> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On 04/19/2018 12:40 AM, Paolo Abeni wrote:
> > > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:  
> > > > On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> 
> [...]
> > > 
> > > Any suggestions for better results are more than welcome!  
> > 
> > Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> > Seoul (netdev conference). Not tied to UDP, but a generic solution.
> 
> Yes, I remember.  I think... was it the idea, where you basically
> wanted to queue back SKBs to the CPU that allocated them, right?
> 
> Freeing an SKB on the same CPU that allocated it, have multiple
> advantages. (1) the SLUB allocator can use a non-atomic
> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.

By the time the skb is returned to the ingress cpu, isn't that skb most
probably out of the cache?

> We just have to avoid that queue back SKB's mechanism, doesn't cost
> more than the operations we expect to save.  Bulk transfer is an
> obvious approach.  For storing SKBs until they are returned, we already
> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> which SLUB/SLAB-bulk free to amortize cost (1).
> 
> I guess, the missing information is that we don't know what CPU the SKB
> were created on...
> 
> Where to store this CPU info?
> 
> (a) In struct sk_buff, in a cache-line that is already read on remote
> CPU in UDP code?
> 
> (b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
> basis, we could have SLUB store a hint about the CPU it was allocated
> on, and bet on returning to that CPU ? (might be bad to read the
> struct-page cache-line)

Bulking would be doable only for connected sockets, elsewhere would be
difficult to assemble a burst long enough to amortize the handshake
with the remote CPU (spinlock + ipi needed ?!?)

Would be good enough for unconnected sockets sending a whole skb burst
back to one of the (several) ingress CPU? e.g. peeking the CPU
associated with the first skb inside the burst, we would somewhat
balance the load between the ingress CPUs.

Cheers,

Paolo
Tariq Toukan April 23, 2018, 8:13 a.m. UTC | #10
On 20/04/2018 4:48 PM, Jesper Dangaard Brouer wrote:
> 
> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
>>> On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>>>> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> [...]
>>>
>>> Any suggestions for better results are more than welcome!
>>
>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
> 
> Yes, I remember.  I think... was it the idea, where you basically
> wanted to queue back SKBs to the CPU that allocated them, right?
> 
> Freeing an SKB on the same CPU that allocated it, have multiple
> advantages. (1) the SLUB allocator can use a non-atomic
> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
> 
> We just have to avoid that queue back SKB's mechanism, doesn't cost
> more than the operations we expect to save.  Bulk transfer is an
> obvious approach.  For storing SKBs until they are returned, we already
> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> which SLUB/SLAB-bulk free to amortize cost (1).
> 
> I guess, the missing information is that we don't know what CPU the SKB
> were created on...
> 
> Where to store this CPU info?
> 
> (a) In struct sk_buff, in a cache-line that is already read on remote
> CPU in UDP code?
> 
> (b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
> basis, we could have SLUB store a hint about the CPU it was allocated
> on, and bet on returning to that CPU ? (might be bad to read the
> struct-page cache-line)
> 

I'm in favor of (a).

Pages of an SKB originates on the same cpu (guaranteed by NAPI).
So a single field in SKB is good for all of its fragments, no need to 
read this info from every single page. This also keeps the change local 
to the networking subsystem.

Best if we find a hole in struct SKB (for u16?), or union it with a 
mutually exclusive field.

Regards,
Tariq
Jesper Dangaard Brouer April 23, 2018, 8:52 a.m. UTC | #11
On Sun, 22 Apr 2018 13:22:58 +0200
Paolo Abeni <pabeni@redhat.com> wrote:

> On Fri, 2018-04-20 at 15:48 +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> > > On 04/19/2018 12:40 AM, Paolo Abeni wrote:  
> > > > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:    
> > > > > On 04/18/2018 10:15 AM, Paolo Abeni wrote:  
> > 
> > [...]  
> > > > 
> > > > Any suggestions for better results are more than welcome!    
> > > 
> > > Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
> > > Seoul (netdev conference). Not tied to UDP, but a generic solution.  
> > 
> > Yes, I remember.  I think... was it the idea, where you basically
> > wanted to queue back SKBs to the CPU that allocated them, right?
> > 
> > Freeing an SKB on the same CPU that allocated it, have multiple
> > advantages. (1) the SLUB allocator can use a non-atomic
> > "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> > the SKB stay local.  (3) the atomic SKB refcnt/users stay local.  
> 
> By the time the skb is returned to the ingress cpu, isn't that skb most
> probably out of the cache?

This is a too simplistic view.  You have to look at the cache
coherence state[1] of the individual cache lines (SKB consist of 4
cache-lines). And newer Intel CPUs [2] can "Forward(F)" cache-lines
between caches.  The SKB cache-line that have atomic refcnt/users
important to analyze (Read For Ownership (RFO) case).  Analyzing the
other cache-lines is actually more complicated due to techniques like
"Store Buffer" and "Invalidate Queues".

[1] https://en.wikipedia.org/wiki/MESI_protocol
[2] https://en.wikipedia.org/wiki/MESIF_protocol

There is also a lot of detail in point (1) about how the SLUB
alloactor works internally, and how it avoids bouncing the struct-page
cache-line.  Some of the performance benefit from you current patch
also comes from this...


> > We just have to avoid that queue back SKB's mechanism, doesn't cost
> > more than the operations we expect to save.  Bulk transfer is an
> > obvious approach.  For storing SKBs until they are returned, we already
> > have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> > which SLUB/SLAB-bulk free to amortize cost (1).
> > 
> > I guess, the missing information is that we don't know what CPU the SKB
> > were created on...
> > 
> > Where to store this CPU info?
> > 
> > (a) In struct sk_buff, in a cache-line that is already read on remote
> > CPU in UDP code?
> > 
> > (b) In struct page, as SLUB alloc hand-out objects/SKBs on a per page
> > basis, we could have SLUB store a hint about the CPU it was allocated
> > on, and bet on returning to that CPU ? (might be bad to read the
> > struct-page cache-line)  
> 
> Bulking would be doable only for connected sockets, elsewhere would be
> difficult to assemble a burst long enough to amortize the handshake
> with the remote CPU (spinlock + ipi needed ?!?)

We obviously need some level of bulking.

I would likely try to avoid any explicit IPI calls, but instead use a
queue like the ptr_ring queue, because it have good separation between
cache-lines used by consumer and producer (but it might be overkill
for this use-case).

 
> Would be good enough for unconnected sockets sending a whole skb burst
> back to one of the (several) ingress CPU? e.g. peeking the CPU
> associated with the first skb inside the burst, we would somewhat
> balance the load between the ingress CPUs.

See, Willem de Bruijn suggestions...
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3fb0fbf4977d..bb1879cd51b4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -125,6 +125,26 @@  EXPORT_SYMBOL(sysctl_udp_mem);
 atomic_long_t udp_memory_allocated;
 EXPORT_SYMBOL(udp_memory_allocated);
 
+struct skb_cache_entry {
+	int size;
+	int head;
+	struct sk_buff *skbs[0];
+};
+
+static struct skb_cache_entry __percpu *skb_cache;
+
+/* Under socket memory pressure, small packets are copied to a percpu cache
+ * before enqueuing them, do decrease the load on the receiver process.
+ * To avoid excessive copy overhead we use a small skb size threshold.
+ * Each percpu cache should be able to cope with at least a socket under
+ * memory pressure. It doesn't need to handle many of them: if there are
+ * more than a few sockets under memory pressure, the user-space is most
+ * probably too lazy and there is no gain using the cache
+ */
+#define UDP_CACHE_MAX_SKB_LEN		512
+#define UDP_CACHE_MIN_SIZE		_SK_MEM_PACKETS
+#define UDP_CACHE_MAX_SIZE		(_SK_MEM_PACKETS * 3)
+
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
@@ -1246,6 +1266,82 @@  static void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
 	udp_rmem_release(sk, udp_skb_truesize(skb), 1, true);
 }
 
+static inline struct sk_buff *udp_cache_get_skb(void)
+{
+	struct skb_cache_entry *cache;
+	struct sk_buff *skb;
+
+	if (unlikely(!skb_cache))
+		return NULL;
+
+	cache = this_cpu_ptr(skb_cache);
+	skb = cache->skbs[cache->head];
+	if (refcount_read(&skb->users) != 1)
+		return NULL;
+
+	/* peeking with offset clones the queued skbs, we must check that all
+	 * the cloned references are gone.
+	 * This barrier is paried with the implicit one in skb_unref(), while
+	 * decrementing skb->users.
+	 */
+	rmb();
+	if (unlikely(skb->cloned)) {
+		if (atomic_read(&skb_shinfo(skb)->dataref) != 1)
+			return NULL;
+		skb->cloned = 0;
+	}
+
+	cache->head++;
+	if (cache->head == cache->size)
+		cache->head = 0;
+	refcount_inc(&skb->users);
+	return skb;
+}
+
+static bool udp_copy_to_cache(struct sk_buff **s)
+{
+	struct sk_buff *skb2, *skb = *s;
+	int hlen;
+
+	/* check if we can copy the specified skb into the cache: data + l3 +
+	 * l4 must be below the the cached skb size and no head states must
+	 * be attached.
+	 */
+	hlen = skb_network_header_len(skb) + sizeof(struct udphdr);
+	if ((hlen + skb->len) >= UDP_CACHE_MAX_SKB_LEN || skb_sec_path(skb))
+		return false;
+
+	skb2 = udp_cache_get_skb();
+	if (!skb2)
+		return false;
+
+	/* copy the relevant header: we skip the head states - we know no state
+	 * is attached to 'skb' - the unrelevant part of the CB, and
+	 * skb->dev - will be overwritten later by udp_set_dev_scratch()
+	 */
+	skb2->tstamp	    = skb->tstamp;
+	*UDP_SKB_CB(skb2)   = *UDP_SKB_CB(skb);
+	skb2->queue_mapping = skb->queue_mapping;
+	memcpy(&skb2->headers_start, &skb->headers_start,
+	       offsetof(struct sk_buff, headers_end) -
+	       offsetof(struct sk_buff, headers_start));
+
+	/* skip the mac header, we don't need it */
+	skb_copy_bits(skb, -hlen, skb2->head, skb->len + hlen);
+
+	/* override the relevant offsets: skb2 starts from the network hdr */
+	skb2->transport_header = hlen - sizeof(struct udphdr);
+	skb2->network_header  = 0;
+	skb2->mac_header = 0;
+	skb2->data = skb2->head + hlen;
+	skb_set_tail_pointer(skb2, skb->len);
+	skb2->len = skb->len;
+	consume_skb(skb);
+
+	*s = skb2;
+	return true;
+}
+
 /* Idea of busylocks is to let producers grab an extra spinlock
  * to relieve pressure on the receive_queue spinlock shared by consumer.
  * Under flood, this means that only one producer can be in line
@@ -1290,9 +1386,12 @@  int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	 * - Reduce memory overhead and thus increase receive queue capacity
 	 * - Less cache line misses at copyout() time
 	 * - Less work at consume_skb() (less alien page frag freeing)
+	 * Additionally, processing skbs from the cache allows udp_recvmsg()
+	 * to 'free' them with a single atomic operation on a hot cacheline
 	 */
 	if (rmem > (sk->sk_rcvbuf >> 1)) {
-		skb_condense(skb);
+		if (!udp_copy_to_cache(&skb))
+			skb_condense(skb);
 
 		busy = busylock_acquire(sk);
 	}
@@ -2858,6 +2957,64 @@  static struct pernet_operations __net_initdata udp_sysctl_ops = {
 	.init	= udp_sysctl_init,
 };
 
+static void udp_free_cache(int nr)
+{
+	int i, cpu;
+
+	for_each_possible_cpu(cpu)
+		for (i = 0; i < nr; ++i)
+			kfree_skb(per_cpu_ptr(skb_cache, cpu)->skbs[i]);
+
+	free_percpu(skb_cache);
+	skb_cache = NULL;
+}
+
+static void udp_init_cache(unsigned long max_size)
+{
+	size_t skb_guessed_size, per_cpu_size;
+	unsigned long total_size = 0;
+	struct sk_buff *skb;
+	int i, nr, cpu = 0;
+
+	/* try to fill the cache only if we can allocate a reasonable number
+	 * of skbs
+	 */
+	skb_guessed_size = SKB_TRUESIZE(UDP_CACHE_MAX_SKB_LEN);
+	nr = min_t(unsigned long, UDP_CACHE_MAX_SIZE,
+		   max_size / (nr_cpu_ids * skb_guessed_size));
+	if (nr < UDP_CACHE_MIN_SIZE) {
+		pr_info("low memory, UDP skbs cache will not be allocated\n");
+		return;
+	}
+
+	per_cpu_size = nr * sizeof(void *) + sizeof(struct skb_cache_entry);
+	skb_cache = __alloc_percpu_gfp(per_cpu_size, L1_CACHE_BYTES,
+				       GFP_KERNEL | __GFP_ZERO);
+	if (!skb_cache) {
+		pr_warn("Can't allocate UDP skb cache\n");
+		return;
+	}
+
+	pr_info("allocating %d skbs on %d CPUs for rx cache\n", nr, nr_cpu_ids);
+	for (i = 0; i < nr && total_size < max_size; ++i) {
+		for_each_possible_cpu(cpu) {
+			skb = __alloc_skb(UDP_CACHE_MAX_SKB_LEN, GFP_KERNEL,
+					  0, cpu_to_node(cpu));
+			if (!skb) {
+				pr_warn("allocation failure, cache disabled");
+				udp_free_cache(nr);
+				return;
+			}
+
+			total_size += skb->truesize;
+			per_cpu_ptr(skb_cache, cpu)->skbs[i] = skb;
+		}
+	}
+
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(skb_cache, cpu)->size = nr;
+}
+
 void __init udp_init(void)
 {
 	unsigned long limit;
@@ -2871,6 +3028,7 @@  void __init udp_init(void)
 	sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2;
 
 	__udp_sysctl_init(&init_net);
+	udp_init_cache(sysctl_udp_mem[0] / 100 * PAGE_SIZE);
 
 	/* 16 spinlocks per cpu */
 	udp_busylocks_log = ilog2(nr_cpu_ids) + 4;