diff mbox

[RFC] Could we avoid touching dst->refcount in some cases ?

Message ID 492A6C94.7030308@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 24, 2008, 8:57 a.m. UTC
tbench has hard time incrementing decrementing the route cache refcount
shared by all communications on localhost.

On real world, we also have this problem on RTP servers sending many UDP
frames to mediagateways, especially big ones handling thousand of streams.

Given that route entries are using RCU, we probably can avoid incrementing
their refcount in case of connected sockets ?

Here is a (untested and probably not working at all) patch on UDP part to
illustrate the idea :

Comments

Andi Kleen Nov. 24, 2008, 9:42 a.m. UTC | #1
Eric Dumazet <dada1@cosmosbay.com> writes:

> tbench has hard time incrementing decrementing the route cache refcount
> shared by all communications on localhost.

iirc there was a patch some time ago to use per CPU loopback devices to 
avoid this, but it was considered too much a benchmark hack.
As core counts increase it might stop being that though.

>
> On real world, we also have this problem on RTP servers sending many UDP
> frames to mediagateways, especially big ones handling thousand of streams.
>
> Given that route entries are using RCU, we probably can avoid incrementing
> their refcount in case of connected sockets ?

Normally they can be hold over sleeps or queuing of skbs too, and RCU
doesn't handle that. To make it handle that you would need to define a
custom RCU period designed for this case, but this would be probably
tricky and fragile: especially I'm not sure even if you had a "any
packet queued" RCU method it be guaranteed to always finish 
because there is no fixed upper livetime of a packet.

The other issue is that on preemptible kernels you would need to 
disable preemption all the time such a routing entry is hold, which
could be potentially quite long.

-Andi
Eric Dumazet Nov. 24, 2008, 10:14 a.m. UTC | #2
Andi Kleen a écrit :
> Eric Dumazet <dada1@cosmosbay.com> writes:
> 
>> tbench has hard time incrementing decrementing the route cache refcount
>> shared by all communications on localhost.
> 
> iirc there was a patch some time ago to use per CPU loopback devices to 
> avoid this, but it was considered too much a benchmark hack.
> As core counts increase it might stop being that though.

Well, you probably mention Stephen patch to avoid dirtying other contended
cache lines (one napi structure per cpu)

Having multiple loopback dev would really be a hack I agree.

> 
>> On real world, we also have this problem on RTP servers sending many UDP
>> frames to mediagateways, especially big ones handling thousand of streams.
>>
>> Given that route entries are using RCU, we probably can avoid incrementing
>> their refcount in case of connected sockets ?
> 
> Normally they can be hold over sleeps or queuing of skbs too, and RCU
> doesn't handle that. To make it handle that you would need to define a
> custom RCU period designed for this case, but this would be probably
> tricky and fragile: especially I'm not sure even if you had a "any
> packet queued" RCU method it be guaranteed to always finish 
> because there is no fixed upper livetime of a packet.
> 
> The other issue is that on preemptible kernels you would need to 
> disable preemption all the time such a routing entry is hold, which
> could be potentially quite long.
> 

Well, in case of UDP, we call ip_push_pending_frames() and this one
does the increment of refcount (again). I was not considering
avoiding the refcount hold we do when queing a skb in transmit
queue, only during a short period of time. Oh well, ip_append_data()
might sleep, so this cannot work...

I agree avoiding one refcount increment/decrement is probably
not a huge gain, considering we *have* to do the increment,
but when many cpus are using UDP send/receive in //, this might
show a gain somehow.

So maybe we could make ip_append_data() (or its callers) a
litle bit smarter, avoiding increment/decrement if possible.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andi Kleen Nov. 24, 2008, 11:27 a.m. UTC | #3
On Mon, Nov 24, 2008 at 11:14:29AM +0100, Eric Dumazet wrote:
> Andi Kleen a écrit :
> >Eric Dumazet <dada1@cosmosbay.com> writes:
> >
> >>tbench has hard time incrementing decrementing the route cache refcount
> >>shared by all communications on localhost.
> >
> >iirc there was a patch some time ago to use per CPU loopback devices to 
> >avoid this, but it was considered too much a benchmark hack.
> >As core counts increase it might stop being that though.
> 
> Well, you probably mention Stephen patch to avoid dirtying other contended
> cache lines (one napi structure per cpu)

No that patch wasn't from Stephen. iirc it was from someone at SGI.

-Andi
David Miller Nov. 24, 2008, 11:36 p.m. UTC | #4
From: Andi Kleen <andi@firstfloor.org>
Date: Mon, 24 Nov 2008 12:27:09 +0100

> On Mon, Nov 24, 2008 at 11:14:29AM +0100, Eric Dumazet wrote:
> > Andi Kleen a écrit :
> > >Eric Dumazet <dada1@cosmosbay.com> writes:
> > >
> > >>tbench has hard time incrementing decrementing the route cache refcount
> > >>shared by all communications on localhost.
> > >
> > >iirc there was a patch some time ago to use per CPU loopback devices to 
> > >avoid this, but it was considered too much a benchmark hack.
> > >As core counts increase it might stop being that though.
> > 
> > Well, you probably mention Stephen patch to avoid dirtying other contended
> > cache lines (one napi structure per cpu)
> 
> No that patch wasn't from Stephen. iirc it was from someone at SGI.

That's how I remember it too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 24, 2008, 11:39 p.m. UTC | #5
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 24 Nov 2008 11:14:29 +0100

> So maybe we could make ip_append_data() (or its callers) a
> litle bit smarter, avoiding increment/decrement if possible.

These ideas are interesting but hard to make work.

I think the receive path has more chance of getting gains
from this, to be honest.

One third (effectively) of TCP stream packets are ACKs and
freed immediately.  This means that the looked up route does
not escape the packet receive path.  So we could elide the
counter increment in that case.

In fact, once we queue even TCP data, there is no need for
that cached skb->dst route any longer.

So pretty much all TCP packets could avoid the dst refcounting
on receive.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 25, 2008, 4:43 a.m. UTC | #6
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Mon, 24 Nov 2008 11:14:29 +0100
> 
>> So maybe we could make ip_append_data() (or its callers) a
>> litle bit smarter, avoiding increment/decrement if possible.
> 
> These ideas are interesting but hard to make work.
> 
> I think the receive path has more chance of getting gains
> from this, to be honest.
> 
> One third (effectively) of TCP stream packets are ACKs and
> freed immediately.  This means that the looked up route does
> not escape the packet receive path.  So we could elide the
> counter increment in that case.
> 
> In fact, once we queue even TCP data, there is no need for
> that cached skb->dst route any longer.
> 
> So pretty much all TCP packets could avoid the dst refcounting
> on receive.

Very interesting. So we could try the following path :

1) First try to release dst when queueing skb to various queues
(UDP, TCP, ...) while its hot. Reader wont have to release it
while its cold.

2) Check if we can handle the input path without any refcount
   dirtying ?

To make the transition easy, we could use a bit on skb to mark
dst being not refcounted (ie no dst_release() should be done on it)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 25, 2008, 5 a.m. UTC | #7
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 25 Nov 2008 05:43:32 +0100

> Very interesting. So we could try the following path :
> 
> 1) First try to release dst when queueing skb to various queues
> (UDP, TCP, ...) while its hot. Reader wont have to release it
> while its cold.
> 
> 2) Check if we can handle the input path without any refcount
>    dirtying ?
> 
> To make the transition easy, we could use a bit on skb to mark
> dst being not refcounted (ie no dst_release() should be done on it)

It is possible to make this self-auditing.  For example, by
using the usual trick where we encode a pointer in an
unsigned long and use the low bits for states.

In the first step, make each skb->dst access go through some
accessor inline function.

Next, audit the paths where skb->dst's can "escape" the pure
packet input path.  Add annotations, in the form of a
inline function call, for these locations.

Also, audit the other locations where we enqueue into a socket
queue and no longer care about the skb->dst, and annotate
those with another inline function.

Finally, the initial skb->dst assignment in the input path doesn't
grab a reference, but sets the low bit ("refcount pending") in
the encoded skb->dst pointer.  The skb->dst "escape" inline
function performs the deferred refcount grab.  And kfree_skb()
is taught to not dst_release() on skb->dst's which have the
low bit set.

Anyways, something like that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index da869ce..c385f13 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -553,6 +553,7 @@  int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int ulen = len;
 	struct ipcm_cookie ipc;
 	struct rtable *rt = NULL;
+	int rt_release = 0;
 	int free = 0;
 	int connected = 0;
 	__be32 daddr, faddr, saddr;
@@ -656,8 +657,9 @@  int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		connected = 0;
 	}
 
+	rcu_read_lock();
 	if (connected)
-		rt = (struct rtable*)sk_dst_check(sk, 0);
+		rt = (struct rtable *)__sk_dst_check(sk, 0);
 
 	if (rt == NULL) {
 		struct flowi fl = { .oif = ipc.oif,
@@ -681,11 +683,14 @@  int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		}
 
 		err = -EACCES;
+		rt_release = 1;
 		if ((rt->rt_flags & RTCF_BROADCAST) &&
 		    !sock_flag(sk, SOCK_BROADCAST))
 			goto out;
-		if (connected)
-			sk_dst_set(sk, dst_clone(&rt->u.dst));
+		if (connected) {
+			sk_dst_set(sk, &rt->u.dst);
+			rt_release = 0;
+		}
 	}
 
 	if (msg->msg_flags&MSG_CONFIRM)
@@ -730,7 +735,9 @@  do_append_data:
 	release_sock(sk);
 
 out:
-	ip_rt_put(rt);
+	if (rt_release)
+		ip_rt_put(rt);
+	rcu_read_unlock();
 	if (free)
 		kfree(ipc.opt);
 	if (!err)