diff mbox

net: release skb->dst in sock_queue_rcv_skb()

Message ID 492C919E.3050108@cosmosbay.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 26, 2008, midnight UTC
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Tue, 25 Nov 2008 05:43:32 +0100
> 
>> Very interesting. So we could try the following path :
>>
>> 1) First try to release dst when queueing skb to various queues
>> (UDP, TCP, ...) while its hot. Reader wont have to release it
>> while its cold.
>>
>> 2) Check if we can handle the input path without any refcount
>>    dirtying ?
>>
>> To make the transition easy, we could use a bit on skb to mark
>> dst being not refcounted (ie no dst_release() should be done on it)
> 
> It is possible to make this self-auditing.  For example, by
> using the usual trick where we encode a pointer in an
> unsigned long and use the low bits for states.
> 
> In the first step, make each skb->dst access go through some
> accessor inline function.
> 
> Next, audit the paths where skb->dst's can "escape" the pure
> packet input path.  Add annotations, in the form of a
> inline function call, for these locations.
> 
> Also, audit the other locations where we enqueue into a socket
> queue and no longer care about the skb->dst, and annotate
> those with another inline function.
> 
> Finally, the initial skb->dst assignment in the input path doesn't
> grab a reference, but sets the low bit ("refcount pending") in
> the encoded skb->dst pointer.  The skb->dst "escape" inline
> function performs the deferred refcount grab.  And kfree_skb()
> is taught to not dst_release() on skb->dst's which have the
> low bit set.
> 
> Anyways, something like that.

I looked this stuff and found it would be difficult to not grab a 
reference (and more important not writing to dst) in input path.

ip_rcv_finish() calls ip_route_input()
and ip_route_input() calls dst_use(&rth->u.dst, jiffies);

static inline void dst_use(struct dst_entry *dst, unsigned long time)
{
        dst_hold(dst);
        dst->__use++;
        dst->lastuse = time;
}

Even if we avoid the refcount increment, I guess we need the lastuse
assignement in order to keep dst in cache. Not sure about the role of
__use field. Hum... for a tcp connection, dst refcount should already
be pinned by a sk->sk_dst_cache. Maybe test refcount value, and if this
value is > 1, dont take a reference. (given rcu_read_lock() is done
before calling ip_rcv_finish())

In the meantime, what do you think of the following patch ?

[PATCH] net: release skb->dst in sock_queue_rcv_skb()

When queuing a skb to sk->sk_receive_queue, we can release its dst, not
anymore needed.
Since current cpu did the dst_hold(), refcount is probably still hot
int this cpu caches.

This avoids readers to access the original dst to decrement its refcount,
possibly a long time after packet reception. This should speedup UDP
and RAW receive path.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Comments

David Miller Nov. 26, 2008, 12:23 a.m. UTC | #1
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 26 Nov 2008 01:00:30 +0100

> Hum... for a tcp connection, dst refcount should already be pinned
> by a sk->sk_dst_cache. Maybe test refcount value, and if this value
> is > 1, dont take a reference. (given rcu_read_lock() is done before
> calling ip_rcv_finish())

Input route is different from output route.  sk->sk_dst_cache holds
the output route.  So I can't see how this can help here.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 26, 2008, 2:04 a.m. UTC | #2
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 26 Nov 2008 01:00:30 +0100

> In the meantime, what do you think of the following patch ?
> 
> [PATCH] net: release skb->dst in sock_queue_rcv_skb()
> 
> When queuing a skb to sk->sk_receive_queue, we can release its dst, not
> anymore needed.
> Since current cpu did the dst_hold(), refcount is probably still hot
> int this cpu caches.
> 
> This avoids readers to access the original dst to decrement its refcount,
> possibly a long time after packet reception. This should speedup UDP
> and RAW receive path.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

I guess the idea is that if we release quickly we'll not have
to reget the cacheline in owned state.

I wonder if this might actually slightly hurt loads like tbench where
we are banging on the refcnt constantly on every cpu anyways.

Can you do a quick check?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 26, 2008, 7:39 a.m. UTC | #3
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 26 Nov 2008 01:00:30 +0100
> 
>> In the meantime, what do you think of the following patch ?
>>
>> [PATCH] net: release skb->dst in sock_queue_rcv_skb()
>>
>> When queuing a skb to sk->sk_receive_queue, we can release its dst, not
>> anymore needed.
>> Since current cpu did the dst_hold(), refcount is probably still hot
>> int this cpu caches.
>>
>> This avoids readers to access the original dst to decrement its refcount,
>> possibly a long time after packet reception. This should speedup UDP
>> and RAW receive path.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> I guess the idea is that if we release quickly we'll not have
> to reget the cacheline in owned state.

Yes, this is the idea.

> 
> I wonder if this might actually slightly hurt loads like tbench where
> we are banging on the refcnt constantly on every cpu anyways.

Yes, the only way to reduce the load on this case is not moving the
increments/decrements : It could help on some machines, and hurt on others.
In the long term, only reducing the number of dirtying can help the average.

> 
> Can you do a quick check?
> 
> 

Sure I can do a check :)

No impact at all on tbench, unless this bench can run in UDP mode :)


CPU: Core 2, speed 3000.1 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
599983   599983        11.2948  11.2948    copy_from_user
549349   1149332       10.3416  21.6364    ipt_do_table
245697   1395029        4.6253  26.2617    copy_to_user
221663   1616692        4.1729  30.4346    schedule
144888   1761580        2.7275  33.1621    tcp_sendmsg
136967   1898547        2.5784  35.7406    tcp_ack
115513   2014060        2.1746  37.9151    tcp_transmit_skb
99517    2113577        1.8734  39.7885    sysenter_past_esp
99438    2213015        1.8719  41.6605    ip_queue_xmit
98171    2311186        1.8481  43.5086    tcp_recvmsg
80763    2391949        1.5204  45.0290    __switch_to
79621    2471570        1.4989  46.5278    tcp_v4_rcv
77210    2548780        1.4535  47.9813    dst_release
69387    2618167        1.3062  49.2876    tcp_rcv_established
64709    2682876        1.2182  50.5057    __tcp_push_pending_frames
55223    2738099        1.0396  51.5453    lock_sock_nested
53754    2791853        1.0119  52.5572    sys_socketcall
50092    2841945        0.9430  53.5002    netif_receive_skb
49499    2891444        0.9318  54.4321    release_sock
47796    2939240        0.8998  55.3318    __inet_lookup_established
45162    2984402        0.8502  56.1820    update_curr
44895    3029297        0.8452  57.0272    ip_rcv
42945    3072242        0.8084  57.8356    dev_queue_xmit
42892    3115134        0.8075  58.6431    tcp_event_data_recv
42768    3157902        0.8051  59.4482    local_bh_enable
41555    3199457        0.7823  60.2305    netif_rx
38613    3238070        0.7269  60.9574    __alloc_skb
38016    3276086        0.7157  61.6730    ip_finish_output
36867    3312953        0.6940  62.3671    tcp_current_mss
36759    3349712        0.6920  63.0591    skb_release_data
35560    3385272        0.6694  63.7285    local_bh_enable_ip
34100    3419372        0.6419  64.3704    sock_recvmsg
33829    3453201        0.6368  65.0073    __kfree_skb
32949    3486150        0.6203  65.6275    sched_clock_cpu

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 26, 2008, 9:08 a.m. UTC | #4
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 26 Nov 2008 08:39:30 +0100

> David Miller a écrit :
> > From: Eric Dumazet <dada1@cosmosbay.com>
> > Date: Wed, 26 Nov 2008 01:00:30 +0100
> > 
> > Can you do a quick check?
> > 
> 
> Sure I can do a check :)
> 
> No impact at all on tbench, unless this bench can run in UDP mode :)

Durrr, of course!

Patch applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark McLoughlin Dec. 17, 2008, 11:25 a.m. UTC | #5
Hi,

On Wed, 2008-11-26 at 01:00 +0100, Eric Dumazet wrote:

> [PATCH] net: release skb->dst in sock_queue_rcv_skb()
> 
> When queuing a skb to sk->sk_receive_queue, we can release its dst, not
> anymore needed.
> Since current cpu did the dst_hold(), refcount is probably still hot
> int this cpu caches.
> 
> This avoids readers to access the original dst to decrement its refcount,
> possibly a long time after packet reception. This should speedup UDP
> and RAW receive path.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> plain text document attachment (sock_queue_rcv_skb.patch)
> diff --git a/net/core/sock.c b/net/core/sock.c
> index a4e840e..b287645 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -289,7 +289,11 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  
>  	skb->dev = NULL;
>  	skb_set_owner_r(skb, sk);
> -
> +	/*
> +	 * release dst right now while its hot
> +	 */
> +	dst_release(skb->dst);
> +	skb->dst = NULL;

IP_PKTINFO cmsg data is one post-queueing user:

  static void ip_cmsg_recv_pktinfo(struct msghdr *msg, struct sk_buff *skb)
  {
          struct in_pktinfo info;
          struct rtable *rt = skb->rtable;

          info.ipi_addr.s_addr = ip_hdr(skb)->daddr;
          if (rt) {
                  info.ipi_ifindex = rt->rt_iif;
                  info.ipi_spec_dst.s_addr = rt->rt_spec_dst;
          } else {
                  info.ipi_ifindex = 0;
                  info.ipi_spec_dst.s_addr = 0;
          }

          put_cmsg(msg, SOL_IP, IP_PKTINFO, sizeof(info), &info);
  }

(i.e. skb->rtable is NULL at this point)

I'm seeing dnsmasq not working on net-next-2.6 because of this and
reverting commit 7035560 makes things work as expected again.

Cheers,
Mark.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 18, 2008, 3:34 a.m. UTC | #6
From: Mark McLoughlin <markmc@redhat.com>
Date: Wed, 17 Dec 2008 11:25:01 +0000

> On Wed, 2008-11-26 at 01:00 +0100, Eric Dumazet wrote:
> 
> > [PATCH] net: release skb->dst in sock_queue_rcv_skb()
 ...
> IP_PKTINFO cmsg data is one post-queueing user:

Eric, we'll need to rever this change I think.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/sock.c b/net/core/sock.c
index a4e840e..b287645 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -289,7 +289,11 @@  int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	skb->dev = NULL;
 	skb_set_owner_r(skb, sk);
-
+	/*
+	 * release dst right now while its hot
+	 */
+	dst_release(skb->dst);
+	skb->dst = NULL;
 	/* Cache the SKB length before we tack it onto the receive
 	 * queue.  Once it is added it no longer belongs to us and
 	 * may be freed by other threads of control pulling packets