Message ID | 4A53CD39.7080407@gmail.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 08 Jul 2009 00:33:29 +0200 > [PATCH] net: sk_prot_alloc() should not blindly overwrite memory > > Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some > fields should not be blindly overwritten, even with null. > > These fields are sk->sk_refcnt and sk->sk_nulls_node.next > > Current sk_prot_alloc() implementation doesnt respect this hypothesis, > calling kmem_cache_alloc() with __GFP_ZERO and setting sk_refcnt to 1 > instead of atomically increment it. > > Reported-by: Emil S Tantilov <emils.tantilov@gmail.com> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> I've applied this but will wait for some more testing before I push it out for real to kernel.org -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Wed, 08 Jul 2009 00:33:29 +0200 > >> [PATCH] net: sk_prot_alloc() should not blindly overwrite memory >> >> Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code rely that some >> fields should not be blindly overwritten, even with null. >> >> These fields are sk->sk_refcnt and sk->sk_nulls_node.next >> >> Current sk_prot_alloc() implementation doesnt respect this >> hypothesis, calling kmem_cache_alloc() with __GFP_ZERO and setting >> sk_refcnt to 1 instead of atomically increment it. >> >> Reported-by: Emil S Tantilov <emils.tantilov@gmail.com> >> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > > I've applied this but will wait for some more testing before > I push it out for real to kernel.org Still seeing traces during the test even with this patch applied: [ 1089.430093] ------------[ cut here ]------------ [ 1089.435667] WARNING: at include/net/sock.h:423 udp_lib_unhash+0x73/0xa0() [ 1089.435670] Hardware name: S5520HC [ 1089.435671] Modules linked in: igb dca mdio [last unloaded: ixgbe] [ 1089.435678] Pid: 15545, comm: netserver Not tainted 2.6.31-rc1-net-2.6-igb-ed-07071641 #4 [ 1089.435681] Call Trace: [ 1089.435686] [<ffffffff813e8a2f>] ? udp_lib_unhash+0x73/0xa0 [ 1089.435691] [<ffffffff81057b49>] warn_slowpath_common+0x77/0x8f [ 1089.435696] [<ffffffff81057b70>] warn_slowpath_null+0xf/0x11 [ 1089.435700] [<ffffffff813e8a2f>] udp_lib_unhash+0x73/0xa0 [ 1089.435705] [<ffffffff8138e616>] sk_common_release+0x2f/0xb4 [ 1089.435710] [<ffffffff81429028>] udp_lib_close+0x9/0xb [ 1089.435715] [<ffffffff813ee62a>] inet_release+0x58/0x5f [ 1089.435720] [<ffffffff814158e5>] inet6_release+0x30/0x35 [ 1089.435725] [<ffffffff8138be4b>] sock_release+0x1a/0x6c [ 1089.435729] [<ffffffff8138c366>] sock_close+0x22/0x26 [ 1089.435735] [<ffffffff810ec923>] __fput+0xf0/0x18c [ 1089.435739] [<ffffffff810eccd1>] fput+0x15/0x18 [ 1089.435742] [<ffffffff810e9bfa>] filp_close+0x5c/0x67 [ 1089.435746] [<ffffffff810e9c80>] sys_close+0x7b/0xb6 [ 1089.435751] [<ffffffff81027aab>] system_call_fastpath+0x16/0x1b [ 1089.435755] ---[ end trace a79410bd00b8b1ac ]--- Emil-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: "Tantilov, Emil S" <emil.s.tantilov@intel.com> Date: Wed, 8 Jul 2009 11:02:22 -0600 > Still seeing traces during the test even with this patch applied: > > [ 1089.430093] ------------[ cut here ]------------ > [ 1089.435667] WARNING: at include/net/sock.h:423 udp_lib_unhash+0x73/0xa0() > [ 1089.435670] Hardware name: S5520HC Ok I'll back this out for now, needs more investigation obviously. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: "Tantilov, Emil S" <emil.s.tantilov@intel.com> > Date: Wed, 8 Jul 2009 11:02:22 -0600 > >> Still seeing traces during the test even with this patch applied: >> >> [ 1089.430093] ------------[ cut here ]------------ >> [ 1089.435667] WARNING: at include/net/sock.h:423 udp_lib_unhash+0x73/0xa0() >> [ 1089.435670] Hardware name: S5520HC > > Ok I'll back this out for now, needs more investigation > obviously. Hmm... I never said it was supposed to fix Emil problem, just that I discovered one potential problem by code inspection. I could not find yet sk_refcnt mismatch. As we do less atomic ops per packet than before, some old bug could surface now... Emil, is it easy to reproduce this problem, considering I have a similar platform than yours (dual quad core machine, E5450 cpus @ 3GHz) ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet wrote: > David Miller a écrit : >> From: "Tantilov, Emil S" <emil.s.tantilov@intel.com> >> Date: Wed, 8 Jul 2009 11:02:22 -0600 >> >>> Still seeing traces during the test even with this patch applied: >>> >>> [ 1089.430093] ------------[ cut here ]------------ >>> [ 1089.435667] WARNING: at include/net/sock.h:423 >>> udp_lib_unhash+0x73/0xa0() [ 1089.435670] Hardware name: S5520HC >> >> Ok I'll back this out for now, needs more investigation >> obviously. > > Hmm... I never said it was supposed to fix Emil problem, just that > I discovered one potential problem by code inspection. > > I could not find yet sk_refcnt mismatch. > As we do less atomic ops per packet than before, some old bug could > surface now... > > Emil, is it easy to reproduce this problem, considering I have a > similar platform than yours (dual quad core machine, E5450 cpus @ > 3GHz) ? Eric, It should be easy to reproduce. At least I have been able to consistently reproduce it on several different systems with different drivers (e1000, e1000e, igb). The test I'm running is a mix of IPV4/6 TCP/UDP traffic with netperf (also mixing different types TCP/UDP_STREAM, TCP_MAERTS, TCP_UDP_RR etc). How much this matters I don't know - it's possible that just UDP traffic would do it. I also think it may have something to do with IPv6 because of the trace, but I am not sure. If you need more information let me know. Thanks, Emil-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/sock.h b/include/net/sock.h index 352f06b..3f6284e 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -103,15 +103,15 @@ struct net; /** * struct sock_common - minimal network layer representation of sockets + * @skc_node: main hash linkage for various protocol lookup tables + * @skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol + * @skc_refcnt: reference count + * @skc_hash: hash value used with various protocol lookup tables * @skc_family: network address family * @skc_state: Connection state * @skc_reuse: %SO_REUSEADDR setting * @skc_bound_dev_if: bound device index if != 0 - * @skc_node: main hash linkage for various protocol lookup tables - * @skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol * @skc_bind_node: bind hash linkage for various protocol lookup tables - * @skc_refcnt: reference count - * @skc_hash: hash value used with various protocol lookup tables * @skc_prot: protocol handlers inside a network family * @skc_net: reference to the network namespace of this socket * @@ -119,17 +119,23 @@ struct net; * for struct sock and struct inet_timewait_sock. */ struct sock_common { - unsigned short skc_family; - volatile unsigned char skc_state; - unsigned char skc_reuse; - int skc_bound_dev_if; + /* + * WARNING : skc_node, skc_refcnt must be first elements of this struct + * (sk_prot_alloc() should not clear skc_node.next & skc_refcnt because + * of RCU lookups) + */ union { struct hlist_node skc_node; struct hlist_nulls_node skc_nulls_node; }; - struct hlist_node skc_bind_node; atomic_t skc_refcnt; + unsigned int skc_hash; + unsigned short skc_family; + volatile unsigned char skc_state; + unsigned char skc_reuse; + int skc_bound_dev_if; + struct hlist_node skc_bind_node; struct proto *skc_prot; #ifdef CONFIG_NET_NS struct net *skc_net; diff --git a/net/core/sock.c b/net/core/sock.c index b0ba569..cf43ff1 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -939,8 +939,31 @@ static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, struct kmem_cache *slab; slab = prot->slab; - if (slab != NULL) - sk = kmem_cache_alloc(slab, priority); + if (slab != NULL) { + /* + * caches using SLAB_DESTROY_BY_RCU should let sk_node + * and sk_refcnt untouched. + * + * Following makes sense only if sk first fields are + * in following order : sk_node, refcnt, sk_hash + */ + BUILD_BUG_ON(offsetof(struct sock, sk_node) != 0); + BUILD_BUG_ON(offsetof(struct sock, sk_refcnt) != + sizeof(sk->sk_node)); + BUILD_BUG_ON(offsetof(struct sock, sk_hash) != + sizeof(sk->sk_node) + sizeof(sk->sk_refcnt)); + sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO); + if (!sk) + return sk; + if (priority & __GFP_ZERO) { + if (prot->slab_flags & SLAB_DESTROY_BY_RCU) + memset(&sk->sk_hash, 0, + prot->obj_size - + offsetof(struct sock, sk_hash)); + else + memset(sk, 0, prot->obj_size); + } + } else sk = kmalloc(prot->obj_size, priority); @@ -1125,7 +1148,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority) newsk->sk_err = 0; newsk->sk_priority = 0; - atomic_set(&newsk->sk_refcnt, 2); + atomic_add(2, &newsk->sk_refcnt); /* * Increment the counter in the same struct proto as the master @@ -1840,7 +1863,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_stamp = ktime_set(-1L, 0); - atomic_set(&sk->sk_refcnt, 1); + atomic_inc(&sk->sk_refcnt); atomic_set(&sk->sk_wmem_alloc, 1); atomic_set(&sk->sk_drops, 0); } @@ -2140,12 +2163,25 @@ static inline void release_proto_idx(struct proto *prot) } #endif +/* + * We need to initialize skc->skc_refcnt to 0, and skc->skc_node.pprev to NULL + * but only on newly allocated objects (check sk_prot_alloc()) + */ +static void sk_init_once(void *arg) +{ + struct sock_common *skc = arg; + + atomic_set(&skc->skc_refcnt, 0); + skc->skc_node.pprev = NULL; +} + int proto_register(struct proto *prot, int alloc_slab) { if (alloc_slab) { prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0, SLAB_HWCACHE_ALIGN | prot->slab_flags, - NULL); + prot->slab_flags & SLAB_DESTROY_BY_RCU ? + sk_init_once : NULL); if (prot->slab == NULL) { printk(KERN_CRIT "%s: Can't create sock SLAB cache!\n", @@ -2187,7 +2223,8 @@ int proto_register(struct proto *prot, int alloc_slab) 0, SLAB_HWCACHE_ALIGN | prot->slab_flags, - NULL); + prot->slab_flags & SLAB_DESTROY_BY_RCU ? + sk_init_once : NULL); if (prot->twsk_prot->twsk_slab == NULL) goto out_free_timewait_sock_slab_name; } diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 61283f9..a2997df 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -139,7 +139,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, const int stat tw->tw_transparent = inet->transparent; tw->tw_prot = sk->sk_prot_creator; twsk_net_set(tw, hold_net(sock_net(sk))); - atomic_set(&tw->tw_refcnt, 1); + atomic_inc(&tw->tw_refcnt); inet_twsk_dead_node_init(tw); __module_get(tw->tw_prot->owner); }