diff mbox

net: Fix IP_MULTICAST_IF

Message ID 4ADC96D6.4000909@gmail.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 19, 2009, 4:41 p.m. UTC
ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.

Fix is to call dev_get_by_index()/dev_put().

If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_sockglue.c   |    7 +++----
 net/ipv6/ipv6_sockglue.c |    6 +++++-
 2 files changed, 8 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Oct. 20, 2009, 3:59 a.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Oct 2009 18:41:58 +0200

> ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

Dubious, how so?

Yes, I know RTNL/dev_base_lock, but it's not using what it gets
back at all.

It's testing existence, a boolean, it doesn't dereference the
'dev' it gets back at all.

This code is intentional and perfectly fine.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 20, 2009, 4:07 a.m. UTC | #2
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 19 Oct 2009 18:41:58 +0200
> 
>> ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.
> 
> Dubious, how so?
> 
> Yes, I know RTNL/dev_base_lock, but it's not using what it gets
> back at all.
> 
> It's testing existence, a boolean, it doesn't dereference the
> 'dev' it gets back at all.
> 
> This code is intentional and perfectly fine.

If this was intentional, something changed and made the prereq false.

Final target might be fine, but an element in the chain, before target
could be deleted while reader scans hash chain.

/* Device list removal */
static void unlist_netdevice(struct net_device *dev)
{
        ASSERT_RTNL();

        /* Unlink dev from the device chain */
        write_lock_bh(&dev_base_lock);
        list_del(&dev->dev_list);
        hlist_del(&dev->name_hlist);
        hlist_del(&dev->index_hlist);   <<< HERE >>>
        write_unlock_bh(&dev_base_lock);
}


static inline void hlist_del(struct hlist_node *n)
{
        __hlist_del(n);
        n->next = LIST_POISON1;   <<< HERE >>>
        n->pprev = LIST_POISON2;
}
include/linux/poison.h:#define LIST_POISON1  ((void *) 0x00100100)

reader tries to pass over this delete net_device, doing a dev->index_hlist->next

#define hlist_for_each(pos, head) \
        for (pos = (head)->first; pos && ({ prefetch(pos->next); 1; }); \
             pos = pos->next)

So it should visit a nice memory location ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 20, 2009, 4:16 a.m. UTC | #3
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 19 Oct 2009 18:41:58 +0200
>>
>>> ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.
>> Dubious, how so?
>>
>> Yes, I know RTNL/dev_base_lock, but it's not using what it gets
>> back at all.
>>
>> It's testing existence, a boolean, it doesn't dereference the
>> 'dev' it gets back at all.
>>
>> This code is intentional and perfectly fine.
> 
> If this was intentional, something changed and made the prereq false.
> 
> Final target might be fine, but an element in the chain, before target
> could be deleted while reader scans hash chain.
> 

BTW, even an insertion can crash a lockless reader, since reader could see a corrupt
 n->next (hlist_add_head() has no barrier between n->next = first and h->first = n;)

static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h)
{
        struct hlist_node *first = h->first;
        n->next = first;
        if (first)
                first->pprev = &n->next;
        h->first = n;
        n->pprev = &h->first;
}

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 20, 2009, 4:20 a.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Oct 2009 06:07:48 +0200

> Final target might be fine, but an element in the chain, before target
> could be deleted while reader scans hash chain.
 ...
> So it should visit a nice memory location ?

It should hit a NULL eventually and deterministically even if an
unlink happens at the same time..... unless the object gets free'd
meanwhile, hmmm...

This code is definitely intentional, I remember when I added it to
the tree, Alexey wrote it :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 20, 2009, 4:21 a.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Oct 2009 06:16:02 +0200

> BTW, even an insertion can crash a lockless reader, since reader
>  could see a corrupt n->next (hlist_add_head() has no barrier
>  between n->next = first and h->first = n;)

Ok, now that convinces it for me, I'll apply your patch, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 20, 2009, 4:23 a.m. UTC | #6
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Oct 2009 06:07:48 +0200
> 
>> Final target might be fine, but an element in the chain, before target
>> could be deleted while reader scans hash chain.
>  ...
>> So it should visit a nice memory location ?
> 
> It should hit a NULL eventually and deterministically even if an
> unlink happens at the same time..... unless the object gets free'd
> meanwhile, hmmm...
> 
> This code is definitely intentional, I remember when I added it to
> the tree, Alexey wrote it :-)

I wonder if the whole thing could use RCU somehow, since some workloads hit
this dev_base_lock rwlock pretty hard...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 20, 2009, 4:28 a.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Oct 2009 06:23:54 +0200

> I wonder if the whole thing could use RCU somehow, since some
> workloads hit this dev_base_lock rwlock pretty hard...

True, but for now we'll put your fix in :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 0c0b6e3..e982b5c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -634,17 +634,16 @@  static int do_ip_setsockopt(struct sock *sk, int level,
 				break;
 			}
 			dev = ip_dev_find(sock_net(sk), mreq.imr_address.s_addr);
-			if (dev) {
+			if (dev)
 				mreq.imr_ifindex = dev->ifindex;
-				dev_put(dev);
-			}
 		} else
-			dev = __dev_get_by_index(sock_net(sk), mreq.imr_ifindex);
+			dev = dev_get_by_index(sock_net(sk), mreq.imr_ifindex);
 
 
 		err = -EADDRNOTAVAIL;
 		if (!dev)
 			break;
+		dev_put(dev);
 
 		err = -EINVAL;
 		if (sk->sk_bound_dev_if &&
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 14f54eb..4f7aaf6 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -496,13 +496,17 @@  done:
 			goto e_inval;
 
 		if (val) {
+			struct net_device *dev;
+
 			if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != val)
 				goto e_inval;
 
-			if (__dev_get_by_index(net, val) == NULL) {
+			dev = dev_get_by_index(net, val);
+			if (!dev) {
 				retv = -ENODEV;
 				break;
 			}
+			dev_put(dev);
 		}
 		np->mcast_oif = val;
 		retv = 0;