diff mbox

PROBLEM: System call 'sendmsg' of process ospfd (quagga) causes kernel oops

Message ID 20111019.030914.218521668377472155.davem@davemloft.net
State Deferred, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller Oct. 19, 2011, 7:09 a.m. UTC
From: Herbert Xu <herbert@gondor.hengli.com.au>
Date: Tue, 18 Oct 2011 15:45:37 +0200

> On Tue, Oct 18, 2011 at 02:56:00PM +0200, Eric Dumazet wrote:
>>
>> I am ok by this way, but we might hit another similar problem elsewhere.
>> 
>> (igmp.c ip6_output, ...)
>> 
>> We effectively want to remove LL_ALLOCATED_SPACE() usage and obfuscate
>> code...
> 
> Here's another idea, provide a helper to do the skb allocation
> and the skb_reserve in one go.  That way this ugliness would only
> need to be done once.

Someone please test this:

--------------------
net: Fix crashes on devices which dynamically change needed headroom.

One such device is IP_GRE.

The problem is that we evaluate the device characteristics twice, once
to determine the allocation size, and once to do the skb_reserve().

Combine these into one operation using a helper function.

With help from Eric Dumazet and Herbert Xu.

Reported-by: Reported-by: Elmar Vonlanthen <evonlanthen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Oct. 19, 2011, 7:18 a.m. UTC | #1
Le mercredi 19 octobre 2011 à 03:09 -0400, David Miller a écrit :
> From: Herbert Xu <herbert@gondor.hengli.com.au>
> Date: Tue, 18 Oct 2011 15:45:37 +0200
> 
> > On Tue, Oct 18, 2011 at 02:56:00PM +0200, Eric Dumazet wrote:
> >>
> >> I am ok by this way, but we might hit another similar problem elsewhere.
> >> 
> >> (igmp.c ip6_output, ...)
> >> 
> >> We effectively want to remove LL_ALLOCATED_SPACE() usage and obfuscate
> >> code...
> > 
> > Here's another idea, provide a helper to do the skb allocation
> > and the skb_reserve in one go.  That way this ugliness would only
> > need to be done once.
> 
> Someone please test this:
> 
> --------------------
> net: Fix crashes on devices which dynamically change needed headroom.
> 
> One such device is IP_GRE.
> 
> The problem is that we evaluate the device characteristics twice, once
> to determine the allocation size, and once to do the skb_reserve().
> 
> Combine these into one operation using a helper function.
> 
> With help from Eric Dumazet and Herbert Xu.
> 
> Reported-by: Reported-by: Elmar Vonlanthen <evonlanthen@gmail.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 



Seems fine (Maybe do the +15 in caller site ?), but we also have other
problematic cases, using alloc_skb() only...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 19, 2011, 7:30 a.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 19 Oct 2011 09:18:33 +0200

> Seems fine (Maybe do the +15 in caller site ?), but we also have other
> problematic cases, using alloc_skb() only...

Ok, which ones operate over these problematic GRE tunnels?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 19, 2011, 7:52 a.m. UTC | #3
Le mercredi 19 octobre 2011 à 03:30 -0400, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 19 Oct 2011 09:18:33 +0200
> 
> > Seems fine (Maybe do the +15 in caller site ?), but we also have other
> > problematic cases, using alloc_skb() only...
> 
> Ok, which ones operate over these problematic GRE tunnels?
> 

I was more thinking of general idea to allow any device to change its
needed headroom.

For GRE tunnels, I dont think IPv6 fragmentation could be relevant,
but maybe IGMP could trigger a problem ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 19, 2011, 8:02 a.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 19 Oct 2011 09:52:19 +0200

> For GRE tunnels, I dont think IPv6 fragmentation could be relevant,
> but maybe IGMP could trigger a problem ?

Funny... icmpv6 uses ip6_append_data() which uses hh_len in a
local variable, which seems to suggest that it's immune to
this problem.

IPV4 side seems identical in this regard.

So, as far as I can see, my patch is sufficient to cover IP_GRE
reasonably in the 'net' tree.

Agreed?

If so, someone please test this thing :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 19, 2011, 8:08 a.m. UTC | #5
On Wed, Oct 19, 2011 at 03:30:52AM -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 19 Oct 2011 09:18:33 +0200
> 
> > Seems fine (Maybe do the +15 in caller site ?), but we also have other
> > problematic cases, using alloc_skb() only...
> 
> Ok, which ones operate over these problematic GRE tunnels?

Potentially all of them since we now support Ethernet-over-GRE.

I think Eric's initial patch is probably the safest bet for rc10.
We can then work on the proper fix for the next release.

As to the latter, I've just done a grep over net and it seems that
all users of LL_ALLOCATED_SPACE fall into two cases, alloc_skb users
or sock_alloc_send_skb (including pskb) users.

For alloc_skb we could add a new helper.  While for the other
case we could either create a new helper or just add an extra
dev argument that may be NULL for those that don't care about
LL_ALLOCATED_SPACE.

Cheers,
David Miller Oct. 20, 2011, 9:30 a.m. UTC | #6
From: Herbert Xu <herbert@gondor.hengli.com.au>
Date: Wed, 19 Oct 2011 10:08:07 +0200

> I think Eric's initial patch is probably the safest bet for rc10.
> We can then work on the proper fix for the next release.

There are two "initial patch", I wonder which one you mean.

There's his really first patch, which remoevs the lines in IP_GRE
which change dev->needed_headroom.  I was under the impression we
were against doing that.

The other patch he posted duplicates the device attribute variable
caching in two functions.

My patch is just a tweak so that we only do this sequence in one
place, the new sock_alloc_send_skb_reserve() helper, instead of
in both the ipv4 and ipv6 RAW code.

So I'm a little confused what your suggestion for rc10 really
is :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 20, 2011, 9:35 a.m. UTC | #7
On Thu, Oct 20, 2011 at 05:30:50AM -0400, David Miller wrote:
>
> So I'm a little confused what your suggestion for rc10 really
> is :-)

I meant his first initial patch :)

While it is suboptimal in the sense that should the value of
needed_headroom increase we'll end up constantly reallocating
skbs, I believe that it is at least semantically correct.

In the time being I'll look more closely at all the users of
needed_headroom to see if there's anything we've missed.

Thanks,
David Miller Oct. 20, 2011, 8:21 p.m. UTC | #8
From: Herbert Xu <herbert@gondor.hengli.com.au>
Date: Thu, 20 Oct 2011 11:35:41 +0200

> On Thu, Oct 20, 2011 at 05:30:50AM -0400, David Miller wrote:
>>
>> So I'm a little confused what your suggestion for rc10 really
>> is :-)
> 
> I meant his first initial patch :)
> 
> While it is suboptimal in the sense that should the value of
> needed_headroom increase we'll end up constantly reallocating
> skbs, I believe that it is at least semantically correct.

Ok, I applied Eric's patch which removes the dynamic changing of the
needed_headroom in IP_GRE.

Thanks everyone!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 25, 2011, 11:54 a.m. UTC | #9
On Thu, Oct 20, 2011 at 11:35:41AM +0200, Herbert Xu wrote:
>
> In the time being I'll look more closely at all the users of
> needed_headroom to see if there's anything we've missed.

OK I've reviewed all the users of needed_headroom and I haven't
found anything other than the cases that we have enumerated.

One thing I noticed is that the macro LL_ALLOCATED_SPACE is
completely bogus and buggy.  It applies the alignment to the
sum of headroom and tailroom.  However, the alignment is then
applied separately to the headroom when reserving, meaning that
we may end up with insufficient tailroom.

So I'm going to get rid of LL_ALLOCATED_SPACE completely and
replace it with explicit references to the tailroom as it doesn't
need the alignment anyway (The headroom needs alignment since
we use it to ensure the head is aligned).

Cheers,
David Miller Oct. 26, 2011, 3:12 a.m. UTC | #10
From: Herbert Xu <herbert@gondor.hengli.com.au>
Date: Tue, 25 Oct 2011 13:54:25 +0200

> So I'm going to get rid of LL_ALLOCATED_SPACE completely and
> replace it with explicit references to the tailroom as it doesn't
> need the alignment anyway (The headroom needs alignment since
> we use it to ensure the head is aligned).

Ok.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Nov. 18, 2011, 12:18 p.m. UTC | #11
On Tue, Oct 25, 2011 at 11:12:04PM -0400, David Miller wrote:
> From: Herbert Xu <herbert@gondor.hengli.com.au>
> Date: Tue, 25 Oct 2011 13:54:25 +0200
> 
> > So I'm going to get rid of LL_ALLOCATED_SPACE completely and
> > replace it with explicit references to the tailroom as it doesn't
> > need the alignment anyway (The headroom needs alignment since
> > we use it to ensure the head is aligned).
> 
> Ok.

Here are the patches that do this.  I also picked up one spot
that should have used LL_ALLOCATED_SPACE but did not (see 2nd
last patch).

Cheers,
David Miller Nov. 18, 2011, 8:01 p.m. UTC | #12
From: Herbert Xu <herbert@gondor.hengli.com.au>
Date: Fri, 18 Nov 2011 20:18:32 +0800

> On Tue, Oct 25, 2011 at 11:12:04PM -0400, David Miller wrote:
>> From: Herbert Xu <herbert@gondor.hengli.com.au>
>> Date: Tue, 25 Oct 2011 13:54:25 +0200
>> 
>> > So I'm going to get rid of LL_ALLOCATED_SPACE completely and
>> > replace it with explicit references to the tailroom as it doesn't
>> > need the alignment anyway (The headroom needs alignment since
>> > we use it to ensure the head is aligned).
>> 
>> Ok.
> 
> Here are the patches that do this.  I also picked up one spot
> that should have used LL_ALLOCATED_SPACE but did not (see 2nd
> last patch).

This all looks good to me, applied to net-next, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddee79b..d4abb400 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -276,12 +276,22 @@  struct hh_cache {
  * LL_ALLOCATED_SPACE also takes into account the tailroom the device
  * may need.
  */
+#define LL_ALIGN(__len) (((__len)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+
+#define __LL_RESERVED_SPACE(__hard_hdr_len, __needed_headroom) \
+	LL_ALIGN((__hard_hdr_len) + (__needed_headroom))
+
 #define LL_RESERVED_SPACE(dev) \
-	((((dev)->hard_header_len+(dev)->needed_headroom)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+	__LL_RESERVED_SPACE((dev)->hard_header_len, (dev)->needed_headroom))
+
 #define LL_RESERVED_SPACE_EXTRA(dev,extra) \
-	((((dev)->hard_header_len+(dev)->needed_headroom+(extra))&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+	LL_ALIGN((dev)->hard_header_len + (dev)->needed_headroom + (extra))
+
+#define __LL_ALLOCATED_SPACE(__hard_hdr_len, __needed_headroom, __needed_tailroom) \
+	LL_ALIGN((__hard_hdr_len) + (__needed_headroom) + (__needed_tailroom))
+
 #define LL_ALLOCATED_SPACE(dev) \
-	((((dev)->hard_header_len+(dev)->needed_headroom+(dev)->needed_tailroom)&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
+	__LL_ALLOCATED_SPACE((dev)->hard_header_len, (dev)->needed_headroom, (dev)->needed_tailroom)
 
 struct header_ops {
 	int	(*create) (struct sk_buff *skb, struct net_device *dev,
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..9d96995 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1103,6 +1103,11 @@  extern struct sk_buff 		*sock_alloc_send_skb(struct sock *sk,
 						     unsigned long size,
 						     int noblock,
 						     int *errcode);
+extern struct sk_buff 		*sock_alloc_send_skb_reserve(struct sock *sk,
+							     struct net_device *dev,
+							     unsigned long base_size,
+							     int noblock,
+							     int *errcode);
 extern struct sk_buff 		*sock_alloc_send_pskb(struct sock *sk,
 						      unsigned long header_len,
 						      unsigned long data_len,
diff --git a/net/core/sock.c b/net/core/sock.c
index bc745d0..9cefb72 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1586,6 +1586,26 @@  struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
 }
 EXPORT_SYMBOL(sock_alloc_send_skb);
 
+struct sk_buff *sock_alloc_send_skb_reserve(struct sock *sk, struct net_device *dev,
+					    unsigned long base_size, int noblock,
+					    int *errcode)
+{
+	unsigned int hh_len = dev->hard_header_len;
+	unsigned int needed_hroom = dev->needed_headroom;
+	unsigned int needed_troom = dev->needed_tailroom;
+	unsigned int reserve, lls;
+	struct sk_buff *skb;
+
+	lls = __LL_ALLOCATED_SPACE(hh_len, needed_hroom, needed_troom);
+	skb = sock_alloc_send_skb(sk, (base_size + lls + 15), noblock, errcode);
+	if (skb) {
+		reserve = __LL_RESERVED_SPACE(hh_len, needed_hroom);
+		skb_reserve(skb, reserve);
+	}
+	return skb;
+}
+EXPORT_SYMBOL(sock_alloc_send_skb_reserve);
+
 static void __lock_sock(struct sock *sk)
 	__releases(&sk->sk_lock.slock)
 	__acquires(&sk->sk_lock.slock)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 61714bd..dcecb97 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -335,12 +335,11 @@  static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	if (flags&MSG_PROBE)
 		goto out;
 
-	skb = sock_alloc_send_skb(sk,
-				  length + LL_ALLOCATED_SPACE(rt->dst.dev) + 15,
-				  flags & MSG_DONTWAIT, &err);
+	skb = sock_alloc_send_skb_reserve(sk, rt->dst.dev, length,
+					  flags & MSG_DONTWAIT,
+					  &err);
 	if (skb == NULL)
 		goto error;
-	skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 343852e..5a8e141 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -618,12 +618,11 @@  static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	if (flags&MSG_PROBE)
 		goto out;
 
-	skb = sock_alloc_send_skb(sk,
-				  length + LL_ALLOCATED_SPACE(rt->dst.dev) + 15,
-				  flags & MSG_DONTWAIT, &err);
+	skb = sock_alloc_send_skb_reserve(sk, rt->dst.dev, length,
+					  flags & MSG_DONTWAIT,
+					  &err);
 	if (skb == NULL)
 		goto error;
-	skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;