diff mbox

PROBLEM: System call 'sendmsg' of process ospfd (quagga) causes kernel oops

Message ID 1318604266.2223.29.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 14, 2011, 2:57 p.m. UTC
Le vendredi 14 octobre 2011 à 10:08 +0200, Elmar Vonlanthen a écrit :
> Hello again
> 
> Kernel 3.0.6 is affected as well.
> 
> 2011-10-14 09:50:46 chgut11fw01 kernel: skb_over_panic: text:c13059c9
> len:64 put:64 head:c730c400 data:c730c450 tail:0xc730c490
> end:0xc730c480 dev:<NULL>
> 2011-10-14 09:50:46 chgut11fw01 kernel: ------------[ cut here ]------------
> 2011-10-14 09:50:46 chgut11fw01 kernel: kernel BUG at net/core/skbuff.c:128!
> 2011-10-14 09:50:46 chgut11fw01 kernel: invalid opcode: 0000 [#1] SMP
> 2011-10-14 09:50:46 chgut11fw01 kernel: Modules linked in: ip_gre gre
> authenc xfrm4_mode_transport deflate zlib_deflate ctr twofish_generic
> twofish_common serpent cryptd aes_i586 aes_generic blowfish cast5 cbc
> ecb rmd160 sha512_generic sha256_generic sha1_generic xfrm_user
> xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key tun ipt_LOG
> xt_limit ipt_REJECT xt_state ipt_REDIRECT ipt_MASQUERADE xt_policy
> xt_TCPMSS xt_tcpmss xt_tcpudp xt_NOTRACK iptable_filter iptable_nat
> xt_mark xt_connmark iptable_mangle iptable_raw ip_tables x_tables
> nf_conntrack_tftp nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_conntrack_ftp nf_conntrack rtc ppdev parport_pc parport w83792d
> i2c_dev i2c_i801 i2c_core pl2303 usbserial coretemp hwmon usbhid
> ohci_hcd uhci_hcd ehci_hcd usbcore e1000 e1000e aufs ata_piix libata
> 2011-10-14 09:50:46 chgut11fw01 kernel:
> 2011-10-14 09:50:46 chgut11fw01 kernel: Pid: 6299, comm: ospfd Not
> tainted 3.0.6-SMP #1    /LakePort
> 2011-10-14 09:50:46 chgut11fw01 kernel: EIP: 0060:[<c12b6815>] EFLAGS:
> 00010292 CPU: 0
> 2011-10-14 09:50:46 chgut11fw01 kernel: EIP is at skb_put+0x85/0x90
> 2011-10-14 09:50:46 chgut11fw01 kernel: EAX: 00000078 EBX: c13059c9
> ECX: 00000096 EDX: ffffff8b
> 2011-10-14 09:50:46 chgut11fw01 kernel: ESI: deac4a80 EDI: 00000040
> EBP: d8979c80 ESP: d8979c58
> 2011-10-14 09:50:46 chgut11fw01 kernel: DS: 007b ES: 007b FS: 00d8 GS:
> 0033 SS: 0068
> 2011-10-14 09:50:46 chgut11fw01 kernel: Process ospfd (pid: 6299,
> ti=d8978000 task=de94ee80 task.ti=d8978000)
> 2011-10-14 09:50:46 chgut11fw01 kernel: Stack:
> 2011-10-14 09:50:46 chgut11fw01 kernel: c13fd194 c13059c9 00000040
> 00000040 c730c400 c730c450 c730c490 c730c480
> 2011-10-14 09:50:46 chgut11fw01 kernel: c13fb21c d897da40 d8979d38
> c13059c9 d8979d24 8db04700 00000001 00000001
> 2011-10-14 09:50:46 chgut11fw01 kernel: d8979ce4 d8979cc8 c121ce7d
> 00000000 000012b5 d8979ecc dbff0c00 c730c450
> 2011-10-14 09:50:46 chgut11fw01 kernel: Call Trace:
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c13059c9>] ? raw_sendmsg+0x5a9/0x850
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c13059c9>] raw_sendmsg+0x5a9/0x850
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c121ce7d>] ? extract_entropy+0x5d/0x70
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c1027f73>] ?
> try_to_wake_up+0x173/0x1f0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12b2700>] ? release_sock+0xb0/0xd0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c130f222>] inet_sendmsg+0x42/0xa0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c1021407>] ?
> __wake_up_sync_key+0x47/0x60
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12afc07>] sock_sendmsg+0xa7/0xd0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12b29f3>] ?
> sock_def_readable+0x33/0x60
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12afc07>] ? sock_sendmsg+0xa7/0xd0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12b8e83>] ? verify_iovec+0x53/0xb0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12b08e4>] __sys_sendmsg+0x2d4/0x2e0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12afc9e>] ?
> sockfd_lookup_light+0x1e/0x70
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12b02fa>] ? sys_sendto+0xaa/0xe0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c102f928>] ? nsecs_to_jiffies+0x8/0x10
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12eaf11>] ? ip_setsockopt+0x41/0xa0
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12b0a36>] sys_sendmsg+0x36/0x60
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c12b1669>] sys_socketcall+0xe9/0x280
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c133d8c5>] syscall_call+0x7/0xb
> 2011-10-14 09:50:46 chgut11fw01 kernel: [<c1330000>] ?
> packet_recvmsg+0x170/0x440
> 2011-10-14 09:50:46 chgut11fw01 kernel: Code: 00 00 89 4c 24 14 8b 88
> a4 00 00 00 89 54 24 0c 89 4c 24 10 8b 40 50 89 5c 24 04 c7 04 24 94
> d1 3f c1 89 44 24 08 e8 b4 4b 08 00 <0f> 0b eb fe b9 1c b2 3f c1 eb ae
> 55 89 e5 57 56 89 d6 53 89 c3
> 2011-10-14 09:50:46 chgut11fw01 kernel: EIP: [<c12b6815>]
> skb_put+0x85/0x90 SS:ESP 0068:d8979c58
> 2011-10-14 09:50:46 chgut11fw01 kernel: ---[ end trace ff341104610beeed ]---

> Could anyone give me a hint, how I can furher proceed to find the error?
> Thanks.

Please try following patch :

[PATCH] ip_gre: dont increase dev->needed_headroom on a live device

It seems ip_gre is able to change dev->needed_headroom on the fly.

Its is not legal unfortunately and triggers a BUG in raw_sendmsg()

skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev) 

< another cpu change dev->needed_headromm (making it bigger)

...
skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));

We end with LL_RESERVED_SPACE() being bigger than LL_ALLOCATED_SPACE()
-> we crash later because skb head is exhausted.

Bug introduced in commit 243aad83 in 2.6.34 (ip_gre: include route
header_len in max_headroom calculation)

Reported-by: Elmar Vonlanthen <evonlanthen@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Timo Teräs <timo.teras@iki.fi>
CC: Herbert Xu <herbert@gondor.apana.org.au>
---



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Elmar Vonlanthen Oct. 17, 2011, 7:16 a.m. UTC | #1
2011/10/14 Eric Dumazet <eric.dumazet@gmail.com>:
> Please try following patch :
>
> [PATCH] ip_gre: dont increase dev->needed_headroom on a live device
>
> It seems ip_gre is able to change dev->needed_headroom on the fly.
>
> Its is not legal unfortunately and triggers a BUG in raw_sendmsg()
>
> skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev)
>
> < another cpu change dev->needed_headromm (making it bigger)
>
> ...
> skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));
>
> We end with LL_RESERVED_SPACE() being bigger than LL_ALLOCATED_SPACE()
> -> we crash later because skb head is exhausted.
>
> Bug introduced in commit 243aad83 in 2.6.34 (ip_gre: include route
> header_len in max_headroom calculation)
>
> Reported-by: Elmar Vonlanthen <evonlanthen@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Timo Teräs <timo.teras@iki.fi>
> CC: Herbert Xu <herbert@gondor.apana.org.au>
> ---
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 8871067..1505dcf 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -835,8 +835,6 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
>        if (skb_headroom(skb) < max_headroom || skb_shared(skb)||
>            (skb_cloned(skb) && !skb_clone_writable(skb, 0))) {
>                struct sk_buff *new_skb = skb_realloc_headroom(skb, max_headroom);
> -               if (max_headroom > dev->needed_headroom)
> -                       dev->needed_headroom = max_headroom;
>                if (!new_skb) {
>                        ip_rt_put(rt);
>                        dev->stats.tx_dropped++;

Hello

I tried this patch and I was not able anymore to reproduce the kernel
oops. So the patch solved the bug.
Thank you very much!

Would it be possible to add the patch to the long term kernel 2.6.35
as well? Because this is the one I use at the moment in production.

And sorry for posting to the wrong mailing list (linux-kernel).

Best regards
Elmar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 18, 2011, 2:30 a.m. UTC | #2
Le lundi 17 octobre 2011 à 09:16 +0200, Elmar Vonlanthen a écrit :
> 2011/10/14 Eric Dumazet <eric.dumazet@gmail.com>:
> > Please try following patch :
> >
> > [PATCH] ip_gre: dont increase dev->needed_headroom on a live device
> >
> > It seems ip_gre is able to change dev->needed_headroom on the fly.
> >
> > Its is not legal unfortunately and triggers a BUG in raw_sendmsg()
> >
> > skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev)
> >
> > < another cpu change dev->needed_headromm (making it bigger)
> >
> > ...
> > skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));
> >
> > We end with LL_RESERVED_SPACE() being bigger than LL_ALLOCATED_SPACE()
> > -> we crash later because skb head is exhausted.
> >
> > Bug introduced in commit 243aad83 in 2.6.34 (ip_gre: include route
> > header_len in max_headroom calculation)
> >
> > Reported-by: Elmar Vonlanthen <evonlanthen@gmail.com>
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > CC: Timo Teräs <timo.teras@iki.fi>
> > CC: Herbert Xu <herbert@gondor.apana.org.au>
> > ---
> > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> > index 8871067..1505dcf 100644
> > --- a/net/ipv4/ip_gre.c
> > +++ b/net/ipv4/ip_gre.c
> > @@ -835,8 +835,6 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
> >        if (skb_headroom(skb) < max_headroom || skb_shared(skb)||
> >            (skb_cloned(skb) && !skb_clone_writable(skb, 0))) {
> >                struct sk_buff *new_skb = skb_realloc_headroom(skb, max_headroom);
> > -               if (max_headroom > dev->needed_headroom)
> > -                       dev->needed_headroom = max_headroom;
> >                if (!new_skb) {
> >                        ip_rt_put(rt);
> >                        dev->stats.tx_dropped++;
> 
> Hello
> 
> I tried this patch and I was not able anymore to reproduce the kernel
> oops. So the patch solved the bug.
> Thank you very much!
> 
> Would it be possible to add the patch to the long term kernel 2.6.35
> as well? Because this is the one I use at the moment in production.
> 

Thanks for testing.

If David/Herbert/Timo agree, then patch should find its way into current
kernel, then to stable trees as well.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 18, 2011, 9:34 a.m. UTC | #3
On Tue, Oct 18, 2011 at 04:30:32AM +0200, Eric Dumazet wrote:
>
> If David/Herbert/Timo agree, then patch should find its way into current
> kernel, then to stable trees as well.

Actually, I think we should instead fix the users of needed_headroom
to not read it twice which is causing problems here.

GRE tunnels by their nature do not have a fixed value for
needed_headroom.  As the underlying routes change the necessary
headroom may need to be adjusted due to further encapsulation such
as IPsec.

Keeping it constant from tunnel creation may result in suboptimal
performance due to unnecessary header reallocations.

However, until we audit the stack to see if there are further
instances of double-readings such as the one causing the crash
here, I'm fine with your patch making it constant.

Once we're sure that all of the double-readings are gone we
can revert to a dynamic needed_headroom.

Thanks,
Eric Dumazet Oct. 18, 2011, 10:01 a.m. UTC | #4
Le mardi 18 octobre 2011 à 11:34 +0200, Herbert Xu a écrit :
> On Tue, Oct 18, 2011 at 04:30:32AM +0200, Eric Dumazet wrote:
> >
> > If David/Herbert/Timo agree, then patch should find its way into current
> > kernel, then to stable trees as well.
> 
> Actually, I think we should instead fix the users of needed_headroom
> to not read it twice which is causing problems here.
> 
> GRE tunnels by their nature do not have a fixed value for
> needed_headroom.  As the underlying routes change the necessary
> headroom may need to be adjusted due to further encapsulation such
> as IPsec.
> 
> Keeping it constant from tunnel creation may result in suboptimal
> performance due to unnecessary header reallocations.
> 
> However, until we audit the stack to see if there are further
> instances of double-readings such as the one causing the crash
> here, I'm fine with your patch making it constant.
> 
> Once we're sure that all of the double-readings are gone we
> can revert to a dynamic needed_headroom.
> 

Sure, we can work on this path for future kernels.

Adding an RCU protected structure to hold hard_header_len /
needed_headroom / needed_tailroom should be possible, but this adds yet
another pointer dereference...

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 18, 2011, 10:05 a.m. UTC | #5
On Tue, Oct 18, 2011 at 12:01:33PM +0200, Eric Dumazet wrote:
>
> Adding an RCU protected structure to hold hard_header_len /
> needed_headroom / needed_tailroom should be possible, but this adds yet
> another pointer dereference...

I don't think we need RCU here since the problem is simply that
we're using two different values for skb allocations and skb_reserve.

As long as we use one and the same value it should work.  The value
will rarely be incorrect and when it is, automatic reallocation will
occur.

Cheers,
Eric Dumazet Oct. 18, 2011, 10:23 a.m. UTC | #6
Le mardi 18 octobre 2011 à 12:05 +0200, Herbert Xu a écrit :
> On Tue, Oct 18, 2011 at 12:01:33PM +0200, Eric Dumazet wrote:
> >
> > Adding an RCU protected structure to hold hard_header_len /
> > needed_headroom / needed_tailroom should be possible, but this adds yet
> > another pointer dereference...
> 
> I don't think we need RCU here since the problem is simply that
> we're using two different values for skb allocations and skb_reserve.
> 
> As long as we use one and the same value it should work.  The value
> will rarely be incorrect and when it is, automatic reallocation will
> occur.
> 

You're right, if reallocations are OK in all paths.

We'll need to change LL_RESERVED_SPACE() / LL_RESERVED_SPACE_EXTRA() /
LL_ALLOCATED_SPACE() macros and provide the [read once] values, instead
of a [read once] pointer to values.

Thats a bit complex change, but doable.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 18, 2011, 10:45 a.m. UTC | #7
On Tue, Oct 18, 2011 at 12:23:43PM +0200, Eric Dumazet wrote:
> 
> You're right, if reallocations are OK in all paths.

If it wasn't OK then making needed_headroom constant won't work
anyway.

> We'll need to change LL_RESERVED_SPACE() / LL_RESERVED_SPACE_EXTRA() /
> LL_ALLOCATED_SPACE() macros and provide the [read once] values, instead
> of a [read once] pointer to values.

I'm not sure what you mean here.  I don't see any need to change
these macros.  All we need is to save the value in a local variable:

	hh_len = LL_RESERVED_SPACE(dev);

	skb = alloc_skb(hh_len + len);
	skb_reserve(skb, hh_len);

Cheers,
Eric Dumazet Oct. 18, 2011, 11:37 a.m. UTC | #8
Le mardi 18 octobre 2011 à 12:45 +0200, Herbert Xu a écrit :
> On Tue, Oct 18, 2011 at 12:23:43PM +0200, Eric Dumazet wrote:
> > 
> > You're right, if reallocations are OK in all paths.
> 
> If it wasn't OK then making needed_headroom constant won't work
> anyway.
> 
> > We'll need to change LL_RESERVED_SPACE() / LL_RESERVED_SPACE_EXTRA() /
> > LL_ALLOCATED_SPACE() macros and provide the [read once] values, instead
> > of a [read once] pointer to values.
> 
> I'm not sure what you mean here.  I don't see any need to change
> these macros.  All we need is to save the value in a local variable:
> 
> 	hh_len = LL_RESERVED_SPACE(dev);
> 
> 	skb = alloc_skb(hh_len + len);
> 	skb_reserve(skb, hh_len);
> 

Not really Herbert. Please read again my patch changelog.

In the bug we try to fix, we have :

skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev) 

... < increase of dev->needed_headroom by another cpu/task >

skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));

skb_put() -> crash because we reserved too much space

So we really want LL_ALLOCATED_SPACE() and LL_RESERVED_SPACE() use the
same needed_headroom, or else you can have LL_RESERVED_SPACE() >
LL_ALLOCATED_SPACE().

There are several way to fix this, but this kind of code assumed the
dev->needed... values were consistent for the whole block.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 18, 2011, 11:49 a.m. UTC | #9
On Tue, Oct 18, 2011 at 01:37:58PM +0200, Eric Dumazet wrote:
>
> In the bug we try to fix, we have :
> 
> skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev) 
> 
> ... < increase of dev->needed_headroom by another cpu/task >
> 
> skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));

OK, in that case one fix would be to replace LL_ALLOCATED_SPACE
with its two constiuents so that they may be stored in local
variables for later use.

	hlen = LL_HEADROOM(skb);
	tlen = LL_TAILROOM(skb);
	skb_alloc_send_skb(sk, ... + LL_ALIGN(hlen + tlen));

	skb_reserve(skb, LL_ALIGN(hlen));

Cheers,
diff mbox

Patch

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 8871067..1505dcf 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -835,8 +835,6 @@  static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff *skb, struct net_device *dev
 	if (skb_headroom(skb) < max_headroom || skb_shared(skb)||
 	    (skb_cloned(skb) && !skb_clone_writable(skb, 0))) {
 		struct sk_buff *new_skb = skb_realloc_headroom(skb, max_headroom);
-		if (max_headroom > dev->needed_headroom)
-			dev->needed_headroom = max_headroom;
 		if (!new_skb) {
 			ip_rt_put(rt);
 			dev->stats.tx_dropped++;