diff mbox

bug report: use after free bug leading to kernel panic

Message ID CAHZ_AjsbkkXNMG8oZfOhr6WbUmV=4eK0j23ijbP=cNkfVdoAjQ@mail.gmail.com
State Not Applicable
Delegated to: Pablo Neira
Headers show

Commit Message

eric gisse Oct. 31, 2014, 3:28 p.m. UTC
Background:

This was discovered on a server running a tor exit node (crazy high
packet flow) with a firewall that uses a few connection tracking rules
in the INPUT chain:

# iptables-save | grep conn
-A INPUT -m comment --comment "001-v4 drop invalid traffic" -m
conntrack --ctstate INVALID -j DROP
-A INPUT -m comment --comment "990-v4 accept existing connections" -m
conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

The kernel was not stock, but rather was modified with grsecurity. I
worked with the grsecurity folks first on this issue (
https://forums.grsecurity.net/viewtopic.php?f=1&t=4071 ) to isolate
and explain what's going on. They were very helpful.

One of the developers was nice enough to generate a test case patch
which I have attached

The bug:

I am using the pax memory sanitization feature (
https://en.wikibooks.org/wiki/Grsecurity/Appendix/Grsecurity_and_PaX_Configuration_Options#Sanitize_all_freed_memory
) which, long story short, wipes stuff as soon as it is marked as
freed in memory.

What happens is that after a few hours of 50k packets per second,
"something" triggers a GPF with regards to __nf_conntrack_find_get.
This happens on both 3.16.5 and 3.17.1.

The panics I am dumping here are NOT with the same patch I've attached
because netconsole is ... inconsistent with when choosing to work. As
an aside, what is the ideal way to get kernel oops output anyway? It
is a massive pain in the ass and not the most consistent thing to rely
on netconsole.

* Patch-specific oops:

https://imgur.com/4li7ePm

This is obviously missing a lot, but looks like the same issue from
what my semi-educated eye can see.

Note: please Ignore the xt_* modules as they were not in use at the
time, and were not present for either the 3.16.5 panics or the 3.17.1
+ sanitize test case patch.

* 3.17.1 grsecurity kernel oops via netconsole:

Oct 27 09:52:53 REDACTED [23041.341354] general protection fault: 0000 [#4]
Oct 27 09:52:53 REDACTED SMP
Oct 27 09:52:53 REDACTED
Oct 27 09:52:53 REDACTED [23041.341413] Modules linked in:
Oct 27 09:52:53 REDACTED xt_DELUDE(O)
Oct 27 09:52:53 REDACTED xt_CHAOS(O)
Oct 27 09:52:53 REDACTED xt_TARPIT(O)
Oct 27 09:52:53 REDACTED
Oct 27 09:52:53 REDACTED [23041.341476] CPU: 6 PID: 3052 Comm: tor
Tainted: G D O 3.17.1-hardened #1
Oct 27 09:52:53 REDACTED [23041.341538] Hardware name: Supermicro
A1SA2-2750F/A1SA2-2750F, BIOS 1.0a 07/14/2014
Oct 27 09:52:53 REDACTED [23041.341600] task: ffff880276ed6b10 ti:
ffff880276ed6f60 task.ti: ffff880276ed6f60
Oct 27 09:52:53 REDACTED [23041.341660] RIP: 0010:[<ffffffff814b58ce>]
Oct 27 09:52:53 REDACTED [<ffffffff814b58ce>] __nf_conntrack_find_get+0x6e/0x290
Oct 27 09:52:53 REDACTED [23041.341732] RSP: 0018:ffffc90006073930
EFLAGS: 00010246
Oct 27 09:52:53 REDACTED [23041.341770] RAX: 0000000000014230 RBX:
fefefefefefefefe RCX: 0000000000014a70
Oct 27 09:52:53 REDACTED [23041.341811] RDX: 000000000000294e RSI:
00000000000266e2 RDI: 00000000fefefefe
Oct 27 09:52:53 REDACTED [23041.341852] RBP: ffffc90006073958 R08:
0000000073a1bccf R09: 00000000bd127271
Oct 27 09:52:53 REDACTED [23041.341894] R10: ffffc900060739c0 R11:
ffff880273943f08 R12: ffffc900060739a8
Oct 27 09:52:53 REDACTED [23041.341935] R13: 0000000000000000 R14:
00000000a538c88a R15: ffffffff81a7e240
Oct 27 09:52:53 REDACTED [23041.341976] FS: 0000031cb9d65700(0000)
GS:ffff88027fd80000(0000) knlGS:0000000000000000
Oct 27 09:52:53 REDACTED [23041.342037] CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
Oct 27 09:52:53 REDACTED [23041.342077] CR2: 000002faa3f38000 CR3:
0000000001654000 CR4: 00000000001007f0
Oct 27 09:52:53 REDACTED [23041.342117] Stack:
Oct 27 09:52:53 REDACTED [23041.342147] ffff880211a540e0
Oct 27 09:52:53 REDACTED ffffffff81a7e240
Oct 27 09:52:53 REDACTED 0000000000000014
Oct 27 09:52:53 REDACTED ffffffff81a9e660
Oct 27 09:52:53 REDACTED [23041.342225] 0000000000000000
Oct 27 09:52:53 REDACTED
Oct 27 09:52:53 REDACTED ffffc90006073a28
Oct 27 09:52:53 REDACTED ffffffff814b6d2c
Oct 27 09:52:53 REDACTED ffffffff81a9e660
Oct 27 09:52:53 REDACTED [23041.342303] ffffffff81a904a0
Oct 27 09:52:53 REDACTED
Oct 27 09:52:53 REDACTED ffff880079cabc4c
Oct 27 09:52:53 REDACTED ffff8802a538c88a
Oct 27 09:52:53 REDACTED ffffffff81a904a0
Oct 27 09:52:53 REDACTED
Oct 27 09:52:53 REDACTED [23041.342380] Call Trace:
Oct 27 09:52:53 REDACTED [23041.342416] [<ffffffff814b6d2c>]
nf_conntrack_in+0x1fc/0x990
Oct 27 09:52:53 REDACTED [23041.342459] [<ffffffff8158bcab>]
ipv4_conntrack_local+0x4b/0x50
Oct 27 09:52:53 REDACTED [23041.342501] [<ffffffff814ae7f8>]
nf_iterate+0xa8/0xc0
Oct 27 09:52:53 REDACTED [23041.342543] [<ffffffff8152ffe0>] ?
ip_forward_options+0x1f0/0x1f0
Oct 27 09:52:53 REDACTED [23041.342585] [<ffffffff814ae885>]
nf_hook_slow+0x75/0x120
Oct 27 09:52:53 REDACTED [23041.342625] [<ffffffff8152ffe0>] ?
ip_forward_options+0x1f0/0x1f0
Oct 27 09:52:53 REDACTED [23041.342667] [<ffffffff81532503>]
__ip_local_out+0xa3/0xb0
Oct 27 09:52:53 REDACTED [23041.342708] [<ffffffff81532525>]
ip_local_out_sk+0x15/0x50
Oct 27 09:52:53 REDACTED [23041.342749] [<ffffffff815328cf>]
ip_queue_xmit+0x14f/0x400
Oct 27 09:52:53 REDACTED [23041.342791] [<ffffffff8154b99b>]
tcp_transmit_skb+0x48b/0x930
Oct 27 09:52:53 REDACTED [23041.342832] [<ffffffff8154bf82>]
tcp_write_xmit+0x142/0xd10
Oct 27 09:52:53 REDACTED [23041.342873] [<ffffffff8154cdb9>]
__tcp_push_pending_frames+0x29/0x90
Oct 27 09:52:53 REDACTED [23041.342915] [<ffffffff8153b737>] tcp_push+0xe7/0x120
Oct 27 09:52:53 REDACTED [23041.342954] [<ffffffff8153d027>]
tcp_sendmsg+0x107/0x11d0
Oct 27 09:52:53 REDACTED [23041.342995] [<ffffffff8126e1ce>] ?
selinux_socket_sendmsg+0x1e/0x30
Oct 27 09:52:53 REDACTED [23041.343037] [<ffffffff8126dbc3>] ?
avc_has_perm+0xa3/0x190
Oct 27 09:52:53 REDACTED [23041.343079] [<ffffffff8142b02f>] ?
sock_sendmsg+0x9f/0xd0
Oct 27 09:52:53 REDACTED [23041.343120] [<ffffffff8156955e>]
inet_sendmsg+0x6e/0xc0
Oct 27 09:52:53 REDACTED [23041.343160] [<ffffffff8126e1ce>] ?
selinux_socket_sendmsg+0x1e/0x30
Oct 27 09:52:53 REDACTED [23041.343203] [<ffffffff81429d38>]
sock_aio_write+0x118/0x150
Oct 27 09:52:53 REDACTED [23041.343243] [<ffffffff8126fd72>] ?
inode_has_perm.isra.28+0x22/0x40
Oct 27 09:52:53 REDACTED [23041.343285] [<ffffffff8126febe>] ?
file_has_perm+0x8e/0x90
Oct 27 09:52:53 REDACTED [23041.343327] [<ffffffff81186fd3>]
do_sync_write+0x63/0x90
Oct 27 09:52:53 REDACTED [23041.343367] [<ffffffff81187ee2>]
vfs_write+0x242/0x2b0
Oct 27 09:52:53 REDACTED [23041.343407] [<ffffffff81188a47>] SyS_write+0x47/0xb0
Oct 27 09:52:53 REDACTED [23041.343448] [<ffffffff81632dfe>]
system_call_fastpath+0x16/0x1b
Oct 27 09:52:53 REDACTED [23041.343487] Code:
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 48
Oct 27 09:52:53 REDACTED 8b
Oct 27 09:52:53 REDACTED 18
Oct 27 09:52:53 REDACTED f6
Oct 27 09:52:53 REDACTED c3
Oct 27 09:52:53 REDACTED 01
Oct 27 09:52:53 REDACTED 74
Oct 27 09:52:53 REDACTED 21
Oct 27 09:52:53 REDACTED e9
Oct 27 09:52:53 REDACTED 56
Oct 27 09:52:53 REDACTED 01
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 66
Oct 27 09:52:53 REDACTED 0f
Oct 27 09:52:53 REDACTED 1f
Oct 27 09:52:53 REDACTED 44
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 49
Oct 27 09:52:53 REDACTED 8b
Oct 27 09:52:53 REDACTED 87
Oct 27 09:52:53 REDACTED 58
Oct 27 09:52:53 REDACTED 0d
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 65
Oct 27 09:52:53 REDACTED ff
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 48
Oct 27 09:52:53 REDACTED 8b
Oct 27 09:52:53 REDACTED 1b
Oct 27 09:52:53 REDACTED f6
Oct 27 09:52:53 REDACTED c3
Oct 27 09:52:53 REDACTED 01
Oct 27 09:52:53 REDACTED 0f
Oct 27 09:52:53 REDACTED 85
Oct 27 09:52:53 REDACTED 3a
Oct 27 09:52:53 REDACTED 01
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 REDACTED 00
Oct 27 09:52:53 headless syslog-ng[11045]: Error processing log message: <0f>
Oct 27 09:52:53 REDACTED b6
Oct 27 09:52:53 REDACTED 43
Oct 27 09:52:53 REDACTED 37
Oct 27 09:52:53 REDACTED 8b
Oct 27 09:52:53 REDACTED 7b
Oct 27 09:52:53 REDACTED 10
Oct 27 09:52:53 REDACTED 41
Oct 27 09:52:53 REDACTED 39
Oct 27 09:52:53 REDACTED 3c
Oct 27 09:52:53 REDACTED 24
Oct 27 09:52:53 REDACTED 75
Oct 27 09:52:53 REDACTED dd
Oct 27 09:52:53 REDACTED 8b
Oct 27 09:52:53 REDACTED 73
Oct 27 09:52:53 REDACTED 14
Oct 27 09:52:53 REDACTED 41
Oct 27 09:52:53 REDACTED 39
Oct 27 09:52:53 REDACTED 74
Oct 27 09:52:53 REDACTED 24
Oct 27 09:52:53 REDACTED 04
Oct 27 09:52:53 REDACTED [23041.343964] RIP
Oct 27 09:52:53 REDACTED
Oct 27 09:52:53 REDACTED [<ffffffff814b58ce>] __nf_conntrack_find_get+0x6e/0x290
Oct 27 09:52:53 REDACTED [23041.344011] RSP <ffffc90006073930>
Oct 27 09:52:53 REDACTED [23041.344609] ---[ end trace 874c3cf41b00aa37 ]---
Oct 27 09:52:53 REDACTED [23041.344717] Kernel panic - not syncing:
Fatal exception in interrupt
Oct 27 09:52:53 REDACTED [23041.344832] Kernel Offset: 0x0 from
0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffff9fffffff)
Oct 27 09:52:53 REDACTED [23041.344965] ---[ end Kernel panic - not
syncing: Fatal exception in interrupt

The spot of code that's causing grief:

# addr2line -e vmlinux -fip ffffffff814b58ce
nf_ct_tuplehash_to_ctrack at
/usr/src/linux/include/net/netfilter/nf_conntrack.h:122
 (inlined by) nf_ct_key_equal at
/usr/src/linux/net/netfilter/nf_conntrack_core.c:393
 (inlined by) ____nf_conntrack_find at
/usr/src/linux/net/netfilter/nf_conntrack_core.c:422
 (inlined by) __nf_conntrack_find_get at
/usr/src/linux/net/netfilter/nf_conntrack_core.c:453

* 3.16.5 panic:

Oct 25 10:56:25 REDACTED [13480.030174] general protection fault: 0000 [#1]
Oct 25 10:56:25 REDACTED SMP
Oct 25 10:56:25 REDACTED [13480.030209] Modules linked in:
Oct 25 10:56:25 REDACTED
Oct 25 10:56:25 REDACTED
Oct 25 10:56:25 REDACTED [13480.030229] CPU: 6 PID: 3945 Comm: tor Not
tainted 3.16.5-hardened #6
Oct 25 10:56:25 REDACTED [13480.030248] Hardware name: Supermicro
A1SA2-2750F/A1SA2-2750F, BIOS 1.0a 07/14/2014
Oct 25 10:56:25 REDACTED [13480.030270] task: ffff880273de0aa0 ti:
ffff880273de1100 task.ti: ffff880273de1100
Oct 25 10:56:25 REDACTED [13480.030291] RIP: 0010:[<ffffffff814ad7ae>]
Oct 25 10:56:25 REDACTED [<ffffffff814ad7ae>] __nf_conntrack_find_get+0x6e/0x2c0
Oct 25 10:56:25 REDACTED [13480.030323] RSP: 0018:ffffc900077d3938
EFLAGS: 00010246
Oct 25 10:56:25 REDACTED [13480.030338] RAX: 0000000000014758 RBX:
fefefefefefefefe RCX: 0000000000014240
Oct 25 10:56:25 REDACTED [13480.030357] RDX: 0000000000002848 RSI:
000000000002dca2 RDI: 00000000fefefefe
Oct 25 10:56:25 REDACTED [13480.030376] RBP: ffffc900077d3960 R08:
000000008ae71bc9 R09: 00000000d83e14ec
Oct 25 10:56:25 REDACTED [13480.030417] R10: ffffc900077d39c0 R11:
ffff880273c55f08 R12: ffffc900077d39a8
Oct 25 10:56:25 REDACTED [13480.030457] R13: 0000000000000000 R14:
00000000a1227919 R15: ffffffff81a5c040
Oct 25 10:56:25 REDACTED [13480.030497] FS:  000002fde93dc700(0000)
GS:ffff88027fd80000(0000) knlGS:0000000000000000
Oct 25 10:56:25 REDACTED [13480.030557] CS:  0010 DS: 0000 ES: 0000
CR0: 0000000080050033
Oct 25 10:56:25 REDACTED [13480.030594] CR2: 000003620f851018 CR3:
000000000164d000 CR4: 00000000001007f0
Oct 25 10:56:25 REDACTED [13480.030634] Stack:
Oct 25 10:56:25 REDACTED [13480.030663]  ffff88006a1cf6e0
Oct 25 10:56:25 REDACTED ffffffff81a5c040
Oct 25 10:56:25 REDACTED 0000000000000014
Oct 25 10:56:25 REDACTED ffffffff81a7a6a0
Oct 25 10:56:25 REDACTED
Oct 25 10:56:25 REDACTED [13480.030739]  0000000000000000
Oct 25 10:56:25 REDACTED ffffc900077d3a30
Oct 25 10:56:25 REDACTED ffffffff814aee5a
Oct 25 10:56:25 REDACTED ffffffff81a7a6a0
Oct 25 10:56:25 REDACTED
Oct 25 10:56:25 REDACTED [13480.030813]  ffffffff81a6c2c0
Oct 25 10:56:25 REDACTED ffff8802a1227919
Oct 25 10:56:25 REDACTED ffffffff81a6c2c0
Oct 25 10:56:25 REDACTED 0000000300000002
Oct 25 10:56:25 REDACTED
Oct 25 10:56:25 REDACTED [13480.030889] Call Trace:
Oct 25 10:56:25 REDACTED [13480.030924]  [<ffffffff814aee5a>]
nf_conntrack_in+0x32a/0x980
Oct 25 10:56:25 REDACTED [13480.030965]  [<ffffffff81584e8b>]
ipv4_conntrack_local+0x4b/0x50
Oct 25 10:56:25 REDACTED [13480.031005]  [<ffffffff814a6b08>]
nf_iterate+0xa8/0xc0
Oct 25 10:56:25 REDACTED [13480.031045]  [<ffffffff81529ee0>] ?
ip_forward_options+0x1f0/0x1f0
Oct 25 10:56:25 REDACTED [13480.031085]  [<ffffffff814a6b95>]
nf_hook_slow+0x75/0x120
Oct 25 10:56:25 REDACTED [13480.031124]  [<ffffffff81529ee0>] ?
ip_forward_options+0x1f0/0x1f0
Oct 25 10:56:25 REDACTED [13480.031165]  [<ffffffff8152c282>]
__ip_local_out+0x72/0x80
Oct 25 10:56:25 REDACTED [13480.031203]  [<ffffffff8152c2a5>]
ip_local_out_sk+0x15/0x50
Oct 25 10:56:25 REDACTED [13480.031242]  [<ffffffff8152c650>]
ip_queue_xmit+0x150/0x3e0
Oct 25 10:56:25 REDACTED [13480.031281]  [<ffffffff815449bd>]
tcp_transmit_skb+0x41d/0x8d0
Oct 25 10:56:25 REDACTED [13480.031320]  [<ffffffff81544fb2>]
tcp_write_xmit+0x142/0xc10
Oct 25 10:56:25 REDACTED [13480.031360]  [<ffffffff8142f72f>] ?
__alloc_skb+0x12f/0x1c0
Oct 25 10:56:25 REDACTED [13480.031399]  [<ffffffff81545d7b>]
tcp_push_one+0x2b/0x40
Oct 25 10:56:25 REDACTED [13480.031438]  [<ffffffff815377c9>]
tcp_sendmsg+0xba9/0x1580
Oct 25 10:56:25 REDACTED [13480.031478]  [<ffffffff81244bc0>] ?
avc_has_perm+0x50/0x130
Oct 25 10:56:25 REDACTED [13480.031518]  [<ffffffff81561304>]
inet_sendmsg+0x54/0xc0
Oct 25 10:56:25 REDACTED [13480.031557]  [<ffffffff8124565e>] ?
selinux_socket_sendmsg+0x1e/0x30
Oct 25 10:56:25 REDACTED [13480.031598]  [<ffffffff81424e7b>]
sock_aio_write+0x10b/0x150
Oct 25 10:56:25 REDACTED [13480.031639]  [<ffffffff81153eb6>]
do_sync_write+0x66/0xa0
Oct 25 10:56:25 REDACTED [13480.031677]  [<ffffffff81154d05>]
vfs_write+0x255/0x2c0
Oct 25 10:56:25 REDACTED [13480.031715]  [<ffffffff8115591b>]
SyS_write+0x4b/0xc0
Oct 25 10:56:25 REDACTED [13480.031754]  [<ffffffff816288be>]
system_call_fastpath+0x16/0x1b
Oct 25 10:56:25 REDACTED [13480.031792] Code:
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 48
Oct 25 10:56:25 REDACTED 8b
Oct 25 10:56:25 REDACTED 18
Oct 25 10:56:25 REDACTED f6
Oct 25 10:56:25 REDACTED c3
Oct 25 10:56:25 REDACTED 01
Oct 25 10:56:25 REDACTED 74
Oct 25 10:56:25 REDACTED 21
Oct 25 10:56:25 REDACTED e9
Oct 25 10:56:25 REDACTED 6e
Oct 25 10:56:25 REDACTED 01
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 66
Oct 25 10:56:25 REDACTED 0f
Oct 25 10:56:25 REDACTED 1f
Oct 25 10:56:25 REDACTED 44
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 49
Oct 25 10:56:25 REDACTED 8b
Oct 25 10:56:25 REDACTED 87
Oct 25 10:56:25 REDACTED 50
Oct 25 10:56:25 REDACTED 0b
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 65
Oct 25 10:56:25 REDACTED ff
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 48
Oct 25 10:56:25 REDACTED 8b
Oct 25 10:56:25 REDACTED 1b
Oct 25 10:56:25 REDACTED f6
Oct 25 10:56:25 REDACTED c3
Oct 25 10:56:25 REDACTED 01
Oct 25 10:56:25 REDACTED 0f
Oct 25 10:56:25 REDACTED 85
Oct 25 10:56:25 REDACTED 52
Oct 25 10:56:25 REDACTED 01
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 REDACTED 00
Oct 25 10:56:25 headless syslog-ng[11045]: Error processing log message: <0f>
Oct 25 10:56:25 REDACTED b6
Oct 25 10:56:25 REDACTED 43
Oct 25 10:56:25 REDACTED 37
Oct 25 10:56:25 REDACTED 8b
Oct 25 10:56:25 REDACTED 7b
Oct 25 10:56:25 REDACTED 10
Oct 25 10:56:25 REDACTED 41
Oct 25 10:56:25 REDACTED 39
Oct 25 10:56:25 REDACTED 3c
Oct 25 10:56:25 REDACTED 24
Oct 25 10:56:25 REDACTED 75
Oct 25 10:56:25 REDACTED dd
Oct 25 10:56:25 REDACTED 8b
Oct 25 10:56:25 REDACTED 73
Oct 25 10:56:25 REDACTED 14
Oct 25 10:56:25 REDACTED 41
Oct 25 10:56:25 REDACTED 39
Oct 25 10:56:25 REDACTED 74
Oct 25 10:56:25 REDACTED 24
Oct 25 10:56:25 REDACTED 04
Oct 25 10:56:25 REDACTED [13480.032227] RIP
Oct 25 10:56:25 REDACTED
Oct 25 10:56:25 REDACTED [<ffffffff814ad7ae>] __nf_conntrack_find_get+0x6e/0x2c0
Oct 25 10:56:25 REDACTED [13480.032273]  RSP <ffffc900077d3938>
Oct 25 10:56:25 REDACTED [13480.032859] ---[ end trace c5991a03f3433531 ]---
Oct 25 10:56:25 REDACTED [13480.032965] Kernel panic - not syncing:
Fatal exception in interrupt
Oct 25 10:56:25 REDACTED [13480.033075] Kernel Offset: 0x0 from
0xffffffff81000000 (relocation range:
0xffffffff80000000-0xffffffff9fffffff)
Oct 25 10:56:25 REDACTED [13480.033206] ---[ end Kernel panic - not
syncing: Fatal exception in interrupt

Comments

Florian Westphal Oct. 31, 2014, 4:50 p.m. UTC | #1
eric gisse <jowr.pi@gmail.com> wrote:
> Background:
> 
> This was discovered on a server running a tor exit node (crazy high
> packet flow) with a firewall that uses a few connection tracking rules
> in the INPUT chain:
> 
> # iptables-save | grep conn
> -A INPUT -m comment --comment "001-v4 drop invalid traffic" -m
> conntrack --ctstate INVALID -j DROP
> -A INPUT -m comment --comment "990-v4 accept existing connections" -m
> conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
> 
> The kernel was not stock, but rather was modified with grsecurity. I
> worked with the grsecurity folks first on this issue (
> https://forums.grsecurity.net/viewtopic.php?f=1&t=4071 ) to isolate
> and explain what's going on. They were very helpful.

Thanks for reporting.

> because netconsole is ... inconsistent with when choosing to work. As
> an aside, what is the ideal way to get kernel oops output anyway?

booting into a crash-kernel has worked for me in the past to salvage
original trace from memory.

> Note: please Ignore the xt_* modules as they were not in use at the
> time, and were not present for either the 3.16.5 panics or the 3.17.1
> + sanitize test case patch.

Just to be clear, the 3.16.5 panic is also with pax memory
sanitizing...?

> The spot of code that's causing grief:
> 
> # addr2line -e vmlinux -fip ffffffff814b58ce
> nf_ct_tuplehash_to_ctrack at
> /usr/src/linux/include/net/netfilter/nf_conntrack.h:122
>  (inlined by) nf_ct_key_equal at
> /usr/src/linux/net/netfilter/nf_conntrack_core.c:393
>  (inlined by) ____nf_conntrack_find at
> /usr/src/linux/net/netfilter/nf_conntrack_core.c:422
>  (inlined by) __nf_conntrack_find_get at
> /usr/src/linux/net/netfilter/nf_conntrack_core.c:453

Thanks.
So this happens when we walk the conntrack hash lists to find
a matching entry.

> diff --git a/mm/slub.c b/mm/slub.c
> index 3e8afcc07a76..08a7cbcf2274 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2643,6 +2643,12 @@ static __always_inline void slab_free(struct kmem_cache *s,
>  
>  	slab_free_hook(s, x);
>  
> +	if (pax_sanitize_slab && !(s->flags & SLAB_NO_SANITIZE)) {
> +		memset(x, PAX_MEMORY_SANITIZE_VALUE, s->object_size);
> +		if (s->ctor)
> +			s->ctor(x);
> +	}
> +

I am no SLUB expert, but this looks wrong.
slab_free() is called directly via kmem_cache_free().

conntrack objects are alloc'd/free'd from a SLAB_DESTROY_BY_RCU cache.

It is therefore legal to access a conntrack object from another
CPU even after kmem_cache_free() was invoked on another cpu, provided all
readers that do so hold rcu_read_lock, and verify that object has not been
freed yet by issuing appropriate atomic_inc_not_zero calls.

Therefore, object poisoning will only be safe from rcu callback, after
accesses are known to be illegal/invalid.

(not saying that conntrack is bug free..., we had races there in the
 past).

From a short glance at SLUB it seems poisoning objects for SLAB_DESTROY_BY_RCU
caches is safe in __free_slab(), but not earlier.

If you use different allocator, please tell us which one (check kernel
config, slub is default).

If its reproduceable with poisoning done after the RCU grace periods
have elapsed (i.e., where its not legal anymore to access the memory),
please let us know and we can have another look at it.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
eric gisse Oct. 31, 2014, 5:30 p.m. UTC | #2
On Fri, Oct 31, 2014 at 4:50 PM, Florian Westphal <fw@strlen.de> wrote:
> eric gisse <jowr.pi@gmail.com> wrote:
>> Background:
>>
>> This was discovered on a server running a tor exit node (crazy high
>> packet flow) with a firewall that uses a few connection tracking rules
>> in the INPUT chain:
>>
>> # iptables-save | grep conn
>> -A INPUT -m comment --comment "001-v4 drop invalid traffic" -m
>> conntrack --ctstate INVALID -j DROP
>> -A INPUT -m comment --comment "990-v4 accept existing connections" -m
>> conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
>>
>> The kernel was not stock, but rather was modified with grsecurity. I
>> worked with the grsecurity folks first on this issue (
>> https://forums.grsecurity.net/viewtopic.php?f=1&t=4071 ) to isolate
>> and explain what's going on. They were very helpful.
>
> Thanks for reporting.
>
>> because netconsole is ... inconsistent with when choosing to work. As
>> an aside, what is the ideal way to get kernel oops output anyway?
>
> booting into a crash-kernel has worked for me in the past to salvage
> original trace from memory.

I'm using Gentoo which doesn't have the super nice crash kernel /
abrtd stuff setup.

That's the one thing I really like about RHEL, though I wouldn't be
able to use grsecurity (or anything else custom) in kernel space with
those tools for that matter...

>
>> Note: please Ignore the xt_* modules as they were not in use at the
>> time, and were not present for either the 3.16.5 panics or the 3.17.1
>> + sanitize test case patch.
>
> Just to be clear, the 3.16.5 panic is also with pax memory
> sanitizing...?

Correct.

Since it ran along the same syscall path as the 3.17.1 panics, I am
making the assumption it is the same bug.

I don't have the 3.16.5 kernel built with the debugging flags needed
though, so I can't verify it 100% after the fact but I'm reasonably
confident at this point with the amount of "reproducability" this
issue has had.

>
>> The spot of code that's causing grief:
>>
>> # addr2line -e vmlinux -fip ffffffff814b58ce
>> nf_ct_tuplehash_to_ctrack at
>> /usr/src/linux/include/net/netfilter/nf_conntrack.h:122
>>  (inlined by) nf_ct_key_equal at
>> /usr/src/linux/net/netfilter/nf_conntrack_core.c:393
>>  (inlined by) ____nf_conntrack_find at
>> /usr/src/linux/net/netfilter/nf_conntrack_core.c:422
>>  (inlined by) __nf_conntrack_find_get at
>> /usr/src/linux/net/netfilter/nf_conntrack_core.c:453
>
> Thanks.
> So this happens when we walk the conntrack hash lists to find
> a matching entry.

That is as far as I was able to understand.

My connection tracking table gets *big*. This is what it looks like at
this instant in time on the machine in question:

# sysctl -a | grep conntrack_count
net.ipv4.netfilter.ip_conntrack_count = 46205
net.netfilter.nf_conntrack_count = 46203



>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 3e8afcc07a76..08a7cbcf2274 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2643,6 +2643,12 @@ static __always_inline void slab_free(struct kmem_cache *s,
>>
>>       slab_free_hook(s, x);
>>
>> +     if (pax_sanitize_slab && !(s->flags & SLAB_NO_SANITIZE)) {
>> +             memset(x, PAX_MEMORY_SANITIZE_VALUE, s->object_size);
>> +             if (s->ctor)
>> +                     s->ctor(x);
>> +     }
>> +
>
> I am no SLUB expert, but this looks wrong.
> slab_free() is called directly via kmem_cache_free().

I can't help with that one. My competence does not extend to kernel
memory managment / allocation issues :)

>
> conntrack objects are alloc'd/free'd from a SLAB_DESTROY_BY_RCU cache.
>
> It is therefore legal to access a conntrack object from another
> CPU even after kmem_cache_free() was invoked on another cpu, provided all
> readers that do so hold rcu_read_lock, and verify that object has not been
> freed yet by issuing appropriate atomic_inc_not_zero calls.
>
> Therefore, object poisoning will only be safe from rcu callback, after
> accesses are known to be illegal/invalid.

Can you expand on that? The term "object poisoning" to me means an
object (you are talking about the conntract tuple, right?) with
problematic values is put into memory, but the way you phrase it seems
more like the hash table itself is being manipulated improperly.

I'm still trying to work out what the actual ISSUE is. My
understanding is this, thus far:

It seems like an object in the connection track hash table is being
improperly marked as free, which then is sanitized, and is then later
being accessed by the netfilter codepath that loops through the table.

>
> (not saying that conntrack is bug free..., we had races there in the
>  past).
>
> From a short glance at SLUB it seems poisoning objects for SLAB_DESTROY_BY_RCU
> caches is safe in __free_slab(), but not earlier.
>
> If you use different allocator, please tell us which one (check kernel
> config, slub is default).

SLAB allocator, though I do not remember making the choice.

From the kernel config that's causing issues:

# egrep 'SLAB|SLUB' .config
CONFIG_SLAB=y
# CONFIG_SLUB is not set
CONFIG_SLABINFO=y
# CONFIG_DEBUG_SLAB is not set
CONFIG_PAX_USERCOPY_SLABS=y

For reference, the current kernel, with the PaX sanitization feature
disabled, doesn't exhibit the issue. Not that I am surprised.

I don't, as a rule, mess with kernel memory/process management
internals without a good reason because I don't have enough
information to make a proper choice. Usually the defaults are "good
enough". I can only think of a handful of instances where I have had
reason to do so, and even then the results were inconsistent at best.

>
> If its reproduceable with poisoning done after the RCU grace periods
> have elapsed (i.e., where its not legal anymore to access the memory),
> please let us know and we can have another look at it.
>
> Thanks.

Reproducability is an issue since I don't know what's triggering it in
the first place. Just that it happens after a variable length of time
along the same code path, subject to differences between the two
kernel versions I've seen this issue with.

The machine itself is pushing 20-25 megabytes (~50k packets) per
second at any given time and has smacked the default conntrack hash
table maximums. So the netfilter system is under nontrivial stresses.

I'll happily work with you guys to isolate this as this is an
interesting problem and I'm bored, but I need a bit of help and
prompting to get this done properly.

I am a sysadmin of reasonable (in my own estimate) skill and developer
in puppet / perl, but kernel stuff beyond surface level debugging of
panics is way beyond my aegis.

Even after your explanation I am not yet sure I understand the issue,
and am definitely sure I don't understand how to debug this further.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Krause Oct. 31, 2014, 5:46 p.m. UTC | #3
On Fri, Oct 31, 2014 at 05:50:42PM +0100, Florian Westphal wrote:
> eric gisse <jowr.pi@gmail.com> wrote:
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 3e8afcc07a76..08a7cbcf2274 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2643,6 +2643,12 @@ static __always_inline void slab_free(struct kmem_cache *s,
> >  
> >  	slab_free_hook(s, x);
> >  
> > +	if (pax_sanitize_slab && !(s->flags & SLAB_NO_SANITIZE)) {
> > +		memset(x, PAX_MEMORY_SANITIZE_VALUE, s->object_size);
> > +		if (s->ctor)
> > +			s->ctor(x);
> > +	}
> > +
> 
> I am no SLUB expert, but this looks wrong.
> slab_free() is called directly via kmem_cache_free().
> 
> conntrack objects are alloc'd/free'd from a SLAB_DESTROY_BY_RCU cache.
> 
> It is therefore legal to access a conntrack object from another
> CPU even after kmem_cache_free() was invoked on another cpu, provided all
> readers that do so hold rcu_read_lock, and verify that object has not been
> freed yet by issuing appropriate atomic_inc_not_zero calls.
> 
> Therefore, object poisoning will only be safe from rcu callback, after
> accesses are known to be illegal/invalid.

Snap, you're right! I was misreading the following comment in
include/linux/slab.h to allow "free reuse" by the slab allocator as
well, e.g. for sanitizing/poisoning the object:

 * SLAB_DESTROY_BY_RCU - **WARNING** READ THIS!
 *  
 * This delays freeing the SLAB page by a grace period, it does _NOT_
 * delay object freeing. This means that if you do kmem_cache_free()
 * that memory location is free to be reused at any time. Thus it may
 * be possible to see another object there in the same RCU grace period.

But, in fact, that assumption is not true. I now see how the conntrack
code exploits this feature by testing &ct->ct_general.use. So if we
scratch that by writing '\xfe' everywhere over the object, that test
will no longer work.

I guess we need to change the slab sanitization feature in PaX to handle
SLAB_DESTROY_BY_RCU marked slabs the way they need to.

> 
> (not saying that conntrack is bug free..., we had races there in the
>  past).
> 
> From a short glance at SLUB it seems poisoning objects for SLAB_DESTROY_BY_RCU
> caches is safe in __free_slab(), but not earlier.
> 
> If you use different allocator, please tell us which one (check kernel
> config, slub is default).
> 
> If its reproduceable with poisoning done after the RCU grace periods
> have elapsed (i.e., where its not legal anymore to access the memory),
> please let us know and we can have another look at it.
> 

Thanks,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Oct. 31, 2014, 10 p.m. UTC | #4
eric gisse <jowr.pi@gmail.com> wrote:
> On Fri, Oct 31, 2014 at 4:50 PM, Florian Westphal <fw@strlen.de> wrote:
> >> +     if (pax_sanitize_slab && !(s->flags & SLAB_NO_SANITIZE)) {
> >> +             memset(x, PAX_MEMORY_SANITIZE_VALUE, s->object_size);
> >> +             if (s->ctor)
> >> +                     s->ctor(x);
> >> +     }
> >> +
> >
> > I am no SLUB expert, but this looks wrong.
> > slab_free() is called directly via kmem_cache_free().
> 
> I can't help with that one. My competence does not extend to kernel
> memory managment / allocation issues :)

Seems Mathias Krause will work on improving Pax poisoning to treat
SLAB_DESTROY_BY_RCU specially.

> > conntrack objects are alloc'd/free'd from a SLAB_DESTROY_BY_RCU cache.
> >
> > It is therefore legal to access a conntrack object from another
> > CPU even after kmem_cache_free() was invoked on another cpu, provided all
> > readers that do so hold rcu_read_lock, and verify that object has not been
> > freed yet by issuing appropriate atomic_inc_not_zero calls.
> >
> > Therefore, object poisoning will only be safe from rcu callback, after
> > accesses are known to be illegal/invalid.
> 
> Can you expand on that? The term "object poisoning" to me means an
> object (you are talking about the conntract tuple, right?) with

Yes.

> problematic values is put into memory, but the way you phrase it seems
> more like the hash table itself is being manipulated improperly.

No, afaics the conntrack object accesses are correct.

> I'm still trying to work out what the actual ISSUE is. My
> understanding is this, thus far:
> 
> It seems like an object in the connection track hash table is being
> improperly marked as free, which then is sanitized, and is then later
> being accessed by the netfilter codepath that loops through the table.

No.  Conntrack objects are free'd when the last reference counter goes
away.  However, because lookup of the conntrack hash table is lockless,
another CPU might be accessing the conntrack object that is being free'd
right now.

Usually this means that the access is invalid.  However, in the
conntrack case, the conntrack objects are allocated from a special
cache that delays freeing of underlying pages until we know that no
other cpu is currently accessing it.

So there are 2 possible cases:
1 - the conntrack object that is being looked at is alive (refcnt > 1).
2 - the conntrack object that is being looked is being free'd RIGHT NOW
on another cpu.  RCU protects us from page fault, since the underlying
memory page cannot be free'd.

So, we're safe to look at the memory contents of the tuple and decide
wheter its the object (conntrack tuple) we're trying to find or not.

If it is, we try to obtain a reference, this will only succeed if the
reference count is not 0 already, so we can detect the "its free'd"
case.

If we obtained a reference, we still need to re-validate the tuple
address since its possible that the object was free'd on cpu x and
almost-instantly reallocated for use by a different tuple.

If you are interested in this you can have a look at the bug fixes made
in that area, there are some more explanations there.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/netfilter/nf_conntrack_core.c?id=e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/netfilter/nf_conntrack_core.c?id=c6825c0976fa7893692e0e43b09740b419b23c09

> > If you use different allocator, please tell us which one (check kernel
> > config, slub is default).
> 
> SLAB allocator, though I do not remember making the choice.
> 
> From the kernel config that's causing issues:
> 
> # egrep 'SLAB|SLUB' .config
> CONFIG_SLAB=y
> # CONFIG_SLUB is not set
> CONFIG_SLABINFO=y
> # CONFIG_DEBUG_SLAB is not set
> CONFIG_PAX_USERCOPY_SLABS=y

Ok, from a quick glance PaX slab kfree is also zapping
objects before grace period elapsed.

> > If its reproduceable with poisoning done after the RCU grace periods
> > have elapsed (i.e., where its not legal anymore to access the memory),
> > please let us know and we can have another look at it.
> >
> > Thanks.
> 
> Reproducability is an issue since I don't know what's triggering it in
> the first place. Just that it happens after a variable length of time
> along the same code path, subject to differences between the two
> kernel versions I've seen this issue with.
> 
> The machine itself is pushing 20-25 megabytes (~50k packets) per
> second at any given time and has smacked the default conntrack hash
> table maximums. So the netfilter system is under nontrivial stresses.

It should be able to handle a lot more.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/netfilter/nf_conntrack_core.c?id=93bb0ceb75be2fdfa9fc0dd1fb522d9ada515d9c

> I'll happily work with you guys to isolate this as this is an
> interesting problem and I'm bored, but I need a bit of help and
> prompting to get this done properly.

Sure, my understanding is that someone from pax team is working on
the object poisoning to handle SLAB_DESTROY_BY_RCU properly.

Please don't hesitate to report back with newer pax versions if you
still see invalid accesses.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Krause Nov. 1, 2014, 10:56 a.m. UTC | #5
On 31 October 2014 23:00, Florian Westphal <fw@strlen.de> wrote:
> eric gisse <jowr.pi@gmail.com> wrote:
>> On Fri, Oct 31, 2014 at 4:50 PM, Florian Westphal <fw@strlen.de> wrote:
>> >> +     if (pax_sanitize_slab && !(s->flags & SLAB_NO_SANITIZE)) {
>> >> +             memset(x, PAX_MEMORY_SANITIZE_VALUE, s->object_size);
>> >> +             if (s->ctor)
>> >> +                     s->ctor(x);
>> >> +     }
>> >> +
>> >
>> > I am no SLUB expert, but this looks wrong.
>> > slab_free() is called directly via kmem_cache_free().
>>
>> I can't help with that one. My competence does not extend to kernel
>> memory managment / allocation issues :)
>
> Seems Mathias Krause will work on improving Pax poisoning to treat
> SLAB_DESTROY_BY_RCU specially.

Well, the fix is as easy as destructive. PaX sanitize has to exclude
SLAB_DESTROY_BY_RCU marked caches from the per-object sanitization. It
only can do page based sanitization for such slabs.

>> > conntrack objects are alloc'd/free'd from a SLAB_DESTROY_BY_RCU cache.
>> >
>> > It is therefore legal to access a conntrack object from another
>> > CPU even after kmem_cache_free() was invoked on another cpu, provided all
>> > readers that do so hold rcu_read_lock, and verify that object has not been
>> > freed yet by issuing appropriate atomic_inc_not_zero calls.
>> >
>> > Therefore, object poisoning will only be safe from rcu callback, after
>> > accesses are known to be illegal/invalid.
>>
>> Can you expand on that? The term "object poisoning" to me means an
>> object (you are talking about the conntract tuple, right?) with
>
> Yes.
>
>> problematic values is put into memory, but the way you phrase it seems
>> more like the hash table itself is being manipulated improperly.
>
> No, afaics the conntrack object accesses are correct.
>
>> I'm still trying to work out what the actual ISSUE is. My
>> understanding is this, thus far:
>>
>> It seems like an object in the connection track hash table is being
>> improperly marked as free, which then is sanitized, and is then later
>> being accessed by the netfilter codepath that loops through the table.
>
> No.  Conntrack objects are free'd when the last reference counter goes
> away.  However, because lookup of the conntrack hash table is lockless,
> another CPU might be accessing the conntrack object that is being free'd
> right now.
>
> Usually this means that the access is invalid.  However, in the
> conntrack case, the conntrack objects are allocated from a special
> cache that delays freeing of underlying pages until we know that no
> other cpu is currently accessing it.
>
> So there are 2 possible cases:
> 1 - the conntrack object that is being looked at is alive (refcnt > 1).
> 2 - the conntrack object that is being looked is being free'd RIGHT NOW
> on another cpu.  RCU protects us from page fault, since the underlying
> memory page cannot be free'd.
>
> So, we're safe to look at the memory contents of the tuple and decide
> wheter its the object (conntrack tuple) we're trying to find or not.
>
> If it is, we try to obtain a reference, this will only succeed if the
> reference count is not 0 already, so we can detect the "its free'd"
> case.
>
> If we obtained a reference, we still need to re-validate the tuple
> address since its possible that the object was free'd on cpu x and
> almost-instantly reallocated for use by a different tuple.
>
> If you are interested in this you can have a look at the bug fixes made
> in that area, there are some more explanations there.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/netfilter/nf_conntrack_core.c?id=e53376bef2cd97d3e3f61fdc677fb8da7d03d0da
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/netfilter/nf_conntrack_core.c?id=c6825c0976fa7893692e0e43b09740b419b23c09
>

PaX per-object sanitization just nullifies this assumption as its
poisoning would set the reference count to 0xfefefefe which is clearly
not zero. So, it looks like, the only place to safely sanitize the
object is on the page level -- when it gets released after the RCU
period has passed. And that's what the latest version of PaX sanitize
is doing now.

>> > If you use different allocator, please tell us which one (check kernel
>> > config, slub is default).
>>
>> SLAB allocator, though I do not remember making the choice.
>>
>> From the kernel config that's causing issues:
>>
>> # egrep 'SLAB|SLUB' .config
>> CONFIG_SLAB=y
>> # CONFIG_SLUB is not set
>> CONFIG_SLABINFO=y
>> # CONFIG_DEBUG_SLAB is not set
>> CONFIG_PAX_USERCOPY_SLABS=y
>
> Ok, from a quick glance PaX slab kfree is also zapping
> objects before grace period elapsed.

Yep. But that is now fixed in the latest version -- for all of them:
SLAB/SLOB/SLUB.

>> > If its reproduceable with poisoning done after the RCU grace periods
>> > have elapsed (i.e., where its not legal anymore to access the memory),
>> > please let us know and we can have another look at it.
>> >
>> > Thanks.
>>
>> Reproducability is an issue since I don't know what's triggering it in
>> the first place. Just that it happens after a variable length of time
>> along the same code path, subject to differences between the two
>> kernel versions I've seen this issue with.
>>
>> The machine itself is pushing 20-25 megabytes (~50k packets) per
>> second at any given time and has smacked the default conntrack hash
>> table maximums. So the netfilter system is under nontrivial stresses.
>
> It should be able to handle a lot more.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/net/netfilter/nf_conntrack_core.c?id=93bb0ceb75be2fdfa9fc0dd1fb522d9ada515d9c
>
>> I'll happily work with you guys to isolate this as this is an
>> interesting problem and I'm bored, but I need a bit of help and
>> prompting to get this done properly.
>
> Sure, my understanding is that someone from pax team is working on
> the object poisoning to handle SLAB_DESTROY_BY_RCU properly.
>
> Please don't hesitate to report back with newer pax versions if you
> still see invalid accesses.

Thanks Florian, for investigating on this!

Regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 1edd5fdc629d..14eda90aa38e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2467,6 +2467,10 @@  bytes respectively. Such letter suffixes can also be entirely omitted.
 			the specified number of seconds.  This is to be used if
 			your oopses keep scrolling off the screen.
 
+	pax_sanitize_slab=
+			0/1 to disable/enable slab object sanitization (enabled
+			by default).
+
 	pcbit=		[HW,ISDN]
 
 	pcd.		[PARIDE]
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1d9abb7d22a0..067bd01fed92 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -23,6 +23,7 @@ 
 #define SLAB_DEBUG_FREE		0x00000100UL	/* DEBUG: Perform (expensive) checks on free */
 #define SLAB_RED_ZONE		0x00000400UL	/* DEBUG: Red zone objs in a cache */
 #define SLAB_POISON		0x00000800UL	/* DEBUG: Poison objects */
+#define SLAB_NO_SANITIZE	0x00001000UL	/* PaX: Do not sanitize objs on free */
 #define SLAB_HWCACHE_ALIGN	0x00002000UL	/* Align objs on cache lines */
 #define SLAB_CACHE_DMA		0x00004000UL	/* Use GFP_DMA memory */
 #define SLAB_STORE_USER		0x00010000UL	/* DEBUG: Store the last owner for bug hunting */
diff --git a/mm/slab.c b/mm/slab.c
index 7c52b3890d25..3f111541d1ce 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3384,6 +3384,16 @@  static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 	struct array_cache *ac = cpu_cache_get(cachep);
 
 	check_irq_off();
+
+	if (pax_sanitize_slab) {
+		if (!(cachep->flags & (SLAB_POISON | SLAB_NO_SANITIZE))) {
+			memset(objp, PAX_MEMORY_SANITIZE_VALUE, cachep->object_size);
+
+			if (cachep->ctor)
+				cachep->ctor(objp);
+		}
+	}
+
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
 
diff --git a/mm/slab.h b/mm/slab.h
index 0e0fdd365840..3a2d6cbae601 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -32,6 +32,13 @@  extern struct list_head slab_caches;
 /* The slab cache that manages slab cache information */
 extern struct kmem_cache *kmem_cache;
 
+#ifdef CONFIG_X86_64
+#define PAX_MEMORY_SANITIZE_VALUE      '\xfe'
+#else
+#define PAX_MEMORY_SANITIZE_VALUE      '\xff'
+#endif
+extern bool pax_sanitize_slab;
+
 unsigned long calculate_alignment(unsigned long flags,
 		unsigned long align, unsigned long size);
 
@@ -67,7 +74,7 @@  __kmem_cache_alias(const char *name, size_t size, size_t align,
 
 /* Legal flag mask for kmem_cache_create(), for various configurations */
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | SLAB_PANIC | \
-			 SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS )
+			 SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS | SLAB_NO_SANITIZE)
 
 #if defined(CONFIG_DEBUG_SLAB)
 #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d319502b2403..f88dbc3fa1e7 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -30,6 +30,15 @@  LIST_HEAD(slab_caches);
 DEFINE_MUTEX(slab_mutex);
 struct kmem_cache *kmem_cache;
 
+bool pax_sanitize_slab __read_mostly = true;
+static int __init pax_sanitize_slab_setup(char *str)
+{
+	pax_sanitize_slab = !!simple_strtol(str, NULL, 0);
+	printk("%sabled PaX slab sanitization\n", pax_sanitize_slab ? "En" : "Dis");
+	return 1;
+}
+__setup("pax_sanitize_slab=", pax_sanitize_slab_setup);
+
 #ifdef CONFIG_DEBUG_VM
 static int kmem_cache_sanity_check(const char *name, size_t size)
 {
diff --git a/mm/slob.c b/mm/slob.c
index 21980e0f39a8..c4907d766048 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -365,6 +365,9 @@  static void slob_free(void *block, int size)
 		return;
 	}
 
+	if (pax_sanitize_slab)
+		memset(block, PAX_MEMORY_SANITIZE_VALUE, size);
+
 	if (!slob_page_free(sp)) {
 		/* This slob page is about to become partially free. Easy! */
 		sp->units = units;
diff --git a/mm/slub.c b/mm/slub.c
index 3e8afcc07a76..08a7cbcf2274 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2643,6 +2643,12 @@  static __always_inline void slab_free(struct kmem_cache *s,
 
 	slab_free_hook(s, x);
 
+	if (pax_sanitize_slab && !(s->flags & SLAB_NO_SANITIZE)) {
+		memset(x, PAX_MEMORY_SANITIZE_VALUE, s->object_size);
+		if (s->ctor)
+			s->ctor(x);
+	}
+
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -2986,6 +2992,7 @@  static int calculate_sizes(struct kmem_cache *s, int forced_order)
 	s->inuse = size;
 
 	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
+		(pax_sanitize_slab && !(flags & SLAB_NO_SANITIZE)) ||
 		s->ctor)) {
 		/*
 		 * Relocate free pointer after the object if it is not
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8d289697cc7a..7a4e52d90eed 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3237,13 +3237,15 @@  void __init skb_init(void)
 	skbuff_head_cache = kmem_cache_create("skbuff_head_cache",
 					      sizeof(struct sk_buff),
 					      0,
-					      SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+					      SLAB_HWCACHE_ALIGN|SLAB_PANIC|
+					      SLAB_NO_SANITIZE,
 					      NULL);
 	skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
 						(2*sizeof(struct sk_buff)) +
 						sizeof(atomic_t),
 						0,
-						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+						SLAB_HWCACHE_ALIGN|SLAB_PANIC|
+						SLAB_NO_SANITIZE,
 						NULL);
 }