Message ID | 31096a330a56cfd40eea136be505270184fbb5d4.1501775813.git.pabeni@redhat.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2017-08-03 at 18:07 +0200, Paolo Abeni wrote: > __ip_options_echo() uses the current network namespace, and > currently retrives it via skb->dst->dev. > > This commit adds an explicit 'net' argument to __ip_options_echo() > and update all the call sites to provide it, usually via a simpler > sock_net(). > > After this change, __ip_options_echo() no more needs to access > skb->dst and we can drop a couple of hack to preserve such > info in the rx path. > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- David, Paolo This commit (91ed1e666a4ea2e260452a7d7d311ac5ae852cba "ip/options: explicitly provide net ns to __ip_options_echo()") needs to be backported to linux-4.13 stable version to avoid these kind of crashes [1] This is because of MSG_PEEK operation, hitting skb_consume_udp() while skb is still in receive queue. Next read() finding again the skb then can see a NULL skb->dst Thanks ! [1] general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 3017 Comm: syzkaller446772 Not tainted 4.13.0+ #68 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801cd0a4380 task.stack: ffff8801cc498000 RIP: 0010:__ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP: 0018:ffff8801cc49f628 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff8801cc49f928 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004 RBP: ffff8801cc49f6b8 R08: ffff8801cc49f936 R09: ffffed0039893f28 R10: 0000000000000003 R11: ffffed0039893f27 R12: ffff8801cc49f918 R13: ffff8801ccbcf36c R14: 000000000000000f R15: 0000000000000018 FS: 0000000000979880(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000200c0ff0 CR3: 00000001cc4ed000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip_options_echo include/net/ip.h:574 [inline] ip_cmsg_recv_retopts net/ipv4/ip_sockglue.c:91 [inline] ip_cmsg_recv_offset+0xa17/0x1280 net/ipv4/ip_sockglue.c:207 udp_recvmsg+0xe0b/0x1260 net/ipv4/udp.c:1641 inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793 sock_recvmsg_nosec net/socket.c:792 [inline] sock_recvmsg+0xc9/0x110 net/socket.c:799 SYSC_recvfrom+0x2dc/0x570 net/socket.c:1788 SyS_recvfrom+0x40/0x50 net/socket.c:1760 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x444c89 RSP: 002b:00007ffd80c788e8 EFLAGS: 00000286 ORIG_RAX: 000000000000002d RAX: ffffffffffffffda RBX: ffffffffffffffff RCX: 0000000000444c89 RDX: 0000000000000000 RSI: 0000000020bc0000 RDI: 0000000000000004 RBP: 0000000000000082 R08: 00000000200c0ff0 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000402390 R13: 0000000000402420 R14: 0000000000000000 R15: 0000000000000000 Code: f6 c1 01 0f 85 a5 01 00 00 48 89 4d b8 e8 31 e9 6b fd 48 8b 4d b8 48 b8 00 00 00 00 00 fc ff df 48 83 e1 fe 48 89 ca 48 c1 ea 03 <80> 3c 02 00 0f 85 41 02 00 00 48 8b 09 48 b8 00 00 00 00 00 fc RIP: __ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP: ffff8801cc49f628 ---[ end trace b30d95b284222843 ]--- Kernel panic - not syncing: Fatal exception
On Tue, 2017-09-05 at 10:18 -0700, Eric Dumazet wrote: > On Thu, 2017-08-03 at 18:07 +0200, Paolo Abeni wrote: > > __ip_options_echo() uses the current network namespace, and > > currently retrives it via skb->dst->dev. > > > > This commit adds an explicit 'net' argument to __ip_options_echo() > > and update all the call sites to provide it, usually via a simpler > > sock_net(). > > > > After this change, __ip_options_echo() no more needs to access > > skb->dst and we can drop a couple of hack to preserve such > > info in the rx path. > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > > --- > > David, Paolo > > This commit (91ed1e666a4ea2e260452a7d7d311ac5ae852cba "ip/options: > explicitly provide net ns to __ip_options_echo()") > > needs to be backported to linux-4.13 stable version to avoid these kind > of crashes [1] > > This is because of MSG_PEEK operation, hitting skb_consume_udp() while > skb is still in receive queue. > > Next read() finding again the skb then can see a NULL skb->dst > > Thanks ! > > [1] > general protection fault: 0000 [#1] SMP KASAN > Dumping ftrace buffer: > (ftrace buffer empty) > Modules linked in: > CPU: 0 PID: 3017 Comm: syzkaller446772 Not tainted 4.13.0+ #68 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > task: ffff8801cd0a4380 task.stack: ffff8801cc498000 > RIP: 0010:__ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 > RSP: 0018:ffff8801cc49f628 EFLAGS: 00010246 > RAX: dffffc0000000000 RBX: ffff8801cc49f928 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004 > RBP: ffff8801cc49f6b8 R08: ffff8801cc49f936 R09: ffffed0039893f28 > R10: 0000000000000003 R11: ffffed0039893f27 R12: ffff8801cc49f918 > R13: ffff8801ccbcf36c R14: 000000000000000f R15: 0000000000000018 > FS: 0000000000979880(0000) GS:ffff8801db200000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000200c0ff0 CR3: 00000001cc4ed000 CR4: 00000000001406f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > ip_options_echo include/net/ip.h:574 [inline] > ip_cmsg_recv_retopts net/ipv4/ip_sockglue.c:91 [inline] > ip_cmsg_recv_offset+0xa17/0x1280 net/ipv4/ip_sockglue.c:207 > udp_recvmsg+0xe0b/0x1260 net/ipv4/udp.c:1641 > inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793 > sock_recvmsg_nosec net/socket.c:792 [inline] > sock_recvmsg+0xc9/0x110 net/socket.c:799 > SYSC_recvfrom+0x2dc/0x570 net/socket.c:1788 > SyS_recvfrom+0x40/0x50 net/socket.c:1760 > entry_SYSCALL_64_fastpath+0x1f/0xbe > RIP: 0033:0x444c89 > RSP: 002b:00007ffd80c788e8 EFLAGS: 00000286 ORIG_RAX: 000000000000002d > RAX: ffffffffffffffda RBX: ffffffffffffffff RCX: 0000000000444c89 > RDX: 0000000000000000 RSI: 0000000020bc0000 RDI: 0000000000000004 > RBP: 0000000000000082 R08: 00000000200c0ff0 R09: 0000000000000010 > R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000402390 > R13: 0000000000402420 R14: 0000000000000000 R15: 0000000000000000 > Code: f6 c1 01 0f 85 a5 01 00 00 48 89 4d b8 e8 31 e9 6b fd 48 8b 4d b8 > 48 b8 00 00 00 00 00 fc ff df 48 83 e1 fe 48 89 ca 48 c1 ea 03 <80> 3c > 02 00 0f 85 41 02 00 00 48 8b 09 48 b8 00 00 00 00 00 fc > RIP: __ip_options_echo+0xea8/0x1430 net/ipv4/ip_options.c:143 RSP: > ffff8801cc49f628 > ---[ end trace b30d95b284222843 ]--- > Kernel panic - not syncing: Fatal exception Thank you Eric for the report! Darn me, I seriously messed-up with the stateless consume. I think we can have similar issues pith ipsec/secpath and MSG_PEEK, even if with less catastropthic outcome. What about the following, which should cover both cases? (only compile tested, I'll test it tomorrow morning my time) --- diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d67a8182e5eb..63df75ae70ee 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -885,7 +885,7 @@ void kfree_skb(struct sk_buff *skb); void kfree_skb_list(struct sk_buff *segs); void skb_tx_error(struct sk_buff *skb); void consume_skb(struct sk_buff *skb); -void consume_stateless_skb(struct sk_buff *skb); +void __consume_stateless_skb(struct sk_buff *skb); void __kfree_skb(struct sk_buff *skb); extern struct kmem_cache *skbuff_head_cache; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e07556606284..f2411a8744d7 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -753,14 +753,11 @@ EXPORT_SYMBOL(consume_skb); * consume_stateless_skb - free an skbuff, assuming it is stateless * @skb: buffer to free * - * Works like consume_skb(), but this variant assumes that all the head - * states have been already dropped. + * Alike consume_skb(), but this variant assumes that all the head + * states have been already dropped and usage count is one */ -void consume_stateless_skb(struct sk_buff *skb) +void __consume_stateless_skb(struct sk_buff *skb) { - if (!skb_unref(skb)) - return; - trace_consume_skb(skb); if (likely(skb->head)) skb_release_data(skb); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 62344804baae..979e4d8526ba 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1386,12 +1386,15 @@ void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len) unlock_sock_fast(sk, slow); } + if (!skb_unref(skb)) + return; + /* In the more common cases we cleared the head states previously, * see __udp_queue_rcv_skb(). */ if (unlikely(udp_skb_has_head_state(skb))) skb_release_head_state(skb); - consume_stateless_skb(skb); + __consume_stateless_skb(skb); } EXPORT_SYMBOL_GPL(skb_consume_udp);
diff --git a/include/net/ip.h b/include/net/ip.h index 821cedcc8e73..9e59dcf1787a 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -567,11 +567,12 @@ int ip_forward(struct sk_buff *skb); void ip_options_build(struct sk_buff *skb, struct ip_options *opt, __be32 daddr, struct rtable *rt, int is_frag); -int __ip_options_echo(struct ip_options *dopt, struct sk_buff *skb, - const struct ip_options *sopt); -static inline int ip_options_echo(struct ip_options *dopt, struct sk_buff *skb) +int __ip_options_echo(struct net *net, struct ip_options *dopt, + struct sk_buff *skb, const struct ip_options *sopt); +static inline int ip_options_echo(struct net *net, struct ip_options *dopt, + struct sk_buff *skb) { - return __ip_options_echo(dopt, skb, &IPCB(skb)->opt); + return __ip_options_echo(net, dopt, skb, &IPCB(skb)->opt); } void ip_options_fragment(struct sk_buff *skb); diff --git a/include/net/tcp.h b/include/net/tcp.h index bb1881b4ce48..5173fecde495 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1885,7 +1885,8 @@ extern void tcp_rack_reo_timeout(struct sock *sk); /* * Save and compile IPv4 options, return a pointer to it */ -static inline struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb) +static inline struct ip_options_rcu *tcp_v4_save_options(struct net *net, + struct sk_buff *skb) { const struct ip_options *opt = &TCP_SKB_CB(skb)->header.h4.opt; struct ip_options_rcu *dopt = NULL; @@ -1894,7 +1895,7 @@ static inline struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb) int opt_size = sizeof(*dopt) + opt->optlen; dopt = kmalloc(opt_size, GFP_ATOMIC); - if (dopt && __ip_options_echo(&dopt->opt, skb, opt)) { + if (dopt && __ip_options_echo(net, &dopt->opt, skb, opt)) { kfree(dopt); dopt = NULL; } diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index c2be26b98b5f..681e33998e03 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -412,7 +412,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb) int type = icmp_param->data.icmph.type; int code = icmp_param->data.icmph.code; - if (ip_options_echo(&icmp_param->replyopts.opt.opt, skb)) + if (ip_options_echo(net, &icmp_param->replyopts.opt.opt, skb)) return; /* Needed by both icmp_global_allow and icmp_xmit_lock */ @@ -694,7 +694,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info) iph->tos; mark = IP4_REPLY_MARK(net, skb_in->mark); - if (ip_options_echo(&icmp_param.replyopts.opt.opt, skb_in)) + if (ip_options_echo(net, &icmp_param.replyopts.opt.opt, skb_in)) goto out_unlock; diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c index fdda97308c0b..525ae88d1e58 100644 --- a/net/ipv4/ip_options.c +++ b/net/ipv4/ip_options.c @@ -86,8 +86,8 @@ void ip_options_build(struct sk_buff *skb, struct ip_options *opt, * NOTE: dopt cannot point to skb. */ -int __ip_options_echo(struct ip_options *dopt, struct sk_buff *skb, - const struct ip_options *sopt) +int __ip_options_echo(struct net *net, struct ip_options *dopt, + struct sk_buff *skb, const struct ip_options *sopt) { unsigned char *sptr, *dptr; int soffset, doffset; @@ -140,7 +140,7 @@ int __ip_options_echo(struct ip_options *dopt, struct sk_buff *skb, __be32 addr; memcpy(&addr, dptr+soffset-1, 4); - if (inet_addr_type(dev_net(skb_dst(skb)->dev), addr) != RTN_UNICAST) { + if (inet_addr_type(net, addr) != RTN_UNICAST) { dopt->ts_needtime = 1; soffset += 8; } diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index b631ec685d77..73b0b15245b6 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1525,7 +1525,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, int err; int oif; - if (__ip_options_echo(&replyopts.opt.opt, skb, sopt)) + if (__ip_options_echo(net, &replyopts.opt.opt, skb, sopt)) return; ipc.addr = daddr; diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index ecc4b4a2413e..1c3354d028a4 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -80,7 +80,8 @@ static void ip_cmsg_recv_opts(struct msghdr *msg, struct sk_buff *skb) } -static void ip_cmsg_recv_retopts(struct msghdr *msg, struct sk_buff *skb) +static void ip_cmsg_recv_retopts(struct net *net, struct msghdr *msg, + struct sk_buff *skb) { unsigned char optbuf[sizeof(struct ip_options) + 40]; struct ip_options *opt = (struct ip_options *)optbuf; @@ -88,7 +89,7 @@ static void ip_cmsg_recv_retopts(struct msghdr *msg, struct sk_buff *skb) if (IPCB(skb)->opt.optlen == 0) return; - if (ip_options_echo(opt, skb)) { + if (ip_options_echo(net, opt, skb)) { msg->msg_flags |= MSG_CTRUNC; return; } @@ -204,7 +205,7 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sock *sk, } if (flags & IP_CMSG_RETOPTS) { - ip_cmsg_recv_retopts(msg, skb); + ip_cmsg_recv_retopts(sock_net(sk), msg, skb); flags &= ~IP_CMSG_RETOPTS; if (!flags) diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index 03ad8778c395..b1bb1b3a1082 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -355,7 +355,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) /* We throwed the options of the initial SYN away, so we hope * the ACK carries the same options again (see RFC1122 4.2.3.8) */ - ireq->opt = tcp_v4_save_options(skb); + ireq->opt = tcp_v4_save_options(sock_net(sk), skb); if (security_inet_conn_request(sk, skb, req)) { reqsk_free(req); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9b51663cd5a4..5f708c85110e 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1267,7 +1267,7 @@ static void tcp_v4_init_req(struct request_sock *req, sk_rcv_saddr_set(req_to_sk(req), ip_hdr(skb)->daddr); sk_daddr_set(req_to_sk(req), ip_hdr(skb)->saddr); - ireq->opt = tcp_v4_save_options(skb); + ireq->opt = tcp_v4_save_options(sock_net(sk_listener), skb); } static struct dst_entry *tcp_v4_route_req(const struct sock *sk,
__ip_options_echo() uses the current network namespace, and currently retrives it via skb->dst->dev. This commit adds an explicit 'net' argument to __ip_options_echo() and update all the call sites to provide it, usually via a simpler sock_net(). After this change, __ip_options_echo() no more needs to access skb->dst and we can drop a couple of hack to preserve such info in the rx path. Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- include/net/ip.h | 9 +++++---- include/net/tcp.h | 5 +++-- net/ipv4/icmp.c | 4 ++-- net/ipv4/ip_options.c | 6 +++--- net/ipv4/ip_output.c | 2 +- net/ipv4/ip_sockglue.c | 7 ++++--- net/ipv4/syncookies.c | 2 +- net/ipv4/tcp_ipv4.c | 2 +- 8 files changed, 20 insertions(+), 17 deletions(-)