diff mbox

Octeon crash in virt_to_page(&core0_stack_variable)

Message ID 201109091623.29000.cratiu@ixiacom.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Cosmin Ratiu Sept. 9, 2011, 1:23 p.m. UTC
Hello,

I've been investigating a strange crash and I wanted to ask for your help.
The crash happens when virt_to_page is called with an address from the softirq 
stack of core 0 on Cavium Octeon. It may happen on other MIPS processors as 
well, but I'm not sure.

I've attached a simple kernel module to demonstrate the problem and the output 
of dmesg + the crash. Two seconds after inserting the module, the kernel 
should crash.

From what I've dug up in the kernel sources, it seems the stack for the first 
idle task resides in the data segment (mapped in kseg2) while the rest are 
allocated with kmalloc in __cpu_up() and reside in a different area (CAC_BASE 
upwards).
It seems virt_to_phys produces bogus results for kseg2 and after that, 
virt_to_page crashes trying to access invalid memory.

This problem was discovered when doing BGP traffic with the TCP MD5 option 
activated, where the following call chain caused a crash:

 * tcp_v4_rcv
 *  tcp_v4_timewait_ack
 *   tcp_v4_send_ack -> follow stack variable rep.th
 *    tcp_v4_md5_hash_hdr
 *     tcp_md5_hash_header
 *      sg_init_one
 *       sg_set_buf
 *        virt_to_page

I noticed that tcp_v4_send_reset uses a similar stack variable and also calls 
tcp_v4_md5_hash_hdr, so it has the same problem.

I don't fully understand octeon mm details, so I wanted to bring up this issue 
in order to find a proper fix.
To avoid the problem, I've implemented a quick hack to declare those variables 
percpu instead of on the stack, so they would also reside in CAC_BASE upwards. 
I've attached a patch against 2.6.32 for reference.

Cosmin.

Comments

David Daney Sept. 9, 2011, 4:59 p.m. UTC | #1
On 09/09/2011 06:23 AM, Cosmin Ratiu wrote:
> Hello,
>
> I've been investigating a strange crash and I wanted to ask for your help.
> The crash happens when virt_to_page is called with an address from the softirq
> stack of core 0 on Cavium Octeon. It may happen on other MIPS processors as
> well, but I'm not sure.
>
> I've attached a simple kernel module to demonstrate the problem and the output
> of dmesg + the crash. Two seconds after inserting the module, the kernel
> should crash.
>
>  From what I've dug up in the kernel sources, it seems the stack for the first
> idle task resides in the data segment (mapped in kseg2) while the rest are
> allocated with kmalloc in __cpu_up() and reside in a different area (CAC_BASE
> upwards).
> It seems virt_to_phys produces bogus results for kseg2 and after that,
> virt_to_page crashes trying to access invalid memory.
>
> This problem was discovered when doing BGP traffic with the TCP MD5 option
> activated, where the following call chain caused a crash:
>
>   * tcp_v4_rcv
>   *  tcp_v4_timewait_ack
>   *   tcp_v4_send_ack ->  follow stack variable rep.th
>   *    tcp_v4_md5_hash_hdr
>   *     tcp_md5_hash_header
>   *      sg_init_one
>   *       sg_set_buf
>   *        virt_to_page
>
> I noticed that tcp_v4_send_reset uses a similar stack variable and also calls
> tcp_v4_md5_hash_hdr, so it has the same problem.
>
> I don't fully understand octeon mm details, so I wanted to bring up this issue
> in order to find a proper fix.
> To avoid the problem, I've implemented a quick hack to declare those variables
> percpu instead of on the stack, so they would also reside in CAC_BASE upwards.
> I've attached a patch against 2.6.32 for reference.
>
> Cosmin.
>
>
[...]
> [ 2040.300/0] Call Trace:
> [ 2040.300/0] [<ffffffffc123a054>] vcrash+0x54/0x80 [vcrash]
> [ 2040.300/0] [<ffffffffc0065f28>] run_timer_softirq+0x198/0x23c
> [ 2040.300/0] [<ffffffffc00609e0>] __do_softirq+0xd8/0x188

                   ^^^^^^^^^ CKSEG2 addresses detected!

You are using the out-of-tree mapped kernel patch which mucks about with 
the implementation of virt_to_phys().

Can you reproduce the TCP related crash in an unpatched kernel?

If not, then it would point to problems in the out-of-tree patches you 
have applied.

David Daney
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Change 3360379 by cratiu@cratiu on 2011/09/02 09:44:48 *pending*

	TCP/md5: Switch from a stack var to a percpu var to avoid a crash.
	
	tcp_v4_send_ack uses a stack variable to construct the TCP header for
	the response packet.
	When using TCP MD5 signatures on mips architecture a crash happens
	sometimes when the current core is the master core using the initial
	stack allocated in vmlinux.
	The reason for this is that the initial stack is mapped in kseg2
	so it can't be directly translated to a physical address by
	virt_to_phys as expected by sg_set_buf from the following call chain:
	
	> (optimized: sg_set_buf)
	> sg_init_one+0x58/0xa4
	> tcp_md5_hash_header+0x30/0x64
	> tcp_v4_md5_hash_hdr+0xb4/0x134
	> tcp_v4_send_ack+0x16c/0x25c
	> (optimized: tcp_v4_timewait_ack)
	> tcp_v4_rcv+0x1b3c/0x1e58
	
	As a temporary fix that should not affect performance, the stack
	variable is converted in a percpu variable allocated at boot time.

Affected files ...

... //packages/linux_2.6.32/main/src/include/net/tcp.h#6 edit
... //packages/linux_2.6.32/main/src/net/ipv4/tcp.c#11 edit
... //packages/linux_2.6.32/main/src/net/ipv4/tcp_ipv4.c#15 edit

 include/net/tcp.h   |   10 +++++++++
 net/ipv4/tcp.c      |    5 ++++
 net/ipv4/tcp_ipv4.c |   53 ++++++++++++++++++++++++----------------------------
 3 files changed, 40 insertions(+), 28 deletions(-)

Signed-off-by: Cosmin Ratiu <cratiu@ixiacom.com>
--- src/include/net/tcp.h~
+++ src/include/net/tcp.h
@@ -1570,5 +1570,15 @@ 
 	return skc->skc_net_params->tcp.rmem;
 }
 
+struct tcp_reply_hdr {
+	struct tcphdr th;
+	__be32 opt[(TCPOLEN_TSTAMP_ALIGNED >> 2)
+#ifdef CONFIG_TCP_MD5SIG
+		   + (TCPOLEN_MD5SIG_ALIGNED >> 2)
+#endif
+		];
+};
+
+extern struct tcp_reply_hdr *tcp_rep_percpu;
 
 #endif	/* _TCP_H */
--- src/net/ipv4/tcp.c~
+++ src/net/ipv4/tcp.c
@@ -3150,6 +3150,11 @@ 
 	       tcp_hashinfo.lhash_size);
 
 	tcp_register_congestion_control(&tcp_reno);
+
+	/* Hack alert: a proper fix should be implemented for the md5 crash */
+	tcp_rep_percpu = alloc_percpu(struct tcp_reply_hdr);
+	if (!tcp_rep_percpu)
+		panic("Cannot allocate per cpu tcp reply hdr\n");
 }
 
 EXPORT_SYMBOL(tcp_close);
--- src/net/ipv4/tcp_ipv4.c~
+++ src/net/ipv4/tcp_ipv4.c
@@ -680,6 +680,8 @@ 
 	SOCK_STAT_INC(groupptr, TcpRstSent, skb_get_portid(skb));
 }
 
+struct tcp_reply_hdr *tcp_rep_percpu;
+
 /* The code following below sending ACKs in SYN-RECV and TIME-WAIT states
    outside socket context is ugly, certainly. What can I do?
  */
@@ -691,53 +693,48 @@ 
 			    int reply_flags, u32 vlanprio)
 {
 	struct tcphdr *th = tcp_hdr(skb);
-	struct {
-		struct tcphdr th;
-		__be32 opt[(TCPOLEN_TSTAMP_ALIGNED >> 2)
-#ifdef CONFIG_TCP_MD5SIG
-			   + (TCPOLEN_MD5SIG_ALIGNED >> 2)
-#endif
-			];
-	} rep;
+	struct tcp_reply_hdr *rep;
 	struct ip_reply_arg arg;
 
-	memset(&rep.th, 0, sizeof(struct tcphdr));
+	rep = per_cpu_ptr(tcp_rep_percpu, get_cpu());
+
+	memset(&rep->th, 0, sizeof(struct tcphdr));
 	memset(&arg, 0, sizeof(arg));
 
-	arg.iov[0].iov_base = (unsigned char *)&rep;
-	arg.iov[0].iov_len  = sizeof(rep.th);
+	arg.iov[0].iov_base = (unsigned char *)rep;
+	arg.iov[0].iov_len  = sizeof(rep->th);
 	if (ts) {
-		rep.opt[0] = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
+		rep->opt[0] = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
 				   (TCPOPT_TIMESTAMP << 8) |
 				   TCPOLEN_TIMESTAMP);
-		rep.opt[1] = htonl(tcp_time_stamp);
-		rep.opt[2] = htonl(ts);
+		rep->opt[1] = htonl(tcp_time_stamp);
+		rep->opt[2] = htonl(ts);
 		arg.iov[0].iov_len += TCPOLEN_TSTAMP_ALIGNED;
 	}
 
 	/* Swap the send and the receive. */
-	rep.th.dest    = th->source;
-	rep.th.source  = th->dest;
-	rep.th.doff    = arg.iov[0].iov_len / 4;
-	rep.th.seq     = htonl(seq);
-	rep.th.ack_seq = htonl(ack);
-	rep.th.ack     = 1;
-	rep.th.window  = htons(win);
+	rep->th.dest    = th->source;
+	rep->th.source  = th->dest;
+	rep->th.doff    = arg.iov[0].iov_len / 4;
+	rep->th.seq     = htonl(seq);
+	rep->th.ack_seq = htonl(ack);
+	rep->th.ack     = 1;
+	rep->th.window  = htons(win);
 
 #ifdef CONFIG_TCP_MD5SIG
 	if (key) {
 		int offset = (ts) ? 3 : 0;
 
-		rep.opt[offset++] = htonl((TCPOPT_NOP << 24) |
-					  (TCPOPT_NOP << 16) |
-					  (TCPOPT_MD5SIG << 8) |
-					  TCPOLEN_MD5SIG);
+		rep->opt[offset++] = htonl((TCPOPT_NOP << 24) |
+					   (TCPOPT_NOP << 16) |
+					   (TCPOPT_MD5SIG << 8) |
+					   TCPOLEN_MD5SIG);
 		arg.iov[0].iov_len += TCPOLEN_MD5SIG_ALIGNED;
-		rep.th.doff = arg.iov[0].iov_len/4;
+		rep->th.doff = arg.iov[0].iov_len/4;
 
-		tcp_v4_md5_hash_hdr((__u8 *) &rep.opt[offset],
+		tcp_v4_md5_hash_hdr((__u8 *) &rep->opt[offset],
 				    key, ip_hdr(skb)->saddr,
-				    ip_hdr(skb)->daddr, &rep.th);
+				    ip_hdr(skb)->daddr, &rep->th);
 	}
 #endif
 	arg.flags = reply_flags;