Message ID | 20110806121247.GC23937@htj.dyndns.org |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 08/06/2011 08:12 AM, Tejun Heo wrote: > Hello, guys. > > So, here's transparent TCP connection hijacking (ie. checkpointing in > one process and restoring in another) which adds only relatively small > pieces to the kernel. It's by no means complete but already works > rather reliably in my test setup even with heavy delay induced with > tc. > > I wrote a rather long README describing how it's working, what's > missing which is appended at the end of this mail so if you're > interested in the details please go ahead and read. That's a little gross but quite cool. I think you have an annoying corner case, though: > 2. Decide where to inject the foreign code and save the original code > with PTRACE_PEEKDATA. Tracer can poke any mapped area regardless > of protection flags but it can't add execution permission to the > code, so it needs to choose memory area which already has X flag > set. The example code uses the page the %rip is in. If the process is executing from the vsyscall page, then you'll probably fail. (Admittedly, this is rather unlikely, given that the vsyscalls are now exactly one instruction.) Presumably you also fail if executing from a read-only MAP_SHARED mapping. Windows has a facility to more-or-less call mmap on behalf of another process, and another one to directly inject a thread into a remote process. It's traditional to use them for this type of manipulation. Perhaps Linux should get the same thing. (Although you could accomplish much the same thing if you could create a task with your mm but the tracee's fs.) --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Sat, Aug 06, 2011 at 08:45:28AM -0400, Andy Lutomirski wrote: > > 2. Decide where to inject the foreign code and save the original code > > with PTRACE_PEEKDATA. Tracer can poke any mapped area regardless > > of protection flags but it can't add execution permission to the > > code, so it needs to choose memory area which already has X flag > > set. The example code uses the page the %rip is in. > > If the process is executing from the vsyscall page, then you'll > probably fail. (Admittedly, this is rather unlikely, given that the > vsyscalls are now exactly one instruction.) Presumably you also > fail if executing from a read-only MAP_SHARED mapping. Heh, yeah, I originally thought about scanning /proc/PID/maps to look for the page to use but was lazy and just used %rip. I think that should work. I'll note the problem in README. > Windows has a facility to more-or-less call mmap on behalf of > another process, and another one to directly inject a thread into a > remote process. It's traditional to use them for this type of > manipulation. Perhaps Linux should get the same thing. (Although > you could accomplish much the same thing if you could create a task > with your mm but the tracee's fs.) Actually, the only thing we need on x86_64 is two bytes for the syscall instruction because all params are passed through registers anyway. We can just set up parameters for mmap, turn on single step, point %rip to syscall in the vsyscall page. So, either way, I don't think this would be too difficult to solve. Thanks.
On Sat, Aug 6, 2011 at 9:00 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Sat, Aug 06, 2011 at 08:45:28AM -0400, Andy Lutomirski wrote: >> > 2. Decide where to inject the foreign code and save the original code >> > with PTRACE_PEEKDATA. Tracer can poke any mapped area regardless >> > of protection flags but it can't add execution permission to the >> > code, so it needs to choose memory area which already has X flag >> > set. The example code uses the page the %rip is in. >> >> If the process is executing from the vsyscall page, then you'll >> probably fail. (Admittedly, this is rather unlikely, given that the >> vsyscalls are now exactly one instruction.) Presumably you also >> fail if executing from a read-only MAP_SHARED mapping. > > Heh, yeah, I originally thought about scanning /proc/PID/maps to look > for the page to use but was lazy and just used %rip. I think that > should work. I'll note the problem in README. > >> Windows has a facility to more-or-less call mmap on behalf of >> another process, and another one to directly inject a thread into a >> remote process. It's traditional to use them for this type of >> manipulation. Perhaps Linux should get the same thing. (Although >> you could accomplish much the same thing if you could create a task >> with your mm but the tracee's fs.) > > Actually, the only thing we need on x86_64 is two bytes for the > syscall instruction because all params are passed through registers > anyway. We can just set up parameters for mmap, turn on single step, > point %rip to syscall in the vsyscall page. So, either way, I don't > think this would be too difficult to solve. Not any more -- that syscall instruction is gone as of 3.1. You could search through the vdso to find a syscall, but that seems fragile. Why not just add a ptrace command to issue a syscall? --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Sat, Aug 06, 2011 at 09:15:45AM -0400, Andrew Lutomirski wrote: > On Sat, Aug 6, 2011 at 9:00 AM, Tejun Heo <tj@kernel.org> wrote: > > Actually, the only thing we need on x86_64 is two bytes for the > > syscall instruction because all params are passed through registers > > anyway. We can just set up parameters for mmap, turn on single step, > > point %rip to syscall in the vsyscall page. So, either way, I don't > > think this would be too difficult to solve. > > Not any more -- that syscall instruction is gone as of 3.1. You could > search through the vdso to find a syscall, but that seems fragile. > > Why not just add a ptrace command to issue a syscall? Yeah, maybe. If this thing proves to be useful enough and looking for a page to poke under proc too cumbersome. I'm not against it but don't really see strong need either at this point. Thanks.
On Sat, Aug 06, 2011 at 03:00:37PM +0200, Tejun Heo wrote: > Hello, > > On Sat, Aug 06, 2011 at 08:45:28AM -0400, Andy Lutomirski wrote: > > > 2. Decide where to inject the foreign code and save the original code > > > with PTRACE_PEEKDATA. Tracer can poke any mapped area regardless > > > of protection flags but it can't add execution permission to the > > > code, so it needs to choose memory area which already has X flag > > > set. The example code uses the page the %rip is in. > > > > If the process is executing from the vsyscall page, then you'll > > probably fail. (Admittedly, this is rather unlikely, given that the > > vsyscalls are now exactly one instruction.) Presumably you also > > fail if executing from a read-only MAP_SHARED mapping. > > Heh, yeah, I originally thought about scanning /proc/PID/maps to look > for the page to use but was lazy and just used %rip. I think that > should work. I'll note the problem in README. Okay, updated README. http://code.google.com/p/ptrace-parasite/source/browse/README Thanks.
On Sat, Aug 06, 2011 at 02:12:47PM +0200, Tejun Heo wrote: > Hello, guys. > > So, here's transparent TCP connection hijacking (ie. checkpointing in > one process and restoring in another) which adds only relatively small > pieces to the kernel. It's by no means complete but already works > rather reliably in my test setup even with heavy delay induced with > tc. I saw the write up on this on lwn.net, pretty creative by the way, and it got me thinking about a different checkpoint/restart problem I've been running into. Specifically in hibernating to disk. In the hibernate case active TCP connections hang after resuming, while an idle TCP connection will continue after the system is back up. My observation is the kernel checkpoints itself to memory, enables devices, writes out that checkpoint image to storage, then powers off. The problem is if TCP packets are received while writing to storage, the kernel will continue to queue and ack those TCP packets, but the running kernel and it's network state is shortly lost. When the computer resumes, those TCP byte sequences hang the TCP connection for an extended period of time while the resumed computer refuses to acknowledge the data that was received after checkpointing and the now running kernel knew nothing about, and the other computer tries in vain to resend any data that hadn't yet been acknowledged, which is always after the data that was lost, until one of them eventually gives up. I've been wondering if it was safe or possible to leave any network interfaces down after the checkpoint, or what the right solution would be. I didn't think marking every TCP connection with a ZOMBIE_KERNEL bit just after the kernel checkpoint (for the kernel is walking dead and won't remember anything that happens), and then prevent any TCP acks from being sent for those connections would be the right solution. I've taken to unplugging the physical lan cable, hibernating to disk, and plugging it back in after the system is down, to avoid the problem. Any ideas?
(cc'ing Rafael and linux-pm) On Sat, Oct 29, 2011 at 11:48:21PM -0500, David Fries wrote: > I saw the write up on this on lwn.net, pretty creative by the way, and > it got me thinking about a different checkpoint/restart problem I've > been running into. Specifically in hibernating to disk. In the > hibernate case active TCP connections hang after resuming, while an > idle TCP connection will continue after the system is back up. My > observation is the kernel checkpoints itself to memory, enables > devices, writes out that checkpoint image to storage, then powers off. > The problem is if TCP packets are received while writing to storage, > the kernel will continue to queue and ack those TCP packets, but the > running kernel and it's network state is shortly lost. When the > computer resumes, those TCP byte sequences hang the TCP connection for > an extended period of time while the resumed computer refuses to > acknowledge the data that was received after checkpointing and the now > running kernel knew nothing about, and the other computer tries in > vain to resend any data that hadn't yet been acknowledged, which is > always after the data that was lost, until one of them eventually > gives up. > > I've been wondering if it was safe or possible to leave any network > interfaces down after the checkpoint, or what the right solution would > be. I didn't think marking every TCP connection with a ZOMBIE_KERNEL > bit just after the kernel checkpoint (for the kernel is walking dead > and won't remember anything that happens), and then prevent any TCP > acks from being sent for those connections would be the right > solution. I've taken to unplugging the physical lan cable, > hibernating to disk, and plugging it back in after the system is down, > to avoid the problem. Any ideas? Hmmm... sounds like taking down network interfaces before starting hibernation sequence should be enough, which shouldn't be too difficult to implement from userland. Rafael, what do you think? Thanks.
On Sun, Oct 30, 2011 at 01:16:18PM -0700, Tejun Heo wrote: > (cc'ing Rafael and linux-pm) > > On Sat, Oct 29, 2011 at 11:48:21PM -0500, David Fries wrote: > > I saw the write up on this on lwn.net, pretty creative by the way, and > > it got me thinking about a different checkpoint/restart problem I've > > been running into. Specifically in hibernating to disk. In the > > hibernate case active TCP connections hang after resuming, while an > > idle TCP connection will continue after the system is back up. My > > observation is the kernel checkpoints itself to memory, enables > > devices, writes out that checkpoint image to storage, then powers off. > > The problem is if TCP packets are received while writing to storage, > > the kernel will continue to queue and ack those TCP packets, but the > > running kernel and it's network state is shortly lost. When the > > computer resumes, those TCP byte sequences hang the TCP connection for > > an extended period of time while the resumed computer refuses to > > acknowledge the data that was received after checkpointing and the now > > running kernel knew nothing about, and the other computer tries in > > vain to resend any data that hadn't yet been acknowledged, which is > > always after the data that was lost, until one of them eventually > > gives up. > > > > I've been wondering if it was safe or possible to leave any network > > interfaces down after the checkpoint, or what the right solution would > > be. I didn't think marking every TCP connection with a ZOMBIE_KERNEL > > bit just after the kernel checkpoint (for the kernel is walking dead > > and won't remember anything that happens), and then prevent any TCP > > acks from being sent for those connections would be the right > > solution. I've taken to unplugging the physical lan cable, > > hibernating to disk, and plugging it back in after the system is down, > > to avoid the problem. Any ideas? > > Hmmm... sounds like taking down network interfaces before starting > hibernation sequence should be enough, which shouldn't be too > difficult to implement from userland. Rafael, what do you think? What I observe is the kernel prints out "Preallocating image memory", then when the screen goes blank the network link light also goes out, then the screen comes back on with "Compressing and saving" along with the link light comes on, until it has been saved and the system shuts down. So the kernel is already brining the network down, it just needs to keep it there until the original check pointed kernel is back up. Userspace bringing the network interfaces down is problematic. As an example one of my systems is running hostapd as an access point and bridging that to the wired ethernet, that's not a trivial task to setup and take down (the Debian ifup can set it up, but I've not figured out yet how to get ifdown to take everything down cleanly, and I sometimes manually run hostapd if I'm troubleshooting). Any manually added routes would go away, good luck in setting everything back up the way it was before for all the different configurations out there in userspace. Add to those issues programs would now have a time when networking is down that they wouldn't have otherwise seen.
On Mon, Oct 31, 2011 at 5:16 AM, Tejun Heo <tj@kernel.org> wrote: > (cc'ing Rafael and linux-pm) > > On Sat, Oct 29, 2011 at 11:48:21PM -0500, David Fries wrote: >> I saw the write up on this on lwn.net, pretty creative by the way, and >> it got me thinking about a different checkpoint/restart problem I've >> been running into. Specifically in hibernating to disk. In the >> hibernate case active TCP connections hang after resuming, while an >> idle TCP connection will continue after the system is back up. My >> observation is the kernel checkpoints itself to memory, enables >> devices, writes out that checkpoint image to storage, then powers off. >> The problem is if TCP packets are received while writing to storage, >> the kernel will continue to queue and ack those TCP packets, but the >> running kernel and it's network state is shortly lost. When the >> computer resumes, those TCP byte sequences hang the TCP connection for >> an extended period of time while the resumed computer refuses to >> acknowledge the data that was received after checkpointing and the now >> running kernel knew nothing about, and the other computer tries in >> vain to resend any data that hadn't yet been acknowledged, which is >> always after the data that was lost, until one of them eventually >> gives up. >> >> I've been wondering if it was safe or possible to leave any network >> interfaces down after the checkpoint, or what the right solution would >> be. I didn't think marking every TCP connection with a ZOMBIE_KERNEL >> bit just after the kernel checkpoint (for the kernel is walking dead >> and won't remember anything that happens), and then prevent any TCP >> acks from being sent for those connections would be the right >> solution. I've taken to unplugging the physical lan cable, >> hibernating to disk, and plugging it back in after the system is down, >> to avoid the problem. Any ideas? > > Hmmm... sounds like taking down network interfaces before starting > hibernation sequence should be enough, which shouldn't be too > difficult to implement from userland. Rafael, what do you think? > > Thanks. Um... it seems that the "thaw" callbacks of network interfaces or TCP should do something on this. Probably, the "thaw" callbacks should make sure that the TCP connections are closed? Cheers, MyungJoo > > -- > tejun > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/linux-pm >
Hello, On Wed, Nov 02, 2011 at 06:44:31PM +0900, MyungJoo Ham wrote: > > Hmmm... sounds like taking down network interfaces before starting > > hibernation sequence should be enough, which shouldn't be too > > difficult to implement from userland. Rafael, what do you think? > > > > Thanks. > > Um... it seems that the "thaw" callbacks of network interfaces or TCP > should do something on this. > > Probably, the "thaw" callbacks should make sure that the TCP > connections are closed? I don't think it's a good idea to diddle with TCP connections from that layer. From what I understand, it seem all we need is plugging tx/rx while preparing for hibernation. That shouldn't be too difficult. Thanks.
On Wed 2011-11-02 08:10:39, Tejun Heo wrote: > Hello, > > On Wed, Nov 02, 2011 at 06:44:31PM +0900, MyungJoo Ham wrote: > > > Hmmm... sounds like taking down network interfaces before starting > > > hibernation sequence should be enough, which shouldn't be too > > > difficult to implement from userland. Rafael, what do you think? > > > > > > Thanks. > > > > Um... it seems that the "thaw" callbacks of network interfaces or TCP > > should do something on this. > > > > Probably, the "thaw" callbacks should make sure that the TCP > > connections are closed? > > I don't think it's a good idea to diddle with TCP connections from > that layer. From what I understand, it seem all we need is plugging > tx/rx while preparing for hibernation. That shouldn't be too > difficult. Yes, that should be done. If someone has uswsusp setup where they talk over the network, it might break them, but hopefully noone is doing that. Also hopefully noone does hibernation on /dev/nbd. Pavel
diff --git a/include/linux/sockios.h b/include/linux/sockios.h index 7997a50..f5c3e41 100644 --- a/include/linux/sockios.h +++ b/include/linux/sockios.h @@ -127,6 +127,12 @@ /* hardware time stamping: parameters in linux/net_tstamp.h */ #define SIOCSHWTSTAMP 0x89b0 +#define SIOCGINSEQ 0x89b1 /* get copied_seq */ +#define SIOCGOUTSEQS 0x89b2 /* get seqs for pending tx pkts */ +#define SIOCSOUTSEQ 0x89b3 /* set write_seq */ +#define SIOCPEEKOUTQ 0x89b4 /* peek output queue */ +#define SIOCFORCEOUTBD 0x89b5 /* force output packet boundary */ + /* Device private ioctl calls */ /* diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 531ede8..c0945fe 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -365,6 +365,8 @@ struct tcp_sock { u32 snd_up; /* Urgent pointer */ u8 keepalive_probes; /* num of allowed keep alive probes */ + u8 wseq_set : 1;/* Write sequence set via setsockopt */ + u8 force_outbd : 1;/* force packet boundary on next send */ /* * Options received (usually on last packet, some only on SYN packets). */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 46febca..3389827 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -464,12 +464,118 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait) } EXPORT_SYMBOL(tcp_poll); +static int tcp_get_out_seqs(struct sock *sk, u32 __user *p, int size) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct sk_buff *skb; + int pos = 0, cnt = size / sizeof(u32); + + if (pos < cnt && put_user(tp->write_seq, &p[pos++])) + return -EFAULT; + + skb_queue_reverse_walk(&sk->sk_write_queue, skb) { + struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); + + if (pos < cnt && put_user(tcb->seq, &p[pos++])) + return -EFAULT; + } + return pos * sizeof(u32); +} + +static int tcp_peek_outq(struct sock *sk, void __user *arg, int size) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct iovec iov = { .iov_base = arg, .iov_len = size }; + struct sk_buff *skb; + int copied = 0, err = 0; + int outq, skip; + + lock_sock(sk); + + /* XXX: why doesn't SIOCOUTQ[NSD] account for queued fin? */ + outq = tp->write_seq - tp->snd_una; + skb = skb_peek_tail(&sk->sk_write_queue); + if (outq && skb) + outq -= tcp_hdr(skb)->fin; + + skip = outq - min(size, outq); + + skb_queue_walk(&sk->sk_write_queue, skb) { + int off = 0, todo; + + if (skip) { + off = min_t(int, skip, skb->len); + skip -= off; + } + + if (!(todo = skb->len - off)) + continue; + + if (WARN_ON_ONCE(iov.iov_len < todo)) { + err = -EINVAL; + break; + } + + err = skb_copy_datagram_iovec(skb, off, &iov, todo); + if (err) + break; + copied += todo; + } + + release_sock(sk); + + return err ?: copied; +} + int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) { struct tcp_sock *tp = tcp_sk(sk); int answ; switch (cmd) { + case SIOCGOUTSEQS: { + s32 size; + + if (get_user(size, (s32 __user *)arg)) + return -EFAULT; + if (size < 0) + return -EINVAL; + return tcp_get_out_seqs(sk, (u32 __user *)arg, size); + } + case SIOCSOUTSEQ: { + u32 seq; + + if (get_user(seq, (u32 __user *)arg)) + return -EFAULT; + + lock_sock(sk); + answ = -EISCONN; + if ((sk->sk_socket->state == SS_UNCONNECTED && + sk->sk_state == TCP_CLOSE) || sk->sk_state == TCP_LISTEN) { + tp->write_seq = seq; + tp->wseq_set = true; + answ = 0; + } + release_sock(sk); + return answ; + } + case SIOCPEEKOUTQ: { + u32 size; + + if (get_user(size, (u32 __user *)arg)) + return -EFAULT; + if ((int)size < size) + return -EINVAL; + return tcp_peek_outq(sk, (void __user *)arg, size); + } + case SIOCFORCEOUTBD: + lock_sock(sk); + tp->force_outbd = true; + release_sock(sk); + return 0; + } + + switch (cmd) { case SIOCINQ: if (sk->sk_state == TCP_LISTEN) return -EINVAL; @@ -514,6 +620,9 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) else answ = tp->write_seq - tp->snd_nxt; break; + case SIOCGINSEQ: + answ = tp->copied_seq; + break; default: return -ENOIOCTLCMD; } @@ -965,7 +1074,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, copy = max - skb->len; } - if (copy <= 0) { + if (copy <= 0 || unlikely(tp->force_outbd)) { new_segment: /* Allocate new segment. If the interface is SG, * allocate skb fitting to single page. @@ -979,6 +1088,8 @@ new_segment: if (!skb) goto wait_for_memory; + tp->force_outbd = false; + /* * Check whether we can use HW checksum. */ diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 955b8e6..579234c 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -201,7 +201,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) /* Reset inherited state */ tp->rx_opt.ts_recent = 0; tp->rx_opt.ts_recent_stamp = 0; - tp->write_seq = 0; + if (!tp->wseq_set) + tp->write_seq = 0; } if (tcp_death_row.sysctl_tw_recycle && @@ -252,12 +253,12 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) sk->sk_gso_type = SKB_GSO_TCPV4; sk_setup_caps(sk, &rt->dst); - if (!tp->write_seq) + if (!tp->write_seq && !tp->wseq_set) tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr, inet->inet_daddr, inet->inet_sport, usin->sin_port); - + tp->wseq_set = false; inet->inet_id = tp->write_seq ^ jiffies; err = tcp_connect(sk); @@ -1252,7 +1253,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) if (net_ratelimit()) syn_flood_warning(skb); #ifdef CONFIG_SYN_COOKIES - if (sysctl_tcp_syncookies) { + if (sysctl_tcp_syncookies && !tp->wseq_set) { want_cookie = 1; } else #endif @@ -1334,7 +1335,10 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb) if (!want_cookie || tmp_opt.tstamp_ok) TCP_ECN_create_request(req, tcp_hdr(skb)); - if (want_cookie) { + if (unlikely(tp->wseq_set)) { + isn = tp->write_seq; + tp->wseq_set = false; + } else if (want_cookie) { isn = cookie_v4_init_sequence(sk, skb, &req->mss); req->cookie_ts = tmp_opt.tstamp_ok; } else if (!isn) { @@ -1526,7 +1530,7 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb) } #ifdef CONFIG_SYN_COOKIES - if (!th->syn) + if (!th->syn && !tcp_sk(sk)->wseq_set) sk = cookie_v4_check(sk, skb, &(IPCB(skb)->opt)); #endif return sk;