[2/2] c/r: Add AF_INET support (v3)

Message ID	1246994776-1882-3-git-send-email-danms@us.ibm.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Dan Smith <danms@us.ibm.com> To: containers@lists.osdl.org Cc: netdev@vger.kernel.org, Oren Laaden <orenl@cs.columbia.edu>, Alexey Dobriyan <adobriyan@gmail.com> Subject: [PATCH 2/2] c/r: Add AF_INET support (v3) Date: Tue, 7 Jul 2009 12:26:16 -0700 Message-Id: <1246994776-1882-3-git-send-email-danms@us.ibm.com> In-Reply-To: <1246994776-1882-1-git-send-email-danms@us.ibm.com> References: <1246994776-1882-1-git-send-email-danms@us.ibm.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk

Dan Smith July 7, 2009, 7:26 p.m. UTC

This patch adds AF_INET c/r support based on the framework established in
my AF_UNIX patch.  I've tested it by checkpointing a single app with a
pair of sockets connected over loopback.

A couple points about the operation:

 1. In order to properly hook up the established sockets with the matching
    listening parent socket, I added a new list to the ckpt_ctx and run the
    parent attachment in the deferqueue at the end of the restart process.
 2. I don't do anything to redirect or freeze traffic flowing to or from the
    remote system (to prevent a RST from breaking things).  I expect that
    userspace will bring down a veth device or freeze traffic to the remote
    system to handle this case.

Changes in v3:
 - Add AF_INET6 support

Changes in v2:
 - Check for data in the TCP out-of-order queue and fail if present
 - Fix a logic issue in sock_add_parent()
 - Add comment about holding a reference to sock for parent list
 - Write error messages to checkpoint stream where appropriate
 - Fix up checking of some return values in restart phase
 - Fix up restart logic to restore socket info for all states
 - Avoid running sk_proto->hash() for non-TCP sockets
 - Fix calling bind() for unconnected (i.e. DGRAM) sockets
 - Change 'in' to 'inet' in structure and function names

Cc: Oren Laaden <orenl@cs.columbia.edu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
---
 checkpoint/sys.c                 |    2 +
 include/linux/checkpoint_hdr.h   |    1 +
 include/linux/checkpoint_types.h |    2 +
 include/linux/socket.h           |  102 ++++++++++++
 net/checkpoint.c                 |  337 +++++++++++++++++++++++++++++++++++++-
 5 files changed, 439 insertions(+), 5 deletions(-)

Brian Haley July 8, 2009, 1:23 a.m. UTC | #1

Dan Smith wrote:
> This patch adds AF_INET c/r support based on the framework established in
> my AF_UNIX patch.  I've tested it by checkpointing a single app with a
> pair of sockets connected over loopback.
 
> +struct ckpt_hdr_socket_inet {
> +	struct ckpt_hdr h;
> +
> +	__u32 daddr;
> +	__u32 rcv_saddr;
> +	__u32 saddr;
> +	__u16 dport;
> +	__u16 num;
> +	__u16 sport;
> +	__s16 uc_ttl;
> +	__u16 cmsg_flags;
> +	__u16 __pad;
> +
> +	struct {
> +		__u64 timeout;
> +		__u32 ato;
> +		__u32 lrcvtime;
> +		__u16 last_seg_size;
> +		__u16 rcv_mss;
> +		__u8 pending;
> +		__u8 quick;
> +		__u8 pingpong;
> +		__u8 blocked;
> +	} icsk_ack __attribute__ ((aligned(8)));
> +
> +	/* FIXME: Skipped opt, tos, multicast, cork settings */
> +
<snip>
> +
> +	struct {
> +		char saddr[16];
> +		char rcv_saddr[16];
> +		char daddr[16];
> +	} inet6 __attribute__ ((aligned(8)));

These should be 'struct in6_addr'.

And just like in your IPv4 section you need a FIXME here for the things
you skipped :)

> +static int sock_inet_cptrst(struct ckpt_ctx *ctx,
> +			    struct sock *sock,
> +			    struct ckpt_hdr_socket_inet *hh,
> +			  int op)
> +{
<snip>
> +
> +	if (sock->sk_family == AF_INET6) {
> +		struct ipv6_pinfo *inet6 = inet6_sk(sock);
> +		unsigned alen = sizeof(struct in6_addr);
> +		if (op == CKPT_CPT) {
> +			memcpy(hh->inet6.saddr, &inet6->saddr, alen);
> +			memcpy(hh->inet6.rcv_saddr, &inet6->rcv_saddr, alen);
> +			memcpy(hh->inet6.daddr, &inet6->daddr, alen);
> +		} else {
> +			memcpy(&inet6->saddr, hh->inet6.saddr, alen);
> +			memcpy(&inet6->rcv_saddr, hh->inet6.rcv_saddr, alen);
> +			memcpy(&inet6->daddr, hh->inet6.daddr, alen);
> +		}
> +	}

There's an inline called ipv6_addr_copy() that will do the same thing,
then you could drop the alen argument.

-Brian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Smith July 8, 2009, 1:31 a.m. UTC | #2

Hi Brian,

>> +	struct {
>> +		char saddr[16];
>> +		char rcv_saddr[16];
>> +		char daddr[16];
>> +	} inet6 __attribute__ ((aligned(8)));

BH> These should be 'struct in6_addr'.

Okay.

BH> And just like in your IPv4 section you need a FIXME here for the
BH> things you skipped :)

Heh, okay :)

BH> There's an inline called ipv6_addr_copy() that will do the same
BH> thing, then you could drop the alen argument.

Okay, cool.

Thanks!

Oren Laadan July 8, 2009, 1:58 p.m. UTC | #3

Hi,

A few quick ones ...

Dan Smith wrote:
> This patch adds AF_INET c/r support based on the framework established in
> my AF_UNIX patch.  I've tested it by checkpointing a single app with a
> pair of sockets connected over loopback.

You can easily test it against a network peer, without too much
magic in userspace:
- app connects (or accepts) a connection
- use iptables to block traffic to/from peer
- app sends data, and has non-recieved data
- checkpoint and kill the applications
- restart the applications
- use iptables to unblock traffic to/from peer

> 
> A couple points about the operation:
> 
>  1. In order to properly hook up the established sockets with the matching
>     listening parent socket, I added a new list to the ckpt_ctx and run the
>     parent attachment in the deferqueue at the end of the restart process.
>  2. I don't do anything to redirect or freeze traffic flowing to or from the
>     remote system (to prevent a RST from breaking things).  I expect that
>     userspace will bring down a veth device or freeze traffic to the remote
>     system to handle this case.
> 
> Changes in v3:
>  - Add AF_INET6 support
> 
> Changes in v2:
>  - Check for data in the TCP out-of-order queue and fail if present
>  - Fix a logic issue in sock_add_parent()
>  - Add comment about holding a reference to sock for parent list
>  - Write error messages to checkpoint stream where appropriate
>  - Fix up checking of some return values in restart phase
>  - Fix up restart logic to restore socket info for all states
>  - Avoid running sk_proto->hash() for non-TCP sockets
>  - Fix calling bind() for unconnected (i.e. DGRAM) sockets
>  - Change 'in' to 'inet' in structure and function names
> 

The live c/r of open connections is very useful for process
migration. For other scenarios (e.g. fault recovery in HPC)
it does not matter that much.

The user needs a way to ask restart to only restore listening
sockets and make previous connections would appear as closed.

One way to do it is leave checkpoint as is and have userspace
modify the checkpoint image prior to restart. But it can be
beneficial to have a checkpoint flag (or cradvise ?) to just
skip them at checkpoint time, and reduce c/r time and size.

[...]

>  
> +struct ckpt_parent_sock {
> +	struct sock *sock;
> +	__u32 oref;

Nit: objref ?

> +	struct list_head list;
> +};
> +
> +static int sock_add_parent(struct ckpt_ctx *ctx, struct sock *sock)
> +{
> +	struct ckpt_parent_sock *parent;
> +	__u32 objref;
> +	int new;
> +
> +	objref = ckpt_obj_lookup_add(ctx, sock, CKPT_OBJ_SOCK, &new);
> +	if (objref < 0)
> +		return objref;
> +	else if (!new)
> +		return 0;
> +
> +	parent = kmalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent)
> +		return -ENOMEM;
> +
> +	/* The deferqueue is processed before the objhash is free()'d,
> +	 * thus the objhash holds a reference to sock for us
> +	 */
> +	parent->sock = sock;
> +	parent->oref = objref;
> +	INIT_LIST_HEAD(&parent->list);

list_add() below makes this moot.

> +
> +	list_add(&parent->list, &ctx->listen_sockets);
> +
> +	return 0;
> +}
> +
> +static struct sock *sock_get_parent(struct ckpt_ctx *ctx, struct sock *sock)
			   ^^^^^
Nit: usually "get" implies refcnt, maybe s/get/find/ ?

> +{
> +	struct ckpt_parent_sock *parent;
> +	struct inet_sock *c = inet_sk(sock);
> +
> +	list_for_each_entry(parent, &ctx->listen_sockets, list) {
> +		struct inet_sock *p = inet_sk(parent->sock);
> +
> +		if (c->sport == p->sport)
> +			return parent->sock;
> +	}
> +
> +	return NULL;
> +}
> +
>  static inline int sock_unix_need_cwd(struct sockaddr_un *a)
>  {
>  	return (a->sun_path[0] != '/');
> @@ -55,17 +108,23 @@ static int sock_copy_buffers(struct sk_buff_head *from, struct sk_buff_head *to)
>  }
>  
>  static int __sock_write_buffers(struct ckpt_ctx *ctx,
> +				uint16_t family,
>  				struct sk_buff_head *queue)
>  {
>  	struct sk_buff *skb;
>  	int ret = 0;
>  
>  	skb_queue_walk(queue, skb) {
> -		if (UNIXCB(skb).fp) {
> +		if ((family == AF_UNIX) && UNIXCB(skb).fp) {
>  			ckpt_write_err(ctx, "fd-passing is not supported");
>  			return -EBUSY;
>  		}
>  
> +		if (skb_shinfo(skb)->nr_frags) {
> +			ckpt_write_err(ctx, "socket has fragments in-flight");
> +			return -EBUSY;
> +		}
> +
>  		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
>  					  CKPT_HDR_SOCKET_BUFFER);
>  		if (ret)
> @@ -75,7 +134,9 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
>  	return 0;
>  }
>  
> -static int sock_write_buffers(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
> +static int sock_write_buffers(struct ckpt_ctx *ctx,
> +			      uint16_t family,
> +			      struct sk_buff_head *queue)
>  {
>  	struct ckpt_hdr_socket_buffer *h;
>  	struct sk_buff_head tmpq;
> @@ -95,7 +156,7 @@ static int sock_write_buffers(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
>  
>  	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
>  	if (!ret)
> -		ret = __sock_write_buffers(ctx, &tmpq);
> +		ret = __sock_write_buffers(ctx, family, &tmpq);
>  
>   out:
>  	ckpt_hdr_put(ctx, h);
> @@ -309,6 +370,166 @@ static int sock_cptrst(struct ckpt_ctx *ctx,
>  		return 0;
>  }
>  
> +static int sock_inet_tcp_cptrst(struct ckpt_ctx *ctx,
> +				struct tcp_sock *sk,
> +				struct ckpt_hdr_socket_inet *hh,
> +				int op)
> +{

Are you certain that none of the values below needs to be
sanitized before put in the kernel ?

IOW, can bad/malicious data case data corruption, a crash,
or simply really bad behavior of TCP, a DoS, etc ?

> +	CKPT_COPY(op, hh->tcp.rcv_nxt, sk->rcv_nxt);
> +	CKPT_COPY(op, hh->tcp.copied_seq, sk->copied_seq);
> +	CKPT_COPY(op, hh->tcp.rcv_wup, sk->rcv_wup);
> +	CKPT_COPY(op, hh->tcp.snd_nxt, sk->snd_nxt);
> +	CKPT_COPY(op, hh->tcp.snd_una, sk->snd_una);
> +	CKPT_COPY(op, hh->tcp.snd_sml, sk->snd_sml);
> +	CKPT_COPY(op, hh->tcp.rcv_tstamp, sk->rcv_tstamp);
> +	CKPT_COPY(op, hh->tcp.lsndtime, sk->lsndtime);
> +
> +	CKPT_COPY(op, hh->tcp.snd_wl1, sk->snd_wl1);
> +	CKPT_COPY(op, hh->tcp.snd_wnd, sk->snd_wnd);
> +	CKPT_COPY(op, hh->tcp.max_window, sk->max_window);
> +	CKPT_COPY(op, hh->tcp.mss_cache, sk->mss_cache);
> +	CKPT_COPY(op, hh->tcp.window_clamp, sk->window_clamp);
> +	CKPT_COPY(op, hh->tcp.rcv_ssthresh, sk->rcv_ssthresh);
> +	CKPT_COPY(op, hh->tcp.frto_highmark, sk->frto_highmark);
> +	CKPT_COPY(op, hh->tcp.advmss, sk->advmss);
> +	CKPT_COPY(op, hh->tcp.frto_counter, sk->frto_counter);
> +	CKPT_COPY(op, hh->tcp.nonagle, sk->nonagle);
> +
> +	CKPT_COPY(op, hh->tcp.srtt, sk->srtt);
> +	CKPT_COPY(op, hh->tcp.mdev, sk->mdev);
> +	CKPT_COPY(op, hh->tcp.mdev_max, sk->mdev_max);
> +	CKPT_COPY(op, hh->tcp.rttvar, sk->rttvar);
> +	CKPT_COPY(op, hh->tcp.rtt_seq, sk->rtt_seq);
> +
> +	CKPT_COPY(op, hh->tcp.packets_out, sk->packets_out);
> +	CKPT_COPY(op, hh->tcp.retrans_out, sk->retrans_out);
> +
> +	CKPT_COPY(op, hh->tcp.urg_data, sk->urg_data);
> +	CKPT_COPY(op, hh->tcp.ecn_flags, sk->ecn_flags);
> +	CKPT_COPY(op, hh->tcp.reordering, sk->reordering);
> +	CKPT_COPY(op, hh->tcp.snd_up, sk->snd_up);
> +
> +	CKPT_COPY(op, hh->tcp.keepalive_probes, sk->keepalive_probes);
> +
> +	CKPT_COPY(op, hh->tcp.rcv_wnd, sk->rcv_wnd);
> +	CKPT_COPY(op, hh->tcp.write_seq, sk->write_seq);
> +	CKPT_COPY(op, hh->tcp.pushed_seq, sk->pushed_seq);
> +	CKPT_COPY(op, hh->tcp.lost_out, sk->lost_out);
> +	CKPT_COPY(op, hh->tcp.sacked_out, sk->sacked_out);
> +	CKPT_COPY(op, hh->tcp.fackets_out, sk->fackets_out);
> +	CKPT_COPY(op, hh->tcp.tso_deferred, sk->tso_deferred);
> +	CKPT_COPY(op, hh->tcp.bytes_acked, sk->bytes_acked);
> +
> +	CKPT_COPY(op, hh->tcp.lost_cnt_hint, sk->lost_cnt_hint);
> +	CKPT_COPY(op, hh->tcp.retransmit_high, sk->retransmit_high);
> +
> +	CKPT_COPY(op, hh->tcp.lost_retrans_low, sk->lost_retrans_low);
> +
> +	CKPT_COPY(op, hh->tcp.prior_ssthresh, sk->prior_ssthresh);
> +	CKPT_COPY(op, hh->tcp.high_seq, sk->high_seq);
> +
> +	CKPT_COPY(op, hh->tcp.retrans_stamp, sk->retrans_stamp);
> +	CKPT_COPY(op, hh->tcp.undo_marker, sk->undo_marker);
> +	CKPT_COPY(op, hh->tcp.undo_retrans, sk->undo_retrans);
> +	CKPT_COPY(op, hh->tcp.total_retrans, sk->total_retrans);
> +
> +	CKPT_COPY(op, hh->tcp.urg_seq, sk->urg_seq);
> +	CKPT_COPY(op, hh->tcp.keepalive_time, sk->keepalive_time);
> +	CKPT_COPY(op, hh->tcp.keepalive_intvl, sk->keepalive_intvl);
> +
> +	CKPT_COPY(op, hh->tcp.last_synq_overflow, sk->last_synq_overflow);
> +
> +	return 0;
> +}
> +
> +static int sock_inet_cptrst(struct ckpt_ctx *ctx,
> +			    struct sock *sock,
> +			    struct ckpt_hdr_socket_inet *hh,
> +			  int op)
> +{
> +	struct inet_sock *sk = inet_sk(sock);
> +	struct inet_connection_sock *icsk = inet_csk(sock);
> +	int ret;
> +

Ditto here.

Also, as TCP uses timestamps, probably need to add a way for TCP
to bias the local time source as per the original clock - perhaps
a delta_time field to the protocol control block.

At restart, compute the delta between original time (already
available in @ctx) and current time, ajust the delta_time and
the time-related fields in the protocol control block if needed.

> +	CKPT_COPY(op, hh->daddr, sk->daddr);
> +	CKPT_COPY(op, hh->rcv_saddr, sk->rcv_saddr);
> +	CKPT_COPY(op, hh->dport, sk->dport);
> +	CKPT_COPY(op, hh->num, sk->num);
> +	CKPT_COPY(op, hh->saddr, sk->saddr);
> +	CKPT_COPY(op, hh->sport, sk->sport);
> +	CKPT_COPY(op, hh->uc_ttl, sk->uc_ttl);
> +	CKPT_COPY(op, hh->cmsg_flags, sk->cmsg_flags);
> +
> +	CKPT_COPY(op, hh->icsk_ack.pending, icsk->icsk_ack.pending);
> +	CKPT_COPY(op, hh->icsk_ack.quick, icsk->icsk_ack.quick);
> +	CKPT_COPY(op, hh->icsk_ack.pingpong, icsk->icsk_ack.pingpong);
> +	CKPT_COPY(op, hh->icsk_ack.blocked, icsk->icsk_ack.blocked);
> +	CKPT_COPY(op, hh->icsk_ack.ato, icsk->icsk_ack.ato);
> +	CKPT_COPY(op, hh->icsk_ack.timeout, icsk->icsk_ack.timeout);
> +	CKPT_COPY(op, hh->icsk_ack.lrcvtime, icsk->icsk_ack.lrcvtime);
> +	CKPT_COPY(op,
> +		hh->icsk_ack.last_seg_size, icsk->icsk_ack.last_seg_size);
> +	CKPT_COPY(op, hh->icsk_ack.rcv_mss, icsk->icsk_ack.rcv_mss);
> +
> +	if (sock->sk_protocol == IPPROTO_TCP)
> +		ret = sock_inet_tcp_cptrst(ctx, tcp_sk(sock), hh, op);
> +	else if (sock->sk_protocol == IPPROTO_UDP)
> +		ret = 0;
> +	else {
> +		ckpt_write_err(ctx, "unknown socket protocol %d",
> +			       sock->sk_protocol);
		^^^^^^^^^^^^^^^
May only be called when @op is checkpoint ...


> +		ret = -EINVAL;
> +	}
> +
> +	if (sock->sk_family == AF_INET6) {
> +		struct ipv6_pinfo *inet6 = inet6_sk(sock);
> +		unsigned alen = sizeof(struct in6_addr);
> +		if (op == CKPT_CPT) {
> +			memcpy(hh->inet6.saddr, &inet6->saddr, alen);
> +			memcpy(hh->inet6.rcv_saddr, &inet6->rcv_saddr, alen);
> +			memcpy(hh->inet6.daddr, &inet6->daddr, alen);
> +		} else {
> +			memcpy(&inet6->saddr, hh->inet6.saddr, alen);
> +			memcpy(&inet6->rcv_saddr, hh->inet6.rcv_saddr, alen);
> +			memcpy(&inet6->daddr, hh->inet6.daddr, alen);
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static int sock_inet_checkpoint(struct ckpt_ctx *ctx,
> +				struct sock *sock,
> +				struct ckpt_hdr_socket *h)
> +{
> +	int ret = -EINVAL;
> +	struct ckpt_hdr_socket_inet *in;
> +
> +	if (sock->sk_protocol == IPPROTO_TCP) {
> +		struct tcp_sock *tsock = tcp_sk(sock);
> +		if (!skb_queue_empty(&tsock->out_of_order_queue)) {
> +			ckpt_write_err(ctx, "TCP socket has out-of-order data");
> +			return -EBUSY;
> +		}
> +	}
> +
> +	in = ckpt_hdr_get_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
> +	if (!in)
> +		goto out;
> +
> +	ret = sock_inet_cptrst(ctx, sock, in, CKPT_CPT);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) in);
> + out:
> +	return ret;
> +}
> +
>  int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>  {
>  	struct socket *socket = file->private_data;
> @@ -349,6 +570,11 @@ int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>  		ret = sock_unix_checkpoint(ctx, sock, h);
>  		if (ret)
>  			goto out;
> +	} else if ((sock->sk_family == AF_INET) ||
> +		   (sock->sk_family == AF_INET6)) {
> +		ret = sock_inet_checkpoint(ctx, sock, h);
> +		if (ret)
> +			goto out;
>  	} else {
>  		ckpt_write_err(ctx, "unsupported socket family %i",
>  			       sock->sk_family);
> @@ -357,11 +583,13 @@ int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>  	}

What about protocol states that aren't established, listen and
closed ?  syn_sent, closed_wait etc ...

Also there may be sockets in closed_wait that still have data in
send buffer that is (re)transmitted, but do not belong to any
particular process. Hmm... this is more related to migration of
a network namespace - but worth a FIXME/TODO comment somewhere.

[unrelated] When saving skb's, is it sufficient to only save their
contents for INET ?  I see that tcp code uses skb->cb[] (via
TCP_SKB_CB macro). Also there is skb->peeked and possible other
fields ?

>  
>  	if (sock->sk_state != TCP_LISTEN) {
> -		ret = sock_write_buffers(ctx, &sock->sk_receive_queue);
> +		uint16_t family = sock->sk_family;
> +
> +		ret = sock_write_buffers(ctx, family, &sock->sk_receive_queue);
>  		if (ret)
>  			goto out;
>  
> -		ret = sock_write_buffers(ctx, &sock->sk_write_queue);
> +		ret = sock_write_buffers(ctx, family, &sock->sk_write_queue);
>  		if (ret)
>  			goto out;
>  	}
> @@ -677,6 +905,101 @@ static int sock_unix_restart(struct ckpt_ctx *ctx,
>  	return ret;
>  }
>  
> +struct dq_sock {
> +	struct sock *sock;
> +	struct ckpt_ctx *ctx;
> +};
> +
> +static int __sock_hash_parent(void *data)
> +{
> +	struct dq_sock *dq = (struct dq_sock *)data;
> +	struct sock *parent;
> +
> +	dq->sock->sk_prot->hash(dq->sock);
> +
> +	parent = sock_get_parent(dq->ctx, dq->sock);
> +	if (parent) {
> +		inet_sk(dq->sock)->num = ntohs(inet_sk(dq->sock)->sport);
> +		local_bh_disable();
> +		__inet_inherit_port(parent, dq->sock);
> +		local_bh_enable();
> +	} else {
> +		inet_sk(dq->sock)->num = 0;
> +		inet_hash_connect(&tcp_death_row, dq->sock);
> +		inet_sk(dq->sock)->num = ntohs(inet_sk(dq->sock)->sport);
> +	}

Where do the restrasnmit timers etc get reconstructed/restored ?
(or am I jumping to far ahead...)

> +
> +	return 0;
> +}
> +
> +static int sock_defer_hash(struct ckpt_ctx *ctx, struct sock *sock)
> +{
> +	struct dq_sock dq;
> +
> +	dq.sock = sock;
> +	dq.ctx = ctx;
> +
> +	return deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
> +			      __sock_hash_parent, __sock_hash_parent);
> +}
> +
> +static int sock_inet_restart(struct ckpt_ctx *ctx,
> +			     struct ckpt_hdr_socket *h,
> +			     struct socket *socket)
> +{
> +	int ret;
> +	struct ckpt_hdr_socket_inet *in;
> +	struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr;
> +
> +	in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
> +	if (IS_ERR(in))
> +		return PTR_ERR(in);
> +
> +	/* Listening sockets and those that are closed but have a local
> +	 * address need to call bind()
> +	 */
> +	if ((h->sock.state == TCP_LISTEN) ||
> +	    ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) {
> +		socket->sk->sk_reuse = 2;
> +		inet_sk(socket->sk)->freebind = 1;
> +		ret = socket->ops->bind(socket,
> +					(struct sockaddr *)l,
> +					h->laddr_len);
> +		ckpt_debug("inet bind: %i", ret);
> +		if (ret < 0)
> +			goto out;
> +
> +		if (h->sock.state == TCP_LISTEN) {
> +			ret = socket->ops->listen(socket, h->sock.backlog);
> +			ckpt_debug("inet listen: %i", ret);
> +			if (ret < 0)
> +				goto out;
> +
> +			ret = sock_add_parent(ctx, socket->sk);
> +			if (ret < 0)
> +				goto out;
> +		}
> +	} else {
> +		ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST);
> +		if (ret)
> +			goto out;
> +
> +		ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST);
> +		if (ret)
> +			goto out;
> +
> +		if ((h->sock.state == TCP_ESTABLISHED) &&
> +		    (h->sock.protocol == IPPROTO_TCP))
> +			/* Delay hashing this sock until the end so we can
> +			 * hook it up with its parent (if appropriate)
> +			 */
> +			ret = sock_defer_hash(ctx, socket->sk);
> +	}

Haven't looked at tcp code deeply, but does this correctly handles
the case of an established socket that was explicitly bind() locally
before connect() to somewhere outside ?

> +
> +  out:
> +	return ret;
> + }
> +
>  struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
>  				    struct ckpt_hdr_socket *h)
>  {
> @@ -690,6 +1013,10 @@ struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
>  	if (h->sock_common.family == AF_UNIX) {
>  		ret = sock_unix_restart(ctx, h, socket);
>  		ckpt_debug("sock_unix_restart: %i\n", ret);
> +	} else if ((h->sock_common.family == AF_INET) ||
> +		   (h->sock_common.family == AF_INET6)) {
> +		ret = sock_inet_restart(ctx, h, socket);
> +		ckpt_debug("sock_inet_restart: %i\n", ret);
>  	} else {
>  		ckpt_write_err(ctx, "unsupported family %i\n",
>  			       h->sock_common.family);

Thanks,

Oren.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Smith July 8, 2009, 3:30 p.m. UTC | #4

OL> The live c/r of open connections is very useful for process
OL> migration. For other scenarios (e.g. fault recovery in HPC) it
OL> does not matter that much.

Perhaps it makes sense to shelve the INET patch for a while until we
can get the UNIX patch worked into decent shape?

OL> Are you certain that none of the values below needs to be
OL> sanitized before put in the kernel ?

No.

I removed the "I'm sure there are lots of holes here" comment so as
not to scare off the netdev folks.  In case it's not (painfully)
obvious, the UNIX patch has gotten significantly more attention so
far.

John Dykstra July 13, 2009, 7:02 p.m. UTC | #5

On Tue, 2009-07-07 at 12:26 -0700, Dan Smith wrote:
> Changes in v2:
>  - Check for data in the TCP out-of-order queue and fail if present

Why?  It seems that this would cause checkpoints to fail unexpectedly,
and is probably unnecessary in a migration scenario, because the peer
will retransmit the segments given appropriate ACKs.

  --  John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Smith July 13, 2009, 7:10 p.m. UTC | #6

JD> Why?  It seems that this would cause checkpoints to fail
JD> unexpectedly,

Well, there are other places where the checkpoint will fail with EBUSY
because something is in a transitional state.

JD> and is probably unnecessary in a migration scenario, because the
JD> peer will retransmit the segments given appropriate ACKs.

Cool, sounds like we can punt on this particular one... Thanks :)

John Dykstra July 24, 2009, 8:44 p.m. UTC | #7

On Tue, 2009-07-07 at 12:26 -0700, Dan Smith wrote:
>  2. I don't do anything to redirect or freeze traffic flowing to or from the
>     remote system (to prevent a RST from breaking things).  I expect that
>     userspace will bring down a veth device or freeze traffic to the remote
>     system to handle this case.

Theoretically, you can drop any packet that's in flight (ingress or
egress), because IP doesn't guarantee delivery.  TCP is able to recover,
and a UDP or raw-socket application should already be designed to.  Of
course, retransmissions will have an impact on application performance
in the migration case, so that's got to be considered in the tradeoff.
Main goal should probably be avoiding anything that shoves either end
into slow-start.

Thinking out loud, have you considered draining TCP buffers rather than
including them in the checkpoint?  You'd stop ingress traffic, and let
the app run until it had read everything in the socket buffer.  On the
egress side, you'd cork the app by telling it that buffers were full,
and then wait until the data already at the socket layer had been
transmitted.  Both are somewhat unbounded re time, and probably not
worth it, but maybe there's some variant of this idea that has value.
TCP transmit buffers on 10GE links can be pretty big...

BTW, if you see RSTs, that probably means you've created a protocol
violation due to a buggy restore.  Just blocking or dropping packets
shouldn't result in an RST unless it's very long.

  --  John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Dykstra July 25, 2009, 9:02 p.m. UTC | #8

Some comments on the patch.  My apologies that these are late, and keep
in mind that I'm new to your c/r implementation.  

There's a lot more to think about, especially re socket state, but I
wanted to get these comments out to you without more delay.

On Tue, 2009-07-07 at 12:26 -0700, Dan Smith wrote:
> This patch adds AF_INET c/r support based on the framework established in
> my AF_UNIX patch.  I've tested it by checkpointing a single app with a
> pair of sockets connected over loopback.
> 
> A couple points about the operation:
> 
>  1. In order to properly hook up the established sockets with the matching
>     listening parent socket, I added a new list to the ckpt_ctx and run the
>     parent attachment in the deferqueue at the end of the restart process.
>  2. I don't do anything to redirect or freeze traffic flowing to or from the
>     remote system (to prevent a RST from breaking things).  I expect that
>     userspace will bring down a veth device or freeze traffic to the remote
>     system to handle this case.
> 
> Changes in v3:
>  - Add AF_INET6 support
> 
> Changes in v2:
>  - Check for data in the TCP out-of-order queue and fail if present
>  - Fix a logic issue in sock_add_parent()
>  - Add comment about holding a reference to sock for parent list
>  - Write error messages to checkpoint stream where appropriate
>  - Fix up checking of some return values in restart phase
>  - Fix up restart logic to restore socket info for all states
>  - Avoid running sk_proto->hash() for non-TCP sockets
>  - Fix calling bind() for unconnected (i.e. DGRAM) sockets
>  - Change 'in' to 'inet' in structure and function names
> 
> Cc: Oren Laaden <orenl@cs.columbia.edu>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Dan Smith <danms@us.ibm.com>
> ---
>  checkpoint/sys.c                 |    2 +
>  include/linux/checkpoint_hdr.h   |    1 +
>  include/linux/checkpoint_types.h |    2 +
>  include/linux/socket.h           |  102 ++++++++++++
>  net/checkpoint.c                 |  337 +++++++++++++++++++++++++++++++++++++-
>  5 files changed, 439 insertions(+), 5 deletions(-)
> 
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> index 38a5299..b6f18ea 100644
> --- a/checkpoint/sys.c
> +++ b/checkpoint/sys.c
> @@ -242,6 +242,8 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
>  	INIT_LIST_HEAD(&ctx->pgarr_pool);
>  	init_waitqueue_head(&ctx->waitq);
>  
> +	INIT_LIST_HEAD(&ctx->listen_sockets);
> +
>  	err = -EBADF;
>  	ctx->file = fget(fd);
>  	if (!ctx->file)
> diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
> index f59b071..16e21ee 100644
> --- a/include/linux/checkpoint_hdr.h
> +++ b/include/linux/checkpoint_hdr.h
> @@ -93,6 +93,7 @@ enum {
>  	CKPT_HDR_SOCKET_BUFFERS,
>  	CKPT_HDR_SOCKET_BUFFER,
>  	CKPT_HDR_SOCKET_UNIX,
> +	CKPT_HDR_SOCKET_INET,
>  
>  	CKPT_HDR_TAIL = 9001,
>  
> diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
> index 27fbe26..d7db190 100644
> --- a/include/linux/checkpoint_types.h
> +++ b/include/linux/checkpoint_types.h
> @@ -60,6 +60,8 @@ struct ckpt_ctx {
>  	struct list_head pgarr_list;	/* page array to dump VMA contents */
>  	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
>  
> +	struct list_head listen_sockets;/* listening parent sockets */
> +
>  	/* [multi-process checkpoint] */
>  	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
>  	int nr_tasks;                   /* size of tasks array */
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index e7d64eb..c723d96 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -334,6 +334,108 @@ struct ckpt_hdr_socket_unix {
>  	__u32 flags;
>  } __attribute__ ((aligned(8)));
>  
> +struct ckpt_hdr_socket_inet {
> +	struct ckpt_hdr h;

I don't know whether this would affect palatibility (sic) for the
mainline, but it bothers me...  socket.h (and sock.h) are included all
over the place in the kernel, and this definition is only needed in very
limited locations.  Can it be placed in a .h used only by checkpoint
code?

> +
> +	__u32 daddr;
> +	__u32 rcv_saddr;
> +	__u32 saddr;
> +	__u16 dport;
> +	__u16 num;
> +	__u16 sport;
> +	__s16 uc_ttl;
> +	__u16 cmsg_flags;
> +	__u16 __pad;
> +
> +	struct {
> +		__u64 timeout;
> +		__u32 ato;
> +		__u32 lrcvtime;
> +		__u16 last_seg_size;
> +		__u16 rcv_mss;
> +		__u8 pending;
> +		__u8 quick;
> +		__u8 pingpong;
> +		__u8 blocked;
> +	} icsk_ack __attribute__ ((aligned(8)));
> +
> +	/* FIXME: Skipped opt, tos, multicast, cork settings */
> +
> +	struct {
> +		__u64 last_synq_overflow;
> +
> +		__u32 rcv_nxt;
> +		__u32 copied_seq;
> +		__u32 rcv_wup;
> +		__u32 snd_nxt;
> +		__u32 snd_una;
> +		__u32 snd_sml;
> +		__u32 rcv_tstamp;
> +		__u32 lsndtime;
> +
> +		__u32 snd_wl1;
> +		__u32 snd_wnd;
> +		__u32 max_window;
> +		__u32 mss_cache;
> +		__u32 window_clamp;
> +		__u32 rcv_ssthresh;
> +		__u32 frto_highmark;
> +
> +		__u32 srtt;
> +		__u32 mdev;
> +		__u32 mdev_max;
> +		__u32 rttvar;
> +		__u32 rtt_seq;
> +
> +		__u32 packets_out;
> +		__u32 retrans_out;
> +
> +		__u32 snd_up;
> +		__u32 rcv_wnd;
> +		__u32 write_seq;
> +		__u32 pushed_seq;
> +		__u32 lost_out;
> +		__u32 sacked_out;
> +		__u32 fackets_out;
> +		__u32 tso_deferred;
> +		__u32 bytes_acked;
> +
> +		__s32 lost_cnt_hint;
> +		__u32 retransmit_high;
> +
> +		__u32 lost_retrans_low;
> +
> +		__u32 prior_ssthresh;
> +		__u32 high_seq;
> +
> +		__u32 retrans_stamp;
> +		__u32 undo_marker;
> +		__s32 undo_retrans;
> +		__u32 total_retrans;
> +
> +		__u32 urg_seq;
> +		__u32 keepalive_time;
> +		__u32 keepalive_intvl;
> +
> +		__u16 urg_data;
> +		__u16 advmss;
> +		__u8 frto_counter;
> +		__u8 nonagle;
> +
> +		__u8 ecn_flags;
> +		__u8 reordering;
> +
> +		__u8 keepalive_probes;
> +	} tcp __attribute__ ((aligned(8)));
> +
> +	struct {
> +		char saddr[16];
> +		char rcv_saddr[16];
> +		char daddr[16];
> +	} inet6 __attribute__ ((aligned(8)));
> +
> +} __attribute__ ((aligned(8)));
> +
>  struct ckpt_hdr_socket {
>  	struct ckpt_hdr h;
>  
> diff --git a/net/checkpoint.c b/net/checkpoint.c
> index 0ff1656..ee069fa 100644
> --- a/net/checkpoint.c
> +++ b/net/checkpoint.c
> @@ -16,16 +16,69 @@
>  #include <linux/syscalls.h>
>  #include <linux/sched.h>
>  #include <linux/fs_struct.h>
> +#include <linux/tcp.h>
> +#include <linux/in.h>
>  
>  #include <net/af_unix.h>
>  #include <net/tcp_states.h>
> +#include <net/tcp.h>
>  
>  #include <linux/checkpoint.h>
>  #include <linux/checkpoint_hdr.h>
> +#include <linux/deferqueue.h>
>  
>  /* Size of an empty struct sockaddr_un */
>  #define UNIX_LEN_EMPTY 2
>  
> +struct ckpt_parent_sock {
> +	struct sock *sock;
> +	__u32 oref;
> +	struct list_head list;
> +};
> +
> +static int sock_add_parent(struct ckpt_ctx *ctx, struct sock *sock)
> +{
> +	struct ckpt_parent_sock *parent;
> +	__u32 objref;
> +	int new;
> +
> +	objref = ckpt_obj_lookup_add(ctx, sock, CKPT_OBJ_SOCK, &new);
> +	if (objref < 0)
> +		return objref;
> +	else if (!new)
> +		return 0;
> +
> +	parent = kmalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent)
> +		return -ENOMEM;
> +
> +	/* The deferqueue is processed before the objhash is free()'d,
> +	 * thus the objhash holds a reference to sock for us
> +	 */
> +	parent->sock = sock;
> +	parent->oref = objref;
> +	INIT_LIST_HEAD(&parent->list);
> +
> +	list_add(&parent->list, &ctx->listen_sockets);
> +
> +	return 0;
> +}
> +
> +static struct sock *sock_get_parent(struct ckpt_ctx *ctx, struct sock *sock)
> +{
> +	struct ckpt_parent_sock *parent;
> +	struct inet_sock *c = inet_sk(sock);
> +
> +	list_for_each_entry(parent, &ctx->listen_sockets, list) {
> +		struct inet_sock *p = inet_sk(parent->sock);
> +
> +		if (c->sport == p->sport)
> +			return parent->sock;
> +	}
> +
> +	return NULL;
> +}
> +
>  static inline int sock_unix_need_cwd(struct sockaddr_un *a)
>  {
>  	return (a->sun_path[0] != '/');
> @@ -55,17 +108,23 @@ static int sock_copy_buffers(struct sk_buff_head *from, struct sk_buff_head *to)
>  }
>  
>  static int __sock_write_buffers(struct ckpt_ctx *ctx,
> +				uint16_t family,
>  				struct sk_buff_head *queue)
>  {
>  	struct sk_buff *skb;
>  	int ret = 0;
>  
>  	skb_queue_walk(queue, skb) {
> -		if (UNIXCB(skb).fp) {
> +		if ((family == AF_UNIX) && UNIXCB(skb).fp) {
>  			ckpt_write_err(ctx, "fd-passing is not supported");
>  			return -EBUSY;
>  		}
>  
> +		if (skb_shinfo(skb)->nr_frags) {
> +			ckpt_write_err(ctx, "socket has fragments in-flight");
> +			return -EBUSY;
> +		}
> +
>  		ret = ckpt_write_obj_type(ctx, skb->data, skb->len,
>  					  CKPT_HDR_SOCKET_BUFFER);
>  		if (ret)
> @@ -75,7 +134,9 @@ static int __sock_write_buffers(struct ckpt_ctx *ctx,
>  	return 0;
>  }
>  
> -static int sock_write_buffers(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
> +static int sock_write_buffers(struct ckpt_ctx *ctx,
> +			      uint16_t family,
> +			      struct sk_buff_head *queue)
>  {
>  	struct ckpt_hdr_socket_buffer *h;
>  	struct sk_buff_head tmpq;
> @@ -95,7 +156,7 @@ static int sock_write_buffers(struct ckpt_ctx *ctx, struct sk_buff_head *queue)
>  
>  	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
>  	if (!ret)
> -		ret = __sock_write_buffers(ctx, &tmpq);
> +		ret = __sock_write_buffers(ctx, family, &tmpq);
>  
>   out:
>  	ckpt_hdr_put(ctx, h);
> @@ -309,6 +370,166 @@ static int sock_cptrst(struct ckpt_ctx *ctx,
>  		return 0;
>  }
>  
> +static int sock_inet_tcp_cptrst(struct ckpt_ctx *ctx,
> +				struct tcp_sock *sk,
> +				struct ckpt_hdr_socket_inet *hh,
> +				int op)
> +{
> +	CKPT_COPY(op, hh->tcp.rcv_nxt, sk->rcv_nxt);
> +	CKPT_COPY(op, hh->tcp.copied_seq, sk->copied_seq);
> +	CKPT_COPY(op, hh->tcp.rcv_wup, sk->rcv_wup);
> +	CKPT_COPY(op, hh->tcp.snd_nxt, sk->snd_nxt);
> +	CKPT_COPY(op, hh->tcp.snd_una, sk->snd_una);
> +	CKPT_COPY(op, hh->tcp.snd_sml, sk->snd_sml);
> +	CKPT_COPY(op, hh->tcp.rcv_tstamp, sk->rcv_tstamp);
> +	CKPT_COPY(op, hh->tcp.lsndtime, sk->lsndtime);
> +
> +	CKPT_COPY(op, hh->tcp.snd_wl1, sk->snd_wl1);
> +	CKPT_COPY(op, hh->tcp.snd_wnd, sk->snd_wnd);
> +	CKPT_COPY(op, hh->tcp.max_window, sk->max_window);
> +	CKPT_COPY(op, hh->tcp.mss_cache, sk->mss_cache);
> +	CKPT_COPY(op, hh->tcp.window_clamp, sk->window_clamp);
> +	CKPT_COPY(op, hh->tcp.rcv_ssthresh, sk->rcv_ssthresh);
> +	CKPT_COPY(op, hh->tcp.frto_highmark, sk->frto_highmark);
> +	CKPT_COPY(op, hh->tcp.advmss, sk->advmss);
> +	CKPT_COPY(op, hh->tcp.frto_counter, sk->frto_counter);
> +	CKPT_COPY(op, hh->tcp.nonagle, sk->nonagle);
> +
> +	CKPT_COPY(op, hh->tcp.srtt, sk->srtt);
> +	CKPT_COPY(op, hh->tcp.mdev, sk->mdev);
> +	CKPT_COPY(op, hh->tcp.mdev_max, sk->mdev_max);
> +	CKPT_COPY(op, hh->tcp.rttvar, sk->rttvar);
> +	CKPT_COPY(op, hh->tcp.rtt_seq, sk->rtt_seq);

I doubt the RTT metrics should be migrated.
> +
> +	CKPT_COPY(op, hh->tcp.packets_out, sk->packets_out);
> +	CKPT_COPY(op, hh->tcp.retrans_out, sk->retrans_out);
> +
> +	CKPT_COPY(op, hh->tcp.urg_data, sk->urg_data);
> +	CKPT_COPY(op, hh->tcp.ecn_flags, sk->ecn_flags);
> +	CKPT_COPY(op, hh->tcp.reordering, sk->reordering);
> +	CKPT_COPY(op, hh->tcp.snd_up, sk->snd_up);
> +
> +	CKPT_COPY(op, hh->tcp.keepalive_probes, sk->keepalive_probes);
> +
> +	CKPT_COPY(op, hh->tcp.rcv_wnd, sk->rcv_wnd);
> +	CKPT_COPY(op, hh->tcp.write_seq, sk->write_seq);
> +	CKPT_COPY(op, hh->tcp.pushed_seq, sk->pushed_seq);
> +	CKPT_COPY(op, hh->tcp.lost_out, sk->lost_out);
> +	CKPT_COPY(op, hh->tcp.sacked_out, sk->sacked_out);

You need to either save the SACK state or reset it.

> +	CKPT_COPY(op, hh->tcp.fackets_out, sk->fackets_out);
> +	CKPT_COPY(op, hh->tcp.tso_deferred, sk->tso_deferred);

Don't need to migrate this.

> +	CKPT_COPY(op, hh->tcp.bytes_acked, sk->bytes_acked);
> +
> +	CKPT_COPY(op, hh->tcp.lost_cnt_hint, sk->lost_cnt_hint);
> +	CKPT_COPY(op, hh->tcp.retransmit_high, sk->retransmit_high);
> +
> +	CKPT_COPY(op, hh->tcp.lost_retrans_low, sk->lost_retrans_low);
> +
> +	CKPT_COPY(op, hh->tcp.prior_ssthresh, sk->prior_ssthresh);
> +	CKPT_COPY(op, hh->tcp.high_seq, sk->high_seq);
> +
> +	CKPT_COPY(op, hh->tcp.retrans_stamp, sk->retrans_stamp);
> +	CKPT_COPY(op, hh->tcp.undo_marker, sk->undo_marker);
> +	CKPT_COPY(op, hh->tcp.undo_retrans, sk->undo_retrans);
> +	CKPT_COPY(op, hh->tcp.total_retrans, sk->total_retrans);
> +
> +	CKPT_COPY(op, hh->tcp.urg_seq, sk->urg_seq);
> +	CKPT_COPY(op, hh->tcp.keepalive_time, sk->keepalive_time);
> +	CKPT_COPY(op, hh->tcp.keepalive_intvl, sk->keepalive_intvl);
> +
> +	CKPT_COPY(op, hh->tcp.last_synq_overflow, sk->last_synq_overflow);
> +
> +	return 0;
> +}
> +
> +static int sock_inet_cptrst(struct ckpt_ctx *ctx,
> +			    struct sock *sock,
> +			    struct ckpt_hdr_socket_inet *hh,
> +			  int op)
> +{
> +	struct inet_sock *sk = inet_sk(sock);
> +	struct inet_connection_sock *icsk = inet_csk(sock);
> +	int ret;
> +
> +	CKPT_COPY(op, hh->daddr, sk->daddr);
> +	CKPT_COPY(op, hh->rcv_saddr, sk->rcv_saddr);
> +	CKPT_COPY(op, hh->dport, sk->dport);
> +	CKPT_COPY(op, hh->num, sk->num);
> +	CKPT_COPY(op, hh->saddr, sk->saddr);
> +	CKPT_COPY(op, hh->sport, sk->sport);
> +	CKPT_COPY(op, hh->uc_ttl, sk->uc_ttl);
> +	CKPT_COPY(op, hh->cmsg_flags, sk->cmsg_flags);
> +
> +	CKPT_COPY(op, hh->icsk_ack.pending, icsk->icsk_ack.pending);
> +	CKPT_COPY(op, hh->icsk_ack.quick, icsk->icsk_ack.quick);
> +	CKPT_COPY(op, hh->icsk_ack.pingpong, icsk->icsk_ack.pingpong);
> +	CKPT_COPY(op, hh->icsk_ack.blocked, icsk->icsk_ack.blocked);
> +	CKPT_COPY(op, hh->icsk_ack.ato, icsk->icsk_ack.ato);
> +	CKPT_COPY(op, hh->icsk_ack.timeout, icsk->icsk_ack.timeout);
> +	CKPT_COPY(op, hh->icsk_ack.lrcvtime, icsk->icsk_ack.lrcvtime);
> +	CKPT_COPY(op,
> +		hh->icsk_ack.last_seg_size, icsk->icsk_ack.last_seg_size);
> +	CKPT_COPY(op, hh->icsk_ack.rcv_mss, icsk->icsk_ack.rcv_mss);
> +
> +	if (sock->sk_protocol == IPPROTO_TCP)
> +		ret = sock_inet_tcp_cptrst(ctx, tcp_sk(sock), hh, op);
> +	else if (sock->sk_protocol == IPPROTO_UDP)
> +		ret = 0;
> +	else {
> +		ckpt_write_err(ctx, "unknown socket protocol %d",
> +			       sock->sk_protocol);
> +		ret = -EINVAL;
> +	}
> +
> +	if (sock->sk_family == AF_INET6) {
> +		struct ipv6_pinfo *inet6 = inet6_sk(sock);
> +		unsigned alen = sizeof(struct in6_addr);
> +		if (op == CKPT_CPT) {
> +			memcpy(hh->inet6.saddr, &inet6->saddr, alen);
> +			memcpy(hh->inet6.rcv_saddr, &inet6->rcv_saddr, alen);
> +			memcpy(hh->inet6.daddr, &inet6->daddr, alen);
> +		} else {
> +			memcpy(&inet6->saddr, hh->inet6.saddr, alen);
> +			memcpy(&inet6->rcv_saddr, hh->inet6.rcv_saddr, alen);
> +			memcpy(&inet6->daddr, hh->inet6.daddr, alen);
> +		}
> +	}
> +
> +	return ret;
> +}

For your to-do someday list--the router alert lists.  Lots of stuff more
important than this.

> +
> +static int sock_inet_checkpoint(struct ckpt_ctx *ctx,
> +				struct sock *sock,
> +				struct ckpt_hdr_socket *h)
> +{
> +	int ret = -EINVAL;
> +	struct ckpt_hdr_socket_inet *in;
> +
> +	if (sock->sk_protocol == IPPROTO_TCP) {
> +		struct tcp_sock *tsock = tcp_sk(sock);
> +		if (!skb_queue_empty(&tsock->out_of_order_queue)) {
> +			ckpt_write_err(ctx, "TCP socket has out-of-order data");
> +			return -EBUSY;
> +		}
> +	}
> +
> +	in = ckpt_hdr_get_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
> +	if (!in)
> +		goto out;
> +
> +	ret = sock_inet_cptrst(ctx, sock, in, CKPT_CPT);
> +	if (ret < 0)
> +		goto out;
> +

There's an interesting design choice re TIME_WAIT and similar states.
In a migration scenario, do those sks migrate?  If the tree isn't going
to be restored for a while., you want the original host to continue to
respond to those four-tuples  If the IP address of the original is
immediately migrated to another machine, you want the TIME_WAIT sks to
migrate too.

> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
> +	if (ret < 0)
> +		goto out;
> +
> +	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) in);
> + out:
> +	return ret;
> +}
> +
>  int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>  {
>  	struct socket *socket = file->private_data;
> @@ -349,6 +570,11 @@ int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>  		ret = sock_unix_checkpoint(ctx, sock, h);
>  		if (ret)
>  			goto out;
> +	} else if ((sock->sk_family == AF_INET) ||
> +		   (sock->sk_family == AF_INET6)) {
> +		ret = sock_inet_checkpoint(ctx, sock, h);
> +		if (ret)
> +			goto out;
>  	} else {
>  		ckpt_write_err(ctx, "unsupported socket family %i",
>  			       sock->sk_family);
> @@ -357,11 +583,13 @@ int do_sock_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>  	}
>  
>  	if (sock->sk_state != TCP_LISTEN) {
> -		ret = sock_write_buffers(ctx, &sock->sk_receive_queue);
> +		uint16_t family = sock->sk_family;
> +
> +		ret = sock_write_buffers(ctx, family, &sock->sk_receive_queue);
>  		if (ret)
>  			goto out;
>  
> -		ret = sock_write_buffers(ctx, &sock->sk_write_queue);
> +		ret = sock_write_buffers(ctx, family, &sock->sk_write_queue);
>  		if (ret)
>  			goto out;
>  	}
> @@ -677,6 +905,101 @@ static int sock_unix_restart(struct ckpt_ctx *ctx,
>  	return ret;
>  }
>  
> +struct dq_sock {
> +	struct sock *sock;
> +	struct ckpt_ctx *ctx;
> +};
> +
> +static int __sock_hash_parent(void *data)
> +{
> +	struct dq_sock *dq = (struct dq_sock *)data;
> +	struct sock *parent;
> +
> +	dq->sock->sk_prot->hash(dq->sock);
> +
> +	parent = sock_get_parent(dq->ctx, dq->sock);
> +	if (parent) {
> +		inet_sk(dq->sock)->num = ntohs(inet_sk(dq->sock)->sport);
> +		local_bh_disable();
> +		__inet_inherit_port(parent, dq->sock);
> +		local_bh_enable();
> +	} else {
> +		inet_sk(dq->sock)->num = 0;
> +		inet_hash_connect(&tcp_death_row, dq->sock);
> +		inet_sk(dq->sock)->num = ntohs(inet_sk(dq->sock)->sport);
> +	}
> +
> +	return 0;
> +}
> +
> +static int sock_defer_hash(struct ckpt_ctx *ctx, struct sock *sock)
> +{
> +	struct dq_sock dq;
> +
> +	dq.sock = sock;
> +	dq.ctx = ctx;
> +
> +	return deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
> +			      __sock_hash_parent, __sock_hash_parent);
> +}
> +
> +static int sock_inet_restart(struct ckpt_ctx *ctx,
> +			     struct ckpt_hdr_socket *h,
> +			     struct socket *socket)
> +{
> +	int ret;
> +	struct ckpt_hdr_socket_inet *in;
> +	struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr;
> +
> +	in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
> +	if (IS_ERR(in))
> +		return PTR_ERR(in);
> +
> +	/* Listening sockets and those that are closed but have a local
> +	 * address need to call bind()
> +	 */
> +	if ((h->sock.state == TCP_LISTEN) ||
> +	    ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) {
> +		socket->sk->sk_reuse = 2;
> +		inet_sk(socket->sk)->freebind = 1;
> +		ret = socket->ops->bind(socket,
> +					(struct sockaddr *)l,
> +					h->laddr_len);
> +		ckpt_debug("inet bind: %i", ret);
> +		if (ret < 0)
> +			goto out;
> +
> +		if (h->sock.state == TCP_LISTEN) {
> +			ret = socket->ops->listen(socket, h->sock.backlog);
> +			ckpt_debug("inet listen: %i", ret);
> +			if (ret < 0)
> +				goto out;
> +
> +			ret = sock_add_parent(ctx, socket->sk);
> +			if (ret < 0)
> +				goto out;

So this is just dummied off as a proof-of-concept for LISTEN?

> +		}
> +	} else {
> +		ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST);
> +		if (ret)
> +			goto out;
> +
> +		ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST);
> +		if (ret)
> +			goto out;

At a minimum, you'll want to start the TCP retransmit timer if there is
unacknowledged data outstanding.  And other timers for other states, as
they're supported.

And you probably do want to do slow-start again--disregard my babbling
from yesterday. 

> +
> +		if ((h->sock.state == TCP_ESTABLISHED) &&
> +		    (h->sock.protocol == IPPROTO_TCP))
> +			/* Delay hashing this sock until the end so we can
> +			 * hook it up with its parent (if appropriate)
> +			 */
> +			ret = sock_defer_hash(ctx, socket->sk);
> +	}
> +
> +  out:
> +	return ret;
> + }
> +
>  struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
>  				    struct ckpt_hdr_socket *h)
>  {
> @@ -690,6 +1013,10 @@ struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
>  	if (h->sock_common.family == AF_UNIX) {
>  		ret = sock_unix_restart(ctx, h, socket);
>  		ckpt_debug("sock_unix_restart: %i\n", ret);
> +	} else if ((h->sock_common.family == AF_INET) ||
> +		   (h->sock_common.family == AF_INET6)) {
> +		ret = sock_inet_restart(ctx, h, socket);
> +		ckpt_debug("sock_inet_restart: %i\n", ret);
>  	} else {
>  		ckpt_write_err(ctx, "unsupported family %i\n",
>  			       h->sock_common.family);

This is part of your AF_UNIX patch:

        struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
        				    struct ckpt_hdr_socket *h)
        {
        	struct socket *socket;
        	int ret;
        
        	ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket);
        	if (ret < 0)
        		return ERR_PTR(ret);
        

You _really_ want to pass the actual protocol number to sock_create().
The size of the sk it creates depends on this.  You'll quickly be in
memory corruption hell without it.

Also from the AF_UNIX patch:
        
        static int obj_sock_users(void *ptr)
        {
        	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
        }
        
Network sockets also use sk->sk_wmem_alloc to track references to the
sock from egress skb's

And:

        static int sock_copy_buffers(struct sk_buff_head *from, struct
        sk_buff_head *to)
        {
        	int count = 0;
        	struct sk_buff *skb;
        
        	skb_queue_walk(from, skb) {
        		struct sk_buff *tmp;
        
        		tmp = dev_alloc_skb(skb->len);
        		if (!tmp)
        			return -ENOMEM;
        
        		spin_lock(&from->lock);
        		skb_morph(tmp, skb);
        		spin_unlock(&from->lock);

Why not just clone the skb?

  --  John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Smith July 28, 2009, 4 p.m. UTC | #9

JD> I don't know whether this would affect palatibility (sic) for the
JD> mainline, but it bothers me...  socket.h (and sock.h) are included
JD> all over the place in the kernel, and this definition is only
JD> needed in very limited locations.  Can it be placed in a .h used
JD> only by checkpoint code?

This was moved here based on previous comments on the patch.
Personally, I think socket.h is a little too wide.  While iterating on
this patch locally, I've discovered that socket.h won't really work
because in6.h includes it early and thus I'm unable to include some of
the address structures as a result.  I think going back to a
checkpoint-specific header would be helpful.

JD> There's an interesting design choice re TIME_WAIT and similar
JD> states.  In a migration scenario, do those sks migrate?  If the
JD> tree isn't going to be restored for a while., you want the
JD> original host to continue to respond to those four-tuples If the
JD> IP address of the original is immediately migrated to another
JD> machine, you want the TIME_WAIT sks to migrate too.

Well, as far as I'm concerned, the only sane migration scenario is one
where you migrate the IP address and virtual interface with the
task.  When you destroy the virtual interface after (or during) the
migration, the TIME_WAIT socket will disappear from the sending host.

I think that we need to increment timers like this on restore anyway,
which should make sure that the TIME_WAIT timers run for the same
amount of wall-clock time regardless of how much time was spent in the
process of migration, right?

Since checkpoint is not aware of a potential migration, a regular
(i.e. not intended for migration) checkpoint operation on a task
running in the init netns will leave the TIME_WAIT socket in place
until the timer expires.

>> +static int sock_inet_restart(struct ckpt_ctx *ctx,
>> +			     struct ckpt_hdr_socket *h,
>> +			     struct socket *socket)
>> +{
>> +	int ret;
>> +	struct ckpt_hdr_socket_inet *in;
>> +	struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr;
>> +
>> +	in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
>> +	if (IS_ERR(in))
>> +		return PTR_ERR(in);
>> +
>> +	/* Listening sockets and those that are closed but have a local
>> +	 * address need to call bind()
>> +	 */
>> +	if ((h->sock.state == TCP_LISTEN) ||
>> +	    ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) {
>> +		socket->sk->sk_reuse = 2;
>> +		inet_sk(socket->sk)->freebind = 1;
>> +		ret = socket->ops->bind(socket,
>> +					(struct sockaddr *)l,
>> +					h->laddr_len);
>> +		ckpt_debug("inet bind: %i", ret);
>> +		if (ret < 0)
>> +			goto out;
>> +
>> +		if (h->sock.state == TCP_LISTEN) {
>> +			ret = socket->ops->listen(socket, h->sock.backlog);
>> +			ckpt_debug("inet listen: %i", ret);
>> +			if (ret < 0)
>> +				goto out;
>> +
>> +			ret = sock_add_parent(ctx, socket->sk);
>> +			if (ret < 0)
>> +				goto out;

JD> So this is just dummied off as a proof-of-concept for LISTEN?

I'm not sure what you mean...

>> +		}
>> +	} else {
>> +		ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST);
>> +		if (ret)
>> +			goto out;
>> +
>> +		ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST);
>> +		if (ret)
>> +			goto out;

JD> At a minimum, you'll want to start the TCP retransmit timer if there is
JD> unacknowledged data outstanding.  And other timers for other states, as
JD> they're supported.

JD> And you probably do want to do slow-start again--disregard my babbling
JD> from yesterday. 

Heh, okay :)

JD> This is part of your AF_UNIX patch:

JD>         struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
JD>         				    struct ckpt_hdr_socket *h)
JD>         {
JD>         	struct socket *socket;
JD>         	int ret;
        
JD>         	ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket);
JD>         	if (ret < 0)
JD>         		return ERR_PTR(ret);
        

JD> You _really_ want to pass the actual protocol number to sock_create().
JD> The size of the sk it creates depends on this.  You'll quickly be in
JD> memory corruption hell without it.

You mean I need to verify that the protocol is one of IPPROTO_TCP,
IPPROTO_UDP, or PF_UNIX, right?

JD> Also from the AF_UNIX patch:
        
JD>         static int obj_sock_users(void *ptr)
JD>         {
JD>         	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
JD>         }
        
JD> Network sockets also use sk->sk_wmem_alloc to track references to
JD> the sock from egress skb's

IIUC, this function is part of the framework's leak detection.  That
code looks to make sure objects don't gain any additional users
between the start and end of the checkpoint operation.  I think the
sk_refcnt satisfies that requirement here, no?

JD> And:

JD>         static int sock_copy_buffers(struct sk_buff_head *from, struct
JD>         sk_buff_head *to)
JD>         {
JD>         	int count = 0;
JD>         	struct sk_buff *skb;
        
JD>         	skb_queue_walk(from, skb) {
JD>         		struct sk_buff *tmp;
        
JD>         		tmp = dev_alloc_skb(skb->len);
JD>         		if (!tmp)
JD>         			return -ENOMEM;
        
JD>         		spin_lock(&from->lock);
JD>         		skb_morph(tmp, skb);
JD>         		spin_unlock(&from->lock);

JD> Why not just clone the skb?

Okay, good call.

Thanks!

Oren Laadan July 28, 2009, 5:07 p.m. UTC | #10

Dan Smith wrote:
> JD> I don't know whether this would affect palatibility (sic) for the
> JD> mainline, but it bothers me...  socket.h (and sock.h) are included
> JD> all over the place in the kernel, and this definition is only
> JD> needed in very limited locations.  Can it be placed in a .h used
> JD> only by checkpoint code?
> 
> This was moved here based on previous comments on the patch.
> Personally, I think socket.h is a little too wide.  While iterating on
> this patch locally, I've discovered that socket.h won't really work
> because in6.h includes it early and thus I'm unable to include some of
> the address structures as a result.  I think going back to a
> checkpoint-specific header would be helpful.
> 
> JD> There's an interesting design choice re TIME_WAIT and similar
> JD> states.  In a migration scenario, do those sks migrate?  If the
> JD> tree isn't going to be restored for a while., you want the
> JD> original host to continue to respond to those four-tuples If the
> JD> IP address of the original is immediately migrated to another
> JD> machine, you want the TIME_WAIT sks to migrate too.
> 
> Well, as far as I'm concerned, the only sane migration scenario is one
> where you migrate the IP address and virtual interface with the
> task.  When you destroy the virtual interface after (or during) the
> migration, the TIME_WAIT socket will disappear from the sending host.

Note that such sockets are unlikely to be referenced by _any_ process
because they will have been closed already. So the checkpoint code
will need to find such sockets not through file descriptors.

> 
> I think that we need to increment timers like this on restore anyway,
> which should make sure that the TIME_WAIT timers run for the same
> amount of wall-clock time regardless of how much time was spent in the
> process of migration, right?
> 
> Since checkpoint is not aware of a potential migration, a regular
> (i.e. not intended for migration) checkpoint operation on a task
> running in the init netns will leave the TIME_WAIT socket in place
> until the timer expires.
> 
>>> +static int sock_inet_restart(struct ckpt_ctx *ctx,
>>> +			     struct ckpt_hdr_socket *h,
>>> +			     struct socket *socket)
>>> +{
>>> +	int ret;
>>> +	struct ckpt_hdr_socket_inet *in;
>>> +	struct sockaddr_in *l = (struct sockaddr_in *)&h->laddr;
>>> +
>>> +	in = ckpt_read_obj_type(ctx, sizeof(*in), CKPT_HDR_SOCKET_INET);
>>> +	if (IS_ERR(in))
>>> +		return PTR_ERR(in);
>>> +
>>> +	/* Listening sockets and those that are closed but have a local
>>> +	 * address need to call bind()
>>> +	 */
>>> +	if ((h->sock.state == TCP_LISTEN) ||
>>> +	    ((h->sock.state == TCP_CLOSE) && (h->laddr_len > 0))) {
>>> +		socket->sk->sk_reuse = 2;
>>> +		inet_sk(socket->sk)->freebind = 1;
>>> +		ret = socket->ops->bind(socket,
>>> +					(struct sockaddr *)l,
>>> +					h->laddr_len);
>>> +		ckpt_debug("inet bind: %i", ret);
>>> +		if (ret < 0)
>>> +			goto out;
>>> +
>>> +		if (h->sock.state == TCP_LISTEN) {
>>> +			ret = socket->ops->listen(socket, h->sock.backlog);
>>> +			ckpt_debug("inet listen: %i", ret);
>>> +			if (ret < 0)
>>> +				goto out;
>>> +
>>> +			ret = sock_add_parent(ctx, socket->sk);
>>> +			if (ret < 0)
>>> +				goto out;
> 
> JD> So this is just dummied off as a proof-of-concept for LISTEN?
> 
> I'm not sure what you mean...
> 
>>> +		}
>>> +	} else {
>>> +		ret = sock_cptrst(ctx, socket->sk, h, CKPT_RST);
>>> +		if (ret)
>>> +			goto out;
>>> +
>>> +		ret = sock_inet_cptrst(ctx, socket->sk, in, CKPT_RST);
>>> +		if (ret)
>>> +			goto out;
> 
> JD> At a minimum, you'll want to start the TCP retransmit timer if there is
> JD> unacknowledged data outstanding.  And other timers for other states, as
> JD> they're supported.
> 
> JD> And you probably do want to do slow-start again--disregard my babbling
> JD> from yesterday. 
> 
> Heh, okay :)
> 
> JD> This is part of your AF_UNIX patch:
> 
> JD>         struct socket *do_sock_file_restore(struct ckpt_ctx *ctx,
> JD>         				    struct ckpt_hdr_socket *h)
> JD>         {
> JD>         	struct socket *socket;
> JD>         	int ret;
>         
> JD>         	ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket);
> JD>         	if (ret < 0)
> JD>         		return ERR_PTR(ret);
>         
> 
> JD> You _really_ want to pass the actual protocol number to sock_create().
> JD> The size of the sk it creates depends on this.  You'll quickly be in
> JD> memory corruption hell without it.
> 
> You mean I need to verify that the protocol is one of IPPROTO_TCP,
> IPPROTO_UDP, or PF_UNIX, right?
> 
> JD> Also from the AF_UNIX patch:
>         
> JD>         static int obj_sock_users(void *ptr)
> JD>         {
> JD>         	return atomic_read(&((struct sock *) ptr)->sk_refcnt);
> JD>         }
>         
> JD> Network sockets also use sk->sk_wmem_alloc to track references to
> JD> the sock from egress skb's
> 
> IIUC, this function is part of the framework's leak detection.  That
> code looks to make sure objects don't gain any additional users
> between the start and end of the checkpoint operation.  I think the
> sk_refcnt satisfies that requirement here, no?

As I commented on patch 5/5 AF_UNIX -- IMO socket objects don't require
leak-detection, since they are 1-to-1 with files, which are already
accounted for. sockets can't be otherwise cloned/inherited/transferred
between processes.

> 
> JD> And:
> 
> JD>         static int sock_copy_buffers(struct sk_buff_head *from, struct
> JD>         sk_buff_head *to)
> JD>         {
> JD>         	int count = 0;
> JD>         	struct sk_buff *skb;
>         
> JD>         	skb_queue_walk(from, skb) {
> JD>         		struct sk_buff *tmp;
>         
> JD>         		tmp = dev_alloc_skb(skb->len);
> JD>         		if (!tmp)
> JD>         			return -ENOMEM;
>         
> JD>         		spin_lock(&from->lock);
> JD>         		skb_morph(tmp, skb);
> JD>         		spin_unlock(&from->lock);
> 
> JD> Why not just clone the skb?
> 
> Okay, good call.
> 
> Thanks!
> 

Oren.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oren Laadan July 28, 2009, 5:22 p.m. UTC | #11

John Dykstra wrote:
> On Tue, 2009-07-07 at 12:26 -0700, Dan Smith wrote:
>>  2. I don't do anything to redirect or freeze traffic flowing to or from the
>>     remote system (to prevent a RST from breaking things).  I expect that
>>     userspace will bring down a veth device or freeze traffic to the remote
>>     system to handle this case.
> 
> Theoretically, you can drop any packet that's in flight (ingress or
> egress), because IP doesn't guarantee delivery.  TCP is able to recover,
> and a UDP or raw-socket application should already be designed to.  Of
> course, retransmissions will have an impact on application performance
> in the migration case, so that's got to be considered in the tradeoff.
> Main goal should probably be avoiding anything that shoves either end
> into slow-start.

Sure. Still, the network needs to be blocked for the duration of the
migration to ensure that the socket at the origin does not ACK any
new data after the receive buffers have been saved.

> 
> Thinking out loud, have you considered draining TCP buffers rather than
> including them in the checkpoint?  You'd stop ingress traffic, and let
> the app run until it had read everything in the socket buffer.  On the
> egress side, you'd cork the app by telling it that buffers were full,
> and then wait until the data already at the socket layer had been
> transmitted.  Both are somewhat unbounded re time, and probably not
> worth it, but maybe there's some variant of this idea that has value.
> TCP transmit buffers on 10GE links can be pretty big...

Hmmm... buffers can be big, but I would expect that in most case the
memory footprint of the application will be bigger (unless all it
does is some very simple receive-filter-send of data).

Oren.

> 
> BTW, if you see RSTs, that probably means you've created a protocol
> violation due to a buggy restore.  Just blocking or dropping packets
> shouldn't result in an RST unless it's very long.
> 
>   --  John
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Dykstra July 29, 2009, 12:28 a.m. UTC | #12

On Tue, 2009-07-28 at 09:00 -0700, Dan Smith wrote:
> I think that we need to increment timers like this on restore anyway,
> which should make sure that the TIME_WAIT timers run for the same
> amount of wall-clock time regardless of how much time was spent in the
> process of migration, right?

Agreed.

> Since checkpoint is not aware of a potential migration, a regular
> (i.e. not intended for migration) checkpoint operation on a task
> running in the init netns will leave the TIME_WAIT socket in place
> until the timer expires.

Can this relationship be counted on--that checkpoints in the init
namespace will be stored indefinitely, and that otherwise the
checkpointed tree will be migrated in a bounded amount of time?  I think
you'll need to differentiate the two cases to get proper network
behavior, so there needs to be some reliable and documented indicator.

Or should this be an advise option?

> >> +		if (h->sock.state == TCP_LISTEN) {
> >> +			ret = socket->ops->listen(socket, h->sock.backlog);
> >> +			ckpt_debug("inet listen: %i", ret);
> >> +			if (ret < 0)
> >> +				goto out;
> >> +
> >> +			ret = sock_add_parent(ctx, socket->sk);
> >> +			if (ret < 0)
> >> +				goto out;
> 
> JD> So this is just dummied off as a proof-of-concept for LISTEN?
> 
> I'm not sure what you mean...

It's not as functional as the rest of the code.  The listening address
is hardwired and you're not restoring the socket state.  I'm assuming
that was intended as a development step.

> JD> At a minimum, you'll want to start the TCP retransmit timer if there is
> JD> unacknowledged data outstanding.  And other timers for other states, as
> JD> they're supported.
> 
> JD> And you probably do want to do slow-start again--disregard my babbling
> JD> from yesterday. 

It seems there's a wide range of requirements here.  A tree migrating
within a rack doesn't want to do slow-start again, nor restart its RTT
metrics.  But in the general case, you do need to.  Maybe the scaling of
congestion window and RTT on restore should be a tunable.

> JD>         	ret = sock_create(h->sock_common.family, h->sock.type, 0, &socket);
> JD>         	if (ret < 0)
> JD>         		return ERR_PTR(ret);
>         
> 
> JD> You _really_ want to pass the actual protocol number to sock_create().
> JD> The size of the sk it creates depends on this.  You'll quickly be in
> JD> memory corruption hell without it.
> 
> You mean I need to verify that the protocol is one of IPPROTO_TCP,
> IPPROTO_UDP, or PF_UNIX, right?

Right now, you're passing zero.  For inet and inet6 sockets, you want to
pass field num out of ckpt_hdr_socket_inet.  

> JD> Network sockets also use sk->sk_wmem_alloc to track references to
> JD> the sock from egress skb's
> 
> IIUC, this function is part of the framework's leak detection.  That
> code looks to make sure objects don't gain any additional users
> between the start and end of the checkpoint operation.  I think the
> sk_refcnt satisfies that requirement here, no?

As long as the function is only used for this, you're fine.  But it's
named very generally, and if it get used in the future for something
else, there might be a hole.

  --  John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Dykstra July 29, 2009, 10:10 p.m. UTC | #13

On Tue, 2009-07-28 at 13:07 -0400, Oren Laadan wrote:
> When you destroy the virtual interface after (or during) the
> > migration, the TIME_WAIT socket will disappear from the sending
> host.
> 
> Note that such sockets are unlikely to be referenced by _any_ process
> because they will have been closed already. So the checkpoint code
> will need to find such sockets not through file descriptors.

Dan, to maybe save you some searching...  The time-wait chains can be
accessed via tcp_hashinfo.

  -- John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Dykstra July 31, 2009, 7:35 p.m. UTC | #14

On Tue, 2009-07-07 at 12:26 -0700, Dan Smith wrote:
> +       CKPT_COPY(op, hh->tcp.last_synq_overflow, sk->last_synq_overflow);

This sock field was removed from the mainline kernel by commit
a0f82f64e26929776c58a5c93c2ecb38e3d82815.

  --  John

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Smith July 31, 2009, 7:40 p.m. UTC | #15

>> +       CKPT_COPY(op, hh->tcp.last_synq_overflow, sk->last_synq_overflow);

JD> This sock field was removed from the mainline kernel by commit
JD> a0f82f64e26929776c58a5c93c2ecb38e3d82815.

Ah, thanks for this.  I'm trying to get the UNIX patch updated with
the latest round of feedback and then I'll take this and the other
comments forward with the INET patch.

Thanks!

[2/2] c/r: Add AF_INET support (v3)

Commit Message

Comments

Patch