[C/R,v20,76/96] c/r: Add AF_UNIX support (v12)

From: Dan Smith <danms@us.ibm.com>

From: Dan Smith <danms@us.ibm.com>

This patch adds basic checkpoint/restart support for AF_UNIX sockets.  It
has been tested with a single and multiple processes, and with data inflight
at the time of checkpoint.  It supports socketpair()s, path-based, and
abstract sockets.

Changes in ckpt-v19:
  - [Serge Hallyn] skb->tail can be offset

Changes in ckpt-v19-rc3:
  - Rebase to kernel 2.6.33: export and leverage sock_alloc_file()
  - [Nathan Lynch] Fix net/checkpoint.c for 64-bit

Changes in ckpt-v19-rc2:
  - Change select uses of ckpt_debug() to ckpt_err() in net c/r
  - [Dan Smith] Unify skb read/write functions and handle fragmented buffers
  - [Dan Smith] Update buffer restore code to match the new format

Changes in ckpt-v19-rc1:
  - [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n
  - [Dan Smith] Remove an unnecessary check on socket restart
  - [Matt Helsley] Add cpp definitions for enums
  - [Dan Smith] Pass the stored sock->protocol into sock_create() on restore

Changes in v12:
  - Collect sockets for leak-detection
  - Adjust socket reference count during leak detection phase

Changes in v11:
  - Create a struct socket for orphan socket during checkpoint
  - Make sockets proper objhash objects and use checkpoint_obj() on them
  - Rename headerless struct ckpt_hdr_* to struct ckpt_*
  - Remove struct timeval from socket header
  - Save and restore UNIX socket peer credentials
  - Set socket flags on restore using sock_setsockopt() where possible
  - Fail on the TIMESTAMPING_* flags for the moment (with a TODO)
  - Remove other explicit flag checks that are no longer copied blindly
  - Changed functions/variables names to follow existing conventions
  - Use proto_ops->{checkpoint,restart} methods for af_unix
  - Cleanup sock_file_restore()/sock_file_checkpoint()
  - Make ckpt_hdr_socket be part of ckpt_hdr_file_socket
  - Fold do_sock_file_checkpoint() into sock_file_checkpoint()
  - Fold do_sock_file_restore() into sock_file_restore()
  - Move sock_file_{checkpoint,restore} to net/checkpoint.c
  - Properly define sock_file_{checkpoint,restore} in header file
  - sock_file_restore() now calls restore_file_common()

Changes in v10:
  - Moved header structure definitions back to checkpoint_hdr.h
  - Moved AF_UNIX checkpoint/restart code to net/unix/checkpoint.c
  - Make sock_unix_*() functions only compile if CONFIG_UNIX=y
  - Add TODO for CONFIG_UNIX=m case

Changes in v9:
  - Fix double-free of skb's in the list and target holding queue in the
    error path of sock_copy_buffers()
  - Adjust use of ckpt_read_string() to match new signature

Changes in v8:
  - Fix stale dev_alloc_skb() from before the conversion to skb_clone()
  - Fix a couple of broken error paths
  - Fix memory leak of kvec.iov_base on successful return from sendmsg()
  - Fix condition for deciding when to run sock_cptrst_verify()
  - Fix buffer queue copy algorithm to hold the lock during walk(s)
  - Log the errno when either getname() or getpeer() fails
  - Add comments about ancillary messages in the UNIX queue
  - Add TODO comments for credential restore and flags via setsockopt()
  - Add TODO comment about strangely-connected dgram sockets and the use
    of sendmsg(peer)

Changes in v7:
  - Fix failure to free iov_base in error path of sock_read_buffer()
  - Change sock_read_buffer() to use _ckpt_read_obj_type() to get the
    header length and then use ckpt_kread() directly to read the payload
  - Change sock_read_buffers() to sock_unix_read_buffers() and break out
    some common functionality to better accommodate the subsequent INET
    patch
  - Generalize sock_unix_getnames() into sock_getnames() so INET can use it
  - Change skb_morph() to skb_clone() which uses the more common path and
    still avoids the copy
  - Add check to validate the socket type before creating socket
    on restore
  - Comment the CAP_NET_ADMIN override in sock_read_buffer_hdr
  - Strengthen the comment about priming the buffer limits
  - Change the objhash functions to deny direct checkpoint of sockets and
    remove the reference counting function
  - Change SOCKET_BUFFERS to SOCKET_QUEUE
  - Change this,peer objrefs to signed integers
  - Remove names from internal socket structures
  - Fix handling of sock_copy_buffers() result
  - Use ckpt_fill_fname() instead of d_path() for writing CWD
  - Use sock_getname() and sock_getpeer() for proper security hookage
  - Return -ENOSYS for unsupported socket families in checkpoint and restart
  - Use sock_setsockopt() and sock_getsockopt() where possible to save and
    restore socket option values
  - Check for SOCK_DESTROY flag in the global verify function because none
    of our supported socket types use it
  - Check for SOCK_USE_WRITE_QUEUE in AF_UNIX restore function because
    that flag should not be used on such a socket
  - Check socket state in UNIX restart path to validate the subset of valid
    values

Changes in v6:
  - Moved the socket addresses to the per-type header
  - Eliminated the HASCWD flag
  - Remove use of ckpt_write_err() in restart paths
  - Change the order in which buffers are read so that we can set the
    socket's limit equal to the size of the image's buffers (if appropriate)
    and then restore the original values afterwards.
  - Use the ckpt_validate_errno() helper
  - Add a check to make sure that we didn't restore a (UNIX) socket with
    any skb's in the send buffer
  - Fix up sock_unix_join() to not leave addr uninitialized for socketpair
  - Remove inclusion of checkpoint_hdr.h in the socket files
  - Make sock_unix_write_cwd() use ckpt_write_string() and use the new
    ckpt_read_string() for reading the cwd
  - Use the restored realcred credentials in sock_unix_join()
  - Fix error path of the chdir_and_bind
  - Change the algorithm for reloading the socket buffers to use sendmsg()
    on the socket's peer for better accounting
  - For DGRAM sockets, check the backlog value against the system max
    to avoid letting a restart bypass the overloaded queue length
  - Use sock_bind() instead of sock->ops->bind() to gain the security hook
  - Change "restart" to "restore" in some of the function names

Changes in v5:
  - Change laddr and raddr buffers in socket header to be long enough
    for INET6 addresses
  - Place socket.c and sock.h function definitions inside #ifdef
    CONFIG_CHECKPOINT
  - Add explicit check in sock_unix_makeaddr() to refuse if the
    checkpoint image specifies an addr length of 0
  - Split sock_unix_restart() into a few pieces to facilitate:
  - Changed behavior of the unix restore code so that unlinked LISTEN
    sockets don't do a bind()...unlink()
  - Save the base path of a bound socket's path so that we can chdir()
    to the base before bind() if it is a relative path
  - Call bind() for any socket that is not established but has a
    non-zero-length local address
  - Enforce the current sysctl limit on socket buffer size during restart
    unless the user holds CAP_NET_ADMIN
  - Unlink a path-based socket before calling bind()

Changes in v4:
  - Changed the signdness of rcvlowat, rcvtimeo, sndtimeo, and backlog
    to match their struct sock definitions.  This should avoid issues
    with sign extension.
  - Add a sock_cptrst_verify() function to be run at restore time to
    validate several of the values in the checkpoint image against
    limits, flag masks, etc.
  - Write an error string with ctk_write_err() in the obscure cases
  - Don't write socket buffers for listen sockets
  - Sanity check address lengths before we agree to allocate memory
  - Check the result of inserting the peer object in the objhash on
    restart
  - Check return value of sock_cptrst() on restart
  - Change logic in remote getname() phase of checkpoint to not fail for
    closed (et al) sockets
  - Eliminate the memory copy while reading socket buffers on restart

Changes in v3:
  - Move sock_file_checkpoint() above sock_file_restore()
  - Change __sock_file_*() functions to do_sock_file_*()
  - Adjust some of the struct cr_hdr_socket alignment
  - Improve the sock_copy_buffers() algorithm to avoid locking the source
    queue for the entire operation
  - Fix alignment in the socket header struct(s)
  - Move the per-protocol structure (ckpt_hdr_socket_un) out of the
    common socket header and read/write it separately
  - Fix missing call to sock_cptrst() in restore path
  - Break out the socket joining into another function
  - Fix failure to restore the socket address thus fixing getname()
  - Check the state values on restart
  - Fix case of state being TCP_CLOSE, which allows dgram sockets to be
    properly connected (if appropriate) to their peer and maintain the
    sockaddr for getname() operation
  - Fix restoring a listening socket that has been unlink()'d
  - Fix checkpointing sockets with an in-flight FD-passing SKB.  Fail
    with EBUSY.
  - Fix checkpointing listening sockets with an unaccepted connection.
    Fail with EBUSY.
  - Changed 'un' to 'unix' in function and structure names

Changes in v2:
  - Change GFP_KERNEL to GFP_ATOMIC in sock_copy_buffers() (this seems
    to be rather common in other uses of skb_copy())
  - Move the ckpt_hdr_socket structure definition to linux/socket.h
  - Fix whitespace issue
  - Move sock_file_checkpoint() to net/socket.c for symmetry

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 checkpoint/files.c             |    7 +
 checkpoint/objhash.c           |   69 +++
 include/linux/checkpoint.h     |    8 +
 include/linux/checkpoint_hdr.h |  117 +++++-
 include/linux/net.h            |    2 +
 include/net/af_unix.h          |   15 +
 include/net/sock.h             |   12 +
 net/Makefile                   |    2 +
 net/checkpoint.c               |  983 ++++++++++++++++++++++++++++++++++++++++
 net/socket.c                   |    6 +-
 net/unix/Makefile              |    1 +
 net/unix/af_unix.c             |    9 +
 net/unix/checkpoint.c          |  647 ++++++++++++++++++++++++++
 13 files changed, 1872 insertions(+), 6 deletions(-)
 create mode 100644 net/checkpoint.c
 create mode 100644 net/unix/checkpoint.c

Message ID	1268842164-5590-77-git-send-email-orenl@cs.columbia.edu
State	Not Applicable, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 87E6CB7CFD for <patchwork-incoming@ozlabs.org>; Thu, 18 Mar 2010 03:24:28 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755473Ab0CQQXd (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Wed, 17 Mar 2010 12:23:33 -0400 Received: from serrano.cc.columbia.edu ([128.59.29.6]:39543 "EHLO serrano.cc.columbia.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755393Ab0CQQX1 (ORCPT <rfc822;netdev@vger.kernel.org>); Wed, 17 Mar 2010 12:23:27 -0400 Received: from localhost.localdomain (dejaview.cs.columbia.edu [128.59.22.193]) (user=ol2104 mech=PLAIN bits=0) by serrano.cc.columbia.edu (8.14.3/8.14.3) with ESMTP id o2HG9Q9J017295 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 17 Mar 2010 12:21:17 -0400 (EDT) From: Oren Laadan <orenl@cs.columbia.edu> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, Serge Hallyn <serue@us.ibm.com>, Ingo Molnar <mingo@elte.hu>, containers@lists.linux-foundation.org, Dan Smith <danms@us.ibm.com>, Alexey Dobriyan <adobriyan@gmail.com>, netdev@vger.kernel.org, Oren Laadan <orenl@cs.columbia.edu> Subject: [C/R v20][PATCH 76/96] c/r: Add AF_UNIX support (v12) Date: Wed, 17 Mar 2010 12:09:04 -0400 Message-Id: <1268842164-5590-77-git-send-email-orenl@cs.columbia.edu> X-Mailer: git-send-email 1.6.3.3 In-Reply-To: <1268842164-5590-76-git-send-email-orenl@cs.columbia.edu> References: <1268842164-5590-1-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-2-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-3-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-4-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-5-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-6-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-7-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-8-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-9-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-10-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-11-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-12-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-13-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-14-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-15-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-16-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-17-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-18-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-19-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-20-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-21-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-22-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-23-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-24-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-25-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-26-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-27-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-28-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-29-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-30-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-31-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-32-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-33-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-34-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-35-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-36-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-37-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-38-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-39-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-40-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-41-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-42-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-43-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-44-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-45-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-46-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-47-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-48-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-49-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-50-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-51-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-52-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-53-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-54-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-55-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-56-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-57-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-58-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-59-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-60-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-61-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-62-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-63-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-64-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-65-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-66-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-67-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-68-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-69-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-70-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-71-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-72-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-73-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-74-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-75-git-send-email-orenl@cs.columbia.edu> <1268842164-5590-76-git-send-email-orenl@cs.columbia.edu> X-No-Spam-Score: Local X-Scanned-By: MIMEDefang 2.68 on 128.59.29.6 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

[C/R,v20,76/96] c/r: Add AF_UNIX support (v12)

Commit Message

Patch