[bpf-next,05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

From: John Fastabend <john.fastabend@gmail.com>

This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. In the sendpage case data is never copied.
The implication being users may change data after BPF programs run in
the sendpage case. (A flag will be added to always copy shortly
if the copy must always be performed).

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. The readable/writeable data provided to the program
in the sendmsg case may not be the entire message, in fact for
large sends this is likely the case. The data range that can be
read is part of the sk_msg_md structure. This is because similar
to the tc bpf_cls case the data is stored in a scatter gather list.
Future work will address this short-coming to allow users to pull
in more data if needed (similar to TC BPF).

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
		&obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests or sockmap samples.

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.

Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/bpf.h       |    1 
 include/linux/bpf_types.h |    1 
 include/linux/filter.h    |   14 +
 include/uapi/linux/bpf.h  |   28 ++
 kernel/bpf/sockmap.c      |  517 ++++++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/syscall.c      |   14 +
 kernel/bpf/verifier.c     |    5 
 net/core/filter.c         |  106 +++++++++
 8 files changed, 668 insertions(+), 18 deletions(-)

Message ID	20180305195122.6612.78322.stgit@john-Precision-Tower-5810
State	Changes Requested, archived
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3zw9X24wnJz9sXh for <patchwork-incoming@ozlabs.org>; Tue, 6 Mar 2018 06:51:30 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932337AbeCETv2 (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Mon, 5 Mar 2018 14:51:28 -0500 Received: from [75.106.27.153] ([75.106.27.153]:39346 "EHLO john-Precision-Tower-5810" rhost-flags-FAIL-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S932218AbeCETv0 (ORCPT <rfc822;netdev@vger.kernel.org>); Mon, 5 Mar 2018 14:51:26 -0500 Received: from [127.0.1.1] (localhost [127.0.0.1]) by john-Precision-Tower-5810 (Postfix) with ESMTP id 67756D434A8; Mon, 5 Mar 2018 11:51:22 -0800 (PST) Subject: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data From: John Fastabend <john.fastabend@gmail.com> To: ast@kernel.org, daniel@iogearbox.net Cc: netdev@vger.kernel.org, davejwatson@fb.com Date: Mon, 05 Mar 2018 11:51:22 -0800 Message-ID: <20180305195122.6612.78322.stgit@john-Precision-Tower-5810> In-Reply-To: <20180305194616.6612.36343.stgit@john-Precision-Tower-5810> References: <20180305194616.6612.36343.stgit@john-Precision-Tower-5810> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org
Series	bpf,sockmap: sendmsg/sendfile ULP \| expand [bpf-next,00/16] bpf,sockmap: sendmsg/sendfile ULP [bpf-next,01/16] sock: make static tls function alloc_sg generic sock helper [bpf-next,02/16] sockmap: convert refcnt to an atomic refcnt [bpf-next,03/16] net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG [bpf-next,04/16] net: generalize sk_alloc_sg to work with scatterlist rings [bpf-next,05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data [bpf-next,06/16] bpf: sockmap, add bpf_msg_apply_bytes() helper [bpf-next,07/16] bpf: sockmap, add msg_cork_bytes() helper [bpf-next,08/16] bpf: add map tests for BPF_PROG_TYPE_SK_MSG [bpf-next,09/16] bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG [bpf-next,10/16] bpf: sockmap sample, add option to attach SK_MSG program [bpf-next,11/16] bpf: sockmap sample, add sendfile test [bpf-next,12/16] bpf: sockmap sample, add data verification option [bpf-next,13/16] bpf: sockmap, add sample option to test apply_bytes helper [bpf-next,14/16] bpf: sockmap sample support for bpf_msg_cork_bytes() [bpf-next,15/16] sockmap: add SK_DROP tests [bpf-next,16/16] bpf: sockmap test script

[bpf-next,05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

Commit Message

Comments

Patch