diff mbox series

[RFC,05/24] bpf: added bpf_xdpsk_redirect

Message ID 20180131135356.19134-6-bjorn.topel@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show
Series Introducing AF_XDP support | expand

Commit Message

Björn Töpel Jan. 31, 2018, 1:53 p.m. UTC
From: Björn Töpel <bjorn.topel@intel.com>

The bpf_xdpsk_redirect call redirects the XDP context to the XDP
socket bound to the receiving queue (if any).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/bpf.h                  |  6 +++++-
 net/core/filter.c                         | 24 ++++++++++++++++++++++++
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 3 files changed, 31 insertions(+), 1 deletion(-)

Comments

Jesper Dangaard Brouer Feb. 5, 2018, 1:42 p.m. UTC | #1
On Wed, 31 Jan 2018 14:53:37 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:

> The bpf_xdpsk_redirect call redirects the XDP context to the XDP
> socket bound to the receiving queue (if any).

As I explained in-person at FOSDEM, my suggestion is to use the
bpf-map infrastructure for AF_XDP redirect, but people on this
upstream mailing also need a chance to validate my idea ;-)

The important thing to keep in-mind is how we can still maintain a
SPSC (Single producer Single Consumer) relationship between an
RX-queue and a userspace consumer-process.

This AF_XDP "FOSDEM" patchset, store the "xsk" (xdp_sock) pointer
directly in the net_device (_rx[].netdev_rx_queue.xs) structure.  This
limit each RX-queue to service a single xdp_sock.  It sounds good from
a SPSC pov, but not very flexible.  With a "xdp_sock_map" we can get
the flexibility to select among multiple xdp_sock'ets (via XDP
pre-filter selecting a different map), and still maintain a SPSC
relationship.  The RX-queue will just have several SPSC relationships
with the individual xdp_sock's.

This is true for the AF_XDP-copy mode, and require less driver change.
For the AF_XDP-zero-copy (ZC) mode drivers need significant changes
anyhow, and in ZC case we will have to disallow this multiple
xdp_sock's, which is simply achieved by checking if the xdp_sock
pointer returned from the map lookup match the one that userspace
requested driver to register it's memory for RX-rings from.

The "xdp_sock_map" is an array, where the index correspond to the
queue_index.  The bpf_redirect_map() ignore the specified index and
instead use xdp_rxq_info->queue_index in the lookup.

Notice that a bpf-map have no pinned relationship with the device or
XDP prog loaded.  Thus, userspace need to bind() this map to the
device before traffic can flow, like the proposed bind() on the
xdp_sock.  This is to establish the SPSC binding.  My proposal is that
userspace insert the xdp_sock file-descriptor(s) in the map at the
queue-index, and the map (which is also just a file-descriptor) is
bound maybe via bind() to a specific device (via the ifindex).  Kernel
side will walk the map and do required actions xdp_sock's in find in
map.

TX-side is harder, as now multiple xdp_sock's can have the same
queue-pair ID with the same net_device. But Magnus propose that this
can be solved with hardware. As newer NICs have many TX-queue, and the
queue-pair ID is just an external visible number, while the kernel
internal structure can point to a dedicated TX-queue per xdp_sock.
Björn Töpel Feb. 7, 2018, 9:11 p.m. UTC | #2
2018-02-05 14:42 GMT+01:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> On Wed, 31 Jan 2018 14:53:37 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> The bpf_xdpsk_redirect call redirects the XDP context to the XDP
>> socket bound to the receiving queue (if any).
>
> As I explained in-person at FOSDEM, my suggestion is to use the
> bpf-map infrastructure for AF_XDP redirect, but people on this
> upstream mailing also need a chance to validate my idea ;-)
>
> The important thing to keep in-mind is how we can still maintain a
> SPSC (Single producer Single Consumer) relationship between an
> RX-queue and a userspace consumer-process.
>
> This AF_XDP "FOSDEM" patchset, store the "xsk" (xdp_sock) pointer
> directly in the net_device (_rx[].netdev_rx_queue.xs) structure.  This
> limit each RX-queue to service a single xdp_sock.  It sounds good from
> a SPSC pov, but not very flexible.  With a "xdp_sock_map" we can get
> the flexibility to select among multiple xdp_sock'ets (via XDP
> pre-filter selecting a different map), and still maintain a SPSC
> relationship.  The RX-queue will just have several SPSC relationships
> with the individual xdp_sock's.
>
> This is true for the AF_XDP-copy mode, and require less driver change.
> For the AF_XDP-zero-copy (ZC) mode drivers need significant changes
> anyhow, and in ZC case we will have to disallow this multiple
> xdp_sock's, which is simply achieved by checking if the xdp_sock
> pointer returned from the map lookup match the one that userspace
> requested driver to register it's memory for RX-rings from.
>
> The "xdp_sock_map" is an array, where the index correspond to the
> queue_index.  The bpf_redirect_map() ignore the specified index and
> instead use xdp_rxq_info->queue_index in the lookup.
>
> Notice that a bpf-map have no pinned relationship with the device or
> XDP prog loaded.  Thus, userspace need to bind() this map to the
> device before traffic can flow, like the proposed bind() on the
> xdp_sock.  This is to establish the SPSC binding.  My proposal is that
> userspace insert the xdp_sock file-descriptor(s) in the map at the
> queue-index, and the map (which is also just a file-descriptor) is
> bound maybe via bind() to a specific device (via the ifindex).  Kernel
> side will walk the map and do required actions xdp_sock's in find in
> map.
>

As we discussed at FOSDEM, I like the idea of using a map. This also
opens up for configuring the AF_XDP sockets via bpf code, like sockmap
does.

I'll have a stab at adding an "xdp_sock_map/xskmap" or similar, and
also extending the cgroup sock_ops to support AF_XDP sockets, so that
the xskmap can be configured from bpf-land.


Björn

> TX-side is harder, as now multiple xdp_sock's can have the same
> queue-pair ID with the same net_device. But Magnus propose that this
> can be solved with hardware. As newer NICs have many TX-queue, and the
> queue-pair ID is just an external visible number, while the kernel
> internal structure can point to a dedicated TX-queue per xdp_sock.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index db6bdc375126..18f9ee7cb529 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -696,6 +696,9 @@  union bpf_attr {
  * int bpf_override_return(pt_regs, rc)
  *	@pt_regs: pointer to struct pt_regs
  *	@rc: the return value to set
+ *
+ * int bpf_xdpsk_redirect()
+ *	Return: XDP_REDIRECT
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -757,7 +760,8 @@  union bpf_attr {
 	FN(perf_prog_read_value),	\
 	FN(getsockopt),			\
 	FN(override_return),		\
-	FN(sock_ops_cb_flags_set),
+	FN(sock_ops_cb_flags_set),	\
+	FN(xdpsk_redirect),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 08ab4c65a998..aedf57489cb5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1809,6 +1809,8 @@  struct redirect_info {
 	struct bpf_map *map;
 	struct bpf_map *map_to_flush;
 	unsigned long   map_owner;
+	bool to_xsk;
+	/* XXX cache xsk socket here, to avoid lookup? */
 };
 
 static DEFINE_PER_CPU(struct redirect_info, redirect_info);
@@ -2707,6 +2709,7 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 	ri->ifindex = 0;
 	ri->map = NULL;
 	ri->map_owner = 0;
+	ri->to_xsk = false;
 
 	if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
 		err = -EFAULT;
@@ -2817,6 +2820,25 @@  static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_0(bpf_xdpsk_redirect)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+	/* XXX Would it be better to check for socket existence here,
+	 * and XDP_ABORTED on failure? Also, then we can populate xsk
+	 * in ri, and don't have to do the lookup multiple times.
+	 */
+	ri->to_xsk = true;
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdpsk_redirect_proto = {
+	.func		= bpf_xdpsk_redirect,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+};
+
 bool bpf_helper_changes_pkt_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -3544,6 +3566,8 @@  xdp_func_proto(enum bpf_func_id func_id)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_xdpsk_redirect:
+		return &bpf_xdpsk_redirect_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index dde2c11d7771..5898ad7a8e40 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -86,6 +86,8 @@  static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
 	(void *) BPF_FUNC_perf_prog_read_value;
 static int (*bpf_override_return)(void *ctx, unsigned long rc) =
 	(void *) BPF_FUNC_override_return;
+static int (*bpf_xdpsk_redirect)(void) =
+	(void *) BPF_FUNC_xdpsk_redirect;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions