diff mbox series

[RFC,net-next,08/18] tun: run offloaded XDP program in Tx path

Message ID 20191126100744.5083-9-prashantbhole.linux@gmail.com
State RFC
Delegated to: David Miller
Headers show
Series virtio_net XDP offload | expand

Commit Message

Prashant Bhole Nov. 26, 2019, 10:07 a.m. UTC
run offloaded XDP program as soon as packet is removed from the ptr
ring. Since this is XDP in Tx path, the traditional handling of
XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
just runs the program and leaves the action handling to us.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c | 149 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 3 deletions(-)

Comments

David Ahern Dec. 1, 2019, 4:39 p.m. UTC | #1
On 11/26/19 4:07 AM, Prashant Bhole wrote:
> run offloaded XDP program as soon as packet is removed from the ptr
> ring. Since this is XDP in Tx path, the traditional handling of
> XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
> do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
> just runs the program and leaves the action handling to us.

What happens if an offloaded program returns XDP_REDIRECT?

Below you just drop the packet which is going to be a bad user
experience. A better user experience is to detect XDP return codes a
program uses, catch those that are not supported for this use case and
fail the install of the program.
David Miller Dec. 1, 2019, 8:56 p.m. UTC | #2
From: David Ahern <dsahern@gmail.com>
Date: Sun, 1 Dec 2019 09:39:54 -0700

> Below you just drop the packet which is going to be a bad user
> experience. A better user experience is to detect XDP return codes a
> program uses, catch those that are not supported for this use case and
> fail the install of the program.

This is not universally possible.

Return codes can be calculated dynamically, come from maps potentially
shared with other bpf programs, etc.

So unfortunately this suggestion is not tenable.
Michael S. Tsirkin Dec. 1, 2019, 9:40 p.m. UTC | #3
On Sun, Dec 01, 2019 at 12:56:21PM -0800, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Sun, 1 Dec 2019 09:39:54 -0700
> 
> > Below you just drop the packet which is going to be a bad user
> > experience. A better user experience is to detect XDP return codes a
> > program uses, catch those that are not supported for this use case and
> > fail the install of the program.
> 
> This is not universally possible.
> 
> Return codes can be calculated dynamically, come from maps potentially
> shared with other bpf programs, etc.
> 
> So unfortunately this suggestion is not tenable.

Right. But it is helpful to expose the supported functionality
to guest in some way, if nothing else then so that
guests can be moved between different hosts.

Also, we need a way to report this kind of event to guest
so it's possible to figure out what went wrong.
David Miller Dec. 1, 2019, 9:54 p.m. UTC | #4
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Sun, 1 Dec 2019 16:40:22 -0500

> Right. But it is helpful to expose the supported functionality
> to guest in some way, if nothing else then so that
> guests can be moved between different hosts.
> 
> Also, we need a way to report this kind of event to guest
> so it's possible to figure out what went wrong.

On the contrary, this is why it is of utmost importance that all
XDP implementations support the full suite of XDP facilities from
the very beginning.

This is why we keep giving people a hard time when they add support
only for some of the XDP return values and semantics.  Users will get
killed by this, and it makes XDP a poor technology to use because
behavior is not consistent across device types.

That's not acceptable and I'll push back on anything that continues
this trend.

If you can't HW offload it, kick it to software.
Jason Wang Dec. 2, 2019, 2:45 a.m. UTC | #5
On 2019/12/2 上午12:39, David Ahern wrote:
> On 11/26/19 4:07 AM, Prashant Bhole wrote:
>> run offloaded XDP program as soon as packet is removed from the ptr
>> ring. Since this is XDP in Tx path, the traditional handling of
>> XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
>> do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
>> just runs the program and leaves the action handling to us.
> What happens if an offloaded program returns XDP_REDIRECT?
>
> Below you just drop the packet which is going to be a bad user
> experience. A better user experience is to detect XDP return codes a
> program uses, catch those that are not supported for this use case and
> fail the install of the program.


Yes, this could be done in the guest.

Thanks
Jason Wang Dec. 2, 2019, 2:56 a.m. UTC | #6
On 2019/12/2 上午5:54, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Sun, 1 Dec 2019 16:40:22 -0500
>
>> Right. But it is helpful to expose the supported functionality
>> to guest in some way, if nothing else then so that
>> guests can be moved between different hosts.
>>
>> Also, we need a way to report this kind of event to guest
>> so it's possible to figure out what went wrong.
> On the contrary, this is why it is of utmost importance that all
> XDP implementations support the full suite of XDP facilities from
> the very beginning.
>
> This is why we keep giving people a hard time when they add support
> only for some of the XDP return values and semantics.  Users will get
> killed by this, and it makes XDP a poor technology to use because
> behavior is not consistent across device types.
>
> That's not acceptable and I'll push back on anything that continues
> this trend.
>
> If you can't HW offload it, kick it to software.


We can try to work out a solution for XDP_REDIRECT.

Thanks
diff mbox series

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ecb49101b0b5..466ea69f00ee 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -131,6 +131,7 @@  struct tap_filter {
 /* MAX_TAP_QUEUES 256 is chosen to allow rx/tx queues to be equal
  * to max number of VCPUs in guest. */
 #define MAX_TAP_QUEUES 256
+#define MAX_TAP_BATCH 64
 #define MAX_TAP_FLOWS  4096
 
 #define TUN_FLOW_EXPIRE (3 * HZ)
@@ -2156,6 +2157,109 @@  static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
+static struct sk_buff *tun_prepare_xdp_skb(struct sk_buff *skb)
+{
+	struct sk_buff *nskb;
+
+	if (skb_shared(skb) || skb_cloned(skb)) {
+		nskb = skb_copy(skb, GFP_ATOMIC);
+		consume_skb(skb);
+		return nskb;
+	}
+
+	return skb;
+}
+
+static u32 tun_do_xdp_offload_generic(struct tun_struct *tun,
+				      struct sk_buff *skb)
+{
+	struct tun_prog *xdp_prog;
+	struct xdp_buff xdp;
+	u32 act = XDP_PASS;
+
+	xdp_prog = rcu_dereference(tun->offloaded_xdp_prog);
+	if (xdp_prog) {
+		skb = tun_prepare_xdp_skb(skb);
+		if (!skb) {
+			act = XDP_DROP;
+			kfree_skb(skb);
+			goto drop;
+		}
+
+		act = do_xdp_generic_core(skb, &xdp, xdp_prog->prog);
+		switch (act) {
+		case XDP_TX:
+			/*
+			 * Rx path generic XDP will be called in this path
+			 */
+			netif_receive_skb(skb);
+			break;
+		case XDP_PASS:
+			break;
+		case XDP_REDIRECT:
+			/*
+			 * Since we are not handling this case yet, let's free
+			 * skb here. In case of XDP_DROP/XDP_ABORTED, the skb
+			 * was already freed in do_xdp_generic_core()
+			 */
+			kfree_skb(skb);
+			/* fall through */
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog->prog, act);
+			/* fall through */
+		case XDP_DROP:
+			goto drop;
+		}
+	}
+
+	return act;
+drop:
+	this_cpu_inc(tun->pcpu_stats->tx_dropped);
+	return act;
+}
+
+static u32 tun_do_xdp_offload(struct tun_struct *tun, struct tun_file *tfile,
+			      struct xdp_frame *frame)
+{
+	struct tun_prog *xdp_prog;
+	struct tun_page tpage;
+	struct xdp_buff xdp;
+	u32 act = XDP_PASS;
+	int flush = 0;
+
+	xdp_prog = rcu_dereference(tun->offloaded_xdp_prog);
+	if (xdp_prog) {
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = xdp.data + frame->len;
+		xdp.data_meta = xdp.data - frame->metasize;
+
+		act = bpf_prog_run_xdp(xdp_prog->prog, &xdp);
+		switch (act) {
+		case XDP_PASS:
+			break;
+		case XDP_TX:
+			/* fall through */
+		case XDP_REDIRECT:
+			/* fall through */
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog->prog, act);
+			/* fall through */
+		case XDP_DROP:
+			xdp_return_frame_rx_napi(frame);
+			break;
+		}
+	}
+
+	return act;
+}
+
 static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
@@ -2574,6 +2678,47 @@  static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	return ret;
 }
 
+static int tun_consume_packets(struct tun_file *tfile, void **ptr_array, int n)
+{
+	struct tun_prog *xdp_prog;
+	struct xdp_frame *frame;
+	struct tun_struct *tun;
+	int i, num_ptrs;
+	int pkt_cnt = 0;
+	void *pkts[MAX_TAP_BATCH];
+	void *ptr;
+	u32 act;
+
+	if (unlikely(!tfile))
+		return 0;
+
+	if (n > MAX_TAP_BATCH)
+		n = MAX_TAP_BATCH;
+
+	rcu_read_lock();
+	tun = rcu_dereference(tfile->tun);
+	if (unlikely(!tun))
+		return 0;
+	xdp_prog = rcu_dereference(tun->offloaded_xdp_prog);
+
+	num_ptrs = ptr_ring_consume_batched(&tfile->tx_ring, pkts, n);
+	for (i = 0; i < num_ptrs; i++) {
+		ptr = pkts[i];
+		if (tun_is_xdp_frame(ptr)) {
+			frame = tun_ptr_to_xdp(ptr);
+			act = tun_do_xdp_offload(tun, tfile, frame);
+		} else {
+			act = tun_do_xdp_offload_generic(tun, ptr);
+		}
+
+		if (act == XDP_PASS)
+			ptr_array[pkt_cnt++] = ptr;
+	}
+
+	rcu_read_unlock();
+	return pkt_cnt;
+}
+
 static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 		       int flags)
 {
@@ -2594,9 +2739,7 @@  static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 			ptr = ctl->ptr;
 			break;
 		case TUN_MSG_CONSUME_PKTS:
-			ret = ptr_ring_consume_batched(&tfile->tx_ring,
-						       ctl->ptr,
-						       ctl->num);
+			ret = tun_consume_packets(tfile, ctl->ptr, ctl->num);
 			goto out;
 		case TUN_MSG_UNCONSUME_PKTS:
 			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,