diff mbox series

[RFC,v4,bpf-next,09/11] tun: Support xdp in the Tx path for xdp_frames

Message ID 20200227032013.12385-10-dsahern@kernel.org
State RFC
Delegated to: BPF Maintainers
Headers show
Series Add support for XDP in egress path | expand

Commit Message

David Ahern Feb. 27, 2020, 3:20 a.m. UTC
From: David Ahern <dahern@digitalocean.com>

Add support to run Tx path program on packets arriving at a tun
device via XDP redirect.

XDP_TX return code means move the packet to the Tx path of the device.
For a program run in the Tx / egress path, XDP_TX is essentially the
same as "continue on" which is XDP_PASS.

Conceptually, XDP_REDIRECT for this path can work the same as it
does for the Rx path, but that return code is left for a follow
on series.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: David Ahern <dahern@digitalocean.com>
---
 drivers/net/tun.c | 49 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

Comments

Alexei Starovoitov March 2, 2020, 6:30 p.m. UTC | #1
On Wed, Feb 26, 2020 at 08:20:11PM -0700, David Ahern wrote:
> +
> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> +		switch (act) {
> +		case XDP_TX:    /* for Tx path, XDP_TX == XDP_PASS */
> +			act = XDP_PASS;
> +			break;
> +		case XDP_PASS:
> +			break;
> +		case XDP_REDIRECT:
> +			/* fall through */
> +		default:
> +			bpf_warn_invalid_xdp_action(act);
> +			/* fall through */
> +		case XDP_ABORTED:
> +			trace_xdp_exception(tun->dev, xdp_prog, act);
> +			/* fall through */
> +		case XDP_DROP:
> +			break;
> +		}

patch 8 has very similar switch. Can you share the code?

I'm worried that XDP_TX is a silent alias to XDP_PASS.
What were the reasons to go with this approach?
imo it's less error prone and extensible to warn on XDP_TX.
Which will mean that both XDP_TX and XDP_REDICT are not supported for egress atm.

Patches 8 and 9 cover tun only. I'd like to see egress hook to be implemented
in at least one physical NIC. Pick any hw. Something that handles real frames.
Adding this hook to virtual NIC is easy, but it doesn't demonstrate design
trade-offs one would need to think through by adding egress hook to physical
nic. That's why I think it's mandatory to have it as part of the patch set.

Patch 11 exposes egress to samples/bpf. It's nice, but without selftests it's
no go. All new features must be exercised as part of selftests/bpf.

re: patch 3. I agree with Toke and Jesper that union in uapi is unnecessary.
Just drop it. xdp_md is a virtual data structure. It's not allocated anywhere.
There is no need to save non-existent memory.
David Ahern March 3, 2020, 4:27 a.m. UTC | #2
On 3/2/20 11:30 AM, Alexei Starovoitov wrote:
> On Wed, Feb 26, 2020 at 08:20:11PM -0700, David Ahern wrote:
>> +
>> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +		switch (act) {
>> +		case XDP_TX:    /* for Tx path, XDP_TX == XDP_PASS */
>> +			act = XDP_PASS;
>> +			break;
>> +		case XDP_PASS:
>> +			break;
>> +		case XDP_REDIRECT:
>> +			/* fall through */
>> +		default:
>> +			bpf_warn_invalid_xdp_action(act);
>> +			/* fall through */
>> +		case XDP_ABORTED:
>> +			trace_xdp_exception(tun->dev, xdp_prog, act);
>> +			/* fall through */
>> +		case XDP_DROP:
>> +			break;
>> +		}
> 
> patch 8 has very similar switch. Can you share the code?

Most likely; I'll take a look.

> 
> I'm worried that XDP_TX is a silent alias to XDP_PASS.
> What were the reasons to go with this approach?

As I stated in the cover letter:

"XDP_TX on Rx means send the packet out the device it arrived
on; given that, XDP_Tx for the Tx path is treated as equivalent to
XDP_PASS - ie., continue on the Tx path."

> imo it's less error prone and extensible to warn on XDP_TX.
> Which will mean that both XDP_TX and XDP_REDICT are not supported for egress atm.

I personally don't care either way; I was going with the simplest
concept from a user perspective.

> 
> Patches 8 and 9 cover tun only. I'd like to see egress hook to be implemented
> in at least one physical NIC. Pick any hw. Something that handles real frames.
> Adding this hook to virtual NIC is easy, but it doesn't demonstrate design
> trade-offs one would need to think through by adding egress hook to physical
> nic. That's why I think it's mandatory to have it as part of the patch set.
> 
> Patch 11 exposes egress to samples/bpf. It's nice, but without selftests it's
> no go. All new features must be exercised as part of selftests/bpf.

Patches that exercise the rtnetlink uapi are fairly easy to do on single
node; anything traffic related requires multiple nodes or namespace
level capabilities.  Unless I am missing something that is why all
current XDP tests ride on top of veth; veth changes are not part of this
set.

So to be clear you are saying that all new XDP features require patches
to a h/w nic, veth and whatever the author really cares about before new
features like this go in?
Jesper Dangaard Brouer March 3, 2020, 9:08 a.m. UTC | #3
On Mon, 2 Mar 2020 21:27:08 -0700
David Ahern <dahern@digitalocean.com> wrote:

> On 3/2/20 11:30 AM, Alexei Starovoitov wrote:
> > On Wed, Feb 26, 2020 at 08:20:11PM -0700, David Ahern wrote:  
> >> +
> >> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> >> +		switch (act) {
> >> +		case XDP_TX:    /* for Tx path, XDP_TX == XDP_PASS */
> >> +			act = XDP_PASS;
> >> +			break;
> >> +		case XDP_PASS:
> >> +			break;
> >> +		case XDP_REDIRECT:
> >> +			/* fall through */
> >> +		default:
> >> +			bpf_warn_invalid_xdp_action(act);
> >> +			/* fall through */
> >> +		case XDP_ABORTED:
> >> +			trace_xdp_exception(tun->dev, xdp_prog, act);
> >> +			/* fall through */
> >> +		case XDP_DROP:
> >> +			break;
> >> +		}  
> > 
> > patch 8 has very similar switch. Can you share the code?  
> 
> Most likely; I'll take a look.
> 
> > 
> > I'm worried that XDP_TX is a silent alias to XDP_PASS.
> > What were the reasons to go with this approach?  
> 
> As I stated in the cover letter:
> 
> "XDP_TX on Rx means send the packet out the device it arrived
> on; given that, XDP_Tx for the Tx path is treated as equivalent to
> XDP_PASS - ie., continue on the Tx path."

I'm not really buying this.  IHMO XDP_PASS should mean continue on the
Tx path, as this is a tx/egress XDP hook.  I don't see a reason to
basically "remove" the XDP_TX return code at this point.

> > imo it's less error prone and extensible to warn on XDP_TX.
> > Which will mean that both XDP_TX and XDP_REDICT are not supported
> > for egress atm.  

I agree.

I more see "XDP_TX" as a mirror facility... maybe there is a use-case
for bouncing packets back in the TX/Egress hook? That is future work,
but not reason disable the option now.


> I personally don't care either way; I was going with the simplest
> concept from a user perspective.
> 
> > 
> > Patches 8 and 9 cover tun only. I'd like to see egress hook to be
> > implemented in at least one physical NIC. Pick any hw. Something
> > that handles real frames. Adding this hook to virtual NIC is easy,
> > but it doesn't demonstrate design trade-offs one would need to
> > think through by adding egress hook to physical nic. That's why I
> > think it's mandatory to have it as part of the patch set.
> > 
> > Patch 11 exposes egress to samples/bpf. It's nice, but without
> > selftests it's no go. All new features must be exercised as part of
> > selftests/bpf.  
> 
> Patches that exercise the rtnetlink uapi are fairly easy to do on
> single node; anything traffic related requires multiple nodes or
> namespace level capabilities.  Unless I am missing something that is
> why all current XDP tests ride on top of veth; veth changes are not
> part of this set.
> 
> So to be clear you are saying that all new XDP features require
> patches to a h/w nic, veth and whatever the author really cares about
> before new features like this go in?

I would say yes. XDP is founded for physical HW NICs, and we need to
show it makes sense for physical HW NICs.
Jesper Dangaard Brouer March 3, 2020, 10:40 a.m. UTC | #4
On Wed, 26 Feb 2020 20:20:11 -0700
David Ahern <dsahern@kernel.org> wrote:

> From: David Ahern <dahern@digitalocean.com>
> 
> Add support to run Tx path program on packets arriving at a tun
> device via XDP redirect.
> 
> XDP_TX return code means move the packet to the Tx path of the device.
> For a program run in the Tx / egress path, XDP_TX is essentially the
> same as "continue on" which is XDP_PASS.
> 
> Conceptually, XDP_REDIRECT for this path can work the same as it
> does for the Rx path, but that return code is left for a follow
> on series.
> 
> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
> Signed-off-by: David Ahern <dahern@digitalocean.com>
> ---
>  drivers/net/tun.c | 49 +++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index dcae6521a39d..d3fc7e921c85 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1359,10 +1359,50 @@ static void __tun_xdp_flush_tfile(struct tun_file *tfile)
>  	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>  }
>  
> +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile,
> +			 struct xdp_frame *frame, struct xdp_txq_info *txq)
> +{
> +	struct bpf_prog *xdp_prog;
> +	u32 act = XDP_PASS;
> +
> +	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
> +	if (xdp_prog) {
> +		struct xdp_buff xdp;
> +
> +		xdp.data_hard_start = frame->data - frame->headroom;

This is correct, only because frame->headroom have been reduced with
sizeof(*xdp_frame), as we want to avoid that the BPF-prog have access
to xdp_frame memory.  Remember that memory storing xdp_frame in located
in the top of the payload/page.


> +		xdp.data = frame->data;
> +		xdp.data_end = xdp.data + frame->len;
> +		xdp_set_data_meta_invalid(&xdp);
> +		xdp.txq = txq;
> +
> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);

The BPF-prog can change/adjust headroom and tailroom (tail only shrink,
but I'm working on extending this).  Thus, you need to adjust the
xdp_frame accordingly afterwards.

(The main use-case is pushing on a header, right?)

> +		switch (act) {
> +		case XDP_TX:    /* for Tx path, XDP_TX == XDP_PASS */
> +			act = XDP_PASS;
> +			break;
> +		case XDP_PASS:
> +			break;
> +		case XDP_REDIRECT:
> +			/* fall through */
> +		default:
> +			bpf_warn_invalid_xdp_action(act);
> +			/* fall through */
> +		case XDP_ABORTED:
> +			trace_xdp_exception(tun->dev, xdp_prog, act);
> +			/* fall through */
> +		case XDP_DROP:
> +			break;
> +		}
> +	}
> +
> +	return act;
> +}
> +
>  static int tun_xdp_xmit(struct net_device *dev, int n,
>  			struct xdp_frame **frames, u32 flags)
>  {
>  	struct tun_struct *tun = netdev_priv(dev);
> +	struct xdp_txq_info txq = { .dev = dev };
>  	struct tun_file *tfile;
>  	u32 numqueues;
>  	int drops = 0;
> @@ -1389,12 +1429,17 @@ static int tun_xdp_xmit(struct net_device *dev, int n,
>  	spin_lock(&tfile->tx_ring.producer_lock);
>  	for (i = 0; i < n; i++) {
>  		struct xdp_frame *xdp = frames[i];
> +		void *frame;
> +
> +		if (tun_do_xdp_tx(tun, tfile, xdp, &txq) != XDP_PASS)
> +			goto drop;
> +
>  		/* Encode the XDP flag into lowest bit for consumer to differ
>  		 * XDP buffer from sk_buff.
>  		 */
> -		void *frame = tun_xdp_to_ptr(xdp);
> -
> +		frame = tun_xdp_to_ptr(xdp);
>  		if (__ptr_ring_produce(&tfile->tx_ring, frame)) {
> +drop:
>  			this_cpu_inc(tun->pcpu_stats->tx_dropped);
>  			xdp_return_frame_rx_napi(xdp);
>  			drops++;
Alexei Starovoitov March 3, 2020, 6:16 p.m. UTC | #5
On Mon, Mar 02, 2020 at 09:27:08PM -0700, David Ahern wrote:
> > 
> > I'm worried that XDP_TX is a silent alias to XDP_PASS.
> > What were the reasons to go with this approach?
> 
> As I stated in the cover letter:
> 
> "XDP_TX on Rx means send the packet out the device it arrived
> on; given that, XDP_Tx for the Tx path is treated as equivalent to
> XDP_PASS - ie., continue on the Tx path."

I saw that, but it states the behavior and doesn't answer my "why" question.

> > imo it's less error prone and extensible to warn on XDP_TX.
> > Which will mean that both XDP_TX and XDP_REDICT are not supported for egress atm.
> 
> I personally don't care either way; I was going with the simplest
> concept from a user perspective.

That's not a good sign when uapi is designed as "dont care either way".

> > 
> > Patches 8 and 9 cover tun only. I'd like to see egress hook to be implemented
> > in at least one physical NIC. Pick any hw. Something that handles real frames.
> > Adding this hook to virtual NIC is easy, but it doesn't demonstrate design
> > trade-offs one would need to think through by adding egress hook to physical
> > nic. That's why I think it's mandatory to have it as part of the patch set.
> > 
> > Patch 11 exposes egress to samples/bpf. It's nice, but without selftests it's
> > no go. All new features must be exercised as part of selftests/bpf.
> 
> Patches that exercise the rtnetlink uapi are fairly easy to do on single
> node; anything traffic related requires multiple nodes or namespace
> level capabilities.  Unless I am missing something that is why all
> current XDP tests ride on top of veth; veth changes are not part of this
> set.
> 
> So to be clear you are saying that all new XDP features require patches
> to a h/w nic, veth and whatever the author really cares about before new
> features like this go in?

I didn't say 'veth'. I really meant 'physical nic'.
The patch set implies that XDP_EGRESS is a generic concept and applicable to
physical and virtual netdevs. There is an implementation for tun. But reading
between the lines I don't see that api was thought through on the physical nic.
Hence I'm requesting to see the patches that implement it. When you'll try to
add xdp_egress to a physical nic I suspect there will be challenges that will
force changes to xdp_egress api and I want that to happen before uapi lands in
the tree and becomes frozen.
David Ahern March 10, 2020, 3:06 a.m. UTC | #6
On 3/3/20 3:40 AM, Jesper Dangaard Brouer wrote:
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index dcae6521a39d..d3fc7e921c85 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1359,10 +1359,50 @@ static void __tun_xdp_flush_tfile(struct tun_file *tfile)
>>  	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>>  }
>>  
>> +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile,
>> +			 struct xdp_frame *frame, struct xdp_txq_info *txq)
>> +{
>> +	struct bpf_prog *xdp_prog;
>> +	u32 act = XDP_PASS;
>> +
>> +	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
>> +	if (xdp_prog) {
>> +		struct xdp_buff xdp;
>> +
>> +		xdp.data_hard_start = frame->data - frame->headroom;
> 
> This is correct, only because frame->headroom have been reduced with
> sizeof(*xdp_frame), as we want to avoid that the BPF-prog have access
> to xdp_frame memory.  Remember that memory storing xdp_frame in located
> in the top of the payload/page.
> 
> 
>> +		xdp.data = frame->data;
>> +		xdp.data_end = xdp.data + frame->len;
>> +		xdp_set_data_meta_invalid(&xdp);
>> +		xdp.txq = txq;
>> +
>> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> 
> The BPF-prog can change/adjust headroom and tailroom (tail only shrink,
> but I'm working on extending this).  Thus, you need to adjust the
> xdp_frame accordingly afterwards.

Why do I need to make any adjustments beyond what is done by
bpf_xdp_adjust_head and bpf_xdp_adjust_tail?

The frame is on its way out, so the stack will not see the frame after
any head or tail changes. (REDIRECT is not supported, only PASS or DROP)
David Ahern March 10, 2020, 3:44 a.m. UTC | #7
On 3/9/20 9:06 PM, David Ahern wrote:
> Why do I need to make any adjustments beyond what is done by
> bpf_xdp_adjust_head and bpf_xdp_adjust_tail?

never mind. forgot the switch from xdp_frame to xdp_buff there so need
to go back to frame.
Jesper Dangaard Brouer March 10, 2020, 9:03 a.m. UTC | #8
On Mon, 9 Mar 2020 21:44:22 -0600
David Ahern <dahern@digitalocean.com> wrote:

> On 3/9/20 9:06 PM, David Ahern wrote:
> > Why do I need to make any adjustments beyond what is done by
> > bpf_xdp_adjust_head and bpf_xdp_adjust_tail?  
> 
> never mind. forgot the switch from xdp_frame to xdp_buff there so need
> to go back to frame.

Yes exactly, glad you realized this yourself ;-)
diff mbox series

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index dcae6521a39d..d3fc7e921c85 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1359,10 +1359,50 @@  static void __tun_xdp_flush_tfile(struct tun_file *tfile)
 	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
 }
 
+static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile,
+			 struct xdp_frame *frame, struct xdp_txq_info *txq)
+{
+	struct bpf_prog *xdp_prog;
+	u32 act = XDP_PASS;
+
+	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
+	if (xdp_prog) {
+		struct xdp_buff xdp;
+
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = xdp.data + frame->len;
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.txq = txq;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+		switch (act) {
+		case XDP_TX:    /* for Tx path, XDP_TX == XDP_PASS */
+			act = XDP_PASS;
+			break;
+		case XDP_PASS:
+			break;
+		case XDP_REDIRECT:
+			/* fall through */
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog, act);
+			/* fall through */
+		case XDP_DROP:
+			break;
+		}
+	}
+
+	return act;
+}
+
 static int tun_xdp_xmit(struct net_device *dev, int n,
 			struct xdp_frame **frames, u32 flags)
 {
 	struct tun_struct *tun = netdev_priv(dev);
+	struct xdp_txq_info txq = { .dev = dev };
 	struct tun_file *tfile;
 	u32 numqueues;
 	int drops = 0;
@@ -1389,12 +1429,17 @@  static int tun_xdp_xmit(struct net_device *dev, int n,
 	spin_lock(&tfile->tx_ring.producer_lock);
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdp = frames[i];
+		void *frame;
+
+		if (tun_do_xdp_tx(tun, tfile, xdp, &txq) != XDP_PASS)
+			goto drop;
+
 		/* Encode the XDP flag into lowest bit for consumer to differ
 		 * XDP buffer from sk_buff.
 		 */
-		void *frame = tun_xdp_to_ptr(xdp);
-
+		frame = tun_xdp_to_ptr(xdp);
 		if (__ptr_ring_produce(&tfile->tx_ring, frame)) {
+drop:
 			this_cpu_inc(tun->pcpu_stats->tx_dropped);
 			xdp_return_frame_rx_napi(xdp);
 			drops++;