Message ID | 20200227032013.12385-10-dsahern@kernel.org |
---|---|
State | RFC |
Delegated to: | BPF Maintainers |
Headers | show |
Series | Add support for XDP in egress path | expand |
On Wed, Feb 26, 2020 at 08:20:11PM -0700, David Ahern wrote: > + > + act = bpf_prog_run_xdp(xdp_prog, &xdp); > + switch (act) { > + case XDP_TX: /* for Tx path, XDP_TX == XDP_PASS */ > + act = XDP_PASS; > + break; > + case XDP_PASS: > + break; > + case XDP_REDIRECT: > + /* fall through */ > + default: > + bpf_warn_invalid_xdp_action(act); > + /* fall through */ > + case XDP_ABORTED: > + trace_xdp_exception(tun->dev, xdp_prog, act); > + /* fall through */ > + case XDP_DROP: > + break; > + } patch 8 has very similar switch. Can you share the code? I'm worried that XDP_TX is a silent alias to XDP_PASS. What were the reasons to go with this approach? imo it's less error prone and extensible to warn on XDP_TX. Which will mean that both XDP_TX and XDP_REDICT are not supported for egress atm. Patches 8 and 9 cover tun only. I'd like to see egress hook to be implemented in at least one physical NIC. Pick any hw. Something that handles real frames. Adding this hook to virtual NIC is easy, but it doesn't demonstrate design trade-offs one would need to think through by adding egress hook to physical nic. That's why I think it's mandatory to have it as part of the patch set. Patch 11 exposes egress to samples/bpf. It's nice, but without selftests it's no go. All new features must be exercised as part of selftests/bpf. re: patch 3. I agree with Toke and Jesper that union in uapi is unnecessary. Just drop it. xdp_md is a virtual data structure. It's not allocated anywhere. There is no need to save non-existent memory.
On 3/2/20 11:30 AM, Alexei Starovoitov wrote: > On Wed, Feb 26, 2020 at 08:20:11PM -0700, David Ahern wrote: >> + >> + act = bpf_prog_run_xdp(xdp_prog, &xdp); >> + switch (act) { >> + case XDP_TX: /* for Tx path, XDP_TX == XDP_PASS */ >> + act = XDP_PASS; >> + break; >> + case XDP_PASS: >> + break; >> + case XDP_REDIRECT: >> + /* fall through */ >> + default: >> + bpf_warn_invalid_xdp_action(act); >> + /* fall through */ >> + case XDP_ABORTED: >> + trace_xdp_exception(tun->dev, xdp_prog, act); >> + /* fall through */ >> + case XDP_DROP: >> + break; >> + } > > patch 8 has very similar switch. Can you share the code? Most likely; I'll take a look. > > I'm worried that XDP_TX is a silent alias to XDP_PASS. > What were the reasons to go with this approach? As I stated in the cover letter: "XDP_TX on Rx means send the packet out the device it arrived on; given that, XDP_Tx for the Tx path is treated as equivalent to XDP_PASS - ie., continue on the Tx path." > imo it's less error prone and extensible to warn on XDP_TX. > Which will mean that both XDP_TX and XDP_REDICT are not supported for egress atm. I personally don't care either way; I was going with the simplest concept from a user perspective. > > Patches 8 and 9 cover tun only. I'd like to see egress hook to be implemented > in at least one physical NIC. Pick any hw. Something that handles real frames. > Adding this hook to virtual NIC is easy, but it doesn't demonstrate design > trade-offs one would need to think through by adding egress hook to physical > nic. That's why I think it's mandatory to have it as part of the patch set. > > Patch 11 exposes egress to samples/bpf. It's nice, but without selftests it's > no go. All new features must be exercised as part of selftests/bpf. Patches that exercise the rtnetlink uapi are fairly easy to do on single node; anything traffic related requires multiple nodes or namespace level capabilities. Unless I am missing something that is why all current XDP tests ride on top of veth; veth changes are not part of this set. So to be clear you are saying that all new XDP features require patches to a h/w nic, veth and whatever the author really cares about before new features like this go in?
On Mon, 2 Mar 2020 21:27:08 -0700 David Ahern <dahern@digitalocean.com> wrote: > On 3/2/20 11:30 AM, Alexei Starovoitov wrote: > > On Wed, Feb 26, 2020 at 08:20:11PM -0700, David Ahern wrote: > >> + > >> + act = bpf_prog_run_xdp(xdp_prog, &xdp); > >> + switch (act) { > >> + case XDP_TX: /* for Tx path, XDP_TX == XDP_PASS */ > >> + act = XDP_PASS; > >> + break; > >> + case XDP_PASS: > >> + break; > >> + case XDP_REDIRECT: > >> + /* fall through */ > >> + default: > >> + bpf_warn_invalid_xdp_action(act); > >> + /* fall through */ > >> + case XDP_ABORTED: > >> + trace_xdp_exception(tun->dev, xdp_prog, act); > >> + /* fall through */ > >> + case XDP_DROP: > >> + break; > >> + } > > > > patch 8 has very similar switch. Can you share the code? > > Most likely; I'll take a look. > > > > > I'm worried that XDP_TX is a silent alias to XDP_PASS. > > What were the reasons to go with this approach? > > As I stated in the cover letter: > > "XDP_TX on Rx means send the packet out the device it arrived > on; given that, XDP_Tx for the Tx path is treated as equivalent to > XDP_PASS - ie., continue on the Tx path." I'm not really buying this. IHMO XDP_PASS should mean continue on the Tx path, as this is a tx/egress XDP hook. I don't see a reason to basically "remove" the XDP_TX return code at this point. > > imo it's less error prone and extensible to warn on XDP_TX. > > Which will mean that both XDP_TX and XDP_REDICT are not supported > > for egress atm. I agree. I more see "XDP_TX" as a mirror facility... maybe there is a use-case for bouncing packets back in the TX/Egress hook? That is future work, but not reason disable the option now. > I personally don't care either way; I was going with the simplest > concept from a user perspective. > > > > > Patches 8 and 9 cover tun only. I'd like to see egress hook to be > > implemented in at least one physical NIC. Pick any hw. Something > > that handles real frames. Adding this hook to virtual NIC is easy, > > but it doesn't demonstrate design trade-offs one would need to > > think through by adding egress hook to physical nic. That's why I > > think it's mandatory to have it as part of the patch set. > > > > Patch 11 exposes egress to samples/bpf. It's nice, but without > > selftests it's no go. All new features must be exercised as part of > > selftests/bpf. > > Patches that exercise the rtnetlink uapi are fairly easy to do on > single node; anything traffic related requires multiple nodes or > namespace level capabilities. Unless I am missing something that is > why all current XDP tests ride on top of veth; veth changes are not > part of this set. > > So to be clear you are saying that all new XDP features require > patches to a h/w nic, veth and whatever the author really cares about > before new features like this go in? I would say yes. XDP is founded for physical HW NICs, and we need to show it makes sense for physical HW NICs.
On Wed, 26 Feb 2020 20:20:11 -0700 David Ahern <dsahern@kernel.org> wrote: > From: David Ahern <dahern@digitalocean.com> > > Add support to run Tx path program on packets arriving at a tun > device via XDP redirect. > > XDP_TX return code means move the packet to the Tx path of the device. > For a program run in the Tx / egress path, XDP_TX is essentially the > same as "continue on" which is XDP_PASS. > > Conceptually, XDP_REDIRECT for this path can work the same as it > does for the Rx path, but that return code is left for a follow > on series. > > Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com> > Signed-off-by: David Ahern <dahern@digitalocean.com> > --- > drivers/net/tun.c | 49 +++++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 47 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index dcae6521a39d..d3fc7e921c85 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -1359,10 +1359,50 @@ static void __tun_xdp_flush_tfile(struct tun_file *tfile) > tfile->socket.sk->sk_data_ready(tfile->socket.sk); > } > > +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile, > + struct xdp_frame *frame, struct xdp_txq_info *txq) > +{ > + struct bpf_prog *xdp_prog; > + u32 act = XDP_PASS; > + > + xdp_prog = rcu_dereference(tun->xdp_egress_prog); > + if (xdp_prog) { > + struct xdp_buff xdp; > + > + xdp.data_hard_start = frame->data - frame->headroom; This is correct, only because frame->headroom have been reduced with sizeof(*xdp_frame), as we want to avoid that the BPF-prog have access to xdp_frame memory. Remember that memory storing xdp_frame in located in the top of the payload/page. > + xdp.data = frame->data; > + xdp.data_end = xdp.data + frame->len; > + xdp_set_data_meta_invalid(&xdp); > + xdp.txq = txq; > + > + act = bpf_prog_run_xdp(xdp_prog, &xdp); The BPF-prog can change/adjust headroom and tailroom (tail only shrink, but I'm working on extending this). Thus, you need to adjust the xdp_frame accordingly afterwards. (The main use-case is pushing on a header, right?) > + switch (act) { > + case XDP_TX: /* for Tx path, XDP_TX == XDP_PASS */ > + act = XDP_PASS; > + break; > + case XDP_PASS: > + break; > + case XDP_REDIRECT: > + /* fall through */ > + default: > + bpf_warn_invalid_xdp_action(act); > + /* fall through */ > + case XDP_ABORTED: > + trace_xdp_exception(tun->dev, xdp_prog, act); > + /* fall through */ > + case XDP_DROP: > + break; > + } > + } > + > + return act; > +} > + > static int tun_xdp_xmit(struct net_device *dev, int n, > struct xdp_frame **frames, u32 flags) > { > struct tun_struct *tun = netdev_priv(dev); > + struct xdp_txq_info txq = { .dev = dev }; > struct tun_file *tfile; > u32 numqueues; > int drops = 0; > @@ -1389,12 +1429,17 @@ static int tun_xdp_xmit(struct net_device *dev, int n, > spin_lock(&tfile->tx_ring.producer_lock); > for (i = 0; i < n; i++) { > struct xdp_frame *xdp = frames[i]; > + void *frame; > + > + if (tun_do_xdp_tx(tun, tfile, xdp, &txq) != XDP_PASS) > + goto drop; > + > /* Encode the XDP flag into lowest bit for consumer to differ > * XDP buffer from sk_buff. > */ > - void *frame = tun_xdp_to_ptr(xdp); > - > + frame = tun_xdp_to_ptr(xdp); > if (__ptr_ring_produce(&tfile->tx_ring, frame)) { > +drop: > this_cpu_inc(tun->pcpu_stats->tx_dropped); > xdp_return_frame_rx_napi(xdp); > drops++;
On Mon, Mar 02, 2020 at 09:27:08PM -0700, David Ahern wrote: > > > > I'm worried that XDP_TX is a silent alias to XDP_PASS. > > What were the reasons to go with this approach? > > As I stated in the cover letter: > > "XDP_TX on Rx means send the packet out the device it arrived > on; given that, XDP_Tx for the Tx path is treated as equivalent to > XDP_PASS - ie., continue on the Tx path." I saw that, but it states the behavior and doesn't answer my "why" question. > > imo it's less error prone and extensible to warn on XDP_TX. > > Which will mean that both XDP_TX and XDP_REDICT are not supported for egress atm. > > I personally don't care either way; I was going with the simplest > concept from a user perspective. That's not a good sign when uapi is designed as "dont care either way". > > > > Patches 8 and 9 cover tun only. I'd like to see egress hook to be implemented > > in at least one physical NIC. Pick any hw. Something that handles real frames. > > Adding this hook to virtual NIC is easy, but it doesn't demonstrate design > > trade-offs one would need to think through by adding egress hook to physical > > nic. That's why I think it's mandatory to have it as part of the patch set. > > > > Patch 11 exposes egress to samples/bpf. It's nice, but without selftests it's > > no go. All new features must be exercised as part of selftests/bpf. > > Patches that exercise the rtnetlink uapi are fairly easy to do on single > node; anything traffic related requires multiple nodes or namespace > level capabilities. Unless I am missing something that is why all > current XDP tests ride on top of veth; veth changes are not part of this > set. > > So to be clear you are saying that all new XDP features require patches > to a h/w nic, veth and whatever the author really cares about before new > features like this go in? I didn't say 'veth'. I really meant 'physical nic'. The patch set implies that XDP_EGRESS is a generic concept and applicable to physical and virtual netdevs. There is an implementation for tun. But reading between the lines I don't see that api was thought through on the physical nic. Hence I'm requesting to see the patches that implement it. When you'll try to add xdp_egress to a physical nic I suspect there will be challenges that will force changes to xdp_egress api and I want that to happen before uapi lands in the tree and becomes frozen.
On 3/3/20 3:40 AM, Jesper Dangaard Brouer wrote: >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >> index dcae6521a39d..d3fc7e921c85 100644 >> --- a/drivers/net/tun.c >> +++ b/drivers/net/tun.c >> @@ -1359,10 +1359,50 @@ static void __tun_xdp_flush_tfile(struct tun_file *tfile) >> tfile->socket.sk->sk_data_ready(tfile->socket.sk); >> } >> >> +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile, >> + struct xdp_frame *frame, struct xdp_txq_info *txq) >> +{ >> + struct bpf_prog *xdp_prog; >> + u32 act = XDP_PASS; >> + >> + xdp_prog = rcu_dereference(tun->xdp_egress_prog); >> + if (xdp_prog) { >> + struct xdp_buff xdp; >> + >> + xdp.data_hard_start = frame->data - frame->headroom; > > This is correct, only because frame->headroom have been reduced with > sizeof(*xdp_frame), as we want to avoid that the BPF-prog have access > to xdp_frame memory. Remember that memory storing xdp_frame in located > in the top of the payload/page. > > >> + xdp.data = frame->data; >> + xdp.data_end = xdp.data + frame->len; >> + xdp_set_data_meta_invalid(&xdp); >> + xdp.txq = txq; >> + >> + act = bpf_prog_run_xdp(xdp_prog, &xdp); > > The BPF-prog can change/adjust headroom and tailroom (tail only shrink, > but I'm working on extending this). Thus, you need to adjust the > xdp_frame accordingly afterwards. Why do I need to make any adjustments beyond what is done by bpf_xdp_adjust_head and bpf_xdp_adjust_tail? The frame is on its way out, so the stack will not see the frame after any head or tail changes. (REDIRECT is not supported, only PASS or DROP)
On 3/9/20 9:06 PM, David Ahern wrote: > Why do I need to make any adjustments beyond what is done by > bpf_xdp_adjust_head and bpf_xdp_adjust_tail? never mind. forgot the switch from xdp_frame to xdp_buff there so need to go back to frame.
On Mon, 9 Mar 2020 21:44:22 -0600 David Ahern <dahern@digitalocean.com> wrote: > On 3/9/20 9:06 PM, David Ahern wrote: > > Why do I need to make any adjustments beyond what is done by > > bpf_xdp_adjust_head and bpf_xdp_adjust_tail? > > never mind. forgot the switch from xdp_frame to xdp_buff there so need > to go back to frame. Yes exactly, glad you realized this yourself ;-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c index dcae6521a39d..d3fc7e921c85 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1359,10 +1359,50 @@ static void __tun_xdp_flush_tfile(struct tun_file *tfile) tfile->socket.sk->sk_data_ready(tfile->socket.sk); } +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile, + struct xdp_frame *frame, struct xdp_txq_info *txq) +{ + struct bpf_prog *xdp_prog; + u32 act = XDP_PASS; + + xdp_prog = rcu_dereference(tun->xdp_egress_prog); + if (xdp_prog) { + struct xdp_buff xdp; + + xdp.data_hard_start = frame->data - frame->headroom; + xdp.data = frame->data; + xdp.data_end = xdp.data + frame->len; + xdp_set_data_meta_invalid(&xdp); + xdp.txq = txq; + + act = bpf_prog_run_xdp(xdp_prog, &xdp); + switch (act) { + case XDP_TX: /* for Tx path, XDP_TX == XDP_PASS */ + act = XDP_PASS; + break; + case XDP_PASS: + break; + case XDP_REDIRECT: + /* fall through */ + default: + bpf_warn_invalid_xdp_action(act); + /* fall through */ + case XDP_ABORTED: + trace_xdp_exception(tun->dev, xdp_prog, act); + /* fall through */ + case XDP_DROP: + break; + } + } + + return act; +} + static int tun_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, u32 flags) { struct tun_struct *tun = netdev_priv(dev); + struct xdp_txq_info txq = { .dev = dev }; struct tun_file *tfile; u32 numqueues; int drops = 0; @@ -1389,12 +1429,17 @@ static int tun_xdp_xmit(struct net_device *dev, int n, spin_lock(&tfile->tx_ring.producer_lock); for (i = 0; i < n; i++) { struct xdp_frame *xdp = frames[i]; + void *frame; + + if (tun_do_xdp_tx(tun, tfile, xdp, &txq) != XDP_PASS) + goto drop; + /* Encode the XDP flag into lowest bit for consumer to differ * XDP buffer from sk_buff. */ - void *frame = tun_xdp_to_ptr(xdp); - + frame = tun_xdp_to_ptr(xdp); if (__ptr_ring_produce(&tfile->tx_ring, frame)) { +drop: this_cpu_inc(tun->pcpu_stats->tx_dropped); xdp_return_frame_rx_napi(xdp); drops++;