diff mbox series

[bpf-next,V1-RFC,01/14] xdp: base API for new XDP rx-queue info concept

Message ID 151316396600.14967.5648145904814220715.stgit@firesoul
State RFC, archived
Delegated to: BPF Maintainers
Headers show
Series xdp: new XDP rx-queue info concept | expand

Commit Message

Jesper Dangaard Brouer Dec. 13, 2017, 11:19 a.m. UTC
This patch only introduce the core data structures and API functions.
All XDP enabled drivers must use the API before this info can used.

There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on.  For both the XDP bpf-prog and kernel side.

Instead of extending xdp_buff each time new info is needed, the patch
creates a separate read-mostly struct xdp_rxq_info, that contains this
info.  We stress this data/cache-line is for read-only info.  This is
NOT for dynamic per packet info, use the data_meta for such use-cases.

The performance advantage is this info can be setup at RX-ring init
time, instead of updating N-members in xdp_buff.  A possible (driver
level) micro optimization is that xdp_buff->rxq assignment could be
done once per XDP/NAPI loop.  The extra pointer deref only happens for
program needing access to this info (thus, no slowdown to existing
use-cases).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/filter.h |    2 ++
 include/net/xdp.h      |   45 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/Makefile      |    2 +-
 net/core/xdp.c         |   40 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 include/net/xdp.h
 create mode 100644 net/core/xdp.c

Comments

David Ahern Dec. 14, 2017, 2:34 a.m. UTC | #1
On 12/13/17 4:19 AM, Jesper Dangaard Brouer wrote:
> +
> +void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
> +{
> +	xdp_rxq->reg_state = REG_STATE_UNREGISTRED;
> +}
> +EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
> +
> +void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
> +{
> +	if (xdp_rxq->reg_state == REG_STATE_REGISTRED) {
> +		WARN(1, "Missing unregister, handled but fix driver\n");
> +		xdp_rxq_info_unreg(xdp_rxq);
> +	}
> +	memset(xdp_rxq, 0, sizeof(*xdp_rxq));
> +	xdp_rxq->queue_index = U32_MAX;
> +	xdp_rxq->reg_state = REG_STATE_NEW;
> +}
> +EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
> +
> +void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq)
> +{
> +	WARN(!xdp_rxq->dev, "Missing net_device from driver");
> +	WARN(xdp_rxq->queue_index == U32_MAX, "Miss queue_index from driver");
> +	WARN(!(xdp_rxq->reg_state == REG_STATE_NEW),"API violation, miss init");
> +	xdp_rxq->reg_state = REG_STATE_REGISTRED;
> +}
> +EXPORT_SYMBOL_GPL(xdp_rxq_info_reg);
> 

Rather than WARN()'s why not make the _reg and _init functions return an
int that indicates an error? For example you don't want to continue if
the dev is expected but missing.
Jesper Dangaard Brouer Dec. 18, 2017, 10:55 a.m. UTC | #2
On Wed, 13 Dec 2017 19:34:40 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/13/17 4:19 AM, Jesper Dangaard Brouer wrote:
> > +
> > +void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
> > +{
> > +	xdp_rxq->reg_state = REG_STATE_UNREGISTRED;
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
> > +
> > +void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
> > +{
> > +	if (xdp_rxq->reg_state == REG_STATE_REGISTRED) {
> > +		WARN(1, "Missing unregister, handled but fix driver\n");
> > +		xdp_rxq_info_unreg(xdp_rxq);
> > +	}
> > +	memset(xdp_rxq, 0, sizeof(*xdp_rxq));
> > +	xdp_rxq->queue_index = U32_MAX;
> > +	xdp_rxq->reg_state = REG_STATE_NEW;
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
> > +
> > +void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq)
> > +{
> > +	WARN(!xdp_rxq->dev, "Missing net_device from driver");
> > +	WARN(xdp_rxq->queue_index == U32_MAX, "Miss queue_index from driver");
> > +	WARN(!(xdp_rxq->reg_state == REG_STATE_NEW),"API violation, miss init");
> > +	xdp_rxq->reg_state = REG_STATE_REGISTRED;
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_rxq_info_reg);
> >   
> 
> Rather than WARN()'s why not make the _reg and _init functions return an
> int that indicates an error? For example you don't want to continue if
> the dev is expected but missing.

Handling return-errors in the drivers complicated the driver code, as it
involves unraveling and deallocating other RX-rings etc (that were
already allocated) if the reg fails. (Also notice next patch will allow
dev == NULL, if right ptype is set).

I'm not completely rejecting you idea, as this is a good optimization
trick, which is to move validation checks to setup-time, thus allowing
less validation checks at runtime.  I sort-of actually already did
this, as I allow bpf to deref dev without NULL check.  I would argue
this is good enough, as we will crash in a predictable way, as above
WARN will point to which driver violated the API.

If people think it is valuable I can change this API to return an err?

I guess, it would be more future-proof to do this, as we (Bjørn,
Michael, Andy) want to extend this to implement a XDP frame/mem return
code-path.  And the register call will likely have to allocate some
resource that could fail, which need to be handled...


If we do this, we might as well (slab) alloc the xdp_rxq_info
structure to reduce the bloat in the drivers RX-rings to a single
pointer (and a pointer to xdp_rxq_info is what xdp_buff.rxq need).
David Ahern Dec. 18, 2017, 1:23 p.m. UTC | #3
On 12/18/17 3:55 AM, Jesper Dangaard Brouer wrote:
> 
> Handling return-errors in the drivers complicated the driver code, as it
> involves unraveling and deallocating other RX-rings etc (that were
> already allocated) if the reg fails. (Also notice next patch will allow
> dev == NULL, if right ptype is set).
> 
> I'm not completely rejecting you idea, as this is a good optimization
> trick, which is to move validation checks to setup-time, thus allowing
> less validation checks at runtime.  I sort-of actually already did
> this, as I allow bpf to deref dev without NULL check.  I would argue
> this is good enough, as we will crash in a predictable way, as above
> WARN will point to which driver violated the API.
> 
> If people think it is valuable I can change this API to return an err?

Saeed's suggested API in a comment on patch 12 also removes most of the
WARN_ONs as it sets the device and index:

xdp_rxq_info_reg(netdev, rxq_index)
{
    rxqueue = dev->_rx + rxq_index;
    xdp_rxq = rxqueue.xdp_rxq;
        xdp_rxq_info_init(xdp_rxq);
    xdp_rxq.dev = netdev;
    xdp_rxq.queue_index = rxq_index;
}

xdp_rxq_info_unreg(netdev, rxq_index)
{
...
}
Jesper Dangaard Brouer Dec. 18, 2017, 3:52 p.m. UTC | #4
On Mon, 18 Dec 2017 06:23:40 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/18/17 3:55 AM, Jesper Dangaard Brouer wrote:
> > 
> > Handling return-errors in the drivers complicated the driver code, as it
> > involves unraveling and deallocating other RX-rings etc (that were
> > already allocated) if the reg fails. (Also notice next patch will allow
> > dev == NULL, if right ptype is set).
> > 
> > I'm not completely rejecting you idea, as this is a good optimization
> > trick, which is to move validation checks to setup-time, thus allowing
> > less validation checks at runtime.  I sort-of actually already did
> > this, as I allow bpf to deref dev without NULL check.  I would argue
> > this is good enough, as we will crash in a predictable way, as above
> > WARN will point to which driver violated the API.
> > 
> > If people think it is valuable I can change this API to return an err?  
> 
> Saeed's suggested API in a comment on patch 12 also removes most of the
> WARN_ONs as it sets the device and index:
> 
> xdp_rxq_info_reg(netdev, rxq_index)
> {
>     rxqueue = netdev->_rx + rxq_index;
>     xdp_rxq = rxqueue.xdp_rxq;
>         xdp_rxq_info_init(xdp_rxq);
>     xdp_rxq.dev = netdev;
>     xdp_rxq.queue_index = rxq_index;
> }
> 
> xdp_rxq_info_unreg(netdev, rxq_index)
> {
> ...
> }

No, we still need the other WARN_ON's.

I don't understand why you think above API is better.  In case
netdev==NULL the system will simply crash on deref of netdev.  That
case happened for both drivers i40e and mlx5, when I was adding this.
The WARN_ON help me quickly identify the issue, and in both drivers it
was a non-critical error, as these queues are not used by XDP. IHMO a
better experience for the driver developer.

IHMO WARN_ON's are a good thing.  For example the:

 if (xdp_rxq->reg_state == REG_STATE_REGISTERED)
   WARN(1, "Missing unregister, handled but fix driver\n");

Just helped me identify a bug in i40e driver.  It turns out that
changing the RX-ring queue size via ethtool <-G|--set-ring> (_not_ the
number of RX-rings, but frames per RX-ring). Then i40e_set_ringparam()
allocates some temp RX-rings and copy-around struct contents, causing
this strange issue.  It will not crash with our currently simple content,
but later this would cause a hard-to-debug issue.  I'm happy I could
catch this now, instead of later as a strange crash.

The WARN's are there to assist driver developers when using this API
in their drivers (better than crash/BUG_ON as they don't have to dig-up
their serial cable console).  For me it is also part of the
documentation, as it document the API assumptions/assertions together
with a small text field.
Jesper Dangaard Brouer Dec. 21, 2017, 4:59 p.m. UTC | #5
On Mon, 18 Dec 2017 11:55:01 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Wed, 13 Dec 2017 19:34:40 -0700
> David Ahern <dsahern@gmail.com> wrote:
> 
> > On 12/13/17 4:19 AM, Jesper Dangaard Brouer wrote:  
> > > +
> > > +void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
> > > +{
> > > +	xdp_rxq->reg_state = REG_STATE_UNREGISTRED;
> > > +}
> > > +EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
> > > +
> > > +void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
> > > +{
> > > +	if (xdp_rxq->reg_state == REG_STATE_REGISTRED) {
> > > +		WARN(1, "Missing unregister, handled but fix driver\n");
> > > +		xdp_rxq_info_unreg(xdp_rxq);
> > > +	}
> > > +	memset(xdp_rxq, 0, sizeof(*xdp_rxq));
> > > +	xdp_rxq->queue_index = U32_MAX;
> > > +	xdp_rxq->reg_state = REG_STATE_NEW;
> > > +}
> > > +EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
> > > +
> > > +void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq)
> > > +{
> > > +	WARN(!xdp_rxq->dev, "Missing net_device from driver");
> > > +	WARN(xdp_rxq->queue_index == U32_MAX, "Miss queue_index from driver");
> > > +	WARN(!(xdp_rxq->reg_state == REG_STATE_NEW),"API violation, miss init");
> > > +	xdp_rxq->reg_state = REG_STATE_REGISTRED;
> > > +}
> > > +EXPORT_SYMBOL_GPL(xdp_rxq_info_reg);
> > >     
> > 
> > Rather than WARN()'s why not make the _reg and _init functions return an
> > int that indicates an error? For example you don't want to continue if
> > the dev is expected but missing.  
> 
> Handling return-errors in the drivers complicated the driver code, as it
> involves unraveling and deallocating other RX-rings etc (that were
> already allocated) if the reg fails. (Also notice next patch will allow
> dev == NULL, if right ptype is set).
> 
> I'm not completely rejecting you idea, as this is a good optimization
> trick, which is to move validation checks to setup-time, thus allowing
> less validation checks at runtime.  I sort-of actually already did
> this, as I allow bpf to deref dev without NULL check.  I would argue
> this is good enough, as we will crash in a predictable way, as above
> WARN will point to which driver violated the API.
> 
> If people think it is valuable I can change this API to return an err?

I will take Ahern's suggestion of returning an err-code, but only from
xdp_rxq_info_reg().  And I'm going to move xdp_rxq_info_init to be an
internal function (which Saeed also implicitly suggested).
I'm working through the drivers now, and only two drivers don't have a
proper error-return for handling xdp_rxq_info_reg() could fail.

I've also extended xdp_rxq_info_reg() to take args dev + idx, to reduce
the code-lines (given we now also have to check return code, this got
too big).  Thus, reg is a single call with if-return-check.


> I guess, it would be more future-proof to do this, as we (Bjørn,
> Michael, Andy) want to extend this to implement a XDP frame/mem return
> code-path.  And the register call will likely have to allocate some
> resource that could fail, which need to be handled...

I'm mostly doing it for above reason, as I'm hoping to avoid touching
every XDP driver once again.  It is a real pain.

> If we do this, we might as well (slab) alloc the xdp_rxq_info
> structure to reduce the bloat in the drivers RX-rings to a single
> pointer (and a pointer to xdp_rxq_info is what xdp_buff.rxq need).

I've dropped my idea of (slab) allocating the xdp_rxq_info structure.
I started coding this up, but realized the number of lines added per
driver got too excessive for no apparent gain. (e.g. I also needed to
take the numa-node into account in some drivers).
diff mbox series

Patch

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5feb441d3dd9..111107fcace6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -19,6 +19,7 @@ 
 #include <linux/cryptohash.h>
 #include <linux/set_memory.h>
 
+#include <net/xdp.h>
 #include <net/sch_generic.h>
 
 #include <uapi/linux/filter.h>
@@ -496,6 +497,7 @@  struct xdp_buff {
 	void *data_end;
 	void *data_meta;
 	void *data_hard_start;
+	struct xdp_rxq_info *rxq;
 };
 
 /* Compute the linear packet data range [data, data_end) which
diff --git a/include/net/xdp.h b/include/net/xdp.h
new file mode 100644
index 000000000000..e4acd198fd60
--- /dev/null
+++ b/include/net/xdp.h
@@ -0,0 +1,45 @@ 
+/* include/net/xdp.h
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#ifndef __LINUX_NET_XDP_H__
+#define __LINUX_NET_XDP_H__
+
+/**
+ * DOC: XDP RX-queue information
+ *
+ * The XDP RX-queue info (xdp_rxq_info) is associated with the driver
+ * level RX-ring queues.  It is information that is specific to how
+ * the driver have configured a given RX-ring queue.
+ *
+ * Each xdp_buff frame received in the driver carry a (pointer)
+ * reference to this xdp_rxq_info structure.  This provides the XDP
+ * data-path read-access to RX-info for both kernel and bpf-side
+ * (limited subset).
+ *
+ * For now, direct access is only safe while running in NAPI/softirq
+ * context.
+ *
+ * The driver usage API is an init, register and unregister API.
+ *
+ * The struct is not directly tied to the XDP prog.  A new XDP prog
+ * can be attached as long as it doesn't change the underlying
+ * RX-ring.  If the RX-ring does change significantly, the NIC driver
+ * naturally need to stop the RX-ring before purging and reallocating
+ * memory.  In that process the driver MUST call unregistor (which
+ * also apply for driver shutdown and unload).  The init and register
+ * API is also mandatory during RX-ring setup.
+ */
+
+struct xdp_rxq_info {
+	struct net_device *dev;
+	u32 queue_index;
+	u32 reg_state;
+} ____cacheline_aligned; /* perf critical, avoid false-sharing */
+
+void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq);
+void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq);
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
+
+#endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/Makefile b/net/core/Makefile
index 1fd0a9c88b1b..6dbbba8c57ae 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -11,7 +11,7 @@  obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
 			sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
-			fib_notifier.o
+			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
diff --git a/net/core/xdp.c b/net/core/xdp.c
new file mode 100644
index 000000000000..a9d2dd7b1ede
--- /dev/null
+++ b/net/core/xdp.c
@@ -0,0 +1,40 @@ 
+/* net/core/xdp.c
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#include <linux/types.h>
+#include <linux/mm.h>
+
+#include <net/xdp.h>
+
+#define REG_STATE_NEW		0x0
+#define REG_STATE_REGISTRED	0x1
+#define REG_STATE_UNREGISTRED	0x2
+
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
+{
+	xdp_rxq->reg_state = REG_STATE_UNREGISTRED;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
+
+void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
+{
+	if (xdp_rxq->reg_state == REG_STATE_REGISTRED) {
+		WARN(1, "Missing unregister, handled but fix driver\n");
+		xdp_rxq_info_unreg(xdp_rxq);
+	}
+	memset(xdp_rxq, 0, sizeof(*xdp_rxq));
+	xdp_rxq->queue_index = U32_MAX;
+	xdp_rxq->reg_state = REG_STATE_NEW;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
+
+void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq)
+{
+	WARN(!xdp_rxq->dev, "Missing net_device from driver");
+	WARN(xdp_rxq->queue_index == U32_MAX, "Miss queue_index from driver");
+	WARN(!(xdp_rxq->reg_state == REG_STATE_NEW),"API violation, miss init");
+	xdp_rxq->reg_state = REG_STATE_REGISTRED;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_reg);