diff mbox series

[bpf-next,v2,1/2] xdp: Move devmap bulk queue into struct net_device

Message ID 157893905569.861394.457637639114847149.stgit@toke.dk
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series xdp: Introduce bulking for non-map XDP_REDIRECT | expand

Commit Message

Toke Høiland-Jørgensen Jan. 13, 2020, 6:10 p.m. UTC
From: Toke Høiland-Jørgensen <toke@redhat.com>

Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.

This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit.  To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.

We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.

Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.

Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/netdevice.h  |   11 +++++---
 include/trace/events/xdp.h |    2 +
 kernel/bpf/devmap.c        |   63 +++++++++++++++++++-------------------------
 net/core/dev.c             |    2 +
 4 files changed, 37 insertions(+), 41 deletions(-)

Comments

John Fastabend Jan. 15, 2020, 7:45 p.m. UTC | #1
Toke Høiland-Jørgensen wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
> instances"), changed devmap flushing to be a global operation instead of a
> per-map operation. However, the queue structure used for bulking was still
> allocated as part of the containing map.
> 
> This patch moves the devmap bulk queue into struct net_device. The
> motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
> which will be changed in a subsequent commit.  To avoid other fields of
> struct net_device moving to different cache lines, we also move a couple of
> other members around.
> 
> We defer the actual allocation of the bulk queue structure until the
> NETDEV_REGISTER notification devmap.c. This makes it possible to check for
> ndo_xdp_xmit support before allocating the structure, which is not possible
> at the time struct net_device is allocated. However, we keep the freeing in
> free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
> 
> Because of this change, we lose the reference back to the map that
> originated the redirect, so change the tracepoint to always return 0 as the
> map ID and index. Otherwise no functional change is intended with this
> patch.
> 
> Acked-by: Björn Töpel <bjorn.topel@intel.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---

LGTM. I didn't check the net_device layout with pahole though so I'm
trusting they are good from v1 discussion.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Jesper Dangaard Brouer Jan. 15, 2020, 8:17 p.m. UTC | #2
On Mon, 13 Jan 2020 19:10:55 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index da9c832fc5c8..030d125c3839 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
[...]
> @@ -346,8 +340,7 @@ static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
>  out:
>  	bq->count = 0;
>  
> -	trace_xdp_devmap_xmit(&obj->dtab->map, obj->idx,
> -			      sent, drops, bq->dev_rx, dev, err);
> +	trace_xdp_devmap_xmit(NULL, 0, sent, drops, bq->dev_rx, dev, err);

Hmm ... I don't like that we lose the map_id and map_index identifier.
This is part of our troubleshooting interface.  

>  	bq->dev_rx = NULL;
>  	__list_del_clearprev(&bq->flush_node);
>  	return 0;
Toke Høiland-Jørgensen Jan. 15, 2020, 10:11 p.m. UTC | #3
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Mon, 13 Jan 2020 19:10:55 +0100
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>> index da9c832fc5c8..030d125c3839 100644
>> --- a/kernel/bpf/devmap.c
>> +++ b/kernel/bpf/devmap.c
> [...]
>> @@ -346,8 +340,7 @@ static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
>>  out:
>>  	bq->count = 0;
>>  
>> -	trace_xdp_devmap_xmit(&obj->dtab->map, obj->idx,
>> -			      sent, drops, bq->dev_rx, dev, err);
>> +	trace_xdp_devmap_xmit(NULL, 0, sent, drops, bq->dev_rx, dev, err);
>
> Hmm ... I don't like that we lose the map_id and map_index identifier.
> This is part of our troubleshooting interface.

Hmm, I guess I can take another look at whether there's a way to avoid
that. Any ideas?

-Toke
Toke Høiland-Jørgensen Jan. 15, 2020, 10:22 p.m. UTC | #4
John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> 
>> Commit 96360004b862 ("xdp: Make devmap flush_list common for all map
>> instances"), changed devmap flushing to be a global operation instead of a
>> per-map operation. However, the queue structure used for bulking was still
>> allocated as part of the containing map.
>> 
>> This patch moves the devmap bulk queue into struct net_device. The
>> motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
>> which will be changed in a subsequent commit.  To avoid other fields of
>> struct net_device moving to different cache lines, we also move a couple of
>> other members around.
>> 
>> We defer the actual allocation of the bulk queue structure until the
>> NETDEV_REGISTER notification devmap.c. This makes it possible to check for
>> ndo_xdp_xmit support before allocating the structure, which is not possible
>> at the time struct net_device is allocated. However, we keep the freeing in
>> free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
>> 
>> Because of this change, we lose the reference back to the map that
>> originated the redirect, so change the tracepoint to always return 0 as the
>> map ID and index. Otherwise no functional change is intended with this
>> patch.
>> 
>> Acked-by: Björn Töpel <bjorn.topel@intel.com>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>
> LGTM. I didn't check the net_device layout with pahole though so I'm
> trusting they are good from v1 discussion.

I believe so; looks like this now:

	/* --- cacheline 14 boundary (896 bytes) --- */
	struct netdev_queue *      _tx __attribute__((__aligned__(64))); /*   896     8 */
	unsigned int               num_tx_queues;        /*   904     4 */
	unsigned int               real_num_tx_queues;   /*   908     4 */
	struct Qdisc *             qdisc;                /*   912     8 */
	unsigned int               tx_queue_len;         /*   920     4 */
	spinlock_t                 tx_global_lock;       /*   924     4 */
	struct xdp_dev_bulk_queue * xdp_bulkq;           /*   928     8 */
	struct xps_dev_maps *      xps_cpus_map;         /*   936     8 */
	struct xps_dev_maps *      xps_rxqs_map;         /*   944     8 */
	struct mini_Qdisc *        miniq_egress;         /*   952     8 */
	/* --- cacheline 15 boundary (960 bytes) --- */
	struct hlist_head  qdisc_hash[16];               /*   960   128 */
	/* --- cacheline 17 boundary (1088 bytes) --- */
	struct timer_list  watchdog_timer;               /*  1088    40 */

	/* XXX last struct has 4 bytes of padding */

	int                        watchdog_timeo;       /*  1128     4 */

	/* XXX 4 bytes hole, try to pack */

	int *                      pcpu_refcnt;          /*  1136     8 */
	struct list_head   todo_list;                    /*  1144    16 */
	/* --- cacheline 18 boundary (1152 bytes) was 8 bytes ago --- */

-Toke
Jesper Dangaard Brouer Jan. 16, 2020, 11:24 a.m. UTC | #5
On Wed, 15 Jan 2020 23:11:21 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> 
> > On Mon, 13 Jan 2020 19:10:55 +0100
> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >  
> >> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> >> index da9c832fc5c8..030d125c3839 100644
> >> --- a/kernel/bpf/devmap.c
> >> +++ b/kernel/bpf/devmap.c  
> > [...]  
> >> @@ -346,8 +340,7 @@ static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
> >>  out:
> >>  	bq->count = 0;
> >>  
> >> -	trace_xdp_devmap_xmit(&obj->dtab->map, obj->idx,
> >> -			      sent, drops, bq->dev_rx, dev, err);
> >> +	trace_xdp_devmap_xmit(NULL, 0, sent, drops, bq->dev_rx, dev, err);  
> >
> > Hmm ... I don't like that we lose the map_id and map_index identifier.
> > This is part of our troubleshooting interface.  
> 
> Hmm, I guess I can take another look at whether there's a way to avoid
> that. Any ideas?

Looking at the code and the other tracepoints...

I will actually suggest to remove these two arguments, because the
trace_xdp_redirect_map tracepoint also contains the ifindex'es, and to
troubleshoot people can record both tracepoints and do the correlation
themselves.

When changing the tracepoint I would like to keep member 'drops' and
'sent' at the same struct offsets.  As our xdp_monitor example reads
these and I hope we can kept it working this way.

I've coded it up, and tested it.  The new xdp_monitor will work on
older kernels, but the old xdp_monitor will fail attaching on newer
kernels. I think this is fair enough, as we are backwards compatible.


[PATCH] devmap: adjust tracepoing after Tokes changes

From: Jesper Dangaard Brouer <brouer@redhat.com>

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/trace/events/xdp.h     |   29 ++++++++++++-----------------
 kernel/bpf/devmap.c            |    2 +-
 samples/bpf/xdp_monitor_kern.c |    8 +++-----
 3 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index cf568a38f852..f1e64689ce94 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -247,43 +247,38 @@ TRACE_EVENT(xdp_cpumap_enqueue,
 
 TRACE_EVENT(xdp_devmap_xmit,
 
-	TP_PROTO(const struct bpf_map *map, u32 map_index,
-		 int sent, int drops,
-		 const struct net_device *from_dev,
-		 const struct net_device *to_dev, int err),
+	TP_PROTO(const struct net_device *from_dev,
+		 const struct net_device *to_dev,
+		 int sent, int drops, int err),
 
-	TP_ARGS(map, map_index, sent, drops, from_dev, to_dev, err),
+	TP_ARGS(from_dev, to_dev, sent, drops, err),
 
 	TP_STRUCT__entry(
-		__field(int, map_id)
+		__field(int, from_ifindex)
 		__field(u32, act)
-		__field(u32, map_index)
+		__field(int, to_ifindex)
 		__field(int, drops)
 		__field(int, sent)
-		__field(int, from_ifindex)
-		__field(int, to_ifindex)
 		__field(int, err)
 	),
 
 	TP_fast_assign(
-		__entry->map_id		= map ? map->id : 0;
+		__entry->from_ifindex	= from_dev->ifindex;
 		__entry->act		= XDP_REDIRECT;
-		__entry->map_index	= map_index;
+		__entry->to_ifindex	= to_dev->ifindex;
 		__entry->drops		= drops;
 		__entry->sent		= sent;
-		__entry->from_ifindex	= from_dev->ifindex;
-		__entry->to_ifindex	= to_dev->ifindex;
 		__entry->err		= err;
 	),
 
 	TP_printk("ndo_xdp_xmit"
-		  " map_id=%d map_index=%d action=%s"
+		  " from_ifindex=%d to_ifindex=%d action=%s"
 		  " sent=%d drops=%d"
-		  " from_ifindex=%d to_ifindex=%d err=%d",
-		  __entry->map_id, __entry->map_index,
+		  " err=%d",
+		  __entry->from_ifindex, __entry->to_ifindex,
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
 		  __entry->sent, __entry->drops,
-		  __entry->from_ifindex, __entry->to_ifindex, __entry->err)
+		  __entry->err)
 );
 
 /* Expect users already include <net/xdp.h>, but not xdp_priv.h */
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index db32272c4f77..1b4bfe4e06d6 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -340,7 +340,7 @@ static int bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
 out:
 	bq->count = 0;
 
-	trace_xdp_devmap_xmit(NULL, 0, sent, drops, bq->dev_rx, dev, err);
+	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);
 	bq->dev_rx = NULL;
 	__list_del_clearprev(&bq->flush_node);
 	return 0;
diff --git a/samples/bpf/xdp_monitor_kern.c b/samples/bpf/xdp_monitor_kern.c
index ad10fe700d7d..39458a44472e 100644
--- a/samples/bpf/xdp_monitor_kern.c
+++ b/samples/bpf/xdp_monitor_kern.c
@@ -222,14 +222,12 @@ struct bpf_map_def SEC("maps") devmap_xmit_cnt = {
  */
 struct devmap_xmit_ctx {
 	u64 __pad;		// First 8 bytes are not accessible by bpf code
-	int map_id;		//	offset:8;  size:4; signed:1;
+	int from_ifindex;	//	offset:8;  size:4; signed:1;
 	u32 act;		//	offset:12; size:4; signed:0;
-	u32 map_index;		//	offset:16; size:4; signed:0;
+	int to_ifindex; 	//	offset:16; size:4; signed:1;
 	int drops;		//	offset:20; size:4; signed:1;
 	int sent;		//	offset:24; size:4; signed:1;
-	int from_ifindex;	//	offset:28; size:4; signed:1;
-	int to_ifindex;		//	offset:32; size:4; signed:1;
-	int err;		//	offset:36; size:4; signed:1;
+	int err;		//	offset:28; size:4; signed:1;
 };
 
 SEC("tracepoint/xdp/xdp_devmap_xmit")
Toke Høiland-Jørgensen Jan. 16, 2020, 1:51 p.m. UTC | #6
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Wed, 15 Jan 2020 23:11:21 +0100
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
>> Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> 
>> > On Mon, 13 Jan 2020 19:10:55 +0100
>> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >  
>> >> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>> >> index da9c832fc5c8..030d125c3839 100644
>> >> --- a/kernel/bpf/devmap.c
>> >> +++ b/kernel/bpf/devmap.c  
>> > [...]  
>> >> @@ -346,8 +340,7 @@ static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
>> >>  out:
>> >>  	bq->count = 0;
>> >>  
>> >> -	trace_xdp_devmap_xmit(&obj->dtab->map, obj->idx,
>> >> -			      sent, drops, bq->dev_rx, dev, err);
>> >> +	trace_xdp_devmap_xmit(NULL, 0, sent, drops, bq->dev_rx, dev, err);  
>> >
>> > Hmm ... I don't like that we lose the map_id and map_index identifier.
>> > This is part of our troubleshooting interface.  
>> 
>> Hmm, I guess I can take another look at whether there's a way to avoid
>> that. Any ideas?
>
> Looking at the code and the other tracepoints...
>
> I will actually suggest to remove these two arguments, because the
> trace_xdp_redirect_map tracepoint also contains the ifindex'es, and to
> troubleshoot people can record both tracepoints and do the correlation
> themselves.
>
> When changing the tracepoint I would like to keep member 'drops' and
> 'sent' at the same struct offsets.  As our xdp_monitor example reads
> these and I hope we can kept it working this way.
>
> I've coded it up, and tested it.  The new xdp_monitor will work on
> older kernels, but the old xdp_monitor will fail attaching on newer
> kernels. I think this is fair enough, as we are backwards compatible.

SGTM - thanks! I'll respin and include this :)

-Toke
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2741aa35bec6..1f24405c1ec5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -876,6 +876,7 @@  enum bpf_netdev_command {
 struct bpf_prog_offload_ops;
 struct netlink_ext_ack;
 struct xdp_umem;
+struct xdp_dev_bulk_queue;
 
 struct netdev_bpf {
 	enum bpf_netdev_command command;
@@ -1986,12 +1987,10 @@  struct net_device {
 	unsigned int		num_tx_queues;
 	unsigned int		real_num_tx_queues;
 	struct Qdisc		*qdisc;
-#ifdef CONFIG_NET_SCHED
-	DECLARE_HASHTABLE	(qdisc_hash, 4);
-#endif
 	unsigned int		tx_queue_len;
 	spinlock_t		tx_global_lock;
-	int			watchdog_timeo;
+
+	struct xdp_dev_bulk_queue __percpu *xdp_bulkq;
 
 #ifdef CONFIG_XPS
 	struct xps_dev_maps __rcu *xps_cpus_map;
@@ -2001,8 +2000,12 @@  struct net_device {
 	struct mini_Qdisc __rcu	*miniq_egress;
 #endif
 
+#ifdef CONFIG_NET_SCHED
+	DECLARE_HASHTABLE	(qdisc_hash, 4);
+#endif
 	/* These may be needed for future network-power-down code. */
 	struct timer_list	watchdog_timer;
+	int			watchdog_timeo;
 
 	int __percpu		*pcpu_refcnt;
 	struct list_head	todo_list;
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index a7378bcd9928..72bad13d4a3c 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -278,7 +278,7 @@  TRACE_EVENT(xdp_devmap_xmit,
 	),
 
 	TP_fast_assign(
-		__entry->map_id		= map->id;
+		__entry->map_id		= map ? map->id : 0;
 		__entry->act		= XDP_REDIRECT;
 		__entry->map_index	= map_index;
 		__entry->drops		= drops;
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index da9c832fc5c8..030d125c3839 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -53,13 +53,11 @@ 
 	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
 #define DEV_MAP_BULK_SIZE 16
-struct bpf_dtab_netdev;
-
-struct xdp_bulk_queue {
+struct xdp_dev_bulk_queue {
 	struct xdp_frame *q[DEV_MAP_BULK_SIZE];
 	struct list_head flush_node;
+	struct net_device *dev;
 	struct net_device *dev_rx;
-	struct bpf_dtab_netdev *obj;
 	unsigned int count;
 };
 
@@ -67,9 +65,8 @@  struct bpf_dtab_netdev {
 	struct net_device *dev; /* must be first member, due to tracepoint */
 	struct hlist_node index_hlist;
 	struct bpf_dtab *dtab;
-	struct xdp_bulk_queue __percpu *bulkq;
 	struct rcu_head rcu;
-	unsigned int idx; /* keep track of map index for tracepoint */
+	unsigned int idx;
 };
 
 struct bpf_dtab {
@@ -219,7 +216,6 @@  static void dev_map_free(struct bpf_map *map)
 
 			hlist_for_each_entry_safe(dev, next, head, index_hlist) {
 				hlist_del_rcu(&dev->index_hlist);
-				free_percpu(dev->bulkq);
 				dev_put(dev->dev);
 				kfree(dev);
 			}
@@ -234,7 +230,6 @@  static void dev_map_free(struct bpf_map *map)
 			if (!dev)
 				continue;
 
-			free_percpu(dev->bulkq);
 			dev_put(dev->dev);
 			kfree(dev);
 		}
@@ -320,10 +315,9 @@  static int dev_map_hash_get_next_key(struct bpf_map *map, void *key,
 	return -ENOENT;
 }
 
-static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
+static int bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
 {
-	struct bpf_dtab_netdev *obj = bq->obj;
-	struct net_device *dev = obj->dev;
+	struct net_device *dev = bq->dev;
 	int sent = 0, drops = 0, err = 0;
 	int i;
 
@@ -346,8 +340,7 @@  static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
 out:
 	bq->count = 0;
 
-	trace_xdp_devmap_xmit(&obj->dtab->map, obj->idx,
-			      sent, drops, bq->dev_rx, dev, err);
+	trace_xdp_devmap_xmit(NULL, 0, sent, drops, bq->dev_rx, dev, err);
 	bq->dev_rx = NULL;
 	__list_del_clearprev(&bq->flush_node);
 	return 0;
@@ -374,7 +367,7 @@  static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags)
 void __dev_map_flush(void)
 {
 	struct list_head *flush_list = this_cpu_ptr(&dev_map_flush_list);
-	struct xdp_bulk_queue *bq, *tmp;
+	struct xdp_dev_bulk_queue *bq, *tmp;
 
 	rcu_read_lock();
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node)
@@ -401,12 +394,12 @@  struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
  * Thus, safe percpu variable access.
  */
-static int bq_enqueue(struct bpf_dtab_netdev *obj, struct xdp_frame *xdpf,
+static int bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 		      struct net_device *dev_rx)
 
 {
 	struct list_head *flush_list = this_cpu_ptr(&dev_map_flush_list);
-	struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq);
+	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
 
 	if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
 		bq_xmit_all(bq, 0);
@@ -444,7 +437,7 @@  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	if (unlikely(!xdpf))
 		return -EOVERFLOW;
 
-	return bq_enqueue(dst, xdpf, dev_rx);
+	return bq_enqueue(dev, xdpf, dev_rx);
 }
 
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
@@ -483,7 +476,6 @@  static void __dev_map_entry_free(struct rcu_head *rcu)
 	struct bpf_dtab_netdev *dev;
 
 	dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
-	free_percpu(dev->bulkq);
 	dev_put(dev->dev);
 	kfree(dev);
 }
@@ -538,30 +530,15 @@  static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net,
 						    u32 ifindex,
 						    unsigned int idx)
 {
-	gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN;
 	struct bpf_dtab_netdev *dev;
-	struct xdp_bulk_queue *bq;
-	int cpu;
 
-	dev = kmalloc_node(sizeof(*dev), gfp, dtab->map.numa_node);
+	dev = kmalloc_node(sizeof(*dev), GFP_ATOMIC | __GFP_NOWARN,
+			   dtab->map.numa_node);
 	if (!dev)
 		return ERR_PTR(-ENOMEM);
 
-	dev->bulkq = __alloc_percpu_gfp(sizeof(*dev->bulkq),
-					sizeof(void *), gfp);
-	if (!dev->bulkq) {
-		kfree(dev);
-		return ERR_PTR(-ENOMEM);
-	}
-
-	for_each_possible_cpu(cpu) {
-		bq = per_cpu_ptr(dev->bulkq, cpu);
-		bq->obj = dev;
-	}
-
 	dev->dev = dev_get_by_index(net, ifindex);
 	if (!dev->dev) {
-		free_percpu(dev->bulkq);
 		kfree(dev);
 		return ERR_PTR(-EINVAL);
 	}
@@ -721,9 +698,23 @@  static int dev_map_notification(struct notifier_block *notifier,
 {
 	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
 	struct bpf_dtab *dtab;
-	int i;
+	int i, cpu;
 
 	switch (event) {
+	case NETDEV_REGISTER:
+		if (!netdev->netdev_ops->ndo_xdp_xmit || netdev->xdp_bulkq)
+			break;
+
+		/* will be freed in free_netdev() */
+		netdev->xdp_bulkq =
+			__alloc_percpu_gfp(sizeof(struct xdp_dev_bulk_queue),
+					   sizeof(void *), GFP_ATOMIC);
+		if (!netdev->xdp_bulkq)
+			return NOTIFY_BAD;
+
+		for_each_possible_cpu(cpu)
+			per_cpu_ptr(netdev->xdp_bulkq, cpu)->dev = netdev;
+		break;
 	case NETDEV_UNREGISTER:
 		/* This rcu_read_lock/unlock pair is needed because
 		 * dev_map_list is an RCU list AND to ensure a delete
diff --git a/net/core/dev.c b/net/core/dev.c
index d99f88c58636..e7802a41ae7f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9847,6 +9847,8 @@  void free_netdev(struct net_device *dev)
 
 	free_percpu(dev->pcpu_refcnt);
 	dev->pcpu_refcnt = NULL;
+	free_percpu(dev->xdp_bulkq);
+	dev->xdp_bulkq = NULL;
 
 	netdev_unregister_lockdep_key(dev);