diff mbox series

[net-next,1/2] xdp: Always use a devmap for XDP_REDIRECT to a device

Message ID 155075021399.13610.12521373406832889226.stgit@alrua-x1
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series [net-next,1/2] xdp: Always use a devmap for XDP_REDIRECT to a device | expand

Commit Message

Toke Høiland-Jørgensen Feb. 21, 2019, 11:56 a.m. UTC
An XDP program can redirect packets between interfaces using either the
xdp_redirect() helper or the xdp_redirect_map() helper. Apart from the
flexibility of updating maps from userspace, the redirect_map() helper also
uses the map structure to batch packets, which results in a significant
(around 50%) performance boost. However, the xdp_redirect() API is simpler
if one just wants to redirect to another interface, which means people tend
to use this interface and then wonder why they getter worse performance
than expected.

This patch seeks to close this performance difference between the two APIs.
It achieves this by changing xdp_redirect() to use a hidden devmap for
looking up destination interfaces, thus gaining the batching benefit with
no visible difference from the user API point of view.

A hidden per-namespace map is allocated when an XDP program that uses the
non-map xdp_redirect() helper is first loaded. This map is populated with
all available interfaces in its namespace, and kept up to date as
interfaces come and go. Once allocated, the map is kept around until the
namespace is removed.

The hidden map uses the ifindex as map key, which means they are limited to
ifindexes smaller than the map size of 64. A later patch introduces a new
map type to lift this restriction.

Performance numbers:

Before patch:
xdp_redirect:     5426035 pkt/s
xdp_redirect_map: 8412754 pkt/s

After patch:
xdp_redirect:     8314702 pkt/s
xdp_redirect_map: 8411854 pkt/s

This corresponds to a 53% increase in xdp_redirect performance, or a
reduction in per-packet processing time by 64 nanoseconds.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h         |   12 +++
 include/net/net_namespace.h |    2 -
 include/net/netns/xdp.h     |    5 +
 kernel/bpf/devmap.c         |  159 +++++++++++++++++++++++++++++++++++--------
 kernel/bpf/verifier.c       |    6 ++
 net/core/filter.c           |   58 +---------------
 6 files changed, 157 insertions(+), 85 deletions(-)

Comments

Jesper Dangaard Brouer Feb. 21, 2019, 3:19 p.m. UTC | #1
You forgot at cover letter describing why we are doing this...
even-though is should be obvious from the performance results ;-)


On Thu, 21 Feb 2019 12:56:54 +0100 Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Before patch:
> xdp_redirect:     5426035 pkt/s
> xdp_redirect_map: 8412754 pkt/s
> 
> After patch:
> xdp_redirect:     8314702 pkt/s
> xdp_redirect_map: 8411854 pkt/s
> 
> This corresponds to a 53% increase in xdp_redirect performance, or a
> reduction in per-packet processing time by 64 nanoseconds.

(1/5426035-1/8314702)*10^9 = 64.0277 almost exactly 64 nanosec

(1/8412754-1/8411854)*10^9 = -0.012 => no regression for xdp_redirect_map
Toke Høiland-Jørgensen Feb. 21, 2019, 3:52 p.m. UTC | #2
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> You forgot at cover letter describing why we are doing this...
> even-though is should be obvious from the performance results ;-)

Well, I tried to put the motivation into the first paragraph of each
patch description instead of as a separate cover letter. I guess I could
have put it in a separate cover letter as well, but that was actually a
deliberate omission in this case ;)

-Toke
Jakub Kicinski Feb. 22, 2019, 12:36 a.m. UTC | #3
On Thu, 21 Feb 2019 12:56:54 +0100, Toke Høiland-Jørgensen wrote:
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index b63bc77af2d1..629661db36ee 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7527,6 +7527,12 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>  			prog->dst_needed = 1;
>  		if (insn->imm == BPF_FUNC_get_prandom_u32)
>  			bpf_user_rnd_init_once();
> +		if (insn->imm == BPF_FUNC_redirect) {
> +			int err = dev_map_alloc_default_map();
> +
> +			if (err)
> +				return err;
> +		}
>  		if (insn->imm == BPF_FUNC_override_return)
>  			prog->kprobe_override = 1;
>  		if (insn->imm == BPF_FUNC_tail_call) {

> +int dev_map_alloc_default_map(void)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	struct bpf_dtab *dtab, *old_dtab;
> +	struct net_device *netdev;
> +	union bpf_attr attr = {};
> +	u32 idx;
> +	int err;

BPF programs don't obey by netns boundaries.  The fact the program is
verified in one ns doesn't mean this is the only ns it will be used in :(
Meaning if any program is using the redirect map you may need a secret
map in every ns.. no?
Toke Høiland-Jørgensen Feb. 22, 2019, 10:13 a.m. UTC | #4
Jakub Kicinski <jakub.kicinski@netronome.com> writes:

> On Thu, 21 Feb 2019 12:56:54 +0100, Toke Høiland-Jørgensen wrote:
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index b63bc77af2d1..629661db36ee 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -7527,6 +7527,12 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>>  			prog->dst_needed = 1;
>>  		if (insn->imm == BPF_FUNC_get_prandom_u32)
>>  			bpf_user_rnd_init_once();
>> +		if (insn->imm == BPF_FUNC_redirect) {
>> +			int err = dev_map_alloc_default_map();
>> +
>> +			if (err)
>> +				return err;
>> +		}
>>  		if (insn->imm == BPF_FUNC_override_return)
>>  			prog->kprobe_override = 1;
>>  		if (insn->imm == BPF_FUNC_tail_call) {
>
>> +int dev_map_alloc_default_map(void)
>> +{
>> +	struct net *net = current->nsproxy->net_ns;
>> +	struct bpf_dtab *dtab, *old_dtab;
>> +	struct net_device *netdev;
>> +	union bpf_attr attr = {};
>> +	u32 idx;
>> +	int err;
>
> BPF programs don't obey by netns boundaries.  The fact the program is
> verified in one ns doesn't mean this is the only ns it will be used in :(
> Meaning if any program is using the redirect map you may need a secret
> map in every ns.. no?

Ah, yes, good point. Totally didn't think about the fact that load and
attach are decoupled. Hmm, guess I'll just have to move the call to
alloc_default_map() to the point where the program is attached to an
interface, then...

I trust it's safe to skip the allocation in case the program is
offloaded to hardware, right? I.e., an offloaded program will never need
to fall back to the kernel helper?

-Toke
Jakub Kicinski Feb. 22, 2019, 9:37 p.m. UTC | #5
On Fri, 22 Feb 2019 11:13:50 +0100, Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <jakub.kicinski@netronome.com> writes:
> > On Thu, 21 Feb 2019 12:56:54 +0100, Toke Høiland-Jørgensen wrote:  
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index b63bc77af2d1..629661db36ee 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -7527,6 +7527,12 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
> >>  			prog->dst_needed = 1;
> >>  		if (insn->imm == BPF_FUNC_get_prandom_u32)
> >>  			bpf_user_rnd_init_once();
> >> +		if (insn->imm == BPF_FUNC_redirect) {
> >> +			int err = dev_map_alloc_default_map();
> >> +
> >> +			if (err)
> >> +				return err;
> >> +		}
> >>  		if (insn->imm == BPF_FUNC_override_return)
> >>  			prog->kprobe_override = 1;
> >>  		if (insn->imm == BPF_FUNC_tail_call) {  
> >  
> >> +int dev_map_alloc_default_map(void)
> >> +{
> >> +	struct net *net = current->nsproxy->net_ns;
> >> +	struct bpf_dtab *dtab, *old_dtab;
> >> +	struct net_device *netdev;
> >> +	union bpf_attr attr = {};
> >> +	u32 idx;
> >> +	int err;  
> >
> > BPF programs don't obey by netns boundaries.  The fact the program is
> > verified in one ns doesn't mean this is the only ns it will be used in :(
> > Meaning if any program is using the redirect map you may need a secret
> > map in every ns.. no?  
> 
> Ah, yes, good point. Totally didn't think about the fact that load and
> attach are decoupled. Hmm, guess I'll just have to move the call to
> alloc_default_map() to the point where the program is attached to an
> interface, then...

Possibly.. and you also need to handle the case where interface with a
program attached is moved, no?

> I trust it's safe to skip the allocation in case the program is
> offloaded to hardware, right? I.e., an offloaded program will never need
> to fall back to the kernel helper?

We will cross that bridge when we get there ;)  I'd definitely want the
ability to do a redirect to a non-offloaded netdev (e.g. redirect to a
veth) via some fallback, but the plan is to try to only add support for
the map version of redirect on offload, anyway.
Jesper Dangaard Brouer Feb. 23, 2019, 10:43 a.m. UTC | #6
On Fri, 22 Feb 2019 13:37:34 -0800 Jakub Kicinski <jakub.kicinski@netronome.com> wrote:

> On Fri, 22 Feb 2019 11:13:50 +0100, Toke Høiland-Jørgensen wrote:
> > Jakub Kicinski <jakub.kicinski@netronome.com> writes:  
> > > On Thu, 21 Feb 2019 12:56:54 +0100, Toke Høiland-Jørgensen wrote:    
[...]
> > >
> > > BPF programs don't obey by netns boundaries.  The fact the program is
> > > verified in one ns doesn't mean this is the only ns it will be used in :(
> > > Meaning if any program is using the redirect map you may need a secret
> > > map in every ns.. no?    
> > 
> > Ah, yes, good point. Totally didn't think about the fact that load and
> > attach are decoupled. Hmm, guess I'll just have to move the call to
> > alloc_default_map() to the point where the program is attached to an
> > interface, then...  
> 
> Possibly.. and you also need to handle the case where interface with a
> program attached is moved, no?

True, we need to handle if e.g. a veth gets an XDP program attached and
then is moved into a network namespace (as I've already explained to
Toke in a meeting).

I'm still not sure how to handle this...
Toke Høiland-Jørgensen Feb. 23, 2019, 12:11 p.m. UTC | #7
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Fri, 22 Feb 2019 13:37:34 -0800 Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
>
>> On Fri, 22 Feb 2019 11:13:50 +0100, Toke Høiland-Jørgensen wrote:
>> > Jakub Kicinski <jakub.kicinski@netronome.com> writes:  
>> > > On Thu, 21 Feb 2019 12:56:54 +0100, Toke Høiland-Jørgensen wrote:    
> [...]
>> > >
>> > > BPF programs don't obey by netns boundaries.  The fact the program is
>> > > verified in one ns doesn't mean this is the only ns it will be used in :(
>> > > Meaning if any program is using the redirect map you may need a secret
>> > > map in every ns.. no?    
>> > 
>> > Ah, yes, good point. Totally didn't think about the fact that load and
>> > attach are decoupled. Hmm, guess I'll just have to move the call to
>> > alloc_default_map() to the point where the program is attached to an
>> > interface, then...  
>> 
>> Possibly.. and you also need to handle the case where interface with a
>> program attached is moved, no?

Yup, alloc on attach was easy enough; the moving turns out to be the
tricky part :)

> True, we need to handle if e.g. a veth gets an XDP program attached and
> then is moved into a network namespace (as I've already explained to
> Toke in a meeting).

Yeah, I had somehow convinced myself that the XDP program was being
removed when the interface was being torn down before moving between
namespaces. Jesper pointed out that this was not in fact the case... :P

> I'm still not sure how to handle this...

There are a couple of options, I think. At least:

1. Maintain a flag on struct net_device indicating that this device
   needs the redirect map allocated, and react to that when interfaces
   are being moved.

2. Lookup the BPF program by ID (which we can get from the driver) on
   move, and react to the program flag.

3. Keep the allocation on program load, but allocate maps for all active
   namespaces (which would probably need a refcnt mechanism to
   deallocate things again).

I think I'm leaning towards #2; possibly combined with a refcnt so we
can actually deallocate the map in the root namespace when it's not
needed anymore.

-Toke
Jakub Kicinski Feb. 25, 2019, 6:47 p.m. UTC | #8
On Sat, 23 Feb 2019 13:11:02 +0100, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> > On Fri, 22 Feb 2019 13:37:34 -0800 Jakub Kicinski wrote:
> >> On Fri, 22 Feb 2019 11:13:50 +0100, Toke Høiland-Jørgensen wrote:  
> >> > Jakub Kicinski <jakub.kicinski@netronome.com> writes:    
> >> > > On Thu, 21 Feb 2019 12:56:54 +0100, Toke Høiland-Jørgensen wrote:      
> > [...]  
> >> > >
> >> > > BPF programs don't obey by netns boundaries.  The fact the program is
> >> > > verified in one ns doesn't mean this is the only ns it will be used in :(
> >> > > Meaning if any program is using the redirect map you may need a secret
> >> > > map in every ns.. no?      
> >> > 
> >> > Ah, yes, good point. Totally didn't think about the fact that load and
> >> > attach are decoupled. Hmm, guess I'll just have to move the call to
> >> > alloc_default_map() to the point where the program is attached to an
> >> > interface, then...    
> >> 
> >> Possibly.. and you also need to handle the case where interface with a
> >> program attached is moved, no?  
> 
> Yup, alloc on attach was easy enough; the moving turns out to be the
> tricky part :)
> 
> > True, we need to handle if e.g. a veth gets an XDP program attached and
> > then is moved into a network namespace (as I've already explained to
> > Toke in a meeting).  
> 
> Yeah, I had somehow convinced myself that the XDP program was being
> removed when the interface was being torn down before moving between
> namespaces. Jesper pointed out that this was not in fact the case... :P
> 
> > I'm still not sure how to handle this...  
> 
> There are a couple of options, I think. At least:
> 
> 1. Maintain a flag on struct net_device indicating that this device
>    needs the redirect map allocated, and react to that when interfaces
>    are being moved.
> 
> 2. Lookup the BPF program by ID (which we can get from the driver) on
>    move, and react to the program flag.
> 
> 3. Keep the allocation on program load, but allocate maps for all active
>    namespaces (which would probably need a refcnt mechanism to
>    deallocate things again).
> 
> I think I'm leaning towards #2; possibly combined with a refcnt so we
> can actually deallocate the map in the root namespace when it's not
> needed anymore.

Okay.. what about tail calls?  I think #3 is most reasonable complexity-
-wise, or some mix of #2 and #3 - cnt the programs with legacy
redirects, and then allocate the resources if cnt && name space has any
XDP program attached.

Can users really not be told to just use the correct helper? ;)
Toke Høiland-Jørgensen Feb. 26, 2019, 11 a.m. UTC | #9
Jakub Kicinski <jakub.kicinski@netronome.com> writes:

> On Sat, 23 Feb 2019 13:11:02 +0100, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> > On Fri, 22 Feb 2019 13:37:34 -0800 Jakub Kicinski wrote:
>> >> On Fri, 22 Feb 2019 11:13:50 +0100, Toke Høiland-Jørgensen wrote:  
>> >> > Jakub Kicinski <jakub.kicinski@netronome.com> writes:    
>> >> > > On Thu, 21 Feb 2019 12:56:54 +0100, Toke Høiland-Jørgensen wrote:      
>> > [...]  
>> >> > >
>> >> > > BPF programs don't obey by netns boundaries.  The fact the program is
>> >> > > verified in one ns doesn't mean this is the only ns it will be used in :(
>> >> > > Meaning if any program is using the redirect map you may need a secret
>> >> > > map in every ns.. no?      
>> >> > 
>> >> > Ah, yes, good point. Totally didn't think about the fact that load and
>> >> > attach are decoupled. Hmm, guess I'll just have to move the call to
>> >> > alloc_default_map() to the point where the program is attached to an
>> >> > interface, then...    
>> >> 
>> >> Possibly.. and you also need to handle the case where interface with a
>> >> program attached is moved, no?  
>> 
>> Yup, alloc on attach was easy enough; the moving turns out to be the
>> tricky part :)
>> 
>> > True, we need to handle if e.g. a veth gets an XDP program attached and
>> > then is moved into a network namespace (as I've already explained to
>> > Toke in a meeting).  
>> 
>> Yeah, I had somehow convinced myself that the XDP program was being
>> removed when the interface was being torn down before moving between
>> namespaces. Jesper pointed out that this was not in fact the case... :P
>> 
>> > I'm still not sure how to handle this...  
>> 
>> There are a couple of options, I think. At least:
>> 
>> 1. Maintain a flag on struct net_device indicating that this device
>>    needs the redirect map allocated, and react to that when interfaces
>>    are being moved.
>> 
>> 2. Lookup the BPF program by ID (which we can get from the driver) on
>>    move, and react to the program flag.
>> 
>> 3. Keep the allocation on program load, but allocate maps for all active
>>    namespaces (which would probably need a refcnt mechanism to
>>    deallocate things again).
>> 
>> I think I'm leaning towards #2; possibly combined with a refcnt so we
>> can actually deallocate the map in the root namespace when it's not
>> needed anymore.
>
> Okay.. what about tail calls? I think #3 is most reasonable
> complexity- -wise, or some mix of #2 and #3 - cnt the programs with
> legacy redirects, and then allocate the resources if cnt && name space
> has any XDP program attached.

Yeah, I have that more or less working; except I forgot about tail
calls, but that should not be too difficult to fix.

> Can users really not be told to just use the correct helper? ;)

Experience would suggest not; users tend to use the simplest API that
gets their job done. And then wonder why they don't get the nice
performance numbers they were "promised". And, well, I tend to agree
that it's not terribly friendly to just go "use this other more
complicated API if you want proper performance". If we really mean that,
then we should formally deprecate xdp_redirect() as an API, IMO :)

-Toke
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bd169a7bcc93..4f8f179df9fd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -606,6 +606,8 @@  struct xdp_buff;
 struct sk_buff;
 
 struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
+struct bpf_map *__dev_map_get_default_map(struct net_device *dev);
+int dev_map_alloc_default_map(void);
 void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
 void __dev_map_flush(struct bpf_map *map);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
@@ -687,6 +689,16 @@  static inline struct net_device  *__dev_map_lookup_elem(struct bpf_map *map,
 	return NULL;
 }
 
+static inline struct bpf_map *__dev_map_get_default_map(struct net_device *dev)
+{
+	return NULL;
+}
+
+static inline int dev_map_alloc_default_map(void)
+{
+	return 0;
+}
+
 static inline void __dev_map_insert_ctx(struct bpf_map *map, u32 index)
 {
 }
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index a68ced28d8f4..6706ecc25d8f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -162,7 +162,7 @@  struct net {
 #if IS_ENABLED(CONFIG_CAN)
 	struct netns_can	can;
 #endif
-#ifdef CONFIG_XDP_SOCKETS
+#ifdef CONFIG_BPF_SYSCALL
 	struct netns_xdp	xdp;
 #endif
 	struct sock		*diag_nlsk;
diff --git a/include/net/netns/xdp.h b/include/net/netns/xdp.h
index e5734261ba0a..4e2d6394b45d 100644
--- a/include/net/netns/xdp.h
+++ b/include/net/netns/xdp.h
@@ -5,9 +5,14 @@ 
 #include <linux/rculist.h>
 #include <linux/mutex.h>
 
+struct bpf_dtab;
+
 struct netns_xdp {
+#ifdef CONFIG_XDP_SOCKETS
 	struct mutex		lock;
 	struct hlist_head	list;
+#endif
+	struct bpf_dtab	*default_map;
 };
 
 #endif /* __NETNS_XDP_H__ */
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 191b79948424..425077664ac6 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -56,6 +56,7 @@ 
 	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
 #define DEV_MAP_BULK_SIZE 16
+#define DEV_MAP_DEFAULT_SIZE 64
 struct xdp_bulk_queue {
 	struct xdp_frame *q[DEV_MAP_BULK_SIZE];
 	struct net_device *dev_rx;
@@ -85,23 +86,11 @@  static u64 dev_map_bitmap_size(const union bpf_attr *attr)
 	return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long);
 }
 
-static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
+static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr,
+			    bool check_memlock)
 {
-	struct bpf_dtab *dtab;
-	int err = -EINVAL;
 	u64 cost;
-
-	if (!capable(CAP_NET_ADMIN))
-		return ERR_PTR(-EPERM);
-
-	/* check sanity of attributes */
-	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK)
-		return ERR_PTR(-EINVAL);
-
-	dtab = kzalloc(sizeof(*dtab), GFP_USER);
-	if (!dtab)
-		return ERR_PTR(-ENOMEM);
+	int err;
 
 	bpf_map_init_from_attr(&dtab->map, attr);
 
@@ -109,39 +98,63 @@  static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *);
 	cost += dev_map_bitmap_size(attr) * num_possible_cpus();
 	if (cost >= U32_MAX - PAGE_SIZE)
-		goto free_dtab;
+		return -EINVAL;
 
 	dtab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
 
-	/* if map size is larger than memlock limit, reject it early */
-	err = bpf_map_precharge_memlock(dtab->map.pages);
-	if (err)
-		goto free_dtab;
-
-	err = -ENOMEM;
+	if (check_memlock) {
+		/* if map size is larger than memlock limit, reject it early */
+		err = bpf_map_precharge_memlock(dtab->map.pages);
+		if (err)
+			return -EINVAL;
+	}
 
 	/* A per cpu bitfield with a bit per possible net device */
 	dtab->flush_needed = __alloc_percpu_gfp(dev_map_bitmap_size(attr),
 						__alignof__(unsigned long),
 						GFP_KERNEL | __GFP_NOWARN);
 	if (!dtab->flush_needed)
-		goto free_dtab;
+		return -ENOMEM;
 
 	dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries *
 					      sizeof(struct bpf_dtab_netdev *),
 					      dtab->map.numa_node);
-	if (!dtab->netdev_map)
-		goto free_dtab;
+	if (!dtab->netdev_map) {
+		free_percpu(dtab->flush_needed);
+		return -ENOMEM;
+	}
 
 	spin_lock(&dev_map_lock);
 	list_add_tail_rcu(&dtab->list, &dev_map_list);
 	spin_unlock(&dev_map_lock);
 
+	return 0;
+}
+
+static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_dtab *dtab;
+	int err = -EINVAL;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK)
+		return ERR_PTR(-EINVAL);
+
+	dtab = kzalloc(sizeof(*dtab), GFP_USER);
+	if (!dtab)
+		return ERR_PTR(-ENOMEM);
+
+	err = dev_map_init_map(dtab, attr, true);
+	if (err) {
+		kfree(dtab);
+		return ERR_PTR(err);
+	}
+
 	return &dtab->map;
-free_dtab:
-	free_percpu(dtab->flush_needed);
-	kfree(dtab);
-	return ERR_PTR(err);
 }
 
 static void dev_map_free(struct bpf_map *map)
@@ -311,6 +324,17 @@  struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 	return obj;
 }
 
+/* This is only being called from xdp_do_redirect() if the xdp_redirect helper
+ * is used; the default map is allocated on XDP program load if the helper is
+ * used, so will always be available at this point.
+ */
+struct bpf_map *__dev_map_get_default_map(struct net_device *dev)
+{
+	struct net *net = dev_net(dev);
+
+	return &net->xdp.default_map->map;
+}
+
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
  * Thus, safe percpu variable access.
  */
@@ -496,10 +520,19 @@  static int dev_map_notification(struct notifier_block *notifier,
 				ulong event, void *ptr)
 {
 	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
+	struct net *net = dev_net(netdev);
+	u32 idx = netdev->ifindex;
 	struct bpf_dtab *dtab;
 	int i;
 
 	switch (event) {
+	case NETDEV_REGISTER:
+		rcu_read_lock();
+		dtab = READ_ONCE(net->xdp.default_map);
+		if (dtab)
+			dev_map_update_elem(&dtab->map, &idx, &idx, 0);
+		rcu_read_unlock();
+		break;
 	case NETDEV_UNREGISTER:
 		/* This rcu_read_lock/unlock pair is needed because
 		 * dev_map_list is an RCU list AND to ensure a delete
@@ -528,16 +561,84 @@  static int dev_map_notification(struct notifier_block *notifier,
 	return NOTIFY_OK;
 }
 
+static void __net_exit dev_map_net_exit(struct net *net)
+{
+	struct bpf_dtab *dtab;
+
+	dtab = xchg(&net->xdp.default_map, NULL);
+	if (dtab)
+		dev_map_free(&dtab->map);
+}
+
 static struct notifier_block dev_map_notifier = {
 	.notifier_call = dev_map_notification,
 };
 
+static struct pernet_operations dev_map_net_ops = {
+	.exit = dev_map_net_exit,
+};
+
+int dev_map_alloc_default_map(void)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_dtab *dtab, *old_dtab;
+	struct net_device *netdev;
+	union bpf_attr attr = {};
+	u32 idx;
+	int err;
+
+	if (READ_ONCE(net->xdp.default_map))
+		return 0;
+
+	dtab = kzalloc(sizeof(*net->xdp.default_map), GFP_USER);
+	if (!dtab)
+		return -ENOMEM;
+
+	attr.max_entries = DEV_MAP_DEFAULT_SIZE;
+	attr.map_type = BPF_MAP_TYPE_DEVMAP;
+	attr.value_size = 4;
+	attr.key_size = 4;
+
+	err = dev_map_init_map(dtab, &attr, false);
+	if (err) {
+		kfree(dtab);
+		return err;
+	}
+
+	for_each_netdev(net, netdev) {
+		if (netdev->ifindex < DEV_MAP_DEFAULT_SIZE) {
+			idx = netdev->ifindex;
+			err = dev_map_update_elem(&dtab->map, &idx, &idx, 0);
+			if (err) {
+				dev_map_free(&dtab->map);
+				return err;
+			}
+		}
+	}
+
+	old_dtab = xchg(&net->xdp.default_map, dtab);
+	if (old_dtab)
+		dev_map_free(&old_dtab->map);
+
+	return 0;
+}
+
 static int __init dev_map_init(void)
 {
+	int err;
+
 	/* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */
 	BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) !=
 		     offsetof(struct _bpf_dtab_netdev, dev));
+
 	register_netdevice_notifier(&dev_map_notifier);
+
+	err = register_pernet_subsys(&dev_map_net_ops);
+	if (err) {
+		unregister_netdevice_notifier(&dev_map_notifier);
+		return err;
+	}
+
 	return 0;
 }
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b63bc77af2d1..629661db36ee 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7527,6 +7527,12 @@  static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			prog->dst_needed = 1;
 		if (insn->imm == BPF_FUNC_get_prandom_u32)
 			bpf_user_rnd_init_once();
+		if (insn->imm == BPF_FUNC_redirect) {
+			int err = dev_map_alloc_default_map();
+
+			if (err)
+				return err;
+		}
 		if (insn->imm == BPF_FUNC_override_return)
 			prog->kprobe_override = 1;
 		if (insn->imm == BPF_FUNC_tail_call) {
diff --git a/net/core/filter.c b/net/core/filter.c
index b5a002d7b263..c709b1468bb6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3326,58 +3326,6 @@  static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
-static int __bpf_tx_xdp(struct net_device *dev,
-			struct bpf_map *map,
-			struct xdp_buff *xdp,
-			u32 index)
-{
-	struct xdp_frame *xdpf;
-	int err, sent;
-
-	if (!dev->netdev_ops->ndo_xdp_xmit) {
-		return -EOPNOTSUPP;
-	}
-
-	err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
-	if (unlikely(err))
-		return err;
-
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
-	sent = dev->netdev_ops->ndo_xdp_xmit(dev, 1, &xdpf, XDP_XMIT_FLUSH);
-	if (sent <= 0)
-		return sent;
-	return 0;
-}
-
-static noinline int
-xdp_do_redirect_slow(struct net_device *dev, struct xdp_buff *xdp,
-		     struct bpf_prog *xdp_prog, struct bpf_redirect_info *ri)
-{
-	struct net_device *fwd;
-	u32 index = ri->ifindex;
-	int err;
-
-	fwd = dev_get_by_index_rcu(dev_net(dev), index);
-	ri->ifindex = 0;
-	if (unlikely(!fwd)) {
-		err = -EINVAL;
-		goto err;
-	}
-
-	err = __bpf_tx_xdp(fwd, NULL, xdp, 0);
-	if (unlikely(err))
-		goto err;
-
-	_trace_xdp_redirect(dev, xdp_prog, index);
-	return 0;
-err:
-	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
-	return err;
-}
-
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 			    struct bpf_map *map,
 			    struct xdp_buff *xdp,
@@ -3508,10 +3456,10 @@  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = READ_ONCE(ri->map);
 
-	if (likely(map))
-		return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
+	if (unlikely(!map))
+		map = __dev_map_get_default_map(dev);
 
-	return xdp_do_redirect_slow(dev, xdp, xdp_prog, ri);
+	return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri);
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);