diff mbox series

[bpf-next,v2,8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table

Message ID 20180504025432.23451-9-dsahern@gmail.com
State Changes Requested, archived
Delegated to: BPF Maintainers
Headers show
Series bpf: Add helper to do FIB lookups | expand

Commit Message

David Ahern May 4, 2018, 2:54 a.m. UTC
Provide a helper for doing a FIB and neighbor lookup in the kernel
tables from an XDP program. The helper provides a fastpath for forwarding
packets. If the packet is a local delivery or for any reason is not a
simple lookup and forward, the packet continues up the stack.

If it is to be forwarded, the forwarding can be done directly if the
neighbor is already known. If the neighbor does not exist, the first
few packets go up the stack for neighbor resolution. Once resolved, the
xdp program provides the fast path.

On successful lookup the nexthop dmac, current device smac and egress
device index are returned.

The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
are implemented in this patch. The API includes layer 4 parameters if
the XDP program chooses to do deep packet inspection to allow compare
against ACLs implemented as FIB rules.

Header rewrite is left to the XDP program.

The lookup takes 2 flags:
- BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
  straight to the table associated with the device (expert setting for
  those looking to maximize throughput)

- BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
  Default is an ingress lookup.

Initial performance numbers collected by Jesper, forwarded packets/sec:

       Full stack    XDP FIB lookup    XDP Direct lookup
IPv4   1,947,969       7,074,156          7,415,333
IPv6   1,728,000       6,165,504          7,262,720

These number are single CPU core forwarding on a Broadwell
E5-1650 v4 @ 3.60GHz.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/uapi/linux/bpf.h |  83 ++++++++++++++-
 net/core/filter.c        | 266 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 348 insertions(+), 1 deletion(-)

Comments

Jesper Dangaard Brouer May 7, 2018, 1:35 p.m. UTC | #1
On Thu,  3 May 2018 19:54:31 -0700 David Ahern <dsahern@gmail.com> wrote:

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6877426c23a6..cf0d27acf1d1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
[...]
> +static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
> +	.func		= bpf_xdp_fib_lookup,
> +	.gpl_only	= true,

Is it a deliberate choice to require BPF-progs using this helper to be
GPL licensed?

Asking as this seems to be the first network related helper with this
requirement, while this is typical for tracing related helpers.


> +	.ret_type	= RET_INTEGER,
> +	.arg1_type      = ARG_PTR_TO_CTX,
> +	.arg2_type      = ARG_PTR_TO_MEM,
> +	.arg3_type      = ARG_CONST_SIZE,
> +	.arg4_type	= ARG_ANYTHING,
> +};
Daniel Borkmann May 7, 2018, 2:10 p.m. UTC | #2
On 05/07/2018 03:35 PM, Jesper Dangaard Brouer wrote:
> On Thu,  3 May 2018 19:54:31 -0700 David Ahern <dsahern@gmail.com> wrote:
> 
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 6877426c23a6..cf0d27acf1d1 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
> [...]
>> +static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
>> +	.func		= bpf_xdp_fib_lookup,
>> +	.gpl_only	= true,
> 
> Is it a deliberate choice to require BPF-progs using this helper to be
> GPL licensed?
> 
> Asking as this seems to be the first network related helper with this
> requirement, while this is typical for tracing related helpers.

Good point, we should remove that. In networking it's only the perf event
output helpers tying into tracing bits. After all, if you do a route lookup
via netlink from user space there's no such restriction at all.
David Ahern May 7, 2018, 2:26 p.m. UTC | #3
On 5/7/18 8:10 AM, Daniel Borkmann wrote:
> On 05/07/2018 03:35 PM, Jesper Dangaard Brouer wrote:
>> On Thu,  3 May 2018 19:54:31 -0700 David Ahern <dsahern@gmail.com> wrote:
>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 6877426c23a6..cf0d27acf1d1 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>> [...]
>>> +static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
>>> +	.func		= bpf_xdp_fib_lookup,
>>> +	.gpl_only	= true,
>>
>> Is it a deliberate choice to require BPF-progs using this helper to be
>> GPL licensed?
>>
>> Asking as this seems to be the first network related helper with this
>> requirement, while this is typical for tracing related helpers.
> 
> Good point, we should remove that. In networking it's only the perf event
> output helpers tying into tracing bits. After all, if you do a route lookup
> via netlink from user space there's no such restriction at all.
> 

Networking symbols are typically exported GPL for modules. The person
writing the code and exporting GPL is specifying a desire that only GPL
licensed modules can link to the symbol.

Given the common analogy of modules and bpf programs, why can't a writer
of a bpf helper specify a preference that only GPL licensed programs
leverage a BPF helper?
David Miller May 7, 2018, 3:36 p.m. UTC | #4
From: David Ahern <dsahern@gmail.com>
Date: Mon, 7 May 2018 08:26:47 -0600

> On 5/7/18 8:10 AM, Daniel Borkmann wrote:
>> On 05/07/2018 03:35 PM, Jesper Dangaard Brouer wrote:
>>> On Thu,  3 May 2018 19:54:31 -0700 David Ahern <dsahern@gmail.com> wrote:
>>>
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index 6877426c23a6..cf0d27acf1d1 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>> [...]
>>>> +static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
>>>> +	.func		= bpf_xdp_fib_lookup,
>>>> +	.gpl_only	= true,
>>>
>>> Is it a deliberate choice to require BPF-progs using this helper to be
>>> GPL licensed?
>>>
>>> Asking as this seems to be the first network related helper with this
>>> requirement, while this is typical for tracing related helpers.
>> 
>> Good point, we should remove that. In networking it's only the perf event
>> output helpers tying into tracing bits. After all, if you do a route lookup
>> via netlink from user space there's no such restriction at all.
>> 
> 
> Networking symbols are typically exported GPL for modules. The person
> writing the code and exporting GPL is specifying a desire that only GPL
> licensed modules can link to the symbol.
> 
> Given the common analogy of modules and bpf programs, why can't a writer
> of a bpf helper specify a preference that only GPL licensed programs
> leverage a BPF helper?

I also think that for this particular set of helpers GPL is appropriate.

Yes, via netlink the same lookup can happen, but not with the same level
of performance and microsecond tuning we've done over the years on this
sophisticated trie lookup code.

Therefore, I think David's choice is very appropriate.
Daniel Borkmann May 9, 2018, 8:15 a.m. UTC | #5
On 05/04/2018 04:54 AM, David Ahern wrote:
> Provide a helper for doing a FIB and neighbor lookup in the kernel
> tables from an XDP program. The helper provides a fastpath for forwarding
> packets. If the packet is a local delivery or for any reason is not a
> simple lookup and forward, the packet continues up the stack.
> 
> If it is to be forwarded, the forwarding can be done directly if the
> neighbor is already known. If the neighbor does not exist, the first
> few packets go up the stack for neighbor resolution. Once resolved, the
> xdp program provides the fast path.
> 
> On successful lookup the nexthop dmac, current device smac and egress
> device index are returned.
> 
> The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
> are implemented in this patch. The API includes layer 4 parameters if
> the XDP program chooses to do deep packet inspection to allow compare
> against ACLs implemented as FIB rules.
> 
> Header rewrite is left to the XDP program.
> 
> The lookup takes 2 flags:
> - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
>   straight to the table associated with the device (expert setting for
>   those looking to maximize throughput)
> 
> - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
>   Default is an ingress lookup.
> 
> Initial performance numbers collected by Jesper, forwarded packets/sec:
> 
>        Full stack    XDP FIB lookup    XDP Direct lookup
> IPv4   1,947,969       7,074,156          7,415,333
> IPv6   1,728,000       6,165,504          7,262,720
> 
> These number are single CPU core forwarding on a Broadwell
> E5-1650 v4 @ 3.60GHz.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Ohh well, this is causing allmodconfig build warnings (e.g. on x86) as reported today:

In file included from include/linux/dma-mapping.h:5:0,
                 from include/linux/skbuff.h:34,
                 from include/linux/if_ether.h:23,
                 from include/uapi/linux/bpf.h:13,
                 from include/linux/bpf-cgroup.h:6,
                 from include/linux/cgroup-defs.h:22,
                 from include/linux/cgroup.h:28,
                 from include/linux/perf_event.h:57,
                 from include/linux/trace_events.h:10,
                 from include/trace/trace_events.h:20,
                 from include/trace/define_trace.h:96,
                 from drivers/android/binder_trace.h:387,
                 from drivers/android/binder.c:5702:
include/linux/sizes.h:24:0: warning: "SZ_1K" redefined
 #define SZ_1K    0x00000400
drivers/android/binder.c:116:0: note: this is the location of the previous definition
 #define SZ_1K                               0x400
In file included from include/linux/dma-mapping.h:5:0,
                 from include/linux/skbuff.h:34,
                 from include/linux/if_ether.h:23,
                 from include/uapi/linux/bpf.h:13,
                 from include/linux/bpf-cgroup.h:6,
                 from include/linux/cgroup-defs.h:22,
                 from include/linux/cgroup.h:28,
                 from include/linux/perf_event.h:57,
                 from include/linux/trace_events.h:10,
                 from include/trace/trace_events.h:20,
                 from include/trace/define_trace.h:96,
                 from drivers/android/binder_trace.h:387,
                 from drivers/android/binder.c:5702:
include/linux/sizes.h:37:0: warning: "SZ_4M" redefined
 #define SZ_4M    0x00400000
drivers/android/binder.c:120:0: note: this is the location of the previous definition
 #define SZ_4M                               0x400000
fs/ecryptfs/miscdev.c:206:0: warning: "PKT_TYPE_OFFSET" redefined
 #define PKT_TYPE_OFFSET  0
In file included from include/linux/if_ether.h:23:0,
                 from include/uapi/linux/bpf.h:13,
                 from include/linux/bpf-cgroup.h:6,
                 from include/linux/cgroup-defs.h:22,
                 from include/linux/cgroup.h:28,
                 from include/linux/writeback.h:183,
                 from include/linux/backing-dev.h:16,
                 from fs/ecryptfs/ecryptfs_kernel.h:41,
                 from fs/ecryptfs/miscdev.c:30:
include/linux/skbuff.h:753:0: note: this is the location of the previous definition
 #define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset)

Lets get a clean, proper version of the whole series into bpf-next. I've dropped it
from there right now and waiting for your v3 respin to apply with the above fixed.

Thank you.
David Ahern May 9, 2018, 4:05 p.m. UTC | #6
On 5/9/18 2:15 AM, Daniel Borkmann wrote:
> 
> Ohh well, this is causing allmodconfig build warnings (e.g. on x86) as reported today:

lovely.

> 
> In file included from include/linux/dma-mapping.h:5:0,
>                  from include/linux/skbuff.h:34,
>                  from include/linux/if_ether.h:23,
>                  from include/uapi/linux/bpf.h:13,
>                  from include/linux/bpf-cgroup.h:6,
>                  from include/linux/cgroup-defs.h:22,
>                  from include/linux/cgroup.h:28,
>                  from include/linux/perf_event.h:57,
>                  from include/linux/trace_events.h:10,
>                  from include/trace/trace_events.h:20,
>                  from include/trace/define_trace.h:96,
>                  from drivers/android/binder_trace.h:387,
>                  from drivers/android/binder.c:5702:
> include/linux/sizes.h:24:0: warning: "SZ_1K" redefined
>  #define SZ_1K    0x00000400
> drivers/android/binder.c:116:0: note: this is the location of the previous definition
>  #define SZ_1K                               0x400

binder.c has very few recent commits to it. Are you ok with me
submitting the change with the others in this set (with proper cc's of
course)?

> fs/ecryptfs/miscdev.c:206:0: warning: "PKT_TYPE_OFFSET" redefined
>  #define PKT_TYPE_OFFSET  0
> In file included from include/linux/if_ether.h:23:0,
>                  from include/uapi/linux/bpf.h:13,
>                  from include/linux/bpf-cgroup.h:6,
>                  from include/linux/cgroup-defs.h:22,
>                  from include/linux/cgroup.h:28,
>                  from include/linux/writeback.h:183,
>                  from include/linux/backing-dev.h:16,
>                  from fs/ecryptfs/ecryptfs_kernel.h:41,
>                  from fs/ecryptfs/miscdev.c:30:
> include/linux/skbuff.h:753:0: note: this is the location of the previous definition
>  #define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset)

And this one I renamed to SKB_PKT_TYPE_OFFSET

With that it compiles cleanly.

> 
> Lets get a clean, proper version of the whole series into bpf-next. I've dropped it
> from there right now and waiting for your v3 respin to apply with the above fixed.
> 
> Thank you.
>
Daniel Borkmann May 9, 2018, 8:44 p.m. UTC | #7
On 05/09/2018 06:05 PM, David Ahern wrote:
> On 5/9/18 2:15 AM, Daniel Borkmann wrote:
>>
>> Ohh well, this is causing allmodconfig build warnings (e.g. on x86) as reported today:
> 
> lovely.
> 
>> In file included from include/linux/dma-mapping.h:5:0,
>>                  from include/linux/skbuff.h:34,
>>                  from include/linux/if_ether.h:23,
>>                  from include/uapi/linux/bpf.h:13,
>>                  from include/linux/bpf-cgroup.h:6,
>>                  from include/linux/cgroup-defs.h:22,
>>                  from include/linux/cgroup.h:28,
>>                  from include/linux/perf_event.h:57,
>>                  from include/linux/trace_events.h:10,
>>                  from include/trace/trace_events.h:20,
>>                  from include/trace/define_trace.h:96,
>>                  from drivers/android/binder_trace.h:387,
>>                  from drivers/android/binder.c:5702:
>> include/linux/sizes.h:24:0: warning: "SZ_1K" redefined
>>  #define SZ_1K    0x00000400
>> drivers/android/binder.c:116:0: note: this is the location of the previous definition
>>  #define SZ_1K                               0x400
> 
> binder.c has very few recent commits to it. Are you ok with me
> submitting the change with the others in this set (with proper cc's of
> course)?
> 
>> fs/ecryptfs/miscdev.c:206:0: warning: "PKT_TYPE_OFFSET" redefined
>>  #define PKT_TYPE_OFFSET  0
>> In file included from include/linux/if_ether.h:23:0,
>>                  from include/uapi/linux/bpf.h:13,
>>                  from include/linux/bpf-cgroup.h:6,
>>                  from include/linux/cgroup-defs.h:22,
>>                  from include/linux/cgroup.h:28,
>>                  from include/linux/writeback.h:183,
>>                  from include/linux/backing-dev.h:16,
>>                  from fs/ecryptfs/ecryptfs_kernel.h:41,
>>                  from fs/ecryptfs/miscdev.c:30:
>> include/linux/skbuff.h:753:0: note: this is the location of the previous definition
>>  #define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset)
> 
> And this one I renamed to SKB_PKT_TYPE_OFFSET
> 
> With that it compiles cleanly.

Generally, no objection. However, could we get rid of the two extra includes altogether
to avoid running into any such dependency issue? Right now the only includes we have in
the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
a v6 address, we could do the same here and should be all good then.
David Ahern May 9, 2018, 9:29 p.m. UTC | #8
On 5/9/18 2:44 PM, Daniel Borkmann wrote:
> Generally, no objection. However, could we get rid of the two extra includes altogether
> to avoid running into any such dependency issue? Right now the only includes we have in
> the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
> by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
> to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
> other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
> a v6 address, we could do the same here and should be all good then.

I was able to drop the include of linux/in6.h and still use in6_addr. I
would prefer to keep in6_addr since it works and avoid the need to add
typecasts.

As for ETH_ALEN, I could redefine it but it just kicks the can down the
road. If if_ether.h is included after bpf.h, it will cause redefinition
warnings.
David Ahern May 9, 2018, 9:39 p.m. UTC | #9
On 5/9/18 3:29 PM, David Ahern wrote:
> On 5/9/18 2:44 PM, Daniel Borkmann wrote:
>> Generally, no objection. However, could we get rid of the two extra includes altogether
>> to avoid running into any such dependency issue? Right now the only includes we have in
>> the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
>> by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
>> to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
>> other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
>> a v6 address, we could do the same here and should be all good then.
> 
> I was able to drop the include of linux/in6.h and still use in6_addr. I
> would prefer to keep in6_addr since it works and avoid the need to add
> typecasts.

Never mind; that was working because if_ether.h was pulling in skbuff.h
which included in6.h.
> 
> As for ETH_ALEN, I could redefine it but it just kicks the can down the
> road. If if_ether.h is included after bpf.h, it will cause redefinition
> warnings.
> 

I guess I will continue the open coded magic numbers for mac and ipv6
addresses.
Alexei Starovoitov May 9, 2018, 9:49 p.m. UTC | #10
On Wed, May 09, 2018 at 03:39:52PM -0600, David Ahern wrote:
> On 5/9/18 3:29 PM, David Ahern wrote:
> > On 5/9/18 2:44 PM, Daniel Borkmann wrote:
> >> Generally, no objection. However, could we get rid of the two extra includes altogether
> >> to avoid running into any such dependency issue? Right now the only includes we have in
> >> the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
> >> by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
> >> to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
> >> other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
> >> a v6 address, we could do the same here and should be all good then.
> > 
> > I was able to drop the include of linux/in6.h and still use in6_addr. I
> > would prefer to keep in6_addr since it works and avoid the need to add
> > typecasts.
> 
> Never mind; that was working because if_ether.h was pulling in skbuff.h
> which included in6.h.
> > 
> > As for ETH_ALEN, I could redefine it but it just kicks the can down the
> > road. If if_ether.h is included after bpf.h, it will cause redefinition
> > warnings.
> > 
> 
> I guess I will continue the open coded magic numbers for mac and ipv6
> addresses.

That's the only way.
Adding
+#include <linux/if_ether.h>
+#include <linux/in6.h>
to uapi/bpf.h is no-go. It will cause all sorts of breakage
not only to kernel build as we realized, but to various user space apps too.
Please use be32 ipv6[4] and hard coded mac instead.
Daniel Borkmann May 9, 2018, 9:49 p.m. UTC | #11
On 05/09/2018 11:39 PM, David Ahern wrote:
> On 5/9/18 3:29 PM, David Ahern wrote:
>> On 5/9/18 2:44 PM, Daniel Borkmann wrote:
>>> Generally, no objection. However, could we get rid of the two extra includes altogether
>>> to avoid running into any such dependency issue? Right now the only includes we have in
>>> the bpf uapi header is linux/types.h and linux/bpf_common.h (latter has no extra deps
>>> by itself). Both the ETH_ALEN and struct in6_addr are in uapi and therefore never allowed
>>> to change so we can e.g. avoid to use ETH_ALEN and just have the value instead. In the
>>> other places of the header we use __u32 remote_ipv6[4], __u32 src_ip6[4] etc to denote
>>> a v6 address, we could do the same here and should be all good then.
>>
>> I was able to drop the include of linux/in6.h and still use in6_addr. I
>> would prefer to keep in6_addr since it works and avoid the need to add
>> typecasts.
> 
> Never mind; that was working because if_ether.h was pulling in skbuff.h
> which included in6.h.
>
>> As for ETH_ALEN, I could redefine it but it just kicks the can down the
>> road. If if_ether.h is included after bpf.h, it will cause redefinition
>> warnings.
> 
> I guess I will continue the open coded magic numbers for mac and ipv6
> addresses.

Agree, it will avoid breakage. We cannot assume that every BPF prog out there has
one specific ordering of if_ether.h and bpf.h includes. Open coding the numbers
seems best here.
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 93d5a4eeec2a..ddc566cb7492 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -10,6 +10,8 @@ 
 
 #include <linux/types.h>
 #include <linux/bpf_common.h>
+#include <linux/if_ether.h>
+#include <linux/in6.h>
 
 /* Extended instruction set based on top of classic BPF */
 
@@ -1826,6 +1828,33 @@  union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
+ *
+ * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
+ *	Description
+ *		Do FIB lookup in kernel tables using parameters in *params*.
+ *		If lookup is successful and result shows packet is to be
+ *		forwarded, the neighbor tables are searched for the nexthop.
+ *		If successful (ie., FIB lookup shows forwarding and nexthop
+ *		is resolved), the nexthop address is returned in ipv4_dst,
+ *		ipv6_dst or mpls_out based on family, smac is set to mac
+ *		address of egress device, dmac is set to nexthop mac address,
+ *		rt_metric is set to metric from route.
+ *
+ *             *plen* argument is the size of the passed in struct.
+ *             *flags* argument can be one or more BPF_FIB_LOOKUP_ flags:
+ *
+ *             **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs
+ *             full lookup using FIB rules
+ *             **BPF_FIB_LOOKUP_OUTPUT** means do lookup from an egress
+ *             perspective (default is ingress)
+ *
+ *             *ctx* is either **struct xdp_md** for XDP programs or
+ *             **struct sk_buff** tc cls_act programs.
+ *
+ *     Return
+ *             Egress device index on success, 0 if packet needs to continue
+ *             up the stack for further processing or a negative error in case
+ *             of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1896,7 +1925,8 @@  union bpf_attr {
 	FN(xdp_adjust_tail),		\
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
-	FN(skb_load_bytes_relative),
+	FN(skb_load_bytes_relative),	\
+	FN(fib_lookup),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2310,4 +2340,55 @@  struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };
 
+/* DIRECT:  Skip the FIB rules and go to FIB table associated with device
+ * OUTPUT:  Do lookup from egress perspective; default is ingress
+ */
+#define BPF_FIB_LOOKUP_DIRECT  BIT(0)
+#define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
+
+struct bpf_fib_lookup {
+	/* input */
+	__u8	family;   /* network family, AF_INET, AF_INET6, AF_MPLS */
+
+	/* set if lookup is to consider L4 data - e.g., FIB rules */
+	__u8	l4_protocol;
+	__be16	sport;
+	__be16	dport;
+
+	/* total length of packet from network header - used for MTU check */
+	__u16	tot_len;
+	__u32	ifindex;  /* L3 device index for lookup */
+
+	union {
+		/* inputs to lookup */
+		__u8	tos;		/* AF_INET  */
+		__be32	flowlabel;	/* AF_INET6 */
+
+		/* output: metric of fib result */
+		__u32 rt_metric;
+	};
+
+	union {
+		__be32		mpls_in;
+		__be32		ipv4_src;
+		struct in6_addr	ipv6_src;
+	};
+
+	/* input to bpf_fib_lookup, *dst is destination address.
+	 * output: bpf_fib_lookup sets to gateway address
+	 */
+	union {
+		/* return for MPLS lookups */
+		__be32		mpls_out[4];  /* support up to 4 labels */
+		__be32		ipv4_dst;
+		struct in6_addr	ipv6_dst;
+	};
+
+	/* output */
+	__be16	h_vlan_proto;
+	__be16	h_vlan_TCI;
+	__u8	smac[ETH_ALEN];
+	__u8	dmac[ETH_ALEN];
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 6877426c23a6..cf0d27acf1d1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -60,6 +60,10 @@ 
 #include <net/xfrm.h>
 #include <linux/bpf_trace.h>
 #include <net/xdp_sock.h>
+#include <linux/inetdevice.h>
+#include <net/ip_fib.h>
+#include <net/flow.h>
+#include <net/arp.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -4032,6 +4036,264 @@  static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
+static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
+				  const struct neighbour *neigh,
+				  const struct net_device *dev)
+{
+	memcpy(params->dmac, neigh->ha, ETH_ALEN);
+	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
+	params->h_vlan_TCI = 0;
+	params->h_vlan_proto = 0;
+
+	return dev->ifindex;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_INET)
+static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
+			       u32 flags)
+{
+	struct in_device *in_dev;
+	struct neighbour *neigh;
+	struct net_device *dev;
+	struct fib_result res;
+	struct fib_nh *nh;
+	struct flowi4 fl4;
+	int err;
+
+	dev = dev_get_by_index_rcu(net, params->ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	/* verify forwarding is enabled on this interface */
+	in_dev = __in_dev_get_rcu(dev);
+	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
+		return 0;
+
+	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
+		fl4.flowi4_iif = 1;
+		fl4.flowi4_oif = params->ifindex;
+	} else {
+		fl4.flowi4_iif = params->ifindex;
+		fl4.flowi4_oif = 0;
+	}
+	fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
+	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
+	fl4.flowi4_flags = 0;
+
+	fl4.flowi4_proto = params->l4_protocol;
+	fl4.daddr = params->ipv4_dst;
+	fl4.saddr = params->ipv4_src;
+	fl4.fl4_sport = params->sport;
+	fl4.fl4_dport = params->dport;
+
+	if (flags & BPF_FIB_LOOKUP_DIRECT) {
+		u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
+		struct fib_table *tb;
+
+		tb = fib_get_table(net, tbid);
+		if (unlikely(!tb))
+			return 0;
+
+		err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
+	} else {
+		fl4.flowi4_mark = 0;
+		fl4.flowi4_secid = 0;
+		fl4.flowi4_tun_key.tun_id = 0;
+		fl4.flowi4_uid = sock_net_uid(net, NULL);
+
+		err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
+	}
+
+	if (err || res.type != RTN_UNICAST)
+		return 0;
+
+	if (res.fi->fib_nhs > 1)
+		fib_select_path(net, &res, &fl4, NULL);
+
+	nh = &res.fi->fib_nh[res.nh_sel];
+
+	/* do not handle lwt encaps right now */
+	if (nh->nh_lwtstate)
+		return 0;
+
+	dev = nh->nh_dev;
+	if (unlikely(!dev))
+		return 0;
+
+	if (nh->nh_gw)
+		params->ipv4_dst = nh->nh_gw;
+
+	params->rt_metric = res.fi->fib_priority;
+
+	/* xdp and cls_bpf programs are run in RCU-bh so
+	 * rcu_read_lock_bh is not needed here
+	 */
+	neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
+	if (neigh)
+		return bpf_fib_set_fwd_params(params, neigh, dev);
+
+	return 0;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_IPV6)
+static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
+			       u32 flags)
+{
+	struct neighbour *neigh;
+	struct net_device *dev;
+	struct inet6_dev *idev;
+	struct fib6_info *f6i;
+	struct flowi6 fl6;
+	int strict = 0;
+	int oif;
+
+	/* link local addresses are never forwarded */
+	if (rt6_need_strict(&params->ipv6_dst) ||
+	    rt6_need_strict(&params->ipv6_src))
+		return 0;
+
+	dev = dev_get_by_index_rcu(net, params->ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	idev = __in6_dev_get_safely(dev);
+	if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
+		return 0;
+
+	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
+		fl6.flowi6_iif = 1;
+		oif = fl6.flowi6_oif = params->ifindex;
+	} else {
+		oif = fl6.flowi6_iif = params->ifindex;
+		fl6.flowi6_oif = 0;
+		strict = RT6_LOOKUP_F_HAS_SADDR;
+	}
+	fl6.flowlabel = params->flowlabel;
+	fl6.flowi6_scope = 0;
+	fl6.flowi6_flags = 0;
+	fl6.mp_hash = 0;
+
+	fl6.flowi6_proto = params->l4_protocol;
+	fl6.daddr = params->ipv6_dst;
+	fl6.saddr = params->ipv6_src;
+	fl6.fl6_sport = params->sport;
+	fl6.fl6_dport = params->dport;
+
+	if (flags & BPF_FIB_LOOKUP_DIRECT) {
+		u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
+		struct fib6_table *tb;
+
+		tb = ipv6_stub->fib6_get_table(net, tbid);
+		if (unlikely(!tb))
+			return 0;
+
+		f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
+	} else {
+		fl6.flowi6_mark = 0;
+		fl6.flowi6_secid = 0;
+		fl6.flowi6_tun_key.tun_id = 0;
+		fl6.flowi6_uid = sock_net_uid(net, NULL);
+
+		f6i = ipv6_stub->fib6_lookup(net, oif, &fl6, strict);
+	}
+
+	if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
+		return 0;
+
+	if (unlikely(f6i->fib6_flags & RTF_REJECT ||
+	    f6i->fib6_type != RTN_UNICAST))
+		return 0;
+
+	if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
+		f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
+						       fl6.flowi6_oif, NULL,
+						       strict);
+
+	if (f6i->fib6_nh.nh_lwtstate)
+		return 0;
+
+	if (f6i->fib6_flags & RTF_GATEWAY)
+		params->ipv6_dst = f6i->fib6_nh.nh_gw;
+
+	dev = f6i->fib6_nh.nh_dev;
+	params->rt_metric = f6i->fib6_metric;
+
+	/* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is
+	 * not needed here. Can not use __ipv6_neigh_lookup_noref here
+	 * because we need to get nd_tbl via the stub
+	 */
+	neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
+				      ndisc_hashfn, &params->ipv6_dst, dev);
+	if (neigh)
+		return bpf_fib_set_fwd_params(params, neigh, dev);
+
+	return 0;
+}
+#endif
+
+BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
+	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
+{
+	if (plen < sizeof(*params))
+		return -EINVAL;
+
+	switch (params->family) {
+#if IS_ENABLED(CONFIG_INET)
+	case AF_INET:
+		return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
+					   flags);
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
+					   flags);
+#endif
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
+	.func		= bpf_xdp_fib_lookup,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_MEM,
+	.arg3_type      = ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
+	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
+{
+	if (plen < sizeof(*params))
+		return -EINVAL;
+
+	switch (params->family) {
+#if IS_ENABLED(CONFIG_INET)
+	case AF_INET:
+		return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags);
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags);
+#endif
+	}
+	return -ENOTSUPP;
+}
+
+static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
+	.func		= bpf_skb_fib_lookup,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_MEM,
+	.arg3_type      = ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -4181,6 +4443,8 @@  tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skb_get_xfrm_state:
 		return &bpf_skb_get_xfrm_state_proto;
 #endif
+	case BPF_FUNC_fib_lookup:
+		return &bpf_skb_fib_lookup_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -4206,6 +4470,8 @@  xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_map_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
+	case BPF_FUNC_fib_lookup:
+		return &bpf_xdp_fib_lookup_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}