diff mbox series

[bpf-next,v3,02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point

Message ID 20200702092416.11961-3-jakub@cloudflare.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series Run a BPF program on socket lookup | expand

Commit Message

Jakub Sitnicki July 2, 2020, 9:24 a.m. UTC
Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
when looking up a listening socket for a new connection request for
connection oriented protocols, or when looking up an unconnected socket for
a packet for connection-less protocols.

When called, SK_LOOKUP BPF program can select a socket that will receive
the packet. This serves as a mechanism to overcome the limits of what
bind() API allows to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, on fixed port to a socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, on any port to a socket

     198.51.100.1, any port -> L7 proxy socket

In its run-time context program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple. Context can be further extended to include ingress
interface identifier.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection. Transport layer then uses the selected
socket as a result of socket lookup.

This patch only enables the user to attach an SK_LOOKUP program to a
network namespace. Subsequent patches hook it up to run on local delivery
path in ipv4 and ipv6 stacks.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Allow bpf_sk_assign helper to replace previously selected socket only
      when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
      programs running in series to accidentally override each other's verdict.
    - Let BPF program decide that load-balancing within a reuseport socket group
      should be skipped for the socket selected with bpf_sk_assign() by passing
      BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
    - Extend struct bpf_sk_lookup program context with an 'sk' field containing
      the selected socket with an intention for multiple attached program
      running in series to see each other's choices. However, currently the
      verifier doesn't allow checking if pointer is set.
    - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
    - Get rid of macros in convert_ctx_access to make it easier to read.
    - Disallow 1-,2-byte access to context fields containing IP addresses.
    
    v2:
    - Make bpf_sk_assign reject sockets that don't use RCU freeing.
      Update bpf_sk_assign docs accordingly. (Martin)
    - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
    - Fix broken build when CONFIG_INET is not selected. (Martin)
    - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
    - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)

 include/linux/bpf-netns.h  |   3 +
 include/linux/bpf_types.h  |   2 +
 include/linux/filter.h     |  19 ++++
 include/uapi/linux/bpf.h   |  74 +++++++++++++++
 kernel/bpf/net_namespace.c |   5 +
 kernel/bpf/syscall.c       |   9 ++
 net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py |   9 +-
 8 files changed, 306 insertions(+), 1 deletion(-)

Comments

Yonghong Song July 4, 2020, 6:42 p.m. UTC | #1
On 7/2/20 2:24 AM, Jakub Sitnicki wrote:
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> when looking up a listening socket for a new connection request for
> connection oriented protocols, or when looking up an unconnected socket for
> a packet for connection-less protocols.
> 
> When called, SK_LOOKUP BPF program can select a socket that will receive
> the packet. This serves as a mechanism to overcome the limits of what
> bind() API allows to express. Two use-cases driving this work are:
> 
>   (1) steer packets destined to an IP range, on fixed port to a socket
> 
>       192.0.2.0/24, port 80 -> NGINX socket
> 
>   (2) steer packets destined to an IP address, on any port to a socket
> 
>       198.51.100.1, any port -> L7 proxy socket
> 
> In its run-time context program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
> 
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
> 
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
> 
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> 
> Notes:
>      v3:
>      - Allow bpf_sk_assign helper to replace previously selected socket only
>        when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>        programs running in series to accidentally override each other's verdict.
>      - Let BPF program decide that load-balancing within a reuseport socket group
>        should be skipped for the socket selected with bpf_sk_assign() by passing
>        BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>      - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>        the selected socket with an intention for multiple attached program
>        running in series to see each other's choices. However, currently the
>        verifier doesn't allow checking if pointer is set.
>      - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>      - Get rid of macros in convert_ctx_access to make it easier to read.
>      - Disallow 1-,2-byte access to context fields containing IP addresses.
>      
>      v2:
>      - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>        Update bpf_sk_assign docs accordingly. (Martin)
>      - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>      - Fix broken build when CONFIG_INET is not selected. (Martin)
>      - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>      - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
> 
>   include/linux/bpf-netns.h  |   3 +
>   include/linux/bpf_types.h  |   2 +
>   include/linux/filter.h     |  19 ++++
>   include/uapi/linux/bpf.h   |  74 +++++++++++++++
>   kernel/bpf/net_namespace.c |   5 +
>   kernel/bpf/syscall.c       |   9 ++
>   net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>   scripts/bpf_helpers_doc.py |   9 +-
>   8 files changed, 306 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
> index 4052d649f36d..cb1d849c5d4f 100644
> --- a/include/linux/bpf-netns.h
> +++ b/include/linux/bpf-netns.h
> @@ -8,6 +8,7 @@
>   enum netns_bpf_attach_type {
>   	NETNS_BPF_INVALID = -1,
>   	NETNS_BPF_FLOW_DISSECTOR = 0,
> +	NETNS_BPF_SK_LOOKUP,
>   	MAX_NETNS_BPF_ATTACH_TYPE
>   };
>   
> @@ -17,6 +18,8 @@ to_netns_bpf_attach_type(enum bpf_attach_type attach_type)
>   	switch (attach_type) {
>   	case BPF_FLOW_DISSECTOR:
>   		return NETNS_BPF_FLOW_DISSECTOR;
> +	case BPF_SK_LOOKUP:
> +		return NETNS_BPF_SK_LOOKUP;
>   	default:
>   		return NETNS_BPF_INVALID;
>   	}
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index a18ae82a298a..a52a5688418e 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
>   #ifdef CONFIG_INET
>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
>   	      struct sk_reuseport_md, struct sk_reuseport_kern)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
> +	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
>   #endif
>   #if defined(CONFIG_BPF_JIT)
>   BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 259377723603..ba4f8595fa54 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1278,4 +1278,23 @@ struct bpf_sockopt_kern {
>   	s32		retval;
>   };
>   
> +struct bpf_sk_lookup_kern {
> +	u16		family;
> +	u16		protocol;
> +	union {
> +		struct {
> +			__be32 saddr;
> +			__be32 daddr;
> +		} v4;
> +		struct {
> +			const struct in6_addr *saddr;
> +			const struct in6_addr *daddr;
> +		} v6;
> +	};
> +	__be16		sport;
> +	u16		dport;
> +	struct sock	*selected_sk;
> +	bool		no_reuseport;
> +};
> +
>   #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0cb8ec948816..8dd6e6ce5de9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -189,6 +189,7 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_STRUCT_OPS,
>   	BPF_PROG_TYPE_EXT,
>   	BPF_PROG_TYPE_LSM,
> +	BPF_PROG_TYPE_SK_LOOKUP,
>   };
>   
>   enum bpf_attach_type {
> @@ -226,6 +227,7 @@ enum bpf_attach_type {
>   	BPF_CGROUP_INET4_GETSOCKNAME,
>   	BPF_CGROUP_INET6_GETSOCKNAME,
>   	BPF_XDP_DEVMAP,
> +	BPF_SK_LOOKUP,
>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -3067,6 +3069,10 @@ union bpf_attr {
>    *
>    * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
>    *	Description
> + *		Helper is overloaded depending on BPF program type. This
> + *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
> + *		**BPF_PROG_TYPE_SCHED_ACT** programs.
> + *
>    *		Assign the *sk* to the *skb*. When combined with appropriate
>    *		routing configuration to receive the packet towards the socket,
>    *		will cause *skb* to be delivered to the specified socket.
> @@ -3092,6 +3098,53 @@ union bpf_attr {
>    *		**-ESOCKTNOSUPPORT** if the socket type is not supported
>    *		(reuseport).
>    *
> + * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)

recently, we have changed return value from "int" to "long" if the 
helper intends to return a negative error. See above
    long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)

> + *	Description
> + *		Helper is overloaded depending on BPF program type. This
> + *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
> + *
> + *		Select the *sk* as a result of a socket lookup.
> + *
> + *		For the operation to succeed passed socket must be compatible
> + *		with the packet description provided by the *ctx* object.
> + *
> + *		L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
> + *		be an exact match. While IP family (**AF_INET** or
> + *		**AF_INET6**) must be compatible, that is IPv6 sockets
> + *		that are not v6-only can be selected for IPv4 packets.
> + *
> + *		Only TCP listeners and UDP unconnected sockets can be
> + *		selected.
> + *
> + *		*flags* argument can combination of following values:
> + *
> + *		* **BPF_SK_LOOKUP_F_REPLACE** to override the previous
> + *		  socket selection, potentially done by a BPF program
> + *		  that ran before us.
> + *
> + *		* **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
> + *		  load-balancing within reuseport group for the socket
> + *		  being selected.
> + *
> + *	Return
> + *		0 on success, or a negative errno in case of failure.
> + *
> + *		* **-EAFNOSUPPORT** if socket family (*sk->family*) is
> + *		  not compatible with packet family (*ctx->family*).
> + *
> + *		* **-EEXIST** if socket has been already selected,
> + *		  potentially by another program, and
> + *		  **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
> + *
> + *		* **-EINVAL** if unsupported flags were specified.
> + *
> + *		* **-EPROTOTYPE** if socket L4 protocol
> + *		  (*sk->protocol*) doesn't match packet protocol
> + *		  (*ctx->protocol*).
> + *
> + *		* **-ESOCKTNOSUPPORT** if socket is not in allowed
> + *		  state (TCP listening or UDP unconnected).
> + *
[...]
> +static bool sk_lookup_is_valid_access(int off, int size,
> +				      enum bpf_access_type type,
> +				      const struct bpf_prog *prog,
> +				      struct bpf_insn_access_aux *info)
> +{
> +	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +	if (type != BPF_READ)
> +		return false;
> +
> +	switch (off) {
> +	case bpf_ctx_range(struct bpf_sk_lookup, family):
> +	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
> +		return size == sizeof(__u32);

Maybe some of the above forcing 4-byte access too restrictive?
For example, if user did
    __u16 *remote_port = ctx->remote_port;
    __u16 *local_port = ctx->local_port;
compiler is likely to generate a 2-byte load and the verifier
will reject the program. The same for protocol, family, ...
Even for local_ip4, user may just want to read one byte to
do something ...

One example, bpf_sock_addr->user_port.

We have numerous instances like this and kernel has to be
patched to permit it later.

I think for read we should allow 1/2/4 byte accesses
whenever possible. pointer of course not allowed.

> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		info->reg_type = PTR_TO_SOCKET;
> +		return size == sizeof(__u64);
> +
> +	default:
> +		return false;
> +	}
> +}
> +
> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> +					const struct bpf_insn *si,
> +					struct bpf_insn *insn_buf,
> +					struct bpf_prog *prog,
> +					u32 *target_size)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	int off;
> +#endif
> +
> +	switch (si->off) {
> +	case offsetof(struct bpf_sk_lookup, family):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, family));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, protocol):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, protocol));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				remote_ip6[0], remote_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				local_ip6[0], local_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, sport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, dport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
> +		break;
> +	}
> +
> +	return insn - insn_buf;
> +}
> +
> +const struct bpf_prog_ops sk_lookup_prog_ops = {
> +};
> +
> +const struct bpf_verifier_ops sk_lookup_verifier_ops = {
> +	.get_func_proto		= sk_lookup_func_proto,
> +	.is_valid_access	= sk_lookup_is_valid_access,
> +	.convert_ctx_access	= sk_lookup_convert_ctx_access,
> +};
> +
>   #endif /* CONFIG_INET */
>   
>   DEFINE_BPF_DISPATCHER(xdp)
[...]
kernel test robot July 5, 2020, 9:20 a.m. UTC | #2
Hi Jakub,

I love your patch! Perhaps something to improve:

[auto build test WARNING on next-20200702]
[cannot apply to bpf-next/master bpf/master net/master vhost/linux-next ipvs/master net-next/master linus/master v5.8-rc3 v5.8-rc2 v5.8-rc1 v5.8-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Run-a-BPF-program-on-socket-lookup/20200702-173127
base:    d37d57041350dff35dd17cbdf9aef4011acada38
config: x86_64-randconfig-s021-20200705 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-14) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-3-gfa153962-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   net/core/filter.c:402:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:405:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:408:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:411:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:414:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:488:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:491:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:494:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:1382:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1382:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1382:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:1460:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1460:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1460:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:7044:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:7047:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:7050:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:8770:31: sparse: sparse: symbol 'sk_filter_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8777:27: sparse: sparse: symbol 'sk_filter_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8781:31: sparse: sparse: symbol 'tc_cls_act_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8789:27: sparse: sparse: symbol 'tc_cls_act_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8793:31: sparse: sparse: symbol 'xdp_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8804:31: sparse: sparse: symbol 'cg_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8810:27: sparse: sparse: symbol 'cg_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8814:31: sparse: sparse: symbol 'lwt_in_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8820:27: sparse: sparse: symbol 'lwt_in_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8824:31: sparse: sparse: symbol 'lwt_out_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8830:27: sparse: sparse: symbol 'lwt_out_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8834:31: sparse: sparse: symbol 'lwt_xmit_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8841:27: sparse: sparse: symbol 'lwt_xmit_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8845:31: sparse: sparse: symbol 'lwt_seg6local_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8851:27: sparse: sparse: symbol 'lwt_seg6local_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8855:31: sparse: sparse: symbol 'cg_sock_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8861:27: sparse: sparse: symbol 'cg_sock_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8864:31: sparse: sparse: symbol 'cg_sock_addr_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8870:27: sparse: sparse: symbol 'cg_sock_addr_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8873:31: sparse: sparse: symbol 'sock_ops_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8879:27: sparse: sparse: symbol 'sock_ops_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8882:31: sparse: sparse: symbol 'sk_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8889:27: sparse: sparse: symbol 'sk_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8892:31: sparse: sparse: symbol 'sk_msg_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8899:27: sparse: sparse: symbol 'sk_msg_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8902:31: sparse: sparse: symbol 'flow_dissector_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8908:27: sparse: sparse: symbol 'flow_dissector_prog_ops' was not declared. Should it be static?
   net/core/filter.c:9214:31: sparse: sparse: symbol 'sk_reuseport_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:9220:27: sparse: sparse: symbol 'sk_reuseport_prog_ops' was not declared. Should it be static?
>> net/core/filter.c:9399:27: sparse: sparse: symbol 'sk_lookup_prog_ops' was not declared. Should it be static?
>> net/core/filter.c:9402:31: sparse: sparse: symbol 'sk_lookup_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:1884:43: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1884:43: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1884:43: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1887:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be16 [usertype] old @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1887:36: sparse:     expected restricted __be16 [usertype] old
   net/core/filter.c:1887:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1887:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] new @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1887:42: sparse:     expected restricted __be16 [usertype] new
   net/core/filter.c:1887:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1890:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1890:36: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1890:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1890:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1890:42: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1890:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1935:59: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1935:59: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1935:59: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1938:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1938:52: sparse:     expected restricted __be16 [usertype] from
   net/core/filter.c:1938:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1938:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be16 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1938:58: sparse:     expected restricted __be16 [usertype] to
   net/core/filter.c:1938:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1941:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1941:52: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1941:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1941:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1941:58: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1941:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1987:28: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum @@
   net/core/filter.c:1987:28: sparse:     expected unsigned long long
   net/core/filter.c:1987:28: sparse:     got restricted __wsum
   net/core/filter.c:2009:35: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum [usertype] csum @@
   net/core/filter.c:2009:35: sparse:     expected unsigned long long
   net/core/filter.c:2009:35: sparse:     got restricted __wsum [usertype] csum
   net/core/filter.c:4730:17: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [usertype] spi @@     got restricted __be32 const [usertype] spi @@
   net/core/filter.c:4730:17: sparse:     expected unsigned int [usertype] spi
   net/core/filter.c:4730:17: sparse:     got restricted __be32 const [usertype] spi
   net/core/filter.c:4738:33: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [usertype] remote_ipv4 @@     got restricted __be32 const [usertype] a4 @@
   net/core/filter.c:4738:33: sparse:     expected unsigned int [usertype] remote_ipv4
   net/core/filter.c:4738:33: sparse:     got restricted __be32 const [usertype] a4

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Jakub Sitnicki July 6, 2020, 11:44 a.m. UTC | #3
On Sat, Jul 04, 2020 at 08:42 PM CEST, Yonghong Song wrote:
> On 7/2/20 2:24 AM, Jakub Sitnicki wrote:
>> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
>> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
>> when looking up a listening socket for a new connection request for
>> connection oriented protocols, or when looking up an unconnected socket for
>> a packet for connection-less protocols.
>>
>> When called, SK_LOOKUP BPF program can select a socket that will receive
>> the packet. This serves as a mechanism to overcome the limits of what
>> bind() API allows to express. Two use-cases driving this work are:
>>
>>   (1) steer packets destined to an IP range, on fixed port to a socket
>>
>>       192.0.2.0/24, port 80 -> NGINX socket
>>
>>   (2) steer packets destined to an IP address, on any port to a socket
>>
>>       198.51.100.1, any port -> L7 proxy socket
>>
>> In its run-time context program receives information about the packet that
>> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
>> address 4-tuple. Context can be further extended to include ingress
>> interface identifier.
>>
>> To select a socket BPF program fetches it from a map holding socket
>> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
>> helper to record the selection. Transport layer then uses the selected
>> socket as a result of socket lookup.
>>
>> This patch only enables the user to attach an SK_LOOKUP program to a
>> network namespace. Subsequent patches hook it up to run on local delivery
>> path in ipv4 and ipv6 stacks.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>      v3:
>>      - Allow bpf_sk_assign helper to replace previously selected socket only
>>        when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>>        programs running in series to accidentally override each other's verdict.
>>      - Let BPF program decide that load-balancing within a reuseport socket group
>>        should be skipped for the socket selected with bpf_sk_assign() by passing
>>        BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>>      - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>>        the selected socket with an intention for multiple attached program
>>        running in series to see each other's choices. However, currently the
>>        verifier doesn't allow checking if pointer is set.
>>      - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>>      - Get rid of macros in convert_ctx_access to make it easier to read.
>>      - Disallow 1-,2-byte access to context fields containing IP addresses.
>>           v2:
>>      - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>>        Update bpf_sk_assign docs accordingly. (Martin)
>>      - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>>      - Fix broken build when CONFIG_INET is not selected. (Martin)
>>      - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>>      - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>>
>>   include/linux/bpf-netns.h  |   3 +
>>   include/linux/bpf_types.h  |   2 +
>>   include/linux/filter.h     |  19 ++++
>>   include/uapi/linux/bpf.h   |  74 +++++++++++++++
>>   kernel/bpf/net_namespace.c |   5 +
>>   kernel/bpf/syscall.c       |   9 ++
>>   net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>>   scripts/bpf_helpers_doc.py |   9 +-
>>   8 files changed, 306 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
>> index 4052d649f36d..cb1d849c5d4f 100644
>> --- a/include/linux/bpf-netns.h
>> +++ b/include/linux/bpf-netns.h
>> @@ -8,6 +8,7 @@
>>   enum netns_bpf_attach_type {
>>   	NETNS_BPF_INVALID = -1,
>>   	NETNS_BPF_FLOW_DISSECTOR = 0,
>> +	NETNS_BPF_SK_LOOKUP,
>>   	MAX_NETNS_BPF_ATTACH_TYPE
>>   };
>>   @@ -17,6 +18,8 @@ to_netns_bpf_attach_type(enum bpf_attach_type attach_type)
>>   	switch (attach_type) {
>>   	case BPF_FLOW_DISSECTOR:
>>   		return NETNS_BPF_FLOW_DISSECTOR;
>> +	case BPF_SK_LOOKUP:
>> +		return NETNS_BPF_SK_LOOKUP;
>>   	default:
>>   		return NETNS_BPF_INVALID;
>>   	}
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index a18ae82a298a..a52a5688418e 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
>>   #ifdef CONFIG_INET
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
>>   	      struct sk_reuseport_md, struct sk_reuseport_kern)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
>> +	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
>>   #endif
>>   #if defined(CONFIG_BPF_JIT)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 259377723603..ba4f8595fa54 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1278,4 +1278,23 @@ struct bpf_sockopt_kern {
>>   	s32		retval;
>>   };
>>   +struct bpf_sk_lookup_kern {
>> +	u16		family;
>> +	u16		protocol;
>> +	union {
>> +		struct {
>> +			__be32 saddr;
>> +			__be32 daddr;
>> +		} v4;
>> +		struct {
>> +			const struct in6_addr *saddr;
>> +			const struct in6_addr *daddr;
>> +		} v6;
>> +	};
>> +	__be16		sport;
>> +	u16		dport;
>> +	struct sock	*selected_sk;
>> +	bool		no_reuseport;
>> +};
>> +
>>   #endif /* __LINUX_FILTER_H__ */
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0cb8ec948816..8dd6e6ce5de9 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -189,6 +189,7 @@ enum bpf_prog_type {
>>   	BPF_PROG_TYPE_STRUCT_OPS,
>>   	BPF_PROG_TYPE_EXT,
>>   	BPF_PROG_TYPE_LSM,
>> +	BPF_PROG_TYPE_SK_LOOKUP,
>>   };
>>     enum bpf_attach_type {
>> @@ -226,6 +227,7 @@ enum bpf_attach_type {
>>   	BPF_CGROUP_INET4_GETSOCKNAME,
>>   	BPF_CGROUP_INET6_GETSOCKNAME,
>>   	BPF_XDP_DEVMAP,
>> +	BPF_SK_LOOKUP,
>>   	__MAX_BPF_ATTACH_TYPE
>>   };
>>   @@ -3067,6 +3069,10 @@ union bpf_attr {
>>    *
>>    * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
>>    *	Description
>> + *		Helper is overloaded depending on BPF program type. This
>> + *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
>> + *		**BPF_PROG_TYPE_SCHED_ACT** programs.
>> + *
>>    *		Assign the *sk* to the *skb*. When combined with appropriate
>>    *		routing configuration to receive the packet towards the socket,
>>    *		will cause *skb* to be delivered to the specified socket.
>> @@ -3092,6 +3098,53 @@ union bpf_attr {
>>    *		**-ESOCKTNOSUPPORT** if the socket type is not supported
>>    *		(reuseport).
>>    *
>> + * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
>
> recently, we have changed return value from "int" to "long" if the helper
> intends to return a negative error. See above
>    long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)

Thanks. I missed that one. Will fix in v4.

>
>> + *	Description
>> + *		Helper is overloaded depending on BPF program type. This
>> + *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
>> + *
>> + *		Select the *sk* as a result of a socket lookup.
>> + *
>> + *		For the operation to succeed passed socket must be compatible
>> + *		with the packet description provided by the *ctx* object.
>> + *
>> + *		L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
>> + *		be an exact match. While IP family (**AF_INET** or
>> + *		**AF_INET6**) must be compatible, that is IPv6 sockets
>> + *		that are not v6-only can be selected for IPv4 packets.
>> + *
>> + *		Only TCP listeners and UDP unconnected sockets can be
>> + *		selected.
>> + *
>> + *		*flags* argument can combination of following values:
>> + *
>> + *		* **BPF_SK_LOOKUP_F_REPLACE** to override the previous
>> + *		  socket selection, potentially done by a BPF program
>> + *		  that ran before us.
>> + *
>> + *		* **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
>> + *		  load-balancing within reuseport group for the socket
>> + *		  being selected.
>> + *
>> + *	Return
>> + *		0 on success, or a negative errno in case of failure.
>> + *
>> + *		* **-EAFNOSUPPORT** if socket family (*sk->family*) is
>> + *		  not compatible with packet family (*ctx->family*).
>> + *
>> + *		* **-EEXIST** if socket has been already selected,
>> + *		  potentially by another program, and
>> + *		  **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
>> + *
>> + *		* **-EINVAL** if unsupported flags were specified.
>> + *
>> + *		* **-EPROTOTYPE** if socket L4 protocol
>> + *		  (*sk->protocol*) doesn't match packet protocol
>> + *		  (*ctx->protocol*).
>> + *
>> + *		* **-ESOCKTNOSUPPORT** if socket is not in allowed
>> + *		  state (TCP listening or UDP unconnected).
>> + *
> [...]
>> +static bool sk_lookup_is_valid_access(int off, int size,
>> +				      enum bpf_access_type type,
>> +				      const struct bpf_prog *prog,
>> +				      struct bpf_insn_access_aux *info)
>> +{
>> +	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
>> +		return false;
>> +	if (off % size != 0)
>> +		return false;
>> +	if (type != BPF_READ)
>> +		return false;
>> +
>> +	switch (off) {
>> +	case bpf_ctx_range(struct bpf_sk_lookup, family):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
>> +		return size == sizeof(__u32);
>
> Maybe some of the above forcing 4-byte access too restrictive?
> For example, if user did
>    __u16 *remote_port = ctx->remote_port;
>    __u16 *local_port = ctx->local_port;
> compiler is likely to generate a 2-byte load and the verifier
> will reject the program. The same for protocol, family, ...
> Even for local_ip4, user may just want to read one byte to
> do something ...
>
> One example, bpf_sock_addr->user_port.
>
> We have numerous instances like this and kernel has to be
> patched to permit it later.
>
> I think for read we should allow 1/2/4 byte accesses
> whenever possible. pointer of course not allowed.

You have a point. I've tried to keep it simple, but did not consider
that this is creating a pain-point for users and can lead to fights with
the compiler.

Will revert to having 1,2,4-byte reads in v4.

Thanks for comments.

>
>> +
>> +	case offsetof(struct bpf_sk_lookup, sk):
>> +		info->reg_type = PTR_TO_SOCKET;
>> +		return size == sizeof(__u64);
>> +
>> +	default:
>> +		return false;
>> +	}
>> +}
>> +
>> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
>> +					const struct bpf_insn *si,
>> +					struct bpf_insn *insn_buf,
>> +					struct bpf_prog *prog,
>> +					u32 *target_size)
>> +{
>> +	struct bpf_insn *insn = insn_buf;
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +	int off;
>> +#endif
>> +
>> +	switch (si->off) {
>> +	case offsetof(struct bpf_sk_lookup, family):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, family));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, protocol):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, protocol));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, remote_ip4):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, local_ip4):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
>> +		break;
>> +
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
>> +				remote_ip6[0], remote_ip6[3]):
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
>> +
>> +		off = si->off;
>> +		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
>> +		off += offsetof(struct in6_addr, s6_addr32[0]);
>> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
>> +#else
>> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
>> +#endif
>> +		break;
>> +
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
>> +				local_ip6[0], local_ip6[3]):
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
>> +
>> +		off = si->off;
>> +		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
>> +		off += offsetof(struct in6_addr, s6_addr32[0]);
>> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
>> +#else
>> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
>> +#endif
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, remote_port):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, sport));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, local_port):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, dport));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, sk):
>> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
>> +		break;
>> +	}
>> +
>> +	return insn - insn_buf;
>> +}
>> +
>> +const struct bpf_prog_ops sk_lookup_prog_ops = {
>> +};
>> +
>> +const struct bpf_verifier_ops sk_lookup_verifier_ops = {
>> +	.get_func_proto		= sk_lookup_func_proto,
>> +	.is_valid_access	= sk_lookup_is_valid_access,
>> +	.convert_ctx_access	= sk_lookup_convert_ctx_access,
>> +};
>> +
>>   #endif /* CONFIG_INET */
>>     DEFINE_BPF_DISPATCHER(xdp)
> [...]
Jakub Sitnicki July 7, 2020, 9:21 a.m. UTC | #4
On Thu, Jul 02, 2020 at 11:24 AM CEST, Jakub Sitnicki wrote:
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> when looking up a listening socket for a new connection request for
> connection oriented protocols, or when looking up an unconnected socket for
> a packet for connection-less protocols.
>
> When called, SK_LOOKUP BPF program can select a socket that will receive
> the packet. This serves as a mechanism to overcome the limits of what
> bind() API allows to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, on fixed port to a socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, on any port to a socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its run-time context program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
>
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - Allow bpf_sk_assign helper to replace previously selected socket only
>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>       programs running in series to accidentally override each other's verdict.
>     - Let BPF program decide that load-balancing within a reuseport socket group
>       should be skipped for the socket selected with bpf_sk_assign() by passing
>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>       the selected socket with an intention for multiple attached program
>       running in series to see each other's choices. However, currently the
>       verifier doesn't allow checking if pointer is set.
>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>     - Get rid of macros in convert_ctx_access to make it easier to read.
>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>
>     v2:
>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>       Update bpf_sk_assign docs accordingly. (Martin)
>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>
>  include/linux/bpf-netns.h  |   3 +
>  include/linux/bpf_types.h  |   2 +
>  include/linux/filter.h     |  19 ++++
>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>  kernel/bpf/net_namespace.c |   5 +
>  kernel/bpf/syscall.c       |   9 ++
>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>  scripts/bpf_helpers_doc.py |   9 +-
>  8 files changed, 306 insertions(+), 1 deletion(-)
>

[...]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index c796e141ea8e..286f90e0c824 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9219,6 +9219,192 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
>
>  const struct bpf_prog_ops sk_reuseport_prog_ops = {
>  };
> +
> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> +	   struct sock *, sk, u64, flags)
> +{
> +	if (unlikely(flags & ~(BPF_SK_LOOKUP_F_REPLACE |
> +			       BPF_SK_LOOKUP_F_NO_REUSEPORT)))
> +		return -EINVAL;
> +	if (unlikely(sk_is_refcounted(sk)))
> +		return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> +	if (unlikely(sk->sk_state == TCP_ESTABLISHED))
> +		return -ESOCKTNOSUPPORT; /* reject connected sockets */
> +
> +	/* Check if socket is suitable for packet L3/L4 protocol */
> +	if (sk->sk_protocol != ctx->protocol)
> +		return -EPROTOTYPE;
> +	if (sk->sk_family != ctx->family &&
> +	    (sk->sk_family == AF_INET || ipv6_only_sock(sk)))
> +		return -EAFNOSUPPORT;
> +
> +	if (ctx->selected_sk && !(flags & BPF_SK_LOOKUP_F_REPLACE))
> +		return -EEXIST;
> +
> +	/* Select socket as lookup result */
> +	ctx->selected_sk = sk;
> +	ctx->no_reuseport = flags & BPF_SK_LOOKUP_F_NO_REUSEPORT;
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
> +	.func		= bpf_sk_lookup_assign,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_PTR_TO_SOCKET,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +static const struct bpf_func_proto *
> +sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_sk_assign:
> +		return &bpf_sk_lookup_assign_proto;
> +	case BPF_FUNC_sk_release:
> +		return &bpf_sk_release_proto;
> +	default:
> +		return bpf_base_func_proto(func_id);
> +	}
> +}
> +
> +static bool sk_lookup_is_valid_access(int off, int size,
> +				      enum bpf_access_type type,
> +				      const struct bpf_prog *prog,
> +				      struct bpf_insn_access_aux *info)
> +{
> +	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +	if (type != BPF_READ)
> +		return false;
> +
> +	switch (off) {
> +	case bpf_ctx_range(struct bpf_sk_lookup, family):
> +	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
> +		return size == sizeof(__u32);
> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		info->reg_type = PTR_TO_SOCKET;

There's a bug here. bpf_sk_lookup 'sk' field is initially NULL.
reg_type should be PTR_TO_SOCKET_OR_NULL to inform the verifier.
Will fix in v4.

> +		return size == sizeof(__u64);
> +
> +	default:
> +		return false;
> +	}
> +}
> +
> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> +					const struct bpf_insn *si,
> +					struct bpf_insn *insn_buf,
> +					struct bpf_prog *prog,
> +					u32 *target_size)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	int off;
> +#endif
> +
> +	switch (si->off) {
> +	case offsetof(struct bpf_sk_lookup, family):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, family));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, protocol):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, protocol));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				remote_ip6[0], remote_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				local_ip6[0], local_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, sport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, dport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
> +		break;
> +	}
> +
> +	return insn - insn_buf;
> +}
> +
> +const struct bpf_prog_ops sk_lookup_prog_ops = {
> +};
> +
> +const struct bpf_verifier_ops sk_lookup_verifier_ops = {
> +	.get_func_proto		= sk_lookup_func_proto,
> +	.is_valid_access	= sk_lookup_is_valid_access,
> +	.convert_ctx_access	= sk_lookup_convert_ctx_access,
> +};
> +
>  #endif /* CONFIG_INET */
>
>  DEFINE_BPF_DISPATCHER(xdp)

[...]
Andrii Nakryiko July 9, 2020, 4:08 a.m. UTC | #5
On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> when looking up a listening socket for a new connection request for
> connection oriented protocols, or when looking up an unconnected socket for
> a packet for connection-less protocols.
>
> When called, SK_LOOKUP BPF program can select a socket that will receive
> the packet. This serves as a mechanism to overcome the limits of what
> bind() API allows to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, on fixed port to a socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, on any port to a socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its run-time context program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
>
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - Allow bpf_sk_assign helper to replace previously selected socket only
>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>       programs running in series to accidentally override each other's verdict.
>     - Let BPF program decide that load-balancing within a reuseport socket group
>       should be skipped for the socket selected with bpf_sk_assign() by passing
>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>       the selected socket with an intention for multiple attached program
>       running in series to see each other's choices. However, currently the
>       verifier doesn't allow checking if pointer is set.
>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>     - Get rid of macros in convert_ctx_access to make it easier to read.
>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>
>     v2:
>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>       Update bpf_sk_assign docs accordingly. (Martin)
>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>
>  include/linux/bpf-netns.h  |   3 +
>  include/linux/bpf_types.h  |   2 +
>  include/linux/filter.h     |  19 ++++
>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>  kernel/bpf/net_namespace.c |   5 +
>  kernel/bpf/syscall.c       |   9 ++
>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>  scripts/bpf_helpers_doc.py |   9 +-
>  8 files changed, 306 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
> index 4052d649f36d..cb1d849c5d4f 100644
> --- a/include/linux/bpf-netns.h
> +++ b/include/linux/bpf-netns.h
> @@ -8,6 +8,7 @@
>  enum netns_bpf_attach_type {
>         NETNS_BPF_INVALID = -1,
>         NETNS_BPF_FLOW_DISSECTOR = 0,
> +       NETNS_BPF_SK_LOOKUP,
>         MAX_NETNS_BPF_ATTACH_TYPE
>  };
>

[...]

> +struct bpf_sk_lookup_kern {
> +       u16             family;
> +       u16             protocol;
> +       union {
> +               struct {
> +                       __be32 saddr;
> +                       __be32 daddr;
> +               } v4;
> +               struct {
> +                       const struct in6_addr *saddr;
> +                       const struct in6_addr *daddr;
> +               } v6;
> +       };
> +       __be16          sport;
> +       u16             dport;
> +       struct sock     *selected_sk;
> +       bool            no_reuseport;
> +};
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0cb8ec948816..8dd6e6ce5de9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -189,6 +189,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_STRUCT_OPS,
>         BPF_PROG_TYPE_EXT,
>         BPF_PROG_TYPE_LSM,
> +       BPF_PROG_TYPE_SK_LOOKUP,
>  };
>
>  enum bpf_attach_type {
> @@ -226,6 +227,7 @@ enum bpf_attach_type {
>         BPF_CGROUP_INET4_GETSOCKNAME,
>         BPF_CGROUP_INET6_GETSOCKNAME,
>         BPF_XDP_DEVMAP,
> +       BPF_SK_LOOKUP,


Point not specific to your changes, but I wanted to bring it up for a
while now, so thought this one might be as good an opportunity as any.

It seems like enum bpf_attach_type originally was intended for only
cgroup BPF programs. To that end, cgroup_bpf has a bunch of fields
with sizes proportional to MAX_BPF_ATTACH_TYPE. It costs at least
8+4+16=28 bytes for each different type *per each cgroup*. At this
point, we have 22 cgroup-specific attach types, and this will be the
13th non-cgroup attach type. So cgroups pay a price for each time we
extend bpf_attach_type with a new non-cgroup attach type. cgroup_bpf
is now 336 bytes bigger than it needs to be.

So I wanted to propose that we do the same thing for cgroup_bpf as you
did for net_ns with netns_bpf_attach_type: have a densely-packed enum
just for cgroup attach types and translate now generic bpf_attach_type
to cgroup-specific cgroup_bpf_attach_type.

I wonder what people think? Is that a good idea? Is anyone up for doing this?

>         __MAX_BPF_ATTACH_TYPE
>  };
>

[...]

> +
> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> +                                       const struct bpf_insn *si,
> +                                       struct bpf_insn *insn_buf,
> +                                       struct bpf_prog *prog,
> +                                       u32 *target_size)

Would it be too extreme to rely on BTF and direct memory access
(similar to tp_raw, fentry/fexit, etc) for accessing context fields,
instead of all this assembly rewrites? So instead of having
bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
(bpf_sk_lookup_kern, or however we'd want to name it then) and
verifier will just ensure that direct memory reads go to the right
field boundaries?

> +{
> +       struct bpf_insn *insn = insn_buf;
> +#if IS_ENABLED(CONFIG_IPV6)
> +       int off;
> +#endif
> +

[...]
Jakub Sitnicki July 9, 2020, 1:25 p.m. UTC | #6
On Thu, Jul 09, 2020 at 06:08 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
>> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
>> when looking up a listening socket for a new connection request for
>> connection oriented protocols, or when looking up an unconnected socket for
>> a packet for connection-less protocols.
>>
>> When called, SK_LOOKUP BPF program can select a socket that will receive
>> the packet. This serves as a mechanism to overcome the limits of what
>> bind() API allows to express. Two use-cases driving this work are:
>>
>>  (1) steer packets destined to an IP range, on fixed port to a socket
>>
>>      192.0.2.0/24, port 80 -> NGINX socket
>>
>>  (2) steer packets destined to an IP address, on any port to a socket
>>
>>      198.51.100.1, any port -> L7 proxy socket
>>
>> In its run-time context program receives information about the packet that
>> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
>> address 4-tuple. Context can be further extended to include ingress
>> interface identifier.
>>
>> To select a socket BPF program fetches it from a map holding socket
>> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
>> helper to record the selection. Transport layer then uses the selected
>> socket as a result of socket lookup.
>>
>> This patch only enables the user to attach an SK_LOOKUP program to a
>> network namespace. Subsequent patches hook it up to run on local delivery
>> path in ipv4 and ipv6 stacks.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>     v3:
>>     - Allow bpf_sk_assign helper to replace previously selected socket only
>>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>>       programs running in series to accidentally override each other's verdict.
>>     - Let BPF program decide that load-balancing within a reuseport socket group
>>       should be skipped for the socket selected with bpf_sk_assign() by passing
>>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>>       the selected socket with an intention for multiple attached program
>>       running in series to see each other's choices. However, currently the
>>       verifier doesn't allow checking if pointer is set.
>>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>>     - Get rid of macros in convert_ctx_access to make it easier to read.
>>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>>
>>     v2:
>>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>>       Update bpf_sk_assign docs accordingly. (Martin)
>>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>>
>>  include/linux/bpf-netns.h  |   3 +
>>  include/linux/bpf_types.h  |   2 +
>>  include/linux/filter.h     |  19 ++++
>>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>>  kernel/bpf/net_namespace.c |   5 +
>>  kernel/bpf/syscall.c       |   9 ++
>>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>>  scripts/bpf_helpers_doc.py |   9 +-
>>  8 files changed, 306 insertions(+), 1 deletion(-)
>>

[...]

>> +
>> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
>> +                                       const struct bpf_insn *si,
>> +                                       struct bpf_insn *insn_buf,
>> +                                       struct bpf_prog *prog,
>> +                                       u32 *target_size)
>
> Would it be too extreme to rely on BTF and direct memory access
> (similar to tp_raw, fentry/fexit, etc) for accessing context fields,
> instead of all this assembly rewrites? So instead of having
> bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
> (bpf_sk_lookup_kern, or however we'd want to name it then) and
> verifier will just ensure that direct memory reads go to the right
> field boundaries?

Sounds like a decision related to long-term vision. I'd appreciate input
from maintainers if this is the direction we want to go in.

From implementation PoV - hard for me to say what would be needed to get
it working, I'm not familiar how BPF_TRACE_* attach types provide access
to context, so I'd need to look around and prototype it
first. (Actually, I'm not sure if you're asking if it is doable or you
already know?)

Off the top of my head, I have one concern, I'm exposing the selected
socket in the context. This is for the benefit of one program being
aware of other program's selection, if multiple programs are attached.

I understand that any piece of data reachable from struct sock *, would
be readable by SK_LOOKUP prog (writes can be blocked in
is_valid_access). And that this is a desired property for tracing. Not
sure how to limit it for a network program that doesn't need all that
info.

>
>> +{
>> +       struct bpf_insn *insn = insn_buf;
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +       int off;
>> +#endif
>> +
>
> [...]
Andrii Nakryiko July 9, 2020, 11:09 p.m. UTC | #7
On Thu, Jul 9, 2020 at 6:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Thu, Jul 09, 2020 at 06:08 AM CEST, Andrii Nakryiko wrote:
> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> >> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> >> when looking up a listening socket for a new connection request for
> >> connection oriented protocols, or when looking up an unconnected socket for
> >> a packet for connection-less protocols.
> >>
> >> When called, SK_LOOKUP BPF program can select a socket that will receive
> >> the packet. This serves as a mechanism to overcome the limits of what
> >> bind() API allows to express. Two use-cases driving this work are:
> >>
> >>  (1) steer packets destined to an IP range, on fixed port to a socket
> >>
> >>      192.0.2.0/24, port 80 -> NGINX socket
> >>
> >>  (2) steer packets destined to an IP address, on any port to a socket
> >>
> >>      198.51.100.1, any port -> L7 proxy socket
> >>
> >> In its run-time context program receives information about the packet that
> >> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> >> address 4-tuple. Context can be further extended to include ingress
> >> interface identifier.
> >>
> >> To select a socket BPF program fetches it from a map holding socket
> >> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> >> helper to record the selection. Transport layer then uses the selected
> >> socket as a result of socket lookup.
> >>
> >> This patch only enables the user to attach an SK_LOOKUP program to a
> >> network namespace. Subsequent patches hook it up to run on local delivery
> >> path in ipv4 and ipv6 stacks.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>
> >> Notes:
> >>     v3:
> >>     - Allow bpf_sk_assign helper to replace previously selected socket only
> >>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
> >>       programs running in series to accidentally override each other's verdict.
> >>     - Let BPF program decide that load-balancing within a reuseport socket group
> >>       should be skipped for the socket selected with bpf_sk_assign() by passing
> >>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
> >>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
> >>       the selected socket with an intention for multiple attached program
> >>       running in series to see each other's choices. However, currently the
> >>       verifier doesn't allow checking if pointer is set.
> >>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
> >>     - Get rid of macros in convert_ctx_access to make it easier to read.
> >>     - Disallow 1-,2-byte access to context fields containing IP addresses.
> >>
> >>     v2:
> >>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
> >>       Update bpf_sk_assign docs accordingly. (Martin)
> >>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
> >>     - Fix broken build when CONFIG_INET is not selected. (Martin)
> >>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
> >>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
> >>
> >>  include/linux/bpf-netns.h  |   3 +
> >>  include/linux/bpf_types.h  |   2 +
> >>  include/linux/filter.h     |  19 ++++
> >>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
> >>  kernel/bpf/net_namespace.c |   5 +
> >>  kernel/bpf/syscall.c       |   9 ++
> >>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
> >>  scripts/bpf_helpers_doc.py |   9 +-
> >>  8 files changed, 306 insertions(+), 1 deletion(-)
> >>
>
> [...]
>
> >> +
> >> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> >> +                                       const struct bpf_insn *si,
> >> +                                       struct bpf_insn *insn_buf,
> >> +                                       struct bpf_prog *prog,
> >> +                                       u32 *target_size)
> >
> > Would it be too extreme to rely on BTF and direct memory access
> > (similar to tp_raw, fentry/fexit, etc) for accessing context fields,
> > instead of all this assembly rewrites? So instead of having
> > bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
> > (bpf_sk_lookup_kern, or however we'd want to name it then) and
> > verifier will just ensure that direct memory reads go to the right
> > field boundaries?
>
> Sounds like a decision related to long-term vision. I'd appreciate input
> from maintainers if this is the direction we want to go in.
>
> From implementation PoV - hard for me to say what would be needed to get
> it working, I'm not familiar how BPF_TRACE_* attach types provide access
> to context, so I'd need to look around and prototype it
> first. (Actually, I'm not sure if you're asking if it is doable or you
> already know?)

I'm pretty sure it's doable with what we have in verifier, but I'm not
sure about all the details and amount of work. So consider this an
initiation of a medium-term discussion. I was also curious to hear an
opinion from Alexei and Daniel whether that's would be the right way
to do this moving forward (not necessarily with your changes, though).

>
> Off the top of my head, I have one concern, I'm exposing the selected
> socket in the context. This is for the benefit of one program being
> aware of other program's selection, if multiple programs are attached.
>
> I understand that any piece of data reachable from struct sock *, would
> be readable by SK_LOOKUP prog (writes can be blocked in
> is_valid_access). And that this is a desired property for tracing. Not
> sure how to limit it for a network program that doesn't need all that
> info.
>
> >
> >> +{
> >> +       struct bpf_insn *insn = insn_buf;
> >> +#if IS_ENABLED(CONFIG_IPV6)
> >> +       int off;
> >> +#endif
> >> +
> >
> > [...]
Jakub Sitnicki July 10, 2020, 8:55 a.m. UTC | #8
On Fri, Jul 10, 2020 at 01:09 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 9, 2020 at 6:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Thu, Jul 09, 2020 at 06:08 AM CEST, Andrii Nakryiko wrote:
>> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>
>> >> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
>> >> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
>> >> when looking up a listening socket for a new connection request for
>> >> connection oriented protocols, or when looking up an unconnected socket for
>> >> a packet for connection-less protocols.
>> >>
>> >> When called, SK_LOOKUP BPF program can select a socket that will receive
>> >> the packet. This serves as a mechanism to overcome the limits of what
>> >> bind() API allows to express. Two use-cases driving this work are:
>> >>
>> >>  (1) steer packets destined to an IP range, on fixed port to a socket
>> >>
>> >>      192.0.2.0/24, port 80 -> NGINX socket
>> >>
>> >>  (2) steer packets destined to an IP address, on any port to a socket
>> >>
>> >>      198.51.100.1, any port -> L7 proxy socket
>> >>
>> >> In its run-time context program receives information about the packet that
>> >> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
>> >> address 4-tuple. Context can be further extended to include ingress
>> >> interface identifier.
>> >>
>> >> To select a socket BPF program fetches it from a map holding socket
>> >> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
>> >> helper to record the selection. Transport layer then uses the selected
>> >> socket as a result of socket lookup.
>> >>
>> >> This patch only enables the user to attach an SK_LOOKUP program to a
>> >> network namespace. Subsequent patches hook it up to run on local delivery
>> >> path in ipv4 and ipv6 stacks.
>> >>
>> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>
>> >> Notes:
>> >>     v3:
>> >>     - Allow bpf_sk_assign helper to replace previously selected socket only
>> >>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>> >>       programs running in series to accidentally override each other's verdict.
>> >>     - Let BPF program decide that load-balancing within a reuseport socket group
>> >>       should be skipped for the socket selected with bpf_sk_assign() by passing
>> >>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>> >>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>> >>       the selected socket with an intention for multiple attached program
>> >>       running in series to see each other's choices. However, currently the
>> >>       verifier doesn't allow checking if pointer is set.
>> >>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>> >>     - Get rid of macros in convert_ctx_access to make it easier to read.
>> >>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>> >>
>> >>     v2:
>> >>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>> >>       Update bpf_sk_assign docs accordingly. (Martin)
>> >>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>> >>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>> >>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>> >>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>> >>
>> >>  include/linux/bpf-netns.h  |   3 +
>> >>  include/linux/bpf_types.h  |   2 +
>> >>  include/linux/filter.h     |  19 ++++
>> >>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>> >>  kernel/bpf/net_namespace.c |   5 +
>> >>  kernel/bpf/syscall.c       |   9 ++
>> >>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>> >>  scripts/bpf_helpers_doc.py |   9 +-
>> >>  8 files changed, 306 insertions(+), 1 deletion(-)
>> >>
>>
>> [...]
>>
>> >> +
>> >> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
>> >> +                                       const struct bpf_insn *si,
>> >> +                                       struct bpf_insn *insn_buf,
>> >> +                                       struct bpf_prog *prog,
>> >> +                                       u32 *target_size)
>> >
>> > Would it be too extreme to rely on BTF and direct memory access
>> > (similar to tp_raw, fentry/fexit, etc) for accessing context fields,
>> > instead of all this assembly rewrites? So instead of having
>> > bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
>> > (bpf_sk_lookup_kern, or however we'd want to name it then) and
>> > verifier will just ensure that direct memory reads go to the right
>> > field boundaries?
>>
>> Sounds like a decision related to long-term vision. I'd appreciate input
>> from maintainers if this is the direction we want to go in.
>>
>> From implementation PoV - hard for me to say what would be needed to get
>> it working, I'm not familiar how BPF_TRACE_* attach types provide access
>> to context, so I'd need to look around and prototype it
>> first. (Actually, I'm not sure if you're asking if it is doable or you
>> already know?)
>
> I'm pretty sure it's doable with what we have in verifier, but I'm not
> sure about all the details and amount of work. So consider this an
> initiation of a medium-term discussion. I was also curious to hear an
> opinion from Alexei and Daniel whether that's would be the right way
> to do this moving forward (not necessarily with your changes, though).

From my side I can vouch that getting convert_ctx_access is not easy to
get right (at least for me) when backing structure is non-trivial,
e.g. has pointers or unions.

v4 will contain two fixes exactly in this area. I also have a patch for
how verifier handles narrow loads when load size <= target field size <
ctx field size.

That is to say, any alternative approach that "automates" this would be
very welcome.

I've accumulated quite a few changes already since v3, so I was planning
to roll out v4 to keep things moving while we continue the discussion.

>
>>
>> Off the top of my head, I have one concern, I'm exposing the selected
>> socket in the context. This is for the benefit of one program being
>> aware of other program's selection, if multiple programs are attached.
>>
>> I understand that any piece of data reachable from struct sock *, would
>> be readable by SK_LOOKUP prog (writes can be blocked in
>> is_valid_access). And that this is a desired property for tracing. Not
>> sure how to limit it for a network program that doesn't need all that
>> info.
>>
>> >
>> >> +{
>> >> +       struct bpf_insn *insn = insn_buf;
>> >> +#if IS_ENABLED(CONFIG_IPV6)
>> >> +       int off;
>> >> +#endif
>> >> +
>> >
>> > [...]
diff mbox series

Patch

diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
index 4052d649f36d..cb1d849c5d4f 100644
--- a/include/linux/bpf-netns.h
+++ b/include/linux/bpf-netns.h
@@ -8,6 +8,7 @@ 
 enum netns_bpf_attach_type {
 	NETNS_BPF_INVALID = -1,
 	NETNS_BPF_FLOW_DISSECTOR = 0,
+	NETNS_BPF_SK_LOOKUP,
 	MAX_NETNS_BPF_ATTACH_TYPE
 };
 
@@ -17,6 +18,8 @@  to_netns_bpf_attach_type(enum bpf_attach_type attach_type)
 	switch (attach_type) {
 	case BPF_FLOW_DISSECTOR:
 		return NETNS_BPF_FLOW_DISSECTOR;
+	case BPF_SK_LOOKUP:
+		return NETNS_BPF_SK_LOOKUP;
 	default:
 		return NETNS_BPF_INVALID;
 	}
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index a18ae82a298a..a52a5688418e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -64,6 +64,8 @@  BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
 	      struct sk_reuseport_md, struct sk_reuseport_kern)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
+	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
 #endif
 #if defined(CONFIG_BPF_JIT)
 BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 259377723603..ba4f8595fa54 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1278,4 +1278,23 @@  struct bpf_sockopt_kern {
 	s32		retval;
 };
 
+struct bpf_sk_lookup_kern {
+	u16		family;
+	u16		protocol;
+	union {
+		struct {
+			__be32 saddr;
+			__be32 daddr;
+		} v4;
+		struct {
+			const struct in6_addr *saddr;
+			const struct in6_addr *daddr;
+		} v6;
+	};
+	__be16		sport;
+	u16		dport;
+	struct sock	*selected_sk;
+	bool		no_reuseport;
+};
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0cb8ec948816..8dd6e6ce5de9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -189,6 +189,7 @@  enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -226,6 +227,7 @@  enum bpf_attach_type {
 	BPF_CGROUP_INET4_GETSOCKNAME,
 	BPF_CGROUP_INET6_GETSOCKNAME,
 	BPF_XDP_DEVMAP,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3067,6 +3069,10 @@  union bpf_attr {
  *
  * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3092,6 +3098,53 @@  union bpf_attr {
  *		**-ESOCKTNOSUPPORT** if the socket type is not supported
  *		(reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
+ *		be an exact match. While IP family (**AF_INET** or
+ *		**AF_INET6**) must be compatible, that is IPv6 sockets
+ *		that are not v6-only can be selected for IPv4 packets.
+ *
+ *		Only TCP listeners and UDP unconnected sockets can be
+ *		selected.
+ *
+ *		*flags* argument can combination of following values:
+ *
+ *		* **BPF_SK_LOOKUP_F_REPLACE** to override the previous
+ *		  socket selection, potentially done by a BPF program
+ *		  that ran before us.
+ *
+ *		* **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
+ *		  load-balancing within reuseport group for the socket
+ *		  being selected.
+ *
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EAFNOSUPPORT** if socket family (*sk->family*) is
+ *		  not compatible with packet family (*ctx->family*).
+ *
+ *		* **-EEXIST** if socket has been already selected,
+ *		  potentially by another program, and
+ *		  **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
+ *
+ *		* **-EINVAL** if unsupported flags were specified.
+ *
+ *		* **-EPROTOTYPE** if socket L4 protocol
+ *		  (*sk->protocol*) doesn't match packet protocol
+ *		  (*ctx->protocol*).
+ *
+ *		* **-ESOCKTNOSUPPORT** if socket is not in allowed
+ *		  state (TCP listening or UDP unconnected).
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -3569,6 +3622,12 @@  enum {
 	BPF_RINGBUF_HDR_SZ		= 8,
 };
 
+/* BPF_FUNC_sk_assign flags in bpf_sk_lookup context. */
+enum {
+	BPF_SK_LOOKUP_F_REPLACE		= (1ULL << 0),
+	BPF_SK_LOOKUP_F_NO_REUSEPORT	= (1ULL << 1),
+};
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
 	BPF_ADJ_ROOM_NET,
@@ -4298,4 +4357,19 @@  struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* Protocol family (AF_INET, AF_INET6) */
+	__u32 protocol;		/* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
+	__u32 remote_ip4;	/* Network byte order */
+	__u32 remote_ip6[4];	/* Network byte order */
+	__u32 remote_port;	/* Network byte order */
+	__u32 local_ip4;	/* Network byte order */
+	__u32 local_ip6[4];	/* Network byte order */
+	__u32 local_port;	/* Host byte order */
+
+	__bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
index 6011122c35b6..090166824ca4 100644
--- a/kernel/bpf/net_namespace.c
+++ b/kernel/bpf/net_namespace.c
@@ -360,6 +360,8 @@  static int netns_bpf_max_progs(enum netns_bpf_attach_type type)
 	switch (type) {
 	case NETNS_BPF_FLOW_DISSECTOR:
 		return 1;
+	case NETNS_BPF_SK_LOOKUP:
+		return 64;
 	default:
 		return 0;
 	}
@@ -391,6 +393,9 @@  static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
 	case NETNS_BPF_FLOW_DISSECTOR:
 		err = flow_dissector_bpf_prog_attach_check(net, link->prog);
 		break;
+	case NETNS_BPF_SK_LOOKUP:
+		err = 0; /* nothing to check */
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8da159936bab..e7d49959340e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2021,6 +2021,10 @@  bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		if (expected_attach_type == BPF_SK_LOOKUP)
+			return 0;
+		return -EINVAL;
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
 			return -EINVAL;
@@ -2755,6 +2759,7 @@  static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
 		if (!capable(CAP_NET_ADMIN))
@@ -2815,6 +2820,8 @@  attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
 	case BPF_TRACE_ITER:
 		return BPF_PROG_TYPE_TRACING;
+	case BPF_SK_LOOKUP:
+		return BPF_PROG_TYPE_SK_LOOKUP;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -2952,6 +2959,7 @@  static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_LIRC_MODE2:
 		return lirc_prog_query(attr, uattr);
 	case BPF_FLOW_DISSECTOR:
+	case BPF_SK_LOOKUP:
 		return netns_bpf_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
@@ -3885,6 +3893,7 @@  static int link_create(union bpf_attr *attr)
 		ret = tracing_bpf_link_attach(attr, prog);
 		break;
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 		ret = netns_bpf_link_create(attr, prog);
 		break;
 	default:
diff --git a/net/core/filter.c b/net/core/filter.c
index c796e141ea8e..286f90e0c824 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9219,6 +9219,192 @@  const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
+	   struct sock *, sk, u64, flags)
+{
+	if (unlikely(flags & ~(BPF_SK_LOOKUP_F_REPLACE |
+			       BPF_SK_LOOKUP_F_NO_REUSEPORT)))
+		return -EINVAL;
+	if (unlikely(sk_is_refcounted(sk)))
+		return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
+	if (unlikely(sk->sk_state == TCP_ESTABLISHED))
+		return -ESOCKTNOSUPPORT; /* reject connected sockets */
+
+	/* Check if socket is suitable for packet L3/L4 protocol */
+	if (sk->sk_protocol != ctx->protocol)
+		return -EPROTOTYPE;
+	if (sk->sk_family != ctx->family &&
+	    (sk->sk_family == AF_INET || ipv6_only_sock(sk)))
+		return -EAFNOSUPPORT;
+
+	if (ctx->selected_sk && !(flags & BPF_SK_LOOKUP_F_REPLACE))
+		return -EEXIST;
+
+	/* Select socket as lookup result */
+	ctx->selected_sk = sk;
+	ctx->no_reuseport = flags & BPF_SK_LOOKUP_F_NO_REUSEPORT;
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
+	.func		= bpf_sk_lookup_assign,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_SOCKET,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_sk_assign:
+		return &bpf_sk_lookup_assign_proto;
+	case BPF_FUNC_sk_release:
+		return &bpf_sk_release_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static bool sk_lookup_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sk_lookup, family):
+	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
+	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
+	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
+		return size == sizeof(__u32);
+
+	case offsetof(struct bpf_sk_lookup, sk):
+		info->reg_type = PTR_TO_SOCKET;
+		return size == sizeof(__u64);
+
+	default:
+		return false;
+	}
+}
+
+static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
+					const struct bpf_insn *si,
+					struct bpf_insn *insn_buf,
+					struct bpf_prog *prog,
+					u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+#if IS_ENABLED(CONFIG_IPV6)
+	int off;
+#endif
+
+	switch (si->off) {
+	case offsetof(struct bpf_sk_lookup, family):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, family));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, protocol):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, protocol));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_ip4):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_ip4):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				remote_ip6[0], remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
+		off += offsetof(struct in6_addr, s6_addr32[0]);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				local_ip6[0], local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
+		off += offsetof(struct in6_addr, s6_addr32[0]);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_port):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, sport));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_port):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, dport));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, sk):
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+const struct bpf_prog_ops sk_lookup_prog_ops = {
+};
+
+const struct bpf_verifier_ops sk_lookup_verifier_ops = {
+	.get_func_proto		= sk_lookup_func_proto,
+	.is_valid_access	= sk_lookup_is_valid_access,
+	.convert_ctx_access	= sk_lookup_convert_ctx_access,
+};
+
 #endif /* CONFIG_INET */
 
 DEFINE_BPF_DISPATCHER(xdp)
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 6bab40ff442e..ea21e86a807c 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -404,6 +404,7 @@  class PrinterHelpers(Printer):
 
     type_fwds = [
             'struct bpf_fib_lookup',
+            'struct bpf_sk_lookup',
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
@@ -449,6 +450,7 @@  class PrinterHelpers(Printer):
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
+            'struct bpf_sk_lookup',
             'struct bpf_sock',
             'struct bpf_sock_addr',
             'struct bpf_sock_ops',
@@ -485,6 +487,11 @@  class PrinterHelpers(Printer):
             'struct sk_msg_buff': 'struct sk_msg_md',
             'struct xdp_buff': 'struct xdp_md',
     }
+    # Helpers overloaded for different context types.
+    overloaded_helpers = [
+        'bpf_get_socket_cookie',
+        'bpf_sk_assign',
+    ]
 
     def print_header(self):
         header = '''\
@@ -541,7 +548,7 @@  class PrinterHelpers(Printer):
         for i, a in enumerate(proto['args']):
             t = a['type']
             n = a['name']
-            if proto['name'] == 'bpf_get_socket_cookie' and i == 0:
+            if proto['name'] in self.overloaded_helpers and i == 0:
                     t = 'void'
                     n = 'ctx'
             one_arg = '{}{}'.format(comma, self.map_type(t))