[bpf-next,09/13] bpf: add bpf_skb_adjust_room encap flags

Message ID	20190320144944.147862-10-willemdebruijn.kernel@gmail.com
State	Changes Requested
Delegated to:	BPF Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> To: netdev@vger.kernel.org Cc: ast@kernel.org, daniel@iogearbox.net, sdf@google.com, posk@google.com, Willem de Bruijn <willemb@google.com> Subject: [PATCH bpf-next 09/13] bpf: add bpf_skb_adjust_room encap flags Date: Wed, 20 Mar 2019 10:49:40 -0400 Message-Id: <20190320144944.147862-10-willemdebruijn.kernel@gmail.com> In-Reply-To: <20190320144944.147862-1-willemdebruijn.kernel@gmail.com> References: <20190320144944.147862-1-willemdebruijn.kernel@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	bpf tc tunneling \| expand [bpf-next,00/13] bpf tc tunneling [bpf-next,01/13] bpf: in bpf_skb_adjust_room avoid copy in tx fast path [bpf-next,02/13] selftests/bpf: bpf tunnel encap test [bpf-next,03/13] selftests/bpf: expand bpf tunnel test with decap [bpf-next,04/13] selftests/bpf: expand bpf tunnel test to ipv6 [bpf-next,05/13] selftests/bpf: extend bpf tunnel test with gre [bpf-next,06/13] selftests/bpf: extend bpf tunnel test with tso [bpf-next,07/13] bpf: add bpf_skb_adjust_room mode BPF_ADJ_ROOM_MAC [bpf-next,08/13] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_FIXED_GSO [bpf-next,09/13] bpf: add bpf_skb_adjust_room encap flags [bpf-next,10/13] bpf: Sync bpf.h to tools [bpf-next,11/13] selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC [bpf-next,12/13] selftests/bpf: convert bpf tunnel test to BPF_F_ADJ_ROOM_FIXED_GSO [bpf-next,13/13] selftests/bpf: convert bpf tunnel test to encap modes

Message ID

20190320144944.147862-10-willemdebruijn.kernel@gmail.com

State

Changes Requested

Delegated to:

BPF Maintainers

Headers

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: netdev@vger.kernel.org
Cc: ast@kernel.org, daniel@iogearbox.net, sdf@google.com,
	posk@google.com, Willem de Bruijn <willemb@google.com>
Subject: [PATCH bpf-next 09/13] bpf: add bpf_skb_adjust_room encap flags
Date: Wed, 20 Mar 2019 10:49:40 -0400
Message-Id: <20190320144944.147862-10-willemdebruijn.kernel@gmail.com>
In-Reply-To: <20190320144944.147862-1-willemdebruijn.kernel@gmail.com>
References: <20190320144944.147862-1-willemdebruijn.kernel@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: netdev-owner@vger.kernel.org
Precedence: bulk

Series

bpf tc tunneling | expand

Commit Message

Willem de Bruijn March 20, 2019, 2:49 p.m. UTC

From: Willem de Bruijn <willemb@google.com>

When pushing tunnel headers, annotate skbs in the same way as tunnel
devices.

For GSO packets, the network stack requires certain fields set to
segment packets with tunnel headers. gro_gse_segment depends on
transport and inner mac header, for instance.

Add an option to pass this information.

Remove the restriction on len_diff to network header length, which
is too short, e.g., for GRE protocols.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/uapi/linux/bpf.h | 14 +++++++++-
 net/core/filter.c        | 58 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 67 insertions(+), 5 deletions(-)

Comments

Alan Maguire March 20, 2019, 3:51 p.m. UTC | #1

On Wed, 20 Mar 2019, Willem de Bruijn wrote:

> From: Willem de Bruijn <willemb@google.com>
> 
> When pushing tunnel headers, annotate skbs in the same way as tunnel
> devices.
>
This is great stuff Willem!
 
> For GSO packets, the network stack requires certain fields set to
> segment packets with tunnel headers. gro_gse_segment depends on
> transport and inner mac header, for instance.
>

By coincidence I've been working on a patch to solve part of
this problem (attached).

I took a slightly different approach (which I think you mentioned
you considered) - adding an additional helper to mark the inner
headers.  The reason I needed this is that the mac header length
in my case was sometimes not the same as the outer mac size (it
could be a set of MPLS labels sometimes, or indeed there might
be no mac header at all).  If I'm reading your code correctly,
you derive  the mac header length from the outer mac header size -
would there be a possibility of overloading the flags field for
bpf_skb_adjust_room to use 8 bits to store a non-standard mac length 
perhaps? I'd be happy to work on that as a separate patch if that seems 
reasonable.
 
> Add an option to pass this information.
> 
> Remove the restriction on len_diff to network header length, which
> is too short, e.g., for GRE protocols.
>

I think this solves another problem I'd observed; when de-encapsulating
packets which had been GRO re-assembled, bpf_skb_adjust_room would
fail becuase GRO reassembly set the transport header, and as
shrinkage was limited to ensure we still had an IPv4/IPv6 header's
worth of space between the network and transport headers, the operation
would fail.  I think that problem is fixed here, is that right?

Reviewed-by: Alan Maguire <alan.maguire@oracle.com>

Thanks!

Alan

From 388c1bf0cfc76901782520c5af58f73b2649a4c0 Mon Sep 17 00:00:00 2001
From: Alan Maguire <alan.maguire@oracle.com>
Date: Wed, 20 Mar 2019 14:12:36 +0100
Subject: [PATCH] bpf: add bpf_skb_set_inner_header helper

This work adds a helper which can mark inner mac and network
headers and sets inner protocol type for the relevant skb such
that generic segmentation offload (GSO) can segment the packet
appropriately while taking newly-added encapsulation headers into
account.

It is intended to be used in conjunction with other helpers such
as bpf_skb_adjust_room for cases where custom encapsulation is
implemented in a tc BPF program and GSO functionality is needed.
Currently UDP and GRE encapsulation are supported.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
---
 include/uapi/linux/bpf.h | 23 ++++++++++++++++-
 net/core/filter.c        | 64 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 929c8e5..3ce3c16 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2431,6 +2431,22 @@ struct bpf_stack_build_id {
  *	Return
  *		A **struct bpf_sock** pointer on success, or **NULL** in
  *		case of failure.
+ *
+ * int bpf_skb_set_inner_header(skb, outer_proto, inner_proto, inner_offset,
+ *				flags)
+ *	Description
+ *		Set inner header at *inner_offset* for specified *inner_proto*
+ *		for tunnel of type *outer_proto*. *outer_proto* must be one
+ *		of **IPPROTO_UDP** or **IPPROTO_GRE**.  *inner_proto* must be
+ *		**ETH_P_IP** or **ETH_P_IPV6**.  *inner_offset* should specify
+ *		offset of the relevant inner header, or should be 0 to reset
+ *		inner headers. *flags* should be a combination of
+ *
+ *			* **BPF_F_L2_INNER_OFFSET** offset is inner mac header
+ *			* **BPF_F_L3_INNER_OFFSET** offset is inner network
+ *			  header
+ *	Return
+ *		0 on success, or negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2531,7 +2547,8 @@ struct bpf_stack_build_id {
 	FN(sk_fullsock),		\
 	FN(tcp_sock),			\
 	FN(skb_ecn_set_ce),		\
-	FN(get_listener_sock),
+	FN(get_listener_sock),		\
+	FN(skb_set_inner_header),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2595,6 +2612,10 @@ enum bpf_adj_room_mode {
 	BPF_ADJ_ROOM_NET,
 };
 
+/* BPF_FUNC_skb_set_inner_header flags. */
+#define	BPF_F_L2_INNER_OFFSET		(1ULL << 0)
+#define	BPF_F_L3_INNER_OFFSET		(1ULL << 1)
+
 /* Mode for BPF_FUNC_skb_load_bytes_relative helper. */
 enum bpf_hdr_start_off {
 	BPF_HDR_START_MAC,
diff --git a/net/core/filter.c b/net/core/filter.c
index 647c63a..f8265f3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5463,6 +5463,68 @@ u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
 };
 #endif /* CONFIG_INET */
 
+BPF_CALL_5(bpf_skb_set_inner_header, struct sk_buff *, skb,
+	   __be16, outer_proto, __be16, inner_proto, u32, inner_offset,
+	   u64, flags)
+{
+	unsigned int gso_type;
+
+	if (unlikely(inner_offset > skb->len))
+		return -EINVAL;
+
+	if (unlikely(flags & ~(BPF_F_L2_INNER_OFFSET | BPF_F_L3_INNER_OFFSET)))
+		return -EINVAL;
+
+	switch (outer_proto) {
+	case IPPROTO_UDP:
+		gso_type = SKB_GSO_UDP_TUNNEL;
+		break;
+	case IPPROTO_GRE:
+		gso_type = SKB_GSO_GRE;
+		break;
+	default:
+		return -ENOTSUPP;
+	}
+
+	switch (inner_proto) {
+	case ETH_P_IP:
+	case ETH_P_IPV6:
+		break;
+	default:
+		return -ENOTSUPP;
+	}
+
+	if (inner_offset == 0) {
+		skb->encapsulation = 0;
+		skb_reset_inner_headers(skb);
+		return 0;
+	}
+
+	if (flags & BPF_F_L2_INNER_OFFSET)
+		skb_set_inner_mac_header(skb, inner_offset);
+	if (flags & BPF_F_L3_INNER_OFFSET)
+		skb_set_inner_network_header(skb, inner_offset);
+
+	skb_set_inner_protocol(skb, htons(inner_proto));
+	skb->encapsulation = 1;
+
+	if (skb_is_gso(skb))
+		skb_shinfo(skb)->gso_type |= gso_type;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_skb_set_inner_header_proto = {
+	.func		= bpf_skb_set_inner_header,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_ANYTHING,
+};
+
 bool bpf_helper_changes_pkt_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -5720,6 +5782,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 	case BPF_FUNC_get_listener_sock:
 		return &bpf_get_listener_sock_proto;
 #endif
+	case BPF_FUNC_skb_set_inner_header:
+		return &bpf_skb_set_inner_header_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}

Willem de Bruijn March 20, 2019, 6:10 p.m. UTC | #2

On Wed, Mar 20, 2019 at 11:51 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>
> On Wed, 20 Mar 2019, Willem de Bruijn wrote:
>
> > From: Willem de Bruijn <willemb@google.com>
> >
> > When pushing tunnel headers, annotate skbs in the same way as tunnel
> > devices.
> >
> This is great stuff Willem!
>
> > For GSO packets, the network stack requires certain fields set to
> > segment packets with tunnel headers. gro_gse_segment depends on
> > transport and inner mac header, for instance.
> >
>
> By coincidence I've been working on a patch to solve part of
> this problem (attached).
>
> I took a slightly different approach (which I think you mentioned
> you considered) - adding an additional helper to mark the inner
> headers.  The reason I needed this is that the mac header length
> in my case was sometimes not the same as the outer mac size (it
> could be a set of MPLS labels sometimes, or indeed there might
> be no mac header at all).  If I'm reading your code correctly,
> you derive  the mac header length from the outer mac header size -
> would there be a possibility of overloading the flags field for
> bpf_skb_adjust_room to use 8 bits to store a non-standard mac length
> perhaps? I'd be happy to work on that as a separate patch if that seems
> reasonable.

My patch indeed has still some limitations in that regard. It assumes
network layer encap, so no inner mac. This excludes GRE with
ETH_P_TEB or MPLS. And there can be MPLS both on the outer and inner
packet.

My plan to deal with those eventually was to add modes
BPF_F_ADJ_ROOM_ENCAP_L2_... and require a separate call to
bpf_skb_adjust_room for each layer of encap. Would that work for your
use-case?

> > Add an option to pass this information.
> >
> > Remove the restriction on len_diff to network header length, which
> > is too short, e.g., for GRE protocols.
> >
>
> I think this solves another problem I'd observed; when de-encapsulating
> packets which had been GRO re-assembled, bpf_skb_adjust_room would
> fail becuase GRO reassembly set the transport header, and as
> shrinkage was limited to ensure we still had an IPv4/IPv6 header's
> worth of space between the network and transport headers, the operation
> would fail.  I think that problem is fixed here, is that right?

Yes, sounds familiar. The included selftest also fails without this because
the transport header is set in the transmit path and not scrubbed when
looped over veth.

Speaking of GRO, the patchset is as still lacks the inverse of this
encap patch: modify gso_type to remove tunnel types like
SKB_GSO_GRE.

> Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
>
> Thanks!

Thanks for reviewing and sharing your patch, Alan :)

There is definitely something to be said for moving this out of
bpf_skb_adjust_room to a separate call bpf_skb_adjust_tunnel_headers
or so that is called after bpf_skb_store_bytes.

Instead of taking all the arguments explicitly, this could also infer
the values by parsing the headers. I was a bit hesitant that this
would turn into (yet another) custom packet parser.

And else merging this with bpf_skb_adjust_room is essentially just a
performance optimization to avoid one extra indirect function call.


>
> Alan
>
> From 388c1bf0cfc76901782520c5af58f73b2649a4c0 Mon Sep 17 00:00:00 2001
> From: Alan Maguire <alan.maguire@oracle.com>
> Date: Wed, 20 Mar 2019 14:12:36 +0100
> Subject: [PATCH] bpf: add bpf_skb_set_inner_header helper
>
> This work adds a helper which can mark inner mac and network
> headers and sets inner protocol type for the relevant skb such
> that generic segmentation offload (GSO) can segment the packet
> appropriately while taking newly-added encapsulation headers into
> account.
>
> It is intended to be used in conjunction with other helpers such
> as bpf_skb_adjust_room for cases where custom encapsulation is
> implemented in a tc BPF program and GSO functionality is needed.
> Currently UDP and GRE encapsulation are supported.
>
> Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
> ---
>  include/uapi/linux/bpf.h | 23 ++++++++++++++++-
>  net/core/filter.c        | 64 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 86 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 929c8e5..3ce3c16 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2431,6 +2431,22 @@ struct bpf_stack_build_id {
>   *     Return
>   *             A **struct bpf_sock** pointer on success, or **NULL** in
>   *             case of failure.
> + *
> + * int bpf_skb_set_inner_header(skb, outer_proto, inner_proto, inner_offset,
> + *                             flags)
> + *     Description
> + *             Set inner header at *inner_offset* for specified *inner_proto*
> + *             for tunnel of type *outer_proto*. *outer_proto* must be one
> + *             of **IPPROTO_UDP** or **IPPROTO_GRE**.  *inner_proto* must be
> + *             **ETH_P_IP** or **ETH_P_IPV6**.  *inner_offset* should specify
> + *             offset of the relevant inner header, or should be 0 to reset
> + *             inner headers. *flags* should be a combination of
> + *
> + *                     * **BPF_F_L2_INNER_OFFSET** offset is inner mac header
> + *                     * **BPF_F_L3_INNER_OFFSET** offset is inner network
> + *                       header
> + *     Return
> + *             0 on success, or negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2531,7 +2547,8 @@ struct bpf_stack_build_id {
>         FN(sk_fullsock),                \
>         FN(tcp_sock),                   \
>         FN(skb_ecn_set_ce),             \
> -       FN(get_listener_sock),
> +       FN(get_listener_sock),          \
> +       FN(skb_set_inner_header),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -2595,6 +2612,10 @@ enum bpf_adj_room_mode {
>         BPF_ADJ_ROOM_NET,
>  };
>
> +/* BPF_FUNC_skb_set_inner_header flags. */
> +#define        BPF_F_L2_INNER_OFFSET           (1ULL << 0)
> +#define        BPF_F_L3_INNER_OFFSET           (1ULL << 1)
> +
>  /* Mode for BPF_FUNC_skb_load_bytes_relative helper. */
>  enum bpf_hdr_start_off {
>         BPF_HDR_START_MAC,
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 647c63a..f8265f3 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5463,6 +5463,68 @@ u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
>  };
>  #endif /* CONFIG_INET */
>
> +BPF_CALL_5(bpf_skb_set_inner_header, struct sk_buff *, skb,
> +          __be16, outer_proto, __be16, inner_proto, u32, inner_offset,
> +          u64, flags)
> +{
> +       unsigned int gso_type;
> +
> +       if (unlikely(inner_offset > skb->len))
> +               return -EINVAL;
> +
> +       if (unlikely(flags & ~(BPF_F_L2_INNER_OFFSET | BPF_F_L3_INNER_OFFSET)))
> +               return -EINVAL;
> +
> +       switch (outer_proto) {
> +       case IPPROTO_UDP:
> +               gso_type = SKB_GSO_UDP_TUNNEL;
> +               break;
> +       case IPPROTO_GRE:
> +               gso_type = SKB_GSO_GRE;
> +               break;
> +       default:
> +               return -ENOTSUPP;
> +       }
> +
> +       switch (inner_proto) {
> +       case ETH_P_IP:
> +       case ETH_P_IPV6:
> +               break;
> +       default:
> +               return -ENOTSUPP;
> +       }
> +
> +       if (inner_offset == 0) {
> +               skb->encapsulation = 0;
> +               skb_reset_inner_headers(skb);
> +               return 0;
> +       }
> +
> +       if (flags & BPF_F_L2_INNER_OFFSET)
> +               skb_set_inner_mac_header(skb, inner_offset);
> +       if (flags & BPF_F_L3_INNER_OFFSET)
> +               skb_set_inner_network_header(skb, inner_offset);
> +
> +       skb_set_inner_protocol(skb, htons(inner_proto));
> +       skb->encapsulation = 1;
> +
> +       if (skb_is_gso(skb))
> +               skb_shinfo(skb)->gso_type |= gso_type;
> +
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_skb_set_inner_header_proto = {
> +       .func           = bpf_skb_set_inner_header,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_ANYTHING,
> +       .arg3_type      = ARG_ANYTHING,
> +       .arg4_type      = ARG_ANYTHING,
> +       .arg5_type      = ARG_ANYTHING,
> +};
> +
>  bool bpf_helper_changes_pkt_data(void *func)
>  {
>         if (func == bpf_skb_vlan_push ||
> @@ -5720,6 +5782,8 @@ bool bpf_helper_changes_pkt_data(void *func)
>         case BPF_FUNC_get_listener_sock:
>                 return &bpf_get_listener_sock_proto;
>  #endif
> +       case BPF_FUNC_skb_set_inner_header:
> +               return &bpf_skb_set_inner_header_proto;
>         default:
>                 return bpf_base_func_proto(func_id);
>         }
> --
> 1.8.3.1
>
>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/uapi/linux/bpf.h | 14 +++++++++-
> >  net/core/filter.c        | 58 +++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 67 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 0eda8f564a381..a444534cc88d7 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2595,7 +2595,19 @@ enum bpf_func_id {
> >
> >  /* BPF_FUNC_skb_adjust_room flags. */
> >  #define BPF_F_ADJ_ROOM_FIXED_GSO     (1ULL << 0)
> > -#define BPF_F_ADJ_ROOM_MASK          (BPF_F_ADJ_ROOM_FIXED_GSO)
> > +
> > +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 (1ULL << 1)
> > +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 (1ULL << 2)
> > +#define BPF_F_ADJ_ROOM_ENCAP_L3_MASK (BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
> > +
> > +#define BPF_F_ADJ_ROOM_ENCAP_L4_GRE  (1ULL << 3)
> > +#define BPF_F_ADJ_ROOM_ENCAP_L4_UDP  (1ULL << 4)
> > +
> > +#define BPF_F_ADJ_ROOM_MASK          (BPF_F_ADJ_ROOM_FIXED_GSO | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L3_MASK | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L4_GRE | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L4_UDP)
> >
> >  /* Mode for BPF_FUNC_skb_adjust_room helper. */
> >  enum bpf_adj_room_mode {
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index e346e48098000..6007b0b4bc0d7 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2966,6 +2966,9 @@ static u32 bpf_skb_net_base_len(const struct sk_buff *skb)
> >  static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
> >                           u64 flags)
> >  {
> > +     bool encap = flags & BPF_F_ADJ_ROOM_ENCAP_L3_MASK;
> > +     unsigned int gso_type = SKB_GSO_DODGY;
> > +     u16 mac_len, inner_net, inner_trans;
> >       int ret;
> >
> >       if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) {
> > @@ -2979,10 +2982,60 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
> >       if (unlikely(ret < 0))
> >               return ret;
> >
> > +     if (encap) {
> > +             if (skb->protocol != htons(ETH_P_IP) &&
> > +                 skb->protocol != htons(ETH_P_IPV6))
> > +                     return -ENOTSUPP;
> > +
> > +             if (flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 &&
> > +                 flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
> > +                     return -EINVAL;
> > +
> > +             if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE &&
> > +                 flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP)
> > +                     return -EINVAL;
> > +
> > +             if (skb->encapsulation)
> > +                     return -EALREADY;
> > +
> > +             mac_len = skb->network_header - skb->mac_header;
> > +             inner_net = skb->network_header;
> > +             inner_trans = skb->transport_header;
> > +     }
> > +
> >       ret = bpf_skb_net_hdr_push(skb, off, len_diff);
> >       if (unlikely(ret < 0))
> >               return ret;
> >
> > +     if (encap) {
> > +             /* inner mac == inner_net on l3 encap */
> > +             skb->inner_mac_header = inner_net;
> > +             skb->inner_network_header = inner_net;
> > +             skb->inner_transport_header = inner_trans;
> > +             skb_set_inner_protocol(skb, skb->protocol);
> > +
> > +             skb->encapsulation = 1;
> > +             skb_set_network_header(skb, mac_len);
> > +
> > +             if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP)
> > +                     gso_type |= SKB_GSO_UDP_TUNNEL;
> > +             else if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE)
> > +                     gso_type |= SKB_GSO_GRE;
> > +             else if (flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
> > +                     gso_type |= SKB_GSO_IPXIP6;
> > +             else
> > +                     gso_type |= SKB_GSO_IPXIP4;
> > +
> > +             if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE ||
> > +                 flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP) {
> > +                     int nh_len = flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 ?
> > +                                     sizeof(struct ipv6hdr) :
> > +                                     sizeof(struct iphdr);
> > +
> > +                     skb_set_transport_header(skb, mac_len + nh_len);
> > +             }
> > +     }
> > +
> >       if (skb_is_gso(skb)) {
> >               struct skb_shared_info *shinfo = skb_shinfo(skb);
> >
> > @@ -2991,7 +3044,7 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
> >                       skb_decrease_gso_size(shinfo, len_diff);
> >
> >               /* Header must be checked, and gso_segs recomputed. */
> > -             shinfo->gso_type |= SKB_GSO_DODGY;
> > +             shinfo->gso_type |= gso_type;
> >               shinfo->gso_segs = 0;
> >       }
> >
> > @@ -3042,7 +3095,6 @@ static u32 __bpf_skb_max_len(const struct sk_buff *skb)
> >  BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
> >          u32, mode, u64, flags)
> >  {
> > -     bool trans_same = skb->transport_header == skb->network_header;
> >       u32 len_cur, len_diff_abs = abs(len_diff);
> >       u32 len_min = bpf_skb_net_base_len(skb);
> >       u32 len_max = __bpf_skb_max_len(skb);
> > @@ -3071,8 +3123,6 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
> >       }
> >
> >       len_cur = skb->len - skb_network_offset(skb);
> > -     if (skb_transport_header_was_set(skb) && !trans_same)
> > -             len_cur = skb_network_header_len(skb);
> >       if ((shrink && (len_diff_abs >= len_cur ||
> >                       len_cur - len_diff_abs < len_min)) ||
> >           (!shrink && (skb->len + len_diff_abs > len_max &&
> > --
> > 2.21.0.225.g810b269d1ac-goog
> >
> >

Alexei Starovoitov March 21, 2019, 3:13 a.m. UTC | #3

On Wed, Mar 20, 2019 at 10:49:40AM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> When pushing tunnel headers, annotate skbs in the same way as tunnel
> devices.
> 
> For GSO packets, the network stack requires certain fields set to
> segment packets with tunnel headers. gro_gse_segment depends on
> transport and inner mac header, for instance.
> 
> Add an option to pass this information.
> 
> Remove the restriction on len_diff to network header length, which
> is too short, e.g., for GRE protocols.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/uapi/linux/bpf.h | 14 +++++++++-
>  net/core/filter.c        | 58 +++++++++++++++++++++++++++++++++++++---
>  2 files changed, 67 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0eda8f564a381..a444534cc88d7 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2595,7 +2595,19 @@ enum bpf_func_id {
>  
>  /* BPF_FUNC_skb_adjust_room flags. */
>  #define BPF_F_ADJ_ROOM_FIXED_GSO	(1ULL << 0)
> -#define BPF_F_ADJ_ROOM_MASK		(BPF_F_ADJ_ROOM_FIXED_GSO)
> +
> +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4	(1ULL << 1)
> +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6	(1ULL << 2)
> +#define BPF_F_ADJ_ROOM_ENCAP_L3_MASK	(BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 | \
> +					 BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
> +
> +#define BPF_F_ADJ_ROOM_ENCAP_L4_GRE	(1ULL << 3)
> +#define BPF_F_ADJ_ROOM_ENCAP_L4_UDP	(1ULL << 4)
> +
> +#define BPF_F_ADJ_ROOM_MASK		(BPF_F_ADJ_ROOM_FIXED_GSO | \
> +					 BPF_F_ADJ_ROOM_ENCAP_L3_MASK | \
> +					 BPF_F_ADJ_ROOM_ENCAP_L4_GRE | \
> +					 BPF_F_ADJ_ROOM_ENCAP_L4_UDP)

The patches looks great.
One small nit. Please move *_MASK out of uapi.
When flags are added this uapi macro keeps changing
which may break user space compiled with older macro.
Essentially user space cannot rely on this macro and cannot use it.
I think it's best to leave this macro for kernel internals.

Willem de Bruijn March 21, 2019, 1:25 p.m. UTC | #4

On Wed, Mar 20, 2019 at 11:13 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 20, 2019 at 10:49:40AM -0400, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> >
> > When pushing tunnel headers, annotate skbs in the same way as tunnel
> > devices.
> >
> > For GSO packets, the network stack requires certain fields set to
> > segment packets with tunnel headers. gro_gse_segment depends on
> > transport and inner mac header, for instance.
> >
> > Add an option to pass this information.
> >
> > Remove the restriction on len_diff to network header length, which
> > is too short, e.g., for GRE protocols.
> >
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/uapi/linux/bpf.h | 14 +++++++++-
> >  net/core/filter.c        | 58 +++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 67 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 0eda8f564a381..a444534cc88d7 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2595,7 +2595,19 @@ enum bpf_func_id {
> >
> >  /* BPF_FUNC_skb_adjust_room flags. */
> >  #define BPF_F_ADJ_ROOM_FIXED_GSO     (1ULL << 0)
> > -#define BPF_F_ADJ_ROOM_MASK          (BPF_F_ADJ_ROOM_FIXED_GSO)
> > +
> > +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 (1ULL << 1)
> > +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 (1ULL << 2)
> > +#define BPF_F_ADJ_ROOM_ENCAP_L3_MASK (BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
> > +
> > +#define BPF_F_ADJ_ROOM_ENCAP_L4_GRE  (1ULL << 3)
> > +#define BPF_F_ADJ_ROOM_ENCAP_L4_UDP  (1ULL << 4)
> > +
> > +#define BPF_F_ADJ_ROOM_MASK          (BPF_F_ADJ_ROOM_FIXED_GSO | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L3_MASK | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L4_GRE | \
> > +                                      BPF_F_ADJ_ROOM_ENCAP_L4_UDP)
>
> The patches looks great.
> One small nit. Please move *_MASK out of uapi.
> When flags are added this uapi macro keeps changing
> which may break user space compiled with older macro.

Ah, good point. Thanks, Alexei. Will fix in v2.


> Essentially user space cannot rely on this macro and cannot use it.
> I think it's best to leave this macro for kernel internals.

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eda8f564a381..a444534cc88d7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2595,7 +2595,19 @@  enum bpf_func_id {
 
 /* BPF_FUNC_skb_adjust_room flags. */
 #define BPF_F_ADJ_ROOM_FIXED_GSO	(1ULL << 0)
-#define BPF_F_ADJ_ROOM_MASK		(BPF_F_ADJ_ROOM_FIXED_GSO)
+
+#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4	(1ULL << 1)
+#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6	(1ULL << 2)
+#define BPF_F_ADJ_ROOM_ENCAP_L3_MASK	(BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 | \
+					 BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
+
+#define BPF_F_ADJ_ROOM_ENCAP_L4_GRE	(1ULL << 3)
+#define BPF_F_ADJ_ROOM_ENCAP_L4_UDP	(1ULL << 4)
+
+#define BPF_F_ADJ_ROOM_MASK		(BPF_F_ADJ_ROOM_FIXED_GSO | \
+					 BPF_F_ADJ_ROOM_ENCAP_L3_MASK | \
+					 BPF_F_ADJ_ROOM_ENCAP_L4_GRE | \
+					 BPF_F_ADJ_ROOM_ENCAP_L4_UDP)
 
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
diff --git a/net/core/filter.c b/net/core/filter.c
index e346e48098000..6007b0b4bc0d7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2966,6 +2966,9 @@  static u32 bpf_skb_net_base_len(const struct sk_buff *skb)
 static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
 			    u64 flags)
 {
+	bool encap = flags & BPF_F_ADJ_ROOM_ENCAP_L3_MASK;
+	unsigned int gso_type = SKB_GSO_DODGY;
+	u16 mac_len, inner_net, inner_trans;
 	int ret;
 
 	if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) {
@@ -2979,10 +2982,60 @@  static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
 	if (unlikely(ret < 0))
 		return ret;
 
+	if (encap) {
+		if (skb->protocol != htons(ETH_P_IP) &&
+		    skb->protocol != htons(ETH_P_IPV6))
+			return -ENOTSUPP;
+
+		if (flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 &&
+		    flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
+			return -EINVAL;
+
+		if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE &&
+		    flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP)
+			return -EINVAL;
+
+		if (skb->encapsulation)
+			return -EALREADY;
+
+		mac_len = skb->network_header - skb->mac_header;
+		inner_net = skb->network_header;
+		inner_trans = skb->transport_header;
+	}
+
 	ret = bpf_skb_net_hdr_push(skb, off, len_diff);
 	if (unlikely(ret < 0))
 		return ret;
 
+	if (encap) {
+		/* inner mac == inner_net on l3 encap */
+		skb->inner_mac_header = inner_net;
+		skb->inner_network_header = inner_net;
+		skb->inner_transport_header = inner_trans;
+		skb_set_inner_protocol(skb, skb->protocol);
+
+		skb->encapsulation = 1;
+		skb_set_network_header(skb, mac_len);
+
+		if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP)
+			gso_type |= SKB_GSO_UDP_TUNNEL;
+		else if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE)
+			gso_type |= SKB_GSO_GRE;
+		else if (flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6)
+			gso_type |= SKB_GSO_IPXIP6;
+		else
+			gso_type |= SKB_GSO_IPXIP4;
+
+		if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE ||
+		    flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP) {
+			int nh_len = flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 ?
+					sizeof(struct ipv6hdr) :
+					sizeof(struct iphdr);
+
+			skb_set_transport_header(skb, mac_len + nh_len);
+		}
+	}
+
 	if (skb_is_gso(skb)) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
 
@@ -2991,7 +3044,7 @@  static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
 			skb_decrease_gso_size(shinfo, len_diff);
 
 		/* Header must be checked, and gso_segs recomputed. */
-		shinfo->gso_type |= SKB_GSO_DODGY;
+		shinfo->gso_type |= gso_type;
 		shinfo->gso_segs = 0;
 	}
 
@@ -3042,7 +3095,6 @@  static u32 __bpf_skb_max_len(const struct sk_buff *skb)
 BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 	   u32, mode, u64, flags)
 {
-	bool trans_same = skb->transport_header == skb->network_header;
 	u32 len_cur, len_diff_abs = abs(len_diff);
 	u32 len_min = bpf_skb_net_base_len(skb);
 	u32 len_max = __bpf_skb_max_len(skb);
@@ -3071,8 +3123,6 @@  BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
 	}
 
 	len_cur = skb->len - skb_network_offset(skb);
-	if (skb_transport_header_was_set(skb) && !trans_same)
-		len_cur = skb_network_header_len(skb);
 	if ((shrink && (len_diff_abs >= len_cur ||
 			len_cur - len_diff_abs < len_min)) ||
 	    (!shrink && (skb->len + len_diff_abs > len_max &&

[bpf-next,09/13] bpf: add bpf_skb_adjust_room encap flags

Commit Message

Comments

Patch