diff mbox series

[RFC,v2,29/33] xdp: allow bpf_xdp_adjust_tail() to grow packet size

Message ID 158634678170.707275.10720666808605360076.stgit@firesoul
State RFC
Delegated to: BPF Maintainers
Headers show
Series [RFC,v2,01/33] xdp: add frame size to xdp_buff | expand

Commit Message

Jesper Dangaard Brouer April 8, 2020, 11:53 a.m. UTC
Finally, after all drivers have a frame size, allow BPF-helper
bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.

Remember that helper/macro xdp_data_hard_end have reserved some
tailroom.  Thus, this helper makes sure that the BPF-prog don't have
access to this tailroom area.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/uapi/linux/bpf.h |    4 ++--
 net/core/filter.c        |   18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 4 deletions(-)

Comments

Saeed Mahameed April 9, 2020, 3:31 a.m. UTC | #1
On Wed, 2020-04-08 at 13:53 +0200, Jesper Dangaard Brouer wrote:
> Finally, after all drivers have a frame size, allow BPF-helper
> bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.
> 

can you provide a list of usecases for why tail extension is necessary
?

and what do you have in mind as immediate use of bpf_xdp_adjust_tail()
? 

both cover letter and commit messages didn't list any actual use case..

> Remember that helper/macro xdp_data_hard_end have reserved some
> tailroom.  Thus, this helper makes sure that the BPF-prog don't have
> access to this tailroom area.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/uapi/linux/bpf.h |    4 ++--
>  net/core/filter.c        |   18 ++++++++++++++++--
>  2 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2e29a671d67e..0e5abe991ca3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1969,8 +1969,8 @@ union bpf_attr {
>   * int bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta)
>   * 	Description
>   * 		Adjust (move) *xdp_md*\ **->data_end** by *delta*
> bytes. It is
> - * 		only possible to shrink the packet as of this writing,
> - * 		therefore *delta* must be a negative integer.
> + * 		possible to both shrink and grow the packet tail.
> + * 		Shrink done via *delta* being a negative integer.
>   *
>   * 		A call to this helper is susceptible to change the
> underlying
>   * 		packet buffer. Therefore, at load time, all checks on
> pointers
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7628b947dbc3..4d58a147eed0 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3422,12 +3422,26 @@ static const struct bpf_func_proto
> bpf_xdp_adjust_head_proto = {
>  
>  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
>  {
> +	void *data_hard_end = xdp_data_hard_end(xdp);
>  	void *data_end = xdp->data_end + offset;
>  
> -	/* only shrinking is allowed for now. */
> -	if (unlikely(offset >= 0))
> +	/* Notice that xdp_data_hard_end have reserved some tailroom */
> +	if (unlikely(data_end > data_hard_end))
>  		return -EINVAL;
>  

i don't know if i like this approach for couple of reasons.

1. drivers will provide arbitrary frames_sz, which is normally larger
than mtu, and could be a full page size, for XDP_TX action this can be
problematic if xdp progs will allow oversized packets to get caught at
the driver level.. 

2. xdp_data_hard_end(xdp) has a hardcoded assumption of the skb shinfo
and it introduces a reverse dependency between xdp buff and skbuff 

both of the above can be solved if the drivers provided the max allowed
frame size, already accounting for mtu and shinfo when setting
xdp_buff.frame_sz at the driver level.


> +	/* DANGER: ALL drivers MUST be converted to init xdp->frame_sz
> +	 * - Adding some chicken checks below
> +	 * - Will (likely) not be for upstream
> +	 */
> +	if (unlikely(xdp->frame_sz < (xdp->data_end - xdp-
> >data_hard_start))) {
> +		WARN(1, "Too small xdp->frame_sz = %d\n", xdp-
> >frame_sz);
> +		return -EINVAL;
> +	}
> +	if (unlikely(xdp->frame_sz > PAGE_SIZE)) {
> +		WARN(1, "Too BIG xdp->frame_sz = %d\n", xdp->frame_sz);
> +		return -EINVAL;
> +	}
> +
>  	if (unlikely(data_end < xdp->data + ETH_HLEN))
>  		return -EINVAL;
>  
> 
>
Jesper Dangaard Brouer April 14, 2020, 9:56 a.m. UTC | #2
On Wed, 08 Apr 2020 13:53:01 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7628b947dbc3..4d58a147eed0 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3422,12 +3422,26 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>  
>  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
>  {
> +	void *data_hard_end = xdp_data_hard_end(xdp);
>  	void *data_end = xdp->data_end + offset;
>  
[...]
> +	/* DANGER: ALL drivers MUST be converted to init xdp->frame_sz
> +	 * - Adding some chicken checks below
> +	 * - Will (likely) not be for upstream
> +	 */
> +	if (unlikely(xdp->frame_sz < (xdp->data_end - xdp->data_hard_start))) {
> +		WARN(1, "Too small xdp->frame_sz = %d\n", xdp->frame_sz);
> +		return -EINVAL;
> +	}
> +	if (unlikely(xdp->frame_sz > PAGE_SIZE)) {
> +		WARN(1, "Too BIG xdp->frame_sz = %d\n", xdp->frame_sz);
> +		return -EINVAL;
> +	}

Any opinions on above checks?
Should they be removed or kept?

The idea is to catch drivers that forgot to update xdp_buff->frame_sz,
by doing some sanity checks on this uninit value.  If I correctly
updated all XDP drivers in this patchset, then these checks should be
unnecessary, but will this be valuable for driver developers converting
new drivers to XDP to have these WARN checks?
Toke Høiland-Jørgensen April 14, 2020, 10:11 a.m. UTC | #3
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Wed, 08 Apr 2020 13:53:01 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 7628b947dbc3..4d58a147eed0 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -3422,12 +3422,26 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>>  
>>  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
>>  {
>> +	void *data_hard_end = xdp_data_hard_end(xdp);
>>  	void *data_end = xdp->data_end + offset;
>>  
> [...]
>> +	/* DANGER: ALL drivers MUST be converted to init xdp->frame_sz
>> +	 * - Adding some chicken checks below
>> +	 * - Will (likely) not be for upstream
>> +	 */
>> +	if (unlikely(xdp->frame_sz < (xdp->data_end - xdp->data_hard_start))) {
>> +		WARN(1, "Too small xdp->frame_sz = %d\n", xdp->frame_sz);
>> +		return -EINVAL;
>> +	}
>> +	if (unlikely(xdp->frame_sz > PAGE_SIZE)) {
>> +		WARN(1, "Too BIG xdp->frame_sz = %d\n", xdp->frame_sz);
>> +		return -EINVAL;
>> +	}
>
> Any opinions on above checks?
> Should they be removed or kept?
>
> The idea is to catch drivers that forgot to update xdp_buff->frame_sz,
> by doing some sanity checks on this uninit value.  If I correctly
> updated all XDP drivers in this patchset, then these checks should be
> unnecessary, but will this be valuable for driver developers converting
> new drivers to XDP to have these WARN checks?

Hmm, I wonder if there's a way we could have these kinds of checks
available, but disabled by default? A new macro (e.g.,
XDP_CHECK(condition)) that is only enabled when some debug option is
enabled in the kernel build, perhaps? Or just straight ifdef'ing them
out, but maybe a macro would be generally useful?

-Toke
Jesper Dangaard Brouer April 14, 2020, 12:46 p.m. UTC | #4
On Thu, 9 Apr 2020 03:31:14 +0000
Saeed Mahameed <saeedm@mellanox.com> wrote:

> On Wed, 2020-04-08 at 13:53 +0200, Jesper Dangaard Brouer wrote:
> > Finally, after all drivers have a frame size, allow BPF-helper
> > bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.
> >   
> 
> can you provide a list of usecases for why tail extension is necessary
> ?

Use-cases:
(1) IPsec / XFRM needs a tail extend[1][2].
(2) DNS-cache replies in XDP.
(3) HA-proxy ALOHA would need it to convert to XDP.
 
> and what do you have in mind as immediate use of bpf_xdp_adjust_tail()
> ? 

I guess Steffen Klassert's ipsec use-case(1) it the most immediate.

[1] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf
[2] http://vger.kernel.org/netconf2019.html

> both cover letter and commit messages didn't list any actual use case..

Sorry about that.

> > Remember that helper/macro xdp_data_hard_end have reserved some
> > tailroom.  Thus, this helper makes sure that the BPF-prog don't have
> > access to this tailroom area.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  include/uapi/linux/bpf.h |    4 ++--
> >  net/core/filter.c        |   18 ++++++++++++++++--
> >  2 files changed, 18 insertions(+), 4 deletions(-)
> > 
[... cut ...]
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 7628b947dbc3..4d58a147eed0 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3422,12 +3422,26 @@ static const struct bpf_func_proto
> > bpf_xdp_adjust_head_proto = {
> >  
> >  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
> >  {
> > +	void *data_hard_end = xdp_data_hard_end(xdp);
> >  	void *data_end = xdp->data_end + offset;
> >  
> > -	/* only shrinking is allowed for now. */
> > -	if (unlikely(offset >= 0))
> > +	/* Notice that xdp_data_hard_end have reserved some tailroom */
> > +	if (unlikely(data_end > data_hard_end))
> >  		return -EINVAL;
> >    
> 
> i don't know if i like this approach for couple of reasons.
> 
> 1. drivers will provide arbitrary frames_sz, which is normally larger
> than mtu, and could be a full page size, for XDP_TX action this can be
> problematic if xdp progs will allow oversized packets to get caught at
> the driver level..

We already check if MTU is exceeded for a specific device when we
redirect into this, see helper xdp_ok_fwd_dev().  For the XDP_TX case,
I guess some drivers bypass that check, which should be fixed. The
XDP_TX case is IMHO a place where we allow drivers do special
optimizations, thus drivers can choose to do something faster than
calling generic helper xdp_ok_fwd_dev().  
  
> 
> 2. xdp_data_hard_end(xdp) has a hardcoded assumption of the skb shinfo
> and it introduces a reverse dependency between xdp buff and skbuff 
> 
(I'll address this in another mail)

> both of the above can be solved if the drivers provided the max
> allowed frame size, already accounting for mtu and shinfo when setting
> xdp_buff.frame_sz at the driver level.

It seems we look at the problem from two different angles.  You have
the drivers perspective, while I have the network stacks perspective
(the XDP_PASS case).  The mlx5 driver treats XDP as a special case, by
hiding or confining xdp_buff to functions fairly deep in the
call-stack.  My goal is different (moving SKB out of drivers), I see
the xdp_buff/xdp_frame as the main packet object in the drivers, that
gets send up the network stack (after converting to xdp_frame) and
converted into SKB in core-code (yes, there is a long road-ahead). The
larger tailroom can be used by netstack in SKB-coalesce.

The next step is making xdp_buff (and xdp_frame) multi-buffer aware.
This is why I reserve room for skb_shared_info.  I have considered
reducing the size of xdp_buff.frame_sz, with sizeof(skb_shared_info),
but it got kind of ugly having this in each drivers.

I also considered having drivers setup a direct pointer to
{skb,xdp}_shared_info section in xdp_buff, because will make it more
flexible (for what I imagined Alexander Duyck want).  (But we can still
do/change that later, once we start work in multi-buffer code)
Saeed Mahameed April 18, 2020, 3:33 a.m. UTC | #5
On Tue, 2020-04-14 at 14:46 +0200, Jesper Dangaard Brouer wrote:
> On Thu, 9 Apr 2020 03:31:14 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > On Wed, 2020-04-08 at 13:53 +0200, Jesper Dangaard Brouer wrote:
> > > Finally, after all drivers have a frame size, allow BPF-helper
> > > bpf_xdp_adjust_tail() to grow or extend packet size at frame
> > > tail.
> > >   
> > 
> > can you provide a list of usecases for why tail extension is
> > necessary
> > ?
> 
> Use-cases:
> (1) IPsec / XFRM needs a tail extend[1][2].
> (2) DNS-cache replies in XDP.
> (3) HA-proxy ALOHA would need it to convert to XDP.
>  
> > and what do you have in mind as immediate use of
> > bpf_xdp_adjust_tail()
> > ? 
> 
> I guess Steffen Klassert's ipsec use-case(1) it the most immediate.
> 
> [1] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf
> [2] http://vger.kernel.org/netconf2019.html
> 

Thanks !

> > both cover letter and commit messages didn't list any actual use
> > case..
> 
> Sorry about that.
> 
> > > Remember that helper/macro xdp_data_hard_end have reserved some
> > > tailroom.  Thus, this helper makes sure that the BPF-prog don't
> > > have
> > > access to this tailroom area.
> > > 
> > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > > ---
> > >  include/uapi/linux/bpf.h |    4 ++--
> > >  net/core/filter.c        |   18 ++++++++++++++++--
> > >  2 files changed, 18 insertions(+), 4 deletions(-)
> > > 
> [... cut ...]
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 7628b947dbc3..4d58a147eed0 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -3422,12 +3422,26 @@ static const struct bpf_func_proto
> > > bpf_xdp_adjust_head_proto = {
> > >  
> > >  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int,
> > > offset)
> > >  {
> > > +	void *data_hard_end = xdp_data_hard_end(xdp);
> > >  	void *data_end = xdp->data_end + offset;
> > >  
> > > -	/* only shrinking is allowed for now. */
> > > -	if (unlikely(offset >= 0))
> > > +	/* Notice that xdp_data_hard_end have reserved some tailroom */
> > > +	if (unlikely(data_end > data_hard_end))
> > >  		return -EINVAL;
> > >    
> > 
> > i don't know if i like this approach for couple of reasons.
> > 
> > 1. drivers will provide arbitrary frames_sz, which is normally
> > larger
> > than mtu, and could be a full page size, for XDP_TX action this can
> > be
> > problematic if xdp progs will allow oversized packets to get caught
> > at
> > the driver level..
> 
> We already check if MTU is exceeded for a specific device when we
> redirect into this, see helper xdp_ok_fwd_dev().  For the XDP_TX
> case,
> I guess some drivers bypass that check, which should be fixed. The
> XDP_TX case is IMHO a place where we allow drivers do special
> optimizations, thus drivers can choose to do something faster than
> calling generic helper xdp_ok_fwd_dev().  
>   
> > 2. xdp_data_hard_end(xdp) has a hardcoded assumption of the skb
> > shinfo
> > and it introduces a reverse dependency between xdp buff and skbuff 
> > 
> (I'll address this in another mail)
> 
> > both of the above can be solved if the drivers provided the max
> > allowed frame size, already accounting for mtu and shinfo when
> > setting
> > xdp_buff.frame_sz at the driver level.
> 
> It seems we look at the problem from two different angles.  You have
> the drivers perspective, while I have the network stacks perspective
> (the XDP_PASS case).  The mlx5 driver treats XDP as a special case,
> by
> hiding or confining xdp_buff to functions fairly deep in the
> call-stack.  My goal is different (moving SKB out of drivers), I see
> the xdp_buff/xdp_frame as the main packet object in the drivers, that
> gets send up the network stack (after converting to xdp_frame) and
> converted into SKB in core-code (yes, there is a long road-ahead).
> The
> larger tailroom can be used by netstack in SKB-coalesce.
> 

But to achieve a proper model, the drivers must be notified about the
size of the tailroom they must preserve, now we are just hardcoding it,
where it even doesn't belong. I don't know what the right solution yet.
but we are still not there .. once we totally move memory management
out of the driver, then we might have a better way to preserve head and
tail-room .. 

> The next step is making xdp_buff (and xdp_frame) multi-buffer aware.
> This is why I reserve room for skb_shared_info.  I have considered

this needs to be carefully crafted.. as we don't want to endup with one
more SKB type thing to deal with.. 


> reducing the size of xdp_buff.frame_sz, with sizeof(skb_shared_info),
> but it got kind of ugly having this in each drivers.
> 

can be done via memory model registration ?

> I also considered having drivers setup a direct pointer to
> {skb,xdp}_shared_info section in xdp_buff, because will make it more
> flexible (for what I imagined Alexander Duyck want).  (But we can
> still
> do/change that later, once we start work in multi-buffer code)
> 

you mean something like xdp->data_tail or xdp->data_hard_end ?
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2e29a671d67e..0e5abe991ca3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1969,8 +1969,8 @@  union bpf_attr {
  * int bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta)
  * 	Description
  * 		Adjust (move) *xdp_md*\ **->data_end** by *delta* bytes. It is
- * 		only possible to shrink the packet as of this writing,
- * 		therefore *delta* must be a negative integer.
+ * 		possible to both shrink and grow the packet tail.
+ * 		Shrink done via *delta* being a negative integer.
  *
  * 		A call to this helper is susceptible to change the underlying
  * 		packet buffer. Therefore, at load time, all checks on pointers
diff --git a/net/core/filter.c b/net/core/filter.c
index 7628b947dbc3..4d58a147eed0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3422,12 +3422,26 @@  static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
 
 BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
 {
+	void *data_hard_end = xdp_data_hard_end(xdp);
 	void *data_end = xdp->data_end + offset;
 
-	/* only shrinking is allowed for now. */
-	if (unlikely(offset >= 0))
+	/* Notice that xdp_data_hard_end have reserved some tailroom */
+	if (unlikely(data_end > data_hard_end))
 		return -EINVAL;
 
+	/* DANGER: ALL drivers MUST be converted to init xdp->frame_sz
+	 * - Adding some chicken checks below
+	 * - Will (likely) not be for upstream
+	 */
+	if (unlikely(xdp->frame_sz < (xdp->data_end - xdp->data_hard_start))) {
+		WARN(1, "Too small xdp->frame_sz = %d\n", xdp->frame_sz);
+		return -EINVAL;
+	}
+	if (unlikely(xdp->frame_sz > PAGE_SIZE)) {
+		WARN(1, "Too BIG xdp->frame_sz = %d\n", xdp->frame_sz);
+		return -EINVAL;
+	}
+
 	if (unlikely(data_end < xdp->data + ETH_HLEN))
 		return -EINVAL;