diff mbox series

[net-next,2/6] bpf: add meta pointer for direct access

Message ID 458f9c13ab58abb1a15627906d03c33c42b02a7c.1506297988.git.daniel@iogearbox.net
State Accepted, archived
Delegated to: David Miller
Headers show
Series BPF metadata for direct access | expand

Commit Message

Daniel Borkmann Sept. 25, 2017, 12:25 a.m. UTC
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   1 +
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   1 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |   1 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c         |   1 +
 drivers/net/tun.c                                  |   1 +
 drivers/net/virtio_net.c                           |   2 +
 include/linux/bpf.h                                |   1 +
 include/linux/filter.h                             |  21 +++-
 include/linux/skbuff.h                             |  68 +++++++++++-
 include/uapi/linux/bpf.h                           |  13 ++-
 kernel/bpf/verifier.c                              | 114 ++++++++++++++++-----
 net/bpf/test_run.c                                 |   1 +
 net/core/dev.c                                     |  31 +++++-
 net/core/filter.c                                  |  77 +++++++++++++-
 net/core/skbuff.c                                  |   2 +
 19 files changed, 297 insertions(+), 42 deletions(-)

Comments

Andy Gospodarek Sept. 25, 2017, 6:10 p.m. UTC | #1
On Mon, Sep 25, 2017 at 02:25:51AM +0200, Daniel Borkmann wrote:
> This work enables generic transfer of metadata from XDP into skb. The
> basic idea is that we can make use of the fact that the resulting skb
> must be linear and already comes with a larger headroom for supporting
> bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
> on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
> for adjusting a new pointer called xdp->data_meta. Thus, the packet has
> a flexible and programmable room for meta data, followed by the actual
> packet data. struct xdp_buff is therefore laid out that we first point
> to data_hard_start, then data_meta directly prepended to data followed
> by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
> account whether we have meta data already prepended and if so, memmove()s
> this along with the given offset provided there's enough room.
> 
> xdp->data_meta is optional and programs are not required to use it. The
> rationale is that when we process the packet in XDP (e.g. as DoS filter),
> we can push further meta data along with it for the XDP_PASS case, and
> give the guarantee that a clsact ingress BPF program on the same device
> can pick this up for further post-processing. Since we work with skb
> there, we can also set skb->mark, skb->priority or other skb meta data
> out of BPF, thus having this scratch space generic and programmable
> allows for more flexibility than defining a direct 1:1 transfer of
> potentially new XDP members into skb (it's also more efficient as we
> don't need to initialize/handle each of such new members).  The facility
> also works together with GRO aggregation. The scratch space at the head
> of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
> yet supporting xdp->data_meta can simply be set up with xdp->data_meta
> as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
> such that the subsequent match against xdp->data for later access is
> guaranteed to fail.
> 
> The verifier treats xdp->data_meta/xdp->data the same way as we treat
> xdp->data/xdp->data_end pointer comparisons. The requirement for doing
> the compare against xdp->data is that it hasn't been modified from it's
> original address we got from ctx access. It may have a range marking
> already from prior successful xdp->data/xdp->data_end pointer comparisons
> though.

First, thanks for this detailed description.  It was helpful to read
along with the patches.

My only concern about this area being generic is that you are now in a
state where any bpf program must know about all the bpf programs in the
receive pipeline before it can properly parse what is stored in the
meta-data and add it to an skb (or perform any other action).
Especially if each program adds it's own meta-data along the way.

Maybe this isn't a big concern based on the number of users of this
today, but it just starts to seem like a concern as there are these
hints being passed between layers that are challenging to track due to a
lack of a standard format for passing data between.

The main reason I bring this up is that Michael and I had discussed and
designed a way for drivers to communicate between each other that rx
resources could be freed after a tx completion on an XDP_REDIRECT
action.  Much like this code, it involved adding an new element to
struct xdp_md that could point to the important information.  Now that
there is a generic way to handle this, it would seem nice to be able to
leverage it, but I'm not sure how reliable this meta-data area would be
without the ability to mark it in some manner.

For additional background, the minimum amount of data needed in the case
Michael and I were discussing was really 2 words.  One to serve as a
pointer to an rx_ring structure and one to have a counter to the rx
producer entry.  This data could be acessed by the driver processing the
tx completions and callback to the driver that received the frame off the wire
to perform any needed processing.  (For those curious this would also require a
new callback/netdev op to act on this data stored in the XDP buffer.)

IIUC, I could use this meta_data area to store this information, but would it
also be useful to create some type field/marker that could also be stored in
the meta_data to indicate what type of information is there?  I hate to propose
such a thing as it may add unneeded complexity, but I just wanted to make sure
to say something before it was too late as there may be more users of this
right away.  :-)

> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |   1 +
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   1 +
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c        |   1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c      |   1 +
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c         |   1 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |   1 +
>  .../net/ethernet/netronome/nfp/nfp_net_common.c    |   1 +
>  drivers/net/ethernet/qlogic/qede/qede_fp.c         |   1 +
>  drivers/net/tun.c                                  |   1 +
>  drivers/net/virtio_net.c                           |   2 +
>  include/linux/bpf.h                                |   1 +
>  include/linux/filter.h                             |  21 +++-
>  include/linux/skbuff.h                             |  68 +++++++++++-
>  include/uapi/linux/bpf.h                           |  13 ++-
>  kernel/bpf/verifier.c                              | 114 ++++++++++++++++-----
>  net/bpf/test_run.c                                 |   1 +
>  net/core/dev.c                                     |  31 +++++-
>  net/core/filter.c                                  |  77 +++++++++++++-
>  net/core/skbuff.c                                  |   2 +
>  19 files changed, 297 insertions(+), 42 deletions(-)
[...]
Daniel Borkmann Sept. 25, 2017, 6:50 p.m. UTC | #2
On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
[...]
> First, thanks for this detailed description.  It was helpful to read
> along with the patches.
>
> My only concern about this area being generic is that you are now in a
> state where any bpf program must know about all the bpf programs in the
> receive pipeline before it can properly parse what is stored in the
> meta-data and add it to an skb (or perform any other action).
> Especially if each program adds it's own meta-data along the way.
>
> Maybe this isn't a big concern based on the number of users of this
> today, but it just starts to seem like a concern as there are these
> hints being passed between layers that are challenging to track due to a
> lack of a standard format for passing data between.

Btw, we do have similar kind of programmable scratch buffer also today
wrt skb cb[] that you can program from tc side, the perf ring buffer,
which doesn't have any fixed layout for the slots, or a per-cpu map
where you can transfer data between tail calls for example, then tail
calls themselves that need to coordinate, or simply mangling of packets
itself if you will, but more below to your use case ...

> The main reason I bring this up is that Michael and I had discussed and
> designed a way for drivers to communicate between each other that rx
> resources could be freed after a tx completion on an XDP_REDIRECT
> action.  Much like this code, it involved adding an new element to
> struct xdp_md that could point to the important information.  Now that
> there is a generic way to handle this, it would seem nice to be able to
> leverage it, but I'm not sure how reliable this meta-data area would be
> without the ability to mark it in some manner.
>
> For additional background, the minimum amount of data needed in the case
> Michael and I were discussing was really 2 words.  One to serve as a
> pointer to an rx_ring structure and one to have a counter to the rx
> producer entry.  This data could be acessed by the driver processing the
> tx completions and callback to the driver that received the frame off the wire
> to perform any needed processing.  (For those curious this would also require a
> new callback/netdev op to act on this data stored in the XDP buffer.)

What you describe above doesn't seem to be fitting to the use-case of
this set, meaning the area here is fully programmable out of the BPF
program, the infrastructure you're describing is some sort of means of
communication between drivers for the XDP_REDIRECT, and should be
outside of the control of the BPF program to mangle.

You could probably reuse the base infra here and make a part of that
inaccessible for the program with some sort of a fixed layout, but I
haven't seen your code yet to be able to fully judge. Intention here
is to allow for programmability within the BPF prog in a generic way,
such that based on the use-case it can be populated in specific ways
and propagated to the skb w/o having to define a fixed layout and
bloat xdp_buff all the way to an skb while still retaining all the
flexibility.

Thanks,
Daniel
John Fastabend Sept. 25, 2017, 7:47 p.m. UTC | #3
On 09/25/2017 11:50 AM, Daniel Borkmann wrote:
> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
> [...]
>> First, thanks for this detailed description.  It was helpful to read
>> along with the patches.
>>
>> My only concern about this area being generic is that you are now in a
>> state where any bpf program must know about all the bpf programs in the
>> receive pipeline before it can properly parse what is stored in the
>> meta-data and add it to an skb (or perform any other action).
>> Especially if each program adds it's own meta-data along the way.
>>
>> Maybe this isn't a big concern based on the number of users of this
>> today, but it just starts to seem like a concern as there are these
>> hints being passed between layers that are challenging to track due to a
>> lack of a standard format for passing data between.
> 
> Btw, we do have similar kind of programmable scratch buffer also today
> wrt skb cb[] that you can program from tc side, the perf ring buffer,
> which doesn't have any fixed layout for the slots, or a per-cpu map
> where you can transfer data between tail calls for example, then tail
> calls themselves that need to coordinate, or simply mangling of packets
> itself if you will, but more below to your use case ...
> 
>> The main reason I bring this up is that Michael and I had discussed and
>> designed a way for drivers to communicate between each other that rx
>> resources could be freed after a tx completion on an XDP_REDIRECT
>> action.  Much like this code, it involved adding an new element to
>> struct xdp_md that could point to the important information.  Now that
>> there is a generic way to handle this, it would seem nice to be able to
>> leverage it, but I'm not sure how reliable this meta-data area would be
>> without the ability to mark it in some manner.
>>
>> For additional background, the minimum amount of data needed in the case
>> Michael and I were discussing was really 2 words.  One to serve as a
>> pointer to an rx_ring structure and one to have a counter to the rx
>> producer entry.  This data could be acessed by the driver processing the
>> tx completions and callback to the driver that received the frame off
>> the wire
>> to perform any needed processing.  (For those curious this would also
>> require a
>> new callback/netdev op to act on this data stored in the XDP buffer.)
> 
> What you describe above doesn't seem to be fitting to the use-case of
> this set, meaning the area here is fully programmable out of the BPF
> program, the infrastructure you're describing is some sort of means of
> communication between drivers for the XDP_REDIRECT, and should be
> outside of the control of the BPF program to mangle.
> 
> You could probably reuse the base infra here and make a part of that
> inaccessible for the program with some sort of a fixed layout, but I
> haven't seen your code yet to be able to fully judge. Intention here
> is to allow for programmability within the BPF prog in a generic way,
> such that based on the use-case it can be populated in specific ways
> and propagated to the skb w/o having to define a fixed layout and
> bloat xdp_buff all the way to an skb while still retaining all the
> flexibility.
> 
> Thanks,
> Daniel

Hi Andy,

I'm guessing this data needs to be passed from the input dev to the
output dev based on your description. If the driver data
is pushed after the BPF program is run but before the xdp_do_flush_map
call no other BPF programs can be run on that xdp_buff. It
should be safe at this point to use the metadata region directly
from the driver. We would just need to add a few helpers for the
drivers to use for this maybe, xdp_metadata_write_drv,
xdp_meadata_read_drv. I think this would work for your use case?
The data structure would have to be agreed upon by all the drivers
but would not be UAPI because it would only be exposed in the
driver. So we would be free to change/update it as needed.

Thanks,
John
Andy Gospodarek Sept. 26, 2017, 5:21 p.m. UTC | #4
On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
> [...]
> > First, thanks for this detailed description.  It was helpful to read
> > along with the patches.
> > 
> > My only concern about this area being generic is that you are now in a
> > state where any bpf program must know about all the bpf programs in the
> > receive pipeline before it can properly parse what is stored in the
> > meta-data and add it to an skb (or perform any other action).
> > Especially if each program adds it's own meta-data along the way.
> > 
> > Maybe this isn't a big concern based on the number of users of this
> > today, but it just starts to seem like a concern as there are these
> > hints being passed between layers that are challenging to track due to a
> > lack of a standard format for passing data between.
> 
> Btw, we do have similar kind of programmable scratch buffer also today
> wrt skb cb[] that you can program from tc side, the perf ring buffer,
> which doesn't have any fixed layout for the slots, or a per-cpu map
> where you can transfer data between tail calls for example, then tail
> calls themselves that need to coordinate, or simply mangling of packets
> itself if you will, but more below to your use case ...
> 
> > The main reason I bring this up is that Michael and I had discussed and
> > designed a way for drivers to communicate between each other that rx
> > resources could be freed after a tx completion on an XDP_REDIRECT
> > action.  Much like this code, it involved adding an new element to
> > struct xdp_md that could point to the important information.  Now that
> > there is a generic way to handle this, it would seem nice to be able to
> > leverage it, but I'm not sure how reliable this meta-data area would be
> > without the ability to mark it in some manner.
> > 
> > For additional background, the minimum amount of data needed in the case
> > Michael and I were discussing was really 2 words.  One to serve as a
> > pointer to an rx_ring structure and one to have a counter to the rx
> > producer entry.  This data could be acessed by the driver processing the
> > tx completions and callback to the driver that received the frame off the wire
> > to perform any needed processing.  (For those curious this would also require a
> > new callback/netdev op to act on this data stored in the XDP buffer.)
> 
> What you describe above doesn't seem to be fitting to the use-case of
> this set, meaning the area here is fully programmable out of the BPF
> program, the infrastructure you're describing is some sort of means of
> communication between drivers for the XDP_REDIRECT, and should be
> outside of the control of the BPF program to mangle.

OK, I understand that perspective.  I think saying this is really meant
as a BPF<->BPF communication channel for now is fine.

> You could probably reuse the base infra here and make a part of that
> inaccessible for the program with some sort of a fixed layout, but I
> haven't seen your code yet to be able to fully judge. Intention here
> is to allow for programmability within the BPF prog in a generic way,
> such that based on the use-case it can be populated in specific ways
> and propagated to the skb w/o having to define a fixed layout and
> bloat xdp_buff all the way to an skb while still retaining all the
> flexibility.

Some level of reuse might be proper, but I'd rather it be explicit for
my use since it's not exclusively something that will need to be used by
a BPF prog, but rather the driver.  I'll produce some patches this week
for reference.
Jesper Dangaard Brouer Sept. 26, 2017, 7:13 p.m. UTC | #5
On Mon, 25 Sep 2017 02:25:51 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> This work enables generic transfer of metadata from XDP into skb. The
> basic idea is that we can make use of the fact that the resulting skb
> must be linear and already comes with a larger headroom for supporting
> bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
> on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
> for adjusting a new pointer called xdp->data_meta. Thus, the packet has
> a flexible and programmable room for meta data, followed by the actual
> packet data. struct xdp_buff is therefore laid out that we first point
> to data_hard_start, then data_meta directly prepended to data followed
> by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
> account whether we have meta data already prepended and if so, memmove()s
> this along with the given offset provided there's enough room.
> 
> [...] The scratch space at the head
> of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
> yet supporting xdp->data_meta can simply be set up with xdp->data_meta
> as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
> such that the subsequent match against xdp->data for later access is
> guaranteed to fail.

So, xdp->meta_data is placed just before the packet xdp->data starts.

I'm currently implementing a cpumap type, that transfers raw XDP frames
to another CPU, and the SKB is allocated on the remote CPU.  (It
actually works extremely well).  

For transferring info I need, I'm currently using xdp->data_hard_start
(the top/start of the xdp page).  Which should be compatible with your
approach, right?

The info I need:

 struct xdp_pkt {
	void *data;
	u16 len;
	u16 headroom;
	struct net_device *dev_rx;
 };

When I enqueue the xdp packet I do the following:

 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
	struct net_device *dev_rx)
 {
	struct xdp_pkt *xdp_pkt;
	int headroom;

	/* Convert xdp_buff to xdp_pkt */
	headroom = xdp->data - xdp->data_hard_start;
	if (headroom < sizeof(*xdp_pkt))
		return -EOVERFLOW;
	xdp_pkt = xdp->data_hard_start;
	xdp_pkt->data = xdp->data;
	xdp_pkt->len  = xdp->data_end - xdp->data;
	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);

	/* Info needed when constructing SKB on remote CPU */
	xdp_pkt->dev_rx = dev_rx;

	bq_enqueue(rcpu, xdp_pkt);
	return 0;
 }

On the remote CPU dequeueing the packet, I'm doing the following.  As
you can see I'm still lacking some meta-data, that would be nice to
also transfer.  Could I use your infrastructure for that?

 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
					  struct xdp_pkt *xdp_pkt)
 {
	unsigned int truesize;
	void *pkt_data_start;
	struct sk_buff *skb;

	/* TODO: rcpu could provide truesize, it's static per RX-ring */
	truesize = 2048;

	// pkt_data_start = xdp_pkt + sizeof(*xdp_pkt);
	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;

	/* Need to adjust "truesize" for skb_shared_info to get proper
	 * placed, to take into account that xdp_pkt is using part of
	 * headroom
	 */
	skb = build_skb(pkt_data_start, truesize - sizeof(*xdp_pkt));
	if (!skb)
		return NULL;

	skb_reserve(skb, xdp_pkt->headroom);
	__skb_put(skb, xdp_pkt->len);

	// skb_record_rx_queue(skb, rx_ring->queue_index);
	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);

	// How much does csum matter? 
 //	skb->ip_summed = CHECKSUM_UNNECESSARY; // Try to fake it...

	// Does setting skb_set_hash()) matter?
 //	__skb_set_hash(skb, 42, true, false); // Say it is software
 //	__skb_set_hash(skb, 42, false, true); // Say it is hardware

	// Do we lack setting rx_queue... it doesn't seem to matter
 //	skb_record_rx_queue(skb, 0);

	return skb;
 }

(I'll send out some patches soonish, hopefully tomorrow... to show in
more details what I'm doing)
Daniel Borkmann Sept. 26, 2017, 7:58 p.m. UTC | #6
On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
[...]
> I'm currently implementing a cpumap type, that transfers raw XDP frames
> to another CPU, and the SKB is allocated on the remote CPU.  (It
> actually works extremely well).

Meaning you let all the XDP_PASS packets get processed on a
different CPU, so you can reserve the whole CPU just for
prefiltering, right? Do you have some numbers to share at
this point, just curious when you mention it works extremely
well.

> For transferring info I need, I'm currently using xdp->data_hard_start
> (the top/start of the xdp page).  Which should be compatible with your
> approach, right?

Should be possible, yes. More below.

> The info I need:
>
>   struct xdp_pkt {
> 	void *data;
> 	u16 len;
> 	u16 headroom;
> 	struct net_device *dev_rx;
>   };
>
> When I enqueue the xdp packet I do the following:
>
>   int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
> 	struct net_device *dev_rx)
>   {
> 	struct xdp_pkt *xdp_pkt;
> 	int headroom;
>
> 	/* Convert xdp_buff to xdp_pkt */
> 	headroom = xdp->data - xdp->data_hard_start;
> 	if (headroom < sizeof(*xdp_pkt))
> 		return -EOVERFLOW;
> 	xdp_pkt = xdp->data_hard_start;
> 	xdp_pkt->data = xdp->data;
> 	xdp_pkt->len  = xdp->data_end - xdp->data;
> 	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
>
> 	/* Info needed when constructing SKB on remote CPU */
> 	xdp_pkt->dev_rx = dev_rx;
>
> 	bq_enqueue(rcpu, xdp_pkt);
> 	return 0;
>   }
>
> On the remote CPU dequeueing the packet, I'm doing the following.  As
> you can see I'm still lacking some meta-data, that would be nice to
> also transfer.  Could I use your infrastructure for that?

There could be multiple options to use it, in case you have a
helper where you look up the CPU in the map and would also store
the meta data, you could use a per-CPU scratch buffer similarly
as we do with struct redirect_info, and move that later e.g.
after program return into xdp->data_hard_start pointer. You
could also reserve that upfront potentially, so it's hidden from
the beginning from the program unless you want the program itself
to fill it out (modulo the pointers). Not all drivers currently
leave room though, I've also seen where xdp->data_hard_start
points directly to xdp->data, so there's 0 headroom available to
use; in such case it could either be treated as a hint and for
those drivers where they just pass the skb up the current CPU or
you would need some other means to move the meta data to the
remote CPU, or potentially just use tail room.

Thanks,
Daniel
Jesper Dangaard Brouer Sept. 27, 2017, 9:26 a.m. UTC | #7
On Tue, 26 Sep 2017 21:58:53 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> [...]
> > I'm currently implementing a cpumap type, that transfers raw XDP frames
> > to another CPU, and the SKB is allocated on the remote CPU.  (It
> > actually works extremely well).  
> 
> Meaning you let all the XDP_PASS packets get processed on a
> different CPU, so you can reserve the whole CPU just for
> prefiltering, right? 

Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
The trick is using the map-flush point, to transfer packets in bulk to
the remote CPU (single call IPC is too slow), but at the same time
flush single packets if NAPI didn't see a bulk.

> Do you have some numbers to share at this point, just curious when
> you mention it works extremely well.

Sure... I've done a lot of benchmarking on this patchset ;-)
I have a benchmark program called xdp_redirect_cpu [1][2], that collect
stats via tracepoints (atm I'm limiting bulking 8 packets, and have
tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)

 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
 [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c

Here I'm installing a DDoS program that drops UDP port 9 (pktgen
packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
the 11.9Mpps DDoS attack is hitting.

Running XDP/eBPF prog_num:4
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          0       12,030,471     11,966,982  0          
XDP-RX          total   12,030,471     11,966,982 
cpumap-enqueue    0:2   63,488         0           0          
cpumap-enqueue  sum:2   63,488         0           0          
cpumap_kthread  2       63,488         0           3          time_exceed
cpumap_kthread  total   63,488         0           0          
redirect_err    total   0              0          

$ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1024     1024    10.00    12735.97   
16384  87380 

The netperf TCP_CRR performance is the same, without XDP loaded.


> Another test

I've previously shown (and optimized) in commit c0303efeab73 ("net:
reduce cycles spend on ICMP replies that gets rate limited"), that my
system can handle approx 2.7Mpps for UdpNoPorts, before the network
stack chokes.

Thus it is interesting to see, when I get UDP traffic that hits the
same CPU, if I can simply round-robin distribute it other CPUs.  This
evaluate if the cross-CPU transfer mechanism is fast-enough.

I do have to increase the ixgbe RX-ring size, else the ixgbe recycle
scheme breaks down, and we stall on the page spin_lock (as Tariq have
demonstrated before).

 # ethtool -G ixgbe1 rx 1024 tx 1024

Start RR program and add some CPUs:

 # ./xdp_redirect_cpu --dev ixgbe1 --prog 2 --cpu 1 --cpu 2 --cpu 3 --cpu 4

Running XDP/eBPF prog_num:2
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          0       11,006,992     0           0          
XDP-RX          total   11,006,992     0          
cpumap-enqueue    0:1   2,751,744      0           0          
cpumap-enqueue  sum:1   2,751,744      0           0          
cpumap-enqueue    0:2   2,751,748      0           0          
cpumap-enqueue  sum:2   2,751,748      0           0          
cpumap-enqueue    0:3   2,751,744      35          0          
cpumap-enqueue  sum:3   2,751,744      35          0          
cpumap-enqueue    0:4   2,751,748      0           0          
cpumap-enqueue  sum:4   2,751,748      0           0          
cpumap_kthread  1       2,751,745      0           156        time_exceed
cpumap_kthread  2       2,751,749      0           142        time_exceed
cpumap_kthread  3       2,751,713      0           131        time_exceed
cpumap_kthread  4       2,751,749      0           128        time_exceed
cpumap_kthread  total   11,006,957     0           0          
redirect_err    total   0              0          

$ nstat > /dev/null && sleep 1 && nstat | grep UdpNoPorts
UdpNoPorts                      11042282           0.0

The nstat show that the Linux network stack is actually now processing,
SKB alloc + free, 11Mpps. 

The generator was sending with 14Mpps, thus the XDP-RX program is
actually a bottleneck here. And I do see some drops on the HW level.
Thus, 1-CPU was not 100% fast-enough.

Thus, lets allocate two CPUs for XDP-RX:

Running XDP/eBPF prog_num:2
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          0       6,352,578      0           0          
XDP-RX          1       6,352,711      0           0          
XDP-RX          total   12,705,289     0          
cpumap-enqueue    0:2   1,588,156      1,351       0          
cpumap-enqueue    1:2   1,588,174      1,330       0          
cpumap-enqueue  sum:2   3,176,331      2,682       0          
cpumap-enqueue    0:3   1,588,157      994         0          
cpumap-enqueue    1:3   1,588,170      912         0          
cpumap-enqueue  sum:3   3,176,327      1,907       0          
cpumap-enqueue    0:4   1,588,157      529         0          
cpumap-enqueue    1:4   1,588,167      514         0          
cpumap-enqueue  sum:4   3,176,324      1,044       0          
cpumap-enqueue    0:5   1,588,159      625         0          
cpumap-enqueue    1:5   1,588,166      614         0          
cpumap-enqueue  sum:5   3,176,326      1,240       0          
cpumap_kthread  2       3,173,642      0           11257      time_exceed
cpumap_kthread  3       3,174,423      0           9779       time_exceed
cpumap_kthread  4       3,175,283      0           3938       time_exceed
cpumap_kthread  5       3,175,083      0           3120       time_exceed
cpumap_kthread  total   12,698,432     0           0          (null)
redirect_err    total   0              0          

Below, I'm using ./pktgen_sample04_many_flows.sh, and my generator
machine cannot generate more that 12,682,445 tx_packets /sec.
nstat says: UdpNoPorts 12,698,001 pps.  The XDP-RX CPUs actually have
30% idle CPU cycles, as the "only" handle 6.3Mpps each ;-)

Perf top on a CPU(3) that have to alloc and free SKBs etc.

# Overhead  CPU  Symbol                                 
# ........  ...  .......................................
#
    15.51%  003  [k] fib_table_lookup
     8.91%  003  [k] cpu_map_kthread_run
     8.04%  003  [k] build_skb
     7.88%  003  [k] page_frag_free
     5.13%  003  [k] kmem_cache_alloc
     4.76%  003  [k] ip_route_input_rcu
     4.59%  003  [k] kmem_cache_free
     4.02%  003  [k] __udp4_lib_rcv
     3.20%  003  [k] fib_validate_source
     3.02%  003  [k] __netif_receive_skb_core
     3.02%  003  [k] udp_v4_early_demux
     2.90%  003  [k] ip_rcv
     2.80%  003  [k] ip_rcv_finish
     2.26%  003  [k] eth_type_trans
     2.23%  003  [k] __build_skb
     2.00%  003  [k] icmp_send
     1.84%  003  [k] __rcu_read_unlock
     1.30%  003  [k] ip_local_deliver_finish
     1.26%  003  [k] netif_receive_skb_internal
     1.17%  003  [k] ip_route_input_noref
     1.11%  003  [k] make_kuid
     1.09%  003  [k] __udp4_lib_lookup
     1.07%  003  [k] skb_release_head_state
     1.04%  003  [k] __rcu_read_lock
     0.95%  003  [k] kfree_skb
     0.89%  003  [k] __local_bh_enable_ip
     0.88%  003  [k] skb_release_data
     0.71%  003  [k] ip_local_deliver
     0.58%  003  [k] netif_receive_skb

cmdline:
 perf report --sort cpu,symbol --kallsyms=/proc/kallsyms  --no-children  -C3 -g none --stdio
John Fastabend Sept. 27, 2017, 1:35 p.m. UTC | #8
On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:
> On Tue, 26 Sep 2017 21:58:53 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
>> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
>> [...]
>>> I'm currently implementing a cpumap type, that transfers raw XDP frames
>>> to another CPU, and the SKB is allocated on the remote CPU.  (It
>>> actually works extremely well).  
>>
>> Meaning you let all the XDP_PASS packets get processed on a
>> different CPU, so you can reserve the whole CPU just for
>> prefiltering, right? 
> 
> Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> The trick is using the map-flush point, to transfer packets in bulk to
> the remote CPU (single call IPC is too slow), but at the same time
> flush single packets if NAPI didn't see a bulk.
> 
>> Do you have some numbers to share at this point, just curious when
>> you mention it works extremely well.
> 
> Sure... I've done a lot of benchmarking on this patchset ;-)
> I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> 
>  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
>  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> 
> Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> the 11.9Mpps DDoS attack is hitting.
> 
> Running XDP/eBPF prog_num:4
> XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> XDP-RX          0       12,030,471     11,966,982  0          
> XDP-RX          total   12,030,471     11,966,982 
> cpumap-enqueue    0:2   63,488         0           0          
> cpumap-enqueue  sum:2   63,488         0           0          
> cpumap_kthread  2       63,488         0           3          time_exceed
> cpumap_kthread  total   63,488         0           0          
> redirect_err    total   0              0          
> 
> $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate         
> bytes  Bytes  bytes    bytes   secs.    per sec   
> 
> 16384  87380  1024     1024    10.00    12735.97   
> 16384  87380 
> 
> The netperf TCP_CRR performance is the same, without XDP loaded.
> 

Just curious could you also try this with RPS enabled (or does this have
RPS enabled). RPS should effectively do the same thing but higher in the
stack. I'm curious what the delta would be. Might be another interesting
case and fairly easy to setup if you already have the above scripts.

Thanks,
John

[...]
Jesper Dangaard Brouer Sept. 27, 2017, 2:54 p.m. UTC | #9
On Wed, 27 Sep 2017 06:35:40 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 26 Sep 2017 21:58:53 +0200
> > Daniel Borkmann <daniel@iogearbox.net> wrote:
> >   
> >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> >> [...]  
> >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> >>> actually works extremely well).    
> >>
> >> Meaning you let all the XDP_PASS packets get processed on a
> >> different CPU, so you can reserve the whole CPU just for
> >> prefiltering, right?   
> > 
> > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > The trick is using the map-flush point, to transfer packets in bulk to
> > the remote CPU (single call IPC is too slow), but at the same time
> > flush single packets if NAPI didn't see a bulk.
> >   
> >> Do you have some numbers to share at this point, just curious when
> >> you mention it works extremely well.  
> > 
> > Sure... I've done a lot of benchmarking on this patchset ;-)
> > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > 
> >  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> >  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > 
> > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > the 11.9Mpps DDoS attack is hitting.
> > 
> > Running XDP/eBPF prog_num:4
> > XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> > XDP-RX          0       12,030,471     11,966,982  0          
> > XDP-RX          total   12,030,471     11,966,982 
> > cpumap-enqueue    0:2   63,488         0           0          
> > cpumap-enqueue  sum:2   63,488         0           0          
> > cpumap_kthread  2       63,488         0           3          time_exceed
> > cpumap_kthread  total   63,488         0           0          
> > redirect_err    total   0              0          
> > 
> > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate         
> > bytes  Bytes  bytes    bytes   secs.    per sec   
> > 
> > 16384  87380  1024     1024    10.00    12735.97   
> > 16384  87380 
> > 
> > The netperf TCP_CRR performance is the same, without XDP loaded.
> >   
> 
> Just curious could you also try this with RPS enabled (or does this have
> RPS enabled). RPS should effectively do the same thing but higher in the
> stack. I'm curious what the delta would be. Might be another interesting
> case and fairly easy to setup if you already have the above scripts.

Yes, I'm essentially competing with RSP, thus such a comparison is very
relevant...

This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
other 4 CPUS process packet.

Summary of RPS (Receive Packet Steering) performance:
 * End result is 6.3 Mpps max performance
 * netperf TCP_CRR is 1 trans/sec.
 * Each RX-RPS CPU stall at ~3.2Mpps.

The full test report below with setup:

The mask needed::

 perl -e 'printf "%b\n",0x3C'
 111100

RPS setup::

 sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'

 for N in $(seq 0 5) ; do \
   sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
   sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
   grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
 done

Reduce RX queues to two ::

 ethtool -L ixgbe1 combined 2

IRQ align to CPU numbers::

 $ ~/setup01.sh
 Not root, running with sudo
  --- Disable Ethernet flow-control ---
 rx unmodified, ignoring
 tx unmodified, ignoring
 no pause parameters changed, aborting
 rx unmodified, ignoring
 tx unmodified, ignoring
 no pause parameters changed, aborting
  --- Align IRQs ---
 /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
 /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
 /proc/irq/56/ixgbe1/../smp_affinity_list:0-5

$ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
/sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
/sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c

Generator is sending: 12,715,782 tx_packets /sec

 ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
    -d 172.16.0.2 -t8

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives                    6346544            0.0
IpInDelivers                    6346544            0.0
IpOutRequests                   1020               0.0
IcmpOutMsgs                     1020               0.0
IcmpOutDestUnreachs             1020               0.0
IcmpMsgOutType3                 1020               0.0
UdpNoPorts                      6346898            0.0
IpExtInOctets                   291964714          0.0
IpExtOutOctets                  73440              0.0
IpExtInNoECTPkts                6347063            0.0

$ mpstat -P ALL -u -I SCPU -I SUM

Average:     CPU    %usr   %nice    %sys   %irq   %soft  %idle
Average:     all    0.00    0.00    0.00   0.42   72.97  26.61
Average:       0    0.00    0.00    0.00   0.17   99.83   0.00
Average:       1    0.00    0.00    0.00   0.17   99.83   0.00
Average:       2    0.00    0.00    0.00   0.67   60.37  38.96
Average:       3    0.00    0.00    0.00   0.67   58.70  40.64
Average:       4    0.00    0.00    0.00   0.67   59.53  39.80
Average:       5    0.00    0.00    0.00   0.67   58.93  40.40

Average:     CPU    intr/s
Average:     all 152067.22
Average:       0  50064.73
Average:       1  50089.35
Average:       2  45095.17
Average:       3  44875.04
Average:       4  44906.32
Average:       5  45152.08

Average:     CPU     TIMER/s   NET_TX/s   NET_RX/s TASKLET/s  SCHED/s     RCU/s
Average:       0      609.48       0.17   49431.28      0.00     2.66     21.13
Average:       1      567.55       0.00   49498.00      0.00     2.66     21.13
Average:       2      998.34       0.00   43941.60      4.16    82.86     68.22
Average:       3      540.60       0.17   44140.27      0.00    85.52    108.49
Average:       4      537.27       0.00   44219.63      0.00    84.53     64.89
Average:       5      530.78       0.17   44445.59      0.00    85.02     90.52

From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck.

Show adapter(s) (ixgbe1) statistics (ONLY that changed!)
Ethtool(ixgbe1) stat:     11109531 (   11,109,531) <= fdir_miss /sec
Ethtool(ixgbe1) stat:    380632356 (  380,632,356) <= rx_bytes /sec
Ethtool(ixgbe1) stat:    812792611 (  812,792,611) <= rx_bytes_nic /sec
Ethtool(ixgbe1) stat:      1753550 (    1,753,550) <= rx_missed_errors /sec
Ethtool(ixgbe1) stat:      4602487 (    4,602,487) <= rx_no_dma_resources /sec
Ethtool(ixgbe1) stat:      6343873 (    6,343,873) <= rx_packets /sec
Ethtool(ixgbe1) stat:     10946441 (   10,946,441) <= rx_pkts_nic /sec
Ethtool(ixgbe1) stat:    190287853 (  190,287,853) <= rx_queue_0_bytes /sec
Ethtool(ixgbe1) stat:      3171464 (    3,171,464) <= rx_queue_0_packets /sec
Ethtool(ixgbe1) stat:    190344503 (  190,344,503) <= rx_queue_1_bytes /sec
Ethtool(ixgbe1) stat:      3172408 (    3,172,408) <= rx_queue_1_packets /sec

Notice, each RX-CPU can only process 3.1Mpps.

RPS RX-CPU(0):

 # Overhead  CPU  Symbol
 # ........  ...  .......................................
 #
    11.72%  000  [k] ixgbe_poll
    11.29%  000  [k] _raw_spin_lock
    10.35%  000  [k] dev_gro_receive
     8.36%  000  [k] __build_skb
     7.35%  000  [k] __skb_get_hash
     6.22%  000  [k] enqueue_to_backlog
     5.89%  000  [k] __skb_flow_dissect
     4.43%  000  [k] inet_gro_receive
     4.19%  000  [k] ___slab_alloc
     3.90%  000  [k] queued_spin_lock_slowpath
     3.85%  000  [k] kmem_cache_alloc
     3.06%  000  [k] build_skb
     2.66%  000  [k] get_rps_cpu
     2.57%  000  [k] napi_gro_receive
     2.34%  000  [k] eth_type_trans
     1.81%  000  [k] __cmpxchg_double_slab.isra.61
     1.47%  000  [k] ixgbe_alloc_rx_buffers
     1.43%  000  [k] get_partial_node.isra.81
     0.84%  000  [k] swiotlb_sync_single
     0.74%  000  [k] udp4_gro_receive
     0.73%  000  [k] netif_receive_skb_internal
     0.72%  000  [k] udp_gro_receive
     0.63%  000  [k] skb_gro_reset_offset
     0.49%  000  [k] __skb_flow_get_ports
     0.48%  000  [k] llist_add_batch
     0.36%  000  [k] swiotlb_sync_single_for_cpu
     0.34%  000  [k] __slab_alloc


Remote RPS-CPU(3) getting packets::

 # Overhead  CPU  Symbol
 # ........  ...  ..............................................
 #
    33.02%  003  [k] poll_idle
    10.99%  003  [k] __netif_receive_skb_core
    10.45%  003  [k] page_frag_free
     8.49%  003  [k] ip_rcv
     4.19%  003  [k] fib_table_lookup
     2.84%  003  [k] __udp4_lib_rcv
     2.81%  003  [k] __slab_free
     2.23%  003  [k] __udp4_lib_lookup
     2.09%  003  [k] ip_route_input_rcu
     2.07%  003  [k] kmem_cache_free
     2.06%  003  [k] udp_v4_early_demux
     1.73%  003  [k] ip_rcv_finish
     1.44%  003  [k] process_backlog
     1.32%  003  [k] icmp_send
     1.30%  003  [k] cmpxchg_double_slab.isra.73
     0.95%  003  [k] intel_idle
     0.88%  003  [k] _raw_spin_lock
     0.84%  003  [k] fib_validate_source
     0.79%  003  [k] ip_local_deliver_finish
     0.67%  003  [k] ip_local_deliver
     0.56%  003  [k] skb_release_data
     0.53%  003  [k] unfreeze_partials.isra.80
     0.51%  003  [k] skb_release_head_state
     0.44%  003  [k] kfree_skb
     0.44%  003  [k] queued_spin_lock_slowpath
     0.44%  003  [k] __cmpxchg_double_slab.isra.61

$ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -T5,5 -- -r 1024,1024
MIGRATED TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.16.0.2 () port 0 AF_INET : histogram : demo : cpu bind
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1024     1024    10.00       1.10   
16384  87380
Alexei Starovoitov Sept. 27, 2017, 5:32 p.m. UTC | #10
On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 27 Sep 2017 06:35:40 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:
> > > On Tue, 26 Sep 2017 21:58:53 +0200
> > > Daniel Borkmann <daniel@iogearbox.net> wrote:
> > >   
> > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> > >> [...]  
> > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> > >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> > >>> actually works extremely well).    
> > >>
> > >> Meaning you let all the XDP_PASS packets get processed on a
> > >> different CPU, so you can reserve the whole CPU just for
> > >> prefiltering, right?   
> > > 
> > > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > > The trick is using the map-flush point, to transfer packets in bulk to
> > > the remote CPU (single call IPC is too slow), but at the same time
> > > flush single packets if NAPI didn't see a bulk.
> > >   
> > >> Do you have some numbers to share at this point, just curious when
> > >> you mention it works extremely well.  
> > > 
> > > Sure... I've done a lot of benchmarking on this patchset ;-)
> > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > > 
> > >  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> > >  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > > 
> > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > > the 11.9Mpps DDoS attack is hitting.
> > > 
> > > Running XDP/eBPF prog_num:4
> > > XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> > > XDP-RX          0       12,030,471     11,966,982  0          
> > > XDP-RX          total   12,030,471     11,966,982 
> > > cpumap-enqueue    0:2   63,488         0           0          
> > > cpumap-enqueue  sum:2   63,488         0           0          
> > > cpumap_kthread  2       63,488         0           3          time_exceed
> > > cpumap_kthread  total   63,488         0           0          
> > > redirect_err    total   0              0          
> > > 
> > > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > > Local /Remote
> > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > Send   Recv   Size     Size    Time     Rate         
> > > bytes  Bytes  bytes    bytes   secs.    per sec   
> > > 
> > > 16384  87380  1024     1024    10.00    12735.97   
> > > 16384  87380 
> > > 
> > > The netperf TCP_CRR performance is the same, without XDP loaded.
> > >   
> > 
> > Just curious could you also try this with RPS enabled (or does this have
> > RPS enabled). RPS should effectively do the same thing but higher in the
> > stack. I'm curious what the delta would be. Might be another interesting
> > case and fairly easy to setup if you already have the above scripts.
> 
> Yes, I'm essentially competing with RSP, thus such a comparison is very
> relevant...
> 
> This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
> other 4 CPUS process packet.
> 
> Summary of RPS (Receive Packet Steering) performance:
>  * End result is 6.3 Mpps max performance
>  * netperf TCP_CRR is 1 trans/sec.
>  * Each RX-RPS CPU stall at ~3.2Mpps.
> 
> The full test report below with setup:
> 
> The mask needed::
> 
>  perl -e 'printf "%b\n",0x3C'
>  111100
> 
> RPS setup::
> 
>  sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'
> 
>  for N in $(seq 0 5) ; do \
>    sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
>    sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
>    grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
>  done
> 
> Reduce RX queues to two ::
> 
>  ethtool -L ixgbe1 combined 2
> 
> IRQ align to CPU numbers::
> 
>  $ ~/setup01.sh
>  Not root, running with sudo
>   --- Disable Ethernet flow-control ---
>  rx unmodified, ignoring
>  tx unmodified, ignoring
>  no pause parameters changed, aborting
>  rx unmodified, ignoring
>  tx unmodified, ignoring
>  no pause parameters changed, aborting
>   --- Align IRQs ---
>  /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
>  /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
>  /proc/irq/56/ixgbe1/../smp_affinity_list:0-5
> 
> $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
> /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
> /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c
> 
> Generator is sending: 12,715,782 tx_packets /sec
> 
>  ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
>     -d 172.16.0.2 -t8
> 
> $ nstat > /dev/null && sleep 1 && nstat
> #kernel
> IpInReceives                    6346544            0.0
> IpInDelivers                    6346544            0.0
> IpOutRequests                   1020               0.0
> IcmpOutMsgs                     1020               0.0
> IcmpOutDestUnreachs             1020               0.0
> IcmpMsgOutType3                 1020               0.0
> UdpNoPorts                      6346898            0.0
> IpExtInOctets                   291964714          0.0
> IpExtOutOctets                  73440              0.0
> IpExtInNoECTPkts                6347063            0.0
> 
> $ mpstat -P ALL -u -I SCPU -I SUM
> 
> Average:     CPU    %usr   %nice    %sys   %irq   %soft  %idle
> Average:     all    0.00    0.00    0.00   0.42   72.97  26.61
> Average:       0    0.00    0.00    0.00   0.17   99.83   0.00
> Average:       1    0.00    0.00    0.00   0.17   99.83   0.00
> Average:       2    0.00    0.00    0.00   0.67   60.37  38.96
> Average:       3    0.00    0.00    0.00   0.67   58.70  40.64
> Average:       4    0.00    0.00    0.00   0.67   59.53  39.80
> Average:       5    0.00    0.00    0.00   0.67   58.93  40.40
> 
> Average:     CPU    intr/s
> Average:     all 152067.22
> Average:       0  50064.73
> Average:       1  50089.35
> Average:       2  45095.17
> Average:       3  44875.04
> Average:       4  44906.32
> Average:       5  45152.08
> 
> Average:     CPU     TIMER/s   NET_TX/s   NET_RX/s TASKLET/s  SCHED/s     RCU/s
> Average:       0      609.48       0.17   49431.28      0.00     2.66     21.13
> Average:       1      567.55       0.00   49498.00      0.00     2.66     21.13
> Average:       2      998.34       0.00   43941.60      4.16    82.86     68.22
> Average:       3      540.60       0.17   44140.27      0.00    85.52    108.49
> Average:       4      537.27       0.00   44219.63      0.00    84.53     64.89
> Average:       5      530.78       0.17   44445.59      0.00    85.02     90.52
> 
> From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck.
> 
> Show adapter(s) (ixgbe1) statistics (ONLY that changed!)
> Ethtool(ixgbe1) stat:     11109531 (   11,109,531) <= fdir_miss /sec
> Ethtool(ixgbe1) stat:    380632356 (  380,632,356) <= rx_bytes /sec
> Ethtool(ixgbe1) stat:    812792611 (  812,792,611) <= rx_bytes_nic /sec
> Ethtool(ixgbe1) stat:      1753550 (    1,753,550) <= rx_missed_errors /sec
> Ethtool(ixgbe1) stat:      4602487 (    4,602,487) <= rx_no_dma_resources /sec
> Ethtool(ixgbe1) stat:      6343873 (    6,343,873) <= rx_packets /sec
> Ethtool(ixgbe1) stat:     10946441 (   10,946,441) <= rx_pkts_nic /sec
> Ethtool(ixgbe1) stat:    190287853 (  190,287,853) <= rx_queue_0_bytes /sec
> Ethtool(ixgbe1) stat:      3171464 (    3,171,464) <= rx_queue_0_packets /sec
> Ethtool(ixgbe1) stat:    190344503 (  190,344,503) <= rx_queue_1_bytes /sec
> Ethtool(ixgbe1) stat:      3172408 (    3,172,408) <= rx_queue_1_packets /sec
> 
> Notice, each RX-CPU can only process 3.1Mpps.
> 
> RPS RX-CPU(0):
> 
>  # Overhead  CPU  Symbol
>  # ........  ...  .......................................
>  #
>     11.72%  000  [k] ixgbe_poll
>     11.29%  000  [k] _raw_spin_lock
>     10.35%  000  [k] dev_gro_receive
>      8.36%  000  [k] __build_skb
>      7.35%  000  [k] __skb_get_hash
>      6.22%  000  [k] enqueue_to_backlog
>      5.89%  000  [k] __skb_flow_dissect
>      4.43%  000  [k] inet_gro_receive
>      4.19%  000  [k] ___slab_alloc
>      3.90%  000  [k] queued_spin_lock_slowpath
>      3.85%  000  [k] kmem_cache_alloc
>      3.06%  000  [k] build_skb
>      2.66%  000  [k] get_rps_cpu
>      2.57%  000  [k] napi_gro_receive
>      2.34%  000  [k] eth_type_trans
>      1.81%  000  [k] __cmpxchg_double_slab.isra.61
>      1.47%  000  [k] ixgbe_alloc_rx_buffers
>      1.43%  000  [k] get_partial_node.isra.81
>      0.84%  000  [k] swiotlb_sync_single
>      0.74%  000  [k] udp4_gro_receive
>      0.73%  000  [k] netif_receive_skb_internal
>      0.72%  000  [k] udp_gro_receive
>      0.63%  000  [k] skb_gro_reset_offset
>      0.49%  000  [k] __skb_flow_get_ports
>      0.48%  000  [k] llist_add_batch
>      0.36%  000  [k] swiotlb_sync_single_for_cpu
>      0.34%  000  [k] __slab_alloc
> 
> 
> Remote RPS-CPU(3) getting packets::
> 
>  # Overhead  CPU  Symbol
>  # ........  ...  ..............................................
>  #
>     33.02%  003  [k] poll_idle
>     10.99%  003  [k] __netif_receive_skb_core
>     10.45%  003  [k] page_frag_free
>      8.49%  003  [k] ip_rcv
>      4.19%  003  [k] fib_table_lookup
>      2.84%  003  [k] __udp4_lib_rcv
>      2.81%  003  [k] __slab_free
>      2.23%  003  [k] __udp4_lib_lookup
>      2.09%  003  [k] ip_route_input_rcu
>      2.07%  003  [k] kmem_cache_free
>      2.06%  003  [k] udp_v4_early_demux
>      1.73%  003  [k] ip_rcv_finish

Very interesting data.
So above perf report compares to xdp-redirect-cpu this one:
Perf top on a CPU(3) that have to alloc and free SKBs etc.

# Overhead  CPU  Symbol
# ........  ...  .......................................
#
    15.51%  003  [k] fib_table_lookup
     8.91%  003  [k] cpu_map_kthread_run
     8.04%  003  [k] build_skb
     7.88%  003  [k] page_frag_free
     5.13%  003  [k] kmem_cache_alloc
     4.76%  003  [k] ip_route_input_rcu
     4.59%  003  [k] kmem_cache_free
     4.02%  003  [k] __udp4_lib_rcv
     3.20%  003  [k] fib_validate_source
     3.02%  003  [k] __netif_receive_skb_core
     3.02%  003  [k] udp_v4_early_demux
     2.90%  003  [k] ip_rcv
     2.80%  003  [k] ip_rcv_finish

right?
and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu
you can load it up all the way.
Am I interpreting all this correctly that with RPS cpu0 cannot
distributed the packets to other cpus fast enough and that's
a bottleneck?
whereas in redirect-cpu you're doing early packet distribution
before skb alloc?
So in other words with redirect-cpu all consumer cpus are doing
skb alloc and in RPS cpu0 is allocating skbs for all ?
and that's where 6M->12M performance gain comes from?
Waskiewicz Jr, Peter Sept. 28, 2017, 5:59 a.m. UTC | #11
On 9/26/17 10:21 AM, Andy Gospodarek wrote:
> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>> [...]
>>> First, thanks for this detailed description.  It was helpful to read
>>> along with the patches.
>>>
>>> My only concern about this area being generic is that you are now in a
>>> state where any bpf program must know about all the bpf programs in the
>>> receive pipeline before it can properly parse what is stored in the
>>> meta-data and add it to an skb (or perform any other action).
>>> Especially if each program adds it's own meta-data along the way.
>>>
>>> Maybe this isn't a big concern based on the number of users of this
>>> today, but it just starts to seem like a concern as there are these
>>> hints being passed between layers that are challenging to track due to a
>>> lack of a standard format for passing data between.
>>
>> Btw, we do have similar kind of programmable scratch buffer also today
>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>> which doesn't have any fixed layout for the slots, or a per-cpu map
>> where you can transfer data between tail calls for example, then tail
>> calls themselves that need to coordinate, or simply mangling of packets
>> itself if you will, but more below to your use case ...
>>
>>> The main reason I bring this up is that Michael and I had discussed and
>>> designed a way for drivers to communicate between each other that rx
>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>> action.  Much like this code, it involved adding an new element to
>>> struct xdp_md that could point to the important information.  Now that
>>> there is a generic way to handle this, it would seem nice to be able to
>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>> without the ability to mark it in some manner.
>>>
>>> For additional background, the minimum amount of data needed in the case
>>> Michael and I were discussing was really 2 words.  One to serve as a
>>> pointer to an rx_ring structure and one to have a counter to the rx
>>> producer entry.  This data could be acessed by the driver processing the
>>> tx completions and callback to the driver that received the frame off the wire
>>> to perform any needed processing.  (For those curious this would also require a
>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>
>> What you describe above doesn't seem to be fitting to the use-case of
>> this set, meaning the area here is fully programmable out of the BPF
>> program, the infrastructure you're describing is some sort of means of
>> communication between drivers for the XDP_REDIRECT, and should be
>> outside of the control of the BPF program to mangle.
> 
> OK, I understand that perspective.  I think saying this is really meant
> as a BPF<->BPF communication channel for now is fine.
> 
>> You could probably reuse the base infra here and make a part of that
>> inaccessible for the program with some sort of a fixed layout, but I
>> haven't seen your code yet to be able to fully judge. Intention here
>> is to allow for programmability within the BPF prog in a generic way,
>> such that based on the use-case it can be populated in specific ways
>> and propagated to the skb w/o having to define a fixed layout and
>> bloat xdp_buff all the way to an skb while still retaining all the
>> flexibility.
> 
> Some level of reuse might be proper, but I'd rather it be explicit for
> my use since it's not exclusively something that will need to be used by
> a BPF prog, but rather the driver.  I'll produce some patches this week
> for reference.

Sorry for chiming in late, I've been offline.

We're looking to add some functionality from driver to XDP inside this 
xdp_buff->data_meta region.  We want to assign it to an opaque 
structure, that would be specific per driver (think of a flex descriptor 
coming out of the hardware).  We'd like to pass these offloaded 
computations into XDP programs to help accelerate them, such as packet 
type, where headers are located, etc.  It's similar to Jesper's RFC 
patches back in May when passing through the mlx Rx descriptor to XDP.

This is actually what a few of us are planning to present at NetDev 2.2 
in November.  If you're hoping to restrict this headroom in the xdp_buff 
for an exclusive use case with XDP_REDIRECT, then I'd like to discuss 
that further.

-PJ
Andy Gospodarek Sept. 28, 2017, 7:58 p.m. UTC | #12
On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
<peter.waskiewicz.jr@intel.com> wrote:
> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>> [...]
>>>> First, thanks for this detailed description.  It was helpful to read
>>>> along with the patches.
>>>>
>>>> My only concern about this area being generic is that you are now in a
>>>> state where any bpf program must know about all the bpf programs in the
>>>> receive pipeline before it can properly parse what is stored in the
>>>> meta-data and add it to an skb (or perform any other action).
>>>> Especially if each program adds it's own meta-data along the way.
>>>>
>>>> Maybe this isn't a big concern based on the number of users of this
>>>> today, but it just starts to seem like a concern as there are these
>>>> hints being passed between layers that are challenging to track due to a
>>>> lack of a standard format for passing data between.
>>>
>>> Btw, we do have similar kind of programmable scratch buffer also today
>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>> where you can transfer data between tail calls for example, then tail
>>> calls themselves that need to coordinate, or simply mangling of packets
>>> itself if you will, but more below to your use case ...
>>>
>>>> The main reason I bring this up is that Michael and I had discussed and
>>>> designed a way for drivers to communicate between each other that rx
>>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>>> action.  Much like this code, it involved adding an new element to
>>>> struct xdp_md that could point to the important information.  Now that
>>>> there is a generic way to handle this, it would seem nice to be able to
>>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>>> without the ability to mark it in some manner.
>>>>
>>>> For additional background, the minimum amount of data needed in the case
>>>> Michael and I were discussing was really 2 words.  One to serve as a
>>>> pointer to an rx_ring structure and one to have a counter to the rx
>>>> producer entry.  This data could be acessed by the driver processing the
>>>> tx completions and callback to the driver that received the frame off the wire
>>>> to perform any needed processing.  (For those curious this would also require a
>>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>>
>>> What you describe above doesn't seem to be fitting to the use-case of
>>> this set, meaning the area here is fully programmable out of the BPF
>>> program, the infrastructure you're describing is some sort of means of
>>> communication between drivers for the XDP_REDIRECT, and should be
>>> outside of the control of the BPF program to mangle.
>>
>> OK, I understand that perspective.  I think saying this is really meant
>> as a BPF<->BPF communication channel for now is fine.
>>
>>> You could probably reuse the base infra here and make a part of that
>>> inaccessible for the program with some sort of a fixed layout, but I
>>> haven't seen your code yet to be able to fully judge. Intention here
>>> is to allow for programmability within the BPF prog in a generic way,
>>> such that based on the use-case it can be populated in specific ways
>>> and propagated to the skb w/o having to define a fixed layout and
>>> bloat xdp_buff all the way to an skb while still retaining all the
>>> flexibility.
>>
>> Some level of reuse might be proper, but I'd rather it be explicit for
>> my use since it's not exclusively something that will need to be used by
>> a BPF prog, but rather the driver.  I'll produce some patches this week
>> for reference.
>
> Sorry for chiming in late, I've been offline.
>
> We're looking to add some functionality from driver to XDP inside this
> xdp_buff->data_meta region.  We want to assign it to an opaque
> structure, that would be specific per driver (think of a flex descriptor
> coming out of the hardware).  We'd like to pass these offloaded
> computations into XDP programs to help accelerate them, such as packet
> type, where headers are located, etc.  It's similar to Jesper's RFC
> patches back in May when passing through the mlx Rx descriptor to XDP.
>
> This is actually what a few of us are planning to present at NetDev 2.2
> in November.  If you're hoping to restrict this headroom in the xdp_buff
> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
> that further.
>

No sweat, PJ, thanks for replying.  I saw the notes for your accepted
session and I'm looking forward to it.

John's suggestion earlier in the thread was actually similar to the
conclusion I reached when thinking about Daniel's patch a bit more.
(I like John's better though as it doesn't get constrained by UAPI.)
Since redirect actions happen at a point where no other programs will
run on the buffer, that space can be used for this redirect data and
there are no conflicts.

It sounds like the idea behind your proposal includes populating some
data into the buffer before the XDP program is executed so that it can
be used by the program.  Would this data be useful later in the driver
or stack or are you just hoping to accelerate processing of frames in
the BPF program?

If the headroom needed for redirect info was only added after it was
clear the redirect action was needed, would this conflict with the
information you are trying to provide?  I had planned to add this just
after the action was XDP_REDIRECT was selected or at the end of the
driver's ndo_xdp_xmit function -- it seems like it would not conflict.

(There's also Jesper's series from today -- I've seen it but have not
had time to fully grok all of those changes.)

Thoughts?
Waskiewicz Jr, Peter Sept. 28, 2017, 8:52 p.m. UTC | #13
On 9/28/17 12:59 PM, Andy Gospodarek wrote:
> On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
> <peter.waskiewicz.jr@intel.com> wrote:
>> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>>> [...]
>>>>> First, thanks for this detailed description.  It was helpful to read
>>>>> along with the patches.
>>>>>
>>>>> My only concern about this area being generic is that you are now in a
>>>>> state where any bpf program must know about all the bpf programs in the
>>>>> receive pipeline before it can properly parse what is stored in the
>>>>> meta-data and add it to an skb (or perform any other action).
>>>>> Especially if each program adds it's own meta-data along the way.
>>>>>
>>>>> Maybe this isn't a big concern based on the number of users of this
>>>>> today, but it just starts to seem like a concern as there are these
>>>>> hints being passed between layers that are challenging to track due to a
>>>>> lack of a standard format for passing data between.
>>>>
>>>> Btw, we do have similar kind of programmable scratch buffer also today
>>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>>> where you can transfer data between tail calls for example, then tail
>>>> calls themselves that need to coordinate, or simply mangling of packets
>>>> itself if you will, but more below to your use case ...
>>>>
>>>>> The main reason I bring this up is that Michael and I had discussed and
>>>>> designed a way for drivers to communicate between each other that rx
>>>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>>>> action.  Much like this code, it involved adding an new element to
>>>>> struct xdp_md that could point to the important information.  Now that
>>>>> there is a generic way to handle this, it would seem nice to be able to
>>>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>>>> without the ability to mark it in some manner.
>>>>>
>>>>> For additional background, the minimum amount of data needed in the case
>>>>> Michael and I were discussing was really 2 words.  One to serve as a
>>>>> pointer to an rx_ring structure and one to have a counter to the rx
>>>>> producer entry.  This data could be acessed by the driver processing the
>>>>> tx completions and callback to the driver that received the frame off the wire
>>>>> to perform any needed processing.  (For those curious this would also require a
>>>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>>>
>>>> What you describe above doesn't seem to be fitting to the use-case of
>>>> this set, meaning the area here is fully programmable out of the BPF
>>>> program, the infrastructure you're describing is some sort of means of
>>>> communication between drivers for the XDP_REDIRECT, and should be
>>>> outside of the control of the BPF program to mangle.
>>>
>>> OK, I understand that perspective.  I think saying this is really meant
>>> as a BPF<->BPF communication channel for now is fine.
>>>
>>>> You could probably reuse the base infra here and make a part of that
>>>> inaccessible for the program with some sort of a fixed layout, but I
>>>> haven't seen your code yet to be able to fully judge. Intention here
>>>> is to allow for programmability within the BPF prog in a generic way,
>>>> such that based on the use-case it can be populated in specific ways
>>>> and propagated to the skb w/o having to define a fixed layout and
>>>> bloat xdp_buff all the way to an skb while still retaining all the
>>>> flexibility.
>>>
>>> Some level of reuse might be proper, but I'd rather it be explicit for
>>> my use since it's not exclusively something that will need to be used by
>>> a BPF prog, but rather the driver.  I'll produce some patches this week
>>> for reference.
>>
>> Sorry for chiming in late, I've been offline.
>>
>> We're looking to add some functionality from driver to XDP inside this
>> xdp_buff->data_meta region.  We want to assign it to an opaque
>> structure, that would be specific per driver (think of a flex descriptor
>> coming out of the hardware).  We'd like to pass these offloaded
>> computations into XDP programs to help accelerate them, such as packet
>> type, where headers are located, etc.  It's similar to Jesper's RFC
>> patches back in May when passing through the mlx Rx descriptor to XDP.
>>
>> This is actually what a few of us are planning to present at NetDev 2.2
>> in November.  If you're hoping to restrict this headroom in the xdp_buff
>> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
>> that further.
>>
> 
> No sweat, PJ, thanks for replying.  I saw the notes for your accepted
> session and I'm looking forward to it.
> 
> John's suggestion earlier in the thread was actually similar to the
> conclusion I reached when thinking about Daniel's patch a bit more.
> (I like John's better though as it doesn't get constrained by UAPI.)
> Since redirect actions happen at a point where no other programs will
> run on the buffer, that space can be used for this redirect data and
> there are no conflicts.

Ah, yes, John and I spoke about this at Plumber's and this is basically 
what we came to as well.  A set of helpers that won't have to be in 
UAPI, but they will be potentially vendor-specific to extract the 
meta-data hints out for the XDP program to use.

> It sounds like the idea behind your proposal includes populating some
> data into the buffer before the XDP program is executed so that it can
> be used by the program.  Would this data be useful later in the driver
> or stack or are you just hoping to accelerate processing of frames in
> the BPF program?

Right now we're thinking it would only be useful for XDP programs to 
execute things quicker, i.e. not have to compute things that are already 
computed by the hardware (rxhash, ptype, header locations, etc.).  I 
don't have any plans to pass this data off elsewhere in the stack or 
back to the driver at this point.

> If the headroom needed for redirect info was only added after it was
> clear the redirect action was needed, would this conflict with the
> information you are trying to provide?  I had planned to add this just
> after the action was XDP_REDIRECT was selected or at the end of the
> driver's ndo_xdp_xmit function -- it seems like it would not conflict.

I'm pretty sure I misunderstood what you were going after with 
XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a 
few weeks) will populate the headroom coming out of the driver to XDP, 
and then once the XDP program extracts whatever hints it wants via 
helpers, I fully expect that area in the headroom to get stomped by 
something else.  If we want to send any of that hint data up farther, 
we'll already have it extracted via the helpers, and the eBPF program 
can happily assign it to wherever in the outbound metadata area.

> (There's also Jesper's series from today -- I've seen it but have not
> had time to fully grok all of those changes.)

I'm also working through my inbox to get to that series.  I have some 
email to catch up on...

Thanks Andy,
-PJ
John Fastabend Sept. 28, 2017, 9:22 p.m. UTC | #14
[...]

> I'm pretty sure I misunderstood what you were going after with 
> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a 
> few weeks) will populate the headroom coming out of the driver to XDP, 
> and then once the XDP program extracts whatever hints it wants via 
> helpers, I fully expect that area in the headroom to get stomped by 
> something else.  If we want to send any of that hint data up farther, 
> we'll already have it extracted via the helpers, and the eBPF program 
> can happily assign it to wherever in the outbound metadata area.

In case its not obvious with the latest xdp metadata patches the outbound
metadata can then be pushed into skb fields via a tc_cls program if needed.

.John

> 
>> (There's also Jesper's series from today -- I've seen it but have not
>> had time to fully grok all of those changes.)
> 
> I'm also working through my inbox to get to that series.  I have some 
> email to catch up on...
> 
> Thanks Andy,
> -PJ
>
Daniel Borkmann Sept. 28, 2017, 9:29 p.m. UTC | #15
On 09/28/2017 10:52 PM, Waskiewicz Jr, Peter wrote:
> On 9/28/17 12:59 PM, Andy Gospodarek wrote:
>> On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
>> <peter.waskiewicz.jr@intel.com> wrote:
>>> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>>>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>>>> [...]
>>>>>> First, thanks for this detailed description.  It was helpful to read
>>>>>> along with the patches.
>>>>>>
>>>>>> My only concern about this area being generic is that you are now in a
>>>>>> state where any bpf program must know about all the bpf programs in the
>>>>>> receive pipeline before it can properly parse what is stored in the
>>>>>> meta-data and add it to an skb (or perform any other action).
>>>>>> Especially if each program adds it's own meta-data along the way.
>>>>>>
>>>>>> Maybe this isn't a big concern based on the number of users of this
>>>>>> today, but it just starts to seem like a concern as there are these
>>>>>> hints being passed between layers that are challenging to track due to a
>>>>>> lack of a standard format for passing data between.
>>>>>
>>>>> Btw, we do have similar kind of programmable scratch buffer also today
>>>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>>>> where you can transfer data between tail calls for example, then tail
>>>>> calls themselves that need to coordinate, or simply mangling of packets
>>>>> itself if you will, but more below to your use case ...
>>>>>
>>>>>> The main reason I bring this up is that Michael and I had discussed and
>>>>>> designed a way for drivers to communicate between each other that rx
>>>>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>>>>> action.  Much like this code, it involved adding an new element to
>>>>>> struct xdp_md that could point to the important information.  Now that
>>>>>> there is a generic way to handle this, it would seem nice to be able to
>>>>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>>>>> without the ability to mark it in some manner.
>>>>>>
>>>>>> For additional background, the minimum amount of data needed in the case
>>>>>> Michael and I were discussing was really 2 words.  One to serve as a
>>>>>> pointer to an rx_ring structure and one to have a counter to the rx
>>>>>> producer entry.  This data could be acessed by the driver processing the
>>>>>> tx completions and callback to the driver that received the frame off the wire
>>>>>> to perform any needed processing.  (For those curious this would also require a
>>>>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>>>>
>>>>> What you describe above doesn't seem to be fitting to the use-case of
>>>>> this set, meaning the area here is fully programmable out of the BPF
>>>>> program, the infrastructure you're describing is some sort of means of
>>>>> communication between drivers for the XDP_REDIRECT, and should be
>>>>> outside of the control of the BPF program to mangle.
>>>>
>>>> OK, I understand that perspective.  I think saying this is really meant
>>>> as a BPF<->BPF communication channel for now is fine.
>>>>
>>>>> You could probably reuse the base infra here and make a part of that
>>>>> inaccessible for the program with some sort of a fixed layout, but I
>>>>> haven't seen your code yet to be able to fully judge. Intention here
>>>>> is to allow for programmability within the BPF prog in a generic way,
>>>>> such that based on the use-case it can be populated in specific ways
>>>>> and propagated to the skb w/o having to define a fixed layout and
>>>>> bloat xdp_buff all the way to an skb while still retaining all the
>>>>> flexibility.
>>>>
>>>> Some level of reuse might be proper, but I'd rather it be explicit for
>>>> my use since it's not exclusively something that will need to be used by
>>>> a BPF prog, but rather the driver.  I'll produce some patches this week
>>>> for reference.
>>>
>>> Sorry for chiming in late, I've been offline.
>>>
>>> We're looking to add some functionality from driver to XDP inside this
>>> xdp_buff->data_meta region.  We want to assign it to an opaque
>>> structure, that would be specific per driver (think of a flex descriptor
>>> coming out of the hardware).  We'd like to pass these offloaded
>>> computations into XDP programs to help accelerate them, such as packet
>>> type, where headers are located, etc.  It's similar to Jesper's RFC
>>> patches back in May when passing through the mlx Rx descriptor to XDP.
>>>
>>> This is actually what a few of us are planning to present at NetDev 2.2
>>> in November.  If you're hoping to restrict this headroom in the xdp_buff
>>> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
>>> that further.
>>
>> No sweat, PJ, thanks for replying.  I saw the notes for your accepted
>> session and I'm looking forward to it.
>>
>> John's suggestion earlier in the thread was actually similar to the
>> conclusion I reached when thinking about Daniel's patch a bit more.
>> (I like John's better though as it doesn't get constrained by UAPI.)
>> Since redirect actions happen at a point where no other programs will
>> run on the buffer, that space can be used for this redirect data and
>> there are no conflicts.

Yep fully agree, it's not read anywhere else anymore or could go up
the stack where we'd read it out again, so that's the best solution
for your use-case moving forward, Andy. I do like that we don't expose
to uapi.

> Ah, yes, John and I spoke about this at Plumber's and this is basically
> what we came to as well.  A set of helpers that won't have to be in
> UAPI, but they will be potentially vendor-specific to extract the
> meta-data hints out for the XDP program to use.
>
>> It sounds like the idea behind your proposal includes populating some
>> data into the buffer before the XDP program is executed so that it can
>> be used by the program.  Would this data be useful later in the driver
>> or stack or are you just hoping to accelerate processing of frames in
>> the BPF program?
>
> Right now we're thinking it would only be useful for XDP programs to
> execute things quicker, i.e. not have to compute things that are already
> computed by the hardware (rxhash, ptype, header locations, etc.).  I
> don't have any plans to pass this data off elsewhere in the stack or
> back to the driver at this point.
>
>> If the headroom needed for redirect info was only added after it was
>> clear the redirect action was needed, would this conflict with the
>> information you are trying to provide?  I had planned to add this just
>> after the action was XDP_REDIRECT was selected or at the end of the
>> driver's ndo_xdp_xmit function -- it seems like it would not conflict.
>
> I'm pretty sure I misunderstood what you were going after with
> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a
> few weeks) will populate the headroom coming out of the driver to XDP,
> and then once the XDP program extracts whatever hints it wants via
> helpers, I fully expect that area in the headroom to get stomped by
> something else.  If we want to send any of that hint data up farther,
> we'll already have it extracted via the helpers, and the eBPF program
> can happily assign it to wherever in the outbound metadata area.

Sure, these two are compatible with each other; in your case it's
populated before the prog is called, and the prog can use it while
processing, in Andy's case it's populated after the prog was called
when we need to redirect, so both fine.

Thanks,
Daniel
Waskiewicz Jr, Peter Sept. 28, 2017, 9:40 p.m. UTC | #16
On 9/28/17 2:23 PM, John Fastabend wrote:
> [...]
> 
>> I'm pretty sure I misunderstood what you were going after with
>> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a
>> few weeks) will populate the headroom coming out of the driver to XDP,
>> and then once the XDP program extracts whatever hints it wants via
>> helpers, I fully expect that area in the headroom to get stomped by
>> something else.  If we want to send any of that hint data up farther,
>> we'll already have it extracted via the helpers, and the eBPF program
>> can happily assign it to wherever in the outbound metadata area.
> 
> In case its not obvious with the latest xdp metadata patches the outbound
> metadata can then be pushed into skb fields via a tc_cls program if needed.

Yes, that was what I was alluding to with "can happily assign it to 
wherever."  The patches we're working on are driver->XDP, then anything 
else using the latest meta-data patches would be XDP->anywhere else.  So 
I don't think we're going to step on any toes.

Thanks John,
-PJ
Jesper Dangaard Brouer Sept. 29, 2017, 7:09 a.m. UTC | #17
On Wed, 27 Sep 2017 10:32:36 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 27 Sep 2017 06:35:40 -0700
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> > > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:  
> > > > On Tue, 26 Sep 2017 21:58:53 +0200
> > > > Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > >     
> > > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> > > >> [...]    
> > > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> > > >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> > > >>> actually works extremely well).      
> > > >>
> > > >> Meaning you let all the XDP_PASS packets get processed on a
> > > >> different CPU, so you can reserve the whole CPU just for
> > > >> prefiltering, right?     
> > > > 
> > > > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > > > The trick is using the map-flush point, to transfer packets in bulk to
> > > > the remote CPU (single call IPC is too slow), but at the same time
> > > > flush single packets if NAPI didn't see a bulk.
> > > >     
> > > >> Do you have some numbers to share at this point, just curious when
> > > >> you mention it works extremely well.    
> > > > 
> > > > Sure... I've done a lot of benchmarking on this patchset ;-)
> > > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > > > 
> > > >  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> > > >  [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > > > 
> > > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > > > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > > > the 11.9Mpps DDoS attack is hitting.
> > > > 
> > > > Running XDP/eBPF prog_num:4
> > > > XDP-cpumap      CPU:to  pps            drop-pps    extra-info
> > > > XDP-RX          0       12,030,471     11,966,982  0          
> > > > XDP-RX          total   12,030,471     11,966,982 
> > > > cpumap-enqueue    0:2   63,488         0           0          
> > > > cpumap-enqueue  sum:2   63,488         0           0          
> > > > cpumap_kthread  2       63,488         0           3          time_exceed
> > > > cpumap_kthread  total   63,488         0           0          
> > > > redirect_err    total   0              0          
> > > > 
> > > > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > > > Local /Remote
> > > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > > Send   Recv   Size     Size    Time     Rate         
> > > > bytes  Bytes  bytes    bytes   secs.    per sec   
> > > > 
> > > > 16384  87380  1024     1024    10.00    12735.97   
> > > > 16384  87380 
> > > > 
> > > > The netperf TCP_CRR performance is the same, without XDP loaded.
> > > >     
> > > 
> > > Just curious could you also try this with RPS enabled (or does this have
> > > RPS enabled). RPS should effectively do the same thing but higher in the
> > > stack. I'm curious what the delta would be. Might be another interesting
> > > case and fairly easy to setup if you already have the above scripts.  
> > 
> > Yes, I'm essentially competing with RSP, thus such a comparison is very
> > relevant...
> > 
> > This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
> > other 4 CPUS process packet.
> > 
> > Summary of RPS (Receive Packet Steering) performance:
> >  * End result is 6.3 Mpps max performance
> >  * netperf TCP_CRR is 1 trans/sec.
> >  * Each RX-RPS CPU stall at ~3.2Mpps.
> > 
> > The full test report below with setup:
> > 
> > The mask needed::
> > 
> >  perl -e 'printf "%b\n",0x3C'
> >  111100
> > 
> > RPS setup::
> > 
> >  sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'
> > 
> >  for N in $(seq 0 5) ; do \
> >    sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
> >    sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
> >    grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
> >  done
> > 
> > Reduce RX queues to two ::
> > 
> >  ethtool -L ixgbe1 combined 2
> > 
> > IRQ align to CPU numbers::
> > 
> >  $ ~/setup01.sh
> >  Not root, running with sudo
> >   --- Disable Ethernet flow-control ---
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >  rx unmodified, ignoring
> >  tx unmodified, ignoring
> >  no pause parameters changed, aborting
> >   --- Align IRQs ---
> >  /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
> >  /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
> >  /proc/irq/56/ixgbe1/../smp_affinity_list:0-5
> > 
> > $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
> > /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
> > /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c
> > 
> > Generator is sending: 12,715,782 tx_packets /sec
> > 
> >  ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
> >     -d 172.16.0.2 -t8
> > 
> > $ nstat > /dev/null && sleep 1 && nstat
> > #kernel
> > IpInReceives                    6346544            0.0
> > IpInDelivers                    6346544            0.0
> > IpOutRequests                   1020               0.0
> > IcmpOutMsgs                     1020               0.0
> > IcmpOutDestUnreachs             1020               0.0
> > IcmpMsgOutType3                 1020               0.0
> > UdpNoPorts                      6346898            0.0
> > IpExtInOctets                   291964714          0.0
> > IpExtOutOctets                  73440              0.0
> > IpExtInNoECTPkts                6347063            0.0
> > 
> > $ mpstat -P ALL -u -I SCPU -I SUM
> > 
> > Average:     CPU    %usr   %nice    %sys   %irq   %soft  %idle
> > Average:     all    0.00    0.00    0.00   0.42   72.97  26.61
> > Average:       0    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       1    0.00    0.00    0.00   0.17   99.83   0.00
> > Average:       2    0.00    0.00    0.00   0.67   60.37  38.96
> > Average:       3    0.00    0.00    0.00   0.67   58.70  40.64
> > Average:       4    0.00    0.00    0.00   0.67   59.53  39.80
> > Average:       5    0.00    0.00    0.00   0.67   58.93  40.40
> > 
> > Average:     CPU    intr/s
> > Average:     all 152067.22
> > Average:       0  50064.73
> > Average:       1  50089.35
> > Average:       2  45095.17
> > Average:       3  44875.04
> > Average:       4  44906.32
> > Average:       5  45152.08
> > 
> > Average:     CPU     TIMER/s   NET_TX/s   NET_RX/s TASKLET/s  SCHED/s     RCU/s
> > Average:       0      609.48       0.17   49431.28      0.00     2.66     21.13
> > Average:       1      567.55       0.00   49498.00      0.00     2.66     21.13
> > Average:       2      998.34       0.00   43941.60      4.16    82.86     68.22
> > Average:       3      540.60       0.17   44140.27      0.00    85.52    108.49
> > Average:       4      537.27       0.00   44219.63      0.00    84.53     64.89
> > Average:       5      530.78       0.17   44445.59      0.00    85.02     90.52
> > 
> > From mpstat it looks like it is the RX-RPS CPUs that are the bottleneck.
> > 
> > Show adapter(s) (ixgbe1) statistics (ONLY that changed!)
> > Ethtool(ixgbe1) stat:     11109531 (   11,109,531) <= fdir_miss /sec
> > Ethtool(ixgbe1) stat:    380632356 (  380,632,356) <= rx_bytes /sec
> > Ethtool(ixgbe1) stat:    812792611 (  812,792,611) <= rx_bytes_nic /sec
> > Ethtool(ixgbe1) stat:      1753550 (    1,753,550) <= rx_missed_errors /sec
> > Ethtool(ixgbe1) stat:      4602487 (    4,602,487) <= rx_no_dma_resources /sec
> > Ethtool(ixgbe1) stat:      6343873 (    6,343,873) <= rx_packets /sec
> > Ethtool(ixgbe1) stat:     10946441 (   10,946,441) <= rx_pkts_nic /sec
> > Ethtool(ixgbe1) stat:    190287853 (  190,287,853) <= rx_queue_0_bytes /sec
> > Ethtool(ixgbe1) stat:      3171464 (    3,171,464) <= rx_queue_0_packets /sec
> > Ethtool(ixgbe1) stat:    190344503 (  190,344,503) <= rx_queue_1_bytes /sec
> > Ethtool(ixgbe1) stat:      3172408 (    3,172,408) <= rx_queue_1_packets /sec
> > 
> > Notice, each RX-CPU can only process 3.1Mpps.
> > 
> > RPS RX-CPU(0):
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  .......................................
> >  #
> >     11.72%  000  [k] ixgbe_poll
> >     11.29%  000  [k] _raw_spin_lock
> >     10.35%  000  [k] dev_gro_receive
> >      8.36%  000  [k] __build_skb
> >      7.35%  000  [k] __skb_get_hash
> >      6.22%  000  [k] enqueue_to_backlog
> >      5.89%  000  [k] __skb_flow_dissect
> >      4.43%  000  [k] inet_gro_receive
> >      4.19%  000  [k] ___slab_alloc
> >      3.90%  000  [k] queued_spin_lock_slowpath
> >      3.85%  000  [k] kmem_cache_alloc
> >      3.06%  000  [k] build_skb
> >      2.66%  000  [k] get_rps_cpu
> >      2.57%  000  [k] napi_gro_receive
> >      2.34%  000  [k] eth_type_trans
> >      1.81%  000  [k] __cmpxchg_double_slab.isra.61
> >      1.47%  000  [k] ixgbe_alloc_rx_buffers
> >      1.43%  000  [k] get_partial_node.isra.81
> >      0.84%  000  [k] swiotlb_sync_single
> >      0.74%  000  [k] udp4_gro_receive
> >      0.73%  000  [k] netif_receive_skb_internal
> >      0.72%  000  [k] udp_gro_receive
> >      0.63%  000  [k] skb_gro_reset_offset
> >      0.49%  000  [k] __skb_flow_get_ports
> >      0.48%  000  [k] llist_add_batch
> >      0.36%  000  [k] swiotlb_sync_single_for_cpu
> >      0.34%  000  [k] __slab_alloc
> > 
> > 
> > Remote RPS-CPU(3) getting packets::
> > 
> >  # Overhead  CPU  Symbol
> >  # ........  ...  ..............................................
> >  #
> >     33.02%  003  [k] poll_idle
> >     10.99%  003  [k] __netif_receive_skb_core
> >     10.45%  003  [k] page_frag_free
> >      8.49%  003  [k] ip_rcv
> >      4.19%  003  [k] fib_table_lookup
> >      2.84%  003  [k] __udp4_lib_rcv
> >      2.81%  003  [k] __slab_free

Notice slow-path of SLUB

> >      2.23%  003  [k] __udp4_lib_lookup
> >      2.09%  003  [k] ip_route_input_rcu
> >      2.07%  003  [k] kmem_cache_free
> >      2.06%  003  [k] udp_v4_early_demux
> >      1.73%  003  [k] ip_rcv_finish  
> 
> Very interesting data.

You removed some of the more interesting part of the perf-report, that
showed us hitting more of the SLUB slowpath for SKBs.  The slowpath
consist of many separate function calls, thus it doesn't bubble to the
top (the FlameGraph tool shows them easier).

> So above perf report compares to xdp-redirect-cpu this one:
> Perf top on a CPU(3) that have to alloc and free SKBs etc.
> 
> # Overhead  CPU  Symbol
> # ........  ...  .......................................
> #
>     15.51%  003  [k] fib_table_lookup
>      8.91%  003  [k] cpu_map_kthread_run
>      8.04%  003  [k] build_skb
>      7.88%  003  [k] page_frag_free
>      5.13%  003  [k] kmem_cache_alloc
>      4.76%  003  [k] ip_route_input_rcu
>      4.59%  003  [k] kmem_cache_free
>      4.02%  003  [k] __udp4_lib_rcv
>      3.20%  003  [k] fib_validate_source
>      3.02%  003  [k] __netif_receive_skb_core
>      3.02%  003  [k] udp_v4_early_demux
>      2.90%  003  [k] ip_rcv
>      2.80%  003  [k] ip_rcv_finish
> 
> right?
> and in RPS case the consumer cpu is 33% idle whereas in redirect-cpu
> you can load it up all the way.
> Am I interpreting all this correctly that with RPS cpu0 cannot
> distributed the packets to other cpus fast enough and that's
> a bottleneck?

Yes, exactly. The work needed on the RPS cpu0 is simply too much.

> whereas in redirect-cpu you're doing early packet distribution
> before skb alloc?

Yes, the main point to reducing the CPU cycles spend on the packet for
doing early packet distribution.

> So in other words with redirect-cpu all consumer cpus are doing
> skb alloc and in RPS cpu0 is allocating skbs for all ?

Yes.

> and that's where 6M->12M performance gain comes from?

Yes, basically.  There are many small thing that help this along.  Like
cpumap case always hitting the SLUB fastpath.  Another big thing is
bulking. It is sort of hidden, but the XDP_REDIRECT flush mechanism is
implementing the RX bulking (I've been "screaming" about for the last
couple of years! ;-))
diff mbox series

Patch

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index d8f0c83..06ce63c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -94,6 +94,7 @@  bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
 
 	xdp.data_hard_start = *data_ptr - offset;
 	xdp.data = *data_ptr;
+	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = *data_ptr + *len;
 	orig_data = xdp.data;
 	mapping = rx_buf->mapping - bp->rx_dma_offset;
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 49b80da..d68478a 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -523,6 +523,7 @@  static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
 
 	xdp.data_hard_start = page_address(page);
 	xdp.data = (void *)cpu_addr;
+	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + len;
 	orig_data = xdp.data;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 1519dfb..f426762 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2107,6 +2107,7 @@  static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		if (!skb) {
 			xdp.data = page_address(rx_buffer->page) +
 				   rx_buffer->page_offset;
+			xdp_set_data_meta_invalid(&xdp);
 			xdp.data_hard_start = xdp.data -
 					      i40e_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d962368..04bb03b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2326,6 +2326,7 @@  static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		if (!skb) {
 			xdp.data = page_address(rx_buffer->page) +
 				   rx_buffer->page_offset;
+			xdp_set_data_meta_invalid(&xdp);
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index b97a55c8..8f9cb8a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -762,6 +762,7 @@  int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 
 			xdp.data_hard_start = va - frags[0].page_offset;
 			xdp.data = va;
+			xdp_set_data_meta_invalid(&xdp);
 			xdp.data_end = xdp.data + length;
 			orig_data = xdp.data;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index f1dd638..30b3f3f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -794,6 +794,7 @@  static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		return false;
 
 	xdp.data = va + *rx_headroom;
+	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *len;
 	xdp.data_hard_start = va;
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 1c0187f..e3a38be 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1583,6 +1583,7 @@  static int nfp_net_run_xdp(struct bpf_prog *prog, void *data, void *hard_start,
 
 	xdp.data_hard_start = hard_start;
 	xdp.data = data + *off;
+	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = data + *off + *len;
 
 	orig_data = xdp.data;
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index 6fc854b..48ec4c5 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1004,6 +1004,7 @@  static bool qede_rx_xdp(struct qede_dev *edev,
 
 	xdp.data_hard_start = page_address(bd->data);
 	xdp.data = xdp.data_hard_start + *data_offset;
+	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *len;
 
 	/* Queues always have a full reset currently, so for the time
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3c9985f..1757fd7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1314,6 +1314,7 @@  static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 
 		xdp.data_hard_start = buf;
 		xdp.data = buf + pad;
+		xdp_set_data_meta_invalid(&xdp);
 		xdp.data_end = xdp.data + len;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dd14a45..fc059f1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -554,6 +554,7 @@  static struct sk_buff *receive_small(struct net_device *dev,
 
 		xdp.data_hard_start = buf + VIRTNET_RX_PAD + vi->hdr_len;
 		xdp.data = xdp.data_hard_start + xdp_headroom;
+		xdp_set_data_meta_invalid(&xdp);
 		xdp.data_end = xdp.data + len;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
@@ -686,6 +687,7 @@  static struct sk_buff *receive_mergeable(struct net_device *dev,
 		data = page_address(xdp_page) + offset;
 		xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
 		xdp.data = data + vi->hdr_len;
+		xdp_set_data_meta_invalid(&xdp);
 		xdp.data_end = xdp.data + (len - vi->hdr_len);
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8390859..2b672c5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -137,6 +137,7 @@  enum bpf_reg_type {
 	PTR_TO_MAP_VALUE,	 /* reg points to map element value */
 	PTR_TO_MAP_VALUE_OR_NULL,/* points to map elem value or NULL */
 	PTR_TO_STACK,		 /* reg == frame_pointer + offset */
+	PTR_TO_PACKET_META,	 /* skb->data - meta_len */
 	PTR_TO_PACKET,		 /* reg points to skb->data */
 	PTR_TO_PACKET_END,	 /* skb->data + headlen */
 };
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 052bab3..911d454 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -487,12 +487,14 @@  struct sk_filter {
 
 struct bpf_skb_data_end {
 	struct qdisc_skb_cb qdisc_cb;
+	void *data_meta;
 	void *data_end;
 };
 
 struct xdp_buff {
 	void *data;
 	void *data_end;
+	void *data_meta;
 	void *data_hard_start;
 };
 
@@ -507,7 +509,8 @@  static inline void bpf_compute_data_pointers(struct sk_buff *skb)
 	struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
 
 	BUILD_BUG_ON(sizeof(*cb) > FIELD_SIZEOF(struct sk_buff, cb));
-	cb->data_end = skb->data + skb_headlen(skb);
+	cb->data_meta = skb->data - skb_metadata_len(skb);
+	cb->data_end  = skb->data + skb_headlen(skb);
 }
 
 static inline u8 *bpf_skb_cb(struct sk_buff *skb)
@@ -728,8 +731,22 @@  int xdp_do_redirect(struct net_device *dev,
 		    struct bpf_prog *prog);
 void xdp_do_flush_map(void);
 
+/* Drivers not supporting XDP metadata can use this helper, which
+ * rejects any room expansion for metadata as a result.
+ */
+static __always_inline void
+xdp_set_data_meta_invalid(struct xdp_buff *xdp)
+{
+	xdp->data_meta = xdp->data + 1;
+}
+
+static __always_inline bool
+xdp_data_meta_unsupported(const struct xdp_buff *xdp)
+{
+	return unlikely(xdp->data_meta > xdp->data);
+}
+
 void bpf_warn_invalid_xdp_action(u32 act);
-void bpf_warn_invalid_xdp_redirect(u32 ifindex);
 
 struct sock *do_sk_redirect_map(void);
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f9db553..19e64bf 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -489,8 +489,9 @@  int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
  * the end of the header data, ie. at skb->end.
  */
 struct skb_shared_info {
-	unsigned short	_unused;
-	unsigned char	nr_frags;
+	__u8		__unused;
+	__u8		meta_len;
+	__u8		nr_frags;
 	__u8		tx_flags;
 	unsigned short	gso_size;
 	/* Warning: this field is not always filled in (UFO)! */
@@ -3400,6 +3401,69 @@  static inline ktime_t net_invalid_timestamp(void)
 	return 0;
 }
 
+static inline u8 skb_metadata_len(const struct sk_buff *skb)
+{
+	return skb_shinfo(skb)->meta_len;
+}
+
+static inline void *skb_metadata_end(const struct sk_buff *skb)
+{
+	return skb_mac_header(skb);
+}
+
+static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
+					  const struct sk_buff *skb_b,
+					  u8 meta_len)
+{
+	const void *a = skb_metadata_end(skb_a);
+	const void *b = skb_metadata_end(skb_b);
+	/* Using more efficient varaiant than plain call to memcmp(). */
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
+	u64 diffs = 0;
+
+	switch (meta_len) {
+#define __it(x, op) (x -= sizeof(u##op))
+#define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op))
+	case 32: diffs |= __it_diff(a, b, 64);
+	case 24: diffs |= __it_diff(a, b, 64);
+	case 16: diffs |= __it_diff(a, b, 64);
+	case  8: diffs |= __it_diff(a, b, 64);
+		break;
+	case 28: diffs |= __it_diff(a, b, 64);
+	case 20: diffs |= __it_diff(a, b, 64);
+	case 12: diffs |= __it_diff(a, b, 64);
+	case  4: diffs |= __it_diff(a, b, 32);
+		break;
+	}
+	return diffs;
+#else
+	return memcmp(a - meta_len, b - meta_len, meta_len);
+#endif
+}
+
+static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
+					const struct sk_buff *skb_b)
+{
+	u8 len_a = skb_metadata_len(skb_a);
+	u8 len_b = skb_metadata_len(skb_b);
+
+	if (!(len_a | len_b))
+		return false;
+
+	return len_a != len_b ?
+	       true : __skb_metadata_differs(skb_a, skb_b, len_a);
+}
+
+static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
+{
+	skb_shinfo(skb)->meta_len = meta_len;
+}
+
+static inline void skb_metadata_clear(struct sk_buff *skb)
+{
+	skb_metadata_set(skb, 0);
+}
+
 struct sk_buff *skb_clone_sk(struct sk_buff *skb);
 
 #ifdef CONFIG_NETWORK_PHY_TIMESTAMPING
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 43ab5c4..e43491a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -582,6 +582,12 @@  enum bpf_attach_type {
  *	@map: pointer to sockmap to update
  *	@key: key to insert/update sock in map
  *	@flags: same flags as map update elem
+ *
+ * int bpf_xdp_adjust_meta(xdp_md, delta)
+ *     Adjust the xdp_md.data_meta by delta
+ *     @xdp_md: pointer to xdp_md
+ *     @delta: An positive/negative integer to be added to xdp_md.data_meta
+ *     Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -638,6 +644,7 @@  enum bpf_attach_type {
 	FN(redirect_map),		\
 	FN(sk_redirect_map),		\
 	FN(sock_map_update),		\
+	FN(xdp_adjust_meta),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -715,7 +722,7 @@  struct __sk_buff {
 	__u32 data_end;
 	__u32 napi_id;
 
-	/* accessed by BPF_PROG_TYPE_sk_skb types */
+	/* Accessed by BPF_PROG_TYPE_sk_skb types from here to ... */
 	__u32 family;
 	__u32 remote_ip4;	/* Stored in network byte order */
 	__u32 local_ip4;	/* Stored in network byte order */
@@ -723,6 +730,9 @@  struct __sk_buff {
 	__u32 local_ip6[4];	/* Stored in network byte order */
 	__u32 remote_port;	/* Stored in network byte order */
 	__u32 local_port;	/* stored in host byte order */
+	/* ... here. */
+
+	__u32 data_meta;
 };
 
 struct bpf_tunnel_key {
@@ -783,6 +793,7 @@  enum xdp_action {
 struct xdp_md {
 	__u32 data;
 	__u32 data_end;
+	__u32 data_meta;
 };
 
 enum sk_action {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b914fbe..f849eca 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -177,6 +177,12 @@  static __printf(1, 2) void verbose(const char *fmt, ...)
 	va_end(args);
 }
 
+static bool type_is_pkt_pointer(enum bpf_reg_type type)
+{
+	return type == PTR_TO_PACKET ||
+	       type == PTR_TO_PACKET_META;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
 	[NOT_INIT]		= "?",
@@ -187,6 +193,7 @@  static __printf(1, 2) void verbose(const char *fmt, ...)
 	[PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null",
 	[PTR_TO_STACK]		= "fp",
 	[PTR_TO_PACKET]		= "pkt",
+	[PTR_TO_PACKET_META]	= "pkt_meta",
 	[PTR_TO_PACKET_END]	= "pkt_end",
 };
 
@@ -226,7 +233,7 @@  static void print_verifier_state(struct bpf_verifier_state *state)
 			verbose("(id=%d", reg->id);
 			if (t != SCALAR_VALUE)
 				verbose(",off=%d", reg->off);
-			if (t == PTR_TO_PACKET)
+			if (type_is_pkt_pointer(t))
 				verbose(",r=%d", reg->range);
 			else if (t == CONST_PTR_TO_MAP ||
 				 t == PTR_TO_MAP_VALUE ||
@@ -519,6 +526,31 @@  static void mark_reg_known_zero(struct bpf_reg_state *regs, u32 regno)
 	__mark_reg_known_zero(regs + regno);
 }
 
+static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg)
+{
+	return type_is_pkt_pointer(reg->type);
+}
+
+static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg)
+{
+	return reg_is_pkt_pointer(reg) ||
+	       reg->type == PTR_TO_PACKET_END;
+}
+
+/* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
+static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg,
+				    enum bpf_reg_type which)
+{
+	/* The register can already have a range from prior markings.
+	 * This is fine as long as it hasn't been advanced from its
+	 * origin.
+	 */
+	return reg->type == which &&
+	       reg->id == 0 &&
+	       reg->off == 0 &&
+	       tnum_equals_const(reg->var_off, 0);
+}
+
 /* Attempts to improve min/max values based on var_off information */
 static void __update_reg_bounds(struct bpf_reg_state *reg)
 {
@@ -702,6 +734,7 @@  static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_STACK:
 	case PTR_TO_CTX:
 	case PTR_TO_PACKET:
+	case PTR_TO_PACKET_META:
 	case PTR_TO_PACKET_END:
 	case CONST_PTR_TO_MAP:
 		return true;
@@ -1047,7 +1080,10 @@  static int check_ptr_alignment(struct bpf_verifier_env *env,
 
 	switch (reg->type) {
 	case PTR_TO_PACKET:
-		/* special case, because of NET_IP_ALIGN */
+	case PTR_TO_PACKET_META:
+		/* Special case, because of NET_IP_ALIGN. Given metadata sits
+		 * right in front, treat it the very same way.
+		 */
 		return check_pkt_ptr_alignment(reg, off, size, strict);
 	case PTR_TO_MAP_VALUE:
 		pointer_desc = "value ";
@@ -1124,8 +1160,8 @@  static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		err = check_ctx_access(env, insn_idx, off, size, t, &reg_type);
 		if (!err && t == BPF_READ && value_regno >= 0) {
 			/* ctx access returns either a scalar, or a
-			 * PTR_TO_PACKET[_END].  In the latter case, we know
-			 * the offset is zero.
+			 * PTR_TO_PACKET[_META,_END]. In the latter
+			 * case, we know the offset is zero.
 			 */
 			if (reg_type == SCALAR_VALUE)
 				mark_reg_unknown(state->regs, value_regno);
@@ -1170,7 +1206,7 @@  static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		} else {
 			err = check_stack_read(state, off, size, value_regno);
 		}
-	} else if (reg->type == PTR_TO_PACKET) {
+	} else if (reg_is_pkt_pointer(reg)) {
 		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
@@ -1310,6 +1346,7 @@  static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 
 	switch (reg->type) {
 	case PTR_TO_PACKET:
+	case PTR_TO_PACKET_META:
 		return check_packet_access(env, regno, reg->off, access_size);
 	case PTR_TO_MAP_VALUE:
 		return check_map_access(env, regno, reg->off, access_size);
@@ -1342,7 +1379,7 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		return 0;
 	}
 
-	if (type == PTR_TO_PACKET &&
+	if (type_is_pkt_pointer(type) &&
 	    !may_access_direct_pkt_data(env, meta, BPF_READ)) {
 		verbose("helper access to the packet is not allowed\n");
 		return -EACCES;
@@ -1351,7 +1388,8 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 	if (arg_type == ARG_PTR_TO_MAP_KEY ||
 	    arg_type == ARG_PTR_TO_MAP_VALUE) {
 		expected_type = PTR_TO_STACK;
-		if (type != PTR_TO_PACKET && type != expected_type)
+		if (!type_is_pkt_pointer(type) &&
+		    type != expected_type)
 			goto err_type;
 	} else if (arg_type == ARG_CONST_SIZE ||
 		   arg_type == ARG_CONST_SIZE_OR_ZERO) {
@@ -1375,7 +1413,8 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		 */
 		if (register_is_null(*reg))
 			/* final test in check_stack_boundary() */;
-		else if (type != PTR_TO_PACKET && type != PTR_TO_MAP_VALUE &&
+		else if (!type_is_pkt_pointer(type) &&
+			 type != PTR_TO_MAP_VALUE &&
 			 type != expected_type)
 			goto err_type;
 		meta->raw_mode = arg_type == ARG_PTR_TO_UNINIT_MEM;
@@ -1401,7 +1440,7 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			verbose("invalid map_ptr to access map->key\n");
 			return -EACCES;
 		}
-		if (type == PTR_TO_PACKET)
+		if (type_is_pkt_pointer(type))
 			err = check_packet_access(env, regno, reg->off,
 						  meta->map_ptr->key_size);
 		else
@@ -1417,7 +1456,7 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			verbose("invalid map_ptr to access map->value\n");
 			return -EACCES;
 		}
-		if (type == PTR_TO_PACKET)
+		if (type_is_pkt_pointer(type))
 			err = check_packet_access(env, regno, reg->off,
 						  meta->map_ptr->value_size);
 		else
@@ -1590,8 +1629,8 @@  static int check_raw_mode(const struct bpf_func_proto *fn)
 	return count > 1 ? -EINVAL : 0;
 }
 
-/* Packet data might have moved, any old PTR_TO_PACKET[_END] are now invalid,
- * so turn them into unknown SCALAR_VALUE.
+/* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
+ * are now invalid, so turn them into unknown SCALAR_VALUE.
  */
 static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
 {
@@ -1600,18 +1639,15 @@  static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
 	int i;
 
 	for (i = 0; i < MAX_BPF_REG; i++)
-		if (regs[i].type == PTR_TO_PACKET ||
-		    regs[i].type == PTR_TO_PACKET_END)
+		if (reg_is_pkt_pointer_any(&regs[i]))
 			mark_reg_unknown(regs, i);
 
 	for (i = 0; i < MAX_BPF_STACK; i += BPF_REG_SIZE) {
 		if (state->stack_slot_type[i] != STACK_SPILL)
 			continue;
 		reg = &state->spilled_regs[i / BPF_REG_SIZE];
-		if (reg->type != PTR_TO_PACKET &&
-		    reg->type != PTR_TO_PACKET_END)
-			continue;
-		__mark_reg_unknown(reg);
+		if (reg_is_pkt_pointer_any(reg))
+			__mark_reg_unknown(reg);
 	}
 }
 
@@ -1871,7 +1907,7 @@  static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		}
 		dst_reg->var_off = tnum_add(ptr_reg->var_off, off_reg->var_off);
 		dst_reg->off = ptr_reg->off;
-		if (ptr_reg->type == PTR_TO_PACKET) {
+		if (reg_is_pkt_pointer(ptr_reg)) {
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
 			dst_reg->range = 0;
@@ -1931,7 +1967,7 @@  static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		}
 		dst_reg->var_off = tnum_sub(ptr_reg->var_off, off_reg->var_off);
 		dst_reg->off = ptr_reg->off;
-		if (ptr_reg->type == PTR_TO_PACKET) {
+		if (reg_is_pkt_pointer(ptr_reg)) {
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
 			if (smin_val < 0)
@@ -2421,7 +2457,8 @@  static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 }
 
 static void find_good_pkt_pointers(struct bpf_verifier_state *state,
-				   struct bpf_reg_state *dst_reg)
+				   struct bpf_reg_state *dst_reg,
+				   enum bpf_reg_type type)
 {
 	struct bpf_reg_state *regs = state->regs, *reg;
 	int i;
@@ -2483,7 +2520,7 @@  static void find_good_pkt_pointers(struct bpf_verifier_state *state,
 	 * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16.
 	 */
 	for (i = 0; i < MAX_BPF_REG; i++)
-		if (regs[i].type == PTR_TO_PACKET && regs[i].id == dst_reg->id)
+		if (regs[i].type == type && regs[i].id == dst_reg->id)
 			/* keep the maximum range already checked */
 			regs[i].range = max_t(u16, regs[i].range, dst_reg->off);
 
@@ -2491,7 +2528,7 @@  static void find_good_pkt_pointers(struct bpf_verifier_state *state,
 		if (state->stack_slot_type[i] != STACK_SPILL)
 			continue;
 		reg = &state->spilled_regs[i / BPF_REG_SIZE];
-		if (reg->type == PTR_TO_PACKET && reg->id == dst_reg->id)
+		if (reg->type == type && reg->id == dst_reg->id)
 			reg->range = max_t(u16, reg->range, dst_reg->off);
 	}
 }
@@ -2856,19 +2893,39 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGT &&
 		   dst_reg->type == PTR_TO_PACKET &&
 		   regs[insn->src_reg].type == PTR_TO_PACKET_END) {
-		find_good_pkt_pointers(this_branch, dst_reg);
+		find_good_pkt_pointers(this_branch, dst_reg, PTR_TO_PACKET);
 	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLT &&
 		   dst_reg->type == PTR_TO_PACKET &&
 		   regs[insn->src_reg].type == PTR_TO_PACKET_END) {
-		find_good_pkt_pointers(other_branch, dst_reg);
+		find_good_pkt_pointers(other_branch, dst_reg, PTR_TO_PACKET);
 	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGE &&
 		   dst_reg->type == PTR_TO_PACKET_END &&
 		   regs[insn->src_reg].type == PTR_TO_PACKET) {
-		find_good_pkt_pointers(other_branch, &regs[insn->src_reg]);
+		find_good_pkt_pointers(other_branch, &regs[insn->src_reg],
+				       PTR_TO_PACKET);
 	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLE &&
 		   dst_reg->type == PTR_TO_PACKET_END &&
 		   regs[insn->src_reg].type == PTR_TO_PACKET) {
-		find_good_pkt_pointers(this_branch, &regs[insn->src_reg]);
+		find_good_pkt_pointers(this_branch, &regs[insn->src_reg],
+				       PTR_TO_PACKET);
+	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGT &&
+		   dst_reg->type == PTR_TO_PACKET_META &&
+		   reg_is_init_pkt_pointer(&regs[insn->src_reg], PTR_TO_PACKET)) {
+		find_good_pkt_pointers(this_branch, dst_reg, PTR_TO_PACKET_META);
+	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLT &&
+		   dst_reg->type == PTR_TO_PACKET_META &&
+		   reg_is_init_pkt_pointer(&regs[insn->src_reg], PTR_TO_PACKET)) {
+		find_good_pkt_pointers(other_branch, dst_reg, PTR_TO_PACKET_META);
+	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JGE &&
+		   reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
+		   regs[insn->src_reg].type == PTR_TO_PACKET_META) {
+		find_good_pkt_pointers(other_branch, &regs[insn->src_reg],
+				       PTR_TO_PACKET_META);
+	} else if (BPF_SRC(insn->code) == BPF_X && opcode == BPF_JLE &&
+		   reg_is_init_pkt_pointer(dst_reg, PTR_TO_PACKET) &&
+		   regs[insn->src_reg].type == PTR_TO_PACKET_META) {
+		find_good_pkt_pointers(this_branch, &regs[insn->src_reg],
+				       PTR_TO_PACKET_META);
 	} else if (is_pointer_value(env, insn->dst_reg)) {
 		verbose("R%d pointer comparison prohibited\n", insn->dst_reg);
 		return -EACCES;
@@ -3298,8 +3355,9 @@  static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 			return false;
 		/* Check our ids match any regs they're supposed to */
 		return check_ids(rold->id, rcur->id, idmap);
+	case PTR_TO_PACKET_META:
 	case PTR_TO_PACKET:
-		if (rcur->type != PTR_TO_PACKET)
+		if (rcur->type != rold->type)
 			return false;
 		/* We must have at least as much range as the old ptr
 		 * did, so that any accesses which were safe before are
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index df67251..a86e668 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -162,6 +162,7 @@  int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 
 	xdp.data_hard_start = data;
 	xdp.data = data + XDP_PACKET_HEADROOM + NET_IP_ALIGN;
+	xdp.data_meta = xdp.data;
 	xdp.data_end = xdp.data + size;
 
 	retval = bpf_test_run(prog, &xdp, repeat, &duration);
diff --git a/net/core/dev.c b/net/core/dev.c
index 97abddd..e350c76 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3864,8 +3864,8 @@  static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 				     struct bpf_prog *xdp_prog)
 {
+	u32 metalen, act = XDP_DROP;
 	struct xdp_buff xdp;
-	u32 act = XDP_DROP;
 	void *orig_data;
 	int hlen, off;
 	u32 mac_len;
@@ -3876,8 +3876,25 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	if (skb_cloned(skb))
 		return XDP_PASS;
 
-	if (skb_linearize(skb))
-		goto do_drop;
+	/* XDP packets must be linear and must have sufficient headroom
+	 * of XDP_PACKET_HEADROOM bytes. This is the guarantee that also
+	 * native XDP provides, thus we need to do it here as well.
+	 */
+	if (skb_is_nonlinear(skb) ||
+	    skb_headroom(skb) < XDP_PACKET_HEADROOM) {
+		int hroom = XDP_PACKET_HEADROOM - skb_headroom(skb);
+		int troom = skb->tail + skb->data_len - skb->end;
+
+		/* In case we have to go down the path and also linearize,
+		 * then lets do the pskb_expand_head() work just once here.
+		 */
+		if (pskb_expand_head(skb,
+				     hroom > 0 ? ALIGN(hroom, NET_SKB_PAD) : 0,
+				     troom > 0 ? troom + 128 : 0, GFP_ATOMIC))
+			goto do_drop;
+		if (troom > 0 && __skb_linearize(skb))
+			goto do_drop;
+	}
 
 	/* The XDP program wants to see the packet starting at the MAC
 	 * header.
@@ -3885,6 +3902,7 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	mac_len = skb->data - skb_mac_header(skb);
 	hlen = skb_headlen(skb) + mac_len;
 	xdp.data = skb->data - mac_len;
+	xdp.data_meta = xdp.data;
 	xdp.data_end = xdp.data + hlen;
 	xdp.data_hard_start = skb->data - skb_headroom(skb);
 	orig_data = xdp.data;
@@ -3902,10 +3920,12 @@  static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	case XDP_REDIRECT:
 	case XDP_TX:
 		__skb_push(skb, mac_len);
-		/* fall through */
+		break;
 	case XDP_PASS:
+		metalen = xdp.data - xdp.data_meta;
+		if (metalen)
+			skb_metadata_set(skb, metalen);
 		break;
-
 	default:
 		bpf_warn_invalid_xdp_action(act);
 		/* fall through */
@@ -4695,6 +4715,7 @@  static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
 		diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
 		diffs |= p->vlan_tci ^ skb->vlan_tci;
 		diffs |= skb_metadata_dst_cmp(p, skb);
+		diffs |= skb_metadata_differs(p, skb);
 		if (maclen == ETH_HLEN)
 			diffs |= compare_ether_header(skb_mac_header(p),
 						      skb_mac_header(skb));
diff --git a/net/core/filter.c b/net/core/filter.c
index c468e7c..9b6e7e8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2447,14 +2447,26 @@  static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 	.arg3_type	= ARG_ANYTHING,
 };
 
+static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
+{
+	return xdp_data_meta_unsupported(xdp) ? 0 :
+	       xdp->data - xdp->data_meta;
+}
+
 BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 {
+	unsigned long metalen = xdp_get_metalen(xdp);
+	void *data_start = xdp->data_hard_start + metalen;
 	void *data = xdp->data + offset;
 
-	if (unlikely(data < xdp->data_hard_start ||
+	if (unlikely(data < data_start ||
 		     data > xdp->data_end - ETH_HLEN))
 		return -EINVAL;
 
+	if (metalen)
+		memmove(xdp->data_meta + offset,
+			xdp->data_meta, metalen);
+	xdp->data_meta += offset;
 	xdp->data = data;
 
 	return 0;
@@ -2468,6 +2480,33 @@  static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
+{
+	void *meta = xdp->data_meta + offset;
+	unsigned long metalen = xdp->data - meta;
+
+	if (xdp_data_meta_unsupported(xdp))
+		return -ENOTSUPP;
+	if (unlikely(meta < xdp->data_hard_start ||
+		     meta > xdp->data))
+		return -EINVAL;
+	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
+		     (metalen > 32)))
+		return -EACCES;
+
+	xdp->data_meta = meta;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
+	.func		= bpf_xdp_adjust_meta,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+};
+
 static int __bpf_tx_xdp(struct net_device *dev,
 			struct bpf_map *map,
 			struct xdp_buff *xdp,
@@ -2692,7 +2731,8 @@  bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_clone_redirect ||
 	    func == bpf_l3_csum_replace ||
 	    func == bpf_l4_csum_replace ||
-	    func == bpf_xdp_adjust_head)
+	    func == bpf_xdp_adjust_head ||
+	    func == bpf_xdp_adjust_meta)
 		return true;
 
 	return false;
@@ -3288,6 +3328,8 @@  static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_xdp_adjust_head:
 		return &bpf_xdp_adjust_head_proto;
+	case BPF_FUNC_xdp_adjust_meta:
+		return &bpf_xdp_adjust_meta_proto;
 	case BPF_FUNC_redirect:
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
@@ -3418,6 +3460,7 @@  static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 	case bpf_ctx_range_till(struct __sk_buff, remote_ip4, remote_ip4):
 	case bpf_ctx_range_till(struct __sk_buff, local_ip4, local_ip4):
 	case bpf_ctx_range(struct __sk_buff, data):
+	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, data_end):
 		if (size != size_default)
 			return false;
@@ -3444,6 +3487,7 @@  static bool sk_filter_is_valid_access(int off, int size,
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range(struct __sk_buff, data):
+	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, data_end):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
@@ -3468,6 +3512,7 @@  static bool lwt_is_valid_access(int off, int size,
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+	case bpf_ctx_range(struct __sk_buff, data_meta):
 		return false;
 	}
 
@@ -3586,6 +3631,9 @@  static bool tc_cls_act_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data):
 		info->reg_type = PTR_TO_PACKET;
 		break;
+	case bpf_ctx_range(struct __sk_buff, data_meta):
+		info->reg_type = PTR_TO_PACKET_META;
+		break;
 	case bpf_ctx_range(struct __sk_buff, data_end):
 		info->reg_type = PTR_TO_PACKET_END;
 		break;
@@ -3619,6 +3667,9 @@  static bool xdp_is_valid_access(int off, int size,
 	case offsetof(struct xdp_md, data):
 		info->reg_type = PTR_TO_PACKET;
 		break;
+	case offsetof(struct xdp_md, data_meta):
+		info->reg_type = PTR_TO_PACKET_META;
+		break;
 	case offsetof(struct xdp_md, data_end):
 		info->reg_type = PTR_TO_PACKET_END;
 		break;
@@ -3677,6 +3728,12 @@  static bool sk_skb_is_valid_access(int off, int size,
 				   enum bpf_access_type type,
 				   struct bpf_insn_access_aux *info)
 {
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, tc_classid):
+	case bpf_ctx_range(struct __sk_buff, data_meta):
+		return false;
+	}
+
 	if (type == BPF_WRITE) {
 		switch (off) {
 		case bpf_ctx_range(struct __sk_buff, mark):
@@ -3689,8 +3746,6 @@  static bool sk_skb_is_valid_access(int off, int size,
 	}
 
 	switch (off) {
-	case bpf_ctx_range(struct __sk_buff, tc_classid):
-		return false;
 	case bpf_ctx_range(struct __sk_buff, data):
 		info->reg_type = PTR_TO_PACKET;
 		break;
@@ -3847,6 +3902,15 @@  static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				      offsetof(struct sk_buff, data));
 		break;
 
+	case offsetof(struct __sk_buff, data_meta):
+		off  = si->off;
+		off -= offsetof(struct __sk_buff, data_meta);
+		off += offsetof(struct sk_buff, cb);
+		off += offsetof(struct bpf_skb_data_end, data_meta);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg,
+				      si->src_reg, off);
+		break;
+
 	case offsetof(struct __sk_buff, data_end):
 		off  = si->off;
 		off -= offsetof(struct __sk_buff, data_end);
@@ -4095,6 +4159,11 @@  static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 				      si->dst_reg, si->src_reg,
 				      offsetof(struct xdp_buff, data));
 		break;
+	case offsetof(struct xdp_md, data_meta):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_meta),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, data_meta));
+		break;
 	case offsetof(struct xdp_md, data_end):
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end),
 				      si->dst_reg, si->src_reg,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 16982de..681177b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1509,6 +1509,8 @@  int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	skb->nohdr    = 0;
 	atomic_set(&skb_shinfo(skb)->dataref, 1);
 
+	skb_metadata_clear(skb);
+
 	/* It is not generally safe to change skb->truesize.
 	 * For the moment, we really care of rx path, or
 	 * when skb is orphaned (not attached to a socket).