diff mbox series

[v2,bpf,1/5] net: ethtool: add xdp properties flag set

Message ID 20201204102901.109709-2-marekx.majtyka@intel.com
State Superseded
Headers show
Series New netdev feature flags for XDP | expand

Commit Message

Marek Majtyka Dec. 4, 2020, 10:28 a.m. UTC
From: Marek Majtyka <marekx.majtyka@intel.com>

Implement support for checking what kind of xdp functionality a netdev
supports. Previously, there was no way to do this other than to try
to create an AF_XDP socket on the interface or load an XDP program and see
if it worked. This commit changes this by adding a new variable which
describes all xdp supported functions on pretty detailed level:
 - aborted
 - drop
 - pass
 - tx
 - redirect
 - zero copy
 - hardware offload.

Zerocopy mode requires that redirect xdp operation is implemented
in a driver and the driver supports also zero copy mode.
Full mode requires that all xdp operation are implemented in the driver.
Basic mode is just full mode without redirect operation.

Initially, these new flags are disabled for all drivers by default.

Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
---
 .../networking/netdev-xdp-properties.rst      | 42 ++++++++
 include/linux/netdevice.h                     |  2 +
 include/linux/xdp_properties.h                | 53 +++++++++++
 include/net/xdp.h                             | 95 +++++++++++++++++++
 include/net/xdp_sock_drv.h                    | 10 ++
 include/uapi/linux/ethtool.h                  |  1 +
 include/uapi/linux/xdp_properties.h           | 32 +++++++
 net/ethtool/common.c                          | 11 +++
 net/ethtool/common.h                          |  4 +
 net/ethtool/strset.c                          |  5 +
 10 files changed, 255 insertions(+)
 create mode 100644 Documentation/networking/netdev-xdp-properties.rst
 create mode 100644 include/linux/xdp_properties.h
 create mode 100644 include/uapi/linux/xdp_properties.h

Comments

Toke Høiland-Jørgensen Dec. 4, 2020, 12:18 p.m. UTC | #1
alardam@gmail.com writes:

> From: Marek Majtyka <marekx.majtyka@intel.com>
>
> Implement support for checking what kind of xdp functionality a netdev
> supports. Previously, there was no way to do this other than to try
> to create an AF_XDP socket on the interface or load an XDP program and see
> if it worked. This commit changes this by adding a new variable which
> describes all xdp supported functions on pretty detailed level:

I like the direction this is going! :)

>  - aborted
>  - drop
>  - pass
>  - tx
>  - redirect

Drivers can in principle implement support for the XDP_REDIRECT return
code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit()
for being the *target* of a redirect. While my quick grepping doesn't
turn up any drivers that do only one of these right now, I think we've
had examples of it in the past, so it would probably be better to split
the redirect feature flag in two.

This would also make it trivial to replace the check in __xdp_enqueue()
(in devmap.c) from looking at whether the ndo is defined, and just
checking the flag. It would be great if you could do this as part of
this series.

Maybe we could even make the 'redirect target' flag be set automatically
if a driver implements ndo_xdp_xmit?

>  - zero copy
>  - hardware offload.
>
> Zerocopy mode requires that redirect xdp operation is implemented
> in a driver and the driver supports also zero copy mode.
> Full mode requires that all xdp operation are implemented in the driver.
> Basic mode is just full mode without redirect operation.
>
> Initially, these new flags are disabled for all drivers by default.
>
> Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
> ---
>  .../networking/netdev-xdp-properties.rst      | 42 ++++++++
>  include/linux/netdevice.h                     |  2 +
>  include/linux/xdp_properties.h                | 53 +++++++++++
>  include/net/xdp.h                             | 95 +++++++++++++++++++
>  include/net/xdp_sock_drv.h                    | 10 ++
>  include/uapi/linux/ethtool.h                  |  1 +
>  include/uapi/linux/xdp_properties.h           | 32 +++++++
>  net/ethtool/common.c                          | 11 +++
>  net/ethtool/common.h                          |  4 +
>  net/ethtool/strset.c                          |  5 +
>  10 files changed, 255 insertions(+)
>  create mode 100644 Documentation/networking/netdev-xdp-properties.rst
>  create mode 100644 include/linux/xdp_properties.h
>  create mode 100644 include/uapi/linux/xdp_properties.h
>
> diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst
> new file mode 100644
> index 000000000000..4a434a1c512b
> --- /dev/null
> +++ b/Documentation/networking/netdev-xdp-properties.rst
> @@ -0,0 +1,42 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Netdev XDP properties
> +=====================
> +
> + * XDP PROPERTIES FLAGS
> +
> +Following netdev xdp properties flags can be retrieve over netlink ethtool
> +interface the same way as netdev feature flags. These properties flags are
> +read only and cannot be change in the runtime.
> +
> +
> +*  XDP_ABORTED
> +
> +This property informs if netdev supports xdp aborted action.
> +
> +*  XDP_DROP
> +
> +This property informs if netdev supports xdp drop action.
> +
> +*  XDP_PASS
> +
> +This property informs if netdev supports xdp pass action.
> +
> +*  XDP_TX
> +
> +This property informs if netdev supports xdp tx action.
> +
> +*  XDP_REDIRECT
> +
> +This property informs if netdev supports xdp redirect action.
> +It assumes the all beforehand mentioned flags are enabled.
> +
> +*  XDP_ZEROCOPY
> +
> +This property informs if netdev driver supports xdp zero copy.
> +It assumes the all beforehand mentioned flags are enabled.

Nit: I think 'XDP_ZEROCOPY' can lead people to think that this is
zero-copy support for all XDP operations, which is obviously not the
case. So maybe 'XDP_SOCK_ZEROCOPY' (and update the description to
mention AF_XDP sockets explicitly)?

-Toke
Maciej Fijalkowski Dec. 4, 2020, 12:46 p.m. UTC | #2
On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:
> alardam@gmail.com writes:
> 
> > From: Marek Majtyka <marekx.majtyka@intel.com>
> >
> > Implement support for checking what kind of xdp functionality a netdev
> > supports. Previously, there was no way to do this other than to try
> > to create an AF_XDP socket on the interface or load an XDP program and see
> > if it worked. This commit changes this by adding a new variable which
> > describes all xdp supported functions on pretty detailed level:
> 
> I like the direction this is going! :)
> 
> >  - aborted
> >  - drop
> >  - pass
> >  - tx
> >  - redirect
> 
> Drivers can in principle implement support for the XDP_REDIRECT return
> code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit()
> for being the *target* of a redirect. While my quick grepping doesn't
> turn up any drivers that do only one of these right now, I think we've
> had examples of it in the past, so it would probably be better to split
> the redirect feature flag in two.
> 
> This would also make it trivial to replace the check in __xdp_enqueue()
> (in devmap.c) from looking at whether the ndo is defined, and just
> checking the flag. It would be great if you could do this as part of
> this series.
> 
> Maybe we could even make the 'redirect target' flag be set automatically
> if a driver implements ndo_xdp_xmit?

+1

> 
> >  - zero copy
> >  - hardware offload.
> >
> > Zerocopy mode requires that redirect xdp operation is implemented
> > in a driver and the driver supports also zero copy mode.
> > Full mode requires that all xdp operation are implemented in the driver.
> > Basic mode is just full mode without redirect operation.
> >
> > Initially, these new flags are disabled for all drivers by default.
> >
> > Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
> > ---
> >  .../networking/netdev-xdp-properties.rst      | 42 ++++++++
> >  include/linux/netdevice.h                     |  2 +
> >  include/linux/xdp_properties.h                | 53 +++++++++++
> >  include/net/xdp.h                             | 95 +++++++++++++++++++
> >  include/net/xdp_sock_drv.h                    | 10 ++
> >  include/uapi/linux/ethtool.h                  |  1 +
> >  include/uapi/linux/xdp_properties.h           | 32 +++++++
> >  net/ethtool/common.c                          | 11 +++
> >  net/ethtool/common.h                          |  4 +
> >  net/ethtool/strset.c                          |  5 +
> >  10 files changed, 255 insertions(+)
> >  create mode 100644 Documentation/networking/netdev-xdp-properties.rst
> >  create mode 100644 include/linux/xdp_properties.h
> >  create mode 100644 include/uapi/linux/xdp_properties.h
> >
> > diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst
> > new file mode 100644
> > index 000000000000..4a434a1c512b
> > --- /dev/null
> > +++ b/Documentation/networking/netdev-xdp-properties.rst
> > @@ -0,0 +1,42 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=====================
> > +Netdev XDP properties
> > +=====================
> > +
> > + * XDP PROPERTIES FLAGS
> > +
> > +Following netdev xdp properties flags can be retrieve over netlink ethtool
> > +interface the same way as netdev feature flags. These properties flags are
> > +read only and cannot be change in the runtime.
> > +
> > +
> > +*  XDP_ABORTED
> > +
> > +This property informs if netdev supports xdp aborted action.
> > +
> > +*  XDP_DROP
> > +
> > +This property informs if netdev supports xdp drop action.
> > +
> > +*  XDP_PASS
> > +
> > +This property informs if netdev supports xdp pass action.
> > +
> > +*  XDP_TX
> > +
> > +This property informs if netdev supports xdp tx action.
> > +
> > +*  XDP_REDIRECT
> > +
> > +This property informs if netdev supports xdp redirect action.
> > +It assumes the all beforehand mentioned flags are enabled.
> > +
> > +*  XDP_ZEROCOPY
> > +
> > +This property informs if netdev driver supports xdp zero copy.
> > +It assumes the all beforehand mentioned flags are enabled.
> 
> Nit: I think 'XDP_ZEROCOPY' can lead people to think that this is
> zero-copy support for all XDP operations, which is obviously not the
> case. So maybe 'XDP_SOCK_ZEROCOPY' (and update the description to
> mention AF_XDP sockets explicitly)?

AF_XDP_ZEROCOPY?

> 
> -Toke
>
Maciej Fijalkowski Dec. 4, 2020, 12:57 p.m. UTC | #3
On Fri, Dec 04, 2020 at 11:28:57AM +0100, alardam@gmail.com wrote:
> From: Marek Majtyka <marekx.majtyka@intel.com>
> 
> Implement support for checking what kind of xdp functionality a netdev
> supports. Previously, there was no way to do this other than to try
> to create an AF_XDP socket on the interface or load an XDP program and see
> if it worked. This commit changes this by adding a new variable which
> describes all xdp supported functions on pretty detailed level:
>  - aborted
>  - drop
>  - pass
>  - tx
>  - redirect
>  - zero copy
>  - hardware offload.
> 
> Zerocopy mode requires that redirect xdp operation is implemented
> in a driver and the driver supports also zero copy mode.
> Full mode requires that all xdp operation are implemented in the driver.
> Basic mode is just full mode without redirect operation.
> 
> Initially, these new flags are disabled for all drivers by default.
> 
> Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
> ---
>  .../networking/netdev-xdp-properties.rst      | 42 ++++++++
>  include/linux/netdevice.h                     |  2 +
>  include/linux/xdp_properties.h                | 53 +++++++++++
>  include/net/xdp.h                             | 95 +++++++++++++++++++
>  include/net/xdp_sock_drv.h                    | 10 ++
>  include/uapi/linux/ethtool.h                  |  1 +
>  include/uapi/linux/xdp_properties.h           | 32 +++++++
>  net/ethtool/common.c                          | 11 +++
>  net/ethtool/common.h                          |  4 +
>  net/ethtool/strset.c                          |  5 +
>  10 files changed, 255 insertions(+)
>  create mode 100644 Documentation/networking/netdev-xdp-properties.rst
>  create mode 100644 include/linux/xdp_properties.h
>  create mode 100644 include/uapi/linux/xdp_properties.h
> 
> diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst
> new file mode 100644
> index 000000000000..4a434a1c512b
> --- /dev/null
> +++ b/Documentation/networking/netdev-xdp-properties.rst
> @@ -0,0 +1,42 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Netdev XDP properties
> +=====================
> +
> + * XDP PROPERTIES FLAGS
> +
> +Following netdev xdp properties flags can be retrieve over netlink ethtool
> +interface the same way as netdev feature flags. These properties flags are
> +read only and cannot be change in the runtime.
> +
> +
> +*  XDP_ABORTED
> +
> +This property informs if netdev supports xdp aborted action.
> +
> +*  XDP_DROP
> +
> +This property informs if netdev supports xdp drop action.
> +
> +*  XDP_PASS
> +
> +This property informs if netdev supports xdp pass action.
> +
> +*  XDP_TX
> +
> +This property informs if netdev supports xdp tx action.
> +
> +*  XDP_REDIRECT
> +
> +This property informs if netdev supports xdp redirect action.
> +It assumes the all beforehand mentioned flags are enabled.
> +
> +*  XDP_ZEROCOPY
> +
> +This property informs if netdev driver supports xdp zero copy.
> +It assumes the all beforehand mentioned flags are enabled.
> +
> +*  XDP_HW_OFFLOAD
> +
> +This property informs if netdev driver supports xdp hw oflloading.
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 52d1cc2bd8a7..2544c7f0e1b7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -43,6 +43,7 @@
>  #include <net/xdp.h>
>  
>  #include <linux/netdev_features.h>
> +#include <linux/xdp_properties.h>
>  #include <linux/neighbour.h>
>  #include <uapi/linux/netdevice.h>
>  #include <uapi/linux/if_bonding.h>
> @@ -2171,6 +2172,7 @@ struct net_device {
>  
>  	/* protected by rtnl_lock */
>  	struct bpf_xdp_entity	xdp_state[__MAX_XDP_MODE];
> +	xdp_properties_t	xdp_properties;
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>  
> diff --git a/include/linux/xdp_properties.h b/include/linux/xdp_properties.h
> new file mode 100644
> index 000000000000..c72c9bcc50de
> --- /dev/null
> +++ b/include/linux/xdp_properties.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Network device xdp properties.
> + */
> +#ifndef _LINUX_XDP_PROPERTIES_H
> +#define _LINUX_XDP_PROPERTIES_H
> +
> +#include <linux/types.h>
> +#include <linux/bitops.h>
> +#include <asm/byteorder.h>
> +
> +typedef u64 xdp_properties_t;
> +
> +enum {
> +	XDP_F_ABORTED_BIT,
> +	XDP_F_DROP_BIT,
> +	XDP_F_PASS_BIT,
> +	XDP_F_TX_BIT,
> +	XDP_F_REDIRECT_BIT,
> +	XDP_F_ZEROCOPY_BIT,
> +	XDP_F_HW_OFFLOAD_BIT,
> +
> +	/*
> +	 * Add your fresh new property above and remember to update
> +	 * xdp_properties_strings [] in net/core/ethtool.c and maybe
> +	 * some xdp_properties mask #defines below. Please also describe it
> +	 * in Documentation/networking/xdp_properties.rst.
> +	 */
> +
> +	/**/XDP_PROPERTIES_COUNT
> +};
> +
> +#define __XDP_F_BIT(bit)	((xdp_properties_t)1 << (bit))
> +#define __XDP_F(name)		__XDP_F_BIT(XDP_F_##name##_BIT)
> +
> +#define XDP_F_ABORTED		__XDP_F(ABORTED)
> +#define XDP_F_DROP		__XDP_F(DROP)
> +#define XDP_F_PASS		__XDP_F(PASS)
> +#define XDP_F_TX		__XDP_F(TX)
> +#define XDP_F_REDIRECT		__XDP_F(REDIRECT)
> +#define XDP_F_ZEROCOPY		__XDP_F(ZEROCOPY)
> +#define XDP_F_HW_OFFLOAD	__XDP_F(HW_OFFLOAD)
> +
> +#define XDP_F_BASIC		(XDP_F_ABORTED |	\
> +				 XDP_F_DROP |		\
> +				 XDP_F_PASS |		\
> +				 XDP_F_TX)
> +
> +#define XDP_F_FULL		(XDP_F_BASIC | XDP_F_REDIRECT)
> +
> +#define XDP_F_FULL_ZC		(XDP_F_FULL | XDP_F_ZEROCOPY)

Seems like you're not making use of this flag? Next patch combines two
calls for XDP_F_FULL and XDP_F_ZEROCOPY, like:

xdp_set_full_properties(&netdev->xdp_properties);
xsk_set_zc_property(&netdev->xdp_properties);

So either drop the flag, or introduce xdp_set_full_zc_properties().

I was also thinking if it would make sense to align the naming here and
refer to these as 'xdp features', like netdevice.h tends to do, not 'xdp
properties'.

> +
> +#endif /* _LINUX_XDP_PROPERTIES_H */
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 700ad5db7f5d..a9fabc1282cf 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -7,6 +7,7 @@
>  #define __LINUX_NET_XDP_H__
>  
>  #include <linux/skbuff.h> /* skb_shared_info */
> +#include <linux/xdp_properties.h>
>  
>  /**
>   * DOC: XDP RX-queue information
> @@ -255,6 +256,100 @@ struct xdp_attachment_info {
>  	u32 flags;
>  };
>  
> +#if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
> +
> +static __always_inline void
> +xdp_set_aborted_property(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_ABORTED;
> +}
> +
> +static __always_inline void
> +xdp_set_pass_property(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_PASS;
> +}
> +
> +static __always_inline void
> +xdp_set_drop_property(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_DROP;
> +}
> +
> +static __always_inline void
> +xdp_set_tx_property(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_TX;
> +}
> +
> +static __always_inline void
> +xdp_set_redirect_property(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_REDIRECT;
> +}
> +
> +static __always_inline void
> +xdp_set_hw_offload_property(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_HW_OFFLOAD;
> +}
> +
> +static __always_inline void
> +xdp_set_basic_properties(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_BASIC;
> +}
> +
> +static __always_inline void
> +xdp_set_full_properties(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_FULL;
> +}
> +
> +#else
> +
> +static __always_inline void
> +xdp_set_aborted_property(xdp_properties_t *properties)
> +{
> +}
> +
> +static __always_inline void
> +xdp_set_pass_property(xdp_properties_t *properties)
> +{
> +}
> +
> +static __always_inline void
> +xdp_set_drop_property(xdp_properties_t *properties)
> +{
> +}
> +
> +static __always_inline void
> +xdp_set_tx_property(xdp_properties_t *properties)
> +{
> +}
> +
> +static __always_inline void
> +xdp_set_redirect_property(xdp_properties_t *properties)
> +{
> +}
> +
> +static __always_inline void
> +xdp_set_hw_offload_property(xdp_properties_t *properties)
> +{
> +}
> +
> +static __always_inline void
> +xdp_set_basic_properties(xdp_properties_t *properties)
> +{
> +}
> +
> +static __always_inline void
> +xdp_set_full_properties(xdp_properties_t *properties)
> +{
> +}
> +
> +#endif
> +
>  struct netdev_bpf;
>  bool xdp_attachment_flags_ok(struct xdp_attachment_info *info,
>  			     struct netdev_bpf *bpf);
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 4e295541e396..48a3b6d165c7 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -8,6 +8,7 @@
>  
>  #include <net/xdp_sock.h>
>  #include <net/xsk_buff_pool.h>
> +#include <linux/xdp_properties.h>
>  
>  #ifdef CONFIG_XDP_SOCKETS
>  
> @@ -117,6 +118,11 @@ static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
>  	xp_dma_sync_for_device(pool, dma, size);
>  }
>  
> +static inline void xsk_set_zc_property(xdp_properties_t *properties)
> +{
> +	*properties |= XDP_F_ZEROCOPY;
> +}
> +
>  #else
>  
>  static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
> @@ -242,6 +248,10 @@ static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
>  {
>  }
>  
> +static inline void xsk_set_zc_property(xdp_properties_t *properties)
> +{
> +}
> +
>  #endif /* CONFIG_XDP_SOCKETS */
>  
>  #endif /* _LINUX_XDP_SOCK_DRV_H */
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index 9ca87bc73c44..dfcb0e2c98b2 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -688,6 +688,7 @@ enum ethtool_stringset {
>  	ETH_SS_TS_TX_TYPES,
>  	ETH_SS_TS_RX_FILTERS,
>  	ETH_SS_UDP_TUNNEL_TYPES,
> +	ETH_SS_XDP_PROPERTIES,
>  
>  	/* add new constants above here */
>  	ETH_SS_COUNT
> diff --git a/include/uapi/linux/xdp_properties.h b/include/uapi/linux/xdp_properties.h
> new file mode 100644
> index 000000000000..e85be03eb707
> --- /dev/null
> +++ b/include/uapi/linux/xdp_properties.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +/*
> + * Copyright (c) 2020 Intel
> + */
> +
> +#ifndef __UAPI_LINUX_XDP_PROPERTIES__
> +#define __UAPI_LINUX_XDP_PROPERTIES__
> +
> +/* ETH_GSTRING_LEN define is needed. */
> +#include <linux/ethtool.h>
> +
> +#define XDP_PROPERTIES_ABORTED_STR	"xdp-aborted"
> +#define XDP_PROPERTIES_DROP_STR		"xdp-drop"
> +#define XDP_PROPERTIES_PASS_STR		"xdp-pass"
> +#define XDP_PROPERTIES_TX_STR		"xdp-tx"
> +#define XDP_PROPERTIES_REDIRECT_STR	"xdp-redirect"
> +#define XDP_PROPERTIES_ZEROCOPY_STR	"xdp-zerocopy"
> +#define XDP_PROPERTIES_HW_OFFLOAD_STR	"xdp-hw-offload"
> +
> +#define	DECLARE_XDP_PROPERTIES_TABLE(name)		\
> +	const char name[][ETH_GSTRING_LEN] = {		\
> +		XDP_PROPERTIES_ABORTED_STR,		\
> +		XDP_PROPERTIES_DROP_STR,		\
> +		XDP_PROPERTIES_PASS_STR,		\
> +		XDP_PROPERTIES_TX_STR,			\
> +		XDP_PROPERTIES_REDIRECT_STR,		\
> +		XDP_PROPERTIES_ZEROCOPY_STR,		\
> +		XDP_PROPERTIES_HW_OFFLOAD_STR,		\
> +	}
> +
> +#endif  /* __UAPI_LINUX_XDP_PROPERTIES__ */
> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> index 24036e3055a1..8f15f96b8922 100644
> --- a/net/ethtool/common.c
> +++ b/net/ethtool/common.c
> @@ -4,6 +4,7 @@
>  #include <linux/net_tstamp.h>
>  #include <linux/phy.h>
>  #include <linux/rtnetlink.h>
> +#include <uapi/linux/xdp_properties.h>
>  
>  #include "common.h"
>  
> @@ -283,6 +284,16 @@ const char udp_tunnel_type_names[][ETH_GSTRING_LEN] = {
>  static_assert(ARRAY_SIZE(udp_tunnel_type_names) ==
>  	      __ETHTOOL_UDP_TUNNEL_TYPE_CNT);
>  
> +const char xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN] = {
> +	[XDP_F_ABORTED_BIT] =		XDP_PROPERTIES_ABORTED_STR,
> +	[XDP_F_DROP_BIT] =		XDP_PROPERTIES_DROP_STR,
> +	[XDP_F_PASS_BIT] =		XDP_PROPERTIES_PASS_STR,
> +	[XDP_F_TX_BIT] =		XDP_PROPERTIES_TX_STR,
> +	[XDP_F_REDIRECT_BIT] =		XDP_PROPERTIES_REDIRECT_STR,
> +	[XDP_F_ZEROCOPY_BIT] =		XDP_PROPERTIES_ZEROCOPY_STR,
> +	[XDP_F_HW_OFFLOAD_BIT] =	XDP_PROPERTIES_HW_OFFLOAD_STR,
> +};
> +
>  /* return false if legacy contained non-0 deprecated fields
>   * maxtxpkt/maxrxpkt. rest of ksettings always updated
>   */
> diff --git a/net/ethtool/common.h b/net/ethtool/common.h
> index 3d9251c95a8b..85a35f8781eb 100644
> --- a/net/ethtool/common.h
> +++ b/net/ethtool/common.h
> @@ -5,8 +5,10 @@
>  
>  #include <linux/netdevice.h>
>  #include <linux/ethtool.h>
> +#include <linux/xdp_properties.h>
>  
>  #define ETHTOOL_DEV_FEATURE_WORDS	DIV_ROUND_UP(NETDEV_FEATURE_COUNT, 32)
> +#define ETHTOOL_XDP_PROPERTIES_WORDS	DIV_ROUND_UP(XDP_PROPERTIES_COUNT, 32)
>  
>  /* compose link mode index from speed, type and duplex */
>  #define ETHTOOL_LINK_MODE(speed, type, duplex) \
> @@ -22,6 +24,8 @@ extern const char
>  tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN];
>  extern const char
>  phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN];
> +extern const char
> +xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN];
>  extern const char link_mode_names[][ETH_GSTRING_LEN];
>  extern const char netif_msg_class_names[][ETH_GSTRING_LEN];
>  extern const char wol_mode_names[][ETH_GSTRING_LEN];
> diff --git a/net/ethtool/strset.c b/net/ethtool/strset.c
> index 0baad0ce1832..684e751b31a9 100644
> --- a/net/ethtool/strset.c
> +++ b/net/ethtool/strset.c
> @@ -80,6 +80,11 @@ static const struct strset_info info_template[] = {
>  		.count		= __ETHTOOL_UDP_TUNNEL_TYPE_CNT,
>  		.strings	= udp_tunnel_type_names,
>  	},
> +	[ETH_SS_XDP_PROPERTIES] = {
> +		.per_dev	= false,
> +		.count		= ARRAY_SIZE(xdp_properties_strings),
> +		.strings	= xdp_properties_strings,
> +	},
>  };
>  
>  struct strset_req_info {
> -- 
> 2.27.0
>
Daniel Borkmann Dec. 4, 2020, 3:21 p.m. UTC | #4
On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
> On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:
>> alardam@gmail.com writes:
>>> From: Marek Majtyka <marekx.majtyka@intel.com>
>>>
>>> Implement support for checking what kind of xdp functionality a netdev
>>> supports. Previously, there was no way to do this other than to try
>>> to create an AF_XDP socket on the interface or load an XDP program and see
>>> if it worked. This commit changes this by adding a new variable which
>>> describes all xdp supported functions on pretty detailed level:
>>
>> I like the direction this is going! :)
>>
>>>   - aborted
>>>   - drop
>>>   - pass
>>>   - tx

I strongly think we should _not_ merge any native XDP driver patchset that does
not support/implement the above return codes. Could we instead group them together
and call this something like XDP_BASE functionality to not give a wrong impression?
If this is properly documented that these are basic must-have _requirements_, then
users and driver developers both know what the expectations are.

>>>   - redirect
>>
>> Drivers can in principle implement support for the XDP_REDIRECT return
>> code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit()
>> for being the *target* of a redirect. While my quick grepping doesn't
>> turn up any drivers that do only one of these right now, I think we've
>> had examples of it in the past, so it would probably be better to split
>> the redirect feature flag in two.
>>
>> This would also make it trivial to replace the check in __xdp_enqueue()
>> (in devmap.c) from looking at whether the ndo is defined, and just
>> checking the flag. It would be great if you could do this as part of
>> this series.
>>
>> Maybe we could even make the 'redirect target' flag be set automatically
>> if a driver implements ndo_xdp_xmit?
> 
> +1
> 
>>>   - zero copy
>>>   - hardware offload.

One other thing that is quite annoying to figure out sometimes and not always
obvious from reading the driver code (and it may even differ depending on how
the driver was built :/) is how much XDP headroom a driver really provides.

We tried to standardize on a minimum guaranteed amount, but unfortunately not
everyone seems to implement it, but I think it would be very useful to query
this from application side, for example, consider that an app inserts a BPF
prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
know which of the different encaps it implements are realistically possible on
the underlying XDP supported dev.

Thanks,
Daniel
Toke Høiland-Jørgensen Dec. 4, 2020, 5:20 p.m. UTC | #5
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
>> On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:
>>> alardam@gmail.com writes:
>>>> From: Marek Majtyka <marekx.majtyka@intel.com>
>>>>
>>>> Implement support for checking what kind of xdp functionality a netdev
>>>> supports. Previously, there was no way to do this other than to try
>>>> to create an AF_XDP socket on the interface or load an XDP program and see
>>>> if it worked. This commit changes this by adding a new variable which
>>>> describes all xdp supported functions on pretty detailed level:
>>>
>>> I like the direction this is going! :)
>>>
>>>>   - aborted
>>>>   - drop
>>>>   - pass
>>>>   - tx
>
> I strongly think we should _not_ merge any native XDP driver patchset
> that does not support/implement the above return codes. Could we
> instead group them together and call this something like XDP_BASE
> functionality to not give a wrong impression? If this is properly
> documented that these are basic must-have _requirements_, then users
> and driver developers both know what the expectations are.

I think there may have been drivers that only did DROP/PASS on first
merge; but adding TX to the "base set" is fine by me, as long as it's
actually enforced ;)

(As in, we originally said the same thing about the full feature set and
that never really worked out).

>>>>   - redirect
>>>
>>> Drivers can in principle implement support for the XDP_REDIRECT return
>>> code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit()
>>> for being the *target* of a redirect. While my quick grepping doesn't
>>> turn up any drivers that do only one of these right now, I think we've
>>> had examples of it in the past, so it would probably be better to split
>>> the redirect feature flag in two.
>>>
>>> This would also make it trivial to replace the check in __xdp_enqueue()
>>> (in devmap.c) from looking at whether the ndo is defined, and just
>>> checking the flag. It would be great if you could do this as part of
>>> this series.
>>>
>>> Maybe we could even make the 'redirect target' flag be set automatically
>>> if a driver implements ndo_xdp_xmit?
>> 
>> +1
>> 
>>>>   - zero copy
>>>>   - hardware offload.
>
> One other thing that is quite annoying to figure out sometimes and not always
> obvious from reading the driver code (and it may even differ depending on how
> the driver was built :/) is how much XDP headroom a driver really provides.
>
> We tried to standardize on a minimum guaranteed amount, but unfortunately not
> everyone seems to implement it, but I think it would be very useful to query
> this from application side, for example, consider that an app inserts a BPF
> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
> know which of the different encaps it implements are realistically possible on
> the underlying XDP supported dev.

How many distinct values are there in reality? Enough to express this in
a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
an additional field to get the exact value? If we implement the latter
we also run the risk of people actually implementing all sorts of weird
values, whereas if we constrain it to a few distinct values it's easier
to push back against adding new values (as it'll be obvious from the
addition of new flags).

-Toke
Daniel Borkmann Dec. 4, 2020, 10:19 p.m. UTC | #6
On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
[...]
>> We tried to standardize on a minimum guaranteed amount, but unfortunately not
>> everyone seems to implement it, but I think it would be very useful to query
>> this from application side, for example, consider that an app inserts a BPF
>> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
>> know which of the different encaps it implements are realistically possible on
>> the underlying XDP supported dev.
> 
> How many distinct values are there in reality? Enough to express this in
> a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
> an additional field to get the exact value? If we implement the latter
> we also run the risk of people actually implementing all sorts of weird
> values, whereas if we constrain it to a few distinct values it's easier
> to push back against adding new values (as it'll be obvious from the
> addition of new flags).

It's not everywhere straight forward to determine unfortunately, see also [0,1]
as some data points where Jesper looked into in the past, so in some cases it
might differ depending on the build/runtime config..

   [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
   [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/
Jesper Dangaard Brouer Dec. 7, 2020, 11:54 a.m. UTC | #7
On Fri, 4 Dec 2020 23:19:55 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
> > Daniel Borkmann <daniel@iogearbox.net> writes:  
> [...]
> >> We tried to standardize on a minimum guaranteed amount, but unfortunately not
> >> everyone seems to implement it, but I think it would be very useful to query
> >> this from application side, for example, consider that an app inserts a BPF
> >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
> >> know which of the different encaps it implements are realistically possible on
> >> the underlying XDP supported dev.  
> > 
> > How many distinct values are there in reality? Enough to express this in
> > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
> > an additional field to get the exact value? If we implement the latter
> > we also run the risk of people actually implementing all sorts of weird
> > values, whereas if we constrain it to a few distinct values it's easier
> > to push back against adding new values (as it'll be obvious from the
> > addition of new flags).  
> 
> It's not everywhere straight forward to determine unfortunately, see also [0,1]
> as some data points where Jesper looked into in the past, so in some cases it
> might differ depending on the build/runtime config..
> 
>    [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
>    [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/

Yes, unfortunately drivers have already gotten creative in this area,
and variations have sneaked in.  I remember that we were forced to
allow SFC driver to use 128 bytes headroom, to avoid a memory
corruption. I tried hard to have the minimum 192 bytes as it is 3
cachelines, but I failed to enforce this.

It might be valuable to expose info on the drivers headroom size, as
this will allow end-users to take advantage of this (instead of having
to use the lowest common headroom) and up-front in userspace rejecting
to load on e.g. SFC that have this annoying limitation.

BUT thinking about what the drivers headroom size MEANS to userspace,
I'm not sure it is wise to give this info to userspace.  The
XDP-headroom is used for several kernel internal things, that limit the
available space for growing packet-headroom.  E.g. (1) xdp_frame is
something that we likely need to grow (even-though I'm pushing back),
E.g. (2) metadata area which Saeed is looking to populate from driver
code (also reduce packet-headroom for encap-headers).  So, userspace
cannot use the XDP-headroom size to much...
Toke Høiland-Jørgensen Dec. 7, 2020, 12:03 p.m. UTC | #8
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
> [...]
>>> We tried to standardize on a minimum guaranteed amount, but unfortunately not
>>> everyone seems to implement it, but I think it would be very useful to query
>>> this from application side, for example, consider that an app inserts a BPF
>>> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
>>> know which of the different encaps it implements are realistically possible on
>>> the underlying XDP supported dev.
>> 
>> How many distinct values are there in reality? Enough to express this in
>> a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
>> an additional field to get the exact value? If we implement the latter
>> we also run the risk of people actually implementing all sorts of weird
>> values, whereas if we constrain it to a few distinct values it's easier
>> to push back against adding new values (as it'll be obvious from the
>> addition of new flags).
>
> It's not everywhere straight forward to determine unfortunately, see also [0,1]
> as some data points where Jesper looked into in the past, so in some cases it
> might differ depending on the build/runtime config..
>
>    [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
>    [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/

Right, well in that case maybe we should just expose the actual headroom
as a separate netlink attribute? Although I suppose that would require
another round of driver changes since Jesper's patch you linked above
only puts this into xdp_buff at XDP program runtime.

Jesper, WDYT?

-Toke
Toke Høiland-Jørgensen Dec. 7, 2020, 12:08 p.m. UTC | #9
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Fri, 4 Dec 2020 23:19:55 +0100
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
>> > Daniel Borkmann <daniel@iogearbox.net> writes:  
>> [...]
>> >> We tried to standardize on a minimum guaranteed amount, but unfortunately not
>> >> everyone seems to implement it, but I think it would be very useful to query
>> >> this from application side, for example, consider that an app inserts a BPF
>> >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
>> >> know which of the different encaps it implements are realistically possible on
>> >> the underlying XDP supported dev.  
>> > 
>> > How many distinct values are there in reality? Enough to express this in
>> > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
>> > an additional field to get the exact value? If we implement the latter
>> > we also run the risk of people actually implementing all sorts of weird
>> > values, whereas if we constrain it to a few distinct values it's easier
>> > to push back against adding new values (as it'll be obvious from the
>> > addition of new flags).  
>> 
>> It's not everywhere straight forward to determine unfortunately, see also [0,1]
>> as some data points where Jesper looked into in the past, so in some cases it
>> might differ depending on the build/runtime config..
>> 
>>    [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
>>    [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/
>
> Yes, unfortunately drivers have already gotten creative in this area,
> and variations have sneaked in.  I remember that we were forced to
> allow SFC driver to use 128 bytes headroom, to avoid a memory
> corruption. I tried hard to have the minimum 192 bytes as it is 3
> cachelines, but I failed to enforce this.
>
> It might be valuable to expose info on the drivers headroom size, as
> this will allow end-users to take advantage of this (instead of having
> to use the lowest common headroom) and up-front in userspace rejecting
> to load on e.g. SFC that have this annoying limitation.
>
> BUT thinking about what the drivers headroom size MEANS to userspace,
> I'm not sure it is wise to give this info to userspace.  The
> XDP-headroom is used for several kernel internal things, that limit the
> available space for growing packet-headroom.  E.g. (1) xdp_frame is
> something that we likely need to grow (even-though I'm pushing back),
> E.g. (2) metadata area which Saeed is looking to populate from driver
> code (also reduce packet-headroom for encap-headers).  So, userspace
> cannot use the XDP-headroom size to much...

(Ah, you had already replied, sorry seems I missed that).

Can we calculate a number from the headroom that is meaningful for
userspace? I suppose that would be "total number of bytes available for
metadata+packet extension"? Even with growing data structures, any
particular kernel should be able to inform userspace of the current
value, no?

-Toke
Jesper Dangaard Brouer Dec. 7, 2020, 12:54 p.m. UTC | #10
On Fri, 4 Dec 2020 16:21:08 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
> > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:  
> >> alardam@gmail.com writes:  
> >>> From: Marek Majtyka <marekx.majtyka@intel.com>
> >>>
> >>> Implement support for checking what kind of xdp functionality a netdev
> >>> supports. Previously, there was no way to do this other than to try
> >>> to create an AF_XDP socket on the interface or load an XDP program and see
> >>> if it worked. This commit changes this by adding a new variable which
> >>> describes all xdp supported functions on pretty detailed level:  
> >>
> >> I like the direction this is going! :)

(Me too, don't get discouraged by our nitpicking, keep working on this! :-))

> >>  
> >>>   - aborted
> >>>   - drop
> >>>   - pass
> >>>   - tx  
> 
> I strongly think we should _not_ merge any native XDP driver patchset
> that does not support/implement the above return codes. 

I agree, with above statement.

> Could we instead group them together and call this something like
> XDP_BASE functionality to not give a wrong impression?

I disagree.  I can accept that XDP_BASE include aborted+drop+pass.

I think we need to keep XDP_TX action separate, because I think that
there are use-cases where the we want to disable XDP_TX due to end-user
policy or hardware limitations.

Use-case(1): Cloud-provider want to give customers (running VMs) ability
to load XDP program for DDoS protection (only), but don't want to allow
customer to use XDP_TX (that can implement LB or cheat their VM
isolation policy).

Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
resources, as the use-case is only DDoS.  Today we have this problem
with the ixgbe hardware, that cannot load XDP programs on systems with
more than 192 CPUs.


> If this is properly documented that these are basic must-have
> _requirements_, then users and driver developers both know what the
> expectations are.

We can still document that XDP_TX is a must-have requirement, when a
driver implements XDP.


> >>>   - redirect  
> >>
John Fastabend Dec. 7, 2020, 8:52 p.m. UTC | #11
Jesper Dangaard Brouer wrote:
> On Fri, 4 Dec 2020 16:21:08 +0100
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
> > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
> > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:  
> > >> alardam@gmail.com writes:  
> > >>> From: Marek Majtyka <marekx.majtyka@intel.com>
> > >>>
> > >>> Implement support for checking what kind of xdp functionality a netdev
> > >>> supports. Previously, there was no way to do this other than to try
> > >>> to create an AF_XDP socket on the interface or load an XDP program and see
> > >>> if it worked. This commit changes this by adding a new variable which
> > >>> describes all xdp supported functions on pretty detailed level:  
> > >>
> > >> I like the direction this is going! :)
> 
> (Me too, don't get discouraged by our nitpicking, keep working on this! :-))
> 
> > >>  
> > >>>   - aborted
> > >>>   - drop
> > >>>   - pass
> > >>>   - tx  
> > 
> > I strongly think we should _not_ merge any native XDP driver patchset
> > that does not support/implement the above return codes. 
> 
> I agree, with above statement.
> 
> > Could we instead group them together and call this something like
> > XDP_BASE functionality to not give a wrong impression?
> 
> I disagree.  I can accept that XDP_BASE include aborted+drop+pass.
> 
> I think we need to keep XDP_TX action separate, because I think that
> there are use-cases where the we want to disable XDP_TX due to end-user
> policy or hardware limitations.

How about we discover this at load time though. Meaning if the program
doesn't use XDP_TX then the hardware can skip resource allocations for
it. I think we could have verifier or extra pass discover the use of
XDP_TX and then pass a bit down to driver to enable/disable TX caps.

> 
> Use-case(1): Cloud-provider want to give customers (running VMs) ability
> to load XDP program for DDoS protection (only), but don't want to allow
> customer to use XDP_TX (that can implement LB or cheat their VM
> isolation policy).

Not following. What interface do they want to allow loading on? If its
the VM interface then I don't see how it matters. From outside the
VM there should be no way to discover if its done in VM or in tc or
some other stack.

If its doing some onloading/offloading I would assume they need to
ensure the isolation, etc. is still maintained because you can't
let one VMs program work on other VMs packets safely.

So what did I miss, above doesn't make sense to me.

> 
> Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
> resources, as the use-case is only DDoS.  Today we have this problem
> with the ixgbe hardware, that cannot load XDP programs on systems with
> more than 192 CPUs.

The ixgbe issues is just a bug or missing-feature in my opinion.

I think we just document that XDP_TX consumes resources and if users
care they shouldn't use XD_TX in programs and in that case hardware
should via program discovery not allocate the resource. This seems
cleaner in my opinion then more bits for features.

> 
> 
> > If this is properly documented that these are basic must-have
> > _requirements_, then users and driver developers both know what the
> > expectations are.
> 
> We can still document that XDP_TX is a must-have requirement, when a
> driver implements XDP.

+1

> 
> 
> > >>>   - redirect  
> > >>
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
Saeed Mahameed Dec. 7, 2020, 10:38 p.m. UTC | #12
On Mon, 2020-12-07 at 12:52 -0800, John Fastabend wrote:
> Jesper Dangaard Brouer wrote:
> > On Fri, 4 Dec 2020 16:21:08 +0100
> > Daniel Borkmann <daniel@iogearbox.net> wrote:
> > 
> > > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
> > > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-
> > > > Jørgensen wrote:  
> > > > > alardam@gmail.com writes:  
> > > > > > From: Marek Majtyka <marekx.majtyka@intel.com>
> > > > > > 
> > > > > > Implement support for checking what kind of xdp
> > > > > > functionality a netdev
> > > > > > supports. Previously, there was no way to do this other
> > > > > > than to try
> > > > > > to create an AF_XDP socket on the interface or load an XDP
> > > > > > program and see
> > > > > > if it worked. This commit changes this by adding a new
> > > > > > variable which
> > > > > > describes all xdp supported functions on pretty detailed
> > > > > > level:  
> > > > > 
> > > > > I like the direction this is going! :)
> > 
> > (Me too, don't get discouraged by our nitpicking, keep working on
> > this! :-))
> > 
> > > > >  
> > > > > >   - aborted
> > > > > >   - drop
> > > > > >   - pass
> > > > > >   - tx  
> > > 
> > > I strongly think we should _not_ merge any native XDP driver
> > > patchset
> > > that does not support/implement the above return codes. 
> > 
> > I agree, with above statement.
> > 
> > > Could we instead group them together and call this something like
> > > XDP_BASE functionality to not give a wrong impression?
> > 
> > I disagree.  I can accept that XDP_BASE include aborted+drop+pass.
> > 
XDP_BASE is a weird name i vote:  
XDP_FLAG_RX,
XDP_FLAG_TX,
XDP_FLAG_REDIRECT,
XDP_FLAG_AF_XDP,
XDP_FLAG_AFXDP_ZC

> > I think we need to keep XDP_TX action separate, because I think
> > that
> > there are use-cases where the we want to disable XDP_TX due to end-
> > user
> > policy or hardware limitations.
> 
> How about we discover this at load time though. Meaning if the
> program
> doesn't use XDP_TX then the hardware can skip resource allocations
> for
> it. I think we could have verifier or extra pass discover the use of
> XDP_TX and then pass a bit down to driver to enable/disable TX caps.
> 

+1, how about we also attach some attributes to the program that would
tell the kernel/driver how to prepare configure itself for the new
program ?

Attributes like how much headroom the program needs, what meta data
driver must provide, should the driver do csum on tx, etc .. 

some attribute can be extracted from the byte code/logic others are
stated explicitly in some predefined section in the XDP prog itself.

On a second thought, this could be disruptive, users will eventually
want to replace XDP progs, and they might want a persistent config
prior to loading/reloading any prog to avoid reconfigs (packet drops)
between progs.

> > Use-case(1): Cloud-provider want to give customers (running VMs)
> > ability
> > to load XDP program for DDoS protection (only), but don't want to
> > allow
> > customer to use XDP_TX (that can implement LB or cheat their VM
> > isolation policy).
> 
> Not following. What interface do they want to allow loading on? If
> its
> the VM interface then I don't see how it matters. From outside the
> VM there should be no way to discover if its done in VM or in tc or
> some other stack.
> 
> If its doing some onloading/offloading I would assume they need to
> ensure the isolation, etc. is still maintained because you can't
> let one VMs program work on other VMs packets safely.
> 
> So what did I miss, above doesn't make sense to me.
> 
> > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
> > resources, as the use-case is only DDoS.  Today we have this
> > problem
> > with the ixgbe hardware, that cannot load XDP programs on systems
> > with
> > more than 192 CPUs.
> 
> The ixgbe issues is just a bug or missing-feature in my opinion.
> 
> I think we just document that XDP_TX consumes resources and if users
> care they shouldn't use XD_TX in programs and in that case hardware
> should via program discovery not allocate the resource. This seems
> cleaner in my opinion then more bits for features.
> 
> > 
> > > If this is properly documented that these are basic must-have
> > > _requirements_, then users and driver developers both know what
> > > the
> > > expectations are.
> > 
> > We can still document that XDP_TX is a must-have requirement, when
> > a
> > driver implements XDP.
> 
> +1
> 

Ho about xdp redirect ? 
do we still need to load a no-op program on the egress netdev so it
would allocate the xdp tx/redirect queues ? 

Adding the above discovery feature will break xdp redirect native mode
and will require to have a special flag for xdp_redirect, so it
actually makes more sense to have a unique knob to turn on XDP tx, for
the redirect use case.
Maciej Fijalkowski Dec. 7, 2020, 11:07 p.m. UTC | #13
On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:
> Jesper Dangaard Brouer wrote:
> > On Fri, 4 Dec 2020 16:21:08 +0100
> > Daniel Borkmann <daniel@iogearbox.net> wrote:
> > 
> > > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
> > > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:  
> > > >> alardam@gmail.com writes:  
> > > >>> From: Marek Majtyka <marekx.majtyka@intel.com>
> > > >>>
> > > >>> Implement support for checking what kind of xdp functionality a netdev
> > > >>> supports. Previously, there was no way to do this other than to try
> > > >>> to create an AF_XDP socket on the interface or load an XDP program and see
> > > >>> if it worked. This commit changes this by adding a new variable which
> > > >>> describes all xdp supported functions on pretty detailed level:  
> > > >>
> > > >> I like the direction this is going! :)
> > 
> > (Me too, don't get discouraged by our nitpicking, keep working on this! :-))
> > 
> > > >>  
> > > >>>   - aborted
> > > >>>   - drop
> > > >>>   - pass
> > > >>>   - tx  
> > > 
> > > I strongly think we should _not_ merge any native XDP driver patchset
> > > that does not support/implement the above return codes. 
> > 
> > I agree, with above statement.
> > 
> > > Could we instead group them together and call this something like
> > > XDP_BASE functionality to not give a wrong impression?
> > 
> > I disagree.  I can accept that XDP_BASE include aborted+drop+pass.
> > 
> > I think we need to keep XDP_TX action separate, because I think that
> > there are use-cases where the we want to disable XDP_TX due to end-user
> > policy or hardware limitations.
> 
> How about we discover this at load time though. Meaning if the program
> doesn't use XDP_TX then the hardware can skip resource allocations for
> it. I think we could have verifier or extra pass discover the use of
> XDP_TX and then pass a bit down to driver to enable/disable TX caps.

+1

> 
> > 
> > Use-case(1): Cloud-provider want to give customers (running VMs) ability
> > to load XDP program for DDoS protection (only), but don't want to allow
> > customer to use XDP_TX (that can implement LB or cheat their VM
> > isolation policy).
> 
> Not following. What interface do they want to allow loading on? If its
> the VM interface then I don't see how it matters. From outside the
> VM there should be no way to discover if its done in VM or in tc or
> some other stack.
> 
> If its doing some onloading/offloading I would assume they need to
> ensure the isolation, etc. is still maintained because you can't
> let one VMs program work on other VMs packets safely.
> 
> So what did I miss, above doesn't make sense to me.
> 
> > 
> > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
> > resources, as the use-case is only DDoS.  Today we have this problem
> > with the ixgbe hardware, that cannot load XDP programs on systems with
> > more than 192 CPUs.
> 
> The ixgbe issues is just a bug or missing-feature in my opinion.

Not a bug, rather HW limitation?

> 
> I think we just document that XDP_TX consumes resources and if users
> care they shouldn't use XD_TX in programs and in that case hardware
> should via program discovery not allocate the resource. This seems
> cleaner in my opinion then more bits for features.

But what if I'm with some limited HW that actually has a support for XDP
and I would like to utilize XDP_TX?

Not all drivers that support XDP consume Tx resources. Recently igb got
support and it shares Tx queues between netstack and XDP.

I feel like we should have a sort-of best effort approach in case we
stumble upon the XDP_TX in prog being loaded and query the driver if it
would be able to provide the Tx resources on the current system, given
that normally we tend to have a queue per core.

In that case igb would say yes, ixgbe would say no and prog would be
rejected.

> 
> > 
> > 
> > > If this is properly documented that these are basic must-have
> > > _requirements_, then users and driver developers both know what the
> > > expectations are.
> > 
> > We can still document that XDP_TX is a must-have requirement, when a
> > driver implements XDP.
> 
> +1
> 
> > 
> > 
> > > >>>   - redirect  
> > > >>
> > 
> > 
> > -- 
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> > 
> 
>
David Ahern Dec. 8, 2020, 1:01 a.m. UTC | #14
On 12/7/20 1:52 PM, John Fastabend wrote:
>>
>> I think we need to keep XDP_TX action separate, because I think that
>> there are use-cases where the we want to disable XDP_TX due to end-user
>> policy or hardware limitations.
> 
> How about we discover this at load time though. Meaning if the program
> doesn't use XDP_TX then the hardware can skip resource allocations for
> it. I think we could have verifier or extra pass discover the use of
> XDP_TX and then pass a bit down to driver to enable/disable TX caps.
> 

This was discussed in the context of virtio_net some months back - it is
hard to impossible to know a program will not return XDP_TX (e.g., value
comes from a map).

Flipping that around, what if the program attach indicates whether
XDP_TX could be returned. If so, driver manages the resource needs. If
not, no resource needed and if the program violates that and returns
XDP_TX the packet is dropped.
Jesper Dangaard Brouer Dec. 8, 2020, 8:28 a.m. UTC | #15
On Mon, 7 Dec 2020 18:01:00 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/7/20 1:52 PM, John Fastabend wrote:
> >>
> >> I think we need to keep XDP_TX action separate, because I think that
> >> there are use-cases where the we want to disable XDP_TX due to end-user
> >> policy or hardware limitations.  
> > 
> > How about we discover this at load time though. 

Nitpick at XDP "attach" time. The general disconnect between BPF and
XDP is that BPF can verify at "load" time (as kernel knows what it
support) while XDP can have different support/features per driver, and
cannot do this until attachment time. (See later issue with tail calls).
(All other BPF-hooks don't have this issue)

> > Meaning if the program
> > doesn't use XDP_TX then the hardware can skip resource allocations for
> > it. I think we could have verifier or extra pass discover the use of
> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.
> >   
> 
> This was discussed in the context of virtio_net some months back - it is
> hard to impossible to know a program will not return XDP_TX (e.g., value
> comes from a map).

It is hard, and sometimes not possible.  For maps the workaround is
that BPF-programmer adds a bound check on values from the map. If not
doing that the verifier have to assume all possible return codes are
used by BPF-prog.

The real nemesis is program tail calls, that can be added dynamically
after the XDP program is attached.  It is at attachment time that
changing the NIC resources is possible.  So, for program tail calls the
verifier have to assume all possible return codes are used by BPF-prog.

BPF now have function calls and function replace right(?)  How does
this affect this detection of possible return codes?


> Flipping that around, what if the program attach indicates whether
> XDP_TX could be returned. If so, driver manages the resource needs. If
> not, no resource needed and if the program violates that and returns
> XDP_TX the packet is dropped.

I do like this idea, as IMHO we do need something that is connected
with the BPF-prog, that describe what resources the program request
(either like above via detecting this in verifier, or simply manually
configuring this in the BPF-prog ELF file)

The main idea is that we all (I assume) want to provide a better
end-user interface/experience. By direct feedback to the end-user that
"loading+attaching" this XDP BPF-prog will not work, as e.g. driver
don't support a specific return code.  Thus, we need to reject
"loading+attaching" if features cannot be satisfied.

We need a solution as; today it is causing frustration for end-users
that packets can be (silently) dropped by XDP, e.g. if driver don't
support XDP_REDIRECT.
Jesper Dangaard Brouer Dec. 8, 2020, 9 a.m. UTC | #16
On Mon, 07 Dec 2020 12:52:22 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> > Use-case(1): Cloud-provider want to give customers (running VMs) ability
> > to load XDP program for DDoS protection (only), but don't want to allow
> > customer to use XDP_TX (that can implement LB or cheat their VM
> > isolation policy).  
> 
> Not following. What interface do they want to allow loading on? If its
> the VM interface then I don't see how it matters. From outside the
> VM there should be no way to discover if its done in VM or in tc or
> some other stack.
> 
> If its doing some onloading/offloading I would assume they need to
> ensure the isolation, etc. is still maintained because you can't
> let one VMs program work on other VMs packets safely.
> 
> So what did I miss, above doesn't make sense to me.

The Cloud-provider want to load customer provided BPF-code on the
physical Host-OS NIC (that support XDP).  The customer can get access
to a web-interface where they can write or upload their BPF-prog.

As multiple customers can upload BPF-progs, the Cloud-provider have to
write a BPF-prog dispatcher that runs these multiple program.  This
could be done via BPF tail-calls, or via Toke's libxdp[1], or via
devmap XDP-progs per egress port.

The Cloud-provider don't fully trust customers BPF-prog.   They already
pre-filtered traffic to the given VM, so they can allow customers
freedom to see traffic and do XDP_PASS and XDP_DROP.  They
administratively (via ethtool) want to disable the XDP_REDIRECT and
XDP_TX driver feature, as it can be used for violation their VM
isolation policy between customers.

Is the use-case more clear now?


[1] https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp
Daniel Borkmann Dec. 8, 2020, 9:42 a.m. UTC | #17
On 12/8/20 10:00 AM, Jesper Dangaard Brouer wrote:
> On Mon, 07 Dec 2020 12:52:22 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>>> Use-case(1): Cloud-provider want to give customers (running VMs) ability
>>> to load XDP program for DDoS protection (only), but don't want to allow
>>> customer to use XDP_TX (that can implement LB or cheat their VM
>>> isolation policy).
>>
>> Not following. What interface do they want to allow loading on? If its
>> the VM interface then I don't see how it matters. From outside the
>> VM there should be no way to discover if its done in VM or in tc or
>> some other stack.
>>
>> If its doing some onloading/offloading I would assume they need to
>> ensure the isolation, etc. is still maintained because you can't
>> let one VMs program work on other VMs packets safely.
>>
>> So what did I miss, above doesn't make sense to me.
> 
> The Cloud-provider want to load customer provided BPF-code on the
> physical Host-OS NIC (that support XDP).  The customer can get access
> to a web-interface where they can write or upload their BPF-prog.
> 
> As multiple customers can upload BPF-progs, the Cloud-provider have to
> write a BPF-prog dispatcher that runs these multiple program.  This
> could be done via BPF tail-calls, or via Toke's libxdp[1], or via
> devmap XDP-progs per egress port.
> 
> The Cloud-provider don't fully trust customers BPF-prog.   They already
> pre-filtered traffic to the given VM, so they can allow customers
> freedom to see traffic and do XDP_PASS and XDP_DROP.  They
> administratively (via ethtool) want to disable the XDP_REDIRECT and
> XDP_TX driver feature, as it can be used for violation their VM
> isolation policy between customers.
> 
> Is the use-case more clear now?

I think we're talking about two different things. The use case as I understood
it in (1) mentioned to be able to disable XDP_TX for NICs that are deployed in
the VM. This would be a no-go as-is since that would mean my basic assumption
for attaching XDP progs is gone in that today return codes pass/drop/tx is
pretty much available everywhere on native XDP supported NICs. And if you've
tried it on major cloud providers like AWS or Azure that offer SRIOV-based
networking that works okay and further restricting this would mean breakage of
existing programs.

What you mean here is "offload" from guest to host which is a different use
case than what likely John and I read from your description in (1). Such program
should then be loaded via BPF offload API. Meaning, if offload is used and the
host is then configured to disallow XDP_TX for such requests from guests, then
these get rejected through such facility, but if the /same/ program was loaded as
regular native XDP where it's still running in the guest, then it must succeed.
These are two entirely different things.

It's not clear to me whether some ethtool XDP properties flag is the right place
to describe this (plus this needs to differ between offloaded / non-offloaded progs)
or whether this should be an implementation detail for things like virtio_net e.g.
via virtio_has_feature(). Feels more like the latter to me which already has such
a facility in place.
Toke Høiland-Jørgensen Dec. 8, 2020, 11:58 a.m. UTC | #18
Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> On Mon, 7 Dec 2020 18:01:00 -0700
> David Ahern <dsahern@gmail.com> wrote:
>
>> On 12/7/20 1:52 PM, John Fastabend wrote:
>> >>
>> >> I think we need to keep XDP_TX action separate, because I think that
>> >> there are use-cases where the we want to disable XDP_TX due to end-user
>> >> policy or hardware limitations.  
>> > 
>> > How about we discover this at load time though. 
>
> Nitpick at XDP "attach" time. The general disconnect between BPF and
> XDP is that BPF can verify at "load" time (as kernel knows what it
> support) while XDP can have different support/features per driver, and
> cannot do this until attachment time. (See later issue with tail calls).
> (All other BPF-hooks don't have this issue)
>
>> > Meaning if the program
>> > doesn't use XDP_TX then the hardware can skip resource allocations for
>> > it. I think we could have verifier or extra pass discover the use of
>> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.
>> >   
>> 
>> This was discussed in the context of virtio_net some months back - it is
>> hard to impossible to know a program will not return XDP_TX (e.g., value
>> comes from a map).
>
> It is hard, and sometimes not possible.  For maps the workaround is
> that BPF-programmer adds a bound check on values from the map. If not
> doing that the verifier have to assume all possible return codes are
> used by BPF-prog.
>
> The real nemesis is program tail calls, that can be added dynamically
> after the XDP program is attached.  It is at attachment time that
> changing the NIC resources is possible.  So, for program tail calls the
> verifier have to assume all possible return codes are used by BPF-prog.

We actually had someone working on a scheme for how to express this for
programs some months ago, but unfortunately that stalled out (Jesper
already knows this, but FYI to the rest of you). In any case, I view
this as a "next step". Just exposing the feature bits to userspace will
help users today, and as a side effect, this also makes drivers declare
what they support, which we can then incorporate into the core code to,
e.g., reject attachment of programs that won't work anyway. But let's
do this in increments and not make the perfect the enemy of the good
here.

> BPF now have function calls and function replace right(?)  How does
> this affect this detection of possible return codes?

It does have the same issue as tail calls, in that the return code of
the function being replaced can obviously change. However, the verifier
knows the target of a replace, so it can propagate any constraints put
upon the caller if we implement it that way.

-Toke
John Fastabend Dec. 9, 2020, 5:50 a.m. UTC | #19
Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
> 
> > On Mon, 7 Dec 2020 18:01:00 -0700
> > David Ahern <dsahern@gmail.com> wrote:
> >
> >> On 12/7/20 1:52 PM, John Fastabend wrote:
> >> >>
> >> >> I think we need to keep XDP_TX action separate, because I think that
> >> >> there are use-cases where the we want to disable XDP_TX due to end-user
> >> >> policy or hardware limitations.  
> >> > 
> >> > How about we discover this at load time though. 
> >
> > Nitpick at XDP "attach" time. The general disconnect between BPF and
> > XDP is that BPF can verify at "load" time (as kernel knows what it
> > support) while XDP can have different support/features per driver, and
> > cannot do this until attachment time. (See later issue with tail calls).
> > (All other BPF-hooks don't have this issue)
> >
> >> > Meaning if the program
> >> > doesn't use XDP_TX then the hardware can skip resource allocations for
> >> > it. I think we could have verifier or extra pass discover the use of
> >> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.
> >> >   
> >> 
> >> This was discussed in the context of virtio_net some months back - it is
> >> hard to impossible to know a program will not return XDP_TX (e.g., value
> >> comes from a map).
> >
> > It is hard, and sometimes not possible.  For maps the workaround is
> > that BPF-programmer adds a bound check on values from the map. If not
> > doing that the verifier have to assume all possible return codes are
> > used by BPF-prog.
> >
> > The real nemesis is program tail calls, that can be added dynamically
> > after the XDP program is attached.  It is at attachment time that
> > changing the NIC resources is possible.  So, for program tail calls the
> > verifier have to assume all possible return codes are used by BPF-prog.
> 
> We actually had someone working on a scheme for how to express this for
> programs some months ago, but unfortunately that stalled out (Jesper
> already knows this, but FYI to the rest of you). In any case, I view
> this as a "next step". Just exposing the feature bits to userspace will
> help users today, and as a side effect, this also makes drivers declare
> what they support, which we can then incorporate into the core code to,
> e.g., reject attachment of programs that won't work anyway. But let's
> do this in increments and not make the perfect the enemy of the good
> here.
> 
> > BPF now have function calls and function replace right(?)  How does
> > this affect this detection of possible return codes?
> 
> It does have the same issue as tail calls, in that the return code of
> the function being replaced can obviously change. However, the verifier
> knows the target of a replace, so it can propagate any constraints put
> upon the caller if we implement it that way.

OK I'm convinced its not possible to tell at attach time if a program
will use XDP_TX or not in general. And in fact for most real programs it
likely will not be knowable. At least most programs I look at these days
use either tail calls or function calls so seems like a dead end.

Also above somewhere it was pointed out that XDP_REDIRECT would want
the queues and it seems even more challenging to sort that out.
John Fastabend Dec. 9, 2020, 6:03 a.m. UTC | #20
> On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:
> > Jesper Dangaard Brouer wrote:
> > > On Fri, 4 Dec 2020 16:21:08 +0100
> > > Daniel Borkmann <daniel@iogearbox.net> wrote:

[...] pruning the thread to answer Jesper.

> > > 
> > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
> > > resources, as the use-case is only DDoS.  Today we have this problem
> > > with the ixgbe hardware, that cannot load XDP programs on systems with
> > > more than 192 CPUs.
> > 
> > The ixgbe issues is just a bug or missing-feature in my opinion.
> 
> Not a bug, rather HW limitation?

Well hardware has some max queue limit. Likely <192 otherwise I would
have kept doing queue per core on up to 192. But, ideally we should
still load and either share queues across multiple cores or restirct
down to a subset of CPUs. Do you need 192 cores for a 10gbps nic,
probably not. Yes, it requires some extra care, but should be doable
if someone cares enough. I gather current limitation/bug is because
no one has that configuration and/or has complained loud enough.

> 
> > 
> > I think we just document that XDP_TX consumes resources and if users
> > care they shouldn't use XD_TX in programs and in that case hardware
> > should via program discovery not allocate the resource. This seems
> > cleaner in my opinion then more bits for features.
> 
> But what if I'm with some limited HW that actually has a support for XDP
> and I would like to utilize XDP_TX?
> 
> Not all drivers that support XDP consume Tx resources. Recently igb got
> support and it shares Tx queues between netstack and XDP.

Makes sense to me.

> 
> I feel like we should have a sort-of best effort approach in case we
> stumble upon the XDP_TX in prog being loaded and query the driver if it
> would be able to provide the Tx resources on the current system, given
> that normally we tend to have a queue per core.

Why do we need to query? I guess you want some indication from the
driver its not going to be running in the ideal NIC configuraition?
I guess printing a warning would be the normal way to show that. But,
maybe your point is you want something easier to query?

> 
> In that case igb would say yes, ixgbe would say no and prog would be
> rejected.

I think the driver should load even if it can't meet the queue per
core quota. Refusing to load at all or just dropping packets on the
floor is not very friendly. I think we agree on that point.
Maciej Fijalkowski Dec. 9, 2020, 9:54 a.m. UTC | #21
On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote:
> > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:
> > > Jesper Dangaard Brouer wrote:
> > > > On Fri, 4 Dec 2020 16:21:08 +0100
> > > > Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
> [...] pruning the thread to answer Jesper.

I think you meant me, but thanks anyway for responding :)

> 
> > > > 
> > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
> > > > resources, as the use-case is only DDoS.  Today we have this problem
> > > > with the ixgbe hardware, that cannot load XDP programs on systems with
> > > > more than 192 CPUs.
> > > 
> > > The ixgbe issues is just a bug or missing-feature in my opinion.
> > 
> > Not a bug, rather HW limitation?
> 
> Well hardware has some max queue limit. Likely <192 otherwise I would
> have kept doing queue per core on up to 192. But, ideally we should

Data sheet states its 128 Tx qs for ixgbe.

> still load and either share queues across multiple cores or restirct
> down to a subset of CPUs.

And that's the missing piece of logic, I suppose.

> Do you need 192 cores for a 10gbps nic, probably not.

Let's hear from Jesper :p

> Yes, it requires some extra care, but should be doable
> if someone cares enough. I gather current limitation/bug is because
> no one has that configuration and/or has complained loud enough.

I would say we're safe for queue per core approach for newer devices where
we have thousands of queues to play with. Older devices combined with big
cpu count can cause us some problems.

Wondering if drivers could have a problem when user would do something
weird as limiting the queue count to a lower value than cpu count and then
changing the irq affinity?

> 
> > 
> > > 
> > > I think we just document that XDP_TX consumes resources and if users
> > > care they shouldn't use XD_TX in programs and in that case hardware
> > > should via program discovery not allocate the resource. This seems
> > > cleaner in my opinion then more bits for features.
> > 
> > But what if I'm with some limited HW that actually has a support for XDP
> > and I would like to utilize XDP_TX?
> > 
> > Not all drivers that support XDP consume Tx resources. Recently igb got
> > support and it shares Tx queues between netstack and XDP.
> 
> Makes sense to me.
> 
> > 
> > I feel like we should have a sort-of best effort approach in case we
> > stumble upon the XDP_TX in prog being loaded and query the driver if it
> > would be able to provide the Tx resources on the current system, given
> > that normally we tend to have a queue per core.
> 
> Why do we need to query? I guess you want some indication from the
> driver its not going to be running in the ideal NIC configuraition?
> I guess printing a warning would be the normal way to show that. But,
> maybe your point is you want something easier to query?

I meant that given Jesper's example, what should we do? You don't have Tx
resources to pull at all. Should we have a data path for that case that
would share Tx qs between XDP/netstack? Probably not.

> 
> > 
> > In that case igb would say yes, ixgbe would say no and prog would be
> > rejected.
> 
> I think the driver should load even if it can't meet the queue per
> core quota. Refusing to load at all or just dropping packets on the
> floor is not very friendly. I think we agree on that point.

Agreed on that. But it needs some work. I can dabble on that a bit.
Toke Høiland-Jørgensen Dec. 9, 2020, 10:26 a.m. UTC | #22
John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>> 
>> > On Mon, 7 Dec 2020 18:01:00 -0700
>> > David Ahern <dsahern@gmail.com> wrote:
>> >
>> >> On 12/7/20 1:52 PM, John Fastabend wrote:
>> >> >>
>> >> >> I think we need to keep XDP_TX action separate, because I think that
>> >> >> there are use-cases where the we want to disable XDP_TX due to end-user
>> >> >> policy or hardware limitations.  
>> >> > 
>> >> > How about we discover this at load time though. 
>> >
>> > Nitpick at XDP "attach" time. The general disconnect between BPF and
>> > XDP is that BPF can verify at "load" time (as kernel knows what it
>> > support) while XDP can have different support/features per driver, and
>> > cannot do this until attachment time. (See later issue with tail calls).
>> > (All other BPF-hooks don't have this issue)
>> >
>> >> > Meaning if the program
>> >> > doesn't use XDP_TX then the hardware can skip resource allocations for
>> >> > it. I think we could have verifier or extra pass discover the use of
>> >> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.
>> >> >   
>> >> 
>> >> This was discussed in the context of virtio_net some months back - it is
>> >> hard to impossible to know a program will not return XDP_TX (e.g., value
>> >> comes from a map).
>> >
>> > It is hard, and sometimes not possible.  For maps the workaround is
>> > that BPF-programmer adds a bound check on values from the map. If not
>> > doing that the verifier have to assume all possible return codes are
>> > used by BPF-prog.
>> >
>> > The real nemesis is program tail calls, that can be added dynamically
>> > after the XDP program is attached.  It is at attachment time that
>> > changing the NIC resources is possible.  So, for program tail calls the
>> > verifier have to assume all possible return codes are used by BPF-prog.
>> 
>> We actually had someone working on a scheme for how to express this for
>> programs some months ago, but unfortunately that stalled out (Jesper
>> already knows this, but FYI to the rest of you). In any case, I view
>> this as a "next step". Just exposing the feature bits to userspace will
>> help users today, and as a side effect, this also makes drivers declare
>> what they support, which we can then incorporate into the core code to,
>> e.g., reject attachment of programs that won't work anyway. But let's
>> do this in increments and not make the perfect the enemy of the good
>> here.
>> 
>> > BPF now have function calls and function replace right(?)  How does
>> > this affect this detection of possible return codes?
>> 
>> It does have the same issue as tail calls, in that the return code of
>> the function being replaced can obviously change. However, the verifier
>> knows the target of a replace, so it can propagate any constraints put
>> upon the caller if we implement it that way.
>
> OK I'm convinced its not possible to tell at attach time if a program
> will use XDP_TX or not in general. And in fact for most real programs it
> likely will not be knowable. At least most programs I look at these days
> use either tail calls or function calls so seems like a dead end.
>
> Also above somewhere it was pointed out that XDP_REDIRECT would want
> the queues and it seems even more challenging to sort that out.

Yeah. Doesn't mean that all hope is lost for "reject stuff that doesn't
work". We could either do pessimistic return code detection (if we don't
know for sure assume all codes are used), or we could add metadata where
the program declares what it wants to do...

-Toke
Jesper Dangaard Brouer Dec. 9, 2020, 11:52 a.m. UTC | #23
On Wed, 9 Dec 2020 10:54:54 +0100
Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:

> On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote:
> > > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:  
> > > > Jesper Dangaard Brouer wrote:  
> > > > > On Fri, 4 Dec 2020 16:21:08 +0100
> > > > > Daniel Borkmann <daniel@iogearbox.net> wrote:  
> > 
> > [...] pruning the thread to answer Jesper.  
> 
> I think you meant me, but thanks anyway for responding :)

I was about to say that ;-)

> > > > > 
> > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
> > > > > resources, as the use-case is only DDoS.  Today we have this problem
> > > > > with the ixgbe hardware, that cannot load XDP programs on systems with
> > > > > more than 192 CPUs.  
> > > > 
> > > > The ixgbe issues is just a bug or missing-feature in my opinion.  
> > > 
> > > Not a bug, rather HW limitation?  
> > 
> > Well hardware has some max queue limit. Likely <192 otherwise I would
> > have kept doing queue per core on up to 192. But, ideally we should  
> 
> Data sheet states its 128 Tx qs for ixgbe.

I likely remember wrong, maybe it was only ~96 CPUs.  I do remember that
some TX queue were reserved for something else, and QA reported issues
(as I don't have this high end system myself).


> > still load and either share queues across multiple cores or restirct
> > down to a subset of CPUs.  
> 
> And that's the missing piece of logic, I suppose.
> 
> > Do you need 192 cores for a 10gbps nic, probably not.  
> 
> Let's hear from Jesper :p

LOL - of-cause you don't need 192 cores.  With XDP I will claim that
you only need 2 cores (with high GHz) to forward 10gbps wirespeed small
packets.

The point is that this only works, when we avoid atomic lock operations
per packet and bulk NIC PCIe tail/doorbell.  It was actually John's
invention/design to have a dedicated TX queue per core to avoid the
atomic lock operation per packet when queuing packets to the NIC.

 10G @64B give budget of 67.2 ns (241 cycles @ 3.60GHz)
 Atomic lock operation use:[1]
 - Type:spin_lock_unlock         Per elem: 34 cycles(tsc) 9.485 ns
 - Type:spin_lock_unlock_irqsave Per elem: 61 cycles(tsc) 17.125 ns
 (And atomic can affect Inst per cycle)

But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
(up-to 16) so it should not be a problem to solve this by sharing
TX-queue and talking a lock per 16 packets.  I still recommend that,
for fallback case,  you allocated a number a TX-queue and distribute
this across CPUs to avoid hitting a congested lock (above measurements
are the optimal non-congested atomic lock operation)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

> > Yes, it requires some extra care, but should be doable
> > if someone cares enough. I gather current limitation/bug is because
> > no one has that configuration and/or has complained loud enough.  
> 
> I would say we're safe for queue per core approach for newer devices where
> we have thousands of queues to play with. Older devices combined with big
> cpu count can cause us some problems.
> 
> Wondering if drivers could have a problem when user would do something
> weird as limiting the queue count to a lower value than cpu count and then
> changing the irq affinity?

Not sure what you mean.

But for XDP RX-side we use softirq NAPI guarantee to guard against
concurrent access to our (per-cpu) data structures.

> >   
> > >   
> > > > 
> > > > I think we just document that XDP_TX consumes resources and if users
> > > > care they shouldn't use XD_TX in programs and in that case hardware
> > > > should via program discovery not allocate the resource. This seems
> > > > cleaner in my opinion then more bits for features.  
> > > 
> > > But what if I'm with some limited HW that actually has a support for XDP
> > > and I would like to utilize XDP_TX?
> > > 
> > > Not all drivers that support XDP consume Tx resources. Recently igb got
> > > support and it shares Tx queues between netstack and XDP.  
> > 
> > Makes sense to me.
> >   
> > > 
> > > I feel like we should have a sort-of best effort approach in case we
> > > stumble upon the XDP_TX in prog being loaded and query the driver if it
> > > would be able to provide the Tx resources on the current system, given
> > > that normally we tend to have a queue per core.  
> > 
> > Why do we need to query? I guess you want some indication from the
> > driver its not going to be running in the ideal NIC configuraition?
> > I guess printing a warning would be the normal way to show that. But,
> > maybe your point is you want something easier to query?  
> 
> I meant that given Jesper's example, what should we do? You don't have Tx
> resources to pull at all. Should we have a data path for that case that
> would share Tx qs between XDP/netstack? Probably not.
> 

I think ixgbe should have a fallback mode, where it allocated e.g. 32
TX-queue for XDP xmits or even just same amount as RX-queues (I think
XDP_TX and XDP_REDIRECT can share these TX-queues dedicated to XDP).
When in fallback mode a lock need to be taken (sharded across CPUs),
but ndo_xdp_xmit will bulk up-to 16 packets, so it should not matter
too much.

I do think ixgbe should output a dmesg log message, to say it is in XDP
fallback mode with X number of TX-queues.  For us QA usually collect
the dmesg output after a test run.

   
> > > 
> > > In that case igb would say yes, ixgbe would say no and prog would be
> > > rejected.  
> > 
> > I think the driver should load even if it can't meet the queue per
> > core quota. Refusing to load at all or just dropping packets on the
> > floor is not very friendly. I think we agree on that point.  
> 
> Agreed on that. But it needs some work. I can dabble on that a bit.
> 

I will really appreciate if Intel can fix this in the ixgbe driver, and
implement a fallback method.
David Ahern Dec. 9, 2020, 3:41 p.m. UTC | #24
On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
>>> still load and either share queues across multiple cores or restirct
>>> down to a subset of CPUs.  
>>
>> And that's the missing piece of logic, I suppose.
>>
>>> Do you need 192 cores for a 10gbps nic, probably not.  
>>
>> Let's hear from Jesper :p
> 
> LOL - of-cause you don't need 192 cores.  With XDP I will claim that
> you only need 2 cores (with high GHz) to forward 10gbps wirespeed small
> packets.

You don't need 192 for 10G on Rx. However, if you are using XDP_REDIRECT
from VM tap devices the next device (presumably the host NIC) does need
to be able to handle the redirect.

My personal experience with this one is mlx5/ConnectX4-LX with a limit
of 63 queues and a server with 96 logical cpus. If the vhost thread for
the tap device runs on a cpu that does not have an XDP TX Queue, the
packet is dropped. This is a really bizarre case to debug as some
packets go out fine while others are dropped.
David Ahern Dec. 9, 2020, 3:44 p.m. UTC | #25
On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> (up-to 16) so it should not be a problem to solve this by sharing
> TX-queue and talking a lock per 16 packets.  I still recommend that,
> for fallback case,  you allocated a number a TX-queue and distribute
> this across CPUs to avoid hitting a congested lock (above measurements
> are the optimal non-congested atomic lock operation)

I have been meaning to ask you why 16 for the XDP batching? If the
netdev budget is 64, why not something higher like 32 or 64?
Saeed Mahameed Dec. 9, 2020, 5:15 p.m. UTC | #26
On Wed, 2020-12-09 at 08:41 -0700, David Ahern wrote:
> On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > > > still load and either share queues across multiple cores or
> > > > restirct
> > > > down to a subset of CPUs.  
> > > 
> > > And that's the missing piece of logic, I suppose.
> > > 
> > > > Do you need 192 cores for a 10gbps nic, probably not.  
> > > 
> > > Let's hear from Jesper :p
> > 
> > LOL - of-cause you don't need 192 cores.  With XDP I will claim
> > that
> > you only need 2 cores (with high GHz) to forward 10gbps wirespeed
> > small
> > packets.
> 
> You don't need 192 for 10G on Rx. However, if you are using
> XDP_REDIRECT
> from VM tap devices the next device (presumably the host NIC) does
> need
> to be able to handle the redirect.
> 
> My personal experience with this one is mlx5/ConnectX4-LX with a
> limit

This limit was removed from mlx5
https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/
Note: you still need to use ehttool to increase from 64 to 128 or 96 in
your case.

> of 63 queues and a server with 96 logical cpus. If the vhost thread
> for
> the tap device runs on a cpu that does not have an XDP TX Queue, the
> packet is dropped. This is a really bizarre case to debug as some
> packets go out fine while others are dropped.

I agree, the user experience horrible.
David Ahern Dec. 10, 2020, 3:34 a.m. UTC | #27
On 12/9/20 10:15 AM, Saeed Mahameed wrote:
>> My personal experience with this one is mlx5/ConnectX4-LX with a
>> limit
> 
> This limit was removed from mlx5
> https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/
> Note: you still need to use ehttool to increase from 64 to 128 or 96 in
> your case.
> 

I asked you about that commit back in May:

https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/

As noted in the thread, it did not work for me.
Saeed Mahameed Dec. 10, 2020, 6:48 a.m. UTC | #28
On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote:
> On 12/9/20 10:15 AM, Saeed Mahameed wrote:
> > > My personal experience with this one is mlx5/ConnectX4-LX with a
> > > limit
> > 
> > This limit was removed from mlx5
> > https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/
> > Note: you still need to use ehttool to increase from 64 to 128 or
> > 96 in
> > your case.
> > 
> 
> I asked you about that commit back in May:
> 

:/, sorry i missed this email, must have been the mlnx nvidia email
transition.

> https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/
> 
> As noted in the thread, it did not work for me.

Still relevant ? I might need to get you some tools to increase #msix
in Firmware.
Jesper Dangaard Brouer Dec. 10, 2020, 1:32 p.m. UTC | #29
On Wed, 9 Dec 2020 08:44:33 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > (up-to 16) so it should not be a problem to solve this by sharing
> > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > for fallback case,  you allocated a number a TX-queue and distribute
> > this across CPUs to avoid hitting a congested lock (above measurements
> > are the optimal non-congested atomic lock operation)  
> 
> I have been meaning to ask you why 16 for the XDP batching? If the
> netdev budget is 64, why not something higher like 32 or 64?

Thanks you for asking as there are multiple good reasons and
consideration for this 16 batch size.  Notice cpumap have batch size 8,
which is also an explicit choice.  And AF_XDP went in the wrong
direction IMHO and I think have 256.  I designed this to be a choice in
the map code, for the level of bulking it needs/wants.

The low level explanation is that these 8 and 16 batch sizes are
optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
with 10 elements).  I'm betting on that memory backing these 8 or 16
packets have higher chance to remain/being in cache, and I can prefetch
them without evicting them from cache again.  In some cases the pointer
to these packets are queued into a ptr_ring, and it is more optimal to
write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.

The general explanation is my goal to do bulking without adding latency.
This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
Sure, you/we can likely make the micro-benchmarks look better by using
64 batch size, but that will introduce added latency and likely shoot
our-selves in the foot for real workloads.  With experience from
bufferbloat and real networks, we know that massive TX bulking have bad
effects.  Still XDP-redirect does massive bulking (NIC flush is after
full 64 budget) and we don't have pushback or a queue mechanism (so I
know we are already shooting ourselves in the foot) ...  Fortunately we
now have a PhD student working on queuing for XDP.

It is also important to understand that this is an adaptive bulking
scheme, which comes from NAPI.  We don't wait for packets arriving
shortly, we pickup what NIC have available, but by only taking 8 or 16
packets (instead of emptying the entire RX-queue), and then spending
some time to send them along, I'm hoping that NIC could have gotten
some more frame.  For cpumap and veth (in-some-cases) they can start to
consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
update their internal queue state (and BQL), and if it gets close to
full they can choose to flush/doorbell the NIC earlier.  When doing
queuing for XDP we need to expose these NIC queue states, and having 4
calls with 16 packets (64 budget) also gives us more chances to get NIC
queue state info which the NIC already touch.


[1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
Magnus Karlsson Dec. 10, 2020, 2:14 p.m. UTC | #30
On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 9 Dec 2020 08:44:33 -0700
> David Ahern <dsahern@gmail.com> wrote:
>
> > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > > (up-to 16) so it should not be a problem to solve this by sharing
> > > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > > for fallback case,  you allocated a number a TX-queue and distribute
> > > this across CPUs to avoid hitting a congested lock (above measurements
> > > are the optimal non-congested atomic lock operation)
> >
> > I have been meaning to ask you why 16 for the XDP batching? If the
> > netdev budget is 64, why not something higher like 32 or 64?
>
> Thanks you for asking as there are multiple good reasons and
> consideration for this 16 batch size.  Notice cpumap have batch size 8,
> which is also an explicit choice.  And AF_XDP went in the wrong
> direction IMHO and I think have 256.  I designed this to be a choice in
> the map code, for the level of bulking it needs/wants.

FYI, as far as I know, there is nothing in AF_XDP that says bulking
should be 256. There is a 256 number in the i40e driver that states
the maximum number of packets to be sent within one napi_poll loop.
But this is just a maximum number and only for that driver. (In case
you wonder, that number was inherited from the original skb Tx
implementation in the driver.) The actual batch size is controlled by
the application. If it puts 1 packet in the Tx ring and calls send(),
the batch size will be 1. If it puts 128 packets in the Tx ring and
calls send(), you get a batch size of 128, and so on. It is flexible,
so you can trade-off latency with throughput in the way the
application desires. Rx batch size has also become flexible now with
the introduction of Björn's prefer_busy_poll patch set [1].

[1] https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.topel@gmail.com/

> The low level explanation is that these 8 and 16 batch sizes are
> optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
> with 10 elements).  I'm betting on that memory backing these 8 or 16
> packets have higher chance to remain/being in cache, and I can prefetch
> them without evicting them from cache again.  In some cases the pointer
> to these packets are queued into a ptr_ring, and it is more optimal to
> write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.
>
> The general explanation is my goal to do bulking without adding latency.
> This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
> Sure, you/we can likely make the micro-benchmarks look better by using
> 64 batch size, but that will introduce added latency and likely shoot
> our-selves in the foot for real workloads.  With experience from
> bufferbloat and real networks, we know that massive TX bulking have bad
> effects.  Still XDP-redirect does massive bulking (NIC flush is after
> full 64 budget) and we don't have pushback or a queue mechanism (so I
> know we are already shooting ourselves in the foot) ...  Fortunately we
> now have a PhD student working on queuing for XDP.
>
> It is also important to understand that this is an adaptive bulking
> scheme, which comes from NAPI.  We don't wait for packets arriving
> shortly, we pickup what NIC have available, but by only taking 8 or 16
> packets (instead of emptying the entire RX-queue), and then spending
> some time to send them along, I'm hoping that NIC could have gotten
> some more frame.  For cpumap and veth (in-some-cases) they can start to
> consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
> signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
> update their internal queue state (and BQL), and if it gets close to
> full they can choose to flush/doorbell the NIC earlier.  When doing
> queuing for XDP we need to expose these NIC queue states, and having 4
> calls with 16 packets (64 budget) also gives us more chances to get NIC
> queue state info which the NIC already touch.
>
>
> [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
David Ahern Dec. 10, 2020, 3:30 p.m. UTC | #31
On 12/9/20 11:48 PM, Saeed Mahameed wrote:
> On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote:
>> On 12/9/20 10:15 AM, Saeed Mahameed wrote:
>>>> My personal experience with this one is mlx5/ConnectX4-LX with a
>>>> limit
>>>
>>> This limit was removed from mlx5
>>> https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/
>>> Note: you still need to use ehttool to increase from 64 to 128 or
>>> 96 in
>>> your case.
>>>
>>
>> I asked you about that commit back in May:
>>
> 
> :/, sorry i missed this email, must have been the mlnx nvidia email
> transition.
> 
>> https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/
>>
>> As noted in the thread, it did not work for me.
> 
> Still relevant ? I might need to get you some tools to increase #msix
> in Firmware.
> 

not for me at the moment, but it would be good to document what a user
needs to do - especially if it involves vendor specific tools and steps.
Jesper Dangaard Brouer Dec. 10, 2020, 5:30 p.m. UTC | #32
On Thu, 10 Dec 2020 15:14:18 +0100
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Wed, 9 Dec 2020 08:44:33 -0700
> > David Ahern <dsahern@gmail.com> wrote:
> >  
> > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:  
> > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > > > (up-to 16) so it should not be a problem to solve this by sharing
> > > > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > > > for fallback case,  you allocated a number a TX-queue and distribute
> > > > this across CPUs to avoid hitting a congested lock (above measurements
> > > > are the optimal non-congested atomic lock operation)  
> > >
> > > I have been meaning to ask you why 16 for the XDP batching? If the
> > > netdev budget is 64, why not something higher like 32 or 64?  
> >
> > Thanks you for asking as there are multiple good reasons and
> > consideration for this 16 batch size.  Notice cpumap have batch size 8,
> > which is also an explicit choice.  And AF_XDP went in the wrong
> > direction IMHO and I think have 256.  I designed this to be a choice in
> > the map code, for the level of bulking it needs/wants.  
> 
> FYI, as far as I know, there is nothing in AF_XDP that says bulking
> should be 256. There is a 256 number in the i40e driver that states
> the maximum number of packets to be sent within one napi_poll loop.
> But this is just a maximum number and only for that driver. (In case
> you wonder, that number was inherited from the original skb Tx
> implementation in the driver.) 

Ah, that explains the issue I have on the production system that runs
the EDT-pacer[2].  I see that i40e function i40e_clean_tx_irq() ignores
napi_budget but uses it own budget, that defaults to 256.  Looks like I
can adjust this via ethtool -C tx-frames-irq.   I turned this down to
64 (32 was giving worse results, and below 16 system acted strange).

Now the issue is gone, which was that if TX-DMA completion was running
(i40e_clean_tx_irq()) on the same CPU that send packets via FQ-pacer
qdisc, then the pacing was not accurate, and was sending too bursty.

System have already tuned "net/core/dev_weight" and RX/TX-bias to
reduce bulking, as this can influence latency and the EDT-pacing
accuracy. (It is a middlebox bridging VLANs and BPF-EDT tiemstamping and
FQ-pacing packets to solve bursts overflowing switch ports).

  sudo sysctl net/core/dev_weight
  net.core.dev_weight = 1
  net.core.dev_weight_rx_bias = 32
  net.core.dev_weight_tx_bias = 1

This net.core.dev_weight_tx_bias=1 (together with dev_weight=1) cause
qdisc transmit budget to become one packet, cycling through
NET_TX_SOFTIRQ which consumes time and gives a little more pacing space
for the packets.


> The actual batch size is controlled by
> the application. If it puts 1 packet in the Tx ring and calls send(),
> the batch size will be 1. If it puts 128 packets in the Tx ring and
> calls send(), you get a batch size of 128, and so on. It is flexible,
> so you can trade-off latency with throughput in the way the
> application desires. Rx batch size has also become flexible now with
> the introduction of Björn's prefer_busy_poll patch set [1].
> 
> [1] https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.topel@gmail.com/

This looks like a cool trick, to get even more accurate packet scheduling.

I played with the tunings, and could see changed behavior with mpstat,
but ended up tuning it off again, as I could not measure a direct
correlation with the bpftrace tools[3].


> > The low level explanation is that these 8 and 16 batch sizes are
> > optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
> > with 10 elements).  I'm betting on that memory backing these 8 or 16
> > packets have higher chance to remain/being in cache, and I can prefetch
> > them without evicting them from cache again.  In some cases the pointer
> > to these packets are queued into a ptr_ring, and it is more optimal to
> > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.
> >
> > The general explanation is my goal to do bulking without adding latency.
> > This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
> > Sure, you/we can likely make the micro-benchmarks look better by using
> > 64 batch size, but that will introduce added latency and likely shoot
> > our-selves in the foot for real workloads.  With experience from
> > bufferbloat and real networks, we know that massive TX bulking have bad
> > effects.  Still XDP-redirect does massive bulking (NIC flush is after
> > full 64 budget) and we don't have pushback or a queue mechanism (so I
> > know we are already shooting ourselves in the foot) ...  Fortunately we
> > now have a PhD student working on queuing for XDP.
> >
> > It is also important to understand that this is an adaptive bulking
> > scheme, which comes from NAPI.  We don't wait for packets arriving
> > shortly, we pickup what NIC have available, but by only taking 8 or 16
> > packets (instead of emptying the entire RX-queue), and then spending
> > some time to send them along, I'm hoping that NIC could have gotten
> > some more frame.  For cpumap and veth (in-some-cases) they can start to
> > consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
> > signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
> > update their internal queue state (and BQL), and if it gets close to
> > full they can choose to flush/doorbell the NIC earlier.  When doing
> > queuing for XDP we need to expose these NIC queue states, and having 4
> > calls with 16 packets (64 budget) also gives us more chances to get NIC
> > queue state info which the NIC already touch.
> >
> >
> > [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf

[2] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/

[3] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/bpftrace
Saeed Mahameed Dec. 10, 2020, 6:58 p.m. UTC | #33
On Thu, 2020-12-10 at 08:30 -0700, David Ahern wrote:
> On 12/9/20 11:48 PM, Saeed Mahameed wrote:
> > On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote:
> > > On 12/9/20 10:15 AM, Saeed Mahameed wrote:
> > > > > My personal experience with this one is mlx5/ConnectX4-LX
> > > > > with a
> > > > > limit
> > > > 
> > > > This limit was removed from mlx5
> > > > https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/
> > > > Note: you still need to use ehttool to increase from 64 to 128
> > > > or
> > > > 96 in
> > > > your case.
> > > > 
> > > 
> > > I asked you about that commit back in May:
> > > 
> > 
> > :/, sorry i missed this email, must have been the mlnx nvidia email
> > transition.
> > 
> > > https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/
> > > 
> > > As noted in the thread, it did not work for me.
> > 
> > Still relevant ? I might need to get you some tools to increase
> > #msix
> > in Firmware.
> > 
> 
> not for me at the moment, but it would be good to document what a
> user
> needs to do - especially if it involves vendor specific tools and
> steps.

Ack, Thanks for pointing that out, I will take action on this.
Saeed Mahameed Dec. 10, 2020, 7:20 p.m. UTC | #34
On Thu, 2020-12-10 at 14:32 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 9 Dec 2020 08:44:33 -0700
> David Ahern <dsahern@gmail.com> wrote:
> 
> > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > > But I have redesigned the ndo_xdp_xmit call to take a bulk of
> > > packets
> > > (up-to 16) so it should not be a problem to solve this by sharing
> > > TX-queue and talking a lock per 16 packets.  I still recommend
> > > that,
> > > for fallback case,  you allocated a number a TX-queue and
> > > distribute
> > > this across CPUs to avoid hitting a congested lock (above
> > > measurements
> > > are the optimal non-congested atomic lock operation)  
> > 
> > I have been meaning to ask you why 16 for the XDP batching? If the
> > netdev budget is 64, why not something higher like 32 or 64?
> 
> Thanks you for asking as there are multiple good reasons and
> consideration for this 16 batch size.  Notice cpumap have batch size
> 8,
> which is also an explicit choice.  And AF_XDP went in the wrong
> direction IMHO and I think have 256.  I designed this to be a choice
> in
> the map code, for the level of bulking it needs/wants.
> 
> The low level explanation is that these 8 and 16 batch sizes are
> optimized towards cache sizes and Intel's Line-Fill-Buffer
> (prefetcher
> with 10 elements).  I'm betting on that memory backing these 8 or 16
> packets have higher chance to remain/being in cache, and I can
> prefetch
> them without evicting them from cache again.  In some cases the
> pointer
> to these packets are queued into a ptr_ring, and it is more optimal
> to
> write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the
> ptr_ring.
> 

I've warned people about this once or twice on the mailing list, for
example re-populating the rx ring, a common mistake is to use the napi
budget, which has the exact side effects as you are explaining here
Jesper !

these 8/16 numbers are used in more than one place in the stack, xdp,
gro, hw buffer re-population, etc..
how can we enforce such numbers and a uniform handling in all drivers?
1. have a clear documentation ? well know defines, for people to copy?

2. for XDP we must keep track on the memory backing of the xdp bulked
data as Jesper pointed out, so we always make sure whatever bulk-size
we define it always remains cache friendly, especially now where people
stated working on  multi-buff and other features that will extend the
xdp_buff and xdp_frame, do we need a selftest that maybe runs pahole to
see the those data strcutre remain within reasonable format/sizes ?



> The general explanation is my goal to do bulking without adding
> latency.
> This is explicitly stated in my presentation[1] as of Feb 2016, slide
> 20.
> Sure, you/we can likely make the micro-benchmarks look better by
> using
> 64 batch size, but that will introduce added latency and likely shoot
> our-selves in the foot for real workloads.  With experience from
> bufferbloat and real networks, we know that massive TX bulking have
> bad
> effects.  Still XDP-redirect does massive bulking (NIC flush is after
> full 64 budget) and we don't have pushback or a queue mechanism (so I
> know we are already shooting ourselves in the foot) ...  Fortunately
> we
> now have a PhD student working on queuing for XDP.
> 
> It is also important to understand that this is an adaptive bulking
> scheme, which comes from NAPI.  We don't wait for packets arriving
> shortly, we pickup what NIC have available, but by only taking 8 or
> 16
> packets (instead of emptying the entire RX-queue), and then spending
> some time to send them along, I'm hoping that NIC could have gotten
> some more frame.  For cpumap and veth (in-some-cases) they can start
> to
> consume packets from these batches, but NIC drivers gets
> XDP_XMIT_FLUSH
> signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
> update their internal queue state (and BQL), and if it gets close to
> full they can choose to flush/doorbell the NIC earlier.  When doing
> queuing for XDP we need to expose these NIC queue states, and having
> 4
> calls with 16 packets (64 budget) also gives us more chances to get
> NIC
> queue state info which the NIC already touch.
> 
> 
> [1] 
> https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
diff mbox series

Patch

diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst
new file mode 100644
index 000000000000..4a434a1c512b
--- /dev/null
+++ b/Documentation/networking/netdev-xdp-properties.rst
@@ -0,0 +1,42 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Netdev XDP properties
+=====================
+
+ * XDP PROPERTIES FLAGS
+
+Following netdev xdp properties flags can be retrieve over netlink ethtool
+interface the same way as netdev feature flags. These properties flags are
+read only and cannot be change in the runtime.
+
+
+*  XDP_ABORTED
+
+This property informs if netdev supports xdp aborted action.
+
+*  XDP_DROP
+
+This property informs if netdev supports xdp drop action.
+
+*  XDP_PASS
+
+This property informs if netdev supports xdp pass action.
+
+*  XDP_TX
+
+This property informs if netdev supports xdp tx action.
+
+*  XDP_REDIRECT
+
+This property informs if netdev supports xdp redirect action.
+It assumes the all beforehand mentioned flags are enabled.
+
+*  XDP_ZEROCOPY
+
+This property informs if netdev driver supports xdp zero copy.
+It assumes the all beforehand mentioned flags are enabled.
+
+*  XDP_HW_OFFLOAD
+
+This property informs if netdev driver supports xdp hw oflloading.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 52d1cc2bd8a7..2544c7f0e1b7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -43,6 +43,7 @@ 
 #include <net/xdp.h>
 
 #include <linux/netdev_features.h>
+#include <linux/xdp_properties.h>
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 #include <uapi/linux/if_bonding.h>
@@ -2171,6 +2172,7 @@  struct net_device {
 
 	/* protected by rtnl_lock */
 	struct bpf_xdp_entity	xdp_state[__MAX_XDP_MODE];
+	xdp_properties_t	xdp_properties;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/linux/xdp_properties.h b/include/linux/xdp_properties.h
new file mode 100644
index 000000000000..c72c9bcc50de
--- /dev/null
+++ b/include/linux/xdp_properties.h
@@ -0,0 +1,53 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Network device xdp properties.
+ */
+#ifndef _LINUX_XDP_PROPERTIES_H
+#define _LINUX_XDP_PROPERTIES_H
+
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <asm/byteorder.h>
+
+typedef u64 xdp_properties_t;
+
+enum {
+	XDP_F_ABORTED_BIT,
+	XDP_F_DROP_BIT,
+	XDP_F_PASS_BIT,
+	XDP_F_TX_BIT,
+	XDP_F_REDIRECT_BIT,
+	XDP_F_ZEROCOPY_BIT,
+	XDP_F_HW_OFFLOAD_BIT,
+
+	/*
+	 * Add your fresh new property above and remember to update
+	 * xdp_properties_strings [] in net/core/ethtool.c and maybe
+	 * some xdp_properties mask #defines below. Please also describe it
+	 * in Documentation/networking/xdp_properties.rst.
+	 */
+
+	/**/XDP_PROPERTIES_COUNT
+};
+
+#define __XDP_F_BIT(bit)	((xdp_properties_t)1 << (bit))
+#define __XDP_F(name)		__XDP_F_BIT(XDP_F_##name##_BIT)
+
+#define XDP_F_ABORTED		__XDP_F(ABORTED)
+#define XDP_F_DROP		__XDP_F(DROP)
+#define XDP_F_PASS		__XDP_F(PASS)
+#define XDP_F_TX		__XDP_F(TX)
+#define XDP_F_REDIRECT		__XDP_F(REDIRECT)
+#define XDP_F_ZEROCOPY		__XDP_F(ZEROCOPY)
+#define XDP_F_HW_OFFLOAD	__XDP_F(HW_OFFLOAD)
+
+#define XDP_F_BASIC		(XDP_F_ABORTED |	\
+				 XDP_F_DROP |		\
+				 XDP_F_PASS |		\
+				 XDP_F_TX)
+
+#define XDP_F_FULL		(XDP_F_BASIC | XDP_F_REDIRECT)
+
+#define XDP_F_FULL_ZC		(XDP_F_FULL | XDP_F_ZEROCOPY)
+
+#endif /* _LINUX_XDP_PROPERTIES_H */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 700ad5db7f5d..a9fabc1282cf 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -7,6 +7,7 @@ 
 #define __LINUX_NET_XDP_H__
 
 #include <linux/skbuff.h> /* skb_shared_info */
+#include <linux/xdp_properties.h>
 
 /**
  * DOC: XDP RX-queue information
@@ -255,6 +256,100 @@  struct xdp_attachment_info {
 	u32 flags;
 };
 
+#if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
+
+static __always_inline void
+xdp_set_aborted_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_ABORTED;
+}
+
+static __always_inline void
+xdp_set_pass_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_PASS;
+}
+
+static __always_inline void
+xdp_set_drop_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_DROP;
+}
+
+static __always_inline void
+xdp_set_tx_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_TX;
+}
+
+static __always_inline void
+xdp_set_redirect_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_REDIRECT;
+}
+
+static __always_inline void
+xdp_set_hw_offload_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_HW_OFFLOAD;
+}
+
+static __always_inline void
+xdp_set_basic_properties(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_BASIC;
+}
+
+static __always_inline void
+xdp_set_full_properties(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_FULL;
+}
+
+#else
+
+static __always_inline void
+xdp_set_aborted_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_pass_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_drop_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_tx_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_redirect_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_hw_offload_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_basic_properties(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_full_properties(xdp_properties_t *properties)
+{
+}
+
+#endif
+
 struct netdev_bpf;
 bool xdp_attachment_flags_ok(struct xdp_attachment_info *info,
 			     struct netdev_bpf *bpf);
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 4e295541e396..48a3b6d165c7 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -8,6 +8,7 @@ 
 
 #include <net/xdp_sock.h>
 #include <net/xsk_buff_pool.h>
+#include <linux/xdp_properties.h>
 
 #ifdef CONFIG_XDP_SOCKETS
 
@@ -117,6 +118,11 @@  static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
 	xp_dma_sync_for_device(pool, dma, size);
 }
 
+static inline void xsk_set_zc_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_ZEROCOPY;
+}
+
 #else
 
 static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
@@ -242,6 +248,10 @@  static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
 {
 }
 
+static inline void xsk_set_zc_property(xdp_properties_t *properties)
+{
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_DRV_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 9ca87bc73c44..dfcb0e2c98b2 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -688,6 +688,7 @@  enum ethtool_stringset {
 	ETH_SS_TS_TX_TYPES,
 	ETH_SS_TS_RX_FILTERS,
 	ETH_SS_UDP_TUNNEL_TYPES,
+	ETH_SS_XDP_PROPERTIES,
 
 	/* add new constants above here */
 	ETH_SS_COUNT
diff --git a/include/uapi/linux/xdp_properties.h b/include/uapi/linux/xdp_properties.h
new file mode 100644
index 000000000000..e85be03eb707
--- /dev/null
+++ b/include/uapi/linux/xdp_properties.h
@@ -0,0 +1,32 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+/*
+ * Copyright (c) 2020 Intel
+ */
+
+#ifndef __UAPI_LINUX_XDP_PROPERTIES__
+#define __UAPI_LINUX_XDP_PROPERTIES__
+
+/* ETH_GSTRING_LEN define is needed. */
+#include <linux/ethtool.h>
+
+#define XDP_PROPERTIES_ABORTED_STR	"xdp-aborted"
+#define XDP_PROPERTIES_DROP_STR		"xdp-drop"
+#define XDP_PROPERTIES_PASS_STR		"xdp-pass"
+#define XDP_PROPERTIES_TX_STR		"xdp-tx"
+#define XDP_PROPERTIES_REDIRECT_STR	"xdp-redirect"
+#define XDP_PROPERTIES_ZEROCOPY_STR	"xdp-zerocopy"
+#define XDP_PROPERTIES_HW_OFFLOAD_STR	"xdp-hw-offload"
+
+#define	DECLARE_XDP_PROPERTIES_TABLE(name)		\
+	const char name[][ETH_GSTRING_LEN] = {		\
+		XDP_PROPERTIES_ABORTED_STR,		\
+		XDP_PROPERTIES_DROP_STR,		\
+		XDP_PROPERTIES_PASS_STR,		\
+		XDP_PROPERTIES_TX_STR,			\
+		XDP_PROPERTIES_REDIRECT_STR,		\
+		XDP_PROPERTIES_ZEROCOPY_STR,		\
+		XDP_PROPERTIES_HW_OFFLOAD_STR,		\
+	}
+
+#endif  /* __UAPI_LINUX_XDP_PROPERTIES__ */
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 24036e3055a1..8f15f96b8922 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -4,6 +4,7 @@ 
 #include <linux/net_tstamp.h>
 #include <linux/phy.h>
 #include <linux/rtnetlink.h>
+#include <uapi/linux/xdp_properties.h>
 
 #include "common.h"
 
@@ -283,6 +284,16 @@  const char udp_tunnel_type_names[][ETH_GSTRING_LEN] = {
 static_assert(ARRAY_SIZE(udp_tunnel_type_names) ==
 	      __ETHTOOL_UDP_TUNNEL_TYPE_CNT);
 
+const char xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN] = {
+	[XDP_F_ABORTED_BIT] =		XDP_PROPERTIES_ABORTED_STR,
+	[XDP_F_DROP_BIT] =		XDP_PROPERTIES_DROP_STR,
+	[XDP_F_PASS_BIT] =		XDP_PROPERTIES_PASS_STR,
+	[XDP_F_TX_BIT] =		XDP_PROPERTIES_TX_STR,
+	[XDP_F_REDIRECT_BIT] =		XDP_PROPERTIES_REDIRECT_STR,
+	[XDP_F_ZEROCOPY_BIT] =		XDP_PROPERTIES_ZEROCOPY_STR,
+	[XDP_F_HW_OFFLOAD_BIT] =	XDP_PROPERTIES_HW_OFFLOAD_STR,
+};
+
 /* return false if legacy contained non-0 deprecated fields
  * maxtxpkt/maxrxpkt. rest of ksettings always updated
  */
diff --git a/net/ethtool/common.h b/net/ethtool/common.h
index 3d9251c95a8b..85a35f8781eb 100644
--- a/net/ethtool/common.h
+++ b/net/ethtool/common.h
@@ -5,8 +5,10 @@ 
 
 #include <linux/netdevice.h>
 #include <linux/ethtool.h>
+#include <linux/xdp_properties.h>
 
 #define ETHTOOL_DEV_FEATURE_WORDS	DIV_ROUND_UP(NETDEV_FEATURE_COUNT, 32)
+#define ETHTOOL_XDP_PROPERTIES_WORDS	DIV_ROUND_UP(XDP_PROPERTIES_COUNT, 32)
 
 /* compose link mode index from speed, type and duplex */
 #define ETHTOOL_LINK_MODE(speed, type, duplex) \
@@ -22,6 +24,8 @@  extern const char
 tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN];
 extern const char
 phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN];
+extern const char
+xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN];
 extern const char link_mode_names[][ETH_GSTRING_LEN];
 extern const char netif_msg_class_names[][ETH_GSTRING_LEN];
 extern const char wol_mode_names[][ETH_GSTRING_LEN];
diff --git a/net/ethtool/strset.c b/net/ethtool/strset.c
index 0baad0ce1832..684e751b31a9 100644
--- a/net/ethtool/strset.c
+++ b/net/ethtool/strset.c
@@ -80,6 +80,11 @@  static const struct strset_info info_template[] = {
 		.count		= __ETHTOOL_UDP_TUNNEL_TYPE_CNT,
 		.strings	= udp_tunnel_type_names,
 	},
+	[ETH_SS_XDP_PROPERTIES] = {
+		.per_dev	= false,
+		.count		= ARRAY_SIZE(xdp_properties_strings),
+		.strings	= xdp_properties_strings,
+	},
 };
 
 struct strset_req_info {