[net-next,v6,00/12] add flow_rule infrastructure

Message ID	20181214181205.28812-1-pablo@netfilter.org
Headers	show Return-Path: <netdev-owner@vger.kernel.org> sender: pneira@us.es) by entrada.int (Postfix) with ESMTPA id 49CF54265A4C; Fri, 14 Dec 2018 19:12:15 +0100 (CET) From: Pablo Neira Ayuso <pablo@netfilter.org> To: netdev@vger.kernel.org Cc: davem@davemloft.net, thomas.lendacky@amd.com, f.fainelli@gmail.com, ariel.elior@cavium.com, michael.chan@broadcom.com, santosh@chelsio.com, madalin.bucur@nxp.com, yisen.zhuang@huawei.com, salil.mehta@huawei.com, jeffrey.t.kirsher@intel.com, tariqt@mellanox.com, saeedm@mellanox.com, jiri@mellanox.com, idosch@mellanox.com, jakub.kicinski@netronome.com, peppe.cavallaro@st.com, grygorii.strashko@ti.com, andrew@lunn.ch, vivien.didelot@savoirfairelinux.com, alexandre.torgue@st.com, joabreu@synopsys.com, linux-net-drivers@solarflare.com, ganeshgr@chelsio.com, ogerlitz@mellanox.com, Manish.Chopra@cavium.com, marcelo.leitner@gmail.com, mkubecek@suse.cz, venkatkumar.duvvuru@broadcom.com, julia.lawall@lip6.fr, john.fastabend@gmail.com, jhs@mojatatu.com, gerlitz.or@gmail.com Subject: [PATCH net-next,v6 00/12] add flow_rule infrastructure Date: Fri, 14 Dec 2018 19:11:53 +0100 Message-Id: <20181214181205.28812-1-pablo@netfilter.org> Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	add flow_rule infrastructure \| expand [net-next,v6,00/12] add flow_rule infrastructure [net-next,v6,01/12] flow_offload: add flow_rule and flow_match structures and use them [net-next,v6,02/12] net/mlx5e: support for two independent packet edit actions [net-next,v6,03/12] flow_offload: add flow action infrastructure [net-next,v6,04/12] cls_api: add translator to flow_action representation [net-next,v6,05/12] flow_offload: add statistics retrieval infrastructure and use it [net-next,v6,06/12] drivers: net: use flow action infrastructure [net-next,v6,07/12] cls_flower: don't expose TC actions to drivers anymore [net-next,v6,08/12] flow_offload: add wake-up-on-lan and queue to flow_action [net-next,v6,09/12] ethtool: add ethtool_rx_flow_spec to flow_rule structure translator [net-next,v6,10/12] dsa: bcm_sf2: use flow_rule infrastructure [net-next,v6,11/12] qede: place ethtool_rx_flow_spec after code after TC flower codebase [net-next,v6,12/12] qede: use ethtool_rx_flow_rule() to remove duplicated parser code

Pablo Neira Ayuso Dec. 14, 2018, 6:11 p.m. UTC

Hi,

This patchset introduces a kernel intermediate representation (IR) to
express ACL hardware offloads, this is heavily based on the existing
flow dissector infrastructure and the TC actions. This IR can be used by
different frontend ACL interfaces such as ethtool_rxnfc, tc and
netfilter to represent ACL hardware offloads.

The main goals of this patchset are:

* Provide an unified representation that can be passed to the driver
  for translation to HW IR. This consolidates the code to be maintained
  in the driver and it also simplifies the development of ACL hardware
  offloads. Driver developers do not need to add one specific parser for
  each supported ACL frontend, instead each frontend just generates
  this flow_rule IR and pass it to drivers to populate the hardware IR.

* Do not expose TC software frontend details to the drivers anymore,
  such as structure layouts. Hence, if TC needs to be updated to support
  a new software feature, no changes to the drivers will be required.

In handcrafted ascii art, the idea is the following:

                .   ethtool_rxnfc   tc      netfilter
               |       (ioctl)   (netlink)  (netlink)
               |          |         |           |     translate native
      Frontend |          |         |           |  interface representation
               |          |         |           |      to flow_rule IR
               |          |         |           |
                .         \/        \/         \/
                .        ----- flow_rule IR ------
               |                     |
       Drivers |                     | parsing of flow_rule IR
               |                     |  to populate hardware IR
               |                    \/
                .            hardware IR (driver)

This patchset only converts tc and ethtool_rxnfc to use this
infrastructure. Therefore. this patchset comes with no netfilter changes
at this stage.

The proposed object that represents rules is the following:

  struct flow_rule {
	struct flow_match       match;
	struct flow_action      action;
  };

The flow_match structure wraps Jiri Pirko's existing representation
available in cls_flower based on flow dissectors to represent the
matching side:

  struct flow_match {
	struct flow_dissector   *dissector;
	void                    *mask;
	void                    *key;
  };

The mask and key layouts are opaque, given the dissector object provides
the used_keys flags - to check for rule selectors that are being used -
and the offset to the corresponding key and mask in the opaque
container structures.

Then, the actions to be performed on the matching packets is represented
through the flow_action object:

  struct flow_action {
	unsigned int              num_entries;
	struct flow_action_entry  entries[0];
  };

This object comes with a num_entries field that specifies the number of
actions - this supports an arbitrary number of actions, the driver will
impose its own restrictions on this - and the array that stores
flow_action_entries structures (entries). Each flow action entry is
defined as it follows:

  struct flow_action_entry {
        enum flow_action_id             id;
        union {
                u32                     chain_index;    /* FLOW_ACTION_GOTO */
                struct net_device       *dev;           /* FLOW_ACTION_REDIRECT */
                struct {                                /* FLOW_ACTION_VLAN */
                        u16             vid;
                        __be16          proto;
                        u8              prio;
                } vlan;
                struct {                                /* FLOW_ACTION_PACKET_EDIT */
                        enum flow_action_mangle_base htype;
                        u32             offset;
                        u32             mask;
                        u32             val;
                } mangle;
                const struct ip_tunnel_info *tunnel;    /* FLOW_ACTION_TUNNEL_ENCAP */
                u32                     csum_flags;     /* FLOW_ACTION_CSUM */
                u32                     mark;           /* FLOW_ACTION_MARK */
                struct {                                /* FLOW_ACTION_QUEUE */
                        u32             ctx;
                        u32             index;
                        u8              vf;
                } queue;
        };
  };

Possible actions are extracted from what existing drivers are already
supporting through tc-flower and ethtool_rxnfc interfaces:

  enum flow_action_id {
        FLOW_ACTION_ACCEPT              = 0,
        FLOW_ACTION_DROP,
        FLOW_ACTION_TRAP,
        FLOW_ACTION_GOTO,
        FLOW_ACTION_REDIRECT,
        FLOW_ACTION_MIRRED,
        FLOW_ACTION_VLAN_PUSH,
        FLOW_ACTION_VLAN_POP,
        FLOW_ACTION_VLAN_MANGLE,
        FLOW_ACTION_TUNNEL_ENCAP,
        FLOW_ACTION_TUNNEL_DECAP,
        FLOW_ACTION_MANGLE,
        FLOW_ACTION_ADD,
        FLOW_ACTION_CSUM,
        FLOW_ACTION_MARK,
        FLOW_ACTION_WAKE,
        FLOW_ACTION_QUEUE,
  };

Common code pattern from the drivers to populate the hardware
intermediate representation looks like this:

        if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_IPV4_ADDRS)) {
                struct flow_match_ipv4_addrs match;

                flow_rule_match_ipv4_addrs(rule, &match);
                flow->l3_key.ipv4.daddr.s_addr = match.key->dst;
                flow->l3_mask.ipv4.daddr.s_addr = match.mask->dst;
                flow->l3_key.ipv4.saddr.s_addr = match.key->src;
                flow->l3_mask.ipv4.saddr.s_addr = match.mask->src;
        }

Then, flow action code parser should look like:

        flow_action_for_each(i, act, flow_action) {
               switch (act->id) {
               case FLOW_ACTION_DROP:
                        actions->flags |= DRIVER_XYZ_ACTION_FLAG_DROP;
                        break;
               case ...:
                        break;
               default:
                        return -EOPNOTSUPP;
               }
        }

This new infrastructure resides in:

   net/core/flow_offload.c
   include/net/flow_offload.h
   net/core/ethtool.c (translator from ethtool_rxnfc to flow_rule)

This patchset is composed of 12 patches:

Patch #1 adds the flow_match structure, this includes the
         flow_rule_match_key() interface to check for existing selectors
         that are in used in the rule and the flow_rule_match_*()
         functions to fetch the selector value and the mask. This
         also introduces the initial flow_rule structure skeleton to
         avoid a follow up patch that would update the same LoCs.

Patch #2 makes changes to packet edit parser of mlx5e driver, to prepare
         introduction of the new flow_action to mangle packets.

Patch #3 Introduce flow_action infrastructure. This infrastructure is
         based on the TC actions. Patch #8 extends it so it also
         supports two new actions that are only available through the
         ethtool_rxnfc interface.

Patch #4 Add function to translate TC action to flow_action from
         cls_flower.

Patch #5 Add infrastructure to fetch statistics into container structure
         and synchronize them to TC actions from cls_flower. Another
         preparation patch before patch #7, so we can stop exposing the
         TC action native layout to the drivers.

Patch #6 Use flow_action infrastructure from drivers.

Patch #7 Do not expose TC actions to drivers anymore, now that drivers
         have been converted to use the flow_action infrastructure after
         patch #6.

Patch #8 Support for wake-up-on-lan and queue actions for the flow_action
         infrastructure, two actions supported by the ethtool_rxnfc
         interface.

Patch #9 Add a function to translate from ethtool_rx_flow_spec structure
         to the flow_action structure. This function maps the ethtool_rxnfc
	 structure to flow_rule. This function resides in net/core/ethtool.c

Patch #10 Use ethtool_rxnfc to flow_rule from bcm_sf2 driver.

Patch #11 A preparation patch for the qlogic/qede driver coming at #12.

Patch #12 Update qlogic/qede driver to use the flow_rule infrastructure.
	  This removes the duplicated parser for ethtool_rxnfc by using
	  flow_rule structure.

Please apply, thanks!

P.S: Moving forward, my rough plan is to propose and discuss the
following changes to use this infrastructure from netfilter for ACL
hardware offload:

* Rename tc_setup ndo to flow_offload (or similar), so it can be used from
  netfilter too. Otherwise, I'm open for alternatives.
* Add a conversion function from native netfilter representation to
  flow_rule, including extra new glue code from the two-phase commit
  protocol to integrate this infrastructure.

Pablo Neira Ayuso (12):
  flow_offload: add flow_rule and flow_match structures and use them
  net/mlx5e: support for two independent packet edit actions
  flow_offload: add flow action infrastructure
  cls_api: add translator to flow_action representation
  flow_offload: add statistics retrieval infrastructure and use it
  drivers: net: use flow action infrastructure
  cls_flower: don't expose TC actions to drivers anymore
  flow_offload: add wake-up-on-lan and queue to flow_action
  ethtool: add ethtool_rx_flow_spec to flow_rule structure translator
  dsa: bcm_sf2: use flow_rule infrastructure
  qede: place ethtool_rx_flow_spec after code after TC flower codebase
  qede: use ethtool_rx_flow_rule() to remove duplicated parser code

 drivers/net/dsa/bcm_sf2_cfp.c                      | 102 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c       | 252 ++++----
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 450 ++++++-------
 drivers/net/ethernet/intel/i40e/i40e_main.c        | 178 ++----
 drivers/net/ethernet/intel/iavf/iavf_main.c        | 195 +++---
 drivers/net/ethernet/intel/igb/igb_main.c          |  64 +-
 .../net/ethernet/mellanox/mlx5/core/en/tc_tun.c    |  68 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    | 697 ++++++++++-----------
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 258 ++++----
 drivers/net/ethernet/netronome/nfp/flower/action.c | 198 +++---
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 417 ++++++------
 .../net/ethernet/netronome/nfp/flower/offload.c    | 150 ++---
 drivers/net/ethernet/qlogic/qede/qede_filter.c     | 572 +++++++----------
 include/linux/ethtool.h                            |  15 +
 include/net/flow_offload.h                         | 203 ++++++
 include/net/pkt_cls.h                              |  18 +-
 net/core/Makefile                                  |   2 +-
 net/core/ethtool.c                                 | 241 +++++++
 net/core/flow_offload.c                            | 153 +++++
 net/sched/cls_api.c                                | 116 ++++
 net/sched/cls_flower.c                             |  71 ++-
 22 files changed, 2416 insertions(+), 2006 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/core/flow_offload.c

--
2.11.0

Jakub Kicinski Dec. 18, 2018, 1:39 a.m. UTC | #1

On Fri, 14 Dec 2018 19:11:53 +0100, Pablo Neira Ayuso wrote:
> Hi,
> 
> This patchset introduces a kernel intermediate representation (IR) to
> express ACL hardware offloads, this is heavily based on the existing
> flow dissector infrastructure and the TC actions. This IR can be used by
> different frontend ACL interfaces such as ethtool_rxnfc, tc and
> netfilter to represent ACL hardware offloads.
> 
> The main goals of this patchset are:
> 
> * Provide an unified representation that can be passed to the driver
>   for translation to HW IR. This consolidates the code to be maintained
>   in the driver and it also simplifies the development of ACL hardware
>   offloads. Driver developers do not need to add one specific parser for
>   each supported ACL frontend, instead each frontend just generates
>   this flow_rule IR and pass it to drivers to populate the hardware IR.

Ack.

> * Do not expose TC software frontend details to the drivers anymore,
>   such as structure layouts. Hence, if TC needs to be updated to support
>   a new software feature, no changes to the drivers will be required.

This one you may need to clarify.  TC already transforms the rules and
for the most part provides abstract enough accessor helpers to protect
from changes in internal implementation of the SW path.

You could make an argument that if one subsystem already offloads
something and the IR can express it another subsystem can add the same
functionality and drivers don't have to change.  But why do we have
multiple subsystem offloading the same thing in the first place?..

> In handcrafted ascii art, the idea is the following:
> 
>                 .   ethtool_rxnfc   tc      netfilter
>                |       (ioctl)   (netlink)  (netlink)
>                |          |         |           |     translate native
>       Frontend |          |         |           |  interface representation
>                |          |         |           |      to flow_rule IR
>                |          |         |           |
>                 .         \/        \/         \/
>                 .        ----- flow_rule IR ------
>                |                     |
>        Drivers |                     | parsing of flow_rule IR
>                |                     |  to populate hardware IR
>                |                    \/
>                 .            hardware IR (driver)
> 
> This patchset only converts tc and ethtool_rxnfc to use this
> infrastructure. Therefore. this patchset comes with no netfilter
> changes at this stage.

Let's try to differentiate between cls_flower and TC as a subsystem.

[...]

> P.S: Moving forward, my rough plan is to propose and discuss the
> following changes to use this infrastructure from netfilter for ACL
> hardware offload:
> 
> * Rename tc_setup ndo to flow_offload (or similar), so it can be used
>   from netfilter too. Otherwise, I'm open for alternatives.

That is the part which really makes me uneasy.  Ordering rules, tables
and offloads in HW is already a hard question, which keeps coming back
(see Jesse's talk from LPC for example).  (And if you think you know
the answer you probably forgot about legacy SR-IOV :))

The TC offloads themselves today are still quite flaky.  I don't think
anyone respects rule ordering/priorities at all.  Even if we did order
cls_flower's rules in HW - what if someone added a cls_basic rule before
them.  SW path will no longer match HW path.  And we are about to mix
another subsystem into that soup.

I don't really see how you could rename ndo_setup_tc to flow_offload.
It offloads block notifications, and Qdiscs.  You must realize that
actual rules are offloaded via block callbacks..?  Are you planning to
create fake blocks for netfilter?  Create yet another "give me all
rules for device X" registration construct?

What level of abstraction are we really going to achieve here?  The
devices still need to know netfilter is a separate subsystem with its
quirks and rules.  Say someone adds a PASS rule to flower, and a DROP
rule to netfilter.  Packet has to be dropped, but the device will most
likely PASS because flower's rule will hit first.  n-tuple filters
don't have pass so the problem isn't obvious from this patch set.

I'm starting to like Or's idea of "expressing nft rules in cls_flower".
I don't mean pretending its flower rules just for offload, but rather
have a TC classifier module that can take nft rules (in SW path) and
which would semantically fit very clearly into the existing TC
framework.  Mirroring SW behaviour in HW becomes easier, we don't have
to teach drivers about another subsystem.

> * Add a conversion function from native netfilter representation to
>   flow_rule, including extra new glue code from the two-phase commit
>   protocol to integrate this infrastructure.

Pablo Neira Ayuso Dec. 18, 2018, 7:57 p.m. UTC | #2

Hi Jakub,

On Mon, Dec 17, 2018 at 05:39:29PM -0800, Jakub Kicinski wrote:
> On Fri, 14 Dec 2018 19:11:53 +0100, Pablo Neira Ayuso wrote:
> > Hi,
> > 
> > This patchset introduces a kernel intermediate representation (IR) to
> > express ACL hardware offloads, this is heavily based on the existing
> > flow dissector infrastructure and the TC actions. This IR can be used by
> > different frontend ACL interfaces such as ethtool_rxnfc, tc and
> > netfilter to represent ACL hardware offloads.
> > 
> > The main goals of this patchset are:
> > 
> > * Provide an unified representation that can be passed to the driver
> >   for translation to HW IR. This consolidates the code to be maintained
> >   in the driver and it also simplifies the development of ACL hardware
> >   offloads. Driver developers do not need to add one specific parser for
> >   each supported ACL frontend, instead each frontend just generates
> >   this flow_rule IR and pass it to drivers to populate the hardware IR.
> 
> Ack.

Thanks.

> > * Do not expose TC software frontend details to the drivers anymore,
> >   such as structure layouts. Hence, if TC needs to be updated to support
> >   a new software feature, no changes to the drivers will be required.
> 
> This one you may need to clarify.  TC already transforms the rules and
> for the most part provides abstract enough accessor helpers to protect
> from changes in internal implementation of the SW path.

Drivers can still modify the TC structure content, they can modify
frontend configurations, make tc dump non-sense values to userspace
via netlink or they may crash software plane at a later stage. Better
if we tend to expose configuration plane data through read-only
infrastructure that is consumed by drivers.

> You could make an argument that if one subsystem already offloads
> something and the IR can express it another subsystem can add the same
> functionality and drivers don't have to change.  But why do we have
> multiple subsystem offloading the same thing in the first place?..

Having multiple subsystems that allow you to do the same thing is a
decision that was made long time ago. This is allowing multiple
approaches to compete inside the kernel and this also allowing users
to select the subsystem that fits better their requirements.

Anyway, the problem that this patchset addresses _already exists_ with
ethtool_rxnfc and cls_flower in place, it would be good to fix it now.

> > In handcrafted ascii art, the idea is the following:
> > 
> >                 .   ethtool_rxnfc   tc      netfilter
> >                |       (ioctl)   (netlink)  (netlink)
> >                |          |         |           |     translate native
> >       Frontend |          |         |           |  interface representation
> >                |          |         |           |      to flow_rule IR
> >                |          |         |           |
> >                 .         \/        \/         \/
> >                 .        ----- flow_rule IR ------
> >                |                     |
> >        Drivers |                     | parsing of flow_rule IR
> >                |                     |  to populate hardware IR
> >                |                    \/
> >                 .            hardware IR (driver)
> > 
> > This patchset only converts tc and ethtool_rxnfc to use this
> > infrastructure. Therefore. this patchset comes with no netfilter
> > changes at this stage.
> 
> Let's try to differentiate between cls_flower and TC as a subsystem.

Right. cls_flower is the one more capable subset of tc at this stage.

> [...]
> 
> > P.S: Moving forward, my rough plan is to propose and discuss the
> > following changes to use this infrastructure from netfilter for ACL
> > hardware offload:
> > 
> > * Rename tc_setup ndo to flow_offload (or similar), so it can be used
> >   from netfilter too. Otherwise, I'm open for alternatives.
> 
> That is the part which really makes me uneasy.  Ordering rules, tables
> and offloads in HW is already a hard question, which keeps coming back
> (see Jesse's talk from LPC for example).  (And if you think you know
> the answer you probably forgot about legacy SR-IOV :))
>
> The TC offloads themselves today are still quite flaky. [...]

I'm very optimistic in what I'm seeing, patch 1/12 is rather large
because we already have a _fair good amount of drivers_ using the
existing upstream infrastructure.

I don't see anything unfixable from the existing approach that we can
incrementally extend what is already upstream to solve the existing
limitations.

[...]
> I don't really see how you could rename ndo_setup_tc to flow_offload.

We can start by using a single .ndo and use the enum parameter to
route the offload request to the corresponding driver handler to deal
with tc details at this stage. So we use the single .ndo as a
multiplexor. By consolidating parsers for ethtool_rxnfc and
cls_flower, we are already saving redundant code in this patchset, and
I can see more codebase in the tree that drivers could simplify.

> What level of abstraction are we really going to achieve here?  The
> devices still need to know netfilter is a separate subsystem with its
> quirks and rules.  Say someone adds a PASS rule to flower, and a DROP
> rule to netfilter.  Packet has to be dropped, but the device will most
> likely PASS because flower's rule will hit first.  n-tuple filters
> don't have pass so the problem isn't obvious from this patch set.

In software, tc ingress comes before netfilter, then ethtool_rxnfc
comes before tc ingress and netfilter. Once we have a single .ndo,
performing unified resource management for offloads will be easier,
ie. store ownership of rules in the driver ACL database, then when
rebuilding ACL offload, we make sure we place ethtool rules before tc,
then tc before netfilter. So we achieve the same semantics as in
software. We cannot solve this problem until we have unified
infrastructure to address this.

[...]
> I don't mean pretending its flower rules just for offload, but rather
> have a TC classifier module that can take nft rules (in SW path) and
> which would semantically fit very clearly into the existing TC
> framework.  Mirroring SW behaviour in HW becomes easier, we don't have
> to teach drivers about another subsystem.

Then, we'll have to rewrite userspace netfilter to run on top of TC.
Moreover, add support for conntrack, NAT, logging... And after all
that is done. For netlink users, we'll have to translate netfilter
netlink message to TC from the kernel, then route this to TC. It will
take ages to rewrite netfilter on top of tc and there will be semantic
differences. This patchset already deals with the existing diversity
in the ecosystem by providing a single unified representation for the
hardware, no matter what frontend is being used for this goal.

Cheers,
Pablo

Jakub Kicinski Dec. 19, 2018, 7:57 p.m. UTC | #3

On Tue, 18 Dec 2018 20:57:05 +0100, Pablo Neira Ayuso wrote:
> On Mon, Dec 17, 2018 at 05:39:29PM -0800, Jakub Kicinski wrote:
> > On Fri, 14 Dec 2018 19:11:53 +0100, Pablo Neira Ayuso wrote:  
> > > * Do not expose TC software frontend details to the drivers anymore,
> > >   such as structure layouts. Hence, if TC needs to be updated to support
> > >   a new software feature, no changes to the drivers will be required.  
> > 
> > This one you may need to clarify.  TC already transforms the rules and
> > for the most part provides abstract enough accessor helpers to protect
> > from changes in internal implementation of the SW path.  
> 
> Drivers can still modify the TC structure content, they can modify
> frontend configurations, make tc dump non-sense values to userspace
> via netlink or they may crash software plane at a later stage. Better
> if we tend to expose configuration plane data through read-only
> infrastructure that is consumed by drivers.

Sounds like inventing a problem to fit the solution :)  I don't think
we've even had such a bug in TC offloads.

> > You could make an argument that if one subsystem already offloads
> > something and the IR can express it another subsystem can add the same
> > functionality and drivers don't have to change.  But why do we have
> > multiple subsystem offloading the same thing in the first place?..  
> 
> Having multiple subsystems that allow you to do the same thing is a
> decision that was made long time ago. This is allowing multiple
> approaches to compete inside the kernel and this also allowing users
> to select the subsystem that fits better their requirements.

Perhaps we learnt something new from the few years we spent working on
this stuff?  I'm not sure where we stand.  I believe I asked you
something like "why do we need this if we already have flower" when you
posted your first nf_flow patches, no?  So my position haven't
changed :)

> Anyway, the problem that this patchset addresses _already exists_ with
> ethtool_rxnfc and cls_flower in place, it would be good to fix it now.

That's not the point of contention.

> > > In handcrafted ascii art, the idea is the following:
> > > 
> > >                 .   ethtool_rxnfc   tc      netfilter
> > >                |       (ioctl)   (netlink)  (netlink)
> > >                |          |         |           |     translate native
> > >       Frontend |          |         |           |  interface representation
> > >                |          |         |           |      to flow_rule IR
> > >                |          |         |           |
> > >                 .         \/        \/         \/
> > >                 .        ----- flow_rule IR ------
> > >                |                     |
> > >        Drivers |                     | parsing of flow_rule IR
> > >                |                     |  to populate hardware IR
> > >                |                    \/
> > >                 .            hardware IR (driver)
> > > 
> > > This patchset only converts tc and ethtool_rxnfc to use this
> > > infrastructure. Therefore. this patchset comes with no netfilter
> > > changes at this stage.  
> > 
> > Let's try to differentiate between cls_flower and TC as a subsystem.  
> 
> Right. cls_flower is the one more capable subset of tc at this stage.

cls_flower fits fixed function HW and ACL use cases, but that's
slightly beside the point, TC has its pipeline, chains, blocks,
templates, qdiscs...

> > I don't really see how you could rename ndo_setup_tc to flow_offload.  
> 
> We can start by using a single .ndo and use the enum parameter to
> route the offload request to the corresponding driver handler to deal
> with tc details at this stage. So we use the single .ndo as a
> multiplexor. By consolidating parsers for ethtool_rxnfc and
> cls_flower, we are already saving redundant code in this patchset, and
> I can see more codebase in the tree that drivers could simplify.

ndo_setup_tc does not carry flows.  It carries information about
changes to the TC pipeline.

> > What level of abstraction are we really going to achieve here?  The
> > devices still need to know netfilter is a separate subsystem with its
> > quirks and rules.  Say someone adds a PASS rule to flower, and a DROP
> > rule to netfilter.  Packet has to be dropped, but the device will most
> > likely PASS because flower's rule will hit first.  n-tuple filters
> > don't have pass so the problem isn't obvious from this patch set.  
> 
> In software, tc ingress comes before netfilter, then ethtool_rxnfc
> comes before tc ingress and netfilter. Once we have a single .ndo,
> performing unified resource management for offloads will be easier,
> ie. store ownership of rules in the driver ACL database, then when
> rebuilding ACL offload, we make sure we place ethtool rules before tc,
> then tc before netfilter. So we achieve the same semantics as in
> software. We cannot solve this problem until we have unified
> infrastructure to address this.

I'm sorry to say your vision of unified resource management is
extremely hazy.  These patches are a prep for something for which 
there isn't even a clear plan.

There is no way having drivers learn semantics of another subsystem
would make things easier. Expressing the rules in terms of flow
dissector is not the hard part.

Did you consider things like tunnels and bonds?  There are a lot of
problems which have been solved for TC offloads, if you create a
separate subsystem offload you'll have to repeat all that work.

> > I don't mean pretending its flower rules just for offload, but rather
> > have a TC classifier module that can take nft rules (in SW path) and
> > which would semantically fit very clearly into the existing TC
> > framework.  Mirroring SW behaviour in HW becomes easier, we don't have
> > to teach drivers about another subsystem.  
> 
> Then, we'll have to rewrite userspace netfilter to run on top of TC.
> Moreover, add support for conntrack, NAT, logging... And after all
> that is done. For netlink users, we'll have to translate netfilter
> netlink message to TC from the kernel, then route this to TC. It will
> take ages to rewrite netfilter on top of tc and there will be semantic
> differences. This patchset already deals with the existing diversity
> in the ecosystem by providing a single unified representation for the
> hardware, no matter what frontend is being used for this goal.

I'm not suggesting we replace netfilter with TC.  I'm suggesting we
replace nf_flow_offload table with something that fits into TC.  

You're not going to offload entire netfilter.  You want to offload
simplistic flows through the nf_flow_table.  What I'm saying, is add a
equivalent of that table into TC.  Make user space "link" netfilter to
that.

Pablo Neira Ayuso Dec. 20, 2018, 12:03 a.m. UTC | #4

Dear Jakub,

On Wed, Dec 19, 2018 at 11:57:03AM -0800, Jakub Kicinski wrote:
[...]
> > Anyway, the problem that this patchset addresses _already exists_ with
> > ethtool_rxnfc and cls_flower in place, it would be good to fix it now.
> 
> That's not the point of contention.

What is your alternative plan to unify driver codebase for
ethtool_rx_flow_spec and tc/cls_flower to offload ACL from the ingress
path?

[...]
> Did you consider things like tunnels and bonds?  There are a lot of
> problems which have been solved for TC offloads, if you create a
> separate subsystem offload you'll have to repeat all that work.

Did I repeat any work to unify ethtool_rx_flow_spec and tc/cls_flower?
No :-). I spinned over the existing work and delivered an incremental
update.

[...]
> I'm not suggesting we replace netfilter with TC.  I'm suggesting we
> replace nf_flow_offload table with something that fits into TC.
> 
> You're not going to offload entire netfilter.  You want to offload
> simplistic flows through the nf_flow_table.  What I'm saying, is add a
> equivalent of that table into TC.  Make user space "link" netfilter to
> that.

This patchset is not related to nf_flow_table infrastructure.

From what I'm reading, you assume we can use nf_flow_table everywhere,
which is probably true if you assume people use your NICs. However,
there is a class of hardware where CPU is extremely smallish to cope
with flows in software, where dataplane is almost entirely offloaded
to hardware. In that scenario, nf_flow_table cannot be used.

Jakub Kicinski Dec. 20, 2018, 12:26 a.m. UTC | #5

On Thu, 20 Dec 2018 01:03:13 +0100, Pablo Neira Ayuso wrote:
> On Wed, Dec 19, 2018 at 11:57:03AM -0800, Jakub Kicinski wrote:
> > > Anyway, the problem that this patchset addresses _already exists_ with
> > > ethtool_rxnfc and cls_flower in place, it would be good to fix it now.  
> > 
> > That's not the point of contention.  
> 
> What is your alternative plan to unify driver codebase for
> ethtool_rx_flow_spec and tc/cls_flower to offload ACL from the ingress
> path?

The point of contention for me is the "reuse of ndo_setup_tc" :)  
I haven't played with your IR, but it looks fairly close to the flower
dissector based thing, so it's probably fine

> > Did you consider things like tunnels and bonds?  There are a lot of
> > problems which have been solved for TC offloads, if you create a
> > separate subsystem offload you'll have to repeat all that work.  
> 
> Did I repeat any work to unify ethtool_rx_flow_spec and tc/cls_flower?
> No :-). I spinned over the existing work and delivered an incremental
> update.

Ethtool exists, so you don't have to teach the driver about it.  nft
offload does not, yet, exist in upstream drivers AFAIK.  Besides, if
ethtool n-tuples were rich and supported complex use cases we wouldn't
have TC flower offloads, so it's not apples to apples ;)

> > I'm not suggesting we replace netfilter with TC.  I'm suggesting we
> > replace nf_flow_offload table with something that fits into TC.
> > 
> > You're not going to offload entire netfilter.  You want to offload
> > simplistic flows through the nf_flow_table.  What I'm saying, is add a
> > equivalent of that table into TC.  Make user space "link" netfilter to
> > that.  
> 
> This patchset is not related to nf_flow_table infrastructure.

Let's not let the elephant back into the room, please :)

> From what I'm reading, you assume we can use nf_flow_table everywhere,
> which is probably true if you assume people use your NICs. However,
> there is a class of hardware where CPU is extremely smallish to cope
> with flows in software, where dataplane is almost entirely offloaded
> to hardware. In that scenario, nf_flow_table cannot be used.

I'm confused, could you rephrase?  How does you work help such devices?
How is tc not suitable for them?

Pablo Neira Ayuso Dec. 20, 2018, 12:35 p.m. UTC | #6

On Wed, Dec 19, 2018 at 04:26:53PM -0800, Jakub Kicinski wrote:
> On Thu, 20 Dec 2018 01:03:13 +0100, Pablo Neira Ayuso wrote:
> > On Wed, Dec 19, 2018 at 11:57:03AM -0800, Jakub Kicinski wrote:
> > > > Anyway, the problem that this patchset addresses _already exists_ with
> > > > ethtool_rxnfc and cls_flower in place, it would be good to fix it now.  
> > > 
> > > That's not the point of contention.  
> > 
> > What is your alternative plan to unify driver codebase for
> > ethtool_rx_flow_spec and tc/cls_flower to offload ACL from the ingress
> > path?
> 
> The point of contention for me is the "reuse of ndo_setup_tc" :)

OK, this patchset is not updating ndo_setup_tc.

> I haven't played with your IR, but it looks fairly close to the flower
> dissector based thing, so it's probably fine

OK.

[...]
> > From what I'm reading, you assume we can use nf_flow_table everywhere,
> > which is probably true if you assume people use your NICs. However,
> > there is a class of hardware where CPU is extremely smallish to cope
> > with flows in software, where dataplane is almost entirely offloaded
> > to hardware. In that scenario, nf_flow_table cannot be used.
> 
> I'm confused, could you rephrase?  How does you work help such devices?
> How is tc not suitable for them?

There are two HW offload usecases:

#1 Policy resides in software, CPU host sees initial packets, based on
   policy, you place flows into hardware via nf_flow_table infrastructure.
   This usecase is fine in your NIC since you assume host CPU can cope
   with policy in software for these few initial packets of the flow.
   However, switches usually have a small CPU to run control plane
   software only. There we _cannot_ use this approach.

#2 Policy resides in hardware. For the usecase of switches with small
   CPU, the ACL is deployed in hardware. We use the host CPU to run
   control plane configurations only.

This patchset _is not_ related to #1, this patchset _is_ related to #2.

So far, there is infrastructure in Netfilter to do #1, it should be
possible to use it from TC too. In TC, there is infrastructure for #2
which can be reused from Netfilter.

Or Gerlitz Dec. 20, 2018, 1:51 p.m. UTC | #7

On Thu, Dec 20, 2018 at 2:35 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Wed, Dec 19, 2018 at 04:26:53PM -0800, Jakub Kicinski wrote:

> > I'm confused, could you rephrase?  How does you work help such devices?
> > How is tc not suitable for them?

> There are two HW offload usecases:
>
> #1 Policy resides in software, CPU host sees initial packets, based on
>    policy, you place flows into hardware via nf_flow_table infrastructure.
>    This usecase is fine in your NIC since you assume host CPU can cope
>    with policy in software for these few initial packets of the flow.
>    However, switches usually have a small CPU to run control plane
>    software only. There we _cannot_ use this approach.
>
> #2 Policy resides in hardware. For the usecase of switches with small
>    CPU, the ACL is deployed in hardware. We use the host CPU to run
>    control plane configurations only.
>
> This patchset _is not_ related to #1, this patchset _is_ related to #2.

confused, isn't this patch set related to connection tracking offloads
on modern NIC HWs?

> So far, there is infrastructure in Netfilter to do #1, it should be
> possible to use it from TC too. In TC, there is infrastructure for #2
> which can be reused from Netfilter.

Pablo Neira Ayuso Dec. 20, 2018, 3:39 p.m. UTC | #8

On Thu, Dec 20, 2018 at 03:51:00PM +0200, Or Gerlitz wrote:
> On Thu, Dec 20, 2018 at 2:35 PM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Wed, Dec 19, 2018 at 04:26:53PM -0800, Jakub Kicinski wrote:
> 
> > > I'm confused, could you rephrase?  How does you work help such devices?
> > > How is tc not suitable for them?
> 
> > There are two HW offload usecases:
> >
> > #1 Policy resides in software, CPU host sees initial packets, based on
> >    policy, you place flows into hardware via nf_flow_table infrastructure.
> >    This usecase is fine in your NIC since you assume host CPU can cope
> >    with policy in software for these few initial packets of the flow.
> >    However, switches usually have a small CPU to run control plane
> >    software only. There we _cannot_ use this approach.
> >
> > #2 Policy resides in hardware. For the usecase of switches with small
> >    CPU, the ACL is deployed in hardware. We use the host CPU to run
> >    control plane configurations only.
> >
> > This patchset _is not_ related to #1, this patchset _is_ related to #2.
> 
> confused, isn't this patch set related to connection tracking offloads
> on modern NIC HWs?

This patchset is aiming to unify ethtool_rxnfc and tc/cls_flower
representation to simplify driver codebase, ie. have a single parser
to populate HW IR. My immediate future work is to reuse this new
infrastructure to explore #2 for netfilter.

[net-next,v6,00/12] add flow_rule infrastructure

Message

Comments