mbox series

[net-next,ct-offload,v3,00/15] Introduce connection tracking offload

Message ID 1583937238-21511-1-git-send-email-paulb@mellanox.com
Headers show
Series Introduce connection tracking offload | expand

Message

Paul Blakey March 11, 2020, 2:33 p.m. UTC
Background
----------

The connection tracking action provides the ability to associate connection state to a packet.
The connection state may be used for stateful packet processing such as stateful firewalls
and NAT operations.

Connection tracking in TC SW
----------------------------

The CT state may be matched only after the CT action is performed.
As such, CT use cases are commonly implemented using multiple chains.
Consider the following TC filters, as an example:
1. tc filter add dev ens1f0_0 ingress prio 1 chain 0 proto ip flower \
    src_mac 24:8a:07:a5:28:01 ct_state -trk \
    action ct \
    pipe action goto chain 2
       
2. tc filter add dev ens1f0_0 ingress prio 1 chain 2 proto ip flower \
    ct_state +trk+new \
    action ct commit \
    pipe action tunnel_key set \
        src_ip 0.0.0.0 \
        dst_ip 7.7.7.8 \
        id 98 \
        dst_port 4789 \
    action mirred egress redirect dev vxlan0
       
3. tc filter add dev ens1f0_0 ingress prio 1 chain 2 proto ip flower \
    ct_state +trk+est \
    action tunnel_key set \
        src_ip 0.0.0.0 \
        dst_ip 7.7.7.8 \
        id 98 \
        dst_port 4789 \
    action mirred egress redirect dev vxlan0
       
Filter #1 (chain 0) decides, after initial packet classification, to send the packet to the
connection tracking module (ct action).
Once the ct_state is initialized by the CT action the packet processing continues on chain 2.

Chain 2 classifies the packet based on the ct_state.
Filter #2 matches on the +trk+new CT state while filter #3 matches on the +trk+est ct_state.

MLX5 Connection tracking HW offload - MLX5 driver patches
------------------------------

The MLX5 hardware model aligns with the software model by realizing a multi-table
architecture. In SW the TC CT action sets the CT state on the skb. Similarly,
HW sets the CT state on a HW register. Driver gets this CT state while offloading
a tuple with a new ct_metadata action that provides it.

Matches on ct_state are translated to HW register matches.
    
TC filter with CT action broken to two rules, a pre_ct rule, and a post_ct rule.
pre_ct rule:
   Inserted on the corrosponding tc chain table, matches on original tc match, with
   actions: any pre ct actions, set fte_id, set zone, and goto the ct table.
   The fte_id is a register mapping uniquely identifying this filter.
post_ct_rule:
   Inserted in a post_ct table, matches on the fte_id register mapping, with
   actions: counter + any post ct actions (this is usally 'goto chain X')

post_ct table is a table that all the tuples inserted to the ct table goto, so
if there is a tuple hit, packet will continue from ct table to post_ct table,
after being marked with the CT state (mark/label..)

This design ensures that the rule's actions and counters will be executed only after a CT hit.
HW misses will continue processing in SW from the last chain ID that was processed in hardware.

The following illustrates the HW model:

+-------------------+      +--------------------+    +--------------+
+ pre_ct (tc chain) +----->+ CT (nat or no nat) +--->+ post_ct      +----->
+ original match    +   |  + tuple + zone match + |  + fte_id match +  |
+-------------------+   |  +--------------------+ |  +--------------+  |
                        v                         v                    v
                     set chain miss mapping    set mark             original
                     set fte_id                set label            filter
                     set zone                  set established      actions
                     set tunnel_id             do nat (if needed)
                     do decap

To fill CT table, driver registers a CB for flow offload events, for each new
flow table that is passed to it from offloading ct actions. Once a flow offload
event is triggered on this CB, offload this flow to the hardware CT table.

Established events offload
--------------------------

Currently, act_ct maintains an FT instance per ct zone. Flow table entries
are created, per ct connection, when connections enter an established
state and deleted otherwise. Once an entry is created, the FT assumes
ownership of the entries, and manages their aging. FT is used for software
offload of conntrack. FT entries associate 5-tuples with an action list.

The act_ct changes in this patchset:
Populate the action list with a (new) ct_metadata action, providing the
connection's ct state (zone,mark and label), and mangle actions if NAT
is configured.

Pass the action's flow table instance as ct action entry parameter,
so  when the action is offloaded, the driver may register a callback on
it's block to receive FT flow offload add/del/stats events.

Netilter changes
--------------------------
The netfilter changes export the relevant bits, and add the relevant CBs
to support the above.

Applying this patchset
--------------------------

On top of current net-next ("r8169: simplify getting stats by using netdev_stats_to_stats64"),
pull Saeed's ct-offload branch, from git git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
and fix the following non trivial conflict in fs_core.c as follows:
#define OFFLOADS_MAX_FT 2
#define OFFLOADS_NUM_PRIOS 2
#define OFFLOADS_MIN_LEVEL (ANCHOR_MIN_LEVEL + OFFLOADS_NUM_PRIOS)

Then apply this patchset.

Changelog:
  v2->v3:
    Added the first two patches needed after rebasing on net-next:
     "net/mlx5: E-Switch, Enable reg c1 loopback when possible"
     "net/mlx5e: en_rep: Create uplink rep root table after eswitch offloads table"

Paul Blakey (15):
  net/mlx5: E-Switch, Enable reg c1 loopback when possible
  net/mlx5e: en_rep: Create uplink rep root table after eswitch offloads table
  netfilter: flowtable: Add API for registering to flow table events
  net/sched: act_ct: Instantiate flow table entry actions
  net/sched: act_ct: Support restoring conntrack info on skbs
  net/sched: act_ct: Support refreshing the flow table entries
  net/sched: act_ct: Enable hardware offload of flow table entires
  net/mlx5: E-Switch, Introduce global tables
  net/mlx5: E-Switch, Add support for offloading rules with no in_port
  net/mlx5: E-Switch, Support getting chain mapping
  flow_offload: Add flow_match_ct to get rule ct match
  net/mlx5e: CT: Introduce connection tracking
  net/mlx5e: CT: Offload established flows
  net/mlx5e: CT: Handle misses after executing CT action
  net/mlx5e: CT: Support clear action

 drivers/net/ethernet/mellanox/mlx5/core/Kconfig    |   10 +
 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |    1 +
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c | 1347 ++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h |  171 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |    1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.h   |    3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c    |  120 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.h    |    9 +
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  |    6 +
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |   66 +-
 .../mellanox/mlx5/core/eswitch_offloads_chains.c   |   43 +
 .../mellanox/mlx5/core/eswitch_offloads_chains.h   |   13 +
 include/linux/mlx5/eswitch.h                       |    7 +
 include/net/flow_offload.h                         |   13 +
 include/net/netfilter/nf_flow_table.h              |   32 +
 include/net/tc_act/tc_ct.h                         |   17 +
 net/core/flow_offload.c                            |    7 +
 net/netfilter/nf_flow_table_core.c                 |   60 +
 net/netfilter/nf_flow_table_ip.c                   |   15 +-
 net/netfilter/nf_flow_table_offload.c              |   27 +-
 net/sched/act_ct.c                                 |  225 ++++
 net/sched/cls_api.c                                |    1 +
 22 files changed, 2124 insertions(+), 70 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.h

Comments

Marcelo Ricardo Leitner March 11, 2020, 7:13 p.m. UTC | #1
On Wed, Mar 11, 2020 at 04:33:43PM +0200, Paul Blakey wrote:
> Applying this patchset
> --------------------------
> 
> On top of current net-next ("r8169: simplify getting stats by using netdev_stats_to_stats64"),
> pull Saeed's ct-offload branch, from git git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
> and fix the following non trivial conflict in fs_core.c as follows:
> #define OFFLOADS_MAX_FT 2
> #define OFFLOADS_NUM_PRIOS 2
> #define OFFLOADS_MIN_LEVEL (ANCHOR_MIN_LEVEL + OFFLOADS_NUM_PRIOS)
> 
> Then apply this patchset.

I did this and I couldn't get tc offloading (not ct) to work anymore.
Then I moved to current net-next (the commit you mentioned above), and
got the same thing.

What I can tell so far is that 
perf script | head
       handler11  4415 [009]  1263.438424: probe:mlx5e_configure_flower__return: (ffffffffc094fa80 <- ffffffff93dc510a) arg1=0xffffffffffffffa1

and that's EOPNOTSUPP. Not sure yet where that is coming from.


Btw, it's prooving to be a nice exercise to find out why it is failing
to offload. Perhaps some netdev_err_once() is welcomed.

  Marcelo
Paul Blakey March 11, 2020, 10:27 p.m. UTC | #2
On 11/03/2020 21:13, Marcelo Ricardo Leitner wrote:
> On Wed, Mar 11, 2020 at 04:33:43PM +0200, Paul Blakey wrote:
>> Applying this patchset
>> --------------------------
>>
>> On top of current net-next ("r8169: simplify getting stats by using netdev_stats_to_stats64"),
>> pull Saeed's ct-offload branch, from git git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
>> and fix the following non trivial conflict in fs_core.c as follows:
>> #define OFFLOADS_MAX_FT 2
>> #define OFFLOADS_NUM_PRIOS 2
>> #define OFFLOADS_MIN_LEVEL (ANCHOR_MIN_LEVEL + OFFLOADS_NUM_PRIOS)
>>
>> Then apply this patchset.
> I did this and I couldn't get tc offloading (not ct) to work anymore.
> Then I moved to current net-next (the commit you mentioned above), and
> got the same thing.
>
> What I can tell so far is that 
> perf script | head
>        handler11  4415 [009]  1263.438424: probe:mlx5e_configure_flower__return: (ffffffffc094fa80 <- ffffffff93dc510a) arg1=0xffffffffffffffa1
>
> and that's EOPNOTSUPP. Not sure yet where that is coming from.

I guess you fail at what this patch "flow_offload: fix allowed types check" is suppose to fix

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a393daa8993fd7d6c9c33110d5dac08bc0dc2696

That was the last bug that caused what you decribe - all tc rules failed.

It should be before the patch I detailed as seen in git log here:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/log/


Can you check you have it? im not sure if compiling just the driver is enough.

if it is this, it should have fixed it.

if not try skipping flow_action_hw_stats_types_check() calls in mlx5 driver.

there is a netlink error msg most of the cases, try skip_sw or verbose to see.


>
> Btw, it's prooving to be a nice exercise to find out why it is failing
> to offload. Perhaps some netdev_err_once() is welcomed.
>
>   Marcelo
Marcelo Ricardo Leitner March 11, 2020, 10:44 p.m. UTC | #3
On Thu, Mar 12, 2020 at 12:27:37AM +0200, Paul Blakey wrote:
> 
> On 11/03/2020 21:13, Marcelo Ricardo Leitner wrote:
> > On Wed, Mar 11, 2020 at 04:33:43PM +0200, Paul Blakey wrote:
> >> Applying this patchset
> >> --------------------------
> >>
> >> On top of current net-next ("r8169: simplify getting stats by using netdev_stats_to_stats64"),
> >> pull Saeed's ct-offload branch, from git git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
> >> and fix the following non trivial conflict in fs_core.c as follows:
> >> #define OFFLOADS_MAX_FT 2
> >> #define OFFLOADS_NUM_PRIOS 2
> >> #define OFFLOADS_MIN_LEVEL (ANCHOR_MIN_LEVEL + OFFLOADS_NUM_PRIOS)
> >>
> >> Then apply this patchset.
> > I did this and I couldn't get tc offloading (not ct) to work anymore.
> > Then I moved to current net-next (the commit you mentioned above), and
> > got the same thing.
> >
> > What I can tell so far is that 
> > perf script | head
> >        handler11  4415 [009]  1263.438424: probe:mlx5e_configure_flower__return: (ffffffffc094fa80 <- ffffffff93dc510a) arg1=0xffffffffffffffa1
> >
> > and that's EOPNOTSUPP. Not sure yet where that is coming from.
> 
> I guess you fail at what this patch "flow_offload: fix allowed types check" is suppose to fix
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a393daa8993fd7d6c9c33110d5dac08bc0dc2696
> 
> That was the last bug that caused what you decribe - all tc rules failed.
> 
> It should be before the patch I detailed as seen in git log here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/log/
> 
> 
> Can you check you have it? im not sure if compiling just the driver is enough.

I had this one, in both tests.

> 
> if it is this, it should have fixed it.
> 
> if not try skipping flow_action_hw_stats_types_check() calls in mlx5 driver.

Ok. This one was my main suspect now after some extra printks. I could
confirm it parse_tc_fdb_actions is returning the error, but not sure
why yet.

> 
> there is a netlink error msg most of the cases, try skip_sw or verbose to see.

Yeah.. that probably would be easier, thanks. I'm using it under OvS
and without skip_sw, got just no logs for it.

> 
> >
> > Btw, it's prooving to be a nice exercise to find out why it is failing
> > to offload. Perhaps some netdev_err_once() is welcomed.
> >
> >   Marcelo
Marcelo Ricardo Leitner March 12, 2020, 12:01 a.m. UTC | #4
On Wed, Mar 11, 2020 at 07:44:15PM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, Mar 12, 2020 at 12:27:37AM +0200, Paul Blakey wrote:
...
> > if not try skipping flow_action_hw_stats_types_check() calls in mlx5 driver.
> 
> Ok. This one was my main suspect now after some extra printks. I could
> confirm it parse_tc_fdb_actions is returning the error, but not sure
> why yet.

Copied the extack in flow_action_mixed_hw_stats_types_check() onto a
printk and logged the if parameters:

@@ -284,6 +284,8 @@ flow_action_mixed_hw_stats_types_check(const struct flow_action *action,

        flow_action_for_each(i, action_entry, action) {
                if (i && action_entry->hw_stats_type != last_hw_stats_type) {
+                       printk("Mixing HW stats types for actions is not supported, %d %d %d\n",
+                              i, action_entry->hw_stats_type, last_hw_stats_type);
                        NL_SET_ERR_MSG_MOD(extack, "Mixing HW stats types for actions is not supported");

[  188.554636] Mixing HW stats types for actions is not supported, 2 0 3

with iproute-next from today loaded with
'iproute2/net-next v2] tc: m_action: introduce support for hw stats
type', I get a dump like:

# ./tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 2 flower chain 0
filter protocol ip pref 2 flower chain 0 handle 0x1
  dst_mac 6a:66:2d:48:92:c2
  src_mac 00:00:00:00:0e:b7
  eth_type ipv4
  ip_proto udp
  ip_ttl 64
  src_port 100
  enc_dst_ip 1.1.1.1
  enc_src_ip 1.1.1.2
  enc_key_id 0
  enc_dst_port 4789
  enc_tos 0
  ip_flags nofrag
  not_in_hw
        action order 1: tunnel_key  unset pipe
         index 4 ref 1 bind 1 installed 2432 sec used 0 sec
        Action statistics:
        Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        no_percpu

        action order 2:  pedit action pipe keys 2
         index 4 ref 1 bind 1 installed 2432 sec used 0 sec
         key #0  at ipv4+8: val 3f000000 mask 00ffffff
         key #1  at udp+0: val 13890000 mask 0000ffff
        Action statistics:
        Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0

        action order 3: csum (iph, udp) action pipe
        index 4 ref 1 bind 1 installed 2432 sec used 0 sec
        Action statistics:
        Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        no_percpu

        action order 4: mirred (Egress Redirect to device eth0) stolen
        index 1 ref 1 bind 1 installed 2432 sec used 0 sec
        Action statistics:
        Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        cookie 9c7d6b3aa84a32498181d98da0af80d2
        no_percpu

Which is quite interesting as it is not listing any hwstats fields at all.
Considering that due to tcf_action_hw_stats_type_get() and
hw_stats_type initialization in tcf_action_init_1(), the default is
for it to be 3 if not specified (which is then not dumped over netlink
by the kernel).

I'm wondering how it was 0 for i==0,1 and is 3 for i==2.
Paul Blakey March 12, 2020, 9:33 a.m. UTC | #5
On 3/12/2020 2:01 AM, Marcelo Ricardo Leitner wrote:
> On Wed, Mar 11, 2020 at 07:44:15PM -0300, Marcelo Ricardo Leitner wrote:
>> On Thu, Mar 12, 2020 at 12:27:37AM +0200, Paul Blakey wrote:
> ...
>>> if not try skipping flow_action_hw_stats_types_check() calls in mlx5 driver.
>> Ok. This one was my main suspect now after some extra printks. I could
>> confirm it parse_tc_fdb_actions is returning the error, but not sure
>> why yet.
> Copied the extack in flow_action_mixed_hw_stats_types_check() onto a
> printk and logged the if parameters:
>
> @@ -284,6 +284,8 @@ flow_action_mixed_hw_stats_types_check(const struct flow_action *action,
>
>         flow_action_for_each(i, action_entry, action) {
>                 if (i && action_entry->hw_stats_type != last_hw_stats_type) {
> +                       printk("Mixing HW stats types for actions is not supported, %d %d %d\n",
> +                              i, action_entry->hw_stats_type, last_hw_stats_type);
>                         NL_SET_ERR_MSG_MOD(extack, "Mixing HW stats types for actions is not supported");
>
> [  188.554636] Mixing HW stats types for actions is not supported, 2 0 3
>
> with iproute-next from today loaded with
> 'iproute2/net-next v2] tc: m_action: introduce support for hw stats
> type', I get a dump like:
>
> # ./tc -s filter show dev vxlan_sys_4789 ingress
> filter protocol ip pref 2 flower chain 0
> filter protocol ip pref 2 flower chain 0 handle 0x1
>   dst_mac 6a:66:2d:48:92:c2
>   src_mac 00:00:00:00:0e:b7
>   eth_type ipv4
>   ip_proto udp
>   ip_ttl 64
>   src_port 100
>   enc_dst_ip 1.1.1.1
>   enc_src_ip 1.1.1.2
>   enc_key_id 0
>   enc_dst_port 4789
>   enc_tos 0
>   ip_flags nofrag
>   not_in_hw
>         action order 1: tunnel_key  unset pipe
>          index 4 ref 1 bind 1 installed 2432 sec used 0 sec
>         Action statistics:
>         Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
>         backlog 0b 0p requeues 0
>         no_percpu
>
>         action order 2:  pedit action pipe keys 2
>          index 4 ref 1 bind 1 installed 2432 sec used 0 sec
>          key #0  at ipv4+8: val 3f000000 mask 00ffffff
>          key #1  at udp+0: val 13890000 mask 0000ffff
>         Action statistics:
>         Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
>         backlog 0b 0p requeues 0
>
>         action order 3: csum (iph, udp) action pipe
>         index 4 ref 1 bind 1 installed 2432 sec used 0 sec
>         Action statistics:
>         Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
>         backlog 0b 0p requeues 0
>         no_percpu
>
>         action order 4: mirred (Egress Redirect to device eth0) stolen
>         index 1 ref 1 bind 1 installed 2432 sec used 0 sec
>         Action statistics:
>         Sent 223744 bytes 4864 pkt (dropped 0, overlimits 0 requeues 0)
>         backlog 0b 0p requeues 0
>         cookie 9c7d6b3aa84a32498181d98da0af80d2
>         no_percpu
>
> Which is quite interesting as it is not listing any hwstats fields at all.
> Considering that due to tcf_action_hw_stats_type_get() and
> hw_stats_type initialization in tcf_action_init_1(), the default is
> for it to be 3 if not specified (which is then not dumped over netlink
> by the kernel).
>
> I'm wondering how it was 0 for i==0,1 and is 3 for i==2.
There is a current bug with pedit, as it is split to muliple mangle action entries, and the action entry hw_stats field is not
init for all of them. Then we fail on checking all hw_stats are the same ...