mbox series

[ovs-dev,RFC,v2,0/8] Introduce connection tracking tc offload

Message ID 1562250507-20335-1-git-send-email-paulb@mellanox.com
Headers show
Series Introduce connection tracking tc offload | expand

Message

Paul Blakey July 4, 2019, 2:28 p.m. UTC
Hi,

The following patches add connection tracking offload to tc.

We plan on offloading the datapath rules to netdev one to one to tc rules.
We'll be using upcoming act_ct tc module which is currently under review in netdev for the datapath ct() action.
Tc chains and tc goto chain action for the recirc_id() match and recirc() action.
cls_flower will do the matching on skb conntrack metadata for the ct_state matches.

The patchset for act_ct and cls_flower is here: https://lwn.net/Articles/791584/

So datapath ovs connection tracking rules:
recirc_id(0),in_port(ens1f0_0),ct_state(-trk),... actions:ct(zone=2),recirc(2)
recirc_id(2),in_port(ens1f0_0),ct_state(+new+trk),ct_mark(0xbb),... actions:ct(commit,zone=2,nat(src=5.5.5.7),mark=0xbb),ens1f0_1
recirc_id(2),in_port(ens1f0_0),ct_state(+est+trk),ct_mark(0xbb),... actions:ct(zone=2,nat),ens1f0_1

recirc_id(1),in_port(ens1f0_1),ct_state(-trk),... actions:ct(zone=2),recirc(1)
recirc_id(1),in_port(ens1f0_1),ct_state(+est+trk),... actions:ct(zone=2,nat),ens1f0_0

Will be translated to these:
$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 0 proto ip \
  flower ip_proto tcp ct_state -trk \
  action ct zone 2 pipe \
  action goto chain 2
$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 2 proto ip \
  flower ct_state +trk+new \
  action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 2 proto ip \
  flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
  action ct nat pipe \
  action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_1 ingress \
  prio 1 chain 0 proto ip \
  flower ip_proto tcp ct_state -trk \
  action ct zone 2 pipe \
  action goto chain 1
$ tc filter add dev ens1f0_1 ingress \
  prio 1 chain 1 proto ip \
  flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
  action ct nat pipe \
  action mirred egress redirect dev ens1f0_0


Changlog:
V1->V2:
    Renamed netdev-tc-offloads to netdev-offload-tc (sorry about double email)

Paul Blakey (8):
  match: Add match_set_ct_zone_masked helper
  compat: Add tc ct action and flower matches defines for older kernels
  tc: Introduce tc_id to specify a tc filter
  netdev-offload-tc: Implement netdev tc flush via tc filter del
  netdev-offload-tc: Add recirculation support via tc chains
  netdev-offload-tc: Add conntrack support
  netdev-offload-tc: Add conntrack label and mark support
  netdev-offload-tc: Add conntrack nat support

 acinclude.m4                 |   6 +-
 include/linux/automake.mk    |   3 +-
 include/linux/pkt_cls.h      |  50 +++-
 include/linux/tc_act/tc_ct.h |  41 +++
 include/openvswitch/match.h  |   1 +
 lib/dpif-netlink.c           |   5 +
 lib/match.c                  |  10 +-
 lib/netdev-linux.c           |   6 +-
 lib/netdev-offload-tc.c      | 595 ++++++++++++++++++++++++++++++-------------
 lib/tc.c                     | 411 ++++++++++++++++++++++++------
 lib/tc.h                     |  75 +++++-
 11 files changed, 921 insertions(+), 282 deletions(-)
 create mode 100644 include/linux/tc_act/tc_ct.h

Comments

Marcelo Ricardo Leitner July 18, 2019, 9:59 p.m. UTC | #1
On Thu, Jul 04, 2019 at 05:28:19PM +0300, Paul Blakey wrote:
> Hi,
> 
> The following patches add connection tracking offload to tc.
> 
> We plan on offloading the datapath rules to netdev one to one to tc rules.
> We'll be using upcoming act_ct tc module which is currently under review in netdev for the datapath ct() action.
> Tc chains and tc goto chain action for the recirc_id() match and recirc() action.
> cls_flower will do the matching on skb conntrack metadata for the ct_state matches.
> 
> The patchset for act_ct and cls_flower is here: https://lwn.net/Articles/791584/
> 
> So datapath ovs connection tracking rules:
> recirc_id(0),in_port(ens1f0_0),ct_state(-trk),... actions:ct(zone=2),recirc(2)
> recirc_id(2),in_port(ens1f0_0),ct_state(+new+trk),ct_mark(0xbb),... actions:ct(commit,zone=2,nat(src=5.5.5.7),mark=0xbb),ens1f0_1
> recirc_id(2),in_port(ens1f0_0),ct_state(+est+trk),ct_mark(0xbb),... actions:ct(zone=2,nat),ens1f0_1
> 
> recirc_id(1),in_port(ens1f0_1),ct_state(-trk),... actions:ct(zone=2),recirc(1)
> recirc_id(1),in_port(ens1f0_1),ct_state(+est+trk),... actions:ct(zone=2,nat),ens1f0_0

Hi Paul,

Sometimes I'm seeing ghosts^Wbogus values with nat:

[root@localhost ~]# ovs-ofctl dump-flows br0
 cookie=0x0, duration=788.595s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=-trk,tcp,in_port="ns2-veth-ab" actions=ct(table=0,zone=2)
 cookie=0x0, duration=788.584s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=-trk,tcp,in_port="ns1-veth-ab" actions=ct(table=0,zone=2)
 cookie=0x0, duration=788.589s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=+new+trk,tcp,in_port="ns2-veth-ab" actions=ct(commit,zone=2,nat(src=192.168.0.30)),output:"ns1-veth-ab"
 cookie=0x0, duration=788.551s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns1-veth-ab" actions=ct(zone=2,nat),output:"ns2-veth-ab"
 cookie=0x0, duration=788.546s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns2-veth-ab" actions=ct(zone=2,nat),output:"ns1-veth-ab"
 cookie=0x0, duration=788.531s, table=0, n_packets=22, n_bytes=1672, priority=10 actions=NORMAL

[root@localhost ~]# cat /proc/net/nf_conntrack
ipv4     2 tcp      6 26 SYN_SENT src=192.168.0.2 dst=192.168.0.1 sport=41524 dport=5001 [UNREPLIED] src=192.168.0.1 dst=233.185.30.138 sport=5001 dport=41524 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2

Note the 'dst=233.185.30.138' where it should have been 192.168.0.30.
Interesting that it is always this address.

Here, it worked:
ipv4     2 tcp      6 58 CLOSE_WAIT src=192.168.0.2 dst=192.168.0.1 sport=41616 dport=5001 src=192.168.0.1 dst=192.168.0.30 sport=5001 dport=41616 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2

I cannot reproduce this with direct tc tests and neither with
hw_offload=false.

I'm using:
kernel: 192f0f8e9db7efe4ac98d47f5fa4334e43c1204d + recird_id patches
iproute2: 0f48f9f46ae83c042d36c1208b0f79966f92a951 + act_ct patches
ovs: c99d14775f78cb38b2109add063f58201ba07652 + this series (including the fixup)
Paul Blakey July 20, 2019, 8:26 a.m. UTC | #2
Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names
Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details.

I'll try and reproduce it on my end.
How often does this happen? and what is the setup ?

Thanks,
Paul.
Marcelo Ricardo Leitner July 23, 2019, 3:42 p.m. UTC | #3
On Sat, Jul 20, 2019 at 08:26:59AM +0000, Paul Blakey wrote:
> Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names
> Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details.
> 
> I'll try and reproduce it on my end.
> How often does this happen? and what is the setup ?

Quite often.
Setup is:
ns1                                                             ns2
--------------.                                     .--------------
              |         v------ br0 ------v         |
 ns1-veth-ba--x----ns1-veth-ab       ns2-veth-ab----x-- ns2-veth-ba
              |                                     |
--------------'                                     '--------------

Seems related to the skb_ext patches.. with a kernel with more
debugging stuff enabled, I'm seeing:

[   17.467576] =============================
[   17.468717] WARNING: suspicious RCU usage
[   17.469687] 5.2.0.c1f3d.g83fb7bc8ff16+ #2 Not tainted
[   17.471496] -----------------------------
[   17.472922] net/sched/sch_ingress.c:52 suspicious rcu_dereference_protected() usage!
[   17.474715]
               other info that might help us debug this:

[   17.476219]
               rcu_scheduler_active = 2, debug_locks = 1
[   17.477408] no locks held by ovs-vswitchd/1063.
[   17.478267]
               stack backtrace:
[   17.479545] CPU: 10 PID: 1063 Comm: ovs-vswitchd Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #2
[   17.481932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[   17.483725] Call Trace:
[   17.484179]  dump_stack+0x85/0xc0
[   17.484782]  ingress_tcf_block+0x4c/0x50 [sch_ingress]
[   17.485717]  __tcf_block_find+0x28/0x80
[   17.486419]  tc_new_tfilter+0x1b6/0x960
[   17.487093]  ? tc_del_tfilter+0x720/0x720
[   17.487854]  rtnetlink_rcv_msg+0x389/0x4b0
[   17.488590]  ? netlink_deliver_tap+0x95/0x400
[   17.489343]  ? rtnl_dellink+0x2d0/0x2d0
[   17.490034]  netlink_rcv_skb+0x49/0x110
[   17.490726]  netlink_unicast+0x171/0x200
[   17.491409]  netlink_sendmsg+0x21e/0x3e0
[   17.492096]  sock_sendmsg+0x5e/0x60
[   17.492792]  ___sys_sendmsg+0x2ae/0x330
[   17.493725]  ? ___sys_recvmsg+0x159/0x1f0
[   17.494734]  ? up_write+0x1c/0xc0
[   17.495515]  ? ext4_file_write_iter+0xd1/0x3b0
[   17.496624]  ? find_held_lock+0x2b/0x80
[   17.497362]  ? ksys_write+0xc0/0xe0
[   17.498000]  __sys_sendmsg+0x59/0xa0
[   17.498653]  do_syscall_64+0x5c/0xa0
[   17.499307]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   17.500225] RIP: 0033:0x7f6bd4ab2b45

And I could trigger a crash with this trace:
(crash was on a different run, while the RCU trace above and outputs below are
from the same run)

[  382.670318] BUG: kernel NULL pointer dereference, address: 0000000000000030
[  382.671927] #PF: supervisor read access in kernel mode
[  382.673070] #PF: error_code(0x0000) - not-present page
[  382.674170] PGD 0 P4D 0
[  382.674873] Oops: 0000 [#1] SMP NOPTI
[  382.675753] CPU: 1 PID: 6988 Comm: nc Kdump: loaded Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #1
[  382.677350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[  382.679255] RIP: 0010:tcf_classify+0x104/0x1e0
[  382.680942] Code: 5c 41 5d 41 5e 41 5f c3 48 8b 55 00 be 01 00 00 00 4c 89 ff 48 89 14 24 e8 79 e3 f9 ff 48 8b 14 24 48 85 c0 0f 84 cb 00 00
 00 <48> 8b 4a 30 8b 49 40 89 08 41 8d 45 01 41 83 fd 03 7e ac e8 84 37
[  382.684792] RSP: 0018:ffffbd19000f4dd0 EFLAGS: 00010282
[  382.685746] RAX: ffff9e04772b0008 RBX: ffff9e0479088360 RCX: 0000000000000001
[  382.686929] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9e047ad9a140
[  382.688144] RBP: ffffbd19000f4e40 R08: ffff9e047b870b60 R09: 0000000000000002
[  382.689347] R10: ffff9e04550ccd68 R11: ffff9e0476050a00 R12: 0000000000000001
[  382.690534] R13: 0000000000000000 R14: ffff9e0479088360 R15: ffff9e0479229ce0
[  382.691730] FS:  00007efd4d75bb80(0000) GS:ffff9e047b840000(0000) knlGS:0000000000000000
[  382.693033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  382.694061] CR2: 0000000000000030 CR3: 0000000163abc000 CR4: 00000000003406e0
[  382.695283] Call Trace:
[  382.695962]  <IRQ>
[  382.696573]  __netif_receive_skb_core+0x3c0/0xcf0
[  382.697492]  ? reweight_entity+0x15b/0x1a0
[  382.698340]  __netif_receive_skb_one_core+0x37/0x90
[  382.699274]  process_backlog+0x9c/0x150
[  382.700199]  net_rx_action+0xff/0x350
[  382.701093]  __do_softirq+0xee/0x2ff
[  382.701976]  do_softirq_own_stack+0x2a/0x40
[  382.702948]  </IRQ>
[  382.703626]  do_softirq.part.0+0x41/0x50
[  382.704556]  __local_bh_enable_ip+0x4b/0x50
[  382.705536]  ip_finish_output2+0x1a9/0x580
[  382.706518]  ip_output+0x71/0xf0
[  382.707392]  ? __ip_finish_output+0x1c0/0x1c0
[  382.708422]  __ip_queue_xmit+0x16c/0x420
[  382.709423]  ? __kmalloc_reserve.isra.0+0x2d/0x70
[  382.711184]  __tcp_transmit_skb+0x52c/0xab0
[  382.712582]  tcp_connect+0xaed/0xda0
[  382.713403]  ? kvm_clock_get_cycles+0xd/0x10
[  382.714315]  tcp_v4_connect+0x3f1/0x500
[  382.715156]  __inet_stream_connect+0xc5/0x360
[  382.716070]  inet_stream_connect+0x37/0x50
[  382.716947]  __sys_connect+0xd6/0x100
[  382.717766]  ? do_fcntl+0x1a5/0x650
[  382.718588]  ? __x64_sys_fcntl+0x84/0xb0
[  382.719450]  __x64_sys_connect+0x16/0x20
[  382.720310]  do_syscall_64+0x5b/0x180
[  382.721146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Then on the test:
+ ovs-appctl dpctl/dump-conntrack
tcp,orig=(src=192.168.0.2,dst=192.168.0.1,sport=35948,dport=5001),reply=(src=192.168.0.1,dst=233.185.30.138,sport=5001,dport=35948),zone=2,protoinfo=(state=SYN_SENT)
tcp,orig=(src=192.168.0.2,dst=192.168.0.1,sport=35946,dport=5001),reply=(src=192.168.0.1,dst=192.168.0.2,sport=5001,dport=35946),protoinfo=(state=TIME_WAIT)

+ ovs-appctl dpctl/dump-flows -m --names
ufid:a7998e58-919b-4d05-9cf5-4ab05527bf91, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x20),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:3, bytes:180, used:2.890s, dp:tc, actions:ct(zone=2),recirc(0x1)
ufid:89ec315b-9068-4c2c-a795-3ea78c406616, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=12:1d:05:89:52:0b,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:1, bytes:56, used:1.480s, dp:tc, actions:br0,ns1-veth-ab
ufid:7becab91-3e24-4ba5-abe0-39b7650243ff, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=12:1d:05:89:52:0b,dst=33:33:ff:89:52:0b),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.010s, dp:tc, actions:br0,ns1-veth-ab
ufid:868ba402-078f-4220-9865-6c58ee6e02bb, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=12:1d:05:89:52:0b,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.010s, dp:tc, actions:br0,ns1-veth-ab
ufid:10757d31-40c8-4359-8735-b40e2e16b07c, skb_priority(0/0),skb_mark(0/0),ct_state(0x21/0x21),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0x1),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:3, bytes:180, used:2.890s, dp:tc, actions:ct(commit,zone=2,nat(src=192.168.0.30)),ns1-veth-ab
ufid:9e322377-a849-45d5-8c9e-719bb60bf76a, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns1-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=0e:78:05:f3:ea:8c,dst=33:33:ff:f3:ea:8c),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.010s, dp:tc, actions:br0,ns2-veth-ab
ufid:1c0c9b46-ceb1-4c6f-95ae-f44169f6395b, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns1-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=0e:78:05:f3:ea:8c,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:1, bytes:56, used:2.500s, dp:tc, actions:br0,ns2-veth-ab
ufid:6e12004f-a24f-42d5-b946-d213491f93c6, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns1-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=0e:78:05:f3:ea:8c,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.000s, dp:tc, actions:br0,ns2-veth-ab
+ tc filter show dev ns2-veth-ab ingress
filter protocol ip pref 3 flower chain 0 
filter protocol ip pref 3 flower chain 0 handle 0x1 
  eth_type ipv4
  ip_proto tcp
  ip_flags nofrag
  ct_state -trk
  not_in_hw
        action order 1: ct zone 2 pipe
         index 1 ref 1 bind 1
        cookie 588e99a7054d9b91b04af59c91bf2755

        action order 2: gact action goto chain 1
         random type none pass val 0
         index 1 ref 1 bind 1
        cookie 588e99a7054d9b91b04af59c91bf2755

filter protocol ipv6 pref 4 flower chain 0 
filter protocol ipv6 pref 4 flower chain 0 handle 0x1 
  dst_mac 33:33:00:00:00:02
  src_mac 12:1d:05:89:52:0b
  eth_type ipv6
  ip_flags nofrag
  ct_state -trk-new-est
  not_in_hw
        action order 1: skbedit  ptype host pipe
         index 1 ref 1 bind 1

        action order 2: mirred (Ingress Mirror to device br0) pipe
        index 2 ref 1 bind 1
        cookie 5b31ec892c4c6890a73e95a71666408c

        action order 3: mirred (Egress Redirect to device ns1-veth-ab) stolen
        index 3 ref 1 bind 1
        cookie 5b31ec892c4c6890a73e95a71666408c

filter protocol ipv6 pref 4 flower chain 0 handle 0x2 
  dst_mac 33:33:ff:89:52:0b
  src_mac 12:1d:05:89:52:0b
  eth_type ipv6
  ip_flags nofrag
  ct_state -trk-new-est
  not_in_hw
        action order 1: skbedit  ptype host pipe
         index 2 ref 1 bind 1

        action order 2: mirred (Ingress Mirror to device br0) pipe
        index 4 ref 1 bind 1
        cookie 91abec7ba54b243eb739e0abff430265

        action order 3: mirred (Egress Redirect to device ns1-veth-ab) stolen
        index 5 ref 1 bind 1
        cookie 91abec7ba54b243eb739e0abff430265

filter protocol ipv6 pref 4 flower chain 0 handle 0x3
  dst_mac 33:33:00:00:00:16
  src_mac 12:1d:05:89:52:0b
  eth_type ipv6
  ip_flags nofrag
  ct_state -trk-new-est
  not_in_hw
        action order 1: skbedit  ptype host pipe
         index 3 ref 1 bind 1

        action order 2: mirred (Ingress Mirror to device br0) pipe
        index 6 ref 1 bind 1
        cookie 02a48b8620428f07586c6598bb026eee

        action order 3: mirred (Egress Redirect to device ns1-veth-ab) stolen
        index 7 ref 1 bind 1
        cookie 02a48b8620428f07586c6598bb026eee

filter protocol ip pref 3 flower chain 1
filter protocol ip pref 3 flower chain 1 handle 0x1
  eth_type ipv4
  ip_proto tcp
  ip_flags nofrag
  ct_state +trk+new
  not_in_hw
        action order 1: ct commit zone 2 nat src addr 192.168.0.30-0.0.0.0 pipe
         index 2 ref 1 bind 1
        cookie 317d75105943c8400eb435877cb0162e

        action order 2: mirred (Egress Redirect to device ns1-veth-ab) stolen
        index 1 ref 1 bind 1
        cookie 317d75105943c8400eb435877cb0162e

Test script is attached.
Paul Blakey July 23, 2019, 4:18 p.m. UTC | #4
On 7/23/2019 6:42 PM, Marcelo Ricardo Leitner wrote:

> On Sat, Jul 20, 2019 at 08:26:59AM +0000, Paul Blakey wrote:
>> Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names
>> Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details.
>>
>> I'll try and reproduce it on my end.
>> How often does this happen? and what is the setup ?
> Quite often.
> Setup is:
> ns1                                                             ns2
> --------------.                                     .--------------
>                |         v------ br0 ------v         |
>   ns1-veth-ba--x----ns1-veth-ab       ns2-veth-ab----x-- ns2-veth-ba
>                |                                     |
> --------------'                                     '--------------
>   [...]


Thanks, I'll try and reproduce it.
Paul Blakey July 24, 2019, 3:44 p.m. UTC | #5
On 7/19/2019 12:59 AM, Marcelo Ricardo Leitner wrote:
>
> Hi Paul,
>
> Sometimes I'm seeing ghosts^Wbogus values with nat:
>
> [root@localhost ~]# ovs-ofctl dump-flows br0
>   cookie=0x0, duration=788.595s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=-trk,tcp,in_port="ns2-veth-ab" actions=ct(table=0,zone=2)
>   cookie=0x0, duration=788.584s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=-trk,tcp,in_port="ns1-veth-ab" actions=ct(table=0,zone=2)
>   cookie=0x0, duration=788.589s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=+new+trk,tcp,in_port="ns2-veth-ab" actions=ct(commit,zone=2,nat(src=192.168.0.30)),output:"ns1-veth-ab"
>   cookie=0x0, duration=788.551s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns1-veth-ab" actions=ct(zone=2,nat),output:"ns2-veth-ab"
>   cookie=0x0, duration=788.546s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns2-veth-ab" actions=ct(zone=2,nat),output:"ns1-veth-ab"
>   cookie=0x0, duration=788.531s, table=0, n_packets=22, n_bytes=1672, priority=10 actions=NORMAL
>
> [root@localhost ~]# cat /proc/net/nf_conntrack
> ipv4     2 tcp      6 26 SYN_SENT src=192.168.0.2 dst=192.168.0.1 sport=41524 dport=5001 [UNREPLIED] src=192.168.0.1 dst=233.185.30.138 sport=5001 dport=41524 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2
>
> Note the 'dst=233.185.30.138' where it should have been 192.168.0.30.
> Interesting that it is always this address.
>
> Here, it worked:
> ipv4     2 tcp      6 58 CLOSE_WAIT src=192.168.0.2 dst=192.168.0.1 sport=41616 dport=5001 src=192.168.0.1 dst=192.168.0.30 sport=5001 dport=41616 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2
>
> I cannot reproduce this with direct tc tests and neither with
> hw_offload=false.
>
> I'm using:
> kernel: 192f0f8e9db7efe4ac98d47f5fa4334e43c1204d + recird_id patches
> iproute2: 0f48f9f46ae83c042d36c1208b0f79966f92a951 + act_ct patches
> ovs: c99d14775f78cb38b2109add063f58201ba07652 + this series (including the fixup)


Hi marcelo,

Thanks for the test, I reproduce it on my end.

The bug is that we supply max range that is 0.0.0.0 to act ct if a max 
range isn't specified in the openflow.

Then we get min range  192.168.0.30 and max 0.0.0.0. Act ct uses this 
0.0.0.0 as max

and the ip you see is the result of some internal conntrack nat logic, 
that is dependent on the previous test.


This can be fixed in this RFC as below or in act_ct to interpret 0.0.0.0 
for max the same as not specified and use min instead.


Try the below fix,

Thanks,

Paul.

-----

diff --git a/lib/tc.c b/lib/tc.c
index eacfe4a..c2658d1 100644
--- a/lib/tc.c
+++ b/lib/tc.c
@@ -1948,13 +1948,17 @@ nl_msg_put_act_ct(struct ofpbuf *request, struct 
tc_action *action)
                  if (action->ct.range.ip_family == AF_INET) {
                      nl_msg_put_be32(request, TCA_CT_NAT_IPV4_MIN,
action->ct.range.min_addr.ipv4);
-                    nl_msg_put_be32(request, TCA_CT_NAT_IPV4_MAX,
+                    if (action->ct.range.max_addr.ipv4)
+                        nl_msg_put_be32(request, TCA_CT_NAT_IPV4_MAX,
action->ct.range.max_addr.ipv4);
                  } else if (action->ct.range.ip_family == AF_INET6) {
                      nl_msg_put_in6_addr(request, TCA_CT_NAT_IPV6_MIN,
&action->ct.range.min_addr.ipv6);
-                    nl_msg_put_in6_addr(request, TCA_CT_NAT_IPV6_MAX,
- &action->ct.range.max_addr.ipv6);
+                    if (!is_all_zeros(&action->ct.range.max_addr.ipv6,
+ sizeof(action->ct.range.max_addr.ipv6))) {
+                        nl_msg_put_in6_addr(request, TCA_CT_NAT_IPV6_MAX,
+ &action->ct.range.max_addr.ipv6);
+                    }
                  }

                  if (action->ct.range.min_port) {
Paul Blakey July 29, 2019, 1:05 p.m. UTC | #6
On 7/23/2019 6:42 PM, Marcelo Ricardo Leitner wrote:

> On Sat, Jul 20, 2019 at 08:26:59AM +0000, Paul Blakey wrote:
>> Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names
>> Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details.
>>
>> I'll try and reproduce it on my end.
>> How often does this happen? and what is the setup ?
> Quite often.
> Setup is:
> ns1                                                             ns2
> --------------.                                     .--------------
>                |         v------ br0 ------v         |
>   ns1-veth-ba--x----ns1-veth-ab       ns2-veth-ab----x-- ns2-veth-ba
>                |                                     |
> --------------'                                     '--------------
>
> Seems related to the skb_ext patches.. with a kernel with more
> debugging stuff enabled, I'm seeing:
>
> [   17.467576] =============================
> [   17.468717] WARNING: suspicious RCU usage
> [   17.469687] 5.2.0.c1f3d.g83fb7bc8ff16+ #2 Not tainted
> [   17.471496] -----------------------------
> [   17.472922] net/sched/sch_ingress.c:52 suspicious rcu_dereference_protected() usage!
> [   17.474715]
>                 other info that might help us debug this:
>
> [   17.476219]
>                 rcu_scheduler_active = 2, debug_locks = 1
> [   17.477408] no locks held by ovs-vswitchd/1063.
> [   17.478267]
>                 stack backtrace:
> [   17.479545] CPU: 10 PID: 1063 Comm: ovs-vswitchd Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #2
> [   17.481932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
> [   17.483725] Call Trace:
> [   17.484179]  dump_stack+0x85/0xc0
> [   17.484782]  ingress_tcf_block+0x4c/0x50 [sch_ingress]
> [   17.485717]  __tcf_block_find+0x28/0x80
> [   17.486419]  tc_new_tfilter+0x1b6/0x960
> [   17.487093]  ? tc_del_tfilter+0x720/0x720
> [   17.487854]  rtnetlink_rcv_msg+0x389/0x4b0
> [   17.488590]  ? netlink_deliver_tap+0x95/0x400
> [   17.489343]  ? rtnl_dellink+0x2d0/0x2d0
> [   17.490034]  netlink_rcv_skb+0x49/0x110
> [   17.490726]  netlink_unicast+0x171/0x200
> [   17.491409]  netlink_sendmsg+0x21e/0x3e0
> [   17.492096]  sock_sendmsg+0x5e/0x60
> [   17.492792]  ___sys_sendmsg+0x2ae/0x330
> [   17.493725]  ? ___sys_recvmsg+0x159/0x1f0
> [   17.494734]  ? up_write+0x1c/0xc0
> [   17.495515]  ? ext4_file_write_iter+0xd1/0x3b0
> [   17.496624]  ? find_held_lock+0x2b/0x80
> [   17.497362]  ? ksys_write+0xc0/0xe0
> [   17.498000]  __sys_sendmsg+0x59/0xa0
> [   17.498653]  do_syscall_64+0x5c/0xa0
> [   17.499307]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   17.500225] RIP: 0033:0x7f6bd4ab2b45
>
> And I could trigger a crash with this trace:
> (crash was on a different run, while the RCU trace above and outputs below are
> from the same run)
>
> [  382.670318] BUG: kernel NULL pointer dereference, address: 0000000000000030
> [  382.671927] #PF: supervisor read access in kernel mode
> [  382.673070] #PF: error_code(0x0000) - not-present page
> [  382.674170] PGD 0 P4D 0
> [  382.674873] Oops: 0000 [#1] SMP NOPTI
> [  382.675753] CPU: 1 PID: 6988 Comm: nc Kdump: loaded Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #1
> [  382.677350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
> [  382.679255] RIP: 0010:tcf_classify+0x104/0x1e0
> [  382.680942] Code: 5c 41 5d 41 5e 41 5f c3 48 8b 55 00 be 01 00 00 00 4c 89 ff 48 89 14 24 e8 79 e3 f9 ff 48 8b 14 24 48 85 c0 0f 84 cb 00 00
>   00 <48> 8b 4a 30 8b 49 40 89 08 41 8d 45 01 41 83 fd 03 7e ac e8 84 37
> [  382.684792] RSP: 0018:ffffbd19000f4dd0 EFLAGS: 00010282
> [  382.685746] RAX: ffff9e04772b0008 RBX: ffff9e0479088360 RCX: 0000000000000001
> [  382.686929] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9e047ad9a140
> [  382.688144] RBP: ffffbd19000f4e40 R08: ffff9e047b870b60 R09: 0000000000000002
> [  382.689347] R10: ffff9e04550ccd68 R11: ffff9e0476050a00 R12: 0000000000000001
> [  382.690534] R13: 0000000000000000 R14: ffff9e0479088360 R15: ffff9e0479229ce0
> [  382.691730] FS:  00007efd4d75bb80(0000) GS:ffff9e047b840000(0000) knlGS:0000000000000000
> [  382.693033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  382.694061] CR2: 0000000000000030 CR3: 0000000163abc000 CR4: 00000000003406e0
> [  382.695283] Call Trace:
> [  382.695962]  <IRQ>
> [  382.696573]  __netif_receive_skb_core+0x3c0/0xcf0
> [  382.697492]  ? reweight_entity+0x15b/0x1a0
> [  382.698340]  __netif_receive_skb_one_core+0x37/0x90
> [  382.699274]  process_backlog+0x9c/0x150
> [  382.700199]  net_rx_action+0xff/0x350
> [  382.701093]  __do_softirq+0xee/0x2ff
> [  382.701976]  do_softirq_own_stack+0x2a/0x40
> [  382.702948]  </IRQ>
> [  382.703626]  do_softirq.part.0+0x41/0x50
> [  382.704556]  __local_bh_enable_ip+0x4b/0x50
> [  382.705536]  ip_finish_output2+0x1a9/0x580
> [  382.706518]  ip_output+0x71/0xf0
> [  382.707392]  ? __ip_finish_output+0x1c0/0x1c0
> [  382.708422]  __ip_queue_xmit+0x16c/0x420
> [  382.709423]  ? __kmalloc_reserve.isra.0+0x2d/0x70
> [  382.711184]  __tcp_transmit_skb+0x52c/0xab0
> [  382.712582]  tcp_connect+0xaed/0xda0
> [  382.713403]  ? kvm_clock_get_cycles+0xd/0x10
> [  382.714315]  tcp_v4_connect+0x3f1/0x500
> [  382.715156]  __inet_stream_connect+0xc5/0x360
> [  382.716070]  inet_stream_connect+0x37/0x50
> [  382.716947]  __sys_connect+0xd6/0x100
> [  382.717766]  ? do_fcntl+0x1a5/0x650
> [  382.718588]  ? __x64_sys_fcntl+0x84/0xb0
> [  382.719450]  __x64_sys_connect+0x16/0x20
> [  382.720310]  do_syscall_64+0x5b/0x180
> [  382.721146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>

Hi,

I reproduced both traces.

The first rcu trace is because we deference the qdisc block without rtnl 
locked (flower allows it unlocked)

and the second trace is because res->goto_tp can be null if the last 
prio of a chain we goto was just deleted, and we try and deference it 
unconditionally when we set the skb extension in tcf_classify.

This can happen when a filter is deleted while there is still traffic, I 
have a test for it and will try on the fixes.

I'll post fixed patches (Patch by VladB) to the tc skb extension to fix 
both of the above, and ovs userspace RFC v2 to fix the nat issue.

Thanks,

Paul.


>
Ilya Maximets July 29, 2019, 1:16 p.m. UTC | #7
Hi.

This patch-set has some checkpatch issues. It'll be good if you
could fix them before submitting the next version.

Best regards, Ilya Maximets.

On 04.07.2019 17:28, Paul Blakey wrote:
> Hi,
> 
> The following patches add connection tracking offload to tc.
> 
> We plan on offloading the datapath rules to netdev one to one to tc rules.
> We'll be using upcoming act_ct tc module which is currently under review in netdev for the datapath ct() action.
> Tc chains and tc goto chain action for the recirc_id() match and recirc() action.
> cls_flower will do the matching on skb conntrack metadata for the ct_state matches.
> 
> The patchset for act_ct and cls_flower is here: https://lwn.net/Articles/791584/
> 
> So datapath ovs connection tracking rules:
> recirc_id(0),in_port(ens1f0_0),ct_state(-trk),... actions:ct(zone=2),recirc(2)
> recirc_id(2),in_port(ens1f0_0),ct_state(+new+trk),ct_mark(0xbb),... actions:ct(commit,zone=2,nat(src=5.5.5.7),mark=0xbb),ens1f0_1
> recirc_id(2),in_port(ens1f0_0),ct_state(+est+trk),ct_mark(0xbb),... actions:ct(zone=2,nat),ens1f0_1
> 
> recirc_id(1),in_port(ens1f0_1),ct_state(-trk),... actions:ct(zone=2),recirc(1)
> recirc_id(1),in_port(ens1f0_1),ct_state(+est+trk),... actions:ct(zone=2,nat),ens1f0_0
> 
> Will be translated to these:
> $ tc filter add dev ens1f0_0 ingress \
>   prio 1 chain 0 proto ip \
>   flower ip_proto tcp ct_state -trk \
>   action ct zone 2 pipe \
>   action goto chain 2
> $ tc filter add dev ens1f0_0 ingress \
>   prio 1 chain 2 proto ip \
>   flower ct_state +trk+new \
>   action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \
>   action mirred egress redirect dev ens1f0_1
> $ tc filter add dev ens1f0_0 ingress \
>   prio 1 chain 2 proto ip \
>   flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
>   action ct nat pipe \
>   action mirred egress redirect dev ens1f0_1
> 
> $ tc filter add dev ens1f0_1 ingress \
>   prio 1 chain 0 proto ip \
>   flower ip_proto tcp ct_state -trk \
>   action ct zone 2 pipe \
>   action goto chain 1
> $ tc filter add dev ens1f0_1 ingress \
>   prio 1 chain 1 proto ip \
>   flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
>   action ct nat pipe \
>   action mirred egress redirect dev ens1f0_0
> 
> 
> Changlog:
> V1->V2:
>     Renamed netdev-tc-offloads to netdev-offload-tc (sorry about double email)
> 
> Paul Blakey (8):
>   match: Add match_set_ct_zone_masked helper
>   compat: Add tc ct action and flower matches defines for older kernels
>   tc: Introduce tc_id to specify a tc filter
>   netdev-offload-tc: Implement netdev tc flush via tc filter del
>   netdev-offload-tc: Add recirculation support via tc chains
>   netdev-offload-tc: Add conntrack support
>   netdev-offload-tc: Add conntrack label and mark support
>   netdev-offload-tc: Add conntrack nat support
> 
>  acinclude.m4                 |   6 +-
>  include/linux/automake.mk    |   3 +-
>  include/linux/pkt_cls.h      |  50 +++-
>  include/linux/tc_act/tc_ct.h |  41 +++
>  include/openvswitch/match.h  |   1 +
>  lib/dpif-netlink.c           |   5 +
>  lib/match.c                  |  10 +-
>  lib/netdev-linux.c           |   6 +-
>  lib/netdev-offload-tc.c      | 595 ++++++++++++++++++++++++++++++-------------
>  lib/tc.c                     | 411 ++++++++++++++++++++++++------
>  lib/tc.h                     |  75 +++++-
>  11 files changed, 921 insertions(+), 282 deletions(-)
>  create mode 100644 include/linux/tc_act/tc_ct.h
>
Paul Blakey July 30, 2019, 1:40 p.m. UTC | #8
On 7/29/2019 4:16 PM, Ilya Maximets wrote:

> Hi.
>
> This patch-set has some checkpatch issues. It'll be good if you
> could fix them before submitting the next version.
>
> Best regards, Ilya Maximets.
Sure thanks.
Marcelo Ricardo Leitner July 30, 2019, 2:08 p.m. UTC | #9
On Mon, Jul 29, 2019 at 01:05:50PM +0000, Paul Blakey wrote:
> Hi,
> 
> I reproduced both traces.
> 
> The first rcu trace is because we deference the qdisc block without rtnl 
> locked (flower allows it unlocked)
> 
> and the second trace is because res->goto_tp can be null if the last 
> prio of a chain we goto was just deleted, and we try and deference it 
> unconditionally when we set the skb extension in tcf_classify.
> 
> This can happen when a filter is deleted while there is still traffic, I 
> have a test for it and will try on the fixes.
> 
> I'll post fixed patches (Patch by VladB) to the tc skb extension to fix 
> both of the above, and ovs userspace RFC v2 to fix the nat issue.

Cool, thanks Paul!

  Marcelo