Message ID | 1562250507-20335-1-git-send-email-paulb@mellanox.com |
---|---|
Headers | show |
Series | Introduce connection tracking tc offload | expand |
On Thu, Jul 04, 2019 at 05:28:19PM +0300, Paul Blakey wrote: > Hi, > > The following patches add connection tracking offload to tc. > > We plan on offloading the datapath rules to netdev one to one to tc rules. > We'll be using upcoming act_ct tc module which is currently under review in netdev for the datapath ct() action. > Tc chains and tc goto chain action for the recirc_id() match and recirc() action. > cls_flower will do the matching on skb conntrack metadata for the ct_state matches. > > The patchset for act_ct and cls_flower is here: https://lwn.net/Articles/791584/ > > So datapath ovs connection tracking rules: > recirc_id(0),in_port(ens1f0_0),ct_state(-trk),... actions:ct(zone=2),recirc(2) > recirc_id(2),in_port(ens1f0_0),ct_state(+new+trk),ct_mark(0xbb),... actions:ct(commit,zone=2,nat(src=5.5.5.7),mark=0xbb),ens1f0_1 > recirc_id(2),in_port(ens1f0_0),ct_state(+est+trk),ct_mark(0xbb),... actions:ct(zone=2,nat),ens1f0_1 > > recirc_id(1),in_port(ens1f0_1),ct_state(-trk),... actions:ct(zone=2),recirc(1) > recirc_id(1),in_port(ens1f0_1),ct_state(+est+trk),... actions:ct(zone=2,nat),ens1f0_0 Hi Paul, Sometimes I'm seeing ghosts^Wbogus values with nat: [root@localhost ~]# ovs-ofctl dump-flows br0 cookie=0x0, duration=788.595s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=-trk,tcp,in_port="ns2-veth-ab" actions=ct(table=0,zone=2) cookie=0x0, duration=788.584s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=-trk,tcp,in_port="ns1-veth-ab" actions=ct(table=0,zone=2) cookie=0x0, duration=788.589s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=+new+trk,tcp,in_port="ns2-veth-ab" actions=ct(commit,zone=2,nat(src=192.168.0.30)),output:"ns1-veth-ab" cookie=0x0, duration=788.551s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns1-veth-ab" actions=ct(zone=2,nat),output:"ns2-veth-ab" cookie=0x0, duration=788.546s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns2-veth-ab" actions=ct(zone=2,nat),output:"ns1-veth-ab" cookie=0x0, duration=788.531s, table=0, n_packets=22, n_bytes=1672, priority=10 actions=NORMAL [root@localhost ~]# cat /proc/net/nf_conntrack ipv4 2 tcp 6 26 SYN_SENT src=192.168.0.2 dst=192.168.0.1 sport=41524 dport=5001 [UNREPLIED] src=192.168.0.1 dst=233.185.30.138 sport=5001 dport=41524 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2 Note the 'dst=233.185.30.138' where it should have been 192.168.0.30. Interesting that it is always this address. Here, it worked: ipv4 2 tcp 6 58 CLOSE_WAIT src=192.168.0.2 dst=192.168.0.1 sport=41616 dport=5001 src=192.168.0.1 dst=192.168.0.30 sport=5001 dport=41616 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2 I cannot reproduce this with direct tc tests and neither with hw_offload=false. I'm using: kernel: 192f0f8e9db7efe4ac98d47f5fa4334e43c1204d + recird_id patches iproute2: 0f48f9f46ae83c042d36c1208b0f79966f92a951 + act_ct patches ovs: c99d14775f78cb38b2109add063f58201ba07652 + this series (including the fixup)
Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details. I'll try and reproduce it on my end. How often does this happen? and what is the setup ? Thanks, Paul.
On Sat, Jul 20, 2019 at 08:26:59AM +0000, Paul Blakey wrote: > Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names > Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details. > > I'll try and reproduce it on my end. > How often does this happen? and what is the setup ? Quite often. Setup is: ns1 ns2 --------------. .-------------- | v------ br0 ------v | ns1-veth-ba--x----ns1-veth-ab ns2-veth-ab----x-- ns2-veth-ba | | --------------' '-------------- Seems related to the skb_ext patches.. with a kernel with more debugging stuff enabled, I'm seeing: [ 17.467576] ============================= [ 17.468717] WARNING: suspicious RCU usage [ 17.469687] 5.2.0.c1f3d.g83fb7bc8ff16+ #2 Not tainted [ 17.471496] ----------------------------- [ 17.472922] net/sched/sch_ingress.c:52 suspicious rcu_dereference_protected() usage! [ 17.474715] other info that might help us debug this: [ 17.476219] rcu_scheduler_active = 2, debug_locks = 1 [ 17.477408] no locks held by ovs-vswitchd/1063. [ 17.478267] stack backtrace: [ 17.479545] CPU: 10 PID: 1063 Comm: ovs-vswitchd Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #2 [ 17.481932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 [ 17.483725] Call Trace: [ 17.484179] dump_stack+0x85/0xc0 [ 17.484782] ingress_tcf_block+0x4c/0x50 [sch_ingress] [ 17.485717] __tcf_block_find+0x28/0x80 [ 17.486419] tc_new_tfilter+0x1b6/0x960 [ 17.487093] ? tc_del_tfilter+0x720/0x720 [ 17.487854] rtnetlink_rcv_msg+0x389/0x4b0 [ 17.488590] ? netlink_deliver_tap+0x95/0x400 [ 17.489343] ? rtnl_dellink+0x2d0/0x2d0 [ 17.490034] netlink_rcv_skb+0x49/0x110 [ 17.490726] netlink_unicast+0x171/0x200 [ 17.491409] netlink_sendmsg+0x21e/0x3e0 [ 17.492096] sock_sendmsg+0x5e/0x60 [ 17.492792] ___sys_sendmsg+0x2ae/0x330 [ 17.493725] ? ___sys_recvmsg+0x159/0x1f0 [ 17.494734] ? up_write+0x1c/0xc0 [ 17.495515] ? ext4_file_write_iter+0xd1/0x3b0 [ 17.496624] ? find_held_lock+0x2b/0x80 [ 17.497362] ? ksys_write+0xc0/0xe0 [ 17.498000] __sys_sendmsg+0x59/0xa0 [ 17.498653] do_syscall_64+0x5c/0xa0 [ 17.499307] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 17.500225] RIP: 0033:0x7f6bd4ab2b45 And I could trigger a crash with this trace: (crash was on a different run, while the RCU trace above and outputs below are from the same run) [ 382.670318] BUG: kernel NULL pointer dereference, address: 0000000000000030 [ 382.671927] #PF: supervisor read access in kernel mode [ 382.673070] #PF: error_code(0x0000) - not-present page [ 382.674170] PGD 0 P4D 0 [ 382.674873] Oops: 0000 [#1] SMP NOPTI [ 382.675753] CPU: 1 PID: 6988 Comm: nc Kdump: loaded Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #1 [ 382.677350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 [ 382.679255] RIP: 0010:tcf_classify+0x104/0x1e0 [ 382.680942] Code: 5c 41 5d 41 5e 41 5f c3 48 8b 55 00 be 01 00 00 00 4c 89 ff 48 89 14 24 e8 79 e3 f9 ff 48 8b 14 24 48 85 c0 0f 84 cb 00 00 00 <48> 8b 4a 30 8b 49 40 89 08 41 8d 45 01 41 83 fd 03 7e ac e8 84 37 [ 382.684792] RSP: 0018:ffffbd19000f4dd0 EFLAGS: 00010282 [ 382.685746] RAX: ffff9e04772b0008 RBX: ffff9e0479088360 RCX: 0000000000000001 [ 382.686929] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9e047ad9a140 [ 382.688144] RBP: ffffbd19000f4e40 R08: ffff9e047b870b60 R09: 0000000000000002 [ 382.689347] R10: ffff9e04550ccd68 R11: ffff9e0476050a00 R12: 0000000000000001 [ 382.690534] R13: 0000000000000000 R14: ffff9e0479088360 R15: ffff9e0479229ce0 [ 382.691730] FS: 00007efd4d75bb80(0000) GS:ffff9e047b840000(0000) knlGS:0000000000000000 [ 382.693033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 382.694061] CR2: 0000000000000030 CR3: 0000000163abc000 CR4: 00000000003406e0 [ 382.695283] Call Trace: [ 382.695962] <IRQ> [ 382.696573] __netif_receive_skb_core+0x3c0/0xcf0 [ 382.697492] ? reweight_entity+0x15b/0x1a0 [ 382.698340] __netif_receive_skb_one_core+0x37/0x90 [ 382.699274] process_backlog+0x9c/0x150 [ 382.700199] net_rx_action+0xff/0x350 [ 382.701093] __do_softirq+0xee/0x2ff [ 382.701976] do_softirq_own_stack+0x2a/0x40 [ 382.702948] </IRQ> [ 382.703626] do_softirq.part.0+0x41/0x50 [ 382.704556] __local_bh_enable_ip+0x4b/0x50 [ 382.705536] ip_finish_output2+0x1a9/0x580 [ 382.706518] ip_output+0x71/0xf0 [ 382.707392] ? __ip_finish_output+0x1c0/0x1c0 [ 382.708422] __ip_queue_xmit+0x16c/0x420 [ 382.709423] ? __kmalloc_reserve.isra.0+0x2d/0x70 [ 382.711184] __tcp_transmit_skb+0x52c/0xab0 [ 382.712582] tcp_connect+0xaed/0xda0 [ 382.713403] ? kvm_clock_get_cycles+0xd/0x10 [ 382.714315] tcp_v4_connect+0x3f1/0x500 [ 382.715156] __inet_stream_connect+0xc5/0x360 [ 382.716070] inet_stream_connect+0x37/0x50 [ 382.716947] __sys_connect+0xd6/0x100 [ 382.717766] ? do_fcntl+0x1a5/0x650 [ 382.718588] ? __x64_sys_fcntl+0x84/0xb0 [ 382.719450] __x64_sys_connect+0x16/0x20 [ 382.720310] do_syscall_64+0x5b/0x180 [ 382.721146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Then on the test: + ovs-appctl dpctl/dump-conntrack tcp,orig=(src=192.168.0.2,dst=192.168.0.1,sport=35948,dport=5001),reply=(src=192.168.0.1,dst=233.185.30.138,sport=5001,dport=35948),zone=2,protoinfo=(state=SYN_SENT) tcp,orig=(src=192.168.0.2,dst=192.168.0.1,sport=35946,dport=5001),reply=(src=192.168.0.1,dst=192.168.0.2,sport=5001,dport=35946),protoinfo=(state=TIME_WAIT) + ovs-appctl dpctl/dump-flows -m --names ufid:a7998e58-919b-4d05-9cf5-4ab05527bf91, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x20),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:3, bytes:180, used:2.890s, dp:tc, actions:ct(zone=2),recirc(0x1) ufid:89ec315b-9068-4c2c-a795-3ea78c406616, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=12:1d:05:89:52:0b,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:1, bytes:56, used:1.480s, dp:tc, actions:br0,ns1-veth-ab ufid:7becab91-3e24-4ba5-abe0-39b7650243ff, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=12:1d:05:89:52:0b,dst=33:33:ff:89:52:0b),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.010s, dp:tc, actions:br0,ns1-veth-ab ufid:868ba402-078f-4220-9865-6c58ee6e02bb, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=12:1d:05:89:52:0b,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.010s, dp:tc, actions:br0,ns1-veth-ab ufid:10757d31-40c8-4359-8735-b40e2e16b07c, skb_priority(0/0),skb_mark(0/0),ct_state(0x21/0x21),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0x1),dp_hash(0/0),in_port(ns2-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:3, bytes:180, used:2.890s, dp:tc, actions:ct(commit,zone=2,nat(src=192.168.0.30)),ns1-veth-ab ufid:9e322377-a849-45d5-8c9e-719bb60bf76a, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns1-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=0e:78:05:f3:ea:8c,dst=33:33:ff:f3:ea:8c),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.010s, dp:tc, actions:br0,ns2-veth-ab ufid:1c0c9b46-ceb1-4c6f-95ae-f44169f6395b, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns1-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=0e:78:05:f3:ea:8c,dst=33:33:00:00:00:02),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:1, bytes:56, used:2.500s, dp:tc, actions:br0,ns2-veth-ab ufid:6e12004f-a24f-42d5-b946-d213491f93c6, skb_priority(0/0),skb_mark(0/0),ct_state(0/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ns1-veth-ab),packet_type(ns=0/0,id=0/0),eth(src=0e:78:05:f3:ea:8c,dst=33:33:00:00:00:16),eth_type(0x86dd),ipv6(src=::/::,dst=::/::,label=0/0,proto=0/0,tclass=0/0,hlimit=0/0,frag=no), packets:0, bytes:0, used:6.000s, dp:tc, actions:br0,ns2-veth-ab + tc filter show dev ns2-veth-ab ingress filter protocol ip pref 3 flower chain 0 filter protocol ip pref 3 flower chain 0 handle 0x1 eth_type ipv4 ip_proto tcp ip_flags nofrag ct_state -trk not_in_hw action order 1: ct zone 2 pipe index 1 ref 1 bind 1 cookie 588e99a7054d9b91b04af59c91bf2755 action order 2: gact action goto chain 1 random type none pass val 0 index 1 ref 1 bind 1 cookie 588e99a7054d9b91b04af59c91bf2755 filter protocol ipv6 pref 4 flower chain 0 filter protocol ipv6 pref 4 flower chain 0 handle 0x1 dst_mac 33:33:00:00:00:02 src_mac 12:1d:05:89:52:0b eth_type ipv6 ip_flags nofrag ct_state -trk-new-est not_in_hw action order 1: skbedit ptype host pipe index 1 ref 1 bind 1 action order 2: mirred (Ingress Mirror to device br0) pipe index 2 ref 1 bind 1 cookie 5b31ec892c4c6890a73e95a71666408c action order 3: mirred (Egress Redirect to device ns1-veth-ab) stolen index 3 ref 1 bind 1 cookie 5b31ec892c4c6890a73e95a71666408c filter protocol ipv6 pref 4 flower chain 0 handle 0x2 dst_mac 33:33:ff:89:52:0b src_mac 12:1d:05:89:52:0b eth_type ipv6 ip_flags nofrag ct_state -trk-new-est not_in_hw action order 1: skbedit ptype host pipe index 2 ref 1 bind 1 action order 2: mirred (Ingress Mirror to device br0) pipe index 4 ref 1 bind 1 cookie 91abec7ba54b243eb739e0abff430265 action order 3: mirred (Egress Redirect to device ns1-veth-ab) stolen index 5 ref 1 bind 1 cookie 91abec7ba54b243eb739e0abff430265 filter protocol ipv6 pref 4 flower chain 0 handle 0x3 dst_mac 33:33:00:00:00:16 src_mac 12:1d:05:89:52:0b eth_type ipv6 ip_flags nofrag ct_state -trk-new-est not_in_hw action order 1: skbedit ptype host pipe index 3 ref 1 bind 1 action order 2: mirred (Ingress Mirror to device br0) pipe index 6 ref 1 bind 1 cookie 02a48b8620428f07586c6598bb026eee action order 3: mirred (Egress Redirect to device ns1-veth-ab) stolen index 7 ref 1 bind 1 cookie 02a48b8620428f07586c6598bb026eee filter protocol ip pref 3 flower chain 1 filter protocol ip pref 3 flower chain 1 handle 0x1 eth_type ipv4 ip_proto tcp ip_flags nofrag ct_state +trk+new not_in_hw action order 1: ct commit zone 2 nat src addr 192.168.0.30-0.0.0.0 pipe index 2 ref 1 bind 1 cookie 317d75105943c8400eb435877cb0162e action order 2: mirred (Egress Redirect to device ns1-veth-ab) stolen index 1 ref 1 bind 1 cookie 317d75105943c8400eb435877cb0162e Test script is attached.
On 7/23/2019 6:42 PM, Marcelo Ricardo Leitner wrote: > On Sat, Jul 20, 2019 at 08:26:59AM +0000, Paul Blakey wrote: >> Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names >> Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details. >> >> I'll try and reproduce it on my end. >> How often does this happen? and what is the setup ? > Quite often. > Setup is: > ns1 ns2 > --------------. .-------------- > | v------ br0 ------v | > ns1-veth-ba--x----ns1-veth-ab ns2-veth-ab----x-- ns2-veth-ba > | | > --------------' '-------------- > [...] Thanks, I'll try and reproduce it.
On 7/19/2019 12:59 AM, Marcelo Ricardo Leitner wrote: > > Hi Paul, > > Sometimes I'm seeing ghosts^Wbogus values with nat: > > [root@localhost ~]# ovs-ofctl dump-flows br0 > cookie=0x0, duration=788.595s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=-trk,tcp,in_port="ns2-veth-ab" actions=ct(table=0,zone=2) > cookie=0x0, duration=788.584s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=-trk,tcp,in_port="ns1-veth-ab" actions=ct(table=0,zone=2) > cookie=0x0, duration=788.589s, table=0, n_packets=3, n_bytes=180, priority=50,ct_state=+new+trk,tcp,in_port="ns2-veth-ab" actions=ct(commit,zone=2,nat(src=192.168.0.30)),output:"ns1-veth-ab" > cookie=0x0, duration=788.551s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns1-veth-ab" actions=ct(zone=2,nat),output:"ns2-veth-ab" > cookie=0x0, duration=788.546s, table=0, n_packets=0, n_bytes=0, priority=50,ct_state=+est+trk,tcp,in_port="ns2-veth-ab" actions=ct(zone=2,nat),output:"ns1-veth-ab" > cookie=0x0, duration=788.531s, table=0, n_packets=22, n_bytes=1672, priority=10 actions=NORMAL > > [root@localhost ~]# cat /proc/net/nf_conntrack > ipv4 2 tcp 6 26 SYN_SENT src=192.168.0.2 dst=192.168.0.1 sport=41524 dport=5001 [UNREPLIED] src=192.168.0.1 dst=233.185.30.138 sport=5001 dport=41524 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2 > > Note the 'dst=233.185.30.138' where it should have been 192.168.0.30. > Interesting that it is always this address. > > Here, it worked: > ipv4 2 tcp 6 58 CLOSE_WAIT src=192.168.0.2 dst=192.168.0.1 sport=41616 dport=5001 src=192.168.0.1 dst=192.168.0.30 sport=5001 dport=41616 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=2 use=2 > > I cannot reproduce this with direct tc tests and neither with > hw_offload=false. > > I'm using: > kernel: 192f0f8e9db7efe4ac98d47f5fa4334e43c1204d + recird_id patches > iproute2: 0f48f9f46ae83c042d36c1208b0f79966f92a951 + act_ct patches > ovs: c99d14775f78cb38b2109add063f58201ba07652 + this series (including the fixup) Hi marcelo, Thanks for the test, I reproduce it on my end. The bug is that we supply max range that is 0.0.0.0 to act ct if a max range isn't specified in the openflow. Then we get min range 192.168.0.30 and max 0.0.0.0. Act ct uses this 0.0.0.0 as max and the ip you see is the result of some internal conntrack nat logic, that is dependent on the previous test. This can be fixed in this RFC as below or in act_ct to interpret 0.0.0.0 for max the same as not specified and use min instead. Try the below fix, Thanks, Paul. ----- diff --git a/lib/tc.c b/lib/tc.c index eacfe4a..c2658d1 100644 --- a/lib/tc.c +++ b/lib/tc.c @@ -1948,13 +1948,17 @@ nl_msg_put_act_ct(struct ofpbuf *request, struct tc_action *action) if (action->ct.range.ip_family == AF_INET) { nl_msg_put_be32(request, TCA_CT_NAT_IPV4_MIN, action->ct.range.min_addr.ipv4); - nl_msg_put_be32(request, TCA_CT_NAT_IPV4_MAX, + if (action->ct.range.max_addr.ipv4) + nl_msg_put_be32(request, TCA_CT_NAT_IPV4_MAX, action->ct.range.max_addr.ipv4); } else if (action->ct.range.ip_family == AF_INET6) { nl_msg_put_in6_addr(request, TCA_CT_NAT_IPV6_MIN, &action->ct.range.min_addr.ipv6); - nl_msg_put_in6_addr(request, TCA_CT_NAT_IPV6_MAX, - &action->ct.range.max_addr.ipv6); + if (!is_all_zeros(&action->ct.range.max_addr.ipv6, + sizeof(action->ct.range.max_addr.ipv6))) { + nl_msg_put_in6_addr(request, TCA_CT_NAT_IPV6_MAX, + &action->ct.range.max_addr.ipv6); + } } if (action->ct.range.min_port) {
On 7/23/2019 6:42 PM, Marcelo Ricardo Leitner wrote: > On Sat, Jul 20, 2019 at 08:26:59AM +0000, Paul Blakey wrote: >> Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names >> Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details. >> >> I'll try and reproduce it on my end. >> How often does this happen? and what is the setup ? > Quite often. > Setup is: > ns1 ns2 > --------------. .-------------- > | v------ br0 ------v | > ns1-veth-ba--x----ns1-veth-ab ns2-veth-ab----x-- ns2-veth-ba > | | > --------------' '-------------- > > Seems related to the skb_ext patches.. with a kernel with more > debugging stuff enabled, I'm seeing: > > [ 17.467576] ============================= > [ 17.468717] WARNING: suspicious RCU usage > [ 17.469687] 5.2.0.c1f3d.g83fb7bc8ff16+ #2 Not tainted > [ 17.471496] ----------------------------- > [ 17.472922] net/sched/sch_ingress.c:52 suspicious rcu_dereference_protected() usage! > [ 17.474715] > other info that might help us debug this: > > [ 17.476219] > rcu_scheduler_active = 2, debug_locks = 1 > [ 17.477408] no locks held by ovs-vswitchd/1063. > [ 17.478267] > stack backtrace: > [ 17.479545] CPU: 10 PID: 1063 Comm: ovs-vswitchd Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #2 > [ 17.481932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 > [ 17.483725] Call Trace: > [ 17.484179] dump_stack+0x85/0xc0 > [ 17.484782] ingress_tcf_block+0x4c/0x50 [sch_ingress] > [ 17.485717] __tcf_block_find+0x28/0x80 > [ 17.486419] tc_new_tfilter+0x1b6/0x960 > [ 17.487093] ? tc_del_tfilter+0x720/0x720 > [ 17.487854] rtnetlink_rcv_msg+0x389/0x4b0 > [ 17.488590] ? netlink_deliver_tap+0x95/0x400 > [ 17.489343] ? rtnl_dellink+0x2d0/0x2d0 > [ 17.490034] netlink_rcv_skb+0x49/0x110 > [ 17.490726] netlink_unicast+0x171/0x200 > [ 17.491409] netlink_sendmsg+0x21e/0x3e0 > [ 17.492096] sock_sendmsg+0x5e/0x60 > [ 17.492792] ___sys_sendmsg+0x2ae/0x330 > [ 17.493725] ? ___sys_recvmsg+0x159/0x1f0 > [ 17.494734] ? up_write+0x1c/0xc0 > [ 17.495515] ? ext4_file_write_iter+0xd1/0x3b0 > [ 17.496624] ? find_held_lock+0x2b/0x80 > [ 17.497362] ? ksys_write+0xc0/0xe0 > [ 17.498000] __sys_sendmsg+0x59/0xa0 > [ 17.498653] do_syscall_64+0x5c/0xa0 > [ 17.499307] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ 17.500225] RIP: 0033:0x7f6bd4ab2b45 > > And I could trigger a crash with this trace: > (crash was on a different run, while the RCU trace above and outputs below are > from the same run) > > [ 382.670318] BUG: kernel NULL pointer dereference, address: 0000000000000030 > [ 382.671927] #PF: supervisor read access in kernel mode > [ 382.673070] #PF: error_code(0x0000) - not-present page > [ 382.674170] PGD 0 P4D 0 > [ 382.674873] Oops: 0000 [#1] SMP NOPTI > [ 382.675753] CPU: 1 PID: 6988 Comm: nc Kdump: loaded Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #1 > [ 382.677350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 > [ 382.679255] RIP: 0010:tcf_classify+0x104/0x1e0 > [ 382.680942] Code: 5c 41 5d 41 5e 41 5f c3 48 8b 55 00 be 01 00 00 00 4c 89 ff 48 89 14 24 e8 79 e3 f9 ff 48 8b 14 24 48 85 c0 0f 84 cb 00 00 > 00 <48> 8b 4a 30 8b 49 40 89 08 41 8d 45 01 41 83 fd 03 7e ac e8 84 37 > [ 382.684792] RSP: 0018:ffffbd19000f4dd0 EFLAGS: 00010282 > [ 382.685746] RAX: ffff9e04772b0008 RBX: ffff9e0479088360 RCX: 0000000000000001 > [ 382.686929] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9e047ad9a140 > [ 382.688144] RBP: ffffbd19000f4e40 R08: ffff9e047b870b60 R09: 0000000000000002 > [ 382.689347] R10: ffff9e04550ccd68 R11: ffff9e0476050a00 R12: 0000000000000001 > [ 382.690534] R13: 0000000000000000 R14: ffff9e0479088360 R15: ffff9e0479229ce0 > [ 382.691730] FS: 00007efd4d75bb80(0000) GS:ffff9e047b840000(0000) knlGS:0000000000000000 > [ 382.693033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 382.694061] CR2: 0000000000000030 CR3: 0000000163abc000 CR4: 00000000003406e0 > [ 382.695283] Call Trace: > [ 382.695962] <IRQ> > [ 382.696573] __netif_receive_skb_core+0x3c0/0xcf0 > [ 382.697492] ? reweight_entity+0x15b/0x1a0 > [ 382.698340] __netif_receive_skb_one_core+0x37/0x90 > [ 382.699274] process_backlog+0x9c/0x150 > [ 382.700199] net_rx_action+0xff/0x350 > [ 382.701093] __do_softirq+0xee/0x2ff > [ 382.701976] do_softirq_own_stack+0x2a/0x40 > [ 382.702948] </IRQ> > [ 382.703626] do_softirq.part.0+0x41/0x50 > [ 382.704556] __local_bh_enable_ip+0x4b/0x50 > [ 382.705536] ip_finish_output2+0x1a9/0x580 > [ 382.706518] ip_output+0x71/0xf0 > [ 382.707392] ? __ip_finish_output+0x1c0/0x1c0 > [ 382.708422] __ip_queue_xmit+0x16c/0x420 > [ 382.709423] ? __kmalloc_reserve.isra.0+0x2d/0x70 > [ 382.711184] __tcp_transmit_skb+0x52c/0xab0 > [ 382.712582] tcp_connect+0xaed/0xda0 > [ 382.713403] ? kvm_clock_get_cycles+0xd/0x10 > [ 382.714315] tcp_v4_connect+0x3f1/0x500 > [ 382.715156] __inet_stream_connect+0xc5/0x360 > [ 382.716070] inet_stream_connect+0x37/0x50 > [ 382.716947] __sys_connect+0xd6/0x100 > [ 382.717766] ? do_fcntl+0x1a5/0x650 > [ 382.718588] ? __x64_sys_fcntl+0x84/0xb0 > [ 382.719450] __x64_sys_connect+0x16/0x20 > [ 382.720310] do_syscall_64+0x5b/0x180 > [ 382.721146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > Hi, I reproduced both traces. The first rcu trace is because we deference the qdisc block without rtnl locked (flower allows it unlocked) and the second trace is because res->goto_tp can be null if the last prio of a chain we goto was just deleted, and we try and deference it unconditionally when we set the skb extension in tcf_classify. This can happen when a filter is deleted while there is still traffic, I have a test for it and will try on the fixes. I'll post fixed patches (Patch by VladB) to the tc skb extension to fix both of the above, and ovs userspace RFC v2 to fix the nat issue. Thanks, Paul. >
Hi. This patch-set has some checkpatch issues. It'll be good if you could fix them before submitting the next version. Best regards, Ilya Maximets. On 04.07.2019 17:28, Paul Blakey wrote: > Hi, > > The following patches add connection tracking offload to tc. > > We plan on offloading the datapath rules to netdev one to one to tc rules. > We'll be using upcoming act_ct tc module which is currently under review in netdev for the datapath ct() action. > Tc chains and tc goto chain action for the recirc_id() match and recirc() action. > cls_flower will do the matching on skb conntrack metadata for the ct_state matches. > > The patchset for act_ct and cls_flower is here: https://lwn.net/Articles/791584/ > > So datapath ovs connection tracking rules: > recirc_id(0),in_port(ens1f0_0),ct_state(-trk),... actions:ct(zone=2),recirc(2) > recirc_id(2),in_port(ens1f0_0),ct_state(+new+trk),ct_mark(0xbb),... actions:ct(commit,zone=2,nat(src=5.5.5.7),mark=0xbb),ens1f0_1 > recirc_id(2),in_port(ens1f0_0),ct_state(+est+trk),ct_mark(0xbb),... actions:ct(zone=2,nat),ens1f0_1 > > recirc_id(1),in_port(ens1f0_1),ct_state(-trk),... actions:ct(zone=2),recirc(1) > recirc_id(1),in_port(ens1f0_1),ct_state(+est+trk),... actions:ct(zone=2,nat),ens1f0_0 > > Will be translated to these: > $ tc filter add dev ens1f0_0 ingress \ > prio 1 chain 0 proto ip \ > flower ip_proto tcp ct_state -trk \ > action ct zone 2 pipe \ > action goto chain 2 > $ tc filter add dev ens1f0_0 ingress \ > prio 1 chain 2 proto ip \ > flower ct_state +trk+new \ > action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \ > action mirred egress redirect dev ens1f0_1 > $ tc filter add dev ens1f0_0 ingress \ > prio 1 chain 2 proto ip \ > flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \ > action ct nat pipe \ > action mirred egress redirect dev ens1f0_1 > > $ tc filter add dev ens1f0_1 ingress \ > prio 1 chain 0 proto ip \ > flower ip_proto tcp ct_state -trk \ > action ct zone 2 pipe \ > action goto chain 1 > $ tc filter add dev ens1f0_1 ingress \ > prio 1 chain 1 proto ip \ > flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \ > action ct nat pipe \ > action mirred egress redirect dev ens1f0_0 > > > Changlog: > V1->V2: > Renamed netdev-tc-offloads to netdev-offload-tc (sorry about double email) > > Paul Blakey (8): > match: Add match_set_ct_zone_masked helper > compat: Add tc ct action and flower matches defines for older kernels > tc: Introduce tc_id to specify a tc filter > netdev-offload-tc: Implement netdev tc flush via tc filter del > netdev-offload-tc: Add recirculation support via tc chains > netdev-offload-tc: Add conntrack support > netdev-offload-tc: Add conntrack label and mark support > netdev-offload-tc: Add conntrack nat support > > acinclude.m4 | 6 +- > include/linux/automake.mk | 3 +- > include/linux/pkt_cls.h | 50 +++- > include/linux/tc_act/tc_ct.h | 41 +++ > include/openvswitch/match.h | 1 + > lib/dpif-netlink.c | 5 + > lib/match.c | 10 +- > lib/netdev-linux.c | 6 +- > lib/netdev-offload-tc.c | 595 ++++++++++++++++++++++++++++++------------- > lib/tc.c | 411 ++++++++++++++++++++++++------ > lib/tc.h | 75 +++++- > 11 files changed, 921 insertions(+), 282 deletions(-) > create mode 100644 include/linux/tc_act/tc_ct.h >
On 7/29/2019 4:16 PM, Ilya Maximets wrote: > Hi. > > This patch-set has some checkpatch issues. It'll be good if you > could fix them before submitting the next version. > > Best regards, Ilya Maximets. Sure thanks.
On Mon, Jul 29, 2019 at 01:05:50PM +0000, Paul Blakey wrote: > Hi, > > I reproduced both traces. > > The first rcu trace is because we deference the qdisc block without rtnl > locked (flower allows it unlocked) > > and the second trace is because res->goto_tp can be null if the last > prio of a chain we goto was just deleted, and we try and deference it > unconditionally when we set the skb extension in tcf_classify. > > This can happen when a filter is deleted while there is still traffic, I > have a test for it and will try on the fixes. > > I'll post fixed patches (Patch by VladB) to the tc skb extension to fix > both of the above, and ovs userspace RFC v2 to fix the nat issue. Cool, thanks Paul! Marcelo