[ovs-dev,v1] ofproto: Fix OVS crash when packets hit Openflow rules with certain combinations of nested actions
diff mbox series

Message ID 1566987898-344-1-git-send-email-anilkumar.k@altencalsoftlabs.com
State New
Headers show
Series
  • [ovs-dev,v1] ofproto: Fix OVS crash when packets hit Openflow rules with certain combinations of nested actions
Related show

Commit Message

Vishal Deep Ajmera via dev Aug. 28, 2019, 10:24 a.m. UTC
OVS crashes when a packet sent with action set to OFPP_TABLE hits Openflow
rules with conntrack and learn actions.

For example:
The crash can be triggered by installing the following Open flow rules and
sending packet with action set to OFPP_TABLE

1. ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, \
   ct_state=-trk,ip, in_port=10 actions=ct(table=0)"

2. ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, \
   ct_state=+trk,ip, in_port=10 actions=ct(commit),resubmit(,1)"

3. ovs-ofctl -OOpenflow13 add-flow br-int "table=1 \
   actions=learn(table=2,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IN_PORT[]->NXM_NX_REG0[0..15], \
   output:NXM_NX_REG0[0..15]),resubmit(,2)"

4. Send a packet with output as OFPP_TABLE
   ovs-ofctl -OOpenflow13 packet-out br-int 'in_port=10 \
   packet=505400000007101111111111080045000028000000004006f97cc0a80001c0a800020008000a0000000000000000500200002e7d0000, \
    actions=TABLE'

The processing code path results in the same thread context attempting to
acquire a mutex that it already holds. Since the mutex is of error checking
type this situation is considered fatal and OVS aborts. The crash isn’t
limited to only the above combination of actions

Signed-off-by: Anil Kumar <anilkumar.k@altencalsoftlabs.com>
---
 ofproto/ofproto.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Ilya Maximets Aug. 28, 2019, 11:56 a.m. UTC | #1
On 28.08.2019 13:24, Anil Kumar wrote:
> OVS crashes when a packet sent with action set to OFPP_TABLE hits Openflow
> rules with conntrack and learn actions.
> 
> For example:
> The crash can be triggered by installing the following Open flow rules and
> sending packet with action set to OFPP_TABLE
> 
> 1. ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, \
>    ct_state=-trk,ip, in_port=10 actions=ct(table=0)"
> 
> 2. ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, \
>    ct_state=+trk,ip, in_port=10 actions=ct(commit),resubmit(,1)"
> 
> 3. ovs-ofctl -OOpenflow13 add-flow br-int "table=1 \
>    actions=learn(table=2,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IN_PORT[]->NXM_NX_REG0[0..15], \
>    output:NXM_NX_REG0[0..15]),resubmit(,2)"
> 
> 4. Send a packet with output as OFPP_TABLE
>    ovs-ofctl -OOpenflow13 packet-out br-int 'in_port=10 \
>    packet=505400000007101111111111080045000028000000004006f97cc0a80001c0a800020008000a0000000000000000500200002e7d0000, \
>     actions=TABLE'
> 
> The processing code path results in the same thread context attempting to
> acquire a mutex that it already holds. Since the mutex is of error checking
> type this situation is considered fatal and OVS aborts. The crash isn’t
> limited to only the above combination of actions
> 
> Signed-off-by: Anil Kumar <anilkumar.k@altencalsoftlabs.com>
> ---

Hi. Thanks for the patch.
But I cannot reproduce the issue on current master using commands you provided.
Could you, please, provide also a stack trace of the asserted thread?
This should make it easier to understand the issue.

Best regards, Ilya Maximets.
Vishal Deep Ajmera via dev Aug. 29, 2019, 9:52 a.m. UTC | #2
Hi Ilya,

Thanks for the review.
Please find below the stack trace of the crash 

 (gdb) bt
#0  0x00007f0a3da46c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f0a3da4a028 in __GI_abort () at abort.c:89
#2  0x000000000094262e in ovs_abort_valist (err_no=<optimized out>, format=<optimized out>, args=args@entry=0x7fffaf59e308) at lib/util.c:419
#3  0x00000000009426b7 in ovs_abort (err_no=<optimized out>, format=format@entry=0xb0289f "%s: pthread_%s_%s failed") at lib/util.c:411
#4  0x00000000009104f9 in ovs_mutex_lock_at (l_=l_@entry=0xe47cc0 <ofproto_mutex>, where=where@entry=0xad4199 "ofproto/ofproto.c:5391") at lib/ovs-thread.c:75
#5  0x000000000083fb1f in ofproto_flow_mod_learn (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>, limit=<optimized out>, below_limitp=below_limitp@entry=0x7fffaf59e540) at ofproto/ofproto.c:5391
#6  0x000000000085e5d0 in xlate_learn_action (ctx=ctx@entry=0x7fffaf5a02e0, learn=learn@entry=0x2b18308) at ofproto/ofproto-dpif-xlate.c:5378
#7  0x00000000008615c3 in do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized out>, ctx=<optimized out>, is_last_action=<optimized out>, group_bucket_action=<optimized out>) at ofproto/ofproto-dpif-xlate.c:6893
#8  0x000000000085edba in xlate_recursively (actions_xlator=0x860730 <do_xlate_actions>, is_last_action=false, deepens=<optimized out>, rule=0x2b8e440, ctx=0x7fffaf5a02e0) at ofproto/ofproto-dpif-xlate.c:4233
#9  xlate_table_action (ctx=0x7fffaf5a02e0, in_port=<optimized out>, table_id=<optimized out>, may_packet_in=<optimized out>, honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, is_last_action=false,
    xlator=0x860730 <do_xlate_actions>) at ofproto/ofproto-dpif-xlate.c:4361
#10 0x0000000000861aaa in xlate_ofpact_resubmit (resubmit=<optimized out>, resubmit=<optimized out>, resubmit=<optimized out>, is_last_action=<optimized out>, ctx=0x7fffaf5a02e0) at ofproto/ofproto-dpif-xlate.c:4672
#11 do_xlate_actions (ofpacts=ofpacts@entry=0x2b654e8, ofpacts_len=ofpacts_len@entry=48, ctx=ctx@entry=0x7fffaf5a02e0, is_last_action=is_last_action@entry=true, group_bucket_action=group_bucket_action@entry=false)
    at ofproto/ofproto-dpif-xlate.c:6773
#12 0x00000000008696d6 in xlate_actions (xin=xin@entry=0x7fffaf5a0da0, xout=xout@entry=0x7fffaf5a11b0) at ofproto/ofproto-dpif-xlate.c:7570
#13 0x0000000000859b56 in upcall_xlate (wc=0x7fffaf5a23f0, odp_actions=0x7fffaf5a1bc0, upcall=0x7fffaf5a1150, udpif=0x2b10670) at ofproto/ofproto-dpif-upcall.c:1197
#14 process_upcall (udpif=udpif@entry=0x2b10670, upcall=upcall@entry=0x7fffaf5a1150, odp_actions=odp_actions@entry=0x7fffaf5a1bc0, wc=wc@entry=0x7fffaf5a23f0) at ofproto/ofproto-dpif-upcall.c:1413
#15 0x000000000085a9eb in upcall_cb (packet=<optimized out>, flow=0x7fffaf5a2150, ufid=<optimized out>, pmd_id=<optimized out>, type=<optimized out>, userdata=<optimized out>, actions=0x7fffaf5a1bc0, wc=0x7fffaf5a23f0,
    put_actions=0x7fffaf5a1c00, aux=0x2b10670) at ofproto/ofproto-dpif-upcall.c:1315
#16 0x000000000088419c in dp_netdev_upcall (pmd=pmd@entry=0x7f0a35f34010, packet_=packet_@entry=0x2b68930, flow=flow@entry=0x7fffaf5a2150, wc=wc@entry=0x7fffaf5a23f0, ufid=ufid@entry=0x7fffaf5a1ba0, type=type@entry=DPIF_UC_MISS,
    userdata=userdata@entry=0x0, actions=actions@entry=0x7fffaf5a1bc0, put_actions=put_actions@entry=0x7fffaf5a1c00) at lib/dpif-netdev.c:6236
#17 0x000000000088d053 in handle_packet_upcall (put_actions=0x7fffaf5a1c00, actions=0x7fffaf5a1bc0, key=0x7fffaf5a3080, packet=0x2b68930, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:6591
#18 fast_path_processing (pmd=pmd@entry=0x7f0a35f34010, packets_=packets_@entry=0x7fffaf5a35f0, keys=keys@entry=0x7fffaf5a3040, flow_map=flow_map@entry=0x7fffaf5a2ef0, index_map=index_map@entry=0x7fffaf5a2ee0 "",
    in_port=<optimized out>) at lib/dpif-netdev.c:6709
#19 0x000000000088e9ce in dp_netdev_input__ (pmd=pmd@entry=0x7f0a35f34010, packets=packets@entry=0x7fffaf5a35f0, md_is_valid=md_is_valid@entry=true, port_no=port_no@entry=0) at lib/dpif-netdev.c:6797
#20 0x0000000000890710 in dp_netdev_recirculate (packets=0x7fffaf5a35f0, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:6842
#21 dp_execute_cb (aux_=aux_@entry=0x7fffaf5a3ea0, packets_=0x7fffaf5a35f0, packets_@entry=0x7fffaf5a3eb0, a=a@entry=0x2b1c7d4, should_steal=<optimized out>) at lib/dpif-netdev.c:7158
#22 0x00000000008c3048 in odp_execute_actions (dp=dp@entry=0x7fffaf5a3ea0, batch=batch@entry=0x7fffaf5a3eb0, steal=steal@entry=false, actions=<optimized out>, actions_len=<optimized out>,
    dp_execute_action=dp_execute_action@entry=0x8904f0 <dp_execute_cb>) at lib/odp-execute.c:794
#23 0x0000000000887bbd in dp_netdev_execute_actions (actions_len=<optimized out>, actions=<optimized out>, flow=<optimized out>, should_steal=false, packets=0x7fffaf5a3eb0, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:7332
#24 dpif_netdev_execute (dpif=dpif@entry=0x2b105d0, execute=execute@entry=0x7fffaf5a40f8) at lib/dpif-netdev.c:3725
#25 0x000000000088dccc in dpif_netdev_operate (dpif=0x2b105d0, ops=0x7fffaf5a40e8, n_ops=1, offload_type=<optimized out>) at lib/dpif-netdev.c:3756
#26 0x0000000000895b07 in dpif_operate (dpif=0x2b105d0, ops=ops@entry=0x7fffaf5a40e8, n_ops=n_ops@entry=1, offload_type=offload_type@entry=DPIF_OFFLOAD_AUTO) at lib/dpif.c:1367
#27 0x0000000000896233 in dpif_execute (dpif=<optimized out>, execute=execute@entry=0x7fffaf5a4180) at lib/dpif.c:1321
#28 0x00000000008459eb in packet_execute (ofproto_=0x2b3ec60, opo=0x7fffaf5a4200) at ofproto/ofproto-dpif.c:4760
#29 0x000000000083a9bb in ofproto_packet_out_finish (opo=0x7fffaf5a4200, ofproto=0x2b3ec60) at ofproto/ofproto.c:3594
#30 handle_packet_out (ofconn=ofconn@entry=0x2b23d50, oh=oh@entry=0x2b89470) at ofproto/ofproto.c:3635
#31 0x000000000083f06f in handle_single_part_openflow (type=OFPTYPE_PACKET_OUT, oh=0x2b89470, ofconn=0x2b23d50) at ofproto/ofproto.c:8400
#32 handle_openflow (ofconn=ofconn@entry=0x2b23d50, msgs=msgs@entry=0x7fffaf5a5600) at ofproto/ofproto.c:8587
#33 0x000000000086ff6b in ofconn_run (handle_openflow=0x83e1f0 <handle_openflow>, ofconn=0x2b23d50) at ofproto/connmgr.c:1318
#34 connmgr_run (mgr=0x2a26820, handle_openflow=handle_openflow@entry=0x83e1f0 <handle_openflow>) at ofproto/connmgr.c:355
#35 0x0000000000838706 in ofproto_run (p=0x2b3ec60) at ofproto/ofproto.c:1826
#36 0x0000000000826234 in bridge_run__ () at vswitchd/bridge.c:2965
#37 0x000000000082bae8 in bridge_run () at vswitchd/bridge.c:3023
#38 0x0000000000422a05 in main (argc=11, argv=0x7fffaf5a5a28) at vswitchd/ovs-vswitchd.c:127
(gdb) f 5
#5  0x000000000083fb1f in ofproto_flow_mod_learn (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>, limit=<optimized out>, below_limitp=below_limitp@entry=0x7fffaf59e540) at ofproto/ofproto.c:5391
5391    ofproto/ofproto.c: No such file or directory.
(gdb) p ofproto_mutex
$2 = {lock = {__data = {__lock = 1, __count = 0, __owner = 67357, __nusers = 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = "\001\000\000\000\000\000\000\000\035\a\001\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 1}, where = 0xad3dd6 "ofproto/ofproto.c:3631"}

Please find the discussion link regarding this crash with community. 
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-April/048494.html

Best Regards,
Anil Kumar
-----Original Message-----
From: Ilya Maximets <i.maximets@samsung.com> 
Sent: Wednesday, 28 August, 2019 05:26 PM
To: Anil Kumar <anilkumar.k@altencalsoftlabs.com>; ovs-dev@openvswitch.org
Subject: Re: [ovs-dev] [PATCH v1] ofproto: Fix OVS crash when packets hit Openflow rules with certain combinations of nested actions

On 28.08.2019 13:24, Anil Kumar wrote:
> OVS crashes when a packet sent with action set to OFPP_TABLE hits 
> Openflow rules with conntrack and learn actions.
> 
> For example:
> The crash can be triggered by installing the following Open flow rules 
> and sending packet with action set to OFPP_TABLE
> 
> 1. ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, \
>    ct_state=-trk,ip, in_port=10 actions=ct(table=0)"
> 
> 2. ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, \
>    ct_state=+trk,ip, in_port=10 actions=ct(commit),resubmit(,1)"
> 
> 3. ovs-ofctl -OOpenflow13 add-flow br-int "table=1 \
>    actions=learn(table=2,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IN_PORT[]->NXM_NX_REG0[0..15], \
>    output:NXM_NX_REG0[0..15]),resubmit(,2)"
> 
> 4. Send a packet with output as OFPP_TABLE
>    ovs-ofctl -OOpenflow13 packet-out br-int 'in_port=10 \
>    packet=505400000007101111111111080045000028000000004006f97cc0a80001c0a800020008000a0000000000000000500200002e7d0000, \
>     actions=TABLE'
> 
> The processing code path results in the same thread context attempting 
> to acquire a mutex that it already holds. Since the mutex is of error 
> checking type this situation is considered fatal and OVS aborts. The 
> crash isn’t limited to only the above combination of actions
> 
> Signed-off-by: Anil Kumar <anilkumar.k@altencalsoftlabs.com>
> ---

Hi. Thanks for the patch.
But I cannot reproduce the issue on current master using commands you provided.
Could you, please, provide also a stack trace of the asserted thread?
This should make it easier to understand the issue.

Best regards, Ilya Maximets.
Ilya Maximets Aug. 29, 2019, 11:39 a.m. UTC | #3
On 29.08.2019 12:52, Anil Kumar Koli wrote:
> Hi Ilya,
> 
> Thanks for the review.
> Please find below the stack trace of the crash 
> 
>  (gdb) bt
> #0  0x00007f0a3da46c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f0a3da4a028 in __GI_abort () at abort.c:89
> #2  0x000000000094262e in ovs_abort_valist (err_no=<optimized out>, format=<optimized out>, args=args@entry=0x7fffaf59e308) at lib/util.c:419
> #3  0x00000000009426b7 in ovs_abort (err_no=<optimized out>, format=format@entry=0xb0289f "%s: pthread_%s_%s failed") at lib/util.c:411
> #4  0x00000000009104f9 in ovs_mutex_lock_at (l_=l_@entry=0xe47cc0 <ofproto_mutex>, where=where@entry=0xad4199 "ofproto/ofproto.c:5391") at lib/ovs-thread.c:75
> #5  0x000000000083fb1f in ofproto_flow_mod_learn (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>, limit=<optimized out>, below_limitp=below_limitp@entry=0x7fffaf59e540) at ofproto/ofproto.c:5391
> #6  0x000000000085e5d0 in xlate_learn_action (ctx=ctx@entry=0x7fffaf5a02e0, learn=learn@entry=0x2b18308) at ofproto/ofproto-dpif-xlate.c:5378
> #7  0x00000000008615c3 in do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized out>, ctx=<optimized out>, is_last_action=<optimized out>, group_bucket_action=<optimized out>) at ofproto/ofproto-dpif-xlate.c:6893
> #8  0x000000000085edba in xlate_recursively (actions_xlator=0x860730 <do_xlate_actions>, is_last_action=false, deepens=<optimized out>, rule=0x2b8e440, ctx=0x7fffaf5a02e0) at ofproto/ofproto-dpif-xlate.c:4233
> #9  xlate_table_action (ctx=0x7fffaf5a02e0, in_port=<optimized out>, table_id=<optimized out>, may_packet_in=<optimized out>, honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, is_last_action=false,
>     xlator=0x860730 <do_xlate_actions>) at ofproto/ofproto-dpif-xlate.c:4361
> #10 0x0000000000861aaa in xlate_ofpact_resubmit (resubmit=<optimized out>, resubmit=<optimized out>, resubmit=<optimized out>, is_last_action=<optimized out>, ctx=0x7fffaf5a02e0) at ofproto/ofproto-dpif-xlate.c:4672
> #11 do_xlate_actions (ofpacts=ofpacts@entry=0x2b654e8, ofpacts_len=ofpacts_len@entry=48, ctx=ctx@entry=0x7fffaf5a02e0, is_last_action=is_last_action@entry=true, group_bucket_action=group_bucket_action@entry=false)
>     at ofproto/ofproto-dpif-xlate.c:6773
> #12 0x00000000008696d6 in xlate_actions (xin=xin@entry=0x7fffaf5a0da0, xout=xout@entry=0x7fffaf5a11b0) at ofproto/ofproto-dpif-xlate.c:7570
> #13 0x0000000000859b56 in upcall_xlate (wc=0x7fffaf5a23f0, odp_actions=0x7fffaf5a1bc0, upcall=0x7fffaf5a1150, udpif=0x2b10670) at ofproto/ofproto-dpif-upcall.c:1197
> #14 process_upcall (udpif=udpif@entry=0x2b10670, upcall=upcall@entry=0x7fffaf5a1150, odp_actions=odp_actions@entry=0x7fffaf5a1bc0, wc=wc@entry=0x7fffaf5a23f0) at ofproto/ofproto-dpif-upcall.c:1413
> #15 0x000000000085a9eb in upcall_cb (packet=<optimized out>, flow=0x7fffaf5a2150, ufid=<optimized out>, pmd_id=<optimized out>, type=<optimized out>, userdata=<optimized out>, actions=0x7fffaf5a1bc0, wc=0x7fffaf5a23f0,
>     put_actions=0x7fffaf5a1c00, aux=0x2b10670) at ofproto/ofproto-dpif-upcall.c:1315
> #16 0x000000000088419c in dp_netdev_upcall (pmd=pmd@entry=0x7f0a35f34010, packet_=packet_@entry=0x2b68930, flow=flow@entry=0x7fffaf5a2150, wc=wc@entry=0x7fffaf5a23f0, ufid=ufid@entry=0x7fffaf5a1ba0, type=type@entry=DPIF_UC_MISS,
>     userdata=userdata@entry=0x0, actions=actions@entry=0x7fffaf5a1bc0, put_actions=put_actions@entry=0x7fffaf5a1c00) at lib/dpif-netdev.c:6236
> #17 0x000000000088d053 in handle_packet_upcall (put_actions=0x7fffaf5a1c00, actions=0x7fffaf5a1bc0, key=0x7fffaf5a3080, packet=0x2b68930, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:6591
> #18 fast_path_processing (pmd=pmd@entry=0x7f0a35f34010, packets_=packets_@entry=0x7fffaf5a35f0, keys=keys@entry=0x7fffaf5a3040, flow_map=flow_map@entry=0x7fffaf5a2ef0, index_map=index_map@entry=0x7fffaf5a2ee0 "",
>     in_port=<optimized out>) at lib/dpif-netdev.c:6709
> #19 0x000000000088e9ce in dp_netdev_input__ (pmd=pmd@entry=0x7f0a35f34010, packets=packets@entry=0x7fffaf5a35f0, md_is_valid=md_is_valid@entry=true, port_no=port_no@entry=0) at lib/dpif-netdev.c:6797
> #20 0x0000000000890710 in dp_netdev_recirculate (packets=0x7fffaf5a35f0, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:6842
> #21 dp_execute_cb (aux_=aux_@entry=0x7fffaf5a3ea0, packets_=0x7fffaf5a35f0, packets_@entry=0x7fffaf5a3eb0, a=a@entry=0x2b1c7d4, should_steal=<optimized out>) at lib/dpif-netdev.c:7158
> #22 0x00000000008c3048 in odp_execute_actions (dp=dp@entry=0x7fffaf5a3ea0, batch=batch@entry=0x7fffaf5a3eb0, steal=steal@entry=false, actions=<optimized out>, actions_len=<optimized out>,
>     dp_execute_action=dp_execute_action@entry=0x8904f0 <dp_execute_cb>) at lib/odp-execute.c:794
> #23 0x0000000000887bbd in dp_netdev_execute_actions (actions_len=<optimized out>, actions=<optimized out>, flow=<optimized out>, should_steal=false, packets=0x7fffaf5a3eb0, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:7332
> #24 dpif_netdev_execute (dpif=dpif@entry=0x2b105d0, execute=execute@entry=0x7fffaf5a40f8) at lib/dpif-netdev.c:3725
> #25 0x000000000088dccc in dpif_netdev_operate (dpif=0x2b105d0, ops=0x7fffaf5a40e8, n_ops=1, offload_type=<optimized out>) at lib/dpif-netdev.c:3756
> #26 0x0000000000895b07 in dpif_operate (dpif=0x2b105d0, ops=ops@entry=0x7fffaf5a40e8, n_ops=n_ops@entry=1, offload_type=offload_type@entry=DPIF_OFFLOAD_AUTO) at lib/dpif.c:1367
> #27 0x0000000000896233 in dpif_execute (dpif=<optimized out>, execute=execute@entry=0x7fffaf5a4180) at lib/dpif.c:1321
> #28 0x00000000008459eb in packet_execute (ofproto_=0x2b3ec60, opo=0x7fffaf5a4200) at ofproto/ofproto-dpif.c:4760
> #29 0x000000000083a9bb in ofproto_packet_out_finish (opo=0x7fffaf5a4200, ofproto=0x2b3ec60) at ofproto/ofproto.c:3594
> #30 handle_packet_out (ofconn=ofconn@entry=0x2b23d50, oh=oh@entry=0x2b89470) at ofproto/ofproto.c:3635
> #31 0x000000000083f06f in handle_single_part_openflow (type=OFPTYPE_PACKET_OUT, oh=0x2b89470, ofconn=0x2b23d50) at ofproto/ofproto.c:8400
> #32 handle_openflow (ofconn=ofconn@entry=0x2b23d50, msgs=msgs@entry=0x7fffaf5a5600) at ofproto/ofproto.c:8587
> #33 0x000000000086ff6b in ofconn_run (handle_openflow=0x83e1f0 <handle_openflow>, ofconn=0x2b23d50) at ofproto/connmgr.c:1318
> #34 connmgr_run (mgr=0x2a26820, handle_openflow=handle_openflow@entry=0x83e1f0 <handle_openflow>) at ofproto/connmgr.c:355
> #35 0x0000000000838706 in ofproto_run (p=0x2b3ec60) at ofproto/ofproto.c:1826
> #36 0x0000000000826234 in bridge_run__ () at vswitchd/bridge.c:2965
> #37 0x000000000082bae8 in bridge_run () at vswitchd/bridge.c:3023
> #38 0x0000000000422a05 in main (argc=11, argv=0x7fffaf5a5a28) at vswitchd/ovs-vswitchd.c:127
> (gdb) f 5
> #5  0x000000000083fb1f in ofproto_flow_mod_learn (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>, limit=<optimized out>, below_limitp=below_limitp@entry=0x7fffaf59e540) at ofproto/ofproto.c:5391
> 5391    ofproto/ofproto.c: No such file or directory.
> (gdb) p ofproto_mutex
> $2 = {lock = {__data = {__lock = 1, __count = 0, __owner = 67357, __nusers = 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
>     __size = "\001\000\000\000\000\000\000\000\035\a\001\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 1}, where = 0xad3dd6 "ofproto/ofproto.c:3631"}
> 
> Please find the discussion link regarding this crash with community. 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2019-April/048494.html

Thanks for the trace and link. It's very useful.
You may, probably, add above link to the commit message via 'Reported-at' tag.

Regarding the issue.
Is it correct that this issue only happens for userspace datapath?
In case of userspace datapath execution of the packet actions happens
in the same thread that handles the ofproto packet-out request. This
is seen in above stack trace.  But in case of kernel datapath,
packet_execute() sends packet to the kernel via netlink and further
processing will happen in a separate thread that handles upcall (if any).

If above statements are correct, there could be another way to fix the
issue.

Ben, you said previously:

> How do we fix this?
> 
> * We can't get rid of ofproto_mutex in do_bundle_commit(), or drop it
>   temporarily around flow translation (i.e. the call to
>   ofproto_packet_out_start()), because it might need to revert all the
>   flow table changes and dropping the mutex would allow other threads to
>   race in and make conflicting changes

But it seems that the issue happens on ofproto_packet_out_finish() and
not on ofproto_packet_out_start(). So, the flow translation should be OK
under the ofproto_mutex and revert is still possible under the mutex.
The only thing we need to take out of the mutex is real action execution
by the datapath triggered by ofproto_packet_out_finish(). Callers never
check the status of this function so it should be not so hard.

So, possible solution:
* Move ofproto_packet_out_finish() out of ofproto_mutex in all the callers:
  * It's easy for handle_packet_out()
  * For do_bundle_commit() we could temporary store all the ofproto_packet_out
    entities and finish them later.

Am I missing something?  Is there reason for ofproto_packet_out_finish() to
require ofproto_mutex?

Ben, Anil, what do you think?

Best regards, Ilya Maximets.
Vishal Deep Ajmera via dev Sept. 3, 2019, 6:05 a.m. UTC | #4
Hi Ilya,

Sorry for late reply and thanks for the review comments.
Yes, the issue happens only in the userspace datapath and not in kernel 
datapath. I have tested it on kernel datapath and could not see this issue for 
the same build.

Regarding the suggested solution, I still feel like going with recursive mutex 
approach is more cleaner then releasing the lock for 
ofproto_packet_out_finish().  We have the ofproto_dpif_xcache_execute() which 
may lead to race when a flow mod learn is called from another thread. May be 
it could also be possible that in certain combination we may result the crash 
in ofproto_packet_out_start().  May be Ben can share his inputs as well.

Best Regards,Anil 
Kumar

-----Original Message-----
From: Ilya Maximets <i.maximets@samsung.com>
Sent: Thursday, 29 August, 2019 05:10 PM
To: Anil Kumar Koli <anilkumar.k@altencalsoftlabs.com>; 
ovs-dev@openvswitch.org
Cc: 'Ben Pfaff' <blp@ovn.org>
Subject: Re: [ovs-dev] [PATCH v1] ofproto: Fix OVS crash when packets hit 
Openflow rules with certain combinations of nested actions

On 29.08.2019 12:52, Anil Kumar Koli wrote:
> Hi Ilya,
>
> Thanks for the review.
> Please find below the stack trace of the crash
>
>  (gdb) bt
> #0  0x00007f0a3da46c37 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f0a3da4a028 in __GI_abort () at abort.c:89
> #2  0x000000000094262e in ovs_abort_valist (err_no=<optimized out>,
> format=<optimized out>, args=args@entry=0x7fffaf59e308) at
> lib/util.c:419
> #3  0x00000000009426b7 in ovs_abort (err_no=<optimized out>,
> format=format@entry=0xb0289f "%s: pthread_%s_%s failed") at
> lib/util.c:411
> #4  0x00000000009104f9 in ovs_mutex_lock_at (l_=l_@entry=0xe47cc0
> <ofproto_mutex>, where=where@entry=0xad4199 "ofproto/ofproto.c:5391")
> at lib/ovs-thread.c:75
> #5  0x000000000083fb1f in ofproto_flow_mod_learn
> (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>,
> limit=<optimized out>, below_limitp=below_limitp@entry=0x7fffaf59e540)
> at ofproto/ofproto.c:5391
> #6  0x000000000085e5d0 in xlate_learn_action
> (ctx=ctx@entry=0x7fffaf5a02e0, learn=learn@entry=0x2b18308) at
> ofproto/ofproto-dpif-xlate.c:5378
> #7  0x00000000008615c3 in do_xlate_actions (ofpacts=<optimized out>,
> ofpacts_len=<optimized out>, ctx=<optimized out>,
> is_last_action=<optimized out>, group_bucket_action=<optimized out>)
> at ofproto/ofproto-dpif-xlate.c:6893
> #8  0x000000000085edba in xlate_recursively (actions_xlator=0x860730
> <do_xlate_actions>, is_last_action=false, deepens=<optimized out>,
> rule=0x2b8e440, ctx=0x7fffaf5a02e0) at
> ofproto/ofproto-dpif-xlate.c:4233
> #9  xlate_table_action (ctx=0x7fffaf5a02e0, in_port=<optimized out>, 
> table_id=<optimized out>, may_packet_in=<optimized out>, 
> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, 
> is_last_action=false,
>     xlator=0x860730 <do_xlate_actions>) at
> ofproto/ofproto-dpif-xlate.c:4361
> #10 0x0000000000861aaa in xlate_ofpact_resubmit (resubmit=<optimized
> out>, resubmit=<optimized out>, resubmit=<optimized out>,
> is_last_action=<optimized out>, ctx=0x7fffaf5a02e0) at
> ofproto/ofproto-dpif-xlate.c:4672
> #11 do_xlate_actions (ofpacts=ofpacts@entry=0x2b654e8, 
> ofpacts_len=ofpacts_len@entry=48, ctx=ctx@entry=0x7fffaf5a02e0, 
> is_last_action=is_last_action@entry=true, 
> group_bucket_action=group_bucket_action@entry=false)
>     at ofproto/ofproto-dpif-xlate.c:6773
> #12 0x00000000008696d6 in xlate_actions (xin=xin@entry=0x7fffaf5a0da0,
> xout=xout@entry=0x7fffaf5a11b0) at ofproto/ofproto-dpif-xlate.c:7570
> #13 0x0000000000859b56 in upcall_xlate (wc=0x7fffaf5a23f0,
> odp_actions=0x7fffaf5a1bc0, upcall=0x7fffaf5a1150, udpif=0x2b10670) at
> ofproto/ofproto-dpif-upcall.c:1197
> #14 process_upcall (udpif=udpif@entry=0x2b10670,
> upcall=upcall@entry=0x7fffaf5a1150,
> odp_actions=odp_actions@entry=0x7fffaf5a1bc0,
> wc=wc@entry=0x7fffaf5a23f0) at ofproto/ofproto-dpif-upcall.c:1413
> #15 0x000000000085a9eb in upcall_cb (packet=<optimized out>, 
> flow=0x7fffaf5a2150, ufid=<optimized out>, pmd_id=<optimized out>, 
> type=<optimized out>, userdata=<optimized out>, actions=0x7fffaf5a1bc0, 
> wc=0x7fffaf5a23f0,
>     put_actions=0x7fffaf5a1c00, aux=0x2b10670) at
> ofproto/ofproto-dpif-upcall.c:1315
> #16 0x000000000088419c in dp_netdev_upcall (pmd=pmd@entry=0x7f0a35f34010, 
> packet_=packet_@entry=0x2b68930, flow=flow@entry=0x7fffaf5a2150, 
> wc=wc@entry=0x7fffaf5a23f0, ufid=ufid@entry=0x7fffaf5a1ba0, 
> type=type@entry=DPIF_UC_MISS,
>     userdata=userdata@entry=0x0, actions=actions@entry=0x7fffaf5a1bc0,
> put_actions=put_actions@entry=0x7fffaf5a1c00) at
> lib/dpif-netdev.c:6236
> #17 0x000000000088d053 in handle_packet_upcall
> (put_actions=0x7fffaf5a1c00, actions=0x7fffaf5a1bc0,
> key=0x7fffaf5a3080, packet=0x2b68930, pmd=0x7f0a35f34010) at
> lib/dpif-netdev.c:6591
> #18 fast_path_processing (pmd=pmd@entry=0x7f0a35f34010, 
> packets_=packets_@entry=0x7fffaf5a35f0, keys=keys@entry=0x7fffaf5a3040, 
> flow_map=flow_map@entry=0x7fffaf5a2ef0, 
> index_map=index_map@entry=0x7fffaf5a2ee0 "",
>     in_port=<optimized out>) at lib/dpif-netdev.c:6709
> #19 0x000000000088e9ce in dp_netdev_input__
> (pmd=pmd@entry=0x7f0a35f34010, packets=packets@entry=0x7fffaf5a35f0,
> md_is_valid=md_is_valid@entry=true, port_no=port_no@entry=0) at
> lib/dpif-netdev.c:6797
> #20 0x0000000000890710 in dp_netdev_recirculate
> (packets=0x7fffaf5a35f0, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:6842
> #21 dp_execute_cb (aux_=aux_@entry=0x7fffaf5a3ea0,
> packets_=0x7fffaf5a35f0, packets_@entry=0x7fffaf5a3eb0,
> a=a@entry=0x2b1c7d4, should_steal=<optimized out>) at
> lib/dpif-netdev.c:7158
> #22 0x00000000008c3048 in odp_execute_actions (dp=dp@entry=0x7fffaf5a3ea0, 
> batch=batch@entry=0x7fffaf5a3eb0, steal=steal@entry=false, 
> actions=<optimized out>, actions_len=<optimized out>,
>     dp_execute_action=dp_execute_action@entry=0x8904f0
> <dp_execute_cb>) at lib/odp-execute.c:794
> #23 0x0000000000887bbd in dp_netdev_execute_actions
> (actions_len=<optimized out>, actions=<optimized out>, flow=<optimized
> out>, should_steal=false, packets=0x7fffaf5a3eb0, pmd=0x7f0a35f34010)
> at lib/dpif-netdev.c:7332
> #24 dpif_netdev_execute (dpif=dpif@entry=0x2b105d0,
> execute=execute@entry=0x7fffaf5a40f8) at lib/dpif-netdev.c:3725
> #25 0x000000000088dccc in dpif_netdev_operate (dpif=0x2b105d0,
> ops=0x7fffaf5a40e8, n_ops=1, offload_type=<optimized out>) at
> lib/dpif-netdev.c:3756
> #26 0x0000000000895b07 in dpif_operate (dpif=0x2b105d0,
> ops=ops@entry=0x7fffaf5a40e8, n_ops=n_ops@entry=1,
> offload_type=offload_type@entry=DPIF_OFFLOAD_AUTO) at lib/dpif.c:1367
> #27 0x0000000000896233 in dpif_execute (dpif=<optimized out>,
> execute=execute@entry=0x7fffaf5a4180) at lib/dpif.c:1321
> #28 0x00000000008459eb in packet_execute (ofproto_=0x2b3ec60,
> opo=0x7fffaf5a4200) at ofproto/ofproto-dpif.c:4760
> #29 0x000000000083a9bb in ofproto_packet_out_finish
> (opo=0x7fffaf5a4200, ofproto=0x2b3ec60) at ofproto/ofproto.c:3594
> #30 handle_packet_out (ofconn=ofconn@entry=0x2b23d50,
> oh=oh@entry=0x2b89470) at ofproto/ofproto.c:3635
> #31 0x000000000083f06f in handle_single_part_openflow
> (type=OFPTYPE_PACKET_OUT, oh=0x2b89470, ofconn=0x2b23d50) at
> ofproto/ofproto.c:8400
> #32 handle_openflow (ofconn=ofconn@entry=0x2b23d50,
> msgs=msgs@entry=0x7fffaf5a5600) at ofproto/ofproto.c:8587
> #33 0x000000000086ff6b in ofconn_run (handle_openflow=0x83e1f0
> <handle_openflow>, ofconn=0x2b23d50) at ofproto/connmgr.c:1318
> #34 connmgr_run (mgr=0x2a26820,
> handle_openflow=handle_openflow@entry=0x83e1f0 <handle_openflow>) at
> ofproto/connmgr.c:355
> #35 0x0000000000838706 in ofproto_run (p=0x2b3ec60) at
> ofproto/ofproto.c:1826
> #36 0x0000000000826234 in bridge_run__ () at vswitchd/bridge.c:2965
> #37 0x000000000082bae8 in bridge_run () at vswitchd/bridge.c:3023
> #38 0x0000000000422a05 in main (argc=11, argv=0x7fffaf5a5a28) at
> vswitchd/ovs-vswitchd.c:127
> (gdb) f 5
> #5  0x000000000083fb1f in ofproto_flow_mod_learn 
> (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>, limit=<optimized 
> out>, below_limitp=below_limitp@entry=0x7fffaf59e540) at 
> ofproto/ofproto.c:5391
> 5391    ofproto/ofproto.c: No such file or directory.
> (gdb) p ofproto_mutex
> $2 = {lock = {__data = {__lock = 1, __count = 0, __owner = 67357, __nusers = 
> 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 
> 0x0}},
>     __size =
> "\001\000\000\000\000\000\000\000\035\a\001\000\001\000\000\000\002",
> '\000' <repeats 22 times>, __align = 1}, where = 0xad3dd6
> "ofproto/ofproto.c:3631"}
>
> Please find the discussion link regarding this crash with community.
> https://mail.openvswitch.org/pipermail/ovs-discuss/2019-April/048494.h
> tml

Thanks for the trace and link. It's very useful.
You may, probably, add above link to the commit message via 'Reported-at' tag.

Regarding the issue.
Is it correct that this issue only happens for userspace datapath?
In case of userspace datapath execution of the packet actions happens in the 
same thread that handles the ofproto packet-out request. This is seen in above 
stack trace.  But in case of kernel datapath,
packet_execute() sends packet to the kernel via netlink and further processing 
will happen in a separate thread that handles upcall (if any).

If above statements are correct, there could be another way to fix the issue.

Ben, you said previously:

> How do we fix this?
>
> * We can't get rid of ofproto_mutex in do_bundle_commit(), or drop it
>   temporarily around flow translation (i.e. the call to
>   ofproto_packet_out_start()), because it might need to revert all the
>   flow table changes and dropping the mutex would allow other threads to
>   race in and make conflicting changes

But it seems that the issue happens on ofproto_packet_out_finish() and not on 
ofproto_packet_out_start(). So, the flow translation should be OK under the 
ofproto_mutex and revert is still possible under the mutex.
The only thing we need to take out of the mutex is real action execution by 
the datapath triggered by ofproto_packet_out_finish(). Callers never check the 
status of this function so it should be not so hard.

So, possible solution:
* Move ofproto_packet_out_finish() out of ofproto_mutex in all the callers:
  * It's easy for handle_packet_out()
  * For do_bundle_commit() we could temporary store all the ofproto_packet_out
    entities and finish them later.

Am I missing something?  Is there reason for ofproto_packet_out_finish() to 
require ofproto_mutex?

Ben, Anil, what do you think?

Best regards, Ilya Maximets.
Vishal Deep Ajmera via dev Sept. 5, 2019, 5:42 p.m. UTC | #5
Hi Ben,

Please consider this as a gentle remainder and provide your inputs on the 
below issue.
This issue has been discussed earlier in the following thread.
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-April/048494.html

Best Regards,
Anil Kumar.
-----Original Message-----
From: Anil Kumar Koli <anilkumar.k@altencalsoftlabs.com>
Sent: Tuesday, 3 September, 2019 11:35 AM
To: 'Ilya Maximets' <i.maximets@samsung.com>; 'ovs-dev@openvswitch.org' 
<ovs-dev@openvswitch.org>
Cc: 'Ben Pfaff' <blp@ovn.org>
Subject: RE: [ovs-dev] [PATCH v1] ofproto: Fix OVS crash when packets hit 
Openflow rules with certain combinations of nested actions

Hi Ilya,

Sorry for late reply and thanks for the review comments.
Yes, the issue happens only in the userspace datapath and not in kernel
datapath. I have tested it on kernel datapath and could not see this issue for
the same build.

Regarding the suggested solution, I still feel like going with recursive mutex
approach is more cleaner then releasing the lock for
ofproto_packet_out_finish().  We have the ofproto_dpif_xcache_execute() which
may lead to race when a flow mod learn is called from another thread. May be
it could also be possible that in certain combination we may result the crash
in ofproto_packet_out_start().  May be Ben can share his inputs as well.

Best Regards,Anil
Kumar

-----Original Message-----
From: Ilya Maximets <i.maximets@samsung.com>
Sent: Thursday, 29 August, 2019 05:10 PM
To: Anil Kumar Koli <anilkumar.k@altencalsoftlabs.com>;
ovs-dev@openvswitch.org
Cc: 'Ben Pfaff' <blp@ovn.org>
Subject: Re: [ovs-dev] [PATCH v1] ofproto: Fix OVS crash when packets hit
Openflow rules with certain combinations of nested actions

On 29.08.2019 12:52, Anil Kumar Koli wrote:
> Hi Ilya,
>
> Thanks for the review.
> Please find below the stack trace of the crash
>
>  (gdb) bt
> #0  0x00007f0a3da46c37 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f0a3da4a028 in __GI_abort () at abort.c:89
> #2  0x000000000094262e in ovs_abort_valist (err_no=<optimized out>,
> format=<optimized out>, args=args@entry=0x7fffaf59e308) at
> lib/util.c:419
> #3  0x00000000009426b7 in ovs_abort (err_no=<optimized out>,
> format=format@entry=0xb0289f "%s: pthread_%s_%s failed") at
> lib/util.c:411
> #4  0x00000000009104f9 in ovs_mutex_lock_at (l_=l_@entry=0xe47cc0
> <ofproto_mutex>, where=where@entry=0xad4199 "ofproto/ofproto.c:5391")
> at lib/ovs-thread.c:75
> #5  0x000000000083fb1f in ofproto_flow_mod_learn
> (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>,
> limit=<optimized out>, below_limitp=below_limitp@entry=0x7fffaf59e540)
> at ofproto/ofproto.c:5391
> #6  0x000000000085e5d0 in xlate_learn_action
> (ctx=ctx@entry=0x7fffaf5a02e0, learn=learn@entry=0x2b18308) at
> ofproto/ofproto-dpif-xlate.c:5378
> #7  0x00000000008615c3 in do_xlate_actions (ofpacts=<optimized out>,
> ofpacts_len=<optimized out>, ctx=<optimized out>,
> is_last_action=<optimized out>, group_bucket_action=<optimized out>)
> at ofproto/ofproto-dpif-xlate.c:6893
> #8  0x000000000085edba in xlate_recursively (actions_xlator=0x860730
> <do_xlate_actions>, is_last_action=false, deepens=<optimized out>,
> rule=0x2b8e440, ctx=0x7fffaf5a02e0) at
> ofproto/ofproto-dpif-xlate.c:4233
> #9  xlate_table_action (ctx=0x7fffaf5a02e0, in_port=<optimized out>,
> table_id=<optimized out>, may_packet_in=<optimized out>,
> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>,
> is_last_action=false,
>     xlator=0x860730 <do_xlate_actions>) at
> ofproto/ofproto-dpif-xlate.c:4361
> #10 0x0000000000861aaa in xlate_ofpact_resubmit (resubmit=<optimized
> out>, resubmit=<optimized out>, resubmit=<optimized out>,
> is_last_action=<optimized out>, ctx=0x7fffaf5a02e0) at
> ofproto/ofproto-dpif-xlate.c:4672
> #11 do_xlate_actions (ofpacts=ofpacts@entry=0x2b654e8,
> ofpacts_len=ofpacts_len@entry=48, ctx=ctx@entry=0x7fffaf5a02e0,
> is_last_action=is_last_action@entry=true,
> group_bucket_action=group_bucket_action@entry=false)
>     at ofproto/ofproto-dpif-xlate.c:6773
> #12 0x00000000008696d6 in xlate_actions (xin=xin@entry=0x7fffaf5a0da0,
> xout=xout@entry=0x7fffaf5a11b0) at ofproto/ofproto-dpif-xlate.c:7570
> #13 0x0000000000859b56 in upcall_xlate (wc=0x7fffaf5a23f0,
> odp_actions=0x7fffaf5a1bc0, upcall=0x7fffaf5a1150, udpif=0x2b10670) at
> ofproto/ofproto-dpif-upcall.c:1197
> #14 process_upcall (udpif=udpif@entry=0x2b10670,
> upcall=upcall@entry=0x7fffaf5a1150,
> odp_actions=odp_actions@entry=0x7fffaf5a1bc0,
> wc=wc@entry=0x7fffaf5a23f0) at ofproto/ofproto-dpif-upcall.c:1413
> #15 0x000000000085a9eb in upcall_cb (packet=<optimized out>,
> flow=0x7fffaf5a2150, ufid=<optimized out>, pmd_id=<optimized out>,
> type=<optimized out>, userdata=<optimized out>, actions=0x7fffaf5a1bc0,
> wc=0x7fffaf5a23f0,
>     put_actions=0x7fffaf5a1c00, aux=0x2b10670) at
> ofproto/ofproto-dpif-upcall.c:1315
> #16 0x000000000088419c in dp_netdev_upcall (pmd=pmd@entry=0x7f0a35f34010,
> packet_=packet_@entry=0x2b68930, flow=flow@entry=0x7fffaf5a2150,
> wc=wc@entry=0x7fffaf5a23f0, ufid=ufid@entry=0x7fffaf5a1ba0,
> type=type@entry=DPIF_UC_MISS,
>     userdata=userdata@entry=0x0, actions=actions@entry=0x7fffaf5a1bc0,
> put_actions=put_actions@entry=0x7fffaf5a1c00) at
> lib/dpif-netdev.c:6236
> #17 0x000000000088d053 in handle_packet_upcall
> (put_actions=0x7fffaf5a1c00, actions=0x7fffaf5a1bc0,
> key=0x7fffaf5a3080, packet=0x2b68930, pmd=0x7f0a35f34010) at
> lib/dpif-netdev.c:6591
> #18 fast_path_processing (pmd=pmd@entry=0x7f0a35f34010,
> packets_=packets_@entry=0x7fffaf5a35f0, keys=keys@entry=0x7fffaf5a3040,
> flow_map=flow_map@entry=0x7fffaf5a2ef0,
> index_map=index_map@entry=0x7fffaf5a2ee0 "",
>     in_port=<optimized out>) at lib/dpif-netdev.c:6709
> #19 0x000000000088e9ce in dp_netdev_input__
> (pmd=pmd@entry=0x7f0a35f34010, packets=packets@entry=0x7fffaf5a35f0,
> md_is_valid=md_is_valid@entry=true, port_no=port_no@entry=0) at
> lib/dpif-netdev.c:6797
> #20 0x0000000000890710 in dp_netdev_recirculate
> (packets=0x7fffaf5a35f0, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:6842
> #21 dp_execute_cb (aux_=aux_@entry=0x7fffaf5a3ea0,
> packets_=0x7fffaf5a35f0, packets_@entry=0x7fffaf5a3eb0,
> a=a@entry=0x2b1c7d4, should_steal=<optimized out>) at
> lib/dpif-netdev.c:7158
> #22 0x00000000008c3048 in odp_execute_actions (dp=dp@entry=0x7fffaf5a3ea0,
> batch=batch@entry=0x7fffaf5a3eb0, steal=steal@entry=false,
> actions=<optimized out>, actions_len=<optimized out>,
>     dp_execute_action=dp_execute_action@entry=0x8904f0
> <dp_execute_cb>) at lib/odp-execute.c:794
> #23 0x0000000000887bbd in dp_netdev_execute_actions
> (actions_len=<optimized out>, actions=<optimized out>, flow=<optimized
> out>, should_steal=false, packets=0x7fffaf5a3eb0, pmd=0x7f0a35f34010)
> at lib/dpif-netdev.c:7332
> #24 dpif_netdev_execute (dpif=dpif@entry=0x2b105d0,
> execute=execute@entry=0x7fffaf5a40f8) at lib/dpif-netdev.c:3725
> #25 0x000000000088dccc in dpif_netdev_operate (dpif=0x2b105d0,
> ops=0x7fffaf5a40e8, n_ops=1, offload_type=<optimized out>) at
> lib/dpif-netdev.c:3756
> #26 0x0000000000895b07 in dpif_operate (dpif=0x2b105d0,
> ops=ops@entry=0x7fffaf5a40e8, n_ops=n_ops@entry=1,
> offload_type=offload_type@entry=DPIF_OFFLOAD_AUTO) at lib/dpif.c:1367
> #27 0x0000000000896233 in dpif_execute (dpif=<optimized out>,
> execute=execute@entry=0x7fffaf5a4180) at lib/dpif.c:1321
> #28 0x00000000008459eb in packet_execute (ofproto_=0x2b3ec60,
> opo=0x7fffaf5a4200) at ofproto/ofproto-dpif.c:4760
> #29 0x000000000083a9bb in ofproto_packet_out_finish
> (opo=0x7fffaf5a4200, ofproto=0x2b3ec60) at ofproto/ofproto.c:3594
> #30 handle_packet_out (ofconn=ofconn@entry=0x2b23d50,
> oh=oh@entry=0x2b89470) at ofproto/ofproto.c:3635
> #31 0x000000000083f06f in handle_single_part_openflow
> (type=OFPTYPE_PACKET_OUT, oh=0x2b89470, ofconn=0x2b23d50) at
> ofproto/ofproto.c:8400
> #32 handle_openflow (ofconn=ofconn@entry=0x2b23d50,
> msgs=msgs@entry=0x7fffaf5a5600) at ofproto/ofproto.c:8587
> #33 0x000000000086ff6b in ofconn_run (handle_openflow=0x83e1f0
> <handle_openflow>, ofconn=0x2b23d50) at ofproto/connmgr.c:1318
> #34 connmgr_run (mgr=0x2a26820,
> handle_openflow=handle_openflow@entry=0x83e1f0 <handle_openflow>) at
> ofproto/connmgr.c:355
> #35 0x0000000000838706 in ofproto_run (p=0x2b3ec60) at
> ofproto/ofproto.c:1826
> #36 0x0000000000826234 in bridge_run__ () at vswitchd/bridge.c:2965
> #37 0x000000000082bae8 in bridge_run () at vswitchd/bridge.c:3023
> #38 0x0000000000422a05 in main (argc=11, argv=0x7fffaf5a5a28) at
> vswitchd/ovs-vswitchd.c:127
> (gdb) f 5
> #5  0x000000000083fb1f in ofproto_flow_mod_learn
> (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>, limit=<optimized
> out>, below_limitp=below_limitp@entry=0x7fffaf59e540) at
> ofproto/ofproto.c:5391
> 5391    ofproto/ofproto.c: No such file or directory.
> (gdb) p ofproto_mutex
> $2 = {lock = {__data = {__lock = 1, __count = 0, __owner = 67357, __nusers =
> 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next =
> 0x0}},
>     __size =
> "\001\000\000\000\000\000\000\000\035\a\001\000\001\000\000\000\002",
> '\000' <repeats 22 times>, __align = 1}, where = 0xad3dd6
> "ofproto/ofproto.c:3631"}
>
> Please find the discussion link regarding this crash with community.
> https://mail.openvswitch.org/pipermail/ovs-discuss/2019-April/048494.h
> tml

Thanks for the trace and link. It's very useful.
You may, probably, add above link to the commit message via 'Reported-at' tag.

Regarding the issue.
Is it correct that this issue only happens for userspace datapath?
In case of userspace datapath execution of the packet actions happens in the
same thread that handles the ofproto packet-out request. This is seen in above
stack trace.  But in case of kernel datapath,
packet_execute() sends packet to the kernel via netlink and further processing
will happen in a separate thread that handles upcall (if any).

If above statements are correct, there could be another way to fix the issue.

Ben, you said previously:

> How do we fix this?
>
> * We can't get rid of ofproto_mutex in do_bundle_commit(), or drop it
>   temporarily around flow translation (i.e. the call to
>   ofproto_packet_out_start()), because it might need to revert all the
>   flow table changes and dropping the mutex would allow other threads to
>   race in and make conflicting changes

But it seems that the issue happens on ofproto_packet_out_finish() and not on
ofproto_packet_out_start(). So, the flow translation should be OK under the
ofproto_mutex and revert is still possible under the mutex.
The only thing we need to take out of the mutex is real action execution by
the datapath triggered by ofproto_packet_out_finish(). Callers never check the
status of this function so it should be not so hard.

So, possible solution:
* Move ofproto_packet_out_finish() out of ofproto_mutex in all the callers:
  * It's easy for handle_packet_out()
  * For do_bundle_commit() we could temporary store all the ofproto_packet_out
    entities and finish them later.

Am I missing something?  Is there reason for ofproto_packet_out_finish() to
require ofproto_mutex?

Ben, Anil, what do you think?

Best regards, Ilya Maximets.
Vishal Deep Ajmera via dev Sept. 13, 2019, 2:50 a.m. UTC | #6
Hi Ben,

Please consider this as a gentle remainder and provide your inputs on the 
below issue.

Thanks & Regards,
Anil Kumar

-----Original Message-----
From: Anil Kumar Koli <anilkumar.k@altencalsoftlabs.com>
Sent: Thursday, 5 September, 2019 11:12 PM
To: 'Ben Pfaff' <blp@ovn.org>
Cc: 'Ilya Maximets' <i.maximets@samsung.com>; 'ovs-dev@openvswitch.org' 
<ovs-dev@openvswitch.org>
Subject: RE: [ovs-dev] [PATCH v1] ofproto: Fix OVS crash when packets hit 
Openflow rules with certain combinations of nested actions

Hi Ben,

Please consider this as a gentle remainder and provide your inputs on the
below issue.
This issue has been discussed earlier in the following thread.
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-April/048494.html

Best Regards,
Anil Kumar.
-----Original Message-----
From: Anil Kumar Koli <anilkumar.k@altencalsoftlabs.com>
Sent: Tuesday, 3 September, 2019 11:35 AM
To: 'Ilya Maximets' <i.maximets@samsung.com>; 'ovs-dev@openvswitch.org'
<ovs-dev@openvswitch.org>
Cc: 'Ben Pfaff' <blp@ovn.org>
Subject: RE: [ovs-dev] [PATCH v1] ofproto: Fix OVS crash when packets hit
Openflow rules with certain combinations of nested actions

Hi Ilya,

Sorry for late reply and thanks for the review comments.
Yes, the issue happens only in the userspace datapath and not in kernel
datapath. I have tested it on kernel datapath and could not see this issue for
the same build.

Regarding the suggested solution, I still feel like going with recursive mutex
approach is more cleaner then releasing the lock for
ofproto_packet_out_finish().  We have the ofproto_dpif_xcache_execute() which
may lead to race when a flow mod learn is called from another thread. May be
it could also be possible that in certain combination we may result the crash
in ofproto_packet_out_start().  May be Ben can share his inputs as well.

Best Regards,Anil
Kumar

-----Original Message-----
From: Ilya Maximets <i.maximets@samsung.com>
Sent: Thursday, 29 August, 2019 05:10 PM
To: Anil Kumar Koli <anilkumar.k@altencalsoftlabs.com>;
ovs-dev@openvswitch.org
Cc: 'Ben Pfaff' <blp@ovn.org>
Subject: Re: [ovs-dev] [PATCH v1] ofproto: Fix OVS crash when packets hit
Openflow rules with certain combinations of nested actions

On 29.08.2019 12:52, Anil Kumar Koli wrote:
> Hi Ilya,
>
> Thanks for the review.
> Please find below the stack trace of the crash
>
>  (gdb) bt
> #0  0x00007f0a3da46c37 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f0a3da4a028 in __GI_abort () at abort.c:89
> #2  0x000000000094262e in ovs_abort_valist (err_no=<optimized out>,
> format=<optimized out>, args=args@entry=0x7fffaf59e308) at
> lib/util.c:419
> #3  0x00000000009426b7 in ovs_abort (err_no=<optimized out>,
> format=format@entry=0xb0289f "%s: pthread_%s_%s failed") at
> lib/util.c:411
> #4  0x00000000009104f9 in ovs_mutex_lock_at (l_=l_@entry=0xe47cc0
> <ofproto_mutex>, where=where@entry=0xad4199 "ofproto/ofproto.c:5391")
> at lib/ovs-thread.c:75
> #5  0x000000000083fb1f in ofproto_flow_mod_learn
> (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>,
> limit=<optimized out>, below_limitp=below_limitp@entry=0x7fffaf59e540)
> at ofproto/ofproto.c:5391
> #6  0x000000000085e5d0 in xlate_learn_action
> (ctx=ctx@entry=0x7fffaf5a02e0, learn=learn@entry=0x2b18308) at
> ofproto/ofproto-dpif-xlate.c:5378
> #7  0x00000000008615c3 in do_xlate_actions (ofpacts=<optimized out>,
> ofpacts_len=<optimized out>, ctx=<optimized out>,
> is_last_action=<optimized out>, group_bucket_action=<optimized out>)
> at ofproto/ofproto-dpif-xlate.c:6893
> #8  0x000000000085edba in xlate_recursively (actions_xlator=0x860730
> <do_xlate_actions>, is_last_action=false, deepens=<optimized out>,
> rule=0x2b8e440, ctx=0x7fffaf5a02e0) at
> ofproto/ofproto-dpif-xlate.c:4233
> #9  xlate_table_action (ctx=0x7fffaf5a02e0, in_port=<optimized out>,
> table_id=<optimized out>, may_packet_in=<optimized out>,
> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>,
> is_last_action=false,
>     xlator=0x860730 <do_xlate_actions>) at
> ofproto/ofproto-dpif-xlate.c:4361
> #10 0x0000000000861aaa in xlate_ofpact_resubmit (resubmit=<optimized
> out>, resubmit=<optimized out>, resubmit=<optimized out>,
> is_last_action=<optimized out>, ctx=0x7fffaf5a02e0) at
> ofproto/ofproto-dpif-xlate.c:4672
> #11 do_xlate_actions (ofpacts=ofpacts@entry=0x2b654e8,
> ofpacts_len=ofpacts_len@entry=48, ctx=ctx@entry=0x7fffaf5a02e0,
> is_last_action=is_last_action@entry=true,
> group_bucket_action=group_bucket_action@entry=false)
>     at ofproto/ofproto-dpif-xlate.c:6773
> #12 0x00000000008696d6 in xlate_actions (xin=xin@entry=0x7fffaf5a0da0,
> xout=xout@entry=0x7fffaf5a11b0) at ofproto/ofproto-dpif-xlate.c:7570
> #13 0x0000000000859b56 in upcall_xlate (wc=0x7fffaf5a23f0,
> odp_actions=0x7fffaf5a1bc0, upcall=0x7fffaf5a1150, udpif=0x2b10670) at
> ofproto/ofproto-dpif-upcall.c:1197
> #14 process_upcall (udpif=udpif@entry=0x2b10670,
> upcall=upcall@entry=0x7fffaf5a1150,
> odp_actions=odp_actions@entry=0x7fffaf5a1bc0,
> wc=wc@entry=0x7fffaf5a23f0) at ofproto/ofproto-dpif-upcall.c:1413
> #15 0x000000000085a9eb in upcall_cb (packet=<optimized out>,
> flow=0x7fffaf5a2150, ufid=<optimized out>, pmd_id=<optimized out>,
> type=<optimized out>, userdata=<optimized out>, actions=0x7fffaf5a1bc0,
> wc=0x7fffaf5a23f0,
>     put_actions=0x7fffaf5a1c00, aux=0x2b10670) at
> ofproto/ofproto-dpif-upcall.c:1315
> #16 0x000000000088419c in dp_netdev_upcall (pmd=pmd@entry=0x7f0a35f34010,
> packet_=packet_@entry=0x2b68930, flow=flow@entry=0x7fffaf5a2150,
> wc=wc@entry=0x7fffaf5a23f0, ufid=ufid@entry=0x7fffaf5a1ba0,
> type=type@entry=DPIF_UC_MISS,
>     userdata=userdata@entry=0x0, actions=actions@entry=0x7fffaf5a1bc0,
> put_actions=put_actions@entry=0x7fffaf5a1c00) at
> lib/dpif-netdev.c:6236
> #17 0x000000000088d053 in handle_packet_upcall
> (put_actions=0x7fffaf5a1c00, actions=0x7fffaf5a1bc0,
> key=0x7fffaf5a3080, packet=0x2b68930, pmd=0x7f0a35f34010) at
> lib/dpif-netdev.c:6591
> #18 fast_path_processing (pmd=pmd@entry=0x7f0a35f34010,
> packets_=packets_@entry=0x7fffaf5a35f0, keys=keys@entry=0x7fffaf5a3040,
> flow_map=flow_map@entry=0x7fffaf5a2ef0,
> index_map=index_map@entry=0x7fffaf5a2ee0 "",
>     in_port=<optimized out>) at lib/dpif-netdev.c:6709
> #19 0x000000000088e9ce in dp_netdev_input__
> (pmd=pmd@entry=0x7f0a35f34010, packets=packets@entry=0x7fffaf5a35f0,
> md_is_valid=md_is_valid@entry=true, port_no=port_no@entry=0) at
> lib/dpif-netdev.c:6797
> #20 0x0000000000890710 in dp_netdev_recirculate
> (packets=0x7fffaf5a35f0, pmd=0x7f0a35f34010) at lib/dpif-netdev.c:6842
> #21 dp_execute_cb (aux_=aux_@entry=0x7fffaf5a3ea0,
> packets_=0x7fffaf5a35f0, packets_@entry=0x7fffaf5a3eb0,
> a=a@entry=0x2b1c7d4, should_steal=<optimized out>) at
> lib/dpif-netdev.c:7158
> #22 0x00000000008c3048 in odp_execute_actions (dp=dp@entry=0x7fffaf5a3ea0,
> batch=batch@entry=0x7fffaf5a3eb0, steal=steal@entry=false,
> actions=<optimized out>, actions_len=<optimized out>,
>     dp_execute_action=dp_execute_action@entry=0x8904f0
> <dp_execute_cb>) at lib/odp-execute.c:794
> #23 0x0000000000887bbd in dp_netdev_execute_actions
> (actions_len=<optimized out>, actions=<optimized out>, flow=<optimized
> out>, should_steal=false, packets=0x7fffaf5a3eb0, pmd=0x7f0a35f34010)
> at lib/dpif-netdev.c:7332
> #24 dpif_netdev_execute (dpif=dpif@entry=0x2b105d0,
> execute=execute@entry=0x7fffaf5a40f8) at lib/dpif-netdev.c:3725
> #25 0x000000000088dccc in dpif_netdev_operate (dpif=0x2b105d0,
> ops=0x7fffaf5a40e8, n_ops=1, offload_type=<optimized out>) at
> lib/dpif-netdev.c:3756
> #26 0x0000000000895b07 in dpif_operate (dpif=0x2b105d0,
> ops=ops@entry=0x7fffaf5a40e8, n_ops=n_ops@entry=1,
> offload_type=offload_type@entry=DPIF_OFFLOAD_AUTO) at lib/dpif.c:1367
> #27 0x0000000000896233 in dpif_execute (dpif=<optimized out>,
> execute=execute@entry=0x7fffaf5a4180) at lib/dpif.c:1321
> #28 0x00000000008459eb in packet_execute (ofproto_=0x2b3ec60,
> opo=0x7fffaf5a4200) at ofproto/ofproto-dpif.c:4760
> #29 0x000000000083a9bb in ofproto_packet_out_finish
> (opo=0x7fffaf5a4200, ofproto=0x2b3ec60) at ofproto/ofproto.c:3594
> #30 handle_packet_out (ofconn=ofconn@entry=0x2b23d50,
> oh=oh@entry=0x2b89470) at ofproto/ofproto.c:3635
> #31 0x000000000083f06f in handle_single_part_openflow
> (type=OFPTYPE_PACKET_OUT, oh=0x2b89470, ofconn=0x2b23d50) at
> ofproto/ofproto.c:8400
> #32 handle_openflow (ofconn=ofconn@entry=0x2b23d50,
> msgs=msgs@entry=0x7fffaf5a5600) at ofproto/ofproto.c:8587
> #33 0x000000000086ff6b in ofconn_run (handle_openflow=0x83e1f0
> <handle_openflow>, ofconn=0x2b23d50) at ofproto/connmgr.c:1318
> #34 connmgr_run (mgr=0x2a26820,
> handle_openflow=handle_openflow@entry=0x83e1f0 <handle_openflow>) at
> ofproto/connmgr.c:355
> #35 0x0000000000838706 in ofproto_run (p=0x2b3ec60) at
> ofproto/ofproto.c:1826
> #36 0x0000000000826234 in bridge_run__ () at vswitchd/bridge.c:2965
> #37 0x000000000082bae8 in bridge_run () at vswitchd/bridge.c:3023
> #38 0x0000000000422a05 in main (argc=11, argv=0x7fffaf5a5a28) at
> vswitchd/ovs-vswitchd.c:127
> (gdb) f 5
> #5  0x000000000083fb1f in ofproto_flow_mod_learn
> (ofm=ofm@entry=0x7fffaf59e620, keep_ref=<optimized out>, limit=<optimized
> out>, below_limitp=below_limitp@entry=0x7fffaf59e540) at
> ofproto/ofproto.c:5391
> 5391    ofproto/ofproto.c: No such file or directory.
> (gdb) p ofproto_mutex
> $2 = {lock = {__data = {__lock = 1, __count = 0, __owner = 67357, __nusers =
> 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next =
> 0x0}},
>     __size =
> "\001\000\000\000\000\000\000\000\035\a\001\000\001\000\000\000\002",
> '\000' <repeats 22 times>, __align = 1}, where = 0xad3dd6
> "ofproto/ofproto.c:3631"}
>
> Please find the discussion link regarding this crash with community.
> https://mail.openvswitch.org/pipermail/ovs-discuss/2019-April/048494.h
> tml

Thanks for the trace and link. It's very useful.
You may, probably, add above link to the commit message via 'Reported-at' tag.

Regarding the issue.
Is it correct that this issue only happens for userspace datapath?
In case of userspace datapath execution of the packet actions happens in the
same thread that handles the ofproto packet-out request. This is seen in above
stack trace.  But in case of kernel datapath,
packet_execute() sends packet to the kernel via netlink and further processing
will happen in a separate thread that handles upcall (if any).

If above statements are correct, there could be another way to fix the issue.

Ben, you said previously:

> How do we fix this?
>
> * We can't get rid of ofproto_mutex in do_bundle_commit(), or drop it
>   temporarily around flow translation (i.e. the call to
>   ofproto_packet_out_start()), because it might need to revert all the
>   flow table changes and dropping the mutex would allow other threads to
>   race in and make conflicting changes

But it seems that the issue happens on ofproto_packet_out_finish() and not on
ofproto_packet_out_start(). So, the flow translation should be OK under the
ofproto_mutex and revert is still possible under the mutex.
The only thing we need to take out of the mutex is real action execution by
the datapath triggered by ofproto_packet_out_finish(). Callers never check the
status of this function so it should be not so hard.

So, possible solution:
* Move ofproto_packet_out_finish() out of ofproto_mutex in all the callers:
  * It's easy for handle_packet_out()
  * For do_bundle_commit() we could temporary store all the ofproto_packet_out
    entities and finish them later.

Am I missing something?  Is there reason for ofproto_packet_out_finish() to
require ofproto_mutex?

Ben, Anil, what do you think?

Best regards, Ilya Maximets.

Patch
diff mbox series

diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
index 12758a3..ff7d90b 100644
--- a/ofproto/ofproto.c
+++ b/ofproto/ofproto.c
@@ -302,7 +302,7 @@  static size_t n_ofproto_classes;
 static size_t allocated_ofproto_classes;
 
 /* Global lock that protects all flow table operations. */
-struct ovs_mutex ofproto_mutex = OVS_MUTEX_INITIALIZER;
+struct ovs_mutex ofproto_mutex;
 
 unsigned ofproto_flow_limit = OFPROTO_FLOW_LIMIT_DEFAULT;
 unsigned ofproto_max_idle = OFPROTO_MAX_IDLE_DEFAULT;
@@ -337,6 +337,8 @@  ofproto_init(const struct shash *iface_hints)
     struct shash_node *node;
     size_t i;
 
+    ovs_mutex_init_recursive(&ofproto_mutex);
+
     ofproto_class_register(&ofproto_dpif_class);
 
     /* Make a local copy, since we don't own 'iface_hints' elements. */