Message ID | 1613596145-3110-1-git-send-email-pkusunyifeng@gmail.com |
---|---|
State | Accepted |
Headers | show |
Series | [ovs-dev,PATCHv2] connmgr: Check nullptr inside ofmonitor_report() | expand |
On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote: > > ovs-vswitchd could crash under these circumstances: > 1. When one bridge is being destroyed, ofproto_destroy() is called and > connmgr pointer of its ofproto struct is nullified. This ofproto struct is > deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'. > 2. Before RCU enters quiesce state to actually free this ofproto struct, > revalidator thread calls udpif_revalidator(), which could handle > a learn flow and calls ofproto_flow_mod_learn(), it later calls > ofmonitor_report() and ofproto struct's connmgr pointer is accessed. > > The crash stack trace is shown below: > > 0 ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, event=event@entry=NXFME_ADDED, > reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, old_actions=old_actions@entry=0x0) > at ofproto/connmgr.c:2160 > 1 0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, ofm=<optimized out>, req=req@entry=0x0) > at ofproto/ofproto.c:5221 > 2 0x00007fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0) > at ofproto/ofproto.c:5823 > 3 ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0) > at ofproto/ofproto.c:8088 > 4 0x00007fa4d680372d in ofproto_flow_mod_learn_finish (ofm=ofm@entry=0x7fa4980753f0, > orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439 > 5 0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, keep_ref=keep_ref@entry=true, > limit=<optimized out>, below_limitp=below_limitp@entry=0x0) at ofproto/ofproto.c:5499 > 6 0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, stats=stats@entry=0x7fa4d2701a10, > offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127 > 7 0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, stats=stats@entry=0x7fa4d2701a10, > offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181 > 8 0x00007fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660, > stats=stats@entry=0x7fa4d2705118, odp_actions=odp_actions@entry=0x7fa4d2701b50, > reval_seq=reval_seq@entry=5655486242, recircs=recircs@entry=0x7fa4d2701b40, offloaded=false) > at ofproto/ofproto-dpif-upcall.c:2294 > 9 0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:2683 > 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:936 > 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:423 > 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0 > 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6 > > At the time of crash, the involved ofproto was already deallocated: > > (gdb) print *ofproto > $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {..., > one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ... > > This patch fixes it. > > VMware-BZ: #2700626 > Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> > --- > v1->v2: Add check for ofmonitor_flush, thanks William. > LGTM, thanks. Acked-by: William Tu < u9012063@gmail.com> CC Ilya and Ben to see if any comments. I feel this kind of RCU issue is hard to find out, and existing tools such as addressSanitizer are usually not helpful. William > ofproto/connmgr.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c > index 9c5c633b4171..fa8f6cd0e83a 100644 > --- a/ofproto/connmgr.c > +++ b/ofproto/connmgr.c > @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule *rule, > const struct rule_actions *old_actions) > OVS_REQUIRES(ofproto_mutex) > { > - if (rule_is_hidden(rule)) { > + if (!mgr || rule_is_hidden(rule)) { > return; > } > > @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr) > { > struct ofconn *ofconn; > > + if (!mgr) { > + return; > + } > + > LIST_FOR_EACH (ofconn, connmgr_node, &mgr->conns) { > struct rconn_packet_counter *counter = ofconn->monitor_counter; > > -- > 2.7.4 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Hi, Looks like this bug is caused by violating the fact that if a rule is referenced, the related ofproto should not be destroyed. If so, I have a patch that also fixes the problem, not sure if this helps. http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ William Tu <u9012063@gmail.com> 于2021年2月19日周五 上午8:10写道: > > On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote: > > > > ovs-vswitchd could crash under these circumstances: > > 1. When one bridge is being destroyed, ofproto_destroy() is called and > > connmgr pointer of its ofproto struct is nullified. This ofproto struct is > > deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'. > > 2. Before RCU enters quiesce state to actually free this ofproto struct, > > revalidator thread calls udpif_revalidator(), which could handle > > a learn flow and calls ofproto_flow_mod_learn(), it later calls > > ofmonitor_report() and ofproto struct's connmgr pointer is accessed. > > > > The crash stack trace is shown below: > > > > 0 ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, event=event@entry=NXFME_ADDED, > > reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, old_actions=old_actions@entry=0x0) > > at ofproto/connmgr.c:2160 > > 1 0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, ofm=<optimized out>, req=req@entry=0x0) > > at ofproto/ofproto.c:5221 > > 2 0x00007fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0) > > at ofproto/ofproto.c:5823 > > 3 ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0) > > at ofproto/ofproto.c:8088 > > 4 0x00007fa4d680372d in ofproto_flow_mod_learn_finish (ofm=ofm@entry=0x7fa4980753f0, > > orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439 > > 5 0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, keep_ref=keep_ref@entry=true, > > limit=<optimized out>, below_limitp=below_limitp@entry=0x0) at ofproto/ofproto.c:5499 > > 6 0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, stats=stats@entry=0x7fa4d2701a10, > > offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127 > > 7 0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, stats=stats@entry=0x7fa4d2701a10, > > offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181 > > 8 0x00007fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660, > > stats=stats@entry=0x7fa4d2705118, odp_actions=odp_actions@entry=0x7fa4d2701b50, > > reval_seq=reval_seq@entry=5655486242, recircs=recircs@entry=0x7fa4d2701b40, offloaded=false) > > at ofproto/ofproto-dpif-upcall.c:2294 > > 9 0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:2683 > > 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:936 > > 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:423 > > 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0 > > 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6 > > > > At the time of crash, the involved ofproto was already deallocated: > > > > (gdb) print *ofproto > > $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {..., > > one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ... > > > > This patch fixes it. > > > > VMware-BZ: #2700626 > > Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> > > --- > > v1->v2: Add check for ofmonitor_flush, thanks William. > > > > LGTM, thanks. > Acked-by: William Tu < u9012063@gmail.com> > > CC Ilya and Ben to see if any comments. > > I feel this kind of RCU issue is hard to find out, and > existing tools such as addressSanitizer are usually > not helpful. > > William > > > ofproto/connmgr.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c > > index 9c5c633b4171..fa8f6cd0e83a 100644 > > --- a/ofproto/connmgr.c > > +++ b/ofproto/connmgr.c > > @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule *rule, > > const struct rule_actions *old_actions) > > OVS_REQUIRES(ofproto_mutex) > > { > > - if (rule_is_hidden(rule)) { > > + if (!mgr || rule_is_hidden(rule)) { > > return; > > } > > > > @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr) > > { > > struct ofconn *ofconn; > > > > + if (!mgr) { > > + return; > > + } > > + > > LIST_FOR_EACH (ofconn, connmgr_node, &mgr->conns) { > > struct rconn_packet_counter *counter = ofconn->monitor_counter; > > > > -- > > 2.7.4 > > > > _______________________________________________ > > dev mailing list > > dev@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 2/19/21 3:12 AM, 贺鹏 wrote: > Hi, > > Looks like this bug is caused by violating the fact that if a rule is > referenced, the related ofproto should not be destroyed. > > If so, I have a patch that also fixes the problem, not sure if this helps. > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ There is at least one more problem that is not strictly related but in more or less the same part of the code: https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html > > William Tu <u9012063@gmail.com> 于2021年2月19日周五 上午8:10写道: >> >> On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote: >>> >>> ovs-vswitchd could crash under these circumstances: >>> 1. When one bridge is being destroyed, ofproto_destroy() is called and >>> connmgr pointer of its ofproto struct is nullified. This ofproto struct is >>> deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'. >>> 2. Before RCU enters quiesce state to actually free this ofproto struct, >>> revalidator thread calls udpif_revalidator(), which could handle >>> a learn flow and calls ofproto_flow_mod_learn(), it later calls >>> ofmonitor_report() and ofproto struct's connmgr pointer is accessed. >>> >>> The crash stack trace is shown below: >>> >>> 0 ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, event=event@entry=NXFME_ADDED, >>> reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, old_actions=old_actions@entry=0x0) >>> at ofproto/connmgr.c:2160 >>> 1 0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, ofm=<optimized out>, req=req@entry=0x0) >>> at ofproto/ofproto.c:5221 >>> 2 0x00007fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0) >>> at ofproto/ofproto.c:5823 >>> 3 ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0) >>> at ofproto/ofproto.c:8088 >>> 4 0x00007fa4d680372d in ofproto_flow_mod_learn_finish (ofm=ofm@entry=0x7fa4980753f0, >>> orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439 >>> 5 0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, keep_ref=keep_ref@entry=true, >>> limit=<optimized out>, below_limitp=below_limitp@entry=0x0) at ofproto/ofproto.c:5499 >>> 6 0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, stats=stats@entry=0x7fa4d2701a10, >>> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127 >>> 7 0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, stats=stats@entry=0x7fa4d2701a10, >>> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181 >>> 8 0x00007fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660, >>> stats=stats@entry=0x7fa4d2705118, odp_actions=odp_actions@entry=0x7fa4d2701b50, >>> reval_seq=reval_seq@entry=5655486242, recircs=recircs@entry=0x7fa4d2701b40, offloaded=false) >>> at ofproto/ofproto-dpif-upcall.c:2294 >>> 9 0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:2683 >>> 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:936 >>> 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:423 >>> 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0 >>> 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6 >>> >>> At the time of crash, the involved ofproto was already deallocated: >>> >>> (gdb) print *ofproto >>> $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {..., >>> one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ... >>> >>> This patch fixes it. >>> >>> VMware-BZ: #2700626 >>> Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> >>> --- >>> v1->v2: Add check for ofmonitor_flush, thanks William. >>> >> >> LGTM, thanks. >> Acked-by: William Tu < u9012063@gmail.com> >> >> CC Ilya and Ben to see if any comments. >> >> I feel this kind of RCU issue is hard to find out, and >> existing tools such as addressSanitizer are usually >> not helpful. >> >> William >> >>> ofproto/connmgr.c | 6 +++++- >>> 1 file changed, 5 insertions(+), 1 deletion(-) >>> >>> diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c >>> index 9c5c633b4171..fa8f6cd0e83a 100644 >>> --- a/ofproto/connmgr.c >>> +++ b/ofproto/connmgr.c >>> @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule *rule, >>> const struct rule_actions *old_actions) >>> OVS_REQUIRES(ofproto_mutex) >>> { >>> - if (rule_is_hidden(rule)) { >>> + if (!mgr || rule_is_hidden(rule)) { >>> return; >>> } >>> >>> @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr) >>> { >>> struct ofconn *ofconn; >>> >>> + if (!mgr) { >>> + return; >>> + } >>> + >>> LIST_FOR_EACH (ofconn, connmgr_node, &mgr->conns) { >>> struct rconn_packet_counter *counter = ofconn->monitor_counter; >>> >>> -- >>> 2.7.4 >>> >>> _______________________________________________ >>> dev mailing list >>> dev@openvswitch.org >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > >
Hi, Ilya Ilya Maximets <i.maximets@ovn.org> 于2021年2月19日周五 下午7:19写道: > > On 2/19/21 3:12 AM, 贺鹏 wrote: > > Hi, > > > > Looks like this bug is caused by violating the fact that if a rule is > > referenced, the related ofproto should not be destroyed. > > > > If so, I have a patch that also fixes the problem, not sure if this helps. > > > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ > > There is at least one more problem that is not strictly related but > in more or less the same part of the code: > https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html So maybe before add it into *ovsrcu_postpone*, we should add refcount of ofproto also? > > > > > William Tu <u9012063@gmail.com> 于2021年2月19日周五 上午8:10写道: > >> > >> On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote: > >>> > >>> ovs-vswitchd could crash under these circumstances: > >>> 1. When one bridge is being destroyed, ofproto_destroy() is called and > >>> connmgr pointer of its ofproto struct is nullified. This ofproto struct is > >>> deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'. > >>> 2. Before RCU enters quiesce state to actually free this ofproto struct, > >>> revalidator thread calls udpif_revalidator(), which could handle > >>> a learn flow and calls ofproto_flow_mod_learn(), it later calls > >>> ofmonitor_report() and ofproto struct's connmgr pointer is accessed. > >>> > >>> The crash stack trace is shown below: > >>> > >>> 0 ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, event=event@entry=NXFME_ADDED, > >>> reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, old_actions=old_actions@entry=0x0) > >>> at ofproto/connmgr.c:2160 > >>> 1 0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, ofm=<optimized out>, req=req@entry=0x0) > >>> at ofproto/ofproto.c:5221 > >>> 2 0x00007fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0) > >>> at ofproto/ofproto.c:5823 > >>> 3 ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0) > >>> at ofproto/ofproto.c:8088 > >>> 4 0x00007fa4d680372d in ofproto_flow_mod_learn_finish (ofm=ofm@entry=0x7fa4980753f0, > >>> orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439 > >>> 5 0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, keep_ref=keep_ref@entry=true, > >>> limit=<optimized out>, below_limitp=below_limitp@entry=0x0) at ofproto/ofproto.c:5499 > >>> 6 0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, stats=stats@entry=0x7fa4d2701a10, > >>> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127 > >>> 7 0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, stats=stats@entry=0x7fa4d2701a10, > >>> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181 > >>> 8 0x00007fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660, > >>> stats=stats@entry=0x7fa4d2705118, odp_actions=odp_actions@entry=0x7fa4d2701b50, > >>> reval_seq=reval_seq@entry=5655486242, recircs=recircs@entry=0x7fa4d2701b40, offloaded=false) > >>> at ofproto/ofproto-dpif-upcall.c:2294 > >>> 9 0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:2683 > >>> 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:936 > >>> 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:423 > >>> 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0 > >>> 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6 > >>> > >>> At the time of crash, the involved ofproto was already deallocated: > >>> > >>> (gdb) print *ofproto > >>> $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {..., > >>> one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ... > >>> > >>> This patch fixes it. > >>> > >>> VMware-BZ: #2700626 > >>> Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> > >>> --- > >>> v1->v2: Add check for ofmonitor_flush, thanks William. > >>> > >> > >> LGTM, thanks. > >> Acked-by: William Tu < u9012063@gmail.com> > >> > >> CC Ilya and Ben to see if any comments. > >> > >> I feel this kind of RCU issue is hard to find out, and > >> existing tools such as addressSanitizer are usually > >> not helpful. > >> > >> William > >> > >>> ofproto/connmgr.c | 6 +++++- > >>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c > >>> index 9c5c633b4171..fa8f6cd0e83a 100644 > >>> --- a/ofproto/connmgr.c > >>> +++ b/ofproto/connmgr.c > >>> @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule *rule, > >>> const struct rule_actions *old_actions) > >>> OVS_REQUIRES(ofproto_mutex) > >>> { > >>> - if (rule_is_hidden(rule)) { > >>> + if (!mgr || rule_is_hidden(rule)) { > >>> return; > >>> } > >>> > >>> @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr) > >>> { > >>> struct ofconn *ofconn; > >>> > >>> + if (!mgr) { > >>> + return; > >>> + } > >>> + > >>> LIST_FOR_EACH (ofconn, connmgr_node, &mgr->conns) { > >>> struct rconn_packet_counter *counter = ofconn->monitor_counter; > >>> > >>> -- > >>> 2.7.4 > >>> > >>> _______________________________________________ > >>> dev mailing list > >>> dev@openvswitch.org > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >> _______________________________________________ > >> dev mailing list > >> dev@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > > > >
On Fri, Feb 19, 2021 at 6:29 PM 贺鹏 <xnhp0320@gmail.com> wrote: > > Hi, Ilya > > Ilya Maximets <i.maximets@ovn.org> 于2021年2月19日周五 下午7:19写道: > > > > On 2/19/21 3:12 AM, 贺鹏 wrote: > > > Hi, > > > > > > Looks like this bug is caused by violating the fact that if a rule is > > > referenced, the related ofproto should not be destroyed. > > > > > > If so, I have a patch that also fixes the problem, not sure if this helps. > > > > > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ > > > > There is at least one more problem that is not strictly related but > > in more or less the same part of the code: > > https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html > > So maybe before add it into *ovsrcu_postpone*, we should add refcount > of ofproto also? > Yes, I think that will fix the issue. But are we able to find out all the places that we need to add refcount of ofproto? Looks like we might have multiple rcu postponed function that might access the 'ofproto'. And 'ofproto' might be freed already and cause segfault. Hepeng's patch fixes two places. http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ Ilya pointed out another place https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html yifeng's case is a little different (not due to ofproto = NULL, but due to setting the p->connmgr = NULL before postponed) Regards, William
On 2/21/21 5:11 PM, William Tu wrote: > On Fri, Feb 19, 2021 at 6:29 PM 贺鹏 <xnhp0320@gmail.com> wrote: >> >> Hi, Ilya >> >> Ilya Maximets <i.maximets@ovn.org> 于2021年2月19日周五 下午7:19写道: >>> >>> On 2/19/21 3:12 AM, 贺鹏 wrote: >>>> Hi, >>>> >>>> Looks like this bug is caused by violating the fact that if a rule is >>>> referenced, the related ofproto should not be destroyed. >>>> >>>> If so, I have a patch that also fixes the problem, not sure if this helps. >>>> >>>> http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ >>> >>> There is at least one more problem that is not strictly related but >>> in more or less the same part of the code: >>> https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html >> >> So maybe before add it into *ovsrcu_postpone*, we should add refcount >> of ofproto also? >> > Yes, I think that will fix the issue. But are we able to find out all the > places that we need to add refcount of ofproto? destruct() is called for ofproto_dpif unconditionally without any deferring, so this will not help unless we're delaying the the destruct() itself, and I'm not sure if we can do that. > > Looks like we might have multiple rcu postponed function that might > access the 'ofproto'. And 'ofproto' might be freed already and cause segfault. > > Hepeng's patch fixes two places. > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ > Ilya pointed out another place > https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html > yifeng's case is a little different (not due to ofproto = NULL, but > due to setting > the p->connmgr = NULL before postponed) In my case ofproto is alive too. The problem is destroyed ofproto->baker. And in case of meter_id we don't really need to refcount the ofproto itself. 'baker' already has a refcount, so we could increase it and pass 'baker' pointer to free_meter_id instead of 'ofproto'. This function doesn't actually need an 'ofproto' pointer. Regarding this connmgr related patch... It's still valid even with refcounts because destruction of connmgr is explicitly not deferred for a reason. It's to free the listening socket that users might want to reuse. Refcounts will not help here. Also for the meters related issue.. destruct() is called for ofproto_dpif without any postponing and I'm not sure if we need or actually able to postpone it without consequences. So, baker->refcount might be a better solution keeping the ofproto->refcount only for rules. This will reduce the scope of changes, otherwise we will have to do a lot more work tracking down all the users that holds 'ofproto' pointer and will miss some of them with high probability. I'll recheck this (connmgr) patch and, likely, apply it. Will also review ofproto refcount patch. Best regards, Ilya Maximets.
Hi, Ilya and William, Ilya Maximets <i.maximets@ovn.org> 于2021年2月22日周一 下午8:12写道: > > On 2/21/21 5:11 PM, William Tu wrote: > > On Fri, Feb 19, 2021 at 6:29 PM 贺鹏 <xnhp0320@gmail.com> wrote: > >> > >> Hi, Ilya > >> > >> Ilya Maximets <i.maximets@ovn.org> 于2021年2月19日周五 下午7:19写道: > >>> > >>> On 2/19/21 3:12 AM, 贺鹏 wrote: > >>>> Hi, > >>>> > >>>> Looks like this bug is caused by violating the fact that if a rule is > >>>> referenced, the related ofproto should not be destroyed. > >>>> > >>>> If so, I have a patch that also fixes the problem, not sure if this helps. > >>>> > >>>> http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ > >>> > >>> There is at least one more problem that is not strictly related but > >>> in more or less the same part of the code: > >>> https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html > >> > >> So maybe before add it into *ovsrcu_postpone*, we should add refcount > >> of ofproto also? > >> > > Yes, I think that will fix the issue. But are we able to find out all the > > places that we need to add refcount of ofproto? Yes, this could be difficult. > > destruct() is called for ofproto_dpif unconditionally without any deferring, > so this will not help unless we're delaying the the destruct() itself, and > I'm not sure if we can do that. > > > > > Looks like we might have multiple rcu postponed function that might > > access the 'ofproto'. And 'ofproto' might be freed already and cause segfault. > > > > Hepeng's patch fixes two places. > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ > > Ilya pointed out another place > > https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html > > yifeng's case is a little different (not due to ofproto = NULL, but > > due to setting > > the p->connmgr = NULL before postponed) > > In my case ofproto is alive too. The problem is destroyed ofproto->baker. > And in case of meter_id we don't really need to refcount the ofproto itself. > 'baker' already has a refcount, so we could increase it and pass 'baker' > pointer to free_meter_id instead of 'ofproto'. This function doesn't > actually need an 'ofproto' pointer. > > Regarding this connmgr related patch... It's still valid even with refcounts > because destruction of connmgr is explicitly not deferred for a reason. It's > to free the listening socket that users might want to reuse. Refcounts will > not help here. > Also for the meters related issue.. destruct() is called for ofproto_dpif > without any postponing and I'm not sure if we need or actually able to > postpone it without consequences. So, baker->refcount might be a better > solution keeping the ofproto->refcount only for rules. This will reduce > the scope of changes, otherwise we will have to do a lot more work tracking > down all the users that holds 'ofproto' pointer and will miss some of them > with high probability. Agree. I recheck the patch. It's a better idea to use ofproto->backer's refcount. But in the rule case, looks like the rule->ofproto needs to be alive to access both ofproto->vl_mff_map and ofproto->ofproto_class. static void rule_destroy_cb(struct rule *rule) OVS_NO_THREAD_SAFETY_ANALYSIS { /* Send rule removed if needed. */ if (rule->flags & OFPUTIL_FF_SEND_FLOW_REM && rule->removed_reason != OVS_OFPRR_NONE && !rule_is_hidden(rule)) { ofproto_rule_send_removed(rule); } rule->ofproto->ofproto_class->rule_destruct(rule); mf_vl_mff_unref(&rule->ofproto->vl_mff_map, rule->match_tlv_bitmap); mf_vl_mff_unref(&rule->ofproto->vl_mff_map, rule->ofpacts_tlv_bitmap); ofproto_rule_destroy__(rule); } Maybe add refcount into vl_mff_map and ofproto_class? > > I'll recheck this (connmgr) patch and, likely, apply it. Will also review > ofproto refcount patch. > > Best regards, Ilya Maximets.
On 2/19/21 1:09 AM, William Tu wrote: > On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote: >> >> ovs-vswitchd could crash under these circumstances: >> 1. When one bridge is being destroyed, ofproto_destroy() is called and >> connmgr pointer of its ofproto struct is nullified. This ofproto struct is >> deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'. >> 2. Before RCU enters quiesce state to actually free this ofproto struct, >> revalidator thread calls udpif_revalidator(), which could handle >> a learn flow and calls ofproto_flow_mod_learn(), it later calls >> ofmonitor_report() and ofproto struct's connmgr pointer is accessed. >> >> The crash stack trace is shown below: >> >> 0 ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, event=event@entry=NXFME_ADDED, >> reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, old_actions=old_actions@entry=0x0) >> at ofproto/connmgr.c:2160 >> 1 0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, ofm=<optimized out>, req=req@entry=0x0) >> at ofproto/ofproto.c:5221 >> 2 0x00007fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0) >> at ofproto/ofproto.c:5823 >> 3 ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0) >> at ofproto/ofproto.c:8088 >> 4 0x00007fa4d680372d in ofproto_flow_mod_learn_finish (ofm=ofm@entry=0x7fa4980753f0, >> orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439 >> 5 0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, keep_ref=keep_ref@entry=true, >> limit=<optimized out>, below_limitp=below_limitp@entry=0x0) at ofproto/ofproto.c:5499 >> 6 0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, stats=stats@entry=0x7fa4d2701a10, >> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127 >> 7 0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, stats=stats@entry=0x7fa4d2701a10, >> offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181 >> 8 0x00007fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660, >> stats=stats@entry=0x7fa4d2705118, odp_actions=odp_actions@entry=0x7fa4d2701b50, >> reval_seq=reval_seq@entry=5655486242, recircs=recircs@entry=0x7fa4d2701b40, offloaded=false) >> at ofproto/ofproto-dpif-upcall.c:2294 >> 9 0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:2683 >> 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:936 >> 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:423 >> 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0 >> 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6 >> >> At the time of crash, the involved ofproto was already deallocated: >> >> (gdb) print *ofproto >> $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {..., >> one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ... >> >> This patch fixes it. >> >> VMware-BZ: #2700626 >> Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> >> --- >> v1->v2: Add check for ofmonitor_flush, thanks William. >> > > LGTM, thanks. > Acked-by: William Tu < u9012063@gmail.com> Thanks! Applied to master and backported down to 2.12. Best regards, Ilya Maximets.
On 2/22/21 2:05 PM, 贺鹏 wrote: > Hi, Ilya and William, > > Ilya Maximets <i.maximets@ovn.org> 于2021年2月22日周一 下午8:12写道: >> >> On 2/21/21 5:11 PM, William Tu wrote: >>> On Fri, Feb 19, 2021 at 6:29 PM 贺鹏 <xnhp0320@gmail.com> wrote: >>>> >>>> Hi, Ilya >>>> >>>> Ilya Maximets <i.maximets@ovn.org> 于2021年2月19日周五 下午7:19写道: >>>>> >>>>> On 2/19/21 3:12 AM, 贺鹏 wrote: >>>>>> Hi, >>>>>> >>>>>> Looks like this bug is caused by violating the fact that if a rule is >>>>>> referenced, the related ofproto should not be destroyed. >>>>>> >>>>>> If so, I have a patch that also fixes the problem, not sure if this helps. >>>>>> >>>>>> http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ >>>>> >>>>> There is at least one more problem that is not strictly related but >>>>> in more or less the same part of the code: >>>>> https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html >>>> >>>> So maybe before add it into *ovsrcu_postpone*, we should add refcount >>>> of ofproto also? >>>> >>> Yes, I think that will fix the issue. But are we able to find out all the >>> places that we need to add refcount of ofproto? > > Yes, this could be difficult. > >> >> destruct() is called for ofproto_dpif unconditionally without any deferring, >> so this will not help unless we're delaying the the destruct() itself, and >> I'm not sure if we can do that. >> >>> >>> Looks like we might have multiple rcu postponed function that might >>> access the 'ofproto'. And 'ofproto' might be freed already and cause segfault. >>> >>> Hepeng's patch fixes two places. >>> http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/ >>> Ilya pointed out another place >>> https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html >>> yifeng's case is a little different (not due to ofproto = NULL, but >>> due to setting >>> the p->connmgr = NULL before postponed) >> >> In my case ofproto is alive too. The problem is destroyed ofproto->baker. >> And in case of meter_id we don't really need to refcount the ofproto itself. >> 'baker' already has a refcount, so we could increase it and pass 'baker' >> pointer to free_meter_id instead of 'ofproto'. This function doesn't >> actually need an 'ofproto' pointer. >> >> Regarding this connmgr related patch... It's still valid even with refcounts >> because destruction of connmgr is explicitly not deferred for a reason. It's >> to free the listening socket that users might want to reuse. Refcounts will >> not help here. >> Also for the meters related issue.. destruct() is called for ofproto_dpif >> without any postponing and I'm not sure if we need or actually able to >> postpone it without consequences. So, baker->refcount might be a better >> solution keeping the ofproto->refcount only for rules. This will reduce >> the scope of changes, otherwise we will have to do a lot more work tracking >> down all the users that holds 'ofproto' pointer and will miss some of them >> with high probability. > > Agree. > I recheck the patch. It's a better idea to use ofproto->backer's refcount. > But in the rule case, looks like the rule->ofproto needs to be alive > to access both ofproto->vl_mff_map and ofproto->ofproto_class. > > static void > rule_destroy_cb(struct rule *rule) > OVS_NO_THREAD_SAFETY_ANALYSIS > { > /* Send rule removed if needed. */ > if (rule->flags & OFPUTIL_FF_SEND_FLOW_REM > && rule->removed_reason != OVS_OFPRR_NONE > && !rule_is_hidden(rule)) { > ofproto_rule_send_removed(rule); > } > rule->ofproto->ofproto_class->rule_destruct(rule); > mf_vl_mff_unref(&rule->ofproto->vl_mff_map, rule->match_tlv_bitmap); > mf_vl_mff_unref(&rule->ofproto->vl_mff_map, rule->ofpacts_tlv_bitmap); > ofproto_rule_destroy__(rule); > } > > Maybe add refcount into vl_mff_map and ofproto_class? IMHO, too many refcounts. It's better to have one global for ofproto. 'baker' could be handled by its own refcount. > >> >> I'll recheck this (connmgr) patch and, likely, apply it. Will also review >> ofproto refcount patch. >> >> Best regards, Ilya Maximets. > > >
diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c index 9c5c633b4171..fa8f6cd0e83a 100644 --- a/ofproto/connmgr.c +++ b/ofproto/connmgr.c @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule *rule, const struct rule_actions *old_actions) OVS_REQUIRES(ofproto_mutex) { - if (rule_is_hidden(rule)) { + if (!mgr || rule_is_hidden(rule)) { return; } @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr) { struct ofconn *ofconn; + if (!mgr) { + return; + } + LIST_FOR_EACH (ofconn, connmgr_node, &mgr->conns) { struct rconn_packet_counter *counter = ofconn->monitor_counter;
ovs-vswitchd could crash under these circumstances: 1. When one bridge is being destroyed, ofproto_destroy() is called and connmgr pointer of its ofproto struct is nullified. This ofproto struct is deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'. 2. Before RCU enters quiesce state to actually free this ofproto struct, revalidator thread calls udpif_revalidator(), which could handle a learn flow and calls ofproto_flow_mod_learn(), it later calls ofmonitor_report() and ofproto struct's connmgr pointer is accessed. The crash stack trace is shown below: 0 ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, event=event@entry=NXFME_ADDED, reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, old_actions=old_actions@entry=0x0) at ofproto/connmgr.c:2160 1 0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, ofm=<optimized out>, req=req@entry=0x0) at ofproto/ofproto.c:5221 2 0x00007fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0) at ofproto/ofproto.c:5823 3 ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0) at ofproto/ofproto.c:8088 4 0x00007fa4d680372d in ofproto_flow_mod_learn_finish (ofm=ofm@entry=0x7fa4980753f0, orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439 5 0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, keep_ref=keep_ref@entry=true, limit=<optimized out>, below_limitp=below_limitp@entry=0x0) at ofproto/ofproto.c:5499 6 0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, stats=stats@entry=0x7fa4d2701a10, offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127 7 0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, stats=stats@entry=0x7fa4d2701a10, offloaded=offloaded@entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181 8 0x00007fa4d6822046 in revalidate_ukey (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660, stats=stats@entry=0x7fa4d2705118, odp_actions=odp_actions@entry=0x7fa4d2701b50, reval_seq=reval_seq@entry=5655486242, recircs=recircs@entry=0x7fa4d2701b40, offloaded=false) at ofproto/ofproto-dpif-upcall.c:2294 9 0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:2683 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:936 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:423 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6 At the time of crash, the involved ofproto was already deallocated: (gdb) print *ofproto $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {..., one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ... This patch fixes it. VMware-BZ: #2700626 Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> --- v1->v2: Add check for ofmonitor_flush, thanks William. ofproto/connmgr.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)