Message ID | 1518634569-5209-1-git-send-email-u9012063@gmail.com |
---|---|
State | Changes Requested |
Headers | show |
Series | [ovs-dev,PATCHv2,1/2] xlate: auto ofproto trace when recursion too deep | expand |
On Wed, Feb 14, 2018 at 10:56:08AM -0800, William Tu wrote: > Usually ofproto/trace is used to debug the flow translation error. > When translation error such as recursion too deep or too many resubmit, > the issue might happen momentary; flows causing the recursion expire > when users try to debug it. This patch enables the ofproto trace > automatically when recursion is too deep or too many resubmit, by > invoking the translation again, and log the ofproto trace as warnings. > Since the log will be huge, rate limit to one per minute. > > VMWare-BZ: #2054659 > Signed-off-by: William Tu <u9012063@gmail.com> > Tested-by: Greg Rose <gvrose8192@gmail.com> > Reviewed-by: Greg Rose <gvrose8192@gmail.com> Thanks for working on this! Some of the data passed to ofproto_trace() is also passed to the xlate_actions() call, indirectly. Did you check whether that data is possibly modified by xlate_actions()? If it is, then we might have to reconsider this approach, because flow data, etc. is very large and I don't think that we can afford to always make a copy of it in advance on the chance that the original might be needed for tracing later. I think that VLOG_WARN is a very high log level for this data. I would tend to use DBG. Thanks, Ben.
On Wed, Feb 14, 2018 at 3:03 PM, Ben Pfaff <blp@ovn.org> wrote: > On Wed, Feb 14, 2018 at 10:56:08AM -0800, William Tu wrote: >> Usually ofproto/trace is used to debug the flow translation error. >> When translation error such as recursion too deep or too many resubmit, >> the issue might happen momentary; flows causing the recursion expire >> when users try to debug it. This patch enables the ofproto trace >> automatically when recursion is too deep or too many resubmit, by >> invoking the translation again, and log the ofproto trace as warnings. >> Since the log will be huge, rate limit to one per minute. >> >> VMWare-BZ: #2054659 >> Signed-off-by: William Tu <u9012063@gmail.com> >> Tested-by: Greg Rose <gvrose8192@gmail.com> >> Reviewed-by: Greg Rose <gvrose8192@gmail.com> > > Thanks for working on this! > > Some of the data passed to ofproto_trace() is also passed to the > xlate_actions() call, indirectly. Did you check whether that data is > possibly modified by xlate_actions()? If it is, then we might have to > reconsider this approach, because flow data, etc. is very large and I > don't think that we can afford to always make a copy of it in advance on > the chance that the original might be needed for tracing later. We pass the "const struct flow *" and "const struct dp_packet *" to the ofproto_trace(), so I think these two data is unmodified and xlate_in_init() actually make a copy of the "struct flow", so xlate_actions() modifies a separate copy of the flow. > > I think that VLOG_WARN is a very high log level for this data. I would > tend to use DBG. > I worried that most of the time DBG is not on, and asking people to turn it on and reproduce the issue is hard. Can we keep VLOG_WARN and rate limit to very low frequent, "VLOG_RATE_LIMIT_INIT(1, 5);",? Thanks William
On Mon, Feb 26, 2018 at 01:43:39PM -0800, William Tu wrote: > On Wed, Feb 14, 2018 at 3:03 PM, Ben Pfaff <blp@ovn.org> wrote: > > On Wed, Feb 14, 2018 at 10:56:08AM -0800, William Tu wrote: > >> Usually ofproto/trace is used to debug the flow translation error. > >> When translation error such as recursion too deep or too many resubmit, > >> the issue might happen momentary; flows causing the recursion expire > >> when users try to debug it. This patch enables the ofproto trace > >> automatically when recursion is too deep or too many resubmit, by > >> invoking the translation again, and log the ofproto trace as warnings. > >> Since the log will be huge, rate limit to one per minute. > >> > >> VMWare-BZ: #2054659 > >> Signed-off-by: William Tu <u9012063@gmail.com> > >> Tested-by: Greg Rose <gvrose8192@gmail.com> > >> Reviewed-by: Greg Rose <gvrose8192@gmail.com> > > > > Thanks for working on this! > > > > Some of the data passed to ofproto_trace() is also passed to the > > xlate_actions() call, indirectly. Did you check whether that data is > > possibly modified by xlate_actions()? If it is, then we might have to > > reconsider this approach, because flow data, etc. is very large and I > > don't think that we can afford to always make a copy of it in advance on > > the chance that the original might be needed for tracing later. > > We pass the "const struct flow *" and "const struct dp_packet *" to > the ofproto_trace(), > so I think these two data is unmodified and xlate_in_init() actually make a copy > of the "struct flow", so xlate_actions() modifies a separate copy of the flow. OK. > > I think that VLOG_WARN is a very high log level for this data. I would > > tend to use DBG. > > > I worried that most of the time DBG is not on, and asking people > to turn it on and reproduce the issue is hard. Can we keep VLOG_WARN and > rate limit to very low frequent, "VLOG_RATE_LIMIT_INIT(1, 5);",? OK, let's try that.
On Wed, Feb 28, 2018 at 2:14 PM, Ben Pfaff <blp@ovn.org> wrote: > On Mon, Feb 26, 2018 at 01:43:39PM -0800, William Tu wrote: >> On Wed, Feb 14, 2018 at 3:03 PM, Ben Pfaff <blp@ovn.org> wrote: >> > On Wed, Feb 14, 2018 at 10:56:08AM -0800, William Tu wrote: >> >> Usually ofproto/trace is used to debug the flow translation error. >> >> When translation error such as recursion too deep or too many resubmit, >> >> the issue might happen momentary; flows causing the recursion expire >> >> when users try to debug it. This patch enables the ofproto trace >> >> automatically when recursion is too deep or too many resubmit, by >> >> invoking the translation again, and log the ofproto trace as warnings. >> >> Since the log will be huge, rate limit to one per minute. >> >> >> >> VMWare-BZ: #2054659 >> >> Signed-off-by: William Tu <u9012063@gmail.com> >> >> Tested-by: Greg Rose <gvrose8192@gmail.com> >> >> Reviewed-by: Greg Rose <gvrose8192@gmail.com> >> > >> > Thanks for working on this! >> > >> > Some of the data passed to ofproto_trace() is also passed to the >> > xlate_actions() call, indirectly. Did you check whether that data is >> > possibly modified by xlate_actions()? If it is, then we might have to >> > reconsider this approach, because flow data, etc. is very large and I >> > don't think that we can afford to always make a copy of it in advance on >> > the chance that the original might be needed for tracing later. >> >> We pass the "const struct flow *" and "const struct dp_packet *" to >> the ofproto_trace(), >> so I think these two data is unmodified and xlate_in_init() actually make a copy >> of the "struct flow", so xlate_actions() modifies a separate copy of the flow. > > OK. > >> > I think that VLOG_WARN is a very high log level for this data. I would >> > tend to use DBG. >> > >> I worried that most of the time DBG is not on, and asking people >> to turn it on and reproduce the issue is hard. Can we keep VLOG_WARN and >> rate limit to very low frequent, "VLOG_RATE_LIMIT_INIT(1, 5);",? > > OK, let's try that. Thanks Ben. Should I send another version of the patch? William
On Wed, Feb 28, 2018 at 03:01:24PM -0800, William Tu wrote: > On Wed, Feb 28, 2018 at 2:14 PM, Ben Pfaff <blp@ovn.org> wrote: > > On Mon, Feb 26, 2018 at 01:43:39PM -0800, William Tu wrote: > >> On Wed, Feb 14, 2018 at 3:03 PM, Ben Pfaff <blp@ovn.org> wrote: > >> > On Wed, Feb 14, 2018 at 10:56:08AM -0800, William Tu wrote: > >> >> Usually ofproto/trace is used to debug the flow translation error. > >> >> When translation error such as recursion too deep or too many resubmit, > >> >> the issue might happen momentary; flows causing the recursion expire > >> >> when users try to debug it. This patch enables the ofproto trace > >> >> automatically when recursion is too deep or too many resubmit, by > >> >> invoking the translation again, and log the ofproto trace as warnings. > >> >> Since the log will be huge, rate limit to one per minute. > >> >> > >> >> VMWare-BZ: #2054659 > >> >> Signed-off-by: William Tu <u9012063@gmail.com> > >> >> Tested-by: Greg Rose <gvrose8192@gmail.com> > >> >> Reviewed-by: Greg Rose <gvrose8192@gmail.com> > >> > > >> > Thanks for working on this! > >> > > >> > Some of the data passed to ofproto_trace() is also passed to the > >> > xlate_actions() call, indirectly. Did you check whether that data is > >> > possibly modified by xlate_actions()? If it is, then we might have to > >> > reconsider this approach, because flow data, etc. is very large and I > >> > don't think that we can afford to always make a copy of it in advance on > >> > the chance that the original might be needed for tracing later. > >> > >> We pass the "const struct flow *" and "const struct dp_packet *" to > >> the ofproto_trace(), > >> so I think these two data is unmodified and xlate_in_init() actually make a copy > >> of the "struct flow", so xlate_actions() modifies a separate copy of the flow. > > > > OK. > > > >> > I think that VLOG_WARN is a very high log level for this data. I would > >> > tend to use DBG. > >> > > >> I worried that most of the time DBG is not on, and asking people > >> to turn it on and reproduce the issue is hard. Can we keep VLOG_WARN and > >> rate limit to very low frequent, "VLOG_RATE_LIMIT_INIT(1, 5);",? > > > > OK, let's try that. > > Thanks Ben. > Should I send another version of the patch? Yes please.
diff --git a/ofproto/ofproto-dpif-trace.c b/ofproto/ofproto-dpif-trace.c index 4071cc0ffd65..7211eac8e1a5 100644 --- a/ofproto/ofproto-dpif-trace.c +++ b/ofproto/ofproto-dpif-trace.c @@ -23,11 +23,6 @@ #include "ofproto-dpif-xlate.h" #include "unixctl.h" -static void ofproto_trace(struct ofproto_dpif *, const struct flow *, - const struct dp_packet *packet, - const struct ofpact[], size_t ofpacts_len, - struct ovs_list *next_ct_states, - struct ds *); static void oftrace_node_destroy(struct oftrace_node *); /* Creates a new oftrace_node, populates it with the given 'type' and a copy of @@ -661,7 +656,7 @@ ofproto_trace__(struct ofproto_dpif *ofproto, const struct flow *flow, * * If 'ofpacts' is nonnull then its 'ofpacts_len' bytes specify the actions to * trace, otherwise the actions are determined by a flow table lookup. */ -static void +void ofproto_trace(struct ofproto_dpif *ofproto, const struct flow *flow, const struct dp_packet *packet, const struct ofpact ofpacts[], size_t ofpacts_len, diff --git a/ofproto/ofproto-dpif-trace.h b/ofproto/ofproto-dpif-trace.h index 5e51771b19b0..ea6cb3fc01ac 100644 --- a/ofproto/ofproto-dpif-trace.h +++ b/ofproto/ofproto-dpif-trace.h @@ -28,6 +28,8 @@ * each action (OFT_ACTION) executed in the table. */ +#include "openvswitch/dynamic-string.h" +#include "ofproto/ofproto-dpif.h" #include "openvswitch/compiler.h" #include "openvswitch/list.h" #include "flow.h" @@ -79,6 +81,10 @@ struct oftrace_next_ct_state { }; void ofproto_dpif_trace_init(void); +void ofproto_trace(struct ofproto_dpif *ofproto, const struct flow *flow, + const struct dp_packet *packet, + const struct ofpact *, size_t ofpacts_len, + struct ovs_list *next_ct_states, struct ds *output); struct oftrace_node *oftrace_report(struct ovs_list *, enum oftrace_node_type, const char *text); diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c index 5eb20f7cc236..1741741fb5c5 100644 --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -34,6 +34,7 @@ #include "ofproto-dpif-sflow.h" #include "ofproto-dpif-xlate.h" #include "ofproto-dpif-xlate-cache.h" +#include "ofproto-dpif-trace.h" #include "ovs-rcu.h" #include "packets.h" #include "openvswitch/poll-loop.h" @@ -1130,7 +1131,9 @@ upcall_xlate(struct udpif *udpif, struct upcall *upcall, struct ofpbuf *odp_actions, struct flow_wildcards *wc) { struct dpif_flow_stats stats; + enum xlate_error xerr; struct xlate_in xin; + struct ds output; stats.n_packets = 1; stats.n_bytes = dp_packet_size(upcall->packet); @@ -1165,7 +1168,24 @@ upcall_xlate(struct udpif *udpif, struct upcall *upcall, upcall->dump_seq = seq_read(udpif->dump_seq); upcall->reval_seq = seq_read(udpif->reval_seq); - xlate_actions(&xin, &upcall->xout); + xerr = xlate_actions(&xin, &upcall->xout); + + /* Translate again and log the ofproto trace for + * these two error types. */ + if (xerr == XLATE_RECURSION_TOO_DEEP || + xerr == XLATE_TOO_MANY_RESUBMITS) { + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); + + /* This is a huge log, so be conservative. */ + if (!VLOG_DROP_WARN(&rl)) { + ds_init(&output); + ofproto_trace(upcall->ofproto, upcall->flow, + upcall->packet, NULL, 0, NULL, &output); + VLOG_WARN("%s", ds_cstr(&output)); + ds_destroy(&output); + } + } + if (wc) { /* Convert the input port wildcard from OFP to ODP format. There's no * real way to do this for arbitrary bitmasks since the numbering spaces diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at index 600afdda8528..fd78c5d9b593 100644 --- a/tests/ofproto-dpif.at +++ b/tests/ofproto-dpif.at @@ -8473,6 +8473,31 @@ AT_CHECK([grep -c 'over max translation depth 64' stdout], OVS_VSWITCHD_STOP(["/resubmit actions recursed/d"]) AT_CLEANUP +dnl Without using ofproto/trace, make sure the +dnl ofproto trace is still logged +AT_SETUP([ofproto-dpif - backward resubmit without trace]) +OVS_VSWITCHD_START +(echo "table=0, actions=resubmit(,66)" + for i in `seq 2 66`; do + j=`expr $i - 1` + echo "table=$i, actions=resubmit(,$j)" + done + echo "table=1, actions=local") > flows +AT_CHECK([ovs-ofctl add-flows br0 flows]) + +AT_CHECK([ovs-appctl netdev-dummy/receive br0 'in_port(1),eth(src=50:54:00:00:00:09,dst=50:54:00:00:00:0a),eth_type(0x1234)'], [0], [stdout]) + +OVS_WAIT_UNTIL([grep 'over max translation' ovs-vswitchd.log]) + +dnl make sure the full ofproto trace dump is present +AT_CHECK([grep -c "^ *resubmit" ovs-vswitchd.log], + [0], [66 +]) + +OVS_VSWITCHD_STOP(["/over max translation depth/d +/ofproto_dpif_upcall/d"]) +AT_CLEANUP + AT_SETUP([ofproto-dpif - exponential resubmit chain]) OVS_VSWITCHD_START add_of_ports br0 1