Message ID | 170314679245.1799515.2320316726971805940.stgit@ebuild |
---|---|
State | Changes Requested |
Delegated to: | Eelco Chaudron |
Headers | show |
Series | [ovs-dev,1/3] ofproto-dpif-upcall: Change flow dump duration message to WARN level. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | success | test: success |
On Thu, Dec 21, 2023 at 09:20:28AM +0100, Eelco Chaudron wrote: > Add new coverage counters that might help debugging flow > dump length related issues. > > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Simon Horman <horms@ovn.org>
On 12/21/23 09:20, Eelco Chaudron wrote: > Add new coverage counters that might help debugging flow > dump length related issues. > > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > --- > ofproto/ofproto-dpif-upcall.c | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c > index cd71e3ee3..85f78675a 100644 > --- a/ofproto/ofproto-dpif-upcall.c > +++ b/ofproto/ofproto-dpif-upcall.c > @@ -52,6 +52,9 @@ > > VLOG_DEFINE_THIS_MODULE(ofproto_dpif_upcall); > > +COVERAGE_DEFINE(dump_duration_low); > +COVERAGE_DEFINE(dump_duration_medium); > +COVERAGE_DEFINE(dump_duration_high); > COVERAGE_DEFINE(dumped_duplicate_flow); > COVERAGE_DEFINE(dumped_inconsistent_flow); > COVERAGE_DEFINE(dumped_new_flow); > @@ -1039,11 +1042,15 @@ udpif_revalidator(void *arg) > udpif->dump_duration = duration; > if (duration > 2000) { > flow_limit /= duration / 1000; > + COVERAGE_INC(dump_duration_high); > } else if (duration > 1300) { > flow_limit = flow_limit * 3 / 4; > - } else if (duration < 1000 && > - flow_limit < n_flows * 1000 / duration) { > - flow_limit += 1000; > + COVERAGE_INC(dump_duration_medium); > + } else if (duration < 1000) { > + if (flow_limit < n_flows * 1000 / duration) { > + flow_limit += 1000; > + } > + COVERAGE_INC(dump_duration_low); > } Hi, Eelco. Thanks for the change! Though I'm not sure we actually need all three of these. The main concern being a low counter. In my experience, counters that just continuously go up create a lot of confusion and panic among users, who do not fully understand what they mean. For example the memory allocation counter; nobody really knows what is the good or bad value or a grow rate for it, it just always goes up at a fairly high rate. And the 'dump_duration_low' counter name doesn't help much here, because something being low can mean either a bad or a good thing. E.g. low battery or something like that. In practice, I'm not sure if there is a lot of value of having this counter as it just tells how many revalidation cycles we had, which is not a very meaningful number either since every cycle can be different and can take different amount of time. I'd suggest not having this counter. For the high and medium, I'm also not sure why we need to distinguish them. Both mean that something is not good, and both can't actually tell what exactly is wrong. For the emphasis we have a warning in the log now that will differentiate the high from the medium, if necessary. Maybe we can go with a single counter with a more understandable name like 'upcall_flow_limit_reduced' ? Adding to a family of other flow limit counters - upcall_flow_limit_hit and upcall_flow_limit_kill. What do you think? Best regards, Ilya Maximets.
On 2 Jan 2024, at 23:10, Ilya Maximets wrote: > On 12/21/23 09:20, Eelco Chaudron wrote: >> Add new coverage counters that might help debugging flow >> dump length related issues. >> >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >> --- >> ofproto/ofproto-dpif-upcall.c | 13 ++++++++++--- >> 1 file changed, 10 insertions(+), 3 deletions(-) >> >> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c >> index cd71e3ee3..85f78675a 100644 >> --- a/ofproto/ofproto-dpif-upcall.c >> +++ b/ofproto/ofproto-dpif-upcall.c >> @@ -52,6 +52,9 @@ >> >> VLOG_DEFINE_THIS_MODULE(ofproto_dpif_upcall); >> >> +COVERAGE_DEFINE(dump_duration_low); >> +COVERAGE_DEFINE(dump_duration_medium); >> +COVERAGE_DEFINE(dump_duration_high); >> COVERAGE_DEFINE(dumped_duplicate_flow); >> COVERAGE_DEFINE(dumped_inconsistent_flow); >> COVERAGE_DEFINE(dumped_new_flow); >> @@ -1039,11 +1042,15 @@ udpif_revalidator(void *arg) >> udpif->dump_duration = duration; >> if (duration > 2000) { >> flow_limit /= duration / 1000; >> + COVERAGE_INC(dump_duration_high); >> } else if (duration > 1300) { >> flow_limit = flow_limit * 3 / 4; >> - } else if (duration < 1000 && >> - flow_limit < n_flows * 1000 / duration) { >> - flow_limit += 1000; >> + COVERAGE_INC(dump_duration_medium); >> + } else if (duration < 1000) { >> + if (flow_limit < n_flows * 1000 / duration) { >> + flow_limit += 1000; >> + } >> + COVERAGE_INC(dump_duration_low); >> } > > Hi, Eelco. Thanks for the change! > Though I'm not sure we actually need all three of these. > > The main concern being a low counter. In my experience, counters that > just continuously go up create a lot of confusion and panic among users, > who do not fully understand what they mean. For example the memory > allocation counter; nobody really knows what is the good or bad value > or a grow rate for it, it just always goes up at a fairly high rate. Guess it’s time to add documentation around coverage counters. I can add an initial framework mentioning the existing ones, and add a clear description for the new ones :) Of course, stating that these are debug counters and are subject to change without notice ;) > And the 'dump_duration_low' counter name doesn't help much here, because > something being low can mean either a bad or a good thing. > E.g. low battery or something like that. In practice, I'm not sure if > there is a lot of value of having this counter as it just tells how many > revalidation cycles we had, which is not a very meaningful number either > since every cycle can be different and can take different amount of time. > I'd suggest not having this counter. > > For the high and medium, I'm also not sure why we need to distinguish > them. Both mean that something is not good, and both can't actually > tell what exactly is wrong. For the emphasis we have a warning in the > log now that will differentiate the high from the medium, if necessary. > > Maybe we can go with a single counter with a more understandable name > like 'upcall_flow_limit_reduced' ? Adding to a family of other flow > limit counters - upcall_flow_limit_hit and upcall_flow_limit_kill. > > What do you think? I guess my goal here was to use the counters to characterize the revalidation process in case of problems. With all three of these counters, we can see how often we have medium and long compared to the short ones. We also can differentiate between moderate and extreme flow reductions. This will give us a counter for the log message you refer to, which might have been lost by the time we need to investigate issues. Although my preference was to keep three individual coverage counters, with good documentation, and maybe rename them to be more self-explanatory, i.e., dump_duration_lt_1000, dump_duration_gt_2000, dump_duration_gt_1300 (_lt_2000). I realized there is the [1000..1300] range missing a count so getting the percentage will not work :( I still want to differentiate the bad from the really bad, as the log message might get lost. So what about the following diff: --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -60,7 +60,10 @@ COVERAGE_DEFINE(revalidate_missed_dp_flow); COVERAGE_DEFINE(ukey_dp_change); COVERAGE_DEFINE(ukey_invalid_stat_reset); COVERAGE_DEFINE(upcall_flow_limit_hit); +COVERAGE_DEFINE(upcall_flow_limit_increased_1k); COVERAGE_DEFINE(upcall_flow_limit_kill); +COVERAGE_DEFINE(upcall_flow_limit_reduced_div_1k); +COVERAGE_DEFINE(upcall_flow_limit_reduced_min_1qtr); COVERAGE_DEFINE(upcall_ukey_contention); COVERAGE_DEFINE(upcall_ukey_replace); @@ -1039,11 +1042,14 @@ udpif_revalidator(void *arg) udpif->dump_duration = duration; if (duration > 2000) { flow_limit /= duration / 1000; + COVERAGE_INC(upcall_flow_limit_reduced_div_1k); } else if (duration > 1300) { flow_limit = flow_limit * 3 / 4; + COVERAGE_INC(upcall_flow_limit_reduced_min_1qtr); } else if (duration < 1000 && flow_limit < n_flows * 1000 / duration) { flow_limit += 1000; + COVERAGE_INC(upcall_flow_limit_increased_1k); } flow_limit = MIN(ofproto_flow_limit, MAX(flow_limit, 1000)); atomic_store_relaxed(&udpif->flow_limit, flow_limit); //Eelco
On 1/3/24 15:40, Eelco Chaudron wrote: > > > On 2 Jan 2024, at 23:10, Ilya Maximets wrote: > >> On 12/21/23 09:20, Eelco Chaudron wrote: >>> Add new coverage counters that might help debugging flow >>> dump length related issues. >>> >>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >>> --- >>> ofproto/ofproto-dpif-upcall.c | 13 ++++++++++--- >>> 1 file changed, 10 insertions(+), 3 deletions(-) >>> >>> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c >>> index cd71e3ee3..85f78675a 100644 >>> --- a/ofproto/ofproto-dpif-upcall.c >>> +++ b/ofproto/ofproto-dpif-upcall.c >>> @@ -52,6 +52,9 @@ >>> >>> VLOG_DEFINE_THIS_MODULE(ofproto_dpif_upcall); >>> >>> +COVERAGE_DEFINE(dump_duration_low); >>> +COVERAGE_DEFINE(dump_duration_medium); >>> +COVERAGE_DEFINE(dump_duration_high); >>> COVERAGE_DEFINE(dumped_duplicate_flow); >>> COVERAGE_DEFINE(dumped_inconsistent_flow); >>> COVERAGE_DEFINE(dumped_new_flow); >>> @@ -1039,11 +1042,15 @@ udpif_revalidator(void *arg) >>> udpif->dump_duration = duration; >>> if (duration > 2000) { >>> flow_limit /= duration / 1000; >>> + COVERAGE_INC(dump_duration_high); >>> } else if (duration > 1300) { >>> flow_limit = flow_limit * 3 / 4; >>> - } else if (duration < 1000 && >>> - flow_limit < n_flows * 1000 / duration) { >>> - flow_limit += 1000; >>> + COVERAGE_INC(dump_duration_medium); >>> + } else if (duration < 1000) { >>> + if (flow_limit < n_flows * 1000 / duration) { >>> + flow_limit += 1000; >>> + } >>> + COVERAGE_INC(dump_duration_low); >>> } >> >> Hi, Eelco. Thanks for the change! >> Though I'm not sure we actually need all three of these. >> >> The main concern being a low counter. In my experience, counters that >> just continuously go up create a lot of confusion and panic among users, >> who do not fully understand what they mean. For example the memory >> allocation counter; nobody really knows what is the good or bad value >> or a grow rate for it, it just always goes up at a fairly high rate. > > Guess it’s time to add documentation around coverage counters. I can add > an initial framework mentioning the existing ones, and add a clear description > for the new ones :) > > Of course, stating that these are debug counters and are subject to change > without notice ;) Even with that, I'd still rather not have documentation for the counters. Some of them have no external meaning outside of a piece of code they are in. And maintaining these docs will be painful. That energy can be better spent on maintaining other parts of existing documentation that are outdated or incomplete. There are lots of places like this. For the counters we just need to avoid scary names that can be easily misinterpreted. > >> And the 'dump_duration_low' counter name doesn't help much here, because >> something being low can mean either a bad or a good thing. >> E.g. low battery or something like that. In practice, I'm not sure if >> there is a lot of value of having this counter as it just tells how many >> revalidation cycles we had, which is not a very meaningful number either >> since every cycle can be different and can take different amount of time. >> I'd suggest not having this counter. >> >> For the high and medium, I'm also not sure why we need to distinguish >> them. Both mean that something is not good, and both can't actually >> tell what exactly is wrong. For the emphasis we have a warning in the >> log now that will differentiate the high from the medium, if necessary. >> >> Maybe we can go with a single counter with a more understandable name >> like 'upcall_flow_limit_reduced' ? Adding to a family of other flow >> limit counters - upcall_flow_limit_hit and upcall_flow_limit_kill. >> >> What do you think? > > I guess my goal here was to use the counters to characterize the revalidation > process in case of problems. With all three of these counters, we can see how > often we have medium and long compared to the short ones. We also can differentiate > between moderate and extreme flow reductions. This will give us a counter for the > log message you refer to, which might have been lost by the time we need to > investigate issues. > > Although my preference was to keep three individual coverage counters, with > good documentation, and maybe rename them to be more self-explanatory, i.e., > dump_duration_lt_1000, dump_duration_gt_2000, dump_duration_gt_1300 (_lt_2000). > I realized there is the [1000..1300] range missing a count so getting the > percentage will not work :( > > I still want to differentiate the bad from the really bad, as the log message > might get lost. So what about the following diff: > > --- a/ofproto/ofproto-dpif-upcall.c > +++ b/ofproto/ofproto-dpif-upcall.c > @@ -60,7 +60,10 @@ COVERAGE_DEFINE(revalidate_missed_dp_flow); > COVERAGE_DEFINE(ukey_dp_change); > COVERAGE_DEFINE(ukey_invalid_stat_reset); > COVERAGE_DEFINE(upcall_flow_limit_hit); > +COVERAGE_DEFINE(upcall_flow_limit_increased_1k); > COVERAGE_DEFINE(upcall_flow_limit_kill); > +COVERAGE_DEFINE(upcall_flow_limit_reduced_div_1k); > +COVERAGE_DEFINE(upcall_flow_limit_reduced_min_1qtr); > COVERAGE_DEFINE(upcall_ukey_contention); > COVERAGE_DEFINE(upcall_ukey_replace); > > @@ -1039,11 +1042,14 @@ udpif_revalidator(void *arg) > udpif->dump_duration = duration; > if (duration > 2000) { > flow_limit /= duration / 1000; > + COVERAGE_INC(upcall_flow_limit_reduced_div_1k); This name is still a bit misleading. Because we're not reducing the limit by a 1000x, we're reducing it times the number of seconds we spent in the dump. So, maybe 'upcall_flow_limit_scaled' (because we're scaling it down to the amount of flows that can be revalidated within 1 second) or 'upcall_flow_limit_reduced_by_time', though the names are growing too long... > } else if (duration > 1300) { > flow_limit = flow_limit * 3 / 4; > + COVERAGE_INC(upcall_flow_limit_reduced_min_1qtr); Here we can drop the 'min' part, "reduced [by] 1 quarter" sounds fine. 'min_1qtr' can be dropped altogether if some different name is used for the previous counter. And we don't need to make above two cases look similar, because they are fairly different in terms of how reduction works. > } else if (duration < 1000 && > flow_limit < n_flows * 1000 / duration) { > flow_limit += 1000; > + COVERAGE_INC(upcall_flow_limit_increased_1k); '1k' potentially can be dropped, since it's the only place where the limit is increased. 'grow'/'grew' can be another (shorter) word to use instead of 'increased'. Note: this counter will always have a value around 190, because the flow limit starts up with 10K, IIRC, and then ramps up over time to 200K. Can be higher if the limit is ever reduced, but in most cases will stay at that number. > } > flow_limit = MIN(ofproto_flow_limit, MAX(flow_limit, 1000)); > atomic_store_relaxed(&udpif->flow_limit, flow_limit); > > //Eelco >
diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c index cd71e3ee3..85f78675a 100644 --- a/ofproto/ofproto-dpif-upcall.c +++ b/ofproto/ofproto-dpif-upcall.c @@ -52,6 +52,9 @@ VLOG_DEFINE_THIS_MODULE(ofproto_dpif_upcall); +COVERAGE_DEFINE(dump_duration_low); +COVERAGE_DEFINE(dump_duration_medium); +COVERAGE_DEFINE(dump_duration_high); COVERAGE_DEFINE(dumped_duplicate_flow); COVERAGE_DEFINE(dumped_inconsistent_flow); COVERAGE_DEFINE(dumped_new_flow); @@ -1039,11 +1042,15 @@ udpif_revalidator(void *arg) udpif->dump_duration = duration; if (duration > 2000) { flow_limit /= duration / 1000; + COVERAGE_INC(dump_duration_high); } else if (duration > 1300) { flow_limit = flow_limit * 3 / 4; - } else if (duration < 1000 && - flow_limit < n_flows * 1000 / duration) { - flow_limit += 1000; + COVERAGE_INC(dump_duration_medium); + } else if (duration < 1000) { + if (flow_limit < n_flows * 1000 / duration) { + flow_limit += 1000; + } + COVERAGE_INC(dump_duration_low); } flow_limit = MIN(ofproto_flow_limit, MAX(flow_limit, 1000)); atomic_store_relaxed(&udpif->flow_limit, flow_limit);
Add new coverage counters that might help debugging flow dump length related issues. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> --- ofproto/ofproto-dpif-upcall.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)