Message ID | e1c94b33df145a0034fd32417a22b1466ca12ed2.1704885957.git.echaudro@redhat.com |
---|---|
State | Accepted |
Commit | 05d6f419cd91c1dc6f676309479cde0c3b35e542 |
Delegated to: | Eelco Chaudron |
Headers | show |
Series | [ovs-dev,v2,1/3] ofproto-dpif-upcall: Change flow dump duration message to WARN level. | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/intel-ovs-compilation | fail | test: fail |
On 1/10/24 12:25, Eelco Chaudron wrote: > Martin Kennelly observes that even though this data is available to > humans via the journal/log files, these aren't exactly easy for a > developer to make any kind of behavioral inferences. This kind of > log and counter would be useful when checking on system health to > let us know that an Open vSwitch component is noticing some kind of > system level hiccup. > > Add a new coverage counter to track information on these events, and > let a developer or system engineer know how long these events have > occurred with some historical context. > > Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html > Suggested-by: Martin Kennelly <mkennell@redhat.com> > Co-Authored-By: Aaron Conole <aconole@redhat.com> > Signed-off-by: Aaron Conole <aconole@redhat.com> > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > Acked-by: Simon Horman <horms@ovn.org> > --- > v2: Updated commit message, based on similar patch upstream. > > lib/timeval.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/lib/timeval.c b/lib/timeval.c > index 193c7bab1..0abe7e555 100644 > --- a/lib/timeval.c > +++ b/lib/timeval.c > @@ -41,6 +41,8 @@ > > VLOG_DEFINE_THIS_MODULE(timeval); > > +COVERAGE_DEFINE(long_poll_interval); > + > #if !defined(HAVE_CLOCK_GETTIME) > typedef unsigned int clockid_t; > static int clock_gettime(clock_t id, struct timespec *ts); > @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup) > const struct rusage *last_rusage = get_recent_rusage(); > struct rusage rusage; > > + COVERAGE_INC(long_poll_interval); > + > if (!getrusage_thread(&rusage)) { > VLOG_WARN("Unreasonably long %lldms poll interval" > " (%lldms user, %lldms system)", Potentially interesting side effect of this change would be that coverage counters will likely be logged on every long poll interval, since every long poll interval will change them. Before this change many coverage dumps are skipped, because they are the same (see the coverage_log() function). Did you notice such a behavior change? Best regards, Ilya Maximets.
On 10 Jan 2024, at 20:34, Ilya Maximets wrote: > On 1/10/24 12:25, Eelco Chaudron wrote: >> Martin Kennelly observes that even though this data is available to >> humans via the journal/log files, these aren't exactly easy for a >> developer to make any kind of behavioral inferences. This kind of >> log and counter would be useful when checking on system health to >> let us know that an Open vSwitch component is noticing some kind of >> system level hiccup. >> >> Add a new coverage counter to track information on these events, and >> let a developer or system engineer know how long these events have >> occurred with some historical context. >> >> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html >> Suggested-by: Martin Kennelly <mkennell@redhat.com> >> Co-Authored-By: Aaron Conole <aconole@redhat.com> >> Signed-off-by: Aaron Conole <aconole@redhat.com> >> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >> Acked-by: Simon Horman <horms@ovn.org> >> --- >> v2: Updated commit message, based on similar patch upstream. >> >> lib/timeval.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/lib/timeval.c b/lib/timeval.c >> index 193c7bab1..0abe7e555 100644 >> --- a/lib/timeval.c >> +++ b/lib/timeval.c >> @@ -41,6 +41,8 @@ >> >> VLOG_DEFINE_THIS_MODULE(timeval); >> >> +COVERAGE_DEFINE(long_poll_interval); >> + >> #if !defined(HAVE_CLOCK_GETTIME) >> typedef unsigned int clockid_t; >> static int clock_gettime(clock_t id, struct timespec *ts); >> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup) >> const struct rusage *last_rusage = get_recent_rusage(); >> struct rusage rusage; >> >> + COVERAGE_INC(long_poll_interval); >> + >> if (!getrusage_thread(&rusage)) { >> VLOG_WARN("Unreasonably long %lldms poll interval" >> " (%lldms user, %lldms system)", > > Potentially interesting side effect of this change would be that > coverage counters will likely be logged on every long poll interval, > since every long poll interval will change them. Before this change > many coverage dumps are skipped, because they are the same (see the > coverage_log() function). Did you notice such a behavior change? I did not notice this, as I only forced a single instance, but this is a good catch. I could exclude this new counter from the coverage_hash() function. However, there will probably be a lot of malloc ones anyway. What do you think? //Eelco
On 11 Jan 2024, at 0:34, Eelco Chaudron wrote: > On 10 Jan 2024, at 20:34, Ilya Maximets wrote: > >> On 1/10/24 12:25, Eelco Chaudron wrote: >>> Martin Kennelly observes that even though this data is available to >>> humans via the journal/log files, these aren't exactly easy for a >>> developer to make any kind of behavioral inferences. This kind of >>> log and counter would be useful when checking on system health to >>> let us know that an Open vSwitch component is noticing some kind of >>> system level hiccup. >>> >>> Add a new coverage counter to track information on these events, and >>> let a developer or system engineer know how long these events have >>> occurred with some historical context. >>> >>> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html >>> Suggested-by: Martin Kennelly <mkennell@redhat.com> >>> Co-Authored-By: Aaron Conole <aconole@redhat.com> >>> Signed-off-by: Aaron Conole <aconole@redhat.com> >>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >>> Acked-by: Simon Horman <horms@ovn.org> >>> --- >>> v2: Updated commit message, based on similar patch upstream. >>> >>> lib/timeval.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/lib/timeval.c b/lib/timeval.c >>> index 193c7bab1..0abe7e555 100644 >>> --- a/lib/timeval.c >>> +++ b/lib/timeval.c >>> @@ -41,6 +41,8 @@ >>> >>> VLOG_DEFINE_THIS_MODULE(timeval); >>> >>> +COVERAGE_DEFINE(long_poll_interval); >>> + >>> #if !defined(HAVE_CLOCK_GETTIME) >>> typedef unsigned int clockid_t; >>> static int clock_gettime(clock_t id, struct timespec *ts); >>> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup) >>> const struct rusage *last_rusage = get_recent_rusage(); >>> struct rusage rusage; >>> >>> + COVERAGE_INC(long_poll_interval); >>> + >>> if (!getrusage_thread(&rusage)) { >>> VLOG_WARN("Unreasonably long %lldms poll interval" >>> " (%lldms user, %lldms system)", >> >> Potentially interesting side effect of this change would be that >> coverage counters will likely be logged on every long poll interval, >> since every long poll interval will change them. Before this change >> many coverage dumps are skipped, because they are the same (see the >> coverage_log() function). Did you notice such a behavior change? > > I did not notice this, as I only forced a single instance, but this is a good catch. I could exclude this new counter from the coverage_hash() function. However, there will probably be a lot of malloc ones anyway. What do you think? Ok, after adding some code to exclude this new event in the hash. I did not see any difference in behavior. The dump will happen each time the total values are updated (around every 5 seconds) due to the fact seq_change and util_xalloc are constantly being increased. The only thing that stops this from happening all the time in my setup is the rate-limiting in coverage_hash(). So I guess this change does not change anything to the current behavior, so adding the exclude in the hash makes no sense. Thoughts? //Eelco
On 1/11/24 09:16, Eelco Chaudron wrote: > > > On 11 Jan 2024, at 0:34, Eelco Chaudron wrote: > >> On 10 Jan 2024, at 20:34, Ilya Maximets wrote: >> >>> On 1/10/24 12:25, Eelco Chaudron wrote: >>>> Martin Kennelly observes that even though this data is available to >>>> humans via the journal/log files, these aren't exactly easy for a >>>> developer to make any kind of behavioral inferences. This kind of >>>> log and counter would be useful when checking on system health to >>>> let us know that an Open vSwitch component is noticing some kind of >>>> system level hiccup. >>>> >>>> Add a new coverage counter to track information on these events, and >>>> let a developer or system engineer know how long these events have >>>> occurred with some historical context. >>>> >>>> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html >>>> Suggested-by: Martin Kennelly <mkennell@redhat.com> >>>> Co-Authored-By: Aaron Conole <aconole@redhat.com> >>>> Signed-off-by: Aaron Conole <aconole@redhat.com> >>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> >>>> Acked-by: Simon Horman <horms@ovn.org> >>>> --- >>>> v2: Updated commit message, based on similar patch upstream. >>>> >>>> lib/timeval.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/lib/timeval.c b/lib/timeval.c >>>> index 193c7bab1..0abe7e555 100644 >>>> --- a/lib/timeval.c >>>> +++ b/lib/timeval.c >>>> @@ -41,6 +41,8 @@ >>>> >>>> VLOG_DEFINE_THIS_MODULE(timeval); >>>> >>>> +COVERAGE_DEFINE(long_poll_interval); >>>> + >>>> #if !defined(HAVE_CLOCK_GETTIME) >>>> typedef unsigned int clockid_t; >>>> static int clock_gettime(clock_t id, struct timespec *ts); >>>> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup) >>>> const struct rusage *last_rusage = get_recent_rusage(); >>>> struct rusage rusage; >>>> >>>> + COVERAGE_INC(long_poll_interval); >>>> + >>>> if (!getrusage_thread(&rusage)) { >>>> VLOG_WARN("Unreasonably long %lldms poll interval" >>>> " (%lldms user, %lldms system)", >>> >>> Potentially interesting side effect of this change would be that >>> coverage counters will likely be logged on every long poll interval, >>> since every long poll interval will change them. Before this change >>> many coverage dumps are skipped, because they are the same (see the >>> coverage_log() function). Did you notice such a behavior change? >> >> I did not notice this, as I only forced a single instance, but this is a good catch. I could exclude this new counter from the coverage_hash() function. However, there will probably be a lot of malloc ones anyway. What do you think? > > Ok, after adding some code to exclude this new event in the hash. I did not see any difference in behavior. The dump will happen each time the total values are updated (around every 5 seconds) due to the fact seq_change and util_xalloc are constantly being increased. The only thing that stops this from happening all the time in my setup is the rate-limiting in coverage_hash(). > > So I guess this change does not change anything to the current behavior, so adding the exclude in the hash makes no sense. Thoughts? Yeah, I don't think exclusion is a good thing. We should just keep as is. I think, the main thing that prevents logging is also the fact that hashing doesn't hash the actual counters (I missed that yesterday), but the ranks of each counter, i.e. if the relative order of all the counters sorted by the total count didn't change, there will be no log. And I agree that xalloc counter is basically increased all the time anyway. Should be fine as-is. Thanks for checking! Acked-by: Ilya Maximets <i.maximets@ovn.org>
Eelco Chaudron <echaudro@redhat.com> writes: > Martin Kennelly observes that even though this data is available to > humans via the journal/log files, these aren't exactly easy for a > developer to make any kind of behavioral inferences. This kind of > log and counter would be useful when checking on system health to > let us know that an Open vSwitch component is noticing some kind of > system level hiccup. > > Add a new coverage counter to track information on these events, and > let a developer or system engineer know how long these events have > occurred with some historical context. > > Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html > Suggested-by: Martin Kennelly <mkennell@redhat.com> > Co-Authored-By: Aaron Conole <aconole@redhat.com> > Signed-off-by: Aaron Conole <aconole@redhat.com> > Signed-off-by: Eelco Chaudron <echaudro@redhat.com> > Acked-by: Simon Horman <horms@ovn.org> > --- Thanks, applied.
diff --git a/lib/timeval.c b/lib/timeval.c index 193c7bab1..0abe7e555 100644 --- a/lib/timeval.c +++ b/lib/timeval.c @@ -41,6 +41,8 @@ VLOG_DEFINE_THIS_MODULE(timeval); +COVERAGE_DEFINE(long_poll_interval); + #if !defined(HAVE_CLOCK_GETTIME) typedef unsigned int clockid_t; static int clock_gettime(clock_t id, struct timespec *ts); @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup) const struct rusage *last_rusage = get_recent_rusage(); struct rusage rusage; + COVERAGE_INC(long_poll_interval); + if (!getrusage_thread(&rusage)) { VLOG_WARN("Unreasonably long %lldms poll interval" " (%lldms user, %lldms system)",