diff mbox series

[ovs-dev,v2,3/3] timeval: Add coverage counter for long poll interval events.

Message ID e1c94b33df145a0034fd32417a22b1466ca12ed2.1704885957.git.echaudro@redhat.com
State Accepted
Commit 05d6f419cd91c1dc6f676309479cde0c3b35e542
Delegated to: Eelco Chaudron
Headers show
Series [ovs-dev,v2,1/3] ofproto-dpif-upcall: Change flow dump duration message to WARN level. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation fail test: fail

Commit Message

Eelco Chaudron Jan. 10, 2024, 11:25 a.m. UTC
Martin Kennelly observes that even though this data is available to
humans via the journal/log files, these aren't exactly easy for a
developer to make any kind of behavioral inferences.  This kind of
log and counter would be useful when checking on system health to
let us know that an Open vSwitch component is noticing some kind of
system level hiccup.

Add a new coverage counter to track information on these events, and
let a developer or system engineer know how long these events have
occurred with some historical context.

Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html
Suggested-by: Martin Kennelly <mkennell@redhat.com>
Co-Authored-By: Aaron Conole <aconole@redhat.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Simon Horman <horms@ovn.org>
---
 v2: Updated commit message, based on similar patch upstream.

 lib/timeval.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Ilya Maximets Jan. 10, 2024, 7:34 p.m. UTC | #1
On 1/10/24 12:25, Eelco Chaudron wrote:
> Martin Kennelly observes that even though this data is available to
> humans via the journal/log files, these aren't exactly easy for a
> developer to make any kind of behavioral inferences.  This kind of
> log and counter would be useful when checking on system health to
> let us know that an Open vSwitch component is noticing some kind of
> system level hiccup.
> 
> Add a new coverage counter to track information on these events, and
> let a developer or system engineer know how long these events have
> occurred with some historical context.
> 
> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html
> Suggested-by: Martin Kennelly <mkennell@redhat.com>
> Co-Authored-By: Aaron Conole <aconole@redhat.com>
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> Acked-by: Simon Horman <horms@ovn.org>
> ---
>  v2: Updated commit message, based on similar patch upstream.
> 
>  lib/timeval.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/lib/timeval.c b/lib/timeval.c
> index 193c7bab1..0abe7e555 100644
> --- a/lib/timeval.c
> +++ b/lib/timeval.c
> @@ -41,6 +41,8 @@
>  
>  VLOG_DEFINE_THIS_MODULE(timeval);
>  
> +COVERAGE_DEFINE(long_poll_interval);
> +
>  #if !defined(HAVE_CLOCK_GETTIME)
>  typedef unsigned int clockid_t;
>  static int clock_gettime(clock_t id, struct timespec *ts);
> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>          const struct rusage *last_rusage = get_recent_rusage();
>          struct rusage rusage;
>  
> +        COVERAGE_INC(long_poll_interval);
> +
>          if (!getrusage_thread(&rusage)) {
>              VLOG_WARN("Unreasonably long %lldms poll interval"
>                        " (%lldms user, %lldms system)",

Potentially interesting side effect of this change would be that
coverage counters will likely be logged on every long poll interval,
since every long poll interval will change them.  Before this change
many coverage dumps are skipped, because they are the same (see the
coverage_log() function).  Did you notice such a behavior change?

Best regards, Ilya Maximets.
Eelco Chaudron Jan. 10, 2024, 11:34 p.m. UTC | #2
On 10 Jan 2024, at 20:34, Ilya Maximets wrote:

> On 1/10/24 12:25, Eelco Chaudron wrote:
>> Martin Kennelly observes that even though this data is available to
>> humans via the journal/log files, these aren't exactly easy for a
>> developer to make any kind of behavioral inferences.  This kind of
>> log and counter would be useful when checking on system health to
>> let us know that an Open vSwitch component is noticing some kind of
>> system level hiccup.
>>
>> Add a new coverage counter to track information on these events, and
>> let a developer or system engineer know how long these events have
>> occurred with some historical context.
>>
>> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html
>> Suggested-by: Martin Kennelly <mkennell@redhat.com>
>> Co-Authored-By: Aaron Conole <aconole@redhat.com>
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>> Acked-by: Simon Horman <horms@ovn.org>
>> ---
>>  v2: Updated commit message, based on similar patch upstream.
>>
>>  lib/timeval.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/lib/timeval.c b/lib/timeval.c
>> index 193c7bab1..0abe7e555 100644
>> --- a/lib/timeval.c
>> +++ b/lib/timeval.c
>> @@ -41,6 +41,8 @@
>>
>>  VLOG_DEFINE_THIS_MODULE(timeval);
>>
>> +COVERAGE_DEFINE(long_poll_interval);
>> +
>>  #if !defined(HAVE_CLOCK_GETTIME)
>>  typedef unsigned int clockid_t;
>>  static int clock_gettime(clock_t id, struct timespec *ts);
>> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>>          const struct rusage *last_rusage = get_recent_rusage();
>>          struct rusage rusage;
>>
>> +        COVERAGE_INC(long_poll_interval);
>> +
>>          if (!getrusage_thread(&rusage)) {
>>              VLOG_WARN("Unreasonably long %lldms poll interval"
>>                        " (%lldms user, %lldms system)",
>
> Potentially interesting side effect of this change would be that
> coverage counters will likely be logged on every long poll interval,
> since every long poll interval will change them.  Before this change
> many coverage dumps are skipped, because they are the same (see the
> coverage_log() function).  Did you notice such a behavior change?

I did not notice this, as I only forced a single instance, but this is a good catch. I could exclude this new counter from the coverage_hash() function. However, there will probably be a lot of malloc ones anyway. What do you think?

//Eelco
Eelco Chaudron Jan. 11, 2024, 8:16 a.m. UTC | #3
On 11 Jan 2024, at 0:34, Eelco Chaudron wrote:

> On 10 Jan 2024, at 20:34, Ilya Maximets wrote:
>
>> On 1/10/24 12:25, Eelco Chaudron wrote:
>>> Martin Kennelly observes that even though this data is available to
>>> humans via the journal/log files, these aren't exactly easy for a
>>> developer to make any kind of behavioral inferences.  This kind of
>>> log and counter would be useful when checking on system health to
>>> let us know that an Open vSwitch component is noticing some kind of
>>> system level hiccup.
>>>
>>> Add a new coverage counter to track information on these events, and
>>> let a developer or system engineer know how long these events have
>>> occurred with some historical context.
>>>
>>> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html
>>> Suggested-by: Martin Kennelly <mkennell@redhat.com>
>>> Co-Authored-By: Aaron Conole <aconole@redhat.com>
>>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>> Acked-by: Simon Horman <horms@ovn.org>
>>> ---
>>>  v2: Updated commit message, based on similar patch upstream.
>>>
>>>  lib/timeval.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/lib/timeval.c b/lib/timeval.c
>>> index 193c7bab1..0abe7e555 100644
>>> --- a/lib/timeval.c
>>> +++ b/lib/timeval.c
>>> @@ -41,6 +41,8 @@
>>>
>>>  VLOG_DEFINE_THIS_MODULE(timeval);
>>>
>>> +COVERAGE_DEFINE(long_poll_interval);
>>> +
>>>  #if !defined(HAVE_CLOCK_GETTIME)
>>>  typedef unsigned int clockid_t;
>>>  static int clock_gettime(clock_t id, struct timespec *ts);
>>> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>>>          const struct rusage *last_rusage = get_recent_rusage();
>>>          struct rusage rusage;
>>>
>>> +        COVERAGE_INC(long_poll_interval);
>>> +
>>>          if (!getrusage_thread(&rusage)) {
>>>              VLOG_WARN("Unreasonably long %lldms poll interval"
>>>                        " (%lldms user, %lldms system)",
>>
>> Potentially interesting side effect of this change would be that
>> coverage counters will likely be logged on every long poll interval,
>> since every long poll interval will change them.  Before this change
>> many coverage dumps are skipped, because they are the same (see the
>> coverage_log() function).  Did you notice such a behavior change?
>
> I did not notice this, as I only forced a single instance, but this is a good catch. I could exclude this new counter from the coverage_hash() function. However, there will probably be a lot of malloc ones anyway. What do you think?

Ok, after adding some code to exclude this new event in the hash. I did not see any difference in behavior. The dump will happen each time the total values are updated (around every 5 seconds) due to the fact seq_change and util_xalloc are constantly being increased. The only thing that stops this from happening all the time in my setup is the rate-limiting in coverage_hash().

So I guess this change does not change anything to the current behavior, so adding the exclude in the hash makes no sense. Thoughts?

//Eelco
Ilya Maximets Jan. 11, 2024, 11:16 a.m. UTC | #4
On 1/11/24 09:16, Eelco Chaudron wrote:
> 
> 
> On 11 Jan 2024, at 0:34, Eelco Chaudron wrote:
> 
>> On 10 Jan 2024, at 20:34, Ilya Maximets wrote:
>>
>>> On 1/10/24 12:25, Eelco Chaudron wrote:
>>>> Martin Kennelly observes that even though this data is available to
>>>> humans via the journal/log files, these aren't exactly easy for a
>>>> developer to make any kind of behavioral inferences.  This kind of
>>>> log and counter would be useful when checking on system health to
>>>> let us know that an Open vSwitch component is noticing some kind of
>>>> system level hiccup.
>>>>
>>>> Add a new coverage counter to track information on these events, and
>>>> let a developer or system engineer know how long these events have
>>>> occurred with some historical context.
>>>>
>>>> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html
>>>> Suggested-by: Martin Kennelly <mkennell@redhat.com>
>>>> Co-Authored-By: Aaron Conole <aconole@redhat.com>
>>>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>>> Acked-by: Simon Horman <horms@ovn.org>
>>>> ---
>>>>  v2: Updated commit message, based on similar patch upstream.
>>>>
>>>>  lib/timeval.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/lib/timeval.c b/lib/timeval.c
>>>> index 193c7bab1..0abe7e555 100644
>>>> --- a/lib/timeval.c
>>>> +++ b/lib/timeval.c
>>>> @@ -41,6 +41,8 @@
>>>>
>>>>  VLOG_DEFINE_THIS_MODULE(timeval);
>>>>
>>>> +COVERAGE_DEFINE(long_poll_interval);
>>>> +
>>>>  #if !defined(HAVE_CLOCK_GETTIME)
>>>>  typedef unsigned int clockid_t;
>>>>  static int clock_gettime(clock_t id, struct timespec *ts);
>>>> @@ -644,6 +646,8 @@ log_poll_interval(long long int last_wakeup)
>>>>          const struct rusage *last_rusage = get_recent_rusage();
>>>>          struct rusage rusage;
>>>>
>>>> +        COVERAGE_INC(long_poll_interval);
>>>> +
>>>>          if (!getrusage_thread(&rusage)) {
>>>>              VLOG_WARN("Unreasonably long %lldms poll interval"
>>>>                        " (%lldms user, %lldms system)",
>>>
>>> Potentially interesting side effect of this change would be that
>>> coverage counters will likely be logged on every long poll interval,
>>> since every long poll interval will change them.  Before this change
>>> many coverage dumps are skipped, because they are the same (see the
>>> coverage_log() function).  Did you notice such a behavior change?
>>
>> I did not notice this, as I only forced a single instance, but this is a good catch. I could exclude this new counter from the coverage_hash() function. However, there will probably be a lot of malloc ones anyway. What do you think?
> 
> Ok, after adding some code to exclude this new event in the hash. I did not see any difference in behavior. The dump will happen each time the total values are updated (around every 5 seconds) due to the fact seq_change and util_xalloc are constantly being increased. The only thing that stops this from happening all the time in my setup is the rate-limiting in coverage_hash().
> 
> So I guess this change does not change anything to the current behavior, so adding the exclude in the hash makes no sense. Thoughts?

Yeah, I don't think exclusion is a good thing.  We should just keep
as is.  I think, the main thing that prevents logging is also the
fact that hashing doesn't hash the actual counters (I missed that
yesterday), but the ranks of each counter, i.e. if the relative
order of all the counters sorted by the total count didn't change,
there will be no log.  And I agree that xalloc counter is basically
increased all the time anyway.  Should be fine as-is.

Thanks for checking!

Acked-by: Ilya Maximets <i.maximets@ovn.org>
Aaron Conole Jan. 15, 2024, 2:16 p.m. UTC | #5
Eelco Chaudron <echaudro@redhat.com> writes:

> Martin Kennelly observes that even though this data is available to
> humans via the journal/log files, these aren't exactly easy for a
> developer to make any kind of behavioral inferences.  This kind of
> log and counter would be useful when checking on system health to
> let us know that an Open vSwitch component is noticing some kind of
> system level hiccup.
>
> Add a new coverage counter to track information on these events, and
> let a developer or system engineer know how long these events have
> occurred with some historical context.
>
> Reported-at: https://lists.linuxfoundation.org/pipermail/ovs-discuss/2023-June/052523.html
> Suggested-by: Martin Kennelly <mkennell@redhat.com>
> Co-Authored-By: Aaron Conole <aconole@redhat.com>
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> Acked-by: Simon Horman <horms@ovn.org>
> ---

Thanks, applied.
diff mbox series

Patch

diff --git a/lib/timeval.c b/lib/timeval.c
index 193c7bab1..0abe7e555 100644
--- a/lib/timeval.c
+++ b/lib/timeval.c
@@ -41,6 +41,8 @@ 
 
 VLOG_DEFINE_THIS_MODULE(timeval);
 
+COVERAGE_DEFINE(long_poll_interval);
+
 #if !defined(HAVE_CLOCK_GETTIME)
 typedef unsigned int clockid_t;
 static int clock_gettime(clock_t id, struct timespec *ts);
@@ -644,6 +646,8 @@  log_poll_interval(long long int last_wakeup)
         const struct rusage *last_rusage = get_recent_rusage();
         struct rusage rusage;
 
+        COVERAGE_INC(long_poll_interval);
+
         if (!getrusage_thread(&rusage)) {
             VLOG_WARN("Unreasonably long %lldms poll interval"
                       " (%lldms user, %lldms system)",