[ovs-dev,v2,1/2] Add performance measuring API

Message ID 20180112225800.31938-2-mmichels@redhat.com
State Changes Requested
Headers show
Series
  • Add performance measurement library
Related show

Commit Message

Mark Michelson Jan. 12, 2018, 10:57 p.m.
This is similar to the existing coverage and perf-counter APIs in OVS.
However, rather than keeping counters, this is aimed at timing how long
operations take to perform. "Operations" in this case can be anything
from a loop iteration, to a function, to something more complex.

The library will keep track of how long it takes to perform the
particular operations and will maintain statistics of those running
times.

Statistics for a particular operation can be queried from the command
line by using ovs-appctl -t <target> performance/show <operation name>.

The API is designed to be pretty small. The expected steps are as
follows:

1) Create a performance measurement, providing a unique name, using
performance_create()
2) Add calls to start_sample() and end_sample() to mark the start and
stop of the operation you wish to measure.
3) Periodically call performance_run() in order to compile statistics.
4) Upon completion (likely program shutdown), call performance_destroy()
to clean up.

Signed-off-by: Mark Michelson <mmichels@redhat.com>
---
 lib/automake.mk   |   2 +
 lib/performance.c | 387 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/performance.h |  58 ++++++++
 3 files changed, 447 insertions(+)
 create mode 100644 lib/performance.c
 create mode 100644 lib/performance.h

Comments

Aaron Conole Jan. 18, 2018, 4:20 p.m. | #1
Hi Mark,

Mark Michelson <mmichels@redhat.com> writes:

> This is similar to the existing coverage and perf-counter APIs in OVS.
> However, rather than keeping counters, this is aimed at timing how long
> operations take to perform. "Operations" in this case can be anything
> from a loop iteration, to a function, to something more complex.

It's important to note that the granularity on this is measured in 'ms',
which means this library is more suited towards measuring larger groups
of operations (those which take longer than the clock resolution on
time_msec).  I guess one advantage of 

I might suggest the ability to pick the time resolution.  For instance,
sometimes picking msec is useful when clocking response time in a
multi-message system.  Sometimes a tighter granularity (using finer
controls such as 'clock_gettime', 'rdtsc', 'read register 9 of
coprocessor 0') may be desired.  A function pointer passed to the
performance 'bucket' ie:
   performance_create("something", perf_time_msec)
that can have a function signature that makes sense for most cases (ull
return, no args, etc.) might be good to allow users to customize the
performance system to suit their needs.

OTOH, since it won't have much for initial users, maybe it's okay.

Also, apologies if I'm really picky or seem overly verbose with moving
average filters.  Once upon a time, I worked on an embedded system that
was a giant... :)

> The library will keep track of how long it takes to perform the
> particular operations and will maintain statistics of those running
> times.
>
> Statistics for a particular operation can be queried from the command
> line by using ovs-appctl -t <target> performance/show <operation name>.

I noticed that you'll have a global lock preventing statistics updates
from happening.  No problem - except the larger issue is while that lock
is held no other forward progress for the thread tracking the statistics
can happen.

Maybe it makes sense to think about an alternate set of data structures
that will accommodate multiple threads contending on performance
monitoring.  I'm guessing in your use case there isn't any contention,
but perhaps in the future someone will use this in, for example, a
netdev area to profile and the results would get skewed.

> The API is designed to be pretty small. The expected steps are as
> follows:
>
> 1) Create a performance measurement, providing a unique name, using
> performance_create()
> 2) Add calls to start_sample() and end_sample() to mark the start and
> stop of the operation you wish to measure.
> 3) Periodically call performance_run() in order to compile statistics.
> 4) Upon completion (likely program shutdown), call performance_destroy()
> to clean up.

You could register an atexit() handler to automatically destroy these.
Then the framework user doesn't have to deal with that.  There's also a
fatal_signal_add_hook() function where you can register deletion
(although I'd advise against it - usually a fatal signal means 'leave
memory alone' and cleanup here would require deletion).

> Signed-off-by: Mark Michelson <mmichels@redhat.com>
> ---
>  lib/automake.mk   |   2 +
>  lib/performance.c | 387 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  lib/performance.h |  58 ++++++++
>  3 files changed, 447 insertions(+)
>  create mode 100644 lib/performance.c
>  create mode 100644 lib/performance.h
>
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 4b38a1163..fef2c4e1a 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -205,6 +205,8 @@ lib_libopenvswitch_la_SOURCES = \
>  	lib/pcap-file.h \
>  	lib/perf-counter.h \
>  	lib/perf-counter.c \
> +	lib/performance.h \
> +	lib/performance.c \
>  	lib/poll-loop.c \
>  	lib/process.c \
>  	lib/process.h \
> diff --git a/lib/performance.c b/lib/performance.c
> new file mode 100644
> index 000000000..5b54cfc29
> --- /dev/null
> +++ b/lib/performance.c
> @@ -0,0 +1,387 @@
> +/* Copyright (c) 2017 Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "performance.h"
> +#include "timer.h"
> +#include "timeval.h"
> +#include "openvswitch/shash.h"
> +#include "openvswitch/vlog.h"
> +#include "unixctl.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/poll-loop.h"
> +
> +VLOG_DEFINE_THIS_MODULE(performance);
> +
> +struct sample {
> +    long long int start_time; /* Time when we started this sample */
> +    long long int end_time;   /* Time when we ended this sample */
> +    long long int elapsed;    /* Elapsed time: end_time - start_time */
> +};
> +
> +struct sample_vec {
> +    struct sample *samples; /* Dynamic array of samples */
> +    size_t n_samples;       /* Number of samples */
> +    size_t capacity;        /* Number of allocated samples */
> +};
> +
> +struct stats {
> +    long long int min;         /* Minimum measurement (ms) */
> +    long long int max;         /* Maximum measurement (ms) */
> +    long long int average;     /* Average measurement (ms) */

I think signed values doesn't make sense.  I don't know that time moves
backwards (unless you're using "wall clock", instead of a monotonic time
source).  I guess this is to keep the return type correct with
time_msec(), but I don't understand why that should be signed either
(since it says it will be monotonic).  In practice, it generally *won't*
matter at msec granularity - I 'napkin math' think it takes something
like 6.2bn years to rollover from epoch.  I just worry about the
rollover to negative case (and maybe it isn't important).

Additionally, I think integers aren't a good way of storing the average,
since it will truncate, and 1ms vs 1.9ms are very different when it
comes to measurements.

> +    long long int percentile;  /* 95th percentile measurement (ms) */
> +    long long int num_samples; /* Total number of measurements */
> +};
> +
> +struct performance {
> +    struct sample_vec vec;
> +    long long int sample_rate;
> +    struct timer timer;
> +    struct sample cur_sample;
> +};

It looks like you're keeping around the samples because you want to 'age
out' from the moving average window.  Otherwise I think it can be
implemented using an IIR filter (psuedo-code below):

       current_avg -= (current_avg/n_window_samples)
       current_avg += (sample / n_window_samples)

BUT - if you're going to keep around the samples, you might as well
include a calculation of additional statistics (variance comes to mind).

> +static struct shash performances = SHASH_INITIALIZER(&performances);
> +static struct ovs_mutex performances_lock = OVS_MUTEX_INITIALIZER;

As I wrote before, having this global mutex means statistics
accumulation blocks for everyone.  You might think it doesn't matter,
but imagine this scenario:

  thread 0                      |    thread 1

  call perf_start()             |  .............
  .... do things                |  call perf_start() (get slept)
  call perf_end() (block...)    |  wake up and complete
     (unblock)                  |

In the above, the perf_end() will intermittently include lock calls from
the perf_start(), which may smooth out over time but will introduce
variance.

> +#define PERFORMANCE_INTERVAL 10000
> +#define PERFORMANCE_OLDEST 600000
> +
> +static void
> +add_sample(struct sample_vec *vec, struct sample *new_sample)
> +{
> +    if (vec->capacity == vec->n_samples) {
> +        vec->samples = x2nrealloc(vec->samples, &vec->capacity,
> +            sizeof *vec->samples);

This resize operation can grow very quickly.  Assuming the samples come
very fast, you'll run out of memory very quickly.  Is there a reason not
to have just roll samples in the vector?  I guess it makes your
'selection' for various time periods more difficult, but you can also
track those with independent counters as well (just set a different time
interval... :).  That also means no allocations when creating samples,
and no deallocations when removing samples, freeing up additional memory
pressure.

The "disadvantage" is you need to determine your sampling window sizes,
but that may be something you expose as a tune-able setting?  Also,
tracking your 1m,5m,10m counters would be a bit different (but you could
just treat them as additional 'performance counters' in your framework,
and every time the performance_run() decides it's time to take an
average, if the time interval makes sense, it can update "name",
"name_1m", "name_5m", "name_10m", etc).

> +    }
> +
> +    vec->samples[vec->n_samples++] = *new_sample;
> +}
> +
> +static int
> +find_earliest(const struct sample_vec *vec, long long int age_ms)
> +{
> +    long long int cutoff = time_msec() - age_ms;
> +
> +    for (int i = 0; i < vec->n_samples; i++) {
> +        if (vec->samples[i].end_time >= cutoff) {
> +            return i;
> +        }
> +    }
> +
> +    /* Either the vector is empty or all times are
> +     * older than the cutoff.
> +     */
> +    return -1;
> +}
> +
> +static long long int
> +average(const struct sample *samples, size_t num_samples)
> +{
> +    /* Avoid division by zero */
> +    if (num_samples == 0) {
> +        return 0;
> +    }
> +
> +    long long int sum = 0;
> +    for (size_t i = 0; i < num_samples; i++) {
> +        sum += samples[i].elapsed;
> +    }
> +
> +    return sum / num_samples;
> +}
> +
> +/* Calculate a specific percentile value.
> + * Valid percentile values range from 0 to 99
> + */
> +static long long int
> +percentile(const struct sample *samples, size_t num_samples, int percentile)
> +{
> +    if (num_samples == 0) {
> +        return 0;
> +    }
> +
> +    size_t pctl = num_samples * percentile / 100;
> +    return samples[pctl].elapsed;
> +}
> +
> +static void
> +cull_old_times(struct sample_vec *vec, long long int age_ms)
> +{
> +    int i = find_earliest(vec, age_ms);
> +
> +    if (i <= 0) {
> +        return;
> +    }
> +
> +    size_t new_size = vec->n_samples - i;
> +    memmove(vec->samples, &vec->samples[i], new_size * sizeof *vec->samples);
> +    vec->n_samples = new_size;
> +}
> +
> +static void
> +format_stats(struct ds *s, const char *prefix, const struct stats *stats)
> +{
> +    if (stats->num_samples) {
> +        ds_put_format(s, "\t%s samples: %lld\n", prefix, stats->num_samples);
> +        ds_put_format(s, "\t%s minimum: %lld\n", prefix, stats->min);
> +        ds_put_format(s, "\t%s maximum: %lld\n", prefix, stats->max);
> +        ds_put_format(s, "\t%s average: %lld\n", prefix, stats->average);
> +        ds_put_format(s, "\t%s 95th percentile: %lld\n\n", prefix,
> +                      stats->percentile);
> +    } else {
> +        ds_put_format(s, "\t%s samples: 0\n", prefix);
> +        ds_put_format(s, "\t%s minimum: N/A\n", prefix);
> +        ds_put_format(s, "\t%s maximum: N/A\n", prefix);
> +        ds_put_format(s, "\t%s average: N/A\n", prefix);
> +        ds_put_format(s, "\t%s 95th percentile: N/A\n\n", prefix);
> +    }
> +}
> +
> +static int
> +cmp_times(const void *left_, const void *right_)
> +{
> +    const struct sample *left = left_;
> +    const struct sample *right = right_;
> +
> +    return left->elapsed == right->elapsed ? 0
> +        : left->elapsed > right->elapsed;
> +}
> +
> +static struct sample *
> +sort_times(const struct sample *by_time, size_t vec_size)

I got confused reading through this a few times, because the name
doesn't convey the 'copy' operation that happens.  It would be good to
call this something like:

  sorted_copy_times()

which would make reasoning about the get_stats block a little bit better
(I read through it twice forgetting that sort_times wasn't an in-place
sort).

> +{
> +    struct sample *copy = xmemdup(by_time, vec_size * sizeof *copy);
> +    qsort(copy, vec_size, sizeof *copy, cmp_times);
> +
> +    return copy;
> +}
> +
> +static void
> +get_stats(const struct sample_vec *vec, long long int age_ms,
> +          struct stats *stats)
> +{
> +    int start_idx = find_earliest(vec, age_ms);
> +    if (start_idx < 0) {
> +        memset(stats, 0, sizeof *stats);
> +        return;
> +    }
> +    size_t vec_size = vec->n_samples - start_idx;
> +
> +    struct sample *by_time = &vec->samples[start_idx];
> +    struct sample *by_elapsed = sort_times(by_time, vec_size);
> +
> +    stats->min = by_elapsed[0].elapsed;
> +    stats->max = by_elapsed[vec_size - 1].elapsed;
> +    stats->average = average(by_time, vec_size);
> +    stats->percentile = percentile(by_elapsed, vec_size, 95);
> +    stats->num_samples = vec_size;
> +
> +    free(by_elapsed);
> +}
> +
> +static void
> +performance_print(struct performance *perf, const char *name,
> +                  struct ds *s)
> +{
> +    struct stats one_min;
> +    struct stats five_min;
> +    struct stats ten_min;
> +    ds_put_format(s, "Statistics for '%s'\n", name);
> +
> +    get_stats(&perf->vec, 60000, &one_min);
> +    get_stats(&perf->vec, 300000, &five_min);
> +    get_stats(&perf->vec, 600000, &ten_min);
> +
> +    format_stats(s, "1 minute", &one_min);
> +    format_stats(s, "5 minute", &five_min);
> +    format_stats(s, "10 minute", &ten_min);
> +}
> +
> +static bool
> +performance_show_protected(int argc, const char *argv[], struct ds *s)
> +{
> +    struct performance *perf;
> +
> +    if (argc > 1) {
> +        perf = shash_find_data(&performances, argv[1]);
> +        if (!perf) {
> +            ds_put_cstr(s, "No such performance");
> +            return false;
> +        }
> +        performance_print(perf, argv[1], s);
> +    } else {
> +        struct shash_node *node;
> +        SHASH_FOR_EACH (node, &performances) {
> +            perf = node->data;
> +            performance_print(perf, node->name, s);
> +        }
> +    }
> +
> +    return true;
> +}
> +
> +static void
> +performance_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
> +        const char *argv[], void *ignore OVS_UNUSED)
> +{
> +    struct ds s = DS_EMPTY_INITIALIZER;
> +    bool success;
> +
> +    ovs_mutex_lock(&performances_lock);
> +    success = performance_show_protected(argc, argv, &s);
> +    ovs_mutex_unlock(&performances_lock);
> +
> +    if (success) {
> +        unixctl_command_reply(conn, ds_cstr(&s));
> +    } else {
> +        unixctl_command_reply_error(conn, ds_cstr(&s));
> +    }
> +    ds_destroy(&s);
> +}
> +
> +static void
> +do_init_performance(void)
> +{
> +    unixctl_command_register("performance/show", "[NAME]", 0, 1,
> +                             performance_show, NULL);
> +}
> +
> +static void
> +performance_init(void)
> +{
> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +    if (ovsthread_once_start(&once)) {
> +        do_init_performance();
> +        ovsthread_once_done(&once);
> +    }
> +}
> +
> +void
> +performance_create(const char *name)
> +{
> +    performance_init();
> +
> +    struct performance *perf = xzalloc(sizeof *perf);
> +    timer_set_duration(&perf->timer, PERFORMANCE_INTERVAL);
> +    ovs_mutex_lock(&performances_lock);
> +    shash_add(&performances, name, perf);
> +    ovs_mutex_unlock(&performances_lock);
> +}
> +
> +void
> +performance_destroy(const char *name)
> +{
> +    struct performance *perf;
> +
> +    ovs_mutex_lock(&performances_lock);
> +    perf = shash_find_and_delete(&performances, name);
> +    ovs_mutex_unlock(&performances_lock);
> +
> +    if (perf) {
> +        free(perf->vec.samples);
> +    }
> +
> +    free(perf);
> +}
> +
> +static bool
> +performance_start_sample_protected(const char *name)
> +{
> +    struct performance *perf = shash_find_data(&performances, name);
> +    if (!perf) {
> +        return false;
> +    }
> +
> +    /* We already started sampling. Need an end before
> +     * we start another sample
> +     */
> +    if (perf->cur_sample.start_time) {
> +        return false;
> +    }
> +
> +    perf->cur_sample.start_time = time_msec();
> +    return true;
> +}
> +
> +bool
> +performance_start_sample(const char *name)
> +{
> +    ovs_mutex_lock(&performances_lock);
> +    bool ret = performance_start_sample_protected(name);
> +    ovs_mutex_unlock(&performances_lock);
> +
> +    return ret;
> +}
> +
> +static bool
> +performance_end_sample_protected(const char *name)
> +{
> +    struct performance *perf = shash_find_data(&performances, name);
> +    if (!perf) {
> +        return false;
> +    }
> +
> +    /* We can't end a sample if we haven't started one */
> +    if (!perf->cur_sample.start_time) {
> +        return false;
> +    }
> +
> +    perf->cur_sample.end_time = time_msec();
> +    perf->cur_sample.elapsed = perf->cur_sample.end_time
> +        - perf->cur_sample.start_time;
> +
> +    add_sample(&perf->vec, &perf->cur_sample);
> +
> +    memset(&perf->cur_sample, 0, sizeof perf->cur_sample);
> +    return true;
> +}
> +
> +bool
> +performance_end_sample(const char *name)
> +{
> +    ovs_mutex_lock(&performances_lock);
> +    bool ret = performance_end_sample_protected(name);
> +    ovs_mutex_unlock(&performances_lock);
> +
> +    return ret;
> +}
> +
> +static void
> +performance_run_protected(void)
> +{
> +    struct shash_node *node;
> +    SHASH_FOR_EACH (node, &performances) {
> +        struct performance *perf = node->data;
> +        if (!timer_expired(&perf->timer)) {
> +            timer_wait(&perf->timer);
> +            continue;
> +        }
> +
> +        cull_old_times(&perf->vec, PERFORMANCE_OLDEST);
> +
> +        timer_set_duration(&perf->timer, perf->sample_rate);
> +        timer_wait(&perf->timer);
> +    }
> +}
> +
> +void
> +performance_run(void)
> +{
> +    ovs_mutex_lock(&performances_lock);
> +    performance_run_protected();
> +    ovs_mutex_unlock(&performances_lock);
> +}
> diff --git a/lib/performance.h b/lib/performance.h
> new file mode 100644
> index 000000000..ec115a88c
> --- /dev/null
> +++ b/lib/performance.h
> @@ -0,0 +1,58 @@
> +/* Copyright (c) 2017 Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef PERFORMANCE_H
> +#define PERFORMANCE_H 1
> +
> +#include <stdbool.h>
> +
> +/* This file provides a method for timing operations in OVS.
> + *
> + * The expected operation is as follows:
> + * 1) Call performance_create(), supplying a unique name for the
> + *    operation that is being measured.
> + * 2) Place calls to start_sample() and end_sample() at the beginning
> + *    and end of the operation you wish to measure.
> + * 3) Periodically call performance_run() so that statistics can be
> + *    gathered based on the recorded times. If you are running a loop,
> + *    calling this once per loop is a good practice.
> + * 4) When complete, call performance_destroy() to clean up.
> + *
> + * This implementation is thread-safe.
> + */
> +
> +/* Create a new performance measurement tracker*/
> +void performance_create(const char *name);
> +
> +/* Destroy a performance measurement tracker */
> +void performance_destroy(const char *name);
> +
> +/* Indicate that a performance measurement is beginning. */
> +bool performance_start_sample(const char *name);
> +
> +/* Indicate that a performance measurement has ended. The
> + * sample will be added to the history of performance
> + * measurements for this tracker
> + */
> +bool performance_end_sample(const char *name);
> +
> +/* Perform periodic duties on all performance measurement
> + * trackers. Currently, this just takes care of culling old
> + * measurements. It is a good practice to call this periodically
> + * from a daemon's main loop
> + */
> +void performance_run(void);
> +
> +#endif /* performance.h */
Mark Michelson Jan. 18, 2018, 9:36 p.m. | #2
Hi Aaron,

Thanks for the feedback. Not having the same sort of statistical 
background as you, some of the suggestions are a bit over my head. See 
below for my responses.

On 01/18/2018 10:20 AM, Aaron Conole wrote:
> Hi Mark,
> 
> Mark Michelson <mmichels@redhat.com> writes:
> 
>> This is similar to the existing coverage and perf-counter APIs in OVS.
>> However, rather than keeping counters, this is aimed at timing how long
>> operations take to perform. "Operations" in this case can be anything
>> from a loop iteration, to a function, to something more complex.
> 
> It's important to note that the granularity on this is measured in 'ms',
> which means this library is more suited towards measuring larger groups
> of operations (those which take longer than the clock resolution on
> time_msec).  I guess one advantage of

One advantage of...?

> 
> I might suggest the ability to pick the time resolution.  For instance,
> sometimes picking msec is useful when clocking response time in a
> multi-message system.  Sometimes a tighter granularity (using finer
> controls such as 'clock_gettime', 'rdtsc', 'read register 9 of
> coprocessor 0') may be desired.  A function pointer passed to the
> performance 'bucket' ie:
>     performance_create("something", perf_time_msec)
> that can have a function signature that makes sense for most cases (ull
> return, no args, etc.) might be good to allow users to customize the
> performance system to suit their needs.
> 
> OTOH, since it won't have much for initial users, maybe it's okay.

Sure, this is a good idea.

> 
> Also, apologies if I'm really picky or seem overly verbose with moving
> average filters.  Once upon a time, I worked on an embedded system that
> was a giant... :)
> 
>> The library will keep track of how long it takes to perform the
>> particular operations and will maintain statistics of those running
>> times.
>>
>> Statistics for a particular operation can be queried from the command
>> line by using ovs-appctl -t <target> performance/show <operation name>.
> 
> I noticed that you'll have a global lock preventing statistics updates
> from happening.  No problem - except the larger issue is while that lock
> is held no other forward progress for the thread tracking the statistics
> can happen.
> 
> Maybe it makes sense to think about an alternate set of data structures
> that will accommodate multiple threads contending on performance
> monitoring.  I'm guessing in your use case there isn't any contention,
> but perhaps in the future someone will use this in, for example, a
> netdev area to profile and the results would get skewed.

Yes, I agree with this. Do you have some suggestions for a method that 
would work better? I've thought of a couple of ideas:

1) Use thread-local storage. This has the advantage of eliminating 
contention. However, I'm not sure how this would work to be able to 
print statistics to the CLI if the thread handling the CLI command 
cannot get to statistics in another thread.

2) Separate responsibilities into different threads. The threads that 
are actually performing operations we want to time would be responsible 
for grabbing the current timestamp and then passing that information to 
a background thread. This background thread would be responsible for 
then storing that data in a structure that is local only to that thread. 
This way, the "foreground" threads are only responsible for getting the 
current timestamp and writing the data to a pipe/socket.

If you have alternate ideas, please let me know.

> 
>> The API is designed to be pretty small. The expected steps are as
>> follows:
>>
>> 1) Create a performance measurement, providing a unique name, using
>> performance_create()
>> 2) Add calls to start_sample() and end_sample() to mark the start and
>> stop of the operation you wish to measure.
>> 3) Periodically call performance_run() in order to compile statistics.
>> 4) Upon completion (likely program shutdown), call performance_destroy()
>> to clean up.
> 
> You could register an atexit() handler to automatically destroy these.
> Then the framework user doesn't have to deal with that.  There's also a
> fatal_signal_add_hook() function where you can register deletion
> (although I'd advise against it - usually a fatal signal means 'leave
> memory alone' and cleanup here would require deletion).

An atexit() handler makes sense. I'll implement that.

> 
>> Signed-off-by: Mark Michelson <mmichels@redhat.com>
>> ---
>>   lib/automake.mk   |   2 +
>>   lib/performance.c | 387 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   lib/performance.h |  58 ++++++++
>>   3 files changed, 447 insertions(+)
>>   create mode 100644 lib/performance.c
>>   create mode 100644 lib/performance.h
>>
>> diff --git a/lib/automake.mk b/lib/automake.mk
>> index 4b38a1163..fef2c4e1a 100644
>> --- a/lib/automake.mk
>> +++ b/lib/automake.mk
>> @@ -205,6 +205,8 @@ lib_libopenvswitch_la_SOURCES = \
>>   	lib/pcap-file.h \
>>   	lib/perf-counter.h \
>>   	lib/perf-counter.c \
>> +	lib/performance.h \
>> +	lib/performance.c \
>>   	lib/poll-loop.c \
>>   	lib/process.c \
>>   	lib/process.h \
>> diff --git a/lib/performance.c b/lib/performance.c
>> new file mode 100644
>> index 000000000..5b54cfc29
>> --- /dev/null
>> +++ b/lib/performance.c
>> @@ -0,0 +1,387 @@
>> +/* Copyright (c) 2017 Red Hat, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#include <config.h>
>> +
>> +#include "performance.h"
>> +#include "timer.h"
>> +#include "timeval.h"
>> +#include "openvswitch/shash.h"
>> +#include "openvswitch/vlog.h"
>> +#include "unixctl.h"
>> +#include "openvswitch/dynamic-string.h"
>> +#include "openvswitch/poll-loop.h"
>> +
>> +VLOG_DEFINE_THIS_MODULE(performance);
>> +
>> +struct sample {
>> +    long long int start_time; /* Time when we started this sample */
>> +    long long int end_time;   /* Time when we ended this sample */
>> +    long long int elapsed;    /* Elapsed time: end_time - start_time */
>> +};
>> +
>> +struct sample_vec {
>> +    struct sample *samples; /* Dynamic array of samples */
>> +    size_t n_samples;       /* Number of samples */
>> +    size_t capacity;        /* Number of allocated samples */
>> +};
>> +
>> +struct stats {
>> +    long long int min;         /* Minimum measurement (ms) */
>> +    long long int max;         /* Maximum measurement (ms) */
>> +    long long int average;     /* Average measurement (ms) */
> 
> I think signed values doesn't make sense.  I don't know that time moves
> backwards (unless you're using "wall clock", instead of a monotonic time
> source).  I guess this is to keep the return type correct with
> time_msec(), but I don't understand why that should be signed either
> (since it says it will be monotonic).  In practice, it generally *won't*
> matter at msec granularity - I 'napkin math' think it takes something
> like 6.2bn years to rollover from epoch.  I just worry about the
> rollover to negative case (and maybe it isn't important).

Yeah the reason I used signed values here is simply because the 
time_msec() function (and others of its ilk in OVS) return a signed 
value. I agree that using an unsigned value makes sense since we 
shouldn't be dealing with wall clock times.

> 
> Additionally, I think integers aren't a good way of storing the average,
> since it will truncate, and 1ms vs 1.9ms are very different when it
> comes to measurements.

Heh, I think I had a XXX comment initially in my code saying "should I 
use floating point here?" I'll change it over.

> 
>> +    long long int percentile;  /* 95th percentile measurement (ms) */
>> +    long long int num_samples; /* Total number of measurements */
>> +};
>> +
>> +struct performance {
>> +    struct sample_vec vec;
>> +    long long int sample_rate;
>> +    struct timer timer;
>> +    struct sample cur_sample;
>> +};
> 
> It looks like you're keeping around the samples because you want to 'age
> out' from the moving average window.  Otherwise I think it can be
> implemented using an IIR filter (psuedo-code below):
> 
>         current_avg -= (current_avg/n_window_samples)
>         current_avg += (sample / n_window_samples)
> 
> BUT - if you're going to keep around the samples, you might as well
> include a calculation of additional statistics (variance comes to mind).

I'm not familiar with IIR filters, so I just want to make sure I 
understand what you're suggesting before trying to implement it. The 
idea is that rather than keeping around, say, 10 minutes worth of 
samples, I just use the calculations above each time that a new sample 
arrives. This way, I don't have to store my samples at all. I'm not sure 
how to choose n_window_samples though, or how to relate that to a 
specific amount of time that has passed.

> 
>> +static struct shash performances = SHASH_INITIALIZER(&performances);
>> +static struct ovs_mutex performances_lock = OVS_MUTEX_INITIALIZER;
> 
> As I wrote before, having this global mutex means statistics
> accumulation blocks for everyone.  You might think it doesn't matter,
> but imagine this scenario:
> 
>    thread 0                      |    thread 1
> 
>    call perf_start()             |  .............
>    .... do things                |  call perf_start() (get slept)
>    call perf_end() (block...)    |  wake up and complete
>       (unblock)                  |
> 
> In the above, the perf_end() will intermittently include lock calls from
> the perf_start(), which may smooth out over time but will introduce
> variance.

I think my suggestions above about alternate threading models might 
address this. Let me know if there are other factors to take into 
consideration.

> 
>> +#define PERFORMANCE_INTERVAL 10000
>> +#define PERFORMANCE_OLDEST 600000
>> +
>> +static void
>> +add_sample(struct sample_vec *vec, struct sample *new_sample)
>> +{
>> +    if (vec->capacity == vec->n_samples) {
>> +        vec->samples = x2nrealloc(vec->samples, &vec->capacity,
>> +            sizeof *vec->samples);
> 
> This resize operation can grow very quickly.  Assuming the samples come
> very fast, you'll run out of memory very quickly.  Is there a reason not
> to have just roll samples in the vector?  I guess it makes your
> 'selection' for various time periods more difficult, but you can also
> track those with independent counters as well (just set a different time
> interval... :).  That also means no allocations when creating samples,
> and no deallocations when removing samples, freeing up additional memory
> pressure.

I'll be honest here, I'm not following you. Rather, I understand your 
concerns about memory. The thing to understand here is that the vector 
will grow for a certain amount of time, but then after a while, old 
times will start getting culled from the vector. At that point, the size 
of the vector should stabilize (unless you have very bursty periods of 
activity). However, I agree that if there are strategies that result in 
the use of less memory, then I should go with those :)

What I'm not following are the parts about "roll samples in the vector" 
and "independent counters". Can you elaborate on those points some more?


> 
> The "disadvantage" is you need to determine your sampling window sizes,
> but that may be something you expose as a tune-able setting?  Also,
> tracking your 1m,5m,10m counters would be a bit different (but you could
> just treat them as additional 'performance counters' in your framework,
> and every time the performance_run() decides it's time to take an
> average, if the time interval makes sense, it can update "name",
> "name_1m", "name_5m", "name_10m", etc).

Again, I'm not really following you here. When you say "sampling window 
sizes" are these what you were referencing earlier in your avg formula? 
Are these based solely on the number of samples collected, rather than 
amount of time that has passed collecting them? I'm not really sure what 
you're talking about with the "name" suggestions in the last sentence.

> 
>> +    }
>> +
>> +    vec->samples[vec->n_samples++] = *new_sample;
>> +}
>> +
>> +static int
>> +find_earliest(const struct sample_vec *vec, long long int age_ms)
>> +{
>> +    long long int cutoff = time_msec() - age_ms;
>> +
>> +    for (int i = 0; i < vec->n_samples; i++) {
>> +        if (vec->samples[i].end_time >= cutoff) {
>> +            return i;
>> +        }
>> +    }
>> +
>> +    /* Either the vector is empty or all times are
>> +     * older than the cutoff.
>> +     */
>> +    return -1;
>> +}
>> +
>> +static long long int
>> +average(const struct sample *samples, size_t num_samples)
>> +{
>> +    /* Avoid division by zero */
>> +    if (num_samples == 0) {
>> +        return 0;
>> +    }
>> +
>> +    long long int sum = 0;
>> +    for (size_t i = 0; i < num_samples; i++) {
>> +        sum += samples[i].elapsed;
>> +    }
>> +
>> +    return sum / num_samples;
>> +}
>> +
>> +/* Calculate a specific percentile value.
>> + * Valid percentile values range from 0 to 99
>> + */
>> +static long long int
>> +percentile(const struct sample *samples, size_t num_samples, int percentile)
>> +{
>> +    if (num_samples == 0) {
>> +        return 0;
>> +    }
>> +
>> +    size_t pctl = num_samples * percentile / 100;
>> +    return samples[pctl].elapsed;
>> +}
>> +
>> +static void
>> +cull_old_times(struct sample_vec *vec, long long int age_ms)
>> +{
>> +    int i = find_earliest(vec, age_ms);
>> +
>> +    if (i <= 0) {
>> +        return;
>> +    }
>> +
>> +    size_t new_size = vec->n_samples - i;
>> +    memmove(vec->samples, &vec->samples[i], new_size * sizeof *vec->samples);
>> +    vec->n_samples = new_size;
>> +}
>> +
>> +static void
>> +format_stats(struct ds *s, const char *prefix, const struct stats *stats)
>> +{
>> +    if (stats->num_samples) {
>> +        ds_put_format(s, "\t%s samples: %lld\n", prefix, stats->num_samples);
>> +        ds_put_format(s, "\t%s minimum: %lld\n", prefix, stats->min);
>> +        ds_put_format(s, "\t%s maximum: %lld\n", prefix, stats->max);
>> +        ds_put_format(s, "\t%s average: %lld\n", prefix, stats->average);
>> +        ds_put_format(s, "\t%s 95th percentile: %lld\n\n", prefix,
>> +                      stats->percentile);
>> +    } else {
>> +        ds_put_format(s, "\t%s samples: 0\n", prefix);
>> +        ds_put_format(s, "\t%s minimum: N/A\n", prefix);
>> +        ds_put_format(s, "\t%s maximum: N/A\n", prefix);
>> +        ds_put_format(s, "\t%s average: N/A\n", prefix);
>> +        ds_put_format(s, "\t%s 95th percentile: N/A\n\n", prefix);
>> +    }
>> +}
>> +
>> +static int
>> +cmp_times(const void *left_, const void *right_)
>> +{
>> +    const struct sample *left = left_;
>> +    const struct sample *right = right_;
>> +
>> +    return left->elapsed == right->elapsed ? 0
>> +        : left->elapsed > right->elapsed;
>> +}
>> +
>> +static struct sample *
>> +sort_times(const struct sample *by_time, size_t vec_size)
> 
> I got confused reading through this a few times, because the name
> doesn't convey the 'copy' operation that happens.  It would be good to
> call this something like:
> 
>    sorted_copy_times()
> 
> which would make reasoning about the get_stats block a little bit better
> (I read through it twice forgetting that sort_times wasn't an in-place
> sort).

Sure, I can do that.

> 
>> +{
>> +    struct sample *copy = xmemdup(by_time, vec_size * sizeof *copy);
>> +    qsort(copy, vec_size, sizeof *copy, cmp_times);
>> +
>> +    return copy;
>> +}
>> +
>> +static void
>> +get_stats(const struct sample_vec *vec, long long int age_ms,
>> +          struct stats *stats)
>> +{
>> +    int start_idx = find_earliest(vec, age_ms);
>> +    if (start_idx < 0) {
>> +        memset(stats, 0, sizeof *stats);
>> +        return;
>> +    }
>> +    size_t vec_size = vec->n_samples - start_idx;
>> +
>> +    struct sample *by_time = &vec->samples[start_idx];
>> +    struct sample *by_elapsed = sort_times(by_time, vec_size);
>> +
>> +    stats->min = by_elapsed[0].elapsed;
>> +    stats->max = by_elapsed[vec_size - 1].elapsed;
>> +    stats->average = average(by_time, vec_size);
>> +    stats->percentile = percentile(by_elapsed, vec_size, 95);
>> +    stats->num_samples = vec_size;
>> +
>> +    free(by_elapsed);
>> +}
>> +
>> +static void
>> +performance_print(struct performance *perf, const char *name,
>> +                  struct ds *s)
>> +{
>> +    struct stats one_min;
>> +    struct stats five_min;
>> +    struct stats ten_min;
>> +    ds_put_format(s, "Statistics for '%s'\n", name);
>> +
>> +    get_stats(&perf->vec, 60000, &one_min);
>> +    get_stats(&perf->vec, 300000, &five_min);
>> +    get_stats(&perf->vec, 600000, &ten_min);
>> +
>> +    format_stats(s, "1 minute", &one_min);
>> +    format_stats(s, "5 minute", &five_min);
>> +    format_stats(s, "10 minute", &ten_min);
>> +}
>> +
>> +static bool
>> +performance_show_protected(int argc, const char *argv[], struct ds *s)
>> +{
>> +    struct performance *perf;
>> +
>> +    if (argc > 1) {
>> +        perf = shash_find_data(&performances, argv[1]);
>> +        if (!perf) {
>> +            ds_put_cstr(s, "No such performance");
>> +            return false;
>> +        }
>> +        performance_print(perf, argv[1], s);
>> +    } else {
>> +        struct shash_node *node;
>> +        SHASH_FOR_EACH (node, &performances) {
>> +            perf = node->data;
>> +            performance_print(perf, node->name, s);
>> +        }
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static void
>> +performance_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
>> +        const char *argv[], void *ignore OVS_UNUSED)
>> +{
>> +    struct ds s = DS_EMPTY_INITIALIZER;
>> +    bool success;
>> +
>> +    ovs_mutex_lock(&performances_lock);
>> +    success = performance_show_protected(argc, argv, &s);
>> +    ovs_mutex_unlock(&performances_lock);
>> +
>> +    if (success) {
>> +        unixctl_command_reply(conn, ds_cstr(&s));
>> +    } else {
>> +        unixctl_command_reply_error(conn, ds_cstr(&s));
>> +    }
>> +    ds_destroy(&s);
>> +}
>> +
>> +static void
>> +do_init_performance(void)
>> +{
>> +    unixctl_command_register("performance/show", "[NAME]", 0, 1,
>> +                             performance_show, NULL);
>> +}
>> +
>> +static void
>> +performance_init(void)
>> +{
>> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +    if (ovsthread_once_start(&once)) {
>> +        do_init_performance();
>> +        ovsthread_once_done(&once);
>> +    }
>> +}
>> +
>> +void
>> +performance_create(const char *name)
>> +{
>> +    performance_init();
>> +
>> +    struct performance *perf = xzalloc(sizeof *perf);
>> +    timer_set_duration(&perf->timer, PERFORMANCE_INTERVAL);
>> +    ovs_mutex_lock(&performances_lock);
>> +    shash_add(&performances, name, perf);
>> +    ovs_mutex_unlock(&performances_lock);
>> +}
>> +
>> +void
>> +performance_destroy(const char *name)
>> +{
>> +    struct performance *perf;
>> +
>> +    ovs_mutex_lock(&performances_lock);
>> +    perf = shash_find_and_delete(&performances, name);
>> +    ovs_mutex_unlock(&performances_lock);
>> +
>> +    if (perf) {
>> +        free(perf->vec.samples);
>> +    }
>> +
>> +    free(perf);
>> +}
>> +
>> +static bool
>> +performance_start_sample_protected(const char *name)
>> +{
>> +    struct performance *perf = shash_find_data(&performances, name);
>> +    if (!perf) {
>> +        return false;
>> +    }
>> +
>> +    /* We already started sampling. Need an end before
>> +     * we start another sample
>> +     */
>> +    if (perf->cur_sample.start_time) {
>> +        return false;
>> +    }
>> +
>> +    perf->cur_sample.start_time = time_msec();
>> +    return true;
>> +}
>> +
>> +bool
>> +performance_start_sample(const char *name)
>> +{
>> +    ovs_mutex_lock(&performances_lock);
>> +    bool ret = performance_start_sample_protected(name);
>> +    ovs_mutex_unlock(&performances_lock);
>> +
>> +    return ret;
>> +}
>> +
>> +static bool
>> +performance_end_sample_protected(const char *name)
>> +{
>> +    struct performance *perf = shash_find_data(&performances, name);
>> +    if (!perf) {
>> +        return false;
>> +    }
>> +
>> +    /* We can't end a sample if we haven't started one */
>> +    if (!perf->cur_sample.start_time) {
>> +        return false;
>> +    }
>> +
>> +    perf->cur_sample.end_time = time_msec();
>> +    perf->cur_sample.elapsed = perf->cur_sample.end_time
>> +        - perf->cur_sample.start_time;
>> +
>> +    add_sample(&perf->vec, &perf->cur_sample);
>> +
>> +    memset(&perf->cur_sample, 0, sizeof perf->cur_sample);
>> +    return true;
>> +}
>> +
>> +bool
>> +performance_end_sample(const char *name)
>> +{
>> +    ovs_mutex_lock(&performances_lock);
>> +    bool ret = performance_end_sample_protected(name);
>> +    ovs_mutex_unlock(&performances_lock);
>> +
>> +    return ret;
>> +}
>> +
>> +static void
>> +performance_run_protected(void)
>> +{
>> +    struct shash_node *node;
>> +    SHASH_FOR_EACH (node, &performances) {
>> +        struct performance *perf = node->data;
>> +        if (!timer_expired(&perf->timer)) {
>> +            timer_wait(&perf->timer);
>> +            continue;
>> +        }
>> +
>> +        cull_old_times(&perf->vec, PERFORMANCE_OLDEST);
>> +
>> +        timer_set_duration(&perf->timer, perf->sample_rate);
>> +        timer_wait(&perf->timer);
>> +    }
>> +}
>> +
>> +void
>> +performance_run(void)
>> +{
>> +    ovs_mutex_lock(&performances_lock);
>> +    performance_run_protected();
>> +    ovs_mutex_unlock(&performances_lock);
>> +}
>> diff --git a/lib/performance.h b/lib/performance.h
>> new file mode 100644
>> index 000000000..ec115a88c
>> --- /dev/null
>> +++ b/lib/performance.h
>> @@ -0,0 +1,58 @@
>> +/* Copyright (c) 2017 Red Hat, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef PERFORMANCE_H
>> +#define PERFORMANCE_H 1
>> +
>> +#include <stdbool.h>
>> +
>> +/* This file provides a method for timing operations in OVS.
>> + *
>> + * The expected operation is as follows:
>> + * 1) Call performance_create(), supplying a unique name for the
>> + *    operation that is being measured.
>> + * 2) Place calls to start_sample() and end_sample() at the beginning
>> + *    and end of the operation you wish to measure.
>> + * 3) Periodically call performance_run() so that statistics can be
>> + *    gathered based on the recorded times. If you are running a loop,
>> + *    calling this once per loop is a good practice.
>> + * 4) When complete, call performance_destroy() to clean up.
>> + *
>> + * This implementation is thread-safe.
>> + */
>> +
>> +/* Create a new performance measurement tracker*/
>> +void performance_create(const char *name);
>> +
>> +/* Destroy a performance measurement tracker */
>> +void performance_destroy(const char *name);
>> +
>> +/* Indicate that a performance measurement is beginning. */
>> +bool performance_start_sample(const char *name);
>> +
>> +/* Indicate that a performance measurement has ended. The
>> + * sample will be added to the history of performance
>> + * measurements for this tracker
>> + */
>> +bool performance_end_sample(const char *name);
>> +
>> +/* Perform periodic duties on all performance measurement
>> + * trackers. Currently, this just takes care of culling old
>> + * measurements. It is a good practice to call this periodically
>> + * from a daemon's main loop
>> + */
>> +void performance_run(void);
>> +
>> +#endif /* performance.h */
Aaron Conole Jan. 19, 2018, 7:04 p.m. | #3
Mark Michelson <mmichels@redhat.com> writes:

> Hi Aaron,
>
> Thanks for the feedback. Not having the same sort of statistical
> background as you, some of the suggestions are a bit over my head. See
> below for my responses.
>
> On 01/18/2018 10:20 AM, Aaron Conole wrote:
>> Hi Mark,
>>
>> Mark Michelson <mmichels@redhat.com> writes:
>>
>>> This is similar to the existing coverage and perf-counter APIs in OVS.
>>> However, rather than keeping counters, this is aimed at timing how long
>>> operations take to perform. "Operations" in this case can be anything
>>> from a loop iteration, to a function, to something more complex.
>>
>> It's important to note that the granularity on this is measured in 'ms',
>> which means this library is more suited towards measuring larger groups
>> of operations (those which take longer than the clock resolution on
>> time_msec).  I guess one advantage of
>
> One advantage of...?

D'oh!  I wrote (and saved, but didn't send when I picked up the draft):

"I guess one advantage of this framework would be that for larger
'macro' operations there wouldn't be as much of a concern on an ever
growing window of samples - the start and end time encompasses so much.
My assumption is you intend this to be more on the macro side, so take
some of my suggestions with a grain of salt - I'm focusing more on a
benchmarking usecase for development."

For more clarity (because re-reading, it reads very opaque), apart from
this first counter you add in 2/2, how do you envision this framework to
be used?  Is it meant as a benchmarking tool that provides visibility at
run-time (sort of like a developer defined gcov)?  Is it supposed to
know about heirarchy (if your 'main' has a performance counter, and a
subroutine also has a performance counter, is the 'main' counter aware
of the 'subroutine' counter)?  Who would be using this?  Is it targetted
to in-field support?  Developers?  Depending on that, some of my other
questions, suggestions, and ideas are possibly inappropriate.

>>
>> I might suggest the ability to pick the time resolution.  For instance,
>> sometimes picking msec is useful when clocking response time in a
>> multi-message system.  Sometimes a tighter granularity (using finer
>> controls such as 'clock_gettime', 'rdtsc', 'read register 9 of
>> coprocessor 0') may be desired.  A function pointer passed to the
>> performance 'bucket' ie:
>>     performance_create("something", perf_time_msec)
>> that can have a function signature that makes sense for most cases (ull
>> return, no args, etc.) might be good to allow users to customize the
>> performance system to suit their needs.
>>
>> OTOH, since it won't have much for initial users, maybe it's okay.
>
> Sure, this is a good idea.
>
>>
>> Also, apologies if I'm really picky or seem overly verbose with moving
>> average filters.  Once upon a time, I worked on an embedded system that
>> was a giant... :)
>>
>>> The library will keep track of how long it takes to perform the
>>> particular operations and will maintain statistics of those running
>>> times.
>>>
>>> Statistics for a particular operation can be queried from the command
>>> line by using ovs-appctl -t <target> performance/show <operation name>.
>>
>> I noticed that you'll have a global lock preventing statistics updates
>> from happening.  No problem - except the larger issue is while that lock
>> is held no other forward progress for the thread tracking the statistics
>> can happen.
>>
>> Maybe it makes sense to think about an alternate set of data structures
>> that will accommodate multiple threads contending on performance
>> monitoring.  I'm guessing in your use case there isn't any contention,
>> but perhaps in the future someone will use this in, for example, a
>> netdev area to profile and the results would get skewed.
>
> Yes, I agree with this. Do you have some suggestions for a method that
> would work better? I've thought of a couple of ideas:
>
> 1) Use thread-local storage. This has the advantage of eliminating
> contention. However, I'm not sure how this would work to be able to
> print statistics to the CLI if the thread handling the CLI command
> cannot get to statistics in another thread.

I agree - I'm not sure this works.

> 2) Separate responsibilities into different threads. The threads that
> are actually performing operations we want to time would be
> responsible for grabbing the current timestamp and then passing that
> information to a background thread. This background thread would be
> responsible for then storing that data in a structure that is local
> only to that thread. This way, the "foreground" threads are only
> responsible for getting the current timestamp and writing the data to
> a pipe/socket.
>
> If you have alternate ideas, please let me know.

There may be a way to use RCU to do the vector updates, but that does
introduce some error in the measurements.

Also, you might use some kind of atomic primitives for a lock-free
approach.  That has a worse-case performing time, but it is a) constant,
and b) usually simple to understand.

>>> The API is designed to be pretty small. The expected steps are as
>>> follows:
>>>
>>> 1) Create a performance measurement, providing a unique name, using
>>> performance_create()
>>> 2) Add calls to start_sample() and end_sample() to mark the start and
>>> stop of the operation you wish to measure.
>>> 3) Periodically call performance_run() in order to compile statistics.
>>> 4) Upon completion (likely program shutdown), call performance_destroy()
>>> to clean up.
>>
>> You could register an atexit() handler to automatically destroy these.
>> Then the framework user doesn't have to deal with that.  There's also a
>> fatal_signal_add_hook() function where you can register deletion
>> (although I'd advise against it - usually a fatal signal means 'leave
>> memory alone' and cleanup here would require deletion).
>
> An atexit() handler makes sense. I'll implement that.
>
>>
>>> Signed-off-by: Mark Michelson <mmichels@redhat.com>
>>> ---
>>>   lib/automake.mk   |   2 +
>>>   lib/performance.c | 387 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   lib/performance.h |  58 ++++++++
>>>   3 files changed, 447 insertions(+)
>>>   create mode 100644 lib/performance.c
>>>   create mode 100644 lib/performance.h
>>>
>>> diff --git a/lib/automake.mk b/lib/automake.mk
>>> index 4b38a1163..fef2c4e1a 100644
>>> --- a/lib/automake.mk
>>> +++ b/lib/automake.mk
>>> @@ -205,6 +205,8 @@ lib_libopenvswitch_la_SOURCES = \
>>>   	lib/pcap-file.h \
>>>   	lib/perf-counter.h \
>>>   	lib/perf-counter.c \
>>> +	lib/performance.h \
>>> +	lib/performance.c \
>>>   	lib/poll-loop.c \
>>>   	lib/process.c \
>>>   	lib/process.h \
>>> diff --git a/lib/performance.c b/lib/performance.c
>>> new file mode 100644
>>> index 000000000..5b54cfc29
>>> --- /dev/null
>>> +++ b/lib/performance.c
>>> @@ -0,0 +1,387 @@
>>> +/* Copyright (c) 2017 Red Hat, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, software
>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>> + * See the License for the specific language governing permissions and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#include <config.h>
>>> +
>>> +#include "performance.h"
>>> +#include "timer.h"
>>> +#include "timeval.h"
>>> +#include "openvswitch/shash.h"
>>> +#include "openvswitch/vlog.h"
>>> +#include "unixctl.h"
>>> +#include "openvswitch/dynamic-string.h"
>>> +#include "openvswitch/poll-loop.h"
>>> +
>>> +VLOG_DEFINE_THIS_MODULE(performance);
>>> +
>>> +struct sample {
>>> +    long long int start_time; /* Time when we started this sample */
>>> +    long long int end_time;   /* Time when we ended this sample */
>>> +    long long int elapsed;    /* Elapsed time: end_time - start_time */
>>> +};
>>> +
>>> +struct sample_vec {
>>> +    struct sample *samples; /* Dynamic array of samples */
>>> +    size_t n_samples;       /* Number of samples */
>>> +    size_t capacity;        /* Number of allocated samples */
>>> +};
>>> +
>>> +struct stats {
>>> +    long long int min;         /* Minimum measurement (ms) */
>>> +    long long int max;         /* Maximum measurement (ms) */
>>> +    long long int average;     /* Average measurement (ms) */
>>
>> I think signed values doesn't make sense.  I don't know that time moves
>> backwards (unless you're using "wall clock", instead of a monotonic time
>> source).  I guess this is to keep the return type correct with
>> time_msec(), but I don't understand why that should be signed either
>> (since it says it will be monotonic).  In practice, it generally *won't*
>> matter at msec granularity - I 'napkin math' think it takes something
>> like 6.2bn years to rollover from epoch.  I just worry about the
>> rollover to negative case (and maybe it isn't important).
>
> Yeah the reason I used signed values here is simply because the
> time_msec() function (and others of its ilk in OVS) return a signed
> value. I agree that using an unsigned value makes sense since we
> shouldn't be dealing with wall clock times.

As I wrote, it still takes something like 6.2 billion years (napkin math
billion years) worth of samples at msec granularity to wrap.  If the
clock stays fixed at msec, then it probably isn't the biggest fish to
fry.  But, I'm not sure what the 'purpose' of this measurement is if it
is so long.  Even disk IO is getting sub-ms performance.  And network
communication on the local node should be well under that threshold
also (is it not?).

>>
>> Additionally, I think integers aren't a good way of storing the average,
>> since it will truncate, and 1ms vs 1.9ms are very different when it
>> comes to measurements.
>
> Heh, I think I had a XXX comment initially in my code saying "should I
> use floating point here?" I'll change it over.
>
>>
>>> +    long long int percentile;  /* 95th percentile measurement (ms) */
>>> +    long long int num_samples; /* Total number of measurements */
>>> +};
>>> +
>>> +struct performance {
>>> +    struct sample_vec vec;
>>> +    long long int sample_rate;
>>> +    struct timer timer;
>>> +    struct sample cur_sample;
>>> +};
>>
>> It looks like you're keeping around the samples because you want to 'age
>> out' from the moving average window.  Otherwise I think it can be
>> implemented using an IIR filter (psuedo-code below):
>>
>>         current_avg -= (current_avg/n_window_samples)
>>         current_avg += (sample / n_window_samples)
>>
>> BUT - if you're going to keep around the samples, you might as well
>> include a calculation of additional statistics (variance comes to mind).
>
> I'm not familiar with IIR filters, so I just want to make sure I
> understand what you're suggesting before trying to implement it. The
> idea is that rather than keeping around, say, 10 minutes worth of
> samples, I just use the calculations above each time that a new sample
> arrives. This way, I don't have to store my samples at all. I'm not
> sure how to choose n_window_samples though, or how to relate that to a
> specific amount of time that has passed.

In this case, mostly ignore the comment.  It leads down a side path of
accumulated errors over time, etc.

But the same types of operations can still be applied to this average.
Instead of summing up each time, simply subtract the 'culled' counters,
and just keep adding the new counters to the total.  No need to walk the
list.  There's 1000 ways to skin this cat, and it's not as important in
the long run.

>>
>>> +static struct shash performances = SHASH_INITIALIZER(&performances);
>>> +static struct ovs_mutex performances_lock = OVS_MUTEX_INITIALIZER;
>>
>> As I wrote before, having this global mutex means statistics
>> accumulation blocks for everyone.  You might think it doesn't matter,
>> but imagine this scenario:
>>
>>    thread 0                      |    thread 1
>>
>>    call perf_start()             |  .............
>>    .... do things                |  call perf_start() (get slept)
>>    call perf_end() (block...)    |  wake up and complete
>>       (unblock)                  |
>>
>> In the above, the perf_end() will intermittently include lock calls from
>> the perf_start(), which may smooth out over time but will introduce
>> variance.
>
> I think my suggestions above about alternate threading models might
> address this. Let me know if there are other factors to take into
> consideration.

Okay.  I think it should too - let's see what you cook up :)

>>
>>> +#define PERFORMANCE_INTERVAL 10000
>>> +#define PERFORMANCE_OLDEST 600000
>>> +
>>> +static void
>>> +add_sample(struct sample_vec *vec, struct sample *new_sample)
>>> +{
>>> +    if (vec->capacity == vec->n_samples) {
>>> +        vec->samples = x2nrealloc(vec->samples, &vec->capacity,
>>> +            sizeof *vec->samples);
>>
>> This resize operation can grow very quickly.  Assuming the samples come
>> very fast, you'll run out of memory very quickly.  Is there a reason not
>> to have just roll samples in the vector?  I guess it makes your
>> 'selection' for various time periods more difficult, but you can also
>> track those with independent counters as well (just set a different time
>> interval... :).  That also means no allocations when creating samples,
>> and no deallocations when removing samples, freeing up additional memory
>> pressure.
>
> I'll be honest here, I'm not following you. Rather, I understand your
> concerns about memory. The thing to understand here is that the vector
> will grow for a certain amount of time, but then after a while, old
> times will start getting culled from the vector. At that point, the
> size of the vector should stabilize (unless you have very bursty
> periods of activity). However, I agree that if there are strategies
> that result in the use of less memory, then I should go with those :)
>
> What I'm not following are the parts about "roll samples in the
> vector" and "independent counters". Can you elaborate on those points
> some more?

Sure, I'll try to state what I was suggesting with an (poor) ascii art
drawing.

   vec=  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |

vec[0] holds sample 1, vec[6] holds sample 7.  When sample 8 arrives,
simply overwrite sample 1 (circular buffer might be a better descriptor,
sorry).

   vec=  |  8  |  2  |  3  |  4  |  5  |  6  |  7  |

You just roll over the sample.

As for the independent counters, I was thinking that since you know the
frequency with which samples come, and you know the window, you could
accumulate totals in additional performance elements.  Or sample the
averages over time (making sure that the explanation of what is actually
being measured is conveyed).

That's probably something that should be left to the person adding the
performance element, though.  For instance, some areas of the code won't
make sense to have a "last 1m, last 5m, last 10m."  Providing that data
could be misleading.

>>
>> The "disadvantage" is you need to determine your sampling window sizes,
>> but that may be something you expose as a tune-able setting?  Also,
>> tracking your 1m,5m,10m counters would be a bit different (but you could
>> just treat them as additional 'performance counters' in your framework,
>> and every time the performance_run() decides it's time to take an
>> average, if the time interval makes sense, it can update "name",
>> "name_1m", "name_5m", "name_10m", etc).
>
> Again, I'm not really following you here. When you say "sampling
> window sizes" are these what you were referencing earlier in your avg
> formula? Are these based solely on the number of samples collected,
> rather than amount of time that has passed collecting them? I'm not
> really sure what you're talking about with the "name" suggestions in
> the last sentence.

Let's just come back to this.

>>
>>> +    }
>>> +
>>> +    vec->samples[vec->n_samples++] = *new_sample;
>>> +}
>>> +
>>> +static int
>>> +find_earliest(const struct sample_vec *vec, long long int age_ms)
>>> +{
>>> +    long long int cutoff = time_msec() - age_ms;
>>> +
>>> +    for (int i = 0; i < vec->n_samples; i++) {
>>> +        if (vec->samples[i].end_time >= cutoff) {
>>> +            return i;
>>> +        }
>>> +    }
>>> +
>>> +    /* Either the vector is empty or all times are
>>> +     * older than the cutoff.
>>> +     */
>>> +    return -1;
>>> +}
>>> +
>>> +static long long int
>>> +average(const struct sample *samples, size_t num_samples)
>>> +{
>>> +    /* Avoid division by zero */
>>> +    if (num_samples == 0) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    long long int sum = 0;
>>> +    for (size_t i = 0; i < num_samples; i++) {
>>> +        sum += samples[i].elapsed;
>>> +    }
>>> +
>>> +    return sum / num_samples;
>>> +}
>>> +
>>> +/* Calculate a specific percentile value.
>>> + * Valid percentile values range from 0 to 99
>>> + */
>>> +static long long int
>>> +percentile(const struct sample *samples, size_t num_samples, int percentile)
>>> +{
>>> +    if (num_samples == 0) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    size_t pctl = num_samples * percentile / 100;
>>> +    return samples[pctl].elapsed;
>>> +}
>>> +
>>> +static void
>>> +cull_old_times(struct sample_vec *vec, long long int age_ms)
>>> +{
>>> +    int i = find_earliest(vec, age_ms);
>>> +
>>> +    if (i <= 0) {
>>> +        return;
>>> +    }
>>> +
>>> +    size_t new_size = vec->n_samples - i;
>>> +    memmove(vec->samples, &vec->samples[i], new_size * sizeof *vec->samples);
>>> +    vec->n_samples = new_size;
>>> +}
>>> +
>>> +static void
>>> +format_stats(struct ds *s, const char *prefix, const struct stats *stats)
>>> +{
>>> +    if (stats->num_samples) {
>>> +        ds_put_format(s, "\t%s samples: %lld\n", prefix, stats->num_samples);
>>> +        ds_put_format(s, "\t%s minimum: %lld\n", prefix, stats->min);
>>> +        ds_put_format(s, "\t%s maximum: %lld\n", prefix, stats->max);
>>> +        ds_put_format(s, "\t%s average: %lld\n", prefix, stats->average);
>>> +        ds_put_format(s, "\t%s 95th percentile: %lld\n\n", prefix,
>>> +                      stats->percentile);
>>> +    } else {
>>> +        ds_put_format(s, "\t%s samples: 0\n", prefix);
>>> +        ds_put_format(s, "\t%s minimum: N/A\n", prefix);
>>> +        ds_put_format(s, "\t%s maximum: N/A\n", prefix);
>>> +        ds_put_format(s, "\t%s average: N/A\n", prefix);
>>> +        ds_put_format(s, "\t%s 95th percentile: N/A\n\n", prefix);
>>> +    }
>>> +}
>>> +
>>> +static int
>>> +cmp_times(const void *left_, const void *right_)
>>> +{
>>> +    const struct sample *left = left_;
>>> +    const struct sample *right = right_;
>>> +
>>> +    return left->elapsed == right->elapsed ? 0
>>> +        : left->elapsed > right->elapsed;
>>> +}
>>> +
>>> +static struct sample *
>>> +sort_times(const struct sample *by_time, size_t vec_size)
>>
>> I got confused reading through this a few times, because the name
>> doesn't convey the 'copy' operation that happens.  It would be good to
>> call this something like:
>>
>>    sorted_copy_times()
>>
>> which would make reasoning about the get_stats block a little bit better
>> (I read through it twice forgetting that sort_times wasn't an in-place
>> sort).
>
> Sure, I can do that.
>
>>
>>> +{
>>> +    struct sample *copy = xmemdup(by_time, vec_size * sizeof *copy);
>>> +    qsort(copy, vec_size, sizeof *copy, cmp_times);
>>> +
>>> +    return copy;
>>> +}
>>> +
>>> +static void
>>> +get_stats(const struct sample_vec *vec, long long int age_ms,
>>> +          struct stats *stats)
>>> +{
>>> +    int start_idx = find_earliest(vec, age_ms);
>>> +    if (start_idx < 0) {
>>> +        memset(stats, 0, sizeof *stats);
>>> +        return;
>>> +    }
>>> +    size_t vec_size = vec->n_samples - start_idx;
>>> +
>>> +    struct sample *by_time = &vec->samples[start_idx];
>>> +    struct sample *by_elapsed = sort_times(by_time, vec_size);
>>> +
>>> +    stats->min = by_elapsed[0].elapsed;
>>> +    stats->max = by_elapsed[vec_size - 1].elapsed;
>>> +    stats->average = average(by_time, vec_size);
>>> +    stats->percentile = percentile(by_elapsed, vec_size, 95);
>>> +    stats->num_samples = vec_size;
>>> +
>>> +    free(by_elapsed);
>>> +}
>>> +
>>> +static void
>>> +performance_print(struct performance *perf, const char *name,
>>> +                  struct ds *s)
>>> +{
>>> +    struct stats one_min;
>>> +    struct stats five_min;
>>> +    struct stats ten_min;
>>> +    ds_put_format(s, "Statistics for '%s'\n", name);
>>> +
>>> +    get_stats(&perf->vec, 60000, &one_min);
>>> +    get_stats(&perf->vec, 300000, &five_min);
>>> +    get_stats(&perf->vec, 600000, &ten_min);
>>> +
>>> +    format_stats(s, "1 minute", &one_min);
>>> +    format_stats(s, "5 minute", &five_min);
>>> +    format_stats(s, "10 minute", &ten_min);
>>> +}
>>> +
>>> +static bool
>>> +performance_show_protected(int argc, const char *argv[], struct ds *s)
>>> +{
>>> +    struct performance *perf;
>>> +
>>> +    if (argc > 1) {
>>> +        perf = shash_find_data(&performances, argv[1]);
>>> +        if (!perf) {
>>> +            ds_put_cstr(s, "No such performance");
>>> +            return false;
>>> +        }
>>> +        performance_print(perf, argv[1], s);
>>> +    } else {
>>> +        struct shash_node *node;
>>> +        SHASH_FOR_EACH (node, &performances) {
>>> +            perf = node->data;
>>> +            performance_print(perf, node->name, s);
>>> +        }
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +static void
>>> +performance_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
>>> +        const char *argv[], void *ignore OVS_UNUSED)
>>> +{
>>> +    struct ds s = DS_EMPTY_INITIALIZER;
>>> +    bool success;
>>> +
>>> +    ovs_mutex_lock(&performances_lock);
>>> +    success = performance_show_protected(argc, argv, &s);
>>> +    ovs_mutex_unlock(&performances_lock);
>>> +
>>> +    if (success) {
>>> +        unixctl_command_reply(conn, ds_cstr(&s));
>>> +    } else {
>>> +        unixctl_command_reply_error(conn, ds_cstr(&s));
>>> +    }
>>> +    ds_destroy(&s);
>>> +}
>>> +
>>> +static void
>>> +do_init_performance(void)
>>> +{
>>> +    unixctl_command_register("performance/show", "[NAME]", 0, 1,
>>> +                             performance_show, NULL);
>>> +}
>>> +
>>> +static void
>>> +performance_init(void)
>>> +{
>>> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>>> +    if (ovsthread_once_start(&once)) {
>>> +        do_init_performance();
>>> +        ovsthread_once_done(&once);
>>> +    }
>>> +}
>>> +
>>> +void
>>> +performance_create(const char *name)
>>> +{
>>> +    performance_init();
>>> +
>>> +    struct performance *perf = xzalloc(sizeof *perf);
>>> +    timer_set_duration(&perf->timer, PERFORMANCE_INTERVAL);
>>> +    ovs_mutex_lock(&performances_lock);
>>> +    shash_add(&performances, name, perf);
>>> +    ovs_mutex_unlock(&performances_lock);
>>> +}
>>> +
>>> +void
>>> +performance_destroy(const char *name)
>>> +{
>>> +    struct performance *perf;
>>> +
>>> +    ovs_mutex_lock(&performances_lock);
>>> +    perf = shash_find_and_delete(&performances, name);
>>> +    ovs_mutex_unlock(&performances_lock);
>>> +
>>> +    if (perf) {
>>> +        free(perf->vec.samples);
>>> +    }
>>> +
>>> +    free(perf);
>>> +}
>>> +
>>> +static bool
>>> +performance_start_sample_protected(const char *name)
>>> +{
>>> +    struct performance *perf = shash_find_data(&performances, name);
>>> +    if (!perf) {
>>> +        return false;
>>> +    }
>>> +
>>> +    /* We already started sampling. Need an end before
>>> +     * we start another sample
>>> +     */
>>> +    if (perf->cur_sample.start_time) {
>>> +        return false;
>>> +    }
>>> +
>>> +    perf->cur_sample.start_time = time_msec();
>>> +    return true;
>>> +}
>>> +
>>> +bool
>>> +performance_start_sample(const char *name)
>>> +{
>>> +    ovs_mutex_lock(&performances_lock);
>>> +    bool ret = performance_start_sample_protected(name);
>>> +    ovs_mutex_unlock(&performances_lock);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static bool
>>> +performance_end_sample_protected(const char *name)
>>> +{
>>> +    struct performance *perf = shash_find_data(&performances, name);
>>> +    if (!perf) {
>>> +        return false;
>>> +    }
>>> +
>>> +    /* We can't end a sample if we haven't started one */
>>> +    if (!perf->cur_sample.start_time) {
>>> +        return false;
>>> +    }
>>> +
>>> +    perf->cur_sample.end_time = time_msec();
>>> +    perf->cur_sample.elapsed = perf->cur_sample.end_time
>>> +        - perf->cur_sample.start_time;
>>> +
>>> +    add_sample(&perf->vec, &perf->cur_sample);
>>> +
>>> +    memset(&perf->cur_sample, 0, sizeof perf->cur_sample);
>>> +    return true;
>>> +}
>>> +
>>> +bool
>>> +performance_end_sample(const char *name)
>>> +{
>>> +    ovs_mutex_lock(&performances_lock);
>>> +    bool ret = performance_end_sample_protected(name);
>>> +    ovs_mutex_unlock(&performances_lock);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static void
>>> +performance_run_protected(void)
>>> +{
>>> +    struct shash_node *node;
>>> +    SHASH_FOR_EACH (node, &performances) {
>>> +        struct performance *perf = node->data;
>>> +        if (!timer_expired(&perf->timer)) {
>>> +            timer_wait(&perf->timer);
>>> +            continue;
>>> +        }
>>> +
>>> +        cull_old_times(&perf->vec, PERFORMANCE_OLDEST);
>>> +
>>> +        timer_set_duration(&perf->timer, perf->sample_rate);
>>> +        timer_wait(&perf->timer);
>>> +    }
>>> +}
>>> +
>>> +void
>>> +performance_run(void)
>>> +{
>>> +    ovs_mutex_lock(&performances_lock);
>>> +    performance_run_protected();
>>> +    ovs_mutex_unlock(&performances_lock);
>>> +}
>>> diff --git a/lib/performance.h b/lib/performance.h
>>> new file mode 100644
>>> index 000000000..ec115a88c
>>> --- /dev/null
>>> +++ b/lib/performance.h
>>> @@ -0,0 +1,58 @@
>>> +/* Copyright (c) 2017 Red Hat, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing, software
>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>> + * See the License for the specific language governing permissions and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef PERFORMANCE_H
>>> +#define PERFORMANCE_H 1
>>> +
>>> +#include <stdbool.h>
>>> +
>>> +/* This file provides a method for timing operations in OVS.
>>> + *
>>> + * The expected operation is as follows:
>>> + * 1) Call performance_create(), supplying a unique name for the
>>> + *    operation that is being measured.
>>> + * 2) Place calls to start_sample() and end_sample() at the beginning
>>> + *    and end of the operation you wish to measure.
>>> + * 3) Periodically call performance_run() so that statistics can be
>>> + *    gathered based on the recorded times. If you are running a loop,
>>> + *    calling this once per loop is a good practice.
>>> + * 4) When complete, call performance_destroy() to clean up.
>>> + *
>>> + * This implementation is thread-safe.
>>> + */
>>> +
>>> +/* Create a new performance measurement tracker*/
>>> +void performance_create(const char *name);
>>> +
>>> +/* Destroy a performance measurement tracker */
>>> +void performance_destroy(const char *name);
>>> +
>>> +/* Indicate that a performance measurement is beginning. */
>>> +bool performance_start_sample(const char *name);
>>> +
>>> +/* Indicate that a performance measurement has ended. The
>>> + * sample will be added to the history of performance
>>> + * measurements for this tracker
>>> + */
>>> +bool performance_end_sample(const char *name);
>>> +
>>> +/* Perform periodic duties on all performance measurement
>>> + * trackers. Currently, this just takes care of culling old
>>> + * measurements. It is a good practice to call this periodically
>>> + * from a daemon's main loop
>>> + */
>>> +void performance_run(void);
>>> +
>>> +#endif /* performance.h */
Mark Michelson Jan. 22, 2018, 6:31 p.m. | #4
On 01/19/2018 01:04 PM, Aaron Conole wrote:
> Mark Michelson <mmichels@redhat.com> writes:
> 
>> Hi Aaron,
>>
>> Thanks for the feedback. Not having the same sort of statistical
>> background as you, some of the suggestions are a bit over my head. See
>> below for my responses.
>>
>> On 01/18/2018 10:20 AM, Aaron Conole wrote:
>>> Hi Mark,
>>>
>>> Mark Michelson <mmichels@redhat.com> writes:
>>>
>>>> This is similar to the existing coverage and perf-counter APIs in OVS.
>>>> However, rather than keeping counters, this is aimed at timing how long
>>>> operations take to perform. "Operations" in this case can be anything
>>>> from a loop iteration, to a function, to something more complex.
>>>
>>> It's important to note that the granularity on this is measured in 'ms',
>>> which means this library is more suited towards measuring larger groups
>>> of operations (those which take longer than the clock resolution on
>>> time_msec).  I guess one advantage of
>>
>> One advantage of...?
> 
> D'oh!  I wrote (and saved, but didn't send when I picked up the draft):
> 
> "I guess one advantage of this framework would be that for larger
> 'macro' operations there wouldn't be as much of a concern on an ever
> growing window of samples - the start and end time encompasses so much.
> My assumption is you intend this to be more on the macro side, so take
> some of my suggestions with a grain of salt - I'm focusing more on a
> benchmarking usecase for development."
> 
> For more clarity (because re-reading, it reads very opaque), apart from
> this first counter you add in 2/2, how do you envision this framework to
> be used?  Is it meant as a benchmarking tool that provides visibility at
> run-time (sort of like a developer defined gcov)?  Is it supposed to
> know about heirarchy (if your 'main' has a performance counter, and a
> subroutine also has a performance counter, is the 'main' counter aware
> of the 'subroutine' counter)?  Who would be using this?  Is it targetted
> to in-field support?  Developers?  Depending on that, some of my other
> questions, suggestions, and ideas are possibly inappropriate.
The initial idea was that this would be useful for both developers and 
network administrators.

In my first draft of this patch, I included both the in-memory 
statistics-gathering and a second patch that would store gathered 
statistics in the OVS database. The idea behind storing data in the 
database would be that it could allow for admins to monitor performance. 
This way they could set up alarms if things were performing poorly and 
potentially take action to address the problem. It also would provide 
useful information on support tickets potentially.

For the second draft of this patch, I removed the database support. The 
big reason was that if I wanted to do things right, I would integrate 
with a time-series database instead of using the approach that I had 
previously used. While no reviewers called it this, I determined that my 
patch was a bit "hacky" in retrospect.

So to answer your question, at this stage it's more aimed for developers 
while running performance tests, but it could be expanded to be more 
useful for network administrators and support techs.

> 
>>>
>>> I might suggest the ability to pick the time resolution.  For instance,
>>> sometimes picking msec is useful when clocking response time in a
>>> multi-message system.  Sometimes a tighter granularity (using finer
>>> controls such as 'clock_gettime', 'rdtsc', 'read register 9 of
>>> coprocessor 0') may be desired.  A function pointer passed to the
>>> performance 'bucket' ie:
>>>      performance_create("something", perf_time_msec)
>>> that can have a function signature that makes sense for most cases (ull
>>> return, no args, etc.) might be good to allow users to customize the
>>> performance system to suit their needs.
>>>
>>> OTOH, since it won't have much for initial users, maybe it's okay.
>>
>> Sure, this is a good idea.
>>
>>>
>>> Also, apologies if I'm really picky or seem overly verbose with moving
>>> average filters.  Once upon a time, I worked on an embedded system that
>>> was a giant... :)
>>>
>>>> The library will keep track of how long it takes to perform the
>>>> particular operations and will maintain statistics of those running
>>>> times.
>>>>
>>>> Statistics for a particular operation can be queried from the command
>>>> line by using ovs-appctl -t <target> performance/show <operation name>.
>>>
>>> I noticed that you'll have a global lock preventing statistics updates
>>> from happening.  No problem - except the larger issue is while that lock
>>> is held no other forward progress for the thread tracking the statistics
>>> can happen.
>>>
>>> Maybe it makes sense to think about an alternate set of data structures
>>> that will accommodate multiple threads contending on performance
>>> monitoring.  I'm guessing in your use case there isn't any contention,
>>> but perhaps in the future someone will use this in, for example, a
>>> netdev area to profile and the results would get skewed.
>>
>> Yes, I agree with this. Do you have some suggestions for a method that
>> would work better? I've thought of a couple of ideas:
>>
>> 1) Use thread-local storage. This has the advantage of eliminating
>> contention. However, I'm not sure how this would work to be able to
>> print statistics to the CLI if the thread handling the CLI command
>> cannot get to statistics in another thread.
> 
> I agree - I'm not sure this works.
> 
>> 2) Separate responsibilities into different threads. The threads that
>> are actually performing operations we want to time would be
>> responsible for grabbing the current timestamp and then passing that
>> information to a background thread. This background thread would be
>> responsible for then storing that data in a structure that is local
>> only to that thread. This way, the "foreground" threads are only
>> responsible for getting the current timestamp and writing the data to
>> a pipe/socket.
>>
>> If you have alternate ideas, please let me know.
> 
> There may be a way to use RCU to do the vector updates, but that does
> introduce some error in the measurements.
> 
> Also, you might use some kind of atomic primitives for a lock-free
> approach.  That has a worse-case performing time, but it is a) constant,
> and b) usually simple to understand.
> 
>>>> The API is designed to be pretty small. The expected steps are as
>>>> follows:
>>>>
>>>> 1) Create a performance measurement, providing a unique name, using
>>>> performance_create()
>>>> 2) Add calls to start_sample() and end_sample() to mark the start and
>>>> stop of the operation you wish to measure.
>>>> 3) Periodically call performance_run() in order to compile statistics.
>>>> 4) Upon completion (likely program shutdown), call performance_destroy()
>>>> to clean up.
>>>
>>> You could register an atexit() handler to automatically destroy these.
>>> Then the framework user doesn't have to deal with that.  There's also a
>>> fatal_signal_add_hook() function where you can register deletion
>>> (although I'd advise against it - usually a fatal signal means 'leave
>>> memory alone' and cleanup here would require deletion).
>>
>> An atexit() handler makes sense. I'll implement that.
>>
>>>
>>>> Signed-off-by: Mark Michelson <mmichels@redhat.com>
>>>> ---
>>>>    lib/automake.mk   |   2 +
>>>>    lib/performance.c | 387 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    lib/performance.h |  58 ++++++++
>>>>    3 files changed, 447 insertions(+)
>>>>    create mode 100644 lib/performance.c
>>>>    create mode 100644 lib/performance.h
>>>>
>>>> diff --git a/lib/automake.mk b/lib/automake.mk
>>>> index 4b38a1163..fef2c4e1a 100644
>>>> --- a/lib/automake.mk
>>>> +++ b/lib/automake.mk
>>>> @@ -205,6 +205,8 @@ lib_libopenvswitch_la_SOURCES = \
>>>>    	lib/pcap-file.h \
>>>>    	lib/perf-counter.h \
>>>>    	lib/perf-counter.c \
>>>> +	lib/performance.h \
>>>> +	lib/performance.c \
>>>>    	lib/poll-loop.c \
>>>>    	lib/process.c \
>>>>    	lib/process.h \
>>>> diff --git a/lib/performance.c b/lib/performance.c
>>>> new file mode 100644
>>>> index 000000000..5b54cfc29
>>>> --- /dev/null
>>>> +++ b/lib/performance.c
>>>> @@ -0,0 +1,387 @@
>>>> +/* Copyright (c) 2017 Red Hat, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing, software
>>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>>> + * See the License for the specific language governing permissions and
>>>> + * limitations under the License.
>>>> + */
>>>> +
>>>> +#include <config.h>
>>>> +
>>>> +#include "performance.h"
>>>> +#include "timer.h"
>>>> +#include "timeval.h"
>>>> +#include "openvswitch/shash.h"
>>>> +#include "openvswitch/vlog.h"
>>>> +#include "unixctl.h"
>>>> +#include "openvswitch/dynamic-string.h"
>>>> +#include "openvswitch/poll-loop.h"
>>>> +
>>>> +VLOG_DEFINE_THIS_MODULE(performance);
>>>> +
>>>> +struct sample {
>>>> +    long long int start_time; /* Time when we started this sample */
>>>> +    long long int end_time;   /* Time when we ended this sample */
>>>> +    long long int elapsed;    /* Elapsed time: end_time - start_time */
>>>> +};
>>>> +
>>>> +struct sample_vec {
>>>> +    struct sample *samples; /* Dynamic array of samples */
>>>> +    size_t n_samples;       /* Number of samples */
>>>> +    size_t capacity;        /* Number of allocated samples */
>>>> +};
>>>> +
>>>> +struct stats {
>>>> +    long long int min;         /* Minimum measurement (ms) */
>>>> +    long long int max;         /* Maximum measurement (ms) */
>>>> +    long long int average;     /* Average measurement (ms) */
>>>
>>> I think signed values doesn't make sense.  I don't know that time moves
>>> backwards (unless you're using "wall clock", instead of a monotonic time
>>> source).  I guess this is to keep the return type correct with
>>> time_msec(), but I don't understand why that should be signed either
>>> (since it says it will be monotonic).  In practice, it generally *won't*
>>> matter at msec granularity - I 'napkin math' think it takes something
>>> like 6.2bn years to rollover from epoch.  I just worry about the
>>> rollover to negative case (and maybe it isn't important).
>>
>> Yeah the reason I used signed values here is simply because the
>> time_msec() function (and others of its ilk in OVS) return a signed
>> value. I agree that using an unsigned value makes sense since we
>> shouldn't be dealing with wall clock times.
> 
> As I wrote, it still takes something like 6.2 billion years (napkin math
> billion years) worth of samples at msec granularity to wrap.  If the
> clock stays fixed at msec, then it probably isn't the biggest fish to
> fry.  But, I'm not sure what the 'purpose' of this measurement is if it
> is so long.  Even disk IO is getting sub-ms performance.  And network
> communication on the local node should be well under that threshold
> also (is it not?).
> 

In my next draft the ability to resolve to usec and nsec will be 
present. So the point is somewhat moot.

However, yes the ability to measure in msec is useful. Patch 2 in this 
series adds a timer that starts when we detect a change in the OVN 
southbound database and finishes when we write the generated OpenFlow 
flows to vswitchd. This operation can vary wildly depending on the 
network setup. On a small network, this will complete in <1 msec. 
However, when the logical network gets complex (and especially if I set 
up lots of ACLs with large address sets), the resulting number of 
OpenFlow flows that get generated are huge. I've been able to run a 
script in the OVS sandbox where I start seeing this operation take 700+ 
msec to complete on my laptop. This is probably due to algorithmic 
complexity more than I/O (though I cannot confirm this yet).

>>>
>>> Additionally, I think integers aren't a good way of storing the average,
>>> since it will truncate, and 1ms vs 1.9ms are very different when it
>>> comes to measurements.
>>
>> Heh, I think I had a XXX comment initially in my code saying "should I
>> use floating point here?" I'll change it over.
>>
>>>
>>>> +    long long int percentile;  /* 95th percentile measurement (ms) */
>>>> +    long long int num_samples; /* Total number of measurements */
>>>> +};
>>>> +
>>>> +struct performance {
>>>> +    struct sample_vec vec;
>>>> +    long long int sample_rate;
>>>> +    struct timer timer;
>>>> +    struct sample cur_sample;
>>>> +};
>>>
>>> It looks like you're keeping around the samples because you want to 'age
>>> out' from the moving average window.  Otherwise I think it can be
>>> implemented using an IIR filter (psuedo-code below):
>>>
>>>          current_avg -= (current_avg/n_window_samples)
>>>          current_avg += (sample / n_window_samples)
>>>
>>> BUT - if you're going to keep around the samples, you might as well
>>> include a calculation of additional statistics (variance comes to mind).
>>
>> I'm not familiar with IIR filters, so I just want to make sure I
>> understand what you're suggesting before trying to implement it. The
>> idea is that rather than keeping around, say, 10 minutes worth of
>> samples, I just use the calculations above each time that a new sample
>> arrives. This way, I don't have to store my samples at all. I'm not
>> sure how to choose n_window_samples though, or how to relate that to a
>> specific amount of time that has passed.
> 
> In this case, mostly ignore the comment.  It leads down a side path of
> accumulated errors over time, etc.
> 
> But the same types of operations can still be applied to this average.
> Instead of summing up each time, simply subtract the 'culled' counters,
> and just keep adding the new counters to the total.  No need to walk the
> list.  There's 1000 ways to skin this cat, and it's not as important in
> the long run.
> 
>>>
>>>> +static struct shash performances = SHASH_INITIALIZER(&performances);
>>>> +static struct ovs_mutex performances_lock = OVS_MUTEX_INITIALIZER;
>>>
>>> As I wrote before, having this global mutex means statistics
>>> accumulation blocks for everyone.  You might think it doesn't matter,
>>> but imagine this scenario:
>>>
>>>     thread 0                      |    thread 1
>>>
>>>     call perf_start()             |  .............
>>>     .... do things                |  call perf_start() (get slept)
>>>     call perf_end() (block...)    |  wake up and complete
>>>        (unblock)                  |
>>>
>>> In the above, the perf_end() will intermittently include lock calls from
>>> the perf_start(), which may smooth out over time but will introduce
>>> variance.
>>
>> I think my suggestions above about alternate threading models might
>> address this. Let me know if there are other factors to take into
>> consideration.
> 
> Okay.  I think it should too - let's see what you cook up :)
> 
>>>
>>>> +#define PERFORMANCE_INTERVAL 10000
>>>> +#define PERFORMANCE_OLDEST 600000
>>>> +
>>>> +static void
>>>> +add_sample(struct sample_vec *vec, struct sample *new_sample)
>>>> +{
>>>> +    if (vec->capacity == vec->n_samples) {
>>>> +        vec->samples = x2nrealloc(vec->samples, &vec->capacity,
>>>> +            sizeof *vec->samples);
>>>
>>> This resize operation can grow very quickly.  Assuming the samples come
>>> very fast, you'll run out of memory very quickly.  Is there a reason not
>>> to have just roll samples in the vector?  I guess it makes your
>>> 'selection' for various time periods more difficult, but you can also
>>> track those with independent counters as well (just set a different time
>>> interval... :).  That also means no allocations when creating samples,
>>> and no deallocations when removing samples, freeing up additional memory
>>> pressure.
>>
>> I'll be honest here, I'm not following you. Rather, I understand your
>> concerns about memory. The thing to understand here is that the vector
>> will grow for a certain amount of time, but then after a while, old
>> times will start getting culled from the vector. At that point, the
>> size of the vector should stabilize (unless you have very bursty
>> periods of activity). However, I agree that if there are strategies
>> that result in the use of less memory, then I should go with those :)
>>
>> What I'm not following are the parts about "roll samples in the
>> vector" and "independent counters". Can you elaborate on those points
>> some more?
> 
> Sure, I'll try to state what I was suggesting with an (poor) ascii art
> drawing.
> 
>     vec=  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |
> 
> vec[0] holds sample 1, vec[6] holds sample 7.  When sample 8 arrives,
> simply overwrite sample 1 (circular buffer might be a better descriptor,
> sorry).
> 
>     vec=  |  8  |  2  |  3  |  4  |  5  |  6  |  7  |
> 
> You just roll over the sample.

Gotcha, this clears up what you meant very well.

> 
> As for the independent counters, I was thinking that since you know the
> frequency with which samples come, and you know the window, you could
> accumulate totals in additional performance elements.  Or sample the
> averages over time (making sure that the explanation of what is actually
> being measured is conveyed).
> 
> That's probably something that should be left to the person adding the
> performance element, though.  For instance, some areas of the code won't
> make sense to have a "last 1m, last 5m, last 10m."  Providing that data
> could be misleading.
> 
>>>
>>> The "disadvantage" is you need to determine your sampling window sizes,
>>> but that may be something you expose as a tune-able setting?  Also,
>>> tracking your 1m,5m,10m counters would be a bit different (but you could
>>> just treat them as additional 'performance counters' in your framework,
>>> and every time the performance_run() decides it's time to take an
>>> average, if the time interval makes sense, it can update "name",
>>> "name_1m", "name_5m", "name_10m", etc).
>>
>> Again, I'm not really following you here. When you say "sampling
>> window sizes" are these what you were referencing earlier in your avg
>> formula? Are these based solely on the number of samples collected,
>> rather than amount of time that has passed collecting them? I'm not
>> really sure what you're talking about with the "name" suggestions in
>> the last sentence.
> 
> Let's just come back to this.
> 
>>>
>>>> +    }
>>>> +
>>>> +    vec->samples[vec->n_samples++] = *new_sample;
>>>> +}
>>>> +
>>>> +static int
>>>> +find_earliest(const struct sample_vec *vec, long long int age_ms)
>>>> +{
>>>> +    long long int cutoff = time_msec() - age_ms;
>>>> +
>>>> +    for (int i = 0; i < vec->n_samples; i++) {
>>>> +        if (vec->samples[i].end_time >= cutoff) {
>>>> +            return i;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Either the vector is empty or all times are
>>>> +     * older than the cutoff.
>>>> +     */
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +static long long int
>>>> +average(const struct sample *samples, size_t num_samples)
>>>> +{
>>>> +    /* Avoid division by zero */
>>>> +    if (num_samples == 0) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    long long int sum = 0;
>>>> +    for (size_t i = 0; i < num_samples; i++) {
>>>> +        sum += samples[i].elapsed;
>>>> +    }
>>>> +
>>>> +    return sum / num_samples;
>>>> +}
>>>> +
>>>> +/* Calculate a specific percentile value.
>>>> + * Valid percentile values range from 0 to 99
>>>> + */
>>>> +static long long int
>>>> +percentile(const struct sample *samples, size_t num_samples, int percentile)
>>>> +{
>>>> +    if (num_samples == 0) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    size_t pctl = num_samples * percentile / 100;
>>>> +    return samples[pctl].elapsed;
>>>> +}
>>>> +
>>>> +static void
>>>> +cull_old_times(struct sample_vec *vec, long long int age_ms)
>>>> +{
>>>> +    int i = find_earliest(vec, age_ms);
>>>> +
>>>> +    if (i <= 0) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    size_t new_size = vec->n_samples - i;
>>>> +    memmove(vec->samples, &vec->samples[i], new_size * sizeof *vec->samples);
>>>> +    vec->n_samples = new_size;
>>>> +}
>>>> +
>>>> +static void
>>>> +format_stats(struct ds *s, const char *prefix, const struct stats *stats)
>>>> +{
>>>> +    if (stats->num_samples) {
>>>> +        ds_put_format(s, "\t%s samples: %lld\n", prefix, stats->num_samples);
>>>> +        ds_put_format(s, "\t%s minimum: %lld\n", prefix, stats->min);
>>>> +        ds_put_format(s, "\t%s maximum: %lld\n", prefix, stats->max);
>>>> +        ds_put_format(s, "\t%s average: %lld\n", prefix, stats->average);
>>>> +        ds_put_format(s, "\t%s 95th percentile: %lld\n\n", prefix,
>>>> +                      stats->percentile);
>>>> +    } else {
>>>> +        ds_put_format(s, "\t%s samples: 0\n", prefix);
>>>> +        ds_put_format(s, "\t%s minimum: N/A\n", prefix);
>>>> +        ds_put_format(s, "\t%s maximum: N/A\n", prefix);
>>>> +        ds_put_format(s, "\t%s average: N/A\n", prefix);
>>>> +        ds_put_format(s, "\t%s 95th percentile: N/A\n\n", prefix);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int
>>>> +cmp_times(const void *left_, const void *right_)
>>>> +{
>>>> +    const struct sample *left = left_;
>>>> +    const struct sample *right = right_;
>>>> +
>>>> +    return left->elapsed == right->elapsed ? 0
>>>> +        : left->elapsed > right->elapsed;
>>>> +}
>>>> +
>>>> +static struct sample *
>>>> +sort_times(const struct sample *by_time, size_t vec_size)
>>>
>>> I got confused reading through this a few times, because the name
>>> doesn't convey the 'copy' operation that happens.  It would be good to
>>> call this something like:
>>>
>>>     sorted_copy_times()
>>>
>>> which would make reasoning about the get_stats block a little bit better
>>> (I read through it twice forgetting that sort_times wasn't an in-place
>>> sort).
>>
>> Sure, I can do that.
>>
>>>
>>>> +{
>>>> +    struct sample *copy = xmemdup(by_time, vec_size * sizeof *copy);
>>>> +    qsort(copy, vec_size, sizeof *copy, cmp_times);
>>>> +
>>>> +    return copy;
>>>> +}
>>>> +
>>>> +static void
>>>> +get_stats(const struct sample_vec *vec, long long int age_ms,
>>>> +          struct stats *stats)
>>>> +{
>>>> +    int start_idx = find_earliest(vec, age_ms);
>>>> +    if (start_idx < 0) {
>>>> +        memset(stats, 0, sizeof *stats);
>>>> +        return;
>>>> +    }
>>>> +    size_t vec_size = vec->n_samples - start_idx;
>>>> +
>>>> +    struct sample *by_time = &vec->samples[start_idx];
>>>> +    struct sample *by_elapsed = sort_times(by_time, vec_size);
>>>> +
>>>> +    stats->min = by_elapsed[0].elapsed;
>>>> +    stats->max = by_elapsed[vec_size - 1].elapsed;
>>>> +    stats->average = average(by_time, vec_size);
>>>> +    stats->percentile = percentile(by_elapsed, vec_size, 95);
>>>> +    stats->num_samples = vec_size;
>>>> +
>>>> +    free(by_elapsed);
>>>> +}
>>>> +
>>>> +static void
>>>> +performance_print(struct performance *perf, const char *name,
>>>> +                  struct ds *s)
>>>> +{
>>>> +    struct stats one_min;
>>>> +    struct stats five_min;
>>>> +    struct stats ten_min;
>>>> +    ds_put_format(s, "Statistics for '%s'\n", name);
>>>> +
>>>> +    get_stats(&perf->vec, 60000, &one_min);
>>>> +    get_stats(&perf->vec, 300000, &five_min);
>>>> +    get_stats(&perf->vec, 600000, &ten_min);
>>>> +
>>>> +    format_stats(s, "1 minute", &one_min);
>>>> +    format_stats(s, "5 minute", &five_min);
>>>> +    format_stats(s, "10 minute", &ten_min);
>>>> +}
>>>> +
>>>> +static bool
>>>> +performance_show_protected(int argc, const char *argv[], struct ds *s)
>>>> +{
>>>> +    struct performance *perf;
>>>> +
>>>> +    if (argc > 1) {
>>>> +        perf = shash_find_data(&performances, argv[1]);
>>>> +        if (!perf) {
>>>> +            ds_put_cstr(s, "No such performance");
>>>> +            return false;
>>>> +        }
>>>> +        performance_print(perf, argv[1], s);
>>>> +    } else {
>>>> +        struct shash_node *node;
>>>> +        SHASH_FOR_EACH (node, &performances) {
>>>> +            perf = node->data;
>>>> +            performance_print(perf, node->name, s);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void
>>>> +performance_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
>>>> +        const char *argv[], void *ignore OVS_UNUSED)
>>>> +{
>>>> +    struct ds s = DS_EMPTY_INITIALIZER;
>>>> +    bool success;
>>>> +
>>>> +    ovs_mutex_lock(&performances_lock);
>>>> +    success = performance_show_protected(argc, argv, &s);
>>>> +    ovs_mutex_unlock(&performances_lock);
>>>> +
>>>> +    if (success) {
>>>> +        unixctl_command_reply(conn, ds_cstr(&s));
>>>> +    } else {
>>>> +        unixctl_command_reply_error(conn, ds_cstr(&s));
>>>> +    }
>>>> +    ds_destroy(&s);
>>>> +}
>>>> +
>>>> +static void
>>>> +do_init_performance(void)
>>>> +{
>>>> +    unixctl_command_register("performance/show", "[NAME]", 0, 1,
>>>> +                             performance_show, NULL);
>>>> +}
>>>> +
>>>> +static void
>>>> +performance_init(void)
>>>> +{
>>>> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>>>> +    if (ovsthread_once_start(&once)) {
>>>> +        do_init_performance();
>>>> +        ovsthread_once_done(&once);
>>>> +    }
>>>> +}
>>>> +
>>>> +void
>>>> +performance_create(const char *name)
>>>> +{
>>>> +    performance_init();
>>>> +
>>>> +    struct performance *perf = xzalloc(sizeof *perf);
>>>> +    timer_set_duration(&perf->timer, PERFORMANCE_INTERVAL);
>>>> +    ovs_mutex_lock(&performances_lock);
>>>> +    shash_add(&performances, name, perf);
>>>> +    ovs_mutex_unlock(&performances_lock);
>>>> +}
>>>> +
>>>> +void
>>>> +performance_destroy(const char *name)
>>>> +{
>>>> +    struct performance *perf;
>>>> +
>>>> +    ovs_mutex_lock(&performances_lock);
>>>> +    perf = shash_find_and_delete(&performances, name);
>>>> +    ovs_mutex_unlock(&performances_lock);
>>>> +
>>>> +    if (perf) {
>>>> +        free(perf->vec.samples);
>>>> +    }
>>>> +
>>>> +    free(perf);
>>>> +}
>>>> +
>>>> +static bool
>>>> +performance_start_sample_protected(const char *name)
>>>> +{
>>>> +    struct performance *perf = shash_find_data(&performances, name);
>>>> +    if (!perf) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    /* We already started sampling. Need an end before
>>>> +     * we start another sample
>>>> +     */
>>>> +    if (perf->cur_sample.start_time) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    perf->cur_sample.start_time = time_msec();
>>>> +    return true;
>>>> +}
>>>> +
>>>> +bool
>>>> +performance_start_sample(const char *name)
>>>> +{
>>>> +    ovs_mutex_lock(&performances_lock);
>>>> +    bool ret = performance_start_sample_protected(name);
>>>> +    ovs_mutex_unlock(&performances_lock);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static bool
>>>> +performance_end_sample_protected(const char *name)
>>>> +{
>>>> +    struct performance *perf = shash_find_data(&performances, name);
>>>> +    if (!perf) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    /* We can't end a sample if we haven't started one */
>>>> +    if (!perf->cur_sample.start_time) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    perf->cur_sample.end_time = time_msec();
>>>> +    perf->cur_sample.elapsed = perf->cur_sample.end_time
>>>> +        - perf->cur_sample.start_time;
>>>> +
>>>> +    add_sample(&perf->vec, &perf->cur_sample);
>>>> +
>>>> +    memset(&perf->cur_sample, 0, sizeof perf->cur_sample);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +bool
>>>> +performance_end_sample(const char *name)
>>>> +{
>>>> +    ovs_mutex_lock(&performances_lock);
>>>> +    bool ret = performance_end_sample_protected(name);
>>>> +    ovs_mutex_unlock(&performances_lock);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static void
>>>> +performance_run_protected(void)
>>>> +{
>>>> +    struct shash_node *node;
>>>> +    SHASH_FOR_EACH (node, &performances) {
>>>> +        struct performance *perf = node->data;
>>>> +        if (!timer_expired(&perf->timer)) {
>>>> +            timer_wait(&perf->timer);
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        cull_old_times(&perf->vec, PERFORMANCE_OLDEST);
>>>> +
>>>> +        timer_set_duration(&perf->timer, perf->sample_rate);
>>>> +        timer_wait(&perf->timer);
>>>> +    }
>>>> +}
>>>> +
>>>> +void
>>>> +performance_run(void)
>>>> +{
>>>> +    ovs_mutex_lock(&performances_lock);
>>>> +    performance_run_protected();
>>>> +    ovs_mutex_unlock(&performances_lock);
>>>> +}
>>>> diff --git a/lib/performance.h b/lib/performance.h
>>>> new file mode 100644
>>>> index 000000000..ec115a88c
>>>> --- /dev/null
>>>> +++ b/lib/performance.h
>>>> @@ -0,0 +1,58 @@
>>>> +/* Copyright (c) 2017 Red Hat, Inc.
>>>> + *
>>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>>> + * you may not use this file except in compliance with the License.
>>>> + * You may obtain a copy of the License at:
>>>> + *
>>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>>> + *
>>>> + * Unless required by applicable law or agreed to in writing, software
>>>> + * distributed under the License is distributed on an "AS IS" BASIS,
>>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>>>> + * See the License for the specific language governing permissions and
>>>> + * limitations under the License.
>>>> + */
>>>> +
>>>> +#ifndef PERFORMANCE_H
>>>> +#define PERFORMANCE_H 1
>>>> +
>>>> +#include <stdbool.h>
>>>> +
>>>> +/* This file provides a method for timing operations in OVS.
>>>> + *
>>>> + * The expected operation is as follows:
>>>> + * 1) Call performance_create(), supplying a unique name for the
>>>> + *    operation that is being measured.
>>>> + * 2) Place calls to start_sample() and end_sample() at the beginning
>>>> + *    and end of the operation you wish to measure.
>>>> + * 3) Periodically call performance_run() so that statistics can be
>>>> + *    gathered based on the recorded times. If you are running a loop,
>>>> + *    calling this once per loop is a good practice.
>>>> + * 4) When complete, call performance_destroy() to clean up.
>>>> + *
>>>> + * This implementation is thread-safe.
>>>> + */
>>>> +
>>>> +/* Create a new performance measurement tracker*/
>>>> +void performance_create(const char *name);
>>>> +
>>>> +/* Destroy a performance measurement tracker */
>>>> +void performance_destroy(const char *name);
>>>> +
>>>> +/* Indicate that a performance measurement is beginning. */
>>>> +bool performance_start_sample(const char *name);
>>>> +
>>>> +/* Indicate that a performance measurement has ended. The
>>>> + * sample will be added to the history of performance
>>>> + * measurements for this tracker
>>>> + */
>>>> +bool performance_end_sample(const char *name);
>>>> +
>>>> +/* Perform periodic duties on all performance measurement
>>>> + * trackers. Currently, this just takes care of culling old
>>>> + * measurements. It is a good practice to call this periodically
>>>> + * from a daemon's main loop
>>>> + */
>>>> +void performance_run(void);
>>>> +
>>>> +#endif /* performance.h */

Patch

diff --git a/lib/automake.mk b/lib/automake.mk
index 4b38a1163..fef2c4e1a 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -205,6 +205,8 @@  lib_libopenvswitch_la_SOURCES = \
 	lib/pcap-file.h \
 	lib/perf-counter.h \
 	lib/perf-counter.c \
+	lib/performance.h \
+	lib/performance.c \
 	lib/poll-loop.c \
 	lib/process.c \
 	lib/process.h \
diff --git a/lib/performance.c b/lib/performance.c
new file mode 100644
index 000000000..5b54cfc29
--- /dev/null
+++ b/lib/performance.c
@@ -0,0 +1,387 @@ 
+/* Copyright (c) 2017 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "performance.h"
+#include "timer.h"
+#include "timeval.h"
+#include "openvswitch/shash.h"
+#include "openvswitch/vlog.h"
+#include "unixctl.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/poll-loop.h"
+
+VLOG_DEFINE_THIS_MODULE(performance);
+
+struct sample {
+    long long int start_time; /* Time when we started this sample */
+    long long int end_time;   /* Time when we ended this sample */
+    long long int elapsed;    /* Elapsed time: end_time - start_time */
+};
+
+struct sample_vec {
+    struct sample *samples; /* Dynamic array of samples */
+    size_t n_samples;       /* Number of samples */
+    size_t capacity;        /* Number of allocated samples */
+};
+
+struct stats {
+    long long int min;         /* Minimum measurement (ms) */
+    long long int max;         /* Maximum measurement (ms) */
+    long long int average;     /* Average measurement (ms) */
+    long long int percentile;  /* 95th percentile measurement (ms) */
+    long long int num_samples; /* Total number of measurements */
+};
+
+struct performance {
+    struct sample_vec vec;
+    long long int sample_rate;
+    struct timer timer;
+    struct sample cur_sample;
+};
+
+static struct shash performances = SHASH_INITIALIZER(&performances);
+static struct ovs_mutex performances_lock = OVS_MUTEX_INITIALIZER;
+
+#define PERFORMANCE_INTERVAL 10000
+#define PERFORMANCE_OLDEST 600000
+
+static void
+add_sample(struct sample_vec *vec, struct sample *new_sample)
+{
+    if (vec->capacity == vec->n_samples) {
+        vec->samples = x2nrealloc(vec->samples, &vec->capacity,
+            sizeof *vec->samples);
+    }
+
+    vec->samples[vec->n_samples++] = *new_sample;
+}
+
+static int
+find_earliest(const struct sample_vec *vec, long long int age_ms)
+{
+    long long int cutoff = time_msec() - age_ms;
+
+    for (int i = 0; i < vec->n_samples; i++) {
+        if (vec->samples[i].end_time >= cutoff) {
+            return i;
+        }
+    }
+
+    /* Either the vector is empty or all times are
+     * older than the cutoff.
+     */
+    return -1;
+}
+
+static long long int
+average(const struct sample *samples, size_t num_samples)
+{
+    /* Avoid division by zero */
+    if (num_samples == 0) {
+        return 0;
+    }
+
+    long long int sum = 0;
+    for (size_t i = 0; i < num_samples; i++) {
+        sum += samples[i].elapsed;
+    }
+
+    return sum / num_samples;
+}
+
+/* Calculate a specific percentile value.
+ * Valid percentile values range from 0 to 99
+ */
+static long long int
+percentile(const struct sample *samples, size_t num_samples, int percentile)
+{
+    if (num_samples == 0) {
+        return 0;
+    }
+
+    size_t pctl = num_samples * percentile / 100;
+    return samples[pctl].elapsed;
+}
+
+static void
+cull_old_times(struct sample_vec *vec, long long int age_ms)
+{
+    int i = find_earliest(vec, age_ms);
+
+    if (i <= 0) {
+        return;
+    }
+
+    size_t new_size = vec->n_samples - i;
+    memmove(vec->samples, &vec->samples[i], new_size * sizeof *vec->samples);
+    vec->n_samples = new_size;
+}
+
+static void
+format_stats(struct ds *s, const char *prefix, const struct stats *stats)
+{
+    if (stats->num_samples) {
+        ds_put_format(s, "\t%s samples: %lld\n", prefix, stats->num_samples);
+        ds_put_format(s, "\t%s minimum: %lld\n", prefix, stats->min);
+        ds_put_format(s, "\t%s maximum: %lld\n", prefix, stats->max);
+        ds_put_format(s, "\t%s average: %lld\n", prefix, stats->average);
+        ds_put_format(s, "\t%s 95th percentile: %lld\n\n", prefix,
+                      stats->percentile);
+    } else {
+        ds_put_format(s, "\t%s samples: 0\n", prefix);
+        ds_put_format(s, "\t%s minimum: N/A\n", prefix);
+        ds_put_format(s, "\t%s maximum: N/A\n", prefix);
+        ds_put_format(s, "\t%s average: N/A\n", prefix);
+        ds_put_format(s, "\t%s 95th percentile: N/A\n\n", prefix);
+    }
+}
+
+static int
+cmp_times(const void *left_, const void *right_)
+{
+    const struct sample *left = left_;
+    const struct sample *right = right_;
+
+    return left->elapsed == right->elapsed ? 0
+        : left->elapsed > right->elapsed;
+}
+
+static struct sample *
+sort_times(const struct sample *by_time, size_t vec_size)
+{
+    struct sample *copy = xmemdup(by_time, vec_size * sizeof *copy);
+    qsort(copy, vec_size, sizeof *copy, cmp_times);
+
+    return copy;
+}
+
+static void
+get_stats(const struct sample_vec *vec, long long int age_ms,
+          struct stats *stats)
+{
+    int start_idx = find_earliest(vec, age_ms);
+    if (start_idx < 0) {
+        memset(stats, 0, sizeof *stats);
+        return;
+    }
+    size_t vec_size = vec->n_samples - start_idx;
+
+    struct sample *by_time = &vec->samples[start_idx];
+    struct sample *by_elapsed = sort_times(by_time, vec_size);
+
+    stats->min = by_elapsed[0].elapsed;
+    stats->max = by_elapsed[vec_size - 1].elapsed;
+    stats->average = average(by_time, vec_size);
+    stats->percentile = percentile(by_elapsed, vec_size, 95);
+    stats->num_samples = vec_size;
+
+    free(by_elapsed);
+}
+
+static void
+performance_print(struct performance *perf, const char *name,
+                  struct ds *s)
+{
+    struct stats one_min;
+    struct stats five_min;
+    struct stats ten_min;
+    ds_put_format(s, "Statistics for '%s'\n", name);
+
+    get_stats(&perf->vec, 60000, &one_min);
+    get_stats(&perf->vec, 300000, &five_min);
+    get_stats(&perf->vec, 600000, &ten_min);
+
+    format_stats(s, "1 minute", &one_min);
+    format_stats(s, "5 minute", &five_min);
+    format_stats(s, "10 minute", &ten_min);
+}
+
+static bool
+performance_show_protected(int argc, const char *argv[], struct ds *s)
+{
+    struct performance *perf;
+
+    if (argc > 1) {
+        perf = shash_find_data(&performances, argv[1]);
+        if (!perf) {
+            ds_put_cstr(s, "No such performance");
+            return false;
+        }
+        performance_print(perf, argv[1], s);
+    } else {
+        struct shash_node *node;
+        SHASH_FOR_EACH (node, &performances) {
+            perf = node->data;
+            performance_print(perf, node->name, s);
+        }
+    }
+
+    return true;
+}
+
+static void
+performance_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
+        const char *argv[], void *ignore OVS_UNUSED)
+{
+    struct ds s = DS_EMPTY_INITIALIZER;
+    bool success;
+
+    ovs_mutex_lock(&performances_lock);
+    success = performance_show_protected(argc, argv, &s);
+    ovs_mutex_unlock(&performances_lock);
+
+    if (success) {
+        unixctl_command_reply(conn, ds_cstr(&s));
+    } else {
+        unixctl_command_reply_error(conn, ds_cstr(&s));
+    }
+    ds_destroy(&s);
+}
+
+static void
+do_init_performance(void)
+{
+    unixctl_command_register("performance/show", "[NAME]", 0, 1,
+                             performance_show, NULL);
+}
+
+static void
+performance_init(void)
+{
+    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+    if (ovsthread_once_start(&once)) {
+        do_init_performance();
+        ovsthread_once_done(&once);
+    }
+}
+
+void
+performance_create(const char *name)
+{
+    performance_init();
+
+    struct performance *perf = xzalloc(sizeof *perf);
+    timer_set_duration(&perf->timer, PERFORMANCE_INTERVAL);
+
+    ovs_mutex_lock(&performances_lock);
+    shash_add(&performances, name, perf);
+    ovs_mutex_unlock(&performances_lock);
+}
+
+void
+performance_destroy(const char *name)
+{
+    struct performance *perf;
+
+    ovs_mutex_lock(&performances_lock);
+    perf = shash_find_and_delete(&performances, name);
+    ovs_mutex_unlock(&performances_lock);
+
+    if (perf) {
+        free(perf->vec.samples);
+    }
+
+    free(perf);
+}
+
+static bool
+performance_start_sample_protected(const char *name)
+{
+    struct performance *perf = shash_find_data(&performances, name);
+    if (!perf) {
+        return false;
+    }
+
+    /* We already started sampling. Need an end before
+     * we start another sample
+     */
+    if (perf->cur_sample.start_time) {
+        return false;
+    }
+
+    perf->cur_sample.start_time = time_msec();
+    return true;
+}
+
+bool
+performance_start_sample(const char *name)
+{
+    ovs_mutex_lock(&performances_lock);
+    bool ret = performance_start_sample_protected(name);
+    ovs_mutex_unlock(&performances_lock);
+
+    return ret;
+}
+
+static bool
+performance_end_sample_protected(const char *name)
+{
+    struct performance *perf = shash_find_data(&performances, name);
+    if (!perf) {
+        return false;
+    }
+
+    /* We can't end a sample if we haven't started one */
+    if (!perf->cur_sample.start_time) {
+        return false;
+    }
+
+    perf->cur_sample.end_time = time_msec();
+    perf->cur_sample.elapsed = perf->cur_sample.end_time
+        - perf->cur_sample.start_time;
+
+    add_sample(&perf->vec, &perf->cur_sample);
+
+    memset(&perf->cur_sample, 0, sizeof perf->cur_sample);
+    return true;
+}
+
+bool
+performance_end_sample(const char *name)
+{
+    ovs_mutex_lock(&performances_lock);
+    bool ret = performance_end_sample_protected(name);
+    ovs_mutex_unlock(&performances_lock);
+
+    return ret;
+}
+
+static void
+performance_run_protected(void)
+{
+    struct shash_node *node;
+    SHASH_FOR_EACH (node, &performances) {
+        struct performance *perf = node->data;
+        if (!timer_expired(&perf->timer)) {
+            timer_wait(&perf->timer);
+            continue;
+        }
+
+        cull_old_times(&perf->vec, PERFORMANCE_OLDEST);
+
+        timer_set_duration(&perf->timer, perf->sample_rate);
+        timer_wait(&perf->timer);
+    }
+}
+
+void
+performance_run(void)
+{
+    ovs_mutex_lock(&performances_lock);
+    performance_run_protected();
+    ovs_mutex_unlock(&performances_lock);
+}
diff --git a/lib/performance.h b/lib/performance.h
new file mode 100644
index 000000000..ec115a88c
--- /dev/null
+++ b/lib/performance.h
@@ -0,0 +1,58 @@ 
+/* Copyright (c) 2017 Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef PERFORMANCE_H
+#define PERFORMANCE_H 1
+
+#include <stdbool.h>
+
+/* This file provides a method for timing operations in OVS.
+ *
+ * The expected operation is as follows:
+ * 1) Call performance_create(), supplying a unique name for the
+ *    operation that is being measured.
+ * 2) Place calls to start_sample() and end_sample() at the beginning
+ *    and end of the operation you wish to measure.
+ * 3) Periodically call performance_run() so that statistics can be
+ *    gathered based on the recorded times. If you are running a loop,
+ *    calling this once per loop is a good practice.
+ * 4) When complete, call performance_destroy() to clean up.
+ *
+ * This implementation is thread-safe.
+ */
+
+/* Create a new performance measurement tracker*/
+void performance_create(const char *name);
+
+/* Destroy a performance measurement tracker */
+void performance_destroy(const char *name);
+
+/* Indicate that a performance measurement is beginning. */
+bool performance_start_sample(const char *name);
+
+/* Indicate that a performance measurement has ended. The
+ * sample will be added to the history of performance
+ * measurements for this tracker
+ */
+bool performance_end_sample(const char *name);
+
+/* Perform periodic duties on all performance measurement
+ * trackers. Currently, this just takes care of culling old
+ * measurements. It is a good practice to call this periodically
+ * from a daemon's main loop
+ */
+void performance_run(void);
+
+#endif /* performance.h */