diff mbox series

[bpf-next,1/3] selftests/bpf: add benchmark runner infrastructure

Message ID 20200508070548.2358701-2-andriin@fb.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series Add benchmark runner and few benchmarks | expand

Commit Message

Andrii Nakryiko May 8, 2020, 7:05 a.m. UTC
While working on BPF ringbuf implementation, testing, and benchmarking, I've
developed a pretty generic and modular benchmark runner, which seems to be
generically useful, as I've already used it for one more purpose (testing
fastest way to trigger BPF program, to minimize overhead of in-kernel code).

This patch adds generic part of benchmark runner and sets up Makefile for
extending it with more sets of benchmarks.

Benchmarker itself operates by spinning up specified number of producer and
consumer threads, setting up interval timer sending SIGALARM signal to
application once a second. Every second, current snapshot with hits/drops
counters are collected and stored in an array. Drops are useful for
producer/consumer benchmarks in which producer might overwhelm consumers.

Once test finishes after given amount of warm-up and testing seconds, mean and
stddev are calculated (ignoring warm-up results) and is printed out to stdout.
This setup seems to give consistent and accurate results.

To validate behavior, I added two atomic counting tests: global and local.
For global one, all the producer threads are atomically incrementing same
counter as fast as possible. This, of course, leads to huge drop of
performance once there is more than one producer thread due to CPUs fighting
for the same memory location.

Local counting, on the other hand, maintains one counter per each producer
thread, incremented independently. Once per second, all counters are read and
added together to form final "counting throughput" measurement. As expected,
such setup demonstrates linear scalability with number of producers (as long
as there are enough physical CPU cores, of course). See example output below.
Also, this setup can nicely demonstrate disastrous effects of false sharing,
if care is not taken to take those per-producer counters apart into
independent cache lines.

Demo output shows global counter first with 1 producer, then with 4. Both
total and per-producer performance significantly drop. The last run is local
counter with 4 producers, demonstrating near-perfect scalability.

$ ./bench -a -w1 -d2 -p1 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-global
Setting up benchmark 'count-global'...
Benchmark 'count-global' started.
Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s

$ ./bench -a -w1 -d2 -p4 count-local
Setting up benchmark 'count-local'...
Benchmark 'count-local' started.
Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/.gitignore    |   1 +
 tools/testing/selftests/bpf/Makefile      |  11 +-
 tools/testing/selftests/bpf/bench.c       | 364 ++++++++++++++++++++++
 tools/testing/selftests/bpf/bench.h       |  74 +++++
 tools/testing/selftests/bpf/bench_count.c |  91 ++++++
 5 files changed, 540 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bench.c
 create mode 100644 tools/testing/selftests/bpf/bench.h
 create mode 100644 tools/testing/selftests/bpf/bench_count.c

Comments

John Fastabend May 8, 2020, 3:49 p.m. UTC | #1
Andrii Nakryiko wrote:
> While working on BPF ringbuf implementation, testing, and benchmarking, I've
> developed a pretty generic and modular benchmark runner, which seems to be
> generically useful, as I've already used it for one more purpose (testing
> fastest way to trigger BPF program, to minimize overhead of in-kernel code).
> 
> This patch adds generic part of benchmark runner and sets up Makefile for
> extending it with more sets of benchmarks.

Seems useful.

> 
> Benchmarker itself operates by spinning up specified number of producer and
> consumer threads, setting up interval timer sending SIGALARM signal to
> application once a second. Every second, current snapshot with hits/drops
> counters are collected and stored in an array. Drops are useful for
> producer/consumer benchmarks in which producer might overwhelm consumers.
> 
> Once test finishes after given amount of warm-up and testing seconds, mean and
> stddev are calculated (ignoring warm-up results) and is printed out to stdout.
> This setup seems to give consistent and accurate results.
> 
> To validate behavior, I added two atomic counting tests: global and local.
> For global one, all the producer threads are atomically incrementing same
> counter as fast as possible. This, of course, leads to huge drop of
> performance once there is more than one producer thread due to CPUs fighting
> for the same memory location.
> 
> Local counting, on the other hand, maintains one counter per each producer
> thread, incremented independently. Once per second, all counters are read and
> added together to form final "counting throughput" measurement. As expected,
> such setup demonstrates linear scalability with number of producers (as long
> as there are enough physical CPU cores, of course). See example output below.
> Also, this setup can nicely demonstrate disastrous effects of false sharing,
> if care is not taken to take those per-producer counters apart into
> independent cache lines.
> 
> Demo output shows global counter first with 1 producer, then with 4. Both
> total and per-producer performance significantly drop. The last run is local
> counter with 4 producers, demonstrating near-perfect scalability.
> 
> $ ./bench -a -w1 -d2 -p1 count-global
> Setting up benchmark 'count-global'...
> Benchmark 'count-global' started.
> Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
> Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
> Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
> Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
> Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
> 
> $ ./bench -a -w1 -d2 -p4 count-global
> Setting up benchmark 'count-global'...
> Benchmark 'count-global' started.
> Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
> Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
> Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
> Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
> Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
> 
> $ ./bench -a -w1 -d2 -p4 count-local
> Setting up benchmark 'count-local'...
> Benchmark 'count-local' started.
> Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
> Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
> Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
> Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
> Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---

Couple nits but otherwise lgtm. I think it should probably be moved
into its own directory though ./bpf/bench/

The other question would be how much stuff do we want to live in
selftests vs outside selftests/bpf but I think its fine and makes
it easy to build small benchmark programs in ./bpf/progs/

>  tools/testing/selftests/bpf/.gitignore    |   1 +
>  tools/testing/selftests/bpf/Makefile      |  11 +-
>  tools/testing/selftests/bpf/bench.c       | 364 ++++++++++++++++++++++
>  tools/testing/selftests/bpf/bench.h       |  74 +++++
>  tools/testing/selftests/bpf/bench_count.c |  91 ++++++
>  5 files changed, 540 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/bench.c
>  create mode 100644 tools/testing/selftests/bpf/bench.h
>  create mode 100644 tools/testing/selftests/bpf/bench_count.c
> 
> diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
> index 3ff031972975..1bb204cee853 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -38,3 +38,4 @@ test_cpp
>  /bpf_gcc
>  /tools
>  /runqslower
> +/bench
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 3d942be23d09..ab03362d46e4 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -77,7 +77,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
>  # Compile but not part of 'make run_tests'
>  TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
>  	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> -	test_lirc_mode2_user xdping test_cpp runqslower
> +	test_lirc_mode2_user xdping test_cpp runqslower bench
>  
>  TEST_CUSTOM_PROGS = urandom_read
>  
> @@ -405,6 +405,15 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
>  	$(call msg,CXX,,$@)
>  	$(CXX) $(CFLAGS) $^ $(LDLIBS) -o $@
>  
> +# Benchmark runner
> +$(OUTPUT)/bench.o:          bench.h
> +$(OUTPUT)/bench_count.o:    bench.h
> +$(OUTPUT)/bench: LDLIBS += -lm
> +$(OUTPUT)/bench: $(OUTPUT)/bench.o \
> +		 $(OUTPUT)/bench_count.o
> +	$(call msg,BINARY,,$@)
> +	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
> +
>  EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR)			\
>  	prog_tests/tests.h map_tests/tests.h verifier/tests.h		\
>  	feature								\
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> new file mode 100644
> index 000000000000..a20482bb74e2
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -0,0 +1,364 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2020 Facebook */
> +#define _GNU_SOURCE
> +#include <argp.h>
> +#include <linux/compiler.h>
> +#include <sys/time.h>
> +#include <sched.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <sys/sysinfo.h>
> +#include <sys/resource.h>
> +#include <signal.h>
> +#include "bench.h"
> +
> +struct env env = {
> +	.duration_sec = 10,
> +	.warmup_sec = 5,

Just curious I'm guessing the duration/warmap are arbitrary here? Seems
a bit long I would bet 5,1 would be enough for global/local test at
least.

> +	.affinity = false,
> +	.consumer_cnt = 1,
> +	.producer_cnt = 1,
> +};
> +

[...]

> +void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
> +{
> +	double hits_per_sec, drops_per_sec;
> +	double hits_per_prod;
> +
> +	hits_per_sec = res->hits / 1000000.0 / (delta_ns / 1000000000.0);
> +	hits_per_prod = hits_per_sec / env.producer_cnt;

Per producer counts would also be useful. Averaging over producer cnt could
hide issues with fairness.

> +	drops_per_sec = res->drops / 1000000.0 / (delta_ns / 1000000000.0);
> +
> +	printf("Iter %3d (%7.3lfus): ",
> +	       iter, (delta_ns - 1000000000) / 1000.0);
> +
> +	printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s\n",
> +	       hits_per_sec, hits_per_prod, drops_per_sec);
> +}
> +

[...]

> +const char *argp_program_version = "benchmark";
> +const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
> +const char argp_program_doc[] =
> +"benchmark    Generic benchmarking framework.\n"
> +"\n"
> +"This tool runs benchmarks.\n"
> +"\n"
> +"USAGE: benchmark <mode>\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    benchmark count-local                # run 'count-local' benchmark with 1 producer and 1 consumer\n"
> +"    benchmark -p16 -c8 -a count-local    # run 'count-local' benchmark with 16 producer and 8 consumer threads, pinned to CPUs\n";
> +
> +static const struct argp_option opts[] = {
> +	{ "mode", 'm', "MODE", 0, "Benchmark mode"},

"Benchmark mode" hmm not sure what this is for yet. Only on
first patch though so maybe I'll become enlightened?

> +	{ "list", 'l', NULL, 0, "List available benchmarks"},
> +	{ "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
> +	{ "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
> +	{ "producers", 'p', "NUM", 0, "Number of producer threads"},
> +	{ "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> +	{ "verbose", 'v', NULL, 0, "Verbose debug output"},
> +	{ "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> +	{ "b2b", 'b', NULL, 0, "Back-to-back mode"},
> +	{ "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
> +	{},
> +};

[...]

> +
> +static void set_thread_affinity(pthread_t thread, int cpu)
> +{
> +	cpu_set_t cpuset;
> +
> +	CPU_ZERO(&cpuset);
> +	CPU_SET(cpu, &cpuset);
> +	if (pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset))
> +		printf("setting affinity to CPU #%d failed: %d\n", cpu, errno);
> +}

Should we error out on affinity errors?

> +
> +static struct bench_state {
> +	int res_cnt;
> +	struct bench_res *results;
> +	pthread_t *consumers;
> +	pthread_t *producers;
> +} state;

[...]

> +
> +static void setup_benchmark()
> +{
> +	int i, err;
> +
> +	if (!env.mode) {
> +		fprintf(stderr, "benchmark mode is not specified\n");
> +		exit(1);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(benchs); i++) {
> +		if (strcmp(benchs[i]->name, env.mode) == 0) {

Ah the mode. OK maybe in description call it, "Benchmark mode to run" or
"Benchmark test"? Or leave it its probably fine.

> +			bench = benchs[i];
> +			break;
> +		}
> +	}
> +	if (!bench) {
> +		fprintf(stderr, "benchmark '%s' not found\n", env.mode);
> +		exit(1);
> +	}
> +
> +	printf("Setting up benchmark '%s'...\n", bench->name);
> +
> +	state.producers = calloc(env.producer_cnt, sizeof(*state.producers));
> +	state.consumers = calloc(env.consumer_cnt, sizeof(*state.consumers));
> +	state.results = calloc(env.duration_sec + env.warmup_sec + 2,
> +			       sizeof(*state.results));
> +	if (!state.producers || !state.consumers || !state.results)
> +		exit(1);
> +
> +	if (bench->validate)
> +		bench->validate();
> +	if (bench->setup)
> +		bench->setup();
> +
> +	for (i = 0; i < env.consumer_cnt; i++) {
> +		err = pthread_create(&state.consumers[i], NULL,
> +				     bench->consumer_thread, (void *)(long)i);
> +		if (err) {
> +			fprintf(stderr, "failed to create consumer thread #%d: %d\n",
> +				i, -errno);
> +			exit(1);
> +		}
> +		if (env.affinity)
> +			set_thread_affinity(state.consumers[i], i);
> +	}
> +	for (i = 0; i < env.producer_cnt; i++) {
> +		err = pthread_create(&state.producers[i], NULL,
> +				     bench->producer_thread, (void *)(long)i);
> +		if (err) {
> +			fprintf(stderr, "failed to create producer thread #%d: %d\n",
> +				i, -errno);
> +			exit(1);
> +		}
> +		if (env.affinity)
> +			set_thread_affinity(state.producers[i],
> +					    env.consumer_cnt + i);
> +	}
> +
> +	printf("Benchmark '%s' started.\n", bench->name);
> +}

[...]

> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bench_count.c

How about a ./bpf/bench/ directory? Seems we are going to get a few
bench_* tests here.
Andrii Nakryiko May 8, 2020, 5:59 p.m. UTC | #2
On Fri, May 8, 2020 at 8:49 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Andrii Nakryiko wrote:
> > While working on BPF ringbuf implementation, testing, and benchmarking, I've
> > developed a pretty generic and modular benchmark runner, which seems to be
> > generically useful, as I've already used it for one more purpose (testing
> > fastest way to trigger BPF program, to minimize overhead of in-kernel code).
> >
> > This patch adds generic part of benchmark runner and sets up Makefile for
> > extending it with more sets of benchmarks.
>
> Seems useful.

thanks :)

>
> >
> > Benchmarker itself operates by spinning up specified number of producer and
> > consumer threads, setting up interval timer sending SIGALARM signal to
> > application once a second. Every second, current snapshot with hits/drops
> > counters are collected and stored in an array. Drops are useful for
> > producer/consumer benchmarks in which producer might overwhelm consumers.
> >
> > Once test finishes after given amount of warm-up and testing seconds, mean and
> > stddev are calculated (ignoring warm-up results) and is printed out to stdout.
> > This setup seems to give consistent and accurate results.
> >
> > To validate behavior, I added two atomic counting tests: global and local.
> > For global one, all the producer threads are atomically incrementing same
> > counter as fast as possible. This, of course, leads to huge drop of
> > performance once there is more than one producer thread due to CPUs fighting
> > for the same memory location.
> >
> > Local counting, on the other hand, maintains one counter per each producer
> > thread, incremented independently. Once per second, all counters are read and
> > added together to form final "counting throughput" measurement. As expected,
> > such setup demonstrates linear scalability with number of producers (as long
> > as there are enough physical CPU cores, of course). See example output below.
> > Also, this setup can nicely demonstrate disastrous effects of false sharing,
> > if care is not taken to take those per-producer counters apart into
> > independent cache lines.
> >
> > Demo output shows global counter first with 1 producer, then with 4. Both
> > total and per-producer performance significantly drop. The last run is local
> > counter with 4 producers, demonstrating near-perfect scalability.
> >
> > $ ./bench -a -w1 -d2 -p1 count-global
> > Setting up benchmark 'count-global'...
> > Benchmark 'count-global' started.
> > Iter   0 ( 24.822us): hits  148.179M/s (148.179M/prod), drops    0.000M/s
> > Iter   1 ( 37.939us): hits  149.308M/s (149.308M/prod), drops    0.000M/s
> > Iter   2 (-10.774us): hits  150.717M/s (150.717M/prod), drops    0.000M/s
> > Iter   3 (  3.807us): hits  151.435M/s (151.435M/prod), drops    0.000M/s
> > Summary: hits  150.488 ± 1.079M/s (150.488M/prod), drops    0.000 ± 0.000M/s
> >
> > $ ./bench -a -w1 -d2 -p4 count-global
> > Setting up benchmark 'count-global'...
> > Benchmark 'count-global' started.
> > Iter   0 ( 60.659us): hits   53.910M/s ( 13.477M/prod), drops    0.000M/s
> > Iter   1 (-17.658us): hits   53.722M/s ( 13.431M/prod), drops    0.000M/s
> > Iter   2 (  5.865us): hits   53.495M/s ( 13.374M/prod), drops    0.000M/s
> > Iter   3 (  0.104us): hits   53.606M/s ( 13.402M/prod), drops    0.000M/s
> > Summary: hits   53.608 ± 0.113M/s ( 13.402M/prod), drops    0.000 ± 0.000M/s
> >
> > $ ./bench -a -w1 -d2 -p4 count-local
> > Setting up benchmark 'count-local'...
> > Benchmark 'count-local' started.
> > Iter   0 ( 23.388us): hits  640.450M/s (160.113M/prod), drops    0.000M/s
> > Iter   1 (  2.291us): hits  605.661M/s (151.415M/prod), drops    0.000M/s
> > Iter   2 ( -6.415us): hits  607.092M/s (151.773M/prod), drops    0.000M/s
> > Iter   3 ( -1.361us): hits  601.796M/s (150.449M/prod), drops    0.000M/s
> > Summary: hits  604.849 ± 2.739M/s (151.212M/prod), drops    0.000 ± 0.000M/s
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
>
> Couple nits but otherwise lgtm. I think it should probably be moved
> into its own directory though ./bpf/bench/

I assume you are talking about benchmark implementations themselve,
all those bench_xxx.c files, right? bench.c probably should stay in
selftests/bpf root.

>
> The other question would be how much stuff do we want to live in
> selftests vs outside selftests/bpf but I think its fine and makes
> it easy to build small benchmark programs in ./bpf/progs/

selftests/bpf Makefile is so convenient for BPF/skeleton/user-space
building, libbpf, vmlinux.h generation, etc, that moving this outside
would be a major pain and lots of extra work. Adding this benchmark
was trivial from Makefile modification point of view (and no debugging
of Makefile either, everything just worked).

>
> >  tools/testing/selftests/bpf/.gitignore    |   1 +
> >  tools/testing/selftests/bpf/Makefile      |  11 +-
> >  tools/testing/selftests/bpf/bench.c       | 364 ++++++++++++++++++++++
> >  tools/testing/selftests/bpf/bench.h       |  74 +++++
> >  tools/testing/selftests/bpf/bench_count.c |  91 ++++++
> >  5 files changed, 540 insertions(+), 1 deletion(-)
> >  create mode 100644 tools/testing/selftests/bpf/bench.c
> >  create mode 100644 tools/testing/selftests/bpf/bench.h
> >  create mode 100644 tools/testing/selftests/bpf/bench_count.c
> >
> > diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
> > index 3ff031972975..1bb204cee853 100644
> > --- a/tools/testing/selftests/bpf/.gitignore
> > +++ b/tools/testing/selftests/bpf/.gitignore
> > @@ -38,3 +38,4 @@ test_cpp
> >  /bpf_gcc
> >  /tools
> >  /runqslower
> > +/bench
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 3d942be23d09..ab03362d46e4 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -77,7 +77,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
> >  # Compile but not part of 'make run_tests'
> >  TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
> >       flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
> > -     test_lirc_mode2_user xdping test_cpp runqslower
> > +     test_lirc_mode2_user xdping test_cpp runqslower bench
> >
> >  TEST_CUSTOM_PROGS = urandom_read
> >
> > @@ -405,6 +405,15 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
> >       $(call msg,CXX,,$@)
> >       $(CXX) $(CFLAGS) $^ $(LDLIBS) -o $@
> >
> > +# Benchmark runner
> > +$(OUTPUT)/bench.o:          bench.h
> > +$(OUTPUT)/bench_count.o:    bench.h
> > +$(OUTPUT)/bench: LDLIBS += -lm
> > +$(OUTPUT)/bench: $(OUTPUT)/bench.o \
> > +              $(OUTPUT)/bench_count.o
> > +     $(call msg,BINARY,,$@)
> > +     $(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
> > +
> >  EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR)                   \
> >       prog_tests/tests.h map_tests/tests.h verifier/tests.h           \
> >       feature                                                         \
> > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > new file mode 100644
> > index 000000000000..a20482bb74e2
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bench.c
> > @@ -0,0 +1,364 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2020 Facebook */
> > +#define _GNU_SOURCE
> > +#include <argp.h>
> > +#include <linux/compiler.h>
> > +#include <sys/time.h>
> > +#include <sched.h>
> > +#include <fcntl.h>
> > +#include <pthread.h>
> > +#include <sys/sysinfo.h>
> > +#include <sys/resource.h>
> > +#include <signal.h>
> > +#include "bench.h"
> > +
> > +struct env env = {
> > +     .duration_sec = 10,
> > +     .warmup_sec = 5,
>
> Just curious I'm guessing the duration/warmap are arbitrary here? Seems
> a bit long I would bet 5,1 would be enough for global/local test at
> least.

Yeah, completely arbitrary. I started with just 1 second, but for some
benchmark stabilization came around second 4-5, so I bumped it to 5.
It's easy to modify with -d and -w arguments, but I can bump it down
to 5, 1 for defaults.

>
> > +     .affinity = false,
> > +     .consumer_cnt = 1,
> > +     .producer_cnt = 1,
> > +};
> > +
>
> [...]
>
> > +void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
> > +{
> > +     double hits_per_sec, drops_per_sec;
> > +     double hits_per_prod;
> > +
> > +     hits_per_sec = res->hits / 1000000.0 / (delta_ns / 1000000000.0);
> > +     hits_per_prod = hits_per_sec / env.producer_cnt;
>
> Per producer counts would also be useful. Averaging over producer cnt could
> hide issues with fairness.

True about hiding fairness issues, but for benchmarks with lots of
producers, it's so many numbres, that it will be hard to interpret it
per-producer. We could probably add stddev calculation across multiple
producers and stuff like that, but I'd defer that to future
enhancements. This benchmarker is a side-product of BPF ringbuf work,
not the goal in itself.

>
> > +     drops_per_sec = res->drops / 1000000.0 / (delta_ns / 1000000000.0);
> > +
> > +     printf("Iter %3d (%7.3lfus): ",
> > +            iter, (delta_ns - 1000000000) / 1000.0);
> > +
> > +     printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s\n",
> > +            hits_per_sec, hits_per_prod, drops_per_sec);
> > +}
> > +
>
> [...]
>
> > +const char *argp_program_version = "benchmark";
> > +const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
> > +const char argp_program_doc[] =
> > +"benchmark    Generic benchmarking framework.\n"
> > +"\n"
> > +"This tool runs benchmarks.\n"
> > +"\n"
> > +"USAGE: benchmark <mode>\n"
> > +"\n"
> > +"EXAMPLES:\n"
> > +"    benchmark count-local                # run 'count-local' benchmark with 1 producer and 1 consumer\n"
> > +"    benchmark -p16 -c8 -a count-local    # run 'count-local' benchmark with 16 producer and 8 consumer threads, pinned to CPUs\n";
> > +
> > +static const struct argp_option opts[] = {
> > +     { "mode", 'm', "MODE", 0, "Benchmark mode"},
>
> "Benchmark mode" hmm not sure what this is for yet. Only on
> first patch though so maybe I'll become enlightened?

Oh, actually I don't need it, it's just a positional argument, I'll
drop this line.

>
> > +     { "list", 'l', NULL, 0, "List available benchmarks"},
> > +     { "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
> > +     { "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
> > +     { "producers", 'p', "NUM", 0, "Number of producer threads"},
> > +     { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
> > +     { "verbose", 'v', NULL, 0, "Verbose debug output"},
> > +     { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > +     { "b2b", 'b', NULL, 0, "Back-to-back mode"},
> > +     { "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
> > +     {},
> > +};
>
> [...]
>
> > +
> > +static void set_thread_affinity(pthread_t thread, int cpu)
> > +{
> > +     cpu_set_t cpuset;
> > +
> > +     CPU_ZERO(&cpuset);
> > +     CPU_SET(cpu, &cpuset);
> > +     if (pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset))
> > +             printf("setting affinity to CPU #%d failed: %d\n", cpu, errno);
> > +}
>
> Should we error out on affinity errors?

Given I made affinity setting optional (in the end), I guess I could
make it fail.

>
> > +
> > +static struct bench_state {
> > +     int res_cnt;
> > +     struct bench_res *results;
> > +     pthread_t *consumers;
> > +     pthread_t *producers;
> > +} state;
>
> [...]
>
> > +
> > +static void setup_benchmark()
> > +{
> > +     int i, err;
> > +
> > +     if (!env.mode) {
> > +             fprintf(stderr, "benchmark mode is not specified\n");
> > +             exit(1);
> > +     }
> > +
> > +     for (i = 0; i < ARRAY_SIZE(benchs); i++) {
> > +             if (strcmp(benchs[i]->name, env.mode) == 0) {
>
> Ah the mode. OK maybe in description call it, "Benchmark mode to run" or
> "Benchmark test"? Or leave it its probably fine.

How about bench_name?


>
> > +                     bench = benchs[i];
> > +                     break;
> > +             }
> > +     }
> > +     if (!bench) {
> > +             fprintf(stderr, "benchmark '%s' not found\n", env.mode);
> > +             exit(1);
> > +     }
> > +
> > +     printf("Setting up benchmark '%s'...\n", bench->name);
> > +
> > +     state.producers = calloc(env.producer_cnt, sizeof(*state.producers));
> > +     state.consumers = calloc(env.consumer_cnt, sizeof(*state.consumers));
> > +     state.results = calloc(env.duration_sec + env.warmup_sec + 2,
> > +                            sizeof(*state.results));
> > +     if (!state.producers || !state.consumers || !state.results)
> > +             exit(1);
> > +
> > +     if (bench->validate)
> > +             bench->validate();
> > +     if (bench->setup)
> > +             bench->setup();
> > +
> > +     for (i = 0; i < env.consumer_cnt; i++) {
> > +             err = pthread_create(&state.consumers[i], NULL,
> > +                                  bench->consumer_thread, (void *)(long)i);
> > +             if (err) {
> > +                     fprintf(stderr, "failed to create consumer thread #%d: %d\n",
> > +                             i, -errno);
> > +                     exit(1);
> > +             }
> > +             if (env.affinity)
> > +                     set_thread_affinity(state.consumers[i], i);
> > +     }
> > +     for (i = 0; i < env.producer_cnt; i++) {
> > +             err = pthread_create(&state.producers[i], NULL,
> > +                                  bench->producer_thread, (void *)(long)i);
> > +             if (err) {
> > +                     fprintf(stderr, "failed to create producer thread #%d: %d\n",
> > +                             i, -errno);
> > +                     exit(1);
> > +             }
> > +             if (env.affinity)
> > +                     set_thread_affinity(state.producers[i],
> > +                                         env.consumer_cnt + i);
> > +     }
> > +
> > +     printf("Benchmark '%s' started.\n", bench->name);
> > +}
>
> [...]
>
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/bench_count.c
>
> How about a ./bpf/bench/ directory? Seems we are going to get a few
> bench_* tests here.
>

Sounds good to me, I'll move.
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 3ff031972975..1bb204cee853 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -38,3 +38,4 @@  test_cpp
 /bpf_gcc
 /tools
 /runqslower
+/bench
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3d942be23d09..ab03362d46e4 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -77,7 +77,7 @@  TEST_PROGS_EXTENDED := with_addr.sh \
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
-	test_lirc_mode2_user xdping test_cpp runqslower
+	test_lirc_mode2_user xdping test_cpp runqslower bench
 
 TEST_CUSTOM_PROGS = urandom_read
 
@@ -405,6 +405,15 @@  $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
 	$(call msg,CXX,,$@)
 	$(CXX) $(CFLAGS) $^ $(LDLIBS) -o $@
 
+# Benchmark runner
+$(OUTPUT)/bench.o:          bench.h
+$(OUTPUT)/bench_count.o:    bench.h
+$(OUTPUT)/bench: LDLIBS += -lm
+$(OUTPUT)/bench: $(OUTPUT)/bench.o \
+		 $(OUTPUT)/bench_count.o
+	$(call msg,BINARY,,$@)
+	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
+
 EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR)			\
 	prog_tests/tests.h map_tests/tests.h verifier/tests.h		\
 	feature								\
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
new file mode 100644
index 000000000000..a20482bb74e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/bench.c
@@ -0,0 +1,364 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#define _GNU_SOURCE
+#include <argp.h>
+#include <linux/compiler.h>
+#include <sys/time.h>
+#include <sched.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sys/sysinfo.h>
+#include <sys/resource.h>
+#include <signal.h>
+#include "bench.h"
+
+struct env env = {
+	.duration_sec = 10,
+	.warmup_sec = 5,
+	.affinity = false,
+	.consumer_cnt = 1,
+	.producer_cnt = 1,
+};
+
+static int libbpf_print_fn(enum libbpf_print_level level,
+		    const char *format, va_list args)
+{
+	if (level == LIBBPF_DEBUG && !env.verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static int bump_memlock_rlimit(void)
+{
+	struct rlimit rlim_new = {
+		.rlim_cur	= RLIM_INFINITY,
+		.rlim_max	= RLIM_INFINITY,
+	};
+
+	return setrlimit(RLIMIT_MEMLOCK, &rlim_new);
+}
+
+void setup_libbpf()
+{
+	int err;
+
+	libbpf_set_print(libbpf_print_fn);
+
+	err = bump_memlock_rlimit();
+	if (err)
+		fprintf(stderr, "failed to increase RLIMIT_MEMLOCK: %d", err);
+}
+
+void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns)
+{
+	double hits_per_sec, drops_per_sec;
+	double hits_per_prod;
+
+	hits_per_sec = res->hits / 1000000.0 / (delta_ns / 1000000000.0);
+	hits_per_prod = hits_per_sec / env.producer_cnt;
+	drops_per_sec = res->drops / 1000000.0 / (delta_ns / 1000000000.0);
+
+	printf("Iter %3d (%7.3lfus): ",
+	       iter, (delta_ns - 1000000000) / 1000.0);
+
+	printf("hits %8.3lfM/s (%7.3lfM/prod), drops %8.3lfM/s\n",
+	       hits_per_sec, hits_per_prod, drops_per_sec);
+}
+
+void hits_drops_report_final(struct bench_res res[], int res_cnt)
+{
+	int i;
+	double hits_mean = 0.0, drops_mean = 0.0;
+	double hits_stddev = 0.0, drops_stddev = 0.0;
+
+	for (i = 0; i < res_cnt; i++) {
+		hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
+		drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
+	}
+
+	if (res_cnt > 1)  {
+		for (i = 0; i < res_cnt; i++) {
+			hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
+				       (hits_mean - res[i].hits / 1000000.0) /
+				       (res_cnt - 1.0);
+			drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
+					(drops_mean - res[i].drops / 1000000.0) /
+					(res_cnt - 1.0);
+		}
+		hits_stddev = sqrt(hits_stddev);
+		drops_stddev = sqrt(drops_stddev);
+	}
+	printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
+	       hits_mean, hits_stddev, hits_mean / env.producer_cnt);
+	printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
+	       drops_mean, drops_stddev);
+}
+
+const char *argp_program_version = "benchmark";
+const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
+const char argp_program_doc[] =
+"benchmark    Generic benchmarking framework.\n"
+"\n"
+"This tool runs benchmarks.\n"
+"\n"
+"USAGE: benchmark <mode>\n"
+"\n"
+"EXAMPLES:\n"
+"    benchmark count-local                # run 'count-local' benchmark with 1 producer and 1 consumer\n"
+"    benchmark -p16 -c8 -a count-local    # run 'count-local' benchmark with 16 producer and 8 consumer threads, pinned to CPUs\n";
+
+static const struct argp_option opts[] = {
+	{ "mode", 'm', "MODE", 0, "Benchmark mode"},
+	{ "list", 'l', NULL, 0, "List available benchmarks"},
+	{ "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
+	{ "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
+	{ "producers", 'p', "NUM", 0, "Number of producer threads"},
+	{ "consumers", 'c', "NUM", 0, "Number of consumer threads"},
+	{ "verbose", 'v', NULL, 0, "Verbose debug output"},
+	{ "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
+	{ "b2b", 'b', NULL, 0, "Back-to-back mode"},
+	{ "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	static int pos_args;
+
+	switch (key) {
+	case 'v':
+		env.verbose = true;
+		break;
+	case 'l':
+		env.list = true;
+		break;
+	case 'd':
+		env.duration_sec = strtol(arg, NULL, 10);
+		if (env.duration_sec <= 0) {
+			fprintf(stderr, "Invalid duration: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'w':
+		env.warmup_sec = strtol(arg, NULL, 10);
+		if (env.warmup_sec <= 0) {
+			fprintf(stderr, "Invalid warm-up duration: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'p':
+		env.producer_cnt = strtol(arg, NULL, 10);
+		if (env.producer_cnt <= 0) {
+			fprintf(stderr, "Invalid producer count: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'c':
+		env.consumer_cnt = strtol(arg, NULL, 10);
+		if (env.consumer_cnt <= 0) {
+			fprintf(stderr, "Invalid consumer count: %s\n", arg);
+			argp_usage(state);
+		}
+		break;
+	case 'a':
+		env.affinity = true;
+		break;
+	case ARGP_KEY_ARG:
+		if (pos_args++) {
+			fprintf(stderr,
+				"Unrecognized positional argument: %s\n", arg);
+			argp_usage(state);
+		}
+		env.mode = strdup(arg);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static void parse_cmdline_args(int argc, char **argv)
+{
+	static const struct argp argp = {
+		.options = opts,
+		.parser = parse_arg,
+		.doc = argp_program_doc,
+	};
+	if (argp_parse(&argp, argc, argv, 0, NULL, NULL))
+		exit(1);
+}
+
+static void collect_measurements(long delta_ns);
+
+static __u64 last_time_ns;
+static void sigalarm_handler(int signo)
+{
+	long new_time_ns = get_time_ns();
+	long delta_ns = new_time_ns - last_time_ns;
+
+	collect_measurements(delta_ns);
+
+	last_time_ns = new_time_ns;
+}
+
+/* set up periodic 1-second timer */
+static void setup_timer()
+{
+	static struct sigaction sigalarm_action = {
+		.sa_handler = sigalarm_handler,
+	};
+	struct itimerval timer_settings = {};
+	int err;
+
+	last_time_ns = get_time_ns();
+	err = sigaction(SIGALRM, &sigalarm_action, NULL);
+	if (err < 0) {
+		fprintf(stderr, "failed to install SIGALARM handler: %d\n", -errno);
+		exit(1);
+	}
+	timer_settings.it_interval.tv_sec = 1;
+	timer_settings.it_value.tv_sec = 1;
+	err = setitimer(ITIMER_REAL, &timer_settings, NULL);
+	if (err < 0) {
+		fprintf(stderr, "failed to arm interval timer: %d\n", -errno);
+		exit(1);
+	}
+}
+
+static void set_thread_affinity(pthread_t thread, int cpu)
+{
+	cpu_set_t cpuset;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	if (pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset))
+		printf("setting affinity to CPU #%d failed: %d\n", cpu, errno);
+}
+
+static struct bench_state {
+	int res_cnt;
+	struct bench_res *results;
+	pthread_t *consumers;
+	pthread_t *producers;
+} state;
+
+const struct bench *bench = NULL;
+
+extern const struct bench bench_count_global;
+extern const struct bench bench_count_local;
+
+static const struct bench *benchs[] = {
+	&bench_count_global,
+	&bench_count_local,
+};
+
+static void setup_benchmark()
+{
+	int i, err;
+
+	if (!env.mode) {
+		fprintf(stderr, "benchmark mode is not specified\n");
+		exit(1);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(benchs); i++) {
+		if (strcmp(benchs[i]->name, env.mode) == 0) {
+			bench = benchs[i];
+			break;
+		}
+	}
+	if (!bench) {
+		fprintf(stderr, "benchmark '%s' not found\n", env.mode);
+		exit(1);
+	}
+
+	printf("Setting up benchmark '%s'...\n", bench->name);
+
+	state.producers = calloc(env.producer_cnt, sizeof(*state.producers));
+	state.consumers = calloc(env.consumer_cnt, sizeof(*state.consumers));
+	state.results = calloc(env.duration_sec + env.warmup_sec + 2,
+			       sizeof(*state.results));
+	if (!state.producers || !state.consumers || !state.results)
+		exit(1);
+
+	if (bench->validate)
+		bench->validate();
+	if (bench->setup)
+		bench->setup();
+
+	for (i = 0; i < env.consumer_cnt; i++) {
+		err = pthread_create(&state.consumers[i], NULL,
+				     bench->consumer_thread, (void *)(long)i);
+		if (err) {
+			fprintf(stderr, "failed to create consumer thread #%d: %d\n",
+				i, -errno);
+			exit(1);
+		}
+		if (env.affinity)
+			set_thread_affinity(state.consumers[i], i);
+	}
+	for (i = 0; i < env.producer_cnt; i++) {
+		err = pthread_create(&state.producers[i], NULL,
+				     bench->producer_thread, (void *)(long)i);
+		if (err) {
+			fprintf(stderr, "failed to create producer thread #%d: %d\n",
+				i, -errno);
+			exit(1);
+		}
+		if (env.affinity)
+			set_thread_affinity(state.producers[i],
+					    env.consumer_cnt + i);
+	}
+
+	printf("Benchmark '%s' started.\n", bench->name);
+}
+
+static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
+static pthread_cond_t bench_done = PTHREAD_COND_INITIALIZER;
+
+static void collect_measurements(long delta_ns) {
+	int iter = state.res_cnt++;
+	struct bench_res *res = &state.results[iter];
+
+	bench->measure(res);
+
+	if (bench->report_progress)
+		bench->report_progress(iter, res, delta_ns);
+
+	if (iter == env.duration_sec + env.warmup_sec) {
+		pthread_mutex_lock(&bench_done_mtx);
+		pthread_cond_signal(&bench_done);
+		pthread_mutex_unlock(&bench_done_mtx);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	parse_cmdline_args(argc, argv);
+
+	if (env.list) {
+		int i;
+
+		printf("Available benchmarks:\n");
+		for (i = 0; i < ARRAY_SIZE(benchs); i++) {
+			printf("- %s\n", benchs[i]->name);
+		}
+		return 0;
+	}
+
+	setup_benchmark();
+
+	setup_timer();
+
+	pthread_mutex_lock(&bench_done_mtx);
+	pthread_cond_wait(&bench_done, &bench_done_mtx);
+	pthread_mutex_unlock(&bench_done_mtx);
+
+	if (bench->report_final)
+		/* skip first sample */
+		bench->report_final(state.results + env.warmup_sec,
+				    state.res_cnt - env.warmup_sec);
+
+	return 0;
+}
+
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
new file mode 100644
index 000000000000..a9daff10af18
--- /dev/null
+++ b/tools/testing/selftests/bpf/bench.h
@@ -0,0 +1,74 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#pragma once
+#include <stdlib.h>
+#include <stdbool.h>
+#include <linux/err.h>
+#include <errno.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include <math.h>
+#include <time.h>
+#include <sys/syscall.h>
+
+struct env {
+	char *mode;
+	int duration_sec;
+	int warmup_sec;
+	bool verbose;
+	bool list;
+	bool back2back;
+	bool affinity;
+	int consumer_cnt;
+	int producer_cnt;
+};
+
+struct bench_res {
+	long hits;
+	long drops;
+};
+
+struct bench {
+	const char *name;
+	void (*validate)();
+	void (*setup)();
+	void *(*producer_thread)(void *ctx);
+	void *(*consumer_thread)(void *ctx);
+	void (*measure)(struct bench_res* res);
+	void (*report_progress)(int iter, struct bench_res* res, long delta_ns);
+	void (*report_final)(struct bench_res res[], int res_cnt);
+};
+
+struct counter {
+	long value;
+} __attribute__((aligned(128)));
+
+extern struct env env;
+extern const struct bench *bench;
+
+void setup_libbpf();
+void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
+void hits_drops_report_final(struct bench_res res[], int res_cnt);
+
+static inline __u64 get_time_ns() {
+	struct timespec t;
+
+	clock_gettime(CLOCK_MONOTONIC, &t);
+
+	return (u64)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+static inline void atomic_inc(long *value)
+{
+	(void)__atomic_add_fetch(value, 1, __ATOMIC_RELAXED);
+}
+
+static inline void atomic_add(long *value, long n)
+{
+	(void)__atomic_add_fetch(value, n, __ATOMIC_RELAXED);
+}
+
+static inline long atomic_swap(long *value, long n)
+{
+	return __atomic_exchange_n(value, n, __ATOMIC_RELAXED);
+}
diff --git a/tools/testing/selftests/bpf/bench_count.c b/tools/testing/selftests/bpf/bench_count.c
new file mode 100644
index 000000000000..befba7a82643
--- /dev/null
+++ b/tools/testing/selftests/bpf/bench_count.c
@@ -0,0 +1,91 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include "bench.h"
+
+/* COUNT-GLOBAL benchmark */
+
+static struct count_global_ctx {
+	struct counter hits;
+} count_global_ctx;
+
+static void *count_global_producer(void *input)
+{
+	struct count_global_ctx *ctx = &count_global_ctx;
+
+	while (true) {
+		atomic_inc(&ctx->hits.value);
+	}
+	return NULL;
+}
+
+static void *count_global_consumer(void *input)
+{
+	return NULL;
+}
+
+static void count_global_measure(struct bench_res *res)
+{
+	struct count_global_ctx *ctx = &count_global_ctx;
+
+	res->hits = atomic_swap(&ctx->hits.value, 0);
+}
+
+/* COUNT-local benchmark */
+
+static struct count_local_ctx {
+	struct counter *hits;
+} count_local_ctx;
+
+static void count_local_setup()
+{
+	struct count_local_ctx *ctx = &count_local_ctx;
+
+	ctx->hits = calloc(env.consumer_cnt, sizeof(*ctx->hits));
+	if (!ctx->hits)
+		exit(1);
+}
+
+static void *count_local_producer(void *input)
+{
+	struct count_local_ctx *ctx = &count_local_ctx;
+	int idx = (long)input;
+
+	while (true) {
+		atomic_inc(&ctx->hits[idx].value);
+	}
+	return NULL;
+}
+
+static void *count_local_consumer(void *input)
+{
+	return NULL;
+}
+
+static void count_local_measure(struct bench_res *res)
+{
+	struct count_local_ctx *ctx = &count_local_ctx;
+	int i;
+
+	for (i = 0; i < env.producer_cnt; i++) {
+		res->hits += atomic_swap(&ctx->hits[i].value, 0);
+	}
+}
+
+const struct bench bench_count_global = {
+	.name = "count-global",
+	.producer_thread = count_global_producer,
+	.consumer_thread = count_global_consumer,
+	.measure = count_global_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_count_local = {
+	.name = "count-local",
+	.setup = count_local_setup,
+	.producer_thread = count_local_producer,
+	.consumer_thread = count_local_consumer,
+	.measure = count_local_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};