diff mbox series

[v2,bpf-next,2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench

Message ID 20200508232032.1974027-3-andriin@fb.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series Add benchmark runner and few benchmarks | expand

Commit Message

Andrii Nakryiko May 8, 2020, 11:20 p.m. UTC
Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
user-space benchmarking part into benchmark runner to compare results.  Results
with ./bench are consistently somewhat lower than test_overhead's, but relative
performance of various types of BPF programs stay consisten (e.g., kretprobe is
noticeably slower).

run_bench_rename.sh script (in benchs/ directory) was used to produce the
following numbers:

  base      :    3.975 ± 0.065M/s
  kprobe    :    3.268 ± 0.095M/s
  kretprobe :    2.496 ± 0.040M/s
  rawtp     :    3.899 ± 0.078M/s
  fentry    :    3.836 ± 0.049M/s
  fexit     :    3.660 ± 0.082M/s
  fmodret   :    3.776 ± 0.033M/s

While running test_overhead gives:

  task_rename base        4457K events per sec
  task_rename kprobe      3849K events per sec
  task_rename kretprobe   2729K events per sec
  task_rename raw_tp      4506K events per sec
  task_rename fentry      4381K events per sec
  task_rename fexit       4349K events per sec
  task_rename fmod_ret    4130K events per sec

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   4 +-
 tools/testing/selftests/bpf/bench.c           |  14 ++
 .../selftests/bpf/benchs/bench_rename.c       | 195 ++++++++++++++++++
 .../selftests/bpf/benchs/run_bench_rename.sh  |   9 +
 .../selftests/bpf/prog_tests/test_overhead.c  |  14 +-
 .../selftests/bpf/progs/test_overhead.c       |   6 +
 6 files changed, 240 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh

Comments

Yonghong Song May 9, 2020, 5:23 p.m. UTC | #1
On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
> user-space benchmarking part into benchmark runner to compare results.  Results
> with ./bench are consistently somewhat lower than test_overhead's, but relative
> performance of various types of BPF programs stay consisten (e.g., kretprobe is
> noticeably slower).
> 
> run_bench_rename.sh script (in benchs/ directory) was used to produce the
> following numbers:
> 
>    base      :    3.975 ± 0.065M/s
>    kprobe    :    3.268 ± 0.095M/s
>    kretprobe :    2.496 ± 0.040M/s
>    rawtp     :    3.899 ± 0.078M/s
>    fentry    :    3.836 ± 0.049M/s
>    fexit     :    3.660 ± 0.082M/s
>    fmodret   :    3.776 ± 0.033M/s
> 
> While running test_overhead gives:
> 
>    task_rename base        4457K events per sec
>    task_rename kprobe      3849K events per sec
>    task_rename kretprobe   2729K events per sec
>    task_rename raw_tp      4506K events per sec
>    task_rename fentry      4381K events per sec
>    task_rename fexit       4349K events per sec
>    task_rename fmod_ret    4130K events per sec

Do you where the overhead is and how we could provide options in
bench to reduce the overhead so we can achieve similar numbers?
For benchmarking, sometimes you really want to see "true"
potential of a particular implementation.

> 
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>   tools/testing/selftests/bpf/Makefile          |   4 +-
>   tools/testing/selftests/bpf/bench.c           |  14 ++
>   .../selftests/bpf/benchs/bench_rename.c       | 195 ++++++++++++++++++
>   .../selftests/bpf/benchs/run_bench_rename.sh  |   9 +
>   .../selftests/bpf/prog_tests/test_overhead.c  |  14 +-
>   .../selftests/bpf/progs/test_overhead.c       |   6 +
>   6 files changed, 240 insertions(+), 2 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c
>   create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 289fffbf975e..29a02abf81a3 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -409,10 +409,12 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
>   $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
>   	$(call msg,CC,,$@)
>   	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
> +$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
>   $(OUTPUT)/bench.o: bench.h
>   $(OUTPUT)/bench: LDLIBS += -lm
>   $(OUTPUT)/bench: $(OUTPUT)/bench.o \
> -		 $(OUTPUT)/bench_count.o
> +		 $(OUTPUT)/bench_count.o \
> +		 $(OUTPUT)/bench_rename.o
>   	$(call msg,BINARY,,$@)
>   	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
>   
[...]
Andrii Nakryiko May 12, 2020, 4:22 a.m. UTC | #2
On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> > Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
> > user-space benchmarking part into benchmark runner to compare results.  Results
> > with ./bench are consistently somewhat lower than test_overhead's, but relative
> > performance of various types of BPF programs stay consisten (e.g., kretprobe is
> > noticeably slower).
> >
> > run_bench_rename.sh script (in benchs/ directory) was used to produce the
> > following numbers:
> >
> >    base      :    3.975 ± 0.065M/s
> >    kprobe    :    3.268 ± 0.095M/s
> >    kretprobe :    2.496 ± 0.040M/s
> >    rawtp     :    3.899 ± 0.078M/s
> >    fentry    :    3.836 ± 0.049M/s
> >    fexit     :    3.660 ± 0.082M/s
> >    fmodret   :    3.776 ± 0.033M/s
> >
> > While running test_overhead gives:
> >
> >    task_rename base        4457K events per sec
> >    task_rename kprobe      3849K events per sec
> >    task_rename kretprobe   2729K events per sec
> >    task_rename raw_tp      4506K events per sec
> >    task_rename fentry      4381K events per sec
> >    task_rename fexit       4349K events per sec
> >    task_rename fmod_ret    4130K events per sec
>
> Do you where the overhead is and how we could provide options in
> bench to reduce the overhead so we can achieve similar numbers?
> For benchmarking, sometimes you really want to see "true"
> potential of a particular implementation.

Alright, let's make it an official bench-off... :) And the reason for
this discrepancy, turns out to be... not atomics at all! But rather a
single-threaded vs multi-threaded process (well, at least task_rename
happening from non-main thread, I didn't narrow it down further).
Atomics actually make very little difference, which gives me a good
peace of mind :)

So, I've built and ran test_overhead (selftest) and bench both as
multi-threaded and single-threaded apps. Corresponding results match
almost perfectly. And that's while test_overhead doesn't use atomics
at all, while bench still does. Then I also ran test_overhead with
added generics to match bench implementation. There are barely any
differences, see two last sets of results.

BTW, selftest results seems bit lower from the ones in original
commit, probably because I made it run more iterations (like 40 times
more) to have more stable results.

So here are the results:

Single-threaded implementations
===============================

/* bench: single-threaded, atomics */
base      :    4.622 ± 0.049M/s
kprobe    :    3.673 ± 0.052M/s
kretprobe :    2.625 ± 0.052M/s
rawtp     :    4.369 ± 0.089M/s
fentry    :    4.201 ± 0.558M/s
fexit     :    4.309 ± 0.148M/s
fmodret   :    4.314 ± 0.203M/s

/* selftest: single-threaded, no atomics */
task_rename base        4555K events per sec
task_rename kprobe      3643K events per sec
task_rename kretprobe   2506K events per sec
task_rename raw_tp      4303K events per sec
task_rename fentry      4307K events per sec
task_rename fexit       4010K events per sec
task_rename fmod_ret    3984K events per sec


Multi-threaded implementations
==============================

/* bench: multi-threaded w/ atomics */
base      :    3.910 ± 0.023M/s
kprobe    :    3.048 ± 0.037M/s
kretprobe :    2.300 ± 0.015M/s
rawtp     :    3.687 ± 0.034M/s
fentry    :    3.740 ± 0.087M/s
fexit     :    3.510 ± 0.009M/s
fmodret   :    3.485 ± 0.050M/s

/* selftest: multi-threaded w/ atomics */
task_rename base        3872K events per sec
task_rename kprobe      3068K events per sec
task_rename kretprobe   2350K events per sec
task_rename raw_tp      3731K events per sec
task_rename fentry      3639K events per sec
task_rename fexit       3558K events per sec
task_rename fmod_ret    3511K events per sec

/* selftest: multi-threaded, no atomics */
task_rename base        3945K events per sec
task_rename kprobe      3298K events per sec
task_rename kretprobe   2451K events per sec
task_rename raw_tp      3718K events per sec
task_rename fentry      3782K events per sec
task_rename fexit       3543K events per sec
task_rename fmod_ret    3526K events per sec


>
> >
> > Acked-by: John Fastabend <john.fastabend@gmail.com>
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> >   tools/testing/selftests/bpf/Makefile          |   4 +-
> >   tools/testing/selftests/bpf/bench.c           |  14 ++
> >   .../selftests/bpf/benchs/bench_rename.c       | 195 ++++++++++++++++++
> >   .../selftests/bpf/benchs/run_bench_rename.sh  |   9 +
> >   .../selftests/bpf/prog_tests/test_overhead.c  |  14 +-
> >   .../selftests/bpf/progs/test_overhead.c       |   6 +
> >   6 files changed, 240 insertions(+), 2 deletions(-)
> >   create mode 100644 tools/testing/selftests/bpf/benchs/bench_rename.c
> >   create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_rename.sh
> >
> > diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> > index 289fffbf975e..29a02abf81a3 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -409,10 +409,12 @@ $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
> >   $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
> >       $(call msg,CC,,$@)
> >       $(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
> > +$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
> >   $(OUTPUT)/bench.o: bench.h
> >   $(OUTPUT)/bench: LDLIBS += -lm
> >   $(OUTPUT)/bench: $(OUTPUT)/bench.o \
> > -              $(OUTPUT)/bench_count.o
> > +              $(OUTPUT)/bench_count.o \
> > +              $(OUTPUT)/bench_rename.o
> >       $(call msg,BINARY,,$@)
> >       $(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
> >
> [...]
Yonghong Song May 12, 2020, 3:11 p.m. UTC | #3
On 5/11/20 9:22 PM, Andrii Nakryiko wrote:
> On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
>>> user-space benchmarking part into benchmark runner to compare results.  Results
>>> with ./bench are consistently somewhat lower than test_overhead's, but relative
>>> performance of various types of BPF programs stay consisten (e.g., kretprobe is
>>> noticeably slower).
>>>
>>> run_bench_rename.sh script (in benchs/ directory) was used to produce the
>>> following numbers:
>>>
>>>     base      :    3.975 ± 0.065M/s
>>>     kprobe    :    3.268 ± 0.095M/s
>>>     kretprobe :    2.496 ± 0.040M/s
>>>     rawtp     :    3.899 ± 0.078M/s
>>>     fentry    :    3.836 ± 0.049M/s
>>>     fexit     :    3.660 ± 0.082M/s
>>>     fmodret   :    3.776 ± 0.033M/s
>>>
>>> While running test_overhead gives:
>>>
>>>     task_rename base        4457K events per sec
>>>     task_rename kprobe      3849K events per sec
>>>     task_rename kretprobe   2729K events per sec
>>>     task_rename raw_tp      4506K events per sec
>>>     task_rename fentry      4381K events per sec
>>>     task_rename fexit       4349K events per sec
>>>     task_rename fmod_ret    4130K events per sec
>>
>> Do you where the overhead is and how we could provide options in
>> bench to reduce the overhead so we can achieve similar numbers?
>> For benchmarking, sometimes you really want to see "true"
>> potential of a particular implementation.
> 
> Alright, let's make it an official bench-off... :) And the reason for
> this discrepancy, turns out to be... not atomics at all! But rather a
> single-threaded vs multi-threaded process (well, at least task_rename
> happening from non-main thread, I didn't narrow it down further).

It would be good to find out why and have a scheme (e.g. some kind
of affinity binding) to close the gap.

> Atomics actually make very little difference, which gives me a good
> peace of mind :)
> 
> So, I've built and ran test_overhead (selftest) and bench both as
> multi-threaded and single-threaded apps. Corresponding results match
> almost perfectly. And that's while test_overhead doesn't use atomics
> at all, while bench still does. Then I also ran test_overhead with
> added generics to match bench implementation. There are barely any
> differences, see two last sets of results.
> 
> BTW, selftest results seems bit lower from the ones in original
> commit, probably because I made it run more iterations (like 40 times
> more) to have more stable results.
> 
> So here are the results:
> 
> Single-threaded implementations
> ===============================
> 
> /* bench: single-threaded, atomics */
> base      :    4.622 ± 0.049M/s
> kprobe    :    3.673 ± 0.052M/s
> kretprobe :    2.625 ± 0.052M/s
> rawtp     :    4.369 ± 0.089M/s
> fentry    :    4.201 ± 0.558M/s
> fexit     :    4.309 ± 0.148M/s
> fmodret   :    4.314 ± 0.203M/s
> 
> /* selftest: single-threaded, no atomics */
> task_rename base        4555K events per sec
> task_rename kprobe      3643K events per sec
> task_rename kretprobe   2506K events per sec
> task_rename raw_tp      4303K events per sec
> task_rename fentry      4307K events per sec
> task_rename fexit       4010K events per sec
> task_rename fmod_ret    3984K events per sec
> 
> 
> Multi-threaded implementations
> ==============================
> 
> /* bench: multi-threaded w/ atomics */
> base      :    3.910 ± 0.023M/s
> kprobe    :    3.048 ± 0.037M/s
> kretprobe :    2.300 ± 0.015M/s
> rawtp     :    3.687 ± 0.034M/s
> fentry    :    3.740 ± 0.087M/s
> fexit     :    3.510 ± 0.009M/s
> fmodret   :    3.485 ± 0.050M/s
> 
> /* selftest: multi-threaded w/ atomics */
> task_rename base        3872K events per sec
> task_rename kprobe      3068K events per sec
> task_rename kretprobe   2350K events per sec
> task_rename raw_tp      3731K events per sec
> task_rename fentry      3639K events per sec
> task_rename fexit       3558K events per sec
> task_rename fmod_ret    3511K events per sec
> 
> /* selftest: multi-threaded, no atomics */
> task_rename base        3945K events per sec
> task_rename kprobe      3298K events per sec
> task_rename kretprobe   2451K events per sec
> task_rename raw_tp      3718K events per sec
> task_rename fentry      3782K events per sec
> task_rename fexit       3543K events per sec
> task_rename fmod_ret    3526K events per sec
> 
> 
[...]
Andrii Nakryiko May 12, 2020, 5:23 p.m. UTC | #4
On Tue, May 12, 2020 at 8:11 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 5/11/20 9:22 PM, Andrii Nakryiko wrote:
> > On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
> >>
> >>
> >>
> >> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
> >>> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
> >>> user-space benchmarking part into benchmark runner to compare results.  Results
> >>> with ./bench are consistently somewhat lower than test_overhead's, but relative
> >>> performance of various types of BPF programs stay consisten (e.g., kretprobe is
> >>> noticeably slower).
> >>>
> >>> run_bench_rename.sh script (in benchs/ directory) was used to produce the
> >>> following numbers:
> >>>
> >>>     base      :    3.975 ± 0.065M/s
> >>>     kprobe    :    3.268 ± 0.095M/s
> >>>     kretprobe :    2.496 ± 0.040M/s
> >>>     rawtp     :    3.899 ± 0.078M/s
> >>>     fentry    :    3.836 ± 0.049M/s
> >>>     fexit     :    3.660 ± 0.082M/s
> >>>     fmodret   :    3.776 ± 0.033M/s
> >>>
> >>> While running test_overhead gives:
> >>>
> >>>     task_rename base        4457K events per sec
> >>>     task_rename kprobe      3849K events per sec
> >>>     task_rename kretprobe   2729K events per sec
> >>>     task_rename raw_tp      4506K events per sec
> >>>     task_rename fentry      4381K events per sec
> >>>     task_rename fexit       4349K events per sec
> >>>     task_rename fmod_ret    4130K events per sec
> >>
> >> Do you where the overhead is and how we could provide options in
> >> bench to reduce the overhead so we can achieve similar numbers?
> >> For benchmarking, sometimes you really want to see "true"
> >> potential of a particular implementation.
> >
> > Alright, let's make it an official bench-off... :) And the reason for
> > this discrepancy, turns out to be... not atomics at all! But rather a
> > single-threaded vs multi-threaded process (well, at least task_rename
> > happening from non-main thread, I didn't narrow it down further).
>
> It would be good to find out why and have a scheme (e.g. some kind
> of affinity binding) to close the gap.

I don't think affinity has anything to do with this. test_overhead
sets affinity for entire process, and that doesn't change results at
all. Same for bench, both with and without setting affinity, results
are pretty much the same. Affinity helps a bit to get a bit more
stable and consistent results, but doesn't hurt or help performance
for this benchmark.

I don't think we need to spend that much time trying to understand
behavior of task renaming for such a particular setup. Benchmarking
has to be multi-threaded in most cases anyways, there is no way around
that.

>
> > Atomics actually make very little difference, which gives me a good
> > peace of mind :)
> >
> > So, I've built and ran test_overhead (selftest) and bench both as
> > multi-threaded and single-threaded apps. Corresponding results match
> > almost perfectly. And that's while test_overhead doesn't use atomics
> > at all, while bench still does. Then I also ran test_overhead with
> > added generics to match bench implementation. There are barely any
> > differences, see two last sets of results.
> >
> > BTW, selftest results seems bit lower from the ones in original
> > commit, probably because I made it run more iterations (like 40 times
> > more) to have more stable results.
> >
> > So here are the results:
> >
> > Single-threaded implementations
> > ===============================
> >
> > /* bench: single-threaded, atomics */
> > base      :    4.622 ± 0.049M/s
> > kprobe    :    3.673 ± 0.052M/s
> > kretprobe :    2.625 ± 0.052M/s
> > rawtp     :    4.369 ± 0.089M/s
> > fentry    :    4.201 ± 0.558M/s
> > fexit     :    4.309 ± 0.148M/s
> > fmodret   :    4.314 ± 0.203M/s
> >
> > /* selftest: single-threaded, no atomics */
> > task_rename base        4555K events per sec
> > task_rename kprobe      3643K events per sec
> > task_rename kretprobe   2506K events per sec
> > task_rename raw_tp      4303K events per sec
> > task_rename fentry      4307K events per sec
> > task_rename fexit       4010K events per sec
> > task_rename fmod_ret    3984K events per sec
> >
> >
> > Multi-threaded implementations
> > ==============================
> >
> > /* bench: multi-threaded w/ atomics */
> > base      :    3.910 ± 0.023M/s
> > kprobe    :    3.048 ± 0.037M/s
> > kretprobe :    2.300 ± 0.015M/s
> > rawtp     :    3.687 ± 0.034M/s
> > fentry    :    3.740 ± 0.087M/s
> > fexit     :    3.510 ± 0.009M/s
> > fmodret   :    3.485 ± 0.050M/s
> >
> > /* selftest: multi-threaded w/ atomics */
> > task_rename base        3872K events per sec
> > task_rename kprobe      3068K events per sec
> > task_rename kretprobe   2350K events per sec
> > task_rename raw_tp      3731K events per sec
> > task_rename fentry      3639K events per sec
> > task_rename fexit       3558K events per sec
> > task_rename fmod_ret    3511K events per sec
> >
> > /* selftest: multi-threaded, no atomics */
> > task_rename base        3945K events per sec
> > task_rename kprobe      3298K events per sec
> > task_rename kretprobe   2451K events per sec
> > task_rename raw_tp      3718K events per sec
> > task_rename fentry      3782K events per sec
> > task_rename fexit       3543K events per sec
> > task_rename fmod_ret    3526K events per sec
> >
> >
> [...]
Yonghong Song May 12, 2020, 5:47 p.m. UTC | #5
On 5/12/20 10:23 AM, Andrii Nakryiko wrote:
> On Tue, May 12, 2020 at 8:11 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 5/11/20 9:22 PM, Andrii Nakryiko wrote:
>>> On Sat, May 9, 2020 at 10:24 AM Yonghong Song <yhs@fb.com> wrote:
>>>>
>>>>
>>>>
>>>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>>>> Add fmod_ret BPF program to existing test_overhead selftest. Also re-implement
>>>>> user-space benchmarking part into benchmark runner to compare results.  Results
>>>>> with ./bench are consistently somewhat lower than test_overhead's, but relative
>>>>> performance of various types of BPF programs stay consisten (e.g., kretprobe is
>>>>> noticeably slower).
>>>>>
>>>>> run_bench_rename.sh script (in benchs/ directory) was used to produce the
>>>>> following numbers:
>>>>>
>>>>>      base      :    3.975 ± 0.065M/s
>>>>>      kprobe    :    3.268 ± 0.095M/s
>>>>>      kretprobe :    2.496 ± 0.040M/s
>>>>>      rawtp     :    3.899 ± 0.078M/s
>>>>>      fentry    :    3.836 ± 0.049M/s
>>>>>      fexit     :    3.660 ± 0.082M/s
>>>>>      fmodret   :    3.776 ± 0.033M/s
>>>>>
>>>>> While running test_overhead gives:
>>>>>
>>>>>      task_rename base        4457K events per sec
>>>>>      task_rename kprobe      3849K events per sec
>>>>>      task_rename kretprobe   2729K events per sec
>>>>>      task_rename raw_tp      4506K events per sec
>>>>>      task_rename fentry      4381K events per sec
>>>>>      task_rename fexit       4349K events per sec
>>>>>      task_rename fmod_ret    4130K events per sec
>>>>
>>>> Do you where the overhead is and how we could provide options in
>>>> bench to reduce the overhead so we can achieve similar numbers?
>>>> For benchmarking, sometimes you really want to see "true"
>>>> potential of a particular implementation.
>>>
>>> Alright, let's make it an official bench-off... :) And the reason for
>>> this discrepancy, turns out to be... not atomics at all! But rather a
>>> single-threaded vs multi-threaded process (well, at least task_rename
>>> happening from non-main thread, I didn't narrow it down further).
>>
>> It would be good to find out why and have a scheme (e.g. some kind
>> of affinity binding) to close the gap.
> 
> I don't think affinity has anything to do with this. test_overhead
> sets affinity for entire process, and that doesn't change results at
> all. Same for bench, both with and without setting affinity, results
> are pretty much the same. Affinity helps a bit to get a bit more
> stable and consistent results, but doesn't hurt or help performance
> for this benchmark.
> 
> I don't think we need to spend that much time trying to understand
> behavior of task renaming for such a particular setup. Benchmarking
> has to be multi-threaded in most cases anyways, there is no way around
> that.

Okay. This might be related to kernel scheduling of main thread vs. 
secondary threads? This then indeed beyond this patch.

I am fine with the current mechanism as is. Maybe put the above
experimental data in commit message? If later other people
want to do further investigation, they have some data to
start with.

> 
>>
>>> Atomics actually make very little difference, which gives me a good
>>> peace of mind :)
>>>
>>> So, I've built and ran test_overhead (selftest) and bench both as
>>> multi-threaded and single-threaded apps. Corresponding results match
>>> almost perfectly. And that's while test_overhead doesn't use atomics
>>> at all, while bench still does. Then I also ran test_overhead with
>>> added generics to match bench implementation. There are barely any
>>> differences, see two last sets of results.
>>>
>>> BTW, selftest results seems bit lower from the ones in original
>>> commit, probably because I made it run more iterations (like 40 times
>>> more) to have more stable results.
>>>
>>> So here are the results:
>>>
>>> Single-threaded implementations
>>> ===============================
>>>
>>> /* bench: single-threaded, atomics */
>>> base      :    4.622 ± 0.049M/s
>>> kprobe    :    3.673 ± 0.052M/s
>>> kretprobe :    2.625 ± 0.052M/s
>>> rawtp     :    4.369 ± 0.089M/s
>>> fentry    :    4.201 ± 0.558M/s
>>> fexit     :    4.309 ± 0.148M/s
>>> fmodret   :    4.314 ± 0.203M/s
>>>
>>> /* selftest: single-threaded, no atomics */
>>> task_rename base        4555K events per sec
>>> task_rename kprobe      3643K events per sec
>>> task_rename kretprobe   2506K events per sec
>>> task_rename raw_tp      4303K events per sec
>>> task_rename fentry      4307K events per sec
>>> task_rename fexit       4010K events per sec
>>> task_rename fmod_ret    3984K events per sec
>>>
>>>
>>> Multi-threaded implementations
>>> ==============================
>>>
>>> /* bench: multi-threaded w/ atomics */
>>> base      :    3.910 ± 0.023M/s
>>> kprobe    :    3.048 ± 0.037M/s
>>> kretprobe :    2.300 ± 0.015M/s
>>> rawtp     :    3.687 ± 0.034M/s
>>> fentry    :    3.740 ± 0.087M/s
>>> fexit     :    3.510 ± 0.009M/s
>>> fmodret   :    3.485 ± 0.050M/s
>>>
>>> /* selftest: multi-threaded w/ atomics */
>>> task_rename base        3872K events per sec
>>> task_rename kprobe      3068K events per sec
>>> task_rename kretprobe   2350K events per sec
>>> task_rename raw_tp      3731K events per sec
>>> task_rename fentry      3639K events per sec
>>> task_rename fexit       3558K events per sec
>>> task_rename fmod_ret    3511K events per sec
>>>
>>> /* selftest: multi-threaded, no atomics */
>>> task_rename base        3945K events per sec
>>> task_rename kprobe      3298K events per sec
>>> task_rename kretprobe   2451K events per sec
>>> task_rename raw_tp      3718K events per sec
>>> task_rename fentry      3782K events per sec
>>> task_rename fexit       3543K events per sec
>>> task_rename fmod_ret    3526K events per sec
>>>
>>>
>> [...]
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 289fffbf975e..29a02abf81a3 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -409,10 +409,12 @@  $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
 $(OUTPUT)/bench_%.o: benchs/bench_%.c bench.h
 	$(call msg,CC,,$@)
 	$(CC) $(CFLAGS) -c $(filter %.c,$^) $(LDLIBS) -o $@
+$(OUTPUT)/bench_rename.o: $(OUTPUT)/test_overhead.skel.h
 $(OUTPUT)/bench.o: bench.h
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
-		 $(OUTPUT)/bench_count.o
+		 $(OUTPUT)/bench_count.o \
+		 $(OUTPUT)/bench_rename.o
 	$(call msg,BINARY,,$@)
 	$(CC) $(LDFLAGS) -o $@ $(filter %.a %.o,$^) $(LDLIBS)
 
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index dddc97cd4db6..650697df47af 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -254,10 +254,24 @@  const struct bench *bench = NULL;
 
 extern const struct bench bench_count_global;
 extern const struct bench bench_count_local;
+extern const struct bench bench_rename_base;
+extern const struct bench bench_rename_kprobe;
+extern const struct bench bench_rename_kretprobe;
+extern const struct bench bench_rename_rawtp;
+extern const struct bench bench_rename_fentry;
+extern const struct bench bench_rename_fexit;
+extern const struct bench bench_rename_fmodret;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
 	&bench_count_local,
+	&bench_rename_base,
+	&bench_rename_kprobe,
+	&bench_rename_kretprobe,
+	&bench_rename_rawtp,
+	&bench_rename_fentry,
+	&bench_rename_fexit,
+	&bench_rename_fmodret,
 };
 
 static void setup_benchmark()
diff --git a/tools/testing/selftests/bpf/benchs/bench_rename.c b/tools/testing/selftests/bpf/benchs/bench_rename.c
new file mode 100644
index 000000000000..e74cff40f4fe
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_rename.c
@@ -0,0 +1,195 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020 Facebook */
+#include <fcntl.h>
+#include "bench.h"
+#include "test_overhead.skel.h"
+
+/* BPF triggering benchmarks */
+static struct ctx {
+	struct test_overhead *skel;
+	struct counter hits;
+	int fd;
+} ctx;
+
+static void validate()
+{
+	if (env.producer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-producer!\n");
+		exit(1);
+	}
+	if (env.consumer_cnt != 1) {
+		fprintf(stderr, "benchmark doesn't support multi-consumer!\n");
+		exit(1);
+	}
+}
+
+static void *producer(void *input)
+{
+	char buf[] = "test_overhead";
+	int err;
+
+	while (true) {
+		err = write(ctx.fd, buf, sizeof(buf));
+		if (err < 0) {
+			fprintf(stderr, "write failed\n");
+			exit(1);
+		}
+		atomic_inc(&ctx.hits.value);
+	}
+}
+
+static void measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.hits.value, 0);
+}
+
+static void setup_ctx()
+{
+	setup_libbpf();
+
+	ctx.skel = test_overhead__open_and_load();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	ctx.fd = open("/proc/self/comm", O_WRONLY|O_TRUNC);
+	if (ctx.fd < 0) {
+		fprintf(stderr, "failed to open /proc/self/comm: %d\n", -errno);
+		exit(1);
+	}
+}
+
+static void attach_bpf(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+
+	link = bpf_program__attach(prog);
+	if (IS_ERR(link)) {
+		fprintf(stderr, "failed to attach program!\n");
+		exit(1);
+	}
+}
+
+static void setup_base()
+{
+	setup_ctx();
+}
+
+static void setup_kprobe()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog1);
+}
+
+static void setup_kretprobe()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog2);
+}
+
+static void setup_rawtp()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog3);
+}
+
+static void setup_fentry()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog4);
+}
+
+static void setup_fexit()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog5);
+}
+
+static void setup_fmodret()
+{
+	setup_ctx();
+	attach_bpf(ctx.skel->progs.prog6);
+}
+
+static void *consumer(void *input)
+{
+	return NULL;
+}
+
+const struct bench bench_rename_base = {
+	.name = "rename-base",
+	.validate = validate,
+	.setup = setup_base,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_kprobe = {
+	.name = "rename-kprobe",
+	.validate = validate,
+	.setup = setup_kprobe,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_kretprobe = {
+	.name = "rename-kretprobe",
+	.validate = validate,
+	.setup = setup_kretprobe,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_rawtp = {
+	.name = "rename-rawtp",
+	.validate = validate,
+	.setup = setup_rawtp,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_fentry = {
+	.name = "rename-fentry",
+	.validate = validate,
+	.setup = setup_fentry,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_fexit = {
+	.name = "rename-fexit",
+	.validate = validate,
+	.setup = setup_fexit,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
+
+const struct bench bench_rename_fmodret = {
+	.name = "rename-fmodret",
+	.validate = validate,
+	.setup = setup_fmodret,
+	.producer_thread = producer,
+	.consumer_thread = consumer,
+	.measure = measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_rename.sh b/tools/testing/selftests/bpf/benchs/run_bench_rename.sh
new file mode 100755
index 000000000000..16f774b1cdbe
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_rename.sh
@@ -0,0 +1,9 @@ 
+#!/bin/bash
+
+set -eufo pipefail
+
+for i in base kprobe kretprobe rawtp fentry fexit fmodret
+do
+	summary=$(sudo ./bench -w2 -d5 -a rename-$i | tail -n1 | cut -d'(' -f1 | cut -d' ' -f3-)
+	printf "%-10s: %s\n" $i "$summary"
+done
diff --git a/tools/testing/selftests/bpf/prog_tests/test_overhead.c b/tools/testing/selftests/bpf/prog_tests/test_overhead.c
index 465b371a561d..2702df2b2343 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_overhead.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_overhead.c
@@ -61,9 +61,10 @@  void test_test_overhead(void)
 	const char *raw_tp_name = "raw_tp/task_rename";
 	const char *fentry_name = "fentry/__set_task_comm";
 	const char *fexit_name = "fexit/__set_task_comm";
+	const char *fmodret_name = "fmod_ret/__set_task_comm";
 	const char *kprobe_func = "__set_task_comm";
 	struct bpf_program *kprobe_prog, *kretprobe_prog, *raw_tp_prog;
-	struct bpf_program *fentry_prog, *fexit_prog;
+	struct bpf_program *fentry_prog, *fexit_prog, *fmodret_prog;
 	struct bpf_object *obj;
 	struct bpf_link *link;
 	int err, duration = 0;
@@ -96,6 +97,10 @@  void test_test_overhead(void)
 	if (CHECK(!fexit_prog, "find_probe",
 		  "prog '%s' not found\n", fexit_name))
 		goto cleanup;
+	fmodret_prog = bpf_object__find_program_by_title(obj, fmodret_name);
+	if (CHECK(!fmodret_prog, "find_probe",
+		  "prog '%s' not found\n", fmodret_name))
+		goto cleanup;
 
 	err = bpf_object__load(obj);
 	if (CHECK(err, "obj_load", "err %d\n", err))
@@ -142,6 +147,13 @@  void test_test_overhead(void)
 		goto cleanup;
 	test_run("fexit");
 	bpf_link__destroy(link);
+
+	/* attach fmod_ret */
+	link = bpf_program__attach_trace(fmodret_prog);
+	if (CHECK(IS_ERR(link), "attach fmod_ret", "err %ld\n", PTR_ERR(link)))
+		goto cleanup;
+	test_run("fmod_ret");
+	bpf_link__destroy(link);
 cleanup:
 	prctl(PR_SET_NAME, comm, 0L, 0L, 0L);
 	bpf_object__close(obj);
diff --git a/tools/testing/selftests/bpf/progs/test_overhead.c b/tools/testing/selftests/bpf/progs/test_overhead.c
index 56a50b25cd33..450bf819beac 100644
--- a/tools/testing/selftests/bpf/progs/test_overhead.c
+++ b/tools/testing/selftests/bpf/progs/test_overhead.c
@@ -39,4 +39,10 @@  int BPF_PROG(prog5, struct task_struct *tsk, const char *buf, bool exec)
 	return !tsk;
 }
 
+SEC("fmod_ret/__set_task_comm")
+int BPF_PROG(prog6, struct task_struct *tsk, const char *buf, bool exec)
+{
+	return !tsk;
+}
+
 char _license[] SEC("license") = "GPL";