Message ID | 20181106205246.567448-6-songliubraving@fb.com |
---|---|
State | RFC, archived |
Delegated to: | BPF Maintainers |
Headers | show |
Series | reveal invisible bpf programs | expand |
On 11/6/18 1:52 PM, Song Liu wrote: > + > static int record__mmap_read_all(struct record *rec) > { > int err; > > + err = record__mmap_process_vip_events(rec); > + if (err) > + return err; > + > err = record__mmap_read_evlist(rec, rec->evlist, false); > if (err) > return err; Seems to me that is going to increase the overhead of perf on any system doing BPF updates. The BPF events cause a wakeup every load and unload, and perf processes not only the VIP events but then walks all of the other maps. > @@ -1686,6 +1734,8 @@ static struct option __record_options[] = { > "signal"), > OPT_BOOLEAN(0, "dry-run", &dry_run, > "Parse options then exit"), > + OPT_BOOLEAN(0, "no-bpf-event", &record.no_bpf_event, > + "do not record event on bpf program load/unload"), Why should this default on? If am recording FIB events, I don't care about BPF events.
> On Nov 6, 2018, at 1:54 PM, David Ahern <dsahern@gmail.com> wrote: > > On 11/6/18 1:52 PM, Song Liu wrote: >> + >> static int record__mmap_read_all(struct record *rec) >> { >> int err; >> >> + err = record__mmap_process_vip_events(rec); >> + if (err) >> + return err; >> + >> err = record__mmap_read_evlist(rec, rec->evlist, false); >> if (err) >> return err; > > Seems to me that is going to increase the overhead of perf on any system > doing BPF updates. The BPF events cause a wakeup every load and unload, > and perf processes not only the VIP events but then walks all of the > other maps. BPF prog load/unload events should be rare events in real world use cases. So I think the overhead is OK. Also, I don't see an easy way to improve this. > >> @@ -1686,6 +1734,8 @@ static struct option __record_options[] = { >> "signal"), >> OPT_BOOLEAN(0, "dry-run", &dry_run, >> "Parse options then exit"), >> + OPT_BOOLEAN(0, "no-bpf-event", &record.no_bpf_event, >> + "do not record event on bpf program load/unload"), > > Why should this default on? If am recording FIB events, I don't care > about BPF events. > I am OK with default off if that's the preferred way. Thanks, Song
On 11/6/18 3:17 PM, Song Liu wrote: > > >> On Nov 6, 2018, at 1:54 PM, David Ahern <dsahern@gmail.com> wrote: >> >> On 11/6/18 1:52 PM, Song Liu wrote: >>> + >>> static int record__mmap_read_all(struct record *rec) >>> { >>> int err; >>> >>> + err = record__mmap_process_vip_events(rec); >>> + if (err) >>> + return err; >>> + >>> err = record__mmap_read_evlist(rec, rec->evlist, false); >>> if (err) >>> return err; >> >> Seems to me that is going to increase the overhead of perf on any system >> doing BPF updates. The BPF events cause a wakeup every load and unload, >> and perf processes not only the VIP events but then walks all of the >> other maps. > > BPF prog load/unload events should be rare events in real world use cases. > So I think the overhead is OK. Also, I don't see an easy way to improve > this. > >> >>> @@ -1686,6 +1734,8 @@ static struct option __record_options[] = { >>> "signal"), >>> OPT_BOOLEAN(0, "dry-run", &dry_run, >>> "Parse options then exit"), >>> + OPT_BOOLEAN(0, "no-bpf-event", &record.no_bpf_event, >>> + "do not record event on bpf program load/unload"), >> >> Why should this default on? If am recording FIB events, I don't care >> about BPF events. >> > > I am OK with default off if that's the preferred way. I think concerns with perf overhead from collecting bpf events are unfounded. I would prefer for this flag to be on by default.
From: Alexei Starovoitov <ast@fb.com> Date: Tue, 6 Nov 2018 23:29:07 +0000 > I think concerns with perf overhead from collecting bpf events > are unfounded. > I would prefer for this flag to be on by default. I will sit in userspace looping over bpf load/unload and see how the person trying to monitor something else with perf feels about that. Really, it is inappropriate to turn this on by default, I completely agree with David Ahern. It's hard enough, _AS IS_, for me to fight back all of the bloat that is in perf right now and get it back to being able to handle simple full workloads without dropping events.. Every new event type like this sets us back. If people want to monitor new things, or have new functionality, fine. But not by default, please.
On 11/6/18 3:36 PM, David Miller wrote: > From: Alexei Starovoitov <ast@fb.com> > Date: Tue, 6 Nov 2018 23:29:07 +0000 > >> I think concerns with perf overhead from collecting bpf events >> are unfounded. >> I would prefer for this flag to be on by default. > > I will sit in userspace looping over bpf load/unload and see how the > person trying to monitor something else with perf feels about that. > > Really, it is inappropriate to turn this on by default, I completely > agree with David Ahern. > > It's hard enough, _AS IS_, for me to fight back all of the bloat that > is in perf right now and get it back to being able to handle simple > full workloads without dropping events.. It's a separate perf thread and separate event with its own epoll. I don't see how it can affect main event collection. Let's put it this way. If it does affect somehow, then yes, it should not be on. If it is not, there is no downside to keep it on. Typical user expects to type 'perf record' and see everything that is happening on the system. Right now short lived bpf programs will not be seen. How user suppose to even know when to use the flag? The only option is to always pass the flag 'just in case' which is unnecessary burden. The problem of dropped events is certainly valid, but it's a separate issue. The aio stuff that Alexey Budankov is working on suppose to address that.
On 11/6/18 5:13 PM, Alexei Starovoitov wrote: > On 11/6/18 3:36 PM, David Miller wrote: >> From: Alexei Starovoitov <ast@fb.com> >> Date: Tue, 6 Nov 2018 23:29:07 +0000 >> >>> I think concerns with perf overhead from collecting bpf events >>> are unfounded. >>> I would prefer for this flag to be on by default. >> >> I will sit in userspace looping over bpf load/unload and see how the >> person trying to monitor something else with perf feels about that. >> >> Really, it is inappropriate to turn this on by default, I completely >> agree with David Ahern. >> >> It's hard enough, _AS IS_, for me to fight back all of the bloat that >> is in perf right now and get it back to being able to handle simple >> full workloads without dropping events.. > > It's a separate perf thread and separate event with its own epoll. > I don't see how it can affect main event collection. > Let's put it this way. If it does affect somehow, then yes, > it should not be on. If it is not, there is no downside to keep it on. > Typical user expects to type 'perf record' and see everything that > is happening on the system. Right now short lived bpf programs > will not be seen. How user suppose to even know when to use the flag? The default is profiling where perf record collects task events and periodic samples. So for the default record/report, the bpf load / unload events are not relevant. > The only option is to always pass the flag 'just in case' > which is unnecessary burden. People who care about an event enable the event collection of the event.
On 11/6/18 4:23 PM, David Ahern wrote: > On 11/6/18 5:13 PM, Alexei Starovoitov wrote: >> On 11/6/18 3:36 PM, David Miller wrote: >>> From: Alexei Starovoitov <ast@fb.com> >>> Date: Tue, 6 Nov 2018 23:29:07 +0000 >>> >>>> I think concerns with perf overhead from collecting bpf events >>>> are unfounded. >>>> I would prefer for this flag to be on by default. >>> >>> I will sit in userspace looping over bpf load/unload and see how the >>> person trying to monitor something else with perf feels about that. >>> >>> Really, it is inappropriate to turn this on by default, I completely >>> agree with David Ahern. >>> >>> It's hard enough, _AS IS_, for me to fight back all of the bloat that >>> is in perf right now and get it back to being able to handle simple >>> full workloads without dropping events.. >> >> It's a separate perf thread and separate event with its own epoll. >> I don't see how it can affect main event collection. >> Let's put it this way. If it does affect somehow, then yes, >> it should not be on. If it is not, there is no downside to keep it on. >> Typical user expects to type 'perf record' and see everything that >> is happening on the system. Right now short lived bpf programs >> will not be seen. How user suppose to even know when to use the flag? > > The default is profiling where perf record collects task events and > periodic samples. So for the default record/report, the bpf load / > unload events are not relevant. Exactly the opposite. It's for default 'perf record' collection of periodic samples. It can be off for -e collection. That's easy.
On 11/6/18 5:26 PM, Alexei Starovoitov wrote: > On 11/6/18 4:23 PM, David Ahern wrote: >> On 11/6/18 5:13 PM, Alexei Starovoitov wrote: >>> On 11/6/18 3:36 PM, David Miller wrote: >>>> From: Alexei Starovoitov <ast@fb.com> >>>> Date: Tue, 6 Nov 2018 23:29:07 +0000 >>>> >>>>> I think concerns with perf overhead from collecting bpf events >>>>> are unfounded. >>>>> I would prefer for this flag to be on by default. >>>> >>>> I will sit in userspace looping over bpf load/unload and see how the >>>> person trying to monitor something else with perf feels about that. >>>> >>>> Really, it is inappropriate to turn this on by default, I completely >>>> agree with David Ahern. >>>> >>>> It's hard enough, _AS IS_, for me to fight back all of the bloat that >>>> is in perf right now and get it back to being able to handle simple >>>> full workloads without dropping events.. >>> >>> It's a separate perf thread and separate event with its own epoll. >>> I don't see how it can affect main event collection. >>> Let's put it this way. If it does affect somehow, then yes, >>> it should not be on. If it is not, there is no downside to keep it on. >>> Typical user expects to type 'perf record' and see everything that >>> is happening on the system. Right now short lived bpf programs >>> will not be seen. How user suppose to even know when to use the flag? >> >> The default is profiling where perf record collects task events and >> periodic samples. So for the default record/report, the bpf load / >> unload events are not relevant. > > Exactly the opposite. > It's for default 'perf record' collection of periodic samples. > It can be off for -e collection. That's easy. > So one use case is profiling bpf programs. I was also considering the auditing discussion from some weeks ago which I thought the events are also targeting. As for the overhead, I did not see a separate thread getting spun off for the bpf events, so the events are processed inline for this RFC set.
On 11/6/18 4:44 PM, David Ahern wrote: > > So one use case is profiling bpf programs. I was also considering the > auditing discussion from some weeks ago which I thought the events are > also targeting. yes. there should be separate mode for 're: audit discussion' where only bpf events are collected. This patch set doesn't add that to perf user space side. The kernel side is common though. It can be used for bpf load/unload only and for different use case in this set. Which is making bpf program appear in normal 'perf report'. Please see link in cover letter. We decided to abandon my old approach in favor of this one, but the end result is the same. From 0.81% cpu of some magic 0x00007fffa001a660 into 18.13% of bpf_prog_1accc788e7f04c38_balancer_ingres > As for the overhead, I did not see a separate thread getting spun off > for the bpf events, so the events are processed inline for this RFC set. argh. you're right. we have to fix that.
On 11/6/18 6:09 PM, Alexei Starovoitov wrote: > On 11/6/18 4:44 PM, David Ahern wrote: >> >> So one use case is profiling bpf programs. I was also considering the >> auditing discussion from some weeks ago which I thought the events are >> also targeting. > > yes. there should be separate mode for 're: audit discussion' where > only bpf events are collected. This patch set doesn't add that to > perf user space side. > The kernel side is common though. It can be used for bpf load/unload > only and for different use case in this set. Which is making > bpf program appear in normal 'perf report'. It would be good for perf-script to work in the next version. Users should be able to dump the event from the kernel and the synthesized events. Should be trivial to add and allows a review of both perspectives -- profiling and monitoring events.
> On Nov 7, 2018, at 10:12 AM, David Ahern <dsahern@gmail.com> wrote: > > On 11/6/18 6:09 PM, Alexei Starovoitov wrote: >> On 11/6/18 4:44 PM, David Ahern wrote: >>> >>> So one use case is profiling bpf programs. I was also considering the >>> auditing discussion from some weeks ago which I thought the events are >>> also targeting. >> >> yes. there should be separate mode for 're: audit discussion' where >> only bpf events are collected. This patch set doesn't add that to >> perf user space side. >> The kernel side is common though. It can be used for bpf load/unload >> only and for different use case in this set. Which is making >> bpf program appear in normal 'perf report'. > > It would be good for perf-script to work in the next version. Users > should be able to dump the event from the kernel and the synthesized > events. Should be trivial to add and allows a review of both > perspectives -- profiling and monitoring events. Yes, we do plan to include these information in other perf commands, including perf-script, perf-top, perf-annotate, etc. Thanks, Song
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 73b02bde1ebc..1036a64eb9f7 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -80,6 +80,7 @@ struct record { bool buildid_all; bool timestamp_filename; bool timestamp_boundary; + bool no_bpf_event; struct switch_output switch_output; unsigned long long samples; }; @@ -381,6 +382,8 @@ static int record__open(struct record *rec) pos->tracking = 1; pos->attr.enable_on_exec = 1; } + if (!rec->no_bpf_event) + perf_evlist__add_bpf_tracker(evlist); perf_evlist__config(evlist, opts, &callchain_param); @@ -562,10 +565,55 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli return rc; } +static int record__mmap_process_vip_events(struct record *rec) +{ + int i; + + for (i = 0; i < rec->evlist->nr_mmaps; i++) { + struct perf_mmap *map = &rec->evlist->vip_mmap[i]; + union perf_event *event; + + perf_mmap__read_init(map); + while ((event = perf_mmap__read_event(map)) != NULL) { + pr_debug("processing vip event of type %d\n", + event->header.type); + switch (event->header.type) { + case PERF_RECORD_BPF_EVENT: + switch (event->bpf_event.type) { + case PERF_BPF_EVENT_PROG_LOAD: + perf_event__synthesize_one_bpf_prog_info( + &rec->tool, + process_synthesized_event, + &rec->session->machines.host, + event->bpf_event.id); + /* fall through */ + case PERF_BPF_EVENT_PROG_UNLOAD: + record__write(rec, NULL, event, + event->header.size); + break; + default: + break; + } + break; + default: + break; + } + perf_mmap__consume(map); + } + perf_mmap__read_done(map); + } + + return 0; +} + static int record__mmap_read_all(struct record *rec) { int err; + err = record__mmap_process_vip_events(rec); + if (err) + return err; + err = record__mmap_read_evlist(rec, rec->evlist, false); if (err) return err; @@ -1686,6 +1734,8 @@ static struct option __record_options[] = { "signal"), OPT_BOOLEAN(0, "dry-run", &dry_run, "Parse options then exit"), + OPT_BOOLEAN(0, "no-bpf-event", &record.no_bpf_event, + "do not record event on bpf program load/unload"), OPT_END() }; diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index be440df29615..466a9f7b1e93 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -45,6 +45,7 @@ void perf_evlist__init(struct perf_evlist *evlist, struct cpu_map *cpus, for (i = 0; i < PERF_EVLIST__HLIST_SIZE; ++i) INIT_HLIST_HEAD(&evlist->heads[i]); INIT_LIST_HEAD(&evlist->entries); + INIT_LIST_HEAD(&evlist->vip_entries); perf_evlist__set_maps(evlist, cpus, threads); fdarray__init(&evlist->pollfd, 64); evlist->workload.pid = -1; @@ -177,6 +178,8 @@ void perf_evlist__add(struct perf_evlist *evlist, struct perf_evsel *entry) { entry->evlist = evlist; list_add_tail(&entry->node, &evlist->entries); + if (entry->vip) + list_add_tail(&entry->vip_node, &evlist->vip_entries); entry->idx = evlist->nr_entries; entry->tracking = !entry->idx; @@ -267,6 +270,27 @@ int perf_evlist__add_dummy(struct perf_evlist *evlist) return 0; } +int perf_evlist__add_bpf_tracker(struct perf_evlist *evlist) +{ + struct perf_event_attr attr = { + .type = PERF_TYPE_SOFTWARE, + .config = PERF_COUNT_SW_DUMMY, + .watermark = 1, + .bpf_event = 1, + .wakeup_watermark = 1, + .size = sizeof(attr), /* to capture ABI version */ + }; + struct perf_evsel *evsel = perf_evsel__new_idx(&attr, + evlist->nr_entries); + + if (evsel == NULL) + return -ENOMEM; + + evsel->vip = true; + perf_evlist__add(evlist, evsel); + return 0; +} + static int perf_evlist__add_attrs(struct perf_evlist *evlist, struct perf_event_attr *attrs, size_t nr_attrs) { @@ -770,6 +794,7 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); evlist__for_each_entry(evlist, evsel) { + struct perf_mmap *vip_maps = evlist->vip_mmap; struct perf_mmap *maps = evlist->mmap; int *output = _output; int fd; @@ -800,7 +825,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, fd = FD(evsel, cpu, thread); - if (*output == -1) { + if (evsel->vip) { + if (perf_mmap__mmap(&vip_maps[idx], mp, + fd, evlist_cpu) < 0) + return -1; + } else if (*output == -1) { *output = fd; if (perf_mmap__mmap(&maps[idx], mp, *output, evlist_cpu) < 0) @@ -822,8 +851,12 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, * Therefore don't add it for polling. */ if (!evsel->system_wide && - __perf_evlist__add_pollfd(evlist, fd, &maps[idx], revent) < 0) { - perf_mmap__put(&maps[idx]); + __perf_evlist__add_pollfd( + evlist, fd, + evsel->vip ? &vip_maps[idx] : &maps[idx], + revent) < 0) { + perf_mmap__put(evsel->vip ? + &vip_maps[idx] : &maps[idx]); return -1; } @@ -1035,6 +1068,9 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages, if (!evlist->mmap) return -ENOMEM; + if (!evlist->vip_mmap) + evlist->vip_mmap = perf_evlist__alloc_mmap(evlist, false); + if (evlist->pollfd.entries == NULL && perf_evlist__alloc_pollfd(evlist) < 0) return -ENOMEM; diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h index dc66436add98..6d99e8dab570 100644 --- a/tools/perf/util/evlist.h +++ b/tools/perf/util/evlist.h @@ -26,6 +26,7 @@ struct record_opts; struct perf_evlist { struct list_head entries; + struct list_head vip_entries; struct hlist_head heads[PERF_EVLIST__HLIST_SIZE]; int nr_entries; int nr_groups; @@ -43,6 +44,7 @@ struct perf_evlist { } workload; struct fdarray pollfd; struct perf_mmap *mmap; + struct perf_mmap *vip_mmap; struct perf_mmap *overwrite_mmap; struct thread_map *threads; struct cpu_map *cpus; @@ -84,6 +86,8 @@ int __perf_evlist__add_default_attrs(struct perf_evlist *evlist, int perf_evlist__add_dummy(struct perf_evlist *evlist); +int perf_evlist__add_bpf_tracker(struct perf_evlist *evlist); + int perf_evlist__add_newtp(struct perf_evlist *evlist, const char *sys, const char *name, void *handler); diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index af9d539e4b6a..94456a493607 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -235,6 +235,7 @@ void perf_evsel__init(struct perf_evsel *evsel, evsel->evlist = NULL; evsel->bpf_fd = -1; INIT_LIST_HEAD(&evsel->node); + INIT_LIST_HEAD(&evsel->vip_node); INIT_LIST_HEAD(&evsel->config_terms); perf_evsel__object.init(evsel); evsel->sample_size = __perf_evsel__sample_size(attr->sample_type); @@ -1795,6 +1796,8 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus, PERF_SAMPLE_BRANCH_NO_CYCLES); if (perf_missing_features.group_read && evsel->attr.inherit) evsel->attr.read_format &= ~(PERF_FORMAT_GROUP|PERF_FORMAT_ID); + if (perf_missing_features.bpf_event) + evsel->attr.bpf_event = 0; retry_sample_id: if (perf_missing_features.sample_id_all) evsel->attr.sample_id_all = 0; @@ -1939,6 +1942,11 @@ int perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus, perf_missing_features.exclude_guest = true; pr_debug2("switching off exclude_guest, exclude_host\n"); goto fallback_missing_features; + } else if (!perf_missing_features.bpf_event && + evsel->attr.bpf_event) { + perf_missing_features.bpf_event = true; + pr_debug2("switching off bpf_event\n"); + goto fallback_missing_features; } else if (!perf_missing_features.sample_id_all) { perf_missing_features.sample_id_all = true; pr_debug2("switching off sample_id_all\n"); diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index 4107c39f4a54..82b1d3e42603 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -89,6 +89,7 @@ struct perf_stat_evsel; */ struct perf_evsel { struct list_head node; + struct list_head vip_node; struct perf_evlist *evlist; struct perf_event_attr attr; char *filter; @@ -128,6 +129,7 @@ struct perf_evsel { bool ignore_missing_thread; bool forced_leader; bool use_uncore_alias; + bool vip; /* vip events have their own mmap */ /* parse modifier helper */ int exclude_GH; int nr_members; @@ -163,6 +165,7 @@ struct perf_missing_features { bool lbr_flags; bool write_backward; bool group_read; + bool bpf_event; }; extern struct perf_missing_features perf_missing_features;
This patch enables perf-record to listen to bpf_event and generate bpf_prog_info_event for bpf programs loaded and unloaded during perf-record run. To minimize latency between bpf_event and following bpf calls, separate mmap with watermark of 1 is created to process these vip events. Then a separate dummy event is attached to the special mmap. By default, perf-record will listen to bpf_event. Option no-bpf-event is added in case the user would opt out. Signed-off-by: Song Liu <songliubraving@fb.com> --- tools/perf/builtin-record.c | 50 +++++++++++++++++++++++++++++++++++++ tools/perf/util/evlist.c | 42 ++++++++++++++++++++++++++++--- tools/perf/util/evlist.h | 4 +++ tools/perf/util/evsel.c | 8 ++++++ tools/perf/util/evsel.h | 3 +++ 5 files changed, 104 insertions(+), 3 deletions(-)