Message ID | f6ffffc1011706426682be92f80fc6aa6a8cc196.1468362046.git.daniel@iogearbox.net |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote: > This patch adds support for non-linear data on raw records. It means > that for such data, the newly introduced __output_custom() helper will > be used instead of __output_copy(). __output_custom() will invoke > whatever custom callback is passed in via struct perf_raw_record_frag > to extract the data into the ring buffer slot. > > To keep changes in perf_prepare_sample() and in perf_output_sample() > minimal, size/size_head split was added to perf_raw_record that call > sites fill out, so that two extra tests in fast-path can be avoided. > > The few users of raw records are adapted to initialize their size_head > and frag data; no change in behavior for them. Later patch will extend > BPF side with a first user and callback for this facility, future users > could be things like XDP BPF programs (that work on different context > though and would thus have a different callback), etc. Why? What problem are we solving?
Hi Peter, On 07/13/2016 09:52 AM, Peter Zijlstra wrote: > On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote: >> This patch adds support for non-linear data on raw records. It means >> that for such data, the newly introduced __output_custom() helper will >> be used instead of __output_copy(). __output_custom() will invoke >> whatever custom callback is passed in via struct perf_raw_record_frag >> to extract the data into the ring buffer slot. >> >> To keep changes in perf_prepare_sample() and in perf_output_sample() >> minimal, size/size_head split was added to perf_raw_record that call >> sites fill out, so that two extra tests in fast-path can be avoided. >> >> The few users of raw records are adapted to initialize their size_head >> and frag data; no change in behavior for them. Later patch will extend >> BPF side with a first user and callback for this facility, future users >> could be things like XDP BPF programs (that work on different context >> though and would thus have a different callback), etc. > > Why? What problem are we solving? I've tried to summarize it in patch 3/3, there are a number of improvements this set would provide: the current BPF_FUNC_perf_event_output helper that can be used from tc/networking programs has the limitation that if data from the skb should be part of the sample, that data first needs to be copied to eBPF stack with the bpf_skb_load_bytes() helper, and only then passed into bpf_event_output(). This currently has 3 issues we'd like to resolve: i) We need two copies instead of just a single one for the skb data. The data can be non-linear, see also skb_copy_bits() as an example for walking/extracting it, ii) for static verification reasons, the bpf_skb_load_bytes() helper needs to see a constant size on the passed buffer to make sure BPF verifier can do its sanity checks on it during verification time, so just passing in skb->len (or any other non-constant value) wouldn't work, but changing bpf_skb_load_bytes() is also not the real solution since we still have two copies we'd like to avoid as well and iii) bpf_skb_load_bytes() is just for rather smaller buffers (e.g. headers) since they need to sit on the limited eBPF stack anyway. The set would improve the BPF helper to address all 3 at once. Thanks, Daniel
On Wed, Jul 13, 2016 at 11:24:13AM +0200, Daniel Borkmann wrote: > Hi Peter, > > On 07/13/2016 09:52 AM, Peter Zijlstra wrote: > >On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote: > >>This patch adds support for non-linear data on raw records. It means > >>that for such data, the newly introduced __output_custom() helper will > >>be used instead of __output_copy(). __output_custom() will invoke > >>whatever custom callback is passed in via struct perf_raw_record_frag > >>to extract the data into the ring buffer slot. > >> > >>To keep changes in perf_prepare_sample() and in perf_output_sample() > >>minimal, size/size_head split was added to perf_raw_record that call > >>sites fill out, so that two extra tests in fast-path can be avoided. > >> > >>The few users of raw records are adapted to initialize their size_head > >>and frag data; no change in behavior for them. Later patch will extend > >>BPF side with a first user and callback for this facility, future users > >>could be things like XDP BPF programs (that work on different context > >>though and would thus have a different callback), etc. > > > >Why? What problem are we solving? > > I've tried to summarize it in patch 3/3, Which is pretty useless if you're staring at this patch. > This currently has 3 issues we'd like to resolve: > i) We need two copies instead of just a single one for the skb data. > The data can be non-linear, see also skb_copy_bits() as an example for > walking/extracting it, I'm not familiar enough with the network gunk to be able to read that. But upto skb_walk_frags() it looks entirely linear to me. > ii) for static verification reasons, the bpf_skb_load_bytes() helper > needs to see a constant size on the passed buffer to make sure BPF > verifier can do its sanity checks on it during verification time, so > just passing in skb->len (or any other non-constant value) wouldn't > work, but changing bpf_skb_load_bytes() is also not the real solution > since we still have two copies we'd like to avoid as well, and > iii) bpf_skb_load_bytes() is just for rather smaller buffers (e.g. > headers) since they need to sit on the limited eBPF stack anyway. The > set would improve the BPF helper to address all 3 at once. Humm, maybe. Lemme go try and reverse engineer that patch, because I'm not at all sure wth it's supposed to do, nor am I entirely sure this clarified things :/
On 07/13/2016 02:10 PM, Peter Zijlstra wrote: > On Wed, Jul 13, 2016 at 11:24:13AM +0200, Daniel Borkmann wrote: >> On 07/13/2016 09:52 AM, Peter Zijlstra wrote: >>> On Wed, Jul 13, 2016 at 12:36:17AM +0200, Daniel Borkmann wrote: >>>> This patch adds support for non-linear data on raw records. It means >>>> that for such data, the newly introduced __output_custom() helper will >>>> be used instead of __output_copy(). __output_custom() will invoke >>>> whatever custom callback is passed in via struct perf_raw_record_frag >>>> to extract the data into the ring buffer slot. >>>> >>>> To keep changes in perf_prepare_sample() and in perf_output_sample() >>>> minimal, size/size_head split was added to perf_raw_record that call >>>> sites fill out, so that two extra tests in fast-path can be avoided. >>>> >>>> The few users of raw records are adapted to initialize their size_head >>>> and frag data; no change in behavior for them. Later patch will extend >>>> BPF side with a first user and callback for this facility, future users >>>> could be things like XDP BPF programs (that work on different context >>>> though and would thus have a different callback), etc. >>> >>> Why? What problem are we solving? >> >> I've tried to summarize it in patch 3/3, > > Which is pretty useless if you're staring at this patch. > >> This currently has 3 issues we'd like to resolve: > >> i) We need two copies instead of just a single one for the skb data. >> The data can be non-linear, see also skb_copy_bits() as an example for >> walking/extracting it, > > I'm not familiar enough with the network gunk to be able to read that. > But upto skb_walk_frags() it looks entirely linear to me. Hm, fair enough, there are three parts, skb can have a linear part which is taken via skb->data, either in its entirety or there can be a non-linear part appended to that which can consist of pages that are in shared info section (skb_shinfo(skb) -> frags[], nr_frags members), that will be linearized, and in addition to that, appended after the frags[] data there can be further skbs to the 'root' skb that contain fragmented data, which is all what skb_copy_bits() copies linearized into 'to' buffer. So depending on the origin of the skb, its structure can be quite different and skb_copy_bits() covers all the cases generically. Maybe [1] summarizes it better if you want to familiarize yourself with how skbs work, although some parts are not up to date anymore. [1] http://vger.kernel.org/~davem/skb_data.html >> ii) for static verification reasons, the bpf_skb_load_bytes() helper >> needs to see a constant size on the passed buffer to make sure BPF >> verifier can do its sanity checks on it during verification time, so >> just passing in skb->len (or any other non-constant value) wouldn't >> work, but changing bpf_skb_load_bytes() is also not the real solution >> since we still have two copies we'd like to avoid as well, and > >> iii) bpf_skb_load_bytes() is just for rather smaller buffers (e.g. >> headers) since they need to sit on the limited eBPF stack anyway. The >> set would improve the BPF helper to address all 3 at once. > > Humm, maybe. Lemme go try and reverse engineer that patch, because I'm > not at all sure wth it's supposed to do, nor am I entirely sure this > clarified things :/
diff --git a/arch/s390/kernel/perf_cpum_sf.c b/arch/s390/kernel/perf_cpum_sf.c index a8e8321..99c5952 100644 --- a/arch/s390/kernel/perf_cpum_sf.c +++ b/arch/s390/kernel/perf_cpum_sf.c @@ -984,7 +984,9 @@ static int perf_push_sample(struct perf_event *event, struct sf_raw_sample *sfr) /* Setup perf sample */ perf_sample_data_init(&data, 0, event->hw.last_period); raw.size = sfr->size; + raw.size_head = raw.size; raw.data = sfr; + raw.frag = NULL; data.raw = &raw; /* Setup pt_regs to look like an CPU-measurement external interrupt diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c index feb90f6..9b27dff 100644 --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -656,7 +656,9 @@ fail: if (event->attr.sample_type & PERF_SAMPLE_RAW) { raw.size = sizeof(u32) + ibs_data.size; + raw.size_head = raw.size; raw.data = ibs_data.data; + raw.frag = NULL; data.raw = &raw; } diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1a827ce..bf08bdf 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -69,9 +69,17 @@ struct perf_callchain_entry_ctx { bool contexts_maxed; }; +struct perf_raw_record_frag { + void *data; + unsigned long (*copy_cb) (void *dst, const void *src, + unsigned long n); +}; + struct perf_raw_record { u32 size; + u32 size_head; void *data; + struct perf_raw_record_frag *frag; }; /* diff --git a/kernel/events/core.c b/kernel/events/core.c index 9c51ec3..3e1dd7a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5553,14 +5553,20 @@ void perf_output_sample(struct perf_output_handle *handle, } if (sample_type & PERF_SAMPLE_RAW) { - if (data->raw) { - u32 raw_size = data->raw->size; + struct perf_raw_record *raw = data->raw; + + if (raw) { + u32 raw_size = raw->size; u32 real_size = round_up(raw_size + sizeof(u32), sizeof(u64)) - sizeof(u32); u64 zero = 0; perf_output_put(handle, real_size); - __output_copy(handle, data->raw->data, raw_size); + __output_copy(handle, raw->data, raw->size_head); + if (raw->frag) + __output_custom(handle, raw->frag->copy_cb, + raw->frag->data, + raw->size - raw->size_head); if (real_size - raw_size) __output_copy(handle, &zero, real_size - raw_size); } else { @@ -7388,6 +7394,7 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size, struct perf_raw_record raw = { .size = entry_size, + .size_head = entry_size, .data = record, }; diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 05f9f6d..1b08d94 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -123,10 +123,7 @@ static inline unsigned long perf_aux_size(struct ring_buffer *rb) return rb->aux_nr_pages << PAGE_SHIFT; } -#define DEFINE_OUTPUT_COPY(func_name, memcpy_func) \ -static inline unsigned long \ -func_name(struct perf_output_handle *handle, \ - const void *buf, unsigned long len) \ +#define __DEFINE_OUTPUT_COPY_BODY(memcpy_func) \ { \ unsigned long size, written; \ \ @@ -152,6 +149,19 @@ func_name(struct perf_output_handle *handle, \ return len; \ } +#define DEFINE_OUTPUT_COPY(func_name, memcpy_func) \ +static inline unsigned long \ +func_name(struct perf_output_handle *handle, \ + const void *buf, unsigned long len) \ +__DEFINE_OUTPUT_COPY_BODY(memcpy_func) + +static inline unsigned long +__output_custom(struct perf_output_handle *handle, + unsigned long (*copy_cb)(void *dst, const void *src, + unsigned long n), + const void *buf, unsigned long len) +__DEFINE_OUTPUT_COPY_BODY(copy_cb) + static inline unsigned long memcpy_common(void *dst, const void *src, unsigned long n) { diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 094c716..8540bd5 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -246,6 +246,7 @@ static u64 bpf_perf_event_output(u64 r1, u64 r2, u64 flags, u64 r4, u64 size) struct perf_event *event; struct perf_raw_record raw = { .size = size, + .size_head = size, .data = data, };