[bpf-next,2/2] bpf, libbpf: simplify perf RB walk and do incremental updates

Message ID 20181011140207.27602-3-daniel@iogearbox.net
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series
  • Two libbpf RB walk improvements
Related show

Commit Message

Daniel Borkmann Oct. 11, 2018, 2:02 p.m.
Clean up and improve bpf_perf_event_read_simple() ring walk a bit
to use similar tail update scheme as in perf and bcc allowing the
kernel to make forward progress not only after full timely walk.
Also few other improvements to use realloc() instead of free() and
malloc() combination and for the callback use proper perf_event_header
instead of void pointer, so that real applications can use container_of()
macro with proper type checking.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/map_perf_ring.c           | 10 ++--
 tools/lib/bpf/libbpf.c                      | 77 +++++++++++++----------------
 tools/lib/bpf/libbpf.h                      | 14 +++---
 tools/testing/selftests/bpf/trace_helpers.c |  7 +--
 4 files changed, 53 insertions(+), 55 deletions(-)

Comments

Jakub Kicinski Oct. 12, 2018, 3:04 a.m. | #1
On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:
> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
> to use similar tail update scheme as in perf and bcc allowing the
> kernel to make forward progress not only after full timely walk.

The extra memory barriers won't impact performance?  If I read the code
correctly we now have:

	while (bla) {
		head = HEAD
		rmb()

		...

		mb()
		TAIL = tail
	}

Would it make sense to try to piggy back on the mb() for head re-read
at least?  Perhaps that's a non-issue, just wondering.

> Also few other improvements to use realloc() instead of free() and
> malloc() combination and for the callback use proper perf_event_header
> instead of void pointer, so that real applications can use container_of()
> macro with proper type checking.

FWIW the free() + malloc() was to avoid the the needless copy of the
previous event realloc() may do.  It makes sense to use realloc()
especially if you want to put extra info in front of the buffer, just
sayin' it wasn't a complete braino ;)

> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann Oct. 12, 2018, 8:39 a.m. | #2
On 10/12/2018 05:04 AM, Jakub Kicinski wrote:
> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:
>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
>> to use similar tail update scheme as in perf and bcc allowing the
>> kernel to make forward progress not only after full timely walk.
> 
> The extra memory barriers won't impact performance?  If I read the code
> correctly we now have:
> 
> 	while (bla) {
> 		head = HEAD
> 		rmb()
> 
> 		...
> 
> 		mb()
> 		TAIL = tail
> 	}
> 
> Would it make sense to try to piggy back on the mb() for head re-read
> at least?  Perhaps that's a non-issue, just wondering.

From the scheme specified in the comment in prior patch my understanding
would be that they don't pair (see B and C) so there would be no guarantee
that load of head would happen before load of data. Fwiw, I've been using
the exact same semantics as user space perf tool walks the perf mmap'ed
ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop
pushing into ring buffer while user space walks it and indicates how far
it has consumed data via tail update, it would allow for making room
successively and not only after full run has complete, so we don't make
any assumptions in the generic libbpf library helper on how slow/quick
the callback would be processing resp. how full ring is, etc, and kernel
pushing new data can be processed in the same run if necessary. One thing
we could consider is to batch tail updates, say, every 8 elements and a
final update once we break out walking the ring; probably okay'ish as a
heuristic..

>> Also few other improvements to use realloc() instead of free() and
>> malloc() combination and for the callback use proper perf_event_header
>> instead of void pointer, so that real applications can use container_of()
>> macro with proper type checking.
> 
> FWIW the free() + malloc() was to avoid the the needless copy of the
> previous event realloc() may do.  It makes sense to use realloc()
> especially if you want to put extra info in front of the buffer, just
> sayin' it wasn't a complete braino ;)

No strong preference from my side, I'd think that it might be sensible in
any case from applications to call the bpf_perf_event_read_simple() with a
already preallocated buffer, depending on the expected max element size from
BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from
the BPF prog and MTU 1500 this would more than suffice to avoid new
allocations altogether. Anyway, given we only grow the new memory area I
did some testing on realloc() with an array of pointers to prior malloc()'ed
buffers, running randomly 10M realloc()s to increase size over them and
saw <1% where area had to be moved, so we're hitting corner case of a corner
case, I'm also ok to leave the combination, though. :)

Thanks,
Daniel
Daniel Borkmann Oct. 12, 2018, 1:30 p.m. | #3
On 10/12/2018 10:39 AM, Daniel Borkmann wrote:
> On 10/12/2018 05:04 AM, Jakub Kicinski wrote:
>> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:
>>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
>>> to use similar tail update scheme as in perf and bcc allowing the
>>> kernel to make forward progress not only after full timely walk.
>>
>> The extra memory barriers won't impact performance?  If I read the code
>> correctly we now have:
>>
>> 	while (bla) {
>> 		head = HEAD
>> 		rmb()
>>
>> 		...
>>
>> 		mb()
>> 		TAIL = tail
>> 	}
>>
>> Would it make sense to try to piggy back on the mb() for head re-read
>> at least?  Perhaps that's a non-issue, just wondering.
> 
> From the scheme specified in the comment in prior patch my understanding
> would be that they don't pair (see B and C) so there would be no guarantee
> that load of head would happen before load of data. Fwiw, I've been using
> the exact same semantics as user space perf tool walks the perf mmap'ed
> ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop

On that note, I'll also respin, after some clarification with PeterZ on
why perf is using {rmb,mb}() barriers today as opposed to more lightweight
smp_{rmb,mb}() ones it turns out there is no real reason other than
historic one and perf can be changed and made more efficient as well. ;)

> pushing into ring buffer while user space walks it and indicates how far
> it has consumed data via tail update, it would allow for making room
> successively and not only after full run has complete, so we don't make
> any assumptions in the generic libbpf library helper on how slow/quick
> the callback would be processing resp. how full ring is, etc, and kernel
> pushing new data can be processed in the same run if necessary. One thing
> we could consider is to batch tail updates, say, every 8 elements and a
> final update once we break out walking the ring; probably okay'ish as a
> heuristic..
> 
>>> Also few other improvements to use realloc() instead of free() and
>>> malloc() combination and for the callback use proper perf_event_header
>>> instead of void pointer, so that real applications can use container_of()
>>> macro with proper type checking.
>>
>> FWIW the free() + malloc() was to avoid the the needless copy of the
>> previous event realloc() may do.  It makes sense to use realloc()
>> especially if you want to put extra info in front of the buffer, just
>> sayin' it wasn't a complete braino ;)
> 
> No strong preference from my side, I'd think that it might be sensible in
> any case from applications to call the bpf_perf_event_read_simple() with a
> already preallocated buffer, depending on the expected max element size from
> BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from
> the BPF prog and MTU 1500 this would more than suffice to avoid new
> allocations altogether. Anyway, given we only grow the new memory area I
> did some testing on realloc() with an array of pointers to prior malloc()'ed
> buffers, running randomly 10M realloc()s to increase size over them and
> saw <1% where area had to be moved, so we're hitting corner case of a corner
> case, I'm also ok to leave the combination, though. :)
> 
> Thanks,
> Daniel
Jakub Kicinski Oct. 12, 2018, 3:39 p.m. | #4
On Fri, 12 Oct 2018 15:30:57 +0200, Daniel Borkmann wrote:
> On 10/12/2018 10:39 AM, Daniel Borkmann wrote:
> > On 10/12/2018 05:04 AM, Jakub Kicinski wrote:  
> >> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:  
> >>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
> >>> to use similar tail update scheme as in perf and bcc allowing the
> >>> kernel to make forward progress not only after full timely walk.  
> >>
> >> The extra memory barriers won't impact performance?  If I read the code
> >> correctly we now have:
> >>
> >> 	while (bla) {
> >> 		head = HEAD
> >> 		rmb()
> >>
> >> 		...
> >>
> >> 		mb()
> >> 		TAIL = tail
> >> 	}
> >>
> >> Would it make sense to try to piggy back on the mb() for head re-read
> >> at least?  Perhaps that's a non-issue, just wondering.  
> > 
> > From the scheme specified in the comment in prior patch my understanding
> > would be that they don't pair (see B and C) so there would be no guarantee
> > that load of head would happen before load of data. Fwiw, I've been using
> > the exact same semantics as user space perf tool walks the perf mmap'ed
> > ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop  
> 
> On that note, I'll also respin, after some clarification with PeterZ on
> why perf is using {rmb,mb}() barriers today as opposed to more lightweight
> smp_{rmb,mb}() ones it turns out there is no real reason other than
> historic one and perf can be changed and made more efficient as well. ;)

Cool, to be clear I meant something like:

	head = HEAD
	rmb()

	while (bla) {
		...

		head = HEAD
		mb()
		TAIL = tail
	}

> > pushing into ring buffer while user space walks it and indicates how far
> > it has consumed data via tail update, it would allow for making room
> > successively and not only after full run has complete, so we don't make
> > any assumptions in the generic libbpf library helper on how slow/quick
> > the callback would be processing resp. how full ring is, etc, and kernel
> > pushing new data can be processed in the same run if necessary. One thing
> > we could consider is to batch tail updates, say, every 8 elements and a
> > final update once we break out walking the ring; probably okay'ish as a
> > heuristic..

Makes sense, I would be tempted to do some batching, but since I'm
mostly working with I/O devices I may think that memory barriers cost
more than they do for main memory-only accesses :)

> >>> Also few other improvements to use realloc() instead of free() and
> >>> malloc() combination and for the callback use proper perf_event_header
> >>> instead of void pointer, so that real applications can use container_of()
> >>> macro with proper type checking.  
> >>
> >> FWIW the free() + malloc() was to avoid the the needless copy of the
> >> previous event realloc() may do.  It makes sense to use realloc()
> >> especially if you want to put extra info in front of the buffer, just
> >> sayin' it wasn't a complete braino ;)  
> > 
> > No strong preference from my side, I'd think that it might be sensible in
> > any case from applications to call the bpf_perf_event_read_simple() with a
> > already preallocated buffer, depending on the expected max element size from
> > BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from
> > the BPF prog and MTU 1500 this would more than suffice to avoid new
> > allocations altogether. Anyway, given we only grow the new memory area I
> > did some testing on realloc() with an array of pointers to prior malloc()'ed
> > buffers, running randomly 10M realloc()s to increase size over them and
> > saw <1% where area had to be moved, so we're hitting corner case of a corner
> > case, I'm also ok to leave the combination, though. :)

Thanks for doing the experiment, not strong preference here either, so
realloc() seems good!

Patch

diff --git a/tools/bpf/bpftool/map_perf_ring.c b/tools/bpf/bpftool/map_perf_ring.c
index 6d41323..bdaf406 100644
--- a/tools/bpf/bpftool/map_perf_ring.c
+++ b/tools/bpf/bpftool/map_perf_ring.c
@@ -50,15 +50,17 @@  static void int_exit(int signo)
 	stop = true;
 }
 
-static enum bpf_perf_event_ret print_bpf_output(void *event, void *priv)
+static enum bpf_perf_event_ret
+print_bpf_output(struct perf_event_header *event, void *private_data)
 {
-	struct event_ring_info *ring = priv;
-	struct perf_event_sample *e = event;
+	struct perf_event_sample *e = container_of(event, struct perf_event_sample,
+						   header);
+	struct event_ring_info *ring = private_data;
 	struct {
 		struct perf_event_header header;
 		__u64 id;
 		__u64 lost;
-	} *lost = event;
+	} *lost = (typeof(lost))event;
 
 	if (json_output) {
 		jsonw_start_object(json_wtr);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1ac8856..35f4a77 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2448,58 +2448,51 @@  static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
 }
 
 enum bpf_perf_event_ret
-bpf_perf_event_read_simple(void *mem, unsigned long size,
-			   unsigned long page_size, void **buf, size_t *buf_len,
-			   bpf_perf_event_print_t fn, void *priv)
-{
-	struct perf_event_mmap_page *header = mem;
-	__u64 data_head = bpf_perf_read_head(header);
-	__u64 data_tail = header->data_tail;
-	int ret = LIBBPF_PERF_EVENT_ERROR;
-	void *base, *begin, *end;
-
-	if (data_head == data_tail)
-		return LIBBPF_PERF_EVENT_CONT;
-
-	base = ((char *)header) + page_size;
-
-	begin = base + data_tail % size;
-	end = base + data_head % size;
-
-	while (begin != end) {
-		struct perf_event_header *ehdr;
-
-		ehdr = begin;
-		if (begin + ehdr->size > base + size) {
-			long len = base + size - begin;
-
-			if (*buf_len < ehdr->size) {
-				free(*buf);
-				*buf = malloc(ehdr->size);
-				if (!*buf) {
+bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t page_size,
+			   void **copy_mem, size_t *copy_size,
+			   bpf_perf_event_print_t fn, void *private_data)
+{
+	struct perf_event_mmap_page *header = mmap_mem;
+	void *base = ((__u8 *)header) + page_size;
+	int ret = LIBBPF_PERF_EVENT_CONT;
+	struct perf_event_header *ehdr;
+	__u64 data_head, data_tail;
+	size_t ehdr_size;
+
+	for (data_tail = header->data_tail,
+	     data_head = bpf_perf_read_head(header);
+	     data_head != data_tail;
+	     data_head = bpf_perf_read_head(header)) {
+		ehdr = base + (data_tail & (mmap_size - 1));
+		ehdr_size = ehdr->size;
+		if (((void *)ehdr) + ehdr_size > base + mmap_size) {
+			void *copy_start = ehdr;
+			size_t len_first = base + mmap_size - copy_start;
+			size_t len_secnd = ehdr_size - len_first;
+
+			if (*copy_size < ehdr_size) {
+				void *ptr = realloc(*copy_mem, ehdr_size);
+
+				if (!ptr) {
 					ret = LIBBPF_PERF_EVENT_ERROR;
 					break;
 				}
-				*buf_len = ehdr->size;
+				*copy_mem = ptr;
+				*copy_size = ehdr_size;
 			}
 
-			memcpy(*buf, begin, len);
-			memcpy(*buf + len, base, ehdr->size - len);
-			ehdr = (void *)*buf;
-			begin = base + ehdr->size - len;
-		} else if (begin + ehdr->size == base + size) {
-			begin = base;
-		} else {
-			begin += ehdr->size;
+			memcpy(*copy_mem, copy_start, len_first);
+			memcpy(*copy_mem + len_first, base, len_secnd);
+			ehdr = *copy_mem;
 		}
 
-		ret = fn(ehdr, priv);
+		ret = fn(ehdr, private_data);
+
+		data_tail += ehdr_size;
+		bpf_perf_write_tail(header, data_tail);
 		if (ret != LIBBPF_PERF_EVENT_CONT)
 			break;
-
-		data_tail += ehdr->size;
 	}
 
-	bpf_perf_write_tail(header, data_tail);
 	return ret;
 }
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 8af8d36..16b5c95 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -284,12 +284,14 @@  enum bpf_perf_event_ret {
 	LIBBPF_PERF_EVENT_CONT	= -2,
 };
 
-typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(void *event,
-							  void *priv);
-int bpf_perf_event_read_simple(void *mem, unsigned long size,
-			       unsigned long page_size,
-			       void **buf, size_t *buf_len,
-			       bpf_perf_event_print_t fn, void *priv);
+struct perf_event_header;
+typedef enum bpf_perf_event_ret
+	(*bpf_perf_event_print_t)(struct perf_event_header *hdr,
+				  void *private_data);
+enum bpf_perf_event_ret
+bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t page_size,
+			   void **copy_mem, size_t *copy_size,
+			   bpf_perf_event_print_t fn, void *private_data);
 
 struct nlattr;
 typedef int (*libbpf_dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb);
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index cabe2a3..1764093 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -124,10 +124,11 @@  struct perf_event_sample {
 	char data[];
 };
 
-static enum bpf_perf_event_ret bpf_perf_event_print(void *event, void *priv)
+static enum bpf_perf_event_ret
+bpf_perf_event_print(struct perf_event_header *hdr, void *private_data)
 {
-	struct perf_event_sample *e = event;
-	perf_event_print_fn fn = priv;
+	struct perf_event_sample *e = (struct perf_event_sample *)hdr;
+	perf_event_print_fn fn = private_data;
 	int ret;
 
 	if (e->header.type == PERF_RECORD_SAMPLE) {