mbox series

[00/12,v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file

Message ID 20200319232219.446480829@goodmis.org
Headers show
Series ring-buffer/tracing: Remove disabling of ring buffer while reading trace file | expand

Message

Steven Rostedt March 19, 2020, 11:22 p.m. UTC
When the ring buffer was first written for ftrace, there was two
human readable files to read it. One was a standard "producer/consumer"
file (trace_pipe), which would consume data from the ring buffer as
it read it, and the other was a "static iterator" that would not
consume the events, such that the file could be read multiple times
and return the same output each time.

The "static iterator" was never meant to be read while there was an
active writer to the ring buffer. If writing was enabled, then it
would disable the writer when the trace file was opened.

There has been some complaints about this by the BPF folks, that did
not realize this little bit of information and it was requested that
the "trace" file does not stop the writing to the ring buffer.

This patch series attempts to satisfy that request, by creating a
temporary buffer in each of the per cpu iterators to place the
read event into, such that it can be passed to users without worrying
about a writer to corrupt the event while it was being written out.
It also uses the fact that the ring buffer is broken up into pages,
where each page has its own timestamp that gets updated when a
writer crosses over to it. By copying it to the temp buffer, and
doing a "before and after" test of the time stamp with memory barriers,
can allow the events to be saved.

Changes since v1:

 - Added fix to selftest first, where these changes wont break it

 - Changed comment in trace_find_next_entry() to better explain what
   it was doing, as pointed out by Masami Hiramatsu.

 - Allocated the iterator temp buffer when the iterator is created,
   as Masami pointed out, it would be better than allocating it each
   time it was used. It is initiated as 128 bytes as most trace events
   are less than that, but will be expanded if needed. Note that
   function is only used when latency measurements are needed (seeing
   two events at once).

Steven Rostedt (VMware) (12):
      selftest/ftrace: Fix function trigger test to handle trace not disabling the tracer
      tracing: Save off entry when peeking at next entry
      ring-buffer: Have ring_buffer_empty() not depend on tracing stopped
      ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()
      ring-buffer: Add page_stamp to iterator for synchronization
      ring-buffer: Have rb_iter_head_event() handle concurrent writer
      ring-buffer: Do not die if rb_iter_peek() fails more than thrice
      ring-buffer: Optimize rb_iter_head_event()
      ring-buffer: Do not disable recording when there is an iterator
      tracing: Do not disable tracing when reading the trace file
      ring-buffer/tracing: Have iterator acknowledge dropped events
      tracing: Have the document reflect that the trace file keeps tracing enabled

----
 Documentation/trace/ftrace.rst                     |  13 +-
 include/linux/ring_buffer.h                        |   4 +-
 include/linux/trace_events.h                       |   2 +
 kernel/trace/ring_buffer.c                         | 196 +++++++++++++++------
 kernel/trace/trace.c                               |  68 +++++--
 kernel/trace/trace_functions_graph.c               |   2 +-
 kernel/trace/trace_output.c                        |  15 +-
 .../test.d/ftrace/func_traceonoff_triggers.tc      |   2 +-
 8 files changed, 211 insertions(+), 91 deletions(-)

Comments

David Laight March 21, 2020, 7:13 p.m. UTC | #1
From: Steven Rostedt
> Sent: 19 March 2020 23:22
...
> 
> This patch series attempts to satisfy that request, by creating a
> temporary buffer in each of the per cpu iterators to place the
> read event into, such that it can be passed to users without worrying
> about a writer to corrupt the event while it was being written out.
> It also uses the fact that the ring buffer is broken up into pages,
> where each page has its own timestamp that gets updated when a
> writer crosses over to it. By copying it to the temp buffer, and
> doing a "before and after" test of the time stamp with memory barriers,
> can allow the events to be saved.

Does this mean the you will no longer be able to look at a snapshot
of the trace by running 'less trace' (and typically going to the end
to get info for all cpus).

A lot of the time trace is being written far too fast for it to make
any sense to try to read it continuously.

Also, if BPF start using ftrace, no one will be able to use it for
'normal debugging' on such systems.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Steven Rostedt March 22, 2020, 6:07 p.m. UTC | #2
On Sat, 21 Mar 2020 19:13:51 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Steven Rostedt
> > Sent: 19 March 2020 23:22  
> ...
> > 
> > This patch series attempts to satisfy that request, by creating a
> > temporary buffer in each of the per cpu iterators to place the
> > read event into, such that it can be passed to users without worrying
> > about a writer to corrupt the event while it was being written out.
> > It also uses the fact that the ring buffer is broken up into pages,
> > where each page has its own timestamp that gets updated when a
> > writer crosses over to it. By copying it to the temp buffer, and
> > doing a "before and after" test of the time stamp with memory barriers,
> > can allow the events to be saved.  
> 
> Does this mean the you will no longer be able to look at a snapshot
> of the trace by running 'less trace' (and typically going to the end
> to get info for all cpus).

If there's a use case for this, it will be trivial to add an option to
bring back the old behavior. If you want that, I can do that, and even add
a config that makes it the default.

> 
> A lot of the time trace is being written far too fast for it to make
> any sense to try to read it continuously.
> 
> Also, if BPF start using ftrace, no one will be able to use it for
> 'normal debugging' on such systems.

I believe its used for debugging bpf, not for normal tracing. BPF only
uses this when it has their trace_printk() using it. Which gives that nasty
"THIS IS A DEBUG KERNEL" message ;-)   Thus, I don't think you need to
worry about bpf having this in production.

-- Steve
Steven Rostedt March 27, 2020, 1:46 a.m. UTC | #3
On Sat, 21 Mar 2020 19:13:51 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Steven Rostedt
> > Sent: 19 March 2020 23:22  
> ...
> > 
> > This patch series attempts to satisfy that request, by creating a
> > temporary buffer in each of the per cpu iterators to place the
> > read event into, such that it can be passed to users without worrying
> > about a writer to corrupt the event while it was being written out.
> > It also uses the fact that the ring buffer is broken up into pages,
> > where each page has its own timestamp that gets updated when a
> > writer crosses over to it. By copying it to the temp buffer, and
> > doing a "before and after" test of the time stamp with memory barriers,
> > can allow the events to be saved.  
> 
> Does this mean the you will no longer be able to look at a snapshot
> of the trace by running 'less trace' (and typically going to the end
> to get info for all cpus).

I changed patch 9 to be this:

It adds an option "pause-on-trace" that when set, will bring back the
old behavior of pausing recording to the ring buffer when the trace
file is open.

If needed, I can add a kernel command line option and a Kconfig that
makes this set to true by default.

-- Steve

From 71f44d604e5b16cc239d6276b447a515448f582f Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Date: Tue, 17 Mar 2020 17:32:31 -0400
Subject: [PATCH] tracing: Do not disable tracing when reading the trace file

When opening the "trace" file, it is no longer necessary to disable tracing.

Note, a new option is created called "pause-on-trace", when set, will cause
the trace file to emulate its original behavior.

Link: http://lkml.kernel.org/r/20200317213416.903351225@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 Documentation/trace/ftrace.rst | 6 ++++++
 kernel/trace/trace.c           | 9 ++++++---
 kernel/trace/trace.h           | 1 +
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 99a0890e20ec..c33950a35d65 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1125,6 +1125,12 @@ Here are the available options:
 	the trace displays additional information about the
 	latency, as described in "Latency trace format".
 
+  pause-on-trace
+	When set, opening the trace file for read, will pause
+	writing to the ring buffer (as if tracing_on was set to zero).
+	This simulates the original behavior of the trace file.
+	When the file is closed, tracing will be enabled again.
+
   record-cmd
 	When any event or tracer is enabled, a hook is enabled
 	in the sched_switch trace point to fill comm cache
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 47889123be7f..650fa81fffe8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4273,8 +4273,11 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 	if (trace_clocks[tr->clock_id].in_ns)
 		iter->iter_flags |= TRACE_FILE_TIME_IN_NS;
 
-	/* stop the trace while dumping if we are not opening "snapshot" */
-	if (!iter->snapshot)
+	/*
+	 * If pause-on-trace is enabled, then stop the trace while
+	 * dumping, unless this is the "snapshot" file
+	 */
+	if (!iter->snapshot && (tr->trace_flags & TRACE_ITER_PAUSE_ON_TRACE))
 		tracing_stop_tr(tr);
 
 	if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
@@ -4371,7 +4374,7 @@ static int tracing_release(struct inode *inode, struct file *file)
 	if (iter->trace && iter->trace->close)
 		iter->trace->close(iter);
 
-	if (!iter->snapshot)
+	if (!iter->snapshot && tr->stop_count)
 		/* reenable tracing if it was previously enabled */
 		tracing_start_tr(tr);
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index c61e1b1c85a6..f37e05135986 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1302,6 +1302,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(IRQ_INFO,		"irq-info"),		\
 		C(MARKERS,		"markers"),		\
 		C(EVENT_FORK,		"event-fork"),		\
+		C(PAUSE_ON_TRACE,	"pause-on-trace"),	\
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\
David Laight March 27, 2020, 10:07 a.m. UTC | #4
From: Steven Rostedt
> Sent: 27 March 2020 01:46
> On Sat, 21 Mar 2020 19:13:51 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > From: Steven Rostedt
> > > Sent: 19 March 2020 23:22
> > ...
> > >
> > > This patch series attempts to satisfy that request, by creating a
> > > temporary buffer in each of the per cpu iterators to place the
> > > read event into, such that it can be passed to users without worrying
> > > about a writer to corrupt the event while it was being written out.
> > > It also uses the fact that the ring buffer is broken up into pages,
> > > where each page has its own timestamp that gets updated when a
> > > writer crosses over to it. By copying it to the temp buffer, and
> > > doing a "before and after" test of the time stamp with memory barriers,
> > > can allow the events to be saved.
> >
> > Does this mean the you will no longer be able to look at a snapshot
> > of the trace by running 'less trace' (and typically going to the end
> > to get info for all cpus).
> 
> I changed patch 9 to be this:
> 
> It adds an option "pause-on-trace" that when set, will bring back the
> old behavior of pausing recording to the ring buffer when the trace
> file is open.
> 
> If needed, I can add a kernel command line option and a Kconfig that
> makes this set to true by default.

Maybe a different file 'trace_no_pause' ?
Along with the one that lets you read the raw trace and get EOF.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Steven Rostedt March 27, 2020, 2:30 p.m. UTC | #5
On Fri, 27 Mar 2020 10:07:00 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> > If needed, I can add a kernel command line option and a Kconfig that
> > makes this set to true by default.  
> 
> Maybe a different file 'trace_no_pause' ?

I rather not add another file, it adds more complexity, and confuses the
interface even more. I'll leave this as is.

> Along with the one that lets you read the raw trace and get EOF.

Can you explain this more? I think we talked about this before, but I don't
remember the details.

-- Steve
David Laight March 27, 2020, 2:56 p.m. UTC | #6
From: Steven Rostedt
> Sent: 27 March 2020 14:31
..
> > Along with the one that lets you read the raw trace and get EOF.
> 
> Can you explain this more? I think we talked about this before, but I don't
> remember the details.

I was trying to use schedviz (on github from google).
It reads out the raw trace (for some scheduler events)
and then postprocesses it to generate to nice pictures.
However the shell script it uses for the captures has to run the
copies in the background, sleep a random time, and then kill all
the copies having hoped it has waited long enough - truly horrid.

There is also some confusion (IIRC between a header and your library
code) about the 'time-delta' on 'pad' entries.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)