[v5,1/4] printk/nmi: generic solution for safe printk in NMI

Submitted by Sergey Senozhatsky on April 21, 2017, 1:57 a.m.

Details

Message ID 20170421015724.GA586@jagdpanzerIV.localdomain
State Not Applicable
Headers show

Commit Message

Sergey Senozhatsky April 21, 2017, 1:57 a.m.
Hello,

On (04/20/17 15:11), Petr Mladek wrote:
[..]
> Good analyze. I would summarize it that we need to be careful of:
> 
>   + logbug_lock
>   + PRINTK_SAFE_CONTEXT
>   + locks used by console drivers
> 
> The first two things are easy to check. Except that a check for logbuf_lock
> might produce false negatives. The last check is very hard.
> 
> > so at the moment what I can think of is something like
> > 
> >   -- check this_cpu_read(printk_context) in NMI prink
> > 
> > 	-- if we are NOT in printk_safe on this CPU, then do printk_deferred()
> > 	   and bypass `nmi_print_seq' buffer
> 
> I would add also a check for logbuf_lock.
> > 	-- if we are in printk_safe
> > 	  -- well... bad luck... have a bigger buffer.
> 
> Yup, we do the best effort while still trying to stay on the safe
> side.
> 
> I have cooked up a patch based on this. It uses printk_deferred()
> in NMI when it is safe. Note that console_flush_on_panic() will
> try to get them on the console when a kdump is not generated.
> I believe that it will help Steven.


OK. I need to look more at the patch. It does more than I'd expected/imagined.


[..]
>  void printk_nmi_enter(void)
>  {
> -	this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> +	/*
> +	 * The size of the extra per-CPU buffer is limited. Use it
> +	 * only when really needed.
> +	 */
> +	if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK ||
> +	    raw_spin_is_locked(&logbuf_lock)) {
> +		this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> +	} else {
> +		this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK);
> +	}
>  }

well... the logbuf_lock can temporarily be locked from another CPU. I'd say
that spin_is_locked() has better chances for false positive than
this_cpu_read(printk_context). because this_cpu_read(printk_context) depends
only on this CPU state, while spin_is_locked() depends on all CPUs. and the
idea with this_cpu_read(printk_context) was that we check if the logbuf_lock
was locked from this particular CPU.

I agree that this_cpu_read(printk_context) covers slightly more than
logbuf_lock scope, so we may get positive this_cpu_read(printk_context)
with unlocked logbuf_lock, but I don't tend to think that it's a big
problem.


wouldn't something as simple as below do the trick?
// absolutely and completely untested //

Comments

Petr Mladek April 21, 2017, 12:06 p.m.
On Fri 2017-04-21 10:57:25, Sergey Senozhatsky wrote:
> On (04/20/17 15:11), Petr Mladek wrote:
> [..]
> >  void printk_nmi_enter(void)
> >  {
> > -	this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> > +	/*
> > +	 * The size of the extra per-CPU buffer is limited. Use it
> > +	 * only when really needed.
> > +	 */
> > +	if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK ||
> > +	    raw_spin_is_locked(&logbuf_lock)) {
> > +		this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> > +	} else {
> > +		this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK);
> > +	}
> >  }
> 
> well... the logbuf_lock can temporarily be locked from another CPU. I'd say
> that spin_is_locked() has better chances for false positive than
> this_cpu_read(printk_context). because this_cpu_read(printk_context) depends
> only on this CPU state, while spin_is_locked() depends on all CPUs. and the
> idea with this_cpu_read(printk_context) was that we check if the logbuf_lock
> was locked from this particular CPU.

I finally see the point. I confess that I did not think much about
this way yesterday because it looked too tricky (indirect) and possibly
error prone.

> I agree that this_cpu_read(printk_context) covers slightly more than
> logbuf_lock scope, so we may get positive this_cpu_read(printk_context)
> with unlocked logbuf_lock, but I don't tend to think that it's a big
> problem.

PRINTK_SAFE_CONTEXT is set also in call_console_drivers().
It might take rather long and logbuf_lock is availe. So, it is
noticeable source of false positives.

Hmm, my version actually checks both the lock and the context.
It is very deffensive to be always on the safe side.

We could get the best restults with both checks and by using "&&":

void printk_nmi_enter(void)
{
	/*
	 * The size of the extra per-CPU buffer is limited. Use it
	 * only when the lock for the main log buffer is not
	 * available.
	 *
	 * logbuf_lock might be taken on another CPU. But it must be
	 * in PRINTK_SAFE context. Reduce false positives by a check
	 * of the context.
	 */
	if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK ||
	    raw_spin_is_locked(&logbuf_lock)) {
		this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
	} else {
		this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK);
	}
}


But after all, I would prefer to keep it simple, strightforward,
and check only the logbuf_lock:

	if (raw_spin_is_locked(&logbuf_lock))
		this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
	else
		this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK);


> wouldn't something as simple as below do the trick?
> // absolutely and completely untested //
> 
> 
> diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
> index 033e50a7d706..c7477654c5b1 100644
> --- a/kernel/printk/printk_safe.c
> +++ b/kernel/printk/printk_safe.c
> @@ -303,7 +303,10 @@ static int vprintk_nmi(const char *fmt, va_list args)
>  {
>         struct printk_safe_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
>  
> -       return printk_safe_log_store(s, fmt, args);
> +       if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
> +               return printk_safe_log_store(s, fmt, args);
> +
> +       return vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args);
>  }

It looks simple but some things are missing. It will be used also
outside panic/oops, so it should queue the irq_work to flush the console.
Also the serialization of nmi_cpu_backtrace() backtrace calls is
needed.

All in all, we could get rid only of the extra
PRINTK_NMI_DEFERRED_CONTEXT_MASK with this approach. Other than
that it looks more tricky to me.

Sigh, I hate problems without the single best solution.

Best Regards,
Petr
Sergey Senozhatsky April 24, 2017, 2:17 a.m.
On (04/21/17 14:06), Petr Mladek wrote:
[..]
> > I agree that this_cpu_read(printk_context) covers slightly more than
> > logbuf_lock scope, so we may get positive this_cpu_read(printk_context)
> > with unlocked logbuf_lock, but I don't tend to think that it's a big
> > problem.
> 
> PRINTK_SAFE_CONTEXT is set also in call_console_drivers().
> It might take rather long and logbuf_lock is availe. So, it is
> noticeable source of false positives.

yes, agree.

probably we need additional printk_safe annotations for
		"logbuf_lock is locked from _this_ CPU"

false positives there can be very painful.

[..]
> 	if (raw_spin_is_locked(&logbuf_lock))
> 		this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> 	else
> 		this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK);

well, if everyone is fine with logbuf_lock access from every CPU from every
NMI then I won't object either. but may be it makes sense to reduce the
possibility of false positives. Steven is loosing critically important logs,
after all.


by the way,
does this `nmi_print_seq' bypass even fix anything for Steven? it sort of
can, in theory, but just in theory. so may be we need direct message flush
from NMI handler (printk->console_unlock), which will be a really big problem.

logbuf might not be big enough for 4890096 messages (Steven's report
mentions "Lost 4890096 message(s)!"). we are counting on the fact that
in case of `nmi_print_seq' bypass some other CPU will call console_unlock()
and print pending logbuf messages, but this is not guaranteed and the
messages can be dropped even from logbuf.

I don't know,
should we try to queue printk_deferred irq_work for all online CPUs from
vprintk_nmi() when it bypasses printk_safe_log_store()? in order to minimize
possibilities of logbuf overflow. printk_deferred() will queue work on
vprintk_nmi() CPU, sure, but we don't know how many messages we are going
to add to logbuf from NMI.


> > @@ -303,7 +303,10 @@ static int vprintk_nmi(const char *fmt, va_list args)
> >  {
> >         struct printk_safe_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
> >  
> > -       return printk_safe_log_store(s, fmt, args);
> > +       if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
> > +               return printk_safe_log_store(s, fmt, args);
> > +
> > +       return vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args);
> >  }
> 
> It looks simple but some things are missing. It will be used also
> outside panic/oops, so it should queue the irq_work to flush the console.

you are right. I thought about moving irq_work to vprintk_emit(), but
completely forgot about it. without that missing bit the proposed two-liner
is not complete.

	-ss
Petr Mladek April 27, 2017, 1:38 p.m.
On Mon 2017-04-24 11:17:47, Sergey Senozhatsky wrote:
> On (04/21/17 14:06), Petr Mladek wrote:
> [..]
> > > I agree that this_cpu_read(printk_context) covers slightly more than
> > > logbuf_lock scope, so we may get positive this_cpu_read(printk_context)
> > > with unlocked logbuf_lock, but I don't tend to think that it's a big
> > > problem.
> > 
> > PRINTK_SAFE_CONTEXT is set also in call_console_drivers().
> > It might take rather long and logbuf_lock is availe. So, it is
> > noticeable source of false positives.
> 
> yes, agree.
> 
> probably we need additional printk_safe annotations for
> 		"logbuf_lock is locked from _this_ CPU"
> 
> false positives there can be very painful.
> 
> [..]
> > 	if (raw_spin_is_locked(&logbuf_lock))
> > 		this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> > 	else
> > 		this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK);
> 
> well, if everyone is fine with logbuf_lock access from every CPU from every
> NMI then I won't object either. but may be it makes sense to reduce the
> possibility of false positives. Steven is loosing critically important logs,
> after all.
> 
> 
> by the way,
> does this `nmi_print_seq' bypass even fix anything for Steven?

I think that this is the most important question.

Steven, does the patch from
https://lkml.kernel.org/r/20170420131154.GL3452@pathway.suse.cz
help you to see the debug messages, please?


> it sort of
> can, in theory, but just in theory. so may be we need direct message flush
> from NMI handler (printk->console_unlock), which will be a really big problem.

I thought about it a lot and got scared where this might go.
We need to balance the usefulness and the complexity of the solution.

It took one year to discover this regression. Before it was
suggested to avoid calling printk() in NMI context at all.
Now, we are trying to fix printk() to handle MBs of messages
in NMI context.

If my proposed patch solves the problem for Steven, I would still
like to get similar solution in. It is not that complex and helps
to bypass the limited per-CPU buffer in most cases. I always thought
that 8kB might be not enough in some cases.

Note that my patch is very defensive. It uses the main log buffer
only when it is really safe. It has higher potential for unneeded
fallback but if it works for Steven (really existing usecase), ...

On the other hand, I would prefer to avoid any much more complex
solution until we have a real reports that they are needed.

Also we need to look for alternatives. There is a chance
to create crashdump and get the ftrace messages from it.
Also this might be scenario when we might need to suggest
the early_printk() patchset from Peter Zijlstra.


> logbuf might not be big enough for 4890096 messages (Steven's report
> mentions "Lost 4890096 message(s)!"). we are counting on the fact that
> in case of `nmi_print_seq' bypass some other CPU will call console_unlock()
> and print pending logbuf messages, but this is not guaranteed and the
> messages can be dropped even from logbuf.

Yup. I tested the patch here and I needed to increase the main log buffer
size to see all ftrace messages. Fortunately, it was possible to use a really
huge global buffer. But it is not realistic to use huge per-CPU ones.

Best Regards,
Petr
Steven Rostedt April 27, 2017, 2:31 p.m.
On Thu, 27 Apr 2017 15:38:19 +0200
Petr Mladek <pmladek@suse.com> wrote:

> > by the way,
> > does this `nmi_print_seq' bypass even fix anything for Steven?  
> 
> I think that this is the most important question.
> 
> Steven, does the patch from
> https://lkml.kernel.org/r/20170420131154.GL3452@pathway.suse.cz
> help you to see the debug messages, please?

You'll have to wait for a bit. The box that I was debugging takes 45
minutes to reboot. And I don't have much more time to play on it before
I have to give it back. I already found the bug I was looking for and
I'm trying not to crash it again (due to the huge bring up time).

When I get a chance, I'll see if I can insert a trigger to crash the
kernel from NMI on another box and see if this patch helps.

Thanks,

-- Steve
Petr Mladek April 27, 2017, 3:28 p.m.
On Thu 2017-04-27 10:31:18, Steven Rostedt wrote:
> On Thu, 27 Apr 2017 15:38:19 +0200
> Petr Mladek <pmladek@suse.com> wrote:
> 
> > > by the way,
> > > does this `nmi_print_seq' bypass even fix anything for Steven?  
> > 
> > I think that this is the most important question.
> > 
> > Steven, does the patch from
> > https://lkml.kernel.org/r/20170420131154.GL3452@pathway.suse.cz
> > help you to see the debug messages, please?
> 
> You'll have to wait for a bit. The box that I was debugging takes 45
> minutes to reboot. And I don't have much more time to play on it before
> I have to give it back. I already found the bug I was looking for and
> I'm trying not to crash it again (due to the huge bring up time).

I see.

> When I get a chance, I'll see if I can insert a trigger to crash the
> kernel from NMI on another box and see if this patch helps.

I actually tested it here using this hack:

diff --cc lib/nmi_backtrace.c
index d531f85c0c9b,0bc0a3535a8a..000000000000
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@@ -89,8 -90,7 +90,9 @@@ bool nmi_cpu_backtrace(struct pt_regs *
        int cpu = smp_processor_id();
  
        if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
 +              if (in_nmi())
 +                      panic("Simulating panic in NMI\n");
+               arch_spin_lock(&lock);
                if (regs && cpu_in_idle(instruction_pointer(regs))) {
                        pr_warn("NMI backtrace for cpu %d skipped: idling at pc %#lx\n",
                                cpu, instruction_pointer(regs));

and triggered by:

   echo  l > /proc/sysrq-trigger

The patch really helped to see much more (all) messages from the ftrace
buffers in NMI mode.

But the test is a bit artifical. The patch might not help when there
is a big printk() activity on the system when the panic() is
triggered. We might wrongly use the small per-CPU buffer when
the logbuf_lock is tested and taken on another CPU at the same time.
It means that it will not always help.

I personally think that the patch might be good enough. I am not sure
if a perfect (more comlpex) solution is worth it.

Best Regards,
Petr
Steven Rostedt April 27, 2017, 3:42 p.m.
On Thu, 27 Apr 2017 17:28:07 +0200
Petr Mladek <pmladek@suse.com> wrote:


> > When I get a chance, I'll see if I can insert a trigger to crash the
> > kernel from NMI on another box and see if this patch helps.  
> 
> I actually tested it here using this hack:
> 
> diff --cc lib/nmi_backtrace.c
> index d531f85c0c9b,0bc0a3535a8a..000000000000
> --- a/lib/nmi_backtrace.c
> +++ b/lib/nmi_backtrace.c
> @@@ -89,8 -90,7 +90,9 @@@ bool nmi_cpu_backtrace(struct pt_regs *
>         int cpu = smp_processor_id();
>   
>         if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
>  +              if (in_nmi())
>  +                      panic("Simulating panic in NMI\n");
> +               arch_spin_lock(&lock);

I was going to create a ftrace trigger, to crash on demand, but this
may do as well.

>                 if (regs && cpu_in_idle(instruction_pointer(regs))) {
>                         pr_warn("NMI backtrace for cpu %d skipped: idling at pc %#lx\n",
>                                 cpu, instruction_pointer(regs));
> 
> and triggered by:
> 
>    echo  l > /proc/sysrq-trigger
> 
> The patch really helped to see much more (all) messages from the ftrace
> buffers in NMI mode.
> 
> But the test is a bit artifical. The patch might not help when there
> is a big printk() activity on the system when the panic() is
> triggered. We might wrongly use the small per-CPU buffer when
> the logbuf_lock is tested and taken on another CPU at the same time.
> It means that it will not always help.
> 
> I personally think that the patch might be good enough. I am not sure
> if a perfect (more comlpex) solution is worth it.

I wasn't asking for perfect, as the previous solutions never were
either. I just want an optimistic dump if possible.

I'll try to get some time today to test this, and let you know. But it
wont be on the machine that I originally had the issue with.

Thanks,

-- Steve
Peter Zijlstra April 28, 2017, 9:02 a.m.
On Thu, Apr 27, 2017 at 03:38:19PM +0200, Petr Mladek wrote:
> Also we need to look for alternatives. There is a chance
> to create crashdump and get the ftrace messages from it.
> Also this might be scenario when we might need to suggest
> the early_printk() patchset from Peter Zijlstra.

I'd be happy to repost those. I still carry them in my tree.
Petr Mladek April 28, 2017, 1:44 p.m.
On Fri 2017-04-28 11:02:26, Peter Zijlstra wrote:
> On Thu, Apr 27, 2017 at 03:38:19PM +0200, Petr Mladek wrote:
> > Also we need to look for alternatives. There is a chance
> > to create crashdump and get the ftrace messages from it.
> > Also this might be scenario when we might need to suggest
> > the early_printk() patchset from Peter Zijlstra.
> 
> I'd be happy to repost those. I still carry them in my tree.

You do not need to if they are still the same as
https://lkml.kernel.org/r/20161018170830.405990950@infradead.org

I rather do not promise anything but I would like to look at them
within next few weeks (after the merge window).

Best Regards,
Petr
Peter Zijlstra April 28, 2017, 1:58 p.m.
On Fri, Apr 28, 2017 at 03:44:23PM +0200, Petr Mladek wrote:
> On Fri 2017-04-28 11:02:26, Peter Zijlstra wrote:
> > On Thu, Apr 27, 2017 at 03:38:19PM +0200, Petr Mladek wrote:
> > > Also we need to look for alternatives. There is a chance
> > > to create crashdump and get the ftrace messages from it.
> > > Also this might be scenario when we might need to suggest
> > > the early_printk() patchset from Peter Zijlstra.
> > 
> > I'd be happy to repost those. I still carry them in my tree.
> 
> You do not need to if they are still the same as
> https://lkml.kernel.org/r/20161018170830.405990950@infradead.org

More or less, I think I fixed an intermediate compile fail reported by
you and the 0-day thing and I wrote a few comments in the last patch.

They live here fwiw:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git debug/tmp

> I rather do not promise anything but I would like to look at them
> within next few weeks (after the merge window).

Sure, I'll post them again during/after the merge window.
Steven Rostedt April 28, 2017, 2:47 p.m.
On Fri, 28 Apr 2017 15:58:10 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Apr 28, 2017 at 03:44:23PM +0200, Petr Mladek wrote:
> > On Fri 2017-04-28 11:02:26, Peter Zijlstra wrote:  
> > > On Thu, Apr 27, 2017 at 03:38:19PM +0200, Petr Mladek wrote:  
> > > > Also we need to look for alternatives. There is a chance
> > > > to create crashdump and get the ftrace messages from it.
> > > > Also this might be scenario when we might need to suggest
> > > > the early_printk() patchset from Peter Zijlstra.  
> > > 
> > > I'd be happy to repost those. I still carry them in my tree.  
> > 
> > You do not need to if they are still the same as
> > https://lkml.kernel.org/r/20161018170830.405990950@infradead.org  
> 
> More or less, I think I fixed an intermediate compile fail reported by
> you and the 0-day thing and I wrote a few comments in the last patch.
> 
> They live here fwiw:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git debug/tmp
> 
> > I rather do not promise anything but I would like to look at them
> > within next few weeks (after the merge window).  
> 
> Sure, I'll post them again during/after the merge window.

BTW, this is what I ended up using to get the debug information I
needed.

-- Steve

Patch hide | download patch | download mbox

diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 033e50a7d706..c7477654c5b1 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -303,7 +303,10 @@  static int vprintk_nmi(const char *fmt, va_list args)
 {
        struct printk_safe_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
 
-       return printk_safe_log_store(s, fmt, args);
+       if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
+               return printk_safe_log_store(s, fmt, args);
+
+       return vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args);
 }

	-ss