Patchwork panic.c: export panic_on_oops

login
register
mail settings
Submitter Artem Bityutskiy
Date Oct. 12, 2009, 11:25 a.m.
Message ID <1255346731.9659.31.camel@localhost>
Download mbox | patch
Permalink /patch/35753/
State New
Headers show

Comments

Artem Bityutskiy - Oct. 12, 2009, 11:25 a.m.
On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote:
> > We use the mtdoops module which stores oopses on the Flash partition,
> > in order to make it possible to analyze them later. And mtdoops needs
> > to know whether 'panic_on_oops' is on or off. Thus, we need this
> > variable to be exported.
> 
> hm, why does it need it? Could you send the patch please that makes use 
> of it (as an easy way to explain the need for the export) - current 
> upstream drivers/mtd/mtdoops.c doesnt need it so it must be some pending 
> change.

Yes, current drivers/mtd/mtdoops.c does not need it, but we have this
patch:

http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html

which makes sure mtdoops writes the oops log immediately, because we
have 'panic_on_oops' set so this is our last chance to save the oops.

Here is the patch inlined as well:

Author: Aaro Koskinen <aaro.koskinen@nokia.com>
Date:   Thu Oct 1 18:16:55 2009 +0300

    mtd: mtdoops: do not schedule work if we are going to die

    If panic_on_oops is set, the log should be written right away
    because this is not going to be a second chance.

    Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Ingo Molnar - Oct. 12, 2009, 11:37 a.m.
* Artem Bityutskiy <dedekind1@gmail.com> wrote:

> On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote:
> > > We use the mtdoops module which stores oopses on the Flash partition,
> > > in order to make it possible to analyze them later. And mtdoops needs
> > > to know whether 'panic_on_oops' is on or off. Thus, we need this
> > > variable to be exported.
> > 
> > hm, why does it need it? Could you send the patch please that makes use 
> > of it (as an easy way to explain the need for the export) - current 
> > upstream drivers/mtd/mtdoops.c doesnt need it so it must be some pending 
> > change.
> 
> Yes, current drivers/mtd/mtdoops.c does not need it, but we have this
> patch:
> 
> http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html
> 
> which makes sure mtdoops writes the oops log immediately, because we
> have 'panic_on_oops' set so this is our last chance to save the oops.
> 
> Here is the patch inlined as well:
> 
> Author: Aaro Koskinen <aaro.koskinen@nokia.com>
> Date:   Thu Oct 1 18:16:55 2009 +0300
> 
>     mtd: mtdoops: do not schedule work if we are going to die
> 
>     If panic_on_oops is set, the log should be written right away
>     because this is not going to be a second chance.
> 
>     Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
>     Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
> 
> diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
> index 1060337..ac67833 100644
> --- a/drivers/mtd/mtdoops.c
> +++ b/drivers/mtd/mtdoops.c
> @@ -335,7 +335,7 @@ static void mtdoops_console_sync(void)
>         cxt->ready = 0;
>         spin_unlock_irqrestore(&cxt->writecount_lock, flags);
> 
> -       if (mtd->panic_write && in_interrupt())
> +       if (mtd->panic_write && (in_interrupt() || panic_on_oops))
>                 /* Interrupt context, we're going to panic so try and log */
>                 mtdoops_write(cxt, 1);

Hm, the code seems to be somewhat confused about this. It tries to guess 
when it's panic-ing, right? in_interrupt() is the wrong test for that.

mtdoops_console_sync() gets called by regular printk() as well via:

static void
mtdoops_console_write(struct console *co, const char *s, unsigned int count)
{
        struct mtdoops_context *cxt = co->data;
        struct mtd_info *mtd = cxt->mtd;
        unsigned long flags;

        if (!oops_in_progress) {
                mtdoops_console_sync();
                return;

I think this is all layered incorrectly - because mtdoops (which is a 
_VERY_ cool debugging facility btw - i wish generic x86 had something 
like that) tries to be a 'driver' and tries to achieve these things 
without modifying the core kernel.

But it really should do that. We can certainly enhance the core kernel 
logic to tell the console driver more clearly when printk should go out 
immediately.

Instead of using oops_in_progress it might be better to go into 'sync 
immeditately' mode if the kernel gets tainted. Add a callback for that 
to struct console (in include/linux/console.h). The ->unblank() callback 
is already such a "user attention needed immediately" callback. We could 
add a ->kernel_bug() callback to that.

	Ingo
Simon Kagstrom - Oct. 12, 2009, 12:01 p.m.
(Risking that Artem also replies, I'll bite on this one! Let's hope we
agree at least :-))

On Mon, 12 Oct 2009 13:37:58 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> > -       if (mtd->panic_write && in_interrupt())
> > +       if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> >                 /* Interrupt context, we're going to panic so try and log */
> >                 mtdoops_write(cxt, 1);
> 
> Hm, the code seems to be somewhat confused about this. It tries to guess 
> when it's panic-ing, right? in_interrupt() is the wrong test for that.

Well, the main reason is to get the write done directly if we know
we're going to crash. The rest of the code around the patch looks like
this:

         if (mtd->panic_write && (in_interrupt() || panic_on_oops))
                 /* Interrupt context, we're going to panic so try and log */
                 mtdoops_write(cxt, 1);
         else
                 schedule_work(&cxt->work_write);

so if we're oopsing in interrupt context or are going to panic, we just
write directly. mtdoops_write will then use mtd->panic_write if it's
available to get the write done immediately without sleeping.

> mtdoops_console_sync() gets called by regular printk() as well via:
> 
> static void
> mtdoops_console_write(struct console *co, const char *s, unsigned int count)
> {
>         struct mtdoops_context *cxt = co->data;
>         struct mtd_info *mtd = cxt->mtd;
>         unsigned long flags;
> 
>         if (!oops_in_progress) {
>                 mtdoops_console_sync();
>                 return;
> 
> I think this is all layered incorrectly - because mtdoops (which is a 
> _VERY_ cool debugging facility btw - i wish generic x86 had something 
> like that) tries to be a 'driver' and tries to achieve these things 
> without modifying the core kernel.
> 
> But it really should do that. We can certainly enhance the core kernel 
> logic to tell the console driver more clearly when printk should go out 
> immediately.
> 
> Instead of using oops_in_progress it might be better to go into 'sync 
> immeditately' mode if the kernel gets tainted. Add a callback for that 
> to struct console (in include/linux/console.h). The ->unblank() callback 
> is already such a "user attention needed immediately" callback. We could 
> add a ->kernel_bug() callback to that.

I sent another patch which changes the mtdoops behavior a bit:

  http://patchwork.ozlabs.org/patch/35750/

(the thread with the patch series is here:

  http://lists.infradead.org/pipermail/linux-mtd/2009-October/027576.html)

the change is basically to log all messages in a circular buffer (the
current one only logs messages after an oops has started). It also
needed some restructuring, and I believe parts of the code becomes
simpler that way.

The console write now only adds messages to the buffer (never calls
mtdoops_console_sync), and mtdoops_console_sync (i.e., the ->unblank()
callback) simply becomes

  static void mtdoops_console_sync(void)
  {
	struct mtdoops_context *cxt = &oops_cxt;

	/* Write out the buffer if we are called during an oops */
	if (oops_in_progress)
		schedule_work(&cxt->work_write);
  }

To handle the panic case, I've simply added a panic notifier which does

  static int mtdoops_panic(struct notifier_block *this, unsigned long event,
		void *ptr)
  {
	struct mtdoops_context *cxt = &oops_cxt;

	cancel_work_sync(&cxt->work_write);
	cxt->ready = 0;
	if (cxt->mtd->panic_write)
		mtdoops_write(cxt, 1);
	else
		printk(KERN_WARNING "mtdoops: panic_write is not defined, "
					"cannot store dump from panic\n");

	return NOTIFY_DONE;
  }

So with this one, the exported panic_on_oops is no longer needed, and
normal oopses are handled by the scheduled work while panic_on_oopses
are handled by the panic handler.


Anyway, this particular patch is still up for discussion.

// Simon
Ingo Molnar - Oct. 12, 2009, 12:09 p.m.
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote:

> (Risking that Artem also replies, I'll bite on this one! Let's hope we
> agree at least :-))
> 
> On Mon, 12 Oct 2009 13:37:58 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > -       if (mtd->panic_write && in_interrupt())
> > > +       if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> > >                 /* Interrupt context, we're going to panic so try and log */
> > >                 mtdoops_write(cxt, 1);
> > 
> > Hm, the code seems to be somewhat confused about this. It tries to guess 
> > when it's panic-ing, right? in_interrupt() is the wrong test for that.
> 
> Well, the main reason is to get the write done directly if we know 
> we're going to crash. The rest of the code around the patch looks like 
> this:
> 
>          if (mtd->panic_write && (in_interrupt() || panic_on_oops))
>                  /* Interrupt context, we're going to panic so try and log */
>                  mtdoops_write(cxt, 1);
>          else
>                  schedule_work(&cxt->work_write);
> 
> so if we're oopsing in interrupt context or are going to panic, we 
> just write directly. mtdoops_write will then use mtd->panic_write if 
> it's available to get the write done immediately without sleeping.

but i'm not sure that code achieves your intention.

in_interrupt() is a generic test. It will be true whenever you printk in 
irq context - be that a panic or not a panic.

Also, the panic_on_oops usage looks wrong as well: it is set on a system 
that wants a panic on oops - but the flag will be set all the time, even 
when we are not oopsing.

I suppose the intention is to add a logic like this:

 - buffer writes to the MTD async writeout thread for regular printks

 - if we are in some sort of emergency, write to the MTD device directly 
   as we cannot buffer anymore.

Correct?

> [...]
>
> To handle the panic case, I've simply added a panic notifier which 
> does
> 
>   static int mtdoops_panic(struct notifier_block *this, unsigned long event,
> 		void *ptr)
>   {
> 	struct mtdoops_context *cxt = &oops_cxt;
> 
> 	cancel_work_sync(&cxt->work_write);
> 	cxt->ready = 0;
> 	if (cxt->mtd->panic_write)
> 		mtdoops_write(cxt, 1);
> 	else
> 		printk(KERN_WARNING "mtdoops: panic_write is not defined, "
> 					"cannot store dump from panic\n");
> 
> 	return NOTIFY_DONE;
>   }
> 
> So with this one, the exported panic_on_oops is no longer needed, and 
> normal oopses are handled by the scheduled work while panic_on_oopses 
> are handled by the panic handler.

Yes, that looks like the better direction - but 'panic' is still the 
wrong trigger condition i think. We generally just crash and dont panic. 
Often we'll display a kernel warning and then hang. Etc.

Also, would it be possible to just simplify the thing and not do any 
buffering at all? Extra buffering complexity in a console driver is only 
asking for trouble. Or is flash storage write cycles optimization that 
important in this case?

	Ingo
David Woodhouse - Oct. 12, 2009, 12:15 p.m.
On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> Also, would it be possible to just simplify the thing and not do any 
> buffering at all? Extra buffering complexity in a console driver is only 
> asking for trouble. Or is flash storage write cycles optimization that 
> important in this case?

That and the fact that on NAND flash you have to write full pages at a
time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. So
we really do want to buffer it where we can.

We don't want to write a 2KiB page for every line of printk output.
Ingo Molnar - Oct. 12, 2009, 12:20 p.m.
* David Woodhouse <dwmw2@infradead.org> wrote:

> On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > Also, would it be possible to just simplify the thing and not do any 
> > buffering at all? Extra buffering complexity in a console driver is only 
> > asking for trouble. Or is flash storage write cycles optimization that 
> > important in this case?
> 
> That and the fact that on NAND flash you have to write full pages at a 
> time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. 
> So we really do want to buffer it where we can.
> 
> We don't want to write a 2KiB page for every line of printk output.

Then i think the buffering is at the wrong place: we should instead 
buffer in the generic layer and pass it to lowlevel if we know that we 
have gone past a 2K boundary.

The size of the generic log buffer is always a power of two so detecting 
2K boundaries is very easy. On any emergency the generic console layer 
will do faster flushes - this is nothing the console driver itself 
should bother with.

And that would avoid the whole workqueue logic - which is fragile to be 
done in a printk to begin with.

So what we need is an extension to struct console that sets a buffering 
limit. Zero (the default) means unbuffered.

(Btw., things like netconsole might make use of such buffering too.)

Agreed?

	Ingo
Simon Kagstrom - Oct. 12, 2009, 12:27 p.m.
On Mon, 12 Oct 2009 14:09:51 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> > Well, the main reason is to get the write done directly if we know 
> > we're going to crash. The rest of the code around the patch looks like 
> > this:
> > 
> >          if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> >                  /* Interrupt context, we're going to panic so try and log */
> >                  mtdoops_write(cxt, 1);
> >          else
> >                  schedule_work(&cxt->work_write);
> > 
> > so if we're oopsing in interrupt context or are going to panic, we 
> > just write directly. mtdoops_write will then use mtd->panic_write if 
> > it's available to get the write done immediately without sleeping.
> 
> but i'm not sure that code achieves your intention.
> 
> in_interrupt() is a generic test. It will be true whenever you printk in 
> irq context - be that a panic or not a panic.

Well, this particular code is only called when oops_in_progress is set,
so we know we have an oops chasing us. And if we oops in an interrupt,
we're going to panic and the same thing goes if panic_on_oops is set.

So the current code only logs messages which happen when an oops is in
progress.

> Also, the panic_on_oops usage looks wrong as well: it is set on a system 
> that wants a panic on oops - but the flag will be set all the time, even 
> when we are not oopsing.
> 
> I suppose the intention is to add a logic like this:
> 
>  - buffer writes to the MTD async writeout thread for regular printks
> 
>  - if we are in some sort of emergency, write to the MTD device directly 
>    as we cannot buffer anymore.
> 
> Correct?

Almost :-). During an oops, it stores in a memory buffer until we have
either filled the memory buffer, the oops is over, or unblank() is called,
and after that it will write out the buffer to flash.

I believe the other implementation (outlined below) is a bit simpler in
that respect. That will also log all messages (well the last 4-8KB or
so of them), but only write it out when a panic or oops occurs.

> > So with this one, the exported panic_on_oops is no longer needed, and 
> > normal oopses are handled by the scheduled work while panic_on_oopses 
> > are handled by the panic handler.
> 
> Yes, that looks like the better direction - but 'panic' is still the 
> wrong trigger condition i think. We generally just crash and dont panic. 
> Often we'll display a kernel warning and then hang. Etc.

But how can we detect that? The code above will write to the MTD device
either if an oops happens, or if we panic for some reason. If the
kernel just hangs (and is reset by the watchdog, if we have one), how
should we know when to write the log out?


For me personally, the panic case is the important one. Regular oopses
(without panic_on_oops) can go to syslog, but it's important to be able
to catch the cases when the kernel panics as well.

// Simon
Ingo Molnar - Oct. 12, 2009, 12:32 p.m.
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote:

> > Yes, that looks like the better direction - but 'panic' is still the 
> > wrong trigger condition i think. We generally just crash and dont 
> > panic. Often we'll display a kernel warning and then hang. Etc.
> 
> But how can we detect that? The code above will write to the MTD 
> device either if an oops happens, or if we panic for some reason. If 
> the kernel just hangs (and is reset by the watchdog, if we have one), 
> how should we know when to write the log out?

You shouldnt need to care about that in a console driver - it's up to 
higher layers.

See my reply to David Woodhouse, i think we should add support for 
buffering in kernel/printk.c and that would both fix your problems, 
would simplify the driver (significantly!) and would expose the generic 
buffering capability to other console drivers as well.

	Ingo
David Woodhouse - Oct. 12, 2009, 12:33 p.m.
On Mon, 2009-10-12 at 14:20 +0200, Ingo Molnar wrote:
> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > Also, would it be possible to just simplify the thing and not do any 
> > > buffering at all? Extra buffering complexity in a console driver is only 
> > > asking for trouble. Or is flash storage write cycles optimization that 
> > > important in this case?
> > 
> > That and the fact that on NAND flash you have to write full pages at a 
> > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. 
> > So we really do want to buffer it where we can.
> > 
> > We don't want to write a 2KiB page for every line of printk output.
> 
> Then i think the buffering is at the wrong place: we should instead 
> buffer in the generic layer and pass it to lowlevel if we know that we 
> have gone past a 2K boundary.
> 
> The size of the generic log buffer is always a power of two so detecting 
> 2K boundaries is very easy. On any emergency the generic console layer 
> will do faster flushes - this is nothing the console driver itself 
> should bother with.
> 
> And that would avoid the whole workqueue logic - which is fragile to be 
> done in a printk to begin with.
> 
> So what we need is an extension to struct console that sets a buffering 
> limit. Zero (the default) means unbuffered.
> 
> (Btw., things like netconsole might make use of such buffering too.)
> 
> Agreed?

Makes some sense, yes.

We also use the workqueue logic to allow us to co-ordinate access to the
hardware properly -- taking locks where appropriate, etc. We can't do
that directly from a console ->write() method.

Some device drivers do provide a ->panic_write() function which breaks
all the locks and just resets the hardware because it knows we're
panicking, but we don't want to do that in the common case.
Ingo Molnar - Oct. 12, 2009, 12:36 p.m.
* David Woodhouse <dwmw2@infradead.org> wrote:

> On Mon, 2009-10-12 at 14:20 +0200, Ingo Molnar wrote:
> > * David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > > Also, would it be possible to just simplify the thing and not do any 
> > > > buffering at all? Extra buffering complexity in a console driver is only 
> > > > asking for trouble. Or is flash storage write cycles optimization that 
> > > > important in this case?
> > > 
> > > That and the fact that on NAND flash you have to write full pages at a 
> > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. 
> > > So we really do want to buffer it where we can.
> > > 
> > > We don't want to write a 2KiB page for every line of printk output.
> > 
> > Then i think the buffering is at the wrong place: we should instead 
> > buffer in the generic layer and pass it to lowlevel if we know that we 
> > have gone past a 2K boundary.
> > 
> > The size of the generic log buffer is always a power of two so detecting 
> > 2K boundaries is very easy. On any emergency the generic console layer 
> > will do faster flushes - this is nothing the console driver itself 
> > should bother with.
> > 
> > And that would avoid the whole workqueue logic - which is fragile to be 
> > done in a printk to begin with.
> > 
> > So what we need is an extension to struct console that sets a buffering 
> > limit. Zero (the default) means unbuffered.
> > 
> > (Btw., things like netconsole might make use of such buffering too.)
> > 
> > Agreed?
> 
> Makes some sense, yes.
> 
> We also use the workqueue logic to allow us to co-ordinate access to 
> the hardware properly -- taking locks where appropriate, etc. We can't 
> do that directly from a console ->write() method.
> 
> Some device drivers do provide a ->panic_write() function which breaks 
> all the locks and just resets the hardware because it knows we're 
> panicking, but we don't want to do that in the common case.

Well other than not using sleeping locks in that codepath it should be 
properly serializable. If the serial driver, netconsole, fbcon and all 
the other non-trivial console drivers can do it then MTD should be able 
to do it too.

Printk via workqueue ... lets not go there, really. (I know you already 
have that but i think it's a mistake on a fundamental level.)

	Ingo
David Woodhouse - Oct. 12, 2009, 12:48 p.m.
On Mon, 2009-10-12 at 14:36 +0200, Ingo Molnar wrote:
> Well other than not using sleeping locks in that codepath it should be 
> properly serializable. If the serial driver, netconsole, fbcon and all 
> the other non-trivial console drivers can do it then MTD should be able 
> to do it too.

It's distinctly non-trivial. A 'real' user of the hardware may have just
triggered a block erase, which could take hundreds (or even thousands)
of milliseconds to complete. We can't always suspend that erase; do we
really want to _wait_ for it?

If we're actually in a _panic_ situation, then yes -- we want to wait,
and that's what the device driver's panic_write() method should do.

But for less critical output? I'm not so sure.

Of course, I'm currently looking at revamping the MTD APIs so they're
not entirely synchronous, and are queue-based like the block device API.
Then the console ->write() method could just queue the write, and we
need some other logic to ensure that it really does _happen_ when the
system crashes. A kind of 'purge queue of console writes' function in a
panic handler... except that as you point out, sometimes we actually
want that even when we haven't actually panicked.
Simon Kagstrom - Oct. 12, 2009, 1:06 p.m.
On Mon, 12 Oct 2009 14:20:23 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > Also, would it be possible to just simplify the thing and not do any 
> > > buffering at all? Extra buffering complexity in a console driver is only 
> > > asking for trouble. Or is flash storage write cycles optimization that 
> > > important in this case?
> > 
> > That and the fact that on NAND flash you have to write full pages at a 
> > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. 
> > So we really do want to buffer it where we can.
> > 
> > We don't want to write a 2KiB page for every line of printk output.
> 
> Then i think the buffering is at the wrong place: we should instead 
> buffer in the generic layer and pass it to lowlevel if we know that we 
> have gone past a 2K boundary.
>
> The size of the generic log buffer is always a power of two so detecting 
> 2K boundaries is very easy. On any emergency the generic console layer 
> will do faster flushes - this is nothing the console driver itself 
> should bother with.

But this is only part of the mtdoops problem (the reason why we don't
write all the time). The current code only stores messages printed
during an oops, and this behavior will surely change if the console
driver gets large buffers of output - or it would have to take in the
output unbuffered anyway.

My patch changes this behavior, and with that I don't think buffered
output would be a problem - it would indeed make it more simple as you
say - assuming there is something like ->kernel_bug() that would flush
the last 4KiB or so of messages to mtdoops when there is an oops or
panic.

> And that would avoid the whole workqueue logic - which is fragile to be 
> done in a printk to begin with.

I'm afraid I don't really see this issue. The workqueue is used to
write the buffer to the mtd device if we are not in a panic or
interrupt context - in which case we do it directly.

So it's only used when an oops is ongoing.

// Simon
Alan Cox - Oct. 12, 2009, 1:08 p.m.
> See my reply to David Woodhouse, i think we should add support for 
> buffering in kernel/printk.c and that would both fix your problems, 
> would simplify the driver (significantly!) and would expose the generic 
> buffering capability to other console drivers as well.

Buffering printk in general is bad. Given a driver needs only to provide
about 6 lines of code using a kfifo is it really that hard for the odd
code that wants to buffer to do that ?
Ingo Molnar - Oct. 12, 2009, 1:15 p.m.
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote:

> On Mon, 12 Oct 2009 14:20:23 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > * David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote:
> > > > Also, would it be possible to just simplify the thing and not do any 
> > > > buffering at all? Extra buffering complexity in a console driver is only 
> > > > asking for trouble. Or is flash storage write cycles optimization that 
> > > > important in this case?
> > > 
> > > That and the fact that on NAND flash you have to write full pages at a 
> > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. 
> > > So we really do want to buffer it where we can.
> > > 
> > > We don't want to write a 2KiB page for every line of printk output.
> > 
> > Then i think the buffering is at the wrong place: we should instead 
> > buffer in the generic layer and pass it to lowlevel if we know that we 
> > have gone past a 2K boundary.
> >
> > The size of the generic log buffer is always a power of two so 
> > detecting 2K boundaries is very easy. On any emergency the generic 
> > console layer will do faster flushes - this is nothing the console 
> > driver itself should bother with.
> 
> But this is only part of the mtdoops problem (the reason why we don't 
> write all the time). The current code only stores messages printed 
> during an oops, and this behavior will surely change if the console 
> driver gets large buffers of output - or it would have to take in the 
> output unbuffered anyway.
> 
> My patch changes this behavior, and with that I don't think buffered 
> output would be a problem - it would indeed make it more simple as you 
> say - assuming there is something like ->kernel_bug() that would flush 
> the last 4KiB or so of messages to mtdoops when there is an oops or 
> panic.
> 
> > And that would avoid the whole workqueue logic - which is fragile to 
> > be done in a printk to begin with.
> 
> I'm afraid I don't really see this issue. The workqueue is used to 
> write the buffer to the mtd device if we are not in a panic or 
> interrupt context - in which case we do it directly.
> 
> So it's only used when an oops is ongoing.

This fixation on 'panic' is so wrong!

90% of the bugs users care about dont involve any panic. And even if 
there is a panic down the line, most of the interesting messages are in 
the stream leading up to the panic - now tucked away in that async 
workqueue mechanism and not visible.

There's two clean solutions i think:

1) add some new "ok, there's trouble!" callback to struct console and 
   the console driver could via that mechanism send out the _last_ 2KB 
   (or more) of kernel log messages. Basically we can go back in time by 
   looking at the dmesg buffer. The low level console driver does not 
   need to 'follow' the high level console state - it only wants to 
   print in case of trouble anyway.

2) or add buffered (flash-friendly) writes for all printk output - panic
   and non-panic alike. This would be useful to debug suspend/resume
   bugs for example. This would also optimize the packets of netconsole
   output. (last i checked we sent a packet per line.)

The workqueue looks wrong in both variants. If we are panic-ing (or 
hanging, or ...) then we are halting the machine - the workqueue has no 
chance to actually execute.

	Ingo
Ingo Molnar - Oct. 12, 2009, 1:25 p.m.
* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> > See my reply to David Woodhouse, i think we should add support for 
> > buffering in kernel/printk.c and that would both fix your problems, 
> > would simplify the driver (significantly!) and would expose the 
> > generic buffering capability to other console drivers as well.
> 
> Buffering printk in general is bad. [...]

The general (and default) case would be 0 buffering - i.e. finegrained 
per line calls to ->console_write().

This is the common case indeed, we want to get console output out as 
soon as possible and as finegrained as possible - as we dont know when a 
failure mode removes our ability to print anything else.

My argument is that instead of a complicated dance of workqueue versus 
non-workqueue printk support in the MTD code in drivers/mtd/mtdoops.c, 
this should be done at the generic console level.

It's not hard, nor complex if done at the right level, nor does it 
impact the regular zero-buffering codepath in a significant way - and 
the end result would be a significantly simpler (and, in turn, more 
robust) MTD printk driver.

I care about this because i still havent given up hope that the company 
you are working for will finally give us some permanent storage in the 
CPU itself, so that we can have cross-reboot printk buffering ;-) If 
that storage is in the form of Flash, then buffering (to optimize write 
cycles) is probably a must.

> [...] Given a driver needs only to provide about 6 lines of code using 
> a kfifo is it really that hard for the odd code that wants to buffer 
> to do that ?

Avoiding a workqueue in printk is about the critical path of failure and 
about complexity in general.

I dont see how kfifo helps here much - kfifo is really just a relatively 
simple dynamic memory buffer abstraction - while most of the complexity 
here is elsewhere. Could you explain what you meant with kfifo here 
please?

	Ingo
David Woodhouse - Oct. 12, 2009, 1:32 p.m.
On Mon, 2009-10-12 at 15:25 +0200, Ingo Molnar wrote:
> 
> I care about this because i still havent given up hope that the company 
> you are working for will finally give us some permanent storage in the 
> CPU itself, so that we can have cross-reboot printk buffering ;-) 

Why would it have to be in the CPU? If we had decent firmware, couldn't
we just look at the old kernel's log_buf in RAM?

On machines where coreboot is supported (which is mostly machines
_other_ than those from my employer, unfortunately), you ought to be
able to do this today.
Simon Kagstrom - Oct. 12, 2009, 1:39 p.m.
OK, I don't think we understand each other. Sorry if I'm being slow
here, please tell me if I'm misunderstanding something fundamental
below.

On Mon, 12 Oct 2009 15:15:29 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> > I'm afraid I don't really see this issue. The workqueue is used to 
> > write the buffer to the mtd device if we are not in a panic or 
> > interrupt context - in which case we do it directly.
> > 
> > So it's only used when an oops is ongoing.
> 
> This fixation on 'panic' is so wrong!
> 
> 90% of the bugs users care about dont involve any panic. And even if 
> there is a panic down the line, most of the interesting messages are in 
> the stream leading up to the panic - now tucked away in that async 
> workqueue mechanism and not visible.

Well, this is what my patch [1] aims to fix. What it does is to put all
messages in a circular buffer, and when an oops or panic occurs it
writes them out. The current version only collects messages _during_ an
oops. I'll rework it with using kfifo as per Alans suggestion though.

Neither the current code nor the new patch has them stored in the work
queue during a panic though. If this happens, they will call
panic_write (if it's available) to write it out directly.

> There's two clean solutions i think:
> 
> 1) add some new "ok, there's trouble!" callback to struct console and 
>    the console driver could via that mechanism send out the _last_ 2KB 
>    (or more) of kernel log messages. Basically we can go back in time by 
>    looking at the dmesg buffer. The low level console driver does not 
>    need to 'follow' the high level console state - it only wants to 
>    print in case of trouble anyway.
>
> 2) or add buffered (flash-friendly) writes for all printk output - panic
>    and non-panic alike. This would be useful to debug suspend/resume
>    bugs for example. This would also optimize the packets of netconsole
>    output. (last i checked we sent a packet per line.)

Well, suspend/resume hangs is one of the cases which mtdoops won't
catch. But at least on NAND flash, I'd be a bit weary about logging all
printk output for fear of wearing out the flash.

> The workqueue looks wrong in both variants. If we are panic-ing (or 
> hanging, or ...) then we are halting the machine - the workqueue has no 
> chance to actually execute.

but then we are using mtd->panic_write to write it out directly, not
via the work queue.

// Simon

[1] http://patchwork.ozlabs.org/patch/35750/
Arjan van de Ven - Oct. 12, 2009, 2:12 p.m.
On Mon, 12 Oct 2009 13:37:58 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Artem Bityutskiy <dedekind1@gmail.com> wrote:
> 
> > On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote:
> > > > We use the mtdoops module which stores oopses on the Flash
> > > > partition, in order to make it possible to analyze them later.
> > > > And mtdoops needs to know whether 'panic_on_oops' is on or off.
> > > > Thus, we need this variable to be exported.
> > > 
> > > hm, why does it need it? Could you send the patch please that
> > > makes use of it (as an easy way to explain the need for the
> > > export) - current upstream drivers/mtd/mtdoops.c doesnt need it
> > > so it must be some pending change.
> > 
> > Yes, current drivers/mtd/mtdoops.c does not need it, but we have
> > this patch:
> > 
> > http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html
> > 
> > which makes sure mtdoops writes the oops log immediately, because we
> > have 'panic_on_oops' set so this is our last chance to save the
> > oops.
> > 
> > Here is the patch inlined as well:
> > 
> > Author: Aaro Koskinen <aaro.koskinen@nokia.com>
> > Date:   Thu Oct 1 18:16:55 2009 +0300
> > 
> >     mtd: mtdoops: do not schedule work if we are going to die
> > 
> >     If panic_on_oops is set, the log should be written right away
> >     because this is not going to be a second chance.
> > 
> >     Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
> >     Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
> > 
> > diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
> > index 1060337..ac67833 100644
> > --- a/drivers/mtd/mtdoops.c
> > +++ b/drivers/mtd/mtdoops.c
> > @@ -335,7 +335,7 @@ static void mtdoops_console_sync(void)
> >         cxt->ready = 0;
> >         spin_unlock_irqrestore(&cxt->writecount_lock, flags);
> > 
> > -       if (mtd->panic_write && in_interrupt())
> > +       if (mtd->panic_write && (in_interrupt() || panic_on_oops))
> >                 /* Interrupt context, we're going to panic so try
> > and log */ mtdoops_write(cxt, 1);
> 
> Hm, the code seems to be somewhat confused about this. It tries to
> guess when it's panic-ing, right? in_interrupt() is the wrong test
> for that.
> 
> mtdoops_console_sync() gets called by regular printk() as well via:
> 
> static void
> mtdoops_console_write(struct console *co, const char *s, unsigned int
> count) {
>         struct mtdoops_context *cxt = co->data;
>         struct mtd_info *mtd = cxt->mtd;
>         unsigned long flags;
> 
>         if (!oops_in_progress) {
>                 mtdoops_console_sync();
>                 return;
> 
> I think this is all layered incorrectly - because mtdoops (which is a 
> _VERY_ cool debugging facility btw - i wish generic x86 had something 
> like that) tries to be a 'driver' and tries to achieve these things 
> without modifying the core kernel.
> 
> But it really should do that. We can certainly enhance the core
> kernel logic to tell the console driver more clearly when printk
> should go out immediately.
> 
> Instead of using oops_in_progress it might be better to go into 'sync 
> immeditately' mode if the kernel gets tainted.

actually there are very nice "begin of oops" and "end of oops"
functions... they are the perfect places for this ;-)
Ingo Molnar - Oct. 12, 2009, 2:26 p.m.
* David Woodhouse <dwmw2@infradead.org> wrote:

> On Mon, 2009-10-12 at 15:25 +0200, Ingo Molnar wrote:
> > 
> > I care about this because i still havent given up hope that the 
> > company you are working for will finally give us some permanent 
> > storage in the CPU itself, so that we can have cross-reboot printk 
> > buffering ;-)
> 
> Why would it have to be in the CPU? If we had decent firmware, 
> couldn't we just look at the old kernel's log_buf in RAM?

Not if the failure is say a s2ram hang that requires a power cycle. Also 
there are certain classes of bugs that only occur on cold boot. Plus 
there's the "need to unplug the battery to revive the system" class of 
bugs (but they are rare).

So i think the MTD / flash stuff is powerful.

	Ingo
Ingo Molnar - Oct. 12, 2009, 2:30 p.m.
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote:

> OK, I don't think we understand each other. Sorry if I'm being slow 
> here, please tell me if I'm misunderstanding something fundamental 
> below.

[ it could easily be me being confused - i dont know the mtdoops code 
  that well - i just raised an eyebrow at the export request, which 
  yelled 'layering violation' at me ;-) ]

> On Mon, 12 Oct 2009 15:15:29 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > I'm afraid I don't really see this issue. The workqueue is used to 
> > > write the buffer to the mtd device if we are not in a panic or 
> > > interrupt context - in which case we do it directly.
> > > 
> > > So it's only used when an oops is ongoing.
> > 
> > This fixation on 'panic' is so wrong!
> > 
> > 90% of the bugs users care about dont involve any panic. And even if 
> > there is a panic down the line, most of the interesting messages are in 
> > the stream leading up to the panic - now tucked away in that async 
> > workqueue mechanism and not visible.
> 
> Well, this is what my patch [1] aims to fix. What it does is to put 
> all messages in a circular buffer, and when an oops or panic occurs it 
> writes them out. The current version only collects messages _during_ 
> an oops. I'll rework it with using kfifo as per Alans suggestion 
> though.
> 
> Neither the current code nor the new patch has them stored in the work 
> queue during a panic though. If this happens, they will call 
> panic_write (if it's available) to write it out directly.
> 
> > There's two clean solutions i think:
> > 
> > 1) add some new "ok, there's trouble!" callback to struct console and 
> >    the console driver could via that mechanism send out the _last_ 2KB 
> >    (or more) of kernel log messages. Basically we can go back in time by 
> >    looking at the dmesg buffer. The low level console driver does not 
> >    need to 'follow' the high level console state - it only wants to 
> >    print in case of trouble anyway.
> >
> > 2) or add buffered (flash-friendly) writes for all printk output - panic
> >    and non-panic alike. This would be useful to debug suspend/resume
> >    bugs for example. This would also optimize the packets of netconsole
> >    output. (last i checked we sent a packet per line.)
> 
> Well, suspend/resume hangs is one of the cases which mtdoops won't 
> catch. [...]

( Sidenote: i see no reason why that wouldnt be possible if it's 
  implemented properly. )

> [...] But at least on NAND flash, I'd be a bit weary about logging all 
> printk output for fear of wearing out the flash.

Clearly should be optional - like the s2ram debug hack to RTC registers 
is optional on x86.

> > The workqueue looks wrong in both variants. If we are panic-ing (or 
> > hanging, or ...) then we are halting the machine - the workqueue has 
> > no chance to actually execute.
> 
> but then we are using mtd->panic_write to write it out directly, not 
> via the work queue.

... i might be confused, but in which case _is_ the workqueue used?

It clearly shows up in the codepaths i've read, but maybe i've 
misinterpreted what it does.

	Ingo
David Woodhouse - Oct. 12, 2009, 2:36 p.m.
On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote:
> Not if the failure is say a s2ram hang that requires a power cycle. Also 
> there are certain classes of bugs that only occur on cold boot. Plus 
> there's the "need to unplug the battery to revive the system" class of 
> bugs (but they are rare).

So you need to build in enough ECC to cope with the decay which happens
when RAM isn't being refreshed for a few seconds... :)

> So i think the MTD / flash stuff is powerful. 

Yeah, definitely. I was just pointing out that we can actually do a lot
better on today's commodity hardware too.
Simon Kagstrom - Oct. 12, 2009, 3:01 p.m.
On Mon, 12 Oct 2009 16:30:17 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> * Simon Kagstrom <simon.kagstrom@netinsight.net> wrote:
> 
> > OK, I don't think we understand each other. Sorry if I'm being slow 
> > here, please tell me if I'm misunderstanding something fundamental 
> > below.
> 
> [ it could easily be me being confused - i dont know the mtdoops code 
>   that well - i just raised an eyebrow at the export request, which 
>   yelled 'layering violation' at me ;-) ]

I sent the same patch (export panic_on_oops) to LKML a week ago, but
got no replies. I now know who to Cc next time ;-)

> > > 2) or add buffered (flash-friendly) writes for all printk output - panic
> > >    and non-panic alike. This would be useful to debug suspend/resume
> > >    bugs for example. This would also optimize the packets of netconsole
> > >    output. (last i checked we sent a packet per line.)
> > 
> > Well, suspend/resume hangs is one of the cases which mtdoops won't 
> > catch. [...]
> 
> ( Sidenote: i see no reason why that wouldnt be possible if it's 
>   implemented properly. )

Provided there is a callback for "really_dump_the_console" which gets
called from interesting places (e.g., suspend/resume) it should be easy
to do with the patched mtdoops without much changes, at least with my
proposed circular buffer patch it will be trivial.

> > [...] But at least on NAND flash, I'd be a bit weary about logging all 
> > printk output for fear of wearing out the flash.
> 
> Clearly should be optional - like the s2ram debug hack to RTC registers 
> is optional on x86.

That would be OK for me at least, and again should not be very difficult
to implement even with todays code.

> > > The workqueue looks wrong in both variants. If we are panic-ing (or 
> > > hanging, or ...) then we are halting the machine - the workqueue has 
> > > no chance to actually execute.
> > 
> > but then we are using mtd->panic_write to write it out directly, not 
> > via the work queue.
> 
> ... i might be confused, but in which case _is_ the workqueue used?
> 
> It clearly shows up in the codepaths i've read, but maybe i've 
> misinterpreted what it does.

With the code in mainline, it basically works like this. 

- When oops_in_progress is not set, mtdoops_console_write will not put
  anything in the buffer. It will call mtdoops_console_sync(), but
  since the buffer will be empty (cxt->writecount is 0), no writes will
  be done.

- When oops_in_progress _is_ set, mtdoops_console_write will start
  putting things in the buffer.

- mtdoops_console_sync is then called either when the buffer has been
  filled, or when ->unblank() is called, or when oops_in_progress is no
  longer true.

  This is the place when the workqueue _can_ be used (in the cases when
  this is not in interrupt context or panic_on_oops is unset).

With my patch it instead works like this:

- mtdoops_console_write continuously writes messages to the buffer, but
  never calls mtdoops_console_sync() itself.

- mtdoops_console_sync (i.e., the ->unblank() callback) will schedule
  work if oops_in_progress is set.

- if we have a panic, it will call mtdoops_write directly (if
  mtd->panic_write is set, otherwise we are out of luck). This is also
  the code path on oopses in interrupt context.

So the workqueue only gets used on unblank() from oopses. I think the
second implementation is simpler, but it also changes the behavior of
mtdoops a bit to include messages before the oops/panic as well.

// Simon
Ingo Molnar - Oct. 12, 2009, 3:14 p.m.
* David Woodhouse <dwmw2@infradead.org> wrote:

> On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote:

> > Not if the failure is say a s2ram hang that requires a power cycle. 
> > Also there are certain classes of bugs that only occur on cold boot. 
> > Plus there's the "need to unplug the battery to revive the system" 
> > class of bugs (but they are rare).
> 
> So you need to build in enough ECC to cope with the decay which 
> happens when RAM isn't being refreshed for a few seconds... :)

[ hey, i think you should line up with BIOS writers at that wall ;-) ]

> > So i think the MTD / flash stuff is powerful.
> 
> Yeah, definitely. I was just pointing out that we can actually do a 
> lot better on today's commodity hardware too.

I wish it worked on any of the 10+ x86 systems i have. Is there anyone 
who'd be interested in exploring whether warm BIOS reboots work 
_anywhere_?

A simple patch with a new (default-off) CONFIG_DEBUG_ feature that just 
puts a signature into a predictable spot in RAM, switches the reboot 
method over to warm reboot (reboot=w) and prints some friendly "yay, 
this BIOS rocks!" message if the signature is still there after a reboot 
and not zeroed out.

If that works _anywhere_ we could complete it: we could cache the dmesg 
buffer address (__log_buf[]) across reboots (and maybe the printk tail 
offset (log_end)), and that would be an _excellent_ debuggability 
feature for a large class of otherwise undebuggable crashes ...

We could use that to preserve a kernel function trace (or a branch 
execution hardware trace using BTS on Intel CPUs) across crashes, etc. 
etc.

	Ingo
Ingo Molnar - Oct. 12, 2009, 3:23 p.m.
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote:

> With my patch it instead works like this:
> 
> - mtdoops_console_write continuously writes messages to the buffer, but
>   never calls mtdoops_console_sync() itself.
> 
> - mtdoops_console_sync (i.e., the ->unblank() callback) will schedule
>   work if oops_in_progress is set.
> 
> - if we have a panic, it will call mtdoops_write directly (if
>   mtd->panic_write is set, otherwise we are out of luck). This is also
>   the code path on oopses in interrupt context.
> 
> So the workqueue only gets used on unblank() from oopses. I think the 
> second implementation is simpler, but it also changes the behavior of 
> mtdoops a bit to include messages before the oops/panic as well.

The main printk principle is simplicity:

- The simpler the printk codepath, the higher the chances that we still
  are within the window of opportunity to get anything out to the user.

- Furthermore, the simpler the printk codepath, the larger the window of
  opportunity is to begin with: we rely on less external state, so we
  have a smaller surface of interaction that might break printk.

In that sense, i think your modified workqueue use is less wrong than 
what is in current mainline, but i'm afraid it's still wrong.

Why use a workqueue on unblank()? Why use a workqueue _at all_? (If we 
piggyback to any kernel thread we could as well piggyback to syslog 
itself - and we all know how often syslog fails at capturing oopses.)

The mtdoops console driver only seems to act if there's an emergency - 
and in emergencies we really _never_ want to do complex things like 
using a workqueue thread.

( Sidenote: i proffer that we dont want to use a workqueue in the 
  'regular' printk case either - but that seems to be irrelevant here as 
  mtdoops does not seem to save / care about regular non-emergency 
  printks. )

	Ingo
Linus Torvalds - Oct. 12, 2009, 3:36 p.m.
On Mon, 12 Oct 2009, Simon Kagstrom wrote:
> 
> Well, this is what my patch [1] aims to fix. What it does is to put all
> messages in a circular buffer, and when an oops or panic occurs it
> writes them out.

Umm. That's wrong. We already _have_ the circular buffer.

It's called 'log_buf'.

I agree with the "save kernel buffer on panic" thing, but I disagree with 
making it anything new, and hooking into "printk()" or the console 
subsystem AT ALL. That's just bogus, stupid, and WRONG.

What you can do is to just flush the 'log_buf' buffer (or as much of it as 
you want - the buffer may be a megabyte in size, and maybe you only want 
to flush the last 8kB or something like that) on oops. And _not_ mix this 
up with anything else. It's a really simple circular buffer, which just 
has

 - log_buf: buffer start
 - log_end: number of characters ever seen
 - log_buf_len: size of buffer (guaranteed to be a power-of-2)

so it's literally as easy as looking at those three values (there's a few 
more that you _can_ look at, but they'd not likely be relevant for a 
"panic_on_oops" thing)

> The current version only collects messages _during_ an
> oops. I'll rework it with using kfifo as per Alans suggestion though.

Don't. kfifo's aren't going to help. You're doing this at all the wrong 
levels ENTIRELY, and we already have the buffer you want to flush.

			Linus
Linus Torvalds - Oct. 12, 2009, 3:44 p.m.
On Mon, 12 Oct 2009, Linus Torvalds wrote:
> 
> Don't. kfifo's aren't going to help. You're doing this at all the wrong 
> levels ENTIRELY, and we already have the buffer you want to flush.

Btw, a few simple rules:

 - if you need to make your device look like a "console device" for 
   dumping at oops time, you're doing things wrong. You don't want line 
   buffered output to begin with, and you don't want to see each line, you 
   only want this at exceptional points.

 - if you need to look at "in_interrupt()" or "panic_on_oops", you're 
   doing things wrong.

 - if you add your own buffers, you're doing things wrong. I have 
   CONFIG_LOG_BUF_SHIFT=18 in my kernel, so I've already got 256kB worth 
   of memory allocated for kernel messages. Sometimes I increase that 
   further, just because I do some silly printk debugging.

IOW, just add a very simple "flush the dmesg buffer on oops" callback to 
the end of the oops printout code (just a single call after the oops thing 
is now known to be in the dmesg buffers!)

It's not just oopses, btw. Maybe people would like to do this as the last 
stage of a reboot/shutdown too. Because some of the final printouts from 
the shutdown will never make it to disk, because 'ksyslogd' has been 
killed, and the root filesystem has been turned read-only.

			Linus
Artem Bityutskiy - Oct. 12, 2009, 5:29 p.m.
On Mon, 2009-10-12 at 08:44 -0700, Linus Torvalds wrote:
> 
> On Mon, 12 Oct 2009, Linus Torvalds wrote:
> > 
> > Don't. kfifo's aren't going to help. You're doing this at all the wrong 
> > levels ENTIRELY, and we already have the buffer you want to flush.
> 
> Btw, a few simple rules:
> 
>  - if you need to make your device look like a "console device" for 
>    dumping at oops time, you're doing things wrong. You don't want line 
>    buffered output to begin with, and you don't want to see each line, you 
>    only want this at exceptional points.
> 
>  - if you need to look at "in_interrupt()" or "panic_on_oops", you're 
>    doing things wrong.
> 
>  - if you add your own buffers, you're doing things wrong. I have 
>    CONFIG_LOG_BUF_SHIFT=18 in my kernel, so I've already got 256kB worth 
>    of memory allocated for kernel messages. Sometimes I increase that 
>    further, just because I do some silly printk debugging.
> 
> IOW, just add a very simple "flush the dmesg buffer on oops" callback to 
> the end of the oops printout code (just a single call after the oops thing 
> is now known to be in the dmesg buffers!)
> 
> It's not just oopses, btw. Maybe people would like to do this as the last 
> stage of a reboot/shutdown too. Because some of the final printouts from 
> the shutdown will never make it to disk, because 'ksyslogd' has been 
> killed, and the root filesystem has been turned read-only.

Yes, it is difficult to disagree with this. As Ingo also pointed,
pretending to be a console driver is ugly, although I think it
was a clever way to avoid touching generic code.

But mtdoops tries to solves the following problem. What if we are
oopsing in an interrupt, which interrupted the mtd driver, so we have
all the locks held, and the mtd driver is in a unexpected stage ATM? Or
what if we are oopsing in the mtd driver, or in something which was
called by the MTD driver.?

This is what all that "workqueue vs. mtd->panic_write()" logic is about,
which Ingo dislikes.

The logic of mtdoops (my interpretation):

1. OK, there was an oops, we have it in a buffer and we need to write
   it to flash now.
2. If we are in interrupt context, we have to write right now, because
   the system is about to die very soon. And must use special
   'mtd->panic_write()' call.
3. If we have 'panic_on_oops' set, the system will be stopped soon. So
   we also have to write now, because this is our last chance. And we
   also have to use 'mtd->panic_write()'.
4. Otherwise, it is probable that system will be somewhat alive, and we
   may schedule the write instead of using 'mtd->panic_write()', because
   'mtd->panic_write()' abuses locking and chip state, and will screw up
   the mtd flash driver. So we prefer to schedule the write instead.

I guess similar could be done in 'panic()' instead. Or you have a more
elegant solution?
Linus Torvalds - Oct. 12, 2009, 5:43 p.m.
On Mon, 12 Oct 2009, Artem Bityutskiy wrote:
> 
> But mtdoops tries to solves the following problem. What if we are
> oopsing in an interrupt, which interrupted the mtd driver, so we have
> all the locks held, and the mtd driver is in a unexpected stage ATM? Or
> what if we are oopsing in the mtd driver, or in something which was
> called by the MTD driver.?

Well, quite frankly, if you have an oops while holding a spinlock, then 
the machine is dead _anyway_. 

So what I would suggest is to just ignore the above problem. No amount of 
workqueue logic will help it - if the oops happened while an interrupt 
held a critical mtd lock, that lock will _never_ be released, so exactly 
what would be helped?

Now, I realize that _if_ you treat mtdoops as a 'console' layer, then you 
need to do that crazy thing, because you still want the oops to print out 
to the other consoles, and you're only getting data one line at a time. 
But since that was the wrong thing to do for a lot of other reasons 
anyway, that's not a very good argument.

Once you do the final flush in a controlled place _after_ you've printed 
out all the oops information, you simply don't care about locks any more. 
Because if you were holding critical locks, you're done anyway.

Sure, maybe you want to do a "trylock()" and skip the oops flush entirely 
in the mtd layer if you can't do it, but it's the "let's use a workqueue" 
or something that doesn't seem to make a lot of sense to me.

			Linus
Linus Torvalds - Oct. 12, 2009, 5:46 p.m.
On Mon, 12 Oct 2009, Linus Torvalds wrote:
> 
> Once you do the final flush in a controlled place _after_ you've printed 
> out all the oops information, you simply don't care about locks any more. 
> Because if you were holding critical locks, you're done anyway.
> 
> Sure, maybe you want to do a "trylock()" and skip the oops flush entirely 
> in the mtd layer if you can't do it, but it's the "let's use a workqueue" 
> or something that doesn't seem to make a lot of sense to me.

Side note: I don't actually care deeply. Once it's all inside some driver, 
and once it's not messing up the console layer, I don't think the small 
details matter all that much. I just think it's likely to be a sign of 
something wrong if you need to use workqueues to flush - it probably means 
that the most interesting oopses will never make it to the mtd device in 
the first place.

			Linus
Andrew Morton - Oct. 12, 2009, 6:09 p.m.
On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I agree with the "save kernel buffer on panic" thing, but I disagree with 
> making it anything new, and hooking into "printk()" or the console 
> subsystem AT ALL. That's just bogus, stupid, and WRONG.
> 
> What you can do is to just flush the 'log_buf' buffer (or as much of it as 
> you want - the buffer may be a megabyte in size, and maybe you only want 
> to flush the last 8kB or something like that) on oops. And _not_ mix this 
> up with anything else.

What he said.  I did it that way in the Digeo kernel back in 2002. 
Worked good.  

Doing it via a console is rather weird.  It will need core kernel
changes to do it properly.

Perhaps oops_enter() is a good place to mark the start of the log, and
flush it within oops_exit().
Ingo Molnar - Oct. 12, 2009, 6:23 p.m.
* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > I agree with the "save kernel buffer on panic" thing, but I disagree with 
> > making it anything new, and hooking into "printk()" or the console 
> > subsystem AT ALL. That's just bogus, stupid, and WRONG.
> > 
> > What you can do is to just flush the 'log_buf' buffer (or as much of it as 
> > you want - the buffer may be a megabyte in size, and maybe you only want 
> > to flush the last 8kB or something like that) on oops. And _not_ mix this 
> > up with anything else.
> 
> What he said.  I did it that way in the Digeo kernel back in 2002. 
> Worked good.
> 
> Doing it via a console is rather weird.  It will need core kernel 
> changes to do it properly.
> 
> Perhaps oops_enter() is a good place to mark the start of the log, and 
> flush it within oops_exit().

Simplest would be to do the last 2K in oops_exit()? That gives the oops, 
and the history leading up to it. Since the blocking is 2K, the extra 
log output is for free.

(unless the oops is larger than 2K - but that is rather rare.)

	Ingo
Carl-Daniel Hailfinger - Oct. 12, 2009, 6:32 p.m.
On 12.10.2009 17:14, Ingo Molnar wrote:
> * David Woodhouse <dwmw2@infradead.org> wrote:
>   
>> On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote:
>>     
>>> Not if the failure is say a s2ram hang that requires a power cycle. 
>>> Also there are certain classes of bugs that only occur on cold boot. 
>>> Plus there's the "need to unplug the battery to revive the system" 
>>> class of bugs (but they are rare).
>>>       
>> So you need to build in enough ECC to cope with the decay which 
>> happens when RAM isn't being refreshed for a few seconds... :)
>>     
>
> [ hey, i think you should line up with BIOS writers at that wall ;-) ]
>   

Not all of us x86 firmware writers are evil.


>>> So i think the MTD / flash stuff is powerful.
>>>       
>> Yeah, definitely. I was just pointing out that we can actually do a 
>> lot better on today's commodity hardware too.
>>     
>
> I wish it worked on any of the 10+ x86 systems i have. Is there anyone 
> who'd be interested in exploring whether warm BIOS reboots work 
> _anywhere_?
>   

AFAIK memory clearing is default off in coreboot for non-ECC RAM and
default on for ECC RAM (to avoid parity errors on read, but that can
probably be worked around). Unless I'm mistaken, the SeaBIOS BIOS
compatibility layer on top of coreboot doesn't erase RAM at all, so
contents can survive.
No idea about classic AMI/Award/Phoenix/Insyde/whatever BIOS, though.


> A simple patch with a new (default-off) CONFIG_DEBUG_ feature that just 
> puts a signature into a predictable spot in RAM, switches the reboot 
> method over to warm reboot (reboot=w) and prints some friendly "yay, 
> this BIOS rocks!" message if the signature is still there after a reboot 
> and not zeroed out.
>
> If that works _anywhere_ we could complete it: we could cache the dmesg 
> buffer address (__log_buf[]) across reboots (and maybe the printk tail 
> offset (log_end)), and that would be an _excellent_ debuggability 
> feature for a large class of otherwise undebuggable crashes ...
>
> We could use that to preserve a kernel function trace (or a branch 
> execution hardware trace using BTS on Intel CPUs) across crashes, etc. 
> etc.
>   

Since we're discussing log buffers anyway, does it make sense to have
this feature interact with the coreboot log buffer which is passed on to
the OS (no official patches for that one, yet)? Basically, coreboot has
its own log buffer where it stores the hardware init diagnostic messages
(very similar to the Linux kernel message buffer) and it could make
sense to use the same memory area for both purposes (if you can deal
with coreboot messages being absent after a kernel Oops).

Thoughts?

Regards,
Carl-Daniel
Andrew Morton - Oct. 12, 2009, 6:36 p.m.
On Mon, 12 Oct 2009 20:23:46 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > I agree with the "save kernel buffer on panic" thing, but I disagree with 
> > > making it anything new, and hooking into "printk()" or the console 
> > > subsystem AT ALL. That's just bogus, stupid, and WRONG.
> > > 
> > > What you can do is to just flush the 'log_buf' buffer (or as much of it as 
> > > you want - the buffer may be a megabyte in size, and maybe you only want 
> > > to flush the last 8kB or something like that) on oops. And _not_ mix this 
> > > up with anything else.
> > 
> > What he said.  I did it that way in the Digeo kernel back in 2002. 
> > Worked good.
> > 
> > Doing it via a console is rather weird.  It will need core kernel 
> > changes to do it properly.
> > 
> > Perhaps oops_enter() is a good place to mark the start of the log, and 
> > flush it within oops_exit().
> 
> Simplest would be to do the last 2K in oops_exit()? That gives the oops, 
> and the history leading up to it. Since the blocking is 2K, the extra 
> log output is for free.
> 
> (unless the oops is larger than 2K - but that is rather rare.)
> 

Some oops traces can be pretty large.  Perhaps "oops_enter minus 1k up
to oops_exit".

The digeo kernel later got changed to package the oops into a binary
record format for logging to nvram.  iirc that was for space reasons,
and for ease of downstream analysis.

That kernel used to be downloadable but I can't immediately find it. 
Probably not very interesting anyway.
Linus Torvalds - Oct. 12, 2009, 6:45 p.m.
On Mon, 12 Oct 2009, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > Perhaps oops_enter() is a good place to mark the start of the log, and 
> > flush it within oops_exit().
> 
> Simplest would be to do the last 2K in oops_exit()? That gives the oops, 
> and the history leading up to it. Since the blocking is 2K, the extra 
> log output is for free.

I agree, except I don't think it should be fixed to 2k.

We should just dump as much as is "appropriate" for the dump device. It 
might be the last 2kB, it might be 8kB, it might be 64kB. We don't know, 
we don't care. The device may have its own per-device limits. Any extra 
data we get from before the oops is just gravy (often there might be 
interestign warning messages leadign up to the dump), and if the oops is 
too big for the dump device, it's not something we can do anything about 
anyway.

So the logic should literally be something like this:

 - kernel/printk.c:

	void dump_kmsg(void)
	{
		unsigned long len = ACCESS_ONCE(log_end);
		struct dump_device *dump;
		const char *s1, *s2;
		unsigned long l1, l2;

		s1 = "";
		l1 = 0;
		s2 = log_buf;
		l2 = len;

		/* Have we rotated around the circular buffer? */
		if (len > log_buf_len) {
			unsigned long pos = (len & LOG_BUF_MASK);

			s1 = log_buf + pos;
			l1 = log_buf_len - pos;

			s2 = log_buf;
			l2 = pos;
		}

		list_for_each_entry (dump, dump_list, list) {
			dump->fn(s1, l1, s2, l2);
		}
	}

ie we just always give the whole buffer (as two "sections", since it's a 
circular buffer) to the dumper, and then the dumper can decide how much of 
those buffers it is able to dump sanely.

If the dumper has some hardware limitation where it can only dump a single 
aligned block, it can decide to just look at <s2,l2> which is the "latter" 
part of the dump. It's usually going to be fine, and it should be 
page-aligned (and you can even round up the size to a page, since the 
worst that can happen is that you'll see some old printk output at the 
end)

Look ma, no locking, no buffer allocations, no nothing.

			Linus
Ingo Molnar - Oct. 12, 2009, 7:14 p.m.
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Mon, 12 Oct 2009, Ingo Molnar wrote:
> > 
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> > > 
> > > Perhaps oops_enter() is a good place to mark the start of the log, and 
> > > flush it within oops_exit().
> > 
> > Simplest would be to do the last 2K in oops_exit()? That gives the 
> > oops, and the history leading up to it. Since the blocking is 2K, 
> > the extra log output is for free.
> 
> I agree, except I don't think it should be fixed to 2k.

Yeah - i cited 2K only because that is what mtdoops uses.

>       	void dump_kmsg(void)
> [...]
>
> Look ma, no locking, no buffer allocations, no nothing.

Neat ...

This could also be used for a warm-reboot preserve-memory thing as well. 
A well-known 4K (or so) area to preserve and print out during the next 
bootup after a crash. dump_kmsg() could copy the kernel's last will out 
to that area, or so.

That would be cross-kernel compatible and the newly booted kernel image 
wouldnt overwrite this area. (which it does currently via its 
__log_buf[])

	Ingo
Ingo Molnar - Oct. 12, 2009, 7:18 p.m.
* Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote:

> > I wish it worked on any of the 10+ x86 systems i have. Is there 
> > anyone who'd be interested in exploring whether warm BIOS reboots 
> > work _anywhere_?
> 
> AFAIK memory clearing is default off in coreboot for non-ECC RAM and 
> default on for ECC RAM (to avoid parity errors on read, but that can 
> probably be worked around). Unless I'm mistaken, the SeaBIOS BIOS 
> compatibility layer on top of coreboot doesn't erase RAM at all, so 
> contents can survive.
>
> No idea about classic AMI/Award/Phoenix/Insyde/whatever BIOS, though.

I wouldnt mind to support this for coreboot too, but it will only be 
practical if there's at least one standard BIOS out of the many i run 
that supports warm reboot in practice ...

Can try a test-patch on all of those systems btw. - we do support 
warm-reboot on x86 in arch/x86/kernel/reboot.c:

        /* Write 0x1234 to absolute memory location 0x472.  The BIOS reads
           this on booting to tell it to "Bypass memory test (also warm
           boot)".  This seems like a fairly standard thing that gets set by
           REBOOT.COM programs, and the previous reset routine did this
           too. */
        *((unsigned short *)0x472) = reboot_mode;

But someone would have to come up with a test-patch to prove whether it 
works really works.

	Ingo
Dirk Hohndel - Oct. 12, 2009, 7:18 p.m.
Here's what I get for traveling and not reading mail for two days...

On Mon, 2009-10-12 at 11:45 -0700, Linus Torvalds wrote:
> 
> On Mon, 12 Oct 2009, Ingo Molnar wrote:
> > 
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> > > 
> > > Perhaps oops_enter() is a good place to mark the start of the log, and 
> > > flush it within oops_exit().
> > 
> > Simplest would be to do the last 2K in oops_exit()? That gives the oops, 
> > and the history leading up to it. Since the blocking is 2K, the extra 
> > log output is for free.
> 
> I agree, except I don't think it should be fixed to 2k.
> 
> We should just dump as much as is "appropriate" for the dump device. It 
> might be the last 2kB, it might be 8kB, it might be 64kB. We don't know, 
> we don't care. The device may have its own per-device limits. Any extra 
> data we get from before the oops is just gravy (often there might be 
> interestign warning messages leadign up to the dump), and if the oops is 
> too big for the dump device, it's not something we can do anything about 
> anyway.

I'm working on something different but related - also using the ring
buffer and just getting as much from its tail as I am able to conserve.

The approach in my case is to write a 2D bar code to the screen and have
the user take a picture / submit the picture to kerneloops.org where it
then gets decoded back into the oops message. This is intended for
situations where you don't have access to other storage / network - or
where a picture of the screen is actually the easiest way to get to the
information.

Right now the project is slightly stalled as I am running into an
unexpected project on the decode side, but I'd love to make sure that
the core changes I'm doing integrate cleanly with this project...

> 
> So the logic should literally be something like this:
> 
>  - kernel/printk.c:
> 
> 	void dump_kmsg(void)
> 	{
> 		unsigned long len = ACCESS_ONCE(log_end);
> 		struct dump_device *dump;
> 		const char *s1, *s2;
> 		unsigned long l1, l2;
> 
> 		s1 = "";
> 		l1 = 0;
> 		s2 = log_buf;
> 		l2 = len;
> 
> 		/* Have we rotated around the circular buffer? */
> 		if (len > log_buf_len) {
> 			unsigned long pos = (len & LOG_BUF_MASK);
> 
> 			s1 = log_buf + pos;
> 			l1 = log_buf_len - pos;
> 
> 			s2 = log_buf;
> 			l2 = pos;
> 		}
> 
> 		list_for_each_entry (dump, dump_list, list) {
> 			dump->fn(s1, l1, s2, l2);
> 		}
> 	}
> 
> ie we just always give the whole buffer (as two "sections", since it's a 
> circular buffer) to the dumper, and then the dumper can decide how much of 
> those buffers it is able to dump sanely.

That's pretty close to what I do - only in my case the information then
doesn't get written to a device but instead gets compressed, encoded and
displayed on the framebuffer...

/D
Simon Kagstrom - Oct. 13, 2009, 7:58 a.m.
On Mon, 12 Oct 2009 11:45:15 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So the logic should literally be something like this:
> 
>  - kernel/printk.c:
> 
> 	void dump_kmsg(void)
> 	{
> 		unsigned long len = ACCESS_ONCE(log_end);

OK, I'll get back with a few patches that does this later today.

// Simon
Artem Bityutskiy - Oct. 13, 2009, 8:57 a.m.
On Tue, 2009-10-13 at 09:58 +0200, Simon Kagstrom wrote:
> On Mon, 12 Oct 2009 11:45:15 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > So the logic should literally be something like this:
> > 
> >  - kernel/printk.c:
> > 
> > 	void dump_kmsg(void)
> > 	{
> > 		unsigned long len = ACCESS_ONCE(log_end);
> 
> OK, I'll get back with a few patches that does this later today.

Yeah, I think it is good idea to work on a generic support, and simplify
mtdoops.
Simon Kagstrom - Oct. 13, 2009, 1:17 p.m.
Hi!

Here are a couple of patches to mtdoops which I've been working on for
a while and which were recently discussed on the MTD list. They are
applied on top of Artems tree,

  http://git.infradead.org/users/dedekind/l2-mtd-2.6.git

which apart from the mainline tree contains a panic fix from Aaro and
Artems mtdoops style cleanup patch.


Now, I'd like to warn sensitive readers that patch 5 *does* contain work
queue elements. However, I believe there is a good reason for this: The
mtd->write call is not good to call while oopsing, and mtd->panic_write
is sort of a last resort. Now, if we panic anyhow, mtd->panic_write
will be called, so the oops will still be written and the scheduled
work won't run anyway.


The patches are:

1. Avoid erasing already empty areas (the area would be erased at each
   module load otherwise)

2. Keep track of mtdoops page cleanliness in an array. This allows
   mtdoops_inc_counter to be called during panic (which fails in my
   case with the current code in mtd->read, although I believe this is
   MTD-driver dependent).

3. Make page size configurable to support longer messages. Mainly
   needed for patch 3, but also allows longer messages to be stored
   during panics (when the next oops area cannot be erased).

4. (kernel/printk.c) Add a dump_device as per Linus suggestion which
   includes a dump_kmsg function which dumps the kernel log buffer to
   registered devices.

5. Refactor mtdoops as a dump_device device instead of a console device.

ChangeLog:

v2:
  * Patches have been reordered to keep the least controversial (?) first.
  * Implement Artems suggestion to keep cleanliness in an array
  * Use Aaros patch to fix the panic_on_oops problem

v3:
  * Correct checkpatch issues

v4: Review corrections
  * Rebase on top of "[PATCH] mtd: mtdoops: several minor cleanups"
  * Patch 1 has been added
  * Use 1 bit per oops page, and rename clean/dirty unused/used
  * Don't initialize oops_page_dirty (it's a global structure anyway so
    it will be cleared together with the rest of BSS)
  * Rename mtdoops_page_size record_size (although only for the page
    size setting)

v5: Corrections after panic_on_oops discussion
  * Add dump_device
  * Refactor mtdoops as a dump device.
Simon Kagstrom - Oct. 14, 2009, 1:34 p.m.
Hi!

Here are a couple of patches to mtdoops which I've been working on for
a while and which were recently discussed on the MTD list. They are
applied on top of Artems tree,

  http://git.infradead.org/users/dedekind/l2-mtd-2.6.git

which apart from the mainline tree contains a panic fix from Aaro and
Artems mtdoops style cleanup patch.


Now, I'd like to warn sensitive readers that patch 5 *does* contain work
queue elements. However, I believe there is a good reason for this: The
mtd->write call is not good to call while oopsing, and mtd->panic_write
is sort of a last resort. Now, if we panic anyhow, mtd->panic_write
will be called, so the oops will still be written and the scheduled
work won't run anyway.


The patches are:

1. Avoid erasing already empty areas (the area would be erased at each
   module load otherwise)

2. Keep track of mtdoops page cleanliness in an array. This allows
   mtdoops_inc_counter to be called during panic (which fails in my
   case with the current code in mtd->read, although I believe this is
   MTD-driver dependent).

3. Make page size configurable to support longer messages. Mainly
   needed for patch 4, but also allows longer messages to be stored
   during panics (when the next oops area cannot be erased).

4. (kernel/printk.c) Add a dump_device as per Linus suggestion which
   includes a dump_kmsg function which dumps the kernel log buffer to
   registered devices.

5. Refactor mtdoops as a dump_device device instead of a console device.

ChangeLog:

v2:
  * Patches have been reordered to keep the least controversial (?) first.
  * Implement Artems suggestion to keep cleanliness in an array
  * Use Aaros patch to fix the panic_on_oops problem

v3:
  * Correct checkpatch issues

v4: Review corrections
  * Rebase on top of "[PATCH] mtd: mtdoops: several minor cleanups"
  * Patch 1 has been added
  * Use 1 bit per oops page, and rename clean/dirty unused/used
  * Don't initialize oops_page_dirty (it's a global structure anyway so
    it will be cleared together with the rest of BSS)
  * Rename mtdoops_page_size record_size (although only for the page
    size setting)

v5: Corrections after panic_on_oops discussion
  * Add dump_device
  * Refactor mtdoops as a dump device.

v6: More corrections (panic_on_oops and Anders Grafström, my room-mate at work)
  * Pathces 2-5 modified
  * Use set_bit/clear_bit primitives in patch 2
  * Set permissions for mtdoops arguments correct in patch 3 and 5
  * Rename files and structures after Linus suggestions in patch 4
  * Correct output under some conditions in patch 5

// Simon

Patch

diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c
index 1060337..ac67833 100644
--- a/drivers/mtd/mtdoops.c
+++ b/drivers/mtd/mtdoops.c
@@ -335,7 +335,7 @@  static void mtdoops_console_sync(void)
        cxt->ready = 0;
        spin_unlock_irqrestore(&cxt->writecount_lock, flags);

-       if (mtd->panic_write && in_interrupt())
+       if (mtd->panic_write && (in_interrupt() || panic_on_oops))
                /* Interrupt context, we're going to panic so try and log */
                mtdoops_write(cxt, 1);
        else