Message ID | 1255346731.9659.31.camel@localhost |
---|---|
State | New, archived |
Headers | show |
* Artem Bityutskiy <dedekind1@gmail.com> wrote: > On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote: > > > We use the mtdoops module which stores oopses on the Flash partition, > > > in order to make it possible to analyze them later. And mtdoops needs > > > to know whether 'panic_on_oops' is on or off. Thus, we need this > > > variable to be exported. > > > > hm, why does it need it? Could you send the patch please that makes use > > of it (as an easy way to explain the need for the export) - current > > upstream drivers/mtd/mtdoops.c doesnt need it so it must be some pending > > change. > > Yes, current drivers/mtd/mtdoops.c does not need it, but we have this > patch: > > http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html > > which makes sure mtdoops writes the oops log immediately, because we > have 'panic_on_oops' set so this is our last chance to save the oops. > > Here is the patch inlined as well: > > Author: Aaro Koskinen <aaro.koskinen@nokia.com> > Date: Thu Oct 1 18:16:55 2009 +0300 > > mtd: mtdoops: do not schedule work if we are going to die > > If panic_on_oops is set, the log should be written right away > because this is not going to be a second chance. > > Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> > Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> > > diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c > index 1060337..ac67833 100644 > --- a/drivers/mtd/mtdoops.c > +++ b/drivers/mtd/mtdoops.c > @@ -335,7 +335,7 @@ static void mtdoops_console_sync(void) > cxt->ready = 0; > spin_unlock_irqrestore(&cxt->writecount_lock, flags); > > - if (mtd->panic_write && in_interrupt()) > + if (mtd->panic_write && (in_interrupt() || panic_on_oops)) > /* Interrupt context, we're going to panic so try and log */ > mtdoops_write(cxt, 1); Hm, the code seems to be somewhat confused about this. It tries to guess when it's panic-ing, right? in_interrupt() is the wrong test for that. mtdoops_console_sync() gets called by regular printk() as well via: static void mtdoops_console_write(struct console *co, const char *s, unsigned int count) { struct mtdoops_context *cxt = co->data; struct mtd_info *mtd = cxt->mtd; unsigned long flags; if (!oops_in_progress) { mtdoops_console_sync(); return; I think this is all layered incorrectly - because mtdoops (which is a _VERY_ cool debugging facility btw - i wish generic x86 had something like that) tries to be a 'driver' and tries to achieve these things without modifying the core kernel. But it really should do that. We can certainly enhance the core kernel logic to tell the console driver more clearly when printk should go out immediately. Instead of using oops_in_progress it might be better to go into 'sync immeditately' mode if the kernel gets tainted. Add a callback for that to struct console (in include/linux/console.h). The ->unblank() callback is already such a "user attention needed immediately" callback. We could add a ->kernel_bug() callback to that. Ingo
(Risking that Artem also replies, I'll bite on this one! Let's hope we agree at least :-)) On Mon, 12 Oct 2009 13:37:58 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > - if (mtd->panic_write && in_interrupt()) > > + if (mtd->panic_write && (in_interrupt() || panic_on_oops)) > > /* Interrupt context, we're going to panic so try and log */ > > mtdoops_write(cxt, 1); > > Hm, the code seems to be somewhat confused about this. It tries to guess > when it's panic-ing, right? in_interrupt() is the wrong test for that. Well, the main reason is to get the write done directly if we know we're going to crash. The rest of the code around the patch looks like this: if (mtd->panic_write && (in_interrupt() || panic_on_oops)) /* Interrupt context, we're going to panic so try and log */ mtdoops_write(cxt, 1); else schedule_work(&cxt->work_write); so if we're oopsing in interrupt context or are going to panic, we just write directly. mtdoops_write will then use mtd->panic_write if it's available to get the write done immediately without sleeping. > mtdoops_console_sync() gets called by regular printk() as well via: > > static void > mtdoops_console_write(struct console *co, const char *s, unsigned int count) > { > struct mtdoops_context *cxt = co->data; > struct mtd_info *mtd = cxt->mtd; > unsigned long flags; > > if (!oops_in_progress) { > mtdoops_console_sync(); > return; > > I think this is all layered incorrectly - because mtdoops (which is a > _VERY_ cool debugging facility btw - i wish generic x86 had something > like that) tries to be a 'driver' and tries to achieve these things > without modifying the core kernel. > > But it really should do that. We can certainly enhance the core kernel > logic to tell the console driver more clearly when printk should go out > immediately. > > Instead of using oops_in_progress it might be better to go into 'sync > immeditately' mode if the kernel gets tainted. Add a callback for that > to struct console (in include/linux/console.h). The ->unblank() callback > is already such a "user attention needed immediately" callback. We could > add a ->kernel_bug() callback to that. I sent another patch which changes the mtdoops behavior a bit: http://patchwork.ozlabs.org/patch/35750/ (the thread with the patch series is here: http://lists.infradead.org/pipermail/linux-mtd/2009-October/027576.html) the change is basically to log all messages in a circular buffer (the current one only logs messages after an oops has started). It also needed some restructuring, and I believe parts of the code becomes simpler that way. The console write now only adds messages to the buffer (never calls mtdoops_console_sync), and mtdoops_console_sync (i.e., the ->unblank() callback) simply becomes static void mtdoops_console_sync(void) { struct mtdoops_context *cxt = &oops_cxt; /* Write out the buffer if we are called during an oops */ if (oops_in_progress) schedule_work(&cxt->work_write); } To handle the panic case, I've simply added a panic notifier which does static int mtdoops_panic(struct notifier_block *this, unsigned long event, void *ptr) { struct mtdoops_context *cxt = &oops_cxt; cancel_work_sync(&cxt->work_write); cxt->ready = 0; if (cxt->mtd->panic_write) mtdoops_write(cxt, 1); else printk(KERN_WARNING "mtdoops: panic_write is not defined, " "cannot store dump from panic\n"); return NOTIFY_DONE; } So with this one, the exported panic_on_oops is no longer needed, and normal oopses are handled by the scheduled work while panic_on_oopses are handled by the panic handler. Anyway, this particular patch is still up for discussion. // Simon
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote: > (Risking that Artem also replies, I'll bite on this one! Let's hope we > agree at least :-)) > > On Mon, 12 Oct 2009 13:37:58 +0200 > Ingo Molnar <mingo@elte.hu> wrote: > > > > - if (mtd->panic_write && in_interrupt()) > > > + if (mtd->panic_write && (in_interrupt() || panic_on_oops)) > > > /* Interrupt context, we're going to panic so try and log */ > > > mtdoops_write(cxt, 1); > > > > Hm, the code seems to be somewhat confused about this. It tries to guess > > when it's panic-ing, right? in_interrupt() is the wrong test for that. > > Well, the main reason is to get the write done directly if we know > we're going to crash. The rest of the code around the patch looks like > this: > > if (mtd->panic_write && (in_interrupt() || panic_on_oops)) > /* Interrupt context, we're going to panic so try and log */ > mtdoops_write(cxt, 1); > else > schedule_work(&cxt->work_write); > > so if we're oopsing in interrupt context or are going to panic, we > just write directly. mtdoops_write will then use mtd->panic_write if > it's available to get the write done immediately without sleeping. but i'm not sure that code achieves your intention. in_interrupt() is a generic test. It will be true whenever you printk in irq context - be that a panic or not a panic. Also, the panic_on_oops usage looks wrong as well: it is set on a system that wants a panic on oops - but the flag will be set all the time, even when we are not oopsing. I suppose the intention is to add a logic like this: - buffer writes to the MTD async writeout thread for regular printks - if we are in some sort of emergency, write to the MTD device directly as we cannot buffer anymore. Correct? > [...] > > To handle the panic case, I've simply added a panic notifier which > does > > static int mtdoops_panic(struct notifier_block *this, unsigned long event, > void *ptr) > { > struct mtdoops_context *cxt = &oops_cxt; > > cancel_work_sync(&cxt->work_write); > cxt->ready = 0; > if (cxt->mtd->panic_write) > mtdoops_write(cxt, 1); > else > printk(KERN_WARNING "mtdoops: panic_write is not defined, " > "cannot store dump from panic\n"); > > return NOTIFY_DONE; > } > > So with this one, the exported panic_on_oops is no longer needed, and > normal oopses are handled by the scheduled work while panic_on_oopses > are handled by the panic handler. Yes, that looks like the better direction - but 'panic' is still the wrong trigger condition i think. We generally just crash and dont panic. Often we'll display a kernel warning and then hang. Etc. Also, would it be possible to just simplify the thing and not do any buffering at all? Extra buffering complexity in a console driver is only asking for trouble. Or is flash storage write cycles optimization that important in this case? Ingo
On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote: > Also, would it be possible to just simplify the thing and not do any > buffering at all? Extra buffering complexity in a console driver is only > asking for trouble. Or is flash storage write cycles optimization that > important in this case? That and the fact that on NAND flash you have to write full pages at a time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. So we really do want to buffer it where we can. We don't want to write a 2KiB page for every line of printk output.
* David Woodhouse <dwmw2@infradead.org> wrote: > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote: > > Also, would it be possible to just simplify the thing and not do any > > buffering at all? Extra buffering complexity in a console driver is only > > asking for trouble. Or is flash storage write cycles optimization that > > important in this case? > > That and the fact that on NAND flash you have to write full pages at a > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. > So we really do want to buffer it where we can. > > We don't want to write a 2KiB page for every line of printk output. Then i think the buffering is at the wrong place: we should instead buffer in the generic layer and pass it to lowlevel if we know that we have gone past a 2K boundary. The size of the generic log buffer is always a power of two so detecting 2K boundaries is very easy. On any emergency the generic console layer will do faster flushes - this is nothing the console driver itself should bother with. And that would avoid the whole workqueue logic - which is fragile to be done in a printk to begin with. So what we need is an extension to struct console that sets a buffering limit. Zero (the default) means unbuffered. (Btw., things like netconsole might make use of such buffering too.) Agreed? Ingo
On Mon, 12 Oct 2009 14:09:51 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > Well, the main reason is to get the write done directly if we know > > we're going to crash. The rest of the code around the patch looks like > > this: > > > > if (mtd->panic_write && (in_interrupt() || panic_on_oops)) > > /* Interrupt context, we're going to panic so try and log */ > > mtdoops_write(cxt, 1); > > else > > schedule_work(&cxt->work_write); > > > > so if we're oopsing in interrupt context or are going to panic, we > > just write directly. mtdoops_write will then use mtd->panic_write if > > it's available to get the write done immediately without sleeping. > > but i'm not sure that code achieves your intention. > > in_interrupt() is a generic test. It will be true whenever you printk in > irq context - be that a panic or not a panic. Well, this particular code is only called when oops_in_progress is set, so we know we have an oops chasing us. And if we oops in an interrupt, we're going to panic and the same thing goes if panic_on_oops is set. So the current code only logs messages which happen when an oops is in progress. > Also, the panic_on_oops usage looks wrong as well: it is set on a system > that wants a panic on oops - but the flag will be set all the time, even > when we are not oopsing. > > I suppose the intention is to add a logic like this: > > - buffer writes to the MTD async writeout thread for regular printks > > - if we are in some sort of emergency, write to the MTD device directly > as we cannot buffer anymore. > > Correct? Almost :-). During an oops, it stores in a memory buffer until we have either filled the memory buffer, the oops is over, or unblank() is called, and after that it will write out the buffer to flash. I believe the other implementation (outlined below) is a bit simpler in that respect. That will also log all messages (well the last 4-8KB or so of them), but only write it out when a panic or oops occurs. > > So with this one, the exported panic_on_oops is no longer needed, and > > normal oopses are handled by the scheduled work while panic_on_oopses > > are handled by the panic handler. > > Yes, that looks like the better direction - but 'panic' is still the > wrong trigger condition i think. We generally just crash and dont panic. > Often we'll display a kernel warning and then hang. Etc. But how can we detect that? The code above will write to the MTD device either if an oops happens, or if we panic for some reason. If the kernel just hangs (and is reset by the watchdog, if we have one), how should we know when to write the log out? For me personally, the panic case is the important one. Regular oopses (without panic_on_oops) can go to syslog, but it's important to be able to catch the cases when the kernel panics as well. // Simon
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote: > > Yes, that looks like the better direction - but 'panic' is still the > > wrong trigger condition i think. We generally just crash and dont > > panic. Often we'll display a kernel warning and then hang. Etc. > > But how can we detect that? The code above will write to the MTD > device either if an oops happens, or if we panic for some reason. If > the kernel just hangs (and is reset by the watchdog, if we have one), > how should we know when to write the log out? You shouldnt need to care about that in a console driver - it's up to higher layers. See my reply to David Woodhouse, i think we should add support for buffering in kernel/printk.c and that would both fix your problems, would simplify the driver (significantly!) and would expose the generic buffering capability to other console drivers as well. Ingo
On Mon, 2009-10-12 at 14:20 +0200, Ingo Molnar wrote: > * David Woodhouse <dwmw2@infradead.org> wrote: > > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote: > > > Also, would it be possible to just simplify the thing and not do any > > > buffering at all? Extra buffering complexity in a console driver is only > > > asking for trouble. Or is flash storage write cycles optimization that > > > important in this case? > > > > That and the fact that on NAND flash you have to write full pages at a > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. > > So we really do want to buffer it where we can. > > > > We don't want to write a 2KiB page for every line of printk output. > > Then i think the buffering is at the wrong place: we should instead > buffer in the generic layer and pass it to lowlevel if we know that we > have gone past a 2K boundary. > > The size of the generic log buffer is always a power of two so detecting > 2K boundaries is very easy. On any emergency the generic console layer > will do faster flushes - this is nothing the console driver itself > should bother with. > > And that would avoid the whole workqueue logic - which is fragile to be > done in a printk to begin with. > > So what we need is an extension to struct console that sets a buffering > limit. Zero (the default) means unbuffered. > > (Btw., things like netconsole might make use of such buffering too.) > > Agreed? Makes some sense, yes. We also use the workqueue logic to allow us to co-ordinate access to the hardware properly -- taking locks where appropriate, etc. We can't do that directly from a console ->write() method. Some device drivers do provide a ->panic_write() function which breaks all the locks and just resets the hardware because it knows we're panicking, but we don't want to do that in the common case.
* David Woodhouse <dwmw2@infradead.org> wrote: > On Mon, 2009-10-12 at 14:20 +0200, Ingo Molnar wrote: > > * David Woodhouse <dwmw2@infradead.org> wrote: > > > > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote: > > > > Also, would it be possible to just simplify the thing and not do any > > > > buffering at all? Extra buffering complexity in a console driver is only > > > > asking for trouble. Or is flash storage write cycles optimization that > > > > important in this case? > > > > > > That and the fact that on NAND flash you have to write full pages at a > > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. > > > So we really do want to buffer it where we can. > > > > > > We don't want to write a 2KiB page for every line of printk output. > > > > Then i think the buffering is at the wrong place: we should instead > > buffer in the generic layer and pass it to lowlevel if we know that we > > have gone past a 2K boundary. > > > > The size of the generic log buffer is always a power of two so detecting > > 2K boundaries is very easy. On any emergency the generic console layer > > will do faster flushes - this is nothing the console driver itself > > should bother with. > > > > And that would avoid the whole workqueue logic - which is fragile to be > > done in a printk to begin with. > > > > So what we need is an extension to struct console that sets a buffering > > limit. Zero (the default) means unbuffered. > > > > (Btw., things like netconsole might make use of such buffering too.) > > > > Agreed? > > Makes some sense, yes. > > We also use the workqueue logic to allow us to co-ordinate access to > the hardware properly -- taking locks where appropriate, etc. We can't > do that directly from a console ->write() method. > > Some device drivers do provide a ->panic_write() function which breaks > all the locks and just resets the hardware because it knows we're > panicking, but we don't want to do that in the common case. Well other than not using sleeping locks in that codepath it should be properly serializable. If the serial driver, netconsole, fbcon and all the other non-trivial console drivers can do it then MTD should be able to do it too. Printk via workqueue ... lets not go there, really. (I know you already have that but i think it's a mistake on a fundamental level.) Ingo
On Mon, 2009-10-12 at 14:36 +0200, Ingo Molnar wrote: > Well other than not using sleeping locks in that codepath it should be > properly serializable. If the serial driver, netconsole, fbcon and all > the other non-trivial console drivers can do it then MTD should be able > to do it too. It's distinctly non-trivial. A 'real' user of the hardware may have just triggered a block erase, which could take hundreds (or even thousands) of milliseconds to complete. We can't always suspend that erase; do we really want to _wait_ for it? If we're actually in a _panic_ situation, then yes -- we want to wait, and that's what the device driver's panic_write() method should do. But for less critical output? I'm not so sure. Of course, I'm currently looking at revamping the MTD APIs so they're not entirely synchronous, and are queue-based like the block device API. Then the console ->write() method could just queue the write, and we need some other logic to ensure that it really does _happen_ when the system crashes. A kind of 'purge queue of console writes' function in a panic handler... except that as you point out, sometimes we actually want that even when we haven't actually panicked.
On Mon, 12 Oct 2009 14:20:23 +0200 Ingo Molnar <mingo@elte.hu> wrote: > * David Woodhouse <dwmw2@infradead.org> wrote: > > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote: > > > Also, would it be possible to just simplify the thing and not do any > > > buffering at all? Extra buffering complexity in a console driver is only > > > asking for trouble. Or is flash storage write cycles optimization that > > > important in this case? > > > > That and the fact that on NAND flash you have to write full pages at a > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. > > So we really do want to buffer it where we can. > > > > We don't want to write a 2KiB page for every line of printk output. > > Then i think the buffering is at the wrong place: we should instead > buffer in the generic layer and pass it to lowlevel if we know that we > have gone past a 2K boundary. > > The size of the generic log buffer is always a power of two so detecting > 2K boundaries is very easy. On any emergency the generic console layer > will do faster flushes - this is nothing the console driver itself > should bother with. But this is only part of the mtdoops problem (the reason why we don't write all the time). The current code only stores messages printed during an oops, and this behavior will surely change if the console driver gets large buffers of output - or it would have to take in the output unbuffered anyway. My patch changes this behavior, and with that I don't think buffered output would be a problem - it would indeed make it more simple as you say - assuming there is something like ->kernel_bug() that would flush the last 4KiB or so of messages to mtdoops when there is an oops or panic. > And that would avoid the whole workqueue logic - which is fragile to be > done in a printk to begin with. I'm afraid I don't really see this issue. The workqueue is used to write the buffer to the mtd device if we are not in a panic or interrupt context - in which case we do it directly. So it's only used when an oops is ongoing. // Simon
> See my reply to David Woodhouse, i think we should add support for > buffering in kernel/printk.c and that would both fix your problems, > would simplify the driver (significantly!) and would expose the generic > buffering capability to other console drivers as well. Buffering printk in general is bad. Given a driver needs only to provide about 6 lines of code using a kfifo is it really that hard for the odd code that wants to buffer to do that ?
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote: > On Mon, 12 Oct 2009 14:20:23 +0200 > Ingo Molnar <mingo@elte.hu> wrote: > > > * David Woodhouse <dwmw2@infradead.org> wrote: > > > > > On Mon, 2009-10-12 at 14:09 +0200, Ingo Molnar wrote: > > > > Also, would it be possible to just simplify the thing and not do any > > > > buffering at all? Extra buffering complexity in a console driver is only > > > > asking for trouble. Or is flash storage write cycles optimization that > > > > important in this case? > > > > > > That and the fact that on NAND flash you have to write full pages at a > > > time -- that's 512 bytes, 2KiB or 4KiB depending on the type of chip. > > > So we really do want to buffer it where we can. > > > > > > We don't want to write a 2KiB page for every line of printk output. > > > > Then i think the buffering is at the wrong place: we should instead > > buffer in the generic layer and pass it to lowlevel if we know that we > > have gone past a 2K boundary. > > > > The size of the generic log buffer is always a power of two so > > detecting 2K boundaries is very easy. On any emergency the generic > > console layer will do faster flushes - this is nothing the console > > driver itself should bother with. > > But this is only part of the mtdoops problem (the reason why we don't > write all the time). The current code only stores messages printed > during an oops, and this behavior will surely change if the console > driver gets large buffers of output - or it would have to take in the > output unbuffered anyway. > > My patch changes this behavior, and with that I don't think buffered > output would be a problem - it would indeed make it more simple as you > say - assuming there is something like ->kernel_bug() that would flush > the last 4KiB or so of messages to mtdoops when there is an oops or > panic. > > > And that would avoid the whole workqueue logic - which is fragile to > > be done in a printk to begin with. > > I'm afraid I don't really see this issue. The workqueue is used to > write the buffer to the mtd device if we are not in a panic or > interrupt context - in which case we do it directly. > > So it's only used when an oops is ongoing. This fixation on 'panic' is so wrong! 90% of the bugs users care about dont involve any panic. And even if there is a panic down the line, most of the interesting messages are in the stream leading up to the panic - now tucked away in that async workqueue mechanism and not visible. There's two clean solutions i think: 1) add some new "ok, there's trouble!" callback to struct console and the console driver could via that mechanism send out the _last_ 2KB (or more) of kernel log messages. Basically we can go back in time by looking at the dmesg buffer. The low level console driver does not need to 'follow' the high level console state - it only wants to print in case of trouble anyway. 2) or add buffered (flash-friendly) writes for all printk output - panic and non-panic alike. This would be useful to debug suspend/resume bugs for example. This would also optimize the packets of netconsole output. (last i checked we sent a packet per line.) The workqueue looks wrong in both variants. If we are panic-ing (or hanging, or ...) then we are halting the machine - the workqueue has no chance to actually execute. Ingo
* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > > See my reply to David Woodhouse, i think we should add support for > > buffering in kernel/printk.c and that would both fix your problems, > > would simplify the driver (significantly!) and would expose the > > generic buffering capability to other console drivers as well. > > Buffering printk in general is bad. [...] The general (and default) case would be 0 buffering - i.e. finegrained per line calls to ->console_write(). This is the common case indeed, we want to get console output out as soon as possible and as finegrained as possible - as we dont know when a failure mode removes our ability to print anything else. My argument is that instead of a complicated dance of workqueue versus non-workqueue printk support in the MTD code in drivers/mtd/mtdoops.c, this should be done at the generic console level. It's not hard, nor complex if done at the right level, nor does it impact the regular zero-buffering codepath in a significant way - and the end result would be a significantly simpler (and, in turn, more robust) MTD printk driver. I care about this because i still havent given up hope that the company you are working for will finally give us some permanent storage in the CPU itself, so that we can have cross-reboot printk buffering ;-) If that storage is in the form of Flash, then buffering (to optimize write cycles) is probably a must. > [...] Given a driver needs only to provide about 6 lines of code using > a kfifo is it really that hard for the odd code that wants to buffer > to do that ? Avoiding a workqueue in printk is about the critical path of failure and about complexity in general. I dont see how kfifo helps here much - kfifo is really just a relatively simple dynamic memory buffer abstraction - while most of the complexity here is elsewhere. Could you explain what you meant with kfifo here please? Ingo
On Mon, 2009-10-12 at 15:25 +0200, Ingo Molnar wrote: > > I care about this because i still havent given up hope that the company > you are working for will finally give us some permanent storage in the > CPU itself, so that we can have cross-reboot printk buffering ;-) Why would it have to be in the CPU? If we had decent firmware, couldn't we just look at the old kernel's log_buf in RAM? On machines where coreboot is supported (which is mostly machines _other_ than those from my employer, unfortunately), you ought to be able to do this today.
OK, I don't think we understand each other. Sorry if I'm being slow here, please tell me if I'm misunderstanding something fundamental below. On Mon, 12 Oct 2009 15:15:29 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > I'm afraid I don't really see this issue. The workqueue is used to > > write the buffer to the mtd device if we are not in a panic or > > interrupt context - in which case we do it directly. > > > > So it's only used when an oops is ongoing. > > This fixation on 'panic' is so wrong! > > 90% of the bugs users care about dont involve any panic. And even if > there is a panic down the line, most of the interesting messages are in > the stream leading up to the panic - now tucked away in that async > workqueue mechanism and not visible. Well, this is what my patch [1] aims to fix. What it does is to put all messages in a circular buffer, and when an oops or panic occurs it writes them out. The current version only collects messages _during_ an oops. I'll rework it with using kfifo as per Alans suggestion though. Neither the current code nor the new patch has them stored in the work queue during a panic though. If this happens, they will call panic_write (if it's available) to write it out directly. > There's two clean solutions i think: > > 1) add some new "ok, there's trouble!" callback to struct console and > the console driver could via that mechanism send out the _last_ 2KB > (or more) of kernel log messages. Basically we can go back in time by > looking at the dmesg buffer. The low level console driver does not > need to 'follow' the high level console state - it only wants to > print in case of trouble anyway. > > 2) or add buffered (flash-friendly) writes for all printk output - panic > and non-panic alike. This would be useful to debug suspend/resume > bugs for example. This would also optimize the packets of netconsole > output. (last i checked we sent a packet per line.) Well, suspend/resume hangs is one of the cases which mtdoops won't catch. But at least on NAND flash, I'd be a bit weary about logging all printk output for fear of wearing out the flash. > The workqueue looks wrong in both variants. If we are panic-ing (or > hanging, or ...) then we are halting the machine - the workqueue has no > chance to actually execute. but then we are using mtd->panic_write to write it out directly, not via the work queue. // Simon [1] http://patchwork.ozlabs.org/patch/35750/
On Mon, 12 Oct 2009 13:37:58 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Artem Bityutskiy <dedekind1@gmail.com> wrote: > > > On Mon, 2009-10-12 at 13:15 +0200, ext Ingo Molnar wrote: > > > > We use the mtdoops module which stores oopses on the Flash > > > > partition, in order to make it possible to analyze them later. > > > > And mtdoops needs to know whether 'panic_on_oops' is on or off. > > > > Thus, we need this variable to be exported. > > > > > > hm, why does it need it? Could you send the patch please that > > > makes use of it (as an easy way to explain the need for the > > > export) - current upstream drivers/mtd/mtdoops.c doesnt need it > > > so it must be some pending change. > > > > Yes, current drivers/mtd/mtdoops.c does not need it, but we have > > this patch: > > > > http://lists.infradead.org/pipermail/linux-mtd/2009-October/027450.html > > > > which makes sure mtdoops writes the oops log immediately, because we > > have 'panic_on_oops' set so this is our last chance to save the > > oops. > > > > Here is the patch inlined as well: > > > > Author: Aaro Koskinen <aaro.koskinen@nokia.com> > > Date: Thu Oct 1 18:16:55 2009 +0300 > > > > mtd: mtdoops: do not schedule work if we are going to die > > > > If panic_on_oops is set, the log should be written right away > > because this is not going to be a second chance. > > > > Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> > > Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> > > > > diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c > > index 1060337..ac67833 100644 > > --- a/drivers/mtd/mtdoops.c > > +++ b/drivers/mtd/mtdoops.c > > @@ -335,7 +335,7 @@ static void mtdoops_console_sync(void) > > cxt->ready = 0; > > spin_unlock_irqrestore(&cxt->writecount_lock, flags); > > > > - if (mtd->panic_write && in_interrupt()) > > + if (mtd->panic_write && (in_interrupt() || panic_on_oops)) > > /* Interrupt context, we're going to panic so try > > and log */ mtdoops_write(cxt, 1); > > Hm, the code seems to be somewhat confused about this. It tries to > guess when it's panic-ing, right? in_interrupt() is the wrong test > for that. > > mtdoops_console_sync() gets called by regular printk() as well via: > > static void > mtdoops_console_write(struct console *co, const char *s, unsigned int > count) { > struct mtdoops_context *cxt = co->data; > struct mtd_info *mtd = cxt->mtd; > unsigned long flags; > > if (!oops_in_progress) { > mtdoops_console_sync(); > return; > > I think this is all layered incorrectly - because mtdoops (which is a > _VERY_ cool debugging facility btw - i wish generic x86 had something > like that) tries to be a 'driver' and tries to achieve these things > without modifying the core kernel. > > But it really should do that. We can certainly enhance the core > kernel logic to tell the console driver more clearly when printk > should go out immediately. > > Instead of using oops_in_progress it might be better to go into 'sync > immeditately' mode if the kernel gets tainted. actually there are very nice "begin of oops" and "end of oops" functions... they are the perfect places for this ;-)
* David Woodhouse <dwmw2@infradead.org> wrote: > On Mon, 2009-10-12 at 15:25 +0200, Ingo Molnar wrote: > > > > I care about this because i still havent given up hope that the > > company you are working for will finally give us some permanent > > storage in the CPU itself, so that we can have cross-reboot printk > > buffering ;-) > > Why would it have to be in the CPU? If we had decent firmware, > couldn't we just look at the old kernel's log_buf in RAM? Not if the failure is say a s2ram hang that requires a power cycle. Also there are certain classes of bugs that only occur on cold boot. Plus there's the "need to unplug the battery to revive the system" class of bugs (but they are rare). So i think the MTD / flash stuff is powerful. Ingo
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote: > OK, I don't think we understand each other. Sorry if I'm being slow > here, please tell me if I'm misunderstanding something fundamental > below. [ it could easily be me being confused - i dont know the mtdoops code that well - i just raised an eyebrow at the export request, which yelled 'layering violation' at me ;-) ] > On Mon, 12 Oct 2009 15:15:29 +0200 > Ingo Molnar <mingo@elte.hu> wrote: > > > > I'm afraid I don't really see this issue. The workqueue is used to > > > write the buffer to the mtd device if we are not in a panic or > > > interrupt context - in which case we do it directly. > > > > > > So it's only used when an oops is ongoing. > > > > This fixation on 'panic' is so wrong! > > > > 90% of the bugs users care about dont involve any panic. And even if > > there is a panic down the line, most of the interesting messages are in > > the stream leading up to the panic - now tucked away in that async > > workqueue mechanism and not visible. > > Well, this is what my patch [1] aims to fix. What it does is to put > all messages in a circular buffer, and when an oops or panic occurs it > writes them out. The current version only collects messages _during_ > an oops. I'll rework it with using kfifo as per Alans suggestion > though. > > Neither the current code nor the new patch has them stored in the work > queue during a panic though. If this happens, they will call > panic_write (if it's available) to write it out directly. > > > There's two clean solutions i think: > > > > 1) add some new "ok, there's trouble!" callback to struct console and > > the console driver could via that mechanism send out the _last_ 2KB > > (or more) of kernel log messages. Basically we can go back in time by > > looking at the dmesg buffer. The low level console driver does not > > need to 'follow' the high level console state - it only wants to > > print in case of trouble anyway. > > > > 2) or add buffered (flash-friendly) writes for all printk output - panic > > and non-panic alike. This would be useful to debug suspend/resume > > bugs for example. This would also optimize the packets of netconsole > > output. (last i checked we sent a packet per line.) > > Well, suspend/resume hangs is one of the cases which mtdoops won't > catch. [...] ( Sidenote: i see no reason why that wouldnt be possible if it's implemented properly. ) > [...] But at least on NAND flash, I'd be a bit weary about logging all > printk output for fear of wearing out the flash. Clearly should be optional - like the s2ram debug hack to RTC registers is optional on x86. > > The workqueue looks wrong in both variants. If we are panic-ing (or > > hanging, or ...) then we are halting the machine - the workqueue has > > no chance to actually execute. > > but then we are using mtd->panic_write to write it out directly, not > via the work queue. ... i might be confused, but in which case _is_ the workqueue used? It clearly shows up in the codepaths i've read, but maybe i've misinterpreted what it does. Ingo
On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote: > Not if the failure is say a s2ram hang that requires a power cycle. Also > there are certain classes of bugs that only occur on cold boot. Plus > there's the "need to unplug the battery to revive the system" class of > bugs (but they are rare). So you need to build in enough ECC to cope with the decay which happens when RAM isn't being refreshed for a few seconds... :) > So i think the MTD / flash stuff is powerful. Yeah, definitely. I was just pointing out that we can actually do a lot better on today's commodity hardware too.
On Mon, 12 Oct 2009 16:30:17 +0200 Ingo Molnar <mingo@elte.hu> wrote: > * Simon Kagstrom <simon.kagstrom@netinsight.net> wrote: > > > OK, I don't think we understand each other. Sorry if I'm being slow > > here, please tell me if I'm misunderstanding something fundamental > > below. > > [ it could easily be me being confused - i dont know the mtdoops code > that well - i just raised an eyebrow at the export request, which > yelled 'layering violation' at me ;-) ] I sent the same patch (export panic_on_oops) to LKML a week ago, but got no replies. I now know who to Cc next time ;-) > > > 2) or add buffered (flash-friendly) writes for all printk output - panic > > > and non-panic alike. This would be useful to debug suspend/resume > > > bugs for example. This would also optimize the packets of netconsole > > > output. (last i checked we sent a packet per line.) > > > > Well, suspend/resume hangs is one of the cases which mtdoops won't > > catch. [...] > > ( Sidenote: i see no reason why that wouldnt be possible if it's > implemented properly. ) Provided there is a callback for "really_dump_the_console" which gets called from interesting places (e.g., suspend/resume) it should be easy to do with the patched mtdoops without much changes, at least with my proposed circular buffer patch it will be trivial. > > [...] But at least on NAND flash, I'd be a bit weary about logging all > > printk output for fear of wearing out the flash. > > Clearly should be optional - like the s2ram debug hack to RTC registers > is optional on x86. That would be OK for me at least, and again should not be very difficult to implement even with todays code. > > > The workqueue looks wrong in both variants. If we are panic-ing (or > > > hanging, or ...) then we are halting the machine - the workqueue has > > > no chance to actually execute. > > > > but then we are using mtd->panic_write to write it out directly, not > > via the work queue. > > ... i might be confused, but in which case _is_ the workqueue used? > > It clearly shows up in the codepaths i've read, but maybe i've > misinterpreted what it does. With the code in mainline, it basically works like this. - When oops_in_progress is not set, mtdoops_console_write will not put anything in the buffer. It will call mtdoops_console_sync(), but since the buffer will be empty (cxt->writecount is 0), no writes will be done. - When oops_in_progress _is_ set, mtdoops_console_write will start putting things in the buffer. - mtdoops_console_sync is then called either when the buffer has been filled, or when ->unblank() is called, or when oops_in_progress is no longer true. This is the place when the workqueue _can_ be used (in the cases when this is not in interrupt context or panic_on_oops is unset). With my patch it instead works like this: - mtdoops_console_write continuously writes messages to the buffer, but never calls mtdoops_console_sync() itself. - mtdoops_console_sync (i.e., the ->unblank() callback) will schedule work if oops_in_progress is set. - if we have a panic, it will call mtdoops_write directly (if mtd->panic_write is set, otherwise we are out of luck). This is also the code path on oopses in interrupt context. So the workqueue only gets used on unblank() from oopses. I think the second implementation is simpler, but it also changes the behavior of mtdoops a bit to include messages before the oops/panic as well. // Simon
* David Woodhouse <dwmw2@infradead.org> wrote: > On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote: > > Not if the failure is say a s2ram hang that requires a power cycle. > > Also there are certain classes of bugs that only occur on cold boot. > > Plus there's the "need to unplug the battery to revive the system" > > class of bugs (but they are rare). > > So you need to build in enough ECC to cope with the decay which > happens when RAM isn't being refreshed for a few seconds... :) [ hey, i think you should line up with BIOS writers at that wall ;-) ] > > So i think the MTD / flash stuff is powerful. > > Yeah, definitely. I was just pointing out that we can actually do a > lot better on today's commodity hardware too. I wish it worked on any of the 10+ x86 systems i have. Is there anyone who'd be interested in exploring whether warm BIOS reboots work _anywhere_? A simple patch with a new (default-off) CONFIG_DEBUG_ feature that just puts a signature into a predictable spot in RAM, switches the reboot method over to warm reboot (reboot=w) and prints some friendly "yay, this BIOS rocks!" message if the signature is still there after a reboot and not zeroed out. If that works _anywhere_ we could complete it: we could cache the dmesg buffer address (__log_buf[]) across reboots (and maybe the printk tail offset (log_end)), and that would be an _excellent_ debuggability feature for a large class of otherwise undebuggable crashes ... We could use that to preserve a kernel function trace (or a branch execution hardware trace using BTS on Intel CPUs) across crashes, etc. etc. Ingo
* Simon Kagstrom <simon.kagstrom@netinsight.net> wrote: > With my patch it instead works like this: > > - mtdoops_console_write continuously writes messages to the buffer, but > never calls mtdoops_console_sync() itself. > > - mtdoops_console_sync (i.e., the ->unblank() callback) will schedule > work if oops_in_progress is set. > > - if we have a panic, it will call mtdoops_write directly (if > mtd->panic_write is set, otherwise we are out of luck). This is also > the code path on oopses in interrupt context. > > So the workqueue only gets used on unblank() from oopses. I think the > second implementation is simpler, but it also changes the behavior of > mtdoops a bit to include messages before the oops/panic as well. The main printk principle is simplicity: - The simpler the printk codepath, the higher the chances that we still are within the window of opportunity to get anything out to the user. - Furthermore, the simpler the printk codepath, the larger the window of opportunity is to begin with: we rely on less external state, so we have a smaller surface of interaction that might break printk. In that sense, i think your modified workqueue use is less wrong than what is in current mainline, but i'm afraid it's still wrong. Why use a workqueue on unblank()? Why use a workqueue _at all_? (If we piggyback to any kernel thread we could as well piggyback to syslog itself - and we all know how often syslog fails at capturing oopses.) The mtdoops console driver only seems to act if there's an emergency - and in emergencies we really _never_ want to do complex things like using a workqueue thread. ( Sidenote: i proffer that we dont want to use a workqueue in the 'regular' printk case either - but that seems to be irrelevant here as mtdoops does not seem to save / care about regular non-emergency printks. ) Ingo
On Mon, 12 Oct 2009, Simon Kagstrom wrote: > > Well, this is what my patch [1] aims to fix. What it does is to put all > messages in a circular buffer, and when an oops or panic occurs it > writes them out. Umm. That's wrong. We already _have_ the circular buffer. It's called 'log_buf'. I agree with the "save kernel buffer on panic" thing, but I disagree with making it anything new, and hooking into "printk()" or the console subsystem AT ALL. That's just bogus, stupid, and WRONG. What you can do is to just flush the 'log_buf' buffer (or as much of it as you want - the buffer may be a megabyte in size, and maybe you only want to flush the last 8kB or something like that) on oops. And _not_ mix this up with anything else. It's a really simple circular buffer, which just has - log_buf: buffer start - log_end: number of characters ever seen - log_buf_len: size of buffer (guaranteed to be a power-of-2) so it's literally as easy as looking at those three values (there's a few more that you _can_ look at, but they'd not likely be relevant for a "panic_on_oops" thing) > The current version only collects messages _during_ an > oops. I'll rework it with using kfifo as per Alans suggestion though. Don't. kfifo's aren't going to help. You're doing this at all the wrong levels ENTIRELY, and we already have the buffer you want to flush. Linus
On Mon, 12 Oct 2009, Linus Torvalds wrote: > > Don't. kfifo's aren't going to help. You're doing this at all the wrong > levels ENTIRELY, and we already have the buffer you want to flush. Btw, a few simple rules: - if you need to make your device look like a "console device" for dumping at oops time, you're doing things wrong. You don't want line buffered output to begin with, and you don't want to see each line, you only want this at exceptional points. - if you need to look at "in_interrupt()" or "panic_on_oops", you're doing things wrong. - if you add your own buffers, you're doing things wrong. I have CONFIG_LOG_BUF_SHIFT=18 in my kernel, so I've already got 256kB worth of memory allocated for kernel messages. Sometimes I increase that further, just because I do some silly printk debugging. IOW, just add a very simple "flush the dmesg buffer on oops" callback to the end of the oops printout code (just a single call after the oops thing is now known to be in the dmesg buffers!) It's not just oopses, btw. Maybe people would like to do this as the last stage of a reboot/shutdown too. Because some of the final printouts from the shutdown will never make it to disk, because 'ksyslogd' has been killed, and the root filesystem has been turned read-only. Linus
On Mon, 2009-10-12 at 08:44 -0700, Linus Torvalds wrote: > > On Mon, 12 Oct 2009, Linus Torvalds wrote: > > > > Don't. kfifo's aren't going to help. You're doing this at all the wrong > > levels ENTIRELY, and we already have the buffer you want to flush. > > Btw, a few simple rules: > > - if you need to make your device look like a "console device" for > dumping at oops time, you're doing things wrong. You don't want line > buffered output to begin with, and you don't want to see each line, you > only want this at exceptional points. > > - if you need to look at "in_interrupt()" or "panic_on_oops", you're > doing things wrong. > > - if you add your own buffers, you're doing things wrong. I have > CONFIG_LOG_BUF_SHIFT=18 in my kernel, so I've already got 256kB worth > of memory allocated for kernel messages. Sometimes I increase that > further, just because I do some silly printk debugging. > > IOW, just add a very simple "flush the dmesg buffer on oops" callback to > the end of the oops printout code (just a single call after the oops thing > is now known to be in the dmesg buffers!) > > It's not just oopses, btw. Maybe people would like to do this as the last > stage of a reboot/shutdown too. Because some of the final printouts from > the shutdown will never make it to disk, because 'ksyslogd' has been > killed, and the root filesystem has been turned read-only. Yes, it is difficult to disagree with this. As Ingo also pointed, pretending to be a console driver is ugly, although I think it was a clever way to avoid touching generic code. But mtdoops tries to solves the following problem. What if we are oopsing in an interrupt, which interrupted the mtd driver, so we have all the locks held, and the mtd driver is in a unexpected stage ATM? Or what if we are oopsing in the mtd driver, or in something which was called by the MTD driver.? This is what all that "workqueue vs. mtd->panic_write()" logic is about, which Ingo dislikes. The logic of mtdoops (my interpretation): 1. OK, there was an oops, we have it in a buffer and we need to write it to flash now. 2. If we are in interrupt context, we have to write right now, because the system is about to die very soon. And must use special 'mtd->panic_write()' call. 3. If we have 'panic_on_oops' set, the system will be stopped soon. So we also have to write now, because this is our last chance. And we also have to use 'mtd->panic_write()'. 4. Otherwise, it is probable that system will be somewhat alive, and we may schedule the write instead of using 'mtd->panic_write()', because 'mtd->panic_write()' abuses locking and chip state, and will screw up the mtd flash driver. So we prefer to schedule the write instead. I guess similar could be done in 'panic()' instead. Or you have a more elegant solution?
On Mon, 12 Oct 2009, Artem Bityutskiy wrote: > > But mtdoops tries to solves the following problem. What if we are > oopsing in an interrupt, which interrupted the mtd driver, so we have > all the locks held, and the mtd driver is in a unexpected stage ATM? Or > what if we are oopsing in the mtd driver, or in something which was > called by the MTD driver.? Well, quite frankly, if you have an oops while holding a spinlock, then the machine is dead _anyway_. So what I would suggest is to just ignore the above problem. No amount of workqueue logic will help it - if the oops happened while an interrupt held a critical mtd lock, that lock will _never_ be released, so exactly what would be helped? Now, I realize that _if_ you treat mtdoops as a 'console' layer, then you need to do that crazy thing, because you still want the oops to print out to the other consoles, and you're only getting data one line at a time. But since that was the wrong thing to do for a lot of other reasons anyway, that's not a very good argument. Once you do the final flush in a controlled place _after_ you've printed out all the oops information, you simply don't care about locks any more. Because if you were holding critical locks, you're done anyway. Sure, maybe you want to do a "trylock()" and skip the oops flush entirely in the mtd layer if you can't do it, but it's the "let's use a workqueue" or something that doesn't seem to make a lot of sense to me. Linus
On Mon, 12 Oct 2009, Linus Torvalds wrote: > > Once you do the final flush in a controlled place _after_ you've printed > out all the oops information, you simply don't care about locks any more. > Because if you were holding critical locks, you're done anyway. > > Sure, maybe you want to do a "trylock()" and skip the oops flush entirely > in the mtd layer if you can't do it, but it's the "let's use a workqueue" > or something that doesn't seem to make a lot of sense to me. Side note: I don't actually care deeply. Once it's all inside some driver, and once it's not messing up the console layer, I don't think the small details matter all that much. I just think it's likely to be a sign of something wrong if you need to use workqueues to flush - it probably means that the most interesting oopses will never make it to the mtd device in the first place. Linus
On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > I agree with the "save kernel buffer on panic" thing, but I disagree with > making it anything new, and hooking into "printk()" or the console > subsystem AT ALL. That's just bogus, stupid, and WRONG. > > What you can do is to just flush the 'log_buf' buffer (or as much of it as > you want - the buffer may be a megabyte in size, and maybe you only want > to flush the last 8kB or something like that) on oops. And _not_ mix this > up with anything else. What he said. I did it that way in the Digeo kernel back in 2002. Worked good. Doing it via a console is rather weird. It will need core kernel changes to do it properly. Perhaps oops_enter() is a good place to mark the start of the log, and flush it within oops_exit().
* Andrew Morton <akpm@linux-foundation.org> wrote: > On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > I agree with the "save kernel buffer on panic" thing, but I disagree with > > making it anything new, and hooking into "printk()" or the console > > subsystem AT ALL. That's just bogus, stupid, and WRONG. > > > > What you can do is to just flush the 'log_buf' buffer (or as much of it as > > you want - the buffer may be a megabyte in size, and maybe you only want > > to flush the last 8kB or something like that) on oops. And _not_ mix this > > up with anything else. > > What he said. I did it that way in the Digeo kernel back in 2002. > Worked good. > > Doing it via a console is rather weird. It will need core kernel > changes to do it properly. > > Perhaps oops_enter() is a good place to mark the start of the log, and > flush it within oops_exit(). Simplest would be to do the last 2K in oops_exit()? That gives the oops, and the history leading up to it. Since the blocking is 2K, the extra log output is for free. (unless the oops is larger than 2K - but that is rather rare.) Ingo
On 12.10.2009 17:14, Ingo Molnar wrote: > * David Woodhouse <dwmw2@infradead.org> wrote: > >> On Mon, 2009-10-12 at 16:26 +0200, Ingo Molnar wrote: >> >>> Not if the failure is say a s2ram hang that requires a power cycle. >>> Also there are certain classes of bugs that only occur on cold boot. >>> Plus there's the "need to unplug the battery to revive the system" >>> class of bugs (but they are rare). >>> >> So you need to build in enough ECC to cope with the decay which >> happens when RAM isn't being refreshed for a few seconds... :) >> > > [ hey, i think you should line up with BIOS writers at that wall ;-) ] > Not all of us x86 firmware writers are evil. >>> So i think the MTD / flash stuff is powerful. >>> >> Yeah, definitely. I was just pointing out that we can actually do a >> lot better on today's commodity hardware too. >> > > I wish it worked on any of the 10+ x86 systems i have. Is there anyone > who'd be interested in exploring whether warm BIOS reboots work > _anywhere_? > AFAIK memory clearing is default off in coreboot for non-ECC RAM and default on for ECC RAM (to avoid parity errors on read, but that can probably be worked around). Unless I'm mistaken, the SeaBIOS BIOS compatibility layer on top of coreboot doesn't erase RAM at all, so contents can survive. No idea about classic AMI/Award/Phoenix/Insyde/whatever BIOS, though. > A simple patch with a new (default-off) CONFIG_DEBUG_ feature that just > puts a signature into a predictable spot in RAM, switches the reboot > method over to warm reboot (reboot=w) and prints some friendly "yay, > this BIOS rocks!" message if the signature is still there after a reboot > and not zeroed out. > > If that works _anywhere_ we could complete it: we could cache the dmesg > buffer address (__log_buf[]) across reboots (and maybe the printk tail > offset (log_end)), and that would be an _excellent_ debuggability > feature for a large class of otherwise undebuggable crashes ... > > We could use that to preserve a kernel function trace (or a branch > execution hardware trace using BTS on Intel CPUs) across crashes, etc. > etc. > Since we're discussing log buffers anyway, does it make sense to have this feature interact with the coreboot log buffer which is passed on to the OS (no official patches for that one, yet)? Basically, coreboot has its own log buffer where it stores the hardware init diagnostic messages (very similar to the Linux kernel message buffer) and it could make sense to use the same memory area for both purposes (if you can deal with coreboot messages being absent after a kernel Oops). Thoughts? Regards, Carl-Daniel
On Mon, 12 Oct 2009 20:23:46 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > On Mon, 12 Oct 2009 08:36:38 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > I agree with the "save kernel buffer on panic" thing, but I disagree with > > > making it anything new, and hooking into "printk()" or the console > > > subsystem AT ALL. That's just bogus, stupid, and WRONG. > > > > > > What you can do is to just flush the 'log_buf' buffer (or as much of it as > > > you want - the buffer may be a megabyte in size, and maybe you only want > > > to flush the last 8kB or something like that) on oops. And _not_ mix this > > > up with anything else. > > > > What he said. I did it that way in the Digeo kernel back in 2002. > > Worked good. > > > > Doing it via a console is rather weird. It will need core kernel > > changes to do it properly. > > > > Perhaps oops_enter() is a good place to mark the start of the log, and > > flush it within oops_exit(). > > Simplest would be to do the last 2K in oops_exit()? That gives the oops, > and the history leading up to it. Since the blocking is 2K, the extra > log output is for free. > > (unless the oops is larger than 2K - but that is rather rare.) > Some oops traces can be pretty large. Perhaps "oops_enter minus 1k up to oops_exit". The digeo kernel later got changed to package the oops into a binary record format for logging to nvram. iirc that was for space reasons, and for ease of downstream analysis. That kernel used to be downloadable but I can't immediately find it. Probably not very interesting anyway.
On Mon, 12 Oct 2009, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Perhaps oops_enter() is a good place to mark the start of the log, and > > flush it within oops_exit(). > > Simplest would be to do the last 2K in oops_exit()? That gives the oops, > and the history leading up to it. Since the blocking is 2K, the extra > log output is for free. I agree, except I don't think it should be fixed to 2k. We should just dump as much as is "appropriate" for the dump device. It might be the last 2kB, it might be 8kB, it might be 64kB. We don't know, we don't care. The device may have its own per-device limits. Any extra data we get from before the oops is just gravy (often there might be interestign warning messages leadign up to the dump), and if the oops is too big for the dump device, it's not something we can do anything about anyway. So the logic should literally be something like this: - kernel/printk.c: void dump_kmsg(void) { unsigned long len = ACCESS_ONCE(log_end); struct dump_device *dump; const char *s1, *s2; unsigned long l1, l2; s1 = ""; l1 = 0; s2 = log_buf; l2 = len; /* Have we rotated around the circular buffer? */ if (len > log_buf_len) { unsigned long pos = (len & LOG_BUF_MASK); s1 = log_buf + pos; l1 = log_buf_len - pos; s2 = log_buf; l2 = pos; } list_for_each_entry (dump, dump_list, list) { dump->fn(s1, l1, s2, l2); } } ie we just always give the whole buffer (as two "sections", since it's a circular buffer) to the dumper, and then the dumper can decide how much of those buffers it is able to dump sanely. If the dumper has some hardware limitation where it can only dump a single aligned block, it can decide to just look at <s2,l2> which is the "latter" part of the dump. It's usually going to be fine, and it should be page-aligned (and you can even round up the size to a page, since the worst that can happen is that you'll see some old printk output at the end) Look ma, no locking, no buffer allocations, no nothing. Linus
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Mon, 12 Oct 2009, Ingo Molnar wrote: > > > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > Perhaps oops_enter() is a good place to mark the start of the log, and > > > flush it within oops_exit(). > > > > Simplest would be to do the last 2K in oops_exit()? That gives the > > oops, and the history leading up to it. Since the blocking is 2K, > > the extra log output is for free. > > I agree, except I don't think it should be fixed to 2k. Yeah - i cited 2K only because that is what mtdoops uses. > void dump_kmsg(void) > [...] > > Look ma, no locking, no buffer allocations, no nothing. Neat ... This could also be used for a warm-reboot preserve-memory thing as well. A well-known 4K (or so) area to preserve and print out during the next bootup after a crash. dump_kmsg() could copy the kernel's last will out to that area, or so. That would be cross-kernel compatible and the newly booted kernel image wouldnt overwrite this area. (which it does currently via its __log_buf[]) Ingo
* Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net> wrote: > > I wish it worked on any of the 10+ x86 systems i have. Is there > > anyone who'd be interested in exploring whether warm BIOS reboots > > work _anywhere_? > > AFAIK memory clearing is default off in coreboot for non-ECC RAM and > default on for ECC RAM (to avoid parity errors on read, but that can > probably be worked around). Unless I'm mistaken, the SeaBIOS BIOS > compatibility layer on top of coreboot doesn't erase RAM at all, so > contents can survive. > > No idea about classic AMI/Award/Phoenix/Insyde/whatever BIOS, though. I wouldnt mind to support this for coreboot too, but it will only be practical if there's at least one standard BIOS out of the many i run that supports warm reboot in practice ... Can try a test-patch on all of those systems btw. - we do support warm-reboot on x86 in arch/x86/kernel/reboot.c: /* Write 0x1234 to absolute memory location 0x472. The BIOS reads this on booting to tell it to "Bypass memory test (also warm boot)". This seems like a fairly standard thing that gets set by REBOOT.COM programs, and the previous reset routine did this too. */ *((unsigned short *)0x472) = reboot_mode; But someone would have to come up with a test-patch to prove whether it works really works. Ingo
Here's what I get for traveling and not reading mail for two days... On Mon, 2009-10-12 at 11:45 -0700, Linus Torvalds wrote: > > On Mon, 12 Oct 2009, Ingo Molnar wrote: > > > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > Perhaps oops_enter() is a good place to mark the start of the log, and > > > flush it within oops_exit(). > > > > Simplest would be to do the last 2K in oops_exit()? That gives the oops, > > and the history leading up to it. Since the blocking is 2K, the extra > > log output is for free. > > I agree, except I don't think it should be fixed to 2k. > > We should just dump as much as is "appropriate" for the dump device. It > might be the last 2kB, it might be 8kB, it might be 64kB. We don't know, > we don't care. The device may have its own per-device limits. Any extra > data we get from before the oops is just gravy (often there might be > interestign warning messages leadign up to the dump), and if the oops is > too big for the dump device, it's not something we can do anything about > anyway. I'm working on something different but related - also using the ring buffer and just getting as much from its tail as I am able to conserve. The approach in my case is to write a 2D bar code to the screen and have the user take a picture / submit the picture to kerneloops.org where it then gets decoded back into the oops message. This is intended for situations where you don't have access to other storage / network - or where a picture of the screen is actually the easiest way to get to the information. Right now the project is slightly stalled as I am running into an unexpected project on the decode side, but I'd love to make sure that the core changes I'm doing integrate cleanly with this project... > > So the logic should literally be something like this: > > - kernel/printk.c: > > void dump_kmsg(void) > { > unsigned long len = ACCESS_ONCE(log_end); > struct dump_device *dump; > const char *s1, *s2; > unsigned long l1, l2; > > s1 = ""; > l1 = 0; > s2 = log_buf; > l2 = len; > > /* Have we rotated around the circular buffer? */ > if (len > log_buf_len) { > unsigned long pos = (len & LOG_BUF_MASK); > > s1 = log_buf + pos; > l1 = log_buf_len - pos; > > s2 = log_buf; > l2 = pos; > } > > list_for_each_entry (dump, dump_list, list) { > dump->fn(s1, l1, s2, l2); > } > } > > ie we just always give the whole buffer (as two "sections", since it's a > circular buffer) to the dumper, and then the dumper can decide how much of > those buffers it is able to dump sanely. That's pretty close to what I do - only in my case the information then doesn't get written to a device but instead gets compressed, encoded and displayed on the framebuffer... /D
On Mon, 12 Oct 2009 11:45:15 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > So the logic should literally be something like this: > > - kernel/printk.c: > > void dump_kmsg(void) > { > unsigned long len = ACCESS_ONCE(log_end); OK, I'll get back with a few patches that does this later today. // Simon
On Tue, 2009-10-13 at 09:58 +0200, Simon Kagstrom wrote: > On Mon, 12 Oct 2009 11:45:15 -0700 (PDT) > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > So the logic should literally be something like this: > > > > - kernel/printk.c: > > > > void dump_kmsg(void) > > { > > unsigned long len = ACCESS_ONCE(log_end); > > OK, I'll get back with a few patches that does this later today. Yeah, I think it is good idea to work on a generic support, and simplify mtdoops.
Hi! Here are a couple of patches to mtdoops which I've been working on for a while and which were recently discussed on the MTD list. They are applied on top of Artems tree, http://git.infradead.org/users/dedekind/l2-mtd-2.6.git which apart from the mainline tree contains a panic fix from Aaro and Artems mtdoops style cleanup patch. Now, I'd like to warn sensitive readers that patch 5 *does* contain work queue elements. However, I believe there is a good reason for this: The mtd->write call is not good to call while oopsing, and mtd->panic_write is sort of a last resort. Now, if we panic anyhow, mtd->panic_write will be called, so the oops will still be written and the scheduled work won't run anyway. The patches are: 1. Avoid erasing already empty areas (the area would be erased at each module load otherwise) 2. Keep track of mtdoops page cleanliness in an array. This allows mtdoops_inc_counter to be called during panic (which fails in my case with the current code in mtd->read, although I believe this is MTD-driver dependent). 3. Make page size configurable to support longer messages. Mainly needed for patch 3, but also allows longer messages to be stored during panics (when the next oops area cannot be erased). 4. (kernel/printk.c) Add a dump_device as per Linus suggestion which includes a dump_kmsg function which dumps the kernel log buffer to registered devices. 5. Refactor mtdoops as a dump_device device instead of a console device. ChangeLog: v2: * Patches have been reordered to keep the least controversial (?) first. * Implement Artems suggestion to keep cleanliness in an array * Use Aaros patch to fix the panic_on_oops problem v3: * Correct checkpatch issues v4: Review corrections * Rebase on top of "[PATCH] mtd: mtdoops: several minor cleanups" * Patch 1 has been added * Use 1 bit per oops page, and rename clean/dirty unused/used * Don't initialize oops_page_dirty (it's a global structure anyway so it will be cleared together with the rest of BSS) * Rename mtdoops_page_size record_size (although only for the page size setting) v5: Corrections after panic_on_oops discussion * Add dump_device * Refactor mtdoops as a dump device.
Hi! Here are a couple of patches to mtdoops which I've been working on for a while and which were recently discussed on the MTD list. They are applied on top of Artems tree, http://git.infradead.org/users/dedekind/l2-mtd-2.6.git which apart from the mainline tree contains a panic fix from Aaro and Artems mtdoops style cleanup patch. Now, I'd like to warn sensitive readers that patch 5 *does* contain work queue elements. However, I believe there is a good reason for this: The mtd->write call is not good to call while oopsing, and mtd->panic_write is sort of a last resort. Now, if we panic anyhow, mtd->panic_write will be called, so the oops will still be written and the scheduled work won't run anyway. The patches are: 1. Avoid erasing already empty areas (the area would be erased at each module load otherwise) 2. Keep track of mtdoops page cleanliness in an array. This allows mtdoops_inc_counter to be called during panic (which fails in my case with the current code in mtd->read, although I believe this is MTD-driver dependent). 3. Make page size configurable to support longer messages. Mainly needed for patch 4, but also allows longer messages to be stored during panics (when the next oops area cannot be erased). 4. (kernel/printk.c) Add a dump_device as per Linus suggestion which includes a dump_kmsg function which dumps the kernel log buffer to registered devices. 5. Refactor mtdoops as a dump_device device instead of a console device. ChangeLog: v2: * Patches have been reordered to keep the least controversial (?) first. * Implement Artems suggestion to keep cleanliness in an array * Use Aaros patch to fix the panic_on_oops problem v3: * Correct checkpatch issues v4: Review corrections * Rebase on top of "[PATCH] mtd: mtdoops: several minor cleanups" * Patch 1 has been added * Use 1 bit per oops page, and rename clean/dirty unused/used * Don't initialize oops_page_dirty (it's a global structure anyway so it will be cleared together with the rest of BSS) * Rename mtdoops_page_size record_size (although only for the page size setting) v5: Corrections after panic_on_oops discussion * Add dump_device * Refactor mtdoops as a dump device. v6: More corrections (panic_on_oops and Anders Grafström, my room-mate at work) * Pathces 2-5 modified * Use set_bit/clear_bit primitives in patch 2 * Set permissions for mtdoops arguments correct in patch 3 and 5 * Rename files and structures after Linus suggestions in patch 4 * Correct output under some conditions in patch 5 // Simon
diff --git a/drivers/mtd/mtdoops.c b/drivers/mtd/mtdoops.c index 1060337..ac67833 100644 --- a/drivers/mtd/mtdoops.c +++ b/drivers/mtd/mtdoops.c @@ -335,7 +335,7 @@ static void mtdoops_console_sync(void) cxt->ready = 0; spin_unlock_irqrestore(&cxt->writecount_lock, flags); - if (mtd->panic_write && in_interrupt()) + if (mtd->panic_write && (in_interrupt() || panic_on_oops)) /* Interrupt context, we're going to panic so try and log */ mtdoops_write(cxt, 1); else