Message ID | 20110112172646.GB13496@thunk.org |
---|---|
State | Not Applicable, archived |
Headers | show |
On Wed, 12 Jan 2011, Ted Ts'o wrote: > Since I don't have a machine set up to test hibernation easily at > hand, I'd really appreciate it if you could try this patch to > determine which inode had the NULL jinode --- and then once you get > the device and inode number, to use debugfs's "ncheck" command to map > the inode number to a pathname. > > If you could do that, it would be a huge help. ok, will do > > Thanks, regards, > > - Ted > > P.S. Also, if you could try suspending once or twice, with different > programs running, to see if the inode number and pathname are constant > or vary, that would also be helpful. > > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h > index d8b992e..7d6d7d7 100644 > --- a/fs/ext4/ext4_jbd2.h > +++ b/fs/ext4/ext4_jbd2.h > @@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal) > > static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode) > { > - if (ext4_handle_valid(handle)) > + if (ext4_handle_valid(handle)) { > + if (unlikely(EXT4_I(inode)->jinode == NULL)) { > + /* Should never happen */ > + ext4_msg(inode->i_sb, KERN_CRIT, > + "inode #%lu has NULL jinode", inode->i_ino); > + BUG(); > + } > return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode); > + } > return 0; > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Wed, 12 Jan 2011, Ted Ts'o wrote: > Since I don't have a machine set up to test hibernation easily at > hand, I'd really appreciate it if you could try this patch to > determine which inode had the NULL jinode --- and then once you get > the device and inode number, to use debugfs's "ncheck" command to map > the inode number to a pathname. > > If you could do that, it would be a huge help. I did about 5 re-tests with different loads, the result was always: [ 249.238697] EXT4-fs (dasda1): inode #1106 has NULL jinode [ 249.238732] ------------[ cut here ]------------ [ 249.238738] kernel BUG at fs/ext4/ext4_jbd2.h:260! [ 249.238747] illegal operation: 0001 [#1] PREEMPT SMP [ 249.238760] Modules linked in: binfmt_misc dm_multipath scsi_dh vmur [last unloaded: scsi_wait_scan] [ 249.238786] CPU: 2 Not tainted 2.6.37-05668-g4162cf6-dirty #20 [ 249.238794] Process flush-94:0 (pid: 1244, task: 000000003f110040, ksp: 000000003dca3c70) [ 249.238803] Krnl PSW : 0704000180000000 00000000001d1130 (mpage_da_map_and_submit+0x464/0x468) [ 249.238824] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3 [ 249.238835] Krnl GPRS: 000000000000c65f 0000000000000000 0000000000000043 0000000000000000 [ 249.238848] 00000000004735c0 0000000000000000 000000003ee45478 0000000000000001 [ 249.238863] 0000000035544aa8 000000003dca3a88 0000000000000004 000000003d29d680 [ 249.238868] 000000003dca3b80 000000000049ef98 00000000001d112c 000000003dca39d8 [ 249.238886] Krnl Code: 00000000001d1120: c040001e01c2 larl %r4,5914a4 [ 249.238893] 00000000001d1126: c0e5000092ed brasl %r14,1e3700 [ 249.238899] 00000000001d112c: a7f40001 brc 15,1d112e [ 249.238906] >00000000001d1130: a7f40000 brc 15,1d1130 [ 249.238913] 00000000001d1134: e3e0f0080024 stg %r14,8(%r15) [ 249.238920] 00000000001d113a: c0f400000006 brcl 15,1d1146 [ 249.238942] 00000000001d1140: b9040000 lgr %r0,%r0 [ 249.238963] 00000000001d1144: 0de1 basr %r14,%r1 [ 249.238989] Call Trace: [ 249.238998] ([<00000000001d112c>] mpage_da_map_and_submit+0x460/0x468) [ 249.239012] [<00000000001d19f2>] ext4_da_writepages+0x32e/0x744 [ 249.239028] [<000000000016fe64>] writeback_single_inode+0xd4/0x25c [ 249.239044] [<0000000000170514>] writeback_sb_inodes+0xd0/0x160 [ 249.239050] [<0000000000171278>] writeback_inodes_wb+0x90/0x16c [ 249.239055] [<00000000001715fa>] wb_writeback+0x2a6/0x480 [ 249.239060] [<00000000001718b6>] wb_do_writeback+0xe2/0x294 [ 249.239066] [<0000000000171b14>] bdi_writeback_thread+0xac/0x2fc [ 249.239072] [<000000000007d192>] kthread+0xbe/0xc8 [ 249.239087] [<0000000000479182>] kernel_thread_starter+0x6/0xc [ 249.239101] [<000000000047917c>] kernel_thread_starter+0x0/0xc [ 249.239112] INFO: lockdep is turned off. [ 249.239119] Last Breaking-Event-Address: [ 249.239126] [<00000000001d112c>] mpage_da_map_and_submit+0x460/0x468 # debugfs /dev/dasda1 debugfs 1.41.10 (10-Feb-2009) debugfs: ncheck 1106 Inode Pathname 1106 /usr/bin/killall debugfs: q there are also a few messages like: EXT4-fs error (device dasda1): ext4_lookup:1043: inode #130849: comm find: deleted inode referenced: 129148 I guess they appeared after one of the crashes during bisecting (can't run fsck right now, since its my /) Regards, Sebastian > > Thanks, regards, > > - Ted > > P.S. Also, if you could try suspending once or twice, with different > programs running, to see if the inode number and pathname are constant > or vary, that would also be helpful. > > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h > index d8b992e..7d6d7d7 100644 > --- a/fs/ext4/ext4_jbd2.h > +++ b/fs/ext4/ext4_jbd2.h > @@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal) > > static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode) > { > - if (ext4_handle_valid(handle)) > + if (ext4_handle_valid(handle)) { > + if (unlikely(EXT4_I(inode)->jinode == NULL)) { > + /* Should never happen */ > + ext4_msg(inode->i_sb, KERN_CRIT, > + "inode #%lu has NULL jinode", inode->i_ino); > + BUG(); > + } > return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode); > + } > return 0; > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jan 12, 2011, at 1:49 PM, Sebastian Ott wrote: > > I did about 5 re-tests with different loads, the result was always: > > [ 249.238697] EXT4-fs (dasda1): inode #1106 has NULL jinode > > # debugfs /dev/dasda1 > debugfs 1.41.10 (10-Feb-2009) > debugfs: ncheck 1106 > Inode Pathname > 1106 /usr/bin/killall > debugfs: q > That looks really bogus. /usr/bin/killall is a system binary, and there's no good reason that hibernation should be trying to write pages to that binary. You said originally that the oops was happening "while going into hibernation right after resuming with...". So that means you did a successful suspend/resume, and then the second suspend caused the oops? It looks like somehow the pages were left marked as dirty, so the writeback daemons attempted writing back a page to an inode which was never opened read/write (and in fact as a text page for /usr/bin/killall, was mapped read/only). Given that ext4 initializes jinode only when the file is opened read/write, the fact that it is null, and the fact that it makes no sense that a program would be modifying /usr/bin/killall as part of a suspend/resume, it looks very much like we just unmasked a software suspend bug.... -- Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 12, 2011 at 07:44:17PM -0500, Theodore Tso wrote: > > You said originally that the oops was happening "while going into > hibernation right after resuming with...". So that means you did a > successful suspend/resume, and then the second suspend caused the > oops? It looks like somehow the pages were left marked as dirty, so > the writeback daemons attempted writing back a page to an inode > which was never opened read/write (and in fact as a text page for > /usr/bin/killall, was mapped read/only). Given that ext4 > initializes jinode only when the file is opened read/write, the fact > that it is null, and the fact that it makes no sense that a program > would be modifying /usr/bin/killall as part of a suspend/resume, it > looks very much like we just unmasked a software suspend bug.... ... and I think I've found the problem. In kernel/power/block_io.c, in the function submit(), we see this: if (bio_chain == NULL) { submit_bio(bio_rw, bio); wait_on_page_locked(page); if (rw == READ) bio_set_pages_dirty(bio); <==== bio_put(bio); So when we read in pages from the software suspend device, we end up marking the pages as dirty(!). I'm guessing this was caused by a copy and paste from the only other caller of bio_set_pages_dirty(), which is the direct I/O code, which needs this when we are writing from a file into a user-provided buffer. But for restoring from a software suspend case, this is as far as I can tell wholely inappropriate. This causes needless writes, which is bad even before ext4 unmasked the problem. I will send a patch under separate cover; could you give it a try and see if it fixes your crash? I will look into bulletproofing ext4 by adding checks for this case and printing warning messages, but neverthe less, I think the root cause is actually in the hibernation's bio code. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-01-12 at 19:44 -0500, Theodore Tso wrote: > It looks like somehow the pages were left marked as dirty, so the > writeback daemons attempted writing back a page to an inode which was > never opened read/write (and in fact as a text page > for /usr/bin/killall, was mapped read/only). Just out of curiosity, when this happens, is LZO compression being used with hibernation (default in 2.6.37) or has it been disabled?
On Wed, 12 Jan 2011, Theodore Tso wrote: > That looks really bogus. /usr/bin/killall is a system binary, and there's > no good reason that hibernation should be trying to write pages to that > binary. > > You said originally that the oops was happening "while going into > hibernation right after resuming with...". So that means you did a > successful suspend/resume, and then the second suspend caused the oops? Yes. I basically did a echo reboot > /sys/power/disk for i in {1..5} ;do echo disk > /sys/power/state ;done and it crashed very early in the second suspend. > It looks like somehow the pages were left marked as dirty, so the > writeback daemons attempted writing back a page to an inode which was > never opened read/write (and in fact as a text page for > /usr/bin/killall, was mapped read/only). > Given that ext4 initializes jinode only when the file is opened > read/write, the fact that it is null, and the fact that it makes no > sense that a program would be modifying /usr/bin/killall as part of a > suspend/resume, it looks very much like we just unmasked a software > suspend bug.... Ah, ok. Thanks for the explanation! Regards, Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Thu, 13 Jan 2011, Bojan Smojver wrote: > On Wed, 2011-01-12 at 19:44 -0500, Theodore Tso wrote: > > It looks like somehow the pages were left marked as dirty, so the > > writeback daemons attempted writing back a page to an inode which was > > never opened read/write (and in fact as a text page > > for /usr/bin/killall, was mapped read/only). > > Just out of curiosity, when this happens, is LZO compression being used > with hibernation (default in 2.6.37) or has it been disabled? LZO is being used - should I try with hibernate=nocompress? Regards, Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Yes. I basically did a > echo reboot > /sys/power/disk > for i in {1..5} ;do echo disk > /sys/power/state ;done > and it crashed very early in the second suspend. Was compression on or off? -- Bojan -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 13 Jan 2011, Bojan Smojver wrote: > > Yes. I basically did a > > echo reboot > /sys/power/disk > > for i in {1..5} ;do echo disk > /sys/power/state ;done > > and it crashed very early in the second suspend. > > Was compression on or off? compression was on. I did a re-test with hibernate=nocompress ..but with the same result Regards, Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 13, 2011 at 12:48:40PM +0100, Sebastian Ott wrote: > On Wed, 12 Jan 2011, Theodore Tso wrote: > > It looks like somehow the pages were left marked as dirty, so the > > writeback daemons attempted writing back a page to an inode which was > > never opened read/write (and in fact as a text page for > > /usr/bin/killall, was mapped read/only). > > Given that ext4 initializes jinode only when the file is opened > > read/write, the fact that it is null, and the fact that it makes no > > sense that a program would be modifying /usr/bin/killall as part of a > > suspend/resume, it looks very much like we just unmasked a software > > suspend bug.... > Ah, ok. Thanks for the explanation! Eeeek... this seems to be an architecture specific bug that is only present on s390. The dirty bit for user space pages on all architectures but s390 are stored into the PTE's. On s390 however they are stored into the storage key that exists per _physical_ page. So, what we should have done, when implementing suspend/resume on s390, is to save the storage key for each page and write that to the suspend device and upon resume restore the storage key contents for each physical page. The code that would do that is missing... Hence _all_ pages of the resumed image are dirty after they have been copied to their location. *ouch* Will fix. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 13, 2011 at 02:36:12PM +0100, Heiko Carstens wrote: > > Eeeek... this seems to be an architecture specific bug that is only present > on s390. > The dirty bit for user space pages on all architectures but s390 are stored > into the PTE's. On s390 however they are stored into the storage key that > exists per _physical_ page. > So, what we should have done, when implementing suspend/resume on s390, is > to save the storage key for each page and write that to the suspend device > and upon resume restore the storage key contents for each physical page. > The code that would do that is missing... Hence _all_ pages of the resumed > image are dirty after they have been copied to their location. > *ouch* > > Will fix. Glad you found the root cause. If you don't think you can get this fixed quickly, before -rc2 or -rc3, I can fairly quickly add some checks to ext4 to detect this condition, issue a warning, and then return an error code from the ->writepages() hook. (Which will then promptly be ignored by the writeback code, since, hey, what are they going to do with an error, but that's a discussion for another forum.) Would that be helpful? I'm still a bit concerned with the call to set the pages' PTE to be dirty that I found in the hibernate code, but I accept the fact that removing it doesn't solve the s390 crash. It still seems wrong to me, and hopefully someone from linux-pm can look at that more closely. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-01-13 at 13:46 -0500, Ted Ts'o wrote: > I'm still a bit concerned with the call to set the pages' PTE to be > dirty that I found in the hibernate code, but I accept the fact that > removing it doesn't solve the s390 crash. It still seems wrong to me, > and hopefully someone from linux-pm can look at that more closely. If I'm understanding things correctly, this should affect only the situation when compression is not used. Otherwise, pages that are read into by block I/O are decompressed first and copied into different pages. No?
On Fri, Jan 14, 2011 at 08:30:12AM +1100, Bojan Smojver wrote: > On Thu, 2011-01-13 at 13:46 -0500, Ted Ts'o wrote: > > I'm still a bit concerned with the call to set the pages' PTE to be > > dirty that I found in the hibernate code, but I accept the fact that > > removing it doesn't solve the s390 crash. It still seems wrong to me, > > and hopefully someone from linux-pm can look at that more closely. > > If I'm understanding things correctly, this should affect only the > situation when compression is not used. Otherwise, pages that are read > into by block I/O are decompressed first and copied into different > pages. No? When the pages get copied to their final resting place the dirty bits of the corresponding physical pages get set automatically by the hardware. If there would be some code that would clear the dirty bit after the copy operation then we would have some underindication and as a result the possibility of data corruption, but we've never seen this. However since s390 is the only architecture which has dirty bits for physical pages I doubt that there is any such code present. So the bug should happen independently of image compression. What is missing is code that restores the original storage keys after the pages have been copied to their final place. I need to read the suspend/resume code and figure out how to fix this. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> However since s390 is the only architecture which has dirty bits for > physical pages I doubt that there is any such code present. So the > bug should happen independently of image compression. Ah, sorry - I wasn't clear there. I meant for other archs, where setting the dirty bit on read may matter when compression is off. Obviously, s390 will have to have its own problem fixed separately. -- Bojan -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index d8b992e..7d6d7d7 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal) static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode) { - if (ext4_handle_valid(handle)) + if (ext4_handle_valid(handle)) { + if (unlikely(EXT4_I(inode)->jinode == NULL)) { + /* Should never happen */ + ext4_msg(inode->i_sb, KERN_CRIT, + "inode #%lu has NULL jinode", inode->i_ino); + BUG(); + } return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode); + } return 0; }