Patchwork Oops while going into hibernate

login
register
mail settings
Submitter Theodore Ts'o
Date Jan. 12, 2011, 5:26 p.m.
Message ID <20110112172646.GB13496@thunk.org>
Download mbox | patch
Permalink /patch/78593/
State Not Applicable
Headers show

Comments

Theodore Ts'o - Jan. 12, 2011, 5:26 p.m.
Since I don't have a machine set up to test hibernation easily at
hand, I'd really appreciate it if you could try this patch to
determine which inode had the NULL jinode --- and then once you get
the device and inode number, to use debugfs's "ncheck" command to map
the inode number to a pathname.

If you could do that, it would be a huge help.

Thanks, regards,

						- Ted

P.S.  Also, if you could try suspending once or twice, with different
programs running, to see if the inode number and pathname are constant
or vary, that would also be helpful.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Ott - Jan. 12, 2011, 5:37 p.m.
On Wed, 12 Jan 2011, Ted Ts'o wrote:
> Since I don't have a machine set up to test hibernation easily at
> hand, I'd really appreciate it if you could try this patch to
> determine which inode had the NULL jinode --- and then once you get
> the device and inode number, to use debugfs's "ncheck" command to map
> the inode number to a pathname.
> 
> If you could do that, it would be a huge help.
ok, will do

> 
> Thanks, regards,
> 
> 						- Ted
> 
> P.S.  Also, if you could try suspending once or twice, with different
> programs running, to see if the inode number and pathname are constant
> or vary, that would also be helpful.
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index d8b992e..7d6d7d7 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal)
> 
>  static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
>  {
> -	if (ext4_handle_valid(handle))
> +	if (ext4_handle_valid(handle)) {
> +		if (unlikely(EXT4_I(inode)->jinode == NULL)) {
> +			/* Should never happen */
> +			ext4_msg(inode->i_sb, KERN_CRIT, 
> +				 "inode #%lu has NULL jinode", inode->i_ino);
> +			BUG();
> +		}
>  		return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
> +	}
>  	return 0;
>  }
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Ott - Jan. 12, 2011, 6:49 p.m.
Hi,
On Wed, 12 Jan 2011, Ted Ts'o wrote:
> Since I don't have a machine set up to test hibernation easily at
> hand, I'd really appreciate it if you could try this patch to
> determine which inode had the NULL jinode --- and then once you get
> the device and inode number, to use debugfs's "ncheck" command to map
> the inode number to a pathname.
> 
> If you could do that, it would be a huge help.

I did about 5 re-tests with different loads, the result was always:

[  249.238697] EXT4-fs (dasda1): inode #1106 has NULL jinode
[  249.238732] ------------[ cut here ]------------
[  249.238738] kernel BUG at fs/ext4/ext4_jbd2.h:260!
[  249.238747] illegal operation: 0001 [#1] PREEMPT SMP 
[  249.238760] Modules linked in: binfmt_misc dm_multipath scsi_dh vmur [last unloaded: scsi_wait_scan]
[  249.238786] CPU: 2 Not tainted 2.6.37-05668-g4162cf6-dirty #20
[  249.238794] Process flush-94:0 (pid: 1244, task: 000000003f110040, ksp: 000000003dca3c70)
[  249.238803] Krnl PSW : 0704000180000000 00000000001d1130 (mpage_da_map_and_submit+0x464/0x468)
[  249.238824]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
[  249.238835] Krnl GPRS: 000000000000c65f 0000000000000000 0000000000000043 0000000000000000
[  249.238848]            00000000004735c0 0000000000000000 000000003ee45478 0000000000000001
[  249.238863]            0000000035544aa8 000000003dca3a88 0000000000000004 000000003d29d680
[  249.238868]            000000003dca3b80 000000000049ef98 00000000001d112c 000000003dca39d8
[  249.238886] Krnl Code: 00000000001d1120: c040001e01c2        larl    %r4,5914a4
[  249.238893]            00000000001d1126: c0e5000092ed        brasl   %r14,1e3700
[  249.238899]            00000000001d112c: a7f40001            brc     15,1d112e
[  249.238906]           >00000000001d1130: a7f40000            brc     15,1d1130
[  249.238913]            00000000001d1134: e3e0f0080024        stg     %r14,8(%r15)
[  249.238920]            00000000001d113a: c0f400000006        brcl    15,1d1146
[  249.238942]            00000000001d1140: b9040000            lgr     %r0,%r0
[  249.238963]            00000000001d1144: 0de1                basr    %r14,%r1
[  249.238989] Call Trace:
[  249.238998] ([<00000000001d112c>] mpage_da_map_and_submit+0x460/0x468)
[  249.239012]  [<00000000001d19f2>] ext4_da_writepages+0x32e/0x744
[  249.239028]  [<000000000016fe64>] writeback_single_inode+0xd4/0x25c
[  249.239044]  [<0000000000170514>] writeback_sb_inodes+0xd0/0x160
[  249.239050]  [<0000000000171278>] writeback_inodes_wb+0x90/0x16c
[  249.239055]  [<00000000001715fa>] wb_writeback+0x2a6/0x480
[  249.239060]  [<00000000001718b6>] wb_do_writeback+0xe2/0x294
[  249.239066]  [<0000000000171b14>] bdi_writeback_thread+0xac/0x2fc
[  249.239072]  [<000000000007d192>] kthread+0xbe/0xc8
[  249.239087]  [<0000000000479182>] kernel_thread_starter+0x6/0xc
[  249.239101]  [<000000000047917c>] kernel_thread_starter+0x0/0xc
[  249.239112] INFO: lockdep is turned off.
[  249.239119] Last Breaking-Event-Address:
[  249.239126]  [<00000000001d112c>] mpage_da_map_and_submit+0x460/0x468


# debugfs /dev/dasda1
debugfs 1.41.10 (10-Feb-2009)
debugfs:  ncheck 1106
Inode   Pathname
1106    /usr/bin/killall
debugfs:  q

there are also a few messages like:
EXT4-fs error (device dasda1): ext4_lookup:1043: inode #130849: comm find: deleted inode referenced: 129148
I guess they appeared after one of the crashes during bisecting (can't 
run fsck right now, since its my /)

Regards,
Sebastian

> 
> Thanks, regards,
> 
> 						- Ted
> 
> P.S.  Also, if you could try suspending once or twice, with different
> programs running, to see if the inode number and pathname are constant
> or vary, that would also be helpful.
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index d8b992e..7d6d7d7 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -252,8 +252,15 @@ static inline int ext4_journal_force_commit(journal_t *journal)
> 
>  static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
>  {
> -	if (ext4_handle_valid(handle))
> +	if (ext4_handle_valid(handle)) {
> +		if (unlikely(EXT4_I(inode)->jinode == NULL)) {
> +			/* Should never happen */
> +			ext4_msg(inode->i_sb, KERN_CRIT, 
> +				 "inode #%lu has NULL jinode", inode->i_ino);
> +			BUG();
> +		}
>  		return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
> +	}
>  	return 0;
>  }
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 13, 2011, 12:44 a.m.
On Jan 12, 2011, at 1:49 PM, Sebastian Ott wrote:

> 
> I did about 5 re-tests with different loads, the result was always:
> 
> [  249.238697] EXT4-fs (dasda1): inode #1106 has NULL jinode
> 
> # debugfs /dev/dasda1
> debugfs 1.41.10 (10-Feb-2009)
> debugfs:  ncheck 1106
> Inode   Pathname
> 1106    /usr/bin/killall
> debugfs:  q
> 

That looks really bogus.  /usr/bin/killall is a system binary, and there's no good reason that hibernation should be trying to write pages to that binary.

You said originally that the oops was happening "while going into hibernation right after resuming with...".   So that means you did a successful suspend/resume, and then the second suspend caused the oops?   It looks like somehow the pages were left marked as dirty, so the writeback daemons attempted writing back a page to an inode which was never opened read/write (and in fact as a text page for /usr/bin/killall, was mapped read/only).   Given that ext4 initializes jinode only when the file is opened read/write, the fact that it is null, and the fact that it makes no sense that a program would be modifying /usr/bin/killall as part of a suspend/resume, it looks very much like we just unmasked a software suspend bug....

-- Ted



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 13, 2011, 5:56 a.m.
On Wed, Jan 12, 2011 at 07:44:17PM -0500, Theodore Tso wrote:
> 
> You said originally that the oops was happening "while going into
> hibernation right after resuming with...".  So that means you did a
> successful suspend/resume, and then the second suspend caused the
> oops?  It looks like somehow the pages were left marked as dirty, so
> the writeback daemons attempted writing back a page to an inode
> which was never opened read/write (and in fact as a text page for
> /usr/bin/killall, was mapped read/only).  Given that ext4
> initializes jinode only when the file is opened read/write, the fact
> that it is null, and the fact that it makes no sense that a program
> would be modifying /usr/bin/killall as part of a suspend/resume, it
> looks very much like we just unmasked a software suspend bug....

... and I think I've found the problem.  In kernel/power/block_io.c,
in the function submit(), we see this:

	if (bio_chain == NULL) {
		submit_bio(bio_rw, bio);
		wait_on_page_locked(page);
		if (rw == READ)
			bio_set_pages_dirty(bio); <====
		bio_put(bio);

So when we read in pages from the software suspend device, we end up
marking the pages as dirty(!).  I'm guessing this was caused by a copy
and paste from the only other caller of bio_set_pages_dirty(), which
is the direct I/O code, which needs this when we are writing from a
file into a user-provided buffer.  But for restoring from a software
suspend case, this is as far as I can tell wholely inappropriate.

This causes needless writes, which is bad even before ext4 unmasked
the problem.  I will send a patch under separate cover; could you give
it a try and see if it fixes your crash?

I will look into bulletproofing ext4 by adding checks for this case
and printing warning messages, but neverthe less, I think the root
cause is actually in the hibernation's bio code.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bojan Smojver - Jan. 13, 2011, 11:12 a.m.
On Wed, 2011-01-12 at 19:44 -0500, Theodore Tso wrote:
> It looks like somehow the pages were left marked as dirty, so the
> writeback daemons attempted writing back a page to an inode which was
> never opened read/write (and in fact as a text page
> for /usr/bin/killall, was mapped read/only).

Just out of curiosity, when this happens, is LZO compression being used
with hibernation (default in 2.6.37) or has it been disabled?
Sebastian Ott - Jan. 13, 2011, 11:48 a.m.
On Wed, 12 Jan 2011, Theodore Tso wrote:
> That looks really bogus.  /usr/bin/killall is a system binary, and there's
> no good reason that hibernation should be trying to write pages to that
> binary.
> 
> You said originally that the oops was happening "while going into
> hibernation right after resuming with...".   So that means you did a
> successful suspend/resume, and then the second suspend caused the oops?
Yes. I basically did a
echo reboot > /sys/power/disk
for i in {1..5} ;do echo disk > /sys/power/state ;done
and it crashed very early in the second suspend.

> It looks like somehow the pages were left marked as dirty, so the
> writeback daemons attempted writing back a page to an inode which was
> never opened read/write (and in fact as a text page for
> /usr/bin/killall, was mapped read/only).
> Given that ext4 initializes jinode only when the file is opened
> read/write, the fact that it is null, and the fact that it makes no
> sense that a program would be modifying /usr/bin/killall as part of a
> suspend/resume, it looks very much like we just unmasked a software
> suspend bug....
Ah, ok. Thanks for the explanation!

Regards,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Ott - Jan. 13, 2011, 11:49 a.m.
Hi,

On Thu, 13 Jan 2011, Bojan Smojver wrote:
> On Wed, 2011-01-12 at 19:44 -0500, Theodore Tso wrote:
> > It looks like somehow the pages were left marked as dirty, so the
> > writeback daemons attempted writing back a page to an inode which was
> > never opened read/write (and in fact as a text page
> > for /usr/bin/killall, was mapped read/only).
> 
> Just out of curiosity, when this happens, is LZO compression being used
> with hibernation (default in 2.6.37) or has it been disabled?
LZO is being used - should I try with hibernate=nocompress?

Regards,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bojan Smojver - Jan. 13, 2011, 12:11 p.m.
> Yes. I basically did a
> echo reboot > /sys/power/disk
> for i in {1..5} ;do echo disk > /sys/power/state ;done
> and it crashed very early in the second suspend.

Was compression on or off?

--
Bojan 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sebastian Ott - Jan. 13, 2011, 12:31 p.m.
On Thu, 13 Jan 2011, Bojan Smojver wrote:
> > Yes. I basically did a
> > echo reboot > /sys/power/disk
> > for i in {1..5} ;do echo disk > /sys/power/state ;done
> > and it crashed very early in the second suspend.
> 
> Was compression on or off?
compression was on. I did a re-test with hibernate=nocompress
..but with the same result

Regards,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Heiko Carstens - Jan. 13, 2011, 1:36 p.m.
On Thu, Jan 13, 2011 at 12:48:40PM +0100, Sebastian Ott wrote:
> On Wed, 12 Jan 2011, Theodore Tso wrote:
> > It looks like somehow the pages were left marked as dirty, so the
> > writeback daemons attempted writing back a page to an inode which was
> > never opened read/write (and in fact as a text page for
> > /usr/bin/killall, was mapped read/only).
> > Given that ext4 initializes jinode only when the file is opened
> > read/write, the fact that it is null, and the fact that it makes no
> > sense that a program would be modifying /usr/bin/killall as part of a
> > suspend/resume, it looks very much like we just unmasked a software
> > suspend bug....
> Ah, ok. Thanks for the explanation!

Eeeek... this seems to be an architecture specific bug that is only present
on s390.
The dirty bit for user space pages on all architectures but s390 are stored
into the PTE's. On s390 however they are stored into the storage key that
exists per _physical_ page.
So, what we should have done, when implementing suspend/resume on s390, is
to save the storage key for each page and write that to the suspend device
and upon resume restore the storage key contents for each physical page.
The code that would do that is missing... Hence _all_ pages of the resumed
image are dirty after they have been copied to their location.
*ouch*

Will fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 13, 2011, 6:46 p.m.
On Thu, Jan 13, 2011 at 02:36:12PM +0100, Heiko Carstens wrote:
> 
> Eeeek... this seems to be an architecture specific bug that is only present
> on s390.
> The dirty bit for user space pages on all architectures but s390 are stored
> into the PTE's. On s390 however they are stored into the storage key that
> exists per _physical_ page.
> So, what we should have done, when implementing suspend/resume on s390, is
> to save the storage key for each page and write that to the suspend device
> and upon resume restore the storage key contents for each physical page.
> The code that would do that is missing... Hence _all_ pages of the resumed
> image are dirty after they have been copied to their location.
> *ouch*
> 
> Will fix.

Glad you found the root cause.  If you don't think you can get this
fixed quickly, before -rc2 or -rc3, I can fairly quickly add some
checks to ext4 to detect this condition, issue a warning, and then
return an error code from the ->writepages() hook.  (Which will then
promptly be ignored by the writeback code, since, hey, what are they
going to do with an error, but that's a discussion for another forum.)
Would that be helpful?

I'm still a bit concerned with the call to set the pages' PTE to be
dirty that I found in the hibernate code, but I accept the fact that
removing it doesn't solve the s390 crash.  It still seems wrong to me,
and hopefully someone from linux-pm can look at that more closely.

    	      	      	   	    	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bojan Smojver - Jan. 13, 2011, 9:30 p.m.
On Thu, 2011-01-13 at 13:46 -0500, Ted Ts'o wrote:
> I'm still a bit concerned with the call to set the pages' PTE to be
> dirty that I found in the hibernate code, but I accept the fact that
> removing it doesn't solve the s390 crash.  It still seems wrong to me,
> and hopefully someone from linux-pm can look at that more closely. 

If I'm understanding things correctly, this should affect only the
situation when compression is not used. Otherwise, pages that are read
into by block I/O are decompressed first and copied into different
pages. No?
Heiko Carstens - Jan. 14, 2011, 9:53 a.m.
On Fri, Jan 14, 2011 at 08:30:12AM +1100, Bojan Smojver wrote:
> On Thu, 2011-01-13 at 13:46 -0500, Ted Ts'o wrote:
> > I'm still a bit concerned with the call to set the pages' PTE to be
> > dirty that I found in the hibernate code, but I accept the fact that
> > removing it doesn't solve the s390 crash.  It still seems wrong to me,
> > and hopefully someone from linux-pm can look at that more closely. 
> 
> If I'm understanding things correctly, this should affect only the
> situation when compression is not used. Otherwise, pages that are read
> into by block I/O are decompressed first and copied into different
> pages. No?

When the pages get copied to their final resting place the dirty bits
of the corresponding physical pages get set automatically by the
hardware.
If there would be some code that would clear the dirty bit after the
copy operation then we would have some underindication and as a result
the possibility of data corruption, but we've never seen this.
However since s390 is the only architecture which has dirty bits for
physical pages I doubt that there is any such code present. So the
bug should happen independently of image compression.

What is missing is code that restores the original storage keys after
the pages have been copied to their final place.
I need to read the suspend/resume code and figure out how to fix this.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bojan Smojver - Jan. 14, 2011, 1:14 p.m.
> However since s390 is the only architecture which has dirty bits for
> physical pages I doubt that there is any such code present. So the
> bug should happen independently of image compression.

Ah, sorry - I wasn't clear there. I meant for other archs, where setting 
the dirty bit on read may matter when compression is off.

Obviously, s390 will have to have its own problem fixed separately.

--
Bojan 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index d8b992e..7d6d7d7 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -252,8 +252,15 @@  static inline int ext4_journal_force_commit(journal_t *journal)
 
 static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 {
-	if (ext4_handle_valid(handle))
+	if (ext4_handle_valid(handle)) {
+		if (unlikely(EXT4_I(inode)->jinode == NULL)) {
+			/* Should never happen */
+			ext4_msg(inode->i_sb, KERN_CRIT, 
+				 "inode #%lu has NULL jinode", inode->i_ino);
+			BUG();
+		}
 		return jbd2_journal_file_inode(handle, EXT4_I(inode)->jinode);
+	}
 	return 0;
 }