2.6.28.9: EXT3/NFS inodes corruption
diff mbox

Message ID 20090804225619.GB11097@duck.suse.cz
State Not Applicable, archived
Headers show

Commit Message

Jan Kara Aug. 4, 2009, 10:56 p.m. UTC
Hi,

On Tue 04-08-09 13:15:05, Sylvain Rochet wrote:
> On Tue, Aug 04, 2009 at 12:29:01AM +0200, Jan Kara wrote:
> > 
> >   OK, I've found some time and written the debugging patch. Hopefully it
> > will tell us more. It should output messages to the kernel log if it
> > finds something suspicious - like:
> > No dentry for unlinked inode...
> > Dentry ... for unlinked inode ... has no parent
> > Found directory entry ... for unlinked inode
> > 
> >   When you see such messages in the log, send them to me please. Also
> > attach the System.map file so that I can translate the address where
> > i_nlink was dropped - for that ext3 should be compiled into the kernel
> > (should not be a module). Thanks a lot for testing.
> 
> Patch applied.
> 
> And there is already a lot of output.
> 
> http://edony.tuxfamily.org/~grad/bazooka/System.map-2.6.30.4
> http://edony.tuxfamily.org/~grad/bazooka/config-2.6.30.4
> http://edony.tuxfamily.org/~grad/bazooka/kern.log
  Thanks for testing. So you seem to be really stressting the path where
creation of new files / directories fails (probably due to group quota).  I
have one idea what could cause your filesystem corruption, although it's a
wild guess... Please try attached oneliner.
  Also your corruption reminded me that Al Viro has been fixing problems
where we could cache one inode twice when a filesystem was mounted over NFS
and that could also lead to a filesystem corruption. So I'm adding him to
CC just in case he has some idea. BTW Al, what do you think about the
problem I describe in the attached patch? I'm not sure if it can cause some
real problems but in theory it could...

								Honza

Comments

Sylvain Rochet Aug. 6, 2009, 1:15 p.m. UTC | #1
Hi,


On Wed, Aug 05, 2009 at 12:56:19AM +0200, Jan Kara wrote:
> 
> Thanks for testing. So you seem to be really stressting the path where
> creation of new files / directories fails (probably due to group quota).

Yes, there are 29 groups over quota on a total of 4499. Those are mainly 
spammed websites and therefore quite stressed due to the amount of tries 
to add new "data".


> I have one idea what could cause your filesystem corruption, although 
> it's a wild guess... Please try attached oneliner.

Running since yesterday.


> Also your corruption reminded me that Al Viro has been fixing problems
> where we could cache one inode twice when a filesystem was mounted over NFS
> and that could also lead to a filesystem corruption. So I'm adding him to
> CC just in case he has some idea. BTW Al, what do you think about the
> problem I describe in the attached patch? I'm not sure if it can cause some
> real problems but in theory it could...

Should we upgrade NFS clients as well ?  (now running 2.6.28.9)


Sylvain
J . Bruce Fields Aug. 6, 2009, 5:05 p.m. UTC | #2
On Thu, Aug 06, 2009 at 03:15:56PM +0200, Sylvain Rochet wrote:
> Hi,
> 
> 
> On Wed, Aug 05, 2009 at 12:56:19AM +0200, Jan Kara wrote:
> > 
> > Thanks for testing. So you seem to be really stressting the path where
> > creation of new files / directories fails (probably due to group quota).
> 
> Yes, there are 29 groups over quota on a total of 4499. Those are mainly 
> spammed websites and therefore quite stressed due to the amount of tries 
> to add new "data".
> 
> 
> > I have one idea what could cause your filesystem corruption, although 
> > it's a wild guess... Please try attached oneliner.
> 
> Running since yesterday.
> 
> 
> > Also your corruption reminded me that Al Viro has been fixing problems
> > where we could cache one inode twice when a filesystem was mounted over NFS
> > and that could also lead to a filesystem corruption. So I'm adding him to
> > CC just in case he has some idea. BTW Al, what do you think about the
> > problem I describe in the attached patch? I'm not sure if it can cause some
> > real problems but in theory it could...
> 
> Should we upgrade NFS clients as well ?  (now running 2.6.28.9)

The client version shouldn't matter.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Aug. 12, 2009, 10:34 p.m. UTC | #3
Hello,

On Thu 06-08-09 15:15:56, Sylvain Rochet wrote:
> > I have one idea what could cause your filesystem corruption, although 
> > it's a wild guess... Please try attached oneliner.
> 
> Running since yesterday.
  Any news after a week of running? How often did the corruption happen
previously?

								Honza
Sylvain Rochet Aug. 20, 2009, 5:19 p.m. UTC | #4
Hi!,

On Thu, Aug 13, 2009 at 12:34:53AM +0200, Jan Kara wrote:
>   Hello,
> 
> On Thu 06-08-09 15:15:56, Sylvain Rochet wrote:
> > > I have one idea what could cause your filesystem corruption, although 
> > > it's a wild guess... Please try attached oneliner.
> > 
> > Running since yesterday.
> 
> Any news after a week of running? How often did the corruption happen
> previously?

Sorry for the late answer, I was lurking at HAR ;-)

So, everything is fine, but the problem happened only one time on this 
server, so we cannot conclude anything after a few weeks. However, 
I now have physical access back, so we will switch back to the former 
server where the problem happened quite frequently, then we will see!

By the way, syslogd is happy, eating about 350 MiB of kernel logs a day ;)

Sylvain
Simon Kirby Aug. 21, 2009, midnight UTC | #5
On Thu, Aug 20, 2009 at 07:19:53PM +0200, Sylvain Rochet wrote:

> So, everything is fine, but the problem happened only one time on this 
> server, so we cannot conclude anything after a few weeks. However, 
> I now have physical access back, so we will switch back to the former 
> server where the problem happened quite frequently, then we will see!

Not to derail the thread, but you were definitely seeing the same issues
with stock 2.6.30.4, right?  We had all sorts of corruption happening for
files served via NFS with 2.6.28 and 2.6.29, but everything was magically
fixed on 2.6.30 (though we needed a lot of fscking).  I never did track
down what change fixed it, since it took a while to reproduce.

Hmm.  I just noticed what seems to be a new occurrence of "deleted inode
referenced" on a box with 2.6.30.  We saw many when we first upgraded to
2.6.30 due to the corruption caused by 2.6.29, but those all occurred
within a day or so and were fsck'd.  I would have thought the backup
sweeps would have tripped over that inode way before now...

Just wondering if you can confirm that the errors you saw with 2.6.30.4
were not leftover from older kernels.

Cheers,

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sylvain Rochet Aug. 21, 2009, 10:51 a.m. UTC | #6
Hi,


On Thu, Aug 20, 2009 at 05:00:35PM -0700, Simon Kirby wrote:
> On Thu, Aug 20, 2009 at 07:19:53PM +0200, Sylvain Rochet wrote:
> 
> > So, everything is fine, but the problem happened only one time on this 
> > server, so we cannot conclude anything after a few weeks. However, 
> > I now have physical access back, so we will switch back to the former 
> > server where the problem happened quite frequently, then we will see!
> 
> Not to derail the thread, but you were definitely seeing the same issues
> with stock 2.6.30.4, right?

Nope, the last issue we had came from 2.6.28.9.

We upgraded to 2.6.30.3 on the advice of Jan, then we "upgraded" to 
2.6.30.3 with the first Jan's patch to add some debug output 
(0001-ext3-Debug-unlinking-of-inodes.patch). Finally we upgraded to 
2.6.30.4 with the first and the second Jan's patch 
(0001-fs-Make-sure-data-stored-into-inode-is-properly-see.patch) to add 
a smp_mb() in the unlock_new_inode() function.


> We had all sorts of corruption happening for files served via NFS with 
> 2.6.28 and 2.6.29, but everything was magically fixed on 2.6.30 
> (though we needed a lot of fscking).  I never did track down what 
> change fixed it, since it took a while to reproduce.

Same here, everything is fine since 2.6.30. We will switch back to the 
quad-core server where the corruption happen(ed) in a few days. We are 
now using a bi-opteron server because we suspected hardware issues on 
the quad-core, the corruption happened only one time on the bi-opteron 
(which is IMHO a sufficient evidence to discard hardware issue). I guess 
the issue was(or is) kinda SMP related.

And yep, we also had long times playing with fsck ;-) Luckily that the 
corruption only occurs on new files, and new files are mostly caches, 
sessions, logs, and such, so fsck used its chainsaw on quite 
not-really-important files.


> Hmm.  I just noticed what seems to be a new occurrence of "deleted inode
> referenced" on a box with 2.6.30.  We saw many when we first upgraded to
> 2.6.30 due to the corruption caused by 2.6.29, but those all occurred
> within a day or so and were fsck'd.  I would have thought the backup
> sweeps would have tripped over that inode way before now...
> 
> Just wondering if you can confirm that the errors you saw with 2.6.30.4
> were not leftover from older kernels.

The few garbaged inodes from 2.6.28.9 (and previous) were pushed to 
lost+found to prevent future use of them. We do a fsck when we moved to 
2.6.30.4 that fixed everything. We never had corruption yet with the 
2.6.30.4.


Sylvain

Patch
diff mbox

From 78513d3a5628fda0f8d685d732b7bc73bd4c9222 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 5 Aug 2009 00:42:21 +0200
Subject: [PATCH] fs: Make sure data stored into inode is properly seen before unlocking new inode

In theory it could happen that on one CPU we initialize a new inode but clearing
of I_NEW | I_LOCK gets reordered before some of the initialization. Thus on
another CPU we return not fully uptodate inode from iget_locked().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/inode.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 901bad1..e9a8e77 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -696,6 +696,7 @@  void unlock_new_inode(struct inode *inode)
 	 * just created it (so there can be no old holders
 	 * that haven't tested I_LOCK).
 	 */
+	smp_mb();
 	WARN_ON((inode->i_state & (I_LOCK|I_NEW)) != (I_LOCK|I_NEW));
 	inode->i_state &= ~(I_LOCK|I_NEW);
 	wake_up_inode(inode);
-- 
1.6.0.2