diff mbox

[0/2] Add inode checksum support to ext4

Message ID m2zknkstx1.fsf@firstfloor.org
State New, archived
Headers show

Commit Message

Andi Kleen April 20, 2011, 5:40 p.m. UTC
"Darrick J. Wong" <djwong@us.ibm.com> writes:

FWIW I had a similar idea quite some time ago and implemented checksums for
superblocks and inodes. I never posted it because I didn't get around
to write the e2fsprogs support and do a lot of performance testing. 
There was one result that showed a slow down in quick tests, but 
I suspect it was a fluke and probably needs to be redone. In other
tests it was neutral.

Anyways here's the old patch if anyone is interested (against a ~.35ish kernel)
I can forward port it if there's interest.

IMHO it's a good idea but it should be done for the super blocks too
(and ideally for all objects, although unfortunately that breaks 
the disk format)

The locking for stable buffers is still something that needs to be
double checked.

-Andi

commit 8ece10cfa4b148075dbb93ca65855a7e2aad7b07
Author: Andi Kleen <ak@linux.intel.com>
Date:   Mon May 31 18:27:29 2010 +0200

    EXT4: Add checksums to the super block and the inodes
    
    Currently when a on-disk structure is corrupted ext4 usually
    detects it only by some internal inconsistency, but this can happen
    too late or give confusing error messages.
    
    One way to detect corruptions sooner is to use checksums. This
    means the file system can stop using a corrupted object ASAP, not
    follow any potentially corrupted pointers to other blocks,
    and also give a clear message on what happened.
    
    Potentially it can also be used to limit read only mounts and
    forced fscks. When the extent of a corruption can be detected
    accurately using checksums and only force errors on that
    particular object (this is right now not enabled though
    because there are still too many objects with missing checksums)
    
    There's a general trend in file systems recently to add metadata
    checksums, this just makes ext4 catch up a bit (in addition
    to the already existing transaction and block group checksums)
    
    Another advantage is that the checksum can also check for misplaced
    writes (by including the intended block location) and objects
    left over from before mkfs (by including the file system UUID).
    
    Adding checksums to all metadata in ext4 would likely need some
    incompatible format changes, but it's relatively straight-forward to add
    them to inodes and the super block at least. This patch-kit does this.
    
    Again it only handles some meta-data objects. This is not a full
    user-data checksumming feature or not even a "all metadata objects"
    feature.
    
    I didn't do a lot of research into different CRC functions, but simply
    used the same one as btrfs and iSCSI (crc32c).  ocfs2 uses ECCs
    that have some self-correcting ability, but that seemed overkill to me.
    The standard CRC has the advantage that the CRC is accelerated by hardware
    instructions on newer Intel CPUs and possibly on other platforms too.
    The CRC32C function is also hard coded, although in theory it could be made
    configurable per file system. But since that would lead to a multitude of
    incompatible formats I decided to simply hard code for now.
    
    Right now there is no e2fsprogs support, so fsck doesn't know about CRCs.
    
    For now the checksums can be enabled with a simple shell script
    (which is just a wrapper around debugfs):
    
    wget ftp://firstfloor.org/pub/ak/shell/e2checksum
    chmod +x e2checksum
    
    To enable:
    
    ./e2checksum /dev/device enable
    
    To disable:
    
    ./e2checksum /dev/device disable
    
    This works with the current ext4 e2fsprogs without any changes.
    
    The kernel will automatically add checksums to the file system when the
    feature is turned on, as the inodes are rewritten.
    
    WARNING: Note there is no fsck support currently, so before you run fsck better
    disable the checksums. After disabling/changing/reenabling
    you may also need to mount with ignore_crc until the checksums
    are fixed up again.
    
    The checksums are implemented as a read-only compat feature: this means
    the a file system with checksums enabled can be read by a kernel that
    does not know about checksums, but not written. Of course you can turn
    off the flag again to make the file system write compatible again (but
    that will put the checksums into a inconsistent state) or use the
    ignore_crc mount option.
    
    WARNING: the in-kernel upgrader relies on the checksum fields being
    zero when checksums are enabled. When you turn on the feature, use the
    file system, turn it off again, use it again and turn it on again the
    checksums will be in a inconsistent state. Right now this can be
    fixed by mounting with -o ignore_crc and then doing a find | xargs touch
    on the file system.
    
    I expect real e2fsprogs support can do that better.
    
    The code also adds some more sanity checks on inodes to distingush zeroed
    inodes from inodes with no checksums. This is currently done by enforcing
    that a/m/ctime are not zero. If there are broken file systems around
    where that is not true, the sanity check can be turned off (see below)
    
    There are two new mount options:
    
    ignore_crc	       Ignore the CRCs on reading but still update them
    noinode_sanity	       Don't do new inode sanity checks
    
    The implementation is not particularly optimized: it always recomputes
    the full CRC on each inode or super block write. Some more optimizations
    would be possible in this area.
    
    BENCHMARKS
    
    I ran kernelbench and it didn't show any significant difference between
    the run with checksums and the run without.
    
    I also tried timing fsstress from XFS/LTP, and it gave a 35% slowdown
    on a disk and 5% slow down on the ramdisk for the system time
    during the test.  However I'm a bit suspicious of these results.
    This was on a core 2.
    
    I also tested on a Nehalem system which has CRC acceleration instructions
    in the CPU.
    
    Anyone has a better inode metadata intensive benchmark to run?
    
    This gobbled up the last mount flag bits, but there are still some unused
    holes in the middle.
    
    The patch is definitely still experimental and needs more testing and
    review and e2fsprogs support. In particular I would appreciate review
    of my bh locking scheme.
    
    Also ext4_abort() now takes the super lock, which implies that the callers
    shouldn't.  I think I checked all callers that it's safe, but double checking
    that would be good.
    
    Opens:
    - e2fsprogs support. In this case the zero crc hack could be removed from
    the super block code at least.
    - Right now checksum numbers by default lead to a read-only file system remount
    and are also logged as "IO error". This could be made more gentle, because
    a checksum catches problems early enough that continuation might be possible.
    - Incremental checksumming
    - Checksums for extents and directories
    - In ignore_crc mode the checksums are only rewritten when the inode is somehow
    dirtied otherwise. Do this implicitely? Might be better to move this to e2fsprogs

Comments

Darrick J. Wong April 20, 2011, 10:54 p.m. UTC | #1
On Wed, Apr 20, 2011 at 10:40:26AM -0700, Andi Kleen wrote:
> "Darrick J. Wong" <djwong@us.ibm.com> writes:
> 
> FWIW I had a similar idea quite some time ago and implemented checksums for
> superblocks and inodes. I never posted it because I didn't get around
> to write the e2fsprogs support and do a lot of performance testing. 
> There was one result that showed a slow down in quick tests, but 
> I suspect it was a fluke and probably needs to be redone. In other
> tests it was neutral.

I've also seen comparable results between having inode checksums and not having
them.  Unfortunately, like you said, modifying e2fsprogs is really what's
slowing this down right now-- there are a lot of places that assume inode_size
= 128 and therefore only read/write that much.

> Anyways here's the old patch if anyone is interested (against a ~.35ish kernel)
> I can forward port it if there's interest.
> 
> IMHO it's a good idea but it should be done for the super blocks too
> (and ideally for all objects, although unfortunately that breaks 
> the disk format)

I think I've seen some proposals for checksumming the bitmaps and the extent
tree nodes.  It might be worth it to save some rocompat bits and combine them
all into one big(ger) rocompat flag.  I guess it wouldn't be too hard to throw
in superblock checksumming too. :)

> The locking for stable buffers is still something that needs to be
> double checked.



> -Andi
> 
> commit 8ece10cfa4b148075dbb93ca65855a7e2aad7b07
> Author: Andi Kleen <ak@linux.intel.com>
> Date:   Mon May 31 18:27:29 2010 +0200
> 
>     EXT4: Add checksums to the super block and the inodes
>     
>     Currently when a on-disk structure is corrupted ext4 usually
>     detects it only by some internal inconsistency, but this can happen
>     too late or give confusing error messages.
>     
>     One way to detect corruptions sooner is to use checksums. This
>     means the file system can stop using a corrupted object ASAP, not
>     follow any potentially corrupted pointers to other blocks,
>     and also give a clear message on what happened.
>     
>     Potentially it can also be used to limit read only mounts and
>     forced fscks. When the extent of a corruption can be detected
>     accurately using checksums and only force errors on that
>     particular object (this is right now not enabled though
>     because there are still too many objects with missing checksums)
>     
>     There's a general trend in file systems recently to add metadata
>     checksums, this just makes ext4 catch up a bit (in addition
>     to the already existing transaction and block group checksums)
>     
>     Another advantage is that the checksum can also check for misplaced
>     writes (by including the intended block location) and objects
>     left over from before mkfs (by including the file system UUID).
>     
>     Adding checksums to all metadata in ext4 would likely need some
>     incompatible format changes, but it's relatively straight-forward to add
>     them to inodes and the super block at least. This patch-kit does this.
>     
>     Again it only handles some meta-data objects. This is not a full
>     user-data checksumming feature or not even a "all metadata objects"
>     feature.
>     
>     I didn't do a lot of research into different CRC functions, but simply
>     used the same one as btrfs and iSCSI (crc32c).  ocfs2 uses ECCs
>     that have some self-correcting ability, but that seemed overkill to me.
>     The standard CRC has the advantage that the CRC is accelerated by hardware
>     instructions on newer Intel CPUs and possibly on other platforms too.
>     The CRC32C function is also hard coded, although in theory it could be made
>     configurable per file system. But since that would lead to a multitude of
>     incompatible formats I decided to simply hard code for now.
>     
>     Right now there is no e2fsprogs support, so fsck doesn't know about CRCs.
>     
>     For now the checksums can be enabled with a simple shell script
>     (which is just a wrapper around debugfs):
>     
>     wget ftp://firstfloor.org/pub/ak/shell/e2checksum
>     chmod +x e2checksum
>     
>     To enable:
>     
>     ./e2checksum /dev/device enable
>     
>     To disable:
>     
>     ./e2checksum /dev/device disable
>     
>     This works with the current ext4 e2fsprogs without any changes.
>     
>     The kernel will automatically add checksums to the file system when the
>     feature is turned on, as the inodes are rewritten.
>     
>     WARNING: Note there is no fsck support currently, so before you run fsck better
>     disable the checksums. After disabling/changing/reenabling
>     you may also need to mount with ignore_crc until the checksums
>     are fixed up again.
>     
>     The checksums are implemented as a read-only compat feature: this means
>     the a file system with checksums enabled can be read by a kernel that
>     does not know about checksums, but not written. Of course you can turn
>     off the flag again to make the file system write compatible again (but
>     that will put the checksums into a inconsistent state) or use the
>     ignore_crc mount option.
>     
>     WARNING: the in-kernel upgrader relies on the checksum fields being
>     zero when checksums are enabled. When you turn on the feature, use the
>     file system, turn it off again, use it again and turn it on again the
>     checksums will be in a inconsistent state. Right now this can be
>     fixed by mounting with -o ignore_crc and then doing a find | xargs touch
>     on the file system.
>     
>     I expect real e2fsprogs support can do that better.
>     
>     The code also adds some more sanity checks on inodes to distingush zeroed
>     inodes from inodes with no checksums. This is currently done by enforcing
>     that a/m/ctime are not zero. If there are broken file systems around
>     where that is not true, the sanity check can be turned off (see below)
>     
>     There are two new mount options:
>     
>     ignore_crc	       Ignore the CRCs on reading but still update them
>     noinode_sanity	       Don't do new inode sanity checks

Hm, the usage model of my patch is tune2fs; fsck; mount.  I suspect it's a good
idea to run fsck before/while turning on this feature to correct any other
errors.  Though, that means that the kernel will simply -EIO anything it thinks
is corrupt.

>     The implementation is not particularly optimized: it always recomputes
>     the full CRC on each inode or super block write. Some more optimizations
>     would be possible in this area.
>     
>     BENCHMARKS
>     
>     I ran kernelbench and it didn't show any significant difference between
>     the run with checksums and the run without.
>     
>     I also tried timing fsstress from XFS/LTP, and it gave a 35% slowdown
>     on a disk and 5% slow down on the ramdisk for the system time
>     during the test.  However I'm a bit suspicious of these results.
>     This was on a core 2.

I'll give fsstress a try when I get back to this (Friday skunkworks project).

Either way, thanks for the patch!

--D
>     
>     I also tested on a Nehalem system which has CRC acceleration instructions
>     in the CPU.
>     
>     Anyone has a better inode metadata intensive benchmark to run?
>     
>     This gobbled up the last mount flag bits, but there are still some unused
>     holes in the middle.
>     
>     The patch is definitely still experimental and needs more testing and
>     review and e2fsprogs support. In particular I would appreciate review
>     of my bh locking scheme.
>     
>     Also ext4_abort() now takes the super lock, which implies that the callers
>     shouldn't.  I think I checked all callers that it's safe, but double checking
>     that would be good.
>     
>     Opens:
>     - e2fsprogs support. In this case the zero crc hack could be removed from
>     the super block code at least.
>     - Right now checksum numbers by default lead to a read-only file system remount
>     and are also logged as "IO error". This could be made more gentle, because
>     a checksum catches problems early enough that continuation might be possible.
>     - Incremental checksumming
>     - Checksums for extents and directories
>     - In ignore_crc mode the checksums are only rewritten when the inode is somehow
>     dirtied otherwise. Do this implicitely? Might be better to move this to e2fsprogs
> 
> diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
> index e1def17..e98141a 100644
> --- a/Documentation/filesystems/ext4.txt
> +++ b/Documentation/filesystems/ext4.txt
> @@ -359,6 +359,14 @@ nodiscard(*)		commands to the underlying block device when
>  			and sparse/thinly-provisioned LUNs, but it is off
>  			by default until sufficient testing has been done.
> 
> +nosanity_check		Don't check for zero m/c/a times when reading a inode.
> +			Should not normally be needed.
> +
> +ignore_crc		Ignore checksum failures while reading the super block
> +			or inodes, but still update the checksums on writing
> +			(if you don't want to update the checksums, clear the
> +			checksum feature bit in the super block)
> +
>  Data Mode
>  =========
>  There are 3 different data modes:
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 19a4de5..3a99769 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -611,6 +611,7 @@ struct ext4_inode {
>  	__le32  i_crtime;       /* File Creation time */
>  	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
>  	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
> +	__le32  i_crc;		/* CRC32 for this inode */
>  };
> 
>  struct move_extent {
> @@ -836,6 +837,12 @@ struct ext4_inode_info {
>  	 */
>  	tid_t i_sync_tid;
>  	tid_t i_datasync_tid;
> +
> +	/* 
> +	 * Protect raw inode modifications, mostly to ensure
> +	 * stability for checksumming.
> +	 */
> +	spinlock_t raw_inode_lock;
>  };
> 
>  /*
> @@ -881,10 +888,12 @@ struct ext4_inode_info {
>  #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
>  #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
>  #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
> +#define EXT4_MOUNT_IGNORE_CRC           0x4000000 /* Ignore CRCs on read */
>  #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
>  #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
>  #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
>  #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
> +#define EXT4_MOUNT_INODE_SANITY         0x80000000 /* Inode sanity check */
> 
>  #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
>  #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
> @@ -1003,7 +1012,8 @@ struct ext4_super_block {
>  	__u8	s_reserved_char_pad2;
>  	__le16  s_reserved_pad;
>  	__le64	s_kbytes_written;	/* nr of lifetime kilobytes written */
> -	__u32   s_reserved[160];        /* Padding to the end of the block */
> +	__le32  s_crc;			/* CRC32 for this super block */
> +	__u32   s_reserved[159];        /* Padding to the end of the block */
>  };
> 
>  #ifdef __KERNEL__
> @@ -1051,6 +1061,7 @@ struct ext4_sb_info {
>  	u32 s_hash_seed[4];
>  	int s_def_hash_version;
>  	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
> +	u8 s_uuid[16];
>  	struct percpu_counter s_freeblocks_counter;
>  	struct percpu_counter s_freeinodes_counter;
>  	struct percpu_counter s_dirs_counter;
> @@ -1265,6 +1276,7 @@ EXT4_INODE_BIT_FNS(state, state_flags)
>  #define EXT4_FEATURE_RO_COMPAT_GDT_CSUM		0x0010
>  #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK	0x0020
>  #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE	0x0040
> +#define EXT4_FEATURE_RO_COMPAT_SBI_CRC		0x0080 /* sb and inode CRCs */
> 
>  #define EXT4_FEATURE_INCOMPAT_COMPRESSION	0x0001
>  #define EXT4_FEATURE_INCOMPAT_FILETYPE		0x0002
> @@ -1291,7 +1303,8 @@ EXT4_INODE_BIT_FNS(state, state_flags)
>  					 EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
>  					 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
>  					 EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
> -					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE)
> +					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
> +					 EXT4_FEATURE_RO_COMPAT_SBI_CRC)
> 
>  /*
>   * Default values for user and/or group using reserved blocks
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 42272d6..8ba2f24 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -37,6 +37,7 @@
>  #include <linux/namei.h>
>  #include <linux/uio.h>
>  #include <linux/bio.h>
> +#include <linux/crc32c.h>
>  #include <linux/workqueue.h>
>  #include <linux/kernel.h>
>  #include <linux/slab.h>
> @@ -72,6 +73,141 @@ static int ext4_inode_is_fast_symlink(struct inode *inode)
>  	return (S_ISLNK(inode->i_mode) && inode->i_blocks - ea_blocks == 0);
>  }
> 
> +#define ZERO_MAGIC 1
> +
> +static __le32 inode_crc(struct super_block *sb, struct ext4_inode *raw_inode, long ino)
> +{
> +	struct ext4_csum_header {
> +		__le64 ino;
> +		__le64 pad;
> +		u8     uuid[16];
> +	} __attribute__((aligned)) hdr;
> +	u32 crc;
> +
> +	/*
> +	 * First CRC in the inode number and the file system UUID
> +	 * to detect inodes from before the last mkfs and misplaced inode
> +	 * writes.
> +	 */
> +	hdr.ino = cpu_to_le64(ino);
> +	memcpy(&hdr.uuid, EXT4_SB(sb)->s_uuid, 16);
> +	hdr.pad = 0;
> +
> +	raw_inode->i_crc = 0;
> +	crc = crc32c(~0U, &hdr, sizeof(struct ext4_csum_header));
> +	/* Could CRC precompute the common zero tail? (if it's really common) */
> +	crc = crc32c(crc, raw_inode, EXT4_INODE_SIZE(sb));
> +	if (crc == 0)
> +		crc = ZERO_MAGIC;
> +	return cpu_to_le32(crc);
> +}
> +
> +/*
> + * Update CRC in a inode, including all additional fields after the
> + * standard inode structure.
> + *
> + * This relies on the raw_inode_lock protecting against all writes
> + * to the raw inode state in the bh. Right now the JBD locking
> + * is not good enough for that.
> + *
> + * This always precomputes the full checksum. In principle it would
> + * be possible to be more clever and do incremental changes at least
> + * for some state changes.
> + */
> +static void inode_update_crc(struct super_block *sb, struct ext4_inode *raw_inode,
> +			     long ino)
> +{
> +	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
> +		return;
> +	raw_inode->i_crc = inode_crc(sb, raw_inode, ino);
> +}
> +
> +/*
> + * This is needed because nfsd might try to access dead inodes
> + * the test is that same one that e2fsck uses
> + * NeilBrown 1999oct15
> + */
> +static int inode_deleted(struct super_block *sb, struct ext4_inode *raw_inode)
> +{
> +	if (raw_inode->i_links_count == 0) {
> +		if (raw_inode->i_mode == 0 ||
> +		    !(EXT4_SB(sb)->s_mount_state & EXT4_ORPHAN_FS))
> +			return -ESTALE;
> +		/*
> +		 * The only unlinked inodes we let through here have
> +		 * valid i_mode and are being read by the orphan
> +		 * recovery code: that's fine, we're about to complete
> +		 * the process of deleting those.
> +		 */
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Does this checksum-less inode look like a valid inode?
> + * Do a few sanity checks.
> + */
> +static int inode_sanity_check(struct super_block *sb, struct ext4_inode *raw_inode, char **msg)
> +{
> +	int err = inode_deleted(sb, raw_inode);
> +	if (err)
> +		return err;
> +	if (!test_opt(sb, INODE_SANITY))
> +		return 0;
> +	if (raw_inode->i_mode == 0 ||
> +	    raw_inode->i_atime == 0 ||
> +	    raw_inode->i_ctime == 0 ||
> +	    raw_inode->i_mtime == 0) {
> +		*msg = "inode has invalid zero times";
> +		return -1;
> +	}
> +	/* More sanity checks? */
> +	return 0;
> +}
> +
> +/*
> + * Check CRC for a newly read inode.
> + */
> +static int inode_check_crc(struct super_block *sb, struct ext4_inode *raw_inode,
> +			   long ino, char **msg)
> +{
> +	__le32 crc, check;
> +
> +	/*
> +	 * Special case: zero CRC.  This is handled like no checksum yet,
> +	 * otherwise tune2fs would need to rewrite all inodes when CRCs
> +	 * are turned on. The CRC will be updated when the inode
> +	 * is written out.
> +	 *
> +	 * This however means that if zeroes are blasted over the inodes
> +	 * we would think the checksum is not there. So instead do
> +	 * some special sanity checks in the other fields when this happens,
> +	 * to catch this case.
> + 	 * This is not 100% fool proof, but should be reasonable.
> +	 * When the checksum function returns a real zero we turn that
> +	 * into a one to avoid ambiguity.
> +	 *
> +	 * The sanity check is done unconditionally even if the checksum passed
> +	 * because it's cheap enough.
> +	 */
> +	crc = raw_inode->i_crc;
> +	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC) && crc
> +	    && !test_opt(sb, IGNORE_CRC)) {
> + 		check = inode_crc(sb, raw_inode, ino);
> +		if (check != crc) {
> +			*msg = "inode has invalid checksum";
> +			/*
> +			 * Restore bad CRC so that if the inode is reread it'll fail
> +			 * the check again.
> +			 */
> +			raw_inode->i_crc = crc;
> +			return -1;
> +		}
> +	}
> +	return inode_sanity_check(sb, raw_inode, msg);
> +}
> +
>  /*
>   * Work out how many blocks we need to proceed with the next chunk of a
>   * truncate transaction.
> @@ -4996,6 +5132,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>  	journal_t *journal = EXT4_SB(sb)->s_journal;
>  	long ret;
>  	int block;
> +	char *msg = NULL;
> 
>  	inode = iget_locked(sb, ino);
>  	if (!inode)
> @@ -5010,6 +5147,26 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>  	if (ret < 0)
>  		goto bad_inode;
>  	raw_inode = ext4_raw_inode(&iloc);
> +
> + 	/*
> + 	 * Relies on the inode lock to protect the raw_inode bh contents.
> + 	 */
> + 	ret = inode_check_crc(sb, raw_inode, ino, &msg);
> + 	if (ret < 0) {
> + 		/*
> + 		 * Here would be the place to send a "read other mirror"
> + 		 * retry hint to the block layer.
> + 		 */
> + 		brelse(iloc.bh);
> + 		if (ret != -ESTALE)
> + 			ext4_error(sb,
> + 				   "%s: inode=%lu, block=%llu", msg,
> + 				   ino, 
> +				   (unsigned long long)iloc.bh->b_blocknr);
> + 		goto bad_inode;
> + 	}
> + 	ret = 0;
> +
>  	inode->i_mode = le16_to_cpu(raw_inode->i_mode);
>  	inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
>  	inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
> @@ -5022,23 +5179,6 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>  	ei->i_state_flags = 0;
>  	ei->i_dir_start_lookup = 0;
>  	ei->i_dtime = le32_to_cpu(raw_inode->i_dtime);
> -	/* We now have enough fields to check if the inode was active or not.
> -	 * This is needed because nfsd might try to access dead inodes
> -	 * the test is that same one that e2fsck uses
> -	 * NeilBrown 1999oct15
> -	 */
> -	if (inode->i_nlink == 0) {
> -		if (inode->i_mode == 0 ||
> -		    !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS)) {
> -			/* this inode is deleted */
> -			ret = -ESTALE;
> -			goto bad_inode;
> -		}
> -		/* The only unlinked inodes we let through here have
> -		 * valid i_mode and are being read by the orphan
> -		 * recovery code: that's fine, we're about to complete
> -		 * the process of deleting those. */
> -	}
>  	ei->i_flags = le32_to_cpu(raw_inode->i_flags);
>  	inode->i_blocks = ext4_inode_blocks(raw_inode, ei);
>  	ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
> @@ -5232,11 +5372,19 @@ static int ext4_do_update_inode(handle_t *handle,
>  				struct inode *inode,
>  				struct ext4_iloc *iloc)
>  {
> +	struct super_block *sb = inode->i_sb;
>  	struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
>  	struct ext4_inode_info *ei = EXT4_I(inode);
>  	struct buffer_head *bh = iloc->bh;
>  	int err = 0, rc, block;
> 
> +	/*
> +	 * Protect the on disk inode against parallel modification
> +	 * until we compute the checksum and pass the resulting block
> +	 * to JBD, which protects it then.
> +	 */
> +	spin_lock(&ei->raw_inode_lock);
> +
>  	/* For fields not not tracking in the in-memory inode,
>  	 * initialise them to zero for new inodes. */
>  	if (ext4_test_inode_state(inode, EXT4_STATE_NEW))
> @@ -5291,18 +5439,23 @@ static int ext4_do_update_inode(handle_t *handle,
>  				EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ||
>  				EXT4_SB(sb)->s_es->s_rev_level ==
>  				cpu_to_le32(EXT4_GOOD_OLD_REV)) {
> + 			spin_unlock(&ei->raw_inode_lock);
>  			/* If this is the first large file
>  			 * created, add a flag to the superblock.
>  			 */
>  			err = ext4_journal_get_write_access(handle,
>  					EXT4_SB(sb)->s_sbh);
> -			if (err)
> +			if (err) {
> + 				spin_lock(&ei->raw_inode_lock);
>  				goto out_brelse;
> +			}
>  			ext4_update_dynamic_rev(sb);
>  			EXT4_SET_RO_COMPAT_FEATURE(sb,
>  					EXT4_FEATURE_RO_COMPAT_LARGE_FILE);
>  			sb->s_dirt = 1;
>  			ext4_handle_sync(handle);
> + 			spin_lock(&ei->raw_inode_lock);
> + 			inode_update_crc(sb, raw_inode, inode->i_ino);
>  			err = ext4_handle_dirty_metadata(handle, NULL,
>  					EXT4_SB(sb)->s_sbh);
>  		}
> @@ -5330,6 +5483,7 @@ static int ext4_do_update_inode(handle_t *handle,
>  			cpu_to_le32(inode->i_version >> 32);
>  		raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
>  	}
> +	inode_update_crc(sb, raw_inode, inode->i_ino);
> 
>  	BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>  	rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> @@ -5339,6 +5493,7 @@ static int ext4_do_update_inode(handle_t *handle,
> 
>  	ext4_update_inode_fsync_trans(handle, inode, 0);
>  out_brelse:
> +	spin_unlock(&ei->raw_inode_lock);
>  	brelse(bh);
>  	ext4_std_error(inode->i_sb, err);
>  	return err;
> @@ -5759,6 +5914,8 @@ static int ext4_expand_extra_isize(struct inode *inode,
>  	if (EXT4_I(inode)->i_extra_isize >= new_extra_isize)
>  		return 0;
> 
> +	spin_lock(&EXT4_I(inode)->raw_inode_lock);
> +
>  	raw_inode = ext4_raw_inode(&iloc);
> 
>  	header = IHDR(inode, raw_inode);
> @@ -5770,10 +5927,12 @@ static int ext4_expand_extra_isize(struct inode *inode,
>  		memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0,
>  			new_extra_isize);
>  		EXT4_I(inode)->i_extra_isize = new_extra_isize;
> +		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
>  		return 0;
>  	}
> 
>  	/* try to expand with EAs present */
> +	spin_unlock(&EXT4_I(inode)->raw_inode_lock);
>  	return ext4_expand_extra_isize_ea(inode, new_extra_isize,
>  					  raw_inode, handle);
>  }
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 4e8983a..4012753 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -39,6 +39,7 @@
>  #include <linux/ctype.h>
>  #include <linux/log2.h>
>  #include <linux/crc16.h>
> +#include <linux/crc32c.h>
>  #include <asm/uaccess.h>
> 
>  #include "ext4.h"
> @@ -70,6 +71,8 @@ static void ext4_write_super(struct super_block *sb);
>  static int ext4_freeze(struct super_block *sb);
>  static int ext4_get_sb(struct file_system_type *fs_type, int flags,
>  		       const char *dev_name, void *data, struct vfsmount *mnt);
> +static void ext4_super_update_crc(struct super_block *sb,
> +				  struct ext4_super_block *es);
> 
>  #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
>  static struct file_system_type ext3_fs_type = {
> @@ -342,7 +345,14 @@ static void ext4_handle_error(struct super_block *sb)
>  		ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only");
>  		sb->s_flags |= MS_RDONLY;
>  	}
> +	/*
> +	 * Unfortunately must take the lock here, to make sure there
> +	 * is consistent super state for the checksum. This is very easy to get
> +	 * wrong in the caller.
> +	 */
> +	lock_super(sb);
>  	ext4_commit_super(sb, 1);
> +	unlock_super(sb);
>  	if (test_opt(sb, ERRORS_PANIC))
>  		panic("EXT4-fs (device %s): panic forced after error\n",
>  			sb->s_id);
> @@ -531,7 +541,9 @@ __acquires(bitlock)
>  	if (test_opt(sb, ERRORS_CONT)) {
>  		EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
>  		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
> +		lock_super(sb);
>  		ext4_commit_super(sb, 0);
> +		unlock_super(sb);
>  		return;
>  	}
>  	ext4_unlock_group(sb, grp);
> @@ -659,9 +671,12 @@ static void ext4_put_super(struct super_block *sb)
>  	if (sbi->s_journal) {
>  		err = jbd2_journal_destroy(sbi->s_journal);
>  		sbi->s_journal = NULL;
> -		if (err < 0)
> +		if (err < 0) {
> +			unlock_super(sb);
>  			ext4_abort(sb, __func__,
>  				   "Couldn't clean up the journal");
> +			lock_super(sb);
> +		}
>  	}
> 
>  	ext4_release_system_zone(sb);
> @@ -758,6 +773,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>  	ei->i_da_metadata_calc_len = 0;
>  	ei->i_delalloc_reserved_flag = 0;
>  	spin_lock_init(&(ei->i_block_reservation_lock));
> +	spin_lock_init(&(ei->raw_inode_lock));
>  #ifdef CONFIG_QUOTA
>  	ei->i_reserved_quota = 0;
>  #endif
> @@ -1161,6 +1177,7 @@ enum {
>  	Opt_inode_readahead_blks, Opt_journal_ioprio,
>  	Opt_dioread_nolock, Opt_dioread_lock,
>  	Opt_discard, Opt_nodiscard,
> + 	Opt_noinode_sanity, Opt_ignore_crc,
>  };
> 
>  static const match_table_t tokens = {
> @@ -1231,6 +1248,8 @@ static const match_table_t tokens = {
>  	{Opt_dioread_lock, "dioread_lock"},
>  	{Opt_discard, "discard"},
>  	{Opt_nodiscard, "nodiscard"},
> + 	{Opt_noinode_sanity, "noinode_sanity"},
> + 	{Opt_ignore_crc, "ignore_crc"},
>  	{Opt_err, NULL},
>  };
> 
> @@ -1408,6 +1427,12 @@ static int parse_options(char *options, struct super_block *sb,
>  		case Opt_orlov:
>  			clear_opt(sbi->s_mount_opt, OLDALLOC);
>  			break;
> +		case Opt_noinode_sanity:
> +			clear_opt(sbi->s_mount_opt, INODE_SANITY);
> +			break;
> +		case Opt_ignore_crc:
> +			set_opt(sbi->s_mount_opt, IGNORE_CRC);
> +			break;
>  #ifdef CONFIG_EXT4_FS_XATTR
>  		case Opt_user_xattr:
>  			set_opt(sbi->s_mount_opt, XATTR_USER);
> @@ -1946,6 +1971,7 @@ static int ext4_check_descriptors(struct super_block *sb)
>  	}
> 
>  	ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
> + 	ext4_super_update_crc(sb, sbi->s_es);
>  	sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
>  	return 1;
>  }
> @@ -2431,6 +2457,50 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
>  	return 1;
>  }
> 
> +/* could be removed for SBs once e2fsutils are fixed to always compute
> +   the CRC when the feature is turned on. */
> +#define ZERO_MAGIC 1
> +
> +/*
> + * Manage CRC32 for the superblock.
> + */
> +static int ext4_super_check_crc(struct super_block *sb,
> +				struct ext4_super_block *es)
> +{
> +	u32 crc, check;
> +
> +	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
> +		return 0;
> +	crc = le32_to_cpu(es->s_crc);
> +	if (crc == 0 || test_opt(sb, IGNORE_CRC)) {
> +		ext4_msg(sb, KERN_INFO, "super block checksum ignored");
> +		return 0;
> +	}
> +	es->s_crc = 0;
> +	check = crc32c(~0U, es, sizeof(struct ext4_super_block));
> +	if (check == 0)
> +		check = ZERO_MAGIC;
> +	if (check != crc)
> +		return -1;
> +	/* Remove me */
> +	ext4_msg(sb, KERN_INFO, "super block checksum passed");
> +	return 0;
> +}
> +
> +static void ext4_super_update_crc(struct super_block *sb,
> +				  struct ext4_super_block *es)
> +{
> +	u32 crc;
> +
> +	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
> +		return;
> +	es->s_crc = 0;
> +        crc = crc32c(~0U, es, sizeof(struct ext4_super_block));
> +	if (crc == 0)
> +		crc = ZERO_MAGIC;
> +	es->s_crc = cpu_to_le32(crc);
> +}
> +
>  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  				__releases(kernel_lock)
>  				__acquires(kernel_lock)
> @@ -2554,6 +2624,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
> 
>  	set_opt(sbi->s_mount_opt, BARRIER);
> +	set_opt(sbi->s_mount_opt, INODE_SANITY);
> 
>  	/*
>  	 * enable delayed allocation by default
> @@ -2585,6 +2656,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	if (!ext4_feature_set_ok(sb, (sb->s_flags & MS_RDONLY)))
>  		goto failed_mount;
> 
> +	if (ext4_super_check_crc(sb, es) < 0) {
> +		ext4_msg(sb, KERN_ERR, "Invalid checksum in super block");
> +		goto failed_mount;
> +	}
> +
>  	blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
> 
>  	if (blocksize < EXT4_MIN_BLOCK_SIZE ||
> @@ -2675,6 +2751,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> 
>  	for (i = 0; i < 4; i++)
>  		sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
> +	memcpy(sbi->s_uuid, es->s_uuid, 16);
>  	sbi->s_def_hash_version = es->s_def_hash_version;
>  	i = le32_to_cpu(es->s_flags);
>  	if (i & EXT2_FLAGS_UNSIGNED_HASH)
> @@ -3393,6 +3470,9 @@ static int ext4_commit_super(struct super_block *sb, int sync)
>  	es->s_free_inodes_count = cpu_to_le32(percpu_counter_sum_positive(
>  					&EXT4_SB(sb)->s_freeinodes_counter));
>  	sb->s_dirt = 0;
> +	/* can be removed if this is properly journaled, see
> +         * http://bugzilla.kernel.org/show_bug.cgi?id=14354 */
> +	ext4_super_update_crc(sb, es);
>  	BUFFER_TRACE(sbh, "marking dirty");
>  	mark_buffer_dirty(sbh);
>  	if (sync) {
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 0433800..fbba95a 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1171,6 +1171,7 @@ retry:
> 
>  	free = ext4_xattr_free_space(last, &min_offs, base, &total_ino);
>  	if (free >= new_extra_isize) {
> +		spin_lock(&EXT4_I(inode)->raw_inode_lock);
>  		entry = IFIRST(header);
>  		ext4_xattr_shift_entries(entry,	EXT4_I(inode)->i_extra_isize
>  				- new_extra_isize, (void *)raw_inode +
> @@ -1179,6 +1180,7 @@ retry:
>  				inode->i_sb->s_blocksize);
>  		EXT4_I(inode)->i_extra_isize = new_extra_isize;
>  		error = 0;
> +		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
>  		goto cleanup;
>  	}
> 
> @@ -1301,6 +1303,7 @@ retry:
>  		if (error)
>  			goto cleanup;
> 
> +		spin_lock(&EXT4_I(inode)->raw_inode_lock);
>  		entry = IFIRST(header);
>  		if (entry_size + EXT4_XATTR_SIZE(size) >= new_extra_isize)
>  			shift_bytes = new_extra_isize;
> @@ -1312,6 +1315,7 @@ retry:
>  			EXT4_GOOD_OLD_INODE_SIZE + extra_isize + shift_bytes,
>  			(void *)header, total_ino - entry_size,
>  			inode->i_sb->s_blocksize);
> +		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
> 
>  		extra_isize += shift_bytes;
>  		new_extra_isize -= shift_bytes;
> 
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger April 21, 2011, 12:25 a.m. UTC | #2
On 2011-04-20, at 4:54 PM, Darrick J. Wong wrote:
> On Wed, Apr 20, 2011 at 10:40:26AM -0700, Andi Kleen wrote:
>> "Darrick J. Wong" <djwong@us.ibm.com> writes:
>> 
>> FWIW I had a similar idea quite some time ago and implemented checksums for
>> superblocks and inodes. I never posted it because I didn't get around
>> to write the e2fsprogs support and do a lot of performance testing. 
>> There was one result that showed a slow down in quick tests, but 
>> I suspect it was a fluke and probably needs to be redone. In other
>> tests it was neutral.
> 
> I've also seen comparable results between having inode checksums and not having
> them.  Unfortunately, like you said, modifying e2fsprogs is really what's
> slowing this down right now-- there are a lot of places that assume inode_size
> = 128 and therefore only read/write that much.

I think it makes sense to include the inode checksum into the core 128-bit inode, so that it is available to all upgraded ext3 filesystems as well.  Having a 16-bit checksum is probably sufficient for the 128- or 256-byte inodes, I don't know that we really need to have a full 32-bit checksum?  Also, the current group descriptor checksums are already CRC-16 so it probably makes sense to stick with that.

>> Anyways here's the old patch if anyone is interested (against a ~.35ish kernel)
>> I can forward port it if there's interest.
>> 
>> IMHO it's a good idea but it should be done for the super blocks too
>> (and ideally for all objects, although unfortunately that breaks 
>> the disk format)
> 
> I think I've seen some proposals for checksumming the bitmaps and the extent
> tree nodes.  It might be worth it to save some rocompat bits and combine them
> all into one big(ger) rocompat flag.  I guess it wouldn't be too hard to throw
> in superblock checksumming too. :)
> 
>> The locking for stable buffers is still something that needs to be
>> double checked.
> 
> 
> 
>> -Andi
>> 
>> commit 8ece10cfa4b148075dbb93ca65855a7e2aad7b07
>> Author: Andi Kleen <ak@linux.intel.com>
>> Date:   Mon May 31 18:27:29 2010 +0200
>> 
>>    EXT4: Add checksums to the super block and the inodes
>> 
>>    Currently when a on-disk structure is corrupted ext4 usually
>>    detects it only by some internal inconsistency, but this can happen
>>    too late or give confusing error messages.
>> 
>>    One way to detect corruptions sooner is to use checksums. This
>>    means the file system can stop using a corrupted object ASAP, not
>>    follow any potentially corrupted pointers to other blocks,
>>    and also give a clear message on what happened.
>> 
>>    Potentially it can also be used to limit read only mounts and
>>    forced fscks. When the extent of a corruption can be detected
>>    accurately using checksums and only force errors on that
>>    particular object (this is right now not enabled though
>>    because there are still too many objects with missing checksums)
>> 
>>    There's a general trend in file systems recently to add metadata
>>    checksums, this just makes ext4 catch up a bit (in addition
>>    to the already existing transaction and block group checksums)
>> 
>>    Another advantage is that the checksum can also check for misplaced
>>    writes (by including the intended block location) and objects
>>    left over from before mkfs (by including the file system UUID).
>> 
>>    Adding checksums to all metadata in ext4 would likely need some
>>    incompatible format changes, but it's relatively straight-forward to add
>>    them to inodes and the super block at least. This patch-kit does this.
>> 
>>    Again it only handles some meta-data objects. This is not a full
>>    user-data checksumming feature or not even a "all metadata objects"
>>    feature.
>> 
>>    I didn't do a lot of research into different CRC functions, but simply
>>    used the same one as btrfs and iSCSI (crc32c).  ocfs2 uses ECCs
>>    that have some self-correcting ability, but that seemed overkill to me.
>>    The standard CRC has the advantage that the CRC is accelerated by hardware
>>    instructions on newer Intel CPUs and possibly on other platforms too.
>>    The CRC32C function is also hard coded, although in theory it could be made
>>    configurable per file system. But since that would lead to a multitude of
>>    incompatible formats I decided to simply hard code for now.
>> 
>>    Right now there is no e2fsprogs support, so fsck doesn't know about CRCs.
>> 
>>    For now the checksums can be enabled with a simple shell script
>>    (which is just a wrapper around debugfs):
>> 
>>    wget ftp://firstfloor.org/pub/ak/shell/e2checksum
>>    chmod +x e2checksum
>> 
>>    To enable:
>> 
>>    ./e2checksum /dev/device enable
>> 
>>    To disable:
>> 
>>    ./e2checksum /dev/device disable
>> 
>>    This works with the current ext4 e2fsprogs without any changes.
>> 
>>    The kernel will automatically add checksums to the file system when the
>>    feature is turned on, as the inodes are rewritten.
>> 
>>    WARNING: Note there is no fsck support currently, so before you run fsck better
>>    disable the checksums. After disabling/changing/reenabling
>>    you may also need to mount with ignore_crc until the checksums
>>    are fixed up again.
>> 
>>    The checksums are implemented as a read-only compat feature: this means
>>    the a file system with checksums enabled can be read by a kernel that
>>    does not know about checksums, but not written. Of course you can turn
>>    off the flag again to make the file system write compatible again (but
>>    that will put the checksums into a inconsistent state) or use the
>>    ignore_crc mount option.
>> 
>>    WARNING: the in-kernel upgrader relies on the checksum fields being
>>    zero when checksums are enabled. When you turn on the feature, use the
>>    file system, turn it off again, use it again and turn it on again the
>>    checksums will be in a inconsistent state. Right now this can be
>>    fixed by mounting with -o ignore_crc and then doing a find | xargs touch
>>    on the file system.
>> 
>>    I expect real e2fsprogs support can do that better.
>> 
>>    The code also adds some more sanity checks on inodes to distingush zeroed
>>    inodes from inodes with no checksums. This is currently done by enforcing
>>    that a/m/ctime are not zero. If there are broken file systems around
>>    where that is not true, the sanity check can be turned off (see below)
>> 
>>    There are two new mount options:
>> 
>>    ignore_crc	       Ignore the CRCs on reading but still update them
>>    noinode_sanity	       Don't do new inode sanity checks
> 
> Hm, the usage model of my patch is tune2fs; fsck; mount.  I suspect it's a good
> idea to run fsck before/while turning on this feature to correct any other
> errors.  Though, that means that the kernel will simply -EIO anything it thinks
> is corrupt.
> 
>>    The implementation is not particularly optimized: it always recomputes
>>    the full CRC on each inode or super block write. Some more optimizations
>>    would be possible in this area.
>> 
>>    BENCHMARKS
>> 
>>    I ran kernelbench and it didn't show any significant difference between
>>    the run with checksums and the run without.
>> 
>>    I also tried timing fsstress from XFS/LTP, and it gave a 35% slowdown
>>    on a disk and 5% slow down on the ramdisk for the system time
>>    during the test.  However I'm a bit suspicious of these results.
>>    This was on a core 2.
> 
> I'll give fsstress a try when I get back to this (Friday skunkworks project).
> 
> Either way, thanks for the patch!
> 
> --D
>> 
>>    I also tested on a Nehalem system which has CRC acceleration instructions
>>    in the CPU.
>> 
>>    Anyone has a better inode metadata intensive benchmark to run?
>> 
>>    This gobbled up the last mount flag bits, but there are still some unused
>>    holes in the middle.
>> 
>>    The patch is definitely still experimental and needs more testing and
>>    review and e2fsprogs support. In particular I would appreciate review
>>    of my bh locking scheme.
>> 
>>    Also ext4_abort() now takes the super lock, which implies that the callers
>>    shouldn't.  I think I checked all callers that it's safe, but double checking
>>    that would be good.
>> 
>>    Opens:
>>    - e2fsprogs support. In this case the zero crc hack could be removed from
>>    the super block code at least.
>>    - Right now checksum numbers by default lead to a read-only file system remount
>>    and are also logged as "IO error". This could be made more gentle, because
>>    a checksum catches problems early enough that continuation might be possible.
>>    - Incremental checksumming
>>    - Checksums for extents and directories
>>    - In ignore_crc mode the checksums are only rewritten when the inode is somehow
>>    dirtied otherwise. Do this implicitely? Might be better to move this to e2fsprogs
>> 
>> diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
>> index e1def17..e98141a 100644
>> --- a/Documentation/filesystems/ext4.txt
>> +++ b/Documentation/filesystems/ext4.txt
>> @@ -359,6 +359,14 @@ nodiscard(*)		commands to the underlying block device when
>> 			and sparse/thinly-provisioned LUNs, but it is off
>> 			by default until sufficient testing has been done.
>> 
>> +nosanity_check		Don't check for zero m/c/a times when reading a inode.
>> +			Should not normally be needed.
>> +
>> +ignore_crc		Ignore checksum failures while reading the super block
>> +			or inodes, but still update the checksums on writing
>> +			(if you don't want to update the checksums, clear the
>> +			checksum feature bit in the super block)
>> +
>> Data Mode
>> =========
>> There are 3 different data modes:
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 19a4de5..3a99769 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -611,6 +611,7 @@ struct ext4_inode {
>> 	__le32  i_crtime;       /* File Creation time */
>> 	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
>> 	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
>> +	__le32  i_crc;		/* CRC32 for this inode */
>> };
>> 
>> struct move_extent {
>> @@ -836,6 +837,12 @@ struct ext4_inode_info {
>> 	 */
>> 	tid_t i_sync_tid;
>> 	tid_t i_datasync_tid;
>> +
>> +	/* 
>> +	 * Protect raw inode modifications, mostly to ensure
>> +	 * stability for checksumming.
>> +	 */
>> +	spinlock_t raw_inode_lock;
>> };
>> 
>> /*
>> @@ -881,10 +888,12 @@ struct ext4_inode_info {
>> #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
>> #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
>> #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
>> +#define EXT4_MOUNT_IGNORE_CRC           0x4000000 /* Ignore CRCs on read */
>> #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
>> #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
>> #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
>> #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
>> +#define EXT4_MOUNT_INODE_SANITY         0x80000000 /* Inode sanity check */
>> 
>> #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
>> #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
>> @@ -1003,7 +1012,8 @@ struct ext4_super_block {
>> 	__u8	s_reserved_char_pad2;
>> 	__le16  s_reserved_pad;
>> 	__le64	s_kbytes_written;	/* nr of lifetime kilobytes written */
>> -	__u32   s_reserved[160];        /* Padding to the end of the block */
>> +	__le32  s_crc;			/* CRC32 for this super block */
>> +	__u32   s_reserved[159];        /* Padding to the end of the block */
>> };
>> 
>> #ifdef __KERNEL__
>> @@ -1051,6 +1061,7 @@ struct ext4_sb_info {
>> 	u32 s_hash_seed[4];
>> 	int s_def_hash_version;
>> 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
>> +	u8 s_uuid[16];
>> 	struct percpu_counter s_freeblocks_counter;
>> 	struct percpu_counter s_freeinodes_counter;
>> 	struct percpu_counter s_dirs_counter;
>> @@ -1265,6 +1276,7 @@ EXT4_INODE_BIT_FNS(state, state_flags)
>> #define EXT4_FEATURE_RO_COMPAT_GDT_CSUM		0x0010
>> #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK	0x0020
>> #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE	0x0040
>> +#define EXT4_FEATURE_RO_COMPAT_SBI_CRC		0x0080 /* sb and inode CRCs */
>> 
>> #define EXT4_FEATURE_INCOMPAT_COMPRESSION	0x0001
>> #define EXT4_FEATURE_INCOMPAT_FILETYPE		0x0002
>> @@ -1291,7 +1303,8 @@ EXT4_INODE_BIT_FNS(state, state_flags)
>> 					 EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
>> 					 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
>> 					 EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
>> -					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE)
>> +					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
>> +					 EXT4_FEATURE_RO_COMPAT_SBI_CRC)
>> 
>> /*
>>  * Default values for user and/or group using reserved blocks
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 42272d6..8ba2f24 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -37,6 +37,7 @@
>> #include <linux/namei.h>
>> #include <linux/uio.h>
>> #include <linux/bio.h>
>> +#include <linux/crc32c.h>
>> #include <linux/workqueue.h>
>> #include <linux/kernel.h>
>> #include <linux/slab.h>
>> @@ -72,6 +73,141 @@ static int ext4_inode_is_fast_symlink(struct inode *inode)
>> 	return (S_ISLNK(inode->i_mode) && inode->i_blocks - ea_blocks == 0);
>> }
>> 
>> +#define ZERO_MAGIC 1
>> +
>> +static __le32 inode_crc(struct super_block *sb, struct ext4_inode *raw_inode, long ino)
>> +{
>> +	struct ext4_csum_header {
>> +		__le64 ino;
>> +		__le64 pad;
>> +		u8     uuid[16];
>> +	} __attribute__((aligned)) hdr;
>> +	u32 crc;
>> +
>> +	/*
>> +	 * First CRC in the inode number and the file system UUID
>> +	 * to detect inodes from before the last mkfs and misplaced inode
>> +	 * writes.
>> +	 */
>> +	hdr.ino = cpu_to_le64(ino);
>> +	memcpy(&hdr.uuid, EXT4_SB(sb)->s_uuid, 16);
>> +	hdr.pad = 0;
>> +
>> +	raw_inode->i_crc = 0;
>> +	crc = crc32c(~0U, &hdr, sizeof(struct ext4_csum_header));
>> +	/* Could CRC precompute the common zero tail? (if it's really common) */
>> +	crc = crc32c(crc, raw_inode, EXT4_INODE_SIZE(sb));
>> +	if (crc == 0)
>> +		crc = ZERO_MAGIC;
>> +	return cpu_to_le32(crc);
>> +}
>> +
>> +/*
>> + * Update CRC in a inode, including all additional fields after the
>> + * standard inode structure.
>> + *
>> + * This relies on the raw_inode_lock protecting against all writes
>> + * to the raw inode state in the bh. Right now the JBD locking
>> + * is not good enough for that.
>> + *
>> + * This always precomputes the full checksum. In principle it would
>> + * be possible to be more clever and do incremental changes at least
>> + * for some state changes.
>> + */
>> +static void inode_update_crc(struct super_block *sb, struct ext4_inode *raw_inode,
>> +			     long ino)
>> +{
>> +	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
>> +		return;
>> +	raw_inode->i_crc = inode_crc(sb, raw_inode, ino);
>> +}
>> +
>> +/*
>> + * This is needed because nfsd might try to access dead inodes
>> + * the test is that same one that e2fsck uses
>> + * NeilBrown 1999oct15
>> + */
>> +static int inode_deleted(struct super_block *sb, struct ext4_inode *raw_inode)
>> +{
>> +	if (raw_inode->i_links_count == 0) {
>> +		if (raw_inode->i_mode == 0 ||
>> +		    !(EXT4_SB(sb)->s_mount_state & EXT4_ORPHAN_FS))
>> +			return -ESTALE;
>> +		/*
>> +		 * The only unlinked inodes we let through here have
>> +		 * valid i_mode and are being read by the orphan
>> +		 * recovery code: that's fine, we're about to complete
>> +		 * the process of deleting those.
>> +		 */
>> +		return 1;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Does this checksum-less inode look like a valid inode?
>> + * Do a few sanity checks.
>> + */
>> +static int inode_sanity_check(struct super_block *sb, struct ext4_inode *raw_inode, char **msg)
>> +{
>> +	int err = inode_deleted(sb, raw_inode);
>> +	if (err)
>> +		return err;
>> +	if (!test_opt(sb, INODE_SANITY))
>> +		return 0;
>> +	if (raw_inode->i_mode == 0 ||
>> +	    raw_inode->i_atime == 0 ||
>> +	    raw_inode->i_ctime == 0 ||
>> +	    raw_inode->i_mtime == 0) {
>> +		*msg = "inode has invalid zero times";
>> +		return -1;
>> +	}
>> +	/* More sanity checks? */
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Check CRC for a newly read inode.
>> + */
>> +static int inode_check_crc(struct super_block *sb, struct ext4_inode *raw_inode,
>> +			   long ino, char **msg)
>> +{
>> +	__le32 crc, check;
>> +
>> +	/*
>> +	 * Special case: zero CRC.  This is handled like no checksum yet,
>> +	 * otherwise tune2fs would need to rewrite all inodes when CRCs
>> +	 * are turned on. The CRC will be updated when the inode
>> +	 * is written out.
>> +	 *
>> +	 * This however means that if zeroes are blasted over the inodes
>> +	 * we would think the checksum is not there. So instead do
>> +	 * some special sanity checks in the other fields when this happens,
>> +	 * to catch this case.
>> + 	 * This is not 100% fool proof, but should be reasonable.
>> +	 * When the checksum function returns a real zero we turn that
>> +	 * into a one to avoid ambiguity.
>> +	 *
>> +	 * The sanity check is done unconditionally even if the checksum passed
>> +	 * because it's cheap enough.
>> +	 */
>> +	crc = raw_inode->i_crc;
>> +	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC) && crc
>> +	    && !test_opt(sb, IGNORE_CRC)) {
>> + 		check = inode_crc(sb, raw_inode, ino);
>> +		if (check != crc) {
>> +			*msg = "inode has invalid checksum";
>> +			/*
>> +			 * Restore bad CRC so that if the inode is reread it'll fail
>> +			 * the check again.
>> +			 */
>> +			raw_inode->i_crc = crc;
>> +			return -1;
>> +		}
>> +	}
>> +	return inode_sanity_check(sb, raw_inode, msg);
>> +}
>> +
>> /*
>>  * Work out how many blocks we need to proceed with the next chunk of a
>>  * truncate transaction.
>> @@ -4996,6 +5132,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>> 	journal_t *journal = EXT4_SB(sb)->s_journal;
>> 	long ret;
>> 	int block;
>> +	char *msg = NULL;
>> 
>> 	inode = iget_locked(sb, ino);
>> 	if (!inode)
>> @@ -5010,6 +5147,26 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>> 	if (ret < 0)
>> 		goto bad_inode;
>> 	raw_inode = ext4_raw_inode(&iloc);
>> +
>> + 	/*
>> + 	 * Relies on the inode lock to protect the raw_inode bh contents.
>> + 	 */
>> + 	ret = inode_check_crc(sb, raw_inode, ino, &msg);
>> + 	if (ret < 0) {
>> + 		/*
>> + 		 * Here would be the place to send a "read other mirror"
>> + 		 * retry hint to the block layer.
>> + 		 */
>> + 		brelse(iloc.bh);
>> + 		if (ret != -ESTALE)
>> + 			ext4_error(sb,
>> + 				   "%s: inode=%lu, block=%llu", msg,
>> + 				   ino, 
>> +				   (unsigned long long)iloc.bh->b_blocknr);
>> + 		goto bad_inode;
>> + 	}
>> + 	ret = 0;
>> +
>> 	inode->i_mode = le16_to_cpu(raw_inode->i_mode);
>> 	inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
>> 	inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
>> @@ -5022,23 +5179,6 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>> 	ei->i_state_flags = 0;
>> 	ei->i_dir_start_lookup = 0;
>> 	ei->i_dtime = le32_to_cpu(raw_inode->i_dtime);
>> -	/* We now have enough fields to check if the inode was active or not.
>> -	 * This is needed because nfsd might try to access dead inodes
>> -	 * the test is that same one that e2fsck uses
>> -	 * NeilBrown 1999oct15
>> -	 */
>> -	if (inode->i_nlink == 0) {
>> -		if (inode->i_mode == 0 ||
>> -		    !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS)) {
>> -			/* this inode is deleted */
>> -			ret = -ESTALE;
>> -			goto bad_inode;
>> -		}
>> -		/* The only unlinked inodes we let through here have
>> -		 * valid i_mode and are being read by the orphan
>> -		 * recovery code: that's fine, we're about to complete
>> -		 * the process of deleting those. */
>> -	}
>> 	ei->i_flags = le32_to_cpu(raw_inode->i_flags);
>> 	inode->i_blocks = ext4_inode_blocks(raw_inode, ei);
>> 	ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
>> @@ -5232,11 +5372,19 @@ static int ext4_do_update_inode(handle_t *handle,
>> 				struct inode *inode,
>> 				struct ext4_iloc *iloc)
>> {
>> +	struct super_block *sb = inode->i_sb;
>> 	struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
>> 	struct ext4_inode_info *ei = EXT4_I(inode);
>> 	struct buffer_head *bh = iloc->bh;
>> 	int err = 0, rc, block;
>> 
>> +	/*
>> +	 * Protect the on disk inode against parallel modification
>> +	 * until we compute the checksum and pass the resulting block
>> +	 * to JBD, which protects it then.
>> +	 */
>> +	spin_lock(&ei->raw_inode_lock);
>> +
>> 	/* For fields not not tracking in the in-memory inode,
>> 	 * initialise them to zero for new inodes. */
>> 	if (ext4_test_inode_state(inode, EXT4_STATE_NEW))
>> @@ -5291,18 +5439,23 @@ static int ext4_do_update_inode(handle_t *handle,
>> 				EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ||
>> 				EXT4_SB(sb)->s_es->s_rev_level ==
>> 				cpu_to_le32(EXT4_GOOD_OLD_REV)) {
>> + 			spin_unlock(&ei->raw_inode_lock);
>> 			/* If this is the first large file
>> 			 * created, add a flag to the superblock.
>> 			 */
>> 			err = ext4_journal_get_write_access(handle,
>> 					EXT4_SB(sb)->s_sbh);
>> -			if (err)
>> +			if (err) {
>> + 				spin_lock(&ei->raw_inode_lock);
>> 				goto out_brelse;
>> +			}
>> 			ext4_update_dynamic_rev(sb);
>> 			EXT4_SET_RO_COMPAT_FEATURE(sb,
>> 					EXT4_FEATURE_RO_COMPAT_LARGE_FILE);
>> 			sb->s_dirt = 1;
>> 			ext4_handle_sync(handle);
>> + 			spin_lock(&ei->raw_inode_lock);
>> + 			inode_update_crc(sb, raw_inode, inode->i_ino);
>> 			err = ext4_handle_dirty_metadata(handle, NULL,
>> 					EXT4_SB(sb)->s_sbh);
>> 		}
>> @@ -5330,6 +5483,7 @@ static int ext4_do_update_inode(handle_t *handle,
>> 			cpu_to_le32(inode->i_version >> 32);
>> 		raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
>> 	}
>> +	inode_update_crc(sb, raw_inode, inode->i_ino);
>> 
>> 	BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>> 	rc = ext4_handle_dirty_metadata(handle, NULL, bh);
>> @@ -5339,6 +5493,7 @@ static int ext4_do_update_inode(handle_t *handle,
>> 
>> 	ext4_update_inode_fsync_trans(handle, inode, 0);
>> out_brelse:
>> +	spin_unlock(&ei->raw_inode_lock);
>> 	brelse(bh);
>> 	ext4_std_error(inode->i_sb, err);
>> 	return err;
>> @@ -5759,6 +5914,8 @@ static int ext4_expand_extra_isize(struct inode *inode,
>> 	if (EXT4_I(inode)->i_extra_isize >= new_extra_isize)
>> 		return 0;
>> 
>> +	spin_lock(&EXT4_I(inode)->raw_inode_lock);
>> +
>> 	raw_inode = ext4_raw_inode(&iloc);
>> 
>> 	header = IHDR(inode, raw_inode);
>> @@ -5770,10 +5927,12 @@ static int ext4_expand_extra_isize(struct inode *inode,
>> 		memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0,
>> 			new_extra_isize);
>> 		EXT4_I(inode)->i_extra_isize = new_extra_isize;
>> +		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
>> 		return 0;
>> 	}
>> 
>> 	/* try to expand with EAs present */
>> +	spin_unlock(&EXT4_I(inode)->raw_inode_lock);
>> 	return ext4_expand_extra_isize_ea(inode, new_extra_isize,
>> 					  raw_inode, handle);
>> }
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 4e8983a..4012753 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -39,6 +39,7 @@
>> #include <linux/ctype.h>
>> #include <linux/log2.h>
>> #include <linux/crc16.h>
>> +#include <linux/crc32c.h>
>> #include <asm/uaccess.h>
>> 
>> #include "ext4.h"
>> @@ -70,6 +71,8 @@ static void ext4_write_super(struct super_block *sb);
>> static int ext4_freeze(struct super_block *sb);
>> static int ext4_get_sb(struct file_system_type *fs_type, int flags,
>> 		       const char *dev_name, void *data, struct vfsmount *mnt);
>> +static void ext4_super_update_crc(struct super_block *sb,
>> +				  struct ext4_super_block *es);
>> 
>> #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
>> static struct file_system_type ext3_fs_type = {
>> @@ -342,7 +345,14 @@ static void ext4_handle_error(struct super_block *sb)
>> 		ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only");
>> 		sb->s_flags |= MS_RDONLY;
>> 	}
>> +	/*
>> +	 * Unfortunately must take the lock here, to make sure there
>> +	 * is consistent super state for the checksum. This is very easy to get
>> +	 * wrong in the caller.
>> +	 */
>> +	lock_super(sb);
>> 	ext4_commit_super(sb, 1);
>> +	unlock_super(sb);
>> 	if (test_opt(sb, ERRORS_PANIC))
>> 		panic("EXT4-fs (device %s): panic forced after error\n",
>> 			sb->s_id);
>> @@ -531,7 +541,9 @@ __acquires(bitlock)
>> 	if (test_opt(sb, ERRORS_CONT)) {
>> 		EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
>> 		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
>> +		lock_super(sb);
>> 		ext4_commit_super(sb, 0);
>> +		unlock_super(sb);
>> 		return;
>> 	}
>> 	ext4_unlock_group(sb, grp);
>> @@ -659,9 +671,12 @@ static void ext4_put_super(struct super_block *sb)
>> 	if (sbi->s_journal) {
>> 		err = jbd2_journal_destroy(sbi->s_journal);
>> 		sbi->s_journal = NULL;
>> -		if (err < 0)
>> +		if (err < 0) {
>> +			unlock_super(sb);
>> 			ext4_abort(sb, __func__,
>> 				   "Couldn't clean up the journal");
>> +			lock_super(sb);
>> +		}
>> 	}
>> 
>> 	ext4_release_system_zone(sb);
>> @@ -758,6 +773,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>> 	ei->i_da_metadata_calc_len = 0;
>> 	ei->i_delalloc_reserved_flag = 0;
>> 	spin_lock_init(&(ei->i_block_reservation_lock));
>> +	spin_lock_init(&(ei->raw_inode_lock));
>> #ifdef CONFIG_QUOTA
>> 	ei->i_reserved_quota = 0;
>> #endif
>> @@ -1161,6 +1177,7 @@ enum {
>> 	Opt_inode_readahead_blks, Opt_journal_ioprio,
>> 	Opt_dioread_nolock, Opt_dioread_lock,
>> 	Opt_discard, Opt_nodiscard,
>> + 	Opt_noinode_sanity, Opt_ignore_crc,
>> };
>> 
>> static const match_table_t tokens = {
>> @@ -1231,6 +1248,8 @@ static const match_table_t tokens = {
>> 	{Opt_dioread_lock, "dioread_lock"},
>> 	{Opt_discard, "discard"},
>> 	{Opt_nodiscard, "nodiscard"},
>> + 	{Opt_noinode_sanity, "noinode_sanity"},
>> + 	{Opt_ignore_crc, "ignore_crc"},
>> 	{Opt_err, NULL},
>> };
>> 
>> @@ -1408,6 +1427,12 @@ static int parse_options(char *options, struct super_block *sb,
>> 		case Opt_orlov:
>> 			clear_opt(sbi->s_mount_opt, OLDALLOC);
>> 			break;
>> +		case Opt_noinode_sanity:
>> +			clear_opt(sbi->s_mount_opt, INODE_SANITY);
>> +			break;
>> +		case Opt_ignore_crc:
>> +			set_opt(sbi->s_mount_opt, IGNORE_CRC);
>> +			break;
>> #ifdef CONFIG_EXT4_FS_XATTR
>> 		case Opt_user_xattr:
>> 			set_opt(sbi->s_mount_opt, XATTR_USER);
>> @@ -1946,6 +1971,7 @@ static int ext4_check_descriptors(struct super_block *sb)
>> 	}
>> 
>> 	ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
>> + 	ext4_super_update_crc(sb, sbi->s_es);
>> 	sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
>> 	return 1;
>> }
>> @@ -2431,6 +2457,50 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
>> 	return 1;
>> }
>> 
>> +/* could be removed for SBs once e2fsutils are fixed to always compute
>> +   the CRC when the feature is turned on. */
>> +#define ZERO_MAGIC 1
>> +
>> +/*
>> + * Manage CRC32 for the superblock.
>> + */
>> +static int ext4_super_check_crc(struct super_block *sb,
>> +				struct ext4_super_block *es)
>> +{
>> +	u32 crc, check;
>> +
>> +	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
>> +		return 0;
>> +	crc = le32_to_cpu(es->s_crc);
>> +	if (crc == 0 || test_opt(sb, IGNORE_CRC)) {
>> +		ext4_msg(sb, KERN_INFO, "super block checksum ignored");
>> +		return 0;
>> +	}
>> +	es->s_crc = 0;
>> +	check = crc32c(~0U, es, sizeof(struct ext4_super_block));
>> +	if (check == 0)
>> +		check = ZERO_MAGIC;
>> +	if (check != crc)
>> +		return -1;
>> +	/* Remove me */
>> +	ext4_msg(sb, KERN_INFO, "super block checksum passed");
>> +	return 0;
>> +}
>> +
>> +static void ext4_super_update_crc(struct super_block *sb,
>> +				  struct ext4_super_block *es)
>> +{
>> +	u32 crc;
>> +
>> +	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
>> +		return;
>> +	es->s_crc = 0;
>> +        crc = crc32c(~0U, es, sizeof(struct ext4_super_block));
>> +	if (crc == 0)
>> +		crc = ZERO_MAGIC;
>> +	es->s_crc = cpu_to_le32(crc);
>> +}
>> +
>> static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>> 				__releases(kernel_lock)
>> 				__acquires(kernel_lock)
>> @@ -2554,6 +2624,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>> 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
>> 
>> 	set_opt(sbi->s_mount_opt, BARRIER);
>> +	set_opt(sbi->s_mount_opt, INODE_SANITY);
>> 
>> 	/*
>> 	 * enable delayed allocation by default
>> @@ -2585,6 +2656,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>> 	if (!ext4_feature_set_ok(sb, (sb->s_flags & MS_RDONLY)))
>> 		goto failed_mount;
>> 
>> +	if (ext4_super_check_crc(sb, es) < 0) {
>> +		ext4_msg(sb, KERN_ERR, "Invalid checksum in super block");
>> +		goto failed_mount;
>> +	}
>> +
>> 	blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
>> 
>> 	if (blocksize < EXT4_MIN_BLOCK_SIZE ||
>> @@ -2675,6 +2751,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>> 
>> 	for (i = 0; i < 4; i++)
>> 		sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
>> +	memcpy(sbi->s_uuid, es->s_uuid, 16);
>> 	sbi->s_def_hash_version = es->s_def_hash_version;
>> 	i = le32_to_cpu(es->s_flags);
>> 	if (i & EXT2_FLAGS_UNSIGNED_HASH)
>> @@ -3393,6 +3470,9 @@ static int ext4_commit_super(struct super_block *sb, int sync)
>> 	es->s_free_inodes_count = cpu_to_le32(percpu_counter_sum_positive(
>> 					&EXT4_SB(sb)->s_freeinodes_counter));
>> 	sb->s_dirt = 0;
>> +	/* can be removed if this is properly journaled, see
>> +         * http://bugzilla.kernel.org/show_bug.cgi?id=14354 */
>> +	ext4_super_update_crc(sb, es);
>> 	BUFFER_TRACE(sbh, "marking dirty");
>> 	mark_buffer_dirty(sbh);
>> 	if (sync) {
>> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
>> index 0433800..fbba95a 100644
>> --- a/fs/ext4/xattr.c
>> +++ b/fs/ext4/xattr.c
>> @@ -1171,6 +1171,7 @@ retry:
>> 
>> 	free = ext4_xattr_free_space(last, &min_offs, base, &total_ino);
>> 	if (free >= new_extra_isize) {
>> +		spin_lock(&EXT4_I(inode)->raw_inode_lock);
>> 		entry = IFIRST(header);
>> 		ext4_xattr_shift_entries(entry,	EXT4_I(inode)->i_extra_isize
>> 				- new_extra_isize, (void *)raw_inode +
>> @@ -1179,6 +1180,7 @@ retry:
>> 				inode->i_sb->s_blocksize);
>> 		EXT4_I(inode)->i_extra_isize = new_extra_isize;
>> 		error = 0;
>> +		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
>> 		goto cleanup;
>> 	}
>> 
>> @@ -1301,6 +1303,7 @@ retry:
>> 		if (error)
>> 			goto cleanup;
>> 
>> +		spin_lock(&EXT4_I(inode)->raw_inode_lock);
>> 		entry = IFIRST(header);
>> 		if (entry_size + EXT4_XATTR_SIZE(size) >= new_extra_isize)
>> 			shift_bytes = new_extra_isize;
>> @@ -1312,6 +1315,7 @@ retry:
>> 			EXT4_GOOD_OLD_INODE_SIZE + extra_isize + shift_bytes,
>> 			(void *)header, total_ino - entry_size,
>> 			inode->i_sb->s_blocksize);
>> +		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
>> 
>> 		extra_isize += shift_bytes;
>> 		new_extra_isize -= shift_bytes;
>> 
>> 
>> -- 
>> ak@linux.intel.com -- Speaking for myself only


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index e1def17..e98141a 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -359,6 +359,14 @@  nodiscard(*)		commands to the underlying block device when
 			and sparse/thinly-provisioned LUNs, but it is off
 			by default until sufficient testing has been done.
 
+nosanity_check		Don't check for zero m/c/a times when reading a inode.
+			Should not normally be needed.
+
+ignore_crc		Ignore checksum failures while reading the super block
+			or inodes, but still update the checksums on writing
+			(if you don't want to update the checksums, clear the
+			checksum feature bit in the super block)
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..3a99769 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -611,6 +611,7 @@  struct ext4_inode {
 	__le32  i_crtime;       /* File Creation time */
 	__le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
 	__le32  i_version_hi;	/* high 32 bits for 64-bit version */
+	__le32  i_crc;		/* CRC32 for this inode */
 };
 
 struct move_extent {
@@ -836,6 +837,12 @@  struct ext4_inode_info {
 	 */
 	tid_t i_sync_tid;
 	tid_t i_datasync_tid;
+
+	/* 
+	 * Protect raw inode modifications, mostly to ensure
+	 * stability for checksumming.
+	 */
+	spinlock_t raw_inode_lock;
 };
 
 /*
@@ -881,10 +888,12 @@  struct ext4_inode_info {
 #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
+#define EXT4_MOUNT_IGNORE_CRC           0x4000000 /* Ignore CRCs on read */
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
 #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
+#define EXT4_MOUNT_INODE_SANITY         0x80000000 /* Inode sanity check */
 
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
 #define set_opt(o, opt)			o |= EXT4_MOUNT_##opt
@@ -1003,7 +1012,8 @@  struct ext4_super_block {
 	__u8	s_reserved_char_pad2;
 	__le16  s_reserved_pad;
 	__le64	s_kbytes_written;	/* nr of lifetime kilobytes written */
-	__u32   s_reserved[160];        /* Padding to the end of the block */
+	__le32  s_crc;			/* CRC32 for this super block */
+	__u32   s_reserved[159];        /* Padding to the end of the block */
 };
 
 #ifdef __KERNEL__
@@ -1051,6 +1061,7 @@  struct ext4_sb_info {
 	u32 s_hash_seed[4];
 	int s_def_hash_version;
 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
+	u8 s_uuid[16];
 	struct percpu_counter s_freeblocks_counter;
 	struct percpu_counter s_freeinodes_counter;
 	struct percpu_counter s_dirs_counter;
@@ -1265,6 +1276,7 @@  EXT4_INODE_BIT_FNS(state, state_flags)
 #define EXT4_FEATURE_RO_COMPAT_GDT_CSUM		0x0010
 #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK	0x0020
 #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE	0x0040
+#define EXT4_FEATURE_RO_COMPAT_SBI_CRC		0x0080 /* sb and inode CRCs */
 
 #define EXT4_FEATURE_INCOMPAT_COMPRESSION	0x0001
 #define EXT4_FEATURE_INCOMPAT_FILETYPE		0x0002
@@ -1291,7 +1303,8 @@  EXT4_INODE_BIT_FNS(state, state_flags)
 					 EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
 					 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
 					 EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
-					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE)
+					 EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
+					 EXT4_FEATURE_RO_COMPAT_SBI_CRC)
 
 /*
  * Default values for user and/or group using reserved blocks
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..8ba2f24 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -37,6 +37,7 @@ 
 #include <linux/namei.h>
 #include <linux/uio.h>
 #include <linux/bio.h>
+#include <linux/crc32c.h>
 #include <linux/workqueue.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
@@ -72,6 +73,141 @@  static int ext4_inode_is_fast_symlink(struct inode *inode)
 	return (S_ISLNK(inode->i_mode) && inode->i_blocks - ea_blocks == 0);
 }
 
+#define ZERO_MAGIC 1
+
+static __le32 inode_crc(struct super_block *sb, struct ext4_inode *raw_inode, long ino)
+{
+	struct ext4_csum_header {
+		__le64 ino;
+		__le64 pad;
+		u8     uuid[16];
+	} __attribute__((aligned)) hdr;
+	u32 crc;
+
+	/*
+	 * First CRC in the inode number and the file system UUID
+	 * to detect inodes from before the last mkfs and misplaced inode
+	 * writes.
+	 */
+	hdr.ino = cpu_to_le64(ino);
+	memcpy(&hdr.uuid, EXT4_SB(sb)->s_uuid, 16);
+	hdr.pad = 0;
+
+	raw_inode->i_crc = 0;
+	crc = crc32c(~0U, &hdr, sizeof(struct ext4_csum_header));
+	/* Could CRC precompute the common zero tail? (if it's really common) */
+	crc = crc32c(crc, raw_inode, EXT4_INODE_SIZE(sb));
+	if (crc == 0)
+		crc = ZERO_MAGIC;
+	return cpu_to_le32(crc);
+}
+
+/*
+ * Update CRC in a inode, including all additional fields after the
+ * standard inode structure.
+ *
+ * This relies on the raw_inode_lock protecting against all writes
+ * to the raw inode state in the bh. Right now the JBD locking
+ * is not good enough for that.
+ *
+ * This always precomputes the full checksum. In principle it would
+ * be possible to be more clever and do incremental changes at least
+ * for some state changes.
+ */
+static void inode_update_crc(struct super_block *sb, struct ext4_inode *raw_inode,
+			     long ino)
+{
+	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
+		return;
+	raw_inode->i_crc = inode_crc(sb, raw_inode, ino);
+}
+
+/*
+ * This is needed because nfsd might try to access dead inodes
+ * the test is that same one that e2fsck uses
+ * NeilBrown 1999oct15
+ */
+static int inode_deleted(struct super_block *sb, struct ext4_inode *raw_inode)
+{
+	if (raw_inode->i_links_count == 0) {
+		if (raw_inode->i_mode == 0 ||
+		    !(EXT4_SB(sb)->s_mount_state & EXT4_ORPHAN_FS))
+			return -ESTALE;
+		/*
+		 * The only unlinked inodes we let through here have
+		 * valid i_mode and are being read by the orphan
+		 * recovery code: that's fine, we're about to complete
+		 * the process of deleting those.
+		 */
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Does this checksum-less inode look like a valid inode?
+ * Do a few sanity checks.
+ */
+static int inode_sanity_check(struct super_block *sb, struct ext4_inode *raw_inode, char **msg)
+{
+	int err = inode_deleted(sb, raw_inode);
+	if (err)
+		return err;
+	if (!test_opt(sb, INODE_SANITY))
+		return 0;
+	if (raw_inode->i_mode == 0 ||
+	    raw_inode->i_atime == 0 ||
+	    raw_inode->i_ctime == 0 ||
+	    raw_inode->i_mtime == 0) {
+		*msg = "inode has invalid zero times";
+		return -1;
+	}
+	/* More sanity checks? */
+	return 0;
+}
+
+/*
+ * Check CRC for a newly read inode.
+ */
+static int inode_check_crc(struct super_block *sb, struct ext4_inode *raw_inode,
+			   long ino, char **msg)
+{
+	__le32 crc, check;
+
+	/*
+	 * Special case: zero CRC.  This is handled like no checksum yet,
+	 * otherwise tune2fs would need to rewrite all inodes when CRCs
+	 * are turned on. The CRC will be updated when the inode
+	 * is written out.
+	 *
+	 * This however means that if zeroes are blasted over the inodes
+	 * we would think the checksum is not there. So instead do
+	 * some special sanity checks in the other fields when this happens,
+	 * to catch this case.
+ 	 * This is not 100% fool proof, but should be reasonable.
+	 * When the checksum function returns a real zero we turn that
+	 * into a one to avoid ambiguity.
+	 *
+	 * The sanity check is done unconditionally even if the checksum passed
+	 * because it's cheap enough.
+	 */
+	crc = raw_inode->i_crc;
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC) && crc
+	    && !test_opt(sb, IGNORE_CRC)) {
+ 		check = inode_crc(sb, raw_inode, ino);
+		if (check != crc) {
+			*msg = "inode has invalid checksum";
+			/*
+			 * Restore bad CRC so that if the inode is reread it'll fail
+			 * the check again.
+			 */
+			raw_inode->i_crc = crc;
+			return -1;
+		}
+	}
+	return inode_sanity_check(sb, raw_inode, msg);
+}
+
 /*
  * Work out how many blocks we need to proceed with the next chunk of a
  * truncate transaction.
@@ -4996,6 +5132,7 @@  struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	journal_t *journal = EXT4_SB(sb)->s_journal;
 	long ret;
 	int block;
+	char *msg = NULL;
 
 	inode = iget_locked(sb, ino);
 	if (!inode)
@@ -5010,6 +5147,26 @@  struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	if (ret < 0)
 		goto bad_inode;
 	raw_inode = ext4_raw_inode(&iloc);
+
+ 	/*
+ 	 * Relies on the inode lock to protect the raw_inode bh contents.
+ 	 */
+ 	ret = inode_check_crc(sb, raw_inode, ino, &msg);
+ 	if (ret < 0) {
+ 		/*
+ 		 * Here would be the place to send a "read other mirror"
+ 		 * retry hint to the block layer.
+ 		 */
+ 		brelse(iloc.bh);
+ 		if (ret != -ESTALE)
+ 			ext4_error(sb,
+ 				   "%s: inode=%lu, block=%llu", msg,
+ 				   ino, 
+				   (unsigned long long)iloc.bh->b_blocknr);
+ 		goto bad_inode;
+ 	}
+ 	ret = 0;
+
 	inode->i_mode = le16_to_cpu(raw_inode->i_mode);
 	inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
 	inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
@@ -5022,23 +5179,6 @@  struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	ei->i_state_flags = 0;
 	ei->i_dir_start_lookup = 0;
 	ei->i_dtime = le32_to_cpu(raw_inode->i_dtime);
-	/* We now have enough fields to check if the inode was active or not.
-	 * This is needed because nfsd might try to access dead inodes
-	 * the test is that same one that e2fsck uses
-	 * NeilBrown 1999oct15
-	 */
-	if (inode->i_nlink == 0) {
-		if (inode->i_mode == 0 ||
-		    !(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS)) {
-			/* this inode is deleted */
-			ret = -ESTALE;
-			goto bad_inode;
-		}
-		/* The only unlinked inodes we let through here have
-		 * valid i_mode and are being read by the orphan
-		 * recovery code: that's fine, we're about to complete
-		 * the process of deleting those. */
-	}
 	ei->i_flags = le32_to_cpu(raw_inode->i_flags);
 	inode->i_blocks = ext4_inode_blocks(raw_inode, ei);
 	ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl_lo);
@@ -5232,11 +5372,19 @@  static int ext4_do_update_inode(handle_t *handle,
 				struct inode *inode,
 				struct ext4_iloc *iloc)
 {
+	struct super_block *sb = inode->i_sb;
 	struct ext4_inode *raw_inode = ext4_raw_inode(iloc);
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct buffer_head *bh = iloc->bh;
 	int err = 0, rc, block;
 
+	/*
+	 * Protect the on disk inode against parallel modification
+	 * until we compute the checksum and pass the resulting block
+	 * to JBD, which protects it then.
+	 */
+	spin_lock(&ei->raw_inode_lock);
+
 	/* For fields not not tracking in the in-memory inode,
 	 * initialise them to zero for new inodes. */
 	if (ext4_test_inode_state(inode, EXT4_STATE_NEW))
@@ -5291,18 +5439,23 @@  static int ext4_do_update_inode(handle_t *handle,
 				EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ||
 				EXT4_SB(sb)->s_es->s_rev_level ==
 				cpu_to_le32(EXT4_GOOD_OLD_REV)) {
+ 			spin_unlock(&ei->raw_inode_lock);
 			/* If this is the first large file
 			 * created, add a flag to the superblock.
 			 */
 			err = ext4_journal_get_write_access(handle,
 					EXT4_SB(sb)->s_sbh);
-			if (err)
+			if (err) {
+ 				spin_lock(&ei->raw_inode_lock);
 				goto out_brelse;
+			}
 			ext4_update_dynamic_rev(sb);
 			EXT4_SET_RO_COMPAT_FEATURE(sb,
 					EXT4_FEATURE_RO_COMPAT_LARGE_FILE);
 			sb->s_dirt = 1;
 			ext4_handle_sync(handle);
+ 			spin_lock(&ei->raw_inode_lock);
+ 			inode_update_crc(sb, raw_inode, inode->i_ino);
 			err = ext4_handle_dirty_metadata(handle, NULL,
 					EXT4_SB(sb)->s_sbh);
 		}
@@ -5330,6 +5483,7 @@  static int ext4_do_update_inode(handle_t *handle,
 			cpu_to_le32(inode->i_version >> 32);
 		raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
 	}
+	inode_update_crc(sb, raw_inode, inode->i_ino);
 
 	BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
 	rc = ext4_handle_dirty_metadata(handle, NULL, bh);
@@ -5339,6 +5493,7 @@  static int ext4_do_update_inode(handle_t *handle,
 
 	ext4_update_inode_fsync_trans(handle, inode, 0);
 out_brelse:
+	spin_unlock(&ei->raw_inode_lock);
 	brelse(bh);
 	ext4_std_error(inode->i_sb, err);
 	return err;
@@ -5759,6 +5914,8 @@  static int ext4_expand_extra_isize(struct inode *inode,
 	if (EXT4_I(inode)->i_extra_isize >= new_extra_isize)
 		return 0;
 
+	spin_lock(&EXT4_I(inode)->raw_inode_lock);
+
 	raw_inode = ext4_raw_inode(&iloc);
 
 	header = IHDR(inode, raw_inode);
@@ -5770,10 +5927,12 @@  static int ext4_expand_extra_isize(struct inode *inode,
 		memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0,
 			new_extra_isize);
 		EXT4_I(inode)->i_extra_isize = new_extra_isize;
+		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
 		return 0;
 	}
 
 	/* try to expand with EAs present */
+	spin_unlock(&EXT4_I(inode)->raw_inode_lock);
 	return ext4_expand_extra_isize_ea(inode, new_extra_isize,
 					  raw_inode, handle);
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..4012753 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -39,6 +39,7 @@ 
 #include <linux/ctype.h>
 #include <linux/log2.h>
 #include <linux/crc16.h>
+#include <linux/crc32c.h>
 #include <asm/uaccess.h>
 
 #include "ext4.h"
@@ -70,6 +71,8 @@  static void ext4_write_super(struct super_block *sb);
 static int ext4_freeze(struct super_block *sb);
 static int ext4_get_sb(struct file_system_type *fs_type, int flags,
 		       const char *dev_name, void *data, struct vfsmount *mnt);
+static void ext4_super_update_crc(struct super_block *sb,
+				  struct ext4_super_block *es);
 
 #if !defined(CONFIG_EXT3_FS) && !defined(CONFIG_EXT3_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT23)
 static struct file_system_type ext3_fs_type = {
@@ -342,7 +345,14 @@  static void ext4_handle_error(struct super_block *sb)
 		ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only");
 		sb->s_flags |= MS_RDONLY;
 	}
+	/*
+	 * Unfortunately must take the lock here, to make sure there
+	 * is consistent super state for the checksum. This is very easy to get
+	 * wrong in the caller.
+	 */
+	lock_super(sb);
 	ext4_commit_super(sb, 1);
+	unlock_super(sb);
 	if (test_opt(sb, ERRORS_PANIC))
 		panic("EXT4-fs (device %s): panic forced after error\n",
 			sb->s_id);
@@ -531,7 +541,9 @@  __acquires(bitlock)
 	if (test_opt(sb, ERRORS_CONT)) {
 		EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS;
 		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
+		lock_super(sb);
 		ext4_commit_super(sb, 0);
+		unlock_super(sb);
 		return;
 	}
 	ext4_unlock_group(sb, grp);
@@ -659,9 +671,12 @@  static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_journal) {
 		err = jbd2_journal_destroy(sbi->s_journal);
 		sbi->s_journal = NULL;
-		if (err < 0)
+		if (err < 0) {
+			unlock_super(sb);
 			ext4_abort(sb, __func__,
 				   "Couldn't clean up the journal");
+			lock_super(sb);
+		}
 	}
 
 	ext4_release_system_zone(sb);
@@ -758,6 +773,7 @@  static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_da_metadata_calc_len = 0;
 	ei->i_delalloc_reserved_flag = 0;
 	spin_lock_init(&(ei->i_block_reservation_lock));
+	spin_lock_init(&(ei->raw_inode_lock));
 #ifdef CONFIG_QUOTA
 	ei->i_reserved_quota = 0;
 #endif
@@ -1161,6 +1177,7 @@  enum {
 	Opt_inode_readahead_blks, Opt_journal_ioprio,
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
+ 	Opt_noinode_sanity, Opt_ignore_crc,
 };
 
 static const match_table_t tokens = {
@@ -1231,6 +1248,8 @@  static const match_table_t tokens = {
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
+ 	{Opt_noinode_sanity, "noinode_sanity"},
+ 	{Opt_ignore_crc, "ignore_crc"},
 	{Opt_err, NULL},
 };
 
@@ -1408,6 +1427,12 @@  static int parse_options(char *options, struct super_block *sb,
 		case Opt_orlov:
 			clear_opt(sbi->s_mount_opt, OLDALLOC);
 			break;
+		case Opt_noinode_sanity:
+			clear_opt(sbi->s_mount_opt, INODE_SANITY);
+			break;
+		case Opt_ignore_crc:
+			set_opt(sbi->s_mount_opt, IGNORE_CRC);
+			break;
 #ifdef CONFIG_EXT4_FS_XATTR
 		case Opt_user_xattr:
 			set_opt(sbi->s_mount_opt, XATTR_USER);
@@ -1946,6 +1971,7 @@  static int ext4_check_descriptors(struct super_block *sb)
 	}
 
 	ext4_free_blocks_count_set(sbi->s_es, ext4_count_free_blocks(sb));
+ 	ext4_super_update_crc(sb, sbi->s_es);
 	sbi->s_es->s_free_inodes_count =cpu_to_le32(ext4_count_free_inodes(sb));
 	return 1;
 }
@@ -2431,6 +2457,50 @@  static int ext4_feature_set_ok(struct super_block *sb, int readonly)
 	return 1;
 }
 
+/* could be removed for SBs once e2fsutils are fixed to always compute
+   the CRC when the feature is turned on. */
+#define ZERO_MAGIC 1
+
+/*
+ * Manage CRC32 for the superblock.
+ */
+static int ext4_super_check_crc(struct super_block *sb,
+				struct ext4_super_block *es)
+{
+	u32 crc, check;
+
+	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
+		return 0;
+	crc = le32_to_cpu(es->s_crc);
+	if (crc == 0 || test_opt(sb, IGNORE_CRC)) {
+		ext4_msg(sb, KERN_INFO, "super block checksum ignored");
+		return 0;
+	}
+	es->s_crc = 0;
+	check = crc32c(~0U, es, sizeof(struct ext4_super_block));
+	if (check == 0)
+		check = ZERO_MAGIC;
+	if (check != crc)
+		return -1;
+	/* Remove me */
+	ext4_msg(sb, KERN_INFO, "super block checksum passed");
+	return 0;
+}
+
+static void ext4_super_update_crc(struct super_block *sb,
+				  struct ext4_super_block *es)
+{
+	u32 crc;
+
+	if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_SBI_CRC))
+		return;
+	es->s_crc = 0;
+        crc = crc32c(~0U, es, sizeof(struct ext4_super_block));
+	if (crc == 0)
+		crc = ZERO_MAGIC;
+	es->s_crc = cpu_to_le32(crc);
+}
+
 static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 				__releases(kernel_lock)
 				__acquires(kernel_lock)
@@ -2554,6 +2624,7 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
 
 	set_opt(sbi->s_mount_opt, BARRIER);
+	set_opt(sbi->s_mount_opt, INODE_SANITY);
 
 	/*
 	 * enable delayed allocation by default
@@ -2585,6 +2656,11 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	if (!ext4_feature_set_ok(sb, (sb->s_flags & MS_RDONLY)))
 		goto failed_mount;
 
+	if (ext4_super_check_crc(sb, es) < 0) {
+		ext4_msg(sb, KERN_ERR, "Invalid checksum in super block");
+		goto failed_mount;
+	}
+
 	blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
 
 	if (blocksize < EXT4_MIN_BLOCK_SIZE ||
@@ -2675,6 +2751,7 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	for (i = 0; i < 4; i++)
 		sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
+	memcpy(sbi->s_uuid, es->s_uuid, 16);
 	sbi->s_def_hash_version = es->s_def_hash_version;
 	i = le32_to_cpu(es->s_flags);
 	if (i & EXT2_FLAGS_UNSIGNED_HASH)
@@ -3393,6 +3470,9 @@  static int ext4_commit_super(struct super_block *sb, int sync)
 	es->s_free_inodes_count = cpu_to_le32(percpu_counter_sum_positive(
 					&EXT4_SB(sb)->s_freeinodes_counter));
 	sb->s_dirt = 0;
+	/* can be removed if this is properly journaled, see
+         * http://bugzilla.kernel.org/show_bug.cgi?id=14354 */
+	ext4_super_update_crc(sb, es);
 	BUFFER_TRACE(sbh, "marking dirty");
 	mark_buffer_dirty(sbh);
 	if (sync) {
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 0433800..fbba95a 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1171,6 +1171,7 @@  retry:
 
 	free = ext4_xattr_free_space(last, &min_offs, base, &total_ino);
 	if (free >= new_extra_isize) {
+		spin_lock(&EXT4_I(inode)->raw_inode_lock);
 		entry = IFIRST(header);
 		ext4_xattr_shift_entries(entry,	EXT4_I(inode)->i_extra_isize
 				- new_extra_isize, (void *)raw_inode +
@@ -1179,6 +1180,7 @@  retry:
 				inode->i_sb->s_blocksize);
 		EXT4_I(inode)->i_extra_isize = new_extra_isize;
 		error = 0;
+		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
 		goto cleanup;
 	}
 
@@ -1301,6 +1303,7 @@  retry:
 		if (error)
 			goto cleanup;
 
+		spin_lock(&EXT4_I(inode)->raw_inode_lock);
 		entry = IFIRST(header);
 		if (entry_size + EXT4_XATTR_SIZE(size) >= new_extra_isize)
 			shift_bytes = new_extra_isize;
@@ -1312,6 +1315,7 @@  retry:
 			EXT4_GOOD_OLD_INODE_SIZE + extra_isize + shift_bytes,
 			(void *)header, total_ino - entry_size,
 			inode->i_sb->s_blocksize);
+		spin_unlock(&EXT4_I(inode)->raw_inode_lock);
 
 		extra_isize += shift_bytes;
 		new_extra_isize -= shift_bytes;