diff mbox

ext4: add support for multiple mount protection

Message ID 1302631493-9778-1-git-send-email-johann@whamcloud.com
State Superseded, archived
Headers show

Commit Message

Johann Lombardi April 12, 2011, 6:04 p.m. UTC
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
---
 fs/ext4/ext4.h  |   56 ++++++++-
 fs/ext4/super.c |  363 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 416 insertions(+), 3 deletions(-)

Comments

Bernd Schubert April 12, 2011, 7:20 p.m. UTC | #1
Hello Johann,

> +/*
> + * Default interval in seconds to update the MMP sequence number.
> + */
> +#define EXT4_MMP_UPDATE_INTERVAL   1

this define is nowhere used (e2fsprogs already sets the update-interval 
in the superblock). And it is also too small, which may cause a huge 
performance issue (please remember my corresponding lustre bugzilla...).

Once you are going to post the e2fsprogs patch, could you please also 
take care of the 2TiB mmp-block read limitation bug (please see another 
Lustre bugzilla )?

It also would be nice to explain somewhere (and not hidden in the code) 
what is the difference between "update interval" and "check interval". 
For example something like this:

The 'mmp update interval' is the frequency how often the mmp is is 
supposed to be written. Values smaller than 5s may reduce performance.
The 'mmp check interval' is used to verify if the mmp block has been 
updated on the device. The value is updated based on the maximum time to 
write the mmp-block during an update-cycle. Its minimum is
5 * mmp-update-interval



Thanks,
Bernd



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen April 12, 2011, 8:39 p.m. UTC | #2
On 4/12/11 1:04 PM, Johann Lombardi wrote:
> Prevent an ext4 filesystem from being mounted multiple times.
> A sequence number is stored on disk and is periodically updated (every 5
> seconds by default) by a mounted filesystem.
> At mount time, we now wait for s_mmp_update_interval seconds to make sure
> that the MMP sequence does not change.
> In case of failure, the nodename, bdevname and the time at which the MMP
> block was last updated is displayed.
> 
> Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
> Signed-off-by: Johann Lombardi <johann@whamcloud.com>
> ---
>  fs/ext4/ext4.h  |   56 ++++++++-
>  fs/ext4/super.c |  363 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 416 insertions(+), 3 deletions(-)
> 

There was a lot of skepticism about this last time, and I imagine there still is...

400 new lines of kernel code for this, and if the other machine is hung up for 5 seconds and doesn't update, it can still be multiply-mounted anyway, right?

BUG: soft lockup - CPU#0 stuck for 10s! anyone?  :(

I don't see the value in it for upstream ext4, but then hey, ext4 rarely meets a feature it doesn't like ;)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger April 12, 2011, 9:08 p.m. UTC | #3
On 2011-04-12, at 4:39 PM, Eric Sandeen wrote:
> On 4/12/11 1:04 PM, Johann Lombardi wrote:
>> Prevent an ext4 filesystem from being mounted multiple times.
>> A sequence number is stored on disk and is periodically updated (every 5
>> seconds by default) by a mounted filesystem.
>> At mount time, we now wait for s_mmp_update_interval seconds to make sure
>> that the MMP sequence does not change.
>> In case of failure, the nodename, bdevname and the time at which the MMP
>> block was last updated is displayed.
>> 
>> Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
>> Signed-off-by: Johann Lombardi <johann@whamcloud.com>
>> ---
>> fs/ext4/ext4.h  |   56 ++++++++-
>> fs/ext4/super.c |  363 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 416 insertions(+), 3 deletions(-)
>> 
> 
> There was a lot of skepticism about this last time, and I imagine there still is...
> 
> 400 new lines of kernel code for this, and if the other machine is hung up for 5 seconds and doesn't update, it can still be multiply-mounted anyway, right?
> 
> BUG: soft lockup - CPU#0 stuck for 10s! anyone?  :(

No, that isn't true, or the whole patch would be completely useless...

If the owning node is blocked for longer than the update interval then the kmmpd thread will detect this (it tracks the last time the IO was completed) it will attempt to "re-acquire" the MMP block (read, wait, re-read, update-if-unchanged) or mark the filesystem in error (mounted with errors={remount-ro,panic} to block the node from accessing the filesystem again.

Note also that MMP isn't intended to be a primary HA failover manager (i.e. having all nodes trying to mount a shared filesystem and depending on MMP to decide on a winner), but rather as a failsafe for broken HA managers that may fail due to many reasons, as we have found in the past (STONITH failure, admin mounting on backup node, mounting fs while e2fsck is running, broken HA scripts, etc).

> I don't see the value in it for upstream ext4, but then hey, ext4 rarely meets a feature it doesn't like ;)

While it is true that this is 400 lines of code, the only change to the main codepath is a few lines at mount/remount/unmount to start/stop the kmmpd thread.  The risk of this causing any problem for someone not using MMP is virtually non-existent, IMHO.


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johann Lombardi April 12, 2011, 9:11 p.m. UTC | #4
On Tue, Apr 12, 2011 at 03:39:33PM -0500, Eric Sandeen wrote:
> There was a lot of skepticism about this last time, and I imagine there still is...

MMP has absolutely no overhead if you don't turn it on. HA configurations (e.g. for NFS/CIFS servers) are very common out there.
Don't you think that MMP could be used in conjunction with Red Hat Cluster Suite to avoid corrupting the backend filesystem when things go wrong?

> 400 new lines of kernel code for this, and if the other machine is hung up for 5 seconds and doesn't update, it can still be multiply-mounted anyway, right?
> 
> BUG: soft lockup - CPU#0 stuck for 10s! anyone?  :(

If the kmmpd thread misses some MMP sequence updates, it reads the MMP block and calls ext3_error() if the sequence was updated behind its back.

Johann
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bernd Schubert April 12, 2011, 9:41 p.m. UTC | #5
On 04/12/2011 10:39 PM, Eric Sandeen wrote:
> On 4/12/11 1:04 PM, Johann Lombardi wrote:
>> Prevent an ext4 filesystem from being mounted multiple times. A
>> sequence number is stored on disk and is periodically updated
>> (every 5 seconds by default) by a mounted filesystem. At mount
>> time, we now wait for s_mmp_update_interval seconds to make sure 
>> that the MMP sequence does not change. In case of failure, the
>> nodename, bdevname and the time at which the MMP block was last
>> updated is displayed.
>> 
>> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> 
>> Signed-off-by: Johann Lombardi <johann@whamcloud.com> --- 
>> fs/ext4/ext4.h  |   56 ++++++++- fs/ext4/super.c |  363
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files
>> changed, 416 insertions(+), 3 deletions(-)
>> 
> 
> There was a lot of skepticism about this last time, and I imagine
> there still is...
> 
> 400 new lines of kernel code for this, and if the other machine is
> hung up for 5 seconds and doesn't update, it can still be
> multiply-mounted anyway, right?
> 
> BUG: soft lockup - CPU#0 stuck for 10s! anyone?  :(

Please see my other comment about the two different intervals. Yes,
there is a minimal chance of a race. But firstly, 5s are too small,
already for performance reasons (setting the update-interval to 5s will
increase the min check-interval to 25s). Secondly, the mount-wait time is

+	wait_time = min(mmp_check_interval * 2 + 1,
+			mmp_check_interval + 60);

So even with Johanns patch it is at least 12s.

Thirdly, the check-interval is automatically increased, if updating the
mmp block takes too long. This value will also be saved in the
mmp-block. Of course, it has a disadvantage - the mount time increases.

> 
> I don't see the value in it for upstream ext4, but then hey, ext4
> rarely meets a feature it doesn't like ;)

Is ext4 is only used on desktop systems? IMHO, every HA solution that
does not use scsi reservations or another way to check if a device is
already in use, needs a solution like this. I have seen so many problems
with heartbeat/pacemaker to not properly detect an already mounted
devices (*) and this MMP patch already protected so many HA Lustre
installations from data corruption due to double mounts....
So why shouldn't other HA solutions benefit from such a nice feature?

Usually, the heartbeat/pacemaker issues to detect if a device is mounted
or not are due to unreliable information if a device is mounted or not.
/etc/mtab is entirely unreliable and /proc/mounts does not always show
if a device is mounted or not.
However, even if that would work somehow perfectly, without the MMP
patch there is still zero protection from user-errors. It can easily
happen an admin forgets about a mounted device and runs e2fsck or
manually mounts the device on another machine again.

So please, let this patch go in.

Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen April 12, 2011, 9:44 p.m. UTC | #6
On 4/12/11 4:41 PM, Bernd Schubert wrote:

> So please, let this patch go in.

Well, it's not up to me :)

-Eric
 
> Thanks,
> Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johann Lombardi May 2, 2011, 1:43 p.m. UTC | #7
Hi Bernd,

On Tue, Apr 12, 2011 at 09:20:03PM +0200, Bernd Schubert wrote:
> >+#define EXT4_MMP_UPDATE_INTERVAL   1
> 
> this define is nowhere used (e2fsprogs already sets the
> update-interval in the superblock).

Indeed, i have removed it.

> And it is also too small, which may cause a huge performance issue
> (please remember my corresponding lustre bugzilla...).

The patches i have just sent increase the update interval to 5s and decrease the multiplier used to compute the check interval from 5 to 2.

> Once you are going to post the e2fsprogs patch, could you please
> also take care of the 2TiB mmp-block read limitation bug (please see
> another Lustre bugzilla )?

The updated e2fsprogs & kernel patches use blk64_t/ext4_fsblk_t, so this should work now. I will give this a try ASAP.

> It also would be nice to explain somewhere (and not hidden in the
> code) what is the difference between "update interval" and "check
> interval".

Done.

Thanks for the feedback.

Cheers,
Johann
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4daaf2b..f130ac7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1028,7 +1028,7 @@  struct ext4_super_block {
 	__le16	s_want_extra_isize; 	/* New inodes should reserve # bytes */
 	__le32	s_flags;		/* Miscellaneous flags */
 	__le16  s_raid_stride;		/* RAID stride */
-	__le16  s_mmp_interval;         /* # seconds to wait in MMP checking */
+	__le16  s_mmp_update_interval;  /* # seconds to wait in MMP checking */
 	__le64  s_mmp_block;            /* Block for multi-mount protection */
 	__le32  s_raid_stripe_width;    /* blocks on all data disks (N*stride)*/
 	__u8	s_log_groups_per_flex;  /* FLEX_BG group size */
@@ -1201,6 +1201,9 @@  struct ext4_sb_info {
 	struct ext4_li_request *s_li_request;
 	/* Wait multiplier for lazy initialization thread */
 	unsigned int s_li_wait_mult;
+
+	/* Kernel thread for multiple mount protection */
+	struct task_struct *s_mmp_tsk;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1357,7 +1360,8 @@  static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 					 EXT4_FEATURE_INCOMPAT_META_BG| \
 					 EXT4_FEATURE_INCOMPAT_EXTENTS| \
 					 EXT4_FEATURE_INCOMPAT_64BIT| \
-					 EXT4_FEATURE_INCOMPAT_FLEX_BG)
+					 EXT4_FEATURE_INCOMPAT_FLEX_BG| \
+					 EXT4_FEATURE_INCOMPAT_MMP)
 #define EXT4_FEATURE_RO_COMPAT_SUPP	(EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
 					 EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
@@ -1615,6 +1619,50 @@  struct ext4_features {
 };
 
 /*
+ * This structure will be used for multiple mount protection. It will be
+ * written into the block number saved in the s_mmp_block field in the
+ * superblock. Programs that check MMP should assume that if
+ * SEQ_FSCK (or any unknown code above SEQ_MAX) is present then it is NOT safe
+ * to use the filesystem, regardless of how old the timestamp is.
+ */
+#define EXT4_MMP_MAGIC     0x004D4D50U /* ASCII for MMP */
+#define EXT4_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */
+#define EXT4_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */
+#define EXT4_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */
+
+struct mmp_struct {
+	__le32	mmp_magic;
+	__le32	mmp_seq;
+	__le64	mmp_time;
+	char	mmp_nodename[64];
+	char	mmp_bdevname[32];
+	__le16	mmp_check_interval;
+	__le16	mmp_pad1;
+	__le32	mmp_pad2[227];
+};
+
+/* arguments passed to the mmp thread */
+struct mmpd_data {
+	struct buffer_head *bh; /* bh from initial read_mmp_block() */
+	struct super_block *sb;  /* super block of the fs */
+};
+
+/*
+ * Default interval in seconds to update the MMP sequence number.
+ */
+#define EXT4_MMP_UPDATE_INTERVAL   1
+
+/*
+ * Minimum interval for MMP checking in seconds.
+ */
+#define EXT4_MMP_MIN_CHECK_INTERVAL	5UL
+
+/*
+ * Maximum interval for MMP checking in seconds.
+ */
+#define EXT4_MMP_MAX_CHECK_INTERVAL	300UL
+
+/*
  * Function prototypes
  */
 
@@ -1788,6 +1836,10 @@  extern void __ext4_warning(struct super_block *, const char *, unsigned int,
 						       __LINE__, ## message)
 extern void ext4_msg(struct super_block *, const char *, const char *, ...)
 	__attribute__ ((format (printf, 3, 4)));
+extern void __dump_mmp_msg(struct super_block *, struct mmp_struct *mmp,
+			   const char *, unsigned int, const char *);
+#define dump_mmp_msg(sb, mmp, msg)	__dump_mmp_msg(sb, mmp, __func__, \
+						       __LINE__, msg)
 extern void __ext4_grp_locked_error(const char *, unsigned int, \
 				    struct super_block *, ext4_group_t, \
 				    unsigned long, ext4_fsblk_t, \
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 22546ad..b8b37c5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -39,6 +39,8 @@ 
 #include <linux/log2.h>
 #include <linux/crc16.h>
 #include <asm/uaccess.h>
+#include <linux/kthread.h>
+#include <linux/utsname.h>
 
 #include <linux/kthread.h>
 #include <linux/freezer.h>
@@ -789,6 +791,8 @@  static void ext4_put_super(struct super_block *sb)
 		invalidate_bdev(sbi->journal_bdev);
 		ext4_blkdev_remove(sbi);
 	}
+	if (sbi->s_mmp_tsk)
+		kthread_stop(sbi->s_mmp_tsk);
 	sb->s_fs_info = NULL;
 	/*
 	 * Now that we are completely done shutting down the
@@ -1088,6 +1092,349 @@  static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	return 0;
 }
 
+
+/*
+ * Write the MMP block using WRITE_SYNC to try to get the block on-disk
+ * faster.
+ */
+static int write_mmp_block(struct buffer_head *bh)
+{
+	mark_buffer_dirty(bh);
+	lock_buffer(bh);
+	bh->b_end_io = end_buffer_write_sync;
+	get_bh(bh);
+	submit_bh(WRITE_SYNC, bh);
+	wait_on_buffer(bh);
+	if (unlikely(!buffer_uptodate(bh)))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Read the MMP block. It _must_ be read from disk and hence we clear the
+ * uptodate flag on the buffer.
+ */
+static int read_mmp_block(struct super_block *sb, struct buffer_head **bh,
+			  unsigned long mmp_block)
+{
+	struct mmp_struct *mmp;
+
+	if (*bh)
+		clear_buffer_uptodate(*bh);
+
+	/* This would be sb_bread(sb, mmp_block), except we need to be sure
+	 * that the MD RAID device cache has been bypassed, and that the read
+	 * is not blocked in the elevator. */
+	if (!*bh)
+		*bh = sb_getblk(sb, mmp_block);
+	if (*bh) {
+		get_bh(*bh);
+		lock_buffer(*bh);
+		(*bh)->b_end_io = end_buffer_read_sync;
+		submit_bh(READ_SYNC, *bh);
+		wait_on_buffer(*bh);
+		if (!buffer_uptodate(*bh)) {
+			brelse(*bh);
+			*bh = NULL;
+		}
+	}
+	if (!*bh) {
+		ext4_warning(sb, "Error while reading MMP block %lu",
+		             mmp_block);
+		return -EIO;
+	}
+
+	mmp = (struct mmp_struct *)((*bh)->b_data);
+	if (le32_to_cpu(mmp->mmp_magic) != EXT4_MMP_MAGIC)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * Dump as much information as possible to help the admin.
+ */
+void __dump_mmp_msg(struct super_block *sb, struct mmp_struct *mmp,
+		    const char *function, unsigned int line, const char *msg)
+{
+	__ext4_warning(sb, function, line, msg);
+	__ext4_warning(sb, function, line,
+		       "MMP failure info: last update time: %llu, last update "
+		       "node: %s, last update device: %s\n",
+		       (long long unsigned int) le64_to_cpu(mmp->mmp_time),
+		       mmp->mmp_nodename, mmp->mmp_bdevname);
+}
+
+/*
+ * kmmpd will update the MMP sequence every s_mmp_update_interval seconds
+ */
+static int kmmpd(void *data)
+{
+	struct super_block *sb = ((struct mmpd_data *) data)->sb;
+	struct buffer_head *bh = ((struct mmpd_data *) data)->bh;
+	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+	struct mmp_struct *mmp;
+	unsigned long mmp_block;
+	u32 seq = 0;
+	unsigned long failed_writes = 0;
+	int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
+	unsigned mmp_check_interval;
+	unsigned long last_update_time;
+	unsigned long diff;
+	int retval;
+
+	mmp_block = le64_to_cpu(es->s_mmp_block);
+	mmp = (struct mmp_struct *)(bh->b_data);
+	mmp->mmp_time = cpu_to_le64(get_seconds());
+	/*
+	 * Start with the higher mmp_check_interval and reduce it if
+	 * the MMP block is being updated on time.
+	 */
+	mmp_check_interval = max(5UL * mmp_update_interval,
+				 EXT4_MMP_MIN_CHECK_INTERVAL);
+	mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval);
+	bdevname(bh->b_bdev, mmp->mmp_bdevname);
+
+	memcpy(mmp->mmp_nodename, init_utsname()->sysname,
+	       sizeof(mmp->mmp_nodename));
+
+	while (!kthread_should_stop()) {
+		if (++seq > EXT4_MMP_SEQ_MAX)
+			seq = 1;
+
+		mmp->mmp_seq = cpu_to_le32(seq);
+		mmp->mmp_time = cpu_to_le64(get_seconds());
+		last_update_time = jiffies;
+
+		retval = write_mmp_block(bh);
+		/*
+		 * Don't spew too many error messages. Print one every
+		 * (s_mmp_update_interval * 60) seconds.
+		 */
+		if (retval && (failed_writes % 60) == 0) {
+			ext4_error(sb, "Error writing to MMP block");
+			failed_writes++;
+		}
+
+		if (!(le32_to_cpu(es->s_feature_incompat) &
+		    EXT4_FEATURE_INCOMPAT_MMP)) {
+			ext4_warning(sb, "kmmpd being stopped since MMP feature"
+				     " has been disabled.");
+			EXT4_SB(sb)->s_mmp_tsk = NULL;
+			goto failed;
+		}
+
+		if (sb->s_flags & MS_RDONLY) {
+			ext4_warning(sb, "kmmpd being stopped since filesystem "
+				     "has been remounted as readonly.");
+			EXT4_SB(sb)->s_mmp_tsk = NULL;
+			goto failed;
+		}
+
+		diff = jiffies - last_update_time;
+		if (diff < mmp_update_interval * HZ)
+			schedule_timeout_interruptible(mmp_update_interval *
+						       HZ - diff);
+
+		/*
+		 * We need to make sure that more than mmp_check_interval
+		 * seconds have not passed since writing. If that has happened
+		 * we need to check if the MMP block is as we left it.
+		 */
+		diff = jiffies - last_update_time;
+		if (diff > mmp_check_interval * HZ) {
+			struct buffer_head *bh_check = NULL;
+			struct mmp_struct *mmp_check;
+
+			retval = read_mmp_block(sb, &bh_check, mmp_block);
+			if (retval) {
+				ext4_error(sb, "error reading MMP data: %d",
+					   retval);
+
+				EXT4_SB(sb)->s_mmp_tsk = NULL;
+				goto failed;
+			}
+
+			mmp_check = (struct mmp_struct *)(bh_check->b_data);
+			if (mmp->mmp_seq != mmp_check->mmp_seq ||
+			    memcmp(mmp->mmp_nodename, mmp_check->mmp_nodename,
+				   sizeof(mmp->mmp_nodename))) {
+				dump_mmp_msg(sb, mmp_check,
+					     "Error while updating MMP info. "
+					     "The filesystem seems to have been"
+					     " multiply mounted.");
+				ext4_error(sb, "abort");
+				goto failed;
+			}
+			put_bh(bh_check);
+		}
+
+		/*
+		 * Adjust the mmp_check_interval depending on how much time
+		 * it took for the MMP block to be written.
+		 */
+		mmp_check_interval = max(min(5 * diff / HZ,
+					     EXT4_MMP_MAX_CHECK_INTERVAL),
+					 EXT4_MMP_MIN_CHECK_INTERVAL);
+		mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval);
+	}
+
+	/*
+	 * Unmount seems to be clean.
+	 */
+	mmp->mmp_seq = cpu_to_le32(EXT4_MMP_SEQ_CLEAN);
+	mmp->mmp_time = cpu_to_le64(get_seconds());
+
+	retval = write_mmp_block(bh);
+
+failed:
+	kfree(data);
+	brelse(bh);
+	return retval;
+}
+
+/*
+ * Get a random new sequence number but make sure it is not greater than
+ * EXT4_MMP_SEQ_MAX.
+ */
+static unsigned int mmp_new_seq(void)
+{
+	u32 new_seq;
+
+	do {
+		get_random_bytes(&new_seq, sizeof(u32));
+	} while (new_seq > EXT4_MMP_SEQ_MAX);
+
+	return new_seq;
+}
+
+/*
+ * Protect the filesystem from being mounted more than once.
+ */
+static int ext4_multi_mount_protect(struct super_block *sb,
+				    unsigned long mmp_block)
+{
+	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+	struct buffer_head *bh = NULL;
+	struct mmp_struct *mmp = NULL;
+	struct mmpd_data *mmpd_data;
+	u32 seq;
+	unsigned int mmp_check_interval = le16_to_cpu(es->s_mmp_update_interval);
+	unsigned int wait_time = 0;
+	int retval;
+
+	if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
+	    mmp_block >= ext4_blocks_count(es)) {
+		ext4_warning(sb, "Invalid MMP block in superblock");
+		goto failed;
+	}
+
+	retval = read_mmp_block(sb, &bh, mmp_block);
+	if (retval)
+		goto failed;
+
+	mmp = (struct mmp_struct *)(bh->b_data);
+
+	if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL)
+		mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL;
+
+	/*
+	 * If check_interval in MMP block is larger, use that instead of
+	 * update_interval from the superblock.
+	 */
+	if (mmp->mmp_check_interval > mmp_check_interval)
+		mmp_check_interval = mmp->mmp_check_interval;
+
+	seq = le32_to_cpu(mmp->mmp_seq);
+	if (seq == EXT4_MMP_SEQ_CLEAN)
+		goto skip;
+
+	if (seq == EXT4_MMP_SEQ_FSCK) {
+		dump_mmp_msg(sb, mmp, "fsck is running on the filesystem");
+		goto failed;
+	}
+
+	wait_time = min(mmp_check_interval * 2 + 1,
+			mmp_check_interval + 60);
+
+	/* Print MMP interval if more than 20 secs. */
+	if (wait_time > EXT4_MMP_MIN_CHECK_INTERVAL * 4)
+		ext4_warning(sb, "MMP interval %u higher than expected, please"
+		             " wait.\n", wait_time * 2);
+
+	if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+		ext4_warning(sb, "MMP startup interrupted, failing mount\n");
+		goto failed;
+	}
+
+	retval = read_mmp_block(sb, &bh, mmp_block);
+	if (retval)
+		goto failed;
+	mmp = (struct mmp_struct *)(bh->b_data);
+	if (seq != le32_to_cpu(mmp->mmp_seq)) {
+		dump_mmp_msg(sb, mmp,
+			     "Device is already active on another node.");
+		goto failed;
+	}
+
+skip:
+	/*
+	 * write a new random sequence number.
+	 */
+	mmp->mmp_seq = seq = cpu_to_le32(mmp_new_seq());
+
+	retval = write_mmp_block(bh);
+	if (retval)
+		goto failed;
+
+	/*
+	 * wait for MMP interval and check mmp_seq.
+	 */
+	if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+		ext4_warning(sb, "MMP startup interrupted, failing mount\n");
+		goto failed;
+	}
+
+	retval = read_mmp_block(sb, &bh, mmp_block);
+	if (retval)
+		goto failed;
+	mmp = (struct mmp_struct *)(bh->b_data);
+	if (seq != le32_to_cpu(mmp->mmp_seq)) {
+		dump_mmp_msg(sb, mmp,
+			     "Device is already active on another node.");
+		goto failed;
+	}
+
+	mmpd_data = kmalloc(sizeof(struct mmpd_data), GFP_KERNEL);
+	if (!mmpd_data) {
+		ext4_warning(sb, "not enough memory for mmpd_data");
+		goto failed;
+	}
+	mmpd_data->sb = sb;
+	mmpd_data->bh = bh;
+
+	/*
+	 * Start a kernel thread to update the MMP block periodically.
+	 */
+	EXT4_SB(sb)->s_mmp_tsk = kthread_run(kmmpd, mmpd_data, "kmmpd-%s",
+					     bdevname(bh->b_bdev,
+						      mmp->mmp_bdevname));
+	if (IS_ERR(EXT4_SB(sb)->s_mmp_tsk)) {
+		EXT4_SB(sb)->s_mmp_tsk = NULL;
+		kfree(mmpd_data);
+		ext4_warning(sb, "Unable to create kmmpd thread for %s.",
+			     sb->s_id);
+		goto failed;
+	}
+
+	return 0;
+
+failed:
+	brelse(bh);
+	return 1;
+}
+
 static struct inode *ext4_nfs_get_inode(struct super_block *sb,
 					u64 ino, u32 generation)
 {
@@ -3432,6 +3779,11 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			  EXT4_HAS_INCOMPAT_FEATURE(sb,
 				    EXT4_FEATURE_INCOMPAT_RECOVER));
 
+	if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_MMP) &&
+	    !(sb->s_flags & MS_RDONLY))
+		if (ext4_multi_mount_protect(sb, le64_to_cpu(es->s_mmp_block)))
+			goto failed_mount3;
+
 	/*
 	 * The first inode we look at is the journal inode.  Don't try
 	 * root first: it may be modified in the journal!
@@ -3682,6 +4034,8 @@  failed_mount3:
 	percpu_counter_destroy(&sbi->s_freeinodes_counter);
 	percpu_counter_destroy(&sbi->s_dirs_counter);
 	percpu_counter_destroy(&sbi->s_dirtyblocks_counter);
+	if (sbi->s_mmp_tsk)
+		kthread_stop(sbi->s_mmp_tsk);
 failed_mount2:
 	for (i = 0; i < db_count; i++)
 		brelse(sbi->s_group_desc[i]);
@@ -4212,7 +4566,7 @@  static int ext4_remount(struct super_block *sb, int *flags, char *data)
 	int enable_quota = 0;
 	ext4_group_t g;
 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
-	int err;
+	int err = 0;
 #ifdef CONFIG_QUOTA
 	int i;
 #endif
@@ -4338,6 +4692,13 @@  static int ext4_remount(struct super_block *sb, int *flags, char *data)
 				goto restore_opts;
 			if (!ext4_setup_super(sb, es, 0))
 				sb->s_flags &= ~MS_RDONLY;
+			if (EXT4_HAS_INCOMPAT_FEATURE(sb,
+						     EXT4_FEATURE_INCOMPAT_MMP))
+				if (ext4_multi_mount_protect(sb,
+						le64_to_cpu(es->s_mmp_block))) {
+					err = -EROFS;
+					goto restore_opts;
+				}
 			enable_quota = 1;
 		}
 	}