Patchwork [2/2] OCFS2: Allow huge (> 16 TiB) volumes to mount

login
register
mail settings
Submitter Patrick J. LoPresti
Date July 11, 2010, 5:04 p.m.
Message ID <87zkxyvtjt.fsf@patl.com>
Download mbox | patch
Permalink /patch/58530/
State Not Applicable
Headers show

Comments

Patrick J. LoPresti - July 11, 2010, 5:04 p.m.
The OCFS2 developers have already done all of the hard work to allow
volumes larger than 16 TiB.  But there is still a "sanity check" in
fs/ocfs2/super.c that prevents the mounting of such volumes, even when
the cluster size and journal options would allow it.

This patch replaces that sanity check with a more sophisticated one to
mount a huge volume provided that (a) it is addressable by the raw
word/address size of the system (borrowing a test from ext4); (b) the
volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
set on the journal.

I factored out the sanity check into its own function.  I also moved it
from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
and the journal will not have been initialized yet.

This patch is one of a pair, and it depends on the other ("JBD2: Allow
feature checks before journal recovery").

I have tested this patch on small volumes, huge volumes, and huge
volumes without 64-bit block support in the journal.  All of them appear
to work or to fail gracefully, as appropriate.

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - July 13, 2010, 12:21 a.m.
On 2010-07-11, at 11:04, Patrick J. LoPresti wrote:
> +/* Check to make sure entire volume is addressable on this system.
> +   Requires osb_clusters_at_boot to be valid and for the journal to
> +   have been initialized by ocfs2_journal_init(). */
> +static int ocfs2_check_addressable(struct ocfs2_super *osb)
> +{
> +	/* Absolute addressability check (borrowed from ext4/super.c) */
> +	if ((max_block >
> +	     (sector_t)(~0LL) >> (osb->sb->s_blocksize_bits - 9)) ||
> +	    (max_block > (pgoff_t)(~0LL) >> (PAGE_CACHE_SHIFT -
> +					     osb->sb->s_blocksize_bits))) {
> +		mlog(ML_ERROR, "Volume too large "
> +		     "to mount safely on this system");
> +		status = -EFBIG;
> +		goto out;
> +	}

This hunk of code is actually in several filesystems.  It wouldn't be a bad idea to make it a library function that can be called by the filesystem to check the kernel page cache and block layer can handle these large filesystems.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick J. LoPresti - July 13, 2010, 1:08 a.m.
On Mon, Jul 12, 2010 at 5:21 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On 2010-07-11, at 11:04, Patrick J. LoPresti wrote:
> >
>> +     /* Absolute addressability check (borrowed from ext4/super.c) */
>> +     if ((max_block >
>> +          (sector_t)(~0LL) >> (osb->sb->s_blocksize_bits - 9)) ||
>> +         (max_block > (pgoff_t)(~0LL) >> (PAGE_CACHE_SHIFT -
>> +                                          osb->sb->s_blocksize_bits))) {
>> +             mlog(ML_ERROR, "Volume too large "
>> +                  "to mount safely on this system");
>> +             status = -EFBIG;
>> +             goto out;
>> +     }
>
> This hunk of code is actually in several filesystems.  It wouldn't be a bad idea to make it a library function that can be called by the filesystem to check the kernel page cache and block layer can handle these large filesystems.

True, but some of them do it differently (e.g. see the #if switch in
xfs_sb_validate_fsb_count).  Tracking down all variants and changing
them is a much larger task than my simple patch.

Are you suggesting I need to do this before my patch is accepted at
all?  Or is this a refactoring that can happen later?

 - Pat
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - July 13, 2010, 1:25 a.m.
On Mon, Jul 12, 2010 at 06:08:51PM -0700, Patrick J. LoPresti wrote:
> On Mon, Jul 12, 2010 at 5:21 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On 2010-07-11, at 11:04, Patrick J. LoPresti wrote:
> > >
> >> +     /* Absolute addressability check (borrowed from ext4/super.c) */
> >> +     if ((max_block >
> >> +          (sector_t)(~0LL) >> (osb->sb->s_blocksize_bits - 9)) ||
> >> +         (max_block > (pgoff_t)(~0LL) >> (PAGE_CACHE_SHIFT -
> >> +                                          osb->sb->s_blocksize_bits))) {
> >> +             mlog(ML_ERROR, "Volume too large "
> >> +                  "to mount safely on this system");
> >> +             status = -EFBIG;
> >> +             goto out;
> >> +     }
> >
> > This hunk of code is actually in several filesystems.  It wouldn't be a bad idea to make it a library function that can be called by the filesystem to check the kernel page cache and block layer can handle these large filesystems.
> 
> True, but some of them do it differently (e.g. see the #if switch in
> xfs_sb_validate_fsb_count).  Tracking down all variants and changing
> them is a much larger task than my simple patch.

The XFS code is different to the above because there is still a 16TB
size limit on 32 bit systemsi (i.e. page cache address limits). IOWs,
you can't just remove the above 16TB check unless you (i.e. OCFS2)
handle >16TB block devices on 32 bit systems correctly...

Cheers,

Dave.
Patrick J. LoPresti - July 13, 2010, 1:37 a.m.
On Mon, Jul 12, 2010 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> The XFS code is different to the above because there is still a 16TB
> size limit on 32 bit systemsi (i.e. page cache address limits). IOWs,
> you can't just remove the above 16TB check unless you (i.e. OCFS2)
> handle >16TB block devices on 32 bit systems correctly...

If you look at my patch, you will see that is precisely what it does.
As the comments indicate, it uses the exact same check as ext4, which
will correctly refuse to mount huge volumes on 32-bit systems.

The XFS test appears to be the same thing written a little
differently.  Andreas is suggesting that somebody should factor out
this check into a common library routine.  That sounds like a fine
idea, but it also sounds orthogonal to the (simple and useful) patch I
am attempting to submit.

 - Pat
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - July 13, 2010, 4:46 a.m.
On 2010-07-12, at 19:08, Patrick J. LoPresti wrote:
> On Mon, Jul 12, 2010 at 5:21 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> On 2010-07-11, at 11:04, Patrick J. LoPresti wrote:
>>> 
>>> +     /* Absolute addressability check (borrowed from ext4/super.c) */
>>> +     if ((max_block >
>>> +          (sector_t)(~0LL) >> (osb->sb->s_blocksize_bits - 9)) ||
>>> +         (max_block > (pgoff_t)(~0LL) >> (PAGE_CACHE_SHIFT -
>>> +                                          osb->sb->s_blocksize_bits))) {
>>> +             mlog(ML_ERROR, "Volume too large "
>>> +                  "to mount safely on this system");
>>> +             status = -EFBIG;
>>> +             goto out;
>>> +     }
>> 
>> This hunk of code is actually in several filesystems.  It wouldn't be a bad idea to make it a library function that can be called by the filesystem to check the kernel page cache and block layer can handle these large filesystems.
> 
> True, but some of them do it differently (e.g. see the #if switch in
> xfs_sb_validate_fsb_count).  Tracking down all variants and changing
> them is a much larger task than my simple patch.
> 
> Are you suggesting I need to do this before my patch is accepted at
> all?  Or is this a refactoring that can happen later?

I'm just suggesting it should be done at some point.  I thought it would be better to do it first, rather than add yet another copy of this code.  That said, I hate to block useful fixes because of cleanup (and I have no control over OCFS2 anyway :-).  However, I've found that once the fix is in people usually forget (or become too busy) to do the cleanup and it just lingers on unseen.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick J. LoPresti - July 13, 2010, 5 a.m.
On Mon, Jul 12, 2010 at 9:46 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On 2010-07-12, at 19:08, Patrick J. LoPresti wrote:
>>
>> Are you suggesting I need to do this before my patch is accepted at
>> all?  Or is this a refactoring that can happen later?
>
> I'm just suggesting it should be done at some point.  I thought it would be better to do it first, rather than add yet another copy of this code.  That said, I hate to block useful fixes because of cleanup (and I have no control over OCFS2 anyway :-).  However, I've found that once the fix is in people usually forget (or become too busy) to do the cleanup and it just lingers on unseen.

I hear you.

I do not object to factoring out the basic addressability test and
using it in my patch, leaving it for others -- like yourself :-) -- to
modify other file systems to invoke it.

Does that sound like a reasonable compromise?  If so, where should the
function live and what should it be called, do you think?

 - Pat
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joel Becker - July 13, 2010, 8:10 a.m.
On Mon, Jul 12, 2010 at 10:00:10PM -0700, Patrick J. LoPresti wrote:
> On Mon, Jul 12, 2010 at 9:46 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On 2010-07-12, at 19:08, Patrick J. LoPresti wrote:
> >>
> >> Are you suggesting I need to do this before my patch is accepted at
> >> all?  Or is this a refactoring that can happen later?
> >
> > I'm just suggesting it should be done at some point.  I thought it would be better to do it first, rather than add yet another copy of this code.  That said, I hate to block useful fixes because of cleanup (and I have no control over OCFS2 anyway :-).  However, I've found that once the fix is in people usually forget (or become too busy) to do the cleanup and it just lingers on unseen.
> 
> I hear you.
> 
> I do not object to factoring out the basic addressability test and
> using it in my patch, leaving it for others -- like yourself :-) -- to
> modify other file systems to invoke it.

	I think you should modify ext3 and xfs, as they clearly are
partaking of this functionality.  I'll happily review it for you.  Put
the call in fs/libfs.c.  Call it generic_check_addressable(struct
super_block *super).

Joel

Patch

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 0eaa929..b809508 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1991,6 +1991,47 @@  static int ocfs2_setup_osb_uuid(struct ocfs2_super *osb, const unsigned char *uu
 	return 0;
 }
 
+/* Check to make sure entire volume is addressable on this system.
+   Requires osb_clusters_at_boot to be valid and for the journal to
+   have been initialized by ocfs2_journal_init(). */
+static int ocfs2_check_addressable(struct ocfs2_super *osb)
+{
+	int status = 0;
+	u64 max_block =
+		ocfs2_clusters_to_blocks(osb->sb,
+					 osb->osb_clusters_at_boot) - 1;
+
+	/* Absolute addressability check (borrowed from ext4/super.c) */
+	if ((max_block >
+	     (sector_t)(~0LL) >> (osb->sb->s_blocksize_bits - 9)) ||
+	    (max_block > (pgoff_t)(~0LL) >> (PAGE_CACHE_SHIFT -
+					     osb->sb->s_blocksize_bits))) {
+		mlog(ML_ERROR, "Volume too large "
+		     "to mount safely on this system");
+		status = -EFBIG;
+		goto out;
+	}
+
+	/* 32-bit block number is always OK. */
+	if (max_block <= (u32)~0UL)
+		goto out;
+
+	/* Volume is "huge", so see if our journal is new enough to
+	   support it. */
+	if (!(OCFS2_HAS_COMPAT_FEATURE(osb->sb,
+				       OCFS2_FEATURE_COMPAT_JBD2_SB) &&
+	      jbd2_journal_check_used_features(osb->journal->j_journal, 0, 0,
+					       JBD2_FEATURE_INCOMPAT_64BIT))) {
+		mlog(ML_ERROR, "The journal cannot address the entire volume. "
+		     "Enable the 'block64' journal option with tunefs.ocfs2");
+		status = -EFBIG;
+		goto out;
+	}
+
+ out:
+	return status;
+}
+
 static int ocfs2_initialize_super(struct super_block *sb,
 				  struct buffer_head *bh,
 				  int sector_size,
@@ -2215,14 +2256,6 @@  static int ocfs2_initialize_super(struct super_block *sb,
 		goto bail;
 	}
 
-	if (ocfs2_clusters_to_blocks(osb->sb, le32_to_cpu(di->i_clusters) - 1)
-	    > (u32)~0UL) {
-		mlog(ML_ERROR, "Volume might try to write to blocks beyond "
-		     "what jbd can address in 32 bits.\n");
-		status = -EINVAL;
-		goto bail;
-	}
-
 	if (ocfs2_setup_osb_uuid(osb, di->id2.i_super.s_uuid,
 				 sizeof(di->id2.i_super.s_uuid))) {
 		mlog(ML_ERROR, "Out of memory trying to setup our uuid.\n");
@@ -2381,6 +2414,12 @@  static int ocfs2_check_volume(struct ocfs2_super *osb)
 		goto finally;
 	}
 
+	/* Now that journal has been initialized, check to make sure
+	   entire volume is addressable. */
+	status = ocfs2_check_addressable(osb);
+	if (status)
+		goto finally;
+
 	/* If the journal was unmounted cleanly then we don't want to
 	 * recover anything. Otherwise, journal_load will do that
 	 * dirty work for us :) */