diff mbox

Question on fallocate/ftruncate sequence

Message ID 5df78e1d0908281740w7bc0f283x5004ca5b231b3af5@mail.gmail.com
State Superseded, archived
Headers show

Commit Message

Jiaying Zhang Aug. 29, 2009, 12:40 a.m. UTC
On Fri, Aug 28, 2009 at 3:14 PM, Andreas Dilger<adilger@sun.com> wrote:
> On Aug 28, 2009  14:44 -0700, Jiaying Zhang wrote:
>> On Fri, Aug 28, 2009 at 12:40 PM, Andreas Dilger<adilger@sun.com> wrote:
>> > This isn't really correct, however, because i_blocks also contains
>> > non-data blocks (indirect/index, EA, etc) blocks, so even with small
>> > files with ACLs i_blocks may always be larger than ia_size >> 9, and
>> > for ext2/3 at least this will ALWAYS be true for files > 48kB in size.
>>
>> I see. I guess we need to use a special flag then. Or is there any
>> other suggestions? I also have another question related to this
>> problem. Why those fallocated blocks are not marked as preallocated
>> blocks that will then be automatically freed in ext4_release_file?
>
> Because fallocate() means "persistent allocation on disk", not "in memory
> preallocation".  The "in memory" preallocation already happens in ext4,
> and it is released when the inode is cleaned up.

Right. Thanks for pointing this out!

RFC, here is a patch that Frank and I have been working on. It introduces
a new fs flag to mark that the file has been allocated beyond its EOF, as
discussed previously in this thread. The flag is cleared in the subsequent
vmtruncate or fallocate without KEEPSIZE. It is possible that a vmtruncate
may be called unnecessarily in the case that the file is written beyond the
allocated size, but I think it is ok to pay this cost to get correctness.

15:37:45.000000000 -0700
+++ fs/ext4/extents.c	2009-08-28 17:27:27.000000000 -0700
@@ -3095,7 +3095,13 @@ static void ext4_falloc_update_inode(str
 			i_size_write(inode, new_size);
 		if (new_size > EXT4_I(inode)->i_disksize)
 			ext4_update_i_disksize(inode, new_size);
+		inode->i_flags &= ~FS_KEEPSIZE_FL;
 	} else {
+		/*
+		 * Mark that we allocate beyond EOF so the subsequent truncate
+		 * can proceed even if the new size is the same as i_size.
+		 */
+		inode->i_flags |= FS_KEEPSIZE_FL;
 	}
 }

--- .pc/fallocate_keepsizse.patch/fs/ext4/inode.c	2009-08-16
14:19:38.000000000 -0700
+++ fs/ext4/inode.c	2009-08-28 16:59:42.000000000 -0700
@@ -3973,6 +3973,8 @@ void ext4_truncate(struct inode *inode)
 	if (!ext4_can_truncate(inode))
 		return;

+	inode->i_flags &= ~FS_KEEPSIZE_FL;
+
 	if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
 		ei->i_state |= EXT4_STATE_DA_ALLOC_CLOSE;

--- .pc/fallocate_keepsizse.patch/include/linux/fs.h	2009-08-28
15:44:27.000000000 -0700
+++ include/linux/fs.h	2009-08-28 17:00:47.000000000 -0700
@@ -343,6 +343,7 @@ struct inodes_stat_t {
 #define FS_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
 #define FS_EXTENT_FL			0x00080000 /* Extents */
 #define FS_DIRECTIO_FL			0x00100000 /* Use direct i/o */
+#define FS_KEEPSIZE_FL			0x00200000 /* Blocks allocated beyond EOF */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */

 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */

Jiaying

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Aug. 30, 2009, 2:52 a.m. UTC | #1
On Fri, Aug 28, 2009 at 05:40:54PM -0700, Jiaying Zhang wrote:
> --- .pc/fallocate_keepsizse.patch/fs/attr.c	2009-08-28 15:38:46.000000000 -0700
> +++ fs/attr.c	2009-08-28 17:01:04.000000000 -0700
> @@ -68,7 +68,8 @@ int inode_setattr(struct inode * inode,
>  	unsigned int ia_valid = attr->ia_valid;
> 
>  	if (ia_valid & ATTR_SIZE &&
> -	    (attr->ia_size != i_size_read(inode)) {
> +	    (attr->ia_size != i_size_read(inode) ||
> +	     (inode->i_flags & FS_KEEPSIZE_FL))) {
>  		int error = vmtruncate(inode, attr->ia_size);
>  		if (error)
>  			return error;

Instead of doing this in the generic code, it really should be done in
ext4_setattr.  Technically speaking, we don't actually need the
FS_KEEPSIZE_FL to solve this problem; instead we can simply have the
ext4 code look in the extent tree to see if there are any blocks
mapped beyond the logical block:

       i_size_read(inode) >> inode->i_sb->s_blocksize_bits

Having a flag as Andreas suggested does help with the issue of e2fsck
noticing whether or not i_size is incorrect (and should be fixed) or
the file has been extended.  So keeping having the flag is an OK thing
to do, but we need to be careful about a particularly subtle
overloading problem.  The flags FS_*_FL as defined in
include/linux/fs.h are technically only for in-memory use.  The ext4
on-disk format flags is EXT4_*_FL, and defined in ext4.h.

The flags were originially defined for use in ext2/3/4, but later on
other filesystems adopted those flags so that e2fsprogs's chattr and
lsattr programs could be used for their filesystems as well.  It just
so happens that for ext2/3/4 the on-disk encoding of those flags in
the in-memory encoding of those flags in i_flags are the same, but
that means that the flags need to be defined in both places to avoid
assignment overlaps.  We also need to be clear whether the flags are
internal flags for ext4's use only, or flags meant for use by all
filesystems.  This is why the testing for FS_KEEPSIZE_FL in fs/attr is
particularly bad, if the flag are going to be set in fs/ext4/extents.c.

It's better to define the flag as EXT4_KEEPSIZE_FL, and to use it as
EXT4_KEEPSIZE_FL, but make a note of that bitfield position as being
reserved in include/linux/fs.h.

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang Aug. 31, 2009, 7:40 p.m. UTC | #2
Thanks a lot for the comments and suggestions!

On Sat, Aug 29, 2009 at 7:52 PM, Theodore Tso<tytso@mit.edu> wrote:
> On Fri, Aug 28, 2009 at 05:40:54PM -0700, Jiaying Zhang wrote:
>> --- .pc/fallocate_keepsizse.patch/fs/attr.c   2009-08-28 15:38:46.000000000 -0700
>> +++ fs/attr.c 2009-08-28 17:01:04.000000000 -0700
>> @@ -68,7 +68,8 @@ int inode_setattr(struct inode * inode,
>>       unsigned int ia_valid = attr->ia_valid;
>>
>>       if (ia_valid & ATTR_SIZE &&
>> -         (attr->ia_size != i_size_read(inode)) {
>> +         (attr->ia_size != i_size_read(inode) ||
>> +          (inode->i_flags & FS_KEEPSIZE_FL))) {
>>               int error = vmtruncate(inode, attr->ia_size);
>>               if (error)
>>                       return error;
>
> Instead of doing this in the generic code, it really should be done in
> ext4_setattr.  Technically speaking, we don't actually need the
> FS_KEEPSIZE_FL to solve this problem; instead we can simply have the
> ext4 code look in the extent tree to see if there are any blocks
> mapped beyond the logical block:
>
>       i_size_read(inode) >> inode->i_sb->s_blocksize_bits

Is it relatively cheap to scan the extent tree? Will this add the
overhead to truncate?

>
> Having a flag as Andreas suggested does help with the issue of e2fsck
> noticing whether or not i_size is incorrect (and should be fixed) or
> the file has been extended.  So keeping having the flag is an OK thing
> to do, but we need to be careful about a particularly subtle
> overloading problem.  The flags FS_*_FL as defined in
> include/linux/fs.h are technically only for in-memory use.  The ext4
> on-disk format flags is EXT4_*_FL, and defined in ext4.h.
>
> The flags were originially defined for use in ext2/3/4, but later on
> other filesystems adopted those flags so that e2fsprogs's chattr and
> lsattr programs could be used for their filesystems as well.  It just
> so happens that for ext2/3/4 the on-disk encoding of those flags in
> the in-memory encoding of those flags in i_flags are the same, but
> that means that the flags need to be defined in both places to avoid
> assignment overlaps.  We also need to be clear whether the flags are
> internal flags for ext4's use only, or flags meant for use by all
> filesystems.  This is why the testing for FS_KEEPSIZE_FL in fs/attr is
> particularly bad, if the flag are going to be set in fs/ext4/extents.c.
>
> It's better to define the flag as EXT4_KEEPSIZE_FL, and to use it as
> EXT4_KEEPSIZE_FL, but make a note of that bitfield position as being
> reserved in include/linux/fs.h.

Here is the modified patch based on your suggestions. I stick with the
KEEPSIZE_FL approach that I think can allow us to handle the special
truncation accordingly during fsck. Other file systems can also re-use
this flag when they want to support fallocate with KEEP_SIZE. As you
suggested, I moved the EXT4_KEEPSIZE_FL checking to ext4_setattr
that now calls vmtruncate if the KEEPSIZE flag is set in the i_flag.
Please let me know what you think about this proposed patch.

Thanks a lot!

Jiaying

--- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c	2009-08-31
12:08:10.000000000 -0700
+++ fs/ext4/extents.c	2009-08-31 12:12:16.000000000 -0700
@@ -3095,7 +3095,13 @@ static void ext4_falloc_update_inode(str
 			i_size_write(inode, new_size);
 		if (new_size > EXT4_I(inode)->i_disksize)
 			ext4_update_i_disksize(inode, new_size);
+		inode->i_flags &= ~EXT4_KEEPSIZE_FL;
 	} else {
+		/*
+		 * Mark that we allocate beyond EOF so the subsequent truncate
+		 * can proceed even if the new size is the same as i_size.
+		 */
+		inode->i_flags |= EXT4_KEEPSIZE_FL;
 	}
 }

--- .pc/fallocate_keepsizse.patch/fs/ext4/inode.c	2009-08-31
12:08:10.000000000 -0700
+++ fs/ext4/inode.c	2009-08-31 12:12:16.000000000 -0700
@@ -3973,6 +3973,8 @@ void ext4_truncate(struct inode *inode)
 	if (!ext4_can_truncate(inode))
 		return;

+	inode->i_flags &= ~EXT4_KEEPSIZE_FL;
+
 	if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
 		ei->i_state |= EXT4_STATE_DA_ALLOC_CLOSE;

@@ -4807,7 +4809,9 @@ int ext4_setattr(struct dentry *dentry,
 	}

 	if (S_ISREG(inode->i_mode) &&
-	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
+	    attr->ia_valid & ATTR_SIZE &&
+	    (attr->ia_size < inode->i_size ||
+	     (inode->i_flags & EXT4_KEEPSIZE_FL))) {
 		handle_t *handle;

 		handle = ext4_journal_start(inode, 3);
@@ -4838,6 +4842,11 @@ int ext4_setattr(struct dentry *dentry,
 				goto err_out;
 			}
 		}
+		if ((inode->i_flags & EXT4_KEEPSIZE_FL)) {
+			rc = vmtruncate(inode, attr->ia_size);
+			if (rc)
+				goto err_out;
+		}
 	}

 	rc = inode_setattr(inode, attr);
--- .pc/fallocate_keepsizse.patch/include/linux/fs.h	2009-08-31
12:08:10.000000000 -0700
+++ include/linux/fs.h	2009-08-31 12:12:16.000000000 -0700
@@ -343,6 +343,7 @@ struct inodes_stat_t {
 #define FS_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
 #define FS_EXTENT_FL			0x00080000 /* Extents */
 #define FS_DIRECTIO_FL			0x00100000 /* Use direct i/o */
+#define FS_KEEPSIZE_FL			0x00200000 /* Blocks allocated beyond EOF */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */

 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
--- .pc/fallocate_keepsizse.patch/fs/ext4/ext4.h	2009-08-31
12:08:10.000000000 -0700
+++ fs/ext4/ext4.h	2009-08-31 12:12:16.000000000 -0700
@@ -235,6 +235,7 @@ struct flex_groups {
 #define EXT4_HUGE_FILE_FL               0x00040000 /* Set to each huge file */
 #define EXT4_EXTENTS_FL			0x00080000 /* Inode uses extents */
 #define EXT4_EXT_MIGRATE		0x00100000 /* Inode is migrating */
+#define EXT4_KEEPSIZE_FL		0x00200000 /* Blocks allocated beyond EOF
(bit reserved in fs.h) */
 #define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */

 #define EXT4_FL_USER_VISIBLE		0x000BDFFF /* User visible flags */


>
>                                                        - Ted
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Aug. 31, 2009, 9:56 p.m. UTC | #3
On Aug 31, 2009  12:40 -0700, Jiaying Zhang wrote:
> > It's better to define the flag as EXT4_KEEPSIZE_FL, and to use it as
> > EXT4_KEEPSIZE_FL, but make a note of that bitfield position as being
> > reserved in include/linux/fs.h.
> 
> Here is the modified patch based on your suggestions. I stick with the
> KEEPSIZE_FL approach that I think can allow us to handle the special
> truncation accordingly during fsck. Other file systems can also re-use
> this flag when they want to support fallocate with KEEP_SIZE. As you
> suggested, I moved the EXT4_KEEPSIZE_FL checking to ext4_setattr
> that now calls vmtruncate if the KEEPSIZE flag is set in the i_flag.
> Please let me know what you think about this proposed patch.
> 
> --- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c	2009-08-31
> 12:08:10.000000000 -0700
> +++ fs/ext4/extents.c	2009-08-31 12:12:16.000000000 -0700
> @@ -3095,7 +3095,13 @@ static void ext4_falloc_update_inode(str
>  			i_size_write(inode, new_size);
>  		if (new_size > EXT4_I(inode)->i_disksize)
>  			ext4_update_i_disksize(inode, new_size);
> +		inode->i_flags &= ~EXT4_KEEPSIZE_FL;

Note that fallocate can be called multiple times for a file.  The
EXT4_KEEPSIZE_FL should only be cleared if there were writes to
the end of the fallocated space.  In that regard, I think the name
of this flag should be changed to something like "EXT4_EOFBLOCKS_FL"
to indicate that blocks are allocated beyond the end of file (i_size).

>  	} else {
> +		/*
> +		 * Mark that we allocate beyond EOF so the subsequent truncate
> +		 * can proceed even if the new size is the same as i_size.
> +		 */
> +		inode->i_flags |= EXT4_KEEPSIZE_FL;

Similarly, this should only be done in case the fallocate is actually
beyond i_size.  While that is the most common case, it isn't necessarily
ALWAYS going to be true (e.g. if multiple threads are calling fallocate()
on a single file, or if a program always calls fallocate() on a file
without first checking if the file size is large enough).

> +++ include/linux/fs.h	2009-08-31 12:12:16.000000000 -0700
>  #define FS_DIRECTIO_FL			0x00100000 /* Use direct i/o */


> +++ fs/ext4/ext4.h	2009-08-31 12:12:16.000000000 -0700
>  #define EXT4_EXT_MIGRATE		0x00100000 /* Inode is migrating */

Should we redefine EXT4_EXT_MIGRATE not to conflict with FS_DIRECTIO_FL?
I don't think much, if any, use has been made of this flag, and I can
imagine a major headache in the future if this isn't changed now.

Also, EXT4_EXT_MIGRATE doesn't necessarily belong in the i_flags space,
since it is only used in-memory rather than on-disk as all of the others
are.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang Aug. 31, 2009, 11:33 p.m. UTC | #4
On Mon, Aug 31, 2009 at 2:56 PM, Andreas Dilger<adilger@sun.com> wrote:
> On Aug 31, 2009  12:40 -0700, Jiaying Zhang wrote:
>> > It's better to define the flag as EXT4_KEEPSIZE_FL, and to use it as
>> > EXT4_KEEPSIZE_FL, but make a note of that bitfield position as being
>> > reserved in include/linux/fs.h.
>>
>> Here is the modified patch based on your suggestions. I stick with the
>> KEEPSIZE_FL approach that I think can allow us to handle the special
>> truncation accordingly during fsck. Other file systems can also re-use
>> this flag when they want to support fallocate with KEEP_SIZE. As you
>> suggested, I moved the EXT4_KEEPSIZE_FL checking to ext4_setattr
>> that now calls vmtruncate if the KEEPSIZE flag is set in the i_flag.
>> Please let me know what you think about this proposed patch.
>>
>> --- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c   2009-08-31
>> 12:08:10.000000000 -0700
>> +++ fs/ext4/extents.c 2009-08-31 12:12:16.000000000 -0700
>> @@ -3095,7 +3095,13 @@ static void ext4_falloc_update_inode(str
>>                       i_size_write(inode, new_size);
>>               if (new_size > EXT4_I(inode)->i_disksize)
>>                       ext4_update_i_disksize(inode, new_size);
>> +             inode->i_flags &= ~EXT4_KEEPSIZE_FL;
>
> Note that fallocate can be called multiple times for a file.  The
> EXT4_KEEPSIZE_FL should only be cleared if there were writes to
> the end of the fallocated space.  In that regard, I think the name
> of this flag should be changed to something like "EXT4_EOFBLOCKS_FL"
> to indicate that blocks are allocated beyond the end of file (i_size).

Thanks for catching this! I changed the patch to only clear the flag
when the new_size is larger than i_size and changed the flag name
as you suggested. It would be nice if we only clear the flag when we
write beyond the fallocated space, but this seems hard to detect
because we no longer have the allocated size once that keepsize
fallocate call returns.

>
>>       } else {
>> +             /*
>> +              * Mark that we allocate beyond EOF so the subsequent truncate
>> +              * can proceed even if the new size is the same as i_size.
>> +              */
>> +             inode->i_flags |= EXT4_KEEPSIZE_FL;
>
> Similarly, this should only be done in case the fallocate is actually
> beyond i_size.  While that is the most common case, it isn't necessarily
> ALWAYS going to be true (e.g. if multiple threads are calling fallocate()
> on a single file, or if a program always calls fallocate() on a file
> without first checking if the file size is large enough).

Also fixed.

>
>> +++ include/linux/fs.h        2009-08-31 12:12:16.000000000 -0700
>>  #define FS_DIRECTIO_FL                       0x00100000 /* Use direct i/o */
>
>
>> +++ fs/ext4/ext4.h    2009-08-31 12:12:16.000000000 -0700
>>  #define EXT4_EXT_MIGRATE             0x00100000 /* Inode is migrating */
>
> Should we redefine EXT4_EXT_MIGRATE not to conflict with FS_DIRECTIO_FL?
> I don't think much, if any, use has been made of this flag, and I can
> imagine a major headache in the future if this isn't changed now.
>
> Also, EXT4_EXT_MIGRATE doesn't necessarily belong in the i_flags space,
> since it is only used in-memory rather than on-disk as all of the others
> are.

I will leave this out from my patch since it seems to belong to more general
cleanup and I don't know much about the EXT4_EXT_MIGRATE flag :).

Here is the new patch:

--- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c	2009-08-31
12:08:10.000000000 -0700
+++ fs/ext4/extents.c	2009-08-31 15:51:13.000000000 -0700
@@ -3091,11 +3091,19 @@ static void ext4_falloc_update_inode(str
 	 * the file size.
 	 */
 	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
-		if (new_size > i_size_read(inode))
+		if (new_size > i_size_read(inode)) {
 			i_size_write(inode, new_size);
+			inode->i_flags &= ~EXT4_EOFBLOCKS_FL;
+		}
 		if (new_size > EXT4_I(inode)->i_disksize)
 			ext4_update_i_disksize(inode, new_size);
 	} else {
+		/*
+		 * Mark that we allocate beyond EOF so the subsequent truncate
+		 * can proceed even if the new size is the same as i_size.
+		 */
+		if (new_size > i_size_read(inode))
+			inode->i_flags |= EXT4_EOFBLOCKS_FL;
 	}
 }

--- .pc/fallocate_keepsizse.patch/fs/ext4/inode.c	2009-08-31
12:08:10.000000000 -0700
+++ fs/ext4/inode.c	2009-08-31 15:50:56.000000000 -0700
@@ -3973,6 +3973,8 @@ void ext4_truncate(struct inode *inode)
 	if (!ext4_can_truncate(inode))
 		return;

+	inode->i_flags &= ~EXT4_EOFBLOCKS_FL;
+
 	if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
 		ei->i_state |= EXT4_STATE_DA_ALLOC_CLOSE;

@@ -4807,7 +4809,9 @@ int ext4_setattr(struct dentry *dentry,
 	}

 	if (S_ISREG(inode->i_mode) &&
-	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
+	    attr->ia_valid & ATTR_SIZE &&
+	    (attr->ia_size < inode->i_size ||
+	     (inode->i_flags & EXT4_EOFBLOCKS_FL))) {
 		handle_t *handle;

 		handle = ext4_journal_start(inode, 3);
@@ -4838,6 +4842,11 @@ int ext4_setattr(struct dentry *dentry,
 				goto err_out;
 			}
 		}
+		if ((inode->i_flags & EXT4_EOFBLOCKS_FL)) {
+			rc = vmtruncate(inode, attr->ia_size);
+			if (rc)
+				goto err_out;
+		}
 	}

 	rc = inode_setattr(inode, attr);
--- .pc/fallocate_keepsizse.patch/include/linux/fs.h	2009-08-31
12:08:10.000000000 -0700
+++ include/linux/fs.h	2009-08-31 16:21:44.000000000 -0700
@@ -343,6 +343,7 @@ struct inodes_stat_t {
 #define FS_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
 #define FS_EXTENT_FL			0x00080000 /* Extents */
 #define FS_DIRECTIO_FL			0x00100000 /* Use direct i/o */
+#define FS_EOFBLOCKS_FL			0x00200000 /* Blocks allocated beyond EOF */
 #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */

 #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */
--- .pc/fallocate_keepsizse.patch/fs/ext4/ext4.h	2009-08-31
12:08:10.000000000 -0700
+++ fs/ext4/ext4.h	2009-08-31 15:52:34.000000000 -0700
@@ -235,6 +235,7 @@ struct flex_groups {
 #define EXT4_HUGE_FILE_FL               0x00040000 /* Set to each huge file */
 #define EXT4_EXTENTS_FL			0x00080000 /* Inode uses extents */
 #define EXT4_EXT_MIGRATE		0x00100000 /* Inode is migrating */
+#define EXT4_EOFBLOCKS_FL		0x00200000 /* Blocks allocated beyond EOF
(bit reserved in fs.h) */
 #define EXT4_RESERVED_FL		0x80000000 /* reserved for ext4 lib */

 #define EXT4_FL_USER_VISIBLE		0x000BDFFF /* User visible flags */
root@outpost:/mnt/work/linux-2.6.30.5#

Jiaying

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Sept. 2, 2009, 8:41 a.m. UTC | #5
On Aug 31, 2009  16:33 -0700, Jiaying Zhang wrote:
> > EXT4_KEEPSIZE_FL should only be cleared if there were writes to
> > the end of the fallocated space.  In that regard, I think the name
> > of this flag should be changed to something like "EXT4_EOFBLOCKS_FL"
> > to indicate that blocks are allocated beyond the end of file (i_size).
> 
> Thanks for catching this! I changed the patch to only clear the flag
> when the new_size is larger than i_size and changed the flag name
> as you suggested. It would be nice if we only clear the flag when we
> write beyond the fallocated space, but this seems hard to detect
> because we no longer have the allocated size once that keepsize
> fallocate call returns.

The problem is that if e2fsck depends on the EXT4_EOFBLOCKS_FL set
for fallocate-beyond-EOF then it is worse to clear it than to leave
it set.  At worst, leaving the flag set results in too many truncates
on the file.  Clearing the flag when not correct may result in user
visible data corruption if the file size is extended...

> Here is the new patch:
> 
> --- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c	2009-08-31
> 12:08:10.000000000 -0700
> +++ fs/ext4/extents.c	2009-08-31 15:51:13.000000000 -0700
> @@ -3091,11 +3091,19 @@ static void ext4_falloc_update_inode(str
>  	 * the file size.
>  	 */
>  	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> +		if (new_size > i_size_read(inode)) {
>  			i_size_write(inode, new_size);
> +			inode->i_flags &= ~EXT4_EOFBLOCKS_FL;

This again isn't quite correct, since the EOFBLOCKS_FL shouldn't
be cleared unless new_size is beyond the allocated size.  The
allocation code itself might be a better place to clear this,
since it knows whether there were new blocks being added beyond
the current max allocated block.

> +#define FS_EOFBLOCKS_FL			0x00200000 /* Blocks allocated beyond EOF */
>  #define FS_RESERVED_FL			0x80000000 /* reserved for ext2 lib */
> 
>  #define FS_FL_USER_VISIBLE		0x0003DFFF /* User visible flags */

It probably isn't a bad idea to make this flag user-visible, since it
would allow scanning for files that have excess space reserved (e.g.
if the filesystem is getting full).  Making it user-settable (i.e.
clearable) would essentially mean truncating the file to i_size without
updating the timestamps so that the reserved space is discarded.  I
don't think there is any value in allowing a user to turn this flag on
for a file.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang Sept. 3, 2009, 5:20 a.m. UTC | #6
On Wed, Sep 2, 2009 at 1:41 AM, Andreas Dilger<adilger@sun.com> wrote:
> On Aug 31, 2009  16:33 -0700, Jiaying Zhang wrote:
>> > EXT4_KEEPSIZE_FL should only be cleared if there were writes to
>> > the end of the fallocated space.  In that regard, I think the name
>> > of this flag should be changed to something like "EXT4_EOFBLOCKS_FL"
>> > to indicate that blocks are allocated beyond the end of file (i_size).
>>
>> Thanks for catching this! I changed the patch to only clear the flag
>> when the new_size is larger than i_size and changed the flag name
>> as you suggested. It would be nice if we only clear the flag when we
>> write beyond the fallocated space, but this seems hard to detect
>> because we no longer have the allocated size once that keepsize
>> fallocate call returns.
>
> The problem is that if e2fsck depends on the EXT4_EOFBLOCKS_FL set
> for fallocate-beyond-EOF then it is worse to clear it than to leave
> it set.  At worst, leaving the flag set results in too many truncates
> on the file.  Clearing the flag when not correct may result in user
> visible data corruption if the file size is extended...
>
>> Here is the new patch:
>>
>> --- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c   2009-08-31
>> 12:08:10.000000000 -0700
>> +++ fs/ext4/extents.c 2009-08-31 15:51:13.000000000 -0700
>> @@ -3091,11 +3091,19 @@ static void ext4_falloc_update_inode(str
>>        * the file size.
>>        */
>>       if (!(mode & FALLOC_FL_KEEP_SIZE)) {
>> +             if (new_size > i_size_read(inode)) {
>>                       i_size_write(inode, new_size);
>> +                     inode->i_flags &= ~EXT4_EOFBLOCKS_FL;
>
> This again isn't quite correct, since the EOFBLOCKS_FL shouldn't
> be cleared unless new_size is beyond the allocated size.  The
> allocation code itself might be a better place to clear this,
> since it knows whether there were new blocks being added beyond
> the current max allocated block.

We were thinking to clear this flag when we need to allocate new
blocks, but I was not sure how to get the current max allocated
block -- that is mostly because I just started working on the ext4
code. After digging into the ext4 allocation code today, I think we
can put the check&clear in ext4_ext_get_blocks:

@@ -2968,6 +2968,14 @@ int ext4_ext_get_blocks(handle_t *handle
 	newex.ee_len = cpu_to_le16(ar.len);
 	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
 		ext4_ext_mark_uninitialized(&newex);
+
+	if (unlikely(inode->i_flags & EXT4_EOFBLOCKS_FL)) {
+		BUG_ON(!eh->eh_entries);
+		last_ex = EXT_LAST_EXTENT(eh);
+		if (iblock + max_blocks > le32_to_cpu(last_ex->ee_block)
+					+ ext4_ext_get_actual_len(last_ex))
+			inode->i_flags &= ~EXT4_EOFBLOCKS_FL;
+	}
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err) {
 		/* free data blocks we just allocated */

Again, I just started looking at this part of code, so please let me know
if I am in the right direction.

Another thing I am not sure is whether we can allocate a non-data block,
like extended attributes, beyond the current max block without changing
the i_size. In that case, clearing the EOFBLOCKS flag will be wrong.

>>  #define FS_FL_USER_VISIBLE           0x0003DFFF /* User visible flags */
>
> It probably isn't a bad idea to make this flag user-visible, since it
> would allow scanning for files that have excess space reserved (e.g.
> if the filesystem is getting full).  Making it user-settable (i.e.
> clearable) would essentially mean truncating the file to i_size without
> updating the timestamps so that the reserved space is discarded.  I
> don't think there is any value in allowing a user to turn this flag on
> for a file.

So to make it user-settable, we need to add the handling in ext4_ioctl
that calls vmtruncate when the flag to be cleared. But how can we get
the right size to truncate in that case? Can we just set that to the
max initialized block shift with block size? But that may also truncate
the blocks reserved without the KEEP_SIZE flag.

Jiaying

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang Sept. 3, 2009, 5:32 a.m. UTC | #7
On Wed, Sep 2, 2009 at 10:20 PM, Jiaying Zhang<jiayingz@google.com> wrote:
> On Wed, Sep 2, 2009 at 1:41 AM, Andreas Dilger<adilger@sun.com> wrote:
>> On Aug 31, 2009  16:33 -0700, Jiaying Zhang wrote:
>>> > EXT4_KEEPSIZE_FL should only be cleared if there were writes to
>>> > the end of the fallocated space.  In that regard, I think the name
>>> > of this flag should be changed to something like "EXT4_EOFBLOCKS_FL"
>>> > to indicate that blocks are allocated beyond the end of file (i_size).
>>>
>>> Thanks for catching this! I changed the patch to only clear the flag
>>> when the new_size is larger than i_size and changed the flag name
>>> as you suggested. It would be nice if we only clear the flag when we
>>> write beyond the fallocated space, but this seems hard to detect
>>> because we no longer have the allocated size once that keepsize
>>> fallocate call returns.
>>
>> The problem is that if e2fsck depends on the EXT4_EOFBLOCKS_FL set
>> for fallocate-beyond-EOF then it is worse to clear it than to leave
>> it set.  At worst, leaving the flag set results in too many truncates
>> on the file.  Clearing the flag when not correct may result in user
>> visible data corruption if the file size is extended...
>>
>>> Here is the new patch:
>>>
>>> --- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c   2009-08-31
>>> 12:08:10.000000000 -0700
>>> +++ fs/ext4/extents.c 2009-08-31 15:51:13.000000000 -0700
>>> @@ -3091,11 +3091,19 @@ static void ext4_falloc_update_inode(str
>>>        * the file size.
>>>        */
>>>       if (!(mode & FALLOC_FL_KEEP_SIZE)) {
>>> +             if (new_size > i_size_read(inode)) {
>>>                       i_size_write(inode, new_size);
>>> +                     inode->i_flags &= ~EXT4_EOFBLOCKS_FL;
>>
>> This again isn't quite correct, since the EOFBLOCKS_FL shouldn't
>> be cleared unless new_size is beyond the allocated size.  The
>> allocation code itself might be a better place to clear this,
>> since it knows whether there were new blocks being added beyond
>> the current max allocated block.
>
> We were thinking to clear this flag when we need to allocate new
> blocks, but I was not sure how to get the current max allocated
> block -- that is mostly because I just started working on the ext4
> code. After digging into the ext4 allocation code today, I think we
> can put the check&clear in ext4_ext_get_blocks:
>
> @@ -2968,6 +2968,14 @@ int ext4_ext_get_blocks(handle_t *handle
>        newex.ee_len = cpu_to_le16(ar.len);
>        if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
>                ext4_ext_mark_uninitialized(&newex);
> +
> +       if (unlikely(inode->i_flags & EXT4_EOFBLOCKS_FL)) {
> +               BUG_ON(!eh->eh_entries);
> +               last_ex = EXT_LAST_EXTENT(eh);
> +               if (iblock + max_blocks > le32_to_cpu(last_ex->ee_block)
> +                                       + ext4_ext_get_actual_len(last_ex))
> +                       inode->i_flags &= ~EXT4_EOFBLOCKS_FL;
> +       }
>        err = ext4_ext_insert_extent(handle, inode, path, &newex);
>        if (err) {
>                /* free data blocks we just allocated */
>
> Again, I just started looking at this part of code, so please let me know
> if I am in the right direction.
>
> Another thing I am not sure is whether we can allocate a non-data block,
> like extended attributes, beyond the current max block without changing
> the i_size. In that case, clearing the EOFBLOCKS flag will be wrong.
>
>>>  #define FS_FL_USER_VISIBLE           0x0003DFFF /* User visible flags */
>>
>> It probably isn't a bad idea to make this flag user-visible, since it
>> would allow scanning for files that have excess space reserved (e.g.
>> if the filesystem is getting full).  Making it user-settable (i.e.
>> clearable) would essentially mean truncating the file to i_size without
>> updating the timestamps so that the reserved space is discarded.  I
>> don't think there is any value in allowing a user to turn this flag on
>> for a file.
>
> So to make it user-settable, we need to add the handling in ext4_ioctl
> that calls vmtruncate when the flag to be cleared. But how can we get
> the right size to truncate in that case? Can we just set that to the
> max initialized block shift with block size? But that may also truncate
> the blocks reserved without the KEEP_SIZE flag.

Never mind, that is a stupid question. We can just truncate to the
current i_size.

Jiaying

>
> Jiaying
>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- .pc/fallocate_keepsizse.patch/fs/attr.c	2009-08-28 15:38:46.000000000 -0700
+++ fs/attr.c	2009-08-28 17:01:04.000000000 -0700
@@ -68,7 +68,8 @@  int inode_setattr(struct inode * inode,
 	unsigned int ia_valid = attr->ia_valid;

 	if (ia_valid & ATTR_SIZE &&
-	    (attr->ia_size != i_size_read(inode)) {
+	    (attr->ia_size != i_size_read(inode) ||
+	     (inode->i_flags & FS_KEEPSIZE_FL))) {
 		int error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
--- .pc/fallocate_keepsizse.patch/fs/ext4/extents.c	2009-08-28