diff mbox

ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr()

Message ID 20110517225926.8B4A94225B@ruihe.smo.corp.google.com
State Superseded, archived
Headers show

Commit Message

Jiaying Zhang May 17, 2011, 10:59 p.m. UTC
There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
flag and then ftruncate the file to a size larger than the file's i_size,
any allocated but unwritten blocks will be freed but the file size is set
to the size that ftruncate specifies.

Here is a simple test to reproduce the problem:
  1. fallocate a 12k size file with KEEP_SIZE flag
  2. write the first 4k
  3. ftruncate the file to 8k
Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
shows the file has only the first written block left.

Below is the proposed patch to fix the bug:

ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().

Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
of ext4_truncate(inode) when it needs to truncate an inode so that
if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
to a size larger than the inode's i_size, we will only truncate the blocks
beyond the specified truncate size instead of all of blocks beyond i_size.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yongqiang Yang May 18, 2011, 2:46 a.m. UTC | #1
On Wed, May 18, 2011 at 6:59 AM, Jiaying Zhang <jiayingz@google.com> wrote:
> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
> flag and then ftruncate the file to a size larger than the file's i_size,
> any allocated but unwritten blocks will be freed but the file size is set
> to the size that ftruncate specifies.
>
> Here is a simple test to reproduce the problem:
>  1. fallocate a 12k size file with KEEP_SIZE flag
>  2. write the first 4k
>  3. ftruncate the file to 8k
> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
> shows the file has only the first written block left.
>
> Below is the proposed patch to fix the bug:
>
> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
>
> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
> of ext4_truncate(inode) when it needs to truncate an inode so that
> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
> to a size larger than the inode's i_size, we will only truncate the blocks
> beyond the specified truncate size instead of all of blocks beyond i_size.
>
> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3424e82..3bfad57 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>                        }
>                }
>                /* ext4_truncate will clear the flag */
> -               if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
> -                       ext4_truncate(inode);
> +               if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
> +                       rc = vmtruncate(inode, attr->ia_size);
> +                       if (rc)
> +                               goto err_out;
> +               }
Hi there,
It seems that the error handling has problems.  1)We should handle
error as the below code does. or 2) we can add a OR condition to the
if statement below so that it can handle case here.  I prefer the 2nd
solution.  it looks like:
-                /* ext4_truncate will clear the flag */
-               if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
-                       ext4_truncate(inode);

-	if ((attr->ia_valid & ATTR_SIZE) &&
-	    attr->ia_size != i_size_read(inode))
+	if (((attr->ia_valid & ATTR_SIZE) &&
+	    attr->ia_size != i_size_read(inode)) ||
+          ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))
		rc = vmtruncate(inode, attr->ia_size);

Yongqiang.
>        }
>
>        if ((attr->ia_valid & ATTR_SIZE) &&


> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Yongqiang Yang May 18, 2011, 2:56 a.m. UTC | #2
On Wed, May 18, 2011 at 6:59 AM, Jiaying Zhang <jiayingz@google.com> wrote:
> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
> flag and then ftruncate the file to a size larger than the file's i_size,
> any allocated but unwritten blocks will be freed but the file size is set
> to the size that ftruncate specifies.
>
> Here is a simple test to reproduce the problem:
>  1. fallocate a 12k size file with KEEP_SIZE flag
>  2. write the first 4k
>  3. ftruncate the file to 8k
> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
> shows the file has only the first written block left.
>
> Below is the proposed patch to fix the bug:
>
> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
>
> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
> of ext4_truncate(inode) when it needs to truncate an inode so that
> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
> to a size larger than the inode's i_size, we will only truncate the blocks
> beyond the specified truncate size instead of all of blocks beyond i_size.
>
> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3424e82..3bfad57 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>                        }
>                }
>                /* ext4_truncate will clear the flag */
> -               if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
> -                       ext4_truncate(inode);
> +               if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
> +                       rc = vmtruncate(inode, attr->ia_size);
> +                       if (rc)
> +                               goto err_out;
> +               }
>        }
Another question, does vmtruncate() write zeros to blocks beyond old
EOF?  I had a rough look and could not find the code doing this, maybe
due to my density.

Yongqiang.
>
>        if ((attr->ia_valid & ATTR_SIZE) &&
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Eric Sandeen May 18, 2011, 3:19 a.m. UTC | #3
On 5/17/11 5:59 PM, Jiaying Zhang wrote:
> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
> flag and then ftruncate the file to a size larger than the file's i_size,
> any allocated but unwritten blocks will be freed but the file size is set
> to the size that ftruncate specifies.
> 
> Here is a simple test to reproduce the problem:
>   1. fallocate a 12k size file with KEEP_SIZE flag
>   2. write the first 4k
>   3. ftruncate the file to 8k
> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
> shows the file has only the first written block left.

To be honest I'm not 100% certain what the fiesystem -should- do in this case.

If I go through that same sequence on xfs, I get 4k written / 8k unwritten:

# xfs_bmap -vp testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
   0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
   1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000

size 8k:
# ls -l testfile
-rw-r--r-- 1 root root 8192 May 17 22:33 testfile

and diskspace used 12k:
# du -hc testfile
12K	testfile
12K	total

I think this is a different result from ext4, either with or without your patch.

On ext4 I get size 8k, but only the first 4k mapped, as you say.

I don't recall when truncate is supposed to free fallocated blocks, and from what point?

-Eric

> Below is the proposed patch to fix the bug:
> 
> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
> 
> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
> of ext4_truncate(inode) when it needs to truncate an inode so that
> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
> to a size larger than the inode's i_size, we will only truncate the blocks
> beyond the specified truncate size instead of all of blocks beyond i_size.
> 
> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3424e82..3bfad57 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>  			}
>  		}
>  		/* ext4_truncate will clear the flag */
> -		if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
> -			ext4_truncate(inode);
> +		if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
> +			rc = vmtruncate(inode, attr->ia_size);
> +			if (rc)
> +				goto err_out;
> +		}
>  	}
>  
>  	if ((attr->ia_valid & ATTR_SIZE) &&
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yongqiang Yang May 18, 2011, 3:27 a.m. UTC | #4
On Wed, May 18, 2011 at 10:56 AM, Yongqiang Yang <xiaoqiangnk@gmail.com> wrote:
> On Wed, May 18, 2011 at 6:59 AM, Jiaying Zhang <jiayingz@google.com> wrote:
>> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>> flag and then ftruncate the file to a size larger than the file's i_size,
>> any allocated but unwritten blocks will be freed but the file size is set
>> to the size that ftruncate specifies.
>>
>> Here is a simple test to reproduce the problem:
>>  1. fallocate a 12k size file with KEEP_SIZE flag
>>  2. write the first 4k
>>  3. ftruncate the file to 8k
>> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>> shows the file has only the first written block left.
>>
>> Below is the proposed patch to fix the bug:
>>
>> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
>>
>> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
>> of ext4_truncate(inode) when it needs to truncate an inode so that
>> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
>> to a size larger than the inode's i_size, we will only truncate the blocks
>> beyond the specified truncate size instead of all of blocks beyond i_size.
>>
>> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 3424e82..3bfad57 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>>                        }
>>                }
>>                /* ext4_truncate will clear the flag */
>> -               if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
>> -                       ext4_truncate(inode);
>> +               if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
>> +                       rc = vmtruncate(inode, attr->ia_size);
>> +                       if (rc)
>> +                               goto err_out;
>> +               }
>>        }
> Another question, does vmtruncate() write zeros to blocks beyond old
> EOF?  I had a rough look and could not find the code doing this, maybe
> due to my density.
Sorry, ignore this, the extent is unwritten, no need to zero.
>
> Yongqiang.
>>
>>        if ((attr->ia_valid & ATTR_SIZE) &&
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Best Wishes
> Yongqiang Yang
>
Andreas Dilger May 18, 2011, 5:35 a.m. UTC | #5
On May 17, 2011, at 21:19, Eric Sandeen wrote:
> On 5/17/11 5:59 PM, Jiaying Zhang wrote:
>> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>> flag and then ftruncate the file to a size larger than the file's i_size,
>> any allocated but unwritten blocks will be freed but the file size is set
>> to the size that ftruncate specifies.
>> 
>> Here is a simple test to reproduce the problem:
>> 1. fallocate a 12k size file with KEEP_SIZE flag
>> 2. write the first 4k
>> 3. ftruncate the file to 8k
>> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>> shows the file has only the first written block left.
> 
> To be honest I'm not 100% certain what the fiesystem -should- do in this case.

I think it makes sense from a usage POV to discard the blocks after EOF when a truncate is being done.  For something like a PVR that is recording a show, but doesn't know the exact total size, it makes sense to fallocate() some estimated amount of space, and then when the show is finished recording it uses ftruncate() of the current size to drop the fallocated space.

> If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
> 
> # xfs_bmap -vp testfile
> testfile:
> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
> 0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
> 1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
> 
> size 8k:
> # ls -l testfile
> -rw-r--r-- 1 root root 8192 May 17 22:33 testfile
> 
> and diskspace used 12k:
> # du -hc testfile
> 12K	testfile
> 12K	total
> 
> I think this is a different result from ext4, either with or without your patch.
> 
> On ext4 I get size 8k, but only the first 4k mapped, as you say.
> 
> I don't recall when truncate is supposed to free fallocated blocks, and from what point?
> 
> -Eric
> 
>> Below is the proposed patch to fix the bug:
>> 
>> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
>> 
>> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
>> of ext4_truncate(inode) when it needs to truncate an inode so that
>> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
>> to a size larger than the inode's i_size, we will only truncate the blocks
>> beyond the specified truncate size instead of all of blocks beyond i_size.
>> 
>> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>> 
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 3424e82..3bfad57 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>> 			}
>> 		}
>> 		/* ext4_truncate will clear the flag */
>> -		if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
>> -			ext4_truncate(inode);
>> +		if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
>> +			rc = vmtruncate(inode, attr->ia_size);
>> +			if (rc)
>> +				goto err_out;
>> +		}
>> 	}
>> 
>> 	if ((attr->ia_valid & ATTR_SIZE) &&
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner May 18, 2011, 6:13 a.m. UTC | #6
On Tue, May 17, 2011 at 10:19:05PM -0500, Eric Sandeen wrote:
> On 5/17/11 5:59 PM, Jiaying Zhang wrote:
> > There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
> > intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
> > flag and then ftruncate the file to a size larger than the file's i_size,
> > any allocated but unwritten blocks will be freed but the file size is set
> > to the size that ftruncate specifies.
> > 
> > Here is a simple test to reproduce the problem:
> >   1. fallocate a 12k size file with KEEP_SIZE flag
> >   2. write the first 4k
> >   3. ftruncate the file to 8k
> > Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
> > shows the file has only the first written block left.
> 
> To be honest I'm not 100% certain what the fiesystem -should- do in this case.
> 
> If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
> 
> # xfs_bmap -vp testfile
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>    0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>    1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000

Ok, so that's the case for a _truncate up_ from 4k to 8k:

$ rm /mnt/test/foo
$ xfs_io -f -c "resvsp 0 12k" -c stat -c "bmap -vp" -c "pwrite 0 4k" -c "fsync" -c "bmap -vp" -c "t 8k" -c "bmap -vp" -c stat /mnt/test/foo
fd.path = "/mnt/test/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 71
stat.type = regular file
stat.size = 0
stat.blocks = 24
fsxattr.xflags = 0x2 [-p------------]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.nextents = 1
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
/mnt/test/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..23]:         9712..9735        0 (9712..9735)        24 10000
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0000 sec (156 MiB/sec and 40000.0000 ops/sec)
/mnt/test/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          9712..9719        0 (9712..9719)         8 00000
   1: [8..23]:         9720..9735        0 (9720..9735)        16 10000
/mnt/test/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          9712..9719        0 (9712..9719)         8 00000
   1: [8..23]:         9720..9735        0 (9720..9735)        16 10000
fd.path = "/mnt/test/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 71
stat.type = regular file
stat.size = 8192
stat.blocks = 24
fsxattr.xflags = 0x2 [-p------------]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.nextents = 2
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136

But you get a different result on truncate down:

$rm /mnt/test/foo
$ xfs_io -f -c "truncate 12k" -c "resvsp 0 12k" -c stat -c "bmap -vp" -c "pwrite 0 4k" -c "fsync" -c "bmap -vp" -c "t 8k" -c "bmap -vp" -c stat /mnt/test/foo
fd.path = "/mnt/test/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 71
stat.type = regular file
stat.size = 12288
stat.blocks = 24
fsxattr.xflags = 0x2 [-p------------]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.nextents = 1
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
/mnt/test/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..23]:         9584..9607        0 (9584..9607)        24 10000
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0000 sec (217.014 MiB/sec and 55555.5556 ops/sec)
/mnt/test/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          9584..9591        0 (9584..9591)         8 00000
   1: [8..23]:         9592..9607        0 (9592..9607)        16 10000
/mnt/test/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          9584..9591        0 (9584..9591)         8 00000
   1: [8..15]:         9592..9599        0 (9592..9599)         8 10000
fd.path = "/mnt/test/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 71
stat.type = regular file
stat.size = 8192
stat.blocks = 16
fsxattr.xflags = 0x2 [-p------------]
fsxattr.projid = 0
fsxattr.extsize = 0
fsxattr.nextents = 2
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136

IOWs, on XFS a truncate up does not change the preallocation at all,
while a truncate down will _always_ remove preallocation beyond the
new EOF.  It's always had this behaviour w.r.t. to truncate(2) and
preallocation beyond EOF.

> I think this is a different result from ext4, either with or without your patch.
> 
> On ext4 I get size 8k, but only the first 4k mapped, as you say.
> 
> I don't recall when truncate is supposed to free fallocated blocks, and from what point?

It's entirely up to the filesystem how it treats blocks beyond EOF
during truncation. XFS frees them on truncate down, because it is
much safer to just truncate away everything beyond the new EOF than
to leave written extents beyond EOF as potential landmines.

Indeed, that's why calling vmtruncate() as a bad fix. If you have:


	       NUUUUUUUUUUWWWWWWWWWOUUUUUUUUU
       ....----+----------+--------+--------+
               A	  B        C        D

Where	A = new EOF (N)
	A->B = unwritten (U)
	B->C = written (W)
	C = old EOF (O)
	C->D = unwritten (U)

Then just calling vmtruncate() will leave the blocks in the range
B->C as written blocks. Hence then doing an extending truncate back
out to D will expose stale data rather than zeros in the range
B->C....

Cheers,

Dave.
Eric Sandeen May 18, 2011, 2:05 p.m. UTC | #7
On 5/18/11 1:13 AM, Dave Chinner wrote:

> It's entirely up to the filesystem how it treats blocks beyond EOF
> during truncation. XFS frees them on truncate down, because it is
> much safer to just truncate away everything beyond the new EOF than
> to leave written extents beyond EOF as potential landmines.
> 
> Indeed, that's why calling vmtruncate() as a bad fix. If you have:
> 
> 
> 	       NUUUUUUUUUUWWWWWWWWWOUUUUUUUUU
>        ....----+----------+--------+--------+
>                A	  B        C        D
> 
> Where	A = new EOF (N)
> 	A->B = unwritten (U)
> 	B->C = written (W)
> 	C = old EOF (O)
> 	C->D = unwritten (U)
> 
> Then just calling vmtruncate() will leave the blocks in the range
> B->C as written blocks. Hence then doing an extending truncate back
> out to D will expose stale data rather than zeros in the range
> B->C....

Hm, running recent xfstests which includes fsx-with-fallocate should probably
eventually catch that then.

-Eric
 
> Cheers,
> 
> Dave.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang May 18, 2011, 8:29 p.m. UTC | #8
On Tue, May 17, 2011 at 8:19 PM, Eric Sandeen <sandeen@redhat.com> wrote:
>
> On 5/17/11 5:59 PM, Jiaying Zhang wrote:
> > There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
> > intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
> > flag and then ftruncate the file to a size larger than the file's i_size,
> > any allocated but unwritten blocks will be freed but the file size is set
> > to the size that ftruncate specifies.
> >
> > Here is a simple test to reproduce the problem:
> >   1. fallocate a 12k size file with KEEP_SIZE flag
> >   2. write the first 4k
> >   3. ftruncate the file to 8k
> > Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
> > shows the file has only the first written block left.
>
> To be honest I'm not 100% certain what the fiesystem -should- do in this case.
>
> If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
>
> # xfs_bmap -vp testfile
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>   0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>   1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
>
> size 8k:
> # ls -l testfile
> -rw-r--r-- 1 root root 8192 May 17 22:33 testfile
>
> and diskspace used 12k:
> # du -hc testfile
> 12K     testfile
> 12K     total
>
> I think this is a different result from ext4, either with or without your patch.
>
> On ext4 I get size 8k, but only the first 4k mapped, as you say.
I agree that truncating to a size larger than i_size is un-specified by
POSIX. However, I think the problem with the current behavior is that
we have an inconsistency between file's i_size and its extent tree.
Now we have 8k i_size but the file has only 4k space allocated. That
can confuse applications.

Jiaying
>
> I don't recall when truncate is supposed to free fallocated blocks, and from what point?
>
> -Eric
>
> > Below is the proposed patch to fix the bug:
> >
> > ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
> >
> > Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
> > of ext4_truncate(inode) when it needs to truncate an inode so that
> > if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
> > to a size larger than the inode's i_size, we will only truncate the blocks
> > beyond the specified truncate size instead of all of blocks beyond i_size.
> >
> > Signed-off-by: Jiaying Zhang <jiayingz@google.com>
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 3424e82..3bfad57 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> >                       }
> >               }
> >               /* ext4_truncate will clear the flag */
> > -             if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
> > -                     ext4_truncate(inode);
> > +             if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
> > +                     rc = vmtruncate(inode, attr->ia_size);
> > +                     if (rc)
> > +                             goto err_out;
> > +             }
> >       }
> >
> >       if ((attr->ia_valid & ATTR_SIZE) &&
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 18, 2011, 8:31 p.m. UTC | #9
On 5/18/11 3:28 PM, Jiaying Zhang wrote:
> 
> 
> On Tue, May 17, 2011 at 8:19 PM, Eric Sandeen <sandeen@redhat.com <mailto:sandeen@redhat.com>> wrote:
> 
>     On 5/17/11 5:59 PM, Jiaying Zhang wrote:
>     > There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>     > intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>     > flag and then ftruncate the file to a size larger than the file's i_size,
>     > any allocated but unwritten blocks will be freed but the file size is set
>     > to the size that ftruncate specifies.
>     >
>     > Here is a simple test to reproduce the problem:
>     >   1. fallocate a 12k size file with KEEP_SIZE flag
>     >   2. write the first 4k
>     >   3. ftruncate the file to 8k
>     > Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>     > shows the file has only the first written block left.
> 
>     To be honest I'm not 100% certain what the fiesystem -should- do in this case.
> 
>     If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
> 
>     # xfs_bmap -vp testfile
>     testfile:
>      EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>       0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>       1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
> 
>     size 8k:
>     # ls -l testfile
>     -rw-r--r-- 1 root root 8192 May 17 22:33 testfile
> 
>     and diskspace used 12k:
>     # du -hc testfile
>     12K     testfile
>     12K     total
> 
>     I think this is a different result from ext4, either with or without your patch.
> 
>     On ext4 I get size 8k, but only the first 4k mapped, as you say.
> 
> I agree that truncating to a size larger than i_size is un-specified by
> POSIX. However, I think the problem with the current behavior is that
> we have an inconsistency between file's i_size and its extent tree.
> Now we have 8k i_size but the file has only 4k space allocated. That
> can confuse applications.

That's called "a sparse file" right?  Apps should not be confused by that ...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang May 18, 2011, 8:32 p.m. UTC | #10
On Tue, May 17, 2011 at 10:35 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On May 17, 2011, at 21:19, Eric Sandeen wrote:
>> On 5/17/11 5:59 PM, Jiaying Zhang wrote:
>>> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>>> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>>> flag and then ftruncate the file to a size larger than the file's i_size,
>>> any allocated but unwritten blocks will be freed but the file size is set
>>> to the size that ftruncate specifies.
>>>
>>> Here is a simple test to reproduce the problem:
>>> 1. fallocate a 12k size file with KEEP_SIZE flag
>>> 2. write the first 4k
>>> 3. ftruncate the file to 8k
>>> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>>> shows the file has only the first written block left.
>>
>> To be honest I'm not 100% certain what the fiesystem -should- do in this case.
>
> I think it makes sense from a usage POV to discard the blocks after EOF when a truncate is being done.  For something like a PVR that is recording a show, but doesn't know the exact total size, it makes sense to fallocate() some estimated amount of space, and then when the show is finished recording it uses ftruncate() of the current size to drop the fallocated space.
>
Indeed we have applications that are doing exactly the same as you
described. They always fallocate files to a pre-defined size with
KEEP_SIZE flag and if they end up using less than the allocated size,
they ftruncate files to their written size later.

Jiaying

>> If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
>>
>> # xfs_bmap -vp testfile
>> testfile:
>> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>> 0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>> 1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
>>
>> size 8k:
>> # ls -l testfile
>> -rw-r--r-- 1 root root 8192 May 17 22:33 testfile
>>
>> and diskspace used 12k:
>> # du -hc testfile
>> 12K   testfile
>> 12K   total
>>
>> I think this is a different result from ext4, either with or without your patch.
>>
>> On ext4 I get size 8k, but only the first 4k mapped, as you say.
>>
>> I don't recall when truncate is supposed to free fallocated blocks, and from what point?
>>
>> -Eric
>>
>>> Below is the proposed patch to fix the bug:
>>>
>>> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
>>>
>>> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
>>> of ext4_truncate(inode) when it needs to truncate an inode so that
>>> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
>>> to a size larger than the inode's i_size, we will only truncate the blocks
>>> beyond the specified truncate size instead of all of blocks beyond i_size.
>>>
>>> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>>>
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index 3424e82..3bfad57 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>>>                      }
>>>              }
>>>              /* ext4_truncate will clear the flag */
>>> -            if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
>>> -                    ext4_truncate(inode);
>>> +            if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
>>> +                    rc = vmtruncate(inode, attr->ia_size);
>>> +                    if (rc)
>>> +                            goto err_out;
>>> +            }
>>>      }
>>>
>>>      if ((attr->ia_valid & ATTR_SIZE) &&
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Cheers, Andreas
>
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang May 18, 2011, 8:38 p.m. UTC | #11
On Wed, May 18, 2011 at 1:31 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> On 5/18/11 3:28 PM, Jiaying Zhang wrote:
>>
>>
>> On Tue, May 17, 2011 at 8:19 PM, Eric Sandeen <sandeen@redhat.com <mailto:sandeen@redhat.com>> wrote:
>>
>>     On 5/17/11 5:59 PM, Jiaying Zhang wrote:
>>     > There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>>     > intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>>     > flag and then ftruncate the file to a size larger than the file's i_size,
>>     > any allocated but unwritten blocks will be freed but the file size is set
>>     > to the size that ftruncate specifies.
>>     >
>>     > Here is a simple test to reproduce the problem:
>>     >   1. fallocate a 12k size file with KEEP_SIZE flag
>>     >   2. write the first 4k
>>     >   3. ftruncate the file to 8k
>>     > Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>>     > shows the file has only the first written block left.
>>
>>     To be honest I'm not 100% certain what the fiesystem -should- do in this case.
>>
>>     If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
>>
>>     # xfs_bmap -vp testfile
>>     testfile:
>>      EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>>       0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>>       1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
>>
>>     size 8k:
>>     # ls -l testfile
>>     -rw-r--r-- 1 root root 8192 May 17 22:33 testfile
>>
>>     and diskspace used 12k:
>>     # du -hc testfile
>>     12K     testfile
>>     12K     total
>>
>>     I think this is a different result from ext4, either with or without your patch.
>>
>>     On ext4 I get size 8k, but only the first 4k mapped, as you say.
>>
>> I agree that truncating to a size larger than i_size is un-specified by
>> POSIX. However, I think the problem with the current behavior is that
>> we have an inconsistency between file's i_size and its extent tree.
>> Now we have 8k i_size but the file has only 4k space allocated. That
>> can confuse applications.
>
> That's called "a sparse file" right?  Apps should not be confused by that ...
But applications don't intend to create a sparse file in this case. They may
create a lot of such files in this way with 'df' showing there is
still plenty of
free space available but as they start writing to these files they will fill up
space and get ENOSPC unexpectedly.

Jiaying
>
> -Eric
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang May 18, 2011, 8:42 p.m. UTC | #12
On Tue, May 17, 2011 at 11:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, May 17, 2011 at 10:19:05PM -0500, Eric Sandeen wrote:
>> On 5/17/11 5:59 PM, Jiaying Zhang wrote:
>> > There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>> > intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>> > flag and then ftruncate the file to a size larger than the file's i_size,
>> > any allocated but unwritten blocks will be freed but the file size is set
>> > to the size that ftruncate specifies.
>> >
>> > Here is a simple test to reproduce the problem:
>> >   1. fallocate a 12k size file with KEEP_SIZE flag
>> >   2. write the first 4k
>> >   3. ftruncate the file to 8k
>> > Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>> > shows the file has only the first written block left.
>>
>> To be honest I'm not 100% certain what the fiesystem -should- do in this case.
>>
>> If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
>>
>> # xfs_bmap -vp testfile
>> testfile:
>>  EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>>    0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>>    1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
>
> Ok, so that's the case for a _truncate up_ from 4k to 8k:
>
> $ rm /mnt/test/foo
> $ xfs_io -f -c "resvsp 0 12k" -c stat -c "bmap -vp" -c "pwrite 0 4k" -c "fsync" -c "bmap -vp" -c "t 8k" -c "bmap -vp" -c stat /mnt/test/foo
> fd.path = "/mnt/test/foo"
> fd.flags = non-sync,non-direct,read-write
> stat.ino = 71
> stat.type = regular file
> stat.size = 0
> stat.blocks = 24
> fsxattr.xflags = 0x2 [-p------------]
> fsxattr.projid = 0
> fsxattr.extsize = 0
> fsxattr.nextents = 1
> fsxattr.naextents = 0
> dioattr.mem = 0x200
> dioattr.miniosz = 512
> dioattr.maxiosz = 2147483136
> /mnt/test/foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..23]:         9712..9735        0 (9712..9735)        24 10000
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0000 sec (156 MiB/sec and 40000.0000 ops/sec)
> /mnt/test/foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..7]:          9712..9719        0 (9712..9719)         8 00000
>   1: [8..23]:         9720..9735        0 (9720..9735)        16 10000
> /mnt/test/foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..7]:          9712..9719        0 (9712..9719)         8 00000
>   1: [8..23]:         9720..9735        0 (9720..9735)        16 10000
> fd.path = "/mnt/test/foo"
> fd.flags = non-sync,non-direct,read-write
> stat.ino = 71
> stat.type = regular file
> stat.size = 8192
> stat.blocks = 24
> fsxattr.xflags = 0x2 [-p------------]
> fsxattr.projid = 0
> fsxattr.extsize = 0
> fsxattr.nextents = 2
> fsxattr.naextents = 0
> dioattr.mem = 0x200
> dioattr.miniosz = 512
> dioattr.maxiosz = 2147483136
>
> But you get a different result on truncate down:
>
> $rm /mnt/test/foo
> $ xfs_io -f -c "truncate 12k" -c "resvsp 0 12k" -c stat -c "bmap -vp" -c "pwrite 0 4k" -c "fsync" -c "bmap -vp" -c "t 8k" -c "bmap -vp" -c stat /mnt/test/foo
> fd.path = "/mnt/test/foo"
> fd.flags = non-sync,non-direct,read-write
> stat.ino = 71
> stat.type = regular file
> stat.size = 12288
> stat.blocks = 24
> fsxattr.xflags = 0x2 [-p------------]
> fsxattr.projid = 0
> fsxattr.extsize = 0
> fsxattr.nextents = 1
> fsxattr.naextents = 0
> dioattr.mem = 0x200
> dioattr.miniosz = 512
> dioattr.maxiosz = 2147483136
> /mnt/test/foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..23]:         9584..9607        0 (9584..9607)        24 10000
> wrote 4096/4096 bytes at offset 0
> 4 KiB, 1 ops; 0.0000 sec (217.014 MiB/sec and 55555.5556 ops/sec)
> /mnt/test/foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..7]:          9584..9591        0 (9584..9591)         8 00000
>   1: [8..23]:         9592..9607        0 (9592..9607)        16 10000
> /mnt/test/foo:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..7]:          9584..9591        0 (9584..9591)         8 00000
>   1: [8..15]:         9592..9599        0 (9592..9599)         8 10000
> fd.path = "/mnt/test/foo"
> fd.flags = non-sync,non-direct,read-write
> stat.ino = 71
> stat.type = regular file
> stat.size = 8192
> stat.blocks = 16
> fsxattr.xflags = 0x2 [-p------------]
> fsxattr.projid = 0
> fsxattr.extsize = 0
> fsxattr.nextents = 2
> fsxattr.naextents = 0
> dioattr.mem = 0x200
> dioattr.miniosz = 512
> dioattr.maxiosz = 2147483136
>
> IOWs, on XFS a truncate up does not change the preallocation at all,
> while a truncate down will _always_ remove preallocation beyond the
> new EOF.  It's always had this behaviour w.r.t. to truncate(2) and
> preallocation beyond EOF.
>
>> I think this is a different result from ext4, either with or without your patch.
>>
>> On ext4 I get size 8k, but only the first 4k mapped, as you say.
>>
>> I don't recall when truncate is supposed to free fallocated blocks, and from what point?
>
> It's entirely up to the filesystem how it treats blocks beyond EOF
> during truncation. XFS frees them on truncate down, because it is
> much safer to just truncate away everything beyond the new EOF than
> to leave written extents beyond EOF as potential landmines.
>
> Indeed, that's why calling vmtruncate() as a bad fix. If you have:
>
>
>               NUUUUUUUUUUWWWWWWWWWOUUUUUUUUU
>       ....----+----------+--------+--------+
>               A          B        C        D
>
> Where   A = new EOF (N)
>        A->B = unwritten (U)
>        B->C = written (W)
>        C = old EOF (O)
>        C->D = unwritten (U)
>
> Then just calling vmtruncate() will leave the blocks in the range
> B->C as written blocks. Hence then doing an extending truncate back
> out to D will expose stale data rather than zeros in the range
> B->C....
Sorry I am a little confused. If I understand correctly, in the situation
you described, we call a truncate that causes EOF to change from
C to A. On ext4, we should free all of blocks after A. And when we
do an extending truncate to D, any blocks beyond A should be treated
as unwritten blocks so we should not expose any stale data, right?

Jiaying
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger May 18, 2011, 8:45 p.m. UTC | #13
On May 18, 2011, at 14:32, Jiaying Zhang wrote:
> On Tue, May 17, 2011 at 10:35 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> On May 17, 2011, at 21:19, Eric Sandeen wrote:
>>> On 5/17/11 5:59 PM, Jiaying Zhang wrote:
>>>> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>>>> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>>>> flag and then ftruncate the file to a size larger than the file's i_size,
>>>> any allocated but unwritten blocks will be freed but the file size is set
>>>> to the size that ftruncate specifies.
>>>> 
>>>> Here is a simple test to reproduce the problem:
>>>> 1. fallocate a 12k size file with KEEP_SIZE flag
>>>> 2. write the first 4k
>>>> 3. ftruncate the file to 8k
>>>> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>>>> shows the file has only the first written block left.
>>> 
>>> To be honest I'm not 100% certain what the fiesystem -should- do in this case.
>> 
>> I think it makes sense from a usage POV to discard the blocks after EOF when a truncate is being done.  For something like a PVR that is recording a show, but doesn't know the exact total size, it makes sense to fallocate() some estimated amount of space, and then when the show is finished recording it uses ftruncate() of the current size to drop the fallocated space.
>> 
> 
> Indeed we have applications that are doing exactly the same as you
> described. They always fallocate files to a pre-defined size with
> KEEP_SIZE flag and if they end up using less than the allocated size,
> they ftruncate files to their written size later.

If XFS is already handling truncate-up and truncate-down differently, I don't mind keeping consistent behaviour with XFS.  I had thought about this also, that truncate-up isn't intending to throw away space while truncate-down is.  However, I didn't mention it in my email because I thought the semantics of having different behaviour for truncate-up vs. truncate-down was confusing.

If XFS is already doing this, then it seems that this is at least somewhat expected by applications and/or is more efficient in the long run for the on-disk allocation.

>>> If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
>>> 
>>> # xfs_bmap -vp testfile
>>> testfile:
>>> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>>> 0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>>> 1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
>>> 
>>> size 8k:
>>> # ls -l testfile
>>> -rw-r--r-- 1 root root 8192 May 17 22:33 testfile
>>> 
>>> and diskspace used 12k:
>>> # du -hc testfile
>>> 12K   testfile
>>> 12K   total
>>> 
>>> I think this is a different result from ext4, either with or without your patch.
>>> 
>>> On ext4 I get size 8k, but only the first 4k mapped, as you say.
>>> 
>>> I don't recall when truncate is supposed to free fallocated blocks, and from what point?
>>> 
>>> -Eric
>>> 
>>>> Below is the proposed patch to fix the bug:
>>>> 
>>>> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
>>>> 
>>>> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
>>>> of ext4_truncate(inode) when it needs to truncate an inode so that
>>>> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
>>>> to a size larger than the inode's i_size, we will only truncate the blocks
>>>> beyond the specified truncate size instead of all of blocks beyond i_size.
>>>> 
>>>> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>>>> 
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 3424e82..3bfad57 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>>>>                      }
>>>>              }
>>>>              /* ext4_truncate will clear the flag */
>>>> -            if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
>>>> -                    ext4_truncate(inode);
>>>> +            if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
>>>> +                    rc = vmtruncate(inode, attr->ia_size);
>>>> +                    if (rc)
>>>> +                            goto err_out;
>>>> +            }
>>>>      }
>>>> 
>>>>      if ((attr->ia_valid & ATTR_SIZE) &&
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang May 18, 2011, 8:57 p.m. UTC | #14
On Wed, May 18, 2011 at 1:45 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On May 18, 2011, at 14:32, Jiaying Zhang wrote:
>> On Tue, May 17, 2011 at 10:35 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>> On May 17, 2011, at 21:19, Eric Sandeen wrote:
>>>> On 5/17/11 5:59 PM, Jiaying Zhang wrote:
>>>>> There is a bug in commit c8d46e41 "ext4: Add flag to files with blocks
>>>>> intentionally past EOF" that if we fallocate a file with FALLOC_FL_KEEP_SIZE
>>>>> flag and then ftruncate the file to a size larger than the file's i_size,
>>>>> any allocated but unwritten blocks will be freed but the file size is set
>>>>> to the size that ftruncate specifies.
>>>>>
>>>>> Here is a simple test to reproduce the problem:
>>>>> 1. fallocate a 12k size file with KEEP_SIZE flag
>>>>> 2. write the first 4k
>>>>> 3. ftruncate the file to 8k
>>>>> Then 'ls -l' shows that the i_size of the file becomes 8k but debugfs
>>>>> shows the file has only the first written block left.
>>>>
>>>> To be honest I'm not 100% certain what the fiesystem -should- do in this case.
>>>
>>> I think it makes sense from a usage POV to discard the blocks after EOF when a truncate is being done.  For something like a PVR that is recording a show, but doesn't know the exact total size, it makes sense to fallocate() some estimated amount of space, and then when the show is finished recording it uses ftruncate() of the current size to drop the fallocated space.
>>>
>>
>> Indeed we have applications that are doing exactly the same as you
>> described. They always fallocate files to a pre-defined size with
>> KEEP_SIZE flag and if they end up using less than the allocated size,
>> they ftruncate files to their written size later.
>
> If XFS is already handling truncate-up and truncate-down differently, I don't mind keeping consistent behaviour with XFS.  I had thought about this also, that truncate-up isn't intending to throw away space while truncate-down is.  However, I didn't mention it in my email because I thought the semantics of having different behaviour for truncate-up vs. truncate-down was confusing.
>
> If XFS is already doing this, then it seems that this is at least somewhat expected by applications and/or is more efficient in the long run for the on-disk allocation.
With my patch, it is still a little different from what xfs does in
truncating-up case. As Eric mentioned, when an application fallocates
12k, writes 4k, and then truncates to 8k, on xfs there will be 12k
allocated blocks left, but on ext4 with my patch there will be 8k
allocated blocks left. The reason I think we may want to free any
blocks beyond the truncate size is because there may be situations
that applications are running out of space and want to shrink
fallocated files to a smaller size that is still larger than the
current i_size.

Jiaying
>
>>>> If I go through that same sequence on xfs, I get 4k written / 8k unwritten:
>>>>
>>>> # xfs_bmap -vp testfile
>>>> testfile:
>>>> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET              TOTAL FLAGS
>>>> 0: [0..7]:          2648750760..2648750767  3 (356066400..356066407)     8 00000
>>>> 1: [8..23]:         2648750768..2648750783  3 (356066408..356066423)    16 10000
>>>>
>>>> size 8k:
>>>> # ls -l testfile
>>>> -rw-r--r-- 1 root root 8192 May 17 22:33 testfile
>>>>
>>>> and diskspace used 12k:
>>>> # du -hc testfile
>>>> 12K   testfile
>>>> 12K   total
>>>>
>>>> I think this is a different result from ext4, either with or without your patch.
>>>>
>>>> On ext4 I get size 8k, but only the first 4k mapped, as you say.
>>>>
>>>> I don't recall when truncate is supposed to free fallocated blocks, and from what point?
>>>>
>>>> -Eric
>>>>
>>>>> Below is the proposed patch to fix the bug:
>>>>>
>>>>> ext4: use vmtruncate() instead of ext4_truncate() in ext4_setattr().
>>>>>
>>>>> Change ext4_setattr() to use vmtruncate(inode, attr->ia_size) instead
>>>>> of ext4_truncate(inode) when it needs to truncate an inode so that
>>>>> if the inode has EXT4_EOFBLOCKS_FL flag set and we are trying to truncate
>>>>> to a size larger than the inode's i_size, we will only truncate the blocks
>>>>> beyond the specified truncate size instead of all of blocks beyond i_size.
>>>>>
>>>>> Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>>>>>
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index 3424e82..3bfad57 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -5347,8 +5347,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>>>>>                      }
>>>>>              }
>>>>>              /* ext4_truncate will clear the flag */
>>>>> -            if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
>>>>> -                    ext4_truncate(inode);
>>>>> +            if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
>>>>> +                    rc = vmtruncate(inode, attr->ia_size);
>>>>> +                    if (rc)
>>>>> +                            goto err_out;
>>>>> +            }
>>>>>      }
>>>>>
>>>>>      if ((attr->ia_valid & ATTR_SIZE) &&
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Cheers, Andreas
>
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3424e82..3bfad57 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5347,8 +5347,11 @@  int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 			}
 		}
 		/* ext4_truncate will clear the flag */
-		if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)))
-			ext4_truncate(inode);
+		if ((ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS))) {
+			rc = vmtruncate(inode, attr->ia_size);
+			if (rc)
+				goto err_out;
+		}
 	}
 
 	if ((attr->ia_valid & ATTR_SIZE) &&