diff mbox

ext4 mballoc: fix tail allocation

Message ID 1428348701-47574-1-git-send-email-jhe@cs.wisc.edu
State Accepted, archived
Headers show

Commit Message

Jun He April 6, 2015, 7:31 p.m. UTC
This patch addresses the tail allocation problem found at paper "Reducing File System Tail Latencies with Chopper". https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf . The paper refers the tail allocation problem as the Special End problem.

Here is a description of the problem:

A tail extent is the last extent of a file. The last block of the tail extent corresponds to the last logical block of the file. When a file is closed and the tail extent is being allocated, ext4 marks the allocation request with the hint EXT4_MB_HINT_NOPREALLOC. The intention is to avoid preallocation when we know the file is final (it won't change until the next open.). But the implementation leads to some problems. The following program attacks the problem.

/****************************************/
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    int fd;
    int size, off;
    char buf[4096];

    size = 4096;

    fd = open(argv[1], O_WRONLY|O_CREAT);
    if ( fd == -1 ) {
        perror("opening file");
        exit(1);
    }
    
    off = 0;
    pwrite(fd, buf, size, off);
    printf("wrote at %d, size %d bytes\n", off, size);
    fsync(fd);

    off = off + size;
    pwrite(fd, buf, size, off);
    printf("wrote at %d, size %d bytes\n", off, size);

    close(fd);
    sync();

}
/****************************************/

Mount an ext4 on /mnt/ext4onloop and run the program by 
$ ./a.out /mnt/ext4onloop/testfile

Check the extents by
$ filefrag -sv /mnt/ext4onloop/testfile
Filesystem type is: ef53
File size of /mnt/ext4onloop/testfile is 8192 (2 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    33280               1 
   1       1    33025    33281      1 eof
/mnt/ext4onloop/testfile: 2 extents found

The first 4KB of the file is forced to disk by fsync() and allocated from a locality group preallocation. The second 4KB of the file is forced to disk by sync() (If you don't call sync(), the effect would be the same as the write back thread will eventually kick it and force the data to disk.). Since the second 4KB is allocated when the file is closed and the extent is the tail extent, it is allocated with the hint EXT4_MB_HINT_NOPREALLOC. This hint prevents the second 4KB from being allocated from locality group preallocation. So the second 4KB is not allocated next to the first 4KB.  

The program above demonstrates the tail allocation problem. The Chopper paper demonstrates that the problem could drag extents of the same file very far (GBs) away from each other. 

The EXT4_MB_HINT_NOPREALLOC hint currently means: 
1. if there is an i-node preallocation for this file, use it; 
2. do not use LG (locality group) preallocations, even they are available (I think this is not intended); 
3. do not create any new preallocations.

EXT4_MB_HINT_NOPREALLOC is not the right hint for tail allocation. What we really want is:
1. if file is large and there is an i-node preallocation for this file, use it; 
2. if file is large and there is no i-node preallocation for this file, do not create new inode preallocation. Simply allocate as needed.
3. if file is small and there is LG preallocations, use it. 
4. if file is small and there is no LG preallocations, create a LG preallocation and allocate the tail extent from it. LG preallocations are always useful because small files almost always come. In the case that a write back thread writes a lot of new small files, this design will group the small files together and reduce seeks. Also, allocating many small files from LG preallocation space should be faster than allocating them from buddy cache separately. 

To get the correct behaviors:

EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we remove current check for tail in ext4_mb_group_or_file() , 1, 3, 4 will be satisfied. Now, we only need to handle the tail of large file (2) by checking tail when normalizing. If it is a tail of large file, we simply do not normalize it. 

The patch with Linux 4.0rc6 has been tested with kvm-xfstests. generic/256 failed. But I wouldn't worry about it because the vanilla kernel also failed...


Signed-off-by: Jun He <jhe@cs.wisc.edu>

---
 fs/ext4/mballoc.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

Comments

Theodore Ts'o April 8, 2015, 2:44 a.m. UTC | #1
On Mon, Apr 06, 2015 at 01:31:41PM -0600, Jun He wrote:
> This patch addresses the tail allocation problem found at paper "Reducing File System Tail Latencies with Chopper". https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf . The paper refers the tail allocation problem as the Special End problem.

Hi Jun He,

Kernel patches should have the commit body line-wrapped at 72 columns
(more or less).  The idea is that the commit should be easily readable
on an 80 column screen, with some extra space left for identation (git
log will ident the commit body by 4 characters).

> A tail extent is the last extent of a file. The last block of the
> tail extent corresponds to the last logical block of the file. When
> a file is closed and the tail extent is being allocated, ext4 marks
> the allocation request with the hint EXT4_MB_HINT_NOPREALLOC. The
> intention is to avoid preallocation when we know the file is final
> (it won't change until the next open.). But the implementation leads
> to some problems. The following program attacks the problem.

I think you mean "the following program _demonstrates_ the problem".


> EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we
> remove current check for tail in ext4_mb_group_or_file() , 1, 3, 4
> will be satisfied. Now, we only need to handle the tail of large
> file (2) by checking tail when normalizing. If it is a tail of large
> file, we simply do not normalize it.

The problem with skipping the normalization is this also by passes the
goal block optimizations based on pleft, pright, lleft, and lright.
In the case of a small file, it's not always true that you want to
allocate from a LG preallocation.  If there is free space immediately
adjacent to file, we want to use that in preference to the LG
preallocation --- since that would result in a contiguous file.

> +
> +	/* don't normalize tail of large files */
> +	if ((size == isize) &&
> +	    !ext4_fs_is_busy(sbi) &&
> +	    (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
> +		return;
> +	}

The comment is a bit misleading; it's not checking for large files at
all.  It also looks like that you copied the ext4_fs_is_busy(sbi) from
the original check.

Perhaps it would be helpful to explain the design goal of the original
code.  As originally conceived, the problem we were trying to fix was
that in the case of unpacking a tar file with a large number of small
files, we wanted to make sure all of the files would be packed
togehter, without extra space reserved (caused by the normalization).
This was achieved by disabing the use of the locality group.  The
problem is that if there are a large number of processes writing into
the same block group, this would cause contention, and so in the case
where we had contention, we would use the locality group (since the
whole point was to eliminate the need to constantly take the block
group lock while allocaitng out of the block group; that's why the
locality groups are per-cpu).  That's what the ext4_fs_is_busy() is
all about.  We try to measure contention and disable the optimization
if that was the case.

I suspect we may want to take a step back and what's the best way of
handling all of the high level goals, and refomulate a better strategy:

1) If the file doesn't have any blocks allocated yet, and it is closed
by the time we need to allocate all of its blocks (thanks to delayed
allocation), we should allocate exactly the number of blocks needed,
and try to avoid fragmenting free space.

2) In the case where we are doing any tails at all, it's a different
optimization altogether, and the main goal should be to make sure the
tail is connected the previous blocks if at all possible.  To the
extent that we have per-CPU LG groups, this can be... problematic.

3) The primary goal of the per-group locality group is to avoid lock
contention when multiple CPU's are trying to access the blocks at the
same time.  Depending on why we are doing block allocation, this may
not be an issue.  We should probably into account why it is we re
doing block allocation at particular moment:

   *) In the case of delayed allocation w/o fsync(2), the writeback
      will be taking place from the writeback daemon.  In that case,
      all of allocations will be issued from a writeback thread, so
      contention is much less of an issue.  In addition, in the case
      where the entire file has been written within 30 seconds, we
      will be allocating the entire file at once, as described in (1).
      This is a fairly common case, so we should optimize for it.

   *) In the case of block allocations resulting from an fallocate(2),
      we should assume userspace knew what it was doing, and we
      shouldn't try to do any kind of size normalization.  Also,
      fallocate() calls generally don't happen with high frequency, so
      worrying about CPU scalability is probalby not a major issue.

   *) In the case of nodelalloc (or ext3 compatibility mode).  In that
      case, we will be doing lots of singleton allocation, and it's in
      this case where normalization and preallocation is most important.

   *) The Lustre file calls ext4_map_blocks() directly.  We'll need to
      check with Andreas, but the user is writing a lot of small files
      over Lustre, it may be that lock contention will be an issue,
      and we might want to use LG preallocation groups.

   *) Writes caused by fsync(2).  (For example, from SQLite files.)
      In this case, lock contention might very well be an issue, and
      use of the LG preallocation group some kind of file
      normalization may very well be a good idea.  In fact, we might
      want to have a hueristic which notices files which are written
      using either a slow append or an append followed by an fsync
      workload, and make sure we use an inode perallocation group, and
      in the case of a slow append workload, we might want to hold the
      preallocation over a file close (with some way of releasing
      blocks in an ENOSPC situation).

Some of the mballoc failure modes which you found in the Chopper paper
was specifically we weren't taking these various different scenarios
into account, and so we used an optimization that worked well when we
were allocating the whole file at a time (thanks to delalloc and
writebacks), and tht hueristic failed miserably in the case of large
files where we happened to be writing a tail.

So before we start playing with changing hueristics, it would be a
good idea if we make sure the block allocator has enough context so it
understands how it is being called and by whom, and make sure we that
we don't use the wrong hueristic by accident.

Does this make sense?

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jun He April 8, 2015, 4:44 p.m. UTC | #2
Resending in plain text mode..

On Wed, Apr 8, 2015 at 11:24 AM, Jun He <jhe@cs.wisc.edu> wrote:
>
> Hi Ted,
>
> On Tue, Apr 7, 2015 at 9:44 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>>
>> On Mon, Apr 06, 2015 at 01:31:41PM -0600, Jun He wrote:
>> > This patch addresses the tail allocation problem found at paper "Reducing File System Tail Latencies with Chopper". https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf . The paper refers the tail allocation problem as the Special End problem.
>>
>> Hi Jun He,
>>
>> Kernel patches should have the commit body line-wrapped at 72 columns
>> (more or less).  The idea is that the commit should be easily readable
>> on an 80 column screen, with some extra space left for identation (git
>> log will ident the commit body by 4 characters).
>>
>> > A tail extent is the last extent of a file. The last block of the
>> > tail extent corresponds to the last logical block of the file. When
>> > a file is closed and the tail extent is being allocated, ext4 marks
>> > the allocation request with the hint EXT4_MB_HINT_NOPREALLOC. The
>> > intention is to avoid preallocation when we know the file is final
>> > (it won't change until the next open.). But the implementation leads
>> > to some problems. The following program attacks the problem.
>>
>> I think you mean "the following program _demonstrates_ the problem".
>>
>>
>> > EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we
>> > remove current check for tail in ext4_mb_group_or_file() , 1, 3, 4
>> > will be satisfied. Now, we only need to handle the tail of large
>> > file (2) by checking tail when normalizing. If it is a tail of large
>> > file, we simply do not normalize it.
>>
>> The problem with skipping the normalization is this also by passes the
>> goal block optimizations based on pleft, pright, lleft, and lright.
>
>
> Actually bypassing this optimization is OK because similar goal has been set by ext4_ext_find_goal() before. The goal optimization in ext4_mb_normalize_request() is a re-alignment for normalized requests. If ext4 doesn't normalize the request, it won't need to re-align.
>
>>
>> In the case of a small file, it's not always true that you want to
>> allocate from a LG preallocation.  If there is free space immediately
>> adjacent to file, we want to use that in preference to the LG
>> preallocation --- since that would result in a contiguous file.
>
>
> In the current code, the space that is immediately after a small file cannot be used for the same file because EXT4_MB_HINT_NOPREALLOC prevents the tail extent from using it (this is the Special End issues in the Chopper paper). The hint actually makes the file discontiguous. The attached program shows the problem. The patch fixes the problem.
>
>>
>> > +
>> > +     /* don't normalize tail of large files */
>> > +     if ((size == isize) &&
>> > +         !ext4_fs_is_busy(sbi) &&
>> > +         (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
>> > +             return;
>> > +     }
>>
>> The comment is a bit misleading; it's not checking for large files at
>> all.  It also looks like that you copied the ext4_fs_is_busy(sbi) from
>> the original check.
>
>
> If the execution reaches here, the file must be large -- if it's small, it would have gone out of the ext4_mb_normalize_request() function. It will be clear if the comment states it though.
>
> I did just copy ext4_fs_is_busy()..
>
>>
>>
>> Perhaps it would be helpful to explain the design goal of the original
>> code.  As originally conceived, the problem we were trying to fix was
>> that in the case of unpacking a tar file with a large number of small
>> files, we wanted to make sure all of the files would be packed
>> togehter, without extra space reserved (caused by the normalization).
>> This was achieved by disabing the use of the locality group.  The
>> problem is that if there are a large number of processes writing into
>> the same block group, this would cause contention, and so in the case
>> where we had contention, we would use the locality group (since the
>> whole point was to eliminate the need to constantly take the block
>> group lock while allocaitng out of the block group; that's why the
>> locality groups are per-cpu).  That's what the ext4_fs_is_busy() is
>> all about.  We try to measure contention and disable the optimization
>> if that was the case.
>
>
> Now I understand ext4_fs_is_busy() is used to detect contention. If there is contention, ext4 uses preallocation to avoid further contentions. If not, ext4 allocates without preallocation to pack files together. But why not just use LG preallocation? It can do both: packing small files together and avoiding contention. The space in LG preallocation is not wasted. It can always be used later. Besides, LG preallocations are created often anyway.
>
>>
>> I suspect we may want to take a step back and what's the best way of
>> handling all of the high level goals, and refomulate a better strategy:
>>
>> 1) If the file doesn't have any blocks allocated yet, and it is closed
>> by the time we need to allocate all of its blocks (thanks to delayed
>> allocation), we should allocate exactly the number of blocks needed,
>> and try to avoid fragmenting free space.
>>
>> 2) In the case where we are doing any tails at all, it's a different
>> optimization altogether, and the main goal should be to make sure the
>> tail is connected the previous blocks if at all possible.  To the
>> extent that we have per-CPU LG groups, this can be... problematic.
>>
> That's why I suggest always put small files to LG preallocations (and remove scheduler dependency), which allows small files to be contiguous. For large files, tail's goal should be set to the location after the existing extents.
>
>> 3) The primary goal of the per-group locality group is to avoid lock
>> contention when multiple CPU's are trying to access the blocks at the
>> same time.  Depending on why we are doing block allocation, this may
>> not be an issue.  We should probably into account why it is we re
>> doing block allocation at particular moment:
>>
>>    *) In the case of delayed allocation w/o fsync(2), the writeback
>>       will be taking place from the writeback daemon.  In that case,
>>       all of allocations will be issued from a writeback thread, so
>>       contention is much less of an issue.  In addition, in the case
>>       where the entire file has been written within 30 seconds, we
>>       will be allocating the entire file at once, as described in (1).
>>       This is a fairly common case, so we should optimize for it.
>
> I think currently ext4 handles this fine. One thing that I am worried about is that the write back thread may sit on different CPU every 30 seconds, and use different LG preallocations that may be far away from each other.
>>
>>
>>    *) In the case of block allocations resulting from an fallocate(2),
>>       we should assume userspace knew what it was doing, and we
>>       shouldn't try to do any kind of size normalization.  Also,
>>       fallocate() calls generally don't happen with high frequency, so
>>       worrying about CPU scalability is probalby not a major issue.
>
> I agree.
>>
>>
>>    *) In the case of nodelalloc (or ext3 compatibility mode).  In that
>>       case, we will be doing lots of singleton allocation, and it's in
>>       this case where normalization and preallocation is most important.
>
> And maybe O_DIRECT mode?
>>
>>
>>    *) The Lustre file calls ext4_map_blocks() directly.  We'll need to
>>       check with Andreas, but the user is writing a lot of small files
>>       over Lustre, it may be that lock contention will be an issue,
>>       and we might want to use LG preallocation groups.
>
> Again, I suggest small files always use LG preallocations. Files are packed together and there's little contention.
>>
>>
>>    *) Writes caused by fsync(2).  (For example, from SQLite files.)
>>       In this case, lock contention might very well be an issue, and
>>       use of the LG preallocation group some kind of file
>>       normalization may very well be a good idea.  In fact, we might
>>       want to have a hueristic which notices files which are written
>>       using either a slow append or an append followed by an fsync
>>       workload, and make sure we use an inode perallocation group, and
>>       in the case of a slow append workload, we might want to hold the
>>       preallocation over a file close (with some way of releasing
>>       blocks in an ENOSPC situation).
>
> Maybe SQLite can pass a hint to block allocator. Maybe ext4 should predict by previous pattern.
>>
>>
>> Some of the mballoc failure modes which you found in the Chopper paper
>> was specifically we weren't taking these various different scenarios
>> into account, and so we used an optimization that worked well when we
>> were allocating the whole file at a time (thanks to delalloc and
>> writebacks), and tht hueristic failed miserably in the case of large
>> files where we happened to be writing a tail.
>>
>> So before we start playing with changing hueristics, it would be a
>> good idea if we make sure the block allocator has enough context so it
>> understands how it is being called and by whom, and make sure we that
>> we don't use the wrong hueristic by accident.
>>
>> Does this make sense?
>
> Yes. I'll keep it in mind while doing my research.
>
> Thanks,
> Jun
>
>>
>>
>>                                         - Ted
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8d1e602..49c8547 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3004,7 +3004,7 @@  ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	int bsbits, max;
 	ext4_lblk_t end;
-	loff_t size, start_off;
+	loff_t size, isize, start_off;
 	loff_t orig_size __maybe_unused;
 	ext4_lblk_t start;
 	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
@@ -3019,8 +3019,7 @@  ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		return;
 
-	/* caller may indicate that preallocation isn't
-	 * required (it's a tail, for example) */
+	/* caller may indicate that preallocation isn't required */
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
 		return;
 
@@ -3029,11 +3028,21 @@  ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 		return ;
 	}
 
-	bsbits = ac->ac_sb->s_blocksize_bits;
-
 	/* first, let's learn actual file size
 	 * given current request is allocated */
 	size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
+
+	bsbits = ac->ac_sb->s_blocksize_bits;
+	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
+		>> bsbits;
+
+	/* don't normalize tail of large files */
+	if ((size == isize) &&
+	    !ext4_fs_is_busy(sbi) &&
+	    (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
+		return;
+	}
+
 	size = size << bsbits;
 	if (size < i_size_read(ac->ac_inode))
 		size = i_size_read(ac->ac_inode);
@@ -4107,13 +4116,6 @@  static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
 		>> bsbits;
 
-	if ((size == isize) &&
-	    !ext4_fs_is_busy(sbi) &&
-	    (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
-		ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
-		return;
-	}
-
 	if (sbi->s_mb_group_prealloc <= 0) {
 		ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
 		return;