ext4 mballoc: fix tail allocation

This patch addresses the tail allocation problem found at paper "Reducing File System Tail Latencies with Chopper". https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf . The paper refers the tail allocation problem as the Special End problem.

Here is a description of the problem:

A tail extent is the last extent of a file. The last block of the tail extent corresponds to the last logical block of the file. When a file is closed and the tail extent is being allocated, ext4 marks the allocation request with the hint EXT4_MB_HINT_NOPREALLOC. The intention is to avoid preallocation when we know the file is final (it won't change until the next open.). But the implementation leads to some problems. The following program attacks the problem.

/****************************************/
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    int fd;
    int size, off;
    char buf[4096];

    size = 4096;

    fd = open(argv[1], O_WRONLY|O_CREAT);
    if ( fd == -1 ) {
        perror("opening file");
        exit(1);
    }

    off = 0;
    pwrite(fd, buf, size, off);
    printf("wrote at %d, size %d bytes\n", off, size);
    fsync(fd);

    off = off + size;
    pwrite(fd, buf, size, off);
    printf("wrote at %d, size %d bytes\n", off, size);

    close(fd);
    sync();

}
/****************************************/

Mount an ext4 on /mnt/ext4onloop and run the program by 
$ ./a.out /mnt/ext4onloop/testfile

Check the extents by
$ filefrag -sv /mnt/ext4onloop/testfile
Filesystem type is: ef53
File size of /mnt/ext4onloop/testfile is 8192 (2 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    33280               1 
   1       1    33025    33281      1 eof
/mnt/ext4onloop/testfile: 2 extents found

The first 4KB of the file is forced to disk by fsync() and allocated from a locality group preallocation. The second 4KB of the file is forced to disk by sync() (If you don't call sync(), the effect would be the same as the write back thread will eventually kick it and force the data to disk.). Since the second 4KB is allocated when the file is closed and the extent is the tail extent, it is allocated with the hint EXT4_MB_HINT_NOPREALLOC. This hint prevents the second 4KB from being allocated from locality group preallocation. So the second 4KB is not allocated next to the first 4KB.  

The program above demonstrates the tail allocation problem. The Chopper paper demonstrates that the problem could drag extents of the same file very far (GBs) away from each other. 

The EXT4_MB_HINT_NOPREALLOC hint currently means: 
1. if there is an i-node preallocation for this file, use it; 
2. do not use LG (locality group) preallocations, even they are available (I think this is not intended); 
3. do not create any new preallocations.

EXT4_MB_HINT_NOPREALLOC is not the right hint for tail allocation. What we really want is:
1. if file is large and there is an i-node preallocation for this file, use it; 
2. if file is large and there is no i-node preallocation for this file, do not create new inode preallocation. Simply allocate as needed.
3. if file is small and there is LG preallocations, use it. 
4. if file is small and there is no LG preallocations, create a LG preallocation and allocate the tail extent from it. LG preallocations are always useful because small files almost always come. In the case that a write back thread writes a lot of new small files, this design will group the small files together and reduce seeks. Also, allocating many small files from LG preallocation space should be faster than allocating them from buddy cache separately. 

To get the correct behaviors:

EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we remove current check for tail in ext4_mb_group_or_file() , 1, 3, 4 will be satisfied. Now, we only need to handle the tail of large file (2) by checking tail when normalizing. If it is a tail of large file, we simply do not normalize it. 

The patch with Linux 4.0rc6 has been tested with kvm-xfstests. generic/256 failed. But I wouldn't worry about it because the vanilla kernel also failed...

Signed-off-by: Jun He <jhe@cs.wisc.edu>

---
 fs/ext4/mballoc.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

ext4 mballoc: fix tail allocation

Commit Message

Comments

Patch