Message ID | 20090721001750.GD4231@webber.adilger.int |
---|---|
State | Accepted, archived |
Headers | show |
On Mon, Jul 20, 2009 at 06:17:50PM -0600, Andreas Dilger wrote: > Attached is the e2freefrag tool. It grabs the block bitmaps, creates > buddy bitmaps from them and displays the total/free chunks (default > 1MB chunk size), and a histogram of free space. > > It could probably be enhanced to print the chunk sizes based on the > RAID chunk size stored in the superblock, but I just thought of that > this minute... Thanks, checked in with some minor changes to fix some printf warnings. Here's the output on my root filesystem (which has been in use since February): Device: /dev/ssd/root Blocksize: 4096 bytes Total blocks: 18350080 Free blocks: 10774142 (58.7%) Chunksize: 1048576 bytes (256 blocks) Total chunks: 71681 Free chunks: 21792 (30.4%) Min free chunk: 4 KB Max free chunk: 568232 KB Avg free chunk: 184 KB HISTOGRAM OF FREE CHUNK SIZES: Chunk Size Range : Free chunks 4K... 8K- : 35005 8K... 16K- : 33639 16K... 32K- : 31419 32K... 64K- : 33953 64K... 128K- : 26397 128K... 256K- : 7314 256K... 512K- : 1855 512K... 1024K- : 1612 1M... 2M- : 1160 2M... 4M- : 567 4M... 8M- : 303 8M... 16M- : 106 16M... 32M- : 40 32M... 64M- : 51 64M... 128M- : 123 128M... 256M- : 8 512M... 1024M- : 1 Yeah.... pretty fragmented. :-( - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: ... > Here's the output on my root filesystem (which has been in use since > February): > > Device: /dev/ssd/root > Blocksize: 4096 bytes > Total blocks: 18350080 > Free blocks: 10774142 (58.7%) > > Chunksize: 1048576 bytes (256 blocks) > Total chunks: 71681 > Free chunks: 21792 (30.4%) > > Min free chunk: 4 KB > Max free chunk: 568232 KB > Avg free chunk: 184 KB > > HISTOGRAM OF FREE CHUNK SIZES: > Chunk Size Range : Free chunks > 4K... 8K- : 35005 > 8K... 16K- : 33639 > 16K... 32K- : 31419 > 32K... 64K- : 33953 > 64K... 128K- : 26397 > 128K... 256K- : 7314 > 256K... 512K- : 1855 > 512K... 1024K- : 1612 > 1M... 2M- : 1160 > 2M... 4M- : 567 > 4M... 8M- : 303 > 8M... 16M- : 106 > 16M... 32M- : 40 > 32M... 64M- : 51 > 64M... 128M- : 123 > 128M... 256M- : 8 > 512M... 1024M- : 1 > > Yeah.... pretty fragmented. :-( > Just for comparison, here's a 30G xfs root that has run for a year or two, currently about 70% full: xfs_db> freesp -s from to extents blocks pct 1 1 1849 1849 0.08 2 3 1383 3293 0.14 4 7 1034 5429 0.23 8 15 1061 12260 0.53 16 31 641 13261 0.57 32 63 355 15601 0.67 64 127 221 19940 0.86 128 255 195 35841 1.54 256 511 173 63066 2.71 512 1023 122 89824 3.86 1024 2047 51 70032 3.01 2048 4095 22 60982 2.62 4096 8191 20 116580 5.01 8192 16383 10 109896 4.72 16384 32767 7 152026 6.53 32768 65535 4 206283 8.87 65536 131071 3 285744 12.28 262144 524287 1 509811 21.91 524288 1048575 1 554838 23.85 total free extents 7153 total free blocks 2326556 average free extent size 325.256 from...to units are in 4k blocks. Maybe the fancy ext4 defragger will have a good second use case in cleaning up some of that freespace fragmentation. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
So I started looking to see how we might be able to improve mballoc to avoid freespace fragmentation, and I came up with the following high level design. Does this look sane? Have I overlooked anything? 1) In ext4_mb_normalize_request(), if the inode that we are allocating does not have any open file descriptors for write (i.e., it's already closed and we're allocating via delalloc) _and_ the inode was previously opened with O_CREAT and without O_APPEND (checked via a flag in EXT4_I(inode)), then do not normalize the size to a power of two, but rather to the filesystem blocksize. The idea here is that we should be trying to find an exact fit, since most of the time (except for log files, which get appended; hence the O_CREAT && !O_APPEND test) once a file is written, that is probably the final size for the file. So normalizing the size for the preallocation area to a power of two will be counterproductive for most files. 2) If the there has been less than X files opened in Y jiffies the parent directory (using the dentry path used to open the file), then do not set EXT4_MB_HINT_GROUP_ALLOC in ext4_mb_group_or_file(). We can simulate this for without creating this patch to test #1 by setting mb_stream_request to 0 (which should completely disable group preallocation). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > So I started looking to see how we might be able to improve mballoc to > avoid freespace fragmentation, and I came up with the following high > level design. Does this look sane? Have I overlooked anything? > > 1) In ext4_mb_normalize_request(), if the inode that we are allocating > does not have any open file descriptors for write (i.e., it's already > closed and we're allocating via delalloc) _and_ the inode was > previously opened with O_CREAT and without O_APPEND (checked via a > flag in EXT4_I(inode)), then do not normalize the size to a power of > two, but rather to the filesystem blocksize. > > The idea here is that we should be trying to find an exact fit, since > most of the time (except for log files, which get appended; hence the > O_CREAT && !O_APPEND test) once a file is written, that is probably > the final size for the file. So normalizing the size for the > preallocation area to a power of two will be counterproductive for > most files. I'm sort of woefully ignorant of a lot of the mballoc stuff. When you say once a file is written that's probably the final size... do you mean when writes are done and it's closed, or when the first write to the file is complete? I think an awful lot of normal cases write to a file in sub-file-sized chunks (think mp3 or flac encoding, file downloading, etc). Also, I get the !O_APPEND test, but is O_CREAT necessary? I wonder how much of a hint that really gives us. > 2) If the there has been less than X files opened in Y jiffies the > parent directory (using the dentry path used to open the file), then > do not set EXT4_MB_HINT_GROUP_ALLOC in ext4_mb_group_or_file(). We > can simulate this for without creating this patch to test #1 by > setting mb_stream_request to 0 (which should completely disable group > preallocation). Hm have to try hard to parse that ;) But that sounds reasonable I think. I'm talking to the Fedora infrastructure folks to see if there's a way to recreate snapshots of, say, the F10 repos from initial release to today, to be able to sort of fast-forward root filesystem updates. It'd be a nice way to do accelerated aging tests for any changes we make, at least for one usecase ... -Eric > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > So I started looking to see how we might be able to improve mballoc to > avoid freespace fragmentation, and I came up with the following high > level design. Does this look sane? Have I overlooked anything? > > 1) In ext4_mb_normalize_request(), if the inode that we are allocating > does not have any open file descriptors for write (i.e., it's already > closed and we're allocating via delalloc) _and_ the inode was > previously opened with O_CREAT and without O_APPEND (checked via a > flag in EXT4_I(inode)), then do not normalize the size to a power of > two, but rather to the filesystem blocksize. > > The idea here is that we should be trying to find an exact fit, since > most of the time (except for log files, which get appended; hence the > O_CREAT && !O_APPEND test) once a file is written, that is probably > the final size for the file. So normalizing the size for the > preallocation area to a power of two will be counterproductive for > most files. > > I am trying to understand what user cases prefer normalize allocation request size? If they are uncommon cases, perhaps we should disable the normalize the allocation size disabled by default, unless the apps opens the files with O_APPEND? > 2) If the there has been less than X files opened in Y jiffies the > parent directory (using the dentry path used to open the file), then > do not set EXT4_MB_HINT_GROUP_ALLOC in ext4_mb_group_or_file(). We > can simulate this for without creating this patch to test #1 by > setting mb_stream_request to 0 (which should completely disable group > preallocation). > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote: > > 1) In ext4_mb_normalize_request(), if the inode that we are allocating > > does not have any open file descriptors for write (i.e., it's already > > closed and we're allocating via delalloc) _and_ the inode was > > previously opened with O_CREAT and without O_APPEND (checked via a > > flag in EXT4_I(inode)), then do not normalize the size to a power of > > two, but rather to the filesystem blocksize. > > I'm sort of woefully ignorant of a lot of the mballoc stuff. > > When you say once a file is written that's probably the final size... do > you mean when writes are done and it's closed, or when the first write > to the file is complete? > > I think an awful lot of normal cases write to a file in sub-file-sized > chunks (think mp3 or flac encoding, file downloading, etc). I meant when the writes are done and the files are closed; hence my proposal that we do this do #1 above only if there are no open file descriptors for write. That is, if the file can be written and closed by the userspace process before any delayed allocation blocks are attempted to be written by the filesystem, we can probably safely assume that the file won't grown in size later on. > Also, I get the !O_APPEND test, but is O_CREAT necessary? I wonder how > much of a hint that really gives us. Well, it probably should be O_CREAT || O_TRUNC. The basic idea here is to distinguish between a file which gets appended to via syslog, or via a mail delivery program that writes 4k of data to the end of a mail spool file. In some cases, such as the mail delivery program, it might not use O_APPEND, but instead it might lock the file, seek to end of the file, and then right the 4k worth of e-mail. So if the file wasn't freshly created (or truncated) at the last open, maybe we should use a more aggressive preallocation --- and in the case of /var/mail spool delivery, perhaps the preallocation should persist beyond the file getting closed. (In the future we might want to have some hueristics where if we notice that the pattern of file writes is a repeated open, write-causing-block-allocation, close, maybe we should do some kind of block reservation style scheme while the filesystem is mounted and the inode stays in the inode cache.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 23, 2009 at 10:51:58AM -0700, Mingming Cao wrote: > I am trying to understand what user cases prefer normalize allocation > request size? If they are uncommon cases, perhaps > we should disable the normalize the allocation size disabled by default, > unless the apps opens the files with O_APPEND? The case where we would want to round the allocation size up would be if we are writing a large file (say, like a large mp3 or mpeg4 file), which takes a while for the audio/video encoder to write out the blocks. In that case, doing file-based preallocation is a good thing. Normally, if we are doing block allocations for files greater than 16 blocks (i.e, 64k), we use file-based preallocation. Otherwise we use block group allocations. The problem with using block group allocations is that way it works is that first time we try to allocate out of a block group, we try to find a free extent which is 512 blocks long. If we can't find a free extent which is 512 blocks long, we'll try another block group. Hence, for small files, once a block group gets fragmented to the point where there isn't a free chunk which is 512 blocks long, we'll try to find another block group --- even if that means finding another block group far, FAR away from the block group where the directory is contained. Worse yet, if we unmount and remount the filesystem, we forget the fact that we were using a particular (now-partially filled) preallocation group, so the next time we try to allocate space for a small file, we will find *another* free 512 block chunk to allocate small files. Given that there is 32,768 blocks in block group, after 64 interations of "mount, write one 4k file in a directory, unmount", that block group will have 64 files, each separated by 511 blocks, and that block group will no longer have any free 512 chunks for block allocations. (And given that the block preallocation is per-CPU, it becomes even worse on an SMP system.) Put this baldly, it may be that we need to do a fundamental rethink on how we do per-cpu, per-blockgroup preallocations for small files. Maybe instead of trying to find a 512 extent which is completely full, we should instead be looking for a 512 extent which has at least mb_stream_req free blocks (i.e. by default 16 free blocks). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote: >>> 1) In ext4_mb_normalize_request(), if the inode that we are allocating >>> does not have any open file descriptors for write (i.e., it's already >>> closed and we're allocating via delalloc) _and_ the inode was >>> previously opened with O_CREAT and without O_APPEND (checked via a >>> flag in EXT4_I(inode)), then do not normalize the size to a power of >>> two, but rather to the filesystem blocksize. >> I'm sort of woefully ignorant of a lot of the mballoc stuff. >> >> When you say once a file is written that's probably the final size... do >> you mean when writes are done and it's closed, or when the first write >> to the file is complete? >> >> I think an awful lot of normal cases write to a file in sub-file-sized >> chunks (think mp3 or flac encoding, file downloading, etc). > > I meant when the writes are done and the files are closed; hence my > proposal that we do this do #1 above only if there are no open file > descriptors for write. That is, if the file can be written and closed > by the userspace process before any delayed allocation blocks are > attempted to be written by the filesystem, we can probably safely > assume that the file won't grown in size later on. Ah, ok. Sorry, I misunderstood. Yep, that seems reasonable. It should probably get tested with workloads like video transcoding, where there will be incremental writes that span many minutes or hours. >> Also, I get the !O_APPEND test, but is O_CREAT necessary? I wonder how >> much of a hint that really gives us. > > Well, it probably should be O_CREAT || O_TRUNC. The basic idea here is > to distinguish between a file which gets appended to via syslog, or > via a mail delivery program that writes 4k of data to the end of a > mail spool file. In some cases, such as the mail delivery program, it > might not use O_APPEND, but instead it might lock the file, seek to > end of the file, and then right the 4k worth of e-mail. So if the > file wasn't freshly created (or truncated) at the last open, maybe we > should use a more aggressive preallocation --- and in the case of > /var/mail spool delivery, perhaps the preallocation should persist > beyond the file getting closed. (In the future we might want to have > some hueristics where if we notice that the pattern of file writes is > a repeated open, write-causing-block-allocation, close, maybe we > should do some kind of block reservation style scheme while the > filesystem is mounted and the inode stays in the inode cache.) sounds fancy ;) -Eric > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Sandeen wrote: > Theodore Tso wrote: >> On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote: >>>> 1) In ext4_mb_normalize_request(), if the inode that we are allocating >>>> does not have any open file descriptors for write (i.e., it's already >>>> closed and we're allocating via delalloc) _and_ the inode was >>>> previously opened with O_CREAT and without O_APPEND (checked via a >>>> flag in EXT4_I(inode)), then do not normalize the size to a power of >>>> two, but rather to the filesystem blocksize. >>> I'm sort of woefully ignorant of a lot of the mballoc stuff. >>> >>> When you say once a file is written that's probably the final size... do >>> you mean when writes are done and it's closed, or when the first write >>> to the file is complete? >>> >>> I think an awful lot of normal cases write to a file in sub-file-sized >>> chunks (think mp3 or flac encoding, file downloading, etc). >> I meant when the writes are done and the files are closed; hence my >> proposal that we do this do #1 above only if there are no open file >> descriptors for write. That is, if the file can be written and closed >> by the userspace process before any delayed allocation blocks are >> attempted to be written by the filesystem, we can probably safely >> assume that the file won't grown in size later on. > > Ah, ok. Sorry, I misunderstood. Yep, that seems reasonable. > > It should probably get tested with workloads like video transcoding, > where there will be incremental writes that span many minutes or hours. Ugh right after I sent this I think I'm finally making sense of it. :) In that case, come allocation time there =would= be file descriptors open, and we'd go back to the old method of normalizing the allocation. You're just talking about changing things where an entire series of file writes have come & gone, everything is closed & done, and -now- we're allocating. Sorry for being slow. :) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 23, 2009 20:23 -0400, Theodore Ts'o wrote: > On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote: > > > 1) In ext4_mb_normalize_request(), if the inode that we are allocating > > > does not have any open file descriptors for write (i.e., it's already > > > closed and we're allocating via delalloc) _and_ the inode was > > > previously opened with O_CREAT and without O_APPEND (checked via a > > > flag in EXT4_I(inode)), then do not normalize the size to a power of > > > two, but rather to the filesystem blocksize. > > > > I'm sort of woefully ignorant of a lot of the mballoc stuff. > > > > When you say once a file is written that's probably the final size... do > > you mean when writes are done and it's closed, or when the first write > > to the file is complete? > > > > I think an awful lot of normal cases write to a file in sub-file-sized > > chunks (think mp3 or flac encoding, file downloading, etc). > > I meant when the writes are done and the files are closed; hence my > proposal that we do this do #1 above only if there are no open file > descriptors for write. That is, if the file can be written and closed > by the userspace process before any delayed allocation blocks are > attempted to be written by the filesystem, we can probably safely > assume that the file won't grown in size later on. Right, this is a reasonable default I think. > > Also, I get the !O_APPEND test, but is O_CREAT necessary? I wonder how > > much of a hint that really gives us. > > Well, it probably should be O_CREAT || O_TRUNC. The basic idea here is > to distinguish between a file which gets appended to via syslog, or > via a mail delivery program that writes 4k of data to the end of a > mail spool file. In some cases, such as the mail delivery program, it > might not use O_APPEND, but instead it might lock the file, seek to > end of the file, and then right the 4k worth of e-mail. So if the > file wasn't freshly created (or truncated) at the last open, maybe we > should use a more aggressive preallocation --- and in the case of > /var/mail spool delivery, perhaps the preallocation should persist > beyond the file getting closed. (In the future we might want to have > some hueristics where if we notice that the pattern of file writes is > a repeated open, write-causing-block-allocation, close, maybe we > should do some kind of block reservation style scheme while the > filesystem is mounted and the inode stays in the inode cache.) I think you are on the right track with the !O_TRUNC check. Namely, any file which is a non-zero size and gets an extending write at a non-zero offset should probably get some persistent preallocation (fallocate). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: e2fsprogs-1.41.4/misc/e2freefrag.8.in =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ e2fsprogs-1.41.4/misc/e2freefrag.8.in 2009-04-21 13:24:09.000000000 -0600 @@ -0,0 +1,99 @@ +.\" -*- nroff -*- +.TH E2FREEFRAG 8 +.SH NAME +e2freefrag \- report free space fragmentation information +.SH SYNOPSIS +.B e2freefrag +[ +.B \-c chunk_kb +] +[ +.B \-h +] +.B filesys + +.SH DESCRIPTION +.B e2freefrag +is used to report free space fragmentation on ext2/3/4 file systems. +.I filesys +is the filesystem device name (e.g. +.IR /dev/hdc1 ", " /dev/md0 ). +The +.B e2freefrag +program will scan the block bitmap information to check how many free blocks +are present as contiguous and aligned free space. The percentage of contiguous +free blocks of size and of alignment +.IR chunk_kb +is reported. It also displays the minimum/maximum/average free chunk size in +the filesystem, along with a histogram of all free chunks. This information +can be used to gauge the level of free space fragmentation in the filesystem. +.SH OPTIONS +.TP +.BI \-c " chunk_kb" +Desired size of chunk. It is specified in units of kilobytes (KB). If no +.I chunk_kb +is specified on the command line, then the default value is 1024KB. +.TP +.BI \-h +Print the usage of the program. +.SH EXAMPLE +# e2freefrag /dev/vgroot/lvhome +.br +Device: /dev/vgroot/lvhome +.br +Blocksize: 4096 bytes +.br +Total blocks: 5120710 +.br +Free blocks: 831744 (16.2%) +.br +Chunk size: 1048576 bytes (256 blocks) +.br +Total chunks: 20003 +.br +Free chunks: 2174 (10.9%) +.br + +Min free chunk: 4 KB +.br +Max free chunk: 24576 KB +.br +Avg. free chunk: 340 KB +.br + +HISTOGRAM OF FREE CHUNK SIZES: +.br + Range Free chunks +.br + 4K... 8K- : 2824 +.br + 8K... 16K- : 1760 +.br + 16K... 32K- : 1857 +.br + 32K... 64K- : 1003 +.br + 64K... 128K- : 616 +.br + 128K... 256K- : 479 +.br + 256K... 512K- : 302 +.br + 512K... 1024K- : 238 +.br + 1M... 2M- : 213 +.br + 2M... 4M- : 173 +.br + 4M... 8M- : 287 +.br + 8M... 16M- : 4 +.br + 16M... 32M- : 1 +.SH AUTHOR +This version of e2freefrag was written by Rupesh Thakare, and modified by +Andreas Dilger <adilger@sun.com>, and Kalpak Shah. +.SH SEE ALSO +.IR debugfs (8), +.IR dumpe2fs (8), +.IR e2fsck (8) Index: e2fsprogs-1.41.4/misc/e2freefrag.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ e2fsprogs-1.41.4/misc/e2freefrag.c 2009-04-21 13:18:30.000000000 -0600 @@ -0,0 +1,275 @@ +/* + * e2freefrag - report filesystem free-space fragmentation + * + * Copyright (C) 2009 Sun Microsystems, Inc. + * + * Author: Rupesh Thakare <rupesh@sun.com> + * Andreas Dilger <adilger@sun.com> + * + * %Begin-Header% + * This file may be redistributed under the terms of the GNU Public + * License version 2. + * %End-Header% + */ +#include <stdio.h> +#ifdef HAVE_UNISTD_H +#include <unistd.h> +#endif +#ifdef HAVE_STDLIB_H +#include <stdlib.h> +#endif +#ifdef HAVE_GETOPT_H +#include <getopt.h> +#else +extern char *optarg; +extern int optind; +#endif + +#include "ext2fs/ext2_fs.h" +#include "ext2fs/ext2fs.h" +#include "e2freefrag.h" + +void usage(const char *prog) +{ + fprintf(stderr, "usage: %s [-c chunksize in kb] [-h] " + "device_name\n", prog); + exit(1); +} + +static int ul_log2(unsigned long arg) +{ + int l = 0; + + arg >>= 1; + while (arg) { + l++; + arg >>= 1; + } + return l; +} + +void init_chunk_info(ext2_filsys fs, struct chunk_info *info) +{ + int i; + + info->chunkbits = ul_log2(info->chunkbytes); + info->blocksize_bits = ul_log2((unsigned long)fs->blocksize); + info->blks_in_chunk = info->chunkbytes >> info->blocksize_bits; + + info->min = ~0UL; + info->max = info->avg = 0; + info->real_free_chunks = 0; + + for (i = 0; i < MAX_HIST; i++) + info->histogram.fc_buckets[i] = 0; +} + +void scan_block_bitmap(ext2_filsys fs, struct chunk_info *info) +{ + unsigned long long blocks_count = fs->super->s_blocks_count; + unsigned long long chunks = (blocks_count + info->blks_in_chunk) >> + (info->chunkbits - info->blocksize_bits); + unsigned long long chunk_num; + unsigned long last_chunk_size = 0; + unsigned long long chunk_start_blk = 0; + + for (chunk_num = 0; chunk_num < chunks; chunk_num++) { + unsigned long long blk, num_blks; + int chunk_free; + + /* Last chunk may be smaller */ + if (chunk_start_blk + info->blks_in_chunk > blocks_count) + num_blks = blocks_count - chunk_start_blk; + else + num_blks = info->blks_in_chunk; + + chunk_free = 0; + + /* Initialize starting block for first chunk correctly else + * there is a segfault when blocksize = 1024 in which case + * block_map->start = 1 */ + for (blk = (chunk_num == 0 ? fs->super->s_first_data_block : 0); + blk < num_blks; blk++, chunk_start_blk++) { + int used = ext2fs_fast_test_block_bitmap(fs->block_map, + chunk_start_blk); + if (!used) { + last_chunk_size++; + chunk_free++; + } + + if (used && last_chunk_size != 0) { + unsigned long index; + + index = ul_log2(last_chunk_size) + 1; + info->histogram.fc_buckets[index]++; + + if (last_chunk_size > info->max) + info->max = last_chunk_size; + if (last_chunk_size < info->min) + info->min = last_chunk_size; + info->avg += last_chunk_size; + + info->real_free_chunks++; + last_chunk_size = 0; + } + } + + if (chunk_free == info->blks_in_chunk) + info->free_chunks++; + } +} + +errcode_t get_chunk_info(ext2_filsys fs, struct chunk_info *info) +{ + unsigned long total_chunks; + char *unitp = "KMGTPEZY"; + int units = 10; + unsigned long start = 0, end, cum; + int i, retval = 0; + + scan_block_bitmap(fs, info); + + printf("Total blocks: %lu\nFree blocks: %lu (%0.1f%%)\n", + fs->super->s_blocks_count, fs->super->s_free_blocks_count, + (double)fs->super->s_free_blocks_count * 100 / + fs->super->s_blocks_count); + + printf("\nChunksize: %u bytes (%u blocks)\n", + info->chunkbytes, info->blks_in_chunk); + total_chunks = (fs->super->s_blocks_count + info->blks_in_chunk) >> + (info->chunkbits - info->blocksize_bits); + printf("Total chunks: %lu\nFree chunks: %lu (%0.1f%%)\n", + total_chunks, info->free_chunks, + (double)info->free_chunks * 100 / total_chunks); + + /* Display chunk information in KB */ + if (info->real_free_chunks) { + info->min = (info->min * fs->blocksize) >> 10; + info->max = (info->max * fs->blocksize) >> 10; + info->avg = (info->avg / info->real_free_chunks * + fs->blocksize) >> 10; + } else { + info->min = 0; + } + + printf("\nMin free chunk: %lu KB \nMax free chunk: %lu KB\n" + "Avg free chunk: %lu KB\n", info->min, info->max, info->avg); + + printf("\nHISTOGRAM OF FREE CHUNK SIZES:\n"); + printf("%s\t%10s\n", "Chunk Size Range :", "Free chunks"); + for (i = 0; i < MAX_HIST; i++) { + end = 1 << (i + info->blocksize_bits - units); + if (info->histogram.fc_buckets[i] != 0) + printf("%5lu%c...%5lu%c- : %10lu\n", start, *unitp, + end, *unitp, info->histogram.fc_buckets[i]); + start = end; + if (start == 1<<10) { + start = 1; + units += 10; + unitp++; + } + } + + return retval; +} + +void close_device(char *device_name, ext2_filsys fs) +{ + int retval = ext2fs_close(fs); + + if (retval) + com_err(device_name, retval, "while closing the filesystem.\n"); +} + +void collect_info(ext2_filsys fs, struct chunk_info *chunk_info) +{ + unsigned int retval = 0, i, free_blks; + + printf("Device: %s\n", fs->device_name); + printf("Blocksize: %u bytes\n", fs->blocksize); + + retval = ext2fs_read_block_bitmap(fs); + if (retval) { + com_err(fs->device_name, retval, "while reading block bitmap"); + close_device(fs->device_name, fs); + exit(1); + } + + init_chunk_info(fs, chunk_info); + + retval = get_chunk_info(fs, chunk_info); + if (retval) { + com_err(fs->device_name, retval, "while collecting chunk info"); + close_device(fs->device_name, fs); + exit(1); + } +} + +void open_device(char *device_name, ext2_filsys *fs) +{ + int retval; + int flag = EXT2_FLAG_FORCE; + + retval = ext2fs_open(device_name, flag, 0, 0, unix_io_manager, fs); + if (retval) { + com_err(device_name, retval, "while opening filesystem"); + exit(1); + } +} + +int main(int argc, char *argv[]) +{ + struct chunk_info chunk_info = { .chunkbytes = DEFAULT_CHUNKSIZE }; + errcode_t retval = 0; + ext2_filsys fs = NULL; + char *device_name; + char *progname; + char c, *end; + + progname = argv[0]; + + while ((c = getopt(argc, argv, "c:h")) != EOF) { + switch (c) { + case 'c': + chunk_info.chunkbytes = strtoull(optarg, &end, 0); + if (*end != '\0') { + fprintf(stderr, "%s: bad chunk size '%s'\n", + progname, optarg); + usage(progname); + } + if (chunk_info.chunkbytes & + (chunk_info.chunkbytes - 1)) { + fprintf(stderr, "%s: chunk size must be a " + "power of 2."); + usage(progname); + } + chunk_info.chunkbytes *= 1024; + break; + default: + fprintf(stderr, "%s: bad option '%c'\n", + progname, c); + case 'h': + usage(progname); + break; + } + } + + if (optind == argc) { + fprintf(stderr, "%s: missing device name.\n", progname); + usage(progname); + } + + device_name = argv[optind]; + + open_device(device_name, &fs); + + if (chunk_info.chunkbytes < fs->blocksize) { + fprintf(stderr, "%s: chunksize must be greater than or equal " + "to filesystem blocksize.\n", progname); + exit(1); + } + collect_info(fs, &chunk_info); + close_device(device_name, fs); + + return retval; +} Index: e2fsprogs-1.41.4/misc/e2freefrag.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ e2fsprogs-1.41.4/misc/e2freefrag.h 2009-04-21 13:19:48.000000000 -0600 @@ -0,0 +1,19 @@ +#include <sys/types.h> + +#define DEFAULT_CHUNKSIZE (1024*1024) + +#define MAX_HIST 32 +struct free_chunk_histogram { + unsigned long fc_buckets[MAX_HIST]; +}; + +struct chunk_info { + unsigned long chunkbytes; /* chunk size in bytes */ + int chunkbits; /* chunk size in bits */ + unsigned long free_chunks; /* total free chunks of given size */ + unsigned long real_free_chunks; /* free chunks of any size */ + int blocksize_bits; /* fs blocksize in bits */ + int blks_in_chunk; /* number of blocks in a chunk */ + unsigned long min, max, avg; /* chunk size stats */ + struct free_chunk_histogram histogram; /* histogram of all chunk sizes*/ +}; Index: e2fsprogs-1.41.4/e2fsprogs.spec.in =================================================================== --- e2fsprogs-1.41.4.orig/e2fsprogs.spec.in 2009-04-14 05:56:43.000000000 -0600 +++ e2fsprogs-1.41.4/e2fsprogs.spec.in 2009-04-21 13:25:19.000000000 -0600 @@ -143,6 +143,7 @@ %{_root_sbindir}/tune2fs %{_sbindir}/filefrag %{_sbindir}/mklost+found +%{_sbindir}/e2freefrag %{_root_libdir}/libblkid.so.* %{_root_libdir}/libcom_err.so.* @@ -187,6 +188,7 @@ %{_mandir}/man8/resize2fs.8* %{_mandir}/man8/tune2fs.8* %{_mandir}/man8/filefrag.8* +%{_mandir}/man8/e2freefrag.8* %files devel %defattr(-,root,root) Index: e2fsprogs-1.41.4/misc/Makefile.in =================================================================== --- e2fsprogs-1.41.4.orig/misc/Makefile.in 2009-04-14 05:56:43.000000000 -0600 +++ e2fsprogs-1.41.4/misc/Makefile.in 2009-04-14 06:09:57.000000000 -0600 @@ -19,10 +19,10 @@ SPROGS= mke2fs badblocks tune2fs dumpe2fs $(BLKID_PROG) logsave \ $(E2IMAGE_PROG) @FSCK_PROG@ e2undo -USPROGS= mklost+found filefrag $(UUIDD_PROG) +USPROGS= mklost+found filefrag e2freefrag $(UUIDD_PROG) SMANPAGES= tune2fs.8 mklost+found.8 mke2fs.8 dumpe2fs.8 badblocks.8 \ e2label.8 $(FINDFS_MAN) $(BLKID_MAN) $(E2IMAGE_MAN) \ - logsave.8 filefrag.8 e2undo.8 $(UUIDD_MAN) @FSCK_MAN@ + logsave.8 filefrag.8 e2freefrag.8 e2undo.8 $(UUIDD_MAN) @FSCK_MAN@ FMANPAGES= mke2fs.conf.5 UPROGS= chattr lsattr uuidgen @@ -44,6 +44,7 @@ BLKID_OBJS= blkid.o FILEFRAG_OBJS= filefrag.o E2UNDO_OBJS= e2undo.o +E2FREEFRAG_OBJS= e2freefrag.o PROFILED_TUNE2FS_OBJS= profiled/tune2fs.o profiled/util.o PROFILED_MKLPF_OBJS= profiled/mklost+found.o @@ -71,7 +72,7 @@ $(srcdir)/uuidgen.c $(srcdir)/blkid.c $(srcdir)/logsave.c \ $(srcdir)/filefrag.c $(srcdir)/base_device.c \ $(srcdir)/ismounted.c $(srcdir)/../e2fsck/profile.c \ - $(srcdir)/e2undo.c + $(srcdir)/e2undo.c $(srcdir)/e2freefrag.c LIBS= $(LIBEXT2FS) $(LIBCOM_ERR) DEPLIBS= $(LIBEXT2FS) $(LIBCOM_ERR) @@ -276,6 +277,10 @@ @echo " LD $@" @$(CC) $(ALL_LDFLAGS) -g -pg -o logsave.profiled profiled/logsave.o +e2freefrag: $(E2FREEFRAG_OBJS) + @echo "LD $@" + @$(CC) $(ALL_LDFLAGS) -o e2freefrag $(E2FREEFRAG_OBJS) $(LIBS) + filefrag: $(FILEFRAG_OBJS) @echo " LD $@" @$(CC) $(ALL_LDFLAGS) -o filefrag $(FILEFRAG_OBJS) @@ -361,6 +366,10 @@ @echo " SUBST $@" @$(SUBSTITUTE_UPTIME) $(srcdir)/blkid.1.in blkid.1 +e2freefrag.8: $(DEP_SUBSTITUTE) $(srcdir)/e2freefrag.8.in + @echo " SUBST $@" + @$(SUBSTITUTE_UPTIME) $(srcdir)/e2freefrag.8.in e2freefrag.8 + filefrag.8: $(DEP_SUBSTITUTE) $(srcdir)/filefrag.8.in @echo " SUBST $@" @$(SUBSTITUTE_UPTIME) $(srcdir)/filefrag.8.in filefrag.8 @@ -522,7 +531,7 @@ clean: $(RM) -f $(SPROGS) $(USPROGS) $(UPROGS) $(UMANPAGES) $(SMANPAGES) \ $(FMANPAGES) \ - base_device base_device.out mke2fs.static filefrag \ + base_device base_device.out mke2fs.static filefrag e2freefrag \ e2initrd_helper partinfo prof_err.[ch] default_profile.c \ uuidd e2image tune2fs.static tst_ismounted fsck.profiled \ blkid.profiled tune2fs.profiled e2image.profiled \ @@ -603,6 +612,9 @@ blkid.o: $(srcdir)/blkid.c $(top_srcdir)/lib/blkid/blkid.h \ $(top_builddir)/lib/blkid/blkid_types.h logsave.o: $(srcdir)/logsave.c +e2freefrag.o: $(srcdir)/e2freefrag.c e2freefrag.h \ + $(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_srcdir)/lib/ext2fs/ext2fs.h \ + $(top_srcdir)/lib/ext2fs/bitops.h filefrag.o: $(srcdir)/filefrag.c base_device.o: $(srcdir)/base_device.c $(srcdir)/fsck.h ismounted.o: $(srcdir)/ismounted.c $(top_srcdir)/lib/et/com_err.h