diff mbox

e2freefrag utility

Message ID 20090721001750.GD4231@webber.adilger.int
State Accepted, archived
Headers show

Commit Message

Andreas Dilger July 21, 2009, 12:17 a.m. UTC
Attached is the e2freefrag tool.  It grabs the block bitmaps, creates
buddy bitmaps from them and displays the total/free chunks (default
1MB chunk size), and a histogram of free space.

It could probably be enhanced to print the chunk sizes based on the
RAID chunk size stored in the superblock, but I just thought of that
this minute...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Comments

Theodore Ts'o July 22, 2009, 7:43 a.m. UTC | #1
On Mon, Jul 20, 2009 at 06:17:50PM -0600, Andreas Dilger wrote:
> Attached is the e2freefrag tool.  It grabs the block bitmaps, creates
> buddy bitmaps from them and displays the total/free chunks (default
> 1MB chunk size), and a histogram of free space.
> 
> It could probably be enhanced to print the chunk sizes based on the
> RAID chunk size stored in the superblock, but I just thought of that
> this minute...

Thanks, checked in with some minor changes to fix some printf
warnings.

Here's the output on my root filesystem (which has been in use since
February):

Device: /dev/ssd/root
Blocksize: 4096 bytes
Total blocks: 18350080
Free blocks: 10774142 (58.7%)

Chunksize: 1048576 bytes (256 blocks)
Total chunks: 71681
Free chunks: 21792 (30.4%)

Min free chunk: 4 KB 
Max free chunk: 568232 KB
Avg free chunk: 184 KB

HISTOGRAM OF FREE CHUNK SIZES:
Chunk Size Range :	Free chunks
    4K...    8K- :       35005
    8K...   16K- :       33639
   16K...   32K- :       31419
   32K...   64K- :       33953
   64K...  128K- :       26397
  128K...  256K- :        7314
  256K...  512K- :        1855
  512K... 1024K- :        1612
    1M...    2M- :        1160
    2M...    4M- :         567
    4M...    8M- :         303
    8M...   16M- :         106
   16M...   32M- :          40
   32M...   64M- :          51
   64M...  128M- :         123
  128M...  256M- :           8
  512M... 1024M- :           1

Yeah.... pretty fragmented.   :-(

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen July 23, 2009, 4:59 a.m. UTC | #2
Theodore Tso wrote:
...

> Here's the output on my root filesystem (which has been in use since
> February):
> 
> Device: /dev/ssd/root
> Blocksize: 4096 bytes
> Total blocks: 18350080
> Free blocks: 10774142 (58.7%)
> 
> Chunksize: 1048576 bytes (256 blocks)
> Total chunks: 71681
> Free chunks: 21792 (30.4%)
> 
> Min free chunk: 4 KB 
> Max free chunk: 568232 KB
> Avg free chunk: 184 KB
> 
> HISTOGRAM OF FREE CHUNK SIZES:
> Chunk Size Range :	Free chunks
>     4K...    8K- :       35005
>     8K...   16K- :       33639
>    16K...   32K- :       31419
>    32K...   64K- :       33953
>    64K...  128K- :       26397
>   128K...  256K- :        7314
>   256K...  512K- :        1855
>   512K... 1024K- :        1612
>     1M...    2M- :        1160
>     2M...    4M- :         567
>     4M...    8M- :         303
>     8M...   16M- :         106
>    16M...   32M- :          40
>    32M...   64M- :          51
>    64M...  128M- :         123
>   128M...  256M- :           8
>   512M... 1024M- :           1
> 
> Yeah.... pretty fragmented.   :-(
> 


Just for comparison, here's a 30G xfs root that has run for a year or
two, currently about 70% full:

xfs_db> freesp -s
   from      to extents  blocks    pct
      1       1    1849    1849   0.08
      2       3    1383    3293   0.14
      4       7    1034    5429   0.23
      8      15    1061   12260   0.53
     16      31     641   13261   0.57
     32      63     355   15601   0.67
     64     127     221   19940   0.86
    128     255     195   35841   1.54
    256     511     173   63066   2.71
    512    1023     122   89824   3.86
   1024    2047      51   70032   3.01
   2048    4095      22   60982   2.62
   4096    8191      20  116580   5.01
   8192   16383      10  109896   4.72
  16384   32767       7  152026   6.53
  32768   65535       4  206283   8.87
  65536  131071       3  285744  12.28
 262144  524287       1  509811  21.91
 524288 1048575       1  554838  23.85
total free extents 7153
total free blocks 2326556
average free extent size 325.256

from...to units are in 4k blocks.

Maybe the fancy ext4 defragger will have a good second use case in
cleaning up some of that freespace fragmentation.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 23, 2009, 1:45 p.m. UTC | #3
So I started looking to see how we might be able to improve mballoc to
avoid freespace fragmentation, and I came up with the following high
level design.  Does this look sane?   Have I overlooked anything?

1) In ext4_mb_normalize_request(), if the inode that we are allocating
does not have any open file descriptors for write (i.e., it's already
closed and we're allocating via delalloc) _and_ the inode was
previously opened with O_CREAT and without O_APPEND (checked via a
flag in EXT4_I(inode)), then do not normalize the size to a power of
two, but rather to the filesystem blocksize.

The idea here is that we should be trying to find an exact fit, since
most of the time (except for log files, which get appended; hence the
O_CREAT && !O_APPEND test) once a file is written, that is probably
the final size for the file.  So normalizing the size for the
preallocation area to a power of two will be counterproductive for
most files.

2) If the there has been less than X files opened in Y jiffies the
parent directory (using the dentry path used to open the file), then
do not set EXT4_MB_HINT_GROUP_ALLOC in ext4_mb_group_or_file().  We
can simulate this for without creating this patch to test #1 by
setting mb_stream_request to 0 (which should completely disable group
preallocation).

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen July 23, 2009, 5:43 p.m. UTC | #4
Theodore Tso wrote:
> So I started looking to see how we might be able to improve mballoc to
> avoid freespace fragmentation, and I came up with the following high
> level design.  Does this look sane?   Have I overlooked anything?
> 
> 1) In ext4_mb_normalize_request(), if the inode that we are allocating
> does not have any open file descriptors for write (i.e., it's already
> closed and we're allocating via delalloc) _and_ the inode was
> previously opened with O_CREAT and without O_APPEND (checked via a
> flag in EXT4_I(inode)), then do not normalize the size to a power of
> two, but rather to the filesystem blocksize.
> 
> The idea here is that we should be trying to find an exact fit, since
> most of the time (except for log files, which get appended; hence the
> O_CREAT && !O_APPEND test) once a file is written, that is probably
> the final size for the file.  So normalizing the size for the
> preallocation area to a power of two will be counterproductive for
> most files.

I'm sort of woefully ignorant of a lot of the mballoc stuff.

When you say once a file is written that's probably the final size... do
you mean when writes are done and it's closed, or when the first write
to the file is complete?

I think an awful lot of normal cases write to a file in sub-file-sized
chunks (think mp3 or flac encoding, file downloading, etc).

Also, I get the !O_APPEND test, but is O_CREAT necessary?  I wonder how
much of a hint that really gives us.

> 2) If the there has been less than X files opened in Y jiffies the
> parent directory (using the dentry path used to open the file), then
> do not set EXT4_MB_HINT_GROUP_ALLOC in ext4_mb_group_or_file().  We
> can simulate this for without creating this patch to test #1 by
> setting mb_stream_request to 0 (which should completely disable group
> preallocation).

Hm have to try hard to parse that ;)  But that sounds reasonable I think.

I'm talking to the Fedora infrastructure folks to see if there's a way
to recreate snapshots of, say, the F10 repos from initial release to
today, to be able to sort of fast-forward root filesystem updates.  It'd
be a nice way to do accelerated aging tests for any changes we make, at
least for one usecase ...

-Eric

> 						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mingming Cao July 23, 2009, 5:51 p.m. UTC | #5
Theodore Tso wrote:
> So I started looking to see how we might be able to improve mballoc to
> avoid freespace fragmentation, and I came up with the following high
> level design.  Does this look sane?   Have I overlooked anything?
>
> 1) In ext4_mb_normalize_request(), if the inode that we are allocating
> does not have any open file descriptors for write (i.e., it's already
> closed and we're allocating via delalloc) _and_ the inode was
> previously opened with O_CREAT and without O_APPEND (checked via a
> flag in EXT4_I(inode)), then do not normalize the size to a power of
> two, but rather to the filesystem blocksize.
>
> The idea here is that we should be trying to find an exact fit, since
> most of the time (except for log files, which get appended; hence the
> O_CREAT && !O_APPEND test) once a file is written, that is probably
> the final size for the file.  So normalizing the size for the
> preallocation area to a power of two will be counterproductive for
> most files.
>
>   
I am trying to understand what user cases prefer normalize allocation 
request size? If they are uncommon cases, perhaps
we should disable the normalize the allocation size disabled by default, 
unless the apps opens the files with O_APPEND?
> 2) If the there has been less than X files opened in Y jiffies the
> parent directory (using the dentry path used to open the file), then
> do not set EXT4_MB_HINT_GROUP_ALLOC in ext4_mb_group_or_file().  We
> can simulate this for without creating this patch to test #1 by
> setting mb_stream_request to 0 (which should completely disable group
> preallocation).
>
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 24, 2009, 12:23 a.m. UTC | #6
On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote:
> > 1) In ext4_mb_normalize_request(), if the inode that we are allocating
> > does not have any open file descriptors for write (i.e., it's already
> > closed and we're allocating via delalloc) _and_ the inode was
> > previously opened with O_CREAT and without O_APPEND (checked via a
> > flag in EXT4_I(inode)), then do not normalize the size to a power of
> > two, but rather to the filesystem blocksize.
> 
> I'm sort of woefully ignorant of a lot of the mballoc stuff.
> 
> When you say once a file is written that's probably the final size... do
> you mean when writes are done and it's closed, or when the first write
> to the file is complete?
> 
> I think an awful lot of normal cases write to a file in sub-file-sized
> chunks (think mp3 or flac encoding, file downloading, etc).

I meant when the writes are done and the files are closed; hence my
proposal that we do this do #1 above only if there are no open file
descriptors for write.  That is, if the file can be written and closed
by the userspace process before any delayed allocation blocks are
attempted to be written by the filesystem, we can probably safely
assume that the file won't grown in size later on.

> Also, I get the !O_APPEND test, but is O_CREAT necessary?  I wonder how
> much of a hint that really gives us.

Well, it probably should be O_CREAT || O_TRUNC.  The basic idea here is
to distinguish between a file which gets appended to via syslog, or
via a mail delivery program that writes 4k of data to the end of a
mail spool file.  In some cases, such as the mail delivery program, it
might not use O_APPEND, but instead it might lock the file, seek to
end of the file, and then right the 4k worth of e-mail.  So if the
file wasn't freshly created (or truncated) at the last open, maybe we
should use a more aggressive preallocation --- and in the case of
/var/mail spool delivery, perhaps the preallocation should persist
beyond the file getting closed.  (In the future we might want to have
some hueristics where if we notice that the pattern of file writes is
a repeated open, write-causing-block-allocation, close, maybe we
should do some kind of block reservation style scheme while the
filesystem is mounted and the inode stays in the inode cache.)

	      	      	      	    	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 24, 2009, 12:43 a.m. UTC | #7
On Thu, Jul 23, 2009 at 10:51:58AM -0700, Mingming Cao wrote:
> I am trying to understand what user cases prefer normalize allocation  
> request size? If they are uncommon cases, perhaps
> we should disable the normalize the allocation size disabled by default,  
> unless the apps opens the files with O_APPEND?

The case where we would want to round the allocation size up would be
if we are writing a large file (say, like a large mp3 or mpeg4 file),
which takes a while for the audio/video encoder to write out the
blocks.   In that case, doing file-based preallocation is a good thing.

Normally, if we are doing block allocations for files greater than 16
blocks (i.e, 64k), we use file-based preallocation.  Otherwise we use
block group allocations.  The problem with using block group
allocations is that way it works is that first time we try to allocate
out of a block group, we try to find a free extent which is 512 blocks
long.  If we can't find a free extent which is 512 blocks long, we'll
try another block group.  Hence, for small files, once a block group
gets fragmented to the point where there isn't a free chunk which is
512 blocks long, we'll try to find another block group --- even if
that means finding another block group far, FAR away from the block
group where the directory is contained.

Worse yet, if we unmount and remount the filesystem, we forget the
fact that we were using a particular (now-partially filled)
preallocation group, so the next time we try to allocate space for a
small file, we will find *another* free 512 block chunk to allocate
small files.  Given that there is 32,768 blocks in block group, after
64 interations of "mount, write one 4k file in a directory, unmount",
that block group will have 64 files, each separated by 511 blocks, and
that block group will no longer have any free 512 chunks for block
allocations.  (And given that the block preallocation is per-CPU, it
becomes even worse on an SMP system.)

Put this baldly, it may be that we need to do a fundamental rethink on
how we do per-cpu, per-blockgroup preallocations for small files.
Maybe instead of trying to find a 512 extent which is completely full,
we should instead be looking for a 512 extent which has at least
mb_stream_req free blocks (i.e. by default 16 free blocks).

	      	   	  	   	      	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen July 24, 2009, 2:18 a.m. UTC | #8
Theodore Tso wrote:
> On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote:
>>> 1) In ext4_mb_normalize_request(), if the inode that we are allocating
>>> does not have any open file descriptors for write (i.e., it's already
>>> closed and we're allocating via delalloc) _and_ the inode was
>>> previously opened with O_CREAT and without O_APPEND (checked via a
>>> flag in EXT4_I(inode)), then do not normalize the size to a power of
>>> two, but rather to the filesystem blocksize.
>> I'm sort of woefully ignorant of a lot of the mballoc stuff.
>>
>> When you say once a file is written that's probably the final size... do
>> you mean when writes are done and it's closed, or when the first write
>> to the file is complete?
>>
>> I think an awful lot of normal cases write to a file in sub-file-sized
>> chunks (think mp3 or flac encoding, file downloading, etc).
> 
> I meant when the writes are done and the files are closed; hence my
> proposal that we do this do #1 above only if there are no open file
> descriptors for write.  That is, if the file can be written and closed
> by the userspace process before any delayed allocation blocks are
> attempted to be written by the filesystem, we can probably safely
> assume that the file won't grown in size later on.

Ah, ok.  Sorry, I misunderstood.  Yep, that seems reasonable.

It should probably get tested with workloads like video transcoding,
where there will be incremental writes that span many minutes or hours.

>> Also, I get the !O_APPEND test, but is O_CREAT necessary?  I wonder how
>> much of a hint that really gives us.
> 
> Well, it probably should be O_CREAT || O_TRUNC.  The basic idea here is
> to distinguish between a file which gets appended to via syslog, or
> via a mail delivery program that writes 4k of data to the end of a
> mail spool file.  In some cases, such as the mail delivery program, it
> might not use O_APPEND, but instead it might lock the file, seek to
> end of the file, and then right the 4k worth of e-mail.  So if the
> file wasn't freshly created (or truncated) at the last open, maybe we
> should use a more aggressive preallocation --- and in the case of
> /var/mail spool delivery, perhaps the preallocation should persist
> beyond the file getting closed.  (In the future we might want to have
> some hueristics where if we notice that the pattern of file writes is
> a repeated open, write-causing-block-allocation, close, maybe we
> should do some kind of block reservation style scheme while the
> filesystem is mounted and the inode stays in the inode cache.)


sounds fancy ;)

-Eric

> 	      	      	      	    	     - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen July 24, 2009, 2:25 a.m. UTC | #9
Eric Sandeen wrote:
> Theodore Tso wrote:
>> On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote:
>>>> 1) In ext4_mb_normalize_request(), if the inode that we are allocating
>>>> does not have any open file descriptors for write (i.e., it's already
>>>> closed and we're allocating via delalloc) _and_ the inode was
>>>> previously opened with O_CREAT and without O_APPEND (checked via a
>>>> flag in EXT4_I(inode)), then do not normalize the size to a power of
>>>> two, but rather to the filesystem blocksize.
>>> I'm sort of woefully ignorant of a lot of the mballoc stuff.
>>>
>>> When you say once a file is written that's probably the final size... do
>>> you mean when writes are done and it's closed, or when the first write
>>> to the file is complete?
>>>
>>> I think an awful lot of normal cases write to a file in sub-file-sized
>>> chunks (think mp3 or flac encoding, file downloading, etc).
>> I meant when the writes are done and the files are closed; hence my
>> proposal that we do this do #1 above only if there are no open file
>> descriptors for write.  That is, if the file can be written and closed
>> by the userspace process before any delayed allocation blocks are
>> attempted to be written by the filesystem, we can probably safely
>> assume that the file won't grown in size later on.
> 
> Ah, ok.  Sorry, I misunderstood.  Yep, that seems reasonable.
> 
> It should probably get tested with workloads like video transcoding,
> where there will be incremental writes that span many minutes or hours.

Ugh right after I sent this I think I'm finally making sense of it.  :)
 In that case, come allocation time there =would= be file descriptors
open, and we'd go back to the old method of normalizing the allocation.
 You're just talking about changing things where an entire series of
file writes have come & gone, everything is closed & done, and -now-
we're allocating.

Sorry for being slow.  :)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger July 24, 2009, 2:30 a.m. UTC | #10
On Jul 23, 2009  20:23 -0400, Theodore Ts'o wrote:
> On Thu, Jul 23, 2009 at 12:43:47PM -0500, Eric Sandeen wrote:
> > > 1) In ext4_mb_normalize_request(), if the inode that we are allocating
> > > does not have any open file descriptors for write (i.e., it's already
> > > closed and we're allocating via delalloc) _and_ the inode was
> > > previously opened with O_CREAT and without O_APPEND (checked via a
> > > flag in EXT4_I(inode)), then do not normalize the size to a power of
> > > two, but rather to the filesystem blocksize.
> > 
> > I'm sort of woefully ignorant of a lot of the mballoc stuff.
> > 
> > When you say once a file is written that's probably the final size... do
> > you mean when writes are done and it's closed, or when the first write
> > to the file is complete?
> > 
> > I think an awful lot of normal cases write to a file in sub-file-sized
> > chunks (think mp3 or flac encoding, file downloading, etc).
> 
> I meant when the writes are done and the files are closed; hence my
> proposal that we do this do #1 above only if there are no open file
> descriptors for write.  That is, if the file can be written and closed
> by the userspace process before any delayed allocation blocks are
> attempted to be written by the filesystem, we can probably safely
> assume that the file won't grown in size later on.

Right, this is a reasonable default I think.

> > Also, I get the !O_APPEND test, but is O_CREAT necessary?  I wonder how
> > much of a hint that really gives us.
> 
> Well, it probably should be O_CREAT || O_TRUNC.  The basic idea here is
> to distinguish between a file which gets appended to via syslog, or
> via a mail delivery program that writes 4k of data to the end of a
> mail spool file.  In some cases, such as the mail delivery program, it
> might not use O_APPEND, but instead it might lock the file, seek to
> end of the file, and then right the 4k worth of e-mail.  So if the
> file wasn't freshly created (or truncated) at the last open, maybe we
> should use a more aggressive preallocation --- and in the case of
> /var/mail spool delivery, perhaps the preallocation should persist
> beyond the file getting closed.  (In the future we might want to have
> some hueristics where if we notice that the pattern of file writes is
> a repeated open, write-causing-block-allocation, close, maybe we
> should do some kind of block reservation style scheme while the
> filesystem is mounted and the inode stays in the inode cache.)

I think you are on the right track with the !O_TRUNC check.  Namely,
any file which is a non-zero size and gets an extending write at a
non-zero offset should probably get some persistent preallocation
(fallocate).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: e2fsprogs-1.41.4/misc/e2freefrag.8.in
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ e2fsprogs-1.41.4/misc/e2freefrag.8.in	2009-04-21 13:24:09.000000000 -0600
@@ -0,0 +1,99 @@ 
+.\" -*- nroff -*-
+.TH E2FREEFRAG 8
+.SH NAME
+e2freefrag \- report free space fragmentation information
+.SH SYNOPSIS
+.B e2freefrag
+[
+.B \-c chunk_kb
+]
+[
+.B \-h
+]
+.B filesys
+
+.SH DESCRIPTION
+.B e2freefrag
+is used to report free space fragmentation on ext2/3/4 file systems.
+.I filesys
+is the filesystem device name (e.g.
+.IR /dev/hdc1 ", " /dev/md0 ).
+The
+.B e2freefrag
+program will scan the block bitmap information to check how many free blocks
+are present as contiguous and aligned free space. The percentage of contiguous
+free blocks of size and of alignment
+.IR chunk_kb
+is reported.  It also displays the minimum/maximum/average free chunk size in
+the filesystem, along with a histogram of all free chunks.  This information
+can be used to gauge the level of free space fragmentation in the filesystem.
+.SH OPTIONS
+.TP
+.BI \-c " chunk_kb"
+Desired size of chunk. It is specified in units of kilobytes (KB). If no
+.I chunk_kb
+is specified on the command line, then the default value is 1024KB.
+.TP
+.BI \-h
+Print the usage of the program.
+.SH EXAMPLE
+# e2freefrag /dev/vgroot/lvhome
+.br
+Device: /dev/vgroot/lvhome
+.br
+Blocksize: 4096 bytes
+.br
+Total blocks: 5120710
+.br
+Free blocks: 831744 (16.2%)
+.br
+Chunk size: 1048576 bytes (256 blocks)
+.br
+Total chunks: 20003
+.br
+Free chunks: 2174 (10.9%)
+.br
+
+Min free chunk: 4 KB
+.br
+Max free chunk: 24576 KB
+.br
+Avg. free chunk: 340 KB
+.br
+
+HISTOGRAM OF FREE CHUNK SIZES:
+.br
+          Range         Free chunks
+.br
+    4K...    8K- :        2824
+.br
+    8K...   16K- :        1760
+.br
+   16K...   32K- :        1857
+.br
+   32K...   64K- :        1003
+.br
+   64K...  128K- :         616
+.br
+  128K...  256K- :         479
+.br
+  256K...  512K- :         302
+.br
+  512K... 1024K- :         238
+.br
+    1M...    2M- :         213
+.br
+    2M...    4M- :         173
+.br
+    4M...    8M- :         287
+.br
+    8M...   16M- :           4
+.br
+   16M...   32M- :           1
+.SH AUTHOR
+This version of e2freefrag was written by Rupesh Thakare, and modified by
+Andreas Dilger <adilger@sun.com>, and Kalpak Shah.
+.SH SEE ALSO
+.IR debugfs (8),
+.IR dumpe2fs (8),
+.IR e2fsck (8)
Index: e2fsprogs-1.41.4/misc/e2freefrag.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ e2fsprogs-1.41.4/misc/e2freefrag.c	2009-04-21 13:18:30.000000000 -0600
@@ -0,0 +1,275 @@ 
+/*
+ * e2freefrag - report filesystem free-space fragmentation
+ *
+ * Copyright (C) 2009 Sun Microsystems, Inc.
+ *
+ * Author: Rupesh Thakare <rupesh@sun.com>
+ *         Andreas Dilger <adilger@sun.com>
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Public
+ * License version 2.
+ * %End-Header%
+ */
+#include <stdio.h>
+#ifdef HAVE_UNISTD_H
+#include <unistd.h>
+#endif
+#ifdef HAVE_STDLIB_H
+#include <stdlib.h>
+#endif
+#ifdef HAVE_GETOPT_H
+#include <getopt.h>
+#else
+extern char *optarg;
+extern int optind;
+#endif
+
+#include "ext2fs/ext2_fs.h"
+#include "ext2fs/ext2fs.h"
+#include "e2freefrag.h"
+
+void usage(const char *prog)
+{
+	fprintf(stderr, "usage: %s [-c chunksize in kb] [-h] "
+		"device_name\n", prog);
+	exit(1);
+}
+
+static int ul_log2(unsigned long arg)
+{
+        int     l = 0;
+
+        arg >>= 1;
+        while (arg) {
+                l++;
+                arg >>= 1;
+        }
+        return l;
+}
+
+void init_chunk_info(ext2_filsys fs, struct chunk_info *info)
+{
+	int i;
+
+	info->chunkbits = ul_log2(info->chunkbytes);
+	info->blocksize_bits = ul_log2((unsigned long)fs->blocksize);
+	info->blks_in_chunk = info->chunkbytes >> info->blocksize_bits;
+
+	info->min = ~0UL;
+	info->max = info->avg = 0;
+	info->real_free_chunks = 0;
+
+	for (i = 0; i < MAX_HIST; i++)
+		info->histogram.fc_buckets[i] = 0;
+}
+
+void scan_block_bitmap(ext2_filsys fs, struct chunk_info *info)
+{
+	unsigned long long blocks_count = fs->super->s_blocks_count;
+	unsigned long long chunks = (blocks_count + info->blks_in_chunk) >>
+				(info->chunkbits - info->blocksize_bits);
+	unsigned long long chunk_num;
+	unsigned long last_chunk_size = 0;
+	unsigned long long chunk_start_blk = 0;
+
+	for (chunk_num = 0; chunk_num < chunks; chunk_num++) {
+		unsigned long long blk, num_blks;
+		int chunk_free;
+
+		/* Last chunk may be smaller */
+		if (chunk_start_blk + info->blks_in_chunk > blocks_count)
+			num_blks = blocks_count - chunk_start_blk;
+		else
+			num_blks = info->blks_in_chunk;
+
+		chunk_free = 0;
+
+		/* Initialize starting block for first chunk correctly else
+		 * there is a segfault when blocksize = 1024 in which case
+		 * block_map->start = 1 */
+		for (blk = (chunk_num == 0 ? fs->super->s_first_data_block : 0);
+		     blk < num_blks; blk++, chunk_start_blk++) {
+			int used = ext2fs_fast_test_block_bitmap(fs->block_map,
+							       chunk_start_blk);
+			if (!used) {
+				last_chunk_size++;
+				chunk_free++;
+			}
+
+			if (used && last_chunk_size != 0) {
+				unsigned long index;
+
+				index = ul_log2(last_chunk_size) + 1;
+				info->histogram.fc_buckets[index]++;
+
+				if (last_chunk_size > info->max)
+					info->max = last_chunk_size;
+				if (last_chunk_size < info->min)
+					info->min = last_chunk_size;
+				info->avg += last_chunk_size;
+
+				info->real_free_chunks++;
+				last_chunk_size = 0;
+			}
+		}
+
+		if (chunk_free == info->blks_in_chunk)
+			info->free_chunks++;
+	}
+}
+
+errcode_t get_chunk_info(ext2_filsys fs, struct chunk_info *info)
+{
+	unsigned long total_chunks;
+	char *unitp = "KMGTPEZY";
+	int units = 10;
+	unsigned long start = 0, end, cum;
+	int i, retval = 0;
+
+	scan_block_bitmap(fs, info);
+
+	printf("Total blocks: %lu\nFree blocks: %lu (%0.1f%%)\n",
+	       fs->super->s_blocks_count, fs->super->s_free_blocks_count,
+	       (double)fs->super->s_free_blocks_count * 100 /
+						fs->super->s_blocks_count);
+
+	printf("\nChunksize: %u bytes (%u blocks)\n",
+	       info->chunkbytes, info->blks_in_chunk);
+	total_chunks = (fs->super->s_blocks_count + info->blks_in_chunk) >>
+                                       (info->chunkbits - info->blocksize_bits);
+	printf("Total chunks: %lu\nFree chunks: %lu (%0.1f%%)\n",
+	       total_chunks, info->free_chunks,
+	       (double)info->free_chunks * 100 / total_chunks);
+
+	/* Display chunk information in KB */
+	if (info->real_free_chunks) {
+		info->min = (info->min * fs->blocksize) >> 10;
+		info->max = (info->max * fs->blocksize) >> 10;
+		info->avg = (info->avg / info->real_free_chunks *
+			     fs->blocksize) >> 10;
+	} else {
+		info->min = 0;
+	}
+
+	printf("\nMin free chunk: %lu KB \nMax free chunk: %lu KB\n"
+	       "Avg free chunk: %lu KB\n", info->min, info->max, info->avg);
+
+	printf("\nHISTOGRAM OF FREE CHUNK SIZES:\n");
+	printf("%s\t%10s\n", "Chunk Size Range :", "Free chunks");
+	for (i = 0; i < MAX_HIST; i++) {
+		end = 1 << (i + info->blocksize_bits - units);
+		if (info->histogram.fc_buckets[i] != 0)
+			printf("%5lu%c...%5lu%c- :  %10lu\n", start, *unitp,
+			       end, *unitp, info->histogram.fc_buckets[i]);
+		start = end;
+		if (start == 1<<10) {
+			start = 1;
+			units += 10;
+			unitp++;
+		}
+	}
+
+	return retval;
+}
+
+void close_device(char *device_name, ext2_filsys fs)
+{
+	int retval = ext2fs_close(fs);
+
+	if (retval)
+		com_err(device_name, retval, "while closing the filesystem.\n");
+}
+
+void collect_info(ext2_filsys fs, struct chunk_info *chunk_info)
+{
+	unsigned int retval = 0, i, free_blks;
+
+	printf("Device: %s\n", fs->device_name);
+	printf("Blocksize: %u bytes\n", fs->blocksize);
+
+	retval = ext2fs_read_block_bitmap(fs);
+	if (retval) {
+		com_err(fs->device_name, retval, "while reading block bitmap");
+		close_device(fs->device_name, fs);
+		exit(1);
+	}
+
+	init_chunk_info(fs, chunk_info);
+
+	retval = get_chunk_info(fs, chunk_info);
+	if (retval) {
+		com_err(fs->device_name, retval, "while collecting chunk info");
+                close_device(fs->device_name, fs);
+		exit(1);
+	}
+}
+
+void open_device(char *device_name, ext2_filsys *fs)
+{
+	int retval;
+	int flag = EXT2_FLAG_FORCE;
+
+	retval = ext2fs_open(device_name, flag, 0, 0, unix_io_manager, fs);
+	if (retval) {
+		com_err(device_name, retval, "while opening filesystem");
+		exit(1);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct chunk_info chunk_info = { .chunkbytes = DEFAULT_CHUNKSIZE };
+	errcode_t retval = 0;
+	ext2_filsys fs = NULL;
+	char *device_name;
+	char *progname;
+	char c, *end;
+
+	progname = argv[0];
+
+	while ((c = getopt(argc, argv, "c:h")) != EOF) {
+		switch (c) {
+		case 'c':
+			chunk_info.chunkbytes = strtoull(optarg, &end, 0);
+			if (*end != '\0') {
+				fprintf(stderr, "%s: bad chunk size '%s'\n",
+					progname, optarg);
+				usage(progname);
+			}
+			if (chunk_info.chunkbytes &
+			    (chunk_info.chunkbytes - 1)) {
+				fprintf(stderr, "%s: chunk size must be a "
+					"power of 2.");
+				usage(progname);
+			}
+			chunk_info.chunkbytes *= 1024;
+			break;
+		default:
+			fprintf(stderr, "%s: bad option '%c'\n",
+				progname, c);
+		case 'h':
+			usage(progname);
+			break;
+		}
+	}
+
+	if (optind == argc) {
+		fprintf(stderr, "%s: missing device name.\n", progname);
+		usage(progname);
+	}
+
+	device_name = argv[optind];
+
+	open_device(device_name, &fs);
+
+	if (chunk_info.chunkbytes < fs->blocksize) {
+		fprintf(stderr, "%s: chunksize must be greater than or equal "
+			"to filesystem blocksize.\n", progname);
+		exit(1);
+	}
+	collect_info(fs, &chunk_info);
+	close_device(device_name, fs);
+
+	return retval;
+}
Index: e2fsprogs-1.41.4/misc/e2freefrag.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ e2fsprogs-1.41.4/misc/e2freefrag.h	2009-04-21 13:19:48.000000000 -0600
@@ -0,0 +1,19 @@ 
+#include <sys/types.h>
+
+#define DEFAULT_CHUNKSIZE (1024*1024)
+
+#define MAX_HIST	32
+struct free_chunk_histogram {
+	unsigned long fc_buckets[MAX_HIST];
+};
+
+struct chunk_info {
+	unsigned long chunkbytes;	/* chunk size in bytes */
+	int chunkbits;			/* chunk size in bits */
+	unsigned long free_chunks;	/* total free chunks of given size */
+	unsigned long real_free_chunks; /* free chunks of any size */
+	int blocksize_bits;		/* fs blocksize in bits */
+	int blks_in_chunk;		/* number of blocks in a chunk */
+	unsigned long min, max, avg;	/* chunk size stats */
+	struct free_chunk_histogram histogram; /* histogram of all chunk sizes*/
+};
Index: e2fsprogs-1.41.4/e2fsprogs.spec.in
===================================================================
--- e2fsprogs-1.41.4.orig/e2fsprogs.spec.in	2009-04-14 05:56:43.000000000 -0600
+++ e2fsprogs-1.41.4/e2fsprogs.spec.in	2009-04-21 13:25:19.000000000 -0600
@@ -143,6 +143,7 @@ 
 %{_root_sbindir}/tune2fs
 %{_sbindir}/filefrag
 %{_sbindir}/mklost+found
+%{_sbindir}/e2freefrag
 
 %{_root_libdir}/libblkid.so.*
 %{_root_libdir}/libcom_err.so.*
@@ -187,6 +188,7 @@ 
 %{_mandir}/man8/resize2fs.8*
 %{_mandir}/man8/tune2fs.8*
 %{_mandir}/man8/filefrag.8*
+%{_mandir}/man8/e2freefrag.8*
 
 %files devel
 %defattr(-,root,root)
Index: e2fsprogs-1.41.4/misc/Makefile.in
===================================================================
--- e2fsprogs-1.41.4.orig/misc/Makefile.in	2009-04-14 05:56:43.000000000 -0600
+++ e2fsprogs-1.41.4/misc/Makefile.in	2009-04-14 06:09:57.000000000 -0600
@@ -19,10 +19,10 @@ 
 
 SPROGS=		mke2fs badblocks tune2fs dumpe2fs $(BLKID_PROG) logsave \
 			$(E2IMAGE_PROG) @FSCK_PROG@ e2undo
-USPROGS=	mklost+found filefrag $(UUIDD_PROG)
+USPROGS=	mklost+found filefrag e2freefrag $(UUIDD_PROG)
 SMANPAGES=	tune2fs.8 mklost+found.8 mke2fs.8 dumpe2fs.8 badblocks.8 \
 			e2label.8 $(FINDFS_MAN) $(BLKID_MAN) $(E2IMAGE_MAN) \
-			logsave.8 filefrag.8 e2undo.8 $(UUIDD_MAN) @FSCK_MAN@
+			logsave.8 filefrag.8 e2freefrag.8 e2undo.8 $(UUIDD_MAN) @FSCK_MAN@
 FMANPAGES=	mke2fs.conf.5
 
 UPROGS=		chattr lsattr uuidgen
@@ -44,6 +44,7 @@ 
 BLKID_OBJS=	blkid.o
 FILEFRAG_OBJS=	filefrag.o
 E2UNDO_OBJS=  e2undo.o
+E2FREEFRAG_OBJS= e2freefrag.o
 
 PROFILED_TUNE2FS_OBJS=	profiled/tune2fs.o profiled/util.o
 PROFILED_MKLPF_OBJS=	profiled/mklost+found.o
@@ -71,7 +72,7 @@ 
 		$(srcdir)/uuidgen.c $(srcdir)/blkid.c $(srcdir)/logsave.c \
 		$(srcdir)/filefrag.c $(srcdir)/base_device.c \
 		$(srcdir)/ismounted.c $(srcdir)/../e2fsck/profile.c \
-		$(srcdir)/e2undo.c
+		$(srcdir)/e2undo.c $(srcdir)/e2freefrag.c
 
 LIBS= $(LIBEXT2FS) $(LIBCOM_ERR) 
 DEPLIBS= $(LIBEXT2FS) $(LIBCOM_ERR) 
@@ -276,6 +277,10 @@ 
 	@echo "	LD $@"
 	@$(CC) $(ALL_LDFLAGS) -g -pg -o logsave.profiled profiled/logsave.o
 
+e2freefrag: $(E2FREEFRAG_OBJS)
+	@echo "LD $@"
+	@$(CC) $(ALL_LDFLAGS) -o e2freefrag $(E2FREEFRAG_OBJS) $(LIBS)
+
 filefrag: $(FILEFRAG_OBJS)
 	@echo "	LD $@"
 	@$(CC) $(ALL_LDFLAGS) -o filefrag $(FILEFRAG_OBJS) 
@@ -361,6 +366,10 @@ 
 	@echo "	SUBST $@"
 	@$(SUBSTITUTE_UPTIME) $(srcdir)/blkid.1.in blkid.1 
 
+e2freefrag.8: $(DEP_SUBSTITUTE) $(srcdir)/e2freefrag.8.in
+	@echo "	SUBST $@"
+	@$(SUBSTITUTE_UPTIME) $(srcdir)/e2freefrag.8.in e2freefrag.8
+
 filefrag.8: $(DEP_SUBSTITUTE) $(srcdir)/filefrag.8.in
 	@echo "	SUBST $@"
 	@$(SUBSTITUTE_UPTIME) $(srcdir)/filefrag.8.in filefrag.8
@@ -522,7 +531,7 @@ 
 clean:
 	$(RM) -f $(SPROGS) $(USPROGS) $(UPROGS) $(UMANPAGES) $(SMANPAGES) \
 		$(FMANPAGES) \
-		base_device base_device.out mke2fs.static filefrag \
+		base_device base_device.out mke2fs.static filefrag e2freefrag \
 		e2initrd_helper partinfo prof_err.[ch] default_profile.c \
 		uuidd e2image tune2fs.static tst_ismounted fsck.profiled \
 		blkid.profiled tune2fs.profiled e2image.profiled \
@@ -603,6 +612,9 @@ 
 blkid.o: $(srcdir)/blkid.c $(top_srcdir)/lib/blkid/blkid.h \
  $(top_builddir)/lib/blkid/blkid_types.h
 logsave.o: $(srcdir)/logsave.c
+e2freefrag.o: $(srcdir)/e2freefrag.c e2freefrag.h \
+ $(top_srcdir)/lib/ext2fs/ext2_fs.h $(top_srcdir)/lib/ext2fs/ext2fs.h \
+ $(top_srcdir)/lib/ext2fs/bitops.h
 filefrag.o: $(srcdir)/filefrag.c
 base_device.o: $(srcdir)/base_device.c $(srcdir)/fsck.h
 ismounted.o: $(srcdir)/ismounted.c $(top_srcdir)/lib/et/com_err.h