ext4: first write to large ext3 filesystem takes 96 seconds

On Mon, Jul 07, 2014 at 11:54:05PM -0400, Theodore Ts'o wrote:
> On Mon, Jul 07, 2014 at 09:35:11PM -0400, Benjamin LaHaise wrote:
> > 
> > Sure -- I put a copy at http://www.kvack.org/~bcrl/mb_groups as it's a bit 
> > too big for the mailing list.  The filesystem in question has a couple of 
> > 11GB files on it, with the remainder of the space being taken up by files 
> > 7200016 bytes in size.  
> 
> Right, so looking at mb_groups we see a bunch of the problems.  There
> are a large number block groups which look like this:
> 
> #group: free  frags first [ 2^0   2^1   2^2   2^3   2^4   2^5   2^6   2^7   2^8   2^9   2^10  2^11  2^12  2^13  ]
> #288  : 1540  7     13056 [ 0     0     1     0     0     0     0     0     6     0     0     0     0     0     ]
> 
> It would be very interesting to see what allocation pattern resulted
> in so many block groups with this layout.  Before we read in
> allocation bitmap, all we know from the block group descriptors is
> that there are 1540 free blocks.  What we don't know is that they are
> broken up into 6 256 block free regions, plus a 4 block region.

I did have to make a change to the ext4 inode allocator to bias things 
towards allocating inodes at the beginning of the disk (see below).  
Without that change the allocation pattern of writes to the filesystem 
resulted in a significant performance regression relative to ext3, 
owing mostly to the fact that fallocate() on ext4 is unimplemented for 
indirect style metadata.  (Note that we mount the filesystem with this 
noorlov mount option.)

With that change, the workload essentially consists of writing 7200016 
files in one write() operation rotating between 100 subdirectories off 
the root of the filesystem.

> If we try to allocate a 1024 block region, we'll end up searching a
> large number of these block groups before find one which is suitable.
> 
> Or there is a large collection of block groups that look like this:
> 
> #834  : 4900  39    514   [ 0     20    5     5     16    6     4     8     6     1     1     0     0     0     ]
> 
> Similarly, we could try to look for a contiguous 2048 range, but even
> though there is 4900 blocks available, we can't tell the difference
> between something a free block layout which looks like like the above,
> versus one that looks like this:
> 
> #834  : 4900  39    514   [ 0      6    0     1     3    5     1     4     0     0     0     2     0     0     ]
> 
> We could try going straight for the largely empty block groups, but
> that's more likely to fragment the file system more quickly, and then
> once those largely empty block groups are partially used, then we'll
> end up taking a long time while we scan all of the block groups.

Fragmentation is not a significant concern for the workload in question.  
Write performance is much more important to us than read performance, and 
read performance tends to degrade to random reads owing to the fact that 
the system can have many queues (~16k) issuing reads.  Hence, getting the 
block allocator to make writes get allocated as close to sequential on 
disk as possible is an important corner of performance.  Ext4 with indirect 
blocks has a tendency to leave gaps between files, which degrades 
performance for this workload, since files tend not to be packed as closely 
together as they were with ext3.  Ext4 with extents + fallocate() packs 
files on disk without any gaps, but turning on extents is not an option 
(unfortunately, as a 20+ minute fsck time / outage as part of an upgrade is 
not viable).

		-ben

>        	      	     	  	   	- Ted
>

ext4: first write to large ext3 filesystem takes 96 seconds

Commit Message

Patch