diff mbox

[50/74] libext2fs: support allocating uninit blocks in bmap2()

Message ID 20131211012353.30655.82545.stgit@birch.djwong.org
State Superseded, archived
Headers show

Commit Message

Darrick Wong Dec. 11, 2013, 1:23 a.m. UTC
In order to support fallocate, we need to be able to have
ext2fs_bmap2() allocate blocks and put them into uninitialized
extents.  There's a flag to do this in the extent code, but it's not
exposed to the bmap2 interface, so plumb that in.  Eventually fuse2fs
or somebody will use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 lib/ext2fs/bmap.c   |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 lib/ext2fs/ext2fs.h |    1 +
 2 files changed, 48 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Jan. 11, 2014, 10:57 p.m. UTC | #1
On Tue, Dec 10, 2013 at 05:23:53PM -0800, Darrick J. Wong wrote:
> @@ -336,6 +370,12 @@ errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
>  		goto done;
>  	}
>  
> +	if ((bmap_flags & BMAP_SET) && (bmap_flags & BMAP_UNINIT)) {
> +		retval = zero_block(fs, *phys_blk);
> +		if (retval)
> +			goto done;
> +	}
> +

We should use a new flag (say, BMAP_ZERO) if we want ext2fs_bmap2() to
zero out the data block.  Otherwise, a number of tools which are
currently using ext2fs_bmap, or debugfs "write" command to copy files
into a file system will end up doing double writes into the file
system --- once to zero the block, and a second time to write data
into said block.

The libext2fs library is designed to be used for low-level tools, so
we shouldn't presume that we should force blocks to be zero'ed unless
the application really wants it.

The other thing to note about this patch is that if you want to
implement fallocate, ext2fs_bmap2() is really the wrong tool to use.
I've been working on a program for work which pre-creates a bunch of
llarge files allocated contiguously on the disk as part of the mke2fs
process, and it turns out that if you try to allocate several
gigabytes worth of files using ext2fs_bmap2(), you end up burning a
huge amount of CPU time (as in around 30 seconds of CPU times while
fallocating a 10GB worth of blocks; so if you try to allocate a
terabyte or three worth of blocks, it would take a truly long time,
while you turn your CPU into a space heater :-).

The top profile user was update_path() in fs/ext4/extents.c, which is
caused by the very large number of extent operations that are needed
for each extent operation.  The second largest profile user is
ext2fs_crc16(), caused by the large number of calls to
ext2fs_block_alloc_stats2(), which causes the the block group
descriptors to get incremented one at a time.

What we need to do if we want create an optimized fallocate() is to
allocate blocks until we either exceed the max number of blocks in an
extent, or we get a non-contiguous allocation, and then insert the
extent into extent tree one extent at a time.  Similarly, we need to
update the block group descriptors a batched chunks, instead of after
each individual block allocation.

Similarly, as far as calling zero_block(), you really don't want to
issue each 4k write separately.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick Wong Jan. 15, 2014, 9:11 p.m. UTC | #2
On Sat, Jan 11, 2014 at 05:57:55PM -0500, Theodore Ts'o wrote:
> On Tue, Dec 10, 2013 at 05:23:53PM -0800, Darrick J. Wong wrote:
> > @@ -336,6 +370,12 @@ errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
> >  		goto done;
> >  	}
> >  
> > +	if ((bmap_flags & BMAP_SET) && (bmap_flags & BMAP_UNINIT)) {
> > +		retval = zero_block(fs, *phys_blk);
> > +		if (retval)
> > +			goto done;
> > +	}
> > +
> 
> We should use a new flag (say, BMAP_ZERO) if we want ext2fs_bmap2() to
> zero out the data block.  Otherwise, a number of tools which are
> currently using ext2fs_bmap, or debugfs "write" command to copy files
> into a file system will end up doing double writes into the file
> system --- once to zero the block, and a second time to write data
> into said block.

Ok, I'll create a BMAP_ZERO to do this.

> The libext2fs library is designed to be used for low-level tools, so
> we shouldn't presume that we should force blocks to be zero'ed unless
> the application really wants it.
> 
> The other thing to note about this patch is that if you want to
> implement fallocate, ext2fs_bmap2() is really the wrong tool to use.
> I've been working on a program for work which pre-creates a bunch of

I think that ext2fs_fallocate would be a good addition to the library.  Is your
program far enough along to share?  fuse2fs would benefit greatly.

That said, I've also found a couple of bugs in the extent code by implementing
fallocate in such a stupid way. :)  It turns out that if (a) we need to split
an extent into three pieces (say we write to a block in the middle of an
unwritten extent and don't want to convert the whole extent) and (b) either of
the extent_insert calls requires us to split the extent block and (c) we ENOSPC
while trying to allocate a new extent block, we don't put the extent tree back
the way it was before the split, and all the blocks after that point are lost.

I will send patches to avoid this corruption by checking for enough space soon.
I think your local git tree has patches in it that aren't on kernel.org yet, so
I'll hold off until I see them show up.

Fortunately there are only 5 new patches since last month. :)

> llarge files allocated contiguously on the disk as part of the mke2fs
> process, and it turns out that if you try to allocate several
> gigabytes worth of files using ext2fs_bmap2(), you end up burning a
> huge amount of CPU time (as in around 30 seconds of CPU times while
> fallocating a 10GB worth of blocks; so if you try to allocate a
> terabyte or three worth of blocks, it would take a truly long time,
> while you turn your CPU into a space heater :-).
> 
> The top profile user was update_path() in fs/ext4/extents.c, which is
> caused by the very large number of extent operations that are needed
> for each extent operation.  The second largest profile user is
> ext2fs_crc16(), caused by the large number of calls to
> ext2fs_block_alloc_stats2(), which causes the the block group
> descriptors to get incremented one at a time.
> 
> What we need to do if we want create an optimized fallocate() is to
> allocate blocks until we either exceed the max number of blocks in an
> extent, or we get a non-contiguous allocation, and then insert the
> extent into extent tree one extent at a time.  Similarly, we need to
> update the block group descriptors a batched chunks, instead of after
> each individual block allocation.
> 
> Similarly, as far as calling zero_block(), you really don't want to
> issue each 4k write separately.

Alternately, we could simply not allow BMAP_UNINIT for non-extent files.
That's the only reason why there's any zeroing going on at all.

--D
> 
> Cheers,
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 15, 2014, 10:19 p.m. UTC | #3
On Wed, Jan 15, 2014 at 01:11:22PM -0800, Darrick J. Wong wrote:
> > The other thing to note about this patch is that if you want to
> > implement fallocate, ext2fs_bmap2() is really the wrong tool to use.
> > I've been working on a program for work which pre-creates a bunch of
> 
> I think that ext2fs_fallocate would be a good addition to the library.  Is your
> program far enough along to share?  fuse2fs would benefit greatly.

An ext2fs_fallocate() is way more difficult than what I've done, since
you have to deal with all sorts of corner cases where the file has
pre-existing sparse extents, which may or may not be initialized, and
making sure that it works in that case.  Allocating blocks to a file
which you know started as a zero length file is in fact much easier.
Here are the key bits from my program:

/*
 * This should eventually be cleaned up and put into libext2fs
 * This is much faster than calling ext2fs_block_alloc_stats() for each
 * block, since it requires recalculating the bg descriptor checksum
 * for every single block that you allocate.
 */
static void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk,
					   blk_t num, int inuse)
{
	int	group;

#ifndef OMIT_COM_ERR
	if (blk + num >= ext2fs_blocks_count(fs->super)) {
		com_err("ext2fs_block_alloc_stats_range", 0,
			"Illegal block range: %llu (%u) ",
			(unsigned long long) blk, num);
		return;
	}
#endif
	if (inuse == 0)
		return;
	if (inuse > 0) {
		ext2fs_mark_block_bitmap_range2(fs->block_map, blk, num);
		inuse = 1;
	} else {
		ext2fs_unmark_block_bitmap_range2(fs->block_map, blk, num);
		inuse = -1;
	}
	while (num) {
		group = ext2fs_group_of_blk2(fs, blk);
		blk64_t last_blk = ext2fs_group_last_block2(fs, group);
		blk_t n = num;

		if (blk + num > last_blk)
			n = last_blk - blk + 1;

		ext2fs_bg_free_blocks_count_set(fs, group, 
			ext2fs_bg_free_blocks_count(fs, group) -
			inuse*n/EXT2FS_CLUSTER_RATIO(fs));
		ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
		ext2fs_group_desc_csum_set(fs, group);

		ext2fs_free_blocks_count_add(fs->super, -inuse * n);
		ext2fs_mark_super_dirty(fs);
		ext2fs_mark_bb_dirty(fs);
		blk += n;
		num -= n;
	}
}

/* 
 * ext2fs_allocate_tables() is not optimally allocating blocks in all
 * situations.  We need to take a look at this at some point.  For
 * now, just replace it with something simple and stupid.
 */ 
errcode_t my_allocate_tables(ext2_filsys fs)
{
	errcode_t	retval;
	dgrp_t		i;

	for (i = 0; i < fs->group_desc_count; i++) {
		retval = ext2fs_new_block2(fs, goal, NULL, &goal);
		if (retval)
			return retval;
		ext2fs_block_alloc_stats2(fs, goal, +1);
		ext2fs_block_bitmap_loc_set(fs, i, goal);
	}
	for (i = 0; i < fs->group_desc_count; i++) {
		retval = ext2fs_new_block2(fs, goal, NULL, &goal);
		if (retval)
			return retval;
		ext2fs_block_alloc_stats2(fs, goal, +1);
		ext2fs_inode_bitmap_loc_set(fs, i, goal);
	}
	for (i = 0; i < fs->group_desc_count; i++) {
		blk64_t end = ext2fs_blocks_count(fs->super) - 1;
		retval = ext2fs_get_free_blocks2(fs, goal, end,
						 fs->inode_blocks_per_group,
						 fs->block_map, &goal);
		if (retval)
			return retval;
		ext2fs_block_alloc_stats_range(fs, goal,
					       fs->inode_blocks_per_group, +1);
		ext2fs_inode_table_loc_set(fs, i, goal);
	}
	return 0;
}

/* 
 * Some of this could eventually get turned into fallocate, but that's
 * actually a much more difficult and tricking thing to implement.
 */
static errcode_t mk_hugefile(ext2_filsys fs, unsigned int num, 
			     ext2_ino_t dir, int idx, ext2_ino_t *ino)

{
	errcode_t		retval;
	blk64_t			lblk, blk, bend;
	__u64			size;
	unsigned int		i;
	struct ext2_inode	inode;
	ext2_extent_handle_t	handle;
	char			fn[32];

	retval = ext2fs_new_inode(fs, 0, LINUX_S_IFREG, NULL, ino);
	if (retval)
		return retval;

	memset(&inode, 0, sizeof(struct ext2_inode));
	inode.i_mode = LINUX_S_IFREG | 0600;
	ext2fs_iblk_set(fs, &inode, num / EXT2FS_CLUSTER_RATIO(fs));
	size = (__u64) num * fs->blocksize;
	inode.i_size = size & 0xffffffff;
	inode.i_size_high = (size >> 32);
	inode.i_links_count = 1;

	retval = ext2fs_write_new_inode(fs, *ino, &inode);
	if (retval)
		return retval;

	ext2fs_inode_alloc_stats2(fs, *ino, +1, 0);

	retval = ext2fs_extent_open2(fs, *ino, &inode, &handle);
	if (retval)
		return retval;

	{
		struct ext2_inode t;

		ext2fs_read_inode(fs, *ino, &t);
		printf("eo: i_size_high: %lu size: %llu\n", t.i_size_high,
		       EXT2_I_SIZE(&t));
	}
	lblk = 0;
	while (num) {
		blk64_t pblk, end;
		blk_t n = num;

		retval =  ext2fs_find_first_zero_block_bitmap2(fs->block_map,
			goal, ext2fs_blocks_count(fs->super) - 1, &end);
		if (retval)
			return ENOSPC;
		goal = end;

		retval =  ext2fs_find_first_set_block_bitmap2(fs->block_map, goal,
			       ext2fs_blocks_count(fs->super) - 1, &bend);
		if (bend == ENOENT)
			bend = ext2fs_blocks_count(fs->super);
		if (bend - goal < num)
			n = bend - goal;
		printf("goal %llu bend %llu num %u n %u\n", goal, bend, num, n);
		pblk = goal;
		num -= n;
		goal += n;
		ext2fs_block_alloc_stats_range(fs, pblk, n, +1);

		while (n) {
			blk_t l = n;
			struct ext2fs_extent newextent;

	{
		struct ext2_inode t;

		ext2fs_read_inode(fs, *ino, &t);
		printf("i_size_high: %lu size: %llu\n", t.i_size_high,
		       EXT2_I_SIZE(&t));
	}

			if (l > EXT_INIT_MAX_LEN)
				l = EXT_INIT_MAX_LEN;

			newextent.e_len = l;
			newextent.e_pblk = pblk;
			newextent.e_lblk = lblk;
			newextent.e_flags = 0;

			printf("inserting extent: %llu %llu %u\n", lblk, pblk, l);
			retval = ext2fs_extent_insert(handle,
					EXT2_EXTENT_INSERT_AFTER, &newextent);
			if (retval)
				return retval;
			pblk += l;
			lblk += l;
			n -= l;
		}
	}

	{
		struct ext2_inode t;

		ext2fs_read_inode(fs, *ino, &t);
		printf("i_size_high: %lu size: %llu\n", t.i_size_high,
		       EXT2_I_SIZE(&t));
	}
	sprintf(fn, "hugefile%05d", idx);
retry:
	retval = ext2fs_link(fs, dir, fn, *ino, EXT2_FT_REG_FILE);
	if (retval == EXT2_ET_DIR_NO_SPACE) {
		retval = ext2fs_expand_dir(fs, dir);
		if (retval)
			goto errout;
		goto retry;
	}

	if (retval)
		goto errout;

errout:
	if (handle)
		ext2fs_extent_free(handle);

	return retval;
}

Note that this requires some of the test patches I've been sending
out, since it uses ext2fs_find_first_{set,zero}_block_bitmap2().

There are also some bugs in the versions which I sent out; I'm working
on fixing them....

> That said, I've also found a couple of bugs in the extent code by implementing
> fallocate in such a stupid way. :)  It turns out that if (a) we need to split
> an extent into three pieces (say we write to a block in the middle of an
> unwritten extent and don't want to convert the whole extent) and (b) either of
> the extent_insert calls requires us to split the extent block and (c) we ENOSPC
> while trying to allocate a new extent block, we don't put the extent tree back
> the way it was before the split, and all the blocks after that point are lost.

Well, I found a bug in extfs_extent_insert() which showed up when I
tried to implement the block allocation in an intelligent way.  :-)
I'll send out that bug fix a bit.


> I will send patches to avoid this corruption by checking for enough space soon.
> I think your local git tree has patches in it that aren't on kernel.org yet, so
> I'll hold off until I see them show up.

Yeah, some of those patches still need some clean up, so I haven't
pushed my maint branch to kernel.org yet.

But anyway, the above code will give you an idea where I'm going ---
this is **way** faster than trying to allocate blocks using the
set_bmap() function.  :-)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 15, 2014, 10:23 p.m. UTC | #4
On Wed, Jan 15, 2014 at 05:19:51PM -0500, Theodore Ts'o wrote:
> 	{
> 		struct ext2_inode t;
> 
> 		ext2fs_read_inode(fs, *ino, &t);
> 		printf("i_size_high: %lu size: %llu\n", t.i_size_high,
> 		       EXT2_I_SIZE(&t));
> 	}

Oops, ignore this debugging code.  This was to find a bug in
ext2fs_extents_insert() where under some circumstances it end up
running into a buffer overrun bug which causes it to wipe out
i_size_high.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/lib/ext2fs/bmap.c b/lib/ext2fs/bmap.c
index 3a18d76..ca89eb1 100644
--- a/lib/ext2fs/bmap.c
+++ b/lib/ext2fs/bmap.c
@@ -33,6 +33,32 @@  extern errcode_t ext2fs_bmap(ext2_filsys fs, ext2_ino_t ino,
 
 #define inode_bmap(inode, nr) ((inode)->i_block[(nr)])
 
+static errcode_t zero_block(ext2_filsys fs, blk64_t blk)
+{
+	void *b;
+	errcode_t retval;
+
+	if (io_channel_discard_zeroes_data(fs->io)) {
+		retval = io_channel_discard(fs->io, blk, 1);
+		if (retval == 0)
+			return 0;
+	}
+
+	retval = ext2fs_get_memzero(fs->blocksize, &b);
+	if (retval)
+		return retval;
+
+	memset(b, 0, fs->blocksize);
+
+	retval = io_channel_write_blk64(fs->io, blk, 1, b);
+	if (retval)
+		goto out;
+
+out:
+	ext2fs_free_mem(&b);
+	return retval;
+}
+
 static _BMAP_INLINE_ errcode_t block_ind_bmap(ext2_filsys fs, int flags,
 					      blk_t ind, char *block_buf,
 					      int *blocks_alloc,
@@ -72,6 +98,11 @@  static _BMAP_INLINE_ errcode_t block_ind_bmap(ext2_filsys fs, int flags,
 					    block_buf + fs->blocksize, &b);
 		if (retval)
 			return retval;
+		if (flags & BMAP_UNINIT) {
+			retval = zero_block(fs, b);
+			if (retval)
+				return retval;
+		}
 
 #ifdef WORDS_BIGENDIAN
 		((blk_t *) block_buf)[nr] = ext2fs_swab32(b);
@@ -214,10 +245,13 @@  static errcode_t extent_bmap(ext2_filsys fs, ext2_ino_t ino,
 	errcode_t		retval = 0;
 	blk64_t			blk64 = 0;
 	int			alloc = 0;
+	int			set_flags;
+
+	set_flags = bmap_flags & BMAP_UNINIT ? EXT2_EXTENT_SET_BMAP_UNINIT : 0;
 
 	if (bmap_flags & BMAP_SET) {
 		retval = ext2fs_extent_set_bmap(handle, block,
-						*phys_blk, 0);
+						*phys_blk, set_flags);
 		return retval;
 	}
 	retval = ext2fs_extent_goto(handle, block);
@@ -254,7 +288,7 @@  got_block:
 		alloc++;
 	set_extent:
 		retval = ext2fs_extent_set_bmap(handle, block,
-						blk64, 0);
+						blk64, set_flags);
 		if (retval)
 			return retval;
 		/* Update inode after setting extent */
@@ -336,6 +370,12 @@  errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
 		goto done;
 	}
 
+	if ((bmap_flags & BMAP_SET) && (bmap_flags & BMAP_UNINIT)) {
+		retval = zero_block(fs, *phys_blk);
+		if (retval)
+			goto done;
+	}
+
 	if (block < EXT2_NDIR_BLOCKS) {
 		if (bmap_flags & BMAP_SET) {
 			b = *phys_blk;
@@ -351,6 +391,11 @@  errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
 			retval = ext2fs_alloc_block(fs, b, block_buf, &b);
 			if (retval)
 				goto done;
+			if (bmap_flags & BMAP_UNINIT) {
+				retval = zero_block(fs, b);
+				if (retval)
+					goto done;
+			}
 			inode_bmap(inode, block) = b;
 			blocks_alloc++;
 			*phys_blk = b;
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index da518df..316e6f5 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -525,6 +525,7 @@  typedef struct ext2_icount *ext2_icount_t;
  */
 #define BMAP_ALLOC	0x0001
 #define BMAP_SET	0x0002
+#define BMAP_UNINIT	0x0004
 
 /*
  * Returned flags from ext2fs_bmap