diff mbox

ext4: online defrag: Enable to reuse blocks by multiple defrag

Message ID 493DD75D.60504@rs.jp.nec.com
State Superseded, archived
Headers show

Commit Message

Akira Fujita Dec. 9, 2008, 2:26 a.m. UTC
ext4: online defrag: Enable to reuse blocks by multiple defrag.

From: Akira Fujita <a-fujita@rs.jp.nec.com>

This patch is for defrag ver0.97 in the ext4 patch queue.
If journal is not writeback mode, commit the transaction
before block allocation to reuse blocks which previous defrag released.

I'm redesigning ext4 online defrag based on the comments from Ted.
Probably defrag's block allocation method will be changed greatly.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
---
 defrag.c |   22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Dec. 9, 2008, 5:47 a.m. UTC | #1
On Tue, Dec 09, 2008 at 11:26:37AM +0900, Akira Fujita wrote:
> I'm redesigning ext4 online defrag based on the comments from Ted.
> Probably defrag's block allocation method will be changed greatly.

Akira-san,

FYI, there was a discussion about defrag on today's ext4 call.  One of
the ideas that was kicked around was to completely change the
primitives used by defrag, and to design things around three
primitive, general purpose interfaces.

We didn't go into complete detail on the call, but let me give you a
strawman proposal for consideration/discussion:

(1) An (ioctl-based) interface which allows a privileged program to
specify one or more range of blocks which the filesystem's block
allocator must NOT allocate from.  (We may want to have a flag for
each block range which either makes the block lockout advisory, such
that if the block allocator can't find blocks anywhere else, it may
invade the reserved block area --- or mandatory, where if there are no
other blocks, the filesystem returns ENOSPC).  This allows the
defragmenter to work on an area of the disk without worrying about
concurrent allocations by other processes from getting in the way.

(2) An (ioctl-based) interface which associates with an inode
preferred range(s) of blocks which the block allocator will try using
first; if those blocks are not available, or the block range(s) is
exhausted, the block allocator use its normal algorithms to pick the
best available block.  The set of preferred blocks is only guaranteed
to persist while the inode is in memory.

(3) An (ioctl-based) interface which takes two inode numbers, and
allows a privileged program to "defrag" an inode by using blocks from
a donor inode and using them as the new blocks for the destination
inode, preserving the contents of the destination inode.

The advantage of this implementation strategy is that each of the
interfaces can be implemented one at a time, with very well defined
semantics, and which can be independently tested.  The semantics can
also be used in different combinations to solve alternate problems.
For example, a combination of (1) and (2) can be used to reserve
blocks for use by a directory that is expected to grow, so the
directory can use contiguous blocks.  Or, they could be used to
implement an "online shrink" that would allow a filesystem to be
resized to a smaller size.

One other thing that comes to mind.  If it turns out that these
interfaces have multiple users, and in some cases the reservations or
block allocation restrictions are expected to last for longer than a
process lifetime, it may be useful to tag them with a short (8-16
character) name, so that it is possible to list the current set of
reservations, and so they can be removed by a privileged user.  This
could be overdesigning the interface; but the whole *point* of
thinking about the interfaces from a more generic point of view (as
opposed for use by a specific program for which the kernel interfaces
are custom-designed) is that hopefully they will have multiple use
cases and multiple users, in which case we need to worry about how
multiple users can co-exist.

Thoughts, comments?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Akira Fujita Dec. 10, 2008, 8 a.m. UTC | #2
Hi Ted,

Thank you for letting me know.
I think new defrag can be implemented with your proposal.
At first, I am planning to implement usual defrag (without any options)
in the following steps.
Please check whether my approach is fine.

(U:User spcace K:Kernel)
1:U  Create donor inode and then unlink it.

2:U  Allocate contiguous blocks to donor inode with fallocate().

3:U  Call the FS_IOC_FIEMAP ioctl to get the extents information of donor inode.
      And check the extents of donor inode are less than the defrag target inode's.

4:U  Call the EXT4_IOC_DEFRAG ioctl to exchange the data between
      target inode and donor inode.

5:K  The EXT4_IOC_DEFRAG ioctl calls ext4_defrag() in kernel
      (I'm going to change current ext4_defrag() to do only data exchange).

* Step 4 and 5 correspond to Ted's (3) ioctl.

6:U  Close fd of donor inode.


New EXT4_IOC_DEFRAG would be implemented as followings.

#define EXT4_IOC_DEFRAG                 _IOW('f', 15, struct move_extent)

struct move_extent
{
     int org_fd;		/* file descriptor of defrag target file */
     int dest_fd;	/* file descriptor of donor file */
     long long start;	/* logical block offset of target file */
     long long len;	/* exchange data length in block */
}


Also defrag -r and -f options can be implemented with (1) and (2)
in your previous post.  I will address them after implementing usual defrag.


Regards,
Akira Fujita

Theodore Tso wrote:
> On Tue, Dec 09, 2008 at 11:26:37AM +0900, Akira Fujita wrote:
>> I'm redesigning ext4 online defrag based on the comments from Ted.
>> Probably defrag's block allocation method will be changed greatly.
> 
> Akira-san,
> 
> FYI, there was a discussion about defrag on today's ext4 call.  One of
> the ideas that was kicked around was to completely change the
> primitives used by defrag, and to design things around three
> primitive, general purpose interfaces.
> 
> We didn't go into complete detail on the call, but let me give you a
> strawman proposal for consideration/discussion:
> 
> (1) An (ioctl-based) interface which allows a privileged program to
> specify one or more range of blocks which the filesystem's block
> allocator must NOT allocate from.  (We may want to have a flag for
> each block range which either makes the block lockout advisory, such
> that if the block allocator can't find blocks anywhere else, it may
> invade the reserved block area --- or mandatory, where if there are no
> other blocks, the filesystem returns ENOSPC).  This allows the
> defragmenter to work on an area of the disk without worrying about
> concurrent allocations by other processes from getting in the way.
> 
> (2) An (ioctl-based) interface which associates with an inode
> preferred range(s) of blocks which the block allocator will try using
> first; if those blocks are not available, or the block range(s) is
> exhausted, the block allocator use its normal algorithms to pick the
> best available block.  The set of preferred blocks is only guaranteed
> to persist while the inode is in memory.
> 
> (3) An (ioctl-based) interface which takes two inode numbers, and
> allows a privileged program to "defrag" an inode by using blocks from
> a donor inode and using them as the new blocks for the destination
> inode, preserving the contents of the destination inode.
> 
> The advantage of this implementation strategy is that each of the
> interfaces can be implemented one at a time, with very well defined
> semantics, and which can be independently tested.  The semantics can
> also be used in different combinations to solve alternate problems.
> For example, a combination of (1) and (2) can be used to reserve
> blocks for use by a directory that is expected to grow, so the
> directory can use contiguous blocks.  Or, they could be used to
> implement an "online shrink" that would allow a filesystem to be
> resized to a smaller size.
> 
> One other thing that comes to mind.  If it turns out that these
> interfaces have multiple users, and in some cases the reservations or
> block allocation restrictions are expected to last for longer than a
> process lifetime, it may be useful to tag them with a short (8-16
> character) name, so that it is possible to list the current set of
> reservations, and so they can be removed by a privileged user.  This
> could be overdesigning the interface; but the whole *point* of
> thinking about the interfaces from a more generic point of view (as
> opposed for use by a specific program for which the kernel interfaces
> are custom-designed) is that hopefully they will have multiple use
> cases and multiple users, in which case we need to worry about how
> multiple users can co-exist.
> 
> Thoughts, comments?
> 
> 						- Ted
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
SandeepKsinha Jan. 6, 2010, 5:06 a.m. UTC | #3
Hi Ted,

Sorry for the late reply...


tytso wrote:
> 
> 
> On Tue, Dec 09, 2008 at 11:26:37AM +0900, Akira Fujita wrote:
>> I'm redesigning ext4 online defrag based on the comments from Ted.
>> Probably defrag's block allocation method will be changed greatly.
> 
> Akira-san,
> 
> FYI, there was a discussion about defrag on today's ext4 call.  One of
> the ideas that was kicked around was to completely change the
> primitives used by defrag, and to design things around three
> primitive, general purpose interfaces.
> 
> We didn't go into complete detail on the call, but let me give you a
> strawman proposal for consideration/discussion:
> 
> (1) An (ioctl-based) interface which allows a privileged program to
> specify one or more range of blocks which the filesystem's block
> allocator must NOT allocate from.  (We may want to have a flag for
> each block range which either makes the block lockout advisory, such
> that if the block allocator can't find blocks anywhere else, it may
> invade the reserved block area --- or mandatory, where if there are no
> other blocks, the filesystem returns ENOSPC).  This allows the
> defragmenter to work on an area of the disk without worrying about
> concurrent allocations by other processes from getting in the way.
> 
> (2) An (ioctl-based) interface which associates with an inode
> preferred range(s) of blocks which the block allocator will try using
> first; if those blocks are not available, or the block range(s) is
> exhausted, the block allocator use its normal algorithms to pick the
> best available block.  The set of preferred blocks is only guaranteed
> to persist while the inode is in memory.
> 

What exactly do we mean here by the preferred range(s) of block here? 
A couple of months back in a similar context a patch was submitted from
Akira to which he was suggested to work the patch on the existing mechanism
of PA.

In case of PA the allocation must start from pa_pstart strictly, which would
not truly mean
preferred block RANGES. Consider a case where you have an inode A with PA of
block range = {100, 500}
and later blocks {100, 300} are allocated to some inode B.

Later when block allocation request of 150 blocks comes for inode A, even
though blocks {301-500} is free in the PA range, the allocation through PA
would fail.

So, if we implement the above mentioned ioctl-based interface will it truly
serve the purpose? Or How about thinking of it in lines of preferred block
group ranges, which should serve the above mentioned purpose.

What I understand about PA is that PA are to be purely used by and within
the block allocator to make best judgement of deciding the preferred ranges
to block (to avoid fragmentation) but should not be allowed to be
manipulated through any ioctl based interface or external interfaces.



tytso wrote:
> 
> (3) An (ioctl-based) interface which takes two inode numbers, and
> allows a privileged program to "defrag" an inode by using blocks from
> a donor inode and using them as the new blocks for the destination
> inode, preserving the contents of the destination inode.
> 
> The advantage of this implementation strategy is that each of the
> interfaces can be implemented one at a time, with very well defined
> semantics, and which can be independently tested.  The semantics can
> also be used in different combinations to solve alternate problems.
> For example, a combination of (1) and (2) can be used to reserve
> blocks for use by a directory that is expected to grow, so the
> directory can use contiguous blocks.  Or, they could be used to
> implement an "online shrink" that would allow a filesystem to be
> resized to a smaller size.
> 
> One other thing that comes to mind.  If it turns out that these
> interfaces have multiple users, and in some cases the reservations or
> block allocation restrictions are expected to last for longer than a
> process lifetime, it may be useful to tag them with a short (8-16
> character) name, so that it is possible to list the current set of
> reservations, and so they can be removed by a privileged user.  This
> could be overdesigning the interface; but the whole *point* of
> thinking about the interfaces from a more generic point of view (as
> opposed for use by a specific program for which the kernel interfaces
> are custom-designed) is that hopefully they will have multiple use
> cases and multiple users, in which case we need to worry about how
> multiple users can co-exist.
> 
> Thoughts, comments?
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>
diff mbox

Patch

diff -Nrup -X linux-2.6.28-rc6-a/Documentation/dontdiff linux-2.6.28-rc6-a/fs/ext4/defrag.c linux-2.6.28-rc6-b/fs/ext4/defrag.c
--- linux-2.6.28-rc6-a/fs/ext4/defrag.c	2008-12-05 15:14:58.000000000 +0900
+++ linux-2.6.28-rc6-b/fs/ext4/defrag.c	2008-12-05 15:56:26.000000000 +0900
@@ -1279,7 +1279,6 @@  ext4_defrag_new_extent_tree(struct inode
 			ext4_fsblk_t goal, int phase)
 {
 	handle_t *handle;
-	struct ext4_sb_info *sbi = EXT4_SB(org_inode->i_sb);
 	struct ext4_extent_header *eh = NULL;
 	struct ext4_allocation_request ar;
 	struct ext4_ext_path *dest_path = NULL;
@@ -1287,7 +1286,7 @@  ext4_defrag_new_extent_tree(struct inode
 	ext4_fsblk_t alloc_total = 0;
 	ext4_fsblk_t newblock = 0;
 	ext4_lblk_t req_end = req_start + req_blocks - 1;
-	ext4_lblk_t rest_blocks = 0;
+	ext4_lblk_t rest_blocks = req_blocks;
 	ext4_group_t dest_group_no, goal_group_no;
 	ext4_grpblk_t dest_blk_off, goal_blk_off;
 	int sum_tmp = 0;
@@ -1308,6 +1307,17 @@  ext4_defrag_new_extent_tree(struct inode
 	ext4_defrag_fill_ar(org_inode, tmp_inode, &ar, org_path,
 				dest_path, req_blocks, iblock, goal, phase);

+	/*
+	 * If journal is not writeback mode,
+	 * commit the transaction to reuse the blocks
+	 * which previous defrag released.
+	 */
+	if (!ext4_should_writeback_data(org_inode)) {
+		ret = ext4_force_commit(org_inode->i_sb);
+		if (ret)
+			goto out2;
+	}
+
 	handle = ext4_journal_start(tmp_inode, 0);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
@@ -1323,8 +1333,6 @@  ext4_defrag_new_extent_tree(struct inode
 						&ar, dest_path, &newblock);
 		if (ret < 0)
 			goto out;
-		/* Claimed blocks are already reserved */
-		EXT4_I(ar.inode)->i_delalloc_reserved_flag = 1;

 		ext4_get_group_no_and_offset(tmp_inode->i_sb, newblock,
 					&dest_group_no, &dest_blk_off);
@@ -1362,12 +1370,6 @@  ext4_defrag_new_extent_tree(struct inode
 out:
 	if (ret < 0 && ar.len)
 		ext4_free_blocks(handle, tmp_inode, newblock, ar.len, metadata);
-	/*
-	 * Update dirty-blocks counter if we cannot allocate the all of
-	 * requested blocks.
-	 */
-	if (rest_blocks)
-		percpu_counter_sub(&sbi->s_dirtyblocks_counter, rest_blocks);

 	ext4_journal_stop(handle);
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in