mbox series

[RFC,0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.

Message ID 158272427715.281342.10873281294835953645.stgit@localhost.localdomain
Headers show
Series fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag. | expand

Message

Kirill Tkhai Feb. 26, 2020, 1:40 p.m. UTC
When discard granuality of a block device is bigger than filesystem block size,
fstrim does not effectively release device blocks. During the filesystem life,
some files become deleted, some remain alive, and this results in that many
device blocks are used incomletely (of course, the reason is not only in this,
but since this is not a problem of a filesystem, this is not a subject
of the patchset). This results in space lose for thin provisioning devices.

Say, a filesystem on a block device, which is provided by another filesystem
(say, distributed network filesystem). Semi-used blocks of the block device
result in bad performance and worse space usage of underlining filesystem.
Another example is ext4 with 4k block on loop on ext4 with 1m block. This
case also results in bad disk space usage.

Choosing a bigger block size is not a solution here, since small files become
taking much more disk space, than they used before, and the result excess
disk usage is the same.

The proposed solution is defragmentation of files based on block device
discard granuality knowledge. Files, which were not modified for a long time,
and read-only files, small files, etc, may be placed in the same block device
block together. I.e., compaction of some device blocks, which results
in releasing another device blocks.

The problem is current fallocate() does not allow to implement effective
way for such the defragmentation. The below describes the situation for ext4,
but this should touch all filesystems.

fallocate() goes thru standard blocks allocator, which try to behave very
good for life allocation cases: block placement and future file size
prediction, delayed blocks allocation, etc. But it almost impossible
to allocate blocks from specified place for our specific case. The only
ext4 block allocator option possible to use is that the allocator firstly
tries to allocate blocks from the same block group, that inode is related to.
But this is not enough for effective files compaction.

This patchset implements an extension of fallocate():

	fallocate2(int fd, int mode, loff_t offset, loff_t len,
		   unsigned long long physical)

The new argument is @physical offset from start of device, which is must
for block allocation. In case of [@physical, @physical + len] block range
is available for allocation, the syscall assigns the corresponding extent/
extents to inode. In case of the range or its part is occupied, the syscall
returns with error (maybe, smaller range will be allocated. The behavior
is the same as when fallocate() meets no space in the middle).

This interface allows to solve the formulated problem. Also, note, this
interface may allow to improve existing e4defrag algorithm: decrease
number of file extents more effective.

[1-2/5] are refactoring.
[3/5] adds fallocate2() body.
[4/5] prepares ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
[5/5] adds fallocate2() support for ext4

Any comments are welcomed!

---

Kirill Tkhai (5):
      fs: Add new argument to file_operations::fallocate()
      fs: Add new argument to vfs_fallocate()
      fs: Add fallocate2() syscall
      ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
      ext4: Add fallocate2() support


 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 arch/x86/ia32/sys_ia32.c               |   10 +++++++
 drivers/block/loop.c                   |    2 +
 drivers/nvme/target/io-cmd-file.c      |    4 +--
 drivers/staging/android/ashmem.c       |    2 +
 drivers/target/target_core_file.c      |    2 +
 fs/block_dev.c                         |    4 +--
 fs/btrfs/file.c                        |    4 ++-
 fs/ceph/file.c                         |    5 +++-
 fs/cifs/cifsfs.c                       |    7 +++--
 fs/cifs/smb2ops.c                      |    5 +++-
 fs/ext4/ext4.h                         |    5 +++-
 fs/ext4/extents.c                      |   35 ++++++++++++++++++++-----
 fs/ext4/inode.c                        |   14 ++++++++++
 fs/ext4/mballoc.c                      |   45 +++++++++++++++++++++++++-------
 fs/f2fs/file.c                         |    4 ++-
 fs/fat/file.c                          |    7 ++++-
 fs/fuse/file.c                         |    5 +++-
 fs/gfs2/file.c                         |    5 +++-
 fs/hugetlbfs/inode.c                   |    5 +++-
 fs/io_uring.c                          |    2 +
 fs/ioctl.c                             |    5 ++--
 fs/nfs/nfs4file.c                      |    6 ++++
 fs/nfsd/vfs.c                          |    2 +
 fs/ocfs2/file.c                        |    4 ++-
 fs/open.c                              |   21 +++++++++++----
 fs/overlayfs/file.c                    |    8 ++++--
 fs/xfs/xfs_file.c                      |    5 +++-
 include/linux/fs.h                     |    4 +--
 include/linux/syscalls.h               |    8 +++++-
 ipc/shm.c                              |    6 ++--
 mm/madvise.c                           |    2 +
 mm/shmem.c                             |    4 ++-
 34 files changed, 190 insertions(+), 59 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

Comments

Ritesh Harjani Feb. 27, 2020, 10:39 a.m. UTC | #1
> fallocate() goes thru standard blocks allocator, which try to behave very
> good for life allocation cases: block placement and future file size
> prediction, delayed blocks allocation, etc. But it almost impossible
> to allocate blocks from specified place for our specific case. The only
> ext4 block allocator option possible to use is that the allocator firstly
> tries to allocate blocks from the same block group, that inode is related to.
> But this is not enough for effective files compaction.
> 
> This patchset implements an extension of fallocate():
> 
> 	fallocate2(int fd, int mode, loff_t offset, loff_t len,
> 		   unsigned long long physical)
> 
> The new argument is @physical offset from start of device, which is must
> for block allocation. In case of [@physical, @physical + len] block range
> is available for allocation, the syscall assigns the corresponding extent/
> extents to inode. In case of the range or its part is occupied, the syscall
> returns with error (maybe, smaller range will be allocated. The behavior
> is the same as when fallocate() meets no space in the middle).

Doesn't this interface kills the whole philosophy of letting filesystems
to decide which block it sees as most fit for allocation. IMHO user
passing over actual physical location from where the FS should allocate,
does not sound like a good interface.


-ritesh
xiaohui li Feb. 28, 2020, 7:07 a.m. UTC | #2
hi Kirill Tkhai:

I agree with your idea very much.
I had also implemented a similar fallocate interface with the unique
flag which call tell ext4 filesystems to allocate the special free
physical extents.
I had done the same work as your fallocate2 work last year.

but i think it has modified the core ext4 physical blocks allocator a
lot, so the ext4 community may not accept it.
so I didn't share it openly.

but i think this fallocate2 interface just same as my work is also
very useful in android mobile phone world.

today android phone has large capacity memory and storage, just the
same as a computer.
and current days, customer treat phone and make full use of it by just
the same way as they treat computer in past days.
so after install many software and unstall them for many times in
android phone,  the ext4 physical layout on disk has become more
fragmented
(the number of extents per group is large).
at this moment, when run a sequential write workload, you will find
the sequential write performance is very bad.

but after do defragment work on some fragmented ext4 groups, and then
run the same workload again, the sequential write performance has
improved greatly.
and the fallocate2 interface is a necessary component of this
defragment work shown above.

so i also think this fallocate2 interface is an important and useful
tools for ext4 filesystems.









On Wed, Feb 26, 2020 at 9:41 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> When discard granuality of a block device is bigger than filesystem block size,
> fstrim does not effectively release device blocks. During the filesystem life,
> some files become deleted, some remain alive, and this results in that many
> device blocks are used incomletely (of course, the reason is not only in this,
> but since this is not a problem of a filesystem, this is not a subject
> of the patchset). This results in space lose for thin provisioning devices.
>
> Say, a filesystem on a block device, which is provided by another filesystem
> (say, distributed network filesystem). Semi-used blocks of the block device
> result in bad performance and worse space usage of underlining filesystem.
> Another example is ext4 with 4k block on loop on ext4 with 1m block. This
> case also results in bad disk space usage.
>
> Choosing a bigger block size is not a solution here, since small files become
> taking much more disk space, than they used before, and the result excess
> disk usage is the same.
>
> The proposed solution is defragmentation of files based on block device
> discard granuality knowledge. Files, which were not modified for a long time,
> and read-only files, small files, etc, may be placed in the same block device
> block together. I.e., compaction of some device blocks, which results
> in releasing another device blocks.
>
> The problem is current fallocate() does not allow to implement effective
> way for such the defragmentation. The below describes the situation for ext4,
> but this should touch all filesystems.
>
> fallocate() goes thru standard blocks allocator, which try to behave very
> good for life allocation cases: block placement and future file size
> prediction, delayed blocks allocation, etc. But it almost impossible
> to allocate blocks from specified place for our specific case. The only
> ext4 block allocator option possible to use is that the allocator firstly
> tries to allocate blocks from the same block group, that inode is related to.
> But this is not enough for effective files compaction.
>
> This patchset implements an extension of fallocate():
>
>         fallocate2(int fd, int mode, loff_t offset, loff_t len,
>                    unsigned long long physical)
>
> The new argument is @physical offset from start of device, which is must
> for block allocation. In case of [@physical, @physical + len] block range
> is available for allocation, the syscall assigns the corresponding extent/
> extents to inode. In case of the range or its part is occupied, the syscall
> returns with error (maybe, smaller range will be allocated. The behavior
> is the same as when fallocate() meets no space in the middle).
>
> This interface allows to solve the formulated problem. Also, note, this
> interface may allow to improve existing e4defrag algorithm: decrease
> number of file extents more effective.
>
> [1-2/5] are refactoring.
> [3/5] adds fallocate2() body.
> [4/5] prepares ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
> [5/5] adds fallocate2() support for ext4
>
> Any comments are welcomed!
>
> ---
>
> Kirill Tkhai (5):
>       fs: Add new argument to file_operations::fallocate()
>       fs: Add new argument to vfs_fallocate()
>       fs: Add fallocate2() syscall
>       ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
>       ext4: Add fallocate2() support
>
>
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 +
>  arch/x86/ia32/sys_ia32.c               |   10 +++++++
>  drivers/block/loop.c                   |    2 +
>  drivers/nvme/target/io-cmd-file.c      |    4 +--
>  drivers/staging/android/ashmem.c       |    2 +
>  drivers/target/target_core_file.c      |    2 +
>  fs/block_dev.c                         |    4 +--
>  fs/btrfs/file.c                        |    4 ++-
>  fs/ceph/file.c                         |    5 +++-
>  fs/cifs/cifsfs.c                       |    7 +++--
>  fs/cifs/smb2ops.c                      |    5 +++-
>  fs/ext4/ext4.h                         |    5 +++-
>  fs/ext4/extents.c                      |   35 ++++++++++++++++++++-----
>  fs/ext4/inode.c                        |   14 ++++++++++
>  fs/ext4/mballoc.c                      |   45 +++++++++++++++++++++++++-------
>  fs/f2fs/file.c                         |    4 ++-
>  fs/fat/file.c                          |    7 ++++-
>  fs/fuse/file.c                         |    5 +++-
>  fs/gfs2/file.c                         |    5 +++-
>  fs/hugetlbfs/inode.c                   |    5 +++-
>  fs/io_uring.c                          |    2 +
>  fs/ioctl.c                             |    5 ++--
>  fs/nfs/nfs4file.c                      |    6 ++++
>  fs/nfsd/vfs.c                          |    2 +
>  fs/ocfs2/file.c                        |    4 ++-
>  fs/open.c                              |   21 +++++++++++----
>  fs/overlayfs/file.c                    |    8 ++++--
>  fs/xfs/xfs_file.c                      |    5 +++-
>  include/linux/fs.h                     |    4 +--
>  include/linux/syscalls.h               |    8 +++++-
>  ipc/shm.c                              |    6 ++--
>  mm/madvise.c                           |    2 +
>  mm/shmem.c                             |    4 ++-
>  34 files changed, 190 insertions(+), 59 deletions(-)
>
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>
Kirill Tkhai Feb. 28, 2020, 12:46 p.m. UTC | #3
Hi, Xiaohui,

On 28.02.2020 10:07, xiaohui li wrote:
> hi Kirill Tkhai:
> 
> I agree with your idea very much.
> I had also implemented a similar fallocate interface with the unique
> flag which call tell ext4 filesystems to allocate the special free
> physical extents.
> I had done the same work as your fallocate2 work last year.
> 
> but i think it has modified the core ext4 physical blocks allocator a
> lot, so the ext4 community may not accept it.
> so I didn't share it openly.
> 
> but i think this fallocate2 interface just same as my work is also
> very useful in android mobile phone world.
> 
> today android phone has large capacity memory and storage, just the
> same as a computer.
> and current days, customer treat phone and make full use of it by just
> the same way as they treat computer in past days.
> so after install many software and unstall them for many times in
> android phone,  the ext4 physical layout on disk has become more
> fragmented
> (the number of extents per group is large).
> at this moment, when run a sequential write workload, you will find
> the sequential write performance is very bad.
> 
> but after do defragment work on some fragmented ext4 groups, and then
> run the same workload again, the sequential write performance has
> improved greatly.
> and the fallocate2 interface is a necessary component of this
> defragment work shown above.
> 
> so i also think this fallocate2 interface is an important and useful
> tools for ext4 filesystems.

I hope fs people in their comments will help to formulate an interface,
which is suitable fs philosophy and for our necessities.

Do you have your defrag util hosted on some of github repositories?

Kirill
Theodore Ts'o March 2, 2020, 4:56 p.m. UTC | #4
Kirill,

In a couple of your comments on this patch series, you mentioned
"defragmentation".  Is that because you're trying to use this as part
of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?

If that's the case, you should note that input parameter for that
ioctl is:

struct move_extent {
	__u32 reserved;		/* should be zero */
	__u32 donor_fd;		/* donor file descriptor */
	__u64 orig_start;	/* logical start offset in block for orig */
	__u64 donor_start;	/* logical start offset in block for donor */
	__u64 len;		/* block length to be moved */
	__u64 moved_len;	/* moved block length */
};

Note that the donor_start is separate from the start of the file that
is being defragged.  So you could have the userspace application
fallocate a large chunk of space for that donor file, and then use
that donor file to defrag multiple files if you want to close pack
them.

Many years ago, back when LSF/MM colocated with a larger
storage-focused conference so we could manage to origanize an ext4
developer's workshop, we had talked about ways we create kernel
support for a more powerful userspace defragger, which could also
defragment the free space, so that future block allocations were more
likely to be successful.

The discussions surrounded interfaces where userspace could block (or
at least strongly dissuade unless the only other alternative was
returning ENOSPC) the kernel from allocating out of a certain number
of block groups.  And then also to have an interface where for a
particular process (namely, the defragger), to make the kernel
strongly prefer that allocations come out of an ordered list of block
groups.

(Of course these days, now that the cool kids are all embracing eBPF,
one could imagine a privileged interface where the defragger could
install some kind of eBPF program which provided enhanced policy to
ext4's block allocator.)

No one ever really followed through with this, in part because the
details of allowing userspace (and it would have to be privileged
userspace) to dictate policy to the block allocator has all sorts of
potential pitfalls, and in part because no company was really
interested in funding the engineering work.  In addition, I'll note
that the windows world, the need and interest for defragging has gone
done significantly with the advent more sophisticated file systems
like NTFSv5, which doesn't need defragging nearly as often as say, the
FAT file system.  And I think if anything, the interst in doing work
with e4defrag has decreased even more over the years.

That being said, there has been some interest in making changes to
both the block allocator and some kind of on-line defrag which is
optimized for low-end flash (such as the kind found in android
handsets).  There, the need to be careful that we don't end up
increasing the write wearout becomes even more critical, although the
GC work which f2fs does involve extra moving around of data blocks,
and phones have seemed to do fine.  Of course, the typical phone only
has to last 2-3 years before the battery dies, the screen gets
cracked, and/or the owner decides they want the latest cool toy from
the phone manufacturers.  :-)

In any case, if your goal is really some interface to support on-line
defragmentation for ext4, you want to consider whether the
EXT4_IOC_MOVE_EXTENT interface is sufficiently powerful such that you
don't really need to mess around with new block allocation hints.

Cheers,

						- Ted
Kirill Tkhai March 3, 2020, 9:57 a.m. UTC | #5
Hi, Ted,

On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
> Kirill,
> 
> In a couple of your comments on this patch series, you mentioned
> "defragmentation".  Is that because you're trying to use this as part
> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
> 
> If that's the case, you should note that input parameter for that
> ioctl is:
> 
> struct move_extent {
> 	__u32 reserved;		/* should be zero */
> 	__u32 donor_fd;		/* donor file descriptor */
> 	__u64 orig_start;	/* logical start offset in block for orig */
> 	__u64 donor_start;	/* logical start offset in block for donor */
> 	__u64 len;		/* block length to be moved */
> 	__u64 moved_len;	/* moved block length */
> };
> 
> Note that the donor_start is separate from the start of the file that
> is being defragged.  So you could have the userspace application
> fallocate a large chunk of space for that donor file, and then use
> that donor file to defrag multiple files if you want to close pack
> them.

The practice shows it's not so. Your suggestion was the first thing we tried,
but it works bad and just doubles/triples IO.

Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:

[[512Kb file][512Kb free]][[512Kb file][512Kb free]]

We want to pack both of files in the same 1Mb cluster. Packed together on block device,
they will be in the same server of underlining distributed storage file system.
This gives a big performance improvement, and this is the price I aimed.

In case of I fallocate a large hunk for both of them, I have to move them
both to this new hunk. So, instead of moving 512Kb of data, we will have to move
1Mb of data, i.e. double size, which is counterproductive.

Imaging another situation, when we have 
[[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]

Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
And this is the thing I try prevent with finding a suitable new interface.

> Many years ago, back when LSF/MM colocated with a larger
> storage-focused conference so we could manage to origanize an ext4
> developer's workshop, we had talked about ways we create kernel
> support for a more powerful userspace defragger, which could also
> defragment the free space, so that future block allocations were more
> likely to be successful.
> 
> The discussions surrounded interfaces where userspace could block (or
> at least strongly dissuade unless the only other alternative was
> returning ENOSPC) the kernel from allocating out of a certain number
> of block groups.  And then also to have an interface where for a
> particular process (namely, the defragger), to make the kernel
> strongly prefer that allocations come out of an ordered list of block
> groups.
> 
> (Of course these days, now that the cool kids are all embracing eBPF,
> one could imagine a privileged interface where the defragger could
> install some kind of eBPF program which provided enhanced policy to
> ext4's block allocator.)
> 
> No one ever really followed through with this, in part because the
> details of allowing userspace (and it would have to be privileged
> userspace) to dictate policy to the block allocator has all sorts of
> potential pitfalls, and in part because no company was really
> interested in funding the engineering work.  In addition, I'll note
> that the windows world, the need and interest for defragging has gone
> done significantly with the advent more sophisticated file systems
> like NTFSv5, which doesn't need defragging nearly as often as say, the
> FAT file system.  And I think if anything, the interst in doing work
> with e4defrag has decreased even more over the years.
> 
> That being said, there has been some interest in making changes to
> both the block allocator and some kind of on-line defrag which is
> optimized for low-end flash (such as the kind found in android
> handsets).  There, the need to be careful that we don't end up
> increasing the write wearout becomes even more critical, although the
> GC work which f2fs does involve extra moving around of data blocks,
> and phones have seemed to do fine.  Of course, the typical phone only
> has to last 2-3 years before the battery dies, the screen gets
> cracked, and/or the owner decides they want the latest cool toy from
> the phone manufacturers.  :-)
> 
> In any case, if your goal is really some interface to support on-line
> defragmentation for ext4, you want to consider whether the
> EXT4_IOC_MOVE_EXTENT interface is sufficiently powerful such that you
> don't really need to mess around with new block allocation hints.

It's powerful, but it does not allow to create an effective defragmentation
tool for my usecase. See the examples above. I do not want to replace
EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
a space close to some existing file and reduce IO at defragmentation time.
This is just only thing I need in this patchset.

I can't climb into maintainers heads and find a thing, which will be suitable
for you. I did my try and suggested the interface. In case of it's not OK
for you, could you, please, suggest another one, which will work for my usecase?
The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
Are you OK with interface suggested by Andreas?

Thanks,
Kirill
Theodore Ts'o March 3, 2020, 4:55 p.m. UTC | #6
On Tue, Mar 03, 2020 at 12:57:15PM +0300, Kirill Tkhai wrote:
> The practice shows it's not so. Your suggestion was the first thing we tried,
> but it works bad and just doubles/triples IO.
> 
> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
> 
> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
> 
> We want to pack both of files in the same 1Mb cluster. Packed together on block device,
> they will be in the same server of underlining distributed storage file system.
> This gives a big performance improvement, and this is the price I aimed.
> 
> In case of I fallocate a large hunk for both of them, I have to move them
> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
> 1Mb of data, i.e. double size, which is counterproductive.
> 
> Imaging another situation, when we have 
> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
> 
> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
> us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
> And this is the thing I try prevent with finding a suitable new interface.

OK, so you aren't trying to *defragment*.  You want to have files
placed "properly" ab initio.

It sounds like what you *think* is the best way to go is to simply
have files backed tightly together.  So effectively what you want as a
block allocation strategy is something which just finds the next free
space big enough for the requested fallocate space, and just plop it
down right there.

OK, so what happens once you've allocated all of the free space, and
the pattern of deletes leaves the file system with a lot of holes?

I could imagine trying to implement this as a mount option which uses
an alternate block allocation strategy, but it's not clear what your
end game is after all of the "easy" spaces have been taken.  It's much
like proposals I've seen for a log-structured file system, where the
garbage collector is left as a "we'll get to it later" TODO item.  (If
I had a dollar each time I've read a paper proposing a log structured
file system which leaves out the garbage collector as an
implementation detail....)

> It's powerful, but it does not allow to create an effective defragmentation
> tool for my usecase. See the examples above. I do not want to replace
> EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
> a space close to some existing file and reduce IO at defragmentation time.
> This is just only thing I need in this patchset.

"At defragmentation time"?   So you do want to run a defragger?

It might be helpful to see the full design of what you have in mind,
and not just a request for interfaces....

> I can't climb into maintainers heads and find a thing, which will be suitable
> for you. I did my try and suggested the interface. In case of it's not OK
> for you, could you, please, suggest another one, which will work for my usecase?
> The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
> Are you OK with interface suggested by Andreas?

Like you, I can't climb into your head and figure out exactly how your
entire system design is going to work.  And I'd really rather not
proposal or bless an interface until I do, since it may be that it's
better to make some minor changes to your system design, instead of
trying to twist ext4 for your particular use case....

       	  	     	      		     - Ted
Kirill Tkhai March 3, 2020, 5:36 p.m. UTC | #7
On 03.03.2020 19:55, Theodore Y. Ts'o wrote:
> On Tue, Mar 03, 2020 at 12:57:15PM +0300, Kirill Tkhai wrote:
>> The practice shows it's not so. Your suggestion was the first thing we tried,
>> but it works bad and just doubles/triples IO.
>>
>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>
>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>
>> We want to pack both of files in the same 1Mb cluster. Packed together on block device,
>> they will be in the same server of underlining distributed storage file system.
>> This gives a big performance improvement, and this is the price I aimed.
>>
>> In case of I fallocate a large hunk for both of them, I have to move them
>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>> 1Mb of data, i.e. double size, which is counterproductive.
>>
>> Imaging another situation, when we have 
>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>
>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
>> us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
>> And this is the thing I try prevent with finding a suitable new interface.
> 
> OK, so you aren't trying to *defragment*.  You want to have files
> placed "properly" ab initio.
> 
> It sounds like what you *think* is the best way to go is to simply
> have files backed tightly together.  So effectively what you want as a
> block allocation strategy is something which just finds the next free
> space big enough for the requested fallocate space, and just plop it
> down right there.
> 
> OK, so what happens once you've allocated all of the free space, and
> the pattern of deletes leaves the file system with a lot of holes?

We defrag not all files, but a specific subset of files. Say, webserver
may have a lot of static content, a lot of small files, which are never
changed. So, we found files, which were not modified for months or years,
and pack them together. Also we pack RO files, and the most cases they
never changed in the future.

So, it's not for all files, it's for specific files, which are chosen
by defragger algorithm.

> I could imagine trying to implement this as a mount option which uses
> an alternate block allocation strategy, but it's not clear what your
> end game is after all of the "easy" spaces have been taken.  It's much
> like proposals I've seen for a log-structured file system, where the
> garbage collector is left as a "we'll get to it later" TODO item.  (If
> I had a dollar each time I've read a paper proposing a log structured
> file system which leaves out the garbage collector as an
> implementation detail....)

Mount option acts at runtime. I'm OK with block placement at runtime. We
defrag files with old modification time, when we know they are unmodifiable,
some time later. So, I'm not sure this will help.
If I understood you wrong, please, explain, whether you mean something else.

>> It's powerful, but it does not allow to create an effective defragmentation
>> tool for my usecase. See the examples above. I do not want to replace
>> EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
>> a space close to some existing file and reduce IO at defragmentation time.
>> This is just only thing I need in this patchset.
> 
> "At defragmentation time"?   So you do want to run a defragger?
> 
> It might be helpful to see the full design of what you have in mind,
> and not just a request for interfaces....

Yes, I run defragger. And it detects, which files may be packed together,
and then it tries to pack them.
>> I can't climb into maintainers heads and find a thing, which will be suitable
>> for you. I did my try and suggested the interface. In case of it's not OK
>> for you, could you, please, suggest another one, which will work for my usecase?
>> The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
>> Are you OK with interface suggested by Andreas?
> 
> Like you, I can't climb into your head and figure out exactly how your
> entire system design is going to work.  And I'd really rather not
> proposal or bless an interface until I do, since it may be that it's
> better to make some minor changes to your system design, instead of
> trying to twist ext4 for your particular use case....

Let I try to give a description:

1)defragger scans whole filesystem and it divides fs in 1Mb cluster.
There is populated a statistics of every scanned cluster: filling percent,
percent of long-time-unmodifiable blocks, percent of RO blocks etc.

Also relation of some directories to block groups are cached.

2)then it marks clusters suitable for compaction and for relocation.

3)then some data becomes compacted or relocated. For compaction we try
to allocate blocks nearly existing RO data to decrease IO (as I wrote
in previous email). The only way we can do it is to use ext4 block
allocation option: usually it tries to allocate blocks for a new inode
in the same block group, where directory is placed. Here we use information
about relationship between directories and block groups. We use specific
directory to create donor file. In case of success, fallocate(donor)
returns blocks from correct 1Mb cluster. Otherwise, cluster is fully
relocated (1Mb block is allocated and we write everything there).

3.1)We have:

[     Cluster 1      ][      Cluster 2    ][    Cluster 3       ]
[1020Kb file][4K hole][4Kfile][1020Kb hole][1020Kb file][4K hole]

3.2)Create a donor file in the directory, which is related to the same
block group, where cluster 1 is placed.

3.3)Several calls of fallocate(donor, 4K)-> are we in Cluster 1 (or 3, or similar)?
                                                /              \
                                             (OK)move 4K file   (Fail)fallocate(1M) and move both 1020Kb file and 4K file.


The practice shows that probability of successful fallocate() results in the specific cluster
is small. So, it the most case we have to move both 1020Kb and 4K files.

My experience is "less full filesystem is, less probability to receive suitable extent from fallocate()".
For me it looks like ext4 tries to spread all files over the disk, so this helps to build files
with less number of extents. This is very nice at allocation time, and it really works good.
But I need some opposite at defragmentation time...

What do you think about all of this?

Thanks,
Kirill
Andreas Dilger March 11, 2020, 7:26 p.m. UTC | #8
On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>> Kirill,
>> 
>> In a couple of your comments on this patch series, you mentioned
>> "defragmentation".  Is that because you're trying to use this as part
>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>> 
>> If that's the case, you should note that input parameter for that
>> ioctl is:
>> 
>> struct move_extent {
>> 	__u32 reserved;		/* should be zero */
>> 	__u32 donor_fd;		/* donor file descriptor */
>> 	__u64 orig_start;	/* logical start offset in block for orig */
>> 	__u64 donor_start;	/* logical start offset in block for donor */
>> 	__u64 len;		/* block length to be moved */
>> 	__u64 moved_len;	/* moved block length */
>> };
>> 
>> Note that the donor_start is separate from the start of the file that
>> is being defragged.  So you could have the userspace application
>> fallocate a large chunk of space for that donor file, and then use
>> that donor file to defrag multiple files if you want to close pack
>> them.
> 
> The practice shows it's not so. Your suggestion was the first thing we tried,
> but it works bad and just doubles/triples IO.
> 
> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
> 
> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
> 
> We want to pack both of files in the same 1Mb cluster. Packed together on block
> device, they will be in the same server of underlining distributed storage file
> system. This gives a big performance improvement, and this is the price I aimed.
> 
> In case of I fallocate a large hunk for both of them, I have to move them
> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
> 1Mb of data, i.e. double size, which is counterproductive.
> 
> Imaging another situation, when we have
> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
> 
> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
> terrible! And this is the thing I try prevent with finding a new interface.

One idea I had, which may work for your use case, is to run fallocate() on
the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
always give you that 4KB chunk if it is free, and that avoids the need to try
and force the allocator to select that block through some other method.

Cheers, Andreas
Kirill Tkhai March 11, 2020, 8:29 p.m. UTC | #9
On 11.03.2020 22:26, Andreas Dilger wrote:
> On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>>> Kirill,
>>>
>>> In a couple of your comments on this patch series, you mentioned
>>> "defragmentation".  Is that because you're trying to use this as part
>>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>>>
>>> If that's the case, you should note that input parameter for that
>>> ioctl is:
>>>
>>> struct move_extent {
>>> 	__u32 reserved;		/* should be zero */
>>> 	__u32 donor_fd;		/* donor file descriptor */
>>> 	__u64 orig_start;	/* logical start offset in block for orig */
>>> 	__u64 donor_start;	/* logical start offset in block for donor */
>>> 	__u64 len;		/* block length to be moved */
>>> 	__u64 moved_len;	/* moved block length */
>>> };
>>>
>>> Note that the donor_start is separate from the start of the file that
>>> is being defragged.  So you could have the userspace application
>>> fallocate a large chunk of space for that donor file, and then use
>>> that donor file to defrag multiple files if you want to close pack
>>> them.
>>
>> The practice shows it's not so. Your suggestion was the first thing we tried,
>> but it works bad and just doubles/triples IO.
>>
>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>
>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>
>> We want to pack both of files in the same 1Mb cluster. Packed together on block
>> device, they will be in the same server of underlining distributed storage file
>> system. This gives a big performance improvement, and this is the price I aimed.
>>
>> In case of I fallocate a large hunk for both of them, I have to move them
>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>> 1Mb of data, i.e. double size, which is counterproductive.
>>
>> Imaging another situation, when we have
>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>
>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
>> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
>> terrible! And this is the thing I try prevent with finding a new interface.
> 
> One idea I had, which may work for your use case, is to run fallocate() on
> the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
> as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
> always give you that 4KB chunk if it is free, and that avoids the need to try
> and force the allocator to select that block through some other method.

Do you mean the following:

1)fallocate() 4K at the end of *1MB-4KB* the first file (==> this increases the file length).
2)EXT4_IOC_MOVE_EXT *4KB* the second file in that new hunk.
3)truncate 4KB at the end of the first file.

?

If so, this can't be an online defrag, since some process may want to increase *1MB-4KB*
file in between. This will just bring to data corruption.
Another problem is that power lose between 1 and 3 will result in that file length remain
*1MB* instead of *1MB-4KB*.

So, we still need some kernel support to implement this.
Andreas Dilger March 12, 2020, 12:31 a.m. UTC | #10
On Mar 11, 2020, at 2:29 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 11.03.2020 22:26, Andreas Dilger wrote:
>> On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>> 
>>> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>>>> Kirill,
>>>> 
>>>> In a couple of your comments on this patch series, you mentioned
>>>> "defragmentation".  Is that because you're trying to use this as part
>>>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>>>> 
>>>> If that's the case, you should note that input parameter for that
>>>> ioctl is:
>>>> 
>>>> struct move_extent {
>>>> 	__u32 reserved;		/* should be zero */
>>>> 	__u32 donor_fd;		/* donor file descriptor */
>>>> 	__u64 orig_start;	/* logical start offset in block for orig */
>>>> 	__u64 donor_start;	/* logical start offset in block for donor */
>>>> 	__u64 len;		/* block length to be moved */
>>>> 	__u64 moved_len;	/* moved block length */
>>>> };
>>>> 
>>>> Note that the donor_start is separate from the start of the file that
>>>> is being defragged.  So you could have the userspace application
>>>> fallocate a large chunk of space for that donor file, and then use
>>>> that donor file to defrag multiple files if you want to close pack
>>>> them.
>>> 
>>> The practice shows it's not so. Your suggestion was the first thing we tried,
>>> but it works bad and just doubles/triples IO.
>>> 
>>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>> 
>>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>> 
>>> We want to pack both of files in the same 1Mb cluster. Packed together on block
>>> device, they will be in the same server of underlining distributed storage file
>>> system. This gives a big performance improvement, and this is the price I aimed.
>>> 
>>> In case of I fallocate a large hunk for both of them, I have to move them
>>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>>> 1Mb of data, i.e. double size, which is counterproductive.
>>> 
>>> Imaging another situation, when we have
>>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>> 
>>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
>>> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
>>> terrible! And this is the thing I try prevent with finding a new interface.
>> 
>> One idea I had, which may work for your use case, is to run fallocate() on
>> the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
>> as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
>> always give you that 4KB chunk if it is free, and that avoids the need to try
>> and force the allocator to select that block through some other method.
> 
> Do you mean the following:
> 
> 1)fallocate() 4K at the end of *1MB-4KB* the first file (==> this increases the file length).

You can use FALLOCATE_KEEP_SIZE to avoid changing the size of the file.

> 2)EXT4_IOC_MOVE_EXT *4KB* the second file in that new hunk.
> 3)truncate 4KB at the end of the first file.
> 
> If so, this can't be an online defrag, since some process may want to increase
> *1MB-4KB* file in between. This will just bring to data corruption.

You previously stated that one of the main reasons to do the defrag is because
the files are not being modified.  It would be possible to detect the case of
the file being modified by the file version and/or size and/or time change
before removing the fallocated block.

> Another problem is that power lose between 1 and 3 will result in that file
> length remain *1MB* instead of *1MB-4KB*.

With FALLOCATE_KEEP_SIZE you can just use this file as the donor file to
allocate the blocks, then migrate it to another file without having written
anything into it.

Cheers, Andreas
Kirill Tkhai March 12, 2020, 9:23 a.m. UTC | #11
On 12.03.2020 03:31, Andreas Dilger wrote:
> On Mar 11, 2020, at 2:29 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 11.03.2020 22:26, Andreas Dilger wrote:
>>> On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>>>>> Kirill,
>>>>>
>>>>> In a couple of your comments on this patch series, you mentioned
>>>>> "defragmentation".  Is that because you're trying to use this as part
>>>>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>>>>>
>>>>> If that's the case, you should note that input parameter for that
>>>>> ioctl is:
>>>>>
>>>>> struct move_extent {
>>>>> 	__u32 reserved;		/* should be zero */
>>>>> 	__u32 donor_fd;		/* donor file descriptor */
>>>>> 	__u64 orig_start;	/* logical start offset in block for orig */
>>>>> 	__u64 donor_start;	/* logical start offset in block for donor */
>>>>> 	__u64 len;		/* block length to be moved */
>>>>> 	__u64 moved_len;	/* moved block length */
>>>>> };
>>>>>
>>>>> Note that the donor_start is separate from the start of the file that
>>>>> is being defragged.  So you could have the userspace application
>>>>> fallocate a large chunk of space for that donor file, and then use
>>>>> that donor file to defrag multiple files if you want to close pack
>>>>> them.
>>>>
>>>> The practice shows it's not so. Your suggestion was the first thing we tried,
>>>> but it works bad and just doubles/triples IO.
>>>>
>>>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>>>
>>>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>>>
>>>> We want to pack both of files in the same 1Mb cluster. Packed together on block
>>>> device, they will be in the same server of underlining distributed storage file
>>>> system. This gives a big performance improvement, and this is the price I aimed.
>>>>
>>>> In case of I fallocate a large hunk for both of them, I have to move them
>>>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>>>> 1Mb of data, i.e. double size, which is counterproductive.
>>>>
>>>> Imaging another situation, when we have
>>>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>>>
>>>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
>>>> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
>>>> terrible! And this is the thing I try prevent with finding a new interface.
>>>
>>> One idea I had, which may work for your use case, is to run fallocate() on
>>> the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
>>> as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
>>> always give you that 4KB chunk if it is free, and that avoids the need to try
>>> and force the allocator to select that block through some other method.
>>
>> Do you mean the following:
>>
>> 1)fallocate() 4K at the end of *1MB-4KB* the first file (==> this increases the file length).
> 
> You can use FALLOCATE_KEEP_SIZE to avoid changing the size of the file.

Ok, but there still remains the problem with fallocation of a block in front of target file:

[4KB hole][1MB-4KB file][hole][4KB file]

Is there a high-probably way that ext4 allocator returns a block before target file?

>> 2)EXT4_IOC_MOVE_EXT *4KB* the second file in that new hunk.
>> 3)truncate 4KB at the end of the first file.
>>
>> If so, this can't be an online defrag, since some process may want to increase
>> *1MB-4KB* file in between. This will just bring to data corruption.
> 
> You previously stated that one of the main reasons to do the defrag is because
> the files are not being modified.  It would be possible to detect the case of
> the file being modified by the file version and/or size and/or time change
> before removing the fallocated block.

Yes, files should not be modified in parallel, but there is no 100% guarantee.
It's almost 100% by statistics, but since there is multi-user system (I defrag
fs on VPS), there are possible exceptions. And the whole architecture must be
safe in any cases.

File version does not look acceptable for online defrag, since there is no a way,
which allows to check the version and to remove fallocated block *atomically*.

>> Another problem is that power lose between 1 and 3 will result in that file
>> length remain *1MB* instead of *1MB-4KB*.
> 
> With FALLOCATE_KEEP_SIZE you can just use this file as the donor file to
> allocate the blocks, then migrate it to another file without having written
> anything into it.

In case of some task decides to write into fallocated block, there will be data
corruption.