diff mbox

[1/3] fs: Introduce new flag FALLOC_FL_COLLAPSE_RANGE

Message ID 1375281721-15840-1-git-send-email-linkinjeon@gmail.com
State Rejected, archived
Headers show

Commit Message

Namjae Jeon July 31, 2013, 2:42 p.m. UTC
From: Namjae Jeon <namjae.jeon@samsung.com>

Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag.
The semantics of this flag are following:
1) It collapses the range lying between offset and length by removing any data
   blocks which are present in this range and than updates all the logical
   offsets of extents beyond "offset + len" to nullify the hole created by
   removing blocks. In short, it does not leave a hole.
1) It should be used exclusively. No other fallocate flag in combination.
2) Offset and length supplied to fallocate should be fs block size aligned.
3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond "EOF"
   are also updated.
4) It reduces the i_size of inode by the amount of collapse range which lies
   within i_size. So, if offset >= i_size, i_size won't be changed at all.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
---
 fs/open.c                   |   19 ++++++++++++++++---
 include/uapi/linux/falloc.h |    5 +++--
 2 files changed, 19 insertions(+), 5 deletions(-)

Comments

Theodore Ts'o July 31, 2013, 10:01 p.m. UTC | #1
Have you considered what happens if you have a 10 megabyte file, of
which the first 5 megs are mmap'ed into a userspace process.

Now suppose you call COLLAPASE_RANGE on a one megabyte range starting
at offset 1024k from the beginning of the file.

Does the right thing happen to the mmap'ed region in memory?

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 1, 2013, 12:22 a.m. UTC | #2
On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote:
> From: Namjae Jeon <namjae.jeon@samsung.com>
> 
> Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag.
> The semantics of this flag are following:
> 1) It collapses the range lying between offset and length by removing any data
>    blocks which are present in this range and than updates all the logical
>    offsets of extents beyond "offset + len" to nullify the hole created by
>    removing blocks. In short, it does not leave a hole.
> 1) It should be used exclusively. No other fallocate flag in combination.
> 2) Offset and length supplied to fallocate should be fs block size aligned.

Given that the rest of fallocate() interfaces are byte based, this
is going to cause some confusion if it's not well documented. i.e.
this restriction needs to be documented in the header file that is
exposed to userspace, as well as in the fallocate(2) man page.

> 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond "EOF"
>    are also updated.

I don't think that's valid for this operation. If the range is
beyond EOF, you are not modifying anything visible to userspace, and
that makes it the same as a hole punch operation. So, I'd get rid of
thisnasty implementation complexity altogether.

> 4) It reduces the i_size of inode by the amount of collapse range which lies
>    within i_size. So, if offset >= i_size, i_size won't be changed at all.

Similarly, I don't think that asking for a range that overlaps EOF
is valid, either. The moment you overlap or go beyond EOF, you are
effectively truncating the file as there is no data to collapse into
the range

As it is, I don't see these EOF rules enforced by this patch, nor
are they documented at all in the patch.

Regardless of what semantics we decide on, this needs xfs_io
support and extensive corner case tests added to xfstests so that we
can confirm the implementation in every filesystem is correct and we
don't introduce regressions in future. it also needs documentation
in the fallocate(2) man page.

> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 990c4cc..7567c8c 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -1,9 +1,10 @@
>  #ifndef _UAPI_FALLOC_H_
>  #define _UAPI_FALLOC_H_
>  
> -#define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
> -#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
> +#define FALLOC_FL_KEEP_SIZE		0x01 /* default is extend size */
> +#define FALLOC_FL_PUNCH_HOLE		0x02 /* de-allocates range */
>  #define FALLOC_FL_NO_HIDE_STALE	0x04 /* reserved codepoint */
> +#define FALLOC_FL_COLLAPSE_RANGE	0x08 /* it does not leave a hole */

I'd suggest that you need to explain what FALLOC_FL_COLLAPSE_RANGE
does in great detail right here.

Cheers,

Dave.
Dave Chinner Aug. 1, 2013, 12:23 a.m. UTC | #3
On Wed, Jul 31, 2013 at 06:01:54PM -0400, Theodore Ts'o wrote:
> Have you considered what happens if you have a 10 megabyte file, of
> which the first 5 megs are mmap'ed into a userspace process.
> 
> Now suppose you call COLLAPASE_RANGE on a one megabyte range starting
> at offset 1024k from the beginning of the file.
> 
> Does the right thing happen to the mmap'ed region in memory?

Implementation detail. like a hole punch, it needs to invalidate the
range that it is operating over so mmap()d regions are refaulted
after the operation is done.

Cheers,

Dave.
Theodore Ts'o Aug. 1, 2013, 12:46 a.m. UTC | #4
On Thu, Aug 01, 2013 at 10:23:41AM +1000, Dave Chinner wrote:
> On Wed, Jul 31, 2013 at 06:01:54PM -0400, Theodore Ts'o wrote:
> > Have you considered what happens if you have a 10 megabyte file, of
> > which the first 5 megs are mmap'ed into a userspace process.
> > 
> > Now suppose you call COLLAPASE_RANGE on a one megabyte range starting
> > at offset 1024k from the beginning of the file.
> > 
> > Does the right thing happen to the mmap'ed region in memory?
> 
> Implementation detail. like a hole punch, it needs to invalidate the
> range that it is operating over so mmap()d regions are refaulted
> after the operation is done.

It's not just the range that it's operating on, but also the region
beyond the range that's been collapsed out.  That's because the page
that had formerly located at offset 3072k would now be located at
2048k.  And you had better make sure that any dirty pages located
beyond that are written out to disk before you do the invalidate and
collapse.  As a result, there's going to be potential races that don't
exist with truncate and punch.

A quick eyeball of the patch didn't seem to show any code that handled
this, which is why I asked the question.

      	    				- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 1, 2013, 12:54 a.m. UTC | #5
On Wed, Jul 31, 2013 at 08:46:45PM -0400, Theodore Ts'o wrote:
> On Thu, Aug 01, 2013 at 10:23:41AM +1000, Dave Chinner wrote:
> > On Wed, Jul 31, 2013 at 06:01:54PM -0400, Theodore Ts'o wrote:
> > > Have you considered what happens if you have a 10 megabyte file, of
> > > which the first 5 megs are mmap'ed into a userspace process.
> > > 
> > > Now suppose you call COLLAPASE_RANGE on a one megabyte range starting
> > > at offset 1024k from the beginning of the file.
> > > 
> > > Does the right thing happen to the mmap'ed region in memory?
> > 
> > Implementation detail. like a hole punch, it needs to invalidate the
> > range that it is operating over so mmap()d regions are refaulted
> > after the operation is done.
> 
> It's not just the range that it's operating on, but also the region
> beyond the range that's been collapsed out.

Yes, that's part of "the range that it is operating over".

> A quick eyeball of the patch didn't seem to show any code that handled
> this, which is why I asked the question.

Right, but really it's the least of the problems I've noticed - the
XFS code is fundamentally broken in many ways - once I've finished
commenting on it, I'll have a quick look to see if the ext4 code has
the same fundamental flaws....

Cheers,

Dave.
Theodore Ts'o Aug. 1, 2013, 1:07 a.m. UTC | #6
On Thu, Aug 01, 2013 at 10:54:47AM +1000, Dave Chinner wrote:
> > It's not just the range that it's operating on, but also the region
> > beyond the range that's been collapsed out.
> 
> Yes, that's part of "the range that it is operating over".
> 
> > A quick eyeball of the patch didn't seem to show any code that handled
> > this, which is why I asked the question.
> 
> Right, but really it's the least of the problems I've noticed - the
> XFS code is fundamentally broken in many ways - once I've finished
> commenting on it, I'll have a quick look to see if the ext4 code has
> the same fundamental flaws....

Well, the fundamental flaw of potential races if the file being
collapsed has been mmap'ed and there is another process making changes
beyond the range that is being collapsed is I suspect one that is
going to be very hard to solve, short of not allowing the collapse
while there are any read/write mmap's for the file in question.

And I would think these sorts of VM issues should be handled with some
generic library functions, instead of reimplementing them from scratch
in each file system.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 1, 2013, 2:59 a.m. UTC | #7
On Wed, Jul 31, 2013 at 09:07:52PM -0400, Theodore Ts'o wrote:
> On Thu, Aug 01, 2013 at 10:54:47AM +1000, Dave Chinner wrote:
> > > It's not just the range that it's operating on, but also the region
> > > beyond the range that's been collapsed out.
> > 
> > Yes, that's part of "the range that it is operating over".
> > 
> > > A quick eyeball of the patch didn't seem to show any code that handled
> > > this, which is why I asked the question.
> > 
> > Right, but really it's the least of the problems I've noticed - the
> > XFS code is fundamentally broken in many ways - once I've finished
> > commenting on it, I'll have a quick look to see if the ext4 code has
> > the same fundamental flaws....
> 
> Well, the fundamental flaw of potential races if the file being
> collapsed has been mmap'ed and there is another process making changes
> beyond the range that is being collapsed is I suspect one that is
> going to be very hard to solve, short of not allowing the collapse
> while there are any read/write mmap's for the file in question.

This funtionality is not introducing any new problems w.r.t. mmap().
In terms of truncating a mmap'd file, that can already occur and
the behaviour is well known. The current patches don't do the
operations in the correct order to ensure this is handled correctly
(i.e. the inode size has to be changed before the page cache
invalidation, not after), but otherwise there is no new issues
introduced here.

The real problem is that it has the same problem as hole punching
w.r.t. not being able to serialise page faults against the extent
manipulations after the page cache invalidation has occurred.
That's what Jan Kara's range locking patches we talked about at
LSFMM will address, and so is beyond the scope of this patchset.

> And I would think these sorts of VM issues should be handled with some
> generic library functions, instead of reimplementing them from scratch
> in each file system.

Well, yes, we have one - truncate_page_cache_range(), and it's used
correctly in the ext4 patch....

Cheers,

Dave.
Theodore Ts'o Aug. 1, 2013, 4:06 a.m. UTC | #8
On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote:
> 
> This funtionality is not introducing any new problems w.r.t. mmap().
> In terms of truncating a mmap'd file, that can already occur and
> the behaviour is well known.

That's not race I'm worried about.  Consider the following scenario.
We have a 10 meg file, and the first five megs are mapped read/write.
We do a collapse range of one meg starting at the one meg boundary.
Menwhile another process is constantly modifying the pages between the
3 meg and 4 meg mark, via a mmap'ed region of memory.

You can have the code which does the collapse range call
filemap_fdatawrite_range() to flush out to disk all of the inode's
dirty pages in the page cache, but between the call to
filemap_fdatawrite_range() and truncate_page_cache_range(), more pages
could have gotten dirtied, and those changes will get lost.

The difference between punch_hole and collapse_range is that
previously, we only needed to call truncate_page_cache_range() on the
pages that were going to be disappearing from the page cache, so it
didn't matter if you had someone dirtying the pages racing with the
punch_hole operation.  But in the collapse_range case, we need to call
truncate_page_cache_range() on data that is not going to disappear,
but rather be _renumbered_.

I'll also note that depending on the use case, doing the renumbering
of the pages by throwing all of the pages from the page cache and
forcing them to be read back in from disk might not all be friendly to
a performance sensitive application.

In the case where the page size == the fs block size, instead of
throwing away all of the pages, we could also have VM code which
remaps the pages in the page cache (including potentially editing the
page tables for any mmap'ed pages).  This of course gets _much_ more
complicated, and we still have to deal with the case where the
collapse_range call is aligned with the fs block size, but which is
not aligned or is not a muliple of the page size.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 1, 2013, 4:32 a.m. UTC | #9
On Thu, Aug 01, 2013 at 12:06:05AM -0400, Theodore Ts'o wrote:
> On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote:
> > 
> > This funtionality is not introducing any new problems w.r.t. mmap().
> > In terms of truncating a mmap'd file, that can already occur and
> > the behaviour is well known.
> 
> That's not race I'm worried about.  Consider the following scenario.
> We have a 10 meg file, and the first five megs are mapped read/write.
> We do a collapse range of one meg starting at the one meg boundary.
> Menwhile another process is constantly modifying the pages between the
> 3 meg and 4 meg mark, via a mmap'ed region of memory.
> 
> You can have the code which does the collapse range call
> filemap_fdatawrite_range() to flush out to disk all of the inode's
> dirty pages in the page cache, but between the call to
> filemap_fdatawrite_range() and truncate_page_cache_range(), more pages
> could have gotten dirtied, and those changes will get lost.

That's one of the many race conditions that Jan's patches are
intended to solve. Locking the range and then unmapping the pages
will then cause the refault to block until the range is unlocked.

While the range is locked, we can then writeback all the dirty
pages, invalidate the page cache and modify the underlying extents
without having to care about user page faults racing with the
operation.

Let's face it, without that we current cannot serialise *any*
multi-page page cache operation against mmap() - direct IO has
exactly the same problems as you are describing, as does hole
punching, and anything else that relies on the i_mutex or other
IO-based locks for serialisation.

> The difference between punch_hole and collapse_range is that
> previously, we only needed to call truncate_page_cache_range() on the
> pages that were going to be disappearing from the page cache, so it
> didn't matter if you had someone dirtying the pages racing with the
> punch_hole operation.  But in the collapse_range case, we need to call
> truncate_page_cache_range() on data that is not going to disappear,
> but rather be _renumbered_.

Sure - as it said up front "the range of the pages being operated
on" need to be invalidating. And, if you look at the ext4 patch, is
actaully does exactly that....

> I'll also note that depending on the use case, doing the renumbering
> of the pages by throwing all of the pages from the page cache and
> forcing them to be read back in from disk might not all be friendly to
> a performance sensitive application.

Application problem.

Besides, this is for accelerating video stream editting (i.e. NLE
workflows). Such applications tend to use direct IO for recording
and playback, and they most certainly won't have concurrent access
to the video stream either by mmap or direct/buferred IO while the
non linear editor is punching the advertisements out of the recorded
video stream......

> In the case where the page size == the fs block size, instead of
> throwing away all of the pages, we could also have VM code which
> remaps the pages in the page cache (including potentially editing the
> page tables for any mmap'ed pages).

That way lies insanity. Not to mention that it's completely
unnecessary for the use case this ioctl is actually accelerating.

Cheers,

Dave.
Namjae Jeon Aug. 1, 2013, 5:07 a.m. UTC | #10
2013/8/1, Dave Chinner <david@fromorbit.com>:
> On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote:
>> From: Namjae Jeon <namjae.jeon@samsung.com>
>>
>> Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag.
>> The semantics of this flag are following:
>> 1) It collapses the range lying between offset and length by removing any
>> data
>>    blocks which are present in this range and than updates all the
>> logical
>>    offsets of extents beyond "offset + len" to nullify the hole created
>> by
>>    removing blocks. In short, it does not leave a hole.
>> 1) It should be used exclusively. No other fallocate flag in combination.
>> 2) Offset and length supplied to fallocate should be fs block size
>> aligned.
>
Hi Dave.

> Given that the rest of fallocate() interfaces are byte based, this
> is going to cause some confusion if it's not well documented. i.e.
> this restriction needs to be documented in the header file that is
> exposed to userspace, as well as in the fallocate(2) man page.
Agree.

>
>> 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond
>> "EOF"
>>    are also updated.
>
> I don't think that's valid for this operation. If the range is
> beyond EOF, you are not modifying anything visible to userspace, and
> that makes it the same as a hole punch operation. So, I'd get rid of
> thisnasty implementation complexity altogether.
The basic idea behind collapse range is that it does'nt leaves a hole.
And this is achieved by updating starting block of each of the extents
which lies beyond hole.
When we are updating an extent, we are actually moving hole to the
right, towards the last allocated block in the file space.
We continue doing this for each extent and finally when this operation
completes, the hole is gone.
If the file has preallocated extent(s) beyond EOF, something like this=>
hole, extent1, extent2, extent_beyond_EOF1.
And we start updating extents one by one, to nullify the effect of hole=>
STEP1: After updating extent1: extent1, hole, extent2, extent_beyond_EOF1.
STEP2: After updating extent2: extent1, extent2, hole, extent_beyond_EOF1.
STEP3: After updating extent_beyond_EOF1: extent1, extent2,
extent_beyond_EOF1, hole => And this removes hole from the file.
If we stop updating just after extent2, this will not remove hole from
the file space but instead, it will place the hole between EOF and 1st
preallcoated extent beyond EOF as in STEP2.
So, if user does an extending truncate, hole will became visible as it
is still there.

>
>> 4) It reduces the i_size of inode by the amount of collapse range which
>> lies
>>    within i_size. So, if offset >= i_size, i_size won't be changed at
>> all.
>
> Similarly, I don't think that asking for a range that overlaps EOF
> is valid, either. The moment you overlap or go beyond EOF, you are
> effectively truncating the file as there is no data to collapse into
> the range
Yes, we could put a restriction on that, something like: if (len +
offset > i_size) { return -EINVAL;}

>
> As it is, I don't see these EOF rules enforced by this patch, nor
> are they documented at all in the patch.
The EOF file rules are handled in the fs specific patches. As in XFS =>
+       if (mode & FALLOC_FL_COLLAPSE_RANGE) {
+               error = -xfs_collapse_file_space(ip, offset + len, len);
+               if (error || offset >= i_size_read(inode))
+                       goto out_unlock;
+
+               /* Shrink size in case of FALLOC_FL_COLLAPSE_RANGE */
+               if ((offset + len) > i_size_read(inode))
+                       new_size = offset;
+               else
+                       new_size = i_size_read(inode) - len;
+               error = -xfs_update_file_size(ip, new_size);
But it seems ok to not allow (offset + len) > i_size.

>
> Regardless of what semantics we decide on, this needs xfs_io
> support and extensive corner case tests added to xfstests so that we
> can confirm the implementation in every filesystem is correct and we
> don't introduce regressions in future. it also needs documentation
> in the fallocate(2) man page.
Yes, we are currently working on it.
With next version of collapse_range patches, we will also send xfs_io
and xfstests patches and manpage also.

>
>> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
>> index 990c4cc..7567c8c 100644
>> --- a/include/uapi/linux/falloc.h
>> +++ b/include/uapi/linux/falloc.h
>> @@ -1,9 +1,10 @@
>>  #ifndef _UAPI_FALLOC_H_
>>  #define _UAPI_FALLOC_H_
>>
>> -#define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
>> -#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
>> +#define FALLOC_FL_KEEP_SIZE		0x01 /* default is extend size */
>> +#define FALLOC_FL_PUNCH_HOLE		0x02 /* de-allocates range */
>>  #define FALLOC_FL_NO_HIDE_STALE	0x04 /* reserved codepoint */
>> +#define FALLOC_FL_COLLAPSE_RANGE	0x08 /* it does not leave a hole */
>
> I'd suggest that you need to explain what FALLOC_FL_COLLAPSE_RANGE
> does in great detail right here.
Okay!

Thanks for review!
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 2, 2013, 2:37 a.m. UTC | #11
On Thu, Aug 01, 2013 at 02:07:39PM +0900, Namjae Jeon wrote:
> 2013/8/1, Dave Chinner <david@fromorbit.com>:
> > On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote:
> >> From: Namjae Jeon <namjae.jeon@samsung.com>
> >>
> >> Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag.
> >> The semantics of this flag are following:
> >> 1) It collapses the range lying between offset and length by removing any
> >> data
> >>    blocks which are present in this range and than updates all the
> >> logical
> >>    offsets of extents beyond "offset + len" to nullify the hole created
> >> by
> >>    removing blocks. In short, it does not leave a hole.
> >> 1) It should be used exclusively. No other fallocate flag in combination.
> >> 2) Offset and length supplied to fallocate should be fs block size
> >> aligned.
> >
> Hi Dave.
> 
> > Given that the rest of fallocate() interfaces are byte based, this
> > is going to cause some confusion if it's not well documented. i.e.
> > this restriction needs to be documented in the header file that is
> > exposed to userspace, as well as in the fallocate(2) man page.
> Agree.
> 
> >
> >> 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond
> >> "EOF"
> >>    are also updated.
> >
> > I don't think that's valid for this operation. If the range is
> > beyond EOF, you are not modifying anything visible to userspace, and
> > that makes it the same as a hole punch operation. So, I'd get rid of
> > thisnasty implementation complexity altogether.
> The basic idea behind collapse range is that it does'nt leaves a hole.

I know what collapse range is and what it's supposed to do. read
again what I said - collapsing a range beyond EOF simply removes
extents that the user can't otherwise see. That makes it the same as
a hole punch operation. i.e. you don't need to use collapse range
for this, because you can simply punch the same number of extents of
the end of the preallocated range and get exactly the same result.

i.e. we already have functionality to manipulate extents beyond EOF
and so we don't need another.

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/open.c b/fs/open.c
index d53e298..e076390 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -225,12 +225,15 @@  int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	long ret;
+	unsigned i_blkbits = ACCESS_ONCE(inode->i_blkbits);
+	unsigned blksize_mask = (1 << i_blkbits) - 1;
 
 	if (offset < 0 || len <= 0)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+		     FALLOC_FL_COLLAPSE_RANGE))
 		return -EOPNOTSUPP;
 
 	/* Punch hole must have keep size set */
@@ -241,8 +244,12 @@  int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
-	/* It's not possible punch hole on append only file */
-	if (mode & FALLOC_FL_PUNCH_HOLE && IS_APPEND(inode))
+	/*
+	 * It's not possible to punch hole or perform collapse range
+	 * on append only file
+	 */
+	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE)
+	    && IS_APPEND(inode))
 		return -EPERM;
 
 	if (IS_IMMUTABLE(inode))
@@ -270,6 +277,12 @@  int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
 		return -EFBIG;
 
+	 /* Collapse range works only on fs block size aligned offsets */
+	if ((mode & FALLOC_FL_COLLAPSE_RANGE) &&
+	    (offset & blksize_mask || len & blksize_mask ||
+	     mode & ~FALLOC_FL_COLLAPSE_RANGE))
+			return -EINVAL;
+
 	if (!file->f_op->fallocate)
 		return -EOPNOTSUPP;
 
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 990c4cc..7567c8c 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -1,9 +1,10 @@ 
 #ifndef _UAPI_FALLOC_H_
 #define _UAPI_FALLOC_H_
 
-#define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
-#define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
+#define FALLOC_FL_KEEP_SIZE		0x01 /* default is extend size */
+#define FALLOC_FL_PUNCH_HOLE		0x02 /* de-allocates range */
 #define FALLOC_FL_NO_HIDE_STALE	0x04 /* reserved codepoint */
+#define FALLOC_FL_COLLAPSE_RANGE	0x08 /* it does not leave a hole */
 
 
 #endif /* _UAPI_FALLOC_H_ */