Message ID | 1375281721-15840-1-git-send-email-linkinjeon@gmail.com |
---|---|
State | Rejected, archived |
Headers | show |
Have you considered what happens if you have a 10 megabyte file, of which the first 5 megs are mmap'ed into a userspace process. Now suppose you call COLLAPASE_RANGE on a one megabyte range starting at offset 1024k from the beginning of the file. Does the right thing happen to the mmap'ed region in memory? Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote: > From: Namjae Jeon <namjae.jeon@samsung.com> > > Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag. > The semantics of this flag are following: > 1) It collapses the range lying between offset and length by removing any data > blocks which are present in this range and than updates all the logical > offsets of extents beyond "offset + len" to nullify the hole created by > removing blocks. In short, it does not leave a hole. > 1) It should be used exclusively. No other fallocate flag in combination. > 2) Offset and length supplied to fallocate should be fs block size aligned. Given that the rest of fallocate() interfaces are byte based, this is going to cause some confusion if it's not well documented. i.e. this restriction needs to be documented in the header file that is exposed to userspace, as well as in the fallocate(2) man page. > 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond "EOF" > are also updated. I don't think that's valid for this operation. If the range is beyond EOF, you are not modifying anything visible to userspace, and that makes it the same as a hole punch operation. So, I'd get rid of thisnasty implementation complexity altogether. > 4) It reduces the i_size of inode by the amount of collapse range which lies > within i_size. So, if offset >= i_size, i_size won't be changed at all. Similarly, I don't think that asking for a range that overlaps EOF is valid, either. The moment you overlap or go beyond EOF, you are effectively truncating the file as there is no data to collapse into the range As it is, I don't see these EOF rules enforced by this patch, nor are they documented at all in the patch. Regardless of what semantics we decide on, this needs xfs_io support and extensive corner case tests added to xfstests so that we can confirm the implementation in every filesystem is correct and we don't introduce regressions in future. it also needs documentation in the fallocate(2) man page. > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h > index 990c4cc..7567c8c 100644 > --- a/include/uapi/linux/falloc.h > +++ b/include/uapi/linux/falloc.h > @@ -1,9 +1,10 @@ > #ifndef _UAPI_FALLOC_H_ > #define _UAPI_FALLOC_H_ > > -#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ > -#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ > +#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ > +#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ > #define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */ > +#define FALLOC_FL_COLLAPSE_RANGE 0x08 /* it does not leave a hole */ I'd suggest that you need to explain what FALLOC_FL_COLLAPSE_RANGE does in great detail right here. Cheers, Dave.
On Wed, Jul 31, 2013 at 06:01:54PM -0400, Theodore Ts'o wrote: > Have you considered what happens if you have a 10 megabyte file, of > which the first 5 megs are mmap'ed into a userspace process. > > Now suppose you call COLLAPASE_RANGE on a one megabyte range starting > at offset 1024k from the beginning of the file. > > Does the right thing happen to the mmap'ed region in memory? Implementation detail. like a hole punch, it needs to invalidate the range that it is operating over so mmap()d regions are refaulted after the operation is done. Cheers, Dave.
On Thu, Aug 01, 2013 at 10:23:41AM +1000, Dave Chinner wrote: > On Wed, Jul 31, 2013 at 06:01:54PM -0400, Theodore Ts'o wrote: > > Have you considered what happens if you have a 10 megabyte file, of > > which the first 5 megs are mmap'ed into a userspace process. > > > > Now suppose you call COLLAPASE_RANGE on a one megabyte range starting > > at offset 1024k from the beginning of the file. > > > > Does the right thing happen to the mmap'ed region in memory? > > Implementation detail. like a hole punch, it needs to invalidate the > range that it is operating over so mmap()d regions are refaulted > after the operation is done. It's not just the range that it's operating on, but also the region beyond the range that's been collapsed out. That's because the page that had formerly located at offset 3072k would now be located at 2048k. And you had better make sure that any dirty pages located beyond that are written out to disk before you do the invalidate and collapse. As a result, there's going to be potential races that don't exist with truncate and punch. A quick eyeball of the patch didn't seem to show any code that handled this, which is why I asked the question. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 31, 2013 at 08:46:45PM -0400, Theodore Ts'o wrote: > On Thu, Aug 01, 2013 at 10:23:41AM +1000, Dave Chinner wrote: > > On Wed, Jul 31, 2013 at 06:01:54PM -0400, Theodore Ts'o wrote: > > > Have you considered what happens if you have a 10 megabyte file, of > > > which the first 5 megs are mmap'ed into a userspace process. > > > > > > Now suppose you call COLLAPASE_RANGE on a one megabyte range starting > > > at offset 1024k from the beginning of the file. > > > > > > Does the right thing happen to the mmap'ed region in memory? > > > > Implementation detail. like a hole punch, it needs to invalidate the > > range that it is operating over so mmap()d regions are refaulted > > after the operation is done. > > It's not just the range that it's operating on, but also the region > beyond the range that's been collapsed out. Yes, that's part of "the range that it is operating over". > A quick eyeball of the patch didn't seem to show any code that handled > this, which is why I asked the question. Right, but really it's the least of the problems I've noticed - the XFS code is fundamentally broken in many ways - once I've finished commenting on it, I'll have a quick look to see if the ext4 code has the same fundamental flaws.... Cheers, Dave.
On Thu, Aug 01, 2013 at 10:54:47AM +1000, Dave Chinner wrote: > > It's not just the range that it's operating on, but also the region > > beyond the range that's been collapsed out. > > Yes, that's part of "the range that it is operating over". > > > A quick eyeball of the patch didn't seem to show any code that handled > > this, which is why I asked the question. > > Right, but really it's the least of the problems I've noticed - the > XFS code is fundamentally broken in many ways - once I've finished > commenting on it, I'll have a quick look to see if the ext4 code has > the same fundamental flaws.... Well, the fundamental flaw of potential races if the file being collapsed has been mmap'ed and there is another process making changes beyond the range that is being collapsed is I suspect one that is going to be very hard to solve, short of not allowing the collapse while there are any read/write mmap's for the file in question. And I would think these sorts of VM issues should be handled with some generic library functions, instead of reimplementing them from scratch in each file system. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 31, 2013 at 09:07:52PM -0400, Theodore Ts'o wrote: > On Thu, Aug 01, 2013 at 10:54:47AM +1000, Dave Chinner wrote: > > > It's not just the range that it's operating on, but also the region > > > beyond the range that's been collapsed out. > > > > Yes, that's part of "the range that it is operating over". > > > > > A quick eyeball of the patch didn't seem to show any code that handled > > > this, which is why I asked the question. > > > > Right, but really it's the least of the problems I've noticed - the > > XFS code is fundamentally broken in many ways - once I've finished > > commenting on it, I'll have a quick look to see if the ext4 code has > > the same fundamental flaws.... > > Well, the fundamental flaw of potential races if the file being > collapsed has been mmap'ed and there is another process making changes > beyond the range that is being collapsed is I suspect one that is > going to be very hard to solve, short of not allowing the collapse > while there are any read/write mmap's for the file in question. This funtionality is not introducing any new problems w.r.t. mmap(). In terms of truncating a mmap'd file, that can already occur and the behaviour is well known. The current patches don't do the operations in the correct order to ensure this is handled correctly (i.e. the inode size has to be changed before the page cache invalidation, not after), but otherwise there is no new issues introduced here. The real problem is that it has the same problem as hole punching w.r.t. not being able to serialise page faults against the extent manipulations after the page cache invalidation has occurred. That's what Jan Kara's range locking patches we talked about at LSFMM will address, and so is beyond the scope of this patchset. > And I would think these sorts of VM issues should be handled with some > generic library functions, instead of reimplementing them from scratch > in each file system. Well, yes, we have one - truncate_page_cache_range(), and it's used correctly in the ext4 patch.... Cheers, Dave.
On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote: > > This funtionality is not introducing any new problems w.r.t. mmap(). > In terms of truncating a mmap'd file, that can already occur and > the behaviour is well known. That's not race I'm worried about. Consider the following scenario. We have a 10 meg file, and the first five megs are mapped read/write. We do a collapse range of one meg starting at the one meg boundary. Menwhile another process is constantly modifying the pages between the 3 meg and 4 meg mark, via a mmap'ed region of memory. You can have the code which does the collapse range call filemap_fdatawrite_range() to flush out to disk all of the inode's dirty pages in the page cache, but between the call to filemap_fdatawrite_range() and truncate_page_cache_range(), more pages could have gotten dirtied, and those changes will get lost. The difference between punch_hole and collapse_range is that previously, we only needed to call truncate_page_cache_range() on the pages that were going to be disappearing from the page cache, so it didn't matter if you had someone dirtying the pages racing with the punch_hole operation. But in the collapse_range case, we need to call truncate_page_cache_range() on data that is not going to disappear, but rather be _renumbered_. I'll also note that depending on the use case, doing the renumbering of the pages by throwing all of the pages from the page cache and forcing them to be read back in from disk might not all be friendly to a performance sensitive application. In the case where the page size == the fs block size, instead of throwing away all of the pages, we could also have VM code which remaps the pages in the page cache (including potentially editing the page tables for any mmap'ed pages). This of course gets _much_ more complicated, and we still have to deal with the case where the collapse_range call is aligned with the fs block size, but which is not aligned or is not a muliple of the page size. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 01, 2013 at 12:06:05AM -0400, Theodore Ts'o wrote: > On Thu, Aug 01, 2013 at 12:59:14PM +1000, Dave Chinner wrote: > > > > This funtionality is not introducing any new problems w.r.t. mmap(). > > In terms of truncating a mmap'd file, that can already occur and > > the behaviour is well known. > > That's not race I'm worried about. Consider the following scenario. > We have a 10 meg file, and the first five megs are mapped read/write. > We do a collapse range of one meg starting at the one meg boundary. > Menwhile another process is constantly modifying the pages between the > 3 meg and 4 meg mark, via a mmap'ed region of memory. > > You can have the code which does the collapse range call > filemap_fdatawrite_range() to flush out to disk all of the inode's > dirty pages in the page cache, but between the call to > filemap_fdatawrite_range() and truncate_page_cache_range(), more pages > could have gotten dirtied, and those changes will get lost. That's one of the many race conditions that Jan's patches are intended to solve. Locking the range and then unmapping the pages will then cause the refault to block until the range is unlocked. While the range is locked, we can then writeback all the dirty pages, invalidate the page cache and modify the underlying extents without having to care about user page faults racing with the operation. Let's face it, without that we current cannot serialise *any* multi-page page cache operation against mmap() - direct IO has exactly the same problems as you are describing, as does hole punching, and anything else that relies on the i_mutex or other IO-based locks for serialisation. > The difference between punch_hole and collapse_range is that > previously, we only needed to call truncate_page_cache_range() on the > pages that were going to be disappearing from the page cache, so it > didn't matter if you had someone dirtying the pages racing with the > punch_hole operation. But in the collapse_range case, we need to call > truncate_page_cache_range() on data that is not going to disappear, > but rather be _renumbered_. Sure - as it said up front "the range of the pages being operated on" need to be invalidating. And, if you look at the ext4 patch, is actaully does exactly that.... > I'll also note that depending on the use case, doing the renumbering > of the pages by throwing all of the pages from the page cache and > forcing them to be read back in from disk might not all be friendly to > a performance sensitive application. Application problem. Besides, this is for accelerating video stream editting (i.e. NLE workflows). Such applications tend to use direct IO for recording and playback, and they most certainly won't have concurrent access to the video stream either by mmap or direct/buferred IO while the non linear editor is punching the advertisements out of the recorded video stream...... > In the case where the page size == the fs block size, instead of > throwing away all of the pages, we could also have VM code which > remaps the pages in the page cache (including potentially editing the > page tables for any mmap'ed pages). That way lies insanity. Not to mention that it's completely unnecessary for the use case this ioctl is actually accelerating. Cheers, Dave.
2013/8/1, Dave Chinner <david@fromorbit.com>: > On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote: >> From: Namjae Jeon <namjae.jeon@samsung.com> >> >> Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag. >> The semantics of this flag are following: >> 1) It collapses the range lying between offset and length by removing any >> data >> blocks which are present in this range and than updates all the >> logical >> offsets of extents beyond "offset + len" to nullify the hole created >> by >> removing blocks. In short, it does not leave a hole. >> 1) It should be used exclusively. No other fallocate flag in combination. >> 2) Offset and length supplied to fallocate should be fs block size >> aligned. > Hi Dave. > Given that the rest of fallocate() interfaces are byte based, this > is going to cause some confusion if it's not well documented. i.e. > this restriction needs to be documented in the header file that is > exposed to userspace, as well as in the fallocate(2) man page. Agree. > >> 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond >> "EOF" >> are also updated. > > I don't think that's valid for this operation. If the range is > beyond EOF, you are not modifying anything visible to userspace, and > that makes it the same as a hole punch operation. So, I'd get rid of > thisnasty implementation complexity altogether. The basic idea behind collapse range is that it does'nt leaves a hole. And this is achieved by updating starting block of each of the extents which lies beyond hole. When we are updating an extent, we are actually moving hole to the right, towards the last allocated block in the file space. We continue doing this for each extent and finally when this operation completes, the hole is gone. If the file has preallocated extent(s) beyond EOF, something like this=> hole, extent1, extent2, extent_beyond_EOF1. And we start updating extents one by one, to nullify the effect of hole=> STEP1: After updating extent1: extent1, hole, extent2, extent_beyond_EOF1. STEP2: After updating extent2: extent1, extent2, hole, extent_beyond_EOF1. STEP3: After updating extent_beyond_EOF1: extent1, extent2, extent_beyond_EOF1, hole => And this removes hole from the file. If we stop updating just after extent2, this will not remove hole from the file space but instead, it will place the hole between EOF and 1st preallcoated extent beyond EOF as in STEP2. So, if user does an extending truncate, hole will became visible as it is still there. > >> 4) It reduces the i_size of inode by the amount of collapse range which >> lies >> within i_size. So, if offset >= i_size, i_size won't be changed at >> all. > > Similarly, I don't think that asking for a range that overlaps EOF > is valid, either. The moment you overlap or go beyond EOF, you are > effectively truncating the file as there is no data to collapse into > the range Yes, we could put a restriction on that, something like: if (len + offset > i_size) { return -EINVAL;} > > As it is, I don't see these EOF rules enforced by this patch, nor > are they documented at all in the patch. The EOF file rules are handled in the fs specific patches. As in XFS => + if (mode & FALLOC_FL_COLLAPSE_RANGE) { + error = -xfs_collapse_file_space(ip, offset + len, len); + if (error || offset >= i_size_read(inode)) + goto out_unlock; + + /* Shrink size in case of FALLOC_FL_COLLAPSE_RANGE */ + if ((offset + len) > i_size_read(inode)) + new_size = offset; + else + new_size = i_size_read(inode) - len; + error = -xfs_update_file_size(ip, new_size); But it seems ok to not allow (offset + len) > i_size. > > Regardless of what semantics we decide on, this needs xfs_io > support and extensive corner case tests added to xfstests so that we > can confirm the implementation in every filesystem is correct and we > don't introduce regressions in future. it also needs documentation > in the fallocate(2) man page. Yes, we are currently working on it. With next version of collapse_range patches, we will also send xfs_io and xfstests patches and manpage also. > >> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h >> index 990c4cc..7567c8c 100644 >> --- a/include/uapi/linux/falloc.h >> +++ b/include/uapi/linux/falloc.h >> @@ -1,9 +1,10 @@ >> #ifndef _UAPI_FALLOC_H_ >> #define _UAPI_FALLOC_H_ >> >> -#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ >> -#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ >> +#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ >> +#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ >> #define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */ >> +#define FALLOC_FL_COLLAPSE_RANGE 0x08 /* it does not leave a hole */ > > I'd suggest that you need to explain what FALLOC_FL_COLLAPSE_RANGE > does in great detail right here. Okay! Thanks for review! > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 01, 2013 at 02:07:39PM +0900, Namjae Jeon wrote: > 2013/8/1, Dave Chinner <david@fromorbit.com>: > > On Wed, Jul 31, 2013 at 11:42:00PM +0900, Namjae Jeon wrote: > >> From: Namjae Jeon <namjae.jeon@samsung.com> > >> > >> Fallocate now supports new FALLOC_FL_COLLAPSE_RANGE flag. > >> The semantics of this flag are following: > >> 1) It collapses the range lying between offset and length by removing any > >> data > >> blocks which are present in this range and than updates all the > >> logical > >> offsets of extents beyond "offset + len" to nullify the hole created > >> by > >> removing blocks. In short, it does not leave a hole. > >> 1) It should be used exclusively. No other fallocate flag in combination. > >> 2) Offset and length supplied to fallocate should be fs block size > >> aligned. > > > Hi Dave. > > > Given that the rest of fallocate() interfaces are byte based, this > > is going to cause some confusion if it's not well documented. i.e. > > this restriction needs to be documented in the header file that is > > exposed to userspace, as well as in the fallocate(2) man page. > Agree. > > > > >> 3) It wokrs beyond "EOF", so the extents which are pre-allocated beyond > >> "EOF" > >> are also updated. > > > > I don't think that's valid for this operation. If the range is > > beyond EOF, you are not modifying anything visible to userspace, and > > that makes it the same as a hole punch operation. So, I'd get rid of > > thisnasty implementation complexity altogether. > The basic idea behind collapse range is that it does'nt leaves a hole. I know what collapse range is and what it's supposed to do. read again what I said - collapsing a range beyond EOF simply removes extents that the user can't otherwise see. That makes it the same as a hole punch operation. i.e. you don't need to use collapse range for this, because you can simply punch the same number of extents of the end of the preallocated range and get exactly the same result. i.e. we already have functionality to manipulate extents beyond EOF and so we don't need another. Cheers, Dave.
diff --git a/fs/open.c b/fs/open.c index d53e298..e076390 100644 --- a/fs/open.c +++ b/fs/open.c @@ -225,12 +225,15 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct inode *inode = file_inode(file); long ret; + unsigned i_blkbits = ACCESS_ONCE(inode->i_blkbits); + unsigned blksize_mask = (1 << i_blkbits) - 1; if (offset < 0 || len <= 0) return -EINVAL; /* Return error if mode is not supported */ - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | + FALLOC_FL_COLLAPSE_RANGE)) return -EOPNOTSUPP; /* Punch hole must have keep size set */ @@ -241,8 +244,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (!(file->f_mode & FMODE_WRITE)) return -EBADF; - /* It's not possible punch hole on append only file */ - if (mode & FALLOC_FL_PUNCH_HOLE && IS_APPEND(inode)) + /* + * It's not possible to punch hole or perform collapse range + * on append only file + */ + if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE) + && IS_APPEND(inode)) return -EPERM; if (IS_IMMUTABLE(inode)) @@ -270,6 +277,12 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) return -EFBIG; + /* Collapse range works only on fs block size aligned offsets */ + if ((mode & FALLOC_FL_COLLAPSE_RANGE) && + (offset & blksize_mask || len & blksize_mask || + mode & ~FALLOC_FL_COLLAPSE_RANGE)) + return -EINVAL; + if (!file->f_op->fallocate) return -EOPNOTSUPP; diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h index 990c4cc..7567c8c 100644 --- a/include/uapi/linux/falloc.h +++ b/include/uapi/linux/falloc.h @@ -1,9 +1,10 @@ #ifndef _UAPI_FALLOC_H_ #define _UAPI_FALLOC_H_ -#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ -#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ +#define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */ +#define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */ #define FALLOC_FL_NO_HIDE_STALE 0x04 /* reserved codepoint */ +#define FALLOC_FL_COLLAPSE_RANGE 0x08 /* it does not leave a hole */ #endif /* _UAPI_FALLOC_H_ */