Message ID | 49019EF6.4000706@rs.jp.nec.com |
---|---|
State | Deferred |
Headers | show |
On Oct 24, 2008 19:09 +0900, Akira Fujita wrote: > The EXT4_IOC_FIEMAP_INO is used to get extents information of > inode which set to ioctl. > The defragger uses this ioctl to check the fragment condition > and to get extents information in the specified block group. Instead of having a separate IOC number for each such ioctl, instead we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that passes in an inode number and a second IOC number so that arbitrary file ioctls can be run on any inode by root. This was mentioned last time these patches were posted, but there was no reply from you. Christoph suggested a more generic VFS open-by-inum, which isn't impossible to do but would cause a lot of controversy I think, while the EXT4_IOC_WRAPPER is at least contained within ext4, but is more generically useful than EXT4_IOC_FIEMAP_INO. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote: > This was mentioned last time these patches were posted, but there was > no reply from you. Christoph suggested a more generic VFS open-by-inum, > which isn't impossible to do but would cause a lot of controversy I > think, while the EXT4_IOC_WRAPPER is at least contained within ext4, > but is more generically useful than EXT4_IOC_FIEMAP_INO. I'll hack up a generic open_by_handle and then we can gather the reaction - it shouldn't be more than about one or two hundred lines of code. Note that you really want an open by handle and not just inum for a defragmentation tool - without the generation you can easily run into races. Btw, any reason the XFS approach of passing in filehandles for both the inode to be defragmented and the "donor" inode doesn't work for you? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote: > Btw, any reason the XFS approach of passing in filehandles for both > the inode to be defragmented and the "donor" inode doesn't work for you? Sorry should be file descriptor and not filehandle above. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Andreas, Andreas Dilger wrote: > On Oct 24, 2008 19:09 +0900, Akira Fujita wrote: >> The EXT4_IOC_FIEMAP_INO is used to get extents information of >> inode which set to ioctl. >> The defragger uses this ioctl to check the fragment condition >> and to get extents information in the specified block group. > > Instead of having a separate IOC number for each such ioctl, instead > we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that > passes in an inode number and a second IOC number so that arbitrary file > ioctls can be run on any inode by root. The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations. But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also non-root user to call fiemap, so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag. > This was mentioned last time these patches were posted, but there was > no reply from you. Christoph suggested a more generic VFS open-by-inum, > which isn't impossible to do but would cause a lot of controversy I > think, while the EXT4_IOC_WRAPPER is at least contained within ext4, > but is more generically useful than EXT4_IOC_FIEMAP_INO. > Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue? Regards, Akira Fujita -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Oct 27, 2008 19:21 +0900, Akira Fujita wrote: > Andreas Dilger wrote: >> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote: >>> The EXT4_IOC_FIEMAP_INO is used to get extents information of >>> inode which set to ioctl. >>> The defragger uses this ioctl to check the fragment condition >>> and to get extents information in the specified block group. >> >> Instead of having a separate IOC number for each such ioctl, instead >> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that >> passes in an inode number and a second IOC number so that arbitrary file >> ioctls can be run on any inode by root. > > The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations. > But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also > non-root user to call fiemap, > so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag. Why does a regular user need to do the ioctl on a file that it may not have read permission to access? I can see this is useful for root doing a defrag of the whole filesystem instead of opening and closing all of the files, but for regular users we need to validate via the full path to ensure they can even access the file before defragmenting it. >> This was mentioned last time these patches were posted, but there was >> no reply from you. Christoph suggested a more generic VFS open-by-inum, >> which isn't impossible to do but would cause a lot of controversy I >> think, while the EXT4_IOC_WRAPPER is at least contained within ext4, >> but is more generically useful than EXT4_IOC_FIEMAP_INO. > > Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue? If there is interest, yes. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger Wrote: > On Oct 27, 2008 19:21 +0900, Akira Fujita wrote: >> Andreas Dilger wrote: >>> On Oct 24, 2008 19:09 +0900, Akira Fujita wrote: >>>> The EXT4_IOC_FIEMAP_INO is used to get extents information of >>>> inode which set to ioctl. >>>> The defragger uses this ioctl to check the fragment condition >>>> and to get extents information in the specified block group. >>> Instead of having a separate IOC number for each such ioctl, instead >>> we implemented EXT4_IOC_WRAPPER, which is an root-specific ioctl that >>> passes in an inode number and a second IOC number so that arbitrary file >>> ioctls can be run on any inode by root. >> The EXT4_IOC_WRAPPER ioctl seems to be usuful for many situations. >> But the EXT4_IOC_FIEMAP_INO ioctl is used not only root user but also >> non-root user to call fiemap, >> so we cannot use the current EXT4_IOC_WRAPPER ioctl for defrag. > > Why does a regular user need to do the ioctl on a file that it may not > have read permission to access? I can see this is useful for root > doing a defrag of the whole filesystem instead of opening and closing > all of the files, but for regular users we need to validate via the > full path to ensure they can even access the file before defragmenting it. The FIEMAP_INO ioctl just passes a inode number belongs to the target block group from user space to kernel space and then the owner check is done in the kernel space. If the regular user (defrag -f excecutant) is owner of a file, defrag handles this file as the candidate of victim file which would be moved to the other block group to make free space. So I think the full path check is unneeded because the owner check is done in the kernel space (I'm not sure it's good enough). If it's not good in the security point of view, I will make defrag -f mode be done only by root user. >>> This was mentioned last time these patches were posted, but there was >>> no reply from you. Christoph suggested a more generic VFS open-by-inum, >>> which isn't impossible to do but would cause a lot of controversy I >>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4, >>> but is more generically useful than EXT4_IOC_FIEMAP_INO. >> Do you plan to add EXT4_IOC_WRAPPER into the ext4 patch queue? > > If there is interest, yes. How do the other ext4 developers think about implementing EXT4_IOC_WRAPPER? Will it be used only for defrag so far? Regards, Akira Fujita -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Akira, can you please comment on these issues before going on? I think the generation issue is a particularly important one if you want to allow defrag by normal users. On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote: > On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote: > > This was mentioned last time these patches were posted, but there was > > no reply from you. Christoph suggested a more generic VFS open-by-inum, > > which isn't impossible to do but would cause a lot of controversy I > > think, while the EXT4_IOC_WRAPPER is at least contained within ext4, > > but is more generically useful than EXT4_IOC_FIEMAP_INO. > > I'll hack up a generic open_by_handle and then we can gather the > reaction - it shouldn't be more than about one or two hundred lines of > code. Note that you really want an open by handle and not just inum for > a defragmentation tool - without the generation you can easily run into > races. > > Btw, any reason the XFS approach of passing in *file descriptors* for both > the inode to be defragmented and the "donor" inode doesn't work for you? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Oct 31, 2008 18:46 +0900, Akira Fujita wrote: >> Why does a regular user need to do the ioctl on a file that it may not >> have read permission to access? I can see this is useful for root >> doing a defrag of the whole filesystem instead of opening and closing >> all of the files, but for regular users we need to validate via the >> full path to ensure they can even access the file before defragmenting it. > > The FIEMAP_INO ioctl just passes a inode number belongs to > the target block group from user space to kernel space > and then the owner check is done in the kernel space. > > If the regular user (defrag -f excecutant) is owner of a file, > defrag handles this file as the candidate of victim file which would > be moved to the other block group to make free space. > > So I think the full path check is unneeded because the owner check > is done in the kernel space (I'm not sure it's good enough). > If it's not good in the security point of view, > I will make defrag -f mode be done only by root user. If the defrag operation is limited to the owner of the file (or root via CAP_DAC_OVERRIDE) then this is probably OK also. The data never gets to userspace so there is relatively little risk to this operation. >>>> This was mentioned last time these patches were posted, but there was >>>> no reply from you. Christoph suggested a more generic VFS open-by-inum, >>>> which isn't impossible to do but would cause a lot of controversy I >>>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4, >>>> but is more generically useful than EXT4_IOC_FIEMAP_INO. > > How do the other ext4 developers think about > implementing EXT4_IOC_WRAPPER? > Will it be used only for defrag so far? I expect the initial users of this ioctl will be FIEMAP and DEFRAG, but it might also be useful for other ioctls in the future. I haven't really asked other ext4 developers about it yet, and nobody else has commented the last time I posted the patch. I don't have an objection to Christoph's open-by-FH API, if there is acceptance of this from other kernel developers (Al Viro in particular), but that exposes a lot more security issues than just the ioctl wrapper. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Christoph, Christoph Hellwig wrote: > Akira, can you please comment on these issues before going on? > I think the generation issue is a particularly important one if you > want to allow defrag by normal users. For a regular user defrag (-f), I'll add the check to solve the generation issue in the following procedures (1-4). Please check whether my approach is right or not. 1. Acquire the extents information of inode in kernel space and then return it to user space with EXT4_IOC_FIEMAP_INO. 2. Calculate the victim extents from the combination of extents (the result of 1) and free space extents. 3. Pass the victim extents (move to the other block group to make free space) from user space to kernel space with EXT4_IOC_MOVE_VICTIM. 4. In kernel space, make sure the permission and the extents construction of the passed inode (the passed inode still covers the victim extent area or not). If there is no problem, defrag continues its process. If check (4) failed, it means victim extents was changed and that area might be already used by other users, so defrag will fail. If my approach doesn't seem to be a problem, I will work on it. Regards, Akira Fujita > On Sun, Oct 26, 2008 at 04:48:48AM -0400, Christoph Hellwig wrote: >> On Sun, Oct 26, 2008 at 02:40:48AM -0600, Andreas Dilger wrote: >>> This was mentioned last time these patches were posted, but there was >>> no reply from you. Christoph suggested a more generic VFS open-by-inum, >>> which isn't impossible to do but would cause a lot of controversy I >>> think, while the EXT4_IOC_WRAPPER is at least contained within ext4, >>> but is more generically useful than EXT4_IOC_FIEMAP_INO. >> I'll hack up a generic open_by_handle and then we can gather the >> reaction - it shouldn't be more than about one or two hundred lines of >> code. Note that you really want an open by handle and not just inum for >> a defragmentation tool - without the generation you can easily run into >> races. >> >> Btw, any reason the XFS approach of passing in *file descriptors* for both >> the inode to be defragmented and the "donor" inode doesn't work for you? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 31, 2008 at 06:05:47AM -0400, Christoph Hellwig wrote: > > I'll hack up a generic open_by_handle and then we can gather the > > reaction - it shouldn't be more than about one or two hundred lines of > > code. Note that you really want an open by handle and not just inum for > > a defragmentation tool - without the generation you can easily run into > > races. I think having a generic open_by_handle is a Good Thing, but it actually isn't quite enough for defrag, because that brings up the question of how defrag can create the handle in the first place. In the case of Aryan's desire to get the list of files that were read during boot, it's pretty obvious how we can define an interface which would make available a set of file handles corresponding to the files that were opened during the boot process, and then that list of file handles can be saved to a file and used on the subsequent boot to do the readahead. Fairly straight forward. In the case of the defrag situation, though, we need to step back and figure out what we are trying to do. What the userspace program is apparently trying to do is to get the block extent maps used by all of the inodes in the block group. The problem is we aren't opening the inodes by pathname, so we couldn't create a handle in the first place (since in order to create a handle, we need the containing directory). The bigger question is whether the defrag code is asking the right question in the first place. The issue is that is that it currently assumes that in order to find the owner of a particular block (or more generally, extent), you should search the inodes in the block's blockgroup. The problem is that for a very fragmented filesystem, most of the blocks' owners might not be in their block group. In fact, over time, if you use defrag -f frequently, it will move blocks belonging to inodes in one block group to other block groups, which will make defrag -f's job even harder, and eventually impossible, for inodes belonging to other block groups. > Akira, can you please comment on these issues before going on? > I think the generation issue is a particularly important one if you > want to allow defrag by normal users. I'm not at all sure that it makes sense to allow "defrag -f" to be used by normal users. The problem here is we're fighting multiple constraints. First of all, we want to keep policy in the kernel to an absolute minimum, since debugging kernel code is a mess, and I don't think we want the complexity of a full-fledge defragger in the kernel. Secondly, though, if we are going to do this in userspace, we need to push a huge amount of information to the userspace defrag program, that immediately raises some very serious security issues, because we don't want to leak information unduly to random userspace programs. > > Btw, any reason the XFS approach of passing in *file descriptors* for both > > the inode to be defragmented and the "donor" inode doesn't work for you? I agree this is probably the better approach; it would certainly reduce the amount of new code that needs to be added to the kernel. Right now the "donor"/temporary inode is created and allocated in the kernel, and then the kernel decides whether or not the temporary inode is an improvement. If we make the userspace code responsible for creating the temporary inode and then using fallocate() to allocate the new blocks, then userspace can call FIEMAP and decide whether it is an improvement. - Ted P.S. I've been looking at ext4_defrag_partial(), and the page locking looks wrong. The page is only locked if it it hasn't been read into memory yet? And at the end of the function, we do this? if (PageLocked(page)) unlock_page(page); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/defrag.c b/fs/ext4/defrag.c index 891e599..67030bc 100644 --- a/fs/ext4/defrag.c +++ b/fs/ext4/defrag.c @@ -20,6 +20,8 @@ #include "ext4_extents.h" #include "group.h" +#define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent)) + /** * ext4_defrag_next_extent - Search for the next extent and set it to "extent" * @@ -91,6 +93,117 @@ err: } /** + * ext4_defrag_fiemap_ino - Get extents information by inode number + * + * @filp: pointer to file + * @arg: pointer to fiemap_ino + * @fiemap_ino->ino: an inode number which is used to get + * extent information + * @fiemap_ino->fiemap: request for fiemap ioctl + * + * This function returns 0 if succeed, otherwise returns error value. + */ +int +ext4_defrag_fiemap_ino(struct file *filp, unsigned long arg) +{ + struct fiemap_ino fiemap_ino; + struct fiemap_extent_info fieinfo = { 0, }; + struct inode *inode; + struct super_block *sb = filp->f_dentry->d_inode->i_sb; + u64 len = 0; + int err = 0; + + if (copy_from_user(&fiemap_ino, (struct fiemap_ino __user *)arg, + sizeof(struct fiemap_ino))) + return -EFAULT; + + /* Special inodes shouldn't be choiced */ + if (fiemap_ino.ino < EXT4_GOOD_OLD_FIRST_INO) + return -ENOENT; + + inode = ext4_iget(sb, fiemap_ino.ino); + if (IS_ERR(inode)) + return PTR_ERR(inode); + + /* + * If we do not have the permission to access the inode, + * just skip it. + */ + if (!capable(CAP_DAC_OVERRIDE)) { + if ((inode->i_mode & S_IRUSR) != S_IRUSR) + return -EACCES; + if (current->fsuid != inode->i_uid) + return -EACCES; + } + + /* Return -ENOENT if a file does not exist */ + if (!inode->i_nlink || !S_ISREG(inode->i_mode)) { + err = -ENOENT; + goto out; + } + + if (!inode->i_op->fiemap) { + err = -EOPNOTSUPP; + goto out; + } + + if (fiemap_ino.fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS) { + err = -EINVAL; + goto out; + } + + if (fiemap_ino.fiemap.fm_length == 0) { + err = -EINVAL; + goto out; + } + + if (fiemap_ino.fiemap.fm_start > sb->s_maxbytes) { + err = -EFBIG; + goto out; + } + + /* + * Check offset and length + * If the specified range exceeds the max file size, + * adjust the length. + */ + if ((fiemap_ino.fiemap.fm_length > sb->s_maxbytes) || + (sb->s_maxbytes - fiemap_ino.fiemap.fm_length) + < fiemap_ino.fiemap.fm_start) + len = sb->s_maxbytes - fiemap_ino.fiemap.fm_start; + else + len = fiemap_ino.fiemap.fm_length; + + fieinfo.fi_flags = fiemap_ino.fiemap.fm_flags; + fieinfo.fi_extents_max = fiemap_ino.fiemap.fm_extent_count; + fieinfo.fi_extents_start = + (struct fiemap_extent *)(arg + sizeof(fiemap_ino)); + + if (fiemap_ino.fiemap.fm_extent_count != 0 && + !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start, + fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) { + err = -EFAULT; + goto out; + } + + if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC) + filemap_write_and_wait(inode->i_mapping); + + err = inode->i_op->fiemap(inode, &fieinfo, + fiemap_ino.fiemap.fm_start, len); + if (!err) { + fiemap_ino.fiemap.fm_flags = fieinfo.fi_flags; + fiemap_ino.fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped; + if (copy_to_user((char *)arg, &fiemap_ino, sizeof(fiemap_ino))) + err = -EFAULT; + } + +out: + iput(inode); + return err; +} + +/** * ext4_defrag_merge_across_blocks - Merge extents across leaf block * * @handle: journal handle diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f7b092d..c72703f 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -305,6 +305,7 @@ struct ext4_new_group_data { #define EXT4_IOC_DEFRAG _IOW('f', 15, struct ext4_ext_defrag_data) #define EXT4_IOC_SUPER_INFO _IOR('f', 16, struct ext4_super_block) #define EXT4_IOC_FREE_BLOCKS_INFO _IOWR('f', 17, struct ext4_extents_info) +#define EXT4_IOC_FIEMAP_INO _IOWR('f', 18, struct fiemap_ino) /* * ioctl commands in 32 bit emulation @@ -352,6 +353,11 @@ struct ext4_extents_info { struct ext4_extent_data ext[DEFRAG_MAX_ENT]; }; +struct fiemap_ino { + __u64 ino; + struct fiemap fiemap; +}; + #define EXT4_TRANS_META_BLOCKS 4 /* bitmap + group desc + sb + inode */ /* @@ -1185,6 +1191,7 @@ extern int ext4_ext_journal_restart(handle_t *handle, int needed); /* defrag.c */ extern int ext4_defrag(struct file *filp, ext4_lblk_t block_start, ext4_lblk_t defrag_size, ext4_fsblk_t goal); +extern int ext4_defrag_fiemap_ino(struct file *filp, unsigned long arg); static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es) { diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 5709574..b69b54a 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -256,6 +256,13 @@ setversion_out: return 0; } + case EXT4_IOC_FIEMAP_INO: { + int err; + + err = ext4_defrag_fiemap_ino(filp, arg); + return err; + } + case EXT4_IOC_FREE_BLOCKS_INFO: { struct ext4_extents_info ext_info; int err;