From patchwork Mon Jun 25 08:51:59 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zheng Liu X-Patchwork-Id: 167007 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id D32E7B6FA1 for ; Mon, 25 Jun 2012 18:43:59 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753692Ab2FYIn5 (ORCPT ); Mon, 25 Jun 2012 04:43:57 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:39411 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752214Ab2FYInz (ORCPT ); Mon, 25 Jun 2012 04:43:55 -0400 Received: by dady13 with SMTP id y13so5029377dad.19 for ; Mon, 25 Jun 2012 01:43:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=GigUPf30JvtqsXZ2taL+EmKoHjbx5YbgfE2GP/pqE4E=; b=0MYDYSOxwaC4BTarunqWgtT1LzpBHDoRQ8D3lrBhyiTwxhGoF5W8r8zR3PrDW73B+9 EcwJTlFO8HoYJgBaDFne0aN/usgcZOTYO/NnwTBIMS85pNswjaEdAwLJZZlzsgWqD3Sp wQot4KN0N8jRlCLunNrz08Ur1hmjs7k3AFpr2OSYqvd/Q7P3k3JXygrLcBWGF93z/Tk8 ia/cMMe9Gb/RBkMPS9WukAy/Ga/rnwFSz9NPw6uwGq7fgXpfpkv89argLIuRo/xJBRH4 LRdiuWTPvSRYgERwyOt6i5nuqzUFX2HRnSbTYxIC3kt4nT1GXR2TGlSX/4OvUXldbraB zu5w== Received: by 10.68.239.164 with SMTP id vt4mr37097232pbc.166.1340613835337; Mon, 25 Jun 2012 01:43:55 -0700 (PDT) Received: from gmail.com ([182.92.247.2]) by mx.google.com with ESMTPS id jp10sm7900591pbb.16.2012.06.25.01.43.53 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 25 Jun 2012 01:43:54 -0700 (PDT) Date: Mon, 25 Jun 2012 16:51:59 +0800 From: Zheng Liu To: Fredrick Cc: linux-ext4@vger.kernel.org, Andreas Dilger Subject: Re: ext4_fallocate Message-ID: <20120625085159.GA18931@gmail.com> Mail-Followup-To: Fredrick , linux-ext4@vger.kernel.org, Andreas Dilger References: <4FE8086F.4070506@zoho.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <4FE8086F.4070506@zoho.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Sun, Jun 24, 2012 at 11:42:55PM -0700, Fredrick wrote: > Hello Ext4 developers, > > When calling fallocate on ext4 fs, ext4_fallocate does not initialize > the extents. The extents are allocated only when they are actually > written. This is causing a problem for us. Our programs create many > "write only once" files as large as 1G on ext4 very rapidly at times. > We thought fallocate would solve the problem. But it didnt. > If I change the EXT4_GET_BLOCKS_CREATE_UNINIT_EXT to > just EXT4_GET_BLOCKS_CREATE in the ext4_map_blocks in the > ext4_fallocate call, > the extents get created in fallocate call itself. This is helping us. > Now the write throughtput to the disk was close to 98%. When extents > were not > initialized, our disk throughput were only 70%. > > Can this change be made to ext4_fallocate? Hi Fredrick, I think that this patch maybe can help you. :-) Actually I want to send a url for you from linux mailing list archive but I cannot find it. After applying this patch, you can call ioctl(2) to enable expose_stale_data flag, and then when you call fallocate(2), ext4 create initialized extents for you. This patch cannot be merged into upstream kernel because it brings a huge security hole. Regards, Zheng From: Zheng Liu Date: Wed, 6 Jun 2012 11:10:57 +0800 Subject: [RFC][PATCH v2] ext4: add expose_stale_data flag in fallocate Here is the v2 of FALLOC_FL_NO_HIDE_STALE in fallocate. Now no new flag is added into vfs in order to reduce the impacts and avoid a huge security hole. The application cannot call fallocate with a new flag to create an unwritten extent. It needs to call ioctl to enable/disable this feature. Meanwhile, in ioctl, filesystem will check CAP_SYS_RAWIO to ensure that the user has a privilege to switch on/off it. Currently, I only implement it in ext4. Even though I try to reduce its impact, this feature still brings a security hole. So the application must ensure that it initializes an unwritten extent by itself before reading it, and it is used in a limited environment. v1 -> v2: * remove FALLOC_FL_NO_HIDE_STALE flag in vfs * add 'i_expose_stale_data' in ext4 to enable/disable it Signed-off-by: Zheng Liu --- fs/ext4/ext4.h | 5 +++++ fs/ext4/extents.c | 6 +++++- fs/ext4/ioctl.c | 43 +++++++++++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 1 + 4 files changed, 54 insertions(+), 1 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index cfc4e01..61da070 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -606,6 +606,8 @@ enum { #define EXT4_IOC_ALLOC_DA_BLKS _IO('f', 12) #define EXT4_IOC_MOVE_EXT _IOWR('f', 15, struct move_extent) #define EXT4_IOC_RESIZE_FS _IOW('f', 16, __u64) +#define EXT4_IOC_GET_EXPOSE_STALE _IOR('f', 17, int) +#define EXT4_IOC_SET_EXPOSE_STALE _IOW('f', 18, int) #if defined(__KERNEL__) && defined(CONFIG_COMPAT) /* @@ -925,6 +927,9 @@ struct ext4_inode_info { /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ __u32 i_csum_seed; + + /* expose stale data in creating a new extent */ + int i_expose_stale_data; }; /* diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 91341ec..9ef883c 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4336,6 +4336,7 @@ static void ext4_falloc_update_inode(struct inode *inode, long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct inode *inode = file->f_path.dentry->d_inode; + struct ext4_inode_info *ei = EXT4_I(inode); handle_t *handle; loff_t new_size; unsigned int max_blocks; @@ -4379,7 +4380,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) trace_ext4_fallocate_exit(inode, offset, max_blocks, ret); return ret; } - flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT; + if (ei->i_expose_stale_data) + flags = EXT4_GET_BLOCKS_CREATE; + else + flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT; if (mode & FALLOC_FL_KEEP_SIZE) flags |= EXT4_GET_BLOCKS_KEEP_SIZE; /* diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 8ad112a..fffb3eb 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -445,6 +445,47 @@ resizefs_out: return 0; } + case EXT4_IOC_GET_EXPOSE_STALE: { + int enable; + + /* security check */ + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + + /* + * currently only extent-based files support (pre)allocate with + * EXPOSE_STALE_DATA flag + */ + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) + return -EOPNOTSUPP; + + enable = ei->i_expose_stale_data; + + return put_user(enable, (int __user *) arg); + } + + case EXT4_IOC_SET_EXPOSE_STALE: { + int enable; + + /* security check */ + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + + /* + * currently only extent-based files support (pre)allocate with + * EXPOSE_STALE_DATA flag + */ + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) + return -EOPNOTSUPP; + + if (get_user(enable, (int __user *) arg)) + return -EFAULT; + + ei->i_expose_stale_data = enable; + + return 0; + } + default: return -ENOTTY; } @@ -508,6 +549,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) case EXT4_IOC_MOVE_EXT: case FITRIM: case EXT4_IOC_RESIZE_FS: + case EXT4_IOC_GET_EXPOSE_STALE: + case EXT4_IOC_SET_EXPOSE_STALE: break; default: return -ENOIOCTLCMD; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index eb7aa3e..3654bb8 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -988,6 +988,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) ei->i_datasync_tid = 0; atomic_set(&ei->i_ioend_count, 0); atomic_set(&ei->i_aiodio_unwritten, 0); + ei->i_expose_stale_data = 0; return &ei->vfs_inode; }