From patchwork Mon Sep 23 05:44:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 1165803 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 46cCv92jJbz9sPJ for ; Mon, 23 Sep 2019 15:44:37 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731291AbfIWFo1 (ORCPT ); Mon, 23 Sep 2019 01:44:27 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:57568 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731250AbfIWFo0 (ORCPT ); Mon, 23 Sep 2019 01:44:26 -0400 Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x8N5c099045895 for ; Mon, 23 Sep 2019 01:44:25 -0400 Received: from e06smtp03.uk.ibm.com (e06smtp03.uk.ibm.com [195.75.94.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 2v6qb1sbwk-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Mon, 23 Sep 2019 01:44:25 -0400 Received: from localhost by e06smtp03.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 23 Sep 2019 06:44:23 +0100 Received: from b06cxnps4076.portsmouth.uk.ibm.com (9.149.109.198) by e06smtp03.uk.ibm.com (192.168.101.133) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Mon, 23 Sep 2019 06:44:19 +0100 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x8N5iHSf29950072 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 23 Sep 2019 05:44:18 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CA7C1AE04D; Mon, 23 Sep 2019 05:44:17 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8410BAE045; Mon, 23 Sep 2019 05:44:14 +0000 (GMT) Received: from dhcp-9-199-158-218.in.ibm.com (unknown [9.199.158.218]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 23 Sep 2019 05:44:14 +0000 (GMT) From: Ritesh Harjani To: jack@suse.cz, tytso@mit.edu, linux-ext4@vger.kernel.org, adilger@dilger.ca Cc: joseph.qi@linux.alibaba.com, david@fromorbit.com, hch@infradead.org, riteshh@linux.ibm.com, mbobrowski@mbobrowski.org, rgoldwyn@suse.de, aneesh.kumar@linux.ibm.com, linux-kernel@vger.kernel.org Subject: [RFC-v2 1/2] ext4: Add ext4_ilock & ext4_iunlock API Date: Mon, 23 Sep 2019 11:14:06 +0530 X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190923054407.8102-1-riteshh@linux.ibm.com> References: <20190923054407.8102-1-riteshh@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 19092305-0012-0000-0000-0000034F6C5C X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19092305-0013-0000-0000-00002189F664 Message-Id: <20190923054407.8102-2-riteshh@linux.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-09-23_02:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=11 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=750 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1908290000 definitions=main-1909230054 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org This adds ext4_ilock/iunlock types of APIs. This is the preparation APIs to make shared locking/unlocking & restarting with exclusive locking/unlocking easier in next patch. Along with above this also addresses the AIM7 regression problem which was only fixed for XFS in, commit 942491c9e6d6 ("xfs: fix AIM7 regression") Signed-off-by: Ritesh Harjani --- fs/ext4/ext4.h | 33 +++++++++++++++++++++++++ fs/ext4/extents.c | 16 ++++++------ fs/ext4/file.c | 63 +++++++++++++++++++++++++---------------------- fs/ext4/inode.c | 4 +-- fs/ext4/ioctl.c | 16 ++++++------ fs/ext4/super.c | 12 ++++----- fs/ext4/xattr.c | 16 ++++++------ 7 files changed, 98 insertions(+), 62 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 2ab91815f52d..9ffafbe6bc3f 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2945,6 +2945,39 @@ static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize) return changed; } +#define EXT4_IOLOCK_EXCL (1 << 0) +#define EXT4_IOLOCK_SHARED (1 << 1) + +static inline void ext4_ilock(struct inode *inode, unsigned int iolock) +{ + if (iolock == EXT4_IOLOCK_EXCL) + inode_lock(inode); + else + inode_lock_shared(inode); +} + +static inline void ext4_iunlock(struct inode *inode, unsigned int iolock) +{ + if (iolock == EXT4_IOLOCK_EXCL) + inode_unlock(inode); + else + inode_unlock_shared(inode); +} + +static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock) +{ + if (iolock == EXT4_IOLOCK_EXCL) + return inode_trylock(inode); + else + return inode_trylock_shared(inode); +} + +static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock) +{ + BUG_ON(iolock != EXT4_IOLOCK_EXCL); + downgrade_write(&inode->i_rwsem); +} + int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, loff_t len); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index a869e206bd81..ef37f4d4ee7e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4680,7 +4680,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, else max_blocks -= lblk; - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); /* * Indirect files do not support unwritten extnets @@ -4790,7 +4790,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, ext4_journal_stop(handle); out_mutex: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); return ret; } @@ -4856,7 +4856,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (mode & FALLOC_FL_KEEP_SIZE) flags |= EXT4_GET_BLOCKS_KEEP_SIZE; - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); /* * We only support preallocation for extent-based files only @@ -4887,7 +4887,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) EXT4_I(inode)->i_sync_tid); } out: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); trace_ext4_fallocate_exit(inode, offset, max_blocks, ret); return ret; } @@ -5387,7 +5387,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) return ret; } - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); /* * There is no need to overlap collapse range with EOF, in which case * it is effectively a truncate operation @@ -5486,7 +5486,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) out_mmap: up_write(&EXT4_I(inode)->i_mmap_sem); out_mutex: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); return ret; } @@ -5537,7 +5537,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) return ret; } - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); /* Currently just for extent based files */ if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) { ret = -EOPNOTSUPP; @@ -5664,7 +5664,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) out_mmap: up_write(&EXT4_I(inode)->i_mmap_sem); out_mutex: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); return ret; } diff --git a/fs/ext4/file.c b/fs/ext4/file.c index d2ff383a8b9f..ce1cecbae932 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -57,14 +57,15 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to) /* * Get exclusion from truncate and other inode operations. */ - if (!inode_trylock_shared(inode)) { - if (iocb->ki_flags & IOCB_NOWAIT) + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED)) return -EAGAIN; - inode_lock_shared(inode); + } else { + ext4_ilock(inode, EXT4_IOLOCK_SHARED); } if (!ext4_dio_checks(inode)) { - inode_unlock_shared(inode); + ext4_iunlock(inode, EXT4_IOLOCK_SHARED); /* * Fallback to buffered IO if the operation being * performed on the inode is not supported by direct @@ -77,7 +78,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to) } ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL); - inode_unlock_shared(inode); + ext4_iunlock(inode, EXT4_IOLOCK_SHARED); file_accessed(iocb->ki_filp); return ret; @@ -89,22 +90,23 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to) struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; - if (!inode_trylock_shared(inode)) { - if (iocb->ki_flags & IOCB_NOWAIT) + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED)) return -EAGAIN; - inode_lock_shared(inode); + } else { + ext4_ilock(inode, EXT4_IOLOCK_SHARED); } /* * Recheck under inode lock - at this point we are sure it cannot * change anymore */ if (!IS_DAX(inode)) { - inode_unlock_shared(inode); + ext4_iunlock(inode, EXT4_IOLOCK_SHARED); /* Fallback to buffered IO in case we cannot support DAX */ return generic_file_read_iter(iocb, to); } ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops); - inode_unlock_shared(inode); + ext4_iunlock(inode, EXT4_IOLOCK_SHARED); file_accessed(iocb->ki_filp); return ret; @@ -241,7 +243,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, if (iocb->ki_flags & IOCB_NOWAIT) return -EOPNOTSUPP; - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); ret = ext4_write_checks(iocb, from); if (ret <= 0) goto out; @@ -250,7 +252,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos); current->backing_dev_info = NULL; out: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); if (likely(ret > 0)) { iocb->ki_pos += ret; ret = generic_write_sync(iocb, ret); @@ -374,15 +376,17 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) loff_t offset = iocb->ki_pos; struct inode *inode = file_inode(iocb->ki_filp); bool extend = false, overwrite = false, unaligned_aio = false; + unsigned int iolock = EXT4_IOLOCK_EXCL; - if (!inode_trylock(inode)) { - if (iocb->ki_flags & IOCB_NOWAIT) + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!ext4_ilock_nowait(inode, iolock)) return -EAGAIN; - inode_lock(inode); + } else { + ext4_ilock(inode, iolock); } if (!ext4_dio_checks(inode)) { - inode_unlock(inode); + ext4_iunlock(inode, iolock); /* * Fallback to buffered IO if the operation on the * inode is not supported by direct IO. @@ -392,7 +396,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) ret = ext4_write_checks(iocb, from); if (ret <= 0) { - inode_unlock(inode); + ext4_iunlock(inode, iolock); return ret; } @@ -416,7 +420,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) && ext4_should_dioread_nolock(inode)) { overwrite = true; - downgrade_write(&inode->i_rwsem); + ext4_ilock_demote(inode, iolock); + iolock = EXT4_IOLOCK_SHARED; } if (offset + count > i_size_read(inode) || @@ -438,10 +443,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) if (ret == -EIOCBQUEUED && (unaligned_aio || extend)) inode_dio_wait(inode); - if (overwrite) - inode_unlock_shared(inode); - else - inode_unlock(inode); + ext4_iunlock(inode, iolock); if (ret >= 0 && iov_iter_count(from)) return ext4_buffered_write_iter(iocb, from); @@ -457,10 +459,11 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) loff_t offset; struct inode *inode = file_inode(iocb->ki_filp); - if (!inode_trylock(inode)) { - if (iocb->ki_flags & IOCB_NOWAIT) + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_EXCL)) return -EAGAIN; - inode_lock(inode); + } else { + ext4_ilock(inode, EXT4_IOLOCK_EXCL); } ret = ext4_write_checks(iocb, from); @@ -480,7 +483,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) if (error) ret = error; out: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); if (ret > 0) ret = generic_write_sync(iocb, ret); return ret; @@ -707,14 +710,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence) return generic_file_llseek_size(file, offset, whence, maxbytes, i_size_read(inode)); case SEEK_HOLE: - inode_lock_shared(inode); + ext4_ilock(inode, EXT4_IOLOCK_SHARED); offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops); - inode_unlock_shared(inode); + ext4_iunlock(inode, EXT4_IOLOCK_SHARED); break; case SEEK_DATA: - inode_lock_shared(inode); + ext4_ilock(inode, EXT4_IOLOCK_SHARED); offset = iomap_seek_data(inode, offset, &ext4_iomap_ops); - inode_unlock_shared(inode); + ext4_iunlock(inode, EXT4_IOLOCK_SHARED); break; } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a4f0749527c7..2870699ee504 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3914,7 +3914,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) return ret; } - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); /* No need to punch hole beyond i_size */ if (offset >= inode->i_size) @@ -4021,7 +4021,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) out_dio: up_write(&EXT4_I(inode)->i_mmap_sem); out_mutex: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); return ret; } diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 442f7ef873fc..c6ae48567207 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -787,13 +787,13 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) if (err) return err; - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); err = ext4_ioctl_check_immutable(inode, from_kprojid(&init_user_ns, ei->i_projid), flags); if (!err) err = ext4_ioctl_setflags(inode, flags); - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); mnt_drop_write_file(filp); return err; } @@ -824,7 +824,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) goto setversion_out; } - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); handle = ext4_journal_start(inode, EXT4_HT_INODE, 1); if (IS_ERR(handle)) { err = PTR_ERR(handle); @@ -839,7 +839,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) ext4_journal_stop(handle); unlock_out: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); setversion_out: mnt_drop_write_file(filp); return err; @@ -958,9 +958,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) * ext4_ext_swap_inode_data before we switch the * inode format to prevent read. */ - inode_lock((inode)); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); err = ext4_ext_migrate(inode); - inode_unlock((inode)); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); mnt_drop_write_file(filp); return err; } @@ -1150,7 +1150,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) if (err) return err; - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); ext4_fill_fsxattr(inode, &old_fa); err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa); if (err) @@ -1165,7 +1165,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) goto out; err = ext4_ioctl_setproject(filp, fa.fsx_projid); out: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); mnt_drop_write_file(filp); return err; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 4079605d437a..45519036de83 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2681,12 +2681,12 @@ static void ext4_orphan_cleanup(struct super_block *sb, __func__, inode->i_ino, inode->i_size); jbd_debug(2, "truncating inode %lu to %lld bytes\n", inode->i_ino, inode->i_size); - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); truncate_inode_pages(inode->i_mapping, inode->i_size); ret = ext4_truncate(inode); if (ret) ext4_std_error(inode->i_sb, ret); - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); nr_truncates++; } else { if (test_opt(sb, DEBUG)) @@ -5763,7 +5763,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id, * files. If this fails, we return success anyway since quotas * are already enabled and this is not a hard failure. */ - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1); if (IS_ERR(handle)) goto unlock_inode; @@ -5773,7 +5773,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id, ext4_mark_inode_dirty(handle, inode); ext4_journal_stop(handle); unlock_inode: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); } return err; } @@ -5865,7 +5865,7 @@ static int ext4_quota_off(struct super_block *sb, int type) if (err || ext4_has_feature_quota(sb)) goto out_put; - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); /* * Update modification times of quota files when userspace can * start looking at them. If we fail, we return success anyway since @@ -5880,7 +5880,7 @@ static int ext4_quota_off(struct super_block *sb, int type) ext4_mark_inode_dirty(handle, inode); ext4_journal_stop(handle); out_unlock: - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); out_put: lockdep_set_quota_inode(inode, I_DATA_SEM_NORMAL); iput(inode); diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 491f9ee4040e..dbe3e2900c24 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -422,9 +422,9 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino, ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE); ext4_xattr_inode_set_ref(inode, 1); } else { - inode_lock(inode); + ext4_ilock(inode, EXT4_IOLOCK_EXCL); inode->i_flags |= S_NOQUOTA; - inode_unlock(inode); + ext4_iunlock(inode, EXT4_IOLOCK_EXCL); } *ea_inode = inode; @@ -1025,7 +1025,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode, u32 hash; int ret; - inode_lock(ea_inode); + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL); ret = ext4_reserve_inode_write(handle, ea_inode, &iloc); if (ret) @@ -1079,7 +1079,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode, ext4_warning_inode(ea_inode, "ext4_mark_iloc_dirty() failed ret=%d", ret); out: - inode_unlock(ea_inode); + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL); return ret; } @@ -1400,10 +1400,10 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode, block += 1; } - inode_lock(ea_inode); + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL); i_size_write(ea_inode, wsize); ext4_update_i_disksize(ea_inode, wsize); - inode_unlock(ea_inode); + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL); ext4_mark_inode_dirty(handle, ea_inode); @@ -1452,9 +1452,9 @@ static struct inode *ext4_xattr_inode_create(handle_t *handle, */ dquot_free_inode(ea_inode); dquot_drop(ea_inode); - inode_lock(ea_inode); + ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL); ea_inode->i_flags |= S_NOQUOTA; - inode_unlock(ea_inode); + ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL); } return ea_inode; From patchwork Mon Sep 23 05:44:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 1165802 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 46cCv40jdTz9sPL for ; Mon, 23 Sep 2019 15:44:32 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387519AbfIWFoa (ORCPT ); Mon, 23 Sep 2019 01:44:30 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:31008 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728841AbfIWFo3 (ORCPT ); Mon, 23 Sep 2019 01:44:29 -0400 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x8N5cUlA149362 for ; Mon, 23 Sep 2019 01:44:27 -0400 Received: from e06smtp03.uk.ibm.com (e06smtp03.uk.ibm.com [195.75.94.99]) by mx0b-001b2d01.pphosted.com with ESMTP id 2v6jneg2k7-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Mon, 23 Sep 2019 01:44:27 -0400 Received: from localhost by e06smtp03.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 23 Sep 2019 06:44:25 +0100 Received: from b06cxnps3074.portsmouth.uk.ibm.com (9.149.109.194) by e06smtp03.uk.ibm.com (192.168.101.133) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Mon, 23 Sep 2019 06:44:22 +0100 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x8N5iKHY52690986 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 23 Sep 2019 05:44:20 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8ECF9AE051; Mon, 23 Sep 2019 05:44:20 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2E0B6AE04D; Mon, 23 Sep 2019 05:44:18 +0000 (GMT) Received: from dhcp-9-199-158-218.in.ibm.com (unknown [9.199.158.218]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 23 Sep 2019 05:44:17 +0000 (GMT) From: Ritesh Harjani To: jack@suse.cz, tytso@mit.edu, linux-ext4@vger.kernel.org, adilger@dilger.ca Cc: joseph.qi@linux.alibaba.com, david@fromorbit.com, hch@infradead.org, riteshh@linux.ibm.com, mbobrowski@mbobrowski.org, rgoldwyn@suse.de, aneesh.kumar@linux.ibm.com, linux-kernel@vger.kernel.org Subject: [RFC-v2 2/2] ext4: Improve DIO writes locking sequence Date: Mon, 23 Sep 2019 11:14:07 +0530 X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190923054407.8102-1-riteshh@linux.ibm.com> References: <20190923054407.8102-1-riteshh@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 19092305-0012-0000-0000-0000034F6C5E X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19092305-0013-0000-0000-00002189F668 Message-Id: <20190923054407.8102-3-riteshh@linux.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-09-23_02:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=3 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=893 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1908290000 definitions=main-1909230054 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Earlier there was no shared lock in DIO read path. But this patch (16c54688592ce: ext4: Allow parallel DIO reads) simplified some of the locking mechanism while still allowing for parallel DIO reads by adding shared lock in inode DIO read path. But this created problem with mixed read/write workload. It is due to the fact that in DIO path, we first start with exclusive lock and only when we determine that it is a ovewrite IO, we downgrade the lock. This causes the problem, since with above patch we have shared locking in DIO reads. So, this patch tries to fix this issue by starting with shared lock and then switching to exclusive lock only when required based on ext4_dio_write_checks(). Other than that, it also simplifies below cases:- 1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was abused in the sense that it was not really checking for AIO anywhere also it used to check for extending writes. So this API was renamed and simplified to ext4_unaligned_io() which actully only checks if the IO is really unaligned. This is because in all other cases inode_dio_wait() will anyway become a no-op. So no need to over complicate by checking for every condition here. 2. Added ext4_extending_io API. This checks if the IO is extending the file. Now we only check for unaligned_io, extend, dioread_nolock & overwrite. Signed-off-by: Ritesh Harjani --- fs/ext4/file.c | 213 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 159 insertions(+), 54 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index ce1cecbae932..faa8dcf9c08d 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -166,19 +166,25 @@ static int ext4_release_file(struct inode *inode, struct file *filp) * threads are at work on the same unwritten block, they must be synchronized * or one thread will zero the other's data, causing corruption. */ -static int -ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos) +static bool +ext4_unaligned_io(struct inode *inode, struct iov_iter *from, loff_t pos) { struct super_block *sb = inode->i_sb; - int blockmask = sb->s_blocksize - 1; - - if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize)) - return 0; + unsigned long blockmask = sb->s_blocksize - 1; if ((pos | iov_iter_alignment(from)) & blockmask) - return 1; + return true; - return 0; + return false; +} + +static bool +ext4_extending_io(struct inode *inode, loff_t offset, size_t len) +{ + if (offset + len > i_size_read(inode) || + offset + len > EXT4_I(inode)->i_disksize) + return true; + return false; } /* Is IO overwriting allocated and initialized blocks? */ @@ -204,7 +210,9 @@ static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) return err == blklen && (map.m_flags & EXT4_MAP_MAPPED); } -static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) + +static ssize_t ext4_generic_write_checks(struct kiocb *iocb, + struct iov_iter *from) { struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; @@ -216,10 +224,6 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) if (ret <= 0) return ret; - ret = file_modified(iocb->ki_filp); - if (ret) - return 0; - /* * If we have encountered a bitmap-format file, the size limit * is smaller than s_maxbytes, which is for extent-mapped files. @@ -231,9 +235,26 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) return -EFBIG; iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos); } + return iov_iter_count(from); } +static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) +{ + ssize_t ret; + ssize_t count; + + count = ext4_generic_write_checks(iocb, from); + if (count <= 0) + return count; + + ret = file_modified(iocb->ki_filp); + if (ret) + return ret; + + return count; +} + static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, struct iov_iter *from) { @@ -336,6 +357,83 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size) return 0; } +/* + * The intention here is to start with shared lock acquired + * (except in unaligned IO & extending writes case), + * then see if any condition requires an exclusive inode + * lock. If yes, then we restart the whole operation by + * releasing the shared lock and acquiring exclusive lock. + * + * - For unaligned_io we never take exclusive lock as it + * may cause data corruption when two unaligned IO tries + * to modify the same block. + * + * - For extending wirtes case we don't take + * the exclusive lock, since it requires updating + * inode i_disksize with exclusive lock. + * + * - shared locking will only be true mostly in case of + * overwrites with dioread_nolock mode. + * Otherwise we will switch to exclusive locking mode. + */ +static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from, + unsigned int *iolock, bool *unaligned_io, + bool *extend) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file_inode(file); + loff_t offset = iocb->ki_pos; + loff_t final_size; + size_t count; + ssize_t ret; + +restart: + if (!ext4_dio_checks(inode)) { + ext4_iunlock(inode, *iolock); + return ext4_buffered_write_iter(iocb, from); + } + + ret = ext4_generic_write_checks(iocb, from); + if (ret <= 0) { + ext4_iunlock(inode, *iolock); + return ret; + } + + /* Recalculate since offset & count may change above. */ + offset = iocb->ki_pos; + count = iov_iter_count(from); + final_size = offset + count; + + if (ext4_unaligned_io(inode, from, offset)) + *unaligned_io = true; + + if (ext4_extending_io(inode, offset, count)) + *extend = true; + /* + * Determine whether the IO operation will overwrite allocated + * and initialized blocks. If so, check to see whether it is + * possible to take the dioread_nolock path. + * + * We need exclusive i_rwsem for changing security info + * in file_modified(). + */ + if (*iolock == EXT4_IOLOCK_SHARED && + (!IS_NOSEC(inode) || *unaligned_io || *extend || + !ext4_should_dioread_nolock(inode) || + !ext4_overwrite_io(inode, offset, count))) { + ext4_iunlock(inode, *iolock); + *iolock = EXT4_IOLOCK_EXCL; + ext4_ilock(inode, *iolock); + goto restart; + } + + ret = file_modified(file); + if (ret) + return ret; + + return count; +} + /* * For a write that extends the inode size, ext4_dio_write_iter() will * wait for the write to complete. Consequently, operations performed @@ -371,64 +469,71 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error, static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) { + ssize_t ret; - size_t count; loff_t offset = iocb->ki_pos; + size_t count = iov_iter_count(from); struct inode *inode = file_inode(iocb->ki_filp); - bool extend = false, overwrite = false, unaligned_aio = false; - unsigned int iolock = EXT4_IOLOCK_EXCL; + bool extend = false, unaligned_io = false; + unsigned int iolock = EXT4_IOLOCK_SHARED; + + /* + * We initially start with shared inode lock + * unless it is unaligned IO which needs + * exclusive lock anyways. + */ + if (ext4_unaligned_io(inode, from, offset)) { + unaligned_io = true; + iolock = EXT4_IOLOCK_EXCL; + } + /* + * Extending writes need exclusive lock + * to update + */ + if (ext4_extending_io(inode, offset, count)) { + extend = true; + iolock = EXT4_IOLOCK_EXCL; + } + + if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode)) + iolock = EXT4_IOLOCK_EXCL; if (iocb->ki_flags & IOCB_NOWAIT) { + /* + * unaligned IO may anyway require wait + * at other places, so bail out. + */ + if (unaligned_io) + return -EAGAIN; if (!ext4_ilock_nowait(inode, iolock)) return -EAGAIN; } else { ext4_ilock(inode, iolock); } - if (!ext4_dio_checks(inode)) { - ext4_iunlock(inode, iolock); - /* - * Fallback to buffered IO if the operation on the - * inode is not supported by direct IO. - */ - return ext4_buffered_write_iter(iocb, from); - } - - ret = ext4_write_checks(iocb, from); - if (ret <= 0) { - ext4_iunlock(inode, iolock); + ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io, + &extend); + if (ret <= 0) return ret; - } - /* - * Unaligned direct AIO must be serialized among each other as + * Unaligned direct IO must be serialized among each other as * the zeroing of partial blocks of two competing unaligned - * AIOs can result in data corruption. + * IOs can result in data corruption. This can mainly + * happen since we may start with shared locking and for + * dioread_nolock and overwrite case we may continue to be + * in shared locking mode. In that case two parallel unaligned + * IO may cause data corruption. + * + * So we make sure we don't allow any unaligned IO in flight. + * For IOs where we need not wait (like unaligned non-AIO DIO), + * below dio_wait may anyway become a no-op, + * since we start take exclusive locks. */ - if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) && - !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) { - unaligned_aio = true; + if (unaligned_io) inode_dio_wait(inode); - } - - /* - * Determine whether the IO operation will overwrite allocated - * and initialized blocks. If so, check to see whether it is - * possible to take the dioread_nolock path. - */ - count = iov_iter_count(from); - if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) && - ext4_should_dioread_nolock(inode)) { - overwrite = true; - ext4_ilock_demote(inode, iolock); - iolock = EXT4_IOLOCK_SHARED; - } - if (offset + count > i_size_read(inode) || - offset + count > EXT4_I(inode)->i_disksize) { + if (extend) ext4_update_i_disksize(inode, inode->i_size); - extend = true; - } ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io); @@ -440,7 +545,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) * routines in ext4_dio_write_end_io() are covered by the * inode_lock(). */ - if (ret == -EIOCBQUEUED && (unaligned_aio || extend)) + if (ret == -EIOCBQUEUED && (unaligned_io || extend)) inode_dio_wait(inode); ext4_iunlock(inode, iolock);