Message ID | 20220605143815.2330891-4-willy@infradead.org |
---|---|
State | Rejected |
Headers | show |
Series | Cache quota files in the page cache | expand |
On Sun 05-06-22 15:38:15, Matthew Wilcox (Oracle) wrote: > The comment about the page cache is rather stale; the buffer cache will > read into the page cache if the buffer isn't present, and the page cache > will not take any locks if the page is present. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> This will not work for couple of reasons, see below. BTW, I don't think the comment about page cache was stale (but lacking details I admit ;). As far as I remember (and it was really many years ago - definitely pre-git era) the problem was (mainly on the write side) that before current state of the code we were using calls like vfs_read() / vfs_write() to get quota information and that was indeed prone to deadlocks. > @@ -6924,20 +6882,21 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type, > return -EIO; > } > > - do { > - bh = ext4_bread(handle, inode, blk, > - EXT4_GET_BLOCKS_CREATE | > - EXT4_GET_BLOCKS_METADATA_NOFAIL); > - } while (PTR_ERR(bh) == -ENOSPC && > - ext4_should_retry_alloc(inode->i_sb, &retries)); > - if (IS_ERR(bh)) > - return PTR_ERR(bh); > - if (!bh) > + folio = read_mapping_folio(inode->i_mapping, off / PAGE_SIZE, NULL); > + if (IS_ERR(folio)) > + return PTR_ERR(folio); > + head = folio_buffers(folio); > + if (!head) > + head = alloc_page_buffers(&folio->page, sb->s_blocksize, false); > + if (!head) > goto out; > + bh = head; > + while ((bh_offset(bh) + sb->s_blocksize) <= (off % PAGE_SIZE)) > + bh = bh->b_this_page; We miss proper handling of blocks that are currently beyond i_size (we are extending the quota file), plus we also miss any mapping of buffers to appropriate disk blocks here... It could be all fixed by replicating what we do in ext4_write_begin() but I'm not quite convinced using inode's page cache is really worth it... Honza > BUFFER_TRACE(bh, "get write access"); > err = ext4_journal_get_write_access(handle, sb, bh, EXT4_JTR_NONE); > if (err) { > - brelse(bh); > + folio_put(folio); > return err; > } > lock_buffer(bh); > @@ -6945,14 +6904,14 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type, > flush_dcache_page(bh->b_page); > unlock_buffer(bh); > err = ext4_handle_dirty_metadata(handle, NULL, bh); > - brelse(bh); > out: > + folio_put(folio); > + if (err) > + return err; > if (inode->i_size < off + len) { > i_size_write(inode, off + len); > EXT4_I(inode)->i_disksize = inode->i_size; > - err2 = ext4_mark_inode_dirty(handle, inode); > - if (unlikely(err2 && !err)) > - err = err2; > + err = ext4_mark_inode_dirty(handle, inode); > } > return err ? err : len; > } > -- > 2.35.1 >
On Mon, Jun 06, 2022 at 10:38:14AM +0200, Jan Kara wrote: > On Sun 05-06-22 15:38:15, Matthew Wilcox (Oracle) wrote: > > The comment about the page cache is rather stale; the buffer cache will > > read into the page cache if the buffer isn't present, and the page cache > > will not take any locks if the page is present. > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > This will not work for couple of reasons, see below. BTW, I don't think the > comment about page cache was stale (but lacking details I admit ;). As far > as I remember (and it was really many years ago - definitely pre-git era) > the problem was (mainly on the write side) that before current state of the > code we were using calls like vfs_read() / vfs_write() to get quota > information and that was indeed prone to deadlocks. Ah yes, vfs_write() might indeed be prone to deadlocks. Particularly if we're doing it under the dq_mutex and any memory allocation might have recursed into reclaim ;-) I actually found the commit in linux-fullhistory. Changelog for context: commit b72debd66a6ed Author: Jan Kara <jack@suse.cz> Date: Mon Jan 3 04:12:24 2005 -0800 [PATCH] Fix of quota deadlock on pagelock: quota core The four patches in this series fix deadlocks with quotas of pagelock (the problem was lock inversion on PageLock and transaction start - quota code needed to first start a transaction and then write the data which subsequent ly needed acquisition of PageLock while the standard ordering - PageLock first and transaction start later - was used e.g. by pdflush). They implement a new way of quota access to disk: Every filesystem that would like to impleme nt quotas now has to provide quota_read() and quota_write() functions. These functions must obey quota lock ordering (in particular they should not take PageLock inside a transaction). The first patch implements the changes in the quota core, the other three patches implement needed functions in ext2, ext3 and reiserfs. The patch for reiserfs also fixes several other lock inversion problems (similar as ext3 had) and implements the journaled quota functionality (which comes almost for free after the locking fixes...). The quota core patch makes quota support in other filesystems (except XFS which implements everything on its own ;)) unfunctional (quotaon() will refuse to turn on quotas on them). When the patches get reasonable wide testing and it will seem that no major changes will be needed I can make fixes also for the other filesystems (JFS, UDF, UFS). This patch: The patch implements the new way of quota io in the quota core. Every filesystem wanting to support quotas has to provide functions quota_read() and quota_write() obeying quota locking rules. As the writes and reads bypass the pagecache there is some ugly stuff ensuring that userspace can see all the data after quotaoff() (or Q_SYNC quotactl). In future I plan to make quota files inaccessible from userspace (with the exception of quotacheck(8) which will take care about the cache flushing and such stuff itself) so that this synchronization stuff can be removed... The rewrite of the quota core. Quota uses the filesystem read() and write() functions no more to avoid possible deadlocks on PageLock. From now on every filesystem supporting quotas must provide functions quota_read() and quota_write() which obey the quota locking rules (e.g. they cannot acquire the PageLock). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> > > @@ -6924,20 +6882,21 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type, > > return -EIO; > > } > > > > - do { > > - bh = ext4_bread(handle, inode, blk, > > - EXT4_GET_BLOCKS_CREATE | > > - EXT4_GET_BLOCKS_METADATA_NOFAIL); > > - } while (PTR_ERR(bh) == -ENOSPC && > > - ext4_should_retry_alloc(inode->i_sb, &retries)); > > - if (IS_ERR(bh)) > > - return PTR_ERR(bh); > > - if (!bh) > > + folio = read_mapping_folio(inode->i_mapping, off / PAGE_SIZE, NULL); > > + if (IS_ERR(folio)) > > + return PTR_ERR(folio); > > + head = folio_buffers(folio); > > + if (!head) > > + head = alloc_page_buffers(&folio->page, sb->s_blocksize, false); > > + if (!head) > > goto out; > > + bh = head; > > + while ((bh_offset(bh) + sb->s_blocksize) <= (off % PAGE_SIZE)) > > + bh = bh->b_this_page; > > We miss proper handling of blocks that are currently beyond i_size > (we are extending the quota file), plus we also miss any mapping of buffers > to appropriate disk blocks here... > > It could be all fixed by replicating what we do in ext4_write_begin() but > I'm not quite convinced using inode's page cache is really worth it... Ah, yes, write_begin. Of course that's what I should have used. I'm looking at this from the point of view of removing buffer_heads where possible. Of course, it's not possible for ext4 while the journal relies on buffer_heads, but if we can steer filesystems away from using sb_bread() (or equivalents), I think that's a good thing.
On Wed 08-06-22 02:42:21, Matthew Wilcox wrote: > On Mon, Jun 06, 2022 at 10:38:14AM +0200, Jan Kara wrote: > > On Sun 05-06-22 15:38:15, Matthew Wilcox (Oracle) wrote: > > > The comment about the page cache is rather stale; the buffer cache will > > > read into the page cache if the buffer isn't present, and the page cache > > > will not take any locks if the page is present. > > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > This will not work for couple of reasons, see below. BTW, I don't think the > > comment about page cache was stale (but lacking details I admit ;). As far > > as I remember (and it was really many years ago - definitely pre-git era) > > the problem was (mainly on the write side) that before current state of the > > code we were using calls like vfs_read() / vfs_write() to get quota > > information and that was indeed prone to deadlocks. > > Ah yes, vfs_write() might indeed be prone to deadlocks. Particularly > if we're doing it under the dq_mutex and any memory allocation might > have recursed into reclaim ;-) > > I actually found the commit in linux-fullhistory. Changelog for > context: > > commit b72debd66a6ed > Author: Jan Kara <jack@suse.cz> > Date: Mon Jan 3 04:12:24 2005 -0800 > > [PATCH] Fix of quota deadlock on pagelock: quota core \o/ to you history searching skills :) > > > @@ -6924,20 +6882,21 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type, > > > return -EIO; > > > } > > > > > > - do { > > > - bh = ext4_bread(handle, inode, blk, > > > - EXT4_GET_BLOCKS_CREATE | > > > - EXT4_GET_BLOCKS_METADATA_NOFAIL); > > > - } while (PTR_ERR(bh) == -ENOSPC && > > > - ext4_should_retry_alloc(inode->i_sb, &retries)); > > > - if (IS_ERR(bh)) > > > - return PTR_ERR(bh); > > > - if (!bh) > > > + folio = read_mapping_folio(inode->i_mapping, off / PAGE_SIZE, NULL); > > > + if (IS_ERR(folio)) > > > + return PTR_ERR(folio); > > > + head = folio_buffers(folio); > > > + if (!head) > > > + head = alloc_page_buffers(&folio->page, sb->s_blocksize, false); > > > + if (!head) > > > goto out; > > > + bh = head; > > > + while ((bh_offset(bh) + sb->s_blocksize) <= (off % PAGE_SIZE)) > > > + bh = bh->b_this_page; > > > > We miss proper handling of blocks that are currently beyond i_size > > (we are extending the quota file), plus we also miss any mapping of buffers > > to appropriate disk blocks here... > > > > It could be all fixed by replicating what we do in ext4_write_begin() but > > I'm not quite convinced using inode's page cache is really worth it... > > Ah, yes, write_begin. Of course that's what I should have used. > > I'm looking at this from the point of view of removing buffer_heads > where possible. Of course, it's not possible for ext4 while the journal > relies on buffer_heads, but if we can steer filesystems away from using > sb_bread() (or equivalents), I think that's a good thing. Well, ext4 uses sb_bread() (sb_getblk()) for all its metadata so quota code, which is rather well localized, is the least of your worries I'm afraid ;). Honza
Greeting, FYI, we noticed the following commit (built with gcc-11): commit: fa964903692268d3913cdaf489f80db849a3e928 ("[PATCH 3/3] ext4: Use generic_quota_read()") url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox-Oracle/Cache-quota-files-in-the-page-cache/20220606-021629 base: https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git dev patch link: https://lore.kernel.org/linux-ext4/20220605143815.2330891-4-willy@infradead.org in testcase: ltp version: ltp-x86_64-14c1f76-1_20220604 with following parameters: disk: 1HDD fs: btrfs test: fs-03 ucode: 0xec test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features. test-url: http://linux-test-project.github.io/ on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with 32G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag Reported-by: kernel test robot <oliver.sang@intel.com> [ 366.801777][ T5247] WARNING: CPU: 5 PID: 5247 at fs/ext4/inode.c:3228 ext4_invalidate_folio (fs/ext4/inode.c:3228) [ 366.811124][ T5247] Modules linked in: loop binfmt_misc dm_mod btrfs blake2b_generic xor raid6_pq intel_rapl_msr intel_rapl_common zstd_compress libcrc32c i915 sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 sg x86_pkg_temp_thermal intel_gtt drm_buddy intel_powerclamp ipmi_devintf ipmi_msghandler coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm_dp_helper wmi_bmof mei_wdt ttm rapl intel_cstate drm_kms_helper ahci libahci syscopyarea sysfillrect i2c_designware_platform intel_uncore mei_me libata sysimgblt i2c_designware_core mei idma64 fb_sys_fops wmi video intel_pmc_core acpi_pad drm fuse ip_tables [ 366.869509][ T5247] CPU: 5 PID: 5247 Comm: umount Tainted: G I 5.18.0-rc5-00028-gfa9649036922 #1 [ 366.879632][ T5247] Hardware name: Dell Inc. OptiPlex 7050/062KRH, BIOS 1.2.0 12/22/2016 [ 366.887739][ T5247] RIP: 0010:ext4_invalidate_folio (fs/ext4/inode.c:3228) [ 366.893597][ T5247] Code: e8 f5 12 c7 ff 48 89 da 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 c1 00 00 00 48 8b 03 a9 00 00 01 00 74 02 <0f> 0b 5b 4c 89 ea 4c 89 e6 48 89 ef 5d 41 5c 41 5d e9 3a 4a df ff All code ======== 0: e8 f5 12 c7 ff callq 0xffffffffffc712fa 5: 48 89 da mov %rbx,%rdx 8: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax f: fc ff df 12: 48 c1 ea 03 shr $0x3,%rdx 16: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) 1a: 0f 85 c1 00 00 00 jne 0xe1 20: 48 8b 03 mov (%rbx),%rax 23: a9 00 00 01 00 test $0x10000,%eax 28: 74 02 je 0x2c 2a:* 0f 0b ud2 <-- trapping instruction 2c: 5b pop %rbx 2d: 4c 89 ea mov %r13,%rdx 30: 4c 89 e6 mov %r12,%rsi 33: 48 89 ef mov %rbp,%rdi 36: 5d pop %rbp 37: 41 5c pop %r12 39: 41 5d pop %r13 3b: e9 3a 4a df ff jmpq 0xffffffffffdf4a7a Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 5b pop %rbx 3: 4c 89 ea mov %r13,%rdx 6: 4c 89 e6 mov %r12,%rsi 9: 48 89 ef mov %rbp,%rdi c: 5d pop %rbp d: 41 5c pop %r12 f: 41 5d pop %r13 11: e9 3a 4a df ff jmpq 0xffffffffffdf4a50 [ 366.913124][ T5247] RSP: 0018:ffffc90000e3fab0 EFLAGS: 00010206 [ 366.919089][ T5247] RAX: 000000000011601b RBX: ffff8887f2c18dc8 RCX: ffffffff81c7828b [ 366.926940][ T5247] RDX: 1ffff110fe5831b9 RSI: 0000000000000008 RDI: ffff8887f2c18dc8 [ 366.934793][ T5247] RBP: ffffea00072cd700 R08: 0000000000000000 R09: ffff8887f2c18dcf [ 366.942641][ T5247] R10: ffffed10fe5831b9 R11: 0000000000000001 R12: 0000000000000000 [ 366.950492][ T5247] R13: 0000000000001000 R14: 0000000000000002 R15: 0000000000000001 [ 366.958343][ T5247] FS: 00007f5852a94840(0000) GS:ffff8886aaa80000(0000) knlGS:0000000000000000 [ 366.967151][ T5247] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 366.973607][ T5247] CR2: 0000561684be9559 CR3: 00000007f60fc006 CR4: 00000000003706e0 [ 366.981457][ T5247] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 366.989303][ T5247] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 366.997156][ T5247] Call Trace: [ 367.000322][ T5247] <TASK> [ 367.003130][ T5247] truncate_cleanup_folio (mm/truncate.c:159 mm/truncate.c:179) [ 367.008372][ T5247] truncate_inode_pages_range (include/linux/pagevec.h:119 (discriminator 3) mm/truncate.c:368 (discriminator 3)) [ 367.013965][ T5247] ? truncate_inode_partial_folio (mm/truncate.c:332) [ 367.019900][ T5247] ? jbd2_journal_set_features (fs/jbd2/journal.c:682) [ 367.025580][ T5247] ? _raw_write_lock (arch/x86/include/asm/atomic.h:202 include/linux/atomic/atomic-instrumented.h:543 include/asm-generic/qrwlock.h:94 include/linux/rwlock_api_smp.h:210 kernel/locking/spinlock.c:300) [ 367.030300][ T5247] ? prepare_to_swait_exclusive (kernel/sched/wait.c:414) [ 367.036082][ T5247] ? inode_to_bdi (mm/backing-dev.c:995 mm/backing-dev.c:985) [ 367.040540][ T5247] ? filemap_check_errors (arch/x86/include/asm/bitops.h:207 include/asm-generic/bitops/instrumented-non-atomic.h:135 mm/filemap.c:351) [ 367.045712][ T5247] ? watch_queue_set_size (kernel/watch_queue.c:219) [ 367.050864][ T5247] ? filemap_fdatawait_keep_errors (mm/filemap.c:669) [ 367.056715][ T5247] ? __cond_resched (kernel/sched/core.c:8177) [ 367.061262][ T5247] ? down_write (arch/x86/include/asm/atomic64_64.h:34 include/linux/atomic/atomic-long.h:41 include/linux/atomic/atomic-instrumented.h:1280 kernel/locking/rwsem.c:138 kernel/locking/rwsem.c:255 kernel/locking/rwsem.c:1258 kernel/locking/rwsem.c:1268 kernel/locking/rwsem.c:1515) [ 367.065562][ T5247] ? down_write_killable (kernel/locking/rwsem.c:1512) [ 367.070715][ T5247] ? v2_free_file_info (fs/quota/quota_v2.c:384) [ 367.075530][ T5247] ? kfree (mm/slub.c:1754 mm/slub.c:3510 mm/slub.c:4552) [ 367.079385][ T5247] dquot_disable (fs/quota/dquot.c:2355) [ 367.083850][ T5247] ext4_quota_off (fs/ext4/super.c:6831) [ 367.088314][ T5247] ext4_put_super (fs/ext4/super.c:1173 fs/ext4/super.c:1218) [ 367.092779][ T5247] generic_shutdown_super (fs/super.c:464) [ 367.098029][ T5247] kill_block_super (fs/super.c:1395) [ 367.102680][ T5247] deactivate_locked_super (fs/super.c:339) [ 367.107923][ T5247] cleanup_mnt (fs/namespace.c:138 fs/namespace.c:1187) [ 367.112210][ T5247] ? path_umount (fs/namespace.c:1808) [ 367.116682][ T5247] task_work_run (kernel/task_work.c:166 (discriminator 1)) [ 367.121091][ T5247] exit_to_user_mode_loop (include/linux/resume_user_mode.h:49 kernel/entry/common.c:169) [ 367.126338][ T5247] exit_to_user_mode_prepare (kernel/entry/common.c:201) [ 367.131756][ T5247] syscall_exit_to_user_mode (arch/x86/include/asm/jump_label.h:27 include/linux/context_tracking_state.h:31 include/linux/context_tracking.h:40 kernel/entry/common.c:132 kernel/entry/common.c:296) [ 367.137112][ T5247] do_syscall_64 (arch/x86/entry/common.c:87) [ 367.141398][ T5247] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:115) [ 367.147174][ T5247] RIP: 0033:0x7f5852cd3e27 [ 367.151460][ T5247] Code: 00 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 00 0c 00 f7 d8 64 89 01 48 All code ======== 0: 00 0c 00 add %cl,(%rax,%rax,1) 3: f7 d8 neg %eax 5: 64 89 01 mov %eax,%fs:(%rcx) 8: 48 83 c8 ff or $0xffffffffffffffff,%rax c: c3 retq d: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 13: 31 f6 xor %esi,%esi 15: e9 09 00 00 00 jmpq 0x23 1a: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 21: 00 00 23: b8 a6 00 00 00 mov $0xa6,%eax 28: 0f 05 syscall 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction 30: 73 01 jae 0x33 32: c3 retq 33: 48 8b 0d 39 00 0c 00 mov 0xc0039(%rip),%rcx # 0xc0073 3a: f7 d8 neg %eax 3c: 64 89 01 mov %eax,%fs:(%rcx) 3f: 48 rex.W Code starting with the faulting instruction =========================================== 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 6: 73 01 jae 0x9 8: c3 retq 9: 48 8b 0d 39 00 0c 00 mov 0xc0039(%rip),%rcx # 0xc0049 10: f7 d8 neg %eax 12: 64 89 01 mov %eax,%fs:(%rcx) 15: 48 rex.W To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 450c918d68fc..1780649ed224 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1506,8 +1506,6 @@ static int ext4_mark_dquot_dirty(struct dquot *dquot); static int ext4_write_info(struct super_block *sb, int type); static int ext4_quota_on(struct super_block *sb, int type, int format_id, const struct path *path); -static ssize_t ext4_quota_read(struct super_block *sb, int type, char *data, - size_t len, loff_t off); static ssize_t ext4_quota_write(struct super_block *sb, int type, const char *data, size_t len, loff_t off); static int ext4_quota_enable(struct super_block *sb, int type, int format_id, @@ -1535,7 +1533,7 @@ static const struct dquot_operations ext4_quota_operations = { static const struct quotactl_ops ext4_qctl_operations = { .quota_on = ext4_quota_on, .quota_off = ext4_quota_off, - .quota_sync = dquot_quota_sync, + .quota_sync = generic_quota_sync, .get_state = dquot_get_state, .set_info = dquot_set_dqinfo, .get_dqblk = dquot_get_dqblk, @@ -1559,7 +1557,7 @@ static const struct super_operations ext4_sops = { .statfs = ext4_statfs, .show_options = ext4_show_options, #ifdef CONFIG_QUOTA - .quota_read = ext4_quota_read, + .quota_read = generic_quota_read, .quota_write = ext4_quota_write, .get_dquots = ext4_get_dquots, #endif @@ -6856,55 +6854,15 @@ static int ext4_quota_off(struct super_block *sb, int type) return dquot_quota_off(sb, type); } -/* Read data from quotafile - avoid pagecache and such because we cannot afford - * acquiring the locks... As quota files are never truncated and quota code - * itself serializes the operations (and no one else should touch the files) - * we don't have to be afraid of races */ -static ssize_t ext4_quota_read(struct super_block *sb, int type, char *data, - size_t len, loff_t off) -{ - struct inode *inode = sb_dqopt(sb)->files[type]; - ext4_lblk_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb); - int offset = off & (sb->s_blocksize - 1); - int tocopy; - size_t toread; - struct buffer_head *bh; - loff_t i_size = i_size_read(inode); - - if (off > i_size) - return 0; - if (off+len > i_size) - len = i_size-off; - toread = len; - while (toread > 0) { - tocopy = sb->s_blocksize - offset < toread ? - sb->s_blocksize - offset : toread; - bh = ext4_bread(NULL, inode, blk, 0); - if (IS_ERR(bh)) - return PTR_ERR(bh); - if (!bh) /* A hole? */ - memset(data, 0, tocopy); - else - memcpy(data, bh->b_data+offset, tocopy); - brelse(bh); - offset = 0; - toread -= tocopy; - data += tocopy; - blk++; - } - return len; -} - /* Write to quotafile (we know the transaction is already started and has * enough credits) */ static ssize_t ext4_quota_write(struct super_block *sb, int type, const char *data, size_t len, loff_t off) { struct inode *inode = sb_dqopt(sb)->files[type]; - ext4_lblk_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb); - int err = 0, err2 = 0, offset = off & (sb->s_blocksize - 1); - int retries = 0; - struct buffer_head *bh; + int err = 0, offset = off & (sb->s_blocksize - 1); + struct buffer_head *bh, *head; + struct folio *folio; handle_t *handle = journal_current_handle(); if (!handle) { @@ -6924,20 +6882,21 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type, return -EIO; } - do { - bh = ext4_bread(handle, inode, blk, - EXT4_GET_BLOCKS_CREATE | - EXT4_GET_BLOCKS_METADATA_NOFAIL); - } while (PTR_ERR(bh) == -ENOSPC && - ext4_should_retry_alloc(inode->i_sb, &retries)); - if (IS_ERR(bh)) - return PTR_ERR(bh); - if (!bh) + folio = read_mapping_folio(inode->i_mapping, off / PAGE_SIZE, NULL); + if (IS_ERR(folio)) + return PTR_ERR(folio); + head = folio_buffers(folio); + if (!head) + head = alloc_page_buffers(&folio->page, sb->s_blocksize, false); + if (!head) goto out; + bh = head; + while ((bh_offset(bh) + sb->s_blocksize) <= (off % PAGE_SIZE)) + bh = bh->b_this_page; BUFFER_TRACE(bh, "get write access"); err = ext4_journal_get_write_access(handle, sb, bh, EXT4_JTR_NONE); if (err) { - brelse(bh); + folio_put(folio); return err; } lock_buffer(bh); @@ -6945,14 +6904,14 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type, flush_dcache_page(bh->b_page); unlock_buffer(bh); err = ext4_handle_dirty_metadata(handle, NULL, bh); - brelse(bh); out: + folio_put(folio); + if (err) + return err; if (inode->i_size < off + len) { i_size_write(inode, off + len); EXT4_I(inode)->i_disksize = inode->i_size; - err2 = ext4_mark_inode_dirty(handle, inode); - if (unlikely(err2 && !err)) - err = err2; + err = ext4_mark_inode_dirty(handle, inode); } return err ? err : len; }
The comment about the page cache is rather stale; the buffer cache will read into the page cache if the buffer isn't present, and the page cache will not take any locks if the page is present. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- fs/ext4/super.c | 81 ++++++++++++------------------------------------- 1 file changed, 20 insertions(+), 61 deletions(-)