diff mbox series

[RFCv2,2/5] ext4: Remove PAGE_SIZE assumption of folio from mpage_submit_folio

Message ID 74182f5607ccfc3b1e7f08737fcb3442b42a2124.1684122756.git.ritesh.list@gmail.com
State Accepted
Headers show
Series ext4: misc left over folio changes | expand

Commit Message

Ritesh Harjani (IBM) May 15, 2023, 10:40 a.m. UTC
mpage_submit_folio() was converted to take folio. Even though
folio_size() in ext4 as of now is PAGE_SIZE, but it's better to
remove that assumption which I am assuming is a missed left over from
patch[1].

[1]: https://lore.kernel.org/linux-ext4/20230324180129.1220691-7-willy@infradead.org/

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/ext4/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Matthew Wilcox (Oracle) May 16, 2023, 7:14 p.m. UTC | #1
On Mon, May 15, 2023 at 04:10:41PM +0530, Ritesh Harjani (IBM) wrote:
> mpage_submit_folio() was converted to take folio. Even though
> folio_size() in ext4 as of now is PAGE_SIZE, but it's better to
> remove that assumption which I am assuming is a missed left over from
> patch[1].
> 
> [1]: https://lore.kernel.org/linux-ext4/20230324180129.1220691-7-willy@infradead.org/
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Theodore Ts'o June 11, 2023, 5:58 a.m. UTC | #2
On Mon, May 15, 2023 at 04:10:41PM +0530, Ritesh Harjani (IBM) wrote:
> mpage_submit_folio() was converted to take folio. Even though
> folio_size() in ext4 as of now is PAGE_SIZE, but it's better to
> remove that assumption which I am assuming is a missed left over from
> patch[1].
> 
> [1]: https://lore.kernel.org/linux-ext4/20230324180129.1220691-7-willy@infradead.org/
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

I didn't notice this right away, because the failure is not 100%
reliable, but this commit will sometimes cause "kvm-xfstests -c
ext4/encrypt generic/068" to crash.  Reverting the patch fixes the
problem, so I plan to drop this patch from my tree.

      	    		      	      	   	- Ted

generic/068 42s ...  [01:56:09][    7.014363] run fstests generic/068 at 2023-06-11 01:56:09
[    7.538841] EXT4-fs (vdc): Test dummy encryption mode enabled
[   11.407307] traps: PANIC: double fault, error_code: 0x0
[   11.407313] double fault: 0000 [#1] PREEMPT SMP NOPTI
[   11.407315] CPU: 1 PID: 3358 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-lockdep-00069-gfc362247e79f #169
[   11.407316] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   11.407317] RIP: 0010:__switch_to_asm+0x33/0x80
[   11.407322] Code: 55 41 56 41 57 48 89 a7 d8 17 00 00 48 8b a6 d8 17 00 00 48 8b 9e 40 04 00 00 65 48 89 1c 25 28 00 00 00 49 c7 c4 10 00 00 00 <e8> 01 00 00 00 cc e8 01 00 00 00 cc 48 83 c4 10 49 ff cc 75 eb 0f
[   11.407323] RSP: 0018:ffffc90003ec7e18 EFLAGS: 00010046
[   11.407324] RAX: 0000000000000001 RBX: 961d22f2e2e05800 RCX: 00000002afbf75a9
[   11.407325] RDX: 0000000000000003 RSI: ffff88800d174080 RDI: ffff88800d0ae200
[   11.407325] RBP: ffffc90003fd7af0 R08: 0000000000000001 R09: 0000000000000001
[   11.407326] R10: 00000000000003cc R11: 0000000000000001 R12: 0000000000000010
[   11.407326] R13: ffffe8ffffc29c50 R14: ffff88807ddee998 R15: ffff88800d174080
[   11.407327] FS:  00007f144aee4740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[   11.407329] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.407330] CR2: ffffc90003ec7e08 CR3: 000000000cb3e004 CR4: 0000000000770ee0
[   11.407330] PKRU: 55555554
[   11.407330] Call Trace:
[   11.407331]  <#DF>
[   11.407332]  ? die+0x36/0x80
[   11.407334]  ? exc_double_fault+0xf1/0x1b0
[   11.407336]  ? asm_exc_double_fault+0x23/0x30
[   11.407338]  ? __switch_to_asm+0x33/0x80
[   11.407339]  </#DF>
[   11.413852] ---[ end trace 0000000000000000 ]---
[   11.413853] RIP: 0010:__switch_to_asm+0x33/0x80
[   11.413856] Code: 55 41 56 41 57 48 89 a7 d8 17 00 00 48 8b a6 d8 17 00 00 48 8b 9e 40 04 00 00 65 48 89 1c 25 28 00 00 00 49 c7 c4 10 00 00 00 <e8> 01 00 00 00 cc e8 01 00 00 00 cc 48 83 c4 10 49 ff cc 75 eb 0f
[   11.413857] RSP: 0018:ffffc90003ec7e18 EFLAGS: 00010046
[   11.413857] RAX: 0000000000000001 RBX: 961d22f2e2e05800 RCX: 00000002afbf75a9
[   11.413858] RDX: 0000000000000003 RSI: ffff88800d174080 RDI: ffff88800d0ae200
[   11.413858] RBP: ffffc90003fd7af0 R08: 0000000000000001 R09: 0000000000000001
[   11.413859] R10: 00000000000003cc R11: 0000000000000001 R12: 0000000000000010
[   11.413859] R13: ffffe8ffffc29c50 R14: ffff88807ddee998 R15: ffff88800d174080
[   11.413860] FS:  00007f144aee4740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[   11.413861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.413862] CR2: ffffc90003ec7e08 CR3: 000000000cb3e004 CR4: 0000000000770ee0
[   11.413863] PKRU: 55555554
[   11.413863] Kernel panic - not syncing: Fatal exception in interrupt
[   11.413889] BUG: unable to handle page fault for address: ffffc90003ebfe88
[   11.414112] #PF: supervisor read access in kernel mode
[   11.414320] #PF: error_code(0x0009) - reserved bit violation
[   11.415151] PGD 5000067 P4D 5000067 PUD 5219067 PMD d278067 PTE 1e914974aa550b07
[   11.417015] Oops: 0009 [#2] PREEMPT SMP NOPTI
[   11.417375] CPU: 0 PID: 29 Comm: kworker/u4:2 Tainted: G      D            6.4.0-rc5-xfstests-lockdep-00069-gfc362247e79f #169
[   11.417641] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   11.417962] Workqueue: writeback wb_workfn (flush-254:32)
[   11.418683] RIP: 0010:timerqueue_add+0x28/0xb0
[   11.418916] Code: 90 90 66 0f 1f 00 55 53 48 3b 36 48 89 f3 0f 85 96 00 00 00 48 8b 07 48 85 c0 74 55 48 8b 73 18 bd 01 00 00 00 eb 03 48 89 d0 <48> 3b 70 18 48 8d 48 10 7c 06 48 8d 48 08 31 ed 48 8b 11 48 85 d2
[   11.419173] RSP: 0018:ffffc90000003f00 EFLAGS: 00010082
[   11.419710] RAX: ffffc90003ebfe70 RBX: ffff88807dbe0210 RCX: ffff88800d07a3e8
[   11.420219] RDX: ffffc90003ebfe70 RSI: 00000002a849a0e0 RDI: ffff88807dbdfb58
[   11.420634] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[   11.420877] R10: 0000000000000000 R11: 0000000000000659 R12: ffff88807dbdfa40
[   11.421082] R13: 0000000000000002 R14: ffff888005d3c180 R15: ffff88807dbdfb00
[   11.421924] FS:  0000000000000000(0000) GS:ffff88807da00000(0000) knlGS:0000000000000000
[   11.422165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.422487] CR2: ffffc90003ebfe88 CR3: 000000000c81c001 CR4: 0000000000770ef0
[   11.422810] PKRU: 55555554
[   11.423133] Call Trace:
[   11.423451]  <IRQ>
[   11.423778]  ? __die+0x23/0x60
[   11.424139]  ? page_fault_oops+0xa4/0x170
[   11.424399]  ? exc_page_fault+0xfa/0x1e0
[   11.424741]  ? asm_exc_page_fault+0x26/0x30
[   11.424884]  ? timerqueue_add+0x28/0xb0
[   11.425001]  enqueue_hrtimer+0x42/0xa0
[   11.425097]  __hrtimer_run_queues+0x304/0x380
[   11.425241]  hrtimer_interrupt+0xf8/0x230
[   11.425426]  __sysvec_apic_timer_interrupt+0x75/0x190
[   11.425605]  sysvec_apic_timer_interrupt+0x65/0x80
[   11.425794]  </IRQ>
[   11.425966]  <TASK>
[   11.426139]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   11.426344] RIP: 0010:aesni_xts_encrypt+0x2d/0x1d0
[   11.426529] Code: 00 66 0f 6f 3d 24 ff 93 01 41 0f 10 18 44 8b 8f e0 01 00 00 48 83 e9 40 0f 8c f3 00 00 00 66 0f 6f c3 f3 0f 6f 0a 66 0f ef c1 <f3> 0f 7f 1e 66 0f 70 d3 13 66 0f d4 db 66 0f 72 e2 1f 66 0f db d7
[   11.426757] RSP: 0018:ffffc9000052f558 EFLAGS: 00010206
[   11.427074] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000fc0
[   11.427172] RDX: ffff88801281a000 RSI: ffff88800d7c3000 RDI: ffff88800d180220
[   11.427406] RBP: ffffc9000052f720 R08: ffffc9000052f780 R09: 0000000000000020
[   11.427624] R10: ffff88800d1800a0 R11: 0000000000000018 R12: ffffc9000052f580
[   11.428468] R13: ffff88800d180220 R14: 0000000000001000 R15: 0000000000000001
[   11.428705]  ? aesni_enc+0x13/0x20
[   11.429027]  xts_crypt+0x10f/0x340
[   11.429349]  ? lock_release+0x65/0x100
[   11.429667]  ? do_raw_spin_unlock+0x4e/0xa0
[   11.429987]  ? _raw_spin_unlock+0x23/0x40
[   11.430312]  ? lock_is_held_type+0x9d/0x110
[   11.430471]  fscrypt_crypt_block+0x268/0x320
[   11.430627]  ? mempool_alloc+0x94/0x1e0
[   11.430803]  fscrypt_encrypt_pagecache_blocks+0xde/0x150
[   11.430991]  ext4_bio_write_folio+0x371/0x500
[   11.431172]  mpage_submit_folio+0x6f/0x90
[   11.431363]  mpage_map_and_submit_buffers+0xc5/0x180
[   11.431558]  mpage_map_and_submit_extent+0x55/0x300
[   11.431739]  ext4_do_writepages+0x70d/0x810
[   11.431981]  ext4_writepages+0xf1/0x290
[   11.432182]  do_writepages+0xd2/0x1e0
[   11.432366]  ? __lock_release.isra.0+0x15e/0x2a0
[   11.432595]  __writeback_single_inode+0x54/0x300
[   11.432817]  ? do_raw_spin_unlock+0x4e/0xa0
[   11.433006]  writeback_sb_inodes+0x1fc/0x500
[   11.433183]  wb_writeback+0xf2/0x370
[   11.433352]  wb_do_writeback+0x9e/0x2e0
[   11.433560]  ? set_worker_desc+0xc7/0xd0
[   11.433772]  wb_workfn+0x6a/0x2b0
[   11.433964]  ? __lock_release.isra.0+0x15e/0x2a0
[   11.434157]  ? process_one_work+0x21b/0x540
[   11.434322]  process_one_work+0x286/0x540
[   11.434500]  worker_thread+0x53/0x3c0
[   11.434678]  ? __pfx_worker_thread+0x10/0x10
[   11.434831]  kthread+0xf2/0x130
[   11.435042]  ? __pfx_kthread+0x10/0x10
[   11.435233]  ret_from_fork+0x29/0x50
[   11.435417]  </TASK>
[   11.435584] CR2: ffffc90003ebfe88
[   11.435931] ---[ end trace 0000000000000000 ]---
[   11.436101] RIP: 0010:__switch_to_asm+0x33/0x80
[   11.436265] Code: 55 41 56 41 57 48 89 a7 d8 17 00 00 48 8b a6 d8 17 00 00 48 8b 9e 40 04 00 00 65 48 89 1c 25 28 00 00 00 49 c7 c4 10 00 00 00 <e8> 01 00 00 00 cc e8 01 00 00 00 cc 48 83 c4 10 49 ff cc 75 eb 0f
[   11.436367] RSP: 0018:ffffc90003ec7e18 EFLAGS: 00010046
[   11.436727] RAX: 0000000000000001 RBX: 961d22f2e2e05800 RCX: 00000002afbf75a9
[   11.436938] RDX: 0000000000000003 RSI: ffff88800d174080 RDI: ffff88800d0ae200
[   11.437766] RBP: ffffc90003fd7af0 R08: 0000000000000001 R09: 0000000000000001
[   11.438000] R10: 00000000000003cc R11: 0000000000000001 R12: 0000000000000010
[   11.438322] R13: ffffe8ffffc29c50 R14: ffff88807ddee998 R15: ffff88800d174080
[   11.438641] FS:  0000000000000000(0000) GS:ffff88807da00000(0000) knlGS:0000000000000000
[   11.438967] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.439285] CR2: ffffc90003ebfe88 CR3: 000000000c81c001 CR4: 0000000000770ef0
[   11.439604] PKRU: 55555554
[   12.433529] Shutting down cpus with NMI
[   12.433728] Kernel Offset: disabled
QEMU: Terminated
Matthew Wilcox (Oracle) June 11, 2023, 2:15 p.m. UTC | #3
On Sun, Jun 11, 2023 at 01:58:31AM -0400, Theodore Ts'o wrote:
> On Mon, May 15, 2023 at 04:10:41PM +0530, Ritesh Harjani (IBM) wrote:
> > mpage_submit_folio() was converted to take folio. Even though
> > folio_size() in ext4 as of now is PAGE_SIZE, but it's better to
> > remove that assumption which I am assuming is a missed left over from
> > patch[1].
> > 
> > [1]: https://lore.kernel.org/linux-ext4/20230324180129.1220691-7-willy@infradead.org/
> > 
> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> 
> I didn't notice this right away, because the failure is not 100%
> reliable, but this commit will sometimes cause "kvm-xfstests -c
> ext4/encrypt generic/068" to crash.  Reverting the patch fixes the
> problem, so I plan to drop this patch from my tree.

Hrm.  Well, let's think about how this can go wrong:

@@ -1885,7 +1885,7 @@ static int mpage_submit_folio(struct mpage_da_data *mpd,
+struct folio *folio)
        len = folio_size(folio);
        if (folio_pos(folio) + len > size &&
            !ext4_verity_in_progress(mpd->inode))
-               len = size & ~PAGE_MASK;
+        	len = size - folio_pos(folio);
        err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
        if (!err)
                mpd->wbc->nr_to_write--;

Just off-camera is:

        size = i_size_read(mpd->inode);

Now, nothing is preventing i_size to be truncated to far below this
folio, right?  So if that happened before this patch, we'd write some
randomly sized fragment of the page.  Now we'll get a negative result
... which is assigned to size_t, so is exabytes in size.

So do we care if we write a random fragment of a page after a truncate?
If so, we should add:

	if (folio_pos(folio) >= size)
		return 0; /* Do we need to account nr_to_write? */

If we simply don't care that we're doing a spurious write, then we can
do something like:

-		len = size & ~PAGE_MASK;
+		len = size & (len - 1);
Ritesh Harjani (IBM) June 11, 2023, 2:25 p.m. UTC | #4
Please ignore the previous email.

"Theodore Ts'o" <tytso@mit.edu> writes:

> On Mon, May 15, 2023 at 04:10:41PM +0530, Ritesh Harjani (IBM) wrote:
>> mpage_submit_folio() was converted to take folio. Even though
>> folio_size() in ext4 as of now is PAGE_SIZE, but it's better to
>> remove that assumption which I am assuming is a missed left over from
>> patch[1].
>>
>> [1]: https://lore.kernel.org/linux-ext4/20230324180129.1220691-7-willy@infradead.org/
>>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
> I didn't notice this right away, because the failure is not 100%
> reliable, but this commit will sometimes cause "kvm-xfstests -c
> ext4/encrypt generic/068" to crash.  Reverting the patch fixes the
> problem, so I plan to drop this patch from my tree.
>

Sorry about the crash. I am now able to reproduce the problem on my
setup as well. I will debug this and will update once I have some more info.

From the initial look, it looks like the problem might be occurring when
folio_pos(folio) itself is > i_size_read(inode).

If that is indeed the case, then I think even doing this with folio
conversion (below code after folio conversion) looks incorrect for case
when size is not PAGE_SIZE aligned.

However, I will spend some more time debugging this.

static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
{
	size_t len;
	loff_t size;
	int err;

	BUG_ON(folio->index != mpd->first_page);
	folio_clear_dirty_for_io(folio);
	/*
	 * We have to be very careful here!  Nothing protects writeback path
	 * against i_size changes and the page can be writeably mapped into
	 * page tables. So an application can be growing i_size and writing
	 * data through mmap while writeback runs. folio_clear_dirty_for_io()
	 * write-protects our page in page tables and the page cannot get
	 * written to again until we release folio lock. So only after
	 * folio_clear_dirty_for_io() we are safe to sample i_size for
	 * ext4_bio_write_page() to zero-out tail of the written page. We rely
	 * on the barrier provided by TestClearPageDirty in
	 * folio_clear_dirty_for_io() to make sure i_size is really sampled only
	 * after page tables are updated.
	 */
	size = i_size_read(mpd->inode);
	len = folio_size(folio);
	if (folio_pos(folio) + len > size &&
	    !ext4_verity_in_progress(mpd->inode))
		len = size & ~PAGE_MASK;
	err = ext4_bio_write_page(&mpd->io_submit, &folio->page, len);
	if (!err)
		mpd->wbc->nr_to_write--;

	return err;
}

>       	    		      	      	   	- Ted
>
> generic/068 42s ...  [01:56:09][    7.014363] run fstests generic/068 at 2023-06-11 01:56:09
> [    7.538841] EXT4-fs (vdc): Test dummy encryption mode enabled
> [   11.407307] traps: PANIC: double fault, error_code: 0x0
> [   11.407313] double fault: 0000 [#1] PREEMPT SMP NOPTI
> [   11.407315] CPU: 1 PID: 3358 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-lockdep-00069-gfc362247e79f #169
> [   11.407316] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   11.407317] RIP: 0010:__switch_to_asm+0x33/0x80
> [   11.407322] Code: 55 41 56 41 57 48 89 a7 d8 17 00 00 48 8b a6 d8 17 00 00 48 8b 9e 40 04 00 00 65 48 89 1c 25 28 00 00 00 49 c7 c4 10 00 00 00 <e8> 01 00 00 00 cc e8 01 00 00 00 cc 48 83 c4 10 49 ff cc 75 eb 0f
> [   11.407323] RSP: 0018:ffffc90003ec7e18 EFLAGS: 00010046
> [   11.407324] RAX: 0000000000000001 RBX: 961d22f2e2e05800 RCX: 00000002afbf75a9
> [   11.407325] RDX: 0000000000000003 RSI: ffff88800d174080 RDI: ffff88800d0ae200
> [   11.407325] RBP: ffffc90003fd7af0 R08: 0000000000000001 R09: 0000000000000001
> [   11.407326] R10: 00000000000003cc R11: 0000000000000001 R12: 0000000000000010
> [   11.407326] R13: ffffe8ffffc29c50 R14: ffff88807ddee998 R15: ffff88800d174080
> [   11.407327] FS:  00007f144aee4740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> [   11.407329] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.407330] CR2: ffffc90003ec7e08 CR3: 000000000cb3e004 CR4: 0000000000770ee0
> [   11.407330] PKRU: 55555554
> [   11.407330] Call Trace:
> [   11.407331]  <#DF>
> [   11.407332]  ? die+0x36/0x80
> [   11.407334]  ? exc_double_fault+0xf1/0x1b0
> [   11.407336]  ? asm_exc_double_fault+0x23/0x30
> [   11.407338]  ? __switch_to_asm+0x33/0x80
> [   11.407339]  </#DF>
> [   11.413852] ---[ end trace 0000000000000000 ]---
> [   11.413853] RIP: 0010:__switch_to_asm+0x33/0x80
> [   11.413856] Code: 55 41 56 41 57 48 89 a7 d8 17 00 00 48 8b a6 d8 17 00 00 48 8b 9e 40 04 00 00 65 48 89 1c 25 28 00 00 00 49 c7 c4 10 00 00 00 <e8> 01 00 00 00 cc e8 01 00 00 00 cc 48 83 c4 10 49 ff cc 75 eb 0f
> [   11.413857] RSP: 0018:ffffc90003ec7e18 EFLAGS: 00010046
> [   11.413857] RAX: 0000000000000001 RBX: 961d22f2e2e05800 RCX: 00000002afbf75a9
> [   11.413858] RDX: 0000000000000003 RSI: ffff88800d174080 RDI: ffff88800d0ae200
> [   11.413858] RBP: ffffc90003fd7af0 R08: 0000000000000001 R09: 0000000000000001
> [   11.413859] R10: 00000000000003cc R11: 0000000000000001 R12: 0000000000000010
> [   11.413859] R13: ffffe8ffffc29c50 R14: ffff88807ddee998 R15: ffff88800d174080
> [   11.413860] FS:  00007f144aee4740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> [   11.413861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.413862] CR2: ffffc90003ec7e08 CR3: 000000000cb3e004 CR4: 0000000000770ee0
> [   11.413863] PKRU: 55555554
> [   11.413863] Kernel panic - not syncing: Fatal exception in interrupt
> [   11.413889] BUG: unable to handle page fault for address: ffffc90003ebfe88
> [   11.414112] #PF: supervisor read access in kernel mode
> [   11.414320] #PF: error_code(0x0009) - reserved bit violation
> [   11.415151] PGD 5000067 P4D 5000067 PUD 5219067 PMD d278067 PTE 1e914974aa550b07
> [   11.417015] Oops: 0009 [#2] PREEMPT SMP NOPTI
> [   11.417375] CPU: 0 PID: 29 Comm: kworker/u4:2 Tainted: G      D            6.4.0-rc5-xfstests-lockdep-00069-gfc362247e79f #169
> [   11.417641] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   11.417962] Workqueue: writeback wb_workfn (flush-254:32)
> [   11.418683] RIP: 0010:timerqueue_add+0x28/0xb0
> [   11.418916] Code: 90 90 66 0f 1f 00 55 53 48 3b 36 48 89 f3 0f 85 96 00 00 00 48 8b 07 48 85 c0 74 55 48 8b 73 18 bd 01 00 00 00 eb 03 48 89 d0 <48> 3b 70 18 48 8d 48 10 7c 06 48 8d 48 08 31 ed 48 8b 11 48 85 d2
> [   11.419173] RSP: 0018:ffffc90000003f00 EFLAGS: 00010082
> [   11.419710] RAX: ffffc90003ebfe70 RBX: ffff88807dbe0210 RCX: ffff88800d07a3e8
> [   11.420219] RDX: ffffc90003ebfe70 RSI: 00000002a849a0e0 RDI: ffff88807dbdfb58
> [   11.420634] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> [   11.420877] R10: 0000000000000000 R11: 0000000000000659 R12: ffff88807dbdfa40
> [   11.421082] R13: 0000000000000002 R14: ffff888005d3c180 R15: ffff88807dbdfb00
> [   11.421924] FS:  0000000000000000(0000) GS:ffff88807da00000(0000) knlGS:0000000000000000
> [   11.422165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.422487] CR2: ffffc90003ebfe88 CR3: 000000000c81c001 CR4: 0000000000770ef0
> [   11.422810] PKRU: 55555554
> [   11.423133] Call Trace:
> [   11.423451]  <IRQ>
> [   11.423778]  ? __die+0x23/0x60
> [   11.424139]  ? page_fault_oops+0xa4/0x170
> [   11.424399]  ? exc_page_fault+0xfa/0x1e0
> [   11.424741]  ? asm_exc_page_fault+0x26/0x30
> [   11.424884]  ? timerqueue_add+0x28/0xb0
> [   11.425001]  enqueue_hrtimer+0x42/0xa0
> [   11.425097]  __hrtimer_run_queues+0x304/0x380
> [   11.425241]  hrtimer_interrupt+0xf8/0x230
> [   11.425426]  __sysvec_apic_timer_interrupt+0x75/0x190
> [   11.425605]  sysvec_apic_timer_interrupt+0x65/0x80
> [   11.425794]  </IRQ>
> [   11.425966]  <TASK>
> [   11.426139]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   11.426344] RIP: 0010:aesni_xts_encrypt+0x2d/0x1d0
> [   11.426529] Code: 00 66 0f 6f 3d 24 ff 93 01 41 0f 10 18 44 8b 8f e0 01 00 00 48 83 e9 40 0f 8c f3 00 00 00 66 0f 6f c3 f3 0f 6f 0a 66 0f ef c1 <f3> 0f 7f 1e 66 0f 70 d3 13 66 0f d4 db 66 0f 72 e2 1f 66 0f db d7
> [   11.426757] RSP: 0018:ffffc9000052f558 EFLAGS: 00010206
> [   11.427074] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000fc0
> [   11.427172] RDX: ffff88801281a000 RSI: ffff88800d7c3000 RDI: ffff88800d180220
> [   11.427406] RBP: ffffc9000052f720 R08: ffffc9000052f780 R09: 0000000000000020
> [   11.427624] R10: ffff88800d1800a0 R11: 0000000000000018 R12: ffffc9000052f580
> [   11.428468] R13: ffff88800d180220 R14: 0000000000001000 R15: 0000000000000001
> [   11.428705]  ? aesni_enc+0x13/0x20
> [   11.429027]  xts_crypt+0x10f/0x340
> [   11.429349]  ? lock_release+0x65/0x100
> [   11.429667]  ? do_raw_spin_unlock+0x4e/0xa0
> [   11.429987]  ? _raw_spin_unlock+0x23/0x40
> [   11.430312]  ? lock_is_held_type+0x9d/0x110
> [   11.430471]  fscrypt_crypt_block+0x268/0x320
> [   11.430627]  ? mempool_alloc+0x94/0x1e0
> [   11.430803]  fscrypt_encrypt_pagecache_blocks+0xde/0x150
> [   11.430991]  ext4_bio_write_folio+0x371/0x500
> [   11.431172]  mpage_submit_folio+0x6f/0x90
> [   11.431363]  mpage_map_and_submit_buffers+0xc5/0x180
> [   11.431558]  mpage_map_and_submit_extent+0x55/0x300
> [   11.431739]  ext4_do_writepages+0x70d/0x810
> [   11.431981]  ext4_writepages+0xf1/0x290
> [   11.432182]  do_writepages+0xd2/0x1e0
> [   11.432366]  ? __lock_release.isra.0+0x15e/0x2a0
> [   11.432595]  __writeback_single_inode+0x54/0x300
> [   11.432817]  ? do_raw_spin_unlock+0x4e/0xa0
> [   11.433006]  writeback_sb_inodes+0x1fc/0x500
> [   11.433183]  wb_writeback+0xf2/0x370
> [   11.433352]  wb_do_writeback+0x9e/0x2e0
> [   11.433560]  ? set_worker_desc+0xc7/0xd0
> [   11.433772]  wb_workfn+0x6a/0x2b0
> [   11.433964]  ? __lock_release.isra.0+0x15e/0x2a0
> [   11.434157]  ? process_one_work+0x21b/0x540
> [   11.434322]  process_one_work+0x286/0x540
> [   11.434500]  worker_thread+0x53/0x3c0
> [   11.434678]  ? __pfx_worker_thread+0x10/0x10
> [   11.434831]  kthread+0xf2/0x130
> [   11.435042]  ? __pfx_kthread+0x10/0x10
> [   11.435233]  ret_from_fork+0x29/0x50
> [   11.435417]  </TASK>
> [   11.435584] CR2: ffffc90003ebfe88
> [   11.435931] ---[ end trace 0000000000000000 ]---
> [   11.436101] RIP: 0010:__switch_to_asm+0x33/0x80
> [   11.436265] Code: 55 41 56 41 57 48 89 a7 d8 17 00 00 48 8b a6 d8 17 00 00 48 8b 9e 40 04 00 00 65 48 89 1c 25 28 00 00 00 49 c7 c4 10 00 00 00 <e8> 01 00 00 00 cc e8 01 00 00 00 cc 48 83 c4 10 49 ff cc 75 eb 0f
> [   11.436367] RSP: 0018:ffffc90003ec7e18 EFLAGS: 00010046
> [   11.436727] RAX: 0000000000000001 RBX: 961d22f2e2e05800 RCX: 00000002afbf75a9
> [   11.436938] RDX: 0000000000000003 RSI: ffff88800d174080 RDI: ffff88800d0ae200
> [   11.437766] RBP: ffffc90003fd7af0 R08: 0000000000000001 R09: 0000000000000001
> [   11.438000] R10: 00000000000003cc R11: 0000000000000001 R12: 0000000000000010
> [   11.438322] R13: ffffe8ffffc29c50 R14: ffff88807ddee998 R15: ffff88800d174080
> [   11.438641] FS:  0000000000000000(0000) GS:ffff88807da00000(0000) knlGS:0000000000000000
> [   11.438967] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.439285] CR2: ffffc90003ebfe88 CR3: 000000000c81c001 CR4: 0000000000770ef0
> [   11.439604] PKRU: 55555554
> [   12.433529] Shutting down cpus with NMI
> [   12.433728] Kernel Offset: disabled
> QEMU: Terminated

Thanks for letting me know. I will look more into this.

-ritesh
Ritesh Harjani (IBM) June 12, 2023, 5:25 p.m. UTC | #5
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> Please ignore the previous email.
>
> "Theodore Ts'o" <tytso@mit.edu> writes:
>
>> On Mon, May 15, 2023 at 04:10:41PM +0530, Ritesh Harjani (IBM) wrote:
>>> mpage_submit_folio() was converted to take folio. Even though
>>> folio_size() in ext4 as of now is PAGE_SIZE, but it's better to
>>> remove that assumption which I am assuming is a missed left over from
>>> patch[1].
>>>
>>> [1]: https://lore.kernel.org/linux-ext4/20230324180129.1220691-7-willy@infradead.org/
>>>
>>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>>
>> I didn't notice this right away, because the failure is not 100%
>> reliable, but this commit will sometimes cause "kvm-xfstests -c
>> ext4/encrypt generic/068" to crash.  Reverting the patch fixes the
>> problem, so I plan to drop this patch from my tree.
>>
>
> Sorry about the crash. I am now able to reproduce the problem on my
> setup as well. I will debug this and will update once I have some more info.
>
> From the initial look, it looks like the problem might be occurring when
> folio_pos(folio) itself is > i_size_read(inode).
>
> If that is indeed the case, then I think even doing this with folio
> conversion (below code after folio conversion) looks incorrect for case
> when size is not PAGE_SIZE aligned.
>
> However, I will spend some more time debugging this.

I am still looking into this. I would like to make sure I go through
all the paths where i_size can be modified.
- buffered-IO
- writeback
- direct-IO
- page fault
- truncate
- fallocate (punch/collapse)
- evict (not relevant though)

It is easily recreatable if we have one thread doing buffered-io +
sync and other thread trying to truncate down inode->i_size.
Kernel panic maybe is happening only with -O encrypt mkfs option +
-o test_dummy_encryption mount option, but the size - folio_pos(folio)
is definitely wrong because inode->i_size is not protected in writeback path.

More on this later...

-ritesh
Matthew Wilcox (Oracle) June 12, 2023, 5:43 p.m. UTC | #6
On Mon, Jun 12, 2023 at 10:55:37PM +0530, Ritesh Harjani wrote:
> It is easily recreatable if we have one thread doing buffered-io +
> sync and other thread trying to truncate down inode->i_size.
> Kernel panic maybe is happening only with -O encrypt mkfs option +
> -o test_dummy_encryption mount option, but the size - folio_pos(folio)
> is definitely wrong because inode->i_size is not protected in writeback path.

Did you not see the email I sent right before you sent your previous
email?
Ritesh Harjani (IBM) June 12, 2023, 6:25 p.m. UTC | #7
Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Jun 12, 2023 at 10:55:37PM +0530, Ritesh Harjani wrote:
>> It is easily recreatable if we have one thread doing buffered-io +
>> sync and other thread trying to truncate down inode->i_size.
>> Kernel panic maybe is happening only with -O encrypt mkfs option +
>> -o test_dummy_encryption mount option, but the size - folio_pos(folio)
>> is definitely wrong because inode->i_size is not protected in writeback path.
>
> Did you not see the email I sent right before you sent your previous
> email?

Aah yes, Matthew. I had seen that email yesterday after I sent my email.
Sorry I forgot to acknowdledge it today and thanks for pointing things
out.

I couldn't respond to your change because I still had some confusion
around this suggestion - 

> So do we care if we write a random fragment of a page after a truncate?
> If so, we should add:
> 
>         if (folio_pos(folio) >= size)
>                 return 0; /* Do we need to account nr_to_write? */

I was not sure whether if go with above case then whether it will
work with collapse_range. I initially thought that collapse_range will
truncate the pages between start and end of the file and then
it can also reduce the inode->i_size. That means writeback can find an
inode->i_size smaller than folio_pos(folio) which it is writing to.
But in this case we can't skip the write in writeback case like above
because that write is still required (a spurious write) even though
i_size is reduced as it's corresponding FS blocks are not truncated.

But just now looking at ext4_collapse_range() code it doesn't look like
it is the problem because it waits for any dirty data to be written
before truncate. So no matter which folio_pos(folio) the writeback is
writing, there should not be an issue if we simply return 0 like how
you suggested above.

    static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)

    <...>
        ioffset = round_down(offset, PAGE_SIZE);
        /*
        * Write tail of the last page before removed range since it will get
        * removed from the page cache below.
        */

        ret = filemap_write_and_wait_range(mapping, ioffset, offset);
        if (ret)
            goto out_mmap;
        /*
        * Write data that will be shifted to preserve them when discarding
        * page cache below. We are also protected from pages becoming dirty
        * by i_rwsem and invalidate_lock.
        */
        ret = filemap_write_and_wait_range(mapping, offset + len,
                        LLONG_MAX);
        truncate_pagecache(inode, ioffset);

        <... within i_data_sem>
        i_size_write(inode, new_size);

    <...>


However to avoid problems like this I felt, I will do some more code
reading. And then I was mostly considering your second suggestion which
is this. This will ensure we keep the current behavior as is and not
change that.

> If we simply don't care that we're doing a spurious write, then we can
> do something like:
> 
> -               len = size & ~PAGE_MASK;
> +               len = size & (len - 1);


-ritesh
Matthew Wilcox (Oracle) June 12, 2023, 7:16 p.m. UTC | #8
On Mon, Jun 12, 2023 at 11:55:55PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> I couldn't respond to your change because I still had some confusion
> around this suggestion - 
> 
> > So do we care if we write a random fragment of a page after a truncate?
> > If so, we should add:
> > 
> >         if (folio_pos(folio) >= size)
> >                 return 0; /* Do we need to account nr_to_write? */
> 
> I was not sure whether if go with above case then whether it will
> work with collapse_range. I initially thought that collapse_range will
> truncate the pages between start and end of the file and then
> it can also reduce the inode->i_size. That means writeback can find an
> inode->i_size smaller than folio_pos(folio) which it is writing to.
> But in this case we can't skip the write in writeback case like above
> because that write is still required (a spurious write) even though
> i_size is reduced as it's corresponding FS blocks are not truncated.
> 
> But just now looking at ext4_collapse_range() code it doesn't look like
> it is the problem because it waits for any dirty data to be written
> before truncate. So no matter which folio_pos(folio) the writeback is
> writing, there should not be an issue if we simply return 0 like how
> you suggested above.
> 
>     static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
> 
>     <...>
>         ioffset = round_down(offset, PAGE_SIZE);
>         /*
>         * Write tail of the last page before removed range since it will get
>         * removed from the page cache below.
>         */
> 
>         ret = filemap_write_and_wait_range(mapping, ioffset, offset);
>         if (ret)
>             goto out_mmap;
>         /*
>         * Write data that will be shifted to preserve them when discarding
>         * page cache below. We are also protected from pages becoming dirty
>         * by i_rwsem and invalidate_lock.
>         */
>         ret = filemap_write_and_wait_range(mapping, offset + len,
>                         LLONG_MAX);
>         truncate_pagecache(inode, ioffset);
> 
>         <... within i_data_sem>
>         i_size_write(inode, new_size);
> 
>     <...>
> 
> 
> However to avoid problems like this I felt, I will do some more code
> reading. And then I was mostly considering your second suggestion which
> is this. This will ensure we keep the current behavior as is and not
> change that.
> 
> > If we simply don't care that we're doing a spurious write, then we can
> > do something like:
> > 
> > -               len = size & ~PAGE_MASK;
> > +               len = size & (len - 1);

For all I know, I've found a bug here.  I don't know enough about ext4; if
we have truncated a file, and then writeback a page that is past i_size,
will the block its writing to have been freed?  Is this potentially a
silent data corruptor?
Ritesh Harjani (IBM) June 13, 2023, 3:57 a.m. UTC | #9
I am hoping Jan and Ted could correct me if any of my understanding
is incorrect. But here is my view...

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Jun 12, 2023 at 11:55:55PM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> I couldn't respond to your change because I still had some confusion
>> around this suggestion -
>>
>> > So do we care if we write a random fragment of a page after a truncate?
>> > If so, we should add:
>> >
>> >         if (folio_pos(folio) >= size)
>> >                 return 0; /* Do we need to account nr_to_write? */
>>
>> I was not sure whether if go with above case then whether it will
>> work with collapse_range. I initially thought that collapse_range will
>> truncate the pages between start and end of the file and then
>> it can also reduce the inode->i_size. That means writeback can find an
>> inode->i_size smaller than folio_pos(folio) which it is writing to.
>> But in this case we can't skip the write in writeback case like above
>> because that write is still required (a spurious write) even though
>> i_size is reduced as it's corresponding FS blocks are not truncated.
>>
>> But just now looking at ext4_collapse_range() code it doesn't look like
>> it is the problem because it waits for any dirty data to be written
>> before truncate. So no matter which folio_pos(folio) the writeback is
>> writing, there should not be an issue if we simply return 0 like how
>> you suggested above.
>>
>>     static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>>
>>     <...>
>>         ioffset = round_down(offset, PAGE_SIZE);
>>         /*
>>         * Write tail of the last page before removed range since it will get
>>         * removed from the page cache below.
>>         */
>>
>>         ret = filemap_write_and_wait_range(mapping, ioffset, offset);
>>         if (ret)
>>             goto out_mmap;
>>         /*
>>         * Write data that will be shifted to preserve them when discarding
>>         * page cache below. We are also protected from pages becoming dirty
>>         * by i_rwsem and invalidate_lock.
>>         */
>>         ret = filemap_write_and_wait_range(mapping, offset + len,
>>                         LLONG_MAX);
>>         truncate_pagecache(inode, ioffset);
>>
>>         <... within i_data_sem>
>>         i_size_write(inode, new_size);
>>
>>     <...>
>>
>>
>> However to avoid problems like this I felt, I will do some more code
>> reading. And then I was mostly considering your second suggestion which
>> is this. This will ensure we keep the current behavior as is and not
>> change that.
>>
>> > If we simply don't care that we're doing a spurious write, then we can
>> > do something like:
>> >
>> > -               len = size & ~PAGE_MASK;
>> > +               len = size & (len - 1);
>
> For all I know, I've found a bug here.  I don't know enough about ext4; if
> we have truncated a file, and then writeback a page that is past i_size,
> will the block its writing to have been freed?

I don't think so. If we look at truncate code, it first reduces i_size,
then call truncate_pagecache(inode, newsize) and then we will call
ext4_truncate() which will free the corresponding blocks.
Since writeback happens with folio lock held until completion, hence I
think truncate_pagecache() should block on that folio until it's lock
has been released.

- IIUC, if truncate would have completed then the folio won't be in the
foliocache for writeback to happen. Foliocache is kept consistent
via
    - first truncate the folio in the foliocache and then remove/free
    the blocks on device.

- Also the reason we update i_size "before" calling truncate_pagecache()
  is to synchronize with mmap/pagefault.

> Is this potentially a silent data corruptor?

- Let's consider a case when folio_pos > i_size but both still belongs
to the last block. i.e. it's a straddle write case.
In such case we require writeback to write the data of this last folio
straddling i_size. Because truncate will not remove/free this last folio
straddling i_size & neither the last block will be freed. And I think
writeback is supposed to write this last folio to the disk to keep the
cache and disk data consistent. Because truncate will only zero out
the rest of the folio in the foliocache. But I don't think it will go and
write that folio out (It's not required because i_size means that the
rest of the folio beyond i_size should remain zero).

So, IMO writeback is supposed to write this last folio to the disk. And,
if we skip this writeout, then I think it may cause silent data corruption.

But I am not sure about the rest of the write beyond the last block of
i_size. I think those could just be spurious writes which won't cause
any harm because truncate will eventually first remove this folio from
file mapping and then will release the corresponding disk blocks.
So writing those out should does no harm


-ritesh
Jan Kara June 13, 2023, 9:59 a.m. UTC | #10
On Tue 13-06-23 09:27:38, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > On Mon, Jun 12, 2023 at 11:55:55PM +0530, Ritesh Harjani wrote:
> >> Matthew Wilcox <willy@infradead.org> writes:
> >> I couldn't respond to your change because I still had some confusion
> >> around this suggestion -
> >>
> >> > So do we care if we write a random fragment of a page after a truncate?
> >> > If so, we should add:
> >> >
> >> >         if (folio_pos(folio) >= size)
> >> >                 return 0; /* Do we need to account nr_to_write? */
> >>
> >> I was not sure whether if go with above case then whether it will
> >> work with collapse_range. I initially thought that collapse_range will
> >> truncate the pages between start and end of the file and then
> >> it can also reduce the inode->i_size. That means writeback can find an
> >> inode->i_size smaller than folio_pos(folio) which it is writing to.
> >> But in this case we can't skip the write in writeback case like above
> >> because that write is still required (a spurious write) even though
> >> i_size is reduced as it's corresponding FS blocks are not truncated.
> >>
> >> But just now looking at ext4_collapse_range() code it doesn't look like
> >> it is the problem because it waits for any dirty data to be written
> >> before truncate. So no matter which folio_pos(folio) the writeback is
> >> writing, there should not be an issue if we simply return 0 like how
> >> you suggested above.
> >>
> >>     static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
> >>
> >>     <...>
> >>         ioffset = round_down(offset, PAGE_SIZE);
> >>         /*
> >>         * Write tail of the last page before removed range since it will get
> >>         * removed from the page cache below.
> >>         */
> >>
> >>         ret = filemap_write_and_wait_range(mapping, ioffset, offset);
> >>         if (ret)
> >>             goto out_mmap;
> >>         /*
> >>         * Write data that will be shifted to preserve them when discarding
> >>         * page cache below. We are also protected from pages becoming dirty
> >>         * by i_rwsem and invalidate_lock.
> >>         */
> >>         ret = filemap_write_and_wait_range(mapping, offset + len,
> >>                         LLONG_MAX);
> >>         truncate_pagecache(inode, ioffset);
> >>
> >>         <... within i_data_sem>
> >>         i_size_write(inode, new_size);
> >>
> >>     <...>
> >>
> >>
> >> However to avoid problems like this I felt, I will do some more code
> >> reading. And then I was mostly considering your second suggestion which
> >> is this. This will ensure we keep the current behavior as is and not
> >> change that.
> >>
> >> > If we simply don't care that we're doing a spurious write, then we can
> >> > do something like:
> >> >
> >> > -               len = size & ~PAGE_MASK;
> >> > +               len = size & (len - 1);
> >
> > For all I know, I've found a bug here.  I don't know enough about ext4; if
> > we have truncated a file, and then writeback a page that is past i_size,
> > will the block its writing to have been freed?
> 
> I don't think so. If we look at truncate code, it first reduces i_size,
> then call truncate_pagecache(inode, newsize) and then we will call
> ext4_truncate() which will free the corresponding blocks.
> Since writeback happens with folio lock held until completion, hence I
> think truncate_pagecache() should block on that folio until it's lock
> has been released.
> 
> - IIUC, if truncate would have completed then the folio won't be in the
> foliocache for writeback to happen. Foliocache is kept consistent
> via
>     - first truncate the folio in the foliocache and then remove/free
>     the blocks on device.

Yes, correct.

> - Also the reason we update i_size "before" calling truncate_pagecache()
>   is to synchronize with mmap/pagefault.

Yes, but these days mapping->invalidate_lock works for that instead for
ext4.

> > Is this potentially a silent data corruptor?
> 
> - Let's consider a case when folio_pos > i_size but both still belongs
> to the last block. i.e. it's a straddle write case.
> In such case we require writeback to write the data of this last folio
> straddling i_size. Because truncate will not remove/free this last folio
> straddling i_size & neither the last block will be freed. And I think
> writeback is supposed to write this last folio to the disk to keep the
> cache and disk data consistent. Because truncate will only zero out
> the rest of the folio in the foliocache. But I don't think it will go and
> write that folio out (It's not required because i_size means that the
> rest of the folio beyond i_size should remain zero).
> 
> So, IMO writeback is supposed to write this last folio to the disk. And,
> if we skip this writeout, then I think it may cause silent data corruption.
> 
> But I am not sure about the rest of the write beyond the last block of
> i_size. I think those could just be spurious writes which won't cause
> any harm because truncate will eventually first remove this folio from
> file mapping and then will release the corresponding disk blocks.
> So writing those out should does no harm

Correct. The block straddling i_size must be written out, the blocks fully
beyond new i_size (but below old i_size) may or may not be written out. As
you say these extra racing writes to blocks that will get truncated cause
no harm.

								Honza
Ritesh Harjani (IBM) June 13, 2023, 7:39 p.m. UTC | #11
Jan Kara <jack@suse.cz> writes:

> On Tue 13-06-23 09:27:38, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > On Mon, Jun 12, 2023 at 11:55:55PM +0530, Ritesh Harjani wrote:
>> >> Matthew Wilcox <willy@infradead.org> writes:
>> >> I couldn't respond to your change because I still had some confusion
>> >> around this suggestion -
>> >>
>> >> > So do we care if we write a random fragment of a page after a truncate?
>> >> > If so, we should add:
>> >> >
>> >> >         if (folio_pos(folio) >= size)
>> >> >                 return 0; /* Do we need to account nr_to_write? */
>> >>
>> >> I was not sure whether if go with above case then whether it will
>> >> work with collapse_range. I initially thought that collapse_range will
>> >> truncate the pages between start and end of the file and then
>> >> it can also reduce the inode->i_size. That means writeback can find an
>> >> inode->i_size smaller than folio_pos(folio) which it is writing to.
>> >> But in this case we can't skip the write in writeback case like above
>> >> because that write is still required (a spurious write) even though
>> >> i_size is reduced as it's corresponding FS blocks are not truncated.
>> >>
>> >> But just now looking at ext4_collapse_range() code it doesn't look like
>> >> it is the problem because it waits for any dirty data to be written
>> >> before truncate. So no matter which folio_pos(folio) the writeback is
>> >> writing, there should not be an issue if we simply return 0 like how
>> >> you suggested above.
>> >>
>> >>     static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>> >>
>> >>     <...>
>> >>         ioffset = round_down(offset, PAGE_SIZE);
>> >>         /*
>> >>         * Write tail of the last page before removed range since it will get
>> >>         * removed from the page cache below.
>> >>         */
>> >>
>> >>         ret = filemap_write_and_wait_range(mapping, ioffset, offset);
>> >>         if (ret)
>> >>             goto out_mmap;
>> >>         /*
>> >>         * Write data that will be shifted to preserve them when discarding
>> >>         * page cache below. We are also protected from pages becoming dirty
>> >>         * by i_rwsem and invalidate_lock.
>> >>         */
>> >>         ret = filemap_write_and_wait_range(mapping, offset + len,
>> >>                         LLONG_MAX);
>> >>         truncate_pagecache(inode, ioffset);
>> >>
>> >>         <... within i_data_sem>
>> >>         i_size_write(inode, new_size);
>> >>
>> >>     <...>
>> >>
>> >>
>> >> However to avoid problems like this I felt, I will do some more code
>> >> reading. And then I was mostly considering your second suggestion which
>> >> is this. This will ensure we keep the current behavior as is and not
>> >> change that.
>> >>
>> >> > If we simply don't care that we're doing a spurious write, then we can
>> >> > do something like:
>> >> >
>> >> > -               len = size & ~PAGE_MASK;
>> >> > +               len = size & (len - 1);
>> >
>> > For all I know, I've found a bug here.  I don't know enough about ext4; if
>> > we have truncated a file, and then writeback a page that is past i_size,
>> > will the block its writing to have been freed?
>> 
>> I don't think so. If we look at truncate code, it first reduces i_size,
>> then call truncate_pagecache(inode, newsize) and then we will call
>> ext4_truncate() which will free the corresponding blocks.
>> Since writeback happens with folio lock held until completion, hence I
>> think truncate_pagecache() should block on that folio until it's lock
>> has been released.
>> 
>> - IIUC, if truncate would have completed then the folio won't be in the
>> foliocache for writeback to happen. Foliocache is kept consistent
>> via
>>     - first truncate the folio in the foliocache and then remove/free
>>     the blocks on device.
>
> Yes, correct.
>
>> - Also the reason we update i_size "before" calling truncate_pagecache()
>>   is to synchronize with mmap/pagefault.
>
> Yes, but these days mapping->invalidate_lock works for that instead for
> ext4.
>
>> > Is this potentially a silent data corruptor?
>> 
>> - Let's consider a case when folio_pos > i_size but both still belongs
>> to the last block. i.e. it's a straddle write case.
>> In such case we require writeback to write the data of this last folio
>> straddling i_size. Because truncate will not remove/free this last folio
>> straddling i_size & neither the last block will be freed. And I think
>> writeback is supposed to write this last folio to the disk to keep the
>> cache and disk data consistent. Because truncate will only zero out
>> the rest of the folio in the foliocache. But I don't think it will go and
>> write that folio out (It's not required because i_size means that the
>> rest of the folio beyond i_size should remain zero).
>> 
>> So, IMO writeback is supposed to write this last folio to the disk. And,
>> if we skip this writeout, then I think it may cause silent data corruption.
>> 
>> But I am not sure about the rest of the write beyond the last block of
>> i_size. I think those could just be spurious writes which won't cause
>> any harm because truncate will eventually first remove this folio from
>> file mapping and then will release the corresponding disk blocks.
>> So writing those out should does no harm
>
> Correct. The block straddling i_size must be written out, the blocks fully
> beyond new i_size (but below old i_size) may or may not be written out. As
> you say these extra racing writes to blocks that will get truncated cause
> no harm.
>

Thanks Jan for confirming. So, I think we should make below change.
(note the code which was doing "size - folio_pos(folio)" in
mpage_submit_folio() is dropped by Ted in the latest tree).

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 43be684dabcb..006eba9be5e6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1859,9 +1859,9 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
         */
        size = i_size_read(mpd->inode);
        len = folio_size(folio);
-       if (folio_pos(folio) + len > size &&
+       if ((folio_pos(folio) >= size || (folio_pos(folio) + len > size)) &&
            !ext4_verity_in_progress(mpd->inode))
-               len = size & ~PAGE_MASK;
+               len = size & (len - 1);
        err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
        if (!err)
                mpd->wbc->nr_to_write--;
@@ -2329,9 +2329,9 @@ static int mpage_journal_page_buffers(handle_t *handle,
        folio_clear_checked(folio);
        mpd->wbc->nr_to_write--;

-       if (folio_pos(folio) + len > size &&
+       if ((folio_pos(folio) >= size || (folio_pos(folio) + len > size)) &&
            !ext4_verity_in_progress(inode))
-               len = size - folio_pos(folio);
+               len = size & (len - 1);

        return ext4_journal_folio_buffers(handle, folio, len);
 }


I will give it some more thoughts and testing.

-ritesh
Matthew Wilcox (Oracle) June 13, 2023, 7:45 p.m. UTC | #12
On Wed, Jun 14, 2023 at 01:09:59AM +0530, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Tue 13-06-23 09:27:38, Ritesh Harjani wrote:
> >> Matthew Wilcox <willy@infradead.org> writes:
> >> > On Mon, Jun 12, 2023 at 11:55:55PM +0530, Ritesh Harjani wrote:
> >> >> Matthew Wilcox <willy@infradead.org> writes:
> >> >> I couldn't respond to your change because I still had some confusion
> >> >> around this suggestion -
> >> >>
> >> >> > So do we care if we write a random fragment of a page after a truncate?
> >> >> > If so, we should add:
> >> >> >
> >> >> >         if (folio_pos(folio) >= size)
> >> >> >                 return 0; /* Do we need to account nr_to_write? */
> >> >>
> >> >> I was not sure whether if go with above case then whether it will
> >> >> work with collapse_range. I initially thought that collapse_range will
> >> >> truncate the pages between start and end of the file and then
> >> >> it can also reduce the inode->i_size. That means writeback can find an
> >> >> inode->i_size smaller than folio_pos(folio) which it is writing to.
> >> >> But in this case we can't skip the write in writeback case like above
> >> >> because that write is still required (a spurious write) even though
> >> >> i_size is reduced as it's corresponding FS blocks are not truncated.
> >> >>
> >> >> But just now looking at ext4_collapse_range() code it doesn't look like
> >> >> it is the problem because it waits for any dirty data to be written
> >> >> before truncate. So no matter which folio_pos(folio) the writeback is
> >> >> writing, there should not be an issue if we simply return 0 like how
> >> >> you suggested above.
> >> >>
> >> >>     static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
> >> >>
> >> >>     <...>
> >> >>         ioffset = round_down(offset, PAGE_SIZE);
> >> >>         /*
> >> >>         * Write tail of the last page before removed range since it will get
> >> >>         * removed from the page cache below.
> >> >>         */
> >> >>
> >> >>         ret = filemap_write_and_wait_range(mapping, ioffset, offset);
> >> >>         if (ret)
> >> >>             goto out_mmap;
> >> >>         /*
> >> >>         * Write data that will be shifted to preserve them when discarding
> >> >>         * page cache below. We are also protected from pages becoming dirty
> >> >>         * by i_rwsem and invalidate_lock.
> >> >>         */
> >> >>         ret = filemap_write_and_wait_range(mapping, offset + len,
> >> >>                         LLONG_MAX);
> >> >>         truncate_pagecache(inode, ioffset);
> >> >>
> >> >>         <... within i_data_sem>
> >> >>         i_size_write(inode, new_size);
> >> >>
> >> >>     <...>
> >> >>
> >> >>
> >> >> However to avoid problems like this I felt, I will do some more code
> >> >> reading. And then I was mostly considering your second suggestion which
> >> >> is this. This will ensure we keep the current behavior as is and not
> >> >> change that.
> >> >>
> >> >> > If we simply don't care that we're doing a spurious write, then we can
> >> >> > do something like:
> >> >> >
> >> >> > -               len = size & ~PAGE_MASK;
> >> >> > +               len = size & (len - 1);
> >> >
> >> > For all I know, I've found a bug here.  I don't know enough about ext4; if
> >> > we have truncated a file, and then writeback a page that is past i_size,
> >> > will the block its writing to have been freed?
> >> 
> >> I don't think so. If we look at truncate code, it first reduces i_size,
> >> then call truncate_pagecache(inode, newsize) and then we will call
> >> ext4_truncate() which will free the corresponding blocks.
> >> Since writeback happens with folio lock held until completion, hence I
> >> think truncate_pagecache() should block on that folio until it's lock
> >> has been released.
> >> 
> >> - IIUC, if truncate would have completed then the folio won't be in the
> >> foliocache for writeback to happen. Foliocache is kept consistent
> >> via
> >>     - first truncate the folio in the foliocache and then remove/free
> >>     the blocks on device.
> >
> > Yes, correct.
> >
> >> - Also the reason we update i_size "before" calling truncate_pagecache()
> >>   is to synchronize with mmap/pagefault.
> >
> > Yes, but these days mapping->invalidate_lock works for that instead for
> > ext4.
> >
> >> > Is this potentially a silent data corruptor?
> >> 
> >> - Let's consider a case when folio_pos > i_size but both still belongs
> >> to the last block. i.e. it's a straddle write case.
> >> In such case we require writeback to write the data of this last folio
> >> straddling i_size. Because truncate will not remove/free this last folio
> >> straddling i_size & neither the last block will be freed. And I think
> >> writeback is supposed to write this last folio to the disk to keep the
> >> cache and disk data consistent. Because truncate will only zero out
> >> the rest of the folio in the foliocache. But I don't think it will go and
> >> write that folio out (It's not required because i_size means that the
> >> rest of the folio beyond i_size should remain zero).
> >> 
> >> So, IMO writeback is supposed to write this last folio to the disk. And,
> >> if we skip this writeout, then I think it may cause silent data corruption.
> >> 
> >> But I am not sure about the rest of the write beyond the last block of
> >> i_size. I think those could just be spurious writes which won't cause
> >> any harm because truncate will eventually first remove this folio from
> >> file mapping and then will release the corresponding disk blocks.
> >> So writing those out should does no harm
> >
> > Correct. The block straddling i_size must be written out, the blocks fully
> > beyond new i_size (but below old i_size) may or may not be written out. As
> > you say these extra racing writes to blocks that will get truncated cause
> > no harm.
> >
> 
> Thanks Jan for confirming. So, I think we should make below change.
> (note the code which was doing "size - folio_pos(folio)" in
> mpage_submit_folio() is dropped by Ted in the latest tree).
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 43be684dabcb..006eba9be5e6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1859,9 +1859,9 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
>          */
>         size = i_size_read(mpd->inode);
>         len = folio_size(folio);
> -       if (folio_pos(folio) + len > size &&
> +       if ((folio_pos(folio) >= size || (folio_pos(folio) + len > size)) &&
>             !ext4_verity_in_progress(mpd->inode))
> -               len = size & ~PAGE_MASK;
> +               len = size & (len - 1);
>         err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
>         if (!err)
>                 mpd->wbc->nr_to_write--;
> @@ -2329,9 +2329,9 @@ static int mpage_journal_page_buffers(handle_t *handle,
>         folio_clear_checked(folio);
>         mpd->wbc->nr_to_write--;
> 
> -       if (folio_pos(folio) + len > size &&
> +       if ((folio_pos(folio) >= size || (folio_pos(folio) + len > size)) &&
>             !ext4_verity_in_progress(inode))
> -               len = size - folio_pos(folio);
> +               len = size & (len - 1);
> 
>         return ext4_journal_folio_buffers(handle, folio, len);
>  }
> 
> 
> I will give it some more thoughts and testing.

Why should ext4 be different from other filesystems which simply do:

	if (folio_pos(folio) >= size)
		return 0;
Ritesh Harjani (IBM) June 13, 2023, 8:43 p.m. UTC | #13
Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Jun 14, 2023 at 01:09:59AM +0530, Ritesh Harjani wrote:
>> Jan Kara <jack@suse.cz> writes:
>>
>> > On Tue 13-06-23 09:27:38, Ritesh Harjani wrote:
>> >> Matthew Wilcox <willy@infradead.org> writes:
>> >> > On Mon, Jun 12, 2023 at 11:55:55PM +0530, Ritesh Harjani wrote:
>> >> >> Matthew Wilcox <willy@infradead.org> writes:
>> >> >> I couldn't respond to your change because I still had some confusion
>> >> >> around this suggestion -
>> >> >>
>> >> >> > So do we care if we write a random fragment of a page after a truncate?
>> >> >> > If so, we should add:
>> >> >> >
>> >> >> >         if (folio_pos(folio) >= size)
>> >> >> >                 return 0; /* Do we need to account nr_to_write? */
>> >> >>
>> >> >> I was not sure whether if go with above case then whether it will
>> >> >> work with collapse_range. I initially thought that collapse_range will
>> >> >> truncate the pages between start and end of the file and then
>> >> >> it can also reduce the inode->i_size. That means writeback can find an
>> >> >> inode->i_size smaller than folio_pos(folio) which it is writing to.
>> >> >> But in this case we can't skip the write in writeback case like above
>> >> >> because that write is still required (a spurious write) even though
>> >> >> i_size is reduced as it's corresponding FS blocks are not truncated.
>> >> >>
>> >> >> But just now looking at ext4_collapse_range() code it doesn't look like
>> >> >> it is the problem because it waits for any dirty data to be written
>> >> >> before truncate. So no matter which folio_pos(folio) the writeback is
>> >> >> writing, there should not be an issue if we simply return 0 like how
>> >> >> you suggested above.
>> >> >>
>> >> >>     static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>> >> >>
>> >> >>     <...>
>> >> >>         ioffset = round_down(offset, PAGE_SIZE);
>> >> >>         /*
>> >> >>         * Write tail of the last page before removed range since it will get
>> >> >>         * removed from the page cache below.
>> >> >>         */
>> >> >>
>> >> >>         ret = filemap_write_and_wait_range(mapping, ioffset, offset);
>> >> >>         if (ret)
>> >> >>             goto out_mmap;
>> >> >>         /*
>> >> >>         * Write data that will be shifted to preserve them when discarding
>> >> >>         * page cache below. We are also protected from pages becoming dirty
>> >> >>         * by i_rwsem and invalidate_lock.
>> >> >>         */
>> >> >>         ret = filemap_write_and_wait_range(mapping, offset + len,
>> >> >>                         LLONG_MAX);
>> >> >>         truncate_pagecache(inode, ioffset);
>> >> >>
>> >> >>         <... within i_data_sem>
>> >> >>         i_size_write(inode, new_size);
>> >> >>
>> >> >>     <...>
>> >> >>
>> >> >>
>> >> >> However to avoid problems like this I felt, I will do some more code
>> >> >> reading. And then I was mostly considering your second suggestion which
>> >> >> is this. This will ensure we keep the current behavior as is and not
>> >> >> change that.
>> >> >>
>> >> >> > If we simply don't care that we're doing a spurious write, then we can
>> >> >> > do something like:
>> >> >> >
>> >> >> > -               len = size & ~PAGE_MASK;
>> >> >> > +               len = size & (len - 1);
>> >> >
>> >> > For all I know, I've found a bug here.  I don't know enough about ext4; if
>> >> > we have truncated a file, and then writeback a page that is past i_size,
>> >> > will the block its writing to have been freed?
>> >>
>> >> I don't think so. If we look at truncate code, it first reduces i_size,
>> >> then call truncate_pagecache(inode, newsize) and then we will call
>> >> ext4_truncate() which will free the corresponding blocks.
>> >> Since writeback happens with folio lock held until completion, hence I
>> >> think truncate_pagecache() should block on that folio until it's lock
>> >> has been released.
>> >>
>> >> - IIUC, if truncate would have completed then the folio won't be in the
>> >> foliocache for writeback to happen. Foliocache is kept consistent
>> >> via
>> >>     - first truncate the folio in the foliocache and then remove/free
>> >>     the blocks on device.
>> >
>> > Yes, correct.
>> >
>> >> - Also the reason we update i_size "before" calling truncate_pagecache()
>> >>   is to synchronize with mmap/pagefault.
>> >
>> > Yes, but these days mapping->invalidate_lock works for that instead for
>> > ext4.
>> >
>> >> > Is this potentially a silent data corruptor?
>> >>
>> >> - Let's consider a case when folio_pos > i_size but both still belongs
>> >> to the last block. i.e. it's a straddle write case.
>> >> In such case we require writeback to write the data of this last folio
>> >> straddling i_size. Because truncate will not remove/free this last folio
>> >> straddling i_size & neither the last block will be freed. And I think
>> >> writeback is supposed to write this last folio to the disk to keep the
>> >> cache and disk data consistent. Because truncate will only zero out
>> >> the rest of the folio in the foliocache. But I don't think it will go and
>> >> write that folio out (It's not required because i_size means that the
>> >> rest of the folio beyond i_size should remain zero).
>> >>
>> >> So, IMO writeback is supposed to write this last folio to the disk. And,
>> >> if we skip this writeout, then I think it may cause silent data corruption.
>> >>
>> >> But I am not sure about the rest of the write beyond the last block of
>> >> i_size. I think those could just be spurious writes which won't cause
>> >> any harm because truncate will eventually first remove this folio from
>> >> file mapping and then will release the corresponding disk blocks.
>> >> So writing those out should does no harm
>> >
>> > Correct. The block straddling i_size must be written out, the blocks fully
>> > beyond new i_size (but below old i_size) may or may not be written out. As
>> > you say these extra racing writes to blocks that will get truncated cause
>> > no harm.
>> >
>>
>> Thanks Jan for confirming. So, I think we should make below change.
>> (note the code which was doing "size - folio_pos(folio)" in
>> mpage_submit_folio() is dropped by Ted in the latest tree).
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 43be684dabcb..006eba9be5e6 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1859,9 +1859,9 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
>>          */
>>         size = i_size_read(mpd->inode);
>>         len = folio_size(folio);
>> -       if (folio_pos(folio) + len > size &&
>> +       if ((folio_pos(folio) >= size || (folio_pos(folio) + len > size)) &&
>>             !ext4_verity_in_progress(mpd->inode))
>> -               len = size & ~PAGE_MASK;
>> +               len = size & (len - 1);
>>         err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
>>         if (!err)
>>                 mpd->wbc->nr_to_write--;
>> @@ -2329,9 +2329,9 @@ static int mpage_journal_page_buffers(handle_t *handle,
>>         folio_clear_checked(folio);
>>         mpd->wbc->nr_to_write--;
>>
>> -       if (folio_pos(folio) + len > size &&
>> +       if ((folio_pos(folio) >= size || (folio_pos(folio) + len > size)) &&
>>             !ext4_verity_in_progress(inode))
>> -               len = size - folio_pos(folio);
>> +               len = size & (len - 1);
>>
>>         return ext4_journal_folio_buffers(handle, folio, len);
>>  }
>>
>>
>> I will give it some more thoughts and testing.
>
> Why should ext4 be different from other filesystems which simply do:
>
> 	if (folio_pos(folio) >= size)
> 		return 0;

Yes, this case was bothering me and I was just thinking of this case. So,
since folio_pos(folio) starts at some pagesize boundary, then anyways
truncate will remove the entire page. So we need not bother about
writing this out.

Also should we just reduce nr_to_write because we could have written
that page-out but we know truncate is anyway going to remove it?

So the code should look like this then?

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 43be684dabcb..976e84507236 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1840,7 +1840,7 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
 {
        size_t len;
        loff_t size;
-       int err;
+       int err = 0;

        BUG_ON(folio->index != mpd->first_page);
        folio_clear_dirty_for_io(folio);
@@ -1859,10 +1859,19 @@ static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
         */
        size = i_size_read(mpd->inode);
        len = folio_size(folio);
+
+       /*
+        * Truncate should take care of truncating the entire folio anyways.
+        * So don't bother writing it out.
+        */
+       if (folio_pos(folio) >= size)
+               goto out;
+
        if (folio_pos(folio) + len > size &&
            !ext4_verity_in_progress(mpd->inode))
-               len = size & ~PAGE_MASK;
+               len = size & (len - 1);
        err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
+out:
        if (!err)
                mpd->wbc->nr_to_write--;

@@ -2329,9 +2338,16 @@ static int mpage_journal_page_buffers(handle_t *handle,
        folio_clear_checked(folio);
        mpd->wbc->nr_to_write--;

+       /*
+        * Truncate should take care of truncating the entire folio anyways.
+        * So don't bother writing it out.
+        */
+       if (folio_pos(folio) >= size)
+               return 0;
+
        if (folio_pos(folio) + len > size &&
            !ext4_verity_in_progress(inode))
-               len = size - folio_pos(folio);
+               len = size & (len - 1);

        return ext4_journal_folio_buffers(handle, folio, len);
 }


I will have to read more on returning 0 from
mpage_journal_page_buffers() function to make sure we don't need any
special handling for folio in data=journal mode.

-ritesh
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce5f21b6c2b3..b9fa7c30a9a5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1885,7 +1885,7 @@  static int mpage_submit_folio(struct mpage_da_data *mpd, struct folio *folio)
 	len = folio_size(folio);
 	if (folio_pos(folio) + len > size &&
 	    !ext4_verity_in_progress(mpd->inode))
-		len = size & ~PAGE_MASK;
+		len = size - folio_pos(folio);
 	err = ext4_bio_write_folio(&mpd->io_submit, folio, len);
 	if (!err)
 		mpd->wbc->nr_to_write--;