diff mbox series

[v3,4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint

Message ID 20230606135928.434610-5-yi.zhang@huaweicloud.com
State Awaiting Upstream
Headers show
Series jbd2: fix several checkpoint inconsistent issues | expand

Commit Message

Zhang Yi June 6, 2023, 1:59 p.m. UTC
From: Zhihao Cheng <chengzhihao1@huawei.com>

Following process,

jbd2_journal_commit_transaction
// there are several dirty buffer heads in transaction->t_checkpoint_list
          P1                   wb_workfn
jbd2_log_do_checkpoint
 if (buffer_locked(bh)) // false
                            __block_write_full_page
                             trylock_buffer(bh)
                             test_clear_buffer_dirty(bh)
 if (!buffer_dirty(bh))
  __jbd2_journal_remove_checkpoint(jh)
   if (buffer_write_io_error(bh)) // false
                             >> bh IO error occurs <<
 jbd2_cleanup_journal_tail
  __jbd2_update_log_tail
   jbd2_write_superblock
   // The bh won't be replayed in next mount.
, which could corrupt the ext4 image, fetch a reproducer in [Link].

Since writeback process clears buffer dirty after locking buffer head,
we can fix it by try locking buffer and check dirtiness while buffer is
locked, the buffer head can be removed if it is neither dirty nor locked.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

Comments

Theodore Ts'o June 13, 2023, 4:31 a.m. UTC | #1
There is something about this patch which is causing test runs to hang
when running "gce-xfstests -c ext4/adv -C 10 generic/475" at least
60-70% of the time.

When I took a closer look, the problem seems to be e2fsck is hanging
after a SEGV when running e2fsck -nf on the block device.  This then
causes the check script to hang, until the test appliance's safety
timer triggers and forces a shutdown of the test VM and aborts the
test run.

The cause of the hang is clearly an e2fsprogs bug --- no matter how
corrupted the file system is, e2fsck should never crash or hang.  So
something is clearly going wrong with e2fsck:

    ...
    Symlink /p1/dc/d14/dee/l154 (inode #2898) is invalid.
    Clear? no

    Entry 'l154' in /p1/dc/d14/dee (2753) has an incorrect filetype (was 7, should be 0).
    Fix? no

    corrupted size vs. prev_size
    Signal (6) SIGABRT si_code=SI_TKILL 

    (Note: "corrutped size vs prev_size" is issued by glibc when
    malloc's internal data structures have been corrupted.  So
    there is definitely something going very wrong with e2fsck.)
    
That being said, if I run the same test on the parent commit (patch
3/6, jbd2: remove journal_clean_one_cp_list()), e2fsck does *not* hang
or crash, and the regression tests complete.  So this patch is
changing the behavior of the kernel in terms of the file system that
is left behind after a large number of injected I/O errors.

My plan therefore is to drop patches 4/6 through 6/6 of this patch
series.  This will allow at least the "long standing metadata
corruption issue that happens from to time" to be addressed, and it
will give us time study what's going on here in more detail.  I've
captured the compressed file system image which is causing e2fsck
(version 1.47.0) to corrupt malloc's data structure, and I'll try see
what using Address Sanitizer or valgrind show about what's going on.

Looking at the patch, it looks pretty innocuous, and I don't
understand how this could be making a significant enough difference
that it's causing e2fsck, which had previously been working fine, to
now start tossing its cookies.  If you could double check the patch
and see you see anything that I might have missed in my code review,
I'd really appreciate it.

Thanks,

					- Ted
Zhihao Cheng June 13, 2023, 8:13 a.m. UTC | #2
在 2023/6/13 12:31, Theodore Ts'o 写道:
> There is something about this patch which is causing test runs to hang
> when running "gce-xfstests -c ext4/adv -C 10 generic/475" at least
> 60-70% of the time.
> 
> When I took a closer look, the problem seems to be e2fsck is hanging
> after a SEGV when running e2fsck -nf on the block device.  This then
> causes the check script to hang, until the test appliance's safety
> timer triggers and forces a shutdown of the test VM and aborts the
> test run.
> 
> The cause of the hang is clearly an e2fsprogs bug --- no matter how
> corrupted the file system is, e2fsck should never crash or hang.  So
> something is clearly going wrong with e2fsck:
> 
>      ...
>      Symlink /p1/dc/d14/dee/l154 (inode #2898) is invalid.
>      Clear? no
> 
>      Entry 'l154' in /p1/dc/d14/dee (2753) has an incorrect filetype (was 7, should be 0).
>      Fix? no
> 
>      corrupted size vs. prev_size
>      Signal (6) SIGABRT si_code=SI_TKILL
> 
>      (Note: "corrutped size vs prev_size" is issued by glibc when
>      malloc's internal data structures have been corrupted.  So
>      there is definitely something going very wrong with e2fsck.)
>      
> That being said, if I run the same test on the parent commit (patch
> 3/6, jbd2: remove journal_clean_one_cp_list()), e2fsck does *not* hang
> or crash, and the regression tests complete.  So this patch is
> changing the behavior of the kernel in terms of the file system that
> is left behind after a large number of injected I/O errors.
> 
> My plan therefore is to drop patches 4/6 through 6/6 of this patch
> series.  This will allow at least the "long standing metadata
> corruption issue that happens from to time" to be addressed, and it
> will give us time study what's going on here in more detail.  I've
> captured the compressed file system image which is causing e2fsck
> (version 1.47.0) to corrupt malloc's data structure, and I'll try see
> what using Address Sanitizer or valgrind show about what's going on.
> 

Hi Ted, I tried to run './check generic/475' many rounds(1.47.0, 
5-Feb-2023), and I cannot reproduce the problem with this patch. Could 
you send me a compressed image which can trigger the problem with 'fsck 
-fn'?

I agree to make clear the problem first before applying this patch.

> Looking at the patch, it looks pretty innocuous, and I don't
> understand how this could be making a significant enough difference
> that it's causing e2fsck, which had previously been working fine, to
> now start tossing its cookies.  If you could double check the patch
> and see you see anything that I might have missed in my code review,
> I'd really appreciate it.
> 
> Thanks,
> 
> 					- Ted
> 
> .
>
Theodore Ts'o June 13, 2023, 5:27 p.m. UTC | #3
On Tue, Jun 13, 2023 at 04:13:06PM +0800, Zhihao Cheng wrote:
> 
> Hi Ted, I tried to run './check generic/475' many rounds(1.47.0,
> 5-Feb-2023), and I cannot reproduce the problem with this patch.

What file system configuration (e.g., mke2fs options) were you using
when you ran generic/475?  I reproduced the problem with
CONFIG_LOCKDEP enabled, with the ext4/adv configuration[1], which
means that the file system was created using "mke2fs -t ext4 -O
inline_data,fast_commit".  The size of the test file system was 5 GiB.

[1] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/cfg/adv

At this point, it looks like the problem is timing specific.  When I
built at patch 3/6 of your patch series, I was no longer able to
trigger the failure using the CONFIG_LOCKDEP kernel --- specifically
using a kernel config generated using install-kconfig[2] with the
--lockdep command-line-option.

[2] https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig

However, when I built a kernel config without --lockdep, I was able to
trigger the problem for both the the ext4/adv and the ext4/ext3[1]
file system test scenario.  That is, doing a full regression test
suite using "gce-xfstests ltm -c ext4/all -g auto", the VM's for the
ext4/adv and ext4/ext3 VM's both hung while running generic/475.  And
using a non-lockdep kernel, the commands "gce-xfstests -c
ext4/adv,ext4/ext3 generic/475" would hang.  I ran this command twice,
to make sure there were no timing-related false negatives, and once we
hung while running generic/475 for ext4/adv, and once we hung while
running ext4/ext3:

% gce-xfstests ls -l
  ...
xfstests-tytso-20230613115748 34.172.36.63 - 6.4.0-rc5-xfstests-00057-ge86f802ab8d4 - 12:07 ext4/adv generic/475 - RUNNING
xfstests-tytso-20230613115802 34.133.66.61 - 6.4.0-rc5-xfstests-00057-ge86f802ab8d4 - 12:06 ext4/ext3 generic/475 - RUNNING

[3] https://github.com/tytso/xfstests-bld/blob/master/test-appliance/files/root/fs/ext4/cfg/ext3

Furthermore, when I rewond the git repo to just before this patch
series (which is currently at the end of the dev branch), the full
regression test suites ("-c ext4/all -g all") and the more specific
test run ("-C 5 -c ext4/adv,ext4/ext3 generic/475") did not hang.  I
am currently doing another bisect run using a non-lockdep kernel to
see if I can more detail.

> Could you send me a compressed image which can trigger the problem
> with 'fsck -fn'?

Sure.  I'll have to send that under a separate e-mail message, since
it's 15 megabytes.  It was created using "dd if=/dev/mapper/xt-vdc |
gzip -9 > broken-image-which-causes-e2fsck-to-segv.gz".
Unfortunately, I was not able to create a metadata-only dump because
the filesystem was too corrupted.  An attempt to run "e2image -Q
/dev/mapper/xt-vdc broken.qcow2" failed with:

e2image 1.47.0 (5-Feb-2023)
e2image: Corrupt extent header while iterating over inode 6016

I was able to run e2fsck compiled with clang's asan enabled, and
here's the ASAN report (this is against the master branch in
e2fsprogs's git repo, so it's a bit ahead of 1.47.0):

e2fsck 1.47.0 (5-Feb-2023)
=================================================================
==25033==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x625000009900 at pc 0x564bbf8ae405 bp 0x7ffdd82bf0e0 sp 0x7ffdd82be8b0
WRITE of size 4096 at 0x625000009900 thread T0
    #0 0x564bbf8ae404 in pread64 (/build/e2fsprogs-asan/e2fsck/e2fsck+0x24a404) (BuildId: e291b1c8655954ec4293b8635a561dc29c81a785)
    #1 0x564bbfd47532 in raw_read_blk /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/unix_io.c:240:12
    #2 0x564bbfd3965b in unix_read_blk64 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/unix_io.c:1079:17
    #3 0x564bbfc6cecd in io_channel_read_blk64 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/io_manager.c:78:10
    #4 0x564bbf9a3791 in e2fsck_pass1_check_symlink /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass1.c:241:7
    #5 0x564bbfa28a9f in e2fsck_process_bad_inode /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1990:8
    #6 0x564bbfa21e9c in check_dir_block /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1525:8
    #7 0x564bbfa18a3b in check_dir_block2 /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:1034:8
    #8 0x564bbfb9ee4a in ext2fs_dblist_iterate3 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/dblist.c:216:9
    #9 0x564bbfb9ef79 in ext2fs_dblist_iterate2 /usr/projects/e2fsprogs/e2fsprogs/lib/ext2fs/dblist.c:229:9
    #10 0x564bbfa14529 in e2fsck_pass2 /usr/projects/e2fsprogs/e2fsprogs/e2fsck/pass2.c:190:20
    #11 0x564bbf980660 in e2fsck_run /usr/projects/e2fsprogs/e2fsprogs/e2fsck/e2fsck.c:262:3
    #12 0x564bbf95c774 in main /usr/projects/e2fsprogs/e2fsprogs/e2fsck/unix.c:1931:15
    #13 0x7f65d4e4e189 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7f65d4e4e244 in __libc_start_main csu/../csu/libc-start.c:381:3
    #15 0x564bbf892840 in _start (/build/e2fsprogs-asan/e2fsck/e2fsck+0x22e840) (BuildId: e291b1c8655954ec4293b8635a561dc29c81a785)

I'm still digging into finding the root cause; I'll let you know if I
find more.

Cheers,

						- Ted
Theodore Ts'o June 14, 2023, 5:42 a.m. UTC | #4
OK, some more updates.  First of all, the e2fsck hang in the ext4/adv
case is an inline_data bug in e2fsck/pass2.c:check_dir_block(); the
code is clearly buggy, and I'll be sending out a fix in the next day
or two.

I still don't understand why this patch series is changing the kernel
behaviour enough to change the resulting file system in such a way as
to unmask this bug.  The bug is triggered by file system corruption,
so the question is whether this patch series is somehow causing the
file system to be more corrupted than it otherwise would be.  I'm not
sure.

However, the ext4/ext3 hang *is* a real hang in the kernel space, and
generic/475 is not completing because the kernel seems to have ended
up deadlocking somehow.  With just the first patch in this patch
series ("jbd2: recheck chechpointing non-dirty buffer") we're getting
a kernel NULL pointer derefence:

[  310.447568] EXT4-fs error (device dm-7): ext4_check_bdev_write_error:223: comm fsstress: Error while async write back metadata
[  310.458038] EXT4-fs error (device dm-7): __ext4_get_inode_loc_noinmem:4467: inode #99400: block 393286: comm fsstress: unable to read itable block
[  310.458421] JBD2: IO error reading journal superblock
[  310.484755] EXT4-fs warning (device dm-7): ext4_end_bio:343: I/O error 10 writing to inode 36066 starting block 19083)
[  310.490956] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  310.490959] #PF: supervisor write access in kernel mode
[  310.490961] #PF: error_code(0x0002) - not-present page
[  310.490963] PGD 0 P4D 0 
[  310.490966] Oops: 0002 [#1] PREEMPT SMP PTI
[  310.490970] CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
[  310.490974] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
[  310.490976] RIP: 0010:jbd2_journal_set_features+0x13d/0x430
[  310.490985] Code: 0f 94 c0 44 20 e8 0f 85 e0 00 00 00 be 94 01 00 00 48 c7 c7 a1 33 59 b4 48 89 0c 24 4c 8b 7d 38 e8 a8 dc c5 ff 2e 2e 2e 31 c0 <f0> 49 0f ba 2f 02 48 8b 0c 24 0f 82 24 02 00 00 45 84 ed 8b 41 28
[  310.490988] RSP: 0018:ffffb9b441043b30 EFLAGS: 00010246
[  310.490990] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8edb447b8100
[  310.490993] RDX: 0000000000000000 RSI: 0000000000000194 RDI: ffffffffb45933a1
[  310.490994] RBP: ffff8edb45a62800 R08: ffffffffb460d6c0 R09: 0000000000000000
[  310.490996] R10: 204f49203a324442 R11: 4f49203a3244424a R12: 0000000000000000
[  310.490997] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[  310.490999] FS:  00007f2940cca740(0000) GS:ffff8edc19500000(0000) knlGS:0000000000000000
[  310.491005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  310.491007] CR2: 0000000000000000 CR3: 000000012543e003 CR4: 00000000003706e0
[  310.491009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  310.491011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  310.491012] Call Trace:
[  310.491016]  <TASK>
[  310.491019]  ? __die+0x23/0x60
[  310.491025]  ? page_fault_oops+0xa4/0x170
[  310.491029]  ? exc_page_fault+0x67/0x170
[  310.491032]  ? asm_exc_page_fault+0x26/0x30
[  310.491039]  ? jbd2_journal_set_features+0x13d/0x430
[  310.491043]  jbd2_journal_revoke+0x47/0x1e0
[  310.491046]  __ext4_forget+0xc3/0x1b0
[  310.491051]  ext4_free_blocks+0x214/0x2f0
[  310.491056]  ext4_free_branches+0xeb/0x270
[  310.491061]  ext4_ind_truncate+0x2bf/0x320
[  310.491065]  ext4_truncate+0x1e4/0x490
[  310.491069]  ext4_handle_inode_extension+0x1bd/0x2a0
[  310.491073]  ? iomap_dio_complete+0xaf/0x1d0
[  310.511141] ------------[ cut here ]------------
[  310.516121]  ext4_dio_write_iter+0x346/0x3e0
[  310.516132]  ? __handle_mm_fault+0x171/0x200
[  310.516135]  vfs_write+0x21a/0x3e0
[  310.516140]  ksys_write+0x6f/0xf0
[  310.516142]  do_syscall_64+0x3b/0x90
[  310.516147]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  310.516154] RIP: 0033:0x7f2940eb2fb3
[  310.516158] Code: 75 05 48 83 c4 58 c3 e8 cb 41 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[  310.516161] RSP: 002b:00007ffe9a322cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  310.516165] RAX: ffffffffffffffda RBX: 0000000000003000 RCX: 00007f2940eb2fb3
[  310.516167] RDX: 0000000000003000 RSI: 0000556ba1e31000 RDI: 0000000000000003
[  310.516168] RBP: 0000000000000003 R08: 0000556ba1e31000 R09: 00007f2940e9bbe0
[  310.516170] R10: 0000556b9fedbf59 R11: 0000000000000246 R12: 0000000000000024
[  310.516172] R13: 00000000000cf000 R14: 0000556ba1e31000 R15: 0000000000000000
[  310.516174]  </TASK>
[  310.516178] CR2: 0000000000000000
[  310.516181] ---[ end trace 0000000000000000 ]---

This is then causing fsstress to wedge:

# ps -ax -o pid,user,wchan:20,args --sort pid
    PID USER     WCHAN                COMMAND
	...
  12860 root     do_wait              /bin/bash /root/xfstests/tests/generic/475
  13086 root     rescuer_thread       [kdmflush/253:7]
  15593 root     rescuer_thread       [ext4-rsv-conver]
  15598 root     jbd2_log_wait_commit ./ltp/fsstress -d /xt-vdc -n 999999 -p 4
  15600 root     ext4_release_file    [fsstress]
  15601 root     exit_aio             [fsstress]

So at this point, I'm going to drop this entire patch series from the
dev tree, since this *does* seem to be some kind of regression
triggered by the first patch in the patch series.

Regards,

					- Ted
Zhang Yi June 14, 2023, 1:25 p.m. UTC | #5
On 2023/6/14 13:42, Theodore Ts'o wrote:
> OK, some more updates.  First of all, the e2fsck hang in the ext4/adv
> case is an inline_data bug in e2fsck/pass2.c:check_dir_block(); the
> code is clearly buggy, and I'll be sending out a fix in the next day
> or two.
> 
> I still don't understand why this patch series is changing the kernel
> behaviour enough to change the resulting file system in such a way as
> to unmask this bug.  The bug is triggered by file system corruption,
> so the question is whether this patch series is somehow causing the
> file system to be more corrupted than it otherwise would be.  I'm not
> sure.
> 
> However, the ext4/ext3 hang *is* a real hang in the kernel space, and
> generic/475 is not completing because the kernel seems to have ended
> up deadlocking somehow.  With just the first patch in this patch
> series ("jbd2: recheck chechpointing non-dirty buffer") we're getting
> a kernel NULL pointer derefence:
> 
> [  310.447568] EXT4-fs error (device dm-7): ext4_check_bdev_write_error:223: comm fsstress: Error while async write back metadata
> [  310.458038] EXT4-fs error (device dm-7): __ext4_get_inode_loc_noinmem:4467: inode #99400: block 393286: comm fsstress: unable to read itable block
> [  310.458421] JBD2: IO error reading journal superblock
> [  310.484755] EXT4-fs warning (device dm-7): ext4_end_bio:343: I/O error 10 writing to inode 36066 starting block 19083)
> [  310.490956] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [  310.490959] #PF: supervisor write access in kernel mode
> [  310.490961] #PF: error_code(0x0002) - not-present page
> [  310.490963] PGD 0 P4D 0 
> [  310.490966] Oops: 0002 [#1] PREEMPT SMP PTI
> [  310.490970] CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
> [  310.490974] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
> [  310.490976] RIP: 0010:jbd2_journal_set_features+0x13d/0x430
> [  310.490985] Code: 0f 94 c0 44 20 e8 0f 85 e0 00 00 00 be 94 01 00 00 48 c7 c7 a1 33 59 b4 48 89 0c 24 4c 8b 7d 38 e8 a8 dc c5 ff 2e 2e 2e 31 c0 <f0> 49 0f ba 2f 02 48 8b 0c 24 0f 82 24 02 00 00 45 84 ed 8b 41 28
> [  310.490988] RSP: 0018:ffffb9b441043b30 EFLAGS: 00010246
> [  310.490990] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8edb447b8100
> [  310.490993] RDX: 0000000000000000 RSI: 0000000000000194 RDI: ffffffffb45933a1
> [  310.490994] RBP: ffff8edb45a62800 R08: ffffffffb460d6c0 R09: 0000000000000000
> [  310.490996] R10: 204f49203a324442 R11: 4f49203a3244424a R12: 0000000000000000
> [  310.490997] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [  310.490999] FS:  00007f2940cca740(0000) GS:ffff8edc19500000(0000) knlGS:0000000000000000
> [  310.491005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  310.491007] CR2: 0000000000000000 CR3: 000000012543e003 CR4: 00000000003706e0
> [  310.491009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  310.491011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  310.491012] Call Trace:
> [  310.491016]  <TASK>
> [  310.491019]  ? __die+0x23/0x60
> [  310.491025]  ? page_fault_oops+0xa4/0x170
> [  310.491029]  ? exc_page_fault+0x67/0x170
> [  310.491032]  ? asm_exc_page_fault+0x26/0x30
> [  310.491039]  ? jbd2_journal_set_features+0x13d/0x430
> [  310.491043]  jbd2_journal_revoke+0x47/0x1e0
> [  310.491046]  __ext4_forget+0xc3/0x1b0
> [  310.491051]  ext4_free_blocks+0x214/0x2f0
> [  310.491056]  ext4_free_branches+0xeb/0x270
> [  310.491061]  ext4_ind_truncate+0x2bf/0x320
> [  310.491065]  ext4_truncate+0x1e4/0x490
> [  310.491069]  ext4_handle_inode_extension+0x1bd/0x2a0
> [  310.491073]  ? iomap_dio_complete+0xaf/0x1d0
> [  310.511141] ------------[ cut here ]------------
> [  310.516121]  ext4_dio_write_iter+0x346/0x3e0
> [  310.516132]  ? __handle_mm_fault+0x171/0x200
> [  310.516135]  vfs_write+0x21a/0x3e0
> [  310.516140]  ksys_write+0x6f/0xf0
> [  310.516142]  do_syscall_64+0x3b/0x90
> [  310.516147]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [  310.516154] RIP: 0033:0x7f2940eb2fb3
> [  310.516158] Code: 75 05 48 83 c4 58 c3 e8 cb 41 ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
> [  310.516161] RSP: 002b:00007ffe9a322cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  310.516165] RAX: ffffffffffffffda RBX: 0000000000003000 RCX: 00007f2940eb2fb3
> [  310.516167] RDX: 0000000000003000 RSI: 0000556ba1e31000 RDI: 0000000000000003
> [  310.516168] RBP: 0000000000000003 R08: 0000556ba1e31000 R09: 00007f2940e9bbe0
> [  310.516170] R10: 0000556b9fedbf59 R11: 0000000000000246 R12: 0000000000000024
> [  310.516172] R13: 00000000000cf000 R14: 0000556ba1e31000 R15: 0000000000000000
> [  310.516174]  </TASK>
> [  310.516178] CR2: 0000000000000000
> [  310.516181] ---[ end trace 0000000000000000 ]---
> 

Sorry about the regression, I found that this issue is not introduced
by the first patch in this patch series ("jbd2: recheck chechpointing
non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
initialization from journal_get_superblock()") [1] on your dev branch.

The problem is the journal super block had been failed to write out
due to IO fault, it's uptodate bit was cleared by
end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
unfortunately, the read IO is also fail, so the error handling in
journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
to above NULL pointer dereference issue.

I think the fix could be just move buffer_verified(bh) in front of
bh_read(). I can send out the fix after tests.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=d9eafe0afafaa519953735498c2a065d223c519b

Thanks,
Yi.

> This is then causing fsstress to wedge:
> 
> # ps -ax -o pid,user,wchan:20,args --sort pid
>     PID USER     WCHAN                COMMAND
> 	...
>   12860 root     do_wait              /bin/bash /root/xfstests/tests/generic/475
>   13086 root     rescuer_thread       [kdmflush/253:7]
>   15593 root     rescuer_thread       [ext4-rsv-conver]
>   15598 root     jbd2_log_wait_commit ./ltp/fsstress -d /xt-vdc -n 999999 -p 4
>   15600 root     ext4_release_file    [fsstress]
>   15601 root     exit_aio             [fsstress]
> 
> So at this point, I'm going to drop this entire patch series from the
> dev tree, since this *does* seem to be some kind of regression
> triggered by the first patch in the patch series.
> 
> Regards,
> 
> 					- Ted
>
Theodore Ts'o June 14, 2023, 8:37 p.m. UTC | #6
On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
> 
> Sorry about the regression, I found that this issue is not introduced
> by the first patch in this patch series ("jbd2: recheck chechpointing
> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
> initialization from journal_get_superblock()") [1] on your dev branch.
> 
> The problem is the journal super block had been failed to write out
> due to IO fault, it's uptodate bit was cleared by
> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
> unfortunately, the read IO is also fail, so the error handling in
> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
> to above NULL pointer dereference issue.

Thanks for looking into this.  What I believe you are saying is that
the root cause is that earlier patch, but it is still something about
the patch "jbd2: recheck chechpointing non-dirty buffer" which is
changing the timing enough that we're hitting this buffer (because
without the "recheck checkpointing" patch, I'm not seeing the NULL
pointer dereference.

As far as the e2fsck bug that was causing it to hang in the ext4/adv
test scenario, the patch was a simple one, and I have also checked in
a test case which was a reliable reproducer of the problem.  (See
attached for the patches and more detail.)

It is really interesting that "recheck checkpointing" patch is making
enough of a difference that it is unmasking these bugs.  If you could
take a look at these changes and perhaps think about how this patch
series could be changing the nature of the corruption (specifically,
how symlink inodes referenced from inline directories could be
corupted with "rechecking checkpointing", thus unmasking the
e2fsprogs, I'd really appreciate it.

Thanks,

					- Ted
Zhang Yi June 15, 2023, 3:56 a.m. UTC | #7
On 2023/6/15 4:37, Theodore Ts'o wrote:
> On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
>>
>> Sorry about the regression, I found that this issue is not introduced
>> by the first patch in this patch series ("jbd2: recheck chechpointing
>> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
>> initialization from journal_get_superblock()") [1] on your dev branch.
>>
>> The problem is the journal super block had been failed to write out
>> due to IO fault, it's uptodate bit was cleared by
>> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
>> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
>> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
>> unfortunately, the read IO is also fail, so the error handling in
>> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
>> to above NULL pointer dereference issue.
> 
> Thanks for looking into this.  What I believe you are saying is that
> the root cause is that earlier patch, but it is still something about
> the patch "jbd2: recheck chechpointing non-dirty buffer" which is
> changing the timing enough that we're hitting this buffer (because
> without the "recheck checkpointing" patch, I'm not seeing the NULL
> pointer dereference.

I have send out a separate patch names "jbd2: skip reading super block
if it has been verified" to fix above NULL pointer dereference issue, I
have been runing ext3 generic/475 about 12hours and have not reproduced
the problem again (I will also do more tests later). Please take a look
at it.

> 
> As far as the e2fsck bug that was causing it to hang in the ext4/adv
> test scenario, the patch was a simple one, and I have also checked in
> a test case which was a reliable reproducer of the problem.  (See
> attached for the patches and more detail.)
> 
> It is really interesting that "recheck checkpointing" patch is making
> enough of a difference that it is unmasking these bugs.  If you could
> take a look at these changes and perhaps think about how this patch
> series could be changing the nature of the corruption (specifically,
> how symlink inodes referenced from inline directories could be
> corupted with "rechecking checkpointing", thus unmasking the
> e2fsprogs, I'd really appreciate it.
> 

Sure, we will take a look at it for details.

Thanks,
Yi.
Zhang Yi June 26, 2023, 7:36 a.m. UTC | #8
On 2023/6/15 4:37, Theodore Ts'o wrote:
> On Wed, Jun 14, 2023 at 09:25:28PM +0800, Zhang Yi wrote:
>>
>> Sorry about the regression, I found that this issue is not introduced
>> by the first patch in this patch series ("jbd2: recheck chechpointing
>> non-dirty buffer"), is d9eafe0afafa ("jbd2: factor out journal
>> initialization from journal_get_superblock()") [1] on your dev branch.
>>
>> The problem is the journal super block had been failed to write out
>> due to IO fault, it's uptodate bit was cleared by
>> end_buffer_write_syn() and didn't reset yet in jbd2_write_superblock().
>> And it raced by jbd2_journal_revoke()->jbd2_journal_set_features()->
>> jbd2_journal_check_used_features()->journal_get_superblock()->bh_read(),
>> unfortunately, the read IO is also fail, so the error handling in
>> journal_fail_superblock() clear the journal->j_sb_buffer, finally lead
>> to above NULL pointer dereference issue.
> 
> Thanks for looking into this.  What I believe you are saying is that
> the root cause is that earlier patch, but it is still something about
> the patch "jbd2: recheck chechpointing non-dirty buffer" which is
> changing the timing enough that we're hitting this buffer (because
> without the "recheck checkpointing" patch, I'm not seeing the NULL
> pointer dereference.
> 
> As far as the e2fsck bug that was causing it to hang in the ext4/adv
> test scenario, the patch was a simple one, and I have also checked in
> a test case which was a reliable reproducer of the problem.  (See
> attached for the patches and more detail.)
> 
> It is really interesting that "recheck checkpointing" patch is making
> enough of a difference that it is unmasking these bugs.  If you could
> take a look at these changes and perhaps think about how this patch
> series could be changing the nature of the corruption (specifically,
> how symlink inodes referenced from inline directories could be
> corupted with "rechecking checkpointing", thus unmasking the
> e2fsprogs, I'd really appreciate it.
> 

Hello Ted.

I have found the root cause of which trigger the e2fsck bug, it's a
real kernel bug which was introduced in 5b849b5f96b4 ("jbd2: fast
commit recovery path") when merging fast commit feature.

The problem is that when fast commit is enabled, kernel doesn't replay
the journal completely at mount time (there a bug in do_one_pass(),
see below for details) if the unrecovered transactions loop back to
the head of the normal journal area. If the missing transactions
contain a symlink block revoke record and a reuse record of the same
block, and the reuse record have been write back to the disk before
it's last umount, it could trigger the "Symlink xxx is invalid" after
the incomplete journal replay. Meanwhile it the symlink belongs to a
inline directory, it will trigger the e2fsck bug.

For example, we have a not cleaned image below.


 | normal journal area                             | fast commit area |
 +-------------------------------------------------+------------------+
 | tX+1(rere half)|tX+2|...| tX | tX+1(front half) |       ....       |
 +-------------------------------------------------+------------------+
                       /        /                  /                  /
             (real head)  s_start    journal->j_last journal->j_fc_last

Transaction X(checkpointed):
 - Create symlink 'foo' (use block A, contain 500 length of link name)
   in inline directory 'dir';
Transaction X+1(uncheckpoint):
 - Remove symlink 'foo' (block A is also freed);
Transaction X+2(uncheckpoint):
 - Create symlink 'bar' (reuse block A, contain 400 length of link
   name).

The above transactions are recorded to the journal, block A is also
successfully written back by the background write-back process.

If fast_commit feature is enabled, the journal->j_last point to the
first unused block behind the normal journal area instead of the whole
log area, and the journal->j_fc_last point to the first unused block
behind the fast_commit journal area. While doing journal recovery,
do_one_pass(PASS_SCAN) should first scan tX+1 in the normal journal
area and turn around to the first block once it meet journal->j_last,
but the wrap() macro misuse the journal->j_fc_last, so it could not
read tX+1's commit block, the recovery quit early mistakenly and
missing tX+1 and tX+2. So, after we mount the filesystem, we could
leftover an invalid symlink 'foo' in the inline directory and trigger
issue.

I can reproduce this issue either with or without this patch series
("jbd2: recheck chechpointing non-dirty buffer"), sometimes it takes
longer, sometimes it takes less. It's easy to reproduce the "Symlink
xxx is invalid" issue, but it's a little hard to trigger the e2fsck
bug (which means the invalid symlink should in a inline dir).
Especially I could 100% reproduce the fast_commit recovery bug when
running generic/475. So I couldn't find any relations between this
issue and this patch series.

I've send a patch to fix the fast commit bug separately[1]. I cannot
reproduce this issue again with this patch, please take a look at
that.

[1] https://lore.kernel.org/linux-ext4/20230626073322.3956567-1-yi.zhang@huaweicloud.com/T/#u

Thanks,
Yi.
diff mbox series

Patch

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index b94f847960c2..42b34cab64fb 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -204,20 +204,6 @@  int jbd2_log_do_checkpoint(journal_t *journal)
 		jh = transaction->t_checkpoint_list;
 		bh = jh2bh(jh);
 
-		/*
-		 * The buffer may be writing back, or flushing out in the
-		 * last couple of cycles, or re-adding into a new transaction,
-		 * need to check it again until it's unlocked.
-		 */
-		if (buffer_locked(bh)) {
-			get_bh(bh);
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			/* the journal_head may have gone by now */
-			BUFFER_TRACE(bh, "brelse");
-			__brelse(bh);
-			goto retry;
-		}
 		if (jh->b_transaction != NULL) {
 			transaction_t *t = jh->b_transaction;
 			tid_t tid = t->t_tid;
@@ -252,7 +238,22 @@  int jbd2_log_do_checkpoint(journal_t *journal)
 			spin_lock(&journal->j_list_lock);
 			goto restart;
 		}
-		if (!buffer_dirty(bh)) {
+		if (!trylock_buffer(bh)) {
+			/*
+			 * The buffer is locked, it may be writing back, or
+			 * flushing out in the last couple of cycles, or
+			 * re-adding into a new transaction, need to check
+			 * it again until it's unlocked.
+			 */
+			get_bh(bh);
+			spin_unlock(&journal->j_list_lock);
+			wait_on_buffer(bh);
+			/* the journal_head may have gone by now */
+			BUFFER_TRACE(bh, "brelse");
+			__brelse(bh);
+			goto retry;
+		} else if (!buffer_dirty(bh)) {
+			unlock_buffer(bh);
 			BUFFER_TRACE(bh, "remove from checkpoint");
 			/*
 			 * If the transaction was released or the checkpoint
@@ -262,6 +263,7 @@  int jbd2_log_do_checkpoint(journal_t *journal)
 			    !transaction->t_checkpoint_list)
 				goto out;
 		} else {
+			unlock_buffer(bh);
 			/*
 			 * We are about to write the buffer, it could be
 			 * raced by some other transaction shrink or buffer