diff mbox series

e2fsck: do not skip deeper checkers when s_last_orphan list has truncated inodes

Message ID 647cc60d-18d4-ab53-6c91-52c1f6d29c3a@huawei.com
State Rejected
Headers show
Series e2fsck: do not skip deeper checkers when s_last_orphan list has truncated inodes | expand

Commit Message

zhanchengbin March 15, 2022, 8:01 a.m. UTC
If the system crashes when a file is being truncated, we will get a 
problematic inode,
and it will be added into fs->super->s_last_orphan.
When we run `e2fsck -a img`, the s_last_orphan list will be traversed 
and deleted.
During this period, orphan inodes in the s_last_orphan list with 
i_links_count==0 can
be deleted, and orphan inodes with  i_links_count !=0 (ex. the truncated 
inode)
cannot be deleted. However, when there are some orphan inodes with 
i_links_count !=0,
the EXT2_VALID_FS is still assigned to fs->super->s_state, the deeper 
checkers are skipped
with some inconsistency problems.
Here, we will clean EXT2_VALID_FS flag when there is orphan inodes with 
i_links_count !=0
for deeper checkers.

Problems with truncated files.
     [root@localhost ~]# e2fsck -a img
     img: recovering journal
     img: Truncating orphaned inode 188 (uid=0, gid=0, mode=0100666, size=0)
     img: Truncating orphaned inode 174 (uid=0, gid=0, mode=0100666, size=0)
     img: clean, 484/128016 files, 118274/512000 blocks
     [root@localhost ~]# e2fsck -fn img
     e2fsck 1.46.5 (30-Dec-2021)
     Pass 1: Checking inodes, blocks, and sizes
     Inode 174, i_blocks is 2, should be 0.  Fix? no

     Inode 188, i_blocks is 2, should be 0.  Fix? no

     Pass 2: Checking directory structure
     Pass 3: Checking directory connectivity
     Pass 4: Checking reference counts
     Pass 5: Checking group summary information

     img: ********** WARNING: Filesystem still has errors **********

     img: 484/128016 files (24.6% non-contiguous), 118274/512000 blocks
     [root@localhost ~]# e2fsck -a img
     img: clean, 484/128016 files, 118274/512000 blocks

But, if run `e2fsck -f img`, EXT2_VALID_FS flag will be clean, so do 
`e2fsck -a img` again,
can fix this problem.

     [root@localhost ~]# e2fsck -f img
     e2fsck 1.46.5 (30-Dec-2021)
     Pass 1: Checking inodes, blocks, and sizes
     Inode 174, i_blocks is 2, should be 0.  Fix<y>? no
     Inode 188, i_blocks is 2, should be 0.  Fix<y>? no
     Pass 2: Checking directory structure
     Pass 3: Checking directory connectivity
     Pass 4: Checking reference counts
     Pass 5: Checking group summary information

     img: ********** WARNING: Filesystem still has errors **********

     img: 484/128016 files (24.6% non-contiguous), 118274/512000 blocks
     [root@localhost ~]# e2fsck -a img
     img was not cleanly unmounted, check forced.
     img: Inode 174, i_blocks is 2, should be 0.  FIXED.
     img: Inode 188, i_blocks is 2, should be 0.  FIXED.
     img: 484/128016 files (24.6% non-contiguous), 118274/512000 blocks

Signed-off-by: zhanchengbin <zhanchengbin1@huawei.com>
---
  e2fsck/super.c | 1 +
  1 file changed, 1 insertion(+)

                  sizeof(inode), "delete_file");

Comments

Theodore Ts'o March 15, 2022, 5:54 p.m. UTC | #1
On Tue, Mar 15, 2022 at 04:01:45PM +0800, zhanchengbin wrote:
> If the system crashes when a file is being truncated, we will get a
> problematic inode,
> and it will be added into fs->super->s_last_orphan.
> When we run `e2fsck -a img`, the s_last_orphan list will be traversed and
> deleted.
> During this period, orphan inodes in the s_last_orphan list with
> i_links_count==0 can
> be deleted, and orphan inodes with  i_links_count !=0 (ex. the truncated
> inode)
> cannot be deleted. However, when there are some orphan inodes with
> i_links_count !=0,
> the EXT2_VALID_FS is still assigned to fs->super->s_state, the deeper
> checkers are skipped
> with some inconsistency problems.

That's not supposed to happen.  We regularly put inodes on the orphan
list when they are being truncated so that if we crash, the truncation
operation can be completed as part of the journal recovery and remount
operation.  This is true regardles sof whether the recovery is done by
e2fsck or by the kernel.

If a crash during a truncate leads to an inconsistent file system
after the file system is mounted, or after e2fsck does the journal
replay and orphan inode list processing, that's a kernel bug, and we
should fix the bug in the kernel.

Do you have a reliable reproducer for this situation?

Thanks,

						- Ted
zhanchengbin March 18, 2022, 10:14 a.m. UTC | #2
在 2022/3/16 1:54, Theodore Ts'o 写道:
> On Tue, Mar 15, 2022 at 04:01:45PM +0800, zhanchengbin wrote:
>> If the system crashes when a file is being truncated, we will get a
>> problematic inode,
>> and it will be added into fs->super->s_last_orphan.
>> When we run `e2fsck -a img`, the s_last_orphan list will be traversed and
>> deleted.
>> During this period, orphan inodes in the s_last_orphan list with
>> i_links_count==0 can
>> be deleted, and orphan inodes with  i_links_count !=0 (ex. the truncated
>> inode)
>> cannot be deleted. However, when there are some orphan inodes with
>> i_links_count !=0,
>> the EXT2_VALID_FS is still assigned to fs->super->s_state, the deeper
>> checkers are skipped
>> with some inconsistency problems.
> 
> That's not supposed to happen.  We regularly put inodes on the orphan
> list when they are being truncated so that if we crash, the truncation
> operation can be completed as part of the journal recovery and remount
> operation.  This is true regardles sof whether the recovery is done by
> e2fsck or by the kernel.

Yes, you are right.
Truncated has been completed,and file ACL has been set to zero in
release_inode_blocks(), but the i_blocks was not subtracted acl blocks.
So i_blocks is inconsistent。
Li Jinlin sent a patch yesterday to fix it.

> 
> If a crash during a truncate leads to an inconsistent file system
> after the file system is mounted, or after e2fsck does the journal
> replay and orphan inode list processing, that's a kernel bug, and we
> should fix the bug in the kernel.
> 
> Do you have a reliable reproducer for this situation?

I have a reproducer but it is not necessarily:
#!/bin/bash
disk_list=$(multipath -ll | grep filedisk | awk '{print $1}')

for disk in ${disk_list}
do
     mkfs.ext4 -F /dev/mapper/$disk
     mkdir ${disk}
done

function err_inject()
{
     iscsiadm -m node -p 127.0.0.1 -u &> /dev/null
     iscsiadm -m node -p 127.0.0.1 -l &> /dev/null
     sleep 1
     iscsiadm -m node -p 9.82.236.206 -u &> /dev/null
     iscsiadm -m node -p 9.82.236.206 -l &> /dev/null
     sleep 1

     iscsiadm -m node -p 127.0.0.1 -u &> /dev/null
     iscsiadm -m node -p 127.0.0.1 -l &> /dev/null
     iscsiadm -m node -p 9.82.236.206 -u &> /dev/null
     iscsiadm -m node -p 9.82.236.206 -l &> /dev/null
     sleep 1
}



count=0
while true
do
     ((count=count+1))
     for disk in ${disk_list}
     do
         while true
         do
             mount -o data_err=abort,errors=remount-ro /dev/mapper/$disk 
$disk && break
             sleep 0.1
         done
         nohup fsstress -d $(pwd)/$disk -l 10 -n 1000 -p 10 &>/dev/null &
     done

     sleep 5

     for disk in ${disk_list}
     do
         dm=$(multipath -ll | grep -w $disk | awk '{print $2}')
         aqu_sz=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print 
$(NF-1)}')
         util=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print 
$NF}')
         #if [ "${aqu_sz}" == "0.00" -o "$util" == "0.00" ];then
         #    iostat -x 1 -d 2
         #    exit 1
         #fi
         mount | grep $disk | grep '(ro' && exit 1
     done

     err_inject

     while [ -n "`pidof fsstress`" ]
     do
         sleep 1
     done

     for disk in ${disk_list}
     do
         umount $disk
         dm=$(multipath -ll | grep -w $disk | awk '{print $2}')
         aqu_sz=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print 
$(NF-1)}')
         util=$(iostat -x 1 -d 2 | grep -w $dm | tail -1 | awk '{print 
$NF}')
         if [ "${aqu_sz}" != "0.00" -o "$util" != "0.00" ];then
             iostat -x 1 -d 2
             exit 1
         fi

         dd bs=1M if=/dev/mapper/$disk of=/root/dockerback

         fsck.ext4 -a /dev/mapper/$disk
             ret=$?
             if [ $ret -ne 0 -a $ret -ne 1 ]; then
                 exit 1
             fi

         fsck.ext4 -fn /dev/mapper/$disk
             ret=$?
             if [ $ret -ne 0 ]; then
                 exit 1
             fi
     done

     if [ $count -gt 5 ];then
         echo 3 > /proc/sys/vm/drop_caches
         sleep 1
         cat /proc/meminfo >> mem.txt
         echo "" >> mem.txt
         slabtop -o >> slab.txt
         echo "" >> slab.txt
         count=0
     fi
done

> 
> Thanks,
> 
> 						- Ted
> .
>
diff mbox series

Patch

diff --git a/e2fsck/super.c b/e2fsck/super.c
index 9495e029..f4a414b7 100644
--- a/e2fsck/super.c
+++ b/e2fsck/super.c
@@ -351,6 +351,7 @@  static int release_orphan_inode(e2fsck_t ctx, 
ext2_ino_t *ino, char *block_buf)
          inode.i_dtime = ctx->now;
      } else {
          inode.i_dtime = 0;
+        fs->super->s_state &= ~EXT2_VALID_FS;
      }
      e2fsck_write_inode_full(ctx, *ino, EXT2_INODE(&inode),