mbox series

[GIT,PULL] ext4 updates for 4.15

Message ID 20171113031502.f6mctmlmgk5psh77@thunk.org
State Not Applicable, archived
Headers show
Series [GIT,PULL] ext4 updates for 4.15 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus

Message

Theodore Ts'o Nov. 13, 2017, 3:15 a.m. UTC
The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff:

  Linux 4.14-rc3 (2017-10-01 14:54:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus

for you to fetch changes up to 232530680290ba94ca37852ab10d9556ea28badf:

  ext4: improve smp scalability for inode generation (2017-11-08 22:23:20 -0500)

----------------------------------------------------------------
Add support for online resizing of file systems with bigalloc.  Fix a
two data corruption bugs involving DAX, as well as a corruption bug
after a crash during a racing fallocate and delayed allocation.
Finally, a number of cleanups and optimizations.

----------------------------------------------------------------
Andreas Gruenbacher (3):
      iomap: Switch from blkno to disk offset
      iomap: Add IOMAP_F_DATA_INLINE flag
      ext4: Add iomap support for inline data

Christoph Hellwig (1):
      ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA

Kees Cook (2):
      jbd2: convert timers to use timer_setup()
      ext4: convert timers to use timer_setup()

Pavel Machek (1):
      Documentation: fix little inconsistencies

Ross Zwisler (5):
      ext4: prevent data corruption with inline data + DAX
      ext4: prevent data corruption with journaling + DAX
      ext4: add sanity check for encryption + DAX
      ext4: add ext4_should_use_dax()
      ext4: remove duplicate extended attributes defs

Simon Ruderich (1):
      ext4: mention noload when recovering on read-only device

Theodore Ts'o (3):
      ext4: retry allocations conservatively
      ext4: fix interaction between i_size, fallocate, and delalloc after a crash
      ext4: improve smp scalability for inode generation

harshads (1):
      ext4: add support for online resizing with bigalloc

 Documentation/filesystems/ext4.txt |   8 +--
 fs/buffer.c                        |   4 +-
 fs/dax.c                           |   2 +-
 fs/ext2/inode.c                    |   4 +-
 fs/ext4/Kconfig                    |   1 +
 fs/ext4/balloc.c                   |  15 ++--
 fs/ext4/ext4.h                     |  50 ++-----------
 fs/ext4/extents.c                  |   6 +-
 fs/ext4/file.c                     | 263 ++++-----------------------------------------------------------------
 fs/ext4/ialloc.c                   |   4 +-
 fs/ext4/inline.c                   |  43 +++++++++---
 fs/ext4/inode.c                    | 153 ++++++++++++++++++----------------------
 fs/ext4/ioctl.c                    |  30 ++++----
 fs/ext4/mballoc.c                  |  28 ++++----
 fs/ext4/resize.c                   | 104 +++++++++++++++++----------
 fs/ext4/super.c                    |  27 +++----
 fs/iomap.c                         |  13 ++--
 fs/jbd2/journal.c                  |   9 ++-
 fs/nfsd/blocklayout.c              |   4 +-
 fs/xfs/xfs_iomap.c                 |   6 +-
 include/linux/iomap.h              |  15 ++--
 21 files changed, 278 insertions(+), 511 deletions(-)

Comments

Theodore Ts'o Nov. 13, 2017, 4:25 p.m. UTC | #1
I forgot to mention, there's a merge conflict when pulling the ext4
and fscrypt trees.  The fixup is relatively straightforward:

commit daf886f04e60eda3bbc957e79d81d72965afd947
Merge: a0b3bc855374 232530680290
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sun Nov 12 22:03:15 2017 -0500

    Merge tag 'ext4_for_linus' into test
    
    Add support for online resizing of file systems with bigalloc.  Fix a
    two data corruption bugs involving DAX, as well as a corruption bug
    after a crash during a racing fallocate and delayed allocation.
    Finally, a number of cleanups and optimizations.

diff --cc fs/ext4/inode.c
index 617c7feced24,9f836e2ec18c..737c43d724fb
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@@ -4587,15 -4640,10 +4640,13 @@@ void ext4_set_inode_flags(struct inode 
  		new_fl |= S_NOATIME;
  	if (flags & EXT4_DIRSYNC_FL)
  		new_fl |= S_DIRSYNC;
- 	if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
- 	    !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
- 	    !(flags & EXT4_ENCRYPT_FL))
+ 	if (ext4_should_use_dax(inode))
  		new_fl |= S_DAX;
 +	if (flags & EXT4_ENCRYPT_FL)
 +		new_fl |= S_ENCRYPTED;
  	inode_set_flags(inode, new_fl,
 -			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
 +			S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX|
 +			S_ENCRYPTED);
  }
  
  static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
Linus Torvalds Nov. 14, 2017, 8:59 p.m. UTC | #2
On Mon, Nov 13, 2017 at 8:25 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> I forgot to mention, there's a merge conflict when pulling the ext4
> and fscrypt trees.  The fixup is relatively straightforward:

It doesn't actually look all that straightforward, and in particular,
the resolution you sent me doesn't actually seem correct:

>                 new_fl |= S_NOATIME;
>         if (flags & EXT4_DIRSYNC_FL)
>                 new_fl |= S_DIRSYNC;
> -       if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
> -           !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
> -           !(flags & EXT4_ENCRYPT_FL))
> +       if (ext4_should_use_dax(inode))
>                 new_fl |= S_DAX;

This now loses the "!(flags & EXT4_ENCRYPT_FL)" test when it sets S_DAX.

Yes, in ext4_should_use_dax(), it has this code

        if (ext4_encrypted_inode(inode))
                return false;

but that test was what commit 2ee6a576be56 changed in favor of just
checking !(flags & EXT4_ENCRYPT_FL).

So that suggested merge resolkution actually undoes some of that
commit 2ee6a576be56.

Of course,

        (flags & EXT4_ENCRYPT_FL)

_should_ be the same as

        ext4_test_inode_flag(inode, EXT4_INODE_ENCRYPT);

so It does seem to be harmless, but it's a bit dodgy.

I'll do that suggested resolution, but I have to say that the ext4 bit
testing is incredibly broken and non-obvious. Just as an example:

  fs/ext4/ext4.h:#define EXT4_ENCRYPT_FL                  0x00000800
/* encrypted file */
  fs/ext4/ext4.h: EXT4_INODE_ENCRYPT      = 11,   /* Encrypted file */

yeah, it's the same bit, but it sure as hell isn't obvious. Why the
two totally different ways to define that data?

            Linus
Theodore Ts'o Nov. 15, 2017, 12:56 a.m. UTC | #3
On Tue, Nov 14, 2017 at 12:59:17PM -0800, Linus Torvalds wrote:
> Of course,
> 
>         (flags & EXT4_ENCRYPT_FL)
> 
> _should_ be the same as
> 
>         ext4_test_inode_flag(inode, EXT4_INODE_ENCRYPT);

And in the second is the preferred way to do things, actually.

> I'll do that suggested resolution, but I have to say that the ext4 bit
> testing is incredibly broken and non-obvious. Just as an example:
> 
>   fs/ext4/ext4.h:#define EXT4_ENCRYPT_FL                  0x00000800
> /* encrypted file */
>   fs/ext4/ext4.h: EXT4_INODE_ENCRYPT      = 11,   /* Encrypted file */
> 
> yeah, it's the same bit, but it sure as hell isn't obvious. Why the
> two totally different ways to define that data?

Yes, it's non-obvious and ugly.  Sorry about that.

We originally used EXT4_*_FL, and we needed to use the bit number
encoding so we could use test_bit().  We just never converted all the
way over.

We do have a way to make sure the two ways of defining a bit position
are in sync; see ext4_check_flag_values() and CHECK_FLAG_VALUE in
ext4.h.  It's a bit gross, and we probably should clean this up, at
least in the kernel.  The e2fsprogs user space libraries all use
EXT4_*_FL, and we can't change that without breaking applications
depending on userspace, but we can keep things consistent in the
kernel, and that probably means completely converting away from
EXT4_*_FL, if possible.

       	   	      	 	  - Ted
Linus Torvalds Nov. 15, 2017, 1:11 a.m. UTC | #4
On Tue, Nov 14, 2017 at 4:56 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>
>    The e2fsprogs user space libraries all use
> EXT4_*_FL, and we can't change that without breaking applications
> depending on userspace, but we can keep things consistent in the
> kernel, and that probably means completely converting away from
> EXT4_*_FL, if possible.

So I don't think you'd necessarily need to convert from one to the
other, but wouldn't it be nice if you at least defined one in terms of
the other, ie something like

  #define EXT4_ENCRYPT_FL (1u << EXT4_INODE_ENCRYPT)

so that when you grep for one you see how they are directly related.

Now it was much less obvious, and I was nervous because that whole
series did introduce _different_ bits that were not in the same space
at all, and encoded the same thing (ie that S_ENCRYPTED bit).

Maybe this normally doesn't come up, but it was not all that obvious,
particularly since there was a lot of indirection:

 ext4_encrypted_inode() ->
    ext4_test_inode_flag(inode, EXT4_INODE_ENCRYPT) ->
        EXT4_INODE_BIT_FNS()

That EXT4_INODE_BIT_FNS thing was really fascinating to see.

So just confirming that yes,

   ext4_encrypted_inode()

is the same thing as

   EXT4_I(inode)->i_flags & EXT4_ENCRYPT_FL

was a real adventure.

Making it clear that EXT4_ENCRYPT_FL and EXT4_INODE_ENCRYPT are the
same bit would maybe have lessened the confusion at least a tiny bit.

Of course, not having five different ways to test the same bit would
have been even better.

Ok, I'm exaggerating.

But there really does seem to be a lot of different ways to check
i_flags bits, with some uses checking it directly, the places
_setting_ it using ext4_set_inode_flag(), and then other testers using
bit-specific helper.

And that somewhat confusing model seems to be true of pretty much all the bits.

As long as you can keep track of it, I guess it's fine.

            Linus
Jiri Slaby Feb. 12, 2018, 10:02 a.m. UTC | #5
Hi,

On 11/13/2017, 04:15 AM, Theodore Ts'o wrote:
> The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff:
> 
>   Linux 4.14-rc3 (2017-10-01 14:54:54 -0700)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus
...
> Christoph Hellwig (1):
>       ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA

So I bisected a 32bit hole problem in 4.15 down to this one.

tar selftests never finish during build on the 32bit kernel. The stack
trace is something like this:
tar             R  running task        0 24780  24590 0x00000000
Call Trace:
 ? ext4_map_blocks+0x2fa/0x5f0
 ? ext4_map_blocks+0x348/0x5f0
 ? ext4_iomap_begin+0x39b/0x4d0
 ? ext4_evict_inode+0x5c0/0x5c0
 ? iomap_apply+0x61/0x160
 ? iomap_to_fiemap+0xd0/0xd0
 ? iomap_seek_data+0xac/0x110
 ? iomap_seek_hole_actor+0x80/0x80
 ? ext4_llseek+0x13a/0x140
 ? ext4_file_mmap+0x90/0x90
 ? SyS_llseek+0x7b/0xc0
 ? do_int80_syscall_32+0x51/0x100
 ? entry_INT80_32+0x36/0x36


The tar test does this:
AT_TAR_CHECK([
AT_DATA([mapfile],
[0 =2560
m4_for([i], 1, 999, 1, [10M =2560
])])
genfile --sparse --file BIGFILE --block-size 4K - < mapfile || AT_SKIP_TEST
tar -f - -c --sparse --posix BIGFILE | tar tvf - | awk '{ print $3, $(NF) }'
],

I.e. it generates a file with 1000 blocks of 2560*4K bytes. The blocks
are 10M from each other, that is 999 10M holes. Tar then tries to
read&seek in the file, but the seek above never finishes (or maybe
finishes after a long time).

Given this happens only on 32bit kernel, I assume some 32bit overflow.
But I am unable to see it (yet).

thanks,
Jiri Slaby Feb. 12, 2018, 12:14 p.m. UTC | #6
On 02/12/2018, 11:02 AM, Jiri Slaby wrote:
> Given this happens only on 32bit kernel, I assume some 32bit overflow.
> But I am unable to see it (yet).

Just to add, a diff of strace in good and bad kernels:
@@ -655,14 +655,4 @@
 _llseek(3, 4275568640, [4286054400], SEEK_DATA) = 0
 _llseek(3, 4286054400, [4288675840], SEEK_HOLE) = 0
 _llseek(3, 4288675840, [4299161600], SEEK_DATA) = 0
-_llseek(3, 4299161600, [4301783040], SEEK_HOLE) = 0
+_llseek(3, 4299161600, [4299161600], SEEK_HOLE) = 2621440

llseek returns a very invalid value when it comes to 0x100400000.

regards,
Jiri Slaby Feb. 12, 2018, 1:34 p.m. UTC | #7
On 02/12/2018, 01:14 PM, Jiri Slaby wrote:
> On 02/12/2018, 11:02 AM, Jiri Slaby wrote:
>> Given this happens only on 32bit kernel, I assume some 32bit overflow.
>> But I am unable to see it (yet).
> 
> Just to add, a diff of strace in good and bad kernels:
> @@ -655,14 +655,4 @@
>  _llseek(3, 4275568640, [4286054400], SEEK_DATA) = 0
>  _llseek(3, 4286054400, [4288675840], SEEK_HOLE) = 0
>  _llseek(3, 4288675840, [4299161600], SEEK_DATA) = 0
> -_llseek(3, 4299161600, [4301783040], SEEK_HOLE) = 0
> +_llseek(3, 4299161600, [4299161600], SEEK_HOLE) = 2621440
> 
> llseek returns a very invalid value when it comes to 0x100400000.

It gets cropped to 0x400000, so:

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,7 +3523,7 @@ static int ext4_iomap_begin(struct inode *inode,
loff_t offset, loff_t length,
                iomap->flags |= IOMAP_F_DIRTY;
        iomap->bdev = inode->i_sb->s_bdev;
        iomap->dax_dev = sbi->s_daxdev;
-       iomap->offset = first_block << blkbits;
+       iomap->offset = (u64)first_block << blkbits;
        iomap->length = (u64)map.m_len << blkbits;

        if (ret == 0) {

sounds about right?

> regards,
Jiri Slaby Feb. 12, 2018, 1:38 p.m. UTC | #8
On 02/12/2018, 02:34 PM, Jiri Slaby wrote:
> @@ -3523,7 +3523,7 @@ static int ext4_iomap_begin(struct inode *inode,
> loff_t offset, loff_t length,
>                 iomap->flags |= IOMAP_F_DIRTY;
>         iomap->bdev = inode->i_sb->s_bdev;
>         iomap->dax_dev = sbi->s_daxdev;
> -       iomap->offset = first_block << blkbits;
> +       iomap->offset = (u64)first_block << blkbits;
>         iomap->length = (u64)map.m_len << blkbits;
> 
>         if (ret == 0) {
> 
> sounds about right?

Or even simpler:
-       iomap->offset = first_block << blkbits;
+       iomap->offset = offset;

>> regards,-- 
js
suse labs
Theodore Ts'o Feb. 12, 2018, 4:10 p.m. UTC | #9
On Mon, Feb 12, 2018 at 01:14:07PM +0100, Jiri Slaby wrote:
> On 02/12/2018, 11:02 AM, Jiri Slaby wrote:
> > Given this happens only on 32bit kernel, I assume some 32bit overflow.
> > But I am unable to see it (yet).
> 
> Just to add, a diff of strace in good and bad kernels:
> @@ -655,14 +655,4 @@
>  _llseek(3, 4275568640, [4286054400], SEEK_DATA) = 0
>  _llseek(3, 4286054400, [4288675840], SEEK_HOLE) = 0
>  _llseek(3, 4288675840, [4299161600], SEEK_DATA) = 0
> -_llseek(3, 4299161600, [4301783040], SEEK_HOLE) = 0
> +_llseek(3, 4299161600, [4299161600], SEEK_HOLE) = 2621440
> 
> llseek returns a very invalid value when it comes to 0x100400000.

Thanks for the bugreport!  Can you send me the output of "filefrag -v"
on the file so I can make sure we can exactly replicate what you're
seeing?

Thanks,

						- Ted