diff mbox series

[RFC] ext4: destroy inline data immediately when converting to extent

Message ID 20240208165808.5494-1-lhenriques@suse.de
State New
Headers show
Series [RFC] ext4: destroy inline data immediately when converting to extent | expand

Commit Message

Luis Henriques Feb. 8, 2024, 4:58 p.m. UTC
When writing to an inode that has inline data and the amount of data written
exceeds the maximum inline data length, that data is un-inlined, i.e. it is
converted into an extent.  However, when delayed allocation is enabled the
destruction of the data is postponed until the data writeback.  This causes
consistency problems.  Here's a very simple test case, run on a filesystem
with delayed allocation and inline data features enabled:

 $ dd if=/dev/zero of=test-file bs=64 count=3 status=none
 $ lsattr test-file
 ------------------N--- test-file

The 'lsattr' command shows that the file has data stored inline.  However,
that is not correct because writing 192 bytes (3 * 64) has forced the data
to be un-inlined.  Doing a 'sync' before running the 'lsattr' fixes it.

Note that this bug doesn't happen if the filesytems is mount using the
'nodelalloc' option.

(There's a similar test case using read() instead in the bugzilla linked
bellow.)

This patch fixes the issue in the delayed allocation path by destroying the
inline data immediately after converting it to an extent instead of delaying
that operation until the writeback.  This is done by invoking function
ext4_destroy_inline_data_nolock(), which is going to clean-up all the
missing data structures, including clearing ĨNLINE_DATA and setting the
EXTENTS inode flags.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200681
Cc: Daniel Dawson <danielcdawson@gmail.com>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
---
Hi!

I'm sending this patch as an RFC because, although I've done a good amount
of testing, I'm still not convinced it is correct.  I.e. there may exist a
good reason for postponing the call to ext4_destroy_inline_data_nolock() and
that I'm failing to see it.  Please let me know what you think.

 fs/ext4/inline.c | 20 ++++++++++----------
 fs/ext4/inode.c  | 18 +-----------------
 2 files changed, 11 insertions(+), 27 deletions(-)

Comments

Daniel Dawson Feb. 9, 2024, 1:31 a.m. UTC | #1
On 2/8/24 8:58 AM, Luis Henriques wrote:
> The 'lsattr' command shows that the file has data stored inline.  However,
> that is not correct because writing 192 bytes (3 * 64) has forced the data
> to be un-inlined.  Doing a 'sync' before running the 'lsattr' fixes it.
I know I'm the one who initially reported this, but I suppose there is 
an argument to be made that this in itself is not a bug, as Ted seemed 
to say before. What is a problem, however, is lseek() giving ENOENT when 
the kernel finds file data where it expects to find a block/extent map, 
and I can confirm this patch solves that (tested against 6.8.0-rc3). Not 
sure if it's the best thing to do so by forcing immediate allocation, 
but I feel it's still an improvement.
diff mbox series

Patch

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index d5bd1e3a5d36..e19a176cfc93 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -830,11 +830,12 @@  int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
  *    update and dirty so that ext4_da_writepages can handle it. We don't
  *    need to start the journal since the file's metadata isn't changed now.
  */
-static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
+static int ext4_da_convert_inline_data_to_extent(handle_t *handle,
+						 struct address_space *mapping,
 						 struct inode *inode,
 						 void **fsdata)
 {
-	int ret = 0, inline_size;
+	int ret = 0, inline_size, no_expand;
 	struct folio *folio;
 
 	folio = __filemap_get_folio(mapping, 0, FGP_WRITEBEGIN,
@@ -842,7 +843,7 @@  static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
-	down_read(&EXT4_I(inode)->xattr_sem);
+	ext4_write_lock_xattr(inode, &no_expand);
 	if (!ext4_has_inline_data(inode)) {
 		ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
 		goto out;
@@ -859,20 +860,18 @@  static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
 	ret = __block_write_begin(&folio->page, 0, inline_size,
 				  ext4_da_get_block_prep);
 	if (ret) {
-		up_read(&EXT4_I(inode)->xattr_sem);
+		ext4_write_unlock_xattr(inode, &no_expand);
 		folio_unlock(folio);
 		folio_put(folio);
 		ext4_truncate_failed_write(inode);
 		return ret;
 	}
 
-	folio_mark_dirty(folio);
-	folio_mark_uptodate(folio);
-	ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
+	ret = ext4_destroy_inline_data_nolock(handle, inode);
 	*fsdata = (void *)CONVERT_INLINE_DATA;
 
 out:
-	up_read(&EXT4_I(inode)->xattr_sem);
+	ext4_write_unlock_xattr(inode, &no_expand);
 	if (folio) {
 		folio_unlock(folio);
 		folio_put(folio);
@@ -916,10 +915,11 @@  int ext4_da_write_inline_data_begin(struct address_space *mapping,
 		goto out_journal;
 
 	if (ret == -ENOSPC) {
-		ext4_journal_stop(handle);
-		ret = ext4_da_convert_inline_data_to_extent(mapping,
+		ret = ext4_da_convert_inline_data_to_extent(handle,
+							    mapping,
 							    inode,
 							    fsdata);
+		ext4_journal_stop(handle);
 		if (ret == -ENOSPC &&
 		    ext4_should_retry_alloc(inode->i_sb, &retries))
 			goto retry_journal;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ccf3b5e3a7c..43fa930fafa0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2548,23 +2548,7 @@  static int ext4_do_writepages(struct mpage_da_data *mpd)
 		goto out_writepages;
 	}
 
-	/*
-	 * If we have inline data and arrive here, it means that
-	 * we will soon create the block for the 1st page, so
-	 * we'd better clear the inline data here.
-	 */
-	if (ext4_has_inline_data(inode)) {
-		/* Just inode will be modified... */
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out_writepages;
-		}
-		BUG_ON(ext4_test_inode_state(inode,
-				EXT4_STATE_MAY_INLINE_DATA));
-		ext4_destroy_inline_data(handle, inode);
-		ext4_journal_stop(handle);
-	}
+	WARN_ON_ONCE(ext4_has_inline_data(inode));
 
 	/*
 	 * data=journal mode does not do delalloc so we just need to writeout /