diff mbox series

[v2] ext4: fix fast commit inode enqueueing during a full journal commit

Message ID 20240523111618.17012-1-luis.henriques@linux.dev
State New
Headers show
Series [v2] ext4: fix fast commit inode enqueueing during a full journal commit | expand

Commit Message

Luis Henriques (SUSE) May 23, 2024, 11:16 a.m. UTC
When a full journal commit is on-going, any fast commit has to be enqueued
into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
is done only once, i.e. if an inode is already queued in a previous fast
commit entry it won't be enqueued again.  However, if a full commit starts
_after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
be done into FC_Q_STAGING.  And this is not being done in function
ext4_fc_track_template().

This patch fixes the issue by flagging an inode that is already enqueued in
either queues.  Later, during the fast commit clean-up callback, if the
inode has a tid that is bigger than the one being handled, that inode is
re-enqueued into STAGING and the spliced back into MAIN.

This bug was found using fstest generic/047.  This test creates several 32k
bytes files, sync'ing each of them after it's creation, and then shutting
down the filesystem.  Some data may be loss in this operation; for example a
file may have it's size truncated to zero.

Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
---
Hi!

(Now Cc'ing Harshad, as I should have done in the initial RFC.)

This v2 is a complete different solution, hinted by Jan Kara.  I hope my
understanding of his suggestion is correct.  Also, I've dropped the second
patch as it didn't made sense, as Jan also pointed out.

Finally, I haven't yet done a review of Harshad's patchset [1] (hope to
get to it soon), but a quick test shows the issue is still present there.
The good news is that patch can be trivially applied on top of it.

[1] https://lore.kernel.org/all/20240520055153.136091-1-harshadshirwadkar@gmail.com

Cheers,
--
Luis

 fs/ext4/ext4.h        | 11 ++++++++++-
 fs/ext4/fast_commit.c | 11 +++++++++++
 fs/ext4/super.c       |  1 +
 3 files changed, 22 insertions(+), 1 deletion(-)

Comments

Jan Kara May 24, 2024, 4:22 p.m. UTC | #1
On Thu 23-05-24 12:16:18, Luis Henriques (SUSE) wrote:
> When a full journal commit is on-going, any fast commit has to be enqueued
> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
> is done only once, i.e. if an inode is already queued in a previous fast
> commit entry it won't be enqueued again.  However, if a full commit starts
> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
> be done into FC_Q_STAGING.  And this is not being done in function
> ext4_fc_track_template().
> 
> This patch fixes the issue by flagging an inode that is already enqueued in
> either queues.  Later, during the fast commit clean-up callback, if the
> inode has a tid that is bigger than the one being handled, that inode is
> re-enqueued into STAGING and the spliced back into MAIN.
> 
> This bug was found using fstest generic/047.  This test creates several 32k
> bytes files, sync'ing each of them after it's creation, and then shutting
> down the filesystem.  Some data may be loss in this operation; for example a
> file may have it's size truncated to zero.
> 
> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>

Thanks for the fix. Some comments below:

> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 983dad8c07ec..4c308c18c3da 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1062,9 +1062,18 @@ struct ext4_inode_info {
>  	/* Fast commit wait queue for this inode */
>  	wait_queue_head_t i_fc_wait;
>  
> -	/* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */
> +	/*
> +	 * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len,
> +	 * i_fc_next
> +	 */
>  	struct mutex i_fc_lock;
>  
> +	/*
> +	 * Used to flag an inode as part of the next fast commit; will be
> +	 * reset during fast commit clean-up
> +	 */
> +	tid_t i_fc_next;
> +

Do we really need new tid in the inode? I'd be kind of hoping we could use
EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in
ext4_fc_track_template() and used for similar comparisons in fast commit
code.

> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 87c009e0c59a..bfdf249f0783 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -402,6 +402,8 @@ static int ext4_fc_track_template(
>  				 sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ?
>  				&sbi->s_fc_q[FC_Q_STAGING] :
>  				&sbi->s_fc_q[FC_Q_MAIN]);
> +	else
> +		ei->i_fc_next = tid;
>  	spin_unlock(&sbi->s_fc_lock);
>  
>  	return ret;
> @@ -1280,6 +1282,15 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>  	list_for_each_entry_safe(iter, iter_n, &sbi->s_fc_q[FC_Q_MAIN],
>  				 i_fc_list) {
>  		list_del_init(&iter->i_fc_list);
> +		if (iter->i_fc_next == tid)
> +			iter->i_fc_next = 0;
> +		else if (iter->i_fc_next > tid)
			 ^^^ careful here, TIDs do wrap so you need to use
tid_geq() for comparison.

> +			/*
> +			 * re-enqueue inode into STAGING, which will later be
> +			 * splice back into MAIN
> +			 */
> +			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
> +				      &sbi->s_fc_q[FC_Q_STAGING]);
>  		ext4_clear_inode_state(&iter->vfs_inode,
>  				       EXT4_STATE_FC_COMMITTING);
>  		if (iter->i_sync_tid <= tid)
				     ^^^ and I can see this is buggy as
well and needs tid_geq() (not your fault obviously).

								Honza
Luis Henriques (SUSE) May 27, 2024, 8:29 a.m. UTC | #2
On Fri 24 May 2024 06:22:31 PM +02, Jan Kara wrote;

> On Thu 23-05-24 12:16:18, Luis Henriques (SUSE) wrote:
>> When a full journal commit is on-going, any fast commit has to be enqueued
>> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN.  This enqueueing
>> is done only once, i.e. if an inode is already queued in a previous fast
>> commit entry it won't be enqueued again.  However, if a full commit starts
>> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
>> be done into FC_Q_STAGING.  And this is not being done in function
>> ext4_fc_track_template().
>> 
>> This patch fixes the issue by flagging an inode that is already enqueued in
>> either queues.  Later, during the fast commit clean-up callback, if the
>> inode has a tid that is bigger than the one being handled, that inode is
>> re-enqueued into STAGING and the spliced back into MAIN.
>> 
>> This bug was found using fstest generic/047.  This test creates several 32k
>> bytes files, sync'ing each of them after it's creation, and then shutting
>> down the filesystem.  Some data may be loss in this operation; for example a
>> file may have it's size truncated to zero.
>> 
>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>
> Thanks for the fix. Some comments below:
>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 983dad8c07ec..4c308c18c3da 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1062,9 +1062,18 @@ struct ext4_inode_info {
>>  	/* Fast commit wait queue for this inode */
>>  	wait_queue_head_t i_fc_wait;
>>  
>> -	/* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */
>> +	/*
>> +	 * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len,
>> +	 * i_fc_next
>> +	 */
>>  	struct mutex i_fc_lock;
>>  
>> +	/*
>> +	 * Used to flag an inode as part of the next fast commit; will be
>> +	 * reset during fast commit clean-up
>> +	 */
>> +	tid_t i_fc_next;
>> +
>
> Do we really need new tid in the inode? I'd be kind of hoping we could use
> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in
> ext4_fc_track_template() and used for similar comparisons in fast commit
> code.

Ah, true.  It looks like it could be used indeed.  We'll still need a flag
here, but a simple bool should be enough for that.

>
>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>> index 87c009e0c59a..bfdf249f0783 100644
>> --- a/fs/ext4/fast_commit.c
>> +++ b/fs/ext4/fast_commit.c
>> @@ -402,6 +402,8 @@ static int ext4_fc_track_template(
>>  				 sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ?
>>  				&sbi->s_fc_q[FC_Q_STAGING] :
>>  				&sbi->s_fc_q[FC_Q_MAIN]);
>> +	else
>> +		ei->i_fc_next = tid;
>>  	spin_unlock(&sbi->s_fc_lock);
>>  
>>  	return ret;
>> @@ -1280,6 +1282,15 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
>>  	list_for_each_entry_safe(iter, iter_n, &sbi->s_fc_q[FC_Q_MAIN],
>>  				 i_fc_list) {
>>  		list_del_init(&iter->i_fc_list);
>> +		if (iter->i_fc_next == tid)
>> +			iter->i_fc_next = 0;
>> +		else if (iter->i_fc_next > tid)
> 			 ^^^ careful here, TIDs do wrap so you need to use
> tid_geq() for comparison.
>

Yikes!  Thanks, I'll update the code to do that.

>> +			/*
>> +			 * re-enqueue inode into STAGING, which will later be
>> +			 * splice back into MAIN
>> +			 */
>> +			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
>> +				      &sbi->s_fc_q[FC_Q_STAGING]);
>>  		ext4_clear_inode_state(&iter->vfs_inode,
>>  				       EXT4_STATE_FC_COMMITTING);
>>  		if (iter->i_sync_tid <= tid)
> 				     ^^^ and I can see this is buggy as
> well and needs tid_geq() (not your fault obviously).

Yeah, good point.  I can that too in v3.

Again, thanks a lot for your review!

Cheers,
Luis Henriques (SUSE) May 27, 2024, 3:48 p.m. UTC | #3
On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote;

<snip>

>>> +	/*
>>> +	 * Used to flag an inode as part of the next fast commit; will be
>>> +	 * reset during fast commit clean-up
>>> +	 */
>>> +	tid_t i_fc_next;
>>> +
>>
>> Do we really need new tid in the inode? I'd be kind of hoping we could use
>> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in
>> ext4_fc_track_template() and used for similar comparisons in fast commit
>> code.
>
> Ah, true.  It looks like it could be used indeed.  We'll still need a flag
> here, but a simple bool should be enough for that.

After looking again at the code, I'm not 100% sure that this is actually
doable.  For example, if I replace the above by

	bool i_fc_next;

and set to to 'true' below:

>>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
>>> index 87c009e0c59a..bfdf249f0783 100644
>>> --- a/fs/ext4/fast_commit.c
>>> +++ b/fs/ext4/fast_commit.c
>>> @@ -402,6 +402,8 @@ static int ext4_fc_track_template(
>>>  				 sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ?
>>>  				&sbi->s_fc_q[FC_Q_STAGING] :
>>>  				&sbi->s_fc_q[FC_Q_MAIN]);
>>> +	else
>>> +		ei->i_fc_next = tid;

		ei->i_fc_next = true;

Then, when we get to the ext4_fc_cleanup(), the value of iter->i_sync_tid
may have changed in the meantime from, e.g., ext4_do_update_inode() or
__ext4_iget().  This would cause the clean-up code to be bogus if it still
implements a the logic below, by comparing the tid with i_sync_tid.
(Although, to be honest, I couldn't see any visible effect in the quick
testing I've done.)  Or am I missing something, and this is *exactly* the
behaviour you'd expect?

Cheers,
Jan Kara May 28, 2024, 10:36 a.m. UTC | #4
On Mon 27-05-24 16:48:24, Luis Henriques wrote:
> On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote;
> >>> +	/*
> >>> +	 * Used to flag an inode as part of the next fast commit; will be
> >>> +	 * reset during fast commit clean-up
> >>> +	 */
> >>> +	tid_t i_fc_next;
> >>> +
> >>
> >> Do we really need new tid in the inode? I'd be kind of hoping we could use
> >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in
> >> ext4_fc_track_template() and used for similar comparisons in fast commit
> >> code.
> >
> > Ah, true.  It looks like it could be used indeed.  We'll still need a flag
> > here, but a simple bool should be enough for that.
> 
> After looking again at the code, I'm not 100% sure that this is actually
> doable.  For example, if I replace the above by
> 
> 	bool i_fc_next;
> 
> and set to to 'true' below:
> 
> >>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> >>> index 87c009e0c59a..bfdf249f0783 100644
> >>> --- a/fs/ext4/fast_commit.c
> >>> +++ b/fs/ext4/fast_commit.c
> >>> @@ -402,6 +402,8 @@ static int ext4_fc_track_template(
> >>>  				 sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ?
> >>>  				&sbi->s_fc_q[FC_Q_STAGING] :
> >>>  				&sbi->s_fc_q[FC_Q_MAIN]);
> >>> +	else
> >>> +		ei->i_fc_next = tid;
> 
> 		ei->i_fc_next = true;
> 
> Then, when we get to the ext4_fc_cleanup(), the value of iter->i_sync_tid
> may have changed in the meantime from, e.g., ext4_do_update_inode() or
> __ext4_iget().  This would cause the clean-up code to be bogus if it still
> implements a the logic below, by comparing the tid with i_sync_tid.
> (Although, to be honest, I couldn't see any visible effect in the quick
> testing I've done.)  Or am I missing something, and this is *exactly* the
> behaviour you'd expect?

Yes, this is the behavior I'd expect. The rationale is that if i_sync_tid
points to the running transaction, it means the inode was modified in it,
which means fastcommit needs to write it out. In fact the
ext4_update_inode_fsync_trans() calls usually happen together with
ext4_fc_track_...() calls. This could use some cleanup so that we don't set
i_sync_tid in two places unnecessarily but that's for some other time...

								Honza
Jan Kara May 28, 2024, 10:52 a.m. UTC | #5
On Tue 28-05-24 12:36:02, Jan Kara wrote:
> On Mon 27-05-24 16:48:24, Luis Henriques wrote:
> > On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote;
> > >>> +	/*
> > >>> +	 * Used to flag an inode as part of the next fast commit; will be
> > >>> +	 * reset during fast commit clean-up
> > >>> +	 */
> > >>> +	tid_t i_fc_next;
> > >>> +
> > >>
> > >> Do we really need new tid in the inode? I'd be kind of hoping we could use
> > >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in
> > >> ext4_fc_track_template() and used for similar comparisons in fast commit
> > >> code.
> > >
> > > Ah, true.  It looks like it could be used indeed.  We'll still need a flag
> > > here, but a simple bool should be enough for that.
> > 
> > After looking again at the code, I'm not 100% sure that this is actually
> > doable.  For example, if I replace the above by
> > 
> > 	bool i_fc_next;
> > 
> > and set to to 'true' below:

Forgot to comment on this one: I don't think you even need 'bool i_fc_next'
- simply whenever i_sync_tid is greater than committing transaction's tid,
you move the inode to FC_Q_STAGING list in ext4_fc_cleanup().

								Honza
Luis Henriques (SUSE) May 28, 2024, 3:50 p.m. UTC | #6
On Tue 28 May 2024 12:52:03 PM +02, Jan Kara wrote;

> On Tue 28-05-24 12:36:02, Jan Kara wrote:
>> On Mon 27-05-24 16:48:24, Luis Henriques wrote:
>> > On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote;
>> > >>> +	/*
>> > >>> +	 * Used to flag an inode as part of the next fast commit; will be
>> > >>> +	 * reset during fast commit clean-up
>> > >>> +	 */
>> > >>> +	tid_t i_fc_next;
>> > >>> +
>> > >>
>> > >> Do we really need new tid in the inode? I'd be kind of hoping we could use
>> > >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in
>> > >> ext4_fc_track_template() and used for similar comparisons in fast commit
>> > >> code.
>> > >
>> > > Ah, true.  It looks like it could be used indeed.  We'll still need a flag
>> > > here, but a simple bool should be enough for that.
>> > 
>> > After looking again at the code, I'm not 100% sure that this is actually
>> > doable.  For example, if I replace the above by
>> > 
>> > 	bool i_fc_next;
>> > 
>> > and set to to 'true' below:
>
> Forgot to comment on this one: I don't think you even need 'bool i_fc_next'
> - simply whenever i_sync_tid is greater than committing transaction's tid,
> you move the inode to FC_Q_STAGING list in ext4_fc_cleanup().

Yeah, I got that from your other comment in the previous email.  And that
means the actual fix will be a pretty small patch (almost a one-liner).

I'm running some more tests on v3, I'll probably send it later today or
tomorrow.  Thanks a lot for your review (and patience), Jan.

Cheers,
harshad shirwadkar May 29, 2024, 12:01 a.m. UTC | #7
Sorry for getting back late on your patchset - I was on vacation and
checked your patch just now. This is a good catch! My patchset does
not fix this issue. Looking forward to your V3 fix.

Also, using i_sync_tid as Jan suggested sounds like a good way to handle this.

- Harshad


On Tue, May 28, 2024 at 8:50 AM Luis Henriques <luis.henriques@linux.dev> wrote:
>
> On Tue 28 May 2024 12:52:03 PM +02, Jan Kara wrote;
>
> > On Tue 28-05-24 12:36:02, Jan Kara wrote:
> >> On Mon 27-05-24 16:48:24, Luis Henriques wrote:
> >> > On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote;
> >> > >>> +      /*
> >> > >>> +       * Used to flag an inode as part of the next fast commit; will be
> >> > >>> +       * reset during fast commit clean-up
> >> > >>> +       */
> >> > >>> +      tid_t i_fc_next;
> >> > >>> +
> >> > >>
> >> > >> Do we really need new tid in the inode? I'd be kind of hoping we could use
> >> > >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in
> >> > >> ext4_fc_track_template() and used for similar comparisons in fast commit
> >> > >> code.
> >> > >
> >> > > Ah, true.  It looks like it could be used indeed.  We'll still need a flag
> >> > > here, but a simple bool should be enough for that.
> >> >
> >> > After looking again at the code, I'm not 100% sure that this is actually
> >> > doable.  For example, if I replace the above by
> >> >
> >> >    bool i_fc_next;
> >> >
> >> > and set to to 'true' below:
> >
> > Forgot to comment on this one: I don't think you even need 'bool i_fc_next'
> > - simply whenever i_sync_tid is greater than committing transaction's tid,
> > you move the inode to FC_Q_STAGING list in ext4_fc_cleanup().
>
> Yeah, I got that from your other comment in the previous email.  And that
> means the actual fix will be a pretty small patch (almost a one-liner).
>
> I'm running some more tests on v3, I'll probably send it later today or
> tomorrow.  Thanks a lot for your review (and patience), Jan.
>
> Cheers,
> --
> Luís
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 983dad8c07ec..4c308c18c3da 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1062,9 +1062,18 @@  struct ext4_inode_info {
 	/* Fast commit wait queue for this inode */
 	wait_queue_head_t i_fc_wait;
 
-	/* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */
+	/*
+	 * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len,
+	 * i_fc_next
+	 */
 	struct mutex i_fc_lock;
 
+	/*
+	 * Used to flag an inode as part of the next fast commit; will be
+	 * reset during fast commit clean-up
+	 */
+	tid_t i_fc_next;
+
 	/*
 	 * i_disksize keeps track of what the inode size is ON DISK, not
 	 * in memory.  During truncate, i_size is set to the new size by
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 87c009e0c59a..bfdf249f0783 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -402,6 +402,8 @@  static int ext4_fc_track_template(
 				 sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ?
 				&sbi->s_fc_q[FC_Q_STAGING] :
 				&sbi->s_fc_q[FC_Q_MAIN]);
+	else
+		ei->i_fc_next = tid;
 	spin_unlock(&sbi->s_fc_lock);
 
 	return ret;
@@ -1280,6 +1282,15 @@  static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 	list_for_each_entry_safe(iter, iter_n, &sbi->s_fc_q[FC_Q_MAIN],
 				 i_fc_list) {
 		list_del_init(&iter->i_fc_list);
+		if (iter->i_fc_next == tid)
+			iter->i_fc_next = 0;
+		else if (iter->i_fc_next > tid)
+			/*
+			 * re-enqueue inode into STAGING, which will later be
+			 * splice back into MAIN
+			 */
+			list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list,
+				      &sbi->s_fc_q[FC_Q_STAGING]);
 		ext4_clear_inode_state(&iter->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
 		if (iter->i_sync_tid <= tid)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 893ab80dafba..56f416656d96 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1437,6 +1437,7 @@  static struct inode *ext4_alloc_inode(struct super_block *sb)
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
 	ext4_fc_init_inode(&ei->vfs_inode);
 	mutex_init(&ei->i_fc_lock);
+	ei->i_fc_next = 0;
 	return &ei->vfs_inode;
 }