Message ID | 20240523111618.17012-1-luis.henriques@linux.dev |
---|---|
State | New |
Headers | show |
Series | [v2] ext4: fix fast commit inode enqueueing during a full journal commit | expand |
On Thu 23-05-24 12:16:18, Luis Henriques (SUSE) wrote: > When a full journal commit is on-going, any fast commit has to be enqueued > into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing > is done only once, i.e. if an inode is already queued in a previous fast > commit entry it won't be enqueued again. However, if a full commit starts > _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to > be done into FC_Q_STAGING. And this is not being done in function > ext4_fc_track_template(). > > This patch fixes the issue by flagging an inode that is already enqueued in > either queues. Later, during the fast commit clean-up callback, if the > inode has a tid that is bigger than the one being handled, that inode is > re-enqueued into STAGING and the spliced back into MAIN. > > This bug was found using fstest generic/047. This test creates several 32k > bytes files, sync'ing each of them after it's creation, and then shutting > down the filesystem. Some data may be loss in this operation; for example a > file may have it's size truncated to zero. > > Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Thanks for the fix. Some comments below: > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 983dad8c07ec..4c308c18c3da 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -1062,9 +1062,18 @@ struct ext4_inode_info { > /* Fast commit wait queue for this inode */ > wait_queue_head_t i_fc_wait; > > - /* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */ > + /* > + * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len, > + * i_fc_next > + */ > struct mutex i_fc_lock; > > + /* > + * Used to flag an inode as part of the next fast commit; will be > + * reset during fast commit clean-up > + */ > + tid_t i_fc_next; > + Do we really need new tid in the inode? I'd be kind of hoping we could use EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in ext4_fc_track_template() and used for similar comparisons in fast commit code. > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c > index 87c009e0c59a..bfdf249f0783 100644 > --- a/fs/ext4/fast_commit.c > +++ b/fs/ext4/fast_commit.c > @@ -402,6 +402,8 @@ static int ext4_fc_track_template( > sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ? > &sbi->s_fc_q[FC_Q_STAGING] : > &sbi->s_fc_q[FC_Q_MAIN]); > + else > + ei->i_fc_next = tid; > spin_unlock(&sbi->s_fc_lock); > > return ret; > @@ -1280,6 +1282,15 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid) > list_for_each_entry_safe(iter, iter_n, &sbi->s_fc_q[FC_Q_MAIN], > i_fc_list) { > list_del_init(&iter->i_fc_list); > + if (iter->i_fc_next == tid) > + iter->i_fc_next = 0; > + else if (iter->i_fc_next > tid) ^^^ careful here, TIDs do wrap so you need to use tid_geq() for comparison. > + /* > + * re-enqueue inode into STAGING, which will later be > + * splice back into MAIN > + */ > + list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list, > + &sbi->s_fc_q[FC_Q_STAGING]); > ext4_clear_inode_state(&iter->vfs_inode, > EXT4_STATE_FC_COMMITTING); > if (iter->i_sync_tid <= tid) ^^^ and I can see this is buggy as well and needs tid_geq() (not your fault obviously). Honza
On Fri 24 May 2024 06:22:31 PM +02, Jan Kara wrote; > On Thu 23-05-24 12:16:18, Luis Henriques (SUSE) wrote: >> When a full journal commit is on-going, any fast commit has to be enqueued >> into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing >> is done only once, i.e. if an inode is already queued in a previous fast >> commit entry it won't be enqueued again. However, if a full commit starts >> _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to >> be done into FC_Q_STAGING. And this is not being done in function >> ext4_fc_track_template(). >> >> This patch fixes the issue by flagging an inode that is already enqueued in >> either queues. Later, during the fast commit clean-up callback, if the >> inode has a tid that is bigger than the one being handled, that inode is >> re-enqueued into STAGING and the spliced back into MAIN. >> >> This bug was found using fstest generic/047. This test creates several 32k >> bytes files, sync'ing each of them after it's creation, and then shutting >> down the filesystem. Some data may be loss in this operation; for example a >> file may have it's size truncated to zero. >> >> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> > > Thanks for the fix. Some comments below: > >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 983dad8c07ec..4c308c18c3da 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -1062,9 +1062,18 @@ struct ext4_inode_info { >> /* Fast commit wait queue for this inode */ >> wait_queue_head_t i_fc_wait; >> >> - /* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */ >> + /* >> + * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len, >> + * i_fc_next >> + */ >> struct mutex i_fc_lock; >> >> + /* >> + * Used to flag an inode as part of the next fast commit; will be >> + * reset during fast commit clean-up >> + */ >> + tid_t i_fc_next; >> + > > Do we really need new tid in the inode? I'd be kind of hoping we could use > EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in > ext4_fc_track_template() and used for similar comparisons in fast commit > code. Ah, true. It looks like it could be used indeed. We'll still need a flag here, but a simple bool should be enough for that. > >> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c >> index 87c009e0c59a..bfdf249f0783 100644 >> --- a/fs/ext4/fast_commit.c >> +++ b/fs/ext4/fast_commit.c >> @@ -402,6 +402,8 @@ static int ext4_fc_track_template( >> sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ? >> &sbi->s_fc_q[FC_Q_STAGING] : >> &sbi->s_fc_q[FC_Q_MAIN]); >> + else >> + ei->i_fc_next = tid; >> spin_unlock(&sbi->s_fc_lock); >> >> return ret; >> @@ -1280,6 +1282,15 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid) >> list_for_each_entry_safe(iter, iter_n, &sbi->s_fc_q[FC_Q_MAIN], >> i_fc_list) { >> list_del_init(&iter->i_fc_list); >> + if (iter->i_fc_next == tid) >> + iter->i_fc_next = 0; >> + else if (iter->i_fc_next > tid) > ^^^ careful here, TIDs do wrap so you need to use > tid_geq() for comparison. > Yikes! Thanks, I'll update the code to do that. >> + /* >> + * re-enqueue inode into STAGING, which will later be >> + * splice back into MAIN >> + */ >> + list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list, >> + &sbi->s_fc_q[FC_Q_STAGING]); >> ext4_clear_inode_state(&iter->vfs_inode, >> EXT4_STATE_FC_COMMITTING); >> if (iter->i_sync_tid <= tid) > ^^^ and I can see this is buggy as > well and needs tid_geq() (not your fault obviously). Yeah, good point. I can that too in v3. Again, thanks a lot for your review! Cheers,
On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote; <snip> >>> + /* >>> + * Used to flag an inode as part of the next fast commit; will be >>> + * reset during fast commit clean-up >>> + */ >>> + tid_t i_fc_next; >>> + >> >> Do we really need new tid in the inode? I'd be kind of hoping we could use >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in >> ext4_fc_track_template() and used for similar comparisons in fast commit >> code. > > Ah, true. It looks like it could be used indeed. We'll still need a flag > here, but a simple bool should be enough for that. After looking again at the code, I'm not 100% sure that this is actually doable. For example, if I replace the above by bool i_fc_next; and set to to 'true' below: >>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c >>> index 87c009e0c59a..bfdf249f0783 100644 >>> --- a/fs/ext4/fast_commit.c >>> +++ b/fs/ext4/fast_commit.c >>> @@ -402,6 +402,8 @@ static int ext4_fc_track_template( >>> sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ? >>> &sbi->s_fc_q[FC_Q_STAGING] : >>> &sbi->s_fc_q[FC_Q_MAIN]); >>> + else >>> + ei->i_fc_next = tid; ei->i_fc_next = true; Then, when we get to the ext4_fc_cleanup(), the value of iter->i_sync_tid may have changed in the meantime from, e.g., ext4_do_update_inode() or __ext4_iget(). This would cause the clean-up code to be bogus if it still implements a the logic below, by comparing the tid with i_sync_tid. (Although, to be honest, I couldn't see any visible effect in the quick testing I've done.) Or am I missing something, and this is *exactly* the behaviour you'd expect? Cheers,
On Mon 27-05-24 16:48:24, Luis Henriques wrote: > On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote; > >>> + /* > >>> + * Used to flag an inode as part of the next fast commit; will be > >>> + * reset during fast commit clean-up > >>> + */ > >>> + tid_t i_fc_next; > >>> + > >> > >> Do we really need new tid in the inode? I'd be kind of hoping we could use > >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in > >> ext4_fc_track_template() and used for similar comparisons in fast commit > >> code. > > > > Ah, true. It looks like it could be used indeed. We'll still need a flag > > here, but a simple bool should be enough for that. > > After looking again at the code, I'm not 100% sure that this is actually > doable. For example, if I replace the above by > > bool i_fc_next; > > and set to to 'true' below: > > >>> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c > >>> index 87c009e0c59a..bfdf249f0783 100644 > >>> --- a/fs/ext4/fast_commit.c > >>> +++ b/fs/ext4/fast_commit.c > >>> @@ -402,6 +402,8 @@ static int ext4_fc_track_template( > >>> sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ? > >>> &sbi->s_fc_q[FC_Q_STAGING] : > >>> &sbi->s_fc_q[FC_Q_MAIN]); > >>> + else > >>> + ei->i_fc_next = tid; > > ei->i_fc_next = true; > > Then, when we get to the ext4_fc_cleanup(), the value of iter->i_sync_tid > may have changed in the meantime from, e.g., ext4_do_update_inode() or > __ext4_iget(). This would cause the clean-up code to be bogus if it still > implements a the logic below, by comparing the tid with i_sync_tid. > (Although, to be honest, I couldn't see any visible effect in the quick > testing I've done.) Or am I missing something, and this is *exactly* the > behaviour you'd expect? Yes, this is the behavior I'd expect. The rationale is that if i_sync_tid points to the running transaction, it means the inode was modified in it, which means fastcommit needs to write it out. In fact the ext4_update_inode_fsync_trans() calls usually happen together with ext4_fc_track_...() calls. This could use some cleanup so that we don't set i_sync_tid in two places unnecessarily but that's for some other time... Honza
On Tue 28-05-24 12:36:02, Jan Kara wrote: > On Mon 27-05-24 16:48:24, Luis Henriques wrote: > > On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote; > > >>> + /* > > >>> + * Used to flag an inode as part of the next fast commit; will be > > >>> + * reset during fast commit clean-up > > >>> + */ > > >>> + tid_t i_fc_next; > > >>> + > > >> > > >> Do we really need new tid in the inode? I'd be kind of hoping we could use > > >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in > > >> ext4_fc_track_template() and used for similar comparisons in fast commit > > >> code. > > > > > > Ah, true. It looks like it could be used indeed. We'll still need a flag > > > here, but a simple bool should be enough for that. > > > > After looking again at the code, I'm not 100% sure that this is actually > > doable. For example, if I replace the above by > > > > bool i_fc_next; > > > > and set to to 'true' below: Forgot to comment on this one: I don't think you even need 'bool i_fc_next' - simply whenever i_sync_tid is greater than committing transaction's tid, you move the inode to FC_Q_STAGING list in ext4_fc_cleanup(). Honza
On Tue 28 May 2024 12:52:03 PM +02, Jan Kara wrote; > On Tue 28-05-24 12:36:02, Jan Kara wrote: >> On Mon 27-05-24 16:48:24, Luis Henriques wrote: >> > On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote; >> > >>> + /* >> > >>> + * Used to flag an inode as part of the next fast commit; will be >> > >>> + * reset during fast commit clean-up >> > >>> + */ >> > >>> + tid_t i_fc_next; >> > >>> + >> > >> >> > >> Do we really need new tid in the inode? I'd be kind of hoping we could use >> > >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in >> > >> ext4_fc_track_template() and used for similar comparisons in fast commit >> > >> code. >> > > >> > > Ah, true. It looks like it could be used indeed. We'll still need a flag >> > > here, but a simple bool should be enough for that. >> > >> > After looking again at the code, I'm not 100% sure that this is actually >> > doable. For example, if I replace the above by >> > >> > bool i_fc_next; >> > >> > and set to to 'true' below: > > Forgot to comment on this one: I don't think you even need 'bool i_fc_next' > - simply whenever i_sync_tid is greater than committing transaction's tid, > you move the inode to FC_Q_STAGING list in ext4_fc_cleanup(). Yeah, I got that from your other comment in the previous email. And that means the actual fix will be a pretty small patch (almost a one-liner). I'm running some more tests on v3, I'll probably send it later today or tomorrow. Thanks a lot for your review (and patience), Jan. Cheers,
Sorry for getting back late on your patchset - I was on vacation and checked your patch just now. This is a good catch! My patchset does not fix this issue. Looking forward to your V3 fix. Also, using i_sync_tid as Jan suggested sounds like a good way to handle this. - Harshad On Tue, May 28, 2024 at 8:50 AM Luis Henriques <luis.henriques@linux.dev> wrote: > > On Tue 28 May 2024 12:52:03 PM +02, Jan Kara wrote; > > > On Tue 28-05-24 12:36:02, Jan Kara wrote: > >> On Mon 27-05-24 16:48:24, Luis Henriques wrote: > >> > On Mon 27 May 2024 09:29:40 AM +01, Luis Henriques wrote; > >> > >>> + /* > >> > >>> + * Used to flag an inode as part of the next fast commit; will be > >> > >>> + * reset during fast commit clean-up > >> > >>> + */ > >> > >>> + tid_t i_fc_next; > >> > >>> + > >> > >> > >> > >> Do we really need new tid in the inode? I'd be kind of hoping we could use > >> > >> EXT4_I(inode)->i_sync_tid for this - I can see we even already set it in > >> > >> ext4_fc_track_template() and used for similar comparisons in fast commit > >> > >> code. > >> > > > >> > > Ah, true. It looks like it could be used indeed. We'll still need a flag > >> > > here, but a simple bool should be enough for that. > >> > > >> > After looking again at the code, I'm not 100% sure that this is actually > >> > doable. For example, if I replace the above by > >> > > >> > bool i_fc_next; > >> > > >> > and set to to 'true' below: > > > > Forgot to comment on this one: I don't think you even need 'bool i_fc_next' > > - simply whenever i_sync_tid is greater than committing transaction's tid, > > you move the inode to FC_Q_STAGING list in ext4_fc_cleanup(). > > Yeah, I got that from your other comment in the previous email. And that > means the actual fix will be a pretty small patch (almost a one-liner). > > I'm running some more tests on v3, I'll probably send it later today or > tomorrow. Thanks a lot for your review (and patience), Jan. > > Cheers, > -- > Luís
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 983dad8c07ec..4c308c18c3da 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1062,9 +1062,18 @@ struct ext4_inode_info { /* Fast commit wait queue for this inode */ wait_queue_head_t i_fc_wait; - /* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */ + /* + * Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len, + * i_fc_next + */ struct mutex i_fc_lock; + /* + * Used to flag an inode as part of the next fast commit; will be + * reset during fast commit clean-up + */ + tid_t i_fc_next; + /* * i_disksize keeps track of what the inode size is ON DISK, not * in memory. During truncate, i_size is set to the new size by diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index 87c009e0c59a..bfdf249f0783 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -402,6 +402,8 @@ static int ext4_fc_track_template( sbi->s_journal->j_flags & JBD2_FAST_COMMIT_ONGOING) ? &sbi->s_fc_q[FC_Q_STAGING] : &sbi->s_fc_q[FC_Q_MAIN]); + else + ei->i_fc_next = tid; spin_unlock(&sbi->s_fc_lock); return ret; @@ -1280,6 +1282,15 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid) list_for_each_entry_safe(iter, iter_n, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { list_del_init(&iter->i_fc_list); + if (iter->i_fc_next == tid) + iter->i_fc_next = 0; + else if (iter->i_fc_next > tid) + /* + * re-enqueue inode into STAGING, which will later be + * splice back into MAIN + */ + list_add_tail(&EXT4_I(&iter->vfs_inode)->i_fc_list, + &sbi->s_fc_q[FC_Q_STAGING]); ext4_clear_inode_state(&iter->vfs_inode, EXT4_STATE_FC_COMMITTING); if (iter->i_sync_tid <= tid) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 893ab80dafba..56f416656d96 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1437,6 +1437,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work); ext4_fc_init_inode(&ei->vfs_inode); mutex_init(&ei->i_fc_lock); + ei->i_fc_next = 0; return &ei->vfs_inode; }
When a full journal commit is on-going, any fast commit has to be enqueued into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing is done only once, i.e. if an inode is already queued in a previous fast commit entry it won't be enqueued again. However, if a full commit starts _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to be done into FC_Q_STAGING. And this is not being done in function ext4_fc_track_template(). This patch fixes the issue by flagging an inode that is already enqueued in either queues. Later, during the fast commit clean-up callback, if the inode has a tid that is bigger than the one being handled, that inode is re-enqueued into STAGING and the spliced back into MAIN. This bug was found using fstest generic/047. This test creates several 32k bytes files, sync'ing each of them after it's creation, and then shutting down the filesystem. Some data may be loss in this operation; for example a file may have it's size truncated to zero. Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> --- Hi! (Now Cc'ing Harshad, as I should have done in the initial RFC.) This v2 is a complete different solution, hinted by Jan Kara. I hope my understanding of his suggestion is correct. Also, I've dropped the second patch as it didn't made sense, as Jan also pointed out. Finally, I haven't yet done a review of Harshad's patchset [1] (hope to get to it soon), but a quick test shows the issue is still present there. The good news is that patch can be trivially applied on top of it. [1] https://lore.kernel.org/all/20240520055153.136091-1-harshadshirwadkar@gmail.com Cheers, -- Luis fs/ext4/ext4.h | 11 ++++++++++- fs/ext4/fast_commit.c | 11 +++++++++++ fs/ext4/super.c | 1 + 3 files changed, 22 insertions(+), 1 deletion(-)