Message ID | 20230531115100.2779605-5-yi.zhang@huaweicloud.com |
---|---|
State | Awaiting Upstream |
Headers | show |
Series | jbd2: fix several checkpoint inconsistent issues | expand |
On Wed 31-05-23 19:50:59, Zhang Yi wrote: > From: Zhihao Cheng <chengzhihao1@huawei.com> > > Following process, > > jbd2_journal_commit_transaction > // there are several dirty buffer heads in transaction->t_checkpoint_list > P1 wb_workfn > jbd2_log_do_checkpoint > if (buffer_locked(bh)) // false > __block_write_full_page > trylock_buffer(bh) > test_clear_buffer_dirty(bh) > if (!buffer_dirty(bh)) > __jbd2_journal_remove_checkpoint(jh) > if (buffer_write_io_error(bh)) // false > >> bh IO error occurs << > jbd2_cleanup_journal_tail > __jbd2_update_log_tail > jbd2_write_superblock > // The bh won't be replayed in next mount. > , which could corrupt the ext4 image, fetch a reproducer in [Link]. > > Since writeback process clears buffer dirty after locking buffer head, > we can fix it by checking buffer dirty firstly and then checking buffer > locked, the buffer head can be removed if it is neither dirty nor locked. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 > Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") > Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> OK, the analysis is correct but I'm afraid the fix won't be that easy. The reordering of tests you did below doesn't really help because CPU or the compiler are free to order the loads (and stores) in whatever way they wish. You'd have to use memory barriers when reading and modifying bh flags (although the modification side is implicitely handled by the bitlock code) to make this work reliably. But that is IMHO too subtle for this code. What we should be doing to avoid these races is to lock the bh. So something like: if (jh->b_transaction != NULL) { do stuff } if (!trylock_buffer(bh)) { buffer_locked() branch } ... Now we have the buffer locked and can safely check for dirtyness And we need to do a similar treatment for journal_clean_one_cp_list() and journal_shrink_one_cp_list(). BTW, I think we could merge journal_clean_one_cp_list() and journal_shrink_one_cp_list() into a single common function. I think we can drop the nr_to_scan argument and just always cleanup the whole checkpoint list and return the number of freed buffers. That way we have one less function to deal with checkpoint list cleaning. Thinking about it some more maybe we can have a function like: int jbd2_try_remove_checkpoint(struct journal_head *jh) { struct buffer_head *bh = jh2bh(jh); if (!trylock_buffer(bh) || buffer_dirty(bh)) return -EBUSY; /* * Buffer is clean and the IO has finished (we hold the buffer lock) so * the checkpoint is done. We can safely remove the buffer from this * transaction. */ unlock_buffer(bh); return __jbd2_journal_remove_checkpoint(jh); } and that can be used with a bit of care in the checkpointing functions as well as in jbd2_journal_forget(), __journal_try_to_free_buffer(), journal_unmap_buffer(). Honza
在 2023/6/1 17:41, Jan Kara 写道: Hi, Jan > On Wed 31-05-23 19:50:59, Zhang Yi wrote: >> From: Zhihao Cheng <chengzhihao1@huawei.com> >> >> Following process, >> >> jbd2_journal_commit_transaction >> // there are several dirty buffer heads in transaction->t_checkpoint_list >> P1 wb_workfn >> jbd2_log_do_checkpoint >> if (buffer_locked(bh)) // false >> __block_write_full_page >> trylock_buffer(bh) >> test_clear_buffer_dirty(bh) >> if (!buffer_dirty(bh)) >> __jbd2_journal_remove_checkpoint(jh) >> if (buffer_write_io_error(bh)) // false >> >> bh IO error occurs << >> jbd2_cleanup_journal_tail >> __jbd2_update_log_tail >> jbd2_write_superblock >> // The bh won't be replayed in next mount. >> , which could corrupt the ext4 image, fetch a reproducer in [Link]. >> >> Since writeback process clears buffer dirty after locking buffer head, >> we can fix it by checking buffer dirty firstly and then checking buffer >> locked, the buffer head can be removed if it is neither dirty nor locked. >> >> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 >> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") >> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > > OK, the analysis is correct but I'm afraid the fix won't be that easy. The > reordering of tests you did below doesn't really help because CPU or the > compiler are free to order the loads (and stores) in whatever way they > wish. You'd have to use memory barriers when reading and modifying bh flags > (although the modification side is implicitely handled by the bitlock > code) to make this work reliably. But that is IMHO too subtle for this > code. > Do you mean there might be a sequence like following: jbd2_log_do_checkpoint if (buffer_dirty(bh)) else if (buffer_locked(bh)) else __jbd2_journal_remove_checkpoint(jh) CPU re-arranges the order of getting buffer state. reg_1 = buffer_locked(bh) // false lock_buffer(bh) clear_buffer(bh) reg_2 = buffer_dirty(bh) // false Then, jbd2_log_do_checkpoint() could become: if (reg_2) else if (reg_1) else __jbd2_journal_remove_checkpoint(jh) // enter ! Am I understanding right? > What we should be doing to avoid these races is to lock the bh. So > something like: > > if (jh->b_transaction != NULL) { > do stuff > } > if (!trylock_buffer(bh)) { > buffer_locked() branch > } > ... Now we have the buffer locked and can safely check for dirtyness > > And we need to do a similar treatment for journal_clean_one_cp_list() and > journal_shrink_one_cp_list(). > > BTW, I think we could merge journal_clean_one_cp_list() and > journal_shrink_one_cp_list() into a single common function. I think we can > drop the nr_to_scan argument and just always cleanup the whole checkpoint > list and return the number of freed buffers. That way we have one less > function to deal with checkpoint list cleaning. > > Thinking about it some more maybe we can have a function like: > > int jbd2_try_remove_checkpoint(struct journal_head *jh) > { > struct buffer_head *bh = jh2bh(jh); > > if (!trylock_buffer(bh) || buffer_dirty(bh)) > return -EBUSY; > /* > * Buffer is clean and the IO has finished (we hold the buffer lock) so > * the checkpoint is done. We can safely remove the buffer from this > * transaction. > */ > unlock_buffer(bh); > return __jbd2_journal_remove_checkpoint(jh); > } > > and that can be used with a bit of care in the checkpointing functions as > well as in jbd2_journal_forget(), __journal_try_to_free_buffer(), > journal_unmap_buffer(). > > Honza >
On 2023/6/1 21:44, Zhihao Cheng wrote: > 在 2023/6/1 17:41, Jan Kara 写道: > > Hi, Jan >> On Wed 31-05-23 19:50:59, Zhang Yi wrote: >>> From: Zhihao Cheng <chengzhihao1@huawei.com> >>> >>> Following process, >>> >>> jbd2_journal_commit_transaction >>> // there are several dirty buffer heads in transaction->t_checkpoint_list >>> P1 wb_workfn >>> jbd2_log_do_checkpoint >>> if (buffer_locked(bh)) // false >>> __block_write_full_page >>> trylock_buffer(bh) >>> test_clear_buffer_dirty(bh) >>> if (!buffer_dirty(bh)) >>> __jbd2_journal_remove_checkpoint(jh) >>> if (buffer_write_io_error(bh)) // false >>> >> bh IO error occurs << >>> jbd2_cleanup_journal_tail >>> __jbd2_update_log_tail >>> jbd2_write_superblock >>> // The bh won't be replayed in next mount. >>> , which could corrupt the ext4 image, fetch a reproducer in [Link]. >>> >>> Since writeback process clears buffer dirty after locking buffer head, >>> we can fix it by checking buffer dirty firstly and then checking buffer >>> locked, the buffer head can be removed if it is neither dirty nor locked. >>> >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 >>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") >>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >> >> OK, the analysis is correct but I'm afraid the fix won't be that easy. The >> reordering of tests you did below doesn't really help because CPU or the >> compiler are free to order the loads (and stores) in whatever way they >> wish. You'd have to use memory barriers when reading and modifying bh flags >> (although the modification side is implicitely handled by the bitlock >> code) to make this work reliably. But that is IMHO too subtle for this >> code. >> > I write two litmus-test files in tools/memory-model to check the memory module of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does. **1. test-ww-rr.litmus** //simulate our changes in jbd2_log_do_checkpoint() C WW-RR (* * Result: Never * * This test shows that write-write ordering * (in P0()) is visible to external process P2(). * * bit 0: lock * bit 1: dirty *) { x=2; } P0(int *x) { int r1; r1 = READ_ONCE(*x); r1 = r1 | 1; //lock WRITE_ONCE(*x, r1); r1 = READ_ONCE(*x); r1 = r1 & 253; //&~2, clear dirty WRITE_ONCE(*x, r1); } P1(int *x) { int r0; int r1; r0 = READ_ONCE(*x); r1 = READ_ONCE(*x); } exists (1:r1=2 /\ 1:r0=1) The test results are: $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-rr.litmus Test WW-RR Allowed States 6 1:r0=1; 1:r1=1; 1:r0=2; 1:r1=1; 1:r0=2; 1:r1=2; 1:r0=2; 1:r1=3; 1:r0=3; 1:r1=1; 1:r0=3; 1:r1=3; No Witnesses Positive: 0 Negative: 6 Condition exists (1:r1=2 /\ 1:r0=1) Observation WW-RR Never 0 6 Time WW-RR 0.02 Hash=d91ecee2379f8ac878b8d06f17967874 **2. test-ww-r.litmus** //simulate our changes in __cp_buffer_busy() C WW-R (* * Result: Never * * This test shows that write-write ordering * (in P0()) is visible to external process P2(). * * bit 0: lock * bit 1: dirty *) { x=2; } P0(int *x) { int r1; r1 = READ_ONCE(*x); r1 = r1 | 1; //lock WRITE_ONCE(*x, r1); r1 = READ_ONCE(*x); r1 = r1 & 253; //&~2, clear dirty WRITE_ONCE(*x, r1); } P1(int *x) { int r0; r0 = READ_ONCE(*x); } exists (1:r0=0) The test results are: $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-r.litmus Test WW-R Allowed States 3 1:r0=1; 1:r0=2; 1:r0=3; No Witnesses Positive: 0 Negative: 3 Condition exists (1:r0=0) Observation WW-R Never 0 3 Time WW-R 0.01 Hash=d76bb39c367dc0cbd0c363a87c6c9eb7 So it looks like the out-of-order situation cannot happen, am I miss something? Thanks, Yi. > Do you mean there might be a sequence like following: > > jbd2_log_do_checkpoint > if (buffer_dirty(bh)) > else if (buffer_locked(bh)) > else > __jbd2_journal_remove_checkpoint(jh) > > CPU re-arranges the order of getting buffer state. > reg_1 = buffer_locked(bh) // false > lock_buffer(bh) > clear_buffer(bh) > reg_2 = buffer_dirty(bh) // false > > Then, jbd2_log_do_checkpoint() could become: > if (reg_2) > else if (reg_1) > else > __jbd2_journal_remove_checkpoint(jh) // enter ! > > Am I understanding right? > >> What we should be doing to avoid these races is to lock the bh. So >> something like: >> >> if (jh->b_transaction != NULL) { >> do stuff >> } >> if (!trylock_buffer(bh)) { >> buffer_locked() branch >> } >> ... Now we have the buffer locked and can safely check for dirtyness >> >> And we need to do a similar treatment for journal_clean_one_cp_list() and >> journal_shrink_one_cp_list(). >> >> BTW, I think we could merge journal_clean_one_cp_list() and >> journal_shrink_one_cp_list() into a single common function. I think we can >> drop the nr_to_scan argument and just always cleanup the whole checkpoint >> list and return the number of freed buffers. That way we have one less >> function to deal with checkpoint list cleaning. >> >> Thinking about it some more maybe we can have a function like: >> >> int jbd2_try_remove_checkpoint(struct journal_head *jh) >> { >> struct buffer_head *bh = jh2bh(jh); >> >> if (!trylock_buffer(bh) || buffer_dirty(bh)) >> return -EBUSY; >> /* >> * Buffer is clean and the IO has finished (we hold the buffer lock) so >> * the checkpoint is done. We can safely remove the buffer from this >> * transaction. >> */ >> unlock_buffer(bh); >> return __jbd2_journal_remove_checkpoint(jh); >> } >> >> and that can be used with a bit of care in the checkpointing functions as >> well as in jbd2_journal_forget(), __journal_try_to_free_buffer(), >> journal_unmap_buffer(). >> >> Honza >>
On Thu 01-06-23 22:20:38, Zhang Yi wrote: > On 2023/6/1 21:44, Zhihao Cheng wrote: > > 在 2023/6/1 17:41, Jan Kara 写道: > > > > Hi, Jan > >> On Wed 31-05-23 19:50:59, Zhang Yi wrote: > >>> From: Zhihao Cheng <chengzhihao1@huawei.com> > >>> > >>> Following process, > >>> > >>> jbd2_journal_commit_transaction > >>> // there are several dirty buffer heads in transaction->t_checkpoint_list > >>> P1 wb_workfn > >>> jbd2_log_do_checkpoint > >>> if (buffer_locked(bh)) // false > >>> __block_write_full_page > >>> trylock_buffer(bh) > >>> test_clear_buffer_dirty(bh) > >>> if (!buffer_dirty(bh)) > >>> __jbd2_journal_remove_checkpoint(jh) > >>> if (buffer_write_io_error(bh)) // false > >>> >> bh IO error occurs << > >>> jbd2_cleanup_journal_tail > >>> __jbd2_update_log_tail > >>> jbd2_write_superblock > >>> // The bh won't be replayed in next mount. > >>> , which could corrupt the ext4 image, fetch a reproducer in [Link]. > >>> > >>> Since writeback process clears buffer dirty after locking buffer head, > >>> we can fix it by checking buffer dirty firstly and then checking buffer > >>> locked, the buffer head can be removed if it is neither dirty nor locked. > >>> > >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 > >>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") > >>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> > >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > >> > >> OK, the analysis is correct but I'm afraid the fix won't be that easy. The > >> reordering of tests you did below doesn't really help because CPU or the > >> compiler are free to order the loads (and stores) in whatever way they > >> wish. You'd have to use memory barriers when reading and modifying bh flags > >> (although the modification side is implicitely handled by the bitlock > >> code) to make this work reliably. But that is IMHO too subtle for this > >> code. > >> > > > > I write two litmus-test files in tools/memory-model to check the memory module > of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does. <snip litmus tests> > So it looks like the out-of-order situation cannot happen, am I miss something? After thinking about it a bit, indeed CPU cannot reorder the two loads because they are from the same location in memory. Thanks for correcting me on this. I'm not sure whether a C compiler still could not reorder the tests - technically I suspect the C standard does not forbid this although it would have to be really evil compiler to do this. But still I think with the helper function I've suggested things are much more obviously correct. Honza
在 2023/6/2 0:31, Jan Kara 写道: > On Thu 01-06-23 22:20:38, Zhang Yi wrote: >> On 2023/6/1 21:44, Zhihao Cheng wrote: >>> 在 2023/6/1 17:41, Jan Kara 写道: >>> >>> Hi, Jan >>>> On Wed 31-05-23 19:50:59, Zhang Yi wrote: >>>>> From: Zhihao Cheng <chengzhihao1@huawei.com> >>>>> >>>>> Following process, >>>>> >>>>> jbd2_journal_commit_transaction >>>>> // there are several dirty buffer heads in transaction->t_checkpoint_list >>>>> P1 wb_workfn >>>>> jbd2_log_do_checkpoint >>>>> if (buffer_locked(bh)) // false >>>>> __block_write_full_page >>>>> trylock_buffer(bh) >>>>> test_clear_buffer_dirty(bh) >>>>> if (!buffer_dirty(bh)) >>>>> __jbd2_journal_remove_checkpoint(jh) >>>>> if (buffer_write_io_error(bh)) // false >>>>> >> bh IO error occurs << >>>>> jbd2_cleanup_journal_tail >>>>> __jbd2_update_log_tail >>>>> jbd2_write_superblock >>>>> // The bh won't be replayed in next mount. >>>>> , which could corrupt the ext4 image, fetch a reproducer in [Link]. >>>>> >>>>> Since writeback process clears buffer dirty after locking buffer head, >>>>> we can fix it by checking buffer dirty firstly and then checking buffer >>>>> locked, the buffer head can be removed if it is neither dirty nor locked. >>>>> >>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490 >>>>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") >>>>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> >>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >>>> >>>> OK, the analysis is correct but I'm afraid the fix won't be that easy. The >>>> reordering of tests you did below doesn't really help because CPU or the >>>> compiler are free to order the loads (and stores) in whatever way they >>>> wish. You'd have to use memory barriers when reading and modifying bh flags >>>> (although the modification side is implicitely handled by the bitlock >>>> code) to make this work reliably. But that is IMHO too subtle for this >>>> code. >>>> >>> >> >> I write two litmus-test files in tools/memory-model to check the memory module >> of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does. > > <snip litmus tests> > >> So it looks like the out-of-order situation cannot happen, am I miss something? > > After thinking about it a bit, indeed CPU cannot reorder the two loads > because they are from the same location in memory. Thanks for correcting me > on this. I'm not sure whether a C compiler still could not reorder the > tests - technically I suspect the C standard does not forbid this although > it would have to be really evil compiler to do this. > > But still I think with the helper function I've suggested things are much > more obviously correct. Thanks for suggestions, we will modify it in next iteration.
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c index 3f52560912a9..620f3d345f3d 100644 --- a/fs/jbd2/checkpoint.c +++ b/fs/jbd2/checkpoint.c @@ -204,20 +204,6 @@ int jbd2_log_do_checkpoint(journal_t *journal) jh = transaction->t_checkpoint_list; bh = jh2bh(jh); - /* - * The buffer may be writing back, or flushing out in the - * last couple of cycles, or re-adding into a new transaction, - * need to check it again until it's unlocked. - */ - if (buffer_locked(bh)) { - get_bh(bh); - spin_unlock(&journal->j_list_lock); - wait_on_buffer(bh); - /* the journal_head may have gone by now */ - BUFFER_TRACE(bh, "brelse"); - __brelse(bh); - goto retry; - } if (jh->b_transaction != NULL) { transaction_t *t = jh->b_transaction; tid_t tid = t->t_tid; @@ -252,16 +238,7 @@ int jbd2_log_do_checkpoint(journal_t *journal) spin_lock(&journal->j_list_lock); goto restart; } - if (!buffer_dirty(bh)) { - BUFFER_TRACE(bh, "remove from checkpoint"); - /* - * If the transaction was released or the checkpoint - * list was empty, we're done. - */ - if (__jbd2_journal_remove_checkpoint(jh) || - !transaction->t_checkpoint_list) - goto out; - } else { + if (buffer_dirty(bh)) { /* * We are about to write the buffer, it could be * raced by some other transaction shrink or buffer @@ -275,6 +252,29 @@ int jbd2_log_do_checkpoint(journal_t *journal) journal->j_chkpt_bhs[batch_count++] = bh; transaction->t_chp_stats.cs_written++; transaction->t_checkpoint_list = jh->b_cpnext; + } else if (buffer_locked(bh)) { + /* + * The buffer may be writing back, or flushing out + * in the last couple of cycles, or re-adding into + * a new transaction, need to check it again until + * it's unlocked. + */ + get_bh(bh); + spin_unlock(&journal->j_list_lock); + wait_on_buffer(bh); + /* the journal_head may have gone by now */ + BUFFER_TRACE(bh, "brelse"); + __brelse(bh); + goto retry; + } else { + BUFFER_TRACE(bh, "remove from checkpoint"); + /* + * If the transaction was released or the checkpoint + * list was empty, we're done. + */ + if (__jbd2_journal_remove_checkpoint(jh) || + !transaction->t_checkpoint_list) + goto out; } if ((batch_count == JBD2_NR_BATCH) ||