diff mbox series

[4/5] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint

Message ID 20230531115100.2779605-5-yi.zhang@huaweicloud.com
State Awaiting Upstream
Headers show
Series jbd2: fix several checkpoint inconsistent issues | expand

Commit Message

Zhang Yi May 31, 2023, 11:50 a.m. UTC
From: Zhihao Cheng <chengzhihao1@huawei.com>

Following process,

jbd2_journal_commit_transaction
// there are several dirty buffer heads in transaction->t_checkpoint_list
          P1                   wb_workfn
jbd2_log_do_checkpoint
 if (buffer_locked(bh)) // false
                            __block_write_full_page
                             trylock_buffer(bh)
                             test_clear_buffer_dirty(bh)
 if (!buffer_dirty(bh))
  __jbd2_journal_remove_checkpoint(jh)
   if (buffer_write_io_error(bh)) // false
                             >> bh IO error occurs <<
 jbd2_cleanup_journal_tail
  __jbd2_update_log_tail
   jbd2_write_superblock
   // The bh won't be replayed in next mount.
, which could corrupt the ext4 image, fetch a reproducer in [Link].

Since writeback process clears buffer dirty after locking buffer head,
we can fix it by checking buffer dirty firstly and then checking buffer
locked, the buffer head can be removed if it is neither dirty nor locked.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/jbd2/checkpoint.c | 48 ++++++++++++++++++++++----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

Comments

Jan Kara June 1, 2023, 9:41 a.m. UTC | #1
On Wed 31-05-23 19:50:59, Zhang Yi wrote:
> From: Zhihao Cheng <chengzhihao1@huawei.com>
> 
> Following process,
> 
> jbd2_journal_commit_transaction
> // there are several dirty buffer heads in transaction->t_checkpoint_list
>           P1                   wb_workfn
> jbd2_log_do_checkpoint
>  if (buffer_locked(bh)) // false
>                             __block_write_full_page
>                              trylock_buffer(bh)
>                              test_clear_buffer_dirty(bh)
>  if (!buffer_dirty(bh))
>   __jbd2_journal_remove_checkpoint(jh)
>    if (buffer_write_io_error(bh)) // false
>                              >> bh IO error occurs <<
>  jbd2_cleanup_journal_tail
>   __jbd2_update_log_tail
>    jbd2_write_superblock
>    // The bh won't be replayed in next mount.
> , which could corrupt the ext4 image, fetch a reproducer in [Link].
> 
> Since writeback process clears buffer dirty after locking buffer head,
> we can fix it by checking buffer dirty firstly and then checking buffer
> locked, the buffer head can be removed if it is neither dirty nor locked.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

OK, the analysis is correct but I'm afraid the fix won't be that easy.  The
reordering of tests you did below doesn't really help because CPU or the
compiler are free to order the loads (and stores) in whatever way they
wish. You'd have to use memory barriers when reading and modifying bh flags
(although the modification side is implicitely handled by the bitlock
code) to make this work reliably. But that is IMHO too subtle for this
code.

What we should be doing to avoid these races is to lock the bh. So
something like:

	if (jh->b_transaction != NULL) {
		do stuff
	}
	if (!trylock_buffer(bh)) {
		buffer_locked() branch
	}
	... Now we have the buffer locked and can safely check for dirtyness

And we need to do a similar treatment for journal_clean_one_cp_list() and
journal_shrink_one_cp_list().

BTW, I think we could merge journal_clean_one_cp_list() and
journal_shrink_one_cp_list() into a single common function. I think we can
drop the nr_to_scan argument and just always cleanup the whole checkpoint
list and return the number of freed buffers. That way we have one less
function to deal with checkpoint list cleaning.

Thinking about it some more maybe we can have a function like:

int jbd2_try_remove_checkpoint(struct journal_head *jh)
{
	struct buffer_head *bh = jh2bh(jh);

	if (!trylock_buffer(bh) || buffer_dirty(bh))
		return -EBUSY;
	/*
	 * Buffer is clean and the IO has finished (we hold the buffer lock) so
	 * the checkpoint is done. We can safely remove the buffer from this
	 * transaction.
	 */
	unlock_buffer(bh);
	return __jbd2_journal_remove_checkpoint(jh);
}

and that can be used with a bit of care in the checkpointing functions as
well as in jbd2_journal_forget(), __journal_try_to_free_buffer(),
journal_unmap_buffer().

								Honza
Zhihao Cheng June 1, 2023, 1:44 p.m. UTC | #2
在 2023/6/1 17:41, Jan Kara 写道:

Hi, Jan
> On Wed 31-05-23 19:50:59, Zhang Yi wrote:
>> From: Zhihao Cheng <chengzhihao1@huawei.com>
>>
>> Following process,
>>
>> jbd2_journal_commit_transaction
>> // there are several dirty buffer heads in transaction->t_checkpoint_list
>>            P1                   wb_workfn
>> jbd2_log_do_checkpoint
>>   if (buffer_locked(bh)) // false
>>                              __block_write_full_page
>>                               trylock_buffer(bh)
>>                               test_clear_buffer_dirty(bh)
>>   if (!buffer_dirty(bh))
>>    __jbd2_journal_remove_checkpoint(jh)
>>     if (buffer_write_io_error(bh)) // false
>>                               >> bh IO error occurs <<
>>   jbd2_cleanup_journal_tail
>>    __jbd2_update_log_tail
>>     jbd2_write_superblock
>>     // The bh won't be replayed in next mount.
>> , which could corrupt the ext4 image, fetch a reproducer in [Link].
>>
>> Since writeback process clears buffer dirty after locking buffer head,
>> we can fix it by checking buffer dirty firstly and then checking buffer
>> locked, the buffer head can be removed if it is neither dirty nor locked.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> OK, the analysis is correct but I'm afraid the fix won't be that easy.  The
> reordering of tests you did below doesn't really help because CPU or the
> compiler are free to order the loads (and stores) in whatever way they
> wish. You'd have to use memory barriers when reading and modifying bh flags
> (although the modification side is implicitely handled by the bitlock
> code) to make this work reliably. But that is IMHO too subtle for this
> code.
> 

Do you mean there might be a sequence like following:

jbd2_log_do_checkpoint
  if (buffer_dirty(bh))
  else if (buffer_locked(bh))
  else
    __jbd2_journal_remove_checkpoint(jh)

CPU re-arranges the order of getting buffer state.
reg_1 = buffer_locked(bh)  // false
                            lock_buffer(bh)
                            clear_buffer(bh)
reg_2 = buffer_dirty(bh)   // false

Then, jbd2_log_do_checkpoint() could become:
if (reg_2)
else if (reg_1)
else
   __jbd2_journal_remove_checkpoint(jh)  // enter !

Am I understanding right?

> What we should be doing to avoid these races is to lock the bh. So
> something like:
> 
> 	if (jh->b_transaction != NULL) {
> 		do stuff
> 	}
> 	if (!trylock_buffer(bh)) {
> 		buffer_locked() branch
> 	}
> 	... Now we have the buffer locked and can safely check for dirtyness
> 
> And we need to do a similar treatment for journal_clean_one_cp_list() and
> journal_shrink_one_cp_list().
> 
> BTW, I think we could merge journal_clean_one_cp_list() and
> journal_shrink_one_cp_list() into a single common function. I think we can
> drop the nr_to_scan argument and just always cleanup the whole checkpoint
> list and return the number of freed buffers. That way we have one less
> function to deal with checkpoint list cleaning.
> 
> Thinking about it some more maybe we can have a function like:
> 
> int jbd2_try_remove_checkpoint(struct journal_head *jh)
> {
> 	struct buffer_head *bh = jh2bh(jh);
> 
> 	if (!trylock_buffer(bh) || buffer_dirty(bh))
> 		return -EBUSY;
> 	/*
> 	 * Buffer is clean and the IO has finished (we hold the buffer lock) so
> 	 * the checkpoint is done. We can safely remove the buffer from this
> 	 * transaction.
> 	 */
> 	unlock_buffer(bh);
> 	return __jbd2_journal_remove_checkpoint(jh);
> }
> 
> and that can be used with a bit of care in the checkpointing functions as
> well as in jbd2_journal_forget(), __journal_try_to_free_buffer(),
> journal_unmap_buffer().
> 
> 								Honza
>
Zhang Yi June 1, 2023, 2:20 p.m. UTC | #3
On 2023/6/1 21:44, Zhihao Cheng wrote:
> 在 2023/6/1 17:41, Jan Kara 写道:
> 
> Hi, Jan
>> On Wed 31-05-23 19:50:59, Zhang Yi wrote:
>>> From: Zhihao Cheng <chengzhihao1@huawei.com>
>>>
>>> Following process,
>>>
>>> jbd2_journal_commit_transaction
>>> // there are several dirty buffer heads in transaction->t_checkpoint_list
>>>            P1                   wb_workfn
>>> jbd2_log_do_checkpoint
>>>   if (buffer_locked(bh)) // false
>>>                              __block_write_full_page
>>>                               trylock_buffer(bh)
>>>                               test_clear_buffer_dirty(bh)
>>>   if (!buffer_dirty(bh))
>>>    __jbd2_journal_remove_checkpoint(jh)
>>>     if (buffer_write_io_error(bh)) // false
>>>                               >> bh IO error occurs <<
>>>   jbd2_cleanup_journal_tail
>>>    __jbd2_update_log_tail
>>>     jbd2_write_superblock
>>>     // The bh won't be replayed in next mount.
>>> , which could corrupt the ext4 image, fetch a reproducer in [Link].
>>>
>>> Since writeback process clears buffer dirty after locking buffer head,
>>> we can fix it by checking buffer dirty firstly and then checking buffer
>>> locked, the buffer head can be removed if it is neither dirty nor locked.
>>>
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
>>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
>>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>
>> OK, the analysis is correct but I'm afraid the fix won't be that easy.  The
>> reordering of tests you did below doesn't really help because CPU or the
>> compiler are free to order the loads (and stores) in whatever way they
>> wish. You'd have to use memory barriers when reading and modifying bh flags
>> (although the modification side is implicitely handled by the bitlock
>> code) to make this work reliably. But that is IMHO too subtle for this
>> code.
>>
> 

I write two litmus-test files in tools/memory-model to check the memory module
of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does.

**1. test-ww-rr.litmus**  //simulate our changes in jbd2_log_do_checkpoint()

  C WW-RR

  (*
   * Result: Never
   *
   * This test shows that write-write ordering
   * (in P0()) is visible to external process P2().
   *
   * bit 0: lock
   * bit 1: dirty
   *)

  {
          x=2;
  }

  P0(int *x)
  {
          int r1;

          r1 = READ_ONCE(*x);
          r1 = r1 | 1;  //lock
          WRITE_ONCE(*x, r1);

          r1 = READ_ONCE(*x);
          r1 = r1 & 253;  //&~2, clear dirty
          WRITE_ONCE(*x, r1);
  }

  P1(int *x)
  {
          int r0;
          int r1;

          r0 = READ_ONCE(*x);
          r1 = READ_ONCE(*x);
  }

  exists (1:r1=2 /\ 1:r0=1)

The test results are:

  $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-rr.litmus
  Test WW-RR Allowed
  States 6
  1:r0=1; 1:r1=1;
  1:r0=2; 1:r1=1;
  1:r0=2; 1:r1=2;
  1:r0=2; 1:r1=3;
  1:r0=3; 1:r1=1;
  1:r0=3; 1:r1=3;
  No
  Witnesses
  Positive: 0 Negative: 6
  Condition exists (1:r1=2 /\ 1:r0=1)
  Observation WW-RR Never 0 6
  Time WW-RR 0.02
  Hash=d91ecee2379f8ac878b8d06f17967874

**2. test-ww-r.litmus**  //simulate our changes in __cp_buffer_busy()

  C WW-R

  (*
   * Result: Never
   *
   * This test shows that write-write ordering
   * (in P0()) is visible to external process P2().
   *
   * bit 0: lock
   * bit 1: dirty
   *)

  {
          x=2;
  }

  P0(int *x)
  {
          int r1;

          r1 = READ_ONCE(*x);
          r1 = r1 | 1;  //lock
          WRITE_ONCE(*x, r1);

          r1 = READ_ONCE(*x);
          r1 = r1 & 253;  //&~2, clear dirty
          WRITE_ONCE(*x, r1);
  }

  P1(int *x)
  {
          int r0;

          r0 = READ_ONCE(*x);
  }

  exists (1:r0=0)

The test results are:

  $ herd7 -conf linux-kernel.cfg litmus-tests/test-ww-r.litmus
  Test WW-R Allowed
  States 3
  1:r0=1;
  1:r0=2;
  1:r0=3;
  No
  Witnesses
  Positive: 0 Negative: 3
  Condition exists (1:r0=0)
  Observation WW-R Never 0 3
  Time WW-R 0.01
  Hash=d76bb39c367dc0cbd0c363a87c6c9eb7

So it looks like the out-of-order situation cannot happen, am I miss something?

Thanks,
Yi.

> Do you mean there might be a sequence like following:
> 
> jbd2_log_do_checkpoint
>  if (buffer_dirty(bh))
>  else if (buffer_locked(bh))
>  else
>    __jbd2_journal_remove_checkpoint(jh)
> 
> CPU re-arranges the order of getting buffer state.
> reg_1 = buffer_locked(bh)  // false
>                            lock_buffer(bh)
>                            clear_buffer(bh)
> reg_2 = buffer_dirty(bh)   // false
> 
> Then, jbd2_log_do_checkpoint() could become:
> if (reg_2)
> else if (reg_1)
> else
>   __jbd2_journal_remove_checkpoint(jh)  // enter !
> 
> Am I understanding right?
> 
>> What we should be doing to avoid these races is to lock the bh. So
>> something like:
>>
>>     if (jh->b_transaction != NULL) {
>>         do stuff
>>     }
>>     if (!trylock_buffer(bh)) {
>>         buffer_locked() branch
>>     }
>>     ... Now we have the buffer locked and can safely check for dirtyness
>>
>> And we need to do a similar treatment for journal_clean_one_cp_list() and
>> journal_shrink_one_cp_list().
>>
>> BTW, I think we could merge journal_clean_one_cp_list() and
>> journal_shrink_one_cp_list() into a single common function. I think we can
>> drop the nr_to_scan argument and just always cleanup the whole checkpoint
>> list and return the number of freed buffers. That way we have one less
>> function to deal with checkpoint list cleaning.
>>
>> Thinking about it some more maybe we can have a function like:
>>
>> int jbd2_try_remove_checkpoint(struct journal_head *jh)
>> {
>>     struct buffer_head *bh = jh2bh(jh);
>>
>>     if (!trylock_buffer(bh) || buffer_dirty(bh))
>>         return -EBUSY;
>>     /*
>>      * Buffer is clean and the IO has finished (we hold the buffer lock) so
>>      * the checkpoint is done. We can safely remove the buffer from this
>>      * transaction.
>>      */
>>     unlock_buffer(bh);
>>     return __jbd2_journal_remove_checkpoint(jh);
>> }
>>
>> and that can be used with a bit of care in the checkpointing functions as
>> well as in jbd2_journal_forget(), __journal_try_to_free_buffer(),
>> journal_unmap_buffer().
>>
>>                                 Honza
>>
Jan Kara June 1, 2023, 4:31 p.m. UTC | #4
On Thu 01-06-23 22:20:38, Zhang Yi wrote:
> On 2023/6/1 21:44, Zhihao Cheng wrote:
> > 在 2023/6/1 17:41, Jan Kara 写道:
> > 
> > Hi, Jan
> >> On Wed 31-05-23 19:50:59, Zhang Yi wrote:
> >>> From: Zhihao Cheng <chengzhihao1@huawei.com>
> >>>
> >>> Following process,
> >>>
> >>> jbd2_journal_commit_transaction
> >>> // there are several dirty buffer heads in transaction->t_checkpoint_list
> >>>            P1                   wb_workfn
> >>> jbd2_log_do_checkpoint
> >>>   if (buffer_locked(bh)) // false
> >>>                              __block_write_full_page
> >>>                               trylock_buffer(bh)
> >>>                               test_clear_buffer_dirty(bh)
> >>>   if (!buffer_dirty(bh))
> >>>    __jbd2_journal_remove_checkpoint(jh)
> >>>     if (buffer_write_io_error(bh)) // false
> >>>                               >> bh IO error occurs <<
> >>>   jbd2_cleanup_journal_tail
> >>>    __jbd2_update_log_tail
> >>>     jbd2_write_superblock
> >>>     // The bh won't be replayed in next mount.
> >>> , which could corrupt the ext4 image, fetch a reproducer in [Link].
> >>>
> >>> Since writeback process clears buffer dirty after locking buffer head,
> >>> we can fix it by checking buffer dirty firstly and then checking buffer
> >>> locked, the buffer head can be removed if it is neither dirty nor locked.
> >>>
> >>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
> >>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
> >>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> OK, the analysis is correct but I'm afraid the fix won't be that easy.  The
> >> reordering of tests you did below doesn't really help because CPU or the
> >> compiler are free to order the loads (and stores) in whatever way they
> >> wish. You'd have to use memory barriers when reading and modifying bh flags
> >> (although the modification side is implicitely handled by the bitlock
> >> code) to make this work reliably. But that is IMHO too subtle for this
> >> code.
> >>
> > 
> 
> I write two litmus-test files in tools/memory-model to check the memory module
> of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does.

<snip litmus tests>

> So it looks like the out-of-order situation cannot happen, am I miss something?

After thinking about it a bit, indeed CPU cannot reorder the two loads
because they are from the same location in memory. Thanks for correcting me
on this. I'm not sure whether a C compiler still could not reorder the
tests - technically I suspect the C standard does not forbid this although
it would have to be really evil compiler to do this.

But still I think with the helper function I've suggested things are much
more obviously correct.

								Honza
Zhihao Cheng June 2, 2023, 1:52 a.m. UTC | #5
在 2023/6/2 0:31, Jan Kara 写道:
> On Thu 01-06-23 22:20:38, Zhang Yi wrote:
>> On 2023/6/1 21:44, Zhihao Cheng wrote:
>>> 在 2023/6/1 17:41, Jan Kara 写道:
>>>
>>> Hi, Jan
>>>> On Wed 31-05-23 19:50:59, Zhang Yi wrote:
>>>>> From: Zhihao Cheng <chengzhihao1@huawei.com>
>>>>>
>>>>> Following process,
>>>>>
>>>>> jbd2_journal_commit_transaction
>>>>> // there are several dirty buffer heads in transaction->t_checkpoint_list
>>>>>             P1                   wb_workfn
>>>>> jbd2_log_do_checkpoint
>>>>>    if (buffer_locked(bh)) // false
>>>>>                               __block_write_full_page
>>>>>                                trylock_buffer(bh)
>>>>>                                test_clear_buffer_dirty(bh)
>>>>>    if (!buffer_dirty(bh))
>>>>>     __jbd2_journal_remove_checkpoint(jh)
>>>>>      if (buffer_write_io_error(bh)) // false
>>>>>                                >> bh IO error occurs <<
>>>>>    jbd2_cleanup_journal_tail
>>>>>     __jbd2_update_log_tail
>>>>>      jbd2_write_superblock
>>>>>      // The bh won't be replayed in next mount.
>>>>> , which could corrupt the ext4 image, fetch a reproducer in [Link].
>>>>>
>>>>> Since writeback process clears buffer dirty after locking buffer head,
>>>>> we can fix it by checking buffer dirty firstly and then checking buffer
>>>>> locked, the buffer head can be removed if it is neither dirty nor locked.
>>>>>
>>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
>>>>> Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
>>>>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> OK, the analysis is correct but I'm afraid the fix won't be that easy.  The
>>>> reordering of tests you did below doesn't really help because CPU or the
>>>> compiler are free to order the loads (and stores) in whatever way they
>>>> wish. You'd have to use memory barriers when reading and modifying bh flags
>>>> (although the modification side is implicitely handled by the bitlock
>>>> code) to make this work reliably. But that is IMHO too subtle for this
>>>> code.
>>>>
>>>
>>
>> I write two litmus-test files in tools/memory-model to check the memory module
>> of these two cases as jbd2_log_do_checkpoint() and __cp_buffer_busy() does.
> 
> <snip litmus tests>
> 
>> So it looks like the out-of-order situation cannot happen, am I miss something?
> 
> After thinking about it a bit, indeed CPU cannot reorder the two loads
> because they are from the same location in memory. Thanks for correcting me
> on this. I'm not sure whether a C compiler still could not reorder the
> tests - technically I suspect the C standard does not forbid this although
> it would have to be really evil compiler to do this.
> 
> But still I think with the helper function I've suggested things are much
> more obviously correct.

Thanks for suggestions, we will modify it in next iteration.
diff mbox series

Patch

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 3f52560912a9..620f3d345f3d 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -204,20 +204,6 @@  int jbd2_log_do_checkpoint(journal_t *journal)
 		jh = transaction->t_checkpoint_list;
 		bh = jh2bh(jh);
 
-		/*
-		 * The buffer may be writing back, or flushing out in the
-		 * last couple of cycles, or re-adding into a new transaction,
-		 * need to check it again until it's unlocked.
-		 */
-		if (buffer_locked(bh)) {
-			get_bh(bh);
-			spin_unlock(&journal->j_list_lock);
-			wait_on_buffer(bh);
-			/* the journal_head may have gone by now */
-			BUFFER_TRACE(bh, "brelse");
-			__brelse(bh);
-			goto retry;
-		}
 		if (jh->b_transaction != NULL) {
 			transaction_t *t = jh->b_transaction;
 			tid_t tid = t->t_tid;
@@ -252,16 +238,7 @@  int jbd2_log_do_checkpoint(journal_t *journal)
 			spin_lock(&journal->j_list_lock);
 			goto restart;
 		}
-		if (!buffer_dirty(bh)) {
-			BUFFER_TRACE(bh, "remove from checkpoint");
-			/*
-			 * If the transaction was released or the checkpoint
-			 * list was empty, we're done.
-			 */
-			if (__jbd2_journal_remove_checkpoint(jh) ||
-			    !transaction->t_checkpoint_list)
-				goto out;
-		} else {
+		if (buffer_dirty(bh)) {
 			/*
 			 * We are about to write the buffer, it could be
 			 * raced by some other transaction shrink or buffer
@@ -275,6 +252,29 @@  int jbd2_log_do_checkpoint(journal_t *journal)
 			journal->j_chkpt_bhs[batch_count++] = bh;
 			transaction->t_chp_stats.cs_written++;
 			transaction->t_checkpoint_list = jh->b_cpnext;
+		} else if (buffer_locked(bh)) {
+			/*
+			 * The buffer may be writing back, or flushing out
+			 * in the last couple of cycles, or re-adding into
+			 * a new transaction, need to check it again until
+			 * it's unlocked.
+			 */
+			get_bh(bh);
+			spin_unlock(&journal->j_list_lock);
+			wait_on_buffer(bh);
+			/* the journal_head may have gone by now */
+			BUFFER_TRACE(bh, "brelse");
+			__brelse(bh);
+			goto retry;
+		} else {
+			BUFFER_TRACE(bh, "remove from checkpoint");
+			/*
+			 * If the transaction was released or the checkpoint
+			 * list was empty, we're done.
+			 */
+			if (__jbd2_journal_remove_checkpoint(jh) ||
+			    !transaction->t_checkpoint_list)
+				goto out;
 		}
 
 		if ((batch_count == JBD2_NR_BATCH) ||