jbd2: set freed flag while revoking a buffer which belongs to older transaction

Message ID 1547100722-132243-1-git-send-email-yi.zhang@huawei.com
State Superseded
Headers show
Series
  • jbd2: set freed flag while revoking a buffer which belongs to older transaction
Related show

Commit Message

zhangyi (F) Jan. 10, 2019, 6:12 a.m.
Now, we capture a data corruption problem on ext4 while we're truncating
an extent index block. Imaging that if we are revoking a buffer which
has been journaled by the committing transaction, the buffer's jbddirty
flag will not be cleared in jbd2_journal_forget(), so the commit code
will set the buffer dirty flag again after refile the buffer.

fsx                               kjournald2
                                  jbd2_journal_commit_transaction
jbd2_journal_revoke                commit phase 1~5...
 jbd2_journal_forget
   belongs to older transaction    commit phase 6
   jbddirty not clear               __jbd2_journal_refile_buffer
                                     __jbd2_journal_unfile_buffer
                                      test_clear_buffer_jbddirty
                                       mark_buffer_dirty

Finally, if the freed extent index block was allocated again as data
block by some other files, it may corrupt the file data when writing
cached pages later, such as during umount time.

This patch mark buffer as freed when it already belongs to the
committing transaction in jbd2_journal_forget(), so that commit code
knows it should clear dirty bits when it is done with the buffer.

This problem can be reproduced by xfstests generic/455 easily with
seeds (3246 3247 3248 3249).

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Cc: stable@vger.kernel.org
---
 fs/jbd2/transaction.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Jan Kara Jan. 10, 2019, 11:20 a.m. | #1
On Thu 10-01-19 14:12:02, zhangyi (F) wrote:
> Now, we capture a data corruption problem on ext4 while we're truncating
> an extent index block. Imaging that if we are revoking a buffer which
> has been journaled by the committing transaction, the buffer's jbddirty
> flag will not be cleared in jbd2_journal_forget(), so the commit code
> will set the buffer dirty flag again after refile the buffer.
> 
> fsx                               kjournald2
>                                   jbd2_journal_commit_transaction
> jbd2_journal_revoke                commit phase 1~5...
>  jbd2_journal_forget
>    belongs to older transaction    commit phase 6
>    jbddirty not clear               __jbd2_journal_refile_buffer
>                                      __jbd2_journal_unfile_buffer
>                                       test_clear_buffer_jbddirty
>                                        mark_buffer_dirty
> 
> Finally, if the freed extent index block was allocated again as data
> block by some other files, it may corrupt the file data when writing
> cached pages later, such as during umount time.
> 
> This patch mark buffer as freed when it already belongs to the
> committing transaction in jbd2_journal_forget(), so that commit code
> knows it should clear dirty bits when it is done with the buffer.
> 
> This problem can be reproduced by xfstests generic/455 easily with
> seeds (3246 3247 3248 3249).
> 
> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
> Cc: stable@vger.kernel.org

Thanks a lot for the analysis and the patch! I fully agree with your
analysis however I think just setting buffer as freed isn't completely
correct. The problem is following: The metadata buffer X has been modified
by the commiting transaction - let's call it A. It has been freed in the
currently running transaction B. Now jbd2_journal_forget() clears
b_next_transaction and if you set buffer freed flag, X will not be added to
the checkpoint list. So when transaction A finishes commit, it can get
checkpointed (without writing out X) before transaction B commits. So if a
crash occurs before B commits, we'd loose modification of X from
transaction A and thus cause filesystem corruption.

What rather needs to happen is the same thing that is done in
journal_unmap_buffer() in this case: We set buffer freed flag and we also
set b_next_transaction to the currently running transaction (B). This will
prevent A from being checkpointed before B commits and thus avoids the
problem above.

								Honza

> ---
>  fs/jbd2/transaction.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 4b51177..fcb65f2 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1592,6 +1592,12 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
>  			if (was_modified)
>  				drop_reserve = 1;
>  		}
> +
> +		/*
> +		 * Mark buffer as freed so that commit code know it should
> +		 * clear dirty bits when it is done with the buffer.
> +		 */
> +		set_buffer_freed(bh);
>  	}
>  
>  not_jbd:
> -- 
> 2.7.4
>
zhangyi (F) Jan. 11, 2019, 6:11 a.m. | #2
On 2019/1/10 19:20, Jan Kara Wrote:
> On Thu 10-01-19 14:12:02, zhangyi (F) wrote:
>> Now, we capture a data corruption problem on ext4 while we're truncating
>> an extent index block. Imaging that if we are revoking a buffer which
>> has been journaled by the committing transaction, the buffer's jbddirty
>> flag will not be cleared in jbd2_journal_forget(), so the commit code
>> will set the buffer dirty flag again after refile the buffer.
>>
>> fsx                               kjournald2
>>                                   jbd2_journal_commit_transaction
>> jbd2_journal_revoke                commit phase 1~5...
>>  jbd2_journal_forget
>>    belongs to older transaction    commit phase 6
>>    jbddirty not clear               __jbd2_journal_refile_buffer
>>                                      __jbd2_journal_unfile_buffer
>>                                       test_clear_buffer_jbddirty
>>                                        mark_buffer_dirty
>>
>> Finally, if the freed extent index block was allocated again as data
>> block by some other files, it may corrupt the file data when writing
>> cached pages later, such as during umount time.
>>
>> This patch mark buffer as freed when it already belongs to the
>> committing transaction in jbd2_journal_forget(), so that commit code
>> knows it should clear dirty bits when it is done with the buffer.
>>
>> This problem can be reproduced by xfstests generic/455 easily with
>> seeds (3246 3247 3248 3249).
>>
>> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
>> Cc: stable@vger.kernel.org
> 
> Thanks a lot for the analysis and the patch! I fully agree with your
> analysis however I think just setting buffer as freed isn't completely
> correct. The problem is following: The metadata buffer X has been modified
> by the commiting transaction - let's call it A. It has been freed in the
> currently running transaction B. Now jbd2_journal_forget() clears
> b_next_transaction and if you set buffer freed flag, X will not be added to
> the checkpoint list. So when transaction A finishes commit, it can get
> checkpointed (without writing out X) before transaction B commits. So if a
> crash occurs before B commits, we'd loose modification of X from
> transaction A and thus cause filesystem corruption.
> 
Thanks for your explanation! There are still two points I don't quite
understand.

I check all three cases of doing checkpoint. IIUC, both jbd2_journal_destroy()
and jbd2_journal_flush() wait the current running transaction B to complete
before doing checkpoint besides __jbd2_log_wait_for_space(). So I guess this is
the case that you mentioned of transaction A could be checkpointed before B
commits, am I right?

For another case, jbd2_update_log_tail() will be invoked after transaction B
complete, so the problem above also can't happen here, right?

> What rather needs to happen is the same thing that is done in
> journal_unmap_buffer() in this case: We set buffer freed flag and we also
> set b_next_transaction to the currently running transaction (B). This will
> prevent A from being checkpointed before B commits and thus avoids the
> problem above.
> 
Sorry, I don't get this point. I find that the difference between setting
b_next_transaction or not is just re-added the buffer X to the BJ_Reserved
list or not. How could we avoid the problem above.

BTW, I am thinking of a similar case. If we modify buffer X instead of
revork it in the transaction B, we also need to avoid transaction A from
being checkpointed before B commits, because current buffer X contains the
modified data (modified by B). So we should prevent writing it before
B commits, otherwise it will corrupt metadata. How do we handle this
situation now?

Thanks,
Yi.

>> ---
>>  fs/jbd2/transaction.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index 4b51177..fcb65f2 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -1592,6 +1592,12 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
>>  			if (was_modified)
>>  				drop_reserve = 1;
>>  		}
>> +
>> +		/*
>> +		 * Mark buffer as freed so that commit code know it should
>> +		 * clear dirty bits when it is done with the buffer.
>> +		 */
>> +		set_buffer_freed(bh);
>>  	}
>>  
>>  not_jbd:
>> -- 
>> 2.7.4
>>
Jan Kara Jan. 11, 2019, 10:30 a.m. | #3
On Fri 11-01-19 14:11:31, zhangyi (F) wrote:
> On 2019/1/10 19:20, Jan Kara Wrote:
> > On Thu 10-01-19 14:12:02, zhangyi (F) wrote:
> >> Now, we capture a data corruption problem on ext4 while we're truncating
> >> an extent index block. Imaging that if we are revoking a buffer which
> >> has been journaled by the committing transaction, the buffer's jbddirty
> >> flag will not be cleared in jbd2_journal_forget(), so the commit code
> >> will set the buffer dirty flag again after refile the buffer.
> >>
> >> fsx                               kjournald2
> >>                                   jbd2_journal_commit_transaction
> >> jbd2_journal_revoke                commit phase 1~5...
> >>  jbd2_journal_forget
> >>    belongs to older transaction    commit phase 6
> >>    jbddirty not clear               __jbd2_journal_refile_buffer
> >>                                      __jbd2_journal_unfile_buffer
> >>                                       test_clear_buffer_jbddirty
> >>                                        mark_buffer_dirty
> >>
> >> Finally, if the freed extent index block was allocated again as data
> >> block by some other files, it may corrupt the file data when writing
> >> cached pages later, such as during umount time.
> >>
> >> This patch mark buffer as freed when it already belongs to the
> >> committing transaction in jbd2_journal_forget(), so that commit code
> >> knows it should clear dirty bits when it is done with the buffer.
> >>
> >> This problem can be reproduced by xfstests generic/455 easily with
> >> seeds (3246 3247 3248 3249).
> >>
> >> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
> >> Cc: stable@vger.kernel.org
> > 
> > Thanks a lot for the analysis and the patch! I fully agree with your
> > analysis however I think just setting buffer as freed isn't completely
> > correct. The problem is following: The metadata buffer X has been modified
> > by the commiting transaction - let's call it A. It has been freed in the
> > currently running transaction B. Now jbd2_journal_forget() clears
> > b_next_transaction and if you set buffer freed flag, X will not be added to
> > the checkpoint list. So when transaction A finishes commit, it can get
> > checkpointed (without writing out X) before transaction B commits. So if a
> > crash occurs before B commits, we'd loose modification of X from
> > transaction A and thus cause filesystem corruption.
> > 
> Thanks for your explanation! There are still two points I don't quite
> understand.
> 
> I check all three cases of doing checkpoint. IIUC, both jbd2_journal_destroy()
> and jbd2_journal_flush() wait the current running transaction B to complete
> before doing checkpoint besides __jbd2_log_wait_for_space(). So I guess this is
> the case that you mentioned of transaction A could be checkpointed before B
> commits, am I right?

Yes, __jbd2_log_wait_for_space() can checkpoint already committed
transactions (i.e., A in our case) without waiting for the running
transaction (B in our case).

> For another case, jbd2_update_log_tail() will be invoked after transaction B
> complete, so the problem above also can't happen here, right?

I'm not sure which "another case" you speak about here...

> > What rather needs to happen is the same thing that is done in
> > journal_unmap_buffer() in this case: We set buffer freed flag and we also
> > set b_next_transaction to the currently running transaction (B). This will
> > prevent A from being checkpointed before B commits and thus avoids the
> > problem above.
> > 
> Sorry, I don't get this point. I find that the difference between setting
> b_next_transaction or not is just re-added the buffer X to the BJ_Reserved
> list or not. How could we avoid the problem above.

Currently, X will be removed from transaction B by jbd2_journal_revoke().
So once A commits, it will not be in the running transaction and thus
checkpoint of A can complete before B is committed.

If we set X->b_next_transaction to B, X will be part of transaction B. The
handling of buffer_freed() buffer in commit code thus will not clear
jbddirty bit and X will get inserted in X as buffer for checkpointing. And
thus checkpoint of A will not be able to complete before B commits, fixing
the problem I have described.
 
> BTW, I am thinking of a similar case. If we modify buffer X instead of
> revork it in the transaction B, we also need to avoid transaction A from
> being checkpointed before B commits, because current buffer X contains the
> modified data (modified by B). So we should prevent writing it before
> B commits, otherwise it will corrupt metadata. How do we handle this
> situation now?

Buffers that are part of the running transaction never have buffer_dirty
bit set (look how jbd2_journal_file_buffer() clears this bit). Thus
background writeback will not write these buffers. Also checkpointing code
checks whether the buffer is part of running / committing transaction and
handles these buffers specially exactly because they cannot be written out
directly.

								Honza
zhangyi (F) Jan. 11, 2019, 1:44 p.m. | #4
Thanks a lot for your in-depth explanation, I check the code and get it now.
Will modify the patch as you suggested and post v2 after test.

Thanks,
Yi.

On 2019/1/11 18:30, Jan Kara Wrote:
> On Fri 11-01-19 14:11:31, zhangyi (F) wrote:
>> On 2019/1/10 19:20, Jan Kara Wrote:
>>> On Thu 10-01-19 14:12:02, zhangyi (F) wrote:
>>>> Now, we capture a data corruption problem on ext4 while we're truncating
>>>> an extent index block. Imaging that if we are revoking a buffer which
>>>> has been journaled by the committing transaction, the buffer's jbddirty
>>>> flag will not be cleared in jbd2_journal_forget(), so the commit code
>>>> will set the buffer dirty flag again after refile the buffer.
>>>>
>>>> fsx                               kjournald2
>>>>                                   jbd2_journal_commit_transaction
>>>> jbd2_journal_revoke                commit phase 1~5...
>>>>  jbd2_journal_forget
>>>>    belongs to older transaction    commit phase 6
>>>>    jbddirty not clear               __jbd2_journal_refile_buffer
>>>>                                      __jbd2_journal_unfile_buffer
>>>>                                       test_clear_buffer_jbddirty
>>>>                                        mark_buffer_dirty
>>>>
>>>> Finally, if the freed extent index block was allocated again as data
>>>> block by some other files, it may corrupt the file data when writing
>>>> cached pages later, such as during umount time.
>>>>
>>>> This patch mark buffer as freed when it already belongs to the
>>>> committing transaction in jbd2_journal_forget(), so that commit code
>>>> knows it should clear dirty bits when it is done with the buffer.
>>>>
>>>> This problem can be reproduced by xfstests generic/455 easily with
>>>> seeds (3246 3247 3248 3249).
>>>>
>>>> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
>>>> Cc: stable@vger.kernel.org
>>>
>>> Thanks a lot for the analysis and the patch! I fully agree with your
>>> analysis however I think just setting buffer as freed isn't completely
>>> correct. The problem is following: The metadata buffer X has been modified
>>> by the commiting transaction - let's call it A. It has been freed in the
>>> currently running transaction B. Now jbd2_journal_forget() clears
>>> b_next_transaction and if you set buffer freed flag, X will not be added to
>>> the checkpoint list. So when transaction A finishes commit, it can get
>>> checkpointed (without writing out X) before transaction B commits. So if a
>>> crash occurs before B commits, we'd loose modification of X from
>>> transaction A and thus cause filesystem corruption.
>>>
>> Thanks for your explanation! There are still two points I don't quite
>> understand.
>>
>> I check all three cases of doing checkpoint. IIUC, both jbd2_journal_destroy()
>> and jbd2_journal_flush() wait the current running transaction B to complete
>> before doing checkpoint besides __jbd2_log_wait_for_space(). So I guess this is
>> the case that you mentioned of transaction A could be checkpointed before B
>> commits, am I right?
> 
> Yes, __jbd2_log_wait_for_space() can checkpoint already committed
> transactions (i.e., A in our case) without waiting for the running
> transaction (B in our case).
> 
>> For another case, jbd2_update_log_tail() will be invoked after transaction B
>> complete, so the problem above also can't happen here, right?
> 
> I'm not sure which "another case" you speak about here...
> 
>>> What rather needs to happen is the same thing that is done in
>>> journal_unmap_buffer() in this case: We set buffer freed flag and we also
>>> set b_next_transaction to the currently running transaction (B). This will
>>> prevent A from being checkpointed before B commits and thus avoids the
>>> problem above.
>>>
>> Sorry, I don't get this point. I find that the difference between setting
>> b_next_transaction or not is just re-added the buffer X to the BJ_Reserved
>> list or not. How could we avoid the problem above.
> 
> Currently, X will be removed from transaction B by jbd2_journal_revoke().
> So once A commits, it will not be in the running transaction and thus
> checkpoint of A can complete before B is committed.
> 
> If we set X->b_next_transaction to B, X will be part of transaction B. The
> handling of buffer_freed() buffer in commit code thus will not clear
> jbddirty bit and X will get inserted in X as buffer for checkpointing. And
> thus checkpoint of A will not be able to complete before B commits, fixing
> the problem I have described.
>  
>> BTW, I am thinking of a similar case. If we modify buffer X instead of
>> revork it in the transaction B, we also need to avoid transaction A from
>> being checkpointed before B commits, because current buffer X contains the
>> modified data (modified by B). So we should prevent writing it before
>> B commits, otherwise it will corrupt metadata. How do we handle this
>> situation now?
> 
> Buffers that are part of the running transaction never have buffer_dirty
> bit set (look how jbd2_journal_file_buffer() clears this bit). Thus
> background writeback will not write these buffers. Also checkpointing code
> checks whether the buffer is part of running / committing transaction and
> handles these buffers specially exactly because they cannot be written out
> directly.
> 
> 								Honza
>
Eryu Guan Jan. 12, 2019, 7:39 a.m. | #5
On Thu, Jan 10, 2019 at 02:12:02PM +0800, zhangyi (F) wrote:
> Now, we capture a data corruption problem on ext4 while we're truncating
> an extent index block. Imaging that if we are revoking a buffer which
> has been journaled by the committing transaction, the buffer's jbddirty
> flag will not be cleared in jbd2_journal_forget(), so the commit code
> will set the buffer dirty flag again after refile the buffer.
> 
> fsx                               kjournald2
>                                   jbd2_journal_commit_transaction
> jbd2_journal_revoke                commit phase 1~5...
>  jbd2_journal_forget
>    belongs to older transaction    commit phase 6
>    jbddirty not clear               __jbd2_journal_refile_buffer
>                                      __jbd2_journal_unfile_buffer
>                                       test_clear_buffer_jbddirty
>                                        mark_buffer_dirty
> 
> Finally, if the freed extent index block was allocated again as data
> block by some other files, it may corrupt the file data when writing
> cached pages later, such as during umount time.
> 
> This patch mark buffer as freed when it already belongs to the
> committing transaction in jbd2_journal_forget(), so that commit code
> knows it should clear dirty bits when it is done with the buffer.
> 
> This problem can be reproduced by xfstests generic/455 easily with
> seeds (3246 3247 3248 3249).

Would you please capture the fsx ops sequences that could reproduce the
problem and replay it in a targeted regression test, like what
generic/{499,511} do? Thanks!

Eryu
zhangyi (F) Jan. 12, 2019, 9:32 a.m. | #6
On 2019/1/12 15:39, Eryu Guan Wrote:
> On Thu, Jan 10, 2019 at 02:12:02PM +0800, zhangyi (F) wrote:
>> Now, we capture a data corruption problem on ext4 while we're truncating
>> an extent index block. Imaging that if we are revoking a buffer which
>> has been journaled by the committing transaction, the buffer's jbddirty
>> flag will not be cleared in jbd2_journal_forget(), so the commit code
>> will set the buffer dirty flag again after refile the buffer.
>>
>> fsx                               kjournald2
>>                                   jbd2_journal_commit_transaction
>> jbd2_journal_revoke                commit phase 1~5...
>>  jbd2_journal_forget
>>    belongs to older transaction    commit phase 6
>>    jbddirty not clear               __jbd2_journal_refile_buffer
>>                                      __jbd2_journal_unfile_buffer
>>                                       test_clear_buffer_jbddirty
>>                                        mark_buffer_dirty
>>
>> Finally, if the freed extent index block was allocated again as data
>> block by some other files, it may corrupt the file data when writing
>> cached pages later, such as during umount time.
>>
>> This patch mark buffer as freed when it already belongs to the
>> committing transaction in jbd2_journal_forget(), so that commit code
>> knows it should clear dirty bits when it is done with the buffer.
>>
>> This problem can be reproduced by xfstests generic/455 easily with
>> seeds (3246 3247 3248 3249).
> 
> Would you please capture the fsx ops sequences that could reproduce the
> problem and replay it in a targeted regression test, like what
> generic/{499,511} do? Thanks!
> 

Yes, I will do it. But this problem is timing dependent, so I am afraid
this targeted regression test cannot always reproduce it (not even
generic/455 with above seeds).

BTW, we only test and capture this problem on ext4, I am not sure other
file systems have the same problem or not. So better to categorize this
test to tests/ext4 group?

Thanks,
Yi.
Eryu Guan Jan. 13, 2019, 3:12 p.m. | #7
On Sat, Jan 12, 2019 at 05:32:21PM +0800, zhangyi (F) wrote:
> On 2019/1/12 15:39, Eryu Guan Wrote:
> > On Thu, Jan 10, 2019 at 02:12:02PM +0800, zhangyi (F) wrote:
> >> Now, we capture a data corruption problem on ext4 while we're truncating
> >> an extent index block. Imaging that if we are revoking a buffer which
> >> has been journaled by the committing transaction, the buffer's jbddirty
> >> flag will not be cleared in jbd2_journal_forget(), so the commit code
> >> will set the buffer dirty flag again after refile the buffer.
> >>
> >> fsx                               kjournald2
> >>                                   jbd2_journal_commit_transaction
> >> jbd2_journal_revoke                commit phase 1~5...
> >>  jbd2_journal_forget
> >>    belongs to older transaction    commit phase 6
> >>    jbddirty not clear               __jbd2_journal_refile_buffer
> >>                                      __jbd2_journal_unfile_buffer
> >>                                       test_clear_buffer_jbddirty
> >>                                        mark_buffer_dirty
> >>
> >> Finally, if the freed extent index block was allocated again as data
> >> block by some other files, it may corrupt the file data when writing
> >> cached pages later, such as during umount time.
> >>
> >> This patch mark buffer as freed when it already belongs to the
> >> committing transaction in jbd2_journal_forget(), so that commit code
> >> knows it should clear dirty bits when it is done with the buffer.
> >>
> >> This problem can be reproduced by xfstests generic/455 easily with
> >> seeds (3246 3247 3248 3249).
> > 
> > Would you please capture the fsx ops sequences that could reproduce the
> > problem and replay it in a targeted regression test, like what
> > generic/{499,511} do? Thanks!
> > 
> 
> Yes, I will do it. But this problem is timing dependent, so I am afraid
> this targeted regression test cannot always reproduce it (not even
> generic/455 with above seeds).

That's fine, if there're multiple sequences could reproduce the bug, we
could replay them all in a test.

> 
> BTW, we only test and capture this problem on ext4, I am not sure other
> file systems have the same problem or not. So better to categorize this
> test to tests/ext4 group?

If there's nothing specific to ext4, a generic test would be fine.

Thanks!

Eryu

Patch

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 4b51177..fcb65f2 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1592,6 +1592,12 @@  int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 			if (was_modified)
 				drop_reserve = 1;
 		}
+
+		/*
+		 * Mark buffer as freed so that commit code know it should
+		 * clear dirty bits when it is done with the buffer.
+		 */
+		set_buffer_freed(bh);
 	}
 
 not_jbd: