diff mbox series

[1/1] quota: flush quota_release_work upon quota writeback

Message ID 20241115183449.2058590-2-ojaswin@linux.ibm.com
State New
Headers show
Series Fix generic/390 failure due to quota release after freeze | expand

Commit Message

Ojaswin Mujoo Nov. 15, 2024, 6:34 p.m. UTC
One of the paths quota writeback is called from is:

freeze_super()
  sync_filesystem()
    ext4_sync_fs()
      dquot_writeback_dquots()

Since we currently don't always flush the quota_release_work queue in
this path, we can end up with the following race:

 1. dquot are added to releasing_dquots list during regular operations.
 2. FS freeze starts, however, this does not flush the quota_release_work queue.
 3. Freeze completes.
 4. Kernel eventually tries to flush the workqueue while FS is frozen which
    hits a WARN_ON since transaction gets started during frozen state:

  ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
  __ext4_journal_start_sb+0x64/0x1c0 [ext4]
  ext4_release_dquot+0x90/0x1d0 [ext4]
  quota_release_workfn+0x43c/0x4d0

Which is the following line:

  WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);

Which ultimately results in generic/390 failing due to dmesg
noise. This was detected on powerpc machine 15 cores.

To avoid this, make sure to flush the workqueue during
dquot_writeback_dquots() so we dont have any pending workitems after
freeze.

Reported-by: Disha Goel <disgoel@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/quota/dquot.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Ritesh Harjani (IBM) Nov. 15, 2024, 8:50 p.m. UTC | #1
Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:

> One of the paths quota writeback is called from is:
>
> freeze_super()
>   sync_filesystem()
>     ext4_sync_fs()
>       dquot_writeback_dquots()
>
> Since we currently don't always flush the quota_release_work queue in
> this path, we can end up with the following race:
>
>  1. dquot are added to releasing_dquots list during regular operations.
>  2. FS freeze starts, however, this does not flush the quota_release_work queue.
>  3. Freeze completes.
>  4. Kernel eventually tries to flush the workqueue while FS is frozen which
>     hits a WARN_ON since transaction gets started during frozen state:
>
>   ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
>   __ext4_journal_start_sb+0x64/0x1c0 [ext4]
>   ext4_release_dquot+0x90/0x1d0 [ext4]
>   quota_release_workfn+0x43c/0x4d0
>
> Which is the following line:
>
>   WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
>
> Which ultimately results in generic/390 failing due to dmesg
> noise. This was detected on powerpc machine 15 cores.
>
> To avoid this, make sure to flush the workqueue during
> dquot_writeback_dquots() so we dont have any pending workitems after
> freeze.

Not just that, sync_filesystem can also be called from other places and
quota_release_workfn() could write out and and release the dquot
structures if such are found during processing of releasing_dquots list. 
IIUC, this was earlier done in the same dqput() context but had races
with dquot_mark_dquot_dirty(). Hence the final dqput() will now add the
dquot structures to releasing_dquots list and will schedule a delayed
workfn which will process the releasing_dquots list. 

And so after the final dqput and before the release_workfn gets
scheduled, if dquot gets marked as dirty or dquot_transfer gets called -
then I am suspecting that it could lead to a dirty or an active dquot.

Hence, flushing the delayed quota_release_work at the end of
dquot_writeback_dquots() looks like the right thing to do IMO.

But I can give another look as this part of the code is not that well
known to me. 

>
> Reported-by: Disha Goel <disgoel@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---

Maybe a fixes tag as well?

>  fs/quota/dquot.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 3dd8d6f27725..2782cfc8c302 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -729,6 +729,8 @@ int dquot_writeback_dquots(struct super_block *sb, int type)
>  			sb->dq_op->write_info(sb, cnt);
>  	dqstats_inc(DQST_SYNCS);
>  
> +	flush_delayed_work(&quota_release_work);
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL(dquot_writeback_dquots);
> -- 
> 2.43.5
Ojaswin Mujoo Nov. 16, 2024, 5:59 p.m. UTC | #2
On Sat, Nov 16, 2024 at 02:20:26AM +0530, Ritesh Harjani wrote:
> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> 
> > One of the paths quota writeback is called from is:
> >
> > freeze_super()
> >   sync_filesystem()
> >     ext4_sync_fs()
> >       dquot_writeback_dquots()
> >
> > Since we currently don't always flush the quota_release_work queue in
> > this path, we can end up with the following race:
> >
> >  1. dquot are added to releasing_dquots list during regular operations.
> >  2. FS freeze starts, however, this does not flush the quota_release_work queue.
> >  3. Freeze completes.
> >  4. Kernel eventually tries to flush the workqueue while FS is frozen which
> >     hits a WARN_ON since transaction gets started during frozen state:
> >
> >   ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
> >   __ext4_journal_start_sb+0x64/0x1c0 [ext4]
> >   ext4_release_dquot+0x90/0x1d0 [ext4]
> >   quota_release_workfn+0x43c/0x4d0
> >
> > Which is the following line:
> >
> >   WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
> >
> > Which ultimately results in generic/390 failing due to dmesg
> > noise. This was detected on powerpc machine 15 cores.
> >
> > To avoid this, make sure to flush the workqueue during
> > dquot_writeback_dquots() so we dont have any pending workitems after
> > freeze.
> 
> Not just that, sync_filesystem can also be called from other places and
> quota_release_workfn() could write out and and release the dquot
> structures if such are found during processing of releasing_dquots list. 
> IIUC, this was earlier done in the same dqput() context but had races
> with dquot_mark_dquot_dirty(). Hence the final dqput() will now add the
> dquot structures to releasing_dquots list and will schedule a delayed
> workfn which will process the releasing_dquots list. 
Hi Ritesh,

Ohh right, thanks for the context. I see this was done here:

  dabc8b207566 quota: fix dqput() to follow the guarantees dquot_srcu
  should provide

Which went in v6.5. Let me cc Baokun as well.
> 
> And so after the final dqput and before the release_workfn gets
> scheduled, if dquot gets marked as dirty or dquot_transfer gets called -
> then I am suspecting that it could lead to a dirty or an active dquot.
> 
> Hence, flushing the delayed quota_release_work at the end of
> dquot_writeback_dquots() looks like the right thing to do IMO.
> 
> But I can give another look as this part of the code is not that well
> known to me. 
> 
> >
> > Reported-by: Disha Goel <disgoel@linux.ibm.com>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > ---
> 
> Maybe a fixes tag as well?

Right, I'll add that in the next revision. I believe it would be:

Fixes: dabc8b207566 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide")

Regards,
ojaswin

> 
> >  fs/quota/dquot.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> > index 3dd8d6f27725..2782cfc8c302 100644
> > --- a/fs/quota/dquot.c
> > +++ b/fs/quota/dquot.c
> > @@ -729,6 +729,8 @@ int dquot_writeback_dquots(struct super_block *sb, int type)
> >  			sb->dq_op->write_info(sb, cnt);
> >  	dqstats_inc(DQST_SYNCS);
> >  
> > +	flush_delayed_work(&quota_release_work);
> > +
> >  	return ret;
> >  }
> >  EXPORT_SYMBOL(dquot_writeback_dquots);
> > -- 
> > 2.43.5
Baokun Li Nov. 18, 2024, 1:29 a.m. UTC | #3
On 2024/11/17 1:59, Ojaswin Mujoo wrote:
> On Sat, Nov 16, 2024 at 02:20:26AM +0530, Ritesh Harjani wrote:
>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>
>>> One of the paths quota writeback is called from is:
>>>
>>> freeze_super()
>>>    sync_filesystem()
>>>      ext4_sync_fs()
>>>        dquot_writeback_dquots()
>>>
>>> Since we currently don't always flush the quota_release_work queue in
>>> this path, we can end up with the following race:
>>>
>>>   1. dquot are added to releasing_dquots list during regular operations.
>>>   2. FS freeze starts, however, this does not flush the quota_release_work queue.
>>>   3. Freeze completes.
>>>   4. Kernel eventually tries to flush the workqueue while FS is frozen which
>>>      hits a WARN_ON since transaction gets started during frozen state:
>>>
>>>    ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
>>>    __ext4_journal_start_sb+0x64/0x1c0 [ext4]
>>>    ext4_release_dquot+0x90/0x1d0 [ext4]
>>>    quota_release_workfn+0x43c/0x4d0
>>>
>>> Which is the following line:
>>>
>>>    WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
>>>
>>> Which ultimately results in generic/390 failing due to dmesg
>>> noise. This was detected on powerpc machine 15 cores.
>>>
>>> To avoid this, make sure to flush the workqueue during
>>> dquot_writeback_dquots() so we dont have any pending workitems after
>>> freeze.
>> Not just that, sync_filesystem can also be called from other places and
>> quota_release_workfn() could write out and and release the dquot
>> structures if such are found during processing of releasing_dquots list.
>> IIUC, this was earlier done in the same dqput() context but had races
>> with dquot_mark_dquot_dirty(). Hence the final dqput() will now add the
>> dquot structures to releasing_dquots list and will schedule a delayed
>> workfn which will process the releasing_dquots list.
> Hi Ritesh,
>
> Ohh right, thanks for the context. I see this was done here:
>
>    dabc8b207566 quota: fix dqput() to follow the guarantees dquot_srcu
>    should provide
>
> Which went in v6.5. Let me cc Baokun as well.
Hello Ojaswin,

Nice catch! Thanks for fixing this up!

Have you tested the performance impact of this patch? It looks like the
unconditional call to flush_delayed_work() in dquot_writeback_dquots()
may have some performance impact for frequent chown/sync scenarios.

When calling release_dquot(), we will only remove the quota of an object
(user/group/project) from disk if it is not quota-limited and does not
use any inode or block.

Asynchronous removal is now much more performance friendly, not only does
it make full use of the multi-core, but for scenarios where we have to
repeatedly chown between two objects, delayed release avoids the need to
repeatedly allocate/free space in memory and on disk.

Overall, since the actual dirty data is already on the disk, there is no
consistency issue here as it is just clearing unreferenced quota on the
disk, so I thought maybe it would be better to call flush_delayed_work()
in the freeze context.


Thanks,
Baokun
>> And so after the final dqput and before the release_workfn gets
>> scheduled, if dquot gets marked as dirty or dquot_transfer gets called -
>> then I am suspecting that it could lead to a dirty or an active dquot.
>>
>> Hence, flushing the delayed quota_release_work at the end of
>> dquot_writeback_dquots() looks like the right thing to do IMO.
>>
>> But I can give another look as this part of the code is not that well
>> known to me.
>>
>>> Reported-by: Disha Goel <disgoel@linux.ibm.com>
>>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>>> ---
>> Maybe a fixes tag as well?
> Right, I'll add that in the next revision. I believe it would be:
>
> Fixes: dabc8b207566 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide")
>
> Regards,
> ojaswin
>
>>>   fs/quota/dquot.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
>>> index 3dd8d6f27725..2782cfc8c302 100644
>>> --- a/fs/quota/dquot.c
>>> +++ b/fs/quota/dquot.c
>>> @@ -729,6 +729,8 @@ int dquot_writeback_dquots(struct super_block *sb, int type)
>>>   			sb->dq_op->write_info(sb, cnt);
>>>   	dqstats_inc(DQST_SYNCS);
>>>   
>>> +	flush_delayed_work(&quota_release_work);
>>> +
>>>   	return ret;
>>>   }
>>>   EXPORT_SYMBOL(dquot_writeback_dquots);
>>> -- 
>>> 2.43.5
Jan Kara Nov. 18, 2024, 12:53 p.m. UTC | #4
On Mon 18-11-24 09:29:19, Baokun Li wrote:
> On 2024/11/17 1:59, Ojaswin Mujoo wrote:
> > On Sat, Nov 16, 2024 at 02:20:26AM +0530, Ritesh Harjani wrote:
> > > Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
> > > 
> > > > One of the paths quota writeback is called from is:
> > > > 
> > > > freeze_super()
> > > >    sync_filesystem()
> > > >      ext4_sync_fs()
> > > >        dquot_writeback_dquots()
> > > > 
> > > > Since we currently don't always flush the quota_release_work queue in
> > > > this path, we can end up with the following race:
> > > > 
> > > >   1. dquot are added to releasing_dquots list during regular operations.
> > > >   2. FS freeze starts, however, this does not flush the quota_release_work queue.
> > > >   3. Freeze completes.
> > > >   4. Kernel eventually tries to flush the workqueue while FS is frozen which
> > > >      hits a WARN_ON since transaction gets started during frozen state:
> > > > 
> > > >    ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
> > > >    __ext4_journal_start_sb+0x64/0x1c0 [ext4]
> > > >    ext4_release_dquot+0x90/0x1d0 [ext4]
> > > >    quota_release_workfn+0x43c/0x4d0
> > > > 
> > > > Which is the following line:
> > > > 
> > > >    WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
> > > > 
> > > > Which ultimately results in generic/390 failing due to dmesg
> > > > noise. This was detected on powerpc machine 15 cores.
> > > > 
> > > > To avoid this, make sure to flush the workqueue during
> > > > dquot_writeback_dquots() so we dont have any pending workitems after
> > > > freeze.
> > > Not just that, sync_filesystem can also be called from other places and
> > > quota_release_workfn() could write out and and release the dquot
> > > structures if such are found during processing of releasing_dquots list.
> > > IIUC, this was earlier done in the same dqput() context but had races
> > > with dquot_mark_dquot_dirty(). Hence the final dqput() will now add the
> > > dquot structures to releasing_dquots list and will schedule a delayed
> > > workfn which will process the releasing_dquots list.
> > Hi Ritesh,
> > 
> > Ohh right, thanks for the context. I see this was done here:
> > 
> >    dabc8b207566 quota: fix dqput() to follow the guarantees dquot_srcu
> >    should provide

Yup.

> Nice catch! Thanks for fixing this up!
> 
> Have you tested the performance impact of this patch? It looks like the
> unconditional call to flush_delayed_work() in dquot_writeback_dquots()
> may have some performance impact for frequent chown/sync scenarios.

Well, but sync(2) or so is expensive anyway. Also dquot_writeback_dquots()
should persist all pending quota modifications and it is true that pending
dquot_release() calls can remove quota structures from the quota file and
thus are by definition pending modifications. So I agree with Ojaswin that
putting the workqueue flush there makes sense and is practically required
for data consistency guarantees.

> When calling release_dquot(), we will only remove the quota of an object
> (user/group/project) from disk if it is not quota-limited and does not
> use any inode or block.
> 
> Asynchronous removal is now much more performance friendly, not only does
> it make full use of the multi-core, but for scenarios where we have to
> repeatedly chown between two objects, delayed release avoids the need to
> repeatedly allocate/free space in memory and on disk.

True, but unless you call sync(2) in between these two calls this is going
to still hold.

> Overall, since the actual dirty data is already on the disk, there is no
> consistency issue here as it is just clearing unreferenced quota on the
> disk, so I thought maybe it would be better to call flush_delayed_work()
> in the freeze context.

To summarise, I don't think real-life workloads are going to observe the
benefit and conceptually the call really belongs more to
dquot_writeback_dquots().

								Honza
Jan Kara Nov. 18, 2024, 1:15 p.m. UTC | #5
On Sat 16-11-24 00:04:49, Ojaswin Mujoo wrote:
> One of the paths quota writeback is called from is:
> 
> freeze_super()
>   sync_filesystem()
>     ext4_sync_fs()
>       dquot_writeback_dquots()
> 
> Since we currently don't always flush the quota_release_work queue in
> this path, we can end up with the following race:
> 
>  1. dquot are added to releasing_dquots list during regular operations.
>  2. FS freeze starts, however, this does not flush the quota_release_work queue.
>  3. Freeze completes.
>  4. Kernel eventually tries to flush the workqueue while FS is frozen which
>     hits a WARN_ON since transaction gets started during frozen state:
> 
>   ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
>   __ext4_journal_start_sb+0x64/0x1c0 [ext4]
>   ext4_release_dquot+0x90/0x1d0 [ext4]
>   quota_release_workfn+0x43c/0x4d0
> 
> Which is the following line:
> 
>   WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
> 
> Which ultimately results in generic/390 failing due to dmesg
> noise. This was detected on powerpc machine 15 cores.
> 
> To avoid this, make sure to flush the workqueue during
> dquot_writeback_dquots() so we dont have any pending workitems after
> freeze.
> 
> Reported-by: Disha Goel <disgoel@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Thanks for debugging this!

> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 3dd8d6f27725..2782cfc8c302 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -729,6 +729,8 @@ int dquot_writeback_dquots(struct super_block *sb, int type)
>  			sb->dq_op->write_info(sb, cnt);
>  	dqstats_inc(DQST_SYNCS);
>  
> +	flush_delayed_work(&quota_release_work);
> +

I'd rather do this at the start of dquot_writeback_dquots(). Chances are
this saves some retry loops in the dirty list iterations. That being said I
don't think this is enough as I'm thinking about it. iput() can be called
anytime while the filesystem is frozen (just freeze the filesystem and do
echo 3 >/proc/sys/vm/drop_caches) which will consequently call dquot_drop()
-> dqput(). This should not be really freeing the dquot on-disk structure
(the inode itself is still accounted there) but nevertheless it may end up
dropping the last dquot in-memory reference and ext4_release_dquot() will
call ext4_journal_start() and complain. So I think on top of this patch
which makes sense on its own and deals with 99.9% of cases, we also need
ext4 specific fix which uses sb_start_intwrite() to get freeze protection
in ext4_release_dquot() (and in principle we always needed this, delayed
dquot releasing does not influence this particular problem). Some care will
be needed if the transaction is already started when ext4_release_dquot()
is called - you can take inspiration in how ext4_evict_inode() handles
this.

								Honza
Ojaswin Mujoo Nov. 19, 2024, 5:42 a.m. UTC | #6
On Mon, Nov 18, 2024 at 02:15:12PM +0100, Jan Kara wrote:
> On Sat 16-11-24 00:04:49, Ojaswin Mujoo wrote:
> > One of the paths quota writeback is called from is:
> > 
> > freeze_super()
> >   sync_filesystem()
> >     ext4_sync_fs()
> >       dquot_writeback_dquots()
> > 
> > Since we currently don't always flush the quota_release_work queue in
> > this path, we can end up with the following race:
> > 
> >  1. dquot are added to releasing_dquots list during regular operations.
> >  2. FS freeze starts, however, this does not flush the quota_release_work queue.
> >  3. Freeze completes.
> >  4. Kernel eventually tries to flush the workqueue while FS is frozen which
> >     hits a WARN_ON since transaction gets started during frozen state:
> > 
> >   ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
> >   __ext4_journal_start_sb+0x64/0x1c0 [ext4]
> >   ext4_release_dquot+0x90/0x1d0 [ext4]
> >   quota_release_workfn+0x43c/0x4d0
> > 
> > Which is the following line:
> > 
> >   WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
> > 
> > Which ultimately results in generic/390 failing due to dmesg
> > noise. This was detected on powerpc machine 15 cores.
> > 
> > To avoid this, make sure to flush the workqueue during
> > dquot_writeback_dquots() so we dont have any pending workitems after
> > freeze.
> > 
> > Reported-by: Disha Goel <disgoel@linux.ibm.com>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Thanks for debugging this!
> 
> > diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> > index 3dd8d6f27725..2782cfc8c302 100644
> > --- a/fs/quota/dquot.c
> > +++ b/fs/quota/dquot.c
> > @@ -729,6 +729,8 @@ int dquot_writeback_dquots(struct super_block *sb, int type)
> >  			sb->dq_op->write_info(sb, cnt);
> >  	dqstats_inc(DQST_SYNCS);
> >  
> > +	flush_delayed_work(&quota_release_work);
> > +
> 
> I'd rather do this at the start of dquot_writeback_dquots(). Chances are
> this saves some retry loops in the dirty list iterations. That being said I

Hi Jan, thanks for review :)


> don't think this is enough as I'm thinking about it. iput() can be called
> anytime while the filesystem is frozen (just freeze the filesystem and do
> echo 3 >/proc/sys/vm/drop_caches) which will consequently call dquot_drop()
> -> dqput(). This should not be really freeing the dquot on-disk structure
> (the inode itself is still accounted there) but nevertheless it may end up
> dropping the last dquot in-memory reference and ext4_release_dquot() will
> call ext4_journal_start() and complain. So I think on top of this patch
> which makes sense on its own and deals with 99.9% of cases, we also need
> ext4 specific fix which uses sb_start_intwrite() to get freeze protection
> in ext4_release_dquot() (and in principle we always needed this, delayed
> dquot releasing does not influence this particular problem). Some care will
> be needed if the transaction is already started when ext4_release_dquot()
> is called - you can take inspiration in how ext4_evict_inode() handles
> this.

That's a good point Jan, this could indeed happen if we drop caches
destroying an inode pinned in the lru cache. Thanks for the pointers,
I'll try to look into hardening ext4_release_dquot() as you suggested
and send a v2.

Regards,
ojaswin
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Baokun Li Nov. 19, 2024, 6:29 a.m. UTC | #7
On 2024/11/18 20:53, Jan Kara wrote:
> On Mon 18-11-24 09:29:19, Baokun Li wrote:
>> On 2024/11/17 1:59, Ojaswin Mujoo wrote:
>>> On Sat, Nov 16, 2024 at 02:20:26AM +0530, Ritesh Harjani wrote:
>>>> Ojaswin Mujoo <ojaswin@linux.ibm.com> writes:
>>>>
>>>>> One of the paths quota writeback is called from is:
>>>>>
>>>>> freeze_super()
>>>>>     sync_filesystem()
>>>>>       ext4_sync_fs()
>>>>>         dquot_writeback_dquots()
>>>>>
>>>>> Since we currently don't always flush the quota_release_work queue in
>>>>> this path, we can end up with the following race:
>>>>>
>>>>>    1. dquot are added to releasing_dquots list during regular operations.
>>>>>    2. FS freeze starts, however, this does not flush the quota_release_work queue.
>>>>>    3. Freeze completes.
>>>>>    4. Kernel eventually tries to flush the workqueue while FS is frozen which
>>>>>       hits a WARN_ON since transaction gets started during frozen state:
>>>>>
>>>>>     ext4_journal_check_start+0x28/0x110 [ext4] (unreliable)
>>>>>     __ext4_journal_start_sb+0x64/0x1c0 [ext4]
>>>>>     ext4_release_dquot+0x90/0x1d0 [ext4]
>>>>>     quota_release_workfn+0x43c/0x4d0
>>>>>
>>>>> Which is the following line:
>>>>>
>>>>>     WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
>>>>>
>>>>> Which ultimately results in generic/390 failing due to dmesg
>>>>> noise. This was detected on powerpc machine 15 cores.
>>>>>
>>>>> To avoid this, make sure to flush the workqueue during
>>>>> dquot_writeback_dquots() so we dont have any pending workitems after
>>>>> freeze.
>>>> Not just that, sync_filesystem can also be called from other places and
>>>> quota_release_workfn() could write out and and release the dquot
>>>> structures if such are found during processing of releasing_dquots list.
>>>> IIUC, this was earlier done in the same dqput() context but had races
>>>> with dquot_mark_dquot_dirty(). Hence the final dqput() will now add the
>>>> dquot structures to releasing_dquots list and will schedule a delayed
>>>> workfn which will process the releasing_dquots list.
>>> Hi Ritesh,
>>>
>>> Ohh right, thanks for the context. I see this was done here:
>>>
>>>     dabc8b207566 quota: fix dqput() to follow the guarantees dquot_srcu
>>>     should provide
> Yup.
>
>> Nice catch! Thanks for fixing this up!
>>
>> Have you tested the performance impact of this patch? It looks like the
>> unconditional call to flush_delayed_work() in dquot_writeback_dquots()
>> may have some performance impact for frequent chown/sync scenarios.
> Well, but sync(2) or so is expensive anyway. Also dquot_writeback_dquots()
> should persist all pending quota modifications and it is true that pending
> dquot_release() calls can remove quota structures from the quota file and
> thus are by definition pending modifications. So I agree with Ojaswin that
> putting the workqueue flush there makes sense and is practically required
> for data consistency guarantees.
Make sense.
>> When calling release_dquot(), we will only remove the quota of an object
>> (user/group/project) from disk if it is not quota-limited and does not
>> use any inode or block.
>>
>> Asynchronous removal is now much more performance friendly, not only does
>> it make full use of the multi-core, but for scenarios where we have to
>> repeatedly chown between two objects, delayed release avoids the need to
>> repeatedly allocate/free space in memory and on disk.
> True, but unless you call sync(2) in between these two calls this is going
> to still hold.
Yeah without sync or syncfs, it's the same as before.
>> Overall, since the actual dirty data is already on the disk, there is no
>> consistency issue here as it is just clearing unreferenced quota on the
>> disk, so I thought maybe it would be better to call flush_delayed_work()
>> in the freeze context.
> To summarise, I don't think real-life workloads are going to observe the
> benefit and conceptually the call really belongs more to
> dquot_writeback_dquots().
>
> 								Honza

Okay, thanks for the feedback!


Regards,
Baokun
diff mbox series

Patch

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 3dd8d6f27725..2782cfc8c302 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -729,6 +729,8 @@  int dquot_writeback_dquots(struct super_block *sb, int type)
 			sb->dq_op->write_info(sb, cnt);
 	dqstats_inc(DQST_SYNCS);
 
+	flush_delayed_work(&quota_release_work);
+
 	return ret;
 }
 EXPORT_SYMBOL(dquot_writeback_dquots);