Message ID | 20200910091252.525346-1-yebin10@huawei.com |
---|---|
State | Superseded |
Headers | show |
Series | [v3] ext4: Fix dead loop in ext4_mb_new_blocks | expand |
On Thu 10-09-20 17:12:52, Ye Bin wrote: > As we test disk offline/online with running fsstress, we find fsstress > process is keeping running state. > kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 > .... > kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 > > ext4_mb_new_blocks > repeat: > ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) > freed = ext4_mb_discard_preallocations > ext4_mb_discard_group_preallocations > this_cpu_inc(discard_pa_seq); > ---> freed == 0 > seq_retry = ext4_get_discard_pa_seq_sum > for_each_possible_cpu(__cpu) > __seq += per_cpu(discard_pa_seq, __cpu); > if (seq_retry != *seq) { > *seq = seq_retry; > ret = true; > } > > As we see seq_retry is sum of discard_pa_seq every cpu, if > ext4_mb_discard_group_preallocations return zero discard_pa_seq in this > cpu maybe increase one, so condition "seq_retry != *seq" have always > been met. > To Fix this problem, in ext4_mb_discard_group_preallocations function increase > discard_pa_seq only when it found preallocation to discard. > > Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") > Signed-off-by: Ye Bin <yebin10@huawei.com> Thanks for the patch. One comment below. > --- > fs/ext4/mballoc.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index f386fe62727d..fd55264dc3fe 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4191,7 +4191,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, > INIT_LIST_HEAD(&list); > repeat: > ext4_lock_group(sb, group); > - this_cpu_inc(discard_pa_seq); > list_for_each_entry_safe(pa, tmp, > &grp->bb_prealloc_list, pa_group_list) { > spin_lock(&pa->pa_lock); > @@ -4233,6 +4232,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, > goto out; > } > > + /* only increase when find reallocation to discard */ > + this_cpu_inc(discard_pa_seq); > + This is a good place to increment the counter but I think you also need to handle the case: if (free < needed && busy) { busy = 0; ext4_unlock_group(sb, group); cond_resched(); goto repeat; } We can unlock the group here after removing some preallocations and thus other processes checking discard_pa_seq could miss we did this. In fact I think the code is somewhat buggy here and we should also discard extents accumulated on "list" so far before unlocking the group. Ritesh? Honza
Hello Ye, Please excuse if there is something horribly wrong with my email formatting. Have yesterday received this laptop and still setting up few things. On 9/10/20 9:47 PM, Jan Kara wrote: > On Thu 10-09-20 17:12:52, Ye Bin wrote: >> As we test disk offline/online with running fsstress, we find fsstress >> process is keeping running state. >> kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 >> .... >> kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 >> >> ext4_mb_new_blocks >> repeat: >> ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) >> freed = ext4_mb_discard_preallocations >> ext4_mb_discard_group_preallocations >> this_cpu_inc(discard_pa_seq); >> ---> freed == 0 >> seq_retry = ext4_get_discard_pa_seq_sum >> for_each_possible_cpu(__cpu) >> __seq += per_cpu(discard_pa_seq, __cpu); >> if (seq_retry != *seq) { >> *seq = seq_retry; >> ret = true; >> } >> >> As we see seq_retry is sum of discard_pa_seq every cpu, if >> ext4_mb_discard_group_preallocations return zero discard_pa_seq in this >> cpu maybe increase one, so condition "seq_retry != *seq" have always >> been met. >> To Fix this problem, in ext4_mb_discard_group_preallocations function increase >> discard_pa_seq only when it found preallocation to discard. >> >> Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") >> Signed-off-by: Ye Bin <yebin10@huawei.com> > > Thanks for the patch. One comment below. > >> --- >> fs/ext4/mballoc.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >> index f386fe62727d..fd55264dc3fe 100644 >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -4191,7 +4191,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, >> INIT_LIST_HEAD(&list); >> repeat: >> ext4_lock_group(sb, group); >> - this_cpu_inc(discard_pa_seq); >> list_for_each_entry_safe(pa, tmp, >> &grp->bb_prealloc_list, pa_group_list) { >> spin_lock(&pa->pa_lock); >> @@ -4233,6 +4232,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, >> goto out; >> } >> >> + /* only increase when find reallocation to discard */ >> + this_cpu_inc(discard_pa_seq); >> + > > This is a good place to increment the counter but I think you also need to > handle the case: > > if (free < needed && busy) { > busy = 0; > ext4_unlock_group(sb, group); > cond_resched(); > goto repeat; > } > > We can unlock the group here after removing some preallocations and thus > other processes checking discard_pa_seq could miss we did this. In fact I > think the code is somewhat buggy here and we should also discard extents > accumulated on "list" so far before unlocking the group. Ritesh? > mmm, so even though this code is not discarding those buffers b4 unlocking, but it has still removed those from the grp->bb_prealloc_list and added it to the local list. And since it will at some point get scheduled and start operating from repeat: label so functionally wise this should be ok. Am I missing anything? Although I agree, that if we remove at least the current pa's before unlocking the group may be a good idea, but we should also check why was this done like this at the first place. I agree with Jan, that we should increment discard_pa_seq once we actually have something to discard. I should have written a comment here to explain why we did this here. But I think commit msg should have all the history (since I have a habit of writing long commit msgs ;) But IIRC, it was done since in case if there is a parallel thread which is discarding all the preallocations so the current thread may return 0 since it checks the list_empty(&grp->bb_prealloc_list) check in ext4_mb_discard_group_preallocations() and returns 0 directly. And why the discard_pa_seq counter at other places may not help since we remove the pa nodes from grp->bb_prealloc_list into a local list and then start operating on that. So meanwhile some thread may comes and just checks that the list is empty and return 0 while some other thread may start discarding from it's local list. So I guess the main problem was that in the current code we remove the pa from grp->bb_prealloc_list and add it to local list. So if someone else comes and checks that grp->bb_prealloc_list is empty then it will directly return 0. So, maybe we could do something like this then? repeat: ext4_lock_group(sb, group); - this_cpu_inc(discard_pa_seq); list_for_each_entry_safe(pa, tmp, &grp->bb_prealloc_list, pa_group_list) {<...> + if (!free) + this_cpu_inc(discard_pa_seq); // we should do this here before calling list_del(&pa->pa_group_list); /* we can trust pa_free ... */ free += pa->pa_free; spin_unlock(&pa->pa_lock); list_del(&pa->pa_group_list); list_add(&pa->u.pa_tmp_list, &list); } I have some test cases around this to test for cases which were failing. Since in world of parallelism you can't be 100% certain of some corner case (like this one you just reported). But, I don't have my other box rite now where I kept all of those - due to some technical issues. I think I should be able to get those by next week, if not, I anyways will setup my current machine for testing this. -ritesh
In fact, we didn't free the available space, and other processes couldn't get it even if they tried again. According to your opinions, I made two different revisions. Which one do you think is better? (1)Free PAs before repeat diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 132c118d12e1..4ab76882350d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4189,7 +4189,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, INIT_LIST_HEAD(&list); repeat: ext4_lock_group(sb, group); - this_cpu_inc(discard_pa_seq); list_for_each_entry_safe(pa, tmp, &grp->bb_prealloc_list, pa_group_list) { spin_lock(&pa->pa_lock); @@ -4215,22 +4214,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, list_add(&pa->u.pa_tmp_list, &list); } - /* if we still need more blocks and some PAs were used, try again */ - if (free < needed && busy) { - busy = 0; - ext4_unlock_group(sb, group); - cond_resched(); - goto repeat; - } - - /* found anything to free? */ - if (list_empty(&list)) { - BUG_ON(free != 0); - mb_debug(sb, "Someone else may have freed PA for this group %u\n", - group); - goto out; - } - /* now free all selected PAs */ list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) { @@ -4248,6 +4231,14 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); } + /* if we still need more blocks and some PAs were used, try again */ + if (free < needed && busy) { + busy = 0; + ext4_unlock_group(sb, group); + cond_resched(); + goto repeat; + } + out: ext4_unlock_group(sb, group); ext4_mb_unload_buddy(&e4b); -- (2) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 132c118d12e1..188772bbf679 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4189,7 +4189,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, INIT_LIST_HEAD(&list); repeat: ext4_lock_group(sb, group); - this_cpu_inc(discard_pa_seq); list_for_each_entry_safe(pa, tmp, &grp->bb_prealloc_list, pa_group_list) { spin_lock(&pa->pa_lock); @@ -4217,6 +4216,8 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, /* if we still need more blocks and some PAs were used, try again */ if (free < needed && busy) { + if (free) + this_cpu_inc(discard_pa_seq); busy = 0; ext4_unlock_group(sb, group); cond_resched(); On 2020/9/11 21:20, Ritesh Harjani wrote: > Hello Ye, > > Please excuse if there is something horribly wrong with my email > formatting. Have yesterday received this laptop and still setting up > few things. > > On 9/10/20 9:47 PM, Jan Kara wrote: >> On Thu 10-09-20 17:12:52, Ye Bin wrote: >>> As we test disk offline/online with running fsstress, we find fsstress >>> process is keeping running state. >>> kworker/u32:3-262 [004] ...1 140.787471: >>> ext4_mb_discard_preallocations: dev 8,32 needed 114 >>> .... >>> kworker/u32:3-262 [004] ...1 140.787471: >>> ext4_mb_discard_preallocations: dev 8,32 needed 114 >>> >>> ext4_mb_new_blocks >>> repeat: >>> ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) >>> freed = ext4_mb_discard_preallocations >>> ext4_mb_discard_group_preallocations >>> this_cpu_inc(discard_pa_seq); >>> ---> freed == 0 >>> seq_retry = ext4_get_discard_pa_seq_sum >>> for_each_possible_cpu(__cpu) >>> __seq += per_cpu(discard_pa_seq, __cpu); >>> if (seq_retry != *seq) { >>> *seq = seq_retry; >>> ret = true; >>> } >>> >>> As we see seq_retry is sum of discard_pa_seq every cpu, if >>> ext4_mb_discard_group_preallocations return zero discard_pa_seq in this >>> cpu maybe increase one, so condition "seq_retry != *seq" have always >>> been met. >>> To Fix this problem, in ext4_mb_discard_group_preallocations >>> function increase >>> discard_pa_seq only when it found preallocation to discard. >>> >>> Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for >>> freeing PA to improve ENOSPC handling") >>> Signed-off-by: Ye Bin <yebin10@huawei.com> >> >> Thanks for the patch. One comment below. >> >>> --- >>> fs/ext4/mballoc.c | 4 +++- >>> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >>> index f386fe62727d..fd55264dc3fe 100644 >>> --- a/fs/ext4/mballoc.c >>> +++ b/fs/ext4/mballoc.c >>> @@ -4191,7 +4191,6 @@ ext4_mb_discard_group_preallocations(struct >>> super_block *sb, >>> INIT_LIST_HEAD(&list); >>> repeat: >>> ext4_lock_group(sb, group); >>> - this_cpu_inc(discard_pa_seq); >>> list_for_each_entry_safe(pa, tmp, >>> &grp->bb_prealloc_list, pa_group_list) { >>> spin_lock(&pa->pa_lock); >>> @@ -4233,6 +4232,9 @@ ext4_mb_discard_group_preallocations(struct >>> super_block *sb, >>> goto out; >>> } >>> + /* only increase when find reallocation to discard */ >>> + this_cpu_inc(discard_pa_seq); >>> + >> >> This is a good place to increment the counter but I think you also >> need to >> handle the case: >> >> if (free < needed && busy) { >> busy = 0; >> ext4_unlock_group(sb, group); >> cond_resched(); >> goto repeat; >> } >> >> We can unlock the group here after removing some preallocations and thus >> other processes checking discard_pa_seq could miss we did this. In >> fact I >> think the code is somewhat buggy here and we should also discard extents >> accumulated on "list" so far before unlocking the group. Ritesh? >> > > mmm, so even though this code is not discarding those buffers b4 > unlocking, but it has still removed those from the grp->bb_prealloc_list > and added it to the local list. And since it will at some point get > scheduled and start operating from repeat: label so functionally wise > this should be ok. Am I missing anything? > > Although I agree, that if we remove at least the current pa's before > unlocking the group may be a good idea, but we should also check > why was this done like this at the first place. > > > I agree with Jan, that we should increment discard_pa_seq once we > actually have something > to discard. I should have written a comment here to explain why we did > this here. > But I think commit msg should have all the history (since I have a > habit of writing long commit msgs ;) > > But IIRC, it was done since in case if there is a parallel thread > which is discarding > all the preallocations so the current thread may return 0 since it > checks the > list_empty(&grp->bb_prealloc_list) check in > ext4_mb_discard_group_preallocations() and returns 0 directly. > > And why the discard_pa_seq counter at other places may not help since > we remove the pa nodes from > grp->bb_prealloc_list into a local list and then start operating on > that. So meanwhile some thread may comes and just checks that the list > is empty and return 0 while some other thread may start discarding from > it's local list. > So I guess the main problem was that in the current code we remove > the pa from grp->bb_prealloc_list and add it to local list. So if > someone else comes > and checks that grp->bb_prealloc_list is empty then it will directly > return 0. > > So, maybe we could do something like this then? > > repeat: > ext4_lock_group(sb, group); > - this_cpu_inc(discard_pa_seq); > list_for_each_entry_safe(pa, tmp, > &grp->bb_prealloc_list, pa_group_list) {<...> > > + if (!free) > + this_cpu_inc(discard_pa_seq); // we should do this here > before calling list_del(&pa->pa_group_list); > > /* we can trust pa_free ... */ > free += pa->pa_free; > spin_unlock(&pa->pa_lock); > > list_del(&pa->pa_group_list); > list_add(&pa->u.pa_tmp_list, &list); > } > > I have some test cases around this to test for cases which were > failing. Since in world of parallelism you can't be 100% certain of some > corner case (like this one you just reported). > But, I don't have my other box rite now where I kept all of those - > due to some technical issues. I think I should be able to get those by > next week, if not, I anyways will setup my current machine for testing > this. > > -ritesh > . >
Hello Ritesh, On Fri 11-09-20 18:50:22, Ritesh Harjani wrote: > On 9/10/20 9:47 PM, Jan Kara wrote: > > > --- > > > fs/ext4/mballoc.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > > > index f386fe62727d..fd55264dc3fe 100644 > > > --- a/fs/ext4/mballoc.c > > > +++ b/fs/ext4/mballoc.c > > > @@ -4191,7 +4191,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, > > > INIT_LIST_HEAD(&list); > > > repeat: > > > ext4_lock_group(sb, group); > > > - this_cpu_inc(discard_pa_seq); > > > list_for_each_entry_safe(pa, tmp, > > > &grp->bb_prealloc_list, pa_group_list) { > > > spin_lock(&pa->pa_lock); > > > @@ -4233,6 +4232,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, > > > goto out; > > > } > > > + /* only increase when find reallocation to discard */ > > > + this_cpu_inc(discard_pa_seq); > > > + > > > > This is a good place to increment the counter but I think you also need to > > handle the case: > > > > if (free < needed && busy) { > > busy = 0; > > ext4_unlock_group(sb, group); > > cond_resched(); > > goto repeat; > > } > > > > We can unlock the group here after removing some preallocations and thus > > other processes checking discard_pa_seq could miss we did this. In fact I > > think the code is somewhat buggy here and we should also discard extents > > accumulated on "list" so far before unlocking the group. Ritesh? > > > > mmm, so even though this code is not discarding those buffers b4 > unlocking, but it has still removed those from the grp->bb_prealloc_list > and added it to the local list. And since it will at some point get > scheduled and start operating from repeat: label so functionally wise > this should be ok. Am I missing anything? Yes, this function itself will be working correctly. But if we unlock the group before discarding preallocations we have in our local list the following can happen: TASK0 TASK1 ext4_mb_new_blocks() - no free block found ext4_mb_discard_preallocations_should_retry() ext4_mb_discard_preallocations() ext4_mb_discard_group_preallocations() ext4_lock_group(sb, group); this_cpu_inc(discard_pa_seq); scans group, moves some preallocations to local list if (free < needed && busy) { ext4_unlock_group(sb, group); - finds no pa to discard seq_retry = ext4_get_discard_pa_seq_sum() - ok seq_retry != seq -> retry - new search still didn't return anything because preallocations are sitting in TASK0 local list and are not freed yet > Although I agree, that if we remove at least the current pa's before > unlocking the group may be a good idea, but we should also check > why was this done like this at the first place. If we don't care about somewhat premature ENOSPC error, then not discarding preallocations before unlocking the group is not a problem. That's why previously it was done like this I believe. > I agree with Jan, that we should increment discard_pa_seq once we > actually have something to discard. I should have written a comment here > to explain why we did this here. But I think commit msg should have all > the history (since I have a habit of writing long commit msgs ;) > > But IIRC, it was done since in case if there is a parallel thread which > is discarding all the preallocations so the current thread may return 0 > since it checks the list_empty(&grp->bb_prealloc_list) check in > ext4_mb_discard_group_preallocations() and returns 0 directly. Ah, OK. I forgot about the unlocked check for empty bb_prealloc_list. > And why the discard_pa_seq counter at other places may not help since we > remove the pa nodes from grp->bb_prealloc_list into a local list and then > start operating on that. So meanwhile some thread may comes and just > checks that the list is empty and return 0 while some other thread may > start discarding from it's local list. So I guess the main problem was > that in the current code we remove the pa from grp->bb_prealloc_list and > add it to local list. So if someone else comes and checks that > grp->bb_prealloc_list is empty then it will directly return 0. > > So, maybe we could do something like this then? > > repeat: > ext4_lock_group(sb, group); > - this_cpu_inc(discard_pa_seq); > list_for_each_entry_safe(pa, tmp, > &grp->bb_prealloc_list, pa_group_list) {<...> > > + if (!free) > + this_cpu_inc(discard_pa_seq); // we should do this here before calling > list_del(&pa->pa_group_list); Yup, that looks good. > /* we can trust pa_free ... */ > free += pa->pa_free; > spin_unlock(&pa->pa_lock); > > list_del(&pa->pa_group_list); > list_add(&pa->u.pa_tmp_list, &list); > } > > I have some test cases around this to test for cases which were > failing. Since in world of parallelism you can't be 100% certain of some > corner case (like this one you just reported). > But, I don't have my other box rite now where I kept all of those - > due to some technical issues. I think I should be able to get those by > next week, if not, I anyways will setup my current machine for testing > this. Honza
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index f386fe62727d..fd55264dc3fe 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4191,7 +4191,6 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, INIT_LIST_HEAD(&list); repeat: ext4_lock_group(sb, group); - this_cpu_inc(discard_pa_seq); list_for_each_entry_safe(pa, tmp, &grp->bb_prealloc_list, pa_group_list) { spin_lock(&pa->pa_lock); @@ -4233,6 +4232,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, goto out; } + /* only increase when find reallocation to discard */ + this_cpu_inc(discard_pa_seq); + /* now free all selected PAs */ list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
As we test disk offline/online with running fsstress, we find fsstress process is keeping running state. kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 .... kworker/u32:3-262 [004] ...1 140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114 ext4_mb_new_blocks repeat: ext4_mb_discard_preallocations_should_retry(sb, ac, &seq) freed = ext4_mb_discard_preallocations ext4_mb_discard_group_preallocations this_cpu_inc(discard_pa_seq); ---> freed == 0 seq_retry = ext4_get_discard_pa_seq_sum for_each_possible_cpu(__cpu) __seq += per_cpu(discard_pa_seq, __cpu); if (seq_retry != *seq) { *seq = seq_retry; ret = true; } As we see seq_retry is sum of discard_pa_seq every cpu, if ext4_mb_discard_group_preallocations return zero discard_pa_seq in this cpu maybe increase one, so condition "seq_retry != *seq" have always been met. To Fix this problem, in ext4_mb_discard_group_preallocations function increase discard_pa_seq only when it found preallocation to discard. Fixes: 07b5b8e1ac40 ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling") Signed-off-by: Ye Bin <yebin10@huawei.com> --- fs/ext4/mballoc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)