Patchwork [block/for-3.3/core] block: an exiting task should be allowed to create io_context

login
register
mail settings
Submitter Tejun Heo
Date Dec. 25, 2011, 1:02 a.m.
Message ID <20111225010238.GA6013@htj.dyndns.org>
Download mbox | patch
Permalink /patch/133168/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Tejun Heo - Dec. 25, 2011, 1:02 a.m.
While fixing io_context creation / task exit race condition,
6e736be7f2 "block: make ioc get/put interface more conventional and
fix race on alloction" also prevented an exiting (%PF_EXITING) task
from creating its own io_context.  This is incorrect as exit path may
issue IOs, e.g. from exit_files(), and if those IOs are the first ones
issued by the task, io_context needs to be created to process the IOs.

Combined with the existing problem of io_context / io_cq creation
failure having the possibility of stalling IO, this problem results in
deterministic full IO lockup with certain workloads.

Fix it by allowing io_context creation regardless of %PF_EXITING for
%current.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Hugh Dickins <hughd@google.com>
---
Thanks a lot for the hint, Hugh.  My testing stuff (fio, dd and some
adhoc rawio testing programs) was issuing IOs before exiting, so I
didn't hit the problem and I suspect the reason why I didn't see the
boot failure Andrew was seeing was because of systemd - boot process
used to be dominated by lots of short-lived programs, many of which
touching/modifying files, and thus it triggered the first IO on exit
paths with Andrew's old userland.  With systemd, most of those are
gone, so...

Thanks.

 block/blk-ioc.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe - Dec. 25, 2011, 1:29 p.m.
On 2011-12-25 02:02, Tejun Heo wrote:
> While fixing io_context creation / task exit race condition,
> 6e736be7f2 "block: make ioc get/put interface more conventional and
> fix race on alloction" also prevented an exiting (%PF_EXITING) task
> from creating its own io_context.  This is incorrect as exit path may
> issue IOs, e.g. from exit_files(), and if those IOs are the first ones
> issued by the task, io_context needs to be created to process the IOs.
> 
> Combined with the existing problem of io_context / io_cq creation
> failure having the possibility of stalling IO, this problem results in
> deterministic full IO lockup with certain workloads.
> 
> Fix it by allowing io_context creation regardless of %PF_EXITING for
> %current.

Thanks, applied.
Andrew Morton - Dec. 27, 2011, 10:07 p.m.
On Sun, 25 Dec 2011 14:29:34 +0100
Jens Axboe <axboe@kernel.dk> wrote:

> On 2011-12-25 02:02, Tejun Heo wrote:
> > While fixing io_context creation / task exit race condition,
> > 6e736be7f2 "block: make ioc get/put interface more conventional and
> > fix race on alloction" also prevented an exiting (%PF_EXITING) task
> > from creating its own io_context.  This is incorrect as exit path may
> > issue IOs, e.g. from exit_files(), and if those IOs are the first ones
> > issued by the task, io_context needs to be created to process the IOs.
> > 
> > Combined with the existing problem of io_context / io_cq creation
> > failure having the possibility of stalling IO, this problem results in
> > deterministic full IO lockup with certain workloads.
> > 
> > Fix it by allowing io_context creation regardless of %PF_EXITING for
> > %current.

The patch works for me.

> Thanks, applied.

So we get another great big bisection hole in mainline.  I feel
duty bound to rewhine about this :(

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugh Dickins - Dec. 28, 2011, 8:33 a.m.
On Sat, 24 Dec 2011, Tejun Heo wrote:

> While fixing io_context creation / task exit race condition,
> 6e736be7f2 "block: make ioc get/put interface more conventional and
> fix race on alloction" also prevented an exiting (%PF_EXITING) task
> from creating its own io_context.  This is incorrect as exit path may
> issue IOs, e.g. from exit_files(), and if those IOs are the first ones
> issued by the task, io_context needs to be created to process the IOs.
> 
> Combined with the existing problem of io_context / io_cq creation
> failure having the possibility of stalling IO, this problem results in
> deterministic full IO lockup with certain workloads.
> 
> Fix it by allowing io_context creation regardless of %PF_EXITING for
> %current.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Andrew Morton <akpm@linux-foundation.org>
> Reported-by: Hugh Dickins <hughd@google.com>

Thanks, I think I've now built enough kernels on -next plus your patch
to say that it does indeed solve that problem.

However, there are a couple of other unhealthy symptoms I've noticed
under load in -next's block/cfq layer, both with and without your patch.

One is kernel BUG at block/cfq-iosched.c:2585!
	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));

cfq_dispatch_request+0x1a
cfq_dispatch_requests+0x5c
blk_peek_request+0x195
scsi_request_fn+0x6a
__blk_run_queue+0x16
scsi_run_queue+0x18a
scsi_next_command+0x36
scsi_io_completion+0x426
scsi_finish_command+0xaf
scsi_softirq_done+0xdd
blk_done_softirq+0x6c
__do_softirq+0x80
call_softirq+0x1c
do_softirq+0x33
irq_exit+0x3f
do_IRQ+0x97
ret_from_intr

I've had that one four times now on different machines; but quicker
to reproduce are these warnings from CONFIG_DEBUG_LIST=y:

------------[ cut here ]------------
WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
Hardware name: 4174AY9
list_del corruption. prev->next should be ffff880005aa1380, but was 6b6b6b6b6b6b6b6b
Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device
Pid: 29241, comm: cc1 Tainted: G        W    3.2.0-rc6-next-20111222 #18
Call Trace:
 <IRQ>  [<ffffffff810544b4>] warn_slowpath_common+0x80/0x98
 [<ffffffff81054560>] warn_slowpath_fmt+0x41/0x43
 [<ffffffff811fc1a1>] __list_del_entry+0x8d/0x98
 [<ffffffff811df8ab>] cfq_remove_request+0x3b/0xdf
 [<ffffffff811df989>] cfq_dispatch_insert+0x3a/0x87
 [<ffffffff811dfb3b>] cfq_dispatch_request+0x65/0x92
 [<ffffffff811dfbc4>] cfq_dispatch_requests+0x5c/0x133
 [<ffffffff812e103e>] ? scsi_request_fn+0x3b6/0x3d3
 [<ffffffff811d3069>] blk_peek_request+0x195/0x1a6
 [<ffffffff812e103e>] ? scsi_request_fn+0x3b6/0x3d3
 [<ffffffff812e0cf5>] scsi_request_fn+0x6d/0x3d3
 [<ffffffff811d0730>] __blk_run_queue+0x19/0x1b
 [<ffffffff811d0bfd>] blk_run_queue+0x21/0x35
 [<ffffffff812e08c4>] scsi_run_queue+0x11f/0x1b9
 [<ffffffff812e205c>] scsi_next_command+0x36/0x46
 [<ffffffff812e24dc>] scsi_io_completion+0x426/0x4a9
 [<ffffffff812dc0b2>] scsi_finish_command+0xaf/0xb8
 [<ffffffff812e200c>] scsi_softirq_done+0xdd/0xe5
 [<ffffffff811d79c6>] blk_done_softirq+0x76/0x8a
 [<ffffffff8105a28d>] __do_softirq+0x98/0x136
 [<ffffffff814e649c>] call_softirq+0x1c/0x30
 [<ffffffff8102f187>] do_softirq+0x38/0x81
 [<ffffffff8105a596>] irq_exit+0x4e/0xb6
 [<ffffffff8102ee9e>] do_IRQ+0x97/0xae
 [<ffffffff814e49f0>] common_interrupt+0x70/0x70
 <EOI>  [<ffffffff814e4a8e>] ? retint_swapgs+0xe/0x13
---[ end trace 61fdaa1b260613d1 ]---

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Dec. 28, 2011, 4:48 p.m.
Hello, Hugh.

On Wed, Dec 28, 2011 at 12:33:01AM -0800, Hugh Dickins wrote:
> Thanks, I think I've now built enough kernels on -next plus your patch
> to say that it does indeed solve that problem.

Awesome, thanks for verifying the fix.

> However, there are a couple of other unhealthy symptoms I've noticed
> under load in -next's block/cfq layer, both with and without your patch.
> 
> One is kernel BUG at block/cfq-iosched.c:2585!
> 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
> 
> cfq_dispatch_request+0x1a
> cfq_dispatch_requests+0x5c
> blk_peek_request+0x195
> scsi_request_fn+0x6a
> __blk_run_queue+0x16
> scsi_run_queue+0x18a
> scsi_next_command+0x36
> scsi_io_completion+0x426
> scsi_finish_command+0xaf
> scsi_softirq_done+0xdd
> blk_done_softirq+0x6c
> __do_softirq+0x80
> call_softirq+0x1c
> do_softirq+0x33
> irq_exit+0x3f
> do_IRQ+0x97
> ret_from_intr
> 
> I've had that one four times now on different machines; but quicker
> to reproduce are these warnings from CONFIG_DEBUG_LIST=y:
> 
> ------------[ cut here ]------------
> WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
> Hardware name: 4174AY9
> list_del corruption. prev->next should be ffff880005aa1380, but was 6b6b6b6b6b6b6b6b
> Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device
> Pid: 29241, comm: cc1 Tainted: G        W    3.2.0-rc6-next-20111222 #18
> Call Trace:
>  <IRQ>  [<ffffffff810544b4>] warn_slowpath_common+0x80/0x98
>  [<ffffffff81054560>] warn_slowpath_fmt+0x41/0x43
>  [<ffffffff811fc1a1>] __list_del_entry+0x8d/0x98
>  [<ffffffff811df8ab>] cfq_remove_request+0x3b/0xdf
>  [<ffffffff811df989>] cfq_dispatch_insert+0x3a/0x87
>  [<ffffffff811dfb3b>] cfq_dispatch_request+0x65/0x92
>  [<ffffffff811dfbc4>] cfq_dispatch_requests+0x5c/0x133
>  [<ffffffff812e103e>] ? scsi_request_fn+0x3b6/0x3d3
>  [<ffffffff811d3069>] blk_peek_request+0x195/0x1a6
>  [<ffffffff812e103e>] ? scsi_request_fn+0x3b6/0x3d3
>  [<ffffffff812e0cf5>] scsi_request_fn+0x6d/0x3d3
>  [<ffffffff811d0730>] __blk_run_queue+0x19/0x1b
>  [<ffffffff811d0bfd>] blk_run_queue+0x21/0x35
>  [<ffffffff812e08c4>] scsi_run_queue+0x11f/0x1b9
>  [<ffffffff812e205c>] scsi_next_command+0x36/0x46
>  [<ffffffff812e24dc>] scsi_io_completion+0x426/0x4a9
>  [<ffffffff812dc0b2>] scsi_finish_command+0xaf/0xb8
>  [<ffffffff812e200c>] scsi_softirq_done+0xdd/0xe5
>  [<ffffffff811d79c6>] blk_done_softirq+0x76/0x8a
>  [<ffffffff8105a28d>] __do_softirq+0x98/0x136
>  [<ffffffff814e649c>] call_softirq+0x1c/0x30
>  [<ffffffff8102f187>] do_softirq+0x38/0x81
>  [<ffffffff8105a596>] irq_exit+0x4e/0xb6
>  [<ffffffff8102ee9e>] do_IRQ+0x97/0xae
>  [<ffffffff814e49f0>] common_interrupt+0x70/0x70
>  <EOI>  [<ffffffff814e4a8e>] ? retint_swapgs+0xe/0x13
> ---[ end trace 61fdaa1b260613d1 ]---

Hmm... that looks like cfqq being freed before unlinked.  I'll try to
reproduce it.  Is there any particular workload you were running?

Thanks.
Hugh Dickins - Dec. 28, 2011, 5:50 p.m.
On Wed, 28 Dec 2011, Tejun Heo wrote:
> On Wed, Dec 28, 2011 at 12:33:01AM -0800, Hugh Dickins wrote:
> 
> > However, there are a couple of other unhealthy symptoms I've noticed
> > under load in -next's block/cfq layer, both with and without your patch.
> > 
> > One is kernel BUG at block/cfq-iosched.c:2585!
> > 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
> > 
> > cfq_dispatch_request+0x1a
> > cfq_dispatch_requests+0x5c
> > blk_peek_request+0x195
> > scsi_request_fn+0x6a
> > __blk_run_queue+0x16
> > scsi_run_queue+0x18a
> > scsi_next_command+0x36
> > scsi_io_completion+0x426
> > scsi_finish_command+0xaf
> > scsi_softirq_done+0xdd
> > blk_done_softirq+0x6c
> > __do_softirq+0x80
> > call_softirq+0x1c
> > do_softirq+0x33
> > irq_exit+0x3f
> > do_IRQ+0x97
> > ret_from_intr
> > 
> > I've had that one four times now on different machines; but quicker
> > to reproduce are these warnings from CONFIG_DEBUG_LIST=y:
> > 
> > ------------[ cut here ]------------
> > WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
> > Hardware name: 4174AY9
> > list_del corruption. prev->next should be ffff880005aa1380, but was 6b6b6b6b6b6b6b6b
> > Modules linked in: snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device
> > Pid: 29241, comm: cc1 Tainted: G        W    3.2.0-rc6-next-20111222 #18
> > Call Trace:
> >  <IRQ>  [<ffffffff810544b4>] warn_slowpath_common+0x80/0x98
> >  [<ffffffff81054560>] warn_slowpath_fmt+0x41/0x43
> >  [<ffffffff811fc1a1>] __list_del_entry+0x8d/0x98
> >  [<ffffffff811df8ab>] cfq_remove_request+0x3b/0xdf
> >  [<ffffffff811df989>] cfq_dispatch_insert+0x3a/0x87
> >  [<ffffffff811dfb3b>] cfq_dispatch_request+0x65/0x92
> >  [<ffffffff811dfbc4>] cfq_dispatch_requests+0x5c/0x133
> >  [<ffffffff812e103e>] ? scsi_request_fn+0x3b6/0x3d3
> >  [<ffffffff811d3069>] blk_peek_request+0x195/0x1a6
> >  [<ffffffff812e103e>] ? scsi_request_fn+0x3b6/0x3d3
> >  [<ffffffff812e0cf5>] scsi_request_fn+0x6d/0x3d3
> >  [<ffffffff811d0730>] __blk_run_queue+0x19/0x1b
> >  [<ffffffff811d0bfd>] blk_run_queue+0x21/0x35
> >  [<ffffffff812e08c4>] scsi_run_queue+0x11f/0x1b9
> >  [<ffffffff812e205c>] scsi_next_command+0x36/0x46
> >  [<ffffffff812e24dc>] scsi_io_completion+0x426/0x4a9
> >  [<ffffffff812dc0b2>] scsi_finish_command+0xaf/0xb8
> >  [<ffffffff812e200c>] scsi_softirq_done+0xdd/0xe5
> >  [<ffffffff811d79c6>] blk_done_softirq+0x76/0x8a
> >  [<ffffffff8105a28d>] __do_softirq+0x98/0x136
> >  [<ffffffff814e649c>] call_softirq+0x1c/0x30
> >  [<ffffffff8102f187>] do_softirq+0x38/0x81
> >  [<ffffffff8105a596>] irq_exit+0x4e/0xb6
> >  [<ffffffff8102ee9e>] do_IRQ+0x97/0xae
> >  [<ffffffff814e49f0>] common_interrupt+0x70/0x70
> >  <EOI>  [<ffffffff814e4a8e>] ? retint_swapgs+0xe/0x13
> > ---[ end trace 61fdaa1b260613d1 ]---
> 
> Hmm... that looks like cfqq being freed before unlinked.  I'll try to
> reproduce it.  Is there any particular workload you were running?

"It's the tmpfs swapping test that I've been running, with variations,
for years.  System booted with mem=700M and 1.5G swap, two repetitious
make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because
the balance of built to unbuilt source grows smaller with later kernels),
one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with
ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."

I doubt much of that (quoted from an older mail to someone else about
one of the many other bugs it's found) is relevant: maybe just plenty
of file I/O and swapping.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Dec. 28, 2011, 5:55 p.m.
Hello,

On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:
> "It's the tmpfs swapping test that I've been running, with variations,
> for years.  System booted with mem=700M and 1.5G swap, two repetitious
> make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because
> the balance of built to unbuilt source grows smaller with later kernels),
> one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with
> ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."
>
> I doubt much of that (quoted from an older mail to someone else about
> one of the many other bugs it's found) is relevant: maybe just plenty
> of file I/O and swapping.

Plain -j4 build isn't triggering anything.  I'll try to replicate the condition.

Thanks.
Tejun Heo - Dec. 28, 2011, 9:19 p.m.
On Wed, Dec 28, 2011 at 09:55:12AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:
> > "It's the tmpfs swapping test that I've been running, with variations,
> > for years.  System booted with mem=700M and 1.5G swap, two repetitious
> > make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because
> > the balance of built to unbuilt source grows smaller with later kernels),
> > one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with
> > ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."
> >
> > I doubt much of that (quoted from an older mail to someone else about
> > one of the many other bugs it's found) is relevant: maybe just plenty
> > of file I/O and swapping.
> 
> Plain -j4 build isn't triggering anything.  I'll try to replicate the condition.

It's not too reliable but I can reproduce it with -j 22 allmodconfig
build inside qemu w/ 512M of memory.  I'll try to find out what's
going on.

Thanks.
Tejun Heo - Jan. 3, 2012, 5:35 p.m.
Happy new year, guys.

On Wed, Dec 28, 2011 at 01:19:18PM -0800, Tejun Heo wrote:
> > On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:
> > > "It's the tmpfs swapping test that I've been running, with variations,
> > > for years.  System booted with mem=700M and 1.5G swap, two repetitious
> > > make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because
> > > the balance of built to unbuilt source grows smaller with later kernels),
> > > one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with
> > > ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."
> > >
> > > I doubt much of that (quoted from an older mail to someone else about
> > > one of the many other bugs it's found) is relevant: maybe just plenty
> > > of file I/O and swapping.
> > 
> > Plain -j4 build isn't triggering anything.  I'll try to replicate the condition.
> 
> It's not too reliable but I can reproduce it with -j 22 allmodconfig
> build inside qemu w/ 512M of memory.  I'll try to find out what's
> going on.

I misread the code, the problem is empty cfqq on the cfq prio tree.  I
don't think this is caused by recent io_context changes.  It looks
like somebody is forgetting to remove cfqq from the dispatch prio tree
after emptying a cfqq by removing a request from it.  Jens, any ideas?

Thanks.
Tejun Heo - Jan. 3, 2012, 5:59 p.m.
Hello, again.

Adding Shaohua Li as he fixed a similar issue in 4a0b75c7d0 "block,
cfq: fix empty queue crash caused by request merge".  The original
thread can be read from

  http://thread.gmane.org/gmane.linux.kernel.next/20064/focus=20159

On Tue, Jan 03, 2012 at 09:35:00AM -0800, Tejun Heo wrote:
> Happy new year, guys.
> 
> On Wed, Dec 28, 2011 at 01:19:18PM -0800, Tejun Heo wrote:
> > > On Wed, Dec 28, 2011 at 9:50 AM, Hugh Dickins <hughd@google.com> wrote:
> > > > "It's the tmpfs swapping test that I've been running, with variations,
> > > > for years.  System booted with mem=700M and 1.5G swap, two repetitious
> > > > make -j20 kernel builds (of a 2.6.24 kernel: I stuck with that because
> > > > the balance of built to unbuilt source grows smaller with later kernels),
> > > > one directly in a tmpfs, the other in a 1k-block ext2 (that I drive with
> > > > ext4's CONFIG_EXT4_USE_FOR_EXT23) on /dev/loop0 on a 450MB tmpfs file."
> > > >
> > > > I doubt much of that (quoted from an older mail to someone else about
> > > > one of the many other bugs it's found) is relevant: maybe just plenty
> > > > of file I/O and swapping.
> > > 
> > > Plain -j4 build isn't triggering anything.  I'll try to replicate the condition.
> > 
> > It's not too reliable but I can reproduce it with -j 22 allmodconfig
> > build inside qemu w/ 512M of memory.  I'll try to find out what's
> > going on.
> 
> I misread the code, the problem is empty cfqq on the cfq prio tree.  I
> don't think this is caused by recent io_context changes.  It looks
> like somebody is forgetting to remove cfqq from the dispatch prio tree
> after emptying a cfqq by removing a request from it.  Jens, any ideas?

That should have been service tree.  I couldn't find more missing
removals other than the one Shaohua's patch already fixed.  Close
cooperator selection in cfq_select_queue() seems suspicious tho.  I
can't see what prevents it from returning an empty coopeator cfqq.
I'm trying to verify whether that's the case.  Will update when I know
more.

Thanks.
Tejun Heo - Jan. 3, 2012, 8:09 p.m.
On Tue, Jan 03, 2012 at 09:59:22AM -0800, Tejun Heo wrote:
> That should have been service tree.  I couldn't find more missing
> removals other than the one Shaohua's patch already fixed.  Close
> cooperator selection in cfq_select_queue() seems suspicious tho.  I
> can't see what prevents it from returning an empty coopeator cfqq.
> I'm trying to verify whether that's the case.  Will update when I know
> more.

While testing, found another bug.

 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 Last user: [<ffffffff813a82ee>](cfq_put_queue+0x7e/0xd0)
 070: e8 32 ab 1d 00 88 ff ff e8 32 ab 1d 00 88 ff ff  .2.......2......
 Prev obj: start=ffff88001dab3178, len=232
 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 Last user: [<ffffffff813a82ee>](cfq_put_queue+0x7e/0xd0)
 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
 Next obj: start=ffff88001dab3378, len=232
 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
 Last user: [<ffffffff813a8e53>](cfq_get_queue+0x153/0x670)
 000: 02 00 00 00 21 01 00 00 e0 c9 b1 1d 00 88 ff ff  ....!...........
 010: 89 96 ae 18 00 88 ff ff 00 00 00 00 00 00 00 00  ................

The field at 0x70 which is being updated after being freed is
cfqq->fifo.  Interestingly, it didn't lead to any visible failure.

Thanks.
Jens Axboe - Jan. 3, 2012, 8:20 p.m.
On 2012-01-03 21:09, Tejun Heo wrote:
> On Tue, Jan 03, 2012 at 09:59:22AM -0800, Tejun Heo wrote:
>> That should have been service tree.  I couldn't find more missing
>> removals other than the one Shaohua's patch already fixed.  Close
>> cooperator selection in cfq_select_queue() seems suspicious tho.  I
>> can't see what prevents it from returning an empty coopeator cfqq.
>> I'm trying to verify whether that's the case.  Will update when I know
>> more.
> 
> While testing, found another bug.
> 
>  Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
>  Last user: [<ffffffff813a82ee>](cfq_put_queue+0x7e/0xd0)
>  070: e8 32 ab 1d 00 88 ff ff e8 32 ab 1d 00 88 ff ff  .2.......2......
>  Prev obj: start=ffff88001dab3178, len=232
>  Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
>  Last user: [<ffffffff813a82ee>](cfq_put_queue+0x7e/0xd0)
>  000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
>  010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
>  Next obj: start=ffff88001dab3378, len=232
>  Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
>  Last user: [<ffffffff813a8e53>](cfq_get_queue+0x153/0x670)
>  000: 02 00 00 00 21 01 00 00 e0 c9 b1 1d 00 88 ff ff  ....!...........
>  010: 89 96 ae 18 00 88 ff ff 00 00 00 00 00 00 00 00  ................
> 
> The field at 0x70 which is being updated after being freed is
> cfqq->fifo.  Interestingly, it didn't lead to any visible failure.

That's pretty odd. Given Hughs report as well, it sure does sound like
we now have some life time issues with cfqq's.
Tejun Heo - Jan. 3, 2012, 10:13 p.m.
On Tue, Jan 03, 2012 at 09:20:44PM +0100, Jens Axboe wrote:
> On 2012-01-03 21:09, Tejun Heo wrote:
> > On Tue, Jan 03, 2012 at 09:59:22AM -0800, Tejun Heo wrote:
> >> That should have been service tree.  I couldn't find more missing
> >> removals other than the one Shaohua's patch already fixed.  Close
> >> cooperator selection in cfq_select_queue() seems suspicious tho.  I
> >> can't see what prevents it from returning an empty coopeator cfqq.
> >> I'm trying to verify whether that's the case.  Will update when I know
> >> more.
> > 
> > While testing, found another bug.
> > 
> >  Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> >  Last user: [<ffffffff813a82ee>](cfq_put_queue+0x7e/0xd0)
> >  070: e8 32 ab 1d 00 88 ff ff e8 32 ab 1d 00 88 ff ff  .2.......2......
> >  Prev obj: start=ffff88001dab3178, len=232
> >  Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> >  Last user: [<ffffffff813a82ee>](cfq_put_queue+0x7e/0xd0)
> >  000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
> >  010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
> >  Next obj: start=ffff88001dab3378, len=232
> >  Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> >  Last user: [<ffffffff813a8e53>](cfq_get_queue+0x153/0x670)
> >  000: 02 00 00 00 21 01 00 00 e0 c9 b1 1d 00 88 ff ff  ....!...........
> >  010: 89 96 ae 18 00 88 ff ff 00 00 00 00 00 00 00 00  ................
> > 
> > The field at 0x70 which is being updated after being freed is
> > cfqq->fifo.  Interestingly, it didn't lead to any visible failure.
> 
> That's pretty odd. Given Hughs report as well, it sure does sound like
> we now have some life time issues with cfqq's.

Hmmm... I disabled cfqq merge logic (commented out
cfq_close_cooperator() and the following cfq_setup_merge() calls) in
cfq_select_queue() and neither is triggering for quite a while now.
Maybe cfqq refcnt is getting borked over cfqq merging / splitting?  It
would also explain the low frequency of the issue too.  I'll try to
further isolate it but It would be awesome if someone more familiar
with the logic can go over that part.

Thanks.
Tejun Heo - Jan. 3, 2012, 10:35 p.m.
On Tue, Jan 03, 2012 at 02:13:01PM -0800, Tejun Heo wrote:
> > That's pretty odd. Given Hughs report as well, it sure does sound like
> > we now have some life time issues with cfqq's.
> 
> Hmmm... I disabled cfqq merge logic (commented out
> cfq_close_cooperator() and the following cfq_setup_merge() calls) in
> cfq_select_queue() and neither is triggering for quite a while now.
> Maybe cfqq refcnt is getting borked over cfqq merging / splitting?  It
> would also explain the low frequency of the issue too.  I'll try to
> further isolate it but It would be awesome if someone more familiar
> with the logic can go over that part.

Scrap that.  It triggered and yeah cfq_get_next_queue() is retrieving
empty cfqq from the service tree.

Thanks.

Patch

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index ce9b35a..33fae7d 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -281,9 +281,16 @@  void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
 	INIT_HLIST_HEAD(&ioc->icq_list);
 	INIT_WORK(&ioc->release_work, ioc_release_fn);
 
-	/* try to install, somebody might already have beaten us to it */
+	/*
+	 * Try to install.  ioc shouldn't be installed if someone else
+	 * already did or @task, which isn't %current, is exiting.  Note
+	 * that we need to allow ioc creation on exiting %current as exit
+	 * path may issue IOs from e.g. exit_files().  The exit path is
+	 * responsible for not issuing IO after exit_io_context().
+	 */
 	task_lock(task);
-	if (!task->io_context && !(task->flags & PF_EXITING))
+	if (!task->io_context &&
+	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
 		kmem_cache_free(iocontext_cachep, ioc);