diff mbox

[1/2] workqueue: Catch more locking problems with flush_work()

Message ID 20120420060101.GA16563@zhy
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Yong Zhang April 20, 2012, 6:01 a.m. UTC
On Fri, Apr 20, 2012 at 01:26:33PM +0800, Yong Zhang wrote:
> On Thu, Apr 19, 2012 at 11:36:32AM -0700, Stephen Boyd wrote:
> > Does looking at the second patch help? Basically schedule_work() can run
> > the callback right between the time the mutex is acquired and
> > flush_work() is called:
> > 
> > CPU0                        CPU1
> > 
> > <irq>
> >   schedule_work()           mutex_lock(&mutex)
> > <irq return>
> >     my_work()               flush_work() 
> >       mutex_lock(&mutex)    
> >       <deadlock>
> 
> Get you point. It is a problem. But your patch could introduece false
> positive since when flush_work() is called that very work may finish
> running already.
> 
> So I think we need the lock_map_acquire()/lock_map_release() only when
> the work is under processing, no?

But start_flush_work() has tried take care of this issue except it
doesn't add work->lockdep_map into the chain.

So does below patch help?

Thanks,
Yong

---
From: Yong Zhang <yong.zhang@windriver.com>
Date: Fri, 20 Apr 2012 13:44:16 +0800
Subject: [PATCH] workqueue:lockdep: make flush_work notice deadlock

Connet the lock chain by aquiring work->lockdep_map when
the tobe-flush work is running.

Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Reported-by: Stephen Boyd <sboyd@codeaurora.org>
---
 kernel/workqueue.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

Comments

Stephen Boyd April 20, 2012, 6:26 a.m. UTC | #1
On 4/19/2012 11:01 PM, Yong Zhang wrote:
> On Fri, Apr 20, 2012 at 01:26:33PM +0800, Yong Zhang wrote:
>> On Thu, Apr 19, 2012 at 11:36:32AM -0700, Stephen Boyd wrote:
>>> Does looking at the second patch help? Basically schedule_work() can run
>>> the callback right between the time the mutex is acquired and
>>> flush_work() is called:
>>>
>>> CPU0                        CPU1
>>>
>>> <irq>
>>>   schedule_work()           mutex_lock(&mutex)
>>> <irq return>
>>>     my_work()               flush_work() 
>>>       mutex_lock(&mutex)    
>>>       <deadlock>
>> Get you point. It is a problem. But your patch could introduece false
>> positive since when flush_work() is called that very work may finish
>> running already.
>>
>> So I think we need the lock_map_acquire()/lock_map_release() only when
>> the work is under processing, no?
> But start_flush_work() has tried take care of this issue except it
> doesn't add work->lockdep_map into the chain.
>
> So does below patch help?
>
[snip]
> @@ -2461,6 +2461,8 @@ static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
>  		lock_map_acquire(&cwq->wq->lockdep_map);
>  	else
>  		lock_map_acquire_read(&cwq->wq->lockdep_map);
> +	lock_map_acquire(&work->lockdep_map);
> +	lock_map_release(&work->lockdep_map);
>  	lock_map_release(&cwq->wq->lockdep_map);
>  
>  	return true;

No this doesn't help. The whole point of the patch is to get lockdep to
complain in the case where the work is not queued. That case is not a
false positive. We will get a lockdep warning if the work is running
(when start_flush_work() returns true) solely with the
lock_map_acquire() on cwq->wq->lockdep_map.
Yong Zhang April 20, 2012, 7:18 a.m. UTC | #2
On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
> complain in the case where the work is not queued. That case is not a
> false positive. We will get a lockdep warning if the work is running

IIRC, flush_work() is just a nop when a work is not queued nor running.

> (when start_flush_work() returns true) solely with the
> lock_map_acquire() on cwq->wq->lockdep_map.

Yeah, that is the point we use lockdep to detect deadlock for workqueue.

But when looking at start_flush_work(), for some case
!(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
lock_map_acquire_read() is called, but recursive read is not added to
the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
is called, deadlock will not be detected. I hope you don't hit that
special case.

Thanks,
Yong
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Boyd April 20, 2012, 8:18 a.m. UTC | #3
On 4/20/2012 12:18 AM, Yong Zhang wrote:
> On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
>> complain in the case where the work is not queued. That case is not a
>> false positive. We will get a lockdep warning if the work is running
> IIRC, flush_work() is just a nop when a work is not queued nor running.

Agreed, but it's better to always print a lockdep warning instead of
only when the deadlock is going to occur.

>
>> (when start_flush_work() returns true) solely with the
>> lock_map_acquire() on cwq->wq->lockdep_map.
> Yeah, that is the point we use lockdep to detect deadlock for workqueue.
>
> But when looking at start_flush_work(), for some case
> !(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
> lock_map_acquire_read() is called, but recursive read is not added to
> the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
> is called, deadlock will not be detected. I hope you don't hit that
> special case.

Hmm. Originally I had what you suggested in my patch but I left it out
because I wasn't sure if it would cause false positives. Do you see any
possibility for false positives? I'll add it back in if not.
Yong Zhang April 20, 2012, 8:32 a.m. UTC | #4
On Fri, Apr 20, 2012 at 01:18:19AM -0700, Stephen Boyd wrote:
> On 4/20/2012 12:18 AM, Yong Zhang wrote:
> > On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
> >> complain in the case where the work is not queued. That case is not a
> >> false positive. We will get a lockdep warning if the work is running
> > IIRC, flush_work() is just a nop when a work is not queued nor running.
> 
> Agreed, but it's better to always print a lockdep warning instead of
> only when the deadlock is going to occur.

It will IMHO.

> 
> >
> >> (when start_flush_work() returns true) solely with the
> >> lock_map_acquire() on cwq->wq->lockdep_map.
> > Yeah, that is the point we use lockdep to detect deadlock for workqueue.
> >
> > But when looking at start_flush_work(), for some case
> > !(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
> > lock_map_acquire_read() is called, but recursive read is not added to
> > the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
> > is called, deadlock will not be detected. I hope you don't hit that
> > special case.
> 
> Hmm. Originally I had what you suggested in my patch but I left it out
> because I wasn't sure if it would cause false positives.
> Do you see any
> possibility for false positives?

No, I don't. My test indeed show nothing (just build and boot).

>I'll add it back in if not.

It's great if you can try it :)

Thanks,
Yong
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yong Zhang April 21, 2012, 12:32 a.m. UTC | #5
On Fri, Apr 20, 2012 at 4:32 PM, Yong Zhang <yong.zhang0@gmail.com> wrote:
> On Fri, Apr 20, 2012 at 01:18:19AM -0700, Stephen Boyd wrote:
>> On 4/20/2012 12:18 AM, Yong Zhang wrote:
>> > On Thu, Apr 19, 2012 at 11:26:47PM -0700, Stephen Boyd wrote:
>> >> complain in the case where the work is not queued. That case is not a
>> >> false positive. We will get a lockdep warning if the work is running
>> > IIRC, flush_work() is just a nop when a work is not queued nor running.
>>
>> Agreed, but it's better to always print a lockdep warning instead of
>> only when the deadlock is going to occur.
>
> It will IMHO.

After reconsidering this issue, I recognize I'm wrong from the beginning.
My suggestion will only warn on the real deadlock. It doesn't follow
what lockdep want to achieve. Sorry for that.

BTW, your latest patch should work :)

Thanks,
Yong

>
>>
>> >
>> >> (when start_flush_work() returns true) solely with the
>> >> lock_map_acquire() on cwq->wq->lockdep_map.
>> > Yeah, that is the point we use lockdep to detect deadlock for workqueue.
>> >
>> > But when looking at start_flush_work(), for some case
>> > !(cwq->wq->saved_max_active == 1 || cwq->wq->flags & WQ_RESCUER),
>> > lock_map_acquire_read() is called, but recursive read is not added to
>> > the chain list. So when lock_map_acquire_read(&cwq->wq->lockdep_map)
>> > is called, deadlock will not be detected. I hope you don't hit that
>> > special case.
>>
>> Hmm. Originally I had what you suggested in my patch but I left it out
>> because I wasn't sure if it would cause false positives.
>> Do you see any
>> possibility for false positives?
>
> No, I don't. My test indeed show nothing (just build and boot).
>
>>I'll add it back in if not.
>
> It's great if you can try it :)
>
> Thanks,
> Yong
diff mbox

Patch

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bc867e8..c096b05 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2461,6 +2461,8 @@  static bool start_flush_work(struct work_struct *work, struct wq_barrier *barr,
 		lock_map_acquire(&cwq->wq->lockdep_map);
 	else
 		lock_map_acquire_read(&cwq->wq->lockdep_map);
+	lock_map_acquire(&work->lockdep_map);
+	lock_map_release(&work->lockdep_map);
 	lock_map_release(&cwq->wq->lockdep_map);
 
 	return true;