diff mbox series

fs/direct-io.c: avoid workqueue allocation race

Message ID 20200308055221.1088089-1-ebiggers@kernel.org
State New
Headers show
Series fs/direct-io.c: avoid workqueue allocation race | expand

Commit Message

Eric Biggers March 8, 2020, 5:52 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

When a thread loses the workqueue allocation race in
sb_init_dio_done_wq(), lockdep reports that the call to
destroy_workqueue() can deadlock waiting for work to complete.  This is
a false positive since the workqueue is empty.  But we shouldn't simply
skip the lockdep check for empty workqueues for everyone.

Just avoid this issue by using a mutex to serialize the workqueue
allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
this doesn't affect direct I/O performance.

Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
since it's a data race.  (That part wasn't actually found by syzbot yet,
but it could be detected by KCSAN in the future.)

Note: the lockdep false positive could alternatively be fixed by
introducing a new function like "destroy_unused_workqueue()" to the
workqueue API as previously suggested.  But I think it makes sense to
avoid the double allocation anyway.

Reported-by: syzbot+a50c7541a4a55cd49b02@syzkaller.appspotmail.com
Reported-by: syzbot+5cd33f0e6abe2bb3e397@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/direct-io.c       | 39 ++++++++++++++++++++-------------------
 fs/internal.h        |  9 ++++++++-
 fs/iomap/direct-io.c |  3 +--
 3 files changed, 29 insertions(+), 22 deletions(-)

Comments

Dave Chinner March 8, 2020, 11:12 p.m. UTC | #1
On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> When a thread loses the workqueue allocation race in
> sb_init_dio_done_wq(), lockdep reports that the call to
> destroy_workqueue() can deadlock waiting for work to complete.  This is
> a false positive since the workqueue is empty.  But we shouldn't simply
> skip the lockdep check for empty workqueues for everyone.

Why not? If the wq is empty, it can't deadlock, so this is a problem
with the workqueue lockdep annotations, not a problem with code that
is destroying an empty workqueue.

> Just avoid this issue by using a mutex to serialize the workqueue
> allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> this doesn't affect direct I/O performance.
> 
> Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> since it's a data race.  (That part wasn't actually found by syzbot yet,
> but it could be detected by KCSAN in the future.)
> 
> Note: the lockdep false positive could alternatively be fixed by
> introducing a new function like "destroy_unused_workqueue()" to the
> workqueue API as previously suggested.  But I think it makes sense to
> avoid the double allocation anyway.

Fix the infrastructure, don't work around it be placing constraints
on how the callers can use the infrastructure to work around
problems internal to the infrastructure.

Cheers,

Dave.
Eric Biggers March 9, 2020, 1:24 a.m. UTC | #2
On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote:
> On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > When a thread loses the workqueue allocation race in
> > sb_init_dio_done_wq(), lockdep reports that the call to
> > destroy_workqueue() can deadlock waiting for work to complete.  This is
> > a false positive since the workqueue is empty.  But we shouldn't simply
> > skip the lockdep check for empty workqueues for everyone.
> 
> Why not? If the wq is empty, it can't deadlock, so this is a problem
> with the workqueue lockdep annotations, not a problem with code that
> is destroying an empty workqueue.

Skipping the lockdep check when flushing an empty workqueue would reduce the
ability of lockdep to detect deadlocks when flushing that workqueue.  I.e., it
could cause lots of false negatives, since there are many cases where workqueues
are *usually* empty when flushed/destroyed but it's still possible that they are
nonempty.

> 
> > Just avoid this issue by using a mutex to serialize the workqueue
> > allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> > this doesn't affect direct I/O performance.
> > 
> > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> > since it's a data race.  (That part wasn't actually found by syzbot yet,
> > but it could be detected by KCSAN in the future.)
> > 
> > Note: the lockdep false positive could alternatively be fixed by
> > introducing a new function like "destroy_unused_workqueue()" to the
> > workqueue API as previously suggested.  But I think it makes sense to
> > avoid the double allocation anyway.
> 
> Fix the infrastructure, don't work around it be placing constraints
> on how the callers can use the infrastructure to work around
> problems internal to the infrastructure.

Well, it's also preferable not to make our debugging tools less effective to
support people doing weird things that they shouldn't really be doing anyway.

(BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate
the data race.  That could be split into a separate patch though.)

Another idea that came up is to make each workqueue_struct track whether work
has been queued on it or not yet, and make flush_workqueue() skip the lockdep
check if the workqueue has always been empty.  (That could still cause lockdep
false negatives, but not as many as if we checked if the workqueue is
*currently* empty.)  Would you prefer that solution?  Adding more overhead to
workqueues would be undesirable though, so I think it would have to be
conditional on CONFIG_LOCKDEP, like (untested):

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 301db4406bc37..72222c09bcaeb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -263,6 +263,7 @@ struct workqueue_struct {
 	char			*lock_name;
 	struct lock_class_key	key;
 	struct lockdep_map	lockdep_map;
+	bool			used;
 #endif
 	char			name[WQ_NAME_LEN]; /* I: workqueue name */
 
@@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	lockdep_assert_irqs_disabled();
 
 	debug_work_activate(work);
+#ifdef CONFIG_LOCKDEP
+	WRITE_ONCE(wq->used, true);
+#endif
 
 	/* if draining, only works from the same workqueue are allowed */
 	if (unlikely(wq->flags & __WQ_DRAINING) &&
@@ -2772,8 +2776,12 @@ void flush_workqueue(struct workqueue_struct *wq)
 	if (WARN_ON(!wq_online))
 		return;
 
-	lock_map_acquire(&wq->lockdep_map);
-	lock_map_release(&wq->lockdep_map);
+#ifdef CONFIG_LOCKDEP
+	if (READ_ONCE(wq->used)) {
+		lock_map_acquire(&wq->lockdep_map);
+		lock_map_release(&wq->lockdep_map);
+	}
+#endif
 
 	mutex_lock(&wq->mutex);
Darrick Wong March 10, 2020, 4:27 p.m. UTC | #3
On Sun, Mar 08, 2020 at 06:24:24PM -0700, Eric Biggers wrote:
> On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote:
> > On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> > > From: Eric Biggers <ebiggers@google.com>
> > > 
> > > When a thread loses the workqueue allocation race in
> > > sb_init_dio_done_wq(), lockdep reports that the call to
> > > destroy_workqueue() can deadlock waiting for work to complete.  This is
> > > a false positive since the workqueue is empty.  But we shouldn't simply
> > > skip the lockdep check for empty workqueues for everyone.
> > 
> > Why not? If the wq is empty, it can't deadlock, so this is a problem
> > with the workqueue lockdep annotations, not a problem with code that
> > is destroying an empty workqueue.
> 
> Skipping the lockdep check when flushing an empty workqueue would reduce the
> ability of lockdep to detect deadlocks when flushing that workqueue.  I.e., it
> could cause lots of false negatives, since there are many cases where workqueues
> are *usually* empty when flushed/destroyed but it's still possible that they are
> nonempty.
> 
> > 
> > > Just avoid this issue by using a mutex to serialize the workqueue
> > > allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> > > this doesn't affect direct I/O performance.
> > > 
> > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> > > since it's a data race.  (That part wasn't actually found by syzbot yet,
> > > but it could be detected by KCSAN in the future.)
> > > 
> > > Note: the lockdep false positive could alternatively be fixed by
> > > introducing a new function like "destroy_unused_workqueue()" to the
> > > workqueue API as previously suggested.  But I think it makes sense to
> > > avoid the double allocation anyway.
> > 
> > Fix the infrastructure, don't work around it be placing constraints
> > on how the callers can use the infrastructure to work around
> > problems internal to the infrastructure.
> 
> Well, it's also preferable not to make our debugging tools less effective to
> support people doing weird things that they shouldn't really be doing anyway.
> 
> (BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate
> the data race.  That could be split into a separate patch though.)
> 
> Another idea that came up is to make each workqueue_struct track whether work
> has been queued on it or not yet, and make flush_workqueue() skip the lockdep
> check if the workqueue has always been empty.  (That could still cause lockdep
> false negatives, but not as many as if we checked if the workqueue is
> *currently* empty.)  Would you prefer that solution?  Adding more overhead to
> workqueues would be undesirable though, so I think it would have to be
> conditional on CONFIG_LOCKDEP, like (untested):

I can't speak for Dave, but if the problem here really is that lockdep's
modelling of flush_workqueue()'s behavior could be improved to eliminate
false reports, then this seems reasonable to me...

--D

> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 301db4406bc37..72222c09bcaeb 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -263,6 +263,7 @@ struct workqueue_struct {
>  	char			*lock_name;
>  	struct lock_class_key	key;
>  	struct lockdep_map	lockdep_map;
> +	bool			used;
>  #endif
>  	char			name[WQ_NAME_LEN]; /* I: workqueue name */
>  
> @@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
>  	lockdep_assert_irqs_disabled();
>  
>  	debug_work_activate(work);
> +#ifdef CONFIG_LOCKDEP
> +	WRITE_ONCE(wq->used, true);
> +#endif
>  
>  	/* if draining, only works from the same workqueue are allowed */
>  	if (unlikely(wq->flags & __WQ_DRAINING) &&
> @@ -2772,8 +2776,12 @@ void flush_workqueue(struct workqueue_struct *wq)
>  	if (WARN_ON(!wq_online))
>  		return;
>  
> -	lock_map_acquire(&wq->lockdep_map);
> -	lock_map_release(&wq->lockdep_map);
> +#ifdef CONFIG_LOCKDEP
> +	if (READ_ONCE(wq->used)) {
> +		lock_map_acquire(&wq->lockdep_map);
> +		lock_map_release(&wq->lockdep_map);
> +	}
> +#endif
>  
>  	mutex_lock(&wq->mutex);
Dave Chinner March 10, 2020, 10:22 p.m. UTC | #4
[ Sorry, my responses are limited at the moment because I took a
chunk out of a fingertip a couple of days ago and I can only do
about half an hour before my hand and arm start to cramp from the
weird positions and motions 3 finger typing results in.... ]

On Tue, Mar 10, 2020 at 09:27:58AM -0700, Darrick J. Wong wrote:
> On Sun, Mar 08, 2020 at 06:24:24PM -0700, Eric Biggers wrote:
> > On Mon, Mar 09, 2020 at 10:12:53AM +1100, Dave Chinner wrote:
> > > On Sat, Mar 07, 2020 at 09:52:21PM -0800, Eric Biggers wrote:
> > > > From: Eric Biggers <ebiggers@google.com>
> > > > 
> > > > When a thread loses the workqueue allocation race in
> > > > sb_init_dio_done_wq(), lockdep reports that the call to
> > > > destroy_workqueue() can deadlock waiting for work to complete.  This is
> > > > a false positive since the workqueue is empty.  But we shouldn't simply
> > > > skip the lockdep check for empty workqueues for everyone.
> > > 
> > > Why not? If the wq is empty, it can't deadlock, so this is a problem
> > > with the workqueue lockdep annotations, not a problem with code that
> > > is destroying an empty workqueue.
> > 
> > Skipping the lockdep check when flushing an empty workqueue would reduce the
> > ability of lockdep to detect deadlocks when flushing that workqueue.  I.e., it
> > could cause lots of false negatives, since there are many cases where workqueues
> > are *usually* empty when flushed/destroyed but it's still possible that they are
> > nonempty.
> > 
> > > 
> > > > Just avoid this issue by using a mutex to serialize the workqueue
> > > > allocation.  We still keep the preliminary check for ->s_dio_done_wq, so
> > > > this doesn't affect direct I/O performance.
> > > > 
> > > > Also fix the preliminary check for ->s_dio_done_wq to use READ_ONCE(),
> > > > since it's a data race.  (That part wasn't actually found by syzbot yet,
> > > > but it could be detected by KCSAN in the future.)
> > > > 
> > > > Note: the lockdep false positive could alternatively be fixed by
> > > > introducing a new function like "destroy_unused_workqueue()" to the
> > > > workqueue API as previously suggested.  But I think it makes sense to
> > > > avoid the double allocation anyway.
> > > 
> > > Fix the infrastructure, don't work around it be placing constraints
> > > on how the callers can use the infrastructure to work around
> > > problems internal to the infrastructure.
> > 
> > Well, it's also preferable not to make our debugging tools less effective to
> > support people doing weird things that they shouldn't really be doing anyway.
> > 
> > (BTW, we need READ_ONCE() on ->sb_init_dio_done_wq anyway to properly annotate
> > the data race.  That could be split into a separate patch though.)
> > 
> > Another idea that came up is to make each workqueue_struct track whether work
> > has been queued on it or not yet, and make flush_workqueue() skip the lockdep
> > check if the workqueue has always been empty.  (That could still cause lockdep
> > false negatives, but not as many as if we checked if the workqueue is
> > *currently* empty.)  Would you prefer that solution?  Adding more overhead to
> > workqueues would be undesirable though, so I think it would have to be
> > conditional on CONFIG_LOCKDEP, like (untested):
> 
> I can't speak for Dave, but if the problem here really is that lockdep's
> modelling of flush_workqueue()'s behavior could be improved to eliminate
> false reports, then this seems reasonable to me...

Yeah, that's what I've been trying to say. IT seems much more
reasonable to fix it for everyone once with a few lines of code than
have to re-write every caller that might trip over this. e.g. think
of all the failure teardown paths that destroy workqueues without
having used them...

So, yeah, this seems like a much better approach....

> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 301db4406bc37..72222c09bcaeb 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -263,6 +263,7 @@ struct workqueue_struct {
> >  	char			*lock_name;
> >  	struct lock_class_key	key;
> >  	struct lockdep_map	lockdep_map;
> > +	bool			used;
> >  #endif
> >  	char			name[WQ_NAME_LEN]; /* I: workqueue name */
> >  
> > @@ -1404,6 +1405,9 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
> >  	lockdep_assert_irqs_disabled();
> >  
> >  	debug_work_activate(work);
> > +#ifdef CONFIG_LOCKDEP
> > +	WRITE_ONCE(wq->used, true);
> > +#endif

....with an appropriate comment to explain why this code is needed.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 00b4d15bb811..8b73a2501c03 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -590,22 +590,25 @@  static inline int dio_bio_reap(struct dio *dio, struct dio_submit *sdio)
  * filesystems that don't need it and also allows us to create the workqueue
  * late enough so the we can include s_id in the name of the workqueue.
  */
-int sb_init_dio_done_wq(struct super_block *sb)
+int __sb_init_dio_done_wq(struct super_block *sb)
 {
-	struct workqueue_struct *old;
-	struct workqueue_struct *wq = alloc_workqueue("dio/%s",
-						      WQ_MEM_RECLAIM, 0,
-						      sb->s_id);
-	if (!wq)
-		return -ENOMEM;
-	/*
-	 * This has to be atomic as more DIOs can race to create the workqueue
-	 */
-	old = cmpxchg(&sb->s_dio_done_wq, NULL, wq);
-	/* Someone created workqueue before us? Free ours... */
-	if (old)
-		destroy_workqueue(wq);
-	return 0;
+	static DEFINE_MUTEX(sb_init_dio_done_wq_mutex);
+	struct workqueue_struct *wq;
+	int err = 0;
+
+	mutex_lock(&sb_init_dio_done_wq_mutex);
+	if (sb->s_dio_done_wq)
+		goto out;
+	wq = alloc_workqueue("dio/%s", WQ_MEM_RECLAIM, 0, sb->s_id);
+	if (!wq) {
+		err = -ENOMEM;
+		goto out;
+	}
+	/* pairs with READ_ONCE() in sb_init_dio_done_wq() */
+	smp_store_release(&sb->s_dio_done_wq, wq);
+out:
+	mutex_unlock(&sb_init_dio_done_wq_mutex);
+	return err;
 }
 
 static int dio_set_defer_completion(struct dio *dio)
@@ -615,9 +618,7 @@  static int dio_set_defer_completion(struct dio *dio)
 	if (dio->defer_completion)
 		return 0;
 	dio->defer_completion = true;
-	if (!sb->s_dio_done_wq)
-		return sb_init_dio_done_wq(sb);
-	return 0;
+	return sb_init_dio_done_wq(sb);
 }
 
 /*
@@ -1250,7 +1251,7 @@  do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		retval = 0;
 		if (iocb->ki_flags & IOCB_DSYNC)
 			retval = dio_set_defer_completion(dio);
-		else if (!dio->inode->i_sb->s_dio_done_wq) {
+		else {
 			/*
 			 * In case of AIO write racing with buffered read we
 			 * need to defer completion. We can't decide this now,
diff --git a/fs/internal.h b/fs/internal.h
index f3f280b952a3..7813dae1dbcd 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -183,7 +183,14 @@  extern void mnt_pin_kill(struct mount *m);
 extern const struct dentry_operations ns_dentry_operations;
 
 /* direct-io.c: */
-int sb_init_dio_done_wq(struct super_block *sb);
+int __sb_init_dio_done_wq(struct super_block *sb);
+static inline int sb_init_dio_done_wq(struct super_block *sb)
+{
+	/* pairs with smp_store_release() in __sb_init_dio_done_wq() */
+	if (likely(READ_ONCE(sb->s_dio_done_wq)))
+		return 0;
+	return __sb_init_dio_done_wq(sb);
+}
 
 /*
  * fs/stat.c:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..5d81faada8a0 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -484,8 +484,7 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		dio_warn_stale_pagecache(iocb->ki_filp);
 	ret = 0;
 
-	if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
-	    !inode->i_sb->s_dio_done_wq) {
+	if (iov_iter_rw(iter) == WRITE && !wait_for_completion) {
 		ret = sb_init_dio_done_wq(inode->i_sb);
 		if (ret < 0)
 			goto out_free_dio;