diff mbox series

[SRU,T/X/A/B,C,1/1] UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race

Message ID 20180604064216.32075-2-daniel.axtens@canonical.com
State New
Headers show
Series [SRU,T/X/A/B,C,1/1] UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race | expand

Commit Message

Daniel Axtens June 4, 2018, 6:42 a.m. UTC
From: Lei Xue <carmark.dlut@gmail.com>

BugLink: https://bugs.launchpad.net/bugs/1774336

There is a potential race in fscache operation enqueuing for reading and
copying multiple pages from cachefiles to netfs.

If this race occurs, an oops similar to the following is seen:

[585042.202316] FS-Cache:
[585042.202343] FS-Cache: Assertion failed
[585042.202367] FS-Cache: 6 == 5 is false
[585042.202452] ------------[ cut here ]------------
[585042.202480] kernel BUG at fs/fscache/operation.c:494!
...
[585042.209600] Call Trace:
[585042.211233]  [<ffffffffc034c29a>] fscache_op_work_func+0x2a/0x50 [fscache]
[585042.212677]  [<ffffffff81095a70>] process_one_work+0x150/0x3f0
[585042.213550]  [<ffffffff810961ea>] worker_thread+0x11a/0x470
...

The race occurs in the following situation:

One thread is in cachefiles_read_waiter:
 1) object->work_lock is taken.
 2) the operation is added to the to_do list.
 3) the work lock is dropped.
 4) fscache_enqueue_retrieval is called, which takes a reference.

Another thread is in cachefiles_read_copier:
 1) object->work_lock is taken
 2) an item is popped off the to_do list.
 3) object->work_lock is dropped.
 4) some processing is done on the item, and fscache_put_retrieval()
    is called, dropping a reference.

Now if the this process in cachefiles_read_copier takes place
*between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
dropped before it is taken, which leads to the object's reference count
hitting zero, which leads to lifecycle events for the object happening
too soon, leading to the assertion failure later on.

Move fscache_enqueue_retrieval under the lock in
cachefiles_read_waiter. This means that the object cannot be popped
off the to_do list until it is in a fully consistent state with the
reference taken.

Signed-off-by: Lei Xue <carmark.dlut@gmail.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
[dja: rewrite and expand commit message]
(From https://www.redhat.com/archives/linux-cachefs/2018-February/msg00000.html
 This patch has been sitting on the mailing list for months with no
 response from the maintainer. A similar patch fixing the same issue
 was posted as far back as May 2017, and likewise had no response:
 https://www.redhat.com/archives/linux-cachefs/2017-May/msg00002.html
 I poked the list recently and also got nothing:
 https://www.redhat.com/archives/linux-cachefs/2018-May/msg00000.html
 and the problem was again reported and this patch validated by
 another user:
 https://www.redhat.com/archives/linux-cachefs/2018-May/msg00001.html
 Hence the submission as a sauce patch.)
Signed-off-by: Daniel Axtens <daniel.axtens@canonical.com>
---
 fs/cachefiles/rdwr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Stefan Bader June 5, 2018, 7:51 p.m. UTC | #1
On 03.06.2018 23:42, Daniel Axtens wrote:
> From: Lei Xue <carmark.dlut@gmail.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1774336
> 
> There is a potential race in fscache operation enqueuing for reading and
> copying multiple pages from cachefiles to netfs.
> 
> If this race occurs, an oops similar to the following is seen:
> 
> [585042.202316] FS-Cache:
> [585042.202343] FS-Cache: Assertion failed
> [585042.202367] FS-Cache: 6 == 5 is false
> [585042.202452] ------------[ cut here ]------------
> [585042.202480] kernel BUG at fs/fscache/operation.c:494!
> ...
> [585042.209600] Call Trace:
> [585042.211233]  [<ffffffffc034c29a>] fscache_op_work_func+0x2a/0x50 [fscache]
> [585042.212677]  [<ffffffff81095a70>] process_one_work+0x150/0x3f0
> [585042.213550]  [<ffffffff810961ea>] worker_thread+0x11a/0x470
> ...
> 
> The race occurs in the following situation:
> 
> One thread is in cachefiles_read_waiter:
>  1) object->work_lock is taken.
>  2) the operation is added to the to_do list.
>  3) the work lock is dropped.
>  4) fscache_enqueue_retrieval is called, which takes a reference.
> 
> Another thread is in cachefiles_read_copier:
>  1) object->work_lock is taken
>  2) an item is popped off the to_do list.
>  3) object->work_lock is dropped.
>  4) some processing is done on the item, and fscache_put_retrieval()
>     is called, dropping a reference.
> 
> Now if the this process in cachefiles_read_copier takes place
> *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
> dropped before it is taken, which leads to the object's reference count
> hitting zero, which leads to lifecycle events for the object happening
> too soon, leading to the assertion failure later on.
> 
> Move fscache_enqueue_retrieval under the lock in
> cachefiles_read_waiter. This means that the object cannot be popped
> off the to_do list until it is in a fully consistent state with the
> reference taken.
> 
> Signed-off-by: Lei Xue <carmark.dlut@gmail.com>
> Reviewed-by: Daniel Axtens <dja@axtens.net>
> [dja: rewrite and expand commit message]
> (From https://www.redhat.com/archives/linux-cachefs/2018-February/msg00000.html
>  This patch has been sitting on the mailing list for months with no
>  response from the maintainer. A similar patch fixing the same issue
>  was posted as far back as May 2017, and likewise had no response:
>  https://www.redhat.com/archives/linux-cachefs/2017-May/msg00002.html
>  I poked the list recently and also got nothing:
>  https://www.redhat.com/archives/linux-cachefs/2018-May/msg00000.html
>  and the problem was again reported and this patch validated by
>  another user:
>  https://www.redhat.com/archives/linux-cachefs/2018-May/msg00001.html
>  Hence the submission as a sauce patch.)
> Signed-off-by: Daniel Axtens <daniel.axtens@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>

> ---
>  fs/cachefiles/rdwr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
> index c0f3da3926a0..d9bb47d1e16d 100644
> --- a/fs/cachefiles/rdwr.c
> +++ b/fs/cachefiles/rdwr.c
> @@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
>  
>  	spin_lock(&object->work_lock);
>  	list_add_tail(&monitor->op_link, &monitor->op->to_do);
> +	fscache_enqueue_retrieval(monitor->op);
>  	spin_unlock(&object->work_lock);
>  
> -	fscache_enqueue_retrieval(monitor->op);
>  	return 0;
>  }
>  
>
Thadeu Lima de Souza Cascardo June 25, 2018, 8:01 p.m. UTC | #2
Applied to unstable master branch.

Thanks.
Cascardo.

Applied-to: unstable/master
diff mbox series

Patch

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index c0f3da3926a0..d9bb47d1e16d 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -58,9 +58,9 @@  static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
 
 	spin_lock(&object->work_lock);
 	list_add_tail(&monitor->op_link, &monitor->op->to_do);
+	fscache_enqueue_retrieval(monitor->op);
 	spin_unlock(&object->work_lock);
 
-	fscache_enqueue_retrieval(monitor->op);
 	return 0;
 }