From patchwork Thu Aug  2 04:18:44 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Daniel Axtens <daniel.axtens@canonical.com>
X-Patchwork-Id: 952567
Return-Path: <kernel-team-bounces@lists.ubuntu.com>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=lists.ubuntu.com
	(client-ip=91.189.94.19; helo=huckleberry.canonical.com;
	envelope-from=kernel-team-bounces@lists.ubuntu.com;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none)
	header.from=canonical.com
Received: from huckleberry.canonical.com (huckleberry.canonical.com
	[91.189.94.19])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 41gxkw3Jd8z9s3x;
	Thu,  2 Aug 2018 14:19:04 +1000 (AEST)
Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com)
	by huckleberry.canonical.com with esmtp (Exim 4.86_2)
	(envelope-from <kernel-team-bounces@lists.ubuntu.com>)
	id 1fl54l-0007Nj-7t; Thu, 02 Aug 2018 04:18:59 +0000
Received: from youngberry.canonical.com ([91.189.89.112])
	by huckleberry.canonical.com with esmtps
	(TLS1.0:DHE_RSA_AES_128_CBC_SHA1:128)
	(Exim 4.86_2) (envelope-from <daniel.axtens@canonical.com>)
	id 1fl54j-0007NW-4k
	for kernel-team@lists.canonical.com; Thu, 02 Aug 2018 04:18:57 +0000
Received: from mail-pg1-f199.google.com ([209.85.215.199])
	by youngberry.canonical.com with esmtps
	(TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.76) (envelope-from <daniel.axtens@canonical.com>)
	id 1fl54i-0001wS-PP
	for kernel-team@lists.canonical.com; Thu, 02 Aug 2018 04:18:56 +0000
Received: by mail-pg1-f199.google.com with SMTP id o16-v6so469856pgv.21
	for <kernel-team@lists.canonical.com>;
	Wed, 01 Aug 2018 21:18:56 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:subject:date:message-id;
	bh=RayxLlWqGxmJ+8t+u/Q4zct/xRR86fqkd8AnbxBRBjg=;
	b=SHAbDW9R7neEZ6E5B5Xod/rnB1t0RAb2cskZq/uXMWrE7SfP4O29Y1HgMX6BqfDJWg
	bzzDp+ziPF2L+BQ3bmCbTJ6nIQxxX+FifhFSuW/CGEYrEM4X/JN70wa5sBsrN8GhjChb
	ixv3zM7gulAxCjj7Du4wq2bST2bJEU4jQFjRjkq4NWrdO1UmyBUEUstRwnc70XU3pxxI
	01zkD5xAhiLLxSOezTu3d0HZmGovwiByBKOdjk5o5kXZNkoWob1iH12utx4mWnQ2NJxC
	ilQQ/976aEjb6mbfzhqyA3dWHN9ArJmuPaBRqipY/bjenlHG0sSDxamYbfhX2yFPz8IU
	udiQ==
X-Gm-Message-State: AOUpUlGd3uQLVCVneionsIGaTkTts3CcRkVvEim9psGN9Pt50GjeoXTS
	1nplmp7cIIT78ME6io/5uAsCWDX9TeMMBsM2JrCX8b/9pNhwS3kiWTnNGQZPvE3HGJyU6CG5on3
	vw5AD8KUf2VeCbxGva0royIisfbtNHg/nPsUciZ+5Jw/N7jsR
X-Received: by 2002:a62:998:: with SMTP id
	24-v6mr1151797pfj.99.1533183535445;
	Wed, 01 Aug 2018 21:18:55 -0700 (PDT)
X-Google-Smtp-Source: 
 AAOMgpewSKi82hIN90vCE3cO0DsrZV9eJ4DJqjlZlRdbBkG5PfM4r1ILxeWjZerpTlfe2Ax8EJGtIA==
X-Received: by 2002:a62:998:: with SMTP id
	24-v6mr1151783pfj.99.1533183535218;
	Wed, 01 Aug 2018 21:18:55 -0700 (PDT)
Received: from localhost.localdomain (124-171-193-200.dyn.iinet.net.au.
	[124.171.193.200]) by smtp.gmail.com with ESMTPSA id
	p73-v6sm952509pfk.186.2018.08.01.21.18.53
	for <kernel-team@lists.canonical.com>
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Wed, 01 Aug 2018 21:18:54 -0700 (PDT)
From: Daniel Axtens <daniel.axtens@canonical.com>
To: kernel-team@lists.canonical.com
Subject: [SRU T][PATCH 0/6] NFS FSCache Fixes: LP: #1774336, #1776277,
	#1776254
Date: Thu,  2 Aug 2018 14:18:44 +1000
Message-Id: <20180802041850.22961-1-daniel.axtens@canonical.com>
X-Mailer: git-send-email 2.17.1
X-BeenThere: kernel-team@lists.ubuntu.com
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Kernel team discussions <kernel-team.lists.ubuntu.com>
List-Unsubscribe: <https://lists.ubuntu.com/mailman/options/kernel-team>,
	<mailto:kernel-team-request@lists.ubuntu.com?subject=unsubscribe>
List-Archive: <https://lists.ubuntu.com/archives/kernel-team>
List-Post: <mailto:kernel-team@lists.ubuntu.com>
List-Help: <mailto:kernel-team-request@lists.ubuntu.com?subject=help>
List-Subscribe: <https://lists.ubuntu.com/mailman/listinfo/kernel-team>,
	<mailto:kernel-team-request@lists.ubuntu.com?subject=subscribe>
MIME-Version: 1.0
Errors-To: kernel-team-bounces@lists.ubuntu.com
Sender: "kernel-team" <kernel-team-bounces@lists.ubuntu.com>

Hi all,

This is a series of patches that solves a set of bugs experienced when
a NFS share with a FSCache/CacheFiles backend experiences heavya
load. There are 3 relevant LP bugs:

LP: #1774336 - FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false -
We included a sauce patch to fix this because upstream hadn't done
anything in months. Then upstream acted, and they took a different
patch to the one I proposed. Revert that patch (#1), and bring in the
upstream replacement (#2, #3)

LP: #1776277 - error when tearing down a NFS share clashes with an
attempt to use a fscache object.  This is fixed by #4.

LP: #1776254 - race where a new cookie for an object could clash with
a cookie in the process of being destroyed. Fixed in #5 and #6.

SRU Justifications:
 - LP: #1774336 - inlined below.
 - LP: #1776277 - inlined in patch 4
 - LP: #1776254 - inlined in patch 5

Apologies that this is a bit messy. I didn't want to split the series
because it went in to upstream as one series, but I can split it to
match LP bugs and submit it piecewise if anyone would prefer.

== SRU Justification (LP: #177336) ==

[Impact]
Oops during heavy NFS + FSCache use:

[81738.886634] FS-Cache:
[81738.888281] FS-Cache: Assertion failed
[81738.889461] FS-Cache: 6 == 5 is false
[81738.890625] ------------[ cut here ]------------
[81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

6 == 5 represents an operation being DEAD when it was not expected to be.

[Cause]
There is a race in fscache and cachefiles.

One thread is in cachefiles_read_waiter:
 1) object->work_lock is taken.
 2) the operation is added to the to_do list.
 3) the work lock is dropped.
 4) fscache_enqueue_retrieval is called, which takes a reference.

Another thread is in cachefiles_read_copier:
 1) object->work_lock is taken
 2) an item is popped off the to_do list.
 3) object->work_lock is dropped.
 4) some processing is done on the item, and fscache_put_retrieval() is called, dropping a reference.

Now if the this process in cachefiles_read_copier takes place *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped before it is taken, which leads to the objects reference count hitting zero, which leads to lifecycle events for the object happening too soon, leading to the assertion failure later on.

(This is simplified and clarified from the original upstream analysis for this patch at https://www.redhat.com/archives/linux-cachefs/2018-February/msg00001.html and from a similar patch with a different approach to fixing the bug at https://www.redhat.com/archives/linux-cachefs/2017-June/msg00002.html)

[Fix]

 (Old sauce patch being reverted) Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This means that the object cannot be popped off the to_do list until it is in a fully consistent state with the reference taken.

 (New upstream patch) Explicitly take a reference to the object while it is being enqueued. Adjust another part of the code to deal with the greater range of object states this exposes.

[Testcase]
A user has run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.
 - The change makes things more conservative (taking more references) so that's reassuring.
 - There may be performance impacts but none have been observed so far.

=================

Daniel Axtens (1):
  Revert "UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race"

Kiran Kumar Modukuri (5):
  fscache: Allow cancelled operations to be enqueued
  cachefiles: Fix refcounting bug in backing-file read monitoring
  fscache: Fix reference overput in fscache_attach_object() error
    handling
  cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag
  cachefiles: Wait rather than BUG'ing on "Unexpected object collision"

 fs/cachefiles/bind.c   |  3 ++-
 fs/cachefiles/namei.c  |  4 ++--
 fs/cachefiles/rdwr.c   | 17 ++++++++++++-----
 fs/fscache/cache.c     |  2 +-
 fs/fscache/cookie.c    |  7 ++++---
 fs/fscache/object.c    |  1 +
 fs/fscache/operation.c |  6 ++++--
 7 files changed, 26 insertions(+), 14 deletions(-)
Acked-by: Kamal Mostafa <kamal@canonical.com>
Acked-by: Khalid Elmously <khalid.elmously@canonical.com>