From patchwork Mon Jul 8 00:50:29 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauricio Faria de Oliveira X-Patchwork-Id: 1128790 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 45hn2f1Vq5z9sNx; Mon, 8 Jul 2019 10:51:38 +1000 (AEST) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1hkHsT-0002VK-SE; Mon, 08 Jul 2019 00:51:33 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.86_2) (envelope-from ) id 1hkHsN-0002Si-KV for kernel-team@lists.ubuntu.com; Mon, 08 Jul 2019 00:51:27 +0000 Received: from mail-qt1-f199.google.com ([209.85.160.199]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1hkHsN-0004cY-73 for kernel-team@lists.ubuntu.com; Mon, 08 Jul 2019 00:51:27 +0000 Received: by mail-qt1-f199.google.com with SMTP id m25so5325822qtn.18 for ; Sun, 07 Jul 2019 17:51:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references; bh=NHOAkcPOtG5iRGB3ROrqyqFMEs0niDpnwecgpMP0wOc=; b=p6Lg+/xcXiwnSv0E4jl6mnYhsaHih8u/rXMS82bUf7doPcCLjGRFJnK0/CgO6CQ4l3 5ANLaAoBAsMttufG2C7hSx6Gfjc/10ztOJPfC4eMPvkdB5c5Ho5LdBDotl5PTFeIpoV6 3S4q35fDZlfZmvpPoQ6W1KRKZY71gLnnAbd1Pfe8lpeev+Q66Jwh1JaEtu96zYNhxIr3 r37lXDqDmJ6vKECVqxD082n/jLP+K1obhYlpL7YbZnmsMPJRSYC7ZLp8c2bVc08OYBxA DLiDDW3qMYSyghaEkT8IFEp2qHeLxtDKMZZi0/iwHc08yFYbgXN7dWuM2R75TpPzSGpC GFYA== X-Gm-Message-State: APjAAAVkIVMEDFmraPRZfEQn1qrqNH+WT3JMH3oyD3yYNiblc8eQHWf3 ebH9P04bThVgzsCLCM+yKEAZuf9EkP37iofyuTDoVrx1bWV3Y1U9goVih47CRF5ngf28lHlsJuV MCAA2Bjz1UApEoKKlh+RYo49nNCRT+HZ+W412CWs+cA== X-Received: by 2002:a37:7ec1:: with SMTP id z184mr12055579qkc.491.1562547086183; Sun, 07 Jul 2019 17:51:26 -0700 (PDT) X-Google-Smtp-Source: APXvYqxBaxSuVder8q/jatNaNnuPMrz1SAa2JCjtKgkaqA4wObRwcwJpc92W/soUANU2KBLYwEUjOw== X-Received: by 2002:a37:7ec1:: with SMTP id z184mr12055572qkc.491.1562547085950; Sun, 07 Jul 2019 17:51:25 -0700 (PDT) Received: from localhost.localdomain ([2804:14c:4e7:1017:3da7:3d04:ea25:3a0]) by smtp.gmail.com with ESMTPSA id t197sm6697527qke.2.2019.07.07.17.51.24 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Sun, 07 Jul 2019 17:51:25 -0700 (PDT) From: Mauricio Faria de Oliveira To: kernel-team@lists.ubuntu.com Subject: [B][PATCH 02/11] bcache: add stop_when_cache_set_failed option to backing device Date: Sun, 7 Jul 2019 21:50:29 -0300 Message-Id: <20190708005038.13184-3-mfo@canonical.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190708005038.13184-1-mfo@canonical.com> References: <20190708005038.13184-1-mfo@canonical.com> X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Coly Li BugLink: https://bugs.launchpad.net/bugs/1829563 When there are too many I/O errors on cache device, current bcache code will retire the whole cache set, and detach all bcache devices. But the detached bcache devices are not stopped, which is problematic when bcache is in writeback mode. If the retired cache set has dirty data of backing devices, continue writing to bcache device will write to backing device directly. If the LBA of write request has a dirty version cached on cache device, next time when the cache device is re-registered and backing device re-attached to it again, the stale dirty data on cache device will be written to backing device, and overwrite latest directly written data. This situation causes a quite data corruption. But we cannot simply stop all attached bcache devices when the cache set is broken or disconnected. For example, use bcache to accelerate performance of an email service. In such workload, if cache device is broken but no dirty data lost, keep the bcache device alive and permit email service continue to access user data might be a better solution for the cache device failure. Nix points out the issue and provides the above example to explain why it might be necessary to not stop bcache device for broken cache device. Pavel Goran provides a brilliant suggestion to provide "always" and "auto" options to per-cached device sysfs file stop_when_cache_set_failed. If cache set is retiring and the backing device has no dirty data on cache, it should be safe to keep the bcache device alive. In this case, if stop_when_cache_set_failed is set to "auto", the device failure handling code will not stop this bcache device and permit application to access the backing device with a unattached bcache device. Changelog: [mlyle: edited to not break string constants across lines] v3: fix typos pointed out by Nix. v2: change option values of stop_when_cache_set_failed from 1/0 to "auto"/"always". v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1 (always stop). Signed-off-by: Coly Li Reviewed-by: Michael Lyle Signed-off-by: Michael Lyle Cc: Nix Cc: Pavel Goran Cc: Junhui Tang Cc: Hannes Reinecke Signed-off-by: Jens Axboe (backported from commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482) [ mfo: backport: super.c, hunk 3: in __cache_set_unregister(), s/c->devices_max_used/c->nr_uuids/ due to lack of non-required upstream commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by devices_max_used"), which is just an optimization, not a fix for index range, and requires one more Fixes: commit.] Signed-off-by: Mauricio Faria de Oliveira --- drivers/md/bcache/bcache.h | 9 +++++ drivers/md/bcache/super.c | 78 +++++++++++++++++++++++++++++++++----- drivers/md/bcache/sysfs.c | 17 +++++++++ 3 files changed, 94 insertions(+), 10 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index ed3c8dd27122..a195abfd3eae 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -288,6 +288,12 @@ struct io { sector_t last; }; +enum stop_on_failure { + BCH_CACHED_DEV_STOP_AUTO = 0, + BCH_CACHED_DEV_STOP_ALWAYS, + BCH_CACHED_DEV_STOP_MODE_MAX, +}; + struct cached_dev { struct list_head list; struct bcache_device disk; @@ -371,6 +377,8 @@ struct cached_dev { unsigned writeback_rate_i_term_inverse; unsigned writeback_rate_p_term_inverse; unsigned writeback_rate_minimum; + + enum stop_on_failure stop_when_cache_set_failed; }; enum alloc_reserve { @@ -922,6 +930,7 @@ void bch_write_bdev_super(struct cached_dev *, struct closure *); extern struct workqueue_struct *bcache_wq; extern const char * const bch_cache_modes[]; +extern const char * const bch_stop_on_failure_modes[]; extern struct mutex bch_register_lock; extern struct list_head bch_cache_sets; diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index e431cee2eb47..12b9776f4b5a 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -47,6 +47,14 @@ const char * const bch_cache_modes[] = { NULL }; +/* Default is -1; we skip past it for stop_when_cache_set_failed */ +const char * const bch_stop_on_failure_modes[] = { + "default", + "auto", + "always", + NULL +}; + static struct kobject *bcache_kobj; struct mutex bch_register_lock; LIST_HEAD(bch_cache_sets); @@ -1201,6 +1209,9 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size) max(dc->disk.disk->queue->backing_dev_info->ra_pages, q->backing_dev_info->ra_pages); + /* default to auto */ + dc->stop_when_cache_set_failed = BCH_CACHED_DEV_STOP_AUTO; + bch_cached_dev_request_init(dc); bch_cached_dev_writeback_init(dc); return 0; @@ -1477,25 +1488,72 @@ static void cache_set_flush(struct closure *cl) closure_return(cl); } +/* + * This function is only called when CACHE_SET_IO_DISABLE is set, which means + * cache set is unregistering due to too many I/O errors. In this condition, + * the bcache device might be stopped, it depends on stop_when_cache_set_failed + * value and whether the broken cache has dirty data: + * + * dc->stop_when_cache_set_failed dc->has_dirty stop bcache device + * BCH_CACHED_STOP_AUTO 0 NO + * BCH_CACHED_STOP_AUTO 1 YES + * BCH_CACHED_DEV_STOP_ALWAYS 0 YES + * BCH_CACHED_DEV_STOP_ALWAYS 1 YES + * + * The expected behavior is, if stop_when_cache_set_failed is configured to + * "auto" via sysfs interface, the bcache device will not be stopped if the + * backing device is clean on the broken cache device. + */ +static void conditional_stop_bcache_device(struct cache_set *c, + struct bcache_device *d, + struct cached_dev *dc) +{ + if (dc->stop_when_cache_set_failed == BCH_CACHED_DEV_STOP_ALWAYS) { + pr_warn("stop_when_cache_set_failed of %s is \"always\", stop it for failed cache set %pU.", + d->disk->disk_name, c->sb.set_uuid); + bcache_device_stop(d); + } else if (atomic_read(&dc->has_dirty)) { + /* + * dc->stop_when_cache_set_failed == BCH_CACHED_STOP_AUTO + * and dc->has_dirty == 1 + */ + pr_warn("stop_when_cache_set_failed of %s is \"auto\" and cache is dirty, stop it to avoid potential data corruption.", + d->disk->disk_name); + bcache_device_stop(d); + } else { + /* + * dc->stop_when_cache_set_failed == BCH_CACHED_STOP_AUTO + * and dc->has_dirty == 0 + */ + pr_warn("stop_when_cache_set_failed of %s is \"auto\" and cache is clean, keep it alive.", + d->disk->disk_name); + } +} + static void __cache_set_unregister(struct closure *cl) { struct cache_set *c = container_of(cl, struct cache_set, caching); struct cached_dev *dc; + struct bcache_device *d; size_t i; mutex_lock(&bch_register_lock); - for (i = 0; i < c->nr_uuids; i++) - if (c->devices[i]) { - if (!UUID_FLASH_ONLY(&c->uuids[i]) && - test_bit(CACHE_SET_UNREGISTERING, &c->flags)) { - dc = container_of(c->devices[i], - struct cached_dev, disk); - bch_cached_dev_detach(dc); - } else { - bcache_device_stop(c->devices[i]); - } + for (i = 0; i < c->nr_uuids; i++) { + d = c->devices[i]; + if (!d) + continue; + + if (!UUID_FLASH_ONLY(&c->uuids[i]) && + test_bit(CACHE_SET_UNREGISTERING, &c->flags)) { + dc = container_of(d, struct cached_dev, disk); + bch_cached_dev_detach(dc); + if (test_bit(CACHE_SET_IO_DISABLE, &c->flags)) + conditional_stop_bcache_device(c, d, dc); + } else { + bcache_device_stop(d); } + } mutex_unlock(&bch_register_lock); diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 6cecf45ab461..48198b47df14 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -75,6 +75,7 @@ rw_attribute(congested_write_threshold_us); rw_attribute(sequential_cutoff); rw_attribute(data_csum); rw_attribute(cache_mode); +rw_attribute(stop_when_cache_set_failed); rw_attribute(writeback_metadata); rw_attribute(writeback_running); rw_attribute(writeback_percent); @@ -123,6 +124,12 @@ SHOW(__bch_cached_dev) bch_cache_modes + 1, BDEV_CACHE_MODE(&dc->sb)); + if (attr == &sysfs_stop_when_cache_set_failed) + return bch_snprint_string_list(buf, PAGE_SIZE, + bch_stop_on_failure_modes + 1, + dc->stop_when_cache_set_failed); + + sysfs_printf(data_csum, "%i", dc->disk.data_csum); var_printf(verify, "%i"); var_printf(bypass_torture_test, "%i"); @@ -244,6 +251,15 @@ STORE(__cached_dev) } } + if (attr == &sysfs_stop_when_cache_set_failed) { + v = bch_read_string_list(buf, bch_stop_on_failure_modes + 1); + + if (v < 0) + return v; + + dc->stop_when_cache_set_failed = v; + } + if (attr == &sysfs_label) { if (size > SB_LABEL_SIZE) return -EINVAL; @@ -323,6 +339,7 @@ static struct attribute *bch_cached_dev_files[] = { &sysfs_data_csum, #endif &sysfs_cache_mode, + &sysfs_stop_when_cache_set_failed, &sysfs_writeback_metadata, &sysfs_writeback_running, &sysfs_writeback_delay,