From patchwork Sun May 13 17:56:20 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Whitney X-Patchwork-Id: 912548 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="df2VA/fK"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40kWjp1XK8z9s0y for ; Mon, 14 May 2018 03:56:46 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751518AbeEMR4p (ORCPT ); Sun, 13 May 2018 13:56:45 -0400 Received: from mail-qt0-f180.google.com ([209.85.216.180]:35212 "EHLO mail-qt0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751438AbeEMR4o (ORCPT ); Sun, 13 May 2018 13:56:44 -0400 Received: by mail-qt0-f180.google.com with SMTP id f5-v6so13421720qth.2 for ; Sun, 13 May 2018 10:56:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=qHudmmS6kBaBC48pPi21hBf3SOOIQL6AsmfjHwsJDfs=; b=df2VA/fK0yZ4s7pOoo0rz8TGcEnmYAkCqkSBAls3jp9C9qcpAiki7jYcUAsl0QflQn D0rcrVP4k0COEz8f1Zof0kLUbxYFOxkhl0/vOrfN/RACjlMXVT1a7zEklrD7n87fKNpf 6/6fTkMl6LzAA2UKJMjV5zb5CCnAko1ZPInl31/QQkhKFAjxule3x4meIcY19naS3949 sHi3mDIGQTxBrfHcg7cvBq3O3gG6BYB/VkoNkEvoHtCe/SczmR81rHM5Zrqy9MaCDukf 5Y9cormkOM3BWdVdHOZT3JxU7PjbfQrDeeZGa1FPkknJIWxHlaZ8N8J6kW20nzXLqEV7 WpDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=qHudmmS6kBaBC48pPi21hBf3SOOIQL6AsmfjHwsJDfs=; b=ns5yAOrXoKLa2AFdJBgRSWoGffEV4zAZ7KMGy2cEjDYXXElD0Qn7cfS/+H2utumLzi mcPNZhPIXz5zS2Uzu7LceOoKRN6J6qppL47FicFtrZICnAF5Hz76Krv8f3AccC4qhq32 jysf8lKJ5Y+y9Uj96l0qSYTX4emSjb4TZ2wQFPD9mKRxzWXtp3tiO5i9/0DOt5yL520N GHSFv23S1kQdlk6lv1CQicf3pf+RYTnBOWH28qrCzug0XMm72Jf+sEj4+7c9KqBjGrR1 nIxtupVSnezF95p9nAw3CH/W+pe0qve9/ZJ/p2ww7QEvgaj7TsFD+evzuiDhBQrXbwAj 9q2A== X-Gm-Message-State: ALKqPwfOP1x532KLrmz2iglvxF9cCIfIfQrSA3+USarRvMM3HORexKmX y68eEY4gb+EqR8RAB5KALU/T/Q== X-Google-Smtp-Source: AB8JxZq4aXuCpPpVGHXSLBBLW22EQma1AFnARqLIMvtQeVi+9TTVf1YkODGRQB5MkwhQdYz9GyMGNA== X-Received: by 2002:ac8:607:: with SMTP id d7-v6mr5734736qth.242.1526234203224; Sun, 13 May 2018 10:56:43 -0700 (PDT) Received: from localhost.localdomain (c-73-60-226-25.hsd1.nh.comcast.net. [73.60.226.25]) by smtp.gmail.com with ESMTPSA id z123-v6sm616164qkc.43.2018.05.13.10.56.42 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 May 2018 10:56:42 -0700 (PDT) From: Eric Whitney To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, Eric Whitney Subject: [RFC PATCH 1/5] ext4: fix reserved cluster accounting at delayed write time Date: Sun, 13 May 2018 13:56:20 -0400 Message-Id: <20180513175624.12887-2-enwlinux@gmail.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180513175624.12887-1-enwlinux@gmail.com> References: <20180513175624.12887-1-enwlinux@gmail.com> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org The code in ext4_da_map_blocks sometimes reserves space for more delayed allocated clusters than it should, resulting in premature ENOSPC, exceeded quota, and inaccurate free space reporting. It fails to check the extents status tree for written and unwritten clusters which have already been allocated and which share a cluster with the block being delayed allocated. In addition, it fails to continue on and search the extent tree when no delayed allocated or allocated clusters have been found in the extents status tree. Written extents may be purged from the extents status tree under memory pressure. Only delayed and unwritten delayed extents are guaranteed to be retained. Signed-off-by: Eric Whitney --- fs/ext4/ext4.h | 4 ++ fs/ext4/extents.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/extents_status.c | 64 ++++++++++++++++++++++++++ fs/ext4/extents_status.h | 11 +++++ fs/ext4/inode.c | 30 ++++++++++--- 5 files changed, 218 insertions(+), 5 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index a42e71203e53..6ee2fded64bf 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3122,6 +3122,9 @@ extern int ext4_find_delalloc_range(struct inode *inode, ext4_lblk_t lblk_start, ext4_lblk_t lblk_end); extern int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk); +extern int ext4_find_cluster(struct inode *inode, + int (*match_fn)(struct extent_status *es), + ext4_lblk_t lblk); extern ext4_lblk_t ext4_ext_next_allocated_block(struct ext4_ext_path *path); extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, __u64 start, __u64 len); @@ -3132,6 +3135,7 @@ extern int ext4_swap_extents(handle_t *handle, struct inode *inode1, struct inode *inode2, ext4_lblk_t lblk1, ext4_lblk_t lblk2, ext4_lblk_t count, int mark_unwritten,int *err); +extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu); /* move_extent.c */ extern void ext4_double_down_write_data_sem(struct inode *first, diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c969275ce3ee..872782ba8bc3 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3846,6 +3846,47 @@ int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk) return ext4_find_delalloc_range(inode, lblk_start, lblk_end); } +/* + * If the block range specified by @start and @end contains an es extent + * matched by the matching function specified by @match_fn, return 1. Else, + * return 0. + */ +int ext4_find_range(struct inode *inode, + int (*match_fn)(struct extent_status *es), + ext4_lblk_t start, + ext4_lblk_t end) +{ + struct extent_status es; + + ext4_es_find_extent_range(inode, match_fn, start, end, &es); + if (es.es_len == 0) + return 0; /* there is no matching extent in this tree */ + else if (es.es_lblk <= start && + start < es.es_lblk + es.es_len) + return 1; + else if (start <= es.es_lblk && es.es_lblk <= end) + return 1; + else + return 0; +} + +/* + * If the cluster containing @lblk is contained in an es extent matched by + * the matching function specified by @match_fn, return 1. Else, return 0. + */ +int ext4_find_cluster(struct inode *inode, + int (*match_fn)(struct extent_status *es), + ext4_lblk_t lblk) +{ + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + ext4_lblk_t lblk_start, lblk_end; + + lblk_start = EXT4_LBLK_CMASK(sbi, lblk); + lblk_end = lblk_start + sbi->s_cluster_ratio - 1; + + return ext4_find_range(inode, match_fn, lblk_start, lblk_end); +} + /** * Determines how many complete clusters (out of those specified by the 'map') * are under delalloc and were reserved quota for. @@ -5935,3 +5976,76 @@ ext4_swap_extents(handle_t *handle, struct inode *inode1, } return replaced_count; } + +/* + * Returns true if any block in the cluster containing lblk is mapped, + * else returns false. Can also return errors. + * + * Derived from ext4_ext_map_blocks(). + */ +int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu) +{ + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + struct ext4_ext_path *path; + int depth, mapped = 0, err = 0; + struct ext4_extent *extent; + ext4_lblk_t first_lblk, first_lclu, last_lclu; + + /* search for the extent closest to the first block in the cluster */ + path = ext4_find_extent(inode, EXT4_C2B(sbi, lclu), NULL, 0); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + goto out; + } + + depth = ext_depth(inode); + + /* + * A consistent leaf must not be empty. This situation is possible, + * though, _during_ tree modification, and it's why an assert can't + * be put in ext4_find_extent(). + */ + if (unlikely(path[depth].p_ext == NULL && depth != 0)) { + EXT4_ERROR_INODE(inode, "bad extent address " + "lblock: %lu, depth: %d, pblock %lld", + (unsigned long) EXT4_C2B(sbi, lclu), + depth, path[depth].p_block); + err = -EFSCORRUPTED; + goto out; + } + + extent = path[depth].p_ext; + + /* can't be mapped if the extent tree is empty */ + if (extent == NULL) + goto out; + + first_lblk = le32_to_cpu(extent->ee_block); + first_lclu = EXT4_B2C(sbi, first_lblk); + + /* + * Three possible outcomes at this point - found extent spanning + * the target cluster, to the left of the target cluster, or to the + * right of the target cluster. The first two cases are handled here. + * The last case indicates the target cluster is not mapped. + */ + if (lclu >= first_lclu) { + last_lclu = EXT4_B2C(sbi, first_lblk + + ext4_ext_get_actual_len(extent) - 1); + if (lclu <= last_lclu) { + mapped = 1; + } else { + first_lblk = ext4_ext_next_allocated_block(path); + first_lclu = EXT4_B2C(sbi, first_lblk); + if (lclu == first_lclu) + mapped = 1; + } + } + +out: + ext4_ext_drop_refs(path); + kfree(path); + + return err ? err : mapped; +} diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index 763ef185dd17..0c395b5a57a2 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -296,6 +296,70 @@ void ext4_es_find_delayed_extent_range(struct inode *inode, trace_ext4_es_find_delayed_extent_range_exit(inode, es); } +/* + * Find the first es extent in the block range specified by @lblk and @end + * that satisfies the matching function specified by @match_fn. If a + * match is found, it's returned in @es. If not, a struct extents_status + * is returned in @es whose es_lblk, es_len, and es_pblk components are 0. + * + * This function is derived from ext4_es_find_delayed_extent_range(). + * In the future, it could be used to replace it with the use of a suitable + * matching function to avoid redundant code. + */ +void ext4_es_find_extent_range(struct inode *inode, + int (*match_fn)(struct extent_status *es), + ext4_lblk_t lblk, ext4_lblk_t end, + struct extent_status *es) +{ + struct ext4_es_tree *tree = NULL; + struct extent_status *es1 = NULL; + struct rb_node *node; + + BUG_ON(es == NULL); + BUG_ON(end < lblk); + + read_lock(&EXT4_I(inode)->i_es_lock); + tree = &EXT4_I(inode)->i_es_tree; + + /* see if the extent has been cached */ + es->es_lblk = es->es_len = es->es_pblk = 0; + if (tree->cache_es) { + es1 = tree->cache_es; + if (in_range(lblk, es1->es_lblk, es1->es_len)) { + es_debug("%u cached by [%u/%u) %llu %x\n", + lblk, es1->es_lblk, es1->es_len, + ext4_es_pblock(es1), ext4_es_status(es1)); + goto out; + } + } + + es1 = __es_tree_search(&tree->root, lblk); + +out: + if (es1 && !match_fn(es1)) { + while ((node = rb_next(&es1->rb_node)) != NULL) { + es1 = rb_entry(node, struct extent_status, rb_node); + if (es1->es_lblk > end) { + es1 = NULL; + break; + } + if (match_fn(es1)) + break; + } + } + + if (es1 && match_fn(es1)) { + tree->cache_es = es1; + es->es_lblk = es1->es_lblk; + es->es_len = es1->es_len; + es->es_pblk = es1->es_pblk; + } + + read_unlock(&EXT4_I(inode)->i_es_lock); + +} + + static void ext4_es_list_add(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h index 8efdeb903d6b..439143f7504f 100644 --- a/fs/ext4/extents_status.h +++ b/fs/ext4/extents_status.h @@ -93,6 +93,10 @@ extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk, extern void ext4_es_find_delayed_extent_range(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t end, struct extent_status *es); +extern void ext4_es_find_extent_range(struct inode *inode, + int (*match_fn)(struct extent_status *es), + ext4_lblk_t lblk, ext4_lblk_t end, + struct extent_status *es); extern int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk, struct extent_status *es); @@ -126,6 +130,13 @@ static inline int ext4_es_is_hole(struct extent_status *es) return (ext4_es_type(es) & EXTENT_STATUS_HOLE) != 0; } +static inline int ext4_es_is_claimed(struct extent_status *es) +{ + return (ext4_es_type(es) & + (EXTENT_STATUS_WRITTEN | EXTENT_STATUS_UNWRITTEN | + EXTENT_STATUS_DELAYED)) != 0; +} + static inline void ext4_es_set_referenced(struct extent_status *es) { es->es_pblk |= ((ext4_fsblk_t)EXTENT_STATUS_REFERENCED) << ES_SHIFT; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 1e50c5efae67..8f5235b2c094 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1788,6 +1788,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, struct ext4_map_blocks *map, struct buffer_head *bh) { + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct extent_status es; int retval; sector_t invalid_block = ~((sector_t) 0xffff); @@ -1857,17 +1858,36 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock, add_delayed: if (retval == 0) { int ret; + bool reserve = true; /* reserve one cluster */ /* * XXX: __block_prepare_write() unmaps passed block, * is it OK? */ /* - * If the block was allocated from previously allocated cluster, - * then we don't need to reserve it again. However we still need - * to reserve metadata for every block we're going to write. + * If the cluster containing m_lblk is shared with a delayed, + * written, or unwritten extent in a bigalloc file system, + * it's already been claimed and does not need to be reserved. + * Written extents may have been purged from the extents status + * tree if the system has been under memory pressure, so it's + * necessary to examine the extent tree to confirm the + * reservation. */ - if (EXT4_SB(inode->i_sb)->s_cluster_ratio == 1 || - !ext4_find_delalloc_cluster(inode, map->m_lblk)) { + if (sbi->s_cluster_ratio > 1) { + reserve = !ext4_find_cluster(inode, + &ext4_es_is_claimed, + map->m_lblk); + if (reserve) { + ret = ext4_clu_mapped(inode, + EXT4_B2C(sbi, map->m_lblk)); + if (ret < 0) { + retval = ret; + goto out_unlock; + } + reserve = !ret; + } + } + + if (reserve) { ret = ext4_da_reserve_space(inode); if (ret) { /* not enough space to reserve */ From patchwork Sun May 13 17:56:21 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Whitney X-Patchwork-Id: 912549 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="qbQ4fFlW"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40kWjt4DLgz9s0y for ; Mon, 14 May 2018 03:56:50 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751881AbeEMR4t (ORCPT ); Sun, 13 May 2018 13:56:49 -0400 Received: from mail-qt0-f193.google.com ([209.85.216.193]:40576 "EHLO mail-qt0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751438AbeEMR4q (ORCPT ); Sun, 13 May 2018 13:56:46 -0400 Received: by mail-qt0-f193.google.com with SMTP id h2-v6so13416800qtp.7 for ; Sun, 13 May 2018 10:56:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=pE0FyBP45BJxq3/0M8NhzOVtXrBUoVei15DYDQMMHL0=; b=qbQ4fFlWZQigVJ1piKor1LY1TNYkn0wgtTsr2bnnDOv+RITOHL2VShJmbyaKqNMtsh CsRK5qZ/GxlgLJ09LB+n3nKbn6Y000jFHP2nvjuivk4LwXhTyZhmqr4gJ3CxweL2uzPX n7BV8XeWFhIIt9cse0GCbGXKc+p/FvRSHGTfBBw/4dmrJOc/ORvzdiUqQ1wNvF4I+4dg w0qA8sscuLX9iYpDpPrt49DJigp18Kc+9G6q5xxLvAyVRXJYxaXOKk1y2Nx9Rt9y1ZLN AzzvzeJKwJsAkyUiaIo3n6FwUGEn9koPxLSlLUwpM5LXbDMnkSn6aIIO/CSO6fTcG29T 2H4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=pE0FyBP45BJxq3/0M8NhzOVtXrBUoVei15DYDQMMHL0=; b=hXWV10Pkg9mD9QXyLXjh87k4XhL2vDVEfnoBfzJK6jzvlQmmVSszq4bQRqNwUOuL// IeJdofMjhLzcAwgUKQl2zvGzIreskgiQGgcn+Xis24RpILj/qE92kt5UVoZe5ttwl6ld krVaM0fJ8ddH9ZceLOTd7OHF4AE55VIhbM+4ROZ6hO1kDGPADnLpVcV14S/OcxnwjnYt No8tewxrcUIYlbJHzdvVGwr+gr3QSZDr1XxKTkj15tDvtUH8ONVKN0OjXrEOq7YJXEkF 8pIoEfs39G/uDssOJkCjzCc48rhHTcMqAvuT/R7i2yenH7EQZRieNv6z/eFIX5TolA/K VJMA== X-Gm-Message-State: ALKqPwe28xkxU/WJBqWxkVsq07tUH0jzV8QWIgq40eOyh7xQeBVXtBWc 5GFQbD53ppNTnMHrlJYHMxGxmw== X-Google-Smtp-Source: AB8JxZomAcrbLA0CdXgge1YN7ewQp5DI4pfXTAWQsw4fmupZYUx0/vTqetemDsetwaSPSMmWQYf58Q== X-Received: by 2002:ac8:3fee:: with SMTP id v43-v6mr2198269qtk.64.1526234204962; Sun, 13 May 2018 10:56:44 -0700 (PDT) Received: from localhost.localdomain (c-73-60-226-25.hsd1.nh.comcast.net. [73.60.226.25]) by smtp.gmail.com with ESMTPSA id z123-v6sm616164qkc.43.2018.05.13.10.56.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 May 2018 10:56:44 -0700 (PDT) From: Eric Whitney To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, Eric Whitney Subject: [RFC PATCH 2/5] ext4: reduce reserved cluster count by number of allocated clusters Date: Sun, 13 May 2018 13:56:21 -0400 Message-Id: <20180513175624.12887-3-enwlinux@gmail.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180513175624.12887-1-enwlinux@gmail.com> References: <20180513175624.12887-1-enwlinux@gmail.com> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Ext4 does not always reduce the reserved cluster count by the number of clusters allocated when mapping a delayed extent. It sometimes adds back one or more clusters after allocation if delalloc blocks adjacent to the range allocated by ext4_ext_map_blocks() share the clusters newly allocated for that range. However, this overcounts the number of clusters needed to satisfy future mapping requests (holding one or more reservations for clusters that have already been allocated) and premature ENOSPC and quota failures, etc., result. The current ext4 code does not reduce the reserved cluster count when allocating clusters for non-delalloc writes that have also been previously reserved for delalloc writes. This also results in a reserved cluster overcount. To make it possible to handle reserved cluster accounting for fallocated regions in the same manner as used for other non-delayed writes, do the reserved cluster accounting for them at the time of allocation. In the current code, this is only done later when a delayed extent sharing the fallocated region is finally mapped. This behavior can also result in a temporary reserved cluster overcount. Signed-off-by: Eric Whitney Reviewed-by: Jan Kara --- fs/ext4/extents.c | 186 +++++++---------------------------------------- fs/ext4/extents_status.c | 81 +++++++++++++++++++++ fs/ext4/extents_status.h | 3 + 3 files changed, 112 insertions(+), 158 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 872782ba8bc3..d3e4e482a475 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3887,81 +3887,6 @@ int ext4_find_cluster(struct inode *inode, return ext4_find_range(inode, match_fn, lblk_start, lblk_end); } -/** - * Determines how many complete clusters (out of those specified by the 'map') - * are under delalloc and were reserved quota for. - * This function is called when we are writing out the blocks that were - * originally written with their allocation delayed, but then the space was - * allocated using fallocate() before the delayed allocation could be resolved. - * The cases to look for are: - * ('=' indicated delayed allocated blocks - * '-' indicates non-delayed allocated blocks) - * (a) partial clusters towards beginning and/or end outside of allocated range - * are not delalloc'ed. - * Ex: - * |----c---=|====c====|====c====|===-c----| - * |++++++ allocated ++++++| - * ==> 4 complete clusters in above example - * - * (b) partial cluster (outside of allocated range) towards either end is - * marked for delayed allocation. In this case, we will exclude that - * cluster. - * Ex: - * |----====c========|========c========| - * |++++++ allocated ++++++| - * ==> 1 complete clusters in above example - * - * Ex: - * |================c================| - * |++++++ allocated ++++++| - * ==> 0 complete clusters in above example - * - * The ext4_da_update_reserve_space will be called only if we - * determine here that there were some "entire" clusters that span - * this 'allocated' range. - * In the non-bigalloc case, this function will just end up returning num_blks - * without ever calling ext4_find_delalloc_range. - */ -static unsigned int -get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start, - unsigned int num_blks) -{ - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); - ext4_lblk_t alloc_cluster_start, alloc_cluster_end; - ext4_lblk_t lblk_from, lblk_to, c_offset; - unsigned int allocated_clusters = 0; - - alloc_cluster_start = EXT4_B2C(sbi, lblk_start); - alloc_cluster_end = EXT4_B2C(sbi, lblk_start + num_blks - 1); - - /* max possible clusters for this allocation */ - allocated_clusters = alloc_cluster_end - alloc_cluster_start + 1; - - trace_ext4_get_reserved_cluster_alloc(inode, lblk_start, num_blks); - - /* Check towards left side */ - c_offset = EXT4_LBLK_COFF(sbi, lblk_start); - if (c_offset) { - lblk_from = EXT4_LBLK_CMASK(sbi, lblk_start); - lblk_to = lblk_from + c_offset - 1; - - if (ext4_find_delalloc_range(inode, lblk_from, lblk_to)) - allocated_clusters--; - } - - /* Now check towards right. */ - c_offset = EXT4_LBLK_COFF(sbi, lblk_start + num_blks); - if (allocated_clusters && c_offset) { - lblk_from = lblk_start + num_blks; - lblk_to = lblk_from + (sbi->s_cluster_ratio - c_offset) - 1; - - if (ext4_find_delalloc_range(inode, lblk_from, lblk_to)) - allocated_clusters--; - } - - return allocated_clusters; -} - static int convert_initialized_extent(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, @@ -4143,23 +4068,6 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode, } map->m_len = allocated; - /* - * If we have done fallocate with the offset that is already - * delayed allocated, we would have block reservation - * and quota reservation done in the delayed write path. - * But fallocate would have already updated quota and block - * count for this offset. So cancel these reservation - */ - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) { - unsigned int reserved_clusters; - reserved_clusters = get_reserved_cluster_alloc(inode, - map->m_lblk, map->m_len); - if (reserved_clusters) - ext4_da_update_reserve_space(inode, - reserved_clusters, - 0); - } - map_out: map->m_flags |= EXT4_MAP_MAPPED; if ((flags & EXT4_GET_BLOCKS_KEEP_SIZE) == 0) { @@ -4548,77 +4456,39 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, map->m_flags |= EXT4_MAP_NEW; /* - * Update reserved blocks/metadata blocks after successful - * block allocation which had been deferred till now. + * Reduce the reserved cluster count to reflect successful deferred + * allocation of delayed allocated clusters or direct allocation of + * clusters discovered to be delayed allocated. Once allocated, a + * cluster is not included in the reserved count. */ - if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) { - unsigned int reserved_clusters; - /* - * Check how many clusters we had reserved this allocated range - */ - reserved_clusters = get_reserved_cluster_alloc(inode, - map->m_lblk, allocated); - if (!map_from_cluster) { - BUG_ON(allocated_clusters < reserved_clusters); - if (reserved_clusters < allocated_clusters) { - struct ext4_inode_info *ei = EXT4_I(inode); - int reservation = allocated_clusters - - reserved_clusters; - /* - * It seems we claimed few clusters outside of - * the range of this allocation. We should give - * it back to the reservation pool. This can - * happen in the following case: - * - * * Suppose s_cluster_ratio is 4 (i.e., each - * cluster has 4 blocks. Thus, the clusters - * are [0-3],[4-7],[8-11]... - * * First comes delayed allocation write for - * logical blocks 10 & 11. Since there were no - * previous delayed allocated blocks in the - * range [8-11], we would reserve 1 cluster - * for this write. - * * Next comes write for logical blocks 3 to 8. - * In this case, we will reserve 2 clusters - * (for [0-3] and [4-7]; and not for [8-11] as - * that range has a delayed allocated blocks. - * Thus total reserved clusters now becomes 3. - * * Now, during the delayed allocation writeout - * time, we will first write blocks [3-8] and - * allocate 3 clusters for writing these - * blocks. Also, we would claim all these - * three clusters above. - * * Now when we come here to writeout the - * blocks [10-11], we would expect to claim - * the reservation of 1 cluster we had made - * (and we would claim it since there are no - * more delayed allocated blocks in the range - * [8-11]. But our reserved cluster count had - * already gone to 0. - * - * Thus, at the step 4 above when we determine - * that there are still some unwritten delayed - * allocated blocks outside of our current - * block range, we should increment the - * reserved clusters count so that when the - * remaining blocks finally gets written, we - * could claim them. - */ - dquot_reserve_block(inode, - EXT4_C2B(sbi, reservation)); - spin_lock(&ei->i_block_reservation_lock); - ei->i_reserved_data_blocks += reservation; - spin_unlock(&ei->i_block_reservation_lock); - } + if (test_opt(inode->i_sb, DELALLOC) && !map_from_cluster) { + if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) { /* - * We will claim quota for all newly allocated blocks. - * We're updating the reserved space *after* the - * correction above so we do not accidentally free - * all the metadata reservation because we might - * actually need it later on. + * when allocating delayed allocated clusters, simply + * reduce the reserved cluster count and claim quota */ ext4_da_update_reserve_space(inode, allocated_clusters, 1); + } else { + ext4_lblk_t lblk, len; + unsigned int n; + + /* + * When allocating non-delayed allocated clusters + * (from fallocate, filemap, DIO, or clusters + * allocated when delalloc has been disabled by + * ext4_nonda_switch), reduce the reserved cluster + * count by the number of allocated clusters that + * have previously been delayed allocated. Quota + * has been claimed by ext4_mb_new_blocks() above, + * so release the quota reservations made for any + * previously delayed allocated clusters. + */ + lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk); + len = allocated_clusters << sbi->s_cluster_bits; + n = ext4_es_delayed_clu(inode, lblk, len); + if (n > 0) + ext4_da_update_reserve_space(inode, (int) n, 0); } } diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index 0c395b5a57a2..afcfa70b95d8 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -1317,3 +1317,84 @@ static int es_reclaim_extents(struct ext4_inode_info *ei, int *nr_to_scan) ei->i_es_tree.cache_es = NULL; return nr_shrunk; } + +/* + * Returns the number of clusters containing delalloc blocks in the + * range specified by @start and @end. Any cluster or part of a cluster + * within the range and containing a delalloc block is counted as a whole + * cluster. + */ +static unsigned int __es_delayed_clu(struct inode *inode, ext4_lblk_t start, + ext4_lblk_t end) +{ + struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; + struct extent_status *es; + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + struct rb_node *node; + ext4_lblk_t first_lclu, last_lclu; + unsigned long long last_counted_lclu; + unsigned int n = 0; + + /* guaranteed to be unequal to any ext4_lblk_t value */ + last_counted_lclu = ~0; + + es = __es_tree_search(&tree->root, start); + + while (es && (es->es_lblk <= end)) { + if (ext4_es_is_delayed(es) && !ext4_es_is_unwritten(es)) { + if (es->es_lblk <= start) + first_lclu = EXT4_B2C(sbi, start); + else + first_lclu = EXT4_B2C(sbi, es->es_lblk); + + if (ext4_es_end(es) >= end) + last_lclu = EXT4_B2C(sbi, end); + else + last_lclu = EXT4_B2C(sbi, ext4_es_end(es)); + + if (first_lclu == last_counted_lclu) + n += last_lclu - first_lclu; + else + n += last_lclu - first_lclu + 1; + last_counted_lclu = last_lclu; + } + node = rb_next(&es->rb_node); + if (!node) + break; + es = rb_entry(node, struct extent_status, rb_node); + } + + return n; +} + +/* + * Returns the number of clusters containing delalloc blocks in the + * range specified by @lblk and @len. Any cluster or part of a cluster + * within the range and containing a delalloc block is counted as a whole + * cluster. Typically used in cases where the start of the range is + * aligned on the start of a cluster and the end of the range on the end + * of a cluster. + * + * This is a simple wrapper for independent use of __es_delayed_clu(). + */ +unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk, + ext4_lblk_t len) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + ext4_lblk_t end; + unsigned int n; + + if (len == 0) + return 0; + + end = lblk + len - 1; + BUG_ON(end < lblk); + + read_lock(&ei->i_es_lock); + + n = __es_delayed_clu(inode, lblk, end); + + read_unlock(&ei->i_es_lock); + + return n; +} diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h index 439143f7504f..da76394108c8 100644 --- a/fs/ext4/extents_status.h +++ b/fs/ext4/extents_status.h @@ -186,4 +186,7 @@ extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi); extern int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v); +extern unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk, + ext4_lblk_t len); + #endif /* _EXT4_EXTENTS_STATUS_H */ From patchwork Sun May 13 17:56:22 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Whitney X-Patchwork-Id: 912551 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="t9qAVuJQ"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40kWjx4M7Sz9s28 for ; Mon, 14 May 2018 03:56:53 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751916AbeEMR4u (ORCPT ); Sun, 13 May 2018 13:56:50 -0400 Received: from mail-qk0-f181.google.com ([209.85.220.181]:37991 "EHLO mail-qk0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751764AbeEMR4s (ORCPT ); Sun, 13 May 2018 13:56:48 -0400 Received: by mail-qk0-f181.google.com with SMTP id b39-v6so8275892qkb.5 for ; Sun, 13 May 2018 10:56:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=zPvJaD6KG8RtH2p2wTjp3e9GwAuYFzYT5Z5S6GOeuVQ=; b=t9qAVuJQTkCc297QFnZq4cJrtLEnrnwoQR59waNfVoYgMkfejKaJP1/xdBXDOUFeK7 //tChkBUF0gTwgCM0pS7yueqgv2NSzjUNRxfn7kBerFB2QweywvS3SlUscFtxUlnXaQO t83tD4Hi5rc9nymcGxgwJHAmIaEjQyl4WTHYoSCIxlkyeYeaW2VOjziiFThKxgYWR2nF NSzhhv9TsVpgqTWcqmGRAq4uxhR3Mlj03JFFYdikGyfwBMWQFsxabAQ6LQl0hGFJ0rTd DL130RNq0rnLPpTsxnzuxkzNI+j1UhwPzsExaBjfUDNu51A/IeRcU4BqAg1lL5xNKMgi L0BQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=zPvJaD6KG8RtH2p2wTjp3e9GwAuYFzYT5Z5S6GOeuVQ=; b=SI5aWilqSGxHZeZNm8IHCpttoXr7T6YslXZhkKQdrQ3asUPCdXIXG/xIaUerT24oPX 32MxVpUDtJoty7TyFnIpkIlnls9Ymn3Zel5gfrxv/5+xNpyFLzxNYhzt+5z29TX+V6Mw dL9WVDetw8FVWqaF+zIANTMb4LFqggjqtwDyfQBxblamzu6WLtxHLS/N23KXhWsrLUEH sYI6rX0SUiyDidQKoFkTGQRicf+QIefa3pc2hmFJhV3porFxlc67BkKzPRld3f1iIPPg o1HAMBFREETUldhsdVh4hCCEnCcighw+uMtuQr9+MmNJhZrt38ac+OOwXbh6WzRDHu+h 2CnQ== X-Gm-Message-State: ALKqPwcL18RGm8n6nHhhfe6CyfzDsEt62vIpyL76aDr1pnZ7PWAO/Y+u 6Tf60NInXgpOESajgK/leA8ntg== X-Google-Smtp-Source: AB8JxZrkzO13emCeHy4RL30Fo9MEdy2k75TL0fE4Kw5P+dh0LFRIXqJuagvrXLHcNBGJtxRjQr4JSg== X-Received: by 2002:a37:6a01:: with SMTP id f1-v6mr5652505qkc.63.1526234206707; Sun, 13 May 2018 10:56:46 -0700 (PDT) Received: from localhost.localdomain (c-73-60-226-25.hsd1.nh.comcast.net. [73.60.226.25]) by smtp.gmail.com with ESMTPSA id z123-v6sm616164qkc.43.2018.05.13.10.56.45 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 May 2018 10:56:46 -0700 (PDT) From: Eric Whitney To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, Eric Whitney Subject: [RFC PATCH 3/5] ext4: adjust reserved cluster count when removing extents Date: Sun, 13 May 2018 13:56:22 -0400 Message-Id: <20180513175624.12887-4-enwlinux@gmail.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180513175624.12887-1-enwlinux@gmail.com> References: <20180513175624.12887-1-enwlinux@gmail.com> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org After applying the previous two patches in this series, it's not uncommon to see kernel error messages reporting a shortage of cluster reservations when extents are mapped after files have been truncated or punched. This can lead to data loss because a cluster may not be available for allocation at the time a delayed extent must be mapped. The overcounting addressed by the previous two patches appears to have masked problems with the code in ext4_da_page_release_reservation() that reduces the reserved cluster count on truncation, hole punching, etc. For example, the reserved cluster count may actually need to be incremented when a page is invalidated if that page was the last page in an allocated cluster shared with a delayed extent. This situation could arise if a write to a portion of a cluster was mapped prior to a delayed write to another portion of the same cluster. No reservation would be made on the second write, since the cluster was allocated, but liability for that write would remain if the allocated cluster was freed before the second write was mapped. It's also not clear that it's checking the proper blocks for delayed allocation status when reducing the reserved cluster count. This should only be done for those pages whose buffer delay bit has been set. This patch and the following two patches address these problems. Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for delayed allocated clusters shared with allocated blocks when a block range is removed from the extent tree. A shared cluster (referred to as a partial cluster in the code) can occur at the ends of a written or unwritten extent when the starting or ending cluster is not aligned on a starting or ending cluster boundary, respectively. The reserved cluster count is incremented if the portion of a partial cluster not shared by the block range to be removed is shared with at least one delayed extent but not shared with any other written or unwritten extent. This reflects the fact that the partial delayed cluster requires a new reservation to reserve space now that the allocated partial cluster has been freed. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before a check can be made to determine whether a reservation should be reapplied; ext4_rereserve_cluster() performs the check and the deferred freecluster counter update. Signed-off-by: Eric Whitney --- fs/ext4/ext4.h | 1 + fs/ext4/extents.c | 309 +++++++++++++++++++++++++++++++++++------------------- fs/ext4/mballoc.c | 11 +- 3 files changed, 209 insertions(+), 112 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6ee2fded64bf..d16064104cd2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -617,6 +617,7 @@ enum { #define EXT4_FREE_BLOCKS_NO_QUOT_UPDATE 0x0008 #define EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER 0x0010 #define EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER 0x0020 +#define EXT4_FREE_BLOCKS_RERESERVE_CLUSTER 0x0040 /* * ioctl commands diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index d3e4e482a475..66e0df0860b6 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -45,6 +45,13 @@ #define EXT4_EXT_DATA_VALID1 0x8 /* first half contains valid data */ #define EXT4_EXT_DATA_VALID2 0x10 /* second half contains valid data */ +struct partial_cluster { + ext4_fsblk_t pclu; + ext4_lblk_t lblk; + enum {initial, tofree, nofree} state; + bool fragment; +}; + static __le32 ext4_extent_block_csum(struct inode *inode, struct ext4_extent_header *eh) { @@ -2484,102 +2491,172 @@ static inline int get_default_free_blocks_flags(struct inode *inode) return 0; } +/* + * Used when freeing a delayed partial cluster containing @lblk when + * the RERESERVE_CLUSTER flag has been applied to ext4_free_blocks(). + */ +static void ext4_rereserve_cluster(struct inode *inode, ext4_lblk_t lblk) +{ + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + struct ext4_inode_info *ei = EXT4_I(inode); + + /* + * rereserve if cluster is being freed and if any block in the + * cluster is delayed - if so, the reserved count was decremented + * for the allocated cluster we're freeing + */ + if (ext4_find_delalloc_cluster(inode, lblk)) { + spin_lock(&ei->i_block_reservation_lock); + ei->i_reserved_data_blocks++; + percpu_counter_add(&sbi->s_dirtyclusters_counter, 1); + spin_unlock(&ei->i_block_reservation_lock); + } + + /* + * the free_clusters counter is always corrected to reflect + * ext4_free_blocks() freeing the cluster - it's just being + * deferred until we can determine whether the dirtyclusters_counter + * should be incremented. + */ + percpu_counter_add(&sbi->s_freeclusters_counter, 1); +} + static int ext4_remove_blocks(handle_t *handle, struct inode *inode, struct ext4_extent *ex, - long long *partial_cluster, + struct partial_cluster *partial, ext4_lblk_t from, ext4_lblk_t to) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); unsigned short ee_len = ext4_ext_get_actual_len(ex); - ext4_fsblk_t pblk; - int flags = get_default_free_blocks_flags(inode); + ext4_fsblk_t last_pblk, pblk; + ext4_lblk_t num; + int flags; + + /* only extent tail removal is allowed */ + if (from < le32_to_cpu(ex->ee_block) || + to != le32_to_cpu(ex->ee_block) + ee_len - 1) { + ext4_error(sbi->s_sb, "strange request: removal(2) " + "%u-%u from %u:%u", + from, to, le32_to_cpu(ex->ee_block), ee_len); + return 0; + } + +#ifdef EXTENTS_STATS + spin_lock(&sbi->s_ext_stats_lock); + sbi->s_ext_blocks += ee_len; + sbi->s_ext_extents++; + if (ee_len < sbi->s_ext_min) + sbi->s_ext_min = ee_len; + if (ee_len > sbi->s_ext_max) + sbi->s_ext_max = ee_len; + if (ext_depth(inode) > sbi->s_depth_max) + sbi->s_depth_max = ext_depth(inode); + spin_unlock(&sbi->s_ext_stats_lock); +#endif + + /* temporary - fix partial reporting */ + trace_ext4_remove_blocks(inode, ex, from, to, + (long long) partial->pclu); + + /* + * if we have a partial cluster, and it's different from the + * cluster of the last block in the extent, we free it + */ + last_pblk = ext4_ext_pblock(ex) + ee_len - 1; + + if (partial->state != initial && + partial->pclu != EXT4_B2C(sbi, last_pblk)) { + if (partial->state == tofree) { + flags = get_default_free_blocks_flags(inode); + if (test_opt(inode->i_sb, DELALLOC)) + flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER; + ext4_free_blocks(handle, inode, NULL, + EXT4_C2B(sbi, partial->pclu), + sbi->s_cluster_ratio, flags); + if (test_opt(inode->i_sb, DELALLOC)) + ext4_rereserve_cluster(inode, partial->lblk); + } + partial->state = initial; + } + + num = le32_to_cpu(ex->ee_block) + ee_len - from; + pblk = ext4_ext_pblock(ex) + ee_len - num; + + /* + * We free the partial cluster at the end of the extent (if any), + * unless the cluster is used by another extent (partial_cluster + * state is nofree). If a partial cluster exists here, it must be + * shared with the last block in the extent. + */ + flags = get_default_free_blocks_flags(inode); + + if (partial->state == nofree) { + flags |= EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER; + } else { + /* partial, left end cluster aligned, right end unaligned */ + if (test_opt(inode->i_sb, DELALLOC) && + EXT4_LBLK_CMASK(sbi, to) >= from && + EXT4_LBLK_COFF(sbi, to) != sbi->s_cluster_ratio - 1) { + if (partial->state == initial || + (partial->state == tofree && partial->fragment) || + (partial->state == tofree && to + 1 != + partial->lblk)) { + flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER; + ext4_free_blocks(handle, inode, NULL, + EXT4_PBLK_CMASK(sbi, last_pblk), + sbi->s_cluster_ratio, flags); + ext4_rereserve_cluster(inode, to); + partial->state = initial; + flags = get_default_free_blocks_flags(inode); + flags |= EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER; + } + } + } /* * For bigalloc file systems, we never free a partial cluster - * at the beginning of the extent. Instead, we make a note - * that we tried freeing the cluster, and check to see if we + * at the beginning of the extent. Instead, we check to see if we * need to free it on a subsequent call to ext4_remove_blocks, * or at the end of ext4_ext_rm_leaf or ext4_ext_remove_space. */ + flags |= EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER; + ext4_free_blocks(handle, inode, NULL, pblk, num, flags); + + /* reset the partial cluster if we've freed past it */ + if (partial->state != initial && partial->pclu != EXT4_B2C(sbi, pblk)) + partial->state = initial; - trace_ext4_remove_blocks(inode, ex, from, to, *partial_cluster); /* - * If we have a partial cluster, and it's different from the - * cluster of the last block, we need to explicitly free the - * partial cluster here. + * If we've freed the entire extent but the beginning is not left + * cluster aligned and is not marked as ineligible for freeing record + * the partial cluster at the beginning of the extent. It wasn't freed + * by the preceding ext4_free_blocks() call, and we need to look + * farther to the left to determine if it's to be freed (not shared + * with another extent). Else, reset the partial cluster - we're either + * done freeing or the beginning of the extent is left cluster aligned. */ - pblk = ext4_ext_pblock(ex) + ee_len - 1; - if (*partial_cluster > 0 && - *partial_cluster != (long long) EXT4_B2C(sbi, pblk)) { - ext4_free_blocks(handle, inode, NULL, - EXT4_C2B(sbi, *partial_cluster), - sbi->s_cluster_ratio, flags); - *partial_cluster = 0; - } - -#ifdef EXTENTS_STATS - { - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); - spin_lock(&sbi->s_ext_stats_lock); - sbi->s_ext_blocks += ee_len; - sbi->s_ext_extents++; - if (ee_len < sbi->s_ext_min) - sbi->s_ext_min = ee_len; - if (ee_len > sbi->s_ext_max) - sbi->s_ext_max = ee_len; - if (ext_depth(inode) > sbi->s_depth_max) - sbi->s_depth_max = ext_depth(inode); - spin_unlock(&sbi->s_ext_stats_lock); + if (EXT4_LBLK_COFF(sbi, from) && num == ee_len) { + if (partial->state == initial) { + partial->pclu = EXT4_B2C(sbi, pblk); + partial->lblk = from; + partial->fragment = + (bool) (EXT4_LBLK_CMASK(sbi, to) < from); + partial->state = tofree; + } + if (partial->state == tofree) { + /* + * Look for gap between right edge and outstanding + * partial cluster. Left edge handled in case above. + */ + if (!partial->fragment && to + 1 != partial->lblk) + partial->fragment = true; + partial->lblk = from; + } + } else { + partial->state = initial; } -#endif - if (from >= le32_to_cpu(ex->ee_block) - && to == le32_to_cpu(ex->ee_block) + ee_len - 1) { - /* tail removal */ - ext4_lblk_t num; - long long first_cluster; - - num = le32_to_cpu(ex->ee_block) + ee_len - from; - pblk = ext4_ext_pblock(ex) + ee_len - num; - /* - * Usually we want to free partial cluster at the end of the - * extent, except for the situation when the cluster is still - * used by any other extent (partial_cluster is negative). - */ - if (*partial_cluster < 0 && - *partial_cluster == -(long long) EXT4_B2C(sbi, pblk+num-1)) - flags |= EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER; - ext_debug("free last %u blocks starting %llu partial %lld\n", - num, pblk, *partial_cluster); - ext4_free_blocks(handle, inode, NULL, pblk, num, flags); - /* - * If the block range to be freed didn't start at the - * beginning of a cluster, and we removed the entire - * extent and the cluster is not used by any other extent, - * save the partial cluster here, since we might need to - * delete if we determine that the truncate or punch hole - * operation has removed all of the blocks in the cluster. - * If that cluster is used by another extent, preserve its - * negative value so it isn't freed later on. - * - * If the whole extent wasn't freed, we've reached the - * start of the truncated/punched region and have finished - * removing blocks. If there's a partial cluster here it's - * shared with the remainder of the extent and is no longer - * a candidate for removal. - */ - if (EXT4_PBLK_COFF(sbi, pblk) && ee_len == num) { - first_cluster = (long long) EXT4_B2C(sbi, pblk); - if (first_cluster != -*partial_cluster) - *partial_cluster = first_cluster; - } else { - *partial_cluster = 0; - } - } else - ext4_error(sbi->s_sb, "strange request: removal(2) " - "%u-%u from %u:%u", - from, to, le32_to_cpu(ex->ee_block), ee_len); return 0; } @@ -2602,7 +2679,7 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode, static int ext4_ext_rm_leaf(handle_t *handle, struct inode *inode, struct ext4_ext_path *path, - long long *partial_cluster, + struct partial_cluster *partial, ext4_lblk_t start, ext4_lblk_t end) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); @@ -2634,7 +2711,8 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode, ex_ee_block = le32_to_cpu(ex->ee_block); ex_ee_len = ext4_ext_get_actual_len(ex); - trace_ext4_ext_rm_leaf(inode, start, ex, *partial_cluster); + /* temporary - fix partial reporting */ + trace_ext4_ext_rm_leaf(inode, start, ex, (long long) partial->pclu); while (ex >= EXT_FIRST_EXTENT(eh) && ex_ee_block + ex_ee_len > start) { @@ -2665,8 +2743,8 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode, */ if (sbi->s_cluster_ratio > 1) { pblk = ext4_ext_pblock(ex); - *partial_cluster = - -(long long) EXT4_B2C(sbi, pblk); + partial->pclu = EXT4_B2C(sbi, pblk); + partial->state = nofree; } ex--; ex_ee_block = le32_to_cpu(ex->ee_block); @@ -2708,8 +2786,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode, if (err) goto out; - err = ext4_remove_blocks(handle, inode, ex, partial_cluster, - a, b); + err = ext4_remove_blocks(handle, inode, ex, partial, a, b); if (err) goto out; @@ -2763,18 +2840,23 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode, * If there's a partial cluster and at least one extent remains in * the leaf, free the partial cluster if it isn't shared with the * current extent. If it is shared with the current extent - * we zero partial_cluster because we've reached the start of the + * we reset the partial cluster because we've reached the start of the * truncated/punched region and we're done removing blocks. */ - if (*partial_cluster > 0 && ex >= EXT_FIRST_EXTENT(eh)) { + if (partial->state == tofree && ex >= EXT_FIRST_EXTENT(eh)) { pblk = ext4_ext_pblock(ex) + ex_ee_len - 1; - if (*partial_cluster != (long long) EXT4_B2C(sbi, pblk)) { + if (partial->pclu != EXT4_B2C(sbi, pblk)) { + int flags = get_default_free_blocks_flags(inode); + + if (test_opt(inode->i_sb, DELALLOC)) + flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER; ext4_free_blocks(handle, inode, NULL, - EXT4_C2B(sbi, *partial_cluster), - sbi->s_cluster_ratio, - get_default_free_blocks_flags(inode)); + EXT4_C2B(sbi, partial->pclu), + sbi->s_cluster_ratio, flags); + if (test_opt(inode->i_sb, DELALLOC)) + ext4_rereserve_cluster(inode, partial->lblk); } - *partial_cluster = 0; + partial->state = initial; } /* if this leaf is free, then we should @@ -2813,10 +2895,15 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); int depth = ext_depth(inode); struct ext4_ext_path *path = NULL; - long long partial_cluster = 0; + struct partial_cluster partial; handle_t *handle; int i = 0, err = 0; + partial.pclu = 0; + partial.lblk = 0; + partial.state = initial; + partial.fragment = false; + ext_debug("truncate since %u to %u\n", start, end); /* probably first extent we're gonna free will be last in block */ @@ -2876,8 +2963,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, */ if (sbi->s_cluster_ratio > 1) { pblk = ext4_ext_pblock(ex) + end - ee_block + 2; - partial_cluster = - -(long long) EXT4_B2C(sbi, pblk); + partial.pclu = EXT4_B2C(sbi, pblk); + partial.state = nofree; } /* @@ -2905,9 +2992,10 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, &ex); if (err) goto out; - if (pblk) - partial_cluster = - -(long long) EXT4_B2C(sbi, pblk); + if (pblk) { + partial.pclu = EXT4_B2C(sbi, pblk); + partial.state = nofree; + } } } /* @@ -2942,8 +3030,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, if (i == depth) { /* this is leaf block */ err = ext4_ext_rm_leaf(handle, inode, path, - &partial_cluster, start, - end); + &partial, start, end); /* root level has p_bh == NULL, brelse() eats this */ brelse(path[i].p_bh); path[i].p_bh = NULL; @@ -3015,21 +3102,25 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start, } } + /* temporary - fix partial reporting */ trace_ext4_ext_remove_space_done(inode, start, end, depth, - partial_cluster, path->p_hdr->eh_entries); + (long long) partial.pclu, path->p_hdr->eh_entries); /* - * If we still have something in the partial cluster and we have removed - * even the first extent, then we should free the blocks in the partial - * cluster as well. (This code will only run when there are no leaves - * to the immediate left of the truncated/punched region.) + * If there's a partial cluster and we have removed the first extent + * in the file, then we should free the partial cluster as well. */ - if (partial_cluster > 0 && err == 0) { - /* don't zero partial_cluster since it's not used afterwards */ + if (partial.state == tofree && err == 0) { + int flags = get_default_free_blocks_flags(inode); + + if (test_opt(inode->i_sb, DELALLOC)) + flags |= EXT4_FREE_BLOCKS_RERESERVE_CLUSTER; ext4_free_blocks(handle, inode, NULL, - EXT4_C2B(sbi, partial_cluster), - sbi->s_cluster_ratio, - get_default_free_blocks_flags(inode)); + EXT4_C2B(sbi, partial.pclu), + sbi->s_cluster_ratio, flags); + if (test_opt(inode->i_sb, DELALLOC)) + ext4_rereserve_cluster(inode, partial.lblk); + partial.state = initial; } /* TODO: flexible tree reduction should be here */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 769a62708b1c..0f0b75a2ce60 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4934,9 +4934,14 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode, &sbi->s_flex_groups[flex_group].free_clusters); } - if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) - dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); - percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); + if (flags & EXT4_FREE_BLOCKS_RERESERVE_CLUSTER) { + dquot_reclaim_block(inode, EXT4_C2B(sbi, count_clusters)); + } else { + if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) + dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); + percpu_counter_add(&sbi->s_freeclusters_counter, + count_clusters); + } ext4_mb_unload_buddy(&e4b); From patchwork Sun May 13 17:56:23 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Whitney X-Patchwork-Id: 912550 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="oRRIToCO"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40kWjw4xhgz9s0y for ; Mon, 14 May 2018 03:56:52 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751982AbeEMR4v (ORCPT ); Sun, 13 May 2018 13:56:51 -0400 Received: from mail-qk0-f178.google.com ([209.85.220.178]:37989 "EHLO mail-qk0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751774AbeEMR4t (ORCPT ); Sun, 13 May 2018 13:56:49 -0400 Received: by mail-qk0-f178.google.com with SMTP id b39-v6so8275920qkb.5 for ; Sun, 13 May 2018 10:56:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=Bt11Ze5yvtilh5qglrCVqzfvPlCO+xs/WB/QlH0sI8I=; b=oRRIToCOTRpd9te9D0Nj6EzUQkE4NdfHLN8fU32oSzOSF0QBIlU8K2GzbFwMGNnAhb FfDP8saMQJu39S7EPWfBg5QKh8yB70u3owseQmpufWDhQ3UAURSsXix/zQgyPg/7ytPc YzBqpevJXaqe0ety/xwT1F5YHgN+1mX8gZPNivxgb3czrKaZFWQAXHM0j0bo/jZMSO1B kWIvF1bFhQ6Wu2Ku07r/svAN1EY8ZKCM+0dzDIqEZzH1ML5muhIT/BGAEXEqWNzS5ETU wObQGHKtBol0zkJKlxsn9TSZAk+scFJhGzjTqD1Y102SMFPav7T/Dhaf8ntonCbs92/G qKNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=Bt11Ze5yvtilh5qglrCVqzfvPlCO+xs/WB/QlH0sI8I=; b=T/rEaz8zCt/uCCVA9zyOtw8JSkHwmklZCsVd41Shb4qoT1K94BvoLZppkedXq96IJo N/uqofF3Y4ljC9FCClTaFH6cXUh9XYDt8CvzEuq+aeZG3YriJkNQsOrt/Ry0/RJuwdIo 9U436rLDdg0ODOox3by5D3og0enwo3wEIXKvP7jbUBzNziJoVdSI17BDnr9lLz2uqtcc NFspLqgqYioa8asXPIqSmesuUGLyi3QPW9fMevN9GHDbcWcPiya7nrX7a9p7NGqJcgUH EtAAhk6TT+rVOLYnz/075fsyNBED+QoH4r5ITb3OZ9AfPtGe8MdkrYRgkpJU2sgWDkF5 e8/g== X-Gm-Message-State: ALKqPwfp7TTCorGIIrgveud6acz5YrqnQiQmW4e/B6Sax3WdFS8ZL7J7 GctfDvyTe6V8zGBorKjAGyphdw== X-Google-Smtp-Source: AB8JxZp25X9kQBCrpT+SsEaOWEsJJR1wN+00Goy3qcXXSKwa/sdQqMjPNT+jt4O9HSroYoCdBGGFRw== X-Received: by 2002:a37:414f:: with SMTP id o76-v6mr5665824qka.24.1526234208310; Sun, 13 May 2018 10:56:48 -0700 (PDT) Received: from localhost.localdomain (c-73-60-226-25.hsd1.nh.comcast.net. [73.60.226.25]) by smtp.gmail.com with ESMTPSA id z123-v6sm616164qkc.43.2018.05.13.10.56.47 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 May 2018 10:56:47 -0700 (PDT) From: Eric Whitney To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, Eric Whitney Subject: [RFC PATCH 4/5] ext4: release delayed allocated clusters when removing block ranges Date: Sun, 13 May 2018 13:56:23 -0400 Message-Id: <20180513175624.12887-5-enwlinux@gmail.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180513175624.12887-1-enwlinux@gmail.com> References: <20180513175624.12887-1-enwlinux@gmail.com> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Once it's possible to accurately determine the number of reserved clusters outstanding after all allocated blocks in a block range have been removed from both the extent tree and extents status tree, determining the number of reserved clusters to be subtracted from the reserved cluster count is a relatively straightforward matter of counting the number of clusters belonging to delayed extents in the extents status tree which are not shared with any other allocated or delayed allocated extents. This can be achieved by reversing the current order in which ext4_ext_remove_space() and ext4_es_remove_extent() are called. For now, a call to a new function to count the delayed clusters in the extents status tree and to adjust the reserved cluster total is inserted between these calls. This could also be integrated in a new version of ext4_es_remove_extent() to avoid a second pass over the extents status tree if performance becomes a concern. Determining whether a delayed allocated cluster is to be included in the total to be subtracted when a block range is removed is a little involved in the code, but the principle is straightforward. A delayed allocated cluster wholly contained within the block range to be removed is counted unconditionally. A delayed allocated cluster that is not wholly contained within the range (referred to as a partial cluster in the code) only counts towards the total if none of the blocks in the cluster not contained within the extent are not included in either another delayed allocated or allocated extent. Signed-off-by: Eric Whitney --- fs/ext4/ext4.h | 6 ++ fs/ext4/extents.c | 68 ++++++++++++++++++-- fs/ext4/extents_status.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/extents_status.h | 6 ++ fs/ext4/inode.c | 32 +++++++--- 5 files changed, 254 insertions(+), 16 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index d16064104cd2..5bc0903f66c5 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -283,6 +283,9 @@ struct ext4_io_submit { ~((ext4_fsblk_t) (s)->s_cluster_ratio - 1)) #define EXT4_LBLK_CMASK(s, lblk) ((lblk) & \ ~((ext4_lblk_t) (s)->s_cluster_ratio - 1)) +/* Set the low bits to get the last block in a cluster */ +#define EXT4_LBLK_CFILL(s, lblk) ((lblk) | \ + ((ext4_lblk_t) (s)->s_cluster_ratio - 1)) /* Get the cluster offset */ #define EXT4_PBLK_COFF(s, pblk) ((pblk) & \ ((ext4_fsblk_t) (s)->s_cluster_ratio - 1)) @@ -2468,6 +2471,7 @@ extern int ext4_page_mkwrite(struct vm_fault *vmf); extern int ext4_filemap_fault(struct vm_fault *vmf); extern qsize_t *ext4_get_reserved_space(struct inode *inode); extern int ext4_get_projid(struct inode *inode, kprojid_t *projid); +extern void ext4_da_release_space(struct inode *inode, int to_free); extern void ext4_da_update_reserve_space(struct inode *inode, int used, int quota_claim); extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk, @@ -3137,6 +3141,8 @@ extern int ext4_swap_extents(handle_t *handle, struct inode *inode1, ext4_lblk_t lblk2, ext4_lblk_t count, int mark_unwritten,int *err); extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu); +extern void ext4_release_reservations(struct inode *inode, ext4_lblk_t start, + ext4_lblk_t len); /* move_extent.c */ extern void ext4_double_down_write_data_sem(struct inode *first, diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 66e0df0860b6..13582cd4905c 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4627,6 +4627,19 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode) last_block = (inode->i_size + sb->s_blocksize - 1) >> EXT4_BLOCK_SIZE_BITS(sb); + + /* + * call to ext4_ext_remove_space() must precede ext4_es_remove_extent() + * for correct cluster reservation accounting + */ + err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1); + if (err) + return err; + + if (test_opt(inode->i_sb, DELALLOC)) + ext4_release_reservations(inode, last_block, + EXT_MAX_BLOCKS - last_block); + retry: err = ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block); @@ -4635,9 +4648,7 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode) congestion_wait(BLK_RW_ASYNC, HZ/50); goto retry; } - if (err) - return err; - return ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1); + return err; } static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, @@ -4975,6 +4986,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) } out: inode_unlock(inode); + trace_ext4_fallocate_exit(inode, offset, max_blocks, ret); return ret; } @@ -5528,18 +5540,27 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) down_write(&EXT4_I(inode)->i_data_sem); ext4_discard_preallocations(inode); - ret = ext4_es_remove_extent(inode, punch_start, - EXT_MAX_BLOCKS - punch_start); + /* + * call to ext4_ext_remove_space() must precede ext4_es_remove_extent() + * for correct cluster reservation accounting + */ + ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; } - ret = ext4_ext_remove_space(inode, punch_start, punch_stop - 1); + if (test_opt(inode->i_sb, DELALLOC)) + ext4_release_reservations(inode, punch_start, + EXT_MAX_BLOCKS - punch_start); + + ret = ext4_es_remove_extent(inode, punch_start, + EXT_MAX_BLOCKS - punch_start); if (ret) { up_write(&EXT4_I(inode)->i_data_sem); goto out_stop; } + ext4_discard_preallocations(inode); ret = ext4_ext_shift_extents(inode, handle, punch_stop, @@ -6010,3 +6031,38 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu) return err ? err : mapped; } + +/* + * releases the reservations on the delayed allocated clusters found in + * the block range extending from @start for @len blocks, inclusive + */ +void ext4_release_reservations(struct inode *inode, ext4_lblk_t start, + ext4_lblk_t len) +{ + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + unsigned int n = 0; + unsigned long long start_partial, end_partial; + int ret; + + n = ext4_es_delayed_clu_partials(inode, start, len, &start_partial, + &end_partial); + + if (sbi->s_cluster_ratio > 1) { + if (start_partial != ~0) { + ret = ext4_clu_mapped(inode, start_partial); + if (ret < 0) + goto out; + n++; + } + + if ((end_partial != ~0) && (end_partial != start_partial)) { + ret = ext4_clu_mapped(inode, end_partial); + if (ret < 0) + goto out; + n++; + } + } + +out: + ext4_da_release_space(inode, (int) n); +} diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index afcfa70b95d8..55212f6f973d 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -1398,3 +1398,161 @@ unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk, return n; } + +/* + * Returns true if there is at least one delayed and not unwritten extent + * (a delayed extent whose blocks have not been allocated for an unwritten + * extent) in the range specified by @start and @end. Returns false if not. + */ +static bool __es_delayed_range(struct inode *inode, ext4_lblk_t start, + ext4_lblk_t end) +{ + struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; + struct rb_node *node; + struct extent_status *es; + + es = __es_tree_search(&tree->root, start); + + while (es && (es->es_lblk <= end)) { + if (ext4_es_is_delayed(es) && !ext4_es_is_unwritten(es)) + return true; + node = rb_next(&es->rb_node); + if (!node) + break; + es = rb_entry(node, struct extent_status, rb_node); + } + return false; +} + +/* + * Returns true if there are no extents marked written, unwritten, or + * delayed anywhere in the range specified by @start and @end. Returns + * false otherwise. + */ +static bool __es_empty_range(struct inode *inode, ext4_lblk_t start, + ext4_lblk_t end) +{ + struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree; + struct rb_node *node; + struct extent_status *es; + + es = __es_tree_search(&tree->root, start); + + while (es && (es->es_lblk <= end)) { + if (!ext4_es_is_hole(es)) + return false; + node = rb_next(&es->rb_node); + if (!node) + break; + es = rb_entry(node, struct extent_status, rb_node); + } + return true; +} + +/* + * This function makes a potentially approximate count of the number of + * delalloc clusters in the range specified by @lblk and @len. It returns two + * kinds of information. It returns the number of whole clusters that + * contain delalloc blocks within the specified range. If these clusters are + * free of written or unwritten blocks, this is the number of cluster + * reservations that should be released if these clusters were to be + * deleted. + * + * It also returns the logical block numbers of partial clusters (if any) at + * the start and end of the specified range that could contribute to the + * number of reservations that should be released if the entire range was + * to be deleted via @start_partial and @end_partial. If a partial cluster + * candidate is found, it does not contain written or unwritten blocks + * and the remainder of the cluster as found in the extent status tree + * does not contain written, unwritten, or delayed blocks. A partial cluster + * can contribute to the total delalloc cluster count if the remainder of + * the cluster does not contain a written block as recorded in the + * extent tree. If a starting or ending delalloc partial cluster candidate + * is not found, @start_partial or @end_partial will be set to ~0. + + * This function's interface is meant to be similar to ext4_es_remove_extent() + * to facilitate integration with that or a similar function in the future + * to avoid an extra pass over the extents status tree. + */ +unsigned int ext4_es_delayed_clu_partials(struct inode *inode, ext4_lblk_t lblk, + ext4_lblk_t len, + unsigned long long *start_partial, + unsigned long long *end_partial) +{ + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + struct ext4_inode_info *ei = EXT4_I(inode); + ext4_lblk_t end, next, last, end_clu; + unsigned int n = 0; + + /* guaranteed to be unequal to any ext4_lblk_t value */ + *start_partial = *end_partial = ~0; + + if (len == 0) + return 0; + + end = lblk + len - 1; + BUG_ON(end < lblk); + + read_lock(&ei->i_es_lock); + + /* + * Examine the starting partial cluster, if any, for a possible delalloc + * cluster candidate + */ + end_clu = EXT4_LBLK_CFILL(sbi, lblk); + if (EXT4_LBLK_COFF(sbi, lblk)) { + /* find first cluster's last block - cluster end or range end */ + if (end_clu < end) + last = end_clu; + else + last = end; + if (__es_empty_range(inode, EXT4_LBLK_CMASK(sbi, lblk), + lblk - 1) && + __es_delayed_range(inode, lblk, last)) { + *start_partial = EXT4_B2C(sbi, lblk); + } + next = last + 1; + } else { + next = lblk; + } + + /* + * Count the delayed clusters in the cluster-aligned region, if + * present. next will be aligned on the start of a cluster. + */ + if ((next <= end) && (EXT4_LBLK_CFILL(sbi, next) <= end)) { + if (EXT4_LBLK_CFILL(sbi, end) == end) + /* single cluster case */ + last = end; + else + /* multiple cluster case */ + last = EXT4_LBLK_CMASK(sbi, end) - 1; + n = __es_delayed_clu(inode, next, last); + next = last + 1; + } + + /* + * Examine the ending partial cluster, if any, for a possible delalloc + * cluster candidate + */ + end_clu = EXT4_LBLK_CFILL(sbi, end); + if (end != end_clu) { + if (next <= end) { + /* ending partial cluster case */ + if (__es_delayed_range(inode, next, end) && + __es_empty_range(inode, end + 1, end_clu)) { + *end_partial = EXT4_B2C(sbi, end); + } + } else { + /* single partial cluster in range case */ + if ((*start_partial != ~0) && + (!__es_empty_range(inode, end + 1, end_clu))) { + *start_partial = ~0; + } + } + } + + read_unlock(&ei->i_es_lock); + + return n; +} diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h index da76394108c8..3f3ffa152daf 100644 --- a/fs/ext4/extents_status.h +++ b/fs/ext4/extents_status.h @@ -186,6 +186,12 @@ extern void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi); extern int ext4_seq_es_shrinker_info_show(struct seq_file *seq, void *v); +extern unsigned int ext4_es_delayed_clu_partials(struct inode *inode, + ext4_lblk_t lblk, + ext4_lblk_t len, + unsigned long long *start_partial, + unsigned long long *end_partial); + extern unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8f5235b2c094..a55b4db4a29c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -378,9 +378,9 @@ void ext4_da_update_reserve_space(struct inode *inode, dquot_claim_block(inode, EXT4_C2B(sbi, used)); else { /* - * We did fallocate with an offset that is already delayed + * We allocated a block with an offset that is already delayed * allocated. So on delayed allocated writeback we should - * not re-claim the quota for fallocated blocks. + * not re-claim the quota for a previously allocated block. */ dquot_release_reservation_block(inode, EXT4_C2B(sbi, used)); } @@ -1593,7 +1593,7 @@ static int ext4_da_reserve_space(struct inode *inode) return 0; /* success */ } -static void ext4_da_release_space(struct inode *inode, int to_free) +void ext4_da_release_space(struct inode *inode, int to_free) { struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); struct ext4_inode_info *ei = EXT4_I(inode); @@ -4325,19 +4325,31 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) down_write(&EXT4_I(inode)->i_data_sem); ext4_discard_preallocations(inode); - ret = ext4_es_remove_extent(inode, first_block, - stop_block - first_block); - if (ret) { - up_write(&EXT4_I(inode)->i_data_sem); - goto out_stop; - } - + /* + * call to ext4_ext_remove_space() must precede ext4_es_remove_extent() + * for correct cluster reservation accounting + */ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) ret = ext4_ext_remove_space(inode, first_block, stop_block - 1); else ret = ext4_ind_remove_space(handle, inode, first_block, stop_block); + if (ret) { + up_write(&EXT4_I(inode)->i_data_sem); + goto out_stop; + } + + if (test_opt(inode->i_sb, DELALLOC)) + ext4_release_reservations(inode, first_block, + stop_block - first_block); + + ret = ext4_es_remove_extent(inode, first_block, + stop_block - first_block); + if (ret) { + up_write(&EXT4_I(inode)->i_data_sem); + goto out_stop; + } up_write(&EXT4_I(inode)->i_data_sem); if (IS_SYNC(inode)) From patchwork Sun May 13 17:56:24 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Whitney X-Patchwork-Id: 912552 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="I73VMew9"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40kWjz4Pj1z9s16 for ; Mon, 14 May 2018 03:56:55 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751984AbeEMR4x (ORCPT ); Sun, 13 May 2018 13:56:53 -0400 Received: from mail-qk0-f195.google.com ([209.85.220.195]:33328 "EHLO mail-qk0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751438AbeEMR4u (ORCPT ); Sun, 13 May 2018 13:56:50 -0400 Received: by mail-qk0-f195.google.com with SMTP id c11-v6so8282997qkm.0 for ; Sun, 13 May 2018 10:56:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=cziAIYCI/6d1+ZPZVw3mMUxvXuhtU78ThoykcGI/E+c=; b=I73VMew9kNNT9Hq3iLdY17IflMkE7JeINCyrP6kwPF8t4sRzMpc4RJpAXgmzP4iNNs uW+dS/2aL/UdcfQrEw49BR6BEdo9eUNU1rW2WV+93uONvx3MZvfr0y8Ib5PwBzUf/Ps0 iOYIvmpCE3VOuzq15awUp92MWHvfvMmqOPIT1iFt6tFdvsrM+VnoDRNRn+KFbGACuoF7 Spy+mkYMUYiqGIM983zzcLAPrdFRo+8hbBVQLIS6lnkSQaCASUvbtuAF1VTXIRQovX/7 gcz6n+xXAPU12vNoe+htWDJuSKe5cw7q2SW9ANF0JWzz5WTnrRZJRdiAspaPIAxUdb4o KLIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=cziAIYCI/6d1+ZPZVw3mMUxvXuhtU78ThoykcGI/E+c=; b=cAcZ46TAihcYrCk9VbEPbYSVHgEghvDTCtYsoxv/Vj2uCTI5oCEQM3TKEMsJYruI9U eVXxGw1UMbtn6hylWSb42DggrsaUkNbGA5pERpuYkTR+5nXpBnlGlL10Wl5nzqs8k0yB JA3GlUUuilESpWwLkKloMo3Jh4ZnXnFlF93vbPeMIvfy2nAO1v0cqTbwPQgcHRiGWV8o obfXkR3gqNbvcs463pKpkkvd49p/27MPhskhRnvHCVPMD5ytMKJd7k0vVfKdQ8tukcrj fZ5YOycU7mduHFL2hbWWL0dkMAMf0SvrL2eGBpkxyHCzDYZRq6rOOMPu9O6orYeVkQBB 2htA== X-Gm-Message-State: ALKqPweUsB7GIlUYaWmBEJ1osD6QtER59fpm91rL2owy9nYgWUal+rrq Af1HH6NzgZJaAlPHe8tRa4DvLg== X-Google-Smtp-Source: AB8JxZpV9ks3ui8w+0kIu8GmsgD4JQH9NqaZW/v/9Ala94ups9vjKjzNMcAxhSMuG4fyN1pmlZ+83g== X-Received: by 2002:ae9:d617:: with SMTP id r23-v6mr5393685qkk.75.1526234209869; Sun, 13 May 2018 10:56:49 -0700 (PDT) Received: from localhost.localdomain (c-73-60-226-25.hsd1.nh.comcast.net. [73.60.226.25]) by smtp.gmail.com with ESMTPSA id z123-v6sm616164qkc.43.2018.05.13.10.56.49 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 May 2018 10:56:49 -0700 (PDT) From: Eric Whitney To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, Eric Whitney Subject: [RFC PATCH 5/5] ext4: don't release delalloc clusters when invalidating page Date: Sun, 13 May 2018 13:56:24 -0400 Message-Id: <20180513175624.12887-6-enwlinux@gmail.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180513175624.12887-1-enwlinux@gmail.com> References: <20180513175624.12887-1-enwlinux@gmail.com> Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org With the preceding two patches, it's now possible to accurately determine the number of delayed allocated clusters to be released when removing a block range from the extent tree and extents status tree in ext4_ext_truncate(), ext4_punch_hole(), and ext4_collapse_range(). Since it's not possible to do this when invalidating pages in ext4_da_page_release_reservation(), remove those operations from it. It's still appropriate to clear the delayed bits for invalidated buffers there. Removal of block ranges in ext4_ext_truncate() and other functions appears redundant with removal at page invalidate time, and removal of a block range as a unit should generally incur less CPU overhead than page by page (block by block) removal. Note: this change does result in a regression for generic/036 when running at least the 4k, bigalloc, and bigalloc_1k test cases using the xfstests-bld test appliance. The test passes, but new kernel error messages of the form "Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!" appear in dmesg. It's likely this patch violates a direct I/O implementation requirement, perhaps making DIO vulnerable to read races with buffered I/O. To be addressed. Signed-off-by: Eric Whitney --- fs/ext4/inode.c | 51 +++++++++++++-------------------------------------- 1 file changed, 13 insertions(+), 38 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a55b4db4a29c..6a903a850d95 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1628,62 +1628,37 @@ void ext4_da_release_space(struct inode *inode, int to_free) dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); } +/* + * This code doesn't work and needs to be replaced. Not deleting delayed + * blocks from the extents status tree here and deferring until blocks + * are deleted from the extent tree causes problems for DIO. One avenue + * to explore is a partial reversion of the code here, altering the + * calls to delete blocks from the extents status tree to calls to + * invalidate those blocks, hopefully avoiding problems with the DIO code. + * They would then be deleted and accounted for when the extent tree is + * modified. + */ static void ext4_da_page_release_reservation(struct page *page, unsigned int offset, unsigned int length) { - int to_release = 0, contiguous_blks = 0; struct buffer_head *head, *bh; unsigned int curr_off = 0; - struct inode *inode = page->mapping->host; - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); unsigned int stop = offset + length; - int num_clusters; - ext4_fsblk_t lblk; + unsigned int next_off; BUG_ON(stop > PAGE_SIZE || stop < length); head = page_buffers(page); bh = head; do { - unsigned int next_off = curr_off + bh->b_size; - + next_off = curr_off + bh->b_size; if (next_off > stop) break; - - if ((offset <= curr_off) && (buffer_delay(bh))) { - to_release++; - contiguous_blks++; + if ((offset <= curr_off) && (buffer_delay(bh))) clear_buffer_delay(bh); - } else if (contiguous_blks) { - lblk = page->index << - (PAGE_SHIFT - inode->i_blkbits); - lblk += (curr_off >> inode->i_blkbits) - - contiguous_blks; - ext4_es_remove_extent(inode, lblk, contiguous_blks); - contiguous_blks = 0; - } curr_off = next_off; } while ((bh = bh->b_this_page) != head); - - if (contiguous_blks) { - lblk = page->index << (PAGE_SHIFT - inode->i_blkbits); - lblk += (curr_off >> inode->i_blkbits) - contiguous_blks; - ext4_es_remove_extent(inode, lblk, contiguous_blks); - } - - /* If we have released all the blocks belonging to a cluster, then we - * need to release the reserved space for that cluster. */ - num_clusters = EXT4_NUM_B2C(sbi, to_release); - while (num_clusters > 0) { - lblk = (page->index << (PAGE_SHIFT - inode->i_blkbits)) + - ((num_clusters - 1) << sbi->s_cluster_bits); - if (sbi->s_cluster_ratio == 1 || - !ext4_find_delalloc_cluster(inode, lblk)) - ext4_da_release_space(inode, 1); - - num_clusters--; - } } /*