From patchwork Fri Aug 21 01:55:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348753 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=gjMiezsJ; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3F22DGz9sPB for ; Fri, 21 Aug 2020 11:55:37 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726973AbgHUBzf (ORCPT ); Thu, 20 Aug 2020 21:55:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726810AbgHUBzb (ORCPT ); Thu, 20 Aug 2020 21:55:31 -0400 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE32CC061386 for ; Thu, 20 Aug 2020 18:55:30 -0700 (PDT) Received: by mail-pf1-x431.google.com with SMTP id r11so280438pfl.11 for ; Thu, 20 Aug 2020 18:55:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PLrfZuGduAYWA2jp6qtqwHn+pLwwoUGpT0wZFHRYyEs=; b=gjMiezsJMU7tG5azk4wG5yZPISFdItw5n1Xeq8bXHNlbINrLyUc+3yVmZ5d4HpBvuH PihsXay7fBZauUtNcsIh7gFXKsarU6Yj8Rx0ooI1YZnVLRjO1fHsM9q2eULEUWSPIPhB VKDRUE7/ab94smI8GGKV/mNEk6eEbCI8K5TEXNiA9Ar2TkDH1i44rfF+AkcZh3zqKldc ZyTVNcY00gzrel8SS9IXQ6HUcVmr0GniAUZS5lPXLlxAEggEQanjLhnV7X/UTwLOU7I/ JrXP7CUEkgv2OorwoletABMWRjHqsclOY9+x/30JiIfarcm6jB5F7z3cSF8i9peegnKi Lcrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PLrfZuGduAYWA2jp6qtqwHn+pLwwoUGpT0wZFHRYyEs=; b=cfCkLcdVaKxux1jl/JFbWC8U+5VakJ9F+TjqRUe27sOxC8NWJuTgcU/0/zpQ5GeHh6 uElUaKiGdGDAiKO2nH7KTM2rQdE0QgmpbqJ5feBcBcxBEGQHwyhRYq0OS4f/yqy8CcLu drJLwDmmyifUfyOj6IpVnzkv5NvPvKw2fbxAcThEMYN9P0ac1Ng3VZOSwH29AmQHjmhB nLmosWXjuhyslRitA/hWQqhzfj4AmOIoyyHT+gpoHf9/dBX4a+i9Am475IAVBkEjz95b AW6TTUIOJzpdRtFpkLGWWXYx034gxXMYvIkEU2zwIQApHPOIiUEawL4gATwfCGYbbi8T lYSA== X-Gm-Message-State: AOAM533UjbC9foXyAQVK+76K/pKdUBWM192hriTeDR9YhdW0rVTp5qBJ TQ6KoQcvhLaWJSzmSVAaJ4vN88z19FE= X-Google-Smtp-Source: ABdhPJw22eDA/HQI6EmP5/wr9cXjegeeHoL2Xy2G7plv3WdOauoQZNiayYir2Yz63ePxc/lU4OXRPQ== X-Received: by 2002:a62:1714:: with SMTP id 20mr577955pfx.133.1597974929774; Thu, 20 Aug 2020 18:55:29 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:28 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar , Andreas Dilger Subject: [RFC PATCH v2 1/9] ext4: add handling for extended mount options Date: Thu, 20 Aug 2020 18:55:15 -0700 Message-Id: <20200821015523.1698374-2-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar We are running out of mount option bits. Add handling for using s_mount_opt2. This patch was originally submitted as a part of fast commit patch series. Resending it here with this patch series too. Signed-off-by: Harshad Shirwadkar Reviewed-by: Andreas Dilger --- fs/ext4/super.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 13bdddc081e0..656eccf6fc9b 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1738,6 +1738,7 @@ static int clear_qf_name(struct super_block *sb, int qtype) #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3) #define MOPT_STRING 0x0400 #define MOPT_SKIP 0x0800 +#define MOPT_2 0x1000 static const struct mount_opts { int token; @@ -2207,10 +2208,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, WARN_ON(1); return -1; } - if (arg != 0) - sbi->s_mount_opt |= m->mount_opt; - else - sbi->s_mount_opt &= ~m->mount_opt; + if (m->flags & MOPT_2) { + if (arg != 0) + sbi->s_mount_opt2 |= m->mount_opt; + else + sbi->s_mount_opt2 &= ~m->mount_opt; + } else { + if (arg != 0) + sbi->s_mount_opt |= m->mount_opt; + else + sbi->s_mount_opt &= ~m->mount_opt; + } } return 1; } From patchwork Fri Aug 21 01:55:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348755 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=ch+JWD65; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3V4l4Zz9sTK for ; Fri, 21 Aug 2020 11:55:50 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727053AbgHUBzs (ORCPT ); Thu, 20 Aug 2020 21:55:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726949AbgHUBzc (ORCPT ); Thu, 20 Aug 2020 21:55:32 -0400 Received: from mail-pj1-x1042.google.com (mail-pj1-x1042.google.com [IPv6:2607:f8b0:4864:20::1042]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 06D7FC061388 for ; Thu, 20 Aug 2020 18:55:32 -0700 (PDT) Received: by mail-pj1-x1042.google.com with SMTP id d4so134934pjx.5 for ; Thu, 20 Aug 2020 18:55:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Vk5A4dTc4mGtA0JH1YwBNe7+/X0GksGGdREjv/oASNo=; b=ch+JWD65tF8gzfD+cLdhD1YcebtljUeW7WaFnqG16KaaMEFV9uHyjNRIR1a2EGSMn2 emgnC+E7zNxTqzwJ0XW/vwWFCVu4dj0eGZ9ZoJb2CL/6haNO55Jz/UzSOSuEuDRipO9k rduAU8k7YOvRVj+i04zfrkWccoTM+bwWiw4sKG6XPZkTXhWZ2OPdaz+NLMtw8bAyKCex aUVXGjM6Qkvt9av44FcyxlWPbIA735fsBAMwiR6Y5J4dWjQhLVT8e89ekDyPDY8os8uB YeybOni12VjRyWopEfixy0eLraxMO1nafkQakPHeR3xOuX2jEr+lgdW3EWFcndJwRaQX HqSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Vk5A4dTc4mGtA0JH1YwBNe7+/X0GksGGdREjv/oASNo=; b=plQP/VrJrGRAsVTUJlMZAcwkzVmLLuOCpWsQkYAQ01bBebRNGOUmqPRBTT/QwoYUVX chQ0yQdtRa4Bnv6uRaTBwmerrdRqgrPsnxi2pCq6jVTv9bujocNQwuDY6+SHJuO0hcWG uCJ2tg7cQOtVpIa8eBb8Wfvnu4yqfZIe5rFyg+uDCEMScu22CNBhWeZg+cuAwOEGbDAQ oF168N03qSTvYX1oPXKQbpgxJDTJ8U6fJ4xGBwTURBHGxlD4fFu6k/YuYjrLW6OEbkQp Xlfk2Z/tP8sEy2y0XQxjFTGsrIhc/B4ou517px1p9YRZEqyUOpcd9nDY+/KE++ChxFX5 HP1Q== X-Gm-Message-State: AOAM533cHQb2JhSqWTFs7O1IOWlA6PEZYq7Ju/oWYVOMrcIHd0UM4lqw unrWcQNCxRa0qZk1j87IV4nbFjOoaVo= X-Google-Smtp-Source: ABdhPJw0LvCRc1ez5V38PHqg2ogRaifj71mzcLXfidmF8+gDMDmQ1e05sz5Z0dsG7jJ/P+x8FonKKg== X-Received: by 2002:a17:902:a70d:: with SMTP id w13mr606668plq.94.1597974930883; Thu, 20 Aug 2020 18:55:30 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:30 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 2/9] ext4: rename ext4_mb_load_buddy to ext4_mb_load_allocator Date: Thu, 20 Aug 2020 18:55:16 -0700 Message-Id: <20200821015523.1698374-3-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar This patch renames ext4_mb_load_buddy and ext4_mb_unload_buddy to ext4_mb_load_allocator and ext4_mb_unload_allocator. Also, we add a flag argument to ext4_mb_load_allocator function which is currently unused. This patch helps reduce the size of the following patch "ext4: add freespace tree allocator" significantly. In the interest of keeping this patchset shorter, I have not renamed ext4_buddy structure and e4b variable names. But have added that as a TODO item. Signed-off-by: Harshad Shirwadkar --- fs/ext4/mballoc.c | 86 ++++++++++++++++++++++++----------------------- 1 file changed, 44 insertions(+), 42 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 132c118d12e1..48d791304bf1 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -29,6 +29,7 @@ * - don't normalize tails * - quota * - reservation for superuser + * - rename ext4_buddy to ext4_allocator and e4b variables to allocator * * TODO v3: * - bitmap read-ahead (proposed by Oleg Drokin aka green) @@ -92,7 +93,7 @@ * mapped to the buddy and bitmap information regarding different * groups. The buddy information is attached to buddy cache inode so that * we can access them through the page cache. The information regarding - * each group is loaded via ext4_mb_load_buddy. The information involve + * each group is loaded via ext4_mb_load_allocator. The information involve * block bitmap and buddy information. The information are stored in the * inode as: * @@ -845,7 +846,7 @@ static void mb_regenerate_buddy(struct ext4_buddy *e4b) /* The buddy information is attached the buddy cache inode * for convenience. The information regarding each group - * is loaded via ext4_mb_load_buddy. The information involve + * is loaded via ext4_mb_load_allocator. The information involve * block bitmap and buddy information. The information are * stored in the inode as * @@ -1105,7 +1106,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp) * This ensures that we don't reinit the buddy cache * page which map to the group from which we are already * allocating. If we are looking at the buddy cache we would - * have taken a reference using ext4_mb_load_buddy and that + * have taken a reference using ext4_mb_load_allocator and that * would have pinned buddy page to page cache. * The call to ext4_mb_get_buddy_page_lock will mark the * page accessed. @@ -1157,8 +1158,8 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp) * calling this routine! */ static noinline_for_stack int -ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group, - struct ext4_buddy *e4b, gfp_t gfp) +ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group, + struct ext4_buddy *e4b, gfp_t gfp, int flags) { int blocks_per_page; int block; @@ -1293,13 +1294,13 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group, return ret; } -static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, - struct ext4_buddy *e4b) +static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group, + struct ext4_buddy *e4b, int flags) { - return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS); + return ext4_mb_load_allocator_gfp(sb, group, e4b, GFP_NOFS, flags); } -static void ext4_mb_unload_buddy(struct ext4_buddy *e4b) +static void ext4_mb_unload_allocator(struct ext4_buddy *e4b) { if (e4b->bd_bitmap_page) put_page(e4b->bd_bitmap_page); @@ -1859,7 +1860,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac, int err; BUG_ON(ex.fe_len <= 0); - err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); + err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0); if (err) return err; @@ -1872,7 +1873,7 @@ int ext4_mb_try_best_found(struct ext4_allocation_context *ac, } ext4_unlock_group(ac->ac_sb, group); - ext4_mb_unload_buddy(e4b); + ext4_mb_unload_allocator(e4b); return 0; } @@ -1893,12 +1894,12 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac, if (grp->bb_free == 0) return 0; - err = ext4_mb_load_buddy(ac->ac_sb, group, e4b); + err = ext4_mb_load_allocator(ac->ac_sb, group, e4b, 0); if (err) return err; if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) { - ext4_mb_unload_buddy(e4b); + ext4_mb_unload_allocator(e4b); return 0; } @@ -1936,7 +1937,7 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac, ext4_mb_use_best_found(ac, e4b); } ext4_unlock_group(ac->ac_sb, group); - ext4_mb_unload_buddy(e4b); + ext4_mb_unload_allocator(e4b); return 0; } @@ -2417,7 +2418,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) continue; } - err = ext4_mb_load_buddy(sb, group, &e4b); + err = ext4_mb_load_allocator(sb, group, &e4b, 0); if (err) goto out; @@ -2430,7 +2431,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ret = ext4_mb_good_group(ac, group, cr); if (ret == 0) { ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); continue; } @@ -2444,7 +2445,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ext4_mb_complex_scan_group(ac, &e4b); ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); if (ac->ac_status != AC_STATUS_CONTINUE) break; @@ -2543,7 +2544,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v) grinfo = ext4_get_group_info(sb, group); /* Load the group info in memory only if not already loaded. */ if (unlikely(EXT4_MB_GRP_NEED_INIT(grinfo))) { - err = ext4_mb_load_buddy(sb, group, &e4b); + err = ext4_mb_load_allocator(sb, group, &e4b, 0); if (err) { seq_printf(seq, "#%-5u: I/O error\n", group); return 0; @@ -2554,7 +2555,7 @@ static int ext4_mb_seq_groups_show(struct seq_file *seq, void *v) memcpy(&sg, ext4_get_group_info(sb, group), i); if (buddy_loaded) - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); seq_printf(seq, "#%-5u: %-5u %-5u %-5u [", group, sg.info.bb_free, sg.info.bb_fragments, sg.info.bb_first_free); @@ -3049,7 +3050,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb, mb_debug(sb, "gonna free %u blocks in group %u (0x%p):", entry->efd_count, entry->efd_group, entry); - err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b); + err = ext4_mb_load_allocator(sb, entry->efd_group, &e4b, 0); /* we expect to find existing buddy because it's pinned */ BUG_ON(err != 0); @@ -3084,7 +3085,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb, } ext4_unlock_group(sb, entry->efd_group); kmem_cache_free(ext4_free_data_cachep, entry); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); mb_debug(sb, "freed %d blocks in %d structures\n", count, count2); @@ -3558,12 +3559,13 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac) if (pa == NULL) { if (ac->ac_f_ex.fe_len == 0) return; - err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b); + err = ext4_mb_load_allocator(ac->ac_sb, ac->ac_f_ex.fe_group, + &e4b, 0); if (err) { /* * This should never happen since we pin the * pages in the ext4_allocation_context so - * ext4_mb_load_buddy() should never fail. + * ext4_mb_load_allocator() should never fail. */ WARN(1, "mb_load_buddy failed (%d)", err); return; @@ -3572,7 +3574,7 @@ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac) mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start, ac->ac_f_ex.fe_len); ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); return; } if (pa->pa_type == MB_INODE_PA) @@ -4175,7 +4177,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, goto out_dbg; } - err = ext4_mb_load_buddy(sb, group, &e4b); + err = ext4_mb_load_allocator(sb, group, &e4b, 0); if (err) { ext4_warning(sb, "Error %d loading buddy information for %u", err, group); @@ -4250,7 +4252,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, out: ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); put_bh(bitmap_bh); out_dbg: mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n", @@ -4347,8 +4349,8 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed) BUG_ON(pa->pa_type != MB_INODE_PA); group = ext4_get_group_number(sb, pa->pa_pstart); - err = ext4_mb_load_buddy_gfp(sb, group, &e4b, - GFP_NOFS|__GFP_NOFAIL); + err = ext4_mb_load_allocator_gfp(sb, group, &e4b, + GFP_NOFS|__GFP_NOFAIL, 0); if (err) { ext4_error_err(sb, -err, "Error %d loading buddy information for %u", err, group); @@ -4360,7 +4362,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed) err = PTR_ERR(bitmap_bh); ext4_error_err(sb, -err, "Error %d reading block bitmap for %u", err, group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); continue; } @@ -4369,7 +4371,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed) ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa); ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); put_bh(bitmap_bh); list_del(&pa->u.pa_tmp_list); @@ -4642,8 +4644,8 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb, int err; group = ext4_get_group_number(sb, pa->pa_pstart); - err = ext4_mb_load_buddy_gfp(sb, group, &e4b, - GFP_NOFS|__GFP_NOFAIL); + err = ext4_mb_load_allocator_gfp(sb, group, &e4b, + GFP_NOFS|__GFP_NOFAIL, 0); if (err) { ext4_error_err(sb, -err, "Error %d loading buddy information for %u", err, group); @@ -4654,7 +4656,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb, ext4_mb_release_group_pa(&e4b, pa); ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); list_del(&pa->u.pa_tmp_list); call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback); } @@ -5241,8 +5243,8 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode, trace_ext4_mballoc_free(sb, inode, block_group, bit, count_clusters); /* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */ - err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b, - GFP_NOFS|__GFP_NOFAIL); + err = ext4_mb_load_allocator_gfp(sb, block_group, &e4b, + GFP_NOFS|__GFP_NOFAIL, 0); if (err) goto error_return; @@ -5316,7 +5318,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode, count_clusters); } - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); /* We dirtied the bitmap block */ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); @@ -5434,7 +5436,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, } } - err = ext4_mb_load_buddy(sb, block_group, &e4b); + err = ext4_mb_load_allocator(sb, block_group, &e4b, 0); if (err) goto error_return; @@ -5462,7 +5464,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, flex_group)->free_clusters); } - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); /* We dirtied the bitmap block */ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); @@ -5550,7 +5552,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group, trace_ext4_trim_all_free(sb, group, start, max); - ret = ext4_mb_load_buddy(sb, group, &e4b); + ret = ext4_mb_load_allocator(sb, group, &e4b, 0); if (ret) { ext4_warning(sb, "Error %d loading buddy information for %u", ret, group); @@ -5604,7 +5606,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group, } out: ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); ext4_debug("trimmed %d blocks in the group %d\n", count, group); @@ -5718,7 +5720,7 @@ ext4_mballoc_query_range( struct ext4_buddy e4b; int error; - error = ext4_mb_load_buddy(sb, group, &e4b); + error = ext4_mb_load_allocator(sb, group, &e4b, 0); if (error) return error; bitmap = e4b.bd_bitmap; @@ -5747,7 +5749,7 @@ ext4_mballoc_query_range( ext4_unlock_group(sb, group); out_unload: - ext4_mb_unload_buddy(&e4b); + ext4_mb_unload_allocator(&e4b); return error; } From patchwork Fri Aug 21 01:55:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348756 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=A7Q7QiD3; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3X06Fsz9sTM for ; Fri, 21 Aug 2020 11:55:52 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727063AbgHUBzu (ORCPT ); Thu, 20 Aug 2020 21:55:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40138 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726978AbgHUBzf (ORCPT ); Thu, 20 Aug 2020 21:55:35 -0400 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 03864C061386 for ; Thu, 20 Aug 2020 18:55:35 -0700 (PDT) Received: by mail-pf1-x442.google.com with SMTP id u20so304162pfn.0 for ; Thu, 20 Aug 2020 18:55:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=2zHlcTMjVTZj+CiEwUETuptSNrWGyJy/uEWg9fRluDY=; b=A7Q7QiD3jiNmvXXjcT/fx5UX+z3yP5rXoinKpisSNeRiDhTm84sYivKhVt5RFNXdY6 cNbytF7JRCPqsWkE74pEFy8PrVa+odtKFdAJpzSO+feDRZWW4Ma11/avCJ3aK+wPdVvG PfGLSfrNY64EOR9HYrgQQdcjlqp+xraKmQphldS0lDiECeKxGni5jT1NIbdiePkKVkup zCsBIoyAET0RknJVvqFR+m3wnhboPrKmJc1ObtlvIXKV2Q+PniwRVPMOmmTKeO7idSPW 7WkojNdfwH7rw+bZnycufzVsXV+p3PyLBC3bY3s9dCxPfnTbJtdKVzFIpJ9fwriunG2x Eb/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2zHlcTMjVTZj+CiEwUETuptSNrWGyJy/uEWg9fRluDY=; b=LEQSUX2vyp7a1REWL7cHkpEzSiYdYvEi3GwYH3x1QHQXVkiurW8okZGEnmfQGw2VDO Odk8LwKLd6vwhkBAtKrOhn//LDX+PkdQ/PE1bAHx8fZjpExCp2dcnt2BFH9vL2eOByS3 7JjJ11EMcIu0Eay/Qw7OmC7rWgnfSmne/YQP+K6riFK0qDn/IzRSUo4Fo3KQ4w5mTcnD J8G3rd2Z3l6Loygpp6491m7qm2ez3AbOW1Us45LepUoMl+Mtn1oo/PsJ4zs55t7PxJ6b qMjR9osaBARSPEc5hx/GVrVeKdAT92s4p0wJH7aMdDpmP08PAbcC+BiqlzfF54l5HKRX bECw== X-Gm-Message-State: AOAM531Fj/x6pQcSAX2I7lD2IYxb9YugsMLfRrK5Es2g1n/mFlbGGc4Z ckdTW3Ryd/CEZH43aPDRa0bBeFv9tok= X-Google-Smtp-Source: ABdhPJwTB8sNn/RhI9rfYNexBio4I7HQv9fYiMGQcaj0eWgLX3wcl0vaUF0i/54BewOMf88cnVo63w== X-Received: by 2002:a63:315:: with SMTP id 21mr676581pgd.103.1597974932902; Thu, 20 Aug 2020 18:55:32 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:31 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 3/9] ext4: add freespace tree allocator Date: Thu, 20 Aug 2020 18:55:17 -0700 Message-Id: <20200821015523.1698374-4-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Yuexin Li This patch adds a new freespace tree allocator. The allocator organizes free space regions in a couple of in-memory rbtrees, one sorted by length and the other by offset. We use these per-flex-bg level trees to quickly scan length sorted free space regions and determine the best region for a given request. This feature can be enabled by passing "-o freespace_tree" mount option. Signed-off-by: Yuexin Li Co-developed-by: Harshad Shirwadkar Signed-off-by: Harshad Shirwadkar --- fs/ext4/ext4.h | 45 +++ fs/ext4/mballoc.c | 984 ++++++++++++++++++++++++++++++++++++++++++++-- fs/ext4/mballoc.h | 61 ++- fs/ext4/resize.c | 8 + fs/ext4/super.c | 35 +- 5 files changed, 1084 insertions(+), 49 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 523e00d7b392..513c8473f45f 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -363,6 +363,7 @@ struct flex_groups { atomic64_t free_clusters; atomic_t free_inodes; atomic_t used_dirs; + struct ext4_frsp_tree *frsp_tree; }; #define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */ @@ -1214,6 +1215,12 @@ struct ext4_inode_info { #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly specified journal checksum */ + +#define EXT4_MOUNT2_FREESPACE_TREE 0x00000020 /* Enable freespace tree + * allocator (turns buddy + * allocator off) + */ + #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ ~EXT4_MOUNT_##opt #define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \ @@ -1419,6 +1426,30 @@ struct ext4_super_block { #define ext4_has_strict_mode(sbi) \ (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL) +/* + * Freespace tree structure + */ +struct ext4_frsp_tree { + struct rb_root_cached frsp_offset_root; /* Root for offset + * sorted tree + */ + struct rb_root_cached frsp_len_root; /* Root for Length + * sorted tree + */ + struct mutex frsp_lock; + __u8 frsp_flags; + __u32 frsp_max_free_len; /* Max free extent in + * this flex bg + */ + __u32 frsp_index; /* Tree index (flex bg + * number) + */ +}; + +/* Freespace tree flags */ + +/* Tree is loaded in memory */ +#define EXT4_MB_FRSP_FLAG_LOADED 0x0001 /* * fourth extended-fs super-block data in memory */ @@ -1557,6 +1588,9 @@ struct ext4_sb_info { struct flex_groups * __rcu *s_flex_groups; ext4_group_t s_flex_groups_allocated; + /* freespace trees stuff */ + int s_mb_num_frsp_trees; + /* workqueue for reserved extent conversions (buffered io) */ struct workqueue_struct *rsv_conversion_wq; @@ -2676,6 +2710,7 @@ extern int ext4_init_inode_table(struct super_block *sb, extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate); /* mballoc.c */ +extern int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups); extern const struct seq_operations ext4_mb_seq_groups_ops; extern long ext4_mb_stats; extern long ext4_mb_max_to_scan; @@ -3100,6 +3135,16 @@ static inline ext4_group_t ext4_flex_group(struct ext4_sb_info *sbi, return block_group >> sbi->s_log_groups_per_flex; } +static inline struct ext4_frsp_tree * +ext4_get_frsp_tree(struct super_block *sb, ext4_group_t flex_bg) +{ + struct flex_groups *fg; + + fg = sbi_array_rcu_deref(EXT4_SB(sb), s_flex_groups, flex_bg); + + return fg->frsp_tree; +} + static inline unsigned int ext4_flex_bg_size(struct ext4_sb_info *sbi) { return 1 << sbi->s_log_groups_per_flex; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 48d791304bf1..4b1c405543f0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -333,6 +333,7 @@ static struct kmem_cache *ext4_pspace_cachep; static struct kmem_cache *ext4_ac_cachep; static struct kmem_cache *ext4_free_data_cachep; +static struct kmem_cache *ext4_freespace_node_cachep; /* We create slab caches for groupinfo data structures based on the * superblock block size. There will be one per mounted filesystem for @@ -352,6 +353,11 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, ext4_group_t group); static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); +static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group, + struct ext4_buddy *e4b, int flags); +static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex); +static void ext4_mb_unload_allocator(struct ext4_buddy *e4b); + /* * The algorithm using this percpu seq counter goes below: * 1. We sample the percpu discard_pa_seq counter before trying for block @@ -395,6 +401,33 @@ static inline void *mb_correct_addr_and_bit(int *bit, void *addr) return addr; } +static inline int ext4_blkno_to_flex_offset(struct super_block *sb, + ext4_fsblk_t blkno) +{ + return blkno % (ext4_flex_bg_size(EXT4_SB(sb)) * + EXT4_SB(sb)->s_blocks_per_group); +} + +static inline ext4_fsblk_t ext4_flex_offset_to_blkno(struct super_block *sb, + int flex_bg, int flex_offset) +{ + return flex_bg * ext4_flex_bg_size(EXT4_SB(sb)) * + EXT4_SB(sb)->s_blocks_per_group + flex_offset; +} + +static inline int ext4_mb_frsp_on(struct super_block *sb) +{ + return test_opt2(sb, FREESPACE_TREE) && + EXT4_SB(sb)->s_es->s_log_groups_per_flex; +} + +static inline unsigned int ext4_num_grps_to_flexbg(struct super_block *sb, + int ngroups) +{ + return (ngroups + ext4_flex_bg_size(EXT4_SB(sb)) - 1) >> + (EXT4_SB(sb)->s_es->s_log_groups_per_flex); +} + static inline int mb_test_bit(int bit, void *addr) { /* @@ -453,6 +486,7 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) { char *bb; + WARN_ON(ext4_mb_frsp_on(e4b->bd_sb)); BUG_ON(e4b->bd_bitmap == e4b->bd_buddy); BUG_ON(max == NULL); @@ -620,6 +654,9 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file, void *buddy; void *buddy2; + if (ext4_mb_frsp_on(sb)) + return 0; + { static int mb_check_counter; if (mb_check_counter++ % 100 != 0) @@ -706,6 +743,729 @@ static int __mb_check_buddy(struct ext4_buddy *e4b, char *file, #define mb_check_buddy(e4b) #endif +/* Generic macro for inserting a new node in cached rbtree */ +#define ext4_mb_rb_insert(__root, __new, __node_t, __entry, __cmp) do { \ + struct rb_node **iter = &((__root)->rb_root.rb_node), *parent = NULL; \ + bool leftmost = true; \ + __node_t *this = NULL; \ + \ + while (*iter) { \ + this = rb_entry(*iter, __node_t, __entry); \ + parent = *iter; \ + if (__cmp((__new), this)) { \ + iter = &((*iter)->rb_left); \ + } else { \ + iter = &((*iter)->rb_right); \ + leftmost = false; \ + } \ + } \ + rb_link_node(&(__new)->__entry, parent, iter); \ + rb_insert_color_cached(&(__new)->__entry, __root, leftmost); \ +} while (0) + +static inline int ext4_mb_frsp_offset_cmp(struct ext4_frsp_node *arg1, + struct ext4_frsp_node *arg2) +{ + return (arg1->frsp_offset < arg2->frsp_offset); +} + +/* + * Free space extents length sorting function, the nodes are sorted in + * decreasing order of length + */ +static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1, + struct ext4_frsp_node *arg2) +{ + return (arg1->frsp_len > arg2->frsp_len); +} + +/* insert to offset-indexed tree */ +static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree, + struct ext4_frsp_node *new_entry) +{ + memset(&new_entry->frsp_node, 0, sizeof(new_entry->frsp_node)); + ext4_mb_rb_insert(&tree->frsp_offset_root, new_entry, + struct ext4_frsp_node, frsp_node, ext4_mb_frsp_offset_cmp); +} + +/* insert to tree sorted by length */ +static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree, + struct ext4_frsp_node *new_entry) +{ + memset(&new_entry->frsp_len_node, 0, sizeof(new_entry->frsp_len_node)); + ext4_mb_rb_insert(&tree->frsp_len_root, new_entry, + struct ext4_frsp_node, frsp_len_node, + ext4_mb_frsp_len_cmp); +} + +#ifdef CONFIG_EXT4_DEBUG +/* print freespace_tree in pre-order traversal */ +void ext4_mb_frsp_print_tree_len(struct super_block *sb, + struct ext4_frsp_tree *tree) +{ + unsigned int count = 0; + ext4_fsblk_t blk = 0, blk_end = 0; + ext4_group_t group = 0, group_end = 0; + struct ext4_frsp_node *entry = NULL; + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct rb_node *cur; + + cur = rb_first_cached(&tree->frsp_len_root); + mb_debug(sb, "\nTree\tNode\tlength\tblock\tgroup\n"); + + while (cur) { + entry = rb_entry(cur, struct ext4_frsp_node, frsp_len_node); + count++; + blk = ext4_flex_offset_to_blkno(sb, tree->frsp_index, + entry->frsp_offset); + blk_end = ext4_flex_offset_to_blkno(sb, tree->frsp_index, + entry->frsp_offset + entry->frsp_len - 1); + group = blk / sbi->s_blocks_per_group; + group_end = (blk_end-1) / sbi->s_blocks_per_group; + mb_debug(sb, "%u\t%u\t%u\t%u(%lld)--%llu\t%u--%u\n", + tree->frsp_index, count, entry->frsp_len, + entry->frsp_offset, blk, blk_end, group, group_end); + cur = rb_next(cur); + } +} +#endif + +static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb) +{ + struct ext4_frsp_node *node; + + node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS); + if (!node) + return NULL; + + RB_CLEAR_NODE(&node->frsp_node); + RB_CLEAR_NODE(&node->frsp_len_node); + + return node; +} + +static void ext4_mb_frsp_free_node(struct super_block *sb, + struct ext4_frsp_node *node) +{ + kmem_cache_free(ext4_freespace_node_cachep, node); +} + +/* Evict a tree from memory */ +void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree) +{ + struct ext4_frsp_node *frsp_node; + struct rb_node *node; + + mb_debug(sb, "Evicting tree %d\n", tree->frsp_index); + mutex_lock(&tree->frsp_lock); + if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED)) + goto out; + + node = rb_first_cached(&tree->frsp_offset_root); + while (node) { + frsp_node = rb_entry(node, struct ext4_frsp_node, frsp_node); + rb_erase_cached(node, &tree->frsp_offset_root); + rb_erase_cached(&frsp_node->frsp_len_node, + &tree->frsp_len_root); + ext4_mb_frsp_free_node(sb, frsp_node); + node = rb_first_cached(&tree->frsp_offset_root); + } + tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED; + tree->frsp_offset_root = RB_ROOT_CACHED; + tree->frsp_len_root = RB_ROOT_CACHED; +out: + mutex_unlock(&tree->frsp_lock); +} + +/* + * Search tree by flexbg offset. Returns the node containing "target". If + * no such node is present, returns NULL. Must be called with tree mutex held. + */ +struct ext4_frsp_node *ext4_mb_frsp_search_by_off(struct super_block *sb, + struct ext4_frsp_tree *tree, + ext4_grpblk_t target) +{ + struct rb_root *root = &tree->frsp_offset_root.rb_root; + struct rb_node *node = root->rb_node; + struct ext4_frsp_node *this = NULL; + + while (node) { + this = rb_entry(node, struct ext4_frsp_node, frsp_node); + if (this->frsp_offset == target) + return this; + else if (target < this->frsp_offset) + node = node->rb_left; + else + node = node->rb_right; + } + return NULL; +} + +/* + * Check if two entries in freespace tree can be merged together. Nodes + * can merge together only if they are physically contiguous and belong + * to the same block group. + */ +int ext4_mb_frsp_node_can_merge(struct super_block *sb, + struct ext4_frsp_node *prev_entry, struct ext4_frsp_node *cur_entry) +{ + if (prev_entry->frsp_offset + prev_entry->frsp_len != + cur_entry->frsp_offset) + return 0; + if (prev_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group != + cur_entry->frsp_offset / EXT4_SB(sb)->s_blocks_per_group) + return 0; + return 1; +} + +/* + * Add new free space region to tree. We insert the new node in both, the + * length sorted and the flex-bg offset sorted trees. Must be called with + * tree mutex held. + */ +int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree, + ext4_grpblk_t offset, ext4_grpblk_t len) +{ + struct ext4_frsp_node *new_entry = NULL, *next_entry = NULL, + *prev_entry = NULL; + struct rb_node *left = NULL, *right = NULL; + + new_entry = ext4_mb_frsp_alloc_node(sb); + if (!new_entry) + return -ENOMEM; + + new_entry->frsp_offset = offset; + new_entry->frsp_len = len; + + ext4_mb_frsp_insert_off(tree, new_entry); + /* try merge to left and right */ + /* left */ + left = rb_prev(&new_entry->frsp_node); + if (left) { + prev_entry = rb_entry(left, struct ext4_frsp_node, frsp_node); + if (ext4_mb_frsp_node_can_merge(sb, prev_entry, new_entry)) { + new_entry->frsp_offset = prev_entry->frsp_offset; + new_entry->frsp_len += prev_entry->frsp_len; + rb_erase_cached(left, &tree->frsp_offset_root); + rb_erase_cached(&prev_entry->frsp_len_node, + &tree->frsp_len_root); + ext4_mb_frsp_free_node(sb, prev_entry); + } + } + + /* right */ + right = rb_next(&new_entry->frsp_node); + if (right) { + next_entry = rb_entry(right, struct ext4_frsp_node, frsp_node); + if (ext4_mb_frsp_node_can_merge(sb, new_entry, next_entry)) { + new_entry->frsp_len += next_entry->frsp_len; + rb_erase_cached(right, &tree->frsp_offset_root); + rb_erase_cached(&next_entry->frsp_len_node, + &tree->frsp_len_root); + ext4_mb_frsp_free_node(sb, next_entry); + } + } + ext4_mb_frsp_insert_len(tree, new_entry); + + return 0; +} + +/* + * Free length number of blocks starting at block number block in free space + * tree. + */ +int ext4_mb_frsp_free_blocks(struct super_block *sb, ext4_group_t group, + ext4_grpblk_t block, ext4_grpblk_t length) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_frsp_tree *tree = + ext4_get_frsp_tree(sb, ext4_flex_group(sbi, group)); + int err = 0; + + mutex_lock(&tree->frsp_lock); + err = ext4_mb_frsp_add_region(sb, tree, + ext4_blkno_to_flex_offset(sb, block), length); + mutex_unlock(&tree->frsp_lock); + + return err; +} + +/* + * Create freespace tree from on-disk bitmap. Must be called with tree mutex + * held. Returns 0 on success, error otherwise. + */ +int ext4_mb_frsp_bb_to_tree(struct ext4_buddy *e4b, ext4_group_t group, + struct buffer_head *bh) +{ + struct super_block *sb = e4b->bd_sb; + int length = 0, bit = 0, next; + int end = EXT4_SB(sb)->s_blocks_per_group; + ext4_fsblk_t block; + int ret = 0; + + /* find all unused blocks in bitmap, convert them to new tree node */ + while (bit < end) { + bit = mb_find_next_zero_bit(bh->b_data, end, bit); + if (bit >= end) + break; + + next = mb_find_next_bit(bh->b_data, end, bit); + length = next - bit; + block = ext4_group_first_block_no(sb, group) + bit; + + ret = ext4_mb_frsp_add_region(sb, e4b->frsp_tree, + ext4_blkno_to_flex_offset(sb, block), length); + if (ret) + break; + bit = next + 1; + } + + return ret; +} + +/* + * Load one freespace_tree from disk. Assume holding mutex lock on tree. + */ +int ext4_mb_frsp_load(struct ext4_buddy *e4b) +{ + ext4_group_t ngroups, group_start, group_end, grp; + struct ext4_sb_info *sbi = EXT4_SB(e4b->bd_sb); + int i, ret = 0; + struct buffer_head **bh = NULL; + + if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) + return 0; + + group_start = e4b->frsp_tree->frsp_index * ext4_flex_bg_size(sbi); + group_end = min(group_start + ext4_flex_bg_size(sbi), + ext4_get_groups_count(e4b->bd_sb)) - 1; + ngroups = group_end - group_start + 1; + + bh = kcalloc(ngroups, sizeof(*bh), GFP_KERNEL); + if (!bh) + return -ENOMEM; + for (i = 0; i < ngroups; i++) { + grp = i + group_start; + bh[i] = ext4_read_block_bitmap_nowait(e4b->bd_sb, grp, false); + if (IS_ERR(bh[i])) { + ret = PTR_ERR(bh[i]); + goto out; + } + } + for (i = 0; i < ngroups; i++) { + grp = i + group_start; + if (!bitmap_uptodate(bh[i])) { + ret = ext4_wait_block_bitmap(e4b->bd_sb, grp, bh[i]); + if (ret) + goto out; + } + ret = ext4_mb_frsp_bb_to_tree(e4b, grp, bh[i]); + if (ret) + goto out; + } + e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED; +out: + for (i = 0; i < ngroups; i++) { + if (!IS_ERR_OR_NULL(bh[i])) + put_bh(bh[i]); + if (!ret && IS_ERR(bh[i])) + ret = PTR_ERR(bh[i]); + } + kfree(bh); + return ret; + +} + +/* + * Determine if node with length len is better than what we have found until + * now. Return 1 if that is the case, 0 otherwise. If len is exact match, + * set *best = 1. + */ +static int ext4_mb_frsp_is_better(struct ext4_allocation_context *ac, + int len, int *best) +{ + struct ext4_tree_extent *btx = &ac->ac_b_tree_ex; + struct ext4_free_extent *gex = &ac->ac_g_ex; + + if (best) + *best = 0; + if (len == gex->fe_len) { + if (best) + *best = 1; + return 1; + } + if (ac->ac_criteria == 0 && len < gex->fe_len) + return 0; + /* + * See if the new node is cloer to goal length than + * the best extent found so far + */ + if (btx->te_len < gex->fe_len && len > btx->te_len) + return 1; + if (len > gex->fe_len && len < btx->te_len) + return 1; + return 0; +} + +/* + * check if we have scanned sufficient freespace candidates + * stop scanning if reached/exceeded s_max_to_scan + */ +static void ext4_mb_frsp_check_limits(struct ext4_allocation_context *ac) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + + if (ac->ac_status == AC_STATUS_FOUND) + return; + + /* + * Exceeded max number of nodes to scan + */ + if (ac->ac_found > sbi->s_mb_max_to_scan && + !(ac->ac_flags & EXT4_MB_HINT_FIRST)) + ac->ac_status = AC_STATUS_BREAK; +} + +/* + * Mark free space in selected tree node as used and update the tree. + * This must be called with tree lock. + */ +static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac, + ext4_group_t flex, struct ext4_frsp_node *selected) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + struct ext4_tree_extent *btx = &ac->ac_b_tree_ex; + struct ext4_free_extent *bex; + unsigned int group_no; + struct ext4_buddy e4b; + + WARN_ON(ac->ac_status == AC_STATUS_FOUND); + btx->te_len = min(btx->te_len, ac->ac_g_ex.fe_len); + ac->ac_status = AC_STATUS_FOUND; + + /* update ac best-found-extent */ + bex = &ac->ac_b_ex; + group_no = (btx->te_flex * ext4_flex_bg_size(sbi)) + + (btx->te_flex_start / sbi->s_blocks_per_group); + bex->fe_start = btx->te_flex_start % sbi->s_blocks_per_group; + bex->fe_group = group_no; + bex->fe_len = btx->te_len; + bex->fe_node = selected; + + /* Add free blocks to our tree */ + ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b, + EXT4_ALLOCATOR_FRSP_NOLOAD); + mb_mark_used(&e4b, bex); + ext4_mb_unload_allocator(&e4b); +} +/* + * The routine checks whether found tree node is good enough. If it is, + * then the tree node gets marked used and flag is set to the context + * to stop scanning. + * + * Otherwise, the tree node is compared with the previous found extent and + * if new one is better, then it's stored in the context. + * + * Later, the best found tree node will be used, if mballoc can't find good + * enough extent. + */ +static int ext4_mb_frsp_measure_node(struct ext4_allocation_context *ac, + int tree_idx, struct ext4_frsp_node *cur, + int goal) +{ + struct ext4_tree_extent *btx = &ac->ac_b_tree_ex; + ext4_fsblk_t start; + int best_found = 0, max; + + WARN_ON(btx->te_len < 0); + WARN_ON(ac->ac_status != AC_STATUS_CONTINUE); + + if (goal) { + start = ext4_group_first_block_no(ac->ac_sb, + ac->ac_g_ex.fe_group) + + ac->ac_g_ex.fe_start; + if (cur->frsp_offset > ext4_blkno_to_flex_offset(ac->ac_sb, + start)) + return 0; + max = cur->frsp_offset + cur->frsp_len - + ext4_blkno_to_flex_offset(ac->ac_sb, start); + if (max >= ac->ac_g_ex.fe_len && + ac->ac_g_ex.fe_len == EXT4_SB(ac->ac_sb)->s_stripe) { + if (do_div(start, EXT4_SB(ac->ac_sb)->s_stripe) != 0) + return 0; + best_found = 1; + } else if (max >= ac->ac_g_ex.fe_len) { + best_found = 1; + } else if (max > 0 && (ac->ac_flags & EXT4_MB_HINT_MERGE)) { + /* + * Sometimes, caller may want to merge even small + * number of blocks to an existing extent + */ + best_found = 1; + + } else { + return 0; + } + ac->ac_found++; + goto use_extent; + } + ac->ac_found++; + + /* The special case - take what you catch first */ + if (unlikely(ac->ac_flags & EXT4_MB_HINT_FIRST)) { + best_found = 1; + goto use_extent; + } + + /* + * Prefer not allocating blocks in first group in flex bg for data + * blocks. + */ + if (unlikely((ac->ac_criteria < 2) && + (ac->ac_flags & EXT4_MB_HINT_DATA) && + (cur->frsp_offset < EXT4_BLOCKS_PER_GROUP(ac->ac_sb)))) + return 1; + + + /* If this is first found extent, just store it in the context */ + if (btx->te_len == 0) + goto use_extent; + + if (ext4_mb_frsp_is_better(ac, cur->frsp_len, &best_found)) + goto use_extent; + + ext4_mb_frsp_check_limits(ac); + return 0; + +use_extent: + btx->te_flex = tree_idx; + if (goal) { + btx->te_flex_start = ext4_blkno_to_flex_offset(ac->ac_sb, + start); + btx->te_len = max; + } else { + btx->te_flex_start = cur->frsp_offset; + btx->te_len = cur->frsp_len; + } + if (best_found) + ext4_mb_frsp_use_best_found(ac, tree_idx, cur); + ext4_mb_frsp_check_limits(ac); + return 1; +} + +/* Search by goal */ +static noinline_for_stack +void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac) +{ + struct ext4_frsp_node *cur = NULL; + unsigned int tree_blk; + ext4_fsblk_t blk; + struct ext4_buddy e4b; + int ret; + + if (!(ac->ac_flags & EXT4_MB_HINT_TRY_GOAL)) + return; + + /* compute start node offset in tree */ + blk = ext4_group_first_block_no(ac->ac_sb, ac->ac_g_ex.fe_group) + + ac->ac_g_ex.fe_start; + tree_blk = ext4_blkno_to_flex_offset(ac->ac_sb, blk); + + ret = ext4_mb_load_allocator(ac->ac_sb, ac->ac_g_ex.fe_group, &e4b, 0); + if (ret) + return; + + /* try goal block and its freespace_tree first */ + mutex_lock(&e4b.frsp_tree->frsp_lock); + cur = ext4_mb_frsp_search_by_off(ac->ac_sb, e4b.frsp_tree, tree_blk); + if (cur && tree_blk < cur->frsp_offset + cur->frsp_len - 1) + ext4_mb_frsp_measure_node(ac, e4b.frsp_tree->frsp_index, cur, + 1 /* searching by goal */); + + mutex_unlock(&e4b.frsp_tree->frsp_lock); + ext4_mb_unload_allocator(&e4b); +} + +static void ext4_mb_frsp_process(struct ext4_allocation_context *ac, + struct ext4_frsp_tree *tree) +{ + struct ext4_frsp_node *cur = NULL; + struct rb_node *node = NULL; + int ret; + + node = rb_first_cached(&tree->frsp_len_root); + mb_debug(ac->ac_sb, "Processing tree index %d, flags = %x\n", + tree->frsp_index, tree->frsp_flags); + /* + * In order to serve the "No data blocks in first group in a flexbg" + * requirement, we cannot do a O(Log N) search here. TODO: it's possible + * to that at CR=2, where that requirement doesn't apply. + */ + while (node && ac->ac_status == AC_STATUS_CONTINUE) { + cur = rb_entry(node, struct ext4_frsp_node, frsp_len_node); + if (ac->ac_criteria == 0 && cur->frsp_len < ac->ac_g_ex.fe_len) + return; + ret = ext4_mb_frsp_measure_node(ac, tree->frsp_index, cur, + 0 /* searching by len */); + if (ret == 0) + return; + node = rb_next(node); + } +} + +/* allocator for freespace_tree */ +static noinline_for_stack int +ext4_mb_tree_allocator(struct ext4_allocation_context *ac) +{ + struct ext4_buddy e4b; + struct ext4_sb_info *sbi; + struct super_block *sb; + struct ext4_frsp_tree *tree = NULL; + struct ext4_frsp_node *cur = NULL; + struct ext4_tree_extent *btx = NULL; + int ret = 0, start_idx = 0, tree_idx, j; + + sb = ac->ac_sb; + btx = &ac->ac_b_tree_ex; + sbi = EXT4_SB(sb); + + start_idx = ext4_flex_group(sbi, ac->ac_g_ex.fe_group); + mb_debug(sb, "requested size %d\n", ac->ac_g_ex.fe_len); + /* First try searching from goal blk in start-idx-th freespace tree */ + ext4_mb_frsp_find_by_goal(ac); + if (ac->ac_status == AC_STATUS_FOUND) + goto out; + + if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) + goto out; + + ac->ac_criteria = 0; + ac->ac_groups_scanned = 0; +repeat: + + /* Loop through the rest of trees (flex_bg) */ + for (j = start_idx; + (ac->ac_groups_scanned < sbi->s_mb_num_frsp_trees) && + ac->ac_status == AC_STATUS_CONTINUE; j++) { + ac->ac_groups_scanned++; + tree_idx = (j % sbi->s_mb_num_frsp_trees); + + ret = ext4_mb_load_allocator(sb, + tree_idx * ext4_flex_bg_size(sbi), &e4b, 0); + if (ret) + goto out; + mutex_lock(&e4b.frsp_tree->frsp_lock); + ext4_mb_frsp_process(ac, e4b.frsp_tree); + mutex_unlock(&e4b.frsp_tree->frsp_lock); + ext4_mb_unload_allocator(&e4b); + } + + if (ac->ac_status != AC_STATUS_FOUND) { + if (ac->ac_criteria < 2) { + ac->ac_criteria++; + ac->ac_groups_scanned = 0; + mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria); + goto repeat; + } + if (btx->te_len > 0 && !(ac->ac_flags & EXT4_MB_HINT_FIRST)) { + e4b.frsp_flags = 0; + ret = ext4_mb_load_allocator(sb, btx->te_flex * + ext4_flex_bg_size(sbi), &e4b, 0); + if (ret) + goto out; + tree = e4b.frsp_tree; + mutex_lock(&tree->frsp_lock); + cur = ext4_mb_frsp_search_by_off(sb, tree, + btx->te_flex_start); + if (cur) { + ext4_mb_frsp_use_best_found(ac, btx->te_flex, + cur); + mutex_unlock(&tree->frsp_lock); + ext4_mb_unload_allocator(&e4b); + + } else { + /* + * Someone else took this freespace node before + * us. Reset the best-found tree extent, and + * turn on FIRST HINT (greedy). + */ + mutex_unlock(&tree->frsp_lock); + ac->ac_b_tree_ex.te_flex_start = 0; + ac->ac_b_tree_ex.te_flex = 0; + ac->ac_b_tree_ex.te_len = 0; + ac->ac_status = AC_STATUS_CONTINUE; + ac->ac_flags |= EXT4_MB_HINT_FIRST; + ac->ac_groups_scanned = 0; + ext4_mb_unload_allocator(&e4b); + goto repeat; + } + } + } + + ret = btx->te_len ? 0 : -ENOSPC; + +out: + return ret; +} + +static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index) +{ + tree->frsp_offset_root = RB_ROOT_CACHED; + tree->frsp_len_root = RB_ROOT_CACHED; + mutex_init(&(tree->frsp_lock)); + tree->frsp_flags = 0; + tree->frsp_index = index; +} + +int ext4_mb_init_freespace_trees(struct super_block *sb) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct flex_groups *fg; + int i; + + sbi->s_mb_num_frsp_trees = + ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb)); + + for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) { + fg = sbi_array_rcu_deref(sbi, s_flex_groups, i); + fg->frsp_tree = kzalloc(sizeof(struct ext4_frsp_tree), + GFP_KERNEL); + if (!fg->frsp_tree) + return -ENOMEM; + ext4_mb_frsp_init_tree(fg->frsp_tree, i); + } + + return 0; +} + +int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + int flex_bg_count, old_trees_count = sbi->s_mb_num_frsp_trees; + int i; + + if (!ext4_mb_frsp_on(sb)) + return 0; + + flex_bg_count = ext4_num_grps_to_flexbg(sb, ngroups); + if (old_trees_count > 0) + ext4_mb_frsp_free_tree(sb, ext4_get_frsp_tree(sb, + old_trees_count - 1)); + + for (i = old_trees_count; i < flex_bg_count; i++) { + struct flex_groups *fg; + + fg = sbi_array_rcu_deref(sbi, s_flex_groups, i); + fg->frsp_tree = kzalloc(sizeof(*fg->frsp_tree), GFP_KERNEL); + if (!fg->frsp_tree) + return -ENOMEM; + ext4_mb_frsp_init_tree(fg->frsp_tree, i); + } + sbi->s_mb_num_frsp_trees = flex_bg_count; + + return 0; +} + /* * Divide blocks started from @first with length @len into * smaller chunks with power of 2 blocks. @@ -1042,6 +1802,9 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb, e4b->bd_buddy_page = NULL; e4b->bd_bitmap_page = NULL; + if (ext4_mb_frsp_on(sb)) + return 0; + blocks_per_page = PAGE_SIZE / sb->s_blocksize; /* * the buddy cache inode stores the block bitmap @@ -1099,6 +1862,7 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp) struct page *page; int ret = 0; + WARN_ON(ext4_mb_frsp_on(sb)); might_sleep(); mb_debug(sb, "init group %u\n", group); this_grp = ext4_get_group_info(sb, group); @@ -1156,6 +1920,11 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group, gfp_t gfp) * Locking note: This routine calls ext4_mb_init_cache(), which takes the * block group lock of all groups for this page; do not hold the BG lock when * calling this routine! + * + * For freespace trees, do not hold tree mutex while calling this routine. + * It is okay to hold that mutex only if EXT4_ALLOCATOR_FRSP_NOLOAD flag is + * set in e4b->frsp_flags. If that flag is set, this function doesn't try + * to load an unloaded tree. */ static noinline_for_stack int ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group, @@ -1166,7 +1935,7 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group, int pnum; int poff; struct page *page; - int ret; + int ret = 0; struct ext4_group_info *grp; struct ext4_sb_info *sbi = EXT4_SB(sb); struct inode *inode = sbi->s_buddy_cache; @@ -1183,6 +1952,22 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group, e4b->bd_group = group; e4b->bd_buddy_page = NULL; e4b->bd_bitmap_page = NULL; + if (ext4_mb_frsp_on(sb)) { + e4b->frsp_tree = ext4_get_frsp_tree(sb, + ext4_flex_group(sbi, group)); + e4b->frsp_flags = flags; + if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD) + return 0; + + mutex_lock(&e4b->frsp_tree->frsp_lock); + if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) { + mutex_unlock(&e4b->frsp_tree->frsp_lock); + return 0; + } + ret = ext4_mb_frsp_load(e4b); + mutex_unlock(&e4b->frsp_tree->frsp_lock); + return ret; + } if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { /* @@ -1302,6 +2087,8 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group, static void ext4_mb_unload_allocator(struct ext4_buddy *e4b) { + if (ext4_mb_frsp_on(e4b->bd_sb)) + return; if (e4b->bd_bitmap_page) put_page(e4b->bd_bitmap_page); if (e4b->bd_buddy_page) @@ -1494,6 +2281,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, if (first < e4b->bd_info->bb_first_free) e4b->bd_info->bb_first_free = first; + if (ext4_mb_frsp_on(sb)) { + ext4_grpblk_t first_blk = EXT4_C2B(EXT4_SB(sb), first) + + ext4_group_first_block_no(sb, e4b->bd_group); + + ext4_unlock_group(sb, e4b->bd_group); + ext4_mb_frsp_free_blocks(sb, e4b->bd_group, first_blk, count); + ext4_lock_group(sb, e4b->bd_group); + return; + } + /* access memory sequentially: check left neighbour, * clear range and then check right neighbour */ @@ -1560,6 +2357,9 @@ static int mb_find_extent(struct ext4_buddy *e4b, int block, assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); BUG_ON(ex == NULL); + if (ext4_mb_frsp_on(e4b->bd_sb)) + return 0; + buddy = mb_find_buddy(e4b, 0, &max); BUG_ON(buddy == NULL); BUG_ON(block >= max); @@ -1624,17 +2424,74 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) unsigned ret = 0; int len0 = len; void *buddy; + ext4_grpblk_t flex_offset; BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); BUG_ON(e4b->bd_group != ex->fe_group); + + e4b->bd_info->bb_free -= len; + if (e4b->bd_info->bb_first_free == start) + e4b->bd_info->bb_first_free += len; + + if (ext4_mb_frsp_on(e4b->bd_sb)) { + struct ext4_frsp_node *new; + + flex_offset = ext4_blkno_to_flex_offset(e4b->bd_sb, + ext4_group_first_block_no(e4b->bd_sb, + e4b->bd_group) + ex->fe_start); + if (!ex->fe_node) { + ex->fe_node = ext4_mb_frsp_search_by_off(e4b->bd_sb, + e4b->frsp_tree, flex_offset); + if (!ex->fe_node) + return 0; + } + /* + * Remove the node from the trees before we modify these since + * we will change the length and / or offset of this node. + */ + rb_erase_cached(&ex->fe_node->frsp_node, + &e4b->frsp_tree->frsp_offset_root); + rb_erase_cached(&ex->fe_node->frsp_len_node, + &e4b->frsp_tree->frsp_len_root); + RB_CLEAR_NODE(&ex->fe_node->frsp_node); + RB_CLEAR_NODE(&ex->fe_node->frsp_len_node); + if (flex_offset > ex->fe_node->frsp_offset) { + if (flex_offset + ex->fe_len != + ex->fe_node->frsp_offset + + ex->fe_node->frsp_len) { + /* Need to split the node */ + new = ext4_mb_frsp_alloc_node(e4b->bd_sb); + if (!new) + return -ENOMEM; + new->frsp_offset = flex_offset + ex->fe_len; + new->frsp_len = (ex->fe_node->frsp_offset + + ex->fe_node->frsp_len) - + new->frsp_offset; + ext4_mb_frsp_insert_len(e4b->frsp_tree, new); + ext4_mb_frsp_insert_off(e4b->frsp_tree, new); + } + ex->fe_node->frsp_len = flex_offset - + ex->fe_node->frsp_offset; + } else if (ex->fe_len < ex->fe_node->frsp_len) { + /* used only a part: update node */ + ex->fe_node->frsp_offset += ex->fe_len; + ex->fe_node->frsp_len -= ex->fe_len; + } else { + ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node); + return 0; + } + + ext4_mb_frsp_insert_len(e4b->frsp_tree, ex->fe_node); + ext4_mb_frsp_insert_off(e4b->frsp_tree, ex->fe_node); + + return 0; + } + assert_spin_locked(ext4_group_lock_ptr(e4b->bd_sb, e4b->bd_group)); mb_check_buddy(e4b); mb_mark_used_double(e4b, start, len); this_cpu_inc(discard_pa_seq); - e4b->bd_info->bb_free -= len; - if (e4b->bd_info->bb_first_free == start) - e4b->bd_info->bb_first_free += len; /* let's maintain fragments counter */ if (start != 0) @@ -2773,6 +3630,13 @@ static int ext4_mb_init_backend(struct super_block *sb) rcu_read_lock(); kvfree(rcu_dereference(sbi->s_group_info)); rcu_read_unlock(); + if (ext4_mb_frsp_on(sb)) { + for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) { + struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i); + + kfree(tree); + } + } return -ENOMEM; } @@ -2869,6 +3733,13 @@ int ext4_mb_init(struct super_block *sb) i++; } while (i <= sb->s_blocksize_bits + 1); + /* init for freespace trees */ + if (ext4_mb_frsp_on(sb)) { + ret = ext4_mb_init_freespace_trees(sb); + if (ret) + goto out; + } + spin_lock_init(&sbi->s_md_lock); spin_lock_init(&sbi->s_bal_lock); sbi->s_mb_free_pending = 0; @@ -2936,6 +3807,13 @@ int ext4_mb_init(struct super_block *sb) sbi->s_mb_offsets = NULL; kfree(sbi->s_mb_maxs); sbi->s_mb_maxs = NULL; + if (ext4_mb_frsp_on(sb)) { + for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) { + struct ext4_frsp_tree *tree = ext4_get_frsp_tree(sb, i); + + kfree(tree); + } + } return ret; } @@ -3080,8 +3958,10 @@ static void ext4_free_data_in_buddy(struct super_block *sb, /* No more items in the per group rb tree * balance refcounts from ext4_mb_free_metadata() */ - put_page(e4b.bd_buddy_page); - put_page(e4b.bd_bitmap_page); + if (!ext4_mb_frsp_on(sb)) { + put_page(e4b.bd_buddy_page); + put_page(e4b.bd_bitmap_page); + } } ext4_unlock_group(sb, entry->efd_group); kmem_cache_free(ext4_free_data_cachep, entry); @@ -3159,9 +4039,14 @@ int __init ext4_init_mballoc(void) SLAB_RECLAIM_ACCOUNT); if (ext4_free_data_cachep == NULL) goto out_ac_free; + ext4_freespace_node_cachep = KMEM_CACHE(ext4_frsp_node, + SLAB_RECLAIM_ACCOUNT); + if (ext4_freespace_node_cachep == NULL) + goto out_frsp_free; return 0; - +out_frsp_free: + kmem_cache_destroy(ext4_free_data_cachep); out_ac_free: kmem_cache_destroy(ext4_ac_cachep); out_pa_free: @@ -3180,6 +4065,7 @@ void ext4_exit_mballoc(void) kmem_cache_destroy(ext4_pspace_cachep); kmem_cache_destroy(ext4_ac_cachep); kmem_cache_destroy(ext4_free_data_cachep); + kmem_cache_destroy(ext4_freespace_node_cachep); ext4_groupinfo_destroy_slabs(); } @@ -3682,6 +4568,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return false; + if (ext4_mb_frsp_on(ac->ac_sb)) + return 0; + /* first, try per-file preallocation */ rcu_read_lock(); list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) { @@ -4384,6 +5273,8 @@ static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac) struct ext4_prealloc_space *pa; BUG_ON(ext4_pspace_cachep == NULL); + if (ext4_mb_frsp_on(ac->ac_sb)) + return 0; pa = kmem_cache_zalloc(ext4_pspace_cachep, GFP_NOFS); if (!pa) return -ENOMEM; @@ -4396,6 +5287,8 @@ static void ext4_mb_pa_free(struct ext4_allocation_context *ac) { struct ext4_prealloc_space *pa = ac->ac_pa; + if (ext4_mb_frsp_on(ac->ac_sb)) + return; BUG_ON(!pa); ac->ac_pa = NULL; WARN_ON(!atomic_dec_and_test(&pa->pa_count)); @@ -4569,6 +5462,13 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac, ac->ac_o_ex.fe_len = len; ac->ac_g_ex = ac->ac_o_ex; ac->ac_flags = ar->flags; + if (ext4_mb_frsp_on(sb)) + ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC; + + /* set up best-found tree node */ + ac->ac_b_tree_ex.te_flex_start = 0; + ac->ac_b_tree_ex.te_flex = 0; + ac->ac_b_tree_ex.te_len = 0; /* we have to define context: we'll work with a file or * locality group. this is a policy, actually */ @@ -4919,7 +5819,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, goto errout; repeat: /* allocate space in core */ - *errp = ext4_mb_regular_allocator(ac); + if (ext4_mb_frsp_on(sb)) + *errp = ext4_mb_tree_allocator(ac); + else + *errp = ext4_mb_regular_allocator(ac); + /* * pa allocated above is added to grp->bb_prealloc_list only * when we were able to allocate some block i.e. when @@ -4932,8 +5836,13 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, ext4_discard_allocated_blocks(ac); goto errout; } - if (ac->ac_status == AC_STATUS_FOUND && - ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len) + /* + * Freespace trees should never return more than what was asked + * for. + */ + if (!ext4_mb_frsp_on(sb) && + (ac->ac_status == AC_STATUS_FOUND && + ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)) ext4_mb_pa_free(ac); } if (likely(ac->ac_status == AC_STATUS_FOUND)) { @@ -5024,13 +5933,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, struct rb_node *parent = NULL, *new_node; BUG_ON(!ext4_handle_valid(handle)); - BUG_ON(e4b->bd_bitmap_page == NULL); - BUG_ON(e4b->bd_buddy_page == NULL); + BUG_ON(!ext4_mb_frsp_on(sb) && e4b->bd_bitmap_page == NULL); new_node = &new_entry->efd_node; cluster = new_entry->efd_start_cluster; - if (!*n) { + if (!*n && !ext4_mb_frsp_on(sb)) { /* first free block exent. We need to protect buddy cache from being freed, * otherwise we'll refresh it from @@ -5509,6 +6417,7 @@ __acquires(bitlock) ex.fe_start = start; ex.fe_group = group; ex.fe_len = count; + ex.fe_node = NULL; /* * Mark blocks used, so no one can reuse them while @@ -5548,6 +6457,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group, void *bitmap; ext4_grpblk_t next, count = 0, free_count = 0; struct ext4_buddy e4b; + struct buffer_head *bh = NULL; int ret = 0; trace_ext4_trim_all_free(sb, group, start, max); @@ -5558,7 +6468,17 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group, ret, group); return ret; } - bitmap = e4b.bd_bitmap; + + if (ext4_mb_frsp_on(sb)) { + bh = ext4_read_block_bitmap(sb, group); + if (IS_ERR(bh)) { + ret = PTR_ERR(bh); + goto out; + } + bitmap = bh->b_data; + } else { + bitmap = e4b.bd_bitmap; + } ext4_lock_group(sb, group); if (EXT4_MB_GRP_WAS_TRIMMED(e4b.bd_info) && @@ -5605,6 +6525,8 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group, EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info); } out: + if (!IS_ERR_OR_NULL(bh)) + brelse(bh); ext4_unlock_group(sb, group); ext4_mb_unload_allocator(&e4b); @@ -5665,7 +6587,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) for (group = first_group; group <= last_group; group++) { grp = ext4_get_group_info(sb, group); /* We only do this if the grp has never been initialized */ - if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { + if (unlikely(!ext4_mb_frsp_on(sb) && EXT4_MB_GRP_NEED_INIT(grp))) { ret = ext4_mb_init_group(sb, group, GFP_NOFS); if (ret) break; @@ -5719,16 +6641,25 @@ ext4_mballoc_query_range( ext4_grpblk_t next; struct ext4_buddy e4b; int error; + struct buffer_head *bh = NULL; - error = ext4_mb_load_allocator(sb, group, &e4b, 0); - if (error) - return error; - bitmap = e4b.bd_bitmap; - + if (ext4_mb_frsp_on(sb)) { + bh = ext4_read_block_bitmap(sb, group); + if (IS_ERR(bh)) { + error = PTR_ERR(bh); + goto out_unload; + } + bitmap = bh->b_data; + } else { + error = ext4_mb_load_allocator(sb, group, &e4b, 0); + if (error) + return error; + bitmap = e4b.bd_bitmap; + } ext4_lock_group(sb, group); - - start = (e4b.bd_info->bb_first_free > start) ? - e4b.bd_info->bb_first_free : start; + if (!ext4_mb_frsp_on(sb)) + start = (e4b.bd_info->bb_first_free > start) ? + e4b.bd_info->bb_first_free : start; if (end >= EXT4_CLUSTERS_PER_GROUP(sb)) end = EXT4_CLUSTERS_PER_GROUP(sb) - 1; @@ -5737,19 +6668,20 @@ ext4_mballoc_query_range( if (start > end) break; next = mb_find_next_bit(bitmap, end + 1, start); - ext4_unlock_group(sb, group); error = formatter(sb, group, start, next - start, priv); if (error) goto out_unload; ext4_lock_group(sb, group); - start = next + 1; } ext4_unlock_group(sb, group); out_unload: - ext4_mb_unload_allocator(&e4b); + if (!IS_ERR_OR_NULL(bh)) + brelse(bh); + if (!ext4_mb_frsp_on(sb)) + ext4_mb_unload_allocator(&e4b); return error; } diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index e75b4749aa1c..af61651258a3 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -78,6 +78,20 @@ */ #define MB_DEFAULT_MAX_INODE_PREALLOC 512 +/* + * Struct for tree node in freespace_tree + */ +struct ext4_frsp_node { + ext4_grpblk_t frsp_offset; /* Start block offset inside + * current flexible group + */ + ext4_grpblk_t frsp_len; /* + * Length of the free space in + * number of clusters + */ + struct rb_node frsp_node; + struct rb_node frsp_len_node; +}; struct ext4_free_data { /* this links the free block information from sb_info */ struct list_head efd_list; @@ -125,6 +139,7 @@ struct ext4_free_extent { ext4_grpblk_t fe_start; /* In cluster units */ ext4_group_t fe_group; ext4_grpblk_t fe_len; /* In cluster units */ + struct ext4_frsp_node *fe_node; }; /* @@ -145,7 +160,14 @@ struct ext4_locality_group { spinlock_t lg_prealloc_lock; }; +struct ext4_tree_extent { + ext4_group_t te_flex; /* flex_bg index (tree index) */ + ext4_grpblk_t te_flex_start; /* block offset w.r.t flex bg */ + ext4_grpblk_t te_len; /* length */ +}; + struct ext4_allocation_context { + __u32 ac_id; struct inode *ac_inode; struct super_block *ac_sb; @@ -158,8 +180,16 @@ struct ext4_allocation_context { /* the best found extent */ struct ext4_free_extent ac_b_ex; - /* copy of the best found extent taken before preallocation efforts */ - struct ext4_free_extent ac_f_ex; + /* With freespace trees, we don't use preallocation anymore. */ + union { + /* + * copy of the best found extent taken before + * preallocation efforts + */ + struct ext4_free_extent ac_f_ex; + /* the best found tree extent */ + struct ext4_tree_extent ac_b_tree_ex; + }; __u16 ac_groups_scanned; __u16 ac_found; @@ -181,14 +211,31 @@ struct ext4_allocation_context { #define AC_STATUS_FOUND 2 #define AC_STATUS_BREAK 3 +/* + * Freespace tree flags + */ +#define EXT4_ALLOCATOR_FRSP_NOLOAD 0x0001 /* + * Don't load freespace + * tree, if it's not + * in memory. + */ + struct ext4_buddy { - struct page *bd_buddy_page; - void *bd_buddy; - struct page *bd_bitmap_page; - void *bd_bitmap; + union { + struct { + struct page *bd_buddy_page; + void *bd_buddy; + struct page *bd_bitmap_page; + void *bd_bitmap; + __u16 bd_blkbits; + }; + struct { + struct ext4_frsp_tree *frsp_tree; + __u32 frsp_flags; + }; + }; struct ext4_group_info *bd_info; struct super_block *bd_sb; - __u16 bd_blkbits; ext4_group_t bd_group; }; diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index a50b51270ea9..6a0e1fc18e95 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -1679,6 +1679,10 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input) if (err) goto out; + err = ext4_mb_add_frsp_trees(sb, input->group + 1); + if (err) + goto out; + flex_gd.count = 1; flex_gd.groups = input; flex_gd.bg_flags = &bg_flags; @@ -2051,6 +2055,10 @@ int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count) if (err) goto out; + err = ext4_mb_add_frsp_trees(sb, n_group + 1); + if (err) + goto out; + flex_gd = alloc_flex_gd(flexbg_size); if (flex_gd == NULL) { err = -ENOMEM; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 656eccf6fc9b..55ff4e8be976 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1526,7 +1526,7 @@ enum { Opt_dioread_nolock, Opt_dioread_lock, Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable, Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache, - Opt_prefetch_block_bitmaps, + Opt_prefetch_block_bitmaps, Opt_freespace_tree, }; static const match_table_t tokens = { @@ -1613,6 +1613,7 @@ static const match_table_t tokens = { {Opt_init_itable, "init_itable=%u"}, {Opt_init_itable, "init_itable"}, {Opt_noinit_itable, "noinit_itable"}, + {Opt_freespace_tree, "freespace_tree"}, {Opt_max_dir_size_kb, "max_dir_size_kb=%u"}, {Opt_test_dummy_encryption, "test_dummy_encryption=%s"}, {Opt_test_dummy_encryption, "test_dummy_encryption"}, @@ -1839,6 +1840,8 @@ static const struct mount_opts { {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET}, {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS, MOPT_SET}, + {Opt_freespace_tree, EXT4_MOUNT2_FREESPACE_TREE, + MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY}, {Opt_err, 0, 0} }; @@ -4745,12 +4748,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) } } + if (ext4_has_feature_flex_bg(sb)) + if (!ext4_fill_flex_info(sb)) { + ext4_msg(sb, KERN_ERR, + "unable to initialize flex_bg meta info!"); + goto failed_mount5; + } + ext4_ext_init(sb); err = ext4_mb_init(sb); if (err) { ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)", err); - goto failed_mount5; + goto failed_mount6; } block = ext4_count_free_clusters(sb); @@ -4780,14 +4790,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) goto failed_mount6; } - if (ext4_has_feature_flex_bg(sb)) - if (!ext4_fill_flex_info(sb)) { - ext4_msg(sb, KERN_ERR, - "unable to initialize " - "flex_bg meta info!"); - goto failed_mount6; - } - err = ext4_register_li_request(sb, first_not_zeroed); if (err) goto failed_mount6; @@ -4872,7 +4874,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) ext4_unregister_li_request(sb); failed_mount6: ext4_mb_release(sb); + percpu_counter_destroy(&sbi->s_freeclusters_counter); + percpu_counter_destroy(&sbi->s_freeinodes_counter); + percpu_counter_destroy(&sbi->s_dirs_counter); + percpu_counter_destroy(&sbi->s_dirtyclusters_counter); + percpu_free_rwsem(&sbi->s_writepages_rwsem); rcu_read_lock(); + +failed_mount5: flex_groups = rcu_dereference(sbi->s_flex_groups); if (flex_groups) { for (i = 0; i < sbi->s_flex_groups_allocated; i++) @@ -4880,12 +4889,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) kvfree(flex_groups); } rcu_read_unlock(); - percpu_counter_destroy(&sbi->s_freeclusters_counter); - percpu_counter_destroy(&sbi->s_freeinodes_counter); - percpu_counter_destroy(&sbi->s_dirs_counter); - percpu_counter_destroy(&sbi->s_dirtyclusters_counter); - percpu_free_rwsem(&sbi->s_writepages_rwsem); -failed_mount5: ext4_ext_release(sb); ext4_release_system_zone(sb); failed_mount4a: From patchwork Fri Aug 21 01:55:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348754 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=tkzHQDmw; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3S1fF3z9sPC for ; Fri, 21 Aug 2020 11:55:48 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727050AbgHUBzr (ORCPT ); Thu, 20 Aug 2020 21:55:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726980AbgHUBzf (ORCPT ); Thu, 20 Aug 2020 21:55:35 -0400 Received: from mail-pj1-x1041.google.com (mail-pj1-x1041.google.com [IPv6:2607:f8b0:4864:20::1041]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C012BC061385 for ; Thu, 20 Aug 2020 18:55:34 -0700 (PDT) Received: by mail-pj1-x1041.google.com with SMTP id t6so144161pjr.0 for ; Thu, 20 Aug 2020 18:55:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=C0aISXWgy4f0tPia037eRIOFaXF5QGy/8flcEPhgruI=; b=tkzHQDmwgTwKblSmOj+d25KOBN3gplAtQvaq9f3eUkxoy8ZUXpSpOE/e+NBT5apqOC fDgg9wWLDwk9rJdkaGVa3ZyeEAOMPxRxIkWv307KLS0Irn57uH9r6AafBVs0rB79sRKx yFVMugQW2lAH09/VO0kJgdNRaoPqjChTOz9lknOLnLdgVYlc/bNbXZtLu3KzwLyr2lA7 t3JVPA4f/sqaHCjhbKI3Qw2W9gglYu77VWLF6vpAzRgWQf5coGTsdEg2ac60CrC7TjI7 1R43kYEJaJtwIoj/oiMgQgfzRXghdaoJcszweC9WXm/WLuiwqGAgnHEt8GrG16iT60e6 vyTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=C0aISXWgy4f0tPia037eRIOFaXF5QGy/8flcEPhgruI=; b=AkcQDZ5P9dURM2urwrauYiq6RGoeN5Dy3pB2PDE/89cwPYiZbUf5g3vrw+ix8rtHLv 6MTPptXlkfT1+xZeTbibHOEKjCjksT2WAP94jW38C0NzYf9t9jJZBiB1V4aiq92XS+za T+K2VycypHEtRIWLtOjsP5DGGN5sKTg3tJELcW1dxNqjOk0gN1CWTQU82QjZxaWxF6Lr QbFetN1DINBP6l8I+CgYOglCBnD1MR+ZUcwGG9ZZAlcVBhUi8S5A2DKCz9CMMJ1wtFFX Cc6+bwGtXcVHKt4Cp/GCc0PtOpsHgUX9ldK+blCV8+v5JWpuhS1flc5paqpd5/6QfnSn AIOg== X-Gm-Message-State: AOAM532Wz1ezL7IoPpYEBS55MWUu2KblP2B8yZK8WXYlk75Sf6P/kfMt YrhHjmu9NqFl5ZwuqEdxF/Roxa1EonA= X-Google-Smtp-Source: ABdhPJy7Of56wgOGFeAI7pJwg05DHhRTcDGtwrUauP1wFGKd0R3NMciAD6Yf+HELXhbFUoEEmktbjQ== X-Received: by 2002:a17:902:c410:: with SMTP id k16mr519649plk.340.1597974933942; Thu, 20 Aug 2020 18:55:33 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:33 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 4/9] ext4: add prefetching support for freespace trees Date: Thu, 20 Aug 2020 18:55:18 -0700 Message-Id: <20200821015523.1698374-5-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar This patch supports creation of freespace trees for prefetched block bitmaps. Signed-off-by: Harshad Shirwadkar --- fs/ext4/mballoc.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 4b1c405543f0..168e9708257f 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3137,6 +3137,9 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group, void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group, unsigned int nr) { + struct ext4_buddy e4b; + int ret; + while (nr-- > 0) { struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL); @@ -3151,8 +3154,15 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group, ext4_free_group_clusters(sb, gdp) > 0 && !(ext4_has_group_desc_csum(sb) && (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))) { - if (ext4_mb_init_group(sb, group, GFP_NOFS)) + if (ext4_mb_frsp_on(sb)) { + ret = ext4_mb_load_allocator(sb, group, &e4b, + 0); + if (ret) + break; + ext4_mb_unload_allocator(&e4b); + } else if (ext4_mb_init_group(sb, group, GFP_NOFS)) { break; + } } } } From patchwork Fri Aug 21 01:55:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348757 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=MLan2qBq; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3Y3L5Dz9sPC for ; Fri, 21 Aug 2020 11:55:53 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727064AbgHUBzw (ORCPT ); Thu, 20 Aug 2020 21:55:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726810AbgHUBzh (ORCPT ); Thu, 20 Aug 2020 21:55:37 -0400 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CA72AC06138B for ; Thu, 20 Aug 2020 18:55:36 -0700 (PDT) Received: by mail-pf1-x444.google.com with SMTP id m71so301956pfd.1 for ; Thu, 20 Aug 2020 18:55:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6IyhBx7W+MiKC0dWGrWkGj+rJtFq3n92XRbcI6FP5ag=; b=MLan2qBq72XX/uDKc8/TnlKCTAv8S4LrNqu1lrnp1wTgCcBceBm4kC1y3Rh2YDstKs Ee3sDbtdL44lIP5SOrhkW13eDqJOshbzokAlanMVSC5s2fX33Vqfqjm8HiecWYUBWgih NjbSVzZqMAalLojcPpWNqlgJcCcREhjkc9Dlg4yMcMb/88o3goW/tStxUV/L1XmrvdW/ z9TK5OCKgKc3jgE/pqL4HYdrlf9wFcl3hrS5rArgyNGZAAMltmuw9yeYz+DgFLldLWWt eZTP0SofPqGIC4g98Rehh/DbxmN/jsjndWJ9OMoJk1C1dIEhj/+P+75b9Pbxnh5m3Izh TxFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=6IyhBx7W+MiKC0dWGrWkGj+rJtFq3n92XRbcI6FP5ag=; b=YV5rgB8GfT22yXz9B4qvLhtRuKa6DKlmP/cBAq8wIy9eurVWstQ5FYpbfTY0QSH9lc 18+OMKAQoV4SrT7XT2fe0ToDafxst3Fv57TY9kAqb5HuIwYqBVV35DYP+iuzpB2bwC3x xd4fMi/ZpC7FKUvoCjrbgh7a0MhuLdKP6nYx0Zhg9A4skucTgo8khjptE0XErsXrlPJd BMYXmHjFvM15fQcTsFe050BKKjLdJCbfprguARVkmf0cTcjW7aiCC9hqQCRov4P+54sW fcWqa1AJShqCKqJV++C9P3jy9FNd+E7nZhdZnN0ck1nI5GiphIpYuRTCvy1uxxYR61Ar e2hg== X-Gm-Message-State: AOAM530+jfNWeK18YUwHbP7VfZSN8EULPTT3Gq09/T/N95ifC6Xpo6dE WzyoSJHYN8xQptvEnAXeJey/iUVpvBA= X-Google-Smtp-Source: ABdhPJxF37P08R/iER6Ha/96JpoRIYJUMHuRAnFk46DrsOyb/5dmk0+S08HmlzC//dAA/bjiZmsUSw== X-Received: by 2002:a62:7bcf:: with SMTP id w198mr623384pfc.90.1597974935592; Thu, 20 Aug 2020 18:55:35 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:34 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 5/9] ext4: add freespace tree optimizations Date: Thu, 20 Aug 2020 18:55:19 -0700 Message-Id: <20200821015523.1698374-6-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar This patch adds an optimization on top of our freespace tree allocator. We add a new meta-tree which contains tree node sorted by length of their largest extent. We use this tree to optimize an allocation request so that it immediately gets a hit or it falls back to next CR level. Signed-off-by: Harshad Shirwadkar --- fs/ext4/ext4.h | 17 +++++ fs/ext4/mballoc.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/mballoc.h | 1 + fs/ext4/super.c | 9 +++ 4 files changed, 205 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 513c8473f45f..15e6ce9f1afa 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -153,6 +153,8 @@ enum SHIFT_DIRECTION { #define EXT4_MB_USE_RESERVED 0x2000 /* Do strict check for free blocks while retrying block allocation */ #define EXT4_MB_STRICT_CHECK 0x4000 +/* Disable freespace optimizations on ac */ +#define EXT4_MB_FRSP_NO_OPTIMIZE 0x8000 struct ext4_allocation_request { /* target inode for block we're allocating */ @@ -1444,12 +1446,22 @@ struct ext4_frsp_tree { __u32 frsp_index; /* Tree index (flex bg * number) */ + struct list_head frsp_list_node; + struct rb_node frsp_len_node; }; /* Freespace tree flags */ /* Tree is loaded in memory */ #define EXT4_MB_FRSP_FLAG_LOADED 0x0001 +/* Tree is inserted into meta tree */ +#define EXT4_MB_FRSP_FLAG_CACHED 0x2 + +/* Freespace tree cache aggression levels */ +#define EXT4_MB_FRSP_MIN_CACHE_AGGRESSION 0x0 +#define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION 0x1 +#define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION 0x3 + /* * fourth extended-fs super-block data in memory */ @@ -1590,6 +1602,11 @@ struct ext4_sb_info { /* freespace trees stuff */ int s_mb_num_frsp_trees; + struct rb_root_cached s_mb_frsp_meta_tree; + rwlock_t s_mb_frsp_lock; + atomic_t s_mb_num_frsp_trees_cached; + struct list_head s_mb_uncached_trees; + u32 s_mb_frsp_cache_aggression; /* workqueue for reserved extent conversions (buffered io) */ struct workqueue_struct *rsv_conversion_wq; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 168e9708257f..1da63afdbb3d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -779,6 +779,13 @@ static inline int ext4_mb_frsp_len_cmp(struct ext4_frsp_node *arg1, return (arg1->frsp_len > arg2->frsp_len); } +/* Compare function for meta tree */ +static inline int ext4_mb_frsp_meta_cmp(struct ext4_frsp_tree *arg1, + struct ext4_frsp_tree *arg2) +{ + return (arg1->frsp_max_free_len > arg2->frsp_max_free_len); +} + /* insert to offset-indexed tree */ static void ext4_mb_frsp_insert_off(struct ext4_frsp_tree *tree, struct ext4_frsp_node *new_entry) @@ -798,6 +805,35 @@ static void ext4_mb_frsp_insert_len(struct ext4_frsp_tree *tree, ext4_mb_frsp_len_cmp); } +void ext4_mb_frsp_meta_reinsert(struct super_block *sb, + struct ext4_frsp_tree *tree) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_frsp_node *node; + struct rb_node *first = rb_first_cached(&tree->frsp_len_root); + struct rb_root_cached *meta_root = &EXT4_SB(sb)->s_mb_frsp_meta_tree; + int expected_len = 0; + + if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED)) + return; + + if (first) { + node = rb_entry(first, struct ext4_frsp_node, frsp_len_node); + expected_len = node->frsp_len; + } + + if (tree->frsp_max_free_len == expected_len) + return; + + write_lock(&sbi->s_mb_frsp_lock); + tree->frsp_max_free_len = expected_len; + rb_erase_cached(&tree->frsp_len_node, &sbi->s_mb_frsp_meta_tree); + RB_CLEAR_NODE(&tree->frsp_len_node); + ext4_mb_rb_insert(meta_root, tree, struct ext4_frsp_tree, frsp_len_node, + ext4_mb_frsp_meta_cmp); + write_unlock(&sbi->s_mb_frsp_lock); +} + #ifdef CONFIG_EXT4_DEBUG /* print freespace_tree in pre-order traversal */ void ext4_mb_frsp_print_tree_len(struct super_block *sb, @@ -966,6 +1002,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree, } } ext4_mb_frsp_insert_len(tree, new_entry); + ext4_mb_frsp_meta_reinsert(sb, tree); return 0; } @@ -1063,6 +1100,9 @@ int ext4_mb_frsp_load(struct ext4_buddy *e4b) if (ret) goto out; } + if (!(e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_CACHED)) + atomic_inc(&sbi->s_mb_num_frsp_trees_cached); + e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_CACHED; e4b->frsp_tree->frsp_flags |= EXT4_MB_FRSP_FLAG_LOADED; out: for (i = 0; i < ngroups; i++) { @@ -1156,6 +1196,7 @@ static void ext4_mb_frsp_use_best_found(struct ext4_allocation_context *ac, ext4_mb_load_allocator(ac->ac_sb, group_no, &e4b, EXT4_ALLOCATOR_FRSP_NOLOAD); mb_mark_used(&e4b, bex); + ext4_mb_frsp_meta_reinsert(ac->ac_sb, e4b.frsp_tree); ext4_mb_unload_allocator(&e4b); } /* @@ -1286,6 +1327,124 @@ void ext4_mb_frsp_find_by_goal(struct ext4_allocation_context *ac) ext4_mb_unload_allocator(&e4b); } +/* + * Determine if caching of trees is necessary. This function returns 1 if it is, + * 0 if it is not. + */ +static int ext4_mb_frsp_should_cache(struct ext4_allocation_context *ac) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + + if (list_empty(&sbi->s_mb_uncached_trees)) + return 0; + + if (sbi->s_mb_frsp_cache_aggression >= + EXT4_MB_FRSP_MAX_CACHE_AGGRESSION) + return 1; + + if (sbi->s_mb_frsp_cache_aggression == + EXT4_MB_FRSP_MIN_CACHE_AGGRESSION) + return 0; + + /* At cache aggression level 1, skip caching at CR 0 */ + if (sbi->s_mb_frsp_cache_aggression == 1 && ac->ac_criteria == 0) + return 0; + + /* + * At cache aggression level 2, perform caching for every alternate + * optimization. + */ + return (ac->ac_num_optimizations & 0x1); +} + +/* + * Optimize allocation request. This function _tries_ to lookup the meta-tree + * and if it can optimize the search in any way, it does so. As a result + * this function returns 1, if the optimization was performed. In this case, + * the caller should restart the search from tree mentioned in *tree_idx. + */ +int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + struct ext4_frsp_tree *cur = NULL; + struct rb_node *node = NULL; + int found = 0, best = 0, cache_more_trees = 0, better_len = 0, ret = 0; + + if (ac->ac_flags & EXT4_MB_HINT_FIRST || + ac->ac_flags & EXT4_MB_FRSP_NO_OPTIMIZE || + ac->ac_status != AC_STATUS_CONTINUE) + return 0; + + ac->ac_num_optimizations++; + if (!read_trylock(&sbi->s_mb_frsp_lock)) + return 0; + + node = sbi->s_mb_frsp_meta_tree.rb_root.rb_node; + while (node) { + cur = rb_entry(node, struct ext4_frsp_tree, frsp_len_node); + if (ext4_mb_frsp_is_better(ac, cur->frsp_max_free_len, &best)) { + /* + * This tree definitely has a better node than the best + * found so far. + */ + found = 1; + ret = 1; + *tree_idx = cur->frsp_index; + better_len = cur->frsp_max_free_len; + if (best) + break; + } + if (cur->frsp_max_free_len > ac->ac_g_ex.fe_len) + node = node->rb_right; + else + node = node->rb_left; + } + + if (ac->ac_found == 0 && !found) { + /* + * If we haven't found a good match above, and we hadn't found + * any match before us, that means we need to loosen our + * criteria. Note that, if we had found something earlier, + * not finding a better node doesn't imply that there is no + * better node available. + * TODO - in this case determine probabilistically which tree + * may have a better node and direct our allocator there. + */ + if (ext4_mb_frsp_should_cache(ac)) { + cache_more_trees = 1; + } else if (ac->ac_criteria < 2) { + ac->ac_criteria++; + ac->ac_groups_scanned = 0; + *tree_idx = 0; + ret = 1; + } else { + ac->ac_flags |= EXT4_MB_HINT_FIRST; + } + } else if (!best && !list_empty(&sbi->s_mb_uncached_trees)) { + cache_more_trees = ext4_mb_frsp_should_cache(ac); + } + + if (cache_more_trees) { + cur = list_first_entry(&sbi->s_mb_uncached_trees, + struct ext4_frsp_tree, frsp_list_node); + list_del_init(&cur->frsp_list_node); + *tree_idx = cur->frsp_index; + ret = 1; + } + read_unlock(&sbi->s_mb_frsp_lock); + + /* + * If we couldn't optimize now, it's unlikely that we'll be able to + * optimize this request anymore + */ + if (!ret) + ac->ac_flags |= EXT4_MB_FRSP_NO_OPTIMIZE; + mb_debug(ac->ac_sb, + "Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n", + found, *tree_idx, better_len, ac->ac_criteria); + return ret; +} + static void ext4_mb_frsp_process(struct ext4_allocation_context *ac, struct ext4_frsp_tree *tree) { @@ -1324,6 +1483,7 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac) struct ext4_frsp_node *cur = NULL; struct ext4_tree_extent *btx = NULL; int ret = 0, start_idx = 0, tree_idx, j; + int optimize; sb = ac->ac_sb; btx = &ac->ac_b_tree_ex; @@ -1341,6 +1501,8 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac) ac->ac_criteria = 0; ac->ac_groups_scanned = 0; + ext4_mb_frsp_optimize(ac, &start_idx); + repeat: /* Loop through the rest of trees (flex_bg) */ @@ -1357,13 +1519,17 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac) mutex_lock(&e4b.frsp_tree->frsp_lock); ext4_mb_frsp_process(ac, e4b.frsp_tree); mutex_unlock(&e4b.frsp_tree->frsp_lock); + optimize = ext4_mb_frsp_optimize(ac, &start_idx); ext4_mb_unload_allocator(&e4b); + if (optimize) + goto repeat; } if (ac->ac_status != AC_STATUS_FOUND) { if (ac->ac_criteria < 2) { ac->ac_criteria++; ac->ac_groups_scanned = 0; + ac->ac_flags &= ~EXT4_MB_FRSP_NO_OPTIMIZE; mb_debug(sb, "Falling back to CR=%d", ac->ac_criteria); goto repeat; } @@ -1415,6 +1581,7 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index) mutex_init(&(tree->frsp_lock)); tree->frsp_flags = 0; tree->frsp_index = index; + INIT_LIST_HEAD(&tree->frsp_list_node); } int ext4_mb_init_freespace_trees(struct super_block *sb) @@ -1425,6 +1592,8 @@ int ext4_mb_init_freespace_trees(struct super_block *sb) sbi->s_mb_num_frsp_trees = ext4_num_grps_to_flexbg(sb, ext4_get_groups_count(sb)); + sbi->s_mb_frsp_meta_tree = RB_ROOT_CACHED; + INIT_LIST_HEAD(&sbi->s_mb_uncached_trees); for (i = 0; i < sbi->s_mb_num_frsp_trees; i++) { fg = sbi_array_rcu_deref(sbi, s_flex_groups, i); @@ -1433,7 +1602,11 @@ int ext4_mb_init_freespace_trees(struct super_block *sb) if (!fg->frsp_tree) return -ENOMEM; ext4_mb_frsp_init_tree(fg->frsp_tree, i); + list_add(&fg->frsp_tree->frsp_list_node, + &sbi->s_mb_uncached_trees); } + rwlock_init(&sbi->s_mb_frsp_lock); + atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0); return 0; } @@ -1460,6 +1633,11 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups) if (!fg->frsp_tree) return -ENOMEM; ext4_mb_frsp_init_tree(fg->frsp_tree, i); + write_lock(&sbi->s_mb_frsp_lock); + list_add(&fg->frsp_tree->frsp_list_node, + &sbi->s_mb_uncached_trees); + write_unlock(&sbi->s_mb_frsp_lock); + ext4_mb_frsp_meta_reinsert(sb, fg->frsp_tree); } sbi->s_mb_num_frsp_trees = flex_bg_count; diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index af61651258a3..1fcdd3e6f7d5 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -200,6 +200,7 @@ struct ext4_allocation_context { __u8 ac_criteria; __u8 ac_2order; /* if request is to allocate 2^N blocks and * N > 0, the field stores N, otherwise 0 */ + __u8 ac_num_optimizations; __u8 ac_op; /* operation, for history only */ struct page *ac_bitmap_page; struct page *ac_buddy_page; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 55ff4e8be976..6a2d43d81b14 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1527,6 +1527,8 @@ enum { Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable, Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache, Opt_prefetch_block_bitmaps, Opt_freespace_tree, + Opt_frsp_cache_aggression, + }; static const match_table_t tokens = { @@ -1620,6 +1622,7 @@ static const match_table_t tokens = { {Opt_nombcache, "nombcache"}, {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */ {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"}, + {Opt_frsp_cache_aggression, "frsp_cache_aggression=%u"}, {Opt_removed, "check=none"}, /* mount option from ext2/3 */ {Opt_removed, "nocheck"}, /* mount option from ext2/3 */ {Opt_removed, "reservation"}, /* mount option from ext2/3 */ @@ -1836,6 +1839,7 @@ static const struct mount_opts { {Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT}, {Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT}, {Opt_max_dir_size_kb, 0, MOPT_GTE0}, + {Opt_frsp_cache_aggression, 0, MOPT_GTE0}, {Opt_test_dummy_encryption, 0, MOPT_STRING}, {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET}, {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS, @@ -2043,6 +2047,10 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, sbi->s_li_wait_mult = arg; } else if (token == Opt_max_dir_size_kb) { sbi->s_max_dir_size_kb = arg; + } else if (token == Opt_frsp_cache_aggression) { + sbi->s_mb_frsp_cache_aggression = + min(EXT4_MB_FRSP_MAX_CACHE_AGGRESSION, + max(EXT4_MB_FRSP_MIN_CACHE_AGGRESSION, arg)); } else if (token == Opt_stripe) { sbi->s_stripe = arg; } else if (token == Opt_resuid) { @@ -3962,6 +3970,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ; sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME; sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME; + sbi->s_mb_frsp_cache_aggression = EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION; if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0) set_opt(sb, BARRIER); From patchwork Fri Aug 21 01:55:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348758 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=Wjq6fyjz; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3b44lXz9sPB for ; Fri, 21 Aug 2020 11:55:55 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727066AbgHUBzx (ORCPT ); Thu, 20 Aug 2020 21:55:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40156 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726982AbgHUBzh (ORCPT ); Thu, 20 Aug 2020 21:55:37 -0400 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 88F95C061387 for ; Thu, 20 Aug 2020 18:55:37 -0700 (PDT) Received: by mail-pg1-x541.google.com with SMTP id o13so265556pgf.0 for ; Thu, 20 Aug 2020 18:55:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/a+jf5V1NS62ktK9skuew7zctJMbM5+wgF7KkJObv6A=; b=Wjq6fyjz8ZPUJmvozb60jEw1EI+MmnwQ2EGUojO1fZiG2Q15bMiWqOqpmzi8Dt/8jT iloZ1RAR5fUBZd4taY+jI6SVY7ozGM7GDrXd2hhMBCKzpWACFwC2QIO5EVuCcbCyjftc A5XH2z5TDCVZ19n6UsbhJF+gn9FGCoXT84uRAyedn6TW6OWKc/QCovjuXckSqeHRghRu 2W4JmI7LIjZfwG/celjLDulOi5iqR6VzQq+jjoODtgZ1mTP/9t4is0bjL3q/TwbVptam WtCgdTOt4Reibx+WLaC8HXXEk5BUCoIchEr23jUnTah8tf7PIEzkY4R2sxxPF9UjBppD XBEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/a+jf5V1NS62ktK9skuew7zctJMbM5+wgF7KkJObv6A=; b=JDuVo23BeVSykZtGLHhi44jiEQ8q1FJDXqPuahNoFiw/82GsWhs8T65/EQhTbIED3i ztSsBYc3GF07Jugbiyols3T8TTUuB0ff37yX8eWj53jZ7Az2q1ecmKuwNWCUfmIYW0a6 ioEkqas+uoSJ6uwKPaC5hR6hh8KWmstvjIxIrvfcFdoQxEOQ0uwkm0PMHjqVoDiLlklK RJm38ydz+4IC0ihjeYGLnQX1HpdupfuPMKnv8HRKtf1ti1RlYUUMVygCKmoFOvd7+yeX 5UkW0WJ0dYWxbrWyPLuu1dNnCJoSkmFLEL/WKFc0VhTNmmit4hBBv6y0NXrVDAiuK8vs MzAw== X-Gm-Message-State: AOAM5305Yj5EA/2u8W6zrWWZBMMHsUTFgquW4wWHOVZPgWhWKV1BXqLV 7tEoDF508TOkWt97QUcm17qEs0avuEQ= X-Google-Smtp-Source: ABdhPJzLB3EKuaE6EXjBqOpgHK6FJWNlitYqxyO7k12GgduKQYIF34MYavPjjxsCadExSu5wnS95yA== X-Received: by 2002:aa7:9ac2:: with SMTP id x2mr561895pfp.57.1597974936620; Thu, 20 Aug 2020 18:55:36 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:35 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 6/9] ext4: add memory usage tracker for freespace trees Date: Thu, 20 Aug 2020 18:55:20 -0700 Message-Id: <20200821015523.1698374-7-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar Freespace trees can occupy a lot of memory with as the fragmentation increases. This patch adds a sysfs file to monitor the memory usage of the freespace tree allocator. Also, added a sysfs config to control maximum memory that the allocator can use. If the allocator exceeds this threshold, file system enters "FRSP_MEM_CRUNCH" state. The next patch in the series performs LRU eviction when this state is reached. Signed-off-by: Harshad Shirwadkar --- fs/ext4/ext4.h | 8 ++++++++ fs/ext4/mballoc.c | 20 ++++++++++++++++++++ fs/ext4/mballoc.h | 4 ++++ fs/ext4/sysfs.c | 11 +++++++++++ 4 files changed, 43 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 15e6ce9f1afa..93bf2fe35cf1 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1223,6 +1223,12 @@ struct ext4_inode_info { * allocator off) */ +#define EXT4_MOUNT2_FRSP_MEM_CRUNCH 0x00000040 /* + * Freespace tree allocator + * is in a tight memory + * situation. + */ + #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ ~EXT4_MOUNT_##opt #define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \ @@ -1607,6 +1613,8 @@ struct ext4_sb_info { atomic_t s_mb_num_frsp_trees_cached; struct list_head s_mb_uncached_trees; u32 s_mb_frsp_cache_aggression; + atomic_t s_mb_num_fragments; + u32 s_mb_frsp_mem_limit; /* workqueue for reserved extent conversions (buffered io) */ struct workqueue_struct *rsv_conversion_wq; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 1da63afdbb3d..b28b7fb0506e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -869,6 +869,7 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb, static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb) { struct ext4_frsp_node *node; + struct ext4_sb_info *sbi = EXT4_SB(sb); node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS); if (!node) @@ -877,13 +878,31 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb) RB_CLEAR_NODE(&node->frsp_node); RB_CLEAR_NODE(&node->frsp_len_node); + atomic_inc(&sbi->s_mb_num_fragments); + + if (sbi->s_mb_frsp_mem_limit && + atomic_read(&sbi->s_mb_num_fragments) > + EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb)) + set_opt2(sb, FRSP_MEM_CRUNCH); + else + clear_opt2(sb, FRSP_MEM_CRUNCH); + + return node; } static void ext4_mb_frsp_free_node(struct super_block *sb, struct ext4_frsp_node *node) { + struct ext4_sb_info *sbi = EXT4_SB(sb); + kmem_cache_free(ext4_freespace_node_cachep, node); + atomic_dec(&sbi->s_mb_num_fragments); + + if (!sbi->s_mb_frsp_mem_limit || + atomic_read(&sbi->s_mb_num_fragments) < + EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(sb)) + clear_opt2(sb, FRSP_MEM_CRUNCH); } /* Evict a tree from memory */ @@ -1607,6 +1626,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb) } rwlock_init(&sbi->s_mb_frsp_lock); atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0); + atomic_set(&sbi->s_mb_num_fragments, 0); return 0; } diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 1fcdd3e6f7d5..6cfb228e4da2 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -92,6 +92,10 @@ struct ext4_frsp_node { struct rb_node frsp_node; struct rb_node frsp_len_node; }; + +#define EXT4_FRSP_MEM_LIMIT_TO_NUM_NODES(__sb) \ + ((sbi->s_mb_frsp_mem_limit / sizeof(struct ext4_frsp_node))) + struct ext4_free_data { /* this links the free block information from sb_info */ struct list_head efd_list; diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index bfabb799fa45..19301b10944b 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -8,6 +8,7 @@ * */ +#include "mballoc.h" #include #include #include @@ -24,6 +25,7 @@ typedef enum { attr_session_write_kbytes, attr_lifetime_write_kbytes, attr_reserved_clusters, + attr_frsp_tree_usage, attr_inode_readahead, attr_trigger_test_error, attr_first_error_time, @@ -208,6 +210,7 @@ EXT4_ATTR_FUNC(delayed_allocation_blocks, 0444); EXT4_ATTR_FUNC(session_write_kbytes, 0444); EXT4_ATTR_FUNC(lifetime_write_kbytes, 0444); EXT4_ATTR_FUNC(reserved_clusters, 0644); +EXT4_ATTR_FUNC(frsp_tree_usage, 0444); EXT4_ATTR_OFFSET(inode_readahead_blks, 0644, inode_readahead, ext4_sb_info, s_inode_readahead_blks); @@ -248,6 +251,7 @@ EXT4_ATTR(last_error_time, 0444, last_error_time); EXT4_ATTR(journal_task, 0444, journal_task); EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch); EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit); +EXT4_RW_ATTR_SBI_UI(mb_frsp_max_mem, s_mb_frsp_mem_limit); static unsigned int old_bump_val = 128; EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val); @@ -257,6 +261,7 @@ static struct attribute *ext4_attrs[] = { ATTR_LIST(session_write_kbytes), ATTR_LIST(lifetime_write_kbytes), ATTR_LIST(reserved_clusters), + ATTR_LIST(frsp_tree_usage), ATTR_LIST(inode_readahead_blks), ATTR_LIST(inode_goal), ATTR_LIST(mb_stats), @@ -296,6 +301,7 @@ static struct attribute *ext4_attrs[] = { #endif ATTR_LIST(mb_prefetch), ATTR_LIST(mb_prefetch_limit), + ATTR_LIST(mb_frsp_max_mem), NULL, }; ATTRIBUTE_GROUPS(ext4); @@ -378,6 +384,11 @@ static ssize_t ext4_attr_show(struct kobject *kobj, return snprintf(buf, PAGE_SIZE, "%llu\n", (unsigned long long) atomic64_read(&sbi->s_resv_clusters)); + case attr_frsp_tree_usage: + return snprintf(buf, PAGE_SIZE, "%llu\n", + (unsigned long long) + atomic_read(&sbi->s_mb_num_fragments) * + sizeof(struct ext4_frsp_node)); case attr_inode_readahead: case attr_pointer_ui: if (!ptr) From patchwork Fri Aug 21 01:55:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348759 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=bmwoVvcp; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3d0bB6z9sPC for ; Fri, 21 Aug 2020 11:55:57 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727068AbgHUBzx (ORCPT ); Thu, 20 Aug 2020 21:55:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726987AbgHUBzk (ORCPT ); Thu, 20 Aug 2020 21:55:40 -0400 Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 018FCC061342 for ; Thu, 20 Aug 2020 18:55:39 -0700 (PDT) Received: by mail-pl1-x643.google.com with SMTP id bh1so162747plb.12 for ; Thu, 20 Aug 2020 18:55:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZFS9GNHqqK+EAPWE0eVP9FHbiH67Dx9oq9Ra6pNYcCI=; b=bmwoVvcpSWObBnr9EEgoPT1MnVXiGhm5p855kmgk37cJwxxOm8nVfeOp8pedCy4tpm qH+ay7OdTi58txT5KyCr1/x4V3zEIbwsSbOXCKjGHEzNlNyOGnxpWTsVn7dgustzg/rG hcIrv9ObQk6kU93kzfoO7B1EDHR0vUiaGwwQU2LyLtqYtqGGIuJNAzeMYmdUn7hEgjoe KYkteZa21nafX2nP0+Km8BWw+C8KA/SC/02ERwRoC6j8PWUWXi8SdlPA4KKv9nzjhVcM iLJCj1foJ7J63ONAKhIAwIOyFuufkDhOc/JnXnSjcvfb5MSgNcdHJRqAKr2mlPxfIJby Fiiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZFS9GNHqqK+EAPWE0eVP9FHbiH67Dx9oq9Ra6pNYcCI=; b=gOg21cqmPg2GbsoWm95vvovARhX7duqZtN6ePfwo1KDWfKsvCgXJJzeJtx1yBQ/ZJJ cjbLeeKohkJKGbvKS8/LlgTou8kXxnZK7mScxP2fQv1zSGr5S9ttY6bjKcFgfpmKHMqb NMOQfN7ouH+9HdlI6Rfewu+cCjQujdTy/9fmCUyIPA0pNLssAjWrImUqbxNuGHQALQ0z BLXSmjTv++1TGPEg8mZi9fV7+c+IAtrm/d7zXNiR5RjVSNkuVCJ66BDDdX6QpyYo0zTj Msu5vvf3bYGlYwcGut7lFVXdjTuDMNeuKpsYXqwB8uhRYBo+YhL7wyZkr6k9+jOLweWb PyVA== X-Gm-Message-State: AOAM530i+Ij/Y3bv1MuhLbmstIHACLQpxquKrx7b4gQB+0LKdaqCeHlK jTgwa0XOPFcp7EpOHtiesW3aLsI+OP8= X-Google-Smtp-Source: ABdhPJxYIEVHOWlau1rEVEMJV08UNZqfoWgN3qWZxj6IjTTjgnPuw9J4j+YO4fZNjnqToWOS1e70/w== X-Received: by 2002:a17:902:323:: with SMTP id 32mr540344pld.59.1597974937949; Thu, 20 Aug 2020 18:55:37 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:36 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 7/9] ext4: add LRU eviction for free space trees Date: Thu, 20 Aug 2020 18:55:21 -0700 Message-Id: <20200821015523.1698374-8-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar This patch adds LRU eviction policy for freespace trees. In order to avoid contention on one LRU lock, the LRU scheme is implemented as two lists - an active list and an inactive list. Trees in active list aren't moved inside the list thereby avoiding the need of a lock. Trees in inactive list become active only if they are accessed twice in a short interval thereby avoiding outliers to enter the active list. Signed-off-by: Harshad Shirwadkar --- fs/ext4/ext4.h | 46 +++++++++++++ fs/ext4/mballoc.c | 164 ++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 197 insertions(+), 13 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 93bf2fe35cf1..64d0dbbcd517 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1454,6 +1454,10 @@ struct ext4_frsp_tree { */ struct list_head frsp_list_node; struct rb_node frsp_len_node; + atomic_t frsp_fragments; + struct list_head frsp_lru_active_node; + unsigned long frsp_last_access; + atomic_t frsp_ref; }; /* Freespace tree flags */ @@ -1468,6 +1472,47 @@ struct ext4_frsp_tree { #define EXT4_MB_FRSP_DEFAULT_CACHE_AGGRESSION 0x1 #define EXT4_MB_FRSP_MAX_CACHE_AGGRESSION 0x3 +/* LRU management for free space trees */ +struct ext4_mb_frsp_lru { + rwlock_t frsp_lru_lock; + atomic_t frsp_active_fragments; /* Current #of fragments in + * the active queue + */ + u32 frsp_max_active_fragments; + /* + * List of active trees. Trees at tail are oldest trees in active set. + */ + struct list_head frsp_lru_active; + /* + * List of inactive trees but loaded trees. + */ + struct list_head frsp_lru_inactive; + struct super_block *frsp_lru_sb; +}; + +/* + * Minimum memory needed for our allocator. The pathological worst case for + * freespace trees is when every other block is allocated. In this case, + * the allocator will end up storing an extent node of length 1 for each free + * block. We need to make sure that the minimum memory available for the + * allocator is as much memory needed for 1 worst-case tree. This ensures that + * we can at-least keep 1 tree in memory. This way we avoid thrashing. + */ +#define EXT4_MB_FRSP_MEM_MIN(sb) \ + ((sizeof(struct ext4_frsp_node) * ext4_flex_bg_size(EXT4_SB(sb))) \ + * (EXT4_BLOCKS_PER_GROUP(sb) / 2)) + +/* Half of the total memory available is allowed for active list */ +#define EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) \ + ((EXT4_SB(sb)->s_mb_frsp_mem_limit / \ + (2 * sizeof(struct ext4_frsp_node)))) + +/* + * Maximum number of jiffies allowed between two successive hits on a freespace + * tree before we move it to the "active" queue. + */ +#define EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES 100 + /* * fourth extended-fs super-block data in memory */ @@ -1615,6 +1660,7 @@ struct ext4_sb_info { u32 s_mb_frsp_cache_aggression; atomic_t s_mb_num_fragments; u32 s_mb_frsp_mem_limit; + struct ext4_mb_frsp_lru s_mb_frsp_lru; /* workqueue for reserved extent conversions (buffered io) */ struct workqueue_struct *rsv_conversion_wq; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index b28b7fb0506e..aea0eb8d28da 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -866,10 +866,12 @@ void ext4_mb_frsp_print_tree_len(struct super_block *sb, } #endif -static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb) +static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb, + struct ext4_frsp_tree *tree) { struct ext4_frsp_node *node; struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru; node = kmem_cache_alloc(ext4_freespace_node_cachep, GFP_NOFS); if (!node) @@ -879,6 +881,11 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb) RB_CLEAR_NODE(&node->frsp_len_node); atomic_inc(&sbi->s_mb_num_fragments); + atomic_inc(&tree->frsp_fragments); + read_lock(&lru->frsp_lru_lock); + if (!list_empty(&tree->frsp_lru_active_node)) + atomic_inc(&lru->frsp_active_fragments); + read_unlock(&lru->frsp_lru_lock); if (sbi->s_mb_frsp_mem_limit && atomic_read(&sbi->s_mb_num_fragments) > @@ -892,12 +899,18 @@ static struct ext4_frsp_node *ext4_mb_frsp_alloc_node(struct super_block *sb) } static void ext4_mb_frsp_free_node(struct super_block *sb, - struct ext4_frsp_node *node) + struct ext4_frsp_tree *tree, struct ext4_frsp_node *node) { struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru; kmem_cache_free(ext4_freespace_node_cachep, node); atomic_dec(&sbi->s_mb_num_fragments); + atomic_dec(&tree->frsp_fragments); + read_lock(&lru->frsp_lru_lock); + if (!list_empty(&tree->frsp_lru_active_node)) + atomic_dec(&lru->frsp_active_fragments); + read_unlock(&lru->frsp_lru_lock); if (!sbi->s_mb_frsp_mem_limit || atomic_read(&sbi->s_mb_num_fragments) < @@ -915,6 +928,8 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree) mutex_lock(&tree->frsp_lock); if (!(tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED)) goto out; + if (atomic_read(&tree->frsp_ref)) + goto out; node = rb_first_cached(&tree->frsp_offset_root); while (node) { @@ -922,7 +937,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree) rb_erase_cached(node, &tree->frsp_offset_root); rb_erase_cached(&frsp_node->frsp_len_node, &tree->frsp_len_root); - ext4_mb_frsp_free_node(sb, frsp_node); + ext4_mb_frsp_free_node(sb, tree, frsp_node); node = rb_first_cached(&tree->frsp_offset_root); } tree->frsp_flags &= ~EXT4_MB_FRSP_FLAG_LOADED; @@ -985,7 +1000,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree, *prev_entry = NULL; struct rb_node *left = NULL, *right = NULL; - new_entry = ext4_mb_frsp_alloc_node(sb); + new_entry = ext4_mb_frsp_alloc_node(sb, tree); if (!new_entry) return -ENOMEM; @@ -1004,7 +1019,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree, rb_erase_cached(left, &tree->frsp_offset_root); rb_erase_cached(&prev_entry->frsp_len_node, &tree->frsp_len_root); - ext4_mb_frsp_free_node(sb, prev_entry); + ext4_mb_frsp_free_node(sb, tree, prev_entry); } } @@ -1017,7 +1032,7 @@ int ext4_mb_frsp_add_region(struct super_block *sb, struct ext4_frsp_tree *tree, rb_erase_cached(right, &tree->frsp_offset_root); rb_erase_cached(&next_entry->frsp_len_node, &tree->frsp_len_root); - ext4_mb_frsp_free_node(sb, next_entry); + ext4_mb_frsp_free_node(sb, tree, next_entry); } } ext4_mb_frsp_insert_len(tree, new_entry); @@ -1601,6 +1616,10 @@ static void ext4_mb_frsp_init_tree(struct ext4_frsp_tree *tree, int index) tree->frsp_flags = 0; tree->frsp_index = index; INIT_LIST_HEAD(&tree->frsp_list_node); + atomic_set(&tree->frsp_fragments, 0); + tree->frsp_last_access = 0; + INIT_LIST_HEAD(&tree->frsp_lru_active_node); + atomic_set(&tree->frsp_ref, 0); } int ext4_mb_init_freespace_trees(struct super_block *sb) @@ -1627,6 +1646,15 @@ int ext4_mb_init_freespace_trees(struct super_block *sb) rwlock_init(&sbi->s_mb_frsp_lock); atomic_set(&sbi->s_mb_num_frsp_trees_cached, 0); atomic_set(&sbi->s_mb_num_fragments, 0); + INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_active); + INIT_LIST_HEAD(&sbi->s_mb_frsp_lru.frsp_lru_inactive); + rwlock_init(&sbi->s_mb_frsp_lru.frsp_lru_lock); + /* Set the default hard-limit to be as much as buddy bitmaps */ + sbi->s_mb_frsp_mem_limit = ext4_get_groups_count(sb) << + (sb->s_blocksize_bits + 1); + if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb)) + sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb); + atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0); return 0; } @@ -1664,6 +1692,99 @@ int ext4_mb_add_frsp_trees(struct super_block *sb, int ngroups) return 0; } +/* + * Evict one tree from inactive list. + */ +static void ext4_frsp_evict(struct super_block *sb) +{ + struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru; + struct ext4_frsp_tree *tree = NULL; + bool found = false; + + write_lock(&lru->frsp_lru_lock); + if (!list_empty(&lru->frsp_lru_inactive)) { + /* Evict from front, insert at tail */ + found = 0; + list_for_each_entry(tree, &lru->frsp_lru_inactive, + frsp_list_node) { + if (!atomic_read(&tree->frsp_ref)) { + found = true; + break; + } + } + if (found) + list_del_init(&tree->frsp_list_node); + } + write_unlock(&lru->frsp_lru_lock); + if (found) + ext4_mb_frsp_free_tree(sb, tree); +} + +/* + * This function maintains LRU lists. "tree" has just been accessed. + */ +static void ext4_mb_frsp_maintain_lru(struct super_block *sb, + struct ext4_frsp_tree *tree) +{ + struct ext4_mb_frsp_lru *lru = &EXT4_SB(sb)->s_mb_frsp_lru; + struct ext4_frsp_tree *last; + unsigned long current_jiffies = jiffies; + + read_lock(&lru->frsp_lru_lock); + if (!list_empty(&tree->frsp_lru_active_node)) { + /* Already active, nothing to do */ + read_unlock(&lru->frsp_lru_lock); + goto out; + } + + /* + * Check if this tree needs to be moved to active list. We move it to + * active list if one of the following three conditions is true: + * - This is the first access to this tree + * - We haven't yet reached the max active fragments threshold, so there + * is space in active list. + * - This tree was accessed twice in + * EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES interval. + */ + if (tree->frsp_last_access && + EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) && + atomic_read(&lru->frsp_active_fragments) > + EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) && + current_jiffies - tree->frsp_last_access > + EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES) { + read_unlock(&lru->frsp_lru_lock); + goto out; + } + read_unlock(&lru->frsp_lru_lock); + + write_lock(&lru->frsp_lru_lock); + /* Check again just in case */ + if (!list_empty(&tree->frsp_lru_active_node)) { + write_unlock(&lru->frsp_lru_lock); + goto out; + } + list_add(&tree->frsp_lru_active_node, &lru->frsp_lru_active); + list_del_init(&tree->frsp_list_node); + atomic_add(atomic_read(&tree->frsp_fragments), + &lru->frsp_active_fragments); + /* Remove trees from active queue until we are below the limit */ + while (EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb) && + atomic_read(&lru->frsp_active_fragments) > + EXT4_MB_FRSP_MAX_ACTIVE_FRAGMENTS(sb)) { + last = list_last_entry(&lru->frsp_lru_active, + struct ext4_frsp_tree, frsp_lru_active_node); + list_del_init(&last->frsp_lru_active_node); + /* Evict from front, insert at tail */ + list_add_tail(&last->frsp_list_node, &lru->frsp_lru_inactive); + atomic_sub(atomic_read(&last->frsp_fragments), + &lru->frsp_active_fragments); + } + write_unlock(&lru->frsp_lru_lock); + +out: + tree->frsp_last_access = current_jiffies; +} + /* * Divide blocks started from @first with length @len into * smaller chunks with power of 2 blocks. @@ -2154,10 +2275,12 @@ ext4_mb_load_allocator_gfp(struct super_block *sb, ext4_group_t group, e4b->frsp_tree = ext4_get_frsp_tree(sb, ext4_flex_group(sbi, group)); e4b->frsp_flags = flags; + ext4_mb_frsp_maintain_lru(sb, e4b->frsp_tree); if (flags & EXT4_ALLOCATOR_FRSP_NOLOAD) return 0; mutex_lock(&e4b->frsp_tree->frsp_lock); + atomic_inc(&e4b->frsp_tree->frsp_ref); if (e4b->frsp_tree->frsp_flags & EXT4_MB_FRSP_FLAG_LOADED) { mutex_unlock(&e4b->frsp_tree->frsp_lock); return 0; @@ -2285,8 +2408,15 @@ static int ext4_mb_load_allocator(struct super_block *sb, ext4_group_t group, static void ext4_mb_unload_allocator(struct ext4_buddy *e4b) { - if (ext4_mb_frsp_on(e4b->bd_sb)) + if (ext4_mb_frsp_on(e4b->bd_sb)) { + if (!e4b->frsp_tree || + e4b->frsp_flags & EXT4_ALLOCATOR_FRSP_NOLOAD) + return; + atomic_dec(&e4b->frsp_tree->frsp_ref); + if (test_opt2(e4b->bd_sb, FRSP_MEM_CRUNCH)) + ext4_frsp_evict(e4b->bd_sb); return; + } if (e4b->bd_bitmap_page) put_page(e4b->bd_bitmap_page); if (e4b->bd_buddy_page) @@ -2658,7 +2788,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) ex->fe_node->frsp_offset + ex->fe_node->frsp_len) { /* Need to split the node */ - new = ext4_mb_frsp_alloc_node(e4b->bd_sb); + new = ext4_mb_frsp_alloc_node(e4b->bd_sb, + e4b->frsp_tree); if (!new) return -ENOMEM; new->frsp_offset = flex_offset + ex->fe_len; @@ -2675,7 +2806,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) ex->fe_node->frsp_offset += ex->fe_len; ex->fe_node->frsp_len -= ex->fe_len; } else { - ext4_mb_frsp_free_node(e4b->bd_sb, ex->fe_node); + ext4_mb_frsp_free_node(e4b->bd_sb, e4b->frsp_tree, + ex->fe_node); return 0; } @@ -4166,7 +4298,9 @@ static void ext4_free_data_in_buddy(struct super_block *sb, /* No more items in the per group rb tree * balance refcounts from ext4_mb_free_metadata() */ - if (!ext4_mb_frsp_on(sb)) { + if (ext4_mb_frsp_on(sb)) { + atomic_dec(&e4b.frsp_tree->frsp_ref); + } else { put_page(e4b.bd_buddy_page); put_page(e4b.bd_bitmap_page); } @@ -6146,14 +6280,18 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, new_node = &new_entry->efd_node; cluster = new_entry->efd_start_cluster; - if (!*n && !ext4_mb_frsp_on(sb)) { + if (!*n) { /* first free block exent. We need to protect buddy cache from being freed, * otherwise we'll refresh it from * on-disk bitmap and lose not-yet-available * blocks */ - get_page(e4b->bd_buddy_page); - get_page(e4b->bd_bitmap_page); + if (ext4_mb_frsp_on(sb)) { + atomic_inc(&e4b->frsp_tree->frsp_ref); + } else { + get_page(e4b->bd_buddy_page); + get_page(e4b->bd_bitmap_page); + } } while (*n) { parent = *n; From patchwork Fri Aug 21 01:55:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348760 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=Jl0DsrM/; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3j4lHBz9sPB for ; Fri, 21 Aug 2020 11:56:01 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727048AbgHUB4B (ORCPT ); Thu, 20 Aug 2020 21:56:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727006AbgHUBzk (ORCPT ); Thu, 20 Aug 2020 21:55:40 -0400 Received: from mail-pl1-x641.google.com (mail-pl1-x641.google.com [IPv6:2607:f8b0:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 43682C061343 for ; Thu, 20 Aug 2020 18:55:40 -0700 (PDT) Received: by mail-pl1-x641.google.com with SMTP id s14so178281plp.4 for ; Thu, 20 Aug 2020 18:55:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=x7iIRNguo7tU3fAQgSA+TJ9RVxJTULM3GQNAS+Z/n/U=; b=Jl0DsrM/C8FX9KcVPY5JsVzGCFlI1tz8DbDwmbCGVWbhcAX7EWIQ/pH4eU8q/9BCuN IDZOagNs0K58B694Z3faKZ9/xUMJWnuf6ctTxEto8Mm6cpWYm31bNxO/FMbIdzo9UmaM EQRHJjyA2H0idYGXzhKRfkDPdq5Fss5/pIk9BWh22HcjTJ8hBg8fR0ZRKrdzrLDdvkHQ YztNRPLKT7IZoAHP6Ei+ZcuKgVlr+MRi4f182Nxt5IG4NlSuM8trOt3WWeY0X5+RIKBz ETHUn+/kNz+xIExRTTdeihCjpptGJk/P9atKB+ggixyDT30z0FGebTXedFIky3ATuNxX h1fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=x7iIRNguo7tU3fAQgSA+TJ9RVxJTULM3GQNAS+Z/n/U=; b=j0BGNxnelplhg9TLfcRO3gOZkiBLGnIVglvGr3xayT4MY1eEg+LqMPSBRtUqILozKf nx8VeJcWB5OC+61b90zbwOJ5coCqmNun+tTsdS4BxPXkCPuKNMzCJdzIIifLRUve8K7R 5jOn6/cPbiCAJkKQyefe/LzWwORw6mIs/AGtqEsF1uDkXDbC1jGdNLThERaGYc+abElS iIEcFdgy1q6dCng+VLnhzhTT/L2bd5ZUan8RFYZAt0cAEyYhkhvve6lLBn6604orbK00 w0021ShIbcBllGP8zw8a9BWd8qoReIs4kbpND8aFTwfMm+DRCBOBuUa3YC721iKT4Lya vXHQ== X-Gm-Message-State: AOAM533iBByrFXgZVtdSHVnp6J24BNU+4Nla+7FjiwPsbzfMKvMNDXNd v1af7UikjguJa9pWZFvTWEIbuJSqD5U= X-Google-Smtp-Source: ABdhPJyxfwGaIKUHmcObXg1rJY2Mfhm8VGMbH+nZtzI/NZUa+3fcKd3T/XlUt83IBTpNODCokC7DTA== X-Received: by 2002:a17:902:9009:: with SMTP id a9mr505786plp.252.1597974939324; Thu, 20 Aug 2020 18:55:39 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:38 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 8/9] ext4: add tracepoints for free space trees Date: Thu, 20 Aug 2020 18:55:22 -0700 Message-Id: <20200821015523.1698374-9-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar This patch adds a few trace points to track the flow of allocation requests with free space trees. Signed-off-by: Harshad Shirwadkar --- fs/ext4/ext4.h | 1 + fs/ext4/mballoc.c | 9 +++ fs/ext4/mballoc.h | 3 +- include/trace/events/ext4.h | 112 ++++++++++++++++++++++++++++++++++++ 4 files changed, 124 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 64d0dbbcd517..22850763c5a8 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1661,6 +1661,7 @@ struct ext4_sb_info { atomic_t s_mb_num_fragments; u32 s_mb_frsp_mem_limit; struct ext4_mb_frsp_lru s_mb_frsp_lru; + atomic_t s_mb_ac_id; /* workqueue for reserved extent conversions (buffered io) */ struct workqueue_struct *rsv_conversion_wq; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index aea0eb8d28da..eb99e2fb9a68 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -945,6 +945,7 @@ void ext4_mb_frsp_free_tree(struct super_block *sb, struct ext4_frsp_tree *tree) tree->frsp_len_root = RB_ROOT_CACHED; out: mutex_unlock(&tree->frsp_lock); + trace_ext4_mb_frsp_free_tree(sb, tree->frsp_index); } /* @@ -1476,6 +1477,7 @@ int ext4_mb_frsp_optimize(struct ext4_allocation_context *ac, int *tree_idx) mb_debug(ac->ac_sb, "Optimizer suggestion: found = %d, tree = %d, len = %d, cr = %d\n", found, *tree_idx, better_len, ac->ac_criteria); + trace_ext4_mb_frsp_optimize(ac, found, cache_more_trees, *tree_idx); return ret; } @@ -1546,6 +1548,11 @@ ext4_mb_tree_allocator(struct ext4_allocation_context *ac) ac->ac_groups_scanned++; tree_idx = (j % sbi->s_mb_num_frsp_trees); + trace_ext4_mb_tree_allocator(ac, tree_idx, + sbi->s_mb_num_frsp_trees, + atomic_read(&sbi->s_mb_num_frsp_trees_cached), + atomic_read(&sbi->s_mb_num_fragments) * + sizeof(struct ext4_frsp_node)); ret = ext4_mb_load_allocator(sb, tree_idx * ext4_flex_bg_size(sbi), &e4b, 0); if (ret) @@ -1655,6 +1662,7 @@ int ext4_mb_init_freespace_trees(struct super_block *sb) if (sbi->s_mb_frsp_mem_limit < EXT4_MB_FRSP_MEM_MIN(sb)) sbi->s_mb_frsp_mem_limit = EXT4_MB_FRSP_MEM_MIN(sb); atomic_set(&sbi->s_mb_frsp_lru.frsp_active_fragments, 0); + atomic_set(&sbi->s_mb_ac_id, 0); return 0; } @@ -5794,6 +5802,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac, ext4_get_group_no_and_offset(sb, goal, &group, &block); /* set up allocation goals */ + ac->ac_id = atomic_inc_return(&sbi->s_mb_ac_id); ac->ac_b_ex.fe_logical = EXT4_LBLK_CMASK(sbi, ar->logical); ac->ac_status = AC_STATUS_CONTINUE; ac->ac_sb = sb; diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 6cfb228e4da2..e734c05572b6 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -171,7 +171,6 @@ struct ext4_tree_extent { }; struct ext4_allocation_context { - __u32 ac_id; struct inode *ac_inode; struct super_block *ac_sb; @@ -206,6 +205,8 @@ struct ext4_allocation_context { * N > 0, the field stores N, otherwise 0 */ __u8 ac_num_optimizations; __u8 ac_op; /* operation, for history only */ + __u8 ac_id; /* allocation ID for tracking purpose */ + struct page *ac_bitmap_page; struct page *ac_buddy_page; struct ext4_prealloc_space *ac_pa; diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 4c8b99ec8606..e5b4c1576097 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -874,6 +874,118 @@ TRACE_EVENT(ext4_allocate_blocks, __entry->pleft, __entry->pright) ); +TRACE_EVENT(ext4_mb_frsp_free_tree, + TP_PROTO(struct super_block *sb, int tree), + + TP_ARGS(sb, tree), + + TP_STRUCT__entry( + __field( dev_t, dev ) + __field( int, tree ) + ), + + TP_fast_assign( + __entry->dev = sb->s_dev; + __entry->tree = tree; + ), + + TP_printk("dev %d,%d tree %d", + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tree) +); + +TRACE_EVENT(ext4_mb_frsp_optimize, + TP_PROTO(struct ext4_allocation_context *ac, int found, + int cache_more_trees, int tree), + + TP_ARGS(ac, found, cache_more_trees, tree), + + TP_STRUCT__entry( + __field( dev_t, dev ) + __field( ino_t, ino ) + __field( unsigned int, len ) + __field( unsigned int, flags ) + __field( int, found ) + __field( int, tree ) + __field( int, num_found ) + __field( int, attempts ) + __field( int, ac_id ) + __field( int, ac_criteria ) + __field( int, ac_groups_scanned ) + __field( int, cache_more_trees ) + ), + + TP_fast_assign( + __entry->dev = ac->ac_sb->s_dev; + __entry->ino = ac->ac_inode->i_ino; + __entry->len = ac->ac_b_tree_ex.te_len; + __entry->flags = ac->ac_flags; + __entry->found = found; + __entry->tree = tree; + __entry->attempts = ac->ac_num_optimizations; + __entry->ac_id = ac->ac_id; + __entry->num_found = ac->ac_found; + __entry->ac_criteria = ac->ac_criteria; + __entry->ac_groups_scanned = ac->ac_groups_scanned; + __entry->cache_more_trees = cache_more_trees; + ), + + TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u num-found %d " + "suggest-tree %d cache_more %d attempt %d status %d cr %d " + "nflexgrps %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, __entry->ac_id, + show_mballoc_flags(__entry->flags), + __entry->len, __entry->num_found, __entry->tree, + __entry->cache_more_trees, __entry->attempts, __entry->found, + __entry->ac_criteria, __entry->ac_groups_scanned) +); + +TRACE_EVENT(ext4_mb_tree_allocator, + TP_PROTO(struct ext4_allocation_context *ac, int tree, int num_trees, + int num_trees_loaded, int bytes_usage), + + TP_ARGS(ac, tree, num_trees, num_trees_loaded, bytes_usage), + + TP_STRUCT__entry( + __field( dev_t, dev ) + __field( ino_t, ino ) + __field( unsigned int, len ) + __field( unsigned int, flags ) + __field( int, num_trees ) + __field( int, num_trees_loaded ) + __field( int, tree ) + __field( int, ac_id ) + __field( int, ac_found ) + __field( int, ac_criteria ) + __field( int, ac_groups_scanned ) + __field( int, bytes_usage ) + ), + + TP_fast_assign( + __entry->dev = ac->ac_sb->s_dev; + __entry->ino = ac->ac_inode->i_ino; + __entry->len = ac->ac_b_tree_ex.te_len; + __entry->flags = ac->ac_flags; + __entry->num_trees = num_trees; + __entry->num_trees_loaded = num_trees_loaded; + __entry->tree = tree; + __entry->ac_id = ac->ac_id; + __entry->ac_found = ac->ac_found; + __entry->ac_criteria = ac->ac_criteria; + __entry->ac_groups_scanned = ac->ac_groups_scanned; + __entry->bytes_usage = bytes_usage; + ), + + TP_printk("dev %d,%d ino %lu ac %d flags %s best-len %u found %d current-tree %d " + "num-trees %d num-trees-loaded %d cr %d nflexgrps %d mem_bytes %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, __entry->ac_id, + show_mballoc_flags(__entry->flags), + __entry->len, __entry->ac_found, __entry->tree, __entry->num_trees, + __entry->num_trees_loaded, __entry->ac_criteria, + __entry->ac_groups_scanned, __entry->bytes_usage) +); + TRACE_EVENT(ext4_free_blocks, TP_PROTO(struct inode *inode, __u64 block, unsigned long count, int flags), From patchwork Fri Aug 21 01:55:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1348761 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=G8WryVUC; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4BXl3k25jwz9sPC for ; Fri, 21 Aug 2020 11:56:02 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727006AbgHUB4B (ORCPT ); Thu, 20 Aug 2020 21:56:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727016AbgHUBzm (ORCPT ); Thu, 20 Aug 2020 21:55:42 -0400 Received: from mail-pj1-x1044.google.com (mail-pj1-x1044.google.com [IPv6:2607:f8b0:4864:20::1044]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A884C061344 for ; Thu, 20 Aug 2020 18:55:41 -0700 (PDT) Received: by mail-pj1-x1044.google.com with SMTP id q93so251169pjq.0 for ; Thu, 20 Aug 2020 18:55:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Qk1QyQp/EHlAlgUoyVynmKWx6LwkC5t8DoDAHkTggKk=; b=G8WryVUCMb0evTZIvRRF3+KizEvxY98nsa1N/K7B6PrZ+RJZkIWqI5z6aFFJ03DsoZ LigoJcH3q46xF9rBno52wzRbcMC0fczL9y1iKFIqWdoVZ3uZ93Z7t9jzjBPYxja8k9K8 yE0D3naqed4eO0pjOHhuMTuRxbgORsEcsYDPpz8cEQd3sLIzzAj5QDxJpR2mvK1uwqKp ZV9Q3hpVBI3aAffLRDbkec5abyIbDYtIRLCBaPkJnXllKoDU0A8U/+xYA18iFGg2CO9D 5FtPVCDX/B981hXpZ55xU2MPMWlDaLIqge1HFBTP1CBPHDgVkbZnkbq9vXRFgSG2uNIC g0kg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Qk1QyQp/EHlAlgUoyVynmKWx6LwkC5t8DoDAHkTggKk=; b=BW75bC52grsVNdefMpmIrZ3bqJx6UkeI39RZKS5MJlkQynVwAcre/8YkURN1aROg7i q6O8UeEX6ygyxS6TqGIDdVPBQMB1MCezGLoX1m4e4eUTu+Q2+fiy/jACekbtv2s0e3hF H2ANPtMGgDPbBAPyKKY6PLIc6EiihbJweaeP6TJieE75vWOqX2JAtuj26VS+oTlK+89h 8SNYeZiPXs7oTNDjv1b5eKCTatgW4z3ezqhdYeggJCmSIdQZOOwqt7eyGPoBaRHohE7g mI41hCxu/H6c2VgNooNotNrnyPOSNn0PyDgBp3iUt7IxGg0HBlYHpiBINMqjx5hp9/51 H5RA== X-Gm-Message-State: AOAM5322DN08RBzHhgBtQf5ThGobJUyZ6q7n0biYeY6kFWxz7Fp08h/6 NECuHjR+9s+ps2mxhlCS8rIhxxNTJvo= X-Google-Smtp-Source: ABdhPJyWtNYCibeEvG8mm6AAAQ+S3Mh7ck020M+f8GCfWj7CQOkPD+6TNYWVwH/u4vXP5TX9n4+fGg== X-Received: by 2002:a17:90a:ce94:: with SMTP id g20mr505845pju.61.1597974940594; Thu, 20 Aug 2020 18:55:40 -0700 (PDT) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:a6ae:11ff:fe11:86a2]) by smtp.googlemail.com with ESMTPSA id o15sm370191pfu.167.2020.08.20.18.55.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Aug 2020 18:55:39 -0700 (PDT) From: Harshad Shirwadkar X-Google-Original-From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, lyx1209@gmail.com, Harshad Shirwadkar Subject: [RFC PATCH v2 9/9] ext4: add freespace trees documentation in code Date: Thu, 20 Aug 2020 18:55:23 -0700 Message-Id: <20200821015523.1698374-10-harshads@google.com> X-Mailer: git-send-email 2.28.0.297.g1956fa8f8d-goog In-Reply-To: <20200821015523.1698374-1-harshads@google.com> References: <20200821015523.1698374-1-harshads@google.com> MIME-Version: 1.0 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Harshad Shirwadkar This patch adds a comment in mballoc.c describing the design and flow of free-space trees. Signed-off-by: Harshad Shirwadkar --- fs/ext4/mballoc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index eb99e2fb9a68..b5d55a52daff 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -330,6 +330,122 @@ * object * */ + +/* + * The freespace tree allocator + * ----------------------------- + * + * High Level Design + * ================= + * This allocator maintains the free space information about the file system in + * per-flexible block group level trees. For every flexible block group, we + * maintain individual freespace nodes in two trees, one sorted by flex-bg + * offset and other sorted by length. This gives us two benefits: i) In + * allocation path, our search time complexity is O(Log(Number of freespace + * regions in the flex-bg)). ii) In free path, in the same time complexity we + * can merge the adjacent nodes thereby reducing the size of the tree + * efficiently. + * + * Along with flexible block group level trees, we also introduce a top level + * meta-tree which consists of individual trees. This tree is sorted by length + * of largest extents found in flex-bgs. The key advantage that this tree gives + * us is this: During an allocation request, the allocator is now able to + * consult this tree and directly (in O(Log(Number of Flex BGs)) jump to a + * flexible block group which definitely has at least one (the largest) extent + * that can satisfy this request. If no flexible block group exists which can + * satisfy this request, the allocator can now immediately drop the CR level. + * + * In order to preseve the parallel allocation / free performance, the allocator + * only *tries* to consult this tree. It does so by calling read_trylock() + * function and if the meta-tree is busy, the allocator continues its linear + * search until it is able to grab a read-lock on this tree. + * + * Memory Footprint + * ================ + * + * In a less fragmented file system, the memory footprint of the new allocator + * is significantly lower than buddy bitmaps. However, as the fragmentation + * level increases, the memory footprint of this allocator increases + * significantly. The memory usage of the freespace tree allocator can be + * controlled using sysfs tunable /sys/fs/ext4//mb_frsp_max_mem. The + * default value of this is max(memory needed for buddy, maximum memory needed + * for one tree in the worst case). Accounting for max memory needed for one + * tree allows us to keep at least one tree in memory even in the worst + * case. This avoids thrashing. Once the memory threshold is reached, the + * allocator starts evicting trees from memory in LRU fashion. However, we don't + * remove tree's entry from the meta-tree. This allows the allocator to + * efficiently reconstruct only relevant trees from on-disk bitmaps. In our + * evaluations, we found that freespace tree allocator still manages to + * outperform buddy allocator in memory crunch situation. + * + * LRU + * === + * + * We maintain two lists - an active list and an inactive list of trees. Trees + * in active list stay in it until evicted. In order to reduce contention on the + * active list lock, once a tree is in active list it is not moved inside the + * list. Also, a tree moves from inactive list to active list only if it is + * accessed twice in a EXT4_MB_FRSP_ACTIVE_THRESHOLD_JIFFIES window. + * + * + * Caching Tree Metadata + * ===================== + * As noted in our experiments, we find caching tree metadata (the largest + * available extent in a tree) in the meta-tree significantly boosts allocation + * performance. However, if the allocator waits for the cache to fill up before + * starting to serve allocation requests, that may severely degrade allocation + * performance on large disks. Thus, it is important to tune the tree caching + * behavior according to the underlying block device. This patchset provides + * four cache aggression levels. Cache aggression level can be specified as a + * mount time parametere "frsp_cache_aggression". Here is the meaning of these + * different levels: + * + * Cache Aggression 0: Try to serve request only cached trees + * Cache Aggression 1 (Default): Try to serve request from only cached trees + * at CR 0. At all other CRs, try to use an uncached tree for every + * alternate request. + * Cache Aggression 2: Try to use an uncached tree for every alternate request + * at all CR levels. + * Cache Aggression 3: Try to use uncached trees for every request. + * + * Moreover, if file system is mounted with "prefetch_block_bitmaps", tree + * caching starts at mount time. + * + * Locking Order + * ============= + * + * - Tree lock + * - Meta tree lock (sbi->s_mb_frsp_lock) + * - LRU lock + * + * Critical sections of meta-tree lock and LRU locks are kept as small as + * possible and you shouldn't need to take meta-tree lock and lru-lock + * simultaneously. + * + * High Level Algorithm + * ==================== + * - Consult meta-tree asking which flex-bg should the allocator look at + * - One of the following things can happen + * - Meta tree may find a suitable flex-bg for the request + * - In this case we start searching in that tree + * - Meta tree may realize that there's no suitable flex-bg + * - In this case we increase CR and restart + * - Based on the caching mode, meta tree may redirect us to an + * uncached tree + * - Meta tree is busy + * - In this case, we dont wait for meta-tree to become available, + * instead, we continue our search linearly. + * - Pick a tree (either based on what meta-tree told us or linearly from the + * last one) + * - Manage LRU structure (ext4_mb_frsp_maintain_lru()) + * - Move tree to active list if needed + * - Move trees from active list to inactive lists if needed + * - Perform search by length and pick matching extents. + * - Repeat until best found + * - Perform memory-crunch check and evict trees in LRU fashion if needed + * (ext4_mb_unload_allocator())) + */ + static struct kmem_cache *ext4_pspace_cachep; static struct kmem_cache *ext4_ac_cachep; static struct kmem_cache *ext4_free_data_cachep;