From patchwork Tue Feb 9 20:28:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1438681 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=i1unExR5; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4DZw0T39Zdz9sCD for ; Wed, 10 Feb 2021 07:46:17 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233974AbhBIUpM (ORCPT ); Tue, 9 Feb 2021 15:45:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36786 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233826AbhBIUjr (ORCPT ); Tue, 9 Feb 2021 15:39:47 -0500 Received: from mail-pf1-x42f.google.com (mail-pf1-x42f.google.com [IPv6:2607:f8b0:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1B152C0698D4 for ; Tue, 9 Feb 2021 12:29:06 -0800 (PST) Received: by mail-pf1-x42f.google.com with SMTP id z6so3514202pfq.0 for ; Tue, 09 Feb 2021 12:29:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=r6gXc36Fb7rev0oeIRbVuSomePZlN/Rh1wSk7zuJJl8=; b=i1unExR5mEXn0h9u/DX3vamDMNXG0KMzY1P5es/Z8TRFp43c4kJ7nLKE9oxo8UjEns uk+e3XS+fdwiLVQtEa39cbcjF5W2RXlgc9Bdchr3MDVD4c2vzAFU+Wo/xSn01NJrORS3 3LiGTE/8/yvftMc3ZXzHMxTUMGpvMCTxcMkWxHSZFU7NvbdKHNFwTzFnPFIYTHWSyMSu A3jZaglHRxYlpP3Rfo+xeQfETXFNATVsKLXFBVfC7gMl4RFyf/EVq4c40uKUl4Vvyqmd vzw1lXQTop9nG3LqOdtpk86FTcdFT5mr3H4pHTDcbHdm/0dxJcg2kcv5oKk2lzyix8VM UwNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=r6gXc36Fb7rev0oeIRbVuSomePZlN/Rh1wSk7zuJJl8=; b=nIyEIzlARLuUgdB4gRdYZfuROePWvWnwebBZgqfzZUeZulBueyQDTdaAJsjPzIFlFM o9y3Br5TTnZwrT0iAsPw3X0NkkPpmwjxsfJfRs/jYFAKefdS8J2+oNDoM9jdfK0z8zya h28I6K0D43Phdv4Uer4aAEd+/hW9/gMoDcz3FYEmhEBLOuEsXJNCSNhMODYNSgpWIRwE eOvHr8bW+9PZNvE7uMhsTRN2NYZWLdZdwpD+BIN9eynAJyOaYq9uOuretBkifjo5TS3+ FYhdYtieemPl9SYE0QFNCN1gpFuJD/Y3wnyNscQCRvddHYLwOhUu3maz455y9j+1xHOB x57Q== X-Gm-Message-State: AOAM5323MVHanUy2J0qNc44t3z23B9JLj7v0bBB+mibCJkokcDhtgaSd 2WRtrZwxLlT8sTfbnSnWXQcVkIBnRk8= X-Google-Smtp-Source: ABdhPJxmdpAuT5XD5z5Pat2zTGL+viZ/MBhLeMr8c8zcRTalYqfDHFR3MUUULXLPP8n8w3Xw7dsggw== X-Received: by 2002:a63:4a49:: with SMTP id j9mr24270712pgl.197.1612902545937; Tue, 09 Feb 2021 12:29:05 -0800 (PST) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:1d7c:b2d9:c196:949c]) by smtp.googlemail.com with ESMTPSA id p12sm3325827pju.35.2021.02.09.12.29.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 12:29:05 -0800 (PST) From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, bzzz@whamcloud.com, artem.blagodarenko@gmail.com, sihara@ddn.com, adilger@dilger.ca, Harshad Shirwadkar Subject: [PATCH v2 1/5] ext4: drop s_mb_bal_lock and convert protected fields to atomic Date: Tue, 9 Feb 2021 12:28:53 -0800 Message-Id: <20210209202857.4185846-2-harshadshirwadkar@gmail.com> X-Mailer: git-send-email 2.30.0.478.g8a0d178c01-goog In-Reply-To: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> References: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org s_mb_buddies_generated gets used later in this patch series to determine if the cr 0 and cr 1 optimziations should be performed or not. Currently, s_mb_buddies_generated is protected under a spin_lock. In the allocation path, it is better if we don't depend on the lock and instead read the value atomically. In order to do that, we drop s_bal_lock altogether and we convert the only two protected fields by it s_mb_buddies_generated and s_mb_generation_time to atomic type. Signed-off-by: Harshad Shirwadkar Reviewed-by: Andreas Dilger --- fs/ext4/ext4.h | 5 ++--- fs/ext4/mballoc.c | 13 +++++-------- 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 64f25ea2fa7a..6dd127942208 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1552,9 +1552,8 @@ struct ext4_sb_info { atomic_t s_bal_goals; /* goal hits */ atomic_t s_bal_breaks; /* too long searches */ atomic_t s_bal_2orders; /* 2^order hits */ - spinlock_t s_bal_lock; - unsigned long s_mb_buddies_generated; - unsigned long long s_mb_generation_time; + atomic_t s_mb_buddies_generated; /* number of buddies generated */ + atomic64_t s_mb_generation_time; atomic_t s_mb_lost_chunks; atomic_t s_mb_preallocated; atomic_t s_mb_discarded; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 99bf091fee10..07b78a3cc421 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -816,10 +816,8 @@ void ext4_mb_generate_buddy(struct super_block *sb, clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state)); period = get_cycles() - period; - spin_lock(&sbi->s_bal_lock); - sbi->s_mb_buddies_generated++; - sbi->s_mb_generation_time += period; - spin_unlock(&sbi->s_bal_lock); + atomic_inc(&sbi->s_mb_buddies_generated); + atomic64_add(period, &sbi->s_mb_generation_time); } /* The buddy information is attached the buddy cache inode @@ -2843,7 +2841,6 @@ int ext4_mb_init(struct super_block *sb) } while (i <= sb->s_blocksize_bits + 1); spin_lock_init(&sbi->s_md_lock); - spin_lock_init(&sbi->s_bal_lock); sbi->s_mb_free_pending = 0; INIT_LIST_HEAD(&sbi->s_freed_data_list); @@ -2979,9 +2976,9 @@ int ext4_mb_release(struct super_block *sb) atomic_read(&sbi->s_bal_breaks), atomic_read(&sbi->s_mb_lost_chunks)); ext4_msg(sb, KERN_INFO, - "mballoc: %lu generated and it took %Lu", - sbi->s_mb_buddies_generated, - sbi->s_mb_generation_time); + "mballoc: %u generated and it took %llu", + atomic_read(&sbi->s_mb_buddies_generated), + atomic64_read(&sbi->s_mb_generation_time)); ext4_msg(sb, KERN_INFO, "mballoc: %u preallocated, %u discarded", atomic_read(&sbi->s_mb_preallocated), From patchwork Tue Feb 9 20:28:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1438680 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=E80SJhJl; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4DZvzD3Bj0z9rx8 for ; Wed, 10 Feb 2021 07:45:12 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233783AbhBIUok (ORCPT ); Tue, 9 Feb 2021 15:44:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36062 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233361AbhBIUi5 (ORCPT ); Tue, 9 Feb 2021 15:38:57 -0500 Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D7FA9C0698D6 for ; Tue, 9 Feb 2021 12:29:07 -0800 (PST) Received: by mail-pl1-x632.google.com with SMTP id d13so10441462plg.0 for ; Tue, 09 Feb 2021 12:29:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=CQZmNro2ukLu7L+YRKhJRC0J65dMTFkgf1oU1lZ5Lb4=; b=E80SJhJll3la7x0zTqx2J/LqM5WjnP2ptJvIkFfU4OnmDyXHRqBmW5BI5h85DJo5UM tC+HvE2wrYfsCp6g4XgV/YVibZxiIn2CEupw79hSOR3hfx4/Tds6iJZKNehLy3mx09oQ nLo6LOumeGoukSGMFMneUCxsWq3l7owICBu45PMTy9vcQn8JGU2vqYqEnKNZlBfuJKJL 5TFKS2jCwpm/0fylFRdbkKDyPe0E0cI7PVgxweX5hO3ZNX4rixKmPeF9XmjhRCp5E+ps DORtEx785grnROzLRmeiu5q+ozl7UZ+UdgT94SdcAtvuyLrfbKNH9jaw/PFB8ZldWoIp SmUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=CQZmNro2ukLu7L+YRKhJRC0J65dMTFkgf1oU1lZ5Lb4=; b=plVPPg+pw24Et3iP7DMvxyiREFCrnBQz13br0UVbQOV82uqLBXDpnWOKyfcPKnGuPD mwQhAPPrCxm5IV5U81tkF7sotr7P65P1i/N8C0weuIzLF13FCY4JDEeIqaGRypY1o5KW H4EX5QmPLB7vRnYy/P1lgpwzE3UnvekT0sB5dV2ec0cqQz3QvZsumNAo2bDOaREmXtn2 sshwn3wovmM7+gxr8SFD6QDBmBLV/AxT3POvdvdURZpPZMtxI5uLIsrOrpVQ4AZlFqrO 2ccFrwcRvzgkQE0N9ajI3+4SVILUHvnFubT0waCLc84Qfu3UsN9wYWGCC33BkPGot5Y8 MDTw== X-Gm-Message-State: AOAM532GZfBB1j7pi76MdjiHmoVMqO70IPGwBYWp0Hq8gq1Jy6MLN7xC pFYkMxz32Mu+IW5AHXAzo8FIrVV6Qsg= X-Google-Smtp-Source: ABdhPJyn0qq/a8Kg/bp2xgS2YNELuCVXflQVQU0hBx235jVLkADUE+SPJnTsyonEFl+vEBEFKPjwgA== X-Received: by 2002:a17:90a:928d:: with SMTP id n13mr5838813pjo.12.1612902547006; Tue, 09 Feb 2021 12:29:07 -0800 (PST) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:1d7c:b2d9:c196:949c]) by smtp.googlemail.com with ESMTPSA id p12sm3325827pju.35.2021.02.09.12.29.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 12:29:06 -0800 (PST) From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, bzzz@whamcloud.com, artem.blagodarenko@gmail.com, sihara@ddn.com, adilger@dilger.ca, Harshad Shirwadkar Subject: [PATCH v2 2/5] ext4: add mballoc stats proc file Date: Tue, 9 Feb 2021 12:28:54 -0800 Message-Id: <20210209202857.4185846-3-harshadshirwadkar@gmail.com> X-Mailer: git-send-email 2.30.0.478.g8a0d178c01-goog In-Reply-To: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> References: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Add new stats for measuring the performance of mballoc. This patch is forked from Artem Blagodarenko's work that can be found here: https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch Signed-off-by: Harshad Shirwadkar Reviewed-by: Andreas Dilger --- fs/ext4/ext4.h | 4 ++++ fs/ext4/mballoc.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++- fs/ext4/mballoc.h | 1 + fs/ext4/sysfs.c | 2 ++ 4 files changed, 57 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6dd127942208..317b43420ecf 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1549,6 +1549,8 @@ struct ext4_sb_info { atomic_t s_bal_success; /* we found long enough chunks */ atomic_t s_bal_allocated; /* in blocks */ atomic_t s_bal_ex_scanned; /* total extents scanned */ + atomic_t s_bal_groups_considered; /* number of groups considered */ + atomic_t s_bal_groups_scanned; /* number of groups scanned */ atomic_t s_bal_goals; /* goal hits */ atomic_t s_bal_breaks; /* too long searches */ atomic_t s_bal_2orders; /* 2^order hits */ @@ -1558,6 +1560,7 @@ struct ext4_sb_info { atomic_t s_mb_preallocated; atomic_t s_mb_discarded; atomic_t s_lock_busy; + atomic64_t s_bal_cX_failed[4]; /* cX loop didn't find blocks */ /* locality groups */ struct ext4_locality_group __percpu *s_locality_groups; @@ -2808,6 +2811,7 @@ int __init ext4_fc_init_dentry_cache(void); extern const struct seq_operations ext4_mb_seq_groups_ops; extern long ext4_mb_stats; extern long ext4_mb_max_to_scan; +extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset); extern int ext4_mb_init(struct super_block *); extern int ext4_mb_release(struct super_block *); extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *, diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 07b78a3cc421..fffd0770e930 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2083,6 +2083,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac, BUG_ON(cr < 0 || cr >= 4); + ac->ac_groups_considered++; if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) return false; @@ -2420,6 +2421,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) if (ac->ac_status != AC_STATUS_CONTINUE) break; } + /* Processed all groups and haven't found blocks */ + if (sbi->s_mb_stats && i == ngroups) + atomic64_inc(&sbi->s_bal_cX_failed[cr]); } if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND && @@ -2548,6 +2552,48 @@ const struct seq_operations ext4_mb_seq_groups_ops = { .show = ext4_mb_seq_groups_show, }; +int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) +{ + struct super_block *sb = (struct super_block *)seq->private; + struct ext4_sb_info *sbi = EXT4_SB(sb); + + seq_puts(seq, "mballoc:\n"); + if (!sbi->s_mb_stats) { + seq_puts(seq, "\tmb stats collection turned off.\n"); + seq_puts(seq, "\tTo enable, please write \"1\" to sysfs file mb_stats.\n"); + return 0; + } + seq_printf(seq, "\treqs: %u\n", atomic_read(&sbi->s_bal_reqs)); + seq_printf(seq, "\tsuccess: %u\n", atomic_read(&sbi->s_bal_success)); + + seq_printf(seq, "\tgroups_scanned: %u\n", atomic_read(&sbi->s_bal_groups_scanned)); + seq_printf(seq, "\tgroups_considered: %u\n", atomic_read(&sbi->s_bal_groups_considered)); + seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned)); + seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals)); + seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders)); + seq_printf(seq, "\t\tbreaks: %u\n", atomic_read(&sbi->s_bal_breaks)); + seq_printf(seq, "\t\tlost: %u\n", atomic_read(&sbi->s_mb_lost_chunks)); + + seq_printf(seq, "\tuseless_c0_loops: %llu\n", + (unsigned long long)atomic64_read(&sbi->s_bal_cX_failed[0])); + seq_printf(seq, "\tuseless_c1_loops: %llu\n", + (unsigned long long)atomic64_read(&sbi->s_bal_cX_failed[1])); + seq_printf(seq, "\tuseless_c2_loops: %llu\n", + (unsigned long long)atomic64_read(&sbi->s_bal_cX_failed[2])); + seq_printf(seq, "\tuseless_c3_loops: %llu\n", + (unsigned long long)atomic64_read(&sbi->s_bal_cX_failed[3])); + seq_printf(seq, "\tbuddies_generated: %u/%u\n", + atomic_read(&sbi->s_mb_buddies_generated), + ext4_get_groups_count(sb)); + seq_printf(seq, "\tbuddies_time_used: %llu\n", + atomic64_read(&sbi->s_mb_generation_time)); + seq_printf(seq, "\tpreallocated: %u\n", + atomic_read(&sbi->s_mb_preallocated)); + seq_printf(seq, "\tdiscarded: %u\n", + atomic_read(&sbi->s_mb_discarded)); + return 0; +} + static struct kmem_cache *get_groupinfo_cache(int blocksize_bits) { int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; @@ -2968,9 +3014,10 @@ int ext4_mb_release(struct super_block *sb) atomic_read(&sbi->s_bal_reqs), atomic_read(&sbi->s_bal_success)); ext4_msg(sb, KERN_INFO, - "mballoc: %u extents scanned, %u goal hits, " + "mballoc: %u extents scanned, %u groups scanned, %u goal hits, " "%u 2^N hits, %u breaks, %u lost", atomic_read(&sbi->s_bal_ex_scanned), + atomic_read(&sbi->s_bal_groups_scanned), atomic_read(&sbi->s_bal_goals), atomic_read(&sbi->s_bal_2orders), atomic_read(&sbi->s_bal_breaks), @@ -3579,6 +3626,8 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac) if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len) atomic_inc(&sbi->s_bal_success); atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned); + atomic_add(ac->ac_groups_scanned, &sbi->s_bal_groups_scanned); + atomic_add(ac->ac_groups_considered, &sbi->s_bal_groups_considered); if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start && ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group) atomic_inc(&sbi->s_bal_goals); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index e75b4749aa1c..7597330dbdf8 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -161,6 +161,7 @@ struct ext4_allocation_context { /* copy of the best found extent taken before preallocation efforts */ struct ext4_free_extent ac_f_ex; + __u32 ac_groups_considered; __u16 ac_groups_scanned; __u16 ac_found; __u16 ac_tail; diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index 4e27fe6ed3ae..752d1c261e2a 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -527,6 +527,8 @@ int ext4_register_sysfs(struct super_block *sb) ext4_fc_info_show, sb); proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc, &ext4_mb_seq_groups_ops, sb); + proc_create_single_data("mb_stats", 0444, sbi->s_proc, + ext4_seq_mb_stats_show, sb); } return 0; } From patchwork Tue Feb 9 20:28:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1438679 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=hehygPpg; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4DZvz401Kjz9rx8 for ; Wed, 10 Feb 2021 07:45:03 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233676AbhBIUoO (ORCPT ); Tue, 9 Feb 2021 15:44:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36176 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233749AbhBIUhA (ORCPT ); Tue, 9 Feb 2021 15:37:00 -0500 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E85A6C0698D5 for ; Tue, 9 Feb 2021 12:29:08 -0800 (PST) Received: by mail-pj1-x1032.google.com with SMTP id gx20so2370617pjb.1 for ; Tue, 09 Feb 2021 12:29:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=vo4q+8O2vAkuEJMBuosHNPgWOrCpbN9Drl+Ko4u9Ik0=; b=hehygPpglKj3xIcft9TC3O9uKKrrzF0OS9OzXgodEMmoMS4db0k+FQwlyUticB12GQ sYosDLBwe0TdNf0b9tSE1mC5gYjFyb2FxGhbmtk5vwuQg56iez7Sts0eOFYC1IkKJrfy Y24TwI3cFIgWb1LM50pNAQD6Si2Cq2M6HwqI6dOvr1VybcMJjKIl2B12sY30Mh34AkKt VBavbR2nkDRsTpVAyHgrKyCtQPhVmXyrZiJVxNGceXH6hV3rw1aQgXXMIX+jEFmwzI4B pcVjuoiQDVEeRoHGvoZPaI+iknULLUX/72oOoDeJCyOF+rMa67U7M463xbCdpm0y8IC0 NVGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=vo4q+8O2vAkuEJMBuosHNPgWOrCpbN9Drl+Ko4u9Ik0=; b=TjxzWvViQ5R3JPiag0KbjMCYch4GAnwRHs0TQoTvFpHM5rFJQa/b2NbCV4Utuo9Jmw Kbumqxpxv1ZAruoT4uEhSZYP26MSOGV4SW70ZgEP0RqQ0kCpv5AbqGIJnsgcQtirBb9F OJZXD53nupwHQkgDLZZNHZOGPtM8sbFslWAHiecVzUKG6LMyQlC1vf3mkCVzCrH3jXrO /KbjHzlT733nYqaiAtbFoeoI6eFo3ky3W1/Lq6F9g3an4/XmwGabdDXZ+onTAXUfJdsA 1ipaPftsYEfUdbF9TV6NQ4VOgR/iupWK0/jai6mA6k2YBkKG+6pxGLOIf50kcL6jdjug zF/g== X-Gm-Message-State: AOAM531UBCguqNrxv9ESgm5Ei5fEjdIZPdfaDWHAhBSRADZ2fMRQi+8w EyFmv7ecAuU0PoyJR9FvV3/5WAenexM= X-Google-Smtp-Source: ABdhPJx4LsS7+65Dom5WhjZ1vno21EPNtwWFU6Ffv3h7mFHiybE7XVD4fFrinzaljOEiwA+MnEKwZQ== X-Received: by 2002:a17:90a:8906:: with SMTP id u6mr5754030pjn.223.1612902548089; Tue, 09 Feb 2021 12:29:08 -0800 (PST) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:1d7c:b2d9:c196:949c]) by smtp.googlemail.com with ESMTPSA id p12sm3325827pju.35.2021.02.09.12.29.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 12:29:07 -0800 (PST) From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, bzzz@whamcloud.com, artem.blagodarenko@gmail.com, sihara@ddn.com, adilger@dilger.ca, Harshad Shirwadkar Subject: [PATCH v2 3/5] ext4: add MB_NUM_ORDERS macro Date: Tue, 9 Feb 2021 12:28:55 -0800 Message-Id: <20210209202857.4185846-4-harshadshirwadkar@gmail.com> X-Mailer: git-send-email 2.30.0.478.g8a0d178c01-goog In-Reply-To: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> References: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org A few arrays in mballoc.c use the total number of valid orders as their size. Currently, this value is set as "sb->s_blocksize_bits + 2". This makes code harder to read. So, instead add a new macro MB_NUM_ORDERS(sb) to make the code more readable. Signed-off-by: Harshad Shirwadkar Reviewed-by: Andreas Dilger --- fs/ext4/mballoc.c | 15 ++++++++------- fs/ext4/mballoc.h | 5 +++++ 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index fffd0770e930..b7f25120547d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -756,7 +756,7 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) grp->bb_largest_free_order = -1; /* uninit */ - bits = sb->s_blocksize_bits + 1; + bits = MB_NUM_ORDERS(sb) - 1; for (i = bits; i >= 0; i--) { if (grp->bb_counters[i] > 0) { grp->bb_largest_free_order = i; @@ -1928,7 +1928,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac, int max; BUG_ON(ac->ac_2order <= 0); - for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) { + for (i = ac->ac_2order; i < MB_NUM_ORDERS(sb); i++) { if (grp->bb_counters[i] == 0) continue; @@ -2314,13 +2314,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * We also support searching for power-of-two requests only for * requests upto maximum buddy size we have constructed. */ - if (i >= sbi->s_mb_order2_reqs && i <= sb->s_blocksize_bits + 2) { + if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) { /* * This should tell if fe_len is exactly power of 2 */ if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0) ac->ac_2order = array_index_nospec(i - 1, - sb->s_blocksize_bits + 2); + MB_NUM_ORDERS(sb)); } /* if stream allocation is enabled, use global goal */ @@ -2850,7 +2850,7 @@ int ext4_mb_init(struct super_block *sb) unsigned max; int ret; - i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets); + i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_offsets); sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL); if (sbi->s_mb_offsets == NULL) { @@ -2858,7 +2858,7 @@ int ext4_mb_init(struct super_block *sb) goto out; } - i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs); + i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_maxs); sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL); if (sbi->s_mb_maxs == NULL) { ret = -ENOMEM; @@ -2884,7 +2884,8 @@ int ext4_mb_init(struct super_block *sb) offset_incr = offset_incr >> 1; max = max >> 1; i++; - } while (i <= sb->s_blocksize_bits + 1); + } while (i < MB_NUM_ORDERS(sb)); + spin_lock_init(&sbi->s_md_lock); sbi->s_mb_free_pending = 0; diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 7597330dbdf8..02861406932f 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -78,6 +78,11 @@ */ #define MB_DEFAULT_MAX_INODE_PREALLOC 512 +/* + * Number of valid buddy orders + */ +#define MB_NUM_ORDERS(sb) ((sb)->s_blocksize_bits + 2) + struct ext4_free_data { /* this links the free block information from sb_info */ struct list_head efd_list; From patchwork Tue Feb 9 20:28:56 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1438697 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=nPDCf7jt; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4DZxHM0lFsz9sSC for ; Wed, 10 Feb 2021 08:44:15 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233949AbhBIVnY (ORCPT ); Tue, 9 Feb 2021 16:43:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233747AbhBIUhA (ORCPT ); Tue, 9 Feb 2021 15:37:00 -0500 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90298C0698D7 for ; Tue, 9 Feb 2021 12:29:10 -0800 (PST) Received: by mail-pl1-x62e.google.com with SMTP id a16so10420993plh.8 for ; Tue, 09 Feb 2021 12:29:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=34ETrX/NRobL+kqh+guZgZbYpcxEG7yL33z7l/mU0eU=; b=nPDCf7jtxbknUx2MTkHhAuWfIbn7TWe/GKwVp93RLxLrC+83LmKxiyBpDP/tJfTcFZ 30vJWIDpkhqlqO52J+hKjemEMS5myTEH1Gfpy2+I0nMb+J41kL/VVbfeai8Nd2Kjprh8 dDRYTs4adLfjPGe2KnmBZxz6khB8V6IiLGuGAh1Qp+0OEsgF7Ye9DPX0TqDAXjM0xoit iBH0/gjj9n19kAcDYLtaRSJccD1fYdYzV8c9UhkX73QeBQIZJeZlMQG7WIZ0xoolR1x/ hRPIKxGCAD+SaBthh4TYjKC9weII+LSB3jQK2XKTLB19ebrBC08fXMsytjPos7A6ELxw lY9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=34ETrX/NRobL+kqh+guZgZbYpcxEG7yL33z7l/mU0eU=; b=Nkn4F8ggOg6xj60IzW+HkOq/iC1mCdQrTxpaahTaVDLtP+NSvdQ8KRzj4VpNVj0YId ZItdRm8+k0CM6y+/Onte+Mw2zcrdvw9zrdN4bk/ByDgt5hF+2sZV1LbxFKG9OSFORcko MURDfCugp4seYgF8bv2XYNf4Q3Itbk4nRAcnf/SUdl/BsG4N35o6l+SQWIsY0OTJ9RAP GL6NTh4gAsgE1iBhjOcVXk1TJRnbo/N09mOhf4j2lhFtLiHbilWmFPxNL7wq5pYzPWnb oqnZrmH4efKwp/7PpArGwZsC62b8m2sIx31MhErBrib5pa0DzBgz4aJo5GuYHN8n9AsT d6Lg== X-Gm-Message-State: AOAM532Yc5DSKvntcQ3+mZZLrKpeNqNJyT5Wd7A//QrZFDCkr3L4w9E7 DDFe0cicfB85exM5Uyd+k7z+MQmtxWw= X-Google-Smtp-Source: ABdhPJxJYLT0li7iXQ9GBwRZ+SMLbkkLu0jeGDko5cJAGL/QDdq6LwILPykiIS9+M59tgrWnN25DzQ== X-Received: by 2002:a17:90a:e292:: with SMTP id d18mr5790919pjz.66.1612902549457; Tue, 09 Feb 2021 12:29:09 -0800 (PST) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:1d7c:b2d9:c196:949c]) by smtp.googlemail.com with ESMTPSA id p12sm3325827pju.35.2021.02.09.12.29.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 12:29:08 -0800 (PST) From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, bzzz@whamcloud.com, artem.blagodarenko@gmail.com, sihara@ddn.com, adilger@dilger.ca, Harshad Shirwadkar Subject: [PATCH v2 4/5] ext4: improve cr 0 / cr 1 group scanning Date: Tue, 9 Feb 2021 12:28:56 -0800 Message-Id: <20210209202857.4185846-5-harshadshirwadkar@gmail.com> X-Mailer: git-send-email 2.30.0.478.g8a0d178c01-goog In-Reply-To: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> References: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Instead of traversing through groups linearly, scan groups in specific orders at cr 0 and cr 1. At cr 0, we want to find groups that have the largest free order >= the order of the request. So, with this patch, we maintain lists for each possible order and insert each group into a list based on the largest free order in its buddy bitmap. During cr 0 allocation, we traverse these lists in the increasing order of largest free orders. This allows us to find a group with the best available cr 0 match in constant time. If nothing can be found, we fallback to cr 1 immediately. At CR1, the story is slightly different. We want to traverse in the order of increasing average fragment size. For CR1, we maintain a rb tree of groupinfos which is sorted by average fragment size. Instead of traversing linearly, at CR1, we traverse in the order of increasing average fragment size, starting at the most optimal group. This brings down cr 1 search complexity to log(num groups). For cr >= 2, we just perform the linear search as before. Also, in case of lock contention, we intermittently fallback to linear search even in CR 0 and CR 1 cases. This allows us to proceed during the allocation path even in case of high contention. There is an opportunity to do optimization at CR2 too. That's because at CR2 we only consider groups where bb_free counter (number of free blocks) is greater than the request extent size. That's left as future work. All the changes introduced in this patch are protected under a new mount option "mb_optimize_scan". Signed-off-by: Harshad Shirwadkar Reported-by: kernel test robot Reported-by: Dan Carpenter Reported-by: kernel test robot --- fs/ext4/ext4.h | 13 +- fs/ext4/mballoc.c | 316 ++++++++++++++++++++++++++++++++++++++++++++-- fs/ext4/mballoc.h | 1 + fs/ext4/super.c | 6 +- 4 files changed, 322 insertions(+), 14 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 317b43420ecf..0601c997c87f 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -162,6 +162,8 @@ enum SHIFT_DIRECTION { #define EXT4_MB_USE_RESERVED 0x2000 /* Do strict check for free blocks while retrying block allocation */ #define EXT4_MB_STRICT_CHECK 0x4000 +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */ +#define EXT4_MB_CR1_OPTIMIZED 0x8000 struct ext4_allocation_request { /* target inode for block we're allocating */ @@ -1247,7 +1249,9 @@ struct ext4_inode_info { #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */ #define EXT4_MOUNT2_DAX_NEVER 0x00000020 /* Do not allow Direct Access */ #define EXT4_MOUNT2_DAX_INODE 0x00000040 /* For printing options only */ - +#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN 0x00000080 /* Optimize group + * scanning in mballoc + */ #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ ~EXT4_MOUNT_##opt @@ -1527,6 +1531,10 @@ struct ext4_sb_info { unsigned int s_mb_free_pending; struct list_head s_freed_data_list; /* List of blocks to be freed after commit completed */ + struct rb_root s_mb_avg_fragment_size_root; + rwlock_t s_mb_rb_lock; + struct list_head *s_mb_largest_free_orders; + rwlock_t *s_mb_largest_free_orders_locks; /* tunables */ unsigned long s_stripe; @@ -3308,11 +3316,14 @@ struct ext4_group_info { ext4_grpblk_t bb_free; /* total free blocks */ ext4_grpblk_t bb_fragments; /* nr of freespace fragments */ ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */ + ext4_group_t bb_group; /* Group number */ struct list_head bb_prealloc_list; #ifdef DOUBLE_CHECK void *bb_bitmap; #endif struct rw_semaphore alloc_sem; + struct rb_node bb_avg_fragment_size_rb; + struct list_head bb_largest_free_order_node; ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block * regions, index is order. * bb_counters[3] = 5 means diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index b7f25120547d..63562f5f42f1 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -147,7 +147,12 @@ * the group specified as the goal value in allocation context via * ac_g_ex. Each group is first checked based on the criteria whether it * can be used for allocation. ext4_mb_good_group explains how the groups are - * checked. + * checked. If "mb_optimize_scan" mount option is set, instead of traversing + * groups linearly starting at the goal, the groups are traversed in an optimal + * order according to each cr level, so as to minimize considering groups which + * would anyway be rejected by ext4_mb_good_group. This has a side effect + * though - subsequent allocations may not be close to each other. And so, + * the underlying device may get filled up in a non-linear fashion. * * Both the prealloc space are getting populated as above. So for the first * request we will hit the buddy cache which will result in this prealloc @@ -299,6 +304,8 @@ * - bitlock on a group (group) * - object (inode/locality) (object) * - per-pa lock (pa) + * - cr0 lists lock (cr0) + * - cr1 tree lock (cr1) * * Paths: * - new pa @@ -328,6 +335,9 @@ * group * object * + * - allocation path (ext4_mb_regular_allocator) + * group + * cr0/cr1 */ static struct kmem_cache *ext4_pspace_cachep; static struct kmem_cache *ext4_ac_cachep; @@ -351,6 +361,9 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, ext4_group_t group); static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); +static bool ext4_mb_good_group(struct ext4_allocation_context *ac, + ext4_group_t group, int cr); + /* * The algorithm using this percpu seq counter goes below: * 1. We sample the percpu discard_pa_seq counter before trying for block @@ -744,6 +757,243 @@ static void ext4_mb_mark_free_simple(struct super_block *sb, } } +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new, + int (*cmp)(struct rb_node *, struct rb_node *)) +{ + struct rb_node **iter = &root->rb_node, *parent = NULL; + + while (*iter) { + parent = *iter; + if (cmp(new, *iter)) + iter = &((*iter)->rb_left); + else + iter = &((*iter)->rb_right); + } + + rb_link_node(new, parent, iter); + rb_insert_color(new, root); +} + +static int +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2) +{ + struct ext4_group_info *grp1 = rb_entry(rb1, + struct ext4_group_info, + bb_avg_fragment_size_rb); + struct ext4_group_info *grp2 = rb_entry(rb2, + struct ext4_group_info, + bb_avg_fragment_size_rb); + int num_frags_1, num_frags_2; + + num_frags_1 = grp1->bb_fragments ? + grp1->bb_free / grp1->bb_fragments : 0; + num_frags_2 = grp2->bb_fragments ? + grp2->bb_free / grp2->bb_fragments : 0; + + return (num_frags_1 < num_frags_2); +} + +/* + * Reinsert grpinfo into the avg_fragment_size tree with new average + * fragment size. + */ +static void +mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + + if (!test_opt2(sb, MB_OPTIMIZE_SCAN)) + return; + + write_lock(&sbi->s_mb_rb_lock); + if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) { + rb_erase(&grp->bb_avg_fragment_size_rb, + &sbi->s_mb_avg_fragment_size_root); + RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb); + } + + ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root, + &grp->bb_avg_fragment_size_rb, + ext4_mb_avg_fragment_size_cmp); + write_unlock(&sbi->s_mb_rb_lock); +} + +/* + * Choose next group by traversing largest_free_order lists. Return 0 if next + * group was selected optimally. Return 1 if next group was not selected + * optimally. Updates *new_cr if cr level needs an update. + */ +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac, + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + struct ext4_group_info *iter, *grp; + int i; + + if (ac->ac_status == AC_STATUS_FOUND) + return 1; + + grp = NULL; + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { + if (list_empty(&sbi->s_mb_largest_free_orders[i])) + continue; + read_lock(&sbi->s_mb_largest_free_orders_locks[i]); + if (list_empty(&sbi->s_mb_largest_free_orders[i])) { + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); + continue; + } + grp = NULL; + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i], + bb_largest_free_order_node) { + /* + * Perform this check without a lock, once we lock + * the group, we'll perform this check again. + */ + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) { + grp = iter; + break; + } + } + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); + if (grp) + break; + } + + if (!grp) { + /* Increment cr and search again */ + *new_cr = 1; + } else { + *group = grp->bb_group; + ac->ac_last_optimal_group = *group; + } + return 0; +} + +/* + * Choose next group by traversing average fragment size tree. Return 0 if next + * group was selected optimally. Return 1 if next group could not selected + * optimally (due to lock contention). Updates *new_cr if cr lvel needs an + * update. + */ +static int ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + int avg_fragment_size, best_so_far; + struct rb_node *node, *found; + struct ext4_group_info *grp; + + /* + * If there is contention on the lock, instead of waiting for the lock + * to become available, just continue searching lineraly. We'll resume + * our rb tree search later starting at ac->ac_last_optimal_group. + */ + if (!read_trylock(&sbi->s_mb_rb_lock)) + return 1; + + if (ac->ac_flags & EXT4_MB_CR1_OPTIMIZED) { + /* We have found something at CR 1 in the past */ + grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group); + for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL; + found = rb_next(found)) { + grp = rb_entry(found, struct ext4_group_info, + bb_avg_fragment_size_rb); + /* + * Perform this check without locking, we'll lock later + * to confirm. + */ + if (likely(ext4_mb_good_group(ac, grp->bb_group, 1))) + break; + } + + goto done; + } + + node = sbi->s_mb_avg_fragment_size_root.rb_node; + best_so_far = 0; + found = NULL; + + while (node) { + grp = rb_entry(node, struct ext4_group_info, + bb_avg_fragment_size_rb); + /* + * Perform this check without locking, we'll lock later to confirm. + */ + if (ext4_mb_good_group(ac, grp->bb_group, 1)) { + avg_fragment_size = grp->bb_fragments ? + grp->bb_free / grp->bb_fragments : 0; + if (!best_so_far || avg_fragment_size < best_so_far) { + best_so_far = avg_fragment_size; + found = node; + } + } + if (avg_fragment_size > ac->ac_g_ex.fe_len) + node = node->rb_right; + else + node = node->rb_left; + } + +done: + if (found) { + grp = rb_entry(found, struct ext4_group_info, + bb_avg_fragment_size_rb); + *group = grp->bb_group; + ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED; + } else { + *new_cr = 2; + } + + read_unlock(&sbi->s_mb_rb_lock); + ac->ac_last_optimal_group = *group; + return 0; +} + +/* + * ext4_mb_choose_next_group: choose next group for allocation. + * + * @ac Allocation Context + * @new_cr This is an output parameter. If the there is no good group available + * at current CR level, this field is updated to indicate the new cr + * level that should be used. + * @group This is an input / output parameter. As an input it indicates the last + * group used for allocation. As output, this field indicates the + * next group that should be used. + * @ngroups Total number of groups + */ +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, + int *new_cr, ext4_group_t *group, ext4_group_t ngroups) +{ + int ret; + + *new_cr = ac->ac_criteria; + + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) || + *new_cr >= 2 || + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) + goto inc_and_return; + + if (*new_cr == 0) { + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); + if (ret) + goto inc_and_return; + } + if (*new_cr == 1) { + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups); + if (ret) + goto inc_and_return; + } + return; + +inc_and_return: + /* + * Artificially restricted ngroups for non-extent + * files makes group > ngroups possible on first loop. + */ + *group = *group + 1; + if (*group >= ngroups) + *group = 0; +} + /* * Cache the order of the largest free extent we have available in this block * group. @@ -751,18 +1001,32 @@ static void ext4_mb_mark_free_simple(struct super_block *sb, static void mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) { + struct ext4_sb_info *sbi = EXT4_SB(sb); int i; - int bits; + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { + write_lock(&sbi->s_mb_largest_free_orders_locks[ + grp->bb_largest_free_order]); + list_del_init(&grp->bb_largest_free_order_node); + write_unlock(&sbi->s_mb_largest_free_orders_locks[ + grp->bb_largest_free_order]); + } grp->bb_largest_free_order = -1; /* uninit */ - bits = MB_NUM_ORDERS(sb) - 1; - for (i = bits; i >= 0; i--) { + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { if (grp->bb_counters[i] > 0) { grp->bb_largest_free_order = i; break; } } + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { + write_lock(&sbi->s_mb_largest_free_orders_locks[ + grp->bb_largest_free_order]); + list_add_tail(&grp->bb_largest_free_order_node, + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); + write_unlock(&sbi->s_mb_largest_free_orders_locks[ + grp->bb_largest_free_order]); + } } static noinline_for_stack @@ -818,6 +1082,7 @@ void ext4_mb_generate_buddy(struct super_block *sb, period = get_cycles() - period; atomic_inc(&sbi->s_mb_buddies_generated); atomic64_add(period, &sbi->s_mb_generation_time); + mb_update_avg_fragment_size(sb, grp); } /* The buddy information is attached the buddy cache inode @@ -1517,6 +1782,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, done: mb_set_largest_free_order(sb, e4b->bd_info); + mb_update_avg_fragment_size(sb, e4b->bd_info); mb_check_buddy(e4b); } @@ -1653,6 +1919,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) } mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info); + mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info); ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0); mb_check_buddy(e4b); @@ -2346,17 +2613,20 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * from the goal value specified */ group = ac->ac_g_ex.fe_group; + ac->ac_last_optimal_group = group; prefetch_grp = group; - for (i = 0; i < ngroups; group++, i++) { - int ret = 0; + for (i = 0; i < ngroups; i++) { + int ret = 0, new_cr; + cond_resched(); - /* - * Artificially restricted ngroups for non-extent - * files makes group > ngroups possible on first loop. - */ - if (group >= ngroups) - group = 0; + + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups); + + if (new_cr != cr) { + cr = new_cr; + goto repeat; + } /* * Batch reads of the block allocation bitmaps @@ -2696,7 +2966,10 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group, INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list); init_rwsem(&meta_group_info[i]->alloc_sem); meta_group_info[i]->bb_free_root = RB_ROOT; + INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node); + RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb); meta_group_info[i]->bb_largest_free_order = -1; /* uninit */ + meta_group_info[i]->bb_group = group; mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group); return 0; @@ -2886,6 +3159,22 @@ int ext4_mb_init(struct super_block *sb) i++; } while (i < MB_NUM_ORDERS(sb)); + sbi->s_mb_avg_fragment_size_root = RB_ROOT; + sbi->s_mb_largest_free_orders = + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), + GFP_KERNEL); + if (!sbi->s_mb_largest_free_orders) + goto out; + sbi->s_mb_largest_free_orders_locks = + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t), + GFP_KERNEL); + if (!sbi->s_mb_largest_free_orders_locks) + goto out; + for (i = 0; i < MB_NUM_ORDERS(sb); i++) { + INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]); + rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]); + } + rwlock_init(&sbi->s_mb_rb_lock); spin_lock_init(&sbi->s_md_lock); sbi->s_mb_free_pending = 0; @@ -2949,6 +3238,8 @@ int ext4_mb_init(struct super_block *sb) free_percpu(sbi->s_locality_groups); sbi->s_locality_groups = NULL; out: + kfree(sbi->s_mb_largest_free_orders); + kfree(sbi->s_mb_largest_free_orders_locks); kfree(sbi->s_mb_offsets); sbi->s_mb_offsets = NULL; kfree(sbi->s_mb_maxs); @@ -3005,6 +3296,7 @@ int ext4_mb_release(struct super_block *sb) kvfree(group_info); rcu_read_unlock(); } + kfree(sbi->s_mb_largest_free_orders); kfree(sbi->s_mb_offsets); kfree(sbi->s_mb_maxs); iput(sbi->s_buddy_cache); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 02861406932f..1e86a8a0460d 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -166,6 +166,7 @@ struct ext4_allocation_context { /* copy of the best found extent taken before preallocation efforts */ struct ext4_free_extent ac_f_ex; + ext4_group_t ac_last_optimal_group; __u32 ac_groups_considered; __u16 ac_groups_scanned; __u16 ac_found; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0f0db49031dc..a14363654cfd 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -154,6 +154,7 @@ static inline void __ext4_read_bh(struct buffer_head *bh, int op_flags, clear_buffer_verified(bh); bh->b_end_io = end_io ? end_io : end_buffer_read_sync; + get_bh(bh); submit_bh(REQ_OP_READ, op_flags, bh); } @@ -1687,7 +1688,7 @@ enum { Opt_dioread_nolock, Opt_dioread_lock, Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable, Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache, - Opt_prefetch_block_bitmaps, + Opt_prefetch_block_bitmaps, Opt_mb_optimize_scan, #ifdef CONFIG_EXT4_DEBUG Opt_fc_debug_max_replay, Opt_fc_debug_force #endif @@ -1788,6 +1789,7 @@ static const match_table_t tokens = { {Opt_nombcache, "nombcache"}, {Opt_nombcache, "no_mbcache"}, /* for backward compatibility */ {Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"}, + {Opt_mb_optimize_scan, "mb_optimize_scan"}, {Opt_removed, "check=none"}, /* mount option from ext2/3 */ {Opt_removed, "nocheck"}, /* mount option from ext2/3 */ {Opt_removed, "reservation"}, /* mount option from ext2/3 */ @@ -2008,6 +2010,8 @@ static const struct mount_opts { {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET}, {Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS, MOPT_SET}, + {Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, + MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY}, #ifdef CONFIG_EXT4_DEBUG {Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT, MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY}, From patchwork Tue Feb 9 20:28:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: harshad shirwadkar X-Patchwork-Id: 1438677 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=QM4ll7By; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4DZvyF3PW7z9sRf for ; Wed, 10 Feb 2021 07:44:21 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233905AbhBIUmI (ORCPT ); Tue, 9 Feb 2021 15:42:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36178 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233689AbhBIUhA (ORCPT ); Tue, 9 Feb 2021 15:37:00 -0500 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E2F44C0698D8 for ; Tue, 9 Feb 2021 12:29:11 -0800 (PST) Received: by mail-pg1-x535.google.com with SMTP id o38so1704274pgm.9 for ; Tue, 09 Feb 2021 12:29:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Q3bcMPVMaB084N3tKlKrXHspi/RHe+qo2Ads4pGrgJY=; b=QM4ll7Bywls7jajJApeMXmZPtU2vJLBwFqoLUeODQbpbBIuHiToW385FWBcqX3HbhX 8ERzAD15iTUBgY5Dki2TQD5YA/3WOqAEo1ikDmkppRbN1tHDinYcXPsP7MJbZDVub3CD qMsP2FKHhPmh15m8YzTAfpVZxmN0rcEH2Cp7iWkNi11bNyIcuoVrK/KUZgua3vO0ea+x c0uYwR7YVH/m1FF7pATEZd1Hw4+qEK5h+bErW94qI9lAwmQRgQUnHYBLSlOwhW5aD5bL C0dlpmQT35zfE9Y4p+pxlReumQ+5Wxhgh9JTx4ZHGaXIr0TEa6XmEHkN9RMtg/2vlHSd NLbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Q3bcMPVMaB084N3tKlKrXHspi/RHe+qo2Ads4pGrgJY=; b=ZCyirDo+37ZFNjsQzYT5LxjM06iLztmkLGdEsbglrCTCV6hvf9wLU+Sw8xUFI5FcRw +XD5ViBnetEyVN95271aPMORslAVlkWQj3+tivSLTtgvPwYsyWWIfr48XlqBVaWVTMO/ tZEcNcLFpqp3UhzUO9cULsg996wBHHHqLkKwm5IONzDfbz/mmWohGm/Yxog6m+WdFP3T KX/CziZn7Knb1ck6C+34Rb+L9Nd9HRNOUgNMFdo49ZafY2VABuJo1e0g2ycea7L17reu noUetGel2AJLE5q8E4EQjvTx6hBrkvzqEYkLgI2n4peyjE7Lz4Cq4fTaibHAfZm/E4N4 wdWA== X-Gm-Message-State: AOAM530dnImpPsKhTNSd4WDrc7FcMQXxAHAQ7Bax0GE7HMoaOPL6iD/Q 2ysLYPtm6na7p3R4uczRorPnr6e/RbA= X-Google-Smtp-Source: ABdhPJxj0CBJcCn1wKanlivVKz4+O31Qhd2b1H1MscR05FrhlMa4OS5KG+XcQiyVlD+WXSKXrK+t3w== X-Received: by 2002:a65:5b47:: with SMTP id y7mr23370440pgr.221.1612902551015; Tue, 09 Feb 2021 12:29:11 -0800 (PST) Received: from harshads-520.kir.corp.google.com ([2620:15c:17:10:1d7c:b2d9:c196:949c]) by smtp.googlemail.com with ESMTPSA id p12sm3325827pju.35.2021.02.09.12.29.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 12:29:09 -0800 (PST) From: Harshad Shirwadkar To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, bzzz@whamcloud.com, artem.blagodarenko@gmail.com, sihara@ddn.com, adilger@dilger.ca, Harshad Shirwadkar Subject: [PATCH v2 5/5] ext4: add proc files to monitor new structures Date: Tue, 9 Feb 2021 12:28:57 -0800 Message-Id: <20210209202857.4185846-6-harshadshirwadkar@gmail.com> X-Mailer: git-send-email 2.30.0.478.g8a0d178c01-goog In-Reply-To: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> References: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org This patch adds a new file "mb_structs_summary" which allows us to see the summary of the new allocator structures added in this series. Signed-off-by: Harshad Shirwadkar --- fs/ext4/ext4.h | 1 + fs/ext4/mballoc.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/sysfs.c | 2 ++ 3 files changed, 87 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 0601c997c87f..39830c07c27e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2817,6 +2817,7 @@ int __init ext4_fc_init_dentry_cache(void); /* mballoc.c */ extern const struct seq_operations ext4_mb_seq_groups_ops; +extern const struct seq_operations ext4_mb_seq_structs_summary_ops; extern long ext4_mb_stats; extern long ext4_mb_max_to_scan; extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 63562f5f42f1..d9cb74787a47 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2864,6 +2864,90 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) return 0; } +static void *ext4_mb_seq_structs_summary_start(struct seq_file *seq, loff_t *pos) +{ + struct super_block *sb = PDE_DATA(file_inode(seq->file)); + unsigned long position; + + read_lock(&EXT4_SB(sb)->s_mb_rb_lock); + + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1) + return NULL; + position = *pos + 1; + return (void *) ((unsigned long) position); +} + +static void *ext4_mb_seq_structs_summary_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct super_block *sb = PDE_DATA(file_inode(seq->file)); + unsigned long position; + + ++*pos; + if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1) + return NULL; + position = *pos + 1; + return (void *) ((unsigned long) position); +} + +static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v) +{ + struct super_block *sb = PDE_DATA(file_inode(seq->file)); + struct ext4_sb_info *sbi = EXT4_SB(sb); + unsigned long position = ((unsigned long) v); + struct ext4_group_info *grp; + struct rb_node *n; + int count, min, max; + + position--; + + if (position >= MB_NUM_ORDERS(sb)) { + seq_puts(seq, "Tree\n"); + n = rb_first(&sbi->s_mb_avg_fragment_size_root); + if (!n) { + seq_puts(seq, "\n"); + return 0; + } + grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb); + min = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0; + count = 1; + while (rb_next(n)) { + count++; + n = rb_next(n); + } + grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb); + max = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0; + + seq_printf(seq, "Min: %d, Max: %d, Num Nodes: %d\n", + min, max, count); + return 0; + } + + if (position == 0) + seq_puts(seq, "Largest Free Order Lists:\n"); + + seq_printf(seq, "Order %ld list: ", position); + count = 0; + list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position], + bb_largest_free_order_node) + count++; + seq_printf(seq, "%d Groups\n", count); + return 0; +} + +static void ext4_mb_seq_structs_summary_stop(struct seq_file *seq, void *v) +{ + struct super_block *sb = PDE_DATA(file_inode(seq->file)); + + read_unlock(&EXT4_SB(sb)->s_mb_rb_lock); +} + +const struct seq_operations ext4_mb_seq_structs_summary_ops = { + .start = ext4_mb_seq_structs_summary_start, + .next = ext4_mb_seq_structs_summary_next, + .stop = ext4_mb_seq_structs_summary_stop, + .show = ext4_mb_seq_structs_summary_show, +}; + static struct kmem_cache *get_groupinfo_cache(int blocksize_bits) { int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE; diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index 752d1c261e2a..b78bc6b57bce 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -529,6 +529,8 @@ int ext4_register_sysfs(struct super_block *sb) &ext4_mb_seq_groups_ops, sb); proc_create_single_data("mb_stats", 0444, sbi->s_proc, ext4_seq_mb_stats_show, sb); + proc_create_seq_data("mb_structs_summary", 0444, sbi->s_proc, + &ext4_mb_seq_structs_summary_ops, sb); } return 0; }