From patchwork Wed Nov 20 10:35:38 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Alex Zhuravlev X-Patchwork-Id: 1198015 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=whamcloud.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=whamcloud.com header.i=@whamcloud.com header.b="I8NLHc3m"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47HzcK3FKkz9sPL for ; Wed, 20 Nov 2019 21:35:44 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728032AbfKTKfo (ORCPT ); Wed, 20 Nov 2019 05:35:44 -0500 Received: from mail-eopbgr800073.outbound.protection.outlook.com ([40.107.80.73]:22270 "EHLO NAM03-DM3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727697AbfKTKfo (ORCPT ); Wed, 20 Nov 2019 05:35:44 -0500 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=KL3DBsayNv30+jibsrFH7RDc1G0KUynlty5bqoAVlDmVrZYCmthm3hphH1GY36u+C9FMXk5uQtRZoKlF8F+vcV/1K4rE6iKZtibHGb1H8ifzZ1RiIvleYcvqaiYzECVUEyGiV931NJWetNO7ALTXxtq49cyhaFgIQehvyNmd5VcRI49k4vsyxAWStt+RMrHgGgGY7Q9SFxWrCfiRgK5m6MNqc01DLJRIAYDYnGlTT+l0SPJgxFYjIEsflMdx81YmTAP3b9LY+W12LiFBSKLT2xasWAo6WD39vcb8BwW2l3Fmo4+nCMfqTsQqwojL0ZpoZxSMskQRauQQ34HJarP9ow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=bLKphau/c/1NtVuDl+kZANf0p8L4oZafsDWsGPi0NC8=; b=U+19ZKPOsxHHsJzO6lnSTf5FoP7UQkiZMW1beWYuDnEYzS0o9zOFftKfmtpLsifeppIvVriANCfIdQA1ZSNDZfTAo/3YiwhKTN/AS38TyK0K5O4BE6Ni6uMoUiBrm7zp+QEgICqtMLgA0kgumyIB0u6gJNAXpA2XCG8BMY8vG79j8SNemUjaJO8I+mq2L2hkWklBPFhMvdXICGYsDN0GmWTNWuZjl4bskF9bUDBNN7XOoiT84dsQL0F6xPeb5CiIeu8nUmpUlaeijE3SeXgSIbGUDQmNIZ+gRhyUizWkHGtnI6ktXaj9KYumoLEwf6ucU+Zw4PxVGF/TDdiEGDs9+w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=whamcloud.com; dmarc=pass action=none header.from=whamcloud.com; dkim=pass header.d=whamcloud.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=whamcloud.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=bLKphau/c/1NtVuDl+kZANf0p8L4oZafsDWsGPi0NC8=; b=I8NLHc3mIi8zYyx1o6UAPUC4VvNSrmEuo7naqAoOdXmh6OYkwrwcKotOkMiKRg8vDoxRNwJG1nY0NcsiZbdx1HKSI41RerdZrKZ65XWAWNaxBSyY0afhx5kKQhiNuLLtA5PKgMVBm/yD1+oOaYnrNM5WBPP7zW/UN5u7lpienxE= Received: from MN2PR19MB2894.namprd19.prod.outlook.com (20.178.254.95) by MN2PR19MB3742.namprd19.prod.outlook.com (10.186.145.148) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2474.17; Wed, 20 Nov 2019 10:35:38 +0000 Received: from MN2PR19MB2894.namprd19.prod.outlook.com ([fe80::a499:dae8:b1c1:b08e]) by MN2PR19MB2894.namprd19.prod.outlook.com ([fe80::a499:dae8:b1c1:b08e%7]) with mapi id 15.20.2451.029; Wed, 20 Nov 2019 10:35:38 +0000 From: Alex Zhuravlev To: "linux-ext4@vger.kernel.org" Subject: [RFC] improve malloc for large filesystems Thread-Topic: [RFC] improve malloc for large filesystems Thread-Index: AQHVn44/D8VmnxREB0i1ZrsSwjwsLA== Date: Wed, 20 Nov 2019 10:35:38 +0000 Message-ID: <8738E8FF-820F-48A5-9150-7FF64219ED42@whamcloud.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=azhuravlev@whamcloud.com; x-originating-ip: [128.72.176.24] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: f31c7f73-6be5-44a8-9862-08d76da56285 x-ms-traffictypediagnostic: MN2PR19MB3742: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:227; x-forefront-prvs: 02272225C5 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(4636009)(136003)(346002)(396003)(376002)(366004)(39850400004)(51874003)(199004)(189003)(71200400001)(71190400001)(14444005)(6506007)(256004)(5660300002)(6436002)(186003)(26005)(6486002)(5640700003)(6916009)(6512007)(76116006)(91956017)(305945005)(7736002)(66946007)(66476007)(66556008)(64756008)(66446008)(476003)(2616005)(102836004)(486006)(25786009)(33656002)(36756003)(478600001)(14454004)(86362001)(2906002)(2501003)(3846002)(6116002)(66066001)(99286004)(8936002)(8676002)(81156014)(81166006)(316002)(2351001); DIR:OUT; SFP:1101; SCL:1; SRVR:MN2PR19MB3742; H:MN2PR19MB2894.namprd19.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: whamcloud.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: jER+D7J087JtvsKT2AN9raGlWNKiza6Y+GbEo1ziJ5jN8dpMDX70tE6lktVUrQlOi7L3ZfTQ4JVRx8R65y8ouGX4jkjPU30usNB2fJvYMYVScMw/qY7zoUApV/Z1bFMnEiBLAigRDlOXRdWWlAgJLFOrz1fZArbhi+wnqJlOGT0wKnbcxXODTtv7KZq+/OTBF+gtGR0B3MjCfgDZw6Tkl/9Rrh2RJEWFIy7arIkw12AdLQt84cVSovAAe7k69n/Pjr6ipFIlUseOMeXlo0bEDELTR6P8zogRfi/Fbbx0LJUJgyB3BZaiBGObX0qW2fn7noqC5PlYKWXzdbtVmuiOHWY9Zy+Oj4Gi96uytZuyTLf1DznPiQ6L7BE6eBTQcHF/qi1aUX+I9rfwazrha6PGE0nlWfO3Mz18RMS4SKT9B4RPLEBXc86B3ucIvcF5ST3g x-ms-exchange-transport-forked: True Content-ID: <30E394C70E494B409F54532A044639EE@namprd19.prod.outlook.com> MIME-Version: 1.0 X-OriginatorOrg: whamcloud.com X-MS-Exchange-CrossTenant-Network-Message-Id: f31c7f73-6be5-44a8-9862-08d76da56285 X-MS-Exchange-CrossTenant-originalarrivaltime: 20 Nov 2019 10:35:38.3749 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 753b6e26-6fd3-43e6-8248-3f1735d59bb4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: lEAcEVQfdertdDngXzwkq49hiGY/jPWW93Ou2xIbo8nEQ6Yrn8fOGznC9fwYGH2XpNm9n8kPswoK6mRcqRFq+g== X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR19MB3742 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Hi, We’ve seen few reports where a huge fragmented filesystem spends a lot of time looking for a good chunks of free space. Two issues have been identified so far: 1) mballoc tries too hard to find the best chunk which is counterproductive - it makes sense to limit this process 2) during scanning the bitmaps are loaded one by one, synchronously - it makes sense to prefetch few groups at once Here is a patch for comments, not really tested at scale, but it’d be great to see the comments. Thanks in advance, Alex Signed-off-by: Alex Zhuravlev diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 0b202e00d93f..76547601384b 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -404,7 +404,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb, * Return buffer_head on success or NULL in case of failure. */ struct buffer_head * -ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group) +ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group, + int ignore_locked) { struct ext4_group_desc *desc; struct ext4_sb_info *sbi = EXT4_SB(sb); @@ -435,6 +436,13 @@ ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group) if (bitmap_uptodate(bh)) goto verify; + if (ignore_locked && buffer_locked(bh)) { + /* buffer under IO already, do not wait + * if called for prefetching */ + err = 0; + goto out; + } + lock_buffer(bh); if (bitmap_uptodate(bh)) { unlock_buffer(bh); @@ -524,7 +532,7 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group) struct buffer_head *bh; int err; - bh = ext4_read_block_bitmap_nowait(sb, block_group); + bh = ext4_read_block_bitmap_nowait(sb, block_group, 1); if (IS_ERR(bh)) return bh; err = ext4_wait_block_bitmap(sb, block_group, bh); diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 03db3e71676c..2320d7e2f8d6 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1480,6 +1480,9 @@ struct ext4_sb_info { /* where last allocation was done - for stream allocation */ unsigned long s_mb_last_group; unsigned long s_mb_last_start; + unsigned int s_mb_toscan0; + unsigned int s_mb_toscan1; + unsigned int s_mb_prefetch; /* stats for buddy allocator */ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ @@ -2333,7 +2336,8 @@ extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb, extern int ext4_should_retry_alloc(struct super_block *sb, int *retries); extern struct buffer_head *ext4_read_block_bitmap_nowait(struct super_block *sb, - ext4_group_t block_group); + ext4_group_t block_group, + int ignore_locked); extern int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group, struct buffer_head *bh); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index a3e2767bdf2f..eac4ee225527 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -861,7 +861,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp) bh[i] = NULL; continue; } - bh[i] = ext4_read_block_bitmap_nowait(sb, group); + bh[i] = ext4_read_block_bitmap_nowait(sb, group, 0); if (IS_ERR(bh[i])) { err = PTR_ERR(bh[i]); bh[i] = NULL; @@ -2095,10 +2095,52 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, return 0; } +/* + * each allocation context (i.e. a thread doing allocation) has own + * sliding prefetch window of @s_mb_prefetch size which starts at the + * very first goal and moves ahead of scaning. + * a side effect is that subsequent allocations will likely find + * the bitmaps in cache or at least in-flight. + */ +static void +ext4_mb_prefetch(struct ext4_allocation_context *ac, + ext4_group_t start) +{ + ext4_group_t ngroups = ext4_get_groups_count(ac->ac_sb); + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + struct ext4_group_info *grp; + ext4_group_t group = start; + struct buffer_head *bh; + int nr; + + /* batch prefetching to get few READs in flight */ + if (group + (sbi->s_mb_prefetch >> 1) < ac->ac_prefetch) + return; + + nr = sbi->s_mb_prefetch; + while (nr > 0) { + if (++group >= ngroups) + group = 0; + if (unlikely(group == start)) + break; + grp = ext4_get_group_info(ac->ac_sb, group); + /* ignore empty groups - those will be skipped + * during the scanning as well */ + if (grp->bb_free == 0) + continue; + nr--; + if (!EXT4_MB_GRP_NEED_INIT(grp)) + continue; + bh = ext4_read_block_bitmap_nowait(ac->ac_sb, group, 1); + brelse(bh); + } + ac->ac_prefetch = group; +} + static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { - ext4_group_t ngroups, group, i; + ext4_group_t ngroups, toscan, group, i; int cr; int err = 0, first_err = 0; struct ext4_sb_info *sbi; @@ -2160,6 +2202,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * cr == 0 try to get exact allocation, * cr == 3 try to get anything */ + + ac->ac_prefetch = ac->ac_g_ex.fe_group; + repeat: for (; cr < 4 && ac->ac_status == AC_STATUS_CONTINUE; cr++) { ac->ac_criteria = cr; @@ -2169,7 +2214,15 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) */ group = ac->ac_g_ex.fe_group; - for (i = 0; i < ngroups; group++, i++) { + /* limit number of groups to scan at the first two rounds + * when we hope to find something really good */ + toscan = ngroups; + if (cr == 0) + toscan = sbi->s_mb_toscan0; + else if (cr == 1) + toscan = sbi->s_mb_toscan1; + + for (i = 0; i < toscan; group++, i++) { int ret = 0; cond_resched(); /* @@ -2179,6 +2232,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) if (group >= ngroups) group = 0; + ext4_mb_prefetch(ac, group); + /* This now checks without needing the buddy page */ ret = ext4_mb_good_group(ac, group, cr); if (ret <= 0) { @@ -2872,6 +2927,9 @@ void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid) bio_put(discard_bio); } } + sbi->s_mb_toscan0 = 1024; + sbi->s_mb_toscan1 = 4096; + sbi->s_mb_prefetch = 32; list_for_each_entry_safe(entry, tmp, &freed_data_list, efd_list) ext4_free_data_in_buddy(sb, entry); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 88c98f17e3d9..9ba5c75e6490 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -175,6 +175,7 @@ struct ext4_allocation_context { struct page *ac_buddy_page; struct ext4_prealloc_space *ac_pa; struct ext4_locality_group *ac_lg; + ext4_group_t ac_prefetch; }; #define AC_STATUS_CONTINUE 1 diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index eb1efad0e20a..4476d828439b 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -198,6 +198,9 @@ EXT4_RO_ATTR_ES_UI(errors_count, s_error_count); EXT4_ATTR(first_error_time, 0444, first_error_time); EXT4_ATTR(last_error_time, 0444, last_error_time); EXT4_ATTR(journal_task, 0444, journal_task); +EXT4_RW_ATTR_SBI_UI(mb_toscan0, s_mb_toscan0); +EXT4_RW_ATTR_SBI_UI(mb_toscan1, s_mb_toscan1); +EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch); static unsigned int old_bump_val = 128; EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val); @@ -228,6 +231,9 @@ static struct attribute *ext4_attrs[] = { ATTR_LIST(first_error_time), ATTR_LIST(last_error_time), ATTR_LIST(journal_task), + ATTR_LIST(mb_toscan0), + ATTR_LIST(mb_toscan1), + ATTR_LIST(mb_prefetch), NULL, }; ATTRIBUTE_GROUPS(ext4);