From patchwork Mon Dec 2 09:08:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Alex Zhuravlev X-Patchwork-Id: 1203035 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=whamcloud.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=whamcloud.com header.i=@whamcloud.com header.b="32f30U0G"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 47RK6Y1yMQz9sPL for ; Mon, 2 Dec 2019 20:08:53 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726060AbfLBJIw (ORCPT ); Mon, 2 Dec 2019 04:08:52 -0500 Received: from mail-eopbgr740078.outbound.protection.outlook.com ([40.107.74.78]:14256 "EHLO NAM01-BN3-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725977AbfLBJIw (ORCPT ); Mon, 2 Dec 2019 04:08:52 -0500 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=d18EWgqZLoCFpOLj5RB22E6AZERWbg/+sz3fSQieEChiuP24jNW9NcMpdN/3ufBPMtjm9JGPTib5u/NTlQ3OjhoJ7i/qB+kQdVi+8h8K+jVC/6QDOJlQct/5MwidGKBy3BrYT57C82diN9auMEmze+OQsbhm5vbTm8BLoUCAmyWKgXZKGFWmTlFCK4wb5KDDeIjTKKMC+P5z6fxA9CCVlXiWYHB/JETPw7tzrorTw1JOTYUJKdjwJQmXbruwDUWhwZktn7N/KhGLFUS0Oe4ec3rNAfd0MQI/4NzkmynyLz65B19+QG8GrJFn9/+2JMopltRDrCSaIybyhkSC5LoR3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+Diye8xgamOJcIK4jyMUMjv0Gv5thRBk+ZbVYqHhvDs=; b=NQkCzfL5H2Ue5aDROMUr7l3Ol13UWUrA8lcdCGowRwY4cOmbir6VNX2HSxBtBONVXWERnErjPBFIBBYR/mFAFKAoJrhwbHQtf4pD12xtVLvow/9x/wyp6k6gEjCOqwSQfvCvlfOT8o/LOEZnf+lX+UvUpbzGru0Ya0Fh94fSVJu2CXAymR1c/xHCxzh7ozGcV/u1nIu3h6eZUx0ukitDNBZHXvxbE/KI9JIt9d1G/HmDdYnrjsN8LkbN9alxJtYQMMq8eOgjlZtG2cyt7laF65LEKoR2hKb2tf0NynZlQLtKjAogrpikMSOuCLTyQbNsVo+lM+u2AR1aP6WKLdvkjg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=whamcloud.com; dmarc=pass action=none header.from=whamcloud.com; dkim=pass header.d=whamcloud.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=whamcloud.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+Diye8xgamOJcIK4jyMUMjv0Gv5thRBk+ZbVYqHhvDs=; b=32f30U0GiuvJik75xWicckrccySZWPNmeeXMASFC4rS8IZETBu91tC31X8iGN8xPLwWtqUTQbaKZggEUoTylHal9rfFlRcxfvzV3LAEJWEH/nuMFnVYaJb447gaMT6eVJuA/V8Eul7mQAhJRMkapnvyuAN534UA+WhhVxmHXdS8= Received: from MN2PR19MB2894.namprd19.prod.outlook.com (20.178.254.95) by MN2PR19MB3231.namprd19.prod.outlook.com (20.179.149.160) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2495.17; Mon, 2 Dec 2019 09:08:36 +0000 Received: from MN2PR19MB2894.namprd19.prod.outlook.com ([fe80::a499:dae8:b1c1:b08e]) by MN2PR19MB2894.namprd19.prod.outlook.com ([fe80::a499:dae8:b1c1:b08e%7]) with mapi id 15.20.2495.014; Mon, 2 Dec 2019 09:08:36 +0000 From: Alex Zhuravlev To: "linux-ext4@vger.kernel.org" Subject: [RFC] improve mballoc for large filesystems: prefetch bitmaps Thread-Topic: [RFC] improve mballoc for large filesystems: prefetch bitmaps Thread-Index: AQHVqPAU7Bg1v9kGTUqQ8/1ickt7Hg== Date: Mon, 2 Dec 2019 09:08:36 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=azhuravlev@whamcloud.com; x-originating-ip: [128.72.176.24] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 7cce08d0-758e-4d78-3fee-08d7770736bf x-ms-traffictypediagnostic: MN2PR19MB3231: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:213; x-forefront-prvs: 0239D46DB6 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(4636009)(136003)(376002)(366004)(39850400004)(346002)(396003)(199004)(189003)(478600001)(6436002)(7736002)(6486002)(71200400001)(5640700003)(99286004)(2351001)(33656002)(6506007)(305945005)(6512007)(26005)(102836004)(66446008)(64756008)(66556008)(66476007)(66946007)(76116006)(91956017)(25786009)(316002)(5660300002)(36756003)(2616005)(256004)(3846002)(186003)(6916009)(71190400001)(14444005)(81156014)(81166006)(8676002)(8936002)(66066001)(2906002)(6116002)(86362001)(14454004)(2501003); DIR:OUT; SFP:1101; SCL:1; SRVR:MN2PR19MB3231; H:MN2PR19MB2894.namprd19.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: whamcloud.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: xaIFmNKC4MToofSdeirqTZLXQqx575vaqxs9dY3v0roNW1hYYen5xpQWA6Jhzgo8Vgi0pc7KxtKvAh8adgpLv0CxrrysEC8mqpnAA90r3y5dsXV/u+iBqq1E3+KvRhEX/iER+UFqrY40ytmqNLE2p3Qo9FUJKh6WbA3Ovyrnt97QGzJax9QGnPbQRPtadNndZz+9cw1o7j79w4KBCFaEw+HO6lmtroRc/xLyO73vTZsS2Kk0HvmeaTktv46eqsCYUeWxjO5QwurNYdEzisvWNa0lhBzxcklg+PC33cajlraP2K4o7QjLCe6NxqmLngkq1D3TOujMa9oB8h19KcBPUEhMpsznw27c/vL4ZQ1ucyIh20jnEJazVRoiHO69fnbdPfGihKfIDsgkiOUzrfjx4uU461zdwDvssFXqTI7Q11N4MukjmEedbJAXK1Kt3AJJ x-ms-exchange-transport-forked: True Content-ID: <2FA1C6F5E19E3841B294A54FB9A3E005@namprd19.prod.outlook.com> MIME-Version: 1.0 X-OriginatorOrg: whamcloud.com X-MS-Exchange-CrossTenant-Network-Message-Id: 7cce08d0-758e-4d78-3fee-08d7770736bf X-MS-Exchange-CrossTenant-originalarrivaltime: 02 Dec 2019 09:08:36.1955 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 753b6e26-6fd3-43e6-8248-3f1735d59bb4 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: LR73zRJTnJ+QGmtWLEyULX6mTdNrGSXAEfOwXR8nB0MGI/Q2q7s4dfwIsomGKeH4WzIYA2uPT5Qi1mNpjKD6TA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR19MB3231 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Hi, Here is another patch for prefetching, reworked a bit: - flex_bg is taken into account as few bitmaps are supposed to be fetched with a single IO - limit number of prefetches at cr=0 so that mballoc doesn’t try to load all bitmaps Thanks, Alex diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 0b202e00d93f..0bc694c5dcfe 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -404,7 +404,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb, * Return buffer_head on success or NULL in case of failure. */ struct buffer_head * -ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group) +ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group, + int ignore_locked) { struct ext4_group_desc *desc; struct ext4_sb_info *sbi = EXT4_SB(sb); @@ -435,6 +436,13 @@ ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group) if (bitmap_uptodate(bh)) goto verify; + if (ignore_locked && buffer_locked(bh)) { + /* buffer under IO already, do not wait + * if called for prefetching */ + put_bh(bh); + return NULL; + } + lock_buffer(bh); if (bitmap_uptodate(bh)) { unlock_buffer(bh); @@ -524,7 +532,7 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group) struct buffer_head *bh; int err; - bh = ext4_read_block_bitmap_nowait(sb, block_group); + bh = ext4_read_block_bitmap_nowait(sb, block_group, 1); if (IS_ERR(bh)) return bh; err = ext4_wait_block_bitmap(sb, block_group, bh); diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f8578caba40d..4a7f4ccd8641 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1485,6 +1485,8 @@ struct ext4_sb_info { /* where last allocation was done - for stream allocation */ unsigned long s_mb_last_group; unsigned long s_mb_last_start; + unsigned int s_mb_prefetch; + unsigned int s_mb_prefetch_limit; /* stats for buddy allocator */ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ @@ -2339,7 +2341,8 @@ extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb, extern int ext4_should_retry_alloc(struct super_block *sb, int *retries); extern struct buffer_head *ext4_read_block_bitmap_nowait(struct super_block *sb, - ext4_group_t block_group); + ext4_group_t block_group, + int ignore_locked); extern int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group, struct buffer_head *bh); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 7c6c34fd8e1c..e4d93c9a6b77 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -861,7 +861,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore, gfp_t gfp) bh[i] = NULL; continue; } - bh[i] = ext4_read_block_bitmap_nowait(sb, group); + bh[i] = ext4_read_block_bitmap_nowait(sb, group, 0); if (IS_ERR(bh[i])) { err = PTR_ERR(bh[i]); bh[i] = NULL; @@ -2103,6 +2103,87 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, return 0; } +/* + * each allocation context (i.e. a thread doing allocation) has own + * sliding prefetch window of @s_mb_prefetch size which starts at the + * very first goal and moves ahead of scaning. + * a side effect is that subsequent allocations will likely find + * the bitmaps in cache or at least in-flight. + */ +static void +ext4_mb_prefetch(struct ext4_allocation_context *ac, + ext4_group_t start) +{ + struct super_block *sb = ac->ac_sb; + ext4_group_t ngroups = ext4_get_groups_count(sb); + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_group_info *grp; + ext4_group_t group = start; + struct buffer_head *bh; + int nr; + + /* limit prefetching at cr=0, otherwise mballoc can + * spend a lot of time loading imperfect groups */ + if (!ac->ac_criteria && ac->ac_prefetch_ios >= sbi->s_mb_prefetch_limit) + return; + + /* batch prefetching to get few READs in flight */ + nr = ac->ac_prefetch - group; + if (ac->ac_prefetch < group) + /* wrapped to the first groups */ + nr += ngroups; + if (nr > 0) + return; + BUG_ON(nr < 0); + + nr = sbi->s_mb_prefetch; + if (ext4_has_feature_flex_bg(ac->ac_sb)) { + /* align to flex_bg to get more bitmas with a single IO */ + nr = (group / sbi->s_mb_prefetch) * sbi->s_mb_prefetch; + nr = nr + sbi->s_mb_prefetch - group; + } + while (nr-- > 0) { + grp = ext4_get_group_info(sb, group); + /* ignore empty groups - those will be skipped + * during the scanning as well */ + if (grp->bb_free > 0 && EXT4_MB_GRP_NEED_INIT(grp)) { + bh = ext4_read_block_bitmap_nowait(sb, group, 1); + if (bh && !IS_ERR(bh)) { + if (!buffer_uptodate(bh)) + ac->ac_prefetch_ios++; + brelse(bh); + } + } + if (++group >= ngroups) + group = 0; + } + ac->ac_prefetch = group; +} + +static void +ext4_mb_prefetch_fini(struct ext4_allocation_context *ac) +{ + struct ext4_group_info *grp; + ext4_group_t group; + int nr, rc; + + /* initialize last window of prefetched groups */ + nr = ac->ac_prefetch_ios; + if (nr > EXT4_SB(ac->ac_sb)->s_mb_prefetch) + nr = EXT4_SB(ac->ac_sb)->s_mb_prefetch; + group = ac->ac_prefetch; + while (nr-- > 0) { + grp = ext4_get_group_info(ac->ac_sb, group); + if (grp->bb_free > 0 && EXT4_MB_GRP_NEED_INIT(grp)) { + rc = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); + if (rc) + break; + } + if (group-- == 0) + group = ext4_get_groups_count(ac->ac_sb) - 1; + } +} + static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { @@ -2175,7 +2256,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * searching for the right group start * from the goal value specified */ - group = ac->ac_g_ex.fe_group; + group = ac->ac_g_ex.fe_group + 1; + ac->ac_prefetch = group; for (i = 0; i < ngroups; group++, i++) { int ret = 0; @@ -2187,6 +2269,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) if (group >= ngroups) group = 0; + ext4_mb_prefetch(ac, group); + /* This now checks without needing the buddy page */ ret = ext4_mb_good_group(ac, group, cr); if (ret <= 0) { @@ -2259,6 +2343,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) out: if (!err && ac->ac_status != AC_STATUS_FOUND && first_err) err = first_err; + /* use prefetched bitmaps to init buddy so that read info is not lost */ + ext4_mb_prefetch_fini(ac); return err; } @@ -2880,6 +2966,22 @@ void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid) bio_put(discard_bio); } } + if (ext4_has_feature_flex_bg(sb)) { + /* a single flex group is supposed to be read by a single IO */ + sbi->s_mb_prefetch = 1 << sbi->s_es->s_log_groups_per_flex; + sbi->s_mb_prefetch *= 8; /* 8 prefetch IOs in flight at most */ + } else { + sbi->s_mb_prefetch = 32; + } + if (sbi->s_mb_prefetch >= ext4_get_groups_count(sb) >> 2) + sbi->s_mb_prefetch = ext4_get_groups_count(sb) >> 2; + /* now many real IOs to prefetch within a single allocation at cr=0 + * given cr=0 is an CPU-related optimization we shouldn't try to + * load too many groups, at some point we should start to use what + * we've got in memory. + * with an average random access time 5ms, it'd take a second to get + * 200 groups (* N with flex_bg), so let's make this limit 256 */ + sbi->s_mb_prefetch_limit = sbi->s_mb_prefetch * 256; list_for_each_entry_safe(entry, tmp, &freed_data_list, efd_list) ext4_free_data_in_buddy(sb, entry); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 88c98f17e3d9..c96a2bd81f72 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -175,6 +175,8 @@ struct ext4_allocation_context { struct page *ac_buddy_page; struct ext4_prealloc_space *ac_pa; struct ext4_locality_group *ac_lg; + ext4_group_t ac_prefetch; + int ac_prefetch_ios; /* number of initialied prefetch IO */ }; #define AC_STATUS_CONTINUE 1 diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index eb1efad0e20a..a14ce23c1444 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -186,6 +186,8 @@ EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan); EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs); EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request); EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc); +EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch); +EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit); EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb); EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error); EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval); @@ -215,6 +217,8 @@ static struct attribute *ext4_attrs[] = { ATTR_LIST(mb_order2_req), ATTR_LIST(mb_stream_req), ATTR_LIST(mb_group_prealloc), + ATTR_LIST(mb_prefetch), + ATTR_LIST(mb_prefetch_limit), ATTR_LIST(max_writeback_mb_bump), ATTR_LIST(extent_max_zeroout_kb), ATTR_LIST(trigger_fs_error),