From patchwork Tue Aug 8 05:05:17 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wang Shilong X-Patchwork-Id: 799014 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="ZLgFKqAn"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3xRMmW4pRYz9s8J for ; Tue, 8 Aug 2017 15:05:47 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750776AbdHHFFq (ORCPT ); Tue, 8 Aug 2017 01:05:46 -0400 Received: from mail-pg0-f67.google.com ([74.125.83.67]:36560 "EHLO mail-pg0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750736AbdHHFFp (ORCPT ); Tue, 8 Aug 2017 01:05:45 -0400 Received: by mail-pg0-f67.google.com with SMTP id y129so2077018pgy.3 for ; Mon, 07 Aug 2017 22:05:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=t3L+3hCZESztK560ZrI/1eP4KJyuP6M1lJX1UNEEc7c=; b=ZLgFKqAnGXvo/VQ9EBU51kSxNhSO1yR8Unvogaqp8VX3uPonaL3pgyP5iSkMEl+YYw 2Kl3E7ZWs3kCf7c1SERjyYp/ZGZf3f7QWoxN83NkOO6Dy9YbUVONfEiOqKnKdGGZkPk+ wX3ZiejxVFZOoYBqGHwZvA7zlfLmeMD0mNXoGbV4zQMBQgns9IC6xMLZZ7bqZW38a6sA 3/rR0TD+3bCxTDjXXa0UH5hmrxuBUmYpceC1/SYaQOGv94QKhjk9QaBhQD3UXua6UyjF 1+P/d9A+t/ZW+hoc4q7ys19VxsEvbmuga+mP8TJ77lgPvDMh8UAjlMss6ANrQyXMDT2z 0f1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=t3L+3hCZESztK560ZrI/1eP4KJyuP6M1lJX1UNEEc7c=; b=p0BQkOjc2DgJCAwTUwSuW+di0nzmYRq1oougmB3b6vinEYGXNB8szNbfAsKp/jvGWT TE9XMCp112feZ04UevqdrAwIBdqsdTzwFn60pyQWOaODAwe7Am5kISV0Gk2AUP5RtGMP eMWFkObYIHjsSCd3SZ0/gQI/cK5eN7mVbvS/a5xBXcW3zProRMypj6EY/f+q5A5dnFUE bVCddMvtEEl5PaoOILFll11myrsR0gAAXrlKJ08B4EpFtsuaUBfAupM3Xmdnx6gsWTLj jS9bJmGmkximwxw8GHDUsVspTiDpjdhx1ffp08QiYely6fW+yKFiH80iqOQpnFgwJeG9 vlsg== X-Gm-Message-State: AHYfb5g16f3S4YMd4/9DENkRen2xjUL7m0c0ldDhSHZgT3BF1LhEjdYy y2/O7+DWWEy+arTX X-Received: by 10.98.198.140 with SMTP id x12mr3138839pfk.44.1502168744782; Mon, 07 Aug 2017 22:05:44 -0700 (PDT) Received: from localhost.localdomain ([45.76.99.184]) by smtp.gmail.com with ESMTPSA id 204sm614958pga.85.2017.08.07.22.05.40 (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 07 Aug 2017 22:05:44 -0700 (PDT) From: Wang Shilong X-Google-Original-From: Wang Shilong To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, wshilong@ddn.com, adilger@dilger.ca, sihara@ddn.com, lixi@ddn.com Subject: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode Date: Tue, 8 Aug 2017 13:05:17 +0800 Message-Id: <20170808050517.7160-1-wshilong@ddn.com> X-Mailer: git-send-email 2.11.0 (Apple Git-81) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Wang Shilong While running number of creating file threads concurrently, we found heavy lock contention on group spinlock: FUNC TOTAL_TIME(us) COUNT AVG(us) ext4_create 1707443399 1440000 1185.72 _raw_spin_lock 1317641501 180899929 7.28 jbd2__journal_start 287821030 1453950 197.96 jbd2_journal_get_write_access 33441470 73077185 0.46 ext4_add_nondir 29435963 1440000 20.44 ext4_add_entry 26015166 1440049 18.07 ext4_dx_add_entry 25729337 1432814 17.96 ext4_mark_inode_dirty 12302433 5774407 2.13 most of cpu time blames to _raw_spin_lock, here is some testing numbers with/without patch. Test environment: Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz DDR4 Memory, 8GbFC) Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB Read Intensive SSD) format command: mkfs.ext4 -J size=4096 test command: mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \ -r -i 1 -v -p 10 -u #first run to load inode mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \ -r -i 5 -v -p 10 -u Kernel version: 4.13.0-rc3 Test 1,440,000 files with 48 directories by 48 processes: Without patch: File Creation File removal 79,033 289,569 ops/per second 81,463 285,359 79,875 288,475 79,917 284,624 79,420 290,91 with patch: File Creation File removal 691,528 296,574 ops/per second 691,946 297,106 692,030 296,238 691,005 299,249 692,871 300,664 Creation performance is improved more than 8X with large journal size. The main problem here is we test bitmap and do some check and journal operations which could be slept, then we test and set with lock hold, this could be racy, and make 'inode' steal by other process. However, after first try, we could confirm handle has been started and inode bitmap journaled too, then we could find and set bit with lock hold directly, this will mostly gurateee success with second try. This patch dosen't change logic if it comes to no journal mode, luckily this is not normal use cases i believe. Tested-by: Shuichi Ihara Signed-off-by: Wang Shilong --- v3->v4: codes cleanup and avoid sleep. --- fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 507bfb3..23380f39 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -761,6 +761,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir, ext4_group_t flex_group; struct ext4_group_info *grp; int encrypt = 0; + bool hold_lock; /* Cannot create files in a deleted directory */ if (!dir || !dir->i_nlink) @@ -917,17 +918,40 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir, continue; } + hold_lock = false; repeat_in_this_group: + /* if @hold_lock is ture, that means, journal + * is properly setup and inode bitmap buffer has + * been journaled already, we can directly hold + * lock and set bit if found, this will mostly + * gurantee forward progress for each thread. + */ + if (hold_lock) + ext4_lock_group(sb, group); + ino = ext4_find_next_zero_bit((unsigned long *) inode_bitmap_bh->b_data, EXT4_INODES_PER_GROUP(sb), ino); - if (ino >= EXT4_INODES_PER_GROUP(sb)) + if (ino >= EXT4_INODES_PER_GROUP(sb)) { + if (hold_lock) + ext4_unlock_group(sb, group); goto next_group; + } if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) { + if (hold_lock) + ext4_unlock_group(sb, group); ext4_error(sb, "reserved inode found cleared - " "inode=%lu", ino + 1); continue; } + + if (hold_lock) { + ext4_set_bit(ino, inode_bitmap_bh->b_data); + ext4_unlock_group(sb, group); + ino++; + goto got; + } + if ((EXT4_SB(sb)->s_journal == NULL) && recently_deleted(sb, group, ino)) { ino++; @@ -950,6 +974,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir, ext4_std_error(sb, err); goto out; } + + if (EXT4_SB(sb)->s_journal) + hold_lock = true; + ext4_lock_group(sb, group); ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data); ext4_unlock_group(sb, group);