From patchwork Tue Aug  8 05:05:17 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wang Shilong <wangshilong1991@gmail.com>
X-Patchwork-Id: 799014
Return-Path: <linux-ext4-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=linux-ext4-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dkim=pass (2048-bit key;
	unprotected) header.d=gmail.com header.i=@gmail.com
	header.b="ZLgFKqAn"; dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 3xRMmW4pRYz9s8J
	for <patchwork-incoming@ozlabs.org>;
	Tue,  8 Aug 2017 15:05:47 +1000 (AEST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1750776AbdHHFFq (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Tue, 8 Aug 2017 01:05:46 -0400
Received: from mail-pg0-f67.google.com ([74.125.83.67]:36560 "EHLO
	mail-pg0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750736AbdHHFFp (ORCPT
	<rfc822; linux-ext4@vger.kernel.org>); Tue, 8 Aug 2017 01:05:45 -0400
Received: by mail-pg0-f67.google.com with SMTP id y129so2077018pgy.3
	for <linux-ext4@vger.kernel.org>;
	Mon, 07 Aug 2017 22:05:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=from:to:cc:subject:date:message-id;
	bh=t3L+3hCZESztK560ZrI/1eP4KJyuP6M1lJX1UNEEc7c=;
	b=ZLgFKqAnGXvo/VQ9EBU51kSxNhSO1yR8Unvogaqp8VX3uPonaL3pgyP5iSkMEl+YYw
	2Kl3E7ZWs3kCf7c1SERjyYp/ZGZf3f7QWoxN83NkOO6Dy9YbUVONfEiOqKnKdGGZkPk+
	wX3ZiejxVFZOoYBqGHwZvA7zlfLmeMD0mNXoGbV4zQMBQgns9IC6xMLZZ7bqZW38a6sA
	3/rR0TD+3bCxTDjXXa0UH5hmrxuBUmYpceC1/SYaQOGv94QKhjk9QaBhQD3UXua6UyjF
	1+P/d9A+t/ZW+hoc4q7ys19VxsEvbmuga+mP8TJ77lgPvDMh8UAjlMss6ANrQyXMDT2z
	0f1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id;
	bh=t3L+3hCZESztK560ZrI/1eP4KJyuP6M1lJX1UNEEc7c=;
	b=p0BQkOjc2DgJCAwTUwSuW+di0nzmYRq1oougmB3b6vinEYGXNB8szNbfAsKp/jvGWT
	TE9XMCp112feZ04UevqdrAwIBdqsdTzwFn60pyQWOaODAwe7Am5kISV0Gk2AUP5RtGMP
	eMWFkObYIHjsSCd3SZ0/gQI/cK5eN7mVbvS/a5xBXcW3zProRMypj6EY/f+q5A5dnFUE
	bVCddMvtEEl5PaoOILFll11myrsR0gAAXrlKJ08B4EpFtsuaUBfAupM3Xmdnx6gsWTLj
	jS9bJmGmkximwxw8GHDUsVspTiDpjdhx1ffp08QiYely6fW+yKFiH80iqOQpnFgwJeG9
	vlsg==
X-Gm-Message-State: AHYfb5g16f3S4YMd4/9DENkRen2xjUL7m0c0ldDhSHZgT3BF1LhEjdYy
	y2/O7+DWWEy+arTX
X-Received: by 10.98.198.140 with SMTP id x12mr3138839pfk.44.1502168744782;
	Mon, 07 Aug 2017 22:05:44 -0700 (PDT)
Received: from localhost.localdomain ([45.76.99.184])
	by smtp.gmail.com with ESMTPSA id
	204sm614958pga.85.2017.08.07.22.05.40
	(version=TLS1 cipher=AES128-SHA bits=128/128);
	Mon, 07 Aug 2017 22:05:44 -0700 (PDT)
From: Wang Shilong <wangshilong1991@gmail.com>
X-Google-Original-From: Wang Shilong <wshilong@ddn.com>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, wshilong@ddn.com, adilger@dilger.ca, sihara@ddn.com,
	lixi@ddn.com
Subject: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
Date: Tue,  8 Aug 2017 13:05:17 +0800
Message-Id: <20170808050517.7160-1-wshilong@ddn.com>
X-Mailer: git-send-email 2.11.0 (Apple Git-81)
Sender: linux-ext4-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-ext4.vger.kernel.org>
X-Mailing-List: linux-ext4@vger.kernel.org

From: Wang Shilong <wshilong@ddn.com>

While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:

FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
ext4_create                    1707443399           1440000      1185.72
_raw_spin_lock                 1317641501           180899929    7.28
jbd2__journal_start            287821030            1453950      197.96
jbd2_journal_get_write_access  33441470             73077185     0.46
ext4_add_nondir                29435963             1440000      20.44
ext4_add_entry                 26015166             1440049      18.07
ext4_dx_add_entry              25729337             1432814      17.96
ext4_mark_inode_dirty          12302433             5774407      2.13

most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.

Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
         DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
          Read Intensive SSD)

format command:
        mkfs.ext4 -J size=4096

test command:
        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 1 -v -p 10 -u #first run to load inode

        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 5 -v -p 10 -u

Kernel version: 4.13.0-rc3

Test  1,440,000 files with 48 directories by 48 processes:

Without patch:

File Creation   File removal
79,033          289,569 ops/per second
81,463          285,359
79,875          288,475
79,917          284,624
79,420          290,91

with patch:
File Creation   File removal
691,528		296,574 ops/per second
691,946		297,106
692,030		296,238
691,005		299,249
692,871		300,664

Creation performance is improved more than 8X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.

However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.

This patch dosen't change logic if it comes to
no journal mode, luckily this is not normal
use cases i believe.

Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
---
v3->v4: codes cleanup and avoid sleep.
---
 fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 507bfb3..23380f39 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -761,6 +761,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	ext4_group_t flex_group;
 	struct ext4_group_info *grp;
 	int encrypt = 0;
+	bool hold_lock;
 
 	/* Cannot create files in a deleted directory */
 	if (!dir || !dir->i_nlink)
@@ -917,17 +918,40 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			continue;
 		}
 
+		hold_lock = false;
 repeat_in_this_group:
+		/* if @hold_lock is ture, that means, journal
+		 * is properly setup and inode bitmap buffer has
+		 * been journaled already, we can directly hold
+		 * lock and set bit if found, this will mostly
+		 * gurantee forward progress for each thread.
+		 */
+		if (hold_lock)
+			ext4_lock_group(sb, group);
+
 		ino = ext4_find_next_zero_bit((unsigned long *)
 					      inode_bitmap_bh->b_data,
 					      EXT4_INODES_PER_GROUP(sb), ino);
-		if (ino >= EXT4_INODES_PER_GROUP(sb))
+		if (ino >= EXT4_INODES_PER_GROUP(sb)) {
+			if (hold_lock)
+				ext4_unlock_group(sb, group);
 			goto next_group;
+		}
 		if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
+			if (hold_lock)
+				ext4_unlock_group(sb, group);
 			ext4_error(sb, "reserved inode found cleared - "
 				   "inode=%lu", ino + 1);
 			continue;
 		}
+
+		if (hold_lock) {
+			ext4_set_bit(ino, inode_bitmap_bh->b_data);
+			ext4_unlock_group(sb, group);
+			ino++;
+			goto got;
+		}
+
 		if ((EXT4_SB(sb)->s_journal == NULL) &&
 		    recently_deleted(sb, group, ino)) {
 			ino++;
@@ -950,6 +974,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			ext4_std_error(sb, err);
 			goto out;
 		}
+
+		if (EXT4_SB(sb)->s_journal)
+			hold_lock = true;
+
 		ext4_lock_group(sb, group);
 		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
 		ext4_unlock_group(sb, group);