From patchwork Thu Apr 23 19:08:17 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andreas Dilger X-Patchwork-Id: 26381 Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id E0BD4B7063 for ; Fri, 24 Apr 2009 05:08:40 +1000 (EST) Received: by ozlabs.org (Postfix) id CF493DE133; Fri, 24 Apr 2009 05:08:40 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 710ADDDF19 for ; Fri, 24 Apr 2009 05:08:40 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752328AbZDWTIi (ORCPT ); Thu, 23 Apr 2009 15:08:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753192AbZDWTIi (ORCPT ); Thu, 23 Apr 2009 15:08:38 -0400 Received: from sca-es-mail-2.Sun.COM ([192.18.43.133]:33942 "EHLO sca-es-mail-2.sun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752328AbZDWTIi (ORCPT ); Thu, 23 Apr 2009 15:08:38 -0400 Received: from fe-sfbay-10.sun.com ([192.18.43.129]) by sca-es-mail-2.sun.com (8.13.7+Sun/8.12.9) with ESMTP id n3NJ8XNX007923 for ; Thu, 23 Apr 2009 12:08:33 -0700 (PDT) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-disposition: inline Content-type: text/plain; charset=us-ascii Received: from conversion-daemon.fe-sfbay-10.sun.com by fe-sfbay-10.sun.com (Sun Java(tm) System Messaging Server 7.0-5.01 64bit (built Feb 19 2009)) id <0KIK00600I9FCB00@fe-sfbay-10.sun.com> for linux-ext4@vger.kernel.org; Thu, 23 Apr 2009 12:08:33 -0700 (PDT) Received: from webber.adilger.int ([unknown] [68.147.169.220]) by fe-sfbay-10.sun.com (Sun Java(tm) System Messaging Server 7.0-5.01 64bit (built Feb 19 2009)) with ESMTPSA id <0KIK00691II4EV20@fe-sfbay-10.sun.com>; Thu, 23 Apr 2009 12:08:29 -0700 (PDT) Date: Thu, 23 Apr 2009 13:08:17 -0600 From: Andreas Dilger Subject: Re: Question on block group allocation In-reply-to: <6601abe90904230941x5cdd590ck2d51410326df2fc5@mail.gmail.com> To: Curt Wohlgemuth Cc: ext4 development Message-id: <20090423190817.GN3209@webber.adilger.int> X-GPG-Key: 1024D/0D35BED6 X-GPG-Fingerprint: 7A37 5D79 BF1B CECA D44F 8A29 A488 39F5 0D35 BED6 References: <6601abe90904230941x5cdd590ck2d51410326df2fc5@mail.gmail.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Apr 23, 2009 09:41 -0700, Curt Wohlgemuth wrote: > I'm seeing a performance problem on ext4 vs ext2, and in trying to > narrow it down, I've got a question about block allocation in ext4 > that I'm having trouble figuring out. > > Using dd, I created (in this order) two 4GB files and a 10GB file in > the mount directory. > > The extent blocks are reasonably close together for the two 4GB files, > but the extents for the 10GB file show a huge gap, which seems to hurt > the random read performance pretty substantially. Here's the output > from debugfs: > > BLOCKS: > (IND):8396832, (0-106495):8282112-8388607, > (106496-399359):11241472-11534335, (399360-888831):20482048-20971519, > (888832-1116159):23889920-24117247, (1116160-1277951):71665664- > 71827455, (1277952-1767423):78678016-79167487, > (1767424-2125823):102402048-102760447, > (2125824-2148351):102768672-102791199, > (2148352-2621439):102793216-103266303 > TOTAL: 2621441 > > Note the gap between blocks 79167487 and 102402048. Well, there are other even larger gaps for other chunks of the file. > I was lucky enough to capture the mb_history from this 10GB create: > > 29109 14 735/30720/32758@1114112 735/30720/2048@1114112 > 735/30720/2048@1114112 1 0 0 1568 M 0 0 > 29109 14 736/0/32758@1116160 736/0/2048@1116160 > 2187/2048/2048@1116160 1 1 0 1568 0 0 > 29109 14 2187/4096/32758@1118208 2187/4096/2048@1118208 > 2187/4096/2048@1118208 1 0 0 1568 M 2048 4096 > > I've been staring at ext4_mb_regular_allocator() trying to understand > why an allocation with a goal block of 736 ends up with a best found > extent group of 2187, and I'm stuck -- at least without a lot of > printk messages. It seems to me that we just cycle through the block > groups starting with the goal group until we find a group that fits. > Again, according to dumpe2fs, block groups 737, 738, 739, ... all have > 32768 free blocks. So why we end up with a best fit group of 2187 is > a mystery to me. This is likely the "uninit_bg" feature that is causing the allocations to skip groups which are marked BLOCK_UNINIT. In some sense the benefit of skipping the block bitmap read during e2fsck is probably not at all beneficial compared to the cost of the extra seeking during IO. As the filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways, so we might as well just keep the early allocations contiguous. A simple change to verify this would be something like the following, but it hasn't actually been tested. Cheers, Andreas Signed-off-by: Andreas Dilger --- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- ./fs/ext4/mballoc.c.uninit 2009-04-08 19:13:13.000000000 -0600 +++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600 @@ -1742,10 +1723,6 @@ static int ext4_mb_good_group(struct ext switch (cr) { case 0: BUG_ON(ac->ac_2order == 0); - /* If this group is uninitialized, skip it initially */ - desc = ext4_get_group_desc(ac->ac_sb, group, NULL); - if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) - return 0; bits = ac->ac_sb->s_blocksize_bits + 1; for (i = ac->ac_2order; i <= bits; i++) @@ -2039,9 +2035,7 @@ repeat: ac->ac_groups_scanned++; desc = ext4_get_group_desc(sb, group, NULL); - if (cr == 0 || (desc->bg_flags & - cpu_to_le16(EXT4_BG_BLOCK_UNINIT) && - ac->ac_2order != 0)) + if (cr == 0) ext4_mb_simple_scan_group(ac, &e4b); else if (cr == 1 && ac->ac_g_ex.fe_len == sbi->s_stripe)