From patchwork Thu Nov 21 23:41:16 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 293275 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id E22572C00E2 for ; Fri, 22 Nov 2013 10:41:30 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755495Ab3KUXl3 (ORCPT ); Thu, 21 Nov 2013 18:41:29 -0500 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:52791 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754163Ab3KUXl1 (ORCPT ); Thu, 21 Nov 2013 18:41:27 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AkMHACqZjlJ5LGc//2dsb2JhbABZgwe3d4VLgSEXdIIlAQEFJxMcIxAIAw4HAwklDwUlAyETiADBDBcWjiJJB4MggRIDmBGSEYM8KIEt Received: from ppp121-44-103-63.lns20.syd6.internode.on.net (HELO dastard) ([121.44.103.63]) by ipmail07.adl2.internode.on.net with ESMTP; 22 Nov 2013 10:11:24 +1030 Received: from dave by dastard with local (Exim 4.76) (envelope-from ) id 1Vjds4-0005RM-IJ; Fri, 22 Nov 2013 10:41:16 +1100 Date: Fri, 22 Nov 2013 10:41:16 +1100 From: Dave Chinner To: Martin Boutin Cc: Eric Sandeen , "Kernel.org-Linux-RAID" , xfs-oss , "Kernel.org-Linux-EXT4" Subject: Re: Filesystem writes on RAID5 too slow Message-ID: <20131121234116.GD6502@dastard> References: <528A5C45.4080906@redhat.com> <20131119005740.GY6188@dastard> <20131121092606.GU11434@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote: > $ uname -a > Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013 > i686 GNU/Linux Oh, it's 32 bit system. Things you don't know from the obfuscating codenames everyone uses these days... > $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0 > $ mount -t xfs /dev/md0 /tmp/diskmnt/ > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s .... > $ cat /proc/mounts > (...) > /dev/md0 /tmp/diskmnt xfs > rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0 sunit/swidth is 512k/1MB > # same layout for other disks > $ fdisk -c -u /dev/sda .... > Device Boot Start End Blocks Id System > /dev/sda1 2048 20565247 10281600 83 Linux Aligned to 1 MB. > /dev/sda2 20565248 1953525167 966479960 83 Linux And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is aligned to 4k, though, so there shouldn't be any hardware RMW cycles. > $ xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks > = sectsz=4096 attr=2 > data = bsize=4096 blocks=483239168, imaxpct=5 > = sunit=12 sunit/swidth of 512k/1MB, so it matches the MD device. > $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero > /tmp/diskmnt/filewr.zero: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width > # this does not look good, does it? Yup, looks broken. /me digs through git. Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke the code that sets stripe unit alignment for the initial allocation way back in 3.2. [ Hmmm, that would explain the very occasional failure that generic/223 throws outi (maybe once a month I see it fail). ] Which means MD is doing RMW cycles for it's parity calculations, and that's where performance is going south. Current code: $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 1056..2098207 0 (1056..2098207) 2097152 11111 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec) $ Which indicates that even if we take direct IO based allocation out of the picture, the allocation does not get aligned properly. This in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k. With a fixed kernel: $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 6293504..8390655 0 (6293504..8390655) 2097152 10000 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec) $ It;s clear we have completely stripe swidth aligned allocation and it's 25% faster. Take fallocate out of the picture so the direct IO does the allocation: $ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec) testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 2099200..4196351 0 (2099200..4196351) 2097152 00000 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width It's slower than with preallocation (no surprise - no allocation overhead per write(2) call after preallocation is done) but the allocation is still correctly aligned. The patch below should fix the unaligned allocation problem you are seeing, but because XFS defaults to stripe unit alignment for large allocations, you might still see RMW cycles when it aligns to a stripe unit that is not the first in a MD stripe. I'll have a quick look at fixing that behaviour when the swalloc mount option is specified.... Cheers, Dave. Reviewed-by: Christoph Hellwig diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c index 3ef11b2..8401f11 100644 --- a/fs/xfs/xfs_bmap.c +++ b/fs/xfs/xfs_bmap.c @@ -1635,7 +1635,7 @@ xfs_bmap_last_extent( * blocks at the end of the file which do not start at the previous data block, * we will try to align the new blocks at stripe unit boundaries. * - * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be + * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be * at, or past the EOF. */ STATIC int @@ -1650,9 +1650,14 @@ xfs_bmap_isaeof( bma->aeof = 0; error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec, &is_empty); - if (error || is_empty) + if (error) return error; + if (is_empty) { + bma->aeof = 1; + return 0; + } + /* * Check if we are allocation or past the last extent, or at least into * the last delayed allocated extent.