Patchwork [3/3] mke2fs: document bigalloc and cluster-size

login
register
mail settings
Submitter Zheng Liu
Date Jan. 13, 2013, 9:08 a.m.
Message ID <1358068095-9034-3-git-send-email-wenqing.lz@taobao.com>
Download mbox | patch
Permalink /patch/211604/
State Accepted
Headers show

Comments

Zheng Liu - Jan. 13, 2013, 9:08 a.m.
From: Zheng Liu <wenqing.lz@taobao.com>

Bigalloc feature has been used for a long time, but the documentation in mke2fs
is still missing.  So add it.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 misc/mke2fs.8.in | 11 +++++++++++
 1 file changed, 11 insertions(+)
Theodore Ts'o - Jan. 15, 2013, 3:10 a.m.
On Sun, Jan 13, 2013 at 05:08:15PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Bigalloc feature has been used for a long time, but the documentation in mke2fs
> is still missing.  So add it.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>

Applied (with some changes to improve the english/wording).

	      	   	      	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 15, 2013, 7:12 p.m.
On Mon, Jan 14, 2013 at 10:10:06PM -0500, Theodore Ts'o wrote:
> On Sun, Jan 13, 2013 at 05:08:15PM +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Bigalloc feature has been used for a long time, but the documentation in mke2fs
> > is still missing.  So add it.
> > 
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> 
> Applied (with some changes to improve the english/wording).

BTW, I used the following modified text:

                   bigalloc
                          This  feature  enables clustered allocation, so that
                          the unit of allocation is a power of two  number  of
                          blocks.   That  is,  each bit in the what had tradi‐
                          tionally been known as the block  allocation  bitmap
                          now  indicates  whether  a cluster is in use or not,
                          where a cluster is by default composed of 16 blocks.
                          This  feature  can  decrease the time spent on doing
                          block allocation and brings  smaller  fragmentation,
                          especially  for large files.  The size can be speci‐
                          fied using the -C option.

                          Warning: The bigalloc feature is still under  devel‐
                          opment,  and  may  not  be fully supported with your
                          kernel or may have various bugs.  Please see the web
                          page  http://ext4.wiki.kernel.org/index.php/Bigalloc
                          for details.

And I populated the page on the ext4 wiki so that when we finally fix
the delalloc/bigalloc problem, we can update the ext4 wiki to reflect
this.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Phillip Susi - Jan. 15, 2013, 7:46 p.m.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/15/2013 2:12 PM, Theodore Ts'o wrote:
> BTW, I used the following modified text:
> 
> bigalloc This  feature  enables clustered allocation, so that the
> unit of allocation is a power of two  number  of blocks.   That
> is,  each bit in the what had tradi‐ tionally been known as the
> block  allocation  bitmap now  indicates  whether  a cluster is in
> use or not, where a cluster is by default composed of 16 blocks. 
> This  feature  can  decrease the time spent on doing block
> allocation and brings  smaller  fragmentation, especially  for
> large files.  The size can be speci‐ fied using the -C option.
> 
> Warning: The bigalloc feature is still under  devel‐ opment,  and
> may  not  be fully supported with your kernel or may have various
> bugs.  Please see the web page
> http://ext4.wiki.kernel.org/index.php/Bigalloc for details.

Does this mean that a cluster is the minimum allocation unit, or can
two small files allocate different blocks in the same cluster, leaving
the cluster partially used?  If the former, then how is this different
than just using a larger block size?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJQ9bIJAAoJEJrBOlT6nu75WskIAM6eNjA1updKy6Kh2SrMWavB
bX7EeTGmXMrxbQtMDgmG1+V2kOy9RoYtCZ5+pXijqJHzrovEtyIwHVdzntKSTtZi
tYSqjZrOpJ/bTJpXuP5AIew9mXRTKzGF8lNyPZkLIgX0AyhTsbC4cccpcmfsnGEX
RfuwDd2Z2NEKhmsXH4SI3HXDM2f4EGZmPqPG8It/B49HXrzfDq+YqzKwVqdrDJ5V
jdTLV5xjJ4E9Y+/P3EC1l2KvfDf0KjJjA2CiuG4sqrthwwQGfdEFK+MF2bfz5nMi
VBsuZQRF5kFgekpsHXy7b0Do9Qa3wMm9FL8Sv2QMy7xf92FxCwrLJFlpIZ9iuSQ=
=BKGM
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 15, 2013, 7:57 p.m.
On Tue, Jan 15, 2013 at 02:46:17PM -0500, Phillip Susi wrote:
> 
> Does this mean that a cluster is the minimum allocation unit, or can
> two small files allocate different blocks in the same cluster, leaving
> the cluster partially used?  If the former, then how is this different
> than just using a larger block size?

The former.  The difference is that we use units of blocks in the
indirect blocks and extents --- and the reason for this is because
there's a pretty fundamental limitation baked into the MM layer that
the file system block size is less than or equal to the page size.  So
on architectures where we have 16k page sizes, we can use a 16k block
size --- but then you won't be able to mount that file system on an
x86 system.

So bigalloc is basically a hack because it was easier to make this
change in the file system than it is to deal with block sizes greater
than the page size.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Phillip Susi - Jan. 15, 2013, 8:38 p.m.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/15/2013 2:57 PM, Theodore Ts'o wrote:
> So bigalloc is basically a hack because it was easier to make this 
> change in the file system than it is to deal with block sizes
> greater than the page size.

If it is only to get around the mm pagesize limit, then why not just
have the fs automatically lie to the kernel about the block size and
shift the references back and forth on the fly when it detects a
larger blocksize?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJQ9b5XAAoJEJrBOlT6nu75ankH/39qiU2tahSVRekQ6kkyFeaY
RnGMydXcKgSacKAlvmIgP6VOnqPmBCKtdM8oxlob4+orJUksWiLmT7nvMIukUOqs
QGqPzsHtpIzNZnPB6soc6ToRbx+b53EM4fQ+XIt9egnJ4p6gDRiS83xKKjyZnywq
94ZPH5Zg84Xr+zmyUFRqs/cDG2tmbo/6qgkqkVUeFfdLzygq2K4LO/dFpuRg3oqV
ceUqVCieCfEplcjnyClT1uOv3RBrjCcyFW1j46UjEYiHkFENYsCSi/Hk4qOYDls+
i/bETjWABSXz23BMD2/B0wZwhGwkrsX2Y5g1CjtRksgPFthW/rmk0HzvWniC9J4=
=kJ8x
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Jan. 15, 2013, 10:28 p.m.
On Tue, Jan 15, 2013 at 03:38:47PM -0500, Phillip Susi wrote:
> 
> If it is only to get around the mm pagesize limit, then why not just
> have the fs automatically lie to the kernel about the block size and
> shift the references back and forth on the fly when it detects a
> larger blocksize?

Because of the pain in dealing with how to handle random writes into a
sparse file.  We need to either track which blocks in the large block
have been initialized, or we would need to erase the entire large
block before writing the first page into the large block (and then you
still need to track whether or not you are writing that first or
subsequent page into a large block).

What we're doing with bigalloc is effectively tracking which blocks in
the cluster have been initialized by using entries in the extent tree,
since entries to the allocation bitmaps is in units of clusters, but
entries in the extent tree is in units of blocks.

Looking back at how complicated it has been to get delalloc right, it
may have been the case that just using a brute-force sb_issue_zeroout
when the block is freshly allocated, unless the arguments to the
request to ext4_writepages() exactly covered the large block might
have been simpler.  Getting the Direct I/O path right would have been
messy, but perhaps it would have been less work in the end.

       	   	      	    	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index d4fbe00..ca3083d 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -187,6 +187,11 @@  Check the device for bad blocks before creating the file system.  If
 this option is specified twice, then a slower read-write
 test is used instead of a fast read-only test.
 .TP
+.B \-C " cluster-size"
+Specify the size of cluster in bytes.  Valid cluster-size values are from
+2048 to 256M bytes per cluster.  If omiited, cluster-size is 64KB by
+default.
+.TP
 .B \-D
 Use direct I/O when writing to the disk.  This avoids mke2fs dirtying a
 lot of buffer cache memory, which may impact other applications running
@@ -516,6 +521,12 @@  prefix the feature name with a  caret ('^') character.  The
 pseudo-filesystem feature "none" will clear all filesystem features.
 .RS 1.2i
 .TP
+.B bigalloc
+Allow to allocate block-size beyond the 4096 bytes.  That can decrease the time
+spent on doing block allocation and brings smaller fragmentation, especially
+for large files.  The size can be specified using the
+.B \-C option.
+.TP
 .B dir_index
 Use hashed b-trees to speed up lookups in large directories.
 .TP