From patchwork Tue Jul 10 17:18:00 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Darrick Wong X-Patchwork-Id: 942181 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-ext4-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=oracle.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=oracle.com header.i=@oracle.com header.b="cebhFZo8"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 41QDH75dMKz9ryt for ; Wed, 11 Jul 2018 06:25:55 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732416AbeGJU0f (ORCPT ); Tue, 10 Jul 2018 16:26:35 -0400 Received: from aserp2130.oracle.com ([141.146.126.79]:45894 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732332AbeGJU0f (ORCPT ); Tue, 10 Jul 2018 16:26:35 -0400 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w6AHEPMV044334; Tue, 10 Jul 2018 17:18:02 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : cc : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=m3mGURPTmNKVFusPzrOK4LVSq8gDZhQ2JQgKiNOckNA=; b=cebhFZo8W3xGgrHT9VhRej5QXMmNPGZ0sDufgJK1fiDyD2ylqLfQmSAW7y+dF8pQQRtf JaCw5XPLEfmmc4qnuoRdU8jM3v+c1Fyox1qsHd3xIcQHGUHIq6I/k1i6y/yERi19rFHj Rh5bdgxlPmL29jEdFQKtbtFMIrSrZiYR3Yh6mO0OpSXMv471XhQ6U1X7WIfqlsAv+L3t 4G5PGH3HTCsYtXBQHbYOSLyNyPOE1m/XrKvHE656LDtHyJ1S2qor6JRJ9U0scKFIgmva YWvCJIePLeMHR4pmP3RVSw0sbN4Ftac92UcpBDNfrwsbTVGlz1vC9oxTSSHu+69VuY0J dg== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2130.oracle.com with ESMTP id 2k2p75twt7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 10 Jul 2018 17:18:02 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w6AHI1ek022507 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 10 Jul 2018 17:18:02 GMT Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w6AHI1TC018475; Tue, 10 Jul 2018 17:18:01 GMT Received: from localhost (/67.169.218.210) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 10 Jul 2018 10:18:01 -0700 Subject: [PATCH 12/13] ext4: import directory layout chapter from wiki page From: "Darrick J. Wong" To: tytso@mit.edu, darrick.wong@oracle.com Cc: linux-ext4@vger.kernel.org Date: Tue, 10 Jul 2018 10:18:00 -0700 Message-ID: <153124308013.17949.7356126977659638635.stgit@magnolia> In-Reply-To: <153124300257.17949.1127788286714678589.stgit@magnolia> References: <153124300257.17949.1127788286714678589.stgit@magnolia> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8950 signatures=668706 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=4 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1806210000 definitions=main-1807100184 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org From: Darrick J. Wong Import the chapter about directory layout from the on-disk format wiki page into the kernel documentation. Signed-off-by: Darrick J. Wong --- .../filesystems/ext4/ondisk/directory.rst | 426 ++++++++++++++++++++ Documentation/filesystems/ext4/ondisk/dynamic.rst | 1 2 files changed, 427 insertions(+) create mode 100644 Documentation/filesystems/ext4/ondisk/directory.rst diff --git a/Documentation/filesystems/ext4/ondisk/directory.rst b/Documentation/filesystems/ext4/ondisk/directory.rst new file mode 100644 index 000000000000..8fcba68c2884 --- /dev/null +++ b/Documentation/filesystems/ext4/ondisk/directory.rst @@ -0,0 +1,426 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Directory Entries +----------------- + +In an ext4 filesystem, a directory is more or less a flat file that maps +an arbitrary byte string (usually ASCII) to an inode number on the +filesystem. There can be many directory entries across the filesystem +that reference the same inode number--these are known as hard links, and +that is why hard links cannot reference files on other filesystems. As +such, directory entries are found by reading the data block(s) +associated with a directory file for the particular directory entry that +is desired. + +Linear (Classic) Directories +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, each directory lists its entries in an “almost-linear” +array. I write “almost” because it's not a linear array in the memory +sense because directory entries are not split across filesystem blocks. +Therefore, it is more accurate to say that a directory is a series of +data blocks and that each block contains a linear array of directory +entries. The end of each per-block array is signified by reaching the +end of the block; the last entry in the block has a record length that +takes it all the way to the end of the block. The end of the entire +directory is of course signified by reaching the end of the file. Unused +directory entries are signified by inode = 0. By default the filesystem +uses ``struct ext4_dir_entry_2`` for directory entries unless the +“filetype” feature flag is not set, in which case it uses +``struct ext4_dir_entry``. + +The original directory entry format is ``struct ext4_dir_entry``, which +is at most 263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - \_\_le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - \_\_le16 + - rec\_len + - Length of this directory entry. Must be a multiple of 4. + * - 0x6 + - \_\_le16 + - name\_len + - Length of the file name. + * - 0x8 + - char + - name[EXT4\_NAME\_LEN] + - File name. + +Since file names cannot be longer than 255 bytes, the new directory +entry format shortens the rec\_len field and uses the space for a file +type flag, probably to avoid having to load every inode during directory +tree traversal. This format is ``ext4_dir_entry_2``, which is at most +263 bytes long, though on disk you'll need to reference +``dirent.rec_len`` to know for sure. + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - \_\_le32 + - inode + - Number of the inode that this directory entry points to. + * - 0x4 + - \_\_le16 + - rec\_len + - Length of this directory entry. + * - 0x6 + - \_\_u8 + - name\_len + - Length of the file name. + * - 0x7 + - \_\_u8 + - file\_type + - File type code, see ftype_ table below. + * - 0x8 + - char + - name[EXT4\_NAME\_LEN] + - File name. + +.. _ftype: + +The directory file type is one of the following values: + +.. list-table:: + :widths: 1 79 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Unknown. + * - 0x1 + - Regular file. + * - 0x2 + - Directory. + * - 0x3 + - Character device file. + * - 0x4 + - Block device file. + * - 0x5 + - FIFO. + * - 0x6 + - Socket. + * - 0x7 + - Symbolic link. + +In order to add checksums to these classic directory blocks, a phony +``struct ext4_dir_entry`` is placed at the end of each leaf block to +hold the checksum. The directory entry is 12 bytes long. The inode +number and name\_len fields are set to zero to fool old software into +ignoring an apparently empty directory entry, and the checksum is stored +in the place where the name normally goes. The structure is +``struct ext4_dir_entry_tail``: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Size + - Name + - Description + * - 0x0 + - \_\_le32 + - det\_reserved\_zero1 + - Inode number, which must be zero. + * - 0x4 + - \_\_le16 + - det\_rec\_len + - Length of this directory entry, which must be 12. + * - 0x6 + - \_\_u8 + - det\_reserved\_zero2 + - Length of the file name, which must be zero. + * - 0x7 + - \_\_u8 + - det\_reserved\_ft + - File type, which must be 0xDE. + * - 0x8 + - \_\_le32 + - det\_checksum + - Directory leaf block checksum. + +The leaf directory block checksum is calculated against the FS UUID, the +directory's inode number, the directory's inode generation number, and +the entire directory entry block up to (but not including) the fake +directory entry. + +Hash Tree Directories +~~~~~~~~~~~~~~~~~~~~~ + +A linear array of directory entries isn't great for performance, so a +new feature was added to ext3 to provide a faster (but peculiar) +balanced tree keyed off a hash of the directory entry name. If the +EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a +hashed btree (htree) to organize and find directory entries. For +backwards read-only compatibility with ext2, this tree is actually +hidden inside the directory file, masquerading as “empty” directory data +blocks! It was stated previously that the end of the linear directory +entry table was signified with an entry pointing to inode 0; this is +(ab)used to fool the old linear-scan algorithm into thinking that the +rest of the directory block is empty so that it moves on. + +The root of the tree always lives in the first data block of the +directory. By ext2 custom, the '.' and '..' entries must appear at the +beginning of this first block, so they are put here as two +``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of +the root node contains metadata about the tree and finally a hash->block +map to find nodes that are lower in the htree. If +``dx_root.info.indirect_levels`` is non-zero then the htree has two +levels; the data block pointed to by the root node's map is an interior +node, which is indexed by a minor hash. Interior nodes in this tree +contains a zeroed out ``struct ext4_dir_entry_2`` followed by a +minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear +array of all ``struct ext4_dir_entry_2``; all of these entries +(presumably) hash to the same value. If there is an overflow, the +entries simply overflow into the next leaf node, and the +least-significant bit of the hash (in the interior node map) that gets +us to this next leaf node is set. + +To traverse the directory as a htree, the code calculates the hash of +the desired file name and uses it to find the corresponding block +number. If the tree is flat, the block is a linear array of directory +entries that can be searched; otherwise, the minor hash of the file name +is computed and used against this second block to find the corresponding +third block number. That third block number will be a linear array of +directory entries. + +To traverse the directory as a linear array (such as the old code does), +the code simply reads every data block in the directory. The blocks used +for the htree will appear to have no entries (aside from '.' and '..') +and so only the leaf nodes will appear to have any interesting content. + +The root of the htree is in ``struct dx_root``, which is the full length +of a data block: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - \_\_le32 + - dot.inode + - inode number of this directory. + * - 0x4 + - \_\_le16 + - dot.rec\_len + - Length of this record, 12. + * - 0x6 + - u8 + - dot.name\_len + - Length of the name, 1. + * - 0x7 + - u8 + - dot.file\_type + - File type of this entry, 0x2 (directory) (if the feature flag is set). + * - 0x8 + - char + - dot.name[4] + - “.\\0\\0\\0” + * - 0xC + - \_\_le32 + - dotdot.inode + - inode number of parent directory. + * - 0x10 + - \_\_le16 + - dotdot.rec\_len + - block\_size - 12. The record length is long enough to cover all htree + data. + * - 0x12 + - u8 + - dotdot.name\_len + - Length of the name, 2. + * - 0x13 + - u8 + - dotdot.file\_type + - File type of this entry, 0x2 (directory) (if the feature flag is set). + * - 0x14 + - char + - dotdot\_name[4] + - “..\\0\\0” + * - 0x18 + - \_\_le32 + - struct dx\_root\_info.reserved\_zero + - Zero. + * - 0x1C + - u8 + - struct dx\_root\_info.hash\_version + - Hash type, see dirhash_ table below. + * - 0x1D + - u8 + - struct dx\_root\_info.info\_length + - Length of the tree information, 0x8. + * - 0x1E + - u8 + - struct dx\_root\_info.indirect\_levels + - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR + feature is set; cannot be larger than 2 otherwise. + * - 0x1F + - u8 + - struct dx\_root\_info.unused\_flags + - + * - 0x20 + - \_\_le16 + - limit + - Maximum number of dx\_entries that can follow this header, plus 1 for + the header itself. + * - 0x22 + - \_\_le16 + - count + - Actual number of dx\_entries that follow this header, plus 1 for the + header itself. + * - 0x24 + - \_\_le32 + - block + - The block number (within the directory file) that goes with hash=0. + * - 0x28 + - struct dx\_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. + +.. _dirhash: + +The directory hash is one of the following values: + +.. list-table:: + :widths: 1 79 + :header-rows: 1 + + * - Value + - Description + * - 0x0 + - Legacy. + * - 0x1 + - Half MD4. + * - 0x2 + - Tea. + * - 0x3 + - Legacy, unsigned. + * - 0x4 + - Half MD4, unsigned. + * - 0x5 + - Tea, unsigned. + +Interior nodes of an htree are recorded as ``struct dx_node``, which is +also the full length of a data block: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - \_\_le32 + - fake.inode + - Zero, to make it look like this entry is not in use. + * - 0x4 + - \_\_le16 + - fake.rec\_len + - The size of the block, in order to hide all of the dx\_node data. + * - 0x6 + - u8 + - name\_len + - Zero. There is no name for this “unused” directory entry. + * - 0x7 + - u8 + - file\_type + - Zero. There is no file type for this “unused” directory entry. + * - 0x8 + - \_\_le16 + - limit + - Maximum number of dx\_entries that can follow this header, plus 1 for + the header itself. + * - 0xA + - \_\_le16 + - count + - Actual number of dx\_entries that follow this header, plus 1 for the + header itself. + * - 0xE + - \_\_le32 + - block + - The block number (within the directory file) that goes with the lowest + hash value of this block. This value is stored in the parent block. + * - 0x12 + - struct dx\_entry + - entries[0] + - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. + +The hash maps that exist in both ``struct dx_root`` and +``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes +long: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - \_\_le32 + - hash + - Hash code. + * - 0x4 + - \_\_le32 + - block + - Block number (within the directory file, not filesystem blocks) of the + next node in the htree. + +(If you think this is all quite clever and peculiar, so does the +author.) + +If metadata checksums are enabled, the last 8 bytes of the directory +block (precisely the length of one dx\_entry) are used to store a +``struct dx_tail``, which contains the checksum. The ``limit`` and +``count`` entries in the dx\_root/dx\_node structures are adjusted as +necessary to fit the dx\_tail into the block. If there is no space for +the dx\_tail, the user is notified to run e2fsck -D to rebuild the +directory index (which will ensure that there's space for the checksum. +The dx\_tail structure is 8 bytes long and looks like this: + +.. list-table:: + :widths: 1 1 1 77 + :header-rows: 1 + + * - Offset + - Type + - Name + - Description + * - 0x0 + - u32 + - dt\_reserved + - Zero. + * - 0x4 + - \_\_le32 + - dt\_checksum + - Checksum of the htree directory block. + +The checksum is calculated against the FS UUID, the htree index header +(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in +use, and the tail block (dx\_tail). diff --git a/Documentation/filesystems/ext4/ondisk/dynamic.rst b/Documentation/filesystems/ext4/ondisk/dynamic.rst index f090de8dd1c1..f2f14822b0f5 100644 --- a/Documentation/filesystems/ext4/ondisk/dynamic.rst +++ b/Documentation/filesystems/ext4/ondisk/dynamic.rst @@ -8,3 +8,4 @@ allocated to files. .. include:: inodes.rst .. include:: ifork.rst +.. include:: directory.rst