[09/13] ext4: import journal chapter from wiki page

Message ID 153124306096.17949.3938060770047064298.stgit@magnolia
State Awaiting Upstream
Headers show
Series
  • ext4: major documentation surgery
Related show

Commit Message

Darrick J. Wong July 10, 2018, 5:17 p.m.
From: Darrick J. Wong <darrick.wong@oracle.com>

Import the chapter about the journal from the on-disk format wiki
page into the kernel documentation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/filesystems/ext4/ondisk/globals.rst |    1 
 Documentation/filesystems/ext4/ondisk/journal.rst |  611 +++++++++++++++++++++
 2 files changed, 612 insertions(+)
 create mode 100644 Documentation/filesystems/ext4/ondisk/journal.rst

Patch

diff --git a/Documentation/filesystems/ext4/ondisk/globals.rst b/Documentation/filesystems/ext4/ondisk/globals.rst
index fe6e107bf515..368bf7662b96 100644
--- a/Documentation/filesystems/ext4/ondisk/globals.rst
+++ b/Documentation/filesystems/ext4/ondisk/globals.rst
@@ -10,3 +10,4 @@  have static metadata at fixed locations.
 .. include:: group_descr.rst
 .. include:: bitmaps.rst
 .. include:: mmp.rst
+.. include:: journal.rst
diff --git a/Documentation/filesystems/ext4/ondisk/journal.rst b/Documentation/filesystems/ext4/ondisk/journal.rst
new file mode 100644
index 000000000000..e7031af86876
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/journal.rst
@@ -0,0 +1,611 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+Journal (jbd2)
+--------------
+
+Introduced in ext3, the ext4 filesystem employs a journal to protect the
+filesystem against corruption in the case of a system crash. A small
+continuous region of disk (default 128MiB) is reserved inside the
+filesystem as a place to land “important” data writes on-disk as quickly
+as possible. Once the important data transaction is fully written to the
+disk and flushed from the disk write cache, a record of the data being
+committed is also written to the journal. At some later point in time,
+the journal code writes the transactions to their final locations on
+disk (this could involve a lot of seeking or a lot of small
+read-write-erases) before erasing the commit record. Should the system
+crash during the second slow write, the journal can be replayed all the
+way to the latest commit record, guaranteeing the atomicity of whatever
+gets written through the journal to the disk. The effect of this is to
+guarantee that the filesystem does not become stuck midway through a
+metadata update.
+
+For performance reasons, ext4 by default only writes filesystem metadata
+through the journal. This means that file data blocks are /not/
+guaranteed to be in any consistent state after a crash. If this default
+guarantee level (``data=ordered``) is not satisfactory, there is a mount
+option to control journal behavior. If ``data=journal``, all data and
+metadata are written to disk through the journal. This is slower but
+safest. If ``data=writeback``, dirty data blocks are not flushed to the
+disk before the metadata are written to disk through the journal.
+
+The journal inode is typically inode 8. The first 68 bytes of the
+journal inode are replicated in the ext4 superblock. The journal itself
+is normal (but hidden) file within the filesystem. The file usually
+consumes an entire block group, though mke2fs tries to put it in the
+middle of the disk.
+
+All fields in jbd2 are written to disk in big-endian order. This is the
+opposite of ext4.
+
+NOTE: Both ext4 and ocfs2 use jbd2.
+
+The maximum size of a journal embedded in an ext4 filesystem is 2^32
+blocks. jbd2 itself does not seem to care.
+
+Layout
+~~~~~~
+
+Generally speaking, the journal has this format:
+
+.. list-table::
+   :widths: 1 1 78
+   :header-rows: 1
+
+   * - Superblock
+     - descriptor\_block (data\_blocks or revocation\_block) [more data or
+       revocations] commmit\_block
+     - [more transactions...]
+   * - 
+     - One transaction
+     -
+
+Notice that a transaction begins with either a descriptor and some data,
+or a block revocation list. A finished transaction always ends with a
+commit. If there is no commit record (or the checksums don't match), the
+transaction will be discarded during replay.
+
+External Journal
+~~~~~~~~~~~~~~~~
+
+Optionally, an ext4 filesystem can be created with an external journal
+device (as opposed to an internal journal, which uses a reserved inode).
+In this case, on the filesystem device, ``s_journal_inum`` should be
+zero and ``s_journal_uuid`` should be set. On the journal device there
+will be an ext4 super block in the usual place, with a matching UUID.
+The journal superblock will be in the next full block after the
+superblock.
+
+.. list-table::
+   :widths: 1 1 1 1 76
+   :header-rows: 1
+
+   * - 1024 bytes of padding
+     - ext4 Superblock
+     - Journal Superblock
+     - descriptor\_block (data\_blocks or revocation\_block) [more data or
+       revocations] commmit\_block
+     - [more transactions...]
+   * - 
+     -
+     -
+     - One transaction
+     -
+
+Block Header
+~~~~~~~~~~~~
+
+Every block in the journal starts with a common 12-byte header
+``struct journal_header_s``:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Description
+   * - 0x0
+     - \_\_be32
+     - h\_magic
+     - jbd2 magic number, 0xC03B3998.
+   * - 0x4
+     - \_\_be32
+     - h\_blocktype
+     - Description of what this block contains. See the jbd2_blocktype_ table
+       below.
+   * - 0x8
+     - \_\_be32
+     - h\_sequence
+     - The transaction ID that goes with this block.
+
+.. _jbd2_blocktype:
+
+The journal block type can be any one of:
+
+.. list-table::
+   :widths: 1 79
+   :header-rows: 1
+
+   * - Value
+     - Description
+   * - 1
+     - Descriptor. This block precedes a series of data blocks that were
+       written through the journal during a transaction.
+   * - 2
+     - Block commit record. This block signifies the completion of a
+       transaction.
+   * - 3
+     - Journal superblock, v1.
+   * - 4
+     - Journal superblock, v2.
+   * - 5
+     - Block revocation records. This speeds up recovery by enabling the
+       journal to skip writing blocks that were subsequently rewritten.
+
+Super Block
+~~~~~~~~~~~
+
+The super block for the journal is much simpler as compared to ext4's.
+The key data kept within are size of the journal, and where to find the
+start of the log of transactions.
+
+The journal superblock is recorded as ``struct journal_superblock_s``,
+which is 1024 bytes long:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Description
+   * -
+     -
+     -
+     - Static information describing the journal.
+   * - 0x0
+     - journal\_header\_t (12 bytes)
+     - s\_header
+     - Common header identifying this as a superblock.
+   * - 0xC
+     - \_\_be32
+     - s\_blocksize
+     - Journal device block size.
+   * - 0x10
+     - \_\_be32
+     - s\_maxlen
+     - Total number of blocks in this journal.
+   * - 0x14
+     - \_\_be32
+     - s\_first
+     - First block of log information.
+   * -
+     -
+     -
+     - Dynamic information describing the current state of the log.
+   * - 0x18
+     - \_\_be32
+     - s\_sequence
+     - First commit ID expected in log.
+   * - 0x1C
+     - \_\_be32
+     - s\_start
+     - Block number of the start of log. Contrary to the comments, this field
+       being zero does not imply that the journal is clean!
+   * - 0x20
+     - \_\_be32
+     - s\_errno
+     - Error value, as set by jbd2\_journal\_abort().
+   * -
+     -
+     -
+     - The remaining fields are only valid in a v2 superblock.
+   * - 0x24
+     - \_\_be32
+     - s\_feature\_compat;
+     - Compatible feature set. See the table jbd2_compat_ below.
+   * - 0x28
+     - \_\_be32
+     - s\_feature\_incompat
+     - Incompatible feature set. See the table jbd2_incompat_ below.
+   * - 0x2C
+     - \_\_be32
+     - s\_feature\_ro\_compat
+     - Read-only compatible feature set. There aren't any of these currently.
+   * - 0x30
+     - \_\_u8
+     - s\_uuid[16]
+     - 128-bit uuid for journal. This is compared against the copy in the ext4
+       super block at mount time.
+   * - 0x40
+     - \_\_be32
+     - s\_nr\_users
+     - Number of file systems sharing this journal.
+   * - 0x44
+     - \_\_be32
+     - s\_dynsuper
+     - Location of dynamic super block copy. (Not used?)
+   * - 0x48
+     - \_\_be32
+     - s\_max\_transaction
+     - Limit of journal blocks per transaction. (Not used?)
+   * - 0x4C
+     - \_\_be32
+     - s\_max\_trans\_data
+     - Limit of data blocks per transaction. (Not used?)
+   * - 0x50
+     - \_\_u8
+     - s\_checksum\_type
+     - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
+       more info.
+   * - 0x51
+     - \_\_u8[3]
+     - s\_padding2
+     -
+   * - 0x54
+     - \_\_u32
+     - s\_padding[42]
+     -
+   * - 0xFC
+     - \_\_be32
+     - s\_checksum
+     - Checksum of the entire superblock, with this field set to zero.
+   * - 0x100
+     - \_\_u8
+     - s\_users[16\*48]
+     - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
+       shared external journals, but I imagine Lustre (or ocfs2?), which use
+       the jbd2 code, might.
+
+.. _jbd2_compat:
+
+The journal compat features are any combination of the following:
+
+.. list-table::
+   :widths: 1 79
+   :header-rows: 1
+
+   * - Value
+     - Description
+   * - 0x1
+     - Journal maintains checksums on the data blocks.
+       (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
+
+.. _jbd2_incompat:
+
+The journal incompat features are any combination of the following:
+
+.. list-table::
+   :widths: 1 79
+   :header-rows: 1
+
+   * - Value
+     - Description
+   * - 0x1
+     - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
+   * - 0x2
+     - Journal can deal with 64-bit block numbers.
+       (JBD2\_FEATURE\_INCOMPAT\_64BIT)
+   * - 0x4
+     - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
+   * - 0x8
+     - This journal uses v2 of the checksum on-disk format. Each journal
+       metadata block gets its own checksum, and the block tags in the
+       descriptor table contain checksums for each of the data blocks in the
+       journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
+   * - 0x10
+     - This journal uses v3 of the checksum on-disk format. This is the same as
+       v2, but the journal block tag size is fixed regardless of the size of
+       block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
+
+.. _jbd2_checksum_type:
+
+Journal checksum type codes are one of the following.  crc32 or crc32c are the
+most likely choices.
+
+.. list-table::
+   :widths: 1 79
+   :header-rows: 1
+
+   * - Value
+     - Description
+   * - 1
+     - CRC32
+   * - 2
+     - MD5
+   * - 3
+     - SHA1
+   * - 4
+     - CRC32C
+
+Descriptor Block
+~~~~~~~~~~~~~~~~
+
+The descriptor block contains an array of journal block tags that
+describe the final locations of the data blocks that follow in the
+journal. Descriptor blocks are open-coded instead of being completely
+described by a data structure, but here is the block structure anyway.
+Descriptor blocks consume at least 36 bytes, but use a full block:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - journal\_header\_t
+     - (open coded)
+     - Common block header.
+   * - 0xC
+     - struct journal\_block\_tag\_s
+     - open coded array[]
+     - Enough tags either to fill up the block or to describe all the data
+       blocks that follow this descriptor block.
+
+Journal block tags have any of the following formats, depending on which
+journal feature and block tag flags are set.
+
+If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
+defined as ``struct journal_block_tag3_s``, which looks like the
+following. The size is 16 or 32 bytes.
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - \_\_be32
+     - t\_blocknr
+     - Lower 32-bits of the location of where the corresponding data block
+       should end up on disk.
+   * - 0x4
+     - \_\_be32
+     - t\_flags
+     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
+       more info.
+   * - 0x8
+     - \_\_be32
+     - t\_blocknr\_high
+     - Upper 32-bits of the location of where the corresponding data block
+       should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
+       not enabled.
+   * - 0xC
+     - \_\_be32
+     - t\_checksum
+     - Checksum of the journal UUID, the sequence number, and the data block.
+   * -
+     -
+     -
+     - This field appears to be open coded. It always comes at the end of the
+       tag, after t_checksum. This field is not present if the "same UUID" flag
+       is set.
+   * - 0x8 or 0xC
+     - char
+     - uuid[16]
+     - A UUID to go with this tag. This field appears to be copied from the
+       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
+       field.
+
+.. _jbd2_tag_flags:
+
+The journal tag flags are any combination of the following:
+
+.. list-table::
+   :widths: 1 79
+   :header-rows: 1
+
+   * - Value
+     - Description
+   * - 0x1
+     - On-disk block is escaped. The first four bytes of the data block just
+       happened to match the jbd2 magic number.
+   * - 0x2
+     - This block has the same UUID as previous, therefore the UUID field is
+       omitted.
+   * - 0x4
+     - The data block was deleted by the transaction. (Not used?)
+   * - 0x8
+     - This is the last tag in this descriptor block.
+
+If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
+is defined as ``struct journal_block_tag_s``, which looks like the
+following. The size is 8, 12, 24, or 28 bytes:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - \_\_be32
+     - t\_blocknr
+     - Lower 32-bits of the location of where the corresponding data block
+       should end up on disk.
+   * - 0x4
+     - \_\_be16
+     - t\_checksum
+     - Checksum of the journal UUID, the sequence number, and the data block.
+       Note that only the lower 16 bits are stored.
+   * - 0x6
+     - \_\_be16
+     - t\_flags
+     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
+       more info.
+   * -
+     -
+     -
+     - This next field is only present if the super block indicates support for
+       64-bit block numbers.
+   * - 0x8
+     - \_\_be32
+     - t\_blocknr\_high
+     - Upper 32-bits of the location of where the corresponding data block
+       should end up on disk.
+   * -
+     -
+     -
+     - This field appears to be open coded. It always comes at the end of the
+       tag, after t_flags or t_blocknr_high. This field is not present if the
+       "same UUID" flag is set.
+   * - 0x8 or 0xC
+     - char
+     - uuid[16]
+     - A UUID to go with this tag. This field appears to be copied from the
+       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
+       field.
+
+If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
+JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
+``struct jbd2_journal_block_tail``, which looks like this:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - \_\_be32
+     - t\_checksum
+     - Checksum of the journal UUID + the descriptor block, with this field set
+       to zero.
+
+Data Block
+~~~~~~~~~~
+
+In general, the data blocks being written to disk through the journal
+are written verbatim into the journal file after the descriptor block.
+However, if the first four bytes of the block match the jbd2 magic
+number then those four bytes are replaced with zeroes and the “escaped”
+flag is set in the descriptor block tag.
+
+Revocation Block
+~~~~~~~~~~~~~~~~
+
+A revocation block is used to prevent replay of a block in an earlier
+transaction. This is used to mark blocks that were journalled at one
+time but are no longer journalled. Typically this happens if a metadata
+block is freed and re-allocated as a file data block; in this case, a
+journal replay after the file block was written to disk will cause
+corruption.
+
+**NOTE**: This mechanism is NOT used to express “this journal block is
+superseded by this other journal block”, as the author (djwong)
+mistakenly thought. Any block being added to a transaction will cause
+the removal of all existing revocation records for that block.
+
+Revocation blocks are described in
+``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
+length, but use a full block:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Description
+   * - 0x0
+     - journal\_header\_t
+     - r\_header
+     - Common block header.
+   * - 0xC
+     - \_\_be32
+     - r\_count
+     - Number of bytes used in this block.
+   * - 0x10
+     - \_\_be32 or \_\_be64
+     - blocks[0]
+     - Blocks to revoke.
+
+After r\_count is a linear array of block numbers that are effectively
+revoked by this transaction. The size of each block number is 8 bytes if
+the superblock advertises 64-bit block number support, or 4 bytes
+otherwise.
+
+If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
+JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
+block is a ``struct jbd2_journal_revoke_tail``, which has this format:
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Description
+   * - 0x0
+     - \_\_be32
+     - r\_checksum
+     - Checksum of the journal UUID + revocation block
+
+Commit Block
+~~~~~~~~~~~~
+
+The commit block is a sentry that indicates that a transaction has been
+completely written to the journal. Once this commit block reaches the
+journal, the data stored with this transaction can be written to their
+final locations on disk.
+
+The commit block is described by ``struct commit_header``, which is 32
+bytes long (but uses a full block):
+
+.. list-table::
+   :widths: 1 1 1 77
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - journal\_header\_s
+     - (open coded)
+     - Common block header.
+   * - 0xC
+     - unsigned char
+     - h\_chksum\_type
+     - The type of checksum to use to verify the integrity of the data blocks
+       in the transaction. See jbd2_checksum_type_ for more info.
+   * - 0xD
+     - unsigned char
+     - h\_chksum\_size
+     - The number of bytes used by the checksum. Most likely 4.
+   * - 0xE
+     - unsigned char
+     - h\_padding[2]
+     -
+   * - 0x10
+     - \_\_be32
+     - h\_chksum[JBD2\_CHECKSUM\_BYTES]
+     - 32 bytes of space to store checksums. If
+       JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
+       are set, the first ``__be32`` is the checksum of the journal UUID and
+       the entire commit block, with this field zeroed. If
+       JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
+       crc32 of all the blocks already written to the transaction.
+   * - 0x30
+     - \_\_be64
+     - h\_commit\_sec
+     - The time that the transaction was committed, in seconds since the epoch.
+   * - 0x38
+     - \_\_be32
+     - h\_commit\_nsec
+     - Nanoseconds component of the above timestamp.
+