1 .. SPDX-License-Identifier: GPL-2.0
6 Introduced in ext3, the ext4 filesystem employs a journal to protect the
7 filesystem against metadata inconsistencies in the case of a system crash. Up
8 to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
9 size limits) can be reserved inside the filesystem as a place to land
10 “important” data writes on-disk as quickly as possible. Once the important
11 data transaction is fully written to the disk and flushed from the disk write
12 cache, a record of the data being committed is also written to the journal. At
13 some later point in time, the journal code writes the transactions to their
14 final locations on disk (this could involve a lot of seeking or a lot of small
15 read-write-erases) before erasing the commit record. Should the system
16 crash during the second slow write, the journal can be replayed all the
17 way to the latest commit record, guaranteeing the atomicity of whatever
18 gets written through the journal to the disk. The effect of this is to
19 guarantee that the filesystem does not become stuck midway through a
22 For performance reasons, ext4 by default only writes filesystem metadata
23 through the journal. This means that file data blocks are /not/
24 guaranteed to be in any consistent state after a crash. If this default
25 guarantee level (``data=ordered``) is not satisfactory, there is a mount
26 option to control journal behavior. If ``data=journal``, all data and
27 metadata are written to disk through the journal. This is slower but
28 safest. If ``data=writeback``, dirty data blocks are not flushed to the
29 disk before the metadata are written to disk through the journal.
31 In case of ``data=ordered`` mode, Ext4 also supports fast commits which
32 help reduce commit latency significantly. The default ``data=ordered``
33 mode works by logging metadata blocks to the journal. In fast commit
34 mode, Ext4 only stores the minimal delta needed to recreate the
35 affected metadata in fast commit space that is shared with JBD2.
36 Once the fast commit area fills in or if fast commit is not possible
37 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
38 A full commit invalidates all the fast commits that happened before
39 it and thus it makes the fast commit area empty for further fast
40 commits. This feature needs to be enabled at mkfs time.
42 The journal inode is typically inode 8. The first 68 bytes of the
43 journal inode are replicated in the ext4 superblock. The journal itself
44 is normal (but hidden) file within the filesystem. The file usually
45 consumes an entire block group, though mke2fs tries to put it in the
48 All fields in jbd2 are written to disk in big-endian order. This is the
51 NOTE: Both ext4 and ocfs2 use jbd2.
53 The maximum size of a journal embedded in an ext4 filesystem is 2^32
54 blocks. jbd2 itself does not seem to care.
59 Generally speaking, the journal has this format:
66 - descriptor_block (data_blocks or revocation_block) [more data or
67 revocations] commmit_block
68 - [more transactions...]
73 Notice that a transaction begins with either a descriptor and some data,
74 or a block revocation list. A finished transaction always ends with a
75 commit. If there is no commit record (or the checksums don't match), the
76 transaction will be discarded during replay.
81 Optionally, an ext4 filesystem can be created with an external journal
82 device (as opposed to an internal journal, which uses a reserved inode).
83 In this case, on the filesystem device, ``s_journal_inum`` should be
84 zero and ``s_journal_uuid`` should be set. On the journal device there
85 will be an ext4 super block in the usual place, with a matching UUID.
86 The journal superblock will be in the next full block after the
90 :widths: 12 12 12 32 12
93 * - 1024 bytes of padding
96 - descriptor_block (data_blocks or revocation_block) [more data or
97 revocations] commmit_block
98 - [more transactions...]
108 Every block in the journal starts with a common 12-byte header
109 ``struct journal_header_s``:
122 - jbd2 magic number, 0xC03B3998.
126 - Description of what this block contains. See the jbd2_blocktype_ table
131 - The transaction ID that goes with this block.
135 The journal block type can be any one of:
144 - Descriptor. This block precedes a series of data blocks that were
145 written through the journal during a transaction.
147 - Block commit record. This block signifies the completion of a
150 - Journal superblock, v1.
152 - Journal superblock, v2.
154 - Block revocation records. This speeds up recovery by enabling the
155 journal to skip writing blocks that were subsequently rewritten.
160 The super block for the journal is much simpler as compared to ext4's.
161 The key data kept within are size of the journal, and where to find the
162 start of the log of transactions.
164 The journal superblock is recorded as ``struct journal_superblock_s``,
165 which is 1024 bytes long:
178 - Static information describing the journal.
180 - journal_header_t (12 bytes)
182 - Common header identifying this as a superblock.
186 - Journal device block size.
190 - Total number of blocks in this journal.
194 - First block of log information.
198 - Dynamic information describing the current state of the log.
202 - First commit ID expected in log.
206 - Block number of the start of log. Contrary to the comments, this field
207 being zero does not imply that the journal is clean!
211 - Error value, as set by jbd2_journal_abort().
215 - The remaining fields are only valid in a v2 superblock.
219 - Compatible feature set. See the table jbd2_compat_ below.
223 - Incompatible feature set. See the table jbd2_incompat_ below.
226 - s_feature_ro_compat
227 - Read-only compatible feature set. There aren't any of these currently.
231 - 128-bit uuid for journal. This is compared against the copy in the ext4
232 super block at mount time.
236 - Number of file systems sharing this journal.
240 - Location of dynamic super block copy. (Not used?)
244 - Limit of journal blocks per transaction. (Not used?)
248 - Limit of data blocks per transaction. (Not used?)
252 - Checksum algorithm used for the journal. See jbd2_checksum_type_ for
261 - Number of fast commit blocks in the journal.
265 - Block number of the head (first unused block) of the journal, only
266 up-to-date when the journal is empty.
274 - Checksum of the entire superblock, with this field set to zero.
278 - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
279 shared external journals, but I imagine Lustre (or ocfs2?), which use
280 the jbd2 code, might.
284 The journal compat features are any combination of the following:
293 - Journal maintains checksums on the data blocks.
294 (JBD2_FEATURE_COMPAT_CHECKSUM)
298 The journal incompat features are any combination of the following:
307 - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
309 - Journal can deal with 64-bit block numbers.
310 (JBD2_FEATURE_INCOMPAT_64BIT)
312 - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
314 - This journal uses v2 of the checksum on-disk format. Each journal
315 metadata block gets its own checksum, and the block tags in the
316 descriptor table contain checksums for each of the data blocks in the
317 journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
319 - This journal uses v3 of the checksum on-disk format. This is the same as
320 v2, but the journal block tag size is fixed regardless of the size of
321 block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
323 - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
325 .. _jbd2_checksum_type:
327 Journal checksum type codes are one of the following. crc32 or crc32c are the
348 The descriptor block contains an array of journal block tags that
349 describe the final locations of the data blocks that follow in the
350 journal. Descriptor blocks are open-coded instead of being completely
351 described by a data structure, but here is the block structure anyway.
352 Descriptor blocks consume at least 36 bytes, but use a full block:
365 - Common block header.
367 - struct journal_block_tag_s
369 - Enough tags either to fill up the block or to describe all the data
370 blocks that follow this descriptor block.
372 Journal block tags have any of the following formats, depending on which
373 journal feature and block tag flags are set.
375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
376 defined as ``struct journal_block_tag3_s``, which looks like the
377 following. The size is 16 or 32 bytes.
390 - Lower 32-bits of the location of where the corresponding data block
391 should end up on disk.
395 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
400 - Upper 32-bits of the location of where the corresponding data block
401 should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
406 - Checksum of the journal UUID, the sequence number, and the data block.
410 - This field appears to be open coded. It always comes at the end of the
411 tag, after t_checksum. This field is not present if the "same UUID" flag
416 - A UUID to go with this tag. This field appears to be copied from the
417 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
422 The journal tag flags are any combination of the following:
431 - On-disk block is escaped. The first four bytes of the data block just
432 happened to match the jbd2 magic number.
434 - This block has the same UUID as previous, therefore the UUID field is
437 - The data block was deleted by the transaction. (Not used?)
439 - This is the last tag in this descriptor block.
441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
442 is defined as ``struct journal_block_tag_s``, which looks like the
443 following. The size is 8, 12, 24, or 28 bytes:
456 - Lower 32-bits of the location of where the corresponding data block
457 should end up on disk.
461 - Checksum of the journal UUID, the sequence number, and the data block.
462 Note that only the lower 16 bits are stored.
466 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
471 - This next field is only present if the super block indicates support for
472 64-bit block numbers.
476 - Upper 32-bits of the location of where the corresponding data block
477 should end up on disk.
481 - This field appears to be open coded. It always comes at the end of the
482 tag, after t_flags or t_blocknr_high. This field is not present if the
483 "same UUID" flag is set.
487 - A UUID to go with this tag. This field appears to be copied from the
488 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
493 ``struct jbd2_journal_block_tail``, which looks like this:
506 - Checksum of the journal UUID + the descriptor block, with this field set
512 In general, the data blocks being written to disk through the journal
513 are written verbatim into the journal file after the descriptor block.
514 However, if the first four bytes of the block match the jbd2 magic
515 number then those four bytes are replaced with zeroes and the “escaped”
516 flag is set in the descriptor block tag.
521 A revocation block is used to prevent replay of a block in an earlier
522 transaction. This is used to mark blocks that were journalled at one
523 time but are no longer journalled. Typically this happens if a metadata
524 block is freed and re-allocated as a file data block; in this case, a
525 journal replay after the file block was written to disk will cause
528 **NOTE**: This mechanism is NOT used to express “this journal block is
529 superseded by this other journal block”, as the author (djwong)
530 mistakenly thought. Any block being added to a transaction will cause
531 the removal of all existing revocation records for that block.
533 Revocation blocks are described in
534 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
535 length, but use a full block:
548 - Common block header.
552 - Number of bytes used in this block.
558 After r_count is a linear array of block numbers that are effectively
559 revoked by this transaction. The size of each block number is 8 bytes if
560 the superblock advertises 64-bit block number support, or 4 bytes
563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
565 block is a ``struct jbd2_journal_revoke_tail``, which has this format:
578 - Checksum of the journal UUID + revocation block
583 The commit block is a sentry that indicates that a transaction has been
584 completely written to the journal. Once this commit block reaches the
585 journal, the data stored with this transaction can be written to their
586 final locations on disk.
588 The commit block is described by ``struct commit_header``, which is 32
589 bytes long (but uses a full block):
602 - Common block header.
606 - The type of checksum to use to verify the integrity of the data blocks
607 in the transaction. See jbd2_checksum_type_ for more info.
611 - The number of bytes used by the checksum. Most likely 4.
618 - h_chksum[JBD2_CHECKSUM_BYTES]
619 - 32 bytes of space to store checksums. If
620 JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
621 are set, the first ``__be32`` is the checksum of the journal UUID and
622 the entire commit block, with this field zeroed. If
623 JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
624 crc32 of all the blocks already written to the transaction.
628 - The time that the transaction was committed, in seconds since the epoch.
632 - Nanoseconds component of the above timestamp.
637 Fast commit area is organized as a log of tag length values. Each TLV has
638 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
639 of the entire field. It is followed by variable length tag specific value.
640 Here is the list of supported tags and their meanings:
651 - Fast commit area header
652 - ``struct ext4_fc_head``
653 - Stores the TID of the transaction after which these fast commits should
655 * - EXT4_FC_TAG_ADD_RANGE
656 - Add extent to inode
657 - ``struct ext4_fc_add_range``
658 - Stores the inode number and extent to be added in this inode
659 * - EXT4_FC_TAG_DEL_RANGE
660 - Remove logical offsets to inode
661 - ``struct ext4_fc_del_range``
662 - Stores the inode number and the logical offset range that needs to be
664 * - EXT4_FC_TAG_CREAT
665 - Create directory entry for a newly created file
666 - ``struct ext4_fc_dentry_info``
667 - Stores the parent inode number, inode number and directory entry of the
670 - Link a directory entry to an inode
671 - ``struct ext4_fc_dentry_info``
672 - Stores the parent inode number, inode number and directory entry
673 * - EXT4_FC_TAG_UNLINK
674 - Unlink a directory entry of an inode
675 - ``struct ext4_fc_dentry_info``
676 - Stores the parent inode number, inode number and directory entry
679 - Padding (unused area)
681 - Unused bytes in the fast commit area.
684 - Mark the end of a fast commit
685 - ``struct ext4_fc_tail``
686 - Stores the TID of the commit, CRC of the fast commit of which this tag
687 represents the end of
689 Fast Commit Replay Idempotence
690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
692 Fast commits tags are idempotent in nature provided the recovery code follows
693 certain rules. The guiding principle that the commit path follows while
694 committing is that it stores the result of a particular operation instead of
695 storing the procedure.
697 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
698 was associated with inode 10. During fast commit, instead of storing this
699 operation as a procedure "rename a to b", we store the resulting file system
700 state as a "series" of outcomes:
702 - Link dirent b to inode 10
704 - Inode 10 with valid refcount
706 Now when recovery code runs, it needs "enforce" this state on the file
707 system. This is what guarantees idempotence of fast commit replay.
709 Let's take an example of a procedure that is not idempotent and see how fast
710 commits make it idempotent. Consider following sequence of operations:
716 If we store this sequence of operations as is then the replay is not idempotent.
717 Let's say while in replay, we crash after (2). During the second replay,
718 file A (which was actually created as a result of "mv B A" operation) would get
719 deleted. Thus, file named A would be absent when we try to read A. So, this
720 sequence of operations is not idempotent. However, as mentioned above, instead
721 of storing the procedure fast commits store the outcome of each procedure. Thus
722 the fast commit log for above procedure would be as follows:
724 (Let's assume dirent A was linked to inode 10 and dirent B was linked to
725 inode 11 before the replay)
728 2) Link A to inode 11
732 If we crash after (3) we will have file A linked to inode 11. During the second
733 replay, we will remove file A (inode 11). But we will create it back and make
734 it point to inode 11. We won't find B, so we'll just skip that step. At this
735 point, the refcount for inode 11 is not reliable, but that gets fixed by the
736 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
737 into a series of idempotent outcomes, fast commits ensured idempotence during
741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
743 Checkpointing the journal ensures all transactions and their associated buffers
744 are submitted to the disk. In-progress transactions are waited upon and included
745 in the checkpoint. Checkpointing is used internally during critical updates to
746 the filesystem including journal recovery, filesystem resizing, and freeing of
747 the journal_t structure.
749 A journal checkpoint can be triggered from userspace via the ioctl
750 EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
751 Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
752 can be used to verify input to the ioctl. It returns error if there is any
753 invalid input, otherwise it returns success without performing
754 any checkpointing. This can be used to check whether the ioctl exists on a
755 system and to verify there are no issues with arguments or flags. The
756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
758 discarded or zero-filled, respectively, after the journal checkpoint is
759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
760 cannot both be set. The ioctl may be useful when snapshotting a system or for
761 complying with content deletion SLOs.