Documentation/filesystems/ext4/journal.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 Journal (jbd2)
   4 --------------
   5
   6 Introduced in ext3, the ext4 filesystem employs a journal to protect the
   7 filesystem against metadata inconsistencies in the case of a system crash. Up
   8 to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
   9 size limits) can be reserved inside the filesystem as a place to land
  10 “important” data writes on-disk as quickly as possible. Once the important
  11 data transaction is fully written to the disk and flushed from the disk write
  12 cache, a record of the data being committed is also written to the journal. At
  13 some later point in time, the journal code writes the transactions to their
  14 final locations on disk (this could involve a lot of seeking or a lot of small
  15 read-write-erases) before erasing the commit record. Should the system
  16 crash during the second slow write, the journal can be replayed all the
  17 way to the latest commit record, guaranteeing the atomicity of whatever
  18 gets written through the journal to the disk. The effect of this is to
  19 guarantee that the filesystem does not become stuck midway through a
  20 metadata update.
  21
  22 For performance reasons, ext4 by default only writes filesystem metadata
  23 through the journal. This means that file data blocks are /not/
  24 guaranteed to be in any consistent state after a crash. If this default
  25 guarantee level (``data=ordered``) is not satisfactory, there is a mount
  26 option to control journal behavior. If ``data=journal``, all data and
  27 metadata are written to disk through the journal. This is slower but
  28 safest. If ``data=writeback``, dirty data blocks are not flushed to the
  29 disk before the metadata are written to disk through the journal.
  30
  31 In case of ``data=ordered`` mode, Ext4 also supports fast commits which
  32 help reduce commit latency significantly. The default ``data=ordered``
  33 mode works by logging metadata blocks to the journal. In fast commit
  34 mode, Ext4 only stores the minimal delta needed to recreate the
  35 affected metadata in fast commit space that is shared with JBD2.
  36 Once the fast commit area fills in or if fast commit is not possible
  37 or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
  38 A full commit invalidates all the fast commits that happened before
  39 it and thus it makes the fast commit area empty for further fast
  40 commits. This feature needs to be enabled at mkfs time.
  41
  42 The journal inode is typically inode 8. The first 68 bytes of the
  43 journal inode are replicated in the ext4 superblock. The journal itself
  44 is normal (but hidden) file within the filesystem. The file usually
  45 consumes an entire block group, though mke2fs tries to put it in the
  46 middle of the disk.
  47
  48 All fields in jbd2 are written to disk in big-endian order. This is the
  49 opposite of ext4.
  50
  51 NOTE: Both ext4 and ocfs2 use jbd2.
  52
  53 The maximum size of a journal embedded in an ext4 filesystem is 2^32
  54 blocks. jbd2 itself does not seem to care.
  55
  56 Layout
  57 ~~~~~~
  58
  59 Generally speaking, the journal has this format:
  60
  61 .. list-table::
  62    :widths: 16 48 16
  63    :header-rows: 1
  64
  65    * - Superblock
  66      - descriptor_block (data_blocks or revocation_block) [more data or
  67        revocations] commmit_block
  68      - [more transactions...]
  69    * -
  70      - One transaction
  71      -
  72
  73 Notice that a transaction begins with either a descriptor and some data,
  74 or a block revocation list. A finished transaction always ends with a
  75 commit. If there is no commit record (or the checksums don't match), the
  76 transaction will be discarded during replay.
  77
  78 External Journal
  79 ~~~~~~~~~~~~~~~~
  80
  81 Optionally, an ext4 filesystem can be created with an external journal
  82 device (as opposed to an internal journal, which uses a reserved inode).
  83 In this case, on the filesystem device, ``s_journal_inum`` should be
  84 zero and ``s_journal_uuid`` should be set. On the journal device there
  85 will be an ext4 super block in the usual place, with a matching UUID.
  86 The journal superblock will be in the next full block after the
  87 superblock.
  88
  89 .. list-table::
  90    :widths: 12 12 12 32 12
  91    :header-rows: 1
  92
  93    * - 1024 bytes of padding
  94      - ext4 Superblock
  95      - Journal Superblock
  96      - descriptor_block (data_blocks or revocation_block) [more data or
  97        revocations] commmit_block
  98      - [more transactions...]
  99    * -
 100      -
 101      -
 102      - One transaction
 103      -
 104
 105 Block Header
 106 ~~~~~~~~~~~~
 107
 108 Every block in the journal starts with a common 12-byte header
 109 ``struct journal_header_s``:
 110
 111 .. list-table::
 112    :widths: 8 8 24 40
 113    :header-rows: 1
 114
 115    * - Offset
 116      - Type
 117      - Name
 118      - Description
 119    * - 0x0
 120      - __be32
 121      - h_magic
 122      - jbd2 magic number, 0xC03B3998.
 123    * - 0x4
 124      - __be32
 125      - h_blocktype
 126      - Description of what this block contains. See the jbd2_blocktype_ table
 127        below.
 128    * - 0x8
 129      - __be32
 130      - h_sequence
 131      - The transaction ID that goes with this block.
 132
 133 .. _jbd2_blocktype:
 134
 135 The journal block type can be any one of:
 136
 137 .. list-table::
 138    :widths: 16 64
 139    :header-rows: 1
 140
 141    * - Value
 142      - Description
 143    * - 1
 144      - Descriptor. This block precedes a series of data blocks that were
 145        written through the journal during a transaction.
 146    * - 2
 147      - Block commit record. This block signifies the completion of a
 148        transaction.
 149    * - 3
 150      - Journal superblock, v1.
 151    * - 4
 152      - Journal superblock, v2.
 153    * - 5
 154      - Block revocation records. This speeds up recovery by enabling the
 155        journal to skip writing blocks that were subsequently rewritten.
 156
 157 Super Block
 158 ~~~~~~~~~~~
 159
 160 The super block for the journal is much simpler as compared to ext4's.
 161 The key data kept within are size of the journal, and where to find the
 162 start of the log of transactions.
 163
 164 The journal superblock is recorded as ``struct journal_superblock_s``,
 165 which is 1024 bytes long:
 166
 167 .. list-table::
 168    :widths: 8 8 24 40
 169    :header-rows: 1
 170
 171    * - Offset
 172      - Type
 173      - Name
 174      - Description
 175    * -
 176      -
 177      -
 178      - Static information describing the journal.
 179    * - 0x0
 180      - journal_header_t (12 bytes)
 181      - s_header
 182      - Common header identifying this as a superblock.
 183    * - 0xC
 184      - __be32
 185      - s_blocksize
 186      - Journal device block size.
 187    * - 0x10
 188      - __be32
 189      - s_maxlen
 190      - Total number of blocks in this journal.
 191    * - 0x14
 192      - __be32
 193      - s_first
 194      - First block of log information.
 195    * -
 196      -
 197      -
 198      - Dynamic information describing the current state of the log.
 199    * - 0x18
 200      - __be32
 201      - s_sequence
 202      - First commit ID expected in log.
 203    * - 0x1C
 204      - __be32
 205      - s_start
 206      - Block number of the start of log. Contrary to the comments, this field
 207        being zero does not imply that the journal is clean!
 208    * - 0x20
 209      - __be32
 210      - s_errno
 211      - Error value, as set by jbd2_journal_abort().
 212    * -
 213      -
 214      -
 215      - The remaining fields are only valid in a v2 superblock.
 216    * - 0x24
 217      - __be32
 218      - s_feature_compat;
 219      - Compatible feature set. See the table jbd2_compat_ below.
 220    * - 0x28
 221      - __be32
 222      - s_feature_incompat
 223      - Incompatible feature set. See the table jbd2_incompat_ below.
 224    * - 0x2C
 225      - __be32
 226      - s_feature_ro_compat
 227      - Read-only compatible feature set. There aren't any of these currently.
 228    * - 0x30
 229      - __u8
 230      - s_uuid[16]
 231      - 128-bit uuid for journal. This is compared against the copy in the ext4
 232        super block at mount time.
 233    * - 0x40
 234      - __be32
 235      - s_nr_users
 236      - Number of file systems sharing this journal.
 237    * - 0x44
 238      - __be32
 239      - s_dynsuper
 240      - Location of dynamic super block copy. (Not used?)
 241    * - 0x48
 242      - __be32
 243      - s_max_transaction
 244      - Limit of journal blocks per transaction. (Not used?)
 245    * - 0x4C
 246      - __be32
 247      - s_max_trans_data
 248      - Limit of data blocks per transaction. (Not used?)
 249    * - 0x50
 250      - __u8
 251      - s_checksum_type
 252      - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
 253        more info.
 254    * - 0x51
 255      - __u8[3]
 256      - s_padding2
 257      -
 258    * - 0x54
 259      - __be32
 260      - s_num_fc_blocks
 261      - Number of fast commit blocks in the journal.
 262    * - 0x58
 263      - __be32
 264      - s_head
 265      - Block number of the head (first unused block) of the journal, only
 266        up-to-date when the journal is empty.
 267    * - 0x5C
 268      - __u32
 269      - s_padding[40]
 270      -
 271    * - 0xFC
 272      - __be32
 273      - s_checksum
 274      - Checksum of the entire superblock, with this field set to zero.
 275    * - 0x100
 276      - __u8
 277      - s_users[16*48]
 278      - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
 279        shared external journals, but I imagine Lustre (or ocfs2?), which use
 280        the jbd2 code, might.
 281
 282 .. _jbd2_compat:
 283
 284 The journal compat features are any combination of the following:
 285
 286 .. list-table::
 287    :widths: 16 64
 288    :header-rows: 1
 289
 290    * - Value
 291      - Description
 292    * - 0x1
 293      - Journal maintains checksums on the data blocks.
 294        (JBD2_FEATURE_COMPAT_CHECKSUM)
 295
 296 .. _jbd2_incompat:
 297
 298 The journal incompat features are any combination of the following:
 299
 300 .. list-table::
 301    :widths: 16 64
 302    :header-rows: 1
 303
 304    * - Value
 305      - Description
 306    * - 0x1
 307      - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
 308    * - 0x2
 309      - Journal can deal with 64-bit block numbers.
 310        (JBD2_FEATURE_INCOMPAT_64BIT)
 311    * - 0x4
 312      - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
 313    * - 0x8
 314      - This journal uses v2 of the checksum on-disk format. Each journal
 315        metadata block gets its own checksum, and the block tags in the
 316        descriptor table contain checksums for each of the data blocks in the
 317        journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
 318    * - 0x10
 319      - This journal uses v3 of the checksum on-disk format. This is the same as
 320        v2, but the journal block tag size is fixed regardless of the size of
 321        block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
 322    * - 0x20
 323      - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
 324
 325 .. _jbd2_checksum_type:
 326
 327 Journal checksum type codes are one of the following.  crc32 or crc32c are the
 328 most likely choices.
 329
 330 .. list-table::
 331    :widths: 16 64
 332    :header-rows: 1
 333
 334    * - Value
 335      - Description
 336    * - 1
 337      - CRC32
 338    * - 2
 339      - MD5
 340    * - 3
 341      - SHA1
 342    * - 4
 343      - CRC32C
 344
 345 Descriptor Block
 346 ~~~~~~~~~~~~~~~~
 347
 348 The descriptor block contains an array of journal block tags that
 349 describe the final locations of the data blocks that follow in the
 350 journal. Descriptor blocks are open-coded instead of being completely
 351 described by a data structure, but here is the block structure anyway.
 352 Descriptor blocks consume at least 36 bytes, but use a full block:
 353
 354 .. list-table::
 355    :widths: 8 8 24 40
 356    :header-rows: 1
 357
 358    * - Offset
 359      - Type
 360      - Name
 361      - Descriptor
 362    * - 0x0
 363      - journal_header_t
 364      - (open coded)
 365      - Common block header.
 366    * - 0xC
 367      - struct journal_block_tag_s
 368      - open coded array[]
 369      - Enough tags either to fill up the block or to describe all the data
 370        blocks that follow this descriptor block.
 371
 372 Journal block tags have any of the following formats, depending on which
 373 journal feature and block tag flags are set.
 374
 375 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
 376 defined as ``struct journal_block_tag3_s``, which looks like the
 377 following. The size is 16 or 32 bytes.
 378
 379 .. list-table::
 380    :widths: 8 8 24 40
 381    :header-rows: 1
 382
 383    * - Offset
 384      - Type
 385      - Name
 386      - Descriptor
 387    * - 0x0
 388      - __be32
 389      - t_blocknr
 390      - Lower 32-bits of the location of where the corresponding data block
 391        should end up on disk.
 392    * - 0x4
 393      - __be32
 394      - t_flags
 395      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
 396        more info.
 397    * - 0x8
 398      - __be32
 399      - t_blocknr_high
 400      - Upper 32-bits of the location of where the corresponding data block
 401        should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
 402        not enabled.
 403    * - 0xC
 404      - __be32
 405      - t_checksum
 406      - Checksum of the journal UUID, the sequence number, and the data block.
 407    * -
 408      -
 409      -
 410      - This field appears to be open coded. It always comes at the end of the
 411        tag, after t_checksum. This field is not present if the "same UUID" flag
 412        is set.
 413    * - 0x8 or 0xC
 414      - char
 415      - uuid[16]
 416      - A UUID to go with this tag. This field appears to be copied from the
 417        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
 418        field.
 419
 420 .. _jbd2_tag_flags:
 421
 422 The journal tag flags are any combination of the following:
 423
 424 .. list-table::
 425    :widths: 16 64
 426    :header-rows: 1
 427
 428    * - Value
 429      - Description
 430    * - 0x1
 431      - On-disk block is escaped. The first four bytes of the data block just
 432        happened to match the jbd2 magic number.
 433    * - 0x2
 434      - This block has the same UUID as previous, therefore the UUID field is
 435        omitted.
 436    * - 0x4
 437      - The data block was deleted by the transaction. (Not used?)
 438    * - 0x8
 439      - This is the last tag in this descriptor block.
 440
 441 If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
 442 is defined as ``struct journal_block_tag_s``, which looks like the
 443 following. The size is 8, 12, 24, or 28 bytes:
 444
 445 .. list-table::
 446    :widths: 8 8 24 40
 447    :header-rows: 1
 448
 449    * - Offset
 450      - Type
 451      - Name
 452      - Descriptor
 453    * - 0x0
 454      - __be32
 455      - t_blocknr
 456      - Lower 32-bits of the location of where the corresponding data block
 457        should end up on disk.
 458    * - 0x4
 459      - __be16
 460      - t_checksum
 461      - Checksum of the journal UUID, the sequence number, and the data block.
 462        Note that only the lower 16 bits are stored.
 463    * - 0x6
 464      - __be16
 465      - t_flags
 466      - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
 467        more info.
 468    * -
 469      -
 470      -
 471      - This next field is only present if the super block indicates support for
 472        64-bit block numbers.
 473    * - 0x8
 474      - __be32
 475      - t_blocknr_high
 476      - Upper 32-bits of the location of where the corresponding data block
 477        should end up on disk.
 478    * -
 479      -
 480      -
 481      - This field appears to be open coded. It always comes at the end of the
 482        tag, after t_flags or t_blocknr_high. This field is not present if the
 483        "same UUID" flag is set.
 484    * - 0x8 or 0xC
 485      - char
 486      - uuid[16]
 487      - A UUID to go with this tag. This field appears to be copied from the
 488        ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
 489        field.
 490
 491 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
 492 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
 493 ``struct jbd2_journal_block_tail``, which looks like this:
 494
 495 .. list-table::
 496    :widths: 8 8 24 40
 497    :header-rows: 1
 498
 499    * - Offset
 500      - Type
 501      - Name
 502      - Descriptor
 503    * - 0x0
 504      - __be32
 505      - t_checksum
 506      - Checksum of the journal UUID + the descriptor block, with this field set
 507        to zero.
 508
 509 Data Block
 510 ~~~~~~~~~~
 511
 512 In general, the data blocks being written to disk through the journal
 513 are written verbatim into the journal file after the descriptor block.
 514 However, if the first four bytes of the block match the jbd2 magic
 515 number then those four bytes are replaced with zeroes and the “escaped”
 516 flag is set in the descriptor block tag.
 517
 518 Revocation Block
 519 ~~~~~~~~~~~~~~~~
 520
 521 A revocation block is used to prevent replay of a block in an earlier
 522 transaction. This is used to mark blocks that were journalled at one
 523 time but are no longer journalled. Typically this happens if a metadata
 524 block is freed and re-allocated as a file data block; in this case, a
 525 journal replay after the file block was written to disk will cause
 526 corruption.
 527
 528 **NOTE**: This mechanism is NOT used to express “this journal block is
 529 superseded by this other journal block”, as the author (djwong)
 530 mistakenly thought. Any block being added to a transaction will cause
 531 the removal of all existing revocation records for that block.
 532
 533 Revocation blocks are described in
 534 ``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
 535 length, but use a full block:
 536
 537 .. list-table::
 538    :widths: 8 8 24 40
 539    :header-rows: 1
 540
 541    * - Offset
 542      - Type
 543      - Name
 544      - Description
 545    * - 0x0
 546      - journal_header_t
 547      - r_header
 548      - Common block header.
 549    * - 0xC
 550      - __be32
 551      - r_count
 552      - Number of bytes used in this block.
 553    * - 0x10
 554      - __be32 or __be64
 555      - blocks[0]
 556      - Blocks to revoke.
 557
 558 After r_count is a linear array of block numbers that are effectively
 559 revoked by this transaction. The size of each block number is 8 bytes if
 560 the superblock advertises 64-bit block number support, or 4 bytes
 561 otherwise.
 562
 563 If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
 564 JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
 565 block is a ``struct jbd2_journal_revoke_tail``, which has this format:
 566
 567 .. list-table::
 568    :widths: 8 8 24 40
 569    :header-rows: 1
 570
 571    * - Offset
 572      - Type
 573      - Name
 574      - Description
 575    * - 0x0
 576      - __be32
 577      - r_checksum
 578      - Checksum of the journal UUID + revocation block
 579
 580 Commit Block
 581 ~~~~~~~~~~~~
 582
 583 The commit block is a sentry that indicates that a transaction has been
 584 completely written to the journal. Once this commit block reaches the
 585 journal, the data stored with this transaction can be written to their
 586 final locations on disk.
 587
 588 The commit block is described by ``struct commit_header``, which is 32
 589 bytes long (but uses a full block):
 590
 591 .. list-table::
 592    :widths: 8 8 24 40
 593    :header-rows: 1
 594
 595    * - Offset
 596      - Type
 597      - Name
 598      - Descriptor
 599    * - 0x0
 600      - journal_header_s
 601      - (open coded)
 602      - Common block header.
 603    * - 0xC
 604      - unsigned char
 605      - h_chksum_type
 606      - The type of checksum to use to verify the integrity of the data blocks
 607        in the transaction. See jbd2_checksum_type_ for more info.
 608    * - 0xD
 609      - unsigned char
 610      - h_chksum_size
 611      - The number of bytes used by the checksum. Most likely 4.
 612    * - 0xE
 613      - unsigned char
 614      - h_padding[2]
 615      -
 616    * - 0x10
 617      - __be32
 618      - h_chksum[JBD2_CHECKSUM_BYTES]
 619      - 32 bytes of space to store checksums. If
 620        JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
 621        are set, the first ``__be32`` is the checksum of the journal UUID and
 622        the entire commit block, with this field zeroed. If
 623        JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
 624        crc32 of all the blocks already written to the transaction.
 625    * - 0x30
 626      - __be64
 627      - h_commit_sec
 628      - The time that the transaction was committed, in seconds since the epoch.
 629    * - 0x38
 630      - __be32
 631      - h_commit_nsec
 632      - Nanoseconds component of the above timestamp.
 633
 634 Fast commits
 635 ~~~~~~~~~~~~
 636
 637 Fast commit area is organized as a log of tag length values. Each TLV has
 638 a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
 639 of the entire field. It is followed by variable length tag specific value.
 640 Here is the list of supported tags and their meanings:
 641
 642 .. list-table::
 643    :widths: 8 20 20 32
 644    :header-rows: 1
 645
 646    * - Tag
 647      - Meaning
 648      - Value struct
 649      - Description
 650    * - EXT4_FC_TAG_HEAD
 651      - Fast commit area header
 652      - ``struct ext4_fc_head``
 653      - Stores the TID of the transaction after which these fast commits should
 654        be applied.
 655    * - EXT4_FC_TAG_ADD_RANGE
 656      - Add extent to inode
 657      - ``struct ext4_fc_add_range``
 658      - Stores the inode number and extent to be added in this inode
 659    * - EXT4_FC_TAG_DEL_RANGE
 660      - Remove logical offsets to inode
 661      - ``struct ext4_fc_del_range``
 662      - Stores the inode number and the logical offset range that needs to be
 663        removed
 664    * - EXT4_FC_TAG_CREAT
 665      - Create directory entry for a newly created file
 666      - ``struct ext4_fc_dentry_info``
 667      - Stores the parent inode number, inode number and directory entry of the
 668        newly created file
 669    * - EXT4_FC_TAG_LINK
 670      - Link a directory entry to an inode
 671      - ``struct ext4_fc_dentry_info``
 672      - Stores the parent inode number, inode number and directory entry
 673    * - EXT4_FC_TAG_UNLINK
 674      - Unlink a directory entry of an inode
 675      - ``struct ext4_fc_dentry_info``
 676      - Stores the parent inode number, inode number and directory entry
 677
 678    * - EXT4_FC_TAG_PAD
 679      - Padding (unused area)
 680      - None
 681      - Unused bytes in the fast commit area.
 682
 683    * - EXT4_FC_TAG_TAIL
 684      - Mark the end of a fast commit
 685      - ``struct ext4_fc_tail``
 686      - Stores the TID of the commit, CRC of the fast commit of which this tag
 687        represents the end of
 688
 689 Fast Commit Replay Idempotence
 690 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 691
 692 Fast commits tags are idempotent in nature provided the recovery code follows
 693 certain rules. The guiding principle that the commit path follows while
 694 committing is that it stores the result of a particular operation instead of
 695 storing the procedure.
 696
 697 Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
 698 was associated with inode 10. During fast commit, instead of storing this
 699 operation as a procedure "rename a to b", we store the resulting file system
 700 state as a "series" of outcomes:
 701
 702 - Link dirent b to inode 10
 703 - Unlink dirent a
 704 - Inode 10 with valid refcount
 705
 706 Now when recovery code runs, it needs "enforce" this state on the file
 707 system. This is what guarantees idempotence of fast commit replay.
 708
 709 Let's take an example of a procedure that is not idempotent and see how fast
 710 commits make it idempotent. Consider following sequence of operations:
 711
 712 1) rm A
 713 2) mv B A
 714 3) read A
 715
 716 If we store this sequence of operations as is then the replay is not idempotent.
 717 Let's say while in replay, we crash after (2). During the second replay,
 718 file A (which was actually created as a result of "mv B A" operation) would get
 719 deleted. Thus, file named A would be absent when we try to read A. So, this
 720 sequence of operations is not idempotent. However, as mentioned above, instead
 721 of storing the procedure fast commits store the outcome of each procedure. Thus
 722 the fast commit log for above procedure would be as follows:
 723
 724 (Let's assume dirent A was linked to inode 10 and dirent B was linked to
 725 inode 11 before the replay)
 726
 727 1) Unlink A
 728 2) Link A to inode 11
 729 3) Unlink B
 730 4) Inode 11
 731
 732 If we crash after (3) we will have file A linked to inode 11. During the second
 733 replay, we will remove file A (inode 11). But we will create it back and make
 734 it point to inode 11. We won't find B, so we'll just skip that step. At this
 735 point, the refcount for inode 11 is not reliable, but that gets fixed by the
 736 replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
 737 into a series of idempotent outcomes, fast commits ensured idempotence during
 738 the replay.
 739
 740 Journal Checkpoint
 741 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 742
 743 Checkpointing the journal ensures all transactions and their associated buffers
 744 are submitted to the disk. In-progress transactions are waited upon and included
 745 in the checkpoint. Checkpointing is used internally during critical updates to
 746 the filesystem including journal recovery, filesystem resizing, and freeing of
 747 the journal_t structure.
 748
 749 A journal checkpoint can be triggered from userspace via the ioctl
 750 EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
 751 Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
 752 can be used to verify input to the ioctl. It returns error if there is any
 753 invalid input, otherwise it returns success without performing
 754 any checkpointing. This can be used to check whether the ioctl exists on a
 755 system and to verify there are no issues with arguments or flags. The
 756 other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
 757 EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
 758 discarded or zero-filled, respectively, after the journal checkpoint is
 759 complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
 760 cannot both be set. The ioctl may be useful when snapshotting a system or for
 761 complying with content deletion SLOs.