Documentation / filesystems / ext4 / journal.rst


Based on kernel version 6.8. Page generated on 2024-03-11 21:26 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761
.. SPDX-License-Identifier: GPL-2.0

Journal (jbd2)
--------------

Introduced in ext3, the ext4 filesystem employs a journal to protect the
filesystem against metadata inconsistencies in the case of a system crash. Up
to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
size limits) can be reserved inside the filesystem as a place to land
“important” data writes on-disk as quickly as possible. Once the important
data transaction is fully written to the disk and flushed from the disk write
cache, a record of the data being committed is also written to the journal. At
some later point in time, the journal code writes the transactions to their
final locations on disk (this could involve a lot of seeking or a lot of small
read-write-erases) before erasing the commit record. Should the system
crash during the second slow write, the journal can be replayed all the
way to the latest commit record, guaranteeing the atomicity of whatever
gets written through the journal to the disk. The effect of this is to
guarantee that the filesystem does not become stuck midway through a
metadata update.

For performance reasons, ext4 by default only writes filesystem metadata
through the journal. This means that file data blocks are /not/
guaranteed to be in any consistent state after a crash. If this default
guarantee level (``data=ordered``) is not satisfactory, there is a mount
option to control journal behavior. If ``data=journal``, all data and
metadata are written to disk through the journal. This is slower but
safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.

In case of ``data=ordered`` mode, Ext4 also supports fast commits which
help reduce commit latency significantly. The default ``data=ordered``
mode works by logging metadata blocks to the journal. In fast commit
mode, Ext4 only stores the minimal delta needed to recreate the
affected metadata in fast commit space that is shared with JBD2.
Once the fast commit area fills in or if fast commit is not possible
or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
A full commit invalidates all the fast commits that happened before
it and thus it makes the fast commit area empty for further fast
commits. This feature needs to be enabled at mkfs time.

The journal inode is typically inode 8. The first 68 bytes of the
journal inode are replicated in the ext4 superblock. The journal itself
is normal (but hidden) file within the filesystem. The file usually
consumes an entire block group, though mke2fs tries to put it in the
middle of the disk.

All fields in jbd2 are written to disk in big-endian order. This is the
opposite of ext4.

NOTE: Both ext4 and ocfs2 use jbd2.

The maximum size of a journal embedded in an ext4 filesystem is 2^32
blocks. jbd2 itself does not seem to care.

Layout
~~~~~~

Generally speaking, the journal has this format:

.. list-table::
   :widths: 16 48 16
   :header-rows: 1

   * - Superblock
     - descriptor_block (data_blocks or revocation_block) [more data or
       revocations] commmit_block
     - [more transactions...]
   * - 
     - One transaction
     -

Notice that a transaction begins with either a descriptor and some data,
or a block revocation list. A finished transaction always ends with a
commit. If there is no commit record (or the checksums don't match), the
transaction will be discarded during replay.

External Journal
~~~~~~~~~~~~~~~~

Optionally, an ext4 filesystem can be created with an external journal
device (as opposed to an internal journal, which uses a reserved inode).
In this case, on the filesystem device, ``s_journal_inum`` should be
zero and ``s_journal_uuid`` should be set. On the journal device there
will be an ext4 super block in the usual place, with a matching UUID.
The journal superblock will be in the next full block after the
superblock.

.. list-table::
   :widths: 12 12 12 32 12
   :header-rows: 1

   * - 1024 bytes of padding
     - ext4 Superblock
     - Journal Superblock
     - descriptor_block (data_blocks or revocation_block) [more data or
       revocations] commmit_block
     - [more transactions...]
   * - 
     -
     -
     - One transaction
     -

Block Header
~~~~~~~~~~~~

Every block in the journal starts with a common 12-byte header
``struct journal_header_s``:

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Description
   * - 0x0
     - __be32
     - h_magic
     - jbd2 magic number, 0xC03B3998.
   * - 0x4
     - __be32
     - h_blocktype
     - Description of what this block contains. See the jbd2_blocktype_ table
       below.
   * - 0x8
     - __be32
     - h_sequence
     - The transaction ID that goes with this block.

.. _jbd2_blocktype:

The journal block type can be any one of:

.. list-table::
   :widths: 16 64
   :header-rows: 1

   * - Value
     - Description
   * - 1
     - Descriptor. This block precedes a series of data blocks that were
       written through the journal during a transaction.
   * - 2
     - Block commit record. This block signifies the completion of a
       transaction.
   * - 3
     - Journal superblock, v1.
   * - 4
     - Journal superblock, v2.
   * - 5
     - Block revocation records. This speeds up recovery by enabling the
       journal to skip writing blocks that were subsequently rewritten.

Super Block
~~~~~~~~~~~

The super block for the journal is much simpler as compared to ext4's.
The key data kept within are size of the journal, and where to find the
start of the log of transactions.

The journal superblock is recorded as ``struct journal_superblock_s``,
which is 1024 bytes long:

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Description
   * -
     -
     -
     - Static information describing the journal.
   * - 0x0
     - journal_header_t (12 bytes)
     - s_header
     - Common header identifying this as a superblock.
   * - 0xC
     - __be32
     - s_blocksize
     - Journal device block size.
   * - 0x10
     - __be32
     - s_maxlen
     - Total number of blocks in this journal.
   * - 0x14
     - __be32
     - s_first
     - First block of log information.
   * -
     -
     -
     - Dynamic information describing the current state of the log.
   * - 0x18
     - __be32
     - s_sequence
     - First commit ID expected in log.
   * - 0x1C
     - __be32
     - s_start
     - Block number of the start of log. Contrary to the comments, this field
       being zero does not imply that the journal is clean!
   * - 0x20
     - __be32
     - s_errno
     - Error value, as set by jbd2_journal_abort().
   * -
     -
     -
     - The remaining fields are only valid in a v2 superblock.
   * - 0x24
     - __be32
     - s_feature_compat;
     - Compatible feature set. See the table jbd2_compat_ below.
   * - 0x28
     - __be32
     - s_feature_incompat
     - Incompatible feature set. See the table jbd2_incompat_ below.
   * - 0x2C
     - __be32
     - s_feature_ro_compat
     - Read-only compatible feature set. There aren't any of these currently.
   * - 0x30
     - __u8
     - s_uuid[16]
     - 128-bit uuid for journal. This is compared against the copy in the ext4
       super block at mount time.
   * - 0x40
     - __be32
     - s_nr_users
     - Number of file systems sharing this journal.
   * - 0x44
     - __be32
     - s_dynsuper
     - Location of dynamic super block copy. (Not used?)
   * - 0x48
     - __be32
     - s_max_transaction
     - Limit of journal blocks per transaction. (Not used?)
   * - 0x4C
     - __be32
     - s_max_trans_data
     - Limit of data blocks per transaction. (Not used?)
   * - 0x50
     - __u8
     - s_checksum_type
     - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
       more info.
   * - 0x51
     - __u8[3]
     - s_padding2
     -
   * - 0x54
     - __be32
     - s_num_fc_blocks
     - Number of fast commit blocks in the journal.
   * - 0x58
     - __be32
     - s_head
     - Block number of the head (first unused block) of the journal, only
       up-to-date when the journal is empty.
   * - 0x5C
     - __u32
     - s_padding[40]
     -
   * - 0xFC
     - __be32
     - s_checksum
     - Checksum of the entire superblock, with this field set to zero.
   * - 0x100
     - __u8
     - s_users[16*48]
     - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
       shared external journals, but I imagine Lustre (or ocfs2?), which use
       the jbd2 code, might.

.. _jbd2_compat:

The journal compat features are any combination of the following:

.. list-table::
   :widths: 16 64
   :header-rows: 1

   * - Value
     - Description
   * - 0x1
     - Journal maintains checksums on the data blocks.
       (JBD2_FEATURE_COMPAT_CHECKSUM)

.. _jbd2_incompat:

The journal incompat features are any combination of the following:

.. list-table::
   :widths: 16 64
   :header-rows: 1

   * - Value
     - Description
   * - 0x1
     - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
   * - 0x2
     - Journal can deal with 64-bit block numbers.
       (JBD2_FEATURE_INCOMPAT_64BIT)
   * - 0x4
     - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
   * - 0x8
     - This journal uses v2 of the checksum on-disk format. Each journal
       metadata block gets its own checksum, and the block tags in the
       descriptor table contain checksums for each of the data blocks in the
       journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
   * - 0x10
     - This journal uses v3 of the checksum on-disk format. This is the same as
       v2, but the journal block tag size is fixed regardless of the size of
       block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
   * - 0x20
     - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)

.. _jbd2_checksum_type:

Journal checksum type codes are one of the following.  crc32 or crc32c are the
most likely choices.

.. list-table::
   :widths: 16 64
   :header-rows: 1

   * - Value
     - Description
   * - 1
     - CRC32
   * - 2
     - MD5
   * - 3
     - SHA1
   * - 4
     - CRC32C

Descriptor Block
~~~~~~~~~~~~~~~~

The descriptor block contains an array of journal block tags that
describe the final locations of the data blocks that follow in the
journal. Descriptor blocks are open-coded instead of being completely
described by a data structure, but here is the block structure anyway.
Descriptor blocks consume at least 36 bytes, but use a full block:

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Descriptor
   * - 0x0
     - journal_header_t
     - (open coded)
     - Common block header.
   * - 0xC
     - struct journal_block_tag_s
     - open coded array[]
     - Enough tags either to fill up the block or to describe all the data
       blocks that follow this descriptor block.

Journal block tags have any of the following formats, depending on which
journal feature and block tag flags are set.

If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
defined as ``struct journal_block_tag3_s``, which looks like the
following. The size is 16 or 32 bytes.

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Descriptor
   * - 0x0
     - __be32
     - t_blocknr
     - Lower 32-bits of the location of where the corresponding data block
       should end up on disk.
   * - 0x4
     - __be32
     - t_flags
     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
       more info.
   * - 0x8
     - __be32
     - t_blocknr_high
     - Upper 32-bits of the location of where the corresponding data block
       should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
       not enabled.
   * - 0xC
     - __be32
     - t_checksum
     - Checksum of the journal UUID, the sequence number, and the data block.
   * -
     -
     -
     - This field appears to be open coded. It always comes at the end of the
       tag, after t_checksum. This field is not present if the "same UUID" flag
       is set.
   * - 0x8 or 0xC
     - char
     - uuid[16]
     - A UUID to go with this tag. This field appears to be copied from the
       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
       field.

.. _jbd2_tag_flags:

The journal tag flags are any combination of the following:

.. list-table::
   :widths: 16 64
   :header-rows: 1

   * - Value
     - Description
   * - 0x1
     - On-disk block is escaped. The first four bytes of the data block just
       happened to match the jbd2 magic number.
   * - 0x2
     - This block has the same UUID as previous, therefore the UUID field is
       omitted.
   * - 0x4
     - The data block was deleted by the transaction. (Not used?)
   * - 0x8
     - This is the last tag in this descriptor block.

If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
is defined as ``struct journal_block_tag_s``, which looks like the
following. The size is 8, 12, 24, or 28 bytes:

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Descriptor
   * - 0x0
     - __be32
     - t_blocknr
     - Lower 32-bits of the location of where the corresponding data block
       should end up on disk.
   * - 0x4
     - __be16
     - t_checksum
     - Checksum of the journal UUID, the sequence number, and the data block.
       Note that only the lower 16 bits are stored.
   * - 0x6
     - __be16
     - t_flags
     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
       more info.
   * -
     -
     -
     - This next field is only present if the super block indicates support for
       64-bit block numbers.
   * - 0x8
     - __be32
     - t_blocknr_high
     - Upper 32-bits of the location of where the corresponding data block
       should end up on disk.
   * -
     -
     -
     - This field appears to be open coded. It always comes at the end of the
       tag, after t_flags or t_blocknr_high. This field is not present if the
       "same UUID" flag is set.
   * - 0x8 or 0xC
     - char
     - uuid[16]
     - A UUID to go with this tag. This field appears to be copied from the
       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
       field.

If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
``struct jbd2_journal_block_tail``, which looks like this:

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Descriptor
   * - 0x0
     - __be32
     - t_checksum
     - Checksum of the journal UUID + the descriptor block, with this field set
       to zero.

Data Block
~~~~~~~~~~

In general, the data blocks being written to disk through the journal
are written verbatim into the journal file after the descriptor block.
However, if the first four bytes of the block match the jbd2 magic
number then those four bytes are replaced with zeroes and the “escaped”
flag is set in the descriptor block tag.

Revocation Block
~~~~~~~~~~~~~~~~

A revocation block is used to prevent replay of a block in an earlier
transaction. This is used to mark blocks that were journalled at one
time but are no longer journalled. Typically this happens if a metadata
block is freed and re-allocated as a file data block; in this case, a
journal replay after the file block was written to disk will cause
corruption.

**NOTE**: This mechanism is NOT used to express “this journal block is
superseded by this other journal block”, as the author (djwong)
mistakenly thought. Any block being added to a transaction will cause
the removal of all existing revocation records for that block.

Revocation blocks are described in
``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
length, but use a full block:

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Description
   * - 0x0
     - journal_header_t
     - r_header
     - Common block header.
   * - 0xC
     - __be32
     - r_count
     - Number of bytes used in this block.
   * - 0x10
     - __be32 or __be64
     - blocks[0]
     - Blocks to revoke.

After r_count is a linear array of block numbers that are effectively
revoked by this transaction. The size of each block number is 8 bytes if
the superblock advertises 64-bit block number support, or 4 bytes
otherwise.

If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
block is a ``struct jbd2_journal_revoke_tail``, which has this format:

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Description
   * - 0x0
     - __be32
     - r_checksum
     - Checksum of the journal UUID + revocation block

Commit Block
~~~~~~~~~~~~

The commit block is a sentry that indicates that a transaction has been
completely written to the journal. Once this commit block reaches the
journal, the data stored with this transaction can be written to their
final locations on disk.

The commit block is described by ``struct commit_header``, which is 32
bytes long (but uses a full block):

.. list-table::
   :widths: 8 8 24 40
   :header-rows: 1

   * - Offset
     - Type
     - Name
     - Descriptor
   * - 0x0
     - journal_header_s
     - (open coded)
     - Common block header.
   * - 0xC
     - unsigned char
     - h_chksum_type
     - The type of checksum to use to verify the integrity of the data blocks
       in the transaction. See jbd2_checksum_type_ for more info.
   * - 0xD
     - unsigned char
     - h_chksum_size
     - The number of bytes used by the checksum. Most likely 4.
   * - 0xE
     - unsigned char
     - h_padding[2]
     -
   * - 0x10
     - __be32
     - h_chksum[JBD2_CHECKSUM_BYTES]
     - 32 bytes of space to store checksums. If
       JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
       are set, the first ``__be32`` is the checksum of the journal UUID and
       the entire commit block, with this field zeroed. If
       JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
       crc32 of all the blocks already written to the transaction.
   * - 0x30
     - __be64
     - h_commit_sec
     - The time that the transaction was committed, in seconds since the epoch.
   * - 0x38
     - __be32
     - h_commit_nsec
     - Nanoseconds component of the above timestamp.

Fast commits
~~~~~~~~~~~~

Fast commit area is organized as a log of tag length values. Each TLV has
a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
of the entire field. It is followed by variable length tag specific value.
Here is the list of supported tags and their meanings:

.. list-table::
   :widths: 8 20 20 32
   :header-rows: 1

   * - Tag
     - Meaning
     - Value struct
     - Description
   * - EXT4_FC_TAG_HEAD
     - Fast commit area header
     - ``struct ext4_fc_head``
     - Stores the TID of the transaction after which these fast commits should
       be applied.
   * - EXT4_FC_TAG_ADD_RANGE
     - Add extent to inode
     - ``struct ext4_fc_add_range``
     - Stores the inode number and extent to be added in this inode
   * - EXT4_FC_TAG_DEL_RANGE
     - Remove logical offsets to inode
     - ``struct ext4_fc_del_range``
     - Stores the inode number and the logical offset range that needs to be
       removed
   * - EXT4_FC_TAG_CREAT
     - Create directory entry for a newly created file
     - ``struct ext4_fc_dentry_info``
     - Stores the parent inode number, inode number and directory entry of the
       newly created file
   * - EXT4_FC_TAG_LINK
     - Link a directory entry to an inode
     - ``struct ext4_fc_dentry_info``
     - Stores the parent inode number, inode number and directory entry
   * - EXT4_FC_TAG_UNLINK
     - Unlink a directory entry of an inode
     - ``struct ext4_fc_dentry_info``
     - Stores the parent inode number, inode number and directory entry

   * - EXT4_FC_TAG_PAD
     - Padding (unused area)
     - None
     - Unused bytes in the fast commit area.

   * - EXT4_FC_TAG_TAIL
     - Mark the end of a fast commit
     - ``struct ext4_fc_tail``
     - Stores the TID of the commit, CRC of the fast commit of which this tag
       represents the end of

Fast Commit Replay Idempotence
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fast commits tags are idempotent in nature provided the recovery code follows
certain rules. The guiding principle that the commit path follows while
committing is that it stores the result of a particular operation instead of
storing the procedure.

Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
was associated with inode 10. During fast commit, instead of storing this
operation as a procedure "rename a to b", we store the resulting file system
state as a "series" of outcomes:

- Link dirent b to inode 10
- Unlink dirent a
- Inode 10 with valid refcount

Now when recovery code runs, it needs "enforce" this state on the file
system. This is what guarantees idempotence of fast commit replay.

Let's take an example of a procedure that is not idempotent and see how fast
commits make it idempotent. Consider following sequence of operations:

1) rm A
2) mv B A
3) read A

If we store this sequence of operations as is then the replay is not idempotent.
Let's say while in replay, we crash after (2). During the second replay,
file A (which was actually created as a result of "mv B A" operation) would get
deleted. Thus, file named A would be absent when we try to read A. So, this
sequence of operations is not idempotent. However, as mentioned above, instead
of storing the procedure fast commits store the outcome of each procedure. Thus
the fast commit log for above procedure would be as follows:

(Let's assume dirent A was linked to inode 10 and dirent B was linked to
inode 11 before the replay)

1) Unlink A
2) Link A to inode 11
3) Unlink B
4) Inode 11

If we crash after (3) we will have file A linked to inode 11. During the second
replay, we will remove file A (inode 11). But we will create it back and make
it point to inode 11. We won't find B, so we'll just skip that step. At this
point, the refcount for inode 11 is not reliable, but that gets fixed by the
replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
into a series of idempotent outcomes, fast commits ensured idempotence during
the replay.

Journal Checkpoint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Checkpointing the journal ensures all transactions and their associated buffers
are submitted to the disk. In-progress transactions are waited upon and included
in the checkpoint. Checkpointing is used internally during critical updates to
the filesystem including journal recovery, filesystem resizing, and freeing of
the journal_t structure.

A journal checkpoint can be triggered from userspace via the ioctl
EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
can be used to verify input to the ioctl. It returns error if there is any
invalid input, otherwise it returns success without performing
any checkpointing. This can be used to check whether the ioctl exists on a
system and to verify there are no issues with arguments or flags. The
other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
discarded or zero-filled, respectively, after the journal checkpoint is
complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
cannot both be set. The ioctl may be useful when snapshotting a system or for
complying with content deletion SLOs.