Documentation / admin-guide / device-mapper / dm-pcache.rst


Based on kernel version 6.18. Page generated on 2025-12-02 09:03 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
.. SPDX-License-Identifier: GPL-2.0

=================================
dm-pcache — Persistent Cache
=================================

*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*

This document describes *dm-pcache*, a Device-Mapper target that lets a
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
high-performance, crash-persistent cache in front of a slower block
device.  The code lives in `drivers/md/dm-pcache/`.

Quick feature summary
=====================

* *Write-back* caching (only mode currently supported).
* *16 MiB segments* allocated on the pmem device.
* *Data CRC32* verification (optional, per cache).
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
  == 2`) and protected with CRC+sequence numbers.
* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
* Pure *DAX path* I/O – no extra BIO round-trips
* *Log-structured write-back* that preserves backend crash-consistency


Constructor
===========

::

    pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]

=========================  ====================================================
``cache_dev``               Any DAX-capable block device (``/dev/pmem0``…).
                            All metadata *and* cached blocks are stored here.

``backing_dev``             The slow block device to be cached.

``cache_mode``              Optional, Only ``writeback`` is accepted at the
                            moment.

``data_crc``                Optional, default to ``false``

                            * ``true``  – store CRC32 for every cached entry
			      and verify on reads
                            * ``false`` – skip CRC (faster)
=========================  ====================================================

Example
-------

.. code-block:: shell

   dmsetup create pcache_sdb --table \
     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

The first time a pmem device is used, dm-pcache formats it automatically
(super-block, cache_info, etc.).


Status line
===========

``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:

::

   <sb_flags> <seg_total> <cache_segs> <segs_used> \
   <gc_percent> <cache_flags> \
   <key_head_seg>:<key_head_off> \
   <dirty_tail_seg>:<dirty_tail_off> \
   <key_tail_seg>:<key_tail_off>

Field meanings
--------------

===============================  =============================================
``sb_flags``                     Super-block flags (e.g. endian marker).

``seg_total``                    Number of physical *pmem* segments.

``cache_segs``                   Number of segments used for cache.

``segs_used``                    Segments currently allocated (bitmap weight).

``gc_percent``                   Current GC high-water mark (0-90).

``cache_flags``                  Bit 0 – DATA_CRC enabled
                                 Bit 1 – INIT_DONE (cache initialised)
                                 Bits 2-5 – cache mode (0 == WB).

``key_head``                     Where new key-sets are being written.

``dirty_tail``                   First dirty key-set that still needs
                                 write-back to the backing device.

``key_tail``                     First key-set that may be reclaimed by GC.
===============================  =============================================


Messages
========

*Change GC trigger*

::

   dmsetup message <dev> 0 gc_percent <0-90>


Theory of operation
===================

Sub-devices
-----------

====================  =========================================================
backing_dev             Any block device (SSD/HDD/loop/LVM, etc.).
cache_dev               DAX device; must expose direct-access memory.
====================  =========================================================

Segments and key-sets
---------------------

* The pmem space is divided into *16 MiB segments*.
* Each write allocates space from a per-CPU *data_head* inside a segment.
* A *cache-key* records a logical range on the origin and where it lives
  inside pmem (segment + offset + generation).
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
  and are themselves crash-safe (CRC).
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.

Write-back
----------

Dirty keys are queued into a tree; a background worker copies data
back to the backing_dev and advances *dirty_tail*.  A FLUSH/FUA bio from the
upper layers forces an immediate metadata commit.

Garbage collection
------------------

GC starts when ``segs_used >= seg_total * gc_percent / 100``.  It walks
from *key_tail*, frees segments whose every key has been invalidated, and
advances *key_tail*.

CRC verification
----------------

If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
range when it is inserted and stores it in the on-media key.  Reads
validate the CRC before copying to the caller.


Failure handling
================

* *pmem media errors* – all metadata copies are read with
  ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
  dm-pcache retries internally (request deferral).
* *System crash* – on attach, the driver replays ksets from *key_tail* to
  rebuild the in-core trees; every segment’s generation guards against
  use-after-free keys.


Limitations & TODO
==================

* Only *write-back* mode; other modes planned.
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
* Table reload is not supported currently.
* Discard planned.


Example workflow
================

.. code-block:: shell

   # 1.  Create devices
   dmsetup create pcache_sdb --table \
     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

   # 2.  Put a filesystem on top
   mkfs.ext4 /dev/mapper/pcache_sdb
   mount /dev/mapper/pcache_sdb /mnt

   # 3.  Tune GC threshold to 80 %
   dmsetup message pcache_sdb 0 gc_percent 80

   # 4.  Observe status
   watch -n1 'dmsetup status pcache_sdb'

   # 5.  Shutdown
   umount /mnt
   dmsetup remove pcache_sdb


``dm-pcache`` is under active development; feedback, bug reports and patches
are very welcome!