Linux Kernel Documentation / userspace-api / mseal.rst

Documentation / userspace-api / mseal.rst

Based on kernel version 6.19. Page generated on 2026-02-12 08:38 EST.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
.. SPDX-License-Identifier: GPL-2.0

=====================
Introduction of mseal
=====================

:Author: Jeff Xu <jeffxu@chromium.org>

Modern CPUs support memory permissions such as RW and NX bits. The memory
permission feature improves security stance on memory corruption bugs, i.e.
the attacker can’t just write to arbitrary memory and point the code to it,
the memory has to be marked with X bit, or else an exception will happen.

Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.

A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].

SYSCALL
=======
mseal syscall signature
-----------------------
   ``int mseal(void *addr, size_t len, unsigned long flags)``

   **addr**/**len**: virtual memory address range.
      The address range set by **addr**/**len** must meet:
         - The start address must be in an allocated VMA.
         - The start address must be page aligned.
         - The end address (**addr** + **len**) must be in an allocated VMA.
         - no gap (unallocated memory) between start and end address.

      The ``len`` will be paged aligned implicitly by the kernel.

   **flags**: reserved for future use.

   **Return values**:
      - **0**: Success.
      - **-EINVAL**:
         * Invalid input ``flags``.
         * The start address (``addr``) is not page aligned.
         * Address range (``addr`` + ``len``) overflow.
      - **-ENOMEM**:
         * The start address (``addr``) is not allocated.
         * The end address (``addr`` + ``len``) is not allocated.
         * A gap (unallocated memory) between start and end address.
      - **-EPERM**:
         * sealing is supported only on 64-bit CPUs, 32-bit is not supported.

   **Note about error return**:
      - For above error cases, users can expect the given memory range is
        unmodified, i.e. no partial update.
      - There might be other internal errors/cases not listed here, e.g.
        error during merging/splitting VMAs, or the process reaching the maximum
        number of supported VMAs. In those cases, partial updates to the given
        memory range could happen. However, those cases should be rare.

   **Architecture support**:
      mseal only works on 64-bit CPUs, not 32-bit CPUs.

   **Idempotent**:
      users can call mseal multiple times. mseal on an already sealed memory
      is a no-action (not error).

   **no munseal**
      Once mapping is sealed, it can't be unsealed. The kernel should never
      have munseal, this is consistent with other sealing feature, e.g.
      F_SEAL_SEAL for file.

Blocked mm syscall for sealed mapping
-------------------------------------
   It might be important to note: **once the mapping is sealed, it will
   stay in the process's memory until the process terminates**.

   Example::

         *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
         rc = mseal(ptr, 4096, 0);
         /* munmap will fail */
         rc = munmap(ptr, 4096);
         assert(rc < 0);

   Blocked mm syscall:
      - munmap
      - mmap
      - mremap
      - mprotect and pkey_mprotect
      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK

   The first set of syscalls to block is munmap, mremap, mmap. They can
   either leave an empty space in the address space, therefore allowing
   replacement with a new mapping with new set of attributes, or can
   overwrite the existing mapping with another mapping.

   mprotect and pkey_mprotect are blocked because they changes the
   protection bits (RWX) of the mapping.

   Certain destructive madvise behaviors, specifically MADV_DONTNEED,
   MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
   risks when applied to anonymous memory by threads lacking write
   permissions. Consequently, these operations are prohibited under such
   conditions. The aforementioned behaviors have the potential to modify
   region contents by discarding pages, effectively performing a memset(0)
   operation on the anonymous memory.

   Kernel will return -EPERM for blocked syscalls.

   When blocked syscall return -EPERM due to sealing, the memory regions may
   or may not be changed, depends on the syscall being blocked:

      - munmap: munmap is atomic. If one of VMAs in the given range is
        sealed, none of VMAs are updated.
      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
        when mprotect over multiple VMAs, mprotect might update the beginning
        VMAs before reaching the sealed VMA and return -EPERM.
      - mmap and mremap: undefined behavior.

Use cases
=========
- glibc:
  The dynamic linker, during loading ELF executables, can apply sealing to
  mapping segments.

- Chrome browser: protect some security sensitive data structures.

- System mappings:
  The system mappings are created by the kernel and includes vdso, vvar,
  vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), uprobes.

  Those system mappings are readonly only or execute only, memory sealing can
  protect them from ever changing to writable or unmmap/remapped as different
  attributes. This is useful to mitigate memory corruption issues where a
  corrupted pointer is passed to a memory management system.

  If supported by an architecture (CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS),
  the CONFIG_MSEAL_SYSTEM_MAPPINGS seals all system mappings of this
  architecture.

  The following architectures currently support this feature: x86-64, arm64,
  loongarch and s390.

  WARNING: This feature breaks programs which rely on relocating
  or unmapping system mappings. Known broken software at the time
  of writing includes CHECKPOINT_RESTORE, UML, gVisor, rr. Therefore
  this config can't be enabled universally.

When not to use mseal
=====================
Applications can apply sealing to any virtual memory region from userspace,
but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
apply the sealing. This is because the sealed mapping *won’t be unmapped*
until the process terminates or the exec system call is invoked.

For example:
   - aio/shm
     aio/shm can call mmap and  munmap on behalf of userspace, e.g.
     ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
     the lifetime of the process. If those memories are sealed from userspace,
     then munmap will fail, causing leaks in VMA address space during the
     lifetime of the process.

   - ptr allocated by malloc (heap)
     Don't use mseal on the memory ptr return from malloc().
     malloc() is implemented by allocator, e.g. by glibc. Heap manager might
     allocate a ptr from brk or mapping created by mmap.
     If an app calls mseal on a ptr returned from malloc(), this can affect
     the heap manager's ability to manage the mappings; the outcome is
     non-deterministic.

     Example::

        ptr = malloc(size);
        /* don't call mseal on ptr return from malloc. */
        mseal(ptr, size);
        /* free will success, allocator can't shrink heap lower than ptr */
        free(ptr);

mseal doesn't block
===================
In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
memory is immutable.

As Jann Horn pointed out in [3], there are still a few ways to write
to RO memory, which is, in a way, by design. And those could be blocked
by different security measures.

Those cases are:

   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
   - userfaultfd.

The idea that inspired this patch comes from Stephen Röttger’s work in V8
CFI [4]. Chrome browser in ChromeOS will be the first user of this API.

Reference
=========
- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
- [2] https://man.openbsd.org/mimmutable.2
- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc