Linux Kernel Documentation / filesystems / idmappings.rst

Documentation / filesystems / idmappings.rst

Based on kernel version 7.0. Page generated on 2026-04-23 09:49 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039

.. SPDX-License-Identifier: GPL-2.0

Idmappings
==========

Most filesystem developers will have encountered idmappings. They are used when
reading from or writing ownership to disk, reporting ownership to userspace, or
for permission checking. This document is aimed at filesystem developers that
want to know how idmappings work.

Formal notes
------------

An idmapping is essentially a translation of a range of ids into another or the
same range of ids. The notational convention for idmappings that is widely used
in userspace is::

u:k:r

``u`` indicates the first element in the upper idmapset ``U`` and ``k``
indicates the first element in the lower idmapset ``K``. The ``r`` parameter
indicates the range of the idmapping, i.e. how many ids are mapped. From now
on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
we're talking about an id in the upper or lower idmapset.

To see what this looks like in practice, let's take the following idmapping::

u22:k10000:r3

and write down the mappings it will generate::

u22 -> k10000
u23 -> k10001
u24 -> k10002

From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
the set of all possible ids usable on a given system.

Looking at this mathematically briefly will help us highlight some properties
that make it easier to understand how we can translate between idmappings. For
example, we know that the inverse idmapping is an order isomorphism as well::

k10000 -> u22
k10001 -> u23
k10002 -> u24

Given that we are dealing with order isomorphisms plus the fact that we're
dealing with subsets we can embed idmappings into each other, i.e. we can
sensibly translate between different idmappings. For example, assume we've been
given the three idmappings::

1. u0:k10000:r10000
2. u0:k20000:r10000
3. u0:k30000:r10000

and id ``k11000`` which has been generated by the first idmapping by mapping
``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.

Because we're dealing with order isomorphic subsets it is meaningful to ask
what id ``k11000`` corresponds to in the second or third idmapping. The
straightforward algorithm to use is to apply the inverse of the first idmapping,
mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
either the second idmapping mapping or third idmapping mapping. The second
idmapping would map ``u1000`` down to ``k21000``. The third idmapping would map
``u1000`` down to ``k31000``.

If we were given the same task for the following three idmappings::

1. u0:k10000:r10000
2. u0:k20000:r200
3. u0:k30000:r300

we would fail to translate as the sets aren't order isomorphic over the full
range of the first idmapping anymore (However they are order isomorphic over
the full range of the second idmapping.). Neither the second or third idmapping
contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
an id mapped. We can simply say that ``u1000`` is unmapped in the second and
third idmapping. The kernel will report unmapped ids as the overflowuid
``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.

The algorithm to calculate what a given id maps to is pretty simple. First, we
need to verify that the range can contain our target id. We will skip this step
for simplicity. After that if we want to know what ``id`` maps to we can do
simple calculations:

- If we want to map from left to right::

u:k:r
id - u + k = n

- If we want to map from right to left::

u:k:r
id - k + u = n

Instead of "left to right" we can also say "down" and instead of "right to
left" we can also say "up". Obviously mapping down and up invert each other.

To see whether the simple formulas above work, consider the following two
idmappings::

1. u0:k20000:r10000
2. u500:k30000:r10000

Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
want to know what id this was mapped from in the upper idmapset of the first
idmapping. So we're mapping up in the first idmapping::

id - k + u = n
k21000 - k20000 + u0 = u1000

Now assume we are given the id ``u1100`` in the upper idmapset of the second
idmapping and we want to know what this id maps down to in the lower idmapset
of the second idmapping. This means we're mapping down in the second
idmapping::

id - u + k = n
u1100 - u500 + k30000 = k30600

General notes
-------------

In the context of the kernel an idmapping can be interpreted as mapping a range
of userspace ids into a range of kernel ids::

userspace-id:kernel-id:range

A userspace id is always an element in the upper idmapset of an idmapping of
type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.

The kernel is mostly concerned with kernel ids. They are used when performing
permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
A userspace id on the other hand is an id that is reported to userspace by the
kernel, or is passed by userspace to the kernel, or a raw device id that is
written or read from disk.

Note that we are only concerned with idmappings as the kernel stores them not
how userspace would specify them.

For the rest of this document we will prefix all userspace ids with ``u`` and
all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
an idmapping will be written as ``u0:k10000:r10000``.

For example, within this idmapping, the id ``u1000`` is an id in the upper
idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to
``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset"
starting with ``k10000``.

A kernel id is always created by an idmapping. Such idmappings are associated
with user namespaces. Since we mainly care about how idmappings work we're not
going to be concerned with how idmappings are created nor how they are used
outside of the filesystem context. This is best left to an explanation of user
namespaces.

The initial user namespace is special. It always has an idmapping of the
following form::

u0:k0:r4294967295

which is an identity idmapping over the full range of ids available on this
system.

Other user namespaces usually have non-identity idmappings such as::

u0:k10000:r10000

When a process creates or wants to change ownership of a file, or when the
ownership of a file is read from disk by a filesystem, the userspace id is
immediately translated into a kernel id according to the idmapping associated
with the relevant user namespace.

For instance, consider a file that is stored on disk by a filesystem as being
owned by ``u1000``:

- If a filesystem were to be mounted in the initial user namespaces (as most
filesystems are) then the initial idmapping will be used. As we saw this is
simply the identity idmapping. This would mean id ``u1000`` read from disk
would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
would contain ``k1000``.

- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
``i_uid`` and ``i_gid`` would contain ``k11000``.

Translation algorithms
----------------------

We've already seen briefly that it is possible to translate between different
idmappings. We'll now take a closer look how that works.

Crossmapping
~~~~~~~~~~~~

This translation algorithm is used by the kernel in quite a few places. For
example, it is used when reporting back the ownership of a file to userspace
via the ``stat()`` system call family.

If we've been given ``k11000`` from one idmapping we can map that id up in
another idmapping. In order for this to work both idmappings need to contain
the same kernel id in their kernel idmapsets. For example, consider the
following idmappings::

1. u0:k10000:r10000
2. u20000:k10000:r10000

and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
then translate ``k11000`` into a userspace id in the second idmapping using the
kernel idmapset of the second idmapping::

/* Map the kernel id up into a userspace id in the second idmapping. */
from_kuid(u20000:k10000:r10000, k11000) = u21000

Note, how we can get back to the kernel id in the first idmapping by inverting
the algorithm::

/* Map the userspace id down into a kernel id in the second idmapping. */
make_kuid(u20000:k10000:r10000, u21000) = k11000

/* Map the kernel id up into a userspace id in the first idmapping. */
from_kuid(u0:k10000:r10000, k11000) = u1000

This algorithm allows us to answer the question what userspace id a given
kernel id corresponds to in a given idmapping. In order to be able to answer
this question both idmappings need to contain the same kernel id in their
respective kernel idmapsets.

For example, when the kernel reads a raw userspace id from disk it maps it down
into a kernel id according to the idmapping associated with the filesystem.
Let's assume the filesystem was mounted with an idmapping of
``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
the inode's ``i_uid`` and ``i_gid`` field.

When someone in userspace calls ``stat()`` or a related function to get
ownership information about the file the kernel can't simply map the id back up
according to the filesystem's idmapping as this would give the wrong owner if
the caller is using an idmapping.

So the kernel will map the id back up in the idmapping of the caller. Let's
assume the caller has the somewhat unconventional idmapping
``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
Consequently the user would see that this file is owned by ``u4000``.

Remapping
~~~~~~~~~

It is possible to translate a kernel id from one idmapping to another one via
the userspace idmapset of the two idmappings. This is equivalent to remapping
a kernel id.

Let's look at an example. We are given the following two idmappings::

1. u0:k10000:r10000
2. u0:k20000:r10000

and we are given ``k11000`` in the first idmapping. In order to translate this
kernel id in the first idmapping into a kernel id in the second idmapping we
need to perform two steps:

1. Map the kernel id up into a userspace id in the first idmapping::

/* Map the kernel id up into a userspace id in the first idmapping. */
from_kuid(u0:k10000:r10000, k11000) = u1000

2. Map the userspace id down into a kernel id in the second idmapping::

/* Map the userspace id down into a kernel id in the second idmapping. */
make_kuid(u0:k20000:r10000, u1000) = k21000

As you can see we used the userspace idmapset in both idmappings to translate
the kernel id in one idmapping to a kernel id in another idmapping.

This allows us to answer the question what kernel id we would need to use to
get the same userspace id in another idmapping. In order to be able to answer
this question both idmappings need to contain the same userspace id in their
respective userspace idmapsets.

Note, how we can easily get back to the kernel id in the first idmapping by
inverting the algorithm:

1. Map the kernel id up into a userspace id in the second idmapping::

/* Map the kernel id up into a userspace id in the second idmapping. */
from_kuid(u0:k20000:r10000, k21000) = u1000

2. Map the userspace id down into a kernel id in the first idmapping::

/* Map the userspace id down into a kernel id in the first idmapping. */
make_kuid(u0:k10000:r10000, u1000) = k11000

Another way to look at this translation is to treat it as inverting one
idmapping and applying another idmapping if both idmappings have the relevant
userspace id mapped. This will come in handy when working with idmapped mounts.

Invalid translations
~~~~~~~~~~~~~~~~~~~~

It is never valid to use an id in the kernel idmapset of one idmapping as the
id in the userspace idmapset of another or the same idmapping. While the kernel
idmapset always indicates an idmapset in the kernel id space the userspace
idmapset indicates a userspace id. So the following translations are forbidden::

/* Map the userspace id down into a kernel id in the first idmapping. */
make_kuid(u0:k10000:r10000, u1000) = k11000

/* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
make_kuid(u10000:k20000:r10000, k110000) = k21000
~~~~~~~

and equally wrong::

/* Map the kernel id up into a userspace id in the first idmapping. */
from_kuid(u0:k10000:r10000, k11000) = u1000

/* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
from_kuid(u20000:k0:r10000, u1000) = k21000
~~~~~

Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type
``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are
conflated. So the two examples above would cause a compilation failure.

Idmappings when creating filesystem objects
-------------------------------------------

The concepts of mapping an id down or mapping an id up are expressed in the two
kernel functions filesystem developers are rather familiar with and which we've
already used in this document::

/* Map the userspace id down into a kernel id. */
make_kuid(idmapping, uid)

/* Map the kernel id up into a userspace id. */
from_kuid(idmapping, kuid)

We will take an abbreviated look into how idmappings figure into creating
filesystem objects. For simplicity we will only look at what happens when the
VFS has already completed path lookup right before it calls into the filesystem
itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
called. We will also assume that the directory we're creating filesystem
objects in is readable and writable for everyone.

When creating a filesystem object the caller will look at the caller's
filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
but they are exclusively used when determining file ownership which is why they
are called "filesystem ids". They are usually identical to the uid and gid of
the caller but can differ. We will just assume they are always identical to not
get lost in too many details.

When the caller enters the kernel two things happen:

1. Map the caller's userspace ids down into kernel ids in the caller's
idmapping.
(To be precise, the kernel will simply look at the kernel ids stashed in the
credentials of the current task but for our education we'll pretend this
translation happens just in time.)
2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
filesystem's idmapping.

The second step is important as regular filesystem will ultimately need to map
the kernel id back up into a userspace id when writing to disk.
So with the second step the kernel guarantees that a valid userspace id can be
written to disk. If it can't the kernel will refuse the creation request to not
even remotely risk filesystem corruption.

The astute reader will have realized that this is simply a variation of the
crossmapping algorithm we mentioned above in a previous section. First, the
kernel maps the caller's userspace id down into a kernel id according to the
caller's idmapping and then maps that kernel id up according to the
filesystem's idmapping.

From the implementation point it's worth mentioning how idmappings are represented.
All idmappings are taken from the corresponding user namespace.

- caller's idmapping (usually taken from ``current_user_ns()``)
- filesystem's idmapping (``sb->s_user_ns``)
- mount's idmapping (``mnt_idmap(vfsmnt)``)

Let's see some examples with caller/filesystem idmapping but without mount
idmappings. This will exhibit some problems we can hit. After that we will
revisit/reconsider these examples, this time using mount idmappings, to see how
they can solve the problems we observed before.

Example 1
~~~~~~~~~

caller id: u1000
caller idmapping: u0:k0:r4294967295
filesystem idmapping: u0:k0:r4294967295

Both the caller and the filesystem use the identity idmapping:

1. Map the caller's userspace ids into kernel ids in the caller's idmapping::

make_kuid(u0:k0:r4294967295, u1000) = k1000

2. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping.

For this second step the kernel will call the function
``fsuidgid_has_mapping()`` which ultimately boils down to calling
``from_kuid()``::

from_kuid(u0:k0:r4294967295, k1000) = u1000

In this example both idmappings are the same so there's nothing exciting going
on. Ultimately the userspace id that lands on disk will be ``u1000``.

Example 2
~~~~~~~~~

caller id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k20000:r10000

1. Map the caller's userspace ids down into kernel ids in the caller's
idmapping::

make_kuid(u0:k10000:r10000, u1000) = k11000

2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
filesystem's idmapping::

from_kuid(u0:k20000:r10000, k11000) = u-1

It's immediately clear that while the caller's userspace id could be
successfully mapped down into kernel ids in the caller's idmapping the kernel
ids could not be mapped up according to the filesystem's idmapping. So the
kernel will deny this creation request.

Note that while this example is less common, because most filesystem can't be
mounted with non-initial idmappings this is a general problem as we can see in
the next examples.

Example 3
~~~~~~~~~

caller id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k0:r4294967295

1. Map the caller's userspace ids down into kernel ids in the caller's
idmapping::

make_kuid(u0:k10000:r10000, u1000) = k11000

2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
filesystem's idmapping::

from_kuid(u0:k0:r4294967295, k11000) = u11000

We can see that the translation always succeeds. The userspace id that the
filesystem will ultimately put to disk will always be identical to the value of
the kernel id that was created in the caller's idmapping. This has mainly two
consequences.

First, that we can't allow a caller to ultimately write to disk with another
userspace id. We could only do this if we were to mount the whole filesystem
with the caller's or another idmapping. But that solution is limited to a few
filesystems and not very flexible. But this is a use-case that is pretty
important in containerized workloads.

Second, the caller will usually not be able to create any files or access
directories that have stricter permissions because none of the filesystem's
kernel ids map up into valid userspace ids in the caller's idmapping

1. Map raw userspace ids down to kernel ids in the filesystem's idmapping::

make_kuid(u0:k0:r4294967295, u1000) = k1000

2. Map kernel ids up to userspace ids in the caller's idmapping::

from_kuid(u0:k10000:r10000, k1000) = u-1

Example 4
~~~~~~~~~

file id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k0:r4294967295

In order to report ownership to userspace the kernel uses the crossmapping
algorithm introduced in a previous section:

1. Map the userspace id on disk down into a kernel id in the filesystem's
idmapping::

make_kuid(u0:k0:r4294967295, u1000) = k1000

2. Map the kernel id up into a userspace id in the caller's idmapping::

from_kuid(u0:k10000:r10000, k1000) = u-1

The crossmapping algorithm fails in this case because the kernel id in the
filesystem idmapping cannot be mapped up to a userspace id in the caller's
idmapping. Thus, the kernel will report the ownership of this file as the
overflowid.

Example 5
~~~~~~~~~

file id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k20000:r10000

In order to report ownership to userspace the kernel uses the crossmapping
algorithm introduced in a previous section:

1. Map the userspace id on disk down into a kernel id in the filesystem's
idmapping::

make_kuid(u0:k20000:r10000, u1000) = k21000

2. Map the kernel id up into a userspace id in the caller's idmapping::

from_kuid(u0:k10000:r10000, k21000) = u-1

Again, the crossmapping algorithm fails in this case because the kernel id in
the filesystem idmapping cannot be mapped to a userspace id in the caller's
idmapping. Thus, the kernel will report the ownership of this file as the
overflowid.

Note how in the last two examples things would be simple if the caller would be
using the initial idmapping. For a filesystem mounted with the initial
idmapping it would be trivial. So we only consider a filesystem with an
idmapping of ``u0:k20000:r10000``:

1. Map the userspace id on disk down into a kernel id in the filesystem's
idmapping::

make_kuid(u0:k20000:r10000, u1000) = k21000

2. Map the kernel id up into a userspace id in the caller's idmapping::

from_kuid(u0:k0:r4294967295, k21000) = u21000

Idmappings on idmapped mounts
-----------------------------

The examples we've seen in the previous section where the caller's idmapping
and the filesystem's idmapping are incompatible causes various issues for
workloads. For a more complex but common example, consider two containers
started on the host. To completely prevent the two containers from affecting
each other, an administrator may often use different non-overlapping idmappings
for the two containers::

container1 idmapping: u0:k10000:r10000
container2 idmapping: u0:k20000:r10000
filesystem idmapping: u0:k30000:r10000

An administrator wanting to provide easy read-write access to the following set
of files::

dir id: u0
dir/file1 id: u1000
dir/file2 id: u2000

to both containers currently can't.

Of course the administrator has the option to recursively change ownership via
``chown()``. For example, they could change ownership so that ``dir`` and all
files below it can be crossmapped from the filesystem's into the container's
idmapping. Let's assume they change ownership so it is compatible with the
first container's idmapping::

dir id: u10000
dir/file1 id: u11000
dir/file2 id: u12000

This would still leave ``dir`` rather useless to the second container. In fact,
``dir`` and all files below it would continue to appear owned by the overflowid
for the second container.

Or consider another increasingly popular example. Some service managers such as
systemd implement a concept called "portable home directories". A user may want
to use their home directories on different machines where they are assigned
different login userspace ids. Most users will have ``u1000`` as the login id
on their machine at home and all files in their home directory will usually be
owned by ``u1000``. At uni or at work they may have another login id such as
``u1125``. This makes it rather difficult to interact with their home directory
on their work machine.

In both cases changing ownership recursively has grave implications. The most
obvious one is that ownership is changed globally and permanently. In the home
directory case this change in ownership would even need to happen every time the
user switches from their home to their work machine. For really large sets of
files this becomes increasingly costly.

If the user is lucky, they are dealing with a filesystem that is mountable
inside user namespaces. But this would also change ownership globally and the
change in ownership is tied to the lifetime of the filesystem mount, i.e. the
superblock. The only way to change ownership is to completely unmount the
filesystem and mount it again in another user namespace. This is usually
impossible because it would mean that all users currently accessing the
filesystem can't anymore. And it means that ``dir`` still can't be shared
between two containers with different idmappings.
But usually the user doesn't even have this option since most filesystems
aren't mountable inside containers. And not having them mountable might be
desirable as it doesn't require the filesystem to deal with malicious
filesystem images.

But the usecases mentioned above and more can be handled by idmapped mounts.
They allow to expose the same set of dentries with different ownership at
different mounts. This is achieved by marking the mounts with a user namespace
through the ``mount_setattr()`` system call. The idmapping associated with it
is then used to translate from the caller's idmapping to the filesystem's
idmapping and vica versa using the remapping algorithm we introduced above.

Idmapped mounts make it possible to change ownership in a temporary and
localized way. The ownership changes are restricted to a specific mount and the
ownership changes are tied to the lifetime of the mount. All other users and
locations where the filesystem is exposed are unaffected.

Filesystems that support idmapped mounts don't have any real reason to support
being mountable inside user namespaces. A filesystem could be exposed
completely under an idmapped mount to get the same effect. This has the
advantage that filesystems can leave the creation of the superblock to
privileged users in the initial user namespace.

However, it is perfectly possible to combine idmapped mounts with filesystems
mountable inside user namespaces. We will touch on this further below.

Filesystem types vs idmapped mount types
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With the introduction of idmapped mounts we need to distinguish between
filesystem ownership and mount ownership of a VFS object such as an inode. The
owner of a inode might be different when looked at from a filesystem
perspective than when looked at from an idmapped mount. Such fundamental
conceptual distinctions should almost always be clearly expressed in the code.
So, to distinguish idmapped mount ownership from filesystem ownership separate
types have been introduced.

If a uid or gid has been generated using the filesystem or caller's idmapping
then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid
has been generated using a mount idmapping then we will be using the dedicated
``vfsuid_t`` and ``vfsgid_t`` types.

All VFS helpers that generate or take uids and gids as arguments use the
``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler
to catch errors that originate from conflating filesystem and VFS uids and gids.

The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t``
and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped
from and to ``uid_t`` and ``gid_t`` types::

uid_t <--> kuid_t <--> vfsuid_t
gid_t <--> kgid_t <--> vfsgid_t

Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type,
e.g., during ``stat()``, or store ownership information in a shared VFS object
based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can
use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.

To illustrate why this helper currently exists, consider what happens when we
change ownership of an inode from an idmapped mount. After we generated
a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership.
Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
``vfsgid_into_kgid()``.

Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached
``struct posix_acl``, stores ownership information a filesystem or "global"
``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t``
and ``vfsgid_t`` is specific to an idmapped mount.

We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based
on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based
on filesystem idmappings. To prevent abusing filesystem idmappings to generate
``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t``
or ``kgid_t`` types filesystem idmappings and mount idmappings are different
types as well.

All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require
a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing
a filesystem or caller idmapping will cause a compilation error.

Similar to how we prefix all userspace ids in this document with ``u`` and all
kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount
idmapping will be written as: ``u0:v10000:r10000``.

Remapping helpers
~~~~~~~~~~~~~~~~~

Idmapping functions were added that translate between idmappings. They make use
of the remapping algorithm we've introduced earlier. We're going to look at:

- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()``

The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into
VFS ids in the mount's idmapping::

/* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
from_kuid(filesystem, kid) = uid

/* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */
make_kuid(mount, uid) = kuid

- ``mapped_fsuid()`` and ``mapped_fsgid()``

The ``mapped_fs*id()`` functions translate the caller's kernel ids into
kernel ids in the filesystem's idmapping. This translation is achieved by
remapping the caller's VFS ids using the mount's idmapping::

/* Map the caller's VFS id up into a userspace id in the mount's idmapping. */
from_kuid(mount, kid) = uid

/* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(filesystem, uid) = kuid

- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()``

Whenever

Note that these two functions invert each other. Consider the following
idmappings::

caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k20000:r10000
mount idmapping: u0:v10000:r10000

Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
to ``k21000`` according to its idmapping. This is what is stored in the
inode's ``i_uid`` and ``i_gid`` fields.

When the caller queries the ownership of this file via ``stat()`` the kernel
would usually simply use the crossmapping algorithm and map the filesystem's
kernel id up to a userspace id in the caller's idmapping.

But when the caller is accessing the file on an idmapped mount the kernel will
first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel
id into a VFS id in the mount's idmapping::

i_uid_into_vfsuid(k21000):
/* Map the filesystem's kernel id up into a userspace id. */
from_kuid(u0:k20000:r10000, k21000) = u1000

/* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */
make_kuid(u0:v10000:r10000, u1000) = v11000

Finally, when the kernel reports the owner to the caller it will turn the
VFS id in the mount's idmapping into a userspace id in the caller's
idmapping::

k11000 = vfsuid_into_kuid(v11000)
from_kuid(u0:k10000:r10000, k11000) = u1000

We can test whether this algorithm really works by verifying what happens when
we create a new file. Let's say the user is creating a file with ``u1000``.

The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
kernel would now apply the crossmapping, verifying that ``k11000`` can be
mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
be mapped up in the filesystem's idmapping directly this creation request
fails.

But when the caller is accessing the file on an idmapped mount the kernel will
first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
a VFS id according to the mount's idmapping::

mapped_fsuid(k11000):
/* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
from_kuid(u0:k10000:r10000, k11000) = u1000

/* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:v20000:r10000, u1000) = v21000

When finally writing to disk the kernel will then map ``v21000`` up into a
userspace id in the filesystem's idmapping::

k21000 = vfsuid_into_kuid(v21000)
from_kuid(u0:k20000:r10000, k21000) = u1000

As we can see, we end up with an invertible and therefore information
preserving algorithm. A file created from ``u1000`` on an idmapped mount will
also be reported as being owned by ``u1000`` and vica versa.

Let's now briefly reconsider the failing examples from earlier in the context
of idmapped mounts.

Example 2 reconsidered
~~~~~~~~~~~~~~~~~~~~~~

caller id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k20000:r10000
mount idmapping: u0:v10000:r10000

When the caller is using a non-initial idmapping the common case is to attach
the same idmapping to the mount. We now perform three steps:

1. Map the caller's userspace ids into kernel ids in the caller's idmapping::

make_kuid(u0:k10000:r10000, u1000) = k11000

2. Translate the caller's VFS id into a kernel id in the filesystem's
idmapping::

mapped_fsuid(v11000):
/* Map the VFS id up into a userspace id in the mount's idmapping. */
from_kuid(u0:v10000:r10000, v11000) = u1000

/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k20000:r10000, u1000) = k21000

3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::

from_kuid(u0:k20000:r10000, k21000) = u1000

So the ownership that lands on disk will be ``u1000``.

Example 3 reconsidered
~~~~~~~~~~~~~~~~~~~~~~

caller id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k0:r4294967295
mount idmapping: u0:v10000:r10000

The same translation algorithm works with the third example.

1. Map the caller's userspace ids into kernel ids in the caller's idmapping::

make_kuid(u0:k10000:r10000, u1000) = k11000

2. Translate the caller's VFS id into a kernel id in the filesystem's
idmapping::

mapped_fsuid(v11000):
/* Map the VFS id up into a userspace id in the mount's idmapping. */
from_kuid(u0:v10000:r10000, v11000) = u1000

/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000

3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::

from_kuid(u0:k0:r4294967295, k1000) = u1000

So the ownership that lands on disk will be ``u1000``.

Example 4 reconsidered
~~~~~~~~~~~~~~~~~~~~~~

file id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k0:r4294967295
mount idmapping: u0:v10000:r10000

In order to report ownership to userspace the kernel now does three steps using
the translation algorithm we introduced earlier:

1. Map the userspace id on disk down into a kernel id in the filesystem's
idmapping::

make_kuid(u0:k0:r4294967295, u1000) = k1000

2. Translate the kernel id into a VFS id in the mount's idmapping::

i_uid_into_vfsuid(k1000):
/* Map the kernel id up into a userspace id in the filesystem's idmapping. */
from_kuid(u0:k0:r4294967295, k1000) = u1000

/* Map the userspace id down into a VFS id in the mounts's idmapping. */
make_kuid(u0:v10000:r10000, u1000) = v11000

3. Map the VFS id up into a userspace id in the caller's idmapping::

k11000 = vfsuid_into_kuid(v11000)
from_kuid(u0:k10000:r10000, k11000) = u1000

Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
idmapping. With the idmapped mount in place it now can be crossmapped into the
filesystem's idmapping via the mount's idmapping. The file will now be created
with ``u1000`` according to the mount's idmapping.

Example 5 reconsidered
~~~~~~~~~~~~~~~~~~~~~~

file id: u1000
caller idmapping: u0:k10000:r10000
filesystem idmapping: u0:k20000:r10000
mount idmapping: u0:v10000:r10000

Again, in order to report ownership to userspace the kernel now does three
steps using the translation algorithm we introduced earlier:

1. Map the userspace id on disk down into a kernel id in the filesystem's
idmapping::

make_kuid(u0:k20000:r10000, u1000) = k21000

2. Translate the kernel id into a VFS id in the mount's idmapping::

i_uid_into_vfsuid(k21000):
/* Map the kernel id up into a userspace id in the filesystem's idmapping. */
from_kuid(u0:k20000:r10000, k21000) = u1000

/* Map the userspace id down into a VFS id in the mounts's idmapping. */
make_kuid(u0:v10000:r10000, u1000) = v11000

3. Map the VFS id up into a userspace id in the caller's idmapping::

k11000 = vfsuid_into_kuid(v11000)
from_kuid(u0:k10000:r10000, k11000) = u1000

Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
idmapping. With the idmapped mount in place it now can be crossmapped into the
filesystem's idmapping via the mount's idmapping. The file is now owned by
``u1000`` according to the mount's idmapping.

Changing ownership on a home directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We've seen above how idmapped mounts can be used to translate between
idmappings when either the caller, the filesystem or both uses a non-initial
idmapping. A wide range of usecases exist when the caller is using
a non-initial idmapping. This mostly happens in the context of containerized
workloads. The consequence is as we have seen that for both, filesystem's
mounted with the initial idmapping and filesystems mounted with non-initial
idmappings, access to the filesystem isn't working because the kernel ids can't
be crossmapped between the caller's and the filesystem's idmapping.

As we've seen above idmapped mounts provide a solution to this by remapping the
caller's or filesystem's idmapping according to the mount's idmapping.

Aside from containerized workloads, idmapped mounts have the advantage that
they also work when both the caller and the filesystem use the initial
idmapping which means users on the host can change the ownership of directories
and files on a per-mount basis.

Consider our previous example where a user has their home directory on portable
storage. At home they have id ``u1000`` and all files in their home directory
are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.

Taking their home directory with them becomes problematic. They can't easily
access their files, they might not be able to write to disk without applying
lax permissions or ACLs and even if they can, they will end up with an annoying
mix of files and directories owned by ``u1000`` and ``u1125``.

Idmapped mounts allow to solve this problem. A user can create an idmapped
mount for their home directory on their work computer or their computer at home
depending on what ownership they would prefer to end up on the portable storage
itself.

Let's assume they want all files on disk to belong to ``u1000``. When the user
plugs in their portable storage at their work station they can setup a job that
creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
when they create a file the kernel performs the following steps we already know
from above:::

caller id: u1125
caller idmapping: u0:k0:r4294967295
filesystem idmapping: u0:k0:r4294967295
mount idmapping: u1000:v1125:r1

1. Map the caller's userspace ids into kernel ids in the caller's idmapping::

make_kuid(u0:k0:r4294967295, u1125) = k1125

2. Translate the caller's VFS id into a kernel id in the filesystem's
idmapping::

mapped_fsuid(v1125):
/* Map the VFS id up into a userspace id in the mount's idmapping. */
from_kuid(u1000:v1125:r1, v1125) = u1000

/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000

3. Verify that the caller's filesystem ids can be mapped to userspace ids in the
filesystem's idmapping::

from_kuid(u0:k0:r4294967295, k1000) = u1000

So ultimately the file will be created with ``u1000`` on disk.

Now let's briefly look at what ownership the caller with id ``u1125`` will see
on their work computer:

file id: u1000
caller idmapping: u0:k0:r4294967295
filesystem idmapping: u0:k0:r4294967295
mount idmapping: u1000:v1125:r1

1. Map the userspace id on disk down into a kernel id in the filesystem's
idmapping::

make_kuid(u0:k0:r4294967295, u1000) = k1000

2. Translate the kernel id into a VFS id in the mount's idmapping::

i_uid_into_vfsuid(k1000):
/* Map the kernel id up into a userspace id in the filesystem's idmapping. */
from_kuid(u0:k0:r4294967295, k1000) = u1000

/* Map the userspace id down into a VFS id in the mounts's idmapping. */
make_kuid(u1000:v1125:r1, u1000) = v1125

3. Map the VFS id up into a userspace id in the caller's idmapping::

k1125 = vfsuid_into_kuid(v1125)
from_kuid(u0:k0:r4294967295, k1125) = u1125

So ultimately the caller will be reported that the file belongs to ``u1125``
which is the caller's userspace id on their workstation in our example.

The raw userspace id that is put on disk is ``u1000`` so when the user takes
their home directory back to their home computer where they are assigned
``u1000`` using the initial idmapping and mount the filesystem with the initial
idmapping they will see all those files owned by ``u1000``.