Linux Kernel Documentation / powerpc / papr

Documentation / powerpc / papr_hcalls.rst

Based on kernel version 6.6. Page generated on 2023-10-31 12:55 EST.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
.. SPDX-License-Identifier: GPL-2.0

===========================
Hypercall Op-codes (hcalls)
===========================

Overview
=========

Virtualization on 64-bit Power Book3S Platforms is based on the PAPR
specification [1]_ which describes the run-time environment for a guest
operating system and how it should interact with the hypervisor for
privileged operations. Currently there are two PAPR compliant hypervisors:

- **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX,
  IBM-i and  Linux as supported guests (termed as Logical Partitions
  or LPARS). It supports the full PAPR specification.

- **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host.
  Though it only implements a subset of PAPR specification called LoPAPR [2]_.

On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called
a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must
issue hypercalls to the hypervisor whenever it needs to perform an action
that is hypervisor privileged [3]_ or for other services managed by the
hypervisor.

Hence a Hypercall (hcall) is essentially a request by the pseries guest
asking hypervisor to perform a privileged operation on behalf of the guest. The
guest issues a with necessary input operands. The hypervisor after performing
the privilege operation returns a status code and output operands back to the
guest.

HCALL ABI
=========
The ABI specification for a hcall between a pseries guest and PAPR hypervisor
is covered in section 14.5.3 of ref [2]_. Switch to the  Hypervisor context is
done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3*
and any in-arguments for the hcall are provided in registers *r4-r12*. If values
have to be passed through a memory buffer, the data stored in that buffer should be
in Big-endian byte order.

Once control returns back to the guest after hypervisor has serviced the
'HVCS' instruction the return value of the hcall is available in *r3* and any
out values are returned in registers *r4-r12*. Again like in case of in-arguments,
any out values stored in a memory buffer will be in Big-endian byte order.

Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined
in a arch specific header [4]_ to issue hcalls from the linux kernel
running as pseries guest.

Register Conventions
====================

Any hcall should follow same register convention as described in section 2.2.1.1
of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below
summarizes these conventions:

+----------+----------+-------------------------------------------+
| Register |Volatile  |  Purpose                                  |
| Range    |(Y/N)     |                                           |
+==========+==========+===========================================+
|   r0     |    Y     |  Optional-usage                           |
+----------+----------+-------------------------------------------+
|   r1     |    N     |  Stack Pointer                            |
+----------+----------+-------------------------------------------+
|   r2     |    N     |  TOC                                      |
+----------+----------+-------------------------------------------+
|   r3     |    Y     |  hcall opcode/return value                |
+----------+----------+-------------------------------------------+
|  r4-r10  |    Y     |  in and out values                        |
+----------+----------+-------------------------------------------+
|   r11    |    Y     |  Optional-usage/Environmental pointer     |
+----------+----------+-------------------------------------------+
|   r12    |    Y     |  Optional-usage/Function entry address at |
|          |          |  global entry point                       |
+----------+----------+-------------------------------------------+
|   r13    |    N     |  Thread-Pointer                           |
+----------+----------+-------------------------------------------+
|  r14-r31 |    N     |  Local Variables                          |
+----------+----------+-------------------------------------------+
|    LR    |    Y     |  Link Register                            |
+----------+----------+-------------------------------------------+
|   CTR    |    Y     |  Loop Counter                             |
+----------+----------+-------------------------------------------+
|   XER    |    Y     |  Fixed-point exception register.          |
+----------+----------+-------------------------------------------+
|  CR0-1   |    Y     |  Condition register fields.               |
+----------+----------+-------------------------------------------+
|  CR2-4   |    N     |  Condition register fields.               |
+----------+----------+-------------------------------------------+
|  CR5-7   |    Y     |  Condition register fields.               |
+----------+----------+-------------------------------------------+
|  Others  |    N     |                                           |
+----------+----------+-------------------------------------------+

DRC & DRC Indexes
=================
::

     DR1                                  Guest
     +--+        +------------+         +---------+
     |  | <----> |            |         |  User   |
     +--+  DRC1  |            |   DRC   |  Space  |
                 |    PAPR    |  Index  +---------+
     DR2         | Hypervisor |         |         |
     +--+        |            | <-----> |  Kernel |
     |  | <----> |            |  Hcall  |         |
     +--+  DRC2  +------------+         +---------+

PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc
available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to
an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC)
to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number
called DRC-Index. The DRC-index value is provided to the LPAR via device-tree
where its present as an attribute in the device tree node associated with the
DR.

HCALL Return-values
===================

After servicing the hcall, hypervisor sets the return-value in *r3* indicating
success or failure of the hcall. In case of a failure an error code indicates
the cause for error. These codes are defined and documented in arch specific
header [4]_.

In some cases a hcall can potentially take a long time and need to be issued
multiple times in order to be completely serviced. These hcalls will usually
accept an opaque value *continue-token* within there argument list and a
return value of *H_CONTINUE* indicates that hypervisor hasn't still finished
servicing the hcall yet.

To make such hcalls the guest need to set *continue-token == 0* for the
initial call and use the hypervisor returned value of *continue-token*
for each subsequent hcall until hypervisor returns a non *H_CONTINUE*
return value.

HCALL Op-codes
==============

Below is a partial list of HCALLs that are supported by PHYP. For the
corresponding opcode values please look into the arch specific header [4]_:

**H_SCM_READ_METADATA**

| Input: *drcIndex, offset, buffer-address, numBytesToRead*
| Out: *numBytesRead*
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware*

Given a DRC Index of an NVDIMM, read N-bytes from the metadata area
associated with it, at a specified offset and copy it to provided buffer.
The metadata area stores configuration information such as label information,
bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage
area hence a separate access semantics is provided.

**H_SCM_WRITE_METADATA**

| Input: *drcIndex, offset, data, numBytesToWrite*
| Out: *None*
| Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware*

Given a DRC Index of an NVDIMM, write N-bytes to the metadata area
associated with it, at the specified offset and from the provided buffer.

**H_SCM_BIND_MEM**

| Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,*
| *targetLogicalMemoryAddress, continue-token*
| Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound*
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,*
| *H_Too_Big, H_P5, H_Busy*

Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range
*(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest
at *targetLogicalMemoryAddress* within guest physical address space. In
case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor
assigns a target address to the guest. The HCALL can fail if the Guest has
an active PTE entry to the SCM block being bound.

**H_SCM_UNBIND_MEM**
| Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind
| Out: numScmBlocksUnbound
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,*
| *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*

Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting
at *startingScmLogicalMemoryAddress* from guest physical address space. The
HCALL can fail if the Guest has an active PTE entry to the SCM block being
unbound.

**H_SCM_QUERY_BLOCK_MEM_BINDING**

| Input: *drcIndex, scmBlockIndex*
| Out: *Guest-Physical-Address*
| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*

Given a DRC-Index and an SCM Block index return the guest physical address to
which the SCM block is mapped to.

**H_SCM_QUERY_LOGICAL_MEM_BINDING**

| Input: *Guest-Physical-Address*
| Out: *drcIndex, scmBlockIndex*
| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*

Given a guest physical address return which DRC Index and SCM block is mapped
to that address.

**H_SCM_UNBIND_ALL**

| Input: *scmTargetScope, drcIndex*
| Out: *None*
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,*
| *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*

Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs
or all SCM blocks belonging to a single NVDIMM identified by its drcIndex
from the LPAR memory.

**H_SCM_HEALTH**

| Input: drcIndex
| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
| Return Value: *H_Success, H_Parameter, H_Hardware*

Given a DRC Index return the info on predictive failure and overall health of
the PMEM device. The asserted bits in the health-bitmap indicate one or more states
(described in table below) of the PMEM device and health-bit-valid-bitmap indicate
which bits in health-bitmap are valid. The bits are reported in
reverse bit ordering for example a value of 0xC400000000000000
indicates bits 0, 1, and 5 are valid.

Health Bitmap Flags:

+------+-----------------------------------------------------------------------+
|  Bit |               Definition                                              |
+======+=======================================================================+
|  00  | PMEM device is unable to persist memory contents.                     |
|      | If the system is powered down, nothing will be saved.                 |
+------+-----------------------------------------------------------------------+
|  01  | PMEM device failed to persist memory contents. Either contents were   |
|      | not saved successfully on power down or were not restored properly on |
|      | power up.                                                             |
+------+-----------------------------------------------------------------------+
|  02  | PMEM device contents are persisted from previous IPL. The data from   |
|      | the last boot were successfully restored.                             |
+------+-----------------------------------------------------------------------+
|  03  | PMEM device contents are not persisted from previous IPL. There was no|
|      | data to restore from the last boot.                                   |
+------+-----------------------------------------------------------------------+
|  04  | PMEM device memory life remaining is critically low                   |
+------+-----------------------------------------------------------------------+
|  05  | PMEM device will be garded off next IPL due to failure                |
+------+-----------------------------------------------------------------------+
|  06  | PMEM device contents cannot persist due to current platform health    |
|      | status. A hardware failure may prevent data from being saved or       |
|      | restored.                                                             |
+------+-----------------------------------------------------------------------+
|  07  | PMEM device is unable to persist memory contents in certain conditions|
+------+-----------------------------------------------------------------------+
|  08  | PMEM device is encrypted                                              |
+------+-----------------------------------------------------------------------+
|  09  | PMEM device has successfully completed a requested erase or secure    |
|      | erase procedure.                                                      |
+------+-----------------------------------------------------------------------+
|10:63 | Reserved / Unused                                                     |
+------+-----------------------------------------------------------------------+

**H_SCM_PERFORMANCE_STATS**

| Input: drcIndex, resultBuffer Addr
| Out: None
| Return Value:  *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege*

Given a DRC Index collect the performance statistics for NVDIMM and copy them
to the resultBuffer.

**H_SCM_FLUSH**

| Input: *drcIndex, continue-token*
| Out: *continue-token*
| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*

Given a DRC Index Flush the data to backend NVDIMM device.

The hcall returns H_BUSY when the flush takes longer time and the hcall needs
to be issued multiple times in order to be completely serviced. The
*continue-token* from the output to be passed in the argument list of
subsequent hcalls to the hypervisor until the hcall is completely serviced
at which point H_SUCCESS or other error is returned by the hypervisor.

References
==========
.. [1] "Power Architecture Platform Reference"
       https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
.. [2] "Linux on Power Architecture Platform Reference"
       https://members.openpowerfoundation.org/document/dl/469
.. [3] "Definitions and Notation" Book III-Section 14.5.3
       https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
.. [4] arch/powerpc/include/asm/hvcall.h
.. [5] "64-Bit ELF V2 ABI Specification: Power Architecture"
       https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture