Documentation / accel / qaic / qaic.rst


Based on kernel version 6.11. Page generated on 2024-09-24 08:21 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
.. SPDX-License-Identifier: GPL-2.0-only

=============
 QAIC driver
=============

The QAIC driver is the Kernel Mode Driver (KMD) for the AIC100 family of AI
accelerator products.

Interrupts
==========

IRQ Storm Mitigation
--------------------

While the AIC100 DMA Bridge hardware implements an IRQ storm mitigation
mechanism, it is still possible for an IRQ storm to occur. A storm can happen
if the workload is particularly quick, and the host is responsive. If the host
can drain the response FIFO as quickly as the device can insert elements into
it, then the device will frequently transition the response FIFO from empty to
non-empty and generate MSIs at a rate equivalent to the speed of the
workload's ability to process inputs. The lprnet (license plate reader network)
workload is known to trigger this condition, and can generate in excess of 100k
MSIs per second. It has been observed that most systems cannot tolerate this
for long, and will crash due to some form of watchdog due to the overhead of
the interrupt controller interrupting the host CPU.

To mitigate this issue, the QAIC driver implements specific IRQ handling. When
QAIC receives an IRQ, it disables that line. This prevents the interrupt
controller from interrupting the CPU. Then AIC drains the FIFO. Once the FIFO
is drained, QAIC implements a "last chance" polling algorithm where QAIC will
sleep for a time to see if the workload will generate more activity. The IRQ
line remains disabled during this time. If no activity is detected, QAIC exits
polling mode and reenables the IRQ line.

This mitigation in QAIC is very effective. The same lprnet usecase that
generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64
IRQs over 5 minutes while keeping the host system stable, and having the same
workload throughput performance (within run to run noise variation).

Single MSI Mode
---------------

MultiMSI is not well supported on all systems; virtualized ones even less so
(circa 2023). Between hypervisors masking the PCIe MSI capability structure to
large memory requirements for vIOMMUs (required for supporting MultiMSI), it is
useful to be able to fall back to a single MSI when needed.

To support this fallback, we allow the case where only one MSI is able to be
allocated, and share that one MSI between MHI and the DBCs. The device detects
when only one MSI has been configured and directs the interrupts for the DBCs
to the interrupt normally used for MHI. Unfortunately this means that the
interrupt handlers for every DBC and MHI wake up for every interrupt that
arrives; however, the DBC threaded irq handlers only are started when work to be
done is detected (MHI will always start its threaded handler).

If the DBC is configured to force MSI interrupts, this can circumvent the
software IRQ storm mitigation mentioned above. Since the MSI is shared it is
never disabled, allowing each new entry to the FIFO to trigger a new interrupt.


Neural Network Control (NNC) Protocol
=====================================

The implementation of NNC is split between the KMD (QAIC) and UMD. In general
QAIC understands how to encode/decode NNC wire protocol, and elements of the
protocol which require kernel space knowledge to process (for example, mapping
host memory to device IOVAs). QAIC understands the structure of a message, and
all of the transactions. QAIC does not understand commands (the payload of a
passthrough transaction).

QAIC handles and enforces the required little endianness and 64-bit alignment,
to the degree that it can. Since QAIC does not know the contents of a
passthrough transaction, it relies on the UMD to satisfy the requirements.

The terminate transaction is of particular use to QAIC. QAIC is not aware of
the resources that are loaded onto a device since the majority of that activity
occurs within NNC commands. As a result, QAIC does not have the means to
roll back userspace activity. To ensure that a userspace client's resources
are fully released in the case of a process crash, or a bug, QAIC uses the
terminate command to let QSM know when a user has gone away, and the resources
can be released.

QSM can report a version number of the NNC protocol it supports. This is in the
form of a Major number and a Minor number.

Major number updates indicate changes to the NNC protocol which impact the
message format, or transactions (impacts QAIC).

Minor number updates indicate changes to the NNC protocol which impact the
commands (does not impact QAIC).

uAPI
====

QAIC creates an accel device per phsyical PCIe device. This accel device exists
for as long as the PCIe device is known to Linux.

The PCIe device may not be in the state to accept requests from userspace at
all times. QAIC will trigger KOBJ_ONLINE/OFFLINE uevents to advertise when the
device can accept requests (ONLINE) and when the device is no longer accepting
requests (OFFLINE) because of a reset or other state transition.

QAIC defines a number of driver specific IOCTLs as part of the userspace API.

DRM_IOCTL_QAIC_MANAGE
  This IOCTL allows userspace to send a NNC request to the QSM. The call will
  block until a response is received, or the request has timed out.

DRM_IOCTL_QAIC_CREATE_BO
  This IOCTL allows userspace to allocate a buffer object (BO) which can send
  or receive data from a workload. The call will return a GEM handle that
  represents the allocated buffer. The BO is not usable until it has been
  sliced (see DRM_IOCTL_QAIC_ATTACH_SLICE_BO).

DRM_IOCTL_QAIC_MMAP_BO
  This IOCTL allows userspace to prepare an allocated BO to be mmap'd into the
  userspace process.

DRM_IOCTL_QAIC_ATTACH_SLICE_BO
  This IOCTL allows userspace to slice a BO in preparation for sending the BO
  to the device. Slicing is the operation of describing what portions of a BO
  get sent where to a workload. This requires a set of DMA transfers for the
  DMA Bridge, and as such, locks the BO to a specific DBC.

DRM_IOCTL_QAIC_EXECUTE_BO
  This IOCTL allows userspace to submit a set of sliced BOs to the device. The
  call is non-blocking. Success only indicates that the BOs have been queued
  to the device, but does not guarantee they have been executed.

DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO
  This IOCTL operates like DRM_IOCTL_QAIC_EXECUTE_BO, but it allows userspace
  to shrink the BOs sent to the device for this specific call. If a BO
  typically has N inputs, but only a subset of those is available, this IOCTL
  allows userspace to indicate that only the first M bytes of the BO should be
  sent to the device to minimize data transfer overhead. This IOCTL dynamically
  recomputes the slicing, and therefore has some processing overhead before the
  BOs can be queued to the device.

DRM_IOCTL_QAIC_WAIT_BO
  This IOCTL allows userspace to determine when a particular BO has been
  processed by the device. The call will block until either the BO has been
  processed and can be re-queued to the device, or a timeout occurs.

DRM_IOCTL_QAIC_PERF_STATS_BO
  This IOCTL allows userspace to collect performance statistics on the most
  recent execution of a BO. This allows userspace to construct an end to end
  timeline of the BO processing for a performance analysis.

DRM_IOCTL_QAIC_PART_DEV
  This IOCTL allows userspace to request a duplicate "shadow device". This extra
  accelN device is associated with a specific partition of resources on the
  AIC100 device and can be used for limiting a process to some subset of
  resources.

DRM_IOCTL_QAIC_DETACH_SLICE_BO
  This IOCTL allows userspace to remove the slicing information from a BO that
  was originally provided by a call to DRM_IOCTL_QAIC_ATTACH_SLICE_BO. This
  is the inverse of DRM_IOCTL_QAIC_ATTACH_SLICE_BO. The BO must be idle for
  DRM_IOCTL_QAIC_DETACH_SLICE_BO to be called. After a successful detach slice
  operation the BO may have new slicing information attached with a new call
  to DRM_IOCTL_QAIC_ATTACH_SLICE_BO. After detach slice, the BO cannot be
  executed until after a new attach slice operation. Combining attach slice
  and detach slice calls allows userspace to use a BO with multiple workloads.

Userspace Client Isolation
==========================

AIC100 supports multiple clients. Multiple DBCs can be consumed by a single
client, and multiple clients can each consume one or more DBCs. Workloads
may contain sensitive information therefore only the client that owns the
workload should be allowed to interface with the DBC.

Clients are identified by the instance associated with their open(). A client
may only use memory they allocate, and DBCs that are assigned to their
workloads. Attempts to access resources assigned to other clients will be
rejected.

Module parameters
=================

QAIC supports the following module parameters:

**datapath_polling (bool)**

Configures QAIC to use a polling thread for datapath events instead of relying
on the device interrupts. Useful for platforms with broken multiMSI. Must be
set at QAIC driver initialization. Default is 0 (off).

**mhi_timeout_ms (unsigned int)**

Sets the timeout value for MHI operations in milliseconds (ms). Must be set
at the time the driver detects a device. Default is 2000 (2 seconds).

**control_resp_timeout_s (unsigned int)**

Sets the timeout value for QSM responses to NNC messages in seconds (s). Must
be set at the time the driver is sending a request to QSM. Default is 60 (one
minute).

**wait_exec_default_timeout_ms (unsigned int)**

Sets the default timeout for the wait_exec ioctl in milliseconds (ms). Must be
set prior to the waic_exec ioctl call. A value specified in the ioctl call
overrides this for that call. Default is 5000 (5 seconds).

**datapath_poll_interval_us (unsigned int)**

Sets the polling interval in microseconds (us) when datapath polling is active.
Takes effect at the next polling interval. Default is 100 (100 us).

**timesync_delay_ms (unsigned int)**

Sets the time interval in milliseconds (ms) between two consecutive timesync
operations. Default is 1000 (1000 ms).