Documentation / networking / tls-offload.rst

Based on kernel version 6.9. Page generated on 2024-05-14 10:02 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)

Kernel TLS offload

Kernel TLS operation

Linux kernel provides TLS connection offload infrastructure. Once a TCP
connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
Layer Protocol (ULP) and install the cryptographic connection state.
For details regarding the user-facing interface refer to the TLS
documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.

``ktls`` can operate in three modes:

 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
   In most basic cases only crypto operations synchronous with the CPU
   can be used, but depending on calling context CPU may utilize
   asynchronous crypto accelerators. The use of accelerators introduces extra
   latency on socket reads (decryption only starts when a read syscall
   is made) and additional I/O load on the system.
 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
   on a packet by packet basis, provided the packets arrive in order.
   This mode integrates best with the kernel stack and is described in detail
   in the remaining part of this document
   (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
 * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where
   NIC driver and firmware replace the kernel networking stack
   with its own TCP handling, it is not usable in production environments
   making use of the Linux networking stack for example any firewalling
   abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``).

The operation mode is selected automatically based on device configuration,
offload opt-in or opt-out on per-connection basis is not currently supported.


At a high level user write requests are turned into a scatter list, the TLS ULP
intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
mode) and then hands the modified scatter list to the TCP layer. From this
point on the TCP stack proceeds as normal.

In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
Instead packets reach a device driver, the driver will mark the packets
for crypto offload based on the socket the packet is attached to,
and send them to the device for encryption and transmission.


On the receive side if the device handled decryption and authentication
successfully, the driver will set the decrypted bit in the associated
:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
are handled normally. ``ktls`` is informed when data is queued to the socket
and the ``strparser`` mechanism is used to delineate the records. Upon read
request, records are retrieved from the socket and passed to decryption routine.
If device decrypted all the segments of the record the decryption is skipped,
otherwise software path handles decryption.

.. kernel-figure::  tls-offload-layers.svg
   :alt:	TLS offload layers
   :align:	center
   :figwidth:	28em

   Layers of Kernel TLS stack

Device configuration

During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
``NETIF_F_HW_TLS_TX`` features and installs its
:c:type:`struct tlsdev_ops <tlsdev_ops>`
pointer in the :c:member:`tlsdev_ops` member of the
:c:type:`struct net_device <net_device>`.

When TLS cryptographic connection state is installed on a ``ktls`` socket
(note that it is done twice, once for RX and once for TX direction,
and the two are completely independent), the kernel checks if the underlying
network device is offload-capable and attempts the offload. In case offload
fails the connection is handled entirely in software using the same mechanism
as if the offload was never tried.

Offload request is performed via the :c:member:`tls_dev_add` callback of
:c:type:`struct tlsdev_ops <tlsdev_ops>`:

.. code-block:: c

	int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
			   enum tls_offload_ctx_dir direction,
			   struct tls_crypto_info *crypto_info,
			   u32 start_offload_tcp_sn);

``direction`` indicates whether the cryptographic information is for
the received or transmitted packets. Driver uses the ``sk`` parameter
to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
Cryptographic information in ``crypto_info`` includes the key, iv, salt
as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
which TCP sequence number corresponds to the beginning of the record with
sequence number from ``crypto_info``. The driver can add its state
at the end of kernel structures (see :c:member:`driver_state` members
in ``include/net/tls.h``) to avoid additional allocations and pointer


After TX state is installed, the stack guarantees that the first segment
of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
number, simplifying TCP sequence number matching.

TX offload being fully initialized does not imply that all segments passing
through the driver and which belong to the offloaded socket will be after
the expected sequence number and will have kernel record information.
In particular, already encrypted data may have been queued to the socket
before installing the connection state in the kernel.


In RX direction local networking stack has little control over the segmentation,
so the initial records' TCP sequence number may be anywhere inside the segment.

Normal operation

At the minimum the device maintains the following state for each connection, in
each direction:

 * crypto secrets (key, iv, salt)
 * crypto processing state (partial blocks, partial authentication tag, etc.)
 * record metadata (sequence number, processing offset and length)
 * expected TCP sequence number

There are no guarantees on record length or record segmentation. In particular
segments may start at any point of a record and contain any number of records.
Assuming segments are received in order, the device should be able to perform
crypto operations and authentication regardless of segmentation. For this
to be possible device has to keep small amount of segment-to-segment state.
This includes at least:

 * partial headers (if a segment carried only a part of the TLS header)
 * partial data block
 * partial authentication tag (all data had been seen but part of the
   authentication tag has to be written or read from the subsequent segment)

Record reassembly is not necessary for TLS offload. If the packets arrive
in order the device should be able to handle them separately and make
forward progress.


The kernel stack performs record framing reserving space for the authentication
tag and populating all other TLS header and tailer fields.

Both the device and the driver maintain expected TCP sequence numbers
due to the possibility of retransmissions and the lack of software fallback
once the packet reaches the device.
For segments passed in order, the driver marks the packets with
a connection identifier (note that a 5-tuple lookup is insufficient to identify
packets requiring HW offload, see the :ref:`5tuple_problems` section)
and hands them to the device. The device identifies the packet as requiring
TLS handling and confirms the sequence number matches its expectation.
The device performs encryption and authentication of the record data.
It replaces the authentication tag and TCP checksum with correct values.


Before a packet is DMAed to the host (but after NIC's embedded switching
and packet transformation functions) the device validates the Layer 4
checksum and performs a 5-tuple lookup to find any TLS connection the packet
may belong to (technically a 4-tuple
lookup is sufficient - IP addresses and TCP port numbers, as the protocol
is always TCP). If connection is matched device confirms if the TCP sequence
number is the expected one and proceeds to TLS handling (record delineation,
decryption, authentication for each record in the packet). The device leaves
the record framing unmodified, the stack takes care of record decapsulation.
Device indicates successful handling of TLS offload in the per-packet context
(descriptor) passed to the host.

Upon reception of a TLS offloaded packet, the driver sets
the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
corresponding to the segment. Networking stack makes sure decrypted
and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
and takes care of partial decryption.

Resync handling

In presence of packet drops or network packet reordering, the device may lose
synchronization with the TLS stream, and require a resync with the kernel's
TCP stack.

Note that resync is only attempted for connections which were successfully
added to the device table and are in TLS_HW mode. For example,
if the table was full when cryptographic state was installed in the kernel,
such connection will never get offloaded. Therefore the resync request
does not carry any cryptographic connection state.


Segments transmitted from an offloaded socket can get out of sync
in similar ways to the receive side-retransmissions - local drops
are possible, though network reorders are not. There are currently
two mechanisms for dealing with out of order segments.

Crypto state rebuilding

Whenever an out of order segment is transmitted the driver provides
the device with enough information to perform cryptographic operations.
This means most likely that the part of the record preceding the current
segment has to be passed to the device as part of the packet context,
together with its TCP sequence number and TLS record number. The device
can then initialize its crypto state, process and discard the preceding
data (to be able to insert the authentication tag) and move onto handling
the actual packet.

In this mode depending on the implementation the driver can either ask
for a continuation with the crypto state and the new sequence number
(next expected segment is the one after the out of order one), or continue
with the previous stream state - assuming that the out of order segment
was just a retransmission. The former is simpler, and does not require
retransmission detection therefore it is the recommended method until
such time it is proven inefficient.

Next record sync

Whenever an out of order segment is detected the driver requests
that the ``ktls`` software fallback code encrypt it. If the segment's
sequence number is lower than expected the driver assumes retransmission
and doesn't change device state. If the segment is in the future, it
may imply a local drop, the driver asks the stack to sync the device
to the next record state and falls back to software.

Resync request is indicated with:

.. code-block:: c

  void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)

Until resync is complete driver should not access its expected TCP
sequence number (as it will be updated from a different context).
Following helper should be used to test if resync is complete:

.. code-block:: c

  bool tls_offload_tx_resync_pending(struct sock *sk)

Next time ``ktls`` pushes a record it will first send its TCP sequence number
and TLS record number to the driver. Stack will also make sure that
the new record will start on a segment boundary (like it does when
the connection is initially added).


A small amount of RX reorder events may not require a full resynchronization.
In particular the device should not lose synchronization
when record boundary can be recovered:

.. kernel-figure::  tls-offload-reorder-good.svg
   :alt:	reorder of non-header segment
   :align:	center

   Reorder of non-header segment

Green segments are successfully decrypted, blue ones are passed
as received on wire, red stripes mark start of new records.

In above case segment 1 is received and decrypted successfully.
Segment 2 was dropped so 3 arrives out of order. The device knows
the next record starts inside 3, based on record length in segment 1.
Segment 3 is passed untouched, because due to lack of data from segment 2
the remainder of the previous record inside segment 3 cannot be handled.
The device can, however, collect the authentication algorithm's state
and partial block from the new record in segment 3 and when 4 and 5
arrive continue decryption. Finally when 2 arrives it's completely outside
of expected window of the device so it's passed as is without special
handling. ``ktls`` software fallback handles the decryption of record
spanning segments 1, 2 and 3. The device did not get out of sync,
even though two segments did not get decrypted.

Kernel synchronization may be necessary if the lost segment contained
a record header and arrived after the next record header has already passed:

.. kernel-figure::  tls-offload-reorder-bad.svg
   :alt:	reorder of header segment
   :align:	center

   Reorder of segment with a TLS header

In this example segment 2 gets dropped, and it contains a record header.
Device can only detect that segment 4 also contains a TLS header
if it knows the length of the previous record from segment 2. In this case
the device will lose synchronization with the stream.

Stream scan resynchronization

When the device gets out of sync and the stream reaches TCP sequence
numbers more than a max size record past the expected TCP sequence number,
the device starts scanning for a known header pattern. For example
for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
in the SSL/TLS version field of the header. Once pattern is matched
the device continues attempting parsing headers at expected locations
(based on the length fields at guessed locations).
Whenever the expected location does not contain a valid header the scan
is restarted.

When the header is matched the device sends a confirmation request
to the kernel, asking if the guessed location is correct (if a TLS record
really starts there), and which record sequence number the given header had.
The kernel confirms the guessed location was correct and tells the device
the record sequence number. Meanwhile, the device had been parsing
and counting all records since the just-confirmed one, it adds the number
of records it had seen to the record number provided by the kernel.
At this point the device is in sync and can resume decryption at next
segment boundary.

In a pathological case the device may latch onto a sequence of matching
headers and never hear back from the kernel (there is no negative
confirmation from the kernel). The implementation may choose to periodically
restart scan. Given how unlikely falsely-matching stream is, however,
periodic restart is not deemed necessary.

Special care has to be taken if the confirmation request is passed
asynchronously to the packet stream and record may get processed
by the kernel before the confirmation request.

Stack-driven resynchronization

The driver may also request the stack to perform resynchronization
whenever it sees the records are no longer getting decrypted.
If the connection is configured in this mode the stack automatically
schedules resynchronization after it has received two completely encrypted

The stack waits for the socket to drain and informs the device about
the next expected record number and its TCP sequence number. If the
records continue to be received fully encrypted stack retries the
synchronization with an exponential back off (first after 2 encrypted
records, then after 4 records, after 8, after 16... up until every
128 records).

Error handling


Packets may be redirected or rerouted by the stack to a different
device than the selected TLS offload device. The stack will handle
such condition using the :c:func:`sk_validate_xmit_skb` helper
(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
Offload maintains information about all records until the data is
fully acknowledged, so if skbs reach the wrong device they can be handled
by software fallback.

Any device TLS offload handling error on the transmission side must result
in the packet being dropped. For example if a packet got out of order
due to a bug in the stack or the device, reached the device and can't
be encrypted such packet must be dropped.


If the device encounters any problems with TLS offload on the receive
side it should pass the packet to the host's networking stack as it was
received on the wire.

For example authentication failure for any record in the segment should
result in passing the unmodified packet to the software fallback. This means
packets should not be modified "in place". Splitting segments to handle partial
decryption is not advised. In other words either all records in the packet
had been handled successfully and authenticated or the packet has to be passed
to the host's stack as it was on the wire (recovering original packet in the
driver if device provides precise error is sufficient).

The Linux networking stack does not provide a way of reporting per-packet
decryption and authentication errors, packets with errors must simply not
have the :c:member:`decrypted` mark set.

A packet should also not be handled by the TLS offload if it contains
incorrect checksums.

Performance metrics

TLS offload can be characterized by the following basic metrics:

 * max connection count
 * connection installation rate
 * connection installation latency
 * total cryptographic performance

Note that each TCP connection requires a TLS session in both directions,
the performance may be reported treating each direction separately.

Max connection count

The number of connections device can support can be exposed via
``devlink resource`` API.

Total cryptographic performance

Offload performance may depend on segment and record size.

Overload of the cryptographic subsystem of the device should not have
significant performance impact on non-offloaded streams.


Following minimum set of TLS-related statistics should be reported
by the driver:

 * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
   which were part of a TLS stream.
 * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
   which were successfully decrypted.
 * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
 * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
   (connection has finished).
 * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
 * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
    was started.
 * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
    properly ended with providing the HW tracked tcp-seq.
 * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
    procedure was started by not properly ended.
 * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
    the driver was successfully handled.
 * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
    the driver was terminated unsuccessfully.
 * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
   but were not decrypted due to unexpected error in the state machine.
 * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
   for encryption of their TLS payload.
 * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
   passed to the device for encryption.
 * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
   but did not arrive in the expected order.
 * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
   a TLS stream and arrived out-of-order, but skipped the HW offload routine
   and went to the regular transmit flow as they were retransmissions of the
   connection handshake.
 * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
   a TLS stream dropped, because they arrived out of order and associated
   record could not be found.
 * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
   stream dropped, because they contain both data that has been encrypted by
   software and data that expects hardware crypto offload.

Notable corner cases, exceptions and additional requirements

.. _5tuple_problems:

5-tuple matching limitations

The device can only recognize received packets based on the 5-tuple
of the socket. Current ``ktls`` implementation will not offload sockets
routed through software interfaces such as those used for tunneling
or virtual networking. However, many packet transformations performed
by the networking stack (most notably any BPF logic) do not require
any intermediate software device, therefore a 5-tuple match may
consistently miss at the device level. In such cases the device
should still be able to perform TX offload (encryption) and should
fallback cleanly to software decryption (RX).

Out of order

Introducing extra processing in NICs should not cause packets to be
transmitted or received out of order, for example pure ACK packets
should not be reordered with respect to data segments.

Ingress reorder

A device is permitted to perform packet reordering for consecutive
TCP segments (i.e. placing packets in the correct order) but any form
of additional buffering is disallowed.

Coexistence with standard networking offload features

Offloaded ``ktls`` sockets should support standard TCP stack features
transparently. Enabling device TLS offload should not cause any difference
in packets as seen on the wire.

Transport layer transparency

The device should not modify any packet headers for the purpose
of the simplifying TLS offload.

The device should not depend on any packet headers beyond what is strictly
necessary for TLS offload.

Segment drops

Dropping packets is acceptable only in the event of catastrophic
system errors and should never be used as an error handling mechanism
in cases arising from normal operation. In other words, reliance
on TCP retransmissions to handle corner cases is not acceptable.

TLS device features

Drivers should ignore the changes to the TLS device feature flags.
These flags will be acted upon accordingly by the core ``ktls`` code.
TLS device feature flags only control adding of new TLS connection
offloads, old connections will remain active after flags are cleared.

TLS encryption cannot be offloaded to devices without checksum calculation
offload. Hence, TLS TX device feature flag requires TX csum offload being set.
Disabling the latter implies clearing the former. Disabling TX checksum offload
should not affect old connections, and drivers should make sure checksum
calculation does not break for them.
Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
does not want to enable RX csum offload, TLS RX device feature is disabled
as well.