Documentation / networking / device_drivers / ethernet / marvell / octeontx2.rst


Based on kernel version 6.8. Page generated on 2024-03-11 21:26 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)

====================================
Marvell OcteonTx2 RVU Kernel Drivers
====================================

Copyright (c) 2020 Marvell International Ltd.

Contents
========

- `Overview`_
- `Drivers`_
- `Basic packet flow`_
- `Devlink health reporters`_
- `Quality of service`_

Overview
========

Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW
resources from the network, crypto and other functional blocks into
PCI-compatible physical and virtual functions. Each functional block
again has multiple local functions (LFs) for provisioning to PCI devices.
RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual
functions (VFs). PF0 is called the administrative / admin function (AF)
and has privileges to provision RVU functional block's LFs to each of the
PF/VF.

RVU managed networking functional blocks
 - Network pool or buffer allocator (NPA)
 - Network interface controller (NIX)
 - Network parser CAM (NPC)
 - Schedule/Synchronize/Order unit (SSO)
 - Loopback interface (LBK)

RVU managed non-networking functional blocks
 - Crypto accelerator (CPT)
 - Scheduled timers unit (TIM)
 - Schedule/Synchronize/Order unit (SSO)
   Used for both networking and non networking usecases

Resource provisioning examples
 - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device
 - A PF/VF with CPT-LF resource works as a pure crypto offload device.

RVU functional blocks are highly configurable as per software requirements.

Firmware setups following stuff before kernel boots
 - Enables required number of RVU PFs based on number of physical links.
 - Number of VFs per PF are either static or configurable at compile time.
   Based on config, firmware assigns VFs to each of the PFs.
 - Also assigns MSIX vectors to each of PF and VFs.
 - These are not changed after kernel boot.

Drivers
=======

Linux kernel will have multiple drivers registering to different PF and VFs
of RVU. Wrt networking there will be 3 flavours of drivers.

Admin Function driver
---------------------

As mentioned above RVU PF0 is called the admin function (AF), this driver
supports resource provisioning and configuration of functional blocks.
Doesn't handle any I/O. It sets up few basic stuff but most of the
funcionality is achieved via configuration requests from PFs and VFs.

PF/VFs communicates with AF via a shared memory region (mailbox). Upon
receiving requests AF does resource provisioning and other HW configuration.
AF is always attached to host kernel, but PFs and their VFs may be used by host
kernel itself, or attached to VMs or to userspace applications like
DPDK etc. So AF has to handle provisioning/configuration requests sent
by any device from any domain.

AF driver also interacts with underlying firmware to
 - Manage physical ethernet links ie CGX LMACs.
 - Retrieve information like speed, duplex, autoneg etc
 - Retrieve PHY EEPROM and stats.
 - Configure FEC, PAM modes
 - etc

From pure networking side AF driver supports following functionality.
 - Map a physical link to a RVU PF to which a netdev is registered.
 - Attach NIX and NPA block LFs to RVU PF/VF which provide buffer pools, RQs, SQs
   for regular networking functionality.
 - Flow control (pause frames) enable/disable/config.
 - HW PTP timestamping related config.
 - NPC parser profile config, basically how to parse pkt and what info to extract.
 - NPC extract profile config, what to extract from the pkt to match data in MCAM entries.
 - Manage NPC MCAM entries, upon request can frame and install requested packet forwarding rules.
 - Defines receive side scaling (RSS) algorithms.
 - Defines segmentation offload algorithms (eg TSO)
 - VLAN stripping, capture and insertion config.
 - SSO and TIM blocks config which provide packet scheduling support.
 - Debugfs support, to check current resource provising, current status of
   NPA pools, NIX RQ, SQ and CQs, various stats etc which helps in debugging issues.
 - And many more.

Physical Function driver
------------------------

This RVU PF handles IO, is mapped to a physical ethernet link and this
driver registers a netdev. This supports SR-IOV. As said above this driver
communicates with AF with a mailbox. To retrieve information from physical
links this driver talks to AF and AF gets that info from firmware and responds
back ie cannot talk to firmware directly.

Supports ethtool for configuring links, RSS, queue count, queue size,
flow control, ntuple filters, dump PHY EEPROM, config FEC etc.

Virtual Function driver
-----------------------

There are two types VFs, VFs that share the physical link with their parent
SR-IOV PF and the VFs which work in pairs using internal HW loopback channels (LBK).

Type1:
 - These VFs and their parent PF share a physical link and used for outside communication.
 - VFs cannot communicate with AF directly, they send mbox message to PF and PF
   forwards that to AF. AF after processing, responds back to PF and PF forwards
   the reply to VF.
 - From functionality point of view there is no difference between PF and VF as same type
   HW resources are attached to both. But user would be able to configure few stuff only
   from PF as PF is treated as owner/admin of the link.

Type2:
 - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels.
 - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of
   VF0 will be received by VF1 and vice versa.
 - These VFs can be used by applications or virtual machines to communicate between them
   without sending traffic outside. There is no switch present in HW, hence the support
   for loopback VFs.
 - These communicate directly with AF (PF0) via mbox.

Except for the IO channels or links used for packet reception and transmission there is
no other difference between these VF types. AF driver takes care of IO channel mapping,
hence same VF driver works for both types of devices.

Basic packet flow
=================

Ingress
-------

1. CGX LMAC receives packet.
2. Forwards the packet to the NIX block.
3. Then submitted to NPC block for parsing and then MCAM lookup to get the destination RVU device.
4. NIX LF attached to the destination RVU device allocates a buffer from RQ mapped buffer pool of NPA block LF.
5. RQ may be selected by RSS or by configuring MCAM rule with a RQ number.
6. Packet is DMA'ed and driver is notified.

Egress
------

1. Driver prepares a send descriptor and submits to SQ for transmission.
2. The SQ is already configured (by AF) to transmit on a specific link/channel.
3. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF.
4. NIX block transmits the pkt on the designated channel.
5. NPC MCAM entries can be installed to divert pkt onto a different channel.

Devlink health reporters
========================

NPA Reporters
-------------
The NPA reporters are responsible for reporting and recovering the following group of errors:

1. GENERAL events

   - Error due to operation of unmapped PF.
   - Error due to disabled alloc/free for other HW blocks (NIX, SSO, TIM, DPI and AURA).

2. ERROR events

   - Fault due to NPA_AQ_INST_S read or NPA_AQ_RES_S write.
   - AQ Doorbell Error.

3. RAS events

   - RAS Error Reporting for NPA_AQ_INST_S/NPA_AQ_RES_S.

4. RVU events

   - Error due to unmapped slot.

Sample Output::

	~# devlink health
	pci/0002:01:00.0:
	  reporter hw_npa_intr
	      state healthy error 2872 recover 2872 last_dump_date 2020-12-10 last_dump_time 09:39:09 grace_period 0 auto_recover true auto_dump true
	  reporter hw_npa_gen
	      state healthy error 2872 recover 2872 last_dump_date 2020-12-11 last_dump_time 04:43:04 grace_period 0 auto_recover true auto_dump true
	  reporter hw_npa_err
	      state healthy error 2871 recover 2871 last_dump_date 2020-12-10 last_dump_time 09:39:17 grace_period 0 auto_recover true auto_dump true
	   reporter hw_npa_ras
	      state healthy error 0 recover 0 last_dump_date 2020-12-10 last_dump_time 09:32:40 grace_period 0 auto_recover true auto_dump true

Each reporter dumps the

 - Error Type
 - Error Register value
 - Reason in words

For example::

	~# devlink health dump show  pci/0002:01:00.0 reporter hw_npa_gen
	 NPA_AF_GENERAL:
	         NPA General Interrupt Reg : 1
	         NIX0: free disabled RX
	~# devlink health dump show  pci/0002:01:00.0 reporter hw_npa_intr
	 NPA_AF_RVU:
	         NPA RVU Interrupt Reg : 1
	         Unmap Slot Error
	~# devlink health dump show  pci/0002:01:00.0 reporter hw_npa_err
	 NPA_AF_ERR:
	        NPA Error Interrupt Reg : 4096
	        AQ Doorbell Error


NIX Reporters
-------------
The NIX reporters are responsible for reporting and recovering the following group of errors:

1. GENERAL events

   - Receive mirror/multicast packet drop due to insufficient buffer.
   - SMQ Flush operation.

2. ERROR events

   - Memory Fault due to WQE read/write from multicast/mirror buffer.
   - Receive multicast/mirror replication list error.
   - Receive packet on an unmapped PF.
   - Fault due to NIX_AQ_INST_S read or NIX_AQ_RES_S write.
   - AQ Doorbell Error.

3. RAS events

   - RAS Error Reporting for NIX Receive Multicast/Mirror Entry Structure.
   - RAS Error Reporting for WQE/Packet Data read from Multicast/Mirror Buffer..
   - RAS Error Reporting for NIX_AQ_INST_S/NIX_AQ_RES_S.

4. RVU events

   - Error due to unmapped slot.

Sample Output::

	~# ./devlink health
	pci/0002:01:00.0:
	  reporter hw_npa_intr
	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
	  reporter hw_npa_gen
	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
	  reporter hw_npa_err
	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
	  reporter hw_npa_ras
	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
	  reporter hw_nix_intr
	    state healthy error 1121 recover 1121 last_dump_date 2021-01-19 last_dump_time 05:42:26 grace_period 0 auto_recover true auto_dump true
	  reporter hw_nix_gen
	    state healthy error 949 recover 949 last_dump_date 2021-01-19 last_dump_time 05:42:43 grace_period 0 auto_recover true auto_dump true
	  reporter hw_nix_err
	    state healthy error 1147 recover 1147 last_dump_date 2021-01-19 last_dump_time 05:42:59 grace_period 0 auto_recover true auto_dump true
	  reporter hw_nix_ras
	    state healthy error 409 recover 409 last_dump_date 2021-01-19 last_dump_time 05:43:16 grace_period 0 auto_recover true auto_dump true

Each reporter dumps the

 - Error Type
 - Error Register value
 - Reason in words

For example::

	~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr
	 NIX_AF_RVU:
	        NIX RVU Interrupt Reg : 1
	        Unmap Slot Error
	~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen
	 NIX_AF_GENERAL:
	        NIX General Interrupt Reg : 1
	        Rx multicast pkt drop
	~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_err
	 NIX_AF_ERR:
	        NIX Error Interrupt Reg : 64
	        Rx on unmapped PF_FUNC


Quality of service
==================


Hardware algorithms used in scheduling
--------------------------------------

octeontx2 silicon and CN10K transmit interface consists of five transmit levels
starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1
levels. Each level contains an array of queues to support scheduling and shaping.
The hardware uses the below algorithms depending on the priority of scheduler queues.
once the usercreates tc classes with different priorities, the driver configures
schedulers allocated to the class with specified priority along with rate-limiting
configuration.

1. Strict Priority

      -  Once packets are submitted to MDQ, hardware picks all active MDQs having different priority
         using strict priority.

2. Round Robin

      - Active MDQs having the same priority level are chosen using round robin.


Setup HTB offload
-----------------

1. Enable HW TC offload on the interface::

        # ethtool -K <interface> hw-tc-offload on

2. Crate htb root::

        # tc qdisc add dev <interface> clsact
        # tc qdisc replace dev <interface> root handle 1: htb offload

3. Create tc classes with different priorities::

        # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1

        # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7

4. Create tc classes with same priorities and different quantum::

        # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600

        # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416

        # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768