Documentation / powerpc / imc.rst


Based on kernel version 5.17. Page generated on 2022-03-28 08:42 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
.. SPDX-License-Identifier: GPL-2.0
.. _imc:

===================================
IMC (In-Memory Collection Counters)
===================================

Anju T Sudhakar, 10 May 2019

.. contents::
    :depth: 3


Basic overview
==============

IMC (In-Memory collection counters) is a hardware monitoring facility that
collects large numbers of hardware performance events at Nest level (these are
on-chip but off-core), Core level and Thread level.

The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC
(On-Chip Controller) complex. The microcode collects the counter data and moves
the nest IMC counter data to memory.

The Core and Thread IMC PMU counters are handled in the core. Core level PMU
counters give us the IMC counters' data per core and thread level PMU counters
give us the IMC counters' data per CPU thread.

OPAL obtains the IMC PMU and supported events information from the IMC Catalog
and passes on to the kernel via the device tree. The event's information
contains:

- Event name
- Event Offset
- Event description

and possibly also:

- Event scale
- Event unit

Some PMUs may have a common scale and unit values for all their supported
events. For those cases, the scale and unit properties for those events must be
inherited from the PMU.

The event offset in the memory is where the counter data gets accumulated.

IMC catalog is available at:
	https://github.com/open-power/ima-catalog

The kernel discovers the IMC counters information in the device tree at the
`imc-counters` device node which has a compatible field
`ibm,opal-in-memory-counters`. From the device tree, the kernel parses the PMUs
and their event's information and register the PMU and its attributes in the
kernel.

IMC example usage
=================

.. code-block:: sh

  # perf list
  [...]
  nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/            [Kernel PMU event]
  nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/            [Kernel PMU event]
  [...]
  core_imc/CPM_0THRD_NON_IDLE_PCYC/                  [Kernel PMU event]
  core_imc/CPM_1THRD_NON_IDLE_INST/                  [Kernel PMU event]
  [...]
  thread_imc/CPM_0THRD_NON_IDLE_PCYC/                [Kernel PMU event]
  thread_imc/CPM_1THRD_NON_IDLE_INST/                [Kernel PMU event]

To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/:

.. code-block:: sh

  # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket

To see non-idle instructions for core 0:

.. code-block:: sh

  # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000

To see non-idle instructions for a "make":

.. code-block:: sh

  # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make


IMC Trace-mode
===============

POWER9 supports two modes for IMC which are the Accumulation mode and Trace
mode. In Accumulation mode, event counts are accumulated in system Memory.
Hypervisor then reads the posted counts periodically or when requested. In IMC
Trace mode, the 64 bit trace SCOM value is initialized with the event
information. The CPMCxSEL and CPMC_LOAD in the trace SCOM, specifies the event
to be monitored and the sampling duration. On each overflow in the CPMCxSEL,
hardware snapshots the program counter along with event counts and writes into
memory pointed by LDBAR.

LDBAR is a 64 bit special purpose per thread register, it has bits to indicate
whether hardware is configured for accumulation or trace mode.

LDBAR Register Layout
---------------------

  +-------+----------------------+
  | 0     | Enable/Disable       |
  +-------+----------------------+
  | 1     | 0: Accumulation Mode |
  |       +----------------------+
  |       | 1: Trace Mode        |
  +-------+----------------------+
  | 2:3   | Reserved             |
  +-------+----------------------+
  | 4-6   | PB scope             |
  +-------+----------------------+
  | 7     | Reserved             |
  +-------+----------------------+
  | 8:50  | Counter Address      |
  +-------+----------------------+
  | 51:63 | Reserved             |
  +-------+----------------------+

TRACE_IMC_SCOM bit representation
---------------------------------

  +-------+------------+
  | 0:1   | SAMPSEL    |
  +-------+------------+
  | 2:33  | CPMC_LOAD  |
  +-------+------------+
  | 34:40 | CPMC1SEL   |
  +-------+------------+
  | 41:47 | CPMC2SEL   |
  +-------+------------+
  | 48:50 | BUFFERSIZE |
  +-------+------------+
  | 51:63 | RESERVED   |
  +-------+------------+

CPMC_LOAD contains the sampling duration. SAMPSEL and CPMCxSEL determines the
event to count. BUFFERSIZE indicates the memory range. On each overflow,
hardware snapshots the program counter along with event counts and updates the
memory and reloads the CMPC_LOAD value for the next sampling duration. IMC
hardware does not support exceptions, so it quietly wraps around if memory
buffer reaches the end.

*Currently the event monitored for trace-mode is fixed as cycle.*

Trace IMC example usage
=======================

.. code-block:: sh

  # perf list
  [....]
  trace_imc/trace_cycles/                            [Kernel PMU event]

To record an application/process with trace-imc event:

.. code-block:: sh

  # perf record -e trace_imc/trace_cycles/ yes > /dev/null
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ]

The `perf.data` generated, can be read using perf report.

Benefits of using IMC trace-mode
================================

PMI (Performance Monitoring Interrupts) interrupt handling is avoided, since IMC
trace mode snapshots the program counter and updates to the memory. And this
also provide a way for the operating system to do instruction sampling in real
time without PMI processing overhead.

Performance data using `perf top` with and without trace-imc event.

PMI interrupts count when `perf top` command is executed without trace-imc event.

.. code-block:: sh

  # grep PMI /proc/interrupts
  PMI:          0          0          0          0   Performance monitoring interrupts
  # ./perf top
  ...
  # grep PMI /proc/interrupts
  PMI:      39735       8710      17338      17801   Performance monitoring interrupts
  # ./perf top -e trace_imc/trace_cycles/
  ...
  # grep PMI /proc/interrupts
  PMI:      39735       8710      17338      17801   Performance monitoring interrupts


That is, the PMI interrupt counts do not increment when using the `trace_imc` event.