Based on kernel version 6.11
. Page generated on 2024-09-24 08:21 EST
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 | ========================= Capacity Aware Scheduling ========================= 1. CPU Capacity =============== 1.1 Introduction ---------------- Conventional, homogeneous SMP platforms are composed of purely identical CPUs. Heterogeneous platforms on the other hand are composed of CPUs with different performance characteristics - on such platforms, not all CPUs can be considered equal. CPU capacity is a measure of the performance a CPU can reach, normalized against the most performant CPU in the system. Heterogeneous systems are also called asymmetric CPU capacity systems, as they contain CPUs of different capacities. Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems from two factors: - not all CPUs may have the same microarchitecture (µarch). - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be physically able to attain the higher Operating Performance Points (OPP). Arm big.LITTLE systems are an example of both. The big CPUs are more performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones can. CPU performance is usually expressed in Millions of Instructions Per Second (MIPS), which can also be expressed as a given amount of instructions attainable per Hz, leading to:: capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) 1.2 Scheduler terms ------------------- Two different capacity values are used within the scheduler. A CPU's ``original capacity`` is its maximum attainable capacity, i.e. its maximum attainable performance level. This original capacity is returned by the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original capacity`` to which some loss of available performance (e.g. time spent handling IRQs) is subtracted. Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, while ``original capacity`` is class-agnostic. The rest of this document will use the term ``capacity`` interchangeably with ``original capacity`` for the sake of brevity. 1.3 Platform examples --------------------- 1.3.1 Identical OPPs ~~~~~~~~~~~~~~~~~~~~ Consider an hypothetical dual-core asymmetric CPU capacity system where - work_per_hz(CPU0) = W - work_per_hz(CPU1) = W/2 - all CPUs are running at the same fixed frequency By the above definition of capacity: - capacity(CPU0) = C - capacity(CPU1) = C/2 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would be a LITTLE. With a workload that periodically does a fixed amount of work, you will get an execution trace like so:: CPU0 work ^ | ____ ____ ____ | | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time CPU1 work ^ | _________ _________ ____ | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time CPU0 has the highest capacity in the system (C), and completes a fixed amount of work W in T units of time. On the other hand, CPU1 has half the capacity of CPU0, and thus only completes W/2 in T. 1.3.2 Different max OPPs ~~~~~~~~~~~~~~~~~~~~~~~~ Usually, CPUs of different capacity values also have different maximum OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: - max_freq(CPU0) = F - max_freq(CPU1) = 2/3 * F This yields: - capacity(CPU0) = C - capacity(CPU1) = C/3 Executing the same workload as described in 1.3.1, which each CPU running at its maximum frequency results in:: CPU0 work ^ | ____ ____ ____ | | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time workload on CPU1 CPU1 work ^ | ______________ ______________ ____ | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time 1.4 Representation caveat ------------------------- It should be noted that having a *single* value to represent differences in CPU performance is somewhat of a contentious point. The relative performance difference between two different µarchs could be X% on integer operations, Y% on floating point operations, Z% on branches, and so on. Still, results using this simple approach have been satisfactory for now. 2. Task utilization =================== 2.1 Introduction ---------------- Capacity aware scheduling requires an expression of a task's requirements with regards to CPU capacity. Each scheduler class can express this differently, and while task utilization is specific to CFS, it is convenient to describe it here in order to introduce more generic concepts. Task utilization is a percentage meant to represent the throughput requirements of a task. A simple approximation of it is the task's duty cycle, i.e.:: task_util(p) = duty_cycle(p) On an SMP system with fixed frequencies, 100% utilization suggests the task is a busy loop. Conversely, 10% utilization hints it is a small periodic task that spends more time sleeping than executing. Variable CPU frequencies and asymmetric CPU capacities complexify this somewhat; the following sections will expand on these. 2.2 Frequency invariance ------------------------ One issue that needs to be taken into account is that a workload's duty cycle is directly impacted by the current OPP the CPU is running at. Consider running a periodic workload at a given frequency F:: CPU work ^ | ____ ____ ____ | | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time This yields duty_cycle(p) == 25%. Now, consider running the *same* workload at frequency F/2:: CPU work ^ | _________ _________ ____ | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time This yields duty_cycle(p) == 50%, despite the task having the exact same behaviour (i.e. executing the same amount of work) in both executions. The task utilization signal can be made frequency invariant using the following formula:: task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) Applying this formula to the two examples above yields a frequency invariant task utilization of 25%. 2.3 CPU invariance ------------------ CPU capacity has a similar effect on task utilization in that running an identical workload on CPUs of different capacity values will yield different duty cycles. Consider the system described in 1.3.2., i.e.:: - capacity(CPU0) = C - capacity(CPU1) = C/3 Executing a given periodic workload on each CPU at their maximum frequency would result in:: CPU0 work ^ | ____ ____ ____ | | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time CPU1 work ^ | ______________ ______________ ____ | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time IOW, - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency The task utilization signal can be made CPU invariant using the following formula:: task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) with ``max_capacity`` being the highest CPU capacity value in the system. Applying this formula to the above example above yields a CPU invariant task utilization of 25%. 2.4 Invariant task utilization ------------------------------ Both frequency and CPU invariance need to be applied to task utilization in order to obtain a truly invariant signal. The pseudo-formula for a task utilization that is both CPU and frequency invariant is thus, for a given task p:: curr_frequency(cpu) capacity(cpu) task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- max_frequency(cpu) max_capacity In other words, invariant task utilization describes the behaviour of a task as if it were running on the highest-capacity CPU in the system, running at its maximum frequency. Any mention of task utilization in the following sections will imply its invariant form. 2.5 Utilization estimation -------------------------- Without a crystal ball, task behaviour (and thus task utilization) cannot accurately be predicted the moment a task first becomes runnable. The CFS class maintains a handful of CPU and task signals based on the Per-Entity Load Tracking (PELT) mechanism, one of those yielding an *average* utilization (as opposed to instantaneous). This means that while the capacity aware scheduling criteria will be written considering a "true" task utilization (using a crystal ball), the implementation will only ever be able to use an estimator thereof. 3. Capacity aware scheduling requirements ========================================= 3.1 CPU capacity ---------------- Linux cannot currently figure out CPU capacity on its own, this information thus needs to be handed to it. Architectures must define arch_scale_cpu_capacity() for that purpose. The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see Documentation/devicetree/bindings/cpu/cpu-capacity.txt. 3.2 Frequency invariance ------------------------ As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task utilization. Architectures must define arch_scale_freq_capacity(cpu) for that purpose. Implementing this function requires figuring out at which frequency each CPU have been running at. One way to implement this is to leverage hardware counters whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, AMU on arm64). Another is to directly hook into cpufreq frequency transitions, when the kernel is aware of the switched-to frequency (also employed by arm/arm64). 4. Scheduler topology ===================== During the construction of the sched domains, the scheduler will figure out whether the system exhibits asymmetric CPU capacities. Should that be the case: - The sched_asym_cpucapacity static key will be enabled. - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain level that spans all unique CPU capacity values. - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans CPUs with any range of asymmetry. The sched_asym_cpucapacity static key is intended to guard sections of code that cater to asymmetric CPU capacity systems. Do note however that said key is *system-wide*. Imagine the following setup using cpusets:: capacity C/2 C ________ ________ / \ / \ CPUs 0 1 2 3 4 5 6 7 \__/ \______________/ cpusets cs0 cs1 Which could be created via: .. code-block:: sh mkdir /sys/fs/cgroup/cpuset/cs0 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems mkdir /sys/fs/cgroup/cpuset/cs1 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance Since there *is* CPU capacity asymmetry in the system, the sched_asym_cpucapacity static key will be enabled. However, the sched_domain hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't set in that hierarchy, it describes an SMP island and should be treated as such. Therefore, the 'canonical' pattern for protecting codepaths that cater to asymmetric CPU capacities is to: - Check the sched_asym_cpucapacity static key - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific CPU or group thereof) 5. Capacity aware scheduling implementation =========================================== 5.1 CFS ------- 5.1.1 Capacity fitness ~~~~~~~~~~~~~~~~~~~~~~ The main capacity scheduling criterion of CFS is:: task_util(p) < capacity(task_cpu(p)) This is commonly called the capacity fitness criterion, i.e. CFS must ensure a task "fits" on its CPU. If it is violated, the task will need to achieve more work than what its CPU can provide: it will be CPU-bound. Furthermore, uclamp lets userspace specify a minimum and a maximum utilization value for a task, either via sched_setattr() or via the cgroup interface (see Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to clamp task_util() in the previous criterion. 5.1.2 Wakeup CPU selection ~~~~~~~~~~~~~~~~~~~~~~~~~~ CFS task wakeup CPU selection follows the capacity fitness criterion described above. On top of that, uclamp is used to clamp the task utilization values, which lets userspace have more leverage over the CPU selection of CFS tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run on any CPU by giving it a low uclamp.max value. Conversely, it can force a small periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by giving it a high uclamp.min value. .. note:: Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling (EAS), which is described in Documentation/scheduler/sched-energy.rst. 5.1.3 Load balancing ~~~~~~~~~~~~~~~~~~~~ A pathological case in the wakeup CPU selection occurs when a task rarely sleeps, if at all - it thus rarely wakes up, if at all. Consider:: w == wakeup event capacity(CPU0) = C capacity(CPU1) = C / 3 workload on CPU0 CPU work ^ | _________ _________ ____ | | | | | | +----+----+----+----+----+----+----+----+----+----+-> time w w w workload on CPU1 CPU work ^ | ____________________________________________ | | +----+----+----+----+----+----+----+----+----+----+-> w This workload should run on CPU0, but if the task either: - was improperly scheduled from the start (inaccurate initial utilization estimation) - was properly scheduled from the start, but suddenly needs more processing power then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; the CPU capacity scheduling criterion is violated, and there may not be any more wakeup event to fix this up via wakeup CPU selection. Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism put in place to handle this shares the same name. Misfit task migration leverages the CFS load balancer, more specifically the active load balance part (which caters to migrating currently running tasks). When load balance happens, a misfit active load balance will be triggered if a misfit task can be migrated to a CPU with more capacity than its current one. 5.2 RT ------ 5.2.1 Wakeup CPU selection ~~~~~~~~~~~~~~~~~~~~~~~~~~ RT task wakeup CPU selection searches for a CPU that satisfies:: task_uclamp_min(p) <= capacity(task_cpu(cpu)) while still following the usual priority constraints. If none of the candidate CPUs can satisfy this capacity criterion, then strict priority based scheduling is followed and CPU capacities are ignored. 5.3 DL ------ 5.3.1 Wakeup CPU selection ~~~~~~~~~~~~~~~~~~~~~~~~~~ DL task wakeup CPU selection searches for a CPU that satisfies:: task_bandwidth(p) < capacity(task_cpu(p)) while still respecting the usual bandwidth and deadline constraints. If none of the candidate CPUs can satisfy this capacity criterion, then the task will remain on its current CPU. |