Based on kernel version 4.10.8. Page generated on 2017-04-01 14:42 EST.
1 Intel P-State driver 2 -------------------- 3 4 This driver provides an interface to control the P-State selection for the 5 SandyBridge+ Intel processors. 6 7 The following document explains P-States: 8 http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf 9 As stated in the document, P-State doesn’t exactly mean a frequency. However, for 10 the sake of the relationship with cpufreq, P-State and frequency are used 11 interchangeably. 12 13 Understanding the cpufreq core governors and policies are important before 14 discussing more details about the Intel P-State driver. Based on what callbacks 15 a cpufreq driver provides to the cpufreq core, it can support two types of 16 drivers: 17 - with target_index() callback: In this mode, the drivers using cpufreq core 18 simply provide the minimum and maximum frequency limits and an additional 19 interface target_index() to set the current frequency. The cpufreq subsystem 20 has a number of scaling governors ("performance", "powersave", "ondemand", 21 etc.). Depending on which governor is in use, cpufreq core will call for 22 transitions to a specific frequency using target_index() callback. 23 - setpolicy() callback: In this mode, drivers do not provide target_index() 24 callback, so cpufreq core can't request a transition to a specific frequency. 25 The driver provides minimum and maximum frequency limits and callbacks to set a 26 policy. The policy in cpufreq sysfs is referred to as the "scaling governor". 27 The cpufreq core can request the driver to operate in any of the two policies: 28 "performance" and "powersave". The driver decides which frequency to use based 29 on the above policy selection considering minimum and maximum frequency limits. 30 31 The Intel P-State driver falls under the latter category, which implements the 32 setpolicy() callback. This driver decides what P-State to use based on the 33 requested policy from the cpufreq core. If the processor is capable of 34 selecting its next P-State internally, then the driver will offload this 35 responsibility to the processor (aka HWP: Hardware P-States). If not, the 36 driver implements algorithms to select the next P-State. 37 38 Since these policies are implemented in the driver, they are not same as the 39 cpufreq scaling governors implementation, even if they have the same name in 40 the cpufreq sysfs (scaling_governors). For example the "performance" policy is 41 similar to cpufreq’s "performance" governor, but "powersave" is completely 42 different than the cpufreq "powersave" governor. The strategy here is similar 43 to cpufreq "ondemand", where the requested P-State is related to the system load. 44 45 Sysfs Interface 46 47 In addition to the frequency-controlling interfaces provided by the cpufreq 48 core, the driver provides its own sysfs files to control the P-State selection. 49 These files have been added to /sys/devices/system/cpu/intel_pstate/. 50 Any changes made to these files are applicable to all CPUs (even in a 51 multi-package system, Refer to later section on placing "Per-CPU limits"). 52 53 max_perf_pct: Limits the maximum P-State that will be requested by 54 the driver. It states it as a percentage of the available performance. The 55 available (P-State) performance may be reduced by the no_turbo 56 setting described below. 57 58 min_perf_pct: Limits the minimum P-State that will be requested by 59 the driver. It states it as a percentage of the max (non-turbo) 60 performance level. 61 62 no_turbo: Limits the driver to selecting P-State below the turbo 63 frequency range. 64 65 turbo_pct: Displays the percentage of the total performance that 66 is supported by hardware that is in the turbo range. This number 67 is independent of whether turbo has been disabled or not. 68 69 num_pstates: Displays the number of P-States that are supported 70 by hardware. This number is independent of whether turbo has 71 been disabled or not. 72 73 For example, if a system has these parameters: 74 Max 1 core turbo ratio: 0x21 (Max 1 core ratio is the maximum P-State) 75 Max non turbo ratio: 0x17 76 Minimum ratio : 0x08 (Here the ratio is called max efficiency ratio) 77 78 Sysfs will show : 79 max_perf_pct:100, which corresponds to 1 core ratio 80 min_perf_pct:24, max_efficiency_ratio / max 1 Core ratio 81 no_turbo:0, turbo is not disabled 82 num_pstates:26 = (max 1 Core ratio - Max Efficiency Ratio + 1) 83 turbo_pct:39 = (max 1 core ratio - max non turbo ratio) / num_pstates 84 85 Refer to "Intel® 64 and IA-32 Architectures Software Developer’s Manual 86 Volume 3: System Programming Guide" to understand ratios. 87 88 cpufreq sysfs for Intel P-State 89 90 Since this driver registers with cpufreq, cpufreq sysfs is also presented. 91 There are some important differences, which need to be considered. 92 93 scaling_cur_freq: This displays the real frequency which was used during 94 the last sample period instead of what is requested. Some other cpufreq driver, 95 like acpi-cpufreq, displays what is requested (Some changes are on the 96 way to fix this for acpi-cpufreq driver). The same is true for frequencies 97 displayed at /proc/cpuinfo. 98 99 scaling_governor: This displays current active policy. Since each CPU has a 100 cpufreq sysfs, it is possible to set a scaling governor to each CPU. But this 101 is not possible with Intel P-States, as there is one common policy for all 102 CPUs. Here, the last requested policy will be applicable to all CPUs. It is 103 suggested that one use the cpupower utility to change policy to all CPUs at the 104 same time. 105 106 scaling_setspeed: This attribute can never be used with Intel P-State. 107 108 scaling_max_freq/scaling_min_freq: This interface can be used similarly to 109 the max_perf_pct/min_perf_pct of Intel P-State sysfs. However since frequencies 110 are converted to nearest possible P-State, this is prone to rounding errors. 111 This method is not preferred to limit performance. 112 113 affected_cpus: Not used 114 related_cpus: Not used 115 116 For contemporary Intel processors, the frequency is controlled by the 117 processor itself and the P-State exposed to software is related to 118 performance levels. The idea that frequency can be set to a single 119 frequency is fictional for Intel Core processors. Even if the scaling 120 driver selects a single P-State, the actual frequency the processor 121 will run at is selected by the processor itself. 122 123 Per-CPU limits 124 125 The kernel command line option "intel_pstate=per_cpu_perf_limits" forces 126 the intel_pstate driver to use per-CPU performance limits. When it is set, 127 the sysfs control interface described above is subject to limitations. 128 - The following controls are not available for both read and write 129 /sys/devices/system/cpu/intel_pstate/max_perf_pct 130 /sys/devices/system/cpu/intel_pstate/min_perf_pct 131 - The following controls can be used to set performance limits, as far as the 132 architecture of the processor permits: 133 /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq 134 /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq 135 /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor 136 - User can still observe turbo percent and number of P-States from 137 /sys/devices/system/cpu/intel_pstate/turbo_pct 138 /sys/devices/system/cpu/intel_pstate/num_pstates 139 - User can read write system wide turbo status 140 /sys/devices/system/cpu/no_turbo 141 142 Support of energy performance hints 143 It is possible to provide hints to the HWP algorithms in the processor 144 to be more performance centric to more energy centric. When the driver 145 is using HWP, two additional cpufreq sysfs attributes are presented for 146 each logical CPU. 147 These attributes are: 148 - energy_performance_available_preferences 149 - energy_performance_preference 150 151 To get list of supported hints: 152 $ cat energy_performance_available_preferences 153 default performance balance_performance balance_power power 154 155 The current preference can be read or changed via cpufreq sysfs 156 attribute "energy_performance_preference". Reading from this attribute 157 will display current effective setting. User can write any of the valid 158 preference string to this attribute. User can always restore to power-on 159 default by writing "default". 160 161 Since threads can migrate to different CPUs, this is possible that the 162 new CPU may have different energy performance preference than the previous 163 one. To avoid such issues, either threads can be pinned to specific CPUs 164 or set the same energy performance preference value to all CPUs. 165 166 Tuning Intel P-State driver 167 168 When the performance can be tuned using PID (Proportional Integral 169 Derivative) controller, debugfs files are provided for adjusting performance. 170 They are presented under: 171 /sys/kernel/debug/pstate_snb/ 172 173 The PID tunable parameters are: 174 deadband 175 d_gain_pct 176 i_gain_pct 177 p_gain_pct 178 sample_rate_ms 179 setpoint 180 181 To adjust these parameters, some understanding of driver implementation is 182 necessary. There are some tweeks described here, but be very careful. Adjusting 183 them requires expert level understanding of power and performance relationship. 184 These limits are only useful when the "powersave" policy is active. 185 186 -To make the system more responsive to load changes, sample_rate_ms can 187 be adjusted (current default is 10ms). 188 -To make the system use higher performance, even if the load is lower, setpoint 189 can be adjusted to a lower number. This will also lead to faster ramp up time 190 to reach the maximum P-State. 191 If there are no derivative and integral coefficients, The next P-State will be 192 equal to: 193 current P-State - ((setpoint - current cpu load) * p_gain_pct) 194 195 For example, if the current PID parameters are (Which are defaults for the core 196 processors like SandyBridge): 197 deadband = 0 198 d_gain_pct = 0 199 i_gain_pct = 0 200 p_gain_pct = 20 201 sample_rate_ms = 10 202 setpoint = 97 203 204 If the current P-State = 0x08 and current load = 100, this will result in the 205 next P-State = 0x08 - ((97 - 100) * 0.2) = 8.6 (rounded to 9). Here the P-State 206 goes up by only 1. If during next sample interval the current load doesn't 207 change and still 100, then P-State goes up by one again. This process will 208 continue as long as the load is more than the setpoint until the maximum P-State 209 is reached. 210 211 For the same load at setpoint = 60, this will result in the next P-State 212 = 0x08 - ((60 - 100) * 0.2) = 16 213 So by changing the setpoint from 97 to 60, there is an increase of the 214 next P-State from 9 to 16. So this will make processor execute at higher 215 P-State for the same CPU load. If the load continues to be more than the 216 setpoint during next sample intervals, then P-State will go up again till the 217 maximum P-State is reached. But the ramp up time to reach the maximum P-State 218 will be much faster when the setpoint is 60 compared to 97. 219 220 Debugging Intel P-State driver 221 222 Event tracing 223 To debug P-State transition, the Linux event tracing interface can be used. 224 There are two specific events, which can be enabled (Provided the kernel 225 configs related to event tracing are enabled). 226 227 # cd /sys/kernel/debug/tracing/ 228 # echo 1 > events/power/pstate_sample/enable 229 # echo 1 > events/power/cpu_frequency/enable 230 # cat trace 231 gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 232 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 233 freq=2474476 234 cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 235 236 237 Using ftrace 238 239 If function level tracing is required, the Linux ftrace interface can be used. 240 For example if we want to check how often a function to set a P-State is 241 called, we can set ftrace filter to intel_pstate_set_pstate. 242 243 # cd /sys/kernel/debug/tracing/ 244 # cat available_filter_functions | grep -i pstate 245 intel_pstate_set_pstate 246 intel_pstate_cpu_init 247 ... 248 249 # echo intel_pstate_set_pstate > set_ftrace_filter 250 # echo function > current_tracer 251 # cat trace | head -15 252 # tracer: function 253 # 254 # entries-in-buffer/entries-written: 80/80 #P:4 255 # 256 # _-----=> irqs-off 257 # / _----=> need-resched 258 # | / _---=> hardirq/softirq 259 # || / _--=> preempt-depth 260 # ||| / delay 261 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 262 # | | | |||| | | 263 Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func 264 gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func 265 gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func 266 <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func