Linux Kernel Documentation

About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog
Documentation / ftrace.txt

Based on kernel version 2.6.29. Page generated on 2009-03-25 22:22 EST.
1			ftrace - Function Tracer
2			========================
3	
4	Copyright 2008 Red Hat Inc.
5	   Author:   Steven Rostedt <srostedt[AT]redhat[DOT]com>
6	  License:   The GNU Free Documentation License, Version 1.2
7	               (dual licensed under the GPL v2)
8	Reviewers:   Elias Oltmanns, Randy Dunlap, Andrew Morton,
9		     John Kacur, and David Teigland.
10	
11	Written for: 2.6.28-rc2
12	
13	Introduction
14	------------
15	
16	Ftrace is an internal tracer designed to help out developers and
17	designers of systems to find what is going on inside the kernel.
18	It can be used for debugging or analyzing latencies and performance
19	issues that take place outside of user-space.
20	
21	Although ftrace is the function tracer, it also includes an
22	infrastructure that allows for other types of tracing. Some of the
23	tracers that are currently in ftrace include a tracer to trace
24	context switches, the time it takes for a high priority task to
25	run after it was woken up, the time interrupts are disabled, and
26	more (ftrace allows for tracer plugins, which means that the list of
27	tracers can always grow).
28	
29	
30	The File System
31	---------------
32	
33	Ftrace uses the debugfs file system to hold the control files as well
34	as the files to display output.
35	
36	To mount the debugfs system:
37	
38	  # mkdir /debug
39	  # mount -t debugfs nodev /debug
40	
41	(Note: it is more common to mount at /sys/kernel/debug, but for simplicity
42	 this document will use /debug)
43	
44	That's it! (assuming that you have ftrace configured into your kernel)
45	
46	After mounting the debugfs, you can see a directory called
47	"tracing".  This directory contains the control and output files
48	of ftrace. Here is a list of some of the key files:
49	
50	
51	 Note: all time values are in microseconds.
52	
53	  current_tracer: This is used to set or display the current tracer
54			that is configured.
55	
56	  available_tracers: This holds the different types of tracers that
57			have been compiled into the kernel. The tracers
58			listed here can be configured by echoing their name
59			into current_tracer.
60	
61	  tracing_enabled: This sets or displays whether the current_tracer
62			is activated and tracing or not. Echo 0 into this
63			file to disable the tracer or 1 to enable it.
64	
65	  trace: This file holds the output of the trace in a human readable
66			format (described below).
67	
68	  latency_trace: This file shows the same trace but the information
69			is organized more to display possible latencies
70			in the system (described below).
71	
72	  trace_pipe: The output is the same as the "trace" file but this
73			file is meant to be streamed with live tracing.
74			Reads from this file will block until new data
75			is retrieved. Unlike the "trace" and "latency_trace"
76			files, this file is a consumer. This means reading
77			from this file causes sequential reads to display
78			more current data. Once data is read from this
79			file, it is consumed, and will not be read
80			again with a sequential read. The "trace" and
81			"latency_trace" files are static, and if the
82			tracer is not adding more data, they will display
83			the same information every time they are read.
84	
85	  trace_options: This file lets the user control the amount of data
86			that is displayed in one of the above output
87			files.
88	
89	  trace_max_latency: Some of the tracers record the max latency.
90			For example, the time interrupts are disabled.
91			This time is saved in this file. The max trace
92			will also be stored, and displayed by either
93			"trace" or "latency_trace".  A new max trace will
94			only be recorded if the latency is greater than
95			the value in this file. (in microseconds)
96	
97	  buffer_size_kb: This sets or displays the number of kilobytes each CPU
98			buffer can hold. The tracer buffers are the same size
99			for each CPU. The displayed number is the size of the
100			CPU buffer and not total size of all buffers. The
101			trace buffers are allocated in pages (blocks of memory
102			that the kernel uses for allocation, usually 4 KB in size).
103			If the last page allocated has room for more bytes
104			than requested, the rest of the page will be used,
105			making the actual allocation bigger than requested.
106			(Note, the size may not be a multiple of the page size due
107			to buffer managment overhead.)
108	
109			This can only be updated when the current_tracer
110			is set to "nop".
111	
112	  tracing_cpumask: This is a mask that lets the user only trace
113			on specified CPUS. The format is a hex string
114			representing the CPUS.
115	
116	  set_ftrace_filter: When dynamic ftrace is configured in (see the
117			section below "dynamic ftrace"), the code is dynamically
118			modified (code text rewrite) to disable calling of the
119			function profiler (mcount). This lets tracing be configured
120			in with practically no overhead in performance.  This also
121			has a side effect of enabling or disabling specific functions
122			to be traced. Echoing names of functions into this file
123			will limit the trace to only those functions.
124	
125	  set_ftrace_notrace: This has an effect opposite to that of
126			set_ftrace_filter. Any function that is added here will not
127			be traced. If a function exists in both set_ftrace_filter
128			and set_ftrace_notrace,	the function will _not_ be traced.
129	
130	  set_ftrace_pid: Have the function tracer only trace a single thread.
131	
132	  available_filter_functions: This lists the functions that ftrace
133			has processed and can trace. These are the function
134			names that you can pass to "set_ftrace_filter" or
135			"set_ftrace_notrace". (See the section "dynamic ftrace"
136			below for more details.)
137	
138	
139	The Tracers
140	-----------
141	
142	Here is the list of current tracers that may be configured.
143	
144	  function - function tracer that uses mcount to trace all functions.
145	
146	  sched_switch - traces the context switches between tasks.
147	
148	  irqsoff - traces the areas that disable interrupts and saves
149	  		the trace with the longest max latency.
150			See tracing_max_latency.  When a new max is recorded,
151			it replaces the old trace. It is best to view this
152			trace via the latency_trace file.
153	
154	  preemptoff - Similar to irqsoff but traces and records the amount of
155			time for which preemption is disabled.
156	
157	  preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
158			 records the largest time for which irqs and/or preemption
159			 is disabled.
160	
161	  wakeup - Traces and records the max latency that it takes for
162			the highest priority task to get scheduled after
163			it has been woken up.
164	
165	  nop - This is not a tracer. To remove all tracers from tracing
166			simply echo "nop" into current_tracer.
167	
168	
169	Examples of using the tracer
170	----------------------------
171	
172	Here are typical examples of using the tracers when controlling them only
173	with the debugfs interface (without using any user-land utilities).
174	
175	Output format:
176	--------------
177	
178	Here is an example of the output format of the file "trace"
179	
180	                             --------
181	# tracer: function
182	#
183	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
184	#              | |      |          |         |
185	            bash-4251  [01] 10152.583854: path_put <-path_walk
186	            bash-4251  [01] 10152.583855: dput <-path_put
187	            bash-4251  [01] 10152.583855: _atomic_dec_and_lock <-dput
188	                             --------
189	
190	A header is printed with the tracer name that is represented by the trace.
191	In this case the tracer is "function". Then a header showing the format. Task
192	name "bash", the task PID "4251", the CPU that it was running on
193	"01", the timestamp in <secs>.<usecs> format, the function name that was
194	traced "path_put" and the parent function that called this function
195	"path_walk". The timestamp is the time at which the function was
196	entered.
197	
198	The sched_switch tracer also includes tracing of task wakeups and
199	context switches.
200	
201	     ksoftirqd/1-7     [01]  1453.070013:      7:115:R   +  2916:115:S
202	     ksoftirqd/1-7     [01]  1453.070013:      7:115:R   +    10:115:S
203	     ksoftirqd/1-7     [01]  1453.070013:      7:115:R ==>    10:115:R
204	        events/1-10    [01]  1453.070013:     10:115:S ==>  2916:115:R
205	     kondemand/1-2916  [01]  1453.070013:   2916:115:S ==>     7:115:R
206	     ksoftirqd/1-7     [01]  1453.070013:      7:115:S ==>     0:140:R
207	
208	Wake ups are represented by a "+" and the context switches are shown as
209	"==>".  The format is:
210	
211	 Context switches:
212	
213	       Previous task              Next Task
214	
215	  <pid>:<prio>:<state>  ==>  <pid>:<prio>:<state>
216	
217	 Wake ups:
218	
219	       Current task               Task waking up
220	
221	  <pid>:<prio>:<state>    +  <pid>:<prio>:<state>
222	
223	The prio is the internal kernel priority, which is the inverse of the
224	priority that is usually displayed by user-space tools. Zero represents
225	the highest priority (99). Prio 100 starts the "nice" priorities with
226	100 being equal to nice -20 and 139 being nice 19. The prio "140" is
227	reserved for the idle task which is the lowest priority thread (pid 0).
228	
229	
230	Latency trace format
231	--------------------
232	
233	For traces that display latency times, the latency_trace file gives
234	somewhat more information to see why a latency happened. Here is a typical
235	trace.
236	
237	# tracer: irqsoff
238	#
239	irqsoff latency trace v1.1.5 on 2.6.26-rc8
240	--------------------------------------------------------------------
241	 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
242	    -----------------
243	    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
244	    -----------------
245	 => started at: apic_timer_interrupt
246	 => ended at:   do_softirq
247	
248	#                _------=> CPU#
249	#               / _-----=> irqs-off
250	#              | / _----=> need-resched
251	#              || / _---=> hardirq/softirq
252	#              ||| / _--=> preempt-depth
253	#              |||| /
254	#              |||||     delay
255	#  cmd     pid ||||| time  |   caller
256	#     \   /    |||||   \   |   /
257	  <idle>-0     0d..1    0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
258	  <idle>-0     0d.s.   97us : __do_softirq (do_softirq)
259	  <idle>-0     0d.s1   98us : trace_hardirqs_on (do_softirq)
260	
261	
262	
263	This shows that the current tracer is "irqsoff" tracing the time for which
264	interrupts were disabled. It gives the trace version and the version
265	of the kernel upon which this was executed on (2.6.26-rc8). Then it displays
266	the max latency in microsecs (97 us). The number of trace entries displayed
267	and the total number recorded (both are three: #3/3). The type of
268	preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero
269	and are reserved for later use. #P is the number of online CPUS (#P:2).
270	
271	The task is the process that was running when the latency occurred.
272	(swapper pid: 0).
273	
274	The start and stop (the functions in which the interrupts were disabled and
275	enabled respectively) that caused the latencies:
276	
277	  apic_timer_interrupt is where the interrupts were disabled.
278	  do_softirq is where they were enabled again.
279	
280	The next lines after the header are the trace itself. The header
281	explains which is which.
282	
283	  cmd: The name of the process in the trace.
284	
285	  pid: The PID of that process.
286	
287	  CPU#: The CPU which the process was running on.
288	
289	  irqs-off: 'd' interrupts are disabled. '.' otherwise.
290		    Note: If the architecture does not support a way to
291			  read the irq flags variable, an 'X' will always
292			  be printed here.
293	
294	  need-resched: 'N' task need_resched is set, '.' otherwise.
295	
296	  hardirq/softirq:
297		'H' - hard irq occurred inside a softirq.
298		'h' - hard irq is running
299		's' - soft irq is running
300		'.' - normal context.
301	
302	  preempt-depth: The level of preempt_disabled
303	
304	The above is mostly meaningful for kernel developers.
305	
306	  time: This differs from the trace file output. The trace file output
307		includes an absolute timestamp. The timestamp used by the
308		latency_trace file is relative to the start of the trace.
309	
310	  delay: This is just to help catch your eye a bit better. And
311		needs to be fixed to be only relative to the same CPU.
312		The marks are determined by the difference between this
313		current trace and the next trace.
314		 '!' - greater than preempt_mark_thresh (default 100)
315		 '+' - greater than 1 microsecond
316		 ' ' - less than or equal to 1 microsecond.
317	
318	  The rest is the same as the 'trace' file.
319	
320	
321	trace_options
322	-------------
323	
324	The trace_options file is used to control what gets printed in the trace
325	output. To see what is available, simply cat the file:
326	
327	  cat /debug/tracing/trace_options
328	  print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
329	 noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
330	
331	To disable one of the options, echo in the option prepended with "no".
332	
333	  echo noprint-parent > /debug/tracing/trace_options
334	
335	To enable an option, leave off the "no".
336	
337	  echo sym-offset > /debug/tracing/trace_options
338	
339	Here are the available options:
340	
341	  print-parent - On function traces, display the calling function
342			as well as the function being traced.
343	
344	  print-parent:
345	   bash-4000  [01]  1477.606694: simple_strtoul <-strict_strtoul
346	
347	  noprint-parent:
348	   bash-4000  [01]  1477.606694: simple_strtoul
349	
350	
351	  sym-offset - Display not only the function name, but also the offset
352			in the function. For example, instead of seeing just
353			"ktime_get", you will see "ktime_get+0xb/0x20".
354	
355	  sym-offset:
356	   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
357	
358	  sym-addr - this will also display the function address as well as
359			the function name.
360	
361	  sym-addr:
362	   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
363	
364	  verbose - This deals with the latency_trace file.
365	
366	    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
367	    (+0.000ms): simple_strtoul (strict_strtoul)
368	
369	  raw - This will display raw numbers. This option is best for use with
370		user applications that can translate the raw numbers better than
371		having it done in the kernel.
372	
373	  hex - Similar to raw, but the numbers will be in a hexadecimal format.
374	
375	  bin - This will print out the formats in raw binary.
376	
377	  block - TBD (needs update)
378	
379	  stacktrace - This is one of the options that changes the trace itself.
380			When a trace is recorded, so is the stack of functions.
381			This allows for back traces of trace sites.
382	
383	  userstacktrace - This option changes the trace.
384			   It records a stacktrace of the current userspace thread.
385	
386	  sym-userobj - when user stacktrace are enabled, look up which object the
387			address belongs to, and print a relative address
388			This is especially useful when ASLR is on, otherwise you don't
389			get a chance to resolve the address to object/file/line after the app is no
390			longer running
391	
392			The lookup is performed when you read trace,trace_pipe,latency_trace. Example:
393	
394			a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
395	x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
396	
397	  sched-tree - TBD (any users??)
398	
399	
400	sched_switch
401	------------
402	
403	This tracer simply records schedule switches. Here is an example
404	of how to use it.
405	
406	 # echo sched_switch > /debug/tracing/current_tracer
407	 # echo 1 > /debug/tracing/tracing_enabled
408	 # sleep 1
409	 # echo 0 > /debug/tracing/tracing_enabled
410	 # cat /debug/tracing/trace
411	
412	# tracer: sched_switch
413	#
414	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
415	#              | |      |          |         |
416	            bash-3997  [01]   240.132281:   3997:120:R   +  4055:120:R
417	            bash-3997  [01]   240.132284:   3997:120:R ==>  4055:120:R
418	           sleep-4055  [01]   240.132371:   4055:120:S ==>  3997:120:R
419	            bash-3997  [01]   240.132454:   3997:120:R   +  4055:120:S
420	            bash-3997  [01]   240.132457:   3997:120:R ==>  4055:120:R
421	           sleep-4055  [01]   240.132460:   4055:120:D ==>  3997:120:R
422	            bash-3997  [01]   240.132463:   3997:120:R   +  4055:120:D
423	            bash-3997  [01]   240.132465:   3997:120:R ==>  4055:120:R
424	          <idle>-0     [00]   240.132589:      0:140:R   +     4:115:S
425	          <idle>-0     [00]   240.132591:      0:140:R ==>     4:115:R
426	     ksoftirqd/0-4     [00]   240.132595:      4:115:S ==>     0:140:R
427	          <idle>-0     [00]   240.132598:      0:140:R   +     4:115:S
428	          <idle>-0     [00]   240.132599:      0:140:R ==>     4:115:R
429	     ksoftirqd/0-4     [00]   240.132603:      4:115:S ==>     0:140:R
430	           sleep-4055  [01]   240.133058:   4055:120:S ==>  3997:120:R
431	 [...]
432	
433	
434	As we have discussed previously about this format, the header shows
435	the name of the trace and points to the options. The "FUNCTION"
436	is a misnomer since here it represents the wake ups and context
437	switches.
438	
439	The sched_switch file only lists the wake ups (represented with '+')
440	and context switches ('==>') with the previous task or current task
441	first followed by the next task or task waking up. The format for both
442	of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO
443	is the inverse of the actual priority with zero (0) being the highest
444	priority and the nice values starting at 100 (nice -20). Below is
445	a quick chart to map the kernel priority to user land priorities.
446	
447	  Kernel priority: 0 to 99    ==> user RT priority 99 to 0
448	  Kernel priority: 100 to 139 ==> user nice -20 to 19
449	  Kernel priority: 140        ==> idle task priority
450	
451	The task states are:
452	
453	 R - running : wants to run, may not actually be running
454	 S - sleep   : process is waiting to be woken up (handles signals)
455	 D - disk sleep (uninterruptible sleep) : process must be woken up
456						(ignores signals)
457	 T - stopped : process suspended
458	 t - traced  : process is being traced (with something like gdb)
459	 Z - zombie  : process waiting to be cleaned up
460	 X - unknown
461	
462	
463	ftrace_enabled
464	--------------
465	
466	The following tracers (listed below) give different output depending
467	on whether or not the sysctl ftrace_enabled is set. To set ftrace_enabled,
468	one can either use the sysctl function or set it via the proc
469	file system interface.
470	
471	  sysctl kernel.ftrace_enabled=1
472	
473	 or
474	
475	  echo 1 > /proc/sys/kernel/ftrace_enabled
476	
477	To disable ftrace_enabled simply replace the '1' with '0' in
478	the above commands.
479	
480	When ftrace_enabled is set the tracers will also record the functions
481	that are within the trace. The descriptions of the tracers
482	will also show an example with ftrace enabled.
483	
484	
485	irqsoff
486	-------
487	
488	When interrupts are disabled, the CPU can not react to any other
489	external event (besides NMIs and SMIs). This prevents the timer
490	interrupt from triggering or the mouse interrupt from letting the
491	kernel know of a new mouse event. The result is a latency with the
492	reaction time.
493	
494	The irqsoff tracer tracks the time for which interrupts are disabled.
495	When a new maximum latency is hit, the tracer saves the trace leading up
496	to that latency point so that every time a new maximum is reached, the old
497	saved trace is discarded and the new trace is saved.
498	
499	To reset the maximum, echo 0 into tracing_max_latency. Here is an
500	example:
501	
502	 # echo irqsoff > /debug/tracing/current_tracer
503	 # echo 0 > /debug/tracing/tracing_max_latency
504	 # echo 1 > /debug/tracing/tracing_enabled
505	 # ls -ltr
506	 [...]
507	 # echo 0 > /debug/tracing/tracing_enabled
508	 # cat /debug/tracing/latency_trace
509	# tracer: irqsoff
510	#
511	irqsoff latency trace v1.1.5 on 2.6.26
512	--------------------------------------------------------------------
513	 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
514	    -----------------
515	    | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
516	    -----------------
517	 => started at: sys_setpgid
518	 => ended at:   sys_setpgid
519	
520	#                _------=> CPU#
521	#               / _-----=> irqs-off
522	#              | / _----=> need-resched
523	#              || / _---=> hardirq/softirq
524	#              ||| / _--=> preempt-depth
525	#              |||| /
526	#              |||||     delay
527	#  cmd     pid ||||| time  |   caller
528	#     \   /    |||||   \   |   /
529	    bash-3730  1d...    0us : _write_lock_irq (sys_setpgid)
530	    bash-3730  1d..1    1us+: _write_unlock_irq (sys_setpgid)
531	    bash-3730  1d..2   14us : trace_hardirqs_on (sys_setpgid)
532	
533	
534	Here we see that that we had a latency of 12 microsecs (which is
535	very good). The _write_lock_irq in sys_setpgid disabled interrupts.
536	The difference between the 12 and the displayed timestamp 14us occurred
537	because the clock was incremented between the time of recording the max
538	latency and the time of recording the function that had that latency.
539	
540	Note the above example had ftrace_enabled not set. If we set the
541	ftrace_enabled, we get a much larger output:
542	
543	# tracer: irqsoff
544	#
545	irqsoff latency trace v1.1.5 on 2.6.26-rc8
546	--------------------------------------------------------------------
547	 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
548	    -----------------
549	    | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
550	    -----------------
551	 => started at: __alloc_pages_internal
552	 => ended at:   __alloc_pages_internal
553	
554	#                _------=> CPU#
555	#               / _-----=> irqs-off
556	#              | / _----=> need-resched
557	#              || / _---=> hardirq/softirq
558	#              ||| / _--=> preempt-depth
559	#              |||| /
560	#              |||||     delay
561	#  cmd     pid ||||| time  |   caller
562	#     \   /    |||||   \   |   /
563	      ls-4339  0...1    0us+: get_page_from_freelist (__alloc_pages_internal)
564	      ls-4339  0d..1    3us : rmqueue_bulk (get_page_from_freelist)
565	      ls-4339  0d..1    3us : _spin_lock (rmqueue_bulk)
566	      ls-4339  0d..1    4us : add_preempt_count (_spin_lock)
567	      ls-4339  0d..2    4us : __rmqueue (rmqueue_bulk)
568	      ls-4339  0d..2    5us : __rmqueue_smallest (__rmqueue)
569	      ls-4339  0d..2    5us : __mod_zone_page_state (__rmqueue_smallest)
570	      ls-4339  0d..2    6us : __rmqueue (rmqueue_bulk)
571	      ls-4339  0d..2    6us : __rmqueue_smallest (__rmqueue)
572	      ls-4339  0d..2    7us : __mod_zone_page_state (__rmqueue_smallest)
573	      ls-4339  0d..2    7us : __rmqueue (rmqueue_bulk)
574	      ls-4339  0d..2    8us : __rmqueue_smallest (__rmqueue)
575	[...]
576	      ls-4339  0d..2   46us : __rmqueue_smallest (__rmqueue)
577	      ls-4339  0d..2   47us : __mod_zone_page_state (__rmqueue_smallest)
578	      ls-4339  0d..2   47us : __rmqueue (rmqueue_bulk)
579	      ls-4339  0d..2   48us : __rmqueue_smallest (__rmqueue)
580	      ls-4339  0d..2   48us : __mod_zone_page_state (__rmqueue_smallest)
581	      ls-4339  0d..2   49us : _spin_unlock (rmqueue_bulk)
582	      ls-4339  0d..2   49us : sub_preempt_count (_spin_unlock)
583	      ls-4339  0d..1   50us : get_page_from_freelist (__alloc_pages_internal)
584	      ls-4339  0d..2   51us : trace_hardirqs_on (__alloc_pages_internal)
585	
586	
587	
588	Here we traced a 50 microsecond latency. But we also see all the
589	functions that were called during that time. Note that by enabling
590	function tracing, we incur an added overhead. This overhead may
591	extend the latency times. But nevertheless, this trace has provided
592	some very helpful debugging information.
593	
594	
595	preemptoff
596	----------
597	
598	When preemption is disabled, we may be able to receive interrupts but
599	the task cannot be preempted and a higher priority task must wait
600	for preemption to be enabled again before it can preempt a lower
601	priority task.
602	
603	The preemptoff tracer traces the places that disable preemption.
604	Like the irqsoff tracer, it records the maximum latency for which preemption
605	was disabled. The control of preemptoff tracer is much like the irqsoff
606	tracer.
607	
608	 # echo preemptoff > /debug/tracing/current_tracer
609	 # echo 0 > /debug/tracing/tracing_max_latency
610	 # echo 1 > /debug/tracing/tracing_enabled
611	 # ls -ltr
612	 [...]
613	 # echo 0 > /debug/tracing/tracing_enabled
614	 # cat /debug/tracing/latency_trace
615	# tracer: preemptoff
616	#
617	preemptoff latency trace v1.1.5 on 2.6.26-rc8
618	--------------------------------------------------------------------
619	 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
620	    -----------------
621	    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
622	    -----------------
623	 => started at: do_IRQ
624	 => ended at:   __do_softirq
625	
626	#                _------=> CPU#
627	#               / _-----=> irqs-off
628	#              | / _----=> need-resched
629	#              || / _---=> hardirq/softirq
630	#              ||| / _--=> preempt-depth
631	#              |||| /
632	#              |||||     delay
633	#  cmd     pid ||||| time  |   caller
634	#     \   /    |||||   \   |   /
635	    sshd-4261  0d.h.    0us+: irq_enter (do_IRQ)
636	    sshd-4261  0d.s.   29us : _local_bh_enable (__do_softirq)
637	    sshd-4261  0d.s1   30us : trace_preempt_on (__do_softirq)
638	
639	
640	This has some more changes. Preemption was disabled when an interrupt
641	came in (notice the 'h'), and was enabled while doing a softirq.
642	(notice the 's'). But we also see that interrupts have been disabled
643	when entering the preempt off section and leaving it (the 'd').
644	We do not know if interrupts were enabled in the mean time.
645	
646	# tracer: preemptoff
647	#
648	preemptoff latency trace v1.1.5 on 2.6.26-rc8
649	--------------------------------------------------------------------
650	 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
651	    -----------------
652	    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
653	    -----------------
654	 => started at: remove_wait_queue
655	 => ended at:   __do_softirq
656	
657	#                _------=> CPU#
658	#               / _-----=> irqs-off
659	#              | / _----=> need-resched
660	#              || / _---=> hardirq/softirq
661	#              ||| / _--=> preempt-depth
662	#              |||| /
663	#              |||||     delay
664	#  cmd     pid ||||| time  |   caller
665	#     \   /    |||||   \   |   /
666	    sshd-4261  0d..1    0us : _spin_lock_irqsave (remove_wait_queue)
667	    sshd-4261  0d..1    1us : _spin_unlock_irqrestore (remove_wait_queue)
668	    sshd-4261  0d..1    2us : do_IRQ (common_interrupt)
669	    sshd-4261  0d..1    2us : irq_enter (do_IRQ)
670	    sshd-4261  0d..1    2us : idle_cpu (irq_enter)
671	    sshd-4261  0d..1    3us : add_preempt_count (irq_enter)
672	    sshd-4261  0d.h1    3us : idle_cpu (irq_enter)
673	    sshd-4261  0d.h.    4us : handle_fasteoi_irq (do_IRQ)
674	[...]
675	    sshd-4261  0d.h.   12us : add_preempt_count (_spin_lock)
676	    sshd-4261  0d.h1   12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
677	    sshd-4261  0d.h1   13us : move_native_irq (ack_ioapic_quirk_irq)
678	    sshd-4261  0d.h1   13us : _spin_unlock (handle_fasteoi_irq)
679	    sshd-4261  0d.h1   14us : sub_preempt_count (_spin_unlock)
680	    sshd-4261  0d.h1   14us : irq_exit (do_IRQ)
681	    sshd-4261  0d.h1   15us : sub_preempt_count (irq_exit)
682	    sshd-4261  0d..2   15us : do_softirq (irq_exit)
683	    sshd-4261  0d...   15us : __do_softirq (do_softirq)
684	    sshd-4261  0d...   16us : __local_bh_disable (__do_softirq)
685	    sshd-4261  0d...   16us+: add_preempt_count (__local_bh_disable)
686	    sshd-4261  0d.s4   20us : add_preempt_count (__local_bh_disable)
687	    sshd-4261  0d.s4   21us : sub_preempt_count (local_bh_enable)
688	    sshd-4261  0d.s5   21us : sub_preempt_count (local_bh_enable)
689	[...]
690	    sshd-4261  0d.s6   41us : add_preempt_count (__local_bh_disable)
691	    sshd-4261  0d.s6   42us : sub_preempt_count (local_bh_enable)
692	    sshd-4261  0d.s7   42us : sub_preempt_count (local_bh_enable)
693	    sshd-4261  0d.s5   43us : add_preempt_count (__local_bh_disable)
694	    sshd-4261  0d.s5   43us : sub_preempt_count (local_bh_enable_ip)
695	    sshd-4261  0d.s6   44us : sub_preempt_count (local_bh_enable_ip)
696	    sshd-4261  0d.s5   44us : add_preempt_count (__local_bh_disable)
697	    sshd-4261  0d.s5   45us : sub_preempt_count (local_bh_enable)
698	[...]
699	    sshd-4261  0d.s.   63us : _local_bh_enable (__do_softirq)
700	    sshd-4261  0d.s1   64us : trace_preempt_on (__do_softirq)
701	
702	
703	The above is an example of the preemptoff trace with ftrace_enabled
704	set. Here we see that interrupts were disabled the entire time.
705	The irq_enter code lets us know that we entered an interrupt 'h'.
706	Before that, the functions being traced still show that it is not
707	in an interrupt, but we can see from the functions themselves that
708	this is not the case.
709	
710	Notice that __do_softirq when called does not have a preempt_count.
711	It may seem that we missed a preempt enabling. What really happened
712	is that the preempt count is held on the thread's stack and we
713	switched to the softirq stack (4K stacks in effect). The code
714	does not copy the preempt count, but because interrupts are disabled,
715	we do not need to worry about it. Having a tracer like this is good
716	for letting people know what really happens inside the kernel.
717	
718	
719	preemptirqsoff
720	--------------
721	
722	Knowing the locations that have interrupts disabled or preemption
723	disabled for the longest times is helpful. But sometimes we would
724	like to know when either preemption and/or interrupts are disabled.
725	
726	Consider the following code:
727	
728	    local_irq_disable();
729	    call_function_with_irqs_off();
730	    preempt_disable();
731	    call_function_with_irqs_and_preemption_off();
732	    local_irq_enable();
733	    call_function_with_preemption_off();
734	    preempt_enable();
735	
736	The irqsoff tracer will record the total length of
737	call_function_with_irqs_off() and
738	call_function_with_irqs_and_preemption_off().
739	
740	The preemptoff tracer will record the total length of
741	call_function_with_irqs_and_preemption_off() and
742	call_function_with_preemption_off().
743	
744	But neither will trace the time that interrupts and/or preemption
745	is disabled. This total time is the time that we can not schedule.
746	To record this time, use the preemptirqsoff tracer.
747	
748	Again, using this trace is much like the irqsoff and preemptoff tracers.
749	
750	 # echo preemptirqsoff > /debug/tracing/current_tracer
751	 # echo 0 > /debug/tracing/tracing_max_latency
752	 # echo 1 > /debug/tracing/tracing_enabled
753	 # ls -ltr
754	 [...]
755	 # echo 0 > /debug/tracing/tracing_enabled
756	 # cat /debug/tracing/latency_trace
757	# tracer: preemptirqsoff
758	#
759	preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
760	--------------------------------------------------------------------
761	 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
762	    -----------------
763	    | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
764	    -----------------
765	 => started at: apic_timer_interrupt
766	 => ended at:   __do_softirq
767	
768	#                _------=> CPU#
769	#               / _-----=> irqs-off
770	#              | / _----=> need-resched
771	#              || / _---=> hardirq/softirq
772	#              ||| / _--=> preempt-depth
773	#              |||| /
774	#              |||||     delay
775	#  cmd     pid ||||| time  |   caller
776	#     \   /    |||||   \   |   /
777	      ls-4860  0d...    0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
778	      ls-4860  0d.s.  294us : _local_bh_enable (__do_softirq)
779	      ls-4860  0d.s1  294us : trace_preempt_on (__do_softirq)
780	
781	
782	
783	The trace_hardirqs_off_thunk is called from assembly on x86 when
784	interrupts are disabled in the assembly code. Without the function
785	tracing, we do not know if interrupts were enabled within the preemption
786	points. We do see that it started with preemption enabled.
787	
788	Here is a trace with ftrace_enabled set:
789	
790	
791	# tracer: preemptirqsoff
792	#
793	preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
794	--------------------------------------------------------------------
795	 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
796	    -----------------
797	    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
798	    -----------------
799	 => started at: write_chan
800	 => ended at:   __do_softirq
801	
802	#                _------=> CPU#
803	#               / _-----=> irqs-off
804	#              | / _----=> need-resched
805	#              || / _---=> hardirq/softirq
806	#              ||| / _--=> preempt-depth
807	#              |||| /
808	#              |||||     delay
809	#  cmd     pid ||||| time  |   caller
810	#     \   /    |||||   \   |   /
811	      ls-4473  0.N..    0us : preempt_schedule (write_chan)
812	      ls-4473  0dN.1    1us : _spin_lock (schedule)
813	      ls-4473  0dN.1    2us : add_preempt_count (_spin_lock)
814	      ls-4473  0d..2    2us : put_prev_task_fair (schedule)
815	[...]
816	      ls-4473  0d..2   13us : set_normalized_timespec (ktime_get_ts)
817	      ls-4473  0d..2   13us : __switch_to (schedule)
818	    sshd-4261  0d..2   14us : finish_task_switch (schedule)
819	    sshd-4261  0d..2   14us : _spin_unlock_irq (finish_task_switch)
820	    sshd-4261  0d..1   15us : add_preempt_count (_spin_lock_irqsave)
821	    sshd-4261  0d..2   16us : _spin_unlock_irqrestore (hrtick_set)
822	    sshd-4261  0d..2   16us : do_IRQ (common_interrupt)
823	    sshd-4261  0d..2   17us : irq_enter (do_IRQ)
824	    sshd-4261  0d..2   17us : idle_cpu (irq_enter)
825	    sshd-4261  0d..2   18us : add_preempt_count (irq_enter)
826	    sshd-4261  0d.h2   18us : idle_cpu (irq_enter)
827	    sshd-4261  0d.h.   18us : handle_fasteoi_irq (do_IRQ)
828	    sshd-4261  0d.h.   19us : _spin_lock (handle_fasteoi_irq)
829	    sshd-4261  0d.h.   19us : add_preempt_count (_spin_lock)
830	    sshd-4261  0d.h1   20us : _spin_unlock (handle_fasteoi_irq)
831	    sshd-4261  0d.h1   20us : sub_preempt_count (_spin_unlock)
832	[...]
833	    sshd-4261  0d.h1   28us : _spin_unlock (handle_fasteoi_irq)
834	    sshd-4261  0d.h1   29us : sub_preempt_count (_spin_unlock)
835	    sshd-4261  0d.h2   29us : irq_exit (do_IRQ)
836	    sshd-4261  0d.h2   29us : sub_preempt_count (irq_exit)
837	    sshd-4261  0d..3   30us : do_softirq (irq_exit)
838	    sshd-4261  0d...   30us : __do_softirq (do_softirq)
839	    sshd-4261  0d...   31us : __local_bh_disable (__do_softirq)
840	    sshd-4261  0d...   31us+: add_preempt_count (__local_bh_disable)
841	    sshd-4261  0d.s4   34us : add_preempt_count (__local_bh_disable)
842	[...]
843	    sshd-4261  0d.s3   43us : sub_preempt_count (local_bh_enable_ip)
844	    sshd-4261  0d.s4   44us : sub_preempt_count (local_bh_enable_ip)
845	    sshd-4261  0d.s3   44us : smp_apic_timer_interrupt (apic_timer_interrupt)
846	    sshd-4261  0d.s3   45us : irq_enter (smp_apic_timer_interrupt)
847	    sshd-4261  0d.s3   45us : idle_cpu (irq_enter)
848	    sshd-4261  0d.s3   46us : add_preempt_count (irq_enter)
849	    sshd-4261  0d.H3   46us : idle_cpu (irq_enter)
850	    sshd-4261  0d.H3   47us : hrtimer_interrupt (smp_apic_timer_interrupt)
851	    sshd-4261  0d.H3   47us : ktime_get (hrtimer_interrupt)
852	[...]
853	    sshd-4261  0d.H3   81us : tick_program_event (hrtimer_interrupt)
854	    sshd-4261  0d.H3   82us : ktime_get (tick_program_event)
855	    sshd-4261  0d.H3   82us : ktime_get_ts (ktime_get)
856	    sshd-4261  0d.H3   83us : getnstimeofday (ktime_get_ts)
857	    sshd-4261  0d.H3   83us : set_normalized_timespec (ktime_get_ts)
858	    sshd-4261  0d.H3   84us : clockevents_program_event (tick_program_event)
859	    sshd-4261  0d.H3   84us : lapic_next_event (clockevents_program_event)
860	    sshd-4261  0d.H3   85us : irq_exit (smp_apic_timer_interrupt)
861	    sshd-4261  0d.H3   85us : sub_preempt_count (irq_exit)
862	    sshd-4261  0d.s4   86us : sub_preempt_count (irq_exit)
863	    sshd-4261  0d.s3   86us : add_preempt_count (__local_bh_disable)
864	[...]
865	    sshd-4261  0d.s1   98us : sub_preempt_count (net_rx_action)
866	    sshd-4261  0d.s.   99us : add_preempt_count (_spin_lock_irq)
867	    sshd-4261  0d.s1   99us+: _spin_unlock_irq (run_timer_softirq)
868	    sshd-4261  0d.s.  104us : _local_bh_enable (__do_softirq)
869	    sshd-4261  0d.s.  104us : sub_preempt_count (_local_bh_enable)
870	    sshd-4261  0d.s.  105us : _local_bh_enable (__do_softirq)
871	    sshd-4261  0d.s1  105us : trace_preempt_on (__do_softirq)
872	
873	
874	This is a very interesting trace. It started with the preemption of
875	the ls task. We see that the task had the "need_resched" bit set
876	via the 'N' in the trace.  Interrupts were disabled before the spin_lock
877	at the beginning of the trace. We see that a schedule took place to run
878	sshd.  When the interrupts were enabled, we took an interrupt.
879	On return from the interrupt handler, the softirq ran. We took another
880	interrupt while running the softirq as we see from the capital 'H'.
881	
882	
883	wakeup
884	------
885	
886	In a Real-Time environment it is very important to know the wakeup
887	time it takes for the highest priority task that is woken up to the
888	time that it executes. This is also known as "schedule latency".
889	I stress the point that this is about RT tasks. It is also important
890	to know the scheduling latency of non-RT tasks, but the average
891	schedule latency is better for non-RT tasks. Tools like
892	LatencyTop are more appropriate for such measurements.
893	
894	Real-Time environments are interested in the worst case latency.
895	That is the longest latency it takes for something to happen, and
896	not the average. We can have a very fast scheduler that may only
897	have a large latency once in a while, but that would not work well
898	with Real-Time tasks.  The wakeup tracer was designed to record
899	the worst case wakeups of RT tasks. Non-RT tasks are not recorded
900	because the tracer only records one worst case and tracing non-RT
901	tasks that are unpredictable will overwrite the worst case latency
902	of RT tasks.
903	
904	Since this tracer only deals with RT tasks, we will run this slightly
905	differently than we did with the previous tracers. Instead of performing
906	an 'ls', we will run 'sleep 1' under 'chrt' which changes the
907	priority of the task.
908	
909	 # echo wakeup > /debug/tracing/current_tracer
910	 # echo 0 > /debug/tracing/tracing_max_latency
911	 # echo 1 > /debug/tracing/tracing_enabled
912	 # chrt -f 5 sleep 1
913	 # echo 0 > /debug/tracing/tracing_enabled
914	 # cat /debug/tracing/latency_trace
915	# tracer: wakeup
916	#
917	wakeup latency trace v1.1.5 on 2.6.26-rc8
918	--------------------------------------------------------------------
919	 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
920	    -----------------
921	    | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
922	    -----------------
923	
924	#                _------=> CPU#
925	#               / _-----=> irqs-off
926	#              | / _----=> need-resched
927	#              || / _---=> hardirq/softirq
928	#              ||| / _--=> preempt-depth
929	#              |||| /
930	#              |||||     delay
931	#  cmd     pid ||||| time  |   caller
932	#     \   /    |||||   \   |   /
933	  <idle>-0     1d.h4    0us+: try_to_wake_up (wake_up_process)
934	  <idle>-0     1d..4    4us : schedule (cpu_idle)
935	
936	
937	
938	Running this on an idle system, we see that it only took 4 microseconds
939	to perform the task switch.  Note, since the trace marker in the
940	schedule is before the actual "switch", we stop the tracing when
941	the recorded task is about to schedule in. This may change if
942	we add a new marker at the end of the scheduler.
943	
944	Notice that the recorded task is 'sleep' with the PID of 4901 and it
945	has an rt_prio of 5. This priority is user-space priority and not
946	the internal kernel priority. The policy is 1 for SCHED_FIFO and 2
947	for SCHED_RR.
948	
949	Doing the same with chrt -r 5 and ftrace_enabled set.
950	
951	# tracer: wakeup
952	#
953	wakeup latency trace v1.1.5 on 2.6.26-rc8
954	--------------------------------------------------------------------
955	 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
956	    -----------------
957	    | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
958	    -----------------
959	
960	#                _------=> CPU#
961	#               / _-----=> irqs-off
962	#              | / _----=> need-resched
963	#              || / _---=> hardirq/softirq
964	#              ||| / _--=> preempt-depth
965	#              |||| /
966	#              |||||     delay
967	#  cmd     pid ||||| time  |   caller
968	#     \   /    |||||   \   |   /
969	ksoftirq-7     1d.H3    0us : try_to_wake_up (wake_up_process)
970	ksoftirq-7     1d.H4    1us : sub_preempt_count (marker_probe_cb)
971	ksoftirq-7     1d.H3    2us : check_preempt_wakeup (try_to_wake_up)
972	ksoftirq-7     1d.H3    3us : update_curr (check_preempt_wakeup)
973	ksoftirq-7     1d.H3    4us : calc_delta_mine (update_curr)
974	ksoftirq-7     1d.H3    5us : __resched_task (check_preempt_wakeup)
975	ksoftirq-7     1d.H3    6us : task_wake_up_rt (try_to_wake_up)
976	ksoftirq-7     1d.H3    7us : _spin_unlock_irqrestore (try_to_wake_up)
977	[...]
978	ksoftirq-7     1d.H2   17us : irq_exit (smp_apic_timer_interrupt)
979	ksoftirq-7     1d.H2   18us : sub_preempt_count (irq_exit)
980	ksoftirq-7     1d.s3   19us : sub_preempt_count (irq_exit)
981	ksoftirq-7     1..s2   20us : rcu_process_callbacks (__do_softirq)
982	[...]
983	ksoftirq-7     1..s2   26us : __rcu_process_callbacks (rcu_process_callbacks)
984	ksoftirq-7     1d.s2   27us : _local_bh_enable (__do_softirq)
985	ksoftirq-7     1d.s2   28us : sub_preempt_count (_local_bh_enable)
986	ksoftirq-7     1.N.3   29us : sub_preempt_count (ksoftirqd)
987	ksoftirq-7     1.N.2   30us : _cond_resched (ksoftirqd)
988	ksoftirq-7     1.N.2   31us : __cond_resched (_cond_resched)
989	ksoftirq-7     1.N.2   32us : add_preempt_count (__cond_resched)
990	ksoftirq-7     1.N.2   33us : schedule (__cond_resched)
991	ksoftirq-7     1.N.2   33us : add_preempt_count (schedule)
992	ksoftirq-7     1.N.3   34us : hrtick_clear (schedule)
993	ksoftirq-7     1dN.3   35us : _spin_lock (schedule)
994	ksoftirq-7     1dN.3   36us : add_preempt_count (_spin_lock)
995	ksoftirq-7     1d..4   37us : put_prev_task_fair (schedule)
996	ksoftirq-7     1d..4   38us : update_curr (put_prev_task_fair)
997	[...]
998	ksoftirq-7     1d..5   47us : _spin_trylock (tracing_record_cmdline)
999	ksoftirq-7     1d..5   48us : add_preempt_count (_spin_trylock)
1000	ksoftirq-7     1d..6   49us : _spin_unlock (tracing_record_cmdline)
1001	ksoftirq-7     1d..6   49us : sub_preempt_count (_spin_unlock)
1002	ksoftirq-7     1d..4   50us : schedule (__cond_resched)
1003	
1004	The interrupt went off while running ksoftirqd. This task runs at
1005	SCHED_OTHER. Why did not we see the 'N' set early? This may be
1006	a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks
1007	configured, the interrupt and softirq run with their own stack.
1008	Some information is held on the top of the task's stack (need_resched
1009	and preempt_count are both stored there). The setting of the NEED_RESCHED
1010	bit is done directly to the task's stack, but the reading of the
1011	NEED_RESCHED is done by looking at the current stack, which in this case
1012	is the stack for the hard interrupt. This hides the fact that NEED_RESCHED
1013	has been set. We do not see the 'N' until we switch back to the task's
1014	assigned stack.
1015	
1016	function
1017	--------
1018	
1019	This tracer is the function tracer. Enabling the function tracer
1020	can be done from the debug file system. Make sure the ftrace_enabled is
1021	set; otherwise this tracer is a nop.
1022	
1023	 # sysctl kernel.ftrace_enabled=1
1024	 # echo function > /debug/tracing/current_tracer
1025	 # echo 1 > /debug/tracing/tracing_enabled
1026	 # usleep 1
1027	 # echo 0 > /debug/tracing/tracing_enabled
1028	 # cat /debug/tracing/trace
1029	# tracer: function
1030	#
1031	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1032	#              | |      |          |         |
1033	            bash-4003  [00]   123.638713: finish_task_switch <-schedule
1034	            bash-4003  [00]   123.638714: _spin_unlock_irq <-finish_task_switch
1035	            bash-4003  [00]   123.638714: sub_preempt_count <-_spin_unlock_irq
1036	            bash-4003  [00]   123.638715: hrtick_set <-schedule
1037	            bash-4003  [00]   123.638715: _spin_lock_irqsave <-hrtick_set
1038	            bash-4003  [00]   123.638716: add_preempt_count <-_spin_lock_irqsave
1039	            bash-4003  [00]   123.638716: _spin_unlock_irqrestore <-hrtick_set
1040	            bash-4003  [00]   123.638717: sub_preempt_count <-_spin_unlock_irqrestore
1041	            bash-4003  [00]   123.638717: hrtick_clear <-hrtick_set
1042	            bash-4003  [00]   123.638718: sub_preempt_count <-schedule
1043	            bash-4003  [00]   123.638718: sub_preempt_count <-preempt_schedule
1044	            bash-4003  [00]   123.638719: wait_for_completion <-__stop_machine_run
1045	            bash-4003  [00]   123.638719: wait_for_common <-wait_for_completion
1046	            bash-4003  [00]   123.638720: _spin_lock_irq <-wait_for_common
1047	            bash-4003  [00]   123.638720: add_preempt_count <-_spin_lock_irq
1048	[...]
1049	
1050	
1051	Note: function tracer uses ring buffers to store the above entries.
1052	The newest data may overwrite the oldest data. Sometimes using echo to
1053	stop the trace is not sufficient because the tracing could have overwritten
1054	the data that you wanted to record. For this reason, it is sometimes better to
1055	disable tracing directly from a program. This allows you to stop the
1056	tracing at the point that you hit the part that you are interested in.
1057	To disable the tracing directly from a C program, something like following
1058	code snippet can be used:
1059	
1060	int trace_fd;
1061	[...]
1062	int main(int argc, char *argv[]) {
1063		[...]
1064		trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY);
1065		[...]
1066		if (condition_hit()) {
1067			write(trace_fd, "0", 1);
1068		}
1069		[...]
1070	}
1071	
1072	Note: Here we hard coded the path name. The debugfs mount is not
1073	guaranteed to be at /debug (and is more commonly at /sys/kernel/debug).
1074	For simple one time traces, the above is sufficent. For anything else,
1075	a search through /proc/mounts may be needed to find where the debugfs
1076	file-system is mounted.
1077	
1078	
1079	Single thread tracing
1080	---------------------
1081	
1082	By writing into /debug/tracing/set_ftrace_pid you can trace a
1083	single thread. For example:
1084	
1085	# cat /debug/tracing/set_ftrace_pid
1086	no pid
1087	# echo 3111 > /debug/tracing/set_ftrace_pid
1088	# cat /debug/tracing/set_ftrace_pid
1089	3111
1090	# echo function > /debug/tracing/current_tracer
1091	# cat /debug/tracing/trace | head
1092	 # tracer: function
1093	 #
1094	 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1095	 #              | |       |          |         |
1096	     yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
1097	     yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
1098	     yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
1099	     yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
1100	     yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
1101	     yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
1102	# echo -1 > /debug/tracing/set_ftrace_pid
1103	# cat /debug/tracing/trace |head
1104	 # tracer: function
1105	 #
1106	 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1107	 #              | |       |          |         |
1108	 ##### CPU 3 buffer started ####
1109	     yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
1110	     yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
1111	     yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
1112	     yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
1113	     yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
1114	
1115	If you want to trace a function when executing, you could use
1116	something like this simple program:
1117	
1118	#include <stdio.h>
1119	#include <stdlib.h>
1120	#include <sys/types.h>
1121	#include <sys/stat.h>
1122	#include <fcntl.h>
1123	#include <unistd.h>
1124	
1125	int main (int argc, char **argv)
1126	{
1127	        if (argc < 1)
1128	                exit(-1);
1129	
1130	        if (fork() > 0) {
1131	                int fd, ffd;
1132	                char line[64];
1133	                int s;
1134	
1135	                ffd = open("/debug/tracing/current_tracer", O_WRONLY);
1136	                if (ffd < 0)
1137	                        exit(-1);
1138	                write(ffd, "nop", 3);
1139	
1140	                fd = open("/debug/tracing/set_ftrace_pid", O_WRONLY);
1141	                s = sprintf(line, "%d\n", getpid());
1142	                write(fd, line, s);
1143	
1144	                write(ffd, "function", 8);
1145	
1146	                close(fd);
1147	                close(ffd);
1148	
1149	                execvp(argv[1], argv+1);
1150	        }
1151	
1152	        return 0;
1153	}
1154	
1155	dynamic ftrace
1156	--------------
1157	
1158	If CONFIG_DYNAMIC_FTRACE is set, the system will run with
1159	virtually no overhead when function tracing is disabled. The way
1160	this works is the mcount function call (placed at the start of
1161	every kernel function, produced by the -pg switch in gcc), starts
1162	of pointing to a simple return. (Enabling FTRACE will include the
1163	-pg switch in the compiling of the kernel.)
1164	
1165	At compile time every C file object is run through the
1166	recordmcount.pl script (located in the scripts directory). This
1167	script will process the C object using objdump to find all the
1168	locations in the .text section that call mcount. (Note, only
1169	the .text section is processed, since processing other sections
1170	like .init.text may cause races due to those sections being freed).
1171	
1172	A new section called "__mcount_loc" is created that holds references
1173	to all the mcount call sites in the .text section. This section is
1174	compiled back into the original object. The final linker will add
1175	all these references into a single table.
1176	
1177	On boot up, before SMP is initialized, the dynamic ftrace code
1178	scans this table and updates all the locations into nops. It also
1179	records the locations, which are added to the available_filter_functions
1180	list.  Modules are processed as they are loaded and before they are
1181	executed.  When a module is unloaded, it also removes its functions from
1182	the ftrace function list. This is automatic in the module unload
1183	code, and the module author does not need to worry about it.
1184	
1185	When tracing is enabled, kstop_machine is called to prevent races
1186	with the CPUS executing code being modified (which can cause the
1187	CPU to do undesireable things), and the nops are patched back
1188	to calls. But this time, they do not call mcount (which is just
1189	a function stub). They now call into the ftrace infrastructure.
1190	
1191	One special side-effect to the recording of the functions being
1192	traced is that we can now selectively choose which functions we
1193	wish to trace and which ones we want the mcount calls to remain as
1194	nops.
1195	
1196	Two files are used, one for enabling and one for disabling the tracing
1197	of specified functions. They are:
1198	
1199	  set_ftrace_filter
1200	
1201	and
1202	
1203	  set_ftrace_notrace
1204	
1205	A list of available functions that you can add to these files is listed
1206	in:
1207	
1208	   available_filter_functions
1209	
1210	 # cat /debug/tracing/available_filter_functions
1211	put_prev_task_idle
1212	kmem_cache_create
1213	pick_next_task_rt
1214	get_online_cpus
1215	pick_next_task_fair
1216	mutex_lock
1217	[...]
1218	
1219	If I am only interested in sys_nanosleep and hrtimer_interrupt:
1220	
1221	 # echo sys_nanosleep hrtimer_interrupt \
1222			> /debug/tracing/set_ftrace_filter
1223	 # echo ftrace > /debug/tracing/current_tracer
1224	 # echo 1 > /debug/tracing/tracing_enabled
1225	 # usleep 1
1226	 # echo 0 > /debug/tracing/tracing_enabled
1227	 # cat /debug/tracing/trace
1228	# tracer: ftrace
1229	#
1230	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1231	#              | |      |          |         |
1232	          usleep-4134  [00]  1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
1233	          usleep-4134  [00]  1317.070111: sys_nanosleep <-syscall_call
1234	          <idle>-0     [00]  1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1235	
1236	To see which functions are being traced, you can cat the file:
1237	
1238	 # cat /debug/tracing/set_ftrace_filter
1239	hrtimer_interrupt
1240	sys_nanosleep
1241	
1242	
1243	Perhaps this is not enough. The filters also allow simple wild cards.
1244	Only the following are currently available
1245	
1246	  <match>*  - will match functions that begin with <match>
1247	  *<match>  - will match functions that end with <match>
1248	  *<match>* - will match functions that have <match> in it
1249	
1250	These are the only wild cards which are supported.
1251	
1252	  <match>*<match> will not work.
1253	
1254	Note: It is better to use quotes to enclose the wild cards, otherwise
1255	  the shell may expand the parameters into names of files in the local
1256	  directory.
1257	
1258	 # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter
1259	
1260	Produces:
1261	
1262	# tracer: ftrace
1263	#
1264	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1265	#              | |      |          |         |
1266	            bash-4003  [00]  1480.611794: hrtimer_init <-copy_process
1267	            bash-4003  [00]  1480.611941: hrtimer_start <-hrtick_set
1268	            bash-4003  [00]  1480.611956: hrtimer_cancel <-hrtick_clear
1269	            bash-4003  [00]  1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
1270	          <idle>-0     [00]  1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
1271	          <idle>-0     [00]  1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
1272	          <idle>-0     [00]  1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
1273	          <idle>-0     [00]  1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
1274	          <idle>-0     [00]  1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
1275	
1276	
1277	Notice that we lost the sys_nanosleep.
1278	
1279	 # cat /debug/tracing/set_ftrace_filter
1280	hrtimer_run_queues
1281	hrtimer_run_pending
1282	hrtimer_init
1283	hrtimer_cancel
1284	hrtimer_try_to_cancel
1285	hrtimer_forward
1286	hrtimer_start
1287	hrtimer_reprogram
1288	hrtimer_force_reprogram
1289	hrtimer_get_next_event
1290	hrtimer_interrupt
1291	hrtimer_nanosleep
1292	hrtimer_wakeup
1293	hrtimer_get_remaining
1294	hrtimer_get_res
1295	hrtimer_init_sleeper
1296	
1297	
1298	This is because the '>' and '>>' act just like they do in bash.
1299	To rewrite the filters, use '>'
1300	To append to the filters, use '>>'
1301	
1302	To clear out a filter so that all functions will be recorded again:
1303	
1304	 # echo > /debug/tracing/set_ftrace_filter
1305	 # cat /debug/tracing/set_ftrace_filter
1306	 #
1307	
1308	Again, now we want to append.
1309	
1310	 # echo sys_nanosleep > /debug/tracing/set_ftrace_filter
1311	 # cat /debug/tracing/set_ftrace_filter
1312	sys_nanosleep
1313	 # echo 'hrtimer_*' >> /debug/tracing/set_ftrace_filter
1314	 # cat /debug/tracing/set_ftrace_filter
1315	hrtimer_run_queues
1316	hrtimer_run_pending
1317	hrtimer_init
1318	hrtimer_cancel
1319	hrtimer_try_to_cancel
1320	hrtimer_forward
1321	hrtimer_start
1322	hrtimer_reprogram
1323	hrtimer_force_reprogram
1324	hrtimer_get_next_event
1325	hrtimer_interrupt
1326	sys_nanosleep
1327	hrtimer_nanosleep
1328	hrtimer_wakeup
1329	hrtimer_get_remaining
1330	hrtimer_get_res
1331	hrtimer_init_sleeper
1332	
1333	
1334	The set_ftrace_notrace prevents those functions from being traced.
1335	
1336	 # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace
1337	
1338	Produces:
1339	
1340	# tracer: ftrace
1341	#
1342	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1343	#              | |      |          |         |
1344	            bash-4043  [01]   115.281644: finish_task_switch <-schedule
1345	            bash-4043  [01]   115.281645: hrtick_set <-schedule
1346	            bash-4043  [01]   115.281645: hrtick_clear <-hrtick_set
1347	            bash-4043  [01]   115.281646: wait_for_completion <-__stop_machine_run
1348	            bash-4043  [01]   115.281647: wait_for_common <-wait_for_completion
1349	            bash-4043  [01]   115.281647: kthread_stop <-stop_machine_run
1350	            bash-4043  [01]   115.281648: init_waitqueue_head <-kthread_stop
1351	            bash-4043  [01]   115.281648: wake_up_process <-kthread_stop
1352	            bash-4043  [01]   115.281649: try_to_wake_up <-wake_up_process
1353	
1354	We can see that there's no more lock or preempt tracing.
1355	
1356	trace_pipe
1357	----------
1358	
1359	The trace_pipe outputs the same content as the trace file, but the effect
1360	on the tracing is different. Every read from trace_pipe is consumed.
1361	This means that subsequent reads will be different. The trace
1362	is live.
1363	
1364	 # echo function > /debug/tracing/current_tracer
1365	 # cat /debug/tracing/trace_pipe > /tmp/trace.out &
1366	[1] 4153
1367	 # echo 1 > /debug/tracing/tracing_enabled
1368	 # usleep 1
1369	 # echo 0 > /debug/tracing/tracing_enabled
1370	 # cat /debug/tracing/trace
1371	# tracer: function
1372	#
1373	#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1374	#              | |      |          |         |
1375	
1376	 #
1377	 # cat /tmp/trace.out
1378	            bash-4043  [00] 41.267106: finish_task_switch <-schedule
1379	            bash-4043  [00] 41.267106: hrtick_set <-schedule
1380	            bash-4043  [00] 41.267107: hrtick_clear <-hrtick_set
1381	            bash-4043  [00] 41.267108: wait_for_completion <-__stop_machine_run
1382	            bash-4043  [00] 41.267108: wait_for_common <-wait_for_completion
1383	            bash-4043  [00] 41.267109: kthread_stop <-stop_machine_run
1384	            bash-4043  [00] 41.267109: init_waitqueue_head <-kthread_stop
1385	            bash-4043  [00] 41.267110: wake_up_process <-kthread_stop
1386	            bash-4043  [00] 41.267110: try_to_wake_up <-wake_up_process
1387	            bash-4043  [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1388	
1389	
1390	Note, reading the trace_pipe file will block until more input is added.
1391	By changing the tracer, trace_pipe will issue an EOF. We needed
1392	to set the function tracer _before_ we "cat" the trace_pipe file.
1393	
1394	
1395	trace entries
1396	-------------
1397	
1398	Having too much or not enough data can be troublesome in diagnosing
1399	an issue in the kernel. The file buffer_size_kb is used to modify
1400	the size of the internal trace buffers. The number listed
1401	is the number of entries that can be recorded per CPU. To know
1402	the full size, multiply the number of possible CPUS with the
1403	number of entries.
1404	
1405	 # cat /debug/tracing/buffer_size_kb
1406	1408 (units kilobytes)
1407	
1408	Note, to modify this, you must have tracing completely disabled. To do that,
1409	echo "nop" into the current_tracer. If the current_tracer is not set
1410	to "nop", an EINVAL error will be returned.
1411	
1412	 # echo nop > /debug/tracing/current_tracer
1413	 # echo 10000 > /debug/tracing/buffer_size_kb
1414	 # cat /debug/tracing/buffer_size_kb
1415	10000 (units kilobytes)
1416	
1417	The number of pages which will be allocated is limited to a percentage
1418	of available memory. Allocating too much will produce an error.
1419	
1420	 # echo 1000000000000 > /debug/tracing/buffer_size_kb
1421	-bash: echo: write error: Cannot allocate memory
1422	 # cat /debug/tracing/buffer_size_kb
1423	85
Hide Line Numbers
About Kernel Documentation Linux Kernel Contact Linux Resources Linux Blog
Information is copyright its respective author.
All material is available from the Linux Kernel Source distributed under a GPL License.
Hosted by mjmwired.net.