Documentation / trace / eprobetrace.rst


Based on kernel version 6.17. Page generated on 2025-10-03 10:04 EST.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
.. SPDX-License-Identifier: GPL-2.0

==================================
Eprobe - Event-based Probe Tracing
==================================

:Author: Steven Rostedt <rostedt@goodmis.org>

- Written for v6.17

Overview
========

Eprobes are dynamic events that are placed on existing events to either
dereference a field that is a pointer, or simply to limit what fields are
recorded in the trace event.

Eprobes depend on kprobe events so to enable this feature, build your kernel
with CONFIG_EPROBE_EVENTS=y.

Eprobes are created via the /sys/kernel/tracing/dynamic_events file.

Synopsis of eprobe_events
-------------------------
::

  e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS]	: Set a probe
  -:[EGRP/][EEVENT]				: Clear a probe

 EGRP		: Group name of the new event. If omitted, use "eprobes" for it.
 EEVENT		: Event name. If omitted, the event name is generated and will
		  be the same event name as the event it attached to.
 GRP		: Group name of the event to attach to.
 EVENT		: Event name of the event to attach to.

 FETCHARGS	: Arguments. Each probe can have up to 128 args.
  $FIELD	: Fetch the value of the event field called FIELD.
  @ADDR		: Fetch memory at ADDR (ADDR should be in kernel)
  @SYM[+|-offs]	: Fetch memory at SYM +|- offs (SYM should be a data symbol)
  $comm		: Fetch current task comm.
  +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
  \IMM		: Store an immediate value to the argument.
  NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
  FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
		  (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                  "string", "ustring", "symbol", "symstr" and "bitfield" are
                  supported.

Types
-----
The FETCHARGS above is very similar to the kprobe events as described in
Documentation/trace/kprobetrace.rst.

The difference between eprobes and kprobes FETCHARGS is that eprobes has a
$FIELD command that returns the content of the event field of the event
that is attached. Eprobes do not have access to registers, stacks and function
arguments that kprobes has.

If a field argument is a pointer, it may be dereferenced just like a memory
address using the FETCHARGS syntax.


Attaching to dynamic events
---------------------------

Eprobes may attach to dynamic events as well as to normal events. It may
attach to a kprobe event, a synthetic event or a fprobe event. This is useful
if the type of a field needs to be changed. See Example 2 below.

Usage examples
==============

Example 1
---------

The basic usage of eprobes is to limit the data that is being recorded into
the tracing buffer. For example, a common event to trace is the sched_switch
trace event. That has a format of::

	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:char prev_comm[16];	offset:8;	size:16;	signed:0;
	field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
	field:int prev_prio;	offset:28;	size:4;	signed:1;
	field:long prev_state;	offset:32;	size:8;	signed:1;
	field:char next_comm[16];	offset:40;	size:16;	signed:0;
	field:pid_t next_pid;	offset:56;	size:4;	signed:1;
	field:int next_prio;	offset:60;	size:4;	signed:1;

The first four fields are common to all events and can not be limited. But the
rest of the event has 60 bytes of information. It records the names of the
previous and next tasks being scheduled out and in, as well as their pids and
priorities. It also records the state of the previous task. If only the pids
of the tasks are of interest, why waste the ring buffer with all the other
fields?

An eprobe can limit what gets recorded. Note, it does not help in performance,
as all the fields are recorded in a temporary buffer to process the eprobe.
::

 # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events
 # echo 1 > /sys/kernel/tracing/events/sched/switch/enable
 # cat /sys/kernel/tracing/trace

 # tracer: nop
 #
 # entries-in-buffer/entries-written: 2721/2721   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
     sshd-session-1082    [004] d..4.  5041.239906: switch: (sched.sched_switch) prev=1082 next=0
             bash-1085    [001] d..4.  5041.240198: switch: (sched.sched_switch) prev=1085 next=141
    kworker/u34:5-141     [001] d..4.  5041.240259: switch: (sched.sched_switch) prev=141 next=1085
           <idle>-0       [004] d..4.  5041.240354: switch: (sched.sched_switch) prev=0 next=1082
             bash-1085    [001] d..4.  5041.240385: switch: (sched.sched_switch) prev=1085 next=141
    kworker/u34:5-141     [001] d..4.  5041.240410: switch: (sched.sched_switch) prev=141 next=1085
             bash-1085    [001] d..4.  5041.240478: switch: (sched.sched_switch) prev=1085 next=0
     sshd-session-1082    [004] d..4.  5041.240526: switch: (sched.sched_switch) prev=1082 next=0
           <idle>-0       [001] d..4.  5041.247524: switch: (sched.sched_switch) prev=0 next=90
           <idle>-0       [002] d..4.  5041.247545: switch: (sched.sched_switch) prev=0 next=16
      kworker/1:1-90      [001] d..4.  5041.247580: switch: (sched.sched_switch) prev=90 next=0
        rcu_sched-16      [002] d..4.  5041.247591: switch: (sched.sched_switch) prev=16 next=0
           <idle>-0       [002] d..4.  5041.257536: switch: (sched.sched_switch) prev=0 next=16
        rcu_sched-16      [002] d..4.  5041.257573: switch: (sched.sched_switch) prev=16 next=0

Note, without adding the "u32" after the prev_pid and next_pid, the values
would default showing in hexadecimal.

Example 2
---------

If a specific system call is to be recorded but the syscalls events are not
enabled, the raw_syscalls can still be used (syscalls are system call
events are not normal events, but are created from the raw_syscalls events
within the kernel). In order to trace the openat system call, one can create
an event probe on top of the raw_syscalls event:
::

 # cd /sys/kernel/tracing
 # cat events/raw_syscalls/sys_enter/format
 name: sys_enter
 ID: 395
 format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:long id;	offset:8;	size:8;	signed:1;
	field:unsigned long args[6];	offset:16;	size:48;	signed:0;

 print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]

From the source code, the sys_openat() has:
::

 int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
 {
	return my_syscall4(__NR_openat, dirfd, path, flags, mode);
 }

The path is the second parameter, and that is what is wanted.
::

 # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events

This is being run on x86_64 where the word size is 8 bytes and the openat
system call __NR_openat is set at 257.
::

 # echo 'nr == 257' > events/eprobes/openat/filter

Now enable the event and look at the trace.
::

 # echo 1 > events/eprobes/openat/enable
 # cat trace

 # tracer: nop
 #
 # entries-in-buffer/entries-written: 4/4   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
              cat-1298    [003] ...2.  2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
              cat-1298    [003] ...2.  2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
              cat-1298    [003] ...2.  2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
              cat-1298    [003] ...2.  2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)

The filename shows "(fault)". This is likely because the filename has not been
pulled into memory yet and currently trace events cannot fault in memory that
is not present. When an eprobe tries to read memory that has not been faulted
in yet, it will show the "(fault)" text.

To get around this, as the kernel will likely pull in this filename and make
it present, attaching it to a synthetic event that can pass the address of the
filename from the entry of the event to the end of the event, this can be used
to show the filename when the system call returns.

Remove the old eprobe::

 # echo 1 > events/eprobes/openat/enable
 # echo '-:openat' >> dynamic_events

This time make an eprobe where the address of the filename is saved::

 # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events

Create a synthetic event that passes the address of the filename to the
end of the event::

 # echo 's:filename u64 file' >> dynamic_events
 # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger
 # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger

Now that the address of the filename has been passed to the end of the
system call, create another eprobe to attach to the exit event to show the
string::

 # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
 # echo 1 > events/eprobes/openat/enable
 # cat trace

 # tracer: nop
 #
 # entries-in-buffer/entries-written: 4/4   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
              cat-1331    [001] ...5.  2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache"
              cat-1331    [001] ...5.  2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6"
              cat-1331    [001] ...5.  2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive"
              cat-1331    [001] ...5.  2944.831362: openat: (synthetic.filename) filename="trace"

Example 3
---------

If syscall trace events are available, the above would not need the first
eprobe, but it would still need the last one::

 # echo 's:filename u64 file' >> dynamic_events
 # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger
 # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger
 # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
 # echo 1 > events/eprobes/openat/enable

And this would produce the same result as Example 2.