Trace in linux kernel

Содержание

Using the Linux Kernel TracepointsВ¶
Purpose of tracepointsВ¶
UsageВ¶
Event TracingВ¶
1. IntroductionВ¶
2. Using Event TracingВ¶
2.1 Via the ‘set_event’ interfaceВ¶
2.2 Via the ‘enable’ toggleВ¶
2.3 Boot optionВ¶
3. Defining an event-enabled tracepointВ¶
4. Event formatsВ¶
5. Event filteringВ¶
5.1 Expression syntaxВ¶
5.2 Setting filtersВ¶
5.3 Clearing filtersВ¶
5.3 Subsystem filtersВ¶
5.4 PID filteringВ¶
6. Event triggersВ¶
6.1 Expression syntaxВ¶
6.2 Supported trigger commandsВ¶

Using the Linux Kernel TracepointsВ¶

This document introduces Linux Kernel Tracepoints and their use. It provides examples of how to insert tracepoints in the kernel and connect probe functions to them and provides some examples of probe functions.

Purpose of tracepointsВ¶

A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be вЂњonвЂќ (a probe is connected to it) or вЂњoffвЂќ (no probe is attached). When a tracepoint is вЂњoffвЂќ it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is вЂњonвЂќ, the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site).

You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file.

They can be used for tracing and performance accounting.

UsageВ¶

Two elements are required for tracepoints :

A tracepoint definition, placed in a header file.

The tracepoint statement, in C code.

In order to use tracepoints, you should include linux/tracepoint.h.

In subsys/file.c (where the tracing statement must be added):

subsys_eventname is an identifier unique to your event

subsys is the name of your subsystem.

eventname is the name of the event to trace.

TP_PROTO(int firstarg, struct task_struct *p) is the prototype of the function called by this tracepoint.

TP_ARGS(firstarg, p) are the parameters names, same as found in the prototype.

if you use the header in multiple source files, #define CREATE_TRACE_POINTS should appear only in one source file.

Connecting a function (probe) to a tracepoint is done by providing a probe (function to call) for the specific tracepoint through register_trace_subsys_eventname(). Removing a probe is done through unregister_trace_subsys_eventname(); it will remove the probe.

tracepoint_synchronize_unregister() must be called before the end of the module exit function to make sure there is no caller left using the probe. This, and the fact that preemption is disabled around the probe call, make sure that probe removal and module unload are safe.

The tracepoint mechanism supports inserting multiple instances of the same tracepoint, but a single definition must be made of a given tracepoint name over all the kernel to make sure no type conflict will occur. Name mangling of the tracepoints is done using the prototypes to make sure typing is correct. Verification of probe type correctness is done at the registration site by the compiler. Tracepoints can be put in inline functions, inlined static functions, and unrolled loops as well as regular functions.

The naming scheme вЂњsubsys_eventвЂќ is suggested here as a convention intended to limit collisions. Tracepoint names are global to the kernel: they are considered as being the same whether they are in the core kernel image or in modules.

If the tracepoint has to be used in kernel modules, an EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be used to export the defined tracepoints.

If you need to do a bit of work for a tracepoint parameter, and that work is only used for the tracepoint, that work can be encapsulated within an if statement with the following:

All trace_ () calls have a matching trace_ _enabled() function defined that returns true if the tracepoint is enabled and false otherwise. The trace_ () should always be within the block of the if (trace_ _enabled()) to prevent races between the tracepoint being enabled and the check being seen.

The advantage of using the trace_ _enabled() is that it uses the static_key of the tracepoint to allow the if statement to be implemented with jump labels and avoid conditional branches.

The convenience macro TRACE_EVENT provides an alternative way to define tracepoints. Check http://lwn.net/Articles/379903, http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362 for a series of articles with more details.

If you require calling a tracepoint from a header file, it is not recommended to call one directly or to use the trace_ _enabled() function call, as tracepoints in header files can have side effects if a header is included from a file that has CREATE_TRACE_POINTS set, as well as the trace_ () is not that small of an inline and can bloat the kernel if used by other inlined functions. Instead, include tracepoint-defs.h and use tracepoint_enabled().

Источник

Event TracingВ¶

Author:	Theodore Ts’o
Updated:	Li Zefan and Tom Zanussi

1. IntroductionВ¶

Tracepoints (see Documentation/trace/tracepoints.rst) can be used without creating custom kernel modules to register probe functions using the event tracing infrastructure.

Not all tracepoints can be traced using the event tracing system; the kernel developer must provide code snippets which define how the tracing information is saved into the tracing buffer, and how the tracing information should be printed.

2. Using Event TracingВ¶

2.1 Via the ‘set_event’ interfaceВ¶

The events which are available for tracing can be found in the file /sys/kernel/debug/tracing/available_events.

To enable a particular event, such as ‘sched_wakeup’, simply echo it to /sys/kernel/debug/tracing/set_event. For example:

‘>>’ is necessary, otherwise it will firstly disable all the events.

To disable an event, echo the event name to the set_event file prefixed with an exclamation point:

To disable all events, echo an empty line to the set_event file:

To enable all events, echo *:* or *: to the set_event file:

The events are organized into subsystems, such as ext4, irq, sched, etc., and a full event name looks like this: : . The subsystem name is optional, but it is displayed in the available_events file. All of the events in a subsystem can be specified via the syntax :* ; for example, to enable all irq events, you can use the command:

2.2 Via the ‘enable’ toggleВ¶

The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy of directories.

To enable event ‘sched_wakeup’:

To enable all events in sched subsystem:

To enable all events:

When reading one of these enable files, there are four results:

0 — all events this file affects are disabled
1 — all events this file affects are enabled
X — there is a mixture of events enabled and disabled
? — this file does not affect any event

2.3 Boot optionВ¶

In order to facilitate early boot debugging, use boot option:

event-list is a comma separated list of events. See section 2.1 for event format.

3. Defining an event-enabled tracepointВ¶

See The example provided in samples/trace_events

4. Event formatsВ¶

Each trace event has a ‘format’ file associated with it that contains a description of each field in a logged event. This information can be used to parse the binary trace stream, and is also the place to find the field names that can be used in event filters (see section 5).

It also displays the format string that will be used to print the event in text mode, along with the event name and ID used for profiling.

Every event has a set of common fields associated with it; these are the fields prefixed with common_ . The other fields vary between events and correspond to the fields defined in the TRACE_EVENT definition for that event.

Each field in the format has the form:

where offset is the offset of the field in the trace record and size is the size of the data item, in bytes.

For example, here’s the information displayed for the ‘sched_wakeup’ event:

This event contains 10 fields, the first 5 common and the remaining 5 event-specific. All the fields for this event are numeric, except for ‘comm’ which is a string, a distinction important for event filtering.

5. Event filteringВ¶

Trace events can be filtered in the kernel by associating boolean ‘filter expressions’ with them. As soon as an event is logged into the trace buffer, its fields are checked against the filter expression associated with that event type. An event with field values that ‘match’ the filter will appear in the trace output, and an event whose values don’t match will be discarded. An event with no filter associated with it matches everything, and is the default when no filter has been set for an event.

5.1 Expression syntaxВ¶

A filter expression consists of one or more ‘predicates’ that can be combined using the logical operators ‘&&’ and ‘||’. A predicate is simply a clause that compares the value of a field contained within a logged event with a constant value and returns either 0 or 1 depending on whether the field value matched (1) or didn’t match (0):

Parentheses can be used to provide arbitrary logical groupings and double-quotes can be used to prevent the shell from interpreting operators as shell metacharacters.

The field-names available for use in filters can be found in the ‘format’ files for trace events (see section 4).

The relational-operators depend on the type of the field being tested:

The operators available for numeric fields are:

And for string fields they are:

) accepts a wild card character (*,?) and character classes ([). For example:

5.2 Setting filtersВ¶

A filter for an individual event is set by writing a filter expression to the ‘filter’ file for the given event.

A slightly more involved example:

If there is an error in the expression, you’ll get an ‘Invalid argument’ error when setting it, and the erroneous string along with an error message can be seen by looking at the filter e.g.:

Currently the caret (‘^’) for an error always appears at the beginning of the filter string; the error message should still be useful though even without more accurate position info.

5.3 Clearing filtersВ¶

To clear the filter for an event, write a ‘0’ to the event’s filter file.

To clear the filters for all events in a subsystem, write a ‘0’ to the subsystem’s filter file.

5.3 Subsystem filtersВ¶

For convenience, filters for every event in a subsystem can be set or cleared as a group by writing a filter expression into the filter file at the root of the subsystem. Note however, that if a filter for any event within the subsystem lacks a field specified in the subsystem filter, or if the filter can’t be applied for any other reason, the filter for that event will retain its previous setting. This can result in an unintended mixture of filters which could lead to confusing (to the user who might think different filters are in effect) trace output. Only filters that reference just the common fields can be guaranteed to propagate successfully to all events.

Here are a few subsystem filter examples that also illustrate the above points:

Clear the filters on all events in the sched subsystem:

Set a filter using only common fields for all events in the sched subsystem (all events end up with the same filter):

Attempt to set a filter using a non-common field for all events in the sched subsystem (all events but those that have a prev_pid field retain their old filters):

5.4 PID filteringВ¶

The set_event_pid file in the same directory as the top events directory exists, will filter all events from tracing any task that does not have the PID listed in the set_event_pid file.

Will only trace events for the current task.

To add more PIDs without losing the PIDs already included, use ‘>>’.

6. Event triggersВ¶

Trace events can be made to conditionally invoke trigger ‘commands’ which can take various forms and are described in detail below; examples would be enabling or disabling other trace events or invoking a stack trace whenever the trace event is hit. Whenever a trace event with attached triggers is invoked, the set of trigger commands associated with that event is invoked. Any given trigger can additionally have an event filter of the same form as described in section 5 (Event filtering) associated with it — the command will only be invoked if the event being invoked passes the associated filter. If no filter is associated with the trigger, it always passes.

Triggers are added to and removed from a particular event by writing trigger expressions to the ‘trigger’ file for the given event.

A given event can have any number of triggers associated with it, subject to any restrictions that individual commands may have in that regard.

Event triggers are implemented on top of “soft” mode, which means that whenever a trace event has one or more triggers associated with it, the event is activated even if it isn’t actually enabled, but is disabled in a “soft” mode. That is, the tracepoint will be called, but just will not be traced, unless of course it’s actually enabled. This scheme allows triggers to be invoked even for events that aren’t enabled, and also allows the current event filter implementation to be used for conditionally invoking triggers.

The syntax for event triggers is roughly based on the syntax for set_ftrace_filter ‘ftrace filter commands’ (see the ‘Filter commands’ section of Documentation/trace/ftrace.rst), but there are major differences and the implementation isn’t currently tied to it in any way, so beware about making generalizations between the two.

Note: Writing into trace_marker (See Documentation/trace/ftrace.rst) can also enable triggers that are written into /sys/kernel/tracing/events/ftrace/print/trigger

6.1 Expression syntaxВ¶

Triggers are added by echoing the command to the ‘trigger’ file:

Triggers are removed by echoing the same command but starting with ‘!’ to the ‘trigger’ file:

The [if filter] part isn’t used in matching commands when removing, so leaving that off in a ‘!’ command will accomplish the same thing as having it in.

The filter syntax is the same as that described in the ‘Event filtering’ section above.

For ease of use, writing to the trigger file using ‘>’ currently just adds or removes a single trigger and there’s no explicit ‘>>’ support (‘>’ actually behaves like ‘>>’) or truncation support to remove all triggers (you have to use ‘!’ for each one added.)

6.2 Supported trigger commandsВ¶

The following commands are supported:

These commands can enable or disable another trace event whenever the triggering event is hit. When these commands are registered, the other trace event is activated, but disabled in a “soft” mode. That is, the tracepoint will be called, but just will not be traced. The event tracepoint stays in this mode as long as there’s a trigger in effect that can trigger it.

For example, the following trigger causes kmalloc events to be traced when a read system call is entered, and the :1 at the end specifies that this enablement happens only once:

The following trigger causes kmalloc events to stop being traced when a read system call exits. This disablement happens on every read system call exit:

Источник