Profiling in linux kernel

Содержание

Using gcov with the Linux kernelВ¶
PreparationВ¶
CustomizationВ¶
FilesВ¶
ModulesВ¶
Separated build and test machinesВ¶
Note on compilersВ¶
TroubleshootingВ¶
Appendix A: gather_on_build.shВ¶
Appendix B: gather_on_test.shВ¶
Механизмы профилирования Linux
Kernel tracepoints
kprobes
Perf events
Вывод
Linux kernel profiling features
May 12, 2014
Intro
Kernel tracepoints
Kernel probes
Perf events
Summary
To read

Using gcov with the Linux kernelВ¶

gcov profiling kernel support enables the use of GCCвЂ™s coverage testing tool gcov with the Linux kernel. Coverage data of a running kernel is exported in gcov-compatible format via the вЂњgcovвЂќ debugfs directory. To get coverage data for a specific file, change to the kernel build directory and use gcov with the -o option as follows (requires root):

This will create source code files annotated with execution counts in the current directory. In addition, graphical gcov front-ends such as lcov can be used to automate the process of collecting data for the entire kernel and provide coverage overviews in HTML format.

debugging (has this line been reached at all?)

test improvement (how do I change my test to cover these lines?)

minimizing kernel configurations (do I need this option if the associated code is never run?)

PreparationВ¶

Configure the kernel with:

and to get coverage data for the entire kernel:

Note that kernels compiled with profiling flags will be significantly larger and run slower. Also CONFIG_GCOV_PROFILE_ALL may not be supported on all architectures.

Profiling data will only become accessible once debugfs has been mounted:

CustomizationВ¶

To enable profiling for specific files or directories, add a line similar to the following to the respective kernel Makefile:

For a single file (e.g. main.o):

For all files in one directory:

To exclude files from being profiled even when CONFIG_GCOV_PROFILE_ALL is specified, use:

Only files which are linked to the main kernel image or are compiled as kernel modules are supported by this mechanism.

FilesВ¶

The gcov kernel support creates the following files in debugfs:

Parent directory for all gcov-related files.

Global reset file: resets all coverage data to zero when written to.

The actual gcov data file as understood by the gcov tool. Resets file coverage data to zero when written to.

Symbolic link to a static data file required by the gcov tool. This file is generated by gcc when compiling with option -ftest-coverage .

ModulesВ¶

Kernel modules may contain cleanup code which is only run during module unload time. The gcov mechanism provides a means to collect coverage data for such code by keeping a copy of the data associated with the unloaded module. This data remains available through debugfs. Once the module is loaded again, the associated coverage counters are initialized with the data from its previous instantiation.

This behavior can be deactivated by specifying the gcov_persist kernel parameter:

At run-time, a user can also choose to discard data for an unloaded module by writing to its data file or the global reset file.

Separated build and test machinesВ¶

The gcov kernel profiling infrastructure is designed to work out-of-the box for setups where kernels are built and run on the same machine. In cases where the kernel runs on a separate machine, special preparations must be made, depending on where the gcov tool is used:

gcov is run on the TEST machine

The gcov tool version on the test machine must be compatible with the gcc version used for kernel build. Also the following files need to be copied from build to test machine:

from the source tree:

all C source files + headers

all .gcda and .gcno files

all links to directories

It is important to note that these files need to be placed into the exact same file system location on the test machine as on the build machine. If any of the path components is symbolic link, the actual directory needs to be used instead (due to makeвЂ™s CURDIR handling).

gcov is run on the BUILD machine

The following files need to be copied after each test case from test to build machine:

from the gcov directory in sysfs:

all links to .gcno files

These files can be copied to any location on the build machine. gcov must then be called with the -o option pointing to that directory.

Example directory setup on the build machine:

Note on compilersВ¶

GCC and LLVM gcov tools are not necessarily compatible. Use gcov to work with GCC-generated .gcno and .gcda files, and use llvm-cov for Clang.

Build differences between GCC and Clang gcov are handled by Kconfig. It automatically selects the appropriate gcov format depending on the detected toolchain.

TroubleshootingВ¶

Compilation aborts during linker step.

Profiling flags are specified for source files which are not linked to the main kernel or which are linked by a custom linker procedure.

Exclude affected source files from profiling by specifying GCOV_PROFILE := n or GCOV_PROFILE_basename.o := n in the corresponding Makefile.

Files copied from sysfs appear empty or incomplete.

Due to the way seq_file works, some tools such as cp or tar may not correctly copy files from sysfs.

Use cat to read .gcda files and cp -d to copy links. Alternatively use the mechanism shown in Appendix B.

Appendix A: gather_on_build.shВ¶

Sample script to gather coverage meta files on the build machine (see Separated build and test machines a. ):

Appendix B: gather_on_test.shВ¶

Sample script to gather coverage data files on the test machine (see Separated build and test machines b. ):

Источник

Механизмы профилирования Linux

Последние пару лет я пишу под ядро Linux и часто вижу, как люди страдают от незнания давнишних, общепринятых и (почти) удобных инструментов. Например, раз мы отлаживали сеть на очередной реинкарнациинашего прибора и пытались понять, что за чудеса происходят с обработкой пакетов. Первым нашим позывом было открыть исходники ядра и вставить в нужные места printk , собрать логи, обработать их питоном и потом долго думать. Но не зря я читал lwn.net. Я вспомнил, что в ядре есть готовые и прекрасно работающие механизмы трассировки и профилирования ядра: те базовые механизмы, с помощью которых вы сможете собирать показания из ядра, а затем анализировать их.

В ядре Linux таких механизмов 3:

tracepoints
kprobes
perf events

На основе этих 3 фич ядра строятся абсолютно все профилировщики и трассировщики, которые доступны для Linux, в том числе ftrace , perf , SystemTap , ktap и другие. В этой статье я расскажу, что они (механизмы) дают, как работают, а в следующие разы мы посмотрим уже на конкретные тулзы.

Kernel tracepoints

Kernel tracepoints — это фреймворк для трассировки ядра, сделанный через статическое инструментирование кода. Да, вы правильно поняли, большинство важных функций ядра (сеть, управление памятью, планировщик) статически инструментировано. На моём ядре количество tracepoint’ов такое:

Среди них есть kmalloc:

А вот так оно выглядит на самом деле в ядерной функции __do_kmalloc :

trace_kmalloc — это и есть tracepoint.

Такие tracepoint’ы пишут вывод в отладочный кольцевой буфер, который можно посмотреть в /sys/kernel/debug/tracing/trace:

Заметьте, что traсepoint «kmem:kmalloc», как и все остальные, по умолчанию выключен, так что его надо включить, сказав 1 в его enable файл.

Вы можете сами написать tracepoint для своего модуля ядра. Но, , это не так уж и просто, потому что tracepoint’ы — не самый удобный и понятный API : смотри примеры в samples/trace_events/ (вообще все эти tracepoint’ы — это чёрная магия , понять которые если и сильно захочется, то не сразу получится). А , скорее всего, они не заработают в вашем модуле, покуда у вас включен CONFIG_MODULE_SIG (почти всегда да) и нет закрытого ключа для подписи (он у вендора ядра вашего дистрибутива). Смотри душераздирающие подробности в lkml [1], [2].

Короче говоря, tracepoint’ы простая и легковесная вещь, но пользоваться ей руками неудобно и не рекомендуется — используйте ftrace или perf .

kprobes

Если tracepoint’ы — это метки статического инструментирования, то kprobes — это механизм динамического инструментирования кода. С помощью kprobes вы можете прервать выполнение ядерного кода в любом месте, вызвать свой обработчик, сделать в нём что хотите и вернуться обратно как ни в чём ни бывало.

Как это делается: вы пишете свой модуль ядра, в котором регистрируете обработчик на определённый символ ядра (читай, функцию) или вообще любой адрес.

Работает это всё следующим образом:

Мы делаем свой модуль ядра, в котором пишем наш обработчик.
Мы регистрируем наш обработчик на некий адрес A, будь то просто адрес или функция.
Подсистема kprobe копирует инструкции по адресу, А и заменяет их на CPU trap ( int 3 для x86).
Теперь, когда выполнение кода доходит до адреса A, генерируется исключение, по которому сохраняются регистры, а управление передаётся обработчику исключительной ситуации, коим в конце концов становится kprobes.
Подсистема kprobes смотрит на адрес исключения, находит, кто был зарегистрирован по адресу, А и вызывает наш обработчик.
Когда наш обработчик заканчивается, регистры восстанавливаются, выполняются сохранённые инструкции, и выполнение продолжается дальше.

В ядре есть 3 вида kprobes:

kprobes — «базовая» проба, которая позволяет прервать любое место ядра.
jprobes — jump probe, вставляется только в начало функции, но зато даёт удобный механизм доступа к аргументам прерываемой функции для нашего обработчика. Также работает не за счёт trap’ов, а через setjmp/longjmp (отсюда и название), то есть более легковесна.
kretprobes — return probe, вставляется перед выходом из функции и даёт удобный доступ к результату функции.

С помощью kprobes мы можем трассировать всё что угодно, включая код сторонних модулей. Давайте сделаем это для нашего miscdevice драйвера. Я хочу знать, что пытается писать в моё устройство, знать по какому отступу и сколько байт.
В моём miscdevice драйвере функция выглядит так:

Я написал простой jprobe модуль, который пишет в ядерный лог количество байт и смещение.

Короче, вещь мощная, однако пользоваться не шибко удобно: доступа к локальным переменным нет (только через отступ от ebp ), нужно писать модуль ядра, отлаживать, загружать Есть доступные примеры в samples/kprobes. Но зачем всё это, если есть SystemTap?

Perf events

Сразу скажу, что не надо путать «perf events» и программу perf — про программу будет сказано отдельно.

«Perf events» — это интерфейс доступа к счётчикам в PMU (Performance Monitoring Unit), который является частью CPU. Благодаря этим метрикам, вы можете с легкостью попросить ядро показать вам сколько было промахов в L1 кеше, независимо от того, какая у вас архитектура, будь то ARM или amd64. Правда, для вашего процессора должна быть поддержка в ядре:-) Относительно актуальную информацию по этому поводу можно найти здесь.

Для доступа к этим счётчикам, а также к огромной куче другого добра, была написана программа perf . С её помощью можно посмотреть, какие железные события нам доступны.

Видно, что x86 побогаче будет на такие вещи.

Для доступа к «perf events» был сделан специальный системный вызов perf_event_open , которому вы передаёте сам event и конфиг, в котором описываете, что вы хотите с этим событием делать. В ответ вы получаете файловый дескриптор, из которого можно читать данные, собранные perf ’ом по событию.

Читайте также: Mac os cpu load

Поверх этого, perf предоставляет множество разных фич, вроде группировки событий, фильтрации, вывода в разные форматы, анализа собранных профилей и пр. Поэтому в perf сейчас пихают всё, что можно: от tracepoint’ов до eBPF и вплоть до того, что весь ftrace хотят сделать частью perf [3] [4].

Короче говоря, «perf_events» сами по себе мало интересны, а сам perf заслуживает отдельной статьи, поэтому для затравки покажу простой пример.

И получаем вот такое чудо:
Perf timechart

Вывод

Таким образом, зная чуть больше про трассировку и профилирование в ядре, вы можете сильно облегчить жизнь себе и товарищам, особенно если научиться пользоваться настоящими инструментами как ftrace , perf и SystemTap , но об этом в другой раз.

Источник

Linux kernel profiling features

May 12, 2014

Intro

Sometimes when you’re facing really hard performance problem it’s not always enough to profile your application. As we saw while profiling our application with gprof, gcov and Valgrind problem is somewhere underneath our application – something is holding pread in long I/O wait cycles.

How to trace system call is not clear at first sight – there are various kernel profilers, all of them works in its own way, requires unique configuration, methods, analysis and so on. Yes, it’s really hard to figure it out. Being the biggest open-source project developed by the massive community, Linux absorbed several different and sometimes conflicting profiling facilities. And it’s in some sense getting even worse – while some profiles tend to merge (ftrace and perf) other tools emerge – the last example is ktap.

To understand that bazaar let’s start from the bottom – what does kernel have that makes it able profile it? Basically, there are only 3 kernel facilities that enable profiling:

Kernel tracepoints
Kernel probes
Perf events

These are the features that give us access to the kernel internals. By using them we can measure kernel functions execution, trace access to devices, analyze CPU states and so on.

These very features are really awkward for direct use and accessible only from the kernel. Well, if you really want you can write your own Linux kernel module that will utilize these facilities for your custom use, but it’s pretty much pointless. That’s why people have created a few really good general purpose profilers:

All of them are based on that features and will be discussed later more thoroughly, but now let’s review features itself.

Kernel tracepoints

Kernel Tracepoints is a framework for tracing kernel function via static instrumenting 1 .

Tracepoint is a place in the code where you can bind your callback. Tracepoints can be disabled (no callback) and enabled (has callback). There might be several callbacks though it’s still lightweight – when callback disabled it actually looks like if (unlikely(tracepoint.enabled)) .

Tracepoint output is written in ring buffer that is export through debugfs at /sys/kernel/debug/tracing/trace . There is also the whole tree of traceable events at /sys/kernel/debug/tracing/events that exports control files to enable/disable particular event.

Despite its name tracepoints are the base for event-based profiling because besides tracing you can do anything in the callback, e.g. timestamping and measuring resource usage. Linux kernel is already (since 2.6.28) instrumented with that tracepoints in many places. For example, __do_kmalloc :

trace_kmalloc is tracepoint. There are many others in other critical parts of kernel such as schedulers, block I/O, networking and even interrupt handlers. All of them are used by most profilers because they have minimal overhead, fires by the event and saves you from modifying the kernel.

Ok, so by now you may be eager to insert it in all of your kernel modules and profile it to hell, but BEWARE. If you want to add tracepoints you must have a lot of patience and skills because writing your own tracepoints is really ugly and awkward. You can see examples at samples/trace_events/. Under the hood tracepoint is a C macro black magic that only bold and fearless persons could understand.

And even if you do all that crazy macro declarations and struct definitions it might just simply not work at all if you have CONFIG_MODULE_SIG=y and don’t sign module. It might seem kinda strange configuration but in reality, it’s a default for all major distributions including Fedora and Ubuntu. That said, after 9 circles of hell, you will end up with nothing.

So, just remember:

USE ONLY EXISTING TRACEPOINTS IN KERNEL, DO NOT CREATE YOUR OWN.

Now I’m gonna explain why it’s happening. So if you tired of tracepoints just skip reading about kprobes.

Ok, so some time ago while preparing kernel 3.1 2 this code was added:

If the module is tainted we will NOT write ANY tracepoints. Later it became more adequate

Like, ok it may be out-of-tree ( TAINT_OOT_MODULE ) or staging ( TAINT_CRAP ) but any others are the no-no.

Seems legit, right? Now, what would you think will be if your kernel is compiled with CONFIG_MODULE_SIG enabled and your pretty module is not signed? Well, module loader will set the TAINT_FORCES_MODULE flag for it. And now your pretty module will never pass the condition in tracepoint_module_coming and never show you any tracepoints output. And as I said earlier this stupid option is set for all major distributions including Fedora and Ubuntu since kernel version 3.1.

If you think – “Well, let’s sign goddamn module!” – you’re wrong. Modules must be signed with kernel private key that is held by your Linux distro vendor and, of course, not available for you.

The whole terrifying story is available in lkml 1, 2.

As for me I just cite my favorite thing from Steven Rostedt (ftrace maintainer and one of the tracepoints developer):

Kernel tracepoints is a lightweight tracing and profiling facility.
Linux kernel is heavy instrumented with tracepoints that are used by the most profilers and especially by perf and ftrace.
Tracepoints are C marco black magic and almost impossible for usage in kernel modules.
It will NOT work in your LKM if:
- Kernel version >=3.1 (might be fixed in 3.15)
- CONFIG_MODULE_SIG=y
- Your module is not signed with kernel private key.

Kernel probes

Kernel probes is a dynamic debugging and profiling mechanism that allows you to break into kernel code, invoke your custom function called probe and return everything back.

Basically, it’s done by writing kernel module where you register a handler for some address or symbol in kernel code. Also according to the definition of struct kprobe , you can pass offset from address but I’m not sure about that. In your registered handler you can do really anything – write to the log, to some buffer exported via sysfs, measure time and an infinite amount of possibilities to do. And that’s really nifty contrary to tracepoints where you can only read logs from debugfs.

There are 3 types of probes:

kprobes – basic probe that allows you to break into any kernel address.
jprobes – jump probes that inserted in the start of the function and gives you handy access to function arguments; it’s something like proxy-function.
kretprobes – return probes that inserted at the return point of the function.

Last 2 types are based on basic kprobes.

All of this generally works like this:

We register probe on some address A.
kprobe subsystem finds A.
kprobe copies instruction at address A.
kprobe replaces instruction at A for breakpoint ( int 3 in the case of x86).
Now when execution hits probed address A, CPU trap occurs.
Registers are saved.
CPU transfers control to kprobes via notifier_call_chain mechanism.
And finally, kprobes invokes our handler.
After all, we restore registers, copies back instruction at A and continues execution.

Our handler usually gets as an argument address where breakpoint happened and registers values in pt_args structures. kprobes handler prototype:

In most cases except debugging this info is useless because we have jprobes. jprobes handler has exactly the same prototype as and intercepting function. For example, this is handler for do_fork :

Also, jprobes doesn’t cause interrupts because it works with help of setjmp/longjmp that are much more lightweight.

And finally, the most convenient tool for profiling are kretprobes. It allows you to register 2 handlers – one to invoke on function start and the other to invoke in the end. But the really cool feature is that it allows you to save state between those 2 calls, like timestamp or counters.

Instead of thousand words – look at absolutely astonishing samples at samples/kprobes.

kprobes is a beautiful hack for dynamic debugging, tracing and profiling.
It’s a fundamental kernel feature for non-invasive profiling.

Perf events

perf_events is an interface for hardware metrics implemented in PMU (Performance Monitoring Unit) which is part of CPU.

Thanks to perf_events you can easily ask the kernel to show you L1 cache misses count regardless of what architecture you are on – x86 or ARM. What CPUs are supported by perf are listed here.

In addition to that perf included various kernel metrics like software context switches count ( PERF_COUNT_SW_CONTEXT_SWITCHES ).

And in addition to that perf included tracepoint support via ftrace .

To access perf_events there is a special syscall perf_event_open . You are passing the type of event (hardware, kernel, tracepoint) and so-called config, where you specify what exactly you want depending on type. It’s gonna be a function name in case of tracepoint, some CPU metric in the case of hardware and so on.

And on top of that, there are a whole lot of stuff like event groups, filters, sampling, various output formats and others. And all of that is constantly breaking 3 , that’s why the only thing you can ask for perf_events is special perf utility – the only userspace utility that is a part of the kernel tree.

perf_events and all things related to it spread as a plague in the kernel and now ftrace is going to be part of perf (1, 2). Some people overreacting on perf related things though it’s useless because perf is developed by kernel big fishes – Ingo Molnar 4 and Peter Zijlstra.

I really can’t tell anything more about perf_events in isolation of perf , so here I finish.

Summary

There are a few Linux kernel features that enable profiling:

All Linux kernel profilers use some combinations of that features, read details in an article for the particular profiler.

To read

https://events.linuxfoundation.org/sites/events/files/slides/kernel_profiling_debugging_tools_0.pdf
http://events.linuxfoundation.org/sites/events/files/lcjp13_zannoni.pdf
tracepoints:
- Documentation/trace/tracepoints.txt
- http://lttng.org/files/thesis/desnoyers-dissertation-2009-12-v27.pdf
- http://lwn.net/Articles/379903/
- http://lwn.net/Articles/381064/
- http://lwn.net/Articles/383362/
kprobes:
- Documentation/kprobes.txt
- https://lwn.net/Articles/132196/
perf_events:
- http://web.eece.maine.edu/
  Tracepoints are improvement of early feature called kernel markers. ↩︎
  
  And that’s indended behaviour. Kernel ABI in no sense stable, API is. ↩︎
  
  Author of current default O(1) process scheduler CFS — Completely Fair Scheduler. ↩︎
  
  Источник