Netlink linux ��

Содержание

Linux, Netlink, and Go — Part 1: netlink
What is netlink?
Creating netlink sockets
Netlink message format
Sending and receiving netlink messages
Large messages
Multi-part messages
Netlink error numbers
Sequence number and PID validation
Multicast groups
Netlink attributes
Summary
Updates
References
Простой монитор сетевых интерфейсов Linux, с помощью netlink
От теории к практике.

Linux, Netlink, and Go — Part 1: netlink

Feb 21, 2017 · 10 min read

I am moving my blog content to mdlayher.com. Please see the updated version of this content at:

I’m a big fan of Prometheus. I use it quite a lot at both home and work, and greatly enjoy having insight into what my systems are doing at any given moment. One of the most widely used Prometheus exporters is the node_exporter: a daemon that can extract a wide variety of metrics from UNIX-like machines.

As I was browsing the repository, I noticed a n open issue requesting the addition of WiFi metrics to node_exporter. The idea intrigued me, and I realized that I would certainly make use of such a feature on my Linux laptop. I began exploring options for retrieving WiFi device information on Linux.

After a couple of weeks of experimentation (including the legacy ioctl() wireless extensions API), I authored two Go packages which work together to interact with WiFi devices on Linux:

netlink: provides low-level access to Linux netlink sockets.
wifi : provides access to IEEE 802.11 WiFi device actions and statistics.

This series of posts will describe some of the lessons I learned while implementing these packages in Go, and hopefully provide a nice reference for others who wish to experiment with netlink and/or WiFi devices in their language of choice.

The pseudo-code in this series will use Go’s x/sys/unix package and types from my netlink and wifi packages. I plan to break up the series as follows (links to come as more are posted):

Part 1: netlink (this post): an introduction to netlink.
Part 2: generic netlink: an introduction to generic netlink, a netlink family meant to simplify creation of new families.
Part 3: packages netlink, genetlink, and wifi: using Go to drive interactions with netlink, generic netlink, and nl80211.

What is netlink?

Netlink is a Linux kernel inter-process communication mechanism, enabling communication between a userspace process and the kernel, or multiple userspace processes. Netlink sockets are the primitive which enables this communication.

This post will provide a primer on netlink sockets, messages, multicast groups, and attributes. In addition, this post will focus on communication between userspace and the kernel, rather than communication between two userspace processes.

Creating netlink sockets

Netlink makes use of the standard BSD sockets API. This should be quite familiar to anyone who has done network programming in C. If you are unfamiliar with BSD sockets, I recommend the excellent Beej’s Guide to Network Programming for a primer on the topic.

It is important to note that netlink communications never traverse beyond the local host. With this in mind, let’s begin diving into how netlink sockets work!

To communicate with netlink, a netlink socket must be opened. This is done using the socket() system call:

The family parameter specifies a particular netlink family: essentially, a kernel subsystem which can be communicated with using netlink sockets. These families may offer functionality such as:

NETLINK_ROUTE : manipulation of Linux’s network interfaces, routes, IP addresses, etc.
NETLINK_GENERIC : a building block for simplified addition of new netlink families, like nl80211, Open vSwitch, etc.

Once the socket is created, bind() must be called to prepare it to send and receive messages.

At this point, the netlink socket is now ready to send and receive messages to and from the kernel.

Netlink message format

Netlink messages follow a very particular format. All messages must be aligned to a 4 byte boundary. As an example, a 16 byte message must be sent as is, but a 17 byte message must be padded to 20 bytes.

It is very important to note that, unlike typical network communications, netlink uses the host byte order, or endianness, for encoding and decoding integers, instead of the common network byte order (big endian). As a result, code which must convert between byte and integer representations of data must keep this in mind.

Netlink message headers make use of the following format: (diagram from RFC 3549):

These fields contain the following information:

Length (32 bits): the length of the entire message, including both headers and payload.
Type (16 bits): what kind of information the message contains, such as an error, end of multi-part message, etc.
Flags (16 bits): bit flags which indicate that a message is a request, a multi-part message, an acknowledgement of a request, etc.
Sequence Number (32 bits): a number used to correlate requests and responses; incremented on each request.
Process ID (PID) (32 bits): sometimes referred to as port ID; a number used to uniquely identify a particular netlink socket; may or may not be the process’s ID.

Finally, a payload may immediately follow a netlink header. Again, note that the payload must be padded to a 4 byte boundary.

An example netlink message which sends a request to the kernel may resemble the following in Go:

Sending and receiving netlink messages

Now that we are familiar with some of the basics of netlink sockets, we can send and receive data using a socket.

Once a message has been prepared, it can be sent to the kernel using sendto():

Read-only requests to netlink typically do not require any special privileges. Operations which modify the state of a subsystem using netlink, or require locking its internal state, typically require elevated privileges. This may mean running the program as root or using CAP_NET_ADMIN to:

Send a write request to make changes to a subsystem using netlink.
Send a read request with the NLM_F_ATOMIC flag, to receive an atomic snapshot of data from netlink.

Receiving messages from a netlink socket using recvfrom() can be slightly more complicated, depending on a variety of factors. Netlink may reply with:

Very small or very large messages.
Multi-part messages, broken into multiple pieces.
An explicit error number, when header type is “error”.

In addition, the sequence number and PID of each message should be validated as well. When working with raw system calls, it’s up to the socket’s user to handle these cases.

Large messages

To deal with large messages, I’ve employed a technique of allocating a single page of memory, peeking at the buffer (without draining it), and then doubling the size of the buffer if it’s too small to read the entire message. Thanks, Dominik Honnef for your insight on this problem.

Error handling omitted for brevity. Please check your errors.

In theory, a netlink message may be of a size up to

4GiB (maximum 32-bit unsigned integer), but in practice, messages are much smaller.

Multi-part messages

For certain types of messages, netlink may reply with a “multi-part message”. In this case, each message before the final one will have the “multi” flag set. The final message will have a type of “done”.

When returning multi-part messages, the first recvfrom() will return all messages with the “multi” flag set. Next, recvfrom() must be called again to retrieve the final message with header type “done”. This is very important or else netlink will simply hang on subsequent requests, waiting for the caller to drain the final header type “done” message.

The code for this isn’t as trivial as other examples, but you can take a look at my implementation if you’d like a reference.

Netlink error numbers

If netlink cannot satisfy a request for whatever reason, it will return an explicit error number in the payload of a message containing header type “error”. These error numbers are the same as Linux’s classic error numbers, such as ENOENT for “no such file or directory”, or EPERM for “permission denied”.

If a message’s header type indicates an error, the error number will be encoded as a signed 32 bit integer (note: also uses system endianness) in the first 4 bytes of the message’s payload.

Sequence number and PID validation

To ensure a netlink reply from the kernel is in response to one of our requests, we must also validate the sequence number and PID on each received message. In the majority of cases, these should match exactly what was sent to the kernel with a request. Subsequent requests should increment the sequence number before sending another message to netlink.

PID validation may vary slightly, depending on several conditions.

If a message is received in userspace on behalf a multicast group, it will have a PID of 0, meaning the message originated in the kernel.
If a request is sent to the kernel with a PID of 0, netlink will assign a PID for a given socket on the first response. This PID should be used (and validated) in subsequent communications.

Assuming you didn’t specify a PID in bind() , when opening multiple netlink sockets in a single application, the first one will be assigned a PID of the process’s ID. Subsequent ones will have a random number chosen by netlink. In my experience, it is much easier to just let netlink assign all PIDs itself, and make sure you keep track of which numbers it assigns for each socket.

Multicast groups

In addition to the classic request/response socket paradigm, netlink sockets also provide multicast groups to enable subscribing to certain events as they occur.

A multicast group can be joined using two different methods:

Specifying a groups bitmask during bind() . This is considered the “legacy” method.
Joining and leaving groups using setsockopt() . This is the preferred, modern method.

Joining and leaving groups using setsockopt() is a matter of swapping a single constant. In Go, this is done using uint32 “group” values.

Once a group is joined, you can listen for messages using recvfrom() as usual. Leaving the group will cause no further messages to be delivered for a given multicast group.

Netlink attributes

To wrap up our primer on netlink sockets, we will discuss a very common data format for netlink message payloads: attributes.

Netlink attributes are unusual in that they are in LTV (length, type, value) format, instead of the typical TLV (type, length, value). As with every other integer in netlink sockets, the type and length values are also encoded with host endianness. Finally, netlink attributes must also be padded to a 4 byte boundary, just like netlink messages.

Each field contains the following information:

Length (16 bits): the length of the entire attribute, including length, type, and value fields. May not be set to a 4 byte boundary. For example, if length is 17 bytes, the attribute will be padded to 20 bytes, but the 3 bytes of padding should not be interpreted as meaningful.
Type (16 bits): the type of an attribute, typically defined as a constant in some netlink family or header.
Value (variable bytes): the raw payload of an attribute. May contain nested attributes, which are stored in the same format. Those nested attributes may contain even more nested attributes!

There are two special flags which may be present in netlink attributes, though I have yet to encounter them in my work.

NLA_F_NESTED : specifies a nested attribute; used as a hint for parsing. Doesn’t always appear to be used, even if nested attributes are present.
NLA_F_NET_BYTEORDER : attribute data is stored in network byte order (big endian) instead of host endianness.

Consult the documentation of a given netlink family to determine if either of these flags should be checked.

Summary

Now that we are familiar with using netlink sockets and messages, the next post in the series will build upon this knowledge to dive into generic netlink.

Hope you enjoyed this post! If you have questions or comments, feel free to reach out via the comments, Twitter, or Gophers Slack (username: mdlayher).

Updates

2/22/2017: moved background information about BSD sockets API to the “Creating netlink sockets” section.
2/22/2017: noted need for root or CAP_NET_ADMIN for many netlink write operations, and when using NLM_F_ATOMIC . Thanks, Steven Hartland from the golang-nuts thread.
2/23/2017: noted ability to specify a PID for a socket in bind() . Thanks, Dan Williams from a libnl thread.
2/27/2017: changed pseudocode to use x/sys/unix instead of syscall , since syscall is frozen.

References

The following links were used frequently as a reference as I built out package netlink, and authored this post:

Источник

Простой монитор сетевых интерфейсов Linux, с помощью netlink

Что такое netlink?

Итак, netlink представляет удобный собой способ коммуникации между юзерспейсом и ядром Linux. Коммуникация осуществляется с помощью обычного сокета, с использованием особого протокола — AF_NETLINK.
Netlink позволяет взаимодействовать с большим количеством подсистем ядра — интерфейсы, маршрутизация, фильтр сетевых пакетов. Кроме того, можно общаться со своим модулем ядра. Разумеется в последнем должна быть реализована поддержка такого способа коммуникации.
Каждое сообщение netlink представляет собой заголовок, представленный структурой nlmsghdr, а так же определенного количества байт — «полезной нагрузки» (playload). Данная «нагрузка» может представлять собой какую либо структуру, либо же просто RAW данные. Сообщение, во время доставки, может быть разбито на несколько частей. В таких случаях каждый следующий пакет помечается флагом NLM_F_MULTI, а последний флагом NLMSG_DONE. Для разбора сообщений имеется целый набор макросов, определенный в заголовочных файлах netlink.h и rtnetlink.h

Создание сокета netlink.

Объявление netlink сокета выглядит вполне стандартно:

socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE)

Где AF_NETLINK — протокол netlink
SOCK_RAW — тип сокета
NETLINK_ROUTE — семейство netlink протокола.

Последний параметр может быть различным, в зависимости от того, что мы именно хотим получить от netlink.
Приведу таблицу со наиболее интересными параметрами (полный список параметров можно посмотреть в документации):

NETLINK_ROUTE — получать уведомления об изменениях таблицы маршрутизации и сетевых интерфейсов.
так же может использоваться для изменения всех параметров вышеперечисленных объектов.
NETLINK_USERSOCK — зарезервировано для определения пользовательских протоколов.
NETLINK_FIREWALL — служит для передачи IPv4 пакетов из сетевого фильтра на пользовательский уровень
NETLINK_INET_DIAG — мониторинг inet сокетов
NETLINK_NFLOG — ULOG сетевого/пакетного фильтра
NETLINK_SELINUX — получать уведомления от системы Selinux
NETLINK_NETFILTER — работа с подсистемой сетевого фильтра
NETLINK_KOBJECT_UEVENT — получение сообщений ядра

Далее созданные сокет можно использовать для отправки сообщений, например, с помощью функции send и приема сообщений с помощью recvmsg.

Сообщение netlink.

Заголовок сообщения представлен структурой nlmsghdr

Поле nlmsg_type может указывать на один из стандартных типо сообщения:
NLMSG_NOOP — сообщения такого типа игнорируются.
NLMSG_ERROR — сообщение с ошибкой, и в секции полезных данных будет структура nlmsgerr (о ней чуть ниже)
NLMSG_DONE — сообщение с этим флагом должно завершать сообщение, разбитое на несколько частей

Сообщения могут быть одного или нескольких (различные типы объеденяются с помощью операции логического или ) типов:

NLM_F_REQUEST — сообщение — запрос чего либо
NLM_F_MULTI — сообщение, часть сообщения разбитого на части
NLM_F_ACK — сообщение — запрос подтверждения
NLM_F_ECHO — эхо запрос. обычное направление — запросы из уровня ядра на пользовательский уровень
NLM_F_ROOT — данный тип запроса возвращает некую таблицу, внутри некой сущности
NLM_F_MATCH — запрос возвращает все найденные соответствия
NLM_F_ATOMIC — возвращает атомарный срез некой таблицы
NLM_F_DUMP — аналог NLM_F_ROOT|NLM_F_MATCH

NLM_F_REPLACE — заменить существующий аналогичный объект
NLM_F_EXCL — не заменять, если такой объект уже существует
NLM_F_CREATE — создать объект, если он не существует
NLM_F_APPEND — добавить объект в список к уже существующему

Для идентификации клиентов (на уровне ядра и на пользовательском уровне) существует специальная адресная структура — nladdr:

nl_pid — это уникальный адрес сокета. Для клиентов в ядре он всегда равен нулю. Для клиентов на пользовательском уровне он равен идентификатору процесса, владеющего сокетом. Каждый идентификатор должен быть уникальным, поэтому тут вы можете натолкнутся на проблему, когда попытаетесь создать несколько netlink сокетов в многопоточном приложении: при создании нового сокета будет возвращаться ошибка «Operation not permitted». Для обхода данного ограничения следуют nl_pid присваивать значение данного выражения:
pthread_self()
Присваивать значение идентификатора следует до того, как будет вызван bind() для сокета.
Так же идентификатору можно присвоить нулевое значение. В этом случае генерацией уникальных идентификаторов будет заниматься ядро, но первому сокету созданному в приложение всегда будет присваиваться значение идентификатора данного приложения.

nl_groups — это битовая маска, каждый бит которой представляет номер группы netlink. При вызове bind() для сокета netlink следует указывать битовую маску группы, которую желает прослушивать приложение, в данном контексте. Различные группы могут быть объединены с помощью логического или.
Основные группы определены в заголовочном файле netlink.
Пример некоторых из них:

RTMGRP_LINK — эта группа получает уведомления об изменениях в сетевых интерфейсах (интерфейс удалился, добавился, опустился, поднялся)
RTMGRP_IPV4_IFADDR — эта группа получает уведомления об изменениях в IPv4 адресах интерфейсов (адрес был добавлен или удален)
RTMGRP_IPV6_IFADDR — эта группа получает уведомления об изменениях в IPv6 адресах интерфейсов (адрес был добавлен или удален)
RTMGRP_IPV4_ROUTE — эта группа получает уведомления об изменениях в таблице маршрутизации для IPv4 адресов
RTMGRP_IPV6_ROUTE — эта группа получает уведомления об изменениях в таблице маршрутизации для IPv6 адресов

После структуры заголовка nlmsghdr всегда расположен указатель на блок данных. Доступ к нему можно получить с помощью макросов, о которых будет рассказано далее.

Макросы netlink

Наиболее полезными, в данном случае, макросами являются:
NLMSG_ALIGN — Округляет размер сообщения netlink до ближайшего большего значения, выровненного по границе.
NLMSG_LENGTH — Принимает в качестве параметра размер поля данных (payload) и возвращает выровненное по границе значение размера для записи в поле nlmsg_len заголовка nlmsghdr.
NLMSG_SPACE — Возвращает размер, который займут данные указанной длины в пакете netlink.
NLMSG_DATA — Возвращает указатель на данные, связанные с переданным заголовком nlmsghdr.
NLMSG_NEXT — Возвращает следующую часть сообщения, состоящего из множества частей. Макрос принимает следующий заголовок nlmsghdr в сообщении, состоящем из множества частей. Вызывающее приложение должно проверить наличие в текущем заголовке nlmsghdr флага NLMSG_DONE – функция не возвращает значение NULL при завершении обработки сообщения. Второй параметр задает размер оставшейся части буфера сообщения. Макрос уменьшает это значение на размер заголовка сообщения.
NLMSG_OK — Возвращает значение true если сообщение не было усечено и его разборка прошла успешно.
NLMSG_PAYLOAD — Возвращает размер данных (payload), связанных с заголовком nlmsghdr.

От теории к практике.

Ну что же. Думаю, что я уже успел надоесть со скучной теорией 🙂 Может быть что-то показалось запутанным или не понятным — постараюсь разжевать все в наглядных примерах, там на самом деле нет ничего сложного.
Ниже приведено обещанное приложение, которое будет получать уведомления об изменениях в сетевых интерфейсах и таблице маршрутизации.
В примере введен целый ряд новых структур:

Эта структура служит хранилищем полезных данных, передаваемых через сокеты netlink. Полю iov_base присваивается указатель на байтовый массив. Именно в этот байтовый массив будут записаны данные сообщения.

Эта структура непосредственно передается через сокет. Она содержит в себе указатель на блок полезных данных, количество данных блоков, а так же ряд дополнительных флагов и полей, пришедших, по большей части, с платформы BSD.

Эта структура используется для представления сетевого устройства, его семейства, типа, индекса и флагов.

Эта структура служит для представления сетевого адреса, назначенного на сетевой интерфейс.

Эта структура служит для хранения какого либо параметра соединения или адреса.

Исходный код, монитор

Компиляция программы:
gcc monitor.c -o monitor

И результат работы:

Пояснения к коду.
После запуска программы мы создаем netlink сокет и проверяем успешность его создания. Далее происходит объявление необходимых переменных и заполнение структуры локального адреса. Тут мы указываем группы сообщений, на которые хотим подписаться: RTMGRP_LINK, RTMGRP_IPV4_IFADDR, RTMGRP_IPV4_ROUTE.
Так же объявляем структуру сообщения и связываем с ней один блок данных.
После этого происходит связывание с сокетом, с помощью bind(). После этого мы становимся подписанными на сообщения для указанных групп. Можно принимать сообщения через сокет.
Далее следует бесконечный цикл приема сообщений из сокета. Т.к. принимаемый блок данных может иметь несколько заголовков и ассоциированных с ними данных — начинаем перебирать, с помощью netlink макросов все принятые данные.
Каждое новое сообщение расположено по указателю struct nlmsghdr *h.
Теперь можно разбирать собственно сообщение. Смотрим на поле nlmsg_type и выясняем, что же за сообщение к нам приехало. Если оно связано с таблицей маршрутизации — печатаем сообщение и идем к следующему сообщению. А если нет — начинаем детально разбираться.
Объявляются массивы опций rtattr, куда будут складываться все необходимые данные. За получение этих данных отвечает вспомогательная функция parseRtattr. Она использует макросы netlink и заполняет указанный массив всеми атрибутами из блока данных структуры ifinfomsg или ifaddrmsg.
После того как мы получили массивы, заполненные атрибутами — можем работать с этим значениями, анализировать их, печатать.
Доступ к каждому атрибуту осуществляется по его индексу. Все индексы определены в заголовочных файлах netlink и прокомментированы.
В данном случае мы используем следующие индексы:
IFLA_IFNAME — индекс атрибута с именем интерфейса.
IFA_LOCAL — индекс атрибута с локальным IP адресом.
После всего этого мы обладаем полной информацией о том, что произошло и можем печатать информацию на экран.

Вот и все. Очень надеюсь, что данный материал будет полезен кому-то.
Если будет достаточное количество желающих (больше одного человека:) ) — могу написать продолжение и рассмотреть, например, взаимодействие с модулем ядра или реализацию работы с IPv6.