Linux с read blocking

Содержание

File locking in Linux
Introduction
Advisory locking
Common features
Differing features
File descriptors and i-nodes
BSD locks (flock)
POSIX record locks (fcntl)
lockf function
Open file description locks (fcntl)
Emulating Open file description locks
Test program
Command-line tools
Mandatory locking
Example usage
Linux Kernel 5.0 — пишем Simple Block Device под blk-mq

File locking in Linux

Table of contents

Introduction

File locking is a mutual-exclusion mechanism for files. Linux supports two major kinds of file locks:

advisory locks
mandatory locks

Below we discuss all lock types available in POSIX and Linux and provide usage examples.

Advisory locking

Traditionally, locks are advisory in Unix. They work only when a process explicitly acquires and releases locks, and are ignored if a process is not aware of locks.

There are several types of advisory locks available in Linux:

BSD locks (flock)
POSIX record locks (fcntl, lockf)
Open file description locks (fcntl)

All locks except the lockf function are reader-writer locks, i.e. support exclusive and shared modes.

Note that flockfile and friends have nothing to do with the file locks. They manage internal mutex of the FILE object from stdio.

File Locks, GNU libc manual
Open File Description Locks, GNU libc manual
File-private POSIX locks, an LWN article about the predecessor of open file description locks

Common features

The following features are common for locks of all types:

All locks support blocking and non-blocking operations.
Locks are allowed only on files, but not directories.
Locks are automatically removed when the process exits or terminates. It’s guaranteed that if a lock is acquired, the process acquiring the lock is still alive.

Differing features

This table summarizes the difference between the lock types. A more detailed description and usage examples are provided below.

BSD locks	lockf function	POSIX record locks	Open file description locks
Portability	widely available	POSIX (XSI)	POSIX (base standard)	Linux 3.15+
Associated with	File object	[i-node, pid] pair	[i-node, pid] pair	File object
Applying to byte range	no	yes	yes	yes
Support exclusive and shared modes	yes	no	yes	yes
Atomic mode switch	no	—	yes	yes
Works on NFS (Linux)	Linux 2.6.12+	yes	yes	yes

File descriptors and i-nodes

A file descriptor is an index in the per-process file descriptor table (in the left of the picture). Each file descriptor table entry contains a reference to a file object, stored in the file table (in the middle of the picture). Each file object contains a reference to an i-node, stored in the i-node table (in the right of the picture).

A file descriptor is just a number that is used to refer a file object from the user space. A file object represents an opened file. It contains things likes current read/write offset, non-blocking flag and another non-persistent state. An i-node represents a filesystem object. It contains things like file meta-information (e.g. owner and permissions) and references to data blocks.

File descriptors created by several open() calls for the same file path point to different file objects, but these file objects point to the same i-node. Duplicated file descriptors created by dup2() or fork() point to the same file object.

A BSD lock and an Open file description lock is associated with a file object, while a POSIX record lock is associated with an [i-node, pid] pair. We’ll discuss it below.

BSD locks (flock)

The simplest and most common file locks are provided by flock(2) .

not specified in POSIX, but widely available on various Unix systems
always lock the entire file
associated with a file object
do not guarantee atomic switch between the locking modes (exclusive and shared)
up to Linux 2.6.11, didn’t work on NFS; since Linux 2.6.12, flock() locks on NFS are emulated using fcntl() POSIX record byte-range locks on the entire file (unless the emulation is disabled in the NFS mount options)

The lock acquisition is associated with a file object, i.e.:

duplicated file descriptors, e.g. created using dup2 or fork , share the lock acquisition;
independent file descriptors, e.g. created using two open calls (even for the same file), don’t share the lock acquisition;

This means that with BSD locks, threads or processes can’t be synchronized on the same or duplicated file descriptor, but nevertheless, both can be synchronized on independent file descriptors.

flock() doesn’t guarantee atomic mode switch. From the man page:

Converting a lock (shared to exclusive, or vice versa) is not guaranteed to be atomic: the existing lock is first removed, and then a new lock is established. Between these two steps, a pending lock request by another process may be granted, with the result that the conversion either blocks, or fails if LOCK_NB was specified. (This is the original BSD behaviour, and occurs on many other implementations.)

This problem is solved by POSIX record locks and Open file description locks.

POSIX record locks (fcntl)

POSIX record locks, also known as process-associated locks, are provided by fcntl(2) , see “Advisory record locking” section in the man page.

specified in POSIX (base standard)
can be applied to a byte range
associated with an [i-node, pid] pair instead of a file object
guarantee atomic switch between the locking modes (exclusive and shared)
work on NFS (on Linux)

Читайте также: Exfat with mac os

The lock acquisition is associated with an [i-node, pid] pair, i.e.:

file descriptors opened by the same process for the same file share the lock acquisition (even independent file descriptors, e.g. created using two open calls);
file descriptors opened by different processes don’t share the lock acquisition;

This means that with POSIX record locks, it is possible to synchronize processes, but not threads. All threads belonging to the same process always share the lock acquisition of a file, which means that:

the lock acquired through some file descriptor by some thread may be released through another file descriptor by another thread;
when any thread calls close on any descriptor referring to given file, the lock is released for the whole process, even if there are other opened descriptors referring to this file.

This problem is solved by Open file description locks.

lockf function

lockf(3) function is a simplified version of POSIX record locks.

specified in POSIX (XSI)
can be applied to a byte range (optionally automatically expanding when data is appended in future)
associated with an [i-node, pid] pair instead of a file object
supports only exclusive locks
works on NFS (on Linux)

Since lockf locks are associated with an [i-node, pid] pair, they have the same problems as POSIX record locks described above.

The interaction between lockf and other types of locks is not specified by POSIX. On Linux, lockf is just a wrapper for POSIX record locks.

Open file description locks (fcntl)

Open file description locks are Linux-specific and combine advantages of the BSD locks and POSIX record locks. They are provided by fcntl(2) , see “Open file description locks (non-POSIX)” section in the man page.

Linux-specific, not specified in POSIX
can be applied to a byte range
associated with a file object
guarantee atomic switch between the locking modes (exclusive and shared)
work on NFS (on Linux)

Thus, Open file description locks combine advantages of BSD locks and POSIX record locks: they provide both atomic switch between the locking modes, and the ability to synchronize both threads and processes.

These locks are available since the 3.15 kernel.

The API is the same as for POSIX record locks (see above). It uses struct flock too. The only difference is in fcntl command names:

F_OFD_SETLK instead of F_SETLK
F_OFD_SETLKW instead of F_SETLKW
F_OFD_GETLK instead of F_GETLK

Emulating Open file description locks

What do we have for multithreading and atomicity so far?

BSD locks allow thread synchronization but don’t allow atomic mode switch.
POSIX record locks don’t allow thread synchronization but allow atomic mode switch.
Open file description locks allow both but are available only on recent Linux kernels.

If you need both features but can’t use Open file description locks (e.g. you’re using some embedded system with an outdated Linux kernel), you can emulate them on top of the POSIX record locks.

Here is one possible approach:

Implement your own API for file locks. Ensure that all threads always use this API instead of using fcntl() directly. Ensure that threads never open and close lock-files directly.

In the API, implement a process-wide singleton (shared by all threads) holding all currently acquired locks.

Associate two additional objects with every acquired lock:

Now, you can implement lock operations as follows:

First, acquire the RW-mutex. If the user requested the shared mode, acquire a read lock. If the user requested the exclusive mode, acquire a write lock.
Check the counter. If it’s zero, also acquire the file lock using fcntl() .
Increment the counter.

Decrement the counter.
If the counter becomes zero, release the file lock using fcntl() .
Release the RW-mutex.

This approach makes possible both thread and process synchronization.

Test program

I’ve prepared a small program that helps to learn the behavior of different lock types.

The program starts two threads or processes, both of which wait to acquire the lock, then sleep for one second, and then release the lock. It has three parameters:

lock mode: flock (BSD locks), lockf , fcntl_posix (POSIX record locks), fcntl_linux (Open file description locks)

access mode: same_fd (access lock via the same descriptor), dup_fd (access lock via duplicated descriptors), two_fds (access lock via two descriptors opened independently for the same path)

concurrency mode: threads (access lock from two threads), processes (access lock from two processes)

Below you can find some examples.

Threads are not serialized if they use BSD locks on duplicated descriptors:

But they are serialized if they are used on two independent descriptors:

Threads are not serialized if they use POSIX record locks on two independent descriptors:

But processes are serialized:

Command-line tools

The following tools may be used to acquire and release file locks from the command line:

Provided by util-linux package. Uses flock() function.

There are two ways to use this tool:

run a command while holding a lock:

flock will acquire the lock, run the command, and release the lock.

open a file descriptor in bash and use flock to acquire and release the lock manually:

You can try to run these two snippets in parallel in different terminals and see that while one is sleeping while holding the lock, another is blocked in flock.

Provided by procmail package.

Runs the given command while holding a lock. Can use either flock() , lockf() , or fcntl() function, depending on what’s available on the system.

There are also two ways to inspect the currently acquired locks:

Provided by util-linux package.

Lists all the currently held file locks in the entire system. Allows to perform filtering by PID and to configure the output format.

A file in procfs virtual file system that shows current file locks of all types. The lslocks tools relies on this file.

Mandatory locking

Linux has limited support for mandatory file locking. See the “Mandatory locking” section in the fcntl(2) man page.

A mandatory lock is activated for a file when all of these conditions are met:

The partition was mounted with the mand option.
The set-group-ID bit is on and group-execute bit is off for the file.
A POSIX record lock is acquired.

Note that the set-group-ID bit has its regular meaning of elevating privileges when the group-execute bit is on and a special meaning of enabling mandatory locking when the group-execute bit is off.

When a mandatory lock is activated, it affects regular system calls on the file:

When an exclusive or shared lock is acquired, all system calls that modify the file (e.g. open() and truncate() ) are blocked until the lock is released.

When an exclusive lock is acquired, all system calls that read from the file (e.g. read() ) are blocked until the lock is released.

However, the documentation mentions that current implementation is not reliable, in particular:

races are possible when locks are acquired concurrently with read() or write()
races are possible when using mmap()

Since mandatory locks are not allowed for directories and are ignored by unlink() and rename() calls, you can’t prevent file deletion or renaming using these locks.

Example usage

Below you can find a usage example of mandatory locking.

Mount the partition and create a file with the mandatory locking enabled:

Acquire a lock in the first terminal:

Try to read the file in the second terminal:

Источник

Linux Kernel 5.0 — пишем Simple Block Device под blk-mq

Good News, Everyone!

Linux kernel 5.0 уже здесь и появляется в экспериментальных дистрибутивах, таких как Arch, openSUSE Tumbleweed, Fedora.

А если посмотреть на RC дистрибутивов Ubuntu Disko Dingo и Red Hat 8, то станет понятно: скоро kernel 5.0 с десктопов фанатов перекачует и на серьёзные сервера.
Кто-то скажет — ну и что. Очередной релиз, ничего особенного. Вот и сам Linus Torvalds сказал:

I’d like to point out (yet again) that we don’t do feature-based releases, and that “5.0” doesn’t mean anything more than that the 4.x numbers started getting big enough that I ran out of fingers and toes.

(Еще раз повторюсь — наши релизы не привязываются к каким-то определенным фичам, так что номер новой версии 5.0 означает только то, что для нумерования версий 4.х у меня уже не хватает пальцев на руках и ногах)

Однако модуль для floppy дисков (кто не знает — это такие диски размером c нагрудный карман рубашки, ёмкостью в 1,44 MB) — поправили…
И вот почему:

Всё дело в multi-queue block layer (blk-mq). Вводных статей про него в интернете предостаточно, так что давайте сразу к сути. Процесс перехода на blk-mq был начат давно и неспешно продвигался. Появился multi-queue scsi (параметр ядра scsi_mod.use_blk_mq), появились новые планировщики mq-deadline, bfq и прочее…

Кстати, а какой у вас?

Сокращалось число драйверов блочных устройств, которые работают по старинке. А в 5.0 убрали функцию blk_init_queue() за ненадобностью. И теперь старый славный код lwn.net/Articles/58720 от 2003 года уже не только не собирается, но и потерял актуальность. Более того, новые дистрибутивы, которые готовятся к выпуску в этом году, в дефолтной конфигурации используют multi-queue block layer. Например, на 18-том Manjaro, ядро хоть и версии 4.19, но blk-mq по дефолту.

Поэтому можно считать, что в ядре 5.0 переход на blk-mq завершился. А для меня это важное событие, которое потребует переписывания кода и дополнительного тестирования. Что само по себе обещает появление багов больших и маленьких, а также несколько упавших серверов (Надо, Федя, надо! (с)).

Кстати, если кто-то думает, что для rhel8 этот переломный момент не настал, так как ядро там «зафризили» версией 4.18, то вы ошибаетесь. В свеженьком RC на rhel8 новинки из 5.0 уже мигрировали, и функцию blk_init_queue() тоже выпилили (наверное, при перетаскивании очередного чекина с github.com/torvalds/linux в свои исходники).
Вообще, «freeze» версии ядра для дистрибьютеров Linux, таких как SUSE и Red Hat, давно стало маркетинговым понятием. Система сообщает, что версия, к примеру, 4.4, а по факту функционал из свеженькой 4.8 vanilla. При этом на официальном сайте красуется надпись вроде: «В новом дистрибутиве мы сохранили для вас стабильное 4.4 ядро».

Но мы отвлеклись…

Так вот. Нам нужен новый simple block device driver, чтобы было понятнее, как это работает.
Итак, исходник на github.com/CodeImp/sblkdev. Предлагаю обсуждать, делать pull request-ы, заводить issue — буду чинить. QA пока не проверял.

Далее в статье я попробую описать что зачем. Поэтому дальше много кода.
Сразу прошу прощения, что в полной степени не соблюдается Linux kernel coding style, и да — я не люблю goto.

Итак, начнём с точек входа.

Очевидно, при загрузке модуля запускается функция sblkdev_init(), при выгрузке sblkdev_exit().
Функция register_blkdev() регистрирует блочное устройство. Ему выделяется major номер. unregister_blkdev() — освобождает этот номер.

Ключевой структурой нашего модуля является sblkdev_device_t.

Она содержит всю необходимую модулю ядра информацию об устройстве, в частности: ёмкость блочного устройства, сами данные (это же simple), указатели на диск и очередь.

Вся инициализация блочного устройства выполняется в функции sblkdev_add_device().

Под структуру выделяем память, аллоцируем буфер для хранения данных. Тут ничего особенного.
Далее инициализируем очередь обработки запросов или одной функцией blk_mq_init_sq_queue(), или сразу двумя: blk_mq_alloc_tag_set() + blk_mq_init_queue().

Кстати, если заглянуть в исходники функции blk_mq_init_sq_queue(), то увидим, что это всего лишь обёртка над функциями blk_mq_alloc_tag_set() и blk_mq_init_queue(), которая появилась в ядре 4.20. Кроме того, она скрывет он нас многие параметры очереди, однако выглядит значительно проще. Вам выбирать, какой вариант лучше, но я предпочитаю более явный.

Ключевым в данном коде является глобальная переменная _mq_ops.

Именно здесь расположилась функция, которая обеспечивает обработку запросов, но подробнее о ней чуть позже. Главное, что точку входа в обработчик запросов мы обозначили.

Теперь, когда мы создали очередь — можно создавать экземпляр диска.

Здесь без особых изменений. Диск аллоцируется, задаются параметры, и диск добавляется в систему. Хочу пояснить насчет параметра disk->flags. Он позволяет указать системе, что диск removable, или, например, что он не содержит партиций и искать их там не надо.

Для управления диском есть структура _fops.

Точки входа _open и _release нам для simple block device модуля пока не сильно интересны. Кроме атомарного инкремента и декремента счётчика, там ничего нет. compat_ioctl я тоже оставил без реализации, так как вариант систем с 64-х битным ядром и 32-х битным user-space окружением мне не кажется перспективным.

А вот _ioctl позволяет обработать системные запросы к данному диску. При появлении диска система пытается побольше узнать о нём. По своему разумению вы можете отвечать на некоторые запросы (к примеру, чтобы прикинуться новым CD), но общее правило таково: если вы не хотите отвечать на неинтересующие вас запросы, просто верните код ошибки -ENOTTY. Кстати, если нужно, то здесь можно добавить и свои обработчики запросов, касающиеся именно этого диска.

Итак, устройство мы добавили — нужно позаботиться об освобождении ресурсов. Здесь вам не ~~тут~~Rust.

В принципе, всё очевидно: удаляем объект диска из системы и освобождаем очередь, после чего освобождаем и свои буферы (области данных).

А теперь самое главное — обработка запросов в функции queue_rq().

Для начала рассмотрим параметры. Первый — struct blk_mq_hw_ctx *hctx — состояние аппаратной очереди. В нашем случае мы обходимся без аппаратной очереди, так что unused.

Второй параметр — const struct blk_mq_queue_data* bd — параметр с очень лаконичной структурой, которую я не побоюсь представить вашему вниманию целиком:

Получается, что по сути это всё тот-же request, пришедший к нам из врёмен, о которых уже не помнит летописец elixir.bootlin.com. Так что берём запрос и начинаем его обрабатывать, о чём уведомляем ядро вызовом blk_mq_start_request(). По завершению обработки запроса сообщим об этом ядру вызовом функции blk_mq_end_request().

Тут маленькое замечание: функция blk_mq_end_request() — это, по сути, обёртка над вызовами blk_update_request() + __blk_mq_end_request(). При использовании функции blk_mq_end_request() нельзя задать, сколько конкретно байт было действительно обработано. Считает, что обработано всё.

У альтернативного варианта есть другая особенность: функция blk_update_request экспортируется только для GPL-only модулей. То есть, если вы захотите создать проприетарный модуль ядра (да избавит вас PM от этого тернистого пути), вы не сможете использовать blk_update_request(). Так что здесь выбор за вами.

Непосредственно саму перекладку байтиков из запроса в буфер и обратно я вынес в функцию do_simple_request().

Тут ничего нового: rq_for_each_segment перебирает все bio, а в них все bio_vec структуры, позволяя нам добраться до страниц с данными запроса.

Как впечатления? Кажется, всё просто? Обработка запроса вообще представляет из себя просто копирование данных между страницами запроса и внутренним буфером. Вполне достойно для simple block device driver, да?

Но есть проблема: Это не для реального использования!

Суть проблемы в том, что функция обработки запроса queue_rq() вызывается в цикле, обрабатывающем запросы из списка. Уж не знаю, какая именно блокировка для этого списка там используется, Spin или RCU (врать не хочу — кто знает, поправьте меня), но при попытке воспользоваться, к примеру, mutex-ом в функции обработки запроса отладочное ядро ругается и предупреждает: дремать тут нельзя. То есть пользоваться обычными средствами синхронизации или виртуальной памятью (virtually contiguous memory) — той, что аллоцируется с помощью vmalloc и может выпасть в swap со всем вытекающими — нельзя, так как процесс не может перейти в состояние ожидания.

Поэтому либо только Spin или RCU блокировки и буфер в виде массива страниц, или списка, или дерева, как это реализовано в ..\linux\drivers\block\brd.c, либо отложенная обработка в другом потоке, как это реализовано в ..\linux\drivers\block\loop.c.

Я думаю, не надо описывать, как собрать модуль, как его загрузить в систему и как выгрузить. На этом фронте без новинок, и на том спасибо 🙂 Так что если кто-то хочет опробовать, уверен разберётся. Только не делайте это сразу на любимом ноутбуке! Поднимите виртуалочку или хотя бы сделайте бэкап на шару.

Кстати, Veeam Backup for Linux 3.0.1.1046 уже доступен. Только не пытайтесь запускать VAL 3.0.1.1046 на ядре 5.0 или старше. veeamsnap не соберётся. А некоторые multi-queue новшества ещё пока находятся на этапе тестирования.

Источник