Linux cluster file system

Содержание

Выбор распределенной файловой системы для Linux. Пару слов о Ceph и остальных
File systems
Contents
Types of file systems
Journaling
FUSE-based file systems
Stackable file systems
Read-only file systems
Clustered file systems
Shared-disk file system
Identify existing file systems
Create a file system
Mount a file system
List mounted file systems

Выбор распределенной файловой системы для Linux. Пару слов о Ceph и остальных

Существует несколько десятков файловых систем, все из них предоставляют пользовательские интерфейсы для хранения данных. Каждая из систем хороша по-своему. Однако, в наш век высоких нагрузок и петабайтов данных для обработки, оказалось довольно непросто подыскать то, что нужно, стоит лишь задуматься о распределенных данных, распределенных нагрузках, множественном монтировании rw и о прочих кластерных прелестях.

Задача: организовать распределенное файловое хранилище
— без самосборных ядер, модулей, патчей,
— с возможностью множественного монтирования в режиме rw,
— POSIX совместимость,
— отказоустойчивость,
— совместимость с уже использующимися технологиями,
— разумный overhead по I/O операциям по сравнению с локальными файловыми системами,
— простота конфигурации, обслуживания и администрирования.

В работе мы используем Proxmox и контейнерную виртуализацию OpenVZ. Это удобно, это летает, у этого решения больше плюсов, чем у аналогичных продуктов. По крайней мере для наших проектов и в наших реалиях.
Сам storage везде монтируется по FC.

OCFS2

У нас был успешный опыт использования данной файловой системы, решили сначала попробовать ее. Proxmox с недавнего времени перешел на редхатовское ядро, в нем поддержка ocfs2 выключена. Модуль в ядре есть, но на форумах openvz и proxmox не рекомендуют его задействовать. Мы попробовали и пересобрали ядро. Модуль версии 1.5.0, кластер из 4 железных машин на базе debian squeeze, proxmox 2.0beta3, ядро 2.6.32-6-pve. Для тестов использовался stress. Проблемы за несколько лет остались те же самые. Все завелось, настройка данной связки занимает полчаса от силы. Однако, под нагрузкой кластер может самопроизвольно развалиться, что ведет к тотальному kernel panic на всех серверах сразу. За сутки тестов машины перезагружались в общей сложности пять раз. Это лечится, но доводить такую систему до работоспособного состояния довольно тяжело. Пришлось также пересобирать ядро и включать ocfs2. Минус.

Хоть ядро и редхатовское, модуль по умолчанию включен, завестись мы и здесь так и не смогли. Все дело в proxmox, которые со второй версии придумали свой кластер с шахматами и поэтессами для хранения своих конфигов. Там cman, corosync и прочие пакеты из gfs2-tools, только все пересобранные специально для pve. Оснастка для gfs2, таким образом, из пакетов просто так не ставится, так как предлагает сначала снести весь proxmox, что мы сделать не могли. За три часа зависимости удалось победить, но все опять закончилось kernel panic. Попытка приспособить пакеты для proxmox для решения наших проблем успехом не увенчалась, после двух часов было принято решение отказаться от этой идеи.

Остановились пока на ней.

POSIX совместимая, высокая скорость работы, отличная масштабируемость, несколько смелых и интересных подходов в реализации.

Файловая система состоит из следующих компонентов:

1. Клиенты. Пользователи данных.
2. Сервера метаданных. Кэшируют и синхронизируют распределенные метаданные. С помощью метаданных клиент в любой промежуток времени знает, где находятся нужные ему данные. Также сервера метаданных выполняют распределение новых данных.
3. Кластер хранения объектов. Здесь в виде объектов хранятся как данные, так и метаданные.
4. Кластерные мониторы. Осуществляют мониторинг здоровья всей системы в целом.

Фактический файловый ввод/вывод происходит между клиентом и кластером хранения объектов. Таким образом, управление высокоуровневыми функциями POSIX (открытие, закрытие и переименование) осуществляется с помощью серверов метаданных, а управление обычными функциями POSIX (чтение и запись) осуществляется непосредственно через кластер хранения объектов.

Любых компонентов может быть несколько, в зависимости от стоящих перед администратором задач.

Файловая система может быть подключена как напрямую, с помощью модуля ядра, так через FUSE. С точки зрения пользователя, файловая система Ceph является прозрачной. Они просто имеют доступ к огромной системе хранения данных и не осведомлены об используемых для этого серверах метаданных, мониторах и отдельных устройствах, составляющих массивный пул системы хранения данных. Пользователи просто видят точку монтирования, в которой могут быть выполнены стандартные операции файлового ввода / вывода. С точки зрения администратора имеется возможность прозрачно расширить кластер, добавив сколько угодно необходимых компонентов, мониторов, хранилищ, серверов метаданных.

Разработчики гордо называют Ceph экосистемой.

GPFS, Lustre и прочие файловые системы, а также надстройки, мы не рассматривали в этот раз, они либо очень сложны в настройке, либо не развиваются, либо не подходят по заданию.

Конфигурация и тестирование

Конфигурация стандартная, все взято из Ceph wiki. В целом файловая система оставила приятные впечатления. Собран массив 2Тб, пополам из SAS и SATA дисков (экспорт блочных устройств по FC), партиции в ext3.
Ceph storage примонтирован внутрь 12-и виртуальных машин на 4 hardware nodes, осуществляется чтение-запись со всех точек монтирования. Четвертые сутки стресс-тестов проходят нормально, I/O выдается в среднем 75 мб/с. на запись по пику.

Мы пока не рассматривали остальные функции Ceph (а их осталось еще довольно много), также есть проблемы с FUSE. Но хотя разработчики предупреждают, что система экспериментальная, что ее не стоит использовать в production, мы считаем, что если очень хочется, то можно -_-

Прошу всех заинтересованных, а также всех сочувствующих, в личку. Тема очень интересная, ищем единомышленников, чтобы обсудить возникшие проблемы и найти способы их решения.

Источник

File systems

In computing, a file system or filesystem controls how data is stored and retrieved. Without a file system, information placed in a storage medium would be one large body of data with no way to tell where one piece of information stops and the next begins. By separating the data into pieces and giving each piece a name, the information is easily isolated and identified. Taking its name from the way paper-based information systems are named, each group of data is called a «file». The structure and logic rules used to manage the groups of information and their names is called a «file system».

Individual drive partitions can be setup using one of the many different available filesystems. Each has its own advantages, disadvantages, and unique idiosyncrasies. A brief overview of supported filesystems follows; the links are to Wikipedia pages that provide much more information.

Types of file systems

The factual accuracy of this article or section is disputed.

See filesystems(5) for a general overview and Wikipedia:Comparison of file systems for a detailed feature comparison. File systems supported by the kernel are listed in /proc/filesystems .

In-tree and FUSE file systems

File system	Creation command	Userspace utilities	Archiso [1]	Kernel documentation [2]	Notes
Btrfs	mkfs.btrfs(8)	btrfs-progs	Yes	btrfs.html	Stability status
VFAT	mkfs.fat(8)	dosfstools	Yes	vfat.html	Windows 9x file system
exFAT	mkfs.exfat(8)	exfatprogs	Yes	Native file system in Linux 5.4. [3]
exFAT	mkexfatfs(8)	exfat-utils	No	N/A (FUSE-based)
F2FS	mkfs.f2fs(8)	f2fs-tools	Yes	f2fs.html	Flash-based devices
ext3	mkfs.ext3(8)	e2fsprogs	Yes	ext3.html
ext4	mkfs.ext4(8)	e2fsprogs	Yes	ext4.html
HFS	mkfs.hfsplus(8)	hfsprogs AUR	No	hfs.html	Classic Mac OS file system
HFS+	mkfs.hfsplus(8)	hfsprogs AUR	No	hfsplus.html	macOS (8–10.12) file system
JFS	mkfs.jfs(8)	jfsutils	Yes	jfs.html
NILFS2	mkfs.nilfs2(8)	nilfs-utils	Yes	nilfs2.html	Raw flash devices, e.g. SD card
NTFS	No	ntfs.html	Windows NT file system. Kernel’s in-built driver has very limited write support. officially supported kernels are built without CONFIG_NTFS_FS so this driver is not available.
NTFS	mkfs.ntfs(8)	ntfs-3g	Yes	N/A (FUSE-based)	FUSE driver with extended capabilities.
ReiserFS	mkfs.reiserfs(8)	reiserfsprogs	Yes
UDF	mkfs.udf(8)	udftools	Yes	udf.html
XFS	mkfs.xfs(8)	xfsprogs	Yes

Out-of-tree file systems

File system	Creation command	Kernel patchset	Userspace utilities	Notes
APFS	mkapfs(8)	linux-apfs-rw-dkms-git AUR	apfsprogs-git AUR	macOS (10.13 and newer) file system. Read only, experimental.
Bcachefs	bcachefs(8)	linux-bcachefs-git AUR	bcachefs-tools-git AUR
NTFS3	ntfs3-dkms AUR	Paragon NTFS3 driver FAQ
Reiser4	mkfs.reiser4(8)	reiser4progs AUR
ZFS	zfs-linux AUR , zfs-dkms AUR	zfs-utils AUR	OpenZFS port

Journaling

All the above filesystems with the exception of exFAT, ext2, FAT16/32, Reiser4 (optional), Btrfs and ZFS, use journaling. Journaling provides fault-resilience by logging changes before they are committed to the filesystem. In the event of a system crash or power failure, such file systems are faster to bring back online and less likely to become corrupted. The logging takes place in a dedicated area of the filesystem.

Not all journaling techniques are the same. Ext3 and ext4 offer data-mode journaling, which logs both data and meta-data, as well as possibility to journal only meta-data changes. Data-mode journaling comes with a speed penalty and is not enabled by default. In the same vein, Reiser4 offers so-called «transaction models» which not only change the features it provides, but in its journaling mode. It uses a different journaling techniques: a special model called wandering logs which eliminates the need to write to the disk twice, write-anywhere—a pure copy-on-write approach (mostly equivalent to btrfs’ default but with a fundamentally different «tree» design) and a combined approach called hybrid which heuristically alternates between the two former.

The other filesystems provide ordered-mode journaling, which only logs meta-data. While all journaling will return a filesystem to a valid state after a crash, data-mode journaling offers the greatest protection against corruption and data loss. There is a compromise in system performance, however, because data-mode journaling does two write operations: first to the journal and then to the disk (which Reiser4 avoids with its «wandering logs» feature). The trade-off between system speed and data safety should be considered when choosing the filesystem type. Reiser4 is the only filesystem that by design operates on full atomicity and also provides checksums for both meta-data and inline data (operations entirely occur, or they entirely do not and does not corrupt or destroy data due to operations half-occurring) and by design is therefore much less prone to data loss than other file systems like Btrfs.

Filesystems based on copy-on-write (also known as write-anywhere), such as Reiser4, Btrfs and ZFS, have no need to use traditional journal to protect metadata, because they are never updated in-place. Although Btrfs still has a journal-like log tree, it is only used to speed-up fdatasync/fsync.

FUSE-based file systems

Stackable file systems

aufs — Advanced Multi-layered Unification Filesystem, a FUSE based union filesystem, a complete rewrite of Unionfs, was rejected from Linux mainline and instead OverlayFS was merged into the Linux Kernel.

http://aufs.sourceforge.net || linux-aufsAUR

eCryptfs — The Enterprise Cryptographic Filesystem is a package of disk encryption software for Linux. It is implemented as a POSIX-compliant filesystem-level encryption layer, aiming to offer functionality similar to that of GnuPG at the operating system level.

https://ecryptfs.org || ecryptfs-utils

mergerfs — a FUSE based union filesystem.

https://github.com/trapexit/mergerfs || mergerfsAUR

mhddfs — Multi-HDD FUSE filesystem, a FUSE based union filesystem.

http://mhddfs.uvw.ru || mhddfsAUR

overlayfs — OverlayFS is a filesystem service for Linux which implements a union mount for other file systems.

https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html || linux

Unionfs — Unionfs is a filesystem service for Linux, FreeBSD and NetBSD which implements a union mount for other file systems.

https://unionfs.filesystems.org/ || not packaged? search in AUR

unionfs-fuse — A user space Unionfs implementation.

https://github.com/rpodgorny/unionfs-fuse || unionfs-fuse

Read-only file systems

EROFS — Enhanced Read-Only File System is a lightweight read-only file system, it aims to improve performance and compress storage capacity.

https://www.kernel.org/doc/html/latest/filesystems/erofs.html || erofs-utils

SquashFS — SquashFS is a compressed read only filesystem. SquashFS compresses files, inodes and directories, and supports block sizes up to 1 MB for greater compression.

https://github.com/plougher/squashfs-tools || squashfs-tools

Clustered file systems

Ceph — Unified, distributed storage system designed for excellent performance, reliability and scalability.

https://ceph.com/ || ceph

Glusterfs — Cluster file system capable of scaling to several peta-bytes.

https://www.gluster.org/ || glusterfs

IPFS — A peer-to-peer hypermedia protocol to make the web faster, safer, and more open. IPFS aims replace HTTP and build a better web for all of us. Uses blocks to store parts of a file, each network node stores only content it is interested, provides deduplication, distribution, scalable system limited only by users. (currently in alpha)

https://ipfs.io/ || go-ipfs

MooseFS — MooseFS is a fault tolerant, highly available and high performance scale-out network distributed file system.

https://moosefs.com || moosefs

OpenAFS — Open source implementation of the AFS distributed file system

https://www.openafs.org || openafsAUR

OrangeFS — OrangeFS is a scale-out network file system designed for transparently accessing multi-server-based disk storage, in parallel. Has optimized MPI-IO support for parallel and distributed applications. Simplifies the use of parallel storage not only for Linux clients, but also for Windows, Hadoop, and WebDAV. POSIX-compatible. Part of Linux kernel since version 4.6.

https://www.orangefs.org/ || not packaged? search in AUR

Sheepdog — Distributed object storage system for volume and container services and manages the disks and nodes intelligently.

https://sheepdog.github.io/sheepdog/ || sheepdogAUR

Tahoe-LAFS — Tahoe Least-Authority Filesystem is a free and open, secure, decentralized, fault-tolerant, peer-to-peer distributed data store and distributed file system.

https://tahoe-lafs.org/ || tahoe-lafsAUR

Shared-disk file system

GFS2 — GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage

https://pagure.io/gfs2-utils || gfs2-utilsAUR

OCFS2 — The Oracle Cluster File System (version 2) is a shared disk file system developed by Oracle Corporation and released under the GNU General Public License

https://oss.oracle.com/projects/ocfs2/ || ocfs2-toolsAUR

VMware VMFS — VMware’s VMFS (Virtual Machine File System) is used by the company’s flagship server virtualization suite, vSphere.

https://www.vmware.com/products/vi/esx/vmfs.html || vmfs-toolsAUR

Identify existing file systems

To identify existing file systems, you can use lsblk:

An existing file system, if present, will be shown in the FSTYPE column. If mounted, it will appear in the MOUNTPOINT column.

Create a file system

File systems are usually created on a partition, inside logical containers such as LVM, RAID and dm-crypt, or on a regular file (see Wikipedia:Loop device). This section describes the partition case.

Before continuing, identify the device where the file system will be created and whether or not it is mounted. For example:

Mounted file systems must be unmounted before proceeding. In the above example an existing filesystem is on /dev/sda2 and is mounted at /mnt . It would be unmounted with:

To find just mounted file systems, see #List mounted file systems.

To create a new file system, use mkfs(8) . See #Types of file systems for the exact type, as well as userspace utilities you may wish to install for a particular file system.

For example, to create a new file system of type ext4 (common for Linux data partitions) on /dev/sda1 , run:

The new file system can now be mounted to a directory of choice.

Mount a file system

To manually mount filesystem located on a device (e.g., a partition) to a directory, use mount(8) . This example mounts /dev/sda1 to /mnt .

This attaches the filesystem on /dev/sda1 at the directory /mnt , making the contents of the filesystem visible. Any data that existed at /mnt before this action is made invisible until the device is unmounted.

fstab contains information on how devices should be automatically mounted if present. See the fstab article for more information on how to modify this behavior.

If a device is specified in /etc/fstab and only the device or mount point is given on the command line, that information will be used in mounting. For example, if /etc/fstab contains a line indicating that /dev/sda1 should be mounted to /mnt , then the following will automatically mount the device to that location:

mount contains several options, many of which depend on the file system specified. The options can be changed, either by:

using flags on the command line with mount
editing fstab
creating udev rules
compiling the kernel yourself
or using filesystem-specific mount scripts (located at /usr/bin/mount.* ).

See these related articles and the article of the filesystem of interest for more information.

List mounted file systems

To list all mounted file systems, use findmnt(8) :

findmnt takes a variety of arguments which can filter the output and show additional information. For example, it can take a device or mount point as an argument to show only information on what is specified:

findmnt gathers information from /etc/fstab , /etc/mtab , and /proc/self/mounts .